Commit cf4614e6 authored by Motazedi's avatar Motazedi
Browse files

Haplotype estimation algorithm for trios using NGS reads

parents
TriPoly, version 1.5, Ehsan Motazedi (ehsan.motazedi@gmail.com), last modified: March 28 2018
Introduction
============
**TriPoly** \(executable *main.py*\) is an efficient tool for haplotype estimation in polyploid F1 populations with two parents and one or several offspring, using next generation sequence reads. The method considers each offspring with the parents as a trio and tries to simultaneously maximize the likelihood of all of the trio haplotypes in the population, using a Bayesian framework that incorporates sequence data and recombination probability in the likelihood function. It can handle multi-allelic markers and discontinuous haplotype blocks.
The main parameters of the algorithm: variant calling error, recombination probability between adjacent markers, τ, branching threshold, ρ, and pruning rate, κ, \(*Motazedi et al.* 2017\) can be passed as optional arguments to TriPoly. TriPoly accepts a multi-sample BAM file (assuming the first two sample names are that of the parents and the rest of the progeny), from which the so-called SNP-fragment matrix is constructed for haplotype estimation. To build this matrix from the sequence reads, the minimum acceptable mapping quality and base quality scores, as well as the maixmum acceptable insert-size to consider the reads as paired, could be passed as parameters.
It is important that the parent samples appear before the offspring in the BAM header, as the first two detected sample names in the read group lines of the BAM header are considered parents and the rest are considered as offspring by TriPoly. Also, the VCF file must contain only SNPs, that is to say no complex variants and indels. This can be achieved by running the *break_vcf.sh* script on the original VCF file.
An extra option: -*a*, --*aggressive* has also been provided to specify an aggressive pruning rate, α, so that at each step, (100×α)%; of the candidate extensions are pruned if the original pruning scheme is unable to rule-out at least (100×α)% of these extensions. Setting a high α can greatly increase the speed of the algorithm. Finally, a warm-up length in base-pairs could be set along which, from the start of a haplotype block, no branching or pruning is performed. With --nogen option, it is also possible to estimate the haplotypes from the reads without restricting the dosages to those detected in the VCF file.
By default, TriPoly reports all of the haplotypes that have survived the successive pruning steps, together with their relative likelihoods. The command line option -t, --top changes this behavior to report only the most likely haplotype (or a random haplotype if several haplotypes have a likelihood equal to the maximum).
TriPoly is written in *Python* and is compatible with both Python 2 and Python 3.
Citation:
=====================
To cite TriPoly, please refer to *TriPoly: a haplotype estimation approach for polyploids using sequencing data of related individuals*, Ehsan Motazedi, Dick de Ridder, Richard Finkers and Chris Maliepaard, 2017, bioRxiv 163162; *doi: https://doi.org/10.1101/163162*
Input parameters:
=====================
For the input parameters, see the help of the software:
./main.py --help, -h
***Example:***
./main.py bamfile vcffile output_dirname --kappa 0.85 -a 0.8 --rho 0.2 --top
Download:
=====================
The software is available at gitlab, Wageningen UR:
git clone git@git.wageningenur.nl:motaz001/TriPoly.git —recursive
Copyright and License:
=====================
Copyright (c) 2017, Ehsan Motazedi, Wageningen UR.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files \(the "Software"\), to deal in the Software without restriction, including without limitation the rights to use, copy, share and modify the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. This permission is only valid for academic and personal use, i.e. not for commercial purposes.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
# Make the SNP-fragment list from a multi-sample bam file and its corresponding VCF file with input options: mmq, mbq, maxIS and qoffset.
# Written by Ehsan Motazedi, Wageningen UR, 04-11-2016.
# Last mofified: 22-02-2018.
import copy
import pysam
import re
import subprocess
from genotypes import getAllelesPop, getGenotypesPop
from math import log
from reads import Read
def adjust_seq(read):
""" Adjust the read sequence according to the mapped positions, i.e. get read of the insertions and clipped bases."""
cig = list(_x for _x in re.split('[0-9]{1,}', read.cigarstring) if _x)
cign = list(_x for _x in re.split('[^0-9]', read.cigarstring) if _x)
cig = list(cig[_n]*int(cign[_n]) for _n in range(0, len(cig)))
cig = ''.join(_x for _x in cig if 'D' not in _x) # deleted nucleotides from the reference are not present in the read sequence
adj_seq = []
adj_qual = []
for _n, _x in enumerate(read.seq):
if cig[_n]=='M': # Allow only match/mismatch of the read nucleotides, i.e. no clipping, no insertion.
adj_qual.append(read.qual[_n])
adj_seq.append(_x)
return ''.join(adj_seq), adj_qual
def frag_gen(varpos, allelelst, genolst, coordinates, nucleotides, qscores):
""" Generate SNP-fragments from (paired-end) reads."""
if (not coordinates) or (not nucleotides):
return Read(), {}
var_codes = []
var_num = []
var_q = []
for _n, _x in enumerate(varpos):
if _x < coordinates[0]:
continue
if _x > coordinates[-1]:
break
if _x in coordinates:
if set(genolst[_n].GetGenes()).intersection(set(['.','-'])): # Throw away missing genotypes or genotypes with one or more missing allele(s)
continue
if len(set(genolst[_n].GetGenes()))<2: # Do not include in the SNP-fragments belonging to a population member its homozygous alleles
continue
try:
var_codes.append(allelelst[_n][2][nucleotides[coordinates.index(_x)]])
var_num.append(allelelst[_n][0])
var_q.append(qscores[coordinates.index(_x)])
except KeyError: # if the called nucleotide is wrong, i.e. does not exist in VCF alleles
pass
try: # return the reads {SNP number: allele} and the associated quality scores {SNP number: Qscore}
return Read({_x:str(_y) for _x, _y in zip([var_num[0], var_num[1]]+var_num[2:], var_codes)}), {_x:str(_y) for _x, _y in zip(var_num, var_q)}
except IndexError: # throw away reads with less than 2 SNPs
return Read(), {}
class InputError(Exception):
""" Handle invalid input specifications."""
def __init__(self, msg, *args):
super(InputError, self).__init__(args)
self.msg = msg
def __str__(self):
return "InputError: {}\n".format(self.msg)
def __repr__(self):
return (self.msg,)+self.args+('\n',)
def get_frags(bamfile, vcffile, maxIS=3000, mbq=13, mmq=20, qoffset=33):
"""
mmq : minimum read mapping quality to consider a read for phasing, default 20\n
qvoffset <33/64> : quality value offset, 33/64 depending on how quality values were encoded, default is 33\n
mbq : minimum base quality to consider a base for haplotype fragment, default 13\n
maxIS : maximum insert-size for a paired-end read to be considered as a single fragment for phasing, default 3000.\n
"""
try:
all_reads = pysam.Samfile(bamfile, 'rb')
except IOError:
raise InputError('The input BAM file was not found!')
ReadHeader = subprocess.Popen(["samtools","view","-H", bamfile], shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True)
header, err_header = ReadHeader.communicate()
if ReadHeader.returncode!=0:
raise InputError('Failed to read the header from the bam file! Original error message:\n'+err_header)
if isinstance(header, bytes):
header=bytes.decode(header)
else:
pass
RGIDs, SMnames = [], []
for _headerline in header.splitlines(): # pasre the header of the bam file to extract the ID and SM fields of each Read Group
if _headerline[0:3]=='@RG':
RGID_added, SM_added = False, False
for _n, _field in enumerate(_headerline.split()):
if 'ID' in _field:
if not RGID_added:
RGIDs.append(''.join(_headerline.split()[_n].split(':')[1:])) # add read group ID
RGID_added = True
else:
raise InputError('Double ID fields detected in @RG header line!')
elif 'SM' in _field:
if not SM_added:
SMnames.append(''.join(_headerline.split()[_n].split(':')[1:])) # add the sample name associated with the read group ID
SM_added = True
else:
raise InputError('Double SM fields detected in @RG header line!')
if SM_added and RGID_added:
pass
elif SM_added:
raise InputError('ID field missing in @RG header line!')
elif RGID_added:
raise InputError('SM field missing in @RG header line!')
else:
raise InputError('ID and SM fields missing in @RG header line!')
if len(RGIDs)!=len(set(RGIDs)):
raise InputError('Duplicate read group IDs detected in the bam header!')
GroupedReadsWithID = [[] for _id in RGIDs] # separate reads belonging to each Read Group
for _read in all_reads:
GroupedReadsWithID[RGIDs.index(dict(_read.get_tags())['RG'])].append(_read)
GroupedReads, GroupedSM = [], [] # combine the reads with different RGID but the same SM as they are assumed to belong to the same sample
for _SMname, _ReadGroup in zip(SMnames, GroupedReadsWithID):
if _SMname not in GroupedSM:
GroupedReads.append(_ReadGroup)
GroupedSM.append(_SMname)
else:
GroupedReads[GroupedSM.index(_SMname)]+=_ReadGroup
del GroupedReadsWithID
try:
genolst = getGenotypesPop(vcffile, GroupedSM)
allelelst = getAllelesPop(vcffile, GroupedSM)
except IOError:
raise InputError('The VCF file was not found!')
except:
raise
Frag_lsts, Q_lsts = [], []
varpos = list(_x[1] for _x in allelelst)
for _group, reads in enumerate(GroupedReads):
frag_lst, q_lst = [], []
reads = sorted(reads, key= lambda _x: (_x.qname, _x.flag & 0x900)) # sort the alignments using their names, with the primary alignments being placed first.
_rnum = 0
rNUM = len(reads)
while _rnum < rNUM: # scan through the alignments to find the pairs/singles
break_mate = False
is_proper_pair = False
read = copy.deepcopy(reads[_rnum])
if read.is_unmapped or read.is_duplicate or (read.flag & 0x900): # throw away unmapped reads, duplicates and secondary/supplementary alignments
_rnum+=1
continue
try:
if read.qname == reads[_rnum+1].qname: # means the read is paired to a mate or has multiple/supplemenatry alignments
if reads[_rnum+1].is_unmapped or reads[_rnum+1].is_duplicate or (reads[_rnum+1].flag & 0x900): # if the next read is unmapped, a duplicate or not primarym: skip it
pass
else:
is_proper_pair = True # means the read is paired to a proper mate
mate = copy.deepcopy(reads[_rnum+1])
_rnum+=2
else: # means the read is single
_rnum+=1
except IndexError: # could occur for the last read in the alignments' list
_rnum+=1
if is_proper_pair:
if (max(mate.positions+read.positions)-min(mate.positions+read.positions)+1)>maxIS: # Check the maximum insert-size to consider the mates as a single fragment
break_mate = True
if read.mapping_quality >= mmq:
try:
adj_seq, adj_qual = adjust_seq(read)
coordinates, nucleotides, quals = list(zip(*[(int(_x), _y, _z) for _x, _y, _z in zip(read.positions, adj_seq.upper(), list(ord(_x)-qoffset for _x in adj_qual)) if _z>=mbq]))
except ValueError as e:
if e.args[0][0:len("need more than 0 values to unpack")]=="need more than 0 values to unpack" or e.args[0][0:len("not enough values to unpack")]=="not enough values to unpack":
coordinates, nucleotides, quals= [(), (), ()]
else:
raise
else:
coordinates, nucleotides, quals = [(), (), ()]
if mate.mapping_quality >= mmq:
try:
adj_seq, adj_qual = adjust_seq(mate)
coordinates_mate, nucleotides_mate, quals_mate = list(zip(*[(int(_x), _y, _z) for _x, _y, _z in zip(mate.positions, adj_seq.upper(), list(ord(_x)-qoffset for _x in adj_qual)) if _z>=mbq]))
except ValueError as e:
if e.args[0][0:len("need more than 0 values to unpack")]=="need more than 0 values to unpack" or e.args[0][0:len("not enough values to unpack")]=="not enough values to unpack":
coordinates_mate, nucleotides_mate, quals_mate = [(), (), ()]
else:
raise
else:
coordinates_mate, nucleotides_mate, quals_mate = [(), (), ()]
if break_mate:
pass
else: # merge the sub-reads if the insert-size is less than maxIS
try:
coordinates, nucleotides, quals = list(zip(*sorted(zip(coordinates+coordinates_mate, nucleotides + nucleotides_mate, quals+quals_mate), key = lambda x: x[0])))
except ValueError as e:
if e.args[0][0:len("need more than 0 values to unpack")]=="need more than 0 values to unpack" or e.args[0][0:len("not enough values to unpack")]=="not enough values to unpack":
coordinates, nucleotides, quals = [(), (), ()]
else:
raise
else:
break_mate = True
if read.mapping_quality >= mmq:
try:
adj_seq, adj_qual = adjust_seq(read)
coordinates, nucleotides, quals = list(zip(*[(int(_x), _y, _z) for _x, _y, _z in zip(read.positions, adj_seq.upper(), list(ord(_x)-qoffset for _x in adj_qual)) if _z>=mbq]))
except ValueError as e:
if e.args[0][0:len("need more than 0 values to unpack")]=="need more than 0 values to unpack" or e.args[0][0:len("not enough values to unpack")]=="not enough values to unpack":
coordinates, nucleotides, quals = [(), (), ()]
else:
raise
else:
coordinates, nucleotides, quals = [(), (), ()]
coordinates_mate, nucleotides_mate, quals_mate = [(), (), ()]
if break_mate:
pass
else:
unique_q = []
unique_c = []
unique_n = []
for _n, _c in enumerate(coordinates): # remove the duplicates from overlapping positions
try:
if unique_c[-1]!=_c:
#print("\nN:\n",nucleotides, "\nQ:\n", quals)
unique_c.append(_c)
unique_n.append(nucleotides[_n])
unique_q.append(quals[_n])
elif unique_n[-1]==nucleotides[_n]:
unique_q[-1] = min(126-qoffset, unique_q[-1]+quals[_n])
else: # if the called nucleotides differ at overlapping sites, use the one with the highest Phred score and adjust the Phred score.
if quals[_n]>unique_q[-1]:
_new_q_score = round(-10*log(1-10**(-unique_q[-1]/10)*(1-10**(-quals[_n]/10)), 10), 5) # Q=-10log(p,10)
if _new_q_score >= mbq:
unique_n[-1] = nucleotides[_n]
unique_q[-1] = _new_q_score
else:
del(unique_c[-1], unique_n[-1], unique_q[-1])
else:
_new_q_score = round(-10*log(1-(1-10**(-unique_q[-1]/10))*10**(-quals[_n]/10), 10), 5)
if _new_q_score >= mbq:
unique_q[-1] = _new_q_score
else:
del(unique_c[-1], unique_n[-1], unique_q[-1])
except IndexError:
unique_c.append(_c)
unique_n.append(nucleotides[_n])
unique_q.append(quals[_n])
coordinates, nucleotides, quals = [unique_c, unique_n, unique_q]
coordinates = list(_x+1 for _x in coordinates) # Convert the zero-based BAM coordinates to 1-based, as the coordinates are 1-based in the VCF (like the SAM format).
new_frag, new_q = frag_gen(varpos, allelelst, genolst[_group], coordinates, nucleotides, quals)
frag_lst.append(new_frag)
q_lst.append(new_q)
if break_mate:
coordinates_mate = list(_x+1 for _x in coordinates_mate)
new_frag_mate, new_q_mate = frag_gen(varpos, allelelst, genolst[_group], coordinates_mate, nucleotides_mate, quals_mate)
frag_lst.append(new_frag_mate)
q_lst.append(new_q_mate)
try:
frag_lst, q_lst = [_lst for _lst in zip(*[(_x, _y) for _x, _y in zip(frag_lst, q_lst) if not _x.isNULL()])]
except ValueError as e:
if e.args[0][0:len("need more than 0 values to unpack")]=="need more than 0 values to unpack" or e.args[0][0:len("not enough values to unpack")]=="not enough values to unpack":
frag_lst, q_lst = [], []
Frag_lsts.append(frag_lst)
Q_lsts.append(q_lst)
return Frag_lsts, Q_lsts, GroupedSM
File added
This diff is collapsed.
File added
# Break complex variants and throw out indels from the input vcf file. i
# EX: ./break_vcf.sh my.vcf out.vcf
# Written by Ehsan Motazedi, Wageningen UR, 17-10-2017
# Last Modified: 27-07-2018
file=$1
outfile=$2
tmpfile=$(mktemp /tmp/dummy.XXXXXX)
exec 3> "$tmpfile" #open tmpfile for writing on file descpritor 3
python << EOS
import re
with open("$file", 'rU') as vcf, open("$tmpfile", "w") as recoded:
detected_SNPs = dict()
for line in vcf:
if line.lstrip()[0]=='#':
recoded.write(line)
else:
fields = line.split()
contig = fields[0]
if contig not in detected_SNPs.keys():
detected_SNPs[contig] = []
if '.' in fields[3] or '.' in fields[4]: #indels
pass
elif ',' in fields[4]: # multi-allelic variants
_temp_alts = fields[4].split(',')
if not all(len(_alt)==len(fields[3]) for _alt in _temp_alts): # do not parse complex variants that contain indels
pass
else: # break complex variants into bi-allelic SNPs if possible
difflocation = []
diff_ref_alts = []
diff_ref_alts_dic = []
for _base in range(0, len(fields[3])):
ref_alt = [fields[3][_base]] + [_x for _x in set(_alt[_base] for _alt in _temp_alts)-set([fields[3][_base]])] # check at each position within the complex variant if ref, alt corresponds to a SNP
if len(ref_alt)>1:
ref_alt_dic = {ref_alt[_n]:str(_n) for _n in range(0, len(ref_alt))}
ref_alt_dic['.']='.' # missing genotype added
difflocation.append(_base)
diff_ref_alts.append(ref_alt)
diff_ref_alts_dic.append(ref_alt_dic)
for _n in range(0, len(difflocation)):
_pos = str(int(fields[1])+difflocation[_n])
if (_pos not in detected_SNPs[contig]):
recoded.write(fields[0]+'\t'+_pos+'\t'+fields[2]+'\t'+diff_ref_alts[_n][0]+'\t'+','.join(diff_ref_alts[_n][1:])+'\t'+'\t'.join(fields[5:9]))
for _genotype in fields[9:]:
_dosage = re.split('/|\|', _genotype.split(':')[0])
_old_dic = {'0':diff_ref_alts[_n][0], '.':'.'} #'.' stands for missing genotypes
_old_dic.update({str(_m+1):_temp_alts[_m][difflocation[_n]] for _m in range(0, len(_temp_alts))})
_new_dosage = '/'.join(diff_ref_alts_dic[_n][_old_dic[_allele]] for _allele in _dosage)
recoded.write('\t'+_new_dosage+':'+':'.join(_genotype.split(':')[1:]))
recoded.write('\n')
detected_SNPs[contig].append(_pos)
else:
pass
elif len(fields[3])!=len(fields[4]): #indels
pass
elif len(fields[3])==1: #biallelic SNPs
if (fields[1] not in detected_SNPs[contig]):
recoded.write(line)
detected_SNPs[contig].append(fields[1])
else:
pass
else: # MNPs
difflocation = []
for _n in range(0, len(fields[3])):
if fields[3][_n]!=fields[4][_n]:
difflocation.append(_n)
for _n in range(0, len(difflocation)):
_pos = str(int(fields[1])+difflocation[_n])
if (_pos not in detected_SNPs[contig]):
recoded.write(fields[0]+'\t'+_pos+'\t'+fields[2]+'\t'+fields[3][difflocation[_n]]+'\t'+fields[4][difflocation[_n]]+'\t'+'\t'.join(fields[5:])+'\n')
detected_SNPs[contig].append(_pos)
else:
pass
EOS
cp $tmpfile $outfile
status=$?
if [ $status -eq 0 ]; then
rm $tmpfile
fi
# Written by Ehsan Motazedi, Wageningen UR, 04-11-2016.
# Last Updated: 30-08-2018
import copy
import re
import sys
from collections import Counter
from Bio import SeqIO
class Genotype(object):
def __init__(self, numS, pos, *genotype):
try:
self.s = int(numS)
self.pos = int(pos)
except (ValueError, TypeError) as e:
e.args+=("The SNP id and SNP position must be inetgers!",)
self.genotype = copy.deepcopy(genotype)
def __repr__(self):
return '['+','.join(str(_x) for _x in (self.s, self.pos))+','+','.join(str(_x) for _x in self.genotype)+']'
def __str__(self):
return "SNP number = {}, SNP position = {}, Genotype = {}".format(self.s, self.pos,
'/'.join(str(_x) for _x in self.genotype))
def __eq__(self, other):
return self.pos==other.pos and self.s==other.s and Counter(str(_x) for _x in self.genotype)==Counter(str(_x) for _x in other.genotype)
def isMISSING(self):
return set(self.genotype).issubset(set(['.','-','NA','na','Na']))
def GetGenes(self):
return self.genotype
def GetPos(self):
return self.pos
def GetS(self):
return self.s
def dropHomozygous(allgenolst):
"""Eliminate homozygous genotypes from a list of genotype objects."""
_s=-1 # genotypes will be numbered starting from 0 (_s+=1)
sorted_genolst = sorted(allgenolst, key = lambda x: x.GetPos()) # sort the input list upon the SNP position
het_genolist=[]
for _geno in sorted_genolst:
if len(set(_geno.GetGenes()))>=2:
_s+=1
het_genolist.append(Genotype(_s, _geno.GetPos(), *_geno.GetGenes()))
return het_genolist
def getGenotypes(vcffile, Pop=False, contig=False):
"""Extract the genotypes from a VCF file. Return the genotypes as a list of Genotypes objects."""
_s=-1 # genotypes will be numbered starting from 0 (_s+=1)
genolist=[]
if contig:
contig_names = [] # contig names in the vcf file, should be just one as otherwise haplotyping will be nonsense!
with open(vcffile, 'rU') as vfile:
for _geno in vfile:
if _geno[0]!='#':
_genolst=_geno.split()
if Pop or len(set(re.split('/|\|', _genolst[-1].split(':')[0])))>1: # If the vcf file belongs to a member of a population, keep all of the SNP, even the homozygous ones.\
_s+=1 # Throw away homozygous SNPs otherwise.
genolist.append(Genotype(_s, _genolst[1], *re.split('/|\|', _genolst[-1].split(':')[0]))) # Extract genotypes
if contig:
contig_names.append(_genolst[0])
if contig:
return genolist, contig_names
return genolist
def getGenotypesPop(vcffile, SampleNames, contig=False):
"""Extract the genotypes from a multiple-sample VCF file of a population with given sample names. Return the genotypes as lists of Genotype objects for each sample in the population."""
if not SampleNames:
raise ValueError('No sample names were given to getAllelesPop to extract their alleles!')
_s=-1 # genotypes will be numbered starting from 0 (_s+=1)
SampleNames = [_sample for _sample in set(SampleNames)]
_header = False
genolist, SMnames, map_dict = [[] for _sm in SampleNames], [], dict()
if contig:
contig_names = [] # contig names in the vcf file, should be just one as otherwise haplotyping will be nonsense!
with open(vcffile, 'rU') as vfile:
for _geno in vfile:
if _geno[0]!='#':
_genolst=_geno.split()
_s+=1
_allalleles_at_s = "" # Collect all of the alleles at s to throw away variants homozygous for all population members
sample_alleles=[[] for _sample in SMnames]
for _sample in SMnames:
sample_alleles[map_dict[_sample]] = re.split('/|\|', _genolst[9:][map_dict[_sample]].split(':')[0]) # Extract the alleles present for each member of the population
_allalleles_at_s+="".join(sample_alleles[map_dict[_sample]])
if len(set(_allalleles_at_s)-set(["."]))>1: # Throw away variants homozygous in the population
for _n, _sample in enumerate(SMnames):
genolist[map_dict[_sample]].append(Genotype(_s, _genolst[1], *sample_alleles[map_dict[_sample]])) # Extract the genotypes of each member in the population
if contig:
contig_names.append(_genolst[0])
elif _geno[1]!='#':
if set(['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT']).issubset(set(_geno.lstrip('#').split())):
_header = True
SMnames = _geno.split()[9:] # extract sample names in the VCF header
if len(set(SMnames))<len(SMnames):
raise ValueError('Duplicate sample names observed in the VCF header!')
if set(SMnames).symmetric_difference(set(SampleNames)):
raise ValueError('The given sample names do not match those in the VCF header!')
map_dict = dict(zip(SMnames, [SampleNames.index(_sample) for _sample in SMnames]))
else:
pass
if contig:
return genolist, contig_names
return genolist
def getAlleles(vcffile, Pop=False):
"""Extract the alleles and their numerical code at (heterozygous) sites from a VCF file. Return\n\
the number, position and a dictionary, for the alleles and their coding as a list of triple tuples. If\n\
the VCF file belongs to a member of a population, keep all of the SNPs. Otherwise, throw away the homozygous ones."""
_s=-1
alllist=[]
with open(vcffile, 'rU') as vfile:
for _geno in vfile:
if _geno[0]!='#':
_alllst=_geno.split()
if Pop or len(set(re.split('/|\|', _alllst[-1].split(':')[0])))>1:
_s+=1
alleles = [_alllst[3]]+re.split("\ *,\ *|,\ *|\ *,|,", _alllst[4])
codes = sorted(set(re.split('/|\|', _alllst[-1].split(':')[0])), key = lambda x: int(x))
if '0' not in codes: # the reference allele should have been coded as 0
codes = ['0'] + codes # the reference and alternative nucleotides should have been CONSISTENTLY coded in all population files to allow merging later.
alllist.append((_s, int(_alllst[1]), {_x.upper():int(_y) for _x, _y in zip(alleles, codes)}))
return alllist
def getAllelesPop(vcffile, SampleNames):
"""Extract the alleles and their numerical code from a multiple-sample VCF file of a population with given sample names. Return\n\
the SNP number, the SNP position and the dictionary that specifies the reference and the observed alternative alleles for the SNP,
as a list of tuples."""
_s=-1
alllist, SMnames = [], []
_header = False
map_dict = dict()
if not SampleNames:
raise ValueError('No sample names were given to getAllelesPop to extract their alleles!')
SampleNames = [_sample for _sample in set(SampleNames)] # get read of duplicate sample names
with open(vcffile, 'rU') as vfile:
for _geno in vfile:
if _geno[0]!='#':
if not _header:
raise ValueError('Unexpected header detected in the VCF file!')
_alllst = _geno.split()
_s+=1
_allalleles_at_s = "" # Collect all of the alleles at s to throw away variants homozygous for all population members
sample_alleles=[[] for _sample in SMnames]
for _sample in SMnames:
sample_alleles[map_dict[_sample]] = re.split('/|\|', _alllst[9:][map_dict[_sample]].split(':')[0]) # Collect the alleles for each member of the population
_allalleles_at_s+="".join(sample_alleles[map_dict[_sample]])
if len(set(_allalleles_at_s)-set(["."]))>1: # Throw away variants homozygous in the population
alleles = [_alllst[3]]+re.split("\ *,\ *|,\ *|\ *,|,", _alllst[4])
codes=[]
for _alleles in sample_alleles:
codes+=_alleles
codes = sorted(set(codes)-set(['.','-']), key = lambda x: int(x)) # '.', '-' corresponds to missing genotypes
if '0' not in codes:
codes = ['0'] + codes
alllist.append((_s, int(_alllst[1]), {_x.upper():int(_y) for _x, _y in zip(alleles, codes)}))
elif _geno[1]!='#':
if set(['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT']).issubset(set(_geno.lstrip('#').split())):
_header = True
SMnames = _geno.split()[9:] # extract sample names in the VCF header
if len(set(SMnames))<len(SMnames):
raise ValueError('Duplicate sample names observed in the VCF header!')
if set(SMnames).symmetric_difference(set(SampleNames)):
raise ValueError('The given sample names do not match those in the VCF header!')
map_dict = dict(zip(SMnames, [SampleNames.index(_sample) for _sample in SMnames]))
else:
pass
return alllist
def mergeVCFs(reference, output, *vcffiles):
"""Merge several vcf files in a single multi sample one, so that all of the lists have genotypes for the same positions\n\
and with the same numbering, starting with zero for the allele in the reference sequence at variant positions. It is assumed that\n\
both VCF and the reference contain only one and the same contig."""
P = [4, 2, 3] # In case a VCF file is empty, the ploidy level cannot be estimated from it. This ad hoc solution should then be used!
with open(reference,'rU') as handle, open(reference+".fai",'rU') as index:
for record in SeqIO.parse(handle,"fasta"):
_seq=record.__getattribute__('seq')
_id=record.__getattribute__('id')
_descrip=record.__getattribute__('description')
line_fai=index.readline()
columns=re.split('\s*|\t',line_fai.rstrip())
contig_id= columns[0] # id derived from .fai
contig_length= int(columns[1]) # seq length defined from .fai
offset = int(columns[2]) # offset derived from .fai
if contig_id!=_id:
raise ValueError("Contig name in the fasta file does not match the name in its index!")
if contig_length!=len(_seq):
raise ValueError("Contig length in the fasta file does not match the length specified in its index!")
SampleNames=[]
for _file in vcffiles:
with open(_file, 'rU') as handle:
for _line in handle:
if _line[0]=='#' and _line[1]!='#' and set(['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT']).issubset(set(_line.lstrip('#').split())):
SampleNames.append(_line.rstrip().split()[-1])
break
genolists = [getGenotypes(_file, Pop=True) for _file in vcffiles]
allelelists = [getAlleles(_file, Pop=True) for _file in vcffiles]