Commit d87894ff authored by Nijsse, Bart's avatar Nijsse, Bart
Browse files

silva lineage to genome ids convertion

parent bf8bcf12
Bacteria;Actinobacteriota;Coriobacteriia;Coriobacteriales;Atopobiaceae;uncultured;uncultured bacterium MGYG-HGUT-00341,MGYG-HGUT-02094,MGYG-HGUT-00819
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family MGYG-HGUT-04314,MGYG-HGUT-01014
Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelotrichaceae;Faecalitalea;uncultured bacterium MGYG-HGUT-00930,MGYG-HGUT-01177
Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Marinifilaceae;Sanguibacteroides;uncultured bacterium MGYG-HGUT-01608
Bacteria;Campylobacterota;Campylobacteria;Campylobacterales;Campylobacteraceae MGYG-HGUT-02429
Bacteria;Actinobacteriota;Coriobacteriia;Coriobacteriales;Eggerthellaceae;uncultured;uncultured bacterium MGYG-HGUT-00767,MGYG-HGUT-01153,MGYG-HGUT-04112,MGYG-HGUT-04183,MGYG-HGUT-04554,MGYG-HGUT-04635,MGYG-HGUT-01600,MGYG-HGUT-02957,MGYG-HGUT-01740,MGYG-HGUT-02023,MGYG-HGUT-00611,MGYG-HGUT-03089,MGYG-HGUT-00595,MGYG-HGUT-04592
Bacteria;Actinobacteriota;Coriobacteriia;Coriobacteriales;Eggerthellaceae MGYG-HGUT-02441,MGYG-HGUT-02451,MGYG-HGUT-02509,MGYG-HGUT-02442
Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelotrichaceae;Merdibacter;uncultured bacterium MGYG-HGUT-03087,MGYG-HGUT-00380,MGYG-HGUT-02014,MGYG-HGUT-03874
Bacteria;Actinobacteriota;Coriobacteriia;Coriobacteriales;Coriobacteriaceae;Enorma;uncultured bacterium MGYG-HGUT-04584,MGYG-HGUT-01605
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family XI;Anaerococcus MGYG-HGUT-01424
Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Sutterellaceae;Parasutterella;Turicimonas MGYG-HGUT-02643,MGYG-HGUT-00796
Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Sutterellaceae;Sutterella;Dakarella MGYG-HGUT-03441
Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Sutterellaceae;Sutterella;Duodenibacillus MGYG-HGUT-00564,MGYG-HGUT-03318,MGYG-HGUT-00624,MGYG-HGUT-01113,MGYG-HGUT-03017,MGYG-HGUT-03288,MGYG-HGUT-00525,MGYG-HGUT-03879,MGYG-HGUT-00717
Bacteria;Firmicutes;Clostridia;Oscillospirales;Butyricicoccaceae;Butyricicoccus;Agathobaculum MGYG-HGUT-02863,MGYG-HGUT-03909,MGYG-HGUT-02138,MGYG-HGUT-04135
Bacteria;Firmicutes;Clostridia;Oscillospirales;Oscillospiraceae MGYG-HGUT-02327,MGYG-HGUT-03686
Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Fusicatenibacter;uncultured bacterium GA21 MGYG-HGUT-03676,MGYG-HGUT-02835,MGYG-HGUT-04233
Bacteria;Verrucomicrobiota;Verrucomicrobiae;Verrucomicrobiales;Akkermansiaceae MGYG-HGUT-02378
Bacteria;Firmicutes;Clostridia;Oscillospirales;Ruminococcaceae;Angelakisella;uncultured bacterium MGYG-HGUT-03311,MGYG-HGUT-04111,MGYG-HGUT-00689,MGYG-HGUT-02102
Bacteria;Actinobacteriota;Actinobacteria;Propionibacteriales;Propionibacteriaceae;Cutibacterium;uncultured bacterium MGYG-HGUT-03135
Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelotrichaceae;Faecalicoccus;uncultured bacterium MGYG-HGUT-04288,MGYG-HGUT-04610
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerovoracaceae;[Eubacterium] MGYG-HGUT-02217,MGYG-HGUT-01953
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerovoracaceae;Mogibacterium;Mobilibacterium MGYG-HGUT-00601,MGYG-HGUT-02597
Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Muribaculaceae MGYG-HGUT-00055
Bacteria;Actinobacteriota;Coriobacteriia;Coriobacteriales;Atopobiaceae;Libanicoccus;uncultured bacterium MGYG-HGUT-03298,MGYG-HGUT-02914,MGYG-HGUT-02106,MGYG-HGUT-00620,MGYG-HGUT-01213,MGYG-HGUT-02801,MGYG-HGUT-00614,MGYG-HGUT-04257
Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Mediterranea MGYG-HGUT-04185,MGYG-HGUT-04188,MGYG-HGUT-04019
Bacteria;Firmicutes;Clostridia;Clostridiales;Caloramatoraceae;Clostridium MGYG-HGUT-01699
Bacteria;Firmicutes;Negativicutes;Veillonellales-Selenomonadales;Selenomonadaceae;uncultured;uncultured bacterium MGYG-HGUT-01944,MGYG-HGUT-03475,MGYG-HGUT-03625
Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelatoclostridiaceae;Erysipelatoclostridium;Beduini MGYG-HGUT-00908
Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Rikenellaceae;Alistipes MGYG-HGUT-01562,MGYG-HGUT-01420
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Peptostreptococcaceae;Clostridioides;uncultured bacterium MGYG-HGUT-01623
Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Epulopiscium;Niameybacter MGYG-HGUT-00919,MGYG-HGUT-00989
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Peptostreptococcaceae;Criibacterium;uncultured bacterium MGYG-HGUT-03753
Bacteria;Desulfobacterota;Desulfovibrionia;Desulfovibrionales;Desulfovibrionaceae;Mailhella;uncultured bacterium MGYG-HGUT-00569,MGYG-HGUT-00566,MGYG-HGUT-00551,MGYG-HGUT-03924,MGYG-HGUT-03588,MGYG-HGUT-03918
Bacteria;Actinobacteriota;Coriobacteriia;Coriobacteriales;Atopobiaceae MGYG-HGUT-01519,MGYG-HGUT-02450,MGYG-HGUT-02405
Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelotrichaceae;Ileibacterium;uncultured bacterium MGYG-HGUT-03758,MGYG-HGUT-03748
Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Franconibacter;uncultured bacterium MGYG-HGUT-03115
Archaea;Thermoplasmatota;Thermoplasmata;Methanomassiliicoccales;Methanomassiliicoccaceae;Methanomassiliicoccus MGYG-HGUT-02160
Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Lachnoclostridium;Mordavella MGYG-HGUT-04263,MGYG-HGUT-02202
Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelotrichaceae;Holdemania MGYG-HGUT-01430
Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;[Ruminococcus] MGYG-HGUT-02945
Bacteria;Firmicutes;Clostridia;Oscillospirales;Ruminococcaceae;Pygmaiobacter;uncultured bacterium MGYG-HGUT-01106,MGYG-HGUT-00319
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus MGYG-HGUT-01428
Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelotrichaceae;Dielma;Traorella MGYG-HGUT-04077
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family XI;Peptoniphilus MGYG-HGUT-01425,MGYG-HGUT-01414
Bacteria;Firmicutes;Negativicutes;Veillonellales-Selenomonadales;Veillonellaceae;Negativicoccus;uncultured bacterium MGYG-HGUT-00757,MGYG-HGUT-02903,MGYG-HGUT-02820
Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelotrichaceae;Holdemanella;uncultured bacterium MGYG-HGUT-04375,MGYG-HGUT-00655
Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Anaerocolumna;uncultured bacterium MGYG-HGUT-02836
Bacteria;Firmicutes;Clostridia;Oscillospirales;Ruminococcaceae;Phocea;uncultured bacterium MGYG-HGUT-01611
Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family XI;Murdochiella;uncultured bacterium MGYG-HGUT-03750
Bacteria;Campylobacterota;Campylobacteria;Campylobacterales;Helicobacteraceae MGYG-HGUT-02443
Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Yersiniaceae MGYG-HGUT-02474
Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus MGYG-HGUT-02313
Bacteria;Actinobacteriota;Actinobacteria;Propionibacteriales;Nocardioidaceae;Aeromicrobium MGYG-HGUT-01418
Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelatoclostridiaceae;Candidatus MGYG-HGUT-01452
Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Lachnoclostridium;Merdimonas MGYG-HGUT-00987
Bacteria;Actinobacteriota;Actinobacteria;Frankiales;Geodermatophilaceae;Blastococcus MGYG-HGUT-01468
Bacteria;Firmicutes;Bacilli;Paenibacillales;Paenibacillaceae;Paenibacillus MGYG-HGUT-01404
#!/usr/bin/python3
# From SILVA fasta headers obtain (cleaned, no dot) SILVA-id's and lineage SILVA-LINEAGE
# sed 's/\./ /' SILVA_138.1_HEADERS| cut -d" " -f1,3- > SILVA-ID_LINEAGE
# From SILVA metadata only obtain SILVA-ID and NCBI-taxonID
# awk '{print $1,$NF}' SILVA_138.1_SSURef.full_metadata > SILVA-ID_NCBI-TAX
# FINAL OUTPUT:
# LINEAGE SILVA <tab> MGnify_ID
# SILVA-LINEAGE
# SILVA-ID_NCBI-TAX
# MagID_TaxID
manual_lineages = {}
for line in open("manual_lineages"):
manual_lineages[line.split()[0]] = line.strip().split()[1]
silva_lineages = {}
silva_genus = {}
silva_family = {}
with open("SILVA-ID_LINEAGE_SSUParc") as infile:
for line in infile:
id = line.split()[0]
lineage = " ".join(line.strip().split()[1:])
silva_lineages[id] = lineage
genus = lineage.split(";")[-2]
silva_genus[genus] = ";".join(lineage.split(";")[0:-1])
if not "Eukaryota" in lineage or "Archaea" in lineage:
if len(lineage.split(";")) > 6:
family = lineage.split(";")[-3]
silva_family[family] = ";".join(lineage.split(";")[0:-2])
tax_silva = {}
for line in open("SILVA-ID_NCBI-TAX_SSUParc.full", "r").readlines():
#for line in open("SILVA-ID_NCBI-TAX", "r").readlines():
sline = line.strip().split()
taxon = sline[1]
acc = sline[0]
if taxon not in tax_silva.keys():
tax_silva[taxon] = [acc]
else:
tax_silva[taxon].append(acc)
c = 0
taxon_hits_conv = {}
nontaxon_hits_conv = {}
all_mags = []
hit = 0
taxhit = 0
notaxhit = 0
for line in open("magID_taxID.tsv", "r").readlines():
taxon = line.strip().split()[0]
magID = line.split()[1]
name = line.strip().split("\t")[-1]
all_mags.append(magID)
# direct with Silva database
if taxon in tax_silva.keys():
taxhit += 1
for acc in tax_silva[taxon]:
if acc in silva_lineages.keys():
hit += 1
lineage = silva_lineages[acc]
if lineage not in taxon_hits_conv.keys():
taxon_hits_conv[lineage] = [magID]
else:
taxon_hits_conv[lineage].append(magID)
break
# Non direct hit. Using organism name given with the UHGG genome to search in the Silva Lineages
else:
uncultured = False
if "uncultured" in name:
uname = name.replace("uncultured ", "")
uncultured = True
elif "Candidatus" in name:
uname = name.replace("Candidatus ", "")
else:
uname = name
sname = uname.split()[0]
lineage = ""
if sname in silva_genus.keys():
lineage = silva_genus[sname]
elif sname in silva_family.keys():
lineage = silva_family[sname]
if lineage:
notaxhit += 1
if uncultured:
for lin in silva_lineages.values():
if sname+";uncultured bacterium" in lin or sname+";uncultured;uncultured bacterium" in lin:
lineage = lin
break
# lookup in the manual curated lineages for the genomes
elif sname in manual_lineages.keys():
lineage = manual_lineages[sname]
notaxhit += 1
else:
c += 1
print(taxon, magID, name, sname)
if lineage:
if lineage not in nontaxon_hits_conv.keys():
nontaxon_hits_conv[lineage] = [magID]
else:
nontaxon_hits_conv[lineage].append(magID)
print(len(all_mags), taxhit, notaxhit, c)
taxon_hits_conv_f = open("Silva138.1-Lineage_UHGG-mags_taxonhits.tsv","w")
for lin in taxon_hits_conv:
taxon_hits_conv_f.write(lin+"\t"+",".join(taxon_hits_conv[lin])+"\n")
taxon_hits_conv_f.close()
nontaxon_hits_conv_f = open("Silva138.1-Lineage_UHGG-mags_non-taxonhits.tsv","w")
for lin in nontaxon_hits_conv:
nontaxon_hits_conv_f.write(lin+"\t"+",".join(nontaxon_hits_conv[lin])+"\n")
nontaxon_hits_conv_f.close()
\ No newline at end of file
This diff is collapsed.
Mediterranea Bacteria;Bacteroidota;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;Mediterranea
Beduini Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelatoclostridiaceae;Erysipelatoclostridium;Beduini
Traorella Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelotrichaceae;Dielma;Traorella
Niameybacter Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Epulopiscium;Niameybacter
Merdimonas Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Lachnoclostridium;Merdimonas
Mordavella Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;Lachnoclostridium;Mordavella
Agathobaculum Bacteria;Firmicutes;Clostridia;Oscillospirales;Butyricicoccaceae;Butyricicoccus;Agathobaculum
Mobilibacterium Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerovoracaceae;Mogibacterium;Mobilibacterium
Turicimonas Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Sutterellaceae;Parasutterella;Turicimonas
Dakarella Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Sutterellaceae;Sutterella;Dakarella
Duodenibacillus Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Sutterellaceae;Sutterella;Duodenibacillus
Urmitella Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Family XI;Tissierella;Urmitella
Emergencia Bacteria;Firmicutes;Clostridia;Peptostreptococcales-Tissierellales;Anaerovoracaceae;[Eubacterium] nodatum group;Emergencia
Bariatricus Bacteria;Firmicutes;Clostridia;Lachnospirales;Lachnospiraceae;[Ruminococcus] torques group;Bariatricus
Stoquefichus Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelatoclostridiaceae;Candidatus Stoquefichus;Candidatus Stoquefichus
\ No newline at end of file
#!/usr/bin/python3
from os import listdir
from lxml import etree
wfile = open("magID_taxID.tsv","w")
for xml in listdir("xml"):
xmlfile = etree.parse("xml/"+xml)
root= xmlfile.getroot()
for t in root.iter('TAXON_ID'): taxonID = t.text
for id in root.iter('SUBMITTER_ID'): magID = id.text
for id in root.iter('SCIENTIFIC_NAME'): name = id.text
wfile.write(taxonID+"\t"+magID+"\t"+name+"\n")
wfile.close()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment