- 28 Apr, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
is needed, only use a translation table (transl_table) from the GenBank if it really is available
-
- 21 Apr, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
-
- 16 Mar, 2017 3 commits
-
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
This new parameter will try to add PKS/NRP Hybrids to the NRPS, PKSI or PKSother classes as well (if they are not on the banned_classes list)
-
- 15 Mar, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
In this mode, BiG-SCAPE assumes that the shortest BGC is probably fragmented. Slices the largest BGC so that the slice shares the largest number of different domains with the shortest one. Then it uses only the slice of the largest BGC to calculate distance. Also introduces the --banned_classes parameter to specify on which classes to perform analysis
-
- 10 Mar, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
-
- 09 Mar, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
-
- 03 Mar, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
-
- 02 Mar, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
alignment phase is resumable. Otherwise, sequences are re-appended and there are problems with hmmalign further on. - Introduce Terpene Class. It uses the default weights for --mix case for now - Extend anchor domain list
-
- 01 Mar, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
another set of files (.stk). Able to work with non-standard amino acid characters (e.g. 'U' for Selenocysteine does not appear in final alignment). For now, same strategy than hmmscan (many 1-thread processes). Activate with '--use_hmmalign' - Improved case where translation of CDS is not available in the GenBank file: - Translate only until stop codon regardless of sequence's location (see NZ_JMEU01000004-181881-306881_ORF39) - Use fuzzy start/end positions to know if trimming should be done at the start or at the end of the sequence (previously, used gene_start == 0) - It fixes a trimming issue that was silently (not visible because this part is parallelized) reported by biopython (BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future. BiopythonWarning) - Note: In the cases where manual translation is needed, alternative start codons could be told to be translated as Methionine, but I've chosen not to. I'm assuming that when the translation is not present, then we are dealing with a random sequence. - gid and pid fields in sequences' identifiers are no longer enclosed in brackets (i.e. they are properly extracted strings and not stringified lists). THIS (along with the change in domain sequences identifiers) LIKELY BREAKS BACKWARDS COMPATIBILITY. Adding new BGCs to a previous analysis and running again is NOT encouraged. - Also deleted code for the old, unmainteined alternate distance calculation (domaindist) as well as GK code.
-
- 23 Feb, 2017 3 commits
-
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
See HMMER manual 3.1b2, page 14
-
Jorge Navarro Muñoz authored
- Don't crash. As the distance calculation function is within a parallelized part of the code, sys.exit'ing hangs the whole script. - Make distance 0.0 in any case - If negative distance is "significant" (<0.000001), report it
-
- 22 Feb, 2017 2 commits
-
-
Jorge Navarro Muñoz authored
- Assume that all records have the same Definition (and use the first one) - For the Group: make a set of Products from all records. * If there is only one kind of Product, use it * If there are two Products and one is "other", set Group to the remaining one * Other cases: discard "other" and join remaining ones using dashes (it will be probably classified as a hybrid)
-
Jorge Navarro Muñoz authored
Further fixes to --samples mode. Seems to be working fine now Minor improvement in bgc class (added product type 'hglks')
-
- 20 Feb, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
to run multiple alignment again!)
-
- 17 Feb, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
...but more tests are needed (starting with a --samples case)
-
- 15 Feb, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
of the input files (PKS_I, PKS Others, NRPS, RiPPs, Saccharides, PKS/NRPS Hybrids and Others). There are MAJOR changes in the code and it's still not usable
-
- 13 Feb, 2017 2 commits
-
-
Jorge Navarro Muñoz authored
Optimization: don't align domain sequences when they are only copies within a single BGC
-
Jorge Navarro Muñoz authored
-
- 10 Feb, 2017 3 commits
-
-
Jorge Navarro Muñoz authored
We found that the GK index was not correlating with chemical distance (but Jaccard and DDS are). We propose a new index: the Adjacency Index. This new index measures the Tanimoto coefficient of pairs of adjacent domains without taking the order into account. Also, more code cleanup.
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
-
- 09 Feb, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
aligned sequences of different length (yes, different variables than before)
-
- 08 Feb, 2017 3 commits
-
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
Experimental no dms See merge request !2
-
Jorge Navarro Muñoz authored
-
- 07 Feb, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
printed used a variable that wasn't declared. In theory, this situation should not happen, but it was reported by a user (Ghofran O.) Deleted function: genbank_grp_dict() was not used anymore as the (AntiSMASH) group annotation as well as the definition are being read in get_gbk_files(). Warning message: be more explicit if BiG-SCAPE cannot find Pfam files.
-
- 13 Jan, 2017 2 commits
-
-
Jorge Navarro Muñoz authored
Both when a list of ordered domains is not found for a particular BGC and also for when aligned domain sequences are not found. In the latter case, though, the messages will be printed several times (one for each distance calculation that needs it) Both messages are controlled by the --verbose parameter
-
Jorge Navarro Muñoz authored
* Currently using the old Hungarian algorithm approach (munkres.py), though Emzo was using something a bit different (matrix is a numpy array, scipy's linear_sum_assignment as the implementation of the Hungarian algorithm). We can later analyze if either approach is better. (this should've been reported in the previous commit) * The ordered list of domains for each BGC is loaded into memory in DomainList, instead of reading both pfs files at the moment of distance calculation (but falls back to that if it can't find the lists in DomainList) * Shaved a couple of seconds from the aligned sequence similarity calculation (tip from one of Marnix's [links](http://stackoverflow.com/questions/16266622/in-python-calculate-percent-identity-between-two-strings) ) Original: ```python matches = 0 length = 0 for position in range(seq_length): if aligned_seqA[position] == aligned_seqB[position]: if aligned_seqA[position] != "-": matches += 1 length += 1 else: length += 1 similarity = 1 - ( float(matches)/float(length) ) ``` New: ```python matches = 0 gaps = 0 for position in range(seq_length): if aligned_seqA[position] == aligned_seqB[position]: if aligned_seqA[position] != "-": matches += 1 else: gaps += 1 similarity = 1 - ( float(matches)/float(seq_length-gaps) ``` * There is a difference in the size of the network files between the original version and the new experimental ones. This seems to be due to the way the genbank files were parsed. Originally, only the line with DEFINITION would be used, but some files (e.g. BGC0001277) have a multiple-line definition value. This is corrected now as we are using BioPython to read the DEFINITION section.
-
- 11 Jan, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
The DMS structure that held the precalculated sequence similarity between all pairs of aligned domain sequences (for each domain) grew exponentially with the number of input files. On top of this, the structure seemed to be copied for parallelized calculation of pairwise distances. Emzo proposed (in the direct_align branch) to avoid using MAFFT for domain sequence alignment, as well as the storing the sequence similarity in the DMS dictionary, and instead calculate both things on-the-fly, at the moment of doing the distance calculation. In this branch, I'm keeping the multiple alignment part with MAFFT but the sequence similarity is left to do on-the-fly. This increase in computing time each time the script needs to be re-run is a tradeoff for getting rid of DMS (both in RAM-space as well as in disk-space, for it was also kept as a file). In summary: * Eliminated DMS usage. Using --skip_mafft avoids calling MAFFT, but otherwise only the aligned domain sequences (.algn) in the domains folder are read and kept in memory. * Eliminated the --use_mafft_distout parameter (only internal sequence similarity is used. We could as well avoid generating the .hat2 files in the future as well) * Dropped the ">" character from the keys of the dictionary returned by fasta_parser() * If running with the --skip_mafft parameter, BiG-SCAPE will not re-generate the domain fasta files (take into account that if the user is adding new files to her input directory, she should not use this parameter, or we should take care to track which domains are affected and process the domain fasta files + mafft-align only for those domains) * Moved the extraction of the gbk_group information (BGC definition + antiSMASH group annotation) to the first time that the GenBank files are opened (when collecting the files and doing basic filtering). Also in that routine (get_gbk_files), each file is only opened once.
-
- 22 Dec, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
This commit fixes some bugs in file manipulation in the first stage of BiG-SCAPE when re-using the output folder. For example, if using the `--force-hmmscan` parameter, hmmscan would be used on ALL fasta files found in the output folder, instead of only those corresponding to the input files. The bug also caused other bad behaviour, like trying to discard files without predicted domains that weren't in the input file list (but that marked because they had not been processed -i.e. they don't have pfd counterparts)
-
- 18 Nov, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
In the previous commit, a new method for combining DDS sub-components was introduced: DDS = (1-anchorweight)*non_anchor_prct*rDDSna + (1+anchorweight)*anchor_prct*rDDSa with anchor_prct = S_anchor / (S + S_anchor) non_anchor_prct = S / (S + S_anchor) However, the final weight was not really normalized (in some instances making the DDS_anchor component shorter than it should, in other instances making it too large, even making the final DDS score lesser than 0!) As a solution, the new weighting system effectively 'boosts' perceived number of ancohr domains: non_anchor_weight = non_anchor_prct / (anchor_prct*anchorweight + non_anchor_prct) anchor_weight = anchor_prct*anchorweight / (anchor_prct*anchorweight + non_anchor_prct) DDS = (non_anchor_weight*DDS_non_anchor) + (anchor_weight*DDS_anchor)
-
- 04 Nov, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
Originally, DDS sub components (one for each type of domain: domains from the list in the anchorfile and the non-achor ones) where combined in the final DDS score with a fixed weight (note that this DDS is measuring *difference* rather than similitude between domains from both BGCs. It is later converted to similarity using the complement DDS = 1 - DDS): DDS = anchorweight*rDDSa + (1-anchorweight)*rDDSna with rDDSa as the raw DDS component for anchor domains. In this new version of BiG-SCAPE, this has changed to give each component a weight that depends on the actual percentage of anchor and non-anchor domains: anchor_prct = S_anchor / (S + S_anchor) non_anchor_prct = S / (S + S_anchor) where S_anchor is the number of different anchor domains analyzed between both BGCs. The anchorweight parameter changed its name 'anchorboost' (though internally it's still called 'anchorweight'), a variable that tries to increase the perceived amount of anchor domains (so the rDDSa has an increased weight). The DDS score is then: DDS = (1-anchorweight)*non_anchor_prct*rDDSna + (1+anchorweight)*anchor_prct*rDDSa Also in this commit, new columns are added to the output network file: each DDS subcomponent as well as the number of anchor and non-anchor domains analyzed
-
- 02 Nov, 2016 5 commits
-
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
Also, minor fixes to a couple of messages for the user
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
Mp hmm This branch, contributed by Emzo, introduces a new workflow for the first phase of BiG-SCAPE; namely, the processing of input files and domain prediction using `hmmscan`. The two most notable characteristics of this branch are: * A better distribution of parallelized work when predicting domains using `hmmscan`. In the original workflow, parallelization was left to `hmmscan` using the `--cpu` parameter but this approach's performance did not scale with the number of CPUs (given by the input parameter `--cores` in BiG-SCAPE). The new workflow changes this to as many one-thread `hmmscan` jobs as the number of `--cores` available, making it more efficient with the available resources. * BiG-SCAPE now runs domain prediction and other input-parsing tasks based only on the non-processed files in the output directory. This means that if BiG-SCAPE terminates early for some reason, it is able to resume work (specially with domain prediction, the most intensive task in this phase). See merge request !1
-