- 08 Feb, 2017 2 commits
-
-
Jorge Navarro Muñoz authored
Use is still able to tweak the mafft_threads parameter to provide a different number of threads.
-
Jorge Navarro Muñoz authored
-
- 07 Feb, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
printed used a variable that wasn't declared. In theory, this situation should not happen, but it was reported by a user (Ghofran O.) Deleted function: genbank_grp_dict() was not used anymore as the (AntiSMASH) group annotation as well as the definition are being read in get_gbk_files(). Warning message: be more explicit if BiG-SCAPE cannot find Pfam files.
-
- 13 Jan, 2017 2 commits
-
-
Jorge Navarro Muñoz authored
Both when a list of ordered domains is not found for a particular BGC and also for when aligned domain sequences are not found. In the latter case, though, the messages will be printed several times (one for each distance calculation that needs it) Both messages are controlled by the --verbose parameter
-
Jorge Navarro Muñoz authored
* Currently using the old Hungarian algorithm approach (munkres.py), though Emzo was using something a bit different (matrix is a numpy array, scipy's linear_sum_assignment as the implementation of the Hungarian algorithm). We can later analyze if either approach is better. (this should've been reported in the previous commit) * The ordered list of domains for each BGC is loaded into memory in DomainList, instead of reading both pfs files at the moment of distance calculation (but falls back to that if it can't find the lists in DomainList) * Shaved a couple of seconds from the aligned sequence similarity calculation (tip from one of Marnix's [links](http://stackoverflow.com/questions/16266622/in-python-calculate-percent-identity-between-two-strings) ) Original: ```python matches = 0 length = 0 for position in range(seq_length): if aligned_seqA[position] == aligned_seqB[position]: if aligned_seqA[position] != "-": matches += 1 length += 1 else: length += 1 similarity = 1 - ( float(matches)/float(length) ) ``` New: ```python matches = 0 gaps = 0 for position in range(seq_length): if aligned_seqA[position] == aligned_seqB[position]: if aligned_seqA[position] != "-": matches += 1 else: gaps += 1 similarity = 1 - ( float(matches)/float(seq_length-gaps) ``` * There is a difference in the size of the network files between the original version and the new experimental ones. This seems to be due to the way the genbank files were parsed. Originally, only the line with DEFINITION would be used, but some files (e.g. BGC0001277) have a multiple-line definition value. This is corrected now as we are using BioPython to read the DEFINITION section.
-
- 11 Jan, 2017 1 commit
-
-
Jorge Navarro Muñoz authored
The DMS structure that held the precalculated sequence similarity between all pairs of aligned domain sequences (for each domain) grew exponentially with the number of input files. On top of this, the structure seemed to be copied for parallelized calculation of pairwise distances. Emzo proposed (in the direct_align branch) to avoid using MAFFT for domain sequence alignment, as well as the storing the sequence similarity in the DMS dictionary, and instead calculate both things on-the-fly, at the moment of doing the distance calculation. In this branch, I'm keeping the multiple alignment part with MAFFT but the sequence similarity is left to do on-the-fly. This increase in computing time each time the script needs to be re-run is a tradeoff for getting rid of DMS (both in RAM-space as well as in disk-space, for it was also kept as a file). In summary: * Eliminated DMS usage. Using --skip_mafft avoids calling MAFFT, but otherwise only the aligned domain sequences (.algn) in the domains folder are read and kept in memory. * Eliminated the --use_mafft_distout parameter (only internal sequence similarity is used. We could as well avoid generating the .hat2 files in the future as well) * Dropped the ">" character from the keys of the dictionary returned by fasta_parser() * If running with the --skip_mafft parameter, BiG-SCAPE will not re-generate the domain fasta files (take into account that if the user is adding new files to her input directory, she should not use this parameter, or we should take care to track which domains are affected and process the domain fasta files + mafft-align only for those domains) * Moved the extraction of the gbk_group information (BGC definition + antiSMASH group annotation) to the first time that the GenBank files are opened (when collecting the files and doing basic filtering). Also in that routine (get_gbk_files), each file is only opened once.
-
- 22 Dec, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
This commit fixes some bugs in file manipulation in the first stage of BiG-SCAPE when re-using the output folder. For example, if using the `--force-hmmscan` parameter, hmmscan would be used on ALL fasta files found in the output folder, instead of only those corresponding to the input files. The bug also caused other bad behaviour, like trying to discard files without predicted domains that weren't in the input file list (but that marked because they had not been processed -i.e. they don't have pfd counterparts)
-
- 18 Nov, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
In the previous commit, a new method for combining DDS sub-components was introduced: DDS = (1-anchorweight)*non_anchor_prct*rDDSna + (1+anchorweight)*anchor_prct*rDDSa with anchor_prct = S_anchor / (S + S_anchor) non_anchor_prct = S / (S + S_anchor) However, the final weight was not really normalized (in some instances making the DDS_anchor component shorter than it should, in other instances making it too large, even making the final DDS score lesser than 0!) As a solution, the new weighting system effectively 'boosts' perceived number of ancohr domains: non_anchor_weight = non_anchor_prct / (anchor_prct*anchorweight + non_anchor_prct) anchor_weight = anchor_prct*anchorweight / (anchor_prct*anchorweight + non_anchor_prct) DDS = (non_anchor_weight*DDS_non_anchor) + (anchor_weight*DDS_anchor)
-
- 04 Nov, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
Originally, DDS sub components (one for each type of domain: domains from the list in the anchorfile and the non-achor ones) where combined in the final DDS score with a fixed weight (note that this DDS is measuring *difference* rather than similitude between domains from both BGCs. It is later converted to similarity using the complement DDS = 1 - DDS): DDS = anchorweight*rDDSa + (1-anchorweight)*rDDSna with rDDSa as the raw DDS component for anchor domains. In this new version of BiG-SCAPE, this has changed to give each component a weight that depends on the actual percentage of anchor and non-anchor domains: anchor_prct = S_anchor / (S + S_anchor) non_anchor_prct = S / (S + S_anchor) where S_anchor is the number of different anchor domains analyzed between both BGCs. The anchorweight parameter changed its name 'anchorboost' (though internally it's still called 'anchorweight'), a variable that tries to increase the perceived amount of anchor domains (so the rDDSa has an increased weight). The DDS score is then: DDS = (1-anchorweight)*non_anchor_prct*rDDSna + (1+anchorweight)*anchor_prct*rDDSa Also in this commit, new columns are added to the output network file: each DDS subcomponent as well as the number of anchor and non-anchor domains analyzed
-
- 02 Nov, 2016 6 commits
-
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
Also, minor fixes to a couple of messages for the user
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
Mp hmm This branch, contributed by Emzo, introduces a new workflow for the first phase of BiG-SCAPE; namely, the processing of input files and domain prediction using `hmmscan`. The two most notable characteristics of this branch are: * A better distribution of parallelized work when predicting domains using `hmmscan`. In the original workflow, parallelization was left to `hmmscan` using the `--cpu` parameter but this approach's performance did not scale with the number of CPUs (given by the input parameter `--cores` in BiG-SCAPE). The new workflow changes this to as many one-thread `hmmscan` jobs as the number of `--cores` available, making it more efficient with the available resources. * BiG-SCAPE now runs domain prediction and other input-parsing tasks based only on the non-processed files in the output directory. This means that if BiG-SCAPE terminates early for some reason, it is able to resume work (specially with domain prediction, the most intensive task in this phase). See merge request !1
-
Jorge Navarro Muñoz authored
# Conflicts: # functions.py
-
- 01 Nov, 2016 2 commits
-
-
Jorge Navarro Muñoz authored
BiG-SCAPE now uses exclusively the new workflow contributed by Emzo Plus, a number of minor editions like improved verification of the existance of some files (e.g. the already calculated network files)
-
Jorge Navarro Muñoz authored
I'll work to erase the domain directory each time BiG-SCAPE is run, so this issue shouldn't happen, but if for any reason the same sequence is read and appended more than once to the same key in its domain sequence file, the sequences' length mismatch would throw an IndexError exception. This tries to reimplement commit b12bfb7d from the main branch.
-
- 31 Oct, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
into main branch. Numerous tiny changes and a couple of bugs fixed. After some successful testing has gone through, the new workflow will be chosen as the default and this branch will be integrated with main.
-
- 10 Oct, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
Re-added math functions for a couple of instances, added a few comments and other minor details.
-
- 05 Oct, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
Currently, if there are duplicated files in the run (i.e. same file in different folders) domain sequences will be appended twice and the fasta parser will join them in one single sequence -increasing that particular sequence's length. This commit just warns the user and continues. In theory this should still give correct results.
-
- 03 Oct, 2016 1 commit
-
-
Emzo authored
fixed bugs from duplicate genbanks issue by reformatting samples/cluster data structure, clusters now are stored in a dictionary with a unique path to them and the samples they are a part of
-
- 14 Sep, 2016 2 commits
-
-
Emzo authored
-
Jorge Navarro Muñoz authored
-
- 12 Sep, 2016 1 commit
-
-
Emzo authored
-
- 16 Aug, 2016 3 commits
-
-
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
-
- 04 Aug, 2016 2 commits
-
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
When sorting pfd_matrix rows, the sorting should correspond to the absolute positions of the predicted domains. This positions have to take into account the strand of the gene from where they were predicted (the 'env' coordinate in the .domtable file is with respect to the start of the gene); all domains from a gene in the complementary strand should be reversed in order. This change means that the .pfd files must be rewritten, so, unfortunately, a re-run from scratch is necessary. Also closed BGC.dict and DMS.dict files after dump/load.
-
- 01 Aug, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
Specifically, fixed the length of the alignment. This should improve DDS (e.g. when comparing the exact same Gene Clusters, DDS will now be 1). Run with --skip_hmmscan
-
- 27 Jul, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
- Better choosing of sequence-pairs for calculation of sequence similarity
-
- 26 Jul, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
-
- 19 Jul, 2016 3 commits
-
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
- Order list of domains by their absolute position (not their internal position within the feature) - Choose all possible pairs of domains in GK calculation. Before, the last nbhood-1 domains were missing from the pair-choosing - Fixed skewed values in the GK index. The absolute value of Ns-Nr meant that only values in the range [0.5,1.0] were being obtained. Values close to 0 mean the most difference in order of domains (all shared pairs are reversed) whereas values close to 1 mean least difference (all shared pairs are in the same order)
-
Jorge Navarro Muñoz authored
Use parameter --pfam_dir to specify location of hmmpress-procesed pfam files (.h3f, .h3i, .h3m and .h3p). If parameter is not given, default is to look in the same place as the BiG-SCAPE script. Thanks to Alex Cristofaro for helping with this!
-
- 12 Jul, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
-
- 11 Jul, 2016 2 commits
-
-
Jorge Navarro Muñoz authored
-
Jorge Navarro Muñoz authored
- Parameter --skip_mafft is intended to be used when it's necessary to recalculate distance (most probably by changing the nbhood parameter). It's necessary to have: the original gbk files; the .pfs files; the .domtable files; the BGCs.dict and DMS.dict files - If any of the 'skip' parameters is activated, genbank_parser_hmmscan() no longer reads and parses fasta sequences from the genbank files. - --skip_all now recalculates logscore, distance and squared similarity in case that the user has submitted new weights for the Jaccard, DDS and GK indices - Minor other improvements in code, comments etc.
-
- 01 Jul, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
-
- 30 Jun, 2016 1 commit
-
-
Jorge Navarro Muñoz authored
- Also fixed a potential bug: when an outlier distance pair was processed for disc nodes inclusion, only the first node would be added to the network, but not the second one (it had an 'elif' check for the second node) - Also, fixed a small bug when reporting number of cores in parameters.txt
-