1. 08 Feb, 2017 2 commits
  2. 07 Feb, 2017 1 commit
    • Jorge Navarro Muñoz's avatar
      Bugfix: If aligned sequences' lengths are different, the warning message being · 681d99a9
      Jorge Navarro Muñoz authored
      printed used a variable that wasn't declared. In theory, this situation should
      not happen, but it was reported by a user (Ghofran O.)
      Deleted function: genbank_grp_dict() was not used anymore as the (AntiSMASH)
      group annotation as well as the definition are being read in get_gbk_files().
      Warning message: be more explicit if BiG-SCAPE cannot find Pfam files.
      681d99a9
  3. 13 Jan, 2017 2 commits
    • Jorge Navarro Muñoz's avatar
      Minor fix: prevent the terminal to be clogged by warning messages · fbe44d5c
      Jorge Navarro Muñoz authored
      Both when a list of ordered domains is not found for a particular BGC and
      also for when aligned domain sequences are not found. In the latter case,
      though, the messages will be printed several times (one for each distance
      calculation that needs it)
      Both messages are controlled by the --verbose parameter
      fbe44d5c
    • Jorge Navarro Muñoz's avatar
      Some minor optimizations · 12140509
      Jorge Navarro Muñoz authored
      * Currently using the old Hungarian algorithm approach (munkres.py), though
      Emzo was using something a bit different (matrix is a numpy array, scipy's
      linear_sum_assignment as the implementation of the Hungarian algorithm).
      We can later analyze if either approach is better. (this should've been
      reported in the previous commit)
      * The ordered list of domains for each BGC is loaded into memory in
      DomainList, instead of reading both pfs files at the moment of distance
      calculation (but falls back to that if it can't find the lists in
      DomainList)
      * Shaved a couple of seconds from the aligned sequence similarity
      calculation (tip from one of Marnix's [links](http://stackoverflow.com/questions/16266622/in-python-calculate-percent-identity-between-two-strings) )
      
      Original:
      ```python
      matches = 0
      length = 0
      for position in range(seq_length):
          if aligned_seqA[position] == aligned_seqB[position]:
              if aligned_seqA[position] != "-":
                  matches += 1
                  length += 1
          else:
              length += 1
      similarity = 1 - ( float(matches)/float(length) )
      ```
      
      New:
      ```python
      matches = 0
      gaps = 0
      for position in range(seq_length):
          if aligned_seqA[position] == aligned_seqB[position]:
              if aligned_seqA[position] != "-":
                  matches += 1
              else:
                  gaps += 1
      
      similarity = 1 - ( float(matches)/float(seq_length-gaps)
      ```
      
      * There is a difference in the size of the network files between the original
      version and the new experimental ones. This seems to be due to the way the
      genbank files were parsed. Originally, only the line with DEFINITION would be
      used, but some files (e.g. BGC0001277) have a multiple-line definition value.
      This is corrected now as we are using BioPython to read the DEFINITION section.
      12140509
  4. 11 Jan, 2017 1 commit
    • Jorge Navarro Muñoz's avatar
      New experimental branch: No DMS.dict · 35188f2e
      Jorge Navarro Muñoz authored
      The DMS structure that held the precalculated sequence similarity between all
      pairs of aligned domain sequences (for each domain) grew exponentially with
      the number of input files. On top of this, the structure seemed to be copied
      for parallelized calculation of pairwise distances.
      Emzo proposed (in the direct_align branch) to avoid using MAFFT for domain
      sequence alignment, as well as the storing the sequence similarity in the DMS
      dictionary, and instead calculate both things on-the-fly, at the moment of
      doing the distance calculation.
      In this branch, I'm keeping the multiple alignment part with MAFFT but the
      sequence similarity is left to do on-the-fly. This increase in computing time
      each time the script needs to be re-run is a tradeoff for getting rid of DMS
      (both in RAM-space as well as in disk-space, for it was also kept as a file).
      
      In summary:
      * Eliminated DMS usage. Using --skip_mafft avoids calling MAFFT, but otherwise
      only the aligned domain sequences (.algn) in the domains folder are read and
      kept in memory.
      * Eliminated the --use_mafft_distout parameter (only internal sequence
      similarity is used. We could as well avoid generating the .hat2 files in the
      future as well)
      * Dropped the ">" character from the keys of the dictionary returned by
      fasta_parser()
      * If running with the --skip_mafft parameter, BiG-SCAPE will not re-generate
      the domain fasta files (take into account that if the user is adding new files
      to her input directory, she should not use this parameter, or we should take
      care to track which domains are affected and process the domain fasta files +
      mafft-align only for those domains)
      * Moved the extraction of the gbk_group information (BGC definition + antiSMASH
      group annotation) to the first time that the GenBank files are opened (when
      collecting the files and doing basic filtering). Also in that routine
      (get_gbk_files), each file is only opened once.
      35188f2e
  5. 22 Dec, 2016 1 commit
    • Jorge Navarro Muñoz's avatar
      Fixed a bug when processing input files · d453df5d
      Jorge Navarro Muñoz authored
      This commit fixes some bugs in file manipulation in the first stage of
      BiG-SCAPE when re-using the output folder. For example, if using the
      `--force-hmmscan` parameter, hmmscan would be used on ALL fasta files found in
      the output folder, instead of only those corresponding to the input files.
      The bug also caused other bad behaviour, like trying to discard files without
      predicted domains that weren't in the input file list (but that marked because
      they had not been processed -i.e. they don't have pfd counterparts)
      d453df5d
  6. 18 Nov, 2016 1 commit
    • Jorge Navarro Muñoz's avatar
      Bugfix in DDS sub-component weighting logic · 26a3c5a2
      Jorge Navarro Muñoz authored
      In the previous commit, a new method for combining DDS sub-components was
      introduced:
      DDS = (1-anchorweight)*non_anchor_prct*rDDSna +
            (1+anchorweight)*anchor_prct*rDDSa
      with
      anchor_prct = S_anchor / (S + S_anchor)
      non_anchor_prct = S / (S + S_anchor)
      However, the final weight was not really normalized (in some instances making
      the DDS_anchor component shorter than it should, in other instances making
      it too large, even making the final DDS score lesser than 0!)
      As a solution, the new weighting system effectively 'boosts' perceived number
      of ancohr domains:
      non_anchor_weight = non_anchor_prct /
                          (anchor_prct*anchorweight + non_anchor_prct)
      anchor_weight = anchor_prct*anchorweight /
                          (anchor_prct*anchorweight + non_anchor_prct)
      DDS = (non_anchor_weight*DDS_non_anchor) + (anchor_weight*DDS_anchor)
      26a3c5a2
  7. 04 Nov, 2016 1 commit
    • Jorge Navarro Muñoz's avatar
      Important changes to DDS index · 85ca13b8
      Jorge Navarro Muñoz authored
      Originally, DDS sub components (one for each type of domain: domains from
      the list in the anchorfile and the non-achor ones) where combined in the final
      DDS score with a fixed weight (note that this DDS is measuring *difference*
      rather than similitude between domains from both BGCs. It is later converted
      to similarity using the complement DDS = 1 - DDS):
      DDS = anchorweight*rDDSa + (1-anchorweight)*rDDSna
      with rDDSa as the raw DDS component for anchor domains.
      
      In this new version of BiG-SCAPE, this has changed to give each component a
      weight that depends on the actual percentage of anchor and non-anchor domains:
      anchor_prct = S_anchor / (S + S_anchor)
      non_anchor_prct = S / (S + S_anchor)
      where S_anchor is the number of different anchor domains analyzed between both
      BGCs.
      The anchorweight parameter changed its name 'anchorboost' (though internally
      it's still called 'anchorweight'), a variable that tries to increase the
      perceived amount of anchor domains (so the rDDSa has an increased weight).
      
      The DDS score is then:
      DDS = (1-anchorweight)*non_anchor_prct*rDDSna +
       (1+anchorweight)*anchor_prct*rDDSa
      
      Also in this commit, new columns are added to the output network file: each
      DDS subcomponent as well as the number of anchor and non-anchor domains
      analyzed
      85ca13b8
  8. 02 Nov, 2016 6 commits
  9. 01 Nov, 2016 2 commits
    • Jorge Navarro Muñoz's avatar
      Make Emzo's workflow default · 07feda76
      Jorge Navarro Muñoz authored
      BiG-SCAPE now uses exclusively the new workflow contributed by Emzo
      Plus, a number of minor editions like improved verification of the existance
      of some files (e.g. the already calculated network files)
      07feda76
    • Jorge Navarro Muñoz's avatar
      Try to catch problems in calc_perc_identity due to sequences' lengths mismatch · 8fef3741
      Jorge Navarro Muñoz authored
      I'll work to erase the domain directory each time BiG-SCAPE is run, so
      this issue shouldn't happen, but if for any reason the same sequence is read
      and appended more than once to the same key in its domain sequence file, the
      sequences' length mismatch would throw an IndexError exception.
      This tries to reimplement commit b12bfb7d from
      the main branch.
      8fef3741
  10. 31 Oct, 2016 1 commit
  11. 10 Oct, 2016 1 commit
  12. 05 Oct, 2016 1 commit
    • Jorge Navarro Muñoz's avatar
      Be more clear if there is a problem calculating seq. identity · b12bfb7d
      Jorge Navarro Muñoz authored
      Currently, if there are duplicated files in the run (i.e. same file
      in different folders) domain sequences will be appended twice and
      the fasta parser will join them in one single sequence -increasing
      that particular sequence's length.
      This commit just warns the user and continues. In theory this should
      still give correct results.
      b12bfb7d
  13. 03 Oct, 2016 1 commit
  14. 14 Sep, 2016 2 commits
  15. 12 Sep, 2016 1 commit
  16. 16 Aug, 2016 3 commits
  17. 04 Aug, 2016 2 commits
    • Jorge Navarro Muñoz's avatar
      0327fcfd
    • Jorge Navarro Muñoz's avatar
      Bug fix in GK calculation: · 890f405f
      Jorge Navarro Muñoz authored
      When sorting pfd_matrix rows, the sorting should correspond to the absolute
      positions of the predicted domains. This positions have to take into account
      the strand of the gene from where they were predicted (the 'env' coordinate
      in the .domtable file is with respect to the start of the gene); all domains
      from a gene in the complementary strand should be reversed in order.
      This change means that the .pfd files must be rewritten, so, unfortunately,
      a re-run from scratch is necessary.
      Also closed BGC.dict and DMS.dict files after dump/load.
      890f405f
  18. 01 Aug, 2016 1 commit
  19. 27 Jul, 2016 1 commit
  20. 26 Jul, 2016 1 commit
  21. 19 Jul, 2016 3 commits
    • Jorge Navarro Muñoz's avatar
      Slight update in README · 3fd6d850
      Jorge Navarro Muñoz authored
      3fd6d850
    • Jorge Navarro Muñoz's avatar
      Fixed some bugs in the GK calculation: · 69bbc420
      Jorge Navarro Muñoz authored
      - Order list of domains by their absolute position (not their internal
      position within the feature)
      - Choose all possible pairs of domains in GK calculation. Before, the last
      nbhood-1 domains were missing from the pair-choosing
      - Fixed skewed values in the GK index. The absolute value of Ns-Nr meant that
      only values in the range [0.5,1.0] were being obtained. Values close to 0
      mean the most difference in order of domains (all shared pairs are reversed)
      whereas values close to 1 mean least difference (all shared pairs are in the
      same order)
      69bbc420
    • Jorge Navarro Muñoz's avatar
      Specify location of Pfam files: · 858cdf27
      Jorge Navarro Muñoz authored
      Use parameter --pfam_dir to specify location of hmmpress-procesed pfam files
      (.h3f, .h3i, .h3m and .h3p). If parameter is not given, default is to look
      in the same place as the BiG-SCAPE script.
      Thanks to Alex Cristofaro for helping with this!
      858cdf27
  22. 12 Jul, 2016 1 commit
  23. 11 Jul, 2016 2 commits
    • Jorge Navarro Muñoz's avatar
      Updated README · d129fa41
      Jorge Navarro Muñoz authored
      d129fa41
    • Jorge Navarro Muñoz's avatar
      Added parameter --skip_mafft and a few other things: · 10e83013
      Jorge Navarro Muñoz authored
      - Parameter --skip_mafft is intended to be used when it's necessary to
      recalculate distance (most probably by changing the nbhood parameter). It's
      necessary to have: the original gbk files; the .pfs files; the .domtable files;
      the BGCs.dict and DMS.dict files
      - If any of the 'skip' parameters is activated, genbank_parser_hmmscan() no
      longer reads and parses fasta sequences from the genbank files.
      - --skip_all now recalculates logscore, distance and squared similarity in
      case that the user has submitted new weights for the Jaccard, DDS and GK
      indices
      - Minor other improvements in code, comments etc.
      10e83013
  24. 01 Jul, 2016 1 commit
  25. 30 Jun, 2016 1 commit