1. 28 Apr, 2017 1 commit
  2. 21 Apr, 2017 1 commit
  3. 16 Mar, 2017 3 commits
  4. 15 Mar, 2017 1 commit
    • Jorge Navarro Muñoz's avatar
      New experimental Metagenomic mode · 4ff8f505
      Jorge Navarro Muñoz authored
      In this mode, BiG-SCAPE assumes that the shortest BGC is probably fragmented.
      Slices the largest BGC so that the slice shares the largest number of different
       domains with the shortest one. Then it uses only the slice of the largest BGC
      to calculate distance.
      Also introduces the --banned_classes parameter to specify on which classes to
      perform analysis
      4ff8f505
  5. 10 Mar, 2017 1 commit
  6. 09 Mar, 2017 1 commit
  7. 03 Mar, 2017 1 commit
  8. 02 Mar, 2017 1 commit
  9. 01 Mar, 2017 1 commit
    • Jorge Navarro Muñoz's avatar
      - Introduced hmmalign as an alternative to MAFFT. Much faster. Generates · 24cf4d13
      Jorge Navarro Muñoz authored
      another set of files (.stk). Able to work with non-standard amino acid
      characters (e.g. 'U' for Selenocysteine does not appear in final alignment).
      For now, same strategy than hmmscan (many 1-thread processes). Activate with
      '--use_hmmalign'
      - Improved case where translation of CDS is not available in the GenBank file:
       - Translate only until stop codon regardless of sequence's location (see
      NZ_JMEU01000004-181881-306881_ORF39)
       - Use fuzzy start/end positions to know if trimming should be done at the
      start or at the end of the sequence (previously, used gene_start == 0)
       - It fixes a trimming issue that was silently (not visible because this part
      is parallelized) reported by biopython (BiopythonWarning: Partial codon,
      len(sequence) not a multiple of three. Explicitly trim the sequence or add
      trailing N before translation. This may become an error in future.
      BiopythonWarning)
       - Note: In the cases where manual translation is needed, alternative start
      codons could be told to be translated as Methionine, but I've chosen not to.
      I'm assuming that when the translation is not present, then we are dealing
      with a random sequence.
       - gid and pid fields in sequences' identifiers are no longer enclosed in
      brackets (i.e. they are properly extracted strings and not stringified lists).
      THIS (along with the change in domain sequences identifiers) LIKELY BREAKS
      BACKWARDS COMPATIBILITY. Adding new BGCs to a previous analysis and running
      again is NOT encouraged.
      - Also deleted code for the old, unmainteined alternate distance calculation
      (domaindist) as well as GK code.
      24cf4d13
  10. 23 Feb, 2017 3 commits
  11. 22 Feb, 2017 2 commits
    • Jorge Navarro Muñoz's avatar
      BiG-SCAPE now accepts multi-record genbank files. · dd673e21
      Jorge Navarro Muñoz authored
      - Assume that all records have the same Definition (and use the first one)
      - For the Group: make a set of Products from all records.
       * If there is only one kind of Product, use it
       * If there are two Products and one is "other", set Group to the remaining one
       * Other cases: discard "other" and join remaining ones using dashes (it will
         be probably classified as a hybrid)
      dd673e21
    • Jorge Navarro Muñoz's avatar
      Possible to resume alignment phase · 02d6cfca
      Jorge Navarro Muñoz authored
      Further fixes to --samples mode. Seems to be working fine now
      Minor improvement in bgc class (added product type 'hglks')
      02d6cfca
  12. 20 Feb, 2017 1 commit
  13. 17 Feb, 2017 1 commit
  14. 15 Feb, 2017 1 commit
  15. 13 Feb, 2017 2 commits
  16. 10 Feb, 2017 3 commits
  17. 09 Feb, 2017 1 commit
  18. 08 Feb, 2017 3 commits
  19. 07 Feb, 2017 1 commit
    • Jorge Navarro Muñoz's avatar
      Bugfix: If aligned sequences' lengths are different, the warning message being · 681d99a9
      Jorge Navarro Muñoz authored
      printed used a variable that wasn't declared. In theory, this situation should
      not happen, but it was reported by a user (Ghofran O.)
      Deleted function: genbank_grp_dict() was not used anymore as the (AntiSMASH)
      group annotation as well as the definition are being read in get_gbk_files().
      Warning message: be more explicit if BiG-SCAPE cannot find Pfam files.
      681d99a9
  20. 13 Jan, 2017 2 commits
    • Jorge Navarro Muñoz's avatar
      Minor fix: prevent the terminal to be clogged by warning messages · fbe44d5c
      Jorge Navarro Muñoz authored
      Both when a list of ordered domains is not found for a particular BGC and
      also for when aligned domain sequences are not found. In the latter case,
      though, the messages will be printed several times (one for each distance
      calculation that needs it)
      Both messages are controlled by the --verbose parameter
      fbe44d5c
    • Jorge Navarro Muñoz's avatar
      Some minor optimizations · 12140509
      Jorge Navarro Muñoz authored
      * Currently using the old Hungarian algorithm approach (munkres.py), though
      Emzo was using something a bit different (matrix is a numpy array, scipy's
      linear_sum_assignment as the implementation of the Hungarian algorithm).
      We can later analyze if either approach is better. (this should've been
      reported in the previous commit)
      * The ordered list of domains for each BGC is loaded into memory in
      DomainList, instead of reading both pfs files at the moment of distance
      calculation (but falls back to that if it can't find the lists in
      DomainList)
      * Shaved a couple of seconds from the aligned sequence similarity
      calculation (tip from one of Marnix's [links](http://stackoverflow.com/questions/16266622/in-python-calculate-percent-identity-between-two-strings) )
      
      Original:
      ```python
      matches = 0
      length = 0
      for position in range(seq_length):
          if aligned_seqA[position] == aligned_seqB[position]:
              if aligned_seqA[position] != "-":
                  matches += 1
                  length += 1
          else:
              length += 1
      similarity = 1 - ( float(matches)/float(length) )
      ```
      
      New:
      ```python
      matches = 0
      gaps = 0
      for position in range(seq_length):
          if aligned_seqA[position] == aligned_seqB[position]:
              if aligned_seqA[position] != "-":
                  matches += 1
              else:
                  gaps += 1
      
      similarity = 1 - ( float(matches)/float(seq_length-gaps)
      ```
      
      * There is a difference in the size of the network files between the original
      version and the new experimental ones. This seems to be due to the way the
      genbank files were parsed. Originally, only the line with DEFINITION would be
      used, but some files (e.g. BGC0001277) have a multiple-line definition value.
      This is corrected now as we are using BioPython to read the DEFINITION section.
      12140509
  21. 11 Jan, 2017 1 commit
    • Jorge Navarro Muñoz's avatar
      New experimental branch: No DMS.dict · 35188f2e
      Jorge Navarro Muñoz authored
      The DMS structure that held the precalculated sequence similarity between all
      pairs of aligned domain sequences (for each domain) grew exponentially with
      the number of input files. On top of this, the structure seemed to be copied
      for parallelized calculation of pairwise distances.
      Emzo proposed (in the direct_align branch) to avoid using MAFFT for domain
      sequence alignment, as well as the storing the sequence similarity in the DMS
      dictionary, and instead calculate both things on-the-fly, at the moment of
      doing the distance calculation.
      In this branch, I'm keeping the multiple alignment part with MAFFT but the
      sequence similarity is left to do on-the-fly. This increase in computing time
      each time the script needs to be re-run is a tradeoff for getting rid of DMS
      (both in RAM-space as well as in disk-space, for it was also kept as a file).
      
      In summary:
      * Eliminated DMS usage. Using --skip_mafft avoids calling MAFFT, but otherwise
      only the aligned domain sequences (.algn) in the domains folder are read and
      kept in memory.
      * Eliminated the --use_mafft_distout parameter (only internal sequence
      similarity is used. We could as well avoid generating the .hat2 files in the
      future as well)
      * Dropped the ">" character from the keys of the dictionary returned by
      fasta_parser()
      * If running with the --skip_mafft parameter, BiG-SCAPE will not re-generate
      the domain fasta files (take into account that if the user is adding new files
      to her input directory, she should not use this parameter, or we should take
      care to track which domains are affected and process the domain fasta files +
      mafft-align only for those domains)
      * Moved the extraction of the gbk_group information (BGC definition + antiSMASH
      group annotation) to the first time that the GenBank files are opened (when
      collecting the files and doing basic filtering). Also in that routine
      (get_gbk_files), each file is only opened once.
      35188f2e
  22. 22 Dec, 2016 1 commit
    • Jorge Navarro Muñoz's avatar
      Fixed a bug when processing input files · d453df5d
      Jorge Navarro Muñoz authored
      This commit fixes some bugs in file manipulation in the first stage of
      BiG-SCAPE when re-using the output folder. For example, if using the
      `--force-hmmscan` parameter, hmmscan would be used on ALL fasta files found in
      the output folder, instead of only those corresponding to the input files.
      The bug also caused other bad behaviour, like trying to discard files without
      predicted domains that weren't in the input file list (but that marked because
      they had not been processed -i.e. they don't have pfd counterparts)
      d453df5d
  23. 18 Nov, 2016 1 commit
    • Jorge Navarro Muñoz's avatar
      Bugfix in DDS sub-component weighting logic · 26a3c5a2
      Jorge Navarro Muñoz authored
      In the previous commit, a new method for combining DDS sub-components was
      introduced:
      DDS = (1-anchorweight)*non_anchor_prct*rDDSna +
            (1+anchorweight)*anchor_prct*rDDSa
      with
      anchor_prct = S_anchor / (S + S_anchor)
      non_anchor_prct = S / (S + S_anchor)
      However, the final weight was not really normalized (in some instances making
      the DDS_anchor component shorter than it should, in other instances making
      it too large, even making the final DDS score lesser than 0!)
      As a solution, the new weighting system effectively 'boosts' perceived number
      of ancohr domains:
      non_anchor_weight = non_anchor_prct /
                          (anchor_prct*anchorweight + non_anchor_prct)
      anchor_weight = anchor_prct*anchorweight /
                          (anchor_prct*anchorweight + non_anchor_prct)
      DDS = (non_anchor_weight*DDS_non_anchor) + (anchor_weight*DDS_anchor)
      26a3c5a2
  24. 04 Nov, 2016 1 commit
    • Jorge Navarro Muñoz's avatar
      Important changes to DDS index · 85ca13b8
      Jorge Navarro Muñoz authored
      Originally, DDS sub components (one for each type of domain: domains from
      the list in the anchorfile and the non-achor ones) where combined in the final
      DDS score with a fixed weight (note that this DDS is measuring *difference*
      rather than similitude between domains from both BGCs. It is later converted
      to similarity using the complement DDS = 1 - DDS):
      DDS = anchorweight*rDDSa + (1-anchorweight)*rDDSna
      with rDDSa as the raw DDS component for anchor domains.
      
      In this new version of BiG-SCAPE, this has changed to give each component a
      weight that depends on the actual percentage of anchor and non-anchor domains:
      anchor_prct = S_anchor / (S + S_anchor)
      non_anchor_prct = S / (S + S_anchor)
      where S_anchor is the number of different anchor domains analyzed between both
      BGCs.
      The anchorweight parameter changed its name 'anchorboost' (though internally
      it's still called 'anchorweight'), a variable that tries to increase the
      perceived amount of anchor domains (so the rDDSa has an increased weight).
      
      The DDS score is then:
      DDS = (1-anchorweight)*non_anchor_prct*rDDSna +
       (1+anchorweight)*anchor_prct*rDDSa
      
      Also in this commit, new columns are added to the output network file: each
      DDS subcomponent as well as the number of anchor and non-anchor domains
      analyzed
      85ca13b8
  25. 02 Nov, 2016 5 commits
    • Jorge Navarro Muñoz's avatar
      67f17d5d
    • Jorge Navarro Muñoz's avatar
      minor update to README · f1f55621
      Jorge Navarro Muñoz authored
      f1f55621
    • Jorge Navarro Muñoz's avatar
      --force_hmmscan will also force BiG-SCAPE to re-process domtable files · 12dc97d6
      Jorge Navarro Muñoz authored
      Also, minor fixes to a couple of messages for the user
      12dc97d6
    • Jorge Navarro Muñoz's avatar
      Updated README file · b79820ba
      Jorge Navarro Muñoz authored
      b79820ba
    • Jorge Navarro Muñoz's avatar
      Merge branch 'mp-hmm' into 'master' · f4165c1e
      Jorge Navarro Muñoz authored
      Mp hmm
      
      This branch, contributed by Emzo, introduces a new workflow for the first phase of BiG-SCAPE; namely, the processing of input files and domain prediction using `hmmscan`.
      The two most notable characteristics of this branch are:
      
      * A better distribution of parallelized work when predicting domains using `hmmscan`. In the original workflow, parallelization was left to `hmmscan` using the `--cpu` parameter but this approach's performance did not scale with the number of CPUs (given by the input parameter `--cores` in BiG-SCAPE). The new workflow changes this to as many one-thread `hmmscan` jobs as the number of `--cores` available, making it more efficient with the available resources.
      * BiG-SCAPE now runs domain prediction and other input-parsing tasks based only on the non-processed files in the output directory. This means that if BiG-SCAPE terminates early for some reason, it is able to resume work (specially with domain prediction, the most intensive task in this phase).
      
      See merge request !1
      f4165c1e