1. 24 Nov, 2016 2 commits
  2. 23 Nov, 2016 1 commit
  3. 18 Nov, 2016 2 commits
    • Emzo de los Santos's avatar
      ** bigscape.py · 5f8e7033
      Emzo de los Santos authored
      added --pairwise option that doesn't need mafft and calculates alignments on the fly
      
      TO DO: Change network matrix structure for bigger files
      
      ** functions.py
      
      added parsePFD function
      5f8e7033
    • Jorge Navarro Muñoz's avatar
      Bugfix in DDS sub-component weighting logic · 26a3c5a2
      Jorge Navarro Muñoz authored
      In the previous commit, a new method for combining DDS sub-components was
      introduced:
      DDS = (1-anchorweight)*non_anchor_prct*rDDSna +
            (1+anchorweight)*anchor_prct*rDDSa
      with
      anchor_prct = S_anchor / (S + S_anchor)
      non_anchor_prct = S / (S + S_anchor)
      However, the final weight was not really normalized (in some instances making
      the DDS_anchor component shorter than it should, in other instances making
      it too large, even making the final DDS score lesser than 0!)
      As a solution, the new weighting system effectively 'boosts' perceived number
      of ancohr domains:
      non_anchor_weight = non_anchor_prct /
                          (anchor_prct*anchorweight + non_anchor_prct)
      anchor_weight = anchor_prct*anchorweight /
                          (anchor_prct*anchorweight + non_anchor_prct)
      DDS = (non_anchor_weight*DDS_non_anchor) + (anchor_weight*DDS_anchor)
      26a3c5a2
  4. 04 Nov, 2016 1 commit
    • Jorge Navarro Muñoz's avatar
      Important changes to DDS index · 85ca13b8
      Jorge Navarro Muñoz authored
      Originally, DDS sub components (one for each type of domain: domains from
      the list in the anchorfile and the non-achor ones) where combined in the final
      DDS score with a fixed weight (note that this DDS is measuring *difference*
      rather than similitude between domains from both BGCs. It is later converted
      to similarity using the complement DDS = 1 - DDS):
      DDS = anchorweight*rDDSa + (1-anchorweight)*rDDSna
      with rDDSa as the raw DDS component for anchor domains.
      
      In this new version of BiG-SCAPE, this has changed to give each component a
      weight that depends on the actual percentage of anchor and non-anchor domains:
      anchor_prct = S_anchor / (S + S_anchor)
      non_anchor_prct = S / (S + S_anchor)
      where S_anchor is the number of different anchor domains analyzed between both
      BGCs.
      The anchorweight parameter changed its name 'anchorboost' (though internally
      it's still called 'anchorweight'), a variable that tries to increase the
      perceived amount of anchor domains (so the rDDSa has an increased weight).
      
      The DDS score is then:
      DDS = (1-anchorweight)*non_anchor_prct*rDDSna +
       (1+anchorweight)*anchor_prct*rDDSa
      
      Also in this commit, new columns are added to the output network file: each
      DDS subcomponent as well as the number of anchor and non-anchor domains
      analyzed
      85ca13b8
  5. 02 Nov, 2016 6 commits
  6. 01 Nov, 2016 2 commits
    • Jorge Navarro Muñoz's avatar
      Make Emzo's workflow default · 07feda76
      Jorge Navarro Muñoz authored
      BiG-SCAPE now uses exclusively the new workflow contributed by Emzo
      Plus, a number of minor editions like improved verification of the existance
      of some files (e.g. the already calculated network files)
      07feda76
    • Jorge Navarro Muñoz's avatar
      Try to catch problems in calc_perc_identity due to sequences' lengths mismatch · 8fef3741
      Jorge Navarro Muñoz authored
      I'll work to erase the domain directory each time BiG-SCAPE is run, so
      this issue shouldn't happen, but if for any reason the same sequence is read
      and appended more than once to the same key in its domain sequence file, the
      sequences' length mismatch would throw an IndexError exception.
      This tries to reimplement commit b12bfb7d from
      the main branch.
      8fef3741
  7. 31 Oct, 2016 1 commit
  8. 10 Oct, 2016 1 commit
  9. 05 Oct, 2016 1 commit
    • Jorge Navarro Muñoz's avatar
      Be more clear if there is a problem calculating seq. identity · b12bfb7d
      Jorge Navarro Muñoz authored
      Currently, if there are duplicated files in the run (i.e. same file
      in different folders) domain sequences will be appended twice and
      the fasta parser will join them in one single sequence -increasing
      that particular sequence's length.
      This commit just warns the user and continues. In theory this should
      still give correct results.
      b12bfb7d
  10. 03 Oct, 2016 1 commit
  11. 14 Sep, 2016 2 commits
  12. 12 Sep, 2016 1 commit
  13. 16 Aug, 2016 3 commits
  14. 04 Aug, 2016 2 commits
    • Jorge Navarro Muñoz's avatar
      0327fcfd
    • Jorge Navarro Muñoz's avatar
      Bug fix in GK calculation: · 890f405f
      Jorge Navarro Muñoz authored
      When sorting pfd_matrix rows, the sorting should correspond to the absolute
      positions of the predicted domains. This positions have to take into account
      the strand of the gene from where they were predicted (the 'env' coordinate
      in the .domtable file is with respect to the start of the gene); all domains
      from a gene in the complementary strand should be reversed in order.
      This change means that the .pfd files must be rewritten, so, unfortunately,
      a re-run from scratch is necessary.
      Also closed BGC.dict and DMS.dict files after dump/load.
      890f405f
  15. 01 Aug, 2016 1 commit
  16. 27 Jul, 2016 1 commit
  17. 26 Jul, 2016 1 commit
  18. 19 Jul, 2016 3 commits
    • Jorge Navarro Muñoz's avatar
      Slight update in README · 3fd6d850
      Jorge Navarro Muñoz authored
      3fd6d850
    • Jorge Navarro Muñoz's avatar
      Fixed some bugs in the GK calculation: · 69bbc420
      Jorge Navarro Muñoz authored
      - Order list of domains by their absolute position (not their internal
      position within the feature)
      - Choose all possible pairs of domains in GK calculation. Before, the last
      nbhood-1 domains were missing from the pair-choosing
      - Fixed skewed values in the GK index. The absolute value of Ns-Nr meant that
      only values in the range [0.5,1.0] were being obtained. Values close to 0
      mean the most difference in order of domains (all shared pairs are reversed)
      whereas values close to 1 mean least difference (all shared pairs are in the
      same order)
      69bbc420
    • Jorge Navarro Muñoz's avatar
      Specify location of Pfam files: · 858cdf27
      Jorge Navarro Muñoz authored
      Use parameter --pfam_dir to specify location of hmmpress-procesed pfam files
      (.h3f, .h3i, .h3m and .h3p). If parameter is not given, default is to look
      in the same place as the BiG-SCAPE script.
      Thanks to Alex Cristofaro for helping with this!
      858cdf27
  19. 12 Jul, 2016 1 commit
  20. 11 Jul, 2016 2 commits
    • Jorge Navarro Muñoz's avatar
      Updated README · d129fa41
      Jorge Navarro Muñoz authored
      d129fa41
    • Jorge Navarro Muñoz's avatar
      Added parameter --skip_mafft and a few other things: · 10e83013
      Jorge Navarro Muñoz authored
      - Parameter --skip_mafft is intended to be used when it's necessary to
      recalculate distance (most probably by changing the nbhood parameter). It's
      necessary to have: the original gbk files; the .pfs files; the .domtable files;
      the BGCs.dict and DMS.dict files
      - If any of the 'skip' parameters is activated, genbank_parser_hmmscan() no
      longer reads and parses fasta sequences from the genbank files.
      - --skip_all now recalculates logscore, distance and squared similarity in
      case that the user has submitted new weights for the Jaccard, DDS and GK
      indices
      - Minor other improvements in code, comments etc.
      10e83013
  21. 01 Jul, 2016 1 commit
  22. 30 Jun, 2016 1 commit
  23. 29 Jun, 2016 2 commits
  24. 24 Jun, 2016 1 commit