1. 31 Oct, 2017 2 commits
  2. 18 Oct, 2017 6 commits
  3. 17 Oct, 2017 2 commits
  4. 16 Oct, 2017 3 commits
  5. 11 Oct, 2017 1 commit
    • Jorge Navarro Muñoz's avatar
      Small speedup for LCS + planning for BiG-SCAPE's visualization · 2aacd067
      Jorge Navarro Muñoz authored
      - Small speedup for LCS: trim pfam identifiers ("PF") when forming words for
      difflib's SequenceMatcher
      - Distance calculation stage now outputs number of gene where LCS seed starts
      for both BGCs + correct orientation of the second BGC. This will be exported
      eventually to the json file used for BiG-SCAPE's visualization
  6. 10 Oct, 2017 2 commits
  7. 29 Sep, 2017 1 commit
  8. 28 Sep, 2017 2 commits
  9. 27 Sep, 2017 1 commit
  10. 26 Sep, 2017 1 commit
    • Jorge Navarro Muñoz's avatar
      Improvements and bugfixes · c428ec54
      Jorge Navarro Muñoz authored
      Bugfix: BGCs from Others class would get added twice to that class. This
      happened once when finding BGCs with mixed-type annotations, then again if
      any of its sub-annotations was also from the Others class.
      Improvement: Only read fasta and pfd files once when calling GCFs (instead of
      doing it at each cutoff)
      Improvement: Pass reduced distance matrix to clusterJsonBatch (so we can get
      rid of it quickly)
      Improvement: Pass already sorted list of BGCs to clusterJsonBatch
      Improvement: improved human readability of .js file
      Improvement: Cleaned and re-ordered arguments (a bit)
      Other: Changed clan-calling parameter to '--clans'
  11. 13 Sep, 2017 5 commits
  12. 12 Sep, 2017 5 commits
  13. 07 Sep, 2017 1 commit
    • Jorge Navarro Muñoz's avatar
      Implement LCS mode. Use to try to cope with fragmented BGCs that may overlap · deba3829
      Jorge Navarro Muñoz authored
      in their start/end positions.
      Number of domains in each gene + gene orientation have to be obtained (after
      having pfd information)
      Prepare A_string and B_string.
      Each is a list of concatenated pfam ids. Concatenation follows downstream orientation
      of each gene
      A_domlist = a b c d e f g   List of pfam ids as found in the BGC
      dcg_a =   1  3  1  2        Number of domains per each gene in the BGC
      go_a =    1 -1 -1  1        Orientation of each Gene
      A_string = a dcb e fg       List of concatenated domains (note reverse order of bcd)
      SequenceMatcher from difflib is used to find the largest common slice between
      A_string and B_string. reverse(B_string) is also tested. Best orientation is kept
      Extension of slices:
      As expansion is relatively costly, a minimum of 3 overlapped genes are asked for
      in the seed subcluster
       Extension occurs:
       For each upstream/downstream side:
        Find which is the BGC closest to the end.
        The slice of this BGC will be extended until the end
        The slice of the other BGC will be extended according to max_score
        (if both slices have the same length, the extension with the best score will
        be considered)
         Input: slice of 'other' BGC and slice of extended BGC as reference.
         For each gene in 'other':
          Try to find gene in reference slice, starting in 'pos_y'
          If not found, decrease score by 'mismatch'
          If found, update score with 'match' + 'match_position'*'gap' and update
          If score >= max_score, update max_score and current position
         Output: position where max_score occurred, max_score.
         Match = +5
         Gap = -2
         Mismatch = -3
      If smallest resulting slice has size 5 or bigger:
          If Biosynthetic Genes are in the final slices of both BGCs:
              Use slices for (normal) distance calculation.
              Use the whole range of domains
          Use the whole range of domains
  14. 28 Aug, 2017 1 commit
  15. 22 Aug, 2017 1 commit
  16. 21 Aug, 2017 2 commits
  17. 18 Aug, 2017 1 commit
  18. 10 Aug, 2017 1 commit
  19. 08 Aug, 2017 2 commits