Skip to content

Add (sub)graph export (AKA region of interest)

Workum, Dirk-Jan van requested to merge add_gfa_export into develop

NB: Still under active development (this branch is a small side project of mine).

This merge request will add a subcommand to retrieve the cDBG from PanTools in GFA format (and others?).

TODO:

  • Check accuracy GFA v1 output
  • Whole pangenome export
    • Decide on subcommand name
    • Decide on what output formats should be supported (only GFA; which is slow)
    • Check speed on large pangenomes
  • Add subcommand for building nucleotide layer from existing graph (GFA v1 format)
    • => edit: to be done with !198
  • Add subcommand for extracting a subgraph in GFA format, including annotations for Bandage
    • Get separate subcommand for regions only
    • Define outputs for region (see below for implementation status)
  • Write all output formats
    • GFAv1
    • Include Bandage annotation CSV for outputs
    • Fasta for each genome
    • Gff3 for each genome
    • PAV for each homology group
    • PAV for each kmer/node
    • Collinearity file (/visualization)

TODO after commit c565cb45 (where the 'novel' algorithm, which is a combination of kmer and alignment, has been implemented and tested):

  • Add parameter for minimal number of kmers in a block for the 'novel' algorithm
  • Make 'novel' algorithm default and rename to more sensible name
  • Remove other algorithms
  • Create homology based search using the 'novel' algorithm
  • Use simple (NJ?) clustering on kmer PAV for ordering the output
  • Add new parameter --flanking to add additional flanking sequence after the ROI finding algorithm
  • Clean up unused code
Edited by Workum, Dirk-Jan van

Merge request reports