Workum, Dirk-Jan van requested to merge fix_distinct_count_kmers into develop Sep 22, 2023

We noticed that pantools kmer_classification did not report the correct number of distinct kmers for subsets of genomes. This bug was that the skip_array was completely ignored for distinct kmers and only used for the total. This merge request fixes that by using the already present counter for each kmer and checks if it is larger than 0 after taking the skip_array into account.

Evidence that the counting is correct now

KMC counting of kmers

This is the kmc output for a random set of four small sequences:

% kmc -cs127 -k13 -ci1 -fm @genome_locations.txt genome .
Stage 1: 100%
1st stage: 0.255s
2nd stage: 1.07611s
Total    : 1.33112s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :        29862
   No. of unique counted k-mers       :        29862
   Total no. of k-mers                :       119423
   Total no. of sequences             :            4
   Total no. of super-k-mers          :            0

And this is the kmc output for three small sequences (removing the fourth sequence from the one above):

% kmc -cs127 -k13 -ci1 -fm @subset_locations.txt subset .
Stage 1: 100%
1st stage: 0.69278s
2nd stage: 1.11376s
Total    : 1.80654s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :        29836
   No. of unique counted k-mers       :        29836
   Total no. of k-mers                :        89583
   Total no. of sequences             :            3
   Total no. of super-k-mers          :            0

PanTools kmer counting (current develop)

When running PanTools with pantools kmer_classification tmp_DB, we get this:

% head tmp_DB/kmer_classification/kmer_classification_overview.txt
Uncompressed the k-mers. K-mer size is 13, extracting 12 of every compressed k-mer
To obtain the distinct k-mer counts, the frequency of k-mers was ignored

Total k-mers: 119423
Total degenerate k-mers: 0

Distinct k-mers: 29862
Distinct degenerate k-mers: 0
Core: 29701 (99.46%)
Accessory: 110 (0.37%)

And for the same subset as with kmc, we run PanTools like so: pantools kmer_classification -e tmp_DB and we get this:

% head tmp_DB/kmer_classification/kmer_classification_overview.txt
Uncompressed the k-mers. K-mer size is 13, extracting 12 of every compressed k-mer
To obtain the distinct k-mer counts, the frequency of k-mers was ignored

Total k-mers: 89583
Total degenerate k-mers: 0

Distinct k-mers: 29862
Distinct degenerate k-mers: 0
Core: 29727 (99.55%)
Accessory: 84 (0.28%)

PanTools kmer counting (this merge request)

When running PanTools with pantools kmer_classification tmp_DB, we get this:

% head tmp_DB/kmer_classification/kmer_classification_overview.txt
Uncompressed the k-mers. K-mer size is 13, extracting 12 of every compressed k-mer
To obtain the distinct k-mer counts, the frequency of k-mers was ignored

Total k-mers: 119423
Total degenerate k-mers: 0

Distinct k-mers: 29862
Distinct degenerate k-mers: 0
Core: 29701 (99.46%)
Accessory: 110 (0.37%)

And for the same subset as with kmc, we run PanTools like so: pantools kmer_classification -e tmp_DB and we get this:

% head tmp_DB/kmer_classification/kmer_classification_overview.txt
Uncompressed the k-mers. K-mer size is 13, extracting 12 of every compressed k-mer
To obtain the distinct k-mer counts, the frequency of k-mers was ignored

Total k-mers: 89583
Total degenerate k-mers: 0

Distinct k-mers: 29836
Distinct degenerate k-mers: 0
Core: 29727 (99.55%)
Accessory: 84 (0.28%)

These final numbers are identical to the kmc output.

Fix distinct count kmers in kmer_classification for subsets of genomes

Evidence that the counting is correct now

KMC counting of kmers

PanTools kmer counting (current develop)

PanTools kmer counting (this merge request)

Merge request reports