Optimize read mapping

The Map class is responsible for aligning reads to the pangenome. One or more instances are run in parallel in an executor service. This process can be optimized by splitting the work across threads as follows.

IO-bound work consists of reading from FASTQ files and writing to SAM files (ignoring compression of input and output), as well as Neo4j queries when finding candidate locations. CPU-bound work consists of clustering of and aligning against candidate locations. All IO- and CPU-bound work is lopped into the same thread pool, with each thread responsible for reading reads, aligning, and writing to each relevant SAM file.

What's more, reading a single read and writing a single SAM record is synchronized to prevent race conditions. Reads are not read in batches, and neither are SAM records written in batches. Without profiling in a number of scenarios it is difficult to determine the impact on performance, but it is likely to be quite substantial. Although SAM file writing is locked on a file writer basis (rather than locking all file writers each addition of a SAM record), locking a FastqReader each read will be inefficient.

It would be better to read FASTQ records in batches in one thread (or optionally two if paired-end, collating in some way), and put each batch in a queue that will be read by one or more aligner threads. Each aligner thread processes a batch of reads by clustering candidate locations and aligning against them. Alignment results are sent via a queue (batched) to one or more output threads responsible for writing SAM records to the different SAM files.

Considerations

Each aligner thread processes batches of reads, finding candidate locations first, and clustering aligning against them later. The first process is DB-intensive, resulting in many database queries. With a high level of parallelism this might overload the database, (drastically) lowering performance. This will need to be tested first but, if this is the case, it would be best to introduce a separate set of threads before the clustering and alignment phase. These threads will have a limited level of parallelism. Another queue would host the candidate locations for each read, and a number of aligner threads would perform the (CPU-bound) clustering and alignment.

If we're writing to BAM files, serialization and compression might take up a bit of CPU time as well. How much will need to be assessed, but if it is a significant amount a case can be made for using more than one output thread, each taking a share of the output files.

Edited May 10, 2022 by Moed, Matthijs