Concepts on fastqrab documentation

Mon, 01 Jan 0001 00:00:00 +0000

Philosophy #

fastqrab transforms (DNA) sequencing reads for downstream analysis.

Its focus are on

correctness
reproducibility
a lack of surprises
friendliness
speed

Correctness #

We strive to do the right thing, always.

To that end, fastqrab is tested with more than 500 end-to-end, input-to-output tests, both during development and via continuous integration.

Reproducibility #

Repeated runs on the same bits (input data & configuration) must deliver the same output bits. Every time.

Parser Architecture

Mon, 01 Jan 0001 00:00:00 +0000

Parser Architecture #

Overview #

fastqrab uses a custom-built parser designed for high performance and correctness when processing FASTQ. The parser’s design emphasizes:

Zero-copy parsing where possible to minimize memory allocations
Streaming architecture to handle files of any size
Transparent compression support (raw, gzip, zstd)
Cross-platform compatibility (Unix/Windows line endings)

(FASTA and BAM files are processed differently, see below).

The Zero-Copy Challenge with Compressed Files #

Why Not Pure Zero-Copy? #

A common optimization in bioinformatics tools is “zero-copy” parsing, where the parser operates directly on memory-mapped file contents without allocating separate buffers. This works well for uncompressed files stored on fast storage in suitable file formats.

Mon, 01 Jan 0001 00:00:00 +0000

Segments #

Modern sequencers, particularly Illumina sequencers, can read multiple times from one (amplified) DNA molecule, producing multiple ‘segments’ (often called ‘reads’) that together form a ‘molecule’ or ‘fragment’.

Definition and Configuration #

Segments are defined in the [input] section of your TOML configuration. Each segment corresponds to one FASTQ file (or stream in interleaved formats), and segment names are arbitrary but should be meaningful.

[input]
 read1 = ["sample_R1.fq.gz"]
 read2 = ["sample_R2.fq.gz"]
 index1 = ["sample_I1.fq.gz"]

In this example, three segments are defined: read1, read2, and index1.

Mon, 01 Jan 0001 00:00:00 +0000

Source #

When a step refers to a ‘source’ (instead of a segment), it means the step can read from multiple types of data: segment sequences, segment names, or tag values.

Overview #

The source parameter generalizes the segment parameter, allowing steps to operate on different kinds of string data within a fragment. This flexibility enables advanced workflows like extracting patterns from read names, processing tag-derived sequences, or combining multiple data sources.

Mon, 01 Jan 0001 00:00:00 +0000

Step #

A step is one coherent manipulation of the FASTQ stream and its associated data.

Overview #

Steps are the building blocks of a processing pipeline. Each step is declared as a [[step]] entry in the TOML configuration file, and the complete pipeline executes steps sequentially from top to bottom.

Every step operates on complete fragments (molecules), ensuring that paired segments remain synchronized. If a filtering step removes a fragment based on criteria from read1, the corresponding read2, index1, and any other segments are automatically removed alongside it.

Mon, 01 Jan 0001 00:00:00 +0000

Tag / Label #

A regular tag is a piece of fragment-derived metadata that one step in the pipeline produces, and other steps may consume, transform, or export.

A virtual tag is an on-the-fly create tag that exists just for this step and disappears right afterwards.

Overview - Regular tags #

Tags enable sophisticated workflows by decoupling data extraction from data usage. Instead of hardcoding logic like “trim adapters AND filter by adapter presence” into a single step, you extract adapter locations as a tag, then use that tag in multiple downstream operations.