High level #
mbf-fastq-processor ingests any number of FastQ files, applies a user-defined sequence of steps, and emits transformed FastQs and/or structured reports.
FastQ segments ──> [extract | modify | filter | report] ──> FastQs / tables / HTML reports
Each step is explicit: there are no hidden defaults, and order matters.
Terminology #
- Fragment / molecule – the logical sequencing record composed of one or more segments (e.g., read1 / read2 / index1 / index2). The piece of DNA the sequencer operated on.
- Segment – one ‘read’ from a fragment. Segment streams are named in the
[input]section (commonlyread1,read2,index1, etc.). Many steps operate on a specific segment. - Tag – metadata derived from a fragment and stored under a label; later steps may consume, modify, or filter on it it.
- Step – an entry in the
[[step]]array that mutates, filters, validates, or reports on fragments.
Parameterisation #
Pipelines live in a TOML document. Steps execute top-to-bottom, and you may repeat a step type any number of times (for example, collect a report both before and after filtering).
Values in the TOML file are explicit by design. Where defaults exist, they are documented in the reference.
Input files #
mbf-fastq-processor reads uncompressed, gzipped, or zstd-compressed FastQ files. Multiple files can be concatenated per segment. Every segment must supply the same number of reads to preserve fragment pairing.
Interleaved FastQ files are also supported—declare a single source and enumerate segment names via interleaved = [...] (see the input section reference).
FASTQs should comply with the format described by Cock et al.. Data on the + line is ignored during parsing (and hence omitted from outputs).
Output files #
Output filenames derive from the configured prefix plus segment names (for example, output_read1.fq.gz). Interleaved outputs use interleaved as a segment name.
Reports use prefix.html / prefix.json. Additional artifacts such as checksums or per-barcode files are controlled by specific steps and [options] entries.
See the output section reference for supported formats and modifiers.
Steps and targets #
Every step sees whole fragments so paired segments stay in lock-step: if you filter a fragment based on read1, the associated read2 and any index reads disappear alongside it.
Many steps accept a segment argument to restrict their work to a specific input stream, while still retaining awareness of the whole fragment.
Tag-generating steps must be paired with consumers—mbf-fastq-processor will error if a label is produced but never used, helping you catch typos early.
Demultiplexing #
Demultiplexing splits the fragment stream into multiple outputs, e.g. on ‘barcodes’, on length, or on any tag.
Further reading #
Continue with the Reference for exhaustive configuration details, or explore integration scenarios in the How-To collection.