Reference on mbf-fastq-processor documentation

Mon, 01 Jan 0001 00:00:00 +0000

Command line interface #

mbf-fastq-processor is configured exclusively through a TOML document. The CLI is therefore intentionally minimal and focuses on selecting the configuration and the working directory.

Usage #

mbf-fastq-processor process [config.toml] [--allow-overwrite]
mbf-fastq-processor template
mbf-fastq-processor interactive [config.toml]

Process #

Process FASTQ as described in <config.toml>.(see the TOML format reference). Relative paths are resolved against the current shell directory.

The config.toml argument can be left off iff there’s one .toml in the current directory, and it contains an [input] and an [output] section

Mon, 01 Jan 0001 00:00:00 +0000

Input section #

The [input] table enumerates all read sources that make up a fragment. At least one segment must be declared.

[input]
 read1 = ['fileA_1.fastq', 'fileB_1.fastq.gz', 'fileC_1.fastq.zst'] # required: one or more paths
 read2 = "fileA_2.fastq.gz" # optional
 index1 = ['index1_A.fastq.gz'] # optional
 # interleaved = [...] # optional, see below

Key	Required	Value type	Notes
segment name (e.g. `read1`)	Yes (at least one)	string or array of strings	Each unique key defines a segment; arrays concatenate multiple files in order.
`interleaved`	No	array of strings	Enables interleaved reading; must list segment names in their in-file order.

Additional points:

Mon, 01 Jan 0001 00:00:00 +0000

Output section #

The [output] table controls how transformed reads and reporting artefacts are written.

[output]
 prefix = "output" # required.
 format = "Fastq", # (optional) output format, defaults to 'Fastq'
					 # Valid values are: Fastq, Fasta, BAM and None (for no sequence output)
 compression = "Gzip" # Raw | Uncompressed | Gzip | Zstd | None (default: Raw)
 suffix = ".fq.gz" # optional override; inferred from format when omitted
 compression_level = 6 # gzip: 0-9, zstd: 1-22, bam: 0-9 (BGZF); defaults are gzip=6, zstd=5
 ix_separator = "_" # optional separator between prefix, infixes, and segments. Defaults to '_'

 report_json = false # write prefix.json
 report_html = true # write prefix.html

 output = ["read1", "read2"] # limit which segments become FASTQ files
 interleave = false # emit a single interleaved FASTQ
 stdout = false # stream to stdout instead of files
 chunk_size = 100000 # Write multiple, numbered output files, each a maximum of chunk_size reads/molecules.

 output_hash_uncompressed = false
 output_hash_compressed = false

Key	Default	Description
`prefix`	`"output"`	Base name for all files produced by the run.
`format`	`"Fastq"`	Output format. Valid values are: `Fastq`, `Fasta`, `Bam`, and `None` (for no sequence output).
`compression`	`"Uncompressed"`	Compression format for read outputs. Valid values are: `Gzip`, `Zstd`, `Uncompressed` (alias: `"Raw"`). Must not be set for BAM
`suffix`	derived from format	Override file extension when interop with other tooling demands a specific suffix.
`compression_level`	gzip: 6, zstd: 5	Fine-tune compression effort. Ignored for `Raw`/`None`. `Bam` maps directly to the BGZF level (0–9).
`report_json` / `report_html`	`false`	Toggle structured or interactive reports.
`output`	all input segments	Restrict the subset of segments written to disk. Use an empty list to suppress FASTQs while still running steps that depend on fragment data.
`interleave`	`false`	Generate a single interleaved FASTQ (`{prefix}_interleaved.fq*`).
`stdout`	`false`	Write to stdout. Forces `format = "Raw"`. `Sets interleave=true` if more than one fragment is listed in `output`
`output_hash_uncompressed` / `output_hash_compressed`	`false`	Emit SHA-256 checksums.
`ix_separator`	`"_"`	Separator inserted between `prefix`, any infix (demultiplex labels, inspect names, etc.), and segment names.
`chunk_size`	(unlimited)	Split outputs into multiple files, each containing at most `chunk_size` reads/molecules. For non-interleaved output files, it’s `chunk_size` reads, for interleaved files it’s molecules. This means when mixing interleaved and non-interleaved output, you get the same number of files. Files are numbered sequentially, e.g. `output_read1_0.fq.gz`, …, Numbers start at 0 and use the minimum number of (base 10) digits necessary for alphabetical sorting (by renaming already produced files whenever an extension is needed).

Generated filenames join these components with ix_separator (default _), e.g. {prefix}_{segment}{suffix}. Interleaving replaces segment with interleaved; demultiplexing adds per-barcode infixes before the segment. Checksums use .uncompressed.sha256 or .compressed.sha256 suffixes.

Mon, 01 Jan 0001 00:00:00 +0000

Demultiplexed output #

Demultiplex is a magic transformation that forks the output.

You receive one set of output files per barcode (combination) defined.

Transformations downstream are (virtually) duplicated, so you can for example filter to the head reads in each barcode, and get reports for both: all reads and each separate barcode.

Demultiplexing can be done on barcodes, or on boolean tags.

Based on barcodes #

[[step]]
 action = "Demultiplex"
 in_label = "mytag"
 barcodes = "mybarcodes"
 output_unmatched = true # if set, write reads not matching any barcode
 # to a file like ouput_prefix_no-barcode_1.fq

[barcodes.mybarcodes] # can be before and after.
# separate multiple regions with a _
# a Mapping of barcode -> output name.
AAAAAA_CCCCCC = "sample-1" # output files are named prefix{ix_separator}barcode_prefix{ix_separator}segment.suffix
 # with the separator defaulting to '_', e.g. output_sample-1_1.fq.gz
 # or output_sample-1_report.fq.gz

Based on boolean tags #

[[step]]
 segment = "read1"
 action = "TagOtherFileByName"
 out_label = "a_bool_tag"
 filename = "path/to/boolean_tags.tsv"
 false_positive_rate = 0

[[step]]
 action = "Demultiplex"
 in_label = "a_bool_tag"

Note that this does not extract the barcodes from the read (use an extract step, such as ExtractRegion).

Mon, 01 Jan 0001 00:00:00 +0000

Options #

There is a small set of runtime knobs exposed under [options]. Most workflows can rely on the defaults.

[options]
 thread_count = -1
 block_size = 10000
 buffer_size = 102400
 accept_duplicate_files = false
 spot_check_read_pairing = true

Key	Default	Description
`thread_count`	`-1`	Worker threads for transformations. `-1` autotunes per CPU; most runtime is still dominated by decompression threads, so gains are modest.
`block_size`	`10000`	Number of fragments pulled per batch. Increase for very large runs when IO is abundant; decrease to reduce peak memory use.
`buffer_size`	`102400`	Initial bytes reserved per block. The allocator grows buffers on demand, so tuning is rarely necessary.
`accept_duplicate_files`	`false`	Permit the same path to appear multiple times across segments. Useful for fixtures or synthetic tests; keep disabled to catch accidental copy/paste errors.
`spot_check_read_pairing`	`true`	Sample every 1000th fragment to ensure paired reads still share a name prefix; disable when names are intentionally divergent or rely on `ValidateName` to customise the separator.

Changing these knobs can affect memory pressure and concurrency behaviour. Measure before and after if you deviate from defaults.

Mon, 01 Jan 0001 00:00:00 +0000

Out of scope #

Things mbf-fastq-processor will explicitly not do and that won’t be implemented.

Anything based on averaging phred scores #

Based on the average quality in a sliding window. Arithmetic averaging of phred scores is wrong.

see ExtractMeanQuality

Corresponding options in other software #

Trimmomatic SLIDINGWINDOW
fastp –cut_front
fastp –cut_tail
fastp –cut_right

Fast5 #

https://medium.com/@shiansu/a-look-at-the-nanopore-fast5-format-f711999e2ff6 Oxford Nanopore squiggle data. Apparently no formal spec.

kallisto BUS format #

- a brief barcode/umi format for single cell RNA-seq
- needs an 'equivalance class' - i.e. at least pseudo alignment
- weird length restrictions on barcodes and umis (1(!)-32), 
 but stores the length in an uint32...