Input section #
The [input] table enumerates all read sources that make up a fragment.
At least one segment must be declared.
[input]
read1 = ['fileA_1.fastq', 'fileB_1.fastq.gz', 'fileC_1.fastq.zst'] # required: one or more paths
read2 = "fileA_2.fastq.gz" # optional
index1 = ['index1_A.fastq.gz'] # optional
# interleaved = [...] # optional, see below
| Key | Required | Value type | Notes |
|---|---|---|---|
segment name (e.g. read1) | Yes (at least one) | string or array of strings | Each unique key defines a segment; arrays concatenate multiple files in order. |
interleaved | No | array of strings | Enables interleaved reading; must list segment names in their in-file order. |
Additional points:
- Segment names are user-defined and case sensitive. Common conventions include
read1,read2,index1, andindex2. They must conform to[a-zA-Z0-9_]+$. - Compression is auto-detected for by inspecting file headers.
- Supported file formats are FASTQ, FASTA, and BAM. See Input options below for format-specific settings.
- Every segment must provide the same number of reads. Cardinality mismatches raise a validation error.
- Multiple files per segment are concatenated virtually; the processor streams them sequentially.
- The names ‘All’ and ‘options’ can not be used for segment names.
File Formats #
mbf-fastq-processor supports FASTQ, Fasta, and BAM (aligned & unaligned) input formats.
Input options #
Format-specific behaviour is configured via the optional [input.options] table. These knobs are required when the corresponding file types are present and ignored otherwise.
[input]
read1 = ["reads.fasta"]
[input.options]
fasta_fake_quality = 'a' # required for FASTA inputs: synthetic Phred score to apply to every base. Used verbatim without further shifting.
bam_include_mapped = true # required for BAM inputs: include reads with a reference assignment
bam_include_unmapped = true # required for BAM inputs: include reads without a reference assignment
read_comment_char = ' ' # defaults to ' '. The character seperating read name from the 'read comment'.
fasta_fake_qualityaccepts a byte character or a number and is used verbatim. Stick to Phred (’!’/33 = worst). The value must be supplied whenever any FASTA source is detected.bam_include_mappedandbam_include_unmappedmust both be defined when reading BAM files. At least one of them has to betrue; disabling both would discard every record.- Format detection is automatic and based on magic bytes: BAM (
BAM\x01), FASTA (>), and FASTQ (@). - The read_comment_char is used for input reads
(e.g. whenTagDeduplicatewith a name: source). The output steps (StoreTagInComment,StoreTagLocationInComment) default to this setting, but allow overwriting.
Interleaved input #
Some datasets store all segments in a single file. Activate interleaved mode and describe how the segments are ordered:
[input]
source = ['interleaved.fq'] # this 'virtual' segment will not be available for steps downstream
interleaved = ["read1", "read2", "index1", "index2"]
Rules for interleaving:
- The
[input]table must contain exactly one data source wheninterleavedis present. - The
interleavedlist dictates how reads are grouped into fragments. The length of the list equals the number of segments. - Downstream steps reference the declared segment names exactly as written in the list.
Automatic segment (pair) name checking. #
By default, if multiple segments are defined, every 1000th read pair is checked for the read name prefix (up until the first /) matching, ensuring correctly paired reads.
This assumes Illumina style named reads ending e.g. ‘/1’ and ‘/2’.
The automatism can be disabled with
[options]
spot_check_read_pairing = false
To influence the character that delimits the read name prefix, or the sampling rate,
add an explicit SpotCheckReadPairing step.
Named pipe input #
Input files may be named pipes (FIFOs) - but only FASTQ formated data is supported in that case.