Input Section

Input section #

The [input] table enumerates all read sources that make up a fragment. At least one segment must be declared.

[input]
    read1 = ['fileA_1.fastq', 'fileB_1.fastq.gz', 'fileC_1.fastq.zst'] # required: one or more paths
    read2 = "fileA_2.fastq.gz"                                      # optional
    index1 = ['index1_A.fastq.gz']                                   # optional
    # interleaved = [...]                                            # optional, see below
KeyRequiredValue typeNotes
segment name (e.g. read1)Yes (at least one)string or array of stringsEach unique key defines a segment; arrays concatenate multiple files in order.
interleavedNoarray of stringsEnables interleaved reading; must list segment names in their in-file order.

Additional points:

  • Segment names are user-defined and case sensitive. Common conventions include read1, read2, index1, and index2. They must conform to [a-zA-Z0-9_]+$.
  • Compression is auto-detected for by inspecting file headers.
  • Supported file formats are FASTQ, FASTA, and BAM. See Input options below for format-specific settings.
  • Every segment must provide the same number of reads. Cardinality mismatches raise a validation error.
  • Multiple files per segment are concatenated virtually; the processor streams them sequentially.
  • The names ‘All’ and ‘options’ can not be used for segment names.

File Formats #

mbf-fastq-processor supports FASTQ, Fasta, and BAM (aligned & unaligned) input formats.

Input options #

Format-specific behaviour is configured via the optional [input.options] table. These knobs are required when the corresponding file types are present and ignored otherwise.

[input]
    read1 = ["reads.fasta"]

[input.options]
    fasta_fake_quality = 'a'        # required for FASTA inputs: synthetic Phred score to apply to every base. Used verbatim without further shifting.
    bam_include_mapped = true      # required for BAM inputs: include reads with a reference assignment
    bam_include_unmapped = true    # required for BAM inputs: include reads without a reference assignment
	read_comment_char = ' '      # defaults to ' '. The character seperating read name from the 'read comment'.
  • fasta_fake_quality accepts a byte character or a number and is used verbatim. Stick to Phred (’!’/33 = worst). The value must be supplied whenever any FASTA source is detected.
  • bam_include_mapped and bam_include_unmapped must both be defined when reading BAM files. At least one of them has to be true; disabling both would discard every record.
  • Format detection is automatic and based on magic bytes: BAM (BAM\x01), FASTA (>), and FASTQ (@).
  • The read_comment_char is used for input reads
    (e.g. when TagDeduplicate with a name: source). The output steps (StoreTagInComment, StoreTagLocationInComment) default to this setting, but allow overwriting.

Interleaved input #

Some datasets store all segments in a single file. Activate interleaved mode and describe how the segments are ordered:

[input]
    source = ['interleaved.fq'] # this 'virtual' segment will not be available for steps downstream
    interleaved = ["read1", "read2", "index1", "index2"]

Rules for interleaving:

  • The [input] table must contain exactly one data source when interleaved is present.
  • The interleaved list dictates how reads are grouped into fragments. The length of the list equals the number of segments.
  • Downstream steps reference the declared segment names exactly as written in the list.

Automatic segment (pair) name checking. #

By default, if multiple segments are defined, every 1000th read pair is checked for the read name prefix (up until the first /) matching, ensuring correctly paired reads.

This assumes Illumina style named reads ending e.g. ‘/1’ and ‘/2’.

The automatism can be disabled with

[options]
    spot_check_read_pairing = false

To influence the character that delimits the read name prefix, or the sampling rate, add an explicit SpotCheckReadPairing step.

Named pipe input #

Input files may be named pipes (FIFOs) - but only FASTQ formated data is supported in that case.