TOML format

TOML file format #

mbf-fastq-processor pipelines are defined in a single TOML 1.0 formatted document. The format favours explicitness: every field is named, order is preserved, and unknown keys are rejected with a descriptive error.

Canonical template #

The repository maintains an authoritative configuration scaffold (the same content emitted by mbf-fastq-processor template).

The contents are included below for reference easy consumption in an LLM.

Structure overview #

SectionRequiredPurpose
[input]YesDeclare the FASTQ segments and associated source files
[input.options]ConditionalConfigure format-specific input toggles (FASTA/BAM)
[output]YesConfigure how processed reads and reports are written
[[step]]No*Define transformations, filters, tag operations, reports
[options]NoTune runtime knobs such as buffer sizes
[barcodes.*]ConditionalSupply barcode tables for demultiplexing

[[step]] entries are optional in the technical sense—an empty pipeline simply copies data between input and output—but in practice most configurations contain at least one transformation or report.

Minimal example #

[input]
    read1 = "file_1.fq"
    read2 = ["file_2.fq"] # lists concatenate multiple files for a segment

[output]
    prefix = "processed"
    format = "Raw" # or Gzip/Zstd/Bam/None

[[step]]
    action = "CutStart"
    segment = "read1"
    n = 3

Key rules:

  • Steps execute top-to-bottom exactly as written.
  • Field names are case-insensitive when matching segments but consistent casing improves readability.
  • Arrays of tables ([[step]]) must come after their shared configuration. Intervening scalar keys apply to the most recent table.

Refer to the Input section and Output section pages for exhaustive key listings, supported compression formats, and compatibility notes.

Additional tables #

Some steps require additional tables outside the main [[step]] list—for example Demultiplex expects [barcodes.<name>] definitions. Place those tables anywhere in the file; they are parsed before execution begins, so forward and backward references are both valid.

Comments and formatting #

TOML supports # comments. Leverage them to annotate why a step exists or to document barcode provenance. The parser enforces strict validation: spelling mistakes such as actionn = "CutStart" will cause an immediate error instead of being silently ignored.

Why TOML? #

We deliberately avoided deep CLI flag hierarchies and configuration formats without comments. TOML offers ordered arrays for sequencing steps, nested tables for barcode definitions, and human-friendly syntax that is widely adopted in both Python and Rust ecosystems.

Curious about complex structures? The Demultiplex reference showcases nested tables and arrays combined with the TOML array-of-tables syntax.

Maximal example (template) #


# mbf-fastq-processor Configuration Template

# This template includes all available transformation steps with explanations
# it is therefore very comprehensive and long.

# To get started, use mbf-fastq-processor cookbooks instead.

# == Input ==
# From input section documentation:
[input]
[input]
    read1 = ['fileA_1.fastq', 'fileB_1.fastq.gz', 'fileC_1.fastq.zstd'] #one is requered
    #read2 = ['fileA_1.fastq', 'fileB_1.fastq.gz', 'fileC_1.fastq.zstd'] # (optional)
    #index1 = ['index1_A.fastq', 'index1_B.fastq.gz', 'index1_C.fastq.zstd'] # (optional)
    #index2 = ['index2_A.fastq', 'index2_B.fastq.gz', 'index2_C.fastq.zstd'] # (optional)
    # interleaved = [...] # Activates interleaved reading, see below

## A mapping from segment names to files to read.
## Compression is auto-detected.
## File format is auto-detected (see options below).
## The number of files per segment must match.

[input.options]
    # fasta_fake_quality = 30      # required for FASTA inputs: synthetic Phred score (0-93)
    # bam_include_mapped = true    # required for BAM inputs: keep reads with alignments
    # bam_include_unmapped = true  # required for BAM inputs: keep reads without alignments
	# read_comment_char = ' '      # defaults to ' '. The character seperating read name from the 'read comment'.

## Interleaved mode:
## specify exactly one key=[files] value, and set interleaved to ['read1','read2',...]
## Further steps use the segment names from interleaved

## By default, (paired) input reads are spot checked for their names matching.
## see options.spot_check_read_pairing at the end of this example.

# == Output ==

[output]
     prefix = "output" # files get named {prefix}_{segment}{suffix}, e.g. "output_read1.fq.gz"
     format = "Fastq", # (optional) output format, defaults to 'Fastq'
					  # Valid values are: Fastq, Fasta, BAM and None (for no sequence output, just reports)
     compression = "Gzip" # (optional), defaults to 'uncompressed'
                     # Valid values are uncompressed, Gzip, Zstd.
#     suffix = ".fq.gz" # optional, determined by the format if left off.
#     compression_level = 6 # optional compression level for gzip (0-9) or zstd (1-22)
                          # defaults: gzip=6, zstd=5

     report_json = true # (optional) write a json report file ($prefix.json)?
     report_html = true # (optional) write an interactive html report report file ($prefix.html)?

#     stdout = false # write read1 to stdout, do not produce other fastq files.
#                    # set's interleave to true (if Read2 is in input),
#                    # format to Raw
#                    # You still need to set a prefix for
#                    # Reports/keep_index/Inspect/QuantifyRegion(s)
#                    # Incompatible with a Progress Transform that's logging to stdout
#
#     interleave = false # (optional) interleave fastq output, producing
#                        # only a single output file for read1/read2
#                        # (with infix _interleaved instead of '_1', e.g. 'output_interleaved.fq.gz')
#     keep_index = false # (optional) write index to files as well? (optional)
#                        # (independent the interleave setting. )
#     output_hash_uncompressed = false # (optional) write a {prefix}_{1|2|i1|i2}.uncompressed.sha256
#                                    # with a hexdigest of the uncompressed data's sha256,
#                                    # similar to what sha256sum would do on the raw FASTQ
#     output_hash_compressed = false   # (optional) write a {prefix}_{1|2|i1|i2}.compressed.sha256
#                                    # with a hexdigest of the compressed output file's sha256,
#                                    # allowing verification with sha256sum on the actual output files
#     output = ["read1", "read2"] # (optional) which segments to write. Defaults to all segments defined in [input]. Set to empty list to supress output. (Equivalent to `format="None`")
#     ix_separator = "_" # (optional, default '_') separator inserted between prefix, infix, and segment names
#     Chunksize = 1_000_000 # (optional) maximum number of molecules per output file. When set, chunk indexes are appended to filenames.
#

# == Tagging ==

# Extract data from / about sequences.
# Tags get temporarily stored in memory under a 'label'
# and can then be used in other steps.

# There are three kinds of tags,
#  - location based string tags (think search query results)
#  -  numeric tags (e.g. length, GC content).
#  - boolean tags (e.g. is this a duplicate?)
# All tags can be stored within the fastq or separately (see below)
# Filtering is available as FilterByTag, FilterByNumericTag, FilterByBoolTag

# === String tags ===

# ==== ExtractIUPAC ====
# # Extract a IUPAC string.
# [[step]]
#    action = "ExtractIUPAC"
#    out_label = "mytag"
#    search = 'CTN' # what we are searching
#    max_mismatches = 1 # how many mismatches are allowed.
#    anchor = 'Anywhere' # Left | Right | Anywhere - Where to search.
                         # Left only matches at the start of the read, etc.
#    segment = "read1" # Any of your input segments

# ==== ExtractIUPACWithIndel ====
# # Extract an IUPAC string while allowing small insertions/deletions.
# [[step]]
#    action = "ExtractIUPACWithIndel"
#    out_label = "mytag"
#    search = 'CTN' # what we are searching
#    max_mismatches = 1 # how many mismatches are allowed.
#    max_indel_bases = 1 # how many inserted or deleted bases are allowed in total.
#    max_total_edits = 2 # optional overall edit budget (mismatches + indels).
#    anchor = 'Anywhere' # Left | Right | Anywhere - Where to search.
#    segment = "read1" # Any of your input segments

# ==== ExtractIUPACSuffix ====
## Extract a IUPAC string at the end of a read.
## Only requires a configurable number of bases to match,
## i.e. can trim partially present adapters.

# [[step]]
#    action = "ExtractIUPACSuffix"
#    out_label = "mytag"
#    query = "AGTCA"  # the adapter to trim. Straigth bases only, no IUPAC.
#    segment = "read1"   # Any of your input segments (default: read1)
#    min_length = 3     # uint, the minimum length of match between the end of the read and
#                       # the start of the adapter
#    max_mismatches = 1 # How many mismatches to accept

# ==== ExtractRegex ====
## Extract a regexp result. Stores an empty string if not found.
# [[step]]
#    action = "ExtractRegex"
#    out_label = "mytag"
#    search = '^CT(..)CT'
#    replacement = "$1"  # standard regex replacement syntax
#    source = "read1" # An input segment (to read from sequence), or name:<segment> to read from a tag


# ==== ExtractRegion ====
## Extract a fixed position region
# [[step]]
#    action = "ExtractRegion"
#    start = 5
#    len = 8
#    segment = "read1" # Any of your input segments
#    out_label = "umi"

# ==== ExtractRegions ====
## Extract from fixed position regions
# [[step]]
#    action = "ExtractRegions"
#    regions = [
#       {segment= "read1", start = 0, length = 8},
#       {segment= "read1", start = 12, length = 4},
#    ]
#    out_label = "barcode"


# ==== ExtractLowQualityStart ====
## Extract a region with all the low quality bases at the start of the read.
## use with TrimAtTag(direction="Start", keep_tag=false) to trim low quality ends.
# [[step]]
#    action = "ExtractLowQualityStart"
#    min_qual = 'C' # minimum quality score
#    segment = "read1" # Any of your input segments
#    out_label = "low_quality_start"

# ==== ExtractLowQualityEnd ====
## Extract a region with all the low quality bases at the end of the read.
## use with TrimAtTag(direction="End", keep_tag=false) to trim low quality ends.
# [[step]]
#    action = "ExtractLowQualityEnd"
#    min_qual = 'C' # minimum quality score
#    segment = "read1" # Any of your input segments
#    out_label = "low_quality_end"

# ==== ExtractRegionsOfLowQuality ====
## Extract all regions (min size: 1 bp) where bases have quality scores below threshold
# [[step]]
#    action = "ExtractRegionsOfLowQuality"
#    segment = "read1" # Any of your input segments
#    min_quality = 60  # Quality threshold, in the files enconding.
#                      # See https://en.wikipedia.org/wiki/Phred_quality_score#Symbols
#                      # Example: 60 | '<' and Sanger|Illumina1.8 encoding, quality scores below Phred=27,
#                      # or 'Probability of Incorrect Base Call > 0.002' (range 0..1) are filtered.
#    out_label = "low_quality_regions"


# ==== ExtractPolyTail ====
## Identify either a specific base repetition, or any base repetition at the end of the read.
## Use with TrimAtTag to trim polyA/T/C/G/N tails.
# [[step]]
#    action = "ExtractPolyTail"
#    out_label = "tag_label"
#    segment = "read1" # Any of your input segments (default: read1)
#    min_length = 5 # positive integer, the minimum number of repeats of the base
#    base = "A" # one of AGTCN., the 'base' to trim (or . for 'any repeated base'. 'N' explicitly looks for NNNN, not for 'any repeated base'.)
#    max_mismatch_rate = 0.1 # float 0.0..=1.0, how many mismatches are allowed in the repeat
#    max_consecutive_mismatches = 3 # how many consecutive mismatches are allowed

# ==== ExtractLongestPolyX ====
## Identify the longest homopolymer anywhere in the read (useful for internal poly-runs).
# [[step]]
#    action = "ExtractLongestPolyX"
#    out_label = "tag_label"
#    segment = "read1" # Any of your input segments (default: read1)
#    min_length = 5 # positive integer, the minimum number of repeats of the base
#    base = "." # one of AGTCN.; '.' searches all bases and picks the longest run
#    max_mismatch_rate = 0.1 # float 0.0..=1.0, how many mismatches are allowed in the run
#    max_consecutive_mismatches = 3 # how many consecutive mismatches are allowed


# ==== ExtractAnchor ====
## Extract regions relative to a previously tagged anchor position.
## Uses the leftmost position of a previously established tag as the anchor.
## First create an anchor tag using ExtractIUPAC, ExtractRegions, etc.
# [[step]]
#    action = "ExtractIUPAC"
#    search = "CAYA"
#    out_label = "anchor_tag"
#    segment = "read1"
#    anchor = "Anywhere"
#    max_mismatches = 0
#
# [[step]]
#    action = "ExtractAnchor"
#    out_label = "mytag"
#    in_label = "anchor_tag" # tag that provides the anchor position
#    regions = [[-2,4], [4,1]] # start, length.
                               # Start relative to the anchor's leftmost position
#    region_separator = "_"

## == Numeric tags ==

# ==== CalcLength ====
## Extract the length of a read as a tag
# [[step]]
#    action = "CalcLength"
#    out_label = "mytag"
#    segment = "read1" # Segment

# ==== CalcKmers ====
## Count how many kmers from a read match those in a database built from reference sequences
# [[step]]
#    action = "CalcKmers"
#    out_label = "mytag"
#    segment = "read1" # Any of your input segments, or 'All'
#    files = ['reference.fa', 'database.fq']  # Sequence files to build kmer database from
#    count_reverse_complement = true # whether to also include each revcomp of a kmer in the database ('canonical kmers')
#    k = 21  # Kmer length
#    min_count = 2  # (optional, default: 1) Minimum occurrences (forward+reverse if count_reverse_complement is set) in reference to include kmer

# ==== CalcNCount ====
## Calc the number of Ns in a read (wrapper around ExtractBaseContent).
# [[step]]
#    action = "CalcNCount"
#    out_label = "ncount"
#    segment = "read1" # Any of your input segments, or 'All'

# ==== CalcBaseContent ====
## Calc the percentage of specified bases, ignoring any bases you choose.
# [[step]]
#    action = "CalcBaseContent"
#    out_label = "base_content"
#    bases_to_count = "AT"
#    bases_to_ignore = "N"
#    relative = true # set to false for absolute counts (bases_to_ignore must be omitted)
#    segment = "read1" # Any of your input segments, or 'All'

# ==== CalcGCContent ====
## alias for `CalcBaseContent`; converted automatically during expansion.
# [[step]]
#    action = "CalcGCContent"
#    out_label = "gc_content"
#    segment = "read1" # Any of your input segments, or 'All'

# ==== CalcQualifiedBases ====
## Count number of high-quality bases
# [[step]]
#    action = "CalcQualifiedBases"
#    threshold = 'C' # minimum quality score for a base to be considered qualified
#    op = 'below' # Do we count phred scores better (below) or worse (above) than the threshold?
#    segment = "read1" # Any of your input segments, or 'All'
#    out_label = "tag_name"

## op also takes the values
## * worse / above / > / gt
## * worse_or_equal / above_or_equal / >= / gte
## * better / below / < / lt
## * better_or_equal / below_or_equal / <= / lte

# ==== EvalExpression ====
## Calculate a numeric expression based on existing numeric tags

# [[step]]
#	action = "EvalExpression"
#   expression = '''
#             mytag>= 50
#   '''
#   out_label = "outtag"
#   result_type = 'bool'

# ==== ConvertRegionsToLength ====
## Summarize the span of region tags as a numeric length tag.
# [[step]]
#    action = "ConvertRegionsToLength"
#    out_label = "region_length"
#    in_label = "mytag" # region tag produced by ExtractRegion(s) or similar

# ==== CalcExpectedError ====
## Aggregate per-base error probabilities (PHRED+33) for each read.
# [[step]]
#    action = "CalcExpectedError"
#    out_label = "expected_error"
#    aggregate = "sum" # or "max"
#    segment = "read1" # Any of your input segments, or 'All'

## == Boolean tags ==

# ==== TagDuplicates ====
## Marks 2nd and further duplicates of reads.
## Filters duplicates with a cuckoo filter
## or an exact hash if false_positive_rate is set to 0.
## (beware memory usage)
# [[step]]
#    action = "TagDuplicates"
#    out_label = "tag_label"
#    false_positive_rate = 0.01 # false positive rate (0..1)
#    seed = 42 # seed for randomness (if false_positive_rate > 0)
#    source = 'read1' # any segment, 'All', 'tag:<tag-name>', or 'name:<segment>'
#    # split_character = "/" # required and accepted only iff using name:<segment>


# ==== TagOtherFileByName ====
## Marks reads based on names present in another file
## With false_positive_rate > 0, uses a cuckoo filter, otherwise an exact hash set.
# [[step]]
#    action = "TagOtherFileByName"
#    out_label = "present_in_other" # which tag to store the result in
#    segment = "read1" # which name to use
#    filename = "names.fastq" # Can read FASTQ (also compressed), or sam/bam files.
#    false_positive_rate = 0.01 # false positive rate (0..1)
#    seed = 42 # seed for randomness
#    ignore_unaligned = false # in case of BAM/SAM, whether to ignore unaligned reads
#    fastq_readname_end_char = " " # (optional) char (byte value) at which to cut FASTQ read names before comparing.
#    reference_readname_end_char = "/" # (optional) char (byte value) at which to cut reference read names before storing.
##                                         Leave either unset to keep the original names intact.

# ==== TagOtherFileBySequence ====
## Filter reads based on sequences present in another file
# [[step]]
#    action = "TagOtherFileBySequence"
#    out_label = "present_in_other" # which tag to store the result in
#    filename = "sequences.fastq" # fastq (also compressed), or sam/bam files.
#    segment = "read1" # Any of your input segments
#    false_positive_rate = 0.01 # false positive rate (0..1)
#    seed = 42 # seed for randomness
#    ignore_unaligned = false # in case of BAM/SAM, whether to ignore unaligned reads



# === Other tag manipulations ===

# ==== ForgetAllTags ====
## forget about every tag currently stored
## useful when downstream steps should not see previous tags
# [[step]]
#    action = "ForgetAllTags"

# ==== ForgetTag ====
## forget about a tag
## usefull if you want to store tags in a table,
## but not this one
# [[step]]
#    action = "ForgetTag"
#    in_label = "mytag"



# == Filters ==

# Most filters come with a keep_or_remove option,
# that decides whether you keep reads that match the filter,
# or whether you keep those that don't match (= remove those that match).

# === Tag filters ===

# ==== FilterByTag ====
## Remove sequences that have (or don't have) a 'region' tag
# [[step]]
#    action = "FilterByTag"
#    in_label = "mytag"
#    keep_or_remove = "Keep" # or "Remove"


# ==== FilterByNumericTag ====
## Remove sequences that have (or don't have) a tag
# [[step]]
#    action = "FilterByNumericTag"
#    in_label = "mytag"
#    keep_or_remove = "Keep" # or "Remove"
#    min_value = 0.0 # (optional) minimum value (inclusive)
#    max_value = 10.0 # (optional) maximum value (exclusive)
## Note: You can filter either min, max, or both, but one of them must be set.


# === Read 'number' based filters ===

# ==== Head ====
## Keep only the first N reads
# [[step]]
#    action = "Head"
#    n = 1000 # number of reads to keep

# ==== Skip ====
## Skip the first N reads
# [[step]]
#    action = "Skip"
#    n = 100 # number of reads to skip

# ==== FilterSample ====
## Randomly sample a subset of reads
# [[step]]
#    action = "FilterSample"
#    p = 0.1 # probability to keep any read (0..1)
#    seed = 42 # (optional) random seed for reproducibility


# ==== FilterReservoirSample ====
# [[step]]
# action = "FilterReservoirSample"
# n = 10_000
# seed = 59

## Filter for a fixed number of reads based on [reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling), that is all reads have an equal probability of being selected.



# ==== FilterEmpty ====
## Remove reads that are empty (zero length)
# [[step]]
#    action = "FilterEmpty"
#    segment = "All" # Any of your input segments, or 'All'

## On segment='All', only filters reads that are empty in all parts.
## Use multiple FilterEmpty to filter if any part is empty.
## Note: FilterEmpty is a convenience wrapper around CalcLength + FilterByNumericTag(min=1)

# ==== CalcComplexity ====
## Extract sequence complexity (transition ratio) as a numeric tag
# [[step]]
#    action = "CalcComplexity"
#    out_label = "complexity"
#    segment = "read1" # Any of your input segments, or 'All'
#
# # Filter based on the complexity score
# [[step]]
#    action = "FilterByNumericTag"
#    in_label = "complexity"
#    min_value = 0.3  # minimum complexity score (0-1)
#    keep_or_remove = "Keep"




# == Edits ==

# ==== ReplaceTagWithLetter ====
## Replace sequence bases in tagged regions with a specified letter
## Useful for example to mask low-quality regions as 'N'
# [[step]]
#    action = "ReplaceTagWithLetter"
#    in_label = "mytag"  # Tag containing regions to replace
#    letter = "N"  # Replacement character (defaults to 'N')

# ==== StoreTagInSequence ====
## Store the tag's replacement in the sequence,
## replacing the original sequence at that location.
# [[step]]
#    action = "StoreTagInSequence"
#    in_label = "mytag"
#    ignore_missing = true # if false, an error is raised if the tag is missing


# ==== StoreTagInComment ====

## Store currently present tags as comments on read names.
## Comments are key=value pairs, separated by `comment_separator`
## which defaults to '|'.
## They get inserted at the first `comment_insert_char`,
## which defaults to space. The comment_insert_char basically moves
## to the right.
##
## That means a read name like
## @ERR12828869.501 A00627:18:HGV7TDSXX:3:1101:10502:5274/1
## becomes
## @ERR12828869.501|key=value|key2=value2 A00627:18:HGV7TDSXX:3:1101:10502:5274/1
##
## This way, your added tags will survive STAR alignment.
## (STAR always cuts at the first space, and by default also on /)
##
## (If the comment_insert_char is not present, we simply add at the right)
##
##
## Be default, comments are only placed on read1.
## If you need them somewhere else, or an all reads, change the segment (to "All")
# [[step]]
#    action = "StoreTagInComment"
#    in_label = "mytag" # if set, only store this tag
#    segment = "read1" # Any of your input segments, or 'All'
#    comment_insert_char = ' ' # (optional) char at which to insert comments
#    comment_separator = '|' # (optional) char to separate comments
#    region_separator = '_' # (optional) char to separate regions in a tag, if it has multiple

# ==== StoreTagLocationInComment ====
## store the coordinates of a tag in the comment
## start-end, 0-based, half-open
# [[step]]
#    action = "StoreTagLocationInComment"
#    in_label = "mytag"
#    segment = "read1" # Any of your input segments, or 'All'
#    comment_insert_char = ' ' # (optional) char at which to insert comments
#    comment_separator = '|' # (optional) char to separate comments

# ==== HammingCorrect ====
## Correct a tag to one of a predefined set of 'barcodes' using closest hamming distance.
#
# [[step]]
#    action = "HammingCorrect"
#    in_label = "mytag"
#    out_label = "my_corrected_tag"
#    barcodes = "mybarcodelist"
#    max_hamming_distance = 1
#    on_no_match = "remove" # 'remove', 'empty', 'keep'
#
#[barcodes.mybarcodelist]
#    "AAAA" = "ignored" # only read when demultiplexing

## on_no_match controls what happens if the tag cannot be corrected within the max_hamming_distance:
##
## * remove: Remove the hit (location and sequence), useful for FilterByTag later.
## * keep: Keep the original tag (and location)
## * empty: Keep the original location, but set the tag to empty.


# ==== LowercaseTag ====
## turns a tag into lowercase
# [[step]]
#    action = "LowercaseTag"
#    in_label = "mytag"

## You still want to StoreTagInSequence after this to actually change the sequence.

# ==== UppercaseTag ====
## turns a tag into Uppercase
# [[step]]
#    action = "UppercaseTag"
#    in_label = "mytag"
## You still want to StoreTagInSequence after this to actually change the sequence.


# ==== LowercaseSequence ====
## turns the complete sequence into lowercase
# [[step]]
#    action = "LowercaseSequence"
#    segment = "read1" # Any of your input segments, or 'All'

# ==== UppercaseSequence ====
## turns the complete sequence into uppercase
# [[step]]
#    action = "UppercaseSequence"
#    segment = "read1" # Any of your input segments, or 'All'



# ==== TrimAtTag ====
## Trim the read at the position of a tag
# [[step]]
#    action = "TrimAtTag"
#    in_label = "mytag"
#    direction = "Start" # or "End"
#    keep_tag = false # if true, the tag sequence is kept in the read,
#                     # swappes wether we trim at the start/end of the tag.

# ==== ConvertQuality ====
## Convert quality scores between different encodings.
# [[step]]
#action = "ConvertQuality"
#from = "Illumina1.8" #Illumina1.8|Illumina1.3|Sanger|Solexa"
#to = "Solexa" # same options as from. Illumina1.8 is an alias for Sanger

## To must be != from.
## Automatically adds a ValidateQuality for the from encoding before this step
## See https://en.wikipedia.org/wiki/Phred_quality_score



# ==== CutEnd ====
## Remove a fixed number of bases from the end of reads
# [[step]]
#    action = "CutEnd"
#    n = 10 # number of bases to remove from end
#    segment = "read1" # Any of your input segments

# ==== CutStart ====
## Remove a fixed number of bases from the start of reads
# [[step]]
#    action = "CutStart"
#    n = 5 # number of bases to remove from start
#    segment = "read1" # Any of your input segments

# ==== Truncate ====
## Truncate reads to maximum length
# [[step]]
#    action = "Truncate"
#    n = 150 # maximum length to keep
#    segment = "read1" # Any of your input segments

# ==== Prefix ====
## Add text to the beginning of read sequences
# [[step]]
#    action = "Prefix"
#    seq = "agtTCAa" # DNA sequence to add at beginning of read names. Checked to be agtcn
#    qual = "IIIBIII" # same length as seq. Your responsibility to have valid phred values.
#    segment = "read1" # Any of your input segments

# ==== Postfix ====
## Add DNA to the end of read sequences
# [[step]]
#    action = "Postfix"
#    seq = "agtc" # DNA sequence to add at end of read names. Checked to be agtcn
#    qual = "IIII" # same length as seq. Your responsibility to have valid phred values.
#    segment = "read1" # Any of your input segments


# ==== Rename ====
## Rename reads using a pattern
# [[step]]
#    action = "Rename"
#    search = "read_(.+)" # regex to search for in read names
#    replacement = "READ_$1"
## applies to all segment at once.
## After regex replacement, {{READ_INDEX}} is replaced with a unique (increasing, 0 based) number per read,

# ==== ReverseComplement ====
## Convert sequences to their reverse complement
# [[step]]
#    action = "ReverseComplement"
#    segment = "read1" # Any of your input segments

# ==== ReverseComplementConditional ====
## Conditionally reverse complement based on a boolean tag
# [[step]]
#    action = "ReverseComplementConditional"
#    in_label = "mytag"  # Boolean tag that determines whether to reverse complement
#    segment = "read1"       # Any of your input segments (default: read1)

# ==== MergeReads ====
## Merge paired-end reads by detecting overlap and resolving mismatches
## Supports multiple algorithms: FastpSeemsWeird
# [[step]]
#    action = "MergeReads"
#    reverse_complement_segment2 = true    # Whether to RC segment2 (suggested: true)
#    segment1 = "read1"                    # First segment (suggested: "read1")
#    segment2 = "read2"                    # Second segment (suggested: "read2")

#    algorithm = "FastpSeemsWeird"        # Algorithm: "fastp_seems_weird", 
#    min_overlap = 30                      # Minimum overlap length required (suggested: 30)
#    max_mismatch_rate = 0.2               # Maximum allowed mismatch rate 0.0-1.0 (suggested: 0.2)
#    max_mismatch_count = 5                # Maximum allowed absolute mismatches (suggested: 5)
#                                          # At least one of max_mismatch_rate or max_mismatch_count required
#    no_overlap_strategy = "as_is"         # "as_is" or "concatenate" (suggested: "as_is")
##    out_label = "merged"                      # (optional) Tag label for boolean merge status (suggested: "merged")
#    concatenate_spacer = "NNNN"           # (optional) Required if no_overlap_strategy = "concatenate"
#    spacer_quality_char = 33              # (optional) Quality score for spacer bases (suggested: 33)
#
## Takes optional reverse complement of segment2, searches for overlap with segment1
## If overlap found: merges using selected algorithm, places result in segment1, empties segment2
## If no overlap:
##   - "as_is": leaves reads unchanged
##   - "concatenate": joins segment1 + spacer + processed segment2 into segment1, empties segment2
## If out_label specified: creates boolean tag (true=merged, false=not merged)
## See documentation for full algorithmic details

# ==== Swap ====
## Swap
# [[step]]
#    action = "Swap"
#    segment_a = "read1"
#    segment_b = "read2"

# ==== SwapConditional ====
## Conditionally swap segments based on a boolean tag
# [[step]]
#    action = "SwapConditional"
#    in_label = "mytag"  # Boolean tag that determines whether to swap
#    segment_a = "read1"       # Optional - only needed if more than 2 segments
#    segment_b = "read2"       # Optional - only needed if more than 2 segments


# == Validation ==

# ==== ValidateQuality ====
## Validate that quality scores are in valid range
# [[step]]
#    action = "ValidateQuality"
#    segment = "All" # Any of your input segments, or 'All'
#    encoding = 'Illumina1.8' # 'Illumina1.8|Illumina1.3|Sanger|Solexa' # define the range of allowed values.

## see https://pmc.ncbi.nlm.nih.gov/articles/PMC2847217/ table 1
## Use ConvertQuality to convert between encodings


# ==== ValidateSeq ====
## Validate that sequences contain only valid DNA bases
# [[step]]
#    action = "ValidateSeq"
#    allowed = "agtc" # Which characters are allowed? no default, case sensitive
#    segment = "read1" # Any of your input segments, or 'All'



# ==== SpotCheckReadPairing ====
# [[step]]
#     action = "SpotCheckReadPairing"
#     sample_stride = 1000 # Every nth fragment, default 1000. > 0
#     readname_end_char = '/' # u8/byte-char, Defaults to '/' for Illumina.


# # Verify that (a subset of ) read names in pairs match.

## Sample paired reads every `sample_stride` fragments and confirm that each segment shares the
## same read name prefix (part before 'readname_end_char').

## This step is injected automatically after your transformations when

##  - more than one segment is defined
##  - and `options.spot_check_read_pairing` is set to `true` (the default)
##  - and and no explicit `SpotCheckReadPairing` or `ValidateName` step is present.

## The automatic readname_end_char is configured for Illumina-style read names ending
## with /1 or /2, splitting on '/'.

## Disable the sampling entirely via options.spot_check_read_pairing = false


# ==== ValidateName ====
## Validate that all segments expose the same read name or prefix
# [[step]]
#    action = "ValidateName"
#    readname_end_char = "_" # Optional single separator character; leave off for exact match

# ==== ValidateAllReadsSameLength ====
# [[step]]
#    action = "ValidateAllReadsSameLength"
#    source = "read1" # Any segment, All, tag:<name> or 'name:segment>'

## Validates that all reads have the same sequence/tag/read length.
## Provided for your sanity checking


# == Reporting ==

# ==== Report ====
## Generate processing report at this point in the processing.
## All reports get merged into one json /html.
# [[step]]
#    action = "Report"
#    name = "before processing" # key to identify this section of your
#    count = true # whether to include the read counts
#    base_statistics = true # whether to include base statistics
#    length_distribution = true # whether to include length distribution
#    duplicate_count_per_read = true # whether to include duplicate counts per read(approximate, cuckoo iflter)
#    duplicate_count_per_fragment = true # duplicate counts per fragment (read1&2&i1&2, approximate, cuckoo filter)
#    count_oligos  = ["AGTC","ACCCCC"] # list occurance count of these oligos
#    count_oligos_segment = "read1" # Any of your input segments, or 'All' # where to look for the oligos to count


# ==== Progress ====
## Report progress (and speed) during processing
# [[step]]
#    action = "Progress"
#    n = 1000000 # report progress every N reads
#    output_infix = "filename" # (optional) write to a file {prefix}{ix_separator}infix.progress instead of stdout.
## Progress to stdout is incompatible with output.stdout = true.


# ==== Inspect ====
## Output detailed information about reads for debugging
# [[step]]
#    action = "Inspect"
#    n = 10 # number of molecules to inspect
#    infix = "my_inspection" # writes to {output.prefix}_{infix}_{segment}.{suffix} (or _interleaved when segment = "all")
#    segment= "read1" # Any of your input segments, or "all" to interleave all segments
#    suffix = "compressed" # (optional) custom suffix for filename
#    compression = "gzip" # (optional) compression format: raw, gzip, zstd (defaults to raw)
#    compression_level = 6 # (optional) compression level for gzip (0-9) or zstd (1-22)
                          # defaults: gzip=6, zstd=5

# ==== Demultiplex ====
## Uncomment to demultiplex samples based on tags.
## See HammingCorrect for barcode correction options.

# [[step]]
#    action = "Demultiplex"
#    in_label = "mytag"
#    barcodes = "mybarcodes"
#    output_unmatched  = true # if set, write reads not matching any barcode
#                             #  to a file like ouput_prefix_no-barcode_1.fq
#
#[barcodes.mybarcodes] # can be before and after.
## separate multiple regions with a _
## a Mapping of barcode -> output name.
#AAAAAA_CCCCCC = "sample-1" # output files will be named prefix.barcode_prefix.infix.suffix
#                           # e.g. output_sample-1_1.fq.gz
#                           # e.g. output_sample-1_report.fq.gz
#AAAAAA_CCCCTT = "sample-1" # multiple barcodes can lead to the same output
#TTTTTT_CCCCTT = "sample-2" #

## you can also dedmultiplex based on boolean tags. To do so, set label = 'a_bool_tag' and leave off the barcodes option.


# == Others ==

# ==== StoreTagsInTable ====
## store the tags in a tsv table
# [[step]]
#    action = "StoreTagsInTable"
#    infix = "tags"
#    compression = "Raw" # Raw, Gzip, Zstd
#    region_separator = "_" # (optional) char to separate regions in a tag, if it has multiple


# ==== StoreTagInFastQ ====
## Store the content of a tag in a fastq file.
## Needs a 'location 'tag'.
## Can store other tags in the read name.
## quality scores are set to '~'.
## With demultiplexing: creates separate files per barcode
# [[step]]
#    action = "StoreTagInFastQ"
#    in_label = "mytag" # tag to store. 
#    compression = "Raw" # Raw, Gzip, Zstd
##   compression_level = 6 # (optional) compression level for gzip (0-9) or zstd (1-22)
                          # defaults: gzip=6, zstd=5
#    comment_tags = []# e.g. ["other_tag"] # see StoreTagInComment
#    comment_location_tags = ["mytag"] # (optional) tags to add location info for, defaults to [label]
#                                      # set to [] to disable location tracking
#    comment_insert_char = ' ' # (optional) char at which to insert comments
#    comment_separator = '|' # (optional) char to separate comments
#    region_separator = "_" # (optional) char to separate regions in a tag, if it has multiple


# ==== QuantifyTag ====
## Count the occurances of each tag-sequence
# [[step]]
#    action = "QuantifyTag"
#    in_label = "mytag"
#    infix = "tagcount" # output file is output{ix_separator}tagcount.qr.json


# == Options ==
# [options]
#   spot_check_read_pairing = true #  Wether to spot check read pair names. See SpotCheckReadPairing step
#   thread_count = -1 # only for the steps supporting multi-core.
#   block_size = 10000 # how many reads per block?
#   buffer_size = 102400 # how many bytes of buffer. Will extend if we can't get block_size reads in there.
#    accept_duplicate_files = false # for testing purposes.

Isn’t this awfully verbose? #

Configuration being understandable is much more important than being terse, and that’s what we strife for.

It is usually written (or is copy/pasted) with the documentation at hand, so typing is not a limting factor..

Our anti-example are tools that end up being called like this (no shade on fastp - bioinformatic tools are overwhelmingly like this):

fastp \
    --in1 r1.fastq.gz \
    --in2 r2.fastq.gz \
    -m \
    --merged_out merged.fastp.gz \
    --out1 read1.fastp.gz \
    --out2 read2.fastp.gz \
    -A -G -Q -L

Which is reasonably clear, until you get to the one-letter-options. In this case, they turn on ‘merge mode’ (’-m’, which you might have guessed) and disable some default processing steps (’-A -G -Q -L’).

Here’s the mbf-fastq-processor equivalent, which we think as being more maintainable:

[input]
    read1 = 'r1.fastq.gz'
    read2 = 'r2.fastq.gz'

[[step]]
    action = 'MergeReads'
	algorithm = "Fastp"
    min_overlap = 30
    max_mismatch_rate = 0.2
	max_mismatch_count = 5
    no_overlap_strategy = 'AsIs'
    reverse_complement_segment2 = true
    segment1 = 'read1'
    segment2 = 'read2'

[output]
	prefix = "output"
	compression = "gzip"

It also illustrates our stance on configuration defaults: Keep them minimal. You can never change a default without unexpectedly and silently breaking some user’s pipeline.

For example, if fastp added another default processing step in a future version, users of fastp would have to add another ‘disable’ command line flag to their invocations to keep the same behaviour as before.

It’s much better to make them write it down explicitly in the first place.

To make getting started easier, we allow a number of aliases, spelling variations, and do defaults when they’re absolutely obvious