Segments #
Modern sequencers, particularly Illumina sequencers, can read multiple times from one (amplified) DNA molecule, producing multiple ‘segments’ (often called ‘reads’) that together form a ‘molecule’ or ‘fragment’.
Definition and Configuration #
Segments are defined in the [input] section of your TOML configuration. Each segment corresponds to one FASTQ file (or stream in interleaved formats), and segment names are arbitrary but should be meaningful.
[input]
read1 = ["sample_R1.fq.gz"]
read2 = ["sample_R2.fq.gz"]
index1 = ["sample_I1.fq.gz"]
In this example, three segments are defined: read1, read2, and index1.
Common Segment Naming Conventions #
While segment names are user-defined, certain conventions align with sequencing technologies:
Paired-End Sequencing #
[input]
read1 = ["lib_R1.fq.gz"] # Forward read
read2 = ["lib_R2.fq.gz"] # Reverse read
Common in RNA-seq. Often the reads are on opposing strands.
Single-End Sequencing #
[input]
read1 = ["lib.fq.gz"] # Single read
Common in ChIP-seq or targeted sequencing.
Indexed Libraries (Multiplexed Samples) #
[input]
read1 = ["run_R1.fq.gz"]
read2 = ["run_R2.fq.gz"]
index1 = ["run_I1.fq.gz"] # i7 index (first barcode)
index2 = ["run_I2.fq.gz"] # i5 index (second barcode)
Index reads contain sample barcodes for demultiplexing multiple samples from a single sequencing run. Dual indexing reduces barcode collisions and enables higher multiplexing.
Custom Naming #
You can use any naming scheme that suits your workflow.
Note that these end up in the output file names as well.
[input]
fwd = ["lib_F.fq.gz"]
rev = ["lib_R.fq.gz"]
umi = ["lib_UMI.fq.gz"] # Unique Molecular Identifier read
Segment Synchronization #
Critical: All segments must contain the same number of reads, in the same order. The processor validates this during execution by spot checking the read names.
When a step filters a molecule, all segments for that fragment are removed together, maintaining synchronization.
Segment Targeting in Steps #
Many steps operate on specific segments via the segment parameter:
[[step]]
action = "CutStart"
segment = "read1" # trim read1
n = 10
[[step]]
action = "ValidateSeq"
segment = "index1" # Only validate index1 sequences
The “All” Pseudo-Segment #
Some steps support segment = "All" to operate across all defined segments:
[[step]]
action = "CalcLength"
segment = "All" # Check all segments
out_label = "sum_len"
When using "All", the step evaluates criteria across every segment and operates on the entire fragment.
Segments vs Sources #
When a step accepts a source parameter instead of segment, it can read from:
- Segments (e.g.,
"read1") - Segment names (e.g.,
"name:read1") - Tag values (e.g.,
"tag:barcode")
This provides greater flexibility for complex workflows involving metadata.
Interleaved Segments #
Interleaved FASTQ files combine multiple segments into a single file, alternating records:
[input]
source = ["interleaved.fq.gz"]
interleaved = ["read1", "read2"]
This declares two segments (read1 and read2) from one file, where records alternate: fragment 1 read1, fragment 1 read2, fragment 2 read1, fragment 2 read2, etc.
See Also #
- Input section reference for detailed configuration syntax
- Source concept for understanding source parameters
- Step concept for how steps interact with segments