Cookbook 08: Read Length Filtering and Truncation #
Use Case #
You have sequencing data with variable read lengths and need to:
- Remove reads that are too short (may align poorly or represent artifacts)
- Remove reads that are too long (may indicate technical issues)
- Truncate all reads to a uniform length (required by some downstream tools)
Read length filtering is important for:
- Quality control after adapter trimming
- Preparing data for tools that require uniform read lengths
- Removing degraded or artifactual sequences
What This Pipeline Does #
This cookbook demonstrates comprehensive read length management:
- Calculate Read Length: Uses
CalcLengthto tag each read with its length - Filter by Minimum Length: Uses
FilterByNumericTagto remove short reads - Filter by Maximum Length: Uses
FilterByNumericTagto remove long reads - Truncate to Uniform Length: Uses
Truncateto trim all reads to the same size - Generate Reports: Creates before/after statistics
Understanding Read Length #
Why read length matters:
- Too short: May align to multiple locations (multimapping)
- Too long: May indicate incomplete adapter trimming or concatenated sequences
- Variable length: Some tools (e.g., older aligners) require uniform lengths
- Optimal range: Depends on application (typically 25-150bp for RNA-seq)
Common scenarios:
- After adapter trimming: Reads become variable length; filter out very short ones
- Amplicon sequencing: Expected length range is tight (e.g., 250-280bp)
- Small RNA: Keep short reads (18-30bp) while filtering longer contamination
- Assembly: Longer reads generally better, but quality matters more
Input Files #
input/variable_length_R1.fq- Reads with varying lengths (20-160bp)
Output Files #
output_read1.fq- Reads filtered to 30-150bp range and truncated to 100bp
Expected Results #
With the provided sample data:
- Input: 10 reads with lengths ranging from 20 to 160bp
- After min filter: Removes reads < 30bp (e.g., 20bp, 25bp reads removed)
- After max filter: Removes reads > 150bp (e.g., 160bp read removed)
- After truncate: All remaining reads are exactly 100bp
Workflow Details #
Example transformations:
| Read ID | Original Length | After Min Filter | After Max Filter | After Truncate |
|---|---|---|---|---|
| READ1 | 25bp | Removed | - | - |
| READ2 | 40bp | Kept | Kept | → 40bp (kept) |
| READ3 | 100bp | Kept | Kept | → 100bp |
| READ4 | 120bp | Kept | Kept | → 100bp |
| READ5 | 160bp | Kept | Removed | - |
Note on Truncate behavior:
- If read is longer than target: trims to target length
- If read is shorter than target: keeps original length (does not pad)
- To enforce exact length, filter first:
min_value = target_length
Customization #
Adjust parameters based on your application:
RNA-seq (general):
min_value = 25 # Minimum for reliable alignment
max_value = 200 # Filter abnormally long reads
# No truncation usually needed
Amplicon sequencing (expected 250bp):
min_value = 240 # Tight range around expected
max_value = 260
# Truncate to 250 for uniformity
Small RNA-seq (miRNA):
min_value = 18 # Shortest mature miRNA
max_value = 30 # Longest miRNA + some tolerance
# No truncation
ChIP-seq or ATAC-seq:
min_value = 25
max_value = 150
# Optional truncation to reduce file size
Uniform length required:
# First filter to ensure all reads are at least target length
min_value = 100
# Then truncate to exact length
[[step]]
action = 'Truncate'
segment = 'read1'
length = 100
When to Use This #
- After adapter trimming to remove reads that became too short
- Quality control to filter abnormally long/short reads
- Before tools that require uniform read lengths
- Amplicon analysis to enforce expected size range
- Small RNA analysis to select specific size classes
Alternative Approaches #
Using CutStart/CutEnd instead of Truncate:
# Remove first 10bp and last 10bp
[[step]]
action = 'CutStart'
segment = 'read1'
n = 10
[[step]]
action = 'CutEnd'
segment = 'read1'
n = 10
Filtering paired-end reads by combined length:
# Calculate both read lengths
[[step]]
action = 'CalcLength'
segment = 'read1'
out_label = 'len1'
[[step]]
action = 'CalcLength'
segment = 'read2'
out_label = 'len2'
# Use EvalExpression to filter based on combined length
[[step]]
action = 'EvalExpression'
expression = 'len1 + len2 >= 50'
out_label = 'long_enough'
result_type = 'bool'
[[step]]
action = 'FilterByTag'
in_label = 'long_enough'
keep_or_remove = 'keep'
Downstream Analysis #
After length filtering:
- Verify length distribution with quality control tools (FastQC)
- Alignment to reference genome (should have better mapping rates)
- Quantification or other downstream analysis
- Compare results with/without filtering to assess impact
Quality Metrics #
Monitor these metrics after length filtering:
- Percentage reads retained: Should retain most reads (>80% typical)
- Mean read length: Should match expected for your protocol
- Mapping rate: Often improves after filtering too-short reads
- Alignment quality: Fewer multimapping reads
Download #
Download 08-length-filtering.tar.gz for a complete, runnable example including expected output files.
Configuration File #
[input]
read1 = 'input/variable_length_R1.fq'
[[step]]
# Generate initial report to see input length distribution
action = 'Report'
name = 'initial'
base_statistics = true
[[step]]
# Calculate the length of each read and store in a tag
action = 'CalcLength'
segment = 'read1'
out_label = 'length'
[[step]]
# Remove reads shorter than 30bp
# Short reads after adapter trimming often align poorly
action = 'FilterByNumericTag'
in_label = 'length'
min_value = 30
keep_or_remove = 'keep'
[[step]]
# Remove reads longer than 150bp
# Unusually long reads may indicate incomplete adapter trimming
action = 'FilterByNumericTag'
in_label = 'length'
max_value = 150
keep_or_remove = 'keep'
# Alternative filter both short and long reads at once
#[[step]]
# # Remove reads shorter than 30bp
# # Remove reads longer than 150bp
# # Unusually long reads may indicate incomplete adapter trimming
# action = 'FilterByNumericTag'
# in_label = 'length'
# min_value = 30
# max_value = 150
# keep_or_remove = 'keep'
[[step]]
# Truncate all reads to exactly 100bp
# Some downstream tools require uniform read lengths
# Reads shorter than 100bp are kept at their original length
action = 'Truncate'
segment = 'read1'
n = 100
[[step]]
# Generate report after filtering and truncation
action = 'Report'
name = 'filtered'
base_statistics = true
[output]
prefix = 'reference_output/output'
format = "FASTQ"
report_html = true