Cookbook 09: Fastp-Equivalent Workflow #
Use Case #
You want to replicate the default behavior of fastp — a popular all-in-one FASTQ preprocessor — using a configurable pipeline. This is useful when you need reproducible, step-by-step control over each filtering stage, or want to extend the workflow beyond what fastp offers.
What This Pipeline Does #
This cookbook replicates fastp’s default single-end processing pipeline:
- PolyG Trimming: Uses
ExtractPolyTail+TrimAtTagto remove polyG tails (Illumina NextSeq/NovaSeq artifact) - Adapter Trimming: Uses
ExtractIUPAC+TrimAtTagto remove the Illumina TruSeq R1 adapter - N-base Filtering: Uses
CalcNCount+FilterByNumericTagto remove reads with too many ambiguous bases (--n_base_limit 5) - Quality Filtering: Uses
CalcQualifiedBases+FilterByNumericTagto remove reads with too many low-quality bases (--qualified_quality_phred 15,--unqualified_percent_limit 40) - Length Filtering: Uses
CalcLength+FilterByNumericTagto remove reads shorter than 15bp (--length_required 15)
Fastp Defaults Replicated #
| Fastp parameter | Value | Pipeline step |
|---|---|---|
--poly_g_min_len | 10 | ExtractPolyTail min_length = 10 |
--adapter_sequence | AGATCGGAAGAGCACACGTCTGAACTCCAGTCA | ExtractIUPAC query = ... |
--n_base_limit | 5 | FilterByNumericTag max_value = 6 |
--qualified_quality_phred | 15 (Phred) → 48 (ASCII) | CalcQualifiedBases threshold = 48 |
--unqualified_percent_limit | 40% → keep if ≥ 60% qualified | FilterByNumericTag min_value = 0.60 |
--length_required | 15 | FilterByNumericTag min_value = 15 |
Note on quality scores: Quality values in FASTQ files are ASCII-encoded. Phred Q15 corresponds to ASCII character 48 (15 + 33 = 48).
Fastp Considerations #
PolyG trimming is automatically enabled in fastp when the read header starts with instrument prefixes indicating NextSeq or NovaSeq data (e.g., @NS, @FS, @MN, @NB, etc.). In this cookbook we enable it explicitly.
Adapter detection in fastp is automatic but requires several thousand reads. Here we specify the adapter sequence directly: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA (Illumina TruSeq R1). See the adapters reference for other common sequences.
Filtering order: fastp applies trimming (polyG, adapter) before calculating quality metrics. This pipeline follows the same order.
Input Files #
R1.fq— Single-end FASTQ reads
Output Files #
output_r1.fq— Processed reads after all trimming and filtering steps
Expected Results #
With the provided sample data:
- Reads with polyG tails have tails removed
- Reads containing the TruSeq adapter are trimmed at the adapter site
- Reads with more than 5 N bases are removed
- Reads where fewer than 60% of bases meet Q15 are removed
- Reads shorter than 15bp (including those shortened by trimming) are removed
Customization #
For paired-end data, add r2 to the input and apply matching steps to segment = 'read2'.
Adapter sequences: Replace the query value in the ExtractIUPAC step with your adapter. For common adapters see the adapters reference.
Stricter quality filtering:
min_value = 0.80 # keep reads with ≥80% bases above threshold
Longer minimum length (e.g., for alignment):
min_value = 25
When to Use This #
- When you want to match fastp’s default preprocessing for comparison or reproducibility
- As a starting point for custom workflows based on fastp’s defaults
- When you need to inspect or audit each filtering step independently
Downstream Analysis #
After preprocessing:
- Alignment to reference genome (BWA, STAR, Bowtie2)
- Quantification of gene expression
- Variant calling with improved accuracy from cleaner input data
Download #
Download 09-fastp-equivalent.tar.gz for a complete, runnable example including expected output files.
Configuration File #
[input]
r1 = 'R1.fq'
# equivalent to -g | --trim_poly_g
# first we must find out where they are
[[step]]
action = "ExtractPolyTail"
out_label = "polyG"
min_length = 10 # --poly_g_min_len, default is 10 in fastp
base = "G"
max_mismatch_rate = 0.125 # not configurable in fastp, but 1/8 is what it uses internally.
max_consecutive_mismatches = 5 # not configurable in fastp
# then cut them off.
[[step]]
action = 'TrimAtTag'
in_label = 'polyG'
direction ='end'
keep_tag = false
# --adapter_sequence AGAT...
# ( auto detected in fastp after a a few thousand reads.)
[[step]]
action = 'ExtractIUPAC'
out_label='adapter'
query = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
anchor = 'anywhere'
max_mismatches = 5
[[step]]
action = 'TrimAtTag'
in_label = 'adapter'
direction ='end'
keep_tag = false
# fastp first cuts, then calculates the qualities!
# -- qualified_quality_phred 15
[[step]]
action = 'CalcQualifiedBases'
out_label = 'qual_base_rate'
threshold = 48 # our qualities are in FASTQ-space, not in phred spacep
operator = ">="
relative = true # we want a rate.
[[step]]
action = 'CalcNCount'
out_label = 'n_count'
relative = false
[[step]]
action = 'FilterByNumericTag'
max_value = 6
keep_or_remove='keep'
in_label = 'n_count'
[[step]]
action = 'FilterByNumericTag'
min_value = 0.60
keep_or_remove='keep'
in_label = 'qual_base_rate'
[[step]]
action = 'CalcLength'
out_label = 'length_r1'
[[step]]
action = 'FilterByNumericTag'
min_value = 15
keep_or_remove='keep'
in_label = 'length_r1'
[output]
prefix = 'output'