09 Fastp Equivalent

Cookbook 09: Fastp-Equivalent Workflow #

Use Case #

You want to replicate the default behavior of fastp — a popular all-in-one FASTQ preprocessor — using a configurable pipeline. This is useful when you need reproducible, step-by-step control over each filtering stage, or want to extend the workflow beyond what fastp offers.

What This Pipeline Does #

This cookbook replicates fastp’s default single-end processing pipeline:

  1. PolyG Trimming: Uses ExtractPolyTail + TrimAtTag to remove polyG tails (Illumina NextSeq/NovaSeq artifact)
  2. Adapter Trimming: Uses ExtractIUPAC + TrimAtTag to remove the Illumina TruSeq R1 adapter
  3. N-base Filtering: Uses CalcNCount + FilterByNumericTag to remove reads with too many ambiguous bases (--n_base_limit 5)
  4. Quality Filtering: Uses CalcQualifiedBases + FilterByNumericTag to remove reads with too many low-quality bases (--qualified_quality_phred 15, --unqualified_percent_limit 40)
  5. Length Filtering: Uses CalcLength + FilterByNumericTag to remove reads shorter than 15bp (--length_required 15)

Fastp Defaults Replicated #

Fastp parameterValuePipeline step
--poly_g_min_len10ExtractPolyTail min_length = 10
--adapter_sequenceAGATCGGAAGAGCACACGTCTGAACTCCAGTCAExtractIUPAC query = ...
--n_base_limit5FilterByNumericTag max_value = 6
--qualified_quality_phred15 (Phred) → 48 (ASCII)CalcQualifiedBases threshold = 48
--unqualified_percent_limit40% → keep if ≥ 60% qualifiedFilterByNumericTag min_value = 0.60
--length_required15FilterByNumericTag min_value = 15

Note on quality scores: Quality values in FASTQ files are ASCII-encoded. Phred Q15 corresponds to ASCII character 48 (15 + 33 = 48).

Fastp Considerations #

PolyG trimming is automatically enabled in fastp when the read header starts with instrument prefixes indicating NextSeq or NovaSeq data (e.g., @NS, @FS, @MN, @NB, etc.). In this cookbook we enable it explicitly.

Adapter detection in fastp is automatic but requires several thousand reads. Here we specify the adapter sequence directly: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA (Illumina TruSeq R1). See the adapters reference for other common sequences.

Filtering order: fastp applies trimming (polyG, adapter) before calculating quality metrics. This pipeline follows the same order.

Input Files #

  • R1.fq — Single-end FASTQ reads

Output Files #

  • output_r1.fq — Processed reads after all trimming and filtering steps

Expected Results #

With the provided sample data:

  • Reads with polyG tails have tails removed
  • Reads containing the TruSeq adapter are trimmed at the adapter site
  • Reads with more than 5 N bases are removed
  • Reads where fewer than 60% of bases meet Q15 are removed
  • Reads shorter than 15bp (including those shortened by trimming) are removed

Customization #

For paired-end data, add r2 to the input and apply matching steps to segment = 'read2'.

Adapter sequences: Replace the query value in the ExtractIUPAC step with your adapter. For common adapters see the adapters reference.

Stricter quality filtering:

min_value = 0.80  # keep reads with ≥80% bases above threshold

Longer minimum length (e.g., for alignment):

min_value = 25

When to Use This #

  • When you want to match fastp’s default preprocessing for comparison or reproducibility
  • As a starting point for custom workflows based on fastp’s defaults
  • When you need to inspect or audit each filtering step independently

Downstream Analysis #

After preprocessing:

  1. Alignment to reference genome (BWA, STAR, Bowtie2)
  2. Quantification of gene expression
  3. Variant calling with improved accuracy from cleaner input data

Download #

Download 09-fastp-equivalent.tar.gz for a complete, runnable example including expected output files.

Configuration File #

[input]
  r1 = 'R1.fq'



# equivalent to -g | --trim_poly_g
# first we must find out where they are
[[step]]
    action = "ExtractPolyTail"
    out_label = "polyG"
    min_length = 10 # --poly_g_min_len, default is 10 in fastp
    base = "G" 
    max_mismatch_rate = 0.125 # not configurable in fastp, but 1/8 is what it uses internally.
    max_consecutive_mismatches = 5 # not configurable in fastp

# then cut them off.
[[step]]
  action = 'TrimAtTag'
  in_label = 'polyG'
  direction ='end'
  keep_tag = false

# --adapter_sequence AGAT...
# ( auto detected in fastp after a a few thousand reads.) 
[[step]]
  action = 'ExtractIUPAC'
  out_label='adapter'
  query = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
  anchor = 'anywhere'
  max_mismatches = 5

[[step]]
  action = 'TrimAtTag'
  in_label = 'adapter'
  direction ='end'
  keep_tag = false

# fastp first cuts, then calculates the qualities!


# -- qualified_quality_phred  15
[[step]]
  action = 'CalcQualifiedBases'
  out_label = 'qual_base_rate'
  threshold = 48 # our qualities are in FASTQ-space, not in phred spacep
  operator = ">="
  relative = true # we want a rate.


[[step]]
  action = 'CalcNCount'
  out_label = 'n_count'
  relative = false

[[step]]
  action = 'FilterByNumericTag'
  max_value = 6
  keep_or_remove='keep'
  in_label = 'n_count'
 

[[step]]
  action = 'FilterByNumericTag'
  min_value = 0.60
  keep_or_remove='keep'
  in_label = 'qual_base_rate'
  

[[step]]
  action = 'CalcLength'
  out_label = 'length_r1'

[[step]]
  action = 'FilterByNumericTag'
  min_value = 15
  keep_or_remove='keep'
  in_label = 'length_r1'
 



[output]
  prefix = 'output'