Cookbook 05: Quality Filtering #

Use Case #

You have sequencing data with varying quality and want to remove low-quality reads before downstream analysis. Poor quality reads can introduce errors in variant calling, assembly, and other analyses.

What This Pipeline Does #

This cookbook demonstrates quality-based filtering using expected error calculation:

Calculate Expected Errors: Uses CalcExpectedError to compute the expected number of base call errors per read based on quality scores
Filter Low-Quality Reads: Uses FilterByNumericTag to remove reads exceeding an error threshold
Generate Reports: Creates quality reports before and after filtering to show improvement

Understanding Expected Error #

Expected error (EE) is a more nuanced quality metric than average quality score:

Formula: EE = sum of error probabilities across all bases
Example: A read with quality scores Q30, Q30, Q20, Q30 has EE ≈ 0.001 + 0.001 + 0.01 + 0.001 = 0.013
Interpretation: Lower EE = higher confidence read
Threshold: Common threshold is EE ≤ 1.0 (expect ≤1 error per read)

Quality scores and error probabilities:

Q20 = 1% error rate (0.01 probability)
Q30 = 0.1% error rate (0.001 probability)
Q40 = 0.01% error rate (0.0001 probability)

Expected Results #

With the provided sample data:

Input: 10 reads (5 high-quality, 5 low-quality)
Output: 5 high-quality reads (low-quality reads removed)
Reports: Before/after quality comparison showing improved quality metrics

Customization #

Adjust the filtering threshold based on your application:

Strict filtering (EE ≤ 0.5): For applications requiring highest accuracy (variant calling, metagenomics)
Standard filtering (EE ≤ 1.0): General-purpose filtering (shown in this cookbook)
Relaxed filtering (EE ≤ 2.0): When read depth is more important than individual read accuracy

You can also filter by other quality metrics:

CalcQualifiedBases: Count bases above a quality threshold
FilterByNumericTag with min_value: Keep reads with enough high-quality bases

When to Use This #

After initial quality assessment (see Cookbook 01)
Before alignment or assembly
When downstream tools are sensitive to sequencing errors
To reduce computational burden by removing unreliable data

Downstream Analysis #

After quality filtering:

Alignment to reference genome (BWA, Bowtie2, STAR)
Variant calling with higher confidence
Assembly with cleaner input data
Quantification with reduced noise

Download #

Download 05-quality-filtering.tar.gz for a complete, runnable example including expected output files.

Configuration File #

[input]
    read1 = 'input/sample_R1.fq'

[[step]]
    # Generate initial quality report to assess data quality
    action = 'Report'
    name = 'initial'
    base_statistics = true

[[step]]
    # Calculate expected error (EE) for each read
    # EE = sum of error probabilities across all bases
    # Lower EE indicates higher quality reads
    action = 'CalcExpectedError'
    segment = 'read1'
    out_label = 'expected_error'
    aggregate = "max"

[[step]]
    # Filter reads based on expected error threshold
    # Keep only reads with EE ≤ 1.0 (expect ≤1 error per read)
    action = 'FilterByNumericTag'
    in_label = "expected_error"
    max_value = 1.0
    keep_or_remove = 'keep'

[[step]]
    # Generate report after filtering to show quality improvement
    action = 'Report'
    name = 'filtered'
    base_statistics = true

[output]
    prefix = 'reference_output/output'
    format = "Fastq"
    report_html = true # when you have reports, you need to set at least one of  report_html or report_json