Cookbook 04: PhiX Removal #
Use Case #
You have Illumina PhiX spike-in sequences in your dataset and want to remove those contaminating reads before downstream analysis. PhiX is commonly added as a control to increase base diversity during sequencing runs.
What This Pipeline Does #
This cookbook demonstrates how to identify and remove PhiX contamination using k-mer counting:
- Count k-mers: Uses
CalcKmersto count how many 30-mers from each read match the PhiX genome - Export data: Saves k-mer counts to a TSV table for analysis
- Filter reads: Removes reads with high PhiX k-mer counts (≥25 matching k-mers)
Understanding the Approach #
K-mer Counting #
The CalcKmers step counts how many k-mers (short subsequences of length k) from each read are present in the PhiX reference genome:
- k = 30: Uses 30-base-pair k-mers (longer k-mers = more specific matching)
- canonical = true: Counts both forward and reverse complement k-mers
- Reads stemming from PhiX will have many matching k-mers
- Clean reads will have very few or zero matching k-mers
Analyzing the Data: Histogram and Threshold Selection #
After running the initial pipeline with StoreTagsInTable,
examine the output TSV file to understand your data distribution:
# View the k-mer histogram with common command-line tools
# get second column, throw away header line, sort numerically, count unique occurrences
cut -f2 output_without_phix_kmer_analysis.tsv | tail -n +2 | sort -n | uniq -c
# alternativly, use awk for steps 1 & 2
awk 'NR>1 {print $2}' output_without_phix_kmer_analysis.tsv | sort -n | uniq -c
Example output interpretation:
4 0 # 4 reads with 0 PhiX k-mers (clean reads)
4 32 # 4 reads with 32 PhiX k-mers (PhiX contamination)
This bimodal distribution (two clear peaks) makes it easy to choose a threshold. Here, any value between 1-31 would work; we chose 25 as a conservative threshold.
For larger datasets (first 1000 reads recommended):
- Run pipeline with
Headstep to sample reads:[[step]]withaction = "Head"andn = 1000 - Export the table and analyze the distribution in your preferred tool
- Look for a clear separation between clean and contaminated reads
- Choose a threshold in the “valley” between peaks
Filtering Approaches #
This cookbook demonstrates three equivalent ways to filter PhiX-contaminated reads:
Approach 1: FilterByNumericTag (Simplest, shown in input.toml) #
[[step]]
action = 'FilterByNumericTag'
in_label = "phix_kmer_count"
min_value = 25
keep_or_remove = 'remove' # Remove reads with ≥25 k-mers
Best for: Direct numeric filtering with a single threshold.
Approach 2: EvalExpression + Demultiplex (Most Flexible) #
# Create a boolean tag based on k-mer count threshold
[[step]]
action = "EvalExpression"
expression = "phix_kmer_count >= 25"
out_label = "is_phix"
result_type = "bool"
# Separate clean reads from PhiX-contaminated reads
# For boolean tags, Demultiplex creates files with pattern: {prefix}_{label}={value}_{segment}.fq
[[step]]
action = "Demultiplex"
in_label = "is_phix"
# No [barcodes] section needed for boolean demultiplexing!
# Output files will be:
# - output_is_phix=true_read1.fq (PhiX-contaminated reads)
# - output_is_phix=false_read1.fq (clean reads)
Best for:
- Separating reads into multiple output files (contaminated vs. clean)
- Complex boolean expressions combining multiple criteria
- When you want to keep both categories for quality control
Usage #
Using the Standard Approach (FilterByNumericTag) #
# Run the pipeline
cd 04-phiX-removal
mbf-fastq-processor input.toml
# Check the results
head output_without_phix_kmer_analysis.tsv # View k-mer counts
grep -c "^@" output_without_phix_read1.fq # Count output reads (should be 4)
Using the Demultiplex Approach (Alternative) #
To try the EvalExpression + Demultiplex approach that separates reads into two files:
# Run the alternative pipeline
mbf-fastq-processor input_demultiplex.toml
# Check the results
head output_kmer_analysis.tsv # View k-mer counts
grep -c "^@" output_is_phix=false_read1.fq # Count clean reads (should be 4)
grep -c "^@" output_is_phix=true_read1.fq # Count PhiX reads (should be 4, have phiX in name.)
Expected Results #
With the provided sample data:
- Input: 8 reads (4 PhiX, 4 clean)
- Output: 4 clean reads (PhiX reads removed)
- Table: Shows k-mer counts for all input reads (32 for PhiX, 0 for clean)
Customization #
Adjust these parameters based on your data:
- k: Larger values (e.g., 35) = more specific; smaller values (e.g., 21) = more sensitive
- min_count: In
CalcKmers, filters rare k-mers from the reference (default: 1) - threshold: Adjust
min_valueinFilterByNumericTagbased on your histogram analysis
Download #
Download 04-phiX-removal.tar.gz for a complete, runnable example including expected output files.
Configuration File #
[input]
read1 = 'input/phix_sample.fq'
[[step]]
# Count k-mers matching the PhiX genome
# Reads with many matching k-mers are likely PhiX contamination
action = 'CalcKmers'
out_label = 'phix_kmer_count'
segment = 'read1'
canonical = true # Count both forward & reverse complement k-mers
files = ["input/NC_001422.1.fasta"]
k = 30
[[step]]
# Export k-mer counts to TSV for histogram analysis
action = 'StoreTagsInTable'
infix = 'kmer_analysis'
[[step]]
# Remove reads with ≥25 matching PhiX k-mers
action = 'FilterByNumericTag'
in_label = "phix_kmer_count"
min_value = 25
keep_or_remove = 'Remove'
[output]
prefix = 'reference_output_without_phix/output'
format = "Fastq"