Cookbooks on mbf-fastq-processor documentation

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 01: Basic Quality Report #

Use Case #

You have FastQ files from a sequencing run and want to generate comprehensive quality reports to assess:

Read quality scores
Base composition
Read length distribution
Duplicate read counts

This is typically the first step in any sequencing data analysis to understand data quality before downstream processing.

What This Pipeline Does #

Reads input FastQ file(s)
Generates a comprehensive quality report including:
- Base quality statistics
- Base distribution across positions
- Read length distribution
- Duplicate read counting
Outputs reports in both HTML (human-readable) and JSON (machine-readable) formats
Passes through all reads unchanged (no filtering)

Input Files #

input/sample_R1.fq - Forward reads (Read 1) from paired-end sequencing

Output Files #

output_R1.fq - Passed-through reads (identical to input)
output.report_initial.html - HTML quality report
output.report_initial.json - JSON quality report with detailed statistics

When to Use This #

First analysis of new sequencing data
Quality control before committing to expensive downstream analysis
Comparing data quality across different sequencing runs
Identifying potential issues (adapter contamination, quality drop-off, etc.)

Download #

Download 01-basic-quality-report.tar.gz for a complete, runnable example including expected output files.

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 02: UMI Extraction #

Use Case #

You have sequencing data with Unique Molecular Identifiers (UMIs) embedded in the reads. UMIs are short random barcodes added during library preparation that allow you to:

Identify and remove PCR duplicates
Distinguish true biological duplicates from amplification artifacts
Improve accuracy in quantitative analyses (RNA-seq, ATAC-seq, etc.)

What This Pipeline Does #

Reads input FastQ file with UMIs at the start of read1
Extracts the UMI sequence (first 8 bases) and creates a tag
Stores the UMI in the read comment (FASTQ header)
Removes the UMI bases from the read sequence (so they don’t interfere with alignment)
Outputs modified reads with UMI preserved in the header

Input Files #

input/sample_R1.fq - Reads with 8bp UMI at the start

Output Files #

output_R1.fq - Reads with UMI in comment, UMI bases removed from sequence

Configuration Highlights #

[[step]]
 # Extract UMI from positions 0-7 (8 bases)
 action = 'ExtractRegions'
 label = 'umi'
 regions = [{segment = 'read1', start = 0, length = 8}]

[[step]]
 # Store UMI in the FASTQ comment
 action = 'StoreTagInComment'
 label = 'umi'

[[step]]
 # Remove the UMI bases from the read
 action = 'CutStart'
 target = 'Read1'
 n = 8

Workflow Details #

Before processing:

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 03: Lexogen QuantSeq Processing #

Use Case #

Lexogen QuantSeq is a popular 3’ mRNA sequencing protocol optimized for gene expression profiling. The library structure includes:

First 8 bases: UMI (Unique Molecular Identifier) for deduplication
Next 6 bases: Random hexamer primer sequence (needs removal)
Remaining sequence: Actual cDNA from the 3’ end of transcripts

This cookbook demonstrates the standard preprocessing for QuantSeq data before alignment.

What This Pipeline Does #

Extracts the 8bp UMI from the start of reads
Stores the UMI in the read comment (FASTQ header)
Removes the first 14 bases total (8bp UMI + 6bp random hexamer)
Outputs processed reads ready for alignment

Input Files #

input/quantseq_sample.fq - Raw QuantSeq reads with UMI and random hexamer

Output Files #

output_read1.fq - Processed reads with:
- UMI stored in comment
- First 14bp removed
- Ready for alignment to reference genome

Workflow Details #

Raw read structure:

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 04: PhiX Removal #

Use Case #

You have Illumina PhiX spike-in sequences in your dataset and want to remove those contaminating reads before downstream analysis. PhiX is commonly added as a control to increase base diversity during sequencing runs.

What This Pipeline Does #

This cookbook demonstrates how to identify and remove PhiX contamination using k-mer counting:

Count k-mers: Uses CalcKmers to count how many 30-mers from each read match the PhiX genome
Export data: Saves k-mer counts to a TSV table for analysis
Filter reads: Removes reads with high PhiX k-mer counts (≥25 matching k-mers)

Understanding the Approach #

K-mer Counting #

The CalcKmers step counts how many k-mers (short subsequences of length k) from each read are present in the PhiX reference genome: