Cookbooks on fastqrab documentation

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 01: Basic Quality Report #

Use Case #

You have FastQ files from a sequencing run and want to generate comprehensive quality reports to assess:

Read quality scores
Base composition
Read length distribution
Duplicate read counts

This is typically the first step in any sequencing data analysis to understand data quality before downstream processing.

What This Pipeline Does #

Reads input FastQ file(s)
Generates a comprehensive quality report including:
- Base quality statistics
- Base distribution across positions
- Read length distribution
- Duplicate read counting
Outputs reports in both HTML (human-readable) and JSON (machine-readable) formats
Passes through all reads unchanged (no filtering)

Input Files #

input/sample_R1.fq - Forward reads (Read 1) from paired-end sequencing

Output Files #

output_R1.fq - Passed-through reads (identical to input)
output.report_initial.html - HTML quality report
output.report_initial.json - JSON quality report with detailed statistics

When to Use This #

First analysis of new sequencing data
Quality control before committing to expensive downstream analysis
Comparing data quality across different sequencing runs
Identifying potential issues (adapter contamination, quality drop-off, etc.)

Download #

Download 01-basic-quality-report.tar.gz for a complete, runnable example including expected output files.

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 02: UMI Extraction #

Use Case #

You have sequencing data with Unique Molecular Identifiers (UMIs) embedded in the reads. UMIs are short random barcodes added during library preparation that allow you to:

Identify and remove PCR duplicates
Distinguish true biological duplicates from amplification artifacts
Improve accuracy in quantitative analyses (RNA-seq, ATAC-seq, etc.)

What This Pipeline Does #

Reads input FastQ file with UMIs at the start of read1
Extracts the UMI sequence (first 8 bases) and creates a tag
Stores the UMI in the read comment (FASTQ header)
Removes the UMI bases from the read sequence (so they don’t interfere with alignment)
Outputs modified reads with UMI preserved in the header

Input Files #

input/sample_R1.fq - Reads with 8bp UMI at the start

Output Files #

output_R1.fq - Reads with UMI in comment, UMI bases removed from sequence

Configuration Highlights #

[[step]]
 # Extract UMI from positions 0-7 (8 bases)
 action = 'ExtractRegions'
 label = 'umi'
 regions = [{source = 'read1', start = 0, length = 8, anchor="Start"}]

[[step]]
 # Store UMI in the FASTQ comment
 action = 'StoreTagInComment'
 label = 'umi'

[[step]]
 # Remove the UMI bases from the read
 action = 'CutStart'
 target = 'Read1'
 n = 8

Workflow Details #

Before processing:

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 03: Lexogen QuantSeq Processing #

Use Case #

Lexogen QuantSeq is a popular 3’ mRNA sequencing protocol optimized for gene expression profiling. The library structure includes:

First 8 bases: UMI (Unique Molecular Identifier) for deduplication
Next 6 bases: Random hexamer primer sequence (needs removal)
Remaining sequence: Actual cDNA from the 3’ end of transcripts

This cookbook demonstrates the standard preprocessing for QuantSeq data before alignment.

What This Pipeline Does #

Extracts the 8bp UMI from the start of reads
Stores the UMI in the read comment (FASTQ header)
Removes the first 14 bases total (8bp UMI + 6bp random hexamer)
Outputs processed reads ready for alignment

Input Files #

input/quantseq_sample.fq - Raw QuantSeq reads with UMI and random hexamer

Output Files #

output_read1.fq - Processed reads with:
- UMI stored in comment
- First 14bp removed
- Ready for alignment to reference genome

Workflow Details #

Raw read structure:

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 04: PhiX Removal #

Use Case #

You have Illumina PhiX spike-in sequences in your dataset and want to remove those contaminating reads before downstream analysis. PhiX is commonly added as a control to increase base diversity during sequencing runs.

What This Pipeline Does #

This cookbook demonstrates how to identify and remove PhiX contamination using k-mer counting:

Count k-mers: Uses CalcKmers to count how many 30-mers from each read match the PhiX genome
Export data: Saves k-mer counts to a TSV table for analysis
Filter reads: Removes reads with high PhiX k-mer counts (≥25 matching k-mers)

Understanding the Approach #

K-mer Counting #

The CalcKmers step counts how many k-mers (short subsequences of length k) from each read are present in the PhiX reference genome:

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 05: Quality Filtering #

Use Case #

You have sequencing data with varying quality and want to remove low-quality reads before downstream analysis. Poor quality reads can introduce errors in variant calling, assembly, and other analyses.

What This Pipeline Does #

This cookbook demonstrates quality-based filtering using expected error calculation:

Calculate Expected Errors: Uses CalcExpectedError to compute the expected number of base call errors per read based on quality scores
Filter Low-Quality Reads: Uses FilterByNumericTag to remove reads exceeding an error threshold
Generate Reports: Creates quality reports before and after filtering to show improvement

Understanding Expected Error #

Expected error (EE) is a more nuanced quality metric than average quality score:

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 06: Adapter Trimming with PolyA Tail Removal #

Use Case #

You have RNA-seq data that contains:

PolyA tails: Stretches of A bases at the 3’ end (or polyT at 5’ for reverse strand)
Sequencing adapters: Illumina or other adapter sequences that need removal before alignment

These artifacts can interfere with alignment and downstream analysis if not removed.

What This Pipeline Does #

This cookbook demonstrates a complete adapter and polyA trimming workflow:

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 07: Demultiplexing by Inline Barcode #

Use Case #

You have pooled sequencing data from multiple samples that were tagged with unique barcode sequences during library preparation and have not been demuliplexed by your sequencing facility.

You need to:

Extract the barcode(s) from each read
Correct sequencing errors in barcodes
Separate reads into individual files per sample

This is common in multiplexed sequencing runs to maximize sequencing efficiency and reduce costs.

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 08: Read Length Filtering and Truncation #

Use Case #

You have sequencing data with variable read lengths and need to:

Remove reads that are too short (may align poorly or represent artifacts)
Remove reads that are too long (may indicate technical issues)
Truncate all reads to a uniform length (required by some downstream tools)

Read length filtering is important for:

Quality control after adapter trimming
Preparing data for tools that require uniform read lengths
Removing degraded or artifactual sequences

What This Pipeline Does #

This cookbook demonstrates comprehensive read length management:

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 09: Fastp-Equivalent Workflow #

Use Case #

You want to replicate the default behavior of fastp — a popular all-in-one FASTQ preprocessor — using a configurable pipeline. This is useful when you need reproducible, step-by-step control over each filtering stage, or want to extend the workflow beyond what fastp offers.

What This Pipeline Does #

This cookbook replicates fastp’s default single-end processing pipeline:

PolyG Trimming: Uses ExtractPolyTail + TrimAtTag to remove polyG tails (Illumina NextSeq/NovaSeq artifact)
Adapter Trimming: Uses ExtractIUPAC + TrimAtTag to remove the Illumina TruSeq R1 adapter
N-base Filtering: Uses CalcNCount + FilterByNumericTag to remove reads with too many ambiguous bases (--n_base_limit 5)
Quality Filtering: Uses CalcQualifiedBases + FilterByNumericTag to remove reads with too many low-quality bases (--qualified_quality_phred 15, --unqualified_percent_limit 40)
Length Filtering: Uses CalcLength + FilterByNumericTag to remove reads shorter than 15bp (--length_required 15)

Fastp Defaults Replicated #

Fastp parameter	Value	Pipeline step
`--poly_g_min_len`	10	`ExtractPolyTail min_length = 10`
`--adapter_sequence`	`AGATCGGAAGAGCACACGTCTGAACTCCAGTCA`	`ExtractIUPAC query = ...`
`--n_base_limit`	5	`FilterByNumericTag max_value = 6`
`--qualified_quality_phred`	15 (Phred) → 48 (ASCII)	`CalcQualifiedBases threshold = 48`
`--unqualified_percent_limit`	40% → keep if ≥ 60% qualified	`FilterByNumericTag min_value = 0.60`
`--length_required`	15	`FilterByNumericTag min_value = 15`

Note on quality scores: Quality values in FASTQ files are ASCII-encoded. Phred Q15 corresponds to ASCII character 48 (15 + 33 = 48).

Mon, 01 Jan 0001 00:00:00 +0000

Cookbook 10: Adapter Identification #

Use Case #

You have a FASTQ file and want to identify which sequencing adapter is present before trimming — or to confirm no adapter contamination remains after trimming. This is useful when the adapter type is unknown, when working with data from multiple library prep kits, or when validating a trimming step.

What This Pipeline Does #

Runs a single Report step that counts exact occurrences of each common adapter sequence in every read (count_oligos)
Writes an HTML and JSON report — no reads are filtered or written to disk

How count_oligos Works #

count_oligos performs exact, full-sequence matching across every read. A read is counted if the probe sequence appears verbatim anywhere within it. There are no mismatches and no IUPAC wildcards. A non-zero count means reads carry at least one complete copy of that adapter.