Cookbook 03: Lexogen QuantSeq Processing #
Use Case #
Lexogen QuantSeq is a popular 3’ mRNA sequencing protocol optimized for gene expression profiling. The library structure includes:
- First 8 bases: UMI (Unique Molecular Identifier) for deduplication
- Next 6 bases: Random hexamer primer sequence (needs removal)
- Remaining sequence: Actual cDNA from the 3’ end of transcripts
This cookbook demonstrates the standard preprocessing for QuantSeq data before alignment.
What This Pipeline Does #
- Extracts the 8bp UMI from the start of reads
- Stores the UMI in the read comment (FASTQ header)
- Removes the first 14 bases total (8bp UMI + 6bp random hexamer)
- Outputs processed reads ready for alignment
Input Files #
input/quantseq_sample.fq- Raw QuantSeq reads with UMI and random hexamer
Output Files #
output_read1.fq- Processed reads with:- UMI stored in comment
- First 14bp removed
- Ready for alignment to reference genome
Workflow Details #
Raw read structure:
@READ1
ATCGATCGTTACGATACTGTACTGTACTGTAC...
^^^^^^ UMI <- These get removed
^^^^ Hexamer <- These get removed
^^^^^^^^^^^^^^^^^^^^^^... <- This stays for alignment
After processing:
@READ1 umi:ATCGATCG
ACTGTACTGTACTGTAC...
The UMI is preserved in the comment for downstream deduplication, and the adapter/primer sequences are removed.
When to Use This #
- Processing Lexogen QuantSeq FWD/REV libraries
- Any 3’ RNA-seq protocol with UMI + random primer structure
- Before aligning to reference genome for gene expression analysis
Downstream Analysis #
After processing with this cookbook:
- Align to reference genome using STAR, HISAT2, or similar
- Assign to genes using mbf-bam-quantifier, which also does UMI dedup
- or Deduplicate using UMI with tools like:
umi_tools dedup(extracts UMI from comment)fgbio GroupReadsByUmi
- Quantify differential gene expression with standard DE tools (DESeq2, edgeR)
Important Notes #
- The 6bp random hexamer introduces sequence bias; UMI-based deduplication helps mitigate this
- QuantSeq reads are strand-specific (typically R2/reverse strand)
- Read lengths will be 14bp shorter after processing
- Quality filtering may be beneficial after trimming (see cookbook 03-quality-filtering)
References #
Download #
Download 03-lexogen-quantseq.tar.gz for a complete, runnable example including expected output files.
Configuration File #
[input]
# QuantSeq produces single-end reads
read1 = 'input/quantseq_sample.fq'
[[step]]
# Extract the 8bp UMI from the start of each read
# QuantSeq uses 8bp random UMI for PCR duplicate identification
action = 'ExtractRegions'
out_label = 'umi'
regions = [{source = 'read1', start = 0, length = 6, anchor="Start"}]
[[step]]
# Store the UMI in the FASTQ comment
# This preserves it for downstream deduplication with umi_tools or similar
action = 'StoreTagInComment'
in_label = 'umi'
[[step]]
# Remove the first 10 bases from reads:
# - 6bp UMI
# - 4bp TATA spacer
# What remains is the actual cDNA sequence for alignment
action = 'CutStart'
segment = 'read1'
n = 10
[output]
prefix = 'reference_output/output'
format = "FASTQ"