Cookbook 03: Lexogen QuantSeq Processing #

Use Case #

Lexogen QuantSeq is a popular 3’ mRNA sequencing protocol optimized for gene expression profiling. The library structure includes:

First 8 bases: UMI (Unique Molecular Identifier) for deduplication
Next 6 bases: Random hexamer primer sequence (needs removal)
Remaining sequence: Actual cDNA from the 3’ end of transcripts

This cookbook demonstrates the standard preprocessing for QuantSeq data before alignment.

What This Pipeline Does #

Extracts the 8bp UMI from the start of reads
Stores the UMI in the read comment (FASTQ header)
Removes the first 14 bases total (8bp UMI + 6bp random hexamer)
Outputs processed reads ready for alignment

Input Files #

input/quantseq_sample.fq - Raw QuantSeq reads with UMI and random hexamer

Output Files #

output_read1.fq - Processed reads with:
- UMI stored in comment
- First 14bp removed
- Ready for alignment to reference genome

Workflow Details #

Raw read structure:

@READ1
ATCGATCGTTACGATACTGTACTGTACTGTAC...
^^^^^^  UMI <- These get removed
      ^^^^ Hexamer  <- These get removed
          ^^^^^^^^^^^^^^^^^^^^^^... <- This stays for alignment

After processing:

@READ1 umi:ATCGATCG
ACTGTACTGTACTGTAC...

The UMI is preserved in the comment for downstream deduplication, and the adapter/primer sequences are removed.

When to Use This #

Processing Lexogen QuantSeq FWD/REV libraries
Any 3’ RNA-seq protocol with UMI + random primer structure
Before aligning to reference genome for gene expression analysis

Downstream Analysis #

After processing with this cookbook:

Align to reference genome using STAR, HISAT2, or similar
Assign to genes using mbf-bam-quantifier, which also does UMI dedup
or Deduplicate using UMI with tools like:
- umi_tools dedup (extracts UMI from comment)
- fgbio GroupReadsByUmi
Quantify differential gene expression with standard DE tools (DESeq2, edgeR)

Important Notes #

The 6bp random hexamer introduces sequence bias; UMI-based deduplication helps mitigate this
QuantSeq reads are strand-specific (typically R2/reverse strand)
Read lengths will be 14bp shorter after processing
Quality filtering may be beneficial after trimming (see cookbook 03-quality-filtering)

References #

Lexogen QuantSeq 3’ mRNA-Seq Library Prep Kit

Download #

Download 03-lexogen-quantseq.tar.gz for a complete, runnable example including expected output files.

Configuration File #

[input]
    # QuantSeq produces single-end reads
    read1 = 'input/quantseq_sample.fq'

[[step]]
    # Extract the 8bp UMI from the start of each read
    # QuantSeq uses 8bp random UMI for PCR duplicate identification
    action = 'ExtractRegions'
    out_label = 'umi'
    regions = [{source = 'read1', start = 0, length = 6, anchor="Start"}]

[[step]]
    # Store the UMI in the FASTQ comment
    # This preserves it for downstream deduplication with umi_tools or similar
    action = 'StoreTagInComment'
    in_label = 'umi'

[[step]]
    # Remove the first 10 bases from reads:
    # - 6bp UMI
    # - 4bp TATA spacer
    # What remains is the actual cDNA sequence for alignment
    action = 'CutStart'
    segment = 'read1'
    n = 10

[output]
    prefix = 'reference_output/output'
    format = "FASTQ"