Cookbook 02: UMI Extraction #

Use Case #

You have sequencing data with Unique Molecular Identifiers (UMIs) embedded in the reads. UMIs are short random barcodes added during library preparation that allow you to:

Identify and remove PCR duplicates
Distinguish true biological duplicates from amplification artifacts
Improve accuracy in quantitative analyses (RNA-seq, ATAC-seq, etc.)

What This Pipeline Does #

Reads input FastQ file with UMIs at the start of read1
Extracts the UMI sequence (first 8 bases) and creates a tag
Stores the UMI in the read comment (FASTQ header)
Removes the UMI bases from the read sequence (so they don’t interfere with alignment)
Outputs modified reads with UMI preserved in the header

Input Files #

input/sample_R1.fq - Reads with 8bp UMI at the start

Output Files #

output_R1.fq - Reads with UMI in comment, UMI bases removed from sequence

Configuration Highlights #

[[step]]
    # Extract UMI from positions 0-7 (8 bases)
    action = 'ExtractRegions'
    label = 'umi'
    regions = [{segment = 'read1', start = 0, length = 8}]

[[step]]
    # Store UMI in the FASTQ comment
    action = 'StoreTagInComment'
    label = 'umi'

[[step]]
    # Remove the UMI bases from the read
    action = 'CutStart'
    target = 'Read1'
    n = 8

Workflow Details #

Before processing:

@READ1
ATCGATCGACTGTACTGTACTGTACTGTACTG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

After processing:

@READ1 umi:ATCGATCG
ACTGTACTGTACTGTACTGTACTG
+
IIIIIIIIIIIIIIIIIIIIIIII

The UMI ATCGATCG is now in the comment and removed from the sequence.

When to Use This #

Single-cell RNA-seq with UMIs
ATAC-seq with UMI-based deduplication
Any protocol using unique molecular identifiers
Before alignment when you need to preserve UMIs for downstream duplicate marking

Download #

Download 02-umi-extraction.tar.gz for a complete, runnable example including expected output files.

Configuration File #

[input]
    read1 = 'input/sample_R1.fq'

[[step]]
    # Extract UMI from the first 8 bases of read1
    action = 'ExtractRegions'
    out_label = 'umi'
    regions = [{segment = 'read1', start = 0, length = 8}]

[[step]]
    # Store the UMI tag in the FASTQ comment
    # This preserves it through alignment and enables UMI-aware deduplication
    action = 'StoreTagInComment'
    in_label = 'umi'

[[step]]
    # Remove the UMI bases from the read sequence
    # Important: Do this AFTER storing the UMI in the comment
    action = 'CutStart'
    segment = 'read1'
    n = 8

[output]
    prefix = 'output'
    format = "FASTQ"