Cookbook 02: UMI Extraction #
Use Case #
You have sequencing data with Unique Molecular Identifiers (UMIs) embedded in the reads. UMIs are short random barcodes added during library preparation that allow you to:
- Identify and remove PCR duplicates
- Distinguish true biological duplicates from amplification artifacts
- Improve accuracy in quantitative analyses (RNA-seq, ATAC-seq, etc.)
What This Pipeline Does #
- Reads input FastQ file with UMIs at the start of read1
- Extracts the UMI sequence (first 8 bases) and creates a tag
- Stores the UMI in the read comment (FASTQ header)
- Removes the UMI bases from the read sequence (so they don’t interfere with alignment)
- Outputs modified reads with UMI preserved in the header
Input Files #
input/sample_R1.fq- Reads with 8bp UMI at the start
Output Files #
output_R1.fq- Reads with UMI in comment, UMI bases removed from sequence
Configuration Highlights #
[[step]]
# Extract UMI from positions 0-7 (8 bases)
action = 'ExtractRegions'
label = 'umi'
regions = [{source = 'read1', start = 0, length = 8, anchor="Start"}]
[[step]]
# Store UMI in the FASTQ comment
action = 'StoreTagInComment'
label = 'umi'
[[step]]
# Remove the UMI bases from the read
action = 'CutStart'
target = 'Read1'
n = 8
Workflow Details #
Before processing:
@READ1
ATCGATCGACTGTACTGTACTGTACTGTACTG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
After processing:
@READ1 umi:ATCGATCG
ACTGTACTGTACTGTACTGTACTG
+
IIIIIIIIIIIIIIIIIIIIIIII
The UMI ATCGATCG is now in the comment and removed from the sequence.
When to Use This #
- Single-cell RNA-seq with UMIs
- ATAC-seq with UMI-based deduplication
- Any protocol using unique molecular identifiers
- Before alignment when you need to preserve UMIs for downstream duplicate marking
Download #
Download 02-umi-extraction.tar.gz for a complete, runnable example including expected output files.
Configuration File #
[input]
read1 = 'input/sample_R1.fq'
[[step]]
# Extract UMI from the first 8 bases of read1
action = 'ExtractRegions'
out_label = 'umi'
regions = [{source = 'read1', start = 0, length = 8, anchor="Start"}]
[[step]]
# Store the UMI tag in the FASTQ comment
# This preserves it through alignment and enables UMI-aware deduplication
action = 'StoreTagInComment'
in_label = 'umi'
[[step]]
# Remove the UMI bases from the read sequence
# Important: Do this AFTER storing the UMI in the comment
action = 'CutStart'
segment = 'read1'
n = 8
[output]
prefix = 'reference_output/cookbook-02'
format = "FASTQ"