Cookbook 06: Adapter Trimming with PolyA Tail Removal #
Use Case #
You have RNA-seq data that contains:
- PolyA tails: Stretches of A bases at the 3’ end (or polyT at 5’ for reverse strand)
- Sequencing adapters: Illumina or other adapter sequences that need removal before alignment
These artifacts can interfere with alignment and downstream analysis if not removed.
What This Pipeline Does #
This cookbook demonstrates a complete adapter and polyA trimming workflow:
- Extract PolyA Tail: Uses
ExtractPolyTailto find polyA/T stretches - Trim PolyA: Uses
TrimAtTagto remove the polyA tail and everything after it - Extract Adapter: Uses
ExtractIUPACto find Illumina adapter sequences - Trim Adapter: Uses
TrimAtTagto remove adapter contamination - Filter Short Reads: Uses
CalcLengthandFilterByNumericTagto remove reads that became too short after trimming
Understanding PolyA/T Tails #
PolyA tails are natural features of mRNA:
- Biological: mRNA molecules have polyA tails added during transcription
- Sequencing artifact: If the read extends past the transcript end, it captures the polyA tail
- Impact: Can interfere with alignment if not removed
- PolyT: Reverse strand sequences show polyT instead of polyA
Input Files #
input/rna_sample_R1.fq- RNA-seq reads with polyA tails and adapters
Output Files #
output_read1.fq- Trimmed reads with polyA tails and adapters removed
Expected Results #
With the provided sample data:
- Input: 8 reads with various combinations of polyA tails and adapters
- Output: Trimmed reads with polyA and adapters removed
- Reads that became too short (< 25bp) after trimming are filtered out
Workflow Details #
Example read transformation:
Adapter
AGATCGGAAGAGC
Before:
@READ1
ACTGACTGACTGACTGAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
^^^^^^^^^^^^^
PolyA ↑
^^^^^^^^^^^^
Adapter ↑
After polyA trimming:
ACTGACTGACTGACTG
Customization #
Adjust parameters based on your data:
PolyA Detection:
min_length: Minimum polyA lengthmax_mismatch_rate: Allow some misread (non-A) bases in the polyA tail
Adapter Sequences:
See adapters sequences for common adapters.
- Use
max_mismatchesto allow for sequencing errors in adapter
Length Filtering:
min_value: Minimum read length to keep (adjust based on alignment requirements)- For RNA-seq: typically 25-50bp minimum
- For miRNA: might keep shorter reads (18-22bp)
When to Use This #
- RNA-seq data before alignment
- Any protocol where reads may extend past the insert (polyA capture)
- When adapter contamination is detected in quality reports
- Before transcriptome assembly or quantification
Alternative Approaches #
This cookbook uses a two-step approach (extract → trim). You can also use:
ExtractLongestPolyX: Finds the longest stretch of any repeated nucleotideExtractAnchor: More flexible pattern matching with orientation- Multiple
TrimAtTagsteps for different adapter types
Downstream Analysis #
After trimming:
- Alignment to reference genome (STAR, HISAT2)
- Quantification of gene/transcript expression
- Quality control to verify adapter removal (FastQC, MultiQC)
Download #
Download 06-adapter-trimming.tar.gz for a complete, runnable example including expected output files.
Configuration File #
[input]
read1 = 'input/rna_sample_R1.fq'
[[step]]
# Find polyA tails
# Looks for stretches of ≥8 A's with up to 10% mismatches
action = 'ExtractPolyTail'
base = 'A'
min_length = 8
max_mismatch_rate = 0.1
max_consecutive_mismatches = 3
out_label = 'polya'
[[step]]
# Trim the polyA tail and everything after it
# keep_match = false means remove the matched region
action = 'TrimAtTag'
in_label = 'polya'
keep_tag = false
direction = 'end'
[[step]]
# Find Illumina TruSeq Universal Adapter sequence
# AGATCGGAAGAGC is the start of the Illumina adapter
action = 'ExtractIUPAC'
pattern = 'AGATCGGAAGAGC'
anchor = "anywhere"
max_mismatches = 1
out_label = 'adapter'
[[step]]
# Trim adapter and everything after it
action = 'TrimAtTag'
in_label = 'adapter'
keep_tag = false
direction = 'end'
[[step]]
# Calculate read length after trimming
action = 'CalcLength'
segment = 'read1'
out_label = 'length'
[[step]]
# Filter out reads that are too short after trimming
# Keep only reads ≥ 25bp
action = 'FilterByNumericTag'
in_label = 'length'
min_value = 25
keep_or_remove = 'keep'
[output]
prefix = 'reference_output/output'
format = "FASTQ"