Cookbook 06: Adapter Trimming with PolyA Tail Removal #

Use Case #

You have RNA-seq data that contains:

PolyA tails: Stretches of A bases at the 3’ end (or polyT at 5’ for reverse strand)
Sequencing adapters: Illumina or other adapter sequences that need removal before alignment

These artifacts can interfere with alignment and downstream analysis if not removed.

What This Pipeline Does #

This cookbook demonstrates a complete adapter and polyA trimming workflow:

Extract PolyA Tail: Uses ExtractPolyTail to find polyA/T stretches
Trim PolyA: Uses TrimAtTag to remove the polyA tail and everything after it
Extract Adapter: Uses ExtractIUPAC to find Illumina adapter sequences
Trim Adapter: Uses TrimAtTag to remove adapter contamination
Filter Short Reads: Uses CalcLength and FilterByNumericTag to remove reads that became too short after trimming

Understanding PolyA/T Tails #

PolyA tails are natural features of mRNA:

Biological: mRNA molecules have polyA tails added during transcription
Sequencing artifact: If the read extends past the transcript end, it captures the polyA tail
Impact: Can interfere with alignment if not removed
PolyT: Reverse strand sequences show polyT instead of polyA

Input Files #

input/rna_sample_R1.fq - RNA-seq reads with polyA tails and adapters

Output Files #

output_read1.fq - Trimmed reads with polyA tails and adapters removed

Expected Results #

With the provided sample data:

Input: 8 reads with various combinations of polyA tails and adapters
Output: Trimmed reads with polyA and adapters removed
Reads that became too short (< 25bp) after trimming are filtered out

Workflow Details #

Example read transformation:

Adapter

AGATCGGAAGAGC

Before:

@READ1
ACTGACTGACTGACTGAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
              ^^^^^^^^^^^^^             
              PolyA ↑
                         ^^^^^^^^^^^^
                          Adapter ↑

After polyA trimming:

ACTGACTGACTGACTG

Customization #

Adjust parameters based on your data:

PolyA Detection:

min_length: Minimum polyA length
max_mismatch_rate: Allow some misread (non-A) bases in the polyA tail

Adapter Sequences:

See adapters sequences for common adapters.

Use max_mismatches to allow for sequencing errors in adapter

Length Filtering:

min_value: Minimum read length to keep (adjust based on alignment requirements)
For RNA-seq: typically 25-50bp minimum
For miRNA: might keep shorter reads (18-22bp)

When to Use This #

RNA-seq data before alignment
Any protocol where reads may extend past the insert (polyA capture)
When adapter contamination is detected in quality reports
Before transcriptome assembly or quantification

Alternative Approaches #

This cookbook uses a two-step approach (extract → trim). You can also use:

ExtractLongestPolyX: Finds the longest stretch of any repeated nucleotide
ExtractAnchor: More flexible pattern matching with orientation
Multiple TrimAtTag steps for different adapter types

Downstream Analysis #

After trimming:

Alignment to reference genome (STAR, HISAT2)
Quantification of gene/transcript expression
Quality control to verify adapter removal (FastQC, MultiQC)

Download #

Download 06-adapter-trimming.tar.gz for a complete, runnable example including expected output files.

Configuration File #

[input]
    read1 = 'input/rna_sample_R1.fq'

[[step]]
    # Find polyA tails
    # Looks for stretches of ≥8 A's with up to 10% mismatches
    action = 'ExtractPolyTail'
    base = 'A'
    min_length = 8
    max_mismatch_rate = 0.1
    max_consecutive_mismatches = 3
    out_label = 'polya'

[[step]]
    # Trim the polyA tail and everything after it
    # keep_match = false means remove the matched region
    action = 'TrimAtTag'
    in_label = 'polya'
    keep_tag = false
    direction = 'end'

[[step]]
    # Find Illumina TruSeq Universal Adapter sequence
    # AGATCGGAAGAGC is the start of the Illumina adapter
    action = 'ExtractIUPAC'
    pattern = 'AGATCGGAAGAGC'
    anchor = "anywhere"
    max_mismatches = 1
    out_label = 'adapter'

[[step]]
    # Trim adapter and everything after it
    action = 'TrimAtTag'
    in_label = 'adapter'
    keep_tag = false
    direction = 'end'

[[step]]
    # Calculate read length after trimming
    action = 'CalcLength'
    segment = 'read1'
    out_label = 'length'

[[step]]
    # Filter out reads that are too short after trimming
    # Keep only reads ≥ 25bp
    action = 'FilterByNumericTag'
    in_label = 'length'
    min_value = 25
    keep_or_remove = 'keep'

[output]
    prefix = 'reference_output/output'
    format = "FASTQ"