Cookbook 10: Adapter Identification #
Use Case #
You have a FASTQ file and want to identify which sequencing adapter is present before trimming — or to confirm no adapter contamination remains after trimming. This is useful when the adapter type is unknown, when working with data from multiple library prep kits, or when validating a trimming step.
What This Pipeline Does #
- Runs a single
Reportstep that counts exact occurrences of each common adapter sequence in every read (count_oligos) - Writes an HTML and JSON report — no reads are filtered or written to disk
How count_oligos Works #
count_oligos performs exact, full-sequence matching across every read. A read
is counted if the probe sequence appears verbatim anywhere within it. There are
no mismatches and no IUPAC wildcards. A non-zero count means reads carry at
least one complete copy of that adapter.
Because the probe must appear in full, very short reads that were already partially trimmed will not match. Use a shorter prefix of the adapter (e.g. the first 15–20 bp) as an additional probe if you expect heavily trimmed data.
Every adapter is scored separately, so overlapping adapters are counted multiple times.
Input Files #
input/fastp_606.fq.gz— Single-end reads containing the Illumina TruSeq Read 2 adapter
Output Files #
reference_output/output.report_adapter_check.html— HTML report with oligo countsreference_output/output.report_adapter_check.json— JSON report with oligo counts
No FASTQ output is written (format = 'None').
Expected Results #
With the provided sample data the report shows:
| Adapter | Count |
|---|---|
illumina_truseq_r2 | 1393 |
| all others | 0 |
This identifies the library as using the Illumina TruSeq Read 2 adapter (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT).
Next Steps #
Once you have identified the adapter, add an ExtractIUPAC + TrimAtTag step to your pipeline. See Cookbook 06: Adapter Trimming for a complete trimming example, and the adapters reference for copy/pastable count_oligos and ExtractIUPAC snippets.
When to Use This #
- Before trimming, to confirm which adapter is present
- After trimming, to verify that adapter contamination has been removed (counts should drop to zero)
- When processing data from an unknown or mixed source
Download #
Download 10-adapter-identification.tar.gz for a complete, runnable example including expected output files.
Configuration File #
[input]
read1 = 'input/fastp_606.fq.gz'
# Probe for common adapters using count_oligos.
# count_oligos performs exact full-sequence matching — no mismatches, no IUPAC.
# A count > 0 identifies reads that still carry adapter sequence.
# See the adapters reference for the full list of sequences.
[[step]]
action = 'Report'
name = 'adapter_check'
count = true
count_oligos = {
# https://support-docs.illumina.com/SHARE/AdapterSequences/adapter-sequences.htm
"Illumina Nextera/AmpliSeq" = "CTGTCTCTTATACACATCT",
"Illumina TruSeq R1/miRNA" = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA",
"Illumina TruSeq R2" = "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT",
# http://seqanswers.com/forums/showthread.php?t=87647 (2nd post)
"BGI Forward" = "AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA",
"BGI Reverse" = "AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG",
}
[output]
prefix = 'reference_output/output'
report_html = true
report_json = true
format = 'None'