Cookbook 07: Demultiplexing by Inline Barcode #

Use Case #

You have pooled sequencing data from multiple samples that were tagged with unique barcode sequences during library preparation and have not been demuliplexed by your sequencing facility.

You need to:

Extract the barcode(s) from each read
Correct sequencing errors in barcodes
Separate reads into individual files per sample

This is common in multiplexed sequencing runs to maximize sequencing efficiency and reduce costs.

What This Pipeline Does #

This cookbook demonstrates a complete demultiplexing workflow:

Extract Barcode: Uses ExtractRegion to extract inline barcode from the start of reads (fixed position)
Correct Errors: Uses HammingCorrect to fix single-base errors in barcodes
Remove Barcode from Sequence: Uses CutStart to trim the barcode bases from reads
Demultiplex: Uses Demultiplex to split reads into separate files per sample
Generate Report: Creates summary statistics for each sample

Understanding Barcodes #

Inline barcodes are short DNA sequences (4-12bp) added to the start or end of reads:

Purpose: Uniquely identify which sample each read came from
Location: Typically at the 5’ end of read1 or in a separate index read (we support any number of segments!)
Errors: Sequencing errors can cause misassignment; error correction is helpful
Hamming distance: Number of positions at which sequences differ
- Hamming distance = 1: One base different (e.g., ATCG vs ACCG)
- Good barcode sets have Hamming distance ≥ 3 for robust error correction

Input Files #

input/pooled_R1.fq - Pooled reads from multiple samples with inline barcodes

Output Files #

output_sample1_read1.fq - Reads belonging to sample1
output_sample2_read1.fq - Reads belonging to sample2
output_sample3_read1.fq - Reads belonging to sample3
output_sample4_read1.fq - Reads belonging to sample4
output_no-barcode_read1.fq - Reads with unrecognized barcodes

Expected Results #

With the provided sample data:

Input: 12 reads (from 4 samples plus some with errors)
Output: Separate files for each sample, with barcode sequences removed
Barcodes with 1 error are corrected to the nearest valid barcode
Reads with >1 error /no match go to the unmatched file

Barcode Design Considerations #

When designing barcodes:

Hamming distance ≥ 3: Allows single-error correction
Balanced GC content: Improves sequencing quality
Avoid homopolymers: AAAA, TTTT, etc. cause sequencing errors
Distinct patterns: Avoid similar-looking barcodes

Example good barcode set (6bp, Hamming ≥ 3):

ATCACG
CGATGT
TTAGGC
TGACCA
ACAGTG
GCCAAT

Customization #

Adjust parameters based on your experimental design:

Barcode Location: Examples

Start of read1: segment = 'read1', start = 0, anchor='Start'
End of read1: segment = 'read1', start = -6, anchor='End' (for 6bp barcode)
Separate index read: segment = 'index1'

Error Correction:

max_hamming_distance = 0: No error correction, leave off the HammingCorrect step
max_hamming_distance = 1: Correct single-base errors (recommended)
max_hamming_distance = 2: Correct two errors (requires Hamming ≥ 5 barcode set)

Unmatched Reads:

output_unmatched = true: Save unmatched reads for QC
output_unmatched = false`: Discard unmatched reads
on_no_match = 'remove': Set tags that do not match to ‘missing’ (usefull for FilterByTag
on_no_match = 'empty': Set tags that do not match to "" (but keep the location data)
on_no_match = 'keep': Keep the non-matching barcode. Combine with `tag_histogram in report to find used-but-undocumented barcodes.

When to Use This #

Multiplexed sequencing runs with inline barcodes
Single-cell experiments with cell barcodes
Pooled CRISPR screens with guide barcodes
Any application where multiple samples are sequenced together

Alternative Approaches #

Index reads instead of inline barcodes: If barcodes are in a separate index file rather than inline:

[[step]]
    action = 'ExtractRegion'
    segment = 'index1'  # Use index read instead
    start = 0
    length = 8
    out_label = 'barcode'

Dual indexing: For higher multiplexing, use two index reads:

# Extract from index1
[[step]]
    action = 'ExtractRegion'
    segment = 'index1'
    start = 0
    length = 8
    out_label = 'i7'

# Extract from index2
[[step]]
    action = 'ExtractRegion'
    segment = 'index2'
    start = 0
    length = 8
    out_label = 'i5'

# Concatenate barcodes
[[step]]
    action = 'ConcatTags'
    in_labels = ['i7', 'i5']
    out_label = 'barcode'
    separator = '_'

# Then demultiplex on concatenated barcode

Downstream Analysis #

After demultiplexing:

Quality control per sample (Report)
Alignment to reference genome
Sample-specific analysis (variant calling, expression quantification)
Combine results across samples for comparative analysis

Quality Control #

Check demultiplexing quality by examining:

Reads per sample: Should be roughly balanced (unless intentionally unequal)
Unmatched rate: High rates (>10%) suggest barcode design or sequencing issues
Error correction rate: Monitor how many barcodes required correction

Download #

Download 07-demultiplexing.tar.gz for a complete, runnable example including expected output files.

Configuration File #

[input]
    read1 = 'input/pooled_R1.fq'

[[step]]
    # Extract 6bp inline barcode from the start of read1
    action = 'ExtractRegion'
    segment = 'read1'
    start = 0
    length = 6
    out_label = 'barcode'
    anchor = "start"

[[step]]
    # Correct single-base sequencing errors in barcodes
    # max_hamming_distance = 1 allows correction of 1 mismatched base
    action = 'HammingCorrect'
    in_label = 'barcode'
    out_label = 'barcode_corrected'
    barcodes = 'sample_barcodes'
    max_hamming_distance = 1
    on_no_match = 'keep'  # Keep reads with unmatched barcodes

[[step]]
    # Remove the barcode sequence from the reads
    # This leaves only the biological sequence for alignment
    action = 'CutStart'
    segment = 'read1'
    n = 6

[[step]]
    # Split reads into separate files per sample
    # Creates output files: output_sample1_read1.fq, output_sample2_read1.fq, etc.
    action = 'Demultiplex'
    in_label = 'barcode_corrected'
    barcodes = 'sample_barcodes'
    output_unmatched = true  # Save unmatched reads to output_unmatched_read1.fq

# Define the barcode → sample mapping
# Barcode sequences are chosen to have Hamming distance ≥ 3
# This allows reliable single-error correction
[barcodes.sample_barcodes]
    ATCACG = 'sample1'
    CGATGT = 'sample2'
    TTAGGC = 'sample3'
    TGACCA = 'sample4'

[output]
    prefix = 'reference_output/output'
    format = "FASTQ"