Assign to Reference

AssignToReference #

Assign each query sequence to the closest entry in barcodes section, using Hamming distance.

( As opposed to HammingCorrect which will correct to the closest barcode sequence).

At start-up the step builds an efficient Hamming-distance index over the database. For every read, the tag supplied in in_label is looked up in the index and the name of the closest matching reference entry is written to out_label as a string tag.

When no reference entry falls within max_hamming_distance the output tag is set to Missing, which can be acted on by a downstream FilterByTag step.

Missing tags lead to empty strings when exported.

For BAM input, all ‘reads’ from the BAM file independent of alignment status are included.

# 1. Extract the region you want to compare against the reference.
# (Even do so if you're querying with the complete sequence)
[[step]]
    action = "ExtractRegion"
    source = "read1"
    start = 0
    length = 50
    anchor = "Start"
    out_label = "query_seq"

# 2. Look up the extracted sequence in the reference database.
[[step]]
    action = "AssignToReference"
    in_label = "query_seq"
    out_label = "ref_name"
    max_hamming_distance = 2
    barcodes = 'reference_barcodes'
    name_split_character = ' ' # Optional,

[barcodes.reference_barcodes]
    from_file = {
            filename= "reference.fa"
    }

# 3. (Optional) Discard reads that did not match any reference entry.
[[step]]
    action = "FilterByTag"
    in_label = "ref_name"
    keep_or_remove = "keep"

# 4. (Optional) Store the reference name in the read comment.
[[step]]
    action = "StoreTagInComment"
    in_label = "ref_name"

Parameters #

ParameterTypeRequiredDescription
in_labeltag nameyesTag holding the query sequence (String or Location tag).
out_labeltag nameyesOutput tag for the matched reference name (String).
max_hamming_distanceintegeryesMaximum allowed Hamming distance. Use 0 for exact matches only.
barcodesyesStringWhich barcode section to reference.
name_split_characternoChar/byteSplit barcode names at this character when considering equivalence classes (see below)

Notes #

  • All sequences in the reference file must have the same length as the query sequences.
  • If the reference contains duplicate sequences, an error will occur during the initial reading. Multiple sequences leading to the same label are ok.
  • The in_label tag can be a String tag (e.g. from ExtractRegion) or a Location tag (e.g. from ExtractIUPAC).
  • Use HammingCorrect instead when you want to correct a tag to one of the barcodes barcodes.
  • All barcode sequences must lead to exactly one barcode label - the hamming based equivalence classes must be disjoint. You can influence this with name_split_character, fastqrab considers labels equivalent if they’re equal up to the first name_split_character. The stored sequence is then always the first one defined in your barcode section.