Tag / Label #
A tag is a piece of fragment-derived metadata that one step in the pipeline produces, and other steps may consume, transform, or export.
Overview #
Tags enable sophisticated workflows by decoupling data extraction from data usage. Instead of hardcoding logic like “trim adapters AND filter by adapter presence” into a single step, you extract adapter locations as a tag, then use that tag in multiple downstream operations.
Tags are identified by labels (arbitrary names following the pattern [a-zA-Z_][a-zA-Z0-9_]*) and carry typed values that describe properties of each fragment.
Tag Types #
mbf-fastq-processor supports four tag types:
(None of the subsequent step listings below are exhaustive).
Location+Sequence Tags #
Represent a region within a segment, storing:
- A segment reference,
- Start position (0-based, inclusive)
- End position (0-based, exclusive)
- The extracted sequence (which may be changed by downstream steps)
If you modify the segment’s sequence, tag positions may become invalid. The extracted sequence however is retained.
Created by:
- ExtractIUPAC β Find IUPAC patterns (e.g., adapters, barcodes)
- ExtractRegex β Find regex patterns
- ExtractRegion β Extract fixed coordinate regions
Used for example by:
- TrimAtTag β Cut segment at tag location
- Lowercase β Lowercase sequences, tags, or names
- Uppercase β Uppercase the stored sequence using ’target=“tag:…”’ (follow with StoreTagInSequence )
- FilterByTag β Keep/remove fragments based on tag presence
- QuantifyTag β Generate histograms and statistics
- StoreTagInComment β Append tag sequence to read name
- StoreTagsInTable β Export to TSV
Sequence-Only Tags #
Store just a sequence string without positional information.
Created by:
- ExtractRegex with a name or tag source.
Used by:
- FilterByTag
- StoreTagInComment β Append tag sequence to read name
- StoreTagsInTable β Export to TSV
Numeric Tags #
Store floating-point or integer values representing computed metrics.
Created by:
- CalcMeanQuality β Average quality score
- CalcGCContent β GC percentage
- CalcLength β Sequence length
- EvalExpression β Compute from other tags (if
return_type== ’numeric')
Used by:
- FilterByNumericTag β Threshold filtering
- EvalExpression β Combine in calculations
- StoreTagsInTable β Export to TSV
Boolean Tags #
Store true/false flags indicating fragment properties.
Created by:
Used by:
- FilterByTag
- StoreTagsInTable β Export flags
Tag Lifecycle #
Tags follow a strict lifecycle enforced by the processor:
- Definition: A step with
out_labelcreates a tag - Consumption: Steps with
in_labelorin_labelsread the tag - Transformation: Convert steps modify tags into new tags
- Removal: Consuming steps may delete tags (e.g.,
ForgetTag)
Validation: At startup, the processor verifies:
- Every tag is defined before use
- Every defined tag is eventually consumed
- The types of consumed tags match step expectations
- Tag names follow the naming rules
This catches typos (e.g., in_label = "adaptor" when you created out_label = "adapter") before processing begins.
Tag Naming Rules #
Tag labels must:
- Match the regex
[a-zA-Z_][a-zA-Z0-9_]* - Be case-sensitive (
mean_qβMean_Q) - Not be
ReadName(reserved for table output) - Not start with
len_(reserved for virtual tags in EvalExpression)
Good names:
adapter_r1barcode_fwdmean_quality_passinggc_content
Invalid names:
mean-quality(contains hyphen)2adapter(starts with number)ReadName(reserved)len_adapter(reserved prefix)
Advanced Usage #
Virtual Tags in EvalExpression #
When using EvalExpression, you can reference tag lengths with len_<tagname>:
[[step]]
action = "ExtractIUPAC"
segment = "read1"
search = "NNNN"
anchor = "anywhere"
max_mismatches = 0
out_label = "umi"
[[step]]
action = "EvalExpression"
expression = "len_umi == 4" # Virtual tag: length of UMI
out_label = "correct_umi_length"
result_type = 'bool'
Conditional Processing #
Modifying tags can be applied conditionally:
# Tag long reads
[[step]]
action = "EvalExpression"
expression = "len_read1 < 100"
out_label = "is_short"
result_type = 'bool'
# Filter differently based on tag (via boolean conversion)
[[step]]
action = "Postfix"
seq = "AGGGG"
qual = "#####"
segment = 'read1'
if_tag = "is_short" # Append postfix only to short reads
See Also #
- Tag extraction reference for all tag-generating steps
- Step concept for tag lifecycle validation
- Source concept for using tags as data sources