Tag / Label #
A regular tag is a piece of fragment-derived metadata that one step in the pipeline produces, and other steps may consume, transform, or export.
A virtual tag is an on-the-fly create tag that exists just for this step and disappears right afterwards.
Overview - Regular tags #
Tags enable sophisticated workflows by decoupling data extraction from data usage. Instead of hardcoding logic like “trim adapters AND filter by adapter presence” into a single step, you extract adapter locations as a tag, then use that tag in multiple downstream operations.
Tags are identified by labels (arbitrary names following the pattern
[a-zA-Z_][a-zA-Z0-9_]*) and carry typed values that describe properties of each
fragment.
Virtual tags are identified by having a specif ‘xyz_’ prefix. See below. You can not declare a tag with that prefix as out_label of any step.
(The tag ‘ReadName’ is also reserved for usage in StoreTagsInTable’s index column)
Tag Types #
fastqrab supports four tag types:
(None of the subsequent step listings below are exhaustive).
Location+Sequence Tags #
Represent a region within a segment, storing:
- A segment reference,
- Start position (0-based, inclusive)
- End position (0-based, exclusive)
- The extracted sequence (which may be changed by downstream steps)
If you modify the segment’s sequence, tag positions may become invalid. The extracted sequence however is retained.
Created by:
- ExtractIUPAC β Find IUPAC patterns (e.g., adapters, barcodes)
- ExtractRegex β Find regex patterns
- ExtractRegion β Extract fixed coordinate regions
Used for example by:
- TrimAtTag β Cut segment at tag location
- Lowercase β Lowercase sequences, tags, or names
- Uppercase β Uppercase the stored sequence using ’target=“tag:…”’ (follow with StoreTagBackInSequence )
- FilterByTag β Keep/remove fragments based on tag presence
- QuantifyTag β Generate histograms and statistics
- StoreTagInComment β Append tag sequence to read name
- StoreTagsInTable β Export to TSV
Sequence-Only Tags #
Store just a sequence string without positional information.
Created by:
- ExtractRegex with a name or tag source.
Used by:
- FilterByTag
- StoreTagInComment β Append tag sequence to read name
- StoreTagsInTable β Export to TSV
Numeric Tags #
Store floating-point or integer values representing computed metrics.
Some steps declare ranges on the tag (lower..=upper, left & right inclusive, e.g.
- CalcGCContent declares 0..=1 if relative=true). The thresholds in FilterbyNumericTag are then checked against these limits.
Created by:
- CalcMeanQuality β Average quality score
- CalcGCContent β GC percentage
- CalcLength β Sequence length
- EvalExpression β Compute from other tags (if
return_type== ’numeric’)
Used by:
- FilterByNumericTag β Threshold filtering
- EvalExpression β Combine in calculations
- StoreTagsInTable β Export to TSV
Boolean Tags #
Store true/false flags indicating fragment properties.
Created by:
Used by:
- FilterByTag
- StoreTagsInTable β Export flags
Tag Lifecycle #
Tags follow a strict life-cycle enforced by the processor:
- Definition: A step with
out_labelcreates a tag - Consumption: Steps with
in_labelorin_labelsread the tag - Transformation: Convert steps modify tags into new tags
- Removal: Consuming steps may delete tags (e.g.,
ForgetTag)
Validation: At startup, the processor verifies:
- Every tag is defined before use
- Every defined tag is eventually consumed
- The types of consumed tags match step expectations
- Tag names follow the naming rules
This catches typos (e.g., in_label = "adaptor" when you created out_label = "adapter") before processing begins.
Tag Naming Rules #
Tag labels must:
- Match the regex
[a-zA-Z_][a-zA-Z0-9_]* - Be case-sensitive (
mean_qβMean_Q) - Not be
ReadName(reserved for table output) - Not start with
len_(reserved for virtual tags in EvalExpression)
Good names:
adapter_r1barcode_fwdmean_quality_passinggc_content
Invalid names:
mean-quality(contains hyphen)2adapter(starts with number)ReadName(reserved)len_adapter(reserved prefix)
Virtual tags #
Any place you can use a tag, you can also use virtual tags.
The following virtual tags are supported:
- read_no - the sequential number of the molecule in the input.
- len_<segment|all> - the length of the read (or the molecule) at this step in the pipeline.
- len_<tag_name> - the length of a tag’s string value (for location tags, that’s after regex replacement etc). (requires a string or location typed tag)
- location_<tag_name> - the location of a (location) tag, as string typed segment:start..end (left inclusive, right exclusive, 0 based)
Example Len Tags in EvalExpression #
When using EvalExpression, you can reference
tag lengths with len_<tagname>:
[[step]]
action = "ExtractIUPAC"
segment = "read1"
search = "NNNN"
anchor = "anywhere"
max_mismatches = 0
out_label = "umi"
[[step]]
action = "EvalExpression"
expression = "len_umi == 4" # Virtual tag: length of UMI
out_label = "correct_umi_length"
result_type = 'bool'
Conditional Processing #
Modifying tags can be applied conditionally:
# Tag long reads
[[step]]
action = "EvalExpression"
expression = "len_read1 < 100"
out_label = "is_short"
result_type = 'bool'
# Filter differently based on tag (via boolean conversion)
[[step]]
action = "Postfix"
seq = "AGGGG"
qual = "#####"
segment = 'read1'
if_tag = "is_short" # Append postfix only to short reads
See Also #
- Tag extraction reference for all tag-generating steps
- Step concept for tag lifecycle validation
- Source concept for using tags as data sources