Tag

Tag / Label #

A tag is a piece of fragment-derived metadata that one step in the pipeline produces, and other steps may consume, transform, or export.

Overview #

Tags enable sophisticated workflows by decoupling data extraction from data usage. Instead of hardcoding logic like “trim adapters AND filter by adapter presence” into a single step, you extract adapter locations as a tag, then use that tag in multiple downstream operations.

Tags are identified by labels (arbitrary names following the pattern [a-zA-Z_][a-zA-Z0-9_]*) and carry typed values that describe properties of each fragment.

Tag Types #

mbf-fastq-processor supports four tag types:

(None of the subsequent step listings below are exhaustive).

Location+Sequence Tags #

Represent a region within a segment, storing:

  • A segment reference,
  • Start position (0-based, inclusive)
  • End position (0-based, exclusive)
  • The extracted sequence (which may be changed by downstream steps)

If you modify the segment’s sequence, tag positions may become invalid. The extracted sequence however is retained.

Created by:

Used for example by:

Sequence-Only Tags #

Store just a sequence string without positional information.

Created by:

Used by:

Numeric Tags #

Store floating-point or integer values representing computed metrics.

Created by:

Used by:

Boolean Tags #

Store true/false flags indicating fragment properties.

Created by:

Used by:

Tag Lifecycle #

Tags follow a strict lifecycle enforced by the processor:

  1. Definition: A step with out_label creates a tag
  2. Consumption: Steps with in_label or in_labels read the tag
  3. Transformation: Convert steps modify tags into new tags
  4. Removal: Consuming steps may delete tags (e.g., ForgetTag)

Validation: At startup, the processor verifies:

  • Every tag is defined before use
  • Every defined tag is eventually consumed
  • The types of consumed tags match step expectations
  • Tag names follow the naming rules

This catches typos (e.g., in_label = "adaptor" when you created out_label = "adapter") before processing begins.

Tag Naming Rules #

Tag labels must:

  • Match the regex [a-zA-Z_][a-zA-Z0-9_]*
  • Be case-sensitive (mean_q β‰  Mean_Q)
  • Not be ReadName (reserved for table output)
  • Not start with len_ (reserved for virtual tags in EvalExpression)

Good names:

  • adapter_r1
  • barcode_fwd
  • mean_quality_passing
  • gc_content

Invalid names:

  • mean-quality (contains hyphen)
  • 2adapter (starts with number)
  • ReadName (reserved)
  • len_adapter (reserved prefix)

Advanced Usage #

Virtual Tags in EvalExpression #

When using EvalExpression, you can reference tag lengths with len_<tagname>:

[[step]]
    action = "ExtractIUPAC"
    segment = "read1"
    search = "NNNN"
    anchor = "anywhere"
    max_mismatches = 0
    out_label = "umi"

[[step]]
    action = "EvalExpression"
    expression = "len_umi == 4"    # Virtual tag: length of UMI
    out_label = "correct_umi_length"
    result_type = 'bool'

Conditional Processing #

Modifying tags can be applied conditionally:

# Tag long reads
[[step]]
    action = "EvalExpression"
    expression = "len_read1 < 100" 
    out_label = "is_short"
    result_type = 'bool'

# Filter differently based on tag (via boolean conversion)
[[step]]
    action = "Postfix"
    seq = "AGGGG"
    qual = "#####"
    segment = 'read1'
    if_tag = "is_short"  # Append postfix only to short reads

See Also #