Store Single Cell Matrix

StoreSingleCellMatrix #

Collect per-read (gene, cell barcode, UMI) index triples in memory, and an output mtx/MatrixMarket formatted count matrix.

[[step]]
    action = "StoreSingleCellMatrix"
    cell_tag = "cell_bc"           # tag carrying the cell-barcode sequence or label
    gene_tag = "gene_bc"           # tag carrying the gene-barcode sequence
    umi_tag = "umi"                # tag carrying the raw UMI sequence
    cell_barcodes = "cells"        # [barcodes.cells] section (sequence → name)
    gene_barcodes = "genes"        # [barcodes.genes] section (sequence → name)
    cell_tag_contains_barcode = true    # (optional) see below; auto-detected if omitted
    gene_tag_contains_barcode = true    # (optional) see below; auto-detected if omitted
    infix = ""                     # (optional) filename infix. 
    compression = "Raw"            # (optional) Raw, Gzip, Zstd — for all output files.
    umi_aggregation = "Exact"       # How to handle duplicate UMIs. See below

Inputs #

cell_tag must carry either a Location tag (from ExtractRegion) or a String tag whose value is either a barcode sequence or a corrected label (e.g. from [AssignByHalves](/fastqrab/v0.9.0/docs/redirects/AssignByHalves/) or [HammingCorrect](/fastqrab/v0.9.0/docs/redirects/HammingCorrect/) with output = ’label' ).

gene_tag must carry a Location or String tag whose value is a barcode sequence.

Both are looked up against the corresponding [barcodes.*] section.

(cell|gene)_tag_contains_barcode #

Controls how cell_tag or gene_tag values are resolved:

ValueBehaviour
trueValue is a barcode sequence; looked up in by sequence
falseValue is a corrected label (e.g. from AssignByHalves); looked up by name
(omitted)Auto-detected: Location input tags → true, String tags → false

Unrecognised sequences are assigned index 0 (“unmatched”).

Real barcodes are 1-indexed in the order they appear in the [barcodes.*] table.

umi_tag accepts Location or String. The UMI is 2-bit encoded (A=0, C=1, G=2, T=3; any other → u32::MAX). Maximum UMI length is 16 bp.

UMI aggregation #

Depending on the umi_aggregation setting, different UMI->count algorithms are used.

Possible Values:

  • None - Do not aggregate UMIs, report read count. UMIs with N are counted.
  • Exact - Each UMI counts as most once per gene & cell. UMIs with any N are not counted.
  • Cluster - We count the 1-hamming-distance connected components of the observed UMIs per gene & cell. UMIs with any N are not counted.

(Internally, any N leads to an UMI of 16T (“TTTTTTTTTTTTTTTT”), which is then not counted. This means if you have 16bp UMIs, in addition to any-N, 16T is not counted. If your UMI is shorter, this won’t affect polyT counting).

Output files #

Three files are written per run:

FileDescription
{prefix}_{infix}scd.matrix.mtx(.gz)matrix market file
{prefix}_{infix}scd.matrix.mtx.stats.txt(.gz)Statistics
{prefix}_{infix}scd.barcodes.txt(.gz)Cell name / barcode lookup (line 0 = “unmatched”)
{prefix}_{infix}scd.features.txt(.gz)Gene name lookup (line 0 = “unmatched”)

Interaction with demultiplexing #

When a Demultiplex step precedes this step, a separate .mtx file is written for each barcode group. The lookup tables are shared (singleton) across all groups:

{prefix}_{infix}_scd_{sample_name}.mtx   # one per demultiplex group
{prefix}_{infix}_scd_{sample_name}.mtx.stats.txt   # one per demultiplex group
{prefix}_{infix}_scd.barcodes.txt
{prefix}_{infix}_scd.features.txt