Input section #

The input section is mandatory.

[input]
    bam = "filename.bam"  # What BAM file to quantify
    correct_reads_for_clipping = false # optional see below.
    max_skip_length = 150 # optional see section on correct_reads_for_clipping

# one of
[input.source]
    mode = 'bam_references'
    # uses 0..length, forward for each 'reference' in the BAM file.
    # (see the References filter if you want just some of them)

[input.source]
    # assume the BAM is already annotated with the regions we're supposed to count.
    mode = 'bam_tag'
    tag = 'XX' 

[input.source]
    mode = 'gtf'
    filename` = "something.gtf" # may be gz
    subformat = "GTF|GFF" 
    feature = "exon" # what feature are we counting
    id_attribute = "exon_id" # the attribute to use as the ID of the feature
    aggr_id_attribute = "gene_id" # optional. If set & different from id_attribute, we sum over this.
    duplication_handling = "Collapse|Remove" # what to do with regions that occur multiple times in the GTF
                                             # e.g. exons that occur in multiple transcripts.
                                             # Defaults to Collapse.

Bam files #

BAMs must be sorted and indexed in order to be quantified. You can use samtools sort and samtools index to do this.

Read positon correction #

The argument correct_reads_for_clipping controls whether we adjust read positions for (soft) clipping, like umi-tools does. Only relevant if you deduplicate PerPosition.

Does not influence the matching of reads to regions (genes).

max_skip_length controls the maximum length of a soft clip that you expect to see in your BAM. We can only ‘finish’ processing a block of reads that had BAM positions of x once we’re past x + max_skip_length. You therefore want to keep this lowish to lower memory usage.

Setting this too low (and enabling correct_reads_for_clipping) will result in a runtime failure, not silently wrong results.

Sources #

Bam_tag #

If you already have a bam tag that defines what region this read belongs to, you can use the bam_tag source.

GTF #

mbf-bam-quantifier can read (GTF and GFF)[https://www.ensembl.org/info/website/upload/gff.html] files (subformat is not not auto-detected at the moment). Trigger both with mode = 'GTF' and the appropriate subformat.

Duplicate region handling #

GTFs that contain transcripts have repeated regions, e.g. exons that occur in multiple transcripts.

By default, we ‘collapse’ them - identical exons (with identical exon_id) will be counted once, and will be summed into the final per gene count (per aggr_id_attribute, more precisely).

The other option is to ‘remove’, which will ignore any reads in these exons (And report them multiple times in the output, with 0 reads each if aggr_id_attribute is not set!). That’s what featureCounts does. (I’m not convinced this is actually a useful mode, but we want to be able to emulate featureCounts).