FilterDuplicates #
[[step]]
action = "TagDuplicates"
false_positive_rate = 0.00001 #
# the false positive rate of the filter.
# 0..1
seed = 59 # required!
source = "All" # Any input segment, 'All', 'tag:<tag-name>' or 'name:<segment>'
# split_character = "/" # required (and accepted only iff using name:<segment>
out_label = "dups"
# initial_filter_capacity = 10_000_000 # optional. Auto detected by default
[[step]]
action = "FilterByTag"
in_label = "dups"
keep_or_remove = "Remove" # Keep|Remove
Tag duplicates (2nd onwards) from the stream using a Cuckoo filter.
That’s a probabilistic data structure, accordingly there’s a false positive rate, and a tunable memory requirement.
Needs a seed for the random number generator, and a source
to know which values to consider for deduplication (filters the complete molecule, like
all other filters of course). Sources can be a segment name, All (to combine every
segment in the molecule), another tag via tag:<tag-name>, or a read name prefix using
name:<segment>. The name: form requires split_character to define where to split the
read name, matching the semantics of readname_end_char elsewhere in the tool. When
referencing an existing tag, every tag value is converted into a binary representation
before entering the filter, allowing numeric, boolean, and sequence tags to participate.
Because these prefixes are reserved, the output label must not begin with name: or
tag:.
The lower you set the false positive rate, the higher your memory requirements will be. 0.00001 might be a good place to start.
If you set the false positive rate to 0.0, a HashSet will be used instead, which will produce exact results, albeit at the expense of keeping a copy of all reads in memory!
Please note our remarks about cuckoo filters.
If the source is a tag, missing values (e.g. not-matching regex results) will always be treated as unique. Only Location/String tags are supported for TagDuplicates.
The initial_filter_capacity is typically auto detected from the input size,
by multiplying the average read length, the total (compressed) file size,
and a compression dependent factor. If no file size is available (streams),
this will default to ~134 million reads.
Underestimation will lead to increased compute. Overestimation will lead to increased memory usage (and a false positive rate better than the requested one). Our cuckoo filters work on power-of-two sized capacities, so there is some leeway, and
, but if your input is pipes, it can’t be and then defaults to 10 million reads. Since under-sizing this leads to increased compute time, you can set it manually.
Interaction with demultiplex #
Duplicates are measured per demultiplexed stream.