Boolean Tag Generation on mbf-fastq-processor documentation

Mon, 01 Jan 0001 00:00:00 +0000

TagOtherFileByName #

Mark reads based on wether names are present in another file.

[[step]]
 action = "TagOtherFileByName"
 segment = "read1" # which segment's name are we using
 out_label = "present_in_other"
 filename = "names.fastq" # Can read fastq (also compressed), or SAM/BAM, or fasta files
 false_positive_rate = 0.01 # false positive rate (0..1)
 seed = 42 # seed for randomness
 ignore_unaligned = false # in case of BAM/SAM, whether to ignore unaligned reads. Mapped reads are always considered
 fastq_readname_end_char = " " # (optional) char (byte value) at which to cut input fastq read names before comparing. If not set, no cutting is done.
 reference_readname_end_char = "/" # (optional) char (byte value) at which to cut reference read names before storing them.

This step marks reads by comparing their names against names from another file.

Mon, 01 Jan 0001 00:00:00 +0000

TagOtherFileBySequence #

Marks reads based on wether sequences are present in another file.

[[step]]
 action = "TagOtherFileBySequence"
 out_label = "present_in_other_file"
 filename = "names.fastq" # Can read fastq (also compressed), or SAM/BAM, or fasta files
 segment = "read1" # Any of your input segments
 false_positive_rate = 0.01 # false positive rate (0..1)
 seed = 42 # seed for randomness
 ignore_unaligned = false # in case of BAM/SAM, whether to ignore unaligned reads. Mapped reads are always considered

This step annotates reads by comparing their sequences against sequences from another file.

Mon, 01 Jan 0001 00:00:00 +0000

FilterDuplicates #

[[step]]
 action = "TagDuplicates"
 false_positive_rate = 0.00001 #
 # the false positive rate of the filter.
 # 0..1
 seed = 59 # required!
 source = "All" # Any input segment, 'All', 'tag:<tag-name>' or 'name:<segment>'
 # split_character = "/" # required (and accepted only iff using name:<segment>
 out_label = "dups"

[[step]]
 action = "FilterByTag"
 in_label = "dups"
 keep_or_remove = "Remove" # Keep|Remove

Tag duplicates (2nd onwards) from the stream using a Cuckoo filter.

That’s a probabilistic data structure, accordingly there’s a false positive rate, and a tunable memory requirement.