The Nucleotide Sequence algorithm for removing duplicate reads
If Remove duplicate reads is selected, then by default, the Sequence Operation Tool uses a Nucleotide Sequence algorithm that assigns a numerical value to every base in a read, where A = 0, C = 1, G = 2, and T = 3. A hash value is then calculated for every read according to the following formula:
sum(Base’s code*(4^Base’s position))
where the starting base position is = 0. For example, for the sequence ATTC, the hash value is calculated as:
0*(4^0) + 3*(4^1) + 3*(4^2) + 1*(4^3) = (0*1) +(3*4) + (3*16) + (1*64) = 124
If multiple reads have the same hash value, indicating identical sequences and identical sequence length, then a single copy of this sequence is kept. For paired reads, if there are multiple pairs where both forward reads have the same hash value, and both reverse reads have the same hash value, indicating identical sequences and identical sequence lengths, then only one pair of the reads is kept. For example, if Read 1F = Read 2F and Read 1R = Read 2R, then only one pair of reads is kept; however, if Read 1F = Read 2F, but Read 1R ≠ Read 2R, then both pairs of reads are kept.
1. In the Input pane, click Add to browse to and select the .fasta or .fastq files for which the duplicate reads are to be removed.
2. Optionally, specify the settings for removing the duplicate reads.
Setting | Description |
---|
Check 5’ end only for paired reads | If this option is selected, then only the first 32 base pairs at the 5’ end of both paired reads must be identical to be considered duplicates. |
Check After 1st Homopolymer | Available only if Check 5’ end only for paired reads is selected. Select this option to check for duplicate reads based on the first 32 base pairs after the first homopolymer sequence. |
Merge Overlap PE | Checks pairs for overlap. If the pairs overlap, then each end is extended with the sequence of its pair. Note: This is a normalization effort designed increase coverage in low-expressed regions. (High-covered regions see less of an effect from this option because these regions already have adequate coverage). This option is a particularly useful for RNA-Seq users. |
3. In the Output field, you can leave the default value for the location of the output files as is (the default value is the directory path for the input file), or you can click Set to select a different location.
4. Optionally, before you process the files, click Save to save the settings that you have specified to a Settings file (.ini file).
| You can always load this file at a later date and process other data files according to the saved settings in the file. |
5. Click OK.
A message opens when the process is completed.
Two data output files are created: _Duplicate.fasta, which contains duplicate reads that were discarded for analysis, and _Unique.fasta, which contains a single copy of all duplicated reads as well as all reads that were not duplicated. A log file, RemoveDuplicates_Log.txt, is also created. The file contains information about the input file, the reads (number of total reads, number of unique reads, and number of duplicate reads), and the distribution of the reads and their counts.