The UMI algorithm for removing duplicate reads
Unique molecular identifiers (UMIs) are random sequences of bases that are used to tag each molecule (fragment) of DNA prior to library amplification, which aids in the identification of PCR duplicates. Illumina instruments generate an I2 Index File for paired-end runs, and an I2 Index File can store these UMIs. NextGENe can read an Illumina I2 Index File that contains UMIs. After reading an I2 Index File, NextGENe identifies all the reads that share the same UMI and retains only the read that has the highest total quality score for continued processing. NextGENe classifies all other reads that share the same UMI as PCR duplicates and discards these reads. To use the UMI algorithm to remove duplicate reads, you must carry out the following prior to any Format Conversion preprocessing steps:
1. Select Process UMIs based on I2 Index File.
2. In the Input pane, click Add to browse to and select the R1 and R2 .fastq files for which the duplicate reads are to be removed.
3. In the Input pane, click Add to browse to and select the required I2 file, and optionally, an I1 file.
4. Optionally, specify the settings for removing the duplicate reads.
Setting | Description |
---|
Check 5’ End [ ] (bps) for similarity | If this option is selected, then only the indicated number of base pairs at the 5’ end of the paired reads are checked for duplication. The default value is 15 bps. |
Allow for [ ] mismatch | Indicates the maximum number of bps that are being checked for duplication at the 5’ end that can be mismatched and still have the paired reads be called as duplicate reads. The default value is one. |
5. In the Output field, you can leave the default value for the location of the output files as is (the default value is the directory path for the input file), or you can click Set to select a different location.
6. Optionally, before you process the files, click Save to save the settings that you have specified to a Settings file (.ini file).
| You can always load this file at a later date and process other data files according to the saved settings in the file. |
7. Click OK.
A message opens when the process is completed.
Two data output files are created: _Removed.fastq, which contains duplicate UMIs that were discarded for analysis, and _Processed.fastq, which contains a single copy of all duplicated UMIs as well as all UMIs that were not duplicated. A log file, RemoveDuplicates_Log.txt, is also created. The file contains information about the input file, the reads (number of total reads, number of unique reads, and number of duplicate reads), and the distribution of the reads and their counts.