Illumina, SOLiD System and Ion Torrent data

NextGENe Online Help : Sequence Condensation Tool : Overview of the NextGENe Sequence Condensation Tool : Illumina, SOLiD System and Ion Torrent data

If you are analyzing Illumina data, SOLiD System data, or Ion Torrent data, then all three condensation methods—Consolidation, Elongation, and Error Correction—are available and all three methods use the same general method for clustering similar reads and generating a consensus sequence. Reads are evaluated for common indices, or anchor sequences, that can be found in multiple sequencing reads. All sequence reads that contain an identical 12 bp anchor sequence form a group. Because this sequence might not be unique within the genome, the groups are organized into separate subgroups based on the anchor’s flanking shoulder sequences, which are the left and right bases that are immediately adjacent to the anchor sequence. Reads that contain, at a minimum, both shoulder sequences are called bridge reads. Bridge reads can also extend past or “bridge” both shoulder sequences. To form a subgroup, a minimum number of bridge reads are required. By evaluating the shoulder sequences on either side of the anchor sequence, a single group can be divided into multiple subgroups with an identical anchor sequence and varying shoulder sequences. Although reads contain an identical 12 bp anchor sequence, multiple subgroups might exist because of a variant or polymorphism within a shoulder sequence or a given 12 bp anchor sequence might occur more than once in different regions of the genome.

Each subgroup can be used to generate a consensus sequence. For Illumina data, SOLiD System data, and Ion Torrent data, it is assumed that the quality of bases that are at the 5’ end of each read is higher than the Phred 20 quality scores and that the remainder of the read is of lower quality, which results in the base calls that are on the 5’ end of the sequences having a higher weight of accuracy. The consensus base calls are calculated by scoring each nucleotide that is seen at a given position according to the following rules:

• 5’ sequences are assigned a higher weight than 3’ sequences.

• Each 5’ read with a given nucleotide is assigned a score of 7.

• Each 3’ read with the same given nucleotide is assigned a score of 2.

• Scores for all the reads with the same nucleotide are summed to provide the score for the nucleotide.

Score for Nucleotide “x” = (7 x No. of 5’ reads) + (2 x No. of 3’ reads)

For example, consider the case in which a position within a subgroup of reads includes some reads that show a “T” at a given position while other reads show a “C” for the position. The “T” nucleotide is seen in the 5’ end of two reads and in the 3’ end of six reads. The “C” nucleotide is seen in the 5’ end of four reads and in the 3’ end of two reads. To determine the consensus base call, quality scores are calculated for both the “T” and “C” nucleotides as follows:

• Score for the “T” nucleotide = (7 x 2) + (2 x 6) = 26

• Score the for “C” nucleotide = (7 x 4) + (2 x 2) = 32

Because the score for the “C” nucleotide is greater than the score for the “T” nucleotide, the consensus sequence includes a “C” nucleotide at this position.