Sequence Condensation Tool - Advanced Settings for Illumina Data, SOLiD System Data, or Ion Torrent Data

NextGENe Online Help : Sequence Condensation Tool : Sequence Condensation Tool - Advanced Settings for Illumina Data, SOLiD System Data, or Ion Torrent Data

For the Illumina, SOLiD System, and Ion Torrent instrument types, the available settings are the same, and the default values for the advanced settings are populated based on the Read Lengths and Expected Depth of Coverage values that were set in Sequence Condensation Tool - General Settings. You can leave these settings as is, or you can modify the settings. At any time, you can click Default Settings to automatically reset all the values to SoftGenetics’s default values.

• Number of Cycles – The default value is 1. After one cycle, many of the instrument’s base call errors are corrected, which is ideal for applications such as SNP/Indel discovery. Additional cycles help to remove some of the systematic instrument errors and low frequency variations. Also, additional cycles further elongate the reads while correcting some of the discrepant variations between the reads. Four cycles of condensation can increase many reads from 35 bps to an excess of 150 bps, which is ideal for some applications such as de novo assembly or the discovery of large indels.

If more than one condensation cycle is used, you can specify the values for the advanced settings for each cycle independently.

• Memory Ratio – Available only for 32-bit OSs. Because of memory constraints, the Condensation Tool parses large sample datasets as needed and processes each partition separately. When the Memory Ratio is set to 1.00, the software loads a pre-set number of sequence reads. If you increase the value for the memory ratio, more reads are loaded into memory, but this might result in limited computer resources and therefore, the inability to use your computer for other functions.

• View Condensation Results – Select this option to view the condensation results in the Condensation Results tool when Consolidation is the selected method. See The NextGENe Condensation Results Tool.

• Minimum Read Length for Condensation – Excludes sequence reads that are less than the specified value from the condensation. The minimum value allowed is 14 bp.

• Range in Read to Index [x] Bases to Length minus [y] Bases – Ignores the lower quality bases at the ends of reads during indexing. These bases are still used for the condensation but they are not included as anchor sequences. For example, if x=1 and y=3, all bases from the first base to the last three bases from the end are used for indexing. To allow indexing of all bases, set x=1 and y=0.

• Auto Indexing Based on Expected Coverage = [x] – Recommended only for high coverage datasets (average coverage> 500). Set “x” equal to the expected average coverage. This provides an alternative to individually specifying values for each of the next four coverage settings. The Condensation Tool can then use the expected average coverage to calculate appropriate coverage requirements.

The minimum allowable value for this setting is 500. With an expected coverage of less than 500x, auto-indexing is less accurate and is not recommended.

• Reads Required for Each Group in One Direction [x] to [y] – Prevents the indexing of fragments that might have errors, repeats and redundancies. The number of reads with a given anchor sequence in the same direction (either forward or reverse) must be within this range. An anchor sequence is added to the index table and used to form a group when the exact anchor sequence is found in a number of reads that have same direction and that is greater than or equal to the lower limit and less than or equal to the upper limit.

For example, consider a case where the lower and upper indexing limits are set to 10 and 6000 respectively. In this case, the 12 base pair anchor sequence of ACCAGAAGTTTA is added to the index table only if it is found in at least 10 forward reads or 10 reverse reads but less than 6000 sequence reads in the same direction. If this index is found in less than 10 reverse reads and less than 10 forward reads then it is considered noise and is not needed in the index table. If the sequence is found in more than 6000 reads in the same direction, then it is a fragment that is difficult to assemble (often because of a repeat) and it also is not added to index table.

• Reads Required for Each Group in Each Direction [x] to [y] – Specifies the number of reads that are required to match an anchor sequence in both directions for it to be included in the index table. The number of forward reads and the number of reverse reads that match the anchor sequence must be within this range. For data that is either completely one-directional or primarily one-directional. set this value to equal to -1.

• Bridge Reads Required for Each Subgroup: [x] and [y%] – “x” indicates the minimum count of bridge reads required to form a subgroup. “y” indicates the minimum percentage of reads within the subgroup that must be bridge reads. For data that is either completely one-directional or primarily one-directional, set both of these values equal to -1.

For example, consider this setting with values of 2 and 1%. For the ACCAGAAGTTTA index, 1000 reads contain this anchor sequence. Of these 1000 reads, a total of 150 reads match at least one of the shoulder sequences. Twenty reads out of these 150 reads contain the same eight nucleotides of CGGATTCC to the left of the index and the same eight nucleotides of TGCCATGC to the right side of this index. These shoulder sequences are therefore are used to form a subgroup with these 150 reads because more than two reads (20 in this example) and more than 1% (13% in this example) of the reads are bridge reads.

• Total Reads Required for Each Subgroup: [x] and [y%] – The number of reads that have identical anchor sequence and that contain similar shoulder sequences must be within the specified range to form a subgroup.

• Recover Best SubGroup for Repeated Indices – Only the first instance (from the 5’ end) of the repeat is indexed and only the unique shoulder sequence is used for repeat indices.

• Forward and Reverse Balance – Sequencing artifacts produce significant imbalances between the number of reads in each direction. If selected, false positives due to PCR bias or other directional bias are reduced. Indices are checked for the number of forward reads and the number of reverse reads that match the anchor sequence. Indices are excluded from the index table if the ratio of the number of reads in either direction to the total number of reads in the other direction is below a set threshold. clear this option for data that is either completely one-directional or primarily one-directional.

For example, if an index contains 100 forward reads and 10 reverse reads, then the ratio of reverse reads to forward reads is 0.1 If this option is set to a value of 0.2, then this index is removed from the index table and no condensed read is produced for the index.

• Remove Indices with PCR bias: Min. Ratio = [x] Min. Coverage = [y] – Amplification bias is sequence dependent, which results in some anchor sequences containing a large number of sequence reads in disproportionate levels. If selected, reads that meet or exceed the specified threshold settings are not used for indexing.

• Fixed Shoulder Length Sequence = [x] bases – Evaluates shoulder sequences of a set length. All reads within a single group contain the identical 12 base pair index. Reads within the group can vary within the shoulder sequences. Reads that are used to create a consensus sequence must contain an identical (“x” + 12) bp sequence. For example, if this value is set to 8, then the reads used for creating a consensus sequence must contain an identical 28 base anchor—8 bases to the right of index, a 12 base index, and 8 bases to the left of index.

• Fixed, then Extended Shoulder Length = [x] Bases and Score <= [y] – This option is useful for assembling condensed reads that have been run through at least one condensation cycle. The fixed shoulder length is checked first, and then is rescanned with some variation being tolerated. If the shoulder bases are the same, then all corresponding bases between the reads are checked. A score is calculated to determine the amount of variation among the reads. A one base difference yields a score of 1 for the position if it is not at the end of a read. The score for a difference in the 1st and last 3 bases is 1/2. The score must be below the set threshold for the read to be used in the subgroup. If the score is set to 1.01 (the default value), then the tool condenses reads containing two differences at the ends and just one difference for the middle bases.

• Flexible Sequence Length = [x], [y], [z] – Sets less stringent criteria for shoulder sequence length. Specify the values from largest to smallest, for example,“10, 8, 6.” Given these settings, the Condensation Tool initially attempts to find sequences with 10 bp matching shoulder sequences; however, it also looks for sequences that have 8 bp matching shoulder sequences and then finally, 6 bp matching shoulder sequences.

• Homopolymer Index Checking – Reduces the size of the index table that is generated for condensation. Instead of indexing every 12 bp anchor sequence, only 12 bp sequences that occur before and after homopolymers of three or more bases are used. The regions that are adjacent to homopolymers are also used for shoulder sequences instead of the regions that are directly adjacent to the anchor sequence.

• Start Index at [x] (2 or 3) Homopolymers or [ ] AT, GC, ATT . . . Complements – Evaluates anchor sequences starting at positions where a homopolymer of two or three bases (as determined by the value set for [x]) is found. Anchor sequences will begin at the second base of the homopolymer. For instance, where a sequence of “AACTGTC…” occurs, the anchor sequence will begin as “ACTGTC…” To provide a sufficient number of anchor sequences, combinations of “GC” “CG” “AT” and TA” are also used to indicate the start of an anchor sequence. With both of these options selected, the condensation speed is increased by using an average of 1/2 as many anchor sequences. To index only homopolymers, clear the “AT, GC, ATT …Complements option. With only the Start Index option selected, the condensation speed is increased by using an average of 1/4th as many anchor sequences.

• Use Only 5’ Bases for Consensus – Uses only the 5’ bases of reads to determine the consensus base at each position.

Elongation starts from the center of the anchor and works outward.

• Remove Low Quality Ends when Score <= [x] – Assigns a quality score to each base of each read relative to the number of variations within the group of reads being condensed. For the bases on both ends of a given condensed read (bases outside of the anchor and shoulder sequences), if the score is less than the defined score, the end is regarded as low quality and is trimmed from the read starting from the low quality base.

Quality scores for each base are calculated by comparing the number of reads that match to the consensus sequence to the number of reads that differ from the consensus at the given position. Reads that are aligned to the position on the 5’ end from the shoulder sequence are given a higher weight than reads that align on the 3’ end from the shoulder sequence. A score of seven is assigned to each read that aligns at the position on the 5’ end. A score of two is assigned to each that aligns at the position on the 3’ end. The value is considered positive for all reads that match to the consensus base and negative for all reads that differ from the consensus base. Additionally, for base calls that differ from the consensus, the score is multiplied by a penalty value of 1.7, so the final calculation is one of the following:

• Number of reads with differing base calls x 7 x 1.7

• Number of reads with differing base calls x 2 x 1.7

For example, consider a position where nine total reads are aligned. Three reads are aligned at the 5’ end with a base call of “C,” four reads are aligned at the 3’ end with a base call of “A,” and two reads are aligned at the 3’ end with a matching base call of “C.” The score is calculated as: (3 x 7) + (2 x 2) – (4 x 2 x 1.7) = 12.8, where:

• (3 x 7) represents the number of matching 5’ reads times the score of 7.

• (2 x 2) represents the # of matching 3’ reads times the score of 2.

• (4 x 2 x 1.7) represents the number of differing 3’ reads times the score of 2 times the penalty of 1.7.

This setting can be very useful when using condensation to prepare reads for assembly by removing low quality calls at the ends of reads. It also useful for low coverage regions.When the minimum coverage of the data is around three or four reads, specify a value of two or three. For a value of three, at least two reads are required to have the same base call at the 3’ end. For higher coverage data, specify a larger value. For example, if the minimum coverage is about 10 reads, and the average coverage is approximately 50 reads, specify a value of 10.

• Require Bridge Read Covering Middle [x%] – Requires for at least one read in the subgroup that the total length of the “bridge” region —the extension beyond the left shoulder sequence, the left shoulder sequence, the anchor sequence, the right shoulder sequence, and the extension beyond the right shoulder sequence—must be at least x% of the total read length. This setting is useful when multiple condensation cycles are used.

• Index Error Correction if Frequency <= [x%] of Majority Index – This setting is useful for transcriptome analysis or other types of analyses in which expression levels vary drastically. For very highly expressed sequences, errors are found at a high frequency and without using this setting, these errors would not be corrected and instead, could be used as separate anchor sequences. This setting allows for reads with two different index (anchor) sequences to be combined into one group. If two anchor sequences differ by only one base and have identical shoulder sequences, they are clustered into one group if the count for either of these anchor sequences is less than or equal to x% of the total reads in the resulting group The majority index is the index that has a greater number of reads. The minority index is the index that has the fewer number of reads. By “correcting” the minor index to match to the major index, the minor sequence is prevented from being used as in index.