Ambiguous Gain penalty/Ambiguous Loss penalty
Ambiguity at the position where a variant is called can be the result of many factors, including pseudo genes and other repetitive elements, and where the variant is located – at the 5’ end, at the 3’ end, or in a central location. The Ambiguous Gain penalty and Ambiguous Loss penalty quantify the ambiguity relative to the region where a variant is called. To calculate these penalties, NextGENe first generates multiple, short synthetic reads for every location at which a variant was called. These synthetic reads are based on the consensus sequence for the region where the variant was called. The reads are generated in both the forward and reverse directions, and are designed so that the variant call is found in the beginning of some the reads, at the end of some of the reads, and at several central locations on other reads. NextGENe then aligns these reads with the reference sequence, and determines the number of synthetic reads that can be aligned at each variant position in the reference sequence. The Ambiguous Gain/Loss penalties are calculated from the results of these alignments. The Ambiguous Gain penalty has no set value, (the range is 0 - n), and the Ambiguous Loss penalty has a range of (0-1). For both penalties, a value closer to zero indicates that the region where the variant was called has a more unique sequence (the expected number of multiple synthetic reads were aligned to the position). Conversely, for both penalties, a larger value indicates that the region where the variant was called is not unique. For the Ambiguous Gain penalty, a value closer to ten indicates that a greater number of reads than expected aligned to the region where the variant was called. For the Ambiguous Loss penalty, a value closer to one indicates that fewer synthetic reads than expected aligned to the region where the variant was called.
For example, consider the scenario in which variant calls were made at Positions A, B, and C in a sample file and NextGENe generates 30 synthetic reads for each position. If after aligning the synthetic reads, NextGENe determines that 30 reads aligned at Position A, 30 reads aligned at Position B, and 30 reads aligned at Position C, then both the Ambiguous Gain and Loss penalties would have a value of zero for all positions; however, if after aligning the synthetic reads, NextGENe determines that 60 reads aligned at Position A and 15 reads aligned at Position B, then the Ambiguous Gain penalty for Position A would be 2, and the Ambiguous Loss penalty for Position B would be 0.5.