To generate the CNV Tool report

NextGENe Online Help : NextGENe Viewer Comparison Reports and Tools : The CNV (Copy Number Variation) Tool : To generate the CNV Tool report

The following procedure describes how to generate a new CNV Tool report. Optionally, you can click Load Settings to browse to and select a Settings file (.ini file) to generate the report based on the saved settings in the file. As you create a new report, at any time, you can click Default to return all values on all tabs to their default values.

1. On the Comparisons menu, select CNV Tool.

The CNV Tool window opens. The Method Selection tab is the open tab.

2. Do one of the following:

• Select Dispersion and HMM, and then leave Normalized Counts selected or select RPKM.

• Select SNP-Based Normalization with smoothing.

3. Open the Data Input tab.

4. Load the Sample project (*.pjt) files:

• To add a single sample project file, or multiple sample project files one at a time, under the Sample pane, click Add, and in the Open dialog box, browse to and select the .pjt file, and then click Open to add the sample file.

• To add multiple sample project files, under the Sample pane, click Batch Add, and then in the Browse for Folder dialog box, browse to and select the folder that contains all the sample .pjt files, and then click OK to add all the files in a single step.

You can add up to 48 sample project (.pjt) files. If you use the Batch Add option, and the folder that you select contains more than 48 .pjt files, then an error message opens indicating this, and only the first 48 .pjt files in the folder are added. The remaining .pjt files are not added.

5. Load the Control project (*.pjt) files:

• To add a single control project file, or multiple control project files one at a time, under the Control pane, click Add, and in the Open dialog box, browse to and select the .pjt file, and then click Open to add the control file.

• To add multiple control project files, under the Control pane, click Batch Add, and then in the Browse for Folder dialog box, browse to and select the folder that contains all the control .pjt files, and then click OK to add all the files in a single step.

You can add up to 24 control project (.pjt) files. If you use the Batch Add option, and the folder that you select contains more than 24 .pjt files, then an error message opens indicating this, and only the first 24 .pjt files in the folder are added. The remaining .pjt files are not added.

6. In the Control Option pane, do the following:

• If you loaded only a single Control project file, select Single Control.

• If you loaded multiple Control project files, select Multiple Controls, and then indicate how the control values are to be determined:

Control	Description
Best Match	Select the single control project that has the best correlation to the sample project when comparing coverage in each region as the control project. Ignore the other projects.
Average Controls	Use the average coverage in each region across all control projects as the control value.
Median Controls	Use the median coverage in each region across all control projects as the control value.

7. Specify the output options for the report.

When you run the tool, the selected reports (CNV Report and/or the Block Report) are automatically saved in the specified output folder in the selected formats (text and/or GVF).

8. Open the Basic Settings tab.

9. Indicate how to define the segments that are to be analyzed and reported on by the tool.

Option	Description
Use segments as defined in the reference files.	• CDS - Report coverage levels for each coding region. • Exon - Report coverage levels for each mRNA region. (Coding and non-coding exons.) • Continuous Exon - Report coverage levels for the entire mRNA for a gene, one region per gene. • Continuous CDS - Report coverage levels for the entire coding region for a gene, one region per gene. • ROI - Report coverage levels based on Regions of Interest that are defined in a GenBank reference file. Note: For information about defining Regions of Interest in a GenBank reference file, see Advanced GBK Editor tool..
Set incremental segment length	Specify the segment length, relative to either the reference positions in the contig or the chromosome positions.
Input region of interest (*.bed)	You can upload a Region of Interest file in a BED format.
Limit reporting to BED regions with specific descriptions	Optional. If you uploaded a Region of Interest file in a BED file format, then you can create a text file that lists the descriptions for selected regions of this file. (Each description must be its own line entry.) You can then upload this text file, and although all calculations are carried out on the whole BED file, only those regions of the uploaded BED file that have descriptions that match those contained in the text file are included in the CNV report. Note: Remember, BED file descriptions are optional, and if they are included in a BED file, then are located in Column 4.
Exclude Chr X	Optional. Select this chromosome to be excluded from the comparison.
Exclude Chr Y	Optional. Select this chromosome to be excluded from the comparison.
Exclude Chr M	Optional. Select this chromosome to be excluded from the comparison.

10. Optionally, open the Advanced Settings tab, select the appropriate fitting method, and then modify any of the default values as needed.

Fitting Method	Description
Note: If you make a change to any of the values that are listed below, then at any time, you can click Default to return all values on all tabs on the dialog box their default values.
Auto fitting	Selected by default. Automatic fitting is the recommended approach for large panels (thousands of regions/exons) and whole exome sequencing. With this method a line is automatically fit to the dispersion fitting points. Manual fitting is recommended for small targeted panels (< hundreds of regions/exons), especially if the data does not have a lot of noise. The number of points for automatic fitting should be sufficient enough to have one fitting point accurately reflect a sufficient number of raw data points. • If Custom fitting point number is not selected, then NextGENe automatically selects the appropriate number of points based on the regions. If Custom fitting point number is selected, then typically, the default value of 15 fitting points is acceptable for most data for large panels; however, if you have a small number of raw data points, then the rule of thumb is one fitting point for every 100 raw data points, so you can decrease this value as needed. For example, if your data has 375 regions, then you would set the number of points to three or four fitting points for Auto fitting. Even with a smaller number of regions, the number of points for Auto fitting should never be less than three. Note: Typically, even if you know that a manual fitting or a manual dispersion is the appropriate approach for your data, you should run an automatic fitting first, and then view the resulting data so that you have an idea of how to modify all the fitting settings for either method.
Manual fitting	For Manual fitting, "a" and "b" represent the values for the line that is fit to the dispersion fitting points. These values are automatically populated after an Automatic fitting. You must modify these values for a Manual fitting. The Minimum Dispersion value is the minimum threshold for the dispersion of the data, regardless of the value that is set for “a.” As with Auto fitting, the number of points for manual fitting should be sufficient enough to have one fitting point accurately reflect a sufficient number of raw data points. • If Custom fitting point number is not selected, then NextGENe automatically selects the appropriate number of points based on the regions. If Custom fitting point number is selected, then typically, the default value of 15 fitting points is acceptable for most data for large panels; however, if you have a small number of raw data points, then, again, the rule of thumb is one fitting point for every 100 raw data points, so you can decrease this value as needed.
Fixed dispersion value	Select this option to use a single dispersion value for all regions in lieu of fitting a line to all the dispersion points. As with the other fitting methods, the number of points for manual dispersion should be sufficient enough to have one fitting point accurately reflect a sufficient number of raw data points. • If Custom fitting point number is not selected, then NextGENe automatically selects the appropriate number of points based on the regions. If Custom fitting point number is selected, then typically, the default value of 15 fitting points is acceptable for most data for large panels; however, if you have a small number of raw data points, then, again, the rule of thumb is one fitting point for every 100 raw data points, so you can decrease this value as needed. Note: The Fixed dispersion option is useful for targeted panels where the dispersion (noise) is relatively low. • Auto-Detect: The manual dispersion value is automatically adjusted. This automatically chosen value works well in most cases, but you can modify this value as needed. You can select this value to be displayed in the CNV report.

11. Leave the default values for the other settings as-is, or modify them as needed.

• HMM and Dispersion method, Advanced Settings

Setting	Description
Note: If you make a change to any of the values below, at any time, you can click Default to return all values on all tabs on the dialog box to their default values.
Minimum Normalized Read Counts	Applicable only if Normalized Counts is selected. Any regions where the total Normalized Read Counts fall below this value are labeled as Uncalled in the CNV Tool report.
Minimum RPKM	Applicable only if RPKM is selected. Any regions where the total RPKM falls below this value are labeled as Uncalled in the CNV Tool report.
Minimum region length	Minimum size of a region in base pairs for the region to be included in the CNV Tool report.
Expected CNV Percentage [5.00]%	Indicates the percentage of regions in which CNV calls are expected to be made. Note: Typically, the default value of 5% is acceptable for most data. If the data is confident (not noisy), then increasing this value does not significantly increase the percentage of regions in which CNV calls are made. If the data is not confident (noisy), then increasing this value increases the percentage of regions in which CNV calls are made.
Estimated sample purity	If the sample is mixed, or it has possible contamination, then enter an appropriate sample purity to adjust the calculations accordingly.

• SNP-Based Normalization with Smoothing

Setting	Description
Note: If you make a change to any of the values below, at any time, you can click Default to return all values on all tabs on the dialog box to their default values.
Neighbor ratio settings
Perfect heterozygote SNP	Indicates the frequency requirements for perfect heterozygote SNP positions. Both the reference and variant allele must be found at frequency that is above the specified threshold, or the SNP is not used to determine the median coverage for the region. The default value is 40%, which means that any variant that is found at a frequency between 40% to 60% is considered to be a perfect heterozygote SNP.
Smooth Log2Ratio	Selected by default. You can clear this option to omit the step of checking Neighbor Ratios.
• High Resolution (3)	• Optimizes the detection sensitivity to call CNVs for smaller regions, such as CNVs that include only part of a gene. Considers three regions total - the region itself and one neighbor region on each side.
• Low Resolution (41)	• Optimizes the detection to call larger CNVs, such as CNVs that include multiple genes or a whole chromosome. Considers 41 regions total - the region itself and 20 neighbor regions on each side.
• Customized resolution	• Specify the number of regions that are to be considered for making the CNV call, where the number must reflect the region itself and the same number of neighbor regions on each side.
Deletion and duplication calls using log2 ratio
• Auto • Manual	• NextGENe automatically makes the calls. • Manually define the range of log2ratio values for regions to be reported as Normal (default is -0.60 - 0.50). Regions with log2ratio values outside of this range are reported as Deletion or Duplication.
Display limits	Available only if Manual is selected for the call option. Set the range that is displayed for the minimum and maximum values on the Log2ratio (Y) axis for the CNV graph relative to values defined for the Manual option. See CNV graphs.
• Large block • Single point	• Detect and report on CNV regions that are > 3 regions. • Detect and report on CNVs regions that are single regions or larger.

12. Optionally, open the Report Settings tab and do either one or both of the following:

• For the Display settings, select the columns that are to be included in the report, or clear the options for the columns that are not to be included.

Setting	Description
Common Display Settings
Index	An ordered count of the segments that are used in the report.
Chr • Name • Number	• The name of the chromosome that the segment is on. • The number of the chromosome that the segment is on.
Chr Position Start	The base number that indicates where the segment starts in the chromosome.
Chr Position End	The ending base number that indicates where the segment ends in the chromosome.
Entrez Gene ID	The unique integer ID generated for the gene by Entrez Gene.
Gene	The gene name for the segment when the segment is the whole gene or the name of the gene on which the segment is found.
Exon	The exon number where the segment is found. This number includes non-coding exons.
CDS	The coding sequence number for the segment.
RNA Accession	The RNA accession for the gene from NCBI.
Protein Accession	The protein accession for the gene from NCBI.
Description	Available if the reference file is a .fasta file with multiple segments. Select this option to display the title line for each segment in the Description column.
Contig	The contig on which the segment is located. The contig is based on the genome assembly from the NCBI.
Locus Tag	An alternate way to identify the gene.
Start	The starting location for the reference region.
End	The ending location for the reference region.
Length	The total length of the reference region, which provides for easy identification of expressed regions by size (such as when locating small RNA transcripts).
Dispersion	The dispersion value for the region. N/A for Uncalled regions.
Normalized Likelihoods	The normalized likelihood value for each potential CNV call (duplication, deletion, or normal). A likelihood value closer to zero indicates an increased likelihood for the call.
Dispersion and HMM Display Settings, Normalized Counts selected
Normalized Read Count	The Normalized Read Counts for both the sample and the control.
Ratio	The ratio of the sample RPKM to total RPKM for the region.
Total Read Counts	The sum of the Sample read counts and the Control read counts.
Dispersion and HMM Display Settings, RPKM selected
RPKM	Reads per Kilobase Exon Model per Million mapped reads. RPKM = 10^9 * R / (T*L) where: • R = Number of mapped reads in a region • T = Total number of mapped reads. • L = Length of the region. Normalizes the expression levels based on the length of the reference region and the total number of aligned reads.
FPKM	Applicable only if the project used paired-end data. Fragments per Kilobase of exon per Million mapped reads. FPKM = 10^9 * F / (T*L) where: • F = Number of mapped fragments in a region and: • A “fragment” corresponds to a pair of reads. • Single reads are not counted. • The position of a fragment is the location between the two 5’ ends of the pairs. • T = Total number of mapped fragments. • L = Length of the region. Normalizes the expression levels for paired end data based on the length of the reference region and the total number of aligned reads.
Ratio	The ratio of the sample RPKM to total RPKM for the region
Total RPKM	The sum of the Sample RPKM and the Control RPKM.
SNP-Based Normalization Display Settings
Position Selected	The chromosome position for the heterozygous SNP at which the median coverage value is obtained.
Original Coverage	The un-normalized median coverage values for the region for the sample and control.
Normalized Coverage	The median coverage following global normalization for the region for the sample and the control.
Control Allele	Read count for the alleles at the Position Selected in the control project. If there are more than two alleles, then only the two most frequent alleles are reported.
Sample Allele	Read count for the alleles at the Position Selected in the sample project. If there are more than two alleles, then only the two most frequent alleles are reported.
Log2 Ratio	The Log2 of the ratio of the normalized coverages of the two sample files.
Neighbor Ratios	The Log2 ratios for the current region followed by the Log2 ratios of the neighbor regions.
Dispersion HMM	Select this option to include the Dispersion HMM analysis in the report results. Note: Neighbor Ratios must also be selected.

• For the Filter settings, specify the thresholds for the regions that are to be included in the report.


Common Filter Settings
Display Deletion	Selected by default. Show CNVs that are classified as Deletions. Clear this option to hide this classification from the CNV Tool report.
Display Normal	Selected by default. Show regions that are classified as Normal (little evidence of a CNV). Clear this option to hide this classification from the CVN Tool report.
Display Duplication	Selected by default. Show CNVs that are classified as Duplications. Clear this option to hide this classification from the CNV Tool report.
Display Uncalled	Selected by default. Show CNVs that are classified as Deletions. Clear this option to hide this classification from the CNV Tool report.
In BED region	Not available if a BED was loaded on the Basic Settings tab. Filters the CNV Tool report to include only those regions that are contained in the BED file. Click Set to browse to and select the appropriate BED file. Note: A BED file is a tab-delimited text file. You can upload a BED file only if the reference sequence contains chromosome information, which means that the reference sequence must be either a preloaded reference that NextGENe supplies, or a GenBank reference file that contains chromosome information.
HMM and Dispersion Filter Settings
Score	Filter the calls shown based on their respective scores. (Deletion, Normal, and Duplication.)The default value is 1.000, which means that all calls with a score > 1.000 are shown in the report. You can modify this value as needed.
SNP-Based Normalization with smoothing Filter Settings
Log2 Ratio <= [0.700] or >= [-0.700}	Display only those regions where the Log2 of the ratio of the normalized coverages of the two sample files is above or below the set thresholds
Scores >= [3.000]	Show only regions where the Phred-scaled score for at least one potential call (insertion, deletion, or normal) meets or exceeds the set threshold.
Minimum Coverage At Least For One Project >= [5.00]	Default value is 30. At least one project (sample file) must contain at least the minimum read count in the selected regions, or the CNV calculations are not carried out for the region and the region is not included in the report.
Show regions with low coverage	Include regions that have coverage that fall below the indicated minimum coverage in the report. N/A is displayed for the Log2 Ratio value for these regions.

13. Optionally, click Save Settings to save these settings to a Settings file (.ini file).

You can click Load Settings to select this Settings file at a later date and generate the report according to the saved settings in the file.

14. Click Run.

The CNV Tool report is automatically generated and displayed. If you set the Output options on the Data Input tab, then the selected reports (CNV Report and/or the Block Report) are also automatically saved in the specified output folder in the selected formats (text and/or GVF).

For the SNP-Based Normalization with smoothing CNV Tool report, percentile information for the normal distribution of the Log2 ratios is displayed above the report columns. The delSigma value is one standard deviation below the 50th percentile. The delSigma value represents the required value for the Log2 ratio to call a deletion for a given region. The dupSigma value is one standard deviation above the 50th percentile. The dupSigma value represents the required value for the Log2 ratio to call a duplication for a given region. The other percentile values represent the required values for the Log2 ratios to place a region in the indicated percentile. For example, 32percentile: -0.0529 means that the Log2 ratio for a given region must equal -0.0529 for the region to be placed in the 32nd percentile of all regions.

The CNV Tool report is interactive:

Option	Description
Select a different sample to view in the report display	If you selected multiple samples for the report, then the report toolbar displays a Sample dropdown list of all the samples that were analyzed for the report. You can select a different sample on this list to update the report display accordingly.
View the region of the genomic database in the Database of Genomic Variants (DGV) for which the call was made	Click the call type in the HMM Calls column.
Load different projects and/or change the project settings	On the report toolbar, click the Load Projects icon to open the CNV Tool window, and then make the appropriate changes.
Modify the CNV report settings	On the report toolbar, click the Settings icon , to open the CNV Report Settings dialog box and modify the report settings as needed. The report display is dynamically updated after you save the modifications
Save the report to a text file	On the report toolbar, click the Save Report icon . A default name (<project_name>_CNVReport and location (project folder) are provided for the file, but you can change both of these values.
Generate the Gene CNV report	Applicable only for SNP-Based Normalization with smoothing. On the report toolbar, click the Gene CNV report icon . See Gene CNV report.
Generate the Block CNV report	On the report toolbar, click the Block CNV report icon . See Block CNV report.
Generate the graphical display of the data	On the report toolbar, click the CNV Graphs icon . See CNV graphs.