PE assembly method for Roche/454, Illumina, and Ion Torrent data

NextGENe Online Help : Sequence Assembly Tool : Sequence Assembly Tool Settings : PE assembly method for Roche/454, Illumina, and Ion Torrent data

The PE Assembly method is a novel paired end assembly algorithm developed by SoftGenetics. This assembly method is designed to tolerate repeat regions smaller than the paired end library size to produce accurate assembly results. The PE assembly method uses a traditional scaffolding assembly algorithm. Short “words” within reads are used to find overlaps to form the scaffold. This generates initial assemblies that stop at repetitive regions. These initial assemblies are referred to as scaffold contigs. (NextGENe places these contigs in the ScaffoldContigs.fasta file. You can use this file to manually select which scaffold contigs are to be linked together. See The NextGENe Long PE Assembly Mapping Tool.) When paired reads are used, the paired information is used to continue the assemblies past the repetitive regions to make larger contigs that otherwise could not be assembled simply by scaffolding. Although you can use the PE assembly method for the assembly of single sequence read data, it is most effective for paired reads with relatively small library sizes, such as 200 bp library paired end Illumina reads.

Setting	Description
Paired End Data	Select this option if you are assembling paired end data.
• Library Size	• The size of the fragment that is being sequenced.
• Long Library Size (> 1000 Bases)	• If the library is greater than 1000 bases, then in addition to specifying the library size, you must also select this option.
Section Size	Available only if Long Library is selected. Scaffold contigs are broken into sections when they are being assembled so that the distance between the contigs can be estimated. For the majority of datasets, the default value of 400 is the recommended value.
Minimum Scaffold Length	Available only if Long Library is selected. Any scaffold contigs that are shorter than the specified Minimum Scaffold Length are discarded and are not are used in the generation of the final contigs.
Word Length	The word length that is used for scaffolding. This value is determined by the average depth of coverage for the data. The lower the average depth of coverage for the data, the shorter this value should be. Conversely, the higher the average depth of coverage for the data, the longer this value should be. (Longer word lengths result in greater noise reduction.) If coverage falls within the range of 20-30x, the recommended word length is 23. If coverage is approximately 50x, the recommended word length is 29. The maximum recommended value for word length is 31.
High Coverage Limited: Max Coverage = [x]	The maximum coverage that is to be used for assembly. For sequences with higher coverage, reads up to the maximum coverage are used. Additional reads with the sequence are ignored, which increases processing speed.
Final Contig Merging	Merges any overlapping contigs that were found after scaffolding and linking with the paired reads are complete.
Reduce Memory Usage	When this option is selected, only the 5’ end of the read is used to create “words” for indexing (to determine overlaps). The number of bases used to index is determined according to the following: (0.5+ (20/L))(L), where L = the average read length. Note: The memory that is conserved by this method is more significant for longer reads. For 36 bp reads, there is no difference in the memory that is used.