Assembly merge: Difference between revisions
Jump to navigation
Jump to search
(→Ssake) |
(→Edena) |
||
Line 73: | Line 73: | ||
$ edena -e prefix.ovl -p prefix | $ edena -e prefix.ovl -p prefix | ||
$ ln -s prefix_contigs.fasta prefix.fasta | $ ln -s prefix_contigs.fasta prefix.fasta | ||
$ cat *_info.txt # assembly stats | |||
== Comparative == | == Comparative == |
Revision as of 17:42, 25 April 2008
Assemblers
Denovo
Minimus
* hash-overlap minimum overlap length: 40 bp default : too large for contig assemblies 20 bp minimum overlap; minimizer window length must be >=15bp; could these values be dropped lower? very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY??? analyzes internal overlaps, not the sequence end ones only
Ssake
- Short Sequence Assembly by K-mer search and 3' read Extension (SSAKE)
- Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501
- Current release: SSAKE 3.2 (2007-12-07)
- written in PERL and runs on Linux.
- Greedy assembler: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences.
- stringently clusters short reads into contigs that can be used to characterize novel sequencing targets.
Example:
$ grep -c "^>" Pa.seq # total number of reads $ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log # version 3.0 $ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log # version 3.2 $ ln -s Pa*contigs Pa.fasta $ ln -s Pa*singlets Pa.singlets.seq $ cat Pa.fasta | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads # number of assembled reads $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons # number of singletons $ grep -v N Pa.seq # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons)
Velvet
- "Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs", Genome Res. published online Mar 18, 2008; Daniel R. Zerbino and Ewan Birney
- manipulates Bruijn graphs efficiently to both eliminate errors and resolve repeats.
- These two tasks are done separately: first the error correction algorithm merges sequences which belong together, then the repeat solver separates paths sharing local overlaps
* minimum overlap length (30-40X coverage data set): 18bp usually gives the fewest contigs 24bp is also ok 15bp is too low => too many short contigs * velvet_assy.afg lists the assembled reads (TLE messages) * some reads extend beyond the contig ends * there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end) * many contigs overlap by a few bp
Example:
$ velveth . ovl prefix.seq # ovl should be an odd integer like 21, 23 $ velvetg . -read_trkg yes $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg $ bank2contig prefix.bnk > prefix.contig $ bank2fasta -b prefix.bnk > prefix.fasta $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons
Edena
- " De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer." Genome Res. Published April 3, 2008, D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel.
- based on the traditional overlap layout paradigm.
- All exact overlaps between any pair of reads are computed and structured in a graph (overlap step).
- the reads are indexed in a prefix array and overlaps are revealed by dichotomic search in the arrays.
- The graph is then analyzed to remove transitive and spurious edges (layout step).
- Finally, contigs that can be assembled following unambiguous path in the graph are given as output.
- Latest version is 2.1.1 (March 17, 2008)
- Compiled for Linux-32,Linux-64
- contigs don't overlap
Example:
$ edena -r prefix.seq -p ovl $ edena -e prefix.ovl -p prefix $ ln -s prefix_contigs.fasta prefix.fasta $ cat *_info.txt # assembly stats
Comparative
AMOScmp
- one should increase the make-consensus error rate if ref and qry are not that similar
- assembly very fragmented if %id<98
- Pa 10K subset:
<98% : 2X contigs <96% : 10X contigs <96% : 50X contigs
- not all reads from layout are included in consensus
- tigger can handle "well" overlaps as short as 5bp
AMOScmp-shortReads-alignmentTrimmed
Defaults:
MINCLUSTER = 16 MINMATCH = 16 MINLEN = 24 # delta-filter -l 24 MINOVL = 5 MAXTRIM = 10 MAJORITY = 50 CONSERR = 0.06 ALIGNWIGGLE = 2
* Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => !!! read trimming is important.
AMOScmp-shortReads
Defaults
MINCLUSTER = 20 MINMATCH = 20 MINOVL = 5 MAXTRIM = 10 MAJORITY = 50 CONSERR = 0.06 ALIGNWIGGLE = 2
* No read trimming is done
Other Scripts
EMBOSS megamerge
* Can merge well short overlapping contigs * Does not work well when the overlap identity is low * Min ovl length=2
Cases
No reference sequence
One data set, multiple denovo assemblers
Example:
* Solexa data * edena & velvet assemblers
Solution:
* merge 2 assembly contigs * run minimus on them
Multipls data sets, one(multiple) denovo assemblers
Example:
Solexa & 454 data velvet assemblers for each set
One reference sequence
Few indels, few rearrangements
Solution:
* AMOScmp * If there are many negative gaps try to further join contigs (fastaMerge.pl $PREFIX.fasta)
Many indels, few rearrangements
Few indels, many rearrangements
Multiple reference sequences
Examples
Pseudomonas_syringae
Reference:
Name Length %GC NC_004578.1 6397126 58.40 NC_004633.1 73661 55.15 NC_004632.1 67473 56.17
Repeats:
desc #repeats min max mean stdev sum 50bp+ 991 50 7362 393.73 792.41 390192 100bp+ 429 100 7362 815.36 1060.29 349793
Solexa reads
Type #reads min max mean Solexa 6340136 32 32 32 (~31x coverage)
Assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons AMOScmp comparative Solaxa 6340136 187 20 577929 34863.06 91692.34 6519394 698638(11%) velvet denovo Solaxa 6340136 25161 45 5057 241.83 212.61 6084887 edena denovo Solaxa 6340136 14084 100 5075 210.92 145.68 2970720 4893301(77%)
Merged assemblies(contigs&singletons):
assemblers type input-data #reads #ctgs min max mean stdev ctgs-sum comments AMOScmp-merged ? AMOScmp(contigs) 187 166 20 804024 39272.2 121124 6519189 #merged 187-166=21 negative gaps out of a total of 32 minimus(ovl20) denovo velvet(contigs) 25161 19121 45 5057 311.3 297.27 5952381 #merged 25161-19121=6040 (25%) gaps minimus(ovl15) denovo velvet(contigs) 25161 16343 45 9903 361.32 359.78 5905143 #merged 25161-16343=8818 (35%) gaps minimus(ovl40) denovo edena+velvet(contigs) 39245 23644 45 6688 257.15 232.94 6080063 #very few 40bp overlaps are found minimus(ovl20) denovo edena+velvet(contigs) 39245 18603 45 6688 322.32 311.02 5996244
Simulated 32bp exact match reads
Type #reads min max mean Sim(ulated) 6538167 32 32 32 ( 32x coverage)
Single assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons edena-sim denovo Sim 6538167 2068 100 47881 2994.03 4857.76 6191673 198699(3%) velvet-sim denovo Sim 6538167 2207 45 56810 2820.91 5348.36 6225757 123591(2%)
454 reads
Type #reads min max mean 454 77466 35 371 240
Pseudomonas aeruginosa b1 (PAb1)
References:
Name Length %GC PA14 6537648 66.29 PACS2 6492423 66.33 PAO1 6264404 66.56 ...
Solexa reads
Type #reads min max mean Solexa 8627900 33 33 33 (~43X coverage)
Assemblies:
All contigs:
Assembler type #ctgs min max mean stdev ctgs-sum #singletons comments AMOSCmp-PA14 comparative 2053 17 170485 3011.84 11917.53 6183320 1127399 AMOSCmp-PAO1 comparative 2797 17 75626 2161.19 5812.2 6044851 1592525 AMOScmp-PA2192 comparative 5816 17 133129 1072.8 3725.22 6239454 1601299 largest assembly maq-PA14 comparative 991 33 155551 6199.79 17445.05 6143996 1197385 velvet denovo 10684 45 16239 640.34 825.24 6841458 1241079 much better than Ps !!! edena denovo 11180 100 11300 552.36 610.52 6175460 3955865 (46%) much better than Ps !!! ssake denovo 185030 34 5490 77.21 141.23 14287079 3056893
200bp+ contigs:
Assembler type #ctgs min max mean stdev ctgs-sum AMOSCmp-PA14 comparative 428 203 170485 14262.09 22852.74 6104175 AMOSCmp-PAO1 comparative 865 200 75626 6893.96 8766.63 5963278 AMOSCmp-PA2192 comparative 1299 200 133129 4683.46 6735.52 6083817 maq-PA14 comparative 368 200 155551 16581.7 25475.92 6102067 velvet denovo 7382 200 16239 877.05 896.35 6474426 edena denovo 8316 200 11300 692.54 651.24 5759209 ssake denovo 12532 200 5490 486 329.93 6090567
Merged assemblies(contigs&singletons):
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum comments AMOSCmp-PA14-merge ? AMOSCmp-PA14(ctgs) 2053 1931 17 170485 3201.7 12981.57 6182486 2053-1931=122 gaps closed