Assembly merge
Jump to navigation
Jump to search
Assemblers
Denovo
Minimus
* hash-overlap overlap: 40 bp default : too large for contig assemblies 20 bp minimum overlap; minimizer window length must be >=15bp; could these values be dropped lower? very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY???
Velvet
* overlap: 18bp usually gives fewest contigs 15bp is too low => too many short contigs
Edena
* contigs don't overlap
Comparative
AMOScmp
Cases
No reference sequence
One data set, multiple denovo assemblers
Example:
* Solexa data * edena & velvet assemblers
Solution:
* merge 2 assembly contigs * run minimus on them
Multipls data sets, one(multiple) denovo assemblers
Example:
Solexa & 454 data velvet assemblers for each set
One reference sequence
Few indels, few rearrangements
Solution:
* AMOScmp * If there are many negative gaps try to further join contigs (fastaMerge.pl $PREFIX.fasta)
Many indels, few rearrangements
Few indels, many rearrangements
Multiple reference sequences
Examples
Pseudomonas_syringae
Reference:
Name Length %GC NC_004578.1 6397126 58.40 NC_004633.1 73661 55.15 NC_004632.1 67473 56.17
Repeats:
desc #repeats min max mean stdev sum 50bp+ 991 50 7362 393.73 792.41 390192 100bp+ 429 100 7362 815.36 1060.29 349793
Solexa reads
Type #reads min max mean Solexa 6340136 32 32 32 (~31x coverage)
Assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons AMOScmp comparative Solaxa 6340136 187 20 577929 34863.06 91692.34 6519394 698638(11%) velvet denovo Solaxa 6340136 25161 45 5057 241.83 212.61 6084887 edena denovo Solaxa 6340136 14084 100 5075 210.92 145.68 2970720 4893301(77%)
Merged assemblies(contigs&singletons):
assemblers type input-data #reads #ctgs min max mean stdev ctgs-sum comments AMOScmp-merged ? AMOScmp(contigs) 187 166 20 804024 39272.2 121124 6519189 #merged 187-166=21 negative gaps out of a total of 32 minimus(ovl20) denovo velvet(contigs) 25161 19121 45 5057 311.3 297.27 5952381 #merged 25161-19121=6040 (25%) gaps minimus(ovl15) denovo velvet(contigs) 25161 16343 45 9903 361.32 359.78 5905143 #merged 25161-16343=8818 (35%) gaps minimus(ovl40) denovo edena+velvet(contigs) 39245 23644 45 6688 257.15 232.94 6080063 #very few 40bp overlaps are found minimus(ovl20) denovo edena+velvet(contigs) 39245 18603 45 6688 322.32 311.02 5996244
Simulated 32bp exact match reads
Type #reads min max mean Sim(ulated) 6538167 32 32 32 ( 32x coverage)
Single assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons edena-sim denovo Sim 6538167 2068 100 47881 2994.03 4857.76 6191673 198699(3%) velvet-sim denovo Sim 6538167 2207 45 56810 2820.91 5348.36 6225757 123591(2%)
454 reads
Type #reads min max mean 454 77466 35 371 240
Pseudomonas aeruginosa b1
References:
Name Length %GC PA14 6537648 66.29 PACS2 6492423 66.33 PAO1 6264404 66.56 ...
Solexa reads
Type #reads min max mean Solexa 8627900 33 33 33
Assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons AMOSCmp-PA14 comparative Solexa 8627900 2053 17 170485 3011.84 11917.53 6183320 1127399 AMOSCmp-PAO1 comparative Solexa 8627900 2797 17 75626 2161.19 5812.2 6044851 1592525 velvet denovo Solexa 8627900 10684 45 16239 640.34 825.24 6841458 edena denovo Solexa 8627900 11180 100 11300 552.36 610.52 6175460