Assembly merge: Difference between revisions
		
		
		
		Jump to navigation
		Jump to search
		
| Line 83: | Line 83: | ||
|    Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons         |    Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons         | ||
|    AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%) |    AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%) | ||
|   velvet      denovo       Solaxa      6340136        25161   45      5057    241.83    212.61    6084887 | |||
|    edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%) |    edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%) | ||
| Merged assemblies(contigs&singletons): | Merged assemblies(contigs&singletons): | ||
|    assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments      |    assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments      | ||
|    AMOScmp-merged ?            AMOScmp(contigs)          187     166     20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32 |    AMOScmp-merged ?            AMOScmp(contigs)          187     166     20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32 | ||
|   minimus(ovl20) denovo       velvet(contigs)           25161   19121   45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps | |||
|   minimus(ovl15) denovo       velvet(contigs)           25161   16343   45      9903    361.32   359.78  5905143    #merged 25161-16343=8818 (35%) gaps | |||
|    minimus(ovl40) denovo       edena+velvet(contigs)     39245   23644   45      6688    257.15   232.94  6080063    #very few 40bp overlaps are found |    minimus(ovl40) denovo       edena+velvet(contigs)     39245   23644   45      6688    257.15   232.94  6080063    #very few 40bp overlaps are found | ||
|    minimus(ovl20) denovo       edena+velvet(contigs)     39245   18603   45      6688    322.32   311.02  5996244 |    minimus(ovl20) denovo       edena+velvet(contigs)     39245   18603   45      6688    322.32   311.02  5996244 | ||
| === Simulated 32bp exact match reads === | === Simulated 32bp exact match reads === | ||
Revision as of 16:35, 28 March 2008
Assemblers
Denovo
Minimus
* hash-overlap overlap: 
    40 bp default : too large for contig assemblies
    20 bp minimum overlap; minimizer window length must be >=15bp; could these values be dropped lower?
    very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY???
Velvet
* overlap:
    18bp usually gives fewest contigs
    15bp is too low => too many short contigs
Edena
* contigs don't overlap
Comparative
AMOScmp
Cases
No reference sequence
One data set, multiple denovo assemblers
Example:
* Solexa data * edena & velvet assemblers
Solution:
* merge 2 assembly contigs * run minimus on them
Multipls data sets, one(multiple) denovo assemblers
Example:
Solexa & 454 data velvet assemblers for each set
One reference sequence
Few indels, few rearrangements
Solution:
* AMOScmp * If there are many negative gaps try to further join contigs (fastaMerge.pl $PREFIX.fasta)
Many indels, few rearrangements
Few indels, many rearrangements
Multiple reference sequences
Examples
Pseudomonas_syringae
Reference:
Name Length %GC NC_004578.1 6397126 58.40 NC_004633.1 73661 55.15 NC_004632.1 67473 56.17
Repeats:
desc #repeats min max mean stdev sum 50bp+ 991 50 7362 393.73 792.41 390192 100bp+ 429 100 7362 815.36 1060.29 349793
Solexa reads
Type #reads min max mean Solexa 6340136 32 32 32 (~31x coverage)
Assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons AMOScmp comparative Solaxa 6340136 187 20 577929 34863.06 91692.34 6519394 698638(11%) velvet denovo Solaxa 6340136 25161 45 5057 241.83 212.61 6084887 edena denovo Solaxa 6340136 14084 100 5075 210.92 145.68 2970720 4893301(77%)
Merged assemblies(contigs&singletons):
assemblers type input-data #reads #ctgs min max mean stdev ctgs-sum comments AMOScmp-merged ? AMOScmp(contigs) 187 166 20 804024 39272.2 121124 6519189 #merged 187-166=21 negative gaps out of a total of 32 minimus(ovl20) denovo velvet(contigs) 25161 19121 45 5057 311.3 297.27 5952381 #merged 25161-19121=6040 (25%) gaps minimus(ovl15) denovo velvet(contigs) 25161 16343 45 9903 361.32 359.78 5905143 #merged 25161-16343=8818 (35%) gaps minimus(ovl40) denovo edena+velvet(contigs) 39245 23644 45 6688 257.15 232.94 6080063 #very few 40bp overlaps are found minimus(ovl20) denovo edena+velvet(contigs) 39245 18603 45 6688 322.32 311.02 5996244
Simulated 32bp exact match reads
Type #reads min max mean Sim(ulated) 6538167 32 32 32 ( 32x coverage)
Single assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons edena-sim denovo Sim 6538167 2068 100 47881 2994.03 4857.76 6191673 198699(3%) velvet-sim denovo Sim 6538167 2207 45 56810 2820.91 5348.36 6225757 123591(2%)
454 reads
Type #reads min max mean 454 77466 35 371 240
Pseudomonas aeruginosa b1
References:
Name Length %GC PA14 6537648 66.29 PACS2 6492423 66.33 PAO1 6264404 66.56 ...
Solexa reads
Type #reads min max mean Solexa 8627900 33 33 33
Assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons