Assembly merge: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 82: Line 82:
Assemblies:
Assemblies:
   Assembler  type        input-data  #reads        #ctgs  min    max    mean      stdev    ctgs-sum      #singletons       
   Assembler  type        input-data  #reads        #ctgs  min    max    mean      stdev    ctgs-sum      #singletons       
  AMOScmp    comparative  Solaxa      6340136        187    20      577929  34863.06  91692.34  6519394      698638(11%)
   edena      denovo      Solaxa      6340136        14084  100    5075    210.92    145.68    2970720      4893301(77%)
   edena      denovo      Solaxa      6340136        14084  100    5075    210.92    145.68    2970720      4893301(77%)
   velvet      denovo      Solaxa      6340136        25161  45      5057    241.83    212.61    6084887
   velvet      denovo      Solaxa      6340136        25161  45      5057    241.83    212.61    6084887
  AMOScmp    comparative  Solaxa      6340136        187    20      577929  34863.06  91692.34  6519394      698638(11%)


Merged assemblies(contigs&singletons):
Merged assemblies(contigs&singletons):
   assemblers    type        input-data                #reads  #ctgs  min    max    mean    stdev  ctgs-sum  comments     
   assemblers    type        input-data                #reads  #ctgs  min    max    mean    stdev  ctgs-sum  comments     
  AMOScmp-merged ?            AMOScmp(contigs)          187    166    20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
   minimus(ovl40) denovo      edena+velvet(contigs)    39245  23644  45      6688    257.15  232.94  6080063    #very few 40bp overlaps are found
   minimus(ovl40) denovo      edena+velvet(contigs)    39245  23644  45      6688    257.15  232.94  6080063    #very few 40bp overlaps are found
   minimus(ovl20) denovo      edena+velvet(contigs)    39245  18603  45      6688    322.32  311.02  5996244
   minimus(ovl20) denovo      edena+velvet(contigs)    39245  18603  45      6688    322.32  311.02  5996244
   minimus(ovl20) denovo      velvet(contigs)          25161  19121  45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
   minimus(ovl20) denovo      velvet(contigs)          25161  19121  45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
   minimus(ovl15) denovo      velvet(contigs)          25161  16343  45      9903    361.32  359.78  5905143    #merged 25161-16343=8818 (35%) gaps
   minimus(ovl15) denovo      velvet(contigs)          25161  16343  45      9903    361.32  359.78  5905143    #merged 25161-16343=8818 (35%) gaps
  AMOScmp-merged ?            AMOScmp(contigs)          187    166    20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
                                                                                                                    #  min 5 bp overlap & 80% identity required


=== Simulated 32bp exact match reads ===
=== Simulated 32bp exact match reads ===

Revision as of 16:35, 28 March 2008

Assemblers

Denovo

Minimus

* hash-overlap overlap: 
    40 bp default : too large for contig assemblies
    20 bp minimum overlap; minimizer window length must be >=15bp; could these values be dropped lower?
    very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY???

Velvet

* overlap:
    18bp usually gives fewest contigs
    15bp is too low => too many short contigs

Edena

 * contigs don't overlap

Comparative

AMOScmp

Cases

No reference sequence

One data set, multiple denovo assemblers

Example:

 * Solexa data
 * edena & velvet assemblers

Solution:

 * merge 2 assembly contigs
 * run minimus on them

Multipls data sets, one(multiple) denovo assemblers

Example:

 Solexa & 454 data
 velvet assemblers for each set

One reference sequence

Few indels, few rearrangements

Solution:

 * AMOScmp
 * If there are many negative gaps try to further join contigs (fastaMerge.pl $PREFIX.fasta) 

Many indels, few rearrangements

Few indels, many rearrangements

Multiple reference sequences


Examples

Pseudomonas_syringae

Reference:

 Name           Length  %GC
 NC_004578.1    6397126 58.40
 NC_004633.1    73661   55.15
 NC_004632.1    67473   56.17

Repeats:

 desc    #repeats   min     max     mean    stdev    sum
 50bp+   991        50      7362    393.73  792.41   390192
 100bp+  429        100     7362    815.36  1060.29  349793

Solexa reads

 Type            #reads       min     max     mean
 Solexa          6340136      32      32      32   (~31x coverage)

Assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%)
 edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%)
 velvet      denovo       Solaxa      6340136        25161   45      5057    241.83    212.61    6084887

Merged assemblies(contigs&singletons):

 assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments    
 AMOScmp-merged ?            AMOScmp(contigs)          187     166     20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
 minimus(ovl40) denovo       edena+velvet(contigs)     39245   23644   45      6688    257.15   232.94  6080063    #very few 40bp overlaps are found
 minimus(ovl20) denovo       edena+velvet(contigs)     39245   18603   45      6688    322.32   311.02  5996244
 minimus(ovl20) denovo       velvet(contigs)           25161   19121   45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
 minimus(ovl15) denovo       velvet(contigs)           25161   16343   45      9903    361.32   359.78  5905143    #merged 25161-16343=8818 (35%) gaps

Simulated 32bp exact match reads

 Type            #reads       min     max     mean
 Sim(ulated)     6538167      32      32      32   ( 32x coverage)

Single assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 edena-sim   denovo       Sim         6538167        2068    100     47881   2994.03   4857.76   6191673       198699(3%)
 velvet-sim  denovo       Sim         6538167        2207    45      56810   2820.91   5348.36   6225757       123591(2%)

454 reads

 Type            #reads       min     max     mean
 454             77466        35      371     240


Pseudomonas aeruginosa b1

References:

 Name           Length  %GC
 PA14           6537648 66.29
 PACS2          6492423 66.33
 PAO1           6264404 66.56
 ...

Solexa reads

 Type            #reads       min     max     mean
 Solexa          8627900      33      33      33 

Assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons