Assembly merge: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 122: Line 122:


   Type            #reads      min    max    mean
   Type            #reads      min    max    mean
   Solexa          8627900      33      33      33  
   Solexa          8627900      33      33      33   (~43X coverage)


Assemblies:
Assemblies:
Line 128: Line 128:
   AMOSCmp-PA14  comparative  Solexa      8627900        2053    17      170485  3011.84  11917.53  6183320      1127399
   AMOSCmp-PA14  comparative  Solexa      8627900        2053    17      170485  3011.84  11917.53  6183320      1127399
   AMOSCmp-PAO1  comparative  Solexa      8627900        2797    17      75626  2161.19  5812.2    6044851      1592525   
   AMOSCmp-PAO1  comparative  Solexa      8627900        2797    17      75626  2161.19  5812.2    6044851      1592525   
   velvet        denovo      Solexa      8627900        10684  45      16239  640.34    825.24    6841458
   velvet        denovo      Solexa      8627900        10684  45      16239  640.34    825.24    6841458       ?
   edena          denovo      Solexa      8627900        11180  100    11300  552.36    610.52    6175460
   edena          denovo      Solexa      8627900        11180  100    11300  552.36    610.52    6175460       3955865 (46%)

Revision as of 16:43, 28 March 2008

Assemblers

Denovo

Minimus

* hash-overlap overlap: 
    40 bp default : too large for contig assemblies
    20 bp minimum overlap; minimizer window length must be >=15bp; could these values be dropped lower?
    very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY???

Velvet

* overlap:
    18bp usually gives fewest contigs
    15bp is too low => too many short contigs

Edena

 * contigs don't overlap

Comparative

AMOScmp

Cases

No reference sequence

One data set, multiple denovo assemblers

Example:

 * Solexa data
 * edena & velvet assemblers

Solution:

 * merge 2 assembly contigs
 * run minimus on them

Multipls data sets, one(multiple) denovo assemblers

Example:

 Solexa & 454 data
 velvet assemblers for each set

One reference sequence

Few indels, few rearrangements

Solution:

 * AMOScmp
 * If there are many negative gaps try to further join contigs (fastaMerge.pl $PREFIX.fasta) 

Many indels, few rearrangements

Few indels, many rearrangements

Multiple reference sequences


Examples

Pseudomonas_syringae

Reference:

 Name           Length  %GC
 NC_004578.1    6397126 58.40
 NC_004633.1    73661   55.15
 NC_004632.1    67473   56.17

Repeats:

 desc    #repeats   min     max     mean    stdev    sum
 50bp+   991        50      7362    393.73  792.41   390192
 100bp+  429        100     7362    815.36  1060.29  349793

Solexa reads

 Type            #reads       min     max     mean
 Solexa          6340136      32      32      32   (~31x coverage)

Assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%)
 velvet      denovo       Solaxa      6340136        25161   45      5057    241.83    212.61    6084887
 edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%)

Merged assemblies(contigs&singletons):

 assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments    
 AMOScmp-merged ?            AMOScmp(contigs)          187     166     20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
 minimus(ovl20) denovo       velvet(contigs)           25161   19121   45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
 minimus(ovl15) denovo       velvet(contigs)           25161   16343   45      9903    361.32   359.78  5905143    #merged 25161-16343=8818 (35%) gaps
 minimus(ovl40) denovo       edena+velvet(contigs)     39245   23644   45      6688    257.15   232.94  6080063    #very few 40bp overlaps are found
 minimus(ovl20) denovo       edena+velvet(contigs)     39245   18603   45      6688    322.32   311.02  5996244

Simulated 32bp exact match reads

 Type            #reads       min     max     mean
 Sim(ulated)     6538167      32      32      32   ( 32x coverage)

Single assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 edena-sim   denovo       Sim         6538167        2068    100     47881   2994.03   4857.76   6191673       198699(3%)
 velvet-sim  denovo       Sim         6538167        2207    45      56810   2820.91   5348.36   6225757       123591(2%)

454 reads

 Type            #reads       min     max     mean
 454             77466        35      371     240


Pseudomonas aeruginosa b1

References:

 Name           Length  %GC
 PA14           6537648 66.29
 PACS2          6492423 66.33
 PAO1           6264404 66.56
 ...

Solexa reads

 Type            #reads       min     max     mean
 Solexa          8627900      33      33      33    (~43X coverage)

Assemblies:

 Assembler      type         input-data  #reads         #ctgs   min     max     mean      stdev      ctgs-sum      #singletons
 AMOSCmp-PA14   comparative  Solexa      8627900        2053    17      170485  3011.84   11917.53   6183320       1127399
 AMOSCmp-PAO1   comparative  Solexa      8627900        2797    17      75626   2161.19   5812.2     6044851       1592525  
 velvet         denovo       Solexa      8627900        10684   45      16239   640.34    825.24     6841458       ?
 edena          denovo       Solexa      8627900        11180   100     11300   552.36    610.52     6175460       3955865 (46%)