Assembly merge: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 74: Line 74:
   100bp+  429        100    7362    815.36  1060.29  349793
   100bp+  429        100    7362    815.36  1060.29  349793


Data:
=== Solexa reads ===
 
   Type            #reads      min    max    mean
   Type            #reads      min    max    mean
   Solexa          6340136      32      32      32  (~31x coverage)
   Solexa          6340136      32      32      32  (~31x coverage)
  Sim(ulated)    6538167      32      32      32  ( 32x coverage)
  454            77466        35      371    240


Single assemblies:
Single assemblies:
Line 84: Line 83:
   edena      denovo      Solaxa      6340136        14084  100    5075    210.92    145.68    2970720      4893301(77%)
   edena      denovo      Solaxa      6340136        14084  100    5075    210.92    145.68    2970720      4893301(77%)
   velvet      denovo      Solaxa      6340136        25161  45      5057    241.83    212.61    6084887
   velvet      denovo      Solaxa      6340136        25161  45      5057    241.83    212.61    6084887
  edena-sim  denovo      Sim        6538167        2068    100    47881  2994.03  4857.76  6191673      198699(3%)
  velvet-sim  denovo      Sim        6538167        2207    45      56810  2820.91  5348.36  6225757      123591(2%)
   AMOScmp    comparative  Solaxa      6340136        187    20      577929  34863.06  91692.34  6519394      698638(11%)
   AMOScmp    comparative  Solaxa      6340136        187    20      577929  34863.06  91692.34  6519394      698638(11%)


Line 96: Line 93:
   AMOScmp-merged ?            AMOScmp(contigs)          187    166    20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
   AMOScmp-merged ?            AMOScmp(contigs)          187    166    20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
                                                                                                                     #  min 5 bp overlap & 80% identity required
                                                                                                                     #  min 5 bp overlap & 80% identity required
=== Simulated 32bp exact match reads ===
  Type            #reads      min    max    mean
  Sim(ulated)    6538167      32      32      32  ( 32x coverage)
Single assemblies:
  Assembler  type        input-data  #reads        #ctgs  min    max    mean      stdev    ctgs-sum      #singletons     
  edena-sim  denovo      Sim        6538167        2068    100    47881  2994.03  4857.76  6191673      198699(3%)
  velvet-sim  denovo      Sim        6538167        2207    45      56810  2820.91  5348.36  6225757      123591(2%)
=== 454 reads ===
  Type            #reads      min    max    mean
  454            77466        35      371    240

Revision as of 15:58, 28 March 2008

Assemblers

Denovo

Minimus

* hash-overlap overlap: 
    40 bp default : too large for contig assemblies
    20 bp minimum overlap; minimizer window length must be >=15bp; could these values be dropped lower?

Velvet

* overlap:
    18bp usually gives fewest contigs
    15bp is too low => too many short contigs

Edena

 * contigs don't overlap

Comparative

AMOScmp

Cases

No reference sequence

One data set, multiple denovo assemblers

Example:

 * Solexa data
 * edena & velvet assemblers

Solution:

 * merge 2 assembly contigs
 * run minimus on them

Multipls data sets, one(multiple) denovo assemblers

Example:

 Solexa & 454 data
 velvet assemblers for each set

One reference sequence

Few indels, few rearrangements

Solution:

 * AMOScmp
 * If there are many negative gaps try to further join contigs (fastaMerge.pl $PREFIX.fasta) 

Many indels, few rearrangements

Few indels, many rearrangements

Multiple reference sequences


Examples

Pseudomonas_syringae

Reference:

 Name           Length  %GC
 NC_004578.1    6397126 58.40
 NC_004633.1    73661   55.15
 NC_004632.1    67473   56.17

Repeats:

 desc    #repeats   min     max     mean    stdev    sum
 50bp+   991        50      7362    393.73  792.41   390192
 100bp+  429        100     7362    815.36  1060.29  349793

Solexa reads

 Type            #reads       min     max     mean
 Solexa          6340136      32      32      32   (~31x coverage)

Single assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%)
 velvet      denovo       Solaxa      6340136        25161   45      5057    241.83    212.61    6084887
 AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%)

Merged assemblies(contigs&singletons):

 assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments    
 minimus(ovl40) denovo       edena+velvet(contigs)     39245   23644   45      6688    257.15   232.94  6080063    #very few 40bp overlaps are found
 minimus(ovl20) denovo       edena+velvet(contigs)     39245   18603   45      6688    322.32   311.02  5996244
 minimus(ovl20) denovo       velvet(contigs)           25161   19121   45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
 minimus(ovl15) denovo       velvet(contigs)           25161   16343   45      9903    361.32   359.78  5905143    #merged 25161-16343=8818 (35%) gaps
 AMOScmp-merged ?            AMOScmp(contigs)          187     166     20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
                                                                                                                   #  min 5 bp overlap & 80% identity required

Simulated 32bp exact match reads

 Type            #reads       min     max     mean
 Sim(ulated)     6538167      32      32      32   ( 32x coverage)

Single assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 edena-sim   denovo       Sim         6538167        2068    100     47881   2994.03   4857.76   6191673       198699(3%)
 velvet-sim  denovo       Sim         6538167        2207    45      56810   2820.91   5348.36   6225757       123591(2%)

454 reads

 Type            #reads       min     max     mean
 454             77466        35      371     240