Assembly merge

From Cbcb
Jump to navigation Jump to search

Assemblers

Denovo

Minimus

* hash-overlap minimum overlap length: 
    40 bp default : too large for contig assemblies
    20 bp minimum overlap; minimizer window length must be >=15bp; could these values be dropped lower?
    very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY???
    analyzes internal overlaps, not the sequence end ones only
* tigger can handle "well" overlaps as short as 5bp

SSAKE

  • Short Sequence Assembly by K-mer search and 3' read Extension
  • Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501
  • Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs
  • Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre.
  • Current release: SSAKE 3.2 (12/07/2007)
  • Distribution: PERL script; runs on Linux systems
  • Parameters:
 -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
 -o  Minimum number of reads needed to call a base during an extension (default -o 2)
 -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.6)

Example:

 $ grep -c "^>" Pa.seq           # total number of reads
 $ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log             # version 3.0
 $ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log        # version 3.2
 $ ln -s Pa*contigs  Pa.fasta
 $ ln -s Pa*singlets Pa.singlets.seq
 $ cat Pa.fasta        | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads    # number of assembled reads
 $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons         # number of singletons
 $ grep -v N Pa.seq              # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons)

Velvet

  • Sequence assembler for very short reads
  • Daniel R. Zerbino and Ewan Birney, "Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs", Genome Res. published online Mar 18, 2008
  • Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps
  • Distribution: C program, to be compiled with gcc
  • Open Source, GPL agreement, EBI.
  • Current release: 0.5.07 (27/02/2008)
  • Parameters:
 - hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced)
* minimum overlap length (30-40X coverage data set):
    18bp usually gives the fewest contigs
    24bp is also ok
    15bp is too low => too many short contigs
 
* velvet_assy.afg lists the assembled reads (TLE messages)
* some reads extend beyond the contig ends
* there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end)
* many contigs overlap by a few bp

Example:

 $ velveth . ovl prefix.seq                                # ovl should be an odd integer like 21, 23 
 $ velvetg . -read_trkg yes
 $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg
 $ bank2contig prefix.bnk  > prefix.contig
 $ bank2fasta -b prefix.bnk > prefix.fasta
 $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons

Edena

  • Edena: Exact DE Novo Assembler
  • Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel, "De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer." Genome Res. Published April 3, 2008
  • Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland
  • Current release: 2.1.1 (03/17/2008)
  • Distribution: Compiled executables for Linux-32,Linux-64
  • Uses a traditional read overlap-layout paradigm (adjusted for short reads)
  • Parameters:
 --overlapCutoff [int]    Only consider overlaps >= than the specified size. (default 22)
 --minContigSize [int]    Minimum size of the contigs to output. (default 100)
  • contigs don't overlap

Example:

 $ edena -r prefix.seq -p ovl
 $ edena -e prefix.ovl -p prefix
 $ ln -s prefix_contigs.fasta prefix.fasta
 $ cat *_info.txt                                 # assembly stats
 $ grep assembled Pa_info.txt
 $ cat Pa_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled

Comparative

AMOScmp

  • one should increase the make-consensus error rate if ref and qry are not that similar
  • all reads from layout are included in consensus
  • casm-layout: if one sequence end does not align it is left as singleton; the adjacent read to that end is left as singletons if they should "theoretically" overlap;
  • all CONTAINED reads should be assembled but the one that are adjacent to dirty reads are left as singletons as well !!!
  • There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the .scaff last contig ; Contig alignments to the ref have breaks
  • casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100)

Example:

 $ sort -nk4 AMOS.scaff | head
 322 BE 87996 -24519
 210 BE 7460 -7324
 1271 BE 60490 -4112
 17 BE 1022 -944
 ...
 $more AMOS.scaff
 ...
 322 BE 87996 -24519     # 322 contains contigs 1770, 1771 ...
 1770 BE 39 1100
 1771 BE 130 738
 ...
 1780 BE 72 25004
 $ tail -1 AMOS.scaff
 1767 BE 39264 0

AMOScmp-shortReads-alignmentTrimmed

Defaults:

  MINCLUSTER  = 16
  MINMATCH    = 16
  MINLEN      = 24  # delta-filter -l 24
  MINOVL      = 10  # previously 5 ; mis-assemblies occurred
  MAXTRIM     = 10
  MAXGAP      = 100 # default 10,000 (casm-layout -g) creates some incorrect contigs
  MAJORITY    = 70  # previously 50 
  CONSERR     = 0.06
  ALIGNWIGGLE = 2
 * Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed
 * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails
 * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => 
   !!! read trimming is important.
 * assembly is very fragmented if the percent identity between reads and the reference is less than 98%
 * simulations run on a PA-b1 10K read subset using PA14 as reference; random snps were introduced into PA14 to decrease the %identity; (MINCLUSTER=12, MINMATCH=16, MINLEN=20)
   99.4% : 1 contig  (original data, no snps added to PA14)
   98%   : 2 contigs
   96%   : 10 contigs
   93%   : 50 contigs
   93%   : 37 contigs (MINMATCH=14)

AMOScmp-shortReads

Defaults

  MINCLUSTER  = 20
  MINMATCH    = 20
  MINOVL      = 5
  MAXTRIM     = 10
  MAJORITY    = 50
  CONSERR     = 0.06
  ALIGNWIGGLE = 2
 * No read trimming is done

maq

  • At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed
  • At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq

Other Scripts

EMBOSS megamerge

 * Can merge well short overlapping contigs
 * Does not work well when the overlap identity is low
 * Min ovl length=2

~dpuiu/bin/nucmer.pl

  • It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the reads not aligned at the previous iterations:
  • Shorter run time, less disk space
  • Default parameters:
 -len 20,16,12         : nucmer minmatch (-l)
 -cluster 20,16,14     : nucmer mincluster (-c)
 -filter 20,20,20      : delta-filter minlen (-l)

Cases

No reference sequence

One data set, multiple denovo assemblers

Example:

 * Solexa data
 * edena & velvet assemblers

Solution:

 * run minimus on the contigs

Multipls data sets, one(multiple) denovo assemblers

Example:

* Solexa & 454 data
* velvet assembly on Solexa
* newbler assembly on 454
 

Solution:

 * run minimus on the contigs

One reference sequence

Example:

 * Solexa data
 * AMOScmp on Solexa
 * velvet on Solexa  

Solution:

 * remove the redundant velvet contigs (CONTAINED in AMOScmp contigs)
 * run AMOScmp on AMOScmp & velvet unique contigs => join some AMOScmp gaps
 * run minimus to further merge contigs

Multiple reference sequences

Example:

 * Solexa data
 * AMOScmp on Solexa using Ref1
 * AMOScmp on Solexa using Ref2

Solution:

 * remove the redundant AMOScmp Ref2 contigs (CONTAINED in AMOScmp contigs)
 * run AMOScmp on AMOScmp Ref1 & Ref2 contigs => join some AMOScmp Ref1 gaps
 * run minimus to further merge contigs

Examples

Pseudomonas_syringae

Reference:

 Name           Length  %GC
 NC_004578.1    6397126 58.40
 NC_004633.1    73661   55.15
 NC_004632.1    67473   56.17

Repeats:

 desc    #repeats   min     max     mean    stdev    sum
 50bp+   991        50      7362    393.73  792.41   390192
 100bp+  429        100     7362    815.36  1060.29  349793

Solexa reads

 Type            #reads       min     max     mean
 Solexa          6340136      32      32      32   (~31x coverage)

Assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%)
 velvet      denovo       Solaxa      6340136        25161   45      5057    241.83    212.61    6084887
 edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%)

Merged assemblies(contigs&singletons):

 assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments    
 AMOScmp-merged ?            AMOScmp(contigs)          187     166     20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
 minimus(ovl20) denovo       velvet(contigs)           25161   19121   45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
 minimus(ovl15) denovo       velvet(contigs)           25161   16343   45      9903    361.32   359.78  5905143    #merged 25161-16343=8818 (35%) gaps
 minimus(ovl40) denovo       edena+velvet(contigs)     39245   23644   45      6688    257.15   232.94  6080063    #very few 40bp overlaps are found
 minimus(ovl20) denovo       edena+velvet(contigs)     39245   18603   45      6688    322.32   311.02  5996244

Simulated 32bp exact match reads

 Type            #reads       min     max     mean
 Sim(ulated)     6538167      32      32      32   ( 32x coverage)

Single assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 edena-sim   denovo       Sim         6538167        2068    100     47881   2994.03   4857.76   6191673       198699(3%)
 velvet-sim  denovo       Sim         6538167        2207    45      56810   2820.91   5348.36   6225757       123591(2%)

454 reads

 Type            #reads       min     max     mean
 454             77466        35      371     240


Pseudomonas aeruginosa b1 (PAb1)

References:

 Name           Length  %GC
 PA14           6537648 66.29
 PACS2          6492423 66.33
 PAO1           6264404 66.56
 ...

Solexa reads

 Type            #reads       min     max     mean
 Solexa          8627900      33      33      33    (~43X coverage)

Assemblies:

All contigs:

 Assembler      type         #ctgs   min     max     mean      stdev      ctgs-sum      #singletons    comments
 AMOSCmp-PA14   comparative  2053    17      170485  3011.84   11917.53   6183320       1127399
 AMOSCmp-PAO1   comparative  2797    17      75626   2161.19   5812.2     6044851       1592525
 AMOScmp-PA2192 comparative  5816    17      133129  1072.8    3725.22    6239454       1601299        largest assembly
 maq-PA14       comparative  991     33      155551  6199.79   17445.05   6143996       1197385
 velvet         denovo       10684   45      16239   640.34    825.24     6841458       1241079        much better than Ps !!!
 edena          denovo       11180   100     11300   552.36    610.52     6175460       3955865 (46%)  much better than Ps !!!
 ssake          denovo       185030  34      5490    77.21     141.23     14287079      3056893 

200bp+ contigs:

 Assembler      type         #ctgs   min     max     mean      stdev      ctgs-sum
 AMOSCmp-PA14   comparative  428     203     170485  14262.09  22852.74   6104175
 AMOSCmp-PAO1   comparative  865     200     75626   6893.96   8766.63    5963278
 AMOSCmp-PA2192 comparative  1299    200     133129  4683.46   6735.52    6083817
 maq-PA14       comparative  368     200     155551  16581.7   25475.92   6102067
 velvet         denovo       7382    200     16239   877.05    896.35     6474426
 edena          denovo       8316    200     11300   692.54    651.24     5759209
 ssake          denovo       12532   200     5490    486       329.93     6090567

Merged assemblies(contigs&singletons):

 Assembler           type   input-data          #reads   #ctgs   min   max     mean    stdev      ctgs-sum       comments
 AMOSCmp-PA14-merge  ?      AMOSCmp-PA14(ctgs)  2053     1931    17    170485  3201.7  12981.57   6182486        2053-1931=122 gaps closed


Staphylococcus aureus MW2

Data

Reads:

 3857879 35bp Solexa reads

Reference:

 NC_003923      2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome.
 AP004832       20654  28.38  Staphylococcus aureus plasmid pMW2 DNA, complete sequence

Other strains:

 NC_002952.2    2902619 32.81  Staphylococcus aureus subsp. aureus MRSA252, complete genome
 NC_002953.3    2799802 32.85  Staphylococcus aureus subsp. aureus MSSA476, complete genome
 NC_002951.2    2809422 32.82  Staphylococcus aureus subsp. aureus COL, complete genome
 NC_009632.1    2906507 32.95  Staphylococcus aureus subsp. aureus JH1, complete genome
 NC_009487.1    2906700 32.95  Staphylococcus aureus subsp. aureus JH9, complete genome
 NC_009782.1    2880168 32.88  Staphylococcus aureus subsp. aureus Mu3, complete genome
 NC_002758.2    2878529 32.88  Staphylococcus aureus subsp. aureus Mu50, complete genome
 NC_003923.1    2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome
 NC_002745.2    2814816 32.84  Staphylococcus aureus subsp. aureus N315, complete genome
 NC_007795.1    2821361 32.87  Staphylococcus aureus subsp. aureus NCTC 8325, complete genome
 NC_009641.1    2878897 32.89  Staphylococcus aureus subsp. aureus str. Newman, complete genome
 NC_007622.1    2742531 32.78  Staphylococcus aureus RF122, complete genome
 NC_007793.1    2872769 32.75  Staphylococcus aureus subsp. aureus USA300, complete genome
 NC_010079.1    2872915 32.76  Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome

Sequence assemblies

velvet

OVL=23 bp

Output stats:

 All:
               #elem   min     max     mean    median  n50     sum
 ctg           1777    45      22892   1583    252     5337    2813162
 ctg(200bp+)   939     200     22892   2924    1764    5512    2745649
 singl         289207  35      35      35      35      35      10122245
 breaks 0

0cvg regions

 $show-coords Sa.all.delta | ~/bin/getNucmerCoverage.pl -M 0 | getSummary.pl -c 3
       #elem   min     max     mean    median  n50     sum
 gap   305     1       570     34      19      70      10521


Contig assemblies

AMOScmp on velvet contigs

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   567     45      73756   4890    455     20846   2772573
 singl 5       45      11556   3159    58      11556   15794
 all   572     45      73756   4875    455     20846   2788367
 breaks 2
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   339     201     73756   8118    3413    20846   2752170
 singl 2       4086    11556   7821    11556   11556   15642
 all   341     201     73756   8117    3494    20846   2767812
 breaks 2


minimus on velvet contigs

Ovl: 20bp

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   303     45      32809   5437    3023    11984   1647554
 singl 862     45      22892   1333    189     5202    1149229
 all   1165    45      32809   2401    335     8482    2796783
 breaks 18
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   248     202     32809   6618    4726    11984   1641333
 singl 423     200     22892   2625    1398    5329    1110266
 all   671     200     32809   4101    2258    8543    2751599
 breaks 18

minimus1 on velvet contigs

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   331     71      29358   4398    2359    9399    1455581
 singl 923     45      22892   1458    189     5635    1346176
 all   1254    45      29358   2234    341     7722    2801757
 breaks 54
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   272     216     29358   5324    3700    9496    1448112
 singl 454     200     22892   2885    1724    5905    1309809
 all   726     200     29358   3799    2120    7875    2757921
 breaks 49