Assembly merge: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
Line 10: Line 10:
     very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY???
     very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY???
     analyzes internal overlaps, not the sequence end ones only
     analyzes internal overlaps, not the sequence end ones only
* tigger can handle "well" overlaps as short as 5bp


=== SSAKE ===
=== SSAKE ===
Line 91: Line 93:


* one should increase the make-consensus error rate if ref and qry are not that similar
* one should increase the make-consensus error rate if ref and qry are not that similar
* tigger can handle "well" overlaps as short as 5bp
* all reads from layout are included in consensus
* all reads from layout are included in consensus
* casm-layout: if one sequence end does not align it is left as singleton; the adjacent read to that end is left as singletons if they should "theoretically" overlap;
* all CONTAINED reads should be assembled but the one that are adjacent to dirty reads are left as singletons as well !!!


=== AMOScmp-shortReads-alignmentTrimmed ===
=== AMOScmp-shortReads-alignmentTrimmed ===

Revision as of 14:24, 1 May 2008

Assemblers

Denovo

Minimus

* hash-overlap minimum overlap length: 
    40 bp default : too large for contig assemblies
    20 bp minimum overlap; minimizer window length must be >=15bp; could these values be dropped lower?
    very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY???
    analyzes internal overlaps, not the sequence end ones only
* tigger can handle "well" overlaps as short as 5bp

SSAKE

  • Short Sequence Assembly by K-mer search and 3' read Extension
  • Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501
  • Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs
  • Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre.
  • Current release: SSAKE 3.2 (12/07/2007)
  • Distribution: PERL script; runs on Linux systems
  • Parameters:
 -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
 -o  Minimum number of reads needed to call a base during an extension (default -o 2)
 -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.6)

Example:

 $ grep -c "^>" Pa.seq           # total number of reads
 $ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log             # version 3.0
 $ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log        # version 3.2
 $ ln -s Pa*contigs  Pa.fasta
 $ ln -s Pa*singlets Pa.singlets.seq
 $ cat Pa.fasta        | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads    # number of assembled reads
 $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons         # number of singletons
 $ grep -v N Pa.seq              # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons)

Velvet

  • Sequence assembler for very short reads
  • Daniel R. Zerbino and Ewan Birney, "Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs", Genome Res. published online Mar 18, 2008
  • Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps
  • Distribution: C program, to be compiled with gcc
  • Open Source, GPL agreement, EBI.
  • Current release: 0.5.07 (27/02/2008)
  • Parameters:
 - hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced)
* minimum overlap length (30-40X coverage data set):
    18bp usually gives the fewest contigs
    24bp is also ok
    15bp is too low => too many short contigs
 
* velvet_assy.afg lists the assembled reads (TLE messages)
* some reads extend beyond the contig ends
* there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end)
* many contigs overlap by a few bp

Example:

 $ velveth . ovl prefix.seq                                # ovl should be an odd integer like 21, 23 
 $ velvetg . -read_trkg yes
 $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg
 $ bank2contig prefix.bnk  > prefix.contig
 $ bank2fasta -b prefix.bnk > prefix.fasta
 $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons

Edena

  • Edena: Exact DE Novo Assembler
  • Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel, "De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer." Genome Res. Published April 3, 2008
  • Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland
  • Current release: 2.1.1 (03/17/2008)
  • Distribution: Compiled executables for Linux-32,Linux-64
  • Uses a traditional read overlap-layout paradigm (adjusted for short reads)
  • Parameters:
 --overlapCutoff [int]    Only consider overlaps >= than the specified size. (default 22)
 --minContigSize [int]    Minimum size of the contigs to output. (default 100)
  • contigs don't overlap

Example:

 $ edena -r prefix.seq -p ovl
 $ edena -e prefix.ovl -p prefix
 $ ln -s prefix_contigs.fasta prefix.fasta
 $ cat *_info.txt                                 # assembly stats
 $ grep assembled Pa_info.txt
 $ cat Pa_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled

Comparative

AMOScmp

  • one should increase the make-consensus error rate if ref and qry are not that similar
  • all reads from layout are included in consensus
  • casm-layout: if one sequence end does not align it is left as singleton; the adjacent read to that end is left as singletons if they should "theoretically" overlap;
  • all CONTAINED reads should be assembled but the one that are adjacent to dirty reads are left as singletons as well !!!

AMOScmp-shortReads-alignmentTrimmed

Defaults:

  MINCLUSTER  = 16
  MINMATCH    = 16
  MINLEN      = 24  # delta-filter -l 24
  MINOVL      = 10  # previously 5 ; mis-assemblies occurred
  MAXTRIM     = 10
  MAJORITY    = 70  # previously 50 
  CONSERR     = 0.06
  ALIGNWIGGLE = 2
 * Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed
 * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails
 * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => 
   !!! read trimming is important.
 * assembly is very fragmented if the percent identity between reads and the reference is less than 98%
 * simulations run on a PA-b1 10K read subset using PA14 as reference; random snps were introduced into PA14 to decrease the %identity; (MINCLUSTER=12, MINMATCH=16, MINLEN=20)
   99.4% : 1 contig  (original data, no snps added to PA14)
   98%   : 2 contigs
   96%   : 10 contigs
   93%   : 50 contigs
   93%   : 37 contigs (MINMATCH=14)

AMOScmp-shortReads

Defaults

  MINCLUSTER  = 20
  MINMATCH    = 20
  MINOVL      = 5
  MAXTRIM     = 10
  MAJORITY    = 50
  CONSERR     = 0.06
  ALIGNWIGGLE = 2
 * No read trimming is done

maq

  • At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed
  • At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq

Other Scripts

EMBOSS megamerge

 * Can merge well short overlapping contigs
 * Does not work well when the overlap identity is low
 * Min ovl length=2

~dpuiu/bin/nucmer.pl

  • It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the reads not aligned at the previous iterations:
  • Shorter run time, less disk space
  • Default parameters:
 -len 20,16,12         : nucmer minmatch (-l)
 -cluster 20,16,14     : nucmer mincluster (-c)
 -filter 20,20,20      : delta-filter minlen (-l)

Cases

No reference sequence

One data set, multiple denovo assemblers

Example:

 * Solexa data
 * edena & velvet assemblers

Solution:

 * merge 2 assembly contigs
 * run minimus on them

Multipls data sets, one(multiple) denovo assemblers

Example:

 Solexa & 454 data
 velvet assemblers for each set

One reference sequence

Few indels, few rearrangements

Solution:

 * AMOScmp
 * If there are many negative gaps try to further join contigs (fastaMerge.pl $PREFIX.fasta) 

Many indels, few rearrangements

Few indels, many rearrangements

Multiple reference sequences


Examples

Pseudomonas_syringae

Reference:

 Name           Length  %GC
 NC_004578.1    6397126 58.40
 NC_004633.1    73661   55.15
 NC_004632.1    67473   56.17

Repeats:

 desc    #repeats   min     max     mean    stdev    sum
 50bp+   991        50      7362    393.73  792.41   390192
 100bp+  429        100     7362    815.36  1060.29  349793

Solexa reads

 Type            #reads       min     max     mean
 Solexa          6340136      32      32      32   (~31x coverage)

Assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%)
 velvet      denovo       Solaxa      6340136        25161   45      5057    241.83    212.61    6084887
 edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%)

Merged assemblies(contigs&singletons):

 assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments    
 AMOScmp-merged ?            AMOScmp(contigs)          187     166     20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
 minimus(ovl20) denovo       velvet(contigs)           25161   19121   45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
 minimus(ovl15) denovo       velvet(contigs)           25161   16343   45      9903    361.32   359.78  5905143    #merged 25161-16343=8818 (35%) gaps
 minimus(ovl40) denovo       edena+velvet(contigs)     39245   23644   45      6688    257.15   232.94  6080063    #very few 40bp overlaps are found
 minimus(ovl20) denovo       edena+velvet(contigs)     39245   18603   45      6688    322.32   311.02  5996244

Simulated 32bp exact match reads

 Type            #reads       min     max     mean
 Sim(ulated)     6538167      32      32      32   ( 32x coverage)

Single assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 edena-sim   denovo       Sim         6538167        2068    100     47881   2994.03   4857.76   6191673       198699(3%)
 velvet-sim  denovo       Sim         6538167        2207    45      56810   2820.91   5348.36   6225757       123591(2%)

454 reads

 Type            #reads       min     max     mean
 454             77466        35      371     240


Pseudomonas aeruginosa b1 (PAb1)

References:

 Name           Length  %GC
 PA14           6537648 66.29
 PACS2          6492423 66.33
 PAO1           6264404 66.56
 ...

Solexa reads

 Type            #reads       min     max     mean
 Solexa          8627900      33      33      33    (~43X coverage)

Assemblies:

All contigs:

 Assembler      type         #ctgs   min     max     mean      stdev      ctgs-sum      #singletons    comments
 AMOSCmp-PA14   comparative  2053    17      170485  3011.84   11917.53   6183320       1127399
 AMOSCmp-PAO1   comparative  2797    17      75626   2161.19   5812.2     6044851       1592525
 AMOScmp-PA2192 comparative  5816    17      133129  1072.8    3725.22    6239454       1601299        largest assembly
 maq-PA14       comparative  991     33      155551  6199.79   17445.05   6143996       1197385
 velvet         denovo       10684   45      16239   640.34    825.24     6841458       1241079        much better than Ps !!!
 edena          denovo       11180   100     11300   552.36    610.52     6175460       3955865 (46%)  much better than Ps !!!
 ssake          denovo       185030  34      5490    77.21     141.23     14287079      3056893 

200bp+ contigs:

 Assembler      type         #ctgs   min     max     mean      stdev      ctgs-sum
 AMOSCmp-PA14   comparative  428     203     170485  14262.09  22852.74   6104175
 AMOSCmp-PAO1   comparative  865     200     75626   6893.96   8766.63    5963278
 AMOSCmp-PA2192 comparative  1299    200     133129  4683.46   6735.52    6083817
 maq-PA14       comparative  368     200     155551  16581.7   25475.92   6102067
 velvet         denovo       7382    200     16239   877.05    896.35     6474426
 edena          denovo       8316    200     11300   692.54    651.24     5759209
 ssake          denovo       12532   200     5490    486       329.93     6090567

Merged assemblies(contigs&singletons):

 Assembler           type   input-data          #reads   #ctgs   min   max     mean    stdev      ctgs-sum       comments
 AMOSCmp-PA14-merge  ?      AMOSCmp-PA14(ctgs)  2053     1931    17    170485  3201.7  12981.57   6182486        2053-1931=122 gaps closed