Assembly merge: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 164: Line 164:
   $ ln -s prefix_contigs.fasta prefix.fasta
   $ ln -s prefix_contigs.fasta prefix.fasta
   $ cat *_info.txt                                # assembly stats
   $ cat *_info.txt                                # assembly stats
   $ grep assembled Pa_info.txt
   $ grep assembled prefix_info.txt
   $ cat Pa_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled
   $ cat prefix_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t reads_assembled


== Comparative ==
== Comparative ==

Revision as of 14:54, 16 May 2008

Assemblers

Denovo

minimus

hash-overlap issues:

  • very slow on large sequences (runs Smith-Waterman)
  • finds all overlaps (not only CONTAINED|BEGIN|END); of no use to the assembler
  • default ovl: 40 bp (too large)
  • min ovl: 15 (set by minimizer)
  • 20 bp overlap ok;
  • the min ovl imposed by the next step (tigger) is 5bp

minimus1

  • hash-overlap was replaced with a nucmer based overlapper
  • nucmer takes long time and lots of space to find very short overlaps (< 16bp?)
  • We can have more control on the overlaps that are used (can easily filter the nucmer output)
  • can discard CONTAINED/IDENTITY alignments in order to avoid collapsing repeats
  • can mask the middle of the sequences to make the alignment process faster; (Issue: CONTAINED overlaps appear as BEGIN/END overlaps)
  • Example:
  ....
  OVERLAP=16
  MINID=94
  MAXTRIM=20
  ...
  20: $(NUCMER) -maxmatch -l $(OVERLAP) -c $(OVERLAP) $(SEQS) $(SEQS) -p $(PREFIX)
  24: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | egrep 'BEGIN|END' > $(COORDS)
  25: $(SCRIPTDIR)/nucmer2ovl.pl -ignore $(MAXTRIM) -tab $(COORDS)> $(OVLTAB)
  26: $(SCRIPTDIR)/ovl2OVL.pl $(OVLTAB)  > $(OVL)
  28: $(BINDIR)/bank-transact -z -b $(BANK) -m $(OVL)

minimus2

  • assembles two sequence sets (S1,S2)
  • Instead of merging the sets and aligning all versus all (S1+S2 : S1+S2) it aligns one versus the other:
 S1:S2
 S1:S1 (optional)
 S2:S2 (optional)
  • Each alignment can be run with different parameters

Example 1: merge two de-novo assemblies

 ...
 TYPE1=N
 TYPE2=N
 ...
 OVERLAP11       = 16
 OVERLAP22       = 16
 OVERLAP12       = 40
 ...
 # align sets
 $(NUCMER) -maxmatch -l $(OVERLAP11) -c $(OVERLAP11) $(SEQS1) $(SEQS1) -p $(PREFIX11)
 $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12)
 $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22)
 ...
 # merge alignments
 $(SHOWCOORDS) $(ALIGN11) | egrep 'BEGIN|END'  > $(COORDS)   
 $(SHOWCOORDS) $(ALIGN12) >> $(COORDS)
 $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END'  >> $(COORDS)
 ...
 nucmer2ovl.pl  $(COORDS) > $(OVLTAB)
 ...

Example 2: merge a comparative and a de-novo assembly

 ...
 TYPE1           = C
 TYPE2           = N
 ...
 # look only for short overlaps among adjacent sequences from S1
 $(SCRIPTDIR)/fasta2ovl.pl -tab $(SEQS1) > $(OVLTAB)  
 ...
 # look for all overlaps  among one sequence from S1 and one from S2
 $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) 
 $(SHOWCOORDS) $(ALIGN12) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB) 
 ...
 # look for end overlaps among sequences from S2
 $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22)  
 $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' > $(COORDS22) | $(SCRIPTDIR)/nucmer2ovl.pl  >> $(OVLTAB)
 ...

SSAKE

  • Short Sequence Assembly by K-mer search and 3' read Extension
  • Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501
  • Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs
  • Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre.
  • Current release: SSAKE 3.2 (12/07/2007)
  • Distribution: PERL script; runs on Linux systems
  • Parameters:
 -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
 -o  Minimum number of reads needed to call a base during an extension (default -o 2)
 -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.6)

Example:

 $ grep -c "^>" Pa.seq           # total number of reads
 $ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log             # version 3.0
 $ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log        # version 3.2
 $ ln -s Pa*contigs  Pa.fasta
 $ ln -s Pa*singlets Pa.singlets.seq
 $ cat Pa.fasta        | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads    # number of assembled reads
 $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons         # number of singletons
 $ grep -v N Pa.seq              # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons)

Velvet

  • Sequence assembler for very short reads
  • Daniel R. Zerbino and Ewan Birney, "Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs", Genome Res. published online Mar 18, 2008
  • Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps
  • Distribution: C program, to be compiled with gcc
  • Open Source, GPL agreement, EBI.
  • Current release: 0.5.07 (27/02/2008)
  • Parameters:
 - hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced)
* minimum overlap length (30-40X coverage data set):
    18bp usually gives the fewest contigs
    24bp is also ok
    15bp is too low => too many short contigs
*  -long option for including "long sequences" (Sanger, 454 or even reference sequences) very slow


* velvet_assy.afg lists the assembled reads (TLE messages)
* some reads extend beyond the contig ends
* there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end)
* many contigs overlap by a few bp
* minimum_ctg length =2*minimum_ovl-1 (23=>45, 21=>41 ...)

Example:

 $ velveth . ovl prefix.seq                                # ovl should be an odd integer like 21, 23 
 $ velvetg . -read_trkg yes
 
 $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg
 $ bank2contig prefix.bnk  > prefix.contig
 $ bank2fasta -b prefix.bnk > prefix.fasta
 
 $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons
 $ extractfromfastanames.pl prefix.seq prefix.singletons > prefix.singletons.seq
 
 $ listReadPlacedStatus prefix.bnk > prefix.status
 $ grep D prefix.status  | awk '{print $5,$6}' | sort -u > prefix.pairs

Edena

  • Edena: Exact DE Novo Assembler
  • Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel, "De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer." Genome Res. Published April 3, 2008
  • Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland
  • Current release: 2.1.1 (03/17/2008)
  • Distribution: Compiled executables for Linux-32,Linux-64
  • Uses a traditional read overlap-layout paradigm (adjusted for short reads)
  • Parameters:
 --overlapCutoff [int]    Only consider overlaps >= than the specified size. (default 22)
 --minContigSize [int]    Minimum size of the contigs to output. (default 100)
  • contigs don't overlap

Example:

 $ edena -r prefix.seq -p ovl
 $ edena -e prefix.ovl -p prefix
 $ ln -s prefix_contigs.fasta prefix.fasta
 $ cat *_info.txt                                 # assembly stats
 $ grep assembled prefix_info.txt
 $ cat prefix_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t reads_assembled

Comparative

AMOScmp

  • one should increase the make-consensus error rate if ref and qry are not that similar
  • all reads from layout are included in consensus

casm-layout issue:

  • if one sequence end does not align, the read is not assembled; the adjacent read is not assembled either if based on alignment to the ref. the two "theoretically" overlap;
  • all CONTAINED reads should be assembled but the ones that are adjacent to dirty reads are left as singletons as well !!!
  • option -r places repeats randomly; what if we want more control over this?
  • There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the .scaff last contig ; Contig alignments to the ref have breaks
  • casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100)

Example:

 $ sort -nk4 AMOS.scaff | head
 322 BE 87996 -24519
 210 BE 7460 -7324
 1271 BE 60490 -4112
 17 BE 1022 -944
 ...
 $more AMOS.scaff
 ...
 322 BE 87996 -24519     # 322 contains contigs 1770, 1771 ...
 1770 BE 39 1100
 1771 BE 130 738
 ...
 1780 BE 72 25004
 $ tail -1 AMOS.scaff
 1767 BE 39264 0

AMOScmp-shortReads-alignmentTrimmed

Defaults:

  MINCLUSTER  = 16
  MINMATCH    = 16
  MINLEN      = 20  # delta-filter -l 20
  MINOVL      = 10  # previously 5 ; mis-assemblies occurred
  MAXTRIM     = 10
  MAXGAP      = 100 # default 10,000 (casm-layout -g) creates some incorrect contigs
  MAJORITY    = 70  # previously 50 
  CONSERR     = 0.06
  ALIGNWIGGLE = 2   # different than the default AMOScmp 15
 * Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed
 * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails
 * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => 
   !!! read trimming is important.
 * assembly is very fragmented if the percent identity between reads and the reference is less than 98%
 * simulations run on a PA-b1 10K read subset using PA14 as reference; random snps were introduced into PA14 to decrease the %identity; (MINCLUSTER=12, MINMATCH=16, MINLEN=20)
   99.4% : 1 contig  (original data, no snps added to PA14)
   98%   : 2 contigs
   96%   : 10 contigs
   93%   : 50 contigs
   93%   : 37 contigs (MINMATCH=14)

AMOScmp-shortReads

Defaults

  MINCLUSTER  = 20
  MINMATCH    = 20
  MINOVL      = 5
  MAXTRIM     = 10
  MAJORITY    = 50
  CONSERR     = 0.06
  ALIGNWIGGLE = 2
 * No read trimming is done

AMOScmp2

  • Runs AMOScmp of 2 data sets at once (with different parameters)

Example: merge a comparative(AMOScmp) and a de-novo(velvet) assembly, using the same reference was used to get the comparative one

 ...
 TYPE1           = C  # Assembly 1 is a comparative one, already know the position of contigs; no need to align them to ref.
 TYPE2           = N  # Assembly 2 is a de-novo one, don't know the position of contigs; we need to align them to ref.
 ...
 $(SCRIPTDIR)/scaff2delta.pl $(SCAFF1) > $(ALIGN1)
 $(NUCMER) -maxmatch -c $(OVERLAP2) $(REF) $(SEQS2) -p=$(PREFIX2) 
 ...
 cp $(ALIGN1)$(ALIGN)
 cat $(ALIGN2) | grep -v $(PREFIX) | grep -v NUCMER  >> $(ALIGN)
 ....

maq

  • At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed
  • At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq

Other Scripts

EMBOSS megamerge

 * Can merge well short overlapping contigs
 * Does not work well when the overlap identity is low
 * Min ovl length=2

nucmer.pl

  • It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the sequences
  • Iteration n only tries to align sequences not aligned at iterations n-1,n-2,...1
  • Shorter run time, less disk space (than running nucmer with a small alignment/cluster size)
  • Default parameters:
 -len 20,16,12         : nucmer minmatch (-l)
 -cluster 20,16,14     : nucmer mincluster (-c)
 -filter 20,20,20      : delta-filter minlen (-l)
  • Could be used by AMOScmp-shortReads, AMOScmp-shortReads-alignmentTrimmed

fasta2ovl.pl

  • Program that finds short normal direction overlaps between two adjacent sequences in a multi-FASTA file.
  • It tests if the END of seqence N overlaps the BEGINING of sequence N+1.

nucmer2ovl.pl

  • Program that converts nucmer annotated alignments to an overlap file (either AMOS or TAB format)

nucmerAnnotate.pl

  • Program that parses a nucmer coords file and re-annotates sequence alignments
  • The maximum length of the end sequence to be ignored can be set as a parameter

scaff2delta.pl

  • Program that converts a scaff file to a delta file
  • Example: No need to run nucmer on the AMOScmp output contigs to get them aligned to reference since we already know their positions in the scaffold

Cases

No reference sequence

One data set, multiple denovo assemblers

Example:

 * Solexa data
 * edena & velvet assemblers

Solution:

 * run minimus on the contigs

Multipls data sets, one(multiple) denovo assemblers

Example:

* Solexa & 454 data
* velvet assembly on Solexa
* newbler assembly on 454
 

Solution:

 * run minimus on the contigs

One reference sequence

Example:

 * Solexa data
 * AMOScmp on Solexa
 * velvet on Solexa  

Solution:

 * remove the redundant velvet contigs (CONTAINED in AMOScmp contigs) (Optional; leaving the CONTAINED contigs had very similar results, CONSERR did not have to be reduced)
 * run AMOScmp on AMOScmp & velvet contigs => join some AMOScmp gaps
 * run minimus to further merge contigs

Multiple reference sequences

Example:

 * Solexa data
 * AMOScmp on Solexa using Ref1
 * AMOScmp on Solexa using Ref2

Solution:

 * remove the redundant AMOScmp Ref2 contigs (CONTAINED in AMOScmp contigs) (optional)
 * run AMOScmp on AMOScmp Ref1 & Ref2 contigs => join some AMOScmp Ref1 gaps
 * run minimus to further merge contigs

Examples

Pseudomonas_syringae

Reference:

 Name           Length  %GC
 NC_004578.1    6397126 58.40
 NC_004633.1    73661   55.15
 NC_004632.1    67473   56.17

Repeats:

 desc    #repeats   min     max     mean    stdev    sum
 50bp+   991        50      7362    393.73  792.41   390192
 100bp+  429        100     7362    815.36  1060.29  349793

Solexa reads

 Type            #reads       min     max     mean
 Solexa          6340136      32      32      32   (~31x coverage)

Assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%)
 velvet      denovo       Solaxa      6340136        25161   45      5057    241.83    212.61    6084887
 edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%)

Merged assemblies(contigs&singletons):

 assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments    
 AMOScmp-merged ?            AMOScmp(contigs)          187     166     20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
 minimus(ovl20) denovo       velvet(contigs)           25161   19121   45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
 minimus(ovl15) denovo       velvet(contigs)           25161   16343   45      9903    361.32   359.78  5905143    #merged 25161-16343=8818 (35%) gaps
 minimus(ovl40) denovo       edena+velvet(contigs)     39245   23644   45      6688    257.15   232.94  6080063    #very few 40bp overlaps are found
 minimus(ovl20) denovo       edena+velvet(contigs)     39245   18603   45      6688    322.32   311.02  5996244

Simulated 32bp exact match reads

 Type            #reads       min     max     mean
 Sim(ulated)     6538167      32      32      32   ( 32x coverage)

Single assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 edena-sim   denovo       Sim         6538167        2068    100     47881   2994.03   4857.76   6191673       198699(3%)
 velvet-sim  denovo       Sim         6538167        2207    45      56810   2820.91   5348.36   6225757       123591(2%)

454 reads

 Type            #reads       min     max     mean
 454             77466        35      371     240


Pseudomonas aeruginosa b1 (PAb1)

References:

 Name           Length  %GC
 PA14           6537648 66.29
 PACS2          6492423 66.33
 PAO1           6264404 66.56
 ...

Solexa reads

 Type            #reads       min     max     mean
 Solexa          8627900      33      33      33    (~43X coverage)

Assemblies:

All contigs:

 Assembler      type         #ctgs   min     max     mean      stdev      ctgs-sum      #singletons    comments
 AMOSCmp-PA14   comparative  2053    17      170485  3011.84   11917.53   6183320       1127399
 AMOSCmp-PAO1   comparative  2797    17      75626   2161.19   5812.2     6044851       1592525
 AMOScmp-PA2192 comparative  5816    17      133129  1072.8    3725.22    6239454       1601299        largest assembly
 maq-PA14       comparative  991     33      155551  6199.79   17445.05   6143996       1197385
 velvet         denovo       10684   45      16239   640.34    825.24     6841458       1241079        much better than Ps !!!
 edena          denovo       11180   100     11300   552.36    610.52     6175460       3955865 (46%)  much better than Ps !!!
 ssake          denovo       185030  34      5490    77.21     141.23     14287079      3056893 

200bp+ contigs:

 Assembler      type         #ctgs   min     max     mean      stdev      ctgs-sum
 AMOSCmp-PA14   comparative  428     203     170485  14262.09  22852.74   6104175
 AMOSCmp-PAO1   comparative  865     200     75626   6893.96   8766.63    5963278
 AMOSCmp-PA2192 comparative  1299    200     133129  4683.46   6735.52    6083817
 maq-PA14       comparative  368     200     155551  16581.7   25475.92   6102067
 velvet         denovo       7382    200     16239   877.05    896.35     6474426
 edena          denovo       8316    200     11300   692.54    651.24     5759209
 ssake          denovo       12532   200     5490    486       329.93     6090567

Merged assemblies(contigs&singletons):

 Assembler           type   input-data          #reads   #ctgs   min   max     mean    stdev      ctgs-sum       comments
 AMOSCmp-PA14-merge  ?      AMOSCmp-PA14(ctgs)  2053     1931    17    170485  3201.7  12981.57   6182486        2053-1931=122 gaps closed


Staphylococcus aureus MW2

Data

Reads:

 3857879 35bp Solexa reads

Reference:

 NC_003923      2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome.
 AP004832       20654  28.38  Staphylococcus aureus plasmid pMW2 DNA, complete sequence
 total          2841116

Coverage: 47.52X

Other strains:

 NC_002952.2    2902619 32.81  Staphylococcus aureus subsp. aureus MRSA252, complete genome
 NC_002953.3    2799802 32.85  Staphylococcus aureus subsp. aureus MSSA476, complete genome
 NC_002951.2    2809422 32.82  Staphylococcus aureus subsp. aureus COL, complete genome
 NC_009632.1    2906507 32.95  Staphylococcus aureus subsp. aureus JH1, complete genome
 NC_009487.1    2906700 32.95  Staphylococcus aureus subsp. aureus JH9, complete genome
 NC_009782.1    2880168 32.88  Staphylococcus aureus subsp. aureus Mu3, complete genome
 NC_002758.2    2878529 32.88  Staphylococcus aureus subsp. aureus Mu50, complete genome
 NC_003923.1    2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome
 NC_002745.2    2814816 32.84  Staphylococcus aureus subsp. aureus N315, complete genome
 NC_007795.1    2821361 32.87  Staphylococcus aureus subsp. aureus NCTC 8325, complete genome
 NC_009641.1    2878897 32.89  Staphylococcus aureus subsp. aureus str. Newman, complete genome
 NC_007622.1    2742531 32.78  Staphylococcus aureus RF122, complete genome
 NC_007793.1    2872769 32.75  Staphylococcus aureus subsp. aureus USA300, complete genome
 NC_010079.1    2872915 32.76  Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome

Sequence assemblies

AMOScmp

OVL: 10 bp

Output stats:

 All:
               #elem   min     max     mean    median  n50     sum
 ctg           17      60      658745  167134  102144  485748  2841283
 ctg(200bp+)   15      2090    658745  189411  102265  485748  2841158
 singl         101422  23      35      35      35      35      3549538
 breaks        0
  • Only 2.62% of the reads are singletons
  • 101232 out of 101422 AMOS singletons (99.81%) are also velvet singletons

velvet

OVL: 23 bp

Output stats:

               #elem   min     max     mean    median  n50     sum
 ctg           1777    45      22892   1583    252     5337    2813162
 ctg(100bp+)   1146    100     22892   2421    1167    5472    2774471
 ctg(200bp+)   939     200     22892   2924    1764    5512    2745649
 singl         289207  35      35      35      35      35      10122245
 breaks 0
  • 7.49% reads are singletons
  • 101232 out of 289207 velvet singletons are also AMOS singletons
  • 13189 reads are duplicates, connect 1126 contig pairs (1242 unique ctgs in these pairs)
  • 1114 out of 1242 contig overlaps were identified by running nucmer (-c 12 -l 12)
 ovlSize count
 22      740
 21      99
 20      55
 19      36
 18      52
 17      33
 16      17
 15      29
 14      21
 13      15
 12      17
 Total   1114

0cvg regions

 $show-coords Sa.all.delta | ~/bin/getNucmerCoverage.pl -M 0 | getSummary.pl -c 3
       #elem   min     max     mean    median  n50     sum
 gap   305     1       570     34      19      70      10521
OVL: 21 bp
               #elem   min     max     mean    median  n50     sum
 all           2099    41      18254   1342    258     4384    2817753
 45bp+         1830    45      18254   1534    448     4391    2806596
 100bp+        1337    100     18254   2075    1083    4558    2774667
 200bp+        1115    200     18254   2461    1486    4625    2743892
 singl         224006

OVL: 23 bp seems to generate a better assembly on this data set !!!

edena

Output stats:

               #elem   min     max     mean    median  n50     sum
 ctg           1175    100     22893   2350    1094    5514    2760953
 ctg(200bp+)   935     204     22893   2918    1766    5594    2727929


 breaks 0

Contig assemblies

AMOScmp on velvet contigs

Min Ovl: 10bp Min Cluster(nucmer -c): 40 (many velvet contigs are 45-65 bp)

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   567     45      73756   4890    455     20846   2772573
 singl 5       45      11556   3159    58      11556   15794
 all   572     45      73756   4875    455     20846   2788367
 breaks 2 (1 contig with 2 overlapping alignments; no repeat; why is nucmer breaking the alignment?)
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   339     201     73756   8118    3413    20846   2752170
 singl 2       4086    11556   7821    11556   11556   15642
 all   341     201     73756   8117    3494    20846   2767812
 breaks 2
  • Reduced the number of contigs from 1777 to 567 !!!

minimus on velvet contigs

Min Ovl: 16bp

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   316     45      39490   5812    3243    12483   1836527
 singl 797     45      22892   1204    161     4724    959356
 all   1113    45      39490   2512    299     9399    2795883
 breaks 20
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   258     202     39490   7093    5069    12483   1829967
 singl 368     200     22892   2504    1347    5210    921478
 all   626     200     39490   4395    2313    9508    2751445
 breaks 19
  • Reduced the number of contigs from 1777 to 1113 !!!

minimus1 on velvet contigs

Min Ovl: 16bp

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   277     70      32618   6623    4709    11914   1834597
 singl 1022    45      19128   948     96      5057    968872
 all   1299    45      32618   2158    154     9526    2803469
 breaks 15
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   247     200     32618   7413    6004    11984   1830980
 singl 354     200     19128   2590    1397    5294    916800
 all   601     200     32618   4572    2495    9757    2747780
 breaks 15
  • Reduced the number of contigs from 1777 to 1299 !!!
  • Generated fewer, larger big contigs than minimus

Fewer contigs get merged if:

 * Min Ovl is dropped from 16bp to 14bp
 * Only the overlaps represented by Duplicate reads are used

minimus1 on edena contigs

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   67      240     24824   5610    4814    8671    375847
 singl 1022    100     19924   2331    1084    5455    2382674
 all   1089    100     24824   2533    1183    5933    2758521
 breaks 27
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   67      240     24824   5610    4814    8671    375847
 singl 811     204     19924   2902    1766    5588    2353676
 all   878     204     24824   3109    1862    5984    2729523
 breaks 27

Edena contig ends get trimmed. There are fewer overlap among them than velvet contigs!!!

minimus2 on velvet & edena contigs

Input

 All:
 desc          #elem   min     max     mean    stdev   sum
 velvet        1777    45      22892   1583    2769    2813162
 edena         1175    100     22893   2350    3152    2760953
 edena.uniq    360     100     19924   2475    3886    891123
 
 200bp+
 desc          #elem   min     max     mean    stdev   sum
 velvet        939     200     22892   2924    3272    2745649
 edena         935     204     22893   2918    3303    2727929
 edena.uniq    202     206     19924   4305    4394    869552

Contig stats:

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   740     67      38980   3793    1290    10553   2807159
 singl 366     45      6795    265     63      3684    96822
 all   1106    45      38980   2626    263     10104   2903981
 breaks 45
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   560     202     38980   4969    2583    10561   2782466
 singl 31      203     6795    2424    1379    5455    75150
 all   591     202     38980   4835    2574    10197   2857616
 breaks 44

Staphylococcus aureus MW2 "mutated" 10% of sequence are random SNPs

Sequence assemblies

AMOScmp

OVL: 10 bp ...

               #elem   min     max     mean    median  n50     sum
 all           22431   20      705     107     87      124     2409008
 200bp+        1997    200     705     267     246     260     533882
 singl         1934772

velvet

Same as the one above

Contig assemblies

AMOScmp on velvet contigs

Min Ovl: 10bp Min Match (nucmer -l): 20 Min Cluster (nucmer -c): 40

All:

       #elem   min     max     mean    median  n50     sum
 ctg   496     45      47387   5504    2076    14370   2729975
 singl 629     45      11556   112     60      131     70262
 all   1125    45      47387   2489    96      13785   2800237
 breaks 6

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   405     204     47387   6715    3277    14370   2719753
 singl 28      200     11556   1042    288     11556   29165
 all   433     200     47387   6349    2878    14036   2748918
 breaks 4

Min Ovl: 10bp Min Match (nucmer -l): 16 Min Cluster (nucmer -c): 32

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   544     45      53996   5044    1215    16039   2744166
 singl 385     45      11556   133     53      5057    51297
 all   929     45      53996   3009    112     15857   2795463
 breaks 8
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   391     201     53996   6980    3156    16443   2729048
 singl 8       222     11556   3543    4086    11556   28343
 all   399     201     53996   6911    3156    16039   2757391
 breaks 6