Assembly merge
Assemblers
Denovo
minimus
- hash-overlap minimum overlap length:
40 bp default : too large for contig assemblies 20 bp minimum overlap ok; minimizer window length must be >=15bp; could these values be dropped lower? very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY??? analyzes internal overlaps, not the sequence end ones only
- tigger can handle "well" overlaps as short as 5bp
minimus1
- hash-overlap was replaced with a nucmer based overlapper
- discards CONTAINED/IDENTITY alignments in order to avoid collapsing repeats
- the middle of the contigs could be masked to make the alignment process faster; but CONTAINED overlaps would appear as BEGIN?END
- Example:
.... OVERLAP=16 MINID=94 MAXTRIM=20 ... 20: $(NUCMER) -maxmatch -l $(OVERLAP) -c $(OVERLAP) $(SEQS) $(SEQS) -p $(PREFIX) 24: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | egrep 'BEGIN|END' > $(COORDS) 25: $(SCRIPTDIR)/nucmer2ovl.pl -ignore $(MAXTRIM) -tab $(COORDS)> $(OVLTAB) 26: $(SCRIPTDIR)/ovl2OVL.pl $(OVLTAB) > $(OVL) 28: $(BINDIR)/bank-transact -z -b $(BANK) -m $(OVL)
minimus2
- merges 2 assemblies
...
SSAKE
- Short Sequence Assembly by K-mer search and 3' read Extension
- Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501
- Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs
- Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre.
- Current release: SSAKE 3.2 (12/07/2007)
- Distribution: PERL script; runs on Linux systems
- Parameters:
-m Minimum number of reads needed to call a base during overhang consensus build up (default -m 16) -o Minimum number of reads needed to call a base during an extension (default -o 2) -r Minimum base ratio used to accept a overhang consensus base (default -r 0.6)
Example:
$ grep -c "^>" Pa.seq # total number of reads $ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log # version 3.0 $ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log # version 3.2 $ ln -s Pa*contigs Pa.fasta $ ln -s Pa*singlets Pa.singlets.seq $ cat Pa.fasta | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads # number of assembled reads $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons # number of singletons $ grep -v N Pa.seq # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons)
Velvet
- Sequence assembler for very short reads
- Daniel R. Zerbino and Ewan Birney, "Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs", Genome Res. published online Mar 18, 2008
- Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps
- Distribution: C program, to be compiled with gcc
- Open Source, GPL agreement, EBI.
- Current release: 0.5.07 (27/02/2008)
- Parameters:
- hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced)
* minimum overlap length (30-40X coverage data set): 18bp usually gives the fewest contigs 24bp is also ok 15bp is too low => too many short contigs * -long option for including "long sequences" (Sanger, 454 or even reference sequences) in the assembly; did not work on including AMOScmp contigs; Sanger& 454 not tested
* velvet_assy.afg lists the assembled reads (TLE messages) * some reads extend beyond the contig ends * there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end) * many contigs overlap by a few bp * minimum_ctg length =2*minimum_ovl-1 (23=>45, 21=>41 ...)
Example:
$ velveth . ovl prefix.seq # ovl should be an odd integer like 21, 23 $ velvetg . -read_trkg yes $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg $ bank2contig prefix.bnk > prefix.contig $ bank2fasta -b prefix.bnk > prefix.fasta $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons $ extractfromfastanames.pl prefix.seq prefix.singletons > prefix.singletons.seq $ listReadPlacedStatus prefix.bnk > prefix.status $ grep D prefix.status | awk '{print $5,$6}' | sort -u > prefix.pairs
Edena
- Edena: Exact DE Novo Assembler
- Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel, "De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer." Genome Res. Published April 3, 2008
- Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland
- Current release: 2.1.1 (03/17/2008)
- Distribution: Compiled executables for Linux-32,Linux-64
- Uses a traditional read overlap-layout paradigm (adjusted for short reads)
- Parameters:
--overlapCutoff [int] Only consider overlaps >= than the specified size. (default 22) --minContigSize [int] Minimum size of the contigs to output. (default 100)
- contigs don't overlap
Example:
$ edena -r prefix.seq -p ovl $ edena -e prefix.ovl -p prefix $ ln -s prefix_contigs.fasta prefix.fasta $ cat *_info.txt # assembly stats $ grep assembled Pa_info.txt $ cat Pa_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled
Comparative
AMOScmp
- one should increase the make-consensus error rate if ref and qry are not that similar
- all reads from layout are included in consensus
- casm-layout: if one sequence end does not align it is left as singleton; the adjacent read to that end is left as singletons if they should "theoretically" overlap;
- all CONTAINED reads should be assembled but the one that are adjacent to dirty reads are left as singletons as well !!!
- There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the .scaff last contig ; Contig alignments to the ref have breaks
- casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100)
Example:
$ sort -nk4 AMOS.scaff | head 322 BE 87996 -24519 210 BE 7460 -7324 1271 BE 60490 -4112 17 BE 1022 -944 ...
$more AMOS.scaff ... 322 BE 87996 -24519 # 322 contains contigs 1770, 1771 ... 1770 BE 39 1100 1771 BE 130 738 ... 1780 BE 72 25004
$ tail -1 AMOS.scaff 1767 BE 39264 0
AMOScmp-shortReads-alignmentTrimmed
Defaults:
MINCLUSTER = 16 MINMATCH = 16 MINLEN = 20 # delta-filter -l 20 MINOVL = 10 # previously 5 ; mis-assemblies occurred MAXTRIM = 10 MAXGAP = 100 # default 10,000 (casm-layout -g) creates some incorrect contigs MAJORITY = 70 # previously 50 CONSERR = 0.06 ALIGNWIGGLE = 2 # different than the default AMOScmp 15
* Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => !!! read trimming is important. * assembly is very fragmented if the percent identity between reads and the reference is less than 98% * simulations run on a PA-b1 10K read subset using PA14 as reference; random snps were introduced into PA14 to decrease the %identity; (MINCLUSTER=12, MINMATCH=16, MINLEN=20) 99.4% : 1 contig (original data, no snps added to PA14) 98% : 2 contigs 96% : 10 contigs 93% : 50 contigs 93% : 37 contigs (MINMATCH=14)
AMOScmp-shortReads
Defaults
MINCLUSTER = 20 MINMATCH = 20 MINOVL = 5 MAXTRIM = 10 MAJORITY = 50 CONSERR = 0.06 ALIGNWIGGLE = 2
* No read trimming is done
AMOScmp2
Runs AMOScmp of 2 data sets at once (with different paarmeters)
maq
- At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed
- At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq
Other Scripts
EMBOSS megamerge
* Can merge well short overlapping contigs * Does not work well when the overlap identity is low * Min ovl length=2
~dpuiu/bin/nucmer.pl
- It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the reads not aligned at the previous iterations:
- Shorter run time, less disk space
- Default parameters:
-len 20,16,12 : nucmer minmatch (-l) -cluster 20,16,14 : nucmer mincluster (-c) -filter 20,20,20 : delta-filter minlen (-l)
- Could be used by AMOScmp-shortReads, AMOScmp-shortReads-alignmentTrimmed
Cases
No reference sequence
One data set, multiple denovo assemblers
Example:
* Solexa data * edena & velvet assemblers
Solution:
* run minimus on the contigs
Multipls data sets, one(multiple) denovo assemblers
Example:
* Solexa & 454 data * velvet assembly on Solexa * newbler assembly on 454
Solution:
* run minimus on the contigs
One reference sequence
Example:
* Solexa data * AMOScmp on Solexa * velvet on Solexa
Solution:
* remove the redundant velvet contigs (CONTAINED in AMOScmp contigs) (Optional; leaving the CONTAINED contigs had very similar results, CONSERR did not have to be reduced) * run AMOScmp on AMOScmp & velvet contigs => join some AMOScmp gaps * run minimus to further merge contigs
Multiple reference sequences
Example:
* Solexa data * AMOScmp on Solexa using Ref1 * AMOScmp on Solexa using Ref2
Solution:
* remove the redundant AMOScmp Ref2 contigs (CONTAINED in AMOScmp contigs) (optional) * run AMOScmp on AMOScmp Ref1 & Ref2 contigs => join some AMOScmp Ref1 gaps * run minimus to further merge contigs
Examples
Pseudomonas_syringae
Reference:
Name Length %GC NC_004578.1 6397126 58.40 NC_004633.1 73661 55.15 NC_004632.1 67473 56.17
Repeats:
desc #repeats min max mean stdev sum 50bp+ 991 50 7362 393.73 792.41 390192 100bp+ 429 100 7362 815.36 1060.29 349793
Solexa reads
Type #reads min max mean Solexa 6340136 32 32 32 (~31x coverage)
Assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons AMOScmp comparative Solaxa 6340136 187 20 577929 34863.06 91692.34 6519394 698638(11%) velvet denovo Solaxa 6340136 25161 45 5057 241.83 212.61 6084887 edena denovo Solaxa 6340136 14084 100 5075 210.92 145.68 2970720 4893301(77%)
Merged assemblies(contigs&singletons):
assemblers type input-data #reads #ctgs min max mean stdev ctgs-sum comments AMOScmp-merged ? AMOScmp(contigs) 187 166 20 804024 39272.2 121124 6519189 #merged 187-166=21 negative gaps out of a total of 32 minimus(ovl20) denovo velvet(contigs) 25161 19121 45 5057 311.3 297.27 5952381 #merged 25161-19121=6040 (25%) gaps minimus(ovl15) denovo velvet(contigs) 25161 16343 45 9903 361.32 359.78 5905143 #merged 25161-16343=8818 (35%) gaps minimus(ovl40) denovo edena+velvet(contigs) 39245 23644 45 6688 257.15 232.94 6080063 #very few 40bp overlaps are found minimus(ovl20) denovo edena+velvet(contigs) 39245 18603 45 6688 322.32 311.02 5996244
Simulated 32bp exact match reads
Type #reads min max mean Sim(ulated) 6538167 32 32 32 ( 32x coverage)
Single assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons edena-sim denovo Sim 6538167 2068 100 47881 2994.03 4857.76 6191673 198699(3%) velvet-sim denovo Sim 6538167 2207 45 56810 2820.91 5348.36 6225757 123591(2%)
454 reads
Type #reads min max mean 454 77466 35 371 240
Pseudomonas aeruginosa b1 (PAb1)
References:
Name Length %GC PA14 6537648 66.29 PACS2 6492423 66.33 PAO1 6264404 66.56 ...
Solexa reads
Type #reads min max mean Solexa 8627900 33 33 33 (~43X coverage)
Assemblies:
All contigs:
Assembler type #ctgs min max mean stdev ctgs-sum #singletons comments AMOSCmp-PA14 comparative 2053 17 170485 3011.84 11917.53 6183320 1127399 AMOSCmp-PAO1 comparative 2797 17 75626 2161.19 5812.2 6044851 1592525 AMOScmp-PA2192 comparative 5816 17 133129 1072.8 3725.22 6239454 1601299 largest assembly maq-PA14 comparative 991 33 155551 6199.79 17445.05 6143996 1197385 velvet denovo 10684 45 16239 640.34 825.24 6841458 1241079 much better than Ps !!! edena denovo 11180 100 11300 552.36 610.52 6175460 3955865 (46%) much better than Ps !!! ssake denovo 185030 34 5490 77.21 141.23 14287079 3056893
200bp+ contigs:
Assembler type #ctgs min max mean stdev ctgs-sum AMOSCmp-PA14 comparative 428 203 170485 14262.09 22852.74 6104175 AMOSCmp-PAO1 comparative 865 200 75626 6893.96 8766.63 5963278 AMOSCmp-PA2192 comparative 1299 200 133129 4683.46 6735.52 6083817 maq-PA14 comparative 368 200 155551 16581.7 25475.92 6102067 velvet denovo 7382 200 16239 877.05 896.35 6474426 edena denovo 8316 200 11300 692.54 651.24 5759209 ssake denovo 12532 200 5490 486 329.93 6090567
Merged assemblies(contigs&singletons):
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum comments AMOSCmp-PA14-merge ? AMOSCmp-PA14(ctgs) 2053 1931 17 170485 3201.7 12981.57 6182486 2053-1931=122 gaps closed
Staphylococcus aureus MW2
Data
Reads:
3857879 35bp Solexa reads
Reference:
NC_003923 2820462 32.83 Staphylococcus aureus subsp. aureus MW2, complete genome. AP004832 20654 28.38 Staphylococcus aureus plasmid pMW2 DNA, complete sequence total 2841116
Coverage: 47.52X
Other strains:
NC_002952.2 2902619 32.81 Staphylococcus aureus subsp. aureus MRSA252, complete genome NC_002953.3 2799802 32.85 Staphylococcus aureus subsp. aureus MSSA476, complete genome NC_002951.2 2809422 32.82 Staphylococcus aureus subsp. aureus COL, complete genome NC_009632.1 2906507 32.95 Staphylococcus aureus subsp. aureus JH1, complete genome NC_009487.1 2906700 32.95 Staphylococcus aureus subsp. aureus JH9, complete genome NC_009782.1 2880168 32.88 Staphylococcus aureus subsp. aureus Mu3, complete genome NC_002758.2 2878529 32.88 Staphylococcus aureus subsp. aureus Mu50, complete genome NC_003923.1 2820462 32.83 Staphylococcus aureus subsp. aureus MW2, complete genome NC_002745.2 2814816 32.84 Staphylococcus aureus subsp. aureus N315, complete genome NC_007795.1 2821361 32.87 Staphylococcus aureus subsp. aureus NCTC 8325, complete genome NC_009641.1 2878897 32.89 Staphylococcus aureus subsp. aureus str. Newman, complete genome NC_007622.1 2742531 32.78 Staphylococcus aureus RF122, complete genome NC_007793.1 2872769 32.75 Staphylococcus aureus subsp. aureus USA300, complete genome NC_010079.1 2872915 32.76 Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome
Sequence assemblies
AMOScmp
OVL: 10 bp
Output stats:
All: #elem min max mean median n50 sum ctg 17 60 658745 167134 102144 485748 2841283 ctg(200bp+) 15 2090 658745 189411 102265 485748 2841158 singl 101422 23 35 35 35 35 3549538 breaks 0
- Only 2.62% of the reads are singletons
- 101232 out of 101422 AMOS singletons (99.81%) are also velvet singletons
velvet
OVL: 23 bp
Output stats:
#elem min max mean median n50 sum ctg 1777 45 22892 1583 252 5337 2813162 ctg(100bp+) 1146 100 22892 2421 1167 5472 2774471 ctg(200bp+) 939 200 22892 2924 1764 5512 2745649 singl 289207 35 35 35 35 35 10122245 breaks 0
- 7.49% reads are singletons
- 101232 out of 289207 velvet singletons are also AMOS singletons
- 13189 reads are duplicates, connect 1126 contig pairs (1242 unique ctgs in these pairs)
- 1114 out of 1242 contig overlaps were identified by running nucmer (-c 12 -l 12)
ovlSize count 22 740 21 99 20 55 19 36 18 52 17 33 16 17 15 29 14 21 13 15 12 17 Total 1114
0cvg regions
$show-coords Sa.all.delta | ~/bin/getNucmerCoverage.pl -M 0 | getSummary.pl -c 3 #elem min max mean median n50 sum gap 305 1 570 34 19 70 10521
OVL: 21 bp
#elem min max mean median n50 sum all 2099 41 18254 1342 258 4384 2817753 45bp+ 1830 45 18254 1534 448 4391 2806596 100bp+ 1337 100 18254 2075 1083 4558 2774667 200bp+ 1115 200 18254 2461 1486 4625 2743892 singl 224006
OVL: 23 bp seems to generate a better assembly on this data set !!!
edena
Output stats:
#elem min max mean median n50 sum ctg 1175 100 22893 2350 1094 5514 2760953 ctg(200bp+) 935 204 22893 2918 1766 5594 2727929
Contig assemblies
AMOScmp on velvet contigs
Min Ovl: 10bp Min Cluster(nucmer -c): 40 (many velvet contigs are 45-65 bp)
All: #elem min max mean median n50 sum ctg 567 45 73756 4890 455 20846 2772573 singl 5 45 11556 3159 58 11556 15794 all 572 45 73756 4875 455 20846 2788367 breaks 2 (1 contig with 2 overlapping alignments; no repeat; why is nucmer breaking the alignment?) 200 bp+: #elem min max mean median n50 sum ctg 339 201 73756 8118 3413 20846 2752170 singl 2 4086 11556 7821 11556 11556 15642 all 341 201 73756 8117 3494 20846 2767812 breaks 2
- Reduced the number of contigs from 1777 to 567 !!!
minimus on velvet contigs
Min Ovl: 16bp
All: #elem min max mean median n50 sum ctg 316 45 39490 5812 3243 12483 1836527 singl 797 45 22892 1204 161 4724 959356 all 1113 45 39490 2512 299 9399 2795883 breaks 20 200 bp+: #elem min max mean median n50 sum ctg 258 202 39490 7093 5069 12483 1829967 singl 368 200 22892 2504 1347 5210 921478 all 626 200 39490 4395 2313 9508 2751445 breaks 19
- Reduced the number of contigs from 1777 to 1113 !!!
minimus1 on velvet contigs
Min Ovl: 16bp
All: #elem min max mean median n50 sum ctg 277 70 32618 6623 4709 11914 1834597 singl 1022 45 19128 948 96 5057 968872 all 1299 45 32618 2158 154 9526 2803469 breaks 15 200 bp+: #elem min max mean median n50 sum ctg 247 200 32618 7413 6004 11984 1830980 singl 354 200 19128 2590 1397 5294 916800 all 601 200 32618 4572 2495 9757 2747780 breaks 15
- Reduced the number of contigs from 1777 to 1299 !!!
- Generated fewer, larger big contigs than minimus
minimus1 on velvet contigs
Min Ovl: 14bp; Uses only the overlaps represented by Duplicate reads
All: #elem min max mean median n50 sum ctg 240 70 27522 3955 1996 8075 949307 singl 1179 45 19128 1575 240 5337 1856945 all 1419 45 27522 1978 344 6270 2806252 breaks 11 200 bp+: #elem min max mean median n50 sum ctg 196 201 27522 4815 3010 8075 943828 singl 612 200 19128 2962 1857 5472 1813005 all 808 200 27522 3412 2055 6337 2756833 breaks 11
Fewer contigs get merged!!!
minimus1 on edena contigs
All: #elem min max mean median n50 sum ctg 67 240 24824 5610 4814 8671 375847 singl 1022 100 19924 2331 1084 5455 2382674 all 1089 100 24824 2533 1183 5933 2758521 breaks 27 200 bp+: #elem min max mean median n50 sum ctg 67 240 24824 5610 4814 8671 375847 singl 811 204 19924 2902 1766 5588 2353676 all 878 204 24824 3109 1862 5984 2729523 breaks 27
Edena contig ends get trimmed. There are fewer overlap between them than velvet contigs!!!
Staphylococcus aureus MW2 "mutated" 10% of sequence are random SNPs
Sequence assemblies
AMOScmp
OVL: 10 bp ...
#elem min max mean median n50 sum all 22431 20 705 107 87 124 2409008 200bp+ 1997 200 705 267 246 260 533882 singl 1934772
velvet
Same as the one above
Contig assemblies
AMOScmp on velvet contigs
Min Ovl: 10bp Min Match (nucmer -l): 20 Min Cluster (nucmer -c): 40
All:
#elem min max mean median n50 sum ctg 496 45 47387 5504 2076 14370 2729975 singl 629 45 11556 112 60 131 70262 all 1125 45 47387 2489 96 13785 2800237 breaks 6
200 bp+:
#elem min max mean median n50 sum ctg 405 204 47387 6715 3277 14370 2719753 singl 28 200 11556 1042 288 11556 29165 all 433 200 47387 6349 2878 14036 2748918 breaks 4
Min Ovl: 10bp Min Match (nucmer -l): 16 Min Cluster (nucmer -c): 32
All: #elem min max mean median n50 sum ctg 544 45 53996 5044 1215 16039 2744166 singl 385 45 11556 133 53 5057 51297 all 929 45 53996 3009 112 15857 2795463 breaks 8 200 bp+: #elem min max mean median n50 sum ctg 391 201 53996 6980 3156 16443 2729048 singl 8 222 11556 3543 4086 11556 28343 all 399 201 53996 6911 3156 16039 2757391 breaks 6