Assembly merge: Difference between revisions
No edit summary |
|||
(261 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= Papers = | |||
* [http://genome.cshlp.org/cgi/content/full/18/2/324 Short read fragment assembly of bacterial genomes 2007] | |||
= Assemblers = | = Assemblers = | ||
== Denovo == | == Denovo == | ||
=== | === CA === | ||
See [[Dpuiu_CA|CA]] | |||
=== Abyss === | |||
* [http://www.bcgsc.ca/platform/bioinfo/software/abyss Abyss] | |||
=== Atlas === | |||
* [http://www.hgsc.bcm.tmc.edu/downloads/software/atlas/readme.html Atlas web site] | |||
=== Euler-SR === | |||
* [http://euler-assembler.ucsd.edu/ Euler-SR source] | |||
* [[Media:README.eulersr|README]] | |||
* [http://genome.cshlp.org/cgi/content/abstract/gr.7088808v1 Short read fragment assembly of bacterial genomes] (Genome Research 2008) | |||
* [http://genome.cshlp.org/cgi/content/full/14/9/1786 De Novo Repeat Classification and Fragment Assembly] (Genome Research 2004) | |||
* [http://genome.cshlp.org/cgi/content/abstract/12/8/1269 Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes] (Genome Research 2002) | |||
* [http://genome.cshlp.org/content/early/2008/12/03/gr.079053.108.abstract De novo fragment assembly with short mate-paired reads: Does the read length matter?] (Genome Research 2008) | |||
* Cmd: | |||
setenv EUSRC /fs/szdevel/dpuiu/euler-sr/ | |||
setenv MACHTYPE x86_64 | |||
set path=($path ${EUSRC}/assembly) | |||
set path=($path ${EUSRC}/assembly/${MACHTYPE}) | |||
/fs/szdevel/dpuiu/euler-sr/assembly/Assemble.pl readsFile vertexSize ruleFile | |||
* 35bp Solexa reads: vertexSize=25 vs vertexSize=21 | |||
vertexSize=21 | |||
more agreagte assembly; | |||
some tandem repeats in 2K+ contigs were collapsed (detected by aligning to Strep aureus finished genome) | |||
these repeats were present in small ctgs (~ 200bp) when euler was run with vertexSize=25 | |||
these repeats were not present in the velvet assembly | |||
=== newbler (454) === | |||
* Command: | |||
/fs/sz-user-supported/Linux-x86_64/packages/rig/bin/runAssembly -o dir prefix.sff | |||
=== minimus === | |||
hash-overlap issues: | |||
* very slow on large sequences (runs Smith-Waterman) | |||
* finds all overlaps (not only CONTAINED|BEGIN|END); of no use to the assembler | |||
* default ovl: 40 bp (too large) | |||
* min ovl: 15 (set by minimizer) | |||
* 20 bp overlap ok; | |||
* the min ovl imposed by the next step (tigger) is 5bp | |||
=== minimus1 === | |||
* hash-overlap was replaced with a nucmer based overlapper | |||
* nucmer takes long time and lots of space to find very short overlaps (< 16bp?) | |||
* We can have more control on the overlaps that are used (can easily filter the nucmer output) | |||
* can discard CONTAINED/IDENTITY alignments in order to avoid collapsing repeats | |||
* can mask the middle of the sequences to make the alignment process faster; (Issue: CONTAINED overlaps appear as BEGIN/END overlaps) | |||
* Example: | |||
.... | |||
OVERLAP=16 | |||
MINID=94 | |||
MAXTRIM=20 | |||
... | |||
20: $(NUCMER) -maxmatch -l $(OVERLAP) -c $(OVERLAP) $(SEQS) $(SEQS) -p $(PREFIX) | |||
24: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | egrep 'BEGIN|END' > $(COORDS) | |||
25: $(SCRIPTDIR)/nucmer2ovl.pl -ignore $(MAXTRIM) -tab $(COORDS)> $(OVLTAB) | |||
26: $(SCRIPTDIR)/ovl2OVL.pl $(OVLTAB) > $(OVL) | |||
28: $(BINDIR)/bank-transact -z -b $(BANK) -m $(OVL) | |||
=== minimus2 === | |||
* assembles two sequence sets (S1,S2) | |||
* Instead of merging the sets and aligning all versus all (S1+S2 : S1+S2) it aligns one versus the other: | |||
S1:S2 | |||
S1:S1 (optional) | |||
S2:S2 (optional) | |||
* Each alignment can be run with different parameters | |||
Example 1: merge two de-novo assemblies | |||
... | |||
TYPE1=N | |||
TYPE2=N | |||
... | |||
OVERLAP11 = 16 | |||
OVERLAP22 = 16 | |||
OVERLAP12 = 40 | |||
... | |||
# align sets | |||
$(NUCMER) -maxmatch -l $(OVERLAP11) -c $(OVERLAP11) $(SEQS1) $(SEQS1) -p $(PREFIX11) | |||
$(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) | |||
$(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22) | |||
... | |||
# merge alignments | |||
$(SHOWCOORDS) $(ALIGN11) | egrep 'BEGIN|END' > $(COORDS) | |||
$(SHOWCOORDS) $(ALIGN12) >> $(COORDS) | |||
$(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' >> $(COORDS) | |||
... | |||
nucmer2ovl.pl $(COORDS) > $(OVLTAB) | |||
... | |||
Example 2: merge a comparative and a de-novo assembly | |||
... | |||
TYPE1 = C | |||
TYPE2 = N | |||
... | |||
# look only for short overlaps among adjacent sequences from S1 | |||
$(SCRIPTDIR)/fasta2ovl.pl -tab $(SEQS1) > $(OVLTAB) | |||
... | |||
# look for all overlaps among one sequence from S1 and one from S2 | |||
$(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) | |||
$(SHOWCOORDS) $(ALIGN12) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB) | |||
... | |||
# look for end overlaps among sequences from S2 | |||
$(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22) | |||
$(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' > $(COORDS22) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB) | |||
... | |||
=== SSAKE === | |||
* [http://www.bcgsc.ca/platform/bioinfo/software/ssake Web Site] | |||
* Short Sequence Assembly by K-mer search and 3' read Extension | |||
* Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501 | |||
* Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs | |||
* Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre. | |||
* Current release: SSAKE 3.2.1 (08/21/2008) | |||
* Distribution: PERL script; runs on Linux systems | |||
* Parameters: | |||
-m Minimum number of reads needed to call a base during overhang consensus build up (default -m 16) | |||
-o Minimum number of reads needed to call a base during an extension (default -o 2) | |||
-r Minimum base ratio used to accept a overhang consensus base (default -r 0.7) | |||
-t Max number of bases to trim (default 0) | |||
* "Especially" targeted towards Solexa seq; 454 reads too long & homopol problem | |||
* Can handle mated reads | |||
* TQS.py should be used for qual trimming | |||
Example: | |||
$ grep -c "^>" Pa.seq # total number of reads | |||
$ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log # version 3.0 | |||
$ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log # version 3.2 | |||
$ ln -s Pa*contigs Pa.fasta | |||
$ ln -s Pa*singlets Pa.singlets.seq | |||
$ cat Pa.fasta | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads # number of assembled reads | |||
$ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons # number of singletons | |||
$ grep -v N Pa.seq # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons) | |||
=== VCAKE === | |||
=== Velvet === | === Velvet === | ||
* Sequence assembler for very short reads | |||
* [http://www.genome.org/cgi/content/abstract/18/5/821 Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs] Genome Res. Mar 2008 | |||
* [http://www.ebi.ac.uk/~zerbino/velvet/ Project Web Site at EBI] Current version: 0.6 02/06/2008: Velvet 0.6 | |||
* Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps | |||
* Distribution: C program, to be compiled with gcc | |||
* Open Source, GPL agreement, EBI. | |||
* Current release: 0.7.55 | |||
* Parameters: | |||
- hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced) | |||
* minimum overlap length (30-40X coverage data set): | * minimum overlap length (30-40X coverage data set): | ||
17bp usually gives the fewest contigs (some case) | |||
23bp is also ok (used for Pa,Ps) | |||
15bp is too low => too many short contigs | 15bp is too low => too many short contigs | ||
* -long option for including "long sequences" (Sanger, 454 or even reference sequences) very slow | |||
* At ~30X, max ctg len is ~ 5KB | |||
* At >45X, max ctg len is ~ 20KB | |||
* Longest contig on simulated data is 56Kbp (Ps, no error 32bp reads, 32X cvg) | |||
* Max ovl=32 | |||
* For longer reads (66bp), some of the repeats ( ~100bp) get assembled partially; -cov_cutoff parameter must be used, otherwise low cvg/quality ctgs are reported | |||
* velvet_assy.afg lists the assembled reads (TLE messages) | |||
* some reads extend beyond the contig ends | |||
* there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end) | |||
* many contigs overlap by a few bp; max val= min_ovl-1 | |||
* minimum_ctg length =2*minimum_ovl-1 (23=>45, 21=>41 ...) | |||
Example: | Example (unmated reads): | ||
$ velveth . ovl prefix.seq | $ velveth . -[short|shortPaired|long|longPaired] ovl prefix.seq # ovl should be an odd integer like 21, 23 | ||
$ velvetg . -read_trkg yes | $ velvetg . -read_trkg yes | ||
Example (mated reads): | |||
$ velveth . 23 -shortPaired prefix.seq | |||
$ velvetg . -ins_length 500 -exp_cov 35 | |||
$ ~/bin/velvet.sh prefix 23 500 35 # combines both; prefix.seq should alternate fwd & rev reads | |||
Example (processing results): | |||
$ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg | $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg | ||
$ bank2contig prefix.bnk > prefix.contig | $ bank2contig prefix.bnk > prefix.contig | ||
$ bank2fasta -b prefix.bnk > prefix.fasta | $ bank2fasta -b prefix.bnk > prefix.fasta | ||
$ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons | $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons | ||
$ extractfromfastanames.pl prefix.seq prefix.singletons > prefix.singletons.seq | |||
$ listReadPlacedStatus prefix.bnk > prefix.status | |||
$ grep D prefix.status | awk '{print $5,$6}' | sort -u > prefix.pairs | |||
=== Edena === | === Edena === | ||
* contigs don't overlap | * Edena: Exact DE Novo Assembler | ||
* [http://www.genome.org/cgi/content/abstract/gr.072033.107v2 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer] Genome Res. Apr 2008 | |||
* [http://www.genomic.ch/edena.php Project Web Site] Current verision: 2.1.1 03/17/2008 | |||
* Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland | |||
* Current release: 2.1.1 (03/17/2008) | |||
* Distribution: Compiled executables for Linux-32,Linux-64 | |||
* Uses a traditional read overlap-layout paradigm (adjusted for short reads) | |||
* Parameters: | |||
--overlapCutoff [int] Only consider overlaps >= than the specified size. (default 22) | |||
--minContigSize [int] Minimum size of the contigs to output. (default 100) | |||
* contigs don't overlap (ends seem to be trimmed) | |||
* a significant number of reads don't get assembled; max % assembled : 69% for Staphylococcus_aureus | |||
* reads should be 1/line (+1 line for the header) | |||
* for 66bp reads, the longest ctgs were obtained for --overlapCutoff 35 | |||
Example: | Example: | ||
Line 40: | Line 231: | ||
$ edena -e prefix.ovl -p prefix | $ edena -e prefix.ovl -p prefix | ||
$ ln -s prefix_contigs.fasta prefix.fasta | $ ln -s prefix_contigs.fasta prefix.fasta | ||
$ cat *_info.txt # assembly stats | |||
$ grep assembled prefix_info.txt | |||
$ cat prefix_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t reads_assembled | |||
=== SOAPdenovo === | |||
* http://soap.genomics.org.cn/soapdenovo.html | |||
* Release 1.03 , 06-09-2009 | |||
* Example config | |||
[LIB] # single reads | |||
f=in.fa | |||
[LIB] # paired reads | |||
avg_ins=180 | |||
f1=fwd.seq | |||
f2=rev.seq | |||
[LIB] # paired reads | |||
avg_ins=180 | |||
p=in.seq | |||
* Example run: | |||
$ cat SOAPdenovo.sh | |||
/nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.03/SOAPdenovo all -s SOAPdenovo.config -o out | |||
* Example turkey: paires | |||
elem min q1 q2 q3 max mean n50 sum | |||
reads 88357456 76 76 76 76 76 76 76 6715166656 # 44178728 mates; 2.39X cvg | |||
ctg 4664917 24 25 36 156 13863 198 761 926637298 | |||
scf 982536 100 258 517 1080 16376 839 1411 825003003 | |||
85505015 out.readOnContig | |||
* Location (turkey) | |||
/scratch1/dpuiu/SOAPdenovo/turkey/ | |||
=== Mira === | |||
* http://www.chevreux.org/projects_mira.html | |||
* Sanger, 454 and Solexa /Illumina | |||
* supposed to be "very" slow | |||
== Others == | |||
* [http://www.genome.org/cgi/content/abstract/gr.7337908v1 ALLPATHS : De novo assembly of whole-genome shotgun microreads] Genome Res. Mar 2008 | |||
* [http://www.bcgsc.ca/platform/bioinfo/software/ssake SSAKE] | |||
* VCAKE | |||
* SHARCGS | |||
* [http://www.dnastar.com/products/SMGA.php SeqMan Genome Assembler (SMGA)] by DNAStar(commercial) | |||
* Euler-SR | |||
== Comparative == | == Comparative == | ||
Line 45: | Line 282: | ||
=== AMOScmp === | === AMOScmp === | ||
* [http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOScmp Web site] | |||
* one should increase the make-consensus error rate if ref and qry are not that similar | |||
* all reads from layout are included in consensus | |||
casm-layout issues: | |||
MINCLUSTER = 20 | * if one sequence end does not align, the read is not assembled; the adjacent read is not assembled either if based on alignment to the ref. the two "theoretically" overlap; | ||
* for a circular bacterial chromosome, the BEGIN/END reads don't get assembled => cannot get a circular contig !!! | |||
* all CONTAINED reads should be assembled but the ones that are adjacent to dirty reads are left as singletons as well !!! | |||
* option -r places repeats randomly; what if we want more control over this? | |||
* There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the .scaff last contig ; Contig alignments to the ref have breaks | |||
* casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100) | |||
show-coords issues: | |||
* sometimes long CONTAINED annotated contigs contain dirty ends | |||
show-snps issues: | |||
* ambiguities are not shown | |||
* maximum init_size = 10000 query seqs | |||
* does not work on long seqs | |||
Example: | |||
$ sort -nk4 AMOS.scaff | head | |||
322 BE 87996 -24519 | |||
210 BE 7460 -7324 | |||
1271 BE 60490 -4112 | |||
17 BE 1022 -944 | |||
... | |||
$more AMOS.scaff | |||
... | |||
322 BE 87996 -24519 # 322 contains contigs 1770, 1771 ... | |||
1770 BE 39 1100 | |||
1771 BE 130 738 | |||
... | |||
1780 BE 72 25004 | |||
$ tail -1 AMOS.scaff | |||
1767 BE 39264 0 | |||
=== AMOScmp-shortReads-alignmentTrimmed === | |||
Defaults: | |||
MINCLUSTER = 16 | |||
MINMATCH = 16 | |||
MINLEN = 20 # delta-filter -l 20 | |||
MINOVL = 10 # previously 5 ; mis-assemblies occurred | |||
MAXTRIM = 10 | |||
MAXGAP = 100 # default 10,000 (casm-layout -g) creates some incorrect contigs | |||
MAJORITY = 70 # previously 50 | |||
CONSERR = 0.06 | |||
ALIGNWIGGLE = 2 # different than the default AMOScmp 15 | |||
* Reads are trimmed based on nucmer alignment to a reference genome | * Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed | ||
* If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails | * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails | ||
* If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => | * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => | ||
!!! read trimming is important. | !!! read trimming is important. | ||
* assembly is very fragmented if the percent identity between reads and the reference is less than 98% | |||
* simulations run on a PA-b1 10K read subset using PA14 as reference; random snps were introduced into PA14 to decrease the %identity; (MINCLUSTER=12, MINMATCH=16, MINLEN=20) | |||
99.4% : 1 contig (original data, no snps added to PA14) | |||
98% : 2 contigs | |||
96% : 10 contigs | |||
93% : 50 contigs | |||
93% : 37 contigs (MINMATCH=14) | |||
=== AMOScmp-shortReads === | |||
Defaults | |||
MINCLUSTER = 20 | |||
MINMATCH = 20 | |||
MINOVL = 5 | |||
MAXTRIM = 10 | |||
MAJORITY = 50 | |||
CONSERR = 0.06 | |||
ALIGNWIGGLE = 2 | |||
* No read trimming is done | |||
=== AMOScmp2 === | |||
* Runs AMOScmp of 2 data sets at once (with different parameters) | |||
Example: merge a comparative(AMOScmp) and a de-novo(velvet) assembly, using the same reference was used to get the comparative one | |||
... | |||
TYPE1 = C # Assembly 1 is a comparative one, already know the position of contigs; no need to align them to ref. | |||
TYPE2 = N # Assembly 2 is a de-novo one, don't know the position of contigs; we need to align them to ref. | |||
... | |||
$(SCRIPTDIR)/scaff2delta.pl $(SCAFF1) > $(ALIGN1) | |||
$(NUCMER) -maxmatch -c $(OVERLAP2) $(REF) $(SEQS2) -p=$(PREFIX2) | |||
... | |||
cp $(ALIGN1)$(ALIGN) | |||
cat $(ALIGN2) | grep -v $(PREFIX) | grep -v NUCMER >> $(ALIGN) | |||
.... | |||
=== maq === | |||
* Version: 0.6.6 | |||
* [http://maq.sourceforge.net/ Web site] | |||
* At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed | |||
* At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq | |||
=== Other Scripts === | === Other Scripts === | ||
==== EMBOSS | ==== delta-filter.pl ==== | ||
* Similar to MUMMer delta-filter | |||
* The difference is that for multi copy repeats, it tries to place each one uniquely on the reference | |||
* "delta-filter -q" collapses all in the same reference region | |||
* "delta-filter -r" aligns the same one to multiple reference regions | |||
==== fasta2ovl.pl ==== | |||
* Program that finds short normal direction overlaps between two adjacent sequences in a multi-FASTA file. | |||
* It tests if the END of seqence N overlaps the BEGINING of sequence N+1. | |||
==== megamerge (EMBOSS) ==== | |||
* Can merge well short overlapping contigs | * Can merge well short overlapping contigs | ||
* Does not work well when the overlap identity is low | * Does not work well when the overlap identity is low | ||
* Min ovl length=2 | * Min ovl length=2 | ||
==== nucmer2ovl.pl ==== | |||
* Program that converts nucmer annotated alignments to an overlap file (either AMOS or TAB format) | |||
==== nucmer.pl ==== | |||
* It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the sequences | |||
* Iteration n only tries to align sequences not aligned at iterations n-1,n-2,...1 | |||
* Shorter run time, less disk space (than running nucmer with a small alignment/cluster size) | |||
* Default parameters: | |||
-len 20,16,12 : nucmer minmatch (-l) | |||
-cluster 20,16,14 : nucmer mincluster (-c) | |||
-filter 20,20,20 : delta-filter minlen (-l) | |||
* Could be used by AMOScmp-shortReads, AMOScmp-shortReads-alignmentTrimmed | |||
==== nucmerAnnotate.pl ==== | |||
* Program that parses a nucmer coords file and re-annotates sequence alignments | |||
* The maximum length of the end sequence to be ignored can be set as a parameter | |||
==== revDelta.pl ==== | |||
* Parses a delta file and switches the ref & query | |||
==== revseq.pl ==== | |||
* Revese complements a seq file | |||
* adds ".rev" suffix to seq id's | |||
==== scaff2delta.pl ==== | |||
* Program that converts a scaff file to a delta file | |||
* Example: No need to run nucmer on the AMOScmp output contigs to get them aligned to reference since we already know their positions in the scaffold | |||
== Subassembly == | |||
== Reconciliator == | |||
* [http://bioinformatics.oxfordjournals.org/cgi/reprint/btm542v1.pdf Article] | |||
* [http://www.genome.umd.edu/reconciliation_instructions.htm Download] | |||
= Cases = | = Cases = | ||
Line 82: | Line 456: | ||
Solution: | Solution: | ||
* run minimus on the contigs | |||
* run minimus on | |||
=== Multipls data sets, one(multiple) denovo assemblers === | === Multipls data sets, one(multiple) denovo assemblers === | ||
Example: | Example: | ||
* Solexa & 454 data | |||
* velvet assembly on Solexa | |||
* newbler assembly on 454 | |||
Solution: | |||
* run minimus on the contigs | |||
== One reference sequence == | == One reference sequence == | ||
Example: | |||
* Solexa data | |||
* AMOScmp on Solexa | |||
* velvet on Solexa | |||
Solution: | Solution: | ||
* AMOScmp | * remove the redundant velvet contigs (CONTAINED in AMOScmp contigs) (Optional; leaving the CONTAINED contigs had very similar results, CONSERR did not have to be reduced) | ||
* | * run AMOScmp on AMOScmp & velvet contigs => join some AMOScmp gaps | ||
* run minimus to further merge contigs | |||
== | == Multiple reference sequences == | ||
Example: | |||
* Solexa data | |||
* AMOScmp on Solexa using Ref1 | |||
* AMOScmp on Solexa using Ref2 | |||
= | Solution: | ||
* remove the redundant AMOScmp Ref2 contigs (CONTAINED in AMOScmp contigs) (optional) | |||
* run AMOScmp on AMOScmp Ref1 & Ref2 contigs => join some AMOScmp Ref1 gaps | |||
* run minimus to further merge contigs | |||
---- | ---- | ||
Line 121: | Line 508: | ||
50bp+ 991 50 7362 393.73 792.41 390192 | 50bp+ 991 50 7362 393.73 792.41 390192 | ||
100bp+ 429 100 7362 815.36 1060.29 349793 | 100bp+ 429 100 7362 815.36 1060.29 349793 | ||
=== Simulated 32bp exact match reads === | === Simulated 32bp exact match reads === | ||
Line 151: | Line 519: | ||
velvet-sim denovo Sim 6538167 2207 45 56810 2820.91 5348.36 6225757 123591(2%) | velvet-sim denovo Sim 6538167 2207 45 56810 2820.91 5348.36 6225757 123591(2%) | ||
=== | === Solexa reads === | ||
Type #reads min max mean | Type #reads min max mean | ||
Solexa 6340136 32 32 32 (~31x coverage) | |||
Assemblies: | |||
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons | |||
AMOScmp comparative Solaxa 6340136 187 20 577929 34863.06 91692.34 6519394 698638(11%) | |||
velvet denovo Solaxa 6340136 25161 45 5057 241.83 212.61 6084887 | |||
velvet(100bp+) Solexa 19534 100 5057 288 220 5633400 | |||
edena denovo Solaxa 6340136 14084 100 5075 210.92 145.68 2970720 4893301(77%) | |||
Assembly breaks: | |||
velvet: 31 | |||
edena : 38 | |||
Merged assemblies(contigs): | |||
assemblers type input-data #reads #ctgs min max mean stdev ctgs-sum comments | |||
minimus(ovl20) denovo velvet 25161 19121 45 5057 311.3 297.27 5952381 #merged 25161-19121=6040 (25%) gaps | |||
minimus(ovl20) denovo edena+velvet 39245 18603 45 6688 322.32 311.02 5996244 | |||
== Pseudomonas aeruginosa b1 (PAb1)== | == Pseudomonas aeruginosa b1 (PAb1)== | ||
Line 182: | Line 565: | ||
edena denovo 11180 100 11300 552.36 610.52 6175460 3955865 (46%) much better than Ps !!! | edena denovo 11180 100 11300 552.36 610.52 6175460 3955865 (46%) much better than Ps !!! | ||
ssake denovo 185030 34 5490 77.21 141.23 14287079 3056893 | ssake denovo 185030 34 5490 77.21 141.23 14287079 3056893 | ||
euler-sr denovo 30394 26 5004 184 138 5601125 3437156 | |||
200bp+ contigs: | 200bp+ contigs: | ||
Line 196: | Line 580: | ||
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum comments | Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum comments | ||
AMOSCmp-PA14-merge ? AMOSCmp-PA14(ctgs) 2053 1931 17 170485 3201.7 12981.57 6182486 2053-1931=122 gaps closed | AMOSCmp-PA14-merge ? AMOSCmp-PA14(ctgs) 2053 1931 17 170485 3201.7 12981.57 6182486 2053-1931=122 gaps closed | ||
== Staphylococcus aureus MW2 == | |||
=== Data === | |||
Files: | |||
/fs/szasmg2/Bacteria/dpuiu/Solexa/Staphylococcus_aureus/ | |||
Reads: | |||
3857879 35bp Solexa reads | |||
22843 contain at least 1 N; 10455 contain more than 1 N; 8244 contain more than 2 N's | |||
Reference: | |||
NC_003923 2820462 32.83 Staphylococcus aureus subsp. aureus MW2, complete genome. | |||
AP004832 20654 28.38 Staphylococcus aureus plasmid pMW2 DNA, complete sequence | |||
total 2841116 | |||
Coverage: 47.52X | |||
Other strains: | |||
NC_002952.2 2902619 32.81 Staphylococcus aureus subsp. aureus MRSA252, complete genome | |||
NC_002953.3 2799802 32.85 Staphylococcus aureus subsp. aureus MSSA476, complete genome | |||
NC_002951.2 2809422 32.82 Staphylococcus aureus subsp. aureus COL, complete genome | |||
NC_009632.1 2906507 32.95 Staphylococcus aureus subsp. aureus JH1, complete genome | |||
NC_009487.1 2906700 32.95 Staphylococcus aureus subsp. aureus JH9, complete genome | |||
NC_009782.1 2880168 32.88 Staphylococcus aureus subsp. aureus Mu3, complete genome | |||
NC_002758.2 2878529 32.88 Staphylococcus aureus subsp. aureus Mu50, complete genome | |||
NC_003923.1 2820462 32.83 Staphylococcus aureus subsp. aureus MW2, complete genome | |||
NC_002745.2 2814816 32.84 Staphylococcus aureus subsp. aureus N315, complete genome | |||
NC_007795.1 2821361 32.87 Staphylococcus aureus subsp. aureus NCTC 8325, complete genome | |||
NC_009641.1 2878897 32.89 Staphylococcus aureus subsp. aureus str. Newman, complete genome | |||
NC_007622.1 2742531 32.78 Staphylococcus aureus RF122, complete genome | |||
NC_007793.1 2872769 32.75 Staphylococcus aureus subsp. aureus USA300, complete genome | |||
NC_010079.1 2872915 32.76 Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome | |||
=== Sequence alignments === | |||
==== soap ==== | |||
3735781 of 3857879 (96.76%) aligned by "soap -v 5 -g 3 -s 12" | |||
Mismatches: | |||
~ 10% of the aligned reads had more at least 2 mismatches | |||
~ 4% of the aligned reads had more at least 3 mismatches | |||
cat Sa.soap.v5.s12.g3 | awk '{print $10}' | count.pl | |||
# total % | |||
0 2752024 73.67 | |||
1 633043 16.95 | |||
2 201439 5.39 | |||
3 84258 2.26 | |||
4 41787 1.12 | |||
5 22629 0.61 | |||
201 525 0.01 | |||
101 45 | |||
202 20 | |||
103 5 | |||
203 4 | |||
102 2 | |||
Total 3735781 | |||
=== Sequence assemblies === | |||
==== AMOScmp ==== | |||
OVL: 10 bp | |||
Output stats: | |||
All: | |||
#elem min max mean median n50 sum | |||
ctg 17 60 658745 167134 102144 485748 2841283 | |||
ctg(200bp+) 15 2090 658745 189411 102265 485748 2841158 | |||
singl 101422 23 35 35 35 35 3549538 | |||
breaks 0 | |||
* Only 2.62% of the reads are singletons | |||
* 101232 out of 101422 AMOS singletons (99.81%) are also velvet singletons | |||
==== velvet ==== | |||
===== OVL: 23 bp ===== | |||
Output stats: | |||
#elem min max mean median n50 sum | |||
ctg 1777 45 22892 1583 252 5337 2813162 | |||
ctg(100bp+) 1146 100 22892 2421 1167 5472 2774471 | |||
ctg(200bp+) 939 200 22892 2924 1764 5512 2745649 | |||
singl 289207 35 35 35 35 35 10122245 | |||
breaks 0 | |||
* 7.49% reads are singletons | |||
* 101232 out of 289207 velvet singletons are also AMOS singletons | |||
* 13189 reads are duplicates, connect 1126 contig pairs (1242 unique ctgs in these pairs) | |||
* 1114 out of 1242 contig overlaps were identified by running nucmer (-c 12 -l 12) | |||
ovlSize count | |||
22 740 | |||
21 99 | |||
20 55 | |||
19 36 | |||
18 52 | |||
17 33 | |||
16 17 | |||
15 29 | |||
14 21 | |||
13 15 | |||
12 17 | |||
Total 1114 | |||
Contig coverage & alignments to the reference: | |||
$ grep "^##" Sa.contig | perl -ane 'print $F[1]*35/$F[2],"\n";' > Sa.cvg | |||
#ctgs min max mean median n50 sum | |||
1777 1.55 517.25 43 37.41 44.36 76159.86 | |||
$ show-coords -I 94 1con-contigs.delta | grep CONTAIN | awk '{print $19,$13}' | count.pl | sort -n >! 1con-contigs.qry_hits | |||
$ join 1con-contigs.qry_hits Sa.cvg | awk '{print $3,$4}' | ~/bin/getCorrelationDescriptive.pl | |||
0.61 # coeff. of correlation > 0 | |||
0.81 # coeff. of correlation for ctgs>100 bp | |||
0cvg regions | |||
$show-coords Sa.all.delta | ~/bin/getNucmerCoverage.pl -M 0 | getSummary.pl -c 3 | |||
#elem min max mean median n50 sum | |||
gap 305 1 570 34 19 70 10521 | |||
===== OVL: 21 bp ===== | |||
#elem min max mean median n50 sum | |||
all 2099 41 18254 1342 258 4384 2817753 | |||
45bp+ 1830 45 18254 1534 448 4391 2806596 | |||
100bp+ 1337 100 18254 2075 1083 4558 2774667 | |||
200bp+ 1115 200 18254 2461 1486 4625 2743892 | |||
singl 224006 | |||
OVL: 23 bp seems to generate a better assembly on this data set !!! | |||
==== edena ==== | |||
Output stats: | |||
#elem min max mean median n50 sum | |||
ctg 1175 100 22893 2350 1094 5514 2760953 | |||
ctg(200bp+) 935 204 22893 2918 1766 5594 2727929 | |||
breaks 0 | |||
assembled_reads 2676002 (69.8%) | |||
singletons 1181877 | |||
* In this case velvet & edena have similar contig stats !!! | |||
* All edena contigs align to velvet contigs | |||
* Not all edena contigs are CONTAINED/IDENTICAL to velvet contigs !!! | |||
#elem min max mean median n50 sum | |||
edena.uniq 360 100 19924 2475 273 7791 891123 | |||
$ show-coords -H velvet-edena.filter-q.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19,$13}' | count.pl -m 2 | wc -l # 193 edena contigs CONTAIN multiple velvet contigs | |||
$ show-coords -H velvet-edena.filter-q.delta | egrep 'IDEN' | wc -l # 435 edena contigs are IDENTICAL to velvet contigs | |||
==== euler-sr ==== | |||
#elem min max mean median n50 sum | |||
all 1349 22 43140 2073 109 10523 2796347 | |||
200bp+ 566 201 43140 4859 2334 10662 2750118 | |||
=== Contig assemblies === | |||
==== AMOScmp on velvet contigs ==== | |||
Min Ovl: 10bp | |||
Min Cluster(nucmer -c): 40 (many velvet contigs are 45-65 bp) | |||
All: | |||
#elem min max mean median n50 sum | |||
ctg 567 45 73756 4890 455 20846 2772573 | |||
singl 5 45 11556 3159 58 11556 15794 | |||
all 572 45 73756 4875 455 20846 2788367 | |||
breaks 2 (1 contig with 2 overlapping alignments; no repeat; why is nucmer breaking the alignment?) | |||
200 bp+: | |||
#elem min max mean median n50 sum | |||
ctg 339 201 73756 8118 3413 20846 2752170 | |||
singl 2 4086 11556 7821 11556 11556 15642 | |||
all 341 201 73756 8117 3494 20846 2767812 | |||
breaks 2 | |||
* Reduced the number of contigs from 1777 to 567 !!! | |||
==== minimus on velvet contigs ==== | |||
Min Ovl: 16bp | |||
All: | |||
#elem min max mean median n50 sum | |||
ctg 316 45 39490 5812 3243 12483 1836527 | |||
singl 797 45 22892 1204 161 4724 959356 | |||
all 1113 45 39490 2512 299 9399 2795883 | |||
breaks 20 | |||
200 bp+: | |||
#elem min max mean median n50 sum | |||
ctg 258 202 39490 7093 5069 12483 1829967 | |||
singl 368 200 22892 2504 1347 5210 921478 | |||
all 626 200 39490 4395 2313 9508 2751445 | |||
breaks 19 | |||
* Reduced the number of contigs from 1777 to 1113 !!! | |||
==== minimus1 on velvet contigs ==== | |||
Min Ovl: 16bp | |||
All: | |||
#elem min max mean median n50 sum | |||
ctg 277 70 32618 6623 4709 11914 1834597 | |||
singl 1022 45 19128 948 96 5057 968872 | |||
all 1299 45 32618 2158 154 9526 2803469 | |||
breaks 15 | |||
200 bp+: | |||
#elem min max mean median n50 sum | |||
ctg 247 200 32618 7413 6004 11984 1830980 | |||
singl 354 200 19128 2590 1397 5294 916800 | |||
all 601 200 32618 4572 2495 9757 2747780 | |||
breaks 15 | |||
* Reduced the number of contigs from 1777 to 1299 !!! | |||
* Generated fewer, larger big contigs than minimus | |||
Fewer contigs get merged if: | |||
* Min Ovl is dropped from 16bp to 14bp | |||
* Only the overlaps represented by Duplicate reads are used | |||
==== minimus1 on edena contigs ==== | |||
All: | |||
#elem min max mean median n50 sum | |||
ctg 67 240 24824 5610 4814 8671 375847 | |||
singl 1022 100 19924 2331 1084 5455 2382674 | |||
all 1089 100 24824 2533 1183 5933 2758521 | |||
breaks 27 | |||
200 bp+: | |||
#elem min max mean median n50 sum | |||
ctg 67 240 24824 5610 4814 8671 375847 | |||
singl 811 204 19924 2902 1766 5588 2353676 | |||
all 878 204 24824 3109 1862 5984 2729523 | |||
breaks 27 | |||
Edena contig ends get trimmed. There are fewer overlap among them than velvet contigs!!! | |||
==== AMOScmp on velvet & edena contigs ==== | |||
Input | |||
All: | |||
desc #elem min max mean stdev sum | |||
velvet 1777 45 22892 1583 2769 2813162 | |||
edena 1175 100 22893 2350 3152 2760953 | |||
edena.uniq 360 100 19924 2475 3886 891123 | |||
200bp+ | |||
desc #elem min max mean stdev sum | |||
velvet 939 200 22892 2924 3272 2745649 | |||
edena 935 204 22893 2918 3303 2727929 | |||
edena.uniq 202 206 19924 4305 4394 869552 | |||
Output: | |||
All: | |||
#elem min max mean median n50 sum | |||
ctg 461 45 81678 6029 703 21943 2779591 | |||
singl 7 45 11556 4283 2634 11556 29984 | |||
all 468 45 81678 6003 719 21353 2809575 | |||
breaks 0 | |||
200 bp+: | |||
#elem min max mean median n50 sum | |||
ctg 289 205 81678 9565 4401 21943 2764375 | |||
singl 4 2634 11556 7458 11556 11556 29832 | |||
all 293 205 81678 9537 4401 21943 2794207 | |||
breaks 0 | |||
==== minimus2 on velvet & edena contigs ==== | |||
Contig stats: | |||
All: | |||
#elem min max mean median n50 sum | |||
ctg 740 67 38980 3793 1290 10553 2807159 | |||
singl 366 45 6795 265 63 3684 96822 | |||
all 1106 45 38980 2626 263 10104 2903981 | |||
breaks 45 | |||
200 bp+: | |||
#elem min max mean median n50 sum | |||
ctg 560 202 38980 4969 2583 10561 2782466 | |||
singl 31 203 6795 2424 1379 5455 75150 | |||
all 591 202 38980 4835 2574 10197 2857616 | |||
breaks 44 | |||
== Staphylococcus aureus MW2 "mutated" 10% of sequence are random SNPs == | |||
=== Sequence assemblies === | |||
==== AMOScmp ==== | |||
OVL: 10 bp | |||
... | |||
#elem min max mean median n50 sum | |||
all 22431 20 705 107 87 124 2409008 | |||
200bp+ 1997 200 705 267 246 260 533882 | |||
singl 1934772 | |||
==== velvet ==== | |||
Same as the one above | |||
=== Contig assemblies === | |||
==== AMOScmp on velvet contigs ==== | |||
Min Ovl: 10bp | |||
Min Match (nucmer -l): 20 | |||
Min Cluster (nucmer -c): 40 | |||
All: | |||
#elem min max mean median n50 sum | |||
ctg 496 45 47387 5504 2076 14370 2729975 | |||
singl 629 45 11556 112 60 131 70262 | |||
all 1125 45 47387 2489 96 13785 2800237 | |||
breaks 6 | |||
200 bp+: | |||
#elem min max mean median n50 sum | |||
ctg 405 204 47387 6715 3277 14370 2719753 | |||
singl 28 200 11556 1042 288 11556 29165 | |||
all 433 200 47387 6349 2878 14036 2748918 | |||
breaks 4 | |||
Min Ovl: 10bp | |||
Min Match (nucmer -l): 16 | |||
Min Cluster (nucmer -c): 32 | |||
All: | |||
#elem min max mean median n50 sum | |||
ctg 544 45 53996 5044 1215 16039 2744166 | |||
singl 385 45 11556 133 53 5057 51297 | |||
all 929 45 53996 3009 112 15857 2795463 | |||
breaks 8 | |||
200 bp+: | |||
#elem min max mean median n50 sum | |||
ctg 391 201 53996 6980 3156 16443 2729048 | |||
singl 8 222 11556 3543 4086 11556 28343 | |||
all 399 201 53996 6911 3156 16039 2757391 | |||
breaks 6 | |||
== Strep suis17 == | |||
=== Data === | |||
* 2726374 36bp Solexa reads | |||
* 2007491bp 41.30 %GC Reference | |||
* Coverage: 49X | |||
* Location: | |||
/fs/szdata/Solexa/Streptococcus_suis/ | |||
/fs/szasmg2/Bacteria/dpuiu/Solexa/Streptococcus_suis/ | |||
=== Sequence alignments === | |||
==== soap ==== | |||
2261784 of 2726374 (82.95%) aligned by "soap -v 5 -g 0 -s 10" | |||
Mismatches: | |||
more Ss.soap | awk '{print $10}' | count.pl | |||
mism. hits hits(%) | |||
0 1235237 54.61 | |||
1 548822 24.27 | |||
2 238795 10.56 | |||
3 120622 5.33 | |||
4 71451 3.16 | |||
5 46857 2.07 | |||
Total 2261784 | |||
=== Sequence assembly === | |||
==== Velvet ==== | |||
Contig stats: | |||
#elem min max mean median n50 sum | |||
all 917 45 24419 2154 278 6661 1975147 | |||
200bp+ 479 202 24419 4051 2940 6966 1940363 | |||
breaks 2 (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss) | |||
Different coverage assemblies: | |||
cvg #ctgs min max mean median n50 sum %singletons breaks | |||
24.5 4630 45 5267 436 284 709 2016871 23.98 6 | |||
29.4 2797 45 7739 716 426 1333 2001964 23.46 0 | |||
34.3 1797 45 9347 1107 649 2230 1989102 23.34 0 | |||
39.2 1325 45 24419 1496 663 3724 1982292 23.32 2 | |||
44.1 1016 45 23272 1946 516 5290 1976890 23.30 2 | |||
49.0 917 45 24419 2154 278 6661 1975147 23.30 2 | |||
==== Edena ==== | |||
Contig stats: | |||
#elem min max mean median n50 sum | |||
all 1668 100 8829 1135 743 1968 1893815 | |||
200bp+ 1390 200 8829 1333 959 1985 1853354 | |||
breaks 2 (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss) | |||
Singletons: 1582493 (58%) | |||
==== Euler-SR ==== | |||
Command: | |||
$ AssembleIllumina.pl Ss.seq working_dir -vertexSize 21 | |||
Contig stats: | |||
#elem min max mean median n50 sum | |||
all 2658 22 19768 757 42 4867 2013259 | |||
200bp+ 659 200 19768 2932 1973 5117 1932406 | |||
== Escherichia coli str. K-12 substr. MG1655 (Solexa-paired) == | |||
=== Data === | |||
* 7,047,668 + 10,408,224 paired-end 36bp Solexa reads => 105X + 161X cvg | |||
* 1M => 15.51X cvg | |||
* NC_000913.2 4,639,675 50.79%GC reference | |||
* Location: | |||
/fs/szasmg2/Bacteria/dpuiu/Solexa-pair/Ecoli | |||
* Formatting : split seq into 2 | |||
cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36 > ! Ecoli.paired.afg | |||
cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36 -noamos > ! Ecoli.paired.seq | |||
=== Sequence assembly (1M reads: 15X cvg) === | |||
* Velvet : hash length of 23 works best; also use -exp_cov 15 | |||
* AMOScmp: OVL=10; the smaller the better | |||
* Ctg stats: | |||
. elem min q1 q2 q3 max mean n50 sum | |||
AMOScmp-shortReads 331 40 1941 7826 18316 172,800 14022 26474 4641532 better !!! | |||
AMOScmp-shortReads-trimmed 372 40 1916 7226 17348 98,155 12477 24730 4641600 | |||
velvet.ctg 12073 45 121 250 461 4690 366 568 4430625 | |||
velvet.scf 715 45 58 70 77 313779 6671 104454 4769943 | |||
=== Sequence assembly (7+M reads: 100+X cvg) === | |||
* Velvet : hash length of 27 works best; also use -exp_cov 100 | |||
. elem min q1 q2 q3 max mean n50 sum | |||
velvet.ctg 375 53 63 82 6105 192151 12162 59972 4561051 | |||
velvet.scf 272 53 60 71 133 860695 16785 171744 4565616 | |||
== Escherichia coli str. K-12 substr. MG1655 (454-paired) == | |||
=== Data === | |||
* Formatting : remove 44 bp linker | |||
nucmer -c 10 -l 10 L454.seq Ecoli.merged.seq -p Ecoli --forward | |||
cat Ecoli.delta | perl ~/bin/filterDeltaAnnotated.pl | ~/bin/delta2alignRange.pl -qry >! Ecoli.pair.seqs | |||
difference.pl Ecoli.seqs Ecoli.pair.seqs >! Ecoli.single.seqs | |||
cat Ecoli.pair.seqs | p '$F[1]--; print "$F[0] $F[0].f 1 $F[1]\n"' >! Ecoli.pair.f.pos | |||
cat Ecoli.pair.seqs | p '$F[2]++; print "$F[0] $F[0].r $F[2] $F[3]\n"' >! Ecoli.pair.r.pos |
Latest revision as of 01:21, 2 September 2010
Papers
Assemblers
Denovo
CA
See CA
Abyss
Atlas
Euler-SR
- Euler-SR source
- README
- Short read fragment assembly of bacterial genomes (Genome Research 2008)
- De Novo Repeat Classification and Fragment Assembly (Genome Research 2004)
- Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes (Genome Research 2002)
- De novo fragment assembly with short mate-paired reads: Does the read length matter? (Genome Research 2008)
- Cmd:
setenv EUSRC /fs/szdevel/dpuiu/euler-sr/ setenv MACHTYPE x86_64 set path=($path ${EUSRC}/assembly) set path=($path ${EUSRC}/assembly/${MACHTYPE}) /fs/szdevel/dpuiu/euler-sr/assembly/Assemble.pl readsFile vertexSize ruleFile
- 35bp Solexa reads: vertexSize=25 vs vertexSize=21
vertexSize=21 more agreagte assembly; some tandem repeats in 2K+ contigs were collapsed (detected by aligning to Strep aureus finished genome) these repeats were present in small ctgs (~ 200bp) when euler was run with vertexSize=25 these repeats were not present in the velvet assembly
newbler (454)
- Command:
/fs/sz-user-supported/Linux-x86_64/packages/rig/bin/runAssembly -o dir prefix.sff
minimus
hash-overlap issues:
- very slow on large sequences (runs Smith-Waterman)
- finds all overlaps (not only CONTAINED|BEGIN|END); of no use to the assembler
- default ovl: 40 bp (too large)
- min ovl: 15 (set by minimizer)
- 20 bp overlap ok;
- the min ovl imposed by the next step (tigger) is 5bp
minimus1
- hash-overlap was replaced with a nucmer based overlapper
- nucmer takes long time and lots of space to find very short overlaps (< 16bp?)
- We can have more control on the overlaps that are used (can easily filter the nucmer output)
- can discard CONTAINED/IDENTITY alignments in order to avoid collapsing repeats
- can mask the middle of the sequences to make the alignment process faster; (Issue: CONTAINED overlaps appear as BEGIN/END overlaps)
- Example:
.... OVERLAP=16 MINID=94 MAXTRIM=20 ... 20: $(NUCMER) -maxmatch -l $(OVERLAP) -c $(OVERLAP) $(SEQS) $(SEQS) -p $(PREFIX) 24: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | egrep 'BEGIN|END' > $(COORDS) 25: $(SCRIPTDIR)/nucmer2ovl.pl -ignore $(MAXTRIM) -tab $(COORDS)> $(OVLTAB) 26: $(SCRIPTDIR)/ovl2OVL.pl $(OVLTAB) > $(OVL) 28: $(BINDIR)/bank-transact -z -b $(BANK) -m $(OVL)
minimus2
- assembles two sequence sets (S1,S2)
- Instead of merging the sets and aligning all versus all (S1+S2 : S1+S2) it aligns one versus the other:
S1:S2 S1:S1 (optional) S2:S2 (optional)
- Each alignment can be run with different parameters
Example 1: merge two de-novo assemblies
... TYPE1=N TYPE2=N ... OVERLAP11 = 16 OVERLAP22 = 16 OVERLAP12 = 40 ... # align sets $(NUCMER) -maxmatch -l $(OVERLAP11) -c $(OVERLAP11) $(SEQS1) $(SEQS1) -p $(PREFIX11) $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22) ... # merge alignments $(SHOWCOORDS) $(ALIGN11) | egrep 'BEGIN|END' > $(COORDS) $(SHOWCOORDS) $(ALIGN12) >> $(COORDS) $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' >> $(COORDS) ... nucmer2ovl.pl $(COORDS) > $(OVLTAB) ...
Example 2: merge a comparative and a de-novo assembly
... TYPE1 = C TYPE2 = N ... # look only for short overlaps among adjacent sequences from S1 $(SCRIPTDIR)/fasta2ovl.pl -tab $(SEQS1) > $(OVLTAB) ... # look for all overlaps among one sequence from S1 and one from S2 $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) $(SHOWCOORDS) $(ALIGN12) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB) ... # look for end overlaps among sequences from S2 $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22) $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' > $(COORDS22) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB) ...
SSAKE
- Web Site
- Short Sequence Assembly by K-mer search and 3' read Extension
- Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501
- Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs
- Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre.
- Current release: SSAKE 3.2.1 (08/21/2008)
- Distribution: PERL script; runs on Linux systems
- Parameters:
-m Minimum number of reads needed to call a base during overhang consensus build up (default -m 16) -o Minimum number of reads needed to call a base during an extension (default -o 2) -r Minimum base ratio used to accept a overhang consensus base (default -r 0.7) -t Max number of bases to trim (default 0)
- "Especially" targeted towards Solexa seq; 454 reads too long & homopol problem
- Can handle mated reads
- TQS.py should be used for qual trimming
Example:
$ grep -c "^>" Pa.seq # total number of reads $ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log # version 3.0 $ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log # version 3.2 $ ln -s Pa*contigs Pa.fasta $ ln -s Pa*singlets Pa.singlets.seq $ cat Pa.fasta | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads # number of assembled reads $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons # number of singletons $ grep -v N Pa.seq # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons)
VCAKE
Velvet
- Sequence assembler for very short reads
- Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs Genome Res. Mar 2008
- Project Web Site at EBI Current version: 0.6 02/06/2008: Velvet 0.6
- Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps
- Distribution: C program, to be compiled with gcc
- Open Source, GPL agreement, EBI.
- Current release: 0.7.55
- Parameters:
- hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced) * minimum overlap length (30-40X coverage data set): 17bp usually gives the fewest contigs (some case) 23bp is also ok (used for Pa,Ps) 15bp is too low => too many short contigs * -long option for including "long sequences" (Sanger, 454 or even reference sequences) very slow
- At ~30X, max ctg len is ~ 5KB
- At >45X, max ctg len is ~ 20KB
- Longest contig on simulated data is 56Kbp (Ps, no error 32bp reads, 32X cvg)
- Max ovl=32
- For longer reads (66bp), some of the repeats ( ~100bp) get assembled partially; -cov_cutoff parameter must be used, otherwise low cvg/quality ctgs are reported
- velvet_assy.afg lists the assembled reads (TLE messages)
- some reads extend beyond the contig ends
- there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end)
- many contigs overlap by a few bp; max val= min_ovl-1
- minimum_ctg length =2*minimum_ovl-1 (23=>45, 21=>41 ...)
Example (unmated reads):
$ velveth . -[short|shortPaired|long|longPaired] ovl prefix.seq # ovl should be an odd integer like 21, 23 $ velvetg . -read_trkg yes
Example (mated reads):
$ velveth . 23 -shortPaired prefix.seq $ velvetg . -ins_length 500 -exp_cov 35
$ ~/bin/velvet.sh prefix 23 500 35 # combines both; prefix.seq should alternate fwd & rev reads
Example (processing results):
$ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg $ bank2contig prefix.bnk > prefix.contig $ bank2fasta -b prefix.bnk > prefix.fasta $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons $ extractfromfastanames.pl prefix.seq prefix.singletons > prefix.singletons.seq $ listReadPlacedStatus prefix.bnk > prefix.status $ grep D prefix.status | awk '{print $5,$6}' | sort -u > prefix.pairs
Edena
- Edena: Exact DE Novo Assembler
- De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer Genome Res. Apr 2008
- Project Web Site Current verision: 2.1.1 03/17/2008
- Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland
- Current release: 2.1.1 (03/17/2008)
- Distribution: Compiled executables for Linux-32,Linux-64
- Uses a traditional read overlap-layout paradigm (adjusted for short reads)
- Parameters:
--overlapCutoff [int] Only consider overlaps >= than the specified size. (default 22) --minContigSize [int] Minimum size of the contigs to output. (default 100)
- contigs don't overlap (ends seem to be trimmed)
- a significant number of reads don't get assembled; max % assembled : 69% for Staphylococcus_aureus
- reads should be 1/line (+1 line for the header)
- for 66bp reads, the longest ctgs were obtained for --overlapCutoff 35
Example:
$ edena -r prefix.seq -p ovl $ edena -e prefix.ovl -p prefix $ ln -s prefix_contigs.fasta prefix.fasta $ cat *_info.txt # assembly stats $ grep assembled prefix_info.txt $ cat prefix_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t reads_assembled
SOAPdenovo
- http://soap.genomics.org.cn/soapdenovo.html
- Release 1.03 , 06-09-2009
- Example config
[LIB] # single reads f=in.fa [LIB] # paired reads avg_ins=180 f1=fwd.seq f2=rev.seq [LIB] # paired reads avg_ins=180 p=in.seq
- Example run:
$ cat SOAPdenovo.sh /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.03/SOAPdenovo all -s SOAPdenovo.config -o out
- Example turkey: paires
elem min q1 q2 q3 max mean n50 sum reads 88357456 76 76 76 76 76 76 76 6715166656 # 44178728 mates; 2.39X cvg ctg 4664917 24 25 36 156 13863 198 761 926637298 scf 982536 100 258 517 1080 16376 839 1411 825003003
85505015 out.readOnContig
- Location (turkey)
/scratch1/dpuiu/SOAPdenovo/turkey/
Mira
- http://www.chevreux.org/projects_mira.html
- Sanger, 454 and Solexa /Illumina
- supposed to be "very" slow
Others
- ALLPATHS : De novo assembly of whole-genome shotgun microreads Genome Res. Mar 2008
- SSAKE
- VCAKE
- SHARCGS
- SeqMan Genome Assembler (SMGA) by DNAStar(commercial)
- Euler-SR
Comparative
AMOScmp
- Web site
- one should increase the make-consensus error rate if ref and qry are not that similar
- all reads from layout are included in consensus
casm-layout issues:
- if one sequence end does not align, the read is not assembled; the adjacent read is not assembled either if based on alignment to the ref. the two "theoretically" overlap;
- for a circular bacterial chromosome, the BEGIN/END reads don't get assembled => cannot get a circular contig !!!
- all CONTAINED reads should be assembled but the ones that are adjacent to dirty reads are left as singletons as well !!!
- option -r places repeats randomly; what if we want more control over this?
- There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the .scaff last contig ; Contig alignments to the ref have breaks
- casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100)
show-coords issues:
- sometimes long CONTAINED annotated contigs contain dirty ends
show-snps issues:
- ambiguities are not shown
- maximum init_size = 10000 query seqs
- does not work on long seqs
Example:
$ sort -nk4 AMOS.scaff | head 322 BE 87996 -24519 210 BE 7460 -7324 1271 BE 60490 -4112 17 BE 1022 -944 ...
$more AMOS.scaff ... 322 BE 87996 -24519 # 322 contains contigs 1770, 1771 ... 1770 BE 39 1100 1771 BE 130 738 ... 1780 BE 72 25004
$ tail -1 AMOS.scaff 1767 BE 39264 0
AMOScmp-shortReads-alignmentTrimmed
Defaults:
MINCLUSTER = 16 MINMATCH = 16 MINLEN = 20 # delta-filter -l 20 MINOVL = 10 # previously 5 ; mis-assemblies occurred MAXTRIM = 10 MAXGAP = 100 # default 10,000 (casm-layout -g) creates some incorrect contigs MAJORITY = 70 # previously 50 CONSERR = 0.06 ALIGNWIGGLE = 2 # different than the default AMOScmp 15
* Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => !!! read trimming is important. * assembly is very fragmented if the percent identity between reads and the reference is less than 98% * simulations run on a PA-b1 10K read subset using PA14 as reference; random snps were introduced into PA14 to decrease the %identity; (MINCLUSTER=12, MINMATCH=16, MINLEN=20) 99.4% : 1 contig (original data, no snps added to PA14) 98% : 2 contigs 96% : 10 contigs 93% : 50 contigs 93% : 37 contigs (MINMATCH=14)
AMOScmp-shortReads
Defaults
MINCLUSTER = 20 MINMATCH = 20 MINOVL = 5 MAXTRIM = 10 MAJORITY = 50 CONSERR = 0.06 ALIGNWIGGLE = 2
* No read trimming is done
AMOScmp2
- Runs AMOScmp of 2 data sets at once (with different parameters)
Example: merge a comparative(AMOScmp) and a de-novo(velvet) assembly, using the same reference was used to get the comparative one
... TYPE1 = C # Assembly 1 is a comparative one, already know the position of contigs; no need to align them to ref. TYPE2 = N # Assembly 2 is a de-novo one, don't know the position of contigs; we need to align them to ref. ... $(SCRIPTDIR)/scaff2delta.pl $(SCAFF1) > $(ALIGN1) $(NUCMER) -maxmatch -c $(OVERLAP2) $(REF) $(SEQS2) -p=$(PREFIX2) ... cp $(ALIGN1)$(ALIGN) cat $(ALIGN2) | grep -v $(PREFIX) | grep -v NUCMER >> $(ALIGN) ....
maq
- Version: 0.6.6
- Web site
- At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed
- At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq
Other Scripts
delta-filter.pl
- Similar to MUMMer delta-filter
- The difference is that for multi copy repeats, it tries to place each one uniquely on the reference
- "delta-filter -q" collapses all in the same reference region
- "delta-filter -r" aligns the same one to multiple reference regions
fasta2ovl.pl
- Program that finds short normal direction overlaps between two adjacent sequences in a multi-FASTA file.
- It tests if the END of seqence N overlaps the BEGINING of sequence N+1.
megamerge (EMBOSS)
* Can merge well short overlapping contigs * Does not work well when the overlap identity is low * Min ovl length=2
nucmer2ovl.pl
- Program that converts nucmer annotated alignments to an overlap file (either AMOS or TAB format)
nucmer.pl
- It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the sequences
- Iteration n only tries to align sequences not aligned at iterations n-1,n-2,...1
- Shorter run time, less disk space (than running nucmer with a small alignment/cluster size)
- Default parameters:
-len 20,16,12 : nucmer minmatch (-l) -cluster 20,16,14 : nucmer mincluster (-c) -filter 20,20,20 : delta-filter minlen (-l)
- Could be used by AMOScmp-shortReads, AMOScmp-shortReads-alignmentTrimmed
nucmerAnnotate.pl
- Program that parses a nucmer coords file and re-annotates sequence alignments
- The maximum length of the end sequence to be ignored can be set as a parameter
revDelta.pl
- Parses a delta file and switches the ref & query
revseq.pl
- Revese complements a seq file
- adds ".rev" suffix to seq id's
scaff2delta.pl
- Program that converts a scaff file to a delta file
- Example: No need to run nucmer on the AMOScmp output contigs to get them aligned to reference since we already know their positions in the scaffold
Subassembly
Reconciliator
Cases
No reference sequence
One data set, multiple denovo assemblers
Example:
* Solexa data * edena & velvet assemblers
Solution:
* run minimus on the contigs
Multipls data sets, one(multiple) denovo assemblers
Example:
* Solexa & 454 data * velvet assembly on Solexa * newbler assembly on 454
Solution:
* run minimus on the contigs
One reference sequence
Example:
* Solexa data * AMOScmp on Solexa * velvet on Solexa
Solution:
* remove the redundant velvet contigs (CONTAINED in AMOScmp contigs) (Optional; leaving the CONTAINED contigs had very similar results, CONSERR did not have to be reduced) * run AMOScmp on AMOScmp & velvet contigs => join some AMOScmp gaps * run minimus to further merge contigs
Multiple reference sequences
Example:
* Solexa data * AMOScmp on Solexa using Ref1 * AMOScmp on Solexa using Ref2
Solution:
* remove the redundant AMOScmp Ref2 contigs (CONTAINED in AMOScmp contigs) (optional) * run AMOScmp on AMOScmp Ref1 & Ref2 contigs => join some AMOScmp Ref1 gaps * run minimus to further merge contigs
Examples
Pseudomonas_syringae
Reference:
Name Length %GC NC_004578.1 6397126 58.40 NC_004633.1 73661 55.15 NC_004632.1 67473 56.17
Repeats:
desc #repeats min max mean stdev sum 50bp+ 991 50 7362 393.73 792.41 390192 100bp+ 429 100 7362 815.36 1060.29 349793
Simulated 32bp exact match reads
Type #reads min max mean Sim(ulated) 6538167 32 32 32 ( 32x coverage)
Single assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons edena-sim denovo Sim 6538167 2068 100 47881 2994.03 4857.76 6191673 198699(3%) velvet-sim denovo Sim 6538167 2207 45 56810 2820.91 5348.36 6225757 123591(2%)
Solexa reads
Type #reads min max mean Solexa 6340136 32 32 32 (~31x coverage)
Assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons AMOScmp comparative Solaxa 6340136 187 20 577929 34863.06 91692.34 6519394 698638(11%) velvet denovo Solaxa 6340136 25161 45 5057 241.83 212.61 6084887 velvet(100bp+) Solexa 19534 100 5057 288 220 5633400 edena denovo Solaxa 6340136 14084 100 5075 210.92 145.68 2970720 4893301(77%)
Assembly breaks:
velvet: 31 edena : 38
Merged assemblies(contigs):
assemblers type input-data #reads #ctgs min max mean stdev ctgs-sum comments minimus(ovl20) denovo velvet 25161 19121 45 5057 311.3 297.27 5952381 #merged 25161-19121=6040 (25%) gaps minimus(ovl20) denovo edena+velvet 39245 18603 45 6688 322.32 311.02 5996244
Pseudomonas aeruginosa b1 (PAb1)
References:
Name Length %GC PA14 6537648 66.29 PACS2 6492423 66.33 PAO1 6264404 66.56 ...
Solexa reads
Type #reads min max mean Solexa 8627900 33 33 33 (~43X coverage)
Assemblies:
All contigs:
Assembler type #ctgs min max mean stdev ctgs-sum #singletons comments AMOSCmp-PA14 comparative 2053 17 170485 3011.84 11917.53 6183320 1127399 AMOSCmp-PAO1 comparative 2797 17 75626 2161.19 5812.2 6044851 1592525 AMOScmp-PA2192 comparative 5816 17 133129 1072.8 3725.22 6239454 1601299 largest assembly maq-PA14 comparative 991 33 155551 6199.79 17445.05 6143996 1197385 velvet denovo 10684 45 16239 640.34 825.24 6841458 1241079 much better than Ps !!! edena denovo 11180 100 11300 552.36 610.52 6175460 3955865 (46%) much better than Ps !!! ssake denovo 185030 34 5490 77.21 141.23 14287079 3056893 euler-sr denovo 30394 26 5004 184 138 5601125 3437156
200bp+ contigs:
Assembler type #ctgs min max mean stdev ctgs-sum AMOSCmp-PA14 comparative 428 203 170485 14262.09 22852.74 6104175 AMOSCmp-PAO1 comparative 865 200 75626 6893.96 8766.63 5963278 AMOSCmp-PA2192 comparative 1299 200 133129 4683.46 6735.52 6083817 maq-PA14 comparative 368 200 155551 16581.7 25475.92 6102067 velvet denovo 7382 200 16239 877.05 896.35 6474426 edena denovo 8316 200 11300 692.54 651.24 5759209 ssake denovo 12532 200 5490 486 329.93 6090567
Merged assemblies(contigs&singletons):
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum comments AMOSCmp-PA14-merge ? AMOSCmp-PA14(ctgs) 2053 1931 17 170485 3201.7 12981.57 6182486 2053-1931=122 gaps closed
Staphylococcus aureus MW2
Data
Files:
/fs/szasmg2/Bacteria/dpuiu/Solexa/Staphylococcus_aureus/
Reads:
3857879 35bp Solexa reads 22843 contain at least 1 N; 10455 contain more than 1 N; 8244 contain more than 2 N's
Reference:
NC_003923 2820462 32.83 Staphylococcus aureus subsp. aureus MW2, complete genome. AP004832 20654 28.38 Staphylococcus aureus plasmid pMW2 DNA, complete sequence total 2841116
Coverage: 47.52X
Other strains:
NC_002952.2 2902619 32.81 Staphylococcus aureus subsp. aureus MRSA252, complete genome NC_002953.3 2799802 32.85 Staphylococcus aureus subsp. aureus MSSA476, complete genome NC_002951.2 2809422 32.82 Staphylococcus aureus subsp. aureus COL, complete genome NC_009632.1 2906507 32.95 Staphylococcus aureus subsp. aureus JH1, complete genome NC_009487.1 2906700 32.95 Staphylococcus aureus subsp. aureus JH9, complete genome NC_009782.1 2880168 32.88 Staphylococcus aureus subsp. aureus Mu3, complete genome NC_002758.2 2878529 32.88 Staphylococcus aureus subsp. aureus Mu50, complete genome NC_003923.1 2820462 32.83 Staphylococcus aureus subsp. aureus MW2, complete genome NC_002745.2 2814816 32.84 Staphylococcus aureus subsp. aureus N315, complete genome NC_007795.1 2821361 32.87 Staphylococcus aureus subsp. aureus NCTC 8325, complete genome NC_009641.1 2878897 32.89 Staphylococcus aureus subsp. aureus str. Newman, complete genome NC_007622.1 2742531 32.78 Staphylococcus aureus RF122, complete genome NC_007793.1 2872769 32.75 Staphylococcus aureus subsp. aureus USA300, complete genome NC_010079.1 2872915 32.76 Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome
Sequence alignments
soap
3735781 of 3857879 (96.76%) aligned by "soap -v 5 -g 3 -s 12"
Mismatches:
~ 10% of the aligned reads had more at least 2 mismatches ~ 4% of the aligned reads had more at least 3 mismatches
cat Sa.soap.v5.s12.g3 | awk '{print $10}' | count.pl # total % 0 2752024 73.67 1 633043 16.95 2 201439 5.39 3 84258 2.26 4 41787 1.12 5 22629 0.61 201 525 0.01 101 45 202 20 103 5 203 4 102 2 Total 3735781
Sequence assemblies
AMOScmp
OVL: 10 bp
Output stats:
All: #elem min max mean median n50 sum ctg 17 60 658745 167134 102144 485748 2841283 ctg(200bp+) 15 2090 658745 189411 102265 485748 2841158 singl 101422 23 35 35 35 35 3549538 breaks 0
- Only 2.62% of the reads are singletons
- 101232 out of 101422 AMOS singletons (99.81%) are also velvet singletons
velvet
OVL: 23 bp
Output stats:
#elem min max mean median n50 sum ctg 1777 45 22892 1583 252 5337 2813162 ctg(100bp+) 1146 100 22892 2421 1167 5472 2774471 ctg(200bp+) 939 200 22892 2924 1764 5512 2745649 singl 289207 35 35 35 35 35 10122245 breaks 0
- 7.49% reads are singletons
- 101232 out of 289207 velvet singletons are also AMOS singletons
- 13189 reads are duplicates, connect 1126 contig pairs (1242 unique ctgs in these pairs)
- 1114 out of 1242 contig overlaps were identified by running nucmer (-c 12 -l 12)
ovlSize count 22 740 21 99 20 55 19 36 18 52 17 33 16 17 15 29 14 21 13 15 12 17 Total 1114
Contig coverage & alignments to the reference:
$ grep "^##" Sa.contig | perl -ane 'print $F[1]*35/$F[2],"\n";' > Sa.cvg #ctgs min max mean median n50 sum 1777 1.55 517.25 43 37.41 44.36 76159.86
$ show-coords -I 94 1con-contigs.delta | grep CONTAIN | awk '{print $19,$13}' | count.pl | sort -n >! 1con-contigs.qry_hits
$ join 1con-contigs.qry_hits Sa.cvg | awk '{print $3,$4}' | ~/bin/getCorrelationDescriptive.pl 0.61 # coeff. of correlation > 0 0.81 # coeff. of correlation for ctgs>100 bp
0cvg regions
$show-coords Sa.all.delta | ~/bin/getNucmerCoverage.pl -M 0 | getSummary.pl -c 3 #elem min max mean median n50 sum gap 305 1 570 34 19 70 10521
OVL: 21 bp
#elem min max mean median n50 sum all 2099 41 18254 1342 258 4384 2817753 45bp+ 1830 45 18254 1534 448 4391 2806596 100bp+ 1337 100 18254 2075 1083 4558 2774667 200bp+ 1115 200 18254 2461 1486 4625 2743892 singl 224006
OVL: 23 bp seems to generate a better assembly on this data set !!!
edena
Output stats:
#elem min max mean median n50 sum ctg 1175 100 22893 2350 1094 5514 2760953 ctg(200bp+) 935 204 22893 2918 1766 5594 2727929 breaks 0 assembled_reads 2676002 (69.8%) singletons 1181877
- In this case velvet & edena have similar contig stats !!!
- All edena contigs align to velvet contigs
- Not all edena contigs are CONTAINED/IDENTICAL to velvet contigs !!!
#elem min max mean median n50 sum edena.uniq 360 100 19924 2475 273 7791 891123
$ show-coords -H velvet-edena.filter-q.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19,$13}' | count.pl -m 2 | wc -l # 193 edena contigs CONTAIN multiple velvet contigs $ show-coords -H velvet-edena.filter-q.delta | egrep 'IDEN' | wc -l # 435 edena contigs are IDENTICAL to velvet contigs
euler-sr
#elem min max mean median n50 sum all 1349 22 43140 2073 109 10523 2796347 200bp+ 566 201 43140 4859 2334 10662 2750118
Contig assemblies
AMOScmp on velvet contigs
Min Ovl: 10bp Min Cluster(nucmer -c): 40 (many velvet contigs are 45-65 bp)
All: #elem min max mean median n50 sum ctg 567 45 73756 4890 455 20846 2772573 singl 5 45 11556 3159 58 11556 15794 all 572 45 73756 4875 455 20846 2788367 breaks 2 (1 contig with 2 overlapping alignments; no repeat; why is nucmer breaking the alignment?) 200 bp+: #elem min max mean median n50 sum ctg 339 201 73756 8118 3413 20846 2752170 singl 2 4086 11556 7821 11556 11556 15642 all 341 201 73756 8117 3494 20846 2767812 breaks 2
- Reduced the number of contigs from 1777 to 567 !!!
minimus on velvet contigs
Min Ovl: 16bp
All: #elem min max mean median n50 sum ctg 316 45 39490 5812 3243 12483 1836527 singl 797 45 22892 1204 161 4724 959356 all 1113 45 39490 2512 299 9399 2795883 breaks 20 200 bp+: #elem min max mean median n50 sum ctg 258 202 39490 7093 5069 12483 1829967 singl 368 200 22892 2504 1347 5210 921478 all 626 200 39490 4395 2313 9508 2751445 breaks 19
- Reduced the number of contigs from 1777 to 1113 !!!
minimus1 on velvet contigs
Min Ovl: 16bp
All: #elem min max mean median n50 sum ctg 277 70 32618 6623 4709 11914 1834597 singl 1022 45 19128 948 96 5057 968872 all 1299 45 32618 2158 154 9526 2803469 breaks 15 200 bp+: #elem min max mean median n50 sum ctg 247 200 32618 7413 6004 11984 1830980 singl 354 200 19128 2590 1397 5294 916800 all 601 200 32618 4572 2495 9757 2747780 breaks 15
- Reduced the number of contigs from 1777 to 1299 !!!
- Generated fewer, larger big contigs than minimus
Fewer contigs get merged if:
* Min Ovl is dropped from 16bp to 14bp * Only the overlaps represented by Duplicate reads are used
minimus1 on edena contigs
All: #elem min max mean median n50 sum ctg 67 240 24824 5610 4814 8671 375847 singl 1022 100 19924 2331 1084 5455 2382674 all 1089 100 24824 2533 1183 5933 2758521 breaks 27 200 bp+: #elem min max mean median n50 sum ctg 67 240 24824 5610 4814 8671 375847 singl 811 204 19924 2902 1766 5588 2353676 all 878 204 24824 3109 1862 5984 2729523 breaks 27
Edena contig ends get trimmed. There are fewer overlap among them than velvet contigs!!!
AMOScmp on velvet & edena contigs
Input
All: desc #elem min max mean stdev sum velvet 1777 45 22892 1583 2769 2813162 edena 1175 100 22893 2350 3152 2760953 edena.uniq 360 100 19924 2475 3886 891123 200bp+ desc #elem min max mean stdev sum velvet 939 200 22892 2924 3272 2745649 edena 935 204 22893 2918 3303 2727929 edena.uniq 202 206 19924 4305 4394 869552
Output:
All:
#elem min max mean median n50 sum ctg 461 45 81678 6029 703 21943 2779591 singl 7 45 11556 4283 2634 11556 29984 all 468 45 81678 6003 719 21353 2809575 breaks 0
200 bp+:
#elem min max mean median n50 sum ctg 289 205 81678 9565 4401 21943 2764375 singl 4 2634 11556 7458 11556 11556 29832 all 293 205 81678 9537 4401 21943 2794207 breaks 0
minimus2 on velvet & edena contigs
Contig stats:
All: #elem min max mean median n50 sum ctg 740 67 38980 3793 1290 10553 2807159 singl 366 45 6795 265 63 3684 96822 all 1106 45 38980 2626 263 10104 2903981 breaks 45 200 bp+: #elem min max mean median n50 sum ctg 560 202 38980 4969 2583 10561 2782466 singl 31 203 6795 2424 1379 5455 75150 all 591 202 38980 4835 2574 10197 2857616 breaks 44
Staphylococcus aureus MW2 "mutated" 10% of sequence are random SNPs
Sequence assemblies
AMOScmp
OVL: 10 bp ...
#elem min max mean median n50 sum all 22431 20 705 107 87 124 2409008 200bp+ 1997 200 705 267 246 260 533882 singl 1934772
velvet
Same as the one above
Contig assemblies
AMOScmp on velvet contigs
Min Ovl: 10bp Min Match (nucmer -l): 20 Min Cluster (nucmer -c): 40
All:
#elem min max mean median n50 sum ctg 496 45 47387 5504 2076 14370 2729975 singl 629 45 11556 112 60 131 70262 all 1125 45 47387 2489 96 13785 2800237 breaks 6
200 bp+:
#elem min max mean median n50 sum ctg 405 204 47387 6715 3277 14370 2719753 singl 28 200 11556 1042 288 11556 29165 all 433 200 47387 6349 2878 14036 2748918 breaks 4
Min Ovl: 10bp Min Match (nucmer -l): 16 Min Cluster (nucmer -c): 32
All: #elem min max mean median n50 sum ctg 544 45 53996 5044 1215 16039 2744166 singl 385 45 11556 133 53 5057 51297 all 929 45 53996 3009 112 15857 2795463 breaks 8 200 bp+: #elem min max mean median n50 sum ctg 391 201 53996 6980 3156 16443 2729048 singl 8 222 11556 3543 4086 11556 28343 all 399 201 53996 6911 3156 16039 2757391 breaks 6
Strep suis17
Data
- 2726374 36bp Solexa reads
- 2007491bp 41.30 %GC Reference
- Coverage: 49X
- Location:
/fs/szdata/Solexa/Streptococcus_suis/ /fs/szasmg2/Bacteria/dpuiu/Solexa/Streptococcus_suis/
Sequence alignments
soap
2261784 of 2726374 (82.95%) aligned by "soap -v 5 -g 0 -s 10"
Mismatches:
more Ss.soap | awk '{print $10}' | count.pl mism. hits hits(%) 0 1235237 54.61 1 548822 24.27 2 238795 10.56 3 120622 5.33 4 71451 3.16 5 46857 2.07 Total 2261784
Sequence assembly
Velvet
Contig stats:
#elem min max mean median n50 sum all 917 45 24419 2154 278 6661 1975147 200bp+ 479 202 24419 4051 2940 6966 1940363 breaks 2 (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss)
Different coverage assemblies:
cvg #ctgs min max mean median n50 sum %singletons breaks 24.5 4630 45 5267 436 284 709 2016871 23.98 6 29.4 2797 45 7739 716 426 1333 2001964 23.46 0 34.3 1797 45 9347 1107 649 2230 1989102 23.34 0 39.2 1325 45 24419 1496 663 3724 1982292 23.32 2 44.1 1016 45 23272 1946 516 5290 1976890 23.30 2 49.0 917 45 24419 2154 278 6661 1975147 23.30 2
Edena
Contig stats:
#elem min max mean median n50 sum all 1668 100 8829 1135 743 1968 1893815 200bp+ 1390 200 8829 1333 959 1985 1853354 breaks 2 (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss)
Singletons: 1582493 (58%)
Euler-SR
Command:
$ AssembleIllumina.pl Ss.seq working_dir -vertexSize 21
Contig stats:
#elem min max mean median n50 sum all 2658 22 19768 757 42 4867 2013259 200bp+ 659 200 19768 2932 1973 5117 1932406
Escherichia coli str. K-12 substr. MG1655 (Solexa-paired)
Data
- 7,047,668 + 10,408,224 paired-end 36bp Solexa reads => 105X + 161X cvg
- 1M => 15.51X cvg
- NC_000913.2 4,639,675 50.79%GC reference
- Location:
/fs/szasmg2/Bacteria/dpuiu/Solexa-pair/Ecoli
- Formatting : split seq into 2
cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36 > ! Ecoli.paired.afg cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36 -noamos > ! Ecoli.paired.seq
Sequence assembly (1M reads: 15X cvg)
- Velvet : hash length of 23 works best; also use -exp_cov 15
- AMOScmp: OVL=10; the smaller the better
- Ctg stats:
. elem min q1 q2 q3 max mean n50 sum AMOScmp-shortReads 331 40 1941 7826 18316 172,800 14022 26474 4641532 better !!! AMOScmp-shortReads-trimmed 372 40 1916 7226 17348 98,155 12477 24730 4641600 velvet.ctg 12073 45 121 250 461 4690 366 568 4430625 velvet.scf 715 45 58 70 77 313779 6671 104454 4769943
Sequence assembly (7+M reads: 100+X cvg)
- Velvet : hash length of 27 works best; also use -exp_cov 100
. elem min q1 q2 q3 max mean n50 sum velvet.ctg 375 53 63 82 6105 192151 12162 59972 4561051 velvet.scf 272 53 60 71 133 860695 16785 171744 4565616
Escherichia coli str. K-12 substr. MG1655 (454-paired)
Data
- Formatting : remove 44 bp linker
nucmer -c 10 -l 10 L454.seq Ecoli.merged.seq -p Ecoli --forward cat Ecoli.delta | perl ~/bin/filterDeltaAnnotated.pl | ~/bin/delta2alignRange.pl -qry >! Ecoli.pair.seqs difference.pl Ecoli.seqs Ecoli.pair.seqs >! Ecoli.single.seqs cat Ecoli.pair.seqs | p '$F[1]--; print "$F[0] $F[0].f 1 $F[1]\n"' >! Ecoli.pair.f.pos cat Ecoli.pair.seqs | p '$F[2]++; print "$F[0] $F[0].r $F[2] $F[3]\n"' >! Ecoli.pair.r.pos