Assembly merge: Difference between revisions
| No edit summary | |||
| (168 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| = Papers = | |||
| * [http://genome.cshlp.org/cgi/content/full/18/2/324 Short read fragment assembly of bacterial genomes 2007] | |||
| = Assemblers = | = Assemblers = | ||
| == Denovo == | == Denovo == | ||
| === CA === | |||
| See [[Dpuiu_CA|CA]] | |||
| === Abyss === | |||
| * [http://www.bcgsc.ca/platform/bioinfo/software/abyss Abyss] | |||
| === Atlas === | |||
| * [http://www.hgsc.bcm.tmc.edu/downloads/software/atlas/readme.html Atlas web site] | |||
| === Euler-SR === | |||
| * [http://euler-assembler.ucsd.edu/ Euler-SR source] | |||
| * [[Media:README.eulersr|README]] | |||
| * [http://genome.cshlp.org/cgi/content/abstract/gr.7088808v1 Short read fragment assembly of bacterial genomes] (Genome Research 2008) | |||
| * [http://genome.cshlp.org/cgi/content/full/14/9/1786 De Novo Repeat Classification and Fragment Assembly] (Genome Research 2004) | |||
| * [http://genome.cshlp.org/cgi/content/abstract/12/8/1269 Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes] (Genome Research 2002) | |||
| * [http://genome.cshlp.org/content/early/2008/12/03/gr.079053.108.abstract De novo fragment assembly with short mate-paired reads: Does the read length matter?] (Genome Research 2008) | |||
| * Cmd:  | |||
|   setenv EUSRC  /fs/szdevel/dpuiu/euler-sr/ | |||
|   setenv MACHTYPE x86_64 | |||
|   set path=($path ${EUSRC}/assembly) | |||
|   set path=($path ${EUSRC}/assembly/${MACHTYPE}) | |||
|   /fs/szdevel/dpuiu/euler-sr/assembly/Assemble.pl readsFile vertexSize ruleFile | |||
| * 35bp Solexa reads: vertexSize=25 vs vertexSize=21 | |||
|   vertexSize=21  | |||
|     more agreagte assembly;  | |||
|     some tandem repeats in 2K+ contigs were collapsed (detected by aligning to Strep aureus finished genome) | |||
|     these repeats were present in small ctgs (~ 200bp) when euler was run with vertexSize=25 | |||
|     these repeats were not present in the velvet assembly | |||
| === newbler (454) === | |||
| * Command: | |||
|   /fs/sz-user-supported/Linux-x86_64/packages/rig/bin/runAssembly -o dir prefix.sff | |||
| === minimus === | === minimus === | ||
| Line 11: | Line 55: | ||
| * min ovl: 15 (set by minimizer) | * min ovl: 15 (set by minimizer) | ||
| * 20 bp overlap ok;   | * 20 bp overlap ok;   | ||
| * the min ovl imposed by the next step (tigger) is 5bp | |||
| * tigger  | |||
| === minimus1 === | === minimus1 === | ||
| * hash-overlap was replaced with a nucmer based overlapper | * hash-overlap was replaced with a nucmer based overlapper | ||
| * nucmer takes long time and lots of space to find very short overlaps (< 16bp?) | |||
| * We can have more control on the overlaps that are used (can easily filter the nucmer output) | |||
| * can discard CONTAINED/IDENTITY alignments in order to avoid collapsing repeats | * can discard CONTAINED/IDENTITY alignments in order to avoid collapsing repeats | ||
| * can mask the middle of the sequences to make the alignment process faster; (Issue: CONTAINED overlaps appear as BEGIN/END overlaps) | * can mask the middle of the sequences to make the alignment process faster; (Issue: CONTAINED overlaps appear as BEGIN/END overlaps) | ||
| Line 34: | Line 79: | ||
| === minimus2 === | === minimus2 === | ||
| * assembles  | * assembles two sequence sets (S1,S2) | ||
| * Instead of merging the  | * Instead of merging the sets and aligning all versus all (S1+S2 : S1+S2) it aligns one versus the other: | ||
|    S1:S2 |    S1:S2 | ||
|    S1:S1 (optional) |    S1:S1 (optional) | ||
|    S2:S2 (optional) |    S2:S2 (optional) | ||
| * Each  | * Each alignment can be run with different parameters | ||
| Example: | Example 1: merge two de-novo assemblies | ||
|   ... | |||
|   TYPE1=N | |||
|   TYPE2=N | |||
|    ... |    ... | ||
|    OVERLAP11       = 16 |    OVERLAP11       = 16 | ||
| Line 48: | Line 96: | ||
|    OVERLAP12       = 40 |    OVERLAP12       = 40 | ||
|    ... |    ... | ||
|   # align sets | |||
|    $(NUCMER) -maxmatch -l $(OVERLAP11) -c $(OVERLAP11) $(SEQS1) $(SEQS1) -p $(PREFIX11) |    $(NUCMER) -maxmatch -l $(OVERLAP11) -c $(OVERLAP11) $(SEQS1) $(SEQS1) -p $(PREFIX11) | ||
|    $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) |    $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) | ||
|    $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22) |    $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22) | ||
|   ... | |||
|   # merge alignments | |||
|   $(SHOWCOORDS) $(ALIGN11) | egrep 'BEGIN|END'  > $(COORDS)    | |||
|   $(SHOWCOORDS) $(ALIGN12) >> $(COORDS) | |||
|   $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END'  >> $(COORDS) | |||
|   ... | |||
|   nucmer2ovl.pl  $(COORDS) > $(OVLTAB) | |||
|   ... | |||
| Example 2: merge a comparative and a de-novo assembly | |||
|   ... | |||
|   TYPE1           = C | |||
|   TYPE2           = N | |||
|   ... | |||
|   # look only for short overlaps among adjacent sequences from S1 | |||
|   $(SCRIPTDIR)/fasta2ovl.pl -tab $(SEQS1) > $(OVLTAB)   | |||
|   ... | |||
|   # look for all overlaps  among one sequence from S1 and one from S2 | |||
|   $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12)  | |||
|   $(SHOWCOORDS) $(ALIGN12) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB)  | |||
|   ... | |||
|   # look for end overlaps among sequences from S2 | |||
|   $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22)   | |||
|   $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' > $(COORDS22) | $(SCRIPTDIR)/nucmer2ovl.pl  >> $(OVLTAB) | |||
|    ... |    ... | ||
| === SSAKE === | === SSAKE === | ||
| * [http://www.bcgsc.ca/platform/bioinfo/software/ssake Web Site] | |||
| * Short Sequence Assembly by K-mer search and 3' read Extension   | * Short Sequence Assembly by K-mer search and 3' read Extension   | ||
| * Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501   | * Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501   | ||
| * Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs | * Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs | ||
| * Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre. | * Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre. | ||
| * Current release: SSAKE 3.2 ( | * Current release: SSAKE 3.2.1 (08/21/2008) | ||
| * Distribution: PERL script; runs on Linux systems | * Distribution: PERL script; runs on Linux systems | ||
| * Parameters: | * Parameters: | ||
|    -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16) |    -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16) | ||
|    -o  Minimum number of reads needed to call a base during an extension (default -o 2) |    -o  Minimum number of reads needed to call a base during an extension (default -o 2) | ||
|    -r  Minimum base ratio used to accept a overhang consensus base (default -r 0. |    -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.7) | ||
|   -t  Max number of bases to trim (default 0) | |||
| * "Especially" targeted towards Solexa seq; 454 reads too long & homopol problem | |||
| * Can handle mated reads | |||
| * TQS.py  should be used for qual trimming | |||
| Example:   | Example:   | ||
| Line 75: | Line 154: | ||
|    $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons         # number of singletons |    $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons         # number of singletons | ||
|    $ grep -v N Pa.seq              # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons) |    $ grep -v N Pa.seq              # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons) | ||
| === VCAKE === | |||
| === Velvet === | === Velvet === | ||
| * Sequence assembler for very short reads | * Sequence assembler for very short reads | ||
| *  | * [http://www.genome.org/cgi/content/abstract/18/5/821 Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs] Genome Res. Mar 2008   | ||
| * [http://www.ebi.ac.uk/~zerbino/velvet/ Project Web Site at EBI] Current version: 0.6 02/06/2008: Velvet 0.6 | |||
| * Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps | * Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps | ||
| * Distribution: C program, to be compiled with gcc | * Distribution: C program, to be compiled with gcc | ||
| * Open Source, GPL agreement, EBI. | * Open Source, GPL agreement, EBI. | ||
| * Current release: 0. | * Current release: 0.7.55  | ||
| * Parameters: | * Parameters: | ||
|    - hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced) |    - hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced) | ||
|   * minimum overlap length (30-40X coverage data set): |   * minimum overlap length (30-40X coverage data set): | ||
|       17bp usually gives the fewest contigs (some case) | |||
|       23bp is also ok (used for Pa,Ps) | |||
|       15bp is too low => too many short contigs |       15bp is too low => too many short contigs | ||
|   *  -long option for including "long sequences" (Sanger, 454 or even reference sequences)  |   *  -long option for including "long sequences" (Sanger, 454 or even reference sequences) very slow | ||
| * At ~30X, max ctg len is ~ 5KB | |||
| * At >45X, max ctg len is ~ 20KB | |||
| * Longest contig on simulated data is 56Kbp (Ps, no error 32bp reads, 32X cvg) | |||
| * Max ovl=32 | |||
| * For longer reads (66bp), some of the repeats ( ~100bp) get assembled partially; -cov_cutoff parameter must be used, otherwise low cvg/quality ctgs are reported | |||
| * velvet_assy.afg lists the assembled reads (TLE messages) | |||
| * some reads extend beyond the contig ends | |||
| * there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end) | |||
| * many contigs overlap by a few bp; max val= min_ovl-1 | |||
| * minimum_ctg length =2*minimum_ovl-1 (23=>45, 21=>41 ...) | |||
| Example: | Example (unmated reads): | ||
|    $ velveth . ovl prefix.seq  |    $ velveth . -[short|shortPaired|long|longPaired] ovl prefix.seq        # ovl should be an odd integer like 21, 23   | ||
|    $ velvetg . -read_trkg yes |    $ velvetg . -read_trkg yes | ||
| Example (mated reads): | |||
|    $ velveth . 23 -shortPaired prefix.seq | |||
|   $ velvetg . -ins_length 500 -exp_cov 35 | |||
|   $ ~/bin/velvet.sh prefix 23 500 35                       # combines both; prefix.seq should alternate fwd & rev reads  | |||
| Example (processing results): | |||
|    $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg |    $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg | ||
|    $ bank2contig prefix.bnk  > prefix.contig |    $ bank2contig prefix.bnk  > prefix.contig | ||
| Line 117: | Line 210: | ||
| * Edena: Exact DE Novo Assembler | * Edena: Exact DE Novo Assembler | ||
| *  | |||
| * [http://www.genome.org/cgi/content/abstract/gr.072033.107v2 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer] Genome Res. Apr 2008 | |||
| * [http://www.genomic.ch/edena.php Project Web Site] Current verision: 2.1.1 03/17/2008 | |||
| * Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland | * Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland | ||
| * Current release: 2.1.1 (03/17/2008) | * Current release: 2.1.1 (03/17/2008) | ||
| Line 125: | Line 220: | ||
|    --overlapCutoff [int]    Only consider overlaps >= than the specified size. (default 22) |    --overlapCutoff [int]    Only consider overlaps >= than the specified size. (default 22) | ||
|    --minContigSize [int]    Minimum size of the contigs to output. (default 100) |    --minContigSize [int]    Minimum size of the contigs to output. (default 100) | ||
| * contigs don't overlap (ends seem to be trimmed) | |||
| * a significant number of reads don't get assembled; max % assembled : 69% for Staphylococcus_aureus | |||
| *  | * reads should be 1/line (+1 line for the header) | ||
| * for 66bp reads, the longest ctgs were obtained for --overlapCutoff 35 | |||
| Example: | Example: | ||
| Line 134: | Line 232: | ||
|    $ ln -s prefix_contigs.fasta prefix.fasta |    $ ln -s prefix_contigs.fasta prefix.fasta | ||
|    $ cat *_info.txt                                 # assembly stats |    $ cat *_info.txt                                 # assembly stats | ||
|    $ grep assembled  |    $ grep assembled prefix_info.txt | ||
|    $ cat  |    $ cat prefix_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t reads_assembled | ||
| === SOAPdenovo === | |||
| * http://soap.genomics.org.cn/soapdenovo.html | |||
| * Release 1.03 , 06-09-2009 | |||
| * Example config  | |||
|   [LIB]        # single reads | |||
|   f=in.fa | |||
|   [LIB]        # paired reads | |||
|   avg_ins=180 | |||
|   f1=fwd.seq | |||
|   f2=rev.seq | |||
|   [LIB]        # paired reads | |||
|   avg_ins=180 | |||
|   p=in.seq | |||
| * Example run: | |||
|   $ cat SOAPdenovo.sh | |||
|   /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.03/SOAPdenovo all -s SOAPdenovo.config -o out | |||
| * Example turkey: paires | |||
|              elem       min    q1     q2     q3     max        mean       n50        sum | |||
|   reads      88357456   76     76     76     76     76         76         76         6715166656  # 44178728 mates; 2.39X cvg  | |||
|   ctg        4664917    24     25     36     156    13863      198        761        926637298 | |||
|   scf        982536     100    258    517    1080   16376      839        1411       825003003 | |||
|   85505015  out.readOnContig | |||
| * Location (turkey) | |||
|   /scratch1/dpuiu/SOAPdenovo/turkey/ | |||
| === Mira === | |||
| * http://www.chevreux.org/projects_mira.html | |||
| * Sanger, 454 and Solexa /Illumina | |||
| * supposed to be "very" slow | |||
| == Others == | |||
| * [http://www.genome.org/cgi/content/abstract/gr.7337908v1 ALLPATHS : De novo assembly of whole-genome shotgun microreads] Genome Res. Mar 2008 | |||
| * [http://www.bcgsc.ca/platform/bioinfo/software/ssake SSAKE] | |||
| * VCAKE  | |||
| * SHARCGS | |||
| * [http://www.dnastar.com/products/SMGA.php SeqMan Genome Assembler (SMGA)] by DNAStar(commercial) | |||
| * Euler-SR | |||
| == Comparative == | == Comparative == | ||
| Line 141: | Line 282: | ||
| === AMOScmp === | === AMOScmp === | ||
| * [http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOScmp Web site] | |||
| * one should increase the make-consensus error rate if ref and qry are not that similar | * one should increase the make-consensus error rate if ref and qry are not that similar | ||
| * all reads from layout are included in consensus | * all reads from layout are included in consensus | ||
| * all CONTAINED reads should be assembled but the  | casm-layout issues:   | ||
| * if one sequence end does not align, the read is not assembled; the adjacent read is not assembled either if based on alignment to the ref. the two "theoretically" overlap; | |||
| * for a circular bacterial chromosome, the BEGIN/END reads don't get assembled => cannot get a circular contig !!! | |||
| * all CONTAINED reads should be assembled but the ones that are adjacent to dirty reads are left as singletons as well !!! | |||
| * option -r places repeats randomly; what if we want more control over this? | |||
| * There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the  .scaff last contig ; Contig alignments to the ref have breaks   | * There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the  .scaff last contig ; Contig alignments to the ref have breaks   | ||
| * casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100) | * casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100) | ||
| show-coords issues:  | |||
| * sometimes long CONTAINED annotated contigs contain dirty ends | |||
| show-snps issues:  | |||
| * ambiguities are not shown | |||
| * maximum  init_size = 10000 query seqs | |||
| * does not work on long seqs | |||
| Example: | Example: | ||
| Line 207: | Line 362: | ||
| === AMOScmp2 === | === AMOScmp2 === | ||
| Runs AMOScmp of 2 data sets at once (with different  | * Runs AMOScmp of 2 data sets at once (with different parameters) | ||
| Example: merge a comparative(AMOScmp) and a de-novo(velvet) assembly, using the same reference was used to get the comparative one | |||
|   ... | |||
|   TYPE1           = C  # Assembly 1 is a comparative one, already know the position of contigs; no need to align them to ref. | |||
|   TYPE2           = N  # Assembly 2 is a de-novo one, don't know the position of contigs; we need to align them to ref. | |||
|   ... | |||
|   $(SCRIPTDIR)/scaff2delta.pl $(SCAFF1) > $(ALIGN1) | |||
|   $(NUCMER) -maxmatch -c $(OVERLAP2) $(REF) $(SEQS2) -p=$(PREFIX2)  | |||
|   ... | |||
|   cp $(ALIGN1)$(ALIGN) | |||
|   cat $(ALIGN2) | grep -v $(PREFIX) | grep -v NUCMER  >> $(ALIGN) | |||
|   .... | |||
| === maq === | === maq === | ||
| * Version: 0.6.6 | |||
| * [http://maq.sourceforge.net/ Web site] | |||
| * At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed | * At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed | ||
| * At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq | * At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq | ||
| Line 216: | Line 386: | ||
| === Other Scripts === | === Other Scripts === | ||
| ==== EMBOSS  | ==== delta-filter.pl ==== | ||
| * Similar to MUMMer delta-filter | |||
| * The difference is that for multi copy repeats, it tries to place each one uniquely on the reference  | |||
| * "delta-filter -q" collapses all in the same reference region | |||
| * "delta-filter -r" aligns the same one to multiple reference regions | |||
| ==== fasta2ovl.pl ==== | |||
| * Program that finds short normal direction overlaps between two adjacent sequences in a multi-FASTA file. | |||
| * It tests if the END of seqence N overlaps the BEGINING of sequence N+1. | |||
| ==== megamerge (EMBOSS) ==== | |||
|    * Can merge well short overlapping contigs |    * Can merge well short overlapping contigs | ||
| Line 222: | Line 404: | ||
|    * Min ovl length=2 |    * Min ovl length=2 | ||
| ====  | ==== nucmer2ovl.pl ==== | ||
| * It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the  | * Program that converts nucmer annotated alignments to an overlap file (either AMOS or TAB format) | ||
| * Shorter run time, less disk space | |||
| ==== nucmer.pl ==== | |||
| * It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the sequences  | |||
| * Iteration n only tries to align sequences not aligned at iterations n-1,n-2,...1 | |||
| * Shorter run time, less disk space (than running nucmer with a small alignment/cluster size) | |||
| * Default parameters: | * Default parameters: | ||
|    -len 20,16,12         : nucmer minmatch (-l) |    -len 20,16,12         : nucmer minmatch (-l) | ||
| Line 231: | Line 418: | ||
|    -filter 20,20,20      : delta-filter minlen (-l) |    -filter 20,20,20      : delta-filter minlen (-l) | ||
| * Could be used by AMOScmp-shortReads, AMOScmp-shortReads-alignmentTrimmed | * Could be used by AMOScmp-shortReads, AMOScmp-shortReads-alignmentTrimmed | ||
| ==== nucmerAnnotate.pl ==== | |||
| * Program that parses a nucmer coords file and re-annotates sequence alignments  | |||
| * The maximum length of the end sequence to be ignored can be set as a parameter  | |||
| ==== revDelta.pl ====  | |||
| * Parses a delta file and switches the ref & query | |||
| ==== revseq.pl ==== | |||
| * Revese complements a seq file | |||
| * adds ".rev" suffix to seq id's | |||
| ==== scaff2delta.pl  ==== | |||
| * Program that converts a scaff file to a delta file | |||
| * Example:  No need to run nucmer on the AMOScmp output contigs to get them aligned to reference since we already know their positions in the scaffold | |||
| == Subassembly == | |||
| == Reconciliator == | |||
| * [http://bioinformatics.oxfordjournals.org/cgi/reprint/btm542v1.pdf Article] | |||
| * [http://www.genome.umd.edu/reconciliation_instructions.htm Download] | |||
| = Cases =   | = Cases =   | ||
| Line 295: | Line 508: | ||
|    50bp+   991        50      7362    393.73  792.41   390192 |    50bp+   991        50      7362    393.73  792.41   390192 | ||
|    100bp+  429        100     7362    815.36  1060.29  349793 |    100bp+  429        100     7362    815.36  1060.29  349793 | ||
| === Simulated 32bp exact match reads === | === Simulated 32bp exact match reads === | ||
| Line 325: | Line 519: | ||
|    velvet-sim  denovo       Sim         6538167        2207    45      56810   2820.91   5348.36   6225757       123591(2%) |    velvet-sim  denovo       Sim         6538167        2207    45      56810   2820.91   5348.36   6225757       123591(2%) | ||
| ===  | === Solexa reads === | ||
|    Type            #reads       min     max     mean |    Type            #reads       min     max     mean | ||
|    Solexa          6340136      32      32      32   (~31x coverage) | |||
| Assemblies: | |||
|   Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons        | |||
|   AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%) | |||
|   velvet      denovo       Solaxa      6340136        25161   45      5057    241.83    212.61    6084887  | |||
|   velvet(100bp+)           Solexa                     19534   100     5057    288       220       5633400 | |||
|   edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%) | |||
| Assembly breaks: | |||
|   velvet: 31 | |||
|   edena : 38 | |||
| Merged assemblies(contigs): | |||
|   assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments     | |||
|   minimus(ovl20) denovo       velvet                    25161   19121   45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps | |||
|   minimus(ovl20) denovo       edena+velvet              39245   18603   45      6688    322.32   311.02  5996244 | |||
| == Pseudomonas aeruginosa b1 (PAb1)== | == Pseudomonas aeruginosa b1 (PAb1)== | ||
| Line 356: | Line 565: | ||
|    edena          denovo       11180   100     11300   552.36    610.52     6175460       3955865 (46%)  much better than Ps !!! |    edena          denovo       11180   100     11300   552.36    610.52     6175460       3955865 (46%)  much better than Ps !!! | ||
|    ssake          denovo       185030  34      5490    77.21     141.23     14287079      3056893   |    ssake          denovo       185030  34      5490    77.21     141.23     14287079      3056893   | ||
|   euler-sr       denovo       30394   26      5004    184       138        5601125       3437156 | |||
| 200bp+ contigs: | 200bp+ contigs: | ||
| Line 370: | Line 580: | ||
|    Assembler           type   input-data          #reads   #ctgs   min   max     mean    stdev      ctgs-sum       comments |    Assembler           type   input-data          #reads   #ctgs   min   max     mean    stdev      ctgs-sum       comments | ||
|    AMOSCmp-PA14-merge  ?      AMOSCmp-PA14(ctgs)  2053     1931    17    170485  3201.7  12981.57   6182486        2053-1931=122 gaps closed |    AMOSCmp-PA14-merge  ?      AMOSCmp-PA14(ctgs)  2053     1931    17    170485  3201.7  12981.57   6182486        2053-1931=122 gaps closed | ||
| == Staphylococcus aureus MW2 == | == Staphylococcus aureus MW2 == | ||
| === Data === | === Data === | ||
| Files: | |||
|   /fs/szasmg2/Bacteria/dpuiu/Solexa/Staphylococcus_aureus/ | |||
| Reads: | Reads: | ||
|    3857879 35bp Solexa reads |    3857879 35bp Solexa reads | ||
|   22843 contain at least 1 N; 10455 contain more than 1 N; 8244 contain more than 2 N's | |||
| Reference: | Reference: | ||
| Line 401: | Line 613: | ||
|    NC_007793.1    2872769 32.75  Staphylococcus aureus subsp. aureus USA300, complete genome |    NC_007793.1    2872769 32.75  Staphylococcus aureus subsp. aureus USA300, complete genome | ||
|    NC_010079.1    2872915 32.76  Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome |    NC_010079.1    2872915 32.76  Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome | ||
| === Sequence alignments === | |||
| ==== soap ==== | |||
|   3735781 of 3857879 (96.76%) aligned by "soap -v 5 -g 3 -s 12" | |||
| Mismatches:  | |||
|   ~ 10% of the aligned reads had more at least 2 mismatches  | |||
|   ~  4% of the aligned reads had more at least 3 mismatches  | |||
|   cat Sa.soap.v5.s12.g3 | awk '{print $10}' | count.pl | |||
|   #       total        % | |||
|   0       2752024      73.67 | |||
|   1       633043       16.95 | |||
|   2       201439       5.39 | |||
|   3       84258        2.26 | |||
|   4       41787        1.12 | |||
|   5       22629        0.61 | |||
|   201     525          0.01 | |||
|   101     45 | |||
|   202     20 | |||
|   103     5 | |||
|   203     4 | |||
|   102     2 | |||
|   Total   3735781 | |||
| === Sequence assemblies === | === Sequence assemblies === | ||
| Line 448: | Line 685: | ||
|    12      17 |    12      17 | ||
|    Total   1114 |    Total   1114 | ||
| Contig coverage & alignments to the reference: | |||
|   $ grep "^##" Sa.contig | perl -ane 'print $F[1]*35/$F[2],"\n";' > Sa.cvg | |||
|   #ctgs   min         max     mean    median  n50     sum | |||
|   1777    1.55        517.25  43      37.41   44.36   76159.86 | |||
|   $ show-coords -I 94 1con-contigs.delta | grep CONTAIN | awk '{print $19,$13}' | count.pl | sort -n >! 1con-contigs.qry_hits | |||
|   $ join 1con-contigs.qry_hits Sa.cvg | awk '{print $3,$4}' | ~/bin/getCorrelationDescriptive.pl | |||
|   0.61 # coeff. of correlation > 0 | |||
|   0.81 # coeff. of correlation for ctgs>100 bp | |||
| 0cvg regions | 0cvg regions | ||
| Line 453: | Line 703: | ||
|          #elem   min     max     mean    median  n50     sum |          #elem   min     max     mean    median  n50     sum | ||
|    gap   305     1       570     34      19      70      10521 |    gap   305     1       570     34      19      70      10521 | ||
| ===== OVL: 21 bp ===== | ===== OVL: 21 bp ===== | ||
| Line 470: | Line 721: | ||
|    ctg           1175    100     22893   2350    1094    5514    2760953 |    ctg           1175    100     22893   2350    1094    5514    2760953 | ||
|    ctg(200bp+)   935     204     22893   2918    1766    5594    2727929 |    ctg(200bp+)   935     204     22893   2918    1766    5594    2727929 | ||
|   breaks        0 | |||
|   assembled_reads 2676002 (69.8%) | |||
|   singletons      1181877 | |||
| * In this case velvet & edena have similar contig stats !!! | |||
| * All edena contigs align to velvet contigs | |||
| * Not all edena contigs are CONTAINED/IDENTICAL to velvet contigs !!! | |||
|                 #elem   min     max     mean    median  n50     sum | |||
|   edena.uniq    360     100     19924   2475    273     7791    891123 | |||
|   $ show-coords -H velvet-edena.filter-q.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19,$13}' | count.pl -m 2 | wc -l  # 193 edena contigs CONTAIN multiple velvet contigs | |||
|   $ show-coords -H velvet-edena.filter-q.delta | egrep 'IDEN' | wc -l                                                         # 435 edena contigs  are IDENTICAL to velvet contigs | |||
| ==== euler-sr ==== | |||
|           #elem   min     max     mean    median  n50     sum | |||
|   all     1349    22      43140   2073    109     10523   2796347 | |||
|   200bp+  566     201     43140   4859    2334    10662   2750118 | |||
| === Contig assemblies === | === Contig assemblies === | ||
| Line 549: | Line 820: | ||
|    breaks 27 |    breaks 27 | ||
| Edena contig ends get trimmed. There are fewer overlap  | Edena contig ends get trimmed. There are fewer overlap among them than velvet contigs!!! | ||
| ====  | ==== AMOScmp on velvet & edena contigs ==== | ||
| Input   | Input   | ||
| Line 565: | Line 836: | ||
|    edena         935     204     22893   2918    3303    2727929 |    edena         935     204     22893   2918    3303    2727929 | ||
|    edena.uniq    202     206     19924   4305    4394    869552 |    edena.uniq    202     206     19924   4305    4394    869552 | ||
| Output:  | |||
| All: | |||
|         #elem   min     max     mean    median  n50     sum | |||
|   ctg   461     45      81678   6029    703     21943   2779591 | |||
|   singl 7       45      11556   4283    2634    11556   29984 | |||
|   all   468     45      81678   6003    719     21353   2809575 | |||
|   breaks 0 | |||
| 200 bp+: | |||
|         #elem   min     max     mean    median  n50     sum | |||
|   ctg   289     205     81678   9565    4401    21943   2764375 | |||
|   singl 4       2634    11556   7458    11556   11556   29832 | |||
|   all   293     205     81678   9537    4401    21943   2794207 | |||
|   breaks 0 | |||
| ==== minimus2 on velvet & edena contigs ==== | |||
| Contig stats: | Contig stats: | ||
| Line 637: | Line 926: | ||
|    all   399     201     53996   6911    3156    16039   2757391 |    all   399     201     53996   6911    3156    16039   2757391 | ||
|    breaks 6 |    breaks 6 | ||
| == Strep suis17 == | |||
| === Data === | |||
| * 2726374  36bp Solexa reads | |||
| * 2007491bp 41.30 %GC Reference | |||
| * Coverage: 49X | |||
| * Location:  | |||
|   /fs/szdata/Solexa/Streptococcus_suis/ | |||
|   /fs/szasmg2/Bacteria/dpuiu/Solexa/Streptococcus_suis/ | |||
| === Sequence alignments === | |||
| ==== soap ==== | |||
|   2261784 of 2726374  (82.95%) aligned by "soap -v 5 -g 0 -s 10" | |||
| Mismatches:  | |||
|   more Ss.soap | awk '{print $10}' | count.pl | |||
|   mism.   hits     hits(%) | |||
|   0       1235237  54.61 | |||
|   1       548822   24.27 | |||
|   2       238795   10.56 | |||
|   3       120622   5.33 | |||
|   4       71451    3.16 | |||
|   5       46857    2.07 | |||
|   Total   2261784 | |||
| === Sequence assembly === | |||
| ==== Velvet ==== | |||
| Contig stats: | |||
|            #elem   min     max     mean    median  n50     sum | |||
|   all      917     45      24419   2154    278     6661    1975147 | |||
|   200bp+   479     202     24419   4051    2940    6966    1940363 | |||
|   breaks 2  (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss) | |||
| Different coverage assemblies: | |||
|     cvg     #ctgs   min     max     mean    median  n50     sum     %singletons     breaks | |||
|     24.5    4630    45      5267    436     284     709     2016871 23.98           6 | |||
|     29.4    2797    45      7739    716     426     1333    2001964 23.46           0 | |||
|     34.3    1797    45      9347    1107    649     2230    1989102 23.34           0 | |||
|     39.2    1325    45      24419   1496    663     3724    1982292 23.32           2 | |||
|     44.1    1016    45      23272   1946    516     5290    1976890 23.30           2 | |||
|     49.0    917     45      24419   2154    278     6661    1975147 23.30           2 | |||
| ==== Edena ==== | |||
| Contig stats: | |||
|                 #elem   min     max     mean    median  n50     sum | |||
|   all           1668    100     8829    1135    743     1968    1893815 | |||
|   200bp+        1390    200     8829    1333    959     1985    1853354 | |||
|   breaks 2  (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss) | |||
| Singletons: 1582493 (58%) | |||
| ==== Euler-SR ==== | |||
| Command: | |||
|   $ AssembleIllumina.pl Ss.seq working_dir -vertexSize 21 | |||
| Contig stats: | |||
|           #elem   min     max     mean    median  n50     sum | |||
|   all     2658    22      19768   757     42      4867    2013259 | |||
|   200bp+  659     200     19768   2932    1973    5117    1932406 | |||
| ==  Escherichia coli str. K-12 substr. MG1655 (Solexa-paired) == | |||
| === Data === | |||
| * 7,047,668 + 10,408,224  paired-end 36bp Solexa reads => 105X + 161X cvg | |||
| * 1M => 15.51X cvg | |||
| * NC_000913.2  4,639,675 50.79%GC reference | |||
| * Location:  | |||
|   /fs/szasmg2/Bacteria/dpuiu/Solexa-pair/Ecoli | |||
| * Formatting : split seq into 2  | |||
|   cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36          > ! Ecoli.paired.afg | |||
|   cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36 -noamos  > ! Ecoli.paired.seq | |||
| === Sequence assembly (1M reads: 15X cvg) === | |||
| * Velvet : hash length of 23 works best; also use -exp_cov 15 | |||
| * AMOScmp: OVL=10; the smaller the better  | |||
| * Ctg stats: | |||
|   .                            elem       min    q1     q2     q3     max        mean       n50        sum             | |||
|   AMOScmp-shortReads           331        40     1941   7826   18316  172,800    14022      26474      4641532  better !!!        | |||
|   AMOScmp-shortReads-trimmed   372        40     1916   7226   17348  98,155     12477      24730      4641600         | |||
|   velvet.ctg                   12073      45     121    250    461    4690       366        568        4430625         | |||
|   velvet.scf                   715        45     58     70     77     313779     6671       104454     4769943        | |||
| === Sequence assembly (7+M reads: 100+X cvg) === | |||
| * Velvet : hash length of 27 works best; also use -exp_cov 100 | |||
|   .                            elem       min    q1     q2     q3     max        mean       n50        sum             | |||
|   velvet.ctg                   375        53     63     82     6105   192151     12162      59972      4561051         | |||
|   velvet.scf                   272        53     60     71     133    860695     16785      171744     4565616 | |||
| ==  Escherichia coli str. K-12 substr. MG1655 (454-paired) == | |||
| === Data === | |||
| * Formatting : remove 44 bp linker | |||
|   nucmer -c 10 -l 10 L454.seq Ecoli.merged.seq -p  Ecoli --forward | |||
|   cat  Ecoli.delta | perl ~/bin/filterDeltaAnnotated.pl | ~/bin/delta2alignRange.pl -qry  >!  Ecoli.pair.seqs | |||
|   difference.pl  Ecoli.seqs Ecoli.pair.seqs >! Ecoli.single.seqs | |||
|   cat Ecoli.pair.seqs | p '$F[1]--; print "$F[0] $F[0].f 1 $F[1]\n"'     >! Ecoli.pair.f.pos | |||
|   cat Ecoli.pair.seqs | p '$F[2]++; print "$F[0] $F[0].r $F[2] $F[3]\n"' >! Ecoli.pair.r.pos | |||
Latest revision as of 01:21, 2 September 2010
Papers
Assemblers
Denovo
CA
See CA
Abyss
Atlas
Euler-SR
- Euler-SR source
- README
- Short read fragment assembly of bacterial genomes (Genome Research 2008)
- De Novo Repeat Classification and Fragment Assembly (Genome Research 2004)
- Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes (Genome Research 2002)
- De novo fragment assembly with short mate-paired reads: Does the read length matter? (Genome Research 2008)
- Cmd:
 setenv EUSRC  /fs/szdevel/dpuiu/euler-sr/
 setenv MACHTYPE x86_64
 set path=($path ${EUSRC}/assembly)
 set path=($path ${EUSRC}/assembly/${MACHTYPE})
 /fs/szdevel/dpuiu/euler-sr/assembly/Assemble.pl readsFile vertexSize ruleFile
- 35bp Solexa reads: vertexSize=25 vs vertexSize=21
vertexSize=21 more agreagte assembly; some tandem repeats in 2K+ contigs were collapsed (detected by aligning to Strep aureus finished genome) these repeats were present in small ctgs (~ 200bp) when euler was run with vertexSize=25 these repeats were not present in the velvet assembly
newbler (454)
- Command:
/fs/sz-user-supported/Linux-x86_64/packages/rig/bin/runAssembly -o dir prefix.sff
minimus
hash-overlap issues:
- very slow on large sequences (runs Smith-Waterman)
- finds all overlaps (not only CONTAINED|BEGIN|END); of no use to the assembler
- default ovl: 40 bp (too large)
- min ovl: 15 (set by minimizer)
- 20 bp overlap ok;
- the min ovl imposed by the next step (tigger) is 5bp
minimus1
- hash-overlap was replaced with a nucmer based overlapper
- nucmer takes long time and lots of space to find very short overlaps (< 16bp?)
- We can have more control on the overlaps that are used (can easily filter the nucmer output)
- can discard CONTAINED/IDENTITY alignments in order to avoid collapsing repeats
- can mask the middle of the sequences to make the alignment process faster; (Issue: CONTAINED overlaps appear as BEGIN/END overlaps)
- Example:
.... OVERLAP=16 MINID=94 MAXTRIM=20 ... 20: $(NUCMER) -maxmatch -l $(OVERLAP) -c $(OVERLAP) $(SEQS) $(SEQS) -p $(PREFIX) 24: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | egrep 'BEGIN|END' > $(COORDS) 25: $(SCRIPTDIR)/nucmer2ovl.pl -ignore $(MAXTRIM) -tab $(COORDS)> $(OVLTAB) 26: $(SCRIPTDIR)/ovl2OVL.pl $(OVLTAB) > $(OVL) 28: $(BINDIR)/bank-transact -z -b $(BANK) -m $(OVL)
minimus2
- assembles two sequence sets (S1,S2)
- Instead of merging the sets and aligning all versus all (S1+S2 : S1+S2) it aligns one versus the other:
S1:S2 S1:S1 (optional) S2:S2 (optional)
- Each alignment can be run with different parameters
Example 1: merge two de-novo assemblies
... TYPE1=N TYPE2=N ... OVERLAP11 = 16 OVERLAP22 = 16 OVERLAP12 = 40 ... # align sets $(NUCMER) -maxmatch -l $(OVERLAP11) -c $(OVERLAP11) $(SEQS1) $(SEQS1) -p $(PREFIX11) $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22) ... # merge alignments $(SHOWCOORDS) $(ALIGN11) | egrep 'BEGIN|END' > $(COORDS) $(SHOWCOORDS) $(ALIGN12) >> $(COORDS) $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' >> $(COORDS) ... nucmer2ovl.pl $(COORDS) > $(OVLTAB) ...
Example 2: merge a comparative and a de-novo assembly
... TYPE1 = C TYPE2 = N ... # look only for short overlaps among adjacent sequences from S1 $(SCRIPTDIR)/fasta2ovl.pl -tab $(SEQS1) > $(OVLTAB) ... # look for all overlaps among one sequence from S1 and one from S2 $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) $(SHOWCOORDS) $(ALIGN12) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB) ... # look for end overlaps among sequences from S2 $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22) $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' > $(COORDS22) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB) ...
SSAKE
- Web Site
- Short Sequence Assembly by K-mer search and 3' read Extension
- Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501
- Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs
- Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre.
- Current release: SSAKE 3.2.1 (08/21/2008)
- Distribution: PERL script; runs on Linux systems
- Parameters:
-m Minimum number of reads needed to call a base during overhang consensus build up (default -m 16) -o Minimum number of reads needed to call a base during an extension (default -o 2) -r Minimum base ratio used to accept a overhang consensus base (default -r 0.7) -t Max number of bases to trim (default 0)
- "Especially" targeted towards Solexa seq; 454 reads too long & homopol problem
- Can handle mated reads
- TQS.py should be used for qual trimming
Example:
$ grep -c "^>" Pa.seq # total number of reads $ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log # version 3.0 $ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log # version 3.2 $ ln -s Pa*contigs Pa.fasta $ ln -s Pa*singlets Pa.singlets.seq $ cat Pa.fasta | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads # number of assembled reads $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons # number of singletons $ grep -v N Pa.seq # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons)
VCAKE
Velvet
- Sequence assembler for very short reads
- Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs Genome Res. Mar 2008
- Project Web Site at EBI Current version: 0.6 02/06/2008: Velvet 0.6
- Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps
- Distribution: C program, to be compiled with gcc
- Open Source, GPL agreement, EBI.
- Current release: 0.7.55
- Parameters:
 - hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced)
* minimum overlap length (30-40X coverage data set):
    17bp usually gives the fewest contigs (some case)
    23bp is also ok (used for Pa,Ps)
    15bp is too low => too many short contigs
*  -long option for including "long sequences" (Sanger, 454 or even reference sequences) very slow
- At ~30X, max ctg len is ~ 5KB
- At >45X, max ctg len is ~ 20KB
- Longest contig on simulated data is 56Kbp (Ps, no error 32bp reads, 32X cvg)
- Max ovl=32
- For longer reads (66bp), some of the repeats ( ~100bp) get assembled partially; -cov_cutoff parameter must be used, otherwise low cvg/quality ctgs are reported
- velvet_assy.afg lists the assembled reads (TLE messages)
- some reads extend beyond the contig ends
- there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end)
- many contigs overlap by a few bp; max val= min_ovl-1
- minimum_ctg length =2*minimum_ovl-1 (23=>45, 21=>41 ...)
Example (unmated reads):
$ velveth . -[short|shortPaired|long|longPaired] ovl prefix.seq # ovl should be an odd integer like 21, 23 $ velvetg . -read_trkg yes
Example (mated reads):
$ velveth . 23 -shortPaired prefix.seq $ velvetg . -ins_length 500 -exp_cov 35
$ ~/bin/velvet.sh prefix 23 500 35 # combines both; prefix.seq should alternate fwd & rev reads
Example (processing results):
 $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg
 $ bank2contig prefix.bnk  > prefix.contig
 $ bank2fasta -b prefix.bnk > prefix.fasta
 
 $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons
 $ extractfromfastanames.pl prefix.seq prefix.singletons > prefix.singletons.seq
 
 $ listReadPlacedStatus prefix.bnk > prefix.status
 $ grep D prefix.status  | awk '{print $5,$6}' | sort -u > prefix.pairs
Edena
- Edena: Exact DE Novo Assembler
- De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer Genome Res. Apr 2008
- Project Web Site Current verision: 2.1.1 03/17/2008
- Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland
- Current release: 2.1.1 (03/17/2008)
- Distribution: Compiled executables for Linux-32,Linux-64
- Uses a traditional read overlap-layout paradigm (adjusted for short reads)
- Parameters:
--overlapCutoff [int] Only consider overlaps >= than the specified size. (default 22) --minContigSize [int] Minimum size of the contigs to output. (default 100)
- contigs don't overlap (ends seem to be trimmed)
- a significant number of reads don't get assembled; max % assembled : 69% for Staphylococcus_aureus
- reads should be 1/line (+1 line for the header)
- for 66bp reads, the longest ctgs were obtained for --overlapCutoff 35
Example:
$ edena -r prefix.seq -p ovl $ edena -e prefix.ovl -p prefix $ ln -s prefix_contigs.fasta prefix.fasta $ cat *_info.txt # assembly stats $ grep assembled prefix_info.txt $ cat prefix_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t reads_assembled
SOAPdenovo
- http://soap.genomics.org.cn/soapdenovo.html
- Release 1.03 , 06-09-2009
- Example config
[LIB] # single reads f=in.fa [LIB] # paired reads avg_ins=180 f1=fwd.seq f2=rev.seq [LIB] # paired reads avg_ins=180 p=in.seq
- Example run:
$ cat SOAPdenovo.sh /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.03/SOAPdenovo all -s SOAPdenovo.config -o out
- Example turkey: paires
elem min q1 q2 q3 max mean n50 sum reads 88357456 76 76 76 76 76 76 76 6715166656 # 44178728 mates; 2.39X cvg ctg 4664917 24 25 36 156 13863 198 761 926637298 scf 982536 100 258 517 1080 16376 839 1411 825003003
85505015 out.readOnContig
- Location (turkey)
/scratch1/dpuiu/SOAPdenovo/turkey/
Mira
- http://www.chevreux.org/projects_mira.html
- Sanger, 454 and Solexa /Illumina
- supposed to be "very" slow
Others
- ALLPATHS : De novo assembly of whole-genome shotgun microreads Genome Res. Mar 2008
- SSAKE
- VCAKE
- SHARCGS
- SeqMan Genome Assembler (SMGA) by DNAStar(commercial)
- Euler-SR
Comparative
AMOScmp
- Web site
- one should increase the make-consensus error rate if ref and qry are not that similar
- all reads from layout are included in consensus
casm-layout issues:
- if one sequence end does not align, the read is not assembled; the adjacent read is not assembled either if based on alignment to the ref. the two "theoretically" overlap;
- for a circular bacterial chromosome, the BEGIN/END reads don't get assembled => cannot get a circular contig !!!
- all CONTAINED reads should be assembled but the ones that are adjacent to dirty reads are left as singletons as well !!!
- option -r places repeats randomly; what if we want more control over this?
- There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the .scaff last contig ; Contig alignments to the ref have breaks
- casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100)
show-coords issues:
- sometimes long CONTAINED annotated contigs contain dirty ends
show-snps issues:
- ambiguities are not shown
- maximum init_size = 10000 query seqs
- does not work on long seqs
Example:
$ sort -nk4 AMOS.scaff | head 322 BE 87996 -24519 210 BE 7460 -7324 1271 BE 60490 -4112 17 BE 1022 -944 ...
$more AMOS.scaff ... 322 BE 87996 -24519 # 322 contains contigs 1770, 1771 ... 1770 BE 39 1100 1771 BE 130 738 ... 1780 BE 72 25004
$ tail -1 AMOS.scaff 1767 BE 39264 0
AMOScmp-shortReads-alignmentTrimmed
Defaults:
MINCLUSTER = 16 MINMATCH = 16 MINLEN = 20 # delta-filter -l 20 MINOVL = 10 # previously 5 ; mis-assemblies occurred MAXTRIM = 10 MAXGAP = 100 # default 10,000 (casm-layout -g) creates some incorrect contigs MAJORITY = 70 # previously 50 CONSERR = 0.06 ALIGNWIGGLE = 2 # different than the default AMOScmp 15
* Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => !!! read trimming is important. * assembly is very fragmented if the percent identity between reads and the reference is less than 98% * simulations run on a PA-b1 10K read subset using PA14 as reference; random snps were introduced into PA14 to decrease the %identity; (MINCLUSTER=12, MINMATCH=16, MINLEN=20) 99.4% : 1 contig (original data, no snps added to PA14) 98% : 2 contigs 96% : 10 contigs 93% : 50 contigs 93% : 37 contigs (MINMATCH=14)
AMOScmp-shortReads
Defaults
MINCLUSTER = 20 MINMATCH = 20 MINOVL = 5 MAXTRIM = 10 MAJORITY = 50 CONSERR = 0.06 ALIGNWIGGLE = 2
* No read trimming is done
AMOScmp2
- Runs AMOScmp of 2 data sets at once (with different parameters)
Example: merge a comparative(AMOScmp) and a de-novo(velvet) assembly, using the same reference was used to get the comparative one
... TYPE1 = C # Assembly 1 is a comparative one, already know the position of contigs; no need to align them to ref. TYPE2 = N # Assembly 2 is a de-novo one, don't know the position of contigs; we need to align them to ref. ... $(SCRIPTDIR)/scaff2delta.pl $(SCAFF1) > $(ALIGN1) $(NUCMER) -maxmatch -c $(OVERLAP2) $(REF) $(SEQS2) -p=$(PREFIX2) ... cp $(ALIGN1)$(ALIGN) cat $(ALIGN2) | grep -v $(PREFIX) | grep -v NUCMER >> $(ALIGN) ....
maq
- Version: 0.6.6
- Web site
- At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed
- At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq
Other Scripts
delta-filter.pl
- Similar to MUMMer delta-filter
- The difference is that for multi copy repeats, it tries to place each one uniquely on the reference
- "delta-filter -q" collapses all in the same reference region
- "delta-filter -r" aligns the same one to multiple reference regions
fasta2ovl.pl
- Program that finds short normal direction overlaps between two adjacent sequences in a multi-FASTA file.
- It tests if the END of seqence N overlaps the BEGINING of sequence N+1.
megamerge (EMBOSS)
* Can merge well short overlapping contigs * Does not work well when the overlap identity is low * Min ovl length=2
nucmer2ovl.pl
- Program that converts nucmer annotated alignments to an overlap file (either AMOS or TAB format)
nucmer.pl
- It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the sequences
- Iteration n only tries to align sequences not aligned at iterations n-1,n-2,...1
- Shorter run time, less disk space (than running nucmer with a small alignment/cluster size)
- Default parameters:
-len 20,16,12 : nucmer minmatch (-l) -cluster 20,16,14 : nucmer mincluster (-c) -filter 20,20,20 : delta-filter minlen (-l)
- Could be used by AMOScmp-shortReads, AMOScmp-shortReads-alignmentTrimmed
nucmerAnnotate.pl
- Program that parses a nucmer coords file and re-annotates sequence alignments
- The maximum length of the end sequence to be ignored can be set as a parameter
revDelta.pl
- Parses a delta file and switches the ref & query
revseq.pl
- Revese complements a seq file
- adds ".rev" suffix to seq id's
scaff2delta.pl
- Program that converts a scaff file to a delta file
- Example: No need to run nucmer on the AMOScmp output contigs to get them aligned to reference since we already know their positions in the scaffold
Subassembly
Reconciliator
Cases
No reference sequence
One data set, multiple denovo assemblers
Example:
* Solexa data * edena & velvet assemblers
Solution:
* run minimus on the contigs
Multipls data sets, one(multiple) denovo assemblers
Example:
* Solexa & 454 data * velvet assembly on Solexa * newbler assembly on 454
Solution:
* run minimus on the contigs
One reference sequence
Example:
* Solexa data * AMOScmp on Solexa * velvet on Solexa
Solution:
* remove the redundant velvet contigs (CONTAINED in AMOScmp contigs) (Optional; leaving the CONTAINED contigs had very similar results, CONSERR did not have to be reduced) * run AMOScmp on AMOScmp & velvet contigs => join some AMOScmp gaps * run minimus to further merge contigs
Multiple reference sequences
Example:
* Solexa data * AMOScmp on Solexa using Ref1 * AMOScmp on Solexa using Ref2
Solution:
* remove the redundant AMOScmp Ref2 contigs (CONTAINED in AMOScmp contigs) (optional) * run AMOScmp on AMOScmp Ref1 & Ref2 contigs => join some AMOScmp Ref1 gaps * run minimus to further merge contigs
Examples
Pseudomonas_syringae
Reference:
Name Length %GC NC_004578.1 6397126 58.40 NC_004633.1 73661 55.15 NC_004632.1 67473 56.17
Repeats:
desc #repeats min max mean stdev sum 50bp+ 991 50 7362 393.73 792.41 390192 100bp+ 429 100 7362 815.36 1060.29 349793
Simulated 32bp exact match reads
Type #reads min max mean Sim(ulated) 6538167 32 32 32 ( 32x coverage)
Single assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons edena-sim denovo Sim 6538167 2068 100 47881 2994.03 4857.76 6191673 198699(3%) velvet-sim denovo Sim 6538167 2207 45 56810 2820.91 5348.36 6225757 123591(2%)
Solexa reads
Type #reads min max mean Solexa 6340136 32 32 32 (~31x coverage)
Assemblies:
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum #singletons AMOScmp comparative Solaxa 6340136 187 20 577929 34863.06 91692.34 6519394 698638(11%) velvet denovo Solaxa 6340136 25161 45 5057 241.83 212.61 6084887 velvet(100bp+) Solexa 19534 100 5057 288 220 5633400 edena denovo Solaxa 6340136 14084 100 5075 210.92 145.68 2970720 4893301(77%)
Assembly breaks:
velvet: 31 edena : 38
Merged assemblies(contigs):
assemblers type input-data #reads #ctgs min max mean stdev ctgs-sum comments minimus(ovl20) denovo velvet 25161 19121 45 5057 311.3 297.27 5952381 #merged 25161-19121=6040 (25%) gaps minimus(ovl20) denovo edena+velvet 39245 18603 45 6688 322.32 311.02 5996244
Pseudomonas aeruginosa b1 (PAb1)
References:
Name Length %GC PA14 6537648 66.29 PACS2 6492423 66.33 PAO1 6264404 66.56 ...
Solexa reads
Type #reads min max mean Solexa 8627900 33 33 33 (~43X coverage)
Assemblies:
All contigs:
Assembler type #ctgs min max mean stdev ctgs-sum #singletons comments AMOSCmp-PA14 comparative 2053 17 170485 3011.84 11917.53 6183320 1127399 AMOSCmp-PAO1 comparative 2797 17 75626 2161.19 5812.2 6044851 1592525 AMOScmp-PA2192 comparative 5816 17 133129 1072.8 3725.22 6239454 1601299 largest assembly maq-PA14 comparative 991 33 155551 6199.79 17445.05 6143996 1197385 velvet denovo 10684 45 16239 640.34 825.24 6841458 1241079 much better than Ps !!! edena denovo 11180 100 11300 552.36 610.52 6175460 3955865 (46%) much better than Ps !!! ssake denovo 185030 34 5490 77.21 141.23 14287079 3056893 euler-sr denovo 30394 26 5004 184 138 5601125 3437156
200bp+ contigs:
Assembler type #ctgs min max mean stdev ctgs-sum AMOSCmp-PA14 comparative 428 203 170485 14262.09 22852.74 6104175 AMOSCmp-PAO1 comparative 865 200 75626 6893.96 8766.63 5963278 AMOSCmp-PA2192 comparative 1299 200 133129 4683.46 6735.52 6083817 maq-PA14 comparative 368 200 155551 16581.7 25475.92 6102067 velvet denovo 7382 200 16239 877.05 896.35 6474426 edena denovo 8316 200 11300 692.54 651.24 5759209 ssake denovo 12532 200 5490 486 329.93 6090567
Merged assemblies(contigs&singletons):
Assembler type input-data #reads #ctgs min max mean stdev ctgs-sum comments AMOSCmp-PA14-merge ? AMOSCmp-PA14(ctgs) 2053 1931 17 170485 3201.7 12981.57 6182486 2053-1931=122 gaps closed
Staphylococcus aureus MW2
Data
Files:
/fs/szasmg2/Bacteria/dpuiu/Solexa/Staphylococcus_aureus/
Reads:
3857879 35bp Solexa reads 22843 contain at least 1 N; 10455 contain more than 1 N; 8244 contain more than 2 N's
Reference:
NC_003923 2820462 32.83 Staphylococcus aureus subsp. aureus MW2, complete genome. AP004832 20654 28.38 Staphylococcus aureus plasmid pMW2 DNA, complete sequence total 2841116
Coverage: 47.52X
Other strains:
NC_002952.2 2902619 32.81 Staphylococcus aureus subsp. aureus MRSA252, complete genome NC_002953.3 2799802 32.85 Staphylococcus aureus subsp. aureus MSSA476, complete genome NC_002951.2 2809422 32.82 Staphylococcus aureus subsp. aureus COL, complete genome NC_009632.1 2906507 32.95 Staphylococcus aureus subsp. aureus JH1, complete genome NC_009487.1 2906700 32.95 Staphylococcus aureus subsp. aureus JH9, complete genome NC_009782.1 2880168 32.88 Staphylococcus aureus subsp. aureus Mu3, complete genome NC_002758.2 2878529 32.88 Staphylococcus aureus subsp. aureus Mu50, complete genome NC_003923.1 2820462 32.83 Staphylococcus aureus subsp. aureus MW2, complete genome NC_002745.2 2814816 32.84 Staphylococcus aureus subsp. aureus N315, complete genome NC_007795.1 2821361 32.87 Staphylococcus aureus subsp. aureus NCTC 8325, complete genome NC_009641.1 2878897 32.89 Staphylococcus aureus subsp. aureus str. Newman, complete genome NC_007622.1 2742531 32.78 Staphylococcus aureus RF122, complete genome NC_007793.1 2872769 32.75 Staphylococcus aureus subsp. aureus USA300, complete genome NC_010079.1 2872915 32.76 Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome
Sequence alignments
soap
3735781 of 3857879 (96.76%) aligned by "soap -v 5 -g 3 -s 12"
Mismatches:
~ 10% of the aligned reads had more at least 2 mismatches ~ 4% of the aligned reads had more at least 3 mismatches
 cat Sa.soap.v5.s12.g3 | awk '{print $10}' | count.pl
 #       total        %
 0       2752024      73.67
 1       633043       16.95
 2       201439       5.39
 3       84258        2.26
 4       41787        1.12
 5       22629        0.61
 201     525          0.01
 101     45
 202     20
 103     5
 203     4
 102     2
 Total   3735781
Sequence assemblies
AMOScmp
OVL: 10 bp
Output stats:
 All:
               #elem   min     max     mean    median  n50     sum
 ctg           17      60      658745  167134  102144  485748  2841283
 ctg(200bp+)   15      2090    658745  189411  102265  485748  2841158
 singl         101422  23      35      35      35      35      3549538
 breaks        0
- Only 2.62% of the reads are singletons
- 101232 out of 101422 AMOS singletons (99.81%) are also velvet singletons
velvet
OVL: 23 bp
Output stats:
#elem min max mean median n50 sum ctg 1777 45 22892 1583 252 5337 2813162 ctg(100bp+) 1146 100 22892 2421 1167 5472 2774471 ctg(200bp+) 939 200 22892 2924 1764 5512 2745649 singl 289207 35 35 35 35 35 10122245 breaks 0
- 7.49% reads are singletons
- 101232 out of 289207 velvet singletons are also AMOS singletons
- 13189 reads are duplicates, connect 1126 contig pairs (1242 unique ctgs in these pairs)
- 1114 out of 1242 contig overlaps were identified by running nucmer (-c 12 -l 12)
ovlSize count 22 740 21 99 20 55 19 36 18 52 17 33 16 17 15 29 14 21 13 15 12 17 Total 1114
Contig coverage & alignments to the reference:
$ grep "^##" Sa.contig | perl -ane 'print $F[1]*35/$F[2],"\n";' > Sa.cvg #ctgs min max mean median n50 sum 1777 1.55 517.25 43 37.41 44.36 76159.86
 $ show-coords -I 94 1con-contigs.delta | grep CONTAIN | awk '{print $19,$13}' | count.pl | sort -n >! 1con-contigs.qry_hits
 $ join 1con-contigs.qry_hits Sa.cvg | awk '{print $3,$4}' | ~/bin/getCorrelationDescriptive.pl
 0.61 # coeff. of correlation > 0
 0.81 # coeff. of correlation for ctgs>100 bp
0cvg regions
 $show-coords Sa.all.delta | ~/bin/getNucmerCoverage.pl -M 0 | getSummary.pl -c 3
       #elem   min     max     mean    median  n50     sum
 gap   305     1       570     34      19      70      10521
OVL: 21 bp
#elem min max mean median n50 sum all 2099 41 18254 1342 258 4384 2817753 45bp+ 1830 45 18254 1534 448 4391 2806596 100bp+ 1337 100 18254 2075 1083 4558 2774667 200bp+ 1115 200 18254 2461 1486 4625 2743892 singl 224006
OVL: 23 bp seems to generate a better assembly on this data set !!!
edena
Output stats:
#elem min max mean median n50 sum ctg 1175 100 22893 2350 1094 5514 2760953 ctg(200bp+) 935 204 22893 2918 1766 5594 2727929 breaks 0 assembled_reads 2676002 (69.8%) singletons 1181877
- In this case velvet & edena have similar contig stats !!!
- All edena contigs align to velvet contigs
- Not all edena contigs are CONTAINED/IDENTICAL to velvet contigs !!!
#elem min max mean median n50 sum edena.uniq 360 100 19924 2475 273 7791 891123
 $ show-coords -H velvet-edena.filter-q.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19,$13}' | count.pl -m 2 | wc -l  # 193 edena contigs CONTAIN multiple velvet contigs
 $ show-coords -H velvet-edena.filter-q.delta | egrep 'IDEN' | wc -l                                                         # 435 edena contigs  are IDENTICAL to velvet contigs
euler-sr
#elem min max mean median n50 sum all 1349 22 43140 2073 109 10523 2796347 200bp+ 566 201 43140 4859 2334 10662 2750118
Contig assemblies
AMOScmp on velvet contigs
Min Ovl: 10bp Min Cluster(nucmer -c): 40 (many velvet contigs are 45-65 bp)
 All:
       #elem   min     max     mean    median  n50     sum
 ctg   567     45      73756   4890    455     20846   2772573
 singl 5       45      11556   3159    58      11556   15794
 all   572     45      73756   4875    455     20846   2788367
 breaks 2 (1 contig with 2 overlapping alignments; no repeat; why is nucmer breaking the alignment?)
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   339     201     73756   8118    3413    20846   2752170
 singl 2       4086    11556   7821    11556   11556   15642
 all   341     201     73756   8117    3494    20846   2767812
 breaks 2
- Reduced the number of contigs from 1777 to 567 !!!
minimus on velvet contigs
Min Ovl: 16bp
 All:
       #elem   min     max     mean    median  n50     sum
 ctg   316     45      39490   5812    3243    12483   1836527
 singl 797     45      22892   1204    161     4724    959356
 all   1113    45      39490   2512    299     9399    2795883
 breaks 20
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   258     202     39490   7093    5069    12483   1829967
 singl 368     200     22892   2504    1347    5210    921478
 all   626     200     39490   4395    2313    9508    2751445
 breaks 19
- Reduced the number of contigs from 1777 to 1113 !!!
minimus1 on velvet contigs
Min Ovl: 16bp
 All:
       #elem   min     max     mean    median  n50     sum
 ctg   277     70      32618   6623    4709    11914   1834597
 singl 1022    45      19128   948     96      5057    968872
 all   1299    45      32618   2158    154     9526    2803469
 breaks 15
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   247     200     32618   7413    6004    11984   1830980
 singl 354     200     19128   2590    1397    5294    916800
 all   601     200     32618   4572    2495    9757    2747780
 breaks 15
- Reduced the number of contigs from 1777 to 1299 !!!
- Generated fewer, larger big contigs than minimus
Fewer contigs get merged if:
* Min Ovl is dropped from 16bp to 14bp * Only the overlaps represented by Duplicate reads are used
minimus1 on edena contigs
 All:
       #elem   min     max     mean    median  n50     sum
 ctg   67      240     24824   5610    4814    8671    375847
 singl 1022    100     19924   2331    1084    5455    2382674
 all   1089    100     24824   2533    1183    5933    2758521
 breaks 27
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   67      240     24824   5610    4814    8671    375847
 singl 811     204     19924   2902    1766    5588    2353676
 all   878     204     24824   3109    1862    5984    2729523
 breaks 27
Edena contig ends get trimmed. There are fewer overlap among them than velvet contigs!!!
AMOScmp on velvet & edena contigs
Input
All: desc #elem min max mean stdev sum velvet 1777 45 22892 1583 2769 2813162 edena 1175 100 22893 2350 3152 2760953 edena.uniq 360 100 19924 2475 3886 891123 200bp+ desc #elem min max mean stdev sum velvet 939 200 22892 2924 3272 2745649 edena 935 204 22893 2918 3303 2727929 edena.uniq 202 206 19924 4305 4394 869552
Output:
All:
#elem min max mean median n50 sum ctg 461 45 81678 6029 703 21943 2779591 singl 7 45 11556 4283 2634 11556 29984 all 468 45 81678 6003 719 21353 2809575 breaks 0
200 bp+:
#elem min max mean median n50 sum ctg 289 205 81678 9565 4401 21943 2764375 singl 4 2634 11556 7458 11556 11556 29832 all 293 205 81678 9537 4401 21943 2794207 breaks 0
minimus2 on velvet & edena contigs
Contig stats:
 All:
       #elem   min     max     mean    median  n50     sum
 ctg   740     67      38980   3793    1290    10553   2807159
 singl 366     45      6795    265     63      3684    96822
 all   1106    45      38980   2626    263     10104   2903981
 breaks 45
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   560     202     38980   4969    2583    10561   2782466
 singl 31      203     6795    2424    1379    5455    75150
 all   591     202     38980   4835    2574    10197   2857616
 breaks 44
Staphylococcus aureus MW2 "mutated" 10% of sequence are random SNPs
Sequence assemblies
AMOScmp
OVL: 10 bp ...
#elem min max mean median n50 sum all 22431 20 705 107 87 124 2409008 200bp+ 1997 200 705 267 246 260 533882 singl 1934772
velvet
Same as the one above
Contig assemblies
AMOScmp on velvet contigs
Min Ovl: 10bp Min Match (nucmer -l): 20 Min Cluster (nucmer -c): 40
All:
#elem min max mean median n50 sum ctg 496 45 47387 5504 2076 14370 2729975 singl 629 45 11556 112 60 131 70262 all 1125 45 47387 2489 96 13785 2800237 breaks 6
200 bp+:
#elem min max mean median n50 sum ctg 405 204 47387 6715 3277 14370 2719753 singl 28 200 11556 1042 288 11556 29165 all 433 200 47387 6349 2878 14036 2748918 breaks 4
Min Ovl: 10bp Min Match (nucmer -l): 16 Min Cluster (nucmer -c): 32
 All:
       #elem   min     max     mean    median  n50     sum
 ctg   544     45      53996   5044    1215    16039   2744166
 singl 385     45      11556   133     53      5057    51297
 all   929     45      53996   3009    112     15857   2795463
 breaks 8
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   391     201     53996   6980    3156    16443   2729048
 singl 8       222     11556   3543    4086    11556   28343
 all   399     201     53996   6911    3156    16039   2757391
 breaks 6
Strep suis17
Data
- 2726374 36bp Solexa reads
- 2007491bp 41.30 %GC Reference
- Coverage: 49X
- Location:
/fs/szdata/Solexa/Streptococcus_suis/ /fs/szasmg2/Bacteria/dpuiu/Solexa/Streptococcus_suis/
Sequence alignments
soap
2261784 of 2726374 (82.95%) aligned by "soap -v 5 -g 0 -s 10"
Mismatches:
 more Ss.soap | awk '{print $10}' | count.pl
 mism.   hits     hits(%)
 0       1235237  54.61
 1       548822   24.27
 2       238795   10.56
 3       120622   5.33
 4       71451    3.16
 5       46857    2.07
 Total   2261784
Sequence assembly
Velvet
Contig stats:
#elem min max mean median n50 sum all 917 45 24419 2154 278 6661 1975147 200bp+ 479 202 24419 4051 2940 6966 1940363 breaks 2 (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss)
Different coverage assemblies:
cvg #ctgs min max mean median n50 sum %singletons breaks 24.5 4630 45 5267 436 284 709 2016871 23.98 6 29.4 2797 45 7739 716 426 1333 2001964 23.46 0 34.3 1797 45 9347 1107 649 2230 1989102 23.34 0 39.2 1325 45 24419 1496 663 3724 1982292 23.32 2 44.1 1016 45 23272 1946 516 5290 1976890 23.30 2 49.0 917 45 24419 2154 278 6661 1975147 23.30 2
Edena
Contig stats:
#elem min max mean median n50 sum all 1668 100 8829 1135 743 1968 1893815 200bp+ 1390 200 8829 1333 959 1985 1853354 breaks 2 (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss)
Singletons: 1582493 (58%)
Euler-SR
Command:
$ AssembleIllumina.pl Ss.seq working_dir -vertexSize 21
Contig stats:
#elem min max mean median n50 sum all 2658 22 19768 757 42 4867 2013259 200bp+ 659 200 19768 2932 1973 5117 1932406
Escherichia coli str. K-12 substr. MG1655 (Solexa-paired)
Data
- 7,047,668 + 10,408,224 paired-end 36bp Solexa reads => 105X + 161X cvg
- 1M => 15.51X cvg
- NC_000913.2 4,639,675 50.79%GC reference
- Location:
/fs/szasmg2/Bacteria/dpuiu/Solexa-pair/Ecoli
- Formatting : split seq into 2
cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36 > ! Ecoli.paired.afg cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36 -noamos > ! Ecoli.paired.seq
Sequence assembly (1M reads: 15X cvg)
- Velvet : hash length of 23 works best; also use -exp_cov 15
- AMOScmp: OVL=10; the smaller the better
- Ctg stats:
. elem min q1 q2 q3 max mean n50 sum AMOScmp-shortReads 331 40 1941 7826 18316 172,800 14022 26474 4641532 better !!! AMOScmp-shortReads-trimmed 372 40 1916 7226 17348 98,155 12477 24730 4641600 velvet.ctg 12073 45 121 250 461 4690 366 568 4430625 velvet.scf 715 45 58 70 77 313779 6671 104454 4769943
Sequence assembly (7+M reads: 100+X cvg)
- Velvet : hash length of 27 works best; also use -exp_cov 100
. elem min q1 q2 q3 max mean n50 sum velvet.ctg 375 53 63 82 6105 192151 12162 59972 4561051 velvet.scf 272 53 60 71 133 860695 16785 171744 4565616
Escherichia coli str. K-12 substr. MG1655 (454-paired)
Data
- Formatting : remove 44 bp linker
nucmer -c 10 -l 10 L454.seq Ecoli.merged.seq -p Ecoli --forward cat Ecoli.delta | perl ~/bin/filterDeltaAnnotated.pl | ~/bin/delta2alignRange.pl -qry >! Ecoli.pair.seqs difference.pl Ecoli.seqs Ecoli.pair.seqs >! Ecoli.single.seqs cat Ecoli.pair.seqs | p '$F[1]--; print "$F[0] $F[0].f 1 $F[1]\n"' >! Ecoli.pair.f.pos cat Ecoli.pair.seqs | p '$F[2]++; print "$F[0] $F[0].r $F[2] $F[3]\n"' >! Ecoli.pair.r.pos