Assembly merge: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
 
(265 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Papers =
* [http://genome.cshlp.org/cgi/content/full/18/2/324 Short read fragment assembly of bacterial genomes 2007]
= Assemblers =
= Assemblers =


== Denovo ==
== Denovo ==


=== Minimus ===
=== CA ===
 
See [[Dpuiu_CA|CA]]
 
=== Abyss ===
* [http://www.bcgsc.ca/platform/bioinfo/software/abyss Abyss]
 
=== Atlas ===
 
* [http://www.hgsc.bcm.tmc.edu/downloads/software/atlas/readme.html Atlas web site]
 
=== Euler-SR ===
 
* [http://euler-assembler.ucsd.edu/ Euler-SR source]
* [[Media:README.eulersr|README]]
* [http://genome.cshlp.org/cgi/content/abstract/gr.7088808v1 Short read fragment assembly of bacterial genomes] (Genome Research 2008)
* [http://genome.cshlp.org/cgi/content/full/14/9/1786 De Novo Repeat Classification and Fragment Assembly] (Genome Research 2004)
* [http://genome.cshlp.org/cgi/content/abstract/12/8/1269 Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes] (Genome Research 2002)
* [http://genome.cshlp.org/content/early/2008/12/03/gr.079053.108.abstract De novo fragment assembly with short mate-paired reads: Does the read length matter?] (Genome Research 2008)
 
* Cmd:
  setenv EUSRC  /fs/szdevel/dpuiu/euler-sr/
  setenv MACHTYPE x86_64
  set path=($path ${EUSRC}/assembly)
  set path=($path ${EUSRC}/assembly/${MACHTYPE})
  /fs/szdevel/dpuiu/euler-sr/assembly/Assemble.pl readsFile vertexSize ruleFile
 
 
* 35bp Solexa reads: vertexSize=25 vs vertexSize=21
  vertexSize=21
    more agreagte assembly;
    some tandem repeats in 2K+ contigs were collapsed (detected by aligning to Strep aureus finished genome)
    these repeats were present in small ctgs (~ 200bp) when euler was run with vertexSize=25
    these repeats were not present in the velvet assembly
 
=== newbler (454) ===
 
* Command:
  /fs/sz-user-supported/Linux-x86_64/packages/rig/bin/runAssembly -o dir prefix.sff
 
=== minimus ===
 
hash-overlap issues:
* very slow on large sequences (runs Smith-Waterman)
* finds all overlaps (not only CONTAINED|BEGIN|END); of no use to the assembler
* default ovl: 40 bp (too large)
* min ovl: 15 (set by minimizer)
* 20 bp overlap ok;
* the min ovl imposed by the next step (tigger) is 5bp
 
=== minimus1 ===
 
* hash-overlap was replaced with a nucmer based overlapper
* nucmer takes long time and lots of space to find very short overlaps (< 16bp?)
* We can have more control on the overlaps that are used (can easily filter the nucmer output)
* can discard CONTAINED/IDENTITY alignments in order to avoid collapsing repeats
* can mask the middle of the sequences to make the alignment process faster; (Issue: CONTAINED overlaps appear as BEGIN/END overlaps)
 
* Example:
  ....
  OVERLAP=16
  MINID=94
  MAXTRIM=20
  ...
  20: $(NUCMER) -maxmatch -l $(OVERLAP) -c $(OVERLAP) $(SEQS) $(SEQS) -p $(PREFIX)
  24: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | egrep 'BEGIN|END' > $(COORDS)
  25: $(SCRIPTDIR)/nucmer2ovl.pl -ignore $(MAXTRIM) -tab $(COORDS)> $(OVLTAB)
  26: $(SCRIPTDIR)/ovl2OVL.pl $(OVLTAB)  > $(OVL)
  28: $(BINDIR)/bank-transact -z -b $(BANK) -m $(OVL)
 
=== minimus2 ===
 
* assembles two sequence sets (S1,S2)
* Instead of merging the sets and aligning all versus all (S1+S2 : S1+S2) it aligns one versus the other:
  S1:S2
  S1:S1 (optional)
  S2:S2 (optional)
* Each alignment can be run with different parameters
 
Example 1: merge two de-novo assemblies
 
  ...
  TYPE1=N
  TYPE2=N
  ...
  OVERLAP11      = 16
  OVERLAP22      = 16
  OVERLAP12      = 40
  ...
  # align sets
  $(NUCMER) -maxmatch -l $(OVERLAP11) -c $(OVERLAP11) $(SEQS1) $(SEQS1) -p $(PREFIX11)
  $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12)
  $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22)
  ...
  # merge alignments
  $(SHOWCOORDS) $(ALIGN11) | egrep 'BEGIN|END'  > $(COORDS) 
  $(SHOWCOORDS) $(ALIGN12) >> $(COORDS)
  $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END'  >> $(COORDS)
  ...
  nucmer2ovl.pl  $(COORDS) > $(OVLTAB)
  ...
 
Example 2: merge a comparative and a de-novo assembly
  ...
  TYPE1          = C
  TYPE2          = N
  ...
  # look only for short overlaps among adjacent sequences from S1
  $(SCRIPTDIR)/fasta2ovl.pl -tab $(SEQS1) > $(OVLTAB) 
  ...
  # look for all overlaps  among one sequence from S1 and one from S2
  $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12)
  $(SHOWCOORDS) $(ALIGN12) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB)
  ...
  # look for end overlaps among sequences from S2
  $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22) 
  $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' > $(COORDS22) | $(SCRIPTDIR)/nucmer2ovl.pl  >> $(OVLTAB)
  ...
 
=== SSAKE ===
* [http://www.bcgsc.ca/platform/bioinfo/software/ssake Web Site]
* Short Sequence Assembly by K-mer search and 3' read Extension
* Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501
* Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs
* Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre.
* Current release: SSAKE 3.2.1 (08/21/2008)
* Distribution: PERL script; runs on Linux systems
* Parameters:
  -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
  -o  Minimum number of reads needed to call a base during an extension (default -o 2)
  -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.7)
  -t  Max number of bases to trim (default 0)
* "Especially" targeted towards Solexa seq; 454 reads too long & homopol problem
* Can handle mated reads
* TQS.py  should be used for qual trimming
 
Example:
  $ grep -c "^>" Pa.seq          # total number of reads
  $ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log            # version 3.0
  $ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log        # version 3.2
  $ ln -s Pa*contigs  Pa.fasta
  $ ln -s Pa*singlets Pa.singlets.seq
  $ cat Pa.fasta        | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads    # number of assembled reads
  $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons        # number of singletons
  $ grep -v N Pa.seq              # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons)


* hash-overlap overlaper:
=== VCAKE ===
    40 bp default : too large for contig assemblies
    20 bp minimum overlap; minimizer window length must be >=15bp; could these values be dropped lower?
    very slow on large sequences (ex Ps.fasta, Ps.plasmid.fasta) even if USE_SIMPLE_OVERLAP=1 !!! WHY???
    analyzes internal overlaps, not the sequence end ones only


=== Velvet ===
=== Velvet ===


  * overlap:
* Sequence assembler for very short reads
     18bp usually gives fewest contigs
* [http://www.genome.org/cgi/content/abstract/18/5/821 Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs] Genome Res. Mar 2008
     24bp is also ok
* [http://www.ebi.ac.uk/~zerbino/velvet/ Project Web Site at EBI] Current version: 0.6 02/06/2008: Velvet 0.6
* Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps
* Distribution: C program, to be compiled with gcc
* Open Source, GPL agreement, EBI.
* Current release: 0.7.55
* Parameters:
  - hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced)
  * minimum overlap length (30-40X coverage data set):
     17bp usually gives the fewest contigs (some case)
     23bp is also ok (used for Pa,Ps)
     15bp is too low => too many short contigs
     15bp is too low => too many short contigs
 
*  -long option for including "long sequences" (Sanger, 454 or even reference sequences) very slow
* velvet_assy.afg lists the assembled reads (TLE messages)
* At ~30X, max ctg len is ~ 5KB
* some reads extend beyond the contig ends
* At >45X, max ctg len is ~ 20KB
* there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end)
* Longest contig on simulated data is 56Kbp (Ps, no error 32bp reads, 32X cvg)
 
* Max ovl=32
* For longer reads (66bp), some of the repeats ( ~100bp) get assembled partially; -cov_cutoff parameter must be used, otherwise low cvg/quality ctgs are reported
 
* velvet_assy.afg lists the assembled reads (TLE messages)
* some reads extend beyond the contig ends
* there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end)
* many contigs overlap by a few bp; max val= min_ovl-1
* minimum_ctg length =2*minimum_ovl-1 (23=>45, 21=>41 ...)


Example:
Example (unmated reads):
   $ velveth . ovl prefix.seq                               # ovl should be an odd integer like 21, 23  
   $ velveth . -[short|shortPaired|long|longPaired] ovl prefix.seq       # ovl should be an odd integer like 21, 23  
   $ velvetg . -read_trkg yes
   $ velvetg . -read_trkg yes
Example (mated reads):
  $ velveth . 23 -shortPaired prefix.seq
  $ velvetg . -ins_length 500 -exp_cov 35
  $ ~/bin/velvet.sh prefix 23 500 35                      # combines both; prefix.seq should alternate fwd & rev reads
Example (processing results):
   $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg
   $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg
   $ bank2contig prefix.bnk  > prefix.contig
   $ bank2contig prefix.bnk  > prefix.contig
   $ bank2fasta -b prefix.bnk > prefix.fasta
   $ bank2fasta -b prefix.bnk > prefix.fasta
 
   $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons
   $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons
  $ extractfromfastanames.pl prefix.seq prefix.singletons > prefix.singletons.seq
 
  $ listReadPlacedStatus prefix.bnk > prefix.status
  $ grep D prefix.status  | awk '{print $5,$6}' | sort -u > prefix.pairs


=== Edena ===
=== Edena ===


   * contigs don't overlap
* Edena: Exact DE Novo Assembler
 
* [http://www.genome.org/cgi/content/abstract/gr.072033.107v2 De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer] Genome Res. Apr 2008
* [http://www.genomic.ch/edena.php Project Web Site] Current verision: 2.1.1 03/17/2008
* Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland
* Current release: 2.1.1 (03/17/2008)
* Distribution: Compiled executables for Linux-32,Linux-64
* Uses a traditional read overlap-layout paradigm (adjusted for short reads)
* Parameters:
   --overlapCutoff [int]    Only consider overlaps >= than the specified size. (default 22)
  --minContigSize [int]    Minimum size of the contigs to output. (default 100)
* contigs don't overlap (ends seem to be trimmed)
* a significant number of reads don't get assembled; max % assembled : 69% for Staphylococcus_aureus
 
* reads should be 1/line (+1 line for the header)
* for 66bp reads, the longest ctgs were obtained for --overlapCutoff 35


Example:
Example:
Line 39: Line 231:
   $ edena -e prefix.ovl -p prefix
   $ edena -e prefix.ovl -p prefix
   $ ln -s prefix_contigs.fasta prefix.fasta
   $ ln -s prefix_contigs.fasta prefix.fasta
  $ cat *_info.txt                                # assembly stats
  $ grep assembled prefix_info.txt
  $ cat prefix_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t reads_assembled
=== SOAPdenovo ===
* http://soap.genomics.org.cn/soapdenovo.html
* Release 1.03 , 06-09-2009
* Example config
  [LIB]        # single reads
  f=in.fa
  [LIB]        # paired reads
  avg_ins=180
  f1=fwd.seq
  f2=rev.seq
  [LIB]        # paired reads
  avg_ins=180
  p=in.seq
* Example run:
  $ cat SOAPdenovo.sh
  /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.03/SOAPdenovo all -s SOAPdenovo.config -o out
* Example turkey: paires
            elem      min    q1    q2    q3    max        mean      n50        sum
  reads      88357456  76    76    76    76    76        76        76        6715166656  # 44178728 mates; 2.39X cvg
  ctg        4664917    24    25    36    156    13863      198        761        926637298
  scf        982536    100    258    517    1080  16376      839        1411      825003003
  85505015  out.readOnContig
* Location (turkey)
  /scratch1/dpuiu/SOAPdenovo/turkey/
=== Mira ===
* http://www.chevreux.org/projects_mira.html
* Sanger, 454 and Solexa /Illumina
* supposed to be "very" slow
== Others ==
* [http://www.genome.org/cgi/content/abstract/gr.7337908v1 ALLPATHS : De novo assembly of whole-genome shotgun microreads] Genome Res. Mar 2008
* [http://www.bcgsc.ca/platform/bioinfo/software/ssake SSAKE]
* VCAKE
* SHARCGS
* [http://www.dnastar.com/products/SMGA.php SeqMan Genome Assembler (SMGA)] by DNAStar(commercial)
* Euler-SR


== Comparative ==
== Comparative ==
Line 44: Line 282:
=== AMOScmp ===
=== AMOScmp ===


=== AMOScmp-shortReads ===
* [http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOScmp Web site]
Defaults
* one should increase the make-consensus error rate if ref and qry are not that similar
 
* all reads from layout are included in consensus
  MINMATCH        = 20        # nucmer default; ideally would be read length/2 to align all reads that contain 1bp error/difference
 
                              # for Pa, dropping the value to 17 made a big difference
casm-layout issues:
   MINCLUSTER      = 20       # contigs start breaking if this value is too high
* if one sequence end does not align, the read is not assembled; the adjacent read is not assembled either if based on alignment to the ref. the two "theoretically" overlap;
  MINOVL         = 5         # noticed one case with Pa where an ~50bp insert was missed (adjacent contigs were merged)
* for a circular bacterial chromosome, the BEGIN/END reads don't get assembled => cannot get a circular contig !!!
                              # AMOScmp has this value=10; for Pa we got about twice as many contigs using 10 instead of 5
* all CONTAINED reads should be assembled but the ones that are adjacent to dirty reads are left as singletons as well !!!
  MAXTRIM        = 10
* option -r places repeats randomly; what if we want more control over this?
  MAJORITY       = 50        # reduced from 70 to avoid negative gaps
 
  ALIGNWIGGLE    = 2        # AMOScmp has this value=10; duplications inserted if this value is increased
* There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the  .scaff last contig ; Contig alignments to the ref have breaks
  CONSERR         = 0.06
* casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100)
 
show-coords issues:
* sometimes long CONTAINED annotated contigs contain dirty ends
 
show-snps issues:
* ambiguities are not shown
* maximum  init_size = 10000 query seqs
* does not work on long seqs
 
Example:
  $ sort -nk4 AMOS.scaff | head
  322 BE 87996 -24519
  210 BE 7460 -7324
  1271 BE 60490 -4112
  17 BE 1022 -944
  ...
 
  $more AMOS.scaff
  ...
  322 BE 87996 -24519    # 322 contains contigs 1770, 1771 ...
  1770 BE 39 1100
  1771 BE 130 738
  ...
  1780 BE 72 25004
 
  $ tail -1 AMOS.scaff
   1767 BE 39264 0
 
=== AMOScmp-shortReads-alignmentTrimmed ===
 
Defaults:
  MINCLUSTER = 16
  MINMATCH    = 16
  MINLEN     = 20 # delta-filter -l 20
  MINOVL     = 10  # previously 5 ; mis-assemblies occurred
  MAXTRIM    = 10
  MAXGAP      = 100 # default 10,000 (casm-layout -g) creates some incorrect contigs
  MAJORITY   = 70 # previously 50
  CONSERR     = 0.06
  ALIGNWIGGLE = 2  # different than the default AMOScmp 15


   * Reads are trimmed based on nucmer alignment to a reference genome.
   * Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed
   * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails
   * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails
   * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments =>  
   * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments =>  
     !!! read trimming is important.
     !!! read trimming is important.
  * assembly is very fragmented if the percent identity between reads and the reference is less than 98%
  * simulations run on a PA-b1 10K read subset using PA14 as reference; random snps were introduced into PA14 to decrease the %identity; (MINCLUSTER=12, MINMATCH=16, MINLEN=20)
    99.4% : 1 contig  (original data, no snps added to PA14)
    98%  : 2 contigs
    96%  : 10 contigs
    93%  : 50 contigs
    93%  : 37 contigs (MINMATCH=14)
=== AMOScmp-shortReads ===
Defaults
  MINCLUSTER  = 20
  MINMATCH    = 20
  MINOVL      = 5
  MAXTRIM    = 10
  MAJORITY    = 50
  CONSERR    = 0.06
  ALIGNWIGGLE = 2
  * No read trimming is done
=== AMOScmp2 ===
* Runs AMOScmp of 2 data sets at once (with different parameters)
Example: merge a comparative(AMOScmp) and a de-novo(velvet) assembly, using the same reference was used to get the comparative one
  ...
  TYPE1          = C  # Assembly 1 is a comparative one, already know the position of contigs; no need to align them to ref.
  TYPE2          = N  # Assembly 2 is a de-novo one, don't know the position of contigs; we need to align them to ref.
  ...
  $(SCRIPTDIR)/scaff2delta.pl $(SCAFF1) > $(ALIGN1)
  $(NUCMER) -maxmatch -c $(OVERLAP2) $(REF) $(SEQS2) -p=$(PREFIX2)
  ...
  cp $(ALIGN1)$(ALIGN)
  cat $(ALIGN2) | grep -v $(PREFIX) | grep -v NUCMER  >> $(ALIGN)
  ....
=== maq ===
* Version: 0.6.6
* [http://maq.sourceforge.net/ Web site]
* At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed
* At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq


=== Other Scripts ===
=== Other Scripts ===


==== EMBOSS megamerge ====
==== delta-filter.pl ====
 
* Similar to MUMMer delta-filter
* The difference is that for multi copy repeats, it tries to place each one uniquely on the reference
* "delta-filter -q" collapses all in the same reference region
* "delta-filter -r" aligns the same one to multiple reference regions
 
==== fasta2ovl.pl ====
 
* Program that finds short normal direction overlaps between two adjacent sequences in a multi-FASTA file.
* It tests if the END of seqence N overlaps the BEGINING of sequence N+1.
 
==== megamerge (EMBOSS) ====


   * Can merge well short overlapping contigs
   * Can merge well short overlapping contigs
   * Does not work well when the overlap identity is low
   * Does not work well when the overlap identity is low
   * Min ovl length=2
   * Min ovl length=2
==== nucmer2ovl.pl ====
* Program that converts nucmer annotated alignments to an overlap file (either AMOS or TAB format)
==== nucmer.pl ====
* It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the sequences
* Iteration n only tries to align sequences not aligned at iterations n-1,n-2,...1
* Shorter run time, less disk space (than running nucmer with a small alignment/cluster size)
* Default parameters:
  -len 20,16,12        : nucmer minmatch (-l)
  -cluster 20,16,14    : nucmer mincluster (-c)
  -filter 20,20,20      : delta-filter minlen (-l)
* Could be used by AMOScmp-shortReads, AMOScmp-shortReads-alignmentTrimmed
==== nucmerAnnotate.pl ====
* Program that parses a nucmer coords file and re-annotates sequence alignments
* The maximum length of the end sequence to be ignored can be set as a parameter
==== revDelta.pl ====
* Parses a delta file and switches the ref & query
==== revseq.pl ====
* Revese complements a seq file
* adds ".rev" suffix to seq id's
==== scaff2delta.pl  ====
* Program that converts a scaff file to a delta file
* Example:  No need to run nucmer on the AMOScmp output contigs to get them aligned to reference since we already know their positions in the scaffold
== Subassembly ==
== Reconciliator ==
* [http://bioinformatics.oxfordjournals.org/cgi/reprint/btm542v1.pdf Article]
* [http://www.genome.umd.edu/reconciliation_instructions.htm Download]


= Cases =  
= Cases =  
Line 81: Line 456:


Solution:
Solution:
  * merge 2 assembly contigs
   * run minimus on the contigs
   * run minimus on them


=== Multipls data sets, one(multiple) denovo assemblers ===
=== Multipls data sets, one(multiple) denovo assemblers ===


Example:  
Example:  
  Solexa & 454 data
* Solexa & 454 data
   velvet assemblers for each set
* velvet assembly on Solexa
* newbler assembly on 454
    
Solution:
  * run minimus on the contigs


== One reference sequence ==
== One reference sequence ==


=== Few indels, few rearrangements ===
Example:
  * Solexa data
  * AMOScmp on Solexa
  * velvet on Solexa 


Solution:
Solution:
   * AMOScmp
   * remove the redundant velvet contigs (CONTAINED in AMOScmp contigs) (Optional; leaving the CONTAINED contigs had very similar results, CONSERR did not have to be reduced)
   * If there are many negative gaps try to further join contigs (fastaMerge.pl $PREFIX.fasta)
   * run AMOScmp on AMOScmp & velvet contigs => join some AMOScmp gaps
  * run minimus to further merge contigs


=== Many indels, few rearrangements ===
== Multiple reference sequences ==


=== Few indels, many rearrangements ===
Example:
  * Solexa data
  * AMOScmp on Solexa using Ref1
  * AMOScmp on Solexa using Ref2


== Multiple reference sequences ==
Solution:
  * remove the redundant AMOScmp Ref2 contigs (CONTAINED in AMOScmp contigs) (optional)
  * run AMOScmp on AMOScmp Ref1 & Ref2 contigs => join some AMOScmp Ref1 gaps
  * run minimus to further merge contigs


----
----
Line 120: Line 508:
   50bp+  991        50      7362    393.73  792.41  390192
   50bp+  991        50      7362    393.73  792.41  390192
   100bp+  429        100    7362    815.36  1060.29  349793
   100bp+  429        100    7362    815.36  1060.29  349793
=== Solexa reads ===
  Type            #reads      min    max    mean
  Solexa          6340136      32      32      32  (~31x coverage)
Assemblies:
  Assembler  type        input-data  #reads        #ctgs  min    max    mean      stdev    ctgs-sum      #singletons     
  AMOScmp    comparative  Solaxa      6340136        187    20      577929  34863.06  91692.34  6519394      698638(11%)
  velvet      denovo      Solaxa      6340136        25161  45      5057    241.83    212.61    6084887
  edena      denovo      Solaxa      6340136        14084  100    5075    210.92    145.68    2970720      4893301(77%)
Merged assemblies(contigs&singletons):
  assemblers    type        input-data                #reads  #ctgs  min    max    mean    stdev  ctgs-sum  comments   
  AMOScmp-merged ?            AMOScmp(contigs)          187    166    20      804024  39272.2  121124  6519189    #merged 187-166=21 negative gaps  out of a total of 32
  minimus(ovl20) denovo      velvet(contigs)          25161  19121  45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
  minimus(ovl15) denovo      velvet(contigs)          25161  16343  45      9903    361.32  359.78  5905143    #merged 25161-16343=8818 (35%) gaps
  minimus(ovl40) denovo      edena+velvet(contigs)    39245  23644  45      6688    257.15  232.94  6080063    #very few 40bp overlaps are found
  minimus(ovl20) denovo      edena+velvet(contigs)    39245  18603  45      6688    322.32  311.02  5996244


=== Simulated 32bp exact match reads ===
=== Simulated 32bp exact match reads ===
Line 150: Line 519:
   velvet-sim  denovo      Sim        6538167        2207    45      56810  2820.91  5348.36  6225757      123591(2%)
   velvet-sim  denovo      Sim        6538167        2207    45      56810  2820.91  5348.36  6225757      123591(2%)


=== 454 reads ===
=== Solexa reads ===


   Type            #reads      min    max    mean
   Type            #reads      min    max    mean
   454            77466       35     371     240
   Solexa          6340136      32      32      32  (~31x coverage)
 
Assemblies:
  Assembler  type        input-data  #reads        #ctgs  min    max    mean      stdev    ctgs-sum      #singletons     
  AMOScmp    comparative  Solaxa      6340136       187    20     577929  34863.06  91692.34  6519394      698638(11%)
  velvet      denovo      Solaxa      6340136        25161  45      5057    241.83    212.61    6084887
  velvet(100bp+)          Solexa                    19534  100    5057    288      220      5633400
  edena      denovo      Solaxa      6340136        14084  100     5075    210.92    145.68    2970720      4893301(77%)


Assembly breaks:
  velvet: 31
  edena : 38
 
Merged assemblies(contigs):
  assemblers    type        input-data                #reads  #ctgs  min    max    mean    stdev  ctgs-sum  comments   
  minimus(ovl20) denovo      velvet                    25161  19121  45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
  minimus(ovl20) denovo      edena+velvet              39245  18603  45      6688    322.32  311.02  5996244


== Pseudomonas aeruginosa b1 (PAb1)==
== Pseudomonas aeruginosa b1 (PAb1)==
Line 181: Line 565:
   edena          denovo      11180  100    11300  552.36    610.52    6175460      3955865 (46%)  much better than Ps !!!
   edena          denovo      11180  100    11300  552.36    610.52    6175460      3955865 (46%)  much better than Ps !!!
   ssake          denovo      185030  34      5490    77.21    141.23    14287079      3056893  
   ssake          denovo      185030  34      5490    77.21    141.23    14287079      3056893  
  euler-sr      denovo      30394  26      5004    184      138        5601125      3437156


200bp+ contigs:
200bp+ contigs:
Line 195: Line 580:
   Assembler          type  input-data          #reads  #ctgs  min  max    mean    stdev      ctgs-sum      comments
   Assembler          type  input-data          #reads  #ctgs  min  max    mean    stdev      ctgs-sum      comments
   AMOSCmp-PA14-merge  ?      AMOSCmp-PA14(ctgs)  2053    1931    17    170485  3201.7  12981.57  6182486        2053-1931=122 gaps closed
   AMOSCmp-PA14-merge  ?      AMOSCmp-PA14(ctgs)  2053    1931    17    170485  3201.7  12981.57  6182486        2053-1931=122 gaps closed
== Staphylococcus aureus MW2 ==
=== Data ===
Files:
  /fs/szasmg2/Bacteria/dpuiu/Solexa/Staphylococcus_aureus/
Reads:
  3857879 35bp Solexa reads
  22843 contain at least 1 N; 10455 contain more than 1 N; 8244 contain more than 2 N's
Reference:
  NC_003923      2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome.
  AP004832      20654  28.38  Staphylococcus aureus plasmid pMW2 DNA, complete sequence
  total          2841116
Coverage: 47.52X
Other strains:
  NC_002952.2    2902619 32.81  Staphylococcus aureus subsp. aureus MRSA252, complete genome
  NC_002953.3    2799802 32.85  Staphylococcus aureus subsp. aureus MSSA476, complete genome
  NC_002951.2    2809422 32.82  Staphylococcus aureus subsp. aureus COL, complete genome
  NC_009632.1    2906507 32.95  Staphylococcus aureus subsp. aureus JH1, complete genome
  NC_009487.1    2906700 32.95  Staphylococcus aureus subsp. aureus JH9, complete genome
  NC_009782.1    2880168 32.88  Staphylococcus aureus subsp. aureus Mu3, complete genome
  NC_002758.2    2878529 32.88  Staphylococcus aureus subsp. aureus Mu50, complete genome
  NC_003923.1    2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome
  NC_002745.2    2814816 32.84  Staphylococcus aureus subsp. aureus N315, complete genome
  NC_007795.1    2821361 32.87  Staphylococcus aureus subsp. aureus NCTC 8325, complete genome
  NC_009641.1    2878897 32.89  Staphylococcus aureus subsp. aureus str. Newman, complete genome
  NC_007622.1    2742531 32.78  Staphylococcus aureus RF122, complete genome
  NC_007793.1    2872769 32.75  Staphylococcus aureus subsp. aureus USA300, complete genome
  NC_010079.1    2872915 32.76  Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome
=== Sequence alignments ===
==== soap ====
  3735781 of 3857879 (96.76%) aligned by "soap -v 5 -g 3 -s 12"
Mismatches:
  ~ 10% of the aligned reads had more at least 2 mismatches
  ~  4% of the aligned reads had more at least 3 mismatches
  cat Sa.soap.v5.s12.g3 | awk '{print $10}' | count.pl
  #      total        %
  0      2752024      73.67
  1      633043      16.95
  2      201439      5.39
  3      84258        2.26
  4      41787        1.12
  5      22629        0.61
  201    525          0.01
  101    45
  202    20
  103    5
  203    4
  102    2
  Total  3735781
=== Sequence assemblies ===
==== AMOScmp ====
OVL: 10 bp
Output  stats:
  All:
                #elem  min    max    mean    median  n50    sum
  ctg          17      60      658745  167134  102144  485748  2841283
  ctg(200bp+)  15      2090    658745  189411  102265  485748  2841158
  singl        101422  23      35      35      35      35      3549538
  breaks        0
* Only 2.62% of the reads are singletons
* 101232 out of 101422 AMOS singletons (99.81%) are also velvet singletons
==== velvet ====
===== OVL: 23 bp =====
Output  stats:
                #elem  min    max    mean    median  n50    sum
  ctg          1777    45      22892  1583    252    5337    2813162
  ctg(100bp+)  1146    100    22892  2421    1167    5472    2774471
  ctg(200bp+)  939    200    22892  2924    1764    5512    2745649
  singl        289207  35      35      35      35      35      10122245
  breaks 0
* 7.49% reads are singletons
* 101232 out of 289207 velvet singletons are also AMOS singletons
* 13189 reads are duplicates, connect 1126 contig pairs (1242 unique ctgs in these pairs)
* 1114 out of 1242 contig overlaps were identified by running nucmer (-c 12 -l 12)
  ovlSize count
  22      740
  21      99
  20      55
  19      36
  18      52
  17      33
  16      17
  15      29
  14      21
  13      15
  12      17
  Total  1114
Contig coverage & alignments to the reference:
  $ grep "^##" Sa.contig | perl -ane 'print $F[1]*35/$F[2],"\n";' > Sa.cvg
 
  #ctgs  min        max    mean    median  n50    sum
  1777    1.55        517.25  43      37.41  44.36  76159.86
  $ show-coords -I 94 1con-contigs.delta | grep CONTAIN | awk '{print $19,$13}' | count.pl | sort -n >! 1con-contigs.qry_hits
  $ join 1con-contigs.qry_hits Sa.cvg | awk '{print $3,$4}' | ~/bin/getCorrelationDescriptive.pl
  0.61 # coeff. of correlation > 0
  0.81 # coeff. of correlation for ctgs>100 bp
0cvg regions
  $show-coords Sa.all.delta | ~/bin/getNucmerCoverage.pl -M 0 | getSummary.pl -c 3
        #elem  min    max    mean    median  n50    sum
  gap  305    1      570    34      19      70      10521
===== OVL: 21 bp =====
                #elem  min    max    mean    median  n50    sum
  all          2099    41      18254  1342    258    4384    2817753
  45bp+        1830    45      18254  1534    448    4391    2806596
  100bp+        1337    100    18254  2075    1083    4558    2774667
  200bp+        1115    200    18254  2461    1486    4625    2743892
  singl        224006
OVL: 23 bp seems to generate a better assembly on this data set !!!
==== edena ====
Output  stats:
                #elem  min    max    mean    median  n50    sum
  ctg          1175    100    22893  2350    1094    5514    2760953
  ctg(200bp+)  935    204    22893  2918    1766    5594    2727929
  breaks        0
 
  assembled_reads 2676002 (69.8%)
  singletons      1181877
* In this case velvet & edena have similar contig stats !!!
* All edena contigs align to velvet contigs
* Not all edena contigs are CONTAINED/IDENTICAL to velvet contigs !!!
                #elem  min    max    mean    median  n50    sum
  edena.uniq    360    100    19924  2475    273    7791    891123
  $ show-coords -H velvet-edena.filter-q.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19,$13}' | count.pl -m 2 | wc -l  # 193 edena contigs CONTAIN multiple velvet contigs
  $ show-coords -H velvet-edena.filter-q.delta | egrep 'IDEN' | wc -l                                                        # 435 edena contigs  are IDENTICAL to velvet contigs
==== euler-sr ====
          #elem  min    max    mean    median  n50    sum
  all    1349    22      43140  2073    109    10523  2796347
  200bp+  566    201    43140  4859    2334    10662  2750118
=== Contig assemblies ===
==== AMOScmp on velvet contigs ====
Min Ovl: 10bp
Min Cluster(nucmer -c): 40 (many velvet contigs are 45-65 bp)
  All:
        #elem  min    max    mean    median  n50    sum
  ctg  567    45      73756  4890    455    20846  2772573
  singl 5      45      11556  3159    58      11556  15794
  all  572    45      73756  4875    455    20846  2788367
  breaks 2 (1 contig with 2 overlapping alignments; no repeat; why is nucmer breaking the alignment?)
 
  200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  339    201    73756  8118    3413    20846  2752170
  singl 2      4086    11556  7821    11556  11556  15642
  all  341    201    73756  8117    3494    20846  2767812
  breaks 2
* Reduced the number of contigs from 1777 to 567 !!!
==== minimus on velvet contigs ====
Min Ovl: 16bp
  All:
        #elem  min    max    mean    median  n50    sum
  ctg  316    45      39490  5812    3243    12483  1836527
  singl 797    45      22892  1204    161    4724    959356
  all  1113    45      39490  2512    299    9399    2795883
  breaks 20
 
  200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  258    202    39490  7093    5069    12483  1829967
  singl 368    200    22892  2504    1347    5210    921478
  all  626    200    39490  4395    2313    9508    2751445
  breaks 19
* Reduced the number of contigs from 1777 to 1113 !!!
==== minimus1 on velvet contigs ====
Min Ovl: 16bp
  All:
        #elem  min    max    mean    median  n50    sum
  ctg  277    70      32618  6623    4709    11914  1834597
  singl 1022    45      19128  948    96      5057    968872
  all  1299    45      32618  2158    154    9526    2803469
  breaks 15
 
  200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  247    200    32618  7413    6004    11984  1830980
  singl 354    200    19128  2590    1397    5294    916800
  all  601    200    32618  4572    2495    9757    2747780
  breaks 15
* Reduced the number of contigs from 1777 to 1299 !!!
* Generated fewer, larger big contigs than minimus
Fewer contigs get merged if:
  * Min Ovl is dropped from 16bp to 14bp
  * Only the overlaps represented by Duplicate reads are used
==== minimus1 on edena contigs ====
  All:
        #elem  min    max    mean    median  n50    sum
  ctg  67      240    24824  5610    4814    8671    375847
  singl 1022    100    19924  2331    1084    5455    2382674
  all  1089    100    24824  2533    1183    5933    2758521
  breaks 27
 
  200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  67      240    24824  5610    4814    8671    375847
  singl 811    204    19924  2902    1766    5588    2353676
  all  878    204    24824  3109    1862    5984    2729523
  breaks 27
Edena contig ends get trimmed. There are fewer overlap among them than velvet contigs!!!
==== AMOScmp on velvet & edena contigs ====
Input
  All:
  desc          #elem  min    max    mean    stdev  sum
  velvet        1777    45      22892  1583    2769    2813162
  edena        1175    100    22893  2350    3152    2760953
  edena.uniq    360    100    19924  2475    3886    891123
 
  200bp+
  desc          #elem  min    max    mean    stdev  sum
  velvet        939    200    22892  2924    3272    2745649
  edena        935    204    22893  2918    3303    2727929
  edena.uniq    202    206    19924  4305    4394    869552
Output:
All:
        #elem  min    max    mean    median  n50    sum
  ctg  461    45      81678  6029    703    21943  2779591
  singl 7      45      11556  4283    2634    11556  29984
  all  468    45      81678  6003    719    21353  2809575
  breaks 0
200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  289    205    81678  9565    4401    21943  2764375
  singl 4      2634    11556  7458    11556  11556  29832
  all  293    205    81678  9537    4401    21943  2794207
  breaks 0
==== minimus2 on velvet & edena contigs ====
Contig stats:
  All:
        #elem  min    max    mean    median  n50    sum
  ctg  740    67      38980  3793    1290    10553  2807159
  singl 366    45      6795    265    63      3684    96822
  all  1106    45      38980  2626    263    10104  2903981
  breaks 45
 
  200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  560    202    38980  4969    2583    10561  2782466
  singl 31      203    6795    2424    1379    5455    75150
  all  591    202    38980  4835    2574    10197  2857616
  breaks 44
== Staphylococcus aureus MW2 "mutated" 10% of sequence are random SNPs ==
=== Sequence assemblies ===
==== AMOScmp ====
OVL: 10 bp
...
                #elem  min    max    mean    median  n50    sum
  all          22431  20      705    107    87      124    2409008
  200bp+        1997    200    705    267    246    260    533882
  singl        1934772
==== velvet ====
Same as the one above
=== Contig assemblies ===
==== AMOScmp on velvet contigs ====
Min Ovl: 10bp
Min Match  (nucmer -l): 20
Min Cluster (nucmer -c): 40
All:
        #elem  min    max    mean    median  n50    sum
  ctg  496    45      47387  5504    2076    14370  2729975
  singl 629    45      11556  112    60      131    70262
  all  1125    45      47387  2489    96      13785  2800237
  breaks 6
200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  405    204    47387  6715    3277    14370  2719753
  singl 28      200    11556  1042    288    11556  29165
  all  433    200    47387  6349    2878    14036  2748918
  breaks 4
Min Ovl: 10bp
Min Match  (nucmer -l): 16
Min Cluster (nucmer -c): 32
  All:
        #elem  min    max    mean    median  n50    sum
  ctg  544    45      53996  5044    1215    16039  2744166
  singl 385    45      11556  133    53      5057    51297
  all  929    45      53996  3009    112    15857  2795463
  breaks 8
 
  200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  391    201    53996  6980    3156    16443  2729048
  singl 8      222    11556  3543    4086    11556  28343
  all  399    201    53996  6911    3156    16039  2757391
  breaks 6
== Strep suis17 ==
=== Data ===
* 2726374  36bp Solexa reads
* 2007491bp 41.30 %GC Reference
* Coverage: 49X
* Location:
  /fs/szdata/Solexa/Streptococcus_suis/
  /fs/szasmg2/Bacteria/dpuiu/Solexa/Streptococcus_suis/
=== Sequence alignments ===
==== soap ====
  2261784 of 2726374  (82.95%) aligned by "soap -v 5 -g 0 -s 10"
Mismatches:
  more Ss.soap | awk '{print $10}' | count.pl
  mism.  hits    hits(%)
  0      1235237  54.61
  1      548822  24.27
  2      238795  10.56
  3      120622  5.33
  4      71451    3.16
  5      46857    2.07
  Total  2261784
=== Sequence assembly ===
==== Velvet ====
Contig stats:
          #elem  min    max    mean    median  n50    sum
  all      917    45      24419  2154    278    6661    1975147
  200bp+  479    202    24419  4051    2940    6966    1940363
  breaks 2  (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss)
Different coverage assemblies:
    cvg    #ctgs  min    max    mean    median  n50    sum    %singletons    breaks
    24.5    4630    45      5267    436    284    709    2016871 23.98          6
    29.4    2797    45      7739    716    426    1333    2001964 23.46          0
    34.3    1797    45      9347    1107    649    2230    1989102 23.34          0
    39.2    1325    45      24419  1496    663    3724    1982292 23.32          2
    44.1    1016    45      23272  1946    516    5290    1976890 23.30          2
    49.0    917    45      24419  2154    278    6661    1975147 23.30          2
==== Edena ====
Contig stats:
                #elem  min    max    mean    median  n50    sum
  all          1668    100    8829    1135    743    1968    1893815
  200bp+        1390    200    8829    1333    959    1985    1853354
  breaks 2  (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss)
Singletons: 1582493 (58%)
==== Euler-SR ====
Command:
  $ AssembleIllumina.pl Ss.seq working_dir -vertexSize 21
Contig stats:
          #elem  min    max    mean    median  n50    sum
  all    2658    22      19768  757    42      4867    2013259
  200bp+  659    200    19768  2932    1973    5117    1932406
==  Escherichia coli str. K-12 substr. MG1655 (Solexa-paired) ==
=== Data ===
* 7,047,668 + 10,408,224  paired-end 36bp Solexa reads => 105X + 161X cvg
* 1M => 15.51X cvg
 
* NC_000913.2  4,639,675 50.79%GC reference
* Location:
  /fs/szasmg2/Bacteria/dpuiu/Solexa-pair/Ecoli
* Formatting : split seq into 2
  cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36          > ! Ecoli.paired.afg
  cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36 -noamos  > ! Ecoli.paired.seq
=== Sequence assembly (1M reads: 15X cvg) ===
* Velvet : hash length of 23 works best; also use -exp_cov 15
* AMOScmp: OVL=10; the smaller the better
* Ctg stats:
  .                            elem      min    q1    q2    q3    max        mean      n50        sum           
  AMOScmp-shortReads          331        40    1941  7826  18316  172,800    14022      26474      4641532  better !!!     
  AMOScmp-shortReads-trimmed  372        40    1916  7226  17348  98,155    12477      24730      4641600       
  velvet.ctg                  12073      45    121    250    461    4690      366        568        4430625       
  velvet.scf                  715        45    58    70    77    313779    6671      104454    4769943     
=== Sequence assembly (7+M reads: 100+X cvg) ===
* Velvet : hash length of 27 works best; also use -exp_cov 100
  .                            elem      min    q1    q2    q3    max        mean      n50        sum           
  velvet.ctg                  375        53    63    82    6105  192151    12162      59972      4561051       
  velvet.scf                  272        53    60    71    133    860695    16785      171744    4565616
==  Escherichia coli str. K-12 substr. MG1655 (454-paired) ==
=== Data ===
* Formatting : remove 44 bp linker
  nucmer -c 10 -l 10 L454.seq Ecoli.merged.seq -p  Ecoli --forward
  cat  Ecoli.delta | perl ~/bin/filterDeltaAnnotated.pl | ~/bin/delta2alignRange.pl -qry  >!  Ecoli.pair.seqs
  difference.pl  Ecoli.seqs Ecoli.pair.seqs >! Ecoli.single.seqs
 
  cat Ecoli.pair.seqs | p '$F[1]--; print "$F[0] $F[0].f 1 $F[1]\n"'    >! Ecoli.pair.f.pos
  cat Ecoli.pair.seqs | p '$F[2]++; print "$F[0] $F[0].r $F[2] $F[3]\n"' >! Ecoli.pair.r.pos

Latest revision as of 01:21, 2 September 2010

Papers

Assemblers

Denovo

CA

See CA

Abyss

Atlas

Euler-SR

  • Cmd:
 setenv EUSRC  /fs/szdevel/dpuiu/euler-sr/
 setenv MACHTYPE x86_64
 set path=($path ${EUSRC}/assembly)
 set path=($path ${EUSRC}/assembly/${MACHTYPE})
 /fs/szdevel/dpuiu/euler-sr/assembly/Assemble.pl readsFile vertexSize ruleFile


  • 35bp Solexa reads: vertexSize=25 vs vertexSize=21
 vertexSize=21 
   more agreagte assembly; 
   some tandem repeats in 2K+ contigs were collapsed (detected by aligning to Strep aureus finished genome)
   these repeats were present in small ctgs (~ 200bp) when euler was run with vertexSize=25
   these repeats were not present in the velvet assembly

newbler (454)

  • Command:
 /fs/sz-user-supported/Linux-x86_64/packages/rig/bin/runAssembly -o dir prefix.sff

minimus

hash-overlap issues:

  • very slow on large sequences (runs Smith-Waterman)
  • finds all overlaps (not only CONTAINED|BEGIN|END); of no use to the assembler
  • default ovl: 40 bp (too large)
  • min ovl: 15 (set by minimizer)
  • 20 bp overlap ok;
  • the min ovl imposed by the next step (tigger) is 5bp

minimus1

  • hash-overlap was replaced with a nucmer based overlapper
  • nucmer takes long time and lots of space to find very short overlaps (< 16bp?)
  • We can have more control on the overlaps that are used (can easily filter the nucmer output)
  • can discard CONTAINED/IDENTITY alignments in order to avoid collapsing repeats
  • can mask the middle of the sequences to make the alignment process faster; (Issue: CONTAINED overlaps appear as BEGIN/END overlaps)
  • Example:
  ....
  OVERLAP=16
  MINID=94
  MAXTRIM=20
  ...
  20: $(NUCMER) -maxmatch -l $(OVERLAP) -c $(OVERLAP) $(SEQS) $(SEQS) -p $(PREFIX)
  24: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | egrep 'BEGIN|END' > $(COORDS)
  25: $(SCRIPTDIR)/nucmer2ovl.pl -ignore $(MAXTRIM) -tab $(COORDS)> $(OVLTAB)
  26: $(SCRIPTDIR)/ovl2OVL.pl $(OVLTAB)  > $(OVL)
  28: $(BINDIR)/bank-transact -z -b $(BANK) -m $(OVL)

minimus2

  • assembles two sequence sets (S1,S2)
  • Instead of merging the sets and aligning all versus all (S1+S2 : S1+S2) it aligns one versus the other:
 S1:S2
 S1:S1 (optional)
 S2:S2 (optional)
  • Each alignment can be run with different parameters

Example 1: merge two de-novo assemblies

 ...
 TYPE1=N
 TYPE2=N
 ...
 OVERLAP11       = 16
 OVERLAP22       = 16
 OVERLAP12       = 40
 ...
 # align sets
 $(NUCMER) -maxmatch -l $(OVERLAP11) -c $(OVERLAP11) $(SEQS1) $(SEQS1) -p $(PREFIX11)
 $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12)
 $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22)
 ...
 # merge alignments
 $(SHOWCOORDS) $(ALIGN11) | egrep 'BEGIN|END'  > $(COORDS)   
 $(SHOWCOORDS) $(ALIGN12) >> $(COORDS)
 $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END'  >> $(COORDS)
 ...
 nucmer2ovl.pl  $(COORDS) > $(OVLTAB)
 ...

Example 2: merge a comparative and a de-novo assembly

 ...
 TYPE1           = C
 TYPE2           = N
 ...
 # look only for short overlaps among adjacent sequences from S1
 $(SCRIPTDIR)/fasta2ovl.pl -tab $(SEQS1) > $(OVLTAB)  
 ...
 # look for all overlaps  among one sequence from S1 and one from S2
 $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) 
 $(SHOWCOORDS) $(ALIGN12) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB) 
 ...
 # look for end overlaps among sequences from S2
 $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22)  
 $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' > $(COORDS22) | $(SCRIPTDIR)/nucmer2ovl.pl  >> $(OVLTAB)
 ...

SSAKE

  • Web Site
  • Short Sequence Assembly by K-mer search and 3' read Extension
  • Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501
  • Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs
  • Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre.
  • Current release: SSAKE 3.2.1 (08/21/2008)
  • Distribution: PERL script; runs on Linux systems
  • Parameters:
 -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
 -o  Minimum number of reads needed to call a base during an extension (default -o 2)
 -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.7)
 -t  Max number of bases to trim (default 0)
  • "Especially" targeted towards Solexa seq; 454 reads too long & homopol problem
  • Can handle mated reads
  • TQS.py should be used for qual trimming

Example:

 $ grep -c "^>" Pa.seq           # total number of reads
 $ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log             # version 3.0
 $ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log        # version 3.2
 $ ln -s Pa*contigs  Pa.fasta
 $ ln -s Pa*singlets Pa.singlets.seq
 $ cat Pa.fasta        | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads    # number of assembled reads
 $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons         # number of singletons
 $ grep -v N Pa.seq              # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons)

VCAKE

Velvet

  • Sequence assembler for very short reads
  • Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs Genome Res. Mar 2008
  • Project Web Site at EBI Current version: 0.6 02/06/2008: Velvet 0.6
  • Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps
  • Distribution: C program, to be compiled with gcc
  • Open Source, GPL agreement, EBI.
  • Current release: 0.7.55
  • Parameters:
 - hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced)
* minimum overlap length (30-40X coverage data set):
    17bp usually gives the fewest contigs (some case)
    23bp is also ok (used for Pa,Ps)
    15bp is too low => too many short contigs
*  -long option for including "long sequences" (Sanger, 454 or even reference sequences) very slow
  • At ~30X, max ctg len is ~ 5KB
  • At >45X, max ctg len is ~ 20KB
  • Longest contig on simulated data is 56Kbp (Ps, no error 32bp reads, 32X cvg)
  • Max ovl=32
  • For longer reads (66bp), some of the repeats ( ~100bp) get assembled partially; -cov_cutoff parameter must be used, otherwise low cvg/quality ctgs are reported
  • velvet_assy.afg lists the assembled reads (TLE messages)
  • some reads extend beyond the contig ends
  • there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end)
  • many contigs overlap by a few bp; max val= min_ovl-1
  • minimum_ctg length =2*minimum_ovl-1 (23=>45, 21=>41 ...)

Example (unmated reads):

 $ velveth . -[short|shortPaired|long|longPaired] ovl prefix.seq        # ovl should be an odd integer like 21, 23 
 $ velvetg . -read_trkg yes

Example (mated reads):

 $ velveth . 23 -shortPaired prefix.seq
 $ velvetg . -ins_length 500 -exp_cov 35
 $ ~/bin/velvet.sh prefix 23 500 35                       # combines both; prefix.seq should alternate fwd & rev reads 

Example (processing results):

 $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg
 $ bank2contig prefix.bnk  > prefix.contig
 $ bank2fasta -b prefix.bnk > prefix.fasta
 
 $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons
 $ extractfromfastanames.pl prefix.seq prefix.singletons > prefix.singletons.seq
 
 $ listReadPlacedStatus prefix.bnk > prefix.status
 $ grep D prefix.status  | awk '{print $5,$6}' | sort -u > prefix.pairs

Edena

  • Edena: Exact DE Novo Assembler
 --overlapCutoff [int]    Only consider overlaps >= than the specified size. (default 22)
 --minContigSize [int]    Minimum size of the contigs to output. (default 100)
  • contigs don't overlap (ends seem to be trimmed)
  • a significant number of reads don't get assembled; max % assembled : 69% for Staphylococcus_aureus
  • reads should be 1/line (+1 line for the header)
  • for 66bp reads, the longest ctgs were obtained for --overlapCutoff 35

Example:

 $ edena -r prefix.seq -p ovl
 $ edena -e prefix.ovl -p prefix
 $ ln -s prefix_contigs.fasta prefix.fasta
 $ cat *_info.txt                                 # assembly stats
 $ grep assembled prefix_info.txt
 $ cat prefix_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t reads_assembled


SOAPdenovo

 [LIB]        # single reads
 f=in.fa
 [LIB]        # paired reads
 avg_ins=180
 f1=fwd.seq
 f2=rev.seq
 [LIB]        # paired reads
 avg_ins=180
 p=in.seq
  • Example run:
 $ cat SOAPdenovo.sh
 /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.03/SOAPdenovo all -s SOAPdenovo.config -o out
  • Example turkey: paires
            elem       min    q1     q2     q3     max        mean       n50        sum
 reads      88357456   76     76     76     76     76         76         76         6715166656  # 44178728 mates; 2.39X cvg 
 ctg        4664917    24     25     36     156    13863      198        761        926637298
 scf        982536     100    258    517    1080   16376      839        1411       825003003
 85505015  out.readOnContig
  • Location (turkey)
 /scratch1/dpuiu/SOAPdenovo/turkey/

Mira

Others

Comparative

AMOScmp

  • Web site
  • one should increase the make-consensus error rate if ref and qry are not that similar
  • all reads from layout are included in consensus

casm-layout issues:

  • if one sequence end does not align, the read is not assembled; the adjacent read is not assembled either if based on alignment to the ref. the two "theoretically" overlap;
  • for a circular bacterial chromosome, the BEGIN/END reads don't get assembled => cannot get a circular contig !!!
  • all CONTAINED reads should be assembled but the ones that are adjacent to dirty reads are left as singletons as well !!!
  • option -r places repeats randomly; what if we want more control over this?
  • There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the .scaff last contig ; Contig alignments to the ref have breaks
  • casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100)

show-coords issues:

  • sometimes long CONTAINED annotated contigs contain dirty ends

show-snps issues:

  • ambiguities are not shown
  • maximum init_size = 10000 query seqs
  • does not work on long seqs

Example:

 $ sort -nk4 AMOS.scaff | head
 322 BE 87996 -24519
 210 BE 7460 -7324
 1271 BE 60490 -4112
 17 BE 1022 -944
 ...
 $more AMOS.scaff
 ...
 322 BE 87996 -24519     # 322 contains contigs 1770, 1771 ...
 1770 BE 39 1100
 1771 BE 130 738
 ...
 1780 BE 72 25004
 $ tail -1 AMOS.scaff
 1767 BE 39264 0

AMOScmp-shortReads-alignmentTrimmed

Defaults:

  MINCLUSTER  = 16
  MINMATCH    = 16
  MINLEN      = 20  # delta-filter -l 20
  MINOVL      = 10  # previously 5 ; mis-assemblies occurred
  MAXTRIM     = 10
  MAXGAP      = 100 # default 10,000 (casm-layout -g) creates some incorrect contigs
  MAJORITY    = 70  # previously 50 
  CONSERR     = 0.06
  ALIGNWIGGLE = 2   # different than the default AMOScmp 15
 * Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed
 * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails
 * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => 
   !!! read trimming is important.
 * assembly is very fragmented if the percent identity between reads and the reference is less than 98%
 * simulations run on a PA-b1 10K read subset using PA14 as reference; random snps were introduced into PA14 to decrease the %identity; (MINCLUSTER=12, MINMATCH=16, MINLEN=20)
   99.4% : 1 contig  (original data, no snps added to PA14)
   98%   : 2 contigs
   96%   : 10 contigs
   93%   : 50 contigs
   93%   : 37 contigs (MINMATCH=14)

AMOScmp-shortReads

Defaults

  MINCLUSTER  = 20
  MINMATCH    = 20
  MINOVL      = 5
  MAXTRIM     = 10
  MAJORITY    = 50
  CONSERR     = 0.06
  ALIGNWIGGLE = 2
 * No read trimming is done

AMOScmp2

  • Runs AMOScmp of 2 data sets at once (with different parameters)

Example: merge a comparative(AMOScmp) and a de-novo(velvet) assembly, using the same reference was used to get the comparative one

 ...
 TYPE1           = C  # Assembly 1 is a comparative one, already know the position of contigs; no need to align them to ref.
 TYPE2           = N  # Assembly 2 is a de-novo one, don't know the position of contigs; we need to align them to ref.
 ...
 $(SCRIPTDIR)/scaff2delta.pl $(SCAFF1) > $(ALIGN1)
 $(NUCMER) -maxmatch -c $(OVERLAP2) $(REF) $(SEQS2) -p=$(PREFIX2) 
 ...
 cp $(ALIGN1)$(ALIGN)
 cat $(ALIGN2) | grep -v $(PREFIX) | grep -v NUCMER  >> $(ALIGN)
 ....

maq

  • Version: 0.6.6
  • Web site
  • At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed
  • At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq

Other Scripts

delta-filter.pl

  • Similar to MUMMer delta-filter
  • The difference is that for multi copy repeats, it tries to place each one uniquely on the reference
  • "delta-filter -q" collapses all in the same reference region
  • "delta-filter -r" aligns the same one to multiple reference regions

fasta2ovl.pl

  • Program that finds short normal direction overlaps between two adjacent sequences in a multi-FASTA file.
  • It tests if the END of seqence N overlaps the BEGINING of sequence N+1.

megamerge (EMBOSS)

 * Can merge well short overlapping contigs
 * Does not work well when the overlap identity is low
 * Min ovl length=2

nucmer2ovl.pl

  • Program that converts nucmer annotated alignments to an overlap file (either AMOS or TAB format)

nucmer.pl

  • It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the sequences
  • Iteration n only tries to align sequences not aligned at iterations n-1,n-2,...1
  • Shorter run time, less disk space (than running nucmer with a small alignment/cluster size)
  • Default parameters:
 -len 20,16,12         : nucmer minmatch (-l)
 -cluster 20,16,14     : nucmer mincluster (-c)
 -filter 20,20,20      : delta-filter minlen (-l)
  • Could be used by AMOScmp-shortReads, AMOScmp-shortReads-alignmentTrimmed

nucmerAnnotate.pl

  • Program that parses a nucmer coords file and re-annotates sequence alignments
  • The maximum length of the end sequence to be ignored can be set as a parameter

revDelta.pl

  • Parses a delta file and switches the ref & query

revseq.pl

  • Revese complements a seq file
  • adds ".rev" suffix to seq id's

scaff2delta.pl

  • Program that converts a scaff file to a delta file
  • Example: No need to run nucmer on the AMOScmp output contigs to get them aligned to reference since we already know their positions in the scaffold

Subassembly

Reconciliator

Cases

No reference sequence

One data set, multiple denovo assemblers

Example:

 * Solexa data
 * edena & velvet assemblers

Solution:

 * run minimus on the contigs

Multipls data sets, one(multiple) denovo assemblers

Example:

* Solexa & 454 data
* velvet assembly on Solexa
* newbler assembly on 454
 

Solution:

 * run minimus on the contigs

One reference sequence

Example:

 * Solexa data
 * AMOScmp on Solexa
 * velvet on Solexa  

Solution:

 * remove the redundant velvet contigs (CONTAINED in AMOScmp contigs) (Optional; leaving the CONTAINED contigs had very similar results, CONSERR did not have to be reduced)
 * run AMOScmp on AMOScmp & velvet contigs => join some AMOScmp gaps
 * run minimus to further merge contigs

Multiple reference sequences

Example:

 * Solexa data
 * AMOScmp on Solexa using Ref1
 * AMOScmp on Solexa using Ref2

Solution:

 * remove the redundant AMOScmp Ref2 contigs (CONTAINED in AMOScmp contigs) (optional)
 * run AMOScmp on AMOScmp Ref1 & Ref2 contigs => join some AMOScmp Ref1 gaps
 * run minimus to further merge contigs

Examples

Pseudomonas_syringae

Reference:

 Name           Length  %GC
 NC_004578.1    6397126 58.40
 NC_004633.1    73661   55.15
 NC_004632.1    67473   56.17

Repeats:

 desc    #repeats   min     max     mean    stdev    sum
 50bp+   991        50      7362    393.73  792.41   390192
 100bp+  429        100     7362    815.36  1060.29  349793

Simulated 32bp exact match reads

 Type            #reads       min     max     mean
 Sim(ulated)     6538167      32      32      32   ( 32x coverage)

Single assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 edena-sim   denovo       Sim         6538167        2068    100     47881   2994.03   4857.76   6191673       198699(3%)
 velvet-sim  denovo       Sim         6538167        2207    45      56810   2820.91   5348.36   6225757       123591(2%)

Solexa reads

 Type            #reads       min     max     mean
 Solexa          6340136      32      32      32   (~31x coverage)

Assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%)
 velvet      denovo       Solaxa      6340136        25161   45      5057    241.83    212.61    6084887 
 velvet(100bp+)           Solexa                     19534   100     5057    288       220       5633400
 edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%)

Assembly breaks:

 velvet: 31
 edena : 38
 

Merged assemblies(contigs):

 assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments    
 minimus(ovl20) denovo       velvet                    25161   19121   45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
 minimus(ovl20) denovo       edena+velvet              39245   18603   45      6688    322.32   311.02  5996244

Pseudomonas aeruginosa b1 (PAb1)

References:

 Name           Length  %GC
 PA14           6537648 66.29
 PACS2          6492423 66.33
 PAO1           6264404 66.56
 ...

Solexa reads

 Type            #reads       min     max     mean
 Solexa          8627900      33      33      33    (~43X coverage)

Assemblies:

All contigs:

 Assembler      type         #ctgs   min     max     mean      stdev      ctgs-sum      #singletons    comments
 AMOSCmp-PA14   comparative  2053    17      170485  3011.84   11917.53   6183320       1127399
 AMOSCmp-PAO1   comparative  2797    17      75626   2161.19   5812.2     6044851       1592525
 AMOScmp-PA2192 comparative  5816    17      133129  1072.8    3725.22    6239454       1601299        largest assembly
 maq-PA14       comparative  991     33      155551  6199.79   17445.05   6143996       1197385
 velvet         denovo       10684   45      16239   640.34    825.24     6841458       1241079        much better than Ps !!!
 edena          denovo       11180   100     11300   552.36    610.52     6175460       3955865 (46%)  much better than Ps !!!
 ssake          denovo       185030  34      5490    77.21     141.23     14287079      3056893 
 euler-sr       denovo       30394   26      5004    184       138        5601125       3437156

200bp+ contigs:

 Assembler      type         #ctgs   min     max     mean      stdev      ctgs-sum
 AMOSCmp-PA14   comparative  428     203     170485  14262.09  22852.74   6104175
 AMOSCmp-PAO1   comparative  865     200     75626   6893.96   8766.63    5963278
 AMOSCmp-PA2192 comparative  1299    200     133129  4683.46   6735.52    6083817
 maq-PA14       comparative  368     200     155551  16581.7   25475.92   6102067
 velvet         denovo       7382    200     16239   877.05    896.35     6474426
 edena          denovo       8316    200     11300   692.54    651.24     5759209
 ssake          denovo       12532   200     5490    486       329.93     6090567

Merged assemblies(contigs&singletons):

 Assembler           type   input-data          #reads   #ctgs   min   max     mean    stdev      ctgs-sum       comments
 AMOSCmp-PA14-merge  ?      AMOSCmp-PA14(ctgs)  2053     1931    17    170485  3201.7  12981.57   6182486        2053-1931=122 gaps closed

Staphylococcus aureus MW2

Data

Files:

 /fs/szasmg2/Bacteria/dpuiu/Solexa/Staphylococcus_aureus/

Reads:

 3857879 35bp Solexa reads
 22843 contain at least 1 N; 10455 contain more than 1 N; 8244 contain more than 2 N's

Reference:

 NC_003923      2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome.
 AP004832       20654  28.38  Staphylococcus aureus plasmid pMW2 DNA, complete sequence
 total          2841116

Coverage: 47.52X

Other strains:

 NC_002952.2    2902619 32.81  Staphylococcus aureus subsp. aureus MRSA252, complete genome
 NC_002953.3    2799802 32.85  Staphylococcus aureus subsp. aureus MSSA476, complete genome
 NC_002951.2    2809422 32.82  Staphylococcus aureus subsp. aureus COL, complete genome
 NC_009632.1    2906507 32.95  Staphylococcus aureus subsp. aureus JH1, complete genome
 NC_009487.1    2906700 32.95  Staphylococcus aureus subsp. aureus JH9, complete genome
 NC_009782.1    2880168 32.88  Staphylococcus aureus subsp. aureus Mu3, complete genome
 NC_002758.2    2878529 32.88  Staphylococcus aureus subsp. aureus Mu50, complete genome
 NC_003923.1    2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome
 NC_002745.2    2814816 32.84  Staphylococcus aureus subsp. aureus N315, complete genome
 NC_007795.1    2821361 32.87  Staphylococcus aureus subsp. aureus NCTC 8325, complete genome
 NC_009641.1    2878897 32.89  Staphylococcus aureus subsp. aureus str. Newman, complete genome
 NC_007622.1    2742531 32.78  Staphylococcus aureus RF122, complete genome
 NC_007793.1    2872769 32.75  Staphylococcus aureus subsp. aureus USA300, complete genome
 NC_010079.1    2872915 32.76  Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome

Sequence alignments

soap

 3735781 of 3857879 (96.76%) aligned by "soap -v 5 -g 3 -s 12"

Mismatches:

 ~ 10% of the aligned reads had more at least 2 mismatches 
 ~  4% of the aligned reads had more at least 3 mismatches 
 cat Sa.soap.v5.s12.g3 | awk '{print $10}' | count.pl
 #       total        %
 0       2752024      73.67
 1       633043       16.95
 2       201439       5.39
 3       84258        2.26
 4       41787        1.12
 5       22629        0.61
 201     525          0.01
 101     45
 202     20
 103     5
 203     4
 102     2
 Total   3735781

Sequence assemblies

AMOScmp

OVL: 10 bp

Output stats:

 All:
               #elem   min     max     mean    median  n50     sum
 ctg           17      60      658745  167134  102144  485748  2841283
 ctg(200bp+)   15      2090    658745  189411  102265  485748  2841158
 singl         101422  23      35      35      35      35      3549538
 breaks        0
  • Only 2.62% of the reads are singletons
  • 101232 out of 101422 AMOS singletons (99.81%) are also velvet singletons

velvet

OVL: 23 bp

Output stats:

               #elem   min     max     mean    median  n50     sum
 ctg           1777    45      22892   1583    252     5337    2813162
 ctg(100bp+)   1146    100     22892   2421    1167    5472    2774471
 ctg(200bp+)   939     200     22892   2924    1764    5512    2745649
 singl         289207  35      35      35      35      35      10122245
 breaks 0
  • 7.49% reads are singletons
  • 101232 out of 289207 velvet singletons are also AMOS singletons
  • 13189 reads are duplicates, connect 1126 contig pairs (1242 unique ctgs in these pairs)
  • 1114 out of 1242 contig overlaps were identified by running nucmer (-c 12 -l 12)
 ovlSize count
 22      740
 21      99
 20      55
 19      36
 18      52
 17      33
 16      17
 15      29
 14      21
 13      15
 12      17
 Total   1114

Contig coverage & alignments to the reference:

 $ grep "^##" Sa.contig | perl -ane 'print $F[1]*35/$F[2],"\n";' > Sa.cvg
 
 #ctgs   min         max     mean    median  n50     sum
 1777    1.55        517.25  43      37.41   44.36   76159.86
 $ show-coords -I 94 1con-contigs.delta | grep CONTAIN | awk '{print $19,$13}' | count.pl | sort -n >! 1con-contigs.qry_hits
 $ join 1con-contigs.qry_hits Sa.cvg | awk '{print $3,$4}' | ~/bin/getCorrelationDescriptive.pl
 0.61 # coeff. of correlation > 0
 0.81 # coeff. of correlation for ctgs>100 bp

0cvg regions

 $show-coords Sa.all.delta | ~/bin/getNucmerCoverage.pl -M 0 | getSummary.pl -c 3
       #elem   min     max     mean    median  n50     sum
 gap   305     1       570     34      19      70      10521


OVL: 21 bp
               #elem   min     max     mean    median  n50     sum
 all           2099    41      18254   1342    258     4384    2817753
 45bp+         1830    45      18254   1534    448     4391    2806596
 100bp+        1337    100     18254   2075    1083    4558    2774667
 200bp+        1115    200     18254   2461    1486    4625    2743892
 singl         224006

OVL: 23 bp seems to generate a better assembly on this data set !!!

edena

Output stats:

               #elem   min     max     mean    median  n50     sum
 ctg           1175    100     22893   2350    1094    5514    2760953
 ctg(200bp+)   935     204     22893   2918    1766    5594    2727929
 breaks        0
 
 assembled_reads 2676002 (69.8%)
 singletons      1181877
  • In this case velvet & edena have similar contig stats !!!
  • All edena contigs align to velvet contigs
  • Not all edena contigs are CONTAINED/IDENTICAL to velvet contigs !!!
               #elem   min     max     mean    median  n50     sum
 edena.uniq    360     100     19924   2475    273     7791    891123
 $ show-coords -H velvet-edena.filter-q.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19,$13}' | count.pl -m 2 | wc -l  # 193 edena contigs CONTAIN multiple velvet contigs
 $ show-coords -H velvet-edena.filter-q.delta | egrep 'IDEN' | wc -l                                                         # 435 edena contigs  are IDENTICAL to velvet contigs

euler-sr

         #elem   min     max     mean    median  n50     sum
 all     1349    22      43140   2073    109     10523   2796347
 200bp+  566     201     43140   4859    2334    10662   2750118

Contig assemblies

AMOScmp on velvet contigs

Min Ovl: 10bp Min Cluster(nucmer -c): 40 (many velvet contigs are 45-65 bp)

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   567     45      73756   4890    455     20846   2772573
 singl 5       45      11556   3159    58      11556   15794
 all   572     45      73756   4875    455     20846   2788367
 breaks 2 (1 contig with 2 overlapping alignments; no repeat; why is nucmer breaking the alignment?)
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   339     201     73756   8118    3413    20846   2752170
 singl 2       4086    11556   7821    11556   11556   15642
 all   341     201     73756   8117    3494    20846   2767812
 breaks 2
  • Reduced the number of contigs from 1777 to 567 !!!

minimus on velvet contigs

Min Ovl: 16bp

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   316     45      39490   5812    3243    12483   1836527
 singl 797     45      22892   1204    161     4724    959356
 all   1113    45      39490   2512    299     9399    2795883
 breaks 20
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   258     202     39490   7093    5069    12483   1829967
 singl 368     200     22892   2504    1347    5210    921478
 all   626     200     39490   4395    2313    9508    2751445
 breaks 19
  • Reduced the number of contigs from 1777 to 1113 !!!

minimus1 on velvet contigs

Min Ovl: 16bp

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   277     70      32618   6623    4709    11914   1834597
 singl 1022    45      19128   948     96      5057    968872
 all   1299    45      32618   2158    154     9526    2803469
 breaks 15
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   247     200     32618   7413    6004    11984   1830980
 singl 354     200     19128   2590    1397    5294    916800
 all   601     200     32618   4572    2495    9757    2747780
 breaks 15
  • Reduced the number of contigs from 1777 to 1299 !!!
  • Generated fewer, larger big contigs than minimus

Fewer contigs get merged if:

 * Min Ovl is dropped from 16bp to 14bp
 * Only the overlaps represented by Duplicate reads are used

minimus1 on edena contigs

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   67      240     24824   5610    4814    8671    375847
 singl 1022    100     19924   2331    1084    5455    2382674
 all   1089    100     24824   2533    1183    5933    2758521
 breaks 27
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   67      240     24824   5610    4814    8671    375847
 singl 811     204     19924   2902    1766    5588    2353676
 all   878     204     24824   3109    1862    5984    2729523
 breaks 27

Edena contig ends get trimmed. There are fewer overlap among them than velvet contigs!!!

AMOScmp on velvet & edena contigs

Input

 All:
 desc          #elem   min     max     mean    stdev   sum
 velvet        1777    45      22892   1583    2769    2813162
 edena         1175    100     22893   2350    3152    2760953
 edena.uniq    360     100     19924   2475    3886    891123
 
 200bp+
 desc          #elem   min     max     mean    stdev   sum
 velvet        939     200     22892   2924    3272    2745649
 edena         935     204     22893   2918    3303    2727929
 edena.uniq    202     206     19924   4305    4394    869552

Output:

All:

       #elem   min     max     mean    median  n50     sum
 ctg   461     45      81678   6029    703     21943   2779591
 singl 7       45      11556   4283    2634    11556   29984
 all   468     45      81678   6003    719     21353   2809575
 breaks 0

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   289     205     81678   9565    4401    21943   2764375
 singl 4       2634    11556   7458    11556   11556   29832
 all   293     205     81678   9537    4401    21943   2794207
 breaks 0

minimus2 on velvet & edena contigs

Contig stats:

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   740     67      38980   3793    1290    10553   2807159
 singl 366     45      6795    265     63      3684    96822
 all   1106    45      38980   2626    263     10104   2903981
 breaks 45
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   560     202     38980   4969    2583    10561   2782466
 singl 31      203     6795    2424    1379    5455    75150
 all   591     202     38980   4835    2574    10197   2857616
 breaks 44

Staphylococcus aureus MW2 "mutated" 10% of sequence are random SNPs

Sequence assemblies

AMOScmp

OVL: 10 bp ...

               #elem   min     max     mean    median  n50     sum
 all           22431   20      705     107     87      124     2409008
 200bp+        1997    200     705     267     246     260     533882
 singl         1934772

velvet

Same as the one above

Contig assemblies

AMOScmp on velvet contigs

Min Ovl: 10bp Min Match (nucmer -l): 20 Min Cluster (nucmer -c): 40

All:

       #elem   min     max     mean    median  n50     sum
 ctg   496     45      47387   5504    2076    14370   2729975
 singl 629     45      11556   112     60      131     70262
 all   1125    45      47387   2489    96      13785   2800237
 breaks 6

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   405     204     47387   6715    3277    14370   2719753
 singl 28      200     11556   1042    288     11556   29165
 all   433     200     47387   6349    2878    14036   2748918
 breaks 4

Min Ovl: 10bp Min Match (nucmer -l): 16 Min Cluster (nucmer -c): 32

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   544     45      53996   5044    1215    16039   2744166
 singl 385     45      11556   133     53      5057    51297
 all   929     45      53996   3009    112     15857   2795463
 breaks 8
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   391     201     53996   6980    3156    16443   2729048
 singl 8       222     11556   3543    4086    11556   28343
 all   399     201     53996   6911    3156    16039   2757391
 breaks 6

Strep suis17

Data

  • 2726374 36bp Solexa reads
  • 2007491bp 41.30 %GC Reference
  • Coverage: 49X
  • Location:
 /fs/szdata/Solexa/Streptococcus_suis/
 /fs/szasmg2/Bacteria/dpuiu/Solexa/Streptococcus_suis/

Sequence alignments

soap

 2261784 of 2726374  (82.95%) aligned by "soap -v 5 -g 0 -s 10"

Mismatches:

 more Ss.soap | awk '{print $10}' | count.pl
 mism.   hits     hits(%)
 0       1235237  54.61
 1       548822   24.27
 2       238795   10.56
 3       120622   5.33
 4       71451    3.16
 5       46857    2.07
 Total   2261784


Sequence assembly

Velvet

Contig stats:

          #elem   min     max     mean    median  n50     sum
 all      917     45      24419   2154    278     6661    1975147
 200bp+   479     202     24419   4051    2940    6966    1940363
 breaks 2  (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss)

Different coverage assemblies:

   cvg     #ctgs   min     max     mean    median  n50     sum     %singletons     breaks
   24.5    4630    45      5267    436     284     709     2016871 23.98           6
   29.4    2797    45      7739    716     426     1333    2001964 23.46           0
   34.3    1797    45      9347    1107    649     2230    1989102 23.34           0
   39.2    1325    45      24419   1496    663     3724    1982292 23.32           2
   44.1    1016    45      23272   1946    516     5290    1976890 23.30           2
   49.0    917     45      24419   2154    278     6661    1975147 23.30           2

Edena

Contig stats:

               #elem   min     max     mean    median  n50     sum
 all           1668    100     8829    1135    743     1968    1893815
 200bp+        1390    200     8829    1333    959     1985    1853354
 breaks 2  (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss)

Singletons: 1582493 (58%)

Euler-SR

Command:

 $ AssembleIllumina.pl Ss.seq working_dir -vertexSize 21

Contig stats:

         #elem   min     max     mean    median  n50     sum
 all     2658    22      19768   757     42      4867    2013259
 200bp+  659     200     19768   2932    1973    5117    1932406


Escherichia coli str. K-12 substr. MG1655 (Solexa-paired)

Data

  • 7,047,668 + 10,408,224 paired-end 36bp Solexa reads => 105X + 161X cvg
  • 1M => 15.51X cvg
  • NC_000913.2 4,639,675 50.79%GC reference
  • Location:
 /fs/szasmg2/Bacteria/dpuiu/Solexa-pair/Ecoli
  • Formatting : split seq into 2
 cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36          > ! Ecoli.paired.afg
 cat Ecoli.seq | ~/bin/solexap2amos.pl -len 36 -noamos  > ! Ecoli.paired.seq

Sequence assembly (1M reads: 15X cvg)

  • Velvet : hash length of 23 works best; also use -exp_cov 15
  • AMOScmp: OVL=10; the smaller the better
  • Ctg stats:
 .                            elem       min    q1     q2     q3     max        mean       n50        sum            
 AMOScmp-shortReads           331        40     1941   7826   18316  172,800    14022      26474      4641532  better !!!       
 AMOScmp-shortReads-trimmed   372        40     1916   7226   17348  98,155     12477      24730      4641600        
 velvet.ctg                   12073      45     121    250    461    4690       366        568        4430625        
 velvet.scf                   715        45     58     70     77     313779     6671       104454     4769943       

Sequence assembly (7+M reads: 100+X cvg)

  • Velvet : hash length of 27 works best; also use -exp_cov 100
 .                            elem       min    q1     q2     q3     max        mean       n50        sum            
 velvet.ctg                   375        53     63     82     6105   192151     12162      59972      4561051        
 velvet.scf                   272        53     60     71     133    860695     16785      171744     4565616

Escherichia coli str. K-12 substr. MG1655 (454-paired)

Data

  • Formatting : remove 44 bp linker
 nucmer -c 10 -l 10 L454.seq Ecoli.merged.seq -p  Ecoli --forward

 cat  Ecoli.delta | perl ~/bin/filterDeltaAnnotated.pl | ~/bin/delta2alignRange.pl -qry  >!  Ecoli.pair.seqs
 difference.pl  Ecoli.seqs Ecoli.pair.seqs >! Ecoli.single.seqs
 
 cat Ecoli.pair.seqs | p '$F[1]--; print "$F[0] $F[0].f 1 $F[1]\n"'     >! Ecoli.pair.f.pos
 cat Ecoli.pair.seqs | p '$F[2]++; print "$F[0] $F[0].r $F[2] $F[3]\n"' >! Ecoli.pair.r.pos