Revision as of 18:25, 15 April 2009

Papers

Short read fragment assembly of bacterial genomes 2007

Assemblers

Denovo

CA

Sourceforge

 cvs -z3 -d:pserver:anonymous@wgs-assembler.cvs.sourceforge.net:/cvsroot/wgs-assembler co -P wgs-assembler

Job submission

 cat runCA.sh   
   touch runCA.log
   echo --- runCA $$ restarted by $USER at `date` --- >>& runCA.log
   runCA -d . -p prefix -s spec prefix.frg >>& runCA.log

Sanger

runCA -d . -p prefix -s spec prefix.frg

doOverlapTrimming=1 (default)
low qual reads (ex: 20 or 'D') are discarded
astatLowBound 1 (default) ; can be lowered for low/un-uniform coverage
astatHighBound 5 (default)
computeInsertSize=0 (default)

454 & Hybrid

runCA -d . overlapper=mer unitigger=bog -p prefix *frg *.sff

gatekeeper -L : search for 454 paired end linker
AS_GKP_sff.c : Load_SFF function
for mated libs set mean=3000,stdev=300

Parameters

stopAfter:

   initialStoreBuilding
   overlapBasedTrimming | OBT
   overlapper
   unitigger
   consensusAfterUnitigger
   scaffolder
   consensusAfterScaffolder

cgwOutputIntermediate = 1 => "cgw -G"

 => .cgw & .cgw_contigs (generated only at the last doExtendClearRanges step)

doExtendClearRanges = 2 => cgw run 3 times, ECR 2 times

Log file & runtime (wgs-5.2)

Use ~dpuiu/bin/getCAruntimes.pl
runCA.log runCA.times bos taurus BCM.CLONEEND 16,875 Sanger reads
runCA.log runCA.times wolbachia symbioint of culex pipiens quinquefasciatus 36,767 Sanger reads

Gatekeeper

Dumping

gatekeeper -dumpfragments -withsequence prefix.gkpStore > prefix.gkpStore.seq
gatekeeper -dumpfrg prefix.gkpStore > prefix.gkpStore.frg
gatekeeper -dumpfragments -tabular prefix.gkpStore > prefix.gkpStore.tab

Updating

act:R !!!

 {DST
 act:R
 acc:10691
 mea:161662
 std:28760
 }

!!! No restrictions on lib stddev with --edit option

cat prefix.update
  lib     uid     19070   mean    167001
  lib     uid     19070   stddev  167001

gatekeeper --edit prefix.update prefix.gkpStore > prefix.gkpStore.seq

Deleting

DST messages can not be deleted
A FRG can be deleted only if there is no related LKG message; an abbreviated MSG

 {LKG
 act:D
 typ:M
 fg1:94376
 fg2:94654
 }
 {FRG
 act:D
 acc:94376
 }

Terminator

Getting output seq

asmOutputFasta -p prefix < prefix.asm

Getting surrogate ids's

 #most of them have negative aStat's
 $ cat *.asm | grep -B 8 ^sta:S | grep ^acc | perl -ane '/(\d+)/; print "utg".$1,"\n";'

Output processing

Get posMap files

buildPosMap -o prefix < prefix.asm

Get scaff files

 cavalidate -e 10 prefix
 bank2scaff prefix.bnk/ > prefix.scaff
 ca2scaff -i prefix.asm -o prefix.scaff

Get AGP files

 asmToAGP.pl < prefix.asm > prefix.agp

        /--- buildPosMap ------> .posmap
       /
  .asm /---- asmToAGP ------------------------------------------------> .AGP 
      \                                                              /   
       \---- cavalidate, bank2scaff?--> .scaff --- scaff2agp.pl ---

Other scripts:

 bank2scaff : incorrect gap sizes

 valiadteAgp.pl < prefix.agp

 infoseq 9-terminator/prefix.ctg.fasta | sed 's/ctg//' >! prefix.ctg.infoseq
 scaff2agp.pl -i prefix.ctg.infoseq < prefix.scaff > prefix.agp

Get new read clear ranges (latest=ECR2):

 gatekeeper -dumpfragments -tabular -clear [ORIG|QLT|VEC|OBTINI|OBT|UTG|ECR1|ECR2] prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.ECR2.clr

Get new library insert sizes:

 asm2mdi.pl < prefix.asm > prefix.mdi

Get number of reads/ctg or scf (from the pos files)

 cat prefix.posmap.frgctg | count.pl > prefix.posmap.frgctg.count
 cat prefix.posmap.frgctg | count.pl > prefix.posmap.frgscf.count
 cat prefix.posmap.ctgscf | count.pl > prefix.posmap.ctgscf.count
 ...

Compute bp in scaffolds (remove N's)

Compute read/mate cvg (from the pos files)

 posmap2cvg.pl                            < prefix.posmap.frgscf > prefix.posmap.frgscf.frg_cvg
 posmap2cvg.pl -mates prefix.posmap.mates < prefix.posmap.frgscf > prefix.posmap.frgscf.mate_cvg

Compute ctg cvg(gaps & overlapping ctgs)

 posmap2cvg.pl < prefix.posmap.ctgscf > prefix.posmap.ctg_cvg

Convert ugs for bank features

 posmap2feat.pl < prefix.posmap.utgctg > prefix.posmap.utgctg.feat

Running subassemblies

Update frg file:

clr=ECR2
MDI -> DST

runCA parameters:

doOBT=no

Test datasets

Porphyromonas_gingivalis_W83 /fs/szdata/454p/Porphyromonas_gingivalis_W83/

 1 .frg & 4 .sff's
 
 wgs 5.1 failed in consensus
 log:
   003 failed -- no .success or no .cgi!
   012 failed -- no .success or no .cgi!
   step PostUnitiggerConsensus failed with '2 consensusAfterUnitigger jobs failed.  Good luck.'

Atlas

Atlas web site

Euler-SR

Euler-SR source
README
Short read fragment assembly of bacterial genomes (Genome Research 2008)
De Novo Repeat Classification and Fragment Assembly (Genome Research 2004)
Automated De Novo Identification of Repeat Sequence Families in Sequenced Genomes (Genome Research 2002)
De novo fragment assembly with short mate-paired reads: Does the read length matter? (Genome Research 2008)

Cmd:

 setenv EUSRC  /fs/szdevel/dpuiu/euler-sr/
 setenv MACHTYPE x86_64
 set path=($path ${EUSRC}/assembly)
 set path=($path ${EUSRC}/assembly/${MACHTYPE})
 /fs/szdevel/dpuiu/euler-sr/assembly/Assemble.pl readsFile vertexSize ruleFile

35bp Solexa reads: vertexSize=25 vs vertexSize=21

 vertexSize=21 
   more agreagte assembly; 
   some tandem repeats in 2K+ contigs were collapsed (detected by aligning to Strep aureus finished genome)
   these repeats were present in small ctgs (~ 200bp) when euler was run with vertexSize=25
   these repeats were not present in the velvet assembly

newbler (454)

minimus

hash-overlap issues:

very slow on large sequences (runs Smith-Waterman)
finds all overlaps (not only CONTAINED|BEGIN|END); of no use to the assembler
default ovl: 40 bp (too large)
min ovl: 15 (set by minimizer)
20 bp overlap ok;
the min ovl imposed by the next step (tigger) is 5bp

minimus1

hash-overlap was replaced with a nucmer based overlapper
nucmer takes long time and lots of space to find very short overlaps (< 16bp?)
We can have more control on the overlaps that are used (can easily filter the nucmer output)
can discard CONTAINED/IDENTITY alignments in order to avoid collapsing repeats
can mask the middle of the sequences to make the alignment process faster; (Issue: CONTAINED overlaps appear as BEGIN/END overlaps)

Example:

  ....
  OVERLAP=16
  MINID=94
  MAXTRIM=20
  ...
  20: $(NUCMER) -maxmatch -l $(OVERLAP) -c $(OVERLAP) $(SEQS) $(SEQS) -p $(PREFIX)
  24: $(SHOWCOORDS) -H -c -l -o -r -I $(MINID) $(ALIGN) | egrep 'BEGIN|END' > $(COORDS)
  25: $(SCRIPTDIR)/nucmer2ovl.pl -ignore $(MAXTRIM) -tab $(COORDS)> $(OVLTAB)
  26: $(SCRIPTDIR)/ovl2OVL.pl $(OVLTAB)  > $(OVL)
  28: $(BINDIR)/bank-transact -z -b $(BANK) -m $(OVL)

minimus2

assembles two sequence sets (S1,S2)
Instead of merging the sets and aligning all versus all (S1+S2 : S1+S2) it aligns one versus the other:

 S1:S2
 S1:S1 (optional)
 S2:S2 (optional)

Each alignment can be run with different parameters

Example 1: merge two de-novo assemblies

 ...
 TYPE1=N
 TYPE2=N
 ...
 OVERLAP11       = 16
 OVERLAP22       = 16
 OVERLAP12       = 40
 ...
 # align sets
 $(NUCMER) -maxmatch -l $(OVERLAP11) -c $(OVERLAP11) $(SEQS1) $(SEQS1) -p $(PREFIX11)
 $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12)
 $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22)
 ...
 # merge alignments
 $(SHOWCOORDS) $(ALIGN11) | egrep 'BEGIN|END'  > $(COORDS)   
 $(SHOWCOORDS) $(ALIGN12) >> $(COORDS)
 $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END'  >> $(COORDS)
 ...
 nucmer2ovl.pl  $(COORDS) > $(OVLTAB)
 ...

Example 2: merge a comparative and a de-novo assembly

 ...
 TYPE1           = C
 TYPE2           = N
 ...
 # look only for short overlaps among adjacent sequences from S1
 $(SCRIPTDIR)/fasta2ovl.pl -tab $(SEQS1) > $(OVLTAB)  
 ...
 # look for all overlaps  among one sequence from S1 and one from S2
 $(NUCMER) -maxmatch -c $(OVERLAP12) $(SEQS1) $(SEQS2) -p $(PREFIX12) 
 $(SHOWCOORDS) $(ALIGN12) | $(SCRIPTDIR)/nucmer2ovl.pl >> $(OVLTAB) 
 ...
 # look for end overlaps among sequences from S2
 $(NUCMER) -maxmatch -l $(OVERLAP22) -c $(OVERLAP22) $(SEQS2) $(SEQS2) -p $(PREFIX22)  
 $(SHOWCOORDS) $(ALIGN22) | egrep 'BEGIN|END' > $(COORDS22) | $(SCRIPTDIR)/nucmer2ovl.pl  >> $(OVLTAB)
 ...

SSAKE

Web Site
Short Sequence Assembly by K-mer search and 3' read Extension
Rene L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. "Assembling millions of short DNA sequences using SSAKE" Bioinformatics. Vol. 23 no. 4 2007, pages 500–501
Uses a greedy assembly algorithm: progressively searches for perfect 3’-most k-mers using a DNA prefix tree to identify overlaps between any two sequences; stringently clusters short reads into contigs
Copyright (c) 2006-2008 Canada's Michael Smith Genome Science Centre.
Current release: SSAKE 3.2.1 (08/21/2008)
Distribution: PERL script; runs on Linux systems
Parameters:

 -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
 -o  Minimum number of reads needed to call a base during an extension (default -o 2)
 -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.7)
 -t  Max number of bases to trim (default 0)

"Especially" targeted towards Solexa seq; 454 reads too long & homopol problem
Can handle mated reads
TQS.py should be used for qual trimming

Example:

 $ grep -c "^>" Pa.seq           # total number of reads
 $ (time ssake_3.0.pl -f Pa.seq) > ssake_3.0.log             # version 3.0
 $ (time ssake_3.2.pl -f Pa.seq -t 1) > ssake_3.2.log        # version 3.2
 $ ln -s Pa*contigs  Pa.fasta
 $ ln -s Pa*singlets Pa.singlets.seq
 $ cat Pa.fasta        | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t assembled_reads    # number of assembled reads
 $ cat Pa.singlets.seq | grep "^>" | perl -ane '/read(\d+)/; print $1, "\n";' | getSummaryDescriptive.pl -t singletons         # number of singletons
 $ grep -v N Pa.seq              # number of reads which contain ambiguities and will be discarder from the beginning (should be added to the singletons)

VCAKE

Velvet

Sequence assembler for very short reads
Velvet: Algorithms for De Novo Short Read Assembly Using De Bruijn Graphs Genome Res. Mar 2008
Project Web Site at EBI Current version: 0.6 02/06/2008: Velvet 0.6
Uses Bruijn graphs to assemble sequences: first uses an error correction algorithm to merge sequences which belong together, then a repeat solver separates paths sharing local overlaps
Distribution: C program, to be compiled with gcc
Open Source, GPL agreement, EBI.
Current release: 0.7.20 (12/01/2008?)
Parameters:

 - hash_length: odd integer (if even, it will be decremented) <= 31 (if above, will be reduced)
* minimum overlap length (30-40X coverage data set):
    17bp usually gives the fewest contigs (some case)
    23bp is also ok (used for Pa,Ps)
    15bp is too low => too many short contigs
*  -long option for including "long sequences" (Sanger, 454 or even reference sequences) very slow

At ~30X, max ctg len is ~ 5KB
At >45X, max ctg len is ~ 20KB
Longest contig on simulated data is 56Kbp (Ps, no error 32bp reads, 32X cvg)

Max ovl=32
For longer reads (66bp), some of the repeats ( ~100bp) get assembled partially; -cov_cutoff parameter must be used, otherwise low cvg/quality ctgs are reported

velvet_assy.afg lists the assembled reads (TLE messages)
some reads extend beyond the contig ends
there are some reads shared by multiple contigs (status:D) ; at least one of the instances occurs at a contig end (goes beyond the contig end)
many contigs overlap by a few bp; max val= min_ovl-1
minimum_ctg length =2*minimum_ovl-1 (23=>45, 21=>41 ...)

Example (unmated reads):

 $ velveth . ovl prefix.seq                                # ovl should be an odd integer like 21, 23 
 $ velvetg . -read_trkg yes

Example (mated reads):

 $ velveth . ovl -shortPaired  reads.fasta
 $ velvetg . -ins_length 1000 -exp_cov 40

Example (processing results):

 $ bank-transact -c -z -b prefix.bnk -m velvet_assy.afg
 $ bank2contig prefix.bnk  > prefix.contig
 $ bank2fasta -b prefix.bnk > prefix.fasta
 
 $ listReadPlacedStatus -S -E prefix.bnk > prefix.singletons
 $ extractfromfastanames.pl prefix.seq prefix.singletons > prefix.singletons.seq
 
 $ listReadPlacedStatus prefix.bnk > prefix.status
 $ grep D prefix.status  | awk '{print $5,$6}' | sort -u > prefix.pairs

Edena

Edena: Exact DE Novo Assembler

De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer Genome Res. Apr 2008
Project Web Site Current verision: 2.1.1 03/17/2008
Genomic Research Laboratory, Infectious Diseases Service, Geneva University Hospitals, Switzerland
Current release: 2.1.1 (03/17/2008)
Distribution: Compiled executables for Linux-32,Linux-64
Uses a traditional read overlap-layout paradigm (adjusted for short reads)
Parameters:

 --overlapCutoff [int]    Only consider overlaps >= than the specified size. (default 22)
 --minContigSize [int]    Minimum size of the contigs to output. (default 100)

contigs don't overlap (ends seem to be trimmed)
a significant number of reads don't get assembled; max % assembled : 69% for Staphylococcus_aureus

reads should be 1/line (+1 line for the header)
for 66bp reads, the longest ctgs were obtained for --overlapCutoff 35

Example:

 $ edena -r prefix.seq -p ovl
 $ edena -e prefix.ovl -p prefix
 $ ln -s prefix_contigs.fasta prefix.fasta
 $ cat *_info.txt                                 # assembly stats
 $ grep assembled prefix_info.txt
 $ cat prefix_contigs.fasta | grep "^>" | perl -ane '/(\d+)$/; print $1, "\n";' | getSummaryDescriptive.pl -t reads_assembled

Others

ALLPATHS : De novo assembly of whole-genome shotgun microreads Genome Res. Mar 2008
SSAKE
VCAKE
SHARCGS
SeqMan Genome Assembler (SMGA) by DNAStar(commercial)
Euler-SR

Comparative

AMOScmp

one should increase the make-consensus error rate if ref and qry are not that similar
all reads from layout are included in consensus

casm-layout issues:

if one sequence end does not align, the read is not assembled; the adjacent read is not assembled either if based on alignment to the ref. the two "theoretically" overlap;
for a circular bacterial chromosome, the BEGIN/END reads don't get assembled => cannot get a circular contig !!!
all CONTAINED reads should be assembled but the ones that are adjacent to dirty reads are left as singletons as well !!!
option -r places repeats randomly; what if we want more control over this?

There are some large negative gaps in scaffold file (Solexa read assemblies); Some contigs are "CONTAINED" in the contig just before the large negative gap; all these contig id's are larger than the .scaff last contig ; Contig alignments to the ref have breaks
casm-layout -g 10000 (defaul max gap) seems to produce the problem above; this value should be reduced (Ex 100)

show-coords issues:

sometimes long CONTAINED annotated contigs contain dirty ends

show-snps issues:

ambiguities are not shown
maximum init_size = 10000 query seqs
does not work on long seqs

Example:

 $ sort -nk4 AMOS.scaff | head
 322 BE 87996 -24519
 210 BE 7460 -7324
 1271 BE 60490 -4112
 17 BE 1022 -944
 ...

 $more AMOS.scaff
 ...
 322 BE 87996 -24519     # 322 contains contigs 1770, 1771 ...
 1770 BE 39 1100
 1771 BE 130 738
 ...
 1780 BE 72 25004

 $ tail -1 AMOS.scaff
 1767 BE 39264 0

AMOScmp-shortReads-alignmentTrimmed

Defaults:

  MINCLUSTER  = 16
  MINMATCH    = 16
  MINLEN      = 20  # delta-filter -l 20
  MINOVL      = 10  # previously 5 ; mis-assemblies occurred
  MAXTRIM     = 10
  MAXGAP      = 100 # default 10,000 (casm-layout -g) creates some incorrect contigs
  MAJORITY    = 70  # previously 50 
  CONSERR     = 0.06
  ALIGNWIGGLE = 2   # different than the default AMOScmp 15

 * Reads are trimmed based on nucmer alignment to a reference genome; reads that are adjacent to 0 cvg regions are not trimmed
 * If reads are not trimmed CONSERR should be increased, otherwise make-consensus fails
 * If reads are not trimmed, the avg contig length is significantly shorter than if reads are trimmed based on alignments => 
   !!! read trimming is important.
 * assembly is very fragmented if the percent identity between reads and the reference is less than 98%
 * simulations run on a PA-b1 10K read subset using PA14 as reference; random snps were introduced into PA14 to decrease the %identity; (MINCLUSTER=12, MINMATCH=16, MINLEN=20)
   99.4% : 1 contig  (original data, no snps added to PA14)
   98%   : 2 contigs
   96%   : 10 contigs
   93%   : 50 contigs
   93%   : 37 contigs (MINMATCH=14)

AMOScmp-shortReads

Defaults

  MINCLUSTER  = 20
  MINMATCH    = 20
  MINOVL      = 5
  MAXTRIM     = 10
  MAJORITY    = 50
  CONSERR     = 0.06
  ALIGNWIGGLE = 2

 * No read trimming is done

AMOScmp2

Runs AMOScmp of 2 data sets at once (with different parameters)

Example: merge a comparative(AMOScmp) and a de-novo(velvet) assembly, using the same reference was used to get the comparative one

 ...
 TYPE1           = C  # Assembly 1 is a comparative one, already know the position of contigs; no need to align them to ref.
 TYPE2           = N  # Assembly 2 is a de-novo one, don't know the position of contigs; we need to align them to ref.
 ...
 $(SCRIPTDIR)/scaff2delta.pl $(SCAFF1) > $(ALIGN1)
 $(NUCMER) -maxmatch -c $(OVERLAP2) $(REF) $(SEQS2) -p=$(PREFIX2) 
 ...
 cp $(ALIGN1)$(ALIGN)
 cat $(ALIGN2) | grep -v $(PREFIX) | grep -v NUCMER  >> $(ALIGN)
 ....

maq

At high %identity (close to 100%) maq outperforms AMOScmp-shortReads-alignmentTrimmed
At lower % identity (<97%) AMOScmp-shortReads-alignmentTrimmed outperforms maq

Other Scripts

delta-filter.pl

Similar to MUMMer delta-filter
The difference is that for multi copy repeats, it tries to place each one uniquely on the reference
"delta-filter -q" collapses all in the same reference region
"delta-filter -r" aligns the same one to multiple reference regions

fasta2ovl.pl

Program that finds short normal direction overlaps between two adjacent sequences in a multi-FASTA file.
It tests if the END of seqence N overlaps the BEGINING of sequence N+1.

megamerge (EMBOSS)

 * Can merge well short overlapping contigs
 * Does not work well when the overlap identity is low
 * Min ovl length=2

nucmer2ovl.pl

Program that converts nucmer annotated alignments to an overlap file (either AMOS or TAB format)

nucmer.pl

It is a nucmer wrapper which runs nucmer with decreasingly alignment size on the sequences
Iteration n only tries to align sequences not aligned at iterations n-1,n-2,...1
Shorter run time, less disk space (than running nucmer with a small alignment/cluster size)
Default parameters:

 -len 20,16,12         : nucmer minmatch (-l)
 -cluster 20,16,14     : nucmer mincluster (-c)
 -filter 20,20,20      : delta-filter minlen (-l)

Could be used by AMOScmp-shortReads, AMOScmp-shortReads-alignmentTrimmed

nucmerAnnotate.pl

Program that parses a nucmer coords file and re-annotates sequence alignments
The maximum length of the end sequence to be ignored can be set as a parameter

revDelta.pl

Parses a delta file and switches the ref & query

revseq.pl

Revese complements a seq file
adds ".rev" suffix to seq id's

scaff2delta.pl

Program that converts a scaff file to a delta file
Example: No need to run nucmer on the AMOScmp output contigs to get them aligned to reference since we already know their positions in the scaffold

Subassembly

Reconciliator

Cases

No reference sequence

One data set, multiple denovo assemblers

Example:

 * Solexa data
 * edena & velvet assemblers

Solution:

 * run minimus on the contigs

Multipls data sets, one(multiple) denovo assemblers

Example:

* Solexa & 454 data
* velvet assembly on Solexa
* newbler assembly on 454

Solution:

 * run minimus on the contigs

One reference sequence

Example:

 * Solexa data
 * AMOScmp on Solexa
 * velvet on Solexa

Solution:

 * remove the redundant velvet contigs (CONTAINED in AMOScmp contigs) (Optional; leaving the CONTAINED contigs had very similar results, CONSERR did not have to be reduced)
 * run AMOScmp on AMOScmp & velvet contigs => join some AMOScmp gaps
 * run minimus to further merge contigs

Multiple reference sequences

Example:

 * Solexa data
 * AMOScmp on Solexa using Ref1
 * AMOScmp on Solexa using Ref2

Solution:

 * remove the redundant AMOScmp Ref2 contigs (CONTAINED in AMOScmp contigs) (optional)
 * run AMOScmp on AMOScmp Ref1 & Ref2 contigs => join some AMOScmp Ref1 gaps
 * run minimus to further merge contigs

Examples

Pseudomonas_syringae

Reference:

 Name           Length  %GC
 NC_004578.1    6397126 58.40
 NC_004633.1    73661   55.15
 NC_004632.1    67473   56.17

Repeats:

 desc    #repeats   min     max     mean    stdev    sum
 50bp+   991        50      7362    393.73  792.41   390192
 100bp+  429        100     7362    815.36  1060.29  349793

Simulated 32bp exact match reads

 Type            #reads       min     max     mean
 Sim(ulated)     6538167      32      32      32   ( 32x coverage)

Single assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 edena-sim   denovo       Sim         6538167        2068    100     47881   2994.03   4857.76   6191673       198699(3%)
 velvet-sim  denovo       Sim         6538167        2207    45      56810   2820.91   5348.36   6225757       123591(2%)

Solexa reads

 Type            #reads       min     max     mean
 Solexa          6340136      32      32      32   (~31x coverage)

Assemblies:

 Assembler   type         input-data  #reads         #ctgs   min     max     mean      stdev     ctgs-sum      #singletons       
 AMOScmp     comparative  Solaxa      6340136        187     20      577929  34863.06  91692.34  6519394       698638(11%)
 velvet      denovo       Solaxa      6340136        25161   45      5057    241.83    212.61    6084887 
 velvet(100bp+)           Solexa                     19534   100     5057    288       220       5633400
 edena       denovo       Solaxa      6340136        14084   100     5075    210.92    145.68    2970720       4893301(77%)

Assembly breaks:

 velvet: 31
 edena : 38

Merged assemblies(contigs):

 assemblers     type         input-data                #reads  #ctgs   min     max     mean     stdev   ctgs-sum   comments    
 minimus(ovl20) denovo       velvet                    25161   19121   45      5057    311.3    297.27  5952381    #merged 25161-19121=6040 (25%) gaps
 minimus(ovl20) denovo       edena+velvet              39245   18603   45      6688    322.32   311.02  5996244

Pseudomonas aeruginosa b1 (PAb1)

References:

 Name           Length  %GC
 PA14           6537648 66.29
 PACS2          6492423 66.33
 PAO1           6264404 66.56
 ...

Solexa reads

 Type            #reads       min     max     mean
 Solexa          8627900      33      33      33    (~43X coverage)

Assemblies:

All contigs:

 Assembler      type         #ctgs   min     max     mean      stdev      ctgs-sum      #singletons    comments
 AMOSCmp-PA14   comparative  2053    17      170485  3011.84   11917.53   6183320       1127399
 AMOSCmp-PAO1   comparative  2797    17      75626   2161.19   5812.2     6044851       1592525
 AMOScmp-PA2192 comparative  5816    17      133129  1072.8    3725.22    6239454       1601299        largest assembly
 maq-PA14       comparative  991     33      155551  6199.79   17445.05   6143996       1197385
 velvet         denovo       10684   45      16239   640.34    825.24     6841458       1241079        much better than Ps !!!
 edena          denovo       11180   100     11300   552.36    610.52     6175460       3955865 (46%)  much better than Ps !!!
 ssake          denovo       185030  34      5490    77.21     141.23     14287079      3056893 
 euler-sr       denovo       30394   26      5004    184       138        5601125       3437156

200bp+ contigs:

 Assembler      type         #ctgs   min     max     mean      stdev      ctgs-sum
 AMOSCmp-PA14   comparative  428     203     170485  14262.09  22852.74   6104175
 AMOSCmp-PAO1   comparative  865     200     75626   6893.96   8766.63    5963278
 AMOSCmp-PA2192 comparative  1299    200     133129  4683.46   6735.52    6083817
 maq-PA14       comparative  368     200     155551  16581.7   25475.92   6102067
 velvet         denovo       7382    200     16239   877.05    896.35     6474426
 edena          denovo       8316    200     11300   692.54    651.24     5759209
 ssake          denovo       12532   200     5490    486       329.93     6090567

Merged assemblies(contigs&singletons):

 Assembler           type   input-data          #reads   #ctgs   min   max     mean    stdev      ctgs-sum       comments
 AMOSCmp-PA14-merge  ?      AMOSCmp-PA14(ctgs)  2053     1931    17    170485  3201.7  12981.57   6182486        2053-1931=122 gaps closed

Staphylococcus aureus MW2

Data

Files:

 /fs/szasmg2/Bacteria/dpuiu/Solexa/Staphylococcus_aureus/

Reads:

 3857879 35bp Solexa reads
 22843 contain at least 1 N; 10455 contain more than 1 N; 8244 contain more than 2 N's

Reference:

 NC_003923      2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome.
 AP004832       20654  28.38  Staphylococcus aureus plasmid pMW2 DNA, complete sequence
 total          2841116

Coverage: 47.52X

Other strains:

 NC_002952.2    2902619 32.81  Staphylococcus aureus subsp. aureus MRSA252, complete genome
 NC_002953.3    2799802 32.85  Staphylococcus aureus subsp. aureus MSSA476, complete genome
 NC_002951.2    2809422 32.82  Staphylococcus aureus subsp. aureus COL, complete genome
 NC_009632.1    2906507 32.95  Staphylococcus aureus subsp. aureus JH1, complete genome
 NC_009487.1    2906700 32.95  Staphylococcus aureus subsp. aureus JH9, complete genome
 NC_009782.1    2880168 32.88  Staphylococcus aureus subsp. aureus Mu3, complete genome
 NC_002758.2    2878529 32.88  Staphylococcus aureus subsp. aureus Mu50, complete genome
 NC_003923.1    2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome
 NC_002745.2    2814816 32.84  Staphylococcus aureus subsp. aureus N315, complete genome
 NC_007795.1    2821361 32.87  Staphylococcus aureus subsp. aureus NCTC 8325, complete genome
 NC_009641.1    2878897 32.89  Staphylococcus aureus subsp. aureus str. Newman, complete genome
 NC_007622.1    2742531 32.78  Staphylococcus aureus RF122, complete genome
 NC_007793.1    2872769 32.75  Staphylococcus aureus subsp. aureus USA300, complete genome
 NC_010079.1    2872915 32.76  Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome

Sequence alignments

soap

 3735781 of 3857879 (96.76%) aligned by "soap -v 5 -g 3 -s 12"

Mismatches:

 ~ 10% of the aligned reads had more at least 2 mismatches 
 ~  4% of the aligned reads had more at least 3 mismatches

 cat Sa.soap.v5.s12.g3 | awk '{print $10}' | count.pl
 #       total        %
 0       2752024      73.67
 1       633043       16.95
 2       201439       5.39
 3       84258        2.26
 4       41787        1.12
 5       22629        0.61
 201     525          0.01
 101     45
 202     20
 103     5
 203     4
 102     2
 Total   3735781

Sequence assemblies

AMOScmp

OVL: 10 bp

Output stats:

 All:
               #elem   min     max     mean    median  n50     sum
 ctg           17      60      658745  167134  102144  485748  2841283
 ctg(200bp+)   15      2090    658745  189411  102265  485748  2841158
 singl         101422  23      35      35      35      35      3549538
 breaks        0

Only 2.62% of the reads are singletons
101232 out of 101422 AMOS singletons (99.81%) are also velvet singletons

velvet

OVL: 23 bp

Output stats:

               #elem   min     max     mean    median  n50     sum
 ctg           1777    45      22892   1583    252     5337    2813162
 ctg(100bp+)   1146    100     22892   2421    1167    5472    2774471
 ctg(200bp+)   939     200     22892   2924    1764    5512    2745649
 singl         289207  35      35      35      35      35      10122245
 breaks 0

7.49% reads are singletons
101232 out of 289207 velvet singletons are also AMOS singletons
13189 reads are duplicates, connect 1126 contig pairs (1242 unique ctgs in these pairs)
1114 out of 1242 contig overlaps were identified by running nucmer (-c 12 -l 12)

 ovlSize count
 22      740
 21      99
 20      55
 19      36
 18      52
 17      33
 16      17
 15      29
 14      21
 13      15
 12      17
 Total   1114

Contig coverage & alignments to the reference:

 $ grep "^##" Sa.contig | perl -ane 'print $F[1]*35/$F[2],"\n";' > Sa.cvg
 
 #ctgs   min         max     mean    median  n50     sum
 1777    1.55        517.25  43      37.41   44.36   76159.86

 $ show-coords -I 94 1con-contigs.delta | grep CONTAIN | awk '{print $19,$13}' | count.pl | sort -n >! 1con-contigs.qry_hits

 $ join 1con-contigs.qry_hits Sa.cvg | awk '{print $3,$4}' | ~/bin/getCorrelationDescriptive.pl
 0.61 # coeff. of correlation > 0
 0.81 # coeff. of correlation for ctgs>100 bp

0cvg regions

 $show-coords Sa.all.delta | ~/bin/getNucmerCoverage.pl -M 0 | getSummary.pl -c 3
       #elem   min     max     mean    median  n50     sum
 gap   305     1       570     34      19      70      10521

OVL: 21 bp

               #elem   min     max     mean    median  n50     sum
 all           2099    41      18254   1342    258     4384    2817753
 45bp+         1830    45      18254   1534    448     4391    2806596
 100bp+        1337    100     18254   2075    1083    4558    2774667
 200bp+        1115    200     18254   2461    1486    4625    2743892
 singl         224006

OVL: 23 bp seems to generate a better assembly on this data set !!!

edena

Output stats:

               #elem   min     max     mean    median  n50     sum
 ctg           1175    100     22893   2350    1094    5514    2760953
 ctg(200bp+)   935     204     22893   2918    1766    5594    2727929
 breaks        0
 
 assembled_reads 2676002 (69.8%)
 singletons      1181877

In this case velvet & edena have similar contig stats !!!
All edena contigs align to velvet contigs
Not all edena contigs are CONTAINED/IDENTICAL to velvet contigs !!!

               #elem   min     max     mean    median  n50     sum
 edena.uniq    360     100     19924   2475    273     7791    891123

 $ show-coords -H velvet-edena.filter-q.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19,$13}' | count.pl -m 2 | wc -l  # 193 edena contigs CONTAIN multiple velvet contigs
 $ show-coords -H velvet-edena.filter-q.delta | egrep 'IDEN' | wc -l                                                         # 435 edena contigs  are IDENTICAL to velvet contigs

euler-sr

         #elem   min     max     mean    median  n50     sum
 all     1349    22      43140   2073    109     10523   2796347
 200bp+  566     201     43140   4859    2334    10662   2750118

Contig assemblies

AMOScmp on velvet contigs

Min Ovl: 10bp Min Cluster(nucmer -c): 40 (many velvet contigs are 45-65 bp)

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   567     45      73756   4890    455     20846   2772573
 singl 5       45      11556   3159    58      11556   15794
 all   572     45      73756   4875    455     20846   2788367
 breaks 2 (1 contig with 2 overlapping alignments; no repeat; why is nucmer breaking the alignment?)
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   339     201     73756   8118    3413    20846   2752170
 singl 2       4086    11556   7821    11556   11556   15642
 all   341     201     73756   8117    3494    20846   2767812
 breaks 2

Reduced the number of contigs from 1777 to 567 !!!

minimus on velvet contigs

Min Ovl: 16bp

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   316     45      39490   5812    3243    12483   1836527
 singl 797     45      22892   1204    161     4724    959356
 all   1113    45      39490   2512    299     9399    2795883
 breaks 20
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   258     202     39490   7093    5069    12483   1829967
 singl 368     200     22892   2504    1347    5210    921478
 all   626     200     39490   4395    2313    9508    2751445
 breaks 19

Reduced the number of contigs from 1777 to 1113 !!!

minimus1 on velvet contigs

Min Ovl: 16bp

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   277     70      32618   6623    4709    11914   1834597
 singl 1022    45      19128   948     96      5057    968872
 all   1299    45      32618   2158    154     9526    2803469
 breaks 15
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   247     200     32618   7413    6004    11984   1830980
 singl 354     200     19128   2590    1397    5294    916800
 all   601     200     32618   4572    2495    9757    2747780
 breaks 15

Reduced the number of contigs from 1777 to 1299 !!!
Generated fewer, larger big contigs than minimus

Fewer contigs get merged if:

 * Min Ovl is dropped from 16bp to 14bp
 * Only the overlaps represented by Duplicate reads are used

minimus1 on edena contigs

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   67      240     24824   5610    4814    8671    375847
 singl 1022    100     19924   2331    1084    5455    2382674
 all   1089    100     24824   2533    1183    5933    2758521
 breaks 27
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   67      240     24824   5610    4814    8671    375847
 singl 811     204     19924   2902    1766    5588    2353676
 all   878     204     24824   3109    1862    5984    2729523
 breaks 27

Edena contig ends get trimmed. There are fewer overlap among them than velvet contigs!!!

AMOScmp on velvet & edena contigs

Input

 All:
 desc          #elem   min     max     mean    stdev   sum
 velvet        1777    45      22892   1583    2769    2813162
 edena         1175    100     22893   2350    3152    2760953
 edena.uniq    360     100     19924   2475    3886    891123
 
 200bp+
 desc          #elem   min     max     mean    stdev   sum
 velvet        939     200     22892   2924    3272    2745649
 edena         935     204     22893   2918    3303    2727929
 edena.uniq    202     206     19924   4305    4394    869552

Output:

All:

       #elem   min     max     mean    median  n50     sum
 ctg   461     45      81678   6029    703     21943   2779591
 singl 7       45      11556   4283    2634    11556   29984
 all   468     45      81678   6003    719     21353   2809575
 breaks 0

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   289     205     81678   9565    4401    21943   2764375
 singl 4       2634    11556   7458    11556   11556   29832
 all   293     205     81678   9537    4401    21943   2794207
 breaks 0

minimus2 on velvet & edena contigs

Contig stats:

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   740     67      38980   3793    1290    10553   2807159
 singl 366     45      6795    265     63      3684    96822
 all   1106    45      38980   2626    263     10104   2903981
 breaks 45
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   560     202     38980   4969    2583    10561   2782466
 singl 31      203     6795    2424    1379    5455    75150
 all   591     202     38980   4835    2574    10197   2857616
 breaks 44

Staphylococcus aureus MW2 "mutated" 10% of sequence are random SNPs

Sequence assemblies

AMOScmp

OVL: 10 bp ...

               #elem   min     max     mean    median  n50     sum
 all           22431   20      705     107     87      124     2409008
 200bp+        1997    200     705     267     246     260     533882
 singl         1934772

velvet

Same as the one above

Contig assemblies

AMOScmp on velvet contigs

Min Ovl: 10bp Min Match (nucmer -l): 20 Min Cluster (nucmer -c): 40

All:

       #elem   min     max     mean    median  n50     sum
 ctg   496     45      47387   5504    2076    14370   2729975
 singl 629     45      11556   112     60      131     70262
 all   1125    45      47387   2489    96      13785   2800237
 breaks 6

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   405     204     47387   6715    3277    14370   2719753
 singl 28      200     11556   1042    288     11556   29165
 all   433     200     47387   6349    2878    14036   2748918
 breaks 4

Min Ovl: 10bp Min Match (nucmer -l): 16 Min Cluster (nucmer -c): 32

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   544     45      53996   5044    1215    16039   2744166
 singl 385     45      11556   133     53      5057    51297
 all   929     45      53996   3009    112     15857   2795463
 breaks 8
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   391     201     53996   6980    3156    16443   2729048
 singl 8       222     11556   3543    4086    11556   28343
 all   399     201     53996   6911    3156    16039   2757391
 breaks 6

Strep suis17

Data

2726374 36bp Solexa reads
2007491bp 41.30 %GC Reference
Coverage: 49X
Location:

 /fs/szdata/Solexa/Streptococcus_suis/
 /fs/szasmg2/Bacteria/dpuiu/Solexa/Streptococcus_suis/

Sequence alignments

soap

 2261784 of 2726374  (82.95%) aligned by "soap -v 5 -g 0 -s 10"

Mismatches:

 more Ss.soap | awk '{print $10}' | count.pl
 mism.   hits     hits(%)
 0       1235237  54.61
 1       548822   24.27
 2       238795   10.56
 3       120622   5.33
 4       71451    3.16
 5       46857    2.07
 Total   2261784

Sequence assembly

Velvet

Contig stats:

          #elem   min     max     mean    median  n50     sum
 all      917     45      24419   2154    278     6661    1975147
 200bp+   479     202     24419   4051    2940    6966    1940363
 breaks 2  (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss)

Different coverage assemblies:

   cvg     #ctgs   min     max     mean    median  n50     sum     %singletons     breaks
   24.5    4630    45      5267    436     284     709     2016871 23.98           6
   29.4    2797    45      7739    716     426     1333    2001964 23.46           0
   34.3    1797    45      9347    1107    649     2230    1989102 23.34           0
   39.2    1325    45      24419   1496    663     3724    1982292 23.32           2
   44.1    1016    45      23272   1946    516     5290    1976890 23.30           2
   49.0    917     45      24419   2154    278     6661    1975147 23.30           2

Edena

Contig stats:

               #elem   min     max     mean    median  n50     sum
 all           1668    100     8829    1135    743     1968    1893815
 200bp+        1390    200     8829    1333    959     1985    1853354
 breaks 2  (1 contig aligned in 2 overlapping pieces; tandem repeat in the 217341-217404 region of Ss)

Singletons: 1582493 (58%)

Euler-SR

Command:

 $ AssembleIllumina.pl Ss.seq working_dir -vertexSize 21

Contig stats:

         #elem   min     max     mean    median  n50     sum
 all     2658    22      19768   757     42      4867    2013259
 200bp+  659     200     19768   2932    1973    5117    1932406

Assembly merge: Difference between revisions

Revision as of 18:25, 15 April 2009

Papers

Assemblers

Denovo

CA

Sourceforge

Job submission

Parameters

Log file & runtime (wgs-5.2)

Gatekeeper

Dumping

Updating

Deleting

Terminator

Output processing

Running subassemblies

Test datasets

Atlas

Euler-SR

newbler (454)

minimus

minimus1

minimus2

SSAKE

VCAKE

Velvet

Edena

Others

Comparative

AMOScmp

AMOScmp-shortReads-alignmentTrimmed

AMOScmp-shortReads

AMOScmp2

maq

Other Scripts

delta-filter.pl

fasta2ovl.pl

megamerge (EMBOSS)

nucmer2ovl.pl

nucmer.pl

nucmerAnnotate.pl

revDelta.pl

revseq.pl

scaff2delta.pl

Subassembly

Reconciliator

Cases

No reference sequence

One data set, multiple denovo assemblers

Multipls data sets, one(multiple) denovo assemblers

One reference sequence

Multiple reference sequences

Examples

Pseudomonas_syringae

Simulated 32bp exact match reads

Solexa reads

Pseudomonas aeruginosa b1 (PAb1)

Solexa reads

Staphylococcus aureus MW2

Data

Sequence alignments

soap

Sequence assemblies

AMOScmp

velvet

OVL: 23 bp

OVL: 21 bp

edena

euler-sr

Contig assemblies

AMOScmp on velvet contigs

minimus on velvet contigs

minimus1 on velvet contigs

minimus1 on edena contigs

AMOScmp on velvet & edena contigs

minimus2 on velvet & edena contigs

Staphylococcus aureus MW2 "mutated" 10% of sequence are random SNPs

Sequence assemblies

AMOScmp