Pseudomonas aeruginosa: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 737: Line 737:
   all    561    201    253489  12123  31042  6800731
   all    561    201    253489  12123  31042  6800731


min OVL=20 : use contig:singleton & singleton:singleton alignments & adjacent contig ovelaps
min OVL=20 : use contig:singleton & singleton:singleton alignments & adjacent contig ovelaps (best)


   All:
   All:
Line 751: Line 751:
   all    538    200    253489  12662  33969  6812086
   all    538    200    253489  12662  33969  6812086


Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2
Locations:  
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2/AMOScmp-avgReads.unique/
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2/AMOScmp-avgReads.unique/minimus3.20 (best)


==== PA14 Sanger maq on all reads ====
==== PA14 Sanger maq on all reads ====

Revision as of 18:15, 6 May 2008

Strain PA-b1

Data

NCBI

Strains:

 PA14        CP000438       6,537,648 bp, 66.29 %GC : most similar to PAb1
 PAO1        AE004091       6,264,404 bp, 66.56 %GC : rearrangement vs PA14
 PA7         CP000744       6,588,339 bp            : no rearrangement vs PA14
 PACS2       AAQW01000001.1 6,492,423 bp, 66.33% GC : rearrangement vs PA14
 C3719       .              6,222,097 bp, 66.30% GC : no rearrangement vs PA14; incomplete
 2192        .              6,905,121 bp, 65.99% GC : rearrangement vs PA14; incomplete

Incomplete strains details(Broad):

 Contig length stats:
 desc  #contigs   min     max     mean    stdev   sum
 C3719 124        2079    242903  49572   53770   6146998
 2192  82         2087    398738  83246   88681   682625
PAb1 project

5.7Kbp repeat: contains 16S rRNA gene; identical in all the strains

 Coordinates:
 
 PA14:
 732540   738302  +
 4951956  4957714 -  
 5535975  5541728 -
 6312434  6318199 -
 PAO1:
 721550   727325  +
 4788516  4794273 -
 5264042  5269801 -
 6039515  6045289 -
 PA7:
 806558   812299  +
 4982874  4988600 -
 5566182  5571924 -
 6353418  6359148 -

CBCB

File location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
 

Files:

 s_1_sequence.txt  # contains seq & qual; seq names: HWI% ; all reads are 33 bp
 s_1_0001_prb.txt  # 27679 lines of 4*36 values in the -40..40 range  
 s_1_tag.txt       # unique seq count in s_1_sequence.txt
 
 s_7_sequence.txt  # contains seq & qual; seq names HWU%  ; all reads are 33 bp
 s_7_0300_prb.txt
 s_7_tag.txt    
 
 wc -l s_1_sequence.txt s_7_sequence.txt 
  4105993 s_1_sequence.txt
  4521907 s_7_sequence.txt
  8627900 total
 #all reads that contain at least 1 ambiguity
 grep -c N s_*seq
  s_1_sequence.seq:37933
  s_7_sequence.seq:69669
 #all reads that contain at least 2 ambiguities
 cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l  : 18771
 cat s_7_sequence.seq  | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589
 ...
 s_1 reads have significantly fewer N's than s_7 reads !!!
 s_1 reads align better to ref than s_2 reads !!!
  
 Conflicts with avg qualities: 28.26 for s_1, 31.05 for s_2 !!!
 
 Base qualities:
 .       #elem           #elem0  #elem<0 min     mean                    max     sum
 s_1     135497769       1863585 3231899 -5      28.2624918053079        40      3829504586
 s_7     149222931       1357384 3113806 -5      31.0508663175903        40      4633501282

Read trimming

No trimming

If no trimming is done and the default AMOScmp pipeline is used (ALIGNWIGGLE=15), the contigs end up containing reference duplications Make-consensus will also take a very long time to run

Duplication causes:

  1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate
  2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3)
  3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other; reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced

Quality trimming

Art's script: qual-trim.awk

Min_Qual=15

 .               #elem   #elem0  #elem<0 min     median  max     sum             mean    stdev   n50
 qualTrim        8627900 1266    0       0       23      33      193999902       22.49   7.43    26

!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate

Alignment based trimming

PA14 is the most similar strain: more PAb1 reads aligned to PA14 than to PAO1 or PA7

 grep -c "^>" PA14-Pa.l-17_c-17.delta PAO1-Pa.l-17_c-17.delta PA7-Pa.l-17_c-17.delta
 PA14-Pa.l-17_c-17.delta:7261629
 PAO1-Pa.l-17_c-17.delta:7036243
 PA7-Pa.l-17_c-17.delta:5393144

Assemblies

Comparative assemblies

PA14 ref assembly

 Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2
 Command:   AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
  1. a align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (nucmer -c 17 -l 17)
   8.62M reads, 4.10M HWI 5.32M HWU
   8.30M (...) total reads align to the reference (nucmer -c 14 -l 14)
   7.59M (...) total reads align to the reference (nucmer -c 16 -l 16)
   7.25M (...) total reads align to the reference (nucmer -c 16 -l 16; delta-filter -l 20)
   7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (nucmer -c 17 -l 17)
   6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (nucmer -c 20 -l 20)
   5.17M (...) total reads align to the reference (nucmer -c 28 -l 14)
   5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads)  (nucmer -c 33 -l 17)
   3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)
  1. b align all unaligned reads to the reference using nucmer. minmatch=14, mincluster=14 (-c 14 -l 14)
  2. Combine 1a & 1b
  3. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions
  4. assemble them using the AMOScmp pipeline(ALIGN_WIGGLE=2 instead of 15)
Contigs stats:
 desc    #elem   min     max     mean            stdev           sum
 all     2053    17      170485  3011.84         11917.53        6183320
 200     428     203     170485  14262.09        22852.74        6104175
 10K     157     10240   170485  35468.89        26531.33        5568616
 
 Singletons: 1127399
 Data accuracy
 Get all assembled bases with coverage>=20
 count_bases=5926977
 sum_bases=235619001
 sum_errors=2455670
 sum_errors/sum_bases=2455670/235619001=0.01042=1.042% error

SNP's in the 157 >10K contigs: T<->C & A<-G> the most common

  PA14 PAb1    count
   T   C       6153
   C   T       6100
   G   A       6075
   A   G       6021
   G   C       1357
   C   G       1309
   G   T       859
   C   A       846
   T   G       840
   A   C       836
   A   T       338
   T   A       308
   G   .       241
   C   .       196
   .   C       182
   .   G       152
   A   .       101
   T   .       88
   .   T       74
   .   A       70
  Total        32146

PA14 ref assembly (2)

Parameters:

 MINOVL=10    # previously 5, allowed for missasemblies
 MAJORITY=70  # previously 50 to reduce the number of negative gaps

Contig stats:

 desc    #elem   min     max     mean    stdev   sum
 all     1843    20      127984  3350.38 10776.95        6174765
 200bp+  542     200     127984  11245.21        17520.92        6094904
 
 1107200 singletons

PAO1 ref assembly

Same method except that 1.b was not used

Location: 2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2

 Contigs stats:
 
 desc    #elem   min     max     mean            stdev   sum
 all     2797    17      75626   2161.19         5812.2  6044851
 200     865     200     75626   6893.96         8766.63 5963278
 10K     204     10016   75626   19016.22        10368   3879309
 
 Singletons: 1592525

PA14 & PAO1 merge

Use minimus to merge all contigs in the PA14 & PAO1 reference assemblies

Filter contigs that contain at least 2 adjacent PA14 merged by PAO1

 desc       #elem   min     max     mean            stdev           sum
 all        1850    17      236472  3400.31         16863.79        6290586
 200        306     204     236472  20318.3         37147.91        6217401
 10K        113     10520   236472  52647.45        45602.76        5949162
 
 Singletons: 1066226


De novo assembly of singletons using SSAKE Input: 1066226 singletons

 Ssake version 3.0 run with default parameters:
 -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
 -o  Minimum number of reads needed to call a base during an extension (default -o 2)
 -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.6) 
 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 all     19795   34      2866    67.78   102.73  1341825
 200     879     200     2866    416.72  308.92  366304
 200 bp contigs stats
 
 desc.           #elem   min     mean    max     sum
 contig_len      879     200     416.72  2866    366304
 contig_reads    879     52      414.21  3157    364091
 contig_cvg      879     7.91    31.58   146.52  27764.51 
 
 Singletons:  702007
 Find contigs overlapping 12 bp or more and merge them using EMBOSS merger program:
 
 desc            #elem   min     max     mean    stdev   sum
 original        879     200     2866    416.72  308.92  366304
 new             670     200     4826    539.32  579.26  361350

Gene assemblies

Dan Sommer reduced number of contigs >=200bp from 306 to 120 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/PAb1.fasta

 Contig stats:
 desc    #elem   min     max     mean    stdev   sum
 contigs 120     212     512638  51438.95        81999.91        6172675
 Average Gap: 105 nt bases
 Median Gap: 14 nt bases
 Largest Gap: 1095 nt bases

927 singletons assembled

Pa-b1 AMOScmp reference assembly

 Contig stats:
  
 desc                           #elem   min     max     mean      stdev           sum
 input(reference|final)         770     200     512638  8484.21   37210.58        6532844
 output(AMOScmp contigs)        936     21      260827  6988.98   22288.49        6541689
 
 Singletons: 1257963
 Snp's:      1941

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0215_AMOScmp-PAb1

Use ALL strains as reference

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0228_AMOSCmp-ALL

Steps:

 * Identify related strains (completed & draft assemblies): PA14, PAO1, PA7, PACS2, PA2192, PAC3719
 * Download & merge related strains into a multi-FASTA file
 * Align PAb1 reads to multi-FASTA file using nucmer (Ex: -l 17 -c 17)
 * Count the number of PAb1 reads aligned to each strain
    strain   #reads_aligned
    ======================
    PA14    7261629
    PA2192  7193960
    PACS2   7089455
    PAO1    7036243
    PAC3719 6928744
    PA7     5393144
    ======================
 * Sort the strains in descending order according to the number of reads aligned to each one
 * The strain with most hits is PA14 => closest relative
 * Assuming the order is PA14, PAO1, PA7, PACS2, PA2192, PAC3719 ...
 * Align strains to one other; identify unique regions in each genome (regions which are not present in the previously considered ones)
 
   Unique regions length summary: 
   Table 1:
   reference                             #regions  min     max     mean    stdev   sum
   =======================================================================================
   PA14                                  1       6537648 6537648 6537648 0       6537648
 
   PAO1-PA14                             115     1       19263   2038.58 3843.93 234437
   PA7-PA14                              511     1       46020   1876.96 4872.51 959129
   PACS2-PA14                            147     1       90591   3054.19 8766.3  448967
   PA2192-PA14                           231     1       92553   3746.3  11475   865396
   PAC3719-PA14                          272     1       14058   1164.78 2089.85 316822
   =======================================================================================
   ALL                                   1277                                     9362399
   Table 2:
   reference                             #regions  min     max     mean    stdev   sum
   =======================================================================================
   PA14                                    1       6537648 6537648 6537648 0       6537648
 
   PAO1-PA14                               115     1       19263   2038.58 3843.93 234437
   PA7-PAO1-PA14                           487     1       46020   1830.82 4910.86 891614
   PACS2-PA7-PAO1-PA14                     71      2       79031   4112.94 10618.2 292019
   PA2192-PACS2-PA7-PAO1-PA14              181     1       92066   3439.9  9500.07 622622 #some N's maybe?
   PAC3719-PA2192-PACS2-PA7-PAO1-PA14      173     1       11425   998.16  1888.94 172682 #some N's maybe?
   =======================================================================================
   ALL                                     1028    2       6537648 8513.66 203929  8752049
   Table 3:
   reference                             #regions  min     max     mean    stdev   sum
   =======================================================================================
   PA14                                    1       6537648 6537648 6537648 0       6537648
   
   PA14-PAO1                               123     1       43690   4009.28 7903.16 493142
   PA14-PA7                                516     1       31464   1768.16 3534.88 912375
   PA14-PACS2                              146     2       43690   3259.15 6129.12 475836
   PA14-PA2192                             222     1       34854   2189.76 4338.6  486127
   PA14-PAC3719                            277     1       43690   2206.38 5214.55 611170
   =======================================================================================
 * Extract unique regions(Table 2) and add them to the most similar genome
 * Assemble using all these regions as reference => 10,201 contigs; 1,052,178 singletons
   reference                        #ctgs   min     max     mean    stdev           sum
   =======================================================================================
   PA14                             2056    17      57476   2992.27 7078.25         6152122 
   PAO1                             877     17      8159    95.16   482.74          83456
   PA7                              4723    17      6822    41.7    128.87          196996
   PACS2                            914     17      3346    66.96   148.47          61207
   PA2192                           1528    17      49826   134.88  1734.19         206109
   PAC3719                          103     17      693     42.11   85.89           4338
   =======================================================================================
   ALL                              10201   17      57476   657.21  3457.1          6704228
 
 * Compare ALL assembly contigs to PA14 ref assembly :
   reference                        #ctgs   min     max     mean    stdev           sum        #singletons
   ========================================================================================================
   ALL                              10201   17      57476   657.21  3457.1          6704228    1052178 
   ALL(200bp+)                      1020    200     57476   6235.17 9220.76         6359881
   ALL(PA14)                        2056    17      57476   2992.27 7078.25         6152122 
   PA14                             1984    17      82196   3103.21 8274.53         6156788    1375004
   ========================================================================================================
   ALL-PA14                         8217     ......                                 547440                 # 0.5M extra bp
   ALL(PA14)-PA14                   72       ......                                 -4666                  # 4666 bp which were originally mapped to PA14 are now mapped to other strains
  * Compare ALL assembly contigs to ssake 200bp+ contigs :
 
   ssake 200bp+ contigs:
     strain    #ctgs   min     max     mean    stdev   sum
     PAb1      879     200     2866    416.72  308.92  366304
   ALL vs ssake alignments:
    desc    #elem   min     max     mean    stdev   sum
    ALL     265     23      49826   912.9   4187.17 241919
    ssake   306     200     2358    379.85  257.04  116235
  ALL vs ssake CONTAINED alignments:
    desc    #elem   min     max     mean    stdev   sum
    ALL     164     23      49826   1264.71 5268.16 207413
    ssake   260     200     2358    368.09  259.02  95704
  ~300 out of 879 ssake contigs were assembled using this method.
  The ALL assembly contigs ssake contigs align to are longer than the ssake contigs!
  * Compare ALL assembly contigs to final de novo contigs :
       id                          len         #de novo ctgs it CONTAINS
  ======================================================================
    1  2361_PA2192_2249428_2341494 49826       52
    2  2360_PA2192_2249428_2341494 42258       49
    3  2364_PA2192_2346558_2361947 15388       17
    4  10190_PAO1_747565_761293    8159        7
    5  2429_PA2192_2405271_2412668 6103        7
    6  3259_PA2192_6656583_6693732 2594        6
    7  2367_PA2192_2374235_2379148 4914        5
    8  6917_PA7_4563267_4570077    6822        5
    9  2958_PA2192_4496915_4500004 3090        4
   10  2375_PA2192_2383706_2386485 2789        3
   11  2366_PA2192_2366515_2369415 2901        2
   12  6919_PA7_4579629_4582081    1085        2
   13  5364_PA7_3138605_3164438    1931        2
   14  8663_PACS2_3494057_3573088  449         2
   15  9322_PACS2_982458_983976    1519        2
   16  2368_PA2192_2379432_2380661 1230        2
   17  2716_PA2192_3268239_3292117 567         2
   18  9587_PAO1_2439014_2458239   3513        1
   ...
   49  8384_PAC3719_5517906_5518496 591        1
  ======================================================================
      Total                                    201

De-novo assemblies

SSAKE

version 3.0 with default parameters

 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 length  185030  34      5490    77.21   141.23  14287079
 reads   185030  2       13352   29.52   127.65  5463405 
 
 Singletons: 3,164,495  (singletons file + ambiguous reads)

Run times:

 18459.472u 31.849s 5:08:59.82 99.7%	0+0k 0+0io 0pf+0w

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0214_ssake/

Latest version is 3.2: run on the same dataset generated fewer contigs and more singletons; also took longer to run

Velvet

Version 0.5.05 Ovl len=23

Commands:

 host: sycamore (Opteron)
 velveth . 23 Pa.seq 
 velvetg . -read_trkg yes

Run times:

 velveth : 88.614u  22.865s 2:16.35  81.7%  0+0k 0+0io 0pf+0w
 velvetg : 439.939u 19.258s 8:19.77  91.8%  0+0k 0+0io 0pf+0w
 total   : 528.553u 42.123s 10:36.12

Output files (used)

 contigs.fa : contig FASTA file

Contigs stats:

 desc           #elem   min     max     mean    stdev   sum
 contig         10684   45      16239   640.34  825.24  6841458
 contig(200bp+) 7382    200     16239   877.05  896.35  6474426
 1273164 singletons; 

Contigs aligned to the 4 copy 5.7Kbp PA14 rRNA repeat:

     [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
 ===============================================================================================================================
       23      409  |        1      388  |      387      388  |    97.94  |     5763      388  |     6.72   100.00  | rRNA        856     [CONTAINS]
      388     2132  |        1     1745  |     1745     1745  |    99.94  |     5763     1745  |    30.28   100.00  | rRNA        642     [CONTAINS]
     2111     2279  |      169        1  |      169      169  |   100.00  |     5763      169  |     2.93   100.00  | rRNA        3433    [CONTAINS]
     2258     5763  |        1     3506  |     3506     3506  |    99.89  |     5763     3506  |    60.84   100.00  | rRNA        260     [CONTAINS]

Are there some velvet contigs contained in each other? Many contigs seem to overlap by ~ 20bp

Some contigs share end reads labeled as D(uplicates)

 listReadPlacedStatus Pa.bnk | awk '{print $3}' | count.pl
 #       total
 P       7322684
 S       1273164
 D       32052     # 33 of them belong to 3 contigs
 Total   8627900

About ~4607 contig pairs could be merged:

 grep D Pa.status.detailed | awk '{print $5,$6}' | sort -u | wc -l
 4607

Example:

 Contigs 1915 & 3551 share 15 Duplicate reads
 cat Pa.status | grep 1915 | grep 3551$ | wc -l
 15 

They align to adjacent pos in PA14 :
 3743092 3745646 |  2555  1  | 2555  2555  |    99.49  |  6537648     2555  |     0.04   100.00  | 1      1915    [CONTAINS]
 3745625 3746382 |   758  1  |  758   758  |    99.60  |  6537648      758  |     0.01   100.00  | 1      3551    [CONTAINS]

411870 out of 1273164 singletons (~33%) align to PA14 (nucmer -maxmatch -c 20) 304162 out of 1273164 singletons (~25%) are also singletons in the AMOScmp-PA14 assembly; probably the common ones are low quality sequences that just don't assemble

9859 of the 10684 contigs (~92%) align to PA14 (nucmer -maxmatch -c 40):

 10684 Pa.fasta 
 9859   Pa.qry_hits       # 825 not aligned
 9682   Pa.CBE.qry_hits   # 177 partially aligned

Partially aligned

 #elem   min     max     mean    stdev   sum
 177     78      15342   2036.52 2145.37 360465
 
 All these contigs have AMOScmp assembly singletons aligned to them!!! 
 These velvet contigs seem correct.
 Example: There is a large insertion in PA14 at position 6700
 
 PA-b1 reads vs PA14 ref  
  find-query-breaks -B Pa.filter-q.delta
  => 3 reads end at base 6700
   
 velvet ctgs vs PA14 ref:
 ...
   6177     6700  |    1   524  |  524    524  | 99.43  |  6537648  875  | 0.01  59.89  | 1  900     # 431 out of 1054 seqs in this contigs are singletons in AMOScmp assembly
   6694     8045  |  743   2094 |  1352   1352 | 99.63  |  6537648  2097 | 0.02  64.47  | 1  1970    # 895 out of 2354 seqs in this contigs are singletons in AMOScmp assembly; unaligned regions aligns at low %id to a Pseudomonas fluorescens hypothetical protein 
 ...
  66764    67073  |    1   310  |  310    310  | 100.00 |  6537648  1282 | 0.00  24.18  | 1 320
  67181    67322  | 1136  1277  |  142    142  |  98.59 |  6537648  1282 | 0.00  11.08  | 1 320
 ...
 116429   117249  |    1   821  |  821    821  |  99.03 |  6537648  1457 | 0.01  56.35  | 1 1955
 120490   120573  |   84     1  |   84     84  | 100.00 |  6537648  1693 | 0.00  4.96   | 1 1827
...
 234641   235980  | 3294  1955  | 1340   1340  |  99.40 |  6537648  3294 | 0.02  40.68  | 1 139
 235981   237442  | 1466     5  | 1462   1462  |  99.32 |  6537648  3294 | 0.02  44.38  | 1 139
...
 286003   287915  | 1216  3128  | 1913   1913  |  99.22 |  6537648  3128 | 0.03  61.16  | 1 3068
...

Not aligned:

 #elem   min     max     mean    stdev   sum
 825     45      16239   618.3   1209.37 510101        # the longest velvet ctg does not align

Not CONTAINED:

 #elem   min     max     mean    stdev   sum
 1002    45      16239   868.82  1518.56 870566

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24

Version 0.5.07 : does not "seem" to be an improvement over Version 0.5.05 Ovl len=23

 desc              #elem   min     max     mean    stdev   sum
 contig            11230   45      16239   611.64  814.09  6868784
 contig(200bp+)    7353    200     16239   878.61  897.06  6460467

Edena

Version 2.1.1 , default parameters

 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 contig  11180   100     11300   552.36  610.52  6175460
 3955865 singletons
 Contig don't seem to overlap much 

Run times:

 ovl step:      1655.561u 7.560s  27:54.62 99.3%   0+0k 0+0io 0pf+0w
 assembly step: 56.141u   3.791s   1:03.82 93.9%   0+0k 0+0io 0pf+0w
 total:         1711.702u 11.351s 28:58.44     

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0312_edena/

SSAKE/Edena/Velvet summary

host: sycamore.umiacs.umd.edu (Linux 64)

 program  version   run-time
 SSAKE    3.0       8459.472u 31.849s 5:08:59.82  
 Velvet   0.5.05    528.553u  42.123s   10:36.12
 Edena    2.11      1711.702u 11.351s   28:58.44     

Input: 8627900 33 bp Solexa reads(2 lanes)

Output: contigs+singletons

Contig statistics:

 program #seqs   min  max    mean  med  n50   sum
 SSAKE   185030  34   5490   77    44   87    14287079
 Velvet  10684   45   16239  640   363  1184  6841458
 Edena   11180   100  11300  552   356  837   6175460

200bp+ contig statistics:

 program #seqs   min  max    mean  med  n50   sum
 SSAKE   12532   200  5490   486   385  549   6090567
 Edena   8316    200  11300  692   488  902   5759209
 Velvet  7382    200  16239  877   586  1252  6474426

Singleton statistics:

 program #seqs
 SSAKE   3164495
 Velvet  1273164 
 Edena   3955865

Other assemblies

AMOScmp on the velvet contigs (1)

Input sequence is not trimmed

AMOScmp -D MINCLUSTER=40 -D MINOVL=5 -D CONSERR=0.12

 desc                  #elem   min     max     mean    stdev   sum
 reads(velvet ctgs)    10684   45      16239   640     .       6841458 
 ctg                   2559    45      26677   2223.04 2731.82 5688784
 singl                 1016    45      16239   822.7   1485.11 835866

Problems:

 * 33 singletons are CONTAINED in the reference. Why we they not assembled?
 show-coords -H -I 88 Pa.singletons.delta | ~/bin/nucmerAnnotate.pl -ignore 20 |  grep -c CONT # 33
 show-coords -H -I 98 Pa.singletons.delta | ~/bin/nucmerAnnotate.pl -ignore 20 |  grep -c CONT # 21

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp

AMOScmp on the velvet contigs (2)

Input sequence is trimmed

AMOScmp-shortReads-alignmentTrimmed -D MINCLUSTER=40 -D MINMATCH=20 -D CONSERR=0.12 Pa

 desc                  #elem   min     max     mean    stdev   sum
 reads(velvet ctgs)    10684   45      16239   640     .       6841458
 ctg                   2657    45      27957   2192.79 2697.71 5826256
 singl                 1013    45      16239   792.2   1431.28 802502

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp.redo


AMOScmp/minimus on the AMOScmp & velvet contigs (3)

Input:

 AMOScmp PA14 ref assembly (2) contigs
 velvet contigs
 !!! No singletons

AMOScmp steps:

  1. Align AMOScmp & velvet contigs (nucmer -c 40) to each other
  2. Identify velvet contigs CONTAINED in AMOScmp contigs(max trimming of 20)
  3. Remove CONTAINED contigs from the velvet set
  4. Create alignment file based to the AMOS.scaff file (if nucmer is repeat contigs are collapsed/no placed exactly)
  5. Align velvet contigs to reference using nucmer ; filter only the CONTAINED ones
  6. Merge AMOS & velvet alignments
  7. Run "casm-layout -S -r ..."
  8. Run "make-consensus ..."
  • All AMOS input contigs should be part of the new assembly
  • If the velvet CONTAINED alignments are not filtered, some end conflicts might cause some of the AMOS contigs end up as singletons

Input:

 All:
 desc            #elem   min     max     mean    stdev   sum
 AMOS            1843    20      127984  3350.38 10776   6174765
 velvet-uniq     1189    45      16239   997.85  1485.34 1186452  # 462 aligned to PA14 ref (nucmer -c 40); 284 CONTAIEND in the ref; 727 not aligned to the ref
 all             3032    20      127984  2427.84 8530.33 7361217
 200bp+:
 desc            #elem   min     max     mean    stdev   sum
 AMOS            542     200     127984  11245   17520   6094904
 velvet-uniq     848     200     16239   1354.36 1627.86 1148499  # 422 aligned to PA14 ref (nucmer -c 40); 249 CONTAIEND in the ref; 426 not aligned to the ref
 all             1390    200     127984  5211.08 12019.6 7243403

Output:

 All:
 desc            #elem   min     max     mean    stdev   sum
 ctg             1327    20      221554  4644.9  19079.  6163795    # two of them(530,531) contain only velvet seqs
                                                                    # 1210 contain only AMOS seqs ; 11 contain 2 AMOS seqs (negative gap got merged)
                                                                    # 115 contain both AMOS & velvet seqs        
 singl           817     45      16274   1053.94 1719.12 861076     # all but one(807) are velvet : see .conflict file
 all             2144    20      221554  3276.52 15146.  7024871
 200bp+:
 desc            #elem   min     max     mean    stdev   sum
 ctg             282     203     221554  21651   36731   6105733
 singl           571     200     16274   1461.55 1917.62 834546
 all             853     200     221554  8136.31 23189   6940279

minimus steps:

  1. Run minimus3 pipeline on the above contigs(1327) and singletons(817) ; Only uses contig:singletion overlaps (not all:all)
 $ ~dpuiu/bin/minimus3 -D BREAKID=1327 -D OVERLAP=40 Pa 

min OVL=40 : use contig:singleton alignments

Output:

 All:
 desc            #elem   min     max     mean    stdev   sum
 ctg             180     135     253489  22817.2 45339   4107096   # 218 of the PA14 unaligned velvet contigs assembled
 singl           1377    20      153036  1996.79 10534   2749587
 all             1557    20      253489  4403.77 19467   6856683

 200bp+:
 desc            #elem   min     max     mean    stdev   sum
 ctg             174     210     253489  23598.3 45919   4106116
 singl           488     200     153036  5519.51 17154   2693524
 all             662     200     253489  10271.3 28846   6799640

min OVL=20 : use contig:singleton alignments (not significantlty better than ,on OVL=40)

Output:

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     279     63      253489  15557   37925   4340305 # 321 of the PA14 unaligned velvet contigs assembled
 singl   972     20      153036  2576    12252   2504000
 all     1251    20      253489  5471    21580   6844305
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     263     214     253489  16495   38868   4338061
 singl   393     200     153036  6261    18680   2460755
 all     656     200     253489  10364   28954   6798816


min OVL=20 : use contig:singleton & singleton:singleton alignments

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     277     103     253489  15983   37991   4427292  # 556 of the PA14 unaligned velvet contigs assembled
 singl   808     20      153036  2986    13439   2412439  # 260 of them are velvet original contigs
 all     1085    20      253489  6304    23111   6839731
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     263     222     253489  16826   38812   4425275
 singl   298     201     153036  7971    21242   2375456  # 165 of them are velvet original contigs
 all     561     201     253489  12123   31042   6800731

min OVL=20 : use contig:singleton & singleton:singleton alignments & adjacent contig ovelaps (best)

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     317     47      253489  16450   40806   5214579
 singl   703     20      112908  2324    11332   1633528
 all     1020    20      253489  6714    25448   6848107
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     291     200     253489  17909   42289   5211547
 singl   247     201     112908  6480    18431   1600539
 all     538     200     253489  12662   33969   6812086

Locations:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2/AMOScmp-avgReads.unique/ 
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2/AMOScmp-avgReads.unique/minimus3.20 (best)

PA14 Sanger maq on all reads

 Contig stats:
 min_len    #contigs   min     max     mean     stdev           sum
 0          991        33      155551  6199.79  17445.05        6143996
 200        368        200     155551  16581.7  25475.92        6102067
 10K        149        10048   155551  38164.35 28513.51        5686489

1197385 singletons; SNP's in the contigs: T<->C & A<-G> the most common

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0206_PA14_maq

Final assembly (v3)

Merged some of the 120 AMOScmp contigs from v2 using velvet contigs. Grouped overlapping contigs and used minimus to merge them.

 Contig length stats:
 
  desc            #ctgs   min     max     mean              stdev           sum
  final3          512     200     512638  13099.41          53526.8         6706902
  AMOScmp3        76      212     512638  82763.22          117199.28       6290005
  novo3           436     200     10493   956.18            1336.44         416897
 
  final3-10KB+    49      10493   512638  127086.12         125757.98       6227220 

Comment:

 ~ 53-74 (out of 436) the velvet denovo that contigs align to PA14.
 
 Length statistics:
                    #ctgs   min     max     mean    stdev   sum
   CONTAINED        53      204     7228    514.81  978.25  27285
   ALL              74      204     7228    622.85  926.13  46091 

Locations:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/final3/
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/nucmer/merge/

Annotation

5769 genes
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/merge/gene_list.ptt