Pseudomonas aeruginosa: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 475: Line 475:


MINCLUSTER=40,MINOVL=10
MINCLUSTER=40,MINOVL=10
   desc         #elem  min    max    mean    stdev  sum
   desc                 #elem  min    max    mean    stdev  sum
   ctg           3054    45      24166  1845.66 2239.78 5636668
   ctg                   3054    45      24166  1845.66 2239.78 5636668
   singl         1030    45      16239  814.01  1476.66 838434
   singl                 1030    45      16239  814.01  1476.66 838434
   ctg+singl     4084    45      24166  1585.48 2121.65 6475102
   ctg+singl             4084    45      24166  1585.48 2121.65 6475102
  ctg+singl(200bp+)    3242    200    24166  1969.57 2225.87 6385364


Location:  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp
Location:  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp

Revision as of 15:32, 21 April 2008

Strain PA-b1

Data

NCBI

Complete strains:

 PA14        CP000438       6,537,648 bp, 66.29 %GC : most similar to PAb1
 PAO1        AE004091       6,264,404 bp, 66.56 %GC : rearrangement vs PA14
 PA7         CP000744       6,588,339 bp            : no rearrangement vs PA14
 PACS2       AAQW01000001.1 6,492,423 bp, 66.33% GC : rearrangement vs PA14

Incomplete strains(Broad):

 Contig length stats:
 desc  #contigs   min     max     mean    stdev   sum
 C3719 124        2079    242903  49572   53770   6146998  : no rearrangement vs PA14
 2192  82         2087    398738  83246   88681   6826253  : rearrangement vs PA14
 Scaffold length stats:
 desc   len     GC%
 C3719  6222097 66.30
 2192   6905121 65.99
PAb1 project

5.7Kbp repeat: contains 16S rRNA gene; identical in all the strains

 Coordinates:
 
 PA14:
 732540   738302  +
 4951956  4957714 -  
 5535975  5541728 -
 6312434  6318199 -
 PAO1:
 721550   727325  +
 4788516  4794273 -
 5264042  5269801 -
 6039515  6045289 -
 PA7:
 806558   812299  +
 4982874  4988600 -
 5566182  5571924 -
 6353418  6359148 -

CBCB

File location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
 

Files:

 s_1_sequence.txt  # contains seq & qual; seq names: HWI% ; all reads are 33 bp
 s_1_0001_prb.txt  # 27679 lines of 4*36 values in the -40..40 range  
 s_1_tag.txt       # unique seq count in s_1_sequence.txt
 
 s_7_sequence.txt  # contains seq & qual; seq names HWU%  ; all reads are 33 bp
 s_7_0300_prb.txt
 s_7_tag.txt    
 
 wc -l s_1_sequence.txt s_7_sequence.txt 
  4105993 s_1_sequence.txt
  4521907 s_7_sequence.txt
  8627900 total
 #all reads that contain at least 1 ambiguity
 grep -c N s_*seq
  s_1_sequence.seq:37933
  s_7_sequence.seq:69669
 #all reads that contain at least 2 ambiguities
 cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l  : 18771
 cat s_7_sequence.seq  | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589
 ...
 s_1 reads have significantly fewer N's than s_7 reads !!!
 s_1 reads align better to ref than s_2 reads !!!
  
 Conflicts with avg qualities: 28.26 for s_1, 31.05 for s_2 !!!
 
 Base qualities:
 .       #elem           #elem0  #elem<0 min     mean                    max     sum
 s_1     135497769       1863585 3231899 -5      28.2624918053079        40      3829504586
 s_7     149222931       1357384 3113806 -5      31.0508663175903        40      4633501282

Assemblies

Untrimmed reads

All contain ref duplications (this section should be deleted eventually) Duplication causes

 1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; 
    the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate
 2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3)
 3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other;  reads starting with these kmers align in 2 ways to the reference; 
    if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced

Trimmed reads

Quality trimming

Art's script: qual-trim.awk

Min_Qual=15

 .               #elem   #elem0  #elem<0 min     median  max     sum             mean    stdev   n50
 qualTrim        8627900 1266    0       0       23      33      193999902       22.49   7.43    26

!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate

Alignment based trimming

PA14 is the most similar strain: more PAb1 reads aligned to PA14 than to PAO1 or PA7

 grep -c "^>" PA14-Pa.l-17_c-17.delta PAO1-Pa.l-17_c-17.delta PA7-Pa.l-17_c-17.delta
 PA14-Pa.l-17_c-17.delta:7261629
 PAO1-Pa.l-17_c-17.delta:7036243
 PA7-Pa.l-17_c-17.delta:5393144

PA14 ref assembly

 Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2
 Command:   AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2

1.a align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (-c 17 -l 17)

      8.62M reads, 4.10M HWI 5.32M HWU
 
      7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (-c 17 -l 17)
      6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (-c 20 -l 20)
      5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads)  (-c 33 -l 17)
      3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)

1.b align all unaligned reads to the reference using nucmer. minmatch=14, mincluster=14 (-c 14 -l 14) Combine 1.a & 1.b

2. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions

3. assemble them using the AMOScmp pipeline(ALIGN_WIGGLE=2 instead of 15)

Contigs stats:
 
 desc    #elem   min     max     mean            stdev           sum
 all     2053    17      170485  3011.84         11917.53        6183320
 200     428     203     170485  14262.09        22852.74        6104175
 10K     157     10240   170485  35468.89        26531.33        5568616
 
 Singletons: 1127399
 Data accuracy
 Get all assembled bases with coverage>=20
 count_bases=5926977
 sum_bases=235619001
 sum_errors=2455670
 sum_errors/sum_bases=2455670/235619001=0.01042=1.042% error

SNP's in the 157 >10K contigs: T<->C & A<-G> the most common

  PA14 PAb1    count
   T   C       6153
   C   T       6100
   G   A       6075
   A   G       6021
   G   C       1357
   C   G       1309
   G   T       859
   C   A       846
   T   G       840
   A   C       836
   A   T       338
   T   A       308
   G   .       241
   C   .       196
   .   C       182
   .   G       152
   A   .       101
   T   .       88
   .   T       74
   .   A       70
  Total        32146

PAO1 ref assembly

Same method except that 1.b was not used

Location: 2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2

 Contigs stats:
 
 desc    #elem   min     max     mean            stdev   sum
 all     2797    17      75626   2161.19         5812.2  6044851
 200     865     200     75626   6893.96         8766.63 5963278
 10K     204     10016   75626   19016.22        10368   3879309
 
 Singletons: 1592525

PA14 & PAO1 merge

Use minimus to merge all contigs in the PA14 & PAO1 reference assemblies

Filter contigs that contain at least 2 adjacent PA14 merged by PAO1

 desc       #elem   min     max     mean            stdev           sum
 all        1850    17      236472  3400.31         16863.79        6290586
 200        306     204     236472  20318.3         37147.91        6217401
 10K        113     10520   236472  52647.45        45602.76        5949162
 
 Singletons: 1066226
Ssake

Input: 1066226 singletons

 Ssake version 3.0 run with default parameters:
 -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
 -o  Minimum number of reads needed to call a base during an extension (default -o 2)
 -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.6) 
 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 all     19795   34      2866    67.78   102.73  1341825
 200     879     200     2866    416.72  308.92  366304
 200 bp contigs stats
 
 desc.           #elem   min     mean    max     sum
 contig_len      879     200     416.72  2866    366304
 contig_reads    879     52      414.21  3157    364091
 contig_cvg      879     7.91    31.58   146.52  27764.51 
 
 Singletons:  702007
 Find contigs overlapping 12 bp or more and merge them using EMBOSS merger program:
 
 desc            #elem   min     max     mean    stdev   sum
 original        879     200     2866    416.72  308.92  366304
 new             670     200     4826    539.32  579.26  361350

Gene assemblies

Dan Sommer reduced number of contigs >=200bp from 306 to 120 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/PAb1.fasta

 Contig stats:
 desc    #elem   min     max     mean    stdev   sum
 contigs 120     212     512638  51438.95        81999.91        6172675
 Average Gap: 105 nt bases
 Median Gap: 14 nt bases
 Largest Gap: 1095 nt bases

927 singletons assembled

Pa-b1 AMOScmp reference assembly

 Contig stats:
  
 desc                           #elem   min     max     mean      stdev           sum
 input(reference|final)         770     200     512638  8484.21   37210.58        6532844
 output(AMOScmp contigs)        936     21      260827  6988.98   22288.49        6541689
 
 Singletons: 1257963
 Snp's:      1941

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0215_AMOScmp-PAb1

Use ALL strains as reference

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0228_AMOSCmp-ALL

Steps:

 * Identify related strains (completed & draft assemblies): PA14, PAO1, PA7, PACS2, PA2192, PAC3719
 * Download & merge related strains into a multi-FASTA file
 * Align PAb1 reads to multi-FASTA file using nucmer (Ex: -l 17 -c 17)
 * Count the number of PAb1 reads aligned to each strain
    strain   #reads_aligned
    ======================
    PA14    7261629
    PA2192  7193960
    PACS2   7089455
    PAO1    7036243
    PAC3719 6928744
    PA7     5393144
    ======================
 * Sort the strains in descending order according to the number of reads aligned to each one
 * The strain with most hits is PA14 => closest relative
 * Assuming the order is PA14, PAO1, PA7, PACS2, PA2192, PAC3719 ...
 * Align strains to one other; identify unique regions in each genome (regions which are not present in the previously considered ones)
 
   Unique regions length summary: 
   Table 1:
   reference                             #regions  min     max     mean    stdev   sum
   =======================================================================================
   PA14                                  1       6537648 6537648 6537648 0       6537648
 
   PAO1-PA14                             115     1       19263   2038.58 3843.93 234437
   PA7-PA14                              511     1       46020   1876.96 4872.51 959129
   PACS2-PA14                            147     1       90591   3054.19 8766.3  448967
   PA2192-PA14                           231     1       92553   3746.3  11475   865396
   PAC3719-PA14                          272     1       14058   1164.78 2089.85 316822
   =======================================================================================
   ALL                                   1277                                     9362399
   Table 2:
   reference                             #regions  min     max     mean    stdev   sum
   =======================================================================================
   PA14                                    1       6537648 6537648 6537648 0       6537648
 
   PAO1-PA14                               115     1       19263   2038.58 3843.93 234437
   PA7-PAO1-PA14                           487     1       46020   1830.82 4910.86 891614
   PACS2-PA7-PAO1-PA14                     71      2       79031   4112.94 10618.2 292019
   PA2192-PACS2-PA7-PAO1-PA14              181     1       92066   3439.9  9500.07 622622 #some N's maybe?
   PAC3719-PA2192-PACS2-PA7-PAO1-PA14      173     1       11425   998.16  1888.94 172682 #some N's maybe?
   =======================================================================================
   ALL                                     1028    2       6537648 8513.66 203929  8752049
   Table 3:
   reference                             #regions  min     max     mean    stdev   sum
   =======================================================================================
   PA14                                    1       6537648 6537648 6537648 0       6537648
   
   PA14-PAO1                               123     1       43690   4009.28 7903.16 493142
   PA14-PA7                                516     1       31464   1768.16 3534.88 912375
   PA14-PACS2                              146     2       43690   3259.15 6129.12 475836
   PA14-PA2192                             222     1       34854   2189.76 4338.6  486127
   PA14-PAC3719                            277     1       43690   2206.38 5214.55 611170
   =======================================================================================
 * Extract unique regions(Table 2) and add them to the most similar genome
 * Assemble using all these regions as reference => 10,201 contigs; 1,052,178 singletons
   reference                        #ctgs   min     max     mean    stdev           sum
   =======================================================================================
   PA14                             2056    17      57476   2992.27 7078.25         6152122 
   PAO1                             877     17      8159    95.16   482.74          83456
   PA7                              4723    17      6822    41.7    128.87          196996
   PACS2                            914     17      3346    66.96   148.47          61207
   PA2192                           1528    17      49826   134.88  1734.19         206109
   PAC3719                          103     17      693     42.11   85.89           4338
   =======================================================================================
   ALL                              10201   17      57476   657.21  3457.1          6704228
 
 * Compare ALL assembly contigs to PA14 ref assembly :
   reference                        #ctgs   min     max     mean    stdev           sum        #singletons
   ========================================================================================================
   ALL                              10201   17      57476   657.21  3457.1          6704228    1052178 
   ALL(200bp+)                      1020    200     57476   6235.17 9220.76         6359881
   ALL(PA14)                        2056    17      57476   2992.27 7078.25         6152122 
   PA14                             1984    17      82196   3103.21 8274.53         6156788    1375004
   ========================================================================================================
   ALL-PA14                         8217     ......                                 547440                 # 0.5M extra bp
   ALL(PA14)-PA14                   72       ......                                 -4666                  # 4666 bp which were originally mapped to PA14 are now mapped to other strains
  * Compare ALL assembly contigs to ssake 200bp+ contigs :
 
   ssake 200bp+ contigs:
     strain    #ctgs   min     max     mean    stdev   sum
     PAb1      879     200     2866    416.72  308.92  366304
   ALL vs ssake alignments:
    desc    #elem   min     max     mean    stdev   sum
    ALL     265     23      49826   912.9   4187.17 241919
    ssake   306     200     2358    379.85  257.04  116235
  ALL vs ssake CONTAINED alignments:
    desc    #elem   min     max     mean    stdev   sum
    ALL     164     23      49826   1264.71 5268.16 207413
    ssake   260     200     2358    368.09  259.02  95704
  ~300 out of 879 ssake contigs were assembled using this method.
  The ALL assembly contigs ssake contigs align to are longer than the ssake contigs!
  * Compare ALL assembly contigs to final de novo contigs :
       id                          len         #de novo ctgs it CONTAINS
  ======================================================================
    1  2361_PA2192_2249428_2341494 49826       52
    2  2360_PA2192_2249428_2341494 42258       49
    3  2364_PA2192_2346558_2361947 15388       17
    4  10190_PAO1_747565_761293    8159        7
    5  2429_PA2192_2405271_2412668 6103        7
    6  3259_PA2192_6656583_6693732 2594        6
    7  2367_PA2192_2374235_2379148 4914        5
    8  6917_PA7_4563267_4570077    6822        5
    9  2958_PA2192_4496915_4500004 3090        4
   10  2375_PA2192_2383706_2386485 2789        3
   11  2366_PA2192_2366515_2369415 2901        2
   12  6919_PA7_4579629_4582081    1085        2
   13  5364_PA7_3138605_3164438    1931        2
   14  8663_PACS2_3494057_3573088  449         2
   15  9322_PACS2_982458_983976    1519        2
   16  2368_PA2192_2379432_2380661 1230        2
   17  2716_PA2192_3268239_3292117 567         2
   18  9587_PAO1_2439014_2458239   3513        1
   ...
   49  8384_PAC3719_5517906_5518496 591        1
  ======================================================================
      Total                                    201

Ssake on all reads

Default parameters

 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 length  185030  34      5490    77.21   141.23  14287079
 reads   185030  2       13352   29.52   127.65  5463405 
 
 Singletons: 3,164,495

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0214_ssake/

Edena on all reads

Default parameters

 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 contig  11180   100     11300   552.36  610.52  6175460
 Contig don't seem to overlap much 

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0312_edena/

Velvet on all reads

Version 0.5.05 Ovl len=24

Commands:

 velveth . 24 Pa.seq 
 velvetg . -read_trkg yes

Output files (used)

 contigs.fa : contig FASTA file
 
 velvet_assy.afg: contig AMOS message file
 {CTG    10684
 {RED    8627900
 {TLE    7386821 
 
 => 8627900-7386821=1241079 singletons

Contigs stats:

 desc    #elem   min     max     mean    stdev   sum
 contig  10684   45      16239   640.34  825.24  6841458

Many contigs seem to overlap by ~ 20bp:

Most of the 10684 contigs align to PA14 reference:

 9679 Pa.qry_hits
 9544 Pa.CBE.qry_hits
 9677 Pa.filter-q.qry_hits
 9543 Pa.filter-q.CBE.qry_hits

Not aligned: Pa.fasta-Pa.qry_hits

 #elem   min     max     mean    stdev   sum
 1005    45      16239   600.69  1149.49 603702

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24

AMOScmp on the velvet contigs

MINCLUSTER=40,MINOVL=10

 desc                  #elem   min     max     mean    stdev   sum
 ctg                   3054    45      24166   1845.66 2239.78 5636668
 singl                 1030    45      16239   814.01  1476.66 838434
 ctg+singl             4084    45      24166   1585.48 2121.65 6475102
 ctg+singl(200bp+)     3242    200     24166   1969.57 2225.87 6385364

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp

PA14 Sanger maq on all reads

 Contig stats:
 min_len    #contigs   min     max     mean     stdev           sum
 0          991        33      155551  6199.79  17445.05        6143996
 200        368        200     155551  16581.7  25475.92        6102067
 10K        149        10048   155551  38164.35 28513.51        5686489

SNP's in the contigs: T<->C & A<-G> the most common

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0206_PA14_maq

Final assembly (v3)

Merged some of the 120 AMOScmp contigs from v2 using velvet contigs. Grouped overlapping contigs and used minimus to merge them.

 Contig length stats:
 
  desc            #ctgs   min     max     mean              stdev           sum
  final3          512     200     512638  13099.41          53526.8         6706902
  AMOScmp3        76      212     512638  82763.22          117199.28       6290005
  novo3           436     200     10493   956.18            1336.44         416897
 
  final3-10KB+    49      10493   512638  127086.12         125757.98       6227220 

Comment:

 ~ 53-74 (out of 436) the velvet denovo that contigs align to PA14.
 
 Length statistics:
                    #ctgs   min     max     mean    stdev   sum
   CONTAINED        53      204     7228    514.81  978.25  27285
   ALL              74      204     7228    622.85  926.13  46091 

Locations:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/final3/
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/nucmer/merge/

Annotation

5769 genes
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/merge/gene_list.ptt