Pseudomonas aeruginosa

From Cbcb
Revision as of 15:39, 25 February 2008 by Dpuiu (talk | contribs) (→‎Ssake)
Jump to navigation Jump to search

Strain PA-b1

Data

NCBI

Complete strains:

 PA14        CP000438       6,537,648 bp, 66.29 %GC : most similar to PAb1
 PAO1        AE004091       6,264,404 bp, 66.56 %GC : rearrangement vs PA14
 PA7         CP000744       6,588,339 bp            : no rearrangement vs PA14
 PACS2       AAQW01000001.1 6,492,423 bp, 66.33% GC : rearrangement vs PA14

Incomplete strains(Broad):

 Contig length stats:
 desc  #contigs   min     max     mean    stdev   sum
 C3719 124        2079    242903  49572   53770   6146998  : no rearrangement vs PA14
 2192  82         2087    398738  83246   88681   6826253  : rearrangement vs PA14
 Scaffold length stats:
 desc   len     GC%
 C3719  6222097 66.30
 2192   6905121 65.99
PAb1 project

5.7Kbp repeat coordinates:

 PA14:
 732540   738302  +
 4951956  4957714 -  
 5535975  5541728 -
 6312434  6318199 -
 PAO1:
 721550   727325  +
 4788516  4794273 -
 5264042  5269801 -
 6039515  6045289 -
 PA7:
 806558   812299  +
 4982874  4988600 -
 5566182  5571924 -
 6353418  6359148 -

CBCB

File location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
 

Files:

 s_1_0001_prb.txt
 s_1_sequence.txt  # contains seq & qual; seq names: HWI% ; all reads are 33 bp
 s_1_tag.txt         
 
 s_7_0300_prb.txt
 s_7_sequence.txt  # contains seq & qual; seq names HWU%  ; all reads are 33 bp
 s_7_tag.txt    
 
 wc -l s_1_sequence.txt s_7_sequence.txt 
  4105993 s_1_sequence.txt
  4521907 s_7_sequence.txt
  8627900 total
 #all reads that contain at least 1 ambiguity
 grep -c N s_*seq
  s_1_sequence.seq:37933
  s_7_sequence.seq:69669
 #all reads that contain at least 2 ambiguities
 cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l  : 18771
 cat s_7_sequence.seq  | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589
 ...
 s_1 reads have significantly fewer N's than s_7 reads !!!
 s_1 reads align better to ref than s_2 reads !!!
 
 conflicts with
 
 Avg qualities: 28.26 for s_1, 31.05 for s_2 !!!

Assemblies

Untrimmed reads

All contain ref duplications (this section should be deleted eventually) Duplication causes

 1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate
 2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3)
 3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other;  reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced

Trimmed reads

Quality trimming

Art's script: qual-trim.awk

Min_Qual=15

 .               #elem   #elem0  #elem<0 min     median  max     sum             mean    stdev   n50
 qualTrim        8627900 1266    0       0       23      33      193999902       22.49   7.43    26

!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate

Alignment based trimming

PA14 ref assembly

Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2 Command: AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2

1.a align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (-c 17 -l 17)

      8.62M reads, 4.10M HWI 5.32M HWU
 
      7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (-c 17 -l 17)
      6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (-c 20 -l 20)
      5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads)  (-c 33 -l 17)
      3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)

1.b align all unaligned reads to the reference using nucmer. minmatch=14, mincluster=14 (-c 14 -l 14) Combine 1.a & 1.b

2. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions

3. assemble them using the AMOScmp pipeline(ALIGN_WIGGLE=2 instead of 15)

Contigs stats:
 
 desc    #elem   min     max     mean            stdev           sum
 all     2053    17      170485  3011.84         11917.53        6183320
 200     428     203     170485  14262.09        22852.74        6104175
 10K     157     10240   170485  35468.89        26531.33        5568616
 
 Singletons: 1127399
 Data accuracy
 Get all assembled bases with coverage>=20
 count_bases=5926977
 sum_bases=235619001
 sum_errors=2455670
 sum_errors/sum_bases=2455670/235619001=0.01042=1.042% error

PAO1 ref assembly

Same method except that 1.b was not used

Location: 2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2

 Contigs stats:
 
 desc    #elem   min     max     mean            stdev   sum
 all     2797    17      75626   2161.19         5812.2  6044851
 200     865     200     75626   6893.96         8766.63 5963278
 10K     204     10016   75626   19016.22        10368   3879309
 
 Singletons: 1592525

PA14 & PAO1 merge

Use minimus to merge all contigs in the PA14 & PAO1 reference assemblies

Filter contigs that contain at least 2 adjacent PA14 merged by PAO1

 desc       #elem   min     max     mean            stdev           sum
 all        1850    17      236472  3400.31         16863.79        6290586
 200        306     204     236472  20318.3         37147.91        6217401
 10K        113     10520   236472  52647.45        45602.76        5949162
 
 Singletons: 1066226
Ssake

Input: 1066226 singletons

 Ssake version 3.0 run with default parameters:
 -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
 -o  Minimum number of reads needed to call a base during an extension (default -o 2)
 -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.6) 
 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 all     19795   34      2866    67.78   102.73  1341825
 200     879     200     2866    416.72  308.92  366304
 200 bp contigs stats
 
 desc.           #elem   min     mean    max     sum
 contig_len      879     200     416.72  2866    366304
 contig_reads    879     52      414.21  3157    364091
 contig_cvg      879     7.91    31.58   146.52  27764.51 
 
 Singletons:  702007
 Find contigs overlapping 12 bp or more and merge them using EMBOSS merger program:
 
 desc            #elem   min     max     mean    stdev   sum
 original        879     200     2866    416.72  308.92  366304
 new             670     200     4826    539.32  579.26  361350

Gene assemblies

Dan Sommer reduced number of contigs >=200bp from 306 to 120 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/PAb1.fasta

 Contig stats:
 desc    #elem   min     max     mean    stdev   sum
 contigs 120     212     512638  51438.95        81999.91        6172675
 Average Gap: 105 nt bases
 Median Gap: 14 nt bases
 Largest Gap: 1095 nt bases

927 singletons assembled

Pa-b1 AMOScmp reference assembly

 Contig stats:
  
 desc                           #elem   min     max     mean      stdev           sum
 input(reference)               770     200     512638  8484.21   37210.58        6532844
 output(AMOScmp contigs)        936     21      260827  6988.98   22288.49        6541689
 
 Singletons: 1257963
 Snp's:      1941

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0215_AMOScmp-PAb1

Ssake on all reads

Default parameters

 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 length  185030  34      5490    77.21   141.23  14287079
 reads   185030  2       13352   29.52   127.65  5463405 
 
 Singletons: 3,164,495

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0214_ssake/