Pseudomonas aeruginosa: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
Line 86: Line 86:
3. assemble them using the AMOScmp pipeline. A modified version of make-consensus was used (ALIGN_WIGGLE=2 instead of 15)
3. assemble them using the AMOScmp pipeline. A modified version of make-consensus was used (ALIGN_WIGGLE=2 instead of 15)


2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2
Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2


   command: /nfshomes/dpuiu/bin/AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
   command: /nfshomes/dpuiu/bin/AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
Line 94: Line 94:
  Contigs stats:
  Contigs stats:
    
    
   desc    #elem  min    max    mean   stdev   sum
   desc    #elem  min    max    mean           stdev           sum
   all    2053    17      170485  3011.84 11917.53        6183320
   all    2053    17      170485  3011.84         11917.53        6183320
   200    428    203    170485  14262.09        22852.74        6104175
   200    428    203    170485  14262.09        22852.74        6104175
   10K   157    10240  170485  35468.89        26531.33        5568616
   10K     157    10240  170485  35468.89        26531.33        5568616
 
==== PAO1 ref assembly ====
 
Same method
Location: 2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2
 
  Contigs stats:
 
  desc    #elem  min    max    mean            stdev  sum
  all    2797    17      75626  2161.19        5812.2  6044851
  200    865    200    75626  6893.96        8766.63 5963278
  10K    204    10016  75626  19016.22        10368  3879309
 
=== PA14 & PAO1 merge ===
 
Use minimus
Filter contigs taht contain at least 2 adjacent PA14 merged by PAO1
 
  desc      #elem  min    max    mean            stdev          sum
  all        1850    17      236472  3400.31        16863.79        6290586
  200        306    204    236472  20318.3        37147.91        6217401
  10K        113    10520  236472  52647.45        45602.76        5949162
  singletons 1066226

Revision as of 14:48, 12 February 2008

Strain PA-b1

Data

NCBI

Complete strains:

 PAO1 AE004091 1 chromosome, 6,264,404 bp, 66.56 %GC
 PA14 CP000438 1 chromosome, 6,537,648 bp, 66.29 %GC : most similar
 PA7 CP000744  1 chromosome, 6,588,339 bp

CBCB

File location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
 

Files:

 s_1_0001_prb.txt
 s_1_sequence.txt  # contains seq & qual; seq names: HWI% ; all reads are 33 bp
 s_1_tag.txt         
 
 s_7_0300_prb.txt
 s_7_sequence.txt  # contains seq & qual; seq names HWU%  ; all reads are 33 bp
 s_7_tag.txt    
 
 wc -l s_1_sequence.txt s_7_sequence.txt 
  4105993 s_1_sequence.txt
  4521907 s_7_sequence.txt
  8627900 total
 #all reads that contain at least 1 ambiguity
 grep -c N s_*seq
  s_1_sequence.seq:37933
  s_7_sequence.seq:69669
 #all reads that contain at least 2 ambiguities
 cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l  : 18771
 cat s_7_sequence.seq  | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589
 ...
 s_1 reads have significantly fewer N's than s_7 reads !!!
 s_1 reads align better to ref than s_2 reads !!!
 
 conflicts with
 
 Avg qualities: 28.26 for s_1, 31.05 for s_2 !!!

Assemblies

Untrimmed reads

All contain ref duplications (this section should be deleted eventually) Duplication causes

 1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate
 2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3)
 3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other;  reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced

Trimmed reads

Quality trimming

Art's script: qual-trim.awk

Min_Qual=15

 .               #elem   #elem0  #elem<0 min     median  max     sum             mean    stdev   n50
 qualTrim        8627900 1266    0       0       23      33      193999902       22.49   7.43    26

!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate

Alignment based trimming

PA14 ref assembly

1. align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (-c 17 -l 17)

      8.62M reads, 4.10M HWI 5.32M HWU
 
      7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (-c 17 -l 17)
      6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (-c 20 -l 20)
      5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads)  (-c 33 -l 17)
      3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)

2. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions 3. assemble them using the AMOScmp pipeline. A modified version of make-consensus was used (ALIGN_WIGGLE=2 instead of 15)

Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2

 command: /nfshomes/dpuiu/bin/AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
 location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0109_AMOSCmp-PA14-relaxed-17-nucmer
 qc stats: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0109_AMOSCmp-PA14-relaxed-17-nucmer/Pa.qc
Contigs stats:
 
 desc    #elem   min     max     mean            stdev           sum
 all     2053    17      170485  3011.84         11917.53        6183320
 200     428     203     170485  14262.09        22852.74        6104175
 10K     157     10240   170485  35468.89        26531.33        5568616

PAO1 ref assembly

Same method Location: 2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2

 Contigs stats:
 
 desc    #elem   min     max     mean            stdev   sum
 all     2797    17      75626   2161.19         5812.2  6044851
 200     865     200     75626   6893.96         8766.63 5963278
 10K     204     10016   75626   19016.22        10368   3879309

PA14 & PAO1 merge

Use minimus Filter contigs taht contain at least 2 adjacent PA14 merged by PAO1

 desc       #elem   min     max     mean            stdev           sum
 all        1850    17      236472  3400.31         16863.79        6290586
 200        306     204     236472  20318.3         37147.91        6217401
 10K        113     10520   236472  52647.45        45602.76        5949162
 singletons 1066226