Pseudomonas aeruginosa: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 9: Line 9:
   [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Pseudomonas%20aeruginosa%20PA14 PA14] [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=CP000438 CP000438] 1 chromosome, 6,537,648 bp, 66.29 %GC : most similar
   [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Pseudomonas%20aeruginosa%20PA14 PA14] [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=CP000438 CP000438] 1 chromosome, 6,537,648 bp, 66.29 %GC : most similar
   [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&cmd=ShowDetailView&TermToSearch=16720 PA7] [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=CP000744 CP000744]  1 chromosome, 6,588,339 bp
   [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&cmd=ShowDetailView&TermToSearch=16720 PA7] [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=CP000744 CP000744]  1 chromosome, 6,588,339 bp
5.7Kbp repeat coordinates:
  PA14:
  732540  738302  +
  4951956  4957714 - 
  5535975  5541728 -
  6312434  6318199 -
  PAO1:
  721550  727325  +
  4788516  4794273 -
  5264042  5269801 -
  6039515  6045289 -


=== CBCB ===
=== CBCB ===

Revision as of 19:36, 12 February 2008

Strain PA-b1

Data

NCBI

Complete strains:

 PAO1 AE004091 1 chromosome, 6,264,404 bp, 66.56 %GC
 PA14 CP000438 1 chromosome, 6,537,648 bp, 66.29 %GC : most similar
 PA7 CP000744  1 chromosome, 6,588,339 bp

5.7Kbp repeat coordinates:

 PA14:
 732540   738302  +
 4951956  4957714 -  
 5535975  5541728 -
 6312434  6318199 -
 PAO1:
 721550   727325  +
 4788516  4794273 -
 5264042  5269801 -
 6039515  6045289 -

CBCB

File location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
 

Files:

 s_1_0001_prb.txt
 s_1_sequence.txt  # contains seq & qual; seq names: HWI% ; all reads are 33 bp
 s_1_tag.txt         
 
 s_7_0300_prb.txt
 s_7_sequence.txt  # contains seq & qual; seq names HWU%  ; all reads are 33 bp
 s_7_tag.txt    
 
 wc -l s_1_sequence.txt s_7_sequence.txt 
  4105993 s_1_sequence.txt
  4521907 s_7_sequence.txt
  8627900 total
 #all reads that contain at least 1 ambiguity
 grep -c N s_*seq
  s_1_sequence.seq:37933
  s_7_sequence.seq:69669
 #all reads that contain at least 2 ambiguities
 cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l  : 18771
 cat s_7_sequence.seq  | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589
 ...
 s_1 reads have significantly fewer N's than s_7 reads !!!
 s_1 reads align better to ref than s_2 reads !!!
 
 conflicts with
 
 Avg qualities: 28.26 for s_1, 31.05 for s_2 !!!

Assemblies

Untrimmed reads

All contain ref duplications (this section should be deleted eventually) Duplication causes

 1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate
 2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3)
 3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other;  reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced

Trimmed reads

Quality trimming

Art's script: qual-trim.awk

Min_Qual=15

 .               #elem   #elem0  #elem<0 min     median  max     sum             mean    stdev   n50
 qualTrim        8627900 1266    0       0       23      33      193999902       22.49   7.43    26

!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate

Alignment based trimming

PA14 ref assembly

Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2 Command: AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2

1.a align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (-c 17 -l 17)

      8.62M reads, 4.10M HWI 5.32M HWU
 
      7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (-c 17 -l 17)
      6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (-c 20 -l 20)
      5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads)  (-c 33 -l 17)
      3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)

1.b align all unaligned reads to the reference using nucmer. minmatch=14, mincluster=14 (-c 14 -l 14) Combine 1.a & 1.b

2. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions

3. assemble them using the AMOScmp pipeline(ALIGN_WIGGLE=2 instead of 15)

Contigs stats:
 
 desc    #elem   min     max     mean            stdev           sum
 all     2053    17      170485  3011.84         11917.53        6183320
 200     428     203     170485  14262.09        22852.74        6104175
 10K     157     10240   170485  35468.89        26531.33        5568616
 
 Singletons: 1127399
 Data accuracy
 Get all assembled bases with coverage>=20
 count_bases=5926977
 sum_bases=235619001
 sum_errors=2455670
 sum_errors/sum_bases=2455670/235619001=0.01042=1.042% error

PAO1 ref assembly

Same method except that 1.b was not used

Location: 2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2

 Contigs stats:
 
 desc    #elem   min     max     mean            stdev   sum
 all     2797    17      75626   2161.19         5812.2  6044851
 200     865     200     75626   6893.96         8766.63 5963278
 10K     204     10016   75626   19016.22        10368   3879309
 
 Singletons: 1592525

PA14 & PAO1 merge

Use minimus to merge all contigs in the PA14 & PAO1 reference assemblies

Filter contigs that contain at least 2 adjacent PA14 merged by PAO1

 desc       #elem   min     max     mean            stdev           sum
 all        1850    17      236472  3400.31         16863.79        6290586
 200        306     204     236472  20318.3         37147.91        6217401
 10K        113     10520   236472  52647.45        45602.76        5949162
 
 Singletons: 1066226

Ssake

Input: 1066226 singletons

 Ssake version 3.0 run with default parameters:
 -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
 -o  Minimum number of reads needed to call a base during an extension (default -o 2)
 -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.6) 
 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 all     19795   34      2866    67.78   102.73  1341825
 200     879     200     2866    416.72  308.92  366304
 200 bp contigs stats
 
 desc.           #elem   min     mean    max     sum
 contig_len      879     200     416.72  2866    366304
 contig_reads    879     52      414.21  3157    364091
 contig_cvg      879     7.91    31.58   146.52  27764.51 
 
 Singletons:  702007
 Find contigs overlapping 12 bp or more and merge them:
 
 desc            #elem   min     max     mean    stdev   sum
 original        879     200     2866    416.72  308.92  366304
 new             670     200     4826    539.32  579.26  361350