Pseudomonas aeruginosa
Jump to navigation
Jump to search
Strain PA-b1
Data
NCBI
Complete strains:
PAO1 AE004091 1 chromosome, 6,264,404 bp, 66.56 %GC PA14 CP000438 1 chromosome, 6,537,648 bp, 66.29 %GC : most similar PA7 CP000744 1 chromosome, 6,588,339 bp
CBCB
File location:
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
Files:
s_1_0001_prb.txt s_1_sequence.txt # contains seq & qual; seq names: HWI% ; all reads are 33 bp s_1_tag.txt s_7_0300_prb.txt s_7_sequence.txt # contains seq & qual; seq names HWU% ; all reads are 33 bp s_7_tag.txt wc -l s_1_sequence.txt s_7_sequence.txt 4105993 s_1_sequence.txt 4521907 s_7_sequence.txt 8627900 total
#all reads that contain at least 1 ambiguity grep -c N s_*seq s_1_sequence.seq:37933 s_7_sequence.seq:69669
#all reads that contain at least 2 ambiguities cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l : 18771 cat s_7_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589 ...
s_1 reads have significantly fewer N's than s_7 reads !!! s_1 reads align better to ref than s_2 reads !!! conflicts with Avg qualities: 28.26 for s_1, 31.05 for s_2 !!!
Assemblies
Untrimmed reads
All contain ref duplications (this section should be deleted eventually) Duplication causes
1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate 2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3) 3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other; reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced
Trimmed reads
Quality trimming
Art's script: qual-trim.awk
Min_Qual=15
. #elem #elem0 #elem<0 min median max sum mean stdev n50 qualTrim 8627900 1266 0 0 23 33 193999902 22.49 7.43 26
!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate
Alignment based trimming
PA14 ref assembly
1. align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (-c 17 -l 17)
8.62M reads, 4.10M HWI 5.32M HWU 7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (-c 17 -l 17) 6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (-c 20 -l 20) 5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads) (-c 33 -l 17) 3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)
2. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions 3. assemble them using the AMOScmp pipeline. A modified version of make-consensus was used (ALIGN_WIGGLE=2 instead of 15)
2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2
command: /nfshomes/dpuiu/bin/AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2 location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0109_AMOSCmp-PA14-relaxed-17-nucmer qc stats: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0109_AMOSCmp-PA14-relaxed-17-nucmer/Pa.qc
Contigs stats: desc #elem min max mean stdev sum all 2053 17 170485 3011.84 11917.53 6183320 200 428 203 170485 14262.09 22852.74 6104175 10K 157 10240 170485 35468.89 26531.33 5568616