Pseudomonas aeruginosa
Strain PA-b1
Data
NCBI
Complete strains:
PA14 CP000438 6,537,648 bp, 66.29 %GC : most similar to PAb1 PAO1 AE004091 6,264,404 bp, 66.56 %GC : rearrangement vs PA14 PA7 CP000744 6,588,339 bp : no rearrangement vs PA14 PACS2 AAQW01000001.1 6,492,423 bp, 66.33% GC : rearrangement vs PA14
Incomplete strains(Broad):
Contig length stats: desc #contigs min max mean stdev sum C3719 124 2079 242903 49572 53770 6146998 : no rearrangement vs PA14 2192 82 2087 398738 83246 88681 6826253 : rearrangement vs PA14
Scaffold length stats: desc len GC% C3719 6222097 66.30 2192 6905121 65.99
PAb1 project
5.7Kbp repeat coordinates:
PA14: 732540 738302 + 4951956 4957714 - 5535975 5541728 - 6312434 6318199 -
PAO1: 721550 727325 + 4788516 4794273 - 5264042 5269801 - 6039515 6045289 -
PA7: 806558 812299 + 4982874 4988600 - 5566182 5571924 - 6353418 6359148 -
CBCB
File location:
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
Files:
s_1_0001_prb.txt s_1_sequence.txt # contains seq & qual; seq names: HWI% ; all reads are 33 bp s_1_tag.txt s_7_0300_prb.txt s_7_sequence.txt # contains seq & qual; seq names HWU% ; all reads are 33 bp s_7_tag.txt wc -l s_1_sequence.txt s_7_sequence.txt 4105993 s_1_sequence.txt 4521907 s_7_sequence.txt 8627900 total
#all reads that contain at least 1 ambiguity grep -c N s_*seq s_1_sequence.seq:37933 s_7_sequence.seq:69669
#all reads that contain at least 2 ambiguities cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l : 18771 cat s_7_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589 ...
s_1 reads have significantly fewer N's than s_7 reads !!! s_1 reads align better to ref than s_2 reads !!! conflicts with Avg qualities: 28.26 for s_1, 31.05 for s_2 !!!
Assemblies
Untrimmed reads
All contain ref duplications (this section should be deleted eventually) Duplication causes
1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate 2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3) 3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other; reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced
Trimmed reads
Quality trimming
Art's script: qual-trim.awk
Min_Qual=15
. #elem #elem0 #elem<0 min median max sum mean stdev n50 qualTrim 8627900 1266 0 0 23 33 193999902 22.49 7.43 26
!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate
Alignment based trimming
PA14 ref assembly
Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2 Command: AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
1.a align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (-c 17 -l 17)
8.62M reads, 4.10M HWI 5.32M HWU 7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (-c 17 -l 17) 6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (-c 20 -l 20) 5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads) (-c 33 -l 17) 3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)
1.b align all unaligned reads to the reference using nucmer. minmatch=14, mincluster=14 (-c 14 -l 14) Combine 1.a & 1.b
2. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions
3. assemble them using the AMOScmp pipeline(ALIGN_WIGGLE=2 instead of 15)
Contigs stats: desc #elem min max mean stdev sum all 2053 17 170485 3011.84 11917.53 6183320 200 428 203 170485 14262.09 22852.74 6104175 10K 157 10240 170485 35468.89 26531.33 5568616 Singletons: 1127399
Data accuracy Get all assembled bases with coverage>=20 count_bases=5926977 sum_bases=235619001 sum_errors=2455670
sum_errors/sum_bases=2455670/235619001=0.01042=1.042% error
PAO1 ref assembly
Same method except that 1.b was not used
Location: 2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2
Contigs stats: desc #elem min max mean stdev sum all 2797 17 75626 2161.19 5812.2 6044851 200 865 200 75626 6893.96 8766.63 5963278 10K 204 10016 75626 19016.22 10368 3879309 Singletons: 1592525
PA14 & PAO1 merge
Use minimus to merge all contigs in the PA14 & PAO1 reference assemblies
Filter contigs that contain at least 2 adjacent PA14 merged by PAO1
desc #elem min max mean stdev sum all 1850 17 236472 3400.31 16863.79 6290586 200 306 204 236472 20318.3 37147.91 6217401 10K 113 10520 236472 52647.45 45602.76 5949162 Singletons: 1066226
Ssake
Input: 1066226 singletons
Ssake version 3.0 run with default parameters: -m Minimum number of reads needed to call a base during overhang consensus build up (default -m 16) -o Minimum number of reads needed to call a base during an extension (default -o 2) -r Minimum base ratio used to accept a overhang consensus base (default -r 0.6)
Contigs stats: desc #elem min max mean stdev sum all 19795 34 2866 67.78 102.73 1341825 200 879 200 2866 416.72 308.92 366304
200 bp contigs stats desc. #elem min mean max sum contig_len 879 200 416.72 2866 366304 contig_reads 879 52 414.21 3157 364091 contig_cvg 879 7.91 31.58 146.52 27764.51 Singletons: 702007
Find contigs overlapping 12 bp or more and merge them using EMBOSS merger program: desc #elem min max mean stdev sum original 879 200 2866 416.72 308.92 366304 new 670 200 4826 539.32 579.26 361350
Gene assemblies
Dan Sommer reduced number of contigs >=200bp from 306 to 120 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/PAb1.fasta
Contig stats: desc #elem min max mean stdev sum contigs 120 212 512638 51438.95 81999.91 6172675
Average Gap: 105 nt bases Median Gap: 14 nt bases Largest Gap: 1095 nt bases
927 singletons assembled
Pa-b1 AMOScmp reference assembly
Contig stats: desc #elem min max mean stdev sum input(reference) 770 200 512638 8484.21 37210.58 6532844 output(AMOScmp contigs) 936 21 260827 6988.98 22288.49 6541689 Singletons: 1257963 Snp's: 1941
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0215_AMOScmp-PAb1
Ssake on all reads
Default parameters
Contigs stats: desc #elem min max mean stdev sum length 185030 34 5490 77.21 141.23 14287079 reads 185030 2 13352 29.52 127.65 5463405 Singletons: 3,164,495
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0214_ssake/