Pseudomonas aeruginosa: Difference between revisions
Line 277: | Line 277: | ||
* Identify related strains (completed & draft assemblies): PA14, PAO1, PA7, PACS2, PA2192, PAC3719 | * Identify related strains (completed & draft assemblies): PA14, PAO1, PA7, PACS2, PA2192, PAC3719 | ||
* Download & merge related strains into a multi-FASTA file | * Download & merge related strains into a multi-FASTA file | ||
* Align PAb1 reads to multi-FASTA file using nucmer (Ex: -l | * Align PAb1 reads to multi-FASTA file using nucmer (Ex: -l 17 -c 17) | ||
* Count the number of PAb1 reads aligned to each strain | * Count the number of PAb1 reads aligned to each strain | ||
strain #reads_aligned | strain #reads_aligned | ||
====================== | ====================== | ||
PA14 | PA14 7261629 | ||
PA2192 7193960 | |||
PACS2 | PACS2 7089455 | ||
PAO1 7036243 | |||
PAC3719 | PAC3719 6928744 | ||
PA7 | PA7 5393144 | ||
====================== | ====================== | ||
Revision as of 18:36, 7 March 2008
Strain PA-b1
Data
NCBI
Complete strains:
PA14 CP000438 6,537,648 bp, 66.29 %GC : most similar to PAb1 PAO1 AE004091 6,264,404 bp, 66.56 %GC : rearrangement vs PA14 PA7 CP000744 6,588,339 bp : no rearrangement vs PA14 PACS2 AAQW01000001.1 6,492,423 bp, 66.33% GC : rearrangement vs PA14
Incomplete strains(Broad):
Contig length stats: desc #contigs min max mean stdev sum C3719 124 2079 242903 49572 53770 6146998 : no rearrangement vs PA14 2192 82 2087 398738 83246 88681 6826253 : rearrangement vs PA14
Scaffold length stats: desc len GC% C3719 6222097 66.30 2192 6905121 65.99
PAb1 project
5.7Kbp repeat: contains 16S rRNA gene; identical in all the strains
Coordinates: PA14: 732540 738302 + 4951956 4957714 - 5535975 5541728 - 6312434 6318199 -
PAO1: 721550 727325 + 4788516 4794273 - 5264042 5269801 - 6039515 6045289 -
PA7: 806558 812299 + 4982874 4988600 - 5566182 5571924 - 6353418 6359148 -
CBCB
File location:
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
Files:
s_1_sequence.txt # contains seq & qual; seq names: HWI% ; all reads are 33 bp s_1_0001_prb.txt # 27679 lines of 4*36 values in the -40..40 range s_1_tag.txt # unique seq count in s_1_sequence.txt s_7_sequence.txt # contains seq & qual; seq names HWU% ; all reads are 33 bp s_7_0300_prb.txt s_7_tag.txt wc -l s_1_sequence.txt s_7_sequence.txt 4105993 s_1_sequence.txt 4521907 s_7_sequence.txt 8627900 total
#all reads that contain at least 1 ambiguity grep -c N s_*seq s_1_sequence.seq:37933 s_7_sequence.seq:69669
#all reads that contain at least 2 ambiguities cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l : 18771 cat s_7_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589 ...
s_1 reads have significantly fewer N's than s_7 reads !!! s_1 reads align better to ref than s_2 reads !!! Conflicts with avg qualities: 28.26 for s_1, 31.05 for s_2 !!! Base qualities: . #elem #elem0 #elem<0 min mean max sum s_1 135497769 1863585 3231899 -5 28.2624918053079 40 3829504586 s_7 149222931 1357384 3113806 -5 31.0508663175903 40 4633501282
Assemblies
Untrimmed reads
All contain ref duplications (this section should be deleted eventually) Duplication causes
1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate 2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3) 3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other; reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced
Trimmed reads
Quality trimming
Art's script: qual-trim.awk
Min_Qual=15
. #elem #elem0 #elem<0 min median max sum mean stdev n50 qualTrim 8627900 1266 0 0 23 33 193999902 22.49 7.43 26
!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate
Alignment based trimming
PA14 is the most similar strain: more PAb1 reads aligned to PA14 than to PAO1 or PA7
grep -c "^>" PA14-Pa.l-17_c-17.delta PAO1-Pa.l-17_c-17.delta PA7-Pa.l-17_c-17.delta PA14-Pa.l-17_c-17.delta:7261629 PAO1-Pa.l-17_c-17.delta:7036243 PA7-Pa.l-17_c-17.delta:5393144
PA14 ref assembly
Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2 Command: AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
1.a align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (-c 17 -l 17)
8.62M reads, 4.10M HWI 5.32M HWU 7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (-c 17 -l 17) 6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (-c 20 -l 20) 5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads) (-c 33 -l 17) 3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)
1.b align all unaligned reads to the reference using nucmer. minmatch=14, mincluster=14 (-c 14 -l 14) Combine 1.a & 1.b
2. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions
3. assemble them using the AMOScmp pipeline(ALIGN_WIGGLE=2 instead of 15)
Contigs stats: desc #elem min max mean stdev sum all 2053 17 170485 3011.84 11917.53 6183320 200 428 203 170485 14262.09 22852.74 6104175 10K 157 10240 170485 35468.89 26531.33 5568616 Singletons: 1127399
Data accuracy Get all assembled bases with coverage>=20 count_bases=5926977 sum_bases=235619001 sum_errors=2455670
sum_errors/sum_bases=2455670/235619001=0.01042=1.042% error
SNP's in the 157 >10K contigs: T<->C & A<-G> the most common
PA14 PAb1 count T C 6153 C T 6100 G A 6075 A G 6021 G C 1357 C G 1309 G T 859 C A 846 T G 840 A C 836 A T 338 T A 308 G . 241 C . 196 . C 182 . G 152 A . 101 T . 88 . T 74 . A 70 Total 32146
PAO1 ref assembly
Same method except that 1.b was not used
Location: 2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2
Contigs stats: desc #elem min max mean stdev sum all 2797 17 75626 2161.19 5812.2 6044851 200 865 200 75626 6893.96 8766.63 5963278 10K 204 10016 75626 19016.22 10368 3879309 Singletons: 1592525
PA14 & PAO1 merge
Use minimus to merge all contigs in the PA14 & PAO1 reference assemblies
Filter contigs that contain at least 2 adjacent PA14 merged by PAO1
desc #elem min max mean stdev sum all 1850 17 236472 3400.31 16863.79 6290586 200 306 204 236472 20318.3 37147.91 6217401 10K 113 10520 236472 52647.45 45602.76 5949162 Singletons: 1066226
Ssake
Input: 1066226 singletons
Ssake version 3.0 run with default parameters: -m Minimum number of reads needed to call a base during overhang consensus build up (default -m 16) -o Minimum number of reads needed to call a base during an extension (default -o 2) -r Minimum base ratio used to accept a overhang consensus base (default -r 0.6)
Contigs stats: desc #elem min max mean stdev sum all 19795 34 2866 67.78 102.73 1341825 200 879 200 2866 416.72 308.92 366304
200 bp contigs stats desc. #elem min mean max sum contig_len 879 200 416.72 2866 366304 contig_reads 879 52 414.21 3157 364091 contig_cvg 879 7.91 31.58 146.52 27764.51 Singletons: 702007
Find contigs overlapping 12 bp or more and merge them using EMBOSS merger program: desc #elem min max mean stdev sum original 879 200 2866 416.72 308.92 366304 new 670 200 4826 539.32 579.26 361350
Gene assemblies
Dan Sommer reduced number of contigs >=200bp from 306 to 120 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/PAb1.fasta
Contig stats: desc #elem min max mean stdev sum contigs 120 212 512638 51438.95 81999.91 6172675
Average Gap: 105 nt bases Median Gap: 14 nt bases Largest Gap: 1095 nt bases
927 singletons assembled
Pa-b1 AMOScmp reference assembly
Contig stats: desc #elem min max mean stdev sum input(reference|final) 770 200 512638 8484.21 37210.58 6532844 output(AMOScmp contigs) 936 21 260827 6988.98 22288.49 6541689 Singletons: 1257963 Snp's: 1941
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0215_AMOScmp-PAb1
Use ALL strains as reference
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0228_AMOSCmp-ALL
Steps:
* Identify related strains (completed & draft assemblies): PA14, PAO1, PA7, PACS2, PA2192, PAC3719 * Download & merge related strains into a multi-FASTA file * Align PAb1 reads to multi-FASTA file using nucmer (Ex: -l 17 -c 17) * Count the number of PAb1 reads aligned to each strain
strain #reads_aligned ====================== PA14 7261629 PA2192 7193960 PACS2 7089455 PAO1 7036243 PAC3719 6928744 PA7 5393144 ======================
* Sort the strains in descending order according to the number of reads aligned to each one * The strain with most hits is PA14 => closest relative
* Assuming the order is PA14, PAO1, PA7, PACS2, PA2192, PAC3719 ... * Align strains to one other; identify unique regions in each genome (regions which are not present in the previously considered ones) Unique regions length summary:
Table 1: reference #regions min max mean stdev sum ======================================================================================= PA14 1 6537648 6537648 6537648 0 6537648 PAO1-PA14 115 1 19263 2038.58 3843.93 234437 PA7-PA14 511 1 46020 1876.96 4872.51 959129 PACS2-PA14 147 1 90591 3054.19 8766.3 448967 PA2192-PA14 231 1 92553 3746.3 11475 865396 PAC3719-PA14 272 1 14058 1164.78 2089.85 316822 ======================================================================================= ALL 1277 9362399
Table 2: reference #regions min max mean stdev sum ======================================================================================= PA14 1 6537648 6537648 6537648 0 6537648 PAO1-PA14 115 1 19263 2038.58 3843.93 234437 PA7-PAO1-PA14 487 1 46020 1830.82 4910.86 891614 PACS2-PA7-PAO1-PA14 71 2 79031 4112.94 10618.2 292019 PA2192-PACS2-PA7-PAO1-PA14 181 1 92066 3439.9 9500.07 622622 #some N's maybe? PAC3719-PA2192-PACS2-PA7-PAO1-PA14 173 1 11425 998.16 1888.94 172682 #some N's maybe? ======================================================================================= ALL 1028 2 6537648 8513.66 203929 8752049
Table 3: reference #regions min max mean stdev sum ======================================================================================= PA14 1 6537648 6537648 6537648 0 6537648 PA14-PAO1 123 1 43690 4009.28 7903.16 493142 PA14-PA7 516 1 31464 1768.16 3534.88 912375 PA14-PACS2 146 2 43690 3259.15 6129.12 475836 PA14-PA2192 222 1 34854 2189.76 4338.6 486127 PA14-PAC3719 277 1 43690 2206.38 5214.55 611170 =======================================================================================
* Extract unique regions(Table 2) and add them to the most similar genome
* Assemble using all these regions as reference => 10,201 contigs; 1,052,178 singletons
reference #ctgs min max mean stdev sum ======================================================================================= PA14 2056 17 57476 2992.27 7078.25 6152122 PAO1 877 17 8159 95.16 482.74 83456 PA7 4723 17 6822 41.7 128.87 196996 PACS2 914 17 3346 66.96 148.47 61207 PA2192 1528 17 49826 134.88 1734.19 206109 PAC3719 103 17 693 42.11 85.89 4338 ======================================================================================= ALL 10201 17 57476 657.21 3457.1 6704228 * Compare ALL assembly contigs to PA14 ref assembly :
reference #ctgs min max mean stdev sum #singletons ======================================================================================================== ALL 10201 17 57476 657.21 3457.1 6704228 1052178 ALL(200bp+) 1020 200 57476 6235.17 9220.76 6359881 ALL(PA14) 2056 17 57476 2992.27 7078.25 6152122 PA14 1984 17 82196 3103.21 8274.53 6156788 1375004 ======================================================================================================== ALL-PA14 8217 ...... 547440 # 0.5M extra bp ALL(PA14)-PA14 72 ...... -4666 # 4666 bp which were originally mapped to PA14 are now mapped to other strains
* Compare ALL assembly contigs to ssake 200bp+ contigs : ssake 200bp+ contigs: strain #ctgs min max mean stdev sum PAb1 879 200 2866 416.72 308.92 366304
ALL vs ssake alignments: desc #elem min max mean stdev sum ALL 265 23 49826 912.9 4187.17 241919 ssake 306 200 2358 379.85 257.04 116235
ALL vs ssake CONTAINED alignments: desc #elem min max mean stdev sum ALL 164 23 49826 1264.71 5268.16 207413 ssake 260 200 2358 368.09 259.02 95704
~300 out of 879 ssake contigs were assembled using this method. The ALL assembly contigs ssake contigs align to are longer than the ssake contigs!
* Compare ALL assembly contigs to final de novo contigs :
id len #de novo ctgs it CONTAINS ====================================================================== 1 2361_PA2192_2249428_2341494 49826 52 2 2360_PA2192_2249428_2341494 42258 49 3 2364_PA2192_2346558_2361947 15388 17 4 10190_PAO1_747565_761293 8159 7 5 2429_PA2192_2405271_2412668 6103 7 6 3259_PA2192_6656583_6693732 2594 6 7 2367_PA2192_2374235_2379148 4914 5 8 6917_PA7_4563267_4570077 6822 5 9 2958_PA2192_4496915_4500004 3090 4 10 2375_PA2192_2383706_2386485 2789 3 11 2366_PA2192_2366515_2369415 2901 2 12 6919_PA7_4579629_4582081 1085 2 13 5364_PA7_3138605_3164438 1931 2 14 8663_PACS2_3494057_3573088 449 2 15 9322_PACS2_982458_983976 1519 2 16 2368_PA2192_2379432_2380661 1230 2 17 2716_PA2192_3268239_3292117 567 2 18 9587_PAO1_2439014_2458239 3513 1 ... 49 8384_PAC3719_5517906_5518496 591 1 ====================================================================== Total 201
Ssake on all reads
Default parameters
Contigs stats: desc #elem min max mean stdev sum length 185030 34 5490 77.21 141.23 14287079 reads 185030 2 13352 29.52 127.65 5463405 Singletons: 3,164,495
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0214_ssake/
PA14 Sanger maq on all reads
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0206_PA14_maq
Contig stats: min_len #contigs min max mean stdev sum 0 991 33 155551 6199.79 17445.05 6143996 200 368 200 155551 16581.7 25475.92 6102067 10K 149 10048 155551 38164.35 28513.51 5686489
SNP's in the contigs: T<->C & A<-G> the most common
PA14 PAb1 count T C 6900 A G 6804 C T 6788 G A 6758 G C 1611 C G 1542 A C 1010 C A 980 G T 967 T G 962 A T 398 T A 369 G R 155 C Y 140 A R 83 C S 76 T Y 74 G S 67 C M 62 G K 48 A M 25 T K 17 A W 12 T W 11 C K 7 C R 4 G M 3 N C 2 A K 2 T S 2 T R 2 A Y 2 Total 35883
Annotation
5769 genes /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/merge/gene_list.ptt