Pseudomonas aeruginosa: Difference between revisions
Line 737: | Line 737: | ||
all 561 201 253489 12123 31042 6800731 | all 561 201 253489 12123 31042 6800731 | ||
min OVL=20 : use contig:singleton & singleton:singleton alignments & adjacent contig ovelaps | min OVL=20 : use contig:singleton & singleton:singleton alignments & adjacent contig ovelaps (best) | ||
All: | All: | ||
Line 751: | Line 751: | ||
all 538 200 253489 12662 33969 6812086 | all 538 200 253489 12662 33969 6812086 | ||
Locations: | |||
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2 | |||
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2/AMOScmp-avgReads.unique/ | |||
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2/AMOScmp-avgReads.unique/minimus3.20 (best) | |||
==== PA14 Sanger maq on all reads ==== | ==== PA14 Sanger maq on all reads ==== |
Revision as of 18:15, 6 May 2008
Strain PA-b1
Data
NCBI
Strains:
PA14 CP000438 6,537,648 bp, 66.29 %GC : most similar to PAb1 PAO1 AE004091 6,264,404 bp, 66.56 %GC : rearrangement vs PA14 PA7 CP000744 6,588,339 bp : no rearrangement vs PA14 PACS2 AAQW01000001.1 6,492,423 bp, 66.33% GC : rearrangement vs PA14 C3719 . 6,222,097 bp, 66.30% GC : no rearrangement vs PA14; incomplete 2192 . 6,905,121 bp, 65.99% GC : rearrangement vs PA14; incomplete
Incomplete strains details(Broad):
Contig length stats: desc #contigs min max mean stdev sum C3719 124 2079 242903 49572 53770 6146998 2192 82 2087 398738 83246 88681 682625
PAb1 project
5.7Kbp repeat: contains 16S rRNA gene; identical in all the strains
Coordinates: PA14: 732540 738302 + 4951956 4957714 - 5535975 5541728 - 6312434 6318199 -
PAO1: 721550 727325 + 4788516 4794273 - 5264042 5269801 - 6039515 6045289 -
PA7: 806558 812299 + 4982874 4988600 - 5566182 5571924 - 6353418 6359148 -
CBCB
File location:
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
Files:
s_1_sequence.txt # contains seq & qual; seq names: HWI% ; all reads are 33 bp s_1_0001_prb.txt # 27679 lines of 4*36 values in the -40..40 range s_1_tag.txt # unique seq count in s_1_sequence.txt s_7_sequence.txt # contains seq & qual; seq names HWU% ; all reads are 33 bp s_7_0300_prb.txt s_7_tag.txt wc -l s_1_sequence.txt s_7_sequence.txt 4105993 s_1_sequence.txt 4521907 s_7_sequence.txt 8627900 total
#all reads that contain at least 1 ambiguity grep -c N s_*seq s_1_sequence.seq:37933 s_7_sequence.seq:69669
#all reads that contain at least 2 ambiguities cat s_1_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l : 18771 cat s_7_sequence.seq | perl -ane 'print $_ if(/N.+N/);' | wc -l : 59589 ...
s_1 reads have significantly fewer N's than s_7 reads !!! s_1 reads align better to ref than s_2 reads !!! Conflicts with avg qualities: 28.26 for s_1, 31.05 for s_2 !!! Base qualities: . #elem #elem0 #elem<0 min mean max sum s_1 135497769 1863585 3231899 -5 28.2624918053079 40 3829504586 s_7 149222931 1357384 3113806 -5 31.0508663175903 40 4633501282
Read trimming
No trimming
If no trimming is done and the default AMOScmp pipeline is used (ALIGNWIGGLE=15), the contigs end up containing reference duplications Make-consensus will also take a very long time to run
Duplication causes:
- Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate
- reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3)
- The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other; reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced
Quality trimming
Art's script: qual-trim.awk
Min_Qual=15
. #elem #elem0 #elem<0 min median max sum mean stdev n50 qualTrim 8627900 1266 0 0 23 33 193999902 22.49 7.43 26
!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate
Alignment based trimming
PA14 is the most similar strain: more PAb1 reads aligned to PA14 than to PAO1 or PA7
grep -c "^>" PA14-Pa.l-17_c-17.delta PAO1-Pa.l-17_c-17.delta PA7-Pa.l-17_c-17.delta PA14-Pa.l-17_c-17.delta:7261629 PAO1-Pa.l-17_c-17.delta:7036243 PA7-Pa.l-17_c-17.delta:5393144
Assemblies
Comparative assemblies
PA14 ref assembly
Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2 Command: AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
- a align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (nucmer -c 17 -l 17)
8.62M reads, 4.10M HWI 5.32M HWU 8.30M (...) total reads align to the reference (nucmer -c 14 -l 14) 7.59M (...) total reads align to the reference (nucmer -c 16 -l 16) 7.25M (...) total reads align to the reference (nucmer -c 16 -l 16; delta-filter -l 20) 7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (nucmer -c 17 -l 17) 6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (nucmer -c 20 -l 20) 5.17M (...) total reads align to the reference (nucmer -c 28 -l 14) 5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads) (nucmer -c 33 -l 17) 3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)
- b align all unaligned reads to the reference using nucmer. minmatch=14, mincluster=14 (-c 14 -l 14)
- Combine 1a & 1b
- trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions
- assemble them using the AMOScmp pipeline(ALIGN_WIGGLE=2 instead of 15)
Contigs stats: desc #elem min max mean stdev sum all 2053 17 170485 3011.84 11917.53 6183320 200 428 203 170485 14262.09 22852.74 6104175 10K 157 10240 170485 35468.89 26531.33 5568616 Singletons: 1127399
Data accuracy Get all assembled bases with coverage>=20 count_bases=5926977 sum_bases=235619001 sum_errors=2455670
sum_errors/sum_bases=2455670/235619001=0.01042=1.042% error
SNP's in the 157 >10K contigs: T<->C & A<-G> the most common
PA14 PAb1 count T C 6153 C T 6100 G A 6075 A G 6021 G C 1357 C G 1309 G T 859 C A 846 T G 840 A C 836 A T 338 T A 308 G . 241 C . 196 . C 182 . G 152 A . 101 T . 88 . T 74 . A 70 Total 32146
PA14 ref assembly (2)
Parameters:
MINOVL=10 # previously 5, allowed for missasemblies MAJORITY=70 # previously 50 to reduce the number of negative gaps
Contig stats:
desc #elem min max mean stdev sum all 1843 20 127984 3350.38 10776.95 6174765 200bp+ 542 200 127984 11245.21 17520.92 6094904 1107200 singletons
PAO1 ref assembly
Same method except that 1.b was not used
Location: 2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2
Contigs stats: desc #elem min max mean stdev sum all 2797 17 75626 2161.19 5812.2 6044851 200 865 200 75626 6893.96 8766.63 5963278 10K 204 10016 75626 19016.22 10368 3879309 Singletons: 1592525
PA14 & PAO1 merge
Use minimus to merge all contigs in the PA14 & PAO1 reference assemblies
Filter contigs that contain at least 2 adjacent PA14 merged by PAO1
desc #elem min max mean stdev sum all 1850 17 236472 3400.31 16863.79 6290586 200 306 204 236472 20318.3 37147.91 6217401 10K 113 10520 236472 52647.45 45602.76 5949162 Singletons: 1066226
De novo assembly of singletons using SSAKE
Input: 1066226 singletons
Ssake version 3.0 run with default parameters: -m Minimum number of reads needed to call a base during overhang consensus build up (default -m 16) -o Minimum number of reads needed to call a base during an extension (default -o 2) -r Minimum base ratio used to accept a overhang consensus base (default -r 0.6)
Contigs stats: desc #elem min max mean stdev sum all 19795 34 2866 67.78 102.73 1341825 200 879 200 2866 416.72 308.92 366304
200 bp contigs stats desc. #elem min mean max sum contig_len 879 200 416.72 2866 366304 contig_reads 879 52 414.21 3157 364091 contig_cvg 879 7.91 31.58 146.52 27764.51 Singletons: 702007
Find contigs overlapping 12 bp or more and merge them using EMBOSS merger program: desc #elem min max mean stdev sum original 879 200 2866 416.72 308.92 366304 new 670 200 4826 539.32 579.26 361350
Gene assemblies
Dan Sommer reduced number of contigs >=200bp from 306 to 120 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/PAb1.fasta
Contig stats: desc #elem min max mean stdev sum contigs 120 212 512638 51438.95 81999.91 6172675
Average Gap: 105 nt bases Median Gap: 14 nt bases Largest Gap: 1095 nt bases
927 singletons assembled
Pa-b1 AMOScmp reference assembly
Contig stats: desc #elem min max mean stdev sum input(reference|final) 770 200 512638 8484.21 37210.58 6532844 output(AMOScmp contigs) 936 21 260827 6988.98 22288.49 6541689 Singletons: 1257963 Snp's: 1941
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0215_AMOScmp-PAb1
Use ALL strains as reference
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0228_AMOSCmp-ALL
Steps:
* Identify related strains (completed & draft assemblies): PA14, PAO1, PA7, PACS2, PA2192, PAC3719 * Download & merge related strains into a multi-FASTA file * Align PAb1 reads to multi-FASTA file using nucmer (Ex: -l 17 -c 17) * Count the number of PAb1 reads aligned to each strain
strain #reads_aligned ====================== PA14 7261629 PA2192 7193960 PACS2 7089455 PAO1 7036243 PAC3719 6928744 PA7 5393144 ======================
* Sort the strains in descending order according to the number of reads aligned to each one * The strain with most hits is PA14 => closest relative
* Assuming the order is PA14, PAO1, PA7, PACS2, PA2192, PAC3719 ... * Align strains to one other; identify unique regions in each genome (regions which are not present in the previously considered ones) Unique regions length summary:
Table 1: reference #regions min max mean stdev sum ======================================================================================= PA14 1 6537648 6537648 6537648 0 6537648 PAO1-PA14 115 1 19263 2038.58 3843.93 234437 PA7-PA14 511 1 46020 1876.96 4872.51 959129 PACS2-PA14 147 1 90591 3054.19 8766.3 448967 PA2192-PA14 231 1 92553 3746.3 11475 865396 PAC3719-PA14 272 1 14058 1164.78 2089.85 316822 ======================================================================================= ALL 1277 9362399
Table 2: reference #regions min max mean stdev sum ======================================================================================= PA14 1 6537648 6537648 6537648 0 6537648 PAO1-PA14 115 1 19263 2038.58 3843.93 234437 PA7-PAO1-PA14 487 1 46020 1830.82 4910.86 891614 PACS2-PA7-PAO1-PA14 71 2 79031 4112.94 10618.2 292019 PA2192-PACS2-PA7-PAO1-PA14 181 1 92066 3439.9 9500.07 622622 #some N's maybe? PAC3719-PA2192-PACS2-PA7-PAO1-PA14 173 1 11425 998.16 1888.94 172682 #some N's maybe? ======================================================================================= ALL 1028 2 6537648 8513.66 203929 8752049
Table 3: reference #regions min max mean stdev sum ======================================================================================= PA14 1 6537648 6537648 6537648 0 6537648 PA14-PAO1 123 1 43690 4009.28 7903.16 493142 PA14-PA7 516 1 31464 1768.16 3534.88 912375 PA14-PACS2 146 2 43690 3259.15 6129.12 475836 PA14-PA2192 222 1 34854 2189.76 4338.6 486127 PA14-PAC3719 277 1 43690 2206.38 5214.55 611170 =======================================================================================
* Extract unique regions(Table 2) and add them to the most similar genome
* Assemble using all these regions as reference => 10,201 contigs; 1,052,178 singletons
reference #ctgs min max mean stdev sum ======================================================================================= PA14 2056 17 57476 2992.27 7078.25 6152122 PAO1 877 17 8159 95.16 482.74 83456 PA7 4723 17 6822 41.7 128.87 196996 PACS2 914 17 3346 66.96 148.47 61207 PA2192 1528 17 49826 134.88 1734.19 206109 PAC3719 103 17 693 42.11 85.89 4338 ======================================================================================= ALL 10201 17 57476 657.21 3457.1 6704228 * Compare ALL assembly contigs to PA14 ref assembly :
reference #ctgs min max mean stdev sum #singletons ======================================================================================================== ALL 10201 17 57476 657.21 3457.1 6704228 1052178 ALL(200bp+) 1020 200 57476 6235.17 9220.76 6359881 ALL(PA14) 2056 17 57476 2992.27 7078.25 6152122 PA14 1984 17 82196 3103.21 8274.53 6156788 1375004 ======================================================================================================== ALL-PA14 8217 ...... 547440 # 0.5M extra bp ALL(PA14)-PA14 72 ...... -4666 # 4666 bp which were originally mapped to PA14 are now mapped to other strains
* Compare ALL assembly contigs to ssake 200bp+ contigs : ssake 200bp+ contigs: strain #ctgs min max mean stdev sum PAb1 879 200 2866 416.72 308.92 366304
ALL vs ssake alignments: desc #elem min max mean stdev sum ALL 265 23 49826 912.9 4187.17 241919 ssake 306 200 2358 379.85 257.04 116235
ALL vs ssake CONTAINED alignments: desc #elem min max mean stdev sum ALL 164 23 49826 1264.71 5268.16 207413 ssake 260 200 2358 368.09 259.02 95704
~300 out of 879 ssake contigs were assembled using this method. The ALL assembly contigs ssake contigs align to are longer than the ssake contigs!
* Compare ALL assembly contigs to final de novo contigs :
id len #de novo ctgs it CONTAINS ====================================================================== 1 2361_PA2192_2249428_2341494 49826 52 2 2360_PA2192_2249428_2341494 42258 49 3 2364_PA2192_2346558_2361947 15388 17 4 10190_PAO1_747565_761293 8159 7 5 2429_PA2192_2405271_2412668 6103 7 6 3259_PA2192_6656583_6693732 2594 6 7 2367_PA2192_2374235_2379148 4914 5 8 6917_PA7_4563267_4570077 6822 5 9 2958_PA2192_4496915_4500004 3090 4 10 2375_PA2192_2383706_2386485 2789 3 11 2366_PA2192_2366515_2369415 2901 2 12 6919_PA7_4579629_4582081 1085 2 13 5364_PA7_3138605_3164438 1931 2 14 8663_PACS2_3494057_3573088 449 2 15 9322_PACS2_982458_983976 1519 2 16 2368_PA2192_2379432_2380661 1230 2 17 2716_PA2192_3268239_3292117 567 2 18 9587_PAO1_2439014_2458239 3513 1 ... 49 8384_PAC3719_5517906_5518496 591 1 ====================================================================== Total 201
De-novo assemblies
SSAKE
version 3.0 with default parameters
Contigs stats: desc #elem min max mean stdev sum length 185030 34 5490 77.21 141.23 14287079 reads 185030 2 13352 29.52 127.65 5463405 Singletons: 3,164,495 (singletons file + ambiguous reads)
Run times:
18459.472u 31.849s 5:08:59.82 99.7% 0+0k 0+0io 0pf+0w
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0214_ssake/
Latest version is 3.2: run on the same dataset generated fewer contigs and more singletons; also took longer to run
Velvet
Version 0.5.05 Ovl len=23
Commands:
host: sycamore (Opteron) velveth . 23 Pa.seq velvetg . -read_trkg yes
Run times:
velveth : 88.614u 22.865s 2:16.35 81.7% 0+0k 0+0io 0pf+0w velvetg : 439.939u 19.258s 8:19.77 91.8% 0+0k 0+0io 0pf+0w total : 528.553u 42.123s 10:36.12
Output files (used)
contigs.fa : contig FASTA file
Contigs stats:
desc #elem min max mean stdev sum contig 10684 45 16239 640.34 825.24 6841458 contig(200bp+) 7382 200 16239 877.05 896.35 6474426 1273164 singletons;
Contigs aligned to the 4 copy 5.7Kbp PA14 rRNA repeat:
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 23 409 | 1 388 | 387 388 | 97.94 | 5763 388 | 6.72 100.00 | rRNA 856 [CONTAINS] 388 2132 | 1 1745 | 1745 1745 | 99.94 | 5763 1745 | 30.28 100.00 | rRNA 642 [CONTAINS] 2111 2279 | 169 1 | 169 169 | 100.00 | 5763 169 | 2.93 100.00 | rRNA 3433 [CONTAINS] 2258 5763 | 1 3506 | 3506 3506 | 99.89 | 5763 3506 | 60.84 100.00 | rRNA 260 [CONTAINS]
Are there some velvet contigs contained in each other? Many contigs seem to overlap by ~ 20bp
Some contigs share end reads labeled as D(uplicates)
listReadPlacedStatus Pa.bnk | awk '{print $3}' | count.pl # total P 7322684 S 1273164 D 32052 # 33 of them belong to 3 contigs Total 8627900
About ~4607 contig pairs could be merged:
grep D Pa.status.detailed | awk '{print $5,$6}' | sort -u | wc -l 4607
Example:
Contigs 1915 & 3551 share 15 Duplicate reads cat Pa.status | grep 1915 | grep 3551$ | wc -l 15 They align to adjacent pos in PA14 : 3743092 3745646 | 2555 1 | 2555 2555 | 99.49 | 6537648 2555 | 0.04 100.00 | 1 1915 [CONTAINS] 3745625 3746382 | 758 1 | 758 758 | 99.60 | 6537648 758 | 0.01 100.00 | 1 3551 [CONTAINS]
411870 out of 1273164 singletons (~33%) align to PA14 (nucmer -maxmatch -c 20) 304162 out of 1273164 singletons (~25%) are also singletons in the AMOScmp-PA14 assembly; probably the common ones are low quality sequences that just don't assemble
9859 of the 10684 contigs (~92%) align to PA14 (nucmer -maxmatch -c 40):
10684 Pa.fasta 9859 Pa.qry_hits # 825 not aligned 9682 Pa.CBE.qry_hits # 177 partially aligned
Partially aligned
#elem min max mean stdev sum 177 78 15342 2036.52 2145.37 360465 All these contigs have AMOScmp assembly singletons aligned to them!!! These velvet contigs seem correct.
Example: There is a large insertion in PA14 at position 6700 PA-b1 reads vs PA14 ref find-query-breaks -B Pa.filter-q.delta => 3 reads end at base 6700 velvet ctgs vs PA14 ref: ... 6177 6700 | 1 524 | 524 524 | 99.43 | 6537648 875 | 0.01 59.89 | 1 900 # 431 out of 1054 seqs in this contigs are singletons in AMOScmp assembly 6694 8045 | 743 2094 | 1352 1352 | 99.63 | 6537648 2097 | 0.02 64.47 | 1 1970 # 895 out of 2354 seqs in this contigs are singletons in AMOScmp assembly; unaligned regions aligns at low %id to a Pseudomonas fluorescens hypothetical protein ... 66764 67073 | 1 310 | 310 310 | 100.00 | 6537648 1282 | 0.00 24.18 | 1 320 67181 67322 | 1136 1277 | 142 142 | 98.59 | 6537648 1282 | 0.00 11.08 | 1 320 ... 116429 117249 | 1 821 | 821 821 | 99.03 | 6537648 1457 | 0.01 56.35 | 1 1955 120490 120573 | 84 1 | 84 84 | 100.00 | 6537648 1693 | 0.00 4.96 | 1 1827 ... 234641 235980 | 3294 1955 | 1340 1340 | 99.40 | 6537648 3294 | 0.02 40.68 | 1 139 235981 237442 | 1466 5 | 1462 1462 | 99.32 | 6537648 3294 | 0.02 44.38 | 1 139 ... 286003 287915 | 1216 3128 | 1913 1913 | 99.22 | 6537648 3128 | 0.03 61.16 | 1 3068 ...
Not aligned:
#elem min max mean stdev sum 825 45 16239 618.3 1209.37 510101 # the longest velvet ctg does not align
Not CONTAINED:
#elem min max mean stdev sum 1002 45 16239 868.82 1518.56 870566
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24
Version 0.5.07 : does not "seem" to be an improvement over Version 0.5.05 Ovl len=23
desc #elem min max mean stdev sum contig 11230 45 16239 611.64 814.09 6868784 contig(200bp+) 7353 200 16239 878.61 897.06 6460467
Edena
Version 2.1.1 , default parameters
Contigs stats: desc #elem min max mean stdev sum contig 11180 100 11300 552.36 610.52 6175460
3955865 singletons
Contig don't seem to overlap much
Run times:
ovl step: 1655.561u 7.560s 27:54.62 99.3% 0+0k 0+0io 0pf+0w assembly step: 56.141u 3.791s 1:03.82 93.9% 0+0k 0+0io 0pf+0w total: 1711.702u 11.351s 28:58.44
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0312_edena/
SSAKE/Edena/Velvet summary
host: sycamore.umiacs.umd.edu (Linux 64)
program version run-time SSAKE 3.0 8459.472u 31.849s 5:08:59.82 Velvet 0.5.05 528.553u 42.123s 10:36.12 Edena 2.11 1711.702u 11.351s 28:58.44
Input: 8627900 33 bp Solexa reads(2 lanes)
Output: contigs+singletons
Contig statistics:
program #seqs min max mean med n50 sum SSAKE 185030 34 5490 77 44 87 14287079 Velvet 10684 45 16239 640 363 1184 6841458 Edena 11180 100 11300 552 356 837 6175460
200bp+ contig statistics:
program #seqs min max mean med n50 sum SSAKE 12532 200 5490 486 385 549 6090567 Edena 8316 200 11300 692 488 902 5759209 Velvet 7382 200 16239 877 586 1252 6474426
Singleton statistics:
program #seqs SSAKE 3164495 Velvet 1273164 Edena 3955865
Other assemblies
AMOScmp on the velvet contigs (1)
Input sequence is not trimmed
AMOScmp -D MINCLUSTER=40 -D MINOVL=5 -D CONSERR=0.12
desc #elem min max mean stdev sum reads(velvet ctgs) 10684 45 16239 640 . 6841458 ctg 2559 45 26677 2223.04 2731.82 5688784 singl 1016 45 16239 822.7 1485.11 835866
Problems:
* 33 singletons are CONTAINED in the reference. Why we they not assembled? show-coords -H -I 88 Pa.singletons.delta | ~/bin/nucmerAnnotate.pl -ignore 20 | grep -c CONT # 33 show-coords -H -I 98 Pa.singletons.delta | ~/bin/nucmerAnnotate.pl -ignore 20 | grep -c CONT # 21
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp
AMOScmp on the velvet contigs (2)
Input sequence is trimmed
AMOScmp-shortReads-alignmentTrimmed -D MINCLUSTER=40 -D MINMATCH=20 -D CONSERR=0.12 Pa
desc #elem min max mean stdev sum reads(velvet ctgs) 10684 45 16239 640 . 6841458 ctg 2657 45 27957 2192.79 2697.71 5826256 singl 1013 45 16239 792.2 1431.28 802502
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp.redo
AMOScmp/minimus on the AMOScmp & velvet contigs (3)
Input:
AMOScmp PA14 ref assembly (2) contigs velvet contigs !!! No singletons
AMOScmp steps:
- Align AMOScmp & velvet contigs (nucmer -c 40) to each other
- Identify velvet contigs CONTAINED in AMOScmp contigs(max trimming of 20)
- Remove CONTAINED contigs from the velvet set
- Create alignment file based to the AMOS.scaff file (if nucmer is repeat contigs are collapsed/no placed exactly)
- Align velvet contigs to reference using nucmer ; filter only the CONTAINED ones
- Merge AMOS & velvet alignments
- Run "casm-layout -S -r ..."
- Run "make-consensus ..."
- All AMOS input contigs should be part of the new assembly
- If the velvet CONTAINED alignments are not filtered, some end conflicts might cause some of the AMOS contigs end up as singletons
Input:
All: desc #elem min max mean stdev sum AMOS 1843 20 127984 3350.38 10776 6174765 velvet-uniq 1189 45 16239 997.85 1485.34 1186452 # 462 aligned to PA14 ref (nucmer -c 40); 284 CONTAIEND in the ref; 727 not aligned to the ref all 3032 20 127984 2427.84 8530.33 7361217
200bp+: desc #elem min max mean stdev sum AMOS 542 200 127984 11245 17520 6094904 velvet-uniq 848 200 16239 1354.36 1627.86 1148499 # 422 aligned to PA14 ref (nucmer -c 40); 249 CONTAIEND in the ref; 426 not aligned to the ref all 1390 200 127984 5211.08 12019.6 7243403
Output:
All: desc #elem min max mean stdev sum ctg 1327 20 221554 4644.9 19079. 6163795 # two of them(530,531) contain only velvet seqs # 1210 contain only AMOS seqs ; 11 contain 2 AMOS seqs (negative gap got merged) # 115 contain both AMOS & velvet seqs singl 817 45 16274 1053.94 1719.12 861076 # all but one(807) are velvet : see .conflict file all 2144 20 221554 3276.52 15146. 7024871
200bp+: desc #elem min max mean stdev sum ctg 282 203 221554 21651 36731 6105733 singl 571 200 16274 1461.55 1917.62 834546 all 853 200 221554 8136.31 23189 6940279
minimus steps:
- Run minimus3 pipeline on the above contigs(1327) and singletons(817) ; Only uses contig:singletion overlaps (not all:all)
$ ~dpuiu/bin/minimus3 -D BREAKID=1327 -D OVERLAP=40 Pa
min OVL=40 : use contig:singleton alignments
Output:
All: desc #elem min max mean stdev sum ctg 180 135 253489 22817.2 45339 4107096 # 218 of the PA14 unaligned velvet contigs assembled singl 1377 20 153036 1996.79 10534 2749587 all 1557 20 253489 4403.77 19467 6856683 200bp+: desc #elem min max mean stdev sum ctg 174 210 253489 23598.3 45919 4106116 singl 488 200 153036 5519.51 17154 2693524 all 662 200 253489 10271.3 28846 6799640
min OVL=20 : use contig:singleton alignments (not significantlty better than ,on OVL=40)
Output:
All: desc #elem min max mean stdev sum ctg 279 63 253489 15557 37925 4340305 # 321 of the PA14 unaligned velvet contigs assembled singl 972 20 153036 2576 12252 2504000 all 1251 20 253489 5471 21580 6844305
200bp+: desc #elem min max mean stdev sum ctg 263 214 253489 16495 38868 4338061 singl 393 200 153036 6261 18680 2460755 all 656 200 253489 10364 28954 6798816
min OVL=20 : use contig:singleton & singleton:singleton alignments
All: desc #elem min max mean stdev sum ctg 277 103 253489 15983 37991 4427292 # 556 of the PA14 unaligned velvet contigs assembled singl 808 20 153036 2986 13439 2412439 # 260 of them are velvet original contigs all 1085 20 253489 6304 23111 6839731
200bp+: desc #elem min max mean stdev sum ctg 263 222 253489 16826 38812 4425275 singl 298 201 153036 7971 21242 2375456 # 165 of them are velvet original contigs all 561 201 253489 12123 31042 6800731
min OVL=20 : use contig:singleton & singleton:singleton alignments & adjacent contig ovelaps (best)
All: desc #elem min max mean stdev sum ctg 317 47 253489 16450 40806 5214579 singl 703 20 112908 2324 11332 1633528 all 1020 20 253489 6714 25448 6848107
200bp+: desc #elem min max mean stdev sum ctg 291 200 253489 17909 42289 5211547 singl 247 201 112908 6480 18431 1600539 all 538 200 253489 12662 33969 6812086
Locations:
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2/AMOScmp-avgReads.unique/ /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet2/AMOScmp-avgReads.unique/minimus3.20 (best)
PA14 Sanger maq on all reads
Contig stats: min_len #contigs min max mean stdev sum 0 991 33 155551 6199.79 17445.05 6143996 200 368 200 155551 16581.7 25475.92 6102067 10K 149 10048 155551 38164.35 28513.51 5686489
1197385 singletons; SNP's in the contigs: T<->C & A<-G> the most common
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0206_PA14_maq
Final assembly (v3)
Merged some of the 120 AMOScmp contigs from v2 using velvet contigs. Grouped overlapping contigs and used minimus to merge them.
Contig length stats: desc #ctgs min max mean stdev sum final3 512 200 512638 13099.41 53526.8 6706902 AMOScmp3 76 212 512638 82763.22 117199.28 6290005 novo3 436 200 10493 956.18 1336.44 416897 final3-10KB+ 49 10493 512638 127086.12 125757.98 6227220
Comment:
~ 53-74 (out of 436) the velvet denovo that contigs align to PA14. Length statistics: #ctgs min max mean stdev sum CONTAINED 53 204 7228 514.81 978.25 27285 ALL 74 204 7228 622.85 926.13 46091
Locations:
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/final3/ /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/nucmer/merge/
Annotation
5769 genes /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/merge/gene_list.ptt