Helicobacter pylori: Difference between revisions
Jump to navigation
Jump to search
(47 intermediate revisions by the same user not shown) | |||
Line 44: | Line 44: | ||
[[Media:NC 000915-NC 000915.20.png]], [[Media:NC 000915-NC 000915.40.png]] | [[Media:NC 000915-NC 000915.20.png]], [[Media:NC 000915-NC 000915.40.png]] | ||
* Repeats (NC_000915.1): | |||
. elem min q1 q2 q3 max mean n50 sum | |||
36 269 36 45 70 197 4853 305.32 1187 82132 | |||
100 112 100 143 267 804 4853 659.07 1891 73816 | |||
== NCBI SRA == | == NCBI SRA == | ||
Line 51: | Line 56: | ||
== Other == | == Other == | ||
* http://msbarker.com/software.htm | * http://msbarker.com/software.htm | ||
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1480403/?tool=pmcentrez The complete genome sequence of a chronic atrophic gastritis Helicobacter pylori strain: Evolution during disease progression] Jun 2006 PNAS | |||
* [http://www.jbc.org/content/284/44/30383/T1.expansion.html Response of Gastric Epithelial Progenitors to Helicobacter pylori Isolates Obtained from Swedish Patients with Chronic Atrophic Gastritis] | |||
= Assemblies = | = Assemblies = | ||
== Wustl == | == Wustl == | ||
'''velvet''' contigs 100bp+ stats: | |||
nl assembly ctgs min q1 q2 q3 max mean n50 sum reads 0cvg | |||
1 HPKX_1039_AG0C1 233 100 318 1700 9639 77743 7085 19912 1650899 6.4m 674288 | |||
2 HPKX_1039_AG0C2 420 101 341 1684 5301 36588 3914 9412 1644043 5.2M 729736 | |||
3 HPKX_1039_AG4C1 271 100 217 1421 8273 90368 6115 17093 1657230 6.0m 687296 | |||
4 HPKX_1039_AG4C2 365 100 301 1595 5658 51523 4522 11890 1650547 6.8M 708607 | |||
5 HPKX_1172_AG0C1 217 107 557 3370 10683 58848 7099 15527 1540507 8.6M 1068848 | |||
6 HPKX_1172_AG0C2 1170 100 264 717 1768 11444 1319 2661 1543511 7.2M 1110550 | |||
7 HPKX_1172_AG4C1 377 103 355 2178 6166 35180 4169 9160 1571948 8.5M 1106858 | |||
8 HPKX_1172_AG4C2 317 100 274 1540 6256 37505 4987 14946 1581161 6.0M 812671 | |||
9 HPKX_1259_NL0C1 1704 100 264 598 1274 7953 936 1606 1595297 4.2m 963211 | |||
10 HPKX_1259_NL0C2 410 102 240 928 4863 32792 3882 11295 1591864 3.6M 824474 | |||
11 HPKX_1259_NL4C1 283 100 224 1098 6814 98400 5634 18624 1594699 6.4m 797825 | |||
12 HPKX_1259_NL4C2 455 102 222 874 4348 32792 3520 11010 1601950 6.3M 833155 | |||
13 HPKX_1379_NL0C1 295 100 230 1243 7551 59556 5539 15858 1634019 6.1m 730236 | |||
14 HPKX_1379_NL0C2 416 100 216 1000 5177 53581 3931 11219 1635644 6.3M 754915 | |||
15 HPKX_1379_NL4C1 328 100 227 1084 6601 61090 4996 14203 1638925 5.0m 716540 | |||
16 HPKX_1379_NL4C2 291 100 231 1501 6751 64227 5539 15080 1612046 4.6M 785276 | |||
17 HPKX_345_AG4C1 251 100 241 1272 8265 97643 6534 19718 1640151 4.5m 727208 | |||
18 HPKX_345_AG4C2 . . . . . . . . . 12.1M . | |||
19 HPKX_345_NL0C1 305 100 243 1146 6718 59632 5360 15874 1634815 5.7m 759194 | |||
20 HPKX_345_NL0C2 283 100 254 2009 8300 59229 5629 13524 1593064 11.1M 1067288 | |||
21 HPKX_438_AG0C1 267 100 348 1710 8311 87876 6071 16918 1620975 5.8m 755933 | |||
22 HPKX_438_AG0C2 407 102 396 1777 5455 31183 3963 8830 1613167 6.3M 804474 | |||
23 HPKX_438_CA4C1 237 100 348 1580 8856 97139 6845 19582 1622487 6.0m 742559 | |||
24 HPKX_438_CA4C2 485 101 363 1502 4408 35471 3332 7779 1616183 4.1M 801123 | |||
== CBCB == | |||
Summary of what I did to assemble the 24 data sets: | |||
# I downloaded the read sets and 7 complete related genomes from NCBI : 36bp unmated reads ; original cvg range: 77-261X cvg : very deep; could be filtered | |||
# I ran velvet on each read set | |||
# For the assemblies which were too fragmented and/or had high read coverage I filtered out the low quality reads, ran velvet on the remaining reads and see if the assembly got better. Some assemblies were better some worse. | |||
# I tried merging the velvet contigs based on direct overlaps (2 contig ends align to each other) or indirect overlaps (2 contigs align to the same complete genome region) | |||
# Velvet assemblies 6,9,18 were very fragmented so I ran AMOScmp-short reads on all the reads, using as reference the best assembled related HPKX set | |||
# I tried merging the velvet & AMOScmp contigs based on direct overlaps only | |||
---- | |||
Read filtering: | |||
* q00: no reads were filter out | |||
* q20: reads that contained 1+ N's or had avg. quality < 20 were filtered out | |||
Velvet: hash_size=23 | |||
'''velvet''' contigs 100bp+ stats: | |||
nl assembly ctgs min q1 q2 q3 max mean n50 sum reads 0cvg qual comment | |||
1b HPKX_1039_AG0C1 208 101 390 2580 11299 90464 7919 19900 1647226 6.4M 692517 q00 better | |||
2 HPKX_1039_AG0C2 417 101 730 2132 5402 34482 3937 8041 1642106 2.9M 722358 q28 similar | |||
3 HPKX_1039_AG4C1 202 100 395 2535 12111 82857 8201 20617 1656737 6.0m 683554 q00 similar | |||
4 HPKX_1039_AG4C2 250 100 287 2134 9314 90459 6604 18459 1651103 6.7M 703430 q00 better | |||
5b HPKX_1172_AG0C1 213 100 229 1521 9559 93714 7423 22783 1581145 7.1M 795376 q20 better | |||
6* HPKX_1172_AG0C2 1122 100 466 900 1639 7935 1243 1857 1395595 3.1M 1915029 q30 worse | |||
7 HPKX_1172_AG4C1 313 100 373 2061 7176 67930 5131 13160 1606127 6.0M 811067 q30 better | |||
8 HPKX_1172_AG4C2 218 102 280 1561 10678 79935 7225 22481 1575172 5.1M 802832 q20 better | |||
9* HPKX_1259_NL0C1 2348 100 242 465 897 8958 678 1061 1593194 4.2m 1060158 q00 worse | |||
10 HPKX_1259_NL0C2 271 102 296 1687 7254 78486 5869 17821 1590749 3.1M 811063 q20 similar | |||
11b HPKX_1259_NL4C1 234 102 223 1293 8826 98849 6803 19887 1592056 6.4m 807774 q00 similar | |||
12 HPKX_1259_NL4C2 311 101 214 901 6216 78488 5137 17546 1597850 5.5M 816013 q20 similar | |||
13 HPKX_1379_NL0C1 243 101 212 1087 8703 79083 6723 20883 1633745 6.1m 731779 q00 similar | |||
14b HPKX_1379_NL0C2 237 101 213 1039 8905 79512 6874 22931 1629236 3.9M 744144 q20 similar | |||
15 HPKX_1379_NL4C1 257 100 208 764 8197 59755 6382 22923 1640181 5.0m 706386 q00 similar | |||
16 HPKX_1379_NL4C2 230 101 225 1440 9348 79083 7055 21887 1622754 4.8M 742039 q20 better | |||
17b HPKX_345_AG4C1 254 100 197 1075 8307 88027 6455 20055 1639650 5.5m 739554 q00 worse | |||
18* HPKX_345_AG4C2 1130 101 389 846 1732 14790 1349 2295 1525450 6.5M 1283074 q20 missing | |||
19 HPKX_345_NL0C1 253 100 239 1052 8308 75943 6475 19602 1638277 5.7m 744044 q00 better | |||
20 HPKX_345_NL0C2 260 101 216 1515 8697 59766 6286 18718 1634563 9.1M 750135 q20 better | |||
21b HPKX_438_AG0C1 229 101 358 1701 8830 98294 7075 20804 1620311 5.8m 750029 q00 better | |||
22 HPKX_438_AG0C2 272 102 433 2271 7395 53896 5945 14906 1617041 4.7M 768889 q20 better | |||
23 HPKX_438_CA4C1 224 102 370 1860 10798 53649 7230 20621 1619628 6.0m 755398 q00 similar | |||
24 HPKX_438_CA4C2 356 102 432 2029 6132 36335 4546 11461 1618724 3.0M 775835 q20 better | |||
Files: | |||
/fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_*/velvet/ | |||
'''velvet-merged''' contigs 100bp+ stats: (merged based on alignments to the 7 complete genomes) ; minOVL=5bp | |||
Assembly pipelines: | |||
# minimus2 | |||
## ~/bin/AMOS/minimus2: | |||
## Align velvet contigs to one another => direct overlaps | |||
## Run tigger, make-consensus ... | |||
# minimus3 | |||
## ~/bin/AMOS/minimus3: | |||
## Align velvet contigs to one another => direct overlaps | |||
## Align velvet contigs to all complete reference genomes; find contigs that overlap same reference regions => indirect overlaps | |||
## Merge direct & indirect overlaps | |||
## Run tigger, make-consensus | |||
nl assembly ctgs min q1 q2 q3 max mean n50 sum | |||
1 HPKX_1039_AG0C1 155 101 438 2761 14251 139553 10627 28134 1647320 | |||
2 HPKX_1039_AG0C2 230 101 575 2665 10199 81516 7129 17062 1639841 | |||
3 HPKX_1039_AG4C1 169 101 383 2723 13391 126810 9811 25845 1658185 | |||
4 HPKX_1039_AG4C2 192 100 318 2402 12716 95840 8601 21767 1651530 | |||
5 HPKX_1172_AG0C1 161 100 229 1602 13093 98369 9826 33745 1582047 | |||
6* HPKX_1172_AG0C2 275 100 774 2840 7251 45605 5732 12775 1576564 # merged AMOScmp (ref=HPKX_1172_AG0C1) and velvet contigs; did not use the complete genomes for alignments ; minOVL=40bp | |||
7 HPKX_1172_AG4C1 204 103 415 2527 12920 76550 7865 19077 1604640 | |||
8 HPKX_1172_AG4C2 183 102 256 1344 13094 79936 8612 25128 1576072 | |||
9* HPKX_1259_NL0C1 249 100 219 837 8006 98866 6485 21851 1614768 # merged AMOScmp (ref=HPKX_1259_NL4C1) and velvet contigs; did not use the complete genomes for alignments ; minOVL=40bp | |||
10 HPKX_1259_NL0C2 198 102 296 1152 10414 91140 8036 24879 1591324 | |||
11 HPKX_1259_NL4C1 181 102 259 1136 10406 103908 8801 32108 1593057 | |||
12 HPKX_1259_NL4C2 255 101 216 729 8686 78488 6272 20602 1599364 | |||
13 HPKX_1379_NL0C1 191 101 223 976 10495 96103 8565 28132 1635927 | |||
14 HPKX_1379_NL0C2 183 101 226 1039 11932 96425 8912 32446 1630904 | |||
15 HPKX_1379_NL4C1 201 101 251 880 9338 100542 8167 27700 1641719 | |||
16 HPKX_1379_NL4C2 167 101 262 1511 13732 95996 9720 32534 1623349 | |||
17 HPKX_345_AG4C1 206 101 196 1044 10973 95672 7967 25867 1641285 | |||
18* HPKX_345_AG4C2 287 101 341 2032 8153 55325 5683 15371 1631176 # merged AMOScmp (ref HPKX_345_NL0C2) and velvet contigs; did not use the complete genomes for alignments ; minOVL=40bp | |||
19 HPKX_345_NL0C1 210 100 229 1131 10090 94992 7806 25980 1639377 | |||
20 HPKX_345_NL0C2 208 101 239 1581 10973 87982 7862 23967 1635408 | |||
21 HPKX_438_AG0C1 173 102 358 1842 11985 128137 9366 26698 1620475 | |||
22 HPKX_438_AG0C2 198 102 393 2767 10090 96481 8165 23774 1616761 | |||
23 HPKX_438_CA4C1 176 102 358 2036 13405 109077 9205 24628 1620172 | |||
24 HPKX_438_CA4C2 224 102 433 2320 10298 53648 7220 19679 1617399 | |||
Files: | |||
/fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_*/velvet/minimus3/ | |||
/fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_*/minimus2/ | |||
'''AMOScmp''' contigs 100bp+ stats: | |||
Pipeline: AMOScmp-shortReads | |||
# ~/bin/AMOS/AMOScmp-shortReads | |||
## Align reads to the "closest" best assembled HPKX reference using soap | |||
## Get overlaps | |||
## Run casm-layout, make-consensus | |||
nl assembly ctgs min q1 q2 q3 max mean n50 sum reads qual ref | |||
2 HPKX_1039_AG0C2 286 100 341 2402 8354 90182 5718 12867 1635626 4.7M q00 HPKX_1039_AG4C2 | |||
6** HPKX_1172_AG0C2 367 100 300 1727 5675 37710 4294 10457 1575949 5.5M q00 HPKX_1172_AG0C1 | |||
9 HPKX_1259_NL0C1 234 103 270 1292 8819 98848 6797 19198 1590643 4.1m q00 HPKX_1259_NL4C1 | |||
18 HPKX_345_AG4C2 325 101 243 1541 6060 55323 5020 13569 1631537 10.5M q00 HPKX_345_NL0C2 | |||
Files: | |||
/fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_*/AMOScmp | |||
Best assembly files: | |||
/fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX/* | |||
ftp://ftp.cbcb.umd.edu/pub/data/H_pylori_reassembly/* | |||
* WUSTL assemblies : velvet_? | |||
* CBCB assemblies : velvet_0.7.55 on Fasta seqs (Fastq: no diffrence ) | |||
== 6 HPKX_1172_AG0C2 == | |||
Reads: | |||
* all : 7.1M | |||
* q30+: 3.1M | |||
* aligned by soap Helicobacter pylori HPAG1 : 4.8M | |||
* aligned by soap Helicobacter pylori HPKX_1172_AG0C1 : 5.5M | |||
=== velvet === | |||
. elem min q1 q2 q3 max mean n50 sum reads 0cvg qual | |||
ctgs 1239 45 346 799 1528 7935 1132 1834 1403538 3.1M 1889408 q30 | |||
ctgs.100+ 1122 100 466 900 1639 7935 1243 1857 1395595 3.1M 1915029 q30 | |||
=== AMOScmp-shortReads (ref HP_HPAG1) === | |||
. elem min q1 q2 q3 max mean n50 sum reads 0cvg qual | |||
ctgs.all 1334 36 78 238 1283 16118 1146 3978 1529868 4.8M 1137259 q00 | |||
ctgs.100+ 905 100 223 728 2152 16133 1662 4073 1504146 . . q00 | |||
Directory: | |||
/fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_1172_AG0C2.6/AMOScmp.HP_HPAG1 | |||
=== AMOScmp-shortReads (ref 5 HPKX_1172_AG0C1) === | |||
. elem min q1 q2 q3 max mean n50 sum reads 0cvg qual | |||
ctgs.all 392 37 227 1470 5481 37710 4024 10457 1577557 5.5M . q00 | |||
ctgs.100+ 367 100 300 1727 5675 37710 4294 10457 1575949 | |||
ref 213 100 229 1521 9559 93714 7423 22783 1581145 | |||
Directory: | |||
/fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_1172_AG0C2.6/AMOScmp.HPKX_1172_AG0C1 | |||
== 18 HPKX_345_AG4C2 == | |||
Reads | |||
* 12.1M Solexa 36bp unpaired | |||
* cvg =~ 120X ? | |||
=== Velvet === | |||
Ctg stats : | |||
hash #ctgs min q1 q2 q3 max mean n50 sum | |||
23 1098 45 244 724 1799 25718 1367 2745 1501014 | |||
== 24 HPKX_438_CA4C2.solexa.txt.assembled-23-11 == | |||
=== Reads === | |||
* 4.1M Solexa 36bp unpaired | * 4.1M Solexa 36bp unpaired | ||
Line 152: | Line 301: | ||
... | ... | ||
=== Velvet (all reads) === | |||
Ctg stats for different velveth hash_lengths: | Ctg stats for different velveth hash_lengths: | ||
Line 167: | Line 316: | ||
cvg 398 13 21 23 25 139 30 25 . | cvg 398 13 21 23 25 139 30 25 . | ||
=== Velvet (filtered reads) === | |||
Hash_len=23 | Hash_len=23 | ||
Line 178: | Line 327: | ||
noN.avgqual20+ 453 45 137 1093 4394 36335 3586 11461 1624653 3013939 756829 !!! least seq missing | noN.avgqual20+ 453 45 137 1093 4394 36335 3586 11461 1624653 3013939 756829 !!! least seq missing | ||
=== AMOScmp === | |||
Ref : NC_000915 | Ref : NC_000915 | ||
Line 193: | Line 342: | ||
-l 8 -c 24 2650 2 8 17 45 2347 56 207 150099 | -l 8 -c 24 2650 2 8 17 45 2347 56 207 150099 | ||
== NC_011498 == | |||
NC_011498.1 1673813 38.81 | NC_011498.1 1673813 38.81 | ||
Line 217: | Line 350: | ||
* the reads were generated by breaking the genome in 36bp segments (35bp ovl)=>36X cvg | * the reads were generated by breaking the genome in 36bp segments (35bp ovl)=>36X cvg | ||
=== Velvet === | |||
* Ctg stats : | * Ctg stats : | ||
Line 223: | Line 356: | ||
23 292 45 67 164 3422 73108 5654 33268 1651121 | 23 292 45 67 164 3422 73108 5654 33268 1651121 | ||
=== Euler-sr === | |||
* Ctg stats : | * Ctg stats : | ||
Line 231: | Line 364: | ||
27 331 28 43 109 1125 83756 5016 41753 1660506 4 27392 | 27 331 28 43 109 1125 83756 5016 41753 1660506 4 27392 | ||
=== AMOScmp === | |||
* Ref : NC_000915 | * Ref : NC_000915 | ||
Line 247: | Line 380: | ||
soap -v 5 -g 3 -s 12 -f 2; -ovl 10 1042 2 6 19 46 5355 93 746 97176 | soap -v 5 -g 3 -s 12 -f 2; -ovl 10 1042 2 6 19 46 5355 93 746 97176 | ||
=== minimus* on velvet contigs === | |||
. ctgs min q1 q2 q3 max mean n50 sum | . ctgs min q1 q2 q3 max mean n50 sum |
Latest revision as of 19:27, 27 January 2010
Data
Wustl
NCBI complete genomes
- Genome info
id len gc% 1 NC_000915.1 1667867 38.87 Helicobacter pylori 26695 2 NC_000921.1 1643831 39.19 Helicobacter pylori J99 3 NC_008086.1 1596366 39.08 Helicobacter pylori HPAG1 4 NC_010698.2 1608548 38.91 Helicobacter pylori Shi470 5 NC_011333.1 1652982 38.89 Helicobacter pylori G27 6 NC_011498.1 1673813 38.81 Helicobacter pylori P12 7 NC_012973.1 1576758 39.16 Helicobacter pylori B38
- nucmer -c 40 => ~200 alignments & 93-95% identity between genomes
- SNPs are mostly substitutions
- Alignment info (NC_000915 0cvg regions) :5-10% of genomes are unique
. elem min q1 q2 q3 max mean n50 sum 1 NC_000915-NC_000915 72 45 178 294 1890 10467 1013 1893 72976 #longest alignment has been removed 2 NC_000915-NC_000921 197 2 81 203 495 17816 644 2146 126988 3 NC_000915-NC_008086 151 3 103 242 894 26862 951 3103 143652 4 NC_000915-NC_010698 206 2 115 283 706 12779 726 1941 149744 5 NC_000915-NC_011333 138 2 111 260 695 7457 688 1941 95063 6 NC_000915-NC_011498 157 2 83 185 565 5362 505 1357 79337 7 NC_000915-NC_012973 140 2 108 239 526 37389 1018 5729 142568
- NC_000915 vs NC_000915 : nucmer -c 40
Align len . elem min q1 q2 q3 max mean n50 sum nucmer -c 20 484 20 21 26 82 10467 196 1892 95007 nucmer -c 40 72 45 178 294 1890 10467 1013 1893 72976
Align %id . elem min q1 q2 q3 max mean n50 sum -c 20 484 63.72 91.89 100.00 100.00 100.00 95 100 46014.71 -c 40 72 76.71 85.06 92.86 99.92 100.00 92 93 6639.6
Media:NC 000915-NC 000915.20.png, Media:NC 000915-NC 000915.40.png
- Repeats (NC_000915.1):
. elem min q1 q2 q3 max mean n50 sum 36 269 36 45 70 197 4853 305.32 1187 82132 100 112 100 143 267 804 4853 659.07 1891 73816
NCBI SRA
- http://www.ncbi.nlm.nih.gov/sites/entrez?db=sra&term=SRP001104 (24 data sets; 10 not loaded)
- http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP001104
Other
- http://msbarker.com/software.htm
- The complete genome sequence of a chronic atrophic gastritis Helicobacter pylori strain: Evolution during disease progression Jun 2006 PNAS
- Response of Gastric Epithelial Progenitors to Helicobacter pylori Isolates Obtained from Swedish Patients with Chronic Atrophic Gastritis
Assemblies
Wustl
velvet contigs 100bp+ stats:
nl assembly ctgs min q1 q2 q3 max mean n50 sum reads 0cvg 1 HPKX_1039_AG0C1 233 100 318 1700 9639 77743 7085 19912 1650899 6.4m 674288 2 HPKX_1039_AG0C2 420 101 341 1684 5301 36588 3914 9412 1644043 5.2M 729736 3 HPKX_1039_AG4C1 271 100 217 1421 8273 90368 6115 17093 1657230 6.0m 687296 4 HPKX_1039_AG4C2 365 100 301 1595 5658 51523 4522 11890 1650547 6.8M 708607 5 HPKX_1172_AG0C1 217 107 557 3370 10683 58848 7099 15527 1540507 8.6M 1068848 6 HPKX_1172_AG0C2 1170 100 264 717 1768 11444 1319 2661 1543511 7.2M 1110550 7 HPKX_1172_AG4C1 377 103 355 2178 6166 35180 4169 9160 1571948 8.5M 1106858 8 HPKX_1172_AG4C2 317 100 274 1540 6256 37505 4987 14946 1581161 6.0M 812671 9 HPKX_1259_NL0C1 1704 100 264 598 1274 7953 936 1606 1595297 4.2m 963211 10 HPKX_1259_NL0C2 410 102 240 928 4863 32792 3882 11295 1591864 3.6M 824474 11 HPKX_1259_NL4C1 283 100 224 1098 6814 98400 5634 18624 1594699 6.4m 797825 12 HPKX_1259_NL4C2 455 102 222 874 4348 32792 3520 11010 1601950 6.3M 833155 13 HPKX_1379_NL0C1 295 100 230 1243 7551 59556 5539 15858 1634019 6.1m 730236 14 HPKX_1379_NL0C2 416 100 216 1000 5177 53581 3931 11219 1635644 6.3M 754915 15 HPKX_1379_NL4C1 328 100 227 1084 6601 61090 4996 14203 1638925 5.0m 716540 16 HPKX_1379_NL4C2 291 100 231 1501 6751 64227 5539 15080 1612046 4.6M 785276 17 HPKX_345_AG4C1 251 100 241 1272 8265 97643 6534 19718 1640151 4.5m 727208 18 HPKX_345_AG4C2 . . . . . . . . . 12.1M . 19 HPKX_345_NL0C1 305 100 243 1146 6718 59632 5360 15874 1634815 5.7m 759194 20 HPKX_345_NL0C2 283 100 254 2009 8300 59229 5629 13524 1593064 11.1M 1067288 21 HPKX_438_AG0C1 267 100 348 1710 8311 87876 6071 16918 1620975 5.8m 755933 22 HPKX_438_AG0C2 407 102 396 1777 5455 31183 3963 8830 1613167 6.3M 804474 23 HPKX_438_CA4C1 237 100 348 1580 8856 97139 6845 19582 1622487 6.0m 742559 24 HPKX_438_CA4C2 485 101 363 1502 4408 35471 3332 7779 1616183 4.1M 801123
CBCB
Summary of what I did to assemble the 24 data sets:
- I downloaded the read sets and 7 complete related genomes from NCBI : 36bp unmated reads ; original cvg range: 77-261X cvg : very deep; could be filtered
- I ran velvet on each read set
- For the assemblies which were too fragmented and/or had high read coverage I filtered out the low quality reads, ran velvet on the remaining reads and see if the assembly got better. Some assemblies were better some worse.
- I tried merging the velvet contigs based on direct overlaps (2 contig ends align to each other) or indirect overlaps (2 contigs align to the same complete genome region)
- Velvet assemblies 6,9,18 were very fragmented so I ran AMOScmp-short reads on all the reads, using as reference the best assembled related HPKX set
- I tried merging the velvet & AMOScmp contigs based on direct overlaps only
Read filtering:
- q00: no reads were filter out
- q20: reads that contained 1+ N's or had avg. quality < 20 were filtered out
Velvet: hash_size=23
velvet contigs 100bp+ stats:
nl assembly ctgs min q1 q2 q3 max mean n50 sum reads 0cvg qual comment 1b HPKX_1039_AG0C1 208 101 390 2580 11299 90464 7919 19900 1647226 6.4M 692517 q00 better 2 HPKX_1039_AG0C2 417 101 730 2132 5402 34482 3937 8041 1642106 2.9M 722358 q28 similar 3 HPKX_1039_AG4C1 202 100 395 2535 12111 82857 8201 20617 1656737 6.0m 683554 q00 similar 4 HPKX_1039_AG4C2 250 100 287 2134 9314 90459 6604 18459 1651103 6.7M 703430 q00 better 5b HPKX_1172_AG0C1 213 100 229 1521 9559 93714 7423 22783 1581145 7.1M 795376 q20 better 6* HPKX_1172_AG0C2 1122 100 466 900 1639 7935 1243 1857 1395595 3.1M 1915029 q30 worse 7 HPKX_1172_AG4C1 313 100 373 2061 7176 67930 5131 13160 1606127 6.0M 811067 q30 better 8 HPKX_1172_AG4C2 218 102 280 1561 10678 79935 7225 22481 1575172 5.1M 802832 q20 better 9* HPKX_1259_NL0C1 2348 100 242 465 897 8958 678 1061 1593194 4.2m 1060158 q00 worse 10 HPKX_1259_NL0C2 271 102 296 1687 7254 78486 5869 17821 1590749 3.1M 811063 q20 similar 11b HPKX_1259_NL4C1 234 102 223 1293 8826 98849 6803 19887 1592056 6.4m 807774 q00 similar 12 HPKX_1259_NL4C2 311 101 214 901 6216 78488 5137 17546 1597850 5.5M 816013 q20 similar 13 HPKX_1379_NL0C1 243 101 212 1087 8703 79083 6723 20883 1633745 6.1m 731779 q00 similar 14b HPKX_1379_NL0C2 237 101 213 1039 8905 79512 6874 22931 1629236 3.9M 744144 q20 similar 15 HPKX_1379_NL4C1 257 100 208 764 8197 59755 6382 22923 1640181 5.0m 706386 q00 similar 16 HPKX_1379_NL4C2 230 101 225 1440 9348 79083 7055 21887 1622754 4.8M 742039 q20 better 17b HPKX_345_AG4C1 254 100 197 1075 8307 88027 6455 20055 1639650 5.5m 739554 q00 worse 18* HPKX_345_AG4C2 1130 101 389 846 1732 14790 1349 2295 1525450 6.5M 1283074 q20 missing 19 HPKX_345_NL0C1 253 100 239 1052 8308 75943 6475 19602 1638277 5.7m 744044 q00 better 20 HPKX_345_NL0C2 260 101 216 1515 8697 59766 6286 18718 1634563 9.1M 750135 q20 better 21b HPKX_438_AG0C1 229 101 358 1701 8830 98294 7075 20804 1620311 5.8m 750029 q00 better 22 HPKX_438_AG0C2 272 102 433 2271 7395 53896 5945 14906 1617041 4.7M 768889 q20 better 23 HPKX_438_CA4C1 224 102 370 1860 10798 53649 7230 20621 1619628 6.0m 755398 q00 similar 24 HPKX_438_CA4C2 356 102 432 2029 6132 36335 4546 11461 1618724 3.0M 775835 q20 better
Files: /fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_*/velvet/
velvet-merged contigs 100bp+ stats: (merged based on alignments to the 7 complete genomes) ; minOVL=5bp
Assembly pipelines:
- minimus2
- ~/bin/AMOS/minimus2:
- Align velvet contigs to one another => direct overlaps
- Run tigger, make-consensus ...
- minimus3
- ~/bin/AMOS/minimus3:
- Align velvet contigs to one another => direct overlaps
- Align velvet contigs to all complete reference genomes; find contigs that overlap same reference regions => indirect overlaps
- Merge direct & indirect overlaps
- Run tigger, make-consensus
nl assembly ctgs min q1 q2 q3 max mean n50 sum 1 HPKX_1039_AG0C1 155 101 438 2761 14251 139553 10627 28134 1647320 2 HPKX_1039_AG0C2 230 101 575 2665 10199 81516 7129 17062 1639841 3 HPKX_1039_AG4C1 169 101 383 2723 13391 126810 9811 25845 1658185 4 HPKX_1039_AG4C2 192 100 318 2402 12716 95840 8601 21767 1651530 5 HPKX_1172_AG0C1 161 100 229 1602 13093 98369 9826 33745 1582047 6* HPKX_1172_AG0C2 275 100 774 2840 7251 45605 5732 12775 1576564 # merged AMOScmp (ref=HPKX_1172_AG0C1) and velvet contigs; did not use the complete genomes for alignments ; minOVL=40bp 7 HPKX_1172_AG4C1 204 103 415 2527 12920 76550 7865 19077 1604640 8 HPKX_1172_AG4C2 183 102 256 1344 13094 79936 8612 25128 1576072 9* HPKX_1259_NL0C1 249 100 219 837 8006 98866 6485 21851 1614768 # merged AMOScmp (ref=HPKX_1259_NL4C1) and velvet contigs; did not use the complete genomes for alignments ; minOVL=40bp 10 HPKX_1259_NL0C2 198 102 296 1152 10414 91140 8036 24879 1591324 11 HPKX_1259_NL4C1 181 102 259 1136 10406 103908 8801 32108 1593057 12 HPKX_1259_NL4C2 255 101 216 729 8686 78488 6272 20602 1599364 13 HPKX_1379_NL0C1 191 101 223 976 10495 96103 8565 28132 1635927 14 HPKX_1379_NL0C2 183 101 226 1039 11932 96425 8912 32446 1630904 15 HPKX_1379_NL4C1 201 101 251 880 9338 100542 8167 27700 1641719 16 HPKX_1379_NL4C2 167 101 262 1511 13732 95996 9720 32534 1623349 17 HPKX_345_AG4C1 206 101 196 1044 10973 95672 7967 25867 1641285 18* HPKX_345_AG4C2 287 101 341 2032 8153 55325 5683 15371 1631176 # merged AMOScmp (ref HPKX_345_NL0C2) and velvet contigs; did not use the complete genomes for alignments ; minOVL=40bp 19 HPKX_345_NL0C1 210 100 229 1131 10090 94992 7806 25980 1639377 20 HPKX_345_NL0C2 208 101 239 1581 10973 87982 7862 23967 1635408 21 HPKX_438_AG0C1 173 102 358 1842 11985 128137 9366 26698 1620475 22 HPKX_438_AG0C2 198 102 393 2767 10090 96481 8165 23774 1616761 23 HPKX_438_CA4C1 176 102 358 2036 13405 109077 9205 24628 1620172 24 HPKX_438_CA4C2 224 102 433 2320 10298 53648 7220 19679 1617399
Files: /fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_*/velvet/minimus3/ /fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_*/minimus2/
AMOScmp contigs 100bp+ stats: Pipeline: AMOScmp-shortReads
- ~/bin/AMOS/AMOScmp-shortReads
- Align reads to the "closest" best assembled HPKX reference using soap
- Get overlaps
- Run casm-layout, make-consensus
nl assembly ctgs min q1 q2 q3 max mean n50 sum reads qual ref 2 HPKX_1039_AG0C2 286 100 341 2402 8354 90182 5718 12867 1635626 4.7M q00 HPKX_1039_AG4C2 6** HPKX_1172_AG0C2 367 100 300 1727 5675 37710 4294 10457 1575949 5.5M q00 HPKX_1172_AG0C1 9 HPKX_1259_NL0C1 234 103 270 1292 8819 98848 6797 19198 1590643 4.1m q00 HPKX_1259_NL4C1 18 HPKX_345_AG4C2 325 101 243 1541 6060 55323 5020 13569 1631537 10.5M q00 HPKX_345_NL0C2
Files: /fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_*/AMOScmp
Best assembly files: /fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX/* ftp://ftp.cbcb.umd.edu/pub/data/H_pylori_reassembly/*
- WUSTL assemblies : velvet_?
- CBCB assemblies : velvet_0.7.55 on Fasta seqs (Fastq: no diffrence )
6 HPKX_1172_AG0C2
Reads:
- all : 7.1M
- q30+: 3.1M
- aligned by soap Helicobacter pylori HPAG1 : 4.8M
- aligned by soap Helicobacter pylori HPKX_1172_AG0C1 : 5.5M
velvet
. elem min q1 q2 q3 max mean n50 sum reads 0cvg qual ctgs 1239 45 346 799 1528 7935 1132 1834 1403538 3.1M 1889408 q30 ctgs.100+ 1122 100 466 900 1639 7935 1243 1857 1395595 3.1M 1915029 q30
AMOScmp-shortReads (ref HP_HPAG1)
. elem min q1 q2 q3 max mean n50 sum reads 0cvg qual ctgs.all 1334 36 78 238 1283 16118 1146 3978 1529868 4.8M 1137259 q00 ctgs.100+ 905 100 223 728 2152 16133 1662 4073 1504146 . . q00
Directory:
/fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_1172_AG0C2.6/AMOScmp.HP_HPAG1
AMOScmp-shortReads (ref 5 HPKX_1172_AG0C1)
. elem min q1 q2 q3 max mean n50 sum reads 0cvg qual ctgs.all 392 37 227 1470 5481 37710 4024 10457 1577557 5.5M . q00 ctgs.100+ 367 100 300 1727 5675 37710 4294 10457 1575949 ref 213 100 229 1521 9559 93714 7423 22783 1581145
Directory:
/fs/szasmg3/dpuiu/Helicobacter_pylori/HPKX_1172_AG0C2.6/AMOScmp.HPKX_1172_AG0C1
18 HPKX_345_AG4C2
Reads
- 12.1M Solexa 36bp unpaired
- cvg =~ 120X ?
Velvet
Ctg stats :
hash #ctgs min q1 q2 q3 max mean n50 sum 23 1098 45 244 724 1799 25718 1367 2745 1501014
24 HPKX_438_CA4C2.solexa.txt.assembled-23-11
Reads
- 4.1M Solexa 36bp unpaired
- cvg =~ 80X
- ~9% of the reads contain at least one N
Quality QC:
. elem min q1 q2 q3 max mean n50 sum Ncount 4107397 3725799<=0 381598>0 0 0 0 0 35 1 34 5614614 avgQuality 4107397 118980<=20 3988417>20 0 19 26 29 34 22 28 91690471
Ncount==0 and avgQuality>=20 => 3013939 filtered reads (73%)
pos elem min q1 q2 q3 max mean n50 sum 0 4107397 0 32 33 33 33 28 33 116108100 1 4107397 0 30 33 34 34 27 33 114845690 5 4107397 0 27 32 33 34 26 33 109832819 10 4107397 0 23 31 33 34 25 32 102860584 20 4107397 0 17 28 31 34 22 30 92139539 30 4107397 0 2 21 28 34 17 27 70231217 32 4107397 0 2 19 26 34 15 26 63156733 35 4107397 0 2 2 25 34 13 26 55261361
12mer counts: too much error???
meryl -C -B -m 12 -s prefix.seq -o prefix.12mers meryl -Dh -s prefix.12mers | sort -nk2 -r | more 1 1876075 0.3452 0.0196 2 1009161 0.5308 0.0407 ... 9 36227 0.7772 0.1017 10 25866 0.7819 0.1044 48 20812 0.8729 0.2727 # read cvg ??? 49 20726 0.8768 0.2833 ...
Velvet (all reads)
Ctg stats for different velveth hash_lengths:
hash #ctgs min q1 q2 q3 max mean n50 sum 19 908 37 161 770 2289 21014 1732 3905 1572704 21 457 41 84 580 4156 49037 3548 12777 1621652 23 398 45 161 1435 5137 37278 4068 12278 1619323 (CBCB best*) 27 769 53 341 1163 2731 18704 2109 4319 1622389 ? 485 101 363 1502 4408 35471 3332 7779 1616183 (WUSTL)
CBCB best* read cvg =~ 23; repeats at higher cvg
. #ctgs min q1 q2 q3 max mean n50 sum cvg 398 13 21 23 25 139 30 25 .
Velvet (filtered reads)
Hash_len=23
Ctg stats
filter #ctgs min q1 q2 q3 max mean n50 sum #reads 0cvg(all 6 genomes) all 398 45 161 1435 5137 37278 4068 12278 1619323 4107397 765345 noN 420 45 115 1194 4898 37173 3863 11880 1622651 3725799 759343 avgqual20+ 424 45 150 1302 4797 36335 3828 11749 1623136 3069665 757697 noN.avgqual20+ 453 45 137 1093 4394 36335 3586 11461 1624653 3013939 756829 !!! least seq missing
AMOScmp
Ref : NC_000915
Ctg stats:
params #ctgs min q1 q2 q3 max mean n50 sum #readsInCtgs -l 16 -c 32 -ovl 10 9533 36 59 96 168 3160 136 185 1302448 1,123,025 (~25%) 1,023,929:0SNP 154,958:1SNP 8,532:2SNP ... -l 8 -c 24 -ovl 10 4429 36 62 152 430 5518 350 806 1554095 2,438,762 (~50%) 1,159,669:0SNP 977,687:1SNP 422,738:2SNP ... -l 8 -c 24 -ovl 5 3880 36 61 158 492 5883 400 966 1553027 2,438,762 (~50%)
nucmer 0cvg stats:
params #gaps min q1 q2 q3 max mean n50 sum -l 16 -c 32 8708 2 10 19 41 9286 44 91 388608 -l 8 -c 24 2650 2 8 17 45 2347 56 207 150099
NC_011498
NC_011498.1 1673813 38.81
Reads
- 1.67M Simulated reads 36bp; unpaired; 100% correct;
- the reads were generated by breaking the genome in 36bp segments (35bp ovl)=>36X cvg
Velvet
- Ctg stats :
hash #ctgs min q1 q2 q3 max mean n50 sum 23 292 45 67 164 3422 73108 5654 33268 1651121
Euler-sr
- Ctg stats :
vertex_size #ctgs min q1 q2 q3 max mean n50 sum #misassemblies 0cvg 23 366 24 36 92 931 83748 4596 41745 1682410 . 25 343 26 39 98 1125 83752 5075 42087 1740988 4 27377 27 331 28 43 109 1125 83756 5016 41753 1660506 4 27392
AMOScmp
- Ref : NC_000915
- Ctg stats:
params #ctgs min q1 q2 q3 max mean n50 sum #readsInCtgs misassemblies(<95%length match) nucmer -l 16 -c 32 -ovl 10 8569 36 66 114 203 2897 164 231 1405371 836827 (~50%) 39 soap -v 5 -g 3 -s 12 -f 2; -ovl 10 1787 37 99 305 1129 14043 881 2283 1574554 1437078 (~85%) 69 soap -v 5 -g 0 -s 12 -f 2; -ovl 10 1789 37 98 304 1128 14043 880 2283 1574516 1437077 64 soap -v 3 -g 0 -s 12 -f 2; -ovl 10 3646 36 89 214 532 7184 424 857 1548982 55 soap -v 3 -g 0 -s 12 -f 2; -ovl 20 4957 36 81 174 389 4783 316 580 1567357 1353104 49
- 0cvg stats
params #gaps min q1 q2 q3 max mean n50 sum nucmer -l 16 -c 32 -ovl 10 8026 2 8 15 34 5384 36 75 291959 soap -v 5 -g 3 -s 12 -f 2; -ovl 10 1042 2 6 19 46 5355 93 746 97176
minimus* on velvet contigs
. ctgs min q1 q2 q3 max mean n50 sum velvet 292 45 67 164 3422 73108 5654 33268 1651121
ctgs+sing min q1 q2 q3 max mean n50 sum misas. minimus2(delta-filter -1; OVL=20) 191 45 108 465 10421 117862 8631 33268 1648580 17(6) # ctgs-vs-ctgs; OVL=20
minimus3(delta-filter -q; OVL=20) 251 45 68 214 5486 73108 6573 33268 1650065 11() # ctgs-vs-ref => ctgs-vs-ctgs minimus3(delta-filter -q; OVL=5) 191 45 71 227 9222 73108 8631 41743 1648698 6(1)
minimus3( OVL=20) 204 45 115 560 10394 117862 8072 33268 1646865 5(1) minimus3( OVL=5) 172 45 134 611 12957 122309 9572 37367 1646538 8(4)
minimus3(all ref; OVL=20) 231 45 99 357 7424 117862 7138 33268 1648917 5(1) minimus3(all ref; OVL=5) 150 45 140 1177 15439 118850 10959 41743 1643998 15(7)