Salmonella
Jump to navigation
Jump to search
Data
From Washington Univ in St. Louis
Strains
Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA : 4,585,229 chromosome; no plasmid Salmonella typhimurium LT2 : B_STM : 4,857,432 chromosome; 93,939 plasmid pSLT
Goals
1. Validate the assemblies 2. Submit traces to NCBI TA: Problems: * some traces were edited (phd.2,phd.3,...); showed these edits appear in the SCF files? 3. Convert assemblies to XML format and submit them NCBI AA
Data location
/fs/ftp-cbcb/pub/data/dsommer/ /fs/sztmpscratch/dsommer/backup_sal /fs/szasmg/Bacteria/Salmonella/ /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/ /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2 /fs/szdata//ncbi/genomes/Bacteria/Salmonella_enterica_Paratypi_ATCC_9150/ /fs/szdata//ncbi/genomes/Bacteria/Salmonella_typhimurium_LT2/
WUSTL
Data Download
name date assembly #contigs 1 Salmonella_enterica_serovar_Arizonae/ 04-Apr-2006 yes 256 2 Salmonella_enterica_serovar_Diarizonae/ 02-May-2006 yes 739 3 Salmonella_enterica_serovar_Paratyphi_A/ 04-Apr-2006 no 4 Salmonella_enterica_serovar_Paratyphi_B/ 27-Apr-2006 yes 187 5 Salmonella_enterica_serovar_Typhimurium_strain_LT2/ 04-Apr-2006 no
Salmonella enterica serovar Paratyphi A str. ATCC 9150
NCBI
Genome project Taxonomy (TaxID: 295319)
Name Length %GC Description NC_006511.1 4585229 52.16 Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150, complete genome
Traces:
All directories: 103,971 (unique) B_SPA : 102,405 (unique) => 1,566 missing ~ 10X coverage
Unmated reads:
Sequenced in 1999-06 *.s1
Mated reads:
The *.b1,*g1 reads seem to be mated! Sequenced in 2002-05, 2002-07 p(.*).[bg]1 oyg(.*).[bg]1 P_AA(.*).[bg]1
WUSTL assemblies
1. ace.83: (best assembly of reads; 2003-05-12)
/fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/B_SPA.fasta.screen.ace.83 $ grep ^CO *ace.83 | grep -v COMM | wc -l 571 # total number of contigs
Longest contig: $ cat B_SPA.fasta.screen.ace.83 AS 571 89509 # 571 contigs, 89509 reads ... CO Contig1368 4813926 88824 1869182 C Contig1368 is 4,813,926 (GDE format) 4,579,713 bp (FASTA format) Ends don't overlap There are missoriented reads at the ends (=>circular) Contains 88824 reads Other Salmonella strains are ~ 4.8M
Problem: * Collapsed repeat: high coverage, missoriented mates in the 2076881-2079555 region * Expanded into 3 copy tandem repeat in the finished assembly * 3 copies also in CA
2. Finished assembly: (assembly of contigs)
File: finished.fasta.screen.ace.0 1 contig 4,585,228 bp (FASTA format) : 5,515bp longer than ace.83 contig 571; ends don't overlap 11 long reads(contig reads)
Estimate lib insert sizes:
$ toAmos -ace B_SPA.fasta.screen.ace.83 $ grep -c ^rds B_SPA.afg # check if links were created $ more toAmos.error # check if there were any convertion errors $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c $ bank2contig B_SPA.bnk > B_SPA.contig $ cat B_SPA.contig | grep ^# | grep -v ^## | sort # look at distances between mated reads
Create mate pair file (Bambus format, tab delimited)
$ cat B_SPA.mates library small 2000 4000 (p).* pair (p.*)\.b1$ (p.*)\.g1$ library medium 4500 5500 (oyg).* pair (oyg.*).b1$ (oyg.*).g1$ library large 35000 45000 (P_AA).* pair (P_AA.*).b1$ (P_AA.*).g1$
Rerun convertion utilities:
$ toAmos -m B_SPA.mates -ace B_SPA.fasta.screen.ace.83 -o B_SPA.afg $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c
CBCB assemblies
1. CA default params /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual 87 scaff, 194 contigs, 19K singletons, 4,425,716 bp
2. CA genomeSize=3M /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual-3M 75 scaff, 183 contigs, 19K singletons, 4,515,434 bp No rearrangements compared to finished genome Significant number of SNP's
3. AMOSCmp Ref=finished assembly; 89,509 reads; .ace.83 trimming => 31 contigs; 4,579,852 bp
4. AMOSCmp Ref=finished assembly; 101,621 reads (.fasta.screen); nucmer trimming => 8 contigs; 4,583,946 bp
5. merge of 9 contigs using slice tools /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final2/9r12345678-circ-rev-tr-recall.* Steps: * recruit unassembled reads to span the Contig8.4 - Contig6.6 gap and assemble them into a new contig. * The 9 overlapping contigs (8 provided by Damon + 1 I assembled) were merged using the slice tools (zipclap program) into one piece. * The new contig was circularized, reversed and rotated to align to the published one. * I also recalled the consensus due to some ambiguity codes introduced in the process. * The new contig sequence is 70 bp shorter (4,585,158 bp vs 4,585,228), but it aligns in one piece to the published contig.
6. merge of 9 contigs using slice tools (best) /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final3/9r12345678-circ-rev-tr.* Steps: * Same as 5 but a modifies version of "modContig --circularize" was called * The circularizan step did not recall the consensus * Reacll was not used in the end * The new contig sequence is 5 bp shorter (4,585,223 bp vs 4,585,228), but it aligns in one piece to the published contig. * show-snps 1con-9r12345678-circ-rev-tr.delta | grep -c 9r12345678$ => 46 SNPs
Salmonella typhimurium LT2
NCBI
Genome project
Name Length %GC Description NC_003197.1 4857432 52.22 Salmonella typhimurium LT2, complete genome NC_003277.1 93939 53.13 Salmonella typhimurium LT2 plasmid pSLT, complete sequence
Traces
From WUSTL:
total: 142,267 single reads (*.s1 117,524) mate pairs: 7,236
Zero coverage regions:
gi|16763390|ref|NC_003197.1| 288829 288861 32 0 gi|16763390|ref|NC_003197.1| 691986 692019 33 0 gi|16763390|ref|NC_003197.1| 1582051 1582484 433 0 gi|16763390|ref|NC_003197.1| 2548470 2548631 161 0 gi|16763390|ref|NC_003197.1| 2703507 2703816 309 0
Two set of reads no included neither in the WUSTL "final" assemblies
12,912 in */Failures/ directories 28,852 in sub assemblies but not in the final ones
WUSTL assemblies
1. Phrap assembly (2001-12-23)
/fs/ftp-cbcb/pub/data/dsommer/B_STM/edit_dir/Phrap_Assembly_dir/B_STM.RAW.fasta.screen.ace head -1 ... AS 1226 113 663 # contigs: 1,226 max contig: 741,919 (larger than cap4 max) singlets 563,308
contigs 1226 (696,553 bp), 1218 (186,475 bp) look rearanged compared to the reference
CBCB assemblies
1. CA default params
/fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_CA_e1.50 TotalScaffolds=1957 TotalContigsInScaffolds=1965 MaxContigLength=13910
2. CA e=6%
/fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_CA_e6.00 TotalScaffolds=1128 TotalContigsInScaffolds=1146 MaxContigLength=26183
3. AMOSCmp: trimming were the OBT clrs(from 1)
-D MINOVL=3 -D MAXTRIM=50 -D MAJORITY=50 /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_AMOSCmp-relaxed-OBT 128 contigs Max 0 cvg area: 1581885-1582864 979 bp (1con-sontigs.delta)
4. AMOSCmp: trimming were the read nucmer clrs
-D MINOVL=3 -D MAXTRIM=50 -D MAJORITY=50 Ran for 9 days !!! /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_AMOSCmp-relaxed-nucmer Max 0 cvg area: 4325279-4340861 15,582 bp (1con-contigs.delta)
Contig length stats: desc #elem min max mean stdev sum contigs 8 46765 2148437 617384.12 710006.86 4939073
5. AMOSCmp: trimming were taken from the phap(ace) assembly
-D MINOVL=3 -D MAXTRIM=50 -D MAJORITY=50 /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_AMOSCmp-relaxed-Phrap
Contig length stats: desc #elem min max mean stdev sum contigs 20 3631 1397417 246942.84 308507.06 4938857
Other strains
Data
NCBI
Genome Projects 26 listed, 5 complete 1 Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701 [TIGR] 2 Salmonella enterica subsp. enterica serovar Agona str. SL483 [J. Craig Venter Institute] 3 Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67 [Chang Gung Memorial Hospital] complete 4 Salmonella enterica subsp. enterica serovar Dublin [University of Illinois at Urbana-Champaign] 5 Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853 [TIGR] 6 Salmonella enterica subsp. enterica serovar Enteritidis str. LK5 [University of Illinois at Urbana-Champaign] 7 Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 [J. Craig Venter Institute] 8 Salmonella enterica subsp. enterica serovar Heidelberg str. SL486 [TIGR/JCVI/J. Craig Venter Institute] 9 Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433 [J. Craig Venter Institute] 10 Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191 [J. Craig Venter Institute] 11 Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188 [TIGR] 12 Salmonella enterica subsp. enterica serovar Newport str. SL254 [TIGR/J. Craig Venter Institute] 13 Salmonella enterica subsp. enterica serovar Newport str. SL317 [J. Craig Venter Institute] in TA but not AA; 63 contigs; shold be submitted to AA!! 14 Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150 [Washington University (WashU)] complete 15 Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594 [Peking University Health Science Center] 16 Salmonella enterica subsp. enterica serovar Pullorum [University of Illinois at Urbana-Champaign] 17 Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23 [TIGR] 18 Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29 [TIGR] 19 Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633 [TIGR] 20 Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480 [J. Craig Venter Institute] 21 Salmonella enterica subsp. enterica serovar Typhi Ty2 [University of Wisconsin-Madison, USA] complete 22 Salmonella enterica subsp. enterica serovar Typhi str. CT18 [Sanger Institute] complete 23 Salmonella typhimurium DT104 [Sanger Institute] 24 Salmonella typhimurium LT2 [Washington University (WashU)] complete 25 Salmonella typhimurium SL1344 [Sanger Institute] 26 Salmonella typhimurium TR7095 [Washington University (WashU)] TA: 1 salmonella_enterica_subsp__enterica_serovar_4__5__12_i___str__cvm23701 2 salmonella_enterica_subsp__enterica_serovar_agona_str__sl483 3 salmonella_enterica_subsp__enterica_serovar_dublin_str__ct_02021853 4 salmonella_enterica_subsp__enterica_serovar_hadar_str__ri_05p066 : not in Genome Projects/AA (JCVI MSC) 5 salmonella_enterica_subsp__enterica_serovar_heidelberg_str__sl476 6 salmonella_enterica_subsp__enterica_serovar_heidelberg_str__sl486 7 salmonella_enterica_subsp__enterica_serovar_javiana_str__ga_mm04042433 8 salmonella_enterica_subsp__enterica_serovar_kentucky_str__cdc_191 9 salmonella_enterica_subsp__enterica_serovar_kentucky_str__cvm29188 10 salmonella_enterica_subsp__enterica_serovar_newport_str__sl254 11 salmonella_enterica_subsp__enterica_serovar_newport_str__sl317 12 salmonella_enterica_subsp__enterica_serovar_saintpaul_str__sara23 13 salmonella_enterica_subsp__enterica_serovar_saintpaul_str__sara29 14 salmonella_enterica_subsp__enterica_serovar_schwarzengrund_str__cvm19633 15 salmonella_enterica_subsp__enterica_serovar_schwarzengrund_str__sl480 16 salmonella_enterica_subsp__enterica_serovar_virchow_str__sl491 : not in Genome Projects/AA (JCVI MSC) 17 salmonella_enterica_subsp__enterica_serovar_weltevreden_str__hi_n05_537 : not in Genome Projects/AA (JCVI MSC) AA: 1 Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701 TIGR 2740 440534 4,895,918 113 53,284 8.0X 2 Salmonella enterica subsp. enterica serovar Agona str. SL483 JCVI 2924 454166 4,835,750 56 51,307 9.5X 3 Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853 TIGR 2741 439851 4,885,976 142 50,129 7.8X 4 Salmonella enterica subsp. enterica serovar Hadar str. RI_05P066 JCVI 2995 465516 4,793,325 50 50,470 9.6X 5 Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 JCVI 2927 454169 5,083,392 49 54,058 9.2X 6 Salmonella enterica subsp. enterica serovar Heidelberg str. SL486 JCVI 2925 454164 4,728,232 48 53,785 10.1X 7 Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433 JCVI 2921 454167 4,553,049 74 52,375 9.9X 8 Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191 JCVI 2922 454231 4,696,566 53 51,826 9.6X 9 Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188 TIGR 2737 439842 5,000,919 75 55,311 9.1X 10 Salmonella enterica subsp. enterica serovar Newport str. SL254 JCVI 2926 423368 4,831,246 2 50,473 8.8X 11 Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23 TIGR 2735 439846 4,785,870 143 50,936 8.6X 12 Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29 TIGR 2739 439847 4,928,961 182 50,405 7.9X 13 Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633 TIGR 2738 439843 4,734,042 160 49,533 7.4X 14 Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480 JCVI 2923 454165 4,761,576 67 50,418 9.1X 15 Salmonella enterica subsp. enterica serovar Virchow str. SL491 JCVI 2996 465517 4,858,188 73 54,841 10.3X 16 Salmonella enterica subsp. enterica serovar Weltevreden str. HI_N05-537 JCVI 2994 465518 5,047,463 81 54,390 9.8X
TIGR/JCVI
MSC
Sanger
Salmonella comparative sequencing project Salmonella ftp Salmonella paratyphi A project Salmonella paratyphi A ftp Salmonella typhi project Salmonella typhi ftp includes 454 Organism Size(Mb) G+C Status Funding Contigs Traces Name 1 Salmonella bongori 4.46 51.3% Finished Beowulf Genomics 1 75084 SB 2 Salmonella enteritidis PT4 4.686 52.17% Finished Beowulf Genomics 2 68660 PT4 3 Salmonella gallinarum 287/91 4.747 52.2% Finished Beowulf Genomics 2 80266 SG 4 Salmonella enterica Hadar 4.786 52.3% Finished Wellcome Trust 3 79998 HADAR 5 Salmonella enterica Infantis 4.711 ~52.3% Finished Wellcome Trust 1 98013 SIN 6 Salmonella paratyphi A 4.582 52.2% Finished Wellcome Trust 2 81421 spa 7 Salmonella typhimurium DT104 5.02 52.1% Finished Beowulf Genomics 2 76534 DT104 8 Salmonella typhimurium D23580 ~5.0 ~52% Fin/gap clos. Wellcome Trust 31 88293 D23580 9 Salmonella typhimurium DT2 ~5 ~52% Shotgun Progr. Wellcome Trust 17 23552 DT2 10 Salmonella typhimurium SL1344 5.067 52.2% Finished Beowulf Genomics 4 81802 SL1344 11 Salmonella typhi 4.81 52.1% Finished Beowulf Genomics 1 0 st