|
|
Line 6: |
Line 6: |
| Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA | | Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA |
| Salmonella typhimurium LT2 : B_STM | | Salmonella typhimurium LT2 : B_STM |
|
| |
| Other data:
| |
|
| |
| NCBI:
| |
| [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Salmonella%20enterica%20subsp.%20enterica%20serovar Genome Projects]
| |
| 1 Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701 [TIGR]
| |
| 2 Salmonella enterica subsp. enterica serovar Agona str. SL483 [J. Craig Venter Institute]
| |
| 3 Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67 [Chang Gung Memorial Hospital] complete
| |
| 4 Salmonella enterica subsp. enterica serovar Dublin [University of Illinois at Urbana-Champaign]
| |
| 5 Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853 [TIGR]
| |
| 6 Salmonella enterica subsp. enterica serovar Enteritidis str. LK5 [University of Illinois at Urbana-Champaign]
| |
| 7 Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 [J. Craig Venter Institute]
| |
| 8 Salmonella enterica subsp. enterica serovar Heidelberg str. SL486 [TIGR/JCVI/J. Craig Venter Institute]
| |
| 9 Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433 [J. Craig Venter Institute]
| |
| 10 Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191 [J. Craig Venter Institute]
| |
| 11 Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188 [TIGR]
| |
| 12 Salmonella enterica subsp. enterica serovar Newport str. SL254 [TIGR/J. Craig Venter Institute]
| |
| 13 Salmonella enterica subsp. enterica serovar Newport str. SL317 [J. Craig Venter Institute] in TA but not AA; 63 contigs; shold be submitted to AA!!
| |
| 14 Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150 [Washington University (WashU)] complete
| |
| 15 Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594 [Peking University Health Science Center]
| |
| 16 Salmonella enterica subsp. enterica serovar Pullorum [University of Illinois at Urbana-Champaign]
| |
| 17 Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23 [TIGR]
| |
| 18 Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29 [TIGR]
| |
| 19 Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633 [TIGR]
| |
| 20 Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480 [J. Craig Venter Institute]
| |
| 21 Salmonella enterica subsp. enterica serovar Typhi Ty2 [University of Wisconsin-Madison, USA] complete
| |
| 22 Salmonella enterica subsp. enterica serovar Typhi str. CT18 [Sanger Institute] complete
| |
| 23 Salmonella typhimurium DT104 [Sanger Institute]
| |
| 24 Salmonella typhimurium LT2 [Washington University (WashU)] complete
| |
| 25 Salmonella typhimurium SL1344 [Sanger Institute]
| |
| 26 Salmonella typhimurium TR7095 [Washington University (WashU)]
| |
|
| |
| TA:
| |
| 1 salmonella_enterica_subsp__enterica_serovar_4__5__12_i___str__cvm23701
| |
| 2 salmonella_enterica_subsp__enterica_serovar_agona_str__sl483
| |
| 3 salmonella_enterica_subsp__enterica_serovar_dublin_str__ct_02021853
| |
| 4 salmonella_enterica_subsp__enterica_serovar_hadar_str__ri_05p066 : not in Genome Projects/AA (JCVI MSC)
| |
| 5 salmonella_enterica_subsp__enterica_serovar_heidelberg_str__sl476
| |
| 6 salmonella_enterica_subsp__enterica_serovar_heidelberg_str__sl486
| |
| 7 salmonella_enterica_subsp__enterica_serovar_javiana_str__ga_mm04042433
| |
| 8 salmonella_enterica_subsp__enterica_serovar_kentucky_str__cdc_191
| |
| 9 salmonella_enterica_subsp__enterica_serovar_kentucky_str__cvm29188
| |
| 10 salmonella_enterica_subsp__enterica_serovar_newport_str__sl254
| |
| 11 salmonella_enterica_subsp__enterica_serovar_newport_str__sl317
| |
| 12 salmonella_enterica_subsp__enterica_serovar_saintpaul_str__sara23
| |
| 13 salmonella_enterica_subsp__enterica_serovar_saintpaul_str__sara29
| |
| 14 salmonella_enterica_subsp__enterica_serovar_schwarzengrund_str__cvm19633
| |
| 15 salmonella_enterica_subsp__enterica_serovar_schwarzengrund_str__sl480
| |
| 16 salmonella_enterica_subsp__enterica_serovar_virchow_str__sl491 : not in Genome Projects/AA (JCVI MSC)
| |
| 17 salmonella_enterica_subsp__enterica_serovar_weltevreden_str__hi_n05_537 : not in Genome Projects/AA (JCVI MSC)
| |
|
| |
| AA:
| |
| 1 Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701 TIGR 2740 440534 4,895,918 113 53,284 8.0X
| |
| 2 Salmonella enterica subsp. enterica serovar Agona str. SL483 JCVI 2924 454166 4,835,750 56 51,307 9.5X
| |
| 3 Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853 TIGR 2741 439851 4,885,976 142 50,129 7.8X
| |
| 4 Salmonella enterica subsp. enterica serovar Hadar str. RI_05P066 JCVI 2995 465516 4,793,325 50 50,470 9.6X
| |
| 5 Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 JCVI 2927 454169 5,083,392 49 54,058 9.2X
| |
| 6 Salmonella enterica subsp. enterica serovar Heidelberg str. SL486 JCVI 2925 454164 4,728,232 48 53,785 10.1X
| |
| 7 Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433 JCVI 2921 454167 4,553,049 74 52,375 9.9X
| |
| 8 Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191 JCVI 2922 454231 4,696,566 53 51,826 9.6X
| |
| 9 Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188 TIGR 2737 439842 5,000,919 75 55,311 9.1X
| |
| 10 Salmonella enterica subsp. enterica serovar Newport str. SL254 JCVI 2926 423368 4,831,246 2 50,473 8.8X
| |
| 11 Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23 TIGR 2735 439846 4,785,870 143 50,936 8.6X
| |
| 12 Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29 TIGR 2739 439847 4,928,961 182 50,405 7.9X
| |
| 13 Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633 TIGR 2738 439843 4,734,042 160 49,533 7.4X
| |
| 14 Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480 JCVI 2923 454165 4,761,576 67 50,418 9.1X
| |
| 15 Salmonella enterica subsp. enterica serovar Virchow str. SL491 JCVI 2996 465517 4,858,188 73 54,841 10.3X
| |
| 16 Salmonella enterica subsp. enterica serovar Weltevreden str. HI_N05-537 JCVI 2994 465518 5,047,463 81 54,390 9.8X
| |
|
| |
| TIGR/JCVI
| |
| [http://msc.tigr.org/salmonella/index.shtml MSC]
| |
|
| |
| Sanger:
| |
| [http://www.sanger.ac.uk/Projects/Salmonella/ Salmonella project]
| |
| [http://www.sanger.ac.uk/Projects/S_typhi/ Salmonella typhi project]
| |
| [ftp://ftp.sanger.ac.uk/pub/pathogens/Salmonella Salmonella ftp]
| |
| [ftp://ftp.sanger.ac.uk/pub/pathogens/st/ ST ftp] includes 454
| |
|
| |
|
| Goals: | | Goals: |
Data
From Washington Univ in St. Louis
Strains:
Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA
Salmonella typhimurium LT2 : B_STM
Goals:
1. Validate the assemblies
2. Submit traces to NCBI TA:
Problems:
* some traces were edited (phd.2,phd.3,...); showed these edits appear in the SCF files?
3. Convert assemblies to XML format and submit them NCBI AA
File locations:
/fs/ftp-cbcb/pub/data/dsommer/
/fs/sztmpscratch/dsommer/backup_sal
/fs/szasmg/Bacteria/Salmonella/
/fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/
SPA
NCBI
Genome
Taxonomy (TaxID: 295319)
Traces:
All directories: 103971 (unique)
B_SPA : 102405 (unique) => 1566 missing
~ 10X coverage
The *.b1,*g1 reads seem to be mated!
Mate pairs:
p(.*).[bg]1
oyg(.*).[bg]1
P_AA(.*).[bg]1
WUSTL assemblies:
1. ace.83: (best assembly of reads)
/fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/B_SPA.fasta.screen.ace.83
$ grep ^CO *ace.83 | grep -v COMM | wc -l
571 # total number of contigs
Longest contig:
$ cat B_SPA.fasta.screen.ace.83
AS 571 89509 # 571 contigs, 89509 reads
...
CO Contig1368 4813926 88824 1869182 C
Contig1368 is 4,813,926 (GDE format) 4,579,713 bp (FASTA format)
Ends don't overlap
There are missoriented reads at the ends (=>circular)
Contains 88824 reads
Other Salmonella strains are ~ 4.8M
Problem:
* Collapsed repeat: high coverage, missoriented mates in the 2076881-2079555 region
* Expanded into 3 copy tandem repeat in the finished assembly
* 3 copies also in CA
2. Finished assembly: (assembly of contigs)
File: finished.fasta.screen.ace.0
1 contig
4,585,228 bp (FASTA format) : 5,515bp longer than ace.83 contig 571; ends don't overlap
11 long reads(contig reads)
Estimate lib insert sizes:
$ toAmos -ace B_SPA.fasta.screen.ace.83
$ grep -c ^rds B_SPA.afg # check if links were created
$ more toAmos.error # check if there were any convertion errors
$ bank-transact -b B_SPA.bnk -m B_SPA.afg -c
$ bank2contig B_SPA.bnk > B_SPA.contig
$ cat B_SPA.contig | grep ^# | grep -v ^## | sort
# look at distances between mated reads
Create mate pair file (Bambus format, tab delimited)
$ cat B_SPA.mates
library small 2000 4000 (p).*
pair (p.*)\.b1$ (p.*)\.g1$
library medium 4500 5500 (oyg).*
pair (oyg.*).b1$ (oyg.*).g1$
library large 35000 45000 (P_AA).*
pair (P_AA.*).b1$ (P_AA.*).g1$
Rerun convertion utilities:
$ toAmos -m B_SPA.mates -ace B_SPA.fasta.screen.ace.83 -o B_SPA.afg
$ bank-transact -b B_SPA.bnk -m B_SPA.afg -c
CBCB assemblies:
1. CA default params
/fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual
87 scaff, 194 contigs, 19K singletons, 4,425,716 bp
2. CA genomeSize=3M /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual-3M
75 scaff, 183 contigs, 19K singletons, 4,515,434 bp
No rearrangements compared to finished genome
Significant number of SNP's
3. AMOSCmp Ref=finished assembly; 89,509 reads; .ace.83 trimming => 31 contigs; 4,579,852 bp
4. AMOSCmp Ref=finished assembly; 101,621 reads (.fasta.screen); nucmer trimming => 8 contigs; 4,583,946 bp
5. merge of 9 contigs using slice tools
/fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final2/9r12345678-circ-rev-tr-recall.*
Steps:
* recruit unassembled reads to span the Contig8.4 - Contig6.6 gap and assemble them into a new contig.
* The 9 overlapping contigs (8 provided by Damon + 1 I assembled) were merged using the slice tools (zipclap program) into one piece.
* The new contig was circularized, reversed and rotated to align to the published one.
* I also recalled the consensus due to some ambiguity codes introduced in the process.
* The new contig sequence is 70 bp shorter (4,585,158 bp vs 4,585,228), but it aligns in one piece to the published contig.
6. merge of 9 contigs using slice tools (best)
/fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final3/9r12345678-circ-rev-tr.*
Steps:
* Same as 5 but a modifies version of "modContig --circularize" was called
* The circularizan step did not recall the consensus
* Reacll was not used in the end
* The new contig sequence is 5 bp shorter (4,585,223 bp vs 4,585,228), but it aligns in one piece to the published contig.
* show-snps 1con-9r12345678-circ-rev-tr.delta | grep -c 9r12345678$ => 46 SNPs