Salmonella

From Cbcb
Jump to: navigation, search

Data

From Washington Univ in St. Louis

Strains

 Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA : 4,585,229 chromosome;  no plasmid
 Salmonella typhimurium LT2                                            : B_STM : 4,857,432 chromosome;  93,939 plasmid pSLT

Goals

 1. Validate the assemblies
 2. Submit traces to NCBI TA: 
    Problems:
      * some traces were edited (phd.2,phd.3,...); showed these edits appear in the SCF files?
 3. Convert assemblies to XML format and submit them NCBI AA

Data location

 /fs/ftp-cbcb/pub/data/dsommer/
 /fs/sztmpscratch/dsommer/backup_sal
 
 /fs/szasmg/Bacteria/Salmonella/
 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/
 /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2
 
 /fs/szdata//ncbi/genomes/Bacteria/Salmonella_enterica_Paratypi_ATCC_9150/
 /fs/szdata//ncbi/genomes/Bacteria/Salmonella_typhimurium_LT2/

WUSTL

 Data Download
       name                                                date                assembly     #contigs
    1  Salmonella_enterica_serovar_Arizonae/               04-Apr-2006         yes          256
    2  Salmonella_enterica_serovar_Diarizonae/             02-May-2006         yes          739
    3  Salmonella_enterica_serovar_Paratyphi_A/            04-Apr-2006         no
    4  Salmonella_enterica_serovar_Paratyphi_B/            27-Apr-2006         yes          187
    5  Salmonella_enterica_serovar_Typhimurium_strain_LT2/ 04-Apr-2006         no

Salmonella enterica serovar Paratyphi A str. ATCC 9150

NCBI

 Genome project
 Taxonomy (TaxID: 295319)
 Name           Length %GC    Description
 NC_006511.1    4585229 52.16  Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150, complete genome

Traces:

 All directories: 103,971 (unique)
 B_SPA : 102,405  (unique) => 1,566 missing
 ~ 10X coverage

Unmated reads:

Sequenced in 1999-06
*.s1

Mated reads:

 The *.b1,*g1 reads seem to be mated!
 Sequenced in 2002-05, 2002-07
 p(.*).[bg]1
 oyg(.*).[bg]1
 P_AA(.*).[bg]1

WUSTL assemblies

1. ace.83: (best assembly of reads; 2003-05-12)

 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/B_SPA.fasta.screen.ace.83 
 $ grep ^CO *ace.83 | grep -v COMM | wc -l
 571 # total number of contigs
 Longest contig: 
 $ cat B_SPA.fasta.screen.ace.83
 AS 571 89509                                # 571 contigs, 89509 reads
 ...
 CO Contig1368 4813926 88824 1869182 C       
 
 Contig1368 is 4,813,926 (GDE format) 4,579,713 bp (FASTA format)
 Ends don't overlap
 There are missoriented reads at the ends (=>circular)
 Contains 88824 reads
 Other Salmonella strains are ~ 4.8M
 Problem:
 * Collapsed repeat:  high coverage, missoriented mates in the 2076881-2079555 region
 * Expanded into 3 copy tandem repeat in the finished assembly
 * 3 copies also in CA

2. Finished assembly: (assembly of contigs)

 File: finished.fasta.screen.ace.0
 1 contig 
 4,585,228 bp (FASTA format) : 5,515bp longer than ace.83 contig 571; ends don't overlap
 11 long reads(contig reads)

Estimate lib insert sizes:

 $ toAmos -ace B_SPA.fasta.screen.ace.83
 $ grep -c ^rds B_SPA.afg         # check if links were created
 $ more toAmos.error              # check if there were any convertion errors
 $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c
 $ bank2contig B_SPA.bnk > B_SPA.contig
 $ cat B_SPA.contig | grep ^# | grep -v ^## | sort 
 # look at distances between mated reads

Create mate pair file (Bambus format, tab delimited)

 $ cat B_SPA.mates
    library small   2000    4000    (p).*
    pair    (p.*)\.b1$      (p.*)\.g1$
    
    library medium  4500    5500    (oyg).*
    pair    (oyg.*).b1$     (oyg.*).g1$
    
    library large   35000   45000   (P_AA).*
    pair    (P_AA.*).b1$    (P_AA.*).g1$

Rerun convertion utilities:

 $ toAmos -m B_SPA.mates -ace B_SPA.fasta.screen.ace.83 -o B_SPA.afg 
 $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c

CBCB assemblies

 1. CA default params
 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual
 87 scaff, 194 contigs, 19K singletons, 4,425,716 bp
 2. CA genomeSize=3M /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual-3M
 75 scaff, 183 contigs, 19K singletons, 4,515,434 bp
 No rearrangements compared to finished genome
 Significant number of SNP's
 3. AMOSCmp Ref=finished assembly; 89,509 reads; .ace.83 trimming => 31 contigs; 4,579,852 bp
 4. AMOSCmp Ref=finished assembly; 101,621 reads (.fasta.screen); nucmer trimming => 8 contigs; 4,583,946 bp
 5. merge of 9 contigs using slice tools
    /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final2/9r12345678-circ-rev-tr-recall.* 
    Steps:
      * recruit unassembled reads to span the Contig8.4 - Contig6.6 gap and assemble them into a new contig.
      * The 9 overlapping contigs (8 provided by Damon + 1 I assembled) were merged using the slice tools (zipclap program) into one piece. 
      * The new contig was circularized, reversed and rotated to align to the published one. 
      * I also recalled the consensus due to some ambiguity codes introduced in the process.
      * The new contig sequence is 70 bp shorter (4,585,158 bp  vs 4,585,228), but it aligns in one piece to the published contig.
 6. merge of 9 contigs using slice tools (best)
    /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final3/9r12345678-circ-rev-tr.* 
    Steps:
      * Same as 5 but a modifies version of "modContig --circularize" was called
      * The circularizan step did not recall the consensus
      * Reacll was not used in the end
      * The new contig sequence is 5 bp shorter (4,585,223  bp  vs 4,585,228), but it aligns in one piece to the published contig.
      * show-snps 1con-9r12345678-circ-rev-tr.delta | grep -c 9r12345678$ => 46 SNPs

Salmonella typhimurium LT2

NCBI

 Genome project
 Name           Length  %GC    Description
 NC_003197.1    4857432 52.22  Salmonella typhimurium LT2, complete genome
 NC_003277.1    93939   53.13  Salmonella typhimurium LT2 plasmid pSLT, complete sequence

Traces

From WUSTL:

 total: 142,267
 single reads (*.s1 117,524)
 mate pairs: 7,236

Zero coverage regions:

 gi|16763390|ref|NC_003197.1|    288829  288861  32      0
 gi|16763390|ref|NC_003197.1|    691986  692019  33      0
 gi|16763390|ref|NC_003197.1|    1582051 1582484 433     0
 gi|16763390|ref|NC_003197.1|    2548470 2548631 161     0
 gi|16763390|ref|NC_003197.1|    2703507 2703816 309     0

Two set of reads no included neither in the WUSTL "final" assemblies

   12,912  in */Failures/ directories
   28,852  in sub assemblies but not in the final ones

WUSTL assemblies

1. Phrap assembly (2001-12-23)

 /fs/ftp-cbcb/pub/data/dsommer/B_STM/edit_dir/Phrap_Assembly_dir/B_STM.RAW.fasta.screen.ace
 head -1 ...
   AS 1226 113 663 
 # contigs: 1,226 
 max contig: 741,919 (larger than cap4 max)
 singlets 563,308
 contigs 1226 (696,553 bp), 1218 (186,475 bp) look rearanged compared to the reference

CBCB assemblies

1. CA default params

 /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_CA_e1.50
 TotalScaffolds=1957
 TotalContigsInScaffolds=1965
 MaxContigLength=13910

2. CA e=6%

 /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_CA_e6.00
 TotalScaffolds=1128
 TotalContigsInScaffolds=1146
 MaxContigLength=26183

3. AMOSCmp: trimming were the OBT clrs(from 1)

  -D MINOVL=3 -D MAXTRIM=50 -D MAJORITY=50
 /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_AMOSCmp-relaxed-OBT
 128 contigs
 Max 0 cvg area: 1581885-1582864 979 bp (1con-sontigs.delta)

4. AMOSCmp: trimming were the read nucmer clrs

 -D MINOVL=3 -D MAXTRIM=50 -D MAJORITY=50
 Ran for 9 days !!!
 /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_AMOSCmp-relaxed-nucmer
 Max 0 cvg area: 4325279-4340861 15,582 bp (1con-contigs.delta)
 Contig length stats:
 desc    #elem   min     max     mean            stdev           sum
 contigs 8       46765   2148437 617384.12       710006.86       4939073
 

5. AMOSCmp: trimming were taken from the phap(ace) assembly

 -D MINOVL=3 -D MAXTRIM=50 -D MAJORITY=50
 /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_AMOSCmp-relaxed-Phrap
 Contig length stats:
 desc    #elem   min     max     mean            stdev           sum
 contigs 20      3631    1397417 246942.84       308507.06       4938857

Other strains

Data

NCBI

 Genome Projects
 26 listed, 5 complete
    1  Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701 [TIGR]
    2  Salmonella enterica subsp. enterica serovar Agona str. SL483 [J. Craig Venter Institute]
    3  Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67 [Chang Gung Memorial Hospital]          complete
    4  Salmonella enterica subsp. enterica serovar Dublin [University of Illinois at Urbana-Champaign]
    5  Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853 [TIGR]
    6  Salmonella enterica subsp. enterica serovar Enteritidis str. LK5 [University of Illinois at Urbana-Champaign]
    7  Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 [J. Craig Venter Institute]
    8  Salmonella enterica subsp. enterica serovar Heidelberg str. SL486 [TIGR/JCVI/J. Craig Venter Institute]
    9  Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433 [J. Craig Venter Institute]
   10  Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191 [J. Craig Venter Institute]
   11  Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188 [TIGR]
   12  Salmonella enterica subsp. enterica serovar Newport str. SL254 [TIGR/J. Craig Venter Institute]
   13  Salmonella enterica subsp. enterica serovar Newport str. SL317 [J. Craig Venter Institute]                   in TA but not AA; 63 contigs; shold be submitted to AA!!
   14  Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150 [Washington University (WashU)]       complete
   15  Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594 [Peking University Health Science Center]
   16  Salmonella enterica subsp. enterica serovar Pullorum [University of Illinois at Urbana-Champaign]
   17  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23 [TIGR]
   18  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29 [TIGR]
   19  Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633 [TIGR]
   20  Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480 [J. Craig Venter Institute]
   21  Salmonella enterica subsp. enterica serovar Typhi Ty2 [University of Wisconsin-Madison, USA]                 complete
   22  Salmonella enterica subsp. enterica serovar Typhi str. CT18 [Sanger Institute]                               complete
   23  Salmonella typhimurium DT104 [Sanger Institute]
   24  Salmonella typhimurium LT2 [Washington University (WashU)]                                                   complete
   25  Salmonella typhimurium SL1344 [Sanger Institute]
   26  Salmonella typhimurium TR7095 [Washington University (WashU)]
 
 TA:
    1  salmonella_enterica_subsp__enterica_serovar_4__5__12_i___str__cvm23701
    2  salmonella_enterica_subsp__enterica_serovar_agona_str__sl483
    3  salmonella_enterica_subsp__enterica_serovar_dublin_str__ct_02021853
    4  salmonella_enterica_subsp__enterica_serovar_hadar_str__ri_05p066             :  not in Genome Projects/AA (JCVI MSC)
    5  salmonella_enterica_subsp__enterica_serovar_heidelberg_str__sl476
    6  salmonella_enterica_subsp__enterica_serovar_heidelberg_str__sl486
    7  salmonella_enterica_subsp__enterica_serovar_javiana_str__ga_mm04042433
    8  salmonella_enterica_subsp__enterica_serovar_kentucky_str__cdc_191
    9  salmonella_enterica_subsp__enterica_serovar_kentucky_str__cvm29188
   10  salmonella_enterica_subsp__enterica_serovar_newport_str__sl254
   11  salmonella_enterica_subsp__enterica_serovar_newport_str__sl317
   12  salmonella_enterica_subsp__enterica_serovar_saintpaul_str__sara23
   13  salmonella_enterica_subsp__enterica_serovar_saintpaul_str__sara29
   14  salmonella_enterica_subsp__enterica_serovar_schwarzengrund_str__cvm19633
   15  salmonella_enterica_subsp__enterica_serovar_schwarzengrund_str__sl480
   16  salmonella_enterica_subsp__enterica_serovar_virchow_str__sl491               :  not in Genome Projects/AA (JCVI MSC)
   17  salmonella_enterica_subsp__enterica_serovar_weltevreden_str__hi_n05_537      :  not in Genome Projects/AA (JCVI MSC)
 
 AA:
    1  Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701  TIGR    2740    440534  4,895,918       113     53,284  8.0X
    2  Salmonella enterica subsp. enterica serovar Agona str. SL483            JCVI    2924    454166  4,835,750       56      51,307  9.5X
    3  Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853     TIGR    2741    439851  4,885,976       142     50,129  7.8X
    4  Salmonella enterica subsp. enterica serovar Hadar str. RI_05P066        JCVI    2995    465516  4,793,325       50      50,470  9.6X
    5  Salmonella enterica subsp. enterica serovar Heidelberg str. SL476       JCVI    2927    454169  5,083,392       49      54,058  9.2X
    6  Salmonella enterica subsp. enterica serovar Heidelberg str. SL486       JCVI    2925    454164  4,728,232       48      53,785  10.1X
    7  Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433  JCVI    2921    454167  4,553,049       74      52,375  9.9X
    8  Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191       JCVI    2922    454231  4,696,566       53      51,826  9.6X
    9  Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188      TIGR    2737    439842  5,000,919       75      55,311  9.1X
   10  Salmonella enterica subsp. enterica serovar Newport str. SL254          JCVI    2926    423368  4,831,246       2       50,473  8.8X
   11  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23       TIGR    2735    439846  4,785,870       143     50,936  8.6X
   12  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29       TIGR    2739    439847  4,928,961       182     50,405  7.9X
   13  Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633  TIGR  2738    439843  4,734,042       160     49,533  7.4X
   14  Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480   JCVI    2923    454165  4,761,576       67      50,418  9.1X
   15  Salmonella enterica subsp. enterica serovar Virchow str. SL491          JCVI    2996    465517  4,858,188       73      54,841  10.3X
   16  Salmonella enterica subsp. enterica serovar Weltevreden str. HI_N05-537 JCVI    2994    465518  5,047,463       81      54,390  9.8X

TIGR/JCVI

  MSC

Sanger

 Salmonella comparative sequencing project
 Salmonella ftp
 
 Salmonella paratyphi A project
 Salmonella paratyphi A ftp
 
 Salmonella typhi project
 Salmonella typhi ftp includes 454
 
       Organism                        Size(Mb)  G+C   Status          Funding              Contigs   Traces      Name
    1  Salmonella bongori              4.46    51.3%   Finished        Beowulf Genomics     1         75084       SB
    2  Salmonella enteritidis PT4      4.686   52.17%  Finished        Beowulf Genomics     2         68660       PT4
    3  Salmonella gallinarum 287/91    4.747   52.2%   Finished        Beowulf Genomics     2         80266       SG
    4  Salmonella enterica Hadar       4.786   52.3%   Finished        Wellcome Trust       3         79998       HADAR
    5  Salmonella enterica Infantis    4.711   ~52.3%  Finished        Wellcome Trust       1         98013       SIN
    6  Salmonella paratyphi A          4.582   52.2%   Finished        Wellcome Trust       2         81421       spa
    7  Salmonella typhimurium DT104    5.02    52.1%   Finished        Beowulf Genomics     2         76534       DT104
    8  Salmonella typhimurium D23580   ~5.0    ~52%    Fin/gap clos.   Wellcome Trust       31        88293       D23580
    9  Salmonella typhimurium DT2      ~5      ~52%    Shotgun Progr.  Wellcome Trust       17        23552       DT2
   10  Salmonella typhimurium SL1344   5.067   52.2%   Finished        Beowulf Genomics     4         81802       SL1344
   11  Salmonella typhi                4.81    52.1%   Finished        Beowulf Genomics     1         0           st