Salmonella

From Cbcb
Revision as of 17:15, 20 November 2007 by Dpuiu (talk | contribs)
Jump to navigation Jump to search

Data

From Washington Univ in St. Louis

Strains:

 Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA
 Salmonella typhimurium LT2                                            : B_STM

Other data:

NCBI:

 Genome Projects
    1  Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701 [TIGR]
    2  Salmonella enterica subsp. enterica serovar Agona str. SL483 [J. Craig Venter Institute]
    3  Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67 [Chang Gung Memorial Hospital]          complete
    4  Salmonella enterica subsp. enterica serovar Dublin [University of Illinois at Urbana-Champaign]
    5  Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853 [TIGR]
    6  Salmonella enterica subsp. enterica serovar Enteritidis str. LK5 [University of Illinois at Urbana-Champaign]
    7  Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 [J. Craig Venter Institute]
    8  Salmonella enterica subsp. enterica serovar Heidelberg str. SL486 [TIGR/JCVI/J. Craig Venter Institute]
    9  Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433 [J. Craig Venter Institute]
   10  Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191 [J. Craig Venter Institute]
   11  Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188 [TIGR]
   12  Salmonella enterica subsp. enterica serovar Newport str. SL254 [TIGR/J. Craig Venter Institute]
   13  Salmonella enterica subsp. enterica serovar Newport str. SL317 [J. Craig Venter Institute]                   in TA but not AA; 63 contigs; shold be submitted to AA!!
   14  Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150 [Washington University (WashU)]       complete
   15  Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594 [Peking University Health Science Center]
   16  Salmonella enterica subsp. enterica serovar Pullorum [University of Illinois at Urbana-Champaign]
   17  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23 [TIGR]
   18  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29 [TIGR]
   19  Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633 [TIGR]
   20  Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480 [J. Craig Venter Institute]
   21  Salmonella enterica subsp. enterica serovar Typhi Ty2 [University of Wisconsin-Madison, USA]                 complete
   22  Salmonella enterica subsp. enterica serovar Typhi str. CT18 [Sanger Institute]                               complete
   23  Salmonella typhimurium DT104 [Sanger Institute]
   24  Salmonella typhimurium LT2 [Washington University (WashU)]                                                   complete
   25  Salmonella typhimurium SL1344 [Sanger Institute]
   26  Salmonella typhimurium TR7095 [Washington University (WashU)]
 TA:
    1  salmonella_enterica_subsp__enterica_serovar_4__5__12_i___str__cvm23701
    2  salmonella_enterica_subsp__enterica_serovar_agona_str__sl483
    3  salmonella_enterica_subsp__enterica_serovar_dublin_str__ct_02021853
    4  salmonella_enterica_subsp__enterica_serovar_hadar_str__ri_05p066             :  not in Genome Projects/AA (JCVI MSC)
    5  salmonella_enterica_subsp__enterica_serovar_heidelberg_str__sl476
    6  salmonella_enterica_subsp__enterica_serovar_heidelberg_str__sl486
    7  salmonella_enterica_subsp__enterica_serovar_javiana_str__ga_mm04042433
    8  salmonella_enterica_subsp__enterica_serovar_kentucky_str__cdc_191
    9  salmonella_enterica_subsp__enterica_serovar_kentucky_str__cvm29188
   10  salmonella_enterica_subsp__enterica_serovar_newport_str__sl254
   11  salmonella_enterica_subsp__enterica_serovar_newport_str__sl317
   12  salmonella_enterica_subsp__enterica_serovar_saintpaul_str__sara23
   13  salmonella_enterica_subsp__enterica_serovar_saintpaul_str__sara29
   14  salmonella_enterica_subsp__enterica_serovar_schwarzengrund_str__cvm19633
   15  salmonella_enterica_subsp__enterica_serovar_schwarzengrund_str__sl480
   16  salmonella_enterica_subsp__enterica_serovar_virchow_str__sl491               :  not in Genome Projects/AA (JCVI MSC)
   17  salmonella_enterica_subsp__enterica_serovar_weltevreden_str__hi_n05_537      :  not in Genome Projects/AA (JCVI MSC)
 AA:
    1  Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701  TIGR    2740    440534  4,895,918       113     53,284  8.0X
    2  Salmonella enterica subsp. enterica serovar Agona str. SL483            JCVI    2924    454166  4,835,750       56      51,307  9.5X
    3  Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853     TIGR    2741    439851  4,885,976       142     50,129  7.8X
    4  Salmonella enterica subsp. enterica serovar Hadar str. RI_05P066        JCVI    2995    465516  4,793,325       50      50,470  9.6X
    5  Salmonella enterica subsp. enterica serovar Heidelberg str. SL476       JCVI    2927    454169  5,083,392       49      54,058  9.2X
    6  Salmonella enterica subsp. enterica serovar Heidelberg str. SL486       JCVI    2925    454164  4,728,232       48      53,785  10.1X
    7  Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433  JCVI    2921    454167  4,553,049       74      52,375  9.9X
    8  Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191       JCVI    2922    454231  4,696,566       53      51,826  9.6X
    9  Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188      TIGR    2737    439842  5,000,919       75      55,311  9.1X
   10  Salmonella enterica subsp. enterica serovar Newport str. SL254          JCVI    2926    423368  4,831,246       2       50,473  8.8X
   11  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23       TIGR    2735    439846  4,785,870       143     50,936  8.6X
   12  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29       TIGR    2739    439847  4,928,961       182     50,405  7.9X
   13  Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633  TIGR  2738    439843  4,734,042       160     49,533  7.4X
   14  Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480   JCVI    2923    454165  4,761,576       67      50,418  9.1X
   15  Salmonella enterica subsp. enterica serovar Virchow str. SL491          JCVI    2996    465517  4,858,188       73      54,841  10.3X
   16  Salmonella enterica subsp. enterica serovar Weltevreden str. HI_N05-537 JCVI    2994    465518  5,047,463       81      54,390  9.8X

TIGR/JCVI

  MSC

Sanger:

 Salmonella project
 Salmonella typhi project
 Salmonella ftp
 ST ftp includes 454

Goals:

 1. Validate the assemblies
 2. Submit traces to NCBI TA: 
    Problems:
      * some traces were edited (phd.2,phd.3,...); showed these edits appear in the SCF files?
 3. Convert assemblies to XML format and submit them NCBI AA

File locations:

 /fs/ftp-cbcb/pub/data/dsommer/
 /fs/sztmpscratch/dsommer/backup_sal
 /fs/szasmg/Bacteria/Salmonella/
 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/

SPA

NCBI

 Genome 
 Taxonomy (TaxID: 295319)

Traces:

 All directories: 103971 (unique)
 B_SPA : 102405  (unique) => 1566 missing
 ~ 10X coverage

The *.b1,*g1 reads seem to be mated!

Mate pairs:

 p(.*).[bg]1
 oyg(.*).[bg]1
 P_AA(.*).[bg]1

WUSTL assemblies:

1. ace.83: (best assembly of reads)

 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/B_SPA.fasta.screen.ace.83 
 $ grep ^CO *ace.83 | grep -v COMM | wc -l
 571 # total number of contigs
 Longest contig: 
 $ cat B_SPA.fasta.screen.ace.83
 AS 571 89509                                # 571 contigs, 89509 reads
 ...
 CO Contig1368 4813926 88824 1869182 C       
 
 Contig1368 is 4,813,926 (GDE format) 4,579,713 bp (FASTA format)
 Ends don't overlap
 There are missoriented reads at the ends (=>circular)
 Contains 88824 reads
 Other Salmonella strains are ~ 4.8M
 Problem:
 * Collapsed repeat:  high coverage, missoriented mates in the 2076881-2079555 region
 * Expanded into 3 copy tandem repeat in the finished assembly
 * 3 copies also in CA

2. Finished assembly: (assembly of contigs)

 File: finished.fasta.screen.ace.0
 1 contig 
 4,585,228 bp (FASTA format) : 5,515bp longer than ace.83 contig 571; ends don't overlap
 11 long reads(contig reads)

Estimate lib insert sizes:

 $ toAmos -ace B_SPA.fasta.screen.ace.83
 $ grep -c ^rds B_SPA.afg         # check if links were created
 $ more toAmos.error              # check if there were any convertion errors
 $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c
 $ bank2contig B_SPA.bnk > B_SPA.contig
 $ cat B_SPA.contig | grep ^# | grep -v ^## | sort 
 # look at distances between mated reads

Create mate pair file (Bambus format, tab delimited)

 $ cat B_SPA.mates
    library small   2000    4000    (p).*
    pair    (p.*)\.b1$      (p.*)\.g1$
    
    library medium  4500    5500    (oyg).*
    pair    (oyg.*).b1$     (oyg.*).g1$
    
    library large   35000   45000   (P_AA).*
    pair    (P_AA.*).b1$    (P_AA.*).g1$

Rerun convertion utilities:

 $ toAmos -m B_SPA.mates -ace B_SPA.fasta.screen.ace.83 -o B_SPA.afg 
 $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c

CBCB assemblies:

 1. CA default params
 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual
 87 scaff, 194 contigs, 19K singletons, 4,425,716 bp
 2. CA genomeSize=3M /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual-3M
 75 scaff, 183 contigs, 19K singletons, 4,515,434 bp
 No rearrangements compared to finished genome
 Significant number of SNP's
 3. AMOSCmp Ref=finished assembly; 89,509 reads; .ace.83 trimming => 31 contigs; 4,579,852 bp
 4. AMOSCmp Ref=finished assembly; 101,621 reads (.fasta.screen); nucmer trimming => 8 contigs; 4,583,946 bp
 5. merge of 9 contigs using slice tools
    /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final2/9r12345678-circ-rev-tr-recall.* 
    Steps:
      * recruit unassembled reads to span the Contig8.4 - Contig6.6 gap and assemble them into a new contig.
      * The 9 overlapping contigs (8 provided by Damon + 1 I assembled) were merged using the slice tools (zipclap program) into one piece. 
      * The new contig was circularized, reversed and rotated to align to the published one. 
      * I also recalled the consensus due to some ambiguity codes introduced in the process.
      * The new contig sequence is 70 bp shorter (4,585,158 bp  vs 4,585,228), but it aligns in one piece to the published contig.
 6. merge of 9 contigs using slice tools (best)
    /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final3/9r12345678-circ-rev-tr.* 
    Steps:
      * Same as 5 but a modifies version of "modContig --circularize" was called
      * The circularizan step did not recall the consensus
      * Reacll was not used in the end
      * The new contig sequence is 5 bp shorter (4,585,223  bp  vs 4,585,228), but it aligns in one piece to the published contig.
      * show-snps 1con-9r12345678-circ-rev-tr.delta | grep -c 9r12345678$ => 46 SNPs