Salmonella: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
No edit summary
Line 6: Line 6:
   Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA
   Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA
   Salmonella typhimurium LT2                                            : B_STM
   Salmonella typhimurium LT2                                            : B_STM
Other data:
NCBI:
  [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Salmonella%20enterica%20subsp.%20enterica%20serovar Genome Projects]
    1  Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701 [TIGR]
    2  Salmonella enterica subsp. enterica serovar Agona str. SL483 [J. Craig Venter Institute]
    3  Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67 [Chang Gung Memorial Hospital]          complete
    4  Salmonella enterica subsp. enterica serovar Dublin [University of Illinois at Urbana-Champaign]
    5  Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853 [TIGR]
    6  Salmonella enterica subsp. enterica serovar Enteritidis str. LK5 [University of Illinois at Urbana-Champaign]
    7  Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 [J. Craig Venter Institute]
    8  Salmonella enterica subsp. enterica serovar Heidelberg str. SL486 [TIGR/JCVI/J. Craig Venter Institute]
    9  Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433 [J. Craig Venter Institute]
    10  Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191 [J. Craig Venter Institute]
    11  Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188 [TIGR]
    12  Salmonella enterica subsp. enterica serovar Newport str. SL254 [TIGR/J. Craig Venter Institute]
    13  Salmonella enterica subsp. enterica serovar Newport str. SL317 [J. Craig Venter Institute]                  in TA but not AA; 63 contigs; shold be submitted to AA!!
    14  Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150 [Washington University (WashU)]      complete
    15  Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594 [Peking University Health Science Center]
    16  Salmonella enterica subsp. enterica serovar Pullorum [University of Illinois at Urbana-Champaign]
    17  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23 [TIGR]
    18  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29 [TIGR]
    19  Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633 [TIGR]
    20  Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480 [J. Craig Venter Institute]
    21  Salmonella enterica subsp. enterica serovar Typhi Ty2 [University of Wisconsin-Madison, USA]                complete
    22  Salmonella enterica subsp. enterica serovar Typhi str. CT18 [Sanger Institute]                              complete
    23  Salmonella typhimurium DT104 [Sanger Institute]
    24  Salmonella typhimurium LT2 [Washington University (WashU)]                                                  complete
    25  Salmonella typhimurium SL1344 [Sanger Institute]
    26  Salmonella typhimurium TR7095 [Washington University (WashU)]
  TA:
    1  salmonella_enterica_subsp__enterica_serovar_4__5__12_i___str__cvm23701
    2  salmonella_enterica_subsp__enterica_serovar_agona_str__sl483
    3  salmonella_enterica_subsp__enterica_serovar_dublin_str__ct_02021853
    4  salmonella_enterica_subsp__enterica_serovar_hadar_str__ri_05p066            :  not in Genome Projects/AA (JCVI MSC)
    5  salmonella_enterica_subsp__enterica_serovar_heidelberg_str__sl476
    6  salmonella_enterica_subsp__enterica_serovar_heidelberg_str__sl486
    7  salmonella_enterica_subsp__enterica_serovar_javiana_str__ga_mm04042433
    8  salmonella_enterica_subsp__enterica_serovar_kentucky_str__cdc_191
    9  salmonella_enterica_subsp__enterica_serovar_kentucky_str__cvm29188
    10  salmonella_enterica_subsp__enterica_serovar_newport_str__sl254
    11  salmonella_enterica_subsp__enterica_serovar_newport_str__sl317
    12  salmonella_enterica_subsp__enterica_serovar_saintpaul_str__sara23
    13  salmonella_enterica_subsp__enterica_serovar_saintpaul_str__sara29
    14  salmonella_enterica_subsp__enterica_serovar_schwarzengrund_str__cvm19633
    15  salmonella_enterica_subsp__enterica_serovar_schwarzengrund_str__sl480
    16  salmonella_enterica_subsp__enterica_serovar_virchow_str__sl491              :  not in Genome Projects/AA (JCVI MSC)
    17  salmonella_enterica_subsp__enterica_serovar_weltevreden_str__hi_n05_537      :  not in Genome Projects/AA (JCVI MSC)
  AA:
    1  Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701  TIGR    2740    440534  4,895,918      113    53,284  8.0X
    2  Salmonella enterica subsp. enterica serovar Agona str. SL483            JCVI    2924    454166  4,835,750      56      51,307  9.5X
    3  Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853    TIGR    2741    439851  4,885,976      142    50,129  7.8X
    4  Salmonella enterica subsp. enterica serovar Hadar str. RI_05P066        JCVI    2995    465516  4,793,325      50      50,470  9.6X
    5  Salmonella enterica subsp. enterica serovar Heidelberg str. SL476      JCVI    2927    454169  5,083,392      49      54,058  9.2X
    6  Salmonella enterica subsp. enterica serovar Heidelberg str. SL486      JCVI    2925    454164  4,728,232      48      53,785  10.1X
    7  Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433  JCVI    2921    454167  4,553,049      74      52,375  9.9X
    8  Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191      JCVI    2922    454231  4,696,566      53      51,826  9.6X
    9  Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188      TIGR    2737    439842  5,000,919      75      55,311  9.1X
    10  Salmonella enterica subsp. enterica serovar Newport str. SL254          JCVI    2926    423368  4,831,246      2      50,473  8.8X
    11  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23      TIGR    2735    439846  4,785,870      143    50,936  8.6X
    12  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29      TIGR    2739    439847  4,928,961      182    50,405  7.9X
    13  Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633  TIGR  2738    439843  4,734,042      160    49,533  7.4X
    14  Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480  JCVI    2923    454165  4,761,576      67      50,418  9.1X
    15  Salmonella enterica subsp. enterica serovar Virchow str. SL491          JCVI    2996    465517  4,858,188      73      54,841  10.3X
    16  Salmonella enterica subsp. enterica serovar Weltevreden str. HI_N05-537 JCVI    2994    465518  5,047,463      81      54,390  9.8X
TIGR/JCVI
  [http://msc.tigr.org/salmonella/index.shtml MSC]
Sanger:
  [http://www.sanger.ac.uk/Projects/Salmonella/ Salmonella project]
  [http://www.sanger.ac.uk/Projects/S_typhi/ Salmonella typhi project]
  [ftp://ftp.sanger.ac.uk/pub/pathogens/Salmonella Salmonella ftp]
  [ftp://ftp.sanger.ac.uk/pub/pathogens/st/ ST ftp] includes 454


Goals:
Goals:

Revision as of 19:17, 20 November 2007

Data

From Washington Univ in St. Louis

Strains:

 Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA
 Salmonella typhimurium LT2                                            : B_STM

Goals:

 1. Validate the assemblies
 2. Submit traces to NCBI TA: 
    Problems:
      * some traces were edited (phd.2,phd.3,...); showed these edits appear in the SCF files?
 3. Convert assemblies to XML format and submit them NCBI AA

File locations:

 /fs/ftp-cbcb/pub/data/dsommer/
 /fs/sztmpscratch/dsommer/backup_sal
 /fs/szasmg/Bacteria/Salmonella/
 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/

SPA

NCBI

 Genome 
 Taxonomy (TaxID: 295319)

Traces:

 All directories: 103971 (unique)
 B_SPA : 102405  (unique) => 1566 missing
 ~ 10X coverage

The *.b1,*g1 reads seem to be mated!

Mate pairs:

 p(.*).[bg]1
 oyg(.*).[bg]1
 P_AA(.*).[bg]1

WUSTL assemblies:

1. ace.83: (best assembly of reads)

 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/B_SPA.fasta.screen.ace.83 
 $ grep ^CO *ace.83 | grep -v COMM | wc -l
 571 # total number of contigs
 Longest contig: 
 $ cat B_SPA.fasta.screen.ace.83
 AS 571 89509                                # 571 contigs, 89509 reads
 ...
 CO Contig1368 4813926 88824 1869182 C       
 
 Contig1368 is 4,813,926 (GDE format) 4,579,713 bp (FASTA format)
 Ends don't overlap
 There are missoriented reads at the ends (=>circular)
 Contains 88824 reads
 Other Salmonella strains are ~ 4.8M
 Problem:
 * Collapsed repeat:  high coverage, missoriented mates in the 2076881-2079555 region
 * Expanded into 3 copy tandem repeat in the finished assembly
 * 3 copies also in CA

2. Finished assembly: (assembly of contigs)

 File: finished.fasta.screen.ace.0
 1 contig 
 4,585,228 bp (FASTA format) : 5,515bp longer than ace.83 contig 571; ends don't overlap
 11 long reads(contig reads)

Estimate lib insert sizes:

 $ toAmos -ace B_SPA.fasta.screen.ace.83
 $ grep -c ^rds B_SPA.afg         # check if links were created
 $ more toAmos.error              # check if there were any convertion errors
 $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c
 $ bank2contig B_SPA.bnk > B_SPA.contig
 $ cat B_SPA.contig | grep ^# | grep -v ^## | sort 
 # look at distances between mated reads

Create mate pair file (Bambus format, tab delimited)

 $ cat B_SPA.mates
    library small   2000    4000    (p).*
    pair    (p.*)\.b1$      (p.*)\.g1$
    
    library medium  4500    5500    (oyg).*
    pair    (oyg.*).b1$     (oyg.*).g1$
    
    library large   35000   45000   (P_AA).*
    pair    (P_AA.*).b1$    (P_AA.*).g1$

Rerun convertion utilities:

 $ toAmos -m B_SPA.mates -ace B_SPA.fasta.screen.ace.83 -o B_SPA.afg 
 $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c

CBCB assemblies:

 1. CA default params
 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual
 87 scaff, 194 contigs, 19K singletons, 4,425,716 bp
 2. CA genomeSize=3M /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual-3M
 75 scaff, 183 contigs, 19K singletons, 4,515,434 bp
 No rearrangements compared to finished genome
 Significant number of SNP's
 3. AMOSCmp Ref=finished assembly; 89,509 reads; .ace.83 trimming => 31 contigs; 4,579,852 bp
 4. AMOSCmp Ref=finished assembly; 101,621 reads (.fasta.screen); nucmer trimming => 8 contigs; 4,583,946 bp
 5. merge of 9 contigs using slice tools
    /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final2/9r12345678-circ-rev-tr-recall.* 
    Steps:
      * recruit unassembled reads to span the Contig8.4 - Contig6.6 gap and assemble them into a new contig.
      * The 9 overlapping contigs (8 provided by Damon + 1 I assembled) were merged using the slice tools (zipclap program) into one piece. 
      * The new contig was circularized, reversed and rotated to align to the published one. 
      * I also recalled the consensus due to some ambiguity codes introduced in the process.
      * The new contig sequence is 70 bp shorter (4,585,158 bp  vs 4,585,228), but it aligns in one piece to the published contig.
 6. merge of 9 contigs using slice tools (best)
    /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final3/9r12345678-circ-rev-tr.* 
    Steps:
      * Same as 5 but a modifies version of "modContig --circularize" was called
      * The circularizan step did not recall the consensus
      * Reacll was not used in the end
      * The new contig sequence is 5 bp shorter (4,585,223  bp  vs 4,585,228), but it aligns in one piece to the published contig.
      * show-snps 1con-9r12345678-circ-rev-tr.delta | grep -c 9r12345678$ => 46 SNPs