Salmonella: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
No edit summary
Line 6: Line 6:
   Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA
   Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA
   Salmonella typhimurium LT2                                            : B_STM
   Salmonella typhimurium LT2                                            : B_STM
Other data:
NCBI:
  [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Salmonella%20enterica%20subsp.%20enterica%20serovar Genome Projects]
    1  Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701 [TIGR]
    2  Salmonella enterica subsp. enterica serovar Agona str. SL483 [J. Craig Venter Institute]
    3  Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67 [Chang Gung Memorial Hospital]
    4  Salmonella enterica subsp. enterica serovar Dublin [University of Illinois at Urbana-Champaign]
    5  Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853 [TIGR]
    6  Salmonella enterica subsp. enterica serovar Enteritidis str. LK5 [University of Illinois at Urbana-Champaign]
    7  Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 [J. Craig Venter Institute]
    8  Salmonella enterica subsp. enterica serovar Heidelberg str. SL486 [TIGR/JCVI/J. Craig Venter Institute]
    9  Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433 [J. Craig Venter Institute]
    10  Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191 [J. Craig Venter Institute]
    11  Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188 [TIGR]
    12  Salmonella enterica subsp. enterica serovar Newport str. SL254 [TIGR/J. Craig Venter Institute]
    13  Salmonella enterica subsp. enterica serovar Newport str. SL317 [J. Craig Venter Institute]
    14  Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150 [Washington University (WashU)]
    15  Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594 [Peking University Health Science Center]
    16  Salmonella enterica subsp. enterica serovar Pullorum [University of Illinois at Urbana-Champaign]
    17  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23 [TIGR]
    18  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29 [TIGR]
    19  Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633 [TIGR]
    20  Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480 [J. Craig Venter Institute]
    21  Salmonella enterica subsp. enterica serovar Typhi Ty2 [University of Wisconsin-Madison, USA]
    22  Salmonella enterica subsp. enterica serovar Typhi str. CT18 [Sanger Institute]
    23  Salmonella typhimurium DT104 [Sanger Institute]
    24  Salmonella typhimurium LT2 [Washington University (WashU)]
    25  Salmonella typhimurium SL1344 [Sanger Institute]
    26  Salmonella typhimurium TR7095 [Washington University (WashU)]


Goals:
Goals:

Revision as of 16:30, 20 November 2007

Data

From Washington Univ in St. Louis

Strains:

 Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA
 Salmonella typhimurium LT2                                            : B_STM

Other data:

NCBI:

 Genome Projects
    1  Salmonella enterica subsp. enterica serovar 4,[5],12:i:- str. CVM23701 [TIGR]
    2  Salmonella enterica subsp. enterica serovar Agona str. SL483 [J. Craig Venter Institute]
    3  Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67 [Chang Gung Memorial Hospital]
    4  Salmonella enterica subsp. enterica serovar Dublin [University of Illinois at Urbana-Champaign]
    5  Salmonella enterica subsp. enterica serovar Dublin str. CT_02021853 [TIGR]
    6  Salmonella enterica subsp. enterica serovar Enteritidis str. LK5 [University of Illinois at Urbana-Champaign]
    7  Salmonella enterica subsp. enterica serovar Heidelberg str. SL476 [J. Craig Venter Institute]
    8  Salmonella enterica subsp. enterica serovar Heidelberg str. SL486 [TIGR/JCVI/J. Craig Venter Institute]
    9  Salmonella enterica subsp. enterica serovar Javiana str. GA_MM04042433 [J. Craig Venter Institute]
   10  Salmonella enterica subsp. enterica serovar Kentucky str. CDC 191 [J. Craig Venter Institute]
   11  Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188 [TIGR]
   12  Salmonella enterica subsp. enterica serovar Newport str. SL254 [TIGR/J. Craig Venter Institute]
   13  Salmonella enterica subsp. enterica serovar Newport str. SL317 [J. Craig Venter Institute]
   14  Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150 [Washington University (WashU)]
   15  Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594 [Peking University Health Science Center]
   16  Salmonella enterica subsp. enterica serovar Pullorum [University of Illinois at Urbana-Champaign]
   17  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA23 [TIGR]
   18  Salmonella enterica subsp. enterica serovar Saintpaul str. SARA29 [TIGR]
   19  Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633 [TIGR]
   20  Salmonella enterica subsp. enterica serovar Schwarzengrund str. SL480 [J. Craig Venter Institute]
   21  Salmonella enterica subsp. enterica serovar Typhi Ty2 [University of Wisconsin-Madison, USA]
   22  Salmonella enterica subsp. enterica serovar Typhi str. CT18 [Sanger Institute]
   23  Salmonella typhimurium DT104 [Sanger Institute]
   24  Salmonella typhimurium LT2 [Washington University (WashU)]
   25  Salmonella typhimurium SL1344 [Sanger Institute]
   26  Salmonella typhimurium TR7095 [Washington University (WashU)]

Goals:

 1. Validate the assemblies
 2. Submit traces to NCBI TA: 
    Problems:
      * some traces were edited (phd.2,phd.3,...); showed these edits appear in the SCF files?
 3. Convert assemblies to XML format and submit them NCBI AA

File locations:

 /fs/ftp-cbcb/pub/data/dsommer/
 /fs/sztmpscratch/dsommer/backup_sal
 /fs/szasmg/Bacteria/Salmonella/
 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/

SPA

NCBI

 Genome 
 Taxonomy (TaxID: 295319)

Traces:

 All directories: 103971 (unique)
 B_SPA : 102405  (unique) => 1566 missing
 ~ 10X coverage

The *.b1,*g1 reads seem to be mated!

Mate pairs:

 p(.*).[bg]1
 oyg(.*).[bg]1
 P_AA(.*).[bg]1

WUSTL assemblies:

1. ace.83: (best assembly of reads)

 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/B_SPA.fasta.screen.ace.83 
 $ grep ^CO *ace.83 | grep -v COMM | wc -l
 571 # total number of contigs
 Longest contig: 
 $ cat B_SPA.fasta.screen.ace.83
 AS 571 89509                                # 571 contigs, 89509 reads
 ...
 CO Contig1368 4813926 88824 1869182 C       
 
 Contig1368 is 4,813,926 (GDE format) 4,579,713 bp (FASTA format)
 Ends don't overlap
 There are missoriented reads at the ends (=>circular)
 Contains 88824 reads
 Other Salmonella strains are ~ 4.8M
 Problem:
 * Collapsed repeat:  high coverage, missoriented mates in the 2076881-2079555 region
 * Expanded into 3 copy tandem repeat in the finished assembly
 * 3 copies also in CA

2. Finished assembly: (assembly of contigs)

 File: finished.fasta.screen.ace.0
 1 contig 
 4,585,228 bp (FASTA format) : 5,515bp longer than ace.83 contig 571; ends don't overlap
 11 long reads(contig reads)

Estimate lib insert sizes:

 $ toAmos -ace B_SPA.fasta.screen.ace.83
 $ grep -c ^rds B_SPA.afg         # check if links were created
 $ more toAmos.error              # check if there were any convertion errors
 $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c
 $ bank2contig B_SPA.bnk > B_SPA.contig
 $ cat B_SPA.contig | grep ^# | grep -v ^## | sort 
 # look at distances between mated reads

Create mate pair file (Bambus format, tab delimited)

 $ cat B_SPA.mates
    library small   2000    4000    (p).*
    pair    (p.*)\.b1$      (p.*)\.g1$
    
    library medium  4500    5500    (oyg).*
    pair    (oyg.*).b1$     (oyg.*).g1$
    
    library large   35000   45000   (P_AA).*
    pair    (P_AA.*).b1$    (P_AA.*).g1$

Rerun convertion utilities:

 $ toAmos -m B_SPA.mates -ace B_SPA.fasta.screen.ace.83 -o B_SPA.afg 
 $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c

CBCB assemblies:

 1. CA default params
 /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual
 87 scaff, 194 contigs, 19K singletons, 4,425,716 bp
 2. CA genomeSize=3M /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual-3M
 75 scaff, 183 contigs, 19K singletons, 4,515,434 bp
 No rearrangements compared to finished genome
 Significant number of SNP's
 3. AMOSCmp Ref=finished assembly; 89,509 reads; .ace.83 trimming => 31 contigs; 4,579,852 bp
 4. AMOSCmp Ref=finished assembly; 101,621 reads (.fasta.screen); nucmer trimming => 8 contigs; 4,583,946 bp
 5. merge of 9 contigs using slice tools
    /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final2/9r12345678-circ-rev-tr-recall.* 
    Steps:
      * recruit unassembled reads to span the Contig8.4 - Contig6.6 gap and assemble them into a new contig.
      * The 9 overlapping contigs (8 provided by Damon + 1 I assembled) were merged using the slice tools (zipclap program) into one piece. 
      * The new contig was circularized, reversed and rotated to align to the published one. 
      * I also recalled the consensus due to some ambiguity codes introduced in the process.
      * The new contig sequence is 70 bp shorter (4,585,158 bp  vs 4,585,228), but it aligns in one piece to the published contig.
 6. merge of 9 contigs using slice tools (best)
    /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final3/9r12345678-circ-rev-tr.* 
    Steps:
      * Same as 5 but a modifies version of "modContig --circularize" was called
      * The circularizan step did not recall the consensus
      * Reacll was not used in the end
      * The new contig sequence is 5 bp shorter (4,585,223  bp  vs 4,585,228), but it aligns in one piece to the published contig.
      * show-snps 1con-9r12345678-circ-rev-tr.delta | grep -c 9r12345678$ => 46 SNPs