Salmonella: Difference between revisions
Jump to navigation
Jump to search
(→Traces) |
(→Traces) |
||
Line 171: | Line 171: | ||
gi|16763390|ref|NC_003197.1| 2548470 2548631 161 0 | gi|16763390|ref|NC_003197.1| 2548470 2548631 161 0 | ||
gi|16763390|ref|NC_003197.1| 2703507 2703816 309 0 | gi|16763390|ref|NC_003197.1| 2703507 2703816 309 0 | ||
Two set of reads no included neither in the WUSTL "final" assemblies | |||
12,912 in */Failures/ directories | |||
28,852 in sub assemblies but not in the final ones | |||
=== WUSTL assemblies === | === WUSTL assemblies === |
Revision as of 16:11, 3 March 2008
Data
From Washington Univ in St. Louis
Strains
Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA : 4,585,229 chromosome; no plasmid Salmonella typhimurium LT2 : B_STM : 4,857,432 chromosome; 93,939 plasmid pSLT
Goals
1. Validate the assemblies 2. Submit traces to NCBI TA: Problems: * some traces were edited (phd.2,phd.3,...); showed these edits appear in the SCF files? 3. Convert assemblies to XML format and submit them NCBI AA
Data location
/fs/ftp-cbcb/pub/data/dsommer/ /fs/sztmpscratch/dsommer/backup_sal /fs/szasmg/Bacteria/Salmonella/ /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/ /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2 /fs/szdata//ncbi/genomes/Bacteria/Salmonella_enterica_Paratypi_ATCC_9150/ /fs/szdata//ncbi/genomes/Bacteria/Salmonella_typhimurium_LT2/
WUSTL
Data Download
name date assembly #contigs 1 Salmonella_enterica_serovar_Arizonae/ 04-Apr-2006 yes 256 2 Salmonella_enterica_serovar_Diarizonae/ 02-May-2006 yes 739 3 Salmonella_enterica_serovar_Paratyphi_A/ 04-Apr-2006 no 4 Salmonella_enterica_serovar_Paratyphi_B/ 27-Apr-2006 yes 187 5 Salmonella_enterica_serovar_Typhimurium_strain_LT2/ 04-Apr-2006 no
Salmonella enterica serovar Paratyphi A str. ATCC 9150
NCBI
Genome project Taxonomy (TaxID: 295319)
Name Length %GC Description NC_006511.1 4585229 52.16 Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150, complete genome
Traces:
All directories: 103,971 (unique) B_SPA : 102,405 (unique) => 1,566 missing ~ 10X coverage
Unmated reads:
Sequenced in 1999-06 *.s1
Mated reads:
The *.b1,*g1 reads seem to be mated! Sequenced in 2002-05, 2002-07 p(.*).[bg]1 oyg(.*).[bg]1 P_AA(.*).[bg]1
WUSTL assemblies
1. ace.83: (best assembly of reads; 2003-05-12)
/fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/B_SPA.fasta.screen.ace.83 $ grep ^CO *ace.83 | grep -v COMM | wc -l 571 # total number of contigs
Longest contig: $ cat B_SPA.fasta.screen.ace.83 AS 571 89509 # 571 contigs, 89509 reads ... CO Contig1368 4813926 88824 1869182 C Contig1368 is 4,813,926 (GDE format) 4,579,713 bp (FASTA format) Ends don't overlap There are missoriented reads at the ends (=>circular) Contains 88824 reads Other Salmonella strains are ~ 4.8M
Problem: * Collapsed repeat: high coverage, missoriented mates in the 2076881-2079555 region * Expanded into 3 copy tandem repeat in the finished assembly * 3 copies also in CA
2. Finished assembly: (assembly of contigs)
File: finished.fasta.screen.ace.0 1 contig 4,585,228 bp (FASTA format) : 5,515bp longer than ace.83 contig 571; ends don't overlap 11 long reads(contig reads)
Estimate lib insert sizes:
$ toAmos -ace B_SPA.fasta.screen.ace.83 $ grep -c ^rds B_SPA.afg # check if links were created $ more toAmos.error # check if there were any convertion errors $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c $ bank2contig B_SPA.bnk > B_SPA.contig $ cat B_SPA.contig | grep ^# | grep -v ^## | sort # look at distances between mated reads
Create mate pair file (Bambus format, tab delimited)
$ cat B_SPA.mates library small 2000 4000 (p).* pair (p.*)\.b1$ (p.*)\.g1$ library medium 4500 5500 (oyg).* pair (oyg.*).b1$ (oyg.*).g1$ library large 35000 45000 (P_AA).* pair (P_AA.*).b1$ (P_AA.*).g1$
Rerun convertion utilities:
$ toAmos -m B_SPA.mates -ace B_SPA.fasta.screen.ace.83 -o B_SPA.afg $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c
CBCB assemblies
1. CA default params /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual 87 scaff, 194 contigs, 19K singletons, 4,425,716 bp
2. CA genomeSize=3M /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual-3M 75 scaff, 183 contigs, 19K singletons, 4,515,434 bp No rearrangements compared to finished genome Significant number of SNP's
3. AMOSCmp Ref=finished assembly; 89,509 reads; .ace.83 trimming => 31 contigs; 4,579,852 bp
4. AMOSCmp Ref=finished assembly; 101,621 reads (.fasta.screen); nucmer trimming => 8 contigs; 4,583,946 bp
5. merge of 9 contigs using slice tools /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final2/9r12345678-circ-rev-tr-recall.* Steps: * recruit unassembled reads to span the Contig8.4 - Contig6.6 gap and assemble them into a new contig. * The 9 overlapping contigs (8 provided by Damon + 1 I assembled) were merged using the slice tools (zipclap program) into one piece. * The new contig was circularized, reversed and rotated to align to the published one. * I also recalled the consensus due to some ambiguity codes introduced in the process. * The new contig sequence is 70 bp shorter (4,585,158 bp vs 4,585,228), but it aligns in one piece to the published contig.
6. merge of 9 contigs using slice tools (best) /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/final3/9r12345678-circ-rev-tr.* Steps: * Same as 5 but a modifies version of "modContig --circularize" was called * The circularizan step did not recall the consensus * Reacll was not used in the end * The new contig sequence is 5 bp shorter (4,585,223 bp vs 4,585,228), but it aligns in one piece to the published contig. * show-snps 1con-9r12345678-circ-rev-tr.delta | grep -c 9r12345678$ => 46 SNPs
Salmonella typhimurium LT2
NCBI
Genome project
Name Length %GC Description NC_003197.1 4857432 52.22 Salmonella typhimurium LT2, complete genome NC_003277.1 93939 53.13 Salmonella typhimurium LT2 plasmid pSLT, complete sequence
Traces
From WUSTL:
total: 142,267 single reads (*.s1 117,524) mate pairs: 7,236
Zero coverage regions:
gi|16763390|ref|NC_003197.1| 288829 288861 32 0 gi|16763390|ref|NC_003197.1| 691986 692019 33 0 gi|16763390|ref|NC_003197.1| 1582051 1582484 433 0 gi|16763390|ref|NC_003197.1| 2548470 2548631 161 0 gi|16763390|ref|NC_003197.1| 2703507 2703816 309 0
Two set of reads no included neither in the WUSTL "final" assemblies
12,912 in */Failures/ directories 28,852 in sub assemblies but not in the final ones
WUSTL assemblies
1. Phrap assembly (2001-12-23)
/fs/ftp-cbcb/pub/data/dsommer/B_STM/edit_dir/Phrap_Assembly_dir/B_STM.RAW.fasta.screen.ace head -1 ... AS 1226 113 663 # contigs: 1,226 max contig: 741,919 (larger than cap4 max) singlets 563,308
contigs 1226 (696,553 bp), 1218 (186,475 bp) look rearanged compared to the reference
CBCB assemblies
1. CA default params
/fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_CA_e1.50 TotalScaffolds=1957 TotalContigsInScaffolds=1965 MaxContigLength=13910
2. CA e=6%
/fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_CA_e6.00 TotalScaffolds=1128 TotalContigsInScaffolds=1146 MaxContigLength=26183
3. AMOSCmp: trimming were the OBT clrs(from 1)
-D MINOVL=3 -D MAXTRIM=50 -D MAJORITY=50 /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_AMOSCmp-relaxed-OBT 128 contigs Max 0 cvg area: 1581885-1582864 979 bp (1con-sontigs.delta)
4. AMOSCmp: trimming were the read nucmer clrs
-D MINOVL=3 -D MAXTRIM=50 -D MAJORITY=50 Ran for 9 days !!! /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_AMOSCmp-relaxed-nucmer Max 0 cvg area: 4325279-4340861 15,582 bp (1con-contigs.delta)
Contig length stats: desc #elem min max mean stdev sum contigs 8 46765 2148437 617384.12 710006.86 4939073
5. AMOSCmp: trimming were taken from the phap(ace) assembly
-D MINOVL=3 -D MAXTRIM=50 -D MAJORITY=50 /fs/szasmg/Bacteria/Salmonella/S_typhimurium_LT2/Assembly/2007_1215_AMOSCmp-relaxed-Phrap
Contig length stats: desc #elem min max mean stdev sum contigs 20 3631 1397417 246942.84 308507.06 4938857