Salmonella: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 9: Line 9:
   1. Validate the assemblies
   1. Validate the assemblies
   2. Convert assemblies to NCBI AA format and submit them
   2. Submit traces to NCBI TA
  3. Convert assemblies to Assembly.XML format and submit them NCBI AA

File locations:
File locations:

Revision as of 17:08, 24 October 2007


From Washington Univ in St. Louis


 Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150: B_SPA
 Salmonella typhimurium LT2                                            : B_STM


 1. Validate the assemblies
 2. Submit traces to NCBI TA
 3. Convert assemblies to Assembly.XML format and submit them NCBI AA

File locations:




 All directories: 103971 (unique)
 B_SPA : 102405  (unique) => 1566 missing
 ~ 10X coverage

The *.b1,*g1 reads seem to be mated!

Mate pairs:


WUSTL assemblies:

1. ace.83: (best assembly of reads)

 $ grep ^CO *ace.83 | grep -v COMM | wc -l
 571 # total number of contigs
 Longest contig: 
 $ cat B_SPA.fasta.screen.ace.83
 AS 571 89509                                # 571 contigs, 89509 reads
 CO Contig1368 4813926 88824 1869182 C       
 Contig1368 is 4,813,926 (GDE format) 4,579,713 bp (FASTA format)
 Ends don't overlap
 There are missoriented reads at the ends (=>circular)
 Contains 88824 reads
 Other Salmonella strains are ~ 4.8M
 * Collapsed repeat:  high coverage, missoriented mates in the 2076881-2079555 region
 * Expanded into 3 copy tandem repeat in the finished assembly
 * 3 copies also in CA

2. Finished assembly: (assembly of contigs)

 File: finished.fasta.screen.ace.0
 1 contig 
 4,585,228 bp (FASTA format) : 5,515bp longer than ace.83 contig 571; ends don't overlap
 11 long reads(contig reads)

Estimate lib insert sizes:

 $ toAmos -ace B_SPA.fasta.screen.ace.83
 $ grep -c ^rds B_SPA.afg         # check if links were created
 $ more toAmos.error              # check if there were any convertion errors
 $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c
 $ bank2contig B_SPA.bnk > B_SPA.contig
 $ cat B_SPA.contig | grep ^# | grep -v ^## | sort 
 # look at distances between mated reads

Create mate pair file (Bambus format, tab delimited)

 $ cat B_SPA.mates
    library small   2000    4000    (p).*
    pair    (p.*)\.b1$      (p.*)\.g1$
    library medium  4500    5500    (oyg).*
    pair    (oyg.*).b1$     (oyg.*).g1$
    library large   35000   45000   (P_AA).*
    pair    (P_AA.*).b1$    (P_AA.*).g1$

Rerun convertion utilities:

 $ toAmos -m B_SPA.mates -ace B_SPA.fasta.screen.ace.83 -o B_SPA.afg 
 $ bank-transact -b B_SPA.bnk -m B_SPA.afg -c

CBCB assemblies:

1. CA default params /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual 87 scaff, 194 contigs, 19K singletons, 4,425,716 bp

2. CA genomeSize=3M /fs/szasmg/Bacteria/Salmonella/S_enterica_paratyphi_A/edit_dir/83/CA-qual-3M 75 scaff, 183 contigs, 19K singletons, 4,515,434 bp No rearrangements compared to finished genome Significant number of SNP's

3. AMOSCmp Ref=finished assembly; no read trimming; max dirty end seq=20 bp => 195 contigs

4. AMOSCmp Ref=finished assembly; no read trimming; max dirty end seq=50 bp => 122 contigs