NCBI submission

From Cbcb
Revision as of 14:24, 23 May 2011 by Dpuiu (talk | contribs) (→‎dbGSS)
Jump to navigation Jump to search

WGS/TPA

Links

Registration

  * search Genome Project for Center for Bioinformatics and Computational Biology[Sequencing Center]
  * Xanthomonas oryzae pv. oryzae PXO99A complete; /fs/szdata/ncbi/ftp.ncbi.nih.gov/genomes/Bacteria/Xanthomonas_oryzae_PXO99A
  * Xanthomonas oryzae pv. oryzicola BLS256 assembly

Output:

  • genome project id (5 digit); use it in e-mail correspondence
  • locus_tags (3+ letter/digit)

Requirements

  • ctg's: no gaps; .sqn format
  • annotation: either for ctg's or superctg's ; .sqn format
  • suprectg's: AGP format

Output:

  • 4-letter WGS project_ID : XXXX
  • project accession number : XXXX00000000 (4-letter ID followed by 8 0's)
  • 1st version: XXXX01000000
  • 1st version ctg's: XXXX01000001
  • CON record for suprectg's

Formating

Metadata

  • multiple sequences
 /nfshomes/dpuiu/szdevel/sequin.8.10/sequin
 /nfshomes/dpuiu/szdevel/bin/sequin              # latest version
 !!! import /nfshomes/dpuiu/bin/seqin.sqn
 Form
  Submission:
   Immediately ...
   Tentative manuscript title: 
  Contact:
   Name:  Daniela Puiu
   Phone: 301.405.3403
   Fax:   301.314.1341 
   Email: dpuiu@umiacs.umd.edu
  Authors:
   Daniela Puiu
   Steven L. Salzberg
   ...
  Affiliation
   Institution: University of Maryland,  Center for Bioinformatics and Computational Biology , 3115 Biomolecular Sciences Building #296, College Park, MD 20742 , US
  Seqeuence format
   Batch submission
   FASTA
   Original submission
  ...
 Organism and Sequences:
   Nucleotide: can import from FASTA file
   Organism: strain, moltype
   Proteins
   Annotation

Export template => /nfshomes/dpuiu/bin/sequin.sbt : contains submission info

Tags

 $ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta

Annotation

  • Generating the .tbl format from a TAB delinited format
 $ ~dpuiu/bin/tab2annotation.pl -h
 
     # Example:
     tab2annotation.pl -ht "SeqId Location Strand Length Product" prefix.ptt > prefix.tbl
     tab2annotation.pl -hl 1 -SeqId NC_012456 prefix.ptt > prefix.tbl
 
     # INPUT
     SeqId   SeqIdLength     OrfId   Start   End     Length  Product
     1225    2425            002     422     706     285     malate synthase G
 
     # OUTPUT
     >Feature 1225
     422     706     gene
                             locus_tag       C1A_1225_002
     422     706     CDS
                             product malate synthase G
                             protein_id      gnl|cbcb|C1A_1225_002

Merge

 Input files: 
   prefix.sbt: submission file
   prefix.fsa: sequence : at most 10,000 sequences/file
   prefix.tbl: annotation
 
 $ tbl2asn -t prefix.sbt -V v -s -p . 
 $ tbl2asn -t prefix.sbt -V v -s -i prefix.fasta
 $ tbl2asn -t template.sbt -i prefix.fasta -V v -s
 * template (*.sbt)        Example: /nfshomes/dpuiu/bin/sequin.sbt
  comment : is the article name
 * FASTA sequence (*.fsa) 
    >SeqID [organism=...] [strain=...] [tech=wgs] [chromosome=...][gcode=11]
 
 Adding tags:
  ~dpuiu/bin/addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" wPipJBH.fasta
 * annotation table (*.tbl) (optional)  
    5-column table 
       locus_tag for genes 
       protein_id for proteins
       product for proteins 
  • Output files:
 * ASN.1 (*.sqn) for submission to GenBank.
 * .val: validation file; check it for errors

AGP

 $ scaff2agp.pl < prefix.scaff > prefix.agp
 $ infoseq2agp.pl prefix.infoseq > prefix.agp
 $ valiadteAgp.pl prefix.agp

Submission

BankIt

  • BankIt
  • one or a few sequence submissions

Email

  • e-mail the output file to gb_sub@ncbi.nlm.nih.gov (deprecated)

GenomeMacroSend

Ftp

  • for large WGS projects
server: ftp-trace.ncbi.nlm.nih.gov
login:         cbcb_trc
password:      t@@GeaYF
center:        CBCB
directory:     test/ ;     don't use uploads/ that is used by SRA

Updates

  • Updating
  • in .sbt file replace "subtype new" with "subtype update"

TA

TA

Compressed archive containing 
  3 files: TRACEINFO.xml, MD5, README
  traces/ directory
  SCF format traces under traces/ or traces/*/
 
The archive(s) is/are gzip files 1-4GB; include center's name and the date into file names
Accepted only by uploading to NCBI FTP server.
  server: ftp-trace.ncbi.nih.gov
  login: 
  passwd: 
  center: UMD

Scripts:

 /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl

Genbank & SRA

 server: ftp-trace.ncbi.nlm.nih.gov
 login:         cbcb_trc
 password:      t@@GeaYF
 
 Center_name (acronym): CBCB 
 Full name: Center for Bioinformatics and Computational Biology, University of Maryland
 
 Directory (Short reads):  short_read/ 
 Directory (Sanger reads): uploads/
 Directory (test):         test/ (Assembled sequences)
 
 Validation table

AA

 Compressed archive containing 2 files: ASSEMBLY.xml , MD5 
 Accepted only by uploading to NCBI FTP server.
   server: ftp-private.ncbi.nlm.nih.gov
   login: cbcb_trc
   passwd: t@@GeaYF
   center: UMD   
   description: University of Maryland
 ASSEMBLY XML Schema png 
 ASSEMBLY XML Schema xsd 

Use XContig package scripts

Files:

.contig      : contigs & underlying reads (use TRACE_NAME's or SEQ_NAME's) 
.seq         : read sequences (use TRACE_NAME's or SEQ_NAME's) 
.qual        : read qualities (use TRACE_NAME's or SEQ_NAME's) 
.ti2seq_name : (TI , TRACE_NAME or SEQ_NAME) : required if the contig file soes not use the read ti's
 $ bank2contig -e       prefix.bnk > prefix.contig
 $ dumpreads   -e -r    prefix.bnk > prefix.seq
 $ dumpreads   -e -r -q prefix.bnk > prefix.qual

Example:

Xoo: /fs/szasmg/Bacteria/Xanthomonas/XOO/Xoo_PXO99A/FinalAsm_June2007/AA

Steps:

 1. makeConinfo ASSEMBLY.coninfo
 $ more ASSEMBLY.coninfo
 <coninfo>
 <meta name='center'>UMD</meta>
 <meta name='db'>Xoo</meta>
 <meta name='desc'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta>
 <meta name='object'>ASSEMBLY</meta>
 <meta name='species_code'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta>
 <meta name='structure'>Chromosome</meta>
 <meta name='subtype'>NEW</meta>
 <meta name='taxid'>360094</meta>
 <contig id="1106158952778_stitched" conformation="CIRCULAR" subtype="NEW"/>
 <contig id="... "/>
 <file src="Xoo.contig"/>
 <seq src="Xoo.seq"/>
 <qual src="Xoo.qual"/>
 <idmap  src="Xoo.ti2seq_name" direction="FORWARD"/>
 </coninfo>
 2. buildAssemblyArchive ASSEMBLY.coninfo --prompt --subname umd-20070816-125223
 problems:
    * submitter_reference="tigr...." : replace tigr with umd
    * conformation: always LINEAR    : replace LINEAR with CIRCULAR ???
 3. validate:
 oXygen: software used by NCBI; license required
 xmllint: open source
 $ xmllint --schema ~/bin/TraceAssembly.xsd umd-*/ASSEMBLY.xml > /dev/null
 umd-20070816-125223/ASSEMBLY.xml validates
 4. edit files
 $ rm *.tar.gz
 $ md5sum umd-*/ASSEMBLY.xml
 $ edit umd-*/MANIFEST         # update ASSEMBLY.xml md5sum 
 
 $ ls -1 umd-*
 umd-20070816-125223/
  1106158952778_stitched_20070817-141849.con       # Contig consensus
  1106158952778_stitched_20070817-141849.congap    # Contig gaps
  ASSEMBLY.xml                                     # Assembly XML
  MANIFEST                                         # MD5 sums
4. create tarball
$ tar czvf umd-20070816-125223.tar.gz umd-20070816-125223/
5. upload tarball
   !!! contact trace@ncbi.nlm.nih.gov if login/password error
 
   $ ftp ftp-private.ncbi.nlm.nih.gov
   login: cbcb_trc
   passwd: t@@GeaYF
   $ cd assembly
   $ put *.tar.gz

dbGSS

  • 4 files: email to batch-sub@ncbi.nlm.nih.gov

1. Publication

 TYPE: Pub                         #required
 MEDUID: 92347897
 TITLE:                            #required
 Genomic sequences from a subtracted retinal pigment epithelium   library
 AUTHORS:                          #required
 Gieser,L.; Swaroop,A.
 JOURNAL: Genomics
 VOLUME: 13
 ISSUE: 2
 PAGES: 873-6
 YEAR:  1992                       #required
 STATUS: 4                         #required :1=unpublished, 2=submitted, 3=in press, 4=published
 ||

2. Library

 TYPE: Lib                         #required
 NAME:  Rat Lambda Zap Express Library
 ORGANISM: Rattus norvegicus
 STRAIN: Sprague-Dawley
 SEX: male
 STAGE: embryonic day 17 post-fertilization
 TISSUE: aorta
 CELL_TYPE: vascular smooth muscle
 DESCR: 
 Put description here.
 ||

3. Contact

 TYPE: Cont
 NAME: Sikela JM
 FAX: 303 270 7097
 TEL: 303 270 
 EMAIL: tjs@tally.hsc.colorado.edu
 LAB: Department of Pharmacology
 INST: University of Colorado Health Sciences Center
 ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA
 ||

4. GSS sequence file

 TYPE: GSS                             #required
 STATUS:  New                          #required
 CONT_NAME: Sikela JM                  #required
 GSS#: Ayh00001                        #required
 CLONE: HHC189
 SOURCE: ATCC
 SOURCE_INHOST: 65128
 OTHER_GSS:  GSS00093, GSS000101
 CITATION:                             #required
 Genomic sequences from Human 
 brain tissue
 SEQ_PRIMER: M13 Forward
 P_END: 5'
 HIQUAL_START: 1
 HIQUAL_STOP: 285
 DNA_TYPE: Genomic
 CLASS: shotgun                                           #required
 LIBRARY: Hippocampus, Stratagene (cat. #936205)          #required
 PUBLIC:                                                  #required
 PUT_ID: Actin, gamma, skeletal
 COMMENT:
 SEQUENCE:
 AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG           #required
 ...
 ||
  • Matching strings:
 CONT_NAME of GSS file and NAME field of the Contact file
 LIBRARY field of GSS file and NAME field of the Library file
 CITATION field of GSS file and TITLE field of the Publication file