NCBI submission: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 197: Line 197:
   /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl
   /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl


= SRA =
= Genbank & SRA =


   server: ftp-trace.ncbi.nlm.nih.gov
   server: ftp-trace.ncbi.nlm.nih.gov
Line 208: Line 208:
   Directory (Short reads):  short_read/  
   Directory (Short reads):  short_read/  
   Directory (Sanger reads): uploads/
   Directory (Sanger reads): uploads/
   Directory (test):        test/ (~30 reads)
   Directory (test):        test/ (Assembled sequences)
    
    
   [http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table]
   [http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table]

Revision as of 14:52, 3 May 2011

WGS/TPA

Links

Registration

  * search Genome Project for Center for Bioinformatics and Computational Biology[Sequencing Center]
  * Xanthomonas oryzae pv. oryzae PXO99A complete; /fs/szdata/ncbi/ftp.ncbi.nih.gov/genomes/Bacteria/Xanthomonas_oryzae_PXO99A
  * Xanthomonas oryzae pv. oryzicola BLS256 assembly

Output:

  • genome project id (5 digit); use it in e-mail correspondence
  • locus_tags (3+ letter/digit)

Requirements

  • ctg's: no gaps; .sqn format
  • annotation: either for ctg's or superctg's ; .sqn format
  • suprectg's: AGP format

Output:

  • 4-letter WGS project_ID : XXXX
  • project accession number : XXXX00000000 (4-letter ID followed by 8 0's)
  • 1st version: XXXX01000000
  • 1st version ctg's: XXXX01000001
  • CON record for suprectg's

Formating

Metadata

  • multiple sequences
 /nfshomes/dpuiu/szdevel/sequin.8.10/sequin
 /nfshomes/dpuiu/szdevel/bin/sequin              # latest version
 !!! import /nfshomes/dpuiu/bin/seqin.sqn
 Form
  Submission:
   Immediately ...
   Tentative manuscript title: 
  Contact:
   Name:  Daniela Puiu
   Phone: 301.405.3403
   Fax:   301.314.1341 
   Email: dpuiu@umiacs.umd.edu
  Authors:
   Daniela Puiu
   Steven L. Salzberg
   ...
  Affiliation
   Institution: University of Maryland,  Center for Bioinformatics and Computational Biology , 3115 Biomolecular Sciences Building #296, College Park, MD 20742 , US
  Seqeuence format
   Batch submission
   FASTA
   Original submission
  ...
 Organism and Sequences:
   Nucleotide: can import from FASTA file
   Organism: strain, moltype
   Proteins
   Annotation

Export template => /nfshomes/dpuiu/bin/sequin.sbt : contains submission info

Tags

 $ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta

Annotation

  • Generating the .tbl format from a TAB delinited format
 $ ~dpuiu/bin/tab2annotation.pl -h
 
     # Example:
     tab2annotation.pl -ht "SeqId Location Strand Length Product" prefix.ptt > prefix.tbl
     tab2annotation.pl -hl 1 -SeqId NC_012456 prefix.ptt > prefix.tbl
 
     # INPUT
     SeqId   SeqIdLength     OrfId   Start   End     Length  Product
     1225    2425            002     422     706     285     malate synthase G
 
     # OUTPUT
     >Feature 1225
     422     706     gene
                             locus_tag       C1A_1225_002
     422     706     CDS
                             product malate synthase G
                             protein_id      gnl|cbcb|C1A_1225_002

Merge

 Input files: 
   prefix.sbt: submission file
   prefix.fsa: sequence : at most 10,000 sequences/file
   prefix.tbl: annotation
 
 $ tbl2asn -t prefix.sbt -V v -s -p . 
 $ tbl2asn -t prefix.sbt -V v -s -i prefix.fasta
 $ tbl2asn -t template.sbt -i prefix.fasta -V v -s
 * template (*.sbt)        Example: /nfshomes/dpuiu/bin/sequin.sbt
  comment : is the article name
 * FASTA sequence (*.fsa) 
    >SeqID [organism=...] [strain=...] [tech=wgs] [chromosome=...][gcode=11]
 
 Adding tags:
  ~dpuiu/bin/addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" wPipJBH.fasta
 * annotation table (*.tbl) (optional)  
    5-column table 
       locus_tag for genes 
       protein_id for proteins
       product for proteins 
  • Output files:
 * ASN.1 (*.sqn) for submission to GenBank.
 * .val: validation file; check it for errors

AGP

 $ scaff2agp.pl < prefix.scaff > prefix.agp
 $ infoseq2agp.pl prefix.infoseq > prefix.agp
 $ valiadteAgp.pl prefix.agp

Submission

BankIt

  • BankIt
  • one or a few sequence submissions

Email

  • e-mail the output file to gb_sub@ncbi.nlm.nih.gov (deprecated)

GenomeMacroSend

Ftp

  • for large WGS projects
server: ftp-trace.ncbi.nlm.nih.gov
login:         cbcb_trc
password:      t@@GeaYF
center:        CBCB
directory:     test/ ;     don't use uploads/ that is used by SRA

Updates

  • Updating
  • in .sbt file replace "subtype new" with "subtype update"

TA

TA

Compressed archive containing 
  3 files: TRACEINFO.xml, MD5, README
  traces/ directory
  SCF format traces under traces/ or traces/*/
 
The archive(s) is/are gzip files 1-4GB; include center's name and the date into file names
Accepted only by uploading to NCBI FTP server.
  server: ftp-trace.ncbi.nih.gov
  login: 
  passwd: 
  center: UMD

Scripts:

 /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl

Genbank & SRA

 server: ftp-trace.ncbi.nlm.nih.gov
 login:         cbcb_trc
 password:      t@@GeaYF
 
 Center_name (acronym): CBCB 
 Full name: Center for Bioinformatics and Computational Biology, University of Maryland
 
 Directory (Short reads):  short_read/ 
 Directory (Sanger reads): uploads/
 Directory (test):         test/ (Assembled sequences)
 
 Validation table

AA

 Compressed archive containing 2 files: ASSEMBLY.xml , MD5 
 Accepted only by uploading to NCBI FTP server.
   server: ftp-private.ncbi.nlm.nih.gov
   login: cbcb_trc
   passwd: t@@GeaYF
   center: UMD   
   description: University of Maryland
 ASSEMBLY XML Schema png 
 ASSEMBLY XML Schema xsd 

Use XContig package scripts

Files:

.contig      : contigs & underlying reads (use TRACE_NAME's or SEQ_NAME's) 
.seq         : read sequences (use TRACE_NAME's or SEQ_NAME's) 
.qual        : read qualities (use TRACE_NAME's or SEQ_NAME's) 
.ti2seq_name : (TI , TRACE_NAME or SEQ_NAME) : required if the contig file soes not use the read ti's
 $ bank2contig -e       prefix.bnk > prefix.contig
 $ dumpreads   -e -r    prefix.bnk > prefix.seq
 $ dumpreads   -e -r -q prefix.bnk > prefix.qual

Example:

Xoo: /fs/szasmg/Bacteria/Xanthomonas/XOO/Xoo_PXO99A/FinalAsm_June2007/AA

Steps:

 1. makeConinfo ASSEMBLY.coninfo
 $ more ASSEMBLY.coninfo
 <coninfo>
 <meta name='center'>UMD</meta>
 <meta name='db'>Xoo</meta>
 <meta name='desc'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta>
 <meta name='object'>ASSEMBLY</meta>
 <meta name='species_code'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta>
 <meta name='structure'>Chromosome</meta>
 <meta name='subtype'>NEW</meta>
 <meta name='taxid'>360094</meta>
 <contig id="1106158952778_stitched" conformation="CIRCULAR" subtype="NEW"/>
 <contig id="... "/>
 <file src="Xoo.contig"/>
 <seq src="Xoo.seq"/>
 <qual src="Xoo.qual"/>
 <idmap  src="Xoo.ti2seq_name" direction="FORWARD"/>
 </coninfo>
 2. buildAssemblyArchive ASSEMBLY.coninfo --prompt --subname umd-20070816-125223
 problems:
    * submitter_reference="tigr...." : replace tigr with umd
    * conformation: always LINEAR    : replace LINEAR with CIRCULAR ???
 3. validate:
 oXygen: software used by NCBI; license required
 xmllint: open source
 $ xmllint --schema ~/bin/TraceAssembly.xsd umd-*/ASSEMBLY.xml > /dev/null
 umd-20070816-125223/ASSEMBLY.xml validates
 4. edit files
 $ rm *.tar.gz
 $ md5sum umd-*/ASSEMBLY.xml
 $ edit umd-*/MANIFEST         # update ASSEMBLY.xml md5sum 
 
 $ ls -1 umd-*
 umd-20070816-125223/
  1106158952778_stitched_20070817-141849.con       # Contig consensus
  1106158952778_stitched_20070817-141849.congap    # Contig gaps
  ASSEMBLY.xml                                     # Assembly XML
  MANIFEST                                         # MD5 sums
4. create tarball
$ tar czvf umd-20070816-125223.tar.gz umd-20070816-125223/
5. upload tarball
   !!! contact trace@ncbi.nlm.nih.gov if login/password error
 
   $ ftp ftp-private.ncbi.nlm.nih.gov
   login: cbcb_trc
   passwd: t@@GeaYF
   $ cd assembly
   $ put *.tar.gz