NCBI submission: Difference between revisions

From Cbcb
Jump to navigation Jump to search
 
(15 intermediate revisions by the same user not shown)
Line 9: Line 9:
* [http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Specification.shtml AGP format]
* [http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Specification.shtml AGP format]
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Bacterial Genomes]
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Bacterial Genomes]
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html Annotation] [http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html Annotation Info] ,[http://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf Locus tag extra info]
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html Annotation]  
* [http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html Annotation Info]  
* [http://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf Locus tag extra info]


== Registration ==
== Registration ==
Line 19: Line 21:
   * [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16740 Xanthomonas oryzae pv. oryzicola BLS256] assembly
   * [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16740 Xanthomonas oryzae pv. oryzicola BLS256] assembly


* [http://www.ncbi.nlm.nih.gov/genomes/static/links.html Seuqncing center list]
Output:
Output:
* genome project id (5 digit); use it in e-mail correspondence
* genome project id (5 digit); use it in e-mail correspondence
* locus_tags (3 letter/digit)
* locus_tags (3+ letter/digit)
 
* [http://www.ncbi.nlm.nih.gov/genomes/lltp.cgi Locus tag database] to check if the chosen locus tag is available


== Requirements ==
== Requirements ==
Line 47: Line 52:


   /nfshomes/dpuiu/szdevel/sequin.8.10/sequin
   /nfshomes/dpuiu/szdevel/sequin.8.10/sequin
  /nfshomes/dpuiu/szdevel/bin/sequin              # latest version
   !!! import /nfshomes/dpuiu/bin/seqin.sqn
   !!! import /nfshomes/dpuiu/bin/seqin.sqn


Line 76: Line 82:
     Annotation
     Annotation


Export template => template.sbt : contains submission info
Export template => /nfshomes/dpuiu/bin/sequin.sbt : contains submission info


=== Tags ===
=== Tags ===
Line 83: Line 89:


=== Annotation ===
=== Annotation ===
* Locus tag examples:
  ABC_I00001 for gene 1, chromosome I
  ABC_II00001 for gene 1, chromosome II
  ABC_r1112 for ribosomal RNA genes
  ABC_t1113 for tRNA genes
* [http://www.ncbi.nlm.nih.gov/genomes/frameshifts/frameshifts.cgi frameshifts]


* Generating the .tbl format from a TAB delinited format
* Generating the .tbl format from a TAB delinited format
Line 105: Line 119:
=== Merge ===
=== Merge ===
* tbl2asn: command line
* tbl2asn: command line
* [[Media:tbl2asn.txt|tbl2asn man]]


   Input files:  
   Input files:  
Line 118: Line 133:
* Input files: [[Media:sequin.sbt|sequin.sbt]]
* Input files: [[Media:sequin.sbt|sequin.sbt]]
   * template (*.sbt)        Example: /nfshomes/dpuiu/bin/sequin.sbt
   * template (*.sbt)        Example: /nfshomes/dpuiu/bin/sequin.sbt
  comment : is the article name


   * FASTA sequence (*.fsa)  
   * FASTA sequence (*.fsa)  
Line 137: Line 153:
=== AGP ===
=== AGP ===


* [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml AGP_Specification.shtml]
* Sequence gaps : "fragment yes"
* Sequence gaps : "fragment yes"
* Scaffold gaps : "contig  no"
* Scaffold gaps : "contig  no"
Line 171: Line 188:


* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html#updating Updating]
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html#updating Updating]
* in .sbt file replace "subtype new" with "subtype update"


= TA =
= TA =
Line 191: Line 209:
   /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl
   /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl


= SRA =
= Genbank & SRA =


   server: ftp-trace.ncbi.nlm.nih.gov
   server: ftp-trace.ncbi.nlm.nih.gov
Line 202: Line 220:
   Directory (Short reads):  short_read/  
   Directory (Short reads):  short_read/  
   Directory (Sanger reads): uploads/
   Directory (Sanger reads): uploads/
   Directory (test):        test/ (~30 reads)
   Directory (test):        test/ (Assembled sequences)
    
    
   [http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table]
   [http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table]
Line 292: Line 310:
     $ cd assembly
     $ cd assembly
     $ put *.tar.gz
     $ put *.tar.gz
= dbGSS =
* 4 files:  email to batch-sub@ncbi.nlm.nih.gov
1. Publication
  TYPE: Pub                        #required
  MEDUID: 92347897
  TITLE:                            #required
  Genomic sequences from a subtracted retinal pigment epithelium  library
  AUTHORS:                          #required
  Gieser,L.; Swaroop,A.
  JOURNAL: Genomics
  VOLUME: 13
  ISSUE: 2
  PAGES: 873-6
  YEAR:  1992                      #required
  STATUS: 4                        #required :1=unpublished, 2=submitted, 3=in press, 4=published
  ||
2. Library
  TYPE: Lib                        #required
  NAME:  Rat Lambda Zap Express Library
  ORGANISM: Rattus norvegicus
  STRAIN: Sprague-Dawley
  SEX: male
  STAGE: embryonic day 17 post-fertilization
  TISSUE: aorta
  CELL_TYPE: vascular smooth muscle
  DESCR:
  Put description here.
  ||
3. Contact
  TYPE: Cont
  NAME: Sikela JM
  FAX: 303 270 7097
  TEL: 303 270
  EMAIL: tjs@tally.hsc.colorado.edu
  LAB: Department of Pharmacology
  INST: University of Colorado Health Sciences Center
  ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA
  ||
4. GSS sequence file
  TYPE: GSS                            #required
  STATUS:  New                          #required
  CONT_NAME: Sikela JM                  #required
  GSS#: Ayh00001                        #required
  CLONE: HHC189
  SOURCE: ATCC
  SOURCE_INHOST: 65128
  OTHER_GSS:  GSS00093, GSS000101
  CITATION:                            #required
  Genomic sequences from Human brain tissue
  SEQ_PRIMER: M13 Forward
  P_END: 5'
  HIQUAL_START: 1
  HIQUAL_STOP: 285
  DNA_TYPE: Genomic
  CLASS: shotgun                                          #required
  LIBRARY: Hippocampus, Stratagene (cat. #936205)          #required
  PUBLIC:                                                  #required
  PUT_ID: Actin, gamma, skeletal
  COMMENT:
  SEQUENCE:
  AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG          #required
  ...
  ||
* Matching strings:
  CONT_NAME of GSS file and NAME field of the Contact file
  LIBRARY field of GSS file and NAME field of the Library file
  CITATION field of GSS file and TITLE field of the Publication file

Latest revision as of 14:25, 1 August 2011

WGS/TPA

Links

Registration

  * search Genome Project for Center for Bioinformatics and Computational Biology[Sequencing Center]
  * Xanthomonas oryzae pv. oryzae PXO99A complete; /fs/szdata/ncbi/ftp.ncbi.nih.gov/genomes/Bacteria/Xanthomonas_oryzae_PXO99A
  * Xanthomonas oryzae pv. oryzicola BLS256 assembly

Output:

  • genome project id (5 digit); use it in e-mail correspondence
  • locus_tags (3+ letter/digit)

Requirements

  • ctg's: no gaps; .sqn format
  • annotation: either for ctg's or superctg's ; .sqn format
  • suprectg's: AGP format

Output:

  • 4-letter WGS project_ID : XXXX
  • project accession number : XXXX00000000 (4-letter ID followed by 8 0's)
  • 1st version: XXXX01000000
  • 1st version ctg's: XXXX01000001
  • CON record for suprectg's

Formating

Metadata

  • multiple sequences
 /nfshomes/dpuiu/szdevel/sequin.8.10/sequin
 /nfshomes/dpuiu/szdevel/bin/sequin              # latest version
 !!! import /nfshomes/dpuiu/bin/seqin.sqn
 Form
  Submission:
   Immediately ...
   Tentative manuscript title: 
  Contact:
   Name:  Daniela Puiu
   Phone: 301.405.3403
   Fax:   301.314.1341 
   Email: dpuiu@umiacs.umd.edu
  Authors:
   Daniela Puiu
   Steven L. Salzberg
   ...
  Affiliation
   Institution: University of Maryland,  Center for Bioinformatics and Computational Biology , 3115 Biomolecular Sciences Building #296, College Park, MD 20742 , US
  Seqeuence format
   Batch submission
   FASTA
   Original submission
  ...
 Organism and Sequences:
   Nucleotide: can import from FASTA file
   Organism: strain, moltype
   Proteins
   Annotation

Export template => /nfshomes/dpuiu/bin/sequin.sbt : contains submission info

Tags

 $ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta

Annotation

  • Locus tag examples:
 ABC_I00001 for gene 1, chromosome I
 ABC_II00001 for gene 1, chromosome II
 ABC_r1112 for ribosomal RNA genes
 ABC_t1113 for tRNA genes
  • Generating the .tbl format from a TAB delinited format
 $ ~dpuiu/bin/tab2annotation.pl -h
 
     # Example:
     tab2annotation.pl -ht "SeqId Location Strand Length Product" prefix.ptt > prefix.tbl
     tab2annotation.pl -hl 1 -SeqId NC_012456 prefix.ptt > prefix.tbl
 
     # INPUT
     SeqId   SeqIdLength     OrfId   Start   End     Length  Product
     1225    2425            002     422     706     285     malate synthase G
 
     # OUTPUT
     >Feature 1225
     422     706     gene
                             locus_tag       C1A_1225_002
     422     706     CDS
                             product malate synthase G
                             protein_id      gnl|cbcb|C1A_1225_002

Merge

 Input files: 
   prefix.sbt: submission file
   prefix.fsa: sequence : at most 10,000 sequences/file
   prefix.tbl: annotation
 
 $ tbl2asn -t prefix.sbt -V v -s -p . 
 $ tbl2asn -t prefix.sbt -V v -s -i prefix.fasta
 $ tbl2asn -t template.sbt -i prefix.fasta -V v -s
 * template (*.sbt)        Example: /nfshomes/dpuiu/bin/sequin.sbt
  comment : is the article name
 * FASTA sequence (*.fsa) 
    >SeqID [organism=...] [strain=...] [tech=wgs] [chromosome=...][gcode=11]
 
 Adding tags:
  ~dpuiu/bin/addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" wPipJBH.fasta
 * annotation table (*.tbl) (optional)  
    5-column table 
       locus_tag for genes 
       protein_id for proteins
       product for proteins 
  • Output files:
 * ASN.1 (*.sqn) for submission to GenBank.
 * .val: validation file; check it for errors

AGP

 $ scaff2agp.pl < prefix.scaff > prefix.agp
 $ infoseq2agp.pl prefix.infoseq > prefix.agp
 $ valiadteAgp.pl prefix.agp

Submission

BankIt

  • BankIt
  • one or a few sequence submissions

Email

  • e-mail the output file to gb_sub@ncbi.nlm.nih.gov (deprecated)

GenomeMacroSend

Ftp

  • for large WGS projects
server: ftp-trace.ncbi.nlm.nih.gov
login:         cbcb_trc
password:      t@@GeaYF
center:        CBCB
directory:     test/ ;     don't use uploads/ that is used by SRA

Updates

  • Updating
  • in .sbt file replace "subtype new" with "subtype update"

TA

TA

Compressed archive containing 
  3 files: TRACEINFO.xml, MD5, README
  traces/ directory
  SCF format traces under traces/ or traces/*/
 
The archive(s) is/are gzip files 1-4GB; include center's name and the date into file names
Accepted only by uploading to NCBI FTP server.
  server: ftp-trace.ncbi.nih.gov
  login: 
  passwd: 
  center: UMD

Scripts:

 /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl

Genbank & SRA

 server: ftp-trace.ncbi.nlm.nih.gov
 login:         cbcb_trc
 password:      t@@GeaYF
 
 Center_name (acronym): CBCB 
 Full name: Center for Bioinformatics and Computational Biology, University of Maryland
 
 Directory (Short reads):  short_read/ 
 Directory (Sanger reads): uploads/
 Directory (test):         test/ (Assembled sequences)
 
 Validation table

AA

 Compressed archive containing 2 files: ASSEMBLY.xml , MD5 
 Accepted only by uploading to NCBI FTP server.
   server: ftp-private.ncbi.nlm.nih.gov
   login: cbcb_trc
   passwd: t@@GeaYF
   center: UMD   
   description: University of Maryland
 ASSEMBLY XML Schema png 
 ASSEMBLY XML Schema xsd 

Use XContig package scripts

Files:

.contig      : contigs & underlying reads (use TRACE_NAME's or SEQ_NAME's) 
.seq         : read sequences (use TRACE_NAME's or SEQ_NAME's) 
.qual        : read qualities (use TRACE_NAME's or SEQ_NAME's) 
.ti2seq_name : (TI , TRACE_NAME or SEQ_NAME) : required if the contig file soes not use the read ti's
 $ bank2contig -e       prefix.bnk > prefix.contig
 $ dumpreads   -e -r    prefix.bnk > prefix.seq
 $ dumpreads   -e -r -q prefix.bnk > prefix.qual

Example:

Xoo: /fs/szasmg/Bacteria/Xanthomonas/XOO/Xoo_PXO99A/FinalAsm_June2007/AA

Steps:

 1. makeConinfo ASSEMBLY.coninfo
 $ more ASSEMBLY.coninfo
 <coninfo>
 <meta name='center'>UMD</meta>
 <meta name='db'>Xoo</meta>
 <meta name='desc'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta>
 <meta name='object'>ASSEMBLY</meta>
 <meta name='species_code'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta>
 <meta name='structure'>Chromosome</meta>
 <meta name='subtype'>NEW</meta>
 <meta name='taxid'>360094</meta>
 <contig id="1106158952778_stitched" conformation="CIRCULAR" subtype="NEW"/>
 <contig id="... "/>
 <file src="Xoo.contig"/>
 <seq src="Xoo.seq"/>
 <qual src="Xoo.qual"/>
 <idmap  src="Xoo.ti2seq_name" direction="FORWARD"/>
 </coninfo>
 2. buildAssemblyArchive ASSEMBLY.coninfo --prompt --subname umd-20070816-125223
 problems:
    * submitter_reference="tigr...." : replace tigr with umd
    * conformation: always LINEAR    : replace LINEAR with CIRCULAR ???
 3. validate:
 oXygen: software used by NCBI; license required
 xmllint: open source
 $ xmllint --schema ~/bin/TraceAssembly.xsd umd-*/ASSEMBLY.xml > /dev/null
 umd-20070816-125223/ASSEMBLY.xml validates
 4. edit files
 $ rm *.tar.gz
 $ md5sum umd-*/ASSEMBLY.xml
 $ edit umd-*/MANIFEST         # update ASSEMBLY.xml md5sum 
 
 $ ls -1 umd-*
 umd-20070816-125223/
  1106158952778_stitched_20070817-141849.con       # Contig consensus
  1106158952778_stitched_20070817-141849.congap    # Contig gaps
  ASSEMBLY.xml                                     # Assembly XML
  MANIFEST                                         # MD5 sums
4. create tarball
$ tar czvf umd-20070816-125223.tar.gz umd-20070816-125223/
5. upload tarball
   !!! contact trace@ncbi.nlm.nih.gov if login/password error
 
   $ ftp ftp-private.ncbi.nlm.nih.gov
   login: cbcb_trc
   passwd: t@@GeaYF
   $ cd assembly
   $ put *.tar.gz

dbGSS

  • 4 files: email to batch-sub@ncbi.nlm.nih.gov

1. Publication

 TYPE: Pub                         #required
 MEDUID: 92347897
 TITLE:                            #required
 Genomic sequences from a subtracted retinal pigment epithelium   library
 AUTHORS:                          #required
 Gieser,L.; Swaroop,A.
 JOURNAL: Genomics
 VOLUME: 13
 ISSUE: 2
 PAGES: 873-6
 YEAR:  1992                       #required
 STATUS: 4                         #required :1=unpublished, 2=submitted, 3=in press, 4=published
 ||

2. Library

 TYPE: Lib                         #required
 NAME:  Rat Lambda Zap Express Library
 ORGANISM: Rattus norvegicus
 STRAIN: Sprague-Dawley
 SEX: male
 STAGE: embryonic day 17 post-fertilization
 TISSUE: aorta
 CELL_TYPE: vascular smooth muscle
 DESCR: 
 Put description here.
 ||

3. Contact

 TYPE: Cont
 NAME: Sikela JM
 FAX: 303 270 7097
 TEL: 303 270 
 EMAIL: tjs@tally.hsc.colorado.edu
 LAB: Department of Pharmacology
 INST: University of Colorado Health Sciences Center
 ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA
 ||

4. GSS sequence file

 TYPE: GSS                             #required
 STATUS:  New                          #required
 CONT_NAME: Sikela JM                  #required
 GSS#: Ayh00001                        #required
 CLONE: HHC189
 SOURCE: ATCC
 SOURCE_INHOST: 65128
 OTHER_GSS:  GSS00093, GSS000101
 CITATION:                             #required
 Genomic sequences from Human brain tissue
 SEQ_PRIMER: M13 Forward
 P_END: 5'
 HIQUAL_START: 1
 HIQUAL_STOP: 285
 DNA_TYPE: Genomic
 CLASS: shotgun                                           #required
 LIBRARY: Hippocampus, Stratagene (cat. #936205)          #required
 PUBLIC:                                                  #required
 PUT_ID: Actin, gamma, skeletal
 COMMENT:
 SEQUENCE:
 AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG           #required
 ...
 ||
  • Matching strings:
 CONT_NAME of GSS file and NAME field of the Contact file
 LIBRARY field of GSS file and NAME field of the Library file
 CITATION field of GSS file and TITLE field of the Publication file