NCBI submission: Difference between revisions
		
		
		
		Jump to navigation
		Jump to search
		
| (14 intermediate revisions by the same user not shown) | |||
| Line 9: | Line 9: | ||
| * [http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Specification.shtml AGP format] | * [http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Specification.shtml AGP format] | ||
| * [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Bacterial Genomes] | * [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Bacterial Genomes] | ||
| * [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html Annotation] [http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html Annotation Info]  | * [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html Annotation]   | ||
| * [http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html Annotation Info]   | |||
| * [http://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf Locus tag extra info] | |||
| == Registration == | == Registration == | ||
| Line 19: | Line 21: | ||
|     * [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16740 Xanthomonas oryzae pv. oryzicola BLS256] assembly |     * [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16740 Xanthomonas oryzae pv. oryzicola BLS256] assembly | ||
| * [http://www.ncbi.nlm.nih.gov/genomes/static/links.html Seuqncing center list] | |||
| Output: | Output: | ||
| * genome project id (5 digit); use it in e-mail correspondence | * genome project id (5 digit); use it in e-mail correspondence | ||
| * locus_tags (3 letter/digit) | * locus_tags (3+ letter/digit) | ||
| * [http://www.ncbi.nlm.nih.gov/genomes/lltp.cgi Locus tag database] to check if the chosen locus tag is available | |||
| == Requirements == | == Requirements == | ||
| Line 47: | Line 52: | ||
|    /nfshomes/dpuiu/szdevel/sequin.8.10/sequin |    /nfshomes/dpuiu/szdevel/sequin.8.10/sequin | ||
|   /nfshomes/dpuiu/szdevel/bin/sequin              # latest version | |||
|    !!! import /nfshomes/dpuiu/bin/seqin.sqn |    !!! import /nfshomes/dpuiu/bin/seqin.sqn | ||
| Line 76: | Line 82: | ||
|      Annotation |      Annotation | ||
| Export template =>  | Export template => /nfshomes/dpuiu/bin/sequin.sbt : contains submission info | ||
| === Tags === | === Tags === | ||
| Line 83: | Line 89: | ||
| === Annotation === | === Annotation === | ||
| * Locus tag examples: | |||
|   ABC_I00001 for gene 1, chromosome I | |||
|   ABC_II00001 for gene 1, chromosome II | |||
|   ABC_r1112 for ribosomal RNA genes | |||
|   ABC_t1113 for tRNA genes | |||
| * [http://www.ncbi.nlm.nih.gov/genomes/frameshifts/frameshifts.cgi frameshifts]  | |||
| * Generating the .tbl format from a TAB delinited format | * Generating the .tbl format from a TAB delinited format | ||
| Line 105: | Line 119: | ||
| === Merge === | === Merge === | ||
| * tbl2asn: command line | * tbl2asn: command line | ||
| * [[Media:tbl2asn.txt|tbl2asn man]] | |||
|    Input files:   |    Input files:   | ||
| Line 118: | Line 133: | ||
| * Input files: [[Media:sequin.sbt|sequin.sbt]] | * Input files: [[Media:sequin.sbt|sequin.sbt]] | ||
|    * template (*.sbt)        Example: /nfshomes/dpuiu/bin/sequin.sbt |    * template (*.sbt)        Example: /nfshomes/dpuiu/bin/sequin.sbt | ||
|    comment : is the article name | |||
|    * FASTA sequence (*.fsa)   |    * FASTA sequence (*.fsa)   | ||
| Line 137: | Line 153: | ||
| === AGP === | === AGP === | ||
| * [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml AGP_Specification.shtml] | |||
| * Sequence gaps : "fragment yes" | * Sequence gaps : "fragment yes" | ||
| * Scaffold gaps : "contig   no" | * Scaffold gaps : "contig   no" | ||
| Line 192: | Line 209: | ||
|    /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl |    /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl | ||
| = SRA = | = Genbank & SRA = | ||
|    server: ftp-trace.ncbi.nlm.nih.gov |    server: ftp-trace.ncbi.nlm.nih.gov | ||
| Line 203: | Line 220: | ||
|    Directory (Short reads):  short_read/   |    Directory (Short reads):  short_read/   | ||
|    Directory (Sanger reads): uploads/ |    Directory (Sanger reads): uploads/ | ||
|    Directory (test):         test/ ( |    Directory (test):         test/ (Assembled sequences) | ||
|    [http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table] |    [http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table] | ||
| Line 293: | Line 310: | ||
|      $ cd assembly |      $ cd assembly | ||
|      $ put *.tar.gz |      $ put *.tar.gz | ||
| = dbGSS = | |||
| * 4 files:  email to batch-sub@ncbi.nlm.nih.gov | |||
| 1. Publication | |||
|   TYPE: Pub                         #required | |||
|   MEDUID: 92347897 | |||
|   TITLE:                            #required | |||
|   Genomic sequences from a subtracted retinal pigment epithelium   library | |||
|   AUTHORS:                          #required | |||
|   Gieser,L.; Swaroop,A. | |||
|   JOURNAL: Genomics | |||
|   VOLUME: 13 | |||
|   ISSUE: 2 | |||
|   PAGES: 873-6 | |||
|   YEAR:  1992                       #required | |||
|   STATUS: 4                         #required :1=unpublished, 2=submitted, 3=in press, 4=published | |||
|   || | |||
| 2. Library | |||
|   TYPE: Lib                         #required | |||
|   NAME:  Rat Lambda Zap Express Library | |||
|   ORGANISM: Rattus norvegicus | |||
|   STRAIN: Sprague-Dawley | |||
|   SEX: male | |||
|   STAGE: embryonic day 17 post-fertilization | |||
|   TISSUE: aorta | |||
|   CELL_TYPE: vascular smooth muscle | |||
|   DESCR:  | |||
|   Put description here. | |||
|   || | |||
| 3. Contact | |||
|   TYPE: Cont | |||
|   NAME: Sikela JM | |||
|   FAX: 303 270 7097 | |||
|   TEL: 303 270  | |||
|   EMAIL: tjs@tally.hsc.colorado.edu | |||
|   LAB: Department of Pharmacology | |||
|   INST: University of Colorado Health Sciences Center | |||
|   ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA | |||
|   || | |||
| 4. GSS sequence file | |||
|   TYPE: GSS                             #required | |||
|   STATUS:  New                          #required | |||
|   CONT_NAME: Sikela JM                  #required | |||
|   GSS#: Ayh00001                        #required | |||
|   CLONE: HHC189 | |||
|   SOURCE: ATCC | |||
|   SOURCE_INHOST: 65128 | |||
|   OTHER_GSS:  GSS00093, GSS000101 | |||
|   CITATION:                             #required | |||
|   Genomic sequences from Human brain tissue | |||
|   SEQ_PRIMER: M13 Forward | |||
|   P_END: 5' | |||
|   HIQUAL_START: 1 | |||
|   HIQUAL_STOP: 285 | |||
|   DNA_TYPE: Genomic | |||
|   CLASS: shotgun                                           #required | |||
|   LIBRARY: Hippocampus, Stratagene (cat. #936205)          #required | |||
|   PUBLIC:                                                  #required | |||
|   PUT_ID: Actin, gamma, skeletal | |||
|   COMMENT: | |||
|   SEQUENCE: | |||
|   AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG           #required | |||
|   ... | |||
|   || | |||
| * Matching strings: | |||
|   CONT_NAME of GSS file and NAME field of the Contact file | |||
|   LIBRARY field of GSS file and NAME field of the Library file | |||
|   CITATION field of GSS file and TITLE field of the Publication file | |||
Latest revision as of 14:25, 1 August 2011
WGS/TPA
Links
- Info
- wgs
- tbl2asn
- Sequence modifiers
- AGP format
- Bacterial Genomes
- Annotation
- Annotation Info
- Locus tag extra info
Registration
- Register WGS form
- CBCB Genome Projects:
* search Genome Project for Center for Bioinformatics and Computational Biology[Sequencing Center] * Xanthomonas oryzae pv. oryzae PXO99A complete; /fs/szdata/ncbi/ftp.ncbi.nih.gov/genomes/Bacteria/Xanthomonas_oryzae_PXO99A * Xanthomonas oryzae pv. oryzicola BLS256 assembly
Output:
- genome project id (5 digit); use it in e-mail correspondence
- locus_tags (3+ letter/digit)
- Locus tag database to check if the chosen locus tag is available
Requirements
- ctg's: no gaps; .sqn format
- annotation: either for ctg's or superctg's ; .sqn format
- suprectg's: AGP format
Output:
- 4-letter WGS project_ID : XXXX
- project accession number : XXXX00000000 (4-letter ID followed by 8 0's)
- 1st version: XXXX01000000
- 1st version ctg's: XXXX01000001
- CON record for suprectg's
Formating
Metadata
- .sbt file gebnerated by Sequin
- SeqIn
- QuickGuide
- multiple sequences
/nfshomes/dpuiu/szdevel/sequin.8.10/sequin /nfshomes/dpuiu/szdevel/bin/sequin # latest version !!! import /nfshomes/dpuiu/bin/seqin.sqn
Form Submission: Immediately ... Tentative manuscript title: Contact: Name: Daniela Puiu Phone: 301.405.3403 Fax: 301.314.1341 Email: dpuiu@umiacs.umd.edu Authors: Daniela Puiu Steven L. Salzberg ... Affiliation Institution: University of Maryland, Center for Bioinformatics and Computational Biology , 3115 Biomolecular Sciences Building #296, College Park, MD 20742 , US
Seqeuence format Batch submission FASTA Original submission ... Organism and Sequences: Nucleotide: can import from FASTA file Organism: strain, moltype Proteins Annotation
Export template => /nfshomes/dpuiu/bin/sequin.sbt : contains submission info
Tags
$ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta
Annotation
- Locus tag examples:
ABC_I00001 for gene 1, chromosome I ABC_II00001 for gene 1, chromosome II ABC_r1112 for ribosomal RNA genes ABC_t1113 for tRNA genes
- Generating the .tbl format from a TAB delinited format
 $ ~dpuiu/bin/tab2annotation.pl -h
 
     # Example:
     tab2annotation.pl -ht "SeqId Location Strand Length Product" prefix.ptt > prefix.tbl
     tab2annotation.pl -hl 1 -SeqId NC_012456 prefix.ptt > prefix.tbl
 
     # INPUT
     SeqId   SeqIdLength     OrfId   Start   End     Length  Product
     1225    2425            002     422     706     285     malate synthase G
 
     # OUTPUT
     >Feature 1225
     422     706     gene
                             locus_tag       C1A_1225_002
     422     706     CDS
                             product malate synthase G
                             protein_id      gnl|cbcb|C1A_1225_002
Merge
- tbl2asn: command line
- tbl2asn man
Input files: prefix.sbt: submission file prefix.fsa: sequence : at most 10,000 sequences/file prefix.tbl: annotation $ tbl2asn -t prefix.sbt -V v -s -p . $ tbl2asn -t prefix.sbt -V v -s -i prefix.fasta
$ tbl2asn -t template.sbt -i prefix.fasta -V v -s
- Input files: sequin.sbt
* template (*.sbt) Example: /nfshomes/dpuiu/bin/sequin.sbt comment : is the article name
 * FASTA sequence (*.fsa) 
    >SeqID [organism=...] [strain=...] [tech=wgs] [chromosome=...][gcode=11]
 
 Adding tags:
  ~dpuiu/bin/addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" wPipJBH.fasta
 * annotation table (*.tbl) (optional)  
    5-column table 
       locus_tag for genes 
       protein_id for proteins
       product for proteins 
- Output files:
* ASN.1 (*.sqn) for submission to GenBank. * .val: validation file; check it for errors
AGP
- AGP_Specification.shtml
- Sequence gaps : "fragment yes"
- Scaffold gaps : "contig no"
$ scaff2agp.pl < prefix.scaff > prefix.agp $ infoseq2agp.pl prefix.infoseq > prefix.agp $ valiadteAgp.pl prefix.agp
Submission
BankIt
- BankIt
- one or a few sequence submissions
- e-mail the output file to gb_sub@ncbi.nlm.nih.gov (deprecated)
GenomeMacroSend
- GenomeMacroSend : submit *.sqn, *.tbl, *.fsa, *agp files
Ftp
- for large WGS projects
server: ftp-trace.ncbi.nlm.nih.gov login: cbcb_trc password: t@@GeaYF center: CBCB directory: test/ ; don't use uploads/ that is used by SRA
Updates
- Updating
- in .sbt file replace "subtype new" with "subtype update"
TA
Compressed archive containing 3 files: TRACEINFO.xml, MD5, README traces/ directory SCF format traces under traces/ or traces/*/ The archive(s) is/are gzip files 1-4GB; include center's name and the date into file names Accepted only by uploading to NCBI FTP server. server: ftp-trace.ncbi.nih.gov login: passwd: center: UMD
Scripts:
/nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl
Genbank & SRA
server: ftp-trace.ncbi.nlm.nih.gov login: cbcb_trc password: t@@GeaYF Center_name (acronym): CBCB Full name: Center for Bioinformatics and Computational Biology, University of Maryland Directory (Short reads): short_read/ Directory (Sanger reads): uploads/ Directory (test): test/ (Assembled sequences) Validation table
AA
- !!! Need a TaxId before formatting the files
- AA
- AA submission info
Compressed archive containing 2 files: ASSEMBLY.xml , MD5 Accepted only by uploading to NCBI FTP server. server: ftp-private.ncbi.nlm.nih.gov login: cbcb_trc passwd: t@@GeaYF center: UMD description: University of Maryland ASSEMBLY XML Schema png ASSEMBLY XML Schema xsd
Use XContig package scripts
Files:
.contig : contigs & underlying reads (use TRACE_NAME's or SEQ_NAME's) .seq : read sequences (use TRACE_NAME's or SEQ_NAME's) .qual : read qualities (use TRACE_NAME's or SEQ_NAME's) .ti2seq_name : (TI , TRACE_NAME or SEQ_NAME) : required if the contig file soes not use the read ti's
$ bank2contig -e prefix.bnk > prefix.contig $ dumpreads -e -r prefix.bnk > prefix.seq $ dumpreads -e -r -q prefix.bnk > prefix.qual
Example:
Xoo: /fs/szasmg/Bacteria/Xanthomonas/XOO/Xoo_PXO99A/FinalAsm_June2007/AA
Steps:
1. makeConinfo ASSEMBLY.coninfo $ more ASSEMBLY.coninfo <coninfo> <meta name='center'>UMD</meta> <meta name='db'>Xoo</meta> <meta name='desc'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta> <meta name='object'>ASSEMBLY</meta> <meta name='species_code'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta> <meta name='structure'>Chromosome</meta> <meta name='subtype'>NEW</meta> <meta name='taxid'>360094</meta> <contig id="1106158952778_stitched" conformation="CIRCULAR" subtype="NEW"/> <contig id="... "/> <file src="Xoo.contig"/> <seq src="Xoo.seq"/> <qual src="Xoo.qual"/> <idmap src="Xoo.ti2seq_name" direction="FORWARD"/> </coninfo>
 2. buildAssemblyArchive ASSEMBLY.coninfo --prompt --subname umd-20070816-125223
 problems:
    * submitter_reference="tigr...." : replace tigr with umd
    * conformation: always LINEAR    : replace LINEAR with CIRCULAR ???
3. validate: oXygen: software used by NCBI; license required xmllint: open source $ xmllint --schema ~/bin/TraceAssembly.xsd umd-*/ASSEMBLY.xml > /dev/null umd-20070816-125223/ASSEMBLY.xml validates
4. edit files $ rm *.tar.gz $ md5sum umd-*/ASSEMBLY.xml $ edit umd-*/MANIFEST # update ASSEMBLY.xml md5sum $ ls -1 umd-* umd-20070816-125223/ 1106158952778_stitched_20070817-141849.con # Contig consensus 1106158952778_stitched_20070817-141849.congap # Contig gaps ASSEMBLY.xml # Assembly XML MANIFEST # MD5 sums
4. create tarball $ tar czvf umd-20070816-125223.tar.gz umd-20070816-125223/
5. upload tarball !!! contact trace@ncbi.nlm.nih.gov if login/password error $ ftp ftp-private.ncbi.nlm.nih.gov login: cbcb_trc passwd: t@@GeaYF
$ cd assembly $ put *.tar.gz
dbGSS
- 4 files: email to batch-sub@ncbi.nlm.nih.gov
1. Publication
TYPE: Pub #required MEDUID: 92347897 TITLE: #required Genomic sequences from a subtracted retinal pigment epithelium library AUTHORS: #required Gieser,L.; Swaroop,A. JOURNAL: Genomics VOLUME: 13 ISSUE: 2 PAGES: 873-6 YEAR: 1992 #required STATUS: 4 #required :1=unpublished, 2=submitted, 3=in press, 4=published ||
2. Library
TYPE: Lib #required NAME: Rat Lambda Zap Express Library ORGANISM: Rattus norvegicus STRAIN: Sprague-Dawley SEX: male STAGE: embryonic day 17 post-fertilization TISSUE: aorta CELL_TYPE: vascular smooth muscle DESCR: Put description here. ||
3. Contact
TYPE: Cont NAME: Sikela JM FAX: 303 270 7097 TEL: 303 270 EMAIL: tjs@tally.hsc.colorado.edu LAB: Department of Pharmacology INST: University of Colorado Health Sciences Center ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA ||
4. GSS sequence file
TYPE: GSS #required STATUS: New #required CONT_NAME: Sikela JM #required GSS#: Ayh00001 #required CLONE: HHC189 SOURCE: ATCC SOURCE_INHOST: 65128 OTHER_GSS: GSS00093, GSS000101 CITATION: #required Genomic sequences from Human brain tissue SEQ_PRIMER: M13 Forward P_END: 5' HIQUAL_START: 1 HIQUAL_STOP: 285 DNA_TYPE: Genomic CLASS: shotgun #required LIBRARY: Hippocampus, Stratagene (cat. #936205) #required PUBLIC: #required PUT_ID: Actin, gamma, skeletal COMMENT: SEQUENCE: AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG #required ... ||
- Matching strings:
CONT_NAME of GSS file and NAME field of the Contact file LIBRARY field of GSS file and NAME field of the Library file CITATION field of GSS file and TITLE field of the Publication file