NCBI submission: Difference between revisions
Jump to navigation
Jump to search
Email
No edit summary |
|||
(33 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= | = WGS/TPA = | ||
== Links == | |||
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Info] | * [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Info] | ||
* [http://www.ncbi.nlm.nih.gov/Genbank/wgs.html wgs] | * [http://www.ncbi.nlm.nih.gov/Genbank/wgs.html wgs] | ||
Line 8: | Line 9: | ||
* [http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Specification.shtml AGP format] | * [http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Specification.shtml AGP format] | ||
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Bacterial Genomes] | * [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Bacterial Genomes] | ||
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html Annotation] [http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html Annotation Info] | * [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html Annotation] | ||
* [http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html Annotation Info] | |||
* [http://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf Locus tag extra info] | |||
== Registration == | |||
* [http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi Register WGS form] | * [http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi Register WGS form] | ||
* CBCB Genome Projects: | * CBCB Genome Projects: | ||
Line 17: | Line 21: | ||
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16740 Xanthomonas oryzae pv. oryzicola BLS256] assembly | * [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16740 Xanthomonas oryzae pv. oryzicola BLS256] assembly | ||
* [http://www.ncbi.nlm.nih.gov/genomes/static/links.html Seuqncing center list] | |||
Output: | Output: | ||
* genome project id (5 digit); use it in e-mail correspondence | * genome project id (5 digit); use it in e-mail correspondence | ||
* locus_tags (3 letter/digit) | * locus_tags (3+ letter/digit) | ||
* [http://www.ncbi.nlm.nih.gov/genomes/lltp.cgi Locus tag database] to check if the chosen locus tag is available | |||
== | == Requirements == | ||
* ctg's: no gaps | * ctg's: no gaps; .sqn format | ||
* annotation: either for ctg's or superctg's ; .sqn format | |||
* suprectg's: AGP format | * suprectg's: AGP format | ||
Output: | Output: | ||
Line 34: | Line 41: | ||
* CON record for suprectg's | * CON record for suprectg's | ||
== | == Formating == | ||
=== Metadata === | |||
* .sbt file gebnerated by Sequin | |||
* [http://www.ncbi.nlm.nih.gov/Sequin/index.html SeqIn] | * [http://www.ncbi.nlm.nih.gov/Sequin/index.html SeqIn] | ||
* [http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm QuickGuide] | * [http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm QuickGuide] | ||
Line 62: | Line 52: | ||
/nfshomes/dpuiu/szdevel/sequin.8.10/sequin | /nfshomes/dpuiu/szdevel/sequin.8.10/sequin | ||
/nfshomes/dpuiu/szdevel/bin/sequin # latest version | |||
!!! import /nfshomes/dpuiu/bin/seqin.sqn | !!! import /nfshomes/dpuiu/bin/seqin.sqn | ||
Line 91: | Line 82: | ||
Annotation | Annotation | ||
Export template => /nfshomes/dpuiu/bin/sequin.sbt : contains submission info | |||
=== Tags === | |||
$ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta | |||
=== Annotation === | |||
* Locus tag examples: | |||
ABC_I00001 for gene 1, chromosome I | |||
ABC_II00001 for gene 1, chromosome II | |||
ABC_r1112 for ribosomal RNA genes | |||
ABC_t1113 for tRNA genes | |||
* [http://www.ncbi.nlm.nih.gov/genomes/frameshifts/frameshifts.cgi frameshifts] | |||
* Generating the .tbl format from a TAB delinited format | |||
$ ~dpuiu/bin/tab2annotation.pl -h | |||
# Example: | |||
tab2annotation.pl -ht "SeqId Location Strand Length Product" prefix.ptt > prefix.tbl | |||
tab2annotation.pl -hl 1 -SeqId NC_012456 prefix.ptt > prefix.tbl | |||
# INPUT | |||
SeqId SeqIdLength OrfId Start End Length Product | |||
1225 2425 002 422 706 285 malate synthase G | |||
# OUTPUT | |||
>Feature 1225 | |||
422 706 gene | |||
locus_tag C1A_1225_002 | |||
422 706 CDS | |||
product malate synthase G | |||
protein_id gnl|cbcb|C1A_1225_002 | |||
=== Merge === | |||
* tbl2asn: command line | |||
* [[Media:tbl2asn.txt|tbl2asn man]] | |||
Input files: | Input files: | ||
prefix.sbt: submission file | prefix.sbt: submission file | ||
Line 105: | Line 129: | ||
$ tbl2asn -t prefix.sbt -V v -s -i prefix.fasta | $ tbl2asn -t prefix.sbt -V v -s -i prefix.fasta | ||
tbl2asn -t template.sbt -i prefix.fasta -V v -s | $ tbl2asn -t template.sbt -i prefix.fasta -V v -s | ||
* Input files: [[Media:sequin.sbt|sequin.sbt]] | |||
* template (*.sbt) Example: /nfshomes/dpuiu/bin/sequin.sbt | |||
comment : is the article name | |||
* FASTA sequence (*.fsa) | * FASTA sequence (*.fsa) | ||
>SeqID [organism=...] [strain=...] [tech=wgs] [chromosome=...][gcode=11] | >SeqID [organism=...] [strain=...] [tech=wgs] [chromosome=...][gcode=11] | ||
Line 125: | Line 151: | ||
* .val: validation file; check it for errors | * .val: validation file; check it for errors | ||
=== AGP === | |||
$ | |||
* [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml AGP_Specification.shtml] | |||
* Sequence gaps : "fragment yes" | |||
* Scaffold gaps : "contig no" | |||
$ scaff2agp.pl < prefix.scaff > prefix.agp | |||
$ infoseq2agp.pl prefix.infoseq > prefix.agp | |||
$ valiadteAgp.pl prefix.agp | |||
== Submission == | |||
=== BankIt === | |||
* [http://www.ncbi.nlm.nih.gov/BankIt/ BankIt] | |||
* one or a few sequence submissions | |||
=== Email === | |||
* e-mail the output file to gb_sub@ncbi.nlm.nih.gov (deprecated) | |||
=== GenomeMacroSend === | |||
* [http://www.ncbi.nlm.nih.gov/projects/GenomeSubmit/genome_submit.cgi GenomeMacroSend] : submit *.sqn, *.tbl, *.fsa, *agp files | |||
=== Ftp === | |||
* for large WGS projects | |||
server: ftp-trace.ncbi.nlm.nih.gov | |||
login: cbcb_trc | |||
password: t@@GeaYF | |||
center: CBCB | |||
directory: test/ ; don't use uploads/ that is used by SRA | |||
=== Updates === | |||
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html#updating Updating] | |||
* in .sbt file replace "subtype new" with "subtype update" | |||
= TA | = TA = | ||
[http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=overview&m=doc&s=overview TA] | [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=overview&m=doc&s=overview TA] | ||
Line 163: | Line 209: | ||
/nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl | /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl | ||
= SRA | = Genbank & SRA = | ||
server: ftp-trace.ncbi.nlm.nih.gov | server: ftp-trace.ncbi.nlm.nih.gov | ||
Line 174: | Line 220: | ||
Directory (Short reads): short_read/ | Directory (Short reads): short_read/ | ||
Directory (Sanger reads): uploads/ | Directory (Sanger reads): uploads/ | ||
Directory (test): test/ ( | Directory (test): test/ (Assembled sequences) | ||
[http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table] | [http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table] | ||
= AA | = AA = | ||
* !!! Need a TaxId before formatting the files | * !!! Need a TaxId before formatting the files | ||
Line 205: | Line 251: | ||
$ dumpreads -e -r prefix.bnk > prefix.seq | $ dumpreads -e -r prefix.bnk > prefix.seq | ||
$ dumpreads -e -r -q prefix.bnk > prefix.qual | $ dumpreads -e -r -q prefix.bnk > prefix.qual | ||
Example: | Example: | ||
Line 266: | Line 311: | ||
$ put *.tar.gz | $ put *.tar.gz | ||
= | = dbGSS = | ||
* 4 files: email to batch-sub@ncbi.nlm.nih.gov | |||
1. Publication | |||
TYPE: Pub #required | |||
MEDUID: 92347897 | |||
TITLE: #required | |||
Genomic sequences from a subtracted retinal pigment epithelium library | |||
AUTHORS: #required | |||
Gieser,L.; Swaroop,A. | |||
JOURNAL: Genomics | |||
VOLUME: 13 | |||
ISSUE: 2 | |||
PAGES: 873-6 | |||
YEAR: 1992 #required | |||
STATUS: 4 #required :1=unpublished, 2=submitted, 3=in press, 4=published | |||
|| | |||
2. Library | |||
TYPE: Lib #required | |||
NAME: Rat Lambda Zap Express Library | |||
ORGANISM: Rattus norvegicus | |||
STRAIN: Sprague-Dawley | |||
SEX: male | |||
STAGE: embryonic day 17 post-fertilization | |||
TISSUE: aorta | |||
CELL_TYPE: vascular smooth muscle | |||
DESCR: | |||
Put description here. | |||
|| | |||
3. Contact | |||
TYPE: Cont | |||
NAME: Sikela JM | |||
FAX: 303 270 7097 | |||
TEL: 303 270 | |||
EMAIL: tjs@tally.hsc.colorado.edu | |||
LAB: Department of Pharmacology | |||
INST: University of Colorado Health Sciences Center | |||
ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA | |||
|| | |||
4. GSS sequence file | |||
TYPE: GSS #required | |||
STATUS: New #required | |||
CONT_NAME: Sikela JM #required | |||
GSS#: Ayh00001 #required | |||
CLONE: HHC189 | |||
SOURCE: ATCC | |||
SOURCE_INHOST: 65128 | |||
OTHER_GSS: GSS00093, GSS000101 | |||
CITATION: #required | |||
Genomic sequences from Human brain tissue | |||
SEQ_PRIMER: M13 Forward | |||
P_END: 5' | |||
HIQUAL_START: 1 | |||
HIQUAL_STOP: 285 | |||
DNA_TYPE: Genomic | |||
CLASS: shotgun #required | |||
LIBRARY: Hippocampus, Stratagene (cat. #936205) #required | |||
PUBLIC: #required | |||
PUT_ID: Actin, gamma, skeletal | |||
COMMENT: | |||
SEQUENCE: | |||
AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG #required | |||
... | |||
|| | |||
* Matching strings: | |||
CONT_NAME of GSS file and NAME field of the Contact file | |||
LIBRARY field of GSS file and NAME field of the Library file | |||
CITATION field of GSS file and TITLE field of the Publication file |
Latest revision as of 14:25, 1 August 2011
WGS/TPA
Links
- Info
- wgs
- tbl2asn
- Sequence modifiers
- AGP format
- Bacterial Genomes
- Annotation
- Annotation Info
- Locus tag extra info
Registration
- Register WGS form
- CBCB Genome Projects:
* search Genome Project for Center for Bioinformatics and Computational Biology[Sequencing Center] * Xanthomonas oryzae pv. oryzae PXO99A complete; /fs/szdata/ncbi/ftp.ncbi.nih.gov/genomes/Bacteria/Xanthomonas_oryzae_PXO99A * Xanthomonas oryzae pv. oryzicola BLS256 assembly
Output:
- genome project id (5 digit); use it in e-mail correspondence
- locus_tags (3+ letter/digit)
- Locus tag database to check if the chosen locus tag is available
Requirements
- ctg's: no gaps; .sqn format
- annotation: either for ctg's or superctg's ; .sqn format
- suprectg's: AGP format
Output:
- 4-letter WGS project_ID : XXXX
- project accession number : XXXX00000000 (4-letter ID followed by 8 0's)
- 1st version: XXXX01000000
- 1st version ctg's: XXXX01000001
- CON record for suprectg's
Formating
Metadata
- .sbt file gebnerated by Sequin
- SeqIn
- QuickGuide
- multiple sequences
/nfshomes/dpuiu/szdevel/sequin.8.10/sequin /nfshomes/dpuiu/szdevel/bin/sequin # latest version !!! import /nfshomes/dpuiu/bin/seqin.sqn
Form Submission: Immediately ... Tentative manuscript title: Contact: Name: Daniela Puiu Phone: 301.405.3403 Fax: 301.314.1341 Email: dpuiu@umiacs.umd.edu Authors: Daniela Puiu Steven L. Salzberg ... Affiliation Institution: University of Maryland, Center for Bioinformatics and Computational Biology , 3115 Biomolecular Sciences Building #296, College Park, MD 20742 , US
Seqeuence format Batch submission FASTA Original submission ... Organism and Sequences: Nucleotide: can import from FASTA file Organism: strain, moltype Proteins Annotation
Export template => /nfshomes/dpuiu/bin/sequin.sbt : contains submission info
Tags
$ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta
Annotation
- Locus tag examples:
ABC_I00001 for gene 1, chromosome I ABC_II00001 for gene 1, chromosome II ABC_r1112 for ribosomal RNA genes ABC_t1113 for tRNA genes
- Generating the .tbl format from a TAB delinited format
$ ~dpuiu/bin/tab2annotation.pl -h # Example: tab2annotation.pl -ht "SeqId Location Strand Length Product" prefix.ptt > prefix.tbl tab2annotation.pl -hl 1 -SeqId NC_012456 prefix.ptt > prefix.tbl # INPUT SeqId SeqIdLength OrfId Start End Length Product 1225 2425 002 422 706 285 malate synthase G # OUTPUT >Feature 1225 422 706 gene locus_tag C1A_1225_002 422 706 CDS product malate synthase G protein_id gnl|cbcb|C1A_1225_002
Merge
- tbl2asn: command line
- tbl2asn man
Input files: prefix.sbt: submission file prefix.fsa: sequence : at most 10,000 sequences/file prefix.tbl: annotation $ tbl2asn -t prefix.sbt -V v -s -p . $ tbl2asn -t prefix.sbt -V v -s -i prefix.fasta
$ tbl2asn -t template.sbt -i prefix.fasta -V v -s
- Input files: sequin.sbt
* template (*.sbt) Example: /nfshomes/dpuiu/bin/sequin.sbt comment : is the article name
* FASTA sequence (*.fsa) >SeqID [organism=...] [strain=...] [tech=wgs] [chromosome=...][gcode=11] Adding tags: ~dpuiu/bin/addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" wPipJBH.fasta
* annotation table (*.tbl) (optional) 5-column table locus_tag for genes protein_id for proteins product for proteins
- Output files:
* ASN.1 (*.sqn) for submission to GenBank. * .val: validation file; check it for errors
AGP
- AGP_Specification.shtml
- Sequence gaps : "fragment yes"
- Scaffold gaps : "contig no"
$ scaff2agp.pl < prefix.scaff > prefix.agp $ infoseq2agp.pl prefix.infoseq > prefix.agp $ valiadteAgp.pl prefix.agp
Submission
BankIt
- BankIt
- one or a few sequence submissions
- e-mail the output file to gb_sub@ncbi.nlm.nih.gov (deprecated)
GenomeMacroSend
- GenomeMacroSend : submit *.sqn, *.tbl, *.fsa, *agp files
Ftp
- for large WGS projects
server: ftp-trace.ncbi.nlm.nih.gov login: cbcb_trc password: t@@GeaYF center: CBCB directory: test/ ; don't use uploads/ that is used by SRA
Updates
- Updating
- in .sbt file replace "subtype new" with "subtype update"
TA
Compressed archive containing 3 files: TRACEINFO.xml, MD5, README traces/ directory SCF format traces under traces/ or traces/*/ The archive(s) is/are gzip files 1-4GB; include center's name and the date into file names Accepted only by uploading to NCBI FTP server. server: ftp-trace.ncbi.nih.gov login: passwd: center: UMD
Scripts:
/nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl
Genbank & SRA
server: ftp-trace.ncbi.nlm.nih.gov login: cbcb_trc password: t@@GeaYF Center_name (acronym): CBCB Full name: Center for Bioinformatics and Computational Biology, University of Maryland Directory (Short reads): short_read/ Directory (Sanger reads): uploads/ Directory (test): test/ (Assembled sequences) Validation table
AA
- !!! Need a TaxId before formatting the files
- AA
- AA submission info
Compressed archive containing 2 files: ASSEMBLY.xml , MD5 Accepted only by uploading to NCBI FTP server. server: ftp-private.ncbi.nlm.nih.gov login: cbcb_trc passwd: t@@GeaYF center: UMD description: University of Maryland ASSEMBLY XML Schema png ASSEMBLY XML Schema xsd
Use XContig package scripts
Files:
.contig : contigs & underlying reads (use TRACE_NAME's or SEQ_NAME's) .seq : read sequences (use TRACE_NAME's or SEQ_NAME's) .qual : read qualities (use TRACE_NAME's or SEQ_NAME's) .ti2seq_name : (TI , TRACE_NAME or SEQ_NAME) : required if the contig file soes not use the read ti's
$ bank2contig -e prefix.bnk > prefix.contig $ dumpreads -e -r prefix.bnk > prefix.seq $ dumpreads -e -r -q prefix.bnk > prefix.qual
Example:
Xoo: /fs/szasmg/Bacteria/Xanthomonas/XOO/Xoo_PXO99A/FinalAsm_June2007/AA
Steps:
1. makeConinfo ASSEMBLY.coninfo $ more ASSEMBLY.coninfo <coninfo> <meta name='center'>UMD</meta> <meta name='db'>Xoo</meta> <meta name='desc'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta> <meta name='object'>ASSEMBLY</meta> <meta name='species_code'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta> <meta name='structure'>Chromosome</meta> <meta name='subtype'>NEW</meta> <meta name='taxid'>360094</meta> <contig id="1106158952778_stitched" conformation="CIRCULAR" subtype="NEW"/> <contig id="... "/> <file src="Xoo.contig"/> <seq src="Xoo.seq"/> <qual src="Xoo.qual"/> <idmap src="Xoo.ti2seq_name" direction="FORWARD"/> </coninfo>
2. buildAssemblyArchive ASSEMBLY.coninfo --prompt --subname umd-20070816-125223 problems: * submitter_reference="tigr...." : replace tigr with umd * conformation: always LINEAR : replace LINEAR with CIRCULAR ???
3. validate: oXygen: software used by NCBI; license required xmllint: open source $ xmllint --schema ~/bin/TraceAssembly.xsd umd-*/ASSEMBLY.xml > /dev/null umd-20070816-125223/ASSEMBLY.xml validates
4. edit files $ rm *.tar.gz $ md5sum umd-*/ASSEMBLY.xml $ edit umd-*/MANIFEST # update ASSEMBLY.xml md5sum $ ls -1 umd-* umd-20070816-125223/ 1106158952778_stitched_20070817-141849.con # Contig consensus 1106158952778_stitched_20070817-141849.congap # Contig gaps ASSEMBLY.xml # Assembly XML MANIFEST # MD5 sums
4. create tarball $ tar czvf umd-20070816-125223.tar.gz umd-20070816-125223/
5. upload tarball !!! contact trace@ncbi.nlm.nih.gov if login/password error $ ftp ftp-private.ncbi.nlm.nih.gov login: cbcb_trc passwd: t@@GeaYF
$ cd assembly $ put *.tar.gz
dbGSS
- 4 files: email to batch-sub@ncbi.nlm.nih.gov
1. Publication
TYPE: Pub #required MEDUID: 92347897 TITLE: #required Genomic sequences from a subtracted retinal pigment epithelium library AUTHORS: #required Gieser,L.; Swaroop,A. JOURNAL: Genomics VOLUME: 13 ISSUE: 2 PAGES: 873-6 YEAR: 1992 #required STATUS: 4 #required :1=unpublished, 2=submitted, 3=in press, 4=published ||
2. Library
TYPE: Lib #required NAME: Rat Lambda Zap Express Library ORGANISM: Rattus norvegicus STRAIN: Sprague-Dawley SEX: male STAGE: embryonic day 17 post-fertilization TISSUE: aorta CELL_TYPE: vascular smooth muscle DESCR: Put description here. ||
3. Contact
TYPE: Cont NAME: Sikela JM FAX: 303 270 7097 TEL: 303 270 EMAIL: tjs@tally.hsc.colorado.edu LAB: Department of Pharmacology INST: University of Colorado Health Sciences Center ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA ||
4. GSS sequence file
TYPE: GSS #required STATUS: New #required CONT_NAME: Sikela JM #required GSS#: Ayh00001 #required CLONE: HHC189 SOURCE: ATCC SOURCE_INHOST: 65128 OTHER_GSS: GSS00093, GSS000101 CITATION: #required Genomic sequences from Human brain tissue SEQ_PRIMER: M13 Forward P_END: 5' HIQUAL_START: 1 HIQUAL_STOP: 285 DNA_TYPE: Genomic CLASS: shotgun #required LIBRARY: Hippocampus, Stratagene (cat. #936205) #required PUBLIC: #required PUT_ID: Actin, gamma, skeletal COMMENT: SEQUENCE: AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG #required ... ||
- Matching strings:
CONT_NAME of GSS file and NAME field of the Contact file LIBRARY field of GSS file and NAME field of the Library file CITATION field of GSS file and TITLE field of the Publication file