NCBI submission: Difference between revisions
Jump to navigation
Jump to search
Email
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
= WGS/TPA = | = WGS/TPA = | ||
Links | == Links == | ||
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Info] | * [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Info] | ||
* [http://www.ncbi.nlm.nih.gov/Genbank/wgs.html wgs] | * [http://www.ncbi.nlm.nih.gov/Genbank/wgs.html wgs] | ||
Line 22: | Line 23: | ||
* locus_tags (3 letter/digit) | * locus_tags (3 letter/digit) | ||
== | == Requirements == | ||
* ctg's: no gaps; .sqn format | * ctg's: no gaps; .sqn format | ||
Line 37: | Line 38: | ||
== Formating == | == Formating == | ||
=== Metadata | === Metadata === | ||
* .sbt file gebnerated by Sequin | |||
* [http://www.ncbi.nlm.nih.gov/Sequin/index.html SeqIn] | * [http://www.ncbi.nlm.nih.gov/Sequin/index.html SeqIn] | ||
* [http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm QuickGuide] | * [http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm QuickGuide] | ||
Line 78: | Line 81: | ||
Export template => template.sbt : contains submission info | Export template => template.sbt : contains submission info | ||
=== | === Tags === | ||
$ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta | $ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta | ||
Line 132: | Line 136: | ||
protein_id gnl|cbcb|C1A_1225_002 | protein_id gnl|cbcb|C1A_1225_002 | ||
=== AGP | === AGP === | ||
$ scaff2agp.pl < prefix.scaff > prefix.agp | $ scaff2agp.pl < prefix.scaff > prefix.agp | ||
Line 140: | Line 144: | ||
== Submission == | == Submission == | ||
=== BankIt | === BankIt === | ||
* [http://www.ncbi.nlm.nih.gov/BankIt/ BankIt] | * [http://www.ncbi.nlm.nih.gov/BankIt/ BankIt] | ||
Line 153: | Line 157: | ||
* [http://www.ncbi.nlm.nih.gov/projects/GenomeSubmit/genome_submit.cgi GenomeMacroSend] : submit *.sqn, *.tbl, *.fsa, *agp files | * [http://www.ncbi.nlm.nih.gov/projects/GenomeSubmit/genome_submit.cgi GenomeMacroSend] : submit *.sqn, *.tbl, *.fsa, *agp files | ||
= TA | === Ftp === | ||
* for large genome projects | |||
* use same account as AA submission, the uploads/ directory to deposit the files | |||
=== Updates === | |||
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html#updating Updating] | |||
= TA = | |||
[http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=overview&m=doc&s=overview TA] | [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=overview&m=doc&s=overview TA] | ||
Line 172: | Line 185: | ||
/nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl | /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl | ||
= SRA | = SRA = | ||
server: ftp-trace.ncbi.nlm.nih.gov | server: ftp-trace.ncbi.nlm.nih.gov | ||
Line 187: | Line 200: | ||
[http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table] | [http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table] | ||
= AA | = AA = | ||
* !!! Need a TaxId before formatting the files | * !!! Need a TaxId before formatting the files | ||
Line 214: | Line 227: | ||
$ dumpreads -e -r prefix.bnk > prefix.seq | $ dumpreads -e -r prefix.bnk > prefix.seq | ||
$ dumpreads -e -r -q prefix.bnk > prefix.qual | $ dumpreads -e -r -q prefix.bnk > prefix.qual | ||
Example: | Example: | ||
Line 274: | Line 286: | ||
$ cd assembly | $ cd assembly | ||
$ put *.tar.gz | $ put *.tar.gz | ||
Revision as of 14:51, 1 December 2008
WGS/TPA
Links
- Info
- wgs
- tbl2asn
- Sequence modifiers
- AGP format
- Bacterial Genomes
- Annotation Annotation Info ,Locus tag extra info
Registration
- Register WGS form
- CBCB Genome Projects:
* search Genome Project for Center for Bioinformatics and Computational Biology[Sequencing Center] * Xanthomonas oryzae pv. oryzae PXO99A complete; /fs/szdata/ncbi/ftp.ncbi.nih.gov/genomes/Bacteria/Xanthomonas_oryzae_PXO99A * Xanthomonas oryzae pv. oryzicola BLS256 assembly
Output:
- genome project id (5 digit); use it in e-mail correspondence
- locus_tags (3 letter/digit)
Requirements
- ctg's: no gaps; .sqn format
- annotation: either for ctg's or superctg's ; .sqn format
- suprectg's: AGP format
Output:
- 4-letter WGS project_ID : XXXX
- project accession number : XXXX00000000 (4-letter ID followed by 8 0's)
- 1st version: XXXX01000000
- 1st version ctg's: XXXX01000001
- CON record for suprectg's
Formating
Metadata
- .sbt file gebnerated by Sequin
- multiple sequences
/nfshomes/dpuiu/szdevel/sequin.8.10/sequin !!! import /nfshomes/dpuiu/bin/seqin.sqn
Form Submission: Immediately ... Tentative manuscript title: Contact: Name: Daniela Puiu Phone: 301.405.3403 Fax: 301.314.1341 Email: dpuiu@umiacs.umd.edu Authors: Daniela Puiu Steven L. Salzberg ... Affiliation Institution: University of Maryland, Center for Bioinformatics and Computational Biology , 3115 Biomolecular Sciences Building #296, College Park, MD 20742 , US
Seqeuence format Batch submission FASTA Original submission ... Organism and Sequences: Nucleotide: can import from FASTA file Organism: strain, moltype Proteins Annotation
GenomeMacroSend : submit *.sqn, *.tbl, *.fsa, *agp files e-mail the output file to gb_sub@ncbi.nlm.nih.gov (deprecated)
Export template => template.sbt : contains submission info
Tags
$ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta
tbl2asn
Input files: prefix.sbt: submission file prefix.fsa: sequence : at most 10,000 sequences/file prefix.tbl: annotation $ tbl2asn -t prefix.sbt -V v -s -p . $ tbl2asn -t prefix.sbt -V v -s -i prefix.fasta
$ tbl2asn -t template.sbt -i prefix.fasta -V v -s
- Input files:
* template (*.sbt) Example: /nfshomes/dpuiu/bin/seqin.sbt * FASTA sequence (*.fsa) >SeqID [organism=...] [strain=...] [tech=wgs] [chromosome=...][gcode=11] Adding tags: ~dpuiu/bin/addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" wPipJBH.fasta
* annotation table (*.tbl) (optional) 5-column table locus_tag for genes protein_id for proteins product for proteins
- Output files:
* ASN.1 (*.sqn) for submission to GenBank. * .val: validation file; check it for errors
Annotation
- Generating the .tbl format from a TAB delinited format
$ ~dpuiu/bin/tab2annotation.pl -h # Example: tab2annotation.pl -ht "SeqId Location Strand Length Product" file.ptt tab2annotation.pl -hl 1 -SeqId NC_012456 file.ptt # INPUT SeqId SeqIdLength OrfId Start End Length Product 1225 2425 002 422 706 285 malate synthase G # OUTPUT >Feature 1225 422 706 gene locus_tag C1A_1225_002 422 706 CDS product malate synthase G protein_id gnl|cbcb|C1A_1225_002
AGP
$ scaff2agp.pl < prefix.scaff > prefix.agp $ infoseq2agp.pl prefix.infoseq > prefix.agp $ valiadteAgp.pl prefix.agp
Submission
BankIt
- BankIt
- one or a few sequence submissions
- e-mail the output file to gb_sub@ncbi.nlm.nih.gov (deprecated)
GenomeMacroSend
- GenomeMacroSend : submit *.sqn, *.tbl, *.fsa, *agp files
Ftp
- for large genome projects
- use same account as AA submission, the uploads/ directory to deposit the files
Updates
TA
Compressed archive containing 3 files: TRACEINFO.xml, MD5, README traces/ directory SCF format traces under traces/ or traces/*/ The archive(s) is/are gzip files 1-4GB; include center's name and the date into file names Accepted only by uploading to NCBI FTP server. server: ftp-trace.ncbi.nih.gov login: passwd: center: UMD
Scripts:
/nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl
SRA
server: ftp-trace.ncbi.nlm.nih.gov login: cbcb_trc password: t@@GeaYF Center_name (acronym): CBCB Full name: Center for Bioinformatics and Computational Biology, University of Maryland Directory (Short reads): short_read/ Directory (Sanger reads): uploads/ Directory (test): test/ (~30 raeds) Validation table
AA
- !!! Need a TaxId before formatting the files
- AA
- AA submission info
Compressed archive containing 2 files: ASSEMBLY.xml , MD5 Accepted only by uploading to NCBI FTP server. server: ftp-private.ncbi.nlm.nih.gov login: cbcb_trc passwd: t@@GeaYF center: UMD description: University of Maryland ASSEMBLY XML Schema png ASSEMBLY XML Schema xsd
Use XContig package scripts
Files:
.contig : contigs & underlying reads (use TRACE_NAME's or SEQ_NAME's) .seq : read sequences (use TRACE_NAME's or SEQ_NAME's) .qual : read qualities (use TRACE_NAME's or SEQ_NAME's) .ti2seq_name : (TI , TRACE_NAME or SEQ_NAME) : required if the contig file soes not use the read ti's
$ bank2contig -e prefix.bnk > prefix.contig $ dumpreads -e -r prefix.bnk > prefix.seq $ dumpreads -e -r -q prefix.bnk > prefix.qual
Example:
Xoo: /fs/szasmg/Bacteria/Xanthomonas/XOO/Xoo_PXO99A/FinalAsm_June2007/AA
Steps:
1. makeConinfo ASSEMBLY.coninfo $ more ASSEMBLY.coninfo <coninfo> <meta name='center'>UMD</meta> <meta name='db'>Xoo</meta> <meta name='desc'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta> <meta name='object'>ASSEMBLY</meta> <meta name='species_code'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta> <meta name='structure'>Chromosome</meta> <meta name='subtype'>NEW</meta> <meta name='taxid'>360094</meta> <contig id="1106158952778_stitched" conformation="CIRCULAR" subtype="NEW"/> <contig id="... "/> <file src="Xoo.contig"/> <seq src="Xoo.seq"/> <qual src="Xoo.qual"/> <idmap src="Xoo.ti2seq_name" direction="FORWARD"/> </coninfo>
2. buildAssemblyArchive ASSEMBLY.coninfo --prompt --subname umd-20070816-125223 problems: * submitter_reference="tigr...." : replace tigr with umd * conformation: always LINEAR : replace LINEAR with CIRCULAR ???
3. validate: oXygen: software used by NCBI; license required xmllint: open source $ xmllint --schema ~/bin/TraceAssembly.xsd umd-*/ASSEMBLY.xml > /dev/null umd-20070816-125223/ASSEMBLY.xml validates
4. edit files $ rm *.tar.gz $ md5sum umd-*/ASSEMBLY.xml $ edit umd-*/MANIFEST # update ASSEMBLY.xml md5sum $ ls -1 umd-* umd-20070816-125223/ 1106158952778_stitched_20070817-141849.con # Contig consensus 1106158952778_stitched_20070817-141849.congap # Contig gaps ASSEMBLY.xml # Assembly XML MANIFEST # MD5 sums
4. create tarball $ tar czvf umd-20070816-125223.tar.gz umd-20070816-125223/
5. upload tarball !!! contact trace@ncbi.nlm.nih.gov if login/password error $ ftp ftp-private.ncbi.nlm.nih.gov login: cbcb_trc passwd: t@@GeaYF
$ cd assembly $ put *.tar.gz