NCBI submission: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Dpuiu (talk | contribs)
Dpuiu (talk | contribs)
 
(116 intermediate revisions by the same user not shown)
Line 1: Line 1:
== NCBI ==
= WGS/TPA =


== Links ==


[http://www.ncbi.nlm.nih.gov/BankIt/ BankIt]
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Info]
* [http://www.ncbi.nlm.nih.gov/Genbank/wgs.html wgs]
* [http://www.ncbi.nlm.nih.gov/Genbank/tbl2asn2.html tbl2asn]
* [http://www.ncbi.nlm.nih.gov/Sequin/modifiers.html Sequence modifiers]
* [http://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Specification.shtml AGP format]
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Bacterial Genomes]
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit_annotation.html Annotation]
* [http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html Annotation Info]
* [http://www.ncbi.nlm.nih.gov/genomes/locustag/Proposal.pdf Locus tag extra info]


seqin: standalone application
== Registration ==


[http://www.ncbi.nlm.nih.gov/Genbank/wgs.html WGS]
* [http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi Register WGS form]
* CBCB Genome Projects:
  * search '''Genome Project''' for '''Center for Bioinformatics and Computational Biology[Sequencing Center]'''
  * [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=28127 Xanthomonas oryzae pv. oryzae PXO99A] complete; /fs/szdata/ncbi/ftp.ncbi.nih.gov/genomes/Bacteria/Xanthomonas_oryzae_PXO99A
  * [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16740 Xanthomonas oryzae pv. oryzicola BLS256] assembly
 
* [http://www.ncbi.nlm.nih.gov/genomes/static/links.html Seuqncing center list]
Output:
* genome project id (5 digit); use it in e-mail correspondence
* locus_tags (3+ letter/digit)
 
* [http://www.ncbi.nlm.nih.gov/genomes/lltp.cgi Locus tag database] to check if the chosen locus tag is available
 
== Requirements ==
 
* ctg's:      no gaps;  .sqn format
* annotation: either for ctg's or superctg's ; .sqn format
* suprectg's: AGP format
 
Output:
* 4-letter WGS project_ID : XXXX
* project accession number : XXXX00000000 (4-letter ID followed by 8 0's)
* 1st version: XXXX01000000
* 1st version ctg's: XXXX01000001
* CON record for suprectg's
 
== Formating ==
 
=== Metadata ===
 
* .sbt file gebnerated by Sequin
* [http://www.ncbi.nlm.nih.gov/Sequin/index.html SeqIn]
* [http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm QuickGuide]
 
* multiple sequences
 
  /nfshomes/dpuiu/szdevel/sequin.8.10/sequin
  /nfshomes/dpuiu/szdevel/bin/sequin              # latest version
  !!! import /nfshomes/dpuiu/bin/seqin.sqn
 
  Form
  Submission:
    Immediately ...
    Tentative manuscript title:
  Contact:
    Name:  Daniela Puiu
    Phone: 301.405.3403
    Fax:  301.314.1341
    Email: dpuiu@umiacs.umd.edu
  Authors:
    Daniela Puiu
    Steven L. Salzberg
    ...
  Affiliation
    Institution: University of Maryland,  Center for Bioinformatics and Computational Biology , 3115 Biomolecular Sciences Building #296, College Park, MD 20742 , US
 
  Seqeuence format
    Batch submission
    FASTA
    Original submission
  ...
  Organism and Sequences:
    Nucleotide: can import from FASTA file
    Organism: strain, moltype
    Proteins
    Annotation
 
Export template => /nfshomes/dpuiu/bin/sequin.sbt : contains submission info
 
=== Tags ===
 
  $ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta
 
=== Annotation ===
 
* Locus tag examples:
  ABC_I00001 for gene 1, chromosome I
  ABC_II00001 for gene 1, chromosome II
  ABC_r1112 for ribosomal RNA genes
  ABC_t1113 for tRNA genes
 
* [http://www.ncbi.nlm.nih.gov/genomes/frameshifts/frameshifts.cgi frameshifts]
 
* Generating the .tbl format from a TAB delinited format
  $ ~dpuiu/bin/tab2annotation.pl -h
 
      # Example:
      tab2annotation.pl -ht "SeqId Location Strand Length Product" prefix.ptt > prefix.tbl
      tab2annotation.pl -hl 1 -SeqId NC_012456 prefix.ptt > prefix.tbl
 
      # INPUT
      SeqId  SeqIdLength    OrfId  Start  End    Length  Product
      1225    2425            002    422    706    285    malate synthase G
 
      # OUTPUT
      >Feature 1225
      422    706    gene
                              locus_tag      C1A_1225_002
      422    706    CDS
                              product malate synthase G
                              protein_id      gnl|cbcb|C1A_1225_002
 
=== Merge ===
* tbl2asn: command line
* [[Media:tbl2asn.txt|tbl2asn man]]
 
  Input files:
    prefix.sbt: submission file
    prefix.fsa: sequence : at most 10,000 sequences/file
    prefix.tbl: annotation
 
  $ tbl2asn -t prefix.sbt -V v -s -p .
  $ tbl2asn -t prefix.sbt -V v -s -i prefix.fasta
 
  $ tbl2asn -t template.sbt -i prefix.fasta -V v -s
 
* Input files: [[Media:sequin.sbt|sequin.sbt]]
  * template (*.sbt)        Example: /nfshomes/dpuiu/bin/sequin.sbt
  comment : is the article name
 
  * FASTA sequence (*.fsa)
    >SeqID [organism=...] [strain=...] [tech=wgs] [chromosome=...][gcode=11]
 
  Adding tags:
  ~dpuiu/bin/addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" wPipJBH.fasta
 
  * annotation table (*.tbl) (optional) 
    5-column table
        locus_tag for genes
        protein_id for proteins
        product for proteins
 
* Output files:
  * ASN.1 (*.sqn) for submission to GenBank.
  * .val: validation file; check it for errors
 
=== AGP ===
 
* [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml AGP_Specification.shtml]
* Sequence gaps : "fragment yes"
* Scaffold gaps : "contig  no"
 
  $ scaff2agp.pl < prefix.scaff > prefix.agp
  $ infoseq2agp.pl prefix.infoseq > prefix.agp
  $ valiadteAgp.pl prefix.agp
 
== Submission  ==
 
=== BankIt ===
 
* [http://www.ncbi.nlm.nih.gov/BankIt/ BankIt]
* one or a few sequence submissions
 
=== Email ===
 
* e-mail the output file to gb_sub@ncbi.nlm.nih.gov (deprecated)
 
=== GenomeMacroSend ===
 
* [http://www.ncbi.nlm.nih.gov/projects/GenomeSubmit/genome_submit.cgi GenomeMacroSend] : submit *.sqn, *.tbl, *.fsa, *agp files
 
=== Ftp ===
 
* for large WGS projects
server: ftp-trace.ncbi.nlm.nih.gov
login:        cbcb_trc
password:      t@@GeaYF
center:        CBCB
directory:    test/ ;    don't use uploads/ that is used by SRA
 
=== Updates ===
 
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html#updating Updating]
* in .sbt file replace "subtype new" with "subtype update"
 
= TA =
 
[http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=overview&m=doc&s=overview TA]
Compressed archive containing
  3 files: TRACEINFO.xml, MD5, README
  traces/ directory
  SCF format traces under traces/ or traces/*/
 
The archive(s) is/are gzip files 1-4GB; include center's name and the date into file names
Accepted only by uploading to NCBI FTP server.
  server: ftp-trace.ncbi.nih.gov
  login:
  passwd:
  center: UMD
 
Scripts:
  /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl
 
= Genbank & SRA =
 
  server: ftp-trace.ncbi.nlm.nih.gov
  login:        cbcb_trc
  password:      t@@GeaYF
 
  Center_name (acronym): CBCB
  Full name: Center for Bioinformatics and Computational Biology, University of Maryland
 
  Directory (Short reads):  short_read/
  Directory (Sanger reads): uploads/
  Directory (test):        test/ (Assembled sequences)
 
  [http://www.ncbi.nlm.nih.gov/Traces/field_matrix_current.xls Validation table]
 
= AA =
 
* !!! Need a TaxId before formatting the files
* [http://www.ncbi.nlm.nih.gov/Traces/assembly/assmbrowser.cgi?cmd=show&f=rfc&m=doc&s=rfc AA]
* [http://www.ncbi.nlm.nih.gov/Traces/assembly/assmbrowser.cgi?cmd=show&f=rfc&m=doc&s=rfc#sub AA submission info]


[http://www.ncbi.nlm.nih.gov/Traces/assembly/assmbrowser.cgi?cmd=show&f=rfc&m=doc&s=rfc AA]
   Compressed archive containing 2 files: ASSEMBLY.xml , MD5  
   Compressed archive containing 2 files: ASSEMBLY.xml , MD5  
   Accepted only by uploading to NCBI FTP server.
   Accepted only by uploading to NCBI FTP server.
     server: ftp-private.ncbi.nlm.nih.gov
     server: ftp-private.ncbi.nlm.nih.gov
     login: umd_trc
     login: cbcb_trc
     passwd:  
     passwd: t@@GeaYF
     center: UMD   
     center: UMD   
     description: University of Maryland
     description: University of Maryland
Line 19: Line 240:
   [[Media:ASSEMBLY.xsd|ASSEMBLY XML Schema xsd]]  
   [[Media:ASSEMBLY.xsd|ASSEMBLY XML Schema xsd]]  


[http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=submit&m=doc&s=submit TA]
Use XContig package scripts
 
Files:
.contig      : contigs & underlying reads (use TRACE_NAME's or SEQ_NAME's)
.seq        : read sequences (use TRACE_NAME's or SEQ_NAME's)
.qual        : read qualities (use TRACE_NAME's or SEQ_NAME's)
.ti2seq_name : (TI , TRACE_NAME or SEQ_NAME) : required if the contig file soes not use the read ti's


== Procedure ==
  $ bank2contig -e      prefix.bnk > prefix.contig
  $ dumpreads  -e -r    prefix.bnk > prefix.seq
  $ dumpreads  -e -r -q prefix.bnk > prefix.qual


Files:
Example:
  .contig, .seq, .qual, .seq_name2ti
  Xoo: /fs/szasmg/Bacteria/Xanthomonas/XOO/Xoo_PXO99A/FinalAsm_June2007/AA


Steps:
Steps:
1. makeConinfo ASSEMBLY.coninfo
  1. makeConinfo ASSEMBLY.coninfo
   $ more ASSEMBLY.coninfo
   $ more ASSEMBLY.coninfo
   <coninfo>
   <coninfo>
Line 39: Line 268:
   <meta name='taxid'>360094</meta>
   <meta name='taxid'>360094</meta>
   <contig id="1106158952778_stitched" conformation="CIRCULAR" subtype="NEW"/>
   <contig id="1106158952778_stitched" conformation="CIRCULAR" subtype="NEW"/>
  <contig id="... "/>
   <file src="Xoo.contig"/>
   <file src="Xoo.contig"/>
   <seq src="Xoo.seq"/>
   <seq src="Xoo.seq"/>
Line 45: Line 275:
   </coninfo>
   </coninfo>


2. buildAssemblyArchive ASSEMBLY.coninfo --prompt --subname umd-20070816-125223
  2. buildAssemblyArchive ASSEMBLY.coninfo --prompt --subname umd-20070816-125223
   problems:
   problems:
     * submitter_reference="tigr...." : replace tigr with umd
     * submitter_reference="tigr...." : replace tigr with umd
     * conformation: always LINEAR    : replace LINEAR with CIRCULAR
     * conformation: always LINEAR    : replace LINEAR with CIRCULAR ???
    * taxid: not recognized          : replace <taxid>id</taxid> with <organism descriptor="TAXID">id</organism>
  $ ls -1 umd-20070816-125223*
  umd-20070816-125223.tar.gz
  umd-20070816-125223/
  1106158952778_stitched_20070817-141849.con
  1106158952778_stitched_20070817-141849.congap
  ASSEMBLY.xml
  MANIFEST
 


3. validate:
  3. validate:
   oXygen: software used by NCBI; license required
   oXygen: software used by NCBI; license required
   xmllint: open source
   xmllint: open source
   $ xmllint --schema ASSEMBLY.xsd umd-20070816-125223/ASSEMBLY.xml > /dev/null
   $ xmllint --schema ~/bin/TraceAssembly.xsd umd-*/ASSEMBLY.xml > /dev/null
   umd-20070816-125223/ASSEMBLY.xml validates
   umd-20070816-125223/ASSEMBLY.xml validates
  4. edit files
  $ rm *.tar.gz
  $ md5sum umd-*/ASSEMBLY.xml
  $ edit umd-*/MANIFEST        # update ASSEMBLY.xml md5sum
 
  $ ls -1 umd-*
  umd-20070816-125223/
  1106158952778_stitched_20070817-141849.con      # Contig consensus
  1106158952778_stitched_20070817-141849.congap    # Contig gaps
  ASSEMBLY.xml                                    # Assembly XML
  MANIFEST                                        # MD5 sums
4. create tarball
$ tar czvf umd-20070816-125223.tar.gz umd-20070816-125223/
5. upload tarball
    !!! contact trace@ncbi.nlm.nih.gov if login/password error
 
    $ ftp ftp-private.ncbi.nlm.nih.gov
    login: cbcb_trc
    passwd: t@@GeaYF
    $ cd assembly
    $ put *.tar.gz
= dbGSS =
* 4 files:  email to batch-sub@ncbi.nlm.nih.gov
1. Publication
  TYPE: Pub                        #required
  MEDUID: 92347897
  TITLE:                            #required
  Genomic sequences from a subtracted retinal pigment epithelium  library
  AUTHORS:                          #required
  Gieser,L.; Swaroop,A.
  JOURNAL: Genomics
  VOLUME: 13
  ISSUE: 2
  PAGES: 873-6
  YEAR:  1992                      #required
  STATUS: 4                        #required :1=unpublished, 2=submitted, 3=in press, 4=published
  ||
2. Library
  TYPE: Lib                        #required
  NAME:  Rat Lambda Zap Express Library
  ORGANISM: Rattus norvegicus
  STRAIN: Sprague-Dawley
  SEX: male
  STAGE: embryonic day 17 post-fertilization
  TISSUE: aorta
  CELL_TYPE: vascular smooth muscle
  DESCR:
  Put description here.
  ||
3. Contact
  TYPE: Cont
  NAME: Sikela JM
  FAX: 303 270 7097
  TEL: 303 270
  EMAIL: tjs@tally.hsc.colorado.edu
  LAB: Department of Pharmacology
  INST: University of Colorado Health Sciences Center
  ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA
  ||
4. GSS sequence file
  TYPE: GSS                            #required
  STATUS:  New                          #required
  CONT_NAME: Sikela JM                  #required
  GSS#: Ayh00001                        #required
  CLONE: HHC189
  SOURCE: ATCC
  SOURCE_INHOST: 65128
  OTHER_GSS:  GSS00093, GSS000101
  CITATION:                            #required
  Genomic sequences from Human brain tissue
  SEQ_PRIMER: M13 Forward
  P_END: 5'
  HIQUAL_START: 1
  HIQUAL_STOP: 285
  DNA_TYPE: Genomic
  CLASS: shotgun                                          #required
  LIBRARY: Hippocampus, Stratagene (cat. #936205)          #required
  PUBLIC:                                                  #required
  PUT_ID: Actin, gamma, skeletal
  COMMENT:
  SEQUENCE:
  AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG          #required
  ...
  ||
* Matching strings:
  CONT_NAME of GSS file and NAME field of the Contact file
  LIBRARY field of GSS file and NAME field of the Library file
  CITATION field of GSS file and TITLE field of the Publication file

Latest revision as of 14:25, 1 August 2011

WGS/TPA

Links

Registration

  * search Genome Project for Center for Bioinformatics and Computational Biology[Sequencing Center]
  * Xanthomonas oryzae pv. oryzae PXO99A complete; /fs/szdata/ncbi/ftp.ncbi.nih.gov/genomes/Bacteria/Xanthomonas_oryzae_PXO99A
  * Xanthomonas oryzae pv. oryzicola BLS256 assembly

Output:

  • genome project id (5 digit); use it in e-mail correspondence
  • locus_tags (3+ letter/digit)

Requirements

  • ctg's: no gaps; .sqn format
  • annotation: either for ctg's or superctg's ; .sqn format
  • suprectg's: AGP format

Output:

  • 4-letter WGS project_ID : XXXX
  • project accession number : XXXX00000000 (4-letter ID followed by 8 0's)
  • 1st version: XXXX01000000
  • 1st version ctg's: XXXX01000001
  • CON record for suprectg's

Formating

Metadata

  • multiple sequences
 /nfshomes/dpuiu/szdevel/sequin.8.10/sequin
 /nfshomes/dpuiu/szdevel/bin/sequin              # latest version
 !!! import /nfshomes/dpuiu/bin/seqin.sqn
 Form
  Submission:
   Immediately ...
   Tentative manuscript title: 
  Contact:
   Name:  Daniela Puiu
   Phone: 301.405.3403
   Fax:   301.314.1341 
   Email: dpuiu@umiacs.umd.edu
  Authors:
   Daniela Puiu
   Steven L. Salzberg
   ...
  Affiliation
   Institution: University of Maryland,  Center for Bioinformatics and Computational Biology , 3115 Biomolecular Sciences Building #296, College Park, MD 20742 , US
  Seqeuence format
   Batch submission
   FASTA
   Original submission
  ...
 Organism and Sequences:
   Nucleotide: can import from FASTA file
   Organism: strain, moltype
   Proteins
   Annotation

Export template => /nfshomes/dpuiu/bin/sequin.sbt : contains submission info

Tags

 $ addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" prefix.fasta

Annotation

  • Locus tag examples:
 ABC_I00001 for gene 1, chromosome I
 ABC_II00001 for gene 1, chromosome II
 ABC_r1112 for ribosomal RNA genes
 ABC_t1113 for tRNA genes
  • Generating the .tbl format from a TAB delinited format
 $ ~dpuiu/bin/tab2annotation.pl -h
 
     # Example:
     tab2annotation.pl -ht "SeqId Location Strand Length Product" prefix.ptt > prefix.tbl
     tab2annotation.pl -hl 1 -SeqId NC_012456 prefix.ptt > prefix.tbl
 
     # INPUT
     SeqId   SeqIdLength     OrfId   Start   End     Length  Product
     1225    2425            002     422     706     285     malate synthase G
 
     # OUTPUT
     >Feature 1225
     422     706     gene
                             locus_tag       C1A_1225_002
     422     706     CDS
                             product malate synthase G
                             protein_id      gnl|cbcb|C1A_1225_002

Merge

 Input files: 
   prefix.sbt: submission file
   prefix.fsa: sequence : at most 10,000 sequences/file
   prefix.tbl: annotation
 
 $ tbl2asn -t prefix.sbt -V v -s -p . 
 $ tbl2asn -t prefix.sbt -V v -s -i prefix.fasta
 $ tbl2asn -t template.sbt -i prefix.fasta -V v -s
 * template (*.sbt)        Example: /nfshomes/dpuiu/bin/sequin.sbt
  comment : is the article name
 * FASTA sequence (*.fsa) 
    >SeqID [organism=...] [strain=...] [tech=wgs] [chromosome=...][gcode=11]
 
 Adding tags:
  ~dpuiu/bin/addFastaTags.pl -s " [organism=...] [strain=wPip] [substrain=JBH] [tech=wgs]" wPipJBH.fasta
 * annotation table (*.tbl) (optional)  
    5-column table 
       locus_tag for genes 
       protein_id for proteins
       product for proteins 
  • Output files:
 * ASN.1 (*.sqn) for submission to GenBank.
 * .val: validation file; check it for errors

AGP

 $ scaff2agp.pl < prefix.scaff > prefix.agp
 $ infoseq2agp.pl prefix.infoseq > prefix.agp
 $ valiadteAgp.pl prefix.agp

Submission

BankIt

  • BankIt
  • one or a few sequence submissions

Email

  • e-mail the output file to gb_sub@ncbi.nlm.nih.gov (deprecated)

GenomeMacroSend

Ftp

  • for large WGS projects
server: ftp-trace.ncbi.nlm.nih.gov
login:         cbcb_trc
password:      t@@GeaYF
center:        CBCB
directory:     test/ ;     don't use uploads/ that is used by SRA

Updates

  • Updating
  • in .sbt file replace "subtype new" with "subtype update"

TA

TA

Compressed archive containing 
  3 files: TRACEINFO.xml, MD5, README
  traces/ directory
  SCF format traces under traces/ or traces/*/
 
The archive(s) is/are gzip files 1-4GB; include center's name and the date into file names
Accepted only by uploading to NCBI FTP server.
  server: ftp-trace.ncbi.nih.gov
  login: 
  passwd: 
  center: UMD

Scripts:

 /nfshomes/dpuiu/Archives/JCVI/bin/phred2xmlTrace.pl

Genbank & SRA

 server: ftp-trace.ncbi.nlm.nih.gov
 login:         cbcb_trc
 password:      t@@GeaYF
 
 Center_name (acronym): CBCB 
 Full name: Center for Bioinformatics and Computational Biology, University of Maryland
 
 Directory (Short reads):  short_read/ 
 Directory (Sanger reads): uploads/
 Directory (test):         test/ (Assembled sequences)
 
 Validation table

AA

 Compressed archive containing 2 files: ASSEMBLY.xml , MD5 
 Accepted only by uploading to NCBI FTP server.
   server: ftp-private.ncbi.nlm.nih.gov
   login: cbcb_trc
   passwd: t@@GeaYF
   center: UMD   
   description: University of Maryland
 ASSEMBLY XML Schema png 
 ASSEMBLY XML Schema xsd 

Use XContig package scripts

Files:

.contig      : contigs & underlying reads (use TRACE_NAME's or SEQ_NAME's) 
.seq         : read sequences (use TRACE_NAME's or SEQ_NAME's) 
.qual        : read qualities (use TRACE_NAME's or SEQ_NAME's) 
.ti2seq_name : (TI , TRACE_NAME or SEQ_NAME) : required if the contig file soes not use the read ti's
 $ bank2contig -e       prefix.bnk > prefix.contig
 $ dumpreads   -e -r    prefix.bnk > prefix.seq
 $ dumpreads   -e -r -q prefix.bnk > prefix.qual

Example:

Xoo: /fs/szasmg/Bacteria/Xanthomonas/XOO/Xoo_PXO99A/FinalAsm_June2007/AA

Steps:

 1. makeConinfo ASSEMBLY.coninfo
 $ more ASSEMBLY.coninfo
 <coninfo>
 <meta name='center'>UMD</meta>
 <meta name='db'>Xoo</meta>
 <meta name='desc'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta>
 <meta name='object'>ASSEMBLY</meta>
 <meta name='species_code'>Xanthomonas oryzae pv. oryzae strain PXO99A</meta>
 <meta name='structure'>Chromosome</meta>
 <meta name='subtype'>NEW</meta>
 <meta name='taxid'>360094</meta>
 <contig id="1106158952778_stitched" conformation="CIRCULAR" subtype="NEW"/>
 <contig id="... "/>
 <file src="Xoo.contig"/>
 <seq src="Xoo.seq"/>
 <qual src="Xoo.qual"/>
 <idmap  src="Xoo.ti2seq_name" direction="FORWARD"/>
 </coninfo>
 2. buildAssemblyArchive ASSEMBLY.coninfo --prompt --subname umd-20070816-125223
 problems:
    * submitter_reference="tigr...." : replace tigr with umd
    * conformation: always LINEAR    : replace LINEAR with CIRCULAR ???
 3. validate:
 oXygen: software used by NCBI; license required
 xmllint: open source
 $ xmllint --schema ~/bin/TraceAssembly.xsd umd-*/ASSEMBLY.xml > /dev/null
 umd-20070816-125223/ASSEMBLY.xml validates
 4. edit files
 $ rm *.tar.gz
 $ md5sum umd-*/ASSEMBLY.xml
 $ edit umd-*/MANIFEST         # update ASSEMBLY.xml md5sum 
 
 $ ls -1 umd-*
 umd-20070816-125223/
  1106158952778_stitched_20070817-141849.con       # Contig consensus
  1106158952778_stitched_20070817-141849.congap    # Contig gaps
  ASSEMBLY.xml                                     # Assembly XML
  MANIFEST                                         # MD5 sums
4. create tarball
$ tar czvf umd-20070816-125223.tar.gz umd-20070816-125223/
5. upload tarball
   !!! contact trace@ncbi.nlm.nih.gov if login/password error
 
   $ ftp ftp-private.ncbi.nlm.nih.gov
   login: cbcb_trc
   passwd: t@@GeaYF
   $ cd assembly
   $ put *.tar.gz

dbGSS

  • 4 files: email to batch-sub@ncbi.nlm.nih.gov

1. Publication

 TYPE: Pub                         #required
 MEDUID: 92347897
 TITLE:                            #required
 Genomic sequences from a subtracted retinal pigment epithelium   library
 AUTHORS:                          #required
 Gieser,L.; Swaroop,A.
 JOURNAL: Genomics
 VOLUME: 13
 ISSUE: 2
 PAGES: 873-6
 YEAR:  1992                       #required
 STATUS: 4                         #required :1=unpublished, 2=submitted, 3=in press, 4=published
 ||

2. Library

 TYPE: Lib                         #required
 NAME:  Rat Lambda Zap Express Library
 ORGANISM: Rattus norvegicus
 STRAIN: Sprague-Dawley
 SEX: male
 STAGE: embryonic day 17 post-fertilization
 TISSUE: aorta
 CELL_TYPE: vascular smooth muscle
 DESCR: 
 Put description here.
 ||

3. Contact

 TYPE: Cont
 NAME: Sikela JM
 FAX: 303 270 7097
 TEL: 303 270 
 EMAIL: tjs@tally.hsc.colorado.edu
 LAB: Department of Pharmacology
 INST: University of Colorado Health Sciences Center
 ADDR: Box C236, 4200 E. 9th Ave., Denver, CO 80262-0236, USA
 ||

4. GSS sequence file

 TYPE: GSS                             #required
 STATUS:  New                          #required
 CONT_NAME: Sikela JM                  #required
 GSS#: Ayh00001                        #required
 CLONE: HHC189
 SOURCE: ATCC
 SOURCE_INHOST: 65128
 OTHER_GSS:  GSS00093, GSS000101
 CITATION:                             #required
 Genomic sequences from Human brain tissue
 SEQ_PRIMER: M13 Forward
 P_END: 5'
 HIQUAL_START: 1
 HIQUAL_STOP: 285
 DNA_TYPE: Genomic
 CLASS: shotgun                                           #required
 LIBRARY: Hippocampus, Stratagene (cat. #936205)          #required
 PUBLIC:                                                  #required
 PUT_ID: Actin, gamma, skeletal
 COMMENT:
 SEQUENCE:
 AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG           #required
 ...
 ||
  • Matching strings:
 CONT_NAME of GSS file and NAME field of the Contact file
 LIBRARY field of GSS file and NAME field of the Library file
 CITATION field of GSS file and TITLE field of the Publication file