Bos taurus: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 160: Line 160:
* Q: Which vectors are the most frequent?   
* Q: Which vectors are the most frequent?   
* A: align UMD2 vctor contaminants to UniVec_core:  
* A: align UMD2 vctor contaminants to UniVec_core:  
** 28051 out of 33540 align  
** 28051 out of 33540 align : [[Media:UniVec_Core-UMD2.vector.ref_hits|UniVec_Core-UMD2.vector.ref_hits]]
** [[Media:UniVec_Core-UMD2.vector.ref_hits|UniVec_Core-UMD2.vector.ref_hits]]
** pBR322, Ecoli_lac*, pSacBII ...
** pBR322, Ecoli_lac*, pSacBII ...
** pBACe3.6 (U80929.2:11415-11517): CONTAINED
** pBACe3.6 (U80929.2:11415-11517): CONTAINED
** 5489 out of 33540 don't align  (4373 are 100+bnp, 11 are 500+bp)
** 5489 out of 33540 don't align  (4373 are 100+bnp, 11 are 500+bp) [[Media:Bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits|Bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits]]





Revision as of 14:44, 5 March 2009

NCBI Traces

 SPECIES_CODE = "BOS TAURUS"                                                          37,788,710 traces
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM"                                  35,596,825 traces
 
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "WGS"      24,863,627 traces
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "SHOTGUN"  10,716,306 traces
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "CLONEEND"     16,892 traces

BCM Assembly

         #elem   min             max             mean            median          n50             sum
 contig  131620  91              326010          20755           10365           44270           2731814362
 placed  101579  91              250125          24286           13928           47485           2466971326
 chrom   30      44060403        161106243       87813777        84419198        106383598       2634413324

UMD2 Assembly

  • Used UMDoverlapper to trim reads
          #elem       min     max     mean    median  n50     sum
 reads    35237868    68      1418    778     840     864     27406137041
  • Library reestimates:
 paste *dst *mdi | perl -ane 'print "  $_" if($F[1]!=$F[4]);'
 35237870      150000  50000   35237870        162386  21158
 35237871      2496    1431    35237871        2409    1137
 35237873      3001    829     35237873        2973    770
 35237875      150001  50000   35237875        172955  43831
 35237876      1629    282     35237876        1595    245
 35237877      3063    1326    35237877        3193    1131
 35237878      6756    836     35237878        6701    793
 35237879      2569    293     35237879        2547    285
 35237880      150002  50000   35237880        160984  26638
 35237881      2749    446     35237881        2697    325
 35237883      3593    1213    35237883        3463    1232
 35237884      3165    700     35237884        3172    699
 35237885      3812    533     35237885        3804    537
 35237886      2754    1432    35237886        2701    1289
 35237887      4977    693     35237887        4968    694
 35237889      2710    1529    35237889        2566    1225
 35237890      150003  50000   35237890        161995  26438

Submission

  • Title: "A whole-genome assembly of the cow, Bos taurus"
  • Authors:
 Steven Salzberg
 Aleksey Zimin
 Arthur Delcher
 Liliana Florea
 David Kelley
 Finian Hanrahan
 Guillaume Marcais
 Geo Pertea
 Michael Roberts
 Michael Schatz
 Curt Van Tassell
 James Yorke
 Poorani S.
  • Assembler:
 Celera Assembler and UMD Overlapper.
  • Sequencing Center :
 Baylor College of Medicine. 
  • Source of DNA used for sequencing:

The source of the BAC library DNA was Hereford bull L1 Domino 99375, registration number 41170496. Dr. Michael MacNeil's laboratory, USDA-ARS, Miles City, MT provided the blood. The DNA for the whole genome shotgun sequences was provided by Dr. Timothy Smith's laboratory, U.S. Meat Animal Research Center, Clay Center, NE from white blood cells from L1 Dominette 01449, American Hereford Association registration number 42190680 (a daughter of L1 Domino 99375). A skin cell fibroblast cell line from the same animal is available from Dr. Carol Chitko-McKown's laboratory, although there is no sequence from that cell line.

  • Sequence modifiers:
 [organism=Bos taurus][breed=Hereford][tech=wgs][chromosome=...]
  • Submission: Use sequin
 /nfshomes/dpuiu/szdevel/sequin.8.10/sequin
  • Sequence:
 Contig length summary:
           #seqs   min     max     mean    median  n50     sum
 all       210657  71      840370  13709   1523    78511   2887902366
 placed    75775   88      840370  34512   13416   88287   2615171268
 unplaced  134882  71      166670  2022    1322    1742    272731098


Duplicates

 deg0003136509,7180003440308 : both unplaced
 deg0003084562,7180002954167 : both unplaced

Contaminants

NCBI (1st batch)

Through "Foreign Contamination Screen" : http://www.ncbi.nlm.nih.gov/projects/WGS/screens/DAAA01_120508/

  • list.exclude_contigs 4,813 ctgs (3,939 vector + 394 Ecoli + 452 other) (73 mito, 43 deg, 8 Acinetobacter baumannii)
  • list.trim_contigs 19,049 ctgs (18,336 vector + 289 Ecoli + 397 other)

Steven search against Ecoli MG1655

  • 12/11/2008 : Found 121 ctgs that align to Ecoli
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/Eco-vs-cow.mum 
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/EcoK12.fna 

NCBI (2nd batch)

Overall

  • Counts
 Bos_taurus.UMD2.exclude.count
 Bos_taurus.UMD2.trim.count

Summary

                             #ctgs   min     max     mean    median  n50     total_bp
 UMD_Freeze2.0_contam        210657  71      840370  13709   1523    78508   2887902366
 UMD_Freeze2.0               187683  71      840370  15225   1609    79580   2857554998
 difference                   22974

Contaminant region summary:

            elem       min        max        mean       med        n50        sum
 exclude    4817       316        16661      1510       1485       1514       7276894
 trim       30325      48         2479       354        319        446        10745455
 all        35142      48         16661      512        362        674        18022349
 
 Ecoli      746        54         16661      1125       1111       1264       839899          
 vector     33540      49         3128       487        346        603        16340397       
 other      910        53         13090      1006       1037       1329       916117

Exclude sequences(example):

 7180003318605   16661   Escherichia coli str. K12 substr  # DH10B
 7180003320028   13090   Acinetobacter baumannii
 7180003316967   7473    Escherichia coli str. K12 substr  # DH10B
 7180003313366   7098    Acinetobacter baumannii
 7180003195772   4993    Serratia marcescens
 7180003288790   4668    Klebsiella pneumoniae
 7180003310064   4371    Escherichia coli                  # all 3
 7180003262150   4275    Escherichia coli
 7180003288789   3565    Serratia marcescens
 join100003627   3563    Acinetobacter baumannii
 ...
 7180003289260   3128    Escherichia coli or vector  
 7180003292886   2957    vector
 7180003166540   2081    mitochondrion
 7180003310112   1977    contaminants
 7180002995790   1711    bacterial insertion sequence
 7180003221530   1647    Bacillus cereus ATCC 10987
 7180003259696   1597    Pseudomonas aeruginosa PAO1
 7180003239826   1378    Macaca mulatta

Problems:

1: Vectors


2: Ecoli

  • There are 22 Ecoli strains * 3 Ecoli K12 substarins
  • MG1655 is the 1st one completed, DH10B and W3110 have been recently(?) completed
  • Contain unique seqs as long as 28K
 cat /fs/szdata/genomes/ncbi/Bacteria/Escherichia_coli_K_12*/*fna | infoseq -description
 NC_010473.1    4686137 50.78  Escherichia coli str. K-12 substr. DH10B, complete genome
 NC_000913.2    4639675 50.79  Escherichia coli str. K-12 substr. MG1655, complete genome
 AC_000091.1    4646332 50.80  Escherichia coli str. K-12 substr. W3110, complete genome

3: Other

  • Acinetobacter baumannii : there are 6 completed strains


  • Files:
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.exclude.count     (4813 exlude sequence counts)
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.list  (35142 contaminant sequences: exclude+trim) 
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.fasta
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.infoseq  35142
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.Ecoli.infoseq        746
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.vector.infoseq       33540
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/odd-contaminants.infoseq             14


  • Dirs:
 /fs/szasmg3/dpuiu/bos_taurus/submission/decontam

Notes:

  • The 4,814 ctgs were aligned to UniVec (-c 20 ; delta-filter -q)
    • 4,247 ctgs aligned to 89 vecto seqs
    • top ref hits:
 gnl|uv|J01749.1   Cloning vector pBR322
 gnl|uv|J01636.1   E.coli lactose operon with lacI, lacZ, lacY and lacA genes
 gnl|uv|AF102576.1 Cloning vector pSOS
 gnl|uv|L08959.1   pUC8 cloning vector 
 gnl|uv|L08931.1   pMAC7-8 cloning vector for site-directed mutagenesis
 gnl|uv|L09145.1   pUR222 cloning vector
 gnl|uv|U47102.2   Cloning vector pALTER<R>-Ex1
 ...
  • The 4,814 ctgs were aligned to EcoliK12
    • 4,299 aligned 200bp+ to Ecoli
    • 3,877 aligned 100% to region 365521_365744 (224bp)
 >NC_000913.2_365521_365744 Escherichia coli K12, complete genome
 CATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATAC
 GAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCACATTAA
 TTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGCATTAAT
 GAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGC
  • The 4,814 ctgs were assembled based on 11,912 reads
 BCM          SHOTGUN    11247 (out of 10M reads)
 BCM          WGS        415   (out of 24M reads)
 NISC         SHOTGUN    213   (out of 0.7M reads)
 BARC         CLONEEND   14    ...
 BCM          CLONEEND   11
 BCCAGSC      CLONEEND   7
 TIGR         CLONEEND   4
 TIGR_JCVIJTC CLONEEND   1
  • Avg UMD clipping rangeof the 11,912 reads is 840bp (vs 778 avg for the 3.53M assembled reads)
  • Other: /fs/ftp-cbcb/pub/data/Bos_taurus/Bos_taurus_UMD_2.0/odd-contaminants.fa

Local files

  • Freeze dir files
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/contigs.unplaced.fa  : sequences
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/bos_taurus.agp       : all scaffolds
 /fs/ftp-cbcb/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_2.0
  • Files uploaded
 Ftp server: ftp-private.ncbi.nlm.nih.gov
 Account: cbcb_trc
 Dir: uploads/
 Local files: /fs/szasmg3/dpuiu/bos_taurus/submission/ftp/   : 22 *sqn + 1 agp