Bos taurus

From Cbcb
Revision as of 14:00, 28 April 2009 by Dpuiu (talk | contribs) (→‎Papers)
Jump to navigation Jump to search

Papers

NCBI Traces

 SPECIES_CODE = "BOS TAURUS"                                                          37,788,710 traces
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM"                                  35,596,825 traces
 
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "WGS"      24,863,627 traces
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "SHOTGUN"  10,716,306 traces
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "CLONEEND"     16,892 traces

BCM Assembly

         #elem   min             max             mean            median          n50             sum
 contig  131620  91              326010          20755           10365           44270           2731814362
 placed  101579  91              250125          24286           13928           47485           2466971326
 chrom   30      44060403        161106243       87813777        84419198        106383598       2634413324
  • 6 finished BACs NCBI accession numbers:
 gi|171461043, 
 gi|171461042, 
 gi|171461041, 
 gi|171461040,
 gi|171461039, 
 gi|167744683

UMD2 Assembly

  • Used UMDoverlapper to trim reads
          #elem       min     max     mean    median  n50     sum
 reads    35237868    68      1418    778     840     864     27406137041
  • Library re-estimates:
 paste *dst *mdi | perl -ane 'print "  $_" if($F[1]!=$F[4]);'
 35237870      150000  50000   35237870        162386  21158
 35237871      2496    1431    35237871        2409    1137
 35237873      3001    829     35237873        2973    770
 35237875      150001  50000   35237875        172955  43831
 35237876      1629    282     35237876        1595    245
 35237877      3063    1326    35237877        3193    1131
 35237878      6756    836     35237878        6701    793
 35237879      2569    293     35237879        2547    285
 35237880      150002  50000   35237880        160984  26638
 35237881      2749    446     35237881        2697    325
 35237883      3593    1213    35237883        3463    1232
 35237884      3165    700     35237884        3172    699
 35237885      3812    533     35237885        3804    537
 35237886      2754    1432    35237886        2701    1289
 35237887      4977    693     35237887        4968    694
 35237889      2710    1529    35237889        2566    1225
 35237890      150003  50000   35237890        161995  26438
  • AGP
  Chr   #Ctgs
  1     4617
  2     3468
  3     3260
  4     3032
  5     3103
  6     3570
  7     3049
  8     2954
  9     2584
  10    2712
  11    2778
  12    2673
  13    2147
  14    2549
  15    2655
  16    2547
  17    1885
  18    2195
  19    1809
  20    2146
  21    1967
  22    1628
  23    1434
  24    1380
  25    1274
  26    1668
  27    1381
  28    1186
  29    1803
  X     4883
  U     113346
 Placed contig orientation:
 -       33990
 +       32686
 0       7661

Submission

DAAA00000000, DAAA01000000.


  • Title: "A whole-genome assembly of the cow, Bos taurus"
  • Authors:
 Steven Salzberg
 Aleksey Zimin
 Arthur Delcher
 Liliana Florea
 David Kelley
 Finian Hanrahan
 Guillaume Marcais
 Geo Pertea
 Michael Roberts
 Michael Schatz
 Curt Van Tassell
 James Yorke
 Poorani S.
  • Assembler:
 Celera Assembler and UMD Overlapper.
  • Sequencing Center :
 Baylor College of Medicine. 
  • Source of DNA used for sequencing:

The source of the BAC library DNA was Hereford bull L1 Domino 99375, registration number 41170496. Dr. Michael MacNeil's laboratory, USDA-ARS, Miles City, MT provided the blood. The DNA for the whole genome shotgun sequences was provided by Dr. Timothy Smith's laboratory, U.S. Meat Animal Research Center, Clay Center, NE from white blood cells from L1 Dominette 01449, American Hereford Association registration number 42190680 (a daughter of L1 Domino 99375). A skin cell fibroblast cell line from the same animal is available from Dr. Carol Chitko-McKown's laboratory, although there is no sequence from that cell line.

  • Sequence modifiers:
 [organism=Bos taurus][breed=Hereford][tech=wgs][chromosome=...]
  • Submission: Use sequin
 /nfshomes/dpuiu/szdevel/sequin.8.10/sequin
  • Sequence:
 Contig length summary:
           #seqs   min     max     mean    median  n50     sum
 all       210657  71      840370  13709   1523    78511   2887902366
 placed    75775   88      840370  34512   13416   88287   2615171268
 unplaced  134882  71      166670  2022    1322    1742    272731098


Duplicates

 deg0003136509,7180003440308 : both unplaced
 deg0003084562,7180002954167 : both unplaced

Contaminants

NCBI (1st batch)

Through "Foreign Contamination Screen" : http://www.ncbi.nlm.nih.gov/projects/WGS/screens/DAAA01_120508/

  • list.exclude_contigs 4,813 ctgs (3,939 vector + 394 Ecoli + 452 other) (73 mito, 43 deg, 8 Acinetobacter baumannii)
  • list.trim_contigs 19,049 ctgs (18,336 vector + 289 Ecoli + 397 other)

Steven search against Ecoli MG1655

  • 12/11/2008 : Found 121 ctgs that align to Ecoli
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/Eco-vs-cow.mum 
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/EcoK12.fna 

NCBI (2nd batch)

Overall

  • Counts
 Bos_taurus.UMD2.exclude.count
 Bos_taurus.UMD2.trim.count

Summary

                             #ctgs   min     max     mean    median  n50     total_bp
 UMD_Freeze2.0_contam        210657  71      840370  13709   1523    78508   2887902366
 UMD_Freeze2.0               187683  71      840370  15225   1609    79580   2857554998
 difference                   22974

Contaminant region summary:

            elem       min        max        mean       med        n50        sum
 exclude    4817       316        16661      1510       1485       1514       7276894
 trim       30325      48         2479       354        319        446        10745455
 all        35142      48         16661      512        362        674        18022349
 
 Ecoli      746        54         16661      1125       1111       1264       839899          
 vector     33540      49         3128       487        346        603        16340397       
 other      910        53         13090      1006       1037       1329       916117

Exclude sequences(example):

 7180003318605   16661   Escherichia coli str. K12 substr  # DH10B
 7180003320028   13090   Acinetobacter baumannii
 7180003316967   7473    Escherichia coli str. K12 substr  # DH10B
 7180003313366   7098    Acinetobacter baumannii
 7180003195772   4993    Serratia marcescens
 7180003288790   4668    Klebsiella pneumoniae
 7180003310064   4371    Escherichia coli                  # all 3
 7180003262150   4275    Escherichia coli
 7180003288789   3565    Serratia marcescens
 join100003627   3563    Acinetobacter baumannii
 ...
 7180003289260   3128    Escherichia coli or vector  
 7180003292886   2957    vector
 7180003166540   2081    mitochondrion
 7180003310112   1977    contaminants
 7180002995790   1711    bacterial insertion sequence
 7180003221530   1647    Bacillus cereus ATCC 10987
 7180003259696   1597    Pseudomonas aeruginosa PAO1
 7180003239826   1378    Macaca mulatta

Problems:

1: Vectors

2: Ecoli

  • There are 22 Ecoli strains * 3 Ecoli K12 substarins
  • MG1655 is the 1st one completed, DH10B and W3110 have been recently(?) completed
  • Contain unique seqs as long as 28K
 cat /fs/szdata/genomes/ncbi/Bacteria/Escherichia_coli_K_12*/*fna | infoseq -description
 NC_010473.1    4686137 50.78  Escherichia coli str. K-12 substr. DH10B, complete genome
 NC_000913.2    4639675 50.79  Escherichia coli str. K-12 substr. MG1655, complete genome
 AC_000091.1    4646332 50.80  Escherichia coli str. K-12 substr. W3110, complete genome
  • out of 746 UMD2 Ecoli seqs, 636(723 maxmatch) aligned to Ecoli.all

3: Other

  • contaminants: 289 ; mostly Ecoli
  • phage: 16 ; 1 aligns to Ecoli, all <1108bp
  • IS: 398; all <1711 bp; 14 align to Ecoli.all & 1 to UniVec_Core; few were NCBI blasted aligned to mammals !!!
  • mitochondrion: 74 seqs: all align to ~/db/bos_taurus.mitochondrion
  • Others: 130; (Acinetobacter baumannii ...)
  • Files:
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.exclude.count     (4813 exlude sequence counts)
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.list  (35142 contaminant sequences: exclude+trim) 
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.fasta
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.infoseq  35142
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.Ecoli.infoseq        746
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.vector.infoseq       33540
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/odd-contaminants.infoseq             14
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/nucmer/bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits.fasta # 100 vector seqs not in UniVec (pPAC7 ...) -> ~dpuiu/db/OtherVec
  • Dirs:
 /fs/szasmg3/dpuiu/bos_taurus/submission/decontam

Notes:

  • The 4,814 ctgs were aligned to UniVec (-c 20 ; delta-filter -q)
    • 4,247 ctgs aligned to 89 vecto seqs
    • top ref hits:
 gnl|uv|J01749.1   Cloning vector pBR322
 gnl|uv|J01636.1   E.coli lactose operon with lacI, lacZ, lacY and lacA genes
 gnl|uv|AF102576.1 Cloning vector pSOS
 gnl|uv|L08959.1   pUC8 cloning vector 
 gnl|uv|L08931.1   pMAC7-8 cloning vector for site-directed mutagenesis
 gnl|uv|L09145.1   pUR222 cloning vector
 gnl|uv|U47102.2   Cloning vector pALTER<R>-Ex1
 ...
  • The 4,814 ctgs were aligned to EcoliK12
    • 4,299 aligned 200bp+ to Ecoli
    • 3,877 aligned 100% to region 365521_365744 (224bp)
 >NC_000913.2_365521_365744 Escherichia coli K12, complete genome
 CATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATAC
 GAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCACATTAA
 TTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGCATTAAT
 GAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGC
  • The 4,814 ctgs were assembled based on 11,912 reads
 BCM          SHOTGUN    11247 (out of 10M reads)
 BCM          WGS        415   (out of 24M reads)
 NISC         SHOTGUN    213   (out of 0.7M reads)
 BARC         CLONEEND   14    ...
 BCM          CLONEEND   11
 BCCAGSC      CLONEEND   7
 TIGR         CLONEEND   4
 TIGR_JCVIJTC CLONEEND   1
  • Avg UMD clipping rangeof the 11,912 reads is 840bp (vs 778 avg for the 3.53M assembled reads)
  • Other: /fs/ftp-cbcb/pub/data/Bos_taurus/Bos_taurus_UMD_2.0/odd-contaminants.fa

Local files

  • Freeze dir files
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/contigs.unplaced.fa  : sequences
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/bos_taurus.agp       : all scaffolds
 /fs/ftp-cbcb/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_2.0
  • Files uploaded
 Ftp server: ftp-private.ncbi.nlm.nih.gov
 Account: cbcb_trc
 Dir: uploads/
 Local files: /fs/szasmg3/dpuiu/bos_taurus/submission/ftp/   : 22 *sqn + 1 agp

Contaminant search

Ecoli

UniVec_Core

UMD2.other

  • 83(82) ctgs align to 65 ref sequences
  • 10 ctgs are Acinetobacter baumannii
 pwd
 /fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant
 
 join UMD2.contaminant.other-ctg.ref_hits ~/db/bos_taurus.UMD2.contaminant.infoseq | sort -nk3 -r | head
 7180003370686_12513_13066 553 16 554 phage
 7180003320028 13090 10 13090 Acinetobacter baumannii
 7180003341208_1_647 646 8 647 phage
 ...
 
 contigs          <2000      >2000      min        max        mean       med        n50        sum
 82               20         62         709        397429     65384      44841      138429     5361543
 
 alignments       <200       >200       min        max        mean       med        n50        sum            
 103              33         70         105        3312       467        276        688        48109

Files:

/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/UMD2.contaminant.Acinetobacter-ctg.qry_hits   # 10 UMD2.0 Acinetobacter ctg ids
/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/Acinetobacter.all-ctg.filter-q.qry_hits       # 22 UMD2.0 Acinetobacter ctg ids ; 7 in common with the 10 above
# 25 Acinetobacter ctg's
       ctgid         ctglen
    1  7180003321583 96481
    2  7180003308373 9419
    3  7180003319195 8955
    4  7180003317370 8045
    5  7180003290024 5922
    6  join100003699 5649
    7  7180003288988 3618
    8  7180003308907 3157
    9  7180003234806 3100
   10  7180003319189 2966
   11  7180003202299 2653
   12  7180003217023 2213
   13  7180003219002 2161
   14  7180003219018 2010
   15  7180003292866 1767
   16  7180003215440 1617
   17  7180003235746 1573
   18  7180003235747 1524
   19  7180003234890 1422
   20  7180003219292 1329
   21  7180003221397 1308
   22  7180003221476 1243
   23  7180003235699 1139
   24  7180003214110 1100
   25  deg0003235855 1062