Bos taurus
NCBI Traces
SPECIES_CODE = "BOS TAURUS" 37,788,710 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" 35,596,825 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "WGS" 24,863,627 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "SHOTGUN" 10,716,306 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "CLONEEND" 16,892 traces
BCM Assembly
#elem min max mean median n50 sum contig 131620 91 326010 20755 10365 44270 2731814362 placed 101579 91 250125 24286 13928 47485 2466971326 chrom 30 44060403 161106243 87813777 84419198 106383598 2634413324
UMD2 Assembly
- Used UMDoverlapper to trim reads
#elem min max mean median n50 sum reads 35237868 68 1418 778 840 864 27406137041
- Library reestimates:
paste *dst *mdi | perl -ane 'print " $_" if($F[1]!=$F[4]);' 35237870 150000 50000 35237870 162386 21158 35237871 2496 1431 35237871 2409 1137 35237873 3001 829 35237873 2973 770 35237875 150001 50000 35237875 172955 43831 35237876 1629 282 35237876 1595 245 35237877 3063 1326 35237877 3193 1131 35237878 6756 836 35237878 6701 793 35237879 2569 293 35237879 2547 285 35237880 150002 50000 35237880 160984 26638 35237881 2749 446 35237881 2697 325 35237883 3593 1213 35237883 3463 1232 35237884 3165 700 35237884 3172 699 35237885 3812 533 35237885 3804 537 35237886 2754 1432 35237886 2701 1289 35237887 4977 693 35237887 4968 694 35237889 2710 1529 35237889 2566 1225 35237890 150003 50000 35237890 161995 26438
Submission
- Genome Project
- WGS id (GPID): 32899
- Title: "A whole-genome assembly of the cow, Bos taurus"
- Authors:
Steven Salzberg Aleksey Zimin Arthur Delcher Liliana Florea David Kelley Finian Hanrahan Guillaume Marcais Geo Pertea Michael Roberts Michael Schatz Curt Van Tassell James Yorke Poorani S.
- Assembler:
Celera Assembler and UMD Overlapper.
- Sequencing Center :
Baylor College of Medicine.
- Source of DNA used for sequencing:
The source of the BAC library DNA was Hereford bull L1 Domino 99375, registration number 41170496. Dr. Michael MacNeil's laboratory, USDA-ARS, Miles City, MT provided the blood. The DNA for the whole genome shotgun sequences was provided by Dr. Timothy Smith's laboratory, U.S. Meat Animal Research Center, Clay Center, NE from white blood cells from L1 Dominette 01449, American Hereford Association registration number 42190680 (a daughter of L1 Domino 99375). A skin cell fibroblast cell line from the same animal is available from Dr. Carol Chitko-McKown's laboratory, although there is no sequence from that cell line.
- Sequence modifiers:
[organism=Bos taurus][breed=Hereford][tech=wgs][chromosome=...]
- Submission: Use sequin
/nfshomes/dpuiu/szdevel/sequin.8.10/sequin
- Sequence:
Contig length summary: #seqs min max mean median n50 sum all 210657 71 840370 13709 1523 78511 2887902366 placed 75775 88 840370 34512 13416 88287 2615171268 unplaced 134882 71 166670 2022 1322 1742 272731098
Duplicates
- Duplicate contigs : 111+2 : one copy was removed
deg0003136509,7180003440308 : both unplaced deg0003084562,7180002954167 : both unplaced
Contaminants
NCBI (1st batch)
Through "Foreign Contamination Screen" : http://www.ncbi.nlm.nih.gov/projects/WGS/screens/DAAA01_120508/
- list.exclude_contigs 4,813 ctgs (3,939 vector + 394 Ecoli + 452 other) (73 mito, 43 deg, 8 Acinetobacter baumannii)
- list.trim_contigs 19,049 ctgs (18,336 vector + 289 Ecoli + 397 other)
Steven search against Ecoli MG1655
- 12/11/2008 : Found 121 ctgs that align to Ecoli
/fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/Eco-vs-cow.mum /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/EcoK12.fna
NCBI (2nd batch)
- More ctgs found (bug in the 1st search)
- list.exclude_contigs 4 ctgs
- list.trim_contigs 46 ctgs
Overall
- Counts
Bos_taurus.UMD2.exclude.count Bos_taurus.UMD2.trim.count
Summary
#ctgs min max mean median n50 total_bp UMD_Freeze2.0_contam 210657 71 840370 13709 1523 78508 2887902366 UMD_Freeze2.0 187683 71 840370 15225 1609 79580 2857554998 difference 22974
Contaminant region summary:
elem min max mean med n50 sum exclude 4817 316 16661 1510 1485 1514 7276894 trim 30325 48 2479 354 319 446 10745455 all 35142 48 16661 512 362 674 18022349 Ecoli 746 54 16661 1125 1111 1264 839899 vector 33540 49 3128 487 346 603 16340397 other 910 53 13090 1006 1037 1329 916117
Exclude sequences(example):
7180003318605 16661 Escherichia coli str. K12 substr # DH10B 7180003320028 13090 Acinetobacter baumannii 7180003316967 7473 Escherichia coli str. K12 substr # DH10B 7180003313366 7098 Acinetobacter baumannii 7180003195772 4993 Serratia marcescens 7180003288790 4668 Klebsiella pneumoniae 7180003310064 4371 Escherichia coli # all 3 7180003262150 4275 Escherichia coli 7180003288789 3565 Serratia marcescens join100003627 3563 Acinetobacter baumannii ... 7180003289260 3128 Escherichia coli or vector 7180003292886 2957 vector 7180003166540 2081 mitochondrion 7180003310112 1977 contaminants 7180002995790 1711 bacterial insertion sequence 7180003221530 1647 Bacillus cereus ATCC 10987 7180003259696 1597 Pseudomonas aeruginosa PAO1 7180003239826 1378 Macaca mulatta
Problems:
1: Vectors
- Q: Which vectors are the most frequent?
- A: align UMD2 vctor contaminants to UniVec_core:
- 28051(29113 maxmatch) out of 33540 align : UniVec_Core-UMD2.vector.ref_hits
- pBR322, Ecoli_lac*, pSacBII ...
- pBACe3.6 (U80929.2:11415-11517): CONTAINED
- 5489(4427 maxmatch) out of 33540 don't align (4373 are 100+bnp, 11 are 500+bp) ; some aligned by megablast to 100 NCBI mostly vector sequences Bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits
2: Ecoli
- There are 22 Ecoli strains * 3 Ecoli K12 substarins
- MG1655 is the 1st one completed, DH10B and W3110 have been recently(?) completed
- Contain unique seqs as long as 28K
cat /fs/szdata/genomes/ncbi/Bacteria/Escherichia_coli_K_12*/*fna | infoseq -description NC_010473.1 4686137 50.78 Escherichia coli str. K-12 substr. DH10B, complete genome NC_000913.2 4639675 50.79 Escherichia coli str. K-12 substr. MG1655, complete genome AC_000091.1 4646332 50.80 Escherichia coli str. K-12 substr. W3110, complete genome
- out of 746 UMD2 Ecoli seqs, 636 aligned to Ecoli.all
3: Other
- contaminants: 289 ; mostly Ecoli
- phage: 16 ; 1 aligns to Ecoli, all <1108bp
- IS: 398; all <1711 bp
- Acinetobacter baumannii : there are 6 completed strains
- Files:
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.exclude.count (4813 exlude sequence counts) /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.list (35142 contaminant sequences: exclude+trim) /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.fasta
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.infoseq 35142 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.Ecoli.infoseq 746 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.vector.infoseq 33540 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/odd-contaminants.infoseq 14
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/nucmer/bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits.fasta # 100 vector seqs not in UniVec (pPAC7 ...)
- Dirs:
/fs/szasmg3/dpuiu/bos_taurus/submission/decontam
Notes:
- The 4,814 ctgs were aligned to UniVec (-c 20 ; delta-filter -q)
- 4,247 ctgs aligned to 89 vecto seqs
- top ref hits:
gnl|uv|J01749.1 Cloning vector pBR322 gnl|uv|J01636.1 E.coli lactose operon with lacI, lacZ, lacY and lacA genes gnl|uv|AF102576.1 Cloning vector pSOS gnl|uv|L08959.1 pUC8 cloning vector gnl|uv|L08931.1 pMAC7-8 cloning vector for site-directed mutagenesis gnl|uv|L09145.1 pUR222 cloning vector gnl|uv|U47102.2 Cloning vector pALTER<R>-Ex1 ...
- The 4,814 ctgs were aligned to EcoliK12
- 4,299 aligned 200bp+ to Ecoli
- 3,877 aligned 100% to region 365521_365744 (224bp)
>NC_000913.2_365521_365744 Escherichia coli K12, complete genome CATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATAC GAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCACATTAA TTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGCATTAAT GAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGC
- The 4,814 ctgs were assembled based on 11,912 reads
BCM SHOTGUN 11247 (out of 10M reads) BCM WGS 415 (out of 24M reads) NISC SHOTGUN 213 (out of 0.7M reads) BARC CLONEEND 14 ... BCM CLONEEND 11 BCCAGSC CLONEEND 7 TIGR CLONEEND 4 TIGR_JCVIJTC CLONEEND 1
- Avg UMD clipping rangeof the 11,912 reads is 840bp (vs 778 avg for the 3.53M assembled reads)
- Other: /fs/ftp-cbcb/pub/data/Bos_taurus/Bos_taurus_UMD_2.0/odd-contaminants.fa
Local files
- Freeze dir files
/fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/contigs.unplaced.fa : sequences /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/bos_taurus.agp : all scaffolds
/fs/ftp-cbcb/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_2.0
- Files uploaded
Ftp server: ftp-private.ncbi.nlm.nih.gov Account: cbcb_trc Dir: uploads/ Local files: /fs/szasmg3/dpuiu/bos_taurus/submission/ftp/ : 22 *sqn + 1 agp