Bos taurus: Difference between revisions
(→Papers) |
|||
Line 19: | Line 19: | ||
= BCM Assembly = | = BCM Assembly = | ||
* [http://www.hgsc.bcm.tmc.edu/projects/bovine/ Genome Project] | * [http://www.hgsc.bcm.tmc.edu/projects/bovine/ Genome Project] @ Baylor | ||
* [http://www.ncbi.nlm.nih.gov/mapview/stats/BuildStats.cgi?taxid=9913&build=4&ver=1 Btau_4.0] @ NCBI | |||
#elem min max mean median n50 sum | #elem min max mean median n50 sum |
Revision as of 14:03, 28 April 2009
Papers
- A whole-genome assembly of the domestic cow, Bos taurus (Genome Biology April 2009)
- Bos taurus genome assembly (BMC genomics April 2009) _ Baylor's assembly paper
- Cattle genome sequenced (Science News)
- The genome sequence of taurine cattle: a window to ruminant biology and evolution (Science 2009)
NCBI Traces
SPECIES_CODE = "BOS TAURUS" 37,788,710 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" 35,596,825 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "WGS" 24,863,627 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "SHOTGUN" 10,716,306 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "CLONEEND" 16,892 traces
BCM Assembly
- Genome Project @ Baylor
- Btau_4.0 @ NCBI
#elem min max mean median n50 sum contig 131620 91 326010 20755 10365 44270 2731814362 placed 101579 91 250125 24286 13928 47485 2466971326 chrom 30 44060403 161106243 87813777 84419198 106383598 2634413324
- 6 finished BACs NCBI accession numbers:
gi|171461043, gi|171461042, gi|171461041, gi|171461040, gi|171461039, gi|167744683
UMD2 Assembly
- Used UMDoverlapper to trim reads
#elem min max mean median n50 sum reads 35237868 68 1418 778 840 864 27406137041
- Library re-estimates:
paste *dst *mdi | perl -ane 'print " $_" if($F[1]!=$F[4]);' 35237870 150000 50000 35237870 162386 21158 35237871 2496 1431 35237871 2409 1137 35237873 3001 829 35237873 2973 770 35237875 150001 50000 35237875 172955 43831 35237876 1629 282 35237876 1595 245 35237877 3063 1326 35237877 3193 1131 35237878 6756 836 35237878 6701 793 35237879 2569 293 35237879 2547 285 35237880 150002 50000 35237880 160984 26638 35237881 2749 446 35237881 2697 325 35237883 3593 1213 35237883 3463 1232 35237884 3165 700 35237884 3172 699 35237885 3812 533 35237885 3804 537 35237886 2754 1432 35237886 2701 1289 35237887 4977 693 35237887 4968 694 35237889 2710 1529 35237889 2566 1225 35237890 150003 50000 35237890 161995 26438
- AGP
Chr #Ctgs 1 4617 2 3468 3 3260 4 3032 5 3103 6 3570 7 3049 8 2954 9 2584 10 2712 11 2778 12 2673 13 2147 14 2549 15 2655 16 2547 17 1885 18 2195 19 1809 20 2146 21 1967 22 1628 23 1434 24 1380 25 1274 26 1668 27 1381 28 1186 29 1803 X 4883 U 113346
Placed contig orientation: - 33990 + 32686 0 7661
Submission
- Genome Project
- WGS id (GPID): 32899
DAAA00000000, DAAA01000000.
- Title: "A whole-genome assembly of the cow, Bos taurus"
- Authors:
Steven Salzberg Aleksey Zimin Arthur Delcher Liliana Florea David Kelley Finian Hanrahan Guillaume Marcais Geo Pertea Michael Roberts Michael Schatz Curt Van Tassell James Yorke Poorani S.
- Assembler:
Celera Assembler and UMD Overlapper.
- Sequencing Center :
Baylor College of Medicine.
- Source of DNA used for sequencing:
The source of the BAC library DNA was Hereford bull L1 Domino 99375, registration number 41170496. Dr. Michael MacNeil's laboratory, USDA-ARS, Miles City, MT provided the blood. The DNA for the whole genome shotgun sequences was provided by Dr. Timothy Smith's laboratory, U.S. Meat Animal Research Center, Clay Center, NE from white blood cells from L1 Dominette 01449, American Hereford Association registration number 42190680 (a daughter of L1 Domino 99375). A skin cell fibroblast cell line from the same animal is available from Dr. Carol Chitko-McKown's laboratory, although there is no sequence from that cell line.
- Sequence modifiers:
[organism=Bos taurus][breed=Hereford][tech=wgs][chromosome=...]
- Submission: Use sequin
/nfshomes/dpuiu/szdevel/sequin.8.10/sequin
- Sequence:
Contig length summary: #seqs min max mean median n50 sum all 210657 71 840370 13709 1523 78511 2887902366 placed 75775 88 840370 34512 13416 88287 2615171268 unplaced 134882 71 166670 2022 1322 1742 272731098
Duplicates
- Duplicate contigs : 111+2 : one copy was removed
deg0003136509,7180003440308 : both unplaced deg0003084562,7180002954167 : both unplaced
Contaminants
NCBI (1st batch)
Through "Foreign Contamination Screen" : http://www.ncbi.nlm.nih.gov/projects/WGS/screens/DAAA01_120508/
- list.exclude_contigs 4,813 ctgs (3,939 vector + 394 Ecoli + 452 other) (73 mito, 43 deg, 8 Acinetobacter baumannii)
- list.trim_contigs 19,049 ctgs (18,336 vector + 289 Ecoli + 397 other)
Steven search against Ecoli MG1655
- 12/11/2008 : Found 121 ctgs that align to Ecoli
/fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/Eco-vs-cow.mum /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/EcoK12.fna
NCBI (2nd batch)
- More ctgs found (bug in the 1st search)
- list.exclude_contigs 4 ctgs
- list.trim_contigs 46 ctgs
Overall
- Counts
Bos_taurus.UMD2.exclude.count Bos_taurus.UMD2.trim.count
Summary
#ctgs min max mean median n50 total_bp UMD_Freeze2.0_contam 210657 71 840370 13709 1523 78508 2887902366 UMD_Freeze2.0 187683 71 840370 15225 1609 79580 2857554998 difference 22974
Contaminant region summary:
elem min max mean med n50 sum exclude 4817 316 16661 1510 1485 1514 7276894 trim 30325 48 2479 354 319 446 10745455 all 35142 48 16661 512 362 674 18022349 Ecoli 746 54 16661 1125 1111 1264 839899 vector 33540 49 3128 487 346 603 16340397 other 910 53 13090 1006 1037 1329 916117
Exclude sequences(example):
7180003318605 16661 Escherichia coli str. K12 substr # DH10B 7180003320028 13090 Acinetobacter baumannii 7180003316967 7473 Escherichia coli str. K12 substr # DH10B 7180003313366 7098 Acinetobacter baumannii 7180003195772 4993 Serratia marcescens 7180003288790 4668 Klebsiella pneumoniae 7180003310064 4371 Escherichia coli # all 3 7180003262150 4275 Escherichia coli 7180003288789 3565 Serratia marcescens join100003627 3563 Acinetobacter baumannii ... 7180003289260 3128 Escherichia coli or vector 7180003292886 2957 vector 7180003166540 2081 mitochondrion 7180003310112 1977 contaminants 7180002995790 1711 bacterial insertion sequence 7180003221530 1647 Bacillus cereus ATCC 10987 7180003259696 1597 Pseudomonas aeruginosa PAO1 7180003239826 1378 Macaca mulatta
Problems:
1: Vectors
- Q: Which vectors are the most frequent?
- A: align UMD2 vctor contaminants to UniVec_core:
- 28051(29113 maxmatch) out of 33540 align : UniVec_Core-UMD2.vector.ref_hits
- pBR322, Ecoli_lac*, pSacBII ...
- pBACe3.6 (U80929.2:11415-11517): CONTAINED
- 5489(4427 maxmatch) out of 33540 don't align (4373 are 100+bnp, 11 are 500+bp) ; some aligned by megablast to 100 NCBI mostly vector sequences Bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits
2: Ecoli
- There are 22 Ecoli strains * 3 Ecoli K12 substarins
- MG1655 is the 1st one completed, DH10B and W3110 have been recently(?) completed
- Contain unique seqs as long as 28K
cat /fs/szdata/genomes/ncbi/Bacteria/Escherichia_coli_K_12*/*fna | infoseq -description NC_010473.1 4686137 50.78 Escherichia coli str. K-12 substr. DH10B, complete genome NC_000913.2 4639675 50.79 Escherichia coli str. K-12 substr. MG1655, complete genome AC_000091.1 4646332 50.80 Escherichia coli str. K-12 substr. W3110, complete genome
- out of 746 UMD2 Ecoli seqs, 636(723 maxmatch) aligned to Ecoli.all
3: Other
- contaminants: 289 ; mostly Ecoli
- phage: 16 ; 1 aligns to Ecoli, all <1108bp
- IS: 398; all <1711 bp; 14 align to Ecoli.all & 1 to UniVec_Core; few were NCBI blasted aligned to mammals !!!
- mitochondrion: 74 seqs: all align to ~/db/bos_taurus.mitochondrion
- Others: 130; (Acinetobacter baumannii ...)
- Files:
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.exclude.count (4813 exlude sequence counts) /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.list (35142 contaminant sequences: exclude+trim) /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.fasta
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.infoseq 35142 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.Ecoli.infoseq 746 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.vector.infoseq 33540 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/odd-contaminants.infoseq 14
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/nucmer/bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits.fasta # 100 vector seqs not in UniVec (pPAC7 ...) -> ~dpuiu/db/OtherVec
- Dirs:
/fs/szasmg3/dpuiu/bos_taurus/submission/decontam
Notes:
- The 4,814 ctgs were aligned to UniVec (-c 20 ; delta-filter -q)
- 4,247 ctgs aligned to 89 vecto seqs
- top ref hits:
gnl|uv|J01749.1 Cloning vector pBR322 gnl|uv|J01636.1 E.coli lactose operon with lacI, lacZ, lacY and lacA genes gnl|uv|AF102576.1 Cloning vector pSOS gnl|uv|L08959.1 pUC8 cloning vector gnl|uv|L08931.1 pMAC7-8 cloning vector for site-directed mutagenesis gnl|uv|L09145.1 pUR222 cloning vector gnl|uv|U47102.2 Cloning vector pALTER<R>-Ex1 ...
- The 4,814 ctgs were aligned to EcoliK12
- 4,299 aligned 200bp+ to Ecoli
- 3,877 aligned 100% to region 365521_365744 (224bp)
>NC_000913.2_365521_365744 Escherichia coli K12, complete genome CATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATAC GAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCACATTAA TTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGCATTAAT GAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGC
- The 4,814 ctgs were assembled based on 11,912 reads
BCM SHOTGUN 11247 (out of 10M reads) BCM WGS 415 (out of 24M reads) NISC SHOTGUN 213 (out of 0.7M reads) BARC CLONEEND 14 ... BCM CLONEEND 11 BCCAGSC CLONEEND 7 TIGR CLONEEND 4 TIGR_JCVIJTC CLONEEND 1
- Avg UMD clipping rangeof the 11,912 reads is 840bp (vs 778 avg for the 3.53M assembled reads)
- Other: /fs/ftp-cbcb/pub/data/Bos_taurus/Bos_taurus_UMD_2.0/odd-contaminants.fa
Local files
- Freeze dir files
/fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/contigs.unplaced.fa : sequences /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/bos_taurus.agp : all scaffolds
/fs/ftp-cbcb/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_2.0
- Files uploaded
Ftp server: ftp-private.ncbi.nlm.nih.gov Account: cbcb_trc Dir: uploads/ Local files: /fs/szasmg3/dpuiu/bos_taurus/submission/ftp/ : 22 *sqn + 1 agp
Contaminant search
Ecoli
UniVec_Core
UMD2.other
- 83(82) ctgs align to 65 ref sequences
- 10 ctgs are Acinetobacter baumannii
pwd /fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant join UMD2.contaminant.other-ctg.ref_hits ~/db/bos_taurus.UMD2.contaminant.infoseq | sort -nk3 -r | head 7180003370686_12513_13066 553 16 554 phage 7180003320028 13090 10 13090 Acinetobacter baumannii 7180003341208_1_647 646 8 647 phage ... contigs <2000 >2000 min max mean med n50 sum 82 20 62 709 397429 65384 44841 138429 5361543 alignments <200 >200 min max mean med n50 sum 103 33 70 105 3312 467 276 688 48109
Files:
/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/UMD2.contaminant.Acinetobacter-ctg.qry_hits # 10 UMD2.0 Acinetobacter ctg ids /fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/Acinetobacter.all-ctg.filter-q.qry_hits # 22 UMD2.0 Acinetobacter ctg ids ; 7 in common with the 10 above
# 25 Acinetobacter ctg's ctgid ctglen 1 7180003321583 96481 2 7180003308373 9419 3 7180003319195 8955 4 7180003317370 8045 5 7180003290024 5922 6 join100003699 5649 7 7180003288988 3618 8 7180003308907 3157 9 7180003234806 3100 10 7180003319189 2966 11 7180003202299 2653 12 7180003217023 2213 13 7180003219002 2161 14 7180003219018 2010 15 7180003292866 1767 16 7180003215440 1617 17 7180003235746 1573 18 7180003235747 1524 19 7180003234890 1422 20 7180003219292 1329 21 7180003221397 1308 22 7180003221476 1243 23 7180003235699 1139 24 7180003214110 1100 25 deg0003235855 1062