Bos taurus: Difference between revisions
(46 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= Articles = | |||
* [http://genomebiology.com/2009/10/4/R42 A whole-genome assembly of the domestic cow, Bos taurus] (Genome Biology April 2009) | |||
* [http://www.biomedcentral.com/1471-2164/10/180/abstract Bos taurus genome assembly] (BMC genomics April 2009) Baylor's assembly paper | |||
* [http://www.sciencenews.org/view/generic/id/43190/description/Cattle_genome_sequenced Cattle genome sequenced] (Science News) | |||
* [http://www.ncbi.nlm.nih.gov/pubmed/19390049 The genome sequence of taurine cattle: a window to ruminant biology and evolution] (Science 2009) | |||
* [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2374996 A physical map of the bovine genome] Genome Biology 2007 | |||
* [http://www.animalgenome.org/bioinfo/] | |||
= NCBI Traces = | = NCBI Traces = | ||
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=bos%20taurus Genome Projects] | * [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=bos%20taurus Genome Projects] | ||
* [ftp://ftp.ncbi.nih.gov/pub/TraceDB/bos_taurus TA ftp] | |||
* [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieve&size=0&val=SPECIES_CODE+%3D+%22BOS+TAURUS%22 TA search] | * [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieve&size=0&val=SPECIES_CODE+%3D+%22BOS+TAURUS%22 TA search] | ||
SPECIES_CODE = "BOS TAURUS" 37,788,710 traces | SPECIES_CODE = "BOS TAURUS" 37,788,710 traces | ||
Line 12: | Line 22: | ||
= BCM Assembly = | = BCM Assembly = | ||
* [http://www.hgsc.bcm.tmc.edu/projects/bovine/ Genome Project] | * [http://www.hgsc.bcm.tmc.edu/projects/bovine/ Genome Project] @ Baylor | ||
* [http://www.ncbi.nlm.nih.gov/mapview/stats/BuildStats.cgi?taxid=9913&build=4&ver=1 Btau_4.0] @ NCBI ; AAFC0000000.3 (not available yet) | |||
* [http://hgdownload.cse.ucsc.edu/downloads.html#cow Btau_4.0] @ UCSC; [http://hgdownload.cse.ucsc.edu/goldenPath/bosTau4/vsHg18/ Btau vs HomoSapiens] | |||
#elem min max mean median n50 sum | #elem min max mean median n50 sum | ||
Line 18: | Line 30: | ||
placed 101579 91 250125 24286 13928 47485 2466971326 | placed 101579 91 250125 24286 13928 47485 2466971326 | ||
chrom 30 44060403 161106243 87813777 84419198 106383598 2634413324 | chrom 30 44060403 161106243 87813777 84419198 106383598 2634413324 | ||
Chr Span GC% | |||
chr1 161106243 40.76 | |||
chr2 140800416 41.21 | |||
chr3 127923604 42.29 | |||
chr4 124454208 41.01 | |||
chr5 125847759 42.02 | |||
chr6 122561022 40.60 | |||
chr7 112078216 42.39 | |||
chr8 116942821 41.70 | |||
chr9 108145351 40.53 | |||
chr10 106383598 41.84 | |||
chr11 110171769 43.16 | |||
chr12 85358539 41.00 | |||
chr13 84419198 44.00 | |||
chr14 81345643 41.59 | |||
chr15 84633453 42.34 | |||
chr16 77906053 42.91 | |||
chr17 76506943 42.70 | |||
chr18 66141439 45.87 | |||
chr19 65312493 46.32 | |||
chr20 75796353 41.51 | |||
chr21 69173390 43.20 | |||
chr22 61848140 43.59 | |||
chr23 53376148 43.75 | |||
chr24 65020233 42.27 | |||
chr25 44060403 47.13 | |||
chr26 51750746 43.16 | |||
chr27 48749334 42.19 | |||
chr28 46084206 42.61 | |||
chr29 51998940 44.34 | |||
chrX 88516663 41.11 | |||
chrM 16338 39.42 | |||
chrUn 283544868 ? # 11869 contigs | |||
= UMD2 Assembly = | Files: | ||
/fs/szasmg3/bos_taurus/BOSTAU4 | |||
= Children's Hospital Oakland Research Institute = | |||
* Bovine BAC Library (male)): | |||
* 6 finished BACs | |||
* NCBI links | |||
[http://www.ncbi.nlm.nih.gov/nuccore/171461043] | |||
[http://www.ncbi.nlm.nih.gov/nuccore/171461042] | |||
[http://www.ncbi.nlm.nih.gov/nuccore/171461041] | |||
[http://www.ncbi.nlm.nih.gov/nuccore/171461040] | |||
[http://www.ncbi.nlm.nih.gov/nuccore/171461039] | |||
[http://www.ncbi.nlm.nih.gov/nuccore/167744683] | |||
= UMD2.0 Assembly = | |||
* qc | |||
#elem | |||
scf 134612 | |||
ctg 194643 | |||
* Used UMDoverlapper to trim reads | * Used UMDoverlapper to trim reads | ||
Line 25: | Line 91: | ||
reads 35237868 68 1418 778 840 864 27406137041 | reads 35237868 68 1418 778 840 864 27406137041 | ||
* Library | * Library re-estimates: | ||
paste *dst *mdi | perl -ane 'print " $_" if($F[1]!=$F[4]);' | paste *dst *mdi | perl -ane 'print " $_" if($F[1]!=$F[4]);' | ||
Line 45: | Line 111: | ||
35237889 2710 1529 35237889 2566 1225 | 35237889 2710 1529 35237889 2566 1225 | ||
35237890 150003 50000 35237890 161995 26438 | 35237890 150003 50000 35237890 161995 26438 | ||
* AGP | |||
Chr #Ctgs | |||
Chr1 4617 156422777 | |||
Chr2 3468 137970877 | |||
Chr3 3260 119903216 | |||
Chr4 3032 120499176 | |||
Chr5 3103 119906797 | |||
Chr6 3570 116708387 | |||
Chr7 3049 109835480 | |||
Chr8 2954 110918838 | |||
Chr9 2584 104153020 | |||
Chr10 2712 103370270 | |||
Chr11 2778 105870899 | |||
Chr12 2673 88593048 | |||
Chr13 2147 83426589 | |||
Chr14 2549 84346988 | |||
Chr15 2655 84608865 | |||
Chr16 2547 80726864 | |||
Chr17 1885 71868308 | |||
Chr18 2195 65032274 | |||
Chr19 1809 63177714 | |||
Chr20 2146 70879676 | |||
Chr21 1967 70124586 | |||
Chr22 1628 60370627 | |||
Chr23 1434 51154144 | |||
Chr24 1380 61242035 | |||
Chr25 1274 42286642 | |||
Chr26 1668 51439476 | |||
Chr27 1381 45311792 | |||
Chr28 1186 45980083 | |||
Chr29 1803 50591405 | |||
ChrX 4883 136090029 | |||
Chr1..29,X 74337 2612810882 | |||
ChrU 113346 244744116 | |||
ChrY 94 832527 | |||
Chromosome mapped ctg/deg orientation: | |||
- 33990 | |||
+ 32686 | |||
0 7661 | |||
Chr1..30: | |||
elem min max mean med n50 sum | |||
ctg 63006 88 840370 41151 20696 89067 2592807255 | |||
deg 11331 251 21929 1765 1330 1781 20003627 | |||
ChrU: | |||
elem min max mean med n50 sum | |||
ctg 94017 89 166670 2362 1398 2537 222079922 | |||
deg 19329 71 13330 1172 1031 1127 22664194 | |||
Chr1..30 & ChrU: | |||
elem min max mean med n50 sum | |||
ctg 157023 88 840370 17926 1787 81230 2814887177 | |||
deg 30660 71 21929 1391 1112 1346 42667821 | |||
* haplotype-variants | |||
elem min max mean med n50 sum | |||
ctg+deg 12375 73 28074 1956 1429 2005 24209396 | |||
ctg 7499 73 28074 2426 1728 2671 18193542 | |||
deg 4876 147 6807 1233 1139 1213 6015854 | |||
= Submission = | = Submission = | ||
* [http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=346F9839-6889-4BA9-924C-B35E6BA99A37 Genome Project] | * [http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=346F9839-6889-4BA9-924C-B35E6BA99A37 Genome Project] ; WGS id (GPID): 32899 | ||
* WGS id (GPID): 32899 | * [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&cmd=ShowDetailView&TermToSearch=32899 Genome Project] ; WGS id (GPID): 32899 | ||
* [http://www.ncbi.nlm.nih.gov/nuccore/227462934 DAAA00000000,DAAA01000000] | |||
* Title: "A whole-genome assembly of the cow, Bos taurus" | * Title: "A whole-genome assembly of the cow, Bos taurus" | ||
Line 86: | Line 217: | ||
placed 75775 88 840370 34512 13416 88287 2615171268 | placed 75775 88 840370 34512 13416 88287 2615171268 | ||
unplaced 134882 71 166670 2022 1322 1742 272731098 | unplaced 134882 71 166670 2022 1322 1742 272731098 | ||
= Duplicates = | = Duplicates = | ||
Line 154: | Line 283: | ||
7180003259696 1597 Pseudomonas aeruginosa PAO1 | 7180003259696 1597 Pseudomonas aeruginosa PAO1 | ||
7180003239826 1378 Macaca mulatta | 7180003239826 1378 Macaca mulatta | ||
Problems: | Problems: | ||
1: | |||
* | 1: Vectors | ||
* Q: Which vectors are the most frequent? | |||
* A: align UMD2 vctor contaminants to UniVec_core: | |||
** 28051(29113 maxmatch) out of 33540 align : [[Media:UniVec_Core-UMD2.vector.ref_hits|UniVec_Core-UMD2.vector.ref_hits]] | |||
** pBR322, Ecoli_lac*, pSacBII ... | |||
** pBACe3.6 (U80929.2:11415-11517): CONTAINED | |||
** 5489(4427 maxmatch) out of 33540 don't align (4373 are 100+bnp, 11 are 500+bp) ; some aligned by megablast to 100 NCBI mostly vector sequences [[Media:Bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits|Bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits]] | |||
2: Ecoli | |||
* There are 22 Ecoli strains * 3 Ecoli K12 substarins | |||
* MG1655 is the 1st one completed, DH10B and W3110 have been recently(?) completed | |||
* Contain unique seqs as long as 28K | * Contain unique seqs as long as 28K | ||
cat /fs/szdata/genomes/ncbi/Bacteria/Escherichia_coli_K_12*/*fna | infoseq -description | cat /fs/szdata/genomes/ncbi/Bacteria/Escherichia_coli_K_12*/*fna | infoseq -description | ||
Line 164: | Line 302: | ||
NC_000913.2 4639675 50.79 Escherichia coli str. K-12 substr. MG1655, complete genome | NC_000913.2 4639675 50.79 Escherichia coli str. K-12 substr. MG1655, complete genome | ||
AC_000091.1 4646332 50.80 Escherichia coli str. K-12 substr. W3110, complete genome | AC_000091.1 4646332 50.80 Escherichia coli str. K-12 substr. W3110, complete genome | ||
* out of 746 UMD2 Ecoli seqs, 636(723 maxmatch) aligned to Ecoli.all | |||
3: Other | |||
* contaminants: 289 ; mostly Ecoli | |||
* phage: 16 ; 1 aligns to Ecoli, all <1108bp | |||
* IS: 398; all <1711 bp; 14 align to Ecoli.all & 1 to UniVec_Core; few were NCBI blasted aligned to mammals !!! | |||
* mitochondrion: 74 seqs: all align to ~/db/bos_taurus.mitochondrion | |||
* Others: 130; (Acinetobacter baumannii ...) | |||
* Files: | * Files: | ||
Line 182: | Line 321: | ||
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/odd-contaminants.infoseq 14 | /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/odd-contaminants.infoseq 14 | ||
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/nucmer/bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits.fasta # 100 vector seqs not in UniVec (pPAC7 ...) -> ~dpuiu/db/OtherVec | |||
* Dirs: | * Dirs: | ||
Line 228: | Line 368: | ||
/fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/contigs.unplaced.fa : sequences | /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/contigs.unplaced.fa : sequences | ||
/fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/bos_taurus.agp : all scaffolds | /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/bos_taurus.agp : all scaffolds | ||
/fs/szasmg3/bos_taurus/UMD_Freeze2.0/reads.placed.gz: 31,942,023 reads (read_id, read_clr, ctg_id, scf_id, ctg_pos, scf_pos) | |||
* [ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_2.0 CBCB ftp] | * [ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_2.0 CBCB ftp] | ||
Line 237: | Line 380: | ||
Dir: uploads/ | Dir: uploads/ | ||
Local files: /fs/szasmg3/dpuiu/bos_taurus/submission/ftp/ : 22 *sqn + 1 agp | Local files: /fs/szasmg3/dpuiu/bos_taurus/submission/ftp/ : 22 *sqn + 1 agp | ||
= Contaminant search = | |||
== Ecoli == | |||
== UniVec_Core == | |||
== UMD2.other == | |||
* 83(82) ctgs align to 65 ref sequences | |||
* 10 ctgs are Acinetobacter baumannii | |||
pwd | |||
/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant | |||
join UMD2.contaminant.other-ctg.ref_hits ~/db/bos_taurus.UMD2.contaminant.infoseq | sort -nk3 -r | head | |||
7180003370686_12513_13066 553 16 554 phage | |||
7180003320028 13090 10 13090 Acinetobacter baumannii | |||
7180003341208_1_647 646 8 647 phage | |||
... | |||
contigs <2000 >2000 min max mean med n50 sum | |||
82 20 62 709 397429 65384 44841 138429 5361543 | |||
alignments <200 >200 min max mean med n50 sum | |||
103 33 70 105 3312 467 276 688 48109 | |||
Files: | |||
/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/UMD2.contaminant.Acinetobacter-ctg.qry_hits # 10 UMD2.0 Acinetobacter ctg ids | |||
/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/Acinetobacter.all-ctg.filter-q.qry_hits # 22 UMD2.0 Acinetobacter ctg ids ; 7 in common with the 10 above | |||
# 25 Acinetobacter ctg's | |||
ctgid ctglen | |||
1 7180003321583 96481 | |||
2 7180003308373 9419 | |||
3 7180003319195 8955 | |||
4 7180003317370 8045 | |||
5 7180003290024 5922 | |||
6 join100003699 5649 | |||
7 7180003288988 3618 | |||
8 7180003308907 3157 | |||
9 7180003234806 3100 | |||
10 7180003319189 2966 | |||
11 7180003202299 2653 | |||
12 7180003217023 2213 | |||
13 7180003219002 2161 | |||
14 7180003219018 2010 | |||
15 7180003292866 1767 | |||
16 7180003215440 1617 | |||
17 7180003235746 1573 | |||
18 7180003235747 1524 | |||
19 7180003234890 1422 | |||
20 7180003219292 1329 | |||
21 7180003221397 1308 | |||
22 7180003221476 1243 | |||
23 7180003235699 1139 | |||
24 7180003214110 1100 | |||
25 deg0003235855 1062 | |||
= Other issues = | |||
== Segmental duplications == | |||
* David Kelly seminar | |||
* UMD1.6 | |||
** inclusions: 384 (1.1Mbp) | |||
** joins: 1090 (1.1Mbp) |
Latest revision as of 17:51, 4 November 2010
Articles
- A whole-genome assembly of the domestic cow, Bos taurus (Genome Biology April 2009)
- Bos taurus genome assembly (BMC genomics April 2009) Baylor's assembly paper
- Cattle genome sequenced (Science News)
- The genome sequence of taurine cattle: a window to ruminant biology and evolution (Science 2009)
- A physical map of the bovine genome Genome Biology 2007
- [1]
NCBI Traces
SPECIES_CODE = "BOS TAURUS" 37,788,710 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" 35,596,825 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "WGS" 24,863,627 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "SHOTGUN" 10,716,306 traces SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "CLONEEND" 16,892 traces
BCM Assembly
- Genome Project @ Baylor
- Btau_4.0 @ NCBI ; AAFC0000000.3 (not available yet)
- Btau_4.0 @ UCSC; Btau vs HomoSapiens
#elem min max mean median n50 sum contig 131620 91 326010 20755 10365 44270 2731814362 placed 101579 91 250125 24286 13928 47485 2466971326 chrom 30 44060403 161106243 87813777 84419198 106383598 2634413324 Chr Span GC% chr1 161106243 40.76 chr2 140800416 41.21 chr3 127923604 42.29 chr4 124454208 41.01 chr5 125847759 42.02 chr6 122561022 40.60 chr7 112078216 42.39 chr8 116942821 41.70 chr9 108145351 40.53 chr10 106383598 41.84 chr11 110171769 43.16 chr12 85358539 41.00 chr13 84419198 44.00 chr14 81345643 41.59 chr15 84633453 42.34 chr16 77906053 42.91 chr17 76506943 42.70 chr18 66141439 45.87 chr19 65312493 46.32 chr20 75796353 41.51 chr21 69173390 43.20 chr22 61848140 43.59 chr23 53376148 43.75 chr24 65020233 42.27 chr25 44060403 47.13 chr26 51750746 43.16 chr27 48749334 42.19 chr28 46084206 42.61 chr29 51998940 44.34 chrX 88516663 41.11 chrM 16338 39.42 chrUn 283544868 ? # 11869 contigs
Files:
/fs/szasmg3/bos_taurus/BOSTAU4
Children's Hospital Oakland Research Institute
- Bovine BAC Library (male)):
- 6 finished BACs
- NCBI links
[2] [3] [4] [5] [6] [7]
UMD2.0 Assembly
- qc
#elem scf 134612 ctg 194643
- Used UMDoverlapper to trim reads
#elem min max mean median n50 sum reads 35237868 68 1418 778 840 864 27406137041
- Library re-estimates:
paste *dst *mdi | perl -ane 'print " $_" if($F[1]!=$F[4]);' 35237870 150000 50000 35237870 162386 21158 35237871 2496 1431 35237871 2409 1137 35237873 3001 829 35237873 2973 770 35237875 150001 50000 35237875 172955 43831 35237876 1629 282 35237876 1595 245 35237877 3063 1326 35237877 3193 1131 35237878 6756 836 35237878 6701 793 35237879 2569 293 35237879 2547 285 35237880 150002 50000 35237880 160984 26638 35237881 2749 446 35237881 2697 325 35237883 3593 1213 35237883 3463 1232 35237884 3165 700 35237884 3172 699 35237885 3812 533 35237885 3804 537 35237886 2754 1432 35237886 2701 1289 35237887 4977 693 35237887 4968 694 35237889 2710 1529 35237889 2566 1225 35237890 150003 50000 35237890 161995 26438
- AGP
Chr #Ctgs Chr1 4617 156422777 Chr2 3468 137970877 Chr3 3260 119903216 Chr4 3032 120499176 Chr5 3103 119906797 Chr6 3570 116708387 Chr7 3049 109835480 Chr8 2954 110918838 Chr9 2584 104153020 Chr10 2712 103370270 Chr11 2778 105870899 Chr12 2673 88593048 Chr13 2147 83426589 Chr14 2549 84346988 Chr15 2655 84608865 Chr16 2547 80726864 Chr17 1885 71868308 Chr18 2195 65032274 Chr19 1809 63177714 Chr20 2146 70879676 Chr21 1967 70124586 Chr22 1628 60370627 Chr23 1434 51154144 Chr24 1380 61242035 Chr25 1274 42286642 Chr26 1668 51439476 Chr27 1381 45311792 Chr28 1186 45980083 Chr29 1803 50591405 ChrX 4883 136090029 Chr1..29,X 74337 2612810882 ChrU 113346 244744116 ChrY 94 832527
Chromosome mapped ctg/deg orientation: - 33990 + 32686 0 7661
Chr1..30: elem min max mean med n50 sum ctg 63006 88 840370 41151 20696 89067 2592807255 deg 11331 251 21929 1765 1330 1781 20003627
ChrU: elem min max mean med n50 sum ctg 94017 89 166670 2362 1398 2537 222079922 deg 19329 71 13330 1172 1031 1127 22664194
Chr1..30 & ChrU: elem min max mean med n50 sum ctg 157023 88 840370 17926 1787 81230 2814887177 deg 30660 71 21929 1391 1112 1346 42667821
- haplotype-variants
elem min max mean med n50 sum ctg+deg 12375 73 28074 1956 1429 2005 24209396 ctg 7499 73 28074 2426 1728 2671 18193542 deg 4876 147 6807 1233 1139 1213 6015854
Submission
- Genome Project ; WGS id (GPID): 32899
- Genome Project ; WGS id (GPID): 32899
- DAAA00000000,DAAA01000000
- Title: "A whole-genome assembly of the cow, Bos taurus"
- Authors:
Steven Salzberg Aleksey Zimin Arthur Delcher Liliana Florea David Kelley Finian Hanrahan Guillaume Marcais Geo Pertea Michael Roberts Michael Schatz Curt Van Tassell James Yorke Poorani S.
- Assembler:
Celera Assembler and UMD Overlapper.
- Sequencing Center :
Baylor College of Medicine.
- Source of DNA used for sequencing:
The source of the BAC library DNA was Hereford bull L1 Domino 99375, registration number 41170496. Dr. Michael MacNeil's laboratory, USDA-ARS, Miles City, MT provided the blood. The DNA for the whole genome shotgun sequences was provided by Dr. Timothy Smith's laboratory, U.S. Meat Animal Research Center, Clay Center, NE from white blood cells from L1 Dominette 01449, American Hereford Association registration number 42190680 (a daughter of L1 Domino 99375). A skin cell fibroblast cell line from the same animal is available from Dr. Carol Chitko-McKown's laboratory, although there is no sequence from that cell line.
- Sequence modifiers:
[organism=Bos taurus][breed=Hereford][tech=wgs][chromosome=...]
- Submission: Use sequin
/nfshomes/dpuiu/szdevel/sequin.8.10/sequin
- Sequence:
Contig length summary: #seqs min max mean median n50 sum all 210657 71 840370 13709 1523 78511 2887902366 placed 75775 88 840370 34512 13416 88287 2615171268 unplaced 134882 71 166670 2022 1322 1742 272731098
Duplicates
- Duplicate contigs : 111+2 : one copy was removed
deg0003136509,7180003440308 : both unplaced deg0003084562,7180002954167 : both unplaced
Contaminants
NCBI (1st batch)
Through "Foreign Contamination Screen" : http://www.ncbi.nlm.nih.gov/projects/WGS/screens/DAAA01_120508/
- list.exclude_contigs 4,813 ctgs (3,939 vector + 394 Ecoli + 452 other) (73 mito, 43 deg, 8 Acinetobacter baumannii)
- list.trim_contigs 19,049 ctgs (18,336 vector + 289 Ecoli + 397 other)
Steven search against Ecoli MG1655
- 12/11/2008 : Found 121 ctgs that align to Ecoli
/fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/Eco-vs-cow.mum /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/EcoK12.fna
NCBI (2nd batch)
- More ctgs found (bug in the 1st search)
- list.exclude_contigs 4 ctgs
- list.trim_contigs 46 ctgs
Overall
- Counts
Bos_taurus.UMD2.exclude.count Bos_taurus.UMD2.trim.count
Summary
#ctgs min max mean median n50 total_bp UMD_Freeze2.0_contam 210657 71 840370 13709 1523 78508 2887902366 UMD_Freeze2.0 187683 71 840370 15225 1609 79580 2857554998 difference 22974
Contaminant region summary:
elem min max mean med n50 sum exclude 4817 316 16661 1510 1485 1514 7276894 trim 30325 48 2479 354 319 446 10745455 all 35142 48 16661 512 362 674 18022349 Ecoli 746 54 16661 1125 1111 1264 839899 vector 33540 49 3128 487 346 603 16340397 other 910 53 13090 1006 1037 1329 916117
Exclude sequences(example):
7180003318605 16661 Escherichia coli str. K12 substr # DH10B 7180003320028 13090 Acinetobacter baumannii 7180003316967 7473 Escherichia coli str. K12 substr # DH10B 7180003313366 7098 Acinetobacter baumannii 7180003195772 4993 Serratia marcescens 7180003288790 4668 Klebsiella pneumoniae 7180003310064 4371 Escherichia coli # all 3 7180003262150 4275 Escherichia coli 7180003288789 3565 Serratia marcescens join100003627 3563 Acinetobacter baumannii ... 7180003289260 3128 Escherichia coli or vector 7180003292886 2957 vector 7180003166540 2081 mitochondrion 7180003310112 1977 contaminants 7180002995790 1711 bacterial insertion sequence 7180003221530 1647 Bacillus cereus ATCC 10987 7180003259696 1597 Pseudomonas aeruginosa PAO1 7180003239826 1378 Macaca mulatta
Problems:
1: Vectors
- Q: Which vectors are the most frequent?
- A: align UMD2 vctor contaminants to UniVec_core:
- 28051(29113 maxmatch) out of 33540 align : UniVec_Core-UMD2.vector.ref_hits
- pBR322, Ecoli_lac*, pSacBII ...
- pBACe3.6 (U80929.2:11415-11517): CONTAINED
- 5489(4427 maxmatch) out of 33540 don't align (4373 are 100+bnp, 11 are 500+bp) ; some aligned by megablast to 100 NCBI mostly vector sequences Bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits
2: Ecoli
- There are 22 Ecoli strains * 3 Ecoli K12 substarins
- MG1655 is the 1st one completed, DH10B and W3110 have been recently(?) completed
- Contain unique seqs as long as 28K
cat /fs/szdata/genomes/ncbi/Bacteria/Escherichia_coli_K_12*/*fna | infoseq -description NC_010473.1 4686137 50.78 Escherichia coli str. K-12 substr. DH10B, complete genome NC_000913.2 4639675 50.79 Escherichia coli str. K-12 substr. MG1655, complete genome AC_000091.1 4646332 50.80 Escherichia coli str. K-12 substr. W3110, complete genome
- out of 746 UMD2 Ecoli seqs, 636(723 maxmatch) aligned to Ecoli.all
3: Other
- contaminants: 289 ; mostly Ecoli
- phage: 16 ; 1 aligns to Ecoli, all <1108bp
- IS: 398; all <1711 bp; 14 align to Ecoli.all & 1 to UniVec_Core; few were NCBI blasted aligned to mammals !!!
- mitochondrion: 74 seqs: all align to ~/db/bos_taurus.mitochondrion
- Others: 130; (Acinetobacter baumannii ...)
- Files:
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.exclude.count (4813 exlude sequence counts) /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.list (35142 contaminant sequences: exclude+trim) /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.fasta
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.infoseq 35142 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.Ecoli.infoseq 746 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.vector.infoseq 33540 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/odd-contaminants.infoseq 14
/fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/nucmer/bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits.fasta # 100 vector seqs not in UniVec (pPAC7 ...) -> ~dpuiu/db/OtherVec
- Dirs:
/fs/szasmg3/dpuiu/bos_taurus/submission/decontam
Notes:
- The 4,814 ctgs were aligned to UniVec (-c 20 ; delta-filter -q)
- 4,247 ctgs aligned to 89 vecto seqs
- top ref hits:
gnl|uv|J01749.1 Cloning vector pBR322 gnl|uv|J01636.1 E.coli lactose operon with lacI, lacZ, lacY and lacA genes gnl|uv|AF102576.1 Cloning vector pSOS gnl|uv|L08959.1 pUC8 cloning vector gnl|uv|L08931.1 pMAC7-8 cloning vector for site-directed mutagenesis gnl|uv|L09145.1 pUR222 cloning vector gnl|uv|U47102.2 Cloning vector pALTER<R>-Ex1 ...
- The 4,814 ctgs were aligned to EcoliK12
- 4,299 aligned 200bp+ to Ecoli
- 3,877 aligned 100% to region 365521_365744 (224bp)
>NC_000913.2_365521_365744 Escherichia coli K12, complete genome CATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATAC GAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCACATTAA TTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGCATTAAT GAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGC
- The 4,814 ctgs were assembled based on 11,912 reads
BCM SHOTGUN 11247 (out of 10M reads) BCM WGS 415 (out of 24M reads) NISC SHOTGUN 213 (out of 0.7M reads) BARC CLONEEND 14 ... BCM CLONEEND 11 BCCAGSC CLONEEND 7 TIGR CLONEEND 4 TIGR_JCVIJTC CLONEEND 1
- Avg UMD clipping rangeof the 11,912 reads is 840bp (vs 778 avg for the 3.53M assembled reads)
- Other: /fs/ftp-cbcb/pub/data/Bos_taurus/Bos_taurus_UMD_2.0/odd-contaminants.fa
Local files
- Freeze dir files
/fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/contigs.unplaced.fa : sequences /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/bos_taurus.agp : all scaffolds /fs/szasmg3/bos_taurus/UMD_Freeze2.0/reads.placed.gz: 31,942,023 reads (read_id, read_clr, ctg_id, scf_id, ctg_pos, scf_pos)
/fs/ftp-cbcb/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_2.0
- Files uploaded
Ftp server: ftp-private.ncbi.nlm.nih.gov Account: cbcb_trc Dir: uploads/ Local files: /fs/szasmg3/dpuiu/bos_taurus/submission/ftp/ : 22 *sqn + 1 agp
Contaminant search
Ecoli
UniVec_Core
UMD2.other
- 83(82) ctgs align to 65 ref sequences
- 10 ctgs are Acinetobacter baumannii
pwd /fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant join UMD2.contaminant.other-ctg.ref_hits ~/db/bos_taurus.UMD2.contaminant.infoseq | sort -nk3 -r | head 7180003370686_12513_13066 553 16 554 phage 7180003320028 13090 10 13090 Acinetobacter baumannii 7180003341208_1_647 646 8 647 phage ... contigs <2000 >2000 min max mean med n50 sum 82 20 62 709 397429 65384 44841 138429 5361543 alignments <200 >200 min max mean med n50 sum 103 33 70 105 3312 467 276 688 48109
Files:
/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/UMD2.contaminant.Acinetobacter-ctg.qry_hits # 10 UMD2.0 Acinetobacter ctg ids /fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/Acinetobacter.all-ctg.filter-q.qry_hits # 22 UMD2.0 Acinetobacter ctg ids ; 7 in common with the 10 above
# 25 Acinetobacter ctg's ctgid ctglen 1 7180003321583 96481 2 7180003308373 9419 3 7180003319195 8955 4 7180003317370 8045 5 7180003290024 5922 6 join100003699 5649 7 7180003288988 3618 8 7180003308907 3157 9 7180003234806 3100 10 7180003319189 2966 11 7180003202299 2653 12 7180003217023 2213 13 7180003219002 2161 14 7180003219018 2010 15 7180003292866 1767 16 7180003215440 1617 17 7180003235746 1573 18 7180003235747 1524 19 7180003234890 1422 20 7180003219292 1329 21 7180003221397 1308 22 7180003221476 1243 23 7180003235699 1139 24 7180003214110 1100 25 deg0003235855 1062
Other issues
Segmental duplications
- David Kelly seminar
- UMD1.6
- inclusions: 384 (1.1Mbp)
- joins: 1090 (1.1Mbp)