Bos taurus: Difference between revisions

From Cbcb
Jump to navigation Jump to search
 
(41 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Articles =
* [http://genomebiology.com/2009/10/4/R42  A whole-genome assembly of the domestic cow, Bos taurus] (Genome Biology April 2009)
* [http://www.biomedcentral.com/1471-2164/10/180/abstract Bos taurus genome assembly] (BMC genomics  April 2009) Baylor's assembly paper
* [http://www.sciencenews.org/view/generic/id/43190/description/Cattle_genome_sequenced Cattle genome sequenced] (Science News)
* [http://www.ncbi.nlm.nih.gov/pubmed/19390049 The genome sequence of taurine cattle: a window to ruminant biology and evolution] (Science 2009)
* [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2374996  A physical map of the bovine genome] Genome Biology 2007
* [http://www.animalgenome.org/bioinfo/]
= NCBI Traces =
= NCBI Traces =


* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=bos%20taurus Genome Projects]
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=bos%20taurus Genome Projects]
* [ftp://ftp.ncbi.nih.gov/pub/TraceDB/bos_taurus TA ftp]
* [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieve&size=0&val=SPECIES_CODE+%3D+%22BOS+TAURUS%22 TA search]  
* [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieve&size=0&val=SPECIES_CODE+%3D+%22BOS+TAURUS%22 TA search]  
   SPECIES_CODE = "BOS TAURUS"                                                          37,788,710 traces
   SPECIES_CODE = "BOS TAURUS"                                                          37,788,710 traces
Line 12: Line 22:
= BCM Assembly =
= BCM Assembly =


* [http://www.hgsc.bcm.tmc.edu/projects/bovine/ Genome Project]
* [http://www.hgsc.bcm.tmc.edu/projects/bovine/ Genome Project] @ Baylor
* [http://www.ncbi.nlm.nih.gov/mapview/stats/BuildStats.cgi?taxid=9913&build=4&ver=1  Btau_4.0] @ NCBI ;  AAFC0000000.3 (not available yet)
* [http://hgdownload.cse.ucsc.edu/downloads.html#cow Btau_4.0] @ UCSC; [http://hgdownload.cse.ucsc.edu/goldenPath/bosTau4/vsHg18/ Btau vs HomoSapiens]


           #elem  min            max            mean            median          n50            sum
           #elem  min            max            mean            median          n50            sum
Line 18: Line 30:
   placed  101579  91              250125          24286          13928          47485          2466971326
   placed  101579  91              250125          24286          13928          47485          2466971326
   chrom  30      44060403        161106243      87813777        84419198        106383598      2634413324
   chrom  30      44060403        161106243      87813777        84419198        106383598      2634413324
  Chr            Span      GC%
  chr1          161106243 40.76
  chr2          140800416 41.21
  chr3          127923604 42.29
  chr4          124454208 41.01
  chr5          125847759 42.02
  chr6          122561022 40.60
  chr7          112078216 42.39
  chr8          116942821 41.70
  chr9          108145351 40.53
  chr10          106383598 41.84
  chr11          110171769 43.16
  chr12          85358539  41.00
  chr13          84419198  44.00
  chr14          81345643  41.59
  chr15          84633453  42.34
  chr16          77906053  42.91
  chr17          76506943  42.70
  chr18          66141439  45.87
  chr19          65312493  46.32
  chr20          75796353  41.51
  chr21          69173390  43.20
  chr22          61848140  43.59
  chr23          53376148  43.75
  chr24          65020233  42.27
  chr25          44060403  47.13
  chr26          51750746  43.16
  chr27          48749334  42.19
  chr28          46084206  42.61
  chr29          51998940  44.34
  chrX          88516663  41.11
  chrM          16338    39.42
  chrUn          283544868  ?    #  11869 contigs
Files:
  /fs/szasmg3/bos_taurus/BOSTAU4
= Children's Hospital Oakland Research Institute =
* Bovine BAC Library (male)):
* 6 finished BACs
* NCBI links
  [http://www.ncbi.nlm.nih.gov/nuccore/171461043]
  [http://www.ncbi.nlm.nih.gov/nuccore/171461042]
  [http://www.ncbi.nlm.nih.gov/nuccore/171461041]
  [http://www.ncbi.nlm.nih.gov/nuccore/171461040]
  [http://www.ncbi.nlm.nih.gov/nuccore/171461039]
  [http://www.ncbi.nlm.nih.gov/nuccore/167744683]


= UMD2 Assembly =
= UMD2.0 Assembly =
 
* qc
          #elem
  scf      134612
  ctg      194643


* Used UMDoverlapper to trim reads
* Used UMDoverlapper to trim reads
Line 25: Line 91:
   reads    35237868    68      1418    778    840    864    27406137041
   reads    35237868    68      1418    778    840    864    27406137041


* Library reestimates:
* Library re-estimates:


   paste *dst *mdi | perl -ane 'print "  $_" if($F[1]!=$F[4]);'
   paste *dst *mdi | perl -ane 'print "  $_" if($F[1]!=$F[4]);'
Line 45: Line 111:
   35237889      2710    1529    35237889        2566    1225
   35237889      2710    1529    35237889        2566    1225
   35237890      150003  50000  35237890        161995  26438
   35237890      150003  50000  35237890        161995  26438
* AGP
  Chr  #Ctgs
  Chr1    4617    156422777
  Chr2    3468    137970877
  Chr3    3260    119903216
  Chr4    3032    120499176
  Chr5    3103    119906797
  Chr6    3570    116708387
  Chr7    3049    109835480
  Chr8    2954    110918838
  Chr9    2584    104153020
  Chr10  2712    103370270
  Chr11  2778    105870899
  Chr12  2673    88593048
  Chr13  2147    83426589
  Chr14  2549    84346988
  Chr15  2655    84608865
  Chr16  2547    80726864
  Chr17  1885    71868308
  Chr18  2195    65032274
  Chr19  1809    63177714
  Chr20  2146    70879676
  Chr21  1967    70124586
  Chr22  1628    60370627
  Chr23  1434    51154144
  Chr24  1380    61242035
  Chr25  1274    42286642
  Chr26  1668    51439476
  Chr27  1381    45311792
  Chr28  1186    45980083
  Chr29  1803    50591405
  ChrX    4883    136090029
  Chr1..29,X  74337  2612810882
  ChrU    113346  244744116
  ChrY    94      832527
  Chromosome mapped ctg/deg orientation:
  -      33990
  +      32686
  0      7661
  Chr1..30:
            elem      min        max        mean      med        n50        sum
  ctg      63006      88        840370    41151      20696      89067      2592807255 
  deg      11331      251        21929      1765      1330      1781      20003627   
  ChrU:
            elem      min        max        mean      med        n50        sum
  ctg      94017      89        166670    2362      1398      2537      222079922
  deg      19329      71        13330      1172      1031      1127      22664194
  Chr1..30 & ChrU:
            elem      min        max        mean      med        n50        sum
  ctg      157023    88        840370    17926      1787      81230      2814887177
  deg      30660      71        21929      1391      1112      1346      42667821
* haplotype-variants
            elem      min        max        mean      med        n50        sum           
  ctg+deg  12375      73        28074      1956      1429      2005      24209396     
  ctg      7499      73        28074      2426      1728      2671      18193542       
  deg      4876      147        6807      1233      1139      1213      6015854


= Submission =
= Submission =


* [http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=346F9839-6889-4BA9-924C-B35E6BA99A37  Genome Project]
* [http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=346F9839-6889-4BA9-924C-B35E6BA99A37  Genome Project] ; WGS id (GPID): 32899
* WGS id (GPID): 32899
* [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&cmd=ShowDetailView&TermToSearch=32899  Genome Project] ; WGS id (GPID): 32899
* [http://www.ncbi.nlm.nih.gov/nuccore/227462934 DAAA00000000,DAAA01000000]


* Title: "A whole-genome assembly of the cow, Bos taurus"  
* Title: "A whole-genome assembly of the cow, Bos taurus"  
Line 86: Line 217:
   placed    75775  88      840370  34512  13416  88287  2615171268
   placed    75775  88      840370  34512  13416  88287  2615171268
   unplaced  134882  71      166670  2022    1322    1742    272731098
   unplaced  134882  71      166670  2022    1322    1742    272731098


= Duplicates =  
= Duplicates =  
Line 156: Line 285:


Problems:   
Problems:   
   
 
1: There are 22 Ecoli strains * 3 Ecoli K12 substarins
1: Vectors
* Q: Which vectors are the most frequent?  
* A: align UMD2 vctor contaminants to UniVec_core:
** 28051(29113 maxmatch) out of 33540 align : [[Media:UniVec_Core-UMD2.vector.ref_hits|UniVec_Core-UMD2.vector.ref_hits]]
** pBR322, Ecoli_lac*, pSacBII ...
** pBACe3.6 (U80929.2:11415-11517): CONTAINED
** 5489(4427 maxmatch) out of 33540 don't align  (4373 are 100+bnp, 11 are 500+bp) ; some aligned by megablast to 100 NCBI mostly vector sequences [[Media:Bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits|Bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits]]
 
2: Ecoli
* There are 22 Ecoli strains * 3 Ecoli K12 substarins
* MG1655 is the 1st one completed, DH10B and W3110 have been recently(?) completed
* MG1655 is the 1st one completed, DH10B and W3110 have been recently(?) completed
* Contain unique seqs as long as 28K
* Contain unique seqs as long as 28K
Line 164: Line 302:
   NC_000913.2    4639675 50.79  Escherichia coli str. K-12 substr. MG1655, complete genome
   NC_000913.2    4639675 50.79  Escherichia coli str. K-12 substr. MG1655, complete genome
   AC_000091.1    4646332 50.80  Escherichia coli str. K-12 substr. W3110, complete genome
   AC_000091.1    4646332 50.80  Escherichia coli str. K-12 substr. W3110, complete genome
* out of 746 UMD2 Ecoli seqs, 636(723 maxmatch) aligned to Ecoli.all


2: Acinetobacter baumannii : there are 6 completed strains
3: Other
 
* contaminants: 289 ; mostly Ecoli
3: Which vectors are the most frequent?
* phage: 16 ; 1 aligns to Ecoli, all <1108bp
* [[Media:UniVec_Core-UMD2.vector.ref_hits|UniVec_Core-UMD2.vector.ref_hits]]
* IS: 398; all <1711 bp; 14 align to Ecoli.all & 1 to UniVec_Core; few were NCBI blasted aligned to mammals !!!
* pBR322, Ecoli_lac*, pSacBII ...
* mitochondrion: 74 seqs: all align to ~/db/bos_taurus.mitochondrion
* pBACe3.6 (U80929.2:11415-11517): CONTAINED
* Others: 130;  (Acinetobacter baumannii ...)
 
 


* Files:  
* Files:  
Line 184: Line 321:
   /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/odd-contaminants.infoseq            14
   /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/odd-contaminants.infoseq            14


  /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/nucmer/bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits.fasta # 100 vector seqs not in UniVec (pPAC7 ...) -> ~dpuiu/db/OtherVec


* Dirs:  
* Dirs:  
Line 230: Line 368:
   /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/contigs.unplaced.fa  : sequences
   /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/contigs.unplaced.fa  : sequences
   /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/bos_taurus.agp      : all scaffolds
   /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/bos_taurus.agp      : all scaffolds
 
  /fs/szasmg3/bos_taurus/UMD_Freeze2.0/reads.placed.gz: 31,942,023 reads (read_id, read_clr, ctg_id, scf_id, ctg_pos, scf_pos)


* [ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_2.0 CBCB ftp]
* [ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_2.0 CBCB ftp]
Line 239: Line 380:
   Dir: uploads/
   Dir: uploads/
   Local files: /fs/szasmg3/dpuiu/bos_taurus/submission/ftp/  : 22 *sqn + 1 agp
   Local files: /fs/szasmg3/dpuiu/bos_taurus/submission/ftp/  : 22 *sqn + 1 agp
= Contaminant search =
== Ecoli ==
== UniVec_Core ==
== UMD2.other ==
* 83(82) ctgs align to 65 ref sequences
* 10 ctgs are Acinetobacter baumannii
  pwd
  /fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant
 
  join UMD2.contaminant.other-ctg.ref_hits ~/db/bos_taurus.UMD2.contaminant.infoseq | sort -nk3 -r | head
  7180003370686_12513_13066 553 16 554 phage
  7180003320028 13090 10 13090 Acinetobacter baumannii
  7180003341208_1_647 646 8 647 phage
  ...
 
  contigs          <2000      >2000      min        max        mean      med        n50        sum
  82              20        62        709        397429    65384      44841      138429    5361543
 
  alignments      <200      >200      min        max        mean      med        n50        sum           
  103              33        70        105        3312      467        276        688        48109
Files:
/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/UMD2.contaminant.Acinetobacter-ctg.qry_hits  # 10 UMD2.0 Acinetobacter ctg ids
/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/Acinetobacter.all-ctg.filter-q.qry_hits      # 22 UMD2.0 Acinetobacter ctg ids ; 7 in common with the 10 above
# 25 Acinetobacter ctg's
        ctgid        ctglen
    1  7180003321583 96481
    2  7180003308373 9419
    3  7180003319195 8955
    4  7180003317370 8045
    5  7180003290024 5922
    6  join100003699 5649
    7  7180003288988 3618
    8  7180003308907 3157
    9  7180003234806 3100
    10  7180003319189 2966
    11  7180003202299 2653
    12  7180003217023 2213
    13  7180003219002 2161
    14  7180003219018 2010
    15  7180003292866 1767
    16  7180003215440 1617
    17  7180003235746 1573
    18  7180003235747 1524
    19  7180003234890 1422
    20  7180003219292 1329
    21  7180003221397 1308
    22  7180003221476 1243
    23  7180003235699 1139
    24  7180003214110 1100
    25  deg0003235855 1062
= Other issues =
== Segmental duplications ==
* David Kelly seminar
* UMD1.6
** inclusions: 384  (1.1Mbp)
** joins:      1090 (1.1Mbp)

Latest revision as of 17:51, 4 November 2010

Articles

NCBI Traces

 SPECIES_CODE = "BOS TAURUS"                                                          37,788,710 traces
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM"                                  35,596,825 traces
 
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "WGS"      24,863,627 traces
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "SHOTGUN"  10,716,306 traces
 SPECIES_CODE = "BOS TAURUS" and CENTER_NAME = "BCM" and TRACE_TYPE_CODE = "CLONEEND"     16,892 traces

BCM Assembly

         #elem   min             max             mean            median          n50             sum
 contig  131620  91              326010          20755           10365           44270           2731814362
 placed  101579  91              250125          24286           13928           47485           2466971326
 chrom   30      44060403        161106243       87813777        84419198        106383598       2634413324

 Chr            Span      GC%
 chr1           161106243 40.76
 chr2           140800416 41.21
 chr3           127923604 42.29
 chr4           124454208 41.01
 chr5           125847759 42.02
 chr6           122561022 40.60
 chr7           112078216 42.39
 chr8           116942821 41.70
 chr9           108145351 40.53
 chr10          106383598 41.84
 chr11          110171769 43.16
 chr12          85358539  41.00
 chr13          84419198  44.00
 chr14          81345643  41.59
 chr15          84633453  42.34
 chr16          77906053  42.91
 chr17          76506943  42.70
 chr18          66141439  45.87
 chr19          65312493  46.32
 chr20          75796353  41.51
 chr21          69173390  43.20
 chr22          61848140  43.59
 chr23          53376148  43.75
 chr24          65020233  42.27
 chr25          44060403  47.13
 chr26          51750746  43.16
 chr27          48749334  42.19
 chr28          46084206  42.61
 chr29          51998940  44.34
 chrX           88516663  41.11
 chrM           16338     39.42
 chrUn          283544868  ?     #  11869 contigs

Files:

 /fs/szasmg3/bos_taurus/BOSTAU4

Children's Hospital Oakland Research Institute

  • Bovine BAC Library (male)):
  • 6 finished BACs
  • NCBI links
 [2]
 [3] 
 [4]
 [5]
 [6]
 [7]

UMD2.0 Assembly

  • qc
          #elem
 scf      134612
 ctg      194643
  • Used UMDoverlapper to trim reads
          #elem       min     max     mean    median  n50     sum
 reads    35237868    68      1418    778     840     864     27406137041
  • Library re-estimates:
 paste *dst *mdi | perl -ane 'print "  $_" if($F[1]!=$F[4]);'
 35237870      150000  50000   35237870        162386  21158
 35237871      2496    1431    35237871        2409    1137
 35237873      3001    829     35237873        2973    770
 35237875      150001  50000   35237875        172955  43831
 35237876      1629    282     35237876        1595    245
 35237877      3063    1326    35237877        3193    1131
 35237878      6756    836     35237878        6701    793
 35237879      2569    293     35237879        2547    285
 35237880      150002  50000   35237880        160984  26638
 35237881      2749    446     35237881        2697    325
 35237883      3593    1213    35237883        3463    1232
 35237884      3165    700     35237884        3172    699
 35237885      3812    533     35237885        3804    537
 35237886      2754    1432    35237886        2701    1289
 35237887      4977    693     35237887        4968    694
 35237889      2710    1529    35237889        2566    1225
 35237890      150003  50000   35237890        161995  26438
  • AGP
 Chr   #Ctgs
 Chr1    4617    156422777
 Chr2    3468    137970877
 Chr3    3260    119903216
 Chr4    3032    120499176
 Chr5    3103    119906797
 Chr6    3570    116708387
 Chr7    3049    109835480
 Chr8    2954    110918838
 Chr9    2584    104153020
 Chr10   2712    103370270
 Chr11   2778    105870899
 Chr12   2673    88593048
 Chr13   2147    83426589
 Chr14   2549    84346988
 Chr15   2655    84608865
 Chr16   2547    80726864
 Chr17   1885    71868308
 Chr18   2195    65032274 
 Chr19   1809    63177714
 Chr20   2146    70879676
 Chr21   1967    70124586
 Chr22   1628    60370627
 Chr23   1434    51154144
 Chr24   1380    61242035
 Chr25   1274    42286642
 Chr26   1668    51439476
 Chr27   1381    45311792
 Chr28   1186    45980083
 Chr29   1803    50591405
 ChrX    4883    136090029
 Chr1..29,X  74337  2612810882
 ChrU    113346  244744116
 ChrY    94      832527
 Chromosome mapped ctg/deg orientation:
 -       33990
 +       32686
 0       7661
 Chr1..30:
           elem       min        max        mean       med        n50        sum
 ctg       63006      88         840370     41151      20696      89067      2592807255  
 deg       11331      251        21929      1765       1330       1781       20003627    
 ChrU:
           elem       min        max        mean       med        n50        sum
 ctg       94017      89         166670     2362       1398       2537       222079922
 deg       19329      71         13330      1172       1031       1127       22664194
 Chr1..30 & ChrU:
           elem       min        max        mean       med        n50        sum
 ctg       157023     88         840370     17926      1787       81230      2814887177
 deg       30660      71         21929      1391       1112       1346       42667821
  • haplotype-variants
           elem       min        max        mean       med        n50        sum            
 ctg+deg   12375      73         28074      1956       1429       2005       24209396       
 ctg       7499       73         28074      2426       1728       2671       18193542        
 deg       4876       147        6807       1233       1139       1213       6015854

Submission

  • Title: "A whole-genome assembly of the cow, Bos taurus"
  • Authors:
 Steven Salzberg
 Aleksey Zimin
 Arthur Delcher
 Liliana Florea
 David Kelley
 Finian Hanrahan
 Guillaume Marcais
 Geo Pertea
 Michael Roberts
 Michael Schatz
 Curt Van Tassell
 James Yorke
 Poorani S.
  • Assembler:
 Celera Assembler and UMD Overlapper.
  • Sequencing Center :
 Baylor College of Medicine. 
  • Source of DNA used for sequencing:

The source of the BAC library DNA was Hereford bull L1 Domino 99375, registration number 41170496. Dr. Michael MacNeil's laboratory, USDA-ARS, Miles City, MT provided the blood. The DNA for the whole genome shotgun sequences was provided by Dr. Timothy Smith's laboratory, U.S. Meat Animal Research Center, Clay Center, NE from white blood cells from L1 Dominette 01449, American Hereford Association registration number 42190680 (a daughter of L1 Domino 99375). A skin cell fibroblast cell line from the same animal is available from Dr. Carol Chitko-McKown's laboratory, although there is no sequence from that cell line.

  • Sequence modifiers:
 [organism=Bos taurus][breed=Hereford][tech=wgs][chromosome=...]
  • Submission: Use sequin
 /nfshomes/dpuiu/szdevel/sequin.8.10/sequin
  • Sequence:
 Contig length summary:
           #seqs   min     max     mean    median  n50     sum
 all       210657  71      840370  13709   1523    78511   2887902366
 placed    75775   88      840370  34512   13416   88287   2615171268
 unplaced  134882  71      166670  2022    1322    1742    272731098

Duplicates

 deg0003136509,7180003440308 : both unplaced
 deg0003084562,7180002954167 : both unplaced

Contaminants

NCBI (1st batch)

Through "Foreign Contamination Screen" : http://www.ncbi.nlm.nih.gov/projects/WGS/screens/DAAA01_120508/

  • list.exclude_contigs 4,813 ctgs (3,939 vector + 394 Ecoli + 452 other) (73 mito, 43 deg, 8 Acinetobacter baumannii)
  • list.trim_contigs 19,049 ctgs (18,336 vector + 289 Ecoli + 397 other)

Steven search against Ecoli MG1655

  • 12/11/2008 : Found 121 ctgs that align to Ecoli
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/Eco-vs-cow.mum 
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/salzberg/EcoK12.fna 

NCBI (2nd batch)

Overall

  • Counts
 Bos_taurus.UMD2.exclude.count
 Bos_taurus.UMD2.trim.count

Summary

                             #ctgs   min     max     mean    median  n50     total_bp
 UMD_Freeze2.0_contam        210657  71      840370  13709   1523    78508   2887902366
 UMD_Freeze2.0               187683  71      840370  15225   1609    79580   2857554998
 difference                   22974

Contaminant region summary:

            elem       min        max        mean       med        n50        sum
 exclude    4817       316        16661      1510       1485       1514       7276894
 trim       30325      48         2479       354        319        446        10745455
 all        35142      48         16661      512        362        674        18022349
 
 Ecoli      746        54         16661      1125       1111       1264       839899          
 vector     33540      49         3128       487        346        603        16340397       
 other      910        53         13090      1006       1037       1329       916117

Exclude sequences(example):

 7180003318605   16661   Escherichia coli str. K12 substr  # DH10B
 7180003320028   13090   Acinetobacter baumannii
 7180003316967   7473    Escherichia coli str. K12 substr  # DH10B
 7180003313366   7098    Acinetobacter baumannii
 7180003195772   4993    Serratia marcescens
 7180003288790   4668    Klebsiella pneumoniae
 7180003310064   4371    Escherichia coli                  # all 3
 7180003262150   4275    Escherichia coli
 7180003288789   3565    Serratia marcescens
 join100003627   3563    Acinetobacter baumannii
 ...
 7180003289260   3128    Escherichia coli or vector  
 7180003292886   2957    vector
 7180003166540   2081    mitochondrion
 7180003310112   1977    contaminants
 7180002995790   1711    bacterial insertion sequence
 7180003221530   1647    Bacillus cereus ATCC 10987
 7180003259696   1597    Pseudomonas aeruginosa PAO1
 7180003239826   1378    Macaca mulatta

Problems:

1: Vectors

2: Ecoli

  • There are 22 Ecoli strains * 3 Ecoli K12 substarins
  • MG1655 is the 1st one completed, DH10B and W3110 have been recently(?) completed
  • Contain unique seqs as long as 28K
 cat /fs/szdata/genomes/ncbi/Bacteria/Escherichia_coli_K_12*/*fna | infoseq -description
 NC_010473.1    4686137 50.78  Escherichia coli str. K-12 substr. DH10B, complete genome
 NC_000913.2    4639675 50.79  Escherichia coli str. K-12 substr. MG1655, complete genome
 AC_000091.1    4646332 50.80  Escherichia coli str. K-12 substr. W3110, complete genome
  • out of 746 UMD2 Ecoli seqs, 636(723 maxmatch) aligned to Ecoli.all

3: Other

  • contaminants: 289 ; mostly Ecoli
  • phage: 16 ; 1 aligns to Ecoli, all <1108bp
  • IS: 398; all <1711 bp; 14 align to Ecoli.all & 1 to UniVec_Core; few were NCBI blasted aligned to mammals !!!
  • mitochondrion: 74 seqs: all align to ~/db/bos_taurus.mitochondrion
  • Others: 130; (Acinetobacter baumannii ...)
  • Files:
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.exclude.count     (4813 exlude sequence counts)
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.list  (35142 contaminant sequences: exclude+trim) 
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.fasta
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.contaminant.infoseq  35142
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.Ecoli.infoseq        746
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/bos_taurus.UMD2.vector.infoseq       33540
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/odd-contaminants.infoseq             14
 /fs/szasmg3/dpuiu/bos_taurus/submission/contaminants/NCBI/nucmer/bos_taurus.UMD2.vector.no_UniVec_Core.no_Ecoli.all.blastn_hits.fasta # 100 vector seqs not in UniVec (pPAC7 ...) -> ~dpuiu/db/OtherVec
  • Dirs:
 /fs/szasmg3/dpuiu/bos_taurus/submission/decontam

Notes:

  • The 4,814 ctgs were aligned to UniVec (-c 20 ; delta-filter -q)
    • 4,247 ctgs aligned to 89 vecto seqs
    • top ref hits:
 gnl|uv|J01749.1   Cloning vector pBR322
 gnl|uv|J01636.1   E.coli lactose operon with lacI, lacZ, lacY and lacA genes
 gnl|uv|AF102576.1 Cloning vector pSOS
 gnl|uv|L08959.1   pUC8 cloning vector 
 gnl|uv|L08931.1   pMAC7-8 cloning vector for site-directed mutagenesis
 gnl|uv|L09145.1   pUR222 cloning vector
 gnl|uv|U47102.2   Cloning vector pALTER<R>-Ex1
 ...
  • The 4,814 ctgs were aligned to EcoliK12
    • 4,299 aligned 200bp+ to Ecoli
    • 3,877 aligned 100% to region 365521_365744 (224bp)
 >NC_000913.2_365521_365744 Escherichia coli K12, complete genome
 CATGGTCATAGCTGTTTCCTGTGTGAAATTGTTATCCGCTCACAATTCCACACAACATAC
 GAGCCGGAAGCATAAAGTGTAAAGCCTGGGGTGCCTAATGAGTGAGCTAACTCACATTAA
 TTGCGTTGCGCTCACTGCCCGCTTTCCAGTCGGGAAACCTGTCGTGCCAGCTGCATTAAT
 GAATCGGCCAACGCGCGGGGAGAGGCGGTTTGCGTATTGGGCGC
  • The 4,814 ctgs were assembled based on 11,912 reads
 BCM          SHOTGUN    11247 (out of 10M reads)
 BCM          WGS        415   (out of 24M reads)
 NISC         SHOTGUN    213   (out of 0.7M reads)
 BARC         CLONEEND   14    ...
 BCM          CLONEEND   11
 BCCAGSC      CLONEEND   7
 TIGR         CLONEEND   4
 TIGR_JCVIJTC CLONEEND   1
  • Avg UMD clipping rangeof the 11,912 reads is 840bp (vs 778 avg for the 3.53M assembled reads)
  • Other: /fs/ftp-cbcb/pub/data/Bos_taurus/Bos_taurus_UMD_2.0/odd-contaminants.fa

Local files

  • Freeze dir files
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/contigs.unplaced.fa  : sequences
 /fs/szasmg3/bos_taurus/Bos_taurus_UMD_2.0/bos_taurus.agp       : all scaffolds
 
 /fs/szasmg3/bos_taurus/UMD_Freeze2.0/reads.placed.gz: 31,942,023 reads (read_id, read_clr, ctg_id, scf_id, ctg_pos, scf_pos)


 /fs/ftp-cbcb/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_2.0
  • Files uploaded
 Ftp server: ftp-private.ncbi.nlm.nih.gov
 Account: cbcb_trc
 Dir: uploads/
 Local files: /fs/szasmg3/dpuiu/bos_taurus/submission/ftp/   : 22 *sqn + 1 agp

Contaminant search

Ecoli

UniVec_Core

UMD2.other

  • 83(82) ctgs align to 65 ref sequences
  • 10 ctgs are Acinetobacter baumannii
 pwd
 /fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant
 
 join UMD2.contaminant.other-ctg.ref_hits ~/db/bos_taurus.UMD2.contaminant.infoseq | sort -nk3 -r | head
 7180003370686_12513_13066 553 16 554 phage
 7180003320028 13090 10 13090 Acinetobacter baumannii
 7180003341208_1_647 646 8 647 phage
 ...
 
 contigs          <2000      >2000      min        max        mean       med        n50        sum
 82               20         62         709        397429     65384      44841      138429     5361543
 
 alignments       <200       >200       min        max        mean       med        n50        sum            
 103              33         70         105        3312       467        276        688        48109

Files:

/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/UMD2.contaminant.Acinetobacter-ctg.qry_hits   # 10 UMD2.0 Acinetobacter ctg ids
/fs/szasmg3/dpuiu/bos_taurus/submission/nucmer_contaminant/Acinetobacter.all-ctg.filter-q.qry_hits       # 22 UMD2.0 Acinetobacter ctg ids ; 7 in common with the 10 above
# 25 Acinetobacter ctg's
       ctgid         ctglen
    1  7180003321583 96481
    2  7180003308373 9419
    3  7180003319195 8955
    4  7180003317370 8045
    5  7180003290024 5922
    6  join100003699 5649
    7  7180003288988 3618
    8  7180003308907 3157
    9  7180003234806 3100
   10  7180003319189 2966
   11  7180003202299 2653
   12  7180003217023 2213
   13  7180003219002 2161
   14  7180003219018 2010
   15  7180003292866 1767
   16  7180003215440 1617
   17  7180003235746 1573
   18  7180003235747 1524
   19  7180003234890 1422
   20  7180003219292 1329
   21  7180003221397 1308
   22  7180003221476 1243
   23  7180003235699 1139
   24  7180003214110 1100
   25  deg0003235855 1062

Other issues

Segmental duplications

  • David Kelly seminar
  • UMD1.6
    • inclusions: 384 (1.1Mbp)
    • joins: 1090 (1.1Mbp)