Bos taurus 3.0: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
Line 3: Line 3:
* NCBI : ftp://ftp.ncbi.nih.gov/pub/TraceDB/bos_taurus/
* NCBI : ftp://ftp.ncbi.nih.gov/pub/TraceDB/bos_taurus/
* 37,829,394 reads organized in 91 volumes  
* 37,829,394 reads organized in 91 volumes  
** WGS, SHOTGUN, CLONEEND & FINISHING: 36,820,485 
** 36,820,485 WGS, SHOTGUN, CLONEEND & FINISHING reads
*** with qualities: 36,170,352
*** 36,170,352 quality reads
*** without qualities: 650,133
*** 650,133 qualityless reads
** EST & PCR reads: 1,008,909
** 1,008,909 EST & PCR reads
* 25,312 libraries (mostly SHOTGUN and BARC.CLONEEND)


* 25,312 libraries (mostly SHOTGUN and BARC.CLONEEND)
== Centers ==


* Centers
     TRACE_COUNT    CENTER_NAME     
     TRACE_COUNT    CENTER_NAME     
   1  35629020        BCM            Baylor College of Medicine
   1  35629020        BCM            Baylor College of Medicine
Line 28: Line 28:
     '''37829394'''        total          total                                     
     '''37829394'''        total          total                                     


* Trace summary
== Trace counts ==
 
     TRACE_COUNT  CENTER_NAME    TRACE_TYPE_CODE         
     TRACE_COUNT  CENTER_NAME    TRACE_TYPE_CODE         
   1  24863599      BCM*            WGS                     
   1  24863599      BCM*            WGS                     
Line 61: Line 62:
Issues:
Issues:
* Many traces are missing CLIP_VECTOR_LEFT,CLIP_VECTOR_RIGHT or CLIP_VECTOR_RIGHT==0
* Many traces are missing CLIP_VECTOR_LEFT,CLIP_VECTOR_RIGHT or CLIP_VECTOR_RIGHT==0
* OBT needs to get as input CLV
* OBT needs to get CLV as input
* Re-trim reads (each library separately)
* Re-trim reads (each library separately)


* Identify linkers
== Identify linkers ==
** Separate FWD/REV reads; for each set
 
** Identify top 20 most frequent kmers (8mers,24mers)
* Separate FWD/REV reads; for each set
** Check if kmers a overrepresented  
* Identify top 20 most frequent kmers (8mers,24mers)
** Check if most frequent 8mer is present in the 24mers
* Check if kmers a overrepresented  
** Align 24mers (extend them by a few bp) => linker
* Check if most frequent 8mer is present in the 24mers
* Align 24mers (extend them by a few bp) => linker


* Identify vectors
== Identify vectors ==
** Align linkers to the opposite stand sequences (nucmer -l 12 -c 24 -r)
** Extract the subsequences following to linker (50..150bp)
** Align the subsequences; if they align we've probably identified the vector
** Identify the vector name/id by alignment to UniVec (several alignments probably) (nucmer -l 12 -c 24)
** Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
** create the Lucy vector & splice files; the splice contains the linker+vector


* Trim quality reads
* Align linkers to the opposite stand sequences (nucmer -l 12 -c 24 -r)
** run Lucy & trim input reads according to Lucy clr
* Extract the subsequences following to linker (50..150bp)
** align Lucy trimmed reads to linker,vector,splice & UniVec
* Align the subsequences; if they align we've probably identified the vector
** align input reads to linker,vector,splice & UniVec (there should be no alignments)
* Identify the vector name/id by alignment to UniVec (several alignments probably) (nucmer -l 12 -c 24)
* Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
* create the Lucy vector & splice files; the splice contains the linker+vector


* BCM reads
== Trim quality reads ==
** linker:
 
* run Lucy & trim input reads according to Lucy clr
* align Lucy trimmed reads to linker,vector,splice & UniVec
* align input reads to linker,vector,splice & UniVec (there should be no alignments)
 
BCM reads
* linker:
   >J01636.linker.fwd 27bp
   >J01636.linker.fwd 27bp
   TCGAGTTCGACTGCAAGTAGTTCATCA
   TCGAGTTCGACTGCAAGTAGTTCATCA
   >J01636.linker.rev 27bp
   >J01636.linker.rev 27bp
   CTAATCAGATGGTACAGTAGTTCATCA
   CTAATCAGATGGTACAGTAGTTCATCA
** vector: J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes (7477 bp)
* vector: J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes (7477 bp)
** avg Original CLV > avg Lucy CLV (20+ bp ; 1015 vs 973 in quality WGS reads , ...)
* avg Original CLV > avg Lucy CLV (20+ bp ; 1015 vs 973 in quality WGS reads , ...)


* NISC raeds
NISC reads
** linker:
* linker:
   >NGB00080.linker.fwd
   >NGB00080.linker.fwd
   TATCATCGCCACTGTGGTGGAATT
   TATCATCGCCACTGTGGTGGAATT
   >NGB00080.linker.rev
   >NGB00080.linker.rev
   GCTGAAGCTCCATGTGGTGGAATTCC
   GCTGAAGCTCCATGTGGTGGAATTCC
** vector NGB00080 (pOTW13 with linkers)
* vector NGB00080 (pOTW13 with linkers)
** avg Original CLV > avg Lucy CLV  (20+ bp ; 771 vs 747)
* avg Original CLV > avg Lucy CLV  (20+ bp ; 771 vs 747)


=  Preliminary Assembly =
=  Preliminary Assembly =
Line 107: Line 111:
* Use only quality reads
* Use only quality reads
* set read CLV to Lucy CLV
* set read CLV to Lucy CLV
* set non random flag = 1 on all reads except for WGS reads
* set non random flag = 1 on all reads except for WGS ones
* obtMerThreshold = 200 (default 1000)
* obtMerThreshold = 200 (default 1000)
* doOBT = 1
* doOBT = 1


* Input
== Input ==
   Reads=36,170,352  # WGS, SHOTGUN, CLONEEND & FINISHING with qualities
 
   Reads=36,170,352  # WGS, SHOTGUN, CLONEEND & FINISHING quality reads
   Libraries=25,312  # mostly SHOTGUN and BARC.CLONEEND
   Libraries=25,312  # mostly SHOTGUN and BARC.CLONEEND


* Output  
== Output ==
 
  TotalScaffolds=66,141
  TotalScaffolds=66,141
  MaxBasesInScaffolds=26,048,998
  MaxBasesInScaffolds=26,048,998
Line 134: Line 140:


== Clear ranges ==
== Clear ranges ==
** Quality reads: extract OBT CLR from gatekeeper store
 
** Align quality-less reads to contigs (no degenerates) : nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05
* Quality reads: extract OBT CLR from gatekeeper store
** Set quality-less read CLR to the maximum alignment coordinates or 50..min(len,600)  
* Align quality-less reads to contigs (no degenerates) : nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05
** Shrink read CLR  if there are multiple N's or low complexity regions
* Set quality-less read CLR to the maximum alignment coordinates or 50..min(len,600)  
* Shrink read CLR  if there are multiple N's or low complexity regions
 
== Library estimates ==


* Extract library insert estimates; merge libraries sequenced by same center that have similar mean/std  25,312 libs => 344 libs  
* Extract library insert estimates; merge libraries sequenced by same center that have similar mean/std  25,312 libs => 344 libs  
Line 154: Line 163:
* doOBT = 0
* doOBT = 0


* Input
== Input ==
   Reads=35,973,728  # WGS, SHOTGUN, CLONEEND & FINISHING with and without qualities
   Reads=35,973,728  # WGS, SHOTGUN, CLONEEND & FINISHING with and without qualities
   Libraries=344
   Libraries=344


* Output  
== Output ==
   TotalScaffolds=39,978
   TotalScaffolds=39,978
   TotalContigsInScaffolds=90,135
   TotalContigsInScaffolds=90,135

Revision as of 14:30, 24 July 2009

Data download

  • NCBI : ftp://ftp.ncbi.nih.gov/pub/TraceDB/bos_taurus/
  • 37,829,394 reads organized in 91 volumes
    • 36,820,485 WGS, SHOTGUN, CLONEEND & FINISHING reads
      • 36,170,352 quality reads
      • 650,133 qualityless reads
    • 1,008,909 EST & PCR reads
  • 25,312 libraries (mostly SHOTGUN and BARC.CLONEEND)

Centers

    TRACE_COUNT     CENTER_NAME     
 1  35629020        BCM             Baylor College of Medicine
 2  737900          NISC            NIH Intramural Sequencing Center
 3  652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Center
 4  378871          MARC            USDA, ARS, US Meat Animal Research Center
 5  114753          UIUC            University of Illinois at Urbana-Champaign
 6  107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
 7  65171           TIGR            The Institute for Genome Research
 8  53556           GSC             Genoscope
 9  43033           CENARGEN        Embrapa Genetic Resources and Biotechnology
 10 18623           SC              The Sanger Center
 11 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
 12 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC
 13 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
 14 49              WUGSC           Washington University, Genome Sequencing Center
    37829394        total           total                                    

Trace counts

    TRACE_COUNT   CENTER_NAME     TRACE_TYPE_CODE        
 1  24863599      BCM*            WGS                    
 2  10748529      BCM*            SHOTGUN                
 3  737900        NISC            SHOTGUN                
 4  125597        BCCAGSC         CLONEEND               
 5  114753        UIUC            CLONEEND               
 6  65171         TIGR            CLONEEND               
 7  53556         GSC             CLONEEND               
 8  26246         CENARGEN        WGS                    
 9  25454         BARC            CLONEEND               
 10 16892         BCM*            CLONEEND               
 11 16787         CENARGEN        CLONEEND               
 12 15150         UOKNOR          SHOTGUN                
 13 10651         TIGR_JCVIJTC    CLONEEND               
 14 151           UOKNOR          FINISHING              
 15 49            WUGSC           CLONEEND               
    36820485      total

 16 527017        BCCAGSC         EST
 17 207204        MARC            EST
 18 171667        MARC            PCR
 19 81913         BARC            EST
 20 18623         SC              EST 
 21 2485          UIACBCB         EST
    1008909       total

Data processing

Vector trimming

Issues:

  • Many traces are missing CLIP_VECTOR_LEFT,CLIP_VECTOR_RIGHT or CLIP_VECTOR_RIGHT==0
  • OBT needs to get CLV as input
  • Re-trim reads (each library separately)

Identify linkers

  • Separate FWD/REV reads; for each set
  • Identify top 20 most frequent kmers (8mers,24mers)
  • Check if kmers a overrepresented
  • Check if most frequent 8mer is present in the 24mers
  • Align 24mers (extend them by a few bp) => linker

Identify vectors

  • Align linkers to the opposite stand sequences (nucmer -l 12 -c 24 -r)
  • Extract the subsequences following to linker (50..150bp)
  • Align the subsequences; if they align we've probably identified the vector
  • Identify the vector name/id by alignment to UniVec (several alignments probably) (nucmer -l 12 -c 24)
  • Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
  • create the Lucy vector & splice files; the splice contains the linker+vector

Trim quality reads

  • run Lucy & trim input reads according to Lucy clr
  • align Lucy trimmed reads to linker,vector,splice & UniVec
  • align input reads to linker,vector,splice & UniVec (there should be no alignments)

BCM reads

  • linker:
 >J01636.linker.fwd 27bp
 TCGAGTTCGACTGCAAGTAGTTCATCA
 >J01636.linker.rev 27bp
 CTAATCAGATGGTACAGTAGTTCATCA
  • vector: J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes (7477 bp)
  • avg Original CLV > avg Lucy CLV (20+ bp ; 1015 vs 973 in quality WGS reads , ...)

NISC reads

  • linker:
 >NGB00080.linker.fwd
 TATCATCGCCACTGTGGTGGAATT
 >NGB00080.linker.rev
 GCTGAAGCTCCATGTGGTGGAATTCC
  • vector NGB00080 (pOTW13 with linkers)
  • avg Original CLV > avg Lucy CLV (20+ bp ; 771 vs 747)

Preliminary Assembly

  • Assembly version: wgs-5.2
  • Use only quality reads
  • set read CLV to Lucy CLV
  • set non random flag = 1 on all reads except for WGS ones
  • obtMerThreshold = 200 (default 1000)
  • doOBT = 1

Input

 Reads=36,170,352   # WGS, SHOTGUN, CLONEEND & FINISHING quality reads
 Libraries=25,312   # mostly SHOTGUN and BARC.CLONEEND

Output

TotalScaffolds=66,141
MaxBasesInScaffolds=26,048,998
MeanBasesInScaffolds=40,861
 
TotalContigsInScaffolds=120,461
MaxContigLength=627,911
MeanContigLength=22,436
 
TotalDegenContigs=269,031
MaxDegenContig=33,824
 
SingletonReads=3,721,123

DeletedReads=421,379 (too short or zero CLR)

Preliminary Assembly processing

Clear ranges

  • Quality reads: extract OBT CLR from gatekeeper store
  • Align quality-less reads to contigs (no degenerates) : nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05
  • Set quality-less read CLR to the maximum alignment coordinates or 50..min(len,600)
  • Shrink read CLR if there are multiple N's or low complexity regions

Library estimates

  • Extract library insert estimates; merge libraries sequenced by same center that have similar mean/std 25,312 libs => 344 libs
  • Assign new library ids; average means & stdevs

Final Assembly

  • Assembly version: wgs-5.2
  • Use all traces
  • set read CLR to:
    • OBT CLR (quality reads)
    • alignment coordinates (aligned quality-less reads)
    • 50..min(len,600) (unaligned quality-less reads)
  • set non random flag = 1 on all reads except for WGS reads
  • obtMerThreshold = 200 (default 1000)
  • doOBT = 0

Input

 Reads=35,973,728   # WGS, SHOTGUN, CLONEEND & FINISHING with and without qualities
 Libraries=344

Output

 TotalScaffolds=39,978
 TotalContigsInScaffolds=90,135
 MeanBasesInScaffolds=66,947
 MaxBasesInScaffolds=3,3907,885
 
 TotalContigsInScaffolds=90,135
 MeanContigLength=29,693
 MaxContigLength=1,160,130
 
 TotalDegenContigs=251,413
 MaxDegenContig=39,964

 SingletonReads=3,634,305(10.24%)

Final Assembly processing

  • Contaminant search

Assembly Summary

 .                                ctg+deg <2Kbp   >=2Kbp min  max      mean   med    n50    sum
 ======================================================================================================
 Chr1..29,X                       72481   20864   51617  65   1160130  36423  12940  97255  2639986644
 ChrU                             3285    2404    881    224  179692   2890   1338   5425   9496583
 Chr                              75766   23268   52498  65   1160130  34969  11207  96955  2649483227
 
 contigs.haplotype-variants       40611   36984   3627   263  97877    1476   1205   1372   59958728
 deg.unplaced.less_2K             224933  224933  0      65   1996     972    983    990    218837572
 
 ChrY-contigs                     314     266     48     224  26490    2210   973    6539   694140
 ChrY-contigs.SHOTGUN_ONLY        144     140     4      804  4224     993    882    888    143047
 
 delete.notPrimates               97      96      1      263  5310     1031   996    1004   100066
 trim                             61      21      40     213  205361   38577  11681  126330 2353214
 ======================================================================================================