Bos taurus redo: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 192: Line 192:
* 13. create the Lucy vector & splice files
* 13. create the Lucy vector & splice files
* 14. run lucy
* 14. run lucy
* 15. align lucy output to vector.seq & UniVec


== Example ==
== Example ==

Revision as of 02:37, 13 January 2009

BCM

NCBI Data

  • Genome Projects
  • TA search
  • Avg LEN=984
  • Avg CLIP (CLB intersect CLV)=760
  • Avg CLV=997 (3.66M reads) !!!> Avg LEN
  • Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
  • 0 QUAL reads 650,133
  • Avg UMDoverlapper CLIP=778 (3.53M reads)

CENTER_NAME counts

 COUNT           CENTER_NAME     
 35629020        BCM             Baylor College of Medicine
 737900          NISC            NIH Intramural Sequencing Center
 652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Centre                           # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 
 378871          MARC            USDA, ARS, US Meat Animal Research Center
 114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368
 107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
 65171           TIGR            The Institute for Genome Research
 53556           GSC             Genoscope
 43033           CENARGEN        Embrapa Genetic Resources and Biotechnology
 18623           SC              The Sanger Center
 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                 # TA query_tracedb CENTER_NAME="JCVI"
 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
 49              WUGSC           Washington University, Genome Sequencing Center                                 # TA query_tracedb CENTER_NAME = "WUGSC" => 9
 37829394        total           total                                                                           # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710 


TRACE_TYPE_CODE counts

 COUNT         CENTER_NAME     TRACE_TYPE_CODE        #LIBS(all)     #LIBS(10K+ reads)
 24863599      BCM             WGS                    89             31
 10748529      BCM             SHOTGUN                10             10
 737900        NISC            SHOTGUN                4              3
 125597        BCCAGSC         CLONEEND
 114753        UIUC            CLONEEND
 65171         TIGR            CLONEEND
 53556         GSC             CLONEEND
 26246         CENARGEN        WGS
 25454         BARC            CLONEEND
 16892         BCM             CLONEEND               1              1      VBBAA   mea=167000  std=25000
 16787         CENARGEN        CLONEEND
 15150         UOKNOR          SHOTGUN
 10651         TIGR_JCVIJTC    CLONEEND
 151           UOKNOR          FINISHING
 49            WUGSC           CLONEEND
 36809945      total

 527017        BCCAGSC         EST
 207204        MARC            EST
 171667        MARC            PCR
 81913         BARC            EST
 81913         BARC            EST
 2485          UIACBCB         EST
 1019449       total

STRATEGY & TRACE_TYPE_CODE counts

 COUNT           CENTER_NAME     STRATEGY        TRACE_TYPE_CODE
 12545304        BCM             .               WGS
 11425910        BCM             WGA             WGS
 5223683         BCM             CLONE           SHOTGUN
 4479883         BCM             POOLCLONE       SHOTGUN
 1044963         BCM             .               SHOTGUN
 892385          BCM             SNP             WGS
 737900          NISC            CLONE           SHOTGUN
 125597          BCCAGSC         CLONEEND        CLONEEND
 114753          UIUC            CLONEEND        CLONEEND 
 65171           TIGR            CLONEEND        CLONEEND
 53556           GSC             CLONEEND        CLONEEND
 26246           CENARGEN        .               WGS
 25454           BARC            .               CLONEEND
 16892           BCM             CLONEEND        CLONEEND
 16787           CENARGEN        CLONEEND        CLONEEND
 12195           UOKNOR          .               SHOTGUN
 10651           TIGR_JCVIJTC    CLONEEND        CLONEEND
 2955            UOKNOR          CLONE           SHOTGUN
 151             UOKNOR          .               FINISHING
 49              WUGSC           CLONEEND        CLONEEND
 527017          BCCAGSC         EST             EST
 145820          MARC            EST             EST
 117958          MARC            COMPARATIVE     PCR
 81913           BARC            EST             EST
 61384           MARC            CLONE           EST
 53709           MARC            Re-Sequencing   PCR
 18623           SC              EST             EST
 2485            UIACBCB         .               EST

3' VECTOR TRIMMED counts

 CENTER_NAME     TRACE_TYPE_CODE TOTAL           3'CLV<LEN   QUAL==0          UMD.FRG
 BCM             WGS             24863599        10968979    551114           24050767
 BCM             SHOTGUN         10748529        5052692     23419            10068499
 NISC            SHOTGUN         737900          28972       0                735488
 BCCAGSC         CLONEEND        125597          125484      8926             113790
 UIUC            CLONEEND        114753          90243       0                106247
 TIGR            CLONEEND        65171           46389       0                64903
 GSC             CLONEEND        53556           53556       53556 (all)      0           !!! all have 0 quals and were excluded
 CENARGEN        WGS             26246           26246       0                25976
 BARC            CLONEEND        25454           25454       0                25387
 BCM             CLONEEND        16892           6751        0                16863
 CENARGEN        CLONEEND        16787           16787       0                16628
 UOKNOR          SHOTGUN         15150           2885        12195            0
 TIGR_JCVIJTC    CLONEEND        10651           339         0                10644
 UOKNOR          FINISHING       151             0           151              151
 WUGSC           CLONEEND        49              0           0                0

 BCCAGSC         EST             527017          524173      772              0
 MARC            EST             207204          207204      0                0
 MARC            PCR             171667          171667      0                0
 BARC            EST             81913           78597       0                0
 SC              EST             18623           7350        0                0
 UIACBCB         EST             2485            2485        0                0

Local Data

Files & Dirs

 /fs/szasmg3/bos_taurus/data/
 /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors
 /nfshomes/dpuiu/db/UniVec

Software

Figaro

  • trims vector only at 5' end
  • call lucy trimming for qualities

Lucy

  • both vector sequence and splice sites are required

Atlas

  • web site
  • atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "

Contaminant search

nucmer reads CLIPPING range to UniVec & EcoliK12

UniVec

Ref

                 #seqs   min     max     mean    median  n50     sum
 UniVec          2861    12      48551   231     99      781     660,151
 UniVec_Core     1348    12      48551   243     98      967     327,641

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      4548466 19      1045    28.37   23      27      129025025
 20      3684852 20      1045    30.56   25      28      112616359
 30      1097357 30      1045    48.04   38      43      52714583
 40      484661  40      1045    66.36   47      53      32163896
 100     54334   100     1045    198     116     223     10772815        # many are ESTs

Ecoli

Ref:

 K12 4,639,675 bp

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      275109  19      1223    30.66   19      20      8435470
 20      102550  20      1223    50.29   21      161     5156849
 30      19032   30      1223    178     37      706     3381214
 40      9234    40      1223    329     171     738     3034293
 100     6781    100     1223    424     223     749     2876432
 200     4378    200     1223    575     696     771     2516916       

BCM vectors

                 #seqs   min     max     mean    median  n50     sum
 BCM             14      2580    33180   9379    5821    32705   131312

Vector/Splice site search

Strategy

  • 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
  • 2. Get the quality clipping time: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
  • 3. Separate reads in 2 piles according to direction TRACE_END: FORWARD & REVERSE
  • 4. Get the most frequent kmers (24 & 8 bp)
  • 5. Check if the most frequent kmers are overrepresented
  • 6. Check if the most frequent 8mers are part of the most frequent 24mers
  • 7. Try to extend the kmers by a few bp => linkers
  • 8. Align linkers to the opposite stand sequences
  • 9. Extract the sequences adjacent(following) to linker (50..150bp)
  • 10. Align the sequences; if they align we've probably identified the vector
  • 11. Align the vector to UniVec => several alignments
  • 12. Check if the forward/reverse vector(s) are the same : should find a common vector sequence; the UniVec alignments should be adjacent
  • 13. create the Lucy vector & splice files
  • 14. run lucy
  • 15. align lucy output to vector.seq & UniVec

Example

  • 1. volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
  • 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
  • 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
 ==> 24.fwd/kmers.tab <==
 AGTTCGACTGCAAGTAGTTCATCA      TGATGAACTACTTGCAGTCGAACT        2463 # contains AGTAGTTC
 GAGTTCGACTGCAAGTAGTTCATC      GATGAACTACTTGCAGTCGAACTC        2189
 CGAGTTCGACTGCAAGTAGTTCAT      ATGAACTACTTGCAGTCGAACTCG        1996
 TCGAGTTCGACTGCAAGTAGTTCA      TGAACTACTTGCAGTCGAACTCGA        1593
 GTTCGACTGCAAGTAGTTCATCAA      TTGATGAACTACTTGCAGTCGAAC        1023
 GAGTTCGACTGCAGTAGTTCATCA      TGATGAACTACTGCAGTCGAACTC        812
 CGAGTTCGACTGCAGTAGTTCATC      GATGAACTACTGCAGTCGAACTCG        777
 GTTCGACTGCAAGTAGTTCATCAT      ATGATGAACTACTTGCAGTCGAAC        769
 TCGAGTTCGACTGCAGTAGTTCAT      ATGAACTACTGCAGTCGAACTCGA        637
 ATCGAGTTCGACTGCAAGTAGTTC      GAACTACTTGCAGTCGAACTCGAT        594
 
 ==> 08.fwd/kmers.tab <==
 AGTAGTTC      GAACTACT        86477
 CAGTAGTT      AACTACTG        67681
 AGTTCTCA      TGAGAACT        61556
 TAGTTCTC      GAGAACTA        60964
 GTAGTTCT      AGAACTAC        57866
 AGTTCATC      GATGAACT        49676
 TAGTTCAT      ATGAACTA        45298
 GTTCATCA      TGATGAAC        42117
 GCAGTAGT      ACTACTGC        41391
 GTAGTTCA      TGAACTAC        40694
 
 ==> 24.rev/kmers.tab <==
 TATCGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCGATA        999 # contains AGTAGTTC
 CTATCGATGGTACAGTAGTTCATC      GATGAACTACTGTACCATCGATAG        774
 GCTATCGATGGTACAGTAGTTCAT      ATGAACTACTGTACCATCGATAGC        600
 CGCTATCGATGGTACAGTAGTTCA      TGAACTACTGTACCATCGATAGCG        432
 ATCGATGGTACAGTAGTTCATCAT      ATGATGAACTACTGTACCATCGAT        417
 ATCGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCGAT        380
 ATCAGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCTGAT        373
 ATCGATGGTACAGTAGTTCATCAC      GTGATGAACTACTGTACCATCGAT        265
 CTATCGATGGTAAGTAGTTCATCA      TGATGAACTACTTACCATCGATAG        235
 TCAGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCTGA        224
 
 ==> 08.rev/kmers.tab <==
 AGTTCATC      GATGAACT        85127
 TAGTTCAT      ATGAACTA        77902
 GTTCATCA      TGATGAAC        75585
 TAGTTCTC      GAGAACTA        68057
 AGTTCTCA      TGAGAACT        67277
 GTAGTTCT      AGAACTAC        64894
 GTAGTTCA      TGAACTAC        62607
 CGTAGTTC      GAACTACG        52031
 AGTAGTTC      GAACTACT        51013
 ACGTAGTT      AACTACGT        31552
  • 7. Get linker sequences
 >linker.fwd 27bp
 TCGAGTTCGACTGCAAGTAGTTCATCA
 >linker.rev 40 bp Art's          
 TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
  • 8. Align using nucmer
 nucmer -c 24 -l 12 linker.fwd.seq bos_taurus.001.rev.seq => 115 hits 
 nucmer -c 24 -l 12 linker.rev.seq bos_taurus.001.fwd.seq => 174 hits
  • 9.
  • 10. Identify "vector reads"
 >399553028  # 24.fwd     
 TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
 GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
 GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA
 ...
 >400269118 #24.rev
 TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
 CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
 AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
  • 11. Align to UniVec
 show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1
     31  148  | 1175 1292  | 118   118  |  95.76  |     1276     7477  |     9.25     1.58  | 399553028.rev gnl|uv|J01636.1:1-7477
     32  199  | 1302 1463  | 168   162  |  90.48  |      653     7477  |    25.73     2.17  | 400269118     gnl|uv|J01636.1:1-7477
  • 12. 10bp distance between the 2 alignments
  • 13. Lucy files
 $ more vector.seq
   >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
   GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG
   TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC
   CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG
   CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA
   ...
 
 $ more splice.seq
   >J01636.for.begin # J01636.1175-1302
   TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC
   GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG
   CGCCTAATCAGATGGTACAGTAGTTCATCA
   >J01636.for.end   # J01636.1292-1463
   TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
   CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
   AGCTGGCGTAAAAACGTAAAAAGCCCCGCA
   >J01636.rev.end (revcomp of J01636.for.end)
   TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
   GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
   GTGTCAAATGAGAGACCTAACTCACATTCA
   >J01636.rev.begin (revcomp of J01636.for.begin)
   TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA
   TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT
   GATTCGATTTCGACTGCAAGTAGTTCATCA
  • 14. Run lucy & trim reads
 $ lucy \ 
     -v vector.seq splice.seq
     -o bos_taurus.lucy.seq bos_taurus.lucy.qual \
     -debug  bos_taurus.lucy.info \
     bos_taurus.seq bos_taurus.qual
 # Trim clr
 $ clrFasta bos_taurus.seq > bos_taurus.cseq
  • 15. Align lucy output to vector.seq & UniVec
 $ nucmer -l 16 -c 30 vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 UniVec      bos_taurus.lucy.cseq -p UniVec-bos_taurus.lucyx