Bos taurus redo: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 609: Line 609:


* No alignments to BCM WGS vector
* No alignments to BCM WGS vector
= Running Lucy =
* Default parameters with vector trimming
== BCM.WGS ==
                            elem      <0        0          >0        min        max        mean      median    n50        sum
  orig.LEN                  24863599  0          0          24863599  5          3097      1002      997        1015      24915462033
  orig.CLR                  24863599  463669    7          24399923  -1143      1833      765        836        864        19036744256
  orig.CLR5                24863599  0          359245    24504354  0          2103      42        22        58        1047922451
  orig.CLR3                24863599  463404    0          24400195  -1        2169      807        872        895        20084666707
  lucy.CLR                  24863599  0          739529    24124070  0          1219      792        878        904        19695000417
  lucy.CLR5                24863599  739529    36108      24087962  -1        1753      43        29        42        1086413880
  lucy.CLR3                24863599  739529    0          24124070  -1        1894      835        915        939        20781414297
  orig.CLR5-lucy.CLR5      24863599  16299521  215345    8348733    -1186      2104      -1        -10        -1186      -38491429
  orig.CLR3-lucy.CLR3      24863599  14858542  1494794    8510263    -1273      2170      -28        -20        -1273      -696747590
  orig.CLV                  24863599  1053      1920      24860626  -2        5345      1015      1002      1017      25260581538
  orig.CLV5                8841849    0          0          8841849    1          1219      33        46        49        295011460
  orig.CLV3                24861698  1053      0          24860645  -1        5346      1027      1005      1019      25555592998
  lucy.CLV                  24863599  10694      707        24852198  -469      3096      973        968        987        24195085877
  lucy.CLV5                24863599  0          3135007    21728592  0          1359      25        27        29        623457486
  lucy.CLV3                24863599  0          0          24863599  4          3096      998        995        1014      24818543363
  lucy.CLVABS5              24863599  0          3135007    21728592  0          1359      25        27        29        623457486
  lucy.CLVABS3              24863599  0          24770953  92646      0          1343      2          0          880        72055071
  orig.CLV5-lucy.CLV5      24863599  17216820  1512453    6134326    -1312      1219      -13        -25        -1312      -328446026
  orig.CLV3-lucy.CLV3      24863599  1519132    18579609  4764858    -1832      4672      29        0          479        737049635
== BCM.SHOTGUN ==
                            elem      <0        0          >0        min        max        mean      median    n50        sum
  orig.LEN                  10748529  0          0          10748529  5          2043      975        950        964        10486690472
  orig.CLR                  10748529  17308      2          10731219  -1293      1467      809        833        847        8701344571
  orig.CLR5                10748529  0          68        10748461  0          1315      26        16        38        288662580
  orig.CLR3                10748529  16780      0          10731749  -1        1647      836        851        863        8990007151
  lucy.CLR                  10748529  0          98070      10650459  0          1337      833        854        868        8955866769
  lucy.CLR5                10748529  98070      1973      10648486  -1        1307      35        28        32        376276188
  lucy.CLR3                10748529  98070      0          10650459  -1        1553      868        882        896        9332142957
  orig.CLR5-lucy.CLR5      10748529  9498290    65171      1185068    -1099      1293      -8        -11        -1099      -87613608
  orig.CLR3-lucy.CLR3      10748529  6879532    671097    3197900    -1149      1437      -31        -26        -1149      -342135806
  orig.CLV                  10748529  16779      412        10731338  -2        3919      974        948        964        10472347908
  orig.CLV5                8594910    0          0          8594910    1          1239      3          1          49        28350257
  orig.CLV3                10748349  16779      0          10731570  -1        3919      976        950        965        10500698165
  lucy.CLV                  10748529  7026      614        10740889  -268      2042      930        924        940        9997862132
  lucy.CLV5                10748529  0          1011521    9737008    0          855        24        24        27        257993796
  lucy.CLV3                10748529  0          0          10748529  4          2042      954        945        962        10255855928
  lucy.CLVABS5              10748529  0          1011521    9737008    0          855        24        24        27        257993796
  lucy.CLVABS3              10748529  0          10453587  294942    0          1214      20        0          847        220086015
  orig.CLV5-lucy.CLV5      10748529  9538738    138680    1071111    -854      1239      -21        -23        -854      -229643539
  orig.CLV3-lucy.CLV3      10748529  357934    9324166    1066429    -1328      2846      22        0          704        244842237
== NISC.SHOTGUN ==
                            elem      <0        0          >0        min        max        mean      median    n50        sum
  orig.LEN                  737900    0          0          737900    104        2104      784        729        734        579172842
  orig.CLR                  737900    5988      2          731910    -636      1033      651        668        676        480400909
  orig.CLR5                737900    0          0          737900    1          1407      47        40        51        34857531
  orig.CLR3                737900    0          5879      732021    0          1470      698        710        715        515258440
  lucy.CLR                  737900    0          8248      729652    0          1035      658        670        676        485757685
  lucy.CLR5                737900    8248      56        729596    -1        1091      45        35        46        33811606
  lucy.CLR3                737900    8248      0          729652    -1        1391      704        710        714        519569291
  orig.CLR5-lucy.CLR5      737900    253727    89345      394828    -566      1408      1          1          485        1045925
  orig.CLR3-lucy.CLR3      737900    177007    31        560862    -867      1471      -5        1          -867      -4310851
  orig.CLV                  737900    3224      2655      732021    -636      2103      771        725        730        569178445
  orig.CLV5                734026    0          0          734026    1          987        5          1          35        4375315
  orig.CLV3                732021    0          0          732021    35        2104      783        729        734        573553760
  lucy.CLV                  737900    1335      55        736510    -200      2104      747        696        702        551392388
  lucy.CLV5                737900    104491    0          633409    -1        1199      30        31        34        22784742
  lucy.CLV3                737900    0          0          737900    15        2103      778        728        733        574177130
  lucy.CLVABS5              737900    0          104491    633409    0          1200      31        32        35        23522642
  lucy.CLVABS3              737900    0          730699    7201      0          1076      5          0          686        4257812
  orig.CLV5-lucy.CLV5      737900    561851    66390      109659    -1198      983        -24        -29        -1198      -18409427
  orig.CLV3-lucy.CLV3      737900    8386      1          729513    -950      1077      0          1          -950      -623370

Revision as of 19:12, 21 January 2009

BCM

NCBI Data

  • Avg LEN=984
  • Avg CLIP (CLB intersect CLV)=760
  • Avg CLV=997 > Avg LEN ???
  • Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
  • Avg UMDoverlapper CLIP=778

Problems:

  • 0 QUAL reads 650,133
  • the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1
  • several xml contained the "&" character => XML parser error
  • xml.bos_taurus.087 contained 2 trace_volumes => XML parser error
  • BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000
  • UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000

CENTER_NAME counts

    COUNT           CENTER_NAME     
 1  35629020        BCM             Baylor College of Medicine
 2  737900          NISC            NIH Intramural Sequencing Center
 3  652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Centre                           # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 
 4  378871          MARC            USDA, ARS, US Meat Animal Research Center
 5  114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368
 6  107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
 7  65171           TIGR            The Institute for Genome Research
 8  53556           GSC             Genoscope
 9  43033           CENARGEN        Embrapa Genetic Resources and Biotechnology
 10 18623           SC              The Sanger Center
 11 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
 12 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                 # TA query_tracedb CENTER_NAME="JCVI"
 13 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
 14 49              WUGSC           Washington University, Genome Sequencing Center                                 # TA query_tracedb CENTER_NAME = "WUGSC" => 9
    37829394        total           total                                                                           # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710

TRACE_TYPE_CODE counts

    COUNT         CENTER_NAME     TRACE_TYPE_CODE        
 1  24863599      BCM*            WGS                    SEQ_LIB_ID:89
 2  10748529      BCM*            SHOTGUN                SEQ_LIB_ID:15543
 3  737900        NISC            SHOTGUN                SEQ_LIB_ID:247
 4  125597        BCCAGSC         CLONEEND               LIBRARY_ID:1         large insert size; some qualityless; !!! almost all have CLIP3=0
 5  114753        UIUC            CLONEEND               LIBRARY_ID:2         insert size missing , no frequent kmers
 6  65171         TIGR            CLONEEND               SEQ_LIB_ID:1         2K & use TRACE_DIRECTION instead of TRACE_END
 7  53556         GSC             CLONEEND               SEQ_LIB_ID:1         large insert size; !!! all have qual=0 and were excluded 
 8  26246         CENARGEN        WGS                    .                    no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID)
 9  25454         BARC            CLONEEND               SEQ_LIB_ID:14304     !!! all have CLIP3=0
 10 16892         BCM*            CLONEEND               LIBRARY_ID:1         VBBAA   mea=167000  std=25000
 11 16787         CENARGEN        CLONEEND               LIBRARY_ID:1         
 12 15150         UOKNOR          SHOTGUN                LIBRARY_ID:1         some qualityless
 13 10651         TIGR_JCVIJTC    CLONEEND               SEQ_LIB_ID:2
 14 151           UOKNOR          FINISHING              LIBRARY_ID:1         some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV
 15 49            WUGSC           CLONEEND               SEQ_LIB_ID:1 
    36809945      total

 16 527017        BCCAGSC         EST
 17 207204        MARC            EST
 18 171667        MARC            PCR
 19 81913         BARC            EST
 20 18623         SC              EST 
 21 2485          UIACBCB         EST
    1019449       total

STRATEGY & TRACE_TYPE_CODE counts

 COUNT           CENTER_NAME     STRATEGY        TRACE_TYPE_CODE
 12545304        BCM             .               WGS
 11425910        BCM             WGA             WGS
 5223683         BCM             CLONE           SHOTGUN
 4479883         BCM             POOLCLONE       SHOTGUN
 1044963         BCM             .               SHOTGUN
 892385          BCM             SNP             WGS
 737900          NISC            CLONE           SHOTGUN
 125597          BCCAGSC         CLONEEND        CLONEEND
 114753          UIUC            CLONEEND        CLONEEND 
 65171           TIGR            CLONEEND        CLONEEND
 53556           GSC             CLONEEND        CLONEEND
 26246           CENARGEN        .               WGS
 25454           BARC            .               CLONEEND
 16892           BCM             CLONEEND        CLONEEND
 16787           CENARGEN        CLONEEND        CLONEEND
 12195           UOKNOR          .               SHOTGUN
 10651           TIGR_JCVIJTC    CLONEEND        CLONEEND
 2955            UOKNOR          CLONE           SHOTGUN
 151             UOKNOR          .               FINISHING
 49              WUGSC           CLONEEND        CLONEEND
 527017          BCCAGSC         EST             EST
 145820          MARC            EST             EST
 117958          MARC            COMPARATIVE     PCR
 81913           BARC            EST             EST
 61384           MARC            CLONE           EST
 53709           MARC            Re-Sequencing   PCR
 18623           SC              EST             EST
 2485            UIACBCB         .               EST

3' VECTOR TRIMMED counts

    CENTER_NAME     TRACE_TYPE_CODE TOTAL           3'CLV<LEN   QUAL==0          UMD.FRG
 1  BCM             WGS             24863599        10968979    551114           24050767
 2  BCM             SHOTGUN         10748529        5052692     23419            10068499
 3  NISC            SHOTGUN         737900          28972       0                735488
 4  BCCAGSC         CLONEEND        125597          125484      8926             113790
 5  UIUC            CLONEEND        114753          90243       0                106247
 6  TIGR            CLONEEND        65171           46389       0                64903
 7  GSC             CLONEEND        53556           53556       53556 (all)      0           !!! all have 0 quals and were excluded
 8  CENARGEN        WGS             26246           26246       0                25976
 9  BARC            CLONEEND        25454           25454       0                25387
 10 BCM             CLONEEND        16892           6751        0                16863
 11 CENARGEN        CLONEEND        16787           16787       0                16628
 12 UOKNOR          SHOTGUN         15150           2885        12195            0
 13 TIGR_JCVIJTC    CLONEEND        10651           339         0                10644
 14 UOKNOR          FINISHING       151             0           151              151
 15 WUGSC           CLONEEND        49              0           0                0

 16 BCCAGSC         EST             527017          524173      772              0
 17 MARC            EST             207204          207204      0                0
 18 MARC            PCR             171667          171667      0                0
 19 BARC            EST             81913           78597       0                0
 20 SC              EST             18623           7350        0                0
 21 UIACBCB         EST             2485            2485        0                0

ZERO QUALITY COUNTS

  • Counts
 CENTER_NAME     TRACE_TYPE_CODE  COUNT
 BCM             WGS              551114
 GSC             CLONEEND         53556
 BCM             SHOTGUN          23419
 UOKNOR          SHOTGUN          12195
 BCCAGSC         CLONEEND         8926
 BCCAGSC         EST              772
 UOKNOR          FINISHING        151
 TOTAL                            650134 
  • For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701..
  • Volumes 026..039 have been fixed

Local Data

Files & Dirs

 /fs/szasmg3/bos_taurus/data/
 /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors
 /nfshomes/dpuiu/db/UniVec

Software

Figaro

  • trims vector only at 5' end
  • call lucy trimming for qualities

Lucy

  • both vector sequence and splice sites are required

Atlas

  • web site
  • atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "

Contaminant search

nucmer reads CLIPPING range to UniVec & EcoliK12

UniVec

Ref

                 #seqs   min     max     mean    median  n50     sum
 UniVec          2861    12      48551   231     99      781     660,151
 UniVec_Core     1348    12      48551   243     98      967     327,641

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      4548466 19      1045    28.37   23      27      129025025
 20      3684852 20      1045    30.56   25      28      112616359
 30      1097357 30      1045    48.04   38      43      52714583
 40      484661  40      1045    66.36   47      53      32163896
 100     54334   100     1045    198     116     223     10772815        # many are ESTs

Ecoli

Ref:

 K12 4,639,675 bp

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      275109  19      1223    30.66   19      20      8435470
 20      102550  20      1223    50.29   21      161     5156849
 30      19032   30      1223    178     37      706     3381214
 40      9234    40      1223    329     171     738     3034293
 100     6781    100     1223    424     223     749     2876432
 200     4378    200     1223    575     696     771     2516916       

BCM vectors

                 #seqs   min     max     mean    median  n50     sum
 BCM             14      2580    33180   9379    5821    32705   131312

Vector/Splice site search

Strategy

  • 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
  • 2. Get the quality clipping time: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
  • 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE
  • 4. Get the most frequent kmers in each set (24 & 8 bp)
  • 5. Check if the most frequent kmers are overrepresented
  • 6. Check if the most frequent 8mers are present in the most frequent 24mers
  • 7. Try to extend the 24mers by a few bp => linkers
  • 8. Align linkers to the opposite stand sequences using nucmer
  • 9. Extract the subsequences adjacent(following) to linker (50..150bp)
  • 10. Align the subsequences; if they align we've probably identified the vector
  • 11. Identify the vector name/id by alignment to UniVec => several alignments
  • 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
  • 13. create the Lucy vector & splice files; the splice contains the linker+vector
  • 14. run lucy & trim input reads according to Lucy clr
  • 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust
  • 16. align input reads to linker,vector,splice & UniVec.dust
  • 17. compare the 15. & 16. counts

Example

  • 1. volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
  • 2.
  • 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
  • 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
 ==> 24.fwd/kmers.tab <==
 AGTTCGACTGCAAGTAGTTCATCA      TGATGAACTACTTGCAGTCGAACT        2463 # contains AGTAGTTC
 GAGTTCGACTGCAAGTAGTTCATC      GATGAACTACTTGCAGTCGAACTC        2189
 CGAGTTCGACTGCAAGTAGTTCAT      ATGAACTACTTGCAGTCGAACTCG        1996
 TCGAGTTCGACTGCAAGTAGTTCA      TGAACTACTTGCAGTCGAACTCGA        1593
 GTTCGACTGCAAGTAGTTCATCAA      TTGATGAACTACTTGCAGTCGAAC        1023
 GAGTTCGACTGCAGTAGTTCATCA      TGATGAACTACTGCAGTCGAACTC        812
 CGAGTTCGACTGCAGTAGTTCATC      GATGAACTACTGCAGTCGAACTCG        777
 GTTCGACTGCAAGTAGTTCATCAT      ATGATGAACTACTTGCAGTCGAAC        769
 TCGAGTTCGACTGCAGTAGTTCAT      ATGAACTACTGCAGTCGAACTCGA        637
 ATCGAGTTCGACTGCAAGTAGTTC      GAACTACTTGCAGTCGAACTCGAT        594
 
 ==> 08.fwd/kmers.tab <==
 AGTAGTTC      GAACTACT        86477
 CAGTAGTT      AACTACTG        67681
 AGTTCTCA      TGAGAACT        61556
 TAGTTCTC      GAGAACTA        60964
 GTAGTTCT      AGAACTAC        57866
 AGTTCATC      GATGAACT        49676
 TAGTTCAT      ATGAACTA        45298
 GTTCATCA      TGATGAAC        42117
 GCAGTAGT      ACTACTGC        41391
 GTAGTTCA      TGAACTAC        40694
 
 ==> 24.rev/kmers.tab <==
 TATCGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCGATA        999 # contains AGTAGTTC
 CTATCGATGGTACAGTAGTTCATC      GATGAACTACTGTACCATCGATAG        774
 GCTATCGATGGTACAGTAGTTCAT      ATGAACTACTGTACCATCGATAGC        600
 CGCTATCGATGGTACAGTAGTTCA      TGAACTACTGTACCATCGATAGCG        432
 ATCGATGGTACAGTAGTTCATCAT      ATGATGAACTACTGTACCATCGAT        417
 ATCGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCGAT        380
 ATCAGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCTGAT        373
 ATCGATGGTACAGTAGTTCATCAC      GTGATGAACTACTGTACCATCGAT        265
 CTATCGATGGTAAGTAGTTCATCA      TGATGAACTACTTACCATCGATAG        235
 TCAGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCTGA        224
 
 ==> 08.rev/kmers.tab <==
 AGTTCATC      GATGAACT        85127
 TAGTTCAT      ATGAACTA        77902
 GTTCATCA      TGATGAAC        75585
 TAGTTCTC      GAGAACTA        68057
 AGTTCTCA      TGAGAACT        67277
 GTAGTTCT      AGAACTAC        64894
 GTAGTTCA      TGAACTAC        62607
 CGTAGTTC      GAACTACG        52031
 AGTAGTTC      GAACTACT        51013
 ACGTAGTT      AACTACGT        31552
  • 7. Get linker sequences
 >linker.fwd 27bp
 TCGAGTTCGACTGCAAGTAGTTCATCA
 >linker.rev 27bp
 CTAATCAGATGGTACAGTAGTTCATCA 
  
 #>linker.rev 40 bp Art's  (13 more bp at 5')        
 #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
 #GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences)
  • 8 & 9 Align reads to linkers using nucmer

Fwd:

 nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta 
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq
 

Rev:

 nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq
 

Both:

 clrFasta out.seq >! out.cseq
 fasta2tab.pl out.cseq | sort -k2 > ! out.tab
 nucmer -c 40 out.cseq ~/db/UniVec -p vector
 delta-filter -q vector.delta >! vector.filter-q.delta
 show-coords vector.filter-q.delta | sort -n | head
 cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2
 
  • 10. Extract "vector reads"
 >399553028  # 24.fwd     
 TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
 GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
 GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA
 ...
 >400269118 #24.rev
 TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
 CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
 AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
  • 11. Align "vector reads" to UniVec
 show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1
     31  148  | 1175 1292  | 118   118  |  95.76  |     1276     7477  |     9.25     1.58  | 399553028.rev gnl|uv|J01636.1:1-7477
     32  199  | 1302 1463  | 168   162  |  90.48  |      653     7477  |    25.73     2.17  | 400269118     gnl|uv|J01636.1:1-7477
  • 12. 10bp distance between the 2 alignments
  • 13. Lucy files
 $ more vector.seq
   >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
   GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG
   TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC
   CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG
   CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA
   ...
 
 $ more splice.seq
   >J01636.for.begin vector+linker.rev
   TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC
   GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG
   CGCCTAATCAGATGGTACAGTAGTTCATCA
   >J01636.for.end  rev(linker.fwd)+vector 
   TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
   CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
   AGCTGGCGTAAAAACGTAAAAAGCCCCGCA
   >J01636.rev.begin (revcomp of J01636.for.end)
   TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA
   TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT
   GATTCGATTTCGACTGCAAGTAGTTCATCA
   >J01636.rev.end (revcomp of J01636.for.begin)
   TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
   GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
   GTGTCAAATGAGAGACCTAACTCACATTCA
 # splice=linker+vector  
      3      120  |     1175     1292  |      118      118  |    95.76  |      150     7477  |    78.67     1.58  | J01636.for.begin   J01636
     32      131  |     1302     1399  |      100       98  |    96.00  |      150     7477  |    66.67     1.31  | J01636.for.end     J01636
  • 14. Run lucy & trim reads
 $ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \ 
     -v vector.seq splice.seq
     -o bos_taurus.lucy.seq bos_taurus.lucy.qual \
     -debug  bos_taurus.lucy.info \
     bos_taurus.seq bos_taurus.qual
 # Trim clr
 $ clrFasta bos_taurus.seq > bos_taurus.cseq
  • 15. Align lucy output to linker, vector, splice & UniVec.dust
 $ nucmer -l 12 -c 24 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/splice.seq  bos_taurus.lucy.cseq -p splice-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy
  • 16. Align input to linker, vector, splice & UniVec.dust
 $ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus

Count how many reads got trimmed

 infoseq *seq | getSummary.pl -c 1 -t original.LEN
 
 cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary  
 cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary
 cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary

Libraries

011.BCM.WGS FORWARD

  • vector: J01636
  • UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
 ll ~dpuiu/db/J01636*
 -rw-rw-r--  1 dpuiu dpuiu 7651 Jan  9 15:56 /nfshomes/dpuiu/db/J01636
 -rw-rw-r--  1 dpuiu dpuiu  105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker
 -rw-rw-r--  1 dpuiu dpuiu  840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice
 cat  ~dpuiu/db/J01636* | infoseq
 J01636            7477   53.43
 J01636.linker.fwd 27     44.44
 J01636.linker.rev 27     37.04
 J01636.for.begin  150    44.67
 J01636.for.end    150    51.33
 J01636.rev.begin  150    51.33
 J01636.rev.end    150    44.67
  • 249,611 reads:
  • 91% got vector trimmed at the 5'
  • 0.4% (1149) got vector trimmed at the 3'
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    249611  0       437     2349    1082    991     1009    270035781     
 lucy.CLV5       249611  21215   0       741     25.03   25      27      6247415
 lucy.CLV3       249611  248462  0       1047    3.49    0       859     870344
  • Original reads hit counts:
10975 linker.fwd
133   linker.rev
166   splice
152   vector
228   UniVec.dust
  • Lucy trimmed read counts
2 linker.fwd
0 linker.rev
1 splice
1 vector
6 UniVec.dust (only 3 are >40bp)

011.BCM.WGS REVERSE

                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    250389  0       502     2148    1085    993     1012    271691094
 lucy.CLR        250389  7345    0       1281    795     876     892     198982171
 lucy.CLV5       250389  20271   0       668     26.52   27      29      6641362
 lucy.CLV3       250389  249269  0       997     3.35    0       861     839029
  • Original reads hit counts:
 linker.fwd      113
 linker.rev      3812
 splice          143
 UniVec.dust     237
 vector          4318
  • Lucy trimmed reads hit counts:
 linker.fwd      1
 linker.rev      0
 splice          1
 UniVec.dust     10
 vector          1

030.BCM.SHOTGUN

  • same linker/vector/splice as BCM.WGS
  • 2.5% (4K out of 160K) reads contain linker & vector at 3'
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    8411    0       325     1685    1181    1240    1314    9933150
 lucy.CLR        8411    8       0       1054    841     863     874     7070994
 lucy.CLV5       8411    568     0       232     27.01   28      29      227206
 lucy.CLV3       8411    2325    0       1040    597     794     851     5023445
  • Original reads hit counts:
 linker.fwd      4314
 linker.rev      4125
 splice          7816
 UniVec.dust     4212
 vector          6750
 vector          27235
  • Lucy trimmed reads hit counts:
 linker.fwd      3
 linker.rev      1
 splice          1
 UniVec.dust     13
 vector          0

001.NISC.SHOTGUN

  • Vector: pOTW13
  • UniVec: 3 partial seqs
 gnl|uv|NGB00080.1:1-198 pOTW13 with linkers
 gnl|uv|NGB00080.1:718-888 pOTW13 with linkers
 gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers
 ll /nfshomes/dpuiu/db/NGB00080*
 -rw-rw-r--  1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080
 -rw-r--r--  1 dpuiu dpuiu   94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker
 -rw-r--r--  1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice
 cat  /nfshomes/dpuiu/db/NGB00080* | infoseq
 NGB00080       1054   50.00
 NGB00080.linker.fwd 24     45.83
 NGB00080.linker.rev 26     53.85
 NGB00080.for.beg 518    46.14
 NGB00080.for.end 518    50.48
 NGB00080.rev.begin 518    50.48
 NGB00080.rev.beg 518    46.14
  • 944 read sample
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    944     0       652     1017    735     721     722     693668
 lucy.CLR        944     39      0       886     415     422     522     391333
 lucy.CLV5       944     121     0       275     34.05   33      35      32143
 lucy.CLV3       944     18      0       885     410     409     511     387007
  • Original reads hit counts:
 linker.fwd      479
 linker.rev      492
 splice          910
 UniVec.dust     0
 vector          939
  • Lucy trimmed reads hit counts:
 linker.fwd      1
 linker.rev      0
 splice          0
 UniVec.dust     9
 vector          1

060.BCCAGSC.CLONEEND

  • Linkers:
 linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT
 linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG
  • No common vector !!!
 wc -l *clb
   60746 bos_taurus.060.f.clb  #18 reads original align to UniVec (nucmer default params)
   60836 bos_taurus.060.r.clb
 
 Fwd:
    329      428  |      440      535  |      100       96  |    91.00  |      503     1585  |    19.88     6.06  | 723951410  gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1
    330      370  |       89       49  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
    330      370  |       94       54  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
 
  Rev:
      1       96  |       71      165  |       96       95  |    93.81  |      203      165  |    47.29    57.58  | 724018013  gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6
     50      143  |        1       94  |       94       94  |    92.71  |      203       94  |    46.31   100.00  | 724018013  gnl|uv|U80929.2:2858-2951     Cloning vector pBACe3.6

017.UIUC.CLONEEND

  • No overrepresented kmers
 wc -l *clb
  17978 bos_taurus.017.f.clb
  17911 bos_taurus.017.r.clb
 
 ==> 24.fwd/kmers.tab <==
 CCCTGCTTTGTCTGGAAGGGGTTC        GAACCCCTTCCAGACAAAGCAGGG        9
 CTGCTTTGTCTGGAAGGGGTTCCC        GGGAACCCCTTCCAGACAAAGCAG        9
 
 ==> 24.rev/kmers.tab <==
 GAATGTTGAGCTTTAGCCAACTTT        AAAGTTGGCTAAAGCTCAACATTC        4
 TCTGAATGTTGAGCTTTAGCCAAC        GTTGGCTAAAGCTCAACATTCAGA        4
 
 ==> 8.fwd/kmers.tab <==
 TTTTTTTT        AAAAAAAA        55
 AAGGGGTT        AACCCCTT        35
 
 ==> 8.rev/kmers.tab <==
 GTCTGGAA        TTCCAGAC        41
 TCTGGAAG        CTTCCAGA        39
  • No UniVec hits

010.TIGR.CLONEEND

  • No overrepresented kmers
 wc -l *clb
 5479 bos_taurus.032.f.clb
 5174 bos_taurus.032.r.clb
 
 ==> 24.fwd/kmers.tab <==
 CTTGTGTTGGCCCAGGCAAGTCCA        TGGACTTGCCTGGGCCAACACAAG        30
 TTGTGTTGGCCCAGGCAAGTCCAA        TTGGACTTGCCTGGGCCAACACAA        30
 
 ==> 24.rev/kmers.tab <==
 CTGCCTCTTGTGTTGGCCCAGGCA        TGCCTGGGCCAACACAAGAGGCAG        16
 GCTGCCTCTTGTGTTGGCCCAGGC        GCCTGGGCCAACACAAGAGGCAGC        15
 
 ==> 8.fwd/kmers.tab <==
 GAGTGGGT        ACCCACTC        176
 GGAGTGGG        CCCACTCC        171
 
 ==> 8.rev/kmers.tab <==
 TGGAGTGG        CCACTCCA        182
 GGAGTGGG        CCCACTCC        181
 
  • No UniVec hits

...

070.BCM.CLONEEND

  • No frequent kmers
 wc -l *clb
   6027 bos_taurus.070.f.clb
   6236 bos_taurus.070.r.clb
 
 ==> 24.fwd/kmers.tab <==
 GGACTCTCAGAGTCTTCTCCAACA        TGTTGGAGAAGACTCTGAGAGTCC        18
 ACTGGTTGGATCTCCTTGCAGTCC        GGACTGCAAGGAGATCCAACCAGT        18

 ==> 24.rev/kmers.tab <==
 ATAAAATCTGAGCCACCAGGGAAG        CTTCCCTGGTGGCTCAGATTTTAT        1
 CTATTGGTTCATATGGTCAACGTC        GACGTTGACCATATGAACCAATAG        1
 
 ==> 8.fwd/kmers.tab <==
 TTTTTTTT        AAAAAAAA        86
 CTTCTCCA        TGGAGAAG        75
 
 ==> 8.rev/kmers.tab <==
 TATAGTGT        ACACTATA        9
 ATATAGGG        CCCTATAT        8
  • No alignments to BCM WGS vector

Running Lucy

  • Default parameters with vector trimming

BCM.WGS

                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  24863599   0          0          24863599   5          3097       1002       997        1015       24915462033
 orig.CLR                  24863599   463669     7          24399923   -1143      1833       765        836        864        19036744256
 orig.CLR5                 24863599   0          359245     24504354   0          2103       42         22         58         1047922451
 orig.CLR3                 24863599   463404     0          24400195   -1         2169       807        872        895        20084666707
 lucy.CLR                  24863599   0          739529     24124070   0          1219       792        878        904        19695000417
 lucy.CLR5                 24863599   739529     36108      24087962   -1         1753       43         29         42         1086413880
 lucy.CLR3                 24863599   739529     0          24124070   -1         1894       835        915        939        20781414297
 orig.CLR5-lucy.CLR5       24863599   16299521   215345     8348733    -1186      2104       -1         -10        -1186      -38491429
 orig.CLR3-lucy.CLR3       24863599   14858542   1494794    8510263    -1273      2170       -28        -20        -1273      -696747590
 orig.CLV                  24863599   1053       1920       24860626   -2         5345       1015       1002       1017       25260581538
 orig.CLV5                 8841849    0          0          8841849    1          1219       33         46         49         295011460
 orig.CLV3                 24861698   1053       0          24860645   -1         5346       1027       1005       1019       25555592998
 lucy.CLV                  24863599   10694      707        24852198   -469       3096       973        968        987        24195085877
 lucy.CLV5                 24863599   0          3135007    21728592   0          1359       25         27         29         623457486
 lucy.CLV3                 24863599   0          0          24863599   4          3096       998        995        1014       24818543363
 lucy.CLVABS5              24863599   0          3135007    21728592   0          1359       25         27         29         623457486
 lucy.CLVABS3              24863599   0          24770953   92646      0          1343       2          0          880        72055071
 orig.CLV5-lucy.CLV5       24863599   17216820   1512453    6134326    -1312      1219       -13        -25        -1312      -328446026
 orig.CLV3-lucy.CLV3       24863599   1519132    18579609   4764858    -1832      4672       29         0          479        737049635

BCM.SHOTGUN

                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  10748529   0          0          10748529   5          2043       975        950        964        10486690472
 orig.CLR                  10748529   17308      2          10731219   -1293      1467       809        833        847        8701344571
 orig.CLR5                 10748529   0          68         10748461   0          1315       26         16         38         288662580
 orig.CLR3                 10748529   16780      0          10731749   -1         1647       836        851        863        8990007151
 lucy.CLR                  10748529   0          98070      10650459   0          1337       833        854        868        8955866769
 lucy.CLR5                 10748529   98070      1973       10648486   -1         1307       35         28         32         376276188
 lucy.CLR3                 10748529   98070      0          10650459   -1         1553       868        882        896        9332142957
 orig.CLR5-lucy.CLR5       10748529   9498290    65171      1185068    -1099      1293       -8         -11        -1099      -87613608
 orig.CLR3-lucy.CLR3       10748529   6879532    671097     3197900    -1149      1437       -31        -26        -1149      -342135806
 orig.CLV                  10748529   16779      412        10731338   -2         3919       974        948        964        10472347908
 orig.CLV5                 8594910    0          0          8594910    1          1239       3          1          49         28350257
 orig.CLV3                 10748349   16779      0          10731570   -1         3919       976        950        965        10500698165
 lucy.CLV                  10748529   7026       614        10740889   -268       2042       930        924        940        9997862132
 lucy.CLV5                 10748529   0          1011521    9737008    0          855        24         24         27         257993796
 lucy.CLV3                 10748529   0          0          10748529   4          2042       954        945        962        10255855928
 lucy.CLVABS5              10748529   0          1011521    9737008    0          855        24         24         27         257993796
 lucy.CLVABS3              10748529   0          10453587   294942     0          1214       20         0          847        220086015
 orig.CLV5-lucy.CLV5       10748529   9538738    138680     1071111    -854       1239       -21        -23        -854       -229643539
 orig.CLV3-lucy.CLV3       10748529   357934     9324166    1066429    -1328      2846       22         0          704        244842237

NISC.SHOTGUN

                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  737900     0          0          737900     104        2104       784        729        734        579172842
 orig.CLR                  737900     5988       2          731910     -636       1033       651        668        676        480400909
 orig.CLR5                 737900     0          0          737900     1          1407       47         40         51         34857531
 orig.CLR3                 737900     0          5879       732021     0          1470       698        710        715        515258440
 lucy.CLR                  737900     0          8248       729652     0          1035       658        670        676        485757685
 lucy.CLR5                 737900     8248       56         729596     -1         1091       45         35         46         33811606
 lucy.CLR3                 737900     8248       0          729652     -1         1391       704        710        714        519569291
 orig.CLR5-lucy.CLR5       737900     253727     89345      394828     -566       1408       1          1          485        1045925
 orig.CLR3-lucy.CLR3       737900     177007     31         560862     -867       1471       -5         1          -867       -4310851
 orig.CLV                  737900     3224       2655       732021     -636       2103       771        725        730        569178445
 orig.CLV5                 734026     0          0          734026     1          987        5          1          35         4375315
 orig.CLV3                 732021     0          0          732021     35         2104       783        729        734        573553760
 lucy.CLV                  737900     1335       55         736510     -200       2104       747        696        702        551392388
 lucy.CLV5                 737900     104491     0          633409     -1         1199       30         31         34         22784742
 lucy.CLV3                 737900     0          0          737900     15         2103       778        728        733        574177130
 lucy.CLVABS5              737900     0          104491     633409     0          1200       31         32         35         23522642
 lucy.CLVABS3              737900     0          730699     7201       0          1076       5          0          686        4257812
 orig.CLV5-lucy.CLV5       737900     561851     66390      109659     -1198      983        -24        -29        -1198      -18409427
 orig.CLV3-lucy.CLV3       737900     8386       1          729513     -950       1077       0          1          -950       -623370