Bos taurus redo: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 765: Line 765:
* Location: /fs/szasmg3/bos_taurus/data/frg
* Location: /fs/szasmg3/bos_taurus/data/frg
* All DST messages are unique
* All DST messages are unique
* Message counts:
* bos_taurus.clv : contains the vector clipping points
** BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
** others: the TA clv
** 374,454 reads don't have valid clv's
** 36,446,031 reads have valid clv's with avg=955
 
== Message counts (original) ==
                                                 DST    FRG            LKG
                                                 DST    FRG            LKG
   bos_taurus.BCM.WGS.frg                        79      24124070        11311841
   bos_taurus.BCM.WGS.frg                        79      24124070        11311841
Line 784: Line 790:
   total                                          18885  35973728        13648878
   total                                          18885  35973728        13648878


* bos_taurus.clv : contains the vector clipping points
== Message counts (quality) ==
** BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
                                                DST    FRG            LKG
** others: the TA clv
  bos_taurus.BCM.WGS.qual.count                  79      23580109        11035582
** 374,454 reads don't have valid clv's
  bos_taurus.BCM.SHOTGUN.qual.count              7339    10644092        1799069
** 36,446,031 reads have valid clv's with avg=955
  bos_taurus.NISC.SHOTGUN.count                  246    729652          344932
  bos_taurus.BCCAGSC.CLONEEND.qual.count        1      116484          53585
  bos_taurus.UIUC.CLONEEND.count                2      114750          46319
  bos_taurus.TIGR.CLONEEND.count                1      65171          27067
  bos_taurus.CENARGEN.WGS.count                  0      26246          0
  bos_taurus.BARC.CLONEEND.count                11150  25454          11150
  bos_taurus.BCM.CLONEEND.count                  1      16875          7103
  bos_taurus.CENARGEN.CLONEEND.count            1      16787          6269
  bos_taurus.TIGR_JCVIJTC.CLONEEND.count        2      10651          4803
  bos_taurus.UOKNOR.SHOTGUN.qual.count          1      2456            813
  bos_taurus.WUGSC.COLONEEND.count              1      49              21
  total                                          18824  35348776        13336713
 
== Message counts (0quality) ==
                                                DST    FRG            LKG
  bos_taurus.BCM.WGS.0qual.count                79      543961          234397
  bos_taurus.GSC.CLONEEND.0qual.count            1      53521          25889
  bos_taurus.UOKNOR.SHOTGUN.0qual.count          1      12195          4097
  bos_taurus.BCCAGSC.CLONEEND.0qual.count        1      8757            2114
  bos_taurus.BCM.SHOTGUN.0qual.count            7339    6367            0
  bos_taurus.UOKNOR.FINISHING.0qual.count        0      151            0
  total                                          7421    624952          266497

Revision as of 17:15, 23 January 2009

BCM

NCBI Data

  • Avg LEN=984
  • Avg CLIP (CLB intersect CLV)=760
  • Avg CLV=997 > Avg LEN ???
  • Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
  • Avg UMDoverlapper CLIP=778

Problems:

  • 0 QUAL reads 650,133
  • the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1
  • several xml contained the "&" character => XML parser error
  • xml.bos_taurus.087 contained 2 trace_volumes => XML parser error
  • BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000
  • UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000

CENTER_NAME counts

    COUNT           CENTER_NAME     
 1  35629020        BCM             Baylor College of Medicine
 2  737900          NISC            NIH Intramural Sequencing Center
 3  652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Centre                           # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 
 4  378871          MARC            USDA, ARS, US Meat Animal Research Center
 5  114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368
 6  107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
 7  65171           TIGR            The Institute for Genome Research
 8  53556           GSC             Genoscope
 9  43033           CENARGEN        Embrapa Genetic Resources and Biotechnology
 10 18623           SC              The Sanger Center
 11 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
 12 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                 # TA query_tracedb CENTER_NAME="JCVI"
 13 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
 14 49              WUGSC           Washington University, Genome Sequencing Center                                 # TA query_tracedb CENTER_NAME = "WUGSC" => 9
    37829394        total           total                                                                           # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710

TRACE_TYPE_CODE counts

    COUNT         CENTER_NAME     TRACE_TYPE_CODE        
 1  24863599      BCM*            WGS                    SEQ_LIB_ID:89
 2  10748529      BCM*            SHOTGUN                SEQ_LIB_ID:15543
 3  737900        NISC            SHOTGUN                SEQ_LIB_ID:247
 4  125597        BCCAGSC         CLONEEND               LIBRARY_ID:1         large insert size; some qualityless; !!! almost all have CLIP3=0
 5  114753        UIUC            CLONEEND               LIBRARY_ID:2         insert size missing , no frequent kmers
 6  65171         TIGR            CLONEEND               SEQ_LIB_ID:1         2K & use TRACE_DIRECTION instead of TRACE_END
 7  53556         GSC             CLONEEND               SEQ_LIB_ID:1         large insert size; !!! all have qual=0 and were excluded 
 8  26246         CENARGEN        WGS                    .                    no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID)
 9  25454         BARC            CLONEEND               SEQ_LIB_ID:14304     !!! all have CLIP3=0
 10 16892         BCM*            CLONEEND               LIBRARY_ID:1         VBBAA   mea=167000  std=25000
 11 16787         CENARGEN        CLONEEND               LIBRARY_ID:1         
 12 15150         UOKNOR          SHOTGUN                LIBRARY_ID:1         some qualityless
 13 10651         TIGR_JCVIJTC    CLONEEND               SEQ_LIB_ID:2
 14 151           UOKNOR          FINISHING              LIBRARY_ID:1         some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV
 15 49            WUGSC           CLONEEND               SEQ_LIB_ID:1 
    36820485      total

 16 527017        BCCAGSC         EST
 17 207204        MARC            EST
 18 171667        MARC            PCR
 19 81913         BARC            EST
 20 18623         SC              EST 
 21 2485          UIACBCB         EST
    1008909       total

STRATEGY & TRACE_TYPE_CODE counts

 COUNT           CENTER_NAME     STRATEGY        TRACE_TYPE_CODE
 12545304        BCM             .               WGS
 11425910        BCM             WGA             WGS
 5223683         BCM             CLONE           SHOTGUN
 4479883         BCM             POOLCLONE       SHOTGUN
 1044963         BCM             .               SHOTGUN
 892385          BCM             SNP             WGS
 737900          NISC            CLONE           SHOTGUN
 125597          BCCAGSC         CLONEEND        CLONEEND
 114753          UIUC            CLONEEND        CLONEEND 
 65171           TIGR            CLONEEND        CLONEEND
 53556           GSC             CLONEEND        CLONEEND
 26246           CENARGEN        .               WGS
 25454           BARC            .               CLONEEND
 16892           BCM             CLONEEND        CLONEEND
 16787           CENARGEN        CLONEEND        CLONEEND
 12195           UOKNOR          .               SHOTGUN
 10651           TIGR_JCVIJTC    CLONEEND        CLONEEND
 2955            UOKNOR          CLONE           SHOTGUN
 151             UOKNOR          .               FINISHING
 49              WUGSC           CLONEEND        CLONEEND
 527017          BCCAGSC         EST             EST
 145820          MARC            EST             EST
 117958          MARC            COMPARATIVE     PCR
 81913           BARC            EST             EST
 61384           MARC            CLONE           EST
 53709           MARC            Re-Sequencing   PCR
 18623           SC              EST             EST
 2485            UIACBCB         .               EST

3' VECTOR TRIMMED counts

    CENTER_NAME     TRACE_TYPE_CODE TOTAL           3'CLV<LEN   QUAL==0          UMD.FRG
 1  BCM             WGS             24863599        10968979    551114           24050767
 2  BCM             SHOTGUN         10748529        5052692     23419            10068499
 3  NISC            SHOTGUN         737900          28972       0                735488
 4  BCCAGSC         CLONEEND        125597          125484      8926             113790
 5  UIUC            CLONEEND        114753          90243       0                106247
 6  TIGR            CLONEEND        65171           46389       0                64903
 7  GSC             CLONEEND        53556           53556       53556 (all)      0           !!! all have 0 quals and were excluded
 8  CENARGEN        WGS             26246           26246       0                25976
 9  BARC            CLONEEND        25454           25454       0                25387
 10 BCM             CLONEEND        16892           6751        0                16863
 11 CENARGEN        CLONEEND        16787           16787       0                16628
 12 UOKNOR          SHOTGUN         15150           2885        12195            0
 13 TIGR_JCVIJTC    CLONEEND        10651           339         0                10644
 14 UOKNOR          FINISHING       151             0           151              151
 15 WUGSC           CLONEEND        49              0           0                0

 16 BCCAGSC         EST             527017          524173      772              0
 17 MARC            EST             207204          207204      0                0
 18 MARC            PCR             171667          171667      0                0
 19 BARC            EST             81913           78597       0                0
 20 SC              EST             18623           7350        0                0
 21 UIACBCB         EST             2485            2485        0                0

ZERO QUALITY COUNTS

  • Counts
 CENTER_NAME     TRACE_TYPE_CODE  COUNT
 BCM             WGS              551114
 GSC             CLONEEND         53556
 BCM             SHOTGUN          23419
 UOKNOR          SHOTGUN          12195
 BCCAGSC         CLONEEND         8926
 BCCAGSC         EST              772
 UOKNOR          FINISHING        151
 TOTAL                            650134 
  • For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701..
  • Volumes 026..039 have been fixed

Local Data

Files & Dirs

 /fs/szasmg3/bos_taurus/data/
 /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors
 /nfshomes/dpuiu/db/UniVec

Software

Figaro

  • trims vector only at 5' end
  • call lucy trimming for qualities

Lucy

  • both vector sequence and splice sites are required

Atlas

  • web site
  • atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "

Contaminant search

nucmer reads CLIPPING range to UniVec & EcoliK12

UniVec

Ref

                 #seqs   min     max     mean    median  n50     sum
 UniVec          2861    12      48551   231     99      781     660,151
 UniVec_Core     1348    12      48551   243     98      967     327,641

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      4548466 19      1045    28.37   23      27      129025025
 20      3684852 20      1045    30.56   25      28      112616359
 30      1097357 30      1045    48.04   38      43      52714583
 40      484661  40      1045    66.36   47      53      32163896
 100     54334   100     1045    198     116     223     10772815        # many are ESTs

Ecoli

Ref:

 K12 4,639,675 bp

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      275109  19      1223    30.66   19      20      8435470
 20      102550  20      1223    50.29   21      161     5156849
 30      19032   30      1223    178     37      706     3381214
 40      9234    40      1223    329     171     738     3034293
 100     6781    100     1223    424     223     749     2876432
 200     4378    200     1223    575     696     771     2516916       

BCM vectors

                 #seqs   min     max     mean    median  n50     sum
 BCM             14      2580    33180   9379    5821    32705   131312

Vector/Splice site search

Strategy

  • 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
  • 2. Get the quality clipping time: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
  • 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE
  • 4. Get the most frequent kmers in each set (24 & 8 bp)
  • 5. Check if the most frequent kmers are overrepresented
  • 6. Check if the most frequent 8mers are present in the most frequent 24mers
  • 7. Try to extend the 24mers by a few bp => linkers
  • 8. Align linkers to the opposite stand sequences using nucmer
  • 9. Extract the subsequences adjacent(following) to linker (50..150bp)
  • 10. Align the subsequences; if they align we've probably identified the vector
  • 11. Identify the vector name/id by alignment to UniVec => several alignments
  • 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
  • 13. create the Lucy vector & splice files; the splice contains the linker+vector
  • 14. run lucy & trim input reads according to Lucy clr
  • 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust
  • 16. align input reads to linker,vector,splice & UniVec.dust
  • 17. compare the 15. & 16. counts

Example

  • 1. volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
  • 2.
  • 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
  • 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
 ==> 24.fwd/kmers.tab <==
 AGTTCGACTGCAAGTAGTTCATCA      TGATGAACTACTTGCAGTCGAACT        2463 # contains AGTAGTTC
 GAGTTCGACTGCAAGTAGTTCATC      GATGAACTACTTGCAGTCGAACTC        2189
 CGAGTTCGACTGCAAGTAGTTCAT      ATGAACTACTTGCAGTCGAACTCG        1996
 TCGAGTTCGACTGCAAGTAGTTCA      TGAACTACTTGCAGTCGAACTCGA        1593
 GTTCGACTGCAAGTAGTTCATCAA      TTGATGAACTACTTGCAGTCGAAC        1023
 GAGTTCGACTGCAGTAGTTCATCA      TGATGAACTACTGCAGTCGAACTC        812
 CGAGTTCGACTGCAGTAGTTCATC      GATGAACTACTGCAGTCGAACTCG        777
 GTTCGACTGCAAGTAGTTCATCAT      ATGATGAACTACTTGCAGTCGAAC        769
 TCGAGTTCGACTGCAGTAGTTCAT      ATGAACTACTGCAGTCGAACTCGA        637
 ATCGAGTTCGACTGCAAGTAGTTC      GAACTACTTGCAGTCGAACTCGAT        594
 
 ==> 08.fwd/kmers.tab <==
 AGTAGTTC      GAACTACT        86477
 CAGTAGTT      AACTACTG        67681
 AGTTCTCA      TGAGAACT        61556
 TAGTTCTC      GAGAACTA        60964
 GTAGTTCT      AGAACTAC        57866
 AGTTCATC      GATGAACT        49676
 TAGTTCAT      ATGAACTA        45298
 GTTCATCA      TGATGAAC        42117
 GCAGTAGT      ACTACTGC        41391
 GTAGTTCA      TGAACTAC        40694
 
 ==> 24.rev/kmers.tab <==
 TATCGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCGATA        999 # contains AGTAGTTC
 CTATCGATGGTACAGTAGTTCATC      GATGAACTACTGTACCATCGATAG        774
 GCTATCGATGGTACAGTAGTTCAT      ATGAACTACTGTACCATCGATAGC        600
 CGCTATCGATGGTACAGTAGTTCA      TGAACTACTGTACCATCGATAGCG        432
 ATCGATGGTACAGTAGTTCATCAT      ATGATGAACTACTGTACCATCGAT        417
 ATCGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCGAT        380
 ATCAGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCTGAT        373
 ATCGATGGTACAGTAGTTCATCAC      GTGATGAACTACTGTACCATCGAT        265
 CTATCGATGGTAAGTAGTTCATCA      TGATGAACTACTTACCATCGATAG        235
 TCAGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCTGA        224
 
 ==> 08.rev/kmers.tab <==
 AGTTCATC      GATGAACT        85127
 TAGTTCAT      ATGAACTA        77902
 GTTCATCA      TGATGAAC        75585
 TAGTTCTC      GAGAACTA        68057
 AGTTCTCA      TGAGAACT        67277
 GTAGTTCT      AGAACTAC        64894
 GTAGTTCA      TGAACTAC        62607
 CGTAGTTC      GAACTACG        52031
 AGTAGTTC      GAACTACT        51013
 ACGTAGTT      AACTACGT        31552
  • 7. Get linker sequences
 >linker.fwd 27bp
 TCGAGTTCGACTGCAAGTAGTTCATCA
 >linker.rev 27bp
 CTAATCAGATGGTACAGTAGTTCATCA 
  
 #>linker.rev 40 bp Art's  (13 more bp at 5')        
 #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
 #GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences)
  • 8 & 9 Align reads to linkers using nucmer

Fwd:

 nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta 
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq
 

Rev:

 nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq
 

Both:

 clrFasta out.seq >! out.cseq
 fasta2tab.pl out.cseq | sort -k2 > ! out.tab
 nucmer -c 40 out.cseq ~/db/UniVec -p vector
 delta-filter -q vector.delta >! vector.filter-q.delta
 show-coords vector.filter-q.delta | sort -n | head
 cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2
 
  • 10. Extract "vector reads"
 >399553028  # 24.fwd     
 TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
 GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
 GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA
 ...
 >400269118 #24.rev
 TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
 CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
 AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
  • 11. Align "vector reads" to UniVec
 show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1
     31  148  | 1175 1292  | 118   118  |  95.76  |     1276     7477  |     9.25     1.58  | 399553028.rev gnl|uv|J01636.1:1-7477
     32  199  | 1302 1463  | 168   162  |  90.48  |      653     7477  |    25.73     2.17  | 400269118     gnl|uv|J01636.1:1-7477
  • 12. 10bp distance between the 2 alignments
  • 13. Lucy files
 $ more vector.seq
   >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
   GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG
   TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC
   CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG
   CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA
   ...
 
 $ more splice.seq
   >J01636.for.begin vector+linker.rev
   TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC
   GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG
   CGCCTAATCAGATGGTACAGTAGTTCATCA
   >J01636.for.end  rev(linker.fwd)+vector 
   TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
   CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
   AGCTGGCGTAAAAACGTAAAAAGCCCCGCA
   >J01636.rev.begin (revcomp of J01636.for.end)
   TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA
   TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT
   GATTCGATTTCGACTGCAAGTAGTTCATCA
   >J01636.rev.end (revcomp of J01636.for.begin)
   TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
   GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
   GTGTCAAATGAGAGACCTAACTCACATTCA
 # splice=linker+vector  
      3      120  |     1175     1292  |      118      118  |    95.76  |      150     7477  |    78.67     1.58  | J01636.for.begin   J01636
     32      131  |     1302     1399  |      100       98  |    96.00  |      150     7477  |    66.67     1.31  | J01636.for.end     J01636
  • 14. Run lucy & trim reads
 $ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \ 
     -v vector.seq splice.seq
     -o bos_taurus.lucy.seq bos_taurus.lucy.qual \
     -debug  bos_taurus.lucy.info \
     bos_taurus.seq bos_taurus.qual
 # Trim clr
 $ clrFasta bos_taurus.seq > bos_taurus.cseq
  • 15. Align lucy output to linker, vector, splice & UniVec.dust
 $ nucmer -l 12 -c 24 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/splice.seq  bos_taurus.lucy.cseq -p splice-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy
  • 16. Align input to linker, vector, splice & UniVec.dust
 $ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus

Count how many reads got trimmed

 infoseq *seq | getSummary.pl -c 1 -t original.LEN
 
 cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary  
 cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary
 cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary

Libraries

011.BCM.WGS FORWARD

  • vector: J01636
  • UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
 ll ~dpuiu/db/J01636*
 -rw-rw-r--  1 dpuiu dpuiu 7651 Jan  9 15:56 /nfshomes/dpuiu/db/J01636
 -rw-rw-r--  1 dpuiu dpuiu  105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker
 -rw-rw-r--  1 dpuiu dpuiu  840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice
 cat  ~dpuiu/db/J01636* | infoseq
 J01636            7477   53.43
 J01636.linker.fwd 27     44.44
 J01636.linker.rev 27     37.04
 J01636.for.begin  150    44.67
 J01636.for.end    150    51.33
 J01636.rev.begin  150    51.33
 J01636.rev.end    150    44.67
  • 249,611 reads:
  • 91% got vector trimmed at the 5'
  • 0.4% (1149) got vector trimmed at the 3'
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    249611  0       437     2349    1082    991     1009    270035781     
 lucy.CLV5       249611  21215   0       741     25.03   25      27      6247415
 lucy.CLV3       249611  248462  0       1047    3.49    0       859     870344
  • Original reads hit counts:
10975 linker.fwd
133   linker.rev
166   splice
152   vector
228   UniVec.dust
  • Lucy trimmed read counts
2 linker.fwd
0 linker.rev
1 splice
1 vector
6 UniVec.dust (only 3 are >40bp)

011.BCM.WGS REVERSE

                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    250389  0       502     2148    1085    993     1012    271691094
 lucy.CLR        250389  7345    0       1281    795     876     892     198982171
 lucy.CLV5       250389  20271   0       668     26.52   27      29      6641362
 lucy.CLV3       250389  249269  0       997     3.35    0       861     839029
  • Original reads hit counts:
 linker.fwd      113
 linker.rev      3812
 splice          143
 UniVec.dust     237
 vector          4318
  • Lucy trimmed reads hit counts:
 linker.fwd      1
 linker.rev      0
 splice          1
 UniVec.dust     10
 vector          1

030.BCM.SHOTGUN

  • same linker/vector/splice as BCM.WGS
  • 2.5% (4K out of 160K) reads contain linker & vector at 3'
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    8411    0       325     1685    1181    1240    1314    9933150
 lucy.CLR        8411    8       0       1054    841     863     874     7070994
 lucy.CLV5       8411    568     0       232     27.01   28      29      227206
 lucy.CLV3       8411    2325    0       1040    597     794     851     5023445
  • Original reads hit counts:
 linker.fwd      4314
 linker.rev      4125
 splice          7816
 UniVec.dust     4212
 vector          6750
 vector          27235
  • Lucy trimmed reads hit counts:
 linker.fwd      3
 linker.rev      1
 splice          1
 UniVec.dust     13
 vector          0

001.NISC.SHOTGUN

  • Vector: pOTW13
  • UniVec: 3 partial seqs
 gnl|uv|NGB00080.1:1-198 pOTW13 with linkers
 gnl|uv|NGB00080.1:718-888 pOTW13 with linkers
 gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers
 ll /nfshomes/dpuiu/db/NGB00080*
 -rw-rw-r--  1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080
 -rw-r--r--  1 dpuiu dpuiu   94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker
 -rw-r--r--  1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice
 cat  /nfshomes/dpuiu/db/NGB00080* | infoseq
 NGB00080       1054   50.00
 NGB00080.linker.fwd 24     45.83
 NGB00080.linker.rev 26     53.85
 NGB00080.for.beg 518    46.14
 NGB00080.for.end 518    50.48
 NGB00080.rev.begin 518    50.48
 NGB00080.rev.beg 518    46.14
  • 944 read sample
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    944     0       652     1017    735     721     722     693668
 lucy.CLR        944     39      0       886     415     422     522     391333
 lucy.CLV5       944     121     0       275     34.05   33      35      32143
 lucy.CLV3       944     18      0       885     410     409     511     387007
  • Original reads hit counts:
 linker.fwd      479
 linker.rev      492
 splice          910
 UniVec.dust     0
 vector          939
  • Lucy trimmed reads hit counts:
 linker.fwd      1
 linker.rev      0
 splice          0
 UniVec.dust     9
 vector          1

060.BCCAGSC.CLONEEND

  • Linkers:
 linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT
 linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG
  • No common vector !!!
 wc -l *clb
   60746 bos_taurus.060.f.clb  #18 reads original align to UniVec (nucmer default params)
   60836 bos_taurus.060.r.clb
 
 Fwd:
    329      428  |      440      535  |      100       96  |    91.00  |      503     1585  |    19.88     6.06  | 723951410  gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1
    330      370  |       89       49  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
    330      370  |       94       54  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
 
  Rev:
      1       96  |       71      165  |       96       95  |    93.81  |      203      165  |    47.29    57.58  | 724018013  gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6
     50      143  |        1       94  |       94       94  |    92.71  |      203       94  |    46.31   100.00  | 724018013  gnl|uv|U80929.2:2858-2951     Cloning vector pBACe3.6

017.UIUC.CLONEEND

  • No overrepresented kmers
 wc -l *clb
  17978 bos_taurus.017.f.clb
  17911 bos_taurus.017.r.clb
 
 ==> 24.fwd/kmers.tab <==
 CCCTGCTTTGTCTGGAAGGGGTTC        GAACCCCTTCCAGACAAAGCAGGG        9
 CTGCTTTGTCTGGAAGGGGTTCCC        GGGAACCCCTTCCAGACAAAGCAG        9
 
 ==> 24.rev/kmers.tab <==
 GAATGTTGAGCTTTAGCCAACTTT        AAAGTTGGCTAAAGCTCAACATTC        4
 TCTGAATGTTGAGCTTTAGCCAAC        GTTGGCTAAAGCTCAACATTCAGA        4
 
 ==> 8.fwd/kmers.tab <==
 TTTTTTTT        AAAAAAAA        55
 AAGGGGTT        AACCCCTT        35
 
 ==> 8.rev/kmers.tab <==
 GTCTGGAA        TTCCAGAC        41
 TCTGGAAG        CTTCCAGA        39
  • No UniVec hits

010.TIGR.CLONEEND

  • No overrepresented kmers
 wc -l *clb
 5479 bos_taurus.032.f.clb
 5174 bos_taurus.032.r.clb
 
 ==> 24.fwd/kmers.tab <==
 CTTGTGTTGGCCCAGGCAAGTCCA        TGGACTTGCCTGGGCCAACACAAG        30
 TTGTGTTGGCCCAGGCAAGTCCAA        TTGGACTTGCCTGGGCCAACACAA        30
 
 ==> 24.rev/kmers.tab <==
 CTGCCTCTTGTGTTGGCCCAGGCA        TGCCTGGGCCAACACAAGAGGCAG        16
 GCTGCCTCTTGTGTTGGCCCAGGC        GCCTGGGCCAACACAAGAGGCAGC        15
 
 ==> 8.fwd/kmers.tab <==
 GAGTGGGT        ACCCACTC        176
 GGAGTGGG        CCCACTCC        171
 
 ==> 8.rev/kmers.tab <==
 TGGAGTGG        CCACTCCA        182
 GGAGTGGG        CCCACTCC        181
 
  • No UniVec hits

...

070.BCM.CLONEEND

  • No frequent kmers
 wc -l *clb
   6027 bos_taurus.070.f.clb
   6236 bos_taurus.070.r.clb
 
 ==> 24.fwd/kmers.tab <==
 GGACTCTCAGAGTCTTCTCCAACA        TGTTGGAGAAGACTCTGAGAGTCC        18
 ACTGGTTGGATCTCCTTGCAGTCC        GGACTGCAAGGAGATCCAACCAGT        18

 ==> 24.rev/kmers.tab <==
 ATAAAATCTGAGCCACCAGGGAAG        CTTCCCTGGTGGCTCAGATTTTAT        1
 CTATTGGTTCATATGGTCAACGTC        GACGTTGACCATATGAACCAATAG        1
 
 ==> 8.fwd/kmers.tab <==
 TTTTTTTT        AAAAAAAA        86
 CTTCTCCA        TGGAGAAG        75
 
 ==> 8.rev/kmers.tab <==
 TATAGTGT        ACACTATA        9
 ATATAGGG        CCCTATAT        8
  • No alignments to BCM WGS vector

Running Lucy

  • Default parameters with vector trimming
  • BCM vector/splice
 /nfshomes/dpuiu/db/vector.BCM.seq
 /nfshomes/dpuiu/db/splice.BCM.seq
  • NISC vector/splice
 /nfshomes/dpuiu/db/vector.NISC.seq
 /nfshomes/dpuiu/db/splice.NISC.seq

BCM.WGS (all reads)

  • orig.CLR < lucy.CLR ( 765 < 792 )
  • orig.CLV > lucy.CLV ( 1015 > 973 )
  • 739,529 out of 24,863,599 reads (3%) deleted by Lucy (CLR=-1,-1)
  • 21,728,592 out of 24,863,599 reads (87%) vector trimmed at the 5' end
  • 92,646 out of 24,863,599 reads (0.3%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum 
 orig.LEN                  24863599   0          0          24863599   5          3097       1002       997        1015       24915462033
 
 orig.CLR                  24863599   463669     7          24399923   -1143      1833       765        836        864        19036744256
 orig.CLR5                 24863599   0          359245     24504354   0          2103       42         22         58         1047922451
 orig.CLR3                 24863599   463404     0          24400195   -1         2169       807        872        895        20084666707
 
 lucy.CLR                  24863599   0          739529     24124070   0          1219       792        878        904        19695000417
 lucy.CLR5                 24863599   739529     36108      24087962   -1         1753       43         29         42         1086413880
 lucy.CLR3                 24863599   739529     0          24124070   -1         1894       835        915        939        20781414297
 
 orig.CLR5-lucy.CLR5       24863599   16299521   215345     8348733    -1186      2104       -1         -10        -1186      -38491429
 orig.CLR3-lucy.CLR3       24863599   14858542   1494794    8510263    -1273      2170       -28        -20        -1273      -696747590
 
 orig.CLV                  24863599   1053       1920       24860626   -2         5345       1015       1002       1017       25260581538
 orig.CLV5                 8841849    0          0          8841849    1          1219       33         46         49         295011460
 orig.CLV3                 24861698   1053       0          24860645   -1         5346       1027       1005       1019       25555592998
 
 lucy.CLV                  24863599   10694      707        24852198   -469       3096       973        968        987        24195085877
 lucy.CLV5                 24863599   0          3135007    21728592   0          1359       25         27         29         623457486
 lucy.CLV3                 24863599   0          0          24863599   4          3096       998        995        1014       24818543363
 lucy.CLVABS5              24863599   0          3135007    21728592   0          1359       25         27         29         623457486
 lucy.CLVABS3              24863599   0          24770953   92646      0          1343       2          0          880        72055071
 
 orig.CLV5-lucy.CLV5       24863599   17216820   1512453    6134326    -1312      1219       -13        -25        -1312      -328446026
 orig.CLV3-lucy.CLV3       24863599   1519132    18579609   4764858    -1832      4672       29         0          479        737049635

BCM.WGS (0 quality reads)

  • orig.CLR > lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 7,153 out of 551,114 reads (1.3%) deleted by Lucy (CLR=-1,-1)
  • 508,166 out of 551,114 reads (92%) vector trimmed at the 5' end
  • 1,946 out of 551,114 reads (0.35%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  551114     0          0          551114     5          1464       872        946        959        480705828
 
 orig.CLR                  551114     7754       0          543360     -770       1175       708        786        807        390325117
 orig.CLR5                 551114     0          6773       544341     0          1519       44         20         111        24582849
 orig.CLR3                 551114     7744       0          543370     -1         1638       752        818        833        414907966
 
 lucy.CLR                  551114     0          7153       543961     0          699        636        671        671        350759771
 lucy.CLR5                 551114     7153       35872      508089     -1         201        26         27         28         14442310
 lucy.CLR3                 551114     7153       0          543961     -1         699        662        699        699        365202081
 
 orig.CLR5-lucy.CLR5       551114     364282     8801       178031     -198       1500       18         -8         215        10140539
 orig.CLR3-lucy.CLR3       551114     85058      2962       463094     -700       1472       90         123        178        49705885
 
 orig.CLV                  551114     971        0          550143     -2         2037       974        978        981        537127121
 orig.CLV5                 5100       0          0          5100       1          845        35         29         31         180490
 orig.CLV3                 551114     971        0          550143     -1         2037       974        978        981        537307611
 
 lucy.CLV                  551114     58         6          551050     -84        1456       841        917        930        463903233
 lucy.CLV5                 551114     0          42948      508166     0          202        27         28         29         14964546
 lucy.CLV3                 551114     0          0          551114     4          1463       868        945        958        478867779
 lucy.CLVABS5              551114     0          42948      508166     0          202        27         28         29         14964546
 lucy.CLVABS3              551114     0          549168     1946       0          700        2          0          686        1286935
 
 orig.CLV5-lucy.CLV5       551114     506108     42215      2791       -202       845        -26        -28        -202       -14784056
 orig.CLV3-lucy.CLV3       551114     134959     23422      392733     -967       1614       106        7          459        58439832

BCM.SHOTGUN

  • orig.CLR < lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 98,070 out of 10,748,529 reads (0.9%) deleted by Lucy (CLR=-1,-1)
  • 9,737,008 out of 10,748,529 reads (90%) vector trimmed at the 5' end
  • 294,942 out of 10,748,529 reads (2.7%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  10748529   0          0          10748529   5          2043       975        950        964        10486690472
 
 orig.CLR                  10748529   17308      2          10731219   -1293      1467       809        833        847        8701344571
 orig.CLR5                 10748529   0          68         10748461   0          1315       26         16         38         288662580
 orig.CLR3                 10748529   16780      0          10731749   -1         1647       836        851        863        8990007151
 
 lucy.CLR                  10748529   0          98070      10650459   0          1337       833        854        868        8955866769
 lucy.CLR5                 10748529   98070      1973       10648486   -1         1307       35         28         32         376276188
 lucy.CLR3                 10748529   98070      0          10650459   -1         1553       868        882        896        9332142957
 
 orig.CLR5-lucy.CLR5       10748529   9498290    65171      1185068    -1099      1293       -8         -11        -1099      -87613608
 orig.CLR3-lucy.CLR3       10748529   6879532    671097     3197900    -1149      1437       -31        -26        -1149      -342135806
 
 orig.CLV                  10748529   16779      412        10731338   -2         3919       974        948        964        10472347908
 orig.CLV5                 8594910    0          0          8594910    1          1239       3          1          49         28350257
 orig.CLV3                 10748349   16779      0          10731570   -1         3919       976        950        965        10500698165
 
 lucy.CLV                  10748529   7026       614        10740889   -268       2042       930        924        940        9997862132
 lucy.CLV5                 10748529   0          1011521    9737008    0          855        24         24         27         257993796
 lucy.CLV3                 10748529   0          0          10748529   4          2042       954        945        962        10255855928
 lucy.CLVABS5              10748529   0          1011521    9737008    0          855        24         24         27         257993796
 lucy.CLVABS3              10748529   0          10453587   294942     0          1214       20         0          847        220086015
 
 orig.CLV5-lucy.CLV5       10748529   9538738    138680     1071111    -854       1239       -21        -23        -854       -229643539
 orig.CLV3-lucy.CLV3       10748529   357934     9324166    1066429    -1328      2846       22         0          704        244842237

NISC.SHOTGUN

  • orig.CLR < lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 8,248 out of 737,900 reads (1.1%) deleted by Lucy (CLR=-1,-1)
  • 633,409 out of 737,900 reads (85%) vector trimmed at the 5' end
  • 7,201 out of 737,900 reads (0.97%) vector trimmed at the 3' end


                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  737900     0          0          737900     104        2104       784        729        734        579172842
 
 orig.CLR                  737900     5988       2          731910     -636       1033       651        668        676        480400909
 orig.CLR5                 737900     0          0          737900     1          1407       47         40         51         34857531
 orig.CLR3                 737900     0          5879       732021     0          1470       698        710        715        515258440
 
 lucy.CLR                  737900     0          8248       729652     0          1035       658        670        676        485757685
 lucy.CLR5                 737900     8248       56         729596     -1         1091       45         35         46         33811606
 lucy.CLR3                 737900     8248       0          729652     -1         1391       704        710        714        519569291
 
 orig.CLR5-lucy.CLR5       737900     253727     89345      394828     -566       1408       1          1          485        1045925
 orig.CLR3-lucy.CLR3       737900     177007     31         560862     -867       1471       -5         1          -867       -4310851
 
 orig.CLV                  737900     3224       2655       732021     -636       2103       771        725        730        569178445
 orig.CLV5                 734026     0          0          734026     1          987        5          1          35         4375315
 orig.CLV3                 732021     0          0          732021     35         2104       783        729        734        573553760
 
 lucy.CLV                  737900     1335       55         736510     -200       2104       747        696        702        551392388
 lucy.CLV5                 737900     104491     0          633409     -1         1199       30         31         34         22784742
 lucy.CLV3                 737900     0          0          737900     15         2103       778        728        733        574177130
 lucy.CLVABS5              737900     0          104491     633409     0          1200       31         32         35         23522642
 lucy.CLVABS3              737900     0          730699     7201       0          1076       5          0          686        4257812
 
 orig.CLV5-lucy.CLV5       737900     561851     66390      109659     -1198      983        -24        -29        -1198      -18409427
 orig.CLV3-lucy.CLV3       737900     8386       1          729513     -950       1077       0          1          -950       -623370

Fragment files

  • Location: /fs/szasmg3/bos_taurus/data/frg
  • All DST messages are unique
  • bos_taurus.clv : contains the vector clipping points
    • BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
    • others: the TA clv
    • 374,454 reads don't have valid clv's
    • 36,446,031 reads have valid clv's with avg=955

Message counts (original)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.frg                         79      24124070        11311841
 bos_taurus.BCM.SHOTGUN.frg                     7339    10650459        1799069
 bos_taurus.NISC.SHOTGUN.frg                    246     729652          344932
 bos_taurus.BCCAGSC.CLONEEND.frg                1       125241          59505
 bos_taurus.UIUC.CLONEEND.frg                   2       114750          46319
 bos_taurus.TIGR.CLONEEND.frg                   1       65171           27067
 bos_taurus.GSC.CLONEEND.frg                    1       53521           25889
 bos_taurus.CENARGEN.WGS.frg                    0       26246           0
 bos_taurus.BARC.CLONEEND.frg                   11150   25454           11150
 bos_taurus.BCM.CLONEEND.frg                    1       16875           7103
 bos_taurus.CENARGEN.CLONEEND.frg               1       16787           6269
 bos_taurus.UOKNOR.SHOTGUN.frg                  1       14651           4910
 bos_taurus.TIGR_JCVIJTC.CLONEEND.frg           2       10651           4803
 bos_taurus.UOKNOR.FINISHING.frg                0       151             0
 bos_taurus.WUGSC.COLONEEND.frg                 1       49              21
 total                                          18885   35973728        13648878

Message counts (quality)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.qual.count                  79      23580109        11035582
 bos_taurus.BCM.SHOTGUN.qual.count              7339    10644092        1799069
 bos_taurus.NISC.SHOTGUN.count                  246     729652          344932
 bos_taurus.BCCAGSC.CLONEEND.qual.count         1       116484          53585
 bos_taurus.UIUC.CLONEEND.count                 2       114750          46319
 bos_taurus.TIGR.CLONEEND.count                 1       65171           27067
 bos_taurus.CENARGEN.WGS.count                  0       26246           0
 bos_taurus.BARC.CLONEEND.count                 11150   25454           11150
 bos_taurus.BCM.CLONEEND.count                  1       16875           7103
 bos_taurus.CENARGEN.CLONEEND.count             1       16787           6269
 bos_taurus.TIGR_JCVIJTC.CLONEEND.count         2       10651           4803
 bos_taurus.UOKNOR.SHOTGUN.qual.count           1       2456            813
 bos_taurus.WUGSC.COLONEEND.count               1       49              21
 total                                          18824   35348776        13336713 

Message counts (0quality)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.0qual.count                 79      543961          234397
 bos_taurus.GSC.CLONEEND.0qual.count            1       53521           25889
 bos_taurus.UOKNOR.SHOTGUN.0qual.count          1       12195           4097
 bos_taurus.BCCAGSC.CLONEEND.0qual.count        1       8757            2114
 bos_taurus.BCM.SHOTGUN.0qual.count             7339    6367            0
 bos_taurus.UOKNOR.FINISHING.0qual.count        0       151             0
 total                                          7421    624952          266497