Bos taurus redo

From Cbcb
Jump to navigation Jump to search

BCM

NCBI Data

  • Avg LEN=984
  • Avg CLIP (CLB intersect CLV)=760
  • Avg CLV=997 > Avg LEN ???
  • Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
  • Avg UMDoverlapper CLIP=778

Problems:

  • 0 QUAL reads 650,133
  • the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1
  • several xml contained the "&" character => XML parser error
  • xml.bos_taurus.087 contained 2 trace_volumes => XML parser error
  • BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000
  • UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000

CENTER_NAME counts

    COUNT           CENTER_NAME     
 1  35629020        BCM             Baylor College of Medicine
 2  737900          NISC            NIH Intramural Sequencing Center
 3  652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Centre                           # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 
 4  378871          MARC            USDA, ARS, US Meat Animal Research Center
 5  114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368
 6  107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
 7  65171           TIGR            The Institute for Genome Research
 8  53556           GSC             Genoscope
 9  43033           CENARGEN        Embrapa Genetic Resources and Biotechnology
 10 18623           SC              The Sanger Center
 11 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
 12 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                 # TA query_tracedb CENTER_NAME="JCVI"
 13 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
 14 49              WUGSC           Washington University, Genome Sequencing Center                                 # TA query_tracedb CENTER_NAME = "WUGSC" => 9
    37829394        total           total                                                                           # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710

TRACE_TYPE_CODE counts

    COUNT         CENTER_NAME     TRACE_TYPE_CODE        
 1  24863599      BCM*            WGS                    SEQ_LIB_ID:89
 2  10748529      BCM*            SHOTGUN                SEQ_LIB_ID:15543
 3  737900        NISC            SHOTGUN                SEQ_LIB_ID:247
 4  125597        BCCAGSC         CLONEEND               LIBRARY_ID:1         large insert size; some qualityless; !!! almost all have CLIP3=0
 5  114753        UIUC            CLONEEND               LIBRARY_ID:2         insert size missing , no frequent kmers
 6  65171         TIGR            CLONEEND               SEQ_LIB_ID:1         2K & use TRACE_DIRECTION instead of TRACE_END
 7  53556         GSC             CLONEEND               SEQ_LIB_ID:1         large insert size; !!! all have qual=0 and were excluded 
 8  26246         CENARGEN        WGS                    .                    no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID)
 9  25454         BARC            CLONEEND               SEQ_LIB_ID:14304     !!! all have CLIP3=0
 10 16892         BCM*            CLONEEND               LIBRARY_ID:1         VBBAA   mea=167000  std=25000
 11 16787         CENARGEN        CLONEEND               LIBRARY_ID:1         
 12 15150         UOKNOR          SHOTGUN                LIBRARY_ID:1         some qualityless
 13 10651         TIGR_JCVIJTC    CLONEEND               SEQ_LIB_ID:2
 14 151           UOKNOR          FINISHING              LIBRARY_ID:1         some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV
 15 49            WUGSC           CLONEEND               SEQ_LIB_ID:1 
    36820485      total

 16 527017        BCCAGSC         EST
 17 207204        MARC            EST
 18 171667        MARC            PCR
 19 81913         BARC            EST
 20 18623         SC              EST 
 21 2485          UIACBCB         EST
    1008909       total

STRATEGY & TRACE_TYPE_CODE counts

 COUNT           CENTER_NAME     STRATEGY        TRACE_TYPE_CODE
 12545304        BCM             .               WGS
 11425910        BCM             WGA             WGS
 5223683         BCM             CLONE           SHOTGUN
 4479883         BCM             POOLCLONE       SHOTGUN
 1044963         BCM             .               SHOTGUN
 892385          BCM             SNP             WGS
 737900          NISC            CLONE           SHOTGUN
 125597          BCCAGSC         CLONEEND        CLONEEND
 114753          UIUC            CLONEEND        CLONEEND 
 65171           TIGR            CLONEEND        CLONEEND
 53556           GSC             CLONEEND        CLONEEND
 26246           CENARGEN        .               WGS
 25454           BARC            .               CLONEEND
 16892           BCM             CLONEEND        CLONEEND
 16787           CENARGEN        CLONEEND        CLONEEND
 12195           UOKNOR          .               SHOTGUN
 10651           TIGR_JCVIJTC    CLONEEND        CLONEEND
 2955            UOKNOR          CLONE           SHOTGUN
 151             UOKNOR          .               FINISHING
 49              WUGSC           CLONEEND        CLONEEND
 527017          BCCAGSC         EST             EST
 145820          MARC            EST             EST
 117958          MARC            COMPARATIVE     PCR
 81913           BARC            EST             EST
 61384           MARC            CLONE           EST
 53709           MARC            Re-Sequencing   PCR
 18623           SC              EST             EST
 2485            UIACBCB         .               EST

BCM.SHOTGUN libraries

 SIZE    STDEV   COUNT
 3500    1500    4502569
 2000    1000    3244493
 3000    1000    1021577
 180000  1000    840528
 6500    1500    429026
 180000  13000   320208
 6000    2000    208192
 167000  13000   96337
 3500    15000   85599
 SIZE    COUNT
 3500    4588168
 2000    3244493
 180000  1160736
 3000    1021577
 6500    429026
 6000    208192
 167000  96337

3' VECTOR TRIMMED counts

    CENTER_NAME     TRACE_TYPE_CODE TOTAL           3'CLV<LEN   QUAL==0          UMD.FRG
 1  BCM             WGS             24863599        10968979    551114           24050767
 2  BCM             SHOTGUN         10748529        5052692     23419            10068499
 3  NISC            SHOTGUN         737900          28972       0                735488
 4  BCCAGSC         CLONEEND        125597          125484      8926             113790
 5  UIUC            CLONEEND        114753          90243       0                106247
 6  TIGR            CLONEEND        65171           46389       0                64903
 7  GSC             CLONEEND        53556           53556       53556 (all)      0           !!! all have 0 quals and were excluded
 8  CENARGEN        WGS             26246           26246       0                25976
 9  BARC            CLONEEND        25454           25454       0                25387
 10 BCM             CLONEEND        16892           6751        0                16863
 11 CENARGEN        CLONEEND        16787           16787       0                16628
 12 UOKNOR          SHOTGUN         15150           2885        12195            0
 13 TIGR_JCVIJTC    CLONEEND        10651           339         0                10644
 14 UOKNOR          FINISHING       151             0           151              151
 15 WUGSC           CLONEEND        49              0           0                0

 16 BCCAGSC         EST             527017          524173      772              0
 17 MARC            EST             207204          207204      0                0
 18 MARC            PCR             171667          171667      0                0
 19 BARC            EST             81913           78597       0                0
 20 SC              EST             18623           7350        0                0
 21 UIACBCB         EST             2485            2485        0                0

ZERO QUALITY COUNTS

  • Counts
 CENTER_NAME     TRACE_TYPE_CODE  COUNT
 BCM             WGS              551114
 GSC             CLONEEND         53556
 BCM             SHOTGUN          23419
 UOKNOR          SHOTGUN          12195
 BCCAGSC         CLONEEND         8926
 BCCAGSC         EST              772
 UOKNOR          FINISHING        151
 TOTAL                            650134 
  • For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701..
  • Volumes 026..039 have been fixed

Local Data

Files & Dirs

 /fs/szasmg3/bos_taurus/data/
 /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors
 /nfshomes/dpuiu/db/UniVec

Software

Figaro

  • trims vector only at 5' end
  • call lucy trimming for qualities

Lucy

  • both vector sequence and splice sites are required

Atlas

  • web site
  • atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "

Contaminant search

nucmer reads CLIPPING range to UniVec & EcoliK12

UniVec

Ref

                 #seqs   min     max     mean    median  n50     sum
 UniVec          2861    12      48551   231     99      781     660,151
 UniVec_Core     1348    12      48551   243     98      967     327,641

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      4548466 19      1045    28.37   23      27      129025025
 20      3684852 20      1045    30.56   25      28      112616359
 30      1097357 30      1045    48.04   38      43      52714583
 40      484661  40      1045    66.36   47      53      32163896
 100     54334   100     1045    198     116     223     10772815        # many are ESTs

Ecoli

Ref:

 K12 4,639,675 bp

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      275109  19      1223    30.66   19      20      8435470
 20      102550  20      1223    50.29   21      161     5156849
 30      19032   30      1223    178     37      706     3381214
 40      9234    40      1223    329     171     738     3034293
 100     6781    100     1223    424     223     749     2876432
 200     4378    200     1223    575     696     771     2516916       

BCM vectors

                 #seqs   min     max     mean    median  n50     sum
 BCM             14      2580    33180   9379    5821    32705   131312

Vector/Splice site search

Strategy

  • 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
  • 2. Get the quality clipping time: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
  • 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE
  • 4. Get the most frequent kmers in each set (24 & 8 bp)
  • 5. Check if the most frequent kmers are overrepresented
  • 6. Check if the most frequent 8mers are present in the most frequent 24mers
  • 7. Try to extend the 24mers by a few bp => linkers
  • 8. Align linkers to the opposite stand sequences using nucmer
  • 9. Extract the subsequences adjacent(following) to linker (50..150bp)
  • 10. Align the subsequences; if they align we've probably identified the vector
  • 11. Identify the vector name/id by alignment to UniVec => several alignments
  • 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
  • 13. create the Lucy vector & splice files; the splice contains the linker+vector
  • 14. run lucy & trim input reads according to Lucy clr
  • 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust
  • 16. align input reads to linker,vector,splice & UniVec.dust
  • 17. compare the 15. & 16. counts

Example

  • 1. volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
  • 2.
  • 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
  • 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
 ==> 24.fwd/kmers.tab <==
 AGTTCGACTGCAAGTAGTTCATCA      TGATGAACTACTTGCAGTCGAACT        2463 # contains AGTAGTTC
 GAGTTCGACTGCAAGTAGTTCATC      GATGAACTACTTGCAGTCGAACTC        2189
 CGAGTTCGACTGCAAGTAGTTCAT      ATGAACTACTTGCAGTCGAACTCG        1996
 TCGAGTTCGACTGCAAGTAGTTCA      TGAACTACTTGCAGTCGAACTCGA        1593
 GTTCGACTGCAAGTAGTTCATCAA      TTGATGAACTACTTGCAGTCGAAC        1023
 GAGTTCGACTGCAGTAGTTCATCA      TGATGAACTACTGCAGTCGAACTC        812
 CGAGTTCGACTGCAGTAGTTCATC      GATGAACTACTGCAGTCGAACTCG        777
 GTTCGACTGCAAGTAGTTCATCAT      ATGATGAACTACTTGCAGTCGAAC        769
 TCGAGTTCGACTGCAGTAGTTCAT      ATGAACTACTGCAGTCGAACTCGA        637
 ATCGAGTTCGACTGCAAGTAGTTC      GAACTACTTGCAGTCGAACTCGAT        594
 
 ==> 08.fwd/kmers.tab <==
 AGTAGTTC      GAACTACT        86477
 CAGTAGTT      AACTACTG        67681
 AGTTCTCA      TGAGAACT        61556
 TAGTTCTC      GAGAACTA        60964
 GTAGTTCT      AGAACTAC        57866
 AGTTCATC      GATGAACT        49676
 TAGTTCAT      ATGAACTA        45298
 GTTCATCA      TGATGAAC        42117
 GCAGTAGT      ACTACTGC        41391
 GTAGTTCA      TGAACTAC        40694
 
 ==> 24.rev/kmers.tab <==
 TATCGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCGATA        999 # contains AGTAGTTC
 CTATCGATGGTACAGTAGTTCATC      GATGAACTACTGTACCATCGATAG        774
 GCTATCGATGGTACAGTAGTTCAT      ATGAACTACTGTACCATCGATAGC        600
 CGCTATCGATGGTACAGTAGTTCA      TGAACTACTGTACCATCGATAGCG        432
 ATCGATGGTACAGTAGTTCATCAT      ATGATGAACTACTGTACCATCGAT        417
 ATCGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCGAT        380
 ATCAGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCTGAT        373
 ATCGATGGTACAGTAGTTCATCAC      GTGATGAACTACTGTACCATCGAT        265
 CTATCGATGGTAAGTAGTTCATCA      TGATGAACTACTTACCATCGATAG        235
 TCAGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCTGA        224
 
 ==> 08.rev/kmers.tab <==
 AGTTCATC      GATGAACT        85127
 TAGTTCAT      ATGAACTA        77902
 GTTCATCA      TGATGAAC        75585
 TAGTTCTC      GAGAACTA        68057
 AGTTCTCA      TGAGAACT        67277
 GTAGTTCT      AGAACTAC        64894
 GTAGTTCA      TGAACTAC        62607
 CGTAGTTC      GAACTACG        52031
 AGTAGTTC      GAACTACT        51013
 ACGTAGTT      AACTACGT        31552
  • 7. Get linker sequences
 >linker.fwd 27bp
 TCGAGTTCGACTGCAAGTAGTTCATCA
 >linker.rev 27bp
 CTAATCAGATGGTACAGTAGTTCATCA 
  
 #>linker.rev 40 bp Art's  (13 more bp at 5')        
 #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
 #GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences)
  • 8 & 9 Align reads to linkers using nucmer

Fwd:

 nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta 
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq
 

Rev:

 nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq
 

Both:

 clrFasta out.seq >! out.cseq
 fasta2tab.pl out.cseq | sort -k2 > ! out.tab
 nucmer -c 40 out.cseq ~/db/UniVec -p vector
 delta-filter -q vector.delta >! vector.filter-q.delta
 show-coords vector.filter-q.delta | sort -n | head
 cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2
 
  • 10. Extract "vector reads"
 >399553028  # 24.fwd     
 TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
 GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
 GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA
 ...
 >400269118 #24.rev
 TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
 CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
 AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
  • 11. Align "vector reads" to UniVec; identify vector
 show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1
     31  148  | 1175 1292  | 118   118  |  95.76  |     1276     7477  |     9.25     1.58  | 399553028.rev gnl|uv|J01636.1:1-7477
     32  199  | 1302 1463  | 168   162  |  90.48  |      653     7477  |    25.73     2.17  | 400269118     gnl|uv|J01636.1:1-7477
  • 12. 10bp distance between the 2 alignments
  • 13. Lucy files
 $ more vector.seq
   >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
   GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG
   TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC
   CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG
   CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA
   ...
 
 $ more splice.seq
   >J01636.for.begin vector+linker.rev
   TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC
   GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG
   CGCCTAATCAGATGGTACAGTAGTTCATCA
   >J01636.for.end  rev(linker.fwd)+vector 
   TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
   CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
   AGCTGGCGTAAAAACGTAAAAAGCCCCGCA
   >J01636.rev.begin (revcomp of J01636.for.end)
   TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA
   TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT
   GATTCGATTTCGACTGCAAGTAGTTCATCA
   >J01636.rev.end (revcomp of J01636.for.begin)
   TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
   GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
   GTGTCAAATGAGAGACCTAACTCACATTCA
 # splice=linker+vector  
      3      120  |     1175     1292  |      118      118  |    95.76  |      150     7477  |    78.67     1.58  | J01636.for.begin   J01636
     32      131  |     1302     1399  |      100       98  |    96.00  |      150     7477  |    66.67     1.31  | J01636.for.end     J01636
  • 13.1 Align vector & splice to Ecoli
      1     7474  |   366812   359335  |     7474     7478  |    99.91  |     7477  4639675  |    99.96     0.16  | J01636             NC_000913.2    [CONTAINED]
     20      119  |       65      162  |      100       98  |    96.00  |      150      395  |    66.67    24.81  | J01636.rev.begin   NC_000913.2
     31      148  |      172      289  |      118      118  |    95.76  |      150      395  |    78.67    29.87  | J01636.rev.end     NC_000913.2
   1069     1463  |      395        1  |      395      395  |   100.00  |     7477      395  |     5.28   100.00  | J01636             NC_000913.2.365350-365744
  • 14. Run lucy & trim reads
 $ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \ 
     -v vector.seq splice.seq
     -o bos_taurus.lucy.seq bos_taurus.lucy.qual \
     -debug  bos_taurus.lucy.info \
     bos_taurus.seq bos_taurus.qual
 # Trim clr
 $ clrFasta bos_taurus.seq > bos_taurus.cseq
  • 15. Align lucy output to linker, vector, splice & UniVec.dust
 $ nucmer -l 12 -c 24 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/splice.seq  bos_taurus.lucy.cseq -p splice-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy
  • 16. Align input to linker, vector, splice & UniVec.dust
 $ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus

Count how many reads got trimmed

 infoseq *seq | getSummary.pl -c 1 -t original.LEN
 
 cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary  
 cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary
 cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary

Libraries

011.BCM.WGS FORWARD

  • vector: J01636
  • UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
 ll ~dpuiu/db/J01636*
 -rw-rw-r--  1 dpuiu dpuiu 7651 Jan  9 15:56 /nfshomes/dpuiu/db/J01636
 -rw-rw-r--  1 dpuiu dpuiu  105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker
 -rw-rw-r--  1 dpuiu dpuiu  840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice
 cat  ~dpuiu/db/J01636* | infoseq
 J01636            7477   53.43
 J01636.linker.fwd 27     44.44
 J01636.linker.rev 27     37.04
 J01636.for.begin  150    44.67
 J01636.for.end    150    51.33
 J01636.rev.begin  150    51.33
 J01636.rev.end    150    44.67
  • 249,611 reads:
  • 91% got vector trimmed at the 5'
  • 0.4% (1149) got vector trimmed at the 3'
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    249611  0       437     2349    1082    991     1009    270035781     
 lucy.CLV5       249611  21215   0       741     25.03   25      27      6247415
 lucy.CLV3       249611  248462  0       1047    3.49    0       859     870344
  • Original reads hit counts:
10975 linker.fwd
133   linker.rev
166   splice
152   vector
228   UniVec.dust
  • Lucy trimmed read counts
2 linker.fwd
0 linker.rev
1 splice
1 vector
6 UniVec.dust (only 3 are >40bp)

011.BCM.WGS REVERSE

                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    250389  0       502     2148    1085    993     1012    271691094
 lucy.CLR        250389  7345    0       1281    795     876     892     198982171
 lucy.CLV5       250389  20271   0       668     26.52   27      29      6641362
 lucy.CLV3       250389  249269  0       997     3.35    0       861     839029
  • Original reads hit counts:
 linker.fwd      113
 linker.rev      3812
 splice          143
 UniVec.dust     237
 vector          4318
  • Lucy trimmed reads hit counts:
 linker.fwd      1
 linker.rev      0
 splice          1
 UniVec.dust     10
 vector          1

030.BCM.SHOTGUN

  • same linker/vector/splice as BCM.WGS
  • 2.5% (4K out of 160K) reads contain linker & vector at 3'
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    8411    0       325     1685    1181    1240    1314    9933150
 lucy.CLR        8411    8       0       1054    841     863     874     7070994
 lucy.CLV5       8411    568     0       232     27.01   28      29      227206
 lucy.CLV3       8411    2325    0       1040    597     794     851     5023445
  • Original reads hit counts:
 linker.fwd      4314
 linker.rev      4125
 splice          7816
 UniVec.dust     4212
 vector          6750
 vector          27235
  • Lucy trimmed reads hit counts:
 linker.fwd      3
 linker.rev      1
 splice          1
 UniVec.dust     13
 vector          0

001.NISC.SHOTGUN

  • Vector: pOTW13
  • UniVec: 3 partial seqs
 gnl|uv|NGB00080.1:1-198 pOTW13 with linkers
 gnl|uv|NGB00080.1:718-888 pOTW13 with linkers
 gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers
 ll /nfshomes/dpuiu/db/NGB00080*
 -rw-rw-r--  1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080
 -rw-r--r--  1 dpuiu dpuiu   94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker
 -rw-r--r--  1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice
 cat  /nfshomes/dpuiu/db/NGB00080* | infoseq
 NGB00080       1054   50.00
 NGB00080.linker.fwd 24     45.83
 NGB00080.linker.rev 26     53.85
 NGB00080.for.beg 518    46.14
 NGB00080.for.end 518    50.48
 NGB00080.rev.begin 518    50.48
 NGB00080.rev.beg 518    46.14
  • 944 read sample
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    944     0       652     1017    735     721     722     693668
 lucy.CLR        944     39      0       886     415     422     522     391333
 lucy.CLV5       944     121     0       275     34.05   33      35      32143
 lucy.CLV3       944     18      0       885     410     409     511     387007
  • Original reads hit counts:
 linker.fwd      479
 linker.rev      492
 splice          910
 UniVec.dust     0
 vector          939
  • Lucy trimmed reads hit counts:
 linker.fwd      1
 linker.rev      0
 splice          0
 UniVec.dust     9
 vector          1

060.BCCAGSC.CLONEEND

  • Linkers:
 linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT
 linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG
  • No common vector !!!
 wc -l *clb
   60746 bos_taurus.060.f.clb  #18 reads original align to UniVec (nucmer default params)
   60836 bos_taurus.060.r.clb
 
 Fwd:
    329      428  |      440      535  |      100       96  |    91.00  |      503     1585  |    19.88     6.06  | 723951410  gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1
    330      370  |       89       49  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
    330      370  |       94       54  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
 
  Rev:
      1       96  |       71      165  |       96       95  |    93.81  |      203      165  |    47.29    57.58  | 724018013  gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6
     50      143  |        1       94  |       94       94  |    92.71  |      203       94  |    46.31   100.00  | 724018013  gnl|uv|U80929.2:2858-2951     Cloning vector pBACe3.6

017.UIUC.CLONEEND

  • No overrepresented kmers
 wc -l *clb
  17978 bos_taurus.017.f.clb
  17911 bos_taurus.017.r.clb
 
 ==> 24.fwd/kmers.tab <==
 CCCTGCTTTGTCTGGAAGGGGTTC        GAACCCCTTCCAGACAAAGCAGGG        9
 CTGCTTTGTCTGGAAGGGGTTCCC        GGGAACCCCTTCCAGACAAAGCAG        9
 
 ==> 24.rev/kmers.tab <==
 GAATGTTGAGCTTTAGCCAACTTT        AAAGTTGGCTAAAGCTCAACATTC        4
 TCTGAATGTTGAGCTTTAGCCAAC        GTTGGCTAAAGCTCAACATTCAGA        4
 
 ==> 8.fwd/kmers.tab <==
 TTTTTTTT        AAAAAAAA        55
 AAGGGGTT        AACCCCTT        35
 
 ==> 8.rev/kmers.tab <==
 GTCTGGAA        TTCCAGAC        41
 TCTGGAAG        CTTCCAGA        39
  • No UniVec hits

010.TIGR.CLONEEND

  • No overrepresented kmers
 wc -l *clb
 5479 bos_taurus.032.f.clb
 5174 bos_taurus.032.r.clb
 
 ==> 24.fwd/kmers.tab <==
 CTTGTGTTGGCCCAGGCAAGTCCA        TGGACTTGCCTGGGCCAACACAAG        30
 TTGTGTTGGCCCAGGCAAGTCCAA        TTGGACTTGCCTGGGCCAACACAA        30
 
 ==> 24.rev/kmers.tab <==
 CTGCCTCTTGTGTTGGCCCAGGCA        TGCCTGGGCCAACACAAGAGGCAG        16
 GCTGCCTCTTGTGTTGGCCCAGGC        GCCTGGGCCAACACAAGAGGCAGC        15
 
 ==> 8.fwd/kmers.tab <==
 GAGTGGGT        ACCCACTC        176
 GGAGTGGG        CCCACTCC        171
 
 ==> 8.rev/kmers.tab <==
 TGGAGTGG        CCACTCCA        182
 GGAGTGGG        CCCACTCC        181
 
  • No UniVec hits

...

070.BCM.CLONEEND

  • No frequent kmers
 wc -l *clb
   6027 bos_taurus.070.f.clb
   6236 bos_taurus.070.r.clb
 
 ==> 24.fwd/kmers.tab <==
 GGACTCTCAGAGTCTTCTCCAACA        TGTTGGAGAAGACTCTGAGAGTCC        18
 ACTGGTTGGATCTCCTTGCAGTCC        GGACTGCAAGGAGATCCAACCAGT        18

 ==> 24.rev/kmers.tab <==
 ATAAAATCTGAGCCACCAGGGAAG        CTTCCCTGGTGGCTCAGATTTTAT        1
 CTATTGGTTCATATGGTCAACGTC        GACGTTGACCATATGAACCAATAG        1
 
 ==> 8.fwd/kmers.tab <==
 TTTTTTTT        AAAAAAAA        86
 CTTCTCCA        TGGAGAAG        75
 
 ==> 8.rev/kmers.tab <==
 TATAGTGT        ACACTATA        9
 ATATAGGG        CCCTATAT        8
  • No alignments to BCM WGS vector

Running Lucy

  • Default parameters with vector trimming
  • BCM vector/splice
 /nfshomes/dpuiu/db/vector.BCM.seq
 /nfshomes/dpuiu/db/splice.BCM.seq
  • NISC vector/splice
 /nfshomes/dpuiu/db/vector.NISC.seq
 /nfshomes/dpuiu/db/splice.NISC.seq

BCM.WGS (all reads)

  • orig.CLR < lucy.CLR ( 765 < 792 )
  • orig.CLV > lucy.CLV ( 1015 > 973 )
  • 739,529 out of 24,863,599 reads (3%) deleted by Lucy (CLR=-1,-1)
  • 21,728,592 out of 24,863,599 reads (87%) vector trimmed at the 5' end
  • 92,646 out of 24,863,599 reads (0.3%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum 
 orig.LEN                  24863599   0          0          24863599   5          3097       1002       997        1015       24915462033
 
 orig.CLR                  24863599   463669     7          24399923   -1143      1833       765        836        864        19036744256
 orig.CLR5                 24863599   0          359245     24504354   0          2103       42         22         58         1047922451
 orig.CLR3                 24863599   463404     0          24400195   -1         2169       807        872        895        20084666707
 
 lucy.CLR                  24863599   0          739529     24124070   0          1219       792        878        904        19695000417
 lucy.CLR5                 24863599   739529     36108      24087962   -1         1753       43         29         42         1086413880
 lucy.CLR3                 24863599   739529     0          24124070   -1         1894       835        915        939        20781414297
 
 orig.CLR5-lucy.CLR5       24863599   16299521   215345     8348733    -1186      2104       -1         -10        -1186      -38491429
 orig.CLR3-lucy.CLR3       24863599   14858542   1494794    8510263    -1273      2170       -28        -20        -1273      -696747590
 
 orig.CLV                  24863599   1053       1920       24860626   -2         5345       1015       1002       1017       25260581538
 orig.CLV5                 8841849    0          0          8841849    1          1219       33         46         49         295011460
 orig.CLV3                 24861698   1053       0          24860645   -1         5346       1027       1005       1019       25555592998
 
 lucy.CLV                  24863599   10694      707        24852198   -469       3096       973        968        987        24195085877
 lucy.CLV5                 24863599   0          3135007    21728592   0          1359       25         27         29         623457486
 lucy.CLV3                 24863599   0          0          24863599   4          3096       998        995        1014       24818543363
 lucy.CLVABS5              24863599   0          3135007    21728592   0          1359       25         27         29         623457486
 lucy.CLVABS3              24863599   0          24770953   92646      0          1343       2          0          880        72055071
 
 orig.CLV5-lucy.CLV5       24863599   17216820   1512453    6134326    -1312      1219       -13        -25        -1312      -328446026
 orig.CLV3-lucy.CLV3       24863599   1519132    18579609   4764858    -1832      4672       29         0          479        737049635

BCM.WGS (0 quality reads)

  • orig.CLR > lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 7,153 out of 551,114 reads (1.3%) deleted by Lucy (CLR=-1,-1)
  • 508,166 out of 551,114 reads (92%) vector trimmed at the 5' end
  • 1,946 out of 551,114 reads (0.35%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  551114     0          0          551114     5          1464       872        946        959        480705828
 
 orig.CLR                  551114     7754       0          543360     -770       1175       708        786        807        390325117
 orig.CLR5                 551114     0          6773       544341     0          1519       44         20         111        24582849
 orig.CLR3                 551114     7744       0          543370     -1         1638       752        818        833        414907966
 
 lucy.CLR                  551114     0          7153       543961     0          699        636        671        671        350759771
 lucy.CLR5                 551114     7153       35872      508089     -1         201        26         27         28         14442310
 lucy.CLR3                 551114     7153       0          543961     -1         699        662        699        699        365202081
 
 orig.CLR5-lucy.CLR5       551114     364282     8801       178031     -198       1500       18         -8         215        10140539
 orig.CLR3-lucy.CLR3       551114     85058      2962       463094     -700       1472       90         123        178        49705885
 
 orig.CLV                  551114     971        0          550143     -2         2037       974        978        981        537127121
 orig.CLV5                 5100       0          0          5100       1          845        35         29         31         180490
 orig.CLV3                 551114     971        0          550143     -1         2037       974        978        981        537307611
 
 lucy.CLV                  551114     58         6          551050     -84        1456       841        917        930        463903233
 lucy.CLV5                 551114     0          42948      508166     0          202        27         28         29         14964546
 lucy.CLV3                 551114     0          0          551114     4          1463       868        945        958        478867779
 lucy.CLVABS5              551114     0          42948      508166     0          202        27         28         29         14964546
 lucy.CLVABS3              551114     0          549168     1946       0          700        2          0          686        1286935
 
 orig.CLV5-lucy.CLV5       551114     506108     42215      2791       -202       845        -26        -28        -202       -14784056
 orig.CLV3-lucy.CLV3       551114     134959     23422      392733     -967       1614       106        7          459        58439832

BCM.SHOTGUN

  • orig.CLR < lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 98,070 out of 10,748,529 reads (0.9%) deleted by Lucy (CLR=-1,-1)
  • 9,737,008 out of 10,748,529 reads (90%) vector trimmed at the 5' end
  • 294,942 out of 10,748,529 reads (2.7%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  10748529   0          0          10748529   5          2043       975        950        964        10486690472
 
 orig.CLR                  10748529   17308      2          10731219   -1293      1467       809        833        847        8701344571
 orig.CLR5                 10748529   0          68         10748461   0          1315       26         16         38         288662580
 orig.CLR3                 10748529   16780      0          10731749   -1         1647       836        851        863        8990007151
 
 lucy.CLR                  10748529   0          98070      10650459   0          1337       833        854        868        8955866769
 lucy.CLR5                 10748529   98070      1973       10648486   -1         1307       35         28         32         376276188
 lucy.CLR3                 10748529   98070      0          10650459   -1         1553       868        882        896        9332142957
 
 orig.CLR5-lucy.CLR5       10748529   9498290    65171      1185068    -1099      1293       -8         -11        -1099      -87613608
 orig.CLR3-lucy.CLR3       10748529   6879532    671097     3197900    -1149      1437       -31        -26        -1149      -342135806
 
 orig.CLV                  10748529   16779      412        10731338   -2         3919       974        948        964        10472347908
 orig.CLV5                 8594910    0          0          8594910    1          1239       3          1          49         28350257
 orig.CLV3                 10748349   16779      0          10731570   -1         3919       976        950        965        10500698165
 
 lucy.CLV                  10748529   7026       614        10740889   -268       2042       930        924        940        9997862132
 lucy.CLV5                 10748529   0          1011521    9737008    0          855        24         24         27         257993796
 lucy.CLV3                 10748529   0          0          10748529   4          2042       954        945        962        10255855928
 lucy.CLVABS5              10748529   0          1011521    9737008    0          855        24         24         27         257993796
 lucy.CLVABS3              10748529   0          10453587   294942     0          1214       20         0          847        220086015
 
 orig.CLV5-lucy.CLV5       10748529   9538738    138680     1071111    -854       1239       -21        -23        -854       -229643539
 orig.CLV3-lucy.CLV3       10748529   357934     9324166    1066429    -1328      2846       22         0          704        244842237

NISC.SHOTGUN

  • orig.CLR < lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 8,248 out of 737,900 reads (1.1%) deleted by Lucy (CLR=-1,-1)
  • 633,409 out of 737,900 reads (85%) vector trimmed at the 5' end
  • 7,201 out of 737,900 reads (0.97%) vector trimmed at the 3' end


                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  737900     0          0          737900     104        2104       784        729        734        579172842
 
 orig.CLR                  737900     5988       2          731910     -636       1033       651        668        676        480400909
 orig.CLR5                 737900     0          0          737900     1          1407       47         40         51         34857531
 orig.CLR3                 737900     0          5879       732021     0          1470       698        710        715        515258440
 
 lucy.CLR                  737900     0          8248       729652     0          1035       658        670        676        485757685
 lucy.CLR5                 737900     8248       56         729596     -1         1091       45         35         46         33811606
 lucy.CLR3                 737900     8248       0          729652     -1         1391       704        710        714        519569291
 
 orig.CLR5-lucy.CLR5       737900     253727     89345      394828     -566       1408       1          1          485        1045925
 orig.CLR3-lucy.CLR3       737900     177007     31         560862     -867       1471       -5         1          -867       -4310851
 
 orig.CLV                  737900     3224       2655       732021     -636       2103       771        725        730        569178445
 orig.CLV5                 734026     0          0          734026     1          987        5          1          35         4375315
 orig.CLV3                 732021     0          0          732021     35         2104       783        729        734        573553760
 
 lucy.CLV                  737900     1335       55         736510     -200       2104       747        696        702        551392388
 lucy.CLV5                 737900     104491     0          633409     -1         1199       30         31         34         22784742
 lucy.CLV3                 737900     0          0          737900     15         2103       778        728        733        574177130
 lucy.CLVABS5              737900     0          104491     633409     0          1200       31         32         35         23522642
 lucy.CLVABS3              737900     0          730699     7201       0          1076       5          0          686        4257812
 
 orig.CLV5-lucy.CLV5       737900     561851     66390      109659     -1198      983        -24        -29        -1198      -18409427
 orig.CLV3-lucy.CLV3       737900     8386       1          729513     -950       1077       0          1          -950       -623370

Fragment files

  • Location: /fs/szasmg3/bos_taurus/data/frg
  • All DST messages are unique
  • bos_taurus.clv : contains the vector clipping points
    • BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
    • others: the TA clv
    • 374,454 reads don't have valid clv's
    • 36,446,031 reads have valid clv's with avg=955

Message counts (original)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.frg                         79      24124070        11311841
 #bos_taurus.BCM.SHOTGUN.frg                    7339    10650459        1799069     # some libs & mates are missing due to a tarchive2ca crash
 #bos_taurus.BCM.SHOTGUN.new.frg                18208   10650459        4715172     # split the libraries by VOL & SEQ_LIB_ID
 bos_taurus.BCM.SHOTGUN.new.frg                 13826   10650459        5046435     # double check the FRG count !!!
 bos_taurus.NISC.SHOTGUN.frg                    246     729652          344932
 bos_taurus.BCCAGSC.CLONEEND.frg                1       125241          59505
 bos_taurus.UIUC.CLONEEND.frg                   2       114750          46319
 bos_taurus.TIGR.CLONEEND.frg                   1       65171           27067
 bos_taurus.GSC.CLONEEND.frg                    1       53521           25889
 bos_taurus.CENARGEN.WGS.frg                    0       26246           0
 bos_taurus.BARC.CLONEEND.frg                   11150   25454           11150
 bos_taurus.BCM.CLONEEND.frg                    1       16875           7103
 bos_taurus.CENARGEN.CLONEEND.frg               1       16787           6269
 bos_taurus.UOKNOR.SHOTGUN.frg                  1       14651           4910
 bos_taurus.TIGR_JCVIJTC.CLONEEND.frg           2       10651           4803
 bos_taurus.UOKNOR.FINISHING.frg                0       151             0
 bos_taurus.WUGSC.COLONEEND.frg                 1       49              21

Message counts (quality)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.qual.count                  79      23580109        11035582
 #bos_taurus.BCM.SHOTGUN.qual.count             7339    10644092        1799069
 bos_taurus.BCM.SHOTGUN.qual.new.count          18208   10644092        4712446
 bos_taurus.NISC.SHOTGUN.count                  246     729652          344932
 bos_taurus.BCCAGSC.CLONEEND.qual.count         1       116484          53585
 bos_taurus.UIUC.CLONEEND.count                 2       114750          46319
 bos_taurus.TIGR.CLONEEND.count                 1       65171           27067
 bos_taurus.CENARGEN.WGS.count                  0       26246           0
 bos_taurus.BARC.CLONEEND.count                 11150   25454           11150
 bos_taurus.BCM.CLONEEND.count                  1       16875           7103
 bos_taurus.CENARGEN.CLONEEND.count             1       16787           6269
 bos_taurus.TIGR_JCVIJTC.CLONEEND.count         2       10651           4803
 bos_taurus.UOKNOR.SHOTGUN.qual.count           1       2456            813
 bos_taurus.WUGSC.COLONEEND.count               1       49              21

Message counts (0quality)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.0qual.count                 79      543961          234397
 bos_taurus.GSC.CLONEEND.0qual.count            1       53521           25889
 bos_taurus.UOKNOR.SHOTGUN.0qual.count          1       12195           4097
 bos_taurus.BCCAGSC.CLONEEND.0qual.count        1       8757            2114
 bos_taurus.BCM.SHOTGUN.0qual.count             7339    6367            0
 bos_taurus.UOKNOR.FINISHING.0qual.count        0       151             0

Assembly 1 (Quality reads)

Issues

  1. Uses only quality reads
  2. BCM.SHOTGUN library : ~ 4715172-1799069=2.9M mates were missed due to a tarchive2ca crash ; some libraries got merged (were assigned the same lib_id)
  3. All reads except for BCM.WGS were set as nonrandom
  4. Update the runCA script to run overlapper concurently; new "ovlConcurrency" parameter added to the .spec file !!!
  5. consensus after cgw crashed in MultiAlignContig() ... use "consensus -D forceunitigabut" !!!
  6. cgw crashed after updating gkpStore with new lib/mate info => edit Input_CGW.c, remove the assert in line 117

Info

 host: walnut
 
 assembly version: wgs-5.2 stable
 
 dir:  /scratch1/bos_taurus/Assembly/2009_0122_CA 
 
 command: /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin/runCA-test -d . -p bt -s bt01.specFile *.frg
 
 spec file:
 cgwDistanceSampleSize   =       1000       # ??? too big; more than 50% of the BCM.SHOTGUN reads are in libraries with less than 1000 inserts
 cnsConcurrency          =       15
 cnsMinFrags             =       200000
 doOverlapTrimming       =       1
 frgCorrBatchSize        =       100000
 frgCorrConcurrency      =       15
 merylMemory             =       24000
 merylThreads            =       15
 obtMerThreshold         =       200
 obtOverlapper           =       ovl
 ovlConcurrency          =       8
 ovlCorrBatchSize        =       100000
 ovlCorrConcurrency      =       15
 ovlHashBlockSize        =       1200000
 ovlMemory               =       8GB --hashload 0.8 --hashstrings 400000
 ovlMerThreshold         =       500
 ovlOverlapper           =       ovl
 ovlRefBlockSize         =       7200000
 ovlThreads              =       2
 unitigger               =       utg
 utgErrorRate            =       0.015
 vectorIntersect         =       bos_taurus.clv
 doExtendClearRanges     =       2          # should be set to 1 to run cgw 1+1=2 times (instead of 3 times)
 cgwOutputIntermediate   =       0          # should be set to 1 to get intermediate .cgw files

Steps

1. Run up till after initialStoreBuilding

 runCA-test stopAfter=initialStoreBuilding ...

2. Update gkpStore with nonrandom frg flag

 cat bos_taurus.nonrandom.clv | perl -ane 'print "frg uid $F[0] isnonrandom 1\n";'  > bos_taurus.nonrandom.edit
 gatekeeper -edit bos_taurus.nonrandom.edit bt.gkpStore

3. Restart

 runCA-test ...

Input

 gatekeeper -dumpinfo -lastfragiid bt.gkpStore
 ...
 Last frag in store is iid = 35348776

Trimming

                           elem       <0         0          >0         min        max        mean       median     n50        sum            
 
 CLV5                      35085508   0          3387027    31698481   0          970        25         27         29         891007232
 CLV3                      35164784   0          0          35164784   15         2974       984        980        1000       34612019144
 
 CLR_ORIG5                 35348776   0          43354      35305422   0          1753       42         29         38         1502168205     
 CLR_ORIG3                 35348776   0          0          35348776   70         1894       864        905        927        30547294868    
 
 CLR_OBT5                  35348776   0          26513      35322263   0          1690       49         30         73         1756346429     
 CLR_OBT3                  35348776   0          23477      35325299   0          1813       843        895        914        29824543869


                           elem       <0         0          >0         min        max        mean       median     n50        sum
 ClearORIG                 35348776   4          0          35348772   -1147      1572       821        870        893        29045126663
 
 ClearQLT                  35348776   35348776   0          0          -1         -1         -1         -1         -1         -35348776
 ClearVEC                  35348776   299034     20323      35029419   -1         2043       952        953        975        33658445088
 
 ClearOBTINI               35348776   0          31254      35317522   0          1364       831        879        902        29394688367
 ClearOBT                  35348776   0          31254      35317522   0          1318       794        854        877        28068197440
 
 ClearUTG                  35348776   0          31254      35317522   0          1318       794        854        877        28068197440
 ClearECR1                 35348776   0          31254      35317522   0          1329       794        854        877        28072014464
 ClearECR2                 35348776   0          31254      35317522   0          1329       794        854        877        28072365712


  • sum(ClearECR1)-sum(ClearUTG) = 3,817,024
  • sum(ClearECR2)-sum(ClearECR1)= 351,248

Overlapper

    • 98.33% of the reads (34,761,786 out of 35,348,776 reads) had overlaps
    • 1.66% of the reads had no overlaps
    • 6.68% of the BCCAGSC.CLONEEND reads had no overlaps
    • 4.95% of the TIGR_JCVIJTC.CLONEEND reads had no overlaps
    • 3.48% of the TIGR.CLONEEND reads had no overlaps
    • the median number of overlaps is 16
 sort -nk2 -r  bt.ovlStore.count2 | head
 16      1582324
 17      1561352
 15      1558093
 18      1504595
 14      1494160
 ...
    • the median number of overlaps for the BCM.WGS reads is 16
    • the median number of overlaps for the BCM.SHOTGUN reads is 16 !!!
    • the median number of overlaps for the NISC.SHOTGUN reads is 40 !!!
    • the median number of overlaps for the BCM.CLONEEND reads is 16 !!!

Media:Bt.ovlStore.big.png , Media:Bt.ovlStore.small.png

Unitigger

more 4-unitigger/bt.cga.0
UNITIG OVERLAP GRAPH INFORMATION
       5208738 : Total number of unitigs
       2527051 : Total number of singleton, contained unitigs
       1814842 : Total number of singleton, non-contained unitigs
        180910 : Total number of non-singleton, spanned unitigs
        685935 : Total number of non-singleton, non-spanned unitigs
      34927397 : Total number of fragments
      34927397 : Total number of fragments in all unitigs
      21521581 : Total number of essential fragments in all unitigs
      13405816 : Total number of contained fragments in all unitigs
  0.0076239952 : Randomly sampled fragment arrival rate per bp
    2510896132 : The sum of overhangs in all the unitigs
    6400342737 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2510896132
 Total nfrags = 19143061
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007624
 Computed global_fragment_arrival_rate =0.007624
 Total number of randomly sampled fragments in genome = 23326293
 Computed genome length  = 3059589120.000000
 Used global_fragment_arrival_rate=0.007624
 Used global_fragment_arrival_distance=131.164826
 Histogram of the number of base pairs in a chunk
 100292 - 159434:    22 
 90010 -  99906:     25 
 80043 -  89676:     73 
 70013 -  79966:    162 
 60010 -  69988:    389 
 50008 -  59983:    977 
 40000 -  49998:   2434 
 30000 -  39997:   6458 
 20000 -  29999:  18957 
 10000 -  19999:  57442
 Unitigs >=10kb
             NewAsm          UMd2Asm
 
 Number       86,939          57,204
 Mean         19,464          15,140
 Sum         1,692.1Mb       866.0Mb
 max         159,434bp      78,570bp
 Contigs >=10Kb:
           NewAsm          UMd2Asm
 n         42,343           45,958      
 mean      59,856           55,473
 sum        2,534.5Mb        2,549.4Mb
 Contigs >=100Kb: 
           NewAsm          UMd2Asm
 n          7,051            6,683         
 mean     163,170          162,357    
 sum        1,150.5Mb        1,085.0Mb
 max      627,705          742,802
 Scaffolds >=10Mb:
           NewAsm          UMd2Asm
 n             30                3
 mean       14.10Mb          11.36Mb
 sum       422.95Mb         340.70Mb
 max        26.54Mb          13.36Mb

QC stats

 TotalScaffolds=66,141
 MaxBasesInScaffolds=26,048,998
 MeanBasesInScaffolds=40,861
 
 TotalContigsInScaffolds=120,461
 MaxContigLength=627,911
 MeanContigLength=22,436
 
 TotalDegenContigs=269,031
 MaxDegenContig=33,824
 
 SingletonReads=3,721,123

Analysis

Inser libraries

1. BCM.WGS : ok

  • FRG.mea: 1750-7000
  • ASM.mea: 1594-6727
  • Most libs have > 1000 reads & get reestimated
  • All libs have ASM.std< ASM.mea/3

2. BCM.SHOTGUN

  • only ~ 50% of the inserts are in libs with >1000 inserts and get reestimated by the assembly
  • if the thold is dropped from 1000 to 100, we'd get ~ 95% of the inserts reestimated
            elem       <0         0          >0         min        max        mean       median     n50        sum
 0          7339       0          0          7339       1          11237      245        135        1137       1799069
 100        4361       0          0          4361       100        11237      395        157        1252       1725604
 1000       440        0          0          440        1008       11237      2075       1791       2323       913086

3. NISC.SHOTGUN: ok

  • Most libs have > 1000 reads & get reestimated
  • All libs have ASM.std< ASM.mea/3

4. BCCAGSC.CLONEEND: ok

 LIB.id  FRG.mea FRG.std FRG.count  CENTER.TYPE        ASM.mea ASM.std
 125606  150000  30000   59505      BCCAGSC.CLONEEND   161998  20133

5. UIUC.CLONEEND: ok

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 114892   150000  30000   31063     UIUC.CLONEEND      175594  41208
 115020   150000  30000   15256     UIUC.CLONEEND      162488  26358

6. TIGR.CLONEEND: originally wrong; gets reestimated

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 65177    2000    600     27067     TIGR.CLONEEND      161761  34938

7. GSC.CLONEEND: not used (all 53556 are 0 qual)

8. CENARGEN.WGS: "not used" (all 26246 are unmated)

9. BARC.CLONEEND: each library contains 1 template id => inserts did not get reestimated (25454 reads/11151 inserts)

10. BCM cloneend: ok

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 19070    167000  25000   7103      BCM.CLONEEND       171244  18555

11. CENARGEN.CLONEEND: large stdev

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 17249    202000  20200   6269      CENARGEN.CLONEEND  158938  55165

12. UOKNOR.SHOTGUN: ok ?

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 15158    3000    1000    4910      UOKNOR.SHOTGUN     3000    1000

13. TIGR_JCVI.CLONEEND: originally wrong; gets reestimated

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 10691    2500    750     2763      TIGR_JCVI.CLONEEND 160363  29580
 10738    2500    750     2040      TIGR_JCVI.CLONEEND 161915  29343

14. UOKNOR.FINISHING: only 151 reads

15. WUGSC.CLONEEND: only 49 reads

Contigs Vs UMD2 contaminants & Ecoli

 4865 contigs in list.exclude_contigs.fa
 
34404 exclude-ctg.qry_hits
 3763 exclude-ctg.ref_hits
 
 1204 exclude-ctg.CBE.qry_hits   CONTAIN|IDENTITY|BEGIN|END
  748 exclude-ctg.CBE.ref_hits   CONTAIN|IDENTITY|BEGIN|END
 559 Ecoli.365350-365744-ctg.qry_hits : max ctg aligned is 179K bp; 10 are > 10K bp

Top 100 contigs Vs UMD2 contigs

Assembly 2 (Quality reads)

  • Try to add the missing BCM.SHOTGUN reads at the assembly
  • Assign new BCM.SHOTGUN library ID's base on volume & SEQ_LIB_ID : same library might have different insert size in different volume => might loose some correct mates from different volumes
 cat bos_taurus.summary | grep BCM | grep SHOTG | cut -f6,7,8,10 | sort | more
 FAAEP   180000  13000   252
 FAAEP   2000    1000    84
 ...
 FAAHP   180000  13000   77
 FAAHP   2000    1000    230
 ...
  • => 20,538 libraries out of which 18,208 contain mated reads
  • create DST messages & add them to gkpStore
 gatekeeper -a -o bt.gkpStore -T -F  bos_taurus.BCM.SHOTGUN.new.DST
  • generate gatekeeper edit file that maps each TI to the new library id
 head bos_taurus.BCM.SHOTGUN.new.ti2libinfo.edit
   frg uid 499507131 libuid 601081 
   frg uid 499507132 libuid 601081
   ...
  • generate gatekeeper edit file that deletes all mate information
 head bos_taurus.BCM.SHOTGUN.new.mate.delete
   frg uid 500086180 mateuid 0
   frg uid 500084310 mateuid 0
   ...
  • pair forward/reverse read that have the same new library id, same TEMPLATE_ID
 head bos_taurus.BCM.SHOTGUN.new.mate.edit
   frg uid 583866821 mateuid 583872364
   frg uid 583866822 mateuid 583872408
   ...
  • run gatekeeper --edit for each edit/delete file
 gatekeeper --edit ...  bt.gkpStore 
  • restart assembly at cgw (doExtendClearRanges=1)
  • consensus after cgw failed on job 25 on CTG 5597062 : cannot create consensus from multialignment ...
 Fix: delete failed message
 cp bt.cgw_contigs.25 bt.cgw_contigs.25.FAILED
 delete "{ICM acc:5597062 pla:P len:20889 ..." from bt.cgw_contigs.25
  • terminator fail; message:
 ICL: reference before definition error for contig ID 5597062

Assembly 3 (All reads)

  • Try to identify the qualityless trimming points based on alignments to Assembly 1
  • Split the 650K reads into 13 50K sets
 bos_taurus.0qual.01.seq:50000
 ...
 bos_taurus.0qual.01.seq:24952

Nucmer alignments

1. Launch jobs in parallel: 12766 jobs on 13 processors

 nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.001.fasta  bos_taurus.0qual.01.seq -p ctg.001-seq.01
 ...
 nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.982.fasta  bos_taurus.0qual.13.seq -p ctg.001-seq.01
  • CPU usage: 100% /job
  • Max mem usage: 0.1% /job

2. Get clrs

 cat *delta | ~/bin/delta2qryClr.pl -best | sort > bos_taurus.0qual.clr

3. Get reads without clrs: set their clr to maximum 50..600

 difference.pl bos_taurus.0qual.infoseq bos_taurus.0qual.clr | perl -ane '$three=600; $three=$F[1] if ($F[1]<600); print "$F[0] 50 $three\n";' > bos_taurus.clr.tmp
 cat bos_taurus.0qual.clr.tmp >> bos_taurus.0qual.clr