Bos taurus redo

From Cbcb
Revision as of 19:03, 30 March 2009 by Dpuiu (talk | contribs) (→‎Markers)
Jump to navigation Jump to search

BCM

NCBI Data

  • Avg LEN=984
  • Avg CLIP (CLB intersect CLV)=760
  • Avg CLV=997 > Avg LEN ???
  • Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
  • Avg UMDoverlapper CLIP=778

Problems:

  • 0 QUAL reads 650,133
  • the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1
  • several xml contained the "&" character => XML parser error
  • xml.bos_taurus.087 contained 2 trace_volumes => XML parser error
  • BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000
  • UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000

CENTER_NAME counts

    COUNT           CENTER_NAME     
 1  35629020        BCM             Baylor College of Medicine
 2  737900          NISC            NIH Intramural Sequencing Center
 3  652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Centre                           # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 
 4  378871          MARC            USDA, ARS, US Meat Animal Research Center
 5  114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368
 6  107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
 7  65171           TIGR            The Institute for Genome Research
 8  53556           GSC             Genoscope
 9  43033           CENARGEN        Embrapa Genetic Resources and Biotechnology
 10 18623           SC              The Sanger Center
 11 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
 12 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                 # TA query_tracedb CENTER_NAME="JCVI"
 13 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
 14 49              WUGSC           Washington University, Genome Sequencing Center                                 # TA query_tracedb CENTER_NAME = "WUGSC" => 9
    37829394        total           total                                                                           # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710

TRACE_TYPE_CODE counts

    COUNT         CENTER_NAME     TRACE_TYPE_CODE        
 1  24863599      BCM*            WGS                    SEQ_LIB_ID:89
 2  10748529      BCM*            SHOTGUN                SEQ_LIB_ID:15543
 3  737900        NISC            SHOTGUN                SEQ_LIB_ID:247
 4  125597        BCCAGSC         CLONEEND               LIBRARY_ID:1         large insert size; some qualityless; !!! almost all have CLIP3=0
 5  114753        UIUC            CLONEEND               LIBRARY_ID:2         insert size missing , no frequent kmers
 6  65171         TIGR            CLONEEND               SEQ_LIB_ID:1         2K & use TRACE_DIRECTION instead of TRACE_END
 7  53556         GSC             CLONEEND               SEQ_LIB_ID:1         large insert size; !!! all have qual=0 and were excluded 
 8  26246         CENARGEN        WGS                    .                    no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID)
 9  25454         BARC            CLONEEND               SEQ_LIB_ID:14304     !!! all have CLIP3=0
 10 16892         BCM*            CLONEEND               LIBRARY_ID:1         VBBAA   mea=167000  std=25000
 11 16787         CENARGEN        CLONEEND               LIBRARY_ID:1         
 12 15150         UOKNOR          SHOTGUN                LIBRARY_ID:1         some qualityless
 13 10651         TIGR_JCVIJTC    CLONEEND               SEQ_LIB_ID:2
 14 151           UOKNOR          FINISHING              LIBRARY_ID:1         some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV
 15 49            WUGSC           CLONEEND               SEQ_LIB_ID:1 
    36820485      total

 16 527017        BCCAGSC         EST
 17 207204        MARC            EST
 18 171667        MARC            PCR
 19 81913         BARC            EST
 20 18623         SC              EST 
 21 2485          UIACBCB         EST
    1008909       total

STRATEGY & TRACE_TYPE_CODE counts

 COUNT           CENTER_NAME     STRATEGY        TRACE_TYPE_CODE
 12545304        BCM             .               WGS
 11425910        BCM             WGA             WGS
 5223683         BCM             CLONE           SHOTGUN
 4479883         BCM             POOLCLONE       SHOTGUN
 1044963         BCM             .               SHOTGUN
 892385          BCM             SNP             WGS
 737900          NISC            CLONE           SHOTGUN
 125597          BCCAGSC         CLONEEND        CLONEEND
 114753          UIUC            CLONEEND        CLONEEND 
 65171           TIGR            CLONEEND        CLONEEND
 53556           GSC             CLONEEND        CLONEEND
 26246           CENARGEN        .               WGS
 25454           BARC            .               CLONEEND
 16892           BCM             CLONEEND        CLONEEND
 16787           CENARGEN        CLONEEND        CLONEEND
 12195           UOKNOR          .               SHOTGUN
 10651           TIGR_JCVIJTC    CLONEEND        CLONEEND
 2955            UOKNOR          CLONE           SHOTGUN
 151             UOKNOR          .               FINISHING
 49              WUGSC           CLONEEND        CLONEEND
 527017          BCCAGSC         EST             EST
 145820          MARC            EST             EST
 117958          MARC            COMPARATIVE     PCR
 81913           BARC            EST             EST
 61384           MARC            CLONE           EST
 53709           MARC            Re-Sequencing   PCR
 18623           SC              EST             EST
 2485            UIACBCB         .               EST

BCM.SHOTGUN libraries

  • The long inserts are probably wrong !!!
 SIZE    STDEV   COUNT
 3500    1500    4502569
 2000    1000    3244493
 3000    1000    1021577
 180000  1000    840528
 6500    1500    429026
 180000  13000   320208
 6000    2000    208192
 167000  13000   96337
 3500    15000   85599
 SIZE    COUNT
 3500    4588168
 2000    3244493
 180000  1160736
 3000    1021577
 6500    429026
 6000    208192
 167000  96337

3' VECTOR TRIMMED counts

    CENTER_NAME     TRACE_TYPE_CODE TOTAL           3'CLV<LEN   QUAL==0          UMD.FRG
 1  BCM             WGS             24863599        10968979    551114           24050767
 2  BCM             SHOTGUN         10748529        5052692     23419            10068499
 3  NISC            SHOTGUN         737900          28972       0                735488
 4  BCCAGSC         CLONEEND        125597          125484      8926             113790
 5  UIUC            CLONEEND        114753          90243       0                106247
 6  TIGR            CLONEEND        65171           46389       0                64903
 7  GSC             CLONEEND        53556           53556       53556 (all)      0           !!! all have 0 quals and were excluded
 8  CENARGEN        WGS             26246           26246       0                25976
 9  BARC            CLONEEND        25454           25454       0                25387
 10 BCM             CLONEEND        16892           6751        0                16863
 11 CENARGEN        CLONEEND        16787           16787       0                16628
 12 UOKNOR          SHOTGUN         15150           2885        12195            0
 13 TIGR_JCVIJTC    CLONEEND        10651           339         0                10644
 14 UOKNOR          FINISHING       151             0           151              151
 15 WUGSC           CLONEEND        49              0           0                0

 16 BCCAGSC         EST             527017          524173      772              0
 17 MARC            EST             207204          207204      0                0
 18 MARC            PCR             171667          171667      0                0
 19 BARC            EST             81913           78597       0                0
 20 SC              EST             18623           7350        0                0
 21 UIACBCB         EST             2485            2485        0                0

ZERO QUALITY COUNTS

  • Counts
 CENTER_NAME     TRACE_TYPE_CODE  COUNT
 BCM             WGS              551114
 GSC             CLONEEND         53556
 BCM             SHOTGUN          23419
 UOKNOR          SHOTGUN          12195
 BCCAGSC         CLONEEND         8926
 BCCAGSC         EST              772
 UOKNOR          FINISHING        151
 TOTAL                            650134 
  • For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701..
  • Volumes 026..039 have been fixed

Local Data

Files & Dirs

 /fs/szasmg3/bos_taurus/data/
 /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors
 /nfshomes/dpuiu/db/UniVec

Software

Figaro

  • trims vector only at 5' end
  • call lucy trimming for qualities

Lucy

  • both vector sequence and splice sites are required

Atlas

  • web site
  • atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "

Contaminant search

nucmer reads CLIPPING range to UniVec & EcoliK12

UniVec

Ref

                 #seqs   min     max     mean    median  n50     sum
 UniVec          2861    12      48551   231     99      781     660,151
 UniVec_Core     1348    12      48551   243     98      967     327,641

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      4548466 19      1045    28.37   23      27      129025025
 20      3684852 20      1045    30.56   25      28      112616359
 30      1097357 30      1045    48.04   38      43      52714583
 40      484661  40      1045    66.36   47      53      32163896
 100     54334   100     1045    198     116     223     10772815        # many are ESTs

Ecoli

Ref:

 K12 4,639,675 bp

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      275109  19      1223    30.66   19      20      8435470
 20      102550  20      1223    50.29   21      161     5156849
 30      19032   30      1223    178     37      706     3381214
 40      9234    40      1223    329     171     738     3034293
 100     6781    100     1223    424     223     749     2876432
 200     4378    200     1223    575     696     771     2516916       

BCM vectors

                 #seqs   min     max     mean    median  n50     sum
 BCM             14      2580    33180   9379    5821    32705   131312

Vector/Splice site search

Strategy

  • 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
  • 2. Get the quality clipping trim: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
  • 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE
  • 4. Get the most frequent kmers in each set (24 & 8 bp)
  • 5. Check if the most frequent kmers are overrepresented
  • 6. Check if the most frequent 8mers are present in the most frequent 24mers
  • 7. Try to extend the 24mers by a few bp => linkers
  • 8. Align linkers to the opposite stand sequences using nucmer
  • 9. Extract the subsequences adjacent(following) to linker (50..150bp)
  • 10. Align the subsequences; if they align we've probably identified the vector
  • 11. Identify the vector name/id by alignment to UniVec => several alignments
  • 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
  • 13. create the Lucy vector & splice files; the splice contains the linker+vector
  • 14. run lucy & trim input reads according to Lucy clr
  • 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust
  • 16. align input reads to linker,vector,splice & UniVec.dust
  • 17. compare the 15. & 16. counts

Example

  • 1. volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
  • 2.
  • 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
  • 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
 ==> 24.fwd/kmers.tab <==
 AGTTCGACTGCAAGTAGTTCATCA      TGATGAACTACTTGCAGTCGAACT        2463 # contains AGTAGTTC
 GAGTTCGACTGCAAGTAGTTCATC      GATGAACTACTTGCAGTCGAACTC        2189
 CGAGTTCGACTGCAAGTAGTTCAT      ATGAACTACTTGCAGTCGAACTCG        1996
 TCGAGTTCGACTGCAAGTAGTTCA      TGAACTACTTGCAGTCGAACTCGA        1593
 GTTCGACTGCAAGTAGTTCATCAA      TTGATGAACTACTTGCAGTCGAAC        1023
 GAGTTCGACTGCAGTAGTTCATCA      TGATGAACTACTGCAGTCGAACTC        812
 CGAGTTCGACTGCAGTAGTTCATC      GATGAACTACTGCAGTCGAACTCG        777
 GTTCGACTGCAAGTAGTTCATCAT      ATGATGAACTACTTGCAGTCGAAC        769
 TCGAGTTCGACTGCAGTAGTTCAT      ATGAACTACTGCAGTCGAACTCGA        637
 ATCGAGTTCGACTGCAAGTAGTTC      GAACTACTTGCAGTCGAACTCGAT        594
 
 ==> 08.fwd/kmers.tab <==
 AGTAGTTC      GAACTACT        86477
 CAGTAGTT      AACTACTG        67681
 AGTTCTCA      TGAGAACT        61556
 TAGTTCTC      GAGAACTA        60964
 GTAGTTCT      AGAACTAC        57866
 AGTTCATC      GATGAACT        49676
 TAGTTCAT      ATGAACTA        45298
 GTTCATCA      TGATGAAC        42117
 GCAGTAGT      ACTACTGC        41391
 GTAGTTCA      TGAACTAC        40694
 
 ==> 24.rev/kmers.tab <==
 TATCGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCGATA        999 # contains AGTAGTTC
 CTATCGATGGTACAGTAGTTCATC      GATGAACTACTGTACCATCGATAG        774
 GCTATCGATGGTACAGTAGTTCAT      ATGAACTACTGTACCATCGATAGC        600
 CGCTATCGATGGTACAGTAGTTCA      TGAACTACTGTACCATCGATAGCG        432
 ATCGATGGTACAGTAGTTCATCAT      ATGATGAACTACTGTACCATCGAT        417
 ATCGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCGAT        380
 ATCAGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCTGAT        373
 ATCGATGGTACAGTAGTTCATCAC      GTGATGAACTACTGTACCATCGAT        265
 CTATCGATGGTAAGTAGTTCATCA      TGATGAACTACTTACCATCGATAG        235
 TCAGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCTGA        224
 
 ==> 08.rev/kmers.tab <==
 AGTTCATC      GATGAACT        85127
 TAGTTCAT      ATGAACTA        77902
 GTTCATCA      TGATGAAC        75585
 TAGTTCTC      GAGAACTA        68057
 AGTTCTCA      TGAGAACT        67277
 GTAGTTCT      AGAACTAC        64894
 GTAGTTCA      TGAACTAC        62607
 CGTAGTTC      GAACTACG        52031
 AGTAGTTC      GAACTACT        51013
 ACGTAGTT      AACTACGT        31552
  • 7. Get linker sequences
 >linker.fwd 27bp
 TCGAGTTCGACTGCAAGTAGTTCATCA
 >linker.rev 27bp
 CTAATCAGATGGTACAGTAGTTCATCA 
  
 #>linker.rev 40 bp Art's  (13 more bp at 5')        
 #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
 #GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences)
  • 8 & 9 Align reads to linkers using nucmer

Fwd:

 nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta 
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq
 

Rev:

 nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq
 

Both:

 clrFasta out.seq >! out.cseq
 fasta2tab.pl out.cseq | sort -k2 > ! out.tab
 nucmer -c 40 out.cseq ~/db/UniVec -p vector
 delta-filter -q vector.delta >! vector.filter-q.delta
 show-coords vector.filter-q.delta | sort -n | head
 cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2
 
  • 10. Extract "vector reads"
 >399553028  # 24.fwd     
 TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
 GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
 GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA
 ...
 >400269118 #24.rev
 TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
 CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
 AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
  • 11. Align "vector reads" to UniVec; identify vector
 show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1
     31  148  | 1175 1292  | 118   118  |  95.76  |     1276     7477  |     9.25     1.58  | 399553028.rev gnl|uv|J01636.1:1-7477
     32  199  | 1302 1463  | 168   162  |  90.48  |      653     7477  |    25.73     2.17  | 400269118     gnl|uv|J01636.1:1-7477
  • 12. 10bp distance between the 2 alignments
  • 13. Lucy files
 $ more vector.seq
   >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
   GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG
   TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC
   CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG
   CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA
   ...
 
 $ more splice.seq
   >J01636.for.begin vector+linker.rev
   TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC
   GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG
   CGCCTAATCAGATGGTACAGTAGTTCATCA
   >J01636.for.end  rev(linker.fwd)+vector 
   TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
   CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
   AGCTGGCGTAAAAACGTAAAAAGCCCCGCA
   >J01636.rev.begin (revcomp of J01636.for.end)
   TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA
   TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT
   GATTCGATTTCGACTGCAAGTAGTTCATCA
   >J01636.rev.end (revcomp of J01636.for.begin)
   TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
   GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
   GTGTCAAATGAGAGACCTAACTCACATTCA
 # splice=linker+vector  
      3      120  |     1175     1292  |      118      118  |    95.76  |      150     7477  |    78.67     1.58  | J01636.for.begin   J01636
     32      131  |     1302     1399  |      100       98  |    96.00  |      150     7477  |    66.67     1.31  | J01636.for.end     J01636
  • 13.1 Align vector & splice to Ecoli
      1     7474  |   366812   359335  |     7474     7478  |    99.91  |     7477  4639675  |    99.96     0.16  | J01636             NC_000913.2    [CONTAINED]
     20      119  |       65      162  |      100       98  |    96.00  |      150      395  |    66.67    24.81  | J01636.rev.begin   NC_000913.2
     31      148  |      172      289  |      118      118  |    95.76  |      150      395  |    78.67    29.87  | J01636.rev.end     NC_000913.2
   1069     1463  |      395        1  |      395      395  |   100.00  |     7477      395  |     5.28   100.00  | J01636             NC_000913.2.365350-365744
  • 14. Run lucy & trim reads
 $ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \ 
     -v vector.seq splice.seq
     -o bos_taurus.lucy.seq bos_taurus.lucy.qual \
     -debug  bos_taurus.lucy.info \
     bos_taurus.seq bos_taurus.qual
 # Trim clr
 $ clrFasta bos_taurus.seq > bos_taurus.cseq
  • 15. Align lucy output to linker, vector, splice & UniVec.dust
 $ nucmer -l 12 -c 24 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/splice.seq  bos_taurus.lucy.cseq -p splice-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy
  • 16. Align input to linker, vector, splice & UniVec.dust
 $ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus

Count how many reads got trimmed

 infoseq *seq | getSummary.pl -c 1 -t original.LEN
 
 cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary  
 cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary
 cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary

Libraries

011.BCM.WGS FORWARD

  • vector: J01636
  • UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
 ll ~dpuiu/db/J01636*
 -rw-rw-r--  1 dpuiu dpuiu 7651 Jan  9 15:56 /nfshomes/dpuiu/db/J01636
 -rw-rw-r--  1 dpuiu dpuiu  105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker
 -rw-rw-r--  1 dpuiu dpuiu  840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice
 cat  ~dpuiu/db/J01636* | infoseq
 J01636            7477   53.43
 J01636.linker.fwd 27     44.44
 J01636.linker.rev 27     37.04
 J01636.for.begin  150    44.67
 J01636.for.end    150    51.33
 J01636.rev.begin  150    51.33
 J01636.rev.end    150    44.67
  • 249,611 reads:
  • 91% got vector trimmed at the 5'
  • 0.4% (1149) got vector trimmed at the 3'
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    249611  0       437     2349    1082    991     1009    270035781     
 lucy.CLV5       249611  21215   0       741     25.03   25      27      6247415
 lucy.CLV3       249611  248462  0       1047    3.49    0       859     870344
  • Original reads hit counts:
10975 linker.fwd
133   linker.rev
166   splice
152   vector
228   UniVec.dust
  • Lucy trimmed read counts
2 linker.fwd
0 linker.rev
1 splice
1 vector
6 UniVec.dust (only 3 are >40bp)

011.BCM.WGS REVERSE

                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    250389  0       502     2148    1085    993     1012    271691094
 lucy.CLR        250389  7345    0       1281    795     876     892     198982171
 lucy.CLV5       250389  20271   0       668     26.52   27      29      6641362
 lucy.CLV3       250389  249269  0       997     3.35    0       861     839029
  • Original reads hit counts:
 linker.fwd      113
 linker.rev      3812
 splice          143
 UniVec.dust     237
 vector          4318
  • Lucy trimmed reads hit counts:
 linker.fwd      1
 linker.rev      0
 splice          1
 UniVec.dust     10
 vector          1

030.BCM.SHOTGUN

  • same linker/vector/splice as BCM.WGS
  • 2.5% (4K out of 160K) reads contain linker & vector at 3'
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    8411    0       325     1685    1181    1240    1314    9933150
 lucy.CLR        8411    8       0       1054    841     863     874     7070994
 lucy.CLV5       8411    568     0       232     27.01   28      29      227206
 lucy.CLV3       8411    2325    0       1040    597     794     851     5023445
  • Original reads hit counts:
 linker.fwd      4314
 linker.rev      4125
 splice          7816
 UniVec.dust     4212
 vector          6750
 vector          27235
  • Lucy trimmed reads hit counts:
 linker.fwd      3
 linker.rev      1
 splice          1
 UniVec.dust     13
 vector          0

001.NISC.SHOTGUN

  • Vector: pOTW13
  • UniVec: 3 partial seqs
 gnl|uv|NGB00080.1:1-198 pOTW13 with linkers
 gnl|uv|NGB00080.1:718-888 pOTW13 with linkers
 gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers
 ll /nfshomes/dpuiu/db/NGB00080*
 -rw-rw-r--  1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080
 -rw-r--r--  1 dpuiu dpuiu   94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker
 -rw-r--r--  1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice
 cat  /nfshomes/dpuiu/db/NGB00080* | infoseq
 NGB00080       1054   50.00
 NGB00080.linker.fwd 24     45.83
 NGB00080.linker.rev 26     53.85
 NGB00080.for.beg 518    46.14
 NGB00080.for.end 518    50.48
 NGB00080.rev.begin 518    50.48
 NGB00080.rev.beg 518    46.14
  • 944 read sample
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    944     0       652     1017    735     721     722     693668
 lucy.CLR        944     39      0       886     415     422     522     391333
 lucy.CLV5       944     121     0       275     34.05   33      35      32143
 lucy.CLV3       944     18      0       885     410     409     511     387007
  • Original reads hit counts:
 linker.fwd      479
 linker.rev      492
 splice          910
 UniVec.dust     0
 vector          939
  • Lucy trimmed reads hit counts:
 linker.fwd      1
 linker.rev      0
 splice          0
 UniVec.dust     9
 vector          1

060.BCCAGSC.CLONEEND

  • Linkers:
 linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT
 linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG
  • No common vector !!!
 wc -l *clb
   60746 bos_taurus.060.f.clb  #18 reads original align to UniVec (nucmer default params)
   60836 bos_taurus.060.r.clb
 
 Fwd:
    329      428  |      440      535  |      100       96  |    91.00  |      503     1585  |    19.88     6.06  | 723951410  gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1
    330      370  |       89       49  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
    330      370  |       94       54  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
 
  Rev:
      1       96  |       71      165  |       96       95  |    93.81  |      203      165  |    47.29    57.58  | 724018013  gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6
     50      143  |        1       94  |       94       94  |    92.71  |      203       94  |    46.31   100.00  | 724018013  gnl|uv|U80929.2:2858-2951     Cloning vector pBACe3.6

017.UIUC.CLONEEND

  • No overrepresented kmers
 wc -l *clb
  17978 bos_taurus.017.f.clb
  17911 bos_taurus.017.r.clb
 
 ==> 24.fwd/kmers.tab <==
 CCCTGCTTTGTCTGGAAGGGGTTC        GAACCCCTTCCAGACAAAGCAGGG        9
 CTGCTTTGTCTGGAAGGGGTTCCC        GGGAACCCCTTCCAGACAAAGCAG        9
 
 ==> 24.rev/kmers.tab <==
 GAATGTTGAGCTTTAGCCAACTTT        AAAGTTGGCTAAAGCTCAACATTC        4
 TCTGAATGTTGAGCTTTAGCCAAC        GTTGGCTAAAGCTCAACATTCAGA        4
 
 ==> 8.fwd/kmers.tab <==
 TTTTTTTT        AAAAAAAA        55
 AAGGGGTT        AACCCCTT        35
 
 ==> 8.rev/kmers.tab <==
 GTCTGGAA        TTCCAGAC        41
 TCTGGAAG        CTTCCAGA        39
  • No UniVec hits

010.TIGR.CLONEEND

  • No overrepresented kmers
 wc -l *clb
 5479 bos_taurus.032.f.clb
 5174 bos_taurus.032.r.clb
 
 ==> 24.fwd/kmers.tab <==
 CTTGTGTTGGCCCAGGCAAGTCCA        TGGACTTGCCTGGGCCAACACAAG        30
 TTGTGTTGGCCCAGGCAAGTCCAA        TTGGACTTGCCTGGGCCAACACAA        30
 
 ==> 24.rev/kmers.tab <==
 CTGCCTCTTGTGTTGGCCCAGGCA        TGCCTGGGCCAACACAAGAGGCAG        16
 GCTGCCTCTTGTGTTGGCCCAGGC        GCCTGGGCCAACACAAGAGGCAGC        15
 
 ==> 8.fwd/kmers.tab <==
 GAGTGGGT        ACCCACTC        176
 GGAGTGGG        CCCACTCC        171
 
 ==> 8.rev/kmers.tab <==
 TGGAGTGG        CCACTCCA        182
 GGAGTGGG        CCCACTCC        181
 
  • No UniVec hits

...

070.BCM.CLONEEND

  • No frequent kmers
 wc -l *clb
   6027 bos_taurus.070.f.clb
   6236 bos_taurus.070.r.clb
 
 ==> 24.fwd/kmers.tab <==
 GGACTCTCAGAGTCTTCTCCAACA        TGTTGGAGAAGACTCTGAGAGTCC        18
 ACTGGTTGGATCTCCTTGCAGTCC        GGACTGCAAGGAGATCCAACCAGT        18

 ==> 24.rev/kmers.tab <==
 ATAAAATCTGAGCCACCAGGGAAG        CTTCCCTGGTGGCTCAGATTTTAT        1
 CTATTGGTTCATATGGTCAACGTC        GACGTTGACCATATGAACCAATAG        1
 
 ==> 8.fwd/kmers.tab <==
 TTTTTTTT        AAAAAAAA        86
 CTTCTCCA        TGGAGAAG        75
 
 ==> 8.rev/kmers.tab <==
 TATAGTGT        ACACTATA        9
 ATATAGGG        CCCTATAT        8
  • No alignments to BCM WGS vector

Running Lucy

  • Default parameters with vector trimming
  • BCM vector/splice
 /nfshomes/dpuiu/db/vector.BCM.seq
 /nfshomes/dpuiu/db/splice.BCM.seq
  • NISC vector/splice
 /nfshomes/dpuiu/db/vector.NISC.seq
 /nfshomes/dpuiu/db/splice.NISC.seq

BCM.WGS (all reads)

  • orig.CLR < lucy.CLR ( 765 < 792 )
  • orig.CLV > lucy.CLV ( 1015 > 973 )
  • 739,529 out of 24,863,599 reads (3%) deleted by Lucy (CLR=-1,-1)
  • 21,728,592 out of 24,863,599 reads (87%) vector trimmed at the 5' end
  • 92,646 out of 24,863,599 reads (0.3%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum 
 orig.LEN                  24863599   0          0          24863599   5          3097       1002       997        1015       24915462033
 
 orig.CLR                  24863599   463669     7          24399923   -1143      1833       765        836        864        19036744256
 orig.CLR5                 24863599   0          359245     24504354   0          2103       42         22         58         1047922451
 orig.CLR3                 24863599   463404     0          24400195   -1         2169       807        872        895        20084666707
 
 lucy.CLR                  24863599   0          739529     24124070   0          1219       792        878        904        19695000417
 lucy.CLR5                 24863599   739529     36108      24087962   -1         1753       43         29         42         1086413880
 lucy.CLR3                 24863599   739529     0          24124070   -1         1894       835        915        939        20781414297
 
 orig.CLR5-lucy.CLR5       24863599   16299521   215345     8348733    -1186      2104       -1         -10        -1186      -38491429
 orig.CLR3-lucy.CLR3       24863599   14858542   1494794    8510263    -1273      2170       -28        -20        -1273      -696747590
 
 orig.CLV                  24863599   1053       1920       24860626   -2         5345       1015       1002       1017       25260581538
 orig.CLV5                 8841849    0          0          8841849    1          1219       33         46         49         295011460
 orig.CLV3                 24861698   1053       0          24860645   -1         5346       1027       1005       1019       25555592998
 
 lucy.CLV                  24863599   10694      707        24852198   -469       3096       973        968        987        24195085877
 lucy.CLV5                 24863599   0          3135007    21728592   0          1359       25         27         29         623457486
 lucy.CLV3                 24863599   0          0          24863599   4          3096       998        995        1014       24818543363
 lucy.CLVABS5              24863599   0          3135007    21728592   0          1359       25         27         29         623457486
 lucy.CLVABS3              24863599   0          24770953   92646      0          1343       2          0          880        72055071
 
 orig.CLV5-lucy.CLV5       24863599   17216820   1512453    6134326    -1312      1219       -13        -25        -1312      -328446026
 orig.CLV3-lucy.CLV3       24863599   1519132    18579609   4764858    -1832      4672       29         0          479        737049635

BCM.WGS (0 quality reads)

  • orig.CLR > lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 7,153 out of 551,114 reads (1.3%) deleted by Lucy (CLR=-1,-1)
  • 508,166 out of 551,114 reads (92%) vector trimmed at the 5' end
  • 1,946 out of 551,114 reads (0.35%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  551114     0          0          551114     5          1464       872        946        959        480705828
 
 orig.CLR                  551114     7754       0          543360     -770       1175       708        786        807        390325117
 orig.CLR5                 551114     0          6773       544341     0          1519       44         20         111        24582849
 orig.CLR3                 551114     7744       0          543370     -1         1638       752        818        833        414907966
 
 lucy.CLR                  551114     0          7153       543961     0          699        636        671        671        350759771
 lucy.CLR5                 551114     7153       35872      508089     -1         201        26         27         28         14442310
 lucy.CLR3                 551114     7153       0          543961     -1         699        662        699        699        365202081
 
 orig.CLR5-lucy.CLR5       551114     364282     8801       178031     -198       1500       18         -8         215        10140539
 orig.CLR3-lucy.CLR3       551114     85058      2962       463094     -700       1472       90         123        178        49705885
 
 orig.CLV                  551114     971        0          550143     -2         2037       974        978        981        537127121
 orig.CLV5                 5100       0          0          5100       1          845        35         29         31         180490
 orig.CLV3                 551114     971        0          550143     -1         2037       974        978        981        537307611
 
 lucy.CLV                  551114     58         6          551050     -84        1456       841        917        930        463903233
 lucy.CLV5                 551114     0          42948      508166     0          202        27         28         29         14964546
 lucy.CLV3                 551114     0          0          551114     4          1463       868        945        958        478867779
 lucy.CLVABS5              551114     0          42948      508166     0          202        27         28         29         14964546
 lucy.CLVABS3              551114     0          549168     1946       0          700        2          0          686        1286935
 
 orig.CLV5-lucy.CLV5       551114     506108     42215      2791       -202       845        -26        -28        -202       -14784056
 orig.CLV3-lucy.CLV3       551114     134959     23422      392733     -967       1614       106        7          459        58439832

BCM.SHOTGUN

  • orig.CLR < lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 98,070 out of 10,748,529 reads (0.9%) deleted by Lucy (CLR=-1,-1)
  • 9,737,008 out of 10,748,529 reads (90%) vector trimmed at the 5' end
  • 294,942 out of 10,748,529 reads (2.7%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  10748529   0          0          10748529   5          2043       975        950        964        10486690472
 
 orig.CLR                  10748529   17308      2          10731219   -1293      1467       809        833        847        8701344571
 orig.CLR5                 10748529   0          68         10748461   0          1315       26         16         38         288662580
 orig.CLR3                 10748529   16780      0          10731749   -1         1647       836        851        863        8990007151
 
 lucy.CLR                  10748529   0          98070      10650459   0          1337       833        854        868        8955866769
 lucy.CLR5                 10748529   98070      1973       10648486   -1         1307       35         28         32         376276188
 lucy.CLR3                 10748529   98070      0          10650459   -1         1553       868        882        896        9332142957
 
 orig.CLR5-lucy.CLR5       10748529   9498290    65171      1185068    -1099      1293       -8         -11        -1099      -87613608
 orig.CLR3-lucy.CLR3       10748529   6879532    671097     3197900    -1149      1437       -31        -26        -1149      -342135806
 
 orig.CLV                  10748529   16779      412        10731338   -2         3919       974        948        964        10472347908
 orig.CLV5                 8594910    0          0          8594910    1          1239       3          1          49         28350257
 orig.CLV3                 10748349   16779      0          10731570   -1         3919       976        950        965        10500698165
 
 lucy.CLV                  10748529   7026       614        10740889   -268       2042       930        924        940        9997862132
 lucy.CLV5                 10748529   0          1011521    9737008    0          855        24         24         27         257993796
 lucy.CLV3                 10748529   0          0          10748529   4          2042       954        945        962        10255855928
 lucy.CLVABS5              10748529   0          1011521    9737008    0          855        24         24         27         257993796
 lucy.CLVABS3              10748529   0          10453587   294942     0          1214       20         0          847        220086015
 
 orig.CLV5-lucy.CLV5       10748529   9538738    138680     1071111    -854       1239       -21        -23        -854       -229643539
 orig.CLV3-lucy.CLV3       10748529   357934     9324166    1066429    -1328      2846       22         0          704        244842237

NISC.SHOTGUN

  • orig.CLR < lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 8,248 out of 737,900 reads (1.1%) deleted by Lucy (CLR=-1,-1)
  • 633,409 out of 737,900 reads (85%) vector trimmed at the 5' end
  • 7,201 out of 737,900 reads (0.97%) vector trimmed at the 3' end


                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  737900     0          0          737900     104        2104       784        729        734        579172842
 
 orig.CLR                  737900     5988       2          731910     -636       1033       651        668        676        480400909
 orig.CLR5                 737900     0          0          737900     1          1407       47         40         51         34857531
 orig.CLR3                 737900     0          5879       732021     0          1470       698        710        715        515258440
 
 lucy.CLR                  737900     0          8248       729652     0          1035       658        670        676        485757685
 lucy.CLR5                 737900     8248       56         729596     -1         1091       45         35         46         33811606
 lucy.CLR3                 737900     8248       0          729652     -1         1391       704        710        714        519569291
 
 orig.CLR5-lucy.CLR5       737900     253727     89345      394828     -566       1408       1          1          485        1045925
 orig.CLR3-lucy.CLR3       737900     177007     31         560862     -867       1471       -5         1          -867       -4310851
 
 orig.CLV                  737900     3224       2655       732021     -636       2103       771        725        730        569178445
 orig.CLV5                 734026     0          0          734026     1          987        5          1          35         4375315
 orig.CLV3                 732021     0          0          732021     35         2104       783        729        734        573553760
 
 lucy.CLV                  737900     1335       55         736510     -200       2104       747        696        702        551392388
 lucy.CLV5                 737900     104491     0          633409     -1         1199       30         31         34         22784742
 lucy.CLV3                 737900     0          0          737900     15         2103       778        728        733        574177130
 lucy.CLVABS5              737900     0          104491     633409     0          1200       31         32         35         23522642
 lucy.CLVABS3              737900     0          730699     7201       0          1076       5          0          686        4257812
 
 orig.CLV5-lucy.CLV5       737900     561851     66390      109659     -1198      983        -24        -29        -1198      -18409427
 orig.CLV3-lucy.CLV3       737900     8386       1          729513     -950       1077       0          1          -950       -623370

Fragment files

  • Locations:
 /fs/szasmg3/bos_taurus/data/frg
 /fs/szasmg3/bos_taurus/data/frg.new
  • All DST messages are unique
  • bos_taurus.clv : contains the vector clipping points
    • BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
    • others: the TA clv
    • 374,454 reads don't have valid clv's
    • 36,446,031 reads have valid clv's with avg=955

Message counts (original)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.frg                         79      24124070        11311841
 #bos_taurus.BCM.SHOTGUN.frg                    7339    10650459        1799069     # some libs & mates are missing due to a tarchive2ca crash (used by UMD2.1)
 #bos_taurus.BCM.SHOTGUN.new.frg                18208   10650459        4715172     # split the libraries by VOL & SEQ_LIB_ID (used by UMD2.2)
 #bos_taurus.BCM.SHOTGUN.new.frg                13826   10650459        5046435     # double check the FRG count !!! (used by UMD2.3)
 bos_taurus.BCM.SHOTGUN.new.frg                 7       10650459        5046435     # UMD2.4
 bos_taurus.NISC.SHOTGUN.frg                    246     729652          344932
 bos_taurus.BCCAGSC.CLONEEND.frg                1       125241          59505
 bos_taurus.UIUC.CLONEEND.frg                   2       114750          46319
 bos_taurus.TIGR.CLONEEND.frg                   1       65171           27067
 bos_taurus.GSC.CLONEEND.frg                    1       53521           25889
 bos_taurus.CENARGEN.WGS.frg                    0       26246           0
 #bos_taurus.BARC.CLONEEND.frg                  11150   25454           11150      # (used by UMD2.3)
 bos_taurus.BARC.CLONEEND.frg                   1       25454           11150      # (used by UMD2.4)
 bos_taurus.BCM.CLONEEND.frg                    1       16875           7103
 bos_taurus.CENARGEN.CLONEEND.frg               1       16787           6269
 bos_taurus.UOKNOR.SHOTGUN.frg                  1       14651           4910
 bos_taurus.TIGR_JCVIJTC.CLONEEND.frg           2       10651           4803
 bos_taurus.UOKNOR.FINISHING.frg                0       151             0
 bos_taurus.WUGSC.COLONEEND.frg                 1       49              21
 #total                                         25312   35973728        16896244   # (UMD2.3) 
 total                                          344     35973728        16896244   # (UMD2.4) 

Message counts (quality)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.qual.count                  79      23580109        11035582
 #bos_taurus.BCM.SHOTGUN.qual.count             7339    10644092        1799069
 bos_taurus.BCM.SHOTGUN.qual.new.count          18208   10644092        4712446
 bos_taurus.NISC.SHOTGUN.count                  246     729652          344932
 bos_taurus.BCCAGSC.CLONEEND.qual.count         1       116484          53585
 bos_taurus.UIUC.CLONEEND.count                 2       114750          46319
 bos_taurus.TIGR.CLONEEND.count                 1       65171           27067
 bos_taurus.CENARGEN.WGS.count                  0       26246           0
 bos_taurus.BARC.CLONEEND.count                 11150   25454           11150
 bos_taurus.BCM.CLONEEND.count                  1       16875           7103
 bos_taurus.CENARGEN.CLONEEND.count             1       16787           6269
 bos_taurus.TIGR_JCVIJTC.CLONEEND.count         2       10651           4803
 bos_taurus.UOKNOR.SHOTGUN.qual.count           1       2456            813
 bos_taurus.WUGSC.COLONEEND.count               1       49              21

Message counts (0quality)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.0qual.count                 79      543961          234397
 bos_taurus.GSC.CLONEEND.0qual.count            1       53521           25889
 bos_taurus.UOKNOR.SHOTGUN.0qual.count          1       12195           4097
 bos_taurus.BCCAGSC.CLONEEND.0qual.count        1       8757            2114
 bos_taurus.BCM.SHOTGUN.0qual.count             7339    6367            0
 bos_taurus.UOKNOR.FINISHING.0qual.count        0       151             0

Assemblies

Bt.qc.combine UMD2.0 ... UMD2.5 combine stats

Assembly UMD2.1(2009_0122_CA; Quality reads)

Issues

  1. Uses only quality reads
  2. BCM.SHOTGUN library : ~ 4715172-1799069=2.9M mates were missed due to a tarchive2ca crash ; some libraries got merged (were assigned the same lib_id)
  3. All reads except for BCM.WGS were set as nonrandom
  4. Update the runCA script to run overlapper concurently; new "ovlConcurrency" parameter added to the .spec file !!!
  5. consensus after cgw crashed in MultiAlignContig() ... use "consensus -D forceunitigabut" !!!
  6. cgw crashed after updating gkpStore with new lib/mate info => edit Input_CGW.c, remove the assert in line 117

Info

 host: walnut
 
 assembly version: wgs-5.2 stable
 
 dir:  /scratch1/bos_taurus/Assembly/2009_0122_CA 
 
 command: /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin/runCA-test -d . -p bt -s bt01.specFile *.frg
 
 spec file:
 cgwDistanceSampleSize   =       1000       # ??? too big; more than 50% of the BCM.SHOTGUN reads are in libraries with less than 1000 inserts
 cnsConcurrency          =       15
 cnsMinFrags             =       200000
 doOverlapTrimming       =       1
 frgCorrBatchSize        =       100000
 frgCorrConcurrency      =       15
 merylMemory             =       24000
 merylThreads            =       15
 obtMerThreshold         =       200
 obtOverlapper           =       ovl
 ovlConcurrency          =       8
 ovlCorrBatchSize        =       100000
 ovlCorrConcurrency      =       15
 ovlHashBlockSize        =       1200000
 ovlMemory               =       8GB --hashload 0.8 --hashstrings 400000
 ovlMerThreshold         =       500
 ovlOverlapper           =       ovl
 ovlRefBlockSize         =       7200000
 ovlThreads              =       2
 unitigger               =       utg
 utgErrorRate            =       0.015
 vectorIntersect         =       bos_taurus.clv
 doExtendClearRanges     =       2

Steps

1. Run up till after initialStoreBuilding

 runCA stopAfter=initialStoreBuilding ...

2. Update gkpStore with nonrandom frg flag

 cat bos_taurus.nonrandom.clv | perl -ane 'print "frg uid $F[0] isnonrandom 1\n";'  > bos_taurus.nonrandom.edit
 gatekeeper -edit bos_taurus.nonrandom.edit bt.gkpStore

Input

 gatekeeper -dumpinfo -lastfragiid bt.gkpStore
 ...
 Last frag in store is iid = 35348776

OBT

                           elem       <0         0          >0         min        max        mean       median     n50        sum            
 
 CLV5                      35085508   0          3387027    31698481   0          970        25         27         29         891007232
 CLV3                      35164784   0          0          35164784   15         2974       984        980        1000       34612019144
 
 CLR_ORIG5                 35348776   0          43354      35305422   0          1753       42         29         38         1502168205     
 CLR_ORIG3                 35348776   0          0          35348776   70         1894       864        905        927        30547294868    
 
 CLR_OBT5                  35348776   0          26513      35322263   0          1690       49         30         73         1756346429     
 CLR_OBT3                  35348776   0          23477      35325299   0          1813       843        895        914        29824543869

  • 421,379 reads deleted by OBT: why so many???

  • Chimera:
 20297 reads too short => deleted
  • more 0-overlaptrim/bt.mergeLog.stats
 ...
 211037: short or inconsistent
 253536: deleted fragment due to zero clear 
  • Example:
 gatekeeper -dumpfragments 516316990  bt.gkpStore  
 fragmentIdent           = 516316990,14
 fragmentMate            = 0,0
 fragmentLibrary         = 27473,1563
 fragmentIsDeleted       = 1
 fragmentIsNonRandom     = 1
 fragmentStatus          = G
 fragmentOrientation     = I
 fragmentHasVectorClear  = 0
 fragmentHasQualityClear = 0
 fragmentPlate           = 0
 fragmentPlateLocation   = 0
 fragmentSeqLen          = 862
 fragmentHPSLen          = 0
 fragmentSrcLen          = 17
 fragmentClearORIG       = 38,553
 fragmentClearQLT        = 1,0
 fragmentClearVEC        = 1,0
 fragmentClearOBTINI     = 35,578
 fragmentClearOBT        = 35,578
 fragmentClearUTG        = 35,578
 fragmentClearECR1       = 35,578
 fragmentClearECR2       = 35,578
 fragmentSeqOffset       = 5376
 fragmentQltOffset       = 11038
 fragmentHpsOffset       = 53
 fragmentSrcOffset       = 287
 cat   0-overlaptrim/bt.mergeLog | grep 516316990 
 516316990,14    412     412     0       0 (deleted, too short)
 zcat *r000*gz | convertOverlap -a -obt 
 ...
    14 12128740  f  377  478   292  393   2.97
    14 15226267  f  397  446    31   80   2.04
    14 19071241  f    4  513   199  708   1.18
    14 20073917  f    7  478    36  508   4.88
    14 20042424  f    4  419   299  714   1.93
    14 20212935  f    7  478   234  706   4.88
    14 20073828  r    7  478   507   35   4.67
    14 20212846  r    7  478   557   85   4.67
    14 27089060  r  491  534   836  793   2.33
    14 29061748  f  489  540    86  137   1.96
    14 32105697  f  455  543   381  469   2.27
    14 32187461  f  430  534   105  209   1.92
    14 32027289  f    4  419   493  907   4.59
 ...
 #read aligns to contigs
 show-coords 516316990-ctg.filter-r.strict.delta
     35      531  |       97      594  |      497      498  |    99.20  |      862     2759  |    57.66    18.05  | 516316990  ctg7180001872751
     45      678  |      931     1564  |      634      634  |    97.00  |      862     1567  |    73.55    40.46  | 516316990  ctg7180001837311


  • OBT deleted reads:
 BCM           WGS         253816
 BCM           SHOTGUN     151770
 BCCAGSC       CLONEEND    7510
 NISC          SHOTGUN     4757
 TIGR          CLONEEND    1577
 CENARGEN      WGS         599
 CENARGEN      CLONEEND    431
 TIGR_JCVIJTC  CLONEEND    377
 UIUC          CLONEEND    182
 BCM           CLONEEND    150
 BARC          CLONEEND    125
 UOKNOR        SHOTGUN     85
 total         .           421379

OBT deleted reads:

          elem       >0         min        max        mean       med        n50        sum
 len      421379     421379     98         2974       862        927        968        363280405
 avgQual  421379     421379     1          57         28         24         36         11852865

Overlapper

    • 98.33% of the reads (34,761,786 out of 35,348,776 reads) had overlaps
    • 1.66% of the reads had no overlaps
    • 6.68% of the BCCAGSC.CLONEEND reads had no overlaps
    • 4.95% of the TIGR_JCVIJTC.CLONEEND reads had no overlaps
    • 3.48% of the TIGR.CLONEEND reads had no overlaps
    • the median number of overlaps is 20
 Overlaps
         reads      min        max        mean       median     n50        sum
 qual    35348776   0          5592       106        20         769        3777789082
    • the median number of overlaps for the BCM.WGS reads is 16
    • the median number of overlaps for the BCM.SHOTGUN reads is 16 !!!
    • the median number of overlaps for the NISC.SHOTGUN reads is 40 !!!
    • the median number of overlaps for the BCM.CLONEEND reads is 16 !!!

Media:Bt.ovlStore.big.png , Media:Bt.ovlStore.small.png

Unitigger

more 4-unitigger/bt.cga.0
UNITIG OVERLAP GRAPH INFORMATION
 
       5208738 : Total number of unitigs
       2527051 : Total number of singleton, contained unitigs
       1814842 : Total number of singleton, non-contained unitigs
        180910 : Total number of non-singleton, spanned unitigs
        685935 : Total number of non-singleton, non-spanned unitigs
      34927397 : Total number of fragments
      34927397 : Total number of fragments in all unitigs
      21521581 : Total number of essential fragments in all unitigs
      13405816 : Total number of contained fragments in all unitigs
  0.0076239952 : Randomly sampled fragment arrival rate per bp
    2510896132 : The sum of overhangs in all the unitigs
    6400342737 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2510896132
 Total nfrags = 19143061
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007624
 Computed global_fragment_arrival_rate =0.007624
 Total number of randomly sampled fragments in genome = 23326293
 Computed genome length  = 3059589120.000000
 Used global_fragment_arrival_rate=0.007624
 Used global_fragment_arrival_distance=131.164826
 
 Histogram of the number of base pairs in a chunk
 100292 - 159434:    22 
 90010 -  99906:     25 
 80043 -  89676:     73 
 70013 -  79966:    162 
 60010 -  69988:    389 
 50008 -  59983:    977 
 40000 -  49998:   2434 
 30000 -  39997:   6458 
 20000 -  29999:  18957 
 10000 -  19999:  57442
 Unitigs >=10kb
             NewAsm          UMd2Asm
 
 Number       86,939          57,204
 Mean         19,464          15,140
 Sum         1,692.1Mb       866.0Mb
 max         159,434bp      78,570bp
 Contigs >=10Kb:
           NewAsm          UMd2Asm
 n         42,343           45,958      
 mean      59,856           55,473
 sum        2,534.5Mb        2,549.4Mb
 Contigs >=100Kb: 
           NewAsm          UMd2Asm
 n          7,051            6,683         
 mean     163,170          162,357    
 sum        1,150.5Mb        1,085.0Mb
 max      627,705          742,802
 Scaffolds >=10Mb:
           NewAsm          UMd2Asm
 n             30                3
 mean       14.10Mb          11.36Mb
 sum       422.95Mb         340.70Mb
 max        26.54Mb          13.36Mb

CGW & ECR

  • Checkpoints:
 cat 7-0-CGW/bt.timing | grep ^Checkpoint
 Checkpoint 3 written during MergeScaffoldsAggressive at iteration 49
 Checkpoint 4 written during MergeScaffoldsAggressive at iteration 85
 Checkpoint 5 written after 1st Scaffold Merge 
 Checkpoint 6 written after 2nd Aggressive Scaffold Merge
 Checkpoint 7 written after Final Rocks
 cat 7-2-CGW/bt.timing | grep ^Checkpoint
 Checkpoint 19 written during MergeScaffoldsAggressive at iteration 12
 Checkpoint 20 written during MergeScaffoldsAggressive at iteration 31
 Checkpoint 21 written after 1st Scaffold Merge
 Checkpoint 22 written after 2nd Aggressive Scaffold Merge
 Checkpoint 23 written after Final Rocks

 cat 7-4-CGW/bt.timing | grep ^Checkpoint
 Checkpoint 34 written during MergeScaffoldsAggressive at iteration 12
 Checkpoint 35 written during MergeScaffoldsAggressive at iteration 49
 Checkpoint 36 written after 1st Scaffold Merge
 Checkpoint 37 written during Stones CleanupScaffolds after scaffold 32436
 Checkpoint 38 written during Stones CleanupScaffolds after scaffold 34939
 Checkpoint 39 written after Stone Throwing and CleanupScaffolds
 Checkpoint 40 written after 2nd Aggressive Scaffold Merge
 Checkpoint 41 written after Final Rocks

Checkpoint 42 written after Partial Stones Checkpoint 43 written after Final Contained Stones Checkpoint 44 written after resolveSurrogates

  • Get early CTG/SCF stats
 cat 7-CGW/bt.cgw_scaffolds | countMessages.pl
 ICL     451555  # ???
 ICP     116455  # CTG
 ISF     66141   # SCF
 ISL     711     # SLK
  • Clear read extension:
                           elem       <0         0          >0         min        max        mean       median     n50        sum
 ClearORIG                 35348776   4          0          35348772   -1147      1572       821        870        893        29045126663
 
 ClearQLT                  35348776   35348776   0          0          -1         -1         -1         -1         -1         -35348776
 ClearVEC                  35348776   299034     20323      35029419   -1         2043       952        953        975        33658445088
 
 ClearOBTINI               35348776   0          31254      35317522   0          1364       831        879        902        29394688367
 ClearOBT                  35348776   0          31254      35317522   0          1318       794        854        877        28068197440
 
 ClearECR1                 35348776   0          31254      35317522   0          1329       794        854        877        28072014464
 ClearECR2                 35348776   0          31254      35317522   0          1329       794        854        877        28072365712
 sum(ClearECR1)-sum(ClearUTG) = 3,817,024
 sum(ClearECR2)-sum(ClearECR1)= 351,248
  • Scaffold length stats:
 cat 7-0-CGW/stat/final0.Scaffolds.nodelength.cgm | grep -v ^Sca | getSummary.pl -t 0 # 0,2,4
 ...
 step     scaff      min        max        mean       med        n50        sum            
 0        7048       2249       19719008   385020     21967      3114907    2713622175     
 2        4960       2249       21907006   540915     21181      4490171    2682939682
 4        4006       2391       26541374   668427     29193      4590744    2677722052
  • Last cgw
 cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf
 cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf
            elem       min        max        mean       med        n50        sum            
 scf        66141      432        26541374   42648      1347       4349378    2820819506     
 ctg        120461     65         627705     22421      2018       84989      2700959854

QC stats

 TotalScaffolds=66,141
 MaxBasesInScaffolds=26,048,998
 MeanBasesInScaffolds=40,861
 
 TotalContigsInScaffolds=120,461
 MaxContigLength=627,911
 MeanContigLength=22,436
 
 TotalDegenContigs=269,031
 MaxDegenContig=33,824
 
 SingletonReads=3,721,123
  • Posmap info
 cat bt.posmap.mates | awk '{print $3}' |count.pl -p 100
 good            10338164
 bothChaff       1160137
 oneChaff        695982
 oneSurrogate    233151
 bothDegen       218198
 diffScaffold    150423
 badShort        138464
 oneDegen        118232
 badLong         23196
 badSame         22451
 badOuttie       8751
 bothSurrogate   589
 total           13107738
 cat bt.posmap.frags | awk '{print $4,$5}' |count.pl  -p 100
 placed good             20676328
 placed notMated         8007072
 chaff bothChaff         2320274
 chaff notMated          704849
 placed oneChaff         695982
 chaff oneChaff          695982
 placed oneSurrogate     466302
 placed bothDegen        436396
 placed diffScaffold     300846
 placed badShort         276928
 placed oneDegen         236464
 placed badLong          46392
 placed badSame          44902
 placed badOuttie        17502
 placed bothSurrogate    1178
 total                   34927397

Log files

Analysis

Insert libraries

1. BCM.WGS : ok

  • FRG.mea: 1750-7000
  • ASM.mea: 1594-6727
  • Most libs have > 1000 reads & get reestimated
  • All libs have ASM.std< ASM.mea/3

2. BCM.SHOTGUN

  • only ~ 50% of the inserts are in libs with >1000 inserts and get reestimated by the assembly
  • if the thold is dropped from 1000 to 100, we'd get ~ 95% of the inserts reestimated
            elem       <0         0          >0         min        max        mean       median     n50        sum
 0          7339       0          0          7339       1          11237      245        135        1137       1799069
 100        4361       0          0          4361       100        11237      395        157        1252       1725604
 1000       440        0          0          440        1008       11237      2075       1791       2323       913086

3. NISC.SHOTGUN: ok

  • Most libs have > 1000 reads & get reestimated
  • All libs have ASM.std< ASM.mea/3

4. BCCAGSC.CLONEEND: ok

 LIB.id  FRG.mea FRG.std FRG.count  CENTER.TYPE        ASM.mea ASM.std
 125606  150000  30000   59505      BCCAGSC.CLONEEND   161998  20133

5. UIUC.CLONEEND: ok

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 114892   150000  30000   31063     UIUC.CLONEEND      175594  41208
 115020   150000  30000   15256     UIUC.CLONEEND      162488  26358

6. TIGR.CLONEEND: originally wrong; gets reestimated

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 65177    2000    600     27067     TIGR.CLONEEND      161761  34938

7. GSC.CLONEEND: not used (all 53556 are 0 qual)

8. CENARGEN.WGS: "not used" (all 26246 are unmated)

9. BARC.CLONEEND: each library contains 1 template id => inserts did not get reestimated (25454 reads/11151 inserts)

10. BCM cloneend: ok

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 19070    167000  25000   7103      BCM.CLONEEND       171244  18555

11. CENARGEN.CLONEEND: large stdev

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 17249    202000  20200   6269      CENARGEN.CLONEEND  158938  55165

12. UOKNOR.SHOTGUN: ok ?

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 15158    3000    1000    4910      UOKNOR.SHOTGUN     3000    1000

13. TIGR_JCVI.CLONEEND: originally wrong; gets reestimated

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 10691    2500    750     2763      TIGR_JCVI.CLONEEND 160363  29580
 10738    2500    750     2040      TIGR_JCVI.CLONEEND 161915  29343

14. UOKNOR.FINISHING: only 151 reads

15. WUGSC.CLONEEND: only 49 reads

Contigs Vs UMD2 contaminants & Ecoli

 4865 contigs in list.exclude_contigs.fa
 
34404 exclude-ctg.qry_hits
 3763 exclude-ctg.ref_hits
 
 1204 exclude-ctg.CBE.qry_hits   CONTAIN|IDENTITY|BEGIN|END
  748 exclude-ctg.CBE.ref_hits   CONTAIN|IDENTITY|BEGIN|END
 559 Ecoli.365350-365744-ctg.qry_hits : max ctg aligned is 179K bp; 10 are > 10K bp

Contigs Vs UMD2 chromosomes

  • Split 120,461 contigs into 100 files; degeneartes not split
  • Align them to the 31 chromosomes 1..30,U (ref) => 101*31 jobs
 #Alignment stats
 cat chr*ctg*delta | grep "^>" | awk '{print $2}' | count.pl -f ../9-terminator/bt.ctg.infoseq | getSummary.pl -i 1 -z 1
 ctg        0          1          >1         min        max        mean       med        n50        sum
 120461     652        37540      82269      0          176        11         3          33         1422808
 #Unaligned ctg lengths
 ctg        min        max        mean       med        n50        sum
 652        65         5849       1146       1134       1194       747295
 
  • 50% of the contigs aligned uniquely
 cat chr*-ctg*.delta | ~/bin/mergeDelta.pl   >  chr-ctg.delta
                                                             # degens? 
 delta-filter -q  chr-ctg.delta              >>  chr-ctg.filter-q.delta
 cat chr1-*.delta | ~/bin/delta2cvg.pl -M 0 | getSummary.pl -i 4
 elem       0          >0         min        max        mean       med        n50        sum
 6681       1          6680       0          12892      366        142        1095       2450106
  • There are disagreements:
 /fs/sz-user-supported/Linux-x86_64/bin/show-coords -l -r -H chr1-ctg.filter-q.delta  | p 'print $F[-1],"\n";' | count.pl | head
 ctg7180001761585        24
 ...
 ctg7180001634116        7
 ...
 show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001761585 | p 'print "  $_";'
 142115744 142188863  |   383463   310345  |    73120    73119  |    99.98  | 157714772   383463  |     0.05    19.07  |  1 -1  chr1   ctg7180001761585
 142188878 142286012  |   310361   213223  |    97135    97139  |    99.94  | 157714772   383463  |     0.06    25.33  |  1 -1  chr1   ctg7180001761585
 142287100 142287675  |   212133   211556  |      576      578  |    98.27  | 157714772   383463  |     0.00     0.15  |  1 -1  chr1   ctg7180001761585
 142288052 142288602  |   211182   210633  |      551      550  |    99.09  | 157714772   383463  |     0.00     0.14  |  1 -1  chr1   ctg7180001761585
 142288652 142295709  |   210586   203531  |     7058     7056  |    99.87  | 157714772   383463  |     0.00     1.84  |  1 -1  chr1   ctg7180001761585
 142295709 142342174  |   203512   157047  |    46466    46466  |   100.00  | 157714772   383463  |     0.03    12.12  |  1 -1  chr1   ctg7180001761585
 142346440 142367791  |   156958   135606  |    21352    21353  |    99.99  | 157714772   383463  |     0.01     5.57  |  1 -1  chr1   ctg7180001761585
 142367822 142370681  |   135597   132737  |     2860     2861  |    99.93  | 157714772   383463  |     0.00     0.75  |  1 -1  chr1   ctg7180001761585
 142370660 142382289  |   132746   121117  |    11630    11630  |    99.88  | 157714772   383463  |     0.01     3.03  |  1 -1  chr1   ctg7180001761585
 142382282 142411927  |   120984    91339  |    29646    29646  |    99.96  | 157714772   383463  |     0.02     7.73  |  1 -1  chr1   ctg7180001761585
 142411941 142419553  |    91339    83728  |     7613     7612  |    99.66  | 157714772   383463  |     0.00     1.99  |  1 -1  chr1   ctg7180001761585
 142419553 142434546  |    83721    68728  |    14994    14994  |    99.79  | 157714772   383463  |     0.01     3.91  |  1 -1  chr1   ctg7180001761585
 142434506 142437288  |    68778    65996  |     2783     2783  |    99.86  | 157714772   383463  |     0.00     0.73  |  1 -1  chr1   ctg7180001761585
 142437389 142439015  |    66757    65131  |     1627     1627  |    99.94  | 157714772   383463  |     0.00     0.42  |  1 -1  chr1   ctg7180001761585
 142439271 142440703  |    65629    64197  |     1433     1433  |   100.00  | 157714772   383463  |     0.00     0.37  |  1 -1  chr1   ctg7180001761585
 142441869 142442975  |    63548    62442  |     1107     1107  |   100.00  | 157714772   383463  |     0.00     0.29  |  1 -1  chr1   ctg7180001761585
 
 142446690 142449325  |    30312    32945  |     2636     2634  |    99.58  | 157714772   383463  |     0.00     0.69  |  1  1  chr1   ctg7180001761585
 142451384 142452476  |    63510    64603  |     1093     1094  |    99.91  | 157714772   383463  |     0.00     0.29  |  1  1  chr1   ctg7180001761585
 142452577 142454379  |    61000    62806  |     1803     1807  |    99.78  | 157714772   383463  |     0.00     0.47  |  1  1  chr1   ctg7180001761585
 142454487 142456821  |    59122    61456  |     2335     2335  |   100.00  | 157714772   383463  |     0.00     0.61  |  1  1  chr1   ctg7180001761585
 142458383 142459582  |    57978    59177  |     1200     1200  |   100.00  | 157714772   383463  |     0.00     0.31  |  1  1  chr1   ctg7180001761585
 142459738 142472295  |    32272    44828  |    12558    12557  |    99.92  | 157714772   383463  |     0.01     3.27  |  1  1  chr1   ctg7180001761585
 142472300 142485640  |    44828    58163  |    13341    13336  |    99.89  | 157714772   383463  |     0.01     3.48  |  1  1  chr1   ctg7180001761585
 
 142501686 142530021  |    28336        1  |    28336    28336  |    99.99  | 157714772   383463  |     0.02     7.39  |  1 -1  chr1   ctg7180001761585
 show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001634116
 116312   162914  |        1    46603  |    46603    46603  |    99.99  | 157714772   122722  |     0.03    37.97  |  1  1  chr1       ctg7180001634116
 
 164916   201988  |    58062    95135  |    37073    37074  |    99.99  | 157714772   122722  |     0.02    30.21  |  1  1  chr1       ctg7180001634116
 
 203244   213377  |    48198    58331  |    10134    10134  |   100.00  | 157714772   122722  |     0.01     8.26  |  1  1  chr1       ctg7180001634116
 261393   264506  |    45949    49062  |     3114     3114  |   100.00  | 157714772   122722  |     0.00     2.54  |  1  1  chr1       ctg7180001634116
 
 264607   268579  |    94345    98317  |     3973     3973  |   100.00  | 157714772   122722  |     0.00     3.24  |  1  1  chr1       ctg7180001634116
 268586   274734  |    98323   104471  |     6149     6149  |   100.00  | 157714772   122722  |     0.00     5.01  |  1  1  chr1       ctg7180001634116
 274835   293945  |   103611   122722  |    19111    19112  |    99.99  | 157714772   122722  |     0.01    15.57  |  1  1  chr1       ctg7180001634116
 ~/bin/delta2breaks.pl -m 200 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       9827
 INVERSION       283
 TRANSLOCATION+  230
 TRANSLOCATION-  154
 
 ~/bin/delta2breaks.pl -m 1000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       7564
 INVERSION       216
 TRANSLOCATION+  192
 TRANSLOCATION-  127
 
 ~/bin/delta2breaks.pl -m 10000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       3394
 INVERSION       62
 TRANSLOCATION+  50
 TRANSLOCATION-  29

Assembly UMD2.2 (Quality reads)

  • Try to add the missing BCM.SHOTGUN reads at the assembly
  • Assign new BCM.SHOTGUN library ID's base on volume & SEQ_LIB_ID : same library might have different insert size in different volume => might loose some correct mates from different volumes
 cat bos_taurus.summary | grep BCM | grep SHOTG | cut -f6,7,8,10 | sort | more
 FAAEP   180000  13000   252
 FAAEP   2000    1000    84
 ...
 FAAHP   180000  13000   77
 FAAHP   2000    1000    230
 ...
  • => 20,538 libraries out of which 18,208 contain mated reads
  • create DST messages & add them to gkpStore
 gatekeeper -a -o bt.gkpStore -T -F  bos_taurus.BCM.SHOTGUN.new.DST
  • generate gatekeeper edit file that maps each TI to the new library id
 head bos_taurus.BCM.SHOTGUN.new.ti2libinfo.edit
   frg uid 499507131 libuid 601081 
   frg uid 499507132 libuid 601081
   ...
  • generate gatekeeper edit file that deletes all mate information
 head bos_taurus.BCM.SHOTGUN.new.mate.delete
   frg uid 500086180 mateuid 0
   frg uid 500084310 mateuid 0
   ...
  • pair forward/reverse read that have the same new library id, same TEMPLATE_ID
 head bos_taurus.BCM.SHOTGUN.new.mate.edit
   frg uid 583866821 mateuid 583872364
   frg uid 583866822 mateuid 583872408
   ...
  • run gatekeeper --edit for each edit/delete file
 gatekeeper --edit ...  bt.gkpStore 
  • restart assembly at cgw (doExtendClearRanges=1)
  • consensus after cgw failed on job 25 on CTG 5597062 : cannot create consensus from multialignment ...
 Fix: delete failed message
 cp bt.cgw_contigs.25 bt.cgw_contigs.25.FAILED
 delete "{ICM acc:5597062 pla:P len:20889 ..." from bt.cgw_contigs.25
  • terminator fail; message:
 ICL: reference before definition error for contig ID 5597062

Assembly UMD2.3 (2009_0210_CA; all reads)

  • 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
  • 16,896,244 mates
  • 25,312 libraries

Issues (not solved):

  • 10420 contain at least 1 "NN" in their clr (50.. min(len,600))
  • 5973 contain at least 1 "NNN" in their clr (50.. min(len,600))

Quality-less clrs

  • 624,952 quality-less reads
  • Quality-less read stats:  : alignment CLR or 50..min(len,600) trimming
            elem       min        max        mean       median     n50        sum
 len        624952     5          1495       887        947        961        554429198
 5          624952     6          1584       51         51         51         32150411
 3          624952     5          1495       695        699        699        434960697
 53         624952     -1579      1444       644        648        648        402810286

  • Align 624,952 to the 120,461 Assembly1 contigs (no degenerates) : 1 day on 13 cpus
  • 572,140(91.5%) reads aligned and 52,812(8.5%) did not align to the contigs

1. Launch jobs in parallel: 12766 jobs on 13 processors

 nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.001.fasta  bos_taurus.0qual.01.seq -p ctg.001-seq.01
 ...
 nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.982.fasta  bos_taurus.0qual.13.seq -p ctg.001-seq.01
  • CPU usage: 100% /job
  • Max mem usage: 0.1% /job

2. Get maximum extended clrs

 cat *delta | ~/bin/delta2qryClr.pl -best | sort > bos_taurus.0qual.best.clr
 Length stats
            elem       min        max        mean       median     n50        sum
 all        624952     5          1495       887        947        961        554429198
 aligned    572140     221        1416       912        953        964        522281354
 unaligned  52812      5          1495       608        580        754        32147844
 Best/Max/Max+extended alignment coord stats:
            elem       min        max        mean       median     n50        sum
 53.best    572140     94         1208       766        841        877        438793102
 53.max     572140     170        1208       794        863        888        454816817
 53.extend  572140     170        1208       797        865        889        456014184
 Unaligned read counts:                           
                          unaligned    total   quality   quality-less
 BCM.WGS                  42595
 UOKNOR.SHOTGUN           5787         14651   2456      12195
 GSC.CLONEEND             2294         53521   0         53521
 BCCAGSC.CLONEEND         1869         125241  116484    8757
 BCM.SHOTGUN              186
 UOKNOR.FINISHING         81
  • 52,812 quality-less unaligned reads to the contigs using less strict nucmer parameters: -l 30 -c 50 -b 50 -g 50 -d 0.12
  • 9,269 reads aligned at an average 92% identity (min 81% identity)  : not too good

3. Get reads without clrs: set their clr to maximum 50..600

 cp bos_taurus.0qual.extended.clr bos_taurus.0qual.clr
 difference.pl bos_taurus.0qual.infoseq bos_taurus.0qual.extended.clr | perl -ane '$three=600; $three=$F[1] if ($F[1]<600); print "$F[0] 50 $three\n";' >> bos_taurus.0qual.clr

Quality clrs

  • Use Assembly1 OBT clrs
  • Delete reads deleted in the OBT process

Gatekeeper

Load order:

  1. Add quality FRG : "gatekeeper -T -F ..."
  2. Add quality-less FRG "gatekeeper -F -a ..." # -T should be removed
  3. Delete quality FRG (deleted by UMD2.1 OBT)
  4. Add DST
  5. Add LKG

Edit

  1. Loads clrs
  2. Loads clvs
  3. Loads nonrandom info

Meryl

Use Assembly1 kmer counts

Overlapper

  • Use 80/90 Assembly1 overlap results
  • Rerun 10 overlap jobs
  • 96.64% of the quality-less reads have overlaps (vs 98.33% of the quality reads)
                  reads      0ovl       1+ovl      min        max        mean       median     n50        sum
 0qual(all)       624831     20941      603890     0          4350       96         19         740        60494730         # 96.64%
 0qual(unaligned) 52691      15384      37307      0          3229       50         5          349        2655545          # 70.80%

Unitigger

  • More unitigs, more bases in unitigs
  • Few of the longest unitigs got broken: Example 138,294(UMD2.3) vs 159,434(UMD2.1)
 UNITIG OVERLAP GRAPH INFORMATION
 
       5333434 : Total number of unitigs
       2595174 : Total number of singleton, contained unitigs
       1865473 : Total number of singleton, non-contained unitigs
        183693 : Total number of non-singleton, spanned unitigs
        689094 : Total number of non-singleton, non-spanned unitigs
      35551316 : Total number of fragments
      35551316 : Total number of fragments in all unitigs
      21830994 : Total number of essential fragments in all unitigs
      13720322 : Total number of contained fragments in all unitigs
  0.0077856472 : Randomly sampled fragment arrival rate per bp
    2514833413 : The sum of overhangs in all the unitigs
    6483064813 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2514833413
 Total nfrags = 19579606
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007786
 Computed global_fragment_arrival_rate =0.007786
 Total number of randomly sampled fragments in genome = 23870254
 Computed genome length  = 3065930496.000000
 Used global_fragment_arrival_rate=0.007786
 Used global_fragment_arrival_distance=128.441474
  
 Histogram of the number of base pairs in a chunk
 100406 - 138294:    21
 90330 -  99887:     23
 80042 -  89675:     79
 70014 -  79943:    169
 60002 -  69792:    374
 50000 -  59982:   1008
 40002 -  49995:   2440
 30001 -  39994:   6509
 20000 -  29999:  18989
 10000 -  19999:  57404


Consensus after unitigger

Problems:

  • job 120 executed partially (see bt_120.cgi_tmp); Solution: split into 3 parts, run separately, merge results
  • failed on 19 unitigs (587..7447 bp)
 rm 5-consensus/*failed
 touch 5-consensus/consensus.success

Cgw

  • Failure 1 : because job 120 was run partially => missing mates
  • Failure 2 : because of /5-consensus/FAILED/bt_???.cgi.failed => missing mates => delete 356 mates
 Error:
   ProcessFrags()-- WARNING!  fragiid=35973388,index=33600942 mateiid=35973363,index=0 -- MATE DOESN'T EXIST!
   cgw: Input_CGW.c:117: ProcessFrags: Assertion `err == 0' failed.
 Fix:
   cat cgw.out | grep MATE | p '/mateiid=(\d+)/; print $1,"\n";' >! cgw.out.mateiid
   gatekeeper -dumpfragments -tabular -iid cgw.out.mateiid bt.gkpStore/ | cut -f1,3 | ~/bin/mate2lkg.pl -a D >! cgw.out.delete.LKG
   gatekeeper -a -o bt.gkpStore -T -F -L cgw.out.delete.LKG
  • Failure 3: because of cgwOutputIntermediate=1
 Try to restart from ckp : die with assertion failure
 cgw -y -R 8 -N 12 -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi
 cgw -y -R 8       -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi
 
 Fix: Restart cgw from the beginning
  • cgw does update bt.SeqStore - OpenSequenceDB()

ECR (eventually skipped)

  • Failed after running for 1 day
 /fs/szdevel/dpuiu/SourceForge/wgs-5.2/Linux-amd64/bin/extendClearRanges  -g ./bt.gkpStore  -n 15  -c bt  -b 146216 -e 167100  -i 1  > 7-1-ECR/extendClearRanges-scaffold.146216.err 
 sh: line 1: 17016 Aborted 
  • Last ckp : bt.ckp.15
  • Try to fix:
 touch 7-1-ECR/cgw.success
 runCA "doExtendClearRanges = 1"
  • Runs too slow !!!
  • Can specify a scaffold range to process: -b ? -e ? => ckp files; could we merge them?
  • Failed after running for 1 day

Consensus after CGW

  • Failed on job 56
 tail 8-consensus/bt.cns_contigs.56.err 
 ...
 Could (really) not find overlap between 153923 (U) and 2508303 (R) estimated ahang: 0 (ejecting frag 2508303 from contig)
 consensus: math_AS.h:51: ceil_log2: Assertion `x > 0' failed.
 cat 7-CGW/bt.cgw_contigs.56  | countMessages.pl
 ICM     440
 IMP     281412
 IUP     12715
 
 cat 8-consensus/bt.cns_contigs.56_tmp | countMessages.pl 
 ICM     115
 IMP     103322
 IMV     8122
 IUP     4849
  • Fix: split ICM messages 1..115,116,116+ and run consensus on each set

QC

            elem       min        max        mean       med        n50        sum            
 scf        56891      407        33129045   50871      1378       4716077    2894145150     
 ctg        122851     64         651167     21957      3647       71561      2697514858     
 deg        268237     65         30246      1019       985        997        273575106
  • Compared with UMD2.1 : better scaffols, worse contigs & unitigs

Analysis

Issues:

  • Identify bacterial & mito contigs: mito seq
  • Align ctg&degen to UMD2 chromosomes
    • the chromosomes should have no 0cvg regions
    • possible inversions, translocations (UMD2 used markers)
    • if align breaks/indels, which assembly is correct?

Assembly UMD2.4 (2004_0217_CA; All reads)

  • 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
  • 16,896,244 mates
  • 344 libraries

Fix quality-less read clrs (N's) (temporary solution)

  • 10420 contain at least 1 "NN" in their clr (50.. min(len,600))
  • 5973 contain at least 1 "NNN" in their clr (50.. min(len,600))

Fix:

 frg2seq.pl < bos_taurus.0qual.frg > bos_taurus.0qual.seq
 fasta2qual.pl bos_taurus.0qual.seq > ! bos_taurus.0qual.qual
 lucy \
    -o bos_taurus.0qual.lucy.seq  bos_taurus.0qual.lucy.qual \
    -debug  bos_taurus.0qual.lucy.info \
    bos_taurus.0qual.seq bos_taurus.0qual.qual
 cat bos_taurus.0qual.lucy.info | cut -f1,3,4 -d ' ' | sort >! bos_taurus.0qual.lucy.clr
  • 624,952 quality-less reads
  • Quality-less read stats: 50..min(len,600) & lucy trimming
            elem       0          >0         min        max        mean       med        n50        sum
 5          624952     2857       622095     0          501        52         52         52         33012433
 3          624952     2857       622095     0          600        579        600        600        361980208
 53         624952     2857       622095     0          548        526        548        548        328967775

Fix quality-less read clrs (low complexity)

  • Run dust filter on seq (before qual & lucy)
            elem       0          >0         min        max        mean       med        n50        sum            
 5          624952     3564       621388     0          501        75         52         52         47385578       
 3          624952     3564       621388     0          600        554        600        600        346470473      
 53         624952     3564       621388     0          548        478        548        548        299084895
  • Merge dust.lucy clrs with the alignment clrs
            elem       0          >0         min        max        mean       med        n50        sum
 5          624952     4488       620464     0          599        93         52         126        58378496
 3          624952     4488       620464     0          600        547        600        600        342258160
 53         624952     4488       620464     0          548        454        512        548        283879664
  • Test seq
 gatekeeper -dumpfastaseq -b 35348777 -e  35973728 bt.gkpStore  | grep NNN
 gatekeeper -dumpfastaseq  bt.gkpStore   | perl -ane 'if(/^>(\d+)/) { $id=$1} elsif(/NNN/) { print $id,"\n";} ' | uniq -c | awk '{print $2,$1}'  > bt.NNN.seqs  # 2411 seqs (all have the N's "in the middle")
 gatekeeper -dumpfastaseq -uid bt.NNN.seqs bt.gkpStore >  bt.NNN.cseqs

Consolidate libraries

Drop from 25,312 to 344 libs

BCM.SHOTGUN

UMD2.4 reestimated 10,117 out of 13,826 libs (have > 100mates)

Base on initial estimates

  • Reduce the total number from 13826 to 2 libs: 3000 & 6000
  • UMD2.3 mean estimates (Initial vs Final):
 meanI      #libs      minF       maxF       meanF      medF       n50F       sumF           uid
 180000     436        1636       5199       2475       2410       2458       1079407        #3000
 167000     86         1585       2948       2264       2258       2285       194775         #3000        
 
 6500       31         5212       6636       5837       5867       5924       180951         #6000        
 6000       11         4556       6272       5389       5421       5421       59286          #6000
 
 3500       949        1670       4769       2668       2608       2645       2532027        #3000
 3000       2511       1483       5250       2715       2662       2723       6818678        #3000
 2000       6093       1157       6443       2526       2487       2554       15391160       #3000

Base on final estimates

  • Reduce the total number from 13826 to 7 libs: 6500,5500,...1500, un-estimates (2501)
 meanF        #libs      min        max        mean       med        n50        sum         uid(new)     mean(new)  std(new)
 6K<=mea<7K   15         6010       6636       6176       6159       6159       92650       6500         6500
 5K<=mea<6K   29         5121       5985       5540       5536       5577       160673      5500         5500
 4K<=mea<5K   67         4017       4939       4284       4266       4274       287072      4500         4500
 3K<=mea<4K   1401       3000       3998       3276       3209       3226       4590323     3500         3500
 2K<=mea<3K   7998       2000       2999       2502       2498       2532       20017767    2500         2500
 1K<=mea<2K   607        1157       1999       1825       1882       1890       1107798     1500         1200
 un-estimated 3709                                                                          2501         2501

BARC.CLONEEND

Collapse all 11150 into 1:

uid:25456
mea:165000
std:43000

Overlapper

  • Quality-less reads overlaps: fewer than in the UMD2.3 assembly
                  elem       0          >0         min        max        mean       med        n50        sum           
 0qual(all)       624830     35692      589138     0          3237       60         14         439        37578899         # 94.39%

Unitigger

 UNITIG OVERLAP GRAPH INFORMATION    
       5356408 : Total number of unitigs
       2613795 : Total number of singleton, contained unitigs
       1870448 : Total number of singleton, non-contained unitigs
        182878 : Total number of non-singleton, spanned unitigs
        689287 : Total number of non-singleton, non-spanned unitigs
      35547861 : Total number of fragments
      35547861 : Total number of fragments in all unitigs
      21685943 : Total number of essential fragments in all unitigs
      13861918 : Total number of contained fragments in all unitigs
  0.0077797328 : Randomly sampled fragment arrival rate per bp
    2513424271 : The sum of overhangs in all the unitigs
    6468428782 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2513424271               
 Total nfrags = 19553770
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007780
 Computed global_fragment_arrival_rate =0.007780     
 Total number of randomly sampled fragments in genome = 23868770
 Computed genome length  = 3068070656.000000          
 Used global_fragment_arrival_rate=0.007780             
 Used global_fragment_arrival_distance=128.539119

Histogram of the number of base pairs in a chunk
100292 - 138301:     19
 90052 -  99906:     23
 80043 -  89676:     79
 70013 -  79966:    164
 60010 -  69988:    390
 50008 -  59983:    949
 40000 -  49998:   2433
 30000 -  39997:   6437
 20000 -  29999:  18808
 10000 -  19999:  57634

Bog

!!! Much bigger unitigs than default unitigger

Global Arrival Rate: 0.013829
212260 - 224992:      4 
100099 - 186873:    372 
 90015 -  99973:    353 
 80045 -  89988:    582 
 70011 -  79999:   1084 
 60000 -  69994:   1856 
 50001 -  59996:   3162 
 40002 -  49994:   5407 
 30000 -  39999:   9767 
 20000 -  29996:  18981 
 10000 -  19999:  39641

Consensus after Unitigger

  • Failed on jobs 120 & 121 ( _tmp file)
 cat 4-unitigger/*120*  | countMessages.pl
 IMP     280264
 IUM     124707

 cat 4-unitigger/*121* | countMessages.pl
 IMP     282146
 IUM     245650
 cat 5-consensus/bt_120.cgi_tmp  | countMessages.pl
 IMP     34348
 IUM     19222
 
 cat 5-consensus/bt_121.cgi_tmp | countMessages.pl 
 IMP     51833
 IUM     16805
  • Fix 120: split IUM messages
 extractfromfrgMSG.pl -b 0 -e 19222 bt_120.cgb.orig IUM >! bt_120.cgb &
 extractfromfrgMSG.pl -b 19222 bt_120.cgb.orig IUM >! bt_120.cgb & 
  • Fix 121: remove assertion in AS_CNS/MultiAlignment_CNS.c
 if(to <= from || to > ma_length-1){
   fprintf(stderr, "AbacusRefine range (to) invalid");
   //assert(0); 
 }

CGW

  • Failed after Ckp3(7-0-CGW/bt.ckp.3; MergeScaffoldsAggressive 2nd itteration)
 CI extends beyond end of scaffold!
 offsetAEnd = 254204 offsetBEnd = 252250 scaffoldLength = 253268
 cgw: CIScaffoldT_Merge_CGW.c:307: InsertScaffoldContentsIntoScaffold: Assertion `0' failed.
  • Last cgw
 Scaffold lengths:
 cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf
 cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf
            elem       min        max        mean       med        n50        sum            
 scf        45826      385        34263871   59591      1349       7059820    2730853790     
 ctg        96562      65         738899     27789      3657       93988      2683452359
 Library insert estimates:
 cat 7-4-CGW/stat/scaffold_final.distupdate.dst | grep ^# | awk '{print $3,int($8),int($10)}' > 7-4-CGW/bt.dst
 join2.pl bt.dst 7-4-CGW/bt.dst | p 'print join "\t",@F[0,1,2,5,6,3,4]; print "\n";' > bt.dst.combine
 CLONEEND inserts:
 UID           MEANI   STDI    MEANF   STDF    COUNT   LIB
 114892        150000  30000   175701  40732   31063   UIUC.CLONEEND
 19070         167000  25000   171349  18253   7103    BCM.CLONEEND
 118           167000  16700   167000  16700   21      WUGSC.CLONEEND
 25456         165000  43000   163044  25849   11150   BARC.CLONEEND
 115020        150000  30000   162719  25343   15256   UIUC.CLONEEND
 65177         2000    600     162540  34155   27067   TIGR.CLONEEND
 125606        150000  30000   162396  19319   59505   BCCAGSC.CLONEEND
 10738         2500    750     162386  27567   2040    TIGR_JCVIJTC.CLONEEND
 10691         2500    750     161540  28239   2763    TIGR_JCVIJTC.CLONEEND
 17249         202000  20200   157496  55375   6269    CENARGEN.CLONEEND
 54017         120000  12000   115671  27594   25889   GSC.CLONEEND
 total                                         188126  CLONEENDs

Consensus

  • Failed on job 34 with segmentation fault
  • 9kbp contig, made out of 3007 reads (24 of which are quality-less)
 cat 7-CGW/bt.cgw_contigs.34.1 | grep "^{" | uniq -c | awk '{print $2,$1}'
 {ICM 1
 {IMP 3007
 {IUP 329
  • Fix : edit AS_CNS/MultiAlignment_CNS.c; add
 if(!ungappedSequence->Elements) { ungappedSequence->numElements=0; }
 if(!ungappedQuality->Elements) { ungappedQuality->numElements=0; }


Analysis

Contigs Vs possible contaminants

  • nucmer alignment parameters: -l 40 -c 100 -b 10 -g 5 -d 0.05
  • have to redo alignments using -maxmatch !!!
  • file location:
 reference seqs:
   /nfshomes/dpuiu/db/Ecoli.365350-365744                     # Ecoli K12 region with most alignments (BCM WGS splice site)
   /nfshomes/dpuiu/db/Ecoli                                   # Ecoli K12 substrain MG1655  (NC_000913 ; 1st completed)
   /nfshomes/dpuiu/db/Ecoli.all                               # 22 Ecoli completed genomes ( + plasmids)
 
   /nfshomes/dpuiu/db/UniVec_Core                             # UniVec Core seqs 
   /nfshomes/dpuiu/db/OtherVec                                # 100 other vector sequences identified by aligning UMD2.0 contaminants to GenBank; align also to 110 UniVec core using nucmer (params above)
   /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.fasta       # 4813 whole contigs and 30329 contig regions identified by NCBI as UMD2 contamination
   /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.organism_count  # organism counts: vector is the most abundant
 
   /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.infoseq      # grep -v 'coli|vector|7180003101029' => 905 other contamiants
 
 query seqs:
   /scratch1/bos_taurus/Assembly/2009_0217_CA/9-terminator/ctg.split100/*fasta  # latest assembly contigs (no degenerates)
 
 delta files: 
   /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_ctg/no_maxmatch/*delta

Ecoli K12 substrains:

 NC_010473.1    4686137 50.78  Escherichia coli str. K-12 substr. DH10B, complete genome
 NC_000913.2    4639675 50.79  Escherichia coli str. K-12 substr. MG1655, complete genome
 AC_000091.1    4646332 50.80  Escherichia coli str. K-12 substr. W3110, complete genome

no maxmatch

  • fewer alignments in UMD2.4 than in UMD2

UMD2 (all): just a few degens

 15102 Ecoli.365350-365744-ctg.qry_hits
 15943 Ecoli-ctg.qry_hits
 17308 Ecoli.all-ctg.qry_hits
 79065 UMD2.contaminant-ctg.qry_hits     # 55877 new hits
 20105 UMD2.contaminant-ctg.CBE.qry_hits # CONTAIN|BEGIN|END|IDENTITY
 19839 UniVec_Core-ctg.qry_hits

UMD2.4

   559 Ecoli.365350-365744-ctg.qry_hits
  1215 Ecoli-ctg.qry_hits
  2767 Ecoli.all-ctg.qry_hits             # most 2 frequenct starins are UMN026 & ATCC 8739; K12 DH10B is rank 5th; K12 MG1655 is ranked 19th (out 31 seqs)
 44112 UMD2.contaminant-ctg.qry_hits
  5286 UniVec_Core-ctg.qry_hits

Length of the reference seqs used for screening:

                         #seqs      min        max        mean       med        n50        sum
 Ecoli.365350-365744     1          395        395        395        395        395        395                  # Ecoli K12 regions with most alignments (BCM WGS splice site)
 Ecoli                   1          4639675    4639675    4639675    4639675    4639675    4639675              # Ecoli K12 substrain MG1655        
 Ecoli.all               49         3306       5572075    2293320    130440     5065741    112372708            # 22 Ecoli's 
 
 UniVec_Core             1348       12         48551      243        98         967        327641          
 OtherVec                100        1702       739874     15419      5027       166744     1541984
 UMD2.contaminant        35142      48         16661      512        362        674        18022349             

Length of UMD2.4 contigs that contain contaminant (0+ bp from end):

                         #ctgs      <2000bp    >=2000bp   min        max        mean       med        n50        sum
 Ecoli.365350-365744-ctg 559        534        25         1001       179527     2467       1341       1894       1379440
 Ecoli-ctg               1215       1086       129        1001       360312     4326       1347       71372      5256540
 Ecoli.all-ctg           2767       2455       312*       1001       453627*    8031       1366       134516     22224468
 UniVec_Core-ctg         5286       4718       568*       882        651163*    9820       1337       136090     51909339
 UMD2.contaminant-ctg.CBE 4976      4410       566*       738        651163*    8497       1339       122281     42281715     #annotated alignments: CONTAIN|BEGIN|END|IDENTITY
 UMD2.contaminant-ctg    44112      12813      31299      268        739442     50591      27461      111598     2231701788

Length of UMD2.4 contigs that contain contaminant in the middle (500+ bp from end):

                         #ctgs      <2000bp    >=2000bp   min        max        mean       med        n50        sum
 Ecoli.365350-365744-ctg 144        136        8          1286       2053       1779       1811       1814       256259
 Ecoli-ctg               171        152        19         1286       4703       1835       1807       1821       313820
 Ecoli.all-ctg           197        160        37*(81)    1228       351373*    6516       1815       125069     1283728     #81  2K+ ctgs using -maxmatch
 UniVec_Core-ctg         1278       1110       168*(276)  1085       651163*    12266      1496       160336     15676765    #276 2K+ ctgs using -maxmatch
 UMD2.contaminant-ctg.CBE 52         25        27*        1249       351373*    22195      2054       125069     1154142     #annotated alignments: CONTAIN|BEGIN|END|IDENTITY
 UMD2.contaminant-ctg    31019      1437       29582      1113       739442     70665      50798      113684     2191986214

Length of the UMD2.4 contaminant seqeunece (0+ bp from end):

                            #align     <200bp     >=200bp       min       max        mean       med        n50        sum
 Ecoli.365350-365744-ctg    1066       537        529        104        225        192        162        224        205379
 Ecoli-ctg                  1793       587        1206       50         4440       496        224        994        889798
 Ecoli.all-ctg              4074       1132       2942*      40         17075*     380        254        441        1551783
 UniVec_Core-ctg            14425      9819       4606*      40         1801*      236        162        325        3409187
 UMD2.contaminant-ctg       144843     96008      48835      40         16661      199        169        209        28912002

Length of the UMD2.4 contaminant seqeunece (500+ bp from end)

                            alignm     <200bp     >=200bp    min        max        mean       med        n50        sum
 Ecoli.365350-365744-ctg    243        136        107        162        224        189        162        224        46000
 Ecoli-ctg                  273        149        124        106        1341       219        162        224        59923
 Ecoli.all-ctg              294        153        141*       106        2150*      251        162        224        73992
 UniVec_Core-ctg            2144       2035       109*       50         1340*      122        121        121        261821
 UMD2.contaminant-ctg       121331     86985      34346      40         2738       171        162        184        20753580
  • Problem: 8 long ctgs contain Ecoli in the middle (1000+ bp from end)
show-coords Ecoli.all-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
4640589  4641890  |   161712   160411  |     1302     1302  |    99.46  |  4686137   351373  |     0.03     0.37  | gi|170079663|ref|NC_010473.1|      ctg7180001872124
5068908  5069620  |    87386    86679  |      713      708  |    98.88  |  5209548    91972  |     0.01     0.77  | gi|218687878|ref|NC_011745.1|      ctg7180002055226
3087480  3088683  |    50423    51620  |     1204     1198  |    99.00  |  5202090    88182  |     0.02     1.36  | gi|218703261|ref|NC_011751.1|      ctg7180002054092
4640580  4641890  |    19953    18646  |     1311     1308  |    99.08  |  4686137    31157  |     0.03     4.20  | gi|170079663|ref|NC_010473.1|      ctg7180001875158
 131462   133564  |     1247     3349  |     2103     2103  |    98.19  |   241387     5751  |     0.87    36.57  | gi|157412014|ref|NC_009838.1|      ctg7180002043242
  82801    83166  |     2986     2621  |      366      366  |    98.09  |   241387     4709  |     0.15     7.77  | gi|157412014|ref|NC_009838.1|      ctg7180001714551
  82264    82793  |     3523     2994  |      530      530  |    98.49  |   241387     4709  |     0.22    11.26  | gi|157412014|ref|NC_009838.1|      ctg7180001714551
1652253  1652545  |     1487     1195  |      293      293  |    98.63  |  4700560     2492  |     0.01    11.76  | gi|218552585|ref|NC_011741.1|      ctg7180001754941
  • Regions present in DH10B but not MG1655
 delta2cvg -M 0 < DH10B-MG1655.delta
 gi|170079663|ref|NC_010473.1|   1349629 1378243 28614   0
 gi|170079663|ref|NC_010473.1|   1391006 1396986 5980    0
 gi|170079663|ref|NC_010473.1|   3199469 3200798 1329    0
 gi|170079663|ref|NC_010473.1|   3211928 3213257 1329    0
 gi|170079663|ref|NC_010473.1|   4640588 4641918 1330    0 !!!
  • Problem: 10 long ctgs contain Vector in the middle (1000+ bp from end)
show-coords UniVec_Core-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
      1      121  |   215495   215615  |      121      121  |    99.17  |      170   271477  |    71.18     0.04  | gnl|uv|U09128.1:15891-16011-49     ctg7180002047604       # pSacBII P1 cloning vector  
   2252     2435  |     1334     1151  |      184      184  |   100.00  |     2485   160336  |     7.40     0.11  | gnl|uv|U75992.1:16925-19409        ctg7180001808271
    180      312  |     1153     1020  |      133      134  |    99.25  |      312   160336  |    42.63     0.08  | gnl|uv|NGB00145.1:2378-2689        ctg7180001808271
      1      121  |     1367     1487  |      121      121  |   100.00  |      170   160336  |    71.18     0.08  | gnl|uv|U09128.1:15891-16011-49     ctg7180001808271
      1      103  |     1286     1388  |      103      103  |   100.00  |      103   160336  |   100.00     0.06  | gnl|uv|U80929.2:11415-11517        ctg7180001808271        [CONTAINED]
      4      121  |    68269    68386  |      118      118  |   100.00  |      170   111913  |    69.41     0.11  | gnl|uv|U09128.1:15891-16011-49     ctg7180002052060
     40      152  |    30255    30142  |      113      114  |    99.12  |     1663    42854  |     6.79     0.27  | gnl|uv|U09128.1:1-1663             ctg7180002053344
      1      121  |    34358    34238  |      121      121  |   100.00  |      170    35471  |    71.18     0.34  | gnl|uv|U09128.1:15891-16011-49     ctg7180002046164
      1      103  |    34439    34337  |      103      103  |   100.00  |      103    35471  |   100.00     0.29  | gnl|uv|U80929.2:11415-11517        ctg7180002046164        [CONTAINED]
     46     1385  |     8928    10267  |     1340     1340  |   100.00  |     1413    17587  |    94.83     7.62  | gnl|uv|X65279.1:5941-7353          ctg7180002043597        [CONTAINED]


  • ctg7180001872124 : 351373 bp; region 160411..161712 contaminated by Ecoli
 cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | wc -l  # 329
 cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);'
 7180000441625   7180001872124   159483  161201  r
 7180000441788   7180001872124   160330  161329  f  #Ecoli
 7180000442730   7180001872124   160368  161010  r  #Ecoli
 7180000441635   7180001872124   160740  162700  f  #Ecoli
 cat 9-terminator/bt.utg.info
 utg7180000441625 length=1715 num_frags=12 Astat=7.00
 utg7180000441788 length=999 num_frags=1 Astat=0.00
 utg7180000442730 length=640 num_frags=1 Astat=0.00
 utg7180000441635 length=1957 num_frags=9 Astat=7.00
 cat 9-terminator/bt.posmap.frgctg | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);'
 1237446426      7180001872124   160117  161201  f
 1238816835      7180001872124   160133  160993  f
 1238817728      7180001872124   160322  161123  r
 1244436200      7180001872124   159976  160984  f
 1238817676      7180001872124   160105  160890  r
 1237443253      7180001872124   160106  160900  f
 1237471027      7180001872124   159930  160928  f
 1238822613      7180001872124   159774  160782  f
 1238816875      7180001872124   159878  160728  f
 1244436248      7180001872124   159483  160553  f
 1238818306      7180001872124   159718  160489  f
 1238818332      7180001872124   159722  160483  f
 1237476824      7180001872124   160330  161329  f
 1238817689      7180001872124   160368  161010  r
 1237447135      7180001872124   160740  161768  r
 1237483546      7180001872124   160814  161790  r
 1237483530      7180001872124   160818  161856  r
 1237471108      7180001872124   161003  162009  f
 1238817744      7180001872124   161151  161978  f
 1237446441      7180001872124   161050  162107  f
 1244436201      7180001872124   161117  162164  f
 1237446407      7180001872124   161586  162699  r
 1237471055      7180001872124   161652  162700  r
 # 23 BCM SHOTGUN RP42 VVHNP reads (1369 read lib; 1341 of the reads in this ctg)
  • ctg7180002047604 : Vctor in the middle
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
      1      121  |   215495   215615  |      121      121  |    99.17  |      170   271477  |    71.18     0.04  | gnl|uv|U09128.1:15891-16011-49     ctg7180002047604       # pSacBII P1 cloning vector 
 cat 9-terminator/bt.posmap.utgctg | grep 7180002047604  perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);'
 7180000441711   7180001872124   214458  219678  r
 cat 9-terminator/bt.posmap.frgctg | grep 7180002047604  | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);'
 498776751       7180001872124   215425  216426  r
 1236502885      7180001872124   215514  216377  r
 379408823       7180001872124   215572  216340  f
 1244436224      7180001872124   215388  216405  f
 1237471071      7180001872124   215229  216234  r
 1233297450      7180001872124   215234  216046  f
 1233363357      7180001872124   215267  215687  f
 937200686       7180001872124   215300  216129  r
 937254901       7180001872124   215321  216160  f
 1233294025      7180001872124   215383  216204  r
 1237446444      7180001872124   215146  216187  f
 1232033776      7180001872124   215193  215996  r
 671976381       7180001872124   215035  216021  r
 514932286       7180001872124   215043  216008  f
 500723879       7180001872124   215043  215802  f
 671927656       7180001872124   215116  215733  r
 381173692       7180001872124   214947  215877  r
 1233303570      7180001872124   214963  215803  f
 1232037705      7180001872124   214990  215803  f
 490852264       7180001872124   214923  215843  f
 1237447184      7180001872124   214684  215646  f
 668822243       7180001872124   214586  215572  f
 #22 reads ; ~half come from BCM SHOTGUN RP42 VVFOP


maxmatch ctg

Parameters:

 nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
 AllVec: UniVec_Core + 100 more vector seqs

Length of UMD2.4 contigs that contain contaminant (0+ bp from end):

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 2951*      2602       349        1001       453627     8252       1367       132226     24352779
 UniVec_Core               5387*      4802       585        882        651163     9979       1334       136556     53760575
 OtherVec                  5657       5062       595        882        651163     9726       1320       136556     55021803
 UMD2.cont.other           3976       3430       546        804        651163     11217      1346       130385     44601117         #18 aligned to Acinetobacter; longest is 56467bp

Length of UMD2.4 contigs that contain contaminant (500+ bp from end):

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 182        156        26*        1286       351373     6525       1811       125069     1187706          # 7*   are >5K;  321* come from multi-ctg scaffolds
 UniVec_Core               2532       2220       312*       1065       651163     10593      1481       128344     26821960         # 267* are >5K ; 655* come from multi-ctg scaffolds
 OtherVec                  376        323        53         1184       361749     13278      1508       139997     4992774
 UMD2.cont.other           ...

Length of UMD2.4 contigs that contain contaminant (1000+ bp from end):

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 8          0          8*         4709       351373     73065      31157      351373     584520
 UniVec_Core               11         0          11*        2600       334933     93674      37847      271477     1030414 
 OtherVec                  5          0          5*         3717       271477     131604     111913     228060     658021
 UMD2.cont.other           54         0          54*        2398       522682     110947     88182      189352     5991164
 total                     67* # 18 of them are CONTAINED by UMD2.0 chromosomes

Length of the UMD2.4 contaminant sequence (0+ bp from end):

                           elem       <200       >200       min        max        mean       med        n50        sum
 Ecoli.all                 4775       1610       3165       39         17072      381        236        502        1823278
 UniVec_Core               16985      12380      4605       39         1800       207        132        300        3519080
 OtherVec                  7563       1372       6191       39         1800       509        548        643        3849567
 UMD2.cont.other           6626       343        6283       39         8228       543        573        615        3602329

maxmatch deg

All degenerates aligned are <2K

Length of UMD2.4 deg that contain contaminant (0+ bp from end):

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 1266       1266       0          104        1611       783        833        869        991447
 UniVec_Core               1908       1908       0          147        1510       872        896        910        1664746
 OtherVec                  1963       1963       0          147        1510       872        898        911        1712703
 UMD2.cont.other           1609       1609       0          132        1611       852        892        914        1372106

maxmatch utg

Unitig stats:

 elem       <2000      >2000      min        max        mean       med        n50        sum            
 1707816    1434164    273652     21         138676     2228       937        8002       3805166508   

Parameters:

 nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...

Files:

 /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_utg/

Length of UMD2.4 unitigs that align to contaminants

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 4275       4110       165        104        71709      1442       1212       1398       6166566  
 UniVec_Core               7563       7409       154        139        71709      1397       1182       1331       10570512
 OtherVec                  8208       8054       154        139        71709      1370       1159       1308       11248775
 UMD2.cont.other           6094       5849       245        132        53113      1546       1163       1401       9422951    #80 aligned to Acinetobacter; longest is 9114bp
 Contaminants(all above)   10264      9895       369        104        71709      1471       1148       1359       15107544

 Acinetobacter             2306**      0          2306       154        71709      1451       1316       1412      3347230  #2182 already in the Cont set 

Length of UMD2.4 unitigs that have contaminants 500+bp from ends

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 172        156        16         1286       4852       1820       1805       1815       313185
 UniVec_Core               2491       2422       69         1065       71709      1722       1457       1523       4291584
 OtherVec                  364        358        6          1167       71709      1795       1478       1538       653595
 UMD2.cont.other           156        108        48         1213       50248      5344       1838       17518      833673

Length of the UMD2.4 alignments of unitigs to contaminants (unique unitig regions)

                           elem       <200       >200       min        max        mean       med        n50        sum       reads(all unitig reads for unitgs with alignments>1K)
 Ecoli.all                 5975       1686       4289       40         8184       397        268        542        2374366   12112(12142)
 UniVec_Core               8754       1674       7080       40         1801       474        490        645        4153030   26590(26849)
 OtherVec                  8919       1250       7669       40         1801       511        536        629        4562326   30268(30268)
 UMD2.cont.other           6752       896        5856       40         6012       529        555        651        3573528   25006(25328)
 Contaminants(all above)   10992      1396       9596       40         8184       571        573        684        6280759   40351(40699)
 Acinetobacter                                                                                                                     (8286)

40699 reads aligned back to contaminants: nucmer -maxmatch

  • 35919 align
  • 34400 align 100+bp
  • 27742 align 200+bp
  • 14211 align 500+bp

utg 5'& 3'

Unitig stats:

             elem         <200       >200       min        max        mean       med        n50        sum            
utg          1,707,816    81200      1626616    21         138676     2228       937        8002       3805166508   
utg5'&3'     3,334,432    0          3334432    21         199        100        100        100        335263271

Align utg5'&3' to Ecoli.all using:

  • nucmer -l 40 -c 100 -b 10 -g 5 -d 0.05 : 4,275 hits
  • nucmer -l 20 -c 40 : 6,617 hits
  • nucmer -l 20 -c 20 : 23,350
  • blastall  : 2,895,506 out of 3,334,432 (86%) aligned

Acinetobacter contamination

Database:

 ~dpuiu/db/Acinetobacter.all : 7 complete genomes, 19 seqs

Seq len summary:

  elem       min        max        mean       med        n50        sum
  19         2726       4050513    1418094    28279      3904116    26943793

Align all unitigs to Acinetobacter.all; Longest alignments is 8517bp

show-coords Acinetobacter.all-utg.filter-q.delta | sort -nk8 -r | head
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [GenBank]                          [UMD2.4 utg]
===============================================================================================================================
  20644    29164  |       62     8578  |     8521     8517  |    98.80  |    94413     8578  |     9.03    99.29  | gi|169786889|ref|NC_010404.1|      utg7180000281954*       [CONTAINS]
3395586  3401299  |     5712        1  |     5714     5712  |    99.79  |  3976747     8015  |     0.14    71.27  | gi|126640115|ref|NC_009085.1|      utg7180000212251
3400344  3404485  |        1     4142  |     4142     4142  |    99.66  |  3976747     9114  |     0.10    45.45  | gi|126640115|ref|NC_009085.1|      utg7180000277331
...

utg7180000281954* -> ctg7180002053982 (28140bp; 78 unitigs)

grep 7180002053982 ../9-terminator/bt.posmap.utgctg | nl

    1  7180000185222   7180002053982   0       3019    f
    2  7180000314302   7180002053982   2151    5706    r
    3  7180001463328   7180002053982   2256    2869    f
    ...
   75  7180000281954*  7180002053982   17862   26442   r
   76  7180001471348   7180002053982   17886   18723   r
   77  7180001468075   7180002053982   17919   18732   f
   78  7180000280508   7180002053982   25672   28140   r


 show-coords UMD2.contaminant.other-ctg.filter-q.delta  | grep 7180002053982
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [UMD2.0 contam]   [UMD2.4 ctg]
===============================================================================================================================
   1394     2508  |    28140    27024  |     1115     1117  |    98.75  |     7098    28140  |    15.71     3.97  | 7180003313366      ctg7180002053982
   2561     2871  |    26971    26661  |      311      311  |    97.11  |     7098    28140  |     4.38     1.11  | 7180003313366      ctg7180002053982
   2934     5670  |    26599    23862  |     2737     2738  |    97.99  |     7098    28140  |    38.56     9.73  | 7180003313366      ctg7180002053982
 ...gap...  
   5930     7098  |    17270    16101  |     1169     1170  |    98.46  |     7098    28140  |    16.47     4.16  | 7180003313366      ctg7180002053982
 ...gap...  
    281     1981  |    10335     8635  |     1701     1701  |    99.41  |    13090    28140  |    12.99     6.04  | 7180003320028      ctg7180002053982
   1992     2672  |     8635     7954  |      681      682  |    99.71  |    13090    28140  |     5.20     2.42  | 7180003320028      ctg7180002053982
   3302     5376  |     6719     4643  |     2075     2077  |    99.33  |    13090    28140  |    15.85     7.38  | 7180003320028      ctg7180002053982
   8469     9021  |     4642     4090  |      553      553  |    99.10  |    13090    28140  |     4.22     1.97  | 7180003320028      ctg7180002053982
   9038     9313  |     4073     3798  |      276      276  |    98.55  |    13090    28140  |     2.11     0.98  | 7180003320028      ctg7180002053982
   9780    13090  |     3331       19  |     3311     3313  |    99.19  |    13090    28140  |    25.29    11.77  | 7180003320028      ctg7180002053982
 grep 7180002053982 ../9-terminator/bt.posmap.utgctg | awk '{print $1,$4-$3+1}' | sed 's/^/utg/' >! ctg7180002053982.utgs

 intersect.pl UMD2.contaminant.other-utg.qry_hits ctg7180002053982.utgs | wc -l
 37 # only 37 out  of 78 unitigs were detected

 ctg7180002053982 is Acinetobacter

Assembly UMD2.5 (2004_0312_CA; delete 40699 contam reads & 22607 mates )

40699 reads:

  • 25803 mated + 14896 unmated
  • 6392 mated reads had the mate also contaminated

Location:

 /scratch1/bos_taurus/Assembly/2009_0312_CA

UNITIGGER

UNITIG OVERLAP GRAPH INFORMATION
       5322910 : Total number of unitigs
       2595715 : Total number of singleton, contained unitigs
       1869655 : Total number of singleton, non-contained unitigs
        182193 : Total number of non-singleton, spanned unitigs
        675347 : Total number of non-singleton, non-spanned unitigs
      35507162 : Total number of fragments
      35507162 : Total number of fragments in all unitigs
      21641007 : Total number of essential fragments in all unitigs
      13866155 : Total number of contained fragments in all unitigs
  0.0077909501 : Randomly sampled fragment arrival rate per bp
    2511009753 : The sum of overhangs in all the unitigs
    6442095933 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
Total rho    = 2511009753
Total nfrags = 19563152
Estimated genome length = 0
Estimated global_fragment_arrival_rate=0.007791
Computed global_fragment_arrival_rate =0.007791
Total number of randomly sampled fragments in genome = 23866135
Computed genome length  = 3063315200.000000
Used global_fragment_arrival_rate=0.007791
Used global_fragment_arrival_distance=128.354050

Histogram of the number of base pairs in a chunk
100292 - 138301:     22    # 19 in UMD2.4 
 90020 -  99906:     28    # 23
 80043 -  89676:     90    # 79
 70013 -  79966:    190    # 164
 60010 -  69988:    423
 50008 -  59983:   1016
 40000 -  49998:   2558
 30000 -  39997:   6660
 20000 -  29999:  18927
 10000 -  19999:  57057

CONSENSUS after CGW

  • failed on job 80 : ctg 5706539, len=180,024, 159 unitigs, 1,851 reads

 head 80/bt.cns_contigs.80.failed
 {ICM
 acc:5706539
 pla:P
 len:180024
 cns:
 .
 qlt:
 .
 for:0
 npc:1851
 more  ../9-terminator/bt.asm
 ...
 {CCO
 acc:(7180002022380*,5706539)
 pla:P
 len:180024
 cns:
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 ...
 cat 80/bt.cns_contigs.80.failed | countMessages.pl  
 ICM     1
 IUP     159   #unitigs
 IMP     1851  #reads : 1264 are BCM.WGS, 389 are BCM.SHOTGUN ...
 # 254 contig scaffold
 cat ../9-terminator/bt.posmap.ctgscf | grep 7180002041301 | nl
 1    7180002022042   7180002041301   0       14121   f
 ...
 183  7180002022380*  7180002041301   9164814 9344838 f
 ..
 254  7180002022240   7180002041301   12874874        12893460        f
 # the 5 UMD2.4 contigs below have the same number of reads with the ones that matched => CONTAINED
 cat UMD2.4-7180002022380.posma   p.frgctg | grep -v ^$ | awk '{print $2}' | uniq -c
   #reads ctgid
     6 7180001712307
   209 7180002028662
     6 7180002028663
  1552 7180002028664
    21 7180002032323
  1794 total => 1851-1794=57 additional reads
 cat ../../2009_0217_CA/9-terminator/bt.posmap.ctgscf | nl | egrep '7180001712307|7180002028662|7180002028663|7180002028664|7180002032323'
 #nl    ctgid           scfid           start   end     dir        
 34791  7180002028662   7180002069912   1425623 1446022 f
 34793  7180002032323   7180002069912   1448133 1450475 f
 34794  7180002028663   7180002069912   1450495 1451918 f
 34795  7180002028664   7180002069912   1452234 1602973 f
 64110  7180001712307   7180002071598   0       1441    f
  • Solution 1:
 * consensus -Dforceunitigabut => new assembly, new UID's
 ctg7180002022636 179505 38.53     # => ctg7180002022380
 scf7180002041557 12892941 40.58   # => scf7180002041301
  • Solution 2:
 * Reassemble 1851 reads ; clr=ECR2; doOBT=no
 * Asm dir: 
    /scratch1/bos_taurus/Assembly/2009_0312_CA/8-consensus/80.ECR2.asm
 * It contains one 179,530 bp scaffold that has two contigs. 
 * One contig is 156,349 bp and the other one is 23,181 bp. 
 * The estimated gap between them is 231 bp.
 show-coords ctg7180002022636-80.ECR2.filter-r.delta 
      1   156326  |   1   156331  | 156326 156331  | 99.99  | 179505  156349 |    87.09    99.99  | ctg7180002022636   ctg7180000000103        [CONTAINS]
 156345   179505  |  21    23181  |  23161  23161  | 99.99  | 179505  23181  |    12.90    99.91  | ctg7180002022636   ctg7180000000104        [CONTAINS]
 >ctg7180002022636_156327_156344
 TTGTAAAAACCATCCCCT
 # ~ 20 bp unaligned on ctg7180002022636 & Chr1
 show-coords  ctg7180002022636-Chr1.filter-r.delta  | more
 ...
 151839   156326  | 61124826  61120339* |     4488     4488  |    99.91  |   179505 157590899  |     2.50     0.00  | ctg7180002022636  Chr1
 156347   157501  | 61111219* 61110064  |     1155     1156  |    99.91  |   179505 157590899  |     0.64     0.00  | ctg7180002022636  Chr1
 ...
 # 2 UMD2.0 ctg & 2 UMD2.0 deg in this region
 more Chr1.agp
 ...
 Chr1    61110064        61111482        3579    W       deg0003139347   1       1419    +
 Chr1    61111483        61114130        3580    N       2648    fragment        yes
 Chr1    61114131        61115490        3581    W       deg0002967451   1       1360    +
 Chr1    61115491        61118114        3582    N       2624    fragment        yes
 Chr1    61118115        61120117        3583    W       7180002846553   1       2003    +
 Chr1    61120118        61120217        3584    U       100     fragment        yes
 Chr1    61120218        61145567        3585    W       7180003318962   1       25350   +
 ...

QC

Lengths:

             elem       <2000      >2000      min        max        mean       med        n50        sum
 scf         39978*     31311      8667       316        34167202   68129      1360       8217662    2723691675
 ctg         90135*     36140      53995      65         1160130    29693      5124       95988      2676390147
 deg         251413     249285     2128       65         39964      1003       984        994        252279234
 utg         1689033    1419729    269304     21         138676     2242       936        8213       3788090224

Scaffold zero read/mate cvg regions:

                       elem       <2000      >2000      min        max        mean       med        n50        sum
 read                  57011      55048      1963       1          177144     913        57         29302      52084484
 mate                  10507      8945       1562       1          30014      996        493        2367       10466518

Scaffold 10K+ zero read/mate cvg regions (2K+ inside) (some might be a result of surrogates?):

                       elem       <2000      >2000      min        max        mean       med        n50        sum
 read                  51747      49878      1869       1          177144     958        38         32625      49613432
 mate                  1290       201        1089       15         30014      3560       3047       4017       4593586

Contaminant search

ctg

               elem       <0         >0         min        max        mean       med        n50        sum
 ctg           90135      0          90135      65         1160130    29693      5124       95988      2676390147
 nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
                  elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all        71         66         5          1006       129770     4180       1127       45234      296830
 UniVec_Core      120        111        9          1000       426540     7366       1130       100944     884023
 OtherVec         121        112        9          1000       426540     7314       1127       100944     885022
 UMD2.cont.other  98(-3)     83         15(-3)     1000       426540     13523      1190       199700     1325332  # 3 are 1000bp+ from ctg ends; these are actually "fake" contaminants
                                                                                                                   # 7 are Acinetobacter baumannii min=1059 , max=9765 
 total            152*(-3)   133        19*(-3)    1000       426540     10649      1150       199700     1618735
 
 Acinetobacter    65         53         12         1013       44359      2586       1212       3847       168130   # 46 out of 65 are in the 152* set; 19 are new; 13 have lots of alignments to other contigs (probably fake contaminants)
 total(new)       171*(-3)   144        27*(-3)    1000       426540     10013      1189       129770     1712376  # 65 are Acinetobacter and should be removed
 cat UMD2.contaminant.other-ctg.filter-q.coords | grep Acinetobacter
                                                                                                                    UMD2.0                     UMD2.5
      1      285  |      285        1  |      285      285  |    99.65  |      287     8096  |    99.30     3.52  | 7180003292866_1_288        ctg7180002015457        [CONTAINED]     Acinetobacter baumannii
   1422     2500  |     1078        1  |     1079     1078  |    99.63  |     7098     1078  |    15.20   100.00  | 7180003313366              ctg7180001706852        [CONTAINS]      Acinetobacter baumannii
   2934     3940  |     1008        1  |     1007     1008  |    98.61  |     7098     1059  |    14.19    95.18  | 7180003313366              ctg7180001709709                        Acinetobacter baumannii
   6281     7098  |        1      818  |      818      818  |    99.76  |     7098     1553  |    11.52    52.67  | 7180003313366              ctg7180001716052        [END]           Acinetobacter baumannii
      1      790  |      790        1  |      790      790  |   100.00  |     1822     9765  |    43.36     8.09  | 7180003319195_8956_10778   ctg7180002015485        [BEGIN]         Acinetobacter calcoaceticus
    285     1981  |        1     1697  |     1697     1697  |    99.59  |    13090     1856  |    12.96    91.43  | 7180003320028              ctg7180001706656*                       Acinetobacter baumannii
   1992     2148  |     1697     1856  |      157      160  |    98.12  |    13090     1856  |     1.20     8.62  | 7180003320028              ctg7180001706656*                       Acinetobacter baumannii
  12210    13090  |       63      943  |      881      881  |    99.89  |    13090     2556  |     6.73    34.47  | 7180003320028              ctg7180002007423                        Acinetobacter baumannii
 # 7 Acinetobacter baumannii ctgs
 # no Serratia "best hits"
 # 3 mitochondrion ctgs, all < 2Kbp

Delete summary: 65 Acinetobacter ctgs + 91 contaminant ctgs <2000bp => 156 ctgs => 4105 reads

 ctgs       <2000      >2000      min        max        mean       med        n50        sum       reads
 156        144        12         1000       44359      1782       1150       1483       278009    4105

Trim summary: 12 contigs >=2000bp & 44 reads that overlap at least 10bp

 ctgs       <2000      >2000      min        max        mean       med        n50        sum        reads
 12         12         0          172        935        532        618        750        6393       44

ctg 5'&3'

               elem       <0         >0         min        max        mean       med        n50        sum
 ctg53         180044     0          180044     65         598        300        300        300        54033229
 nucmer -maxmatch -l 17 -c 35 ...
                  #ctgEnds   #ctgs      min        max        mean       med        n50        sum         
 Ecoli.all        180        149        300        300        300        300        300        54000 
 UniVec_Core      312        277        300        300        300        300        300        93600
 OtherVec         1211       1167       300        553        300        300        300        363989
 UMD2.cont.other  15689      14693      257        598        300        300        300        4712162

deg

 nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
                  elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all        387        387        0          131        1099       756        806        835        292892
 UniVec_Core      569        569        0          101        1115       763        822        843        434400
 OtherVec         579        579        0          101        1115       752        819        840        435549
 UMD2.cont.other  539        539        0          131        1483       792        838        873        427408
                 
 total            810*       810        0*         101        1483       784        838        869        63547

Scaffolds vs UMD2.0 chromosome alignments

Directory:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf

Depening on the ref/qry seq and nucmer parameters, the number of unaligned gaps in UMD2.0 can vary between:

 101M: REF=Chr,       QRY=scf,     nucmer -l 100 -c 500 
 6M:   REF=ChrPlaced, QRY=scf-deg, nucmer -maxmatch -l 50 -c 250 

nucmer -l 100 -c 500

Chr-scf.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 Chr-scf.qry_hits                         32901      24546      8355       723        34167202   82494      1405       8217662    2714164100
 Chr-scf.qry_nohits                       7077       6765       312        316        12006      1346       1239       1291       9527056
 Chr-scf.10K.qry_hits2+                   574        0          574        10308      34167202   4006753    1887309    9586144    2299876795
 Chr-scf.0cvg                             144712     125933     18779      1          102265     900        178        2968       130248709
 Chr-scf.0cvg.clean                       148556     143283     5273       1          39625      683        280        1363       101526883(101M)

Chr-scf-deg.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 Chr-scf-deg.qry_hits                     210225     199781     10444      501        34167202   13785      1007       7328685    2898141592
 Chr-scf-deg.qry_nohits                   81166      80815      351        65         12006      958        972        989        77828798
 Chr-scf-deg.10K.qry_hits2+               574        0          574        10308      34167202   4006753    1887309    9586144    2299876795
 Chr-scf-deg.0cvg                         175952     168553     7399       1          22067      445        120        1329       78433265
 Chr-scf-deg.0cvg.clean                   133809     132381     1428       1          20512      371        124        1101       49711440(49M)

ChrPlaced-scf.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf.qry_hits                   19488      13112      6376       723        34167202   137773     1527       8428844    2684927057
 ChrPlaced-scf.qry_nohits                 20490      18199      2291       316        192648     1891       1276       1569       38764099
 ChrPlaced-scf.10K.qry_hits2+             139        0          139        10486      31959312   6786671    4979278    12956086   943347316
 ChrPlaced-scf.0cvg                       76271      71816      4455       1          102265     568        179        1710       43339413
 ChrPlaced-scf.0cvg.clean                 72865      70541      2324       1          39625      356        94         1413       25951987(25M)

ChrPlaced-scf-deg.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf-deg.qry_hits               130125     122000     8125       501        34167202   21530      1009       7515049    2801670853
 ChrPlaced-scf-deg.qry_nohits             161266     158596     2670       65         192648     1080       987        1012       174299537
 ChrPlaced-scf-deg.10K.qry_hits2+         139        0          139        10486      31959312   6786671    4979278    12956086   943347316
 ChrPlaced-scf-deg.0cvg                   79041      76374      2667       1          22067      395        157        948        31251753
 ChrPlaced-scf-deg.0cvg.clean             69012      68271      741        1          20512      200        81         592        13864328(13M)

nucmer -maxmatch -l 100 -c 500

Dir:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.2

ChrPlaced-scf-deg.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf-deg.qry_hits               130510     122377     8133       501        34167202   21470      1009       7515049    2802100771
 ChrPlaced-scf-deg.qry_nohits             160881     158219     2662       65         192648     1080       986        1012       173869619
 ChrPlaced-scf-deg.10K.qry_hits2+         120        0          120        20022      31959312   7587296    5639522    13010806   910475551
 ChrPlaced-scf-deg.0cvg                   82159      80425      1734       1          7002       321        145        647        26444796
 ChrPlaced-scf-deg.0cvg.clean             111645     111546     99         1          6248       81         13         272        9057424(9M)

nucmer -maxmatch -l 50 -c 250

Dir:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3

ChrPlaced-scf-deg.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf-deg.qry_hits               204673     195653     9020       251        34167202   14088      1005       7329288    2883625712
 ChrPlaced-scf-deg.qry_nohits             86718      84943      1775       65         192648     1064       970        1007       92344678
 ChrPlaced-scf-deg.10K.qry_hits2+         148        0          148        10486      31959312   6814292    5135095    12792673   1008515300
 ChrPlaced-scf-deg.0cvg                   86085      84614      1471       1          4565       279        123        557        24101912
 ChrPlaced-scf-deg.0cvg.clean             113796     113791     5          1          2822       59         7          176        6714556(6M)

nucmer -maxmatch -l 50 -c 250 ; delta-fileter -q

Dir:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3
 ChrPlaced-scf-deg.filter-q.summary
                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf-deg.qry_hits               204673     195653     9020       251        34167202   14088      1005       7329288    2883625712
 ChrPlaced-scf-deg.qry_nohits             86718      84943      1775       65         192648     1064       970        1007       92344678
 ChrPlaced-scf-deg.10K.qry_hits2+         118        0          118        20022      31959312   7633834    5639522    13010806   900792422
 ChrPlaced-scf-deg.0cvg                   77864      73686      4178       1          28711      529        181        1523       41240419
 ChrPlaced-scf-deg.0cvg.clean             74172      72150      2022       1          28331*     321        89         1415       23852952(23M)

Max gap is 28331; Duplicate region in UMD2.0?

 ChrPlaced-scf-deg.coords
   70739616 70767946  |  2993389  2965048  |    28331    28342  |    99.17  | 85187327 19514159  |     0.03     0.15  | Chr15      scf7180002041107
   70768054 70808299  |  3005305  2965048  |    40246    40258  |    99.56  | 85187327 19514159  |     0.05     0.21  | Chr15      scf7180002041107
 =>
 ChrPlaced-scf-deg.filter-q.coords
   70768054 70808299  |  3005305  2965048  |    40246    40258  |    99.56  | 85187327 19514159  |     0.05     0.21  | Chr15      scf7180002041107

Markers

 head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.txt
 Marker        Chr_BTA Pos(K)  Pos_from Pos_to UMD_Ctg_Pos     Match_Len       %IDY    %Match  UMD_Ctg_name
 BZ945871      1       47501   1       95001   7622            515             100.00  99.61   ctg7180002007845
 BZ953651      1       80001   47501   112501  10786           700             99.57   100.00  ctg7180002026484
 CC504788      1       118751  80001   157501  54583           862             100.00  100.00  ctg7180002026483
 CC484491      1       123751  90001   157501  50169           77              98.72   100.00  ctg7180002026482
 CZ415082      1       125001  92501   157501  75850           507             99.21   99.80   ctg7180002026483
 CC475154      1       130001  97501   162501  40013           666             99.25   100.00  ctg7180002026482
 CC561114      1       182501  145001  220001  1130            709             99.02   100.00  ctg7180002026482
 CC578374      1       190001  155001  225001  170145          647             100.00  100.00  ctg7180002026481
 BZ911787      1       278751  232501  325001  na              na              na      na      na
 ...
  • 126,014 markers & 90,135 ctgs total
  • 107,271 markers align to 31,407 ctgs:
    • 85% of the markers align to 85% of the ctg sequence
    • avg distance between markers is 25Kbp
  • 93,508 unique markers (out of 107,271)

Ctg vs markers summary:

                                        #ctg       <10000     >10000     min        max        mean       med        n50        sum         file
 ctg (all)                              90135*     51024      39111      65         1160130    29693      5124       95988      2676390147
 no markers                             58728      48324      10404      65         322949*    6573       1597       21989      386064754
 
 markers from 1+ Chr                    31407      2700       28707      442        1160130    72924      52693      111252     2290325393  markers_ctg.Chr.count
 markers from 2+ Chr                    2987       25         2962       1002       1160130    132480     104807     179692     395718221   markers_ctg.Chr.count2+
 2+ markers from 2+ Chr                 26         0          26*        15228      604155     221354     192182     298848     5755227     markers_ctg.Chr.count2.2+
 2+ adjacent markers from 2+ Chr        15         0          15**       15228      368879     202728     194749     294623     3040932     markers_ctg.Chr.count2+a

Scf vs markers summary:

                                        #scf       <10000     >10000     min        max        mean       med        n50        sum
 scf(all)                               39978*     37135      2843       316        34167202   68129      1360       8217662    2723691675
 no markers                             37338      36038      1300       316        754615*    2601       1336       3957       97140879
 
 markers from 1+ Chr                    2640       1097       1543       1000       34167202   994905     16220      8661690    2626550796  markers_scf.Chr.count 
 markers from 2+ Chr                    552        10         542        1002       34167202   4526814    2714036    9167014    2498801557  markers_scf.Chr.count2+
 2+ markers from 2+ Chr                 212        0          212*       15228      34167202   8579232    7358307    10521496   1818797327  markers_scf.Chr.count2.2+
 2+ adjacent markers from 2+ Chr        38         0          38**       15228      25078118   8419544    7176534    13458592   319942681   markers_scf.Chr.count2+a


212* scaffolds

 scf_id        scf_len   #Chr/2+markers  #/Chr/2+adjmarkers  reads
 7180002041381 31959312  15              0                   469503
 7180002041358 25078118  13              2                   291163
 7180002041386 21280754  12              0
 ...

1. 7180002041381 : no low cvg regions in the middle

  • 1281 markers: 1231 on Chr4, 12 on Ch11
 #1- mate cvg regions: at the ends !!!
 #scfid          begin           end             scf_len         cvg_len cvg
 7180002041381   1               1173            31959312        1173    0
 7180002041381   1174            1454            31959312        281     1
 7180002041381   31959139        31959312        31959312        174     1
 

2. 7180002041358 : one low cvg region & real break

  • 1111 markers: 869 on Chr14, 193 on Chr26, ...
 #1- mate cvg regions: middle
 #scfid          begin           end             scf_len         cvg_len cvg
 7180002041358   20970531        20970827        25078118        297     1
 7180002041358   20970828        20970949        25078118        122     0
 7180002041358   20970950        20971112        25078118        163     1
 # markers in the regions
 #scfid          begin           end             makerid         Chr
 7180002041358   20964100        20964829        BZ839784        14
 7180002041358   21002368        21003219        CC527932        26

3. 7180002041386: one low cvg region but no markers in that region

  • 939 markers: 902 on Chr24, 3 on Chr6 ...
 #1- mate cvg regions: at the ends
 #scfid          begin           end             scf_len         cvg_len cvg
 7180002041386   26382   27557   21280754        1176    1
 7180002041386   27558   27607   21280754        50      0
 7180002041386   27608   28031   21280754        424     1

...