Bos taurus redo: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Dpuiu (talk | contribs)
Dpuiu (talk | contribs)
 
(891 intermediate revisions by the same user not shown)
Line 7: Line 7:
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=bos%20taurus Genome Projects]
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=bos%20taurus Genome Projects]
* [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieve&size=0&val=SPECIES_CODE+%3D+%22BOS+TAURUS%22 TA search]  
* [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieve&size=0&val=SPECIES_CODE+%3D+%22BOS+TAURUS%22 TA search]  
* ftp download counts
* [ftp://ftp.ncbi.nih.gov/pub/TraceDB/bos_taurus/ TA ftp]
* 91 volumes: 87 with qual & 4 with no quality (85 volumes contain BCM reads)
* 14 centers
* 21 center/trace_type_codes


* Avg LEN=984
* Avg LEN=984
* Avg CLIP (CLB intersect CLV)=760
* Avg CLIP (CLB intersect CLV)=760
* Avg CLV=997  (3.66M reads)
* Avg CLV=997 > Avg LEN ???
* 0 QUAL reads 650,133
* Avg QUAL=38.96   (27.51 for the 2.59M reads not in the UMD assembly)
* Avg UMDoverlapper CLIP=778


* Avg UMDoverlapper CLIP=778 (3.53M reads)
Problems:
* 0 QUAL reads 650,133 ( 551,114 are BCM.WGS )
* the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1
* several xml contained the "&" character => XML parser error
* xml.bos_taurus.087 contained 2 trace_volumes => XML parser error
* BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000
* UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000


== CENTER_NAME counts ==
== CENTER_NAME counts ==
  COUNT          CENTER_NAME     
    COUNT          CENTER_NAME     
   35629020        BCM            Baylor College of Medicine
   35629020        BCM            Baylor College of Medicine
   737900          NISC            NIH Intramural Sequencing Center
   737900          NISC            NIH Intramural Sequencing Center
   652614          BCCAGSC        British Columbia Cancer Agency Genome Sciences Centre                          # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510  
   652614          BCCAGSC        British Columbia Cancer Agency Genome Sciences Centre                          # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510  
   378871          MARC            USDA, ARS, US Meat Animal Research Center
   378871          MARC            USDA, ARS, US Meat Animal Research Center
   114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368
   114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368
   107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
   107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
   65171          TIGR            The Institute for Genome Research
   65171          TIGR            The Institute for Genome Research
   53556          GSC            Genoscope
   53556          GSC            Genoscope
   43033          CENARGEN        Embrapa Genetic Resources and Biotechnology
   43033          CENARGEN        Embrapa Genetic Resources and Biotechnology
   18623          SC              The Sanger Center
   10 18623          SC              The Sanger Center
   15301          UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
   11 15301          UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
   10651          TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                # TA query_tracedb CENTER_NAME="JCVI"
   12 10651          TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                # TA query_tracedb CENTER_NAME="JCVI"
   2485            UIACBCB        University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
   13 2485            UIACBCB        University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
   49              WUGSC          Washington University, Genome Sequencing Center                                # TA query_tracedb CENTER_NAME = "WUGSC" => 9
   14 49              WUGSC          Washington University, Genome Sequencing Center                                # TA query_tracedb CENTER_NAME = "WUGSC" => 9
  '''37829394'''        total          total                                                                          # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710  
    '''37829394'''        total          total                                                                          # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710
 


== TRACE_TYPE_CODE counts ==
== TRACE_TYPE_CODE counts ==
  COUNT        CENTER_NAME    TRACE_TYPE_CODE        #LIBS(all)    #LIBS(10K+ reads)
    COUNT        CENTER_NAME    TRACE_TYPE_CODE         
   24863599      BCM             WGS                    89             31
   24863599      BCM*            WGS                    SEQ_LIB_ID:89
   10748529      BCM             SHOTGUN                10            10
   10748529      BCM*            SHOTGUN                SEQ_LIB_ID:15543
   737900        NISC            SHOTGUN                4              3
   737900        NISC            SHOTGUN                SEQ_LIB_ID:247
   125597        BCCAGSC        CLONEEND
   125597        BCCAGSC        CLONEEND               LIBRARY_ID:1        large insert size; some qualityless; !!! almost all have CLIP3=0
   114753        UIUC            CLONEEND
   114753        UIUC            CLONEEND               LIBRARY_ID:2        insert size missing , no frequent kmers
   65171        TIGR            CLONEEND
   65171        TIGR            CLONEEND               SEQ_LIB_ID:1        2K & use TRACE_DIRECTION instead of TRACE_END
   53556        GSC            CLONEEND
   53556        GSC            CLONEEND               SEQ_LIB_ID:1        large insert size; !!! all have qual=0 and were excluded
   26246        CENARGEN        WGS
   26246        CENARGEN        WGS                   .                    no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID)
   25454        BARC            CLONEEND
   25454        BARC            CLONEEND               SEQ_LIB_ID:14304    !!! all have CLIP3=0
   16892        BCM             CLONEEND              1             1      VBBAA  mea=167000  std=25000
   10 16892        BCM*            CLONEEND              LIBRARY_ID:1         VBBAA  mea=167000  std=25000
   16787        CENARGEN        CLONEEND
   11 16787        CENARGEN        CLONEEND               LIBRARY_ID:1       
   15150        UOKNOR          SHOTGUN
   12 15150        UOKNOR          SHOTGUN               LIBRARY_ID:1        some qualityless
   10651        TIGR_JCVIJTC    CLONEEND
   13 10651        TIGR_JCVIJTC    CLONEEND               SEQ_LIB_ID:2
   151          UOKNOR          FINISHING
   14 151          UOKNOR          FINISHING             LIBRARY_ID:1        some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV
   49            WUGSC          CLONEEND
   15 49            WUGSC          CLONEEND               SEQ_LIB_ID:1
  '''36809945'''      total
    '''36820485'''      total
   
   
   527017        BCCAGSC        EST
   16 527017        BCCAGSC        EST
   207204        MARC            EST
   17 207204        MARC            EST
   171667        MARC            PCR
   18 171667        MARC            PCR
   81913        BARC            EST
   19 81913        BARC            EST
   81913         BARC            EST
   20 18623         SC              EST  
   2485          UIACBCB        EST
   21 2485          UIACBCB        EST
  '''1019449'''      total
    '''1008909'''      total


== STRATEGY & TRACE_TYPE_CODE counts ==
== STRATEGY & TRACE_TYPE_CODE counts ==
Line 93: Line 102:
   18623          SC              EST            EST
   18623          SC              EST            EST
   2485            UIACBCB        .              EST
   2485            UIACBCB        .              EST
== BCM.SHOTGUN libraries ==
* The long inserts are probably wrong !!!
  SIZE    STDEV  COUNT
  3500    1500    4502569
  2000    1000    3244493
  3000    1000    1021577
  180000  1000    840528
  6500    1500    429026
  180000  13000  320208
  6000    2000    208192
  167000  13000  96337
  3500    15000  85599
  SIZE    COUNT
  3500    4588168
  2000    3244493
  180000  1160736
  3000    1021577
  6500    429026
  6000    208192
  167000  96337


== 3' VECTOR TRIMMED counts ==
== 3' VECTOR TRIMMED counts ==


  CENTER_NAME    TRACE_TYPE_CODE TOTAL          3'CLV<LEN  0QUAL
    CENTER_NAME    TRACE_TYPE_CODE TOTAL          3'CLV<LEN  QUAL==0          UMD.FRG
   BCM            WGS            24863599        10968979    551114
   BCM            WGS            24863599        10968979    551114           24050767
   BCM            SHOTGUN        10748529        5052692    23419
   BCM            SHOTGUN        10748529        5052692    23419           10068499
   NISC            SHOTGUN        737900          28972      0
   NISC            SHOTGUN        737900          28972      0               735488
   BCCAGSC        CLONEEND        125597          125484      8926
   BCCAGSC        CLONEEND        125597          125484      8926             113790
   UIUC            CLONEEND        114753          90243      0
   UIUC            CLONEEND        114753          90243      0               106247
   TIGR            CLONEEND        65171          46389      0
   TIGR            CLONEEND        65171          46389      0               64903
   GSC            CLONEEND        53556          53556      53556 (all)
   GSC            CLONEEND        53556          53556      53556 (all)     0          !!! all have 0 quals and were excluded
   CENARGEN        WGS            26246          26246      0
   CENARGEN        WGS            26246          26246      0               25976
   BARC            CLONEEND        25454          25454      0
   BARC            CLONEEND        25454          25454      0               25387
   BCM            CLONEEND        16892          6751        0
   10 BCM            CLONEEND        16892          6751        0               16863
   CENARGEN        CLONEEND        16787          16787      0
   11 CENARGEN        CLONEEND        16787          16787      0               16628
   UOKNOR          SHOTGUN        15150          2885        12195
   12 UOKNOR          SHOTGUN        15150          2885        12195           0
   TIGR_JCVIJTC    CLONEEND        10651          339        0
   13 TIGR_JCVIJTC    CLONEEND        10651          339        0               10644
   UOKNOR          FINISHING      151            0          151
   14 UOKNOR          FINISHING      151            0          151              151
   WUGSC          CLONEEND        49              0          0
   15 WUGSC          CLONEEND        49              0          0               0
  16 BCCAGSC        EST            527017          524173      772              0
  17 MARC            EST            207204          207204      0                0
  18 MARC            PCR            171667          171667      0                0
  19 BARC            EST            81913          78597      0                0
  20 SC              EST            18623          7350        0                0
  21 UIACBCB        EST            2485            2485        0                0
 
== ZERO QUALITY COUNTS ==


   BCCAGSC        EST             527017          524173      772
* Counts
   MARC            EST             207204          207204      0
   CENTER_NAME    TRACE_TYPE_CODE  COUNT
   MARC            PCR             171667         171667      0
  BCM             WGS              551114
   BARC            EST            81913          78597      0
   GSC             CLONEEND        53556
   SC              EST            18623          7350        0
   BCM             SHOTGUN         23419
   UIACBCB         EST             2485            2485       0
   UOKNOR          SHOTGUN          12195
   BCCAGSC        CLONEEND        8926
   BCCAGSC         EST             772
  UOKNOR          FINISHING       151
  TOTAL                            650134
* For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701..
* Volumes 026..039 have been fixed


= Local Data =
= Local Data =
Line 131: Line 177:
Figaro
Figaro
* trims vector only at 5' end
* trims vector only at 5' end
* call lucy trimming for qaulities
* call lucy trimming for qualities


Lucy
Lucy
Line 141: Line 187:


= Contaminant search =
= Contaminant search =
nucmer reads CLIPPING range to UniVec & EcoliK12


== UniVec ==
== UniVec ==
Ref
                   #seqs  min    max    mean    median  n50    sum
                   #seqs  min    max    mean    median  n50    sum
   UniVec          2861    12      48551  231    99      781    660,151
   UniVec          2861    12      48551  231    99      781    660,151
   UniVec_Core    1348    12      48551  243    98      967    327,641
   UniVec_Core    1348    12      48551  243    98      967    327,641
Hits: alignment length
  bp      #reads  min    max    mean    median  n50    sum
  19      4548466 19      1045    28.37  23      27      129025025
  20      3684852 20      1045    30.56  25      28      112616359
  30      1097357 30      1045    48.04  38      43      52714583
  40      484661  40      1045    66.36  47      53      32163896
  100    54334  100    1045    198    116    223    10772815        # many are ESTs


== Ecoli ==
== Ecoli ==


K12 4,639,675 bp
Ref:
  K12 4,639,675 bp
 
Hits: alignment length
  bp      #reads  min    max    mean    median  n50    sum
  19      275109  19      1223    30.66  19      20      8435470
  20      102550  20      1223    50.29  21      161    5156849
  30      19032  30      1223    178    37      706    3381214
  40      9234    40      1223    329    171    738    3034293
  100    6781    100    1223    424    223    749    2876432
  200    4378    200    1223    575    696    771    2516916     


== BCM vectors ==
== BCM vectors ==
                   #seqs  min    max    mean    median  n50    sum
                   #seqs  min    max    mean    median  n50    sum
   BCM            14      2580    33180  9379    5821    32705  131312
   BCM            14      2580    33180  9379    5821    32705  131312
= Vector/Splice site search =
== Strategy ==
* 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
* 2. Get the quality clipping trim: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
* 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE
* 4. Get the most frequent kmers in each set (24 & 8 bp)
* 5. Check if the most frequent kmers are overrepresented
* 6. Check if the most frequent 8mers are present in the most frequent 24mers
* 7. Try to extend the 24mers by a few bp => linkers
* 8. Align linkers to the opposite stand sequences using nucmer
* 9. Extract the subsequences adjacent(following) to linker (50..150bp)
* 10. Align the subsequences; if they align we've probably identified the vector
* 11. Identify the vector name/id by alignment to UniVec => several alignments
* 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
* 13. create the Lucy vector & splice files; the splice contains the linker+vector
* 14. run lucy & trim input reads according to Lucy clr
* 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust
* 16. align input reads to linker,vector,splice & UniVec.dust
* 17. compare the 15. & 16. counts
== Example ==
* 1. volume 011 :  500,000 reads  CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
* 2.
* 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
* 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
  ==> 24.fwd/kmers.tab <==
  AGTTCGACTGCAAGTAGTTCATCA      TGATGAACTACTTGCAGTCGAACT        2463 # contains AGTAGTTC
  GAGTTCGACTGCAAGTAGTTCATC      GATGAACTACTTGCAGTCGAACTC        2189
  CGAGTTCGACTGCAAGTAGTTCAT      ATGAACTACTTGCAGTCGAACTCG        1996
  TCGAGTTCGACTGCAAGTAGTTCA      TGAACTACTTGCAGTCGAACTCGA        1593
  GTTCGACTGCAAGTAGTTCATCAA      TTGATGAACTACTTGCAGTCGAAC        1023
  GAGTTCGACTGCAGTAGTTCATCA      TGATGAACTACTGCAGTCGAACTC        812
  CGAGTTCGACTGCAGTAGTTCATC      GATGAACTACTGCAGTCGAACTCG        777
  GTTCGACTGCAAGTAGTTCATCAT      ATGATGAACTACTTGCAGTCGAAC        769
  TCGAGTTCGACTGCAGTAGTTCAT      ATGAACTACTGCAGTCGAACTCGA        637
  ATCGAGTTCGACTGCAAGTAGTTC      GAACTACTTGCAGTCGAACTCGAT        594
 
  ==> 08.fwd/kmers.tab <==
  AGTAGTTC      GAACTACT        86477
  CAGTAGTT      AACTACTG        67681
  AGTTCTCA      TGAGAACT        61556
  TAGTTCTC      GAGAACTA        60964
  GTAGTTCT      AGAACTAC        57866
  AGTTCATC      GATGAACT        49676
  TAGTTCAT      ATGAACTA        45298
  GTTCATCA      TGATGAAC        42117
  GCAGTAGT      ACTACTGC        41391
  GTAGTTCA      TGAACTAC        40694
 
  ==> 24.rev/kmers.tab <==
  TATCGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCGATA        999 # contains AGTAGTTC
  CTATCGATGGTACAGTAGTTCATC      GATGAACTACTGTACCATCGATAG        774
  GCTATCGATGGTACAGTAGTTCAT      ATGAACTACTGTACCATCGATAGC        600
  CGCTATCGATGGTACAGTAGTTCA      TGAACTACTGTACCATCGATAGCG        432
  ATCGATGGTACAGTAGTTCATCAT      ATGATGAACTACTGTACCATCGAT        417
  ATCGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCGAT        380
  ATCAGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCTGAT        373
  ATCGATGGTACAGTAGTTCATCAC      GTGATGAACTACTGTACCATCGAT        265
  CTATCGATGGTAAGTAGTTCATCA      TGATGAACTACTTACCATCGATAG        235
  TCAGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCTGA        224
 
  ==> 08.rev/kmers.tab <==
  AGTTCATC      GATGAACT        85127
  TAGTTCAT      ATGAACTA        77902
  GTTCATCA      TGATGAAC        75585
  TAGTTCTC      GAGAACTA        68057
  AGTTCTCA      TGAGAACT        67277
  GTAGTTCT      AGAACTAC        64894
  GTAGTTCA      TGAACTAC        62607
  CGTAGTTC      GAACTACG        52031
  AGTAGTTC      GAACTACT        51013
  ACGTAGTT      AACTACGT        31552
* 7. Get linker sequences
  >linker.fwd 27bp
  TCGAGTTCGACTGCAAGTAGTTCATCA
  >linker.rev 27bp
  CTAATCAGATGGTACAGTAGTTCATCA
 
  #>linker.rev 40 bp Art's  (13 more bp at 5')       
  #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
  #GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences)
* 8 & 9 Align reads to linkers using nucmer
Fwd:
  nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta
  #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta 
  show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
  extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq
 
Rev:
  nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta
  #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta 
  show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
  extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq
 
Both:
  clrFasta out.seq >! out.cseq
  fasta2tab.pl out.cseq | sort -k2 > ! out.tab
  nucmer -c 40 out.cseq ~/db/UniVec -p vector
  delta-filter -q vector.delta >! vector.filter-q.delta
  show-coords vector.filter-q.delta | sort -n | head
  cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2
 
* 10. Extract "vector reads"
  >399553028  # 24.fwd   
  TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
  GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
  GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA
  ...
  >400269118 #24.rev
  TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
  CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
  AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
* 11. Align  "vector reads" to UniVec; identify vector
  show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1
      31  148  | 1175 1292  | 118  118  |  95.76  |    1276    7477  |    9.25    1.58  | 399553028.rev gnl|uv|J01636.1:1-7477
      32  199  | 1302 1463  | 168  162  |  90.48  |      653    7477  |    25.73    2.17  | 400269118    gnl|uv|J01636.1:1-7477
* 12. 10bp distance between the 2 alignments
* 13. Lucy files
  $ more vector.seq
    >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
    GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG
    TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC
    CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG
    CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA
    ...
 
  $ more splice.seq
    >J01636.for.begin vector+linker.rev
    TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC
    GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG
    CGCCTAATCAGATGGTACAGTAGTTCATCA
    >J01636.for.end  rev(linker.fwd)+vector
    TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
    CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
    AGCTGGCGTAAAAACGTAAAAAGCCCCGCA
    >J01636.rev.begin (revcomp of J01636.for.end)
    TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA
    TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT
    GATTCGATTTCGACTGCAAGTAGTTCATCA
    >J01636.rev.end (revcomp of J01636.for.begin)
    TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
    GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
    GTGTCAAATGAGAGACCTAACTCACATTCA
  # splice=linker+vector 
      3      120  |    1175    1292  |      118      118  |    95.76  |      150    7477  |    78.67    1.58  | J01636.for.begin  J01636
      32      131  |    1302    1399  |      100      98  |    96.00  |      150    7477  |    66.67    1.31  | J01636.for.end    J01636
* 13.1 Align vector & splice to Ecoli
      1    7474  |  366812  359335  |    7474    7478  |    99.91  |    7477  4639675  |    99.96    0.16  | J01636            NC_000913.2    [CONTAINED]
      20      119  |      65      162  |      100      98  |    96.00  |      150      395  |    66.67    24.81  | J01636.rev.begin  NC_000913.2
      31      148  |      172      289  |      118      118  |    95.76  |      150      395  |    78.67    29.87  | J01636.rev.end    NC_000913.2
    1069    1463  |      395        1  |      395      395  |  100.00  |    7477      395  |    5.28  100.00  | J01636            NC_000913.2.365350-365744
* 14. Run lucy & trim reads
  $ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \
      -v vector.seq splice.seq
      -o bos_taurus.lucy.seq bos_taurus.lucy.qual \
      -debug  bos_taurus.lucy.info \
      bos_taurus.seq bos_taurus.qual
  # Trim clr
  $ clrFasta bos_taurus.seq > bos_taurus.cseq
* 15. Align lucy output to linker, vector, splice & UniVec.dust
  $ nucmer -l 12 -c 24 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
  $ nucmer -l 16 -c 30 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
  $ nucmer -l 16 -c 30 ~/db/splice.seq  bos_taurus.lucy.cseq -p splice-bos_taurus.lucy
  $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy
* 16. Align input to linker, vector, splice & UniVec.dust
  $ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus
  $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus
  $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus
  $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus
Count how many reads got trimmed
  infoseq *seq | getSummary.pl -c 1 -t original.LEN
 
  cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary 
  cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary
  cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary
== Libraries ==
=== 011.BCM.WGS FORWARD ===
* vector: J01636
* UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
  ll ~dpuiu/db/J01636*
  -rw-rw-r--  1 dpuiu dpuiu 7651 Jan  9 15:56 /nfshomes/dpuiu/db/J01636
  -rw-rw-r--  1 dpuiu dpuiu  105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker
  -rw-rw-r--  1 dpuiu dpuiu  840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice
  cat  ~dpuiu/db/J01636* | infoseq
  J01636            7477  53.43
  J01636.linker.fwd 27    44.44
  J01636.linker.rev 27    37.04
  J01636.for.begin  150    44.67
  J01636.for.end    150    51.33
  J01636.rev.begin  150    51.33
  J01636.rev.end    150    44.67
* 249,611 reads:
* 91% got vector trimmed at the 5' 
* 0.4% (1149) got  vector trimmed at the 3'
                  #elem  #0s    min    max    mean    median  n50    sum
  original.LEN    249611  0      437    2349    1082    991    1009    270035781   
  lucy.CLV5      249611  21215  0      741    25.03  25      27      6247415
  lucy.CLV3      249611  248462  0      1047    3.49    0      859    870344
* Original reads hit counts:
10975 linker.fwd
133  linker.rev
166  splice
152  vector
228  UniVec.dust
* Lucy trimmed read counts
2 linker.fwd
0 linker.rev
1 splice
1 vector
6 UniVec.dust (only 3 are >40bp)
=== 011.BCM.WGS REVERSE ===
                  #elem  #0s    min    max    mean    median  n50    sum
  original.LEN    250389  0      502    2148    1085    993    1012    271691094
  lucy.CLR        250389  7345    0      1281    795    876    892    198982171
  lucy.CLV5      250389  20271  0      668    26.52  27      29      6641362
  lucy.CLV3      250389  249269  0      997    3.35    0      861    839029
* Original reads hit counts:
  linker.fwd      113
  linker.rev      3812
  splice          143
  UniVec.dust    237
  vector          4318
* Lucy trimmed reads hit counts:
  linker.fwd      1
  linker.rev      0
  splice          1
  UniVec.dust    10
  vector          1
=== 030.BCM.SHOTGUN ===
* same linker/vector/splice as BCM.WGS
* 2.5% (4K out of 160K) reads contain linker & vector at 3'
                  #elem  #0s    min    max    mean    median  n50    sum
  original.LEN    8411    0      325    1685    1181    1240    1314    9933150
  lucy.CLR        8411    8      0      1054    841    863    874    7070994
  lucy.CLV5      8411    568    0      232    27.01  28      29      227206
  lucy.CLV3      8411    2325    0      1040    597    794    851    5023445
* Original reads hit counts:
  linker.fwd      4314
  linker.rev      4125
  splice          7816
  UniVec.dust    4212
  vector          6750
  vector          27235
* Lucy trimmed reads hit counts:
  linker.fwd      3
  linker.rev      1
  splice          1
  UniVec.dust    13
  vector          0
=== 001.NISC.SHOTGUN ===
* Vector: pOTW13
* UniVec: 3 partial seqs
  gnl|uv|NGB00080.1:1-198 pOTW13 with linkers
  gnl|uv|NGB00080.1:718-888 pOTW13 with linkers
  gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers
  ll /nfshomes/dpuiu/db/NGB00080*
  -rw-rw-r--  1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080
  -rw-r--r--  1 dpuiu dpuiu  94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker
  -rw-r--r--  1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice
  cat  /nfshomes/dpuiu/db/NGB00080* | infoseq
  NGB00080      1054  50.00
  NGB00080.linker.fwd 24    45.83
  NGB00080.linker.rev 26    53.85
  NGB00080.for.beg 518    46.14
  NGB00080.for.end 518    50.48
  NGB00080.rev.begin 518    50.48
  NGB00080.rev.beg 518    46.14
* 944 read sample
                  #elem  #0s    min    max    mean    median  n50    sum
  original.LEN    944    0      652    1017    735    721    722    693668
  lucy.CLR        944    39      0      886    415    422    522    391333
  lucy.CLV5      944    121    0      275    34.05  33      35      32143
  lucy.CLV3      944    18      0      885    410    409    511    387007
* Original reads hit counts:
  linker.fwd      479
  linker.rev      492
  splice          910
  UniVec.dust    0
  vector          939
* Lucy trimmed reads hit counts:
  linker.fwd      1
  linker.rev      0
  splice          0
  UniVec.dust    9
  vector          1
=== 060.BCCAGSC.CLONEEND ===
* Linkers:
  linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT
  linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG
* No common vector !!!
  wc -l *clb
    60746 bos_taurus.060.f.clb  #18 reads original align to UniVec (nucmer default params)
    60836 bos_taurus.060.r.clb
 
  Fwd:
    329      428  |      440      535  |      100      96  |    91.00  |      503    1585  |    19.88    6.06  | 723951410  gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1
    330      370  |      89      49  |      41      41  |  100.00  |      503      143  |    8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
    330      370  |      94      54  |      41      41  |  100.00  |      503      143  |    8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
 
  Rev:
      1      96  |      71      165  |      96      95  |    93.81  |      203      165  |    47.29    57.58  | 724018013  gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6
      50      143  |        1      94  |      94      94  |    92.71  |      203      94  |    46.31  100.00  | 724018013  gnl|uv|U80929.2:2858-2951    Cloning vector pBACe3.6
=== 017.UIUC.CLONEEND ===
* No overrepresented kmers
  wc -l *clb
  17978 bos_taurus.017.f.clb
  17911 bos_taurus.017.r.clb
 
  ==> 24.fwd/kmers.tab <==
  CCCTGCTTTGTCTGGAAGGGGTTC        GAACCCCTTCCAGACAAAGCAGGG        9
  CTGCTTTGTCTGGAAGGGGTTCCC        GGGAACCCCTTCCAGACAAAGCAG        9
 
  ==> 24.rev/kmers.tab <==
  GAATGTTGAGCTTTAGCCAACTTT        AAAGTTGGCTAAAGCTCAACATTC        4
  TCTGAATGTTGAGCTTTAGCCAAC        GTTGGCTAAAGCTCAACATTCAGA        4
 
  ==> 8.fwd/kmers.tab <==
  TTTTTTTT        AAAAAAAA        55
  AAGGGGTT        AACCCCTT        35
 
  ==> 8.rev/kmers.tab <==
  GTCTGGAA        TTCCAGAC        41
  TCTGGAAG        CTTCCAGA        39
* No UniVec hits
=== 010.TIGR.CLONEEND ===
* No overrepresented kmers
  wc -l *clb
  5479 bos_taurus.032.f.clb
  5174 bos_taurus.032.r.clb
 
  ==> 24.fwd/kmers.tab <==
  CTTGTGTTGGCCCAGGCAAGTCCA        TGGACTTGCCTGGGCCAACACAAG        30
  TTGTGTTGGCCCAGGCAAGTCCAA        TTGGACTTGCCTGGGCCAACACAA        30
 
  ==> 24.rev/kmers.tab <==
  CTGCCTCTTGTGTTGGCCCAGGCA        TGCCTGGGCCAACACAAGAGGCAG        16
  GCTGCCTCTTGTGTTGGCCCAGGC        GCCTGGGCCAACACAAGAGGCAGC        15
 
  ==> 8.fwd/kmers.tab <==
  GAGTGGGT        ACCCACTC        176
  GGAGTGGG        CCCACTCC        171
 
  ==> 8.rev/kmers.tab <==
  TGGAGTGG        CCACTCCA        182
  GGAGTGGG        CCCACTCC        181
 
* No UniVec hits
=== ... ===
=== 070.BCM.CLONEEND ===
* No frequent kmers
  wc -l *clb
    6027 bos_taurus.070.f.clb
    6236 bos_taurus.070.r.clb
 
  ==> 24.fwd/kmers.tab <==
  GGACTCTCAGAGTCTTCTCCAACA        TGTTGGAGAAGACTCTGAGAGTCC        18
  ACTGGTTGGATCTCCTTGCAGTCC        GGACTGCAAGGAGATCCAACCAGT        18
  ==> 24.rev/kmers.tab <==
  ATAAAATCTGAGCCACCAGGGAAG        CTTCCCTGGTGGCTCAGATTTTAT        1
  CTATTGGTTCATATGGTCAACGTC        GACGTTGACCATATGAACCAATAG        1
 
  ==> 8.fwd/kmers.tab <==
  TTTTTTTT        AAAAAAAA        86
  CTTCTCCA        TGGAGAAG        75
 
  ==> 8.rev/kmers.tab <==
  TATAGTGT        ACACTATA        9
  ATATAGGG        CCCTATAT        8
* No alignments to BCM WGS vector
= Running Lucy =
* Default parameters with vector trimming
* BCM vector/splice
  /nfshomes/dpuiu/db/vector.BCM.seq
  /nfshomes/dpuiu/db/splice.BCM.seq
* NISC vector/splice
  /nfshomes/dpuiu/db/vector.NISC.seq
  /nfshomes/dpuiu/db/splice.NISC.seq
== BCM.WGS (all reads)==
* orig.CLR < lucy.CLR ( 765 < 792 )
* orig.CLV > lucy.CLV ( 1015 > 973 )
* 739,529    out of 24,863,599 reads (3%)  deleted by Lucy (CLR=-1,-1)
* 21,728,592 out of 24,863,599 reads (87%)  vector trimmed at the 5' end
* 92,646    out of 24,863,599 reads (0.3%) vector trimmed at the 3' end
                            '''elem      <0        0          >0        min        max        mean      median    n50        sum'''
  orig.LEN                  24863599  0          0          24863599  5          3097      1002      997        1015      24915462033
 
  orig.CLR                  24863599  463669    7          24399923  -1143      1833      '''765'''        836        864        19036744256
  orig.CLR5                24863599  0          359245    24504354  0          2103      42        22        58        1047922451
  orig.CLR3                24863599  463404    0          24400195  -1        2169      807        872        895        20084666707
 
  lucy.CLR                  24863599  0          739529    24124070  0          1219      '''792'''        878        904        19695000417
  lucy.CLR5                24863599  '''739529'''    36108      24087962  -1        1753      43        29        42        1086413880
  lucy.CLR3                24863599  '''739529'''    0          24124070  -1        1894      835        915        939        20781414297
 
  orig.CLR5-lucy.CLR5      24863599  16299521  215345    8348733    -1186      2104      -1        -10        -1186      -38491429
  orig.CLR3-lucy.CLR3      24863599  14858542  1494794    8510263    -1273      2170      -28        -20        -1273      -696747590
 
  orig.CLV                  24863599  1053      1920      24860626  -2        5345      '''1015'''      1002      1017      25260581538
  orig.CLV5                8841849    0          0          8841849    1          1219      33        46        49        295011460
  orig.CLV3                24861698  1053      0          24860645  -1        5346      1027      1005      1019      25555592998
 
  lucy.CLV                  24863599  10694      707        24852198  -469      3096      '''973'''        968        987        24195085877
  lucy.CLV5                24863599  0          3135007    21728592  0          1359      25        27        29        623457486
  lucy.CLV3                24863599  0          0          24863599  4          3096      998        995        1014      24818543363
  lucy.CLVABS5              24863599  0          3135007    '''21728592'''  0          1359      25        27        29        623457486
  lucy.CLVABS3              24863599  0          24770953  '''92646'''      0          1343      2          0          880        72055071
 
  orig.CLV5-lucy.CLV5      24863599  17216820  1512453    6134326    -1312      1219      -13        -25        -1312      -328446026
  orig.CLV3-lucy.CLV3      24863599  1519132    18579609  4764858    -1832      4672      29        0          479        737049635
== BCM.WGS (0 quality reads)==
* orig.CLR > lucy.CLR (mean)
* orig.CLV > lucy.CLV (mean)
* 7,153    out of 551,114 reads (1.3%)  deleted by Lucy (CLR=-1,-1)
* 508,166  out of 551,114 reads (92%)  vector trimmed at the 5' end
* 1,946    out of 551,114 reads (0.35%) vector trimmed at the 3' end
                            '''elem      <0        0          >0        min        max        mean      median    n50        sum'''
  orig.LEN                  551114    0          0          551114    5          1464      872        946        959        480705828
 
  orig.CLR                  551114    7754      0          543360    -770      1175      708        786        807        390325117
  orig.CLR5                551114    0          6773      544341    0          1519      44        20        111        24582849
  orig.CLR3                551114    7744      0          543370    -1        1638      752        818        833        414907966
 
  lucy.CLR                  551114    0          7153      543961    0          699        636        671        671        350759771
  lucy.CLR5                551114    7153      35872      508089    -1        201        26        27        28        14442310
  lucy.CLR3                551114    7153      0          543961    -1        699        662        699        699        365202081
 
  orig.CLR5-lucy.CLR5      551114    364282    8801      178031    -198      1500      18        -8        215        10140539
  orig.CLR3-lucy.CLR3      551114    85058      2962      463094    -700      1472      90        123        178        49705885
 
  orig.CLV                  551114    971        0          550143    -2        2037      974        978        981        537127121
  orig.CLV5                5100      0          0          5100      1          845        35        29        31        180490
  orig.CLV3                551114    971        0          550143    -1        2037      974        978        981        537307611
 
  lucy.CLV                  551114    58        6          551050    -84        1456      841        917        930        463903233
  lucy.CLV5                551114    0          42948      508166    0          202        27        28        29        14964546
  lucy.CLV3                551114    0          0          551114    4          1463      868        945        958        478867779
  lucy.CLVABS5              551114    0          42948      508166    0          202        27        28        29        14964546
  lucy.CLVABS3              551114    0          549168    1946      0          700        2          0          686        1286935
 
  orig.CLV5-lucy.CLV5      551114    506108    42215      2791      -202      845        -26        -28        -202      -14784056
  orig.CLV3-lucy.CLV3      551114    134959    23422      392733    -967      1614      106        7          459        58439832
== BCM.SHOTGUN ==
* orig.CLR < lucy.CLR (mean)
* orig.CLV > lucy.CLV (mean)
* 98,070    out of 10,748,529 reads (0.9%)  deleted by Lucy (CLR=-1,-1)
* 9,737,008  out of 10,748,529 reads (90%)  vector trimmed at the 5' end
* 294,942    out of 10,748,529 reads (2.7%)  vector trimmed at the 3' end
                            '''elem      <0        0          >0        min        max        mean      median    n50        sum'''
  orig.LEN                  10748529  0          0          10748529  5          2043      975        950        964        10486690472
 
  orig.CLR                  10748529  17308      2          10731219  -1293      1467      809        833        847        8701344571
  orig.CLR5                10748529  0          68        10748461  0          1315      26        16        38        288662580
  orig.CLR3                10748529  16780      0          10731749  -1        1647      836        851        863        8990007151
 
  lucy.CLR                  10748529  0          98070      10650459  0          1337      833        854        868        8955866769
  lucy.CLR5                10748529  98070      1973      10648486  -1        1307      35        28        32        376276188
  lucy.CLR3                10748529  98070      0          10650459  -1        1553      868        882        896        9332142957
 
  orig.CLR5-lucy.CLR5      10748529  9498290    65171      1185068    -1099      1293      -8        -11        -1099      -87613608
  orig.CLR3-lucy.CLR3      10748529  6879532    671097    3197900    -1149      1437      -31        -26        -1149      -342135806
 
  orig.CLV                  10748529  16779      412        10731338  -2        3919      974        948        964        10472347908
  orig.CLV5                8594910    0          0          8594910    1          1239      3          1          49        28350257
  orig.CLV3                10748349  16779      0          10731570  -1        3919      976        950        965        10500698165
 
  lucy.CLV                  10748529  7026      614        10740889  -268      2042      930        924        940        9997862132
  lucy.CLV5                10748529  0          1011521    9737008    0          855        24        24        27        257993796
  lucy.CLV3                10748529  0          0          10748529  4          2042      954        945        962        10255855928
  lucy.CLVABS5              10748529  0          1011521    9737008    0          855        24        24        27        257993796
  lucy.CLVABS3              10748529  0          10453587  294942    0          1214      20        0          847        220086015
 
  orig.CLV5-lucy.CLV5      10748529  9538738    138680    1071111    -854      1239      -21        -23        -854      -229643539
  orig.CLV3-lucy.CLV3      10748529  357934    9324166    1066429    -1328      2846      22        0          704        244842237
== NISC.SHOTGUN ==
* orig.CLR < lucy.CLR (mean)
* orig.CLV > lucy.CLV (mean)
* 8,248    out of 737,900 reads (1.1%)  deleted by Lucy (CLR=-1,-1)
* 633,409  out of 737,900 reads (85%)  vector trimmed at the 5' end
* 7,201    out of 737,900 reads (0.97%) vector trimmed at the 3' end
                            '''elem      <0        0          >0        min        max        mean      median    n50        sum'''
  orig.LEN                  737900    0          0          737900    104        2104      784        729        734        579172842
 
  orig.CLR                  737900    5988      2          731910    -636      1033      651        668        676        480400909
  orig.CLR5                737900    0          0          737900    1          1407      47        40        51        34857531
  orig.CLR3                737900    0          5879      732021    0          1470      698        710        715        515258440
 
  lucy.CLR                  737900    0          8248      729652    0          1035      658        670        676        485757685
  lucy.CLR5                737900    8248      56        729596    -1        1091      45        35        46        33811606
  lucy.CLR3                737900    8248      0          729652    -1        1391      704        710        714        519569291
 
  orig.CLR5-lucy.CLR5      737900    253727    89345      394828    -566      1408      1          1          485        1045925
  orig.CLR3-lucy.CLR3      737900    177007    31        560862    -867      1471      -5        1          -867      -4310851
 
  orig.CLV                  737900    3224      2655      732021    -636      2103      771        725        730        569178445
  orig.CLV5                734026    0          0          734026    1          987        5          1          35        4375315
  orig.CLV3                732021    0          0          732021    35        2104      783        729        734        573553760
 
  lucy.CLV                  737900    1335      55        736510    -200      2104      747        696        702        551392388
  lucy.CLV5                737900    104491    0          633409    -1        1199      30        31        34        22784742
  lucy.CLV3                737900    0          0          737900    15        2103      778        728        733        574177130
  lucy.CLVABS5              737900    0          104491    633409    0          1200      31        32        35        23522642
  lucy.CLVABS3              737900    0          730699    7201      0          1076      5          0          686        4257812
 
  orig.CLV5-lucy.CLV5      737900    561851    66390      109659    -1198      983        -24        -29        -1198      -18409427
  orig.CLV3-lucy.CLV3      737900    8386      1          729513    -950      1077      0          1          -950      -623370
= Fragment files =
* Locations:
  /fs/szasmg3/bos_taurus/data/frg
  /fs/szasmg3/bos_taurus/data/frg.new
* All DST messages are unique
* bos_taurus.clv : contains the vector clipping points
** BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
** others: the TA clv
** 374,454 reads don't have valid clv's
** 36,446,031 reads have valid clv's with avg=955
== Message counts (original) ==
                                                DST    FRG            LKG
  bos_taurus.BCM.WGS.frg                        79      24124070        11311841
  #bos_taurus.BCM.SHOTGUN.frg                    7339    10650459        1799069    # some libs & mates are missing due to a tarchive2ca crash (used by UMD2.1)
  #bos_taurus.BCM.SHOTGUN.new.frg                18208  10650459        4715172    # split the libraries by VOL & SEQ_LIB_ID (used by UMD2.2)
  #bos_taurus.BCM.SHOTGUN.new.frg                13826  10650459        5046435    # double check the FRG count !!! (used by UMD2.3)
  bos_taurus.BCM.SHOTGUN.new.frg                7      10650459        5046435    # UMD2.4
  bos_taurus.NISC.SHOTGUN.frg                    246    729652          344932
  bos_taurus.BCCAGSC.CLONEEND.frg                1      125241          59505
  bos_taurus.UIUC.CLONEEND.frg                  2      114750          46319
  bos_taurus.TIGR.CLONEEND.frg                  1      65171          27067
  bos_taurus.GSC.CLONEEND.frg                    1      53521          25889
  bos_taurus.CENARGEN.WGS.frg                    0      26246          0
  #bos_taurus.BARC.CLONEEND.frg                  11150  25454          11150      # (used by UMD2.3)
  bos_taurus.BARC.CLONEEND.frg                  1      25454          11150      # (used by UMD2.4)
  bos_taurus.BCM.CLONEEND.frg                    1      16875          7103
  bos_taurus.CENARGEN.CLONEEND.frg              1      16787          6269
  bos_taurus.UOKNOR.SHOTGUN.frg                  1      14651          4910
  bos_taurus.TIGR_JCVIJTC.CLONEEND.frg          2      10651          4803
  bos_taurus.UOKNOR.FINISHING.frg                0      151            0
  bos_taurus.WUGSC.COLONEEND.frg                1      49              21
  #total                                        25312  35973728        16896244  # (UMD2.3) <font color=red>
  total                                          344    35973728        16896244  # (UMD2.4) </font>
== Message counts (quality) ==
                                                DST    FRG            LKG
  bos_taurus.BCM.WGS.qual.count                  79      23580109        11035582
  #bos_taurus.BCM.SHOTGUN.qual.count            7339    10644092        1799069
  bos_taurus.BCM.SHOTGUN.qual.new.count          18208  10644092        4712446
  bos_taurus.NISC.SHOTGUN.count                  246    729652          344932
  bos_taurus.BCCAGSC.CLONEEND.qual.count        1      116484          53585
  bos_taurus.UIUC.CLONEEND.count                2      114750          46319
  bos_taurus.TIGR.CLONEEND.count                1      65171          27067
  bos_taurus.CENARGEN.WGS.count                  0      26246          0
  bos_taurus.BARC.CLONEEND.count                11150  25454          11150
  bos_taurus.BCM.CLONEEND.count                  1      16875          7103
  bos_taurus.CENARGEN.CLONEEND.count            1      16787          6269
  bos_taurus.TIGR_JCVIJTC.CLONEEND.count        2      10651          4803
  bos_taurus.UOKNOR.SHOTGUN.qual.count          1      2456            813
  bos_taurus.WUGSC.COLONEEND.count              1      49              21
== Message counts (0quality) ==
                                                DST    FRG            LKG
  bos_taurus.BCM.WGS.0qual.count                79      543961          234397
  bos_taurus.GSC.CLONEEND.0qual.count            1      53521          25889
  bos_taurus.UOKNOR.SHOTGUN.0qual.count          1      12195          4097
  bos_taurus.BCCAGSC.CLONEEND.0qual.count        1      8757            2114
  bos_taurus.BCM.SHOTGUN.0qual.count            7339    6367            0
  bos_taurus.UOKNOR.FINISHING.0qual.count        0      151            0
= Assemblies =
[[Media:Bt.qc.combine|Bt.qc.combine]] UMD2.0 ... UMD2.5 combine stats
= UMD2.1(2009_0122_CA; Quality reads) =
== Issues ==
# Uses only quality reads
# BCM.SHOTGUN library : ~ 4715172-1799069=2.9M mates were missed due to a tarchive2ca crash ; some libraries got merged (were assigned the same lib_id)
# All reads except for BCM.WGS were set as nonrandom
# Update the runCA script to run overlapper concurently; new "ovlConcurrency" parameter added to the .spec file !!!
# consensus after cgw crashed in MultiAlignContig() ... use "consensus -D forceunitigabut" !!!
# cgw crashed after updating gkpStore with new lib/mate info => edit Input_CGW.c, remove the assert in line 117
== Info ==
 
  host: walnut
 
  assembly version: wgs-5.2 stable
 
  dir:  /scratch1/bos_taurus/Assembly/2009_0122_CA
 
  command: /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin/runCA-test -d . -p bt -s bt01.specFile *.frg
 
  spec file:
  cgwDistanceSampleSize  =      1000      # ??? too big; more than 50% of the BCM.SHOTGUN reads are in libraries with less than 1000 inserts
  cnsConcurrency          =      15
  cnsMinFrags            =      200000
  doOverlapTrimming      =      1
  frgCorrBatchSize        =      100000
  frgCorrConcurrency      =      15
  merylMemory            =      24000
  merylThreads            =      15
  obtMerThreshold        =      200
  obtOverlapper          =      ovl
  ovlConcurrency          =      8
  ovlCorrBatchSize        =      100000
  ovlCorrConcurrency      =      15
  ovlHashBlockSize        =      1200000
  ovlMemory              =      8GB --hashload 0.8 --hashstrings 400000
  ovlMerThreshold        =      500
  ovlOverlapper          =      ovl
  ovlRefBlockSize        =      7200000
  ovlThreads              =      2
  unitigger              =      utg
  utgErrorRate            =      0.015
  vectorIntersect        =      bos_taurus.clv
  doExtendClearRanges    =      2
== Steps ==
1. Run up till after initialStoreBuilding
  runCA stopAfter=initialStoreBuilding ...
2. Update gkpStore with nonrandom frg flag
  cat bos_taurus.nonrandom.clv | perl -ane 'print "frg uid $F[0] isnonrandom 1\n";'  > bos_taurus.nonrandom.edit
  gatekeeper -edit bos_taurus.nonrandom.edit bt.gkpStore
== Input ==
  gatekeeper -dumpinfo -lastfragiid bt.gkpStore
  ...
  Last frag in store is iid = 35348776
== OBT ==
                            elem      <0        0          >0        min        max        mean      median    n50        sum           
 
  CLV5                      35085508  0          3387027    31698481  0          970        25        27        29        891007232
  CLV3                      35164784  0          0          35164784  15        2974      984        980        1000      34612019144
 
  CLR_ORIG5                35348776  0          43354      35305422  0          1753      42        29        38        1502168205   
  CLR_ORIG3                35348776  0          0          35348776  70        1894      864        905        927        30547294868   
 
  CLR_OBT5                  35348776  0          26513      35322263  0          1690      49        30        73        1756346429   
  CLR_OBT3                  35348776  0          23477      35325299  0          1813      843        895        914        29824543869
<font color=red>
* 421,379 reads deleted by OBT: why so many???
</font>
* Chimera:
  20297 reads too short => deleted
* more  0-overlaptrim/bt.mergeLog.stats
  ...
  211037: short or inconsistent
  253536: deleted fragment due to zero clear
* Example:
  gatekeeper -dumpfragments 516316990  bt.gkpStore 
  fragmentIdent          = 516316990,14
  fragmentMate            = 0,0
  fragmentLibrary        = 27473,1563
  fragmentIsDeleted      = 1
  fragmentIsNonRandom    = 1
  fragmentStatus          = G
  fragmentOrientation    = I
  fragmentHasVectorClear  = 0
  fragmentHasQualityClear = 0
  fragmentPlate          = 0
  fragmentPlateLocation  = 0
  fragmentSeqLen          = 862
  fragmentHPSLen          = 0
  fragmentSrcLen          = 17
  fragmentClearORIG      = 38,553
  fragmentClearQLT        = 1,0
  fragmentClearVEC        = 1,0
  fragmentClearOBTINI    = 35,578
  fragmentClearOBT        = 35,578
  fragmentClearUTG        = 35,578
  fragmentClearECR1      = 35,578
  fragmentClearECR2      = 35,578
  fragmentSeqOffset      = 5376
  fragmentQltOffset      = 11038
  fragmentHpsOffset      = 53
  fragmentSrcOffset      = 287
  cat  0-overlaptrim/bt.mergeLog | grep 516316990
  516316990,14    412    412    0      0 (deleted, too short)
  zcat *r000*gz | convertOverlap -a -obt
  ...
    14 12128740  f  377  478  292  393  2.97
    14 15226267  f  397  446    31  80  2.04
    14 19071241  f    4  513  199  708  1.18
    14 20073917  f    7  478    36  508  4.88
    14 20042424  f    4  419  299  714  1.93
    14 20212935  f    7  478  234  706  4.88
    14 20073828  r    7  478  507  35  4.67
    14 20212846  r    7  478  557  85  4.67
    14 27089060  r  491  534  836  793  2.33
    14 29061748  f  489  540    86  137  1.96
    14 32105697  f  455  543  381  469  2.27
    14 32187461  f  430  534  105  209  1.92
    14 32027289  f    4  419  493  907  4.59
  ...
  #read aligns to contigs
  show-coords 516316990-ctg.filter-r.strict.delta
      35      531  |      97      594  |      497      498  |    99.20  |      862    2759  |    57.66    18.05  | 516316990  ctg7180001872751
      45      678  |      931    1564  |      634      634  |    97.00  |      862    1567  |    73.55    40.46  | 516316990  ctg7180001837311
* OBT  deleted reads:
  BCM          WGS        253816
  BCM          SHOTGUN    151770
  BCCAGSC      CLONEEND    7510
  NISC          SHOTGUN    4757
  TIGR          CLONEEND    1577
  CENARGEN      WGS        599
  CENARGEN      CLONEEND    431
  TIGR_JCVIJTC  CLONEEND    377
  UIUC          CLONEEND    182
  BCM          CLONEEND    150
  BARC          CLONEEND    125
  UOKNOR        SHOTGUN    85
  total        .          421379
OBT  deleted reads:
          elem      >0        min        max        mean      med        n50        sum
  len      421379    421379    98        2974      862        927        968        363280405
  avgQual  421379    421379    1          57        28        24        36        11852865
== Overlapper ==
 
** 98.33% of the reads (34,761,786 out of 35,348,776 reads) had overlaps
** 1.66% of the reads had no overlaps
** 6.68% of the BCCAGSC.CLONEEND reads had no overlaps
** 4.95% of the TIGR_JCVIJTC.CLONEEND reads had no overlaps
** 3.48% of the TIGR.CLONEEND reads had no overlaps
** the median number of overlaps is 20
  Overlaps
          reads      min        max        mean      median    n50        sum
  qual    35348776  0          5592      106        20        769        3777789082
** the median number of overlaps for the BCM.WGS reads is 16
** the median number of overlaps for the BCM.SHOTGUN reads is 16 !!!
** the median number of overlaps for the NISC.SHOTGUN reads is 40 !!!
** the median number of overlaps for the BCM.CLONEEND reads is 16 !!!
[[Media:Bt.ovlStore.big.png]] , [[Media:Bt.ovlStore.small.png]]
== Unitigger ==
 
more 4-unitigger/bt.cga.0
UNITIG OVERLAP GRAPH INFORMATION
 
        5208738 : Total number of unitigs
        2527051 : Total number of singleton, contained unitigs
        1814842 : Total number of singleton, non-contained unitigs
        180910 : Total number of non-singleton, spanned unitigs
        685935 : Total number of non-singleton, non-spanned unitigs
      34927397 : Total number of fragments
      34927397 : Total number of fragments in all unitigs
      21521581 : Total number of essential fragments in all unitigs
      13405816 : Total number of contained fragments in all unitigs
  0.0076239952 : Randomly sampled fragment arrival rate per bp
    2510896132 : The sum of overhangs in all the unitigs
    6400342737 : Total number of bases in all unitigs
              0 : Estimated number of base pairs in the genome.
              0 : Total number of contained fragments not connected
                  by containment edges to essential fragments.
  Total rho    = 2510896132
  Total nfrags = 19143061
  Estimated genome length = 0
  Estimated global_fragment_arrival_rate=0.007624
  Computed global_fragment_arrival_rate =0.007624
  Total number of randomly sampled fragments in genome = 23326293
  Computed genome length  = 3059589120.000000
  Used global_fragment_arrival_rate=0.007624
  Used global_fragment_arrival_distance=131.164826
 
  Histogram of the number of base pairs in a chunk
  100292 - 159434:    22
  90010 -  99906:    25
  80043 -  89676:    73
  70013 -  79966:    162
  60010 -  69988:    389
  50008 -  59983:    977
  40000 -  49998:  2434
  30000 -  39997:  6458
  20000 -  29999:  18957
  10000 -  19999:  57442
  Unitigs >=10kb
              NewAsm          UMd2Asm
 
  Number      86,939          57,204
  Mean        19,464          15,140
  Sum        1,692.1Mb      866.0Mb
  max        159,434bp      78,570bp
  Contigs >=10Kb:
            NewAsm          UMd2Asm
  n        42,343          45,958     
  mean      59,856          55,473
  sum        2,534.5Mb        2,549.4Mb
  Contigs >=100Kb:
            NewAsm          UMd2Asm
  n          7,051            6,683       
  mean    163,170          162,357   
  sum        1,150.5Mb        1,085.0Mb
  max      627,705          742,802
  Scaffolds >=10Mb:
            NewAsm          UMd2Asm
  n            30                3
  mean      14.10Mb          11.36Mb
  sum      422.95Mb        340.70Mb
  max        26.54Mb          13.36Mb
== CGW & ECR ==
* Checkpoints:
  cat 7-0-CGW/bt.timing | grep ^Checkpoint
  Checkpoint 3 written during MergeScaffoldsAggressive at iteration 49
  Checkpoint 4 written during MergeScaffoldsAggressive at iteration 85
  Checkpoint 5 written after 1st Scaffold Merge
  Checkpoint 6 written after 2nd Aggressive Scaffold Merge
  Checkpoint 7 written after Final Rocks
  cat 7-2-CGW/bt.timing | grep ^Checkpoint
  Checkpoint 19 written during MergeScaffoldsAggressive at iteration 12
  Checkpoint 20 written during MergeScaffoldsAggressive at iteration 31
  Checkpoint 21 written after 1st Scaffold Merge
  Checkpoint 22 written after 2nd Aggressive Scaffold Merge
  Checkpoint 23 written after Final Rocks
  cat 7-4-CGW/bt.timing | grep ^Checkpoint
  Checkpoint 34 written during MergeScaffoldsAggressive at iteration 12
  Checkpoint 35 written during MergeScaffoldsAggressive at iteration 49
  Checkpoint 36 written after 1st Scaffold Merge
  Checkpoint 37 written during Stones CleanupScaffolds after scaffold 32436
  Checkpoint 38 written during Stones CleanupScaffolds after scaffold 34939
  Checkpoint 39 written after Stone Throwing and CleanupScaffolds
  Checkpoint 40 written after 2nd Aggressive Scaffold Merge
  Checkpoint 41 written after Final Rocks
Checkpoint 42 written after Partial Stones
Checkpoint 43 written after Final Contained Stones
Checkpoint 44 written after resolveSurrogates
* Get early CTG/SCF stats
  cat 7-CGW/bt.cgw_scaffolds | countMessages.pl
  ICL    451555  # ???
  ICP    116455  # CTG
  ISF    66141  # SCF
  ISL    711    # SLK
* Clear read extension:
                            elem      <0        0          >0        min        max        mean      median    n50        sum
  ClearORIG                35348776  4          0          35348772  -1147      1572      821        870        893        29045126663
 
  ClearQLT                  35348776  35348776  0          0          -1        -1        -1        -1        -1        -35348776
  ClearVEC                  35348776  299034    20323      35029419  -1        2043      952        953        975        33658445088
 
  ClearOBTINI              35348776  0          31254      35317522  0          1364      831        879        902        29394688367
  ClearOBT                  35348776  0          31254      35317522  0          1318      794        854        877        28068197440
 
  ClearECR1                35348776  0          31254      35317522  0          1329      794        854        877        28072014464
  ClearECR2                35348776  0          31254      35317522  0          1329      794        854        877        28072365712
  sum(ClearECR1)-sum(ClearUTG) = 3,817,024
  sum(ClearECR2)-sum(ClearECR1)= 351,248
* Scaffold length stats:
  cat 7-0-CGW/stat/final0.Scaffolds.nodelength.cgm | grep -v ^Sca | getSummary.pl -t 0 # 0,2,4
  ...
  step    scaff      min        max        mean      med        n50        sum           
  0        7048      2249      19719008  385020    21967      3114907    2713622175   
  2        4960      2249      21907006  540915    21181      4490171    2682939682
  4        4006      2391      26541374  668427    29193      4590744    2677722052
* Last cgw
  cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf
  cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf
            elem      min        max        mean      med        n50        sum           
  scf        66141      432        26541374  42648      1347      4349378    2820819506   
  ctg        120461    65        627705    22421      2018      84989      2700959854
== QC stats ==
* [[Media:Bos_taurus.qc|Bos_taurus.qc]] this assembly stats
* [[Media:Bos_taurus.qc.combine|Bos_taurus.qc.combine]] UMD2 vs this assembly stats
  TotalScaffolds=66,141
  MaxBasesInScaffolds=26,048,998
  MeanBasesInScaffolds=40,861
 
  TotalContigsInScaffolds=120,461
  MaxContigLength=627,911
  MeanContigLength=22,436
 
  TotalDegenContigs=269,031
  MaxDegenContig=33,824
 
  SingletonReads=3,721,123
* Posmap info
  cat bt.posmap.mates | awk '{print $3}' |count.pl -p 100
  good            10338164
  bothChaff      1160137
  oneChaff        695982
  oneSurrogate    233151
  bothDegen      218198
  diffScaffold    150423
  badShort        138464
  oneDegen        118232
  badLong        23196
  badSame        22451
  badOuttie      8751
  bothSurrogate  589
  total          13107738
  cat bt.posmap.frags | awk '{print $4,$5}' |count.pl  -p 100
  placed good            20676328
  placed notMated        8007072
  chaff bothChaff        2320274
  chaff notMated          704849
  placed oneChaff        695982
  chaff oneChaff          695982
  placed oneSurrogate    466302
  placed bothDegen        436396
  placed diffScaffold    300846
  placed badShort        276928
  placed oneDegen        236464
  placed badLong          46392
  placed badSame          44902
  placed badOuttie        17502
  placed bothSurrogate    1178
  total                  34927397
== Log files ==
* [[Media:Bt.runCA.log|Bt.runCA.log]]
* [[Media:Bt.runCA.hourly.runtimes|Bt.runCA.hourly.runtimes]] approximate running times (in hours)
== Analysis ==
=== Insert libraries ===
1. BCM.WGS : ok
* FRG.mea: 1750-7000
* ASM.mea: 1594-6727
* Most libs have > 1000 reads & get reestimated
* All libs have ASM.std< ASM.mea/3
2. BCM.SHOTGUN
* only ~ 50% of the inserts are in libs with >1000 inserts and get reestimated by the assembly
* if the thold is dropped from 1000 to 100, we'd get ~ 95% of the inserts reestimated
            elem      <0        0          >0        min        max        mean      median    n50        sum
  0          7339      0          0          7339      1          11237      245        135        1137      1799069
  100        4361      0          0          4361      100        11237      395        157        1252      1725604
  1000      440        0          0          440        1008      11237      2075      1791      2323      913086
3. NISC.SHOTGUN: ok
* Most libs have > 1000 reads & get reestimated
* All libs have ASM.std< ASM.mea/3
4. BCCAGSC.CLONEEND: ok
  LIB.id  FRG.mea FRG.std FRG.count  CENTER.TYPE        ASM.mea ASM.std
  125606  150000  30000  59505      BCCAGSC.CLONEEND  161998  20133
5. UIUC.CLONEEND: ok
  LIB.id  FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
  114892  150000  30000  31063    UIUC.CLONEEND      175594  41208
  115020  150000  30000  15256    UIUC.CLONEEND      162488  26358
6. TIGR.CLONEEND: originally wrong; gets reestimated
  LIB.id  FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
  65177    2000    600    27067    TIGR.CLONEEND      161761  34938
7. GSC.CLONEEND: not used (all 53556 are 0 qual)
8. CENARGEN.WGS: "not used" (all 26246 are unmated)
9. BARC.CLONEEND: each library contains 1 template id => inserts did not get reestimated  (25454 reads/11151 inserts)
10. BCM cloneend: ok
  LIB.id  FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
  19070    167000  25000  7103      BCM.CLONEEND      171244  18555
11. CENARGEN.CLONEEND: large stdev
  LIB.id  FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
  17249    202000  20200  6269      CENARGEN.CLONEEND  158938  55165
12. UOKNOR.SHOTGUN: ok ?
  LIB.id  FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
  15158    3000    1000    4910      UOKNOR.SHOTGUN    3000    1000
13. TIGR_JCVI.CLONEEND: originally wrong; gets reestimated
  LIB.id  FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
  10691    2500    750    2763      TIGR_JCVI.CLONEEND 160363  29580
  10738    2500    750    2040      TIGR_JCVI.CLONEEND 161915  29343
14. UOKNOR.FINISHING: only 151 reads
15. WUGSC.CLONEEND: only 49 reads
=== Contigs Vs UMD2 contaminants & Ecoli ===
  4865 contigs in list.exclude_contigs.fa
 
34404 exclude-ctg.qry_hits
  3763 exclude-ctg.ref_hits
 
  1204 exclude-ctg.CBE.qry_hits  CONTAIN|IDENTITY|BEGIN|END
  748 exclude-ctg.CBE.ref_hits  CONTAIN|IDENTITY|BEGIN|END
  559 Ecoli.365350-365744-ctg.qry_hits : max ctg aligned is 179K bp; 10 are > 10K bp
=== Contigs Vs UMD2 chromosomes ===
* Split 120,461 contigs into 100 files; degeneartes not split
* Align them to the 31 chromosomes 1..30,U (ref) => 101*31 jobs
  #Alignment stats
  cat chr*ctg*delta | grep "^>" | awk '{print $2}' | count.pl -f ../9-terminator/bt.ctg.infoseq | getSummary.pl -i 1 -z 1
  ctg        0          1          >1        min        max        mean      med        n50        sum
  120461    652        37540      82269      0          176        11        3          33        1422808
  #Unaligned ctg lengths
  ctg        min        max        mean      med        n50        sum
  652        65        5849      1146      1134      1194      747295
 
* 50% of the contigs aligned uniquely
  cat chr*-ctg*.delta | ~/bin/mergeDelta.pl  >  chr-ctg.delta
                                                              # degens?
  delta-filter -q  chr-ctg.delta              >>  chr-ctg.filter-q.delta
  cat chr1-*.delta | ~/bin/delta2cvg.pl -M 0 | getSummary.pl -i 4
  elem      0          >0        min        max        mean      med        n50        sum
  6681      1          6680      0          12892      366        142        1095      2450106
* There are disagreements:
  /fs/sz-user-supported/Linux-x86_64/bin/show-coords -l -r -H chr1-ctg.filter-q.delta  | p 'print $F[-1],"\n";' | count.pl | head
  ctg7180001761585        24
  ...
  ctg7180001634116        7
  ...
  show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001761585 | p 'print "  $_";'
  142115744 142188863  |  383463  310345  |    73120    73119  |    99.98  | 157714772  383463  |    0.05    19.07  |  1 -1  chr1  ctg7180001761585
  142188878 142286012  |  310361  213223  |    97135    97139  |    99.94  | 157714772  383463  |    0.06    25.33  |  1 -1  chr1  ctg7180001761585
  142287100 142287675  |  212133  211556  |      576      578  |    98.27  | 157714772  383463  |    0.00    0.15  |  1 -1  chr1  ctg7180001761585
  142288052 142288602  |  211182  210633  |      551      550  |    99.09  | 157714772  383463  |    0.00    0.14  |  1 -1  chr1  ctg7180001761585
  142288652 142295709  |  210586  203531  |    7058    7056  |    99.87  | 157714772  383463  |    0.00    1.84  |  1 -1  chr1  ctg7180001761585
  142295709 142342174  |  203512  157047  |    46466    46466  |  100.00  | 157714772  383463  |    0.03    12.12  |  1 -1  chr1  ctg7180001761585
  142346440 142367791  |  156958  135606  |    21352    21353  |    99.99  | 157714772  383463  |    0.01    5.57  |  1 -1  chr1  ctg7180001761585
  142367822 142370681  |  135597  132737  |    2860    2861  |    99.93  | 157714772  383463  |    0.00    0.75  |  1 -1  chr1  ctg7180001761585
  142370660 142382289  |  132746  121117  |    11630    11630  |    99.88  | 157714772  383463  |    0.01    3.03  |  1 -1  chr1  ctg7180001761585
  142382282 142411927  |  120984    91339  |    29646    29646  |    99.96  | 157714772  383463  |    0.02    7.73  |  1 -1  chr1  ctg7180001761585
  142411941 142419553  |    91339    83728  |    7613    7612  |    99.66  | 157714772  383463  |    0.00    1.99  |  1 -1  chr1  ctg7180001761585
  142419553 142434546  |    83721    68728  |    14994    14994  |    99.79  | 157714772  383463  |    0.01    3.91  |  1 -1  chr1  ctg7180001761585
  142434506 142437288  |    68778    65996  |    2783    2783  |    99.86  | 157714772  383463  |    0.00    0.73  |  1 -1  chr1  ctg7180001761585
  142437389 142439015  |    66757    65131  |    1627    1627  |    99.94  | 157714772  383463  |    0.00    0.42  |  1 -1  chr1  ctg7180001761585
  142439271 142440703  |    65629    64197  |    1433    1433  |  100.00  | 157714772  383463  |    0.00    0.37  |  1 -1  chr1  ctg7180001761585
  142441869 142442975  |    63548    62442  |    1107    1107  |  100.00  | 157714772  383463  |    0.00    0.29  |  1 -1  chr1  ctg7180001761585
 
  142446690 142449325  |    30312    32945  |    2636    2634  |    99.58  | 157714772  383463  |    0.00    0.69  |  1  1  chr1  ctg7180001761585
  142451384 142452476  |    63510    64603  |    1093    1094  |    99.91  | 157714772  383463  |    0.00    0.29  |  1  1  chr1  ctg7180001761585
  142452577 142454379  |    61000    62806  |    1803    1807  |    99.78  | 157714772  383463  |    0.00    0.47  |  1  1  chr1  ctg7180001761585
  142454487 142456821  |    59122    61456  |    2335    2335  |  100.00  | 157714772  383463  |    0.00    0.61  |  1  1  chr1  ctg7180001761585
  142458383 142459582  |    57978    59177  |    1200    1200  |  100.00  | 157714772  383463  |    0.00    0.31  |  1  1  chr1  ctg7180001761585
  142459738 142472295  |    32272    44828  |    12558    12557  |    99.92  | 157714772  383463  |    0.01    3.27  |  1  1  chr1  ctg7180001761585
  142472300 142485640  |    44828    58163  |    13341    13336  |    99.89  | 157714772  383463  |    0.01    3.48  |  1  1  chr1  ctg7180001761585
 
  142501686 142530021  |    28336        1  |    28336    28336  |    99.99  | 157714772  383463  |    0.02    7.39  |  1 -1  chr1  ctg7180001761585
  show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001634116
  116312  162914  |        1    46603  |    46603    46603  |    99.99  | 157714772  122722  |    0.03    37.97  |  1  1  chr1      ctg7180001634116
 
  164916  201988  |    58062    95135  |    37073    37074  |    99.99  | 157714772  122722  |    0.02    30.21  |  1  1  chr1      ctg7180001634116
 
  203244  213377  |    48198    58331  |    10134    10134  |  100.00  | 157714772  122722  |    0.01    8.26  |  1  1  chr1      ctg7180001634116
  261393  264506  |    45949    49062  |    3114    3114  |  100.00  | 157714772  122722  |    0.00    2.54  |  1  1  chr1      ctg7180001634116
 
  264607  268579  |    94345    98317  |    3973    3973  |  100.00  | 157714772  122722  |    0.00    3.24  |  1  1  chr1      ctg7180001634116
  268586  274734  |    98323  104471  |    6149    6149  |  100.00  | 157714772  122722  |    0.00    5.01  |  1  1  chr1      ctg7180001634116
  274835  293945  |  103611  122722  |    19111    19112  |    99.99  | 157714772  122722  |    0.01    15.57  |  1  1  chr1      ctg7180001634116
  ~/bin/delta2breaks.pl -m 200 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
  AGREEMENT      9827
  INVERSION      283
  TRANSLOCATION+  230
  TRANSLOCATION-  154
 
  ~/bin/delta2breaks.pl -m 1000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
  AGREEMENT      7564
  INVERSION      216
  TRANSLOCATION+  192
  TRANSLOCATION-  127
 
  ~/bin/delta2breaks.pl -m 10000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
  AGREEMENT      3394
  INVERSION      62
  TRANSLOCATION+  50
  TRANSLOCATION-  29
= Assembly UMD2.2 (Quality reads) =
* Try to add the missing BCM.SHOTGUN reads at the assembly
* Assign new BCM.SHOTGUN library ID's base on volume & SEQ_LIB_ID : same library might have different insert size in different volume => might loose some correct mates from different volumes
  cat bos_taurus.summary | grep BCM | grep SHOTG | cut -f6,7,8,10 | sort | more
  FAAEP  180000  13000  252
  FAAEP  2000    1000    84
  ...
  FAAHP  180000  13000  77
  FAAHP  2000    1000    230
  ...
* => 20,538 libraries out of which 18,208  contain mated reads
* create DST messages & add them to gkpStore
  gatekeeper -a -o bt.gkpStore -T -F  bos_taurus.BCM.SHOTGUN.new.DST
* generate gatekeeper edit file that maps each TI to the new library id
  head bos_taurus.BCM.SHOTGUN.new.ti2libinfo.edit
    frg uid 499507131 libuid 601081
    frg uid 499507132 libuid 601081
    ...
* generate gatekeeper edit file that deletes all mate information
  head bos_taurus.BCM.SHOTGUN.new.mate.delete
    frg uid 500086180 mateuid 0
    frg uid 500084310 mateuid 0
    ...
* pair forward/reverse read that have the same new library id, same TEMPLATE_ID
  head bos_taurus.BCM.SHOTGUN.new.mate.edit
    frg uid 583866821 mateuid 583872364
    frg uid 583866822 mateuid 583872408
    ...
* run gatekeeper --edit for each edit/delete file
  gatekeeper --edit ...  bt.gkpStore
* restart assembly at cgw (doExtendClearRanges=1)
* consensus after cgw failed on job 25 on CTG 5597062 : cannot create consensus from multialignment  ...
  Fix: delete failed message
  cp bt.cgw_contigs.25 bt.cgw_contigs.25.FAILED
  delete "{ICM acc:5597062 pla:P len:20889 ..." from bt.cgw_contigs.25
* terminator fail; message:
  ICL: reference before definition error for contig ID 5597062
= Assembly UMD2.3 (2009_0210_CA; all reads) =
<font color=red>
* 35,973,728 reads : 35,348,776 quality &  624,952 quality-less
* 16,896,244 mates
* 25,312 libraries
</font>
Issues (not solved):
* 10420 contain at least 1 "NN"  in their clr (50.. min(len,600))
* 5973 contain at least 1 "NNN"  in their clr (50.. min(len,600))
== Quality-less clrs ==
* 624,952 quality-less reads
* Quality-less read stats:  : alignment CLR or 50..min(len,600) trimming
            elem      min        max        mean      median    n50        sum
  len        624952    5          1495      887        947        961        554429198
  5          624952    6          1584      51        51        51        32150411
  3          624952    5          1495      695        699        699        434960697
  53        624952    -1579      1444      644        648        648        402810286
<font color=red>
* Align 624,952 to the 120,461 Assembly1 contigs (no degenerates) : 1 day on 13 cpus
* 572,140(91.5%) reads aligned and 52,812(8.5%) did not align to the contigs
</font>
1. Launch jobs in parallel: 12766 jobs on 13 processors
  nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.001.fasta  bos_taurus.0qual.01.seq -p ctg.001-seq.01
  ...
  nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.982.fasta  bos_taurus.0qual.13.seq -p ctg.001-seq.01
* CPU usage: 100% /job
* Max mem usage: 0.1% /job
2. Get maximum extended clrs
  cat *delta | ~/bin/delta2qryClr.pl -best | sort > bos_taurus.0qual.best.clr
  Length stats
            elem      min        max        mean      median    n50        sum
  all        624952    5          1495      887        947        961        554429198
  aligned    572140    221        1416      912        953        964        522281354
  unaligned  52812      5          1495      608        580        754        32147844
  Best/Max/Max+extended alignment coord stats:
            elem      min        max        mean      median    n50        sum
  53.best    572140    94        1208      766        841        877        438793102
  53.max    572140    170        1208      794        863        888        454816817
  53.extend  572140    170        1208      797        865        889        456014184
  Unaligned read counts:                         
                          unaligned    total  quality  quality-less
  BCM.WGS                  42595
  UOKNOR.SHOTGUN          5787        14651  2456      12195
  GSC.CLONEEND            2294        53521  0        53521
  BCCAGSC.CLONEEND        1869        125241  116484    8757
  BCM.SHOTGUN              186
  UOKNOR.FINISHING        81
* 52,812 quality-less unaligned reads to the contigs using less strict nucmer parameters:  -l 30 -c 50 -b 50 -g 50 -d 0.12
* 9,269  reads aligned at an average 92% identity (min 81% identity)  : not too good
3. Get reads without clrs: set their clr to maximum 50..600
  cp bos_taurus.0qual.extended.clr bos_taurus.0qual.clr
  difference.pl bos_taurus.0qual.infoseq bos_taurus.0qual.extended.clr | perl -ane '$three=600; $three=$F[1] if ($F[1]<600); print "$F[0] 50 $three\n";' >> bos_taurus.0qual.clr
== Quality clrs ==
* Use Assembly1 OBT clrs
* Delete reads deleted in the OBT process
== Gatekeeper ==
Load order:
# Add quality FRG :      "gatekeeper -T  -F ..."
# Add quality-less FRG  "gatekeeper    -F  -a ..."          # -T should be removed
# Delete quality FRG (deleted by UMD2.1 OBT)
# Add DST
# Add LKG
Edit
# Loads clrs
# Loads clvs
# Loads nonrandom info
== Meryl ==
Use Assembly1 kmer counts
== Overlapper ==
* Use 80/90 Assembly1 overlap results
* Rerun 10 overlap jobs
* 96.64% of the quality-less reads have overlaps (vs 98.33% of the quality reads)
                  reads      0ovl      1+ovl      min        max        mean      median    n50        sum
  0qual(all)      624831    20941      603890    0          4350      96        19        740        60494730        # 96.64%
  0qual(unaligned) 52691      15384      37307      0          3229      50        5          349        2655545          # 70.80%
== Unitigger ==
* More unitigs, more bases in unitigs
* Few of the longest unitigs got broken: Example 138,294(UMD2.3) vs  159,434(UMD2.1)
  UNITIG OVERLAP GRAPH INFORMATION
 
        5333434 : Total number of unitigs
        2595174 : Total number of singleton, contained unitigs
        1865473 : Total number of singleton, non-contained unitigs
        183693 : Total number of non-singleton, spanned unitigs
        689094 : Total number of non-singleton, non-spanned unitigs
      35551316 : Total number of fragments
      35551316 : Total number of fragments in all unitigs
      21830994 : Total number of essential fragments in all unitigs
      13720322 : Total number of contained fragments in all unitigs
  0.0077856472 : Randomly sampled fragment arrival rate per bp
    2514833413 : The sum of overhangs in all the unitigs
    6483064813 : Total number of bases in all unitigs
              0 : Estimated number of base pairs in the genome.
              0 : Total number of contained fragments not connected
                  by containment edges to essential fragments.
  Total rho    = 2514833413
  Total nfrags = 19579606
  Estimated genome length = 0
  Estimated global_fragment_arrival_rate=0.007786
  Computed global_fragment_arrival_rate =0.007786
  Total number of randomly sampled fragments in genome = 23870254
  Computed genome length  = 3065930496.000000
  Used global_fragment_arrival_rate=0.007786
  Used global_fragment_arrival_distance=128.441474
 
  Histogram of the number of base pairs in a chunk
  100406 - 138294:    21
  90330 -  99887:    23
  80042 -  89675:    79
  70014 -  79943:    169
  60002 -  69792:    374
  50000 -  59982:  1008
  40002 -  49995:  2440
  30001 -  39994:  6509
  20000 -  29999:  18989
  10000 -  19999:  57404
== Consensus after unitigger ==
Problems:
* job 120 executed partially (see bt_120.cgi_tmp); Solution: split into 3 parts, run separately, merge results
* failed on 19 unitigs (587..7447 bp)
  rm 5-consensus/*failed
  touch 5-consensus/consensus.success
== Cgw ==
* Failure 1 : because job 120 was run partially => missing mates
* Failure 2 : because of /5-consensus/FAILED/bt_???.cgi.failed => missing mates => delete 356 mates
  Error:
    ProcessFrags()-- WARNING!  fragiid=35973388,index=33600942 mateiid=35973363,index=0 -- MATE DOESN'T EXIST!
    cgw: Input_CGW.c:117: ProcessFrags: Assertion `err == 0' failed.
  Fix:
    cat cgw.out | grep MATE | p '/mateiid=(\d+)/; print $1,"\n";' >! cgw.out.mateiid
    gatekeeper -dumpfragments -tabular -iid cgw.out.mateiid bt.gkpStore/ | cut -f1,3 | ~/bin/mate2lkg.pl -a D >! cgw.out.delete.LKG
    gatekeeper -a -o bt.gkpStore -T -F -L cgw.out.delete.LKG
* Failure 3: because of cgwOutputIntermediate=1 
  Try to restart from ckp : die with assertion failure
  cgw -y -R 8 -N 12 -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi
  cgw -y -R 8      -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi
 
  Fix: Restart cgw from the beginning
* cgw does update bt.SeqStore - OpenSequenceDB()
== ECR (eventually skipped) ==
* Failed after running for 1 day
  /fs/szdevel/dpuiu/SourceForge/wgs-5.2/Linux-amd64/bin/extendClearRanges  -g ./bt.gkpStore  -n 15  -c bt  -b 146216 -e 167100  -i 1  > 7-1-ECR/extendClearRanges-scaffold.146216.err
  sh: line 1: 17016 Aborted
* Last ckp :  bt.ckp.15
* Try to fix:
  touch 7-1-ECR/cgw.success
  runCA "doExtendClearRanges = 1"
* Runs too slow !!!
* Can specify a scaffold range to process: -b ? -e ? => ckp files; could we merge them?
* Failed after running for 1 day
== Consensus after CGW ==
* Failed on job 56
  tail 8-consensus/bt.cns_contigs.56.err
  ...
  Could (really) not find overlap between 153923 (U) and 2508303 (R) estimated ahang: 0 (ejecting frag 2508303 from contig)
  consensus: math_AS.h:51: ceil_log2: Assertion `x > 0' failed.
  cat 7-CGW/bt.cgw_contigs.56  | countMessages.pl
  ICM    440
  IMP    281412
  IUP    12715
 
  cat 8-consensus/bt.cns_contigs.56_tmp | countMessages.pl
  ICM    115
  IMP    103322
  IMV    8122
  IUP    4849
* Fix: split ICM messages 1..115,116,116+ and run consensus on each set
== QC ==
            elem      min        max        mean      med        n50        sum           
  scf        56891      407        33129045  50871      1378      4716077    2894145150   
  ctg        122851    64        651167    21957      3647      71561      2697514858   
  deg        268237    65        30246      1019      985        997        273575106
* Compared with UMD2.1 : better scaffols, worse contigs & unitigs
== Analysis ==
Issues:
* Identify bacterial & mito contigs: [http://www.bch.umontreal.ca/ogmp/projects/other/genbank/AY676873.txt mito seq]
* Align ctg&degen to UMD2 chromosomes
** the chromosomes should have no 0cvg regions
** possible inversions, translocations (UMD2 used markers)
** if align breaks/indels, which assembly is correct?
= Assembly UMD2.4 (2004_0217_CA; All reads) =
<font color=red>
* 35,973,728 reads : 35,348,776 quality &  624,952 quality-less
* 16,896,244 mates
* 344 libraries
</font>
== Fix quality-less read clrs (N's) (temporary solution) ==
* 10420 contain at least 1 "NN"  in their clr (50.. min(len,600))
* 5973 contain at least 1 "NNN"  in their clr (50.. min(len,600))
Fix:
  frg2seq.pl < bos_taurus.0qual.frg > bos_taurus.0qual.seq
  fasta2qual.pl bos_taurus.0qual.seq > ! bos_taurus.0qual.qual
  lucy \
    -o bos_taurus.0qual.lucy.seq  bos_taurus.0qual.lucy.qual \
    -debug  bos_taurus.0qual.lucy.info \
    bos_taurus.0qual.seq bos_taurus.0qual.qual
  cat bos_taurus.0qual.lucy.info | cut -f1,3,4 -d ' ' | sort >! bos_taurus.0qual.lucy.clr
* 624,952 quality-less reads
* Quality-less read stats:  50..min(len,600) & lucy trimming
            elem      0          >0        min        max        mean      med        n50        sum
  5          624952    2857      622095    0          501        52        52        52        33012433
  3          624952    2857      622095    0          600        579        600        600        361980208
  53        624952    2857      622095    0          548        526        548        548        328967775
== Fix quality-less read clrs (low complexity) ==
* Run dust filter on seq (before qual & lucy)
            elem      0          >0        min        max        mean      med        n50        sum           
  5          624952    3564      621388    0          501        75        52        52        47385578     
  3          624952    3564      621388    0          600        554        600        600        346470473     
  53        624952    3564      621388    0          548        478        548        548        299084895
* Merge dust.lucy clrs with the alignment clrs
            elem      0          >0        min        max        mean      med        n50        sum
  5          624952    4488      620464    0          599        93        52        126        58378496
  3          624952    4488      620464    0          600        547        600        600        342258160
  53        624952    4488      620464    0          548        454        512        548        283879664
* Test seq
  gatekeeper -dumpfastaseq -b 35348777 -e  35973728 bt.gkpStore  | grep NNN
  gatekeeper -dumpfastaseq  bt.gkpStore  | perl -ane 'if(/^>(\d+)/) { $id=$1} elsif(/NNN/) { print $id,"\n";} ' | uniq -c | awk '{print $2,$1}'  > bt.NNN.seqs  # 2411 seqs (all have the N's "in the middle")
  gatekeeper -dumpfastaseq -uid bt.NNN.seqs bt.gkpStore >  bt.NNN.cseqs
== Consolidate libraries ==
<font color=red> Drop from 25,312 to 344 libs </font>
===  BCM.SHOTGUN ===
UMD2.4 reestimated 10,117 out of 13,826 libs (have > 100mates)
==== Base on initial estimates ====
* Reduce the total number from 13826 to 2 libs: 3000 & 6000
* UMD2.3 mean estimates (Initial vs Final):
  meanI      #libs      minF      maxF      meanF      medF      n50F      sumF          uid
  180000    436        1636      5199      2475      2410      2458      1079407        #3000
  167000    86        1585      2948      2264      2258      2285      194775        #3000       
 
  6500      31        5212      6636      5837      5867      5924      180951        #6000       
  6000      11        4556      6272      5389      5421      5421      59286          #6000
 
  3500      949        1670      4769      2668      2608      2645      2532027        #3000
  3000      2511      1483      5250      2715      2662      2723      6818678        #3000
  2000      6093      1157      6443      2526      2487      2554      15391160      #3000
==== Base on final estimates ====
* Reduce the total number from 13826 to 7 libs:  6500,5500,...1500, un-estimates (2501)
  meanF        #libs      min        max        mean      med        n50        sum        uid(new)    mean(new)  std(new)
  6K<=mea<7K  15        6010      6636      6176      6159      6159      92650      6500        6500
  5K<=mea<6K  29        5121      5985      5540      5536      5577      160673      5500        5500
  4K<=mea<5K  67        4017      4939      4284      4266      4274      287072      4500        4500
  3K<=mea<4K  1401      3000      3998      3276      3209      3226      4590323    3500        3500
  2K<=mea<3K  7998      2000      2999      2502      2498      2532      20017767    2500        2500
  1K<=mea<2K  607        1157      1999      1825      1882      1890      1107798    1500        1200
  un-estimated 3709                                                                          2501        2501
===  BARC.CLONEEND ===
Collapse all 11150 into 1:
uid:25456
mea:165000
std:43000
== Overlapper ==
* Quality-less reads overlaps: fewer than in the UMD2.3 assembly
                  elem      0          >0        min        max        mean      med        n50        sum         
  0qual(all)      624830    35692      589138    0          3237      60        14        439        37578899        # 94.39%
== Unitigger ==
  UNITIG OVERLAP GRAPH INFORMATION   
        5356408 : Total number of unitigs
        2613795 : Total number of singleton, contained unitigs
        1870448 : Total number of singleton, non-contained unitigs
        182878 : Total number of non-singleton, spanned unitigs
        689287 : Total number of non-singleton, non-spanned unitigs
      35547861 : Total number of fragments
      35547861 : Total number of fragments in all unitigs
      21685943 : Total number of essential fragments in all unitigs
      13861918 : Total number of contained fragments in all unitigs
  0.0077797328 : Randomly sampled fragment arrival rate per bp
    2513424271 : The sum of overhangs in all the unitigs
    6468428782 : Total number of bases in all unitigs
              0 : Estimated number of base pairs in the genome.
              0 : Total number of contained fragments not connected
                  by containment edges to essential fragments.
  Total rho    = 2513424271             
  Total nfrags = 19553770
  Estimated genome length = 0
  Estimated global_fragment_arrival_rate=0.007780
  Computed global_fragment_arrival_rate =0.007780   
  Total number of randomly sampled fragments in genome = 23868770
  Computed genome length  = 3068070656.000000         
  Used global_fragment_arrival_rate=0.007780           
  Used global_fragment_arrival_distance=128.539119
Histogram of the number of base pairs in a chunk
100292 - 138301:    19
  90052 -  99906:    23
  80043 -  89676:    79
  70013 -  79966:    164
  60010 -  69988:    390
  50008 -  59983:    949
  40000 -  49998:  2433
  30000 -  39997:  6437
  20000 -  29999:  18808
  10000 -  19999:  57634
== Bog ==
!!! Much bigger unitigs than default unitigger
Global Arrival Rate: 0.013829
212260 - 224992:      4
100099 - 186873:    372
  90015 -  99973:    353
  80045 -  89988:    582
  70011 -  79999:  1084
  60000 -  69994:  1856
  50001 -  59996:  3162
  40002 -  49994:  5407
  30000 -  39999:  9767
  20000 -  29996:  18981
  10000 -  19999:  39641
== Consensus after Unitigger ==
* Failed on jobs 120 & 121 ( _tmp file)
  cat 4-unitigger/*120*  | countMessages.pl
  IMP    280264
  IUM    124707
  cat 4-unitigger/*121* | countMessages.pl
  IMP    282146
  IUM    245650
  cat 5-consensus/bt_120.cgi_tmp  | countMessages.pl
  IMP    34348
  IUM    19222
 
  cat 5-consensus/bt_121.cgi_tmp | countMessages.pl
  IMP    51833
  IUM    16805
* Fix 120: split IUM messages
  extractfromfrgMSG.pl -b 0 -e 19222 bt_120.cgb.orig IUM >! bt_120.cgb &
  extractfromfrgMSG.pl -b 19222 bt_120.cgb.orig IUM >! bt_120.cgb &
* Fix 121: remove assertion in AS_CNS/MultiAlignment_CNS.c
  if(to <= from || to > ma_length-1){
    fprintf(stderr, "AbacusRefine range (to) invalid");
    //assert(0);
  }
== CGW ==
* Failed after Ckp3(7-0-CGW/bt.ckp.3; MergeScaffoldsAggressive 2nd itteration)
  CI extends beyond end of scaffold!
  offsetAEnd = 254204 offsetBEnd = 252250 scaffoldLength = 253268
  cgw: CIScaffoldT_Merge_CGW.c:307: InsertScaffoldContentsIntoScaffold: Assertion `0' failed.
* Last cgw
  Scaffold lengths:
  cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf
  cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf
            elem      min        max        mean      med        n50        sum           
  scf        45826      385        34263871  59591      1349      7059820    2730853790   
  ctg        96562      65        738899    27789      3657      93988      2683452359
  Library insert estimates:
  cat 7-4-CGW/stat/scaffold_final.distupdate.dst | grep ^# | awk '{print $3,int($8),int($10)}' > 7-4-CGW/bt.dst
  join2.pl bt.dst 7-4-CGW/bt.dst | p 'print join "\t",@F[0,1,2,5,6,3,4]; print "\n";' > bt.dst.combine
  CLONEEND inserts:
  UID          MEANI  STDI    MEANF  STDF    COUNT  LIB
  114892        150000  30000  175701  40732  31063  UIUC.CLONEEND
  19070        167000  25000  171349  18253  7103    BCM.CLONEEND
  118          167000  16700  167000  16700  21      WUGSC.CLONEEND
  25456        165000  43000  163044  25849  11150  BARC.CLONEEND
  115020        150000  30000  162719  25343  15256  UIUC.CLONEEND
  65177        2000    600    162540  34155  27067  TIGR.CLONEEND
  125606        150000  30000  162396  19319  59505  BCCAGSC.CLONEEND
  10738        2500    750    162386  27567  2040    TIGR_JCVIJTC.CLONEEND
  10691        2500    750    161540  28239  2763    TIGR_JCVIJTC.CLONEEND
  17249        202000  20200  157496  55375  6269    CENARGEN.CLONEEND
  54017        120000  12000  115671  27594  25889  GSC.CLONEEND
  total                                        188126  CLONEENDs
== Consensus ==
* Failed on job 34 with segmentation fault
* 9kbp contig, made out of 3007 reads (24 of which are quality-less)
  cat 7-CGW/bt.cgw_contigs.34.1 | grep "^{" | uniq -c | awk '{print $2,$1}'
  {ICM 1
  {IMP 3007
  {IUP 329
* Fix : edit AS_CNS/MultiAlignment_CNS.c; add
  if(!ungappedSequence->Elements) { ungappedSequence->numElements=0; }
  if(!ungappedQuality->Elements) { ungappedQuality->numElements=0; }
== Analysis ==
=== Contigs Vs possible contaminants  ===
* nucmer alignment parameters:  -l 40 -c 100 -b 10 -g 5 -d 0.05
* have to redo alignments using -maxmatch !!!
* file location:
  reference seqs:
    /nfshomes/dpuiu/db/Ecoli.365350-365744                    # Ecoli K12 region with most alignments (BCM WGS splice site)
    /nfshomes/dpuiu/db/Ecoli                                  # Ecoli K12 substrain MG1655  (NC_000913 ; 1st completed)
    /nfshomes/dpuiu/db/Ecoli.all                              # 22 Ecoli completed genomes ( + plasmids)
 
    /nfshomes/dpuiu/db/UniVec_Core                            # UniVec Core seqs
    /nfshomes/dpuiu/db/OtherVec                                # 100 other vector sequences identified by aligning UMD2.0 contaminants to GenBank; align also to 110 UniVec core using nucmer (params above)
    /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.fasta      # 4813 whole contigs and 30329 contig regions identified by NCBI as UMD2 contamination
    /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.organism_count  # organism counts: vector is the most abundant
 
    /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.infoseq      # grep -v 'coli|vector|7180003101029' => 905 other contamiants
 
  query seqs:
    /scratch1/bos_taurus/Assembly/2009_0217_CA/9-terminator/ctg.split100/*fasta  # latest assembly contigs (no degenerates)
 
  delta files:
    /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_ctg/no_maxmatch/*delta
Ecoli K12 substrains:
  NC_010473.1    4686137 50.78  Escherichia coli str. K-12 substr. DH10B, complete genome
  NC_000913.2    4639675 50.79  Escherichia coli str. K-12 substr. MG1655, complete genome
  AC_000091.1    4646332 50.80  Escherichia coli str. K-12 substr. W3110, complete genome
==== no maxmatch ====
* fewer alignments in UMD2.4 than in UMD2
UMD2 (all): just a few degens
  15102 Ecoli.365350-365744-ctg.qry_hits
  15943 Ecoli-ctg.qry_hits
  17308 Ecoli.all-ctg.qry_hits
  79065 UMD2.contaminant-ctg.qry_hits    # 55877 new hits
  20105 UMD2.contaminant-ctg.CBE.qry_hits # CONTAIN|BEGIN|END|IDENTITY
  19839 UniVec_Core-ctg.qry_hits
UMD2.4
    559 Ecoli.365350-365744-ctg.qry_hits
  1215 Ecoli-ctg.qry_hits
  2767 Ecoli.all-ctg.qry_hits            # most 2 frequenct starins are UMN026 & ATCC 8739; K12 DH10B is rank 5th; K12 MG1655 is ranked 19th (out 31 seqs)
  44112 UMD2.contaminant-ctg.qry_hits
  5286 UniVec_Core-ctg.qry_hits
Length of the reference seqs used for screening:
                          #seqs      min        max        mean      med        n50        sum
  Ecoli.365350-365744    1          395        395        395        395        395        395                  # Ecoli K12 regions with most alignments (BCM WGS splice site)
  Ecoli                  1          4639675    4639675    4639675    4639675    4639675    4639675              # Ecoli K12 substrain MG1655       
  Ecoli.all              49        3306      5572075    2293320    130440    5065741    112372708            # 22 Ecoli's
 
  UniVec_Core            1348      12        48551      243        98        967        327641         
  OtherVec                100        1702      739874    15419      5027      166744    1541984
  UMD2.contaminant        35142      48        16661      512        362        674        18022349           
Length of UMD2.4 contigs that contain contaminant (0+ bp from end):
                          #ctgs      <2000bp    >=2000bp  min        max        mean      med        n50        sum
  Ecoli.365350-365744-ctg 559        534        25        1001      179527    2467      1341      1894      1379440
  Ecoli-ctg              1215      1086      129        1001      360312    4326      1347      71372      5256540
  Ecoli.all-ctg          2767      2455      312*      1001      453627*    8031      1366      134516    22224468
  UniVec_Core-ctg        5286      4718      568*      882        651163*    9820      1337      136090    51909339
  UMD2.contaminant-ctg.CBE 4976      4410      566*      738        651163*    8497      1339      122281    42281715    #annotated alignments: CONTAIN|BEGIN|END|IDENTITY
  UMD2.contaminant-ctg    44112      12813      31299      268        739442    50591      27461      111598    2231701788
Length of UMD2.4 contigs that contain contaminant in the middle (500+ bp from end):
                          #ctgs      <2000bp    >=2000bp  min        max        mean      med        n50        sum
  Ecoli.365350-365744-ctg 144        136        8          1286      2053      1779      1811      1814      256259
  Ecoli-ctg              171        152        19        1286      4703      1835      1807      1821      313820
  Ecoli.all-ctg          197        160        37*(81)    1228      351373*    6516      1815      125069    1283728    #81  2K+ ctgs using -maxmatch
  UniVec_Core-ctg        1278      1110      168*(276)  1085      651163*    12266      1496      160336    15676765    #276 2K+ ctgs using -maxmatch
  UMD2.contaminant-ctg.CBE 52        25        27*        1249      351373*    22195      2054      125069    1154142    #annotated alignments: CONTAIN|BEGIN|END|IDENTITY
  UMD2.contaminant-ctg    31019      1437      29582      1113      739442    70665      50798      113684    2191986214
Length of the UMD2.4 contaminant seqeunece (0+ bp from end):
                            #align    <200bp    >=200bp      min      max        mean      med        n50        sum
  Ecoli.365350-365744-ctg    1066      537        529        104        225        192        162        224        205379
  Ecoli-ctg                  1793      587        1206      50        4440      496        224        994        889798
  Ecoli.all-ctg              4074      1132      2942*      40        17075*    380        254        441        1551783
  UniVec_Core-ctg            14425      9819      4606*      40        1801*      236        162        325        3409187
  UMD2.contaminant-ctg      144843    96008      48835      40        16661      199        169        209        28912002
Length of the UMD2.4 contaminant seqeunece (500+ bp from end)
                            alignm    <200bp    >=200bp    min        max        mean      med        n50        sum
  Ecoli.365350-365744-ctg    243        136        107        162        224        189        162        224        46000
  Ecoli-ctg                  273        149        124        106        1341      219        162        224        59923
  Ecoli.all-ctg              294        153        141*      106        2150*      251        162        224        73992
  UniVec_Core-ctg            2144      2035      109*      50        1340*      122        121        121        261821
  UMD2.contaminant-ctg      121331    86985      34346      40        2738      171        162        184        20753580
* Problem: 8 long ctgs contain Ecoli in the middle (1000+ bp from end)
show-coords Ecoli.all-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
    [S1]    [E1]  |    [S2]    [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
4640589  4641890  |  161712  160411  |    1302    1302  |    99.46  |  4686137  351373  |    0.03    0.37  | gi|170079663|ref|NC_010473.1|      ctg7180001872124
5068908  5069620  |    87386    86679  |      713      708  |    98.88  |  5209548    91972  |    0.01    0.77  | gi|218687878|ref|NC_011745.1|      ctg7180002055226
3087480  3088683  |    50423    51620  |    1204    1198  |    99.00  |  5202090    88182  |    0.02    1.36  | gi|218703261|ref|NC_011751.1|      ctg7180002054092
4640580  4641890  |    19953    18646  |    1311    1308  |    99.08  |  4686137    31157  |    0.03    4.20  | gi|170079663|ref|NC_010473.1|      ctg7180001875158
  131462  133564  |    1247    3349  |    2103    2103  |    98.19  |  241387    5751  |    0.87    36.57  | gi|157412014|ref|NC_009838.1|      ctg7180002043242
  82801    83166  |    2986    2621  |      366      366  |    98.09  |  241387    4709  |    0.15    7.77  | gi|157412014|ref|NC_009838.1|      ctg7180001714551
  82264    82793  |    3523    2994  |      530      530  |    98.49  |  241387    4709  |    0.22    11.26  | gi|157412014|ref|NC_009838.1|      ctg7180001714551
1652253  1652545  |    1487    1195  |      293      293  |    98.63  |  4700560    2492  |    0.01    11.76  | gi|218552585|ref|NC_011741.1|      ctg7180001754941
* Regions present in DH10B but not MG1655
  delta2cvg -M 0 < DH10B-MG1655.delta
  gi|170079663|ref|NC_010473.1|  1349629 1378243 28614  0
  gi|170079663|ref|NC_010473.1|  1391006 1396986 5980    0
  gi|170079663|ref|NC_010473.1|  3199469 3200798 1329    0
  gi|170079663|ref|NC_010473.1|  3211928 3213257 1329    0
  gi|170079663|ref|NC_010473.1|  4640588 4641918 1330    0 !!!
* Problem: 10 long ctgs contain Vector in the middle (1000+ bp from end)
show-coords UniVec_Core-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
    [S1]    [E1]  |    [S2]    [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
      1      121  |  215495  215615  |      121      121  |    99.17  |      170  271477  |    71.18    0.04  | gnl|uv|U09128.1:15891-16011-49    ctg7180002047604      # pSacBII P1 cloning vector 
    2252    2435  |    1334    1151  |      184      184  |  100.00  |    2485  160336  |    7.40    0.11  | gnl|uv|U75992.1:16925-19409        ctg7180001808271
    180      312  |    1153    1020  |      133      134  |    99.25  |      312  160336  |    42.63    0.08  | gnl|uv|NGB00145.1:2378-2689        ctg7180001808271
      1      121  |    1367    1487  |      121      121  |  100.00  |      170  160336  |    71.18    0.08  | gnl|uv|U09128.1:15891-16011-49    ctg7180001808271
      1      103  |    1286    1388  |      103      103  |  100.00  |      103  160336  |  100.00    0.06  | gnl|uv|U80929.2:11415-11517        ctg7180001808271        [CONTAINED]
      4      121  |    68269    68386  |      118      118  |  100.00  |      170  111913  |    69.41    0.11  | gnl|uv|U09128.1:15891-16011-49    ctg7180002052060
      40      152  |    30255    30142  |      113      114  |    99.12  |    1663    42854  |    6.79    0.27  | gnl|uv|U09128.1:1-1663            ctg7180002053344
      1      121  |    34358    34238  |      121      121  |  100.00  |      170    35471  |    71.18    0.34  | gnl|uv|U09128.1:15891-16011-49    ctg7180002046164
      1      103  |    34439    34337  |      103      103  |  100.00  |      103    35471  |  100.00    0.29  | gnl|uv|U80929.2:11415-11517        ctg7180002046164        [CONTAINED]
      46    1385  |    8928    10267  |    1340    1340  |  100.00  |    1413    17587  |    94.83    7.62  | gnl|uv|X65279.1:5941-7353          ctg7180002043597        [CONTAINED]
* ctg7180001872124 : 351373 bp; region 160411..161712 contaminated by Ecoli
  cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | wc -l  # 329
  cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);'
  7180000441625  7180001872124  159483  161201  r
  7180000441788  7180001872124  160330  161329  f  #Ecoli
  7180000442730  7180001872124  160368  161010  r  #Ecoli
  7180000441635  7180001872124  160740  162700  f  #Ecoli
  cat 9-terminator/bt.utg.info
  utg7180000441625 length=1715 num_frags=12 Astat=7.00
  utg7180000441788 length=999 num_frags=1 Astat=0.00
  utg7180000442730 length=640 num_frags=1 Astat=0.00
  utg7180000441635 length=1957 num_frags=9 Astat=7.00
  cat 9-terminator/bt.posmap.frgctg | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);'
  1237446426      7180001872124  160117  161201  f
  1238816835      7180001872124  160133  160993  f
  1238817728      7180001872124  160322  161123  r
  1244436200      7180001872124  159976  160984  f
  1238817676      7180001872124  160105  160890  r
  1237443253      7180001872124  160106  160900  f
  1237471027      7180001872124  159930  160928  f
  1238822613      7180001872124  159774  160782  f
  1238816875      7180001872124  159878  160728  f
  1244436248      7180001872124  159483  160553  f
  1238818306      7180001872124  159718  160489  f
  1238818332      7180001872124  159722  160483  f
  1237476824      7180001872124  160330  161329  f
  1238817689      7180001872124  160368  161010  r
  1237447135      7180001872124  160740  161768  r
  1237483546      7180001872124  160814  161790  r
  1237483530      7180001872124  160818  161856  r
  1237471108      7180001872124  161003  162009  f
  1238817744      7180001872124  161151  161978  f
  1237446441      7180001872124  161050  162107  f
  1244436201      7180001872124  161117  162164  f
  1237446407      7180001872124  161586  162699  r
  1237471055      7180001872124  161652  162700  r
  # 23 BCM SHOTGUN RP42 VVHNP reads (1369 read lib; 1341 of the reads in this ctg)
* ctg7180002047604 : Vctor in the middle
    [S1]    [E1]  |    [S2]    [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
      1      121  |  215495  215615  |      121      121  |    99.17  |      170  271477  |    71.18    0.04  | gnl|uv|U09128.1:15891-16011-49    ctg7180002047604      # pSacBII P1 cloning vector
  cat 9-terminator/bt.posmap.utgctg | grep 7180002047604  perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);'
  7180000441711  7180001872124  214458  219678  r
  cat 9-terminator/bt.posmap.frgctg | grep 7180002047604  | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);'
  498776751      7180001872124  215425  216426  r
  1236502885      7180001872124  215514  216377  r
  379408823      7180001872124  215572  216340  f
  1244436224      7180001872124  215388  216405  f
  1237471071      7180001872124  215229  216234  r
  1233297450      7180001872124  215234  216046  f
  1233363357      7180001872124  215267  215687  f
  937200686      7180001872124  215300  216129  r
  937254901      7180001872124  215321  216160  f
  1233294025      7180001872124  215383  216204  r
  1237446444      7180001872124  215146  216187  f
  1232033776      7180001872124  215193  215996  r
  671976381      7180001872124  215035  216021  r
  514932286      7180001872124  215043  216008  f
  500723879      7180001872124  215043  215802  f
  671927656      7180001872124  215116  215733  r
  381173692      7180001872124  214947  215877  r
  1233303570      7180001872124  214963  215803  f
  1232037705      7180001872124  214990  215803  f
  490852264      7180001872124  214923  215843  f
  1237447184      7180001872124  214684  215646  f
  668822243      7180001872124  214586  215572  f
  #22 reads ; ~half come from BCM SHOTGUN RP42 VVFOP
==== maxmatch ctg ====
Parameters:
  nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
  AllVec: UniVec_Core + 100 more vector seqs
Length of UMD2.4 contigs that contain contaminant (0+ bp from end):
                            elem      <2000      >2000      min        max        mean      med        n50        sum
  Ecoli.all                2951*      2602      349        1001      453627    8252      1367      132226    24352779
  UniVec_Core              5387*      4802      585        882        651163    9979      1334      136556    53760575
  OtherVec                  5657      5062      595        882        651163    9726      1320      136556    55021803
  UMD2.cont.other          3976      3430      546        804        651163    11217      1346      130385    44601117        #18 aligned to Acinetobacter; longest is 56467bp
Length of UMD2.4 contigs that contain contaminant (500+ bp from end):
                            elem      <2000      >2000      min        max        mean      med        n50        sum
  Ecoli.all                182        156        26*        1286      351373    6525      1811      125069    1187706          # 7*  are >5K;  321* come from multi-ctg scaffolds
  UniVec_Core              2532      2220      312*      1065      651163    10593      1481      128344    26821960        # 267* are >5K ; 655* come from multi-ctg scaffolds
  OtherVec                  376        323        53        1184      361749    13278      1508      139997    4992774
  UMD2.cont.other          ...
Length of UMD2.4 contigs that contain contaminant (1000+ bp from end):
                            elem      <2000      >2000      min        max        mean      med        n50        sum
  Ecoli.all                8          0          8*        4709      351373    73065      31157      351373    584520
  UniVec_Core              11        0          11*        2600      334933    93674      37847      271477    1030414
  OtherVec                  5          0          5*        3717      271477    131604    111913    228060    658021
  UMD2.cont.other          54        0          54*        2398      522682    110947    88182      189352    5991164
  total                    67* # 18 of them are CONTAINED by UMD2.0 chromosomes
Length of the UMD2.4 contaminant sequence (0+ bp from end):
                            elem      <200      >200      min        max        mean      med        n50        sum
  Ecoli.all                4775      1610      3165      39        17072      381        236        502        1823278
  UniVec_Core              16985      12380      4605      39        1800      207        132        300        3519080
  OtherVec                  7563      1372      6191      39        1800      509        548        643        3849567
  UMD2.cont.other          6626      343        6283      39        8228      543        573        615        3602329
==== maxmatch deg ====
All degenerates aligned are <2K
Length of UMD2.4 deg that contain contaminant (0+ bp from end):
                            elem      <2000      >2000      min        max        mean      med        n50        sum
  Ecoli.all                1266      1266      0          104        1611      783        833        869        991447
  UniVec_Core              1908      1908      0          147        1510      872        896        910        1664746
  OtherVec                  1963      1963      0          147        1510      872        898        911        1712703
  UMD2.cont.other          1609      1609      0          132        1611      852        892        914        1372106
==== maxmatch utg ====
Unitig stats:
  elem      <2000      >2000      min        max        mean      med        n50        sum           
  1707816    1434164    273652    21        138676    2228      937        8002      3805166508 
Parameters:
  nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
Files:
  /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_utg/
<font color=red>
Length of UMD2.4 unitigs that align to contaminants
                            elem      <2000      >2000      min        max        mean      med        n50        sum
  Ecoli.all                4275      4110      165        104        71709      1442      1212      1398      6166566 
  UniVec_Core              7563      7409      154        139        71709      1397      1182      1331      10570512
  OtherVec                  8208      8054      154        139        71709      1370      1159      1308      11248775
  UMD2.cont.other          6094      5849      245        132        53113      1546      1163      1401      9422951    #80 aligned to Acinetobacter; longest is 9114bp
  Contaminants(all above)  10264      9895      369        104        71709      1471      1148      1359      15107544
  Acinetobacter            2306**      0          2306      154        71709      1451      1316      1412      3347230  #2182 already in the Cont set
Length of UMD2.4 unitigs that have contaminants 500+bp from ends
                            elem      <2000      >2000      min        max        mean      med        n50        sum
  Ecoli.all                172        156        16        1286      4852      1820      1805      1815      313185
  UniVec_Core              2491      2422      69        1065      71709      1722      1457      1523      4291584
  OtherVec                  364        358        6          1167      71709      1795      1478      1538      653595
  UMD2.cont.other          156        108        48        1213      50248      5344      1838      17518      833673
Length of the UMD2.4 alignments of unitigs to contaminants (unique unitig regions)
                            elem      <200      >200      min        max        mean      med        n50        sum      reads(all unitig reads for unitgs with alignments>1K)
  Ecoli.all                5975      1686      4289      40        8184      397        268        542        2374366  12112(12142)
  UniVec_Core              8754      1674      7080      40        1801      474        490        645        4153030  26590(26849)
  OtherVec                  8919      1250      7669      40        1801      511        536        629        4562326  30268(30268)
  UMD2.cont.other          6752      896        5856      40        6012      529        555        651        3573528  25006(25328)
  Contaminants(all above)  10992      1396      9596      40        8184      571        573        684        6280759  40351(40699)
  Acinetobacter                                                                                                                    (8286)</font>
40699 reads aligned back to contaminants: nucmer -maxmatch
* 35919 align
* 34400 align 100+bp
* 27742 align 200+bp
* 14211 align 500+bp
==== utg 5'& 3' ====
Unitig stats:
              elem        <200      >200      min        max        mean      med        n50        sum           
utg          1,707,816    81200      1626616    21        138676    2228      937        8002      3805166508 
utg5'&3'    3,334,432    0          3334432    21        199        100        100        100        335263271
Align utg5'&3' to Ecoli.all using:
* nucmer -l 40 -c 100 -b 10 -g 5 -d 0.05 : 4,275 hits
* nucmer -l 20 -c 40 : 6,617 hits
* nucmer -l 20 -c 20 : 23,350
* blastall          : 2,895,506  out of 3,334,432 (86%) aligned
==== Acinetobacter contamination ====
Database:
  ~dpuiu/db/Acinetobacter.all : 7 complete genomes, 19 seqs
Seq len summary:
  elem      min        max        mean      med        n50        sum
  19        2726      4050513    1418094    28279      3904116    26943793
Align all unitigs to Acinetobacter.all; Longest alignments is 8517bp
show-coords Acinetobacter.all-utg.filter-q.delta | sort -nk8 -r | head
    [S1]    [E1]  |    [S2]    [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [GenBank]                          [UMD2.4 utg]
===============================================================================================================================
  20644    29164  |      62    8578  |    8521    8517  |    98.80  |    94413    8578  |    9.03    99.29  | gi|169786889|ref|NC_010404.1|      utg7180000281954*      [CONTAINS]
3395586  3401299  |    5712        1  |    5714    5712  |    99.79  |  3976747    8015  |    0.14    71.27  | gi|126640115|ref|NC_009085.1|      utg7180000212251
3400344  3404485  |        1    4142  |    4142    4142  |    99.66  |  3976747    9114  |    0.10    45.45  | gi|126640115|ref|NC_009085.1|      utg7180000277331
...
utg7180000281954*  -> ctg7180002053982  (28140bp; 78 unitigs)
grep 7180002053982  ../9-terminator/bt.posmap.utgctg | nl
    1  7180000185222  7180002053982  0      3019    f
    2  7180000314302  7180002053982  2151    5706    r
    3  7180001463328  7180002053982  2256    2869    f
    ...
    75  7180000281954*  7180002053982  17862  26442  r
    76  7180001471348  7180002053982  17886  18723  r
    77  7180001468075  7180002053982  17919  18732  f
    78  7180000280508  7180002053982  25672  28140  r
  show-coords UMD2.contaminant.other-ctg.filter-q.delta  | grep 7180002053982
    [S1]    [E1]  |    [S2]    [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [UMD2.0 contam]  [UMD2.4 ctg]
===============================================================================================================================
    1394    2508  |    28140    27024  |    1115    1117  |    98.75  |    7098    28140  |    15.71    3.97  | 7180003313366      ctg7180002053982
    2561    2871  |    26971    26661  |      311      311  |    97.11  |    7098    28140  |    4.38    1.11  | 7180003313366      ctg7180002053982
    2934    5670  |    26599    23862  |    2737    2738  |    97.99  |    7098    28140  |    38.56    9.73  | 7180003313366      ctg7180002053982
  ...gap... 
    5930    7098  |    17270    16101  |    1169    1170  |    98.46  |    7098    28140  |    16.47    4.16  | 7180003313366      ctg7180002053982
  ...gap... 
    281    1981  |    10335    8635  |    1701    1701  |    99.41  |    13090    28140  |    12.99    6.04  | 7180003320028      ctg7180002053982
    1992    2672  |    8635    7954  |      681      682  |    99.71  |    13090    28140  |    5.20    2.42  | 7180003320028      ctg7180002053982
    3302    5376  |    6719    4643  |    2075    2077  |    99.33  |    13090    28140  |    15.85    7.38  | 7180003320028      ctg7180002053982
    8469    9021  |    4642    4090  |      553      553  |    99.10  |    13090    28140  |    4.22    1.97  | 7180003320028      ctg7180002053982
    9038    9313  |    4073    3798  |      276      276  |    98.55  |    13090    28140  |    2.11    0.98  | 7180003320028      ctg7180002053982
    9780    13090  |    3331      19  |    3311    3313  |    99.19  |    13090    28140  |    25.29    11.77  | 7180003320028      ctg7180002053982
  grep 7180002053982 ../9-terminator/bt.posmap.utgctg | awk '{print $1,$4-$3+1}' | sed 's/^/utg/' >! ctg7180002053982.utgs
  intersect.pl UMD2.contaminant.other-utg.qry_hits ctg7180002053982.utgs | wc -l
  37 # only 37 out  of 78 unitigs were detected
  [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Get&VIEW_RESULTS=FromRes&RID=VNC3Z5NU01R&UNIQ_OBJ_NAME=A_SearchResults_1LiCpb_3X0f_1Z58HQ4MV_23torn_22Zi4F&QUERY_INDEX=0 ctg7180002053982 is Acinetobacter]
= UMD2.5 (2004_0312_CA; delete  40699 contam reads & 22607 mates ) =
40699 reads:
* 25803 mated + 14896 unmated
* 6392 mated reads had the mate also contaminated
Location:
  /scratch1/bos_taurus/Assembly/2009_0312_CA
== UNITIGGER ==
UNITIG OVERLAP GRAPH INFORMATION
        5322910 : Total number of unitigs
        2595715 : Total number of singleton, contained unitigs
        1869655 : Total number of singleton, non-contained unitigs
        182193 : Total number of non-singleton, spanned unitigs
        675347 : Total number of non-singleton, non-spanned unitigs
      35507162 : Total number of fragments
      35507162 : Total number of fragments in all unitigs
      21641007 : Total number of essential fragments in all unitigs
      13866155 : Total number of contained fragments in all unitigs
  0.0077909501 : Randomly sampled fragment arrival rate per bp
    2511009753 : The sum of overhangs in all the unitigs
    6442095933 : Total number of bases in all unitigs
              0 : Estimated number of base pairs in the genome.
              0 : Total number of contained fragments not connected
                  by containment edges to essential fragments.
Total rho    = 2511009753
Total nfrags = 19563152
Estimated genome length = 0
Estimated global_fragment_arrival_rate=0.007791
Computed global_fragment_arrival_rate =0.007791
Total number of randomly sampled fragments in genome = 23866135
Computed genome length  = 3063315200.000000
Used global_fragment_arrival_rate=0.007791
Used global_fragment_arrival_distance=128.354050
Histogram of the number of base pairs in a chunk
100292 - 138301:    22    # 19 in UMD2.4
  90020 -  99906:    28    # 23
  80043 -  89676:    90    # 79
  70013 -  79966:    190    # 164
  60010 -  69988:    423
  50008 -  59983:  1016
  40000 -  49998:  2558
  30000 -  39997:  6660
  20000 -  29999:  18927
  10000 -  19999:  57057
== CONSENSUS after CGW ==
<font color=red>
* failed on job 80 : ctg 5706539, len=180,024, 159 unitigs, 1,851 reads
</font>
  head 80/bt.cns_contigs.80.failed
  {ICM
  acc:5706539
  pla:P
  len:180024
  cns:
  .
  qlt:
  .
  for:0
  npc:1851
  more  ../9-terminator/bt.asm
  ...
  {CCO
  acc:(7180002022380*,5706539)
  pla:P
  len:180024
  cns:
  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  ...
  cat 80/bt.cns_contigs.80.failed | countMessages.pl 
  ICM    1
  IUP    159  #unitigs
  IMP    1851  #reads : 1264 are BCM.WGS, 389 are BCM.SHOTGUN ...
  # 254 contig scaffold
  cat ../9-terminator/bt.posmap.ctgscf | grep 7180002041301 | nl
  1    7180002022042  7180002041301  0      14121  f
  ...
  183  7180002022380*  7180002041301  9164814 9344838 f
  ..
  254  7180002022240  7180002041301  12874874        12893460        f
  # the 5 UMD2.4 contigs below have the same number of reads with the ones that matched => CONTAINED
  cat UMD2.4-7180002022380.posma  p.frgctg | grep -v ^$ | awk '{print $2}' | uniq -c
    #reads ctgid
      6 7180001712307
    209 7180002028662
      6 7180002028663
  1552 7180002028664
    21 7180002032323
  1794 total => 1851-1794=57 additional reads
  cat ../../2009_0217_CA/9-terminator/bt.posmap.ctgscf | nl | egrep '7180001712307|7180002028662|7180002028663|7180002028664|7180002032323'
  #nl    ctgid          scfid          start  end    dir       
  34791  7180002028662  7180002069912  1425623 1446022 f
  34793  7180002032323  7180002069912  1448133 1450475 f
  34794  7180002028663  7180002069912  1450495 1451918 f
  34795  7180002028664  7180002069912  1452234 1602973 f
  64110  7180001712307  7180002071598  0      1441    f
* Solution 1:
  * consensus -Dforceunitigabut => new assembly, new UID's
  ctg7180002022636 179505 38.53    # => ctg7180002022380
  scf7180002041557 12892941 40.58  # => scf7180002041301
* Solution 2:
  * Reassemble 1851 reads ; clr=ECR2; doOBT=no
  * Asm dir:
    /scratch1/bos_taurus/Assembly/2009_0312_CA/8-consensus/80.ECR2.asm
  * It contains one 179,530 bp scaffold that has two contigs.
  * One contig is 156,349 bp and the other one is 23,181 bp.
  * The estimated gap between them is 231 bp.
  show-coords ctg7180002022636-80.ECR2.filter-r.delta
      1  156326  |  1  156331  | 156326 156331  | 99.99  | 179505  156349 |    87.09    99.99  | ctg7180002022636  ctg7180000000103        [CONTAINS]
  156345  179505  |  21    23181  |  23161  23161  | 99.99  | 179505  23181  |    12.90    99.91  | ctg7180002022636  ctg7180000000104        [CONTAINS]
  >ctg7180002022636_156327_156344
  TTGTAAAAACCATCCCCT
  # ~ 20 bp unaligned on ctg7180002022636 & Chr1
  show-coords  ctg7180002022636-Chr1.filter-r.delta  | more
  ...
  151839  156326  | 61124826  61120339* |    4488    4488  |    99.91  |  179505 157590899  |    2.50    0.00  | ctg7180002022636  Chr1
  156347  157501  | 61111219* 61110064  |    1155    1156  |    99.91  |  179505 157590899  |    0.64    0.00  | ctg7180002022636  Chr1
  ...
  # 2 UMD2.0 ctg & 2 UMD2.0 deg in this region
  more Chr1.agp
  ...
  Chr1    61110064        61111482        3579    W      deg0003139347  1      1419    +
  Chr1    61111483        61114130        3580    N      2648    fragment        yes
  Chr1    61114131        61115490        3581    W      deg0002967451  1      1360    +
  Chr1    61115491        61118114        3582    N      2624    fragment        yes
  Chr1    61118115        61120117        3583    W      7180002846553  1      2003    +
  Chr1    61120118        61120217        3584    U      100    fragment        yes
  Chr1    61120218        61145567        3585    W      7180003318962  1      25350  +
  ...
== QC ==
<font color=red>
Lengths:
              elem      <2000      >=2000    min        max        mean      med        n50        sum
  scf        39978*    31311      8667      316        34167202  68129      1360      8217662    2723691675
  ctg        90135*    36140      53995      65        1160130    29693      5124      95988      2676390147
  deg        251413    249285    2128      65        39964      1003      984        994        252279234
  utg        1689033    1419729    269304    21        138676    2242      936        8213      3788090224
              elem      <0        0          >0        min        max        mean      med        n50        sum
  gaps(ca2scf) 50157      10759      3296      36102      -20        177144    929        20        34357      46620040
  gaps(posmap) 50157      0          0          50157      20        177144    943        20        34065      47301528</font>
Fragment happiness:
  placed good      27263138
  chaff  bothChaff  2467462
  placed notMated  2050084
  placed oneChaff  732517
  chaff  oneChaff  732517
  placed oneSurrogate    555510
  placed bothDegen        465114
  chaff  notMated        434052
  placed diffScaffold    369,294 *
  placed oneDegen        213,768 *
  placed badSame          96862
  placed badLong          76640
  placed badOuttie        41142
  placed badShort        5784
  placed bothSurrogate    3278
Mate happiness:
  good            13631569
  bothChaff      1233731
  oneChaff        732517
  oneSurrogate    277755
  bothDegen      232557
  diffScaffold    184647
  oneDegen        106884
  badSame        48431
  badLong        38320
  badOuttie      20571
  badShort        2892
  bothSurrogate  1639
Scaffold zero read/mate cvg regions:
                        elem      <2000      >2000      min        max        mean      med        n50        sum
  read                  57011      55048      1963      1          177144    913        57        29302      52084484
  mate                  10507      8945      1562      1          30014      996        493        2367      10466518
Scaffold 10K+ zero read/mate cvg regions (2K+ inside) (some might be a result of surrogates?):
                        elem      <2000      >2000      min        max        mean      med        n50        sum
  read                  51747      49878      1869      1          177144    958        38        32625      49613432
  mate                  1290      201        1089      15        30014      3560      3047      4017      4593586
  mate(ignore seq len)  2599      1051      1548      1          72706      3321      2541      5513      8633691
== Contaminant search  ==
=== ctg ===
                elem      <0        >0        min        max        mean      med        n50        sum
  ctg          90135      0          90135      65        1160130    29693      5124      95988      2676390147
  nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
  #OLD table
                  elem      <2000      >2000      min        max        mean      med        n50        sum
  Ecoli.all        71        66        5          1006      129770    4180      1127      45234      296830
  UniVec_Core      120        111        9          1000      426540    7366      1130      100944    884023
  OtherVec        121        112        9          1000      426540    7314      1127      100944    885022
  UMD2.cont.other  98(-3)    83        15(-3)    1000      426540    13523      1190      199700    1325332  # 3 are 1000bp+ from ctg ends; these are actually "fake" contaminants
                                                                                                                    # 7 are Acinetobacter baumannii min=1059 , max=9765
  total            152*(-3)  133        19*(-3)    1000      426540    10649      1150      199700    1618735
 
  Acinetobacter    65        53        12        1013      44359      2586      1212      3847      168130  # 46 out of 65 are in the 152* set; 19 are new; 13 have lots of alignments to other contigs (probably fake contaminants)
  total(new)      171*(-3)  144        27*(-3)    1000      426540    10013      1189      129770    1712376  # 65 are Acinetobacter and should be removed
  cat UMD2.contaminant.other-ctg.filter-q.coords | grep Acinetobacter
                                                                                                                    UMD2.0                    UMD2.5
      1      285  |      285        1  |      285      285  |    99.65  |      287    8096  |    99.30    3.52  | 7180003292866_1_288        ctg7180002015457        [CONTAINED]    Acinetobacter baumannii
    1422    2500  |    1078        1  |    1079    1078  |    99.63  |    7098    1078  |    15.20  100.00  | 7180003313366              ctg7180001706852        [CONTAINS]      Acinetobacter baumannii
    2934    3940  |    1008        1  |    1007    1008  |    98.61  |    7098    1059  |    14.19    95.18  | 7180003313366              ctg7180001709709                        Acinetobacter baumannii
    6281    7098  |        1      818  |      818      818  |    99.76  |    7098    1553  |    11.52    52.67  | 7180003313366              ctg7180001716052        [END]          Acinetobacter baumannii
      1      790  |      790        1  |      790      790  |  100.00  |    1822    9765  |    43.36    8.09  | 7180003319195_8956_10778  ctg7180002015485        [BEGIN]        Acinetobacter calcoaceticus
    285    1981  |        1    1697  |    1697    1697  |    99.59  |    13090    1856  |    12.96    91.43  | 7180003320028              ctg7180001706656*                      Acinetobacter baumannii
    1992    2148  |    1697    1856  |      157      160  |    98.12  |    13090    1856  |    1.20    8.62  | 7180003320028              ctg7180001706656*                      Acinetobacter baumannii
  12210    13090  |      63      943  |      881      881  |    99.89  |    13090    2556  |    6.73    34.47  | 7180003320028              ctg7180002007423                        Acinetobacter baumannii
  # 7 Acinetobacter baumannii ctgs
  # no Serratia "best hits"
  # 3 mitochondrion ctgs, all < 2Kbp
<font color=red>
Delete summary: 65 Acinetobacter ctgs + 91 contaminant ctgs <2000bp => 156 ctgs , 152 scf => 4105 reads
  ctgs      <2000      >2000      min        max        mean      med        n50        sum      reads
  156        144        12        1000      44359      1782      1150      1483      278009    4105        # all ctg
Trim summary: 12 contigs >=2000bp & 44 reads that overlap at least 10bp
  ctgs      <2000      >2000      min        max        mean      med        n50        sum        reads
  12        0          12        7072      426540    78470      45234      129770    941646    ?          # all ctg
  12        12        0          172        935        532        618        750        6393      44          # ctg regions
Files
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_DELETE/ctg.delete.uid
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_DELETE/scf.delete.uid
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_TRIM/ctg.trim.uid
</font>
=== ctg 5'&3' ===
                elem      <0        >0        min        max        mean      med        n50        sum
  ctg53        180044    0          180044    65        598        300        300        300        54033229
  nucmer -maxmatch -l 17 -c 35 ...
                  #ctgEnds  #ctgs      min        max        mean      med        n50        sum       
  Ecoli.all        180        149        300        300        300        300        300        54000
  UniVec_Core      312        277        300        300        300        300        300        93600
  OtherVec        1211      1167      300        553        300        300        300        363989
  UMD2.cont.other  15689      14693      257        598        300        300        300        4712162
=== deg ===
  nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
                  elem      <2000      >2000      min        max        mean      med        n50        sum
  Ecoli.all        387        387        0          131        1099      756        806        835        292892
  UniVec_Core      569        569        0          101        1115      763        822        843        434400
  OtherVec        579        579        0          101        1115      752        819        840        435549
  UMD2.cont.other  539        539        0          131        1483      792        838        873        427408
                 
  total            810*      810        0*        101        1483      784        838        869        63547
== Scaffolds vs UMD2.0 chromosome alignments ==
Directory:
  /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf
Depening on the ref/qry seq and nucmer parameters, the number of unaligned gaps in  UMD2.0 can vary between:
  101M: REF=Chr,      QRY=scf,    nucmer -l 100 -c 500
  6M:  REF=ChrPlaced, QRY=scf-deg, nucmer -maxmatch -l 50 -c 250
=== nucmer -l 100 -c 500 ===
Chr-scf.summary
                                          elem      <2000      >2000      min        max        mean      med        n50        sum
  Chr-scf.qry_hits                        32901      24546      8355      723        34167202  82494      1405      8217662    2714164100
  Chr-scf.qry_nohits                      7077      6765      312        316        12006      1346      1239      1291      9527056
  Chr-scf.10K.qry_hits2+                  574        0          574        10308      34167202  4006753    1887309    9586144    2299876795
  Chr-scf.0cvg                            144712    125933    18779      1          102265    900        178        2968      130248709
  Chr-scf.0cvg.clean                      148556    143283    5273      1          39625      683        280        1363      101526883(101M)
Chr-scf-deg.summary
                                          elem      <2000      >2000      min        max        mean      med        n50        sum
  Chr-scf-deg.qry_hits                    210225    199781    10444      501        34167202  13785      1007      7328685    2898141592
  Chr-scf-deg.qry_nohits                  81166      80815      351        65        12006      958        972        989        77828798
  Chr-scf-deg.10K.qry_hits2+              574        0          574        10308      34167202  4006753    1887309    9586144    2299876795
  Chr-scf-deg.0cvg                        175952    168553    7399      1          22067      445        120        1329      78433265
  Chr-scf-deg.0cvg.clean                  133809    132381    1428      1          20512      371        124        1101      49711440(49M)
ChrPlaced-scf.summary
                                          elem      <2000      >2000      min        max        mean      med        n50        sum
  ChrPlaced-scf.qry_hits                  19488      13112      6376      723        34167202  137773    1527      8428844    2684927057
  ChrPlaced-scf.qry_nohits                20490      18199      2291      316        192648    1891      1276      1569      38764099
  ChrPlaced-scf.10K.qry_hits2+            139        0          139        10486      31959312  6786671    4979278    12956086  943347316
  ChrPlaced-scf.0cvg                      76271      71816      4455      1          102265    568        179        1710      43339413
  ChrPlaced-scf.0cvg.clean                72865      70541      2324      1          39625      356        94        1413      25951987(25M)
ChrPlaced-scf-deg.summary
                                          elem      <2000      >2000      min        max        mean      med        n50        sum
  ChrPlaced-scf-deg.qry_hits              130125    122000    8125      501        34167202  21530      1009      7515049    2801670853
  ChrPlaced-scf-deg.qry_nohits            161266    158596    2670      65        192648    1080      987        1012      174299537
  ChrPlaced-scf-deg.10K.qry_hits2+        139        0          139        10486      31959312  6786671    4979278    12956086  943347316
  ChrPlaced-scf-deg.0cvg                  79041      76374      2667      1          22067      395        157        948        31251753
  ChrPlaced-scf-deg.0cvg.clean            69012      68271      741        1          20512      200        81        592        13864328(13M)
=== nucmer -maxmatch -l 100 -c 500 ===
Dir:
  /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.2
ChrPlaced-scf-deg.summary
                                          elem      <2000      >2000      min        max        mean      med        n50        sum
  ChrPlaced-scf-deg.qry_hits              130510    122377    8133      501        34167202  21470      1009      7515049    2802100771
  ChrPlaced-scf-deg.qry_nohits            160881    158219    2662      65        192648    1080      986        1012      173869619
  ChrPlaced-scf-deg.10K.qry_hits2+        120        0          120        20022      31959312  7587296    5639522    13010806  910475551
  ChrPlaced-scf-deg.0cvg                  82159      80425      1734      1          7002      321        145        647        26444796
  ChrPlaced-scf-deg.0cvg.clean            111645    111546    99        1          6248      81        13        272        9057424(9M)
=== nucmer -maxmatch -l 50 -c 250 ===
Dir:
  /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3
ChrPlaced-scf-deg.summary
                                          elem      <2000      >2000      min        max        mean      med        n50        sum
  ChrPlaced-scf-deg.qry_hits              204673    195653    9020      251        34167202  14088      1005      7329288    2883625712
  ChrPlaced-scf-deg.qry_nohits            86718      84943      1775      65        192648    1064      970        1007      92344678
  ChrPlaced-scf-deg.10K.qry_hits2+        148        0          148        10486      31959312  6814292    5135095    12792673  1008515300
  ChrPlaced-scf-deg.0cvg                  86085      84614      1471      1          4565      279        123        557        24101912
  ChrPlaced-scf-deg.0cvg.clean            113796    113791    5          1          2822      59        7          176        6714556(6M)
=== nucmer -maxmatch -l 50 -c 250 ; delta-fileter -q ===
Dir:
  /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3
  ChrPlaced-scf-deg.filter-q.summary
                                          elem      <2000      >2000      min        max        mean      med        n50        sum
  ChrPlaced-scf-deg.qry_hits              204673    195653    9020      251        34167202  14088      1005      7329288    2883625712
  ChrPlaced-scf-deg.qry_nohits            86718      84943      1775      65        192648    1064      970        1007      92344678
  ChrPlaced-scf-deg.10K.qry_hits2+        118        0          118        20022      31959312  7633834    5639522    13010806  900792422
  ChrPlaced-scf-deg.0cvg                  77864      73686      4178      1          28711      529        181        1523      41240419
  ChrPlaced-scf-deg.0cvg.clean            74172      72150      2022      1          28331*    321        89        1415      23852952(23M)
Max gap is 28331; Duplicate region in UMD2.0?
  ChrPlaced-scf-deg.coords
    70739616 70767946  |  2993389  2965048  |    28331    28342  |    99.17  | 85187327 19514159  |    0.03    0.15  | Chr15      scf7180002041107
    70768054 70808299  |  3005305  2965048  |    40246    40258  |    99.56  | 85187327 19514159  |    0.05    0.21  | Chr15      scf7180002041107
  =>
  ChrPlaced-scf-deg.filter-q.coords
    70768054 70808299  |  3005305  2965048  |    40246    40258  |    99.56  | 85187327 19514159  |    0.05    0.21  | Chr15      scf7180002041107
== Markers  ==
ALL:
  head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.txt
  '''Marker        Chr_BTA Pos(K)  Pos_from Pos_to UMD_Ctg_Pos    Match_Len      %IDY    %Match  UMD_Ctg_name'''
  BZ945871      1      47501  1      95001  7622            515            100.00  99.61  ctg7180002007845
  BZ953651      1      80001  47501  112501  10786          700            99.57  100.00  ctg7180002026484
  CC504788      1      118751  80001  157501  54583          862            100.00  100.00  ctg7180002026483
  CC484491      1      123751  90001  157501  50169          77              98.72  100.00  ctg7180002026482
  CZ415082      1      125001  92501  157501  75850          507            99.21  99.80  ctg7180002026483
  CC475154      1      130001  97501  162501  40013          666            99.25  100.00  ctg7180002026482
  CC561114      1      182501  145001  220001  1130            709            99.02  100.00  ctg7180002026482
  CC578374      1      190001  155001  225001  170145          647            100.00  100.00  ctg7180002026481
  BZ911787      1      278751  232501  325001  na              na              na      na      na
  ...
* 126,014 markers & 90,135 ctgs total
* 107,271 markers align to 31,407 ctg & 2640 scf:
** 85% of the markers align to 85% of the ctg sequence
** avg distance between markers is 25Kbp
* 188 questionable ctgs & 219 questionable scf (2 out of 4 disagreeing markers)
UNIQ:
  head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.unique_only.txt | p 'print "  ",$_;'
  '''Marker Chr_BTA Pos(Kbp) CI_Pos_from CI_Pos_to UMD_Scaff_Pos Match_Len %IDY %Matched UMD_Scaff_name'''
  BZ945871 1 52251 1 104501 na na na na na
  BZ953651 1 88001 52251 123751 na na na na na
  CC504788 1 130626 88001 173251 54583 862 100.00 100.00 ctg7180002026483
  CC484491 1 136126 99001 173251 50169 77 98.72 100.00 ctg7180002026482
  CZ415082 1 137501 101751 173251 na na na na na
  CC475154 1 143001 107251 178751 40013 666 99.25 100.00 ctg7180002026482
  CC561114 1 200751 159501 242001 1130 709 99.02 100.00 ctg7180002026482
  CC578374 1 209001 170501 247501 170145 647 100.00 100.00 ctg7180002026481
  BZ911787 1 306626 255751 357501 na na na na na
  ...
* 93,508 markers align to 28,752 ctgs & 1,476 scf:
* 109 questionable ctgs & 153 questionable scf (2 out of 4 disagreeing markers)
----
Scripts:
  ~/bin/marker2pos.pl markers_contigs_Art.unique_only.txt | sed 's/ctg//' |  sort -nk1 -nk2  > markers_ctg.pos
  ~/bin/translatePosMap.pl markers_ctg.pos bt.posmap.ctgscf | ~/bin/tab2tab.pl > markers_scf.pos
Ctg vs markers summary:
                                        #ctg      <10000    >10000    min        max        mean      med        n50        sum        file
  ctg (all)                              90135*    51024      39111      65        1160130    29693      5124      95988      2676390147
  no markers                            58728      48324      10404      65        322949*    6573      1597      21989      386064754
 
  markers from 1+ Chr                    31407      2700      28707      442        1160130    72924      52693      111252    2290325393  markers_ctg.Chr.count
  markers from 2+ Chr                    2987      25        2962      1002      1160130    132480    104807    179692    395718221  markers_ctg.Chr.count2+
  2+ markers from 2+ Chr                26        0          26*        15228      604155    221354    192182    298848    5755227    markers_ctg.Chr.count2.2+
  2+ adjacent markers from 2+ Chr        15        0          15**      15228      368879    202728    194749    294623    3040932    markers_ctg.Chr.count2+a
Scf vs markers summary:
                                        #scf      <10000    >10000    min        max        mean      med        n50        sum
  scf(all)                              39978*    37135      2843      316        34167202  68129      1360      8217662    2723691675
  no markers                            37338      36038      1300      316        754615*    2601      1336      3957      97140879
 
  markers from 1+ Chr                    2640      1097      1543      1000      34167202  994905    16220      8661690    2626550796  markers_scf.Chr.count
  markers from 2+ Chr                    552        10        542        1002      34167202  4526814    2714036    9167014    2498801557  markers_scf.Chr.count2+
  2+ markers from 2+ Chr                212        0          212*      15228      34167202  8579232    7358307    10521496  1818797327  markers_scf.Chr.count2.2+
  2+ adjacent markers from 2+ Chr        38        0          38**      15228      25078118  8419544    7176534    13458592  319942681  markers_scf.Chr.count2+a
=== 212* scaffolds ===
  scf_id        scf_len  #Chr/2+markers  #/Chr/2+adjmarkers  reads
  7180002041381 31959312  15              0                  469503
  7180002041358 25078118  13              2                  291163
  7180002041386 21280754  12              0
  ...
==== scf7180002041381.1 ====
* no low cvg regions in the middle
* 1281 markers: 1231 on Chr4, 12 on Ch11
  #1- mate cvg regions: at the ends !!!
  #scfid          begin          end            scf_len        cvg_len cvg
  7180002041381  1              1173            31959312        1173    0
  7180002041381  1174            1454            31959312        281    1
  7180002041381  31959139        31959312        31959312        174    1
 
==== scf7180002041358.2 ====
* one low cvg region & real break
* 1111 markers: 869 on Chr14, 193 on Chr26, ...
  #1- mate cvg regions: middle
  #scfid          begin          end            scf_len        cvg_len cvg
  7180002041358  20970531        20970827        25078118        297    1
  7180002041358  20970828        20970949        25078118        122    0
  7180002041358  20970950        20971112        25078118        163    1
  # markers in the regions
  #scfid          begin          end            makerid        Chr
  7180002041358  20964100        20964829        BZ839784        14
  7180002041358  21002368        21003219        CC527932        26
==== scf7180002041386.3 ====
* one low cvg region but no markers in that region
* 939 markers: 902 on Chr24, 3 on Chr6 ...
  #1- mate cvg regions: at the ends
  #scfid          begin  end    scf_len        cvg_len cvg
  7180002041386  26382  27557  21280754        1176    1
  7180002041386  27558  27607  21280754        50      0
  7180002041386  27608  28031  21280754        424    1
...
==== scf7180002041368.128 ====
* one low cvg region
  #scfid          begin    end      scf_len        cvg_len  cvg
  7180002041368  1356953  1357221  1404851        269      1
* markers:Chr7 & Chr5 at 3'
  #scfid          begin    end      markerid        Chr
  7180002041368  10436    11033    BZ840669        7
  ...
  7180002041368  1278950  1279444  CZ404867        7
  7180002041368  1319489  1320264  CC593534        7
  7180002041368  1344001  1344347  BZ885865        7 *
  7180002041368  1371526  1372308  BZ867572        5 *
  7180002041368  1386933  1387750  CC534893        5
* [[Media:7180002041368.0cvg.png|7180002041368.0cvg.png]]
* subassembly: extract all reads & extra mates(265) and reassemble
** Msg Counts
  DST    40
  FRG    13762
  LKG    6482
** qc
  [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
  0=37,1211999,1358964,32757,4082,7180000000571
  1=2,53844,53824,26922,-20,7180000000566
  2=4,50178,50238,12544,20,7180000000570
  3=1,28553,28553,28553,0,7180000000568
  4=1,17195,17195,17195,0,7180000000569
  total=45,1361769,1508774,30262,3675
show-coords scf7180002041368-7180002041368.update.scf.filter-q.delta
...
1269662  1338725  |    69065        1  |    69064    69065  |    99.97  |  1404851  1359628  |    4.92    5.08  | scf7180002041368  scf7180000000571
1340619  1341912  |        1    1295  |    1294    1295  |    98.77  |  1404851    1295  |    0.09  100.00  | scf7180002041368  scf7180000000574        [CONTAINS]
1345570  1347105  |    1537        1  |    1536    1537  |    98.18  |  1404851    1537  |    0.11  100.00  | scf7180002041368  scf7180000000572        [CONTAINS]
1347683  1356952  |    12617    3352  |    9270    9266  |    99.85  |  1404851    12617  |    0.66    73.44  | scf7180002041368  scf7180000000567
1356901  1361861  |    1767    6728  |    4961    4962  |    99.96  |  1404851    50318  |    0.35    9.86  | scf7180002041368  scf7180000000570
...
==== scf7180002041061.187 ====
* infoseq
  scf7180002041061 509987 39.73
* 22 markers
  #scfid        begin  end    makerid        Chr
  7180002041061 2061    2904    BZ847430        15
  7180002041061 8979    9756    CC481592        15
  7180002041061 10856  11485  BZ839377        15
  7180002041061 47578  48186  BZ836581        15
  7180002041061 80485  81237  CC553472        15
  7180002041061 117811  118488  CC918533        15
  7180002041061 151253  152009  CC477055        15
  7180002041061 213959  214640  CC572066        15 *
  7180002041061 236304  236479  CC550436        29 *
  7180002041061 242590  243462  BZ848041        29
  7180002041061 267880  268614  BZ877402        29
  7180002041061 268891  269493  CC580499        29
  7180002041061 282248  282861  BZ885584        29
  7180002041061 295221  295987  CC572941        29
  7180002041061 337785  338517  CC923898        29
  7180002041061 338464  339092  BZ921430        29
  7180002041061 341219  341971  CZ415932        29
  7180002041061 382509  383281  CC581748        29
  7180002041061 387548  388176  CC572064        29
  7180002041061 415263  415788  BZ878185        29
  7180002041061 470138  470773  CC558303        29
  7180002041061 493006  493794  CC565740        29
* mate happiness:
  cat 7180002041061.posmap.mates | cut -f3 | count.pl
  good            2501
  diffScaffold    86
  oneSurrogate    69
  oneChaff        40
  oneDegen        11
  badSame        9
  badOuttie      4
  badLong        4
* low cvg region:
  #scfid          begin  end    scf_len  cvg_len  cvg
  7180002041061  232849  232871  509987    23      1
  #it is in the middle of a unitig
  intersectPos.pl -i 1 7180002041061.posmap.utgscf 7180002041061.posmap.frgscf.10K.2K.mate_cvg.1-
  7180000849321  7180002041061  229782  238910  r
[[Media:7180002041061.0cvg.png|7180002041061.0cvg.png]]
* subassembly: extract all reads & mates(139) and reassemble
** Msg Counts
  DST    38
  FRG    5769
  LKG    2724
Dirs:
  /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scf7180002041061.187.mates/asm.ECR2
  /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scf7180002041061.187.mates/asm.ECR2.mates
Qc stats:
              asm.ECR2                                asm.ECR2.bog
  ...
  [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
  0          3,277300,277925,92433,312,7180000000278  5,279938,399837,55988,29975,7180000000155
  1          6,234445,280546,39074,9220,7180000000277 6,234463,280485,39077,9204,7180000000156
  2          1,1630,1630,1630,0,7180000000276        1,1799,1799,1799,0,7180000000154
  3          NA                                      1,1674,1674,1674,0,7180000000153
  4          NA                                      1,1183,1183,1183,0,7180000000158
  total      10,513375,560101,51338,6675              14,519057,684978,37076,18436
* Some of the 139 mates assemble into scaffolds
* There is slightly more sequence in the bog assembly
* Mean/Max bog utg size is twice larger than default utg (scf, ctg sizes are about the same)
* align asm.ECR2 scaffolds to scf7180002041061
** 2919 bp at the 3' end of the new scf7180000000277 don't align
** 1990 bp at the 3' end of the new scf7180000000278 don't align
** new scf7180000000277 & new scf7180000000278 align for ~ 1058bp
** most of the mated read added assembled at new scf7180000000277 & new scf7180000000278 3'
   
  nucmer -l 100 -c 500 ../scf7180002041061.fasta 9-terminator/7180002041061.update.scf.fasta -p scf7180002041061-7180002041061.update.scf
 
  delta-filter -q scf7180002041061-7180002041061.update.scf.delta > scf7180002041061-7180002041061.update.scf.filter-q.delta
 
  show-coords scf7180002041061-7180002041061.update.scf.filter-q.delta
      1    46099  |        1    46099  |    46099    46099  |    99.99  |  509987  280626  |    9.04    16.43  | scf7180002041061  scf7180000000277
  46273    48914  |    46120    48761  |    2642    2642  |  100.00  |  509987  280626  |    0.52    0.94  | scf7180002041061  scf7180000000277
  50810    56904  |    95425  101521  |    6095    6097  |    99.43  |  509987  280626  |    1.20    2.17  | scf7180002041061  scf7180000000277
  57301  233303  |  101699  277707  |  176003  176009  |    99.97  |  509987  280626  |    34.51    62.72  | scf7180002041061  scf7180000000277 *
 
  232245  444418  |  275975    63802  |  212174  212174  |    99.99  |  509987  277965  |    41.60    76.33  | scf7180002041061  scf7180000000278 *
  444964  469975  |    63156    38151  |    25012    25006  |    99.97  |  509987  277965  |    4.90    9.00  | scf7180002041061  scf7180000000278
  449967  479221  |    58152    28900  |    29255    29253  |    99.96  |  509987  277965  |    5.74    10.52  | scf7180002041061  scf7180000000278
  480298  509987  |    29688        1  |    29690    29688  |    99.99  |  509987  277965  |    5.82    10.68  | scf7180002041061  scf7180000000278
==== scf7180002041163.214 ====
* infoseq
  scf7180002041234 10336067 41.93
* Markers
  #scfid          begin  end    makerid        Chr
  ...
  7180002041163  770783  771508  CC562874        30 *
  7180002041163  791050  791484  BZ887794        11 *
  ...
* [[Media:7180002041163.0cvg.png|7180002041163.0cvg.png]]
* Markers on different contigs !!!
  cat 7180002041163.posmap.ctgscf
  ..
  7180001926720  7180002041163  750392  787668  f
  7180001926721  7180002041163  787911  815738  f
  ...
* subassembly: extract all reads & extra mates(293) and reassemble
** Msg Counts
  DST    39
  FRG    17287
  LKG    8113
** qc
  [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
  0=54,1742235,1760846,32264,351,7180000000507
  1=1,6224,6224,6224,0,7180000000506
  2=1,1472,1472,1472,0,7180000000508
  total=56,1749931,1768542,31249,351
* alignments :
  show-coords scf7180002041163-7180002041163.update.scf.filter-q.delta
    1748    7972  |    6224        1  |    6225    6224  |    99.84  |  1757217    6224  |    0.35  100.00  | scf7180002041163  scf7180000000506        [CONTAINS]
    4012    5490  |        1    1470  |    1479    1470  |    98.85  |  1757217    1472  |    0.08    99.86  | scf7180002041163  scf7180000000508        [CONTAINS]
    4567    18273  |        1    13697  |    13707    13697  |    99.92  |  1757217  1761853  |    0.78    0.78  | scf7180002041163  scf7180000000507
  ...
  750393  787668  |  757092  794367  |    37276    37276  |  100.00  |  1757217  1761853  |    2.12    2.12  | scf7180002041163  scf7180000000507
  787912  815738  |  794644  822455  |    27827    27812  |    99.81  |  1757217  1761853  |    1.58    1.58  | scf7180002041163  scf7180000000507
  ...
1755928  1757217  |  1760564  1761853  |    1290    1290  |  100.00  |  1757217  1761853  |    0.07    0.07  | scf7180002041163  scf7180000000507
=== Manually curated ===
==== Markers within 50K of a low mate cvg region ====
* 13 scaffolds (22 before)
* 14 breaks : 9 on the same contig , 2 on adjacent contigs , 3 on non adjacent contigs
* File
  /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.mates//scfproblems
* scfproblems.markers.txt
  #scfid        begin  end    markerid        Chr(break)      comment
  7180002040911 2210713 2211037 CZ405316        29              half Chr30, half Chr29 ; SAME ctg
  7180002041061 213959  214640  CC572066        15              half Chr29, half Chr15; SAME ctg
  7180002041103 5296272 5297045 CC587675        15              Chr18 & Chr15 at 3'; diff ctg (2 ctgs in between)
  7180002041107 7302203 7302874 CC906829        15              half Chr9, half Chr15; SAME ctg
  7180002041200 4920833 4921695 BZ877250        20              half Chr20, half Chr2; SAME ctg
  7180002041216 307121  307653  CL609365        5              Chr5 at 5' & Chr15; SAME ctg
  7180002041259 4315523 4316178 BZ922220        3              Chr5, Chr3, Chr24 !!! ; diff ctgs (4 ctgs in between)
  7180002041281 2884855 2885652 CC538348        19              half Chr19, half Chr28 ; SAME ctg
  7180002041315 2196754 2197298 BZ838379        1              Chr13 & Chr1 at 3'; diff ctg (2 ctgs in between)
  7180002041325 2597590 2598283 CC472212        10              half Chr10, half Chr5; SAME ctg
  7180002041348 3028112 3028921 CC531996        8              half Chr8, half Chr7; SAME ctg
  7180002041356 6828214 6828925 CC531427        16              half Chr16, half Chr25; SAME ctg
  7180002041358 21002368 21003219 CC527932      26              Chr14 & Chr26 at 3' (long chunck); diff ctg
 
Details:
  #scfid          begin  end    markerid        Chr    #ctgid          begin  end    markerid        Chr
  7180002040911  2189844 2190490 BZ923031        30      7180001730191  42851  43497  BZ923031        30      *
  7180002040911  2210713 2211037 CZ405316        29      7180001730191  63720  64044  CZ405316        29      *
  --
  7180002041061  213959  214640  CC572066        15      7180001852904  29927  30608  CC572066        15      *
  7180002041061  236304  236479  CC550436        29      7180001852904  52272  52447  CC550436        29      *
  --
  7180002041103  5220856 5221407 CG984741        18      7180001854649  21126  21677  CG984741        18      *
  7180002041103  5296272 5297045 CC587675        15      7180001854651  837    1610    CC587675        15      *
  --
  7180002041107  7302203 7302874 CC906829        15      7180001855003  77842  78513  CC906829        15      **
  7180002041107  7311399 7312254 CC479102        9      7180001855003  87038  87893  CC479102        9      **
  --
  7180002041200  4940131 4940949 CC500137        20      7180002002029  31935  32753  CC500137        20      **
  7180002041200  4956412 4957105 CZ428497        2      7180002002029  48216  48909  CZ428497        2      **
  --
  7180002041216  307121  307653  CL609365        5      7180002003578  6871    7403    CL609365        5      *
  7180002041216  310832  311546  CL865591        15      7180002003578  10582  11296  CL865591        15      *
  --
  7180002041259  4224828 4225527 CC920177        5      7180002012718  112862  113561  CC920177        5      *
  7180002041259  4315523 4316178 BZ922220        3      7180002012722  19376  20031  BZ922220        3      *
  --
  7180002041259  6220638 6221268 BZ869532        3      7180002012752  199799  200429  BZ869532        3      *
  7180002041259  6239728 6240375 CZ413142        24      7180002012753  2600    3247    CZ413142        24      *
  --
  7180002041281  2927406 2928016 CC573399        19      7180002018361  74539  75149  CC573399        19      **
  7180002041281  2938896 2939696 CC513914        28      7180002018361  86029  86829  CC513914        28      **
  --
  7180002041315  2152291 2153097 BZ836343        13      7180001862977  3397    4203    BZ836343        13      *
  7180002041315  2196754 2197298 BZ838379        1      7180002024308  7255    7799    BZ838379        1      *
  --
  7180002041325  2608389 2609213 CC506736        10      7180002025880  237956  238780  CC506736        10      **
  7180002041325  2638591 2639207 CC770009        5      7180002025880  268158  268774  CC770009        5      **
  --
  7180002041348  3044263 3044902 BZ872906        8      7180002030033  152672  153311  BZ872906        8      **
  7180002041348  3092547 3093121 BZ924509        7      7180002030033  200956  201530  BZ924509        7      **
  --
  7180002041356  6828214 6828925 CC531427        16      7180001722964  12571  13282  CC531427        16      *
  7180002041356  6832720 6833356 CC494876        25      7180001722964  17077  17713  CC494876        25      *
  --
  7180002041358  20964100 20964829  BZ839784      14      7180001723456  140919  141648  BZ839784        14      *
  7180002041358  21002368 21003219  CC527932      26      7180001723457  15983  16834  CC527932        26      *
* scfproblems.low_mate_cvg.txt
  #scfid                begin          end            ctglen          len            mate_cvg
  7180002040911        2205494        2205582        3008363        89              1
  7180002041061        232849          232871          509987          23              1
  7180002041103        5285314        5285479        5341215        166            1
  7180002041107        7306552        7306653        19514159        102            1
  7180002041200        4954073        4954180        16995932        108            1
  7180002041216        307614          307741          1444165        128            0
  7180002041259        4313297        4313311        15600612        15              0
  7180002041281        2933371        2933618        12907599        248            1
  7180002041315        2193142        2193167        2232736        26              1
  7180002041325        2623956        2624190        11153784        235            1
  7180002041348        3064633        3064750        19180127        118            1
  7180002041356        6831454        6831811        14697197        358            1
  7180002041358        20970950        20971112        25078118        163            1
==== No low cvg regions ====
* 9 scaffolds
* 14 breaks : 5 on the same contig , 6 on adjacent contigs , 3 on non adjacent contigs
  7180002040844 : 3 consecutive Chr30 markers (in the middle); mate_cvg > 18 ; markers CC576837,CC516738,CC543771; SAME ctg
  7180002041163 : 5' Chr30, 3' Chr11 ; mate_cvg~=10; marker BZ887794 ; diff ctg (1 ctg in between)
  7180002041234 : 5' Chr10, 3' Chr5; mate_cvg=10; marker BZ889975; scf=1757217bp; diff ctg
  7180002041235 : 2 consecutive Chr14 markers; mate_cvg=15..20; markers CC561837 &  CC585677; NOT uniq; SAME ctg
  7180002041279 : 5 consecutive Chr23 markers; markers CC472696,CC522963; cvg=9..18; diff ctg
  7180002041306 : 5' Chr14, 3' Chr6; marker CC549871; cvg=2; diff ctg (22 ctgs in between)
  7180002041308 : 2 consecutive Chr20 markers; markers CC571631 & BZ832318; cvg=20; SAME/diff ctg
  7180002041321 : 5' Chr2, 3' Chr15 ; marker CC513377 cvg=10; diff ctg (1 ctgs in between)
  7180002041350 : 2 consecutive Chr3 markers ; markers BZ837387 & CC571149; cvg 17..26; NOT uniq; SAME/diff ctg
Details:
  #scfid          begin  end    markerid        Chr    #ctgid          begin  end    markerid        Chr
  7180002040844  8369147 8369989 CC521620        17      7180001727899  132573  133415  CC521620        17      **
  7180002040844  8437707 8438324 CC576837        30      7180001727899  201133  201750  CC576837        30      **
  ...
  7180002040844  8467493 8468096 CC543771        30      7180001727899  230919  231522  CC543771        30      *
  7180002040844  8522732 8523598 CC513544        17      7180001727901  34970  35836  CC513544        17      *
  ---
  7180002041163  770783  771508  CC562874        30      7180001926720  20391  21116  CC562874        30      *
  7180002041163  791050  791484  BZ887794        11      7180001926721  3139    3573    BZ887794        11      *
  --
  7180002041234  3856848 3857673 BZ889975        10      7180002010181  56608  57433  BZ889975        10      *
  7180002041234  3949893 3950652 CC509477        5      7180002010182  28610  29369  CC509477        5      *
  --
  7180002041235  1117908 1118686 CC579933        21      7180002010285  4086    4864    CC579933        21      **
  7180002041235  1118939 1119675 CC561837        14      7180002010285  5117    5853    CC561837        14      **
  7180002041235  1158300 1159022 CC585677        14      7180002010285  44478  45200  CC585677        14      **
  7180002041235  1164296 1164891 BZ924510        21      7180002010285  50474  51069  BZ924510        21      **
  --
  7180002041279  6755839 6756519 BZ849919        1      7180002018195  417990  418670  BZ849919        1      *
  7180002041279  6786086 6786957 CC472696        23      7180002018196  12703  13574  CC472696        23      *
  ...
  7180002041279  6866378 6866890 CC522963        23      7180002018196  92995  93507  CC522963        23      *
  7180002041279  6911185 6911897 CC574255        1      7180002018197  15411  16123  CC574255        1      *
  --
  7180002041306  248713  249457  CC503129        14      7180002022875  59534  60278  CC503129        14      *
  7180002041306  910997  911554  BZ839769        6      7180002022892  21939  22496  BZ839769        6      *
  --
  7180002041308  1919399 1920196 BZ883381        18      7180002023013  57093  57890  BZ883381        18      *
  7180002041308  1958658 1959168 CC571631        20      7180002023014  29833  30343  CC571631        20      *
  7180002041308  1963333 1964083 BZ832318        20      7180002023014  34508  35258  BZ832318        20      **
  7180002041308  2009574 2010367 CC499423        18      7180002023014  80749  81542  CC499423        30      **
  --
  7180002041321  4820048 4820542 BZ846646        2      7180002025240  77429  77923  BZ846646        2      *
  7180002041321  4908631 4909160 CC513377        15      7180002025242  13302  13831  CC513377        15      *
  --
  7180002041350  2423855 2424205 CG983886        13      7180002030523  183644  183994  CG983886        13      *
  7180002041350  2446763 2447125 BZ837387        3      7180002030524  9191    9553    BZ837387        3      *
  7180002041350  2457417 2457704 CC571149        3      7180002030524  19845  20132  CC571149        3      **
  7180002041350  2486474 2487050 CC490214        13      7180002030524  48902  49478  CC490214        13      **
==== Scaffold splitting ====
Before:
* 22 scaffolds
* 28 breaks : 14 on the same contig , 8 on adjacent contigs , 6 on non adjacent contigs
Now:
* 14 scaffolds
* 15 breaks : 8 on the same contig , 3 on adjacent contigs , 4 on non adjacent contigs
===== Scaffold to break =====
    nl  scfid          breaks
    #1  7180002040844  2
    2  7180002040911  1
    3  7180002041061  1
    #4  7180002041103  1
    5  7180002041107  1
    #6  7180002041163  1
    7  7180002041200  1
    #8  7180002041216  1
    9  7180002041234  1
  #10  7180002041235  2
    11  7180002041259  2
  #12  7180002041279  2
    13  7180002041281  1
    14  7180002041306  1
  #15  7180002041308  2
    16  7180002041315  1
    17  7180002041321  1
    18  7180002041325  1
    19  7180002041348  1
  #20  7180002041350  2
    21  7180002041356  1
    22  7180002041358  1
===== Contigs to break =====
    nl  ctgid         
    1  7180001722964
    2  7180001723456
    3  7180001730191
    4  7180001852904
    5  7180001855003
    6  7180002002029
    7  7180002010182
    8  7180002012722
    9  7180002012752
    10  7180002018361
    11  7180002024308
    12  7180002025240
    13  7180002025880
    14  7180002030033
===== Marker pairs =====
    nl  scfid          begin1  end2    markerid1      Chr1    markerid2      Chr2    ctg1            ctg2            dist(end2-begin1)      dist(ctg2-ctg1)
    #1  7180002040844  8369147 8438324 CC521620        17      CC576837        30      7180001727899  7180001727899  69177                  0
    #2  7180002040844  8467493 8523598 CC543771        30      CC513544        17      7180001727899  7180001727901  56105                  2
 
    3  7180002040911  2189844 2211037 BZ923031        30      CZ405316        29      7180001730191  7180001730191  21193                  0
 
    4  7180002041061  213959  236479  CC572066        15      CC550436        29      7180001852904  7180001852904  22520                  0
 
    #5  7180002041103  5220856 5297045 CG984741        18      CC587675        15      7180001854649  7180001854651  76189                  2
 
    6  7180002041107  7302203 7312254 CC906829        15      CC479102        9      7180001855003  7180001855003  10051                  0
 
    #7  7180002041163  770783  791484  CC562874        30      BZ887794        11      7180001926720  7180001926721  20701                  1
 
    8  7180002041200  4940131 4957105 CC500137        20      CZ428497        2      7180002002029  7180002002029  16974                  0
 
    #9  7180002041216  307121  311546  CL609365        5      CL865591        15      7180002003578  7180002003578  4425                    0
 
    10  7180002041234  3856848 3950652 BZ889975        10      CC509477        5      7180002010181  7180002010182  93804                  1
 
  #11  7180002041235  1117908 1119675 CC579933        21      CC561837        14      7180002010285  7180002010285  1767                    0
  #12  7180002041235  1158300 1164891 CC585677        14      BZ924510        21      7180002010285  7180002010285  6591                    0
 
    13  7180002041259  4224828 4316178 CC920177        5      BZ922220        3      7180002012718  7180002012722  91350                  4
    14  7180002041259  6220638 6240375 BZ869532        3      CZ413142        24      7180002012752  7180002012753  19737                  1
 
  #15  7180002041279  6755839 6786957 BZ849919        1      CC472696        23      7180002018195  7180002018196  31118                  1
  #16  7180002041279  6866378 6911897 CC522963        23      CC574255        1      7180002018196  7180002018197  45519                  1
 
    17  7180002041281  2927406 2939696 CC573399        19      CC513914        28      7180002018361  7180002018361  12290                  0
 
    18  7180002041306  248713  911554  CC503129        14      BZ839769        6      7180002022875  7180002022892  662841                  22
 
  #19  7180002041308  1919399 1959168 BZ883381        18      CC571631        20      7180002023013  7180002023014  39769                  1
  #20  7180002041308  1963333 2010367 BZ832318        20      CC499423        18      7180002023014  7180002023014  47034                  0
 
    21  7180002041315  2152291 2197298 BZ836343        13      BZ838379        1      7180001862977  7180002024308  45007                  2*
 
    22  7180002041321  4820048 4909160 BZ846646        2      CC513377        15      7180002025240  7180002025242  89112                  2
 
    23  7180002041325  2608389 2639207 CC506736        10      CC770009        5      7180002025880  7180002025880  30818                  0
 
    24  7180002041348  3044263 3093121 BZ872906        8      BZ924509        7      7180002030033  7180002030033  48858                  0
 
  #25  7180002041350  2423855 2447125 CG983886        13      BZ837387        3      7180002030523  7180002030524  23270                  1
  #26  7180002041350  2457417 2487050 CC571149        3      CC490214        13      7180002030524  7180002030524  29633                  0
 
    27  7180002041356  6828214 6833356 CC531427        16      CC494876        25      7180001722964  7180001722964  5142                    0
 
    28  7180002041358 20964100 21003219 BZ839784        14      CC527932        26      7180001723456  7180001723457  39119                  1
===== Break intervals =====
        #scfid          begin  end    scflen          len  allread goodmate badmate Chr1 Chr2 mark1  mark2  ctgid
    #1  7180002040844  8412735 8412739 9565303        5      1      12      15      17 30  357    3     
    #2  7180002040844  8485384 8485869 9565303        486    0      12      4      30 17  3      53
 
    3  7180002040911  2205044 2205247 3009964        204    3      0      8      30 29  44      34    7180001730191
    4  7180002041061  232819  232843  510915          25      5      0      23      15 29  8      14    7180001852904
    #5  7180002041103  5295117 5295435 5342873        319    0      11      2      18 15  221    4
    6  7180002041107  7306552 7306620 19515586        69      2      0      11      15 9    320    549    7180001855003
    #7  7180002041163  787669  787911  1752998        243    0      10      0      30 11  16      8
    8  7180002041200  4954103 4954180 16991478        78      1      0      27      20 2    219    532    7180002002029
    #9  7180002041216  309331  309337  1435859        7      1      4      3      5 15    7      50
    10  7180002041234  3939493 3939548 10292558        56      1      0      20      10 5    160    227    7180002010182
  #11  7180002041235  1117752 1117987 1742396        236    5      20      2      21 14  43      2
  #12  7180002041235  1161294 1161363 1742396        70      3      16      0      14 21  2      32
    13  7180002041259  4313381 4313547 15594966        167    1      0      12      5 3    193    75    7180002012722
    14  7180002041259  6231717 6231762 15594966        46      8      4      18      3 24    75      406    7180002012752
  #15  7180002041279  6773364 6773383 9784709        20      0      4      14      1 23    292    5
  #16  7180002041279  6894603 6894618 9784709        16      0      4      16      23 1    5      141
    17  7180002041281  2931877 2931885 12908817        9      3      5      10      19 28  83      441    7180002018361
    18  7180002041306  267445  267446  4484402        2      0      2      14      14 6    49      141    7180002022875,7180002005935
 
  #19  7180002041308  1928806 1928825 23070077        20      0      10      0      18 20  63      2
  #20  7180002041308  1977759 1977828 23070077        70      1      21      1      20 18  2      836
    21  7180002041315  2192754 2192893 2233575        140    7      0      22      13 1    59      2      7180002024308
    22  7180002041321  4840934 4841070 6492983        137    1      0      16      2 15    227    72    7180002025240
    23  7180002041325  2624153 2624190 11155368        38      3      0      22      10 5    86      357    7180002025880
    24  7180002041348  3064691 3064730 19179560        40      1      0      16      8 7    116    634    7180002030033
  #25  7180002041350  2437553 2437572 23353087        20      0      19      1      13 3    107    2
  #26  7180002041350  2479000 2479029 23353087        30      2      23      1      3 13    2      880
 
    27  7180002041356  6831454 6831783 14699461        330    1      0      9      16 25  299    319    7180001722964
    28  7180002041358  20970057 20970139 25079321      83      7      0      25      14 26  868    189    7180001723456
Where:
  Chr1:  The most frequent chromosome markers with alignment at coordinates <= begin
  Chr2:  The most frequent chromosome markers with alignment at coordinates >= end
  mark1: Number of Chr1 markers with alignment at coordinates <= begin
  mark2: Number of Chr2 markers with alignment at coordinates >=end
  Lines starting with # should be ignored.
  ctgid: ctg to break
===== New scaffolds =====
  #scfid(new)          scfid          begin  end      scflen(new)
===== Files =====
  # ctg:scf new name mapping
  /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scfproblems.posmap.ctbscb
  # scf:scf new:original name mapping
  /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scfproblems.posmap.scbscf
 
  # ctg:ctg new:original name mapping
  /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/ctgproblems.posmap.ctbctg
 
  # ctg FASTA sequences
  /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/ctbproblems.fasta
  # scf FASTA sequences
  /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scbproblems.fasta
= UMD2.6 (UMD2.5 without contam ctg/scaff; split ctg/scaff) =
== Contaminants & MarkerBreaks  ==
Delete summary:
                          ctg    scf    scf->ctg
  contaminants(delete)    156    152    666
  contaminants(trim)      12      12      1328
  markerBreaks            14+1    14+1    2875+1  # 1 more break in UMD2.6.1     
  total                    182    178    4869
Add summary:
                          ctg    scf 
  contaminants(delete)    0      2
  contaminants(trim)      12      12
  markerBreaks            28+2    29+2            # 1 more break in UMD2.6.1
  total                    40      43
Summary:                 
                          ctg    scf        markers
  original                90135  39978
  del/add                -142    -135
  final                    89993  39843      -17
Files:
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.ctg.fasta    : contig  FASTA sequence
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.scf.fasta    : scaffold FASTA sequence
 
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.ctglen : contig  lengths
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.scflen : scaffold lengths
 
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.ctgscf : mapping of contigs to scaffolds  (posmap format)
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.scaff  : mapping of contigs to scaffolds  (scaff  format)
 
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/ctg.delete.uid  : UID of the contigs  which were deleted 
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/scf.delete.uid  : UID of the scaffolds which were deleted 
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/ctg.add.uid      : UID of the contigs which were added :  UID =~/brk\d+[abc]/ OR UID =~/cnt\d+/
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/scf.add.uid      : UID of the contigs which were added :  UID =~/brk\d+[abc]/ OR UID =~/cnt\d+/
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.delete.uid : markers which got deleted
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.ctg.break  : 15 contig break regions
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.scf.break  : 16 scaffold break regions
Scripts:
  ~/bin/breakPosmapKeep.amos : pipeline for breaking scf/ctg
* Final:
                    elem      min        max              mean      med        n50        sum
  ctg              89993      65        1160130(1.1M)    29736      5180      95952      2676109306
  scf              39843      316        34167202(34M)    68353      1361      7451988    2723419943
  scf<50K          38915      316        49898            2765      1349      5107      107632139
  scf<5K            35229      316        4999              1518      1306      1435      53495879
== Ctg Markers ==
Filtered:
  %IDY>90
  %Matched>85
* ~30% of alignments agree to this condition
Markers:           
  total:    126111    # from 31372 ctgs
Contigs
                    elem      min        max        mean      med        n50        sum
  placed            31372*    442        1160130    72908      52734      111147    2287291732*
  unplaced          58621      65        425922    6632      1600      22204      388817574
  total            89993      65        1160130    29736      5180      95952      2676109306
== Scf Markers ==
     
Markers:
  total:    126111  # from 2641 scaffolds; 1595 scaffolds have 2+ markers
  na:        18744  # not placed
  del:      17      # the scaffolds they belong to were deleted
  other:    3722    # not on the "main" chromosome; "main" chromosome determined by a majority rule; if it's a tie check markers for uniqueness
  outliers:  411    # interquartile range method (IQR)  q1-1.5*(q3-q1) ..  q1+1.5*(q3-q1) 
 
  filtered:  103217  # "non conflicting" markers
  Scaffolds:
                    elem      min        max        mean      med        n50        sum              #ctgs
  placed            2641      1000      34167202  994523    16673      8170786    2626536153*      50528
  unplaced          37202      316        754615    2604      1337      3964      96883790         39465
  total            39843      316        34167202  68353      1361      7451988    2723419943      89993
                    scf  <2  >=2  >=10 min  max  mean  med n50 sum
  markers/scf      2641 1595 1046 562  1    1418  39    1  354 103479
  ctg/scf          2641 1546 1095 559  1    545  19    1  130 50528
  cat markers_scf.mainChr.*summary  | count.pl -c 7 | sort -n
  1      1564  # 1 marker/scf
  2      276    # 2 markers/scf
  3      116    # ...
  4      52
  ...
  1470    1      # scaffold 7180002041371 has 1470 markers
* Scaffold position:   
  filter outliers (interquartile range method); use median value
  problem: only 2 markers far apart: choose randomly or check for uniqueness
  Summary (approximate)
  Chr  #Ctg    #Scaff  ScaffSpanSum    MaxMarkerPos
  1    2989    112    157088082      167097751
  2    2488    73      138112445      141135901
  3    2257    109    120984003      128677351
  4    2079    111    124695956      123662451
  5    2271    98      120470218      130242001
  6    2244    127    117350442      127208151
  7    2169    93      109780318      114917551
  8    2069    95      111646872      114607251
  9    1937    72      103790361      106365151
  10    1829    96      102878815      108508301
  11    2030    62      106593132      107458151
  12    1789    114    89109155        97406401
  13    1498    73      83821399        88539451
  14    1482    142    84084175        89211101
  15    1734    105    84680500        91332551
  16    1710    111    80727432        86838601
  17    1384    51      72913556        78195801
  18    1446    86      65689468        70299751
  19    1338    56      63372609        69847351
  20    1454    54      71941707        75982901
  21    1405    63      70035525        72193201
  22    1077    36      60892135      > 60178851
  23    1021    44      51791473        54886001
  24    1059    26      61662407      > 61466101
  25    783    41      42670836      > 45254751
  26    991    40      50640267        52316851
  27    920    70      45768018        48911451
  28    810    52      45884054        50753001
  29    1143    89      51657687        55219751
  30    3122    340    135803106      152429101
  U    39465  37202  94983049
* Scaffold orientation
  filter outliers (interquartile range method)
  use LeastSequareFit method to estimate the orientation : if slope is positive => forward; if slope is negative => reverse;
  problem: slope ~=0 => which direction ?
  cat Chr.summary | getSummary.pl -i 5
  cat Chr.agp | grep W | awk '{print $9}' | count.pl
        elem      <0        0          >0        min        max        mean      med
  scf    2641      516        1610      515        -31        61        0          0 
  ctg    50528      24885      2236      23407
 
  Use slope thold to determine direction?
  cat markers_scf.mainChr.noOutliers.summary | p 'print $_ if(abs($F[5])>0.5);' | wc -l #  634
  Ambiguity examples:
  BZ908653 6 114061501 114016501 114106501 172149 597 98.66 100.00 7180002040834
  BZ891600 6 114085251 114051501 114119001 4834 504 99.80 99.02 7180002040834
  CZ411135 6 114094001 114059001 114129001 242710 609 98.20 100.00 7180002040834
  BZ854276 6 114132751 114101501 114164001 100980 580 99.31 99.83 7180002040834
 
  CC524983 30 115669901 115622401 115717401 96634 715 98.74 100.00 7180002041003
  BZ869249 30 115791151 115717401 115864901 86931 448 99.78 100.00 7180002041003
  BZ867530 30 115798651 115732401 115864901 69671 572 99.13 100.00 7180002041003
  CC585731 30 115808651 115752401 115864901 54950 737 99.86 100.00 7180002041003
  CC469285 30 115818651 115772401 115864901 125555 550 94.57 100.00 7180002041003
* Scaffold overlaps: some small scaffolds might be contained by bigger ones
  cat markers_scf.mainChr.noOutliers.posmap.scfchrabs | ~/bin/posmap2ovl.pl | sort -nk6 -r | ~/bin/tab2tab.pl -f -15 | head
  Chr            ref            qry            begin          end            end-begin
  30            7180002041328  7180002041078  72536051      76993951      4457900
  4              7180002041381  7180002041269  41976451      46175201      4198750
  30            7180002040852  7180002041077  142001501      145251501      3250000
  30            7180002038569  7180002041121  33012001      35518351      2506350
  30            7180002040971  7180002034501  59411001      61836101      2425100
  ...
== Marker Issues ==
=== Placement ===
* 37202 scaffolds are unplaced (3.5% of total scaffold span); max=0.7M
* 87 unplaced scaffolds (1.5Mbp total) could be placed using SLK messages
  perl ~/bin/difference21.pl  bt.slk markers_scf.scflen | head
  7180002030632  7180002040171  I      -147744.734    21214.779      2      UP
  7180002030792  7180002031221  I      -15042.301      558.223        2      UP
  7180002030849  7180002041244  N      -1609089.875    18145.568      4      UP
  ...
                    elem      min        max        mean      med        n50        sum
  Unplaced          87        740        102909    17259      8544      34550      1501601
  Placed            78        16177      27139572  5757234    4037663    10989230  449064304
* ambiguous assignment to chromosmes:
  cat markers_scf.count | p 'print $_ if($F[2]*2==$F[3]);' | nl
        #k1            k2      count12 count1
    1  7180002030741  15      2      4
    2  7180002031341  11      1      2
    ...
    36  7180002068140  25      1      2
  # scaffold assigned differently
  ~/bin/difference12.pl markers.all/markers_scf.mainChr.count markers.all_plus_uniq/markers_scf.mainChr.count | nl
    1  7180002032536  19      1      2      23480
    2  7180002037223  8      1      2      31516
    3  7180002040013  25      2      4      15228
    4  7180002040262  30      1      2      3087
    5  7180002040378  10      1      2      50907
    6  7180002040523  29      1      2      9105
    7  7180002040769  4      1      3      827526
    8  7180002041203  14      2      4      1141560
    9  7180002044555  18      1      2      1566
* Chr30 has many scaff aligned to it
* how reliable are the markers: 1151 out of 2641 placed scaffolds have no unique markers ?
* what measure is best for placing the scaffolds?
* can some scaffolds go in the gaps ?
* AGP:
** gaps<=0 or 20 set to 100
** unoriented ctgs set to +
* Marker positions are not uniformly distributed; they tend to "custer"
                                  elem      0          >0        min        max        mean      med        n50        sum
  markers_chr                      107337    8405      98932      0          446250    25800      8750      75000      2769379200
  markers_scf                      94171      123        94048      0          2565850    24143      15856      42236      2273610521
  markers_chr.mainChr.noOutliers  101084    7609      93475      0          1253750    27383      10000      78750      2768040450  # filtered IRQx,xy method
  markers_scf.mainChr.noOutliers  88450      52        88398      0          2669361    25327      16587      44374      2240202221
=== Orientation ===
* at least 1564 out of 2641 placed scaffolds cannot be oriented; max=1.39M
=== Possible misassemblies ===
  nl  #scfid          Ch      medianPos      rangePos        scfLen              slope  #ChMark #Mark  #ctg
  1  7180002041225  1      120473301      9866350        9279975(9M)        0.0103  381    390    184        : break interval: 1674263..1674319 (1X frg_cvg region)
  2  7180002041078  30      77506551        6310600        3094999(3M)        0.0148  65      69      46        : no clear position
==== scf7180002041225 : 9.27 Mbp on Chr1 ====
scflen=9279975
  [[Media:Markers_7180002041225.txt|7180002041225.markers]]
  [[Media:7180002041225.png|7180002041225.markers]]
  [[Media:7180002041225.cvg.png|7180002041225.cvg]]
  ChrPos    ScfPos
  115272051 56300
  ...
  116848301 1618430
  116953301 9190560
  ...
  125138401 1691145
  break interval: 1674263..1674319 = 57bp frg_cvg=1 mate_cvg=0 bad_mate_cvg(nearby)=7
====  scf7180002041078 : 3.09 Mbp (not broken) ====
  [[Media:Markers_7180002041078.txt|7180002041078.markers]]
  [[Media:7180002041078.png|7180002041078.markers]]
  [[Media:7180002041078.1.cvg.png|7180002041078.1.cvg]] xrange [315268:466827]
  [[Media:7180002041078.2.cvg.png|7180002041078.2.cvg]] xrange [1191000:1229713]
  [[Media:7180002041078.3.cvg.png|7180002041078.3.cvg]] xrange [2733146:2750017]
  ChrPos  ScfPos
  72536051 2893791
  ...
  72682301 2750017
  72728651 1191000
  ...
  73273651 466827
  77112801 2733146
  ...
  78464051 1229713
  ..
  78595401 315268
====  scf7180002040971 : 1.93 Mbp (not broken) ====
==== Files ====
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo/
* Nucmer alignments to UMD2.0 chromosomes: all seem to agree pretty well
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr26-Chr26.png
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr27-Chr27.png
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr29-Chr29.png
  /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.ChrX-Chr30.png
== Homo sapiens alignments ==
Citations:
* mice closer to human than cow
* human and cow have approximately 201 homologous blocks of DNA
* Independently generated mapping data provide another measure of the quality of the assembly. Snelling et al. [4] created a B. taurus map from three radiation hybrid panels, two genetic maps, and bacterial artificial chromosome (BAC) end sequences. We aligned all of the 17,254 markers (of which 17,193 are unique) in their composite map (Cmap) to both assemblies. A marker was considered as matching a chromosome if 90% of the marker sequence aligned with at least 95% identity. Of the Cmap markers, 14,620 align to the UMD2 assembly's chromosomes, versus 13,699 markers (6.3% fewer) for the BCM4 assembly. A small number of Cmap markers (119 and 82 for UMD2 and BCM4, respectively) mapped to a different chromosome from the one indicated in the Cmap data.
* homologous synteny block (HSB) : 
** human-cow alignment extended for at least 250 Kbp
** it was not interrupted by an inversion or by an HSB on another chromosome.
** If two HSBs were interrupted by a gap of <3 Mbp and nothing else fell in that gap, the two blocks were merged. (Note that if a large region of synteny is interrupted by a distinct HSB, the interruption creates three HSBs.)
** [http://genomebiology.com/2009/10/4/R42/table/T3 Number of homologous synteny blocks on each chromosome of the cow and human genomes] 201 -> 268 blocks
* Orienting contigs using cow-human alignments:
** Scaffolds (sets of linked contigs) that were mapped onto chromosomes using only a single marker could not be oriented from the marker information alone. We oriented many of these scaffolds by taking advantage of the overall conserved synteny between cow and human. First, all cow scaffolds were aligned to the human genome using nucmer [14] with its maximal unique match (mum) option in order to avoid alignments of repetitive sequence. For each alignment of a previously unoriented scaffold to human, all alignments within 100 Kbp on each side were pulled out for analysis. A score S was computed for each unoriented scaffold, taking into account whether the scaffolds surrounding S on both sides (in cow) were mapped to a consistent set of locations in human. If the scaffolds surrounding S were oriented, and if a large majority of these scaffolds on both the left and right agreed on the orientation, then S was assigned that orientation. Using this procedure, 1,840 scaffolds containing 4,011 contigs were oriented.
** We developed a similar procedure to assign unplaced contigs to chromosomes, again relying on conserved synteny between cow and human. First, all unplaced contigs were aligned as above. Mummer's 'delta-filter' program was then used to compute a one-to-one mapping of the unplaced contigs to human so that only the best aligning contig was considered at each region in human. For each unplaced contig's best alignment to human, the matching region in cow was identified via our human-cow syntenic map, and all contigs from this region were extracted for examination. We only considered placing a contig on a B. taurus chromosome if the order and direction of the surrounding contigs in cow matched the corresponding region in human. As above, we examined the alignments of nearby cow contigs that aligned within 100 kb of the unplaced contig's alignment in human. If the region of cow-human synteny contained no rearrangements, then the unplaced contig was placed at the location indicated by these alignments. Using this procedure, 1,046 contigs were placed on chromosomes. One consequence of this procedure was that a number of incompletely mapped genes (based on mRNA alignments) were completed.
Issues:
* which alignment program to use?
** nucmer
** blastz: difficult to parse
** blat
* nucmer: what parameters?
** default
** loose : -mum -l 12 -c 30 -g 1000
** ref:  24 homo sapiens chromosomes files
** query: 26 bos taurus scaffold files
Total scaffolds:
                        elem      min        max        mean      med        n50        sum
  '''scf(len)              39844      316        34167202  68352      1361      7451988    2723419938'''
Aligned to HS:                               
                        elem      min        max        mean      med        n50        sum           
  '''scf(len)              8860      385        34167202  301202    2257      7740810    2668658140'''      # each scf aligns in avg to 4 Chr 
  scf(aligLen)          789272    250        19097      514        399        550        405946972     
  scf(alig%)            789272    58.04      100.00    79        78.54      78.72      .
 
  scf(len,maxX)          727        1002      7069350    194658    6530      1895564    141516604      # 220 in common with the 339 ones that have mark
 
  scf(len,noMark)        7025      385        7605708    7783      1654      27630      54680173 
  '''scf(len,mark)          1835      1001      34167202  1424511    40503      7970944    2613977967'''
Not aligned to HS:                               
                        elem      min        max        mean      med        n50        sum           
  scf(len,notAligned)    30984      316        163953    1767      1309      1555      54762139 
Aligned to markers:
                        elem      min        max        mean      med        n50        sum
  '''scf(len)              2641      1000      34167202  994523    16673      8170786    2626536153'''
  1 marker              1595      1000      1396951    17482      4595      48276      27884078
  2+ markers            1046      1055      34167202  2484371    596240    8217662    2598652075
 
  abs(slope)<0.25        1946      1000      3094999    29136      7585      114274    56700529
  abs(slope)>=0.25      696        4674      34167202  3692292    1845828    8401441    2569835619
Scf summary:
                        elem      min        max        mean      med        n50        sum
  all                    39844      316        34167202  68352      1361      7451988    2723419938
  1+mark                2641      1000      34167202  994523    16673      8170786    2626536153 # best marker alignment
  1+align                8860      385        34167202  301202    2257      7740810    2668658140 # alignments > 250bp
  1+align(new)          9880      385        34167202  270458    1996      7740810    2672129298 # alignments > 200bp (3.74M more than 250bp align)
  1+mark or  1+align    9669      385        34167202  276443    2051      7740810    2672936792
  0 mark and 1+align    7027      385        754,615    6603      1654      20219      46,400,303 !!!
  1+mark and 1+align    1791      1001      34167202  1305861    37914      7473583    2338798064
  0 mark and 0 align    30179      316        88,326    1705      1305      1517      51,483,587 !!!
 
  1+mark or  1+align    9669      385        34167202  276443    2051      7740810    2,672,936,792
Degenerate:
                            elem      min        max        mean      med        n50        sum
  all                        251413    65        39964      1003      984        994        252,279,234
  all(2000+bp)              2128      2000      39964      3753      2827      3946      7,986,872
  1+mark                    562        200        30168      2731      1274      5029      1,535,011
  1+mark(2000+bp)            180        2007      30168      6079      4404      7510      1,094,252
  1+align                    6429      251        39964      1487      1013      1287      9,566,273
  1+align(2000+bp)          756        2004      39964      4820      3624      5646      3,644,556
Issues:
* 24 scaffolds that have 200+ alignments to at least 2 HS chromosomes
* 37 scaffolds that have 100+ alignments to at least 2 HS chromosomes
* 83 scaffolds that have 50+ alignments to at least 2 HS chromosomes
File location:
  /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_human
=== Synteny method ===
File location:
  /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny
* Position & orient scaffolds that aligned to HS
* find placed neighboring that don't "disagree"
                              elem      min        max        mean      med        n50        sum
  total                      7027      385        754615    6603      1654      20219      46,400,303 !!! 
* Scaffolds possibly assigned to the wrong chromosome
  cat summary.nucmer-markers.txt | perl ~/bin/synteny/getSynteny2.nucmer-marker.pl -minMarkers 2 | grep update:chr | getSummary.pl -i 6 -t "<2 markers"
  cat summary.nucmer-markers.txt | perl ~/bin/synteny/getSynteny2.nucmer-marker.pl -minMarkers 4 | grep update:chr | getSummary.pl -i 6 -t "<4 markers"
                              elem      min        max        mean      med        n50        sum
  <2 markers                  307        1001      937279    27898      9212      98674      8,564,850
  <4 markers                  383        1001      1141560    33822      13426      160445    12,953,875
==== UMD2.0 ====
                              elem      min        max      mean      med        N50        sum
  ctg                        74337      88        840370    35148      14148      79144      2612810882*
==== UMD2.6.1 noVariants,noCont ====
                              elem      min        max      mean      med        N50        sum
  scf                        2646      385      34167202  994701    20787      7139718    2631980624
  ctg                        50755      65        1160130    50966      29450      88583      2586785910*
==== UMD2.6.1 noVariants ====
                              elem      min        max        mean      med        n50        sum
  scf                          39844      316        34167202  68352      1361      7451988    2,723,419,938
  scf.placed                  4707      385        34167202  564225    10413      7800796    2655811430
  scf.variants                29436      723        51828      1714      1298      1514      50,461,989
  scf.unplaced                30575      316        451968    1845      1309      1601      56440663
 
                              elem      min        max        mean      med        n50        sum
  ctg                          89994      65        1160130    29736      5180      95952      2676109378
  ctg.placed                  53646      65        1160130    48650      26840      98428      2609925446
  ctg.unplaced                31510      101        207476    1761      1314      1556      55511410
                              elem      min        max        mean      med        n50        sum
  deg(all)                    251413    65        39964      1003      984        994        252,279,234
  deg(>2Kbp)                  2128      2000      39964      3753      2827      3946      7,986,872
  deg.placed                  883        200        30246      2711      1303      4845      2,393,905
  deg.variants                4654      331        8039      1117      989        1024      5,200,553
  deg.unplaced(?)              245343    65        39964      994        984        993        244047258
  scf.unplaced                978        316        451968    7932      3736      15828      7757882
  scf.uplaced.0cvg            863        316        451968    7521      2286      15744      6,491,342
  deg.unplaced                747        2002      39964      4365      3509      4819      3261181
  deg.unplaced.0cvg            734        2002      39964      4358      3507      4811      3,199,221
 
  ct_deg.placed                53646      65        1160130    48650      26840      98428      2,609,925,446*
  UMD2.0-UMD2.6.1 gaps>1K summary
  id    count  min    max    median  sum
  1    775    999    64953  1683    2884541
  ...
  30    2174    999    167808  1528    7739170
  total 14936  999    167808  1638    53,526,388
* align UMD2.6 Chr1..30 to UMD2.0 Chr1..30 0cvg regions
              elem      min        max        mean      med        n50        sum
  all          14936      1000      167809    3584      1639      6465      53541324
  aligned      11091      1000      167809    4010      1692      7977      44475715
  not_aligned  3845      1000      50546      2357      1529      2735      9065609 
Files: UMD2.0 regions not covered by UMD2.6.1 (chr aligned to itself)
  /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.fa
  /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.posmap
  /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.summary
* UMD2.6.1 vs UMD2.0 Chr1..30
* nucmer -mum -l 50 -c 250
* max gap: Chr8:60134434..60228381=93948
* gaps>10K: 191 , 4.2M total, 4.19M aligned without "-mum"
                      elem      min        max        mean      med        n50        sum
  gaps>10K            191        10077      93948      22106      15599      26551      4222342
* realign all scaff without using nucmer "-mum" option: all gaps >10K align to large scaffolds !!!
==== UMD2.6.1 noVariants, add deg & UMD2.0 alignments ====
--[[User:Dpuiu|Dpuiu]] 09:44, 22 June 2009 (EDT)
                                elem      min        max        mean      med        n50        sum
  scf                            39844      316        34167202  68352      1361      7451988    2723419938
  scf.variants                  29436      723        51828      1714      1298      1514      50461989
  scf.noVariants                10580      316        34167202  252690    1957      7740810    2673467579
  scf.noVariants.ignore          4079      316        4985      1852      1541      1842      7558196    # less than 5K and placed through alignments inside a larger scaffold
  scf.noVariants.noIgnore        6506      385        34167202  409763    6673      7740810    2665923763
  ctg.variants                  29967      723        51828      1641      1298      1475      49178546
  deg.variants                  4654      331        8039      1117      989        1024      5200553
  scf_deg.placed                4933      385        34167202  540139    11548      7740810    2664507818
  scf_deg.markers                2003      1020      34167202  1302992    30456      8170786    2609894380
  scf_deg.alignHS                1818      385        3278163    27070      12322      47406      49214215
  scf_deg.alignUMD2.0            1112      1002      73626      4855      2948      6898      5399223
 
  scf.placed                    4044      385        34167202  657786    15617      7740810    2660090134
  scf.markers                    1825      1020      34167202  1429490    37672      8170786    2608820353
  scf.alignHS                    1587      385        3278163    30006      13998      50604      47620297
  scf.alignUMD2.0                632        1002      73626      5774      5137      9046      3649484
 
  deg.placed                    889        2002      39964      4969      3771      6165      4417684
  deg.markers                    178        2007      30168      6033      4336      7350      1074027
  deg.alignHS                    231        2049      39964      6900      6289      7337      1593918
  deg.alignUMD2.0                480        2002      15992      3645      2767      3842      1749739
 
  ctg_deg.placed                54129      65        1160130    48371      26493      98075      2,618,296,162*
  ctg.placed                    53240      65        1160130    49096      27309      98285      2,613,878,478
  deg.placed                    889        2002      39964      4969      3771      6165      4,417,684
==== HS-BT Synteny map ====
* Trust scaffolds with 4+ markers
* Scaffolds with 3- markers must have a close neighbor that agrees with them
  same HS & BT Chr
  maxCount=2                      # at most 2 scaffolds away
  minMarkers=4
  minRatio=0.66 & maxRatio=1.5    # distance ratio; maxDistance not used !!!
  => 226 synteny regions !!!
  join2.pl nucmer_*lsf markers_scf.*lsf | \
        ~/bin/filterMarkers.pl -minMarkers 4 | \
        ~/bin/getSyntenyBlock.pl | \
        grep -v ^# | grep -v ^$ | ~/bin/flipSummary.pl | sort -nk2 -nk5 | ~/bin/tab2tab.pl | \
        perl -ane  'print $P[13]," ",$F[13],"\n" if($F[13]-$P[13]==1); print $F[13]," ",$P[13],"\n" if($P[13]-$F[13]==1); @P=@F;' | sort -u -n | \
        ~/bin/mergeMap.pl >! hs-bt.map.tmp
  join2.pl nucmer_*lsf markers_scf.*lsf | \
        ~/bin/filterMarkers.pl -minMarkers 4 | \
        ~/bin/getSyntenyBlock.pl -map  hs-bt.map.tmp | \
        ~/bin/tab2tab.pl | grep  # | sed 's/#//' >  hs-bt.map
 
  ~/bin/map-draw.pl -refLen hs.infoseq -qryLen bt.infoseq hs-bt.map > ! hs-bt.png
Problems:
  7180002041220: BT.Chr2 ok
  7180002041025: BT.Chr4
  7180002041222: BT.Chr7
  7180002041228: BT.Chr8
  7180002040195: BT.Chr8; HS.Ch23 del !!!
  7180002041001: BT.Chr8; HS.Ch23 del !!!
  7180002040851: BT.Chr30; HS.Ch3 -> HS.Ch23
  7180002041008: BT.Chr30; HS.Ch7 -> HS.Ch23
* [[Media:hs-bt.png|hs-bt.png]] Map picture
* Map:
  #HS-ref      begin    end      len      HS-clust  BT-ref    begin    end      BT-clust  #scf
  01            870247    12637274  11767027  16        46893147  58621226  11728079  -1        5
  01            16674545  30031540  13356995  2        126668764 139027145 12358381  -2        4
  01            32016003  68894564  36878561  3        83066354  120874874 37808520  -3        10
  01            68907947  122485896 53577949  3        23819464  83221225  59401761  -4        6
  01            143725713 143737025 11312    3        25097798  25109110  11312    5        1
  ...
  23            148835146 151656716 2821570  30        32889252  35700335  2811083  -276      3
  23            152157769 154641974 2484205  30        39322867  42104054  2781187  -277      2
  23            154387850 154403684 15834    30        39353299  39369133  15834    278      1
Files:
  /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/map.4/hs-bt.map
  /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/map.4/hs-bt.png
==== Overlaps ====
Cases:
  1. CONTAINED scaffolds (clear variants)
  2. single BEGIN/END between 2 scaffolds: 2 scaffolds could be merged
  3. scaffold closing a sequence: 3 scaffolds could be merged
  4. multiple BEGIN/END/CONTAIN* between 2 scaffold contigs
1: CONTAINED ~ 4731 cases
  cat nucmer_scf.ovl/all-all.contained.ids  marker_scf.ovl/all-all.contained.ids  | sort -u > all-all.contained.ids
* Summary
                        elem      <2Kbp    2..10Kbp    >10Kbp  min    max        mean      med        n50        sum
  variants(all)        4731      3503    1085        143      723    75436      2513      1429      3469      11891882
  variants(1+markers)  541        311      170        60      1001  75421      4466      1690      9947      2416312
* Example: longest contained scaffold 75Kbp
  #scaffold alignments
  117893  126367  |      8520    1      |      8475    8520    |      94.06  |      562884  75576  |      1.51    11.27  |      7180002040891  7180002040646
  286811  289412  |      72974  75576  |      2602    2603    |      93.29  |      562884  75576  |      0.46    3.44    |      7180002040891  7180002040646
  421477  421977  |      9727    10223  |      501    497    |      95.41  |      562884  75576  |      0.09    0.66    |      7180002040891  7180002040646
  171970  180765  |      73605  64798  |      8796    8808    |      96.74  |      274750  75576  |      3.20    11.65  |      7180002040912  7180002040646
  180765  195989  |      64652  49398  |      15225  15255  |      96.05  |      274750  75576  |      5.54    20.18  |      7180002040912  7180002040646
  196223  203368  |      49244  42105  |      7146    7140    |      96.81  |      274750  75576  |      2.60    9.45    |      7180002040912  7180002040646
  205935  208456  |      38652  36130  |      2522    2523    |      96.12  |      274750  75576  |      0.92    3.34    |      7180002040912  7180002040646
  207096  210496  |      42082  38673  |      3401    3410    |      96.07  |      274750  75576  |      1.24    4.51    |      7180002040912  7180002040646
  209529  217353  |      36109  28272  |      7825    7838    |      95.50  |      274750  75576  |      2.85    10.37  |      7180002040912  7180002040646
  213859  219155  |      25520  20220  |      5297    5301    |      95.80  |      274750  75576  |      1.93    7.01    |      7180002040912  7180002040646
  221183  227630  |      20236  13870  |      6448    6367    |      93.63  |      274750  75576  |      2.35    8.42    |      7180002040912  7180002040646
  227923  228686  |      13884  13126  |      764    759    |      95.16  |      274750  75576  |      0.28    1.00    |      7180002040912  7180002040646
  228084  231413  |      13104  9760    |      3330    3345    |      94.10  |      274750  75576  |      1.21    4.43    |      7180002040912  7180002040646
  #contig alignments
  1929    10403  |      8520    1      |      8475    8520    |      94.06  |      26592  8520    |      31.87  100.00  |      7180002040891.4.10      7180002040646.1.8      [CONTAINS]
  9084    11685  |      30871  33473  |      2602    2603    |      93.29  |      276174  33473  |      0.94    7.78    |      7180002040891.8.10      7180002040646.8.8
  143750  144250  |      1187    1683    |      501    497    |      95.41  |      276174  4564    |      0.18    10.89  |      7180002040891.8.10      7180002040646.2.8
  163797  172592  |      31502  22695  |      8796    8808    |      96.74  |      233551  33473  |      3.77    26.31  |      7180002040912.3.5      7180002040646.8.8
  172592  187816  |      22549  7295    |      15225  15255  |      96.05  |      233551  33473  |      6.52    45.57  |      7180002040912.3.5      7180002040646.8.8
  188050  195195  |      7141    2      |      7146    7140    |      96.81  |      233551  33473  |      3.06    21.33  |      7180002040912.3.5      7180002040646.8.8
  197762  200283  |      2523    1      |      2522    2523    |      96.12  |      233551  2523    |      1.08    100.00  |      7180002040912.3.5      7180002040646.6.8      [CONTAINS]
  198923  202323  |      3410    1      |      3401    3410    |      96.07  |      233551  3411    |      1.46    99.97  |      7180002040912.3.5      7180002040646.7.8      [CONTAINS]
  201356  209180  |      7838    1      |      7825    7838    |      95.50  |      233551  7838    |      3.35    100.00  |      7180002040912.3.5      7180002040646.5.8      [CONTAINS]
  205686  210982  |      12396  7096    |      5297    5301    |      95.80  |      233551  12396  |      2.27    42.76  |      7180002040912.3.5      7180002040646.3.8
  213010  219457  |      7112    746    |      6448    6367    |      93.63  |      233551  12396  |      2.76    51.36  |      7180002040912.3.5      7180002040646.3.8
  219750  220513  |      760    2      |      764    759    |      95.16  |      233551  12396  |      0.33    6.12    |      7180002040912.3.5      7180002040646.3.8
  219911  223240  |      4564    1220    |      3330    3345    |      94.10  |      233551  4564    |      1.43    73.29  |      7180002040912.3.5      7180002040646.2.8      [CONTAINS]
  #marker & alignment summary
  #id            BT-ref    #markers  slope    begin    end      len      HS-ref    #align    slope    begin    end      len
  7180002040891  4        1        0        13956053  14518937  562884    22        14        -1.5687  20838730  21401614  562884
  7180002040163  8        1        -1        38079677  38178351  98674    22        3        1.0264    21133894  21232568  98674    update:dir:7180002040646
  7180002040646  8        .        1        38088797  38164373  75576    22        4        -0.9833  21143014  21218590  75576    assign:Chr:7180002040163,7180002040163
2. ~ 349 cases (-3 cases 3.)
  cat all-all.begin.ids all-all.end.ids | sort -u | wc -l
3.  ~ 3 cases
  intersect.pl all-all.begin.ids all-all.end.ids | perl -ane 'print $_ if($F[1]=~/1.1$/);'
  7180002032818.1.1      23259
  7180002036943.1.1      28355
  7180002040409.1.1      25219 # merges 2 scaff
 
  1      2811    |      2807    1      |      2811    2807    |      99.11  |      23259  5134    |      12.09  54.67  |      7180002032818.1.1      7180002032811.1.2[BEGIN]  [BEGIN]
  22101  23259  |      7238    6079    |      1159    1160    |      99.66  |      23259  7238    |      4.98    16.03  |      7180002032818.1.1      7180002032811.2.2[END]    [END]
  1      2811    |      2807    1      |      2811    2807    |      99.11  |      23259  5134    |      12.09  54.67  |      7180002032818.1.1      7180002032811.1.2[BEGIN]  [BEGIN]
  22101  23259  |      7238    6079    |      1159    1160    |      99.66  |      23259  7238    |      4.98    16.03  |      7180002032818.1.1      7180002032811.2.2[END]    [END]
 
  1      2025    |      2066    4090    |      2025    2025    |      99.80  |      25219  4090    |      8.03    49.51  |      7180002040409.1.1      7180002033541.1.1[BEGIN]  [END]
  23394  25219  |      5761    3930    |      1826    1832    |      98.42  |      25219  5761    |      7.24    31.80  |      7180002040409.1.1      7180002033538.1.1[END]    [END]
  cd  nucmer_scf.ovl
  intersect.pl  all-all.begin.ids all-all.end.ids
  7180002032818  23259
  7180002036943  28355
  7180002040409  25219
4. ~10 cases
  cd nucmer_ctg.ovl/translated/
  cat all-all.annotated.coords | egrep 'BEGIN|END|CONTAIN' | p 'next if($F[6]<5000); next if($F[7]<5000); print $_;' | p '$F[17]=~/^([^.]+)(.+)/;  $F[17]=$1 ; $F[18]=~/^([^.]+)/; $F[18]=$1; print $F[17],"\t",$F[18],"\n";' | count.pl -m 2
  7180002041235  brk002041306a  3
  7180002038888  7180002034914  2
  7180002041059  7180002040095  2
  7180002040879  7180002040934  2
  cnt0002041350  7180002040907  2
  7180002041341  7180002040894  2
  7180002041015  cnt0002040938  2
  7180002039401  7180002040397  2
  7180002037358  7180002031425  2
  7180002040915  7180002039470  2
  cat all-all.annotated.coords | egrep 'BEGIN|END|CONTAIN' | grep ...
  12363  21143  |      1      8837    |      8781    8837    |      93.43  |      21143  50443  |      41.53  17.52  |      7180002041235.23.33    brk002041306a.5.7[END]    [BEGIN]
  1      9272    |      5591    14944  |      9272    9354    |      94.47  |      9272    50443  |      100.00  18.54  |      7180002041235.25.33    brk002041306a.5.7[CONTAINED]
  71049  97689  |      31941  5220    |      26641  26722  |      96.71  |      97689  31941  |      27.27  83.66  |      7180002041235.29.33    brk002041306a.6.7[END]    [END]
  57      3896    |      14743  10903  |      3840    3841    |      95.94  |      3896    22667  |      98.56  16.95  |      7180002041235.32.33    brk002041306a.3.7[CONTAINED]
  1      8942    |      1      8947    |      8942    8947    |      99.49  |      20792  8947    |      43.01  100.00  |      7180002038888.1.1      7180002034914.1.2[CONTAINS]
  9670    20792  |      1      11091  |      11123  11091  |      99.27  |      20792  11091  |      53.50  100.00  |      7180002038888.1.1      7180002034914.2.2[CONTAINS]
  1      8705    |      3932    12610  |      8705    8679    |      96.17  |      8705    30654  |      100.00  28.31  |      7180002041059.1.26      7180002040095.1.1[CONTAINED]
  1      4777    |      12989  17697  |      4777    4709    |      95.32  |      4777    30654  |      100.00  15.36  |      7180002041059.2.26      7180002040095.1.1[CONTAINED]
  1      6588    |      17999  24607  |      6588    6609    |      96.15  |      6591    30654  |      99.95  21.56  |      7180002041059.3.26      7180002040095.1.1[CONTAINED]
  630    16709  |      16024  1      |      16080  16024  |      97.09  |      16901  16024  |      95.14  100.00  |      7180002040879.2.14      7180002040934.2.87[CONTAINS]
  1      5876    |      5882    1      |      5876    5882    |      98.51  |      39657  13643  |      14.82  43.11  |      7180002040879.3.14      7180002040934.1.87[BEGIN] [BEGIN]
== Scaffold links ==
Try to identify scaffold that fit the following criteria:
* have no markers and no alignments to HS
* linked by 2+ links to a single scaffold that has markers/alignments to HS
* is not a variant (2Kbp+ of unique sequnece)
                                                  elem      min        max        mean      med        n50        sum
  linked                                          6169      316        88326      2408      1480      2709      14,860,479
  linked(2+mates)                                2109      316        44120      2595      1558      3092      5,474,510
  linked(2+mates to a single scf)                2057      316        44120      2516      1547      2850      5,177,182
  linked(2+mates to a single scf, no variant)    112        316        43782      7278      1987      23338      815,223        # 46>2Kbp; 25>10Kbp
=> 112 scaffold & 0.81Mbp could be added to the chromosomes !!!
== BT alignments ==
* UMD2.6.1 vs UMD2.6.1
# Marker scaffolds against themselves (2642 total scaffolds) : nucmer -maxmatch -l 40 -c 2500 -g 250
# Mapped scaffolds without markers against marker scaffolds:  nucmer -maxmatch -l 40 -c 250
# Mapped scaffolds without markers against themselves:        nucmer -maxmatch -l 40 -c 250
File location:
  /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_Chr/
== Other issues ==
=== Chr10 gene duplication ===
* LOC100298457
* See /nfshomes/dpuiu/Readmes/bos_taurus.runCA.13.txt
=== Scf breaks (Mike Robers) ===
  ::::::::::::::
  scf.break.ids
  ::::::::::::::
  #ctg            scf            beg    end    dir
  7180002015438  7180002038435  0      32822  f      # ok: 4 markers from Chr18
  7180002021916  7180002040442  0      24559  f      # ok: 1 marker  from Chr10
  7180001725791  7180002040808  466123  562461  f      # ok: 25 marker  from Chr1, 1 marker from Chr2
  7180001727899  7180002040844  8236574 8479429 f      # also found by us; 3 Chr30 markers in the middle
  7180001854650  7180002041103  5230302 5294734 f      # also found by us; 4 Chr15 markers at 3'
  7180002003578  7180002041216  300250  392069  f      # also found by us; 7 Chr5 markers at 5'
  7180002020010  7180002041293  4482477 4600045 f      # ok
  7180001722390  7180002041353  0      51674  f      # ok
  ::::::::::::::
  scf.excised.ids
  ::::::::::::::
  #ctg            scf            beg    end    dir
  7180002008629  7180002034664  82740  105723  f      # no markers
  7180001786240  7180002040913  2076746 2112375 f      # ok
  7180001787022  7180002040927  1075352 1109350 f      # ok
  7180001787352  7180002040931  4956099 4967823 f      # ok
  7180001789387  7180002040981  6641924 6671967 f      # ok
  7180001789575  7180002040984  3095587 3154302 f      # ok
  7180002003269  7180002041209  6350890 6470971 f      # ok
  7180002025281  7180002041322  989366  1227636 f      # ok
  7180002026741  7180002041328  343726  414329  f      # ok
  7180002029433  7180002041343  834493  888144  f      # ok
  7180001726451  7180002041383  7893629 7901645 f      # ok
=== LOC100298457 (duplicate gene on Chr10) ===
Problem:
  Is LOC100298457 gene (cow Chr10) a variation of  MFSD3 gene (cow Chr14)?
UMD2.0
Chr10  15186759        15189292        833    W      7180003260677  1      2534    +    # gene LOC100298457
Chr14  1839355        1958202        533    W      7180003326040  1      118848  +    # gene MFSD3
* LOC100298457 aligns to 2 UMD2.6 scaffolds:
  scf7180002041112 (1.56Mbp; 34 contigs)
  scf7180002033841 (2.44Kbp; 1 contig)
* scf7180002041112 contains 39 bos taurus Chr14 markers and has 206 alignments to human chromosome 8
* scf7180002033841 contains no bos taurus markers and has 1 alignments to human chromosome 8.
* scf7180002033841 5' aligns to scf7180002041112 (1.56Mbp; 34 contigs, cow Chr14) and to a human Chr8 region that maps to cow Chr14
* scf7180002033841 3' aligns to scf7180002041157 (6.15Mbp; 99 contigs, cow Chr10)
* scf7180002033841 & scf7180002041157 are linked by 2 mate pairs (inserts from a 3kbp BCM shotgun library)
* LOC100298457 vs UMD2.6 scf:
  [S1]    [E1]  |    [S2]    [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
  ===============================================================================================================================
      1      88  |  1138313  1138226  |      88      88  |  100.00  |    1344  1561688  |    6.55    0.01  | LOC100298457      scf7180002041112
    80      192  |  1138150  1138038  |      113      113  |  100.00  |    1344  1561688  |    8.41    0.01  | LOC100298457      scf7180002041112
    190      316  |  1137974  1137848  |      127      127  |  100.00  |    1344  1561688  |    9.45    0.01  | LOC100298457      scf7180002041112
    315      464  |  1137770  1137621  |      150      150  |  100.00  |    1344  1561688  |    11.16    0.01  | LOC100298457      scf7180002041112
    455    1344  |  1137322  1136432  |      890      891  |    99.78  |    1344  1561688  |    66.22    0.06  | LOC100298457      scf7180002041112
      1    1344  |    1349        6  |    1344    1344  |  100.00  |    1344    2445  |  100.00    54.97  | LOC100298457      scf7180002033841
* UMD2.6 scf* vs scf7180002033841:
    [S1]    [E1]  |    [S2]    [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
  ===============================================================================================================================
  1136427  1137322  |        1      895  |      896      895  |    99.78  |  1561688    2445  |    0.06    36.61  | scf7180002041112  scf7180002033841
  4060495  4061426  |    2445    1521  |      932      925  |    97.76  |  6157473    2445  |    0.02    37.83  | scf7180002041157  scf7180002033841
Bos taurus marker summary:
  #id                BT-ref    #markers  slope    begin    end      len
  scf7180002041112  Chr14    39        -0.8518  3377707  4939395  1561688
  scf7180002041157  Chr10    255      1.0035    12831593  18989066  6157473
  scf7180002033841  ?        0
Homo sapiens alignment summary:
  #id                HS-ref    #align    slope    begin    end      len
  scf7180002041112  Chr8      206      0.6701    144381103 145942791 1561688
  scf7180002041157  Chr15    1667      0.987    62245836  68403309  6157473
  scf7180002033841  Chr8      1        1        145705320 145707765 2445
        [S1]    [E1]  |    [S2]    [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
  ===============================================================================================================================
  143996007 143996305  |    74278    74572  |      299      295  |    82.45  | 146274826  1561688  |    0.00    0.02  |  1  1  NC_000008        7180002041112
  ...
  146249212 146249657  |  1268887  1268446  |      446      442  |    80.89  | 146274826  1561688  |    0.00    0.03  |  1 -1  NC_000008        7180002041112
    63013167 63013429  |  887697  887966  |      263      270  |    78.15  | 100338915  6157473  |    0.00    0.00  |  1  1  NC_000015        7180002041157
    ...
    68369047 68369417  |  6154757  6155112  |      371      356  |    83.06  | 100338915  6157473  |    0.00    0.01  |  1  1  NC_000015        7180002041157
  145705522 145706206  |      202      886  |      685      685  |    81.28  | 146274826    2445  |    0.00    28.02  |  1  1  NC_000008        7180002033841
scf7180002033841: 6 reads
  read            scf            begin  end    dir
  1120017508      7180002033841  0      1018    f
  1120020725      7180002033841  227    1109    f
  1120020722      7180002033841  724    1720    r
  1120020726      7180002033841  1006    2073    f      diffScaffold
  1120017511      7180002033841  1303    2175    r
  1120020728      7180002033841  1370    2445    f      diffScaffold
scf7180002033841: 4 mates
  read1          read2          scf1            scf2
  1120017508      1120020722      7180002033841  7180002033841
  1120017511      1120020725      7180002033841  7180002033841
  1120020726      1120017512      7180002033841  7180002041157  diffScaffold
  1120020728      1120017514      7180002033841  7180002041157  diffScaffold
scf7180002041157: 69095 reads; frg 1120017514 & 1120017512 positions close to the region aligned to scf7180002033841 (4060495-4061426)
  read            scf            begin  end    dir
  607312594      7180002041157  0      866    f
  ...
  1120017512      7180002041157  4059332 4060396 f      diffScaffold
  1120017514      7180002041157  4058851 4059931 f      diffScaffold
  ....
  583956204      7180002041157  6157144 6157228 r
scf7180002041157: 99 contigs
  count ctg                scf                begin  end    dir
  1    ctg7180001926175  scf7180002041157  0      81467  f
  ..
  64    ctg7180001926231  scf7180002041157  4039922 4046680 f # 6758bp ctg;    800bp gap following the contig
  65    ctg7180001926232  scf7180002041157  4047480 4193693 f # 146213bp ctg ; 20bp  gap following the contig
  66    ctg7180001926233  scf7180002041157  4193713 4208047 f # 14334bp ctg
  ..
  99    ctg7180001926260  scf7180002041157  6152691 6157473 f
frg 1120017514 & 1120017512 positions on ctg7180001926232
  1120017514  ctg7180001926232  11371  12451  f  # reads are 11kbp(>3Kb lib mean) inside the contig
  1120017512  ctg7180001926232  11852  12916  f  # reads are 11kbp(>3Kb lib mean) inside the contig
=== Missing genes ===
* 5 genes not found by Liliana using ESTaligner but found using gmap (%id<95) : they were on the haplotype variants
===  Chr27 -> Chr21 ===
* The centromeric end of BTA27 is actually placed on BTA21  in UMD2.6 (which version?)
== Chromosomes ==
=== synteny.redo2 ===
* summary.txt : 8122 scf + 6322 deg => 14444 seqs  (placed using markers or synteny to HS)
* from summary.txt removed:
  58 scf.questionable.ids (44 placed, 14 linked)
  100 ctg.questionable.ids
  931 deg.questionable.ids
=== UMD_2.6.d_g ===
  .                    elem      <2000      >2000      min    max        mean      med        n50        sum
  haplotypes_contigs  3828      2701      1127      471    123243    2503      1580      2590      9,584,985
=== UMD_2.6.a_g_070109 ===
Gaps: 65900
* all 65900 gaps are "fragment yes"
=== UMD_2.6.a_g_070509 ===
* Combine UMD_2.6.a_g_070109 (Guillaume's assembly) and UMD2.6.1 (Daniela's)
=== UMD_2.6.a_g_070709 ===
* Remove from UMD_2.6.a_g_070509
** ~ 21 ChrY ctg & ~ 39 ChrY deg
** ~ 4206 haplotype variants (6.63Mbp) within 1K from one another
  .                                      elem      <2000        >2000      min    max        mean      med        n50        sum
  ctg_deg.variants.placed                6654      5504        1150      263    42158      1761      1189      1772      11719653
  ctg_deg.variants.placed.sameChr        5374      4455        919        263    42158      1748      1152      1775      9393815
  ctg_deg.variants.placed.within_100K    4790      4100        690        263    42158      1651      1128      1605      7911160
  ctg_deg.variants.placed.within_1K*    4206      3679        527        263    42158      1577      1107      1508      6633118
Summary:
                              ctg+deg    <2000      >=2000    min    max        mean      med        n50        sum
  Chr1..29,X                  72197      20763      51434      65    1160130    36536      13055      97328      2,637,809,286
  ChrU                        3752      2587      1165      362    179692    3284      1447      6427      12,324,356
  ChrY-contigs                315        266        49        224    26490      2249      974        6679      708,535
  contigs.haplotype-variants  40198      36720      3478      263    51828      1460      1203      1361      58,698,457
  deg.unplaced.less_2K        224945    224945    0          65    1996      972        983        990        218,847,978
=== Issues ===
* 490 scf don't have all ctgs placed (865 ctgs)
* 699 reliable contigs (3.25Mbp) unplaced
  difference.pl ctg.reliable.ids UMD_2.6.a_g_070709/Chr.posmap | getSummary.pl -i 2 -z 2000
  elem      <2000      >2000      min    max        mean      med        n50        sum
  699        405        294        362    123243    4660      1759      14809      3257635
* Rearrangements UMD2.0 vs UMD_2.6.a_g_070709 : ~ 25 ctgs>50K ; ~ 12 scf
* AGP file format: "fragment yes" should be preserved even if gap type=U
  grep -A 2 7180001925241 UMD_2.6.a_g_070109/Chr.agp
  Chr30  131686408      131762880      14157  W      7180001925241  1      76473  +
  Chr30  131762881      131762980      14158  U      100    fragment        yes
  Chr30  131762981      131796167      14159  W      7180001925242  1      33187  +
  grep -A 2 7180001925241 ../UMD_2.6.a_g_070709/Chr.agp
  ChrX    133855417      133931889      15581  W      7180001925241  1      76473  +
  ChrX    133931890      133931989      15582  U      100    contig  no
  ChrX    133931990      133965176      15583  W      7180001925242  1      33187  +
* Liliana found  18 ctg + deg that have genes
  Fixed case B (7180001932648,7180001925237)
* Bob (Missouri)
# scf7180002041216 (43 ctgs, 1.44Bmp) should be split : Chr15 (36 ctgs) , Chr5 (first 7 ctgs)  ctg7180002003576,ctg7180002003577 moved from Chr9=>Chr5
  => Chr5
    1  7180002003574  7180002041216  0      19886  f # already on Chr5
    2  7180001694221  7180002041216  19906  20911  r
    3  7180001978231  7180002041216  20931  52725  r
    4  7180001946738  7180002041216  53559  55402  r
    5  7180002003575  7180002041216  55422  92145  f
    6  7180002003576  7180002041216  92165  200007  f
    7  7180002003577  7180002041216  200456  300048  f
# scf7180002041153 (48 ctgs, 2.36Mbp): assignment to Chr11 seems correct, Chr6(0)
===  UMD_2.6.a_g_071709 -> UMD_Freeze2.99 ===
Changes vs UMD_2.6.a_g_070709:
* 17 contigs/degenerates recruited by Liliana based on mRNA alignments got placed on chromosomes
* Chr9,15 => Chr5 correction : a scaffold got broken between 2 chromosomes & and several contigs got moved to Chr5
* 697 unplaced contigs from placed scaffolds got placed as well
Contig placement summary:
                                #ctg+deg <2Kbp  >=2Kbp min  max        mean  med    n50    sum
  ==================================================================================================
  Chr1..29,X                    72911  21180  51731  65  1160130    36223  12706  97232  2641097363
  ChrU                          3365    2448    917    224  179692    2898  1348  5399  9754701
  contigs.haplotype-variants    40198  36720  3478  263  51828      1460  1203  1361  58698457
  deg.unplaced.less_2K          224933  224933  0      65  1996      972    983    990    218837572
  Chr1..29,X(new)              714    417    297    177  123243    4605  1741  14809  3288596
  ChrY-contigs                  315    266    49    224  26490      2249  974    6679  708535
  ChrY-contigs.SHOTGUN_ONLY    144    140    4      804  4224      993    882    888    143047
  ===================================================================================================
Comments:
* "Chr1..29,X", ChrU, contigs.haplotype-variants, deg.unplaced.less_2K are mutually exclusive sets
* "Chr1..29,X(new)" are contigs which were not placed in UMD_2.6.a_g_071709.
** 17 contig them were added by Liliana (1 failed)
** the rest are reliable contigs left unplaced by Aleksey/Guillaume program
* ChrY-contigs.SHOTGUN_ONLY are a subset of ChrY-contigs which don't contain only SHOTGUN reads
* ChrY-contigs are part of ChrU
Files (walnut):
  /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/                # FASTA & AGP format
  /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/nucmer_UMD2.0/  # nucmer alignments to UMD2.0
Files (freeze):
  /fs/szasmg3/bos_taurus/UMD_Freeze2.99/              # FASTA & AGP format
  /fs/szasmg3/bos_taurus/UMD_Freeze2.99/ncbi_files    # SEQUIN format
Ftp:
  ftp://ftp.cbcb.umd.edu/pub/salzberg/Bos_taurus_2.99/ -> /fs/ftp-cbcb/pub/salzberg/Bos_taurus_2.99/
=== UMD_2.6.a_g_072109 -> UMD_Freeze3.0 ===
Changes vs UMD_2.6.a_g_070709:
* Delete 97 contaminated sequences found by NCBI (all except the primates) : http://www.ncbi.nlm.nih.gov/projects/WGS/screens/DAAA02_071709/
* Delete 441 haplotype variants found by Guillaume
* Trim 54 partial contaminants (contaminants were on the ends)
* Trim 7 terminal N's
Gaps: 75739
* all 27103 N gaps are "fragment yes"
* all 48636 U gaps are "contig no"
Files (walnut):
  /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_072109/                # FASTA & AGP format
 
Files (freeze):
  /fs/szasmg3/bos_taurus/UMD_Freeze3.0/              # FASTA & AGP format
  /fs/szasmg3/bos_taurus/UMD_Freeze3.0/ncbi_files    # SEQUIN format
Ftp:
  ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_3.0/ -> pub/data/assembly/Bos_taurus/Bos_taurus_UMD_3.0/
Issues:
* 7180001836672 941bp deg on Chr4 : aligns on all its length to the cow mitochondrion; placed  based on human synteny
* mitochondrion screening was done only on contigs, not on degenerates
* Align all Chr*.fasta files to cow mitochondrion; show-coords -I 90 -L 600
  [S1]    [E1]  |    [S2]    [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
  ===============================================================================================================================
  3603    4930  |        6    1333  |    1328    1328  |    90.89  |    16339    1333  |    8.13    99.62  | AY676873  ctg7180001759596 [CONTAINS]
  13284    14521  |    1214        1  |    1238    1214  |    97.90  |    16339    1216  |    7.58    99.84  | AY676873  deg7180001872458 [CONTAINS]
  15731    16339  |      1      608  |      609      608  |    99.18  |    16339      941  |    3.73    64.61  | AY676873  deg7180001836672 [END]
* Delete 2 degenerates
* Summary 3.a --[[User:Dpuiu|Dpuiu]] 11:04, 5 August 2009 (EDT)
  .                                ctg+deg <2Kbp  >=2Kbp min  max      mean  med    n50    sum
  ======================================================================================================
  Chr1..29,X                        72479  20862  51617  65  1160130  36424  12941  103785  2639984487
  ChrU                              3285    2404    881    224  179692  2890  1338  5425    9496583
  Chr                              75764  23266  52498  65  1160130  34970  11207  96955  2649481070
 
  contigs.haplotype-variants        40611  36984  3627  263  97877    1476  1205  1372    59958728
  deg.unplaced.less_2K              224933  224933  0      65  1996    972    983    990    218837572
 
  ChrY-contigs                      314    266    48    224  26490    2210  973    6539    694140
  ChrY-contigs.SHOTGUN_ONLY        144    140    4      804  4224    993    882    888    143047
  ======================================================================================================
=== UMD_Freeze3.1 ===
--[[User:Dpuiu|Dpuiu]] 11:42, 19 November 2009 (EST)
* Only changed some of the gap specifications
* Original(CA):
  ctg: 90135
  deg: 251413
  scf: 39978
* UMD3.1 AGP:
  ctg:            60499
  deg:            15229
  scf(CA):        11458
  ctg(unoriented)    2118
  scf(placed chr)    3193
  scf(unplaced chr)  3285
Files:
/fs/szasmg3/bos_taurus/UMD_Freeze3.1
== ToDo ==
* Align all UMD2.0 ChrU ctg/deg 10Kbp+ to our assembly; make sure everything aligns (Steven's suggestion)
  /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/nucmer_UMD2.0
** all ctgs: 113K, 244Mbp
** 10Kbp+ ctgs: 2561, 54Mbp
** 53,951,168  out of 54,971,011 bp covered (98% of the sequence)
** Most ctgs were added to ChrU
  join2.pl -i 4 UMD2.0.ChrU.10K.maxCvg.pair ../other/Chr.posmap | awk '{print $10,$2}' | ~/bin/sum2.pl | sort -nk3 -r
  Chr30  252    7,511,975
  Chr1    171    3263575
  Chr6    138    3043388
  Chr12  152    2676615
  ChrU    100    2600562
  ...
  Chr28  29      473523
  all    2651    54,960,233
1. Remove remain_haps
  .                                elem      <2000      >2000      min    max        mean      med        n50        sum
  remain_haps                      436        282        154        471    37860      2600      1693      2819      1,133,874
  remain_haps(Chr1..30,U)          408        264        144        471    37860      2636      1692      2883      1,075,616 # 77 from ChrU
  remain_haps(Chr1..30)            335        226        109        471    37860      2480      1661      2728      831,068
2. Remove/trim contaminants
3. Remove/trim N's

Latest revision as of 18:11, 8 April 2011

BCM

NCBI Data

  • Avg LEN=984
  • Avg CLIP (CLB intersect CLV)=760
  • Avg CLV=997 > Avg LEN ???
  • Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
  • Avg UMDoverlapper CLIP=778

Problems:

  • 0 QUAL reads 650,133 ( 551,114 are BCM.WGS )
  • the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1
  • several xml contained the "&" character => XML parser error
  • xml.bos_taurus.087 contained 2 trace_volumes => XML parser error
  • BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000
  • UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000

CENTER_NAME counts

    COUNT           CENTER_NAME     
 1  35629020        BCM             Baylor College of Medicine
 2  737900          NISC            NIH Intramural Sequencing Center
 3  652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Centre                           # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 
 4  378871          MARC            USDA, ARS, US Meat Animal Research Center
 5  114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368
 6  107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
 7  65171           TIGR            The Institute for Genome Research
 8  53556           GSC             Genoscope
 9  43033           CENARGEN        Embrapa Genetic Resources and Biotechnology
 10 18623           SC              The Sanger Center
 11 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
 12 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                 # TA query_tracedb CENTER_NAME="JCVI"
 13 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
 14 49              WUGSC           Washington University, Genome Sequencing Center                                 # TA query_tracedb CENTER_NAME = "WUGSC" => 9
    37829394        total           total                                                                           # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710

TRACE_TYPE_CODE counts

    COUNT         CENTER_NAME     TRACE_TYPE_CODE        
 1  24863599      BCM*            WGS                    SEQ_LIB_ID:89
 2  10748529      BCM*            SHOTGUN                SEQ_LIB_ID:15543
 3  737900        NISC            SHOTGUN                SEQ_LIB_ID:247
 4  125597        BCCAGSC         CLONEEND               LIBRARY_ID:1         large insert size; some qualityless; !!! almost all have CLIP3=0
 5  114753        UIUC            CLONEEND               LIBRARY_ID:2         insert size missing , no frequent kmers
 6  65171         TIGR            CLONEEND               SEQ_LIB_ID:1         2K & use TRACE_DIRECTION instead of TRACE_END
 7  53556         GSC             CLONEEND               SEQ_LIB_ID:1         large insert size; !!! all have qual=0 and were excluded 
 8  26246         CENARGEN        WGS                    .                    no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID)
 9  25454         BARC            CLONEEND               SEQ_LIB_ID:14304     !!! all have CLIP3=0
 10 16892         BCM*            CLONEEND               LIBRARY_ID:1         VBBAA   mea=167000  std=25000
 11 16787         CENARGEN        CLONEEND               LIBRARY_ID:1         
 12 15150         UOKNOR          SHOTGUN                LIBRARY_ID:1         some qualityless
 13 10651         TIGR_JCVIJTC    CLONEEND               SEQ_LIB_ID:2
 14 151           UOKNOR          FINISHING              LIBRARY_ID:1         some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV
 15 49            WUGSC           CLONEEND               SEQ_LIB_ID:1 
    36820485      total

 16 527017        BCCAGSC         EST
 17 207204        MARC            EST
 18 171667        MARC            PCR
 19 81913         BARC            EST
 20 18623         SC              EST 
 21 2485          UIACBCB         EST
    1008909       total

STRATEGY & TRACE_TYPE_CODE counts

 COUNT           CENTER_NAME     STRATEGY        TRACE_TYPE_CODE
 12545304        BCM             .               WGS
 11425910        BCM             WGA             WGS
 5223683         BCM             CLONE           SHOTGUN
 4479883         BCM             POOLCLONE       SHOTGUN
 1044963         BCM             .               SHOTGUN
 892385          BCM             SNP             WGS
 737900          NISC            CLONE           SHOTGUN
 125597          BCCAGSC         CLONEEND        CLONEEND
 114753          UIUC            CLONEEND        CLONEEND 
 65171           TIGR            CLONEEND        CLONEEND
 53556           GSC             CLONEEND        CLONEEND
 26246           CENARGEN        .               WGS
 25454           BARC            .               CLONEEND
 16892           BCM             CLONEEND        CLONEEND
 16787           CENARGEN        CLONEEND        CLONEEND
 12195           UOKNOR          .               SHOTGUN
 10651           TIGR_JCVIJTC    CLONEEND        CLONEEND
 2955            UOKNOR          CLONE           SHOTGUN
 151             UOKNOR          .               FINISHING
 49              WUGSC           CLONEEND        CLONEEND
 527017          BCCAGSC         EST             EST
 145820          MARC            EST             EST
 117958          MARC            COMPARATIVE     PCR
 81913           BARC            EST             EST
 61384           MARC            CLONE           EST
 53709           MARC            Re-Sequencing   PCR
 18623           SC              EST             EST
 2485            UIACBCB         .               EST

BCM.SHOTGUN libraries

  • The long inserts are probably wrong !!!
 SIZE    STDEV   COUNT
 3500    1500    4502569
 2000    1000    3244493
 3000    1000    1021577
 180000  1000    840528
 6500    1500    429026
 180000  13000   320208
 6000    2000    208192
 167000  13000   96337
 3500    15000   85599
 SIZE    COUNT
 3500    4588168
 2000    3244493
 180000  1160736
 3000    1021577
 6500    429026
 6000    208192
 167000  96337

3' VECTOR TRIMMED counts

    CENTER_NAME     TRACE_TYPE_CODE TOTAL           3'CLV<LEN   QUAL==0          UMD.FRG
 1  BCM             WGS             24863599        10968979    551114           24050767
 2  BCM             SHOTGUN         10748529        5052692     23419            10068499
 3  NISC            SHOTGUN         737900          28972       0                735488
 4  BCCAGSC         CLONEEND        125597          125484      8926             113790
 5  UIUC            CLONEEND        114753          90243       0                106247
 6  TIGR            CLONEEND        65171           46389       0                64903
 7  GSC             CLONEEND        53556           53556       53556 (all)      0           !!! all have 0 quals and were excluded
 8  CENARGEN        WGS             26246           26246       0                25976
 9  BARC            CLONEEND        25454           25454       0                25387
 10 BCM             CLONEEND        16892           6751        0                16863
 11 CENARGEN        CLONEEND        16787           16787       0                16628
 12 UOKNOR          SHOTGUN         15150           2885        12195            0
 13 TIGR_JCVIJTC    CLONEEND        10651           339         0                10644
 14 UOKNOR          FINISHING       151             0           151              151
 15 WUGSC           CLONEEND        49              0           0                0

 16 BCCAGSC         EST             527017          524173      772              0
 17 MARC            EST             207204          207204      0                0
 18 MARC            PCR             171667          171667      0                0
 19 BARC            EST             81913           78597       0                0
 20 SC              EST             18623           7350        0                0
 21 UIACBCB         EST             2485            2485        0                0

ZERO QUALITY COUNTS

  • Counts
 CENTER_NAME     TRACE_TYPE_CODE  COUNT
 BCM             WGS              551114
 GSC             CLONEEND         53556
 BCM             SHOTGUN          23419
 UOKNOR          SHOTGUN          12195
 BCCAGSC         CLONEEND         8926
 BCCAGSC         EST              772
 UOKNOR          FINISHING        151
 TOTAL                            650134 
  • For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701..
  • Volumes 026..039 have been fixed

Local Data

Files & Dirs

 /fs/szasmg3/bos_taurus/data/
 /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors
 /nfshomes/dpuiu/db/UniVec

Software

Figaro

  • trims vector only at 5' end
  • call lucy trimming for qualities

Lucy

  • both vector sequence and splice sites are required

Atlas

  • web site
  • atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "

Contaminant search

nucmer reads CLIPPING range to UniVec & EcoliK12

UniVec

Ref

                 #seqs   min     max     mean    median  n50     sum
 UniVec          2861    12      48551   231     99      781     660,151
 UniVec_Core     1348    12      48551   243     98      967     327,641

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      4548466 19      1045    28.37   23      27      129025025
 20      3684852 20      1045    30.56   25      28      112616359
 30      1097357 30      1045    48.04   38      43      52714583
 40      484661  40      1045    66.36   47      53      32163896
 100     54334   100     1045    198     116     223     10772815        # many are ESTs

Ecoli

Ref:

 K12 4,639,675 bp

Hits: alignment length

 bp      #reads  min     max     mean    median  n50     sum
 19      275109  19      1223    30.66   19      20      8435470
 20      102550  20      1223    50.29   21      161     5156849
 30      19032   30      1223    178     37      706     3381214
 40      9234    40      1223    329     171     738     3034293
 100     6781    100     1223    424     223     749     2876432
 200     4378    200     1223    575     696     771     2516916       

BCM vectors

                 #seqs   min     max     mean    median  n50     sum
 BCM             14      2580    33180   9379    5821    32705   131312

Vector/Splice site search

Strategy

  • 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
  • 2. Get the quality clipping trim: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
  • 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE
  • 4. Get the most frequent kmers in each set (24 & 8 bp)
  • 5. Check if the most frequent kmers are overrepresented
  • 6. Check if the most frequent 8mers are present in the most frequent 24mers
  • 7. Try to extend the 24mers by a few bp => linkers
  • 8. Align linkers to the opposite stand sequences using nucmer
  • 9. Extract the subsequences adjacent(following) to linker (50..150bp)
  • 10. Align the subsequences; if they align we've probably identified the vector
  • 11. Identify the vector name/id by alignment to UniVec => several alignments
  • 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
  • 13. create the Lucy vector & splice files; the splice contains the linker+vector
  • 14. run lucy & trim input reads according to Lucy clr
  • 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust
  • 16. align input reads to linker,vector,splice & UniVec.dust
  • 17. compare the 15. & 16. counts

Example

  • 1. volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
  • 2.
  • 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
  • 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
 ==> 24.fwd/kmers.tab <==
 AGTTCGACTGCAAGTAGTTCATCA      TGATGAACTACTTGCAGTCGAACT        2463 # contains AGTAGTTC
 GAGTTCGACTGCAAGTAGTTCATC      GATGAACTACTTGCAGTCGAACTC        2189
 CGAGTTCGACTGCAAGTAGTTCAT      ATGAACTACTTGCAGTCGAACTCG        1996
 TCGAGTTCGACTGCAAGTAGTTCA      TGAACTACTTGCAGTCGAACTCGA        1593
 GTTCGACTGCAAGTAGTTCATCAA      TTGATGAACTACTTGCAGTCGAAC        1023
 GAGTTCGACTGCAGTAGTTCATCA      TGATGAACTACTGCAGTCGAACTC        812
 CGAGTTCGACTGCAGTAGTTCATC      GATGAACTACTGCAGTCGAACTCG        777
 GTTCGACTGCAAGTAGTTCATCAT      ATGATGAACTACTTGCAGTCGAAC        769
 TCGAGTTCGACTGCAGTAGTTCAT      ATGAACTACTGCAGTCGAACTCGA        637
 ATCGAGTTCGACTGCAAGTAGTTC      GAACTACTTGCAGTCGAACTCGAT        594
 
 ==> 08.fwd/kmers.tab <==
 AGTAGTTC      GAACTACT        86477
 CAGTAGTT      AACTACTG        67681
 AGTTCTCA      TGAGAACT        61556
 TAGTTCTC      GAGAACTA        60964
 GTAGTTCT      AGAACTAC        57866
 AGTTCATC      GATGAACT        49676
 TAGTTCAT      ATGAACTA        45298
 GTTCATCA      TGATGAAC        42117
 GCAGTAGT      ACTACTGC        41391
 GTAGTTCA      TGAACTAC        40694
 
 ==> 24.rev/kmers.tab <==
 TATCGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCGATA        999 # contains AGTAGTTC
 CTATCGATGGTACAGTAGTTCATC      GATGAACTACTGTACCATCGATAG        774
 GCTATCGATGGTACAGTAGTTCAT      ATGAACTACTGTACCATCGATAGC        600
 CGCTATCGATGGTACAGTAGTTCA      TGAACTACTGTACCATCGATAGCG        432
 ATCGATGGTACAGTAGTTCATCAT      ATGATGAACTACTGTACCATCGAT        417
 ATCGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCGAT        380
 ATCAGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCTGAT        373
 ATCGATGGTACAGTAGTTCATCAC      GTGATGAACTACTGTACCATCGAT        265
 CTATCGATGGTAAGTAGTTCATCA      TGATGAACTACTTACCATCGATAG        235
 TCAGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCTGA        224
 
 ==> 08.rev/kmers.tab <==
 AGTTCATC      GATGAACT        85127
 TAGTTCAT      ATGAACTA        77902
 GTTCATCA      TGATGAAC        75585
 TAGTTCTC      GAGAACTA        68057
 AGTTCTCA      TGAGAACT        67277
 GTAGTTCT      AGAACTAC        64894
 GTAGTTCA      TGAACTAC        62607
 CGTAGTTC      GAACTACG        52031
 AGTAGTTC      GAACTACT        51013
 ACGTAGTT      AACTACGT        31552
  • 7. Get linker sequences
 >linker.fwd 27bp
 TCGAGTTCGACTGCAAGTAGTTCATCA
 >linker.rev 27bp
 CTAATCAGATGGTACAGTAGTTCATCA 
  
 #>linker.rev 40 bp Art's  (13 more bp at 5')        
 #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
 #GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences)
  • 8 & 9 Align reads to linkers using nucmer

Fwd:

 nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta 
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq
 

Rev:

 nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq
 

Both:

 clrFasta out.seq >! out.cseq
 fasta2tab.pl out.cseq | sort -k2 > ! out.tab
 nucmer -c 40 out.cseq ~/db/UniVec -p vector
 delta-filter -q vector.delta >! vector.filter-q.delta
 show-coords vector.filter-q.delta | sort -n | head
 cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2
 
  • 10. Extract "vector reads"
 >399553028  # 24.fwd     
 TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
 GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
 GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA
 ...
 >400269118 #24.rev
 TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
 CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
 AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
  • 11. Align "vector reads" to UniVec; identify vector
 show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1
     31  148  | 1175 1292  | 118   118  |  95.76  |     1276     7477  |     9.25     1.58  | 399553028.rev gnl|uv|J01636.1:1-7477
     32  199  | 1302 1463  | 168   162  |  90.48  |      653     7477  |    25.73     2.17  | 400269118     gnl|uv|J01636.1:1-7477
  • 12. 10bp distance between the 2 alignments
  • 13. Lucy files
 $ more vector.seq
   >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
   GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG
   TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC
   CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG
   CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA
   ...
 
 $ more splice.seq
   >J01636.for.begin vector+linker.rev
   TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC
   GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG
   CGCCTAATCAGATGGTACAGTAGTTCATCA
   >J01636.for.end  rev(linker.fwd)+vector 
   TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA
   CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC
   AGCTGGCGTAAAAACGTAAAAAGCCCCGCA
   >J01636.rev.begin (revcomp of J01636.for.end)
   TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA
   TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT
   GATTCGATTTCGACTGCAAGTAGTTCATCA
   >J01636.rev.end (revcomp of J01636.for.begin)
   TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT
   GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG
   GTGTCAAATGAGAGACCTAACTCACATTCA
 # splice=linker+vector  
      3      120  |     1175     1292  |      118      118  |    95.76  |      150     7477  |    78.67     1.58  | J01636.for.begin   J01636
     32      131  |     1302     1399  |      100       98  |    96.00  |      150     7477  |    66.67     1.31  | J01636.for.end     J01636
  • 13.1 Align vector & splice to Ecoli
      1     7474  |   366812   359335  |     7474     7478  |    99.91  |     7477  4639675  |    99.96     0.16  | J01636             NC_000913.2    [CONTAINED]
     20      119  |       65      162  |      100       98  |    96.00  |      150      395  |    66.67    24.81  | J01636.rev.begin   NC_000913.2
     31      148  |      172      289  |      118      118  |    95.76  |      150      395  |    78.67    29.87  | J01636.rev.end     NC_000913.2
   1069     1463  |      395        1  |      395      395  |   100.00  |     7477      395  |     5.28   100.00  | J01636             NC_000913.2.365350-365744
  • 14. Run lucy & trim reads
 $ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \ 
     -v vector.seq splice.seq
     -o bos_taurus.lucy.seq bos_taurus.lucy.qual \
     -debug  bos_taurus.lucy.info \
     bos_taurus.seq bos_taurus.qual
 # Trim clr
 $ clrFasta bos_taurus.seq > bos_taurus.cseq
  • 15. Align lucy output to linker, vector, splice & UniVec.dust
 $ nucmer -l 12 -c 24 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/splice.seq  bos_taurus.lucy.cseq -p splice-bos_taurus.lucy
 $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy
  • 16. Align input to linker, vector, splice & UniVec.dust
 $ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus
 $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus

Count how many reads got trimmed

 infoseq *seq | getSummary.pl -c 1 -t original.LEN
 
 cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary  
 cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary
 cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary

Libraries

011.BCM.WGS FORWARD

  • vector: J01636
  • UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
 ll ~dpuiu/db/J01636*
 -rw-rw-r--  1 dpuiu dpuiu 7651 Jan  9 15:56 /nfshomes/dpuiu/db/J01636
 -rw-rw-r--  1 dpuiu dpuiu  105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker
 -rw-rw-r--  1 dpuiu dpuiu  840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice
 cat  ~dpuiu/db/J01636* | infoseq
 J01636            7477   53.43
 J01636.linker.fwd 27     44.44
 J01636.linker.rev 27     37.04
 J01636.for.begin  150    44.67
 J01636.for.end    150    51.33
 J01636.rev.begin  150    51.33
 J01636.rev.end    150    44.67
  • 249,611 reads:
  • 91% got vector trimmed at the 5'
  • 0.4% (1149) got vector trimmed at the 3'
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    249611  0       437     2349    1082    991     1009    270035781     
 lucy.CLV5       249611  21215   0       741     25.03   25      27      6247415
 lucy.CLV3       249611  248462  0       1047    3.49    0       859     870344
  • Original reads hit counts:
10975 linker.fwd
133   linker.rev
166   splice
152   vector
228   UniVec.dust
  • Lucy trimmed read counts
2 linker.fwd
0 linker.rev
1 splice
1 vector
6 UniVec.dust (only 3 are >40bp)

011.BCM.WGS REVERSE

                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    250389  0       502     2148    1085    993     1012    271691094
 lucy.CLR        250389  7345    0       1281    795     876     892     198982171
 lucy.CLV5       250389  20271   0       668     26.52   27      29      6641362
 lucy.CLV3       250389  249269  0       997     3.35    0       861     839029
  • Original reads hit counts:
 linker.fwd      113
 linker.rev      3812
 splice          143
 UniVec.dust     237
 vector          4318
  • Lucy trimmed reads hit counts:
 linker.fwd      1
 linker.rev      0
 splice          1
 UniVec.dust     10
 vector          1

030.BCM.SHOTGUN

  • same linker/vector/splice as BCM.WGS
  • 2.5% (4K out of 160K) reads contain linker & vector at 3'
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    8411    0       325     1685    1181    1240    1314    9933150
 lucy.CLR        8411    8       0       1054    841     863     874     7070994
 lucy.CLV5       8411    568     0       232     27.01   28      29      227206
 lucy.CLV3       8411    2325    0       1040    597     794     851     5023445
  • Original reads hit counts:
 linker.fwd      4314
 linker.rev      4125
 splice          7816
 UniVec.dust     4212
 vector          6750
 vector          27235
  • Lucy trimmed reads hit counts:
 linker.fwd      3
 linker.rev      1
 splice          1
 UniVec.dust     13
 vector          0

001.NISC.SHOTGUN

  • Vector: pOTW13
  • UniVec: 3 partial seqs
 gnl|uv|NGB00080.1:1-198 pOTW13 with linkers
 gnl|uv|NGB00080.1:718-888 pOTW13 with linkers
 gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers
 ll /nfshomes/dpuiu/db/NGB00080*
 -rw-rw-r--  1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080
 -rw-r--r--  1 dpuiu dpuiu   94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker
 -rw-r--r--  1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice
 cat  /nfshomes/dpuiu/db/NGB00080* | infoseq
 NGB00080       1054   50.00
 NGB00080.linker.fwd 24     45.83
 NGB00080.linker.rev 26     53.85
 NGB00080.for.beg 518    46.14
 NGB00080.for.end 518    50.48
 NGB00080.rev.begin 518    50.48
 NGB00080.rev.beg 518    46.14
  • 944 read sample
                 #elem   #0s     min     max     mean    median  n50     sum
 original.LEN    944     0       652     1017    735     721     722     693668
 lucy.CLR        944     39      0       886     415     422     522     391333
 lucy.CLV5       944     121     0       275     34.05   33      35      32143
 lucy.CLV3       944     18      0       885     410     409     511     387007
  • Original reads hit counts:
 linker.fwd      479
 linker.rev      492
 splice          910
 UniVec.dust     0
 vector          939
  • Lucy trimmed reads hit counts:
 linker.fwd      1
 linker.rev      0
 splice          0
 UniVec.dust     9
 vector          1

060.BCCAGSC.CLONEEND

  • Linkers:
 linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT
 linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG
  • No common vector !!!
 wc -l *clb
   60746 bos_taurus.060.f.clb  #18 reads original align to UniVec (nucmer default params)
   60836 bos_taurus.060.r.clb
 
 Fwd:
    329      428  |      440      535  |      100       96  |    91.00  |      503     1585  |    19.88     6.06  | 723951410  gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1
    330      370  |       89       49  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
    330      370  |       94       54  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
 
  Rev:
      1       96  |       71      165  |       96       95  |    93.81  |      203      165  |    47.29    57.58  | 724018013  gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6
     50      143  |        1       94  |       94       94  |    92.71  |      203       94  |    46.31   100.00  | 724018013  gnl|uv|U80929.2:2858-2951     Cloning vector pBACe3.6

017.UIUC.CLONEEND

  • No overrepresented kmers
 wc -l *clb
  17978 bos_taurus.017.f.clb
  17911 bos_taurus.017.r.clb
 
 ==> 24.fwd/kmers.tab <==
 CCCTGCTTTGTCTGGAAGGGGTTC        GAACCCCTTCCAGACAAAGCAGGG        9
 CTGCTTTGTCTGGAAGGGGTTCCC        GGGAACCCCTTCCAGACAAAGCAG        9
 
 ==> 24.rev/kmers.tab <==
 GAATGTTGAGCTTTAGCCAACTTT        AAAGTTGGCTAAAGCTCAACATTC        4
 TCTGAATGTTGAGCTTTAGCCAAC        GTTGGCTAAAGCTCAACATTCAGA        4
 
 ==> 8.fwd/kmers.tab <==
 TTTTTTTT        AAAAAAAA        55
 AAGGGGTT        AACCCCTT        35
 
 ==> 8.rev/kmers.tab <==
 GTCTGGAA        TTCCAGAC        41
 TCTGGAAG        CTTCCAGA        39
  • No UniVec hits

010.TIGR.CLONEEND

  • No overrepresented kmers
 wc -l *clb
 5479 bos_taurus.032.f.clb
 5174 bos_taurus.032.r.clb
 
 ==> 24.fwd/kmers.tab <==
 CTTGTGTTGGCCCAGGCAAGTCCA        TGGACTTGCCTGGGCCAACACAAG        30
 TTGTGTTGGCCCAGGCAAGTCCAA        TTGGACTTGCCTGGGCCAACACAA        30
 
 ==> 24.rev/kmers.tab <==
 CTGCCTCTTGTGTTGGCCCAGGCA        TGCCTGGGCCAACACAAGAGGCAG        16
 GCTGCCTCTTGTGTTGGCCCAGGC        GCCTGGGCCAACACAAGAGGCAGC        15
 
 ==> 8.fwd/kmers.tab <==
 GAGTGGGT        ACCCACTC        176
 GGAGTGGG        CCCACTCC        171
 
 ==> 8.rev/kmers.tab <==
 TGGAGTGG        CCACTCCA        182
 GGAGTGGG        CCCACTCC        181
 
  • No UniVec hits

...

070.BCM.CLONEEND

  • No frequent kmers
 wc -l *clb
   6027 bos_taurus.070.f.clb
   6236 bos_taurus.070.r.clb
 
 ==> 24.fwd/kmers.tab <==
 GGACTCTCAGAGTCTTCTCCAACA        TGTTGGAGAAGACTCTGAGAGTCC        18
 ACTGGTTGGATCTCCTTGCAGTCC        GGACTGCAAGGAGATCCAACCAGT        18

 ==> 24.rev/kmers.tab <==
 ATAAAATCTGAGCCACCAGGGAAG        CTTCCCTGGTGGCTCAGATTTTAT        1
 CTATTGGTTCATATGGTCAACGTC        GACGTTGACCATATGAACCAATAG        1
 
 ==> 8.fwd/kmers.tab <==
 TTTTTTTT        AAAAAAAA        86
 CTTCTCCA        TGGAGAAG        75
 
 ==> 8.rev/kmers.tab <==
 TATAGTGT        ACACTATA        9
 ATATAGGG        CCCTATAT        8
  • No alignments to BCM WGS vector

Running Lucy

  • Default parameters with vector trimming
  • BCM vector/splice
 /nfshomes/dpuiu/db/vector.BCM.seq
 /nfshomes/dpuiu/db/splice.BCM.seq
  • NISC vector/splice
 /nfshomes/dpuiu/db/vector.NISC.seq
 /nfshomes/dpuiu/db/splice.NISC.seq

BCM.WGS (all reads)

  • orig.CLR < lucy.CLR ( 765 < 792 )
  • orig.CLV > lucy.CLV ( 1015 > 973 )
  • 739,529 out of 24,863,599 reads (3%) deleted by Lucy (CLR=-1,-1)
  • 21,728,592 out of 24,863,599 reads (87%) vector trimmed at the 5' end
  • 92,646 out of 24,863,599 reads (0.3%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum 
 orig.LEN                  24863599   0          0          24863599   5          3097       1002       997        1015       24915462033
 
 orig.CLR                  24863599   463669     7          24399923   -1143      1833       765        836        864        19036744256
 orig.CLR5                 24863599   0          359245     24504354   0          2103       42         22         58         1047922451
 orig.CLR3                 24863599   463404     0          24400195   -1         2169       807        872        895        20084666707
 
 lucy.CLR                  24863599   0          739529     24124070   0          1219       792        878        904        19695000417
 lucy.CLR5                 24863599   739529     36108      24087962   -1         1753       43         29         42         1086413880
 lucy.CLR3                 24863599   739529     0          24124070   -1         1894       835        915        939        20781414297
 
 orig.CLR5-lucy.CLR5       24863599   16299521   215345     8348733    -1186      2104       -1         -10        -1186      -38491429
 orig.CLR3-lucy.CLR3       24863599   14858542   1494794    8510263    -1273      2170       -28        -20        -1273      -696747590
 
 orig.CLV                  24863599   1053       1920       24860626   -2         5345       1015       1002       1017       25260581538
 orig.CLV5                 8841849    0          0          8841849    1          1219       33         46         49         295011460
 orig.CLV3                 24861698   1053       0          24860645   -1         5346       1027       1005       1019       25555592998
 
 lucy.CLV                  24863599   10694      707        24852198   -469       3096       973        968        987        24195085877
 lucy.CLV5                 24863599   0          3135007    21728592   0          1359       25         27         29         623457486
 lucy.CLV3                 24863599   0          0          24863599   4          3096       998        995        1014       24818543363
 lucy.CLVABS5              24863599   0          3135007    21728592   0          1359       25         27         29         623457486
 lucy.CLVABS3              24863599   0          24770953   92646      0          1343       2          0          880        72055071
 
 orig.CLV5-lucy.CLV5       24863599   17216820   1512453    6134326    -1312      1219       -13        -25        -1312      -328446026
 orig.CLV3-lucy.CLV3       24863599   1519132    18579609   4764858    -1832      4672       29         0          479        737049635

BCM.WGS (0 quality reads)

  • orig.CLR > lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 7,153 out of 551,114 reads (1.3%) deleted by Lucy (CLR=-1,-1)
  • 508,166 out of 551,114 reads (92%) vector trimmed at the 5' end
  • 1,946 out of 551,114 reads (0.35%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  551114     0          0          551114     5          1464       872        946        959        480705828
 
 orig.CLR                  551114     7754       0          543360     -770       1175       708        786        807        390325117
 orig.CLR5                 551114     0          6773       544341     0          1519       44         20         111        24582849
 orig.CLR3                 551114     7744       0          543370     -1         1638       752        818        833        414907966
 
 lucy.CLR                  551114     0          7153       543961     0          699        636        671        671        350759771
 lucy.CLR5                 551114     7153       35872      508089     -1         201        26         27         28         14442310
 lucy.CLR3                 551114     7153       0          543961     -1         699        662        699        699        365202081
 
 orig.CLR5-lucy.CLR5       551114     364282     8801       178031     -198       1500       18         -8         215        10140539
 orig.CLR3-lucy.CLR3       551114     85058      2962       463094     -700       1472       90         123        178        49705885
 
 orig.CLV                  551114     971        0          550143     -2         2037       974        978        981        537127121
 orig.CLV5                 5100       0          0          5100       1          845        35         29         31         180490
 orig.CLV3                 551114     971        0          550143     -1         2037       974        978        981        537307611
 
 lucy.CLV                  551114     58         6          551050     -84        1456       841        917        930        463903233
 lucy.CLV5                 551114     0          42948      508166     0          202        27         28         29         14964546
 lucy.CLV3                 551114     0          0          551114     4          1463       868        945        958        478867779
 lucy.CLVABS5              551114     0          42948      508166     0          202        27         28         29         14964546
 lucy.CLVABS3              551114     0          549168     1946       0          700        2          0          686        1286935
 
 orig.CLV5-lucy.CLV5       551114     506108     42215      2791       -202       845        -26        -28        -202       -14784056
 orig.CLV3-lucy.CLV3       551114     134959     23422      392733     -967       1614       106        7          459        58439832

BCM.SHOTGUN

  • orig.CLR < lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 98,070 out of 10,748,529 reads (0.9%) deleted by Lucy (CLR=-1,-1)
  • 9,737,008 out of 10,748,529 reads (90%) vector trimmed at the 5' end
  • 294,942 out of 10,748,529 reads (2.7%) vector trimmed at the 3' end
                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  10748529   0          0          10748529   5          2043       975        950        964        10486690472
 
 orig.CLR                  10748529   17308      2          10731219   -1293      1467       809        833        847        8701344571
 orig.CLR5                 10748529   0          68         10748461   0          1315       26         16         38         288662580
 orig.CLR3                 10748529   16780      0          10731749   -1         1647       836        851        863        8990007151
 
 lucy.CLR                  10748529   0          98070      10650459   0          1337       833        854        868        8955866769
 lucy.CLR5                 10748529   98070      1973       10648486   -1         1307       35         28         32         376276188
 lucy.CLR3                 10748529   98070      0          10650459   -1         1553       868        882        896        9332142957
 
 orig.CLR5-lucy.CLR5       10748529   9498290    65171      1185068    -1099      1293       -8         -11        -1099      -87613608
 orig.CLR3-lucy.CLR3       10748529   6879532    671097     3197900    -1149      1437       -31        -26        -1149      -342135806
 
 orig.CLV                  10748529   16779      412        10731338   -2         3919       974        948        964        10472347908
 orig.CLV5                 8594910    0          0          8594910    1          1239       3          1          49         28350257
 orig.CLV3                 10748349   16779      0          10731570   -1         3919       976        950        965        10500698165
 
 lucy.CLV                  10748529   7026       614        10740889   -268       2042       930        924        940        9997862132
 lucy.CLV5                 10748529   0          1011521    9737008    0          855        24         24         27         257993796
 lucy.CLV3                 10748529   0          0          10748529   4          2042       954        945        962        10255855928
 lucy.CLVABS5              10748529   0          1011521    9737008    0          855        24         24         27         257993796
 lucy.CLVABS3              10748529   0          10453587   294942     0          1214       20         0          847        220086015
 
 orig.CLV5-lucy.CLV5       10748529   9538738    138680     1071111    -854       1239       -21        -23        -854       -229643539
 orig.CLV3-lucy.CLV3       10748529   357934     9324166    1066429    -1328      2846       22         0          704        244842237

NISC.SHOTGUN

  • orig.CLR < lucy.CLR (mean)
  • orig.CLV > lucy.CLV (mean)
  • 8,248 out of 737,900 reads (1.1%) deleted by Lucy (CLR=-1,-1)
  • 633,409 out of 737,900 reads (85%) vector trimmed at the 5' end
  • 7,201 out of 737,900 reads (0.97%) vector trimmed at the 3' end


                           elem       <0         0          >0         min        max        mean       median     n50        sum
 orig.LEN                  737900     0          0          737900     104        2104       784        729        734        579172842
 
 orig.CLR                  737900     5988       2          731910     -636       1033       651        668        676        480400909
 orig.CLR5                 737900     0          0          737900     1          1407       47         40         51         34857531
 orig.CLR3                 737900     0          5879       732021     0          1470       698        710        715        515258440
 
 lucy.CLR                  737900     0          8248       729652     0          1035       658        670        676        485757685
 lucy.CLR5                 737900     8248       56         729596     -1         1091       45         35         46         33811606
 lucy.CLR3                 737900     8248       0          729652     -1         1391       704        710        714        519569291
 
 orig.CLR5-lucy.CLR5       737900     253727     89345      394828     -566       1408       1          1          485        1045925
 orig.CLR3-lucy.CLR3       737900     177007     31         560862     -867       1471       -5         1          -867       -4310851
 
 orig.CLV                  737900     3224       2655       732021     -636       2103       771        725        730        569178445
 orig.CLV5                 734026     0          0          734026     1          987        5          1          35         4375315
 orig.CLV3                 732021     0          0          732021     35         2104       783        729        734        573553760
 
 lucy.CLV                  737900     1335       55         736510     -200       2104       747        696        702        551392388
 lucy.CLV5                 737900     104491     0          633409     -1         1199       30         31         34         22784742
 lucy.CLV3                 737900     0          0          737900     15         2103       778        728        733        574177130
 lucy.CLVABS5              737900     0          104491     633409     0          1200       31         32         35         23522642
 lucy.CLVABS3              737900     0          730699     7201       0          1076       5          0          686        4257812
 
 orig.CLV5-lucy.CLV5       737900     561851     66390      109659     -1198      983        -24        -29        -1198      -18409427
 orig.CLV3-lucy.CLV3       737900     8386       1          729513     -950       1077       0          1          -950       -623370

Fragment files

  • Locations:
 /fs/szasmg3/bos_taurus/data/frg
 /fs/szasmg3/bos_taurus/data/frg.new
  • All DST messages are unique
  • bos_taurus.clv : contains the vector clipping points
    • BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
    • others: the TA clv
    • 374,454 reads don't have valid clv's
    • 36,446,031 reads have valid clv's with avg=955

Message counts (original)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.frg                         79      24124070        11311841
 #bos_taurus.BCM.SHOTGUN.frg                    7339    10650459        1799069     # some libs & mates are missing due to a tarchive2ca crash (used by UMD2.1)
 #bos_taurus.BCM.SHOTGUN.new.frg                18208   10650459        4715172     # split the libraries by VOL & SEQ_LIB_ID (used by UMD2.2)
 #bos_taurus.BCM.SHOTGUN.new.frg                13826   10650459        5046435     # double check the FRG count !!! (used by UMD2.3)
 bos_taurus.BCM.SHOTGUN.new.frg                 7       10650459        5046435     # UMD2.4
 bos_taurus.NISC.SHOTGUN.frg                    246     729652          344932
 bos_taurus.BCCAGSC.CLONEEND.frg                1       125241          59505
 bos_taurus.UIUC.CLONEEND.frg                   2       114750          46319
 bos_taurus.TIGR.CLONEEND.frg                   1       65171           27067
 bos_taurus.GSC.CLONEEND.frg                    1       53521           25889
 bos_taurus.CENARGEN.WGS.frg                    0       26246           0
 #bos_taurus.BARC.CLONEEND.frg                  11150   25454           11150      # (used by UMD2.3)
 bos_taurus.BARC.CLONEEND.frg                   1       25454           11150      # (used by UMD2.4)
 bos_taurus.BCM.CLONEEND.frg                    1       16875           7103
 bos_taurus.CENARGEN.CLONEEND.frg               1       16787           6269
 bos_taurus.UOKNOR.SHOTGUN.frg                  1       14651           4910
 bos_taurus.TIGR_JCVIJTC.CLONEEND.frg           2       10651           4803
 bos_taurus.UOKNOR.FINISHING.frg                0       151             0
 bos_taurus.WUGSC.COLONEEND.frg                 1       49              21
 #total                                         25312   35973728        16896244   # (UMD2.3) 
 total                                          344     35973728        16896244   # (UMD2.4) 

Message counts (quality)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.qual.count                  79      23580109        11035582
 #bos_taurus.BCM.SHOTGUN.qual.count             7339    10644092        1799069
 bos_taurus.BCM.SHOTGUN.qual.new.count          18208   10644092        4712446
 bos_taurus.NISC.SHOTGUN.count                  246     729652          344932
 bos_taurus.BCCAGSC.CLONEEND.qual.count         1       116484          53585
 bos_taurus.UIUC.CLONEEND.count                 2       114750          46319
 bos_taurus.TIGR.CLONEEND.count                 1       65171           27067
 bos_taurus.CENARGEN.WGS.count                  0       26246           0
 bos_taurus.BARC.CLONEEND.count                 11150   25454           11150
 bos_taurus.BCM.CLONEEND.count                  1       16875           7103
 bos_taurus.CENARGEN.CLONEEND.count             1       16787           6269
 bos_taurus.TIGR_JCVIJTC.CLONEEND.count         2       10651           4803
 bos_taurus.UOKNOR.SHOTGUN.qual.count           1       2456            813
 bos_taurus.WUGSC.COLONEEND.count               1       49              21

Message counts (0quality)

                                                DST     FRG             LKG
 bos_taurus.BCM.WGS.0qual.count                 79      543961          234397
 bos_taurus.GSC.CLONEEND.0qual.count            1       53521           25889
 bos_taurus.UOKNOR.SHOTGUN.0qual.count          1       12195           4097
 bos_taurus.BCCAGSC.CLONEEND.0qual.count        1       8757            2114
 bos_taurus.BCM.SHOTGUN.0qual.count             7339    6367            0
 bos_taurus.UOKNOR.FINISHING.0qual.count        0       151             0

Assemblies

Bt.qc.combine UMD2.0 ... UMD2.5 combine stats

UMD2.1(2009_0122_CA; Quality reads)

Issues

  1. Uses only quality reads
  2. BCM.SHOTGUN library : ~ 4715172-1799069=2.9M mates were missed due to a tarchive2ca crash ; some libraries got merged (were assigned the same lib_id)
  3. All reads except for BCM.WGS were set as nonrandom
  4. Update the runCA script to run overlapper concurently; new "ovlConcurrency" parameter added to the .spec file !!!
  5. consensus after cgw crashed in MultiAlignContig() ... use "consensus -D forceunitigabut" !!!
  6. cgw crashed after updating gkpStore with new lib/mate info => edit Input_CGW.c, remove the assert in line 117

Info

 host: walnut
 
 assembly version: wgs-5.2 stable
 
 dir:  /scratch1/bos_taurus/Assembly/2009_0122_CA 
 
 command: /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin/runCA-test -d . -p bt -s bt01.specFile *.frg
 
 spec file:
 cgwDistanceSampleSize   =       1000       # ??? too big; more than 50% of the BCM.SHOTGUN reads are in libraries with less than 1000 inserts
 cnsConcurrency          =       15
 cnsMinFrags             =       200000
 doOverlapTrimming       =       1
 frgCorrBatchSize        =       100000
 frgCorrConcurrency      =       15
 merylMemory             =       24000
 merylThreads            =       15
 obtMerThreshold         =       200
 obtOverlapper           =       ovl
 ovlConcurrency          =       8
 ovlCorrBatchSize        =       100000
 ovlCorrConcurrency      =       15
 ovlHashBlockSize        =       1200000
 ovlMemory               =       8GB --hashload 0.8 --hashstrings 400000
 ovlMerThreshold         =       500
 ovlOverlapper           =       ovl
 ovlRefBlockSize         =       7200000
 ovlThreads              =       2
 unitigger               =       utg
 utgErrorRate            =       0.015
 vectorIntersect         =       bos_taurus.clv
 doExtendClearRanges     =       2

Steps

1. Run up till after initialStoreBuilding

 runCA stopAfter=initialStoreBuilding ...

2. Update gkpStore with nonrandom frg flag

 cat bos_taurus.nonrandom.clv | perl -ane 'print "frg uid $F[0] isnonrandom 1\n";'  > bos_taurus.nonrandom.edit
 gatekeeper -edit bos_taurus.nonrandom.edit bt.gkpStore

Input

 gatekeeper -dumpinfo -lastfragiid bt.gkpStore
 ...
 Last frag in store is iid = 35348776

OBT

                           elem       <0         0          >0         min        max        mean       median     n50        sum            
 
 CLV5                      35085508   0          3387027    31698481   0          970        25         27         29         891007232
 CLV3                      35164784   0          0          35164784   15         2974       984        980        1000       34612019144
 
 CLR_ORIG5                 35348776   0          43354      35305422   0          1753       42         29         38         1502168205     
 CLR_ORIG3                 35348776   0          0          35348776   70         1894       864        905        927        30547294868    
 
 CLR_OBT5                  35348776   0          26513      35322263   0          1690       49         30         73         1756346429     
 CLR_OBT3                  35348776   0          23477      35325299   0          1813       843        895        914        29824543869

  • 421,379 reads deleted by OBT: why so many???

  • Chimera:
 20297 reads too short => deleted
  • more 0-overlaptrim/bt.mergeLog.stats
 ...
 211037: short or inconsistent
 253536: deleted fragment due to zero clear 
  • Example:
 gatekeeper -dumpfragments 516316990  bt.gkpStore  
 fragmentIdent           = 516316990,14
 fragmentMate            = 0,0
 fragmentLibrary         = 27473,1563
 fragmentIsDeleted       = 1
 fragmentIsNonRandom     = 1
 fragmentStatus          = G
 fragmentOrientation     = I
 fragmentHasVectorClear  = 0
 fragmentHasQualityClear = 0
 fragmentPlate           = 0
 fragmentPlateLocation   = 0
 fragmentSeqLen          = 862
 fragmentHPSLen          = 0
 fragmentSrcLen          = 17
 fragmentClearORIG       = 38,553
 fragmentClearQLT        = 1,0
 fragmentClearVEC        = 1,0
 fragmentClearOBTINI     = 35,578
 fragmentClearOBT        = 35,578
 fragmentClearUTG        = 35,578
 fragmentClearECR1       = 35,578
 fragmentClearECR2       = 35,578
 fragmentSeqOffset       = 5376
 fragmentQltOffset       = 11038
 fragmentHpsOffset       = 53
 fragmentSrcOffset       = 287
 cat   0-overlaptrim/bt.mergeLog | grep 516316990 
 516316990,14    412     412     0       0 (deleted, too short)
 zcat *r000*gz | convertOverlap -a -obt 
 ...
    14 12128740  f  377  478   292  393   2.97
    14 15226267  f  397  446    31   80   2.04
    14 19071241  f    4  513   199  708   1.18
    14 20073917  f    7  478    36  508   4.88
    14 20042424  f    4  419   299  714   1.93
    14 20212935  f    7  478   234  706   4.88
    14 20073828  r    7  478   507   35   4.67
    14 20212846  r    7  478   557   85   4.67
    14 27089060  r  491  534   836  793   2.33
    14 29061748  f  489  540    86  137   1.96
    14 32105697  f  455  543   381  469   2.27
    14 32187461  f  430  534   105  209   1.92
    14 32027289  f    4  419   493  907   4.59
 ...
 #read aligns to contigs
 show-coords 516316990-ctg.filter-r.strict.delta
     35      531  |       97      594  |      497      498  |    99.20  |      862     2759  |    57.66    18.05  | 516316990  ctg7180001872751
     45      678  |      931     1564  |      634      634  |    97.00  |      862     1567  |    73.55    40.46  | 516316990  ctg7180001837311


  • OBT deleted reads:
 BCM           WGS         253816
 BCM           SHOTGUN     151770
 BCCAGSC       CLONEEND    7510
 NISC          SHOTGUN     4757
 TIGR          CLONEEND    1577
 CENARGEN      WGS         599
 CENARGEN      CLONEEND    431
 TIGR_JCVIJTC  CLONEEND    377
 UIUC          CLONEEND    182
 BCM           CLONEEND    150
 BARC          CLONEEND    125
 UOKNOR        SHOTGUN     85
 total         .           421379

OBT deleted reads:

          elem       >0         min        max        mean       med        n50        sum
 len      421379     421379     98         2974       862        927        968        363280405
 avgQual  421379     421379     1          57         28         24         36         11852865

Overlapper

    • 98.33% of the reads (34,761,786 out of 35,348,776 reads) had overlaps
    • 1.66% of the reads had no overlaps
    • 6.68% of the BCCAGSC.CLONEEND reads had no overlaps
    • 4.95% of the TIGR_JCVIJTC.CLONEEND reads had no overlaps
    • 3.48% of the TIGR.CLONEEND reads had no overlaps
    • the median number of overlaps is 20
 Overlaps
         reads      min        max        mean       median     n50        sum
 qual    35348776   0          5592       106        20         769        3777789082
    • the median number of overlaps for the BCM.WGS reads is 16
    • the median number of overlaps for the BCM.SHOTGUN reads is 16 !!!
    • the median number of overlaps for the NISC.SHOTGUN reads is 40 !!!
    • the median number of overlaps for the BCM.CLONEEND reads is 16 !!!

Media:Bt.ovlStore.big.png , Media:Bt.ovlStore.small.png

Unitigger

more 4-unitigger/bt.cga.0
UNITIG OVERLAP GRAPH INFORMATION
 
       5208738 : Total number of unitigs
       2527051 : Total number of singleton, contained unitigs
       1814842 : Total number of singleton, non-contained unitigs
        180910 : Total number of non-singleton, spanned unitigs
        685935 : Total number of non-singleton, non-spanned unitigs
      34927397 : Total number of fragments
      34927397 : Total number of fragments in all unitigs
      21521581 : Total number of essential fragments in all unitigs
      13405816 : Total number of contained fragments in all unitigs
  0.0076239952 : Randomly sampled fragment arrival rate per bp
    2510896132 : The sum of overhangs in all the unitigs
    6400342737 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2510896132
 Total nfrags = 19143061
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007624
 Computed global_fragment_arrival_rate =0.007624
 Total number of randomly sampled fragments in genome = 23326293
 Computed genome length  = 3059589120.000000
 Used global_fragment_arrival_rate=0.007624
 Used global_fragment_arrival_distance=131.164826
 
 Histogram of the number of base pairs in a chunk
 100292 - 159434:    22 
 90010 -  99906:     25 
 80043 -  89676:     73 
 70013 -  79966:    162 
 60010 -  69988:    389 
 50008 -  59983:    977 
 40000 -  49998:   2434 
 30000 -  39997:   6458 
 20000 -  29999:  18957 
 10000 -  19999:  57442
 Unitigs >=10kb
             NewAsm          UMd2Asm
 
 Number       86,939          57,204
 Mean         19,464          15,140
 Sum         1,692.1Mb       866.0Mb
 max         159,434bp      78,570bp
 Contigs >=10Kb:
           NewAsm          UMd2Asm
 n         42,343           45,958      
 mean      59,856           55,473
 sum        2,534.5Mb        2,549.4Mb
 Contigs >=100Kb: 
           NewAsm          UMd2Asm
 n          7,051            6,683         
 mean     163,170          162,357    
 sum        1,150.5Mb        1,085.0Mb
 max      627,705          742,802
 Scaffolds >=10Mb:
           NewAsm          UMd2Asm
 n             30                3
 mean       14.10Mb          11.36Mb
 sum       422.95Mb         340.70Mb
 max        26.54Mb          13.36Mb

CGW & ECR

  • Checkpoints:
 cat 7-0-CGW/bt.timing | grep ^Checkpoint
 Checkpoint 3 written during MergeScaffoldsAggressive at iteration 49
 Checkpoint 4 written during MergeScaffoldsAggressive at iteration 85
 Checkpoint 5 written after 1st Scaffold Merge 
 Checkpoint 6 written after 2nd Aggressive Scaffold Merge
 Checkpoint 7 written after Final Rocks
 cat 7-2-CGW/bt.timing | grep ^Checkpoint
 Checkpoint 19 written during MergeScaffoldsAggressive at iteration 12
 Checkpoint 20 written during MergeScaffoldsAggressive at iteration 31
 Checkpoint 21 written after 1st Scaffold Merge
 Checkpoint 22 written after 2nd Aggressive Scaffold Merge
 Checkpoint 23 written after Final Rocks

 cat 7-4-CGW/bt.timing | grep ^Checkpoint
 Checkpoint 34 written during MergeScaffoldsAggressive at iteration 12
 Checkpoint 35 written during MergeScaffoldsAggressive at iteration 49
 Checkpoint 36 written after 1st Scaffold Merge
 Checkpoint 37 written during Stones CleanupScaffolds after scaffold 32436
 Checkpoint 38 written during Stones CleanupScaffolds after scaffold 34939
 Checkpoint 39 written after Stone Throwing and CleanupScaffolds
 Checkpoint 40 written after 2nd Aggressive Scaffold Merge
 Checkpoint 41 written after Final Rocks

Checkpoint 42 written after Partial Stones Checkpoint 43 written after Final Contained Stones Checkpoint 44 written after resolveSurrogates

  • Get early CTG/SCF stats
 cat 7-CGW/bt.cgw_scaffolds | countMessages.pl
 ICL     451555  # ???
 ICP     116455  # CTG
 ISF     66141   # SCF
 ISL     711     # SLK
  • Clear read extension:
                           elem       <0         0          >0         min        max        mean       median     n50        sum
 ClearORIG                 35348776   4          0          35348772   -1147      1572       821        870        893        29045126663
 
 ClearQLT                  35348776   35348776   0          0          -1         -1         -1         -1         -1         -35348776
 ClearVEC                  35348776   299034     20323      35029419   -1         2043       952        953        975        33658445088
 
 ClearOBTINI               35348776   0          31254      35317522   0          1364       831        879        902        29394688367
 ClearOBT                  35348776   0          31254      35317522   0          1318       794        854        877        28068197440
 
 ClearECR1                 35348776   0          31254      35317522   0          1329       794        854        877        28072014464
 ClearECR2                 35348776   0          31254      35317522   0          1329       794        854        877        28072365712
 sum(ClearECR1)-sum(ClearUTG) = 3,817,024
 sum(ClearECR2)-sum(ClearECR1)= 351,248
  • Scaffold length stats:
 cat 7-0-CGW/stat/final0.Scaffolds.nodelength.cgm | grep -v ^Sca | getSummary.pl -t 0 # 0,2,4
 ...
 step     scaff      min        max        mean       med        n50        sum            
 0        7048       2249       19719008   385020     21967      3114907    2713622175     
 2        4960       2249       21907006   540915     21181      4490171    2682939682
 4        4006       2391       26541374   668427     29193      4590744    2677722052
  • Last cgw
 cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf
 cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf
            elem       min        max        mean       med        n50        sum            
 scf        66141      432        26541374   42648      1347       4349378    2820819506     
 ctg        120461     65         627705     22421      2018       84989      2700959854

QC stats

 TotalScaffolds=66,141
 MaxBasesInScaffolds=26,048,998
 MeanBasesInScaffolds=40,861
 
 TotalContigsInScaffolds=120,461
 MaxContigLength=627,911
 MeanContigLength=22,436
 
 TotalDegenContigs=269,031
 MaxDegenContig=33,824
 
 SingletonReads=3,721,123
  • Posmap info
 cat bt.posmap.mates | awk '{print $3}' |count.pl -p 100
 good            10338164
 bothChaff       1160137
 oneChaff        695982
 oneSurrogate    233151
 bothDegen       218198
 diffScaffold    150423
 badShort        138464
 oneDegen        118232
 badLong         23196
 badSame         22451
 badOuttie       8751
 bothSurrogate   589
 total           13107738
 cat bt.posmap.frags | awk '{print $4,$5}' |count.pl  -p 100
 placed good             20676328
 placed notMated         8007072
 chaff bothChaff         2320274
 chaff notMated          704849
 placed oneChaff         695982
 chaff oneChaff          695982
 placed oneSurrogate     466302
 placed bothDegen        436396
 placed diffScaffold     300846
 placed badShort         276928
 placed oneDegen         236464
 placed badLong          46392
 placed badSame          44902
 placed badOuttie        17502
 placed bothSurrogate    1178
 total                   34927397

Log files

Analysis

Insert libraries

1. BCM.WGS : ok

  • FRG.mea: 1750-7000
  • ASM.mea: 1594-6727
  • Most libs have > 1000 reads & get reestimated
  • All libs have ASM.std< ASM.mea/3

2. BCM.SHOTGUN

  • only ~ 50% of the inserts are in libs with >1000 inserts and get reestimated by the assembly
  • if the thold is dropped from 1000 to 100, we'd get ~ 95% of the inserts reestimated
            elem       <0         0          >0         min        max        mean       median     n50        sum
 0          7339       0          0          7339       1          11237      245        135        1137       1799069
 100        4361       0          0          4361       100        11237      395        157        1252       1725604
 1000       440        0          0          440        1008       11237      2075       1791       2323       913086

3. NISC.SHOTGUN: ok

  • Most libs have > 1000 reads & get reestimated
  • All libs have ASM.std< ASM.mea/3

4. BCCAGSC.CLONEEND: ok

 LIB.id  FRG.mea FRG.std FRG.count  CENTER.TYPE        ASM.mea ASM.std
 125606  150000  30000   59505      BCCAGSC.CLONEEND   161998  20133

5. UIUC.CLONEEND: ok

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 114892   150000  30000   31063     UIUC.CLONEEND      175594  41208
 115020   150000  30000   15256     UIUC.CLONEEND      162488  26358

6. TIGR.CLONEEND: originally wrong; gets reestimated

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 65177    2000    600     27067     TIGR.CLONEEND      161761  34938

7. GSC.CLONEEND: not used (all 53556 are 0 qual)

8. CENARGEN.WGS: "not used" (all 26246 are unmated)

9. BARC.CLONEEND: each library contains 1 template id => inserts did not get reestimated (25454 reads/11151 inserts)

10. BCM cloneend: ok

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 19070    167000  25000   7103      BCM.CLONEEND       171244  18555

11. CENARGEN.CLONEEND: large stdev

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 17249    202000  20200   6269      CENARGEN.CLONEEND  158938  55165

12. UOKNOR.SHOTGUN: ok ?

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 15158    3000    1000    4910      UOKNOR.SHOTGUN     3000    1000

13. TIGR_JCVI.CLONEEND: originally wrong; gets reestimated

 LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std
 10691    2500    750     2763      TIGR_JCVI.CLONEEND 160363  29580
 10738    2500    750     2040      TIGR_JCVI.CLONEEND 161915  29343

14. UOKNOR.FINISHING: only 151 reads

15. WUGSC.CLONEEND: only 49 reads

Contigs Vs UMD2 contaminants & Ecoli

 4865 contigs in list.exclude_contigs.fa
 
34404 exclude-ctg.qry_hits
 3763 exclude-ctg.ref_hits
 
 1204 exclude-ctg.CBE.qry_hits   CONTAIN|IDENTITY|BEGIN|END
  748 exclude-ctg.CBE.ref_hits   CONTAIN|IDENTITY|BEGIN|END
 559 Ecoli.365350-365744-ctg.qry_hits : max ctg aligned is 179K bp; 10 are > 10K bp

Contigs Vs UMD2 chromosomes

  • Split 120,461 contigs into 100 files; degeneartes not split
  • Align them to the 31 chromosomes 1..30,U (ref) => 101*31 jobs
 #Alignment stats
 cat chr*ctg*delta | grep "^>" | awk '{print $2}' | count.pl -f ../9-terminator/bt.ctg.infoseq | getSummary.pl -i 1 -z 1
 ctg        0          1          >1         min        max        mean       med        n50        sum
 120461     652        37540      82269      0          176        11         3          33         1422808
 #Unaligned ctg lengths
 ctg        min        max        mean       med        n50        sum
 652        65         5849       1146       1134       1194       747295
 
  • 50% of the contigs aligned uniquely
 cat chr*-ctg*.delta | ~/bin/mergeDelta.pl   >  chr-ctg.delta
                                                             # degens? 
 delta-filter -q  chr-ctg.delta              >>  chr-ctg.filter-q.delta
 cat chr1-*.delta | ~/bin/delta2cvg.pl -M 0 | getSummary.pl -i 4
 elem       0          >0         min        max        mean       med        n50        sum
 6681       1          6680       0          12892      366        142        1095       2450106
  • There are disagreements:
 /fs/sz-user-supported/Linux-x86_64/bin/show-coords -l -r -H chr1-ctg.filter-q.delta  | p 'print $F[-1],"\n";' | count.pl | head
 ctg7180001761585        24
 ...
 ctg7180001634116        7
 ...
 show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001761585 | p 'print "  $_";'
 142115744 142188863  |   383463   310345  |    73120    73119  |    99.98  | 157714772   383463  |     0.05    19.07  |  1 -1  chr1   ctg7180001761585
 142188878 142286012  |   310361   213223  |    97135    97139  |    99.94  | 157714772   383463  |     0.06    25.33  |  1 -1  chr1   ctg7180001761585
 142287100 142287675  |   212133   211556  |      576      578  |    98.27  | 157714772   383463  |     0.00     0.15  |  1 -1  chr1   ctg7180001761585
 142288052 142288602  |   211182   210633  |      551      550  |    99.09  | 157714772   383463  |     0.00     0.14  |  1 -1  chr1   ctg7180001761585
 142288652 142295709  |   210586   203531  |     7058     7056  |    99.87  | 157714772   383463  |     0.00     1.84  |  1 -1  chr1   ctg7180001761585
 142295709 142342174  |   203512   157047  |    46466    46466  |   100.00  | 157714772   383463  |     0.03    12.12  |  1 -1  chr1   ctg7180001761585
 142346440 142367791  |   156958   135606  |    21352    21353  |    99.99  | 157714772   383463  |     0.01     5.57  |  1 -1  chr1   ctg7180001761585
 142367822 142370681  |   135597   132737  |     2860     2861  |    99.93  | 157714772   383463  |     0.00     0.75  |  1 -1  chr1   ctg7180001761585
 142370660 142382289  |   132746   121117  |    11630    11630  |    99.88  | 157714772   383463  |     0.01     3.03  |  1 -1  chr1   ctg7180001761585
 142382282 142411927  |   120984    91339  |    29646    29646  |    99.96  | 157714772   383463  |     0.02     7.73  |  1 -1  chr1   ctg7180001761585
 142411941 142419553  |    91339    83728  |     7613     7612  |    99.66  | 157714772   383463  |     0.00     1.99  |  1 -1  chr1   ctg7180001761585
 142419553 142434546  |    83721    68728  |    14994    14994  |    99.79  | 157714772   383463  |     0.01     3.91  |  1 -1  chr1   ctg7180001761585
 142434506 142437288  |    68778    65996  |     2783     2783  |    99.86  | 157714772   383463  |     0.00     0.73  |  1 -1  chr1   ctg7180001761585
 142437389 142439015  |    66757    65131  |     1627     1627  |    99.94  | 157714772   383463  |     0.00     0.42  |  1 -1  chr1   ctg7180001761585
 142439271 142440703  |    65629    64197  |     1433     1433  |   100.00  | 157714772   383463  |     0.00     0.37  |  1 -1  chr1   ctg7180001761585
 142441869 142442975  |    63548    62442  |     1107     1107  |   100.00  | 157714772   383463  |     0.00     0.29  |  1 -1  chr1   ctg7180001761585
 
 142446690 142449325  |    30312    32945  |     2636     2634  |    99.58  | 157714772   383463  |     0.00     0.69  |  1  1  chr1   ctg7180001761585
 142451384 142452476  |    63510    64603  |     1093     1094  |    99.91  | 157714772   383463  |     0.00     0.29  |  1  1  chr1   ctg7180001761585
 142452577 142454379  |    61000    62806  |     1803     1807  |    99.78  | 157714772   383463  |     0.00     0.47  |  1  1  chr1   ctg7180001761585
 142454487 142456821  |    59122    61456  |     2335     2335  |   100.00  | 157714772   383463  |     0.00     0.61  |  1  1  chr1   ctg7180001761585
 142458383 142459582  |    57978    59177  |     1200     1200  |   100.00  | 157714772   383463  |     0.00     0.31  |  1  1  chr1   ctg7180001761585
 142459738 142472295  |    32272    44828  |    12558    12557  |    99.92  | 157714772   383463  |     0.01     3.27  |  1  1  chr1   ctg7180001761585
 142472300 142485640  |    44828    58163  |    13341    13336  |    99.89  | 157714772   383463  |     0.01     3.48  |  1  1  chr1   ctg7180001761585
 
 142501686 142530021  |    28336        1  |    28336    28336  |    99.99  | 157714772   383463  |     0.02     7.39  |  1 -1  chr1   ctg7180001761585
 show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001634116
 116312   162914  |        1    46603  |    46603    46603  |    99.99  | 157714772   122722  |     0.03    37.97  |  1  1  chr1       ctg7180001634116
 
 164916   201988  |    58062    95135  |    37073    37074  |    99.99  | 157714772   122722  |     0.02    30.21  |  1  1  chr1       ctg7180001634116
 
 203244   213377  |    48198    58331  |    10134    10134  |   100.00  | 157714772   122722  |     0.01     8.26  |  1  1  chr1       ctg7180001634116
 261393   264506  |    45949    49062  |     3114     3114  |   100.00  | 157714772   122722  |     0.00     2.54  |  1  1  chr1       ctg7180001634116
 
 264607   268579  |    94345    98317  |     3973     3973  |   100.00  | 157714772   122722  |     0.00     3.24  |  1  1  chr1       ctg7180001634116
 268586   274734  |    98323   104471  |     6149     6149  |   100.00  | 157714772   122722  |     0.00     5.01  |  1  1  chr1       ctg7180001634116
 274835   293945  |   103611   122722  |    19111    19112  |    99.99  | 157714772   122722  |     0.01    15.57  |  1  1  chr1       ctg7180001634116
 ~/bin/delta2breaks.pl -m 200 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       9827
 INVERSION       283
 TRANSLOCATION+  230
 TRANSLOCATION-  154
 
 ~/bin/delta2breaks.pl -m 1000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       7564
 INVERSION       216
 TRANSLOCATION+  192
 TRANSLOCATION-  127
 
 ~/bin/delta2breaks.pl -m 10000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       3394
 INVERSION       62
 TRANSLOCATION+  50
 TRANSLOCATION-  29

Assembly UMD2.2 (Quality reads)

  • Try to add the missing BCM.SHOTGUN reads at the assembly
  • Assign new BCM.SHOTGUN library ID's base on volume & SEQ_LIB_ID : same library might have different insert size in different volume => might loose some correct mates from different volumes
 cat bos_taurus.summary | grep BCM | grep SHOTG | cut -f6,7,8,10 | sort | more
 FAAEP   180000  13000   252
 FAAEP   2000    1000    84
 ...
 FAAHP   180000  13000   77
 FAAHP   2000    1000    230
 ...
  • => 20,538 libraries out of which 18,208 contain mated reads
  • create DST messages & add them to gkpStore
 gatekeeper -a -o bt.gkpStore -T -F  bos_taurus.BCM.SHOTGUN.new.DST
  • generate gatekeeper edit file that maps each TI to the new library id
 head bos_taurus.BCM.SHOTGUN.new.ti2libinfo.edit
   frg uid 499507131 libuid 601081 
   frg uid 499507132 libuid 601081
   ...
  • generate gatekeeper edit file that deletes all mate information
 head bos_taurus.BCM.SHOTGUN.new.mate.delete
   frg uid 500086180 mateuid 0
   frg uid 500084310 mateuid 0
   ...
  • pair forward/reverse read that have the same new library id, same TEMPLATE_ID
 head bos_taurus.BCM.SHOTGUN.new.mate.edit
   frg uid 583866821 mateuid 583872364
   frg uid 583866822 mateuid 583872408
   ...
  • run gatekeeper --edit for each edit/delete file
 gatekeeper --edit ...  bt.gkpStore 
  • restart assembly at cgw (doExtendClearRanges=1)
  • consensus after cgw failed on job 25 on CTG 5597062 : cannot create consensus from multialignment ...
 Fix: delete failed message
 cp bt.cgw_contigs.25 bt.cgw_contigs.25.FAILED
 delete "{ICM acc:5597062 pla:P len:20889 ..." from bt.cgw_contigs.25
  • terminator fail; message:
 ICL: reference before definition error for contig ID 5597062

Assembly UMD2.3 (2009_0210_CA; all reads)

  • 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
  • 16,896,244 mates
  • 25,312 libraries

Issues (not solved):

  • 10420 contain at least 1 "NN" in their clr (50.. min(len,600))
  • 5973 contain at least 1 "NNN" in their clr (50.. min(len,600))

Quality-less clrs

  • 624,952 quality-less reads
  • Quality-less read stats:  : alignment CLR or 50..min(len,600) trimming
            elem       min        max        mean       median     n50        sum
 len        624952     5          1495       887        947        961        554429198
 5          624952     6          1584       51         51         51         32150411
 3          624952     5          1495       695        699        699        434960697
 53         624952     -1579      1444       644        648        648        402810286

  • Align 624,952 to the 120,461 Assembly1 contigs (no degenerates) : 1 day on 13 cpus
  • 572,140(91.5%) reads aligned and 52,812(8.5%) did not align to the contigs

1. Launch jobs in parallel: 12766 jobs on 13 processors

 nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.001.fasta  bos_taurus.0qual.01.seq -p ctg.001-seq.01
 ...
 nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.982.fasta  bos_taurus.0qual.13.seq -p ctg.001-seq.01
  • CPU usage: 100% /job
  • Max mem usage: 0.1% /job

2. Get maximum extended clrs

 cat *delta | ~/bin/delta2qryClr.pl -best | sort > bos_taurus.0qual.best.clr
 Length stats
            elem       min        max        mean       median     n50        sum
 all        624952     5          1495       887        947        961        554429198
 aligned    572140     221        1416       912        953        964        522281354
 unaligned  52812      5          1495       608        580        754        32147844
 Best/Max/Max+extended alignment coord stats:
            elem       min        max        mean       median     n50        sum
 53.best    572140     94         1208       766        841        877        438793102
 53.max     572140     170        1208       794        863        888        454816817
 53.extend  572140     170        1208       797        865        889        456014184
 Unaligned read counts:                           
                          unaligned    total   quality   quality-less
 BCM.WGS                  42595
 UOKNOR.SHOTGUN           5787         14651   2456      12195
 GSC.CLONEEND             2294         53521   0         53521
 BCCAGSC.CLONEEND         1869         125241  116484    8757
 BCM.SHOTGUN              186
 UOKNOR.FINISHING         81
  • 52,812 quality-less unaligned reads to the contigs using less strict nucmer parameters: -l 30 -c 50 -b 50 -g 50 -d 0.12
  • 9,269 reads aligned at an average 92% identity (min 81% identity)  : not too good

3. Get reads without clrs: set their clr to maximum 50..600

 cp bos_taurus.0qual.extended.clr bos_taurus.0qual.clr
 difference.pl bos_taurus.0qual.infoseq bos_taurus.0qual.extended.clr | perl -ane '$three=600; $three=$F[1] if ($F[1]<600); print "$F[0] 50 $three\n";' >> bos_taurus.0qual.clr

Quality clrs

  • Use Assembly1 OBT clrs
  • Delete reads deleted in the OBT process

Gatekeeper

Load order:

  1. Add quality FRG : "gatekeeper -T -F ..."
  2. Add quality-less FRG "gatekeeper -F -a ..." # -T should be removed
  3. Delete quality FRG (deleted by UMD2.1 OBT)
  4. Add DST
  5. Add LKG

Edit

  1. Loads clrs
  2. Loads clvs
  3. Loads nonrandom info

Meryl

Use Assembly1 kmer counts

Overlapper

  • Use 80/90 Assembly1 overlap results
  • Rerun 10 overlap jobs
  • 96.64% of the quality-less reads have overlaps (vs 98.33% of the quality reads)
                  reads      0ovl       1+ovl      min        max        mean       median     n50        sum
 0qual(all)       624831     20941      603890     0          4350       96         19         740        60494730         # 96.64%
 0qual(unaligned) 52691      15384      37307      0          3229       50         5          349        2655545          # 70.80%

Unitigger

  • More unitigs, more bases in unitigs
  • Few of the longest unitigs got broken: Example 138,294(UMD2.3) vs 159,434(UMD2.1)
 UNITIG OVERLAP GRAPH INFORMATION
 
       5333434 : Total number of unitigs
       2595174 : Total number of singleton, contained unitigs
       1865473 : Total number of singleton, non-contained unitigs
        183693 : Total number of non-singleton, spanned unitigs
        689094 : Total number of non-singleton, non-spanned unitigs
      35551316 : Total number of fragments
      35551316 : Total number of fragments in all unitigs
      21830994 : Total number of essential fragments in all unitigs
      13720322 : Total number of contained fragments in all unitigs
  0.0077856472 : Randomly sampled fragment arrival rate per bp
    2514833413 : The sum of overhangs in all the unitigs
    6483064813 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2514833413
 Total nfrags = 19579606
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007786
 Computed global_fragment_arrival_rate =0.007786
 Total number of randomly sampled fragments in genome = 23870254
 Computed genome length  = 3065930496.000000
 Used global_fragment_arrival_rate=0.007786
 Used global_fragment_arrival_distance=128.441474
  
 Histogram of the number of base pairs in a chunk
 100406 - 138294:    21
 90330 -  99887:     23
 80042 -  89675:     79
 70014 -  79943:    169
 60002 -  69792:    374
 50000 -  59982:   1008
 40002 -  49995:   2440
 30001 -  39994:   6509
 20000 -  29999:  18989
 10000 -  19999:  57404


Consensus after unitigger

Problems:

  • job 120 executed partially (see bt_120.cgi_tmp); Solution: split into 3 parts, run separately, merge results
  • failed on 19 unitigs (587..7447 bp)
 rm 5-consensus/*failed
 touch 5-consensus/consensus.success

Cgw

  • Failure 1 : because job 120 was run partially => missing mates
  • Failure 2 : because of /5-consensus/FAILED/bt_???.cgi.failed => missing mates => delete 356 mates
 Error:
   ProcessFrags()-- WARNING!  fragiid=35973388,index=33600942 mateiid=35973363,index=0 -- MATE DOESN'T EXIST!
   cgw: Input_CGW.c:117: ProcessFrags: Assertion `err == 0' failed.
 Fix:
   cat cgw.out | grep MATE | p '/mateiid=(\d+)/; print $1,"\n";' >! cgw.out.mateiid
   gatekeeper -dumpfragments -tabular -iid cgw.out.mateiid bt.gkpStore/ | cut -f1,3 | ~/bin/mate2lkg.pl -a D >! cgw.out.delete.LKG
   gatekeeper -a -o bt.gkpStore -T -F -L cgw.out.delete.LKG
  • Failure 3: because of cgwOutputIntermediate=1
 Try to restart from ckp : die with assertion failure
 cgw -y -R 8 -N 12 -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi
 cgw -y -R 8       -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi
 
 Fix: Restart cgw from the beginning
  • cgw does update bt.SeqStore - OpenSequenceDB()

ECR (eventually skipped)

  • Failed after running for 1 day
 /fs/szdevel/dpuiu/SourceForge/wgs-5.2/Linux-amd64/bin/extendClearRanges  -g ./bt.gkpStore  -n 15  -c bt  -b 146216 -e 167100  -i 1  > 7-1-ECR/extendClearRanges-scaffold.146216.err 
 sh: line 1: 17016 Aborted 
  • Last ckp : bt.ckp.15
  • Try to fix:
 touch 7-1-ECR/cgw.success
 runCA "doExtendClearRanges = 1"
  • Runs too slow !!!
  • Can specify a scaffold range to process: -b ? -e ? => ckp files; could we merge them?
  • Failed after running for 1 day

Consensus after CGW

  • Failed on job 56
 tail 8-consensus/bt.cns_contigs.56.err 
 ...
 Could (really) not find overlap between 153923 (U) and 2508303 (R) estimated ahang: 0 (ejecting frag 2508303 from contig)
 consensus: math_AS.h:51: ceil_log2: Assertion `x > 0' failed.
 cat 7-CGW/bt.cgw_contigs.56  | countMessages.pl
 ICM     440
 IMP     281412
 IUP     12715
 
 cat 8-consensus/bt.cns_contigs.56_tmp | countMessages.pl 
 ICM     115
 IMP     103322
 IMV     8122
 IUP     4849
  • Fix: split ICM messages 1..115,116,116+ and run consensus on each set

QC

            elem       min        max        mean       med        n50        sum            
 scf        56891      407        33129045   50871      1378       4716077    2894145150     
 ctg        122851     64         651167     21957      3647       71561      2697514858     
 deg        268237     65         30246      1019       985        997        273575106
  • Compared with UMD2.1 : better scaffols, worse contigs & unitigs

Analysis

Issues:

  • Identify bacterial & mito contigs: mito seq
  • Align ctg&degen to UMD2 chromosomes
    • the chromosomes should have no 0cvg regions
    • possible inversions, translocations (UMD2 used markers)
    • if align breaks/indels, which assembly is correct?

Assembly UMD2.4 (2004_0217_CA; All reads)

  • 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
  • 16,896,244 mates
  • 344 libraries

Fix quality-less read clrs (N's) (temporary solution)

  • 10420 contain at least 1 "NN" in their clr (50.. min(len,600))
  • 5973 contain at least 1 "NNN" in their clr (50.. min(len,600))

Fix:

 frg2seq.pl < bos_taurus.0qual.frg > bos_taurus.0qual.seq
 fasta2qual.pl bos_taurus.0qual.seq > ! bos_taurus.0qual.qual
 lucy \
    -o bos_taurus.0qual.lucy.seq  bos_taurus.0qual.lucy.qual \
    -debug  bos_taurus.0qual.lucy.info \
    bos_taurus.0qual.seq bos_taurus.0qual.qual
 cat bos_taurus.0qual.lucy.info | cut -f1,3,4 -d ' ' | sort >! bos_taurus.0qual.lucy.clr
  • 624,952 quality-less reads
  • Quality-less read stats: 50..min(len,600) & lucy trimming
            elem       0          >0         min        max        mean       med        n50        sum
 5          624952     2857       622095     0          501        52         52         52         33012433
 3          624952     2857       622095     0          600        579        600        600        361980208
 53         624952     2857       622095     0          548        526        548        548        328967775

Fix quality-less read clrs (low complexity)

  • Run dust filter on seq (before qual & lucy)
            elem       0          >0         min        max        mean       med        n50        sum            
 5          624952     3564       621388     0          501        75         52         52         47385578       
 3          624952     3564       621388     0          600        554        600        600        346470473      
 53         624952     3564       621388     0          548        478        548        548        299084895
  • Merge dust.lucy clrs with the alignment clrs
            elem       0          >0         min        max        mean       med        n50        sum
 5          624952     4488       620464     0          599        93         52         126        58378496
 3          624952     4488       620464     0          600        547        600        600        342258160
 53         624952     4488       620464     0          548        454        512        548        283879664
  • Test seq
 gatekeeper -dumpfastaseq -b 35348777 -e  35973728 bt.gkpStore  | grep NNN
 gatekeeper -dumpfastaseq  bt.gkpStore   | perl -ane 'if(/^>(\d+)/) { $id=$1} elsif(/NNN/) { print $id,"\n";} ' | uniq -c | awk '{print $2,$1}'  > bt.NNN.seqs  # 2411 seqs (all have the N's "in the middle")
 gatekeeper -dumpfastaseq -uid bt.NNN.seqs bt.gkpStore >  bt.NNN.cseqs

Consolidate libraries

Drop from 25,312 to 344 libs

BCM.SHOTGUN

UMD2.4 reestimated 10,117 out of 13,826 libs (have > 100mates)

Base on initial estimates

  • Reduce the total number from 13826 to 2 libs: 3000 & 6000
  • UMD2.3 mean estimates (Initial vs Final):
 meanI      #libs      minF       maxF       meanF      medF       n50F       sumF           uid
 180000     436        1636       5199       2475       2410       2458       1079407        #3000
 167000     86         1585       2948       2264       2258       2285       194775         #3000        
 
 6500       31         5212       6636       5837       5867       5924       180951         #6000        
 6000       11         4556       6272       5389       5421       5421       59286          #6000
 
 3500       949        1670       4769       2668       2608       2645       2532027        #3000
 3000       2511       1483       5250       2715       2662       2723       6818678        #3000
 2000       6093       1157       6443       2526       2487       2554       15391160       #3000

Base on final estimates

  • Reduce the total number from 13826 to 7 libs: 6500,5500,...1500, un-estimates (2501)
 meanF        #libs      min        max        mean       med        n50        sum         uid(new)     mean(new)  std(new)
 6K<=mea<7K   15         6010       6636       6176       6159       6159       92650       6500         6500
 5K<=mea<6K   29         5121       5985       5540       5536       5577       160673      5500         5500
 4K<=mea<5K   67         4017       4939       4284       4266       4274       287072      4500         4500
 3K<=mea<4K   1401       3000       3998       3276       3209       3226       4590323     3500         3500
 2K<=mea<3K   7998       2000       2999       2502       2498       2532       20017767    2500         2500
 1K<=mea<2K   607        1157       1999       1825       1882       1890       1107798     1500         1200
 un-estimated 3709                                                                          2501         2501

BARC.CLONEEND

Collapse all 11150 into 1:

uid:25456
mea:165000
std:43000

Overlapper

  • Quality-less reads overlaps: fewer than in the UMD2.3 assembly
                  elem       0          >0         min        max        mean       med        n50        sum           
 0qual(all)       624830     35692      589138     0          3237       60         14         439        37578899         # 94.39%

Unitigger

 UNITIG OVERLAP GRAPH INFORMATION    
       5356408 : Total number of unitigs
       2613795 : Total number of singleton, contained unitigs
       1870448 : Total number of singleton, non-contained unitigs
        182878 : Total number of non-singleton, spanned unitigs
        689287 : Total number of non-singleton, non-spanned unitigs
      35547861 : Total number of fragments
      35547861 : Total number of fragments in all unitigs
      21685943 : Total number of essential fragments in all unitigs
      13861918 : Total number of contained fragments in all unitigs
  0.0077797328 : Randomly sampled fragment arrival rate per bp
    2513424271 : The sum of overhangs in all the unitigs
    6468428782 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2513424271               
 Total nfrags = 19553770
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007780
 Computed global_fragment_arrival_rate =0.007780     
 Total number of randomly sampled fragments in genome = 23868770
 Computed genome length  = 3068070656.000000          
 Used global_fragment_arrival_rate=0.007780             
 Used global_fragment_arrival_distance=128.539119

Histogram of the number of base pairs in a chunk
100292 - 138301:     19
 90052 -  99906:     23
 80043 -  89676:     79
 70013 -  79966:    164
 60010 -  69988:    390
 50008 -  59983:    949
 40000 -  49998:   2433
 30000 -  39997:   6437
 20000 -  29999:  18808
 10000 -  19999:  57634

Bog

!!! Much bigger unitigs than default unitigger

Global Arrival Rate: 0.013829
212260 - 224992:      4 
100099 - 186873:    372 
 90015 -  99973:    353 
 80045 -  89988:    582 
 70011 -  79999:   1084 
 60000 -  69994:   1856 
 50001 -  59996:   3162 
 40002 -  49994:   5407 
 30000 -  39999:   9767 
 20000 -  29996:  18981 
 10000 -  19999:  39641

Consensus after Unitigger

  • Failed on jobs 120 & 121 ( _tmp file)
 cat 4-unitigger/*120*  | countMessages.pl
 IMP     280264
 IUM     124707

 cat 4-unitigger/*121* | countMessages.pl
 IMP     282146
 IUM     245650
 cat 5-consensus/bt_120.cgi_tmp  | countMessages.pl
 IMP     34348
 IUM     19222
 
 cat 5-consensus/bt_121.cgi_tmp | countMessages.pl 
 IMP     51833
 IUM     16805
  • Fix 120: split IUM messages
 extractfromfrgMSG.pl -b 0 -e 19222 bt_120.cgb.orig IUM >! bt_120.cgb &
 extractfromfrgMSG.pl -b 19222 bt_120.cgb.orig IUM >! bt_120.cgb & 
  • Fix 121: remove assertion in AS_CNS/MultiAlignment_CNS.c
 if(to <= from || to > ma_length-1){
   fprintf(stderr, "AbacusRefine range (to) invalid");
   //assert(0); 
 }

CGW

  • Failed after Ckp3(7-0-CGW/bt.ckp.3; MergeScaffoldsAggressive 2nd itteration)
 CI extends beyond end of scaffold!
 offsetAEnd = 254204 offsetBEnd = 252250 scaffoldLength = 253268
 cgw: CIScaffoldT_Merge_CGW.c:307: InsertScaffoldContentsIntoScaffold: Assertion `0' failed.
  • Last cgw
 Scaffold lengths:
 cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf
 cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf
            elem       min        max        mean       med        n50        sum            
 scf        45826      385        34263871   59591      1349       7059820    2730853790     
 ctg        96562      65         738899     27789      3657       93988      2683452359
 Library insert estimates:
 cat 7-4-CGW/stat/scaffold_final.distupdate.dst | grep ^# | awk '{print $3,int($8),int($10)}' > 7-4-CGW/bt.dst
 join2.pl bt.dst 7-4-CGW/bt.dst | p 'print join "\t",@F[0,1,2,5,6,3,4]; print "\n";' > bt.dst.combine
 CLONEEND inserts:
 UID           MEANI   STDI    MEANF   STDF    COUNT   LIB
 114892        150000  30000   175701  40732   31063   UIUC.CLONEEND
 19070         167000  25000   171349  18253   7103    BCM.CLONEEND
 118           167000  16700   167000  16700   21      WUGSC.CLONEEND
 25456         165000  43000   163044  25849   11150   BARC.CLONEEND
 115020        150000  30000   162719  25343   15256   UIUC.CLONEEND
 65177         2000    600     162540  34155   27067   TIGR.CLONEEND
 125606        150000  30000   162396  19319   59505   BCCAGSC.CLONEEND
 10738         2500    750     162386  27567   2040    TIGR_JCVIJTC.CLONEEND
 10691         2500    750     161540  28239   2763    TIGR_JCVIJTC.CLONEEND
 17249         202000  20200   157496  55375   6269    CENARGEN.CLONEEND
 54017         120000  12000   115671  27594   25889   GSC.CLONEEND
 total                                         188126  CLONEENDs

Consensus

  • Failed on job 34 with segmentation fault
  • 9kbp contig, made out of 3007 reads (24 of which are quality-less)
 cat 7-CGW/bt.cgw_contigs.34.1 | grep "^{" | uniq -c | awk '{print $2,$1}'
 {ICM 1
 {IMP 3007
 {IUP 329
  • Fix : edit AS_CNS/MultiAlignment_CNS.c; add
 if(!ungappedSequence->Elements) { ungappedSequence->numElements=0; }
 if(!ungappedQuality->Elements) { ungappedQuality->numElements=0; }


Analysis

Contigs Vs possible contaminants

  • nucmer alignment parameters: -l 40 -c 100 -b 10 -g 5 -d 0.05
  • have to redo alignments using -maxmatch !!!
  • file location:
 reference seqs:
   /nfshomes/dpuiu/db/Ecoli.365350-365744                     # Ecoli K12 region with most alignments (BCM WGS splice site)
   /nfshomes/dpuiu/db/Ecoli                                   # Ecoli K12 substrain MG1655  (NC_000913 ; 1st completed)
   /nfshomes/dpuiu/db/Ecoli.all                               # 22 Ecoli completed genomes ( + plasmids)
 
   /nfshomes/dpuiu/db/UniVec_Core                             # UniVec Core seqs 
   /nfshomes/dpuiu/db/OtherVec                                # 100 other vector sequences identified by aligning UMD2.0 contaminants to GenBank; align also to 110 UniVec core using nucmer (params above)
   /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.fasta       # 4813 whole contigs and 30329 contig regions identified by NCBI as UMD2 contamination
   /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.organism_count  # organism counts: vector is the most abundant
 
   /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.infoseq      # grep -v 'coli|vector|7180003101029' => 905 other contamiants
 
 query seqs:
   /scratch1/bos_taurus/Assembly/2009_0217_CA/9-terminator/ctg.split100/*fasta  # latest assembly contigs (no degenerates)
 
 delta files: 
   /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_ctg/no_maxmatch/*delta

Ecoli K12 substrains:

 NC_010473.1    4686137 50.78  Escherichia coli str. K-12 substr. DH10B, complete genome
 NC_000913.2    4639675 50.79  Escherichia coli str. K-12 substr. MG1655, complete genome
 AC_000091.1    4646332 50.80  Escherichia coli str. K-12 substr. W3110, complete genome

no maxmatch

  • fewer alignments in UMD2.4 than in UMD2

UMD2 (all): just a few degens

 15102 Ecoli.365350-365744-ctg.qry_hits
 15943 Ecoli-ctg.qry_hits
 17308 Ecoli.all-ctg.qry_hits
 79065 UMD2.contaminant-ctg.qry_hits     # 55877 new hits
 20105 UMD2.contaminant-ctg.CBE.qry_hits # CONTAIN|BEGIN|END|IDENTITY
 19839 UniVec_Core-ctg.qry_hits

UMD2.4

   559 Ecoli.365350-365744-ctg.qry_hits
  1215 Ecoli-ctg.qry_hits
  2767 Ecoli.all-ctg.qry_hits             # most 2 frequenct starins are UMN026 & ATCC 8739; K12 DH10B is rank 5th; K12 MG1655 is ranked 19th (out 31 seqs)
 44112 UMD2.contaminant-ctg.qry_hits
  5286 UniVec_Core-ctg.qry_hits

Length of the reference seqs used for screening:

                         #seqs      min        max        mean       med        n50        sum
 Ecoli.365350-365744     1          395        395        395        395        395        395                  # Ecoli K12 regions with most alignments (BCM WGS splice site)
 Ecoli                   1          4639675    4639675    4639675    4639675    4639675    4639675              # Ecoli K12 substrain MG1655        
 Ecoli.all               49         3306       5572075    2293320    130440     5065741    112372708            # 22 Ecoli's 
 
 UniVec_Core             1348       12         48551      243        98         967        327641          
 OtherVec                100        1702       739874     15419      5027       166744     1541984
 UMD2.contaminant        35142      48         16661      512        362        674        18022349             

Length of UMD2.4 contigs that contain contaminant (0+ bp from end):

                         #ctgs      <2000bp    >=2000bp   min        max        mean       med        n50        sum
 Ecoli.365350-365744-ctg 559        534        25         1001       179527     2467       1341       1894       1379440
 Ecoli-ctg               1215       1086       129        1001       360312     4326       1347       71372      5256540
 Ecoli.all-ctg           2767       2455       312*       1001       453627*    8031       1366       134516     22224468
 UniVec_Core-ctg         5286       4718       568*       882        651163*    9820       1337       136090     51909339
 UMD2.contaminant-ctg.CBE 4976      4410       566*       738        651163*    8497       1339       122281     42281715     #annotated alignments: CONTAIN|BEGIN|END|IDENTITY
 UMD2.contaminant-ctg    44112      12813      31299      268        739442     50591      27461      111598     2231701788

Length of UMD2.4 contigs that contain contaminant in the middle (500+ bp from end):

                         #ctgs      <2000bp    >=2000bp   min        max        mean       med        n50        sum
 Ecoli.365350-365744-ctg 144        136        8          1286       2053       1779       1811       1814       256259
 Ecoli-ctg               171        152        19         1286       4703       1835       1807       1821       313820
 Ecoli.all-ctg           197        160        37*(81)    1228       351373*    6516       1815       125069     1283728     #81  2K+ ctgs using -maxmatch
 UniVec_Core-ctg         1278       1110       168*(276)  1085       651163*    12266      1496       160336     15676765    #276 2K+ ctgs using -maxmatch
 UMD2.contaminant-ctg.CBE 52         25        27*        1249       351373*    22195      2054       125069     1154142     #annotated alignments: CONTAIN|BEGIN|END|IDENTITY
 UMD2.contaminant-ctg    31019      1437       29582      1113       739442     70665      50798      113684     2191986214

Length of the UMD2.4 contaminant seqeunece (0+ bp from end):

                            #align     <200bp     >=200bp       min       max        mean       med        n50        sum
 Ecoli.365350-365744-ctg    1066       537        529        104        225        192        162        224        205379
 Ecoli-ctg                  1793       587        1206       50         4440       496        224        994        889798
 Ecoli.all-ctg              4074       1132       2942*      40         17075*     380        254        441        1551783
 UniVec_Core-ctg            14425      9819       4606*      40         1801*      236        162        325        3409187
 UMD2.contaminant-ctg       144843     96008      48835      40         16661      199        169        209        28912002

Length of the UMD2.4 contaminant seqeunece (500+ bp from end)

                            alignm     <200bp     >=200bp    min        max        mean       med        n50        sum
 Ecoli.365350-365744-ctg    243        136        107        162        224        189        162        224        46000
 Ecoli-ctg                  273        149        124        106        1341       219        162        224        59923
 Ecoli.all-ctg              294        153        141*       106        2150*      251        162        224        73992
 UniVec_Core-ctg            2144       2035       109*       50         1340*      122        121        121        261821
 UMD2.contaminant-ctg       121331     86985      34346      40         2738       171        162        184        20753580
  • Problem: 8 long ctgs contain Ecoli in the middle (1000+ bp from end)
show-coords Ecoli.all-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
4640589  4641890  |   161712   160411  |     1302     1302  |    99.46  |  4686137   351373  |     0.03     0.37  | gi|170079663|ref|NC_010473.1|      ctg7180001872124
5068908  5069620  |    87386    86679  |      713      708  |    98.88  |  5209548    91972  |     0.01     0.77  | gi|218687878|ref|NC_011745.1|      ctg7180002055226
3087480  3088683  |    50423    51620  |     1204     1198  |    99.00  |  5202090    88182  |     0.02     1.36  | gi|218703261|ref|NC_011751.1|      ctg7180002054092
4640580  4641890  |    19953    18646  |     1311     1308  |    99.08  |  4686137    31157  |     0.03     4.20  | gi|170079663|ref|NC_010473.1|      ctg7180001875158
 131462   133564  |     1247     3349  |     2103     2103  |    98.19  |   241387     5751  |     0.87    36.57  | gi|157412014|ref|NC_009838.1|      ctg7180002043242
  82801    83166  |     2986     2621  |      366      366  |    98.09  |   241387     4709  |     0.15     7.77  | gi|157412014|ref|NC_009838.1|      ctg7180001714551
  82264    82793  |     3523     2994  |      530      530  |    98.49  |   241387     4709  |     0.22    11.26  | gi|157412014|ref|NC_009838.1|      ctg7180001714551
1652253  1652545  |     1487     1195  |      293      293  |    98.63  |  4700560     2492  |     0.01    11.76  | gi|218552585|ref|NC_011741.1|      ctg7180001754941
  • Regions present in DH10B but not MG1655
 delta2cvg -M 0 < DH10B-MG1655.delta
 gi|170079663|ref|NC_010473.1|   1349629 1378243 28614   0
 gi|170079663|ref|NC_010473.1|   1391006 1396986 5980    0
 gi|170079663|ref|NC_010473.1|   3199469 3200798 1329    0
 gi|170079663|ref|NC_010473.1|   3211928 3213257 1329    0
 gi|170079663|ref|NC_010473.1|   4640588 4641918 1330    0 !!!
  • Problem: 10 long ctgs contain Vector in the middle (1000+ bp from end)
show-coords UniVec_Core-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
      1      121  |   215495   215615  |      121      121  |    99.17  |      170   271477  |    71.18     0.04  | gnl|uv|U09128.1:15891-16011-49     ctg7180002047604       # pSacBII P1 cloning vector  
   2252     2435  |     1334     1151  |      184      184  |   100.00  |     2485   160336  |     7.40     0.11  | gnl|uv|U75992.1:16925-19409        ctg7180001808271
    180      312  |     1153     1020  |      133      134  |    99.25  |      312   160336  |    42.63     0.08  | gnl|uv|NGB00145.1:2378-2689        ctg7180001808271
      1      121  |     1367     1487  |      121      121  |   100.00  |      170   160336  |    71.18     0.08  | gnl|uv|U09128.1:15891-16011-49     ctg7180001808271
      1      103  |     1286     1388  |      103      103  |   100.00  |      103   160336  |   100.00     0.06  | gnl|uv|U80929.2:11415-11517        ctg7180001808271        [CONTAINED]
      4      121  |    68269    68386  |      118      118  |   100.00  |      170   111913  |    69.41     0.11  | gnl|uv|U09128.1:15891-16011-49     ctg7180002052060
     40      152  |    30255    30142  |      113      114  |    99.12  |     1663    42854  |     6.79     0.27  | gnl|uv|U09128.1:1-1663             ctg7180002053344
      1      121  |    34358    34238  |      121      121  |   100.00  |      170    35471  |    71.18     0.34  | gnl|uv|U09128.1:15891-16011-49     ctg7180002046164
      1      103  |    34439    34337  |      103      103  |   100.00  |      103    35471  |   100.00     0.29  | gnl|uv|U80929.2:11415-11517        ctg7180002046164        [CONTAINED]
     46     1385  |     8928    10267  |     1340     1340  |   100.00  |     1413    17587  |    94.83     7.62  | gnl|uv|X65279.1:5941-7353          ctg7180002043597        [CONTAINED]


  • ctg7180001872124 : 351373 bp; region 160411..161712 contaminated by Ecoli
 cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | wc -l  # 329
 cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);'
 7180000441625   7180001872124   159483  161201  r
 7180000441788   7180001872124   160330  161329  f  #Ecoli
 7180000442730   7180001872124   160368  161010  r  #Ecoli
 7180000441635   7180001872124   160740  162700  f  #Ecoli
 cat 9-terminator/bt.utg.info
 utg7180000441625 length=1715 num_frags=12 Astat=7.00
 utg7180000441788 length=999 num_frags=1 Astat=0.00
 utg7180000442730 length=640 num_frags=1 Astat=0.00
 utg7180000441635 length=1957 num_frags=9 Astat=7.00
 cat 9-terminator/bt.posmap.frgctg | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);'
 1237446426      7180001872124   160117  161201  f
 1238816835      7180001872124   160133  160993  f
 1238817728      7180001872124   160322  161123  r
 1244436200      7180001872124   159976  160984  f
 1238817676      7180001872124   160105  160890  r
 1237443253      7180001872124   160106  160900  f
 1237471027      7180001872124   159930  160928  f
 1238822613      7180001872124   159774  160782  f
 1238816875      7180001872124   159878  160728  f
 1244436248      7180001872124   159483  160553  f
 1238818306      7180001872124   159718  160489  f
 1238818332      7180001872124   159722  160483  f
 1237476824      7180001872124   160330  161329  f
 1238817689      7180001872124   160368  161010  r
 1237447135      7180001872124   160740  161768  r
 1237483546      7180001872124   160814  161790  r
 1237483530      7180001872124   160818  161856  r
 1237471108      7180001872124   161003  162009  f
 1238817744      7180001872124   161151  161978  f
 1237446441      7180001872124   161050  162107  f
 1244436201      7180001872124   161117  162164  f
 1237446407      7180001872124   161586  162699  r
 1237471055      7180001872124   161652  162700  r
 # 23 BCM SHOTGUN RP42 VVHNP reads (1369 read lib; 1341 of the reads in this ctg)
  • ctg7180002047604 : Vctor in the middle
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
      1      121  |   215495   215615  |      121      121  |    99.17  |      170   271477  |    71.18     0.04  | gnl|uv|U09128.1:15891-16011-49     ctg7180002047604       # pSacBII P1 cloning vector 
 cat 9-terminator/bt.posmap.utgctg | grep 7180002047604  perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);'
 7180000441711   7180001872124   214458  219678  r
 cat 9-terminator/bt.posmap.frgctg | grep 7180002047604  | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);'
 498776751       7180001872124   215425  216426  r
 1236502885      7180001872124   215514  216377  r
 379408823       7180001872124   215572  216340  f
 1244436224      7180001872124   215388  216405  f
 1237471071      7180001872124   215229  216234  r
 1233297450      7180001872124   215234  216046  f
 1233363357      7180001872124   215267  215687  f
 937200686       7180001872124   215300  216129  r
 937254901       7180001872124   215321  216160  f
 1233294025      7180001872124   215383  216204  r
 1237446444      7180001872124   215146  216187  f
 1232033776      7180001872124   215193  215996  r
 671976381       7180001872124   215035  216021  r
 514932286       7180001872124   215043  216008  f
 500723879       7180001872124   215043  215802  f
 671927656       7180001872124   215116  215733  r
 381173692       7180001872124   214947  215877  r
 1233303570      7180001872124   214963  215803  f
 1232037705      7180001872124   214990  215803  f
 490852264       7180001872124   214923  215843  f
 1237447184      7180001872124   214684  215646  f
 668822243       7180001872124   214586  215572  f
 #22 reads ; ~half come from BCM SHOTGUN RP42 VVFOP


maxmatch ctg

Parameters:

 nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
 AllVec: UniVec_Core + 100 more vector seqs

Length of UMD2.4 contigs that contain contaminant (0+ bp from end):

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 2951*      2602       349        1001       453627     8252       1367       132226     24352779
 UniVec_Core               5387*      4802       585        882        651163     9979       1334       136556     53760575
 OtherVec                  5657       5062       595        882        651163     9726       1320       136556     55021803
 UMD2.cont.other           3976       3430       546        804        651163     11217      1346       130385     44601117         #18 aligned to Acinetobacter; longest is 56467bp

Length of UMD2.4 contigs that contain contaminant (500+ bp from end):

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 182        156        26*        1286       351373     6525       1811       125069     1187706          # 7*   are >5K;  321* come from multi-ctg scaffolds
 UniVec_Core               2532       2220       312*       1065       651163     10593      1481       128344     26821960         # 267* are >5K ; 655* come from multi-ctg scaffolds
 OtherVec                  376        323        53         1184       361749     13278      1508       139997     4992774
 UMD2.cont.other           ...

Length of UMD2.4 contigs that contain contaminant (1000+ bp from end):

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 8          0          8*         4709       351373     73065      31157      351373     584520
 UniVec_Core               11         0          11*        2600       334933     93674      37847      271477     1030414 
 OtherVec                  5          0          5*         3717       271477     131604     111913     228060     658021
 UMD2.cont.other           54         0          54*        2398       522682     110947     88182      189352     5991164
 total                     67* # 18 of them are CONTAINED by UMD2.0 chromosomes

Length of the UMD2.4 contaminant sequence (0+ bp from end):

                           elem       <200       >200       min        max        mean       med        n50        sum
 Ecoli.all                 4775       1610       3165       39         17072      381        236        502        1823278
 UniVec_Core               16985      12380      4605       39         1800       207        132        300        3519080
 OtherVec                  7563       1372       6191       39         1800       509        548        643        3849567
 UMD2.cont.other           6626       343        6283       39         8228       543        573        615        3602329

maxmatch deg

All degenerates aligned are <2K

Length of UMD2.4 deg that contain contaminant (0+ bp from end):

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 1266       1266       0          104        1611       783        833        869        991447
 UniVec_Core               1908       1908       0          147        1510       872        896        910        1664746
 OtherVec                  1963       1963       0          147        1510       872        898        911        1712703
 UMD2.cont.other           1609       1609       0          132        1611       852        892        914        1372106

maxmatch utg

Unitig stats:

 elem       <2000      >2000      min        max        mean       med        n50        sum            
 1707816    1434164    273652     21         138676     2228       937        8002       3805166508   

Parameters:

 nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...

Files:

 /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_utg/

Length of UMD2.4 unitigs that align to contaminants

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 4275       4110       165        104        71709      1442       1212       1398       6166566  
 UniVec_Core               7563       7409       154        139        71709      1397       1182       1331       10570512
 OtherVec                  8208       8054       154        139        71709      1370       1159       1308       11248775
 UMD2.cont.other           6094       5849       245        132        53113      1546       1163       1401       9422951    #80 aligned to Acinetobacter; longest is 9114bp
 Contaminants(all above)   10264      9895       369        104        71709      1471       1148       1359       15107544

 Acinetobacter             2306**      0          2306       154        71709      1451       1316       1412      3347230  #2182 already in the Cont set 

Length of UMD2.4 unitigs that have contaminants 500+bp from ends

                           elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all                 172        156        16         1286       4852       1820       1805       1815       313185
 UniVec_Core               2491       2422       69         1065       71709      1722       1457       1523       4291584
 OtherVec                  364        358        6          1167       71709      1795       1478       1538       653595
 UMD2.cont.other           156        108        48         1213       50248      5344       1838       17518      833673

Length of the UMD2.4 alignments of unitigs to contaminants (unique unitig regions)

                           elem       <200       >200       min        max        mean       med        n50        sum       reads(all unitig reads for unitgs with alignments>1K)
 Ecoli.all                 5975       1686       4289       40         8184       397        268        542        2374366   12112(12142)
 UniVec_Core               8754       1674       7080       40         1801       474        490        645        4153030   26590(26849)
 OtherVec                  8919       1250       7669       40         1801       511        536        629        4562326   30268(30268)
 UMD2.cont.other           6752       896        5856       40         6012       529        555        651        3573528   25006(25328)
 Contaminants(all above)   10992      1396       9596       40         8184       571        573        684        6280759   40351(40699)
 Acinetobacter                                                                                                                     (8286)

40699 reads aligned back to contaminants: nucmer -maxmatch

  • 35919 align
  • 34400 align 100+bp
  • 27742 align 200+bp
  • 14211 align 500+bp

utg 5'& 3'

Unitig stats:

             elem         <200       >200       min        max        mean       med        n50        sum            
utg          1,707,816    81200      1626616    21         138676     2228       937        8002       3805166508   
utg5'&3'     3,334,432    0          3334432    21         199        100        100        100        335263271

Align utg5'&3' to Ecoli.all using:

  • nucmer -l 40 -c 100 -b 10 -g 5 -d 0.05 : 4,275 hits
  • nucmer -l 20 -c 40 : 6,617 hits
  • nucmer -l 20 -c 20 : 23,350
  • blastall  : 2,895,506 out of 3,334,432 (86%) aligned

Acinetobacter contamination

Database:

 ~dpuiu/db/Acinetobacter.all : 7 complete genomes, 19 seqs

Seq len summary:

  elem       min        max        mean       med        n50        sum
  19         2726       4050513    1418094    28279      3904116    26943793

Align all unitigs to Acinetobacter.all; Longest alignments is 8517bp

show-coords Acinetobacter.all-utg.filter-q.delta | sort -nk8 -r | head
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [GenBank]                          [UMD2.4 utg]
===============================================================================================================================
  20644    29164  |       62     8578  |     8521     8517  |    98.80  |    94413     8578  |     9.03    99.29  | gi|169786889|ref|NC_010404.1|      utg7180000281954*       [CONTAINS]
3395586  3401299  |     5712        1  |     5714     5712  |    99.79  |  3976747     8015  |     0.14    71.27  | gi|126640115|ref|NC_009085.1|      utg7180000212251
3400344  3404485  |        1     4142  |     4142     4142  |    99.66  |  3976747     9114  |     0.10    45.45  | gi|126640115|ref|NC_009085.1|      utg7180000277331
...

utg7180000281954* -> ctg7180002053982 (28140bp; 78 unitigs)

grep 7180002053982 ../9-terminator/bt.posmap.utgctg | nl

    1  7180000185222   7180002053982   0       3019    f
    2  7180000314302   7180002053982   2151    5706    r
    3  7180001463328   7180002053982   2256    2869    f
    ...
   75  7180000281954*  7180002053982   17862   26442   r
   76  7180001471348   7180002053982   17886   18723   r
   77  7180001468075   7180002053982   17919   18732   f
   78  7180000280508   7180002053982   25672   28140   r


 show-coords UMD2.contaminant.other-ctg.filter-q.delta  | grep 7180002053982
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [UMD2.0 contam]   [UMD2.4 ctg]
===============================================================================================================================
   1394     2508  |    28140    27024  |     1115     1117  |    98.75  |     7098    28140  |    15.71     3.97  | 7180003313366      ctg7180002053982
   2561     2871  |    26971    26661  |      311      311  |    97.11  |     7098    28140  |     4.38     1.11  | 7180003313366      ctg7180002053982
   2934     5670  |    26599    23862  |     2737     2738  |    97.99  |     7098    28140  |    38.56     9.73  | 7180003313366      ctg7180002053982
 ...gap...  
   5930     7098  |    17270    16101  |     1169     1170  |    98.46  |     7098    28140  |    16.47     4.16  | 7180003313366      ctg7180002053982
 ...gap...  
    281     1981  |    10335     8635  |     1701     1701  |    99.41  |    13090    28140  |    12.99     6.04  | 7180003320028      ctg7180002053982
   1992     2672  |     8635     7954  |      681      682  |    99.71  |    13090    28140  |     5.20     2.42  | 7180003320028      ctg7180002053982
   3302     5376  |     6719     4643  |     2075     2077  |    99.33  |    13090    28140  |    15.85     7.38  | 7180003320028      ctg7180002053982
   8469     9021  |     4642     4090  |      553      553  |    99.10  |    13090    28140  |     4.22     1.97  | 7180003320028      ctg7180002053982
   9038     9313  |     4073     3798  |      276      276  |    98.55  |    13090    28140  |     2.11     0.98  | 7180003320028      ctg7180002053982
   9780    13090  |     3331       19  |     3311     3313  |    99.19  |    13090    28140  |    25.29    11.77  | 7180003320028      ctg7180002053982
 grep 7180002053982 ../9-terminator/bt.posmap.utgctg | awk '{print $1,$4-$3+1}' | sed 's/^/utg/' >! ctg7180002053982.utgs

 intersect.pl UMD2.contaminant.other-utg.qry_hits ctg7180002053982.utgs | wc -l
 37 # only 37 out  of 78 unitigs were detected

 ctg7180002053982 is Acinetobacter

UMD2.5 (2004_0312_CA; delete 40699 contam reads & 22607 mates )

40699 reads:

  • 25803 mated + 14896 unmated
  • 6392 mated reads had the mate also contaminated

Location:

 /scratch1/bos_taurus/Assembly/2009_0312_CA

UNITIGGER

UNITIG OVERLAP GRAPH INFORMATION
       5322910 : Total number of unitigs
       2595715 : Total number of singleton, contained unitigs
       1869655 : Total number of singleton, non-contained unitigs
        182193 : Total number of non-singleton, spanned unitigs
        675347 : Total number of non-singleton, non-spanned unitigs
      35507162 : Total number of fragments
      35507162 : Total number of fragments in all unitigs
      21641007 : Total number of essential fragments in all unitigs
      13866155 : Total number of contained fragments in all unitigs
  0.0077909501 : Randomly sampled fragment arrival rate per bp
    2511009753 : The sum of overhangs in all the unitigs
    6442095933 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
Total rho    = 2511009753
Total nfrags = 19563152
Estimated genome length = 0
Estimated global_fragment_arrival_rate=0.007791
Computed global_fragment_arrival_rate =0.007791
Total number of randomly sampled fragments in genome = 23866135
Computed genome length  = 3063315200.000000
Used global_fragment_arrival_rate=0.007791
Used global_fragment_arrival_distance=128.354050

Histogram of the number of base pairs in a chunk
100292 - 138301:     22    # 19 in UMD2.4 
 90020 -  99906:     28    # 23
 80043 -  89676:     90    # 79
 70013 -  79966:    190    # 164
 60010 -  69988:    423
 50008 -  59983:   1016
 40000 -  49998:   2558
 30000 -  39997:   6660
 20000 -  29999:  18927
 10000 -  19999:  57057

CONSENSUS after CGW

  • failed on job 80 : ctg 5706539, len=180,024, 159 unitigs, 1,851 reads

 head 80/bt.cns_contigs.80.failed
 {ICM
 acc:5706539
 pla:P
 len:180024
 cns:
 .
 qlt:
 .
 for:0
 npc:1851
 more  ../9-terminator/bt.asm
 ...
 {CCO
 acc:(7180002022380*,5706539)
 pla:P
 len:180024
 cns:
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 ...
 cat 80/bt.cns_contigs.80.failed | countMessages.pl  
 ICM     1
 IUP     159   #unitigs
 IMP     1851  #reads : 1264 are BCM.WGS, 389 are BCM.SHOTGUN ...
 # 254 contig scaffold
 cat ../9-terminator/bt.posmap.ctgscf | grep 7180002041301 | nl
 1    7180002022042   7180002041301   0       14121   f
 ...
 183  7180002022380*  7180002041301   9164814 9344838 f
 ..
 254  7180002022240   7180002041301   12874874        12893460        f
 # the 5 UMD2.4 contigs below have the same number of reads with the ones that matched => CONTAINED
 cat UMD2.4-7180002022380.posma   p.frgctg | grep -v ^$ | awk '{print $2}' | uniq -c
   #reads ctgid
     6 7180001712307
   209 7180002028662
     6 7180002028663
  1552 7180002028664
    21 7180002032323
  1794 total => 1851-1794=57 additional reads
 cat ../../2009_0217_CA/9-terminator/bt.posmap.ctgscf | nl | egrep '7180001712307|7180002028662|7180002028663|7180002028664|7180002032323'
 #nl    ctgid           scfid           start   end     dir        
 34791  7180002028662   7180002069912   1425623 1446022 f
 34793  7180002032323   7180002069912   1448133 1450475 f
 34794  7180002028663   7180002069912   1450495 1451918 f
 34795  7180002028664   7180002069912   1452234 1602973 f
 64110  7180001712307   7180002071598   0       1441    f
  • Solution 1:
 * consensus -Dforceunitigabut => new assembly, new UID's
 ctg7180002022636 179505 38.53     # => ctg7180002022380
 scf7180002041557 12892941 40.58   # => scf7180002041301
  • Solution 2:
 * Reassemble 1851 reads ; clr=ECR2; doOBT=no
 * Asm dir: 
    /scratch1/bos_taurus/Assembly/2009_0312_CA/8-consensus/80.ECR2.asm
 * It contains one 179,530 bp scaffold that has two contigs. 
 * One contig is 156,349 bp and the other one is 23,181 bp. 
 * The estimated gap between them is 231 bp.
 show-coords ctg7180002022636-80.ECR2.filter-r.delta 
      1   156326  |   1   156331  | 156326 156331  | 99.99  | 179505  156349 |    87.09    99.99  | ctg7180002022636   ctg7180000000103        [CONTAINS]
 156345   179505  |  21    23181  |  23161  23161  | 99.99  | 179505  23181  |    12.90    99.91  | ctg7180002022636   ctg7180000000104        [CONTAINS]
 >ctg7180002022636_156327_156344
 TTGTAAAAACCATCCCCT
 # ~ 20 bp unaligned on ctg7180002022636 & Chr1
 show-coords  ctg7180002022636-Chr1.filter-r.delta  | more
 ...
 151839   156326  | 61124826  61120339* |     4488     4488  |    99.91  |   179505 157590899  |     2.50     0.00  | ctg7180002022636  Chr1
 156347   157501  | 61111219* 61110064  |     1155     1156  |    99.91  |   179505 157590899  |     0.64     0.00  | ctg7180002022636  Chr1
 ...
 # 2 UMD2.0 ctg & 2 UMD2.0 deg in this region
 more Chr1.agp
 ...
 Chr1    61110064        61111482        3579    W       deg0003139347   1       1419    +
 Chr1    61111483        61114130        3580    N       2648    fragment        yes
 Chr1    61114131        61115490        3581    W       deg0002967451   1       1360    +
 Chr1    61115491        61118114        3582    N       2624    fragment        yes
 Chr1    61118115        61120117        3583    W       7180002846553   1       2003    +
 Chr1    61120118        61120217        3584    U       100     fragment        yes
 Chr1    61120218        61145567        3585    W       7180003318962   1       25350   +
 ...

QC

Lengths:

             elem       <2000      >=2000     min        max        mean       med        n50        sum
 scf         39978*     31311      8667       316        34167202   68129      1360       8217662    2723691675
 ctg         90135*     36140      53995      65         1160130    29693      5124       95988      2676390147
 deg         251413     249285     2128       65         39964      1003       984        994        252279234
 utg         1689033    1419729    269304     21         138676     2242       936        8213       3788090224


              elem       <0         0          >0         min        max        mean       med        n50        sum
 gaps(ca2scf) 50157      10759      3296       36102      -20        177144     929        20         34357      46620040
 gaps(posmap) 50157      0          0          50157      20         177144     943        20         34065      47301528


Fragment happiness:

 placed good       27263138
 chaff  bothChaff  2467462
 placed notMated   2050084
 placed oneChaff   732517
 chaff  oneChaff   732517
 placed oneSurrogate     555510
 placed bothDegen        465114
 chaff  notMated         434052
 placed diffScaffold     369,294 *
 placed oneDegen         213,768 *
 placed badSame          96862
 placed badLong          76640
 placed badOuttie        41142
 placed badShort         5784
 placed bothSurrogate    3278

Mate happiness:

 good            13631569
 bothChaff       1233731
 oneChaff        732517
 oneSurrogate    277755
 bothDegen       232557
 diffScaffold    184647
 oneDegen        106884
 badSame         48431
 badLong         38320
 badOuttie       20571
 badShort        2892
 bothSurrogate   1639

Scaffold zero read/mate cvg regions:

                       elem       <2000      >2000      min        max        mean       med        n50        sum
 read                  57011      55048      1963       1          177144     913        57         29302      52084484
 mate                  10507      8945       1562       1          30014      996        493        2367       10466518

Scaffold 10K+ zero read/mate cvg regions (2K+ inside) (some might be a result of surrogates?):

                       elem       <2000      >2000      min        max        mean       med        n50        sum
 read                  51747      49878      1869       1          177144     958        38         32625      49613432
 mate                  1290       201        1089       15         30014      3560       3047       4017       4593586
 mate(ignore seq len)  2599       1051       1548       1          72706      3321       2541       5513       8633691

Contaminant search

ctg

               elem       <0         >0         min        max        mean       med        n50        sum
 ctg           90135      0          90135      65         1160130    29693      5124       95988      2676390147
 nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
 #OLD table
                  elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all        71         66         5          1006       129770     4180       1127       45234      296830
 UniVec_Core      120        111        9          1000       426540     7366       1130       100944     884023
 OtherVec         121        112        9          1000       426540     7314       1127       100944     885022
 UMD2.cont.other  98(-3)     83         15(-3)     1000       426540     13523      1190       199700     1325332  # 3 are 1000bp+ from ctg ends; these are actually "fake" contaminants
                                                                                                                   # 7 are Acinetobacter baumannii min=1059 , max=9765 
 total            152*(-3)   133        19*(-3)    1000       426540     10649      1150       199700     1618735
 
 Acinetobacter    65         53         12         1013       44359      2586       1212       3847       168130   # 46 out of 65 are in the 152* set; 19 are new; 13 have lots of alignments to other contigs (probably fake contaminants)
 total(new)       171*(-3)   144        27*(-3)    1000       426540     10013      1189       129770     1712376  # 65 are Acinetobacter and should be removed
 cat UMD2.contaminant.other-ctg.filter-q.coords | grep Acinetobacter
                                                                                                                    UMD2.0                     UMD2.5
      1      285  |      285        1  |      285      285  |    99.65  |      287     8096  |    99.30     3.52  | 7180003292866_1_288        ctg7180002015457        [CONTAINED]     Acinetobacter baumannii
   1422     2500  |     1078        1  |     1079     1078  |    99.63  |     7098     1078  |    15.20   100.00  | 7180003313366              ctg7180001706852        [CONTAINS]      Acinetobacter baumannii
   2934     3940  |     1008        1  |     1007     1008  |    98.61  |     7098     1059  |    14.19    95.18  | 7180003313366              ctg7180001709709                        Acinetobacter baumannii
   6281     7098  |        1      818  |      818      818  |    99.76  |     7098     1553  |    11.52    52.67  | 7180003313366              ctg7180001716052        [END]           Acinetobacter baumannii
      1      790  |      790        1  |      790      790  |   100.00  |     1822     9765  |    43.36     8.09  | 7180003319195_8956_10778   ctg7180002015485        [BEGIN]         Acinetobacter calcoaceticus
    285     1981  |        1     1697  |     1697     1697  |    99.59  |    13090     1856  |    12.96    91.43  | 7180003320028              ctg7180001706656*                       Acinetobacter baumannii
   1992     2148  |     1697     1856  |      157      160  |    98.12  |    13090     1856  |     1.20     8.62  | 7180003320028              ctg7180001706656*                       Acinetobacter baumannii
  12210    13090  |       63      943  |      881      881  |    99.89  |    13090     2556  |     6.73    34.47  | 7180003320028              ctg7180002007423                        Acinetobacter baumannii
 # 7 Acinetobacter baumannii ctgs
 # no Serratia "best hits"
 # 3 mitochondrion ctgs, all < 2Kbp

Delete summary: 65 Acinetobacter ctgs + 91 contaminant ctgs <2000bp => 156 ctgs , 152 scf => 4105 reads

 ctgs       <2000      >2000      min        max        mean       med        n50        sum       reads
 156        144        12         1000       44359      1782       1150       1483       278009    4105        # all ctg

Trim summary: 12 contigs >=2000bp & 44 reads that overlap at least 10bp

 ctgs       <2000      >2000      min        max        mean       med        n50        sum        reads
 12         0          12         7072       426540     78470      45234      129770     941646     ?           # all ctg
 12         12         0          172        935        532        618        750        6393       44          # ctg regions

Files

/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_DELETE/ctg.delete.uid
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_DELETE/scf.delete.uid
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_TRIM/ctg.trim.uid

ctg 5'&3'

               elem       <0         >0         min        max        mean       med        n50        sum
 ctg53         180044     0          180044     65         598        300        300        300        54033229
 nucmer -maxmatch -l 17 -c 35 ...
                  #ctgEnds   #ctgs      min        max        mean       med        n50        sum         
 Ecoli.all        180        149        300        300        300        300        300        54000 
 UniVec_Core      312        277        300        300        300        300        300        93600
 OtherVec         1211       1167       300        553        300        300        300        363989
 UMD2.cont.other  15689      14693      257        598        300        300        300        4712162

deg

 nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
                  elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all        387        387        0          131        1099       756        806        835        292892
 UniVec_Core      569        569        0          101        1115       763        822        843        434400
 OtherVec         579        579        0          101        1115       752        819        840        435549
 UMD2.cont.other  539        539        0          131        1483       792        838        873        427408
                 
 total            810*       810        0*         101        1483       784        838        869        63547

Scaffolds vs UMD2.0 chromosome alignments

Directory:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf

Depening on the ref/qry seq and nucmer parameters, the number of unaligned gaps in UMD2.0 can vary between:

 101M: REF=Chr,       QRY=scf,     nucmer -l 100 -c 500 
 6M:   REF=ChrPlaced, QRY=scf-deg, nucmer -maxmatch -l 50 -c 250 

nucmer -l 100 -c 500

Chr-scf.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 Chr-scf.qry_hits                         32901      24546      8355       723        34167202   82494      1405       8217662    2714164100
 Chr-scf.qry_nohits                       7077       6765       312        316        12006      1346       1239       1291       9527056
 Chr-scf.10K.qry_hits2+                   574        0          574        10308      34167202   4006753    1887309    9586144    2299876795
 Chr-scf.0cvg                             144712     125933     18779      1          102265     900        178        2968       130248709
 Chr-scf.0cvg.clean                       148556     143283     5273       1          39625      683        280        1363       101526883(101M)

Chr-scf-deg.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 Chr-scf-deg.qry_hits                     210225     199781     10444      501        34167202   13785      1007       7328685    2898141592
 Chr-scf-deg.qry_nohits                   81166      80815      351        65         12006      958        972        989        77828798
 Chr-scf-deg.10K.qry_hits2+               574        0          574        10308      34167202   4006753    1887309    9586144    2299876795
 Chr-scf-deg.0cvg                         175952     168553     7399       1          22067      445        120        1329       78433265
 Chr-scf-deg.0cvg.clean                   133809     132381     1428       1          20512      371        124        1101       49711440(49M)

ChrPlaced-scf.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf.qry_hits                   19488      13112      6376       723        34167202   137773     1527       8428844    2684927057
 ChrPlaced-scf.qry_nohits                 20490      18199      2291       316        192648     1891       1276       1569       38764099
 ChrPlaced-scf.10K.qry_hits2+             139        0          139        10486      31959312   6786671    4979278    12956086   943347316
 ChrPlaced-scf.0cvg                       76271      71816      4455       1          102265     568        179        1710       43339413
 ChrPlaced-scf.0cvg.clean                 72865      70541      2324       1          39625      356        94         1413       25951987(25M)

ChrPlaced-scf-deg.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf-deg.qry_hits               130125     122000     8125       501        34167202   21530      1009       7515049    2801670853
 ChrPlaced-scf-deg.qry_nohits             161266     158596     2670       65         192648     1080       987        1012       174299537
 ChrPlaced-scf-deg.10K.qry_hits2+         139        0          139        10486      31959312   6786671    4979278    12956086   943347316
 ChrPlaced-scf-deg.0cvg                   79041      76374      2667       1          22067      395        157        948        31251753
 ChrPlaced-scf-deg.0cvg.clean             69012      68271      741        1          20512      200        81         592        13864328(13M)

nucmer -maxmatch -l 100 -c 500

Dir:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.2

ChrPlaced-scf-deg.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf-deg.qry_hits               130510     122377     8133       501        34167202   21470      1009       7515049    2802100771
 ChrPlaced-scf-deg.qry_nohits             160881     158219     2662       65         192648     1080       986        1012       173869619
 ChrPlaced-scf-deg.10K.qry_hits2+         120        0          120        20022      31959312   7587296    5639522    13010806   910475551
 ChrPlaced-scf-deg.0cvg                   82159      80425      1734       1          7002       321        145        647        26444796
 ChrPlaced-scf-deg.0cvg.clean             111645     111546     99         1          6248       81         13         272        9057424(9M)

nucmer -maxmatch -l 50 -c 250

Dir:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3

ChrPlaced-scf-deg.summary

                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf-deg.qry_hits               204673     195653     9020       251        34167202   14088      1005       7329288    2883625712
 ChrPlaced-scf-deg.qry_nohits             86718      84943      1775       65         192648     1064       970        1007       92344678
 ChrPlaced-scf-deg.10K.qry_hits2+         148        0          148        10486      31959312   6814292    5135095    12792673   1008515300
 ChrPlaced-scf-deg.0cvg                   86085      84614      1471       1          4565       279        123        557        24101912
 ChrPlaced-scf-deg.0cvg.clean             113796     113791     5          1          2822       59         7          176        6714556(6M)

nucmer -maxmatch -l 50 -c 250 ; delta-fileter -q

Dir:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3
 ChrPlaced-scf-deg.filter-q.summary
                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf-deg.qry_hits               204673     195653     9020       251        34167202   14088      1005       7329288    2883625712
 ChrPlaced-scf-deg.qry_nohits             86718      84943      1775       65         192648     1064       970        1007       92344678
 ChrPlaced-scf-deg.10K.qry_hits2+         118        0          118        20022      31959312   7633834    5639522    13010806   900792422
 ChrPlaced-scf-deg.0cvg                   77864      73686      4178       1          28711      529        181        1523       41240419
 ChrPlaced-scf-deg.0cvg.clean             74172      72150      2022       1          28331*     321        89         1415       23852952(23M)

Max gap is 28331; Duplicate region in UMD2.0?

 ChrPlaced-scf-deg.coords
   70739616 70767946  |  2993389  2965048  |    28331    28342  |    99.17  | 85187327 19514159  |     0.03     0.15  | Chr15      scf7180002041107
   70768054 70808299  |  3005305  2965048  |    40246    40258  |    99.56  | 85187327 19514159  |     0.05     0.21  | Chr15      scf7180002041107
 =>
 ChrPlaced-scf-deg.filter-q.coords
   70768054 70808299  |  3005305  2965048  |    40246    40258  |    99.56  | 85187327 19514159  |     0.05     0.21  | Chr15      scf7180002041107

Markers

ALL:

 head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.txt
 Marker        Chr_BTA Pos(K)  Pos_from Pos_to UMD_Ctg_Pos     Match_Len       %IDY    %Match  UMD_Ctg_name
 BZ945871      1       47501   1       95001   7622            515             100.00  99.61   ctg7180002007845
 BZ953651      1       80001   47501   112501  10786           700             99.57   100.00  ctg7180002026484
 CC504788      1       118751  80001   157501  54583           862             100.00  100.00  ctg7180002026483
 CC484491      1       123751  90001   157501  50169           77              98.72   100.00  ctg7180002026482
 CZ415082      1       125001  92501   157501  75850           507             99.21   99.80   ctg7180002026483
 CC475154      1       130001  97501   162501  40013           666             99.25   100.00  ctg7180002026482
 CC561114      1       182501  145001  220001  1130            709             99.02   100.00  ctg7180002026482
 CC578374      1       190001  155001  225001  170145          647             100.00  100.00  ctg7180002026481
 BZ911787      1       278751  232501  325001  na              na              na      na      na
 ...
  • 126,014 markers & 90,135 ctgs total
  • 107,271 markers align to 31,407 ctg & 2640 scf:
    • 85% of the markers align to 85% of the ctg sequence
    • avg distance between markers is 25Kbp
  • 188 questionable ctgs & 219 questionable scf (2 out of 4 disagreeing markers)

UNIQ:

 head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.unique_only.txt | p 'print "  ",$_;'
 Marker Chr_BTA Pos(Kbp) CI_Pos_from CI_Pos_to UMD_Scaff_Pos Match_Len %IDY %Matched UMD_Scaff_name
 BZ945871 1 52251 1 104501 na na na na na
 BZ953651 1 88001 52251 123751 na na na na na
 CC504788 1 130626 88001 173251 54583 862 100.00 100.00 ctg7180002026483
 CC484491 1 136126 99001 173251 50169 77 98.72 100.00 ctg7180002026482
 CZ415082 1 137501 101751 173251 na na na na na
 CC475154 1 143001 107251 178751 40013 666 99.25 100.00 ctg7180002026482
 CC561114 1 200751 159501 242001 1130 709 99.02 100.00 ctg7180002026482
 CC578374 1 209001 170501 247501 170145 647 100.00 100.00 ctg7180002026481
 BZ911787 1 306626 255751 357501 na na na na na
 ...
  • 93,508 markers align to 28,752 ctgs & 1,476 scf:
  • 109 questionable ctgs & 153 questionable scf (2 out of 4 disagreeing markers)


Scripts:

 ~/bin/marker2pos.pl markers_contigs_Art.unique_only.txt | sed 's/ctg//' |  sort -nk1 -nk2  > markers_ctg.pos
 ~/bin/translatePosMap.pl markers_ctg.pos bt.posmap.ctgscf | ~/bin/tab2tab.pl > markers_scf.pos

Ctg vs markers summary:

                                        #ctg       <10000     >10000     min        max        mean       med        n50        sum         file
 ctg (all)                              90135*     51024      39111      65         1160130    29693      5124       95988      2676390147
 no markers                             58728      48324      10404      65         322949*    6573       1597       21989      386064754
 
 markers from 1+ Chr                    31407      2700       28707      442        1160130    72924      52693      111252     2290325393  markers_ctg.Chr.count
 markers from 2+ Chr                    2987       25         2962       1002       1160130    132480     104807     179692     395718221   markers_ctg.Chr.count2+
 2+ markers from 2+ Chr                 26         0          26*        15228      604155     221354     192182     298848     5755227     markers_ctg.Chr.count2.2+
 2+ adjacent markers from 2+ Chr        15         0          15**       15228      368879     202728     194749     294623     3040932     markers_ctg.Chr.count2+a

Scf vs markers summary:

                                        #scf       <10000     >10000     min        max        mean       med        n50        sum
 scf(all)                               39978*     37135      2843       316        34167202   68129      1360       8217662    2723691675
 no markers                             37338      36038      1300       316        754615*    2601       1336       3957       97140879
 
 markers from 1+ Chr                    2640       1097       1543       1000       34167202   994905     16220      8661690    2626550796  markers_scf.Chr.count 
 markers from 2+ Chr                    552        10         542        1002       34167202   4526814    2714036    9167014    2498801557  markers_scf.Chr.count2+
 2+ markers from 2+ Chr                 212        0          212*       15228      34167202   8579232    7358307    10521496   1818797327  markers_scf.Chr.count2.2+
 2+ adjacent markers from 2+ Chr        38         0          38**       15228      25078118   8419544    7176534    13458592   319942681   markers_scf.Chr.count2+a

212* scaffolds

 scf_id        scf_len   #Chr/2+markers  #/Chr/2+adjmarkers  reads
 7180002041381 31959312  15              0                   469503
 7180002041358 25078118  13              2                   291163
 7180002041386 21280754  12              0
 ...

scf7180002041381.1

  • no low cvg regions in the middle
  • 1281 markers: 1231 on Chr4, 12 on Ch11
 #1- mate cvg regions: at the ends !!!
 #scfid          begin           end             scf_len         cvg_len cvg
 7180002041381   1               1173            31959312        1173    0
 7180002041381   1174            1454            31959312        281     1
 7180002041381   31959139        31959312        31959312        174     1
 

scf7180002041358.2

  • one low cvg region & real break
  • 1111 markers: 869 on Chr14, 193 on Chr26, ...
 #1- mate cvg regions: middle
 #scfid          begin           end             scf_len         cvg_len cvg
 7180002041358   20970531        20970827        25078118        297     1
 7180002041358   20970828        20970949        25078118        122     0
 7180002041358   20970950        20971112        25078118        163     1
 # markers in the regions
 #scfid          begin           end             makerid         Chr
 7180002041358   20964100        20964829        BZ839784        14
 7180002041358   21002368        21003219        CC527932        26

scf7180002041386.3

  • one low cvg region but no markers in that region
  • 939 markers: 902 on Chr24, 3 on Chr6 ...
 #1- mate cvg regions: at the ends
 #scfid          begin   end     scf_len         cvg_len cvg
 7180002041386   26382   27557   21280754        1176    1
 7180002041386   27558   27607   21280754        50      0
 7180002041386   27608   28031   21280754        424     1

...

scf7180002041368.128

  • one low cvg region
 #scfid          begin    end      scf_len         cvg_len  cvg
 7180002041368   1356953  1357221  1404851         269      1
  • markers:Chr7 & Chr5 at 3'
 #scfid          begin    end      markerid        Chr
 7180002041368   10436    11033    BZ840669        7
 ...
 7180002041368   1278950  1279444  CZ404867        7
 7180002041368   1319489  1320264  CC593534        7
 7180002041368   1344001  1344347  BZ885865        7 *
 7180002041368   1371526  1372308  BZ867572        5 *
 7180002041368   1386933  1387750  CC534893        5
  • subassembly: extract all reads & extra mates(265) and reassemble
    • Msg Counts
 DST     40
 FRG     13762 
 LKG     6482
    • qc
 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
 0=37,1211999,1358964,32757,4082,7180000000571
 1=2,53844,53824,26922,-20,7180000000566
 2=4,50178,50238,12544,20,7180000000570
 3=1,28553,28553,28553,0,7180000000568
 4=1,17195,17195,17195,0,7180000000569
 total=45,1361769,1508774,30262,3675

show-coords scf7180002041368-7180002041368.update.scf.filter-q.delta

...
1269662  1338725  |    69065        1  |    69064    69065  |    99.97  |  1404851  1359628  |     4.92     5.08  | scf7180002041368   scf7180000000571
1340619  1341912  |        1     1295  |     1294     1295  |    98.77  |  1404851     1295  |     0.09   100.00  | scf7180002041368   scf7180000000574        [CONTAINS]
1345570  1347105  |     1537        1  |     1536     1537  |    98.18  |  1404851     1537  |     0.11   100.00  | scf7180002041368   scf7180000000572        [CONTAINS]
1347683  1356952  |    12617     3352  |     9270     9266  |    99.85  |  1404851    12617  |     0.66    73.44  | scf7180002041368   scf7180000000567
1356901  1361861  |     1767     6728  |     4961     4962  |    99.96  |  1404851    50318  |     0.35     9.86  | scf7180002041368   scf7180000000570
...

scf7180002041061.187

  • infoseq
 scf7180002041061 509987 39.73
  • 22 markers
 #scfid        begin   end     makerid         Chr
 7180002041061 2061    2904    BZ847430        15
 7180002041061 8979    9756    CC481592        15
 7180002041061 10856   11485   BZ839377        15
 7180002041061 47578   48186   BZ836581        15
 7180002041061 80485   81237   CC553472        15
 7180002041061 117811  118488  CC918533        15
 7180002041061 151253  152009  CC477055        15
 7180002041061 213959  214640  CC572066        15 *
 7180002041061 236304  236479  CC550436        29 *
 7180002041061 242590  243462  BZ848041        29
 7180002041061 267880  268614  BZ877402        29
 7180002041061 268891  269493  CC580499        29
 7180002041061 282248  282861  BZ885584        29
 7180002041061 295221  295987  CC572941        29
 7180002041061 337785  338517  CC923898        29
 7180002041061 338464  339092  BZ921430        29
 7180002041061 341219  341971  CZ415932        29
 7180002041061 382509  383281  CC581748        29
 7180002041061 387548  388176  CC572064        29
 7180002041061 415263  415788  BZ878185        29
 7180002041061 470138  470773  CC558303        29
 7180002041061 493006  493794  CC565740        29
  • mate happiness:
 cat 7180002041061.posmap.mates | cut -f3 | count.pl
 good            2501
 diffScaffold    86
 oneSurrogate    69
 oneChaff        40
 oneDegen        11
 badSame         9
 badOuttie       4
 badLong         4
  • low cvg region:
 #scfid          begin   end     scf_len   cvg_len  cvg
 7180002041061   232849  232871  509987    23       1

 #it is in the middle of a unitig
 intersectPos.pl -i 1 7180002041061.posmap.utgscf 7180002041061.posmap.frgscf.10K.2K.mate_cvg.1-
 7180000849321   7180002041061   229782  238910  r

7180002041061.0cvg.png


  • subassembly: extract all reads & mates(139) and reassemble
    • Msg Counts
 DST     38
 FRG     5769
 LKG     2724

Dirs:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scf7180002041061.187.mates/asm.ECR2
 /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scf7180002041061.187.mates/asm.ECR2.mates

Qc stats:

             asm.ECR2                                asm.ECR2.bog
  ...
 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
 0          3,277300,277925,92433,312,7180000000278  5,279938,399837,55988,29975,7180000000155
 1          6,234445,280546,39074,9220,7180000000277 6,234463,280485,39077,9204,7180000000156
 2          1,1630,1630,1630,0,7180000000276         1,1799,1799,1799,0,7180000000154
 3          NA                                       1,1674,1674,1674,0,7180000000153
 4          NA                                       1,1183,1183,1183,0,7180000000158
 total      10,513375,560101,51338,6675              14,519057,684978,37076,18436
  • Some of the 139 mates assemble into scaffolds
  • There is slightly more sequence in the bog assembly
  • Mean/Max bog utg size is twice larger than default utg (scf, ctg sizes are about the same)
  • align asm.ECR2 scaffolds to scf7180002041061
    • 2919 bp at the 3' end of the new scf7180000000277 don't align
    • 1990 bp at the 3' end of the new scf7180000000278 don't align
    • new scf7180000000277 & new scf7180000000278 align for ~ 1058bp
    • most of the mated read added assembled at new scf7180000000277 & new scf7180000000278 3'
 nucmer -l 100 -c 500 ../scf7180002041061.fasta 9-terminator/7180002041061.update.scf.fasta -p scf7180002041061-7180002041061.update.scf
 
 delta-filter -q scf7180002041061-7180002041061.update.scf.delta > scf7180002041061-7180002041061.update.scf.filter-q.delta
 
 show-coords scf7180002041061-7180002041061.update.scf.filter-q.delta
      1    46099  |        1    46099  |    46099    46099  |    99.99  |   509987   280626  |     9.04    16.43  | scf7180002041061   scf7180000000277
  46273    48914  |    46120    48761  |     2642     2642  |   100.00  |   509987   280626  |     0.52     0.94  | scf7180002041061   scf7180000000277
  50810    56904  |    95425   101521  |     6095     6097  |    99.43  |   509987   280626  |     1.20     2.17  | scf7180002041061   scf7180000000277
  57301   233303  |   101699   277707  |   176003   176009  |    99.97  |   509987   280626  |    34.51    62.72  | scf7180002041061   scf7180000000277 *
 
 232245   444418  |   275975    63802  |   212174   212174  |    99.99  |   509987   277965  |    41.60    76.33  | scf7180002041061   scf7180000000278 * 
 444964   469975  |    63156    38151  |    25012    25006  |    99.97  |   509987   277965  |     4.90     9.00  | scf7180002041061   scf7180000000278
 449967   479221  |    58152    28900  |    29255    29253  |    99.96  |   509987   277965  |     5.74    10.52  | scf7180002041061   scf7180000000278
 480298   509987  |    29688        1  |    29690    29688  |    99.99  |   509987   277965  |     5.82    10.68  | scf7180002041061   scf7180000000278

scf7180002041163.214

  • infoseq
 scf7180002041234 10336067 41.93
  • Markers
 #scfid          begin   end     makerid         Chr
 ...
 7180002041163   770783  771508  CC562874        30 *
 7180002041163   791050  791484  BZ887794        11 *
 ...
  • Markers on different contigs !!!
 cat 7180002041163.posmap.ctgscf
 ..
 7180001926720   7180002041163   750392  787668  f
 7180001926721   7180002041163   787911  815738  f
 ...
  • subassembly: extract all reads & extra mates(293) and reassemble
    • Msg Counts
 DST     39
 FRG     17287
 LKG     8113
    • qc
 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
 0=54,1742235,1760846,32264,351,7180000000507
 1=1,6224,6224,6224,0,7180000000506
 2=1,1472,1472,1472,0,7180000000508
 total=56,1749931,1768542,31249,351
  • alignments :
 show-coords scf7180002041163-7180002041163.update.scf.filter-q.delta
   1748     7972  |     6224        1  |     6225     6224  |    99.84  |  1757217     6224  |     0.35   100.00  | scf7180002041163   scf7180000000506        [CONTAINS]
   4012     5490  |        1     1470  |     1479     1470  |    98.85  |  1757217     1472  |     0.08    99.86  | scf7180002041163   scf7180000000508        [CONTAINS]
   4567    18273  |        1    13697  |    13707    13697  |    99.92  |  1757217  1761853  |     0.78     0.78  | scf7180002041163   scf7180000000507 
 ...
 750393   787668  |   757092   794367  |    37276    37276  |   100.00  |  1757217  1761853  |     2.12     2.12  | scf7180002041163   scf7180000000507
 787912   815738  |   794644   822455  |    27827    27812  |    99.81  |  1757217  1761853  |     1.58     1.58  | scf7180002041163   scf7180000000507
 ...
1755928  1757217  |  1760564  1761853  |     1290     1290  |   100.00  |  1757217  1761853  |     0.07     0.07  | scf7180002041163   scf7180000000507

Manually curated

Markers within 50K of a low mate cvg region

  • 13 scaffolds (22 before)
  • 14 breaks : 9 on the same contig , 2 on adjacent contigs , 3 on non adjacent contigs
  • File
 /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.mates//scfproblems
  • scfproblems.markers.txt
 #scfid        begin   end     markerid        Chr(break)      comment
 7180002040911 2210713 2211037 CZ405316        29              half Chr30, half Chr29 ; SAME ctg 
 7180002041061 213959  214640  CC572066        15              half Chr29, half Chr15; SAME ctg
 7180002041103 5296272 5297045 CC587675        15              Chr18 & Chr15 at 3'; diff ctg (2 ctgs in between)
 7180002041107 7302203 7302874 CC906829        15              half Chr9, half Chr15; SAME ctg
 7180002041200 4920833 4921695 BZ877250        20              half Chr20, half Chr2; SAME ctg
 7180002041216 307121  307653  CL609365        5               Chr5 at 5' & Chr15; SAME ctg
 7180002041259 4315523 4316178 BZ922220        3               Chr5, Chr3, Chr24 !!! ; diff ctgs (4 ctgs in between)
 7180002041281 2884855 2885652 CC538348        19              half Chr19, half Chr28 ; SAME ctg
 7180002041315 2196754 2197298 BZ838379        1               Chr13 & Chr1 at 3'; diff ctg (2 ctgs in between)
 7180002041325 2597590 2598283 CC472212        10              half Chr10, half Chr5; SAME ctg
 7180002041348 3028112 3028921 CC531996        8               half Chr8, half Chr7; SAME ctg
 7180002041356 6828214 6828925 CC531427        16              half Chr16, half Chr25; SAME ctg
 7180002041358 21002368 21003219 CC527932      26              Chr14 & Chr26 at 3' (long chunck); diff ctg
 

Details:

 #scfid          begin   end     markerid        Chr     #ctgid          begin   end     markerid        Chr
 7180002040911   2189844 2190490 BZ923031        30      7180001730191   42851   43497   BZ923031        30      *
 7180002040911   2210713 2211037 CZ405316        29      7180001730191   63720   64044   CZ405316        29      *
 --
 7180002041061   213959  214640  CC572066        15      7180001852904   29927   30608   CC572066        15      *
 7180002041061   236304  236479  CC550436        29      7180001852904   52272   52447   CC550436        29      *
 --
 7180002041103   5220856 5221407 CG984741        18      7180001854649   21126   21677   CG984741        18      *
 7180002041103   5296272 5297045 CC587675        15      7180001854651   837     1610    CC587675        15      *
 --
 7180002041107   7302203 7302874 CC906829        15      7180001855003   77842   78513   CC906829        15      **
 7180002041107   7311399 7312254 CC479102        9       7180001855003   87038   87893   CC479102        9       **
 --
 7180002041200   4940131 4940949 CC500137        20      7180002002029   31935   32753   CC500137        20      **
 7180002041200   4956412 4957105 CZ428497        2       7180002002029   48216   48909   CZ428497        2       **
 -- 
 7180002041216   307121  307653  CL609365        5       7180002003578   6871    7403    CL609365        5       *
 7180002041216   310832  311546  CL865591        15      7180002003578   10582   11296   CL865591        15      *
 --
 7180002041259   4224828 4225527 CC920177        5       7180002012718   112862  113561  CC920177        5       *
 7180002041259   4315523 4316178 BZ922220        3       7180002012722   19376   20031   BZ922220        3       *
 --
 7180002041259   6220638 6221268 BZ869532        3       7180002012752   199799  200429  BZ869532        3       *
 7180002041259   6239728 6240375 CZ413142        24      7180002012753   2600    3247    CZ413142        24      *
 --
 7180002041281   2927406 2928016 CC573399        19      7180002018361   74539   75149   CC573399        19      **
 7180002041281   2938896 2939696 CC513914        28      7180002018361   86029   86829   CC513914        28      **
 --
 7180002041315   2152291 2153097 BZ836343        13      7180001862977   3397    4203    BZ836343        13      *
 7180002041315   2196754 2197298 BZ838379        1       7180002024308   7255    7799    BZ838379        1       *
 --
 7180002041325   2608389 2609213 CC506736        10      7180002025880   237956  238780  CC506736        10      **
 7180002041325   2638591 2639207 CC770009        5       7180002025880   268158  268774  CC770009        5       **
 --
 7180002041348   3044263 3044902 BZ872906        8       7180002030033   152672  153311  BZ872906        8       **
 7180002041348   3092547 3093121 BZ924509        7       7180002030033   200956  201530  BZ924509        7       **
 --
 7180002041356   6828214 6828925 CC531427        16      7180001722964   12571   13282   CC531427        16      *
 7180002041356   6832720 6833356 CC494876        25      7180001722964   17077   17713   CC494876        25      *
 --
 7180002041358  20964100 20964829  BZ839784      14      7180001723456   140919  141648  BZ839784        14      *
 7180002041358  21002368 21003219  CC527932      26      7180001723457   15983   16834   CC527932        26      *
  • scfproblems.low_mate_cvg.txt
 #scfid                begin           end             ctglen          len             mate_cvg
 7180002040911         2205494         2205582         3008363         89              1
 7180002041061         232849          232871          509987          23              1
 7180002041103         5285314         5285479         5341215         166             1
 7180002041107         7306552         7306653         19514159        102             1
 7180002041200         4954073         4954180         16995932        108             1
 7180002041216         307614          307741          1444165         128             0
 7180002041259         4313297         4313311         15600612        15              0
 7180002041281         2933371         2933618         12907599        248             1
 7180002041315         2193142         2193167         2232736         26              1
 7180002041325         2623956         2624190         11153784        235             1
 7180002041348         3064633         3064750         19180127        118             1
 7180002041356         6831454         6831811         14697197        358             1
 7180002041358         20970950        20971112        25078118        163             1

No low cvg regions

  • 9 scaffolds
  • 14 breaks : 5 on the same contig , 6 on adjacent contigs , 3 on non adjacent contigs
 7180002040844 : 3 consecutive Chr30 markers (in the middle); mate_cvg > 18 ; markers CC576837,CC516738,CC543771; SAME ctg
 7180002041163 : 5' Chr30, 3' Chr11 ; mate_cvg~=10; marker BZ887794 ; diff ctg (1 ctg in between)
 7180002041234 : 5' Chr10, 3' Chr5; mate_cvg=10; marker BZ889975; scf=1757217bp; diff ctg
 7180002041235 : 2 consecutive Chr14 markers; mate_cvg=15..20; markers CC561837 &  CC585677; NOT uniq; SAME ctg
 7180002041279 : 5 consecutive Chr23 markers; markers CC472696,CC522963; cvg=9..18; diff ctg
 7180002041306 : 5' Chr14, 3' Chr6; marker CC549871; cvg=2; diff ctg (22 ctgs in between)
 7180002041308 : 2 consecutive Chr20 markers; markers CC571631 & BZ832318; cvg=20; SAME/diff ctg 
 7180002041321 : 5' Chr2, 3' Chr15 ; marker CC513377 cvg=10; diff ctg (1 ctgs in between)
 7180002041350 : 2 consecutive Chr3 markers ; markers BZ837387 & CC571149; cvg 17..26; NOT uniq; SAME/diff ctg

Details:

 #scfid          begin   end     markerid        Chr     #ctgid          begin   end     markerid        Chr
 7180002040844   8369147 8369989 CC521620        17      7180001727899   132573  133415  CC521620        17      **
 7180002040844   8437707 8438324 CC576837        30      7180001727899   201133  201750  CC576837        30      **
 ...
 7180002040844   8467493 8468096 CC543771        30      7180001727899   230919  231522  CC543771        30      *
 7180002040844   8522732 8523598 CC513544        17      7180001727901   34970   35836   CC513544        17      *
 ---
 7180002041163   770783  771508  CC562874        30      7180001926720   20391   21116   CC562874        30      *
 7180002041163   791050  791484  BZ887794        11      7180001926721   3139    3573    BZ887794        11      *
 --
 7180002041234   3856848 3857673 BZ889975        10      7180002010181   56608   57433   BZ889975        10      *
 7180002041234   3949893 3950652 CC509477        5       7180002010182   28610   29369   CC509477        5       *
 --
 7180002041235   1117908 1118686 CC579933        21      7180002010285   4086    4864    CC579933        21      **
 7180002041235   1118939 1119675 CC561837        14      7180002010285   5117    5853    CC561837        14      **
 7180002041235   1158300 1159022 CC585677        14      7180002010285   44478   45200   CC585677        14      **
 7180002041235   1164296 1164891 BZ924510        21      7180002010285   50474   51069   BZ924510        21      **
 --
 7180002041279   6755839 6756519 BZ849919        1       7180002018195   417990  418670  BZ849919        1       *
 7180002041279   6786086 6786957 CC472696        23      7180002018196   12703   13574   CC472696        23      *
 ...
 7180002041279   6866378 6866890 CC522963        23      7180002018196   92995   93507   CC522963        23      *
 7180002041279   6911185 6911897 CC574255        1       7180002018197   15411   16123   CC574255        1       *
 --
 7180002041306   248713  249457  CC503129        14      7180002022875   59534   60278   CC503129        14      *
 7180002041306   910997  911554  BZ839769        6       7180002022892   21939   22496   BZ839769        6       *
 --
 7180002041308   1919399 1920196 BZ883381        18      7180002023013   57093   57890   BZ883381        18      *
 7180002041308   1958658 1959168 CC571631        20      7180002023014   29833   30343   CC571631        20      *
 7180002041308   1963333 1964083 BZ832318        20      7180002023014   34508   35258   BZ832318        20      **
 7180002041308   2009574 2010367 CC499423        18      7180002023014   80749   81542   CC499423        30      **
 --
 7180002041321   4820048 4820542 BZ846646        2       7180002025240   77429   77923   BZ846646        2       *
 7180002041321   4908631 4909160 CC513377        15      7180002025242   13302   13831   CC513377        15      *
 --
 7180002041350   2423855 2424205 CG983886        13      7180002030523   183644  183994  CG983886        13      *
 7180002041350   2446763 2447125 BZ837387        3       7180002030524   9191    9553    BZ837387        3       *
 7180002041350   2457417 2457704 CC571149        3       7180002030524   19845   20132   CC571149        3       **
 7180002041350   2486474 2487050 CC490214        13      7180002030524   48902   49478   CC490214        13      **

Scaffold splitting

Before:

  • 22 scaffolds
  • 28 breaks : 14 on the same contig , 8 on adjacent contigs , 6 on non adjacent contigs

Now:

  • 14 scaffolds
  • 15 breaks : 8 on the same contig , 3 on adjacent contigs , 4 on non adjacent contigs
Scaffold to break
   nl  scfid           breaks
   #1  7180002040844   2
    2  7180002040911   1
    3  7180002041061   1
   #4  7180002041103   1
    5  7180002041107   1
   #6  7180002041163   1
    7  7180002041200   1
   #8  7180002041216   1
    9  7180002041234   1
  #10  7180002041235   2
   11  7180002041259   2
  #12  7180002041279   2
   13  7180002041281   1
   14  7180002041306   1
  #15  7180002041308   2
   16  7180002041315   1
   17  7180002041321   1
   18  7180002041325   1
   19  7180002041348   1
  #20  7180002041350   2
   21  7180002041356   1
   22  7180002041358   1
Contigs to break
   nl  ctgid           
    1  7180001722964
    2  7180001723456
    3  7180001730191
    4  7180001852904
    5  7180001855003
    6  7180002002029
    7  7180002010182
    8  7180002012722
    9  7180002012752
   10  7180002018361
   11  7180002024308
   12  7180002025240
   13  7180002025880
   14  7180002030033
Marker pairs
   nl  scfid           begin1  end2    markerid1       Chr1    markerid2       Chr2    ctg1            ctg2            dist(end2-begin1)      dist(ctg2-ctg1)
   #1  7180002040844   8369147 8438324 CC521620        17      CC576837        30      7180001727899   7180001727899   69177                   0
   #2  7180002040844   8467493 8523598 CC543771        30      CC513544        17      7180001727899   7180001727901   56105                   2
 
    3  7180002040911   2189844 2211037 BZ923031        30      CZ405316        29      7180001730191   7180001730191   21193                   0
 
    4  7180002041061   213959  236479  CC572066        15      CC550436        29      7180001852904   7180001852904   22520                   0
 
   #5  7180002041103   5220856 5297045 CG984741        18      CC587675        15      7180001854649   7180001854651   76189                   2
 
    6  7180002041107   7302203 7312254 CC906829        15      CC479102        9       7180001855003   7180001855003   10051                   0
 
   #7  7180002041163   770783  791484  CC562874        30      BZ887794        11      7180001926720   7180001926721   20701                   1
 
    8  7180002041200   4940131 4957105 CC500137        20      CZ428497        2       7180002002029   7180002002029   16974                   0
 
   #9  7180002041216   307121  311546  CL609365        5       CL865591        15      7180002003578   7180002003578   4425                    0
 
   10  7180002041234   3856848 3950652 BZ889975        10      CC509477        5       7180002010181   7180002010182   93804                   1
 
  #11  7180002041235   1117908 1119675 CC579933        21      CC561837        14      7180002010285   7180002010285   1767                    0
  #12  7180002041235   1158300 1164891 CC585677        14      BZ924510        21      7180002010285   7180002010285   6591                    0
 
   13  7180002041259   4224828 4316178 CC920177        5       BZ922220        3       7180002012718   7180002012722   91350                   4
   14  7180002041259   6220638 6240375 BZ869532        3       CZ413142        24      7180002012752   7180002012753   19737                   1
 
  #15  7180002041279   6755839 6786957 BZ849919        1       CC472696        23      7180002018195   7180002018196   31118                   1
  #16  7180002041279   6866378 6911897 CC522963        23      CC574255        1       7180002018196   7180002018197   45519                   1
 
   17  7180002041281   2927406 2939696 CC573399        19      CC513914        28      7180002018361   7180002018361   12290                   0
 
   18  7180002041306   248713  911554  CC503129        14      BZ839769        6       7180002022875   7180002022892   662841                  22
 
  #19  7180002041308   1919399 1959168 BZ883381        18      CC571631        20      7180002023013   7180002023014   39769                   1
  #20  7180002041308   1963333 2010367 BZ832318        20      CC499423        18      7180002023014   7180002023014   47034                   0
 
   21  7180002041315   2152291 2197298 BZ836343        13      BZ838379        1       7180001862977   7180002024308   45007                   2*
 
   22  7180002041321   4820048 4909160 BZ846646        2       CC513377        15      7180002025240   7180002025242   89112                   2
 
   23  7180002041325   2608389 2639207 CC506736        10      CC770009        5       7180002025880   7180002025880   30818                   0
 
   24  7180002041348   3044263 3093121 BZ872906        8       BZ924509        7       7180002030033   7180002030033   48858                   0
 
  #25  7180002041350   2423855 2447125 CG983886        13      BZ837387        3       7180002030523   7180002030524   23270                   1
  #26  7180002041350   2457417 2487050 CC571149        3       CC490214        13      7180002030524   7180002030524   29633                   0
 
   27  7180002041356   6828214 6833356 CC531427        16      CC494876        25      7180001722964   7180001722964   5142                    0
 
   28  7180002041358 20964100 21003219 BZ839784        14      CC527932        26      7180001723456   7180001723457   39119                   1
Break intervals
       #scfid          begin   end     scflen          len  allread goodmate badmate Chr1 Chr2 mark1   mark2  ctgid
   #1  7180002040844   8412735 8412739 9565303         5       1       12      15      17 30   357     3      
   #2  7180002040844   8485384 8485869 9565303         486     0       12      4       30 17   3       53
 
    3  7180002040911   2205044 2205247 3009964         204     3       0       8       30 29   44      34     7180001730191

    4  7180002041061   232819  232843  510915          25      5       0       23      15 29   8       14     7180001852904

   #5  7180002041103   5295117 5295435 5342873         319     0       11      2       18 15   221     4

    6  7180002041107   7306552 7306620 19515586        69      2       0       11      15 9    320     549    7180001855003

   #7  7180002041163   787669  787911  1752998         243     0       10      0       30 11   16      8

    8  7180002041200   4954103 4954180 16991478        78      1       0       27      20 2    219     532    7180002002029

   #9  7180002041216   309331  309337  1435859         7       1       4       3       5 15    7       50

   10  7180002041234   3939493 3939548 10292558        56      1       0       20      10 5    160     227    7180002010182

  #11  7180002041235   1117752 1117987 1742396         236     5       20      2       21 14   43      2
  #12  7180002041235   1161294 1161363 1742396         70      3       16      0       14 21   2       32

   13  7180002041259   4313381 4313547 15594966        167     1       0       12      5 3     193     75     7180002012722
   14  7180002041259   6231717 6231762 15594966        46      8       4       18      3 24    75      406    7180002012752

  #15  7180002041279   6773364 6773383 9784709         20      0       4       14      1 23    292     5
  #16  7180002041279   6894603 6894618 9784709         16      0       4       16      23 1    5       141

   17  7180002041281   2931877 2931885 12908817        9       3       5       10      19 28   83      441    7180002018361

   18  7180002041306   267445  267446  4484402         2       0       2       14      14 6    49      141    7180002022875,7180002005935
 
  #19  7180002041308   1928806 1928825 23070077        20      0       10      0       18 20   63      2
  #20  7180002041308   1977759 1977828 23070077        70      1       21      1       20 18   2       836

   21  7180002041315   2192754 2192893 2233575         140     7       0       22      13 1    59      2      7180002024308

   22  7180002041321   4840934 4841070 6492983         137     1       0       16      2 15    227     72     7180002025240

   23  7180002041325   2624153 2624190 11155368        38      3       0       22      10 5    86      357    7180002025880

   24  7180002041348   3064691 3064730 19179560        40      1       0       16      8 7     116     634    7180002030033

  #25  7180002041350   2437553 2437572 23353087        20      0       19      1       13 3    107     2
  #26  7180002041350   2479000 2479029 23353087        30      2       23      1       3 13    2       880
 
   27  7180002041356   6831454 6831783 14699461        330     1       0       9       16 25   299     319    7180001722964

   28  7180002041358  20970057 20970139 25079321       83      7       0       25      14 26   868     189    7180001723456

Where:

 Chr1:  The most frequent chromosome markers with alignment at coordinates <= begin 
 Chr2:  The most frequent chromosome markers with alignment at coordinates >= end
 mark1: Number of Chr1 markers with alignment at coordinates <= begin 
 mark2: Number of Chr2 markers with alignment at coordinates >=end 
 Lines starting with # should be ignored.
 ctgid: ctg to break
New scaffolds
 #scfid(new)           scfid           begin   end      scflen(new)
Files
 # ctg:scf new name mapping
 /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scfproblems.posmap.ctbscb

 # scf:scf new:original name mapping
 /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scfproblems.posmap.scbscf
  
 # ctg:ctg new:original name mapping
 /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/ctgproblems.posmap.ctbctg
 
 # ctg FASTA sequences
 /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/ctbproblems.fasta

 # scf FASTA sequences
 /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scbproblems.fasta


UMD2.6 (UMD2.5 without contam ctg/scaff; split ctg/scaff)

Contaminants & MarkerBreaks

Delete summary:

                          ctg     scf     scf->ctg
 contaminants(delete)     156     152     666
 contaminants(trim)       12      12      1328
 markerBreaks             14+1    14+1    2875+1  # 1 more break in UMD2.6.1      
 total                    182     178     4869 

Add summary:

                          ctg     scf  
 contaminants(delete)     0       2
 contaminants(trim)       12      12
 markerBreaks             28+2    29+2            # 1 more break in UMD2.6.1 
 total                    40      43

Summary:

                          ctg     scf        markers
 original                 90135   39978
 del/add                 -142    -135
 final                    89993   39843      -17

Files:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.ctg.fasta     : contig  FASTA sequence
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.scf.fasta     : scaffold FASTA sequence
 
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.ctglen : contig  lengths 
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.scflen : scaffold lengths
 
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.ctgscf : mapping of contigs to scaffolds  (posmap format)
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.scaff  : mapping of contigs to scaffolds  (scaff  format)
 
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/ctg.delete.uid   : UID of the contigs   which were deleted  
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/scf.delete.uid   : UID of the scaffolds which were deleted  

 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/ctg.add.uid      : UID of the contigs which were added :  UID =~/brk\d+[abc]/ OR UID =~/cnt\d+/
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/scf.add.uid      : UID of the contigs which were added :  UID =~/brk\d+[abc]/ OR UID =~/cnt\d+/

 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.delete.uid : markers which got deleted
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.ctg.break  : 15 contig break regions 
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.scf.break  : 16 scaffold break regions 

Scripts:

 ~/bin/breakPosmapKeep.amos : pipeline for breaking scf/ctg


  • Final:
                   elem       min        max               mean       med        n50        sum	 
 ctg               89993      65         1160130(1.1M)     29736      5180       95952      2676109306
 scf               39843      316        34167202(34M)     68353      1361       7451988    2723419943
 scf<50K           38915      316        49898             2765       1349       5107       107632139
 scf<5K            35229      316        4999              1518       1306       1435       53495879

Ctg Markers

Filtered:

 %IDY>90
 %Matched>85
  • ~30% of alignments agree to this condition

Markers:

 total:     126111    # from 31372 ctgs

Contigs

                   elem       min        max        mean       med        n50        sum
 placed            31372*     442        1160130    72908      52734      111147     2287291732*
 unplaced          58621      65         425922     6632       1600       22204      388817574
 total             89993      65         1160130    29736      5180       95952      2676109306

Scf Markers

Markers:

 total:     126111  # from 2641 scaffolds; 1595 scaffolds have 2+ markers
 na:        18744   # not placed	 
 del:       17      # the scaffolds they belong to were deleted	 
 other:     3722    # not on the "main" chromosome; "main" chromosome determined by a majority rule; if it's a tie check markers for uniqueness 
 outliers:  411     # interquartile range method (IQR)   q1-1.5*(q3-q1) ..  q1+1.5*(q3-q1)  		 
 
 filtered:  103217  # "non conflicting" markers	 	 
 Scaffolds:	 
                   elem       min        max        mean       med        n50        sum              #ctgs
 placed            2641       1000       34167202   994523     16673      8170786    2626536153*      50528	 
 unplaced          37202      316        754615     2604       1337       3964       96883790	        39465
 total             39843      316        34167202   68353      1361       7451988    2723419943       89993	 
                   scf  <2   >=2  >=10 min  max   mean  med n50 sum	 
 markers/scf       2641 1595 1046 562  1    1418  39    1   354 103479	 
 ctg/scf           2641 1546 1095 559  1    545   19    1   130 50528
 cat markers_scf.mainChr.*summary  | count.pl -c 7 | sort -n
 1       1564   # 1 marker/scf
 2       276    # 2 markers/scf
 3       116    # ...
 4       52
 ...
 1470    1      # scaffold 7180002041371 has 1470 markers
  • Scaffold position:
 filter outliers (interquartile range method); use median value
 problem: only 2 markers far apart: choose randomly or check for uniqueness
 Summary (approximate)
 Chr   #Ctg    #Scaff  ScaffSpanSum    MaxMarkerPos
 1     2989    112     157088082       167097751
 2     2488    73      138112445       141135901
 3     2257    109     120984003       128677351
 4     2079    111     124695956       123662451
 5     2271    98      120470218       130242001
 6     2244    127     117350442       127208151
 7     2169    93      109780318       114917551
 8     2069    95      111646872       114607251
 9     1937    72      103790361       106365151
 10    1829    96      102878815       108508301
 11    2030    62      106593132       107458151
 12    1789    114     89109155        97406401
 13    1498    73      83821399        88539451
 14    1482    142     84084175        89211101
 15    1734    105     84680500        91332551
 16    1710    111     80727432        86838601
 17    1384    51      72913556        78195801
 18    1446    86      65689468        70299751
 19    1338    56      63372609        69847351
 20    1454    54      71941707        75982901
 21    1405    63      70035525        72193201
 22    1077    36      60892135      > 60178851
 23    1021    44      51791473        54886001
 24    1059    26      61662407      > 61466101
 25    783     41      42670836      > 45254751
 26    991     40      50640267        52316851
 27    920     70      45768018        48911451
 28    810     52      45884054        50753001
 29    1143    89      51657687        55219751
 30    3122    340     135803106       152429101
 U     39465   37202   94983049
  • Scaffold orientation
 filter outliers (interquartile range method)
 use LeastSequareFit method to estimate the orientation : if slope is positive => forward; if slope is negative => reverse; 
 problem: slope ~=0 => which direction ? 
 cat Chr.summary | getSummary.pl -i 5
 cat Chr.agp | grep W | awk '{print $9}' | count.pl
        elem       <0         0          >0         min        max        mean       med
 scf    2641       516        1610       515        -31        61         0          0   
 ctg    50528      24885      2236       23407
 
 Use slope thold to determine direction?
 cat markers_scf.mainChr.noOutliers.summary | p 'print $_ if(abs($F[5])>0.5);' | wc -l #  634


 Ambiguity examples:
 BZ908653 6 114061501 114016501 114106501 172149 597 98.66 100.00 7180002040834
 BZ891600 6 114085251 114051501 114119001 4834 504 99.80 99.02 7180002040834
 CZ411135 6 114094001 114059001 114129001 242710 609 98.20 100.00 7180002040834
 BZ854276 6 114132751 114101501 114164001 100980 580 99.31 99.83 7180002040834
 
 CC524983 30 115669901 115622401 115717401 96634 715 98.74 100.00 7180002041003
 BZ869249 30 115791151 115717401 115864901 86931 448 99.78 100.00 7180002041003
 BZ867530 30 115798651 115732401 115864901 69671 572 99.13 100.00 7180002041003
 CC585731 30 115808651 115752401 115864901 54950 737 99.86 100.00 7180002041003
 CC469285 30 115818651 115772401 115864901 125555 550 94.57 100.00 7180002041003
  • Scaffold overlaps: some small scaffolds might be contained by bigger ones
 cat markers_scf.mainChr.noOutliers.posmap.scfchrabs | ~/bin/posmap2ovl.pl | sort -nk6 -r | ~/bin/tab2tab.pl -f -15 | head
 Chr            ref            qry            begin          end            end-begin
 30             7180002041328  7180002041078  72536051       76993951       4457900
 4              7180002041381  7180002041269  41976451       46175201       4198750
 30             7180002040852  7180002041077  142001501      145251501      3250000
 30             7180002038569  7180002041121  33012001       35518351       2506350
 30             7180002040971  7180002034501  59411001       61836101       2425100
 ...

Marker Issues

Placement

  • 37202 scaffolds are unplaced (3.5% of total scaffold span); max=0.7M
  • 87 unplaced scaffolds (1.5Mbp total) could be placed using SLK messages
 perl ~/bin/difference21.pl  bt.slk markers_scf.scflen | head
 7180002030632   7180002040171   I       -147744.734     21214.779       2       UP
 7180002030792   7180002031221   I       -15042.301      558.223         2       UP
 7180002030849   7180002041244   N       -1609089.875    18145.568       4       UP
 ...
                   elem       min        max        mean       med        n50        sum
 Unplaced          87         740        102909     17259      8544       34550      1501601
 Placed            78         16177      27139572   5757234    4037663    10989230   449064304
  • ambiguous assignment to chromosmes:
 cat markers_scf.count | p 'print $_ if($F[2]*2==$F[3]);' | nl
       #k1             k2      count12 count1
    1  7180002030741   15      2       4
    2  7180002031341   11      1       2
   ...
   36  7180002068140   25      1       2
 # scaffold assigned differently
 ~/bin/difference12.pl markers.all/markers_scf.mainChr.count markers.all_plus_uniq/markers_scf.mainChr.count | nl
    1  7180002032536   19      1       2       23480
    2  7180002037223   8       1       2       31516
    3  7180002040013   25      2       4       15228
    4  7180002040262   30      1       2       3087
    5  7180002040378   10      1       2       50907
    6  7180002040523   29      1       2       9105
    7  7180002040769   4       1       3       827526
    8  7180002041203   14      2       4       1141560
    9  7180002044555   18      1       2       1566
  • Chr30 has many scaff aligned to it
  • how reliable are the markers: 1151 out of 2641 placed scaffolds have no unique markers ?
  • what measure is best for placing the scaffolds?
  • can some scaffolds go in the gaps ?
  • AGP:
    • gaps<=0 or 20 set to 100
    • unoriented ctgs set to +
  • Marker positions are not uniformly distributed; they tend to "custer"
                                  elem       0          >0         min        max        mean       med        n50        sum
 markers_chr                      107337     8405       98932      0          446250     25800      8750       75000      2769379200
 markers_scf                      94171      123        94048      0          2565850    24143      15856      42236      2273610521
 markers_chr.mainChr.noOutliers   101084     7609       93475      0          1253750    27383      10000      78750      2768040450  # filtered IRQx,xy method
 markers_scf.mainChr.noOutliers   88450      52         88398      0          2669361    25327      16587      44374      2240202221

Orientation

  • at least 1564 out of 2641 placed scaffolds cannot be oriented; max=1.39M

Possible misassemblies

 nl  #scfid          Ch      medianPos       rangePos        scfLen              slope   #ChMark #Mark   #ctg
 1   7180002041225   1       120473301       9866350         9279975(9M)         0.0103  381     390     184        : break interval: 1674263..1674319 (1X frg_cvg region) 
 2   7180002041078   30      77506551        6310600         3094999(3M)         0.0148  65      69      46         : no clear position

scf7180002041225 : 9.27 Mbp on Chr1

scflen=9279975

 7180002041225.markers
 7180002041225.markers 
 7180002041225.cvg 
 ChrPos    ScfPos
 115272051 56300
 ...
 116848301 1618430 
 116953301 9190560
 ...
 125138401 1691145
 break interval: 1674263..1674319 = 57bp frg_cvg=1 mate_cvg=0 bad_mate_cvg(nearby)=7

scf7180002041078 : 3.09 Mbp (not broken)

 7180002041078.markers
 7180002041078.markers 
 7180002041078.1.cvg xrange [315268:466827]
 7180002041078.2.cvg xrange [1191000:1229713]
 7180002041078.3.cvg xrange [2733146:2750017]
 ChrPos   ScfPos
 72536051 2893791
 ...
 72682301 2750017 
 72728651 1191000
 ...
 73273651 466827
 77112801 2733146
 ...
 78464051 1229713
 ..
 78595401 315268

scf7180002040971 : 1.93 Mbp (not broken)

Files

 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo/
  • Nucmer alignments to UMD2.0 chromosomes: all seem to agree pretty well
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr26-Chr26.png
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr27-Chr27.png
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr29-Chr29.png
 /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.ChrX-Chr30.png

Homo sapiens alignments

Citations:

  • mice closer to human than cow
  • human and cow have approximately 201 homologous blocks of DNA
  • Independently generated mapping data provide another measure of the quality of the assembly. Snelling et al. [4] created a B. taurus map from three radiation hybrid panels, two genetic maps, and bacterial artificial chromosome (BAC) end sequences. We aligned all of the 17,254 markers (of which 17,193 are unique) in their composite map (Cmap) to both assemblies. A marker was considered as matching a chromosome if 90% of the marker sequence aligned with at least 95% identity. Of the Cmap markers, 14,620 align to the UMD2 assembly's chromosomes, versus 13,699 markers (6.3% fewer) for the BCM4 assembly. A small number of Cmap markers (119 and 82 for UMD2 and BCM4, respectively) mapped to a different chromosome from the one indicated in the Cmap data.
  • homologous synteny block (HSB) :
    • human-cow alignment extended for at least 250 Kbp
    • it was not interrupted by an inversion or by an HSB on another chromosome.
    • If two HSBs were interrupted by a gap of <3 Mbp and nothing else fell in that gap, the two blocks were merged. (Note that if a large region of synteny is interrupted by a distinct HSB, the interruption creates three HSBs.)
    • Number of homologous synteny blocks on each chromosome of the cow and human genomes 201 -> 268 blocks
  • Orienting contigs using cow-human alignments:
    • Scaffolds (sets of linked contigs) that were mapped onto chromosomes using only a single marker could not be oriented from the marker information alone. We oriented many of these scaffolds by taking advantage of the overall conserved synteny between cow and human. First, all cow scaffolds were aligned to the human genome using nucmer [14] with its maximal unique match (mum) option in order to avoid alignments of repetitive sequence. For each alignment of a previously unoriented scaffold to human, all alignments within 100 Kbp on each side were pulled out for analysis. A score S was computed for each unoriented scaffold, taking into account whether the scaffolds surrounding S on both sides (in cow) were mapped to a consistent set of locations in human. If the scaffolds surrounding S were oriented, and if a large majority of these scaffolds on both the left and right agreed on the orientation, then S was assigned that orientation. Using this procedure, 1,840 scaffolds containing 4,011 contigs were oriented.
    • We developed a similar procedure to assign unplaced contigs to chromosomes, again relying on conserved synteny between cow and human. First, all unplaced contigs were aligned as above. Mummer's 'delta-filter' program was then used to compute a one-to-one mapping of the unplaced contigs to human so that only the best aligning contig was considered at each region in human. For each unplaced contig's best alignment to human, the matching region in cow was identified via our human-cow syntenic map, and all contigs from this region were extracted for examination. We only considered placing a contig on a B. taurus chromosome if the order and direction of the surrounding contigs in cow matched the corresponding region in human. As above, we examined the alignments of nearby cow contigs that aligned within 100 kb of the unplaced contig's alignment in human. If the region of cow-human synteny contained no rearrangements, then the unplaced contig was placed at the location indicated by these alignments. Using this procedure, 1,046 contigs were placed on chromosomes. One consequence of this procedure was that a number of incompletely mapped genes (based on mRNA alignments) were completed.

Issues:

  • which alignment program to use?
    • nucmer
    • blastz: difficult to parse
    • blat
  • nucmer: what parameters?
    • default
    • loose : -mum -l 12 -c 30 -g 1000
    • ref: 24 homo sapiens chromosomes files
    • query: 26 bos taurus scaffold files

Total scaffolds:

                        elem       min        max        mean       med        n50        sum
 scf(len)               39844      316        34167202   68352      1361       7451988    2723419938

Aligned to HS:

                        elem       min        max        mean       med        n50        sum             
 scf(len)               8860       385        34167202   301202     2257       7740810    2668658140      # each scf aligns in avg to 4 Chr   
 scf(aligLen)           789272     250        19097      514        399        550        405946972      
 scf(alig%)             789272     58.04      100.00     79         78.54      78.72      .
  
 scf(len,maxX)          727        1002       7069350    194658     6530       1895564    141516604       # 220 in common with the 339 ones that have mark
 
 scf(len,noMark)        7025       385        7605708    7783       1654       27630      54680173   
 scf(len,mark)          1835       1001       34167202   1424511    40503      7970944    2613977967 

Not aligned to HS:

                        elem       min        max        mean       med        n50        sum             
 scf(len,notAligned)    30984      316        163953     1767       1309       1555       54762139   

Aligned to markers:

                        elem       min        max        mean       med        n50        sum
 scf(len)               2641       1000       34167202   994523     16673      8170786    2626536153

 1 marker               1595       1000       1396951    17482      4595       48276      27884078
 2+ markers             1046       1055       34167202   2484371    596240     8217662    2598652075
 
 abs(slope)<0.25        1946       1000       3094999    29136      7585       114274     56700529
 abs(slope)>=0.25       696        4674       34167202   3692292    1845828    8401441    2569835619

Scf summary:

                        elem       min        max        mean       med        n50        sum
 all                    39844      316        34167202   68352      1361       7451988    2723419938
 1+mark                 2641       1000       34167202   994523     16673      8170786    2626536153 # best marker alignment
 1+align                8860       385        34167202   301202     2257       7740810    2668658140 # alignments > 250bp
 1+align(new)           9880       385        34167202   270458     1996       7740810    2672129298 # alignments > 200bp (3.74M more than 250bp align)
 1+mark or  1+align     9669       385        34167202   276443     2051       7740810    2672936792
 0 mark and 1+align     7027       385        754,615    6603       1654       20219      46,400,303 !!!
 1+mark and 1+align     1791       1001       34167202   1305861    37914      7473583    2338798064
 0 mark and 0 align     30179      316        88,326     1705       1305       1517       51,483,587 !!!
 
 1+mark or  1+align     9669       385        34167202   276443     2051       7740810    2,672,936,792

Degenerate:

                            elem       min        max        mean       med        n50        sum
 all                        251413     65         39964      1003       984        994        252,279,234
 all(2000+bp)               2128       2000       39964      3753       2827       3946       7,986,872

 1+mark                     562        200        30168      2731       1274       5029       1,535,011
 1+mark(2000+bp)            180        2007       30168      6079       4404       7510       1,094,252
 1+align                    6429       251        39964      1487       1013       1287       9,566,273
 1+align(2000+bp)           756        2004       39964      4820       3624       5646       3,644,556

Issues:

  • 24 scaffolds that have 200+ alignments to at least 2 HS chromosomes
  • 37 scaffolds that have 100+ alignments to at least 2 HS chromosomes
  • 83 scaffolds that have 50+ alignments to at least 2 HS chromosomes

File location:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_human


Synteny method

File location:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny
  • Position & orient scaffolds that aligned to HS
  • find placed neighboring that don't "disagree"
                             elem       min        max        mean       med        n50        sum
 total                       7027       385        754615     6603       1654       20219      46,400,303 !!!  
  • Scaffolds possibly assigned to the wrong chromosome
 cat summary.nucmer-markers.txt | perl ~/bin/synteny/getSynteny2.nucmer-marker.pl -minMarkers 2 | grep update:chr | getSummary.pl -i 6 -t "<2 markers"
 cat summary.nucmer-markers.txt | perl ~/bin/synteny/getSynteny2.nucmer-marker.pl -minMarkers 4 | grep update:chr | getSummary.pl -i 6 -t "<4 markers"

                             elem       min        max        mean       med        n50        sum
 <2 markers                  307        1001       937279     27898      9212       98674      8,564,850
 <4 markers                  383        1001       1141560    33822      13426      160445     12,953,875


UMD2.0

                             elem       min        max       mean       med        N50        sum
 ctg                         74337      88        840370     35148      14148      79144      2612810882*

UMD2.6.1 noVariants,noCont

                             elem       min        max       mean       med        N50        sum
 scf                         2646       385       34167202   994701     20787      7139718    2631980624
 ctg                         50755      65        1160130    50966      29450      88583      2586785910*

UMD2.6.1 noVariants

                              elem       min        max        mean       med        n50        sum
 scf                          39844      316        34167202   68352      1361       7451988    2,723,419,938
 scf.placed                   4707       385        34167202   564225     10413      7800796    2655811430
 scf.variants                 29436      723        51828      1714       1298       1514       50,461,989
 scf.unplaced                 30575      316        451968     1845       1309       1601       56440663
 
                              elem       min        max        mean       med        n50        sum
 ctg                          89994      65         1160130    29736      5180       95952      2676109378
 ctg.placed                   53646      65         1160130    48650      26840      98428      2609925446
 ctg.unplaced                 31510      101        207476     1761       1314       1556       55511410

                              elem       min        max        mean       med        n50        sum
 deg(all)                     251413     65         39964      1003       984        994        252,279,234
 deg(>2Kbp)                   2128       2000       39964      3753       2827       3946       7,986,872
 deg.placed                   883        200        30246      2711       1303       4845       2,393,905
 deg.variants                 4654       331        8039       1117       989        1024       5,200,553
 deg.unplaced(?)              245343     65         39964      994        984        993        244047258
 scf.unplaced                 978        316        451968     7932       3736       15828      7757882
 scf.uplaced.0cvg             863        316        451968     7521       2286       15744      6,491,342
 deg.unplaced                 747        2002       39964      4365       3509       4819       3261181
 deg.unplaced.0cvg            734        2002       39964      4358       3507       4811       3,199,221
 
 ct_deg.placed                53646      65         1160130    48650      26840      98428      2,609,925,446*


 UMD2.0-UMD2.6.1 gaps>1K summary
 id    count   min     max     median  sum
 1     775     999     64953   1683    2884541
 ...
 30    2174    999     167808  1528    7739170
 total 14936   999     167808  1638    53,526,388
  • align UMD2.6 Chr1..30 to UMD2.0 Chr1..30 0cvg regions
              elem       min        max        mean       med        n50        sum
 all          14936      1000       167809     3584       1639       6465       53541324
 aligned      11091      1000       167809     4010       1692       7977       44475715
 not_aligned  3845       1000       50546      2357       1529       2735       9065609  

Files: UMD2.0 regions not covered by UMD2.6.1 (chr aligned to itself)

 /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.fa
 /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.posmap
 /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.summary


  • UMD2.6.1 vs UMD2.0 Chr1..30
  • nucmer -mum -l 50 -c 250
  • max gap: Chr8:60134434..60228381=93948
  • gaps>10K: 191 , 4.2M total, 4.19M aligned without "-mum"
                     elem       min        max        mean       med        n50        sum
 gaps>10K            191        10077      93948      22106      15599      26551      4222342
  • realign all scaff without using nucmer "-mum" option: all gaps >10K align to large scaffolds !!!

UMD2.6.1 noVariants, add deg & UMD2.0 alignments

--Dpuiu 09:44, 22 June 2009 (EDT)

                                elem       min        max        mean       med        n50        sum
 scf                            39844      316        34167202   68352      1361       7451988    2723419938
 scf.variants                   29436      723        51828      1714       1298       1514       50461989
 scf.noVariants                 10580      316        34167202   252690     1957       7740810    2673467579
 scf.noVariants.ignore          4079       316        4985       1852       1541       1842       7558196    # less than 5K and placed through alignments inside a larger scaffold
 scf.noVariants.noIgnore        6506       385        34167202   409763     6673       7740810    2665923763

 ctg.variants                   29967      723        51828      1641       1298       1475       49178546
 deg.variants                   4654       331        8039       1117       989        1024       5200553
 scf_deg.placed                 4933       385        34167202   540139     11548      7740810    2664507818
 scf_deg.markers                2003       1020       34167202   1302992    30456      8170786    2609894380
 scf_deg.alignHS                1818       385        3278163    27070      12322      47406      49214215
 scf_deg.alignUMD2.0            1112       1002       73626      4855       2948       6898       5399223
 
 scf.placed                     4044       385        34167202   657786     15617      7740810    2660090134
 scf.markers                    1825       1020       34167202   1429490    37672      8170786    2608820353
 scf.alignHS                    1587       385        3278163    30006      13998      50604      47620297
 scf.alignUMD2.0                632        1002       73626      5774       5137       9046       3649484
  
 deg.placed                     889        2002       39964      4969       3771       6165       4417684
 deg.markers                    178        2007       30168      6033       4336       7350       1074027
 deg.alignHS                    231        2049       39964      6900       6289       7337       1593918
 deg.alignUMD2.0                480        2002       15992      3645       2767       3842       1749739
 
 ctg_deg.placed                 54129      65         1160130    48371      26493      98075      2,618,296,162*
 ctg.placed                     53240      65         1160130    49096      27309      98285      2,613,878,478
 deg.placed                     889        2002       39964      4969       3771       6165       4,417,684

HS-BT Synteny map

  • Trust scaffolds with 4+ markers
  • Scaffolds with 3- markers must have a close neighbor that agrees with them
 same HS & BT Chr
 maxCount=2                      # at most 2 scaffolds away
 minMarkers=4
 minRatio=0.66 & maxRatio=1.5    # distance ratio; maxDistance not used !!!
 => 226 synteny regions !!!

 join2.pl nucmer_*lsf markers_scf.*lsf | \
        ~/bin/filterMarkers.pl -minMarkers 4 | \
        ~/bin/getSyntenyBlock.pl | \
        grep -v ^# | grep -v ^$ | ~/bin/flipSummary.pl | sort -nk2 -nk5 | ~/bin/tab2tab.pl | \
        perl -ane  'print $P[13]," ",$F[13],"\n" if($F[13]-$P[13]==1); print $F[13]," ",$P[13],"\n" if($P[13]-$F[13]==1); @P=@F;' | sort -u -n | \
        ~/bin/mergeMap.pl >! hs-bt.map.tmp

 join2.pl nucmer_*lsf markers_scf.*lsf | \
        ~/bin/filterMarkers.pl -minMarkers 4 | \
        ~/bin/getSyntenyBlock.pl -map  hs-bt.map.tmp | \
        ~/bin/tab2tab.pl | grep  # | sed 's/#//' >  hs-bt.map
 
 ~/bin/map-draw.pl -refLen hs.infoseq -qryLen bt.infoseq hs-bt.map > ! hs-bt.png

Problems:

 7180002041220: BT.Chr2 ok
 7180002041025: BT.Chr4
 7180002041222: BT.Chr7
 7180002041228: BT.Chr8
 7180002040195: BT.Chr8; HS.Ch23 del !!!
 7180002041001: BT.Chr8; HS.Ch23 del !!!
 7180002040851: BT.Chr30; HS.Ch3 -> HS.Ch23
 7180002041008: BT.Chr30; HS.Ch7 -> HS.Ch23
 #HS-ref       begin     end       len       HS-clust  BT-ref    begin     end       BT-clust  #scf
 01            870247    12637274  11767027  16        46893147  58621226  11728079  -1        5
 01            16674545  30031540  13356995  2         126668764 139027145 12358381  -2        4
 01            32016003  68894564  36878561  3         83066354  120874874 37808520  -3        10
 01            68907947  122485896 53577949  3         23819464  83221225  59401761  -4        6
 01            143725713 143737025 11312     3         25097798  25109110  11312     5         1
 ...
 23            148835146 151656716 2821570   30        32889252  35700335  2811083   -276      3
 23            152157769 154641974 2484205   30        39322867  42104054  2781187   -277      2
 23            154387850 154403684 15834     30        39353299  39369133  15834     278       1

Files:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/map.4/hs-bt.map
 /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/map.4/hs-bt.png

Overlaps

Cases:

 1. CONTAINED scaffolds (clear variants)
 2. single BEGIN/END between 2 scaffolds: 2 scaffolds could be merged
 3. scaffold closing a sequence: 3 scaffolds could be merged
 4. multiple BEGIN/END/CONTAIN* between 2 scaffold contigs


1: CONTAINED ~ 4731 cases

 cat nucmer_scf.ovl/all-all.contained.ids  marker_scf.ovl/all-all.contained.ids  | sort -u > all-all.contained.ids 
  • Summary
                       elem       <2Kbp    2..10Kbp    >10Kbp   min    max        mean       med        n50        sum
 variants(all)         4731       3503     1085        143      723    75436      2513       1429       3469       11891882
 variants(1+markers)   541        311      170         60       1001   75421      4466       1690       9947       2416312
  • Example: longest contained scaffold 75Kbp
 #scaffold alignments
 117893  126367  |       8520    1       |       8475    8520    |       94.06   |       562884  75576   |       1.51    11.27   |       7180002040891   7180002040646
 286811  289412  |       72974   75576   |       2602    2603    |       93.29   |       562884  75576   |       0.46    3.44    |       7180002040891   7180002040646
 421477  421977  |       9727    10223   |       501     497     |       95.41   |       562884  75576   |       0.09    0.66    |       7180002040891   7180002040646

 171970  180765  |       73605   64798   |       8796    8808    |       96.74   |       274750  75576   |       3.20    11.65   |       7180002040912   7180002040646
 180765  195989  |       64652   49398   |       15225   15255   |       96.05   |       274750  75576   |       5.54    20.18   |       7180002040912   7180002040646
 196223  203368  |       49244   42105   |       7146    7140    |       96.81   |       274750  75576   |       2.60    9.45    |       7180002040912   7180002040646
 205935  208456  |       38652   36130   |       2522    2523    |       96.12   |       274750  75576   |       0.92    3.34    |       7180002040912   7180002040646
 207096  210496  |       42082   38673   |       3401    3410    |       96.07   |       274750  75576   |       1.24    4.51    |       7180002040912   7180002040646
 209529  217353  |       36109   28272   |       7825    7838    |       95.50   |       274750  75576   |       2.85    10.37   |       7180002040912   7180002040646
 213859  219155  |       25520   20220   |       5297    5301    |       95.80   |       274750  75576   |       1.93    7.01    |       7180002040912   7180002040646
 221183  227630  |       20236   13870   |       6448    6367    |       93.63   |       274750  75576   |       2.35    8.42    |       7180002040912   7180002040646
 227923  228686  |       13884   13126   |       764     759     |       95.16   |       274750  75576   |       0.28    1.00    |       7180002040912   7180002040646
 228084  231413  |       13104   9760    |       3330    3345    |       94.10   |       274750  75576   |       1.21    4.43    |       7180002040912   7180002040646

 #contig alignments
 1929    10403   |       8520    1       |       8475    8520    |       94.06   |       26592   8520    |       31.87   100.00  |       7180002040891.4.10      7180002040646.1.8       [CONTAINS]
 9084    11685   |       30871   33473   |       2602    2603    |       93.29   |       276174  33473   |       0.94    7.78    |       7180002040891.8.10      7180002040646.8.8
 143750  144250  |       1187    1683    |       501     497     |       95.41   |       276174  4564    |       0.18    10.89   |       7180002040891.8.10      7180002040646.2.8

 163797  172592  |       31502   22695   |       8796    8808    |       96.74   |       233551  33473   |       3.77    26.31   |       7180002040912.3.5       7180002040646.8.8
 172592  187816  |       22549   7295    |       15225   15255   |       96.05   |       233551  33473   |       6.52    45.57   |       7180002040912.3.5       7180002040646.8.8
 188050  195195  |       7141    2       |       7146    7140    |       96.81   |       233551  33473   |       3.06    21.33   |       7180002040912.3.5       7180002040646.8.8
 197762  200283  |       2523    1       |       2522    2523    |       96.12   |       233551  2523    |       1.08    100.00  |       7180002040912.3.5       7180002040646.6.8       [CONTAINS]
 198923  202323  |       3410    1       |       3401    3410    |       96.07   |       233551  3411    |       1.46    99.97   |       7180002040912.3.5       7180002040646.7.8       [CONTAINS]
 201356  209180  |       7838    1       |       7825    7838    |       95.50   |       233551  7838    |       3.35    100.00  |       7180002040912.3.5       7180002040646.5.8       [CONTAINS]
 205686  210982  |       12396   7096    |       5297    5301    |       95.80   |       233551  12396   |       2.27    42.76   |       7180002040912.3.5       7180002040646.3.8
 213010  219457  |       7112    746     |       6448    6367    |       93.63   |       233551  12396   |       2.76    51.36   |       7180002040912.3.5       7180002040646.3.8
 219750  220513  |       760     2       |       764     759     |       95.16   |       233551  12396   |       0.33    6.12    |       7180002040912.3.5       7180002040646.3.8
 219911  223240  |       4564    1220    |       3330    3345    |       94.10   |       233551  4564    |       1.43    73.29   |       7180002040912.3.5       7180002040646.2.8       [CONTAINS]

 #marker & alignment summary
 #id            BT-ref    #markers  slope     begin     end       len       HS-ref    #align    slope     begin     end       len
 7180002040891  4         1         0         13956053  14518937  562884    22        14        -1.5687   20838730  21401614  562884
 7180002040163  8         1         -1        38079677  38178351  98674     22        3         1.0264    21133894  21232568  98674     update:dir:7180002040646
 7180002040646  8         .         1         38088797  38164373  75576     22        4         -0.9833   21143014  21218590  75576     assign:Chr:7180002040163,7180002040163

2. ~ 349 cases (-3 cases 3.)

 cat all-all.begin.ids all-all.end.ids | sort -u | wc -l

3. ~ 3 cases

 intersect.pl all-all.begin.ids all-all.end.ids | perl -ane 'print $_ if($F[1]=~/1.1$/);'
 7180002032818.1.1       23259
 7180002036943.1.1       28355
 7180002040409.1.1       25219 # merges 2 scaff
 
 1       2811    |       2807    1       |       2811    2807    |       99.11   |       23259   5134    |       12.09   54.67   |       7180002032818.1.1       7180002032811.1.2[BEGIN]  [BEGIN]
 22101   23259   |       7238    6079    |       1159    1160    |       99.66   |       23259   7238    |       4.98    16.03   |       7180002032818.1.1       7180002032811.2.2[END]    [END]

 1       2811    |       2807    1       |       2811    2807    |       99.11   |       23259   5134    |       12.09   54.67   |       7180002032818.1.1       7180002032811.1.2[BEGIN]  [BEGIN]
 22101   23259   |       7238    6079    |       1159    1160    |       99.66   |       23259   7238    |       4.98    16.03   |       7180002032818.1.1       7180002032811.2.2[END]    [END]
 
 1       2025    |       2066    4090    |       2025    2025    |       99.80   |       25219   4090    |       8.03    49.51   |       7180002040409.1.1       7180002033541.1.1[BEGIN]  [END]
 23394   25219   |       5761    3930    |       1826    1832    |       98.42   |       25219   5761    |       7.24    31.80   |       7180002040409.1.1       7180002033538.1.1[END]    [END]
 cd  nucmer_scf.ovl
 intersect.pl  all-all.begin.ids all-all.end.ids
 7180002032818   23259
 7180002036943   28355
 7180002040409   25219

4. ~10 cases

 cd nucmer_ctg.ovl/translated/
 cat all-all.annotated.coords | egrep 'BEGIN|END|CONTAIN' | p 'next if($F[6]<5000); next if($F[7]<5000); print $_;' | p '$F[17]=~/^([^.]+)(.+)/;  $F[17]=$1 ; $F[18]=~/^([^.]+)/; $F[18]=$1; print $F[17],"\t",$F[18],"\n";' | count.pl -m 2
 7180002041235   brk002041306a   3
 7180002038888   7180002034914   2
 7180002041059   7180002040095   2
 7180002040879   7180002040934   2
 cnt0002041350   7180002040907   2
 7180002041341   7180002040894   2
 7180002041015   cnt0002040938   2
 7180002039401   7180002040397   2
 7180002037358   7180002031425   2
 7180002040915   7180002039470   2

 cat all-all.annotated.coords | egrep 'BEGIN|END|CONTAIN' | grep ...
 12363   21143   |       1       8837    |       8781    8837    |       93.43   |       21143   50443   |       41.53   17.52   |       7180002041235.23.33     brk002041306a.5.7[END]    [BEGIN]
 1       9272    |       5591    14944   |       9272    9354    |       94.47   |       9272    50443   |       100.00  18.54   |       7180002041235.25.33     brk002041306a.5.7[CONTAINED]
 71049   97689   |       31941   5220    |       26641   26722   |       96.71   |       97689   31941   |       27.27   83.66   |       7180002041235.29.33     brk002041306a.6.7[END]    [END]
 57      3896    |       14743   10903   |       3840    3841    |       95.94   |       3896    22667   |       98.56   16.95   |       7180002041235.32.33     brk002041306a.3.7[CONTAINED]

 1       8942    |       1       8947    |       8942    8947    |       99.49   |       20792   8947    |       43.01   100.00  |       7180002038888.1.1       7180002034914.1.2[CONTAINS]
 9670    20792   |       1       11091   |       11123   11091   |       99.27   |       20792   11091   |       53.50   100.00  |       7180002038888.1.1       7180002034914.2.2[CONTAINS]

 1       8705    |       3932    12610   |       8705    8679    |       96.17   |       8705    30654   |       100.00  28.31   |       7180002041059.1.26      7180002040095.1.1[CONTAINED]
 1       4777    |       12989   17697   |       4777    4709    |       95.32   |       4777    30654   |       100.00  15.36   |       7180002041059.2.26      7180002040095.1.1[CONTAINED]
 1       6588    |       17999   24607   |       6588    6609    |       96.15   |       6591    30654   |       99.95   21.56   |       7180002041059.3.26      7180002040095.1.1[CONTAINED]

 630     16709   |       16024   1       |       16080   16024   |       97.09   |       16901   16024   |       95.14   100.00  |       7180002040879.2.14      7180002040934.2.87[CONTAINS]
 1       5876    |       5882    1       |       5876    5882    |       98.51   |       39657   13643   |       14.82   43.11   |       7180002040879.3.14      7180002040934.1.87[BEGIN] [BEGIN]

Scaffold links

Try to identify scaffold that fit the following criteria:

  • have no markers and no alignments to HS
  • linked by 2+ links to a single scaffold that has markers/alignments to HS
  • is not a variant (2Kbp+ of unique sequnece)
                                                 elem       min        max        mean       med        n50        sum
 linked                                          6169       316        88326      2408       1480       2709       14,860,479
 linked(2+mates)                                 2109       316        44120      2595       1558       3092       5,474,510
 linked(2+mates to a single scf)                 2057       316        44120      2516       1547       2850       5,177,182
 linked(2+mates to a single scf, no variant)     112        316        43782      7278       1987       23338      815,223        # 46>2Kbp; 25>10Kbp

=> 112 scaffold & 0.81Mbp could be added to the chromosomes !!!

BT alignments

  • UMD2.6.1 vs UMD2.6.1
  1. Marker scaffolds against themselves (2642 total scaffolds) : nucmer -maxmatch -l 40 -c 2500 -g 250
  2. Mapped scaffolds without markers against marker scaffolds: nucmer -maxmatch -l 40 -c 250
  3. Mapped scaffolds without markers against themselves: nucmer -maxmatch -l 40 -c 250

File location:

 /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_Chr/

Other issues

Chr10 gene duplication

  • LOC100298457
  • See /nfshomes/dpuiu/Readmes/bos_taurus.runCA.13.txt

Scf breaks (Mike Robers)

 ::::::::::::::
 scf.break.ids
 ::::::::::::::
 #ctg            scf             beg     end     dir
 7180002015438   7180002038435   0       32822   f       # ok: 4 markers from Chr18
 7180002021916   7180002040442   0       24559   f       # ok: 1 marker  from Chr10
 7180001725791   7180002040808   466123  562461  f       # ok: 25 marker  from Chr1, 1 marker from Chr2 
 7180001727899   7180002040844   8236574 8479429 f       # also found by us; 3 Chr30 markers in the middle
 7180001854650   7180002041103   5230302 5294734 f       # also found by us; 4 Chr15 markers at 3'
 7180002003578   7180002041216   300250  392069  f       # also found by us; 7 Chr5 markers at 5'
 7180002020010   7180002041293   4482477 4600045 f       # ok
 7180001722390   7180002041353   0       51674   f       # ok
 ::::::::::::::
 scf.excised.ids
 ::::::::::::::
 #ctg            scf             beg     end     dir
 7180002008629   7180002034664   82740   105723  f       # no markers
 7180001786240   7180002040913   2076746 2112375 f       # ok
 7180001787022   7180002040927   1075352 1109350 f       # ok
 7180001787352   7180002040931   4956099 4967823 f       # ok
 7180001789387   7180002040981   6641924 6671967 f       # ok
 7180001789575   7180002040984   3095587 3154302 f       # ok
 7180002003269   7180002041209   6350890 6470971 f       # ok
 7180002025281   7180002041322   989366  1227636 f       # ok
 7180002026741   7180002041328   343726  414329  f       # ok
 7180002029433   7180002041343   834493  888144  f       # ok
 7180001726451   7180002041383   7893629 7901645 f       # ok

LOC100298457 (duplicate gene on Chr10)

Problem:

 Is LOC100298457 gene (cow Chr10) a variation of  MFSD3 gene (cow Chr14)?

UMD2.0

Chr10   15186759        15189292        833     W       7180003260677   1       2534    +     # gene LOC100298457
Chr14   1839355         1958202         533     W       7180003326040   1       118848  +     # gene MFSD3
  • LOC100298457 aligns to 2 UMD2.6 scaffolds:
 scf7180002041112 (1.56Mbp; 34 contigs) 
 scf7180002033841 (2.44Kbp; 1 contig)
  • scf7180002041112 contains 39 bos taurus Chr14 markers and has 206 alignments to human chromosome 8
  • scf7180002033841 contains no bos taurus markers and has 1 alignments to human chromosome 8.
  • scf7180002033841 5' aligns to scf7180002041112 (1.56Mbp; 34 contigs, cow Chr14) and to a human Chr8 region that maps to cow Chr14
  • scf7180002033841 3' aligns to scf7180002041157 (6.15Mbp; 99 contigs, cow Chr10)
  • scf7180002033841 & scf7180002041157 are linked by 2 mate pairs (inserts from a 3kbp BCM shotgun library)
  • LOC100298457 vs UMD2.6 scf:
  [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
 ===============================================================================================================================
     1       88  |  1138313  1138226  |       88       88  |   100.00  |     1344  1561688  |     6.55     0.01  | LOC100298457       scf7180002041112
    80      192  |  1138150  1138038  |      113      113  |   100.00  |     1344  1561688  |     8.41     0.01  | LOC100298457       scf7180002041112
   190      316  |  1137974  1137848  |      127      127  |   100.00  |     1344  1561688  |     9.45     0.01  | LOC100298457       scf7180002041112
   315      464  |  1137770  1137621  |      150      150  |   100.00  |     1344  1561688  |    11.16     0.01  | LOC100298457       scf7180002041112
   455     1344  |  1137322  1136432  |      890      891  |    99.78  |     1344  1561688  |    66.22     0.06  | LOC100298457       scf7180002041112

     1     1344  |     1349        6  |     1344     1344  |   100.00  |     1344     2445  |   100.00    54.97  | LOC100298457       scf7180002033841
  • UMD2.6 scf* vs scf7180002033841:
    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
 ===============================================================================================================================
 1136427  1137322  |        1      895  |      896      895  |    99.78  |  1561688     2445  |     0.06    36.61  | scf7180002041112   scf7180002033841
 4060495  4061426  |     2445     1521  |      932      925  |    97.76  |  6157473     2445  |     0.02    37.83  | scf7180002041157   scf7180002033841

Bos taurus marker summary:

 #id                BT-ref    #markers  slope     begin     end       len
 scf7180002041112   Chr14     39        -0.8518   3377707   4939395   1561688
 scf7180002041157   Chr10     255       1.0035    12831593  18989066  6157473
 scf7180002033841   ?         0

Homo sapiens alignment summary:

 #id                HS-ref    #align    slope     begin     end       len
 scf7180002041112   Chr8      206       0.6701    144381103 145942791 1561688
 scf7180002041157   Chr15     1667      0.987     62245836  68403309  6157473
 scf7180002033841   Chr8      1         1         145705320 145707765 2445
       [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
 ===============================================================================================================================
 143996007 143996305  |    74278    74572  |      299      295  |    82.45  | 146274826  1561688  |     0.00     0.02  |  1  1  NC_000008        7180002041112
 ...
 146249212 146249657  |  1268887  1268446  |      446      442  |    80.89  | 146274826  1561688  |     0.00     0.03  |  1 -1  NC_000008        7180002041112

   63013167 63013429  |   887697   887966  |      263      270  |    78.15  | 100338915  6157473  |     0.00     0.00  |  1  1  NC_000015        7180002041157
   ...
   68369047 68369417  |  6154757  6155112  |      371      356  |    83.06  | 100338915  6157473  |     0.00     0.01  |  1  1  NC_000015        7180002041157

 145705522 145706206  |      202      886  |      685      685  |    81.28  | 146274826     2445  |     0.00    28.02  |  1  1  NC_000008        7180002033841

scf7180002033841: 6 reads

 read            scf             begin   end     dir
 1120017508      7180002033841   0       1018    f
 1120020725      7180002033841   227     1109    f
 1120020722      7180002033841   724     1720    r
 1120020726      7180002033841   1006    2073    f       diffScaffold
 1120017511      7180002033841   1303    2175    r
 1120020728      7180002033841   1370    2445    f       diffScaffold

scf7180002033841: 4 mates

 read1           read2           scf1            scf2
 1120017508      1120020722      7180002033841   7180002033841
 1120017511      1120020725      7180002033841   7180002033841
 1120020726      1120017512      7180002033841   7180002041157  diffScaffold
 1120020728      1120017514      7180002033841   7180002041157  diffScaffold

scf7180002041157: 69095 reads; frg 1120017514 & 1120017512 positions close to the region aligned to scf7180002033841 (4060495-4061426)

 read            scf             begin   end     dir
 607312594       7180002041157   0       866     f
 ...
 1120017512      7180002041157   4059332 4060396 f       diffScaffold
 1120017514      7180002041157   4058851 4059931 f       diffScaffold
 ....
 583956204       7180002041157   6157144 6157228 r

scf7180002041157: 99 contigs

 count ctg                scf                begin   end     dir
 1     ctg7180001926175   scf7180002041157   0       81467   f
 ..
 64    ctg7180001926231   scf7180002041157   4039922 4046680 f # 6758bp ctg;    800bp gap following the contig
 65    ctg7180001926232   scf7180002041157   4047480 4193693 f # 146213bp ctg ; 20bp  gap following the contig

 66    ctg7180001926233   scf7180002041157   4193713 4208047 f # 14334bp ctg
 ..
 99    ctg7180001926260   scf7180002041157   6152691 6157473 f

frg 1120017514 & 1120017512 positions on ctg7180001926232

 1120017514  ctg7180001926232   11371   12451   f  # reads are 11kbp(>3Kb lib mean) inside the contig
 1120017512  ctg7180001926232   11852   12916   f  # reads are 11kbp(>3Kb lib mean) inside the contig

Missing genes

  • 5 genes not found by Liliana using ESTaligner but found using gmap (%id<95) : they were on the haplotype variants

Chr27 -> Chr21

  • The centromeric end of BTA27 is actually placed on BTA21 in UMD2.6 (which version?)


Chromosomes

synteny.redo2

  • summary.txt : 8122 scf + 6322 deg => 14444 seqs (placed using markers or synteny to HS)
  • from summary.txt removed:
  58 scf.questionable.ids (44 placed, 14 linked)
 100 ctg.questionable.ids
 931 deg.questionable.ids

UMD_2.6.d_g

 .                    elem       <2000      >2000      min    max        mean       med        n50        sum
 haplotypes_contigs   3828       2701       1127       471    123243     2503       1580       2590       9,584,985

UMD_2.6.a_g_070109

Gaps: 65900

  • all 65900 gaps are "fragment yes"

UMD_2.6.a_g_070509

  • Combine UMD_2.6.a_g_070109 (Guillaume's assembly) and UMD2.6.1 (Daniela's)

UMD_2.6.a_g_070709

  • Remove from UMD_2.6.a_g_070509
    • ~ 21 ChrY ctg & ~ 39 ChrY deg
    • ~ 4206 haplotype variants (6.63Mbp) within 1K from one another
 .                                      elem       <2000         >2000      min    max        mean       med        n50        sum
  ctg_deg.variants.placed                6654       5504         1150       263    42158      1761       1189       1772       11719653
  ctg_deg.variants.placed.sameChr        5374       4455         919        263    42158      1748       1152       1775       9393815
  ctg_deg.variants.placed.within_100K    4790       4100         690        263    42158      1651       1128       1605       7911160
  ctg_deg.variants.placed.within_1K*     4206       3679         527        263    42158      1577       1107       1508       6633118 

Summary:

                              ctg+deg    <2000      >=2000     min    max        mean       med        n50        sum
 Chr1..29,X                   72197      20763      51434      65     1160130    36536      13055      97328      2,637,809,286
 ChrU                         3752       2587       1165       362    179692     3284       1447       6427       12,324,356
 ChrY-contigs                 315        266        49         224    26490      2249       974        6679       708,535
 contigs.haplotype-variants   40198      36720      3478       263    51828      1460       1203       1361       58,698,457
 deg.unplaced.less_2K         224945     224945     0          65     1996       972        983        990        218,847,978

Issues

  • 490 scf don't have all ctgs placed (865 ctgs)
  • 699 reliable contigs (3.25Mbp) unplaced
 difference.pl ctg.reliable.ids UMD_2.6.a_g_070709/Chr.posmap | getSummary.pl -i 2 -z 2000
 elem       <2000      >2000      min    max        mean       med        n50        sum
 699        405        294        362    123243     4660       1759       14809      3257635
  • Rearrangements UMD2.0 vs UMD_2.6.a_g_070709 : ~ 25 ctgs>50K ; ~ 12 scf
  • AGP file format: "fragment yes" should be preserved even if gap type=U
 grep -A 2 7180001925241 UMD_2.6.a_g_070109/Chr.agp
 Chr30   131686408       131762880       14157   W       7180001925241   1       76473   +
 Chr30   131762881       131762980       14158   U       100     fragment        yes
 Chr30   131762981       131796167       14159   W       7180001925242   1       33187   +
 grep -A 2 7180001925241 ../UMD_2.6.a_g_070709/Chr.agp
 ChrX    133855417       133931889       15581   W       7180001925241   1       76473   +
 ChrX    133931890       133931989       15582   U       100     contig  no
 ChrX    133931990       133965176       15583   W       7180001925242   1       33187   +
  • Liliana found 18 ctg + deg that have genes
 Fixed case B (7180001932648,7180001925237)


  • Bob (Missouri)
  1. scf7180002041216 (43 ctgs, 1.44Bmp) should be split : Chr15 (36 ctgs) , Chr5 (first 7 ctgs) ctg7180002003576,ctg7180002003577 moved from Chr9=>Chr5
 => Chr5
    1  7180002003574   7180002041216   0       19886   f # already on Chr5
    2  7180001694221   7180002041216   19906   20911   r
    3  7180001978231   7180002041216   20931   52725   r
    4  7180001946738   7180002041216   53559   55402   r
    5  7180002003575   7180002041216   55422   92145   f
    6  7180002003576   7180002041216   92165   200007  f
    7  7180002003577   7180002041216   200456  300048  f
  1. scf7180002041153 (48 ctgs, 2.36Mbp): assignment to Chr11 seems correct, Chr6(0)

UMD_2.6.a_g_071709 -> UMD_Freeze2.99

Changes vs UMD_2.6.a_g_070709:

  • 17 contigs/degenerates recruited by Liliana based on mRNA alignments got placed on chromosomes
  • Chr9,15 => Chr5 correction : a scaffold got broken between 2 chromosomes & and several contigs got moved to Chr5
  • 697 unplaced contigs from placed scaffolds got placed as well

Contig placement summary:

                                #ctg+deg <2Kbp   >=2Kbp min  max        mean   med    n50    sum
  ==================================================================================================
  Chr1..29,X                    72911   21180   51731  65   1160130    36223  12706  97232  2641097363
  ChrU                          3365    2448    917    224  179692     2898   1348   5399   9754701
  contigs.haplotype-variants    40198   36720   3478   263  51828      1460   1203   1361   58698457
  deg.unplaced.less_2K          224933  224933  0      65   1996       972    983    990    218837572
  Chr1..29,X(new)               714     417     297    177  123243     4605   1741   14809  3288596
  ChrY-contigs                  315     266     49     224  26490      2249   974    6679   708535
  ChrY-contigs.SHOTGUN_ONLY     144     140     4      804  4224       993    882    888    143047
  ===================================================================================================

Comments:

  • "Chr1..29,X", ChrU, contigs.haplotype-variants, deg.unplaced.less_2K are mutually exclusive sets
  • "Chr1..29,X(new)" are contigs which were not placed in UMD_2.6.a_g_071709.
    • 17 contig them were added by Liliana (1 failed)
    • the rest are reliable contigs left unplaced by Aleksey/Guillaume program
  • ChrY-contigs.SHOTGUN_ONLY are a subset of ChrY-contigs which don't contain only SHOTGUN reads
  • ChrY-contigs are part of ChrU

Files (walnut):

  /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/                # FASTA & AGP format
  /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/nucmer_UMD2.0/  # nucmer alignments to UMD2.0

Files (freeze):

 /fs/szasmg3/bos_taurus/UMD_Freeze2.99/               # FASTA & AGP format
 /fs/szasmg3/bos_taurus/UMD_Freeze2.99/ncbi_files     # SEQUIN format

Ftp:

  ftp://ftp.cbcb.umd.edu/pub/salzberg/Bos_taurus_2.99/ -> /fs/ftp-cbcb/pub/salzberg/Bos_taurus_2.99/

UMD_2.6.a_g_072109 -> UMD_Freeze3.0

Changes vs UMD_2.6.a_g_070709:

Gaps: 75739

  • all 27103 N gaps are "fragment yes"
  • all 48636 U gaps are "contig no"

Files (walnut):

  /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_072109/                # FASTA & AGP format
  

Files (freeze):

 /fs/szasmg3/bos_taurus/UMD_Freeze3.0/               # FASTA & AGP format
 /fs/szasmg3/bos_taurus/UMD_Freeze3.0/ncbi_files     # SEQUIN format

Ftp:

 ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_3.0/ -> pub/data/assembly/Bos_taurus/Bos_taurus_UMD_3.0/

Issues:

  • 7180001836672 941bp deg on Chr4 : aligns on all its length to the cow mitochondrion; placed based on human synteny
  • mitochondrion screening was done only on contigs, not on degenerates
  • Align all Chr*.fasta files to cow mitochondrion; show-coords -I 90 -L 600
 [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
 ===============================================================================================================================
 3603     4930  |        6     1333  |     1328     1328  |    90.89  |    16339     1333  |     8.13    99.62  | AY676873  ctg7180001759596 [CONTAINS]
 13284    14521  |    1214        1  |     1238     1214  |    97.90  |    16339     1216  |     7.58    99.84  | AY676873  deg7180001872458 [CONTAINS]
 15731    16339  |       1      608  |      609      608  |    99.18  |    16339      941  |     3.73    64.61  | AY676873  deg7180001836672 [END]
  • Delete 2 degenerates
  • Summary 3.a --Dpuiu 11:04, 5 August 2009 (EDT)
 .                                 ctg+deg <2Kbp   >=2Kbp min  max      mean   med    n50     sum
 ======================================================================================================
 Chr1..29,X                        72479   20862   51617  65   1160130  36424  12941  103785  2639984487
 ChrU                              3285    2404    881    224  179692   2890   1338   5425    9496583
 Chr                               75764   23266   52498  65   1160130  34970  11207  96955   2649481070
 
 contigs.haplotype-variants        40611   36984   3627   263  97877    1476   1205   1372    59958728
 deg.unplaced.less_2K              224933  224933  0      65   1996     972    983    990     218837572
 
 ChrY-contigs                      314     266     48     224  26490    2210   973    6539    694140
 ChrY-contigs.SHOTGUN_ONLY         144     140     4      804  4224     993    882    888     143047
 ======================================================================================================

UMD_Freeze3.1

--Dpuiu 11:42, 19 November 2009 (EST)

  • Only changed some of the gap specifications
  • Original(CA):
 ctg: 90135 
 deg: 251413 
 scf: 39978 
  • UMD3.1 AGP:
 ctg:             60499
 deg:             15229
 scf(CA):         11458
 ctg(unoriented)    2118
 scf(placed chr)    3193 
 scf(unplaced chr)  3285

Files:

/fs/szasmg3/bos_taurus/UMD_Freeze3.1

ToDo

  • Align all UMD2.0 ChrU ctg/deg 10Kbp+ to our assembly; make sure everything aligns (Steven's suggestion)
 /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/nucmer_UMD2.0
    • all ctgs: 113K, 244Mbp
    • 10Kbp+ ctgs: 2561, 54Mbp
    • 53,951,168 out of 54,971,011 bp covered (98% of the sequence)
    • Most ctgs were added to ChrU
 join2.pl -i 4 UMD2.0.ChrU.10K.maxCvg.pair ../other/Chr.posmap | awk '{print $10,$2}' | ~/bin/sum2.pl | sort -nk3 -r
 Chr30   252     7,511,975
 Chr1    171     3263575
 Chr6    138     3043388
 Chr12   152     2676615
 ChrU    100     2600562
 ...
 Chr28   29      473523
 all     2651    54,960,233

1. Remove remain_haps

 .                                elem       <2000      >2000      min    max        mean       med        n50        sum
 remain_haps                      436        282        154        471    37860      2600       1693       2819       1,133,874
 remain_haps(Chr1..30,U)          408        264        144        471    37860      2636       1692       2883       1,075,616 # 77 from ChrU
 remain_haps(Chr1..30)            335        226        109        471    37860      2480       1661       2728       831,068

2. Remove/trim contaminants

3. Remove/trim N's