Bos taurus redo: Difference between revisions
| (925 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| = NCBI = | = BCM = | ||
| * [http://vector.bcm.tmc.edu/Technical_Information/Cloning_Vectors/Cloning_Vectors.htm Cloning_Vectors] | |||
| = NCBI Data = | |||
| * [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=bos%20taurus Genome Projects] | * [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=bos%20taurus Genome Projects] | ||
| * [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieve&size=0&val=SPECIES_CODE+%3D+%22BOS+TAURUS%22 TA search]   | * [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=retrieve&size=0&val=SPECIES_CODE+%3D+%22BOS+TAURUS%22 TA search]   | ||
| * [ftp://ftp.ncbi.nih.gov/pub/TraceDB/bos_taurus/ TA ftp] | |||
| * 91 volumes: 87 with qual & 4 with no quality (85 volumes contain BCM reads) | |||
| * 14 centers | |||
| * 21 center/trace_type_codes | |||
| * Avg LEN=984 | |||
| * Avg CLIP (CLB intersect CLV)=760 | |||
| * Avg CLV=997  > Avg LEN ??? | |||
| * Avg QUAL=38.96   (27.51 for the 2.59M reads not in the UMD assembly) | |||
| * Avg UMDoverlapper CLIP=778  | |||
| Problems: | |||
| * 0 QUAL reads 650,133 ( 551,114 are BCM.WGS ) | |||
| * the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1 | |||
| * several xml contained the "&" character => XML parser error | |||
| * xml.bos_taurus.087 contained 2 trace_volumes => XML parser error | |||
| * BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000 | |||
| * UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000 | |||
| == CENTER_NAME counts == | |||
|      COUNT           CENTER_NAME      | |||
|   1  35629020        BCM             Baylor College of Medicine | |||
|   2  737900          NISC            NIH Intramural Sequencing Center | |||
|   3  652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Centre                           # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510  | |||
|   4  378871          MARC            USDA, ARS, US Meat Animal Research Center | |||
|   5  114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368 | |||
|   6  107367          BARC            USDA, ARS, Beltsville Agricultural Research Center | |||
|   7  65171           TIGR            The Institute for Genome Research | |||
|   8  53556           GSC             Genoscope | |||
|   9  43033           CENARGEN        Embrapa Genetic Resources and Biotechnology | |||
|   10 18623           SC              The Sanger Center | |||
|   11 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology | |||
|   12 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                 # TA query_tracedb CENTER_NAME="JCVI" | |||
|   13 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB) | |||
|   14 49              WUGSC           Washington University, Genome Sequencing Center                                 # TA query_tracedb CENTER_NAME = "WUGSC" => 9 | |||
|      '''37829394'''        total           total                                                                           # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710 | |||
| == TRACE_TYPE_CODE counts == | |||
|      COUNT         CENTER_NAME     TRACE_TYPE_CODE         | |||
|   1  24863599      BCM*            WGS                    SEQ_LIB_ID:89 | |||
|   2  10748529      BCM*            SHOTGUN                SEQ_LIB_ID:15543 | |||
|   3  737900        NISC            SHOTGUN                SEQ_LIB_ID:247 | |||
|   4  125597        BCCAGSC         CLONEEND               LIBRARY_ID:1         large insert size; some qualityless; !!! almost all have CLIP3=0 | |||
|   5  114753        UIUC            CLONEEND               LIBRARY_ID:2         insert size missing , no frequent kmers | |||
|   6  65171         TIGR            CLONEEND               SEQ_LIB_ID:1         2K & use TRACE_DIRECTION instead of TRACE_END | |||
|   7  53556         GSC             CLONEEND               SEQ_LIB_ID:1         large insert size; !!! all have qual=0 and were excluded  | |||
|   8  26246         CENARGEN        WGS                    .                    no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID) | |||
|   9  25454         BARC            CLONEEND               SEQ_LIB_ID:14304     !!! all have CLIP3=0 | |||
|   10 16892         BCM*            CLONEEND               LIBRARY_ID:1         VBBAA   mea=167000  std=25000 | |||
|   11 16787         CENARGEN        CLONEEND               LIBRARY_ID:1          | |||
|   12 15150         UOKNOR          SHOTGUN                LIBRARY_ID:1         some qualityless | |||
|   13 10651         TIGR_JCVIJTC    CLONEEND               SEQ_LIB_ID:2 | |||
|   14 151           UOKNOR          FINISHING              LIBRARY_ID:1         some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV | |||
|   15 49            WUGSC           CLONEEND               SEQ_LIB_ID:1  | |||
|      '''36820485'''      total | |||
|   16 527017        BCCAGSC         EST | |||
|   17 207204        MARC            EST | |||
|   18 171667        MARC            PCR | |||
|   19 81913         BARC            EST | |||
|   20 18623         SC              EST  | |||
|   21 2485          UIACBCB         EST | |||
|      '''1008909'''       total | |||
| == STRATEGY & TRACE_TYPE_CODE counts == | |||
|   COUNT           CENTER_NAME     STRATEGY        TRACE_TYPE_CODE | |||
|   12545304        BCM             .               WGS | |||
|   11425910        BCM             WGA             WGS | |||
|   5223683         BCM             CLONE           SHOTGUN | |||
|   4479883         BCM             POOLCLONE       SHOTGUN | |||
|   1044963         BCM             .               SHOTGUN | |||
|   892385          BCM             SNP             WGS | |||
|   737900          NISC            CLONE           SHOTGUN | |||
|   125597          BCCAGSC         CLONEEND        CLONEEND | |||
|   114753          UIUC            CLONEEND        CLONEEND  | |||
|   65171           TIGR            CLONEEND        CLONEEND | |||
|   53556           GSC             CLONEEND        CLONEEND | |||
|   26246           CENARGEN        .               WGS | |||
|   25454           BARC            .               CLONEEND | |||
|   16892           BCM             CLONEEND        CLONEEND | |||
|   16787           CENARGEN        CLONEEND        CLONEEND | |||
|   12195           UOKNOR          .               SHOTGUN | |||
|   10651           TIGR_JCVIJTC    CLONEEND        CLONEEND | |||
|   2955            UOKNOR          CLONE           SHOTGUN | |||
|   151             UOKNOR          .               FINISHING | |||
|   49              WUGSC           CLONEEND        CLONEEND | |||
|   527017          BCCAGSC         EST             EST | |||
|   145820          MARC            EST             EST | |||
|   117958          MARC            COMPARATIVE     PCR | |||
|   81913           BARC            EST             EST | |||
|   61384           MARC            CLONE           EST | |||
|   53709           MARC            Re-Sequencing   PCR | |||
|   18623           SC              EST             EST | |||
|   2485            UIACBCB         .               EST | |||
| == BCM.SHOTGUN libraries == | |||
| * The long inserts are probably wrong !!! | |||
|   SIZE    STDEV   COUNT | |||
|   3500    1500    4502569 | |||
|   2000    1000    3244493 | |||
|   3000    1000    1021577 | |||
|   180000  1000    840528 | |||
|   6500    1500    429026 | |||
|   180000  13000   320208 | |||
|   6000    2000    208192 | |||
|   167000  13000   96337 | |||
|   3500    15000   85599 | |||
|   SIZE    COUNT | |||
|   3500    4588168 | |||
|   2000    3244493 | |||
|   180000  1160736 | |||
|   3000    1021577 | |||
|   6500    429026 | |||
|   6000    208192 | |||
|   167000  96337 | |||
| == 3' VECTOR TRIMMED counts == | |||
|      CENTER_NAME     TRACE_TYPE_CODE TOTAL           3'CLV<LEN   QUAL==0          UMD.FRG | |||
|   1  BCM             WGS             24863599        10968979    551114           24050767 | |||
|   2  BCM             SHOTGUN         10748529        5052692     23419            10068499 | |||
|   3  NISC            SHOTGUN         737900          28972       0                735488 | |||
|   4  BCCAGSC         CLONEEND        125597          125484      8926             113790 | |||
|   5  UIUC            CLONEEND        114753          90243       0                106247 | |||
|   6  TIGR            CLONEEND        65171           46389       0                64903 | |||
|   7  GSC             CLONEEND        53556           53556       53556 (all)      0           !!! all have 0 quals and were excluded | |||
|   8  CENARGEN        WGS             26246           26246       0                25976 | |||
|   9  BARC            CLONEEND        25454           25454       0                25387 | |||
|   10 BCM             CLONEEND        16892           6751        0                16863 | |||
|   11 CENARGEN        CLONEEND        16787           16787       0                16628 | |||
|   12 UOKNOR          SHOTGUN         15150           2885        12195            0 | |||
|   13 TIGR_JCVIJTC    CLONEEND        10651           339         0                10644 | |||
|   14 UOKNOR          FINISHING       151             0           151              151 | |||
|   15 WUGSC           CLONEEND        49              0           0                0 | |||
|   16 BCCAGSC         EST             527017          524173      772              0 | |||
|   17 MARC            EST             207204          207204      0                0 | |||
|   18 MARC            PCR             171667          171667      0                0 | |||
|   19 BARC            EST             81913           78597       0                0 | |||
|   20 SC              EST             18623           7350        0                0 | |||
|   21 UIACBCB         EST             2485            2485        0                0 | |||
| == ZERO QUALITY COUNTS == | |||
| * Counts | |||
|   CENTER_NAME     TRACE_TYPE_CODE  COUNT | |||
|   BCM             WGS              551114 | |||
|   GSC             CLONEEND         53556 | |||
|   BCM             SHOTGUN          23419 | |||
|   UOKNOR          SHOTGUN          12195 | |||
|   BCCAGSC         CLONEEND         8926 | |||
|   BCCAGSC         EST              772 | |||
|   UOKNOR          FINISHING        151 | |||
|   TOTAL                            650134  | |||
| * For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701.. | |||
| * Volumes 026..039 have been fixed | |||
| = Local Data = | |||
| == Files & Dirs == | |||
|   /fs/szasmg3/bos_taurus/data/ | |||
|   /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors | |||
|   /nfshomes/dpuiu/db/UniVec | |||
| = Software = | |||
| Figaro | |||
| * trims vector only at 5' end | |||
| * call lucy trimming for qualities | |||
| Lucy | |||
| * both vector sequence and splice sites are required | |||
| Atlas  | |||
| * [http://www.hgsc.bcm.tmc.edu/downloads/software/atlas/readme.html web site] | |||
| * atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); " | |||
| = Contaminant search = | |||
| nucmer reads CLIPPING range to UniVec & EcoliK12 | |||
| == UniVec == | |||
| Ref | |||
|                   #seqs   min     max     mean    median  n50     sum | |||
|   UniVec          2861    12      48551   231     99      781     660,151 | |||
|   UniVec_Core     1348    12      48551   243     98      967     327,641 | |||
| Hits: alignment length | |||
|   bp      #reads  min     max     mean    median  n50     sum | |||
|   19      4548466 19      1045    28.37   23      27      129025025 | |||
|   20      3684852 20      1045    30.56   25      28      112616359 | |||
|   30      1097357 30      1045    48.04   38      43      52714583 | |||
|   40      484661  40      1045    66.36   47      53      32163896 | |||
|   100     54334   100     1045    198     116     223     10772815        # many are ESTs | |||
| == Ecoli == | |||
| Ref: | |||
|   K12 4,639,675 bp | |||
| Hits: alignment length | |||
|   bp      #reads  min     max     mean    median  n50     sum | |||
|   19      275109  19      1223    30.66   19      20      8435470 | |||
|   20      102550  20      1223    50.29   21      161     5156849 | |||
|   30      19032   30      1223    178     37      706     3381214 | |||
|   40      9234    40      1223    329     171     738     3034293 | |||
|   100     6781    100     1223    424     223     749     2876432 | |||
|   200     4378    200     1223    575     696     771     2516916        | |||
| == BCM vectors == | |||
|                   #seqs   min     max     mean    median  n50     sum | |||
|   BCM             14      2580    33180   9379    5821    32705   131312 | |||
| = Vector/Splice site search = | |||
| == Strategy == | |||
| * 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE | |||
| * 2. Get the quality clipping trim: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT | |||
| * 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE  | |||
| * 4. Get the most frequent kmers in each set (24 & 8 bp)  | |||
| * 5. Check if the most frequent kmers are overrepresented | |||
| * 6. Check if the most frequent 8mers are present in the most frequent 24mers | |||
| * 7. Try to extend the 24mers by a few bp => linkers | |||
| * 8. Align linkers to the opposite stand sequences using nucmer | |||
| * 9. Extract the subsequences adjacent(following) to linker (50..150bp) | |||
| * 10. Align the subsequences; if they align we've probably identified the vector  | |||
| * 11. Identify the vector name/id by alignment to UniVec => several alignments | |||
| * 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent | |||
| * 13. create the Lucy vector & splice files; the splice contains the linker+vector | |||
| * 14. run lucy & trim input reads according to Lucy clr | |||
| * 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust | |||
| * 16. align input reads to linker,vector,splice & UniVec.dust | |||
| * 17. compare the 15. & 16. counts | |||
| == Example == | |||
| * 1. volume 011 :  500,000 reads  CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS | |||
| * 2. | |||
| * 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R | |||
| * 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared | |||
|   ==> 24.fwd/kmers.tab <== | |||
|   AGTTCGACTGCAAGTAGTTCATCA      TGATGAACTACTTGCAGTCGAACT        2463 # contains AGTAGTTC | |||
|   GAGTTCGACTGCAAGTAGTTCATC      GATGAACTACTTGCAGTCGAACTC        2189 | |||
|   CGAGTTCGACTGCAAGTAGTTCAT      ATGAACTACTTGCAGTCGAACTCG        1996 | |||
|   TCGAGTTCGACTGCAAGTAGTTCA      TGAACTACTTGCAGTCGAACTCGA        1593 | |||
|   GTTCGACTGCAAGTAGTTCATCAA      TTGATGAACTACTTGCAGTCGAAC        1023 | |||
|   GAGTTCGACTGCAGTAGTTCATCA      TGATGAACTACTGCAGTCGAACTC        812 | |||
|   CGAGTTCGACTGCAGTAGTTCATC      GATGAACTACTGCAGTCGAACTCG        777 | |||
|   GTTCGACTGCAAGTAGTTCATCAT      ATGATGAACTACTTGCAGTCGAAC        769 | |||
|   TCGAGTTCGACTGCAGTAGTTCAT      ATGAACTACTGCAGTCGAACTCGA        637 | |||
|   ATCGAGTTCGACTGCAAGTAGTTC      GAACTACTTGCAGTCGAACTCGAT        594 | |||
|   ==> 08.fwd/kmers.tab <== | |||
|   AGTAGTTC      GAACTACT        86477 | |||
|   CAGTAGTT      AACTACTG        67681 | |||
|   AGTTCTCA      TGAGAACT        61556 | |||
|   TAGTTCTC      GAGAACTA        60964 | |||
|   GTAGTTCT      AGAACTAC        57866 | |||
|   AGTTCATC      GATGAACT        49676 | |||
|   TAGTTCAT      ATGAACTA        45298 | |||
|   GTTCATCA      TGATGAAC        42117 | |||
|   GCAGTAGT      ACTACTGC        41391 | |||
|   GTAGTTCA      TGAACTAC        40694 | |||
|   ==> 24.rev/kmers.tab <== | |||
|   TATCGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCGATA        999 # contains AGTAGTTC | |||
|   CTATCGATGGTACAGTAGTTCATC      GATGAACTACTGTACCATCGATAG        774 | |||
|   GCTATCGATGGTACAGTAGTTCAT      ATGAACTACTGTACCATCGATAGC        600 | |||
|   CGCTATCGATGGTACAGTAGTTCA      TGAACTACTGTACCATCGATAGCG        432 | |||
|   ATCGATGGTACAGTAGTTCATCAT      ATGATGAACTACTGTACCATCGAT        417 | |||
|   ATCGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCGAT        380 | |||
|   ATCAGATGGTACAGTAGTTCATCA      TGATGAACTACTGTACCATCTGAT        373 | |||
|   ATCGATGGTACAGTAGTTCATCAC      GTGATGAACTACTGTACCATCGAT        265 | |||
|   CTATCGATGGTAAGTAGTTCATCA      TGATGAACTACTTACCATCGATAG        235 | |||
|   TCAGATGGTACAGTAGTTCATCAA      TTGATGAACTACTGTACCATCTGA        224 | |||
|   ==> 08.rev/kmers.tab <== | |||
|   AGTTCATC      GATGAACT        85127 | |||
|   TAGTTCAT      ATGAACTA        77902 | |||
|   GTTCATCA      TGATGAAC        75585 | |||
|   TAGTTCTC      GAGAACTA        68057 | |||
|   AGTTCTCA      TGAGAACT        67277 | |||
|   GTAGTTCT      AGAACTAC        64894 | |||
|   GTAGTTCA      TGAACTAC        62607 | |||
|   CGTAGTTC      GAACTACG        52031 | |||
|   AGTAGTTC      GAACTACT        51013 | |||
|   ACGTAGTT      AACTACGT        31552 | |||
| * 7. Get linker sequences | |||
|   >linker.fwd 27bp | |||
|   TCGAGTTCGACTGCAAGTAGTTCATCA | |||
|   >linker.rev 27bp | |||
|   CTAATCAGATGGTACAGTAGTTCATCA  | |||
|   #>linker.rev 40 bp Art's  (13 more bp at 5')         | |||
|   #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA | |||
|   #GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences) | |||
| * 8 & 9 Align reads to linkers using nucmer  | |||
| Fwd: | |||
|   nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta  | |||
|   #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta   | |||
|   show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr | |||
|   extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq | |||
| Rev: | |||
|   nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta | |||
|   #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta   | |||
|   show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr | |||
|   extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq | |||
| Both: | |||
|   clrFasta out.seq >! out.cseq | |||
|   fasta2tab.pl out.cseq | sort -k2 > ! out.tab | |||
|   nucmer -c 40 out.cseq ~/db/UniVec -p vector | |||
|   delta-filter -q vector.delta >! vector.filter-q.delta | |||
|   show-coords vector.filter-q.delta | sort -n | head | |||
|   cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2 | |||
| * 10. Extract "vector reads" | |||
|   >399553028  # 24.fwd      | |||
|   TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT | |||
|   GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG | |||
|   GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA | |||
|   ... | |||
|   >400269118 #24.rev | |||
|   TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA | |||
|   CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC | |||
|   AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG | |||
| * 11. Align  "vector reads" to UniVec; identify vector | |||
|   show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1 | |||
|       31  148  | 1175 1292  | 118   118  |  95.76  |     1276     7477  |     9.25     1.58  | 399553028.rev gnl|uv|J01636.1:1-7477 | |||
|       32  199  | 1302 1463  | 168   162  |  90.48  |      653     7477  |    25.73     2.17  | 400269118     gnl|uv|J01636.1:1-7477 | |||
| * 12. 10bp distance between the 2 alignments  | |||
| * 13. Lucy files | |||
|   $ more vector.seq | |||
|     >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes | |||
|     GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG | |||
|     TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC | |||
|     CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG | |||
|     CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA | |||
|     ... | |||
|   $ more splice.seq | |||
|     >J01636.for.begin vector+linker.rev | |||
|     TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC | |||
|     GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG | |||
|     CGCCTAATCAGATGGTACAGTAGTTCATCA | |||
|     >J01636.for.end  rev(linker.fwd)+vector  | |||
|     TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA | |||
|     CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC | |||
|     AGCTGGCGTAAAAACGTAAAAAGCCCCGCA | |||
|     >J01636.rev.begin (revcomp of J01636.for.end) | |||
|     TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA | |||
|     TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT | |||
|     GATTCGATTTCGACTGCAAGTAGTTCATCA | |||
|     >J01636.rev.end (revcomp of J01636.for.begin) | |||
|     TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT | |||
|     GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG | |||
|     GTGTCAAATGAGAGACCTAACTCACATTCA | |||
|   # splice=linker+vector   | |||
|        3      120  |     1175     1292  |      118      118  |    95.76  |      150     7477  |    78.67     1.58  | J01636.for.begin   J01636 | |||
|       32      131  |     1302     1399  |      100       98  |    96.00  |      150     7477  |    66.67     1.31  | J01636.for.end     J01636 | |||
| * 13.1 Align vector & splice to Ecoli | |||
|        1     7474  |   366812   359335  |     7474     7478  |    99.91  |     7477  4639675  |    99.96     0.16  | J01636             NC_000913.2    [CONTAINED] | |||
|       20      119  |       65      162  |      100       98  |    96.00  |      150      395  |    66.67    24.81  | J01636.rev.begin   NC_000913.2 | |||
|       31      148  |      172      289  |      118      118  |    95.76  |      150      395  |    78.67    29.87  | J01636.rev.end     NC_000913.2 | |||
|     1069     1463  |      395        1  |      395      395  |   100.00  |     7477      395  |     5.28   100.00  | J01636             NC_000913.2.365350-365744 | |||
| * 14. Run lucy & trim reads | |||
|   $ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \  | |||
|       -v vector.seq splice.seq | |||
|       -o bos_taurus.lucy.seq bos_taurus.lucy.qual \ | |||
|       -debug  bos_taurus.lucy.info \ | |||
|       bos_taurus.seq bos_taurus.qual | |||
|   # Trim clr | |||
|   $ clrFasta bos_taurus.seq > bos_taurus.cseq | |||
| * 15. Align lucy output to linker, vector, splice & UniVec.dust  | |||
|   $ nucmer -l 12 -c 24 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy | |||
|   $ nucmer -l 16 -c 30 ~/db/vector.seq  bos_taurus.lucy.cseq -p vector-bos_taurus.lucy | |||
|   $ nucmer -l 16 -c 30 ~/db/splice.seq  bos_taurus.lucy.cseq -p splice-bos_taurus.lucy | |||
|   $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy | |||
| * 16. Align input to linker, vector, splice & UniVec.dust  | |||
|   $ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus | |||
|   $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus | |||
|   $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus | |||
|   $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus | |||
| Count how many reads got trimmed | |||
|   infoseq *seq | getSummary.pl -c 1 -t original.LEN | |||
|   cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary   | |||
|   cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary | |||
|   cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary | |||
| == Libraries ==  | |||
| === 011.BCM.WGS FORWARD === | |||
| * vector: J01636 | |||
| * UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes | |||
|   ll ~dpuiu/db/J01636* | |||
|   -rw-rw-r--  1 dpuiu dpuiu 7651 Jan  9 15:56 /nfshomes/dpuiu/db/J01636 | |||
|   -rw-rw-r--  1 dpuiu dpuiu  105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker | |||
|   -rw-rw-r--  1 dpuiu dpuiu  840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice | |||
|   cat  ~dpuiu/db/J01636* | infoseq | |||
|   J01636            7477   53.43 | |||
|   J01636.linker.fwd 27     44.44 | |||
|   J01636.linker.rev 27     37.04 | |||
|   J01636.for.begin  150    44.67 | |||
|   J01636.for.end    150    51.33 | |||
|   J01636.rev.begin  150    51.33 | |||
|   J01636.rev.end    150    44.67 | |||
| * 249,611 reads: | |||
| * 91% got vector trimmed at the 5'   | |||
| * 0.4% (1149) got  vector trimmed at the 3' | |||
|                   #elem   #0s     min     max     mean    median  n50     sum | |||
|   original.LEN    249611  0       437     2349    1082    991     1009    270035781      | |||
|   lucy.CLV5       249611  21215   0       741     25.03   25      27      6247415 | |||
|   lucy.CLV3       249611  248462  0       1047    3.49    0       859     870344 | |||
| * Original reads hit counts: | |||
|  10975 linker.fwd | |||
|  133   linker.rev | |||
|  166   splice | |||
|  152   vector | |||
|  228   UniVec.dust | |||
| * Lucy trimmed read counts | |||
|  2 linker.fwd | |||
|  0 linker.rev | |||
|  1 splice | |||
|  1 vector | |||
|  6 UniVec.dust (only 3 are >40bp) | |||
| === 011.BCM.WGS REVERSE === | |||
|                   #elem   #0s     min     max     mean    median  n50     sum | |||
|   original.LEN    250389  0       502     2148    1085    993     1012    271691094 | |||
|   lucy.CLR        250389  7345    0       1281    795     876     892     198982171 | |||
|   lucy.CLV5       250389  20271   0       668     26.52   27      29      6641362 | |||
|   lucy.CLV3       250389  249269  0       997     3.35    0       861     839029 | |||
| * Original reads hit counts: | |||
|   linker.fwd      113 | |||
|   linker.rev      3812 | |||
|   splice          143 | |||
|   UniVec.dust     237 | |||
|   vector          4318 | |||
| * Lucy trimmed reads hit counts: | |||
|   linker.fwd      1 | |||
|   linker.rev      0 | |||
|   splice          1 | |||
|   UniVec.dust     10 | |||
|   vector          1 | |||
| === 030.BCM.SHOTGUN === | |||
| * same linker/vector/splice as BCM.WGS | |||
| * 2.5% (4K out of 160K) reads contain linker & vector at 3' | |||
|                   #elem   #0s     min     max     mean    median  n50     sum | |||
|   original.LEN    8411    0       325     1685    1181    1240    1314    9933150 | |||
|   lucy.CLR        8411    8       0       1054    841     863     874     7070994 | |||
|   lucy.CLV5       8411    568     0       232     27.01   28      29      227206 | |||
|   lucy.CLV3       8411    2325    0       1040    597     794     851     5023445 | |||
| * Original reads hit counts: | |||
|   linker.fwd      4314 | |||
|   linker.rev      4125 | |||
|   splice          7816 | |||
|   UniVec.dust     4212 | |||
|   vector          6750 | |||
|   vector          27235 | |||
| * Lucy trimmed reads hit counts: | |||
|   linker.fwd      3 | |||
|   linker.rev      1 | |||
|   splice          1 | |||
|   UniVec.dust     13 | |||
|   vector          0 | |||
| === 001.NISC.SHOTGUN === | |||
| * Vector: pOTW13 | |||
| * UniVec: 3 partial seqs | |||
|   gnl|uv|NGB00080.1:1-198 pOTW13 with linkers | |||
|   gnl|uv|NGB00080.1:718-888 pOTW13 with linkers | |||
|   gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers | |||
|   ll /nfshomes/dpuiu/db/NGB00080* | |||
|   -rw-rw-r--  1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080 | |||
|   -rw-r--r--  1 dpuiu dpuiu   94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker | |||
|   -rw-r--r--  1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice | |||
|   cat  /nfshomes/dpuiu/db/NGB00080* | infoseq | |||
|   NGB00080       1054   50.00 | |||
|   NGB00080.linker.fwd 24     45.83 | |||
|   NGB00080.linker.rev 26     53.85 | |||
|   NGB00080.for.beg 518    46.14 | |||
|   NGB00080.for.end 518    50.48 | |||
|   NGB00080.rev.begin 518    50.48 | |||
|   NGB00080.rev.beg 518    46.14 | |||
| * 944 read sample | |||
|                   #elem   #0s     min     max     mean    median  n50     sum | |||
|   original.LEN    944     0       652     1017    735     721     722     693668 | |||
|   lucy.CLR        944     39      0       886     415     422     522     391333 | |||
|   lucy.CLV5       944     121     0       275     34.05   33      35      32143 | |||
|   lucy.CLV3       944     18      0       885     410     409     511     387007 | |||
| * Original reads hit counts: | |||
|   linker.fwd      479 | |||
|   linker.rev      492 | |||
|   splice          910 | |||
|   UniVec.dust     0 | |||
|   vector          939 | |||
| * Lucy trimmed reads hit counts: | |||
|   linker.fwd      1 | |||
|   linker.rev      0 | |||
|   splice          0 | |||
|   UniVec.dust     9 | |||
|   vector          1 | |||
| === 060.BCCAGSC.CLONEEND === | |||
| * Linkers: | |||
|   linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT | |||
|   linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG | |||
| * No common vector !!!  | |||
|   wc -l *clb | |||
|     60746 bos_taurus.060.f.clb  #18 reads original align to UniVec (nucmer default params) | |||
|     60836 bos_taurus.060.r.clb | |||
|   Fwd: | |||
|      329      428  |      440      535  |      100       96  |    91.00  |      503     1585  |    19.88     6.06  | 723951410  gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1 | |||
|      330      370  |       89       49  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector | |||
|      330      370  |       94       54  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector | |||
|    Rev: | |||
|        1       96  |       71      165  |       96       95  |    93.81  |      203      165  |    47.29    57.58  | 724018013  gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6 | |||
|       50      143  |        1       94  |       94       94  |    92.71  |      203       94  |    46.31   100.00  | 724018013  gnl|uv|U80929.2:2858-2951     Cloning vector pBACe3.6 | |||
| === 017.UIUC.CLONEEND === | |||
| * No overrepresented kmers | |||
|   wc -l *clb | |||
|    17978 bos_taurus.017.f.clb | |||
|    17911 bos_taurus.017.r.clb | |||
|   ==> 24.fwd/kmers.tab <== | |||
|   CCCTGCTTTGTCTGGAAGGGGTTC        GAACCCCTTCCAGACAAAGCAGGG        9 | |||
|   CTGCTTTGTCTGGAAGGGGTTCCC        GGGAACCCCTTCCAGACAAAGCAG        9 | |||
|   ==> 24.rev/kmers.tab <== | |||
|   GAATGTTGAGCTTTAGCCAACTTT        AAAGTTGGCTAAAGCTCAACATTC        4 | |||
|   TCTGAATGTTGAGCTTTAGCCAAC        GTTGGCTAAAGCTCAACATTCAGA        4 | |||
|   ==> 8.fwd/kmers.tab <== | |||
|   TTTTTTTT        AAAAAAAA        55 | |||
|   AAGGGGTT        AACCCCTT        35 | |||
|   ==> 8.rev/kmers.tab <== | |||
|   GTCTGGAA        TTCCAGAC        41 | |||
|   TCTGGAAG        CTTCCAGA        39 | |||
| * No UniVec hits | |||
| === 010.TIGR.CLONEEND === | |||
| * No overrepresented kmers | |||
|   wc -l *clb | |||
|   5479 bos_taurus.032.f.clb | |||
|   5174 bos_taurus.032.r.clb | |||
|   ==> 24.fwd/kmers.tab <== | |||
|   CTTGTGTTGGCCCAGGCAAGTCCA        TGGACTTGCCTGGGCCAACACAAG        30 | |||
|   TTGTGTTGGCCCAGGCAAGTCCAA        TTGGACTTGCCTGGGCCAACACAA        30 | |||
|   ==> 24.rev/kmers.tab <== | |||
|   CTGCCTCTTGTGTTGGCCCAGGCA        TGCCTGGGCCAACACAAGAGGCAG        16 | |||
|   GCTGCCTCTTGTGTTGGCCCAGGC        GCCTGGGCCAACACAAGAGGCAGC        15 | |||
|   ==> 8.fwd/kmers.tab <== | |||
|   GAGTGGGT        ACCCACTC        176 | |||
|   GGAGTGGG        CCCACTCC        171 | |||
|   ==> 8.rev/kmers.tab <== | |||
|   TGGAGTGG        CCACTCCA        182 | |||
|   GGAGTGGG        CCCACTCC        181 | |||
| * No UniVec hits | |||
| === ... === | |||
| === 070.BCM.CLONEEND === | |||
| * No frequent kmers | |||
|   wc -l *clb | |||
|     6027 bos_taurus.070.f.clb | |||
|     6236 bos_taurus.070.r.clb | |||
|   ==> 24.fwd/kmers.tab <== | |||
|   GGACTCTCAGAGTCTTCTCCAACA        TGTTGGAGAAGACTCTGAGAGTCC        18 | |||
|   ACTGGTTGGATCTCCTTGCAGTCC        GGACTGCAAGGAGATCCAACCAGT        18 | |||
|   ==> 24.rev/kmers.tab <== | |||
|   ATAAAATCTGAGCCACCAGGGAAG        CTTCCCTGGTGGCTCAGATTTTAT        1 | |||
|   CTATTGGTTCATATGGTCAACGTC        GACGTTGACCATATGAACCAATAG        1 | |||
|   ==> 8.fwd/kmers.tab <== | |||
|   TTTTTTTT        AAAAAAAA        86 | |||
|   CTTCTCCA        TGGAGAAG        75 | |||
|   ==> 8.rev/kmers.tab <== | |||
|   TATAGTGT        ACACTATA        9 | |||
|   ATATAGGG        CCCTATAT        8 | |||
| * No alignments to BCM WGS vector | |||
| = Running Lucy =  | |||
| * Default parameters with vector trimming | |||
| * BCM vector/splice | |||
|   /nfshomes/dpuiu/db/vector.BCM.seq | |||
|   /nfshomes/dpuiu/db/splice.BCM.seq | |||
| * NISC vector/splice | |||
|   /nfshomes/dpuiu/db/vector.NISC.seq | |||
|   /nfshomes/dpuiu/db/splice.NISC.seq | |||
| == BCM.WGS (all reads)== | |||
| * orig.CLR < lucy.CLR ( 765 < 792 )  | |||
| * orig.CLV > lucy.CLV ( 1015 > 973 ) | |||
| * 739,529    out of 24,863,599 reads (3%)   deleted by Lucy (CLR=-1,-1) | |||
| * 21,728,592 out of 24,863,599 reads (87%)  vector trimmed at the 5' end  | |||
| * 92,646     out of 24,863,599 reads (0.3%) vector trimmed at the 3' end  | |||
|                             '''elem       <0         0          >0         min        max        mean       median     n50        sum'''  | |||
|   orig.LEN                  24863599   0          0          24863599   5          3097       1002       997        1015       24915462033 | |||
|   orig.CLR                  24863599   463669     7          24399923   -1143      1833       '''765'''        836        864        19036744256 | |||
|   orig.CLR5                 24863599   0          359245     24504354   0          2103       42         22         58         1047922451 | |||
|   orig.CLR3                 24863599   463404     0          24400195   -1         2169       807        872        895        20084666707 | |||
|   lucy.CLR                  24863599   0          739529     24124070   0          1219       '''792'''        878        904        19695000417 | |||
|   lucy.CLR5                 24863599   '''739529'''     36108      24087962   -1         1753       43         29         42         1086413880 | |||
|   lucy.CLR3                 24863599   '''739529'''     0          24124070   -1         1894       835        915        939        20781414297 | |||
|   orig.CLR5-lucy.CLR5       24863599   16299521   215345     8348733    -1186      2104       -1         -10        -1186      -38491429 | |||
|   orig.CLR3-lucy.CLR3       24863599   14858542   1494794    8510263    -1273      2170       -28        -20        -1273      -696747590 | |||
|   orig.CLV                  24863599   1053       1920       24860626   -2         5345       '''1015'''       1002       1017       25260581538 | |||
|   orig.CLV5                 8841849    0          0          8841849    1          1219       33         46         49         295011460 | |||
|   orig.CLV3                 24861698   1053       0          24860645   -1         5346       1027       1005       1019       25555592998 | |||
|   lucy.CLV                  24863599   10694      707        24852198   -469       3096       '''973'''        968        987        24195085877 | |||
|   lucy.CLV5                 24863599   0          3135007    21728592   0          1359       25         27         29         623457486 | |||
|   lucy.CLV3                 24863599   0          0          24863599   4          3096       998        995        1014       24818543363 | |||
|   lucy.CLVABS5              24863599   0          3135007    '''21728592'''   0          1359       25         27         29         623457486 | |||
|   lucy.CLVABS3              24863599   0          24770953   '''92646'''      0          1343       2          0          880        72055071 | |||
|   orig.CLV5-lucy.CLV5       24863599   17216820   1512453    6134326    -1312      1219       -13        -25        -1312      -328446026 | |||
|   orig.CLV3-lucy.CLV3       24863599   1519132    18579609   4764858    -1832      4672       29         0          479        737049635 | |||
| == BCM.WGS (0 quality reads)== | |||
| * orig.CLR > lucy.CLR (mean) | |||
| * orig.CLV > lucy.CLV (mean) | |||
| * 7,153    out of 551,114 reads (1.3%)   deleted by Lucy (CLR=-1,-1) | |||
| * 508,166  out of 551,114 reads (92%)  vector trimmed at the 5' end  | |||
| * 1,946    out of 551,114 reads (0.35%) vector trimmed at the 3' end  | |||
|                             '''elem       <0         0          >0         min        max        mean       median     n50        sum''' | |||
|   orig.LEN                  551114     0          0          551114     5          1464       872        946        959        480705828 | |||
|   orig.CLR                  551114     7754       0          543360     -770       1175       708        786        807        390325117 | |||
|   orig.CLR5                 551114     0          6773       544341     0          1519       44         20         111        24582849 | |||
|   orig.CLR3                 551114     7744       0          543370     -1         1638       752        818        833        414907966 | |||
|   lucy.CLR                  551114     0          7153       543961     0          699        636        671        671        350759771 | |||
|   lucy.CLR5                 551114     7153       35872      508089     -1         201        26         27         28         14442310 | |||
|   lucy.CLR3                 551114     7153       0          543961     -1         699        662        699        699        365202081 | |||
|   orig.CLR5-lucy.CLR5       551114     364282     8801       178031     -198       1500       18         -8         215        10140539 | |||
|   orig.CLR3-lucy.CLR3       551114     85058      2962       463094     -700       1472       90         123        178        49705885 | |||
|   orig.CLV                  551114     971        0          550143     -2         2037       974        978        981        537127121 | |||
|   orig.CLV5                 5100       0          0          5100       1          845        35         29         31         180490 | |||
|   orig.CLV3                 551114     971        0          550143     -1         2037       974        978        981        537307611 | |||
|   lucy.CLV                  551114     58         6          551050     -84        1456       841        917        930        463903233 | |||
|   lucy.CLV5                 551114     0          42948      508166     0          202        27         28         29         14964546 | |||
|   lucy.CLV3                 551114     0          0          551114     4          1463       868        945        958        478867779 | |||
|   lucy.CLVABS5              551114     0          42948      508166     0          202        27         28         29         14964546 | |||
|   lucy.CLVABS3              551114     0          549168     1946       0          700        2          0          686        1286935 | |||
|   orig.CLV5-lucy.CLV5       551114     506108     42215      2791       -202       845        -26        -28        -202       -14784056 | |||
|   orig.CLV3-lucy.CLV3       551114     134959     23422      392733     -967       1614       106        7          459        58439832 | |||
| == BCM.SHOTGUN == | |||
| * orig.CLR < lucy.CLR (mean) | |||
| * orig.CLV > lucy.CLV (mean) | |||
| * 98,070     out of 10,748,529 reads (0.9%)  deleted by Lucy (CLR=-1,-1) | |||
| * 9,737,008  out of 10,748,529 reads (90%)   vector trimmed at the 5' end  | |||
| * 294,942    out of 10,748,529 reads (2.7%)  vector trimmed at the 3' end  | |||
|                             '''elem       <0         0          >0         min        max        mean       median     n50        sum''' | |||
|   orig.LEN                  10748529   0          0          10748529   5          2043       975        950        964        10486690472 | |||
|   orig.CLR                  10748529   17308      2          10731219   -1293      1467       809        833        847        8701344571 | |||
|   orig.CLR5                 10748529   0          68         10748461   0          1315       26         16         38         288662580 | |||
|   orig.CLR3                 10748529   16780      0          10731749   -1         1647       836        851        863        8990007151 | |||
|   lucy.CLR                  10748529   0          98070      10650459   0          1337       833        854        868        8955866769 | |||
|   lucy.CLR5                 10748529   98070      1973       10648486   -1         1307       35         28         32         376276188 | |||
|   lucy.CLR3                 10748529   98070      0          10650459   -1         1553       868        882        896        9332142957 | |||
|   orig.CLR5-lucy.CLR5       10748529   9498290    65171      1185068    -1099      1293       -8         -11        -1099      -87613608 | |||
|   orig.CLR3-lucy.CLR3       10748529   6879532    671097     3197900    -1149      1437       -31        -26        -1149      -342135806 | |||
|   orig.CLV                  10748529   16779      412        10731338   -2         3919       974        948        964        10472347908 | |||
|   orig.CLV5                 8594910    0          0          8594910    1          1239       3          1          49         28350257 | |||
|   orig.CLV3                 10748349   16779      0          10731570   -1         3919       976        950        965        10500698165 | |||
|   lucy.CLV                  10748529   7026       614        10740889   -268       2042       930        924        940        9997862132 | |||
|   lucy.CLV5                 10748529   0          1011521    9737008    0          855        24         24         27         257993796 | |||
|   lucy.CLV3                 10748529   0          0          10748529   4          2042       954        945        962        10255855928 | |||
|   lucy.CLVABS5              10748529   0          1011521    9737008    0          855        24         24         27         257993796 | |||
|   lucy.CLVABS3              10748529   0          10453587   294942     0          1214       20         0          847        220086015 | |||
|   orig.CLV5-lucy.CLV5       10748529   9538738    138680     1071111    -854       1239       -21        -23        -854       -229643539 | |||
|   orig.CLV3-lucy.CLV3       10748529   357934     9324166    1066429    -1328      2846       22         0          704        244842237 | |||
| == NISC.SHOTGUN == | |||
| * orig.CLR < lucy.CLR (mean) | |||
| * orig.CLV > lucy.CLV (mean) | |||
| * 8,248     out of 737,900 reads (1.1%)  deleted by Lucy (CLR=-1,-1) | |||
| * 633,409   out of 737,900 reads (85%)   vector trimmed at the 5' end  | |||
| * 7,201     out of 737,900 reads (0.97%) vector trimmed at the 3' end  | |||
|                             '''elem       <0         0          >0         min        max        mean       median     n50        sum''' | |||
|   orig.LEN                  737900     0          0          737900     104        2104       784        729        734        579172842 | |||
|   orig.CLR                  737900     5988       2          731910     -636       1033       651        668        676        480400909 | |||
|   orig.CLR5                 737900     0          0          737900     1          1407       47         40         51         34857531 | |||
|   orig.CLR3                 737900     0          5879       732021     0          1470       698        710        715        515258440 | |||
|   lucy.CLR                  737900     0          8248       729652     0          1035       658        670        676        485757685 | |||
|   lucy.CLR5                 737900     8248       56         729596     -1         1091       45         35         46         33811606 | |||
|   lucy.CLR3                 737900     8248       0          729652     -1         1391       704        710        714        519569291 | |||
|   orig.CLR5-lucy.CLR5       737900     253727     89345      394828     -566       1408       1          1          485        1045925 | |||
|   orig.CLR3-lucy.CLR3       737900     177007     31         560862     -867       1471       -5         1          -867       -4310851 | |||
|   orig.CLV                  737900     3224       2655       732021     -636       2103       771        725        730        569178445 | |||
|   orig.CLV5                 734026     0          0          734026     1          987        5          1          35         4375315 | |||
|   orig.CLV3                 732021     0          0          732021     35         2104       783        729        734        573553760 | |||
|   lucy.CLV                  737900     1335       55         736510     -200       2104       747        696        702        551392388 | |||
|   lucy.CLV5                 737900     104491     0          633409     -1         1199       30         31         34         22784742 | |||
|   lucy.CLV3                 737900     0          0          737900     15         2103       778        728        733        574177130 | |||
|   lucy.CLVABS5              737900     0          104491     633409     0          1200       31         32         35         23522642 | |||
|   lucy.CLVABS3              737900     0          730699     7201       0          1076       5          0          686        4257812 | |||
|   orig.CLV5-lucy.CLV5       737900     561851     66390      109659     -1198      983        -24        -29        -1198      -18409427 | |||
|   orig.CLV3-lucy.CLV3       737900     8386       1          729513     -950       1077       0          1          -950       -623370 | |||
| = Fragment files = | |||
| * Locations:  | |||
|   /fs/szasmg3/bos_taurus/data/frg | |||
|   /fs/szasmg3/bos_taurus/data/frg.new | |||
| * All DST messages are unique | |||
| * bos_taurus.clv : contains the vector clipping points | |||
| ** BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv | |||
| ** others: the TA clv | |||
| ** 374,454 reads don't have valid clv's | |||
| ** 36,446,031 reads have valid clv's with avg=955 | |||
| == Message counts (original) == | |||
|                                                  DST     FRG             LKG | |||
|   bos_taurus.BCM.WGS.frg                         79      24124070        11311841 | |||
|   #bos_taurus.BCM.SHOTGUN.frg                    7339    10650459        1799069     # some libs & mates are missing due to a tarchive2ca crash (used by UMD2.1) | |||
|   #bos_taurus.BCM.SHOTGUN.new.frg                18208   10650459        4715172     # split the libraries by VOL & SEQ_LIB_ID (used by UMD2.2) | |||
|   #bos_taurus.BCM.SHOTGUN.new.frg                13826   10650459        5046435     # double check the FRG count !!! (used by UMD2.3) | |||
|   bos_taurus.BCM.SHOTGUN.new.frg                 7       10650459        5046435     # UMD2.4 | |||
|   bos_taurus.NISC.SHOTGUN.frg                    246     729652          344932 | |||
|   bos_taurus.BCCAGSC.CLONEEND.frg                1       125241          59505 | |||
|   bos_taurus.UIUC.CLONEEND.frg                   2       114750          46319 | |||
|   bos_taurus.TIGR.CLONEEND.frg                   1       65171           27067 | |||
|   bos_taurus.GSC.CLONEEND.frg                    1       53521           25889 | |||
|   bos_taurus.CENARGEN.WGS.frg                    0       26246           0 | |||
|   #bos_taurus.BARC.CLONEEND.frg                  11150   25454           11150      # (used by UMD2.3) | |||
|   bos_taurus.BARC.CLONEEND.frg                   1       25454           11150      # (used by UMD2.4) | |||
|   bos_taurus.BCM.CLONEEND.frg                    1       16875           7103 | |||
|   bos_taurus.CENARGEN.CLONEEND.frg               1       16787           6269 | |||
|   bos_taurus.UOKNOR.SHOTGUN.frg                  1       14651           4910 | |||
|   bos_taurus.TIGR_JCVIJTC.CLONEEND.frg           2       10651           4803 | |||
|   bos_taurus.UOKNOR.FINISHING.frg                0       151             0 | |||
|   bos_taurus.WUGSC.COLONEEND.frg                 1       49              21 | |||
|   #total                                         25312   35973728        16896244   # (UMD2.3) <font color=red> | |||
|   total                                          344     35973728        16896244   # (UMD2.4) </font> | |||
| == Message counts (quality) == | |||
|                                                  DST     FRG             LKG | |||
|   bos_taurus.BCM.WGS.qual.count                  79      23580109        11035582 | |||
|   #bos_taurus.BCM.SHOTGUN.qual.count             7339    10644092        1799069 | |||
|   bos_taurus.BCM.SHOTGUN.qual.new.count          18208   10644092        4712446 | |||
|   bos_taurus.NISC.SHOTGUN.count                  246     729652          344932 | |||
|   bos_taurus.BCCAGSC.CLONEEND.qual.count         1       116484          53585 | |||
|   bos_taurus.UIUC.CLONEEND.count                 2       114750          46319 | |||
|   bos_taurus.TIGR.CLONEEND.count                 1       65171           27067 | |||
|   bos_taurus.CENARGEN.WGS.count                  0       26246           0 | |||
|   bos_taurus.BARC.CLONEEND.count                 11150   25454           11150 | |||
|   bos_taurus.BCM.CLONEEND.count                  1       16875           7103 | |||
|   bos_taurus.CENARGEN.CLONEEND.count             1       16787           6269 | |||
|   bos_taurus.TIGR_JCVIJTC.CLONEEND.count         2       10651           4803 | |||
|   bos_taurus.UOKNOR.SHOTGUN.qual.count           1       2456            813 | |||
|   bos_taurus.WUGSC.COLONEEND.count               1       49              21 | |||
| == Message counts (0quality) == | |||
|                                                  DST     FRG             LKG | |||
|   bos_taurus.BCM.WGS.0qual.count                 79      543961          234397 | |||
|   bos_taurus.GSC.CLONEEND.0qual.count            1       53521           25889 | |||
|   bos_taurus.UOKNOR.SHOTGUN.0qual.count          1       12195           4097 | |||
|   bos_taurus.BCCAGSC.CLONEEND.0qual.count        1       8757            2114 | |||
|   bos_taurus.BCM.SHOTGUN.0qual.count             7339    6367            0 | |||
|   bos_taurus.UOKNOR.FINISHING.0qual.count        0       151             0 | |||
| = Assemblies = | |||
| [[Media:Bt.qc.combine|Bt.qc.combine]] UMD2.0 ... UMD2.5 combine stats | |||
| = UMD2.1(2009_0122_CA; Quality reads) = | |||
| == Issues == | |||
| # Uses only quality reads | |||
| # BCM.SHOTGUN library : ~ 4715172-1799069=2.9M mates were missed due to a tarchive2ca crash ; some libraries got merged (were assigned the same lib_id) | |||
| # All reads except for BCM.WGS were set as nonrandom | |||
| # Update the runCA script to run overlapper concurently; new "ovlConcurrency" parameter added to the .spec file !!! | |||
| # consensus after cgw crashed in MultiAlignContig() ... use "consensus -D forceunitigabut" !!! | |||
| # cgw crashed after updating gkpStore with new lib/mate info => edit Input_CGW.c, remove the assert in line 117 | |||
| == Info == | |||
|   host: walnut | |||
|   assembly version: wgs-5.2 stable | |||
|   dir:  /scratch1/bos_taurus/Assembly/2009_0122_CA  | |||
|   command: /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin/runCA-test -d . -p bt -s bt01.specFile *.frg | |||
|   spec file: | |||
|   cgwDistanceSampleSize   =       1000       # ??? too big; more than 50% of the BCM.SHOTGUN reads are in libraries with less than 1000 inserts | |||
|   cnsConcurrency          =       15 | |||
|   cnsMinFrags             =       200000 | |||
|   doOverlapTrimming       =       1 | |||
|   frgCorrBatchSize        =       100000 | |||
|   frgCorrConcurrency      =       15 | |||
|   merylMemory             =       24000 | |||
|   merylThreads            =       15 | |||
|   obtMerThreshold         =       200 | |||
|   obtOverlapper           =       ovl | |||
|   ovlConcurrency          =       8 | |||
|   ovlCorrBatchSize        =       100000 | |||
|   ovlCorrConcurrency      =       15 | |||
|   ovlHashBlockSize        =       1200000 | |||
|   ovlMemory               =       8GB --hashload 0.8 --hashstrings 400000 | |||
|   ovlMerThreshold         =       500 | |||
|   ovlOverlapper           =       ovl | |||
|   ovlRefBlockSize         =       7200000 | |||
|   ovlThreads              =       2 | |||
|   unitigger               =       utg | |||
|   utgErrorRate            =       0.015 | |||
|   vectorIntersect         =       bos_taurus.clv | |||
|   doExtendClearRanges     =       2 | |||
| == Steps == | |||
| 1. Run up till after initialStoreBuilding | |||
|   runCA stopAfter=initialStoreBuilding ... | |||
| 2. Update gkpStore with nonrandom frg flag | |||
|   cat bos_taurus.nonrandom.clv | perl -ane 'print "frg uid $F[0] isnonrandom 1\n";'  > bos_taurus.nonrandom.edit | |||
|   gatekeeper -edit bos_taurus.nonrandom.edit bt.gkpStore | |||
| == Input == | |||
|   gatekeeper -dumpinfo -lastfragiid bt.gkpStore | |||
|   ... | |||
|   Last frag in store is iid = 35348776 | |||
| == OBT == | |||
|                             elem       <0         0          >0         min        max        mean       median     n50        sum             | |||
|   CLV5                      35085508   0          3387027    31698481   0          970        25         27         29         891007232 | |||
|   CLV3                      35164784   0          0          35164784   15         2974       984        980        1000       34612019144 | |||
|   CLR_ORIG5                 35348776   0          43354      35305422   0          1753       42         29         38         1502168205      | |||
|   CLR_ORIG3                 35348776   0          0          35348776   70         1894       864        905        927        30547294868     | |||
|   CLR_OBT5                  35348776   0          26513      35322263   0          1690       49         30         73         1756346429      | |||
|   CLR_OBT3                  35348776   0          23477      35325299   0          1813       843        895        914        29824543869 | |||
| <font color=red> | |||
| * 421,379 reads deleted by OBT: why so many??? | |||
| </font> | |||
| * Chimera:  | |||
|   20297 reads too short => deleted | |||
| * more  0-overlaptrim/bt.mergeLog.stats | |||
|   ... | |||
|   211037: short or inconsistent | |||
|   253536: deleted fragment due to zero clear  | |||
| * Example: | |||
|   gatekeeper -dumpfragments 516316990  bt.gkpStore   | |||
|   fragmentIdent           = 516316990,14 | |||
|   fragmentMate            = 0,0 | |||
|   fragmentLibrary         = 27473,1563 | |||
|   fragmentIsDeleted       = 1 | |||
|   fragmentIsNonRandom     = 1 | |||
|   fragmentStatus          = G | |||
|   fragmentOrientation     = I | |||
|   fragmentHasVectorClear  = 0 | |||
|   fragmentHasQualityClear = 0 | |||
|   fragmentPlate           = 0 | |||
|   fragmentPlateLocation   = 0 | |||
|   fragmentSeqLen          = 862 | |||
|   fragmentHPSLen          = 0 | |||
|   fragmentSrcLen          = 17 | |||
|   fragmentClearORIG       = 38,553 | |||
|   fragmentClearQLT        = 1,0 | |||
|   fragmentClearVEC        = 1,0 | |||
|   fragmentClearOBTINI     = 35,578 | |||
|   fragmentClearOBT        = 35,578 | |||
|   fragmentClearUTG        = 35,578 | |||
|   fragmentClearECR1       = 35,578 | |||
|   fragmentClearECR2       = 35,578 | |||
|   fragmentSeqOffset       = 5376 | |||
|   fragmentQltOffset       = 11038 | |||
|   fragmentHpsOffset       = 53 | |||
|   fragmentSrcOffset       = 287 | |||
|   cat   0-overlaptrim/bt.mergeLog | grep 516316990  | |||
|   516316990,14    412     412     0       0 (deleted, too short) | |||
|   zcat *r000*gz | convertOverlap -a -obt  | |||
|   ... | |||
|      14 12128740  f  377  478   292  393   2.97 | |||
|      14 15226267  f  397  446    31   80   2.04 | |||
|      14 19071241  f    4  513   199  708   1.18 | |||
|      14 20073917  f    7  478    36  508   4.88 | |||
|      14 20042424  f    4  419   299  714   1.93 | |||
|      14 20212935  f    7  478   234  706   4.88 | |||
|      14 20073828  r    7  478   507   35   4.67 | |||
|      14 20212846  r    7  478   557   85   4.67 | |||
|      14 27089060  r  491  534   836  793   2.33 | |||
|      14 29061748  f  489  540    86  137   1.96 | |||
|      14 32105697  f  455  543   381  469   2.27 | |||
|      14 32187461  f  430  534   105  209   1.92 | |||
|      14 32027289  f    4  419   493  907   4.59 | |||
|   ... | |||
|   #read aligns to contigs | |||
|   show-coords 516316990-ctg.filter-r.strict.delta | |||
|       35      531  |       97      594  |      497      498  |    99.20  |      862     2759  |    57.66    18.05  | 516316990  ctg7180001872751 | |||
|       45      678  |      931     1564  |      634      634  |    97.00  |      862     1567  |    73.55    40.46  | 516316990  ctg7180001837311 | |||
| * OBT  deleted reads: | |||
|   BCM           WGS         253816 | |||
|   BCM           SHOTGUN     151770 | |||
|   BCCAGSC       CLONEEND    7510 | |||
|   NISC          SHOTGUN     4757 | |||
|   TIGR          CLONEEND    1577 | |||
|   CENARGEN      WGS         599 | |||
|   CENARGEN      CLONEEND    431 | |||
|   TIGR_JCVIJTC  CLONEEND    377 | |||
|   UIUC          CLONEEND    182 | |||
|   BCM           CLONEEND    150 | |||
|   BARC          CLONEEND    125 | |||
|   UOKNOR        SHOTGUN     85 | |||
|   total         .           421379 | |||
| OBT  deleted reads: | |||
|            elem       >0         min        max        mean       med        n50        sum | |||
|   len      421379     421379     98         2974       862        927        968        363280405 | |||
|   avgQual  421379     421379     1          57         28         24         36         11852865 | |||
| == Overlapper == | |||
| ** 98.33% of the reads (34,761,786 out of 35,348,776 reads) had overlaps | |||
| ** 1.66% of the reads had no overlaps | |||
| ** 6.68% of the BCCAGSC.CLONEEND reads had no overlaps | |||
| ** 4.95% of the TIGR_JCVIJTC.CLONEEND reads had no overlaps | |||
| ** 3.48% of the TIGR.CLONEEND reads had no overlaps | |||
| ** the median number of overlaps is 20 | |||
|   Overlaps | |||
|           reads      min        max        mean       median     n50        sum | |||
|   qual    35348776   0          5592       106        20         769        3777789082 | |||
| ** the median number of overlaps for the BCM.WGS reads is 16 | |||
| ** the median number of overlaps for the BCM.SHOTGUN reads is 16 !!! | |||
| ** the median number of overlaps for the NISC.SHOTGUN reads is 40 !!! | |||
| ** the median number of overlaps for the BCM.CLONEEND reads is 16 !!! | |||
| [[Media:Bt.ovlStore.big.png]] , [[Media:Bt.ovlStore.small.png]] | |||
| == Unitigger == | |||
|  more 4-unitigger/bt.cga.0 | |||
|  UNITIG OVERLAP GRAPH INFORMATION | |||
|         5208738 : Total number of unitigs | |||
|         2527051 : Total number of singleton, contained unitigs | |||
|         1814842 : Total number of singleton, non-contained unitigs | |||
|          180910 : Total number of non-singleton, spanned unitigs | |||
|          685935 : Total number of non-singleton, non-spanned unitigs | |||
|        34927397 : Total number of fragments | |||
|        34927397 : Total number of fragments in all unitigs | |||
|        21521581 : Total number of essential fragments in all unitigs | |||
|        13405816 : Total number of contained fragments in all unitigs | |||
|    0.0076239952 : Randomly sampled fragment arrival rate per bp | |||
|      2510896132 : The sum of overhangs in all the unitigs | |||
|      6400342737 : Total number of bases in all unitigs | |||
|               0 : Estimated number of base pairs in the genome. | |||
|               0 : Total number of contained fragments not connected | |||
|                   by containment edges to essential fragments. | |||
|   Total rho    = 2510896132 | |||
|   Total nfrags = 19143061 | |||
|   Estimated genome length = 0 | |||
|   Estimated global_fragment_arrival_rate=0.007624 | |||
|   Computed global_fragment_arrival_rate =0.007624 | |||
|   Total number of randomly sampled fragments in genome = 23326293 | |||
|   Computed genome length  = 3059589120.000000 | |||
|   Used global_fragment_arrival_rate=0.007624 | |||
|   Used global_fragment_arrival_distance=131.164826 | |||
|   Histogram of the number of base pairs in a chunk | |||
|   100292 - 159434:    22  | |||
|   90010 -  99906:     25  | |||
|   80043 -  89676:     73  | |||
|   70013 -  79966:    162  | |||
|   60010 -  69988:    389  | |||
|   50008 -  59983:    977  | |||
|   40000 -  49998:   2434  | |||
|   30000 -  39997:   6458  | |||
|   20000 -  29999:  18957  | |||
|   10000 -  19999:  57442 | |||
|   Unitigs >=10kb | |||
|               NewAsm          UMd2Asm | |||
|   Number       86,939          57,204 | |||
|   Mean         19,464          15,140 | |||
|   Sum         1,692.1Mb       866.0Mb | |||
|   max         159,434bp      78,570bp | |||
|   Contigs >=10Kb: | |||
|             NewAsm          UMd2Asm | |||
|   n         42,343           45,958       | |||
|   mean      59,856           55,473 | |||
|   sum        2,534.5Mb        2,549.4Mb | |||
|   Contigs >=100Kb:  | |||
|             NewAsm          UMd2Asm | |||
|   n          7,051            6,683          | |||
|   mean     163,170          162,357     | |||
|   sum        1,150.5Mb        1,085.0Mb | |||
|   max      627,705          742,802 | |||
|   Scaffolds >=10Mb: | |||
|             NewAsm          UMd2Asm | |||
|   n             30                3 | |||
|   mean       14.10Mb          11.36Mb | |||
|   sum       422.95Mb         340.70Mb | |||
|   max        26.54Mb          13.36Mb | |||
| == CGW & ECR == | |||
| * Checkpoints: | |||
|   cat 7-0-CGW/bt.timing | grep ^Checkpoint | |||
|   Checkpoint 3 written during MergeScaffoldsAggressive at iteration 49 | |||
|   Checkpoint 4 written during MergeScaffoldsAggressive at iteration 85 | |||
|   Checkpoint 5 written after 1st Scaffold Merge  | |||
|   Checkpoint 6 written after 2nd Aggressive Scaffold Merge | |||
|   Checkpoint 7 written after Final Rocks | |||
|   cat 7-2-CGW/bt.timing | grep ^Checkpoint | |||
|   Checkpoint 19 written during MergeScaffoldsAggressive at iteration 12 | |||
|   Checkpoint 20 written during MergeScaffoldsAggressive at iteration 31 | |||
|   Checkpoint 21 written after 1st Scaffold Merge | |||
|   Checkpoint 22 written after 2nd Aggressive Scaffold Merge | |||
|   Checkpoint 23 written after Final Rocks | |||
|   cat 7-4-CGW/bt.timing | grep ^Checkpoint | |||
|   Checkpoint 34 written during MergeScaffoldsAggressive at iteration 12 | |||
|   Checkpoint 35 written during MergeScaffoldsAggressive at iteration 49 | |||
|   Checkpoint 36 written after 1st Scaffold Merge | |||
|   Checkpoint 37 written during Stones CleanupScaffolds after scaffold 32436 | |||
|   Checkpoint 38 written during Stones CleanupScaffolds after scaffold 34939 | |||
|   Checkpoint 39 written after Stone Throwing and CleanupScaffolds | |||
|   Checkpoint 40 written after 2nd Aggressive Scaffold Merge | |||
|   Checkpoint 41 written after Final Rocks | |||
| Checkpoint 42 written after Partial Stones | |||
| Checkpoint 43 written after Final Contained Stones | |||
| Checkpoint 44 written after resolveSurrogates | |||
| * Get early CTG/SCF stats | |||
|   cat 7-CGW/bt.cgw_scaffolds | countMessages.pl | |||
|   ICL     451555  # ??? | |||
|   ICP     116455  # CTG | |||
|   ISF     66141   # SCF | |||
|   ISL     711     # SLK | |||
| * Clear read extension: | |||
|                             elem       <0         0          >0         min        max        mean       median     n50        sum | |||
|   ClearORIG                 35348776   4          0          35348772   -1147      1572       821        870        893        29045126663 | |||
|   ClearQLT                  35348776   35348776   0          0          -1         -1         -1         -1         -1         -35348776 | |||
|   ClearVEC                  35348776   299034     20323      35029419   -1         2043       952        953        975        33658445088 | |||
|   ClearOBTINI               35348776   0          31254      35317522   0          1364       831        879        902        29394688367 | |||
|   ClearOBT                  35348776   0          31254      35317522   0          1318       794        854        877        28068197440 | |||
|   ClearECR1                 35348776   0          31254      35317522   0          1329       794        854        877        28072014464 | |||
|   ClearECR2                 35348776   0          31254      35317522   0          1329       794        854        877        28072365712 | |||
|   sum(ClearECR1)-sum(ClearUTG) = 3,817,024 | |||
|   sum(ClearECR2)-sum(ClearECR1)= 351,248 | |||
| * Scaffold length stats: | |||
|   cat 7-0-CGW/stat/final0.Scaffolds.nodelength.cgm | grep -v ^Sca | getSummary.pl -t 0 # 0,2,4 | |||
|   ... | |||
|   step     scaff      min        max        mean       med        n50        sum             | |||
|   0        7048       2249       19719008   385020     21967      3114907    2713622175      | |||
|   2        4960       2249       21907006   540915     21181      4490171    2682939682 | |||
|   4        4006       2391       26541374   668427     29193      4590744    2677722052 | |||
| * Last cgw | |||
|   cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf | |||
|   cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf | |||
|              elem       min        max        mean       med        n50        sum             | |||
|   scf        66141      432        26541374   42648      1347       4349378    2820819506      | |||
|   ctg        120461     65         627705     22421      2018       84989      2700959854 | |||
| == QC stats == | |||
| * [[Media:Bos_taurus.qc|Bos_taurus.qc]] this assembly stats | |||
| * [[Media:Bos_taurus.qc.combine|Bos_taurus.qc.combine]] UMD2 vs this assembly stats | |||
|   TotalScaffolds=66,141 | |||
|   MaxBasesInScaffolds=26,048,998 | |||
|   MeanBasesInScaffolds=40,861 | |||
|   TotalContigsInScaffolds=120,461 | |||
|   MaxContigLength=627,911 | |||
|   MeanContigLength=22,436 | |||
|   TotalDegenContigs=269,031 | |||
|   MaxDegenContig=33,824 | |||
|   SingletonReads=3,721,123 | |||
| * Posmap info | |||
|   cat bt.posmap.mates | awk '{print $3}' |count.pl -p 100 | |||
|   good            10338164 | |||
|   bothChaff       1160137 | |||
|   oneChaff        695982 | |||
|   oneSurrogate    233151 | |||
|   bothDegen       218198 | |||
|   diffScaffold    150423 | |||
|   badShort        138464 | |||
|   oneDegen        118232 | |||
|   badLong         23196 | |||
|   badSame         22451 | |||
|   badOuttie       8751 | |||
|   bothSurrogate   589 | |||
|   total           13107738 | |||
|   cat bt.posmap.frags | awk '{print $4,$5}' |count.pl  -p 100 | |||
|   placed good             20676328 | |||
|   placed notMated         8007072 | |||
|   chaff bothChaff         2320274 | |||
|   chaff notMated          704849 | |||
|   placed oneChaff         695982 | |||
|   chaff oneChaff          695982 | |||
|   placed oneSurrogate     466302 | |||
|   placed bothDegen        436396 | |||
|   placed diffScaffold     300846 | |||
|   placed badShort         276928 | |||
|   placed oneDegen         236464 | |||
|   placed badLong          46392 | |||
|   placed badSame          44902 | |||
|   placed badOuttie        17502 | |||
|   placed bothSurrogate    1178 | |||
|   total                   34927397 | |||
| == Log files == | |||
| * [[Media:Bt.runCA.log|Bt.runCA.log]] | |||
| * [[Media:Bt.runCA.hourly.runtimes|Bt.runCA.hourly.runtimes]] approximate running times (in hours) | |||
| == Analysis == | |||
| === Insert libraries === | |||
| 1. BCM.WGS : ok | |||
| * FRG.mea: 1750-7000 | |||
| * ASM.mea: 1594-6727  | |||
| * Most libs have > 1000 reads & get reestimated | |||
| * All libs have ASM.std< ASM.mea/3 | |||
| 2. BCM.SHOTGUN  | |||
| * only ~ 50% of the inserts are in libs with >1000 inserts and get reestimated by the assembly | |||
| * if the thold is dropped from 1000 to 100, we'd get ~ 95% of the inserts reestimated | |||
|              elem       <0         0          >0         min        max        mean       median     n50        sum | |||
|   0          7339       0          0          7339       1          11237      245        135        1137       1799069 | |||
|   100        4361       0          0          4361       100        11237      395        157        1252       1725604 | |||
|   1000       440        0          0          440        1008       11237      2075       1791       2323       913086 | |||
| 3. NISC.SHOTGUN: ok | |||
| * Most libs have > 1000 reads & get reestimated | |||
| * All libs have ASM.std< ASM.mea/3 | |||
| 4. BCCAGSC.CLONEEND: ok | |||
|   LIB.id  FRG.mea FRG.std FRG.count  CENTER.TYPE        ASM.mea ASM.std | |||
|   125606  150000  30000   59505      BCCAGSC.CLONEEND   161998  20133 | |||
| 5. UIUC.CLONEEND: ok | |||
|   LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std | |||
|   114892   150000  30000   31063     UIUC.CLONEEND      175594  41208 | |||
|   115020   150000  30000   15256     UIUC.CLONEEND      162488  26358 | |||
| 6. TIGR.CLONEEND: originally wrong; gets reestimated | |||
|   LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std | |||
|   65177    2000    600     27067     TIGR.CLONEEND      161761  34938 | |||
| 7. GSC.CLONEEND: not used (all 53556 are 0 qual) | |||
| 8. CENARGEN.WGS: "not used" (all 26246 are unmated) | |||
| 9. BARC.CLONEEND: each library contains 1 template id => inserts did not get reestimated  (25454 reads/11151 inserts) | |||
| 10. BCM cloneend: ok  | |||
|   LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std | |||
|   19070    167000  25000   7103      BCM.CLONEEND       171244  18555 | |||
| 11. CENARGEN.CLONEEND: large stdev | |||
|   LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std | |||
|   17249    202000  20200   6269      CENARGEN.CLONEEND  158938  55165 | |||
| 12. UOKNOR.SHOTGUN: ok ? | |||
|    LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std | |||
|   15158    3000    1000    4910      UOKNOR.SHOTGUN     3000    1000 | |||
| 13. TIGR_JCVI.CLONEEND: originally wrong; gets reestimated | |||
|   LIB.id   FRG.mea FRG.std FRG.count CENTER.TYPE        ASM.mea ASM.std | |||
|   10691    2500    750     2763      TIGR_JCVI.CLONEEND 160363  29580 | |||
|   10738    2500    750     2040      TIGR_JCVI.CLONEEND 161915  29343 | |||
| 14. UOKNOR.FINISHING: only 151 reads | |||
| 15. WUGSC.CLONEEND: only 49 reads | |||
| === Contigs Vs UMD2 contaminants & Ecoli === | |||
|   4865 contigs in list.exclude_contigs.fa | |||
|  34404 exclude-ctg.qry_hits | |||
|    3763 exclude-ctg.ref_hits | |||
|    1204 exclude-ctg.CBE.qry_hits   CONTAIN|IDENTITY|BEGIN|END | |||
|    748 exclude-ctg.CBE.ref_hits   CONTAIN|IDENTITY|BEGIN|END | |||
|    559 Ecoli.365350-365744-ctg.qry_hits : max ctg aligned is 179K bp; 10 are > 10K bp | |||
| === Contigs Vs UMD2 chromosomes === | |||
| * Split 120,461 contigs into 100 files; degeneartes not split | |||
| * Align them to the 31 chromosomes 1..30,U (ref) => 101*31 jobs | |||
|    #Alignment stats | |||
|    cat chr*ctg*delta | grep "^>" | awk '{print $2}' | count.pl -f ../9-terminator/bt.ctg.infoseq | getSummary.pl -i 1 -z 1 | |||
|    ctg        0          1          >1         min        max        mean       med        n50        sum | |||
|    120461     652        37540      82269      0          176        11         3          33         1422808 | |||
|   #Unaligned ctg lengths | |||
|   ctg        min        max        mean       med        n50        sum | |||
|   652        65         5849       1146       1134       1194       747295 | |||
| * 50% of the contigs aligned uniquely | |||
|   cat chr*-ctg*.delta | ~/bin/mergeDelta.pl   >  chr-ctg.delta | |||
|                                                               # degens?  | |||
|   delta-filter -q  chr-ctg.delta              >>  chr-ctg.filter-q.delta | |||
|   cat chr1-*.delta | ~/bin/delta2cvg.pl -M 0 | getSummary.pl -i 4 | |||
|    /fs/szasmg3/bos_taurus/data/ |   elem       0          >0         min        max        mean       med        n50        sum | ||
|   6681       1          6680       0          12892      366        142        1095       2450106 | |||
| * There are disagreements: | |||
|   /fs/sz-user-supported/Linux-x86_64/bin/show-coords -l -r -H chr1-ctg.filter-q.delta  | p 'print $F[-1],"\n";' | count.pl | head | |||
|   ctg7180001761585        24 | |||
|   ... | |||
|   ctg7180001634116        7 | |||
|   ... | |||
|   show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001761585 | p 'print "  $_";' | |||
|   142115744 142188863  |   383463   310345  |    73120    73119  |    99.98  | 157714772   383463  |     0.05    19.07  |  1 -1  chr1   ctg7180001761585 | |||
|   142188878 142286012  |   310361   213223  |    97135    97139  |    99.94  | 157714772   383463  |     0.06    25.33  |  1 -1  chr1   ctg7180001761585 | |||
|   142287100 142287675  |   212133   211556  |      576      578  |    98.27  | 157714772   383463  |     0.00     0.15  |  1 -1  chr1   ctg7180001761585 | |||
|   142288052 142288602  |   211182   210633  |      551      550  |    99.09  | 157714772   383463  |     0.00     0.14  |  1 -1  chr1   ctg7180001761585 | |||
|   142288652 142295709  |   210586   203531  |     7058     7056  |    99.87  | 157714772   383463  |     0.00     1.84  |  1 -1  chr1   ctg7180001761585 | |||
|   142295709 142342174  |   203512   157047  |    46466    46466  |   100.00  | 157714772   383463  |     0.03    12.12  |  1 -1  chr1   ctg7180001761585 | |||
|   142346440 142367791  |   156958   135606  |    21352    21353  |    99.99  | 157714772   383463  |     0.01     5.57  |  1 -1  chr1   ctg7180001761585 | |||
|   142367822 142370681  |   135597   132737  |     2860     2861  |    99.93  | 157714772   383463  |     0.00     0.75  |  1 -1  chr1   ctg7180001761585 | |||
|   142370660 142382289  |   132746   121117  |    11630    11630  |    99.88  | 157714772   383463  |     0.01     3.03  |  1 -1  chr1   ctg7180001761585 | |||
|   142382282 142411927  |   120984    91339  |    29646    29646  |    99.96  | 157714772   383463  |     0.02     7.73  |  1 -1  chr1   ctg7180001761585 | |||
|   142411941 142419553  |    91339    83728  |     7613     7612  |    99.66  | 157714772   383463  |     0.00     1.99  |  1 -1  chr1   ctg7180001761585 | |||
|   142419553 142434546  |    83721    68728  |    14994    14994  |    99.79  | 157714772   383463  |     0.01     3.91  |  1 -1  chr1   ctg7180001761585 | |||
|   142434506 142437288  |    68778    65996  |     2783     2783  |    99.86  | 157714772   383463  |     0.00     0.73  |  1 -1  chr1   ctg7180001761585 | |||
|   142437389 142439015  |    66757    65131  |     1627     1627  |    99.94  | 157714772   383463  |     0.00     0.42  |  1 -1  chr1   ctg7180001761585 | |||
|   142439271 142440703  |    65629    64197  |     1433     1433  |   100.00  | 157714772   383463  |     0.00     0.37  |  1 -1  chr1   ctg7180001761585 | |||
|   142441869 142442975  |    63548    62442  |     1107     1107  |   100.00  | 157714772   383463  |     0.00     0.29  |  1 -1  chr1   ctg7180001761585 | |||
|   142446690 142449325  |    30312    32945  |     2636     2634  |    99.58  | 157714772   383463  |     0.00     0.69  |  1  1  chr1   ctg7180001761585 | |||
|   142451384 142452476  |    63510    64603  |     1093     1094  |    99.91  | 157714772   383463  |     0.00     0.29  |  1  1  chr1   ctg7180001761585 | |||
|   142452577 142454379  |    61000    62806  |     1803     1807  |    99.78  | 157714772   383463  |     0.00     0.47  |  1  1  chr1   ctg7180001761585 | |||
|   142454487 142456821  |    59122    61456  |     2335     2335  |   100.00  | 157714772   383463  |     0.00     0.61  |  1  1  chr1   ctg7180001761585 | |||
|   142458383 142459582  |    57978    59177  |     1200     1200  |   100.00  | 157714772   383463  |     0.00     0.31  |  1  1  chr1   ctg7180001761585 | |||
|   142459738 142472295  |    32272    44828  |    12558    12557  |    99.92  | 157714772   383463  |     0.01     3.27  |  1  1  chr1   ctg7180001761585 | |||
|   142472300 142485640  |    44828    58163  |    13341    13336  |    99.89  | 157714772   383463  |     0.01     3.48  |  1  1  chr1   ctg7180001761585 | |||
|   142501686 142530021  |    28336        1  |    28336    28336  |    99.99  | 157714772   383463  |     0.02     7.39  |  1 -1  chr1   ctg7180001761585 | |||
|   show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001634116 | |||
|   116312   162914  |        1    46603  |    46603    46603  |    99.99  | 157714772   122722  |     0.03    37.97  |  1  1  chr1       ctg7180001634116 | |||
|   164916   201988  |    58062    95135  |    37073    37074  |    99.99  | 157714772   122722  |     0.02    30.21  |  1  1  chr1       ctg7180001634116 | |||
|   203244   213377  |    48198    58331  |    10134    10134  |   100.00  | 157714772   122722  |     0.01     8.26  |  1  1  chr1       ctg7180001634116 | |||
|   261393   264506  |    45949    49062  |     3114     3114  |   100.00  | 157714772   122722  |     0.00     2.54  |  1  1  chr1       ctg7180001634116 | |||
|   264607   268579  |    94345    98317  |     3973     3973  |   100.00  | 157714772   122722  |     0.00     3.24  |  1  1  chr1       ctg7180001634116 | |||
|   268586   274734  |    98323   104471  |     6149     6149  |   100.00  | 157714772   122722  |     0.00     5.01  |  1  1  chr1       ctg7180001634116 | |||
|   274835   293945  |   103611   122722  |    19111    19112  |    99.99  | 157714772   122722  |     0.01    15.57  |  1  1  chr1       ctg7180001634116 | |||
|   ~/bin/delta2breaks.pl -m 200 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl | |||
|   AGREEMENT       9827 | |||
|   INVERSION       283 | |||
|   TRANSLOCATION+  230 | |||
|   TRANSLOCATION-  154 | |||
|   ~/bin/delta2breaks.pl -m 1000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl | |||
|   AGREEMENT       7564 | |||
|   INVERSION       216 | |||
|   TRANSLOCATION+  192 | |||
|   TRANSLOCATION-  127 | |||
|   ~/bin/delta2breaks.pl -m 10000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl | |||
|   AGREEMENT       3394 | |||
|   INVERSION       62 | |||
|   TRANSLOCATION+  50 | |||
|   TRANSLOCATION-  29 | |||
| = Assembly UMD2.2 (Quality reads) = | |||
| * Try to add the missing BCM.SHOTGUN reads at the assembly | |||
| * Assign new BCM.SHOTGUN library ID's base on volume & SEQ_LIB_ID : same library might have different insert size in different volume => might loose some correct mates from different volumes | |||
|   cat bos_taurus.summary | grep BCM | grep SHOTG | cut -f6,7,8,10 | sort | more | |||
|   FAAEP   180000  13000   252 | |||
|   FAAEP   2000    1000    84 | |||
|   ... | |||
|   FAAHP   180000  13000   77 | |||
|   FAAHP   2000    1000    230 | |||
|   ... | |||
| * => 20,538 libraries out of which 18,208  contain mated reads | |||
| * create DST messages & add them to gkpStore | |||
|   gatekeeper -a -o bt.gkpStore -T -F  bos_taurus.BCM.SHOTGUN.new.DST | |||
| * generate gatekeeper edit file that maps each TI to the new library id | |||
|   head bos_taurus.BCM.SHOTGUN.new.ti2libinfo.edit | |||
|     frg uid 499507131 libuid 601081  | |||
|     frg uid 499507132 libuid 601081 | |||
|     ... | |||
| * generate gatekeeper edit file that deletes all mate information | |||
|   head bos_taurus.BCM.SHOTGUN.new.mate.delete | |||
|     frg uid 500086180 mateuid 0 | |||
|     frg uid 500084310 mateuid 0 | |||
|     ... | |||
| * pair forward/reverse read that have the same new library id, same TEMPLATE_ID | |||
|   head bos_taurus.BCM.SHOTGUN.new.mate.edit | |||
|     frg uid 583866821 mateuid 583872364 | |||
|     frg uid 583866822 mateuid 583872408 | |||
|     ... | |||
| * run gatekeeper --edit for each edit/delete file | |||
|   gatekeeper --edit ...  bt.gkpStore  | |||
| * restart assembly at cgw (doExtendClearRanges=1) | |||
| * consensus after cgw failed on job 25 on CTG 5597062 : cannot create consensus from multialignment  ... | |||
|   Fix: delete failed message | |||
|   cp bt.cgw_contigs.25 bt.cgw_contigs.25.FAILED | |||
|   delete "{ICM acc:5597062 pla:P len:20889 ..." from bt.cgw_contigs.25 | |||
| * terminator fail; message:  | |||
|   ICL: reference before definition error for contig ID 5597062 | |||
| = Assembly UMD2.3 (2009_0210_CA; all reads) = | |||
| <font color=red> | |||
| * 35,973,728 reads : 35,348,776 quality &  624,952 quality-less | |||
| * 16,896,244 mates | |||
| * 25,312 libraries | |||
| </font> | |||
| Issues (not solved): | |||
| * 10420 contain at least 1 "NN"  in their clr (50.. min(len,600)) | |||
| * 5973 contain at least 1 "NNN"  in their clr (50.. min(len,600)) | |||
| == Quality-less clrs == | |||
| * 624,952 quality-less reads | |||
| * Quality-less read stats:   : alignment CLR or 50..min(len,600) trimming | |||
|              elem       min        max        mean       median     n50        sum | |||
|   len        624952     5          1495       887        947        961        554429198 | |||
|   5          624952     6          1584       51         51         51         32150411 | |||
|   3          624952     5          1495       695        699        699        434960697 | |||
|   53         624952     -1579      1444       644        648        648        402810286 | |||
| <font color=red> | |||
| * Align 624,952 to the 120,461 Assembly1 contigs (no degenerates) : 1 day on 13 cpus | |||
| * 572,140(91.5%) reads aligned and 52,812(8.5%) did not align to the contigs | |||
| </font> | |||
| 1. Launch jobs in parallel: 12766 jobs on 13 processors | |||
|   nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.001.fasta  bos_taurus.0qual.01.seq -p ctg.001-seq.01 | |||
|   ... | |||
|   nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.982.fasta  bos_taurus.0qual.13.seq -p ctg.001-seq.01 | |||
| * CPU usage: 100% /job | |||
| * Max mem usage: 0.1% /job | |||
| 2. Get maximum extended clrs | |||
|   cat *delta | ~/bin/delta2qryClr.pl -best | sort > bos_taurus.0qual.best.clr | |||
|   Length stats | |||
|              elem       min        max        mean       median     n50        sum | |||
|   all        624952     5          1495       887        947        961        554429198 | |||
|   aligned    572140     221        1416       912        953        964        522281354 | |||
|   unaligned  52812      5          1495       608        580        754        32147844 | |||
|   Best/Max/Max+extended alignment coord stats: | |||
|              elem       min        max        mean       median     n50        sum | |||
|   53.best    572140     94         1208       766        841        877        438793102 | |||
|   53.max     572140     170        1208       794        863        888        454816817 | |||
|   53.extend  572140     170        1208       797        865        889        456014184 | |||
|   Unaligned read counts:                            | |||
|                            unaligned    total   quality   quality-less | |||
|   BCM.WGS                  42595 | |||
|   UOKNOR.SHOTGUN           5787         14651   2456      12195 | |||
|   GSC.CLONEEND             2294         53521   0         53521 | |||
|   BCCAGSC.CLONEEND         1869         125241  116484    8757 | |||
|   BCM.SHOTGUN              186 | |||
|   UOKNOR.FINISHING         81 | |||
| * 52,812 quality-less unaligned reads to the contigs using less strict nucmer parameters:  -l 30 -c 50 -b 50 -g 50 -d 0.12 | |||
| * 9,269  reads aligned at an average 92% identity (min 81% identity)  : not too good | |||
| 3. Get reads without clrs: set their clr to maximum 50..600 | |||
|   cp bos_taurus.0qual.extended.clr bos_taurus.0qual.clr | |||
|   difference.pl bos_taurus.0qual.infoseq bos_taurus.0qual.extended.clr | perl -ane '$three=600; $three=$F[1] if ($F[1]<600); print "$F[0] 50 $three\n";' >> bos_taurus.0qual.clr | |||
| == Quality clrs == | |||
| * Use Assembly1 OBT clrs | |||
| * Delete reads deleted in the OBT process | |||
| == Gatekeeper == | |||
| Load order: | |||
| # Add quality FRG :      "gatekeeper -T  -F ..." | |||
| # Add quality-less FRG   "gatekeeper     -F  -a ..."           # -T should be removed | |||
| # Delete quality FRG (deleted by UMD2.1 OBT) | |||
| # Add DST | |||
| # Add LKG | |||
| Edit | |||
| # Loads clrs | |||
| # Loads clvs | |||
| # Loads nonrandom info | |||
| == Meryl == | |||
| Use Assembly1 kmer counts | |||
| == Overlapper == | |||
| * Use 80/90 Assembly1 overlap results | |||
| * Rerun 10 overlap jobs | |||
| * 96.64% of the quality-less reads have overlaps (vs 98.33% of the quality reads) | |||
|                    reads      0ovl       1+ovl      min        max        mean       median     n50        sum | |||
|   0qual(all)       624831     20941      603890     0          4350       96         19         740        60494730         # 96.64% | |||
|   0qual(unaligned) 52691      15384      37307      0          3229       50         5          349        2655545          # 70.80% | |||
| == Unitigger == | |||
| * More unitigs, more bases in unitigs | |||
| * Few of the longest unitigs got broken: Example 138,294(UMD2.3) vs  159,434(UMD2.1) | |||
|   UNITIG OVERLAP GRAPH INFORMATION | |||
|         5333434 : Total number of unitigs | |||
|         2595174 : Total number of singleton, contained unitigs | |||
|         1865473 : Total number of singleton, non-contained unitigs | |||
|          183693 : Total number of non-singleton, spanned unitigs | |||
|          689094 : Total number of non-singleton, non-spanned unitigs | |||
|        35551316 : Total number of fragments | |||
|        35551316 : Total number of fragments in all unitigs | |||
|        21830994 : Total number of essential fragments in all unitigs | |||
|        13720322 : Total number of contained fragments in all unitigs | |||
|    0.0077856472 : Randomly sampled fragment arrival rate per bp | |||
|      2514833413 : The sum of overhangs in all the unitigs | |||
|      6483064813 : Total number of bases in all unitigs | |||
|               0 : Estimated number of base pairs in the genome. | |||
|               0 : Total number of contained fragments not connected | |||
|                   by containment edges to essential fragments. | |||
|   Total rho    = 2514833413 | |||
|   Total nfrags = 19579606 | |||
|   Estimated genome length = 0 | |||
|   Estimated global_fragment_arrival_rate=0.007786 | |||
|   Computed global_fragment_arrival_rate =0.007786 | |||
|   Total number of randomly sampled fragments in genome = 23870254 | |||
|   Computed genome length  = 3065930496.000000 | |||
|   Used global_fragment_arrival_rate=0.007786 | |||
|   Used global_fragment_arrival_distance=128.441474 | |||
|   Histogram of the number of base pairs in a chunk | |||
|   100406 - 138294:    21 | |||
|   90330 -  99887:     23 | |||
|   80042 -  89675:     79 | |||
|   70014 -  79943:    169 | |||
|   60002 -  69792:    374 | |||
|   50000 -  59982:   1008 | |||
|   40002 -  49995:   2440 | |||
|   30001 -  39994:   6509 | |||
|   20000 -  29999:  18989 | |||
|   10000 -  19999:  57404 | |||
| == Consensus after unitigger == | |||
| Problems: | |||
| * job 120 executed partially (see bt_120.cgi_tmp); Solution: split into 3 parts, run separately, merge results | |||
| * failed on 19 unitigs (587..7447 bp)  | |||
|   rm 5-consensus/*failed | |||
|   touch 5-consensus/consensus.success | |||
| == Cgw == | |||
| * Failure 1 : because job 120 was run partially => missing mates | |||
| * Failure 2 : because of /5-consensus/FAILED/bt_???.cgi.failed => missing mates => delete 356 mates | |||
|   Error: | |||
|     ProcessFrags()-- WARNING!  fragiid=35973388,index=33600942 mateiid=35973363,index=0 -- MATE DOESN'T EXIST! | |||
|     cgw: Input_CGW.c:117: ProcessFrags: Assertion `err == 0' failed. | |||
|   Fix: | |||
|     cat cgw.out | grep MATE | p '/mateiid=(\d+)/; print $1,"\n";' >! cgw.out.mateiid | |||
|     gatekeeper -dumpfragments -tabular -iid cgw.out.mateiid bt.gkpStore/ | cut -f1,3 | ~/bin/mate2lkg.pl -a D >! cgw.out.delete.LKG | |||
|     gatekeeper -a -o bt.gkpStore -T -F -L cgw.out.delete.LKG | |||
| * Failure 3: because of cgwOutputIntermediate=1   | |||
|   Try to restart from ckp : die with assertion failure | |||
|   cgw -y -R 8 -N 12 -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi | |||
|   cgw -y -R 8       -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi | |||
|   Fix: Restart cgw from the beginning | |||
| * cgw does update bt.SeqStore - OpenSequenceDB() | |||
| == ECR (eventually skipped) == | |||
| * Failed after running for 1 day | |||
|   /fs/szdevel/dpuiu/SourceForge/wgs-5.2/Linux-amd64/bin/extendClearRanges  -g ./bt.gkpStore  -n 15  -c bt  -b 146216 -e 167100  -i 1  > 7-1-ECR/extendClearRanges-scaffold.146216.err  | |||
|   sh: line 1: 17016 Aborted  | |||
| * Last ckp :  bt.ckp.15 | |||
| * Try to fix:  | |||
|   touch 7-1-ECR/cgw.success | |||
|   runCA "doExtendClearRanges = 1" | |||
| * Runs too slow !!! | |||
| * Can specify a scaffold range to process: -b ? -e ? => ckp files; could we merge them? | |||
| * Failed after running for 1 day | |||
| == Consensus after CGW == | |||
| * Failed on job 56 | |||
|   tail 8-consensus/bt.cns_contigs.56.err  | |||
|   ... | |||
|   Could (really) not find overlap between 153923 (U) and 2508303 (R) estimated ahang: 0 (ejecting frag 2508303 from contig) | |||
|   consensus: math_AS.h:51: ceil_log2: Assertion `x > 0' failed. | |||
|   cat 7-CGW/bt.cgw_contigs.56  | countMessages.pl | |||
|   ICM     440 | |||
|   IMP     281412 | |||
|   IUP     12715 | |||
|   cat 8-consensus/bt.cns_contigs.56_tmp | countMessages.pl  | |||
|   ICM     115 | |||
|   IMP     103322 | |||
|   IMV     8122 | |||
|   IUP     4849 | |||
| * Fix: split ICM messages 1..115,116,116+ and run consensus on each set | |||
| == QC == | |||
|              elem       min        max        mean       med        n50        sum             | |||
|   scf        56891      407        33129045   50871      1378       4716077    2894145150      | |||
|   ctg        122851     64         651167     21957      3647       71561      2697514858      | |||
|   deg        268237     65         30246      1019       985        997        273575106 | |||
| * Compared with UMD2.1 : better scaffols, worse contigs & unitigs | |||
| == Analysis == | |||
| Issues: | |||
| * Identify bacterial & mito contigs: [http://www.bch.umontreal.ca/ogmp/projects/other/genbank/AY676873.txt mito seq] | |||
| * Align ctg°en to UMD2 chromosomes | |||
| ** the chromosomes should have no 0cvg regions  | |||
| ** possible inversions, translocations (UMD2 used markers) | |||
| ** if align breaks/indels, which assembly is correct? | |||
| = Assembly UMD2.4 (2004_0217_CA; All reads) = | |||
| <font color=red> | |||
| * 35,973,728 reads : 35,348,776 quality &  624,952 quality-less | |||
| * 16,896,244 mates | |||
| * 344 libraries | |||
| </font> | |||
| == Fix quality-less read clrs (N's) (temporary solution) == | |||
| * 10420 contain at least 1 "NN"  in their clr (50.. min(len,600)) | |||
| * 5973 contain at least 1 "NNN"  in their clr (50.. min(len,600)) | |||
| Fix:  | |||
|   frg2seq.pl < bos_taurus.0qual.frg > bos_taurus.0qual.seq | |||
|   fasta2qual.pl bos_taurus.0qual.seq > ! bos_taurus.0qual.qual | |||
|   lucy \ | |||
|      -o bos_taurus.0qual.lucy.seq  bos_taurus.0qual.lucy.qual \ | |||
|      -debug  bos_taurus.0qual.lucy.info \ | |||
|      bos_taurus.0qual.seq bos_taurus.0qual.qual | |||
|   cat bos_taurus.0qual.lucy.info | cut -f1,3,4 -d ' ' | sort >! bos_taurus.0qual.lucy.clr | |||
| * 624,952 quality-less reads | |||
| * Quality-less read stats:   50..min(len,600) & lucy trimming | |||
|              elem       0          >0         min        max        mean       med        n50        sum | |||
|   5          624952     2857       622095     0          501        52         52         52         33012433 | |||
|   3          624952     2857       622095     0          600        579        600        600        361980208 | |||
|   53         624952     2857       622095     0          548        526        548        548        328967775 | |||
| == Fix quality-less read clrs (low complexity) == | |||
| * Run dust filter on seq (before qual & lucy) | |||
|              elem       0          >0         min        max        mean       med        n50        sum             | |||
|   5          624952     3564       621388     0          501        75         52         52         47385578        | |||
|   3          624952     3564       621388     0          600        554        600        600        346470473       | |||
|   53         624952     3564       621388     0          548        478        548        548        299084895 | |||
| * Merge dust.lucy clrs with the alignment clrs | |||
|              elem       0          >0         min        max        mean       med        n50        sum | |||
|   5          624952     4488       620464     0          599        93         52         126        58378496 | |||
|   3          624952     4488       620464     0          600        547        600        600        342258160 | |||
|   53         624952     4488       620464     0          548        454        512        548        283879664 | |||
| * Test seq | |||
|   gatekeeper -dumpfastaseq -b 35348777 -e  35973728 bt.gkpStore  | grep NNN | |||
|   gatekeeper -dumpfastaseq  bt.gkpStore   | perl -ane 'if(/^>(\d+)/) { $id=$1} elsif(/NNN/) { print $id,"\n";} ' | uniq -c | awk '{print $2,$1}'  > bt.NNN.seqs  # 2411 seqs (all have the N's "in the middle") | |||
|   gatekeeper -dumpfastaseq -uid bt.NNN.seqs bt.gkpStore >  bt.NNN.cseqs | |||
| == Consolidate libraries == | |||
| <font color=red> Drop from 25,312 to 344 libs </font> | |||
| ===  BCM.SHOTGUN ===  | |||
| UMD2.4 reestimated 10,117 out of 13,826 libs (have > 100mates) | |||
| ==== Base on initial estimates ==== | |||
| * Reduce the total number from 13826 to 2 libs: 3000 & 6000 | |||
| * UMD2.3 mean estimates (Initial vs Final):  | |||
|   meanI      #libs      minF       maxF       meanF      medF       n50F       sumF           uid | |||
|   180000     436        1636       5199       2475       2410       2458       1079407        #3000 | |||
|   167000     86         1585       2948       2264       2258       2285       194775         #3000         | |||
|   6500       31         5212       6636       5837       5867       5924       180951         #6000         | |||
|   6000       11         4556       6272       5389       5421       5421       59286          #6000 | |||
|   3500       949        1670       4769       2668       2608       2645       2532027        #3000 | |||
|   3000       2511       1483       5250       2715       2662       2723       6818678        #3000 | |||
|   2000       6093       1157       6443       2526       2487       2554       15391160       #3000 | |||
| ==== Base on final estimates ==== | |||
| * Reduce the total number from 13826 to 7 libs:  6500,5500,...1500, un-estimates (2501)  | |||
|   meanF        #libs      min        max        mean       med        n50        sum         uid(new)     mean(new)  std(new) | |||
|   6K<=mea<7K   15         6010       6636       6176       6159       6159       92650       6500         6500 | |||
|   5K<=mea<6K   29         5121       5985       5540       5536       5577       160673      5500         5500 | |||
|   4K<=mea<5K   67         4017       4939       4284       4266       4274       287072      4500         4500 | |||
|   3K<=mea<4K   1401       3000       3998       3276       3209       3226       4590323     3500         3500 | |||
|   2K<=mea<3K   7998       2000       2999       2502       2498       2532       20017767    2500         2500 | |||
|   1K<=mea<2K   607        1157       1999       1825       1882       1890       1107798     1500         1200 | |||
|   un-estimated 3709                                                                          2501         2501 | |||
| ===  BARC.CLONEEND ===  | |||
| Collapse all 11150 into 1:  | |||
|  uid:25456 | |||
|  mea:165000 | |||
|  std:43000 | |||
| == Overlapper == | |||
| * Quality-less reads overlaps: fewer than in the UMD2.3 assembly | |||
|                    elem       0          >0         min        max        mean       med        n50        sum            | |||
|   0qual(all)       624830     35692      589138     0          3237       60         14         439        37578899         # 94.39% | |||
| == Unitigger == | |||
|   UNITIG OVERLAP GRAPH INFORMATION     | |||
|         5356408 : Total number of unitigs | |||
|         2613795 : Total number of singleton, contained unitigs | |||
|         1870448 : Total number of singleton, non-contained unitigs | |||
|          182878 : Total number of non-singleton, spanned unitigs | |||
|          689287 : Total number of non-singleton, non-spanned unitigs | |||
|        35547861 : Total number of fragments | |||
|        35547861 : Total number of fragments in all unitigs | |||
|        21685943 : Total number of essential fragments in all unitigs | |||
|        13861918 : Total number of contained fragments in all unitigs | |||
|    0.0077797328 : Randomly sampled fragment arrival rate per bp | |||
|      2513424271 : The sum of overhangs in all the unitigs | |||
|      6468428782 : Total number of bases in all unitigs | |||
|               0 : Estimated number of base pairs in the genome. | |||
|               0 : Total number of contained fragments not connected | |||
|                   by containment edges to essential fragments. | |||
|   Total rho    = 2513424271                | |||
|   Total nfrags = 19553770 | |||
|   Estimated genome length = 0 | |||
|   Estimated global_fragment_arrival_rate=0.007780 | |||
|   Computed global_fragment_arrival_rate =0.007780      | |||
|   Total number of randomly sampled fragments in genome = 23868770 | |||
|   Computed genome length  = 3068070656.000000           | |||
|   Used global_fragment_arrival_rate=0.007780              | |||
|   Used global_fragment_arrival_distance=128.539119 | |||
|  Histogram of the number of base pairs in a chunk | |||
|  100292 - 138301:     19 | |||
|   90052 -  99906:     23 | |||
|   80043 -  89676:     79 | |||
|   70013 -  79966:    164 | |||
|   60010 -  69988:    390 | |||
|   50008 -  59983:    949 | |||
|   40000 -  49998:   2433 | |||
|   30000 -  39997:   6437 | |||
|   20000 -  29999:  18808 | |||
|   10000 -  19999:  57634 | |||
| == Bog == | |||
| !!! Much bigger unitigs than default unitigger | |||
|  Global Arrival Rate: 0.013829 | |||
|  212260 - 224992:      4  | |||
|  100099 - 186873:    372  | |||
|   90015 -  99973:    353  | |||
|   80045 -  89988:    582  | |||
|   70011 -  79999:   1084  | |||
|   60000 -  69994:   1856  | |||
|   50001 -  59996:   3162  | |||
|   40002 -  49994:   5407  | |||
|   30000 -  39999:   9767  | |||
|   20000 -  29996:  18981  | |||
|   10000 -  19999:  39641 | |||
| == Consensus after Unitigger == | |||
| * Failed on jobs 120 & 121 ( _tmp file) | |||
|   cat 4-unitigger/*120*  | countMessages.pl | |||
|   IMP     280264 | |||
|   IUM     124707 | |||
|   cat 4-unitigger/*121* | countMessages.pl | |||
|   IMP     282146 | |||
|   IUM     245650 | |||
|   cat 5-consensus/bt_120.cgi_tmp  | countMessages.pl | |||
|   IMP     34348 | |||
|   IUM     19222 | |||
|   cat 5-consensus/bt_121.cgi_tmp | countMessages.pl  | |||
|   IMP     51833 | |||
|   IUM     16805 | |||
| * Fix 120: split IUM messages  | |||
|   extractfromfrgMSG.pl -b 0 -e 19222 bt_120.cgb.orig IUM >! bt_120.cgb & | |||
|   extractfromfrgMSG.pl -b 19222 bt_120.cgb.orig IUM >! bt_120.cgb &  | |||
| * Fix 121: remove assertion in AS_CNS/MultiAlignment_CNS.c  | |||
|   if(to <= from || to > ma_length-1){ | |||
|     fprintf(stderr, "AbacusRefine range (to) invalid"); | |||
|     //assert(0);  | |||
|   } | |||
| == CGW == | |||
| * Failed after Ckp3(7-0-CGW/bt.ckp.3; MergeScaffoldsAggressive 2nd itteration) | |||
|   CI extends beyond end of scaffold! | |||
|   offsetAEnd = 254204 offsetBEnd = 252250 scaffoldLength = 253268 | |||
|   cgw: CIScaffoldT_Merge_CGW.c:307: InsertScaffoldContentsIntoScaffold: Assertion `0' failed. | |||
| * Last cgw | |||
|   Scaffold lengths: | |||
|   cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf | |||
|   cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf | |||
|              elem       min        max        mean       med        n50        sum             | |||
|   scf        45826      385        34263871   59591      1349       7059820    2730853790      | |||
|   ctg        96562      65         738899     27789      3657       93988      2683452359 | |||
|   Library insert estimates: | |||
|   cat 7-4-CGW/stat/scaffold_final.distupdate.dst | grep ^# | awk '{print $3,int($8),int($10)}' > 7-4-CGW/bt.dst | |||
|   join2.pl bt.dst 7-4-CGW/bt.dst | p 'print join "\t",@F[0,1,2,5,6,3,4]; print "\n";' > bt.dst.combine | |||
|   CLONEEND inserts: | |||
|   UID           MEANI   STDI    MEANF   STDF    COUNT   LIB | |||
|   114892        150000  30000   175701  40732   31063   UIUC.CLONEEND | |||
|   19070         167000  25000   171349  18253   7103    BCM.CLONEEND | |||
|   118           167000  16700   167000  16700   21      WUGSC.CLONEEND | |||
|   25456         165000  43000   163044  25849   11150   BARC.CLONEEND | |||
|   115020        150000  30000   162719  25343   15256   UIUC.CLONEEND | |||
|   65177         2000    600     162540  34155   27067   TIGR.CLONEEND | |||
|   125606        150000  30000   162396  19319   59505   BCCAGSC.CLONEEND | |||
|   10738         2500    750     162386  27567   2040    TIGR_JCVIJTC.CLONEEND | |||
|   10691         2500    750     161540  28239   2763    TIGR_JCVIJTC.CLONEEND | |||
|   17249         202000  20200   157496  55375   6269    CENARGEN.CLONEEND | |||
|   54017         120000  12000   115671  27594   25889   GSC.CLONEEND | |||
|   total                                         188126  CLONEENDs | |||
| == Consensus == | |||
| * Failed on job 34 with segmentation fault | |||
| * 9kbp contig, made out of 3007 reads (24 of which are quality-less) | |||
|   cat 7-CGW/bt.cgw_contigs.34.1 | grep "^{" | uniq -c | awk '{print $2,$1}' | |||
|   {ICM 1 | |||
|   {IMP 3007 | |||
|   {IUP 329 | |||
| * Fix : edit AS_CNS/MultiAlignment_CNS.c; add | |||
|   if(!ungappedSequence->Elements) { ungappedSequence->numElements=0; } | |||
|   if(!ungappedQuality->Elements) { ungappedQuality->numElements=0; } | |||
| == Analysis == | |||
| === Contigs Vs possible contaminants  === | |||
| * nucmer alignment parameters:  -l 40 -c 100 -b 10 -g 5 -d 0.05 | |||
| * have to redo alignments using -maxmatch !!! | |||
| * file location:  | |||
|   reference seqs: | |||
|     /nfshomes/dpuiu/db/Ecoli.365350-365744                     # Ecoli K12 region with most alignments (BCM WGS splice site) | |||
|     /nfshomes/dpuiu/db/Ecoli                                   # Ecoli K12 substrain MG1655  (NC_000913 ; 1st completed) | |||
|     /nfshomes/dpuiu/db/Ecoli.all                               # 22 Ecoli completed genomes ( + plasmids) | |||
|     /nfshomes/dpuiu/db/UniVec_Core                             # UniVec Core seqs  | |||
|     /nfshomes/dpuiu/db/OtherVec                                # 100 other vector sequences identified by aligning UMD2.0 contaminants to GenBank; align also to 110 UniVec core using nucmer (params above) | |||
|     /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.fasta       # 4813 whole contigs and 30329 contig regions identified by NCBI as UMD2 contamination | |||
|     /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.organism_count  # organism counts: vector is the most abundant | |||
|     /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.infoseq      # grep -v 'coli|vector|7180003101029' => 905 other contamiants | |||
|   query seqs: | |||
|     /scratch1/bos_taurus/Assembly/2009_0217_CA/9-terminator/ctg.split100/*fasta  # latest assembly contigs (no degenerates) | |||
|   delta files:  | |||
|     /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_ctg/no_maxmatch/*delta | |||
| Ecoli K12 substrains: | |||
|   NC_010473.1    4686137 50.78  Escherichia coli str. K-12 substr. DH10B, complete genome | |||
|   NC_000913.2    4639675 50.79  Escherichia coli str. K-12 substr. MG1655, complete genome | |||
|   AC_000091.1    4646332 50.80  Escherichia coli str. K-12 substr. W3110, complete genome | |||
| ==== no maxmatch ==== | |||
| * fewer alignments in UMD2.4 than in UMD2 | |||
| UMD2 (all): just a few degens | |||
|   15102 Ecoli.365350-365744-ctg.qry_hits | |||
|   15943 Ecoli-ctg.qry_hits | |||
|   17308 Ecoli.all-ctg.qry_hits | |||
|   79065 UMD2.contaminant-ctg.qry_hits     # 55877 new hits | |||
|   20105 UMD2.contaminant-ctg.CBE.qry_hits # CONTAIN|BEGIN|END|IDENTITY | |||
|   19839 UniVec_Core-ctg.qry_hits | |||
| UMD2.4 | |||
|     559 Ecoli.365350-365744-ctg.qry_hits | |||
|    1215 Ecoli-ctg.qry_hits | |||
|    2767 Ecoli.all-ctg.qry_hits             # most 2 frequenct starins are UMN026 & ATCC 8739; K12 DH10B is rank 5th; K12 MG1655 is ranked 19th (out 31 seqs) | |||
|   44112 UMD2.contaminant-ctg.qry_hits | |||
|    5286 UniVec_Core-ctg.qry_hits | |||
| Length of the reference seqs used for screening: | |||
|                           #seqs      min        max        mean       med        n50        sum | |||
|   Ecoli.365350-365744     1          395        395        395        395        395        395                  # Ecoli K12 regions with most alignments (BCM WGS splice site) | |||
|   Ecoli                   1          4639675    4639675    4639675    4639675    4639675    4639675              # Ecoli K12 substrain MG1655         | |||
|   Ecoli.all               49         3306       5572075    2293320    130440     5065741    112372708            # 22 Ecoli's  | |||
|   UniVec_Core             1348       12         48551      243        98         967        327641           | |||
|   OtherVec                100        1702       739874     15419      5027       166744     1541984 | |||
|   UMD2.contaminant        35142      48         16661      512        362        674        18022349              | |||
| Length of UMD2.4 contigs that contain contaminant (0+ bp from end): | |||
|                           #ctgs      <2000bp    >=2000bp   min        max        mean       med        n50        sum | |||
|   Ecoli.365350-365744-ctg 559        534        25         1001       179527     2467       1341       1894       1379440 | |||
|   Ecoli-ctg               1215       1086       129        1001       360312     4326       1347       71372      5256540 | |||
|   Ecoli.all-ctg           2767       2455       312*       1001       453627*    8031       1366       134516     22224468 | |||
|   UniVec_Core-ctg         5286       4718       568*       882        651163*    9820       1337       136090     51909339 | |||
|   UMD2.contaminant-ctg.CBE 4976      4410       566*       738        651163*    8497       1339       122281     42281715     #annotated alignments: CONTAIN|BEGIN|END|IDENTITY | |||
|   UMD2.contaminant-ctg    44112      12813      31299      268        739442     50591      27461      111598     2231701788 | |||
| Length of UMD2.4 contigs that contain contaminant in the middle (500+ bp from end): | |||
|                           #ctgs      <2000bp    >=2000bp   min        max        mean       med        n50        sum | |||
|   Ecoli.365350-365744-ctg 144        136        8          1286       2053       1779       1811       1814       256259 | |||
|   Ecoli-ctg               171        152        19         1286       4703       1835       1807       1821       313820 | |||
|   Ecoli.all-ctg           197        160        37*(81)    1228       351373*    6516       1815       125069     1283728     #81  2K+ ctgs using -maxmatch | |||
|   UniVec_Core-ctg         1278       1110       168*(276)  1085       651163*    12266      1496       160336     15676765    #276 2K+ ctgs using -maxmatch | |||
|   UMD2.contaminant-ctg.CBE 52         25        27*        1249       351373*    22195      2054       125069     1154142     #annotated alignments: CONTAIN|BEGIN|END|IDENTITY | |||
|   UMD2.contaminant-ctg    31019      1437       29582      1113       739442     70665      50798      113684     2191986214 | |||
| Length of the UMD2.4 contaminant seqeunece (0+ bp from end): | |||
|                              #align     <200bp     >=200bp       min       max        mean       med        n50        sum | |||
|   Ecoli.365350-365744-ctg    1066       537        529        104        225        192        162        224        205379 | |||
|   Ecoli-ctg                  1793       587        1206       50         4440       496        224        994        889798 | |||
|   Ecoli.all-ctg              4074       1132       2942*      40         17075*     380        254        441        1551783 | |||
|   UniVec_Core-ctg            14425      9819       4606*      40         1801*      236        162        325        3409187 | |||
|   UMD2.contaminant-ctg       144843     96008      48835      40         16661      199        169        209        28912002 | |||
| Length of the UMD2.4 contaminant seqeunece (500+ bp from end) | |||
|                              alignm     <200bp     >=200bp    min        max        mean       med        n50        sum | |||
|   Ecoli.365350-365744-ctg    243        136        107        162        224        189        162        224        46000 | |||
|   Ecoli-ctg                  273        149        124        106        1341       219        162        224        59923 | |||
|   Ecoli.all-ctg              294        153        141*       106        2150*      251        162        224        73992 | |||
|   UniVec_Core-ctg            2144       2035       109*       50         1340*      122        121        121        261821 | |||
|   UMD2.contaminant-ctg       121331     86985      34346      40         2738       171        162        184        20753580 | |||
| * Problem: 8 long ctgs contain Ecoli in the middle (1000+ bp from end) | |||
|  show-coords Ecoli.all-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r | |||
|     [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS] | |||
|  =============================================================================================================================== | |||
|  4640589  4641890  |   161712   160411  |     1302     1302  |    99.46  |  4686137   351373  |     0.03     0.37  | gi|170079663|ref|NC_010473.1|      ctg7180001872124 | |||
|  5068908  5069620  |    87386    86679  |      713      708  |    98.88  |  5209548    91972  |     0.01     0.77  | gi|218687878|ref|NC_011745.1|      ctg7180002055226 | |||
|  3087480  3088683  |    50423    51620  |     1204     1198  |    99.00  |  5202090    88182  |     0.02     1.36  | gi|218703261|ref|NC_011751.1|      ctg7180002054092 | |||
|  4640580  4641890  |    19953    18646  |     1311     1308  |    99.08  |  4686137    31157  |     0.03     4.20  | gi|170079663|ref|NC_010473.1|      ctg7180001875158 | |||
|   131462   133564  |     1247     3349  |     2103     2103  |    98.19  |   241387     5751  |     0.87    36.57  | gi|157412014|ref|NC_009838.1|      ctg7180002043242 | |||
|    82801    83166  |     2986     2621  |      366      366  |    98.09  |   241387     4709  |     0.15     7.77  | gi|157412014|ref|NC_009838.1|      ctg7180001714551 | |||
|    82264    82793  |     3523     2994  |      530      530  |    98.49  |   241387     4709  |     0.22    11.26  | gi|157412014|ref|NC_009838.1|      ctg7180001714551 | |||
|  1652253  1652545  |     1487     1195  |      293      293  |    98.63  |  4700560     2492  |     0.01    11.76  | gi|218552585|ref|NC_011741.1|      ctg7180001754941 | |||
| * Regions present in DH10B but not MG1655 | |||
|   delta2cvg -M 0 < DH10B-MG1655.delta | |||
|   gi|170079663|ref|NC_010473.1|   1349629 1378243 28614   0 | |||
|   gi|170079663|ref|NC_010473.1|   1391006 1396986 5980    0 | |||
|   gi|170079663|ref|NC_010473.1|   3199469 3200798 1329    0 | |||
|   gi|170079663|ref|NC_010473.1|   3211928 3213257 1329    0 | |||
|   gi|170079663|ref|NC_010473.1|   4640588 4641918 1330    0 !!! | |||
| * Problem: 10 long ctgs contain Vector in the middle (1000+ bp from end) | |||
|  show-coords UniVec_Core-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r | |||
|     [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS] | |||
|  =============================================================================================================================== | |||
|        1      121  |   215495   215615  |      121      121  |    99.17  |      170   271477  |    71.18     0.04  | gnl|uv|U09128.1:15891-16011-49     ctg7180002047604       # pSacBII P1 cloning vector   | |||
|     2252     2435  |     1334     1151  |      184      184  |   100.00  |     2485   160336  |     7.40     0.11  | gnl|uv|U75992.1:16925-19409        ctg7180001808271 | |||
|      180      312  |     1153     1020  |      133      134  |    99.25  |      312   160336  |    42.63     0.08  | gnl|uv|NGB00145.1:2378-2689        ctg7180001808271 | |||
|        1      121  |     1367     1487  |      121      121  |   100.00  |      170   160336  |    71.18     0.08  | gnl|uv|U09128.1:15891-16011-49     ctg7180001808271 | |||
|        1      103  |     1286     1388  |      103      103  |   100.00  |      103   160336  |   100.00     0.06  | gnl|uv|U80929.2:11415-11517        ctg7180001808271        [CONTAINED] | |||
|        4      121  |    68269    68386  |      118      118  |   100.00  |      170   111913  |    69.41     0.11  | gnl|uv|U09128.1:15891-16011-49     ctg7180002052060 | |||
|       40      152  |    30255    30142  |      113      114  |    99.12  |     1663    42854  |     6.79     0.27  | gnl|uv|U09128.1:1-1663             ctg7180002053344 | |||
|        1      121  |    34358    34238  |      121      121  |   100.00  |      170    35471  |    71.18     0.34  | gnl|uv|U09128.1:15891-16011-49     ctg7180002046164 | |||
|        1      103  |    34439    34337  |      103      103  |   100.00  |      103    35471  |   100.00     0.29  | gnl|uv|U80929.2:11415-11517        ctg7180002046164        [CONTAINED] | |||
|       46     1385  |     8928    10267  |     1340     1340  |   100.00  |     1413    17587  |    94.83     7.62  | gnl|uv|X65279.1:5941-7353          ctg7180002043597        [CONTAINED] | |||
| * ctg7180001872124 : 351373 bp; region 160411..161712 contaminated by Ecoli | |||
|   cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | wc -l  # 329 | |||
|   cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);' | |||
|   7180000441625   7180001872124   159483  161201  r | |||
|   7180000441788   7180001872124   160330  161329  f  #Ecoli | |||
|   7180000442730   7180001872124   160368  161010  r  #Ecoli | |||
|   7180000441635   7180001872124   160740  162700  f  #Ecoli | |||
|   cat 9-terminator/bt.utg.info | |||
|   utg7180000441625 length=1715 num_frags=12 Astat=7.00 | |||
|   utg7180000441788 length=999 num_frags=1 Astat=0.00 | |||
|   utg7180000442730 length=640 num_frags=1 Astat=0.00 | |||
|   utg7180000441635 length=1957 num_frags=9 Astat=7.00 | |||
|   cat 9-terminator/bt.posmap.frgctg | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);' | |||
|   1237446426      7180001872124   160117  161201  f | |||
|   1238816835      7180001872124   160133  160993  f | |||
|   1238817728      7180001872124   160322  161123  r | |||
|   1244436200      7180001872124   159976  160984  f | |||
|   1238817676      7180001872124   160105  160890  r | |||
|   1237443253      7180001872124   160106  160900  f | |||
|   1237471027      7180001872124   159930  160928  f | |||
|   1238822613      7180001872124   159774  160782  f | |||
|   1238816875      7180001872124   159878  160728  f | |||
|   1244436248      7180001872124   159483  160553  f | |||
|   1238818306      7180001872124   159718  160489  f | |||
|   1238818332      7180001872124   159722  160483  f | |||
|   1237476824      7180001872124   160330  161329  f | |||
|   1238817689      7180001872124   160368  161010  r | |||
|   1237447135      7180001872124   160740  161768  r | |||
|   1237483546      7180001872124   160814  161790  r | |||
|   1237483530      7180001872124   160818  161856  r | |||
|   1237471108      7180001872124   161003  162009  f | |||
|   1238817744      7180001872124   161151  161978  f | |||
|   1237446441      7180001872124   161050  162107  f | |||
|   1244436201      7180001872124   161117  162164  f | |||
|   1237446407      7180001872124   161586  162699  r | |||
|   1237471055      7180001872124   161652  162700  r | |||
|   # 23 BCM SHOTGUN RP42 VVHNP reads (1369 read lib; 1341 of the reads in this ctg) | |||
| * ctg7180002047604 : Vctor in the middle | |||
|     [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS] | |||
|  =============================================================================================================================== | |||
|        1      121  |   215495   215615  |      121      121  |    99.17  |      170   271477  |    71.18     0.04  | gnl|uv|U09128.1:15891-16011-49     ctg7180002047604       # pSacBII P1 cloning vector  | |||
|   cat 9-terminator/bt.posmap.utgctg | grep 7180002047604  perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);' | |||
|   7180000441711   7180001872124   214458  219678  r | |||
|   cat 9-terminator/bt.posmap.frgctg | grep 7180002047604  | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);' | |||
|   498776751       7180001872124   215425  216426  r | |||
|   1236502885      7180001872124   215514  216377  r | |||
|   379408823       7180001872124   215572  216340  f | |||
|   1244436224      7180001872124   215388  216405  f | |||
|   1237471071      7180001872124   215229  216234  r | |||
|   1233297450      7180001872124   215234  216046  f | |||
|   1233363357      7180001872124   215267  215687  f | |||
|   937200686       7180001872124   215300  216129  r | |||
|   937254901       7180001872124   215321  216160  f | |||
|   1233294025      7180001872124   215383  216204  r | |||
|   1237446444      7180001872124   215146  216187  f | |||
|   1232033776      7180001872124   215193  215996  r | |||
|   671976381       7180001872124   215035  216021  r | |||
|   514932286       7180001872124   215043  216008  f | |||
|   500723879       7180001872124   215043  215802  f | |||
|   671927656       7180001872124   215116  215733  r | |||
|   381173692       7180001872124   214947  215877  r | |||
|   1233303570      7180001872124   214963  215803  f | |||
|   1232037705      7180001872124   214990  215803  f | |||
|   490852264       7180001872124   214923  215843  f | |||
|   1237447184      7180001872124   214684  215646  f | |||
|   668822243       7180001872124   214586  215572  f | |||
|   #22 reads ; ~half come from BCM SHOTGUN RP42 VVFOP | |||
| ==== maxmatch ctg ==== | |||
| Parameters: | |||
|   nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ... | |||
|   AllVec: UniVec_Core + 100 more vector seqs | |||
| Length of UMD2.4 contigs that contain contaminant (0+ bp from end): | |||
|                             elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   Ecoli.all                 2951*      2602       349        1001       453627     8252       1367       132226     24352779 | |||
|   UniVec_Core               5387*      4802       585        882        651163     9979       1334       136556     53760575 | |||
|   OtherVec                  5657       5062       595        882        651163     9726       1320       136556     55021803 | |||
|   UMD2.cont.other           3976       3430       546        804        651163     11217      1346       130385     44601117         #18 aligned to Acinetobacter; longest is 56467bp | |||
| Length of UMD2.4 contigs that contain contaminant (500+ bp from end): | |||
|                             elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   Ecoli.all                 182        156        26*        1286       351373     6525       1811       125069     1187706          # 7*   are >5K;  321* come from multi-ctg scaffolds | |||
|   UniVec_Core               2532       2220       312*       1065       651163     10593      1481       128344     26821960         # 267* are >5K ; 655* come from multi-ctg scaffolds | |||
|   OtherVec                  376        323        53         1184       361749     13278      1508       139997     4992774 | |||
|   UMD2.cont.other           ... | |||
| Length of UMD2.4 contigs that contain contaminant (1000+ bp from end): | |||
|                             elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   Ecoli.all                 8          0          8*         4709       351373     73065      31157      351373     584520 | |||
|   UniVec_Core               11         0          11*        2600       334933     93674      37847      271477     1030414  | |||
|   OtherVec                  5          0          5*         3717       271477     131604     111913     228060     658021 | |||
|   UMD2.cont.other           54         0          54*        2398       522682     110947     88182      189352     5991164 | |||
|   total                     67* # 18 of them are CONTAINED by UMD2.0 chromosomes | |||
| Length of the UMD2.4 contaminant sequence (0+ bp from end): | |||
|                             elem       <200       >200       min        max        mean       med        n50        sum | |||
|   Ecoli.all                 4775       1610       3165       39         17072      381        236        502        1823278 | |||
|   UniVec_Core               16985      12380      4605       39         1800       207        132        300        3519080 | |||
|   OtherVec                  7563       1372       6191       39         1800       509        548        643        3849567 | |||
|   UMD2.cont.other           6626       343        6283       39         8228       543        573        615        3602329 | |||
| ==== maxmatch deg ==== | |||
| All degenerates aligned are <2K | |||
| Length of UMD2.4 deg that contain contaminant (0+ bp from end): | |||
|                             elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   Ecoli.all                 1266       1266       0          104        1611       783        833        869        991447 | |||
|   UniVec_Core               1908       1908       0          147        1510       872        896        910        1664746 | |||
|   OtherVec                  1963       1963       0          147        1510       872        898        911        1712703 | |||
|   UMD2.cont.other           1609       1609       0          132        1611       852        892        914        1372106 | |||
| ==== maxmatch utg ==== | |||
| Unitig stats: | |||
|   elem       <2000      >2000      min        max        mean       med        n50        sum             | |||
|   1707816    1434164    273652     21         138676     2228       937        8002       3805166508    | |||
| Parameters: | |||
|   nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ... | |||
| Files: | |||
|   /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_utg/ | |||
| <font color=red> | |||
| Length of UMD2.4 unitigs that align to contaminants | |||
|                             elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   Ecoli.all                 4275       4110       165        104        71709      1442       1212       1398       6166566   | |||
|   UniVec_Core               7563       7409       154        139        71709      1397       1182       1331       10570512 | |||
|   OtherVec                  8208       8054       154        139        71709      1370       1159       1308       11248775 | |||
|   UMD2.cont.other           6094       5849       245        132        53113      1546       1163       1401       9422951    #80 aligned to Acinetobacter; longest is 9114bp | |||
|   Contaminants(all above)   10264      9895       369        104        71709      1471       1148       1359       15107544 | |||
|   Acinetobacter             2306**      0          2306       154        71709      1451       1316       1412      3347230  #2182 already in the Cont set  | |||
| Length of UMD2.4 unitigs that have contaminants 500+bp from ends | |||
|                             elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   Ecoli.all                 172        156        16         1286       4852       1820       1805       1815       313185 | |||
|   UniVec_Core               2491       2422       69         1065       71709      1722       1457       1523       4291584 | |||
|   OtherVec                  364        358        6          1167       71709      1795       1478       1538       653595 | |||
|   UMD2.cont.other           156        108        48         1213       50248      5344       1838       17518      833673 | |||
| Length of the UMD2.4 alignments of unitigs to contaminants (unique unitig regions) | |||
|                             elem       <200       >200       min        max        mean       med        n50        sum       reads(all unitig reads for unitgs with alignments>1K) | |||
|   Ecoli.all                 5975       1686       4289       40         8184       397        268        542        2374366   12112(12142) | |||
|   UniVec_Core               8754       1674       7080       40         1801       474        490        645        4153030   26590(26849) | |||
|   OtherVec                  8919       1250       7669       40         1801       511        536        629        4562326   30268(30268) | |||
|   UMD2.cont.other           6752       896        5856       40         6012       529        555        651        3573528   25006(25328) | |||
|   Contaminants(all above)   10992      1396       9596       40         8184       571        573        684        6280759   40351(40699) | |||
|   Acinetobacter                                                                                                                     (8286)</font> | |||
| 40699 reads aligned back to contaminants: nucmer -maxmatch | |||
| * 35919 align | |||
| * 34400 align 100+bp | |||
| * 27742 align 200+bp | |||
| * 14211 align 500+bp | |||
| ==== utg 5'& 3' ==== | |||
| Unitig stats: | |||
|               elem         <200       >200       min        max        mean       med        n50        sum             | |||
|  utg          1,707,816    81200      1626616    21         138676     2228       937        8002       3805166508    | |||
|  utg5'&3'     3,334,432    0          3334432    21         199        100        100        100        335263271 | |||
| Align utg5'&3' to Ecoli.all using: | |||
| * nucmer -l 40 -c 100 -b 10 -g 5 -d 0.05 : 4,275 hits | |||
| * nucmer -l 20 -c 40 : 6,617 hits | |||
| * nucmer -l 20 -c 20 : 23,350  | |||
| * blastall           : 2,895,506  out of 3,334,432 (86%) aligned | |||
| ==== Acinetobacter contamination ==== | |||
| Database:  | |||
|   ~dpuiu/db/Acinetobacter.all : 7 complete genomes, 19 seqs | |||
| Seq len summary: | |||
|    elem       min        max        mean       med        n50        sum | |||
|    19         2726       4050513    1418094    28279      3904116    26943793 | |||
| Align all unitigs to Acinetobacter.all; Longest alignments is 8517bp | |||
|  show-coords Acinetobacter.all-utg.filter-q.delta | sort -nk8 -r | head | |||
|     [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [GenBank]                          [UMD2.4 utg] | |||
|  =============================================================================================================================== | |||
|    20644    29164  |       62     8578  |     8521     8517  |    98.80  |    94413     8578  |     9.03    99.29  | gi|169786889|ref|NC_010404.1|      utg7180000281954*       [CONTAINS] | |||
|  3395586  3401299  |     5712        1  |     5714     5712  |    99.79  |  3976747     8015  |     0.14    71.27  | gi|126640115|ref|NC_009085.1|      utg7180000212251 | |||
|  3400344  3404485  |        1     4142  |     4142     4142  |    99.66  |  3976747     9114  |     0.10    45.45  | gi|126640115|ref|NC_009085.1|      utg7180000277331 | |||
|  ... | |||
| utg7180000281954*  -> ctg7180002053982  (28140bp; 78 unitigs) | |||
| grep 7180002053982  ../9-terminator/bt.posmap.utgctg | nl | |||
|      1  7180000185222   7180002053982   0       3019    f | |||
|      2  7180000314302   7180002053982   2151    5706    r | |||
|      3  7180001463328   7180002053982   2256    2869    f | |||
|      ... | |||
|     75  7180000281954*  7180002053982   17862   26442   r | |||
|     76  7180001471348   7180002053982   17886   18723   r | |||
|     77  7180001468075   7180002053982   17919   18732   f | |||
|     78  7180000280508   7180002053982   25672   28140   r | |||
|   show-coords UMD2.contaminant.other-ctg.filter-q.delta  | grep 7180002053982 | |||
|     [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [UMD2.0 contam]   [UMD2.4 ctg] | |||
|  =============================================================================================================================== | |||
|     1394     2508  |    28140    27024  |     1115     1117  |    98.75  |     7098    28140  |    15.71     3.97  | 7180003313366      ctg7180002053982 | |||
|     2561     2871  |    26971    26661  |      311      311  |    97.11  |     7098    28140  |     4.38     1.11  | 7180003313366      ctg7180002053982 | |||
|     2934     5670  |    26599    23862  |     2737     2738  |    97.99  |     7098    28140  |    38.56     9.73  | 7180003313366      ctg7180002053982 | |||
|   ...gap...   | |||
|     5930     7098  |    17270    16101  |     1169     1170  |    98.46  |     7098    28140  |    16.47     4.16  | 7180003313366      ctg7180002053982 | |||
|   ...gap...   | |||
|      281     1981  |    10335     8635  |     1701     1701  |    99.41  |    13090    28140  |    12.99     6.04  | 7180003320028      ctg7180002053982 | |||
|     1992     2672  |     8635     7954  |      681      682  |    99.71  |    13090    28140  |     5.20     2.42  | 7180003320028      ctg7180002053982 | |||
|     3302     5376  |     6719     4643  |     2075     2077  |    99.33  |    13090    28140  |    15.85     7.38  | 7180003320028      ctg7180002053982 | |||
|     8469     9021  |     4642     4090  |      553      553  |    99.10  |    13090    28140  |     4.22     1.97  | 7180003320028      ctg7180002053982 | |||
|     9038     9313  |     4073     3798  |      276      276  |    98.55  |    13090    28140  |     2.11     0.98  | 7180003320028      ctg7180002053982 | |||
|     9780    13090  |     3331       19  |     3311     3313  |    99.19  |    13090    28140  |    25.29    11.77  | 7180003320028      ctg7180002053982 | |||
|   grep 7180002053982 ../9-terminator/bt.posmap.utgctg | awk '{print $1,$4-$3+1}' | sed 's/^/utg/' >! ctg7180002053982.utgs | |||
|   intersect.pl UMD2.contaminant.other-utg.qry_hits ctg7180002053982.utgs | wc -l | |||
|   37 # only 37 out  of 78 unitigs were detected | |||
|   [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Get&VIEW_RESULTS=FromRes&RID=VNC3Z5NU01R&UNIQ_OBJ_NAME=A_SearchResults_1LiCpb_3X0f_1Z58HQ4MV_23torn_22Zi4F&QUERY_INDEX=0 ctg7180002053982 is Acinetobacter] | |||
| = UMD2.5 (2004_0312_CA; delete  40699 contam reads & 22607 mates ) = | |||
| 40699 reads: | |||
| * 25803 mated + 14896 unmated | |||
| * 6392 mated reads had the mate also contaminated | |||
| Location: | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA | |||
| == UNITIGGER == | |||
|  UNITIG OVERLAP GRAPH INFORMATION | |||
|         5322910 : Total number of unitigs | |||
|         2595715 : Total number of singleton, contained unitigs | |||
|         1869655 : Total number of singleton, non-contained unitigs | |||
|          182193 : Total number of non-singleton, spanned unitigs | |||
|          675347 : Total number of non-singleton, non-spanned unitigs | |||
|        35507162 : Total number of fragments | |||
|        35507162 : Total number of fragments in all unitigs | |||
|        21641007 : Total number of essential fragments in all unitigs | |||
|        13866155 : Total number of contained fragments in all unitigs | |||
|    0.0077909501 : Randomly sampled fragment arrival rate per bp | |||
|      2511009753 : The sum of overhangs in all the unitigs | |||
|      6442095933 : Total number of bases in all unitigs | |||
|               0 : Estimated number of base pairs in the genome. | |||
|               0 : Total number of contained fragments not connected | |||
|                   by containment edges to essential fragments. | |||
|  Total rho    = 2511009753 | |||
|  Total nfrags = 19563152 | |||
|  Estimated genome length = 0 | |||
|  Estimated global_fragment_arrival_rate=0.007791 | |||
|  Computed global_fragment_arrival_rate =0.007791 | |||
|  Total number of randomly sampled fragments in genome = 23866135 | |||
|  Computed genome length  = 3063315200.000000 | |||
|  Used global_fragment_arrival_rate=0.007791 | |||
|  Used global_fragment_arrival_distance=128.354050 | |||
|  Histogram of the number of base pairs in a chunk | |||
|  100292 - 138301:     22    # 19 in UMD2.4  | |||
|   90020 -  99906:     28    # 23 | |||
|   80043 -  89676:     90    # 79 | |||
|   70013 -  79966:    190    # 164 | |||
|   60010 -  69988:    423 | |||
|   50008 -  59983:   1016 | |||
|   40000 -  49998:   2558 | |||
|   30000 -  39997:   6660 | |||
|   20000 -  29999:  18927 | |||
|   10000 -  19999:  57057 | |||
| == CONSENSUS after CGW == | |||
| <font color=red> | |||
| * failed on job 80 : ctg 5706539, len=180,024, 159 unitigs, 1,851 reads | |||
| </font> | |||
|   head 80/bt.cns_contigs.80.failed | |||
|   {ICM | |||
|   acc:5706539 | |||
|   pla:P | |||
|   len:180024 | |||
|   cns: | |||
|   . | |||
|   qlt: | |||
|   . | |||
|   for:0 | |||
|   npc:1851 | |||
|   more  ../9-terminator/bt.asm | |||
|   ... | |||
|   {CCO | |||
|   acc:(7180002022380*,5706539) | |||
|   pla:P | |||
|   len:180024 | |||
|   cns: | |||
|   NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN | |||
|   NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN | |||
|   ... | |||
|   cat 80/bt.cns_contigs.80.failed | countMessages.pl   | |||
|   ICM     1 | |||
|   IUP     159   #unitigs | |||
|   IMP     1851  #reads : 1264 are BCM.WGS, 389 are BCM.SHOTGUN ... | |||
|   # 254 contig scaffold | |||
|   cat ../9-terminator/bt.posmap.ctgscf | grep 7180002041301 | nl | |||
|   1    7180002022042   7180002041301   0       14121   f | |||
|   ... | |||
|   183  7180002022380*  7180002041301   9164814 9344838 f | |||
|   .. | |||
|   254  7180002022240   7180002041301   12874874        12893460        f | |||
|   # the 5 UMD2.4 contigs below have the same number of reads with the ones that matched => CONTAINED | |||
|   cat UMD2.4-7180002022380.posma   p.frgctg | grep -v ^$ | awk '{print $2}' | uniq -c | |||
|     #reads ctgid | |||
|       6 7180001712307 | |||
|     209 7180002028662 | |||
|       6 7180002028663 | |||
|    1552 7180002028664 | |||
|      21 7180002032323 | |||
|    1794 total => 1851-1794=57 additional reads | |||
|   cat ../../2009_0217_CA/9-terminator/bt.posmap.ctgscf | nl | egrep '7180001712307|7180002028662|7180002028663|7180002028664|7180002032323' | |||
|   #nl    ctgid           scfid           start   end     dir         | |||
|   34791  7180002028662   7180002069912   1425623 1446022 f | |||
|   34793  7180002032323   7180002069912   1448133 1450475 f | |||
|   34794  7180002028663   7180002069912   1450495 1451918 f | |||
|   34795  7180002028664   7180002069912   1452234 1602973 f | |||
|   64110  7180001712307   7180002071598   0       1441    f | |||
| * Solution 1: | |||
|   * consensus -Dforceunitigabut => new assembly, new UID's | |||
|   ctg7180002022636 179505 38.53     # => ctg7180002022380 | |||
|   scf7180002041557 12892941 40.58   # => scf7180002041301 | |||
| * Solution 2: | |||
|   * Reassemble 1851 reads ; clr=ECR2; doOBT=no | |||
|   * Asm dir:  | |||
|      /scratch1/bos_taurus/Assembly/2009_0312_CA/8-consensus/80.ECR2.asm | |||
|   * It contains one 179,530 bp scaffold that has two contigs.  | |||
|   * One contig is 156,349 bp and the other one is 23,181 bp.  | |||
|   * The estimated gap between them is 231 bp. | |||
|   show-coords ctg7180002022636-80.ECR2.filter-r.delta  | |||
|        1   156326  |   1   156331  | 156326 156331  | 99.99  | 179505  156349 |    87.09    99.99  | ctg7180002022636   ctg7180000000103        [CONTAINS] | |||
|   156345   179505  |  21    23181  |  23161  23161  | 99.99  | 179505  23181  |    12.90    99.91  | ctg7180002022636   ctg7180000000104        [CONTAINS] | |||
|   >ctg7180002022636_156327_156344 | |||
|   TTGTAAAAACCATCCCCT | |||
|   # ~ 20 bp unaligned on ctg7180002022636 & Chr1 | |||
|   show-coords  ctg7180002022636-Chr1.filter-r.delta  | more | |||
|   ... | |||
|   151839   156326  | 61124826  61120339* |     4488     4488  |    99.91  |   179505 157590899  |     2.50     0.00  | ctg7180002022636  Chr1 | |||
|   156347   157501  | 61111219* 61110064  |     1155     1156  |    99.91  |   179505 157590899  |     0.64     0.00  | ctg7180002022636  Chr1 | |||
|   ... | |||
|   # 2 UMD2.0 ctg & 2 UMD2.0 deg in this region | |||
|   more Chr1.agp | |||
|   ... | |||
|   Chr1    61110064        61111482        3579    W       deg0003139347   1       1419    + | |||
|   Chr1    61111483        61114130        3580    N       2648    fragment        yes | |||
|   Chr1    61114131        61115490        3581    W       deg0002967451   1       1360    + | |||
|   Chr1    61115491        61118114        3582    N       2624    fragment        yes | |||
|   Chr1    61118115        61120117        3583    W       7180002846553   1       2003    + | |||
|   Chr1    61120118        61120217        3584    U       100     fragment        yes | |||
|   Chr1    61120218        61145567        3585    W       7180003318962   1       25350   + | |||
|   ... | |||
| == QC == | |||
| <font color=red> | |||
| Lengths: | |||
|               elem       <2000      >=2000     min        max        mean       med        n50        sum | |||
|   scf         39978*     31311      8667       316        34167202   68129      1360       8217662    2723691675 | |||
|   ctg         90135*     36140      53995      65         1160130    29693      5124       95988      2676390147 | |||
|   deg         251413     249285     2128       65         39964      1003       984        994        252279234 | |||
|   utg         1689033    1419729    269304     21         138676     2242       936        8213       3788090224 | |||
|                elem       <0         0          >0         min        max        mean       med        n50        sum | |||
|   gaps(ca2scf) 50157      10759      3296       36102      -20        177144     929        20         34357      46620040 | |||
|   gaps(posmap) 50157      0          0          50157      20         177144     943        20         34065      47301528</font> | |||
| Fragment happiness: | |||
|   placed good       27263138 | |||
|   chaff  bothChaff  2467462 | |||
|   placed notMated   2050084 | |||
|   placed oneChaff   732517 | |||
|   chaff  oneChaff   732517 | |||
|   placed oneSurrogate     555510 | |||
|   placed bothDegen        465114 | |||
|   chaff  notMated         434052 | |||
|   placed diffScaffold     369,294 * | |||
|   placed oneDegen         213,768 * | |||
|   placed badSame          96862 | |||
|   placed badLong          76640 | |||
|   placed badOuttie        41142 | |||
|   placed badShort         5784 | |||
|   placed bothSurrogate    3278 | |||
| Mate happiness: | |||
|   good            13631569 | |||
|   bothChaff       1233731 | |||
|   oneChaff        732517 | |||
|   oneSurrogate    277755 | |||
|   bothDegen       232557 | |||
|   diffScaffold    184647 | |||
|   oneDegen        106884 | |||
|   badSame         48431 | |||
|   badLong         38320 | |||
|   badOuttie       20571 | |||
|   badShort        2892 | |||
|   bothSurrogate   1639 | |||
| Scaffold zero read/mate cvg regions: | |||
|                         elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   read                  57011      55048      1963       1          177144     913        57         29302      52084484 | |||
|   mate                  10507      8945       1562       1          30014      996        493        2367       10466518 | |||
| Scaffold 10K+ zero read/mate cvg regions (2K+ inside) (some might be a result of surrogates?):  | |||
|                         elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   read                  51747      49878      1869       1          177144     958        38         32625      49613432 | |||
|   mate                  1290       201        1089       15         30014      3560       3047       4017       4593586 | |||
|   mate(ignore seq len)  2599       1051       1548       1          72706      3321       2541       5513       8633691 | |||
| == Contaminant search  == | |||
| === ctg === | |||
|                 elem       <0         >0         min        max        mean       med        n50        sum | |||
|   ctg           90135      0          90135      65         1160130    29693      5124       95988      2676390147 | |||
|   nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ... | |||
|   #OLD table | |||
|                    elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   Ecoli.all        71         66         5          1006       129770     4180       1127       45234      296830 | |||
|   UniVec_Core      120        111        9          1000       426540     7366       1130       100944     884023 | |||
|   OtherVec         121        112        9          1000       426540     7314       1127       100944     885022 | |||
|   UMD2.cont.other  98(-3)     83         15(-3)     1000       426540     13523      1190       199700     1325332  # 3 are 1000bp+ from ctg ends; these are actually "fake" contaminants | |||
|                                                                                                                     # 7 are Acinetobacter baumannii min=1059 , max=9765  | |||
|   total            152*(-3)   133        19*(-3)    1000       426540     10649      1150       199700     1618735 | |||
|   Acinetobacter    65         53         12         1013       44359      2586       1212       3847       168130   # 46 out of 65 are in the 152* set; 19 are new; 13 have lots of alignments to other contigs (probably fake contaminants) | |||
|   total(new)       171*(-3)   144        27*(-3)    1000       426540     10013      1189       129770     1712376  # 65 are Acinetobacter and should be removed | |||
|   cat UMD2.contaminant.other-ctg.filter-q.coords | grep Acinetobacter | |||
|                                                                                                                      UMD2.0                     UMD2.5 | |||
|        1      285  |      285        1  |      285      285  |    99.65  |      287     8096  |    99.30     3.52  | 7180003292866_1_288        ctg7180002015457        [CONTAINED]     Acinetobacter baumannii | |||
|     1422     2500  |     1078        1  |     1079     1078  |    99.63  |     7098     1078  |    15.20   100.00  | 7180003313366              ctg7180001706852        [CONTAINS]      Acinetobacter baumannii | |||
|     2934     3940  |     1008        1  |     1007     1008  |    98.61  |     7098     1059  |    14.19    95.18  | 7180003313366              ctg7180001709709                        Acinetobacter baumannii | |||
|     6281     7098  |        1      818  |      818      818  |    99.76  |     7098     1553  |    11.52    52.67  | 7180003313366              ctg7180001716052        [END]           Acinetobacter baumannii | |||
|        1      790  |      790        1  |      790      790  |   100.00  |     1822     9765  |    43.36     8.09  | 7180003319195_8956_10778   ctg7180002015485        [BEGIN]         Acinetobacter calcoaceticus | |||
|      285     1981  |        1     1697  |     1697     1697  |    99.59  |    13090     1856  |    12.96    91.43  | 7180003320028              ctg7180001706656*                       Acinetobacter baumannii | |||
|     1992     2148  |     1697     1856  |      157      160  |    98.12  |    13090     1856  |     1.20     8.62  | 7180003320028              ctg7180001706656*                       Acinetobacter baumannii | |||
|    12210    13090  |       63      943  |      881      881  |    99.89  |    13090     2556  |     6.73    34.47  | 7180003320028              ctg7180002007423                        Acinetobacter baumannii | |||
|   # 7 Acinetobacter baumannii ctgs | |||
|   # no Serratia "best hits" | |||
|   # 3 mitochondrion ctgs, all < 2Kbp | |||
| <font color=red> | |||
| Delete summary: 65 Acinetobacter ctgs + 91 contaminant ctgs <2000bp => 156 ctgs , 152 scf => 4105 reads | |||
|   ctgs       <2000      >2000      min        max        mean       med        n50        sum       reads | |||
|   156        144        12         1000       44359      1782       1150       1483       278009    4105        # all ctg | |||
| Trim summary: 12 contigs >=2000bp & 44 reads that overlap at least 10bp | |||
|   ctgs       <2000      >2000      min        max        mean       med        n50        sum        reads | |||
|   12         0          12         7072       426540     78470      45234      129770     941646     ?           # all ctg | |||
|   12         12         0          172        935        532        618        750        6393       44          # ctg regions | |||
| Files | |||
|  /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_DELETE/ctg.delete.uid | |||
|  /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_DELETE/scf.delete.uid | |||
|  /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_TRIM/ctg.trim.uid | |||
| </font> | |||
| === ctg 5'&3' === | |||
|                 elem       <0         >0         min        max        mean       med        n50        sum | |||
|   ctg53         180044     0          180044     65         598        300        300        300        54033229 | |||
|   nucmer -maxmatch -l 17 -c 35 ... | |||
|                    #ctgEnds   #ctgs      min        max        mean       med        n50        sum          | |||
|   Ecoli.all        180        149        300        300        300        300        300        54000  | |||
|   UniVec_Core      312        277        300        300        300        300        300        93600 | |||
|   OtherVec         1211       1167       300        553        300        300        300        363989 | |||
|   UMD2.cont.other  15689      14693      257        598        300        300        300        4712162 | |||
| === deg === | |||
|   nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ... | |||
|                    elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   Ecoli.all        387        387        0          131        1099       756        806        835        292892 | |||
|   UniVec_Core      569        569        0          101        1115       763        822        843        434400 | |||
|   OtherVec         579        579        0          101        1115       752        819        840        435549 | |||
|   UMD2.cont.other  539        539        0          131        1483       792        838        873        427408 | |||
|   total            810*       810        0*         101        1483       784        838        869        63547 | |||
| == Scaffolds vs UMD2.0 chromosome alignments == | |||
| Directory: | |||
|    /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf | |||
| Depening on the ref/qry seq and nucmer parameters, the number of unaligned gaps in  UMD2.0 can vary between: | |||
|   101M: REF=Chr,       QRY=scf,     nucmer -l 100 -c 500  | |||
|   6M:   REF=ChrPlaced, QRY=scf-deg, nucmer -maxmatch -l 50 -c 250  | |||
| === nucmer -l 100 -c 500 === | |||
| Chr-scf.summary | |||
|                                            elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   Chr-scf.qry_hits                         32901      24546      8355       723        34167202   82494      1405       8217662    2714164100 | |||
|   Chr-scf.qry_nohits                       7077       6765       312        316        12006      1346       1239       1291       9527056 | |||
|   Chr-scf.10K.qry_hits2+                   574        0          574        10308      34167202   4006753    1887309    9586144    2299876795 | |||
|   Chr-scf.0cvg                             144712     125933     18779      1          102265     900        178        2968       130248709 | |||
|   Chr-scf.0cvg.clean                       148556     143283     5273       1          39625      683        280        1363       101526883(101M) | |||
| Chr-scf-deg.summary | |||
|                                            elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   Chr-scf-deg.qry_hits                     210225     199781     10444      501        34167202   13785      1007       7328685    2898141592 | |||
|   Chr-scf-deg.qry_nohits                   81166      80815      351        65         12006      958        972        989        77828798 | |||
|   Chr-scf-deg.10K.qry_hits2+               574        0          574        10308      34167202   4006753    1887309    9586144    2299876795 | |||
|   Chr-scf-deg.0cvg                         175952     168553     7399       1          22067      445        120        1329       78433265 | |||
|   Chr-scf-deg.0cvg.clean                   133809     132381     1428       1          20512      371        124        1101       49711440(49M) | |||
| ChrPlaced-scf.summary | |||
|                                            elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   ChrPlaced-scf.qry_hits                   19488      13112      6376       723        34167202   137773     1527       8428844    2684927057 | |||
|   ChrPlaced-scf.qry_nohits                 20490      18199      2291       316        192648     1891       1276       1569       38764099 | |||
|   ChrPlaced-scf.10K.qry_hits2+             139        0          139        10486      31959312   6786671    4979278    12956086   943347316 | |||
|   ChrPlaced-scf.0cvg                       76271      71816      4455       1          102265     568        179        1710       43339413 | |||
|   ChrPlaced-scf.0cvg.clean                 72865      70541      2324       1          39625      356        94         1413       25951987(25M) | |||
| ChrPlaced-scf-deg.summary | |||
|                                            elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   ChrPlaced-scf-deg.qry_hits               130125     122000     8125       501        34167202   21530      1009       7515049    2801670853 | |||
|   ChrPlaced-scf-deg.qry_nohits             161266     158596     2670       65         192648     1080       987        1012       174299537 | |||
|   ChrPlaced-scf-deg.10K.qry_hits2+         139        0          139        10486      31959312   6786671    4979278    12956086   943347316 | |||
|   ChrPlaced-scf-deg.0cvg                   79041      76374      2667       1          22067      395        157        948        31251753 | |||
|   ChrPlaced-scf-deg.0cvg.clean             69012      68271      741        1          20512      200        81         592        13864328(13M) | |||
| === nucmer -maxmatch -l 100 -c 500 === | |||
| Dir: | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.2 | |||
| ChrPlaced-scf-deg.summary | |||
|                                            elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   ChrPlaced-scf-deg.qry_hits               130510     122377     8133       501        34167202   21470      1009       7515049    2802100771 | |||
|   ChrPlaced-scf-deg.qry_nohits             160881     158219     2662       65         192648     1080       986        1012       173869619 | |||
|   ChrPlaced-scf-deg.10K.qry_hits2+         120        0          120        20022      31959312   7587296    5639522    13010806   910475551 | |||
|   ChrPlaced-scf-deg.0cvg                   82159      80425      1734       1          7002       321        145        647        26444796 | |||
|   ChrPlaced-scf-deg.0cvg.clean             111645     111546     99         1          6248       81         13         272        9057424(9M) | |||
| === nucmer -maxmatch -l 50 -c 250 === | |||
| Dir:  | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3 | |||
| ChrPlaced-scf-deg.summary | |||
|                                            elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   ChrPlaced-scf-deg.qry_hits               204673     195653     9020       251        34167202   14088      1005       7329288    2883625712 | |||
|   ChrPlaced-scf-deg.qry_nohits             86718      84943      1775       65         192648     1064       970        1007       92344678 | |||
|   ChrPlaced-scf-deg.10K.qry_hits2+         148        0          148        10486      31959312   6814292    5135095    12792673   1008515300 | |||
|   ChrPlaced-scf-deg.0cvg                   86085      84614      1471       1          4565       279        123        557        24101912 | |||
|   ChrPlaced-scf-deg.0cvg.clean             113796     113791     5          1          2822       59         7          176        6714556(6M) | |||
| === nucmer -maxmatch -l 50 -c 250 ; delta-fileter -q === | |||
| Dir:  | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3 | |||
|   ChrPlaced-scf-deg.filter-q.summary | |||
|                                            elem       <2000      >2000      min        max        mean       med        n50        sum | |||
|   ChrPlaced-scf-deg.qry_hits               204673     195653     9020       251        34167202   14088      1005       7329288    2883625712 | |||
|   ChrPlaced-scf-deg.qry_nohits             86718      84943      1775       65         192648     1064       970        1007       92344678 | |||
|   ChrPlaced-scf-deg.10K.qry_hits2+         118        0          118        20022      31959312   7633834    5639522    13010806   900792422 | |||
|   ChrPlaced-scf-deg.0cvg                   77864      73686      4178       1          28711      529        181        1523       41240419 | |||
|   ChrPlaced-scf-deg.0cvg.clean             74172      72150      2022       1          28331*     321        89         1415       23852952(23M) | |||
| Max gap is 28331; Duplicate region in UMD2.0? | |||
|   ChrPlaced-scf-deg.coords | |||
|     70739616 70767946  |  2993389  2965048  |    28331    28342  |    99.17  | 85187327 19514159  |     0.03     0.15  | Chr15      scf7180002041107 | |||
|     70768054 70808299  |  3005305  2965048  |    40246    40258  |    99.56  | 85187327 19514159  |     0.05     0.21  | Chr15      scf7180002041107 | |||
|   => | |||
|   ChrPlaced-scf-deg.filter-q.coords | |||
|     70768054 70808299  |  3005305  2965048  |    40246    40258  |    99.56  | 85187327 19514159  |     0.05     0.21  | Chr15      scf7180002041107 | |||
| == Markers  == | |||
| ALL: | |||
|   head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.txt | |||
|   '''Marker        Chr_BTA Pos(K)  Pos_from Pos_to UMD_Ctg_Pos     Match_Len       %IDY    %Match  UMD_Ctg_name''' | |||
|   BZ945871      1       47501   1       95001   7622            515             100.00  99.61   ctg7180002007845 | |||
|   BZ953651      1       80001   47501   112501  10786           700             99.57   100.00  ctg7180002026484 | |||
|   CC504788      1       118751  80001   157501  54583           862             100.00  100.00  ctg7180002026483 | |||
|   CC484491      1       123751  90001   157501  50169           77              98.72   100.00  ctg7180002026482 | |||
|   CZ415082      1       125001  92501   157501  75850           507             99.21   99.80   ctg7180002026483 | |||
|   CC475154      1       130001  97501   162501  40013           666             99.25   100.00  ctg7180002026482 | |||
|   CC561114      1       182501  145001  220001  1130            709             99.02   100.00  ctg7180002026482 | |||
|   CC578374      1       190001  155001  225001  170145          647             100.00  100.00  ctg7180002026481 | |||
|   BZ911787      1       278751  232501  325001  na              na              na      na      na | |||
|   ... | |||
| * 126,014 markers & 90,135 ctgs total | |||
| * 107,271 markers align to 31,407 ctg & 2640 scf: | |||
| ** 85% of the markers align to 85% of the ctg sequence | |||
| ** avg distance between markers is 25Kbp | |||
| * 188 questionable ctgs & 219 questionable scf (2 out of 4 disagreeing markers) | |||
| UNIQ: | |||
|   head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.unique_only.txt | p 'print "  ",$_;' | |||
|   '''Marker Chr_BTA Pos(Kbp) CI_Pos_from CI_Pos_to UMD_Scaff_Pos Match_Len %IDY %Matched UMD_Scaff_name''' | |||
|   BZ945871 1 52251 1 104501 na na na na na | |||
|   BZ953651 1 88001 52251 123751 na na na na na | |||
|   CC504788 1 130626 88001 173251 54583 862 100.00 100.00 ctg7180002026483 | |||
|   CC484491 1 136126 99001 173251 50169 77 98.72 100.00 ctg7180002026482 | |||
|   CZ415082 1 137501 101751 173251 na na na na na | |||
|   CC475154 1 143001 107251 178751 40013 666 99.25 100.00 ctg7180002026482 | |||
|   CC561114 1 200751 159501 242001 1130 709 99.02 100.00 ctg7180002026482 | |||
|   CC578374 1 209001 170501 247501 170145 647 100.00 100.00 ctg7180002026481 | |||
|   BZ911787 1 306626 255751 357501 na na na na na | |||
|   ... | |||
| * 93,508 markers align to 28,752 ctgs & 1,476 scf: | |||
| * 109 questionable ctgs & 153 questionable scf (2 out of 4 disagreeing markers) | |||
| ---- | |||
| Scripts: | |||
|   ~/bin/marker2pos.pl markers_contigs_Art.unique_only.txt | sed 's/ctg//' |  sort -nk1 -nk2  > markers_ctg.pos | |||
|   ~/bin/translatePosMap.pl markers_ctg.pos bt.posmap.ctgscf | ~/bin/tab2tab.pl > markers_scf.pos | |||
| Ctg vs markers summary: | |||
|                                          #ctg       <10000     >10000     min        max        mean       med        n50        sum         file | |||
|   ctg (all)                              90135*     51024      39111      65         1160130    29693      5124       95988      2676390147 | |||
|   no markers                             58728      48324      10404      65         322949*    6573       1597       21989      386064754 | |||
|   markers from 1+ Chr                    31407      2700       28707      442        1160130    72924      52693      111252     2290325393  markers_ctg.Chr.count | |||
|   markers from 2+ Chr                    2987       25         2962       1002       1160130    132480     104807     179692     395718221   markers_ctg.Chr.count2+ | |||
|   2+ markers from 2+ Chr                 26         0          26*        15228      604155     221354     192182     298848     5755227     markers_ctg.Chr.count2.2+ | |||
|   2+ adjacent markers from 2+ Chr        15         0          15**       15228      368879     202728     194749     294623     3040932     markers_ctg.Chr.count2+a | |||
| Scf vs markers summary: | |||
|                                          #scf       <10000     >10000     min        max        mean       med        n50        sum | |||
|   scf(all)                               39978*     37135      2843       316        34167202   68129      1360       8217662    2723691675 | |||
|   no markers                             37338      36038      1300       316        754615*    2601       1336       3957       97140879 | |||
|   markers from 1+ Chr                    2640       1097       1543       1000       34167202   994905     16220      8661690    2626550796  markers_scf.Chr.count  | |||
|   markers from 2+ Chr                    552        10         542        1002       34167202   4526814    2714036    9167014    2498801557  markers_scf.Chr.count2+ | |||
|   2+ markers from 2+ Chr                 212        0          212*       15228      34167202   8579232    7358307    10521496   1818797327  markers_scf.Chr.count2.2+ | |||
|   2+ adjacent markers from 2+ Chr        38         0          38**       15228      25078118   8419544    7176534    13458592   319942681   markers_scf.Chr.count2+a | |||
| === 212* scaffolds === | |||
|   scf_id        scf_len   #Chr/2+markers  #/Chr/2+adjmarkers  reads | |||
|   7180002041381 31959312  15              0                   469503 | |||
|   7180002041358 25078118  13              2                   291163 | |||
|   7180002041386 21280754  12              0 | |||
|   ... | |||
| ==== scf7180002041381.1 ==== | |||
| * no low cvg regions in the middle | |||
| * 1281 markers: 1231 on Chr4, 12 on Ch11 | |||
|   #1- mate cvg regions: at the ends !!! | |||
|   #scfid          begin           end             scf_len         cvg_len cvg | |||
|   7180002041381   1               1173            31959312        1173    0 | |||
|   7180002041381   1174            1454            31959312        281     1 | |||
|   7180002041381   31959139        31959312        31959312        174     1 | |||
| ==== scf7180002041358.2 ==== | |||
| * one low cvg region & real break | |||
| * 1111 markers: 869 on Chr14, 193 on Chr26, ... | |||
|   #1- mate cvg regions: middle | |||
|   #scfid          begin           end             scf_len         cvg_len cvg | |||
|   7180002041358   20970531        20970827        25078118        297     1 | |||
|   7180002041358   20970828        20970949        25078118        122     0 | |||
|   7180002041358   20970950        20971112        25078118        163     1 | |||
|   # markers in the regions | |||
|   #scfid          begin           end             makerid         Chr | |||
|   7180002041358   20964100        20964829        BZ839784        14 | |||
|   7180002041358   21002368        21003219        CC527932        26 | |||
| ==== scf7180002041386.3 ==== | |||
| * one low cvg region but no markers in that region | |||
| * 939 markers: 902 on Chr24, 3 on Chr6 ... | |||
|   #1- mate cvg regions: at the ends | |||
|   #scfid          begin   end     scf_len         cvg_len cvg | |||
|   7180002041386   26382   27557   21280754        1176    1 | |||
|   7180002041386   27558   27607   21280754        50      0 | |||
|   7180002041386   27608   28031   21280754        424     1 | |||
| ... | |||
| ==== scf7180002041368.128 ==== | |||
| * one low cvg region | |||
|   #scfid          begin    end      scf_len         cvg_len  cvg | |||
|   7180002041368   1356953  1357221  1404851         269      1 | |||
| * markers:Chr7 & Chr5 at 3' | |||
|   #scfid          begin    end      markerid        Chr | |||
|   7180002041368   10436    11033    BZ840669        7 | |||
|   ... | |||
|   7180002041368   1278950  1279444  CZ404867        7 | |||
|   7180002041368   1319489  1320264  CC593534        7 | |||
|   7180002041368   1344001  1344347  BZ885865        7 * | |||
|   7180002041368   1371526  1372308  BZ867572        5 * | |||
|   7180002041368   1386933  1387750  CC534893        5 | |||
| * [[Media:7180002041368.0cvg.png|7180002041368.0cvg.png]] | |||
| * subassembly: extract all reads & extra mates(265) and reassemble | |||
| ** Msg Counts  | |||
|   DST     40 | |||
|   FRG     13762  | |||
|   LKG     6482 | |||
| ** qc | |||
|   [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] | |||
|   0=37,1211999,1358964,32757,4082,7180000000571 | |||
|   1=2,53844,53824,26922,-20,7180000000566 | |||
|   2=4,50178,50238,12544,20,7180000000570 | |||
|   3=1,28553,28553,28553,0,7180000000568 | |||
|   4=1,17195,17195,17195,0,7180000000569 | |||
|   total=45,1361769,1508774,30262,3675 | |||
| show-coords scf7180002041368-7180002041368.update.scf.filter-q.delta | |||
|  ... | |||
|  1269662  1338725  |    69065        1  |    69064    69065  |    99.97  |  1404851  1359628  |     4.92     5.08  | scf7180002041368   scf7180000000571 | |||
|  1340619  1341912  |        1     1295  |     1294     1295  |    98.77  |  1404851     1295  |     0.09   100.00  | scf7180002041368   scf7180000000574        [CONTAINS] | |||
|  1345570  1347105  |     1537        1  |     1536     1537  |    98.18  |  1404851     1537  |     0.11   100.00  | scf7180002041368   scf7180000000572        [CONTAINS] | |||
|  1347683  1356952  |    12617     3352  |     9270     9266  |    99.85  |  1404851    12617  |     0.66    73.44  | scf7180002041368   scf7180000000567 | |||
|  1356901  1361861  |     1767     6728  |     4961     4962  |    99.96  |  1404851    50318  |     0.35     9.86  | scf7180002041368   scf7180000000570 | |||
|  ... | |||
| ==== scf7180002041061.187 ==== | |||
| * infoseq | |||
|   scf7180002041061 509987 39.73 | |||
| * 22 markers  | |||
|   #scfid        begin   end     makerid         Chr | |||
|   7180002041061 2061    2904    BZ847430        15 | |||
|   7180002041061 8979    9756    CC481592        15 | |||
|   7180002041061 10856   11485   BZ839377        15 | |||
|   7180002041061 47578   48186   BZ836581        15 | |||
|   7180002041061 80485   81237   CC553472        15 | |||
|   7180002041061 117811  118488  CC918533        15 | |||
|   7180002041061 151253  152009  CC477055        15 | |||
|   7180002041061 213959  214640  CC572066        15 * | |||
|   7180002041061 236304  236479  CC550436        29 * | |||
|   7180002041061 242590  243462  BZ848041        29 | |||
|   7180002041061 267880  268614  BZ877402        29 | |||
|   7180002041061 268891  269493  CC580499        29 | |||
|   7180002041061 282248  282861  BZ885584        29 | |||
|   7180002041061 295221  295987  CC572941        29 | |||
|   7180002041061 337785  338517  CC923898        29 | |||
|   7180002041061 338464  339092  BZ921430        29 | |||
|   7180002041061 341219  341971  CZ415932        29 | |||
|   7180002041061 382509  383281  CC581748        29 | |||
|   7180002041061 387548  388176  CC572064        29 | |||
|   7180002041061 415263  415788  BZ878185        29 | |||
|   7180002041061 470138  470773  CC558303        29 | |||
|   7180002041061 493006  493794  CC565740        29 | |||
| * mate happiness: | |||
|   cat 7180002041061.posmap.mates | cut -f3 | count.pl | |||
|   good            2501 | |||
|   diffScaffold    86 | |||
|   oneSurrogate    69 | |||
|   oneChaff        40 | |||
|   oneDegen        11 | |||
|   badSame         9 | |||
|   badOuttie       4 | |||
|   badLong         4 | |||
| * low cvg region: | |||
|   #scfid          begin   end     scf_len   cvg_len  cvg | |||
|   7180002041061   232849  232871  509987    23       1 | |||
|   #it is in the middle of a unitig | |||
|   intersectPos.pl -i 1 7180002041061.posmap.utgscf 7180002041061.posmap.frgscf.10K.2K.mate_cvg.1- | |||
|   7180000849321   7180002041061   229782  238910  r | |||
| [[Media:7180002041061.0cvg.png|7180002041061.0cvg.png]] | |||
| * subassembly: extract all reads & mates(139) and reassemble | |||
| ** Msg Counts | |||
|   DST     38 | |||
|   FRG     5769 | |||
|   LKG     2724 | |||
| Dirs: | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scf7180002041061.187.mates/asm.ECR2 | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scf7180002041061.187.mates/asm.ECR2.mates | |||
| Qc stats: | |||
|               asm.ECR2                                asm.ECR2.bog | |||
|    ... | |||
|   [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] | |||
|   0          3,277300,277925,92433,312,7180000000278  5,279938,399837,55988,29975,7180000000155 | |||
|   1          6,234445,280546,39074,9220,7180000000277 6,234463,280485,39077,9204,7180000000156 | |||
|   2          1,1630,1630,1630,0,7180000000276         1,1799,1799,1799,0,7180000000154 | |||
|   3          NA                                       1,1674,1674,1674,0,7180000000153 | |||
|   4          NA                                       1,1183,1183,1183,0,7180000000158 | |||
|   total      10,513375,560101,51338,6675              14,519057,684978,37076,18436 | |||
| * Some of the 139 mates assemble into scaffolds | |||
| * There is slightly more sequence in the bog assembly | |||
| * Mean/Max bog utg size is twice larger than default utg (scf, ctg sizes are about the same) | |||
| * align asm.ECR2 scaffolds to scf7180002041061  | |||
| ** 2919 bp at the 3' end of the new scf7180000000277 don't align | |||
| ** 1990 bp at the 3' end of the new scf7180000000278 don't align | |||
| ** new scf7180000000277 & new scf7180000000278 align for ~ 1058bp | |||
| ** most of the mated read added assembled at new scf7180000000277 & new scf7180000000278 3' | |||
|   nucmer -l 100 -c 500 ../scf7180002041061.fasta 9-terminator/7180002041061.update.scf.fasta -p scf7180002041061-7180002041061.update.scf | |||
|   delta-filter -q scf7180002041061-7180002041061.update.scf.delta > scf7180002041061-7180002041061.update.scf.filter-q.delta | |||
|   show-coords scf7180002041061-7180002041061.update.scf.filter-q.delta | |||
|        1    46099  |        1    46099  |    46099    46099  |    99.99  |   509987   280626  |     9.04    16.43  | scf7180002041061   scf7180000000277 | |||
|    46273    48914  |    46120    48761  |     2642     2642  |   100.00  |   509987   280626  |     0.52     0.94  | scf7180002041061   scf7180000000277 | |||
|    50810    56904  |    95425   101521  |     6095     6097  |    99.43  |   509987   280626  |     1.20     2.17  | scf7180002041061   scf7180000000277 | |||
|    57301   233303  |   101699   277707  |   176003   176009  |    99.97  |   509987   280626  |    34.51    62.72  | scf7180002041061   scf7180000000277 * | |||
|   232245   444418  |   275975    63802  |   212174   212174  |    99.99  |   509987   277965  |    41.60    76.33  | scf7180002041061   scf7180000000278 *  | |||
|   444964   469975  |    63156    38151  |    25012    25006  |    99.97  |   509987   277965  |     4.90     9.00  | scf7180002041061   scf7180000000278 | |||
|   449967   479221  |    58152    28900  |    29255    29253  |    99.96  |   509987   277965  |     5.74    10.52  | scf7180002041061   scf7180000000278 | |||
|   480298   509987  |    29688        1  |    29690    29688  |    99.99  |   509987   277965  |     5.82    10.68  | scf7180002041061   scf7180000000278 | |||
| ==== scf7180002041163.214 ==== | |||
| * infoseq | |||
|   scf7180002041234 10336067 41.93 | |||
| * Markers | |||
|   #scfid          begin   end     makerid         Chr | |||
|   ... | |||
|   7180002041163   770783  771508  CC562874        30 * | |||
|   7180002041163   791050  791484  BZ887794        11 * | |||
|   ... | |||
| * [[Media:7180002041163.0cvg.png|7180002041163.0cvg.png]] | |||
| * Markers on different contigs !!! | |||
|   cat 7180002041163.posmap.ctgscf | |||
|   .. | |||
|   7180001926720   7180002041163   750392  787668  f | |||
|   7180001926721   7180002041163   787911  815738  f | |||
|   ... | |||
| * subassembly: extract all reads & extra mates(293) and reassemble | |||
| ** Msg Counts  | |||
|   DST     39 | |||
|   FRG     17287 | |||
|   LKG     8113 | |||
| ** qc | |||
|   [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] | |||
|   0=54,1742235,1760846,32264,351,7180000000507 | |||
|   1=1,6224,6224,6224,0,7180000000506 | |||
|   2=1,1472,1472,1472,0,7180000000508 | |||
|   total=56,1749931,1768542,31249,351 | |||
| * alignments : | |||
|   show-coords scf7180002041163-7180002041163.update.scf.filter-q.delta | |||
|     1748     7972  |     6224        1  |     6225     6224  |    99.84  |  1757217     6224  |     0.35   100.00  | scf7180002041163   scf7180000000506        [CONTAINS] | |||
|     4012     5490  |        1     1470  |     1479     1470  |    98.85  |  1757217     1472  |     0.08    99.86  | scf7180002041163   scf7180000000508        [CONTAINS] | |||
|     4567    18273  |        1    13697  |    13707    13697  |    99.92  |  1757217  1761853  |     0.78     0.78  | scf7180002041163   scf7180000000507  | |||
|   ... | |||
|   750393   787668  |   757092   794367  |    37276    37276  |   100.00  |  1757217  1761853  |     2.12     2.12  | scf7180002041163   scf7180000000507 | |||
|   787912   815738  |   794644   822455  |    27827    27812  |    99.81  |  1757217  1761853  |     1.58     1.58  | scf7180002041163   scf7180000000507 | |||
|   ... | |||
|  1755928  1757217  |  1760564  1761853  |     1290     1290  |   100.00  |  1757217  1761853  |     0.07     0.07  | scf7180002041163   scf7180000000507 | |||
| === Manually curated === | |||
| ==== Markers within 50K of a low mate cvg region ==== | |||
| * 13 scaffolds (22 before) | |||
| * 14 breaks : 9 on the same contig , 2 on adjacent contigs , 3 on non adjacent contigs | |||
| * File | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.mates//scfproblems | |||
| * scfproblems.markers.txt | |||
|   #scfid        begin   end     markerid        Chr(break)      comment | |||
|   7180002040911 2210713 2211037 CZ405316        29              half Chr30, half Chr29 ; SAME ctg  | |||
|   7180002041061 213959  214640  CC572066        15              half Chr29, half Chr15; SAME ctg | |||
|   7180002041103 5296272 5297045 CC587675        15              Chr18 & Chr15 at 3'; diff ctg (2 ctgs in between) | |||
|   7180002041107 7302203 7302874 CC906829        15              half Chr9, half Chr15; SAME ctg | |||
|   7180002041200 4920833 4921695 BZ877250        20              half Chr20, half Chr2; SAME ctg | |||
|   7180002041216 307121  307653  CL609365        5               Chr5 at 5' & Chr15; SAME ctg | |||
|   7180002041259 4315523 4316178 BZ922220        3               Chr5, Chr3, Chr24 !!! ; diff ctgs (4 ctgs in between) | |||
|   7180002041281 2884855 2885652 CC538348        19              half Chr19, half Chr28 ; SAME ctg | |||
|   7180002041315 2196754 2197298 BZ838379        1               Chr13 & Chr1 at 3'; diff ctg (2 ctgs in between) | |||
|   7180002041325 2597590 2598283 CC472212        10              half Chr10, half Chr5; SAME ctg | |||
|   7180002041348 3028112 3028921 CC531996        8               half Chr8, half Chr7; SAME ctg | |||
|   7180002041356 6828214 6828925 CC531427        16              half Chr16, half Chr25; SAME ctg | |||
|   7180002041358 21002368 21003219 CC527932      26              Chr14 & Chr26 at 3' (long chunck); diff ctg | |||
| Details: | |||
|   #scfid          begin   end     markerid        Chr     #ctgid          begin   end     markerid        Chr | |||
|   7180002040911   2189844 2190490 BZ923031        30      7180001730191   42851   43497   BZ923031        30      * | |||
|   7180002040911   2210713 2211037 CZ405316        29      7180001730191   63720   64044   CZ405316        29      * | |||
|   -- | |||
|   7180002041061   213959  214640  CC572066        15      7180001852904   29927   30608   CC572066        15      * | |||
|   7180002041061   236304  236479  CC550436        29      7180001852904   52272   52447   CC550436        29      * | |||
|   -- | |||
|   7180002041103   5220856 5221407 CG984741        18      7180001854649   21126   21677   CG984741        18      * | |||
|   7180002041103   5296272 5297045 CC587675        15      7180001854651   837     1610    CC587675        15      * | |||
|   -- | |||
|   7180002041107   7302203 7302874 CC906829        15      7180001855003   77842   78513   CC906829        15      ** | |||
|   7180002041107   7311399 7312254 CC479102        9       7180001855003   87038   87893   CC479102        9       ** | |||
|   -- | |||
|   7180002041200   4940131 4940949 CC500137        20      7180002002029   31935   32753   CC500137        20      ** | |||
|   7180002041200   4956412 4957105 CZ428497        2       7180002002029   48216   48909   CZ428497        2       ** | |||
|   --  | |||
|   7180002041216   307121  307653  CL609365        5       7180002003578   6871    7403    CL609365        5       * | |||
|   7180002041216   310832  311546  CL865591        15      7180002003578   10582   11296   CL865591        15      * | |||
|   -- | |||
|   7180002041259   4224828 4225527 CC920177        5       7180002012718   112862  113561  CC920177        5       * | |||
|   7180002041259   4315523 4316178 BZ922220        3       7180002012722   19376   20031   BZ922220        3       * | |||
|   -- | |||
|   7180002041259   6220638 6221268 BZ869532        3       7180002012752   199799  200429  BZ869532        3       * | |||
|   7180002041259   6239728 6240375 CZ413142        24      7180002012753   2600    3247    CZ413142        24      * | |||
|   -- | |||
|   7180002041281   2927406 2928016 CC573399        19      7180002018361   74539   75149   CC573399        19      ** | |||
|   7180002041281   2938896 2939696 CC513914        28      7180002018361   86029   86829   CC513914        28      ** | |||
|   -- | |||
|   7180002041315   2152291 2153097 BZ836343        13      7180001862977   3397    4203    BZ836343        13      * | |||
|   7180002041315   2196754 2197298 BZ838379        1       7180002024308   7255    7799    BZ838379        1       * | |||
|   -- | |||
|   7180002041325   2608389 2609213 CC506736        10      7180002025880   237956  238780  CC506736        10      ** | |||
|   7180002041325   2638591 2639207 CC770009        5       7180002025880   268158  268774  CC770009        5       ** | |||
|   -- | |||
|   7180002041348   3044263 3044902 BZ872906        8       7180002030033   152672  153311  BZ872906        8       ** | |||
|   7180002041348   3092547 3093121 BZ924509        7       7180002030033   200956  201530  BZ924509        7       ** | |||
|   -- | |||
|   7180002041356   6828214 6828925 CC531427        16      7180001722964   12571   13282   CC531427        16      * | |||
|   7180002041356   6832720 6833356 CC494876        25      7180001722964   17077   17713   CC494876        25      * | |||
|   -- | |||
|   7180002041358  20964100 20964829  BZ839784      14      7180001723456   140919  141648  BZ839784        14      * | |||
|   7180002041358  21002368 21003219  CC527932      26      7180001723457   15983   16834   CC527932        26      * | |||
| * scfproblems.low_mate_cvg.txt | |||
|   #scfid                begin           end             ctglen          len             mate_cvg | |||
|   7180002040911         2205494         2205582         3008363         89              1 | |||
|   7180002041061         232849          232871          509987          23              1 | |||
|   7180002041103         5285314         5285479         5341215         166             1 | |||
|   7180002041107         7306552         7306653         19514159        102             1 | |||
|   7180002041200         4954073         4954180         16995932        108             1 | |||
|   7180002041216         307614          307741          1444165         128             0 | |||
|   7180002041259         4313297         4313311         15600612        15              0 | |||
|   7180002041281         2933371         2933618         12907599        248             1 | |||
|   7180002041315         2193142         2193167         2232736         26              1 | |||
|   7180002041325         2623956         2624190         11153784        235             1 | |||
|   7180002041348         3064633         3064750         19180127        118             1 | |||
|   7180002041356         6831454         6831811         14697197        358             1 | |||
|   7180002041358         20970950        20971112        25078118        163             1 | |||
| ==== No low cvg regions ==== | |||
| * 9 scaffolds | |||
| * 14 breaks : 5 on the same contig , 6 on adjacent contigs , 3 on non adjacent contigs | |||
|   7180002040844 : 3 consecutive Chr30 markers (in the middle); mate_cvg > 18 ; markers CC576837,CC516738,CC543771; SAME ctg | |||
|   7180002041163 : 5' Chr30, 3' Chr11 ; mate_cvg~=10; marker BZ887794 ; diff ctg (1 ctg in between) | |||
|   7180002041234 : 5' Chr10, 3' Chr5; mate_cvg=10; marker BZ889975; scf=1757217bp; diff ctg | |||
|   7180002041235 : 2 consecutive Chr14 markers; mate_cvg=15..20; markers CC561837 &  CC585677; NOT uniq; SAME ctg | |||
|   7180002041279 : 5 consecutive Chr23 markers; markers CC472696,CC522963; cvg=9..18; diff ctg | |||
|   7180002041306 : 5' Chr14, 3' Chr6; marker CC549871; cvg=2; diff ctg (22 ctgs in between) | |||
|   7180002041308 : 2 consecutive Chr20 markers; markers CC571631 & BZ832318; cvg=20; SAME/diff ctg  | |||
|   7180002041321 : 5' Chr2, 3' Chr15 ; marker CC513377 cvg=10; diff ctg (1 ctgs in between) | |||
|   7180002041350 : 2 consecutive Chr3 markers ; markers BZ837387 & CC571149; cvg 17..26; NOT uniq; SAME/diff ctg | |||
| Details: | |||
|   #scfid          begin   end     markerid        Chr     #ctgid          begin   end     markerid        Chr | |||
|   7180002040844   8369147 8369989 CC521620        17      7180001727899   132573  133415  CC521620        17      ** | |||
|   7180002040844   8437707 8438324 CC576837        30      7180001727899   201133  201750  CC576837        30      ** | |||
|   ... | |||
|   7180002040844   8467493 8468096 CC543771        30      7180001727899   230919  231522  CC543771        30      * | |||
|   7180002040844   8522732 8523598 CC513544        17      7180001727901   34970   35836   CC513544        17      * | |||
|   --- | |||
|   7180002041163   770783  771508  CC562874        30      7180001926720   20391   21116   CC562874        30      * | |||
|   7180002041163   791050  791484  BZ887794        11      7180001926721   3139    3573    BZ887794        11      * | |||
|   -- | |||
|   7180002041234   3856848 3857673 BZ889975        10      7180002010181   56608   57433   BZ889975        10      * | |||
|   7180002041234   3949893 3950652 CC509477        5       7180002010182   28610   29369   CC509477        5       * | |||
|   -- | |||
|   7180002041235   1117908 1118686 CC579933        21      7180002010285   4086    4864    CC579933        21      ** | |||
|   7180002041235   1118939 1119675 CC561837        14      7180002010285   5117    5853    CC561837        14      ** | |||
|   7180002041235   1158300 1159022 CC585677        14      7180002010285   44478   45200   CC585677        14      ** | |||
|   7180002041235   1164296 1164891 BZ924510        21      7180002010285   50474   51069   BZ924510        21      ** | |||
|   -- | |||
|   7180002041279   6755839 6756519 BZ849919        1       7180002018195   417990  418670  BZ849919        1       * | |||
|   7180002041279   6786086 6786957 CC472696        23      7180002018196   12703   13574   CC472696        23      * | |||
|   ... | |||
|   7180002041279   6866378 6866890 CC522963        23      7180002018196   92995   93507   CC522963        23      * | |||
|   7180002041279   6911185 6911897 CC574255        1       7180002018197   15411   16123   CC574255        1       * | |||
|   -- | |||
|   7180002041306   248713  249457  CC503129        14      7180002022875   59534   60278   CC503129        14      * | |||
|   7180002041306   910997  911554  BZ839769        6       7180002022892   21939   22496   BZ839769        6       * | |||
|   -- | |||
|   7180002041308   1919399 1920196 BZ883381        18      7180002023013   57093   57890   BZ883381        18      * | |||
|   7180002041308   1958658 1959168 CC571631        20      7180002023014   29833   30343   CC571631        20      * | |||
|   7180002041308   1963333 1964083 BZ832318        20      7180002023014   34508   35258   BZ832318        20      ** | |||
|   7180002041308   2009574 2010367 CC499423        18      7180002023014   80749   81542   CC499423        30      ** | |||
|   -- | |||
|   7180002041321   4820048 4820542 BZ846646        2       7180002025240   77429   77923   BZ846646        2       * | |||
|   7180002041321   4908631 4909160 CC513377        15      7180002025242   13302   13831   CC513377        15      * | |||
|   -- | |||
|   7180002041350   2423855 2424205 CG983886        13      7180002030523   183644  183994  CG983886        13      * | |||
|   7180002041350   2446763 2447125 BZ837387        3       7180002030524   9191    9553    BZ837387        3       * | |||
|   7180002041350   2457417 2457704 CC571149        3       7180002030524   19845   20132   CC571149        3       ** | |||
|   7180002041350   2486474 2487050 CC490214        13      7180002030524   48902   49478   CC490214        13      ** | |||
| ==== Scaffold splitting ==== | |||
| Before: | |||
| * 22 scaffolds | |||
| * 28 breaks : 14 on the same contig , 8 on adjacent contigs , 6 on non adjacent contigs | |||
| Now: | |||
| * 14 scaffolds | |||
| * 15 breaks : 8 on the same contig , 3 on adjacent contigs , 4 on non adjacent contigs | |||
| ===== Scaffold to break ===== | |||
|     nl  scfid           breaks | |||
|     #1  7180002040844   2 | |||
|      2  7180002040911   1 | |||
|      3  7180002041061   1 | |||
|     #4  7180002041103   1 | |||
|      5  7180002041107   1 | |||
|     #6  7180002041163   1 | |||
|      7  7180002041200   1 | |||
|     #8  7180002041216   1 | |||
|      9  7180002041234   1 | |||
|    #10  7180002041235   2 | |||
|     11  7180002041259   2 | |||
|    #12  7180002041279   2 | |||
|     13  7180002041281   1 | |||
|     14  7180002041306   1 | |||
|    #15  7180002041308   2 | |||
|     16  7180002041315   1 | |||
|     17  7180002041321   1 | |||
|     18  7180002041325   1 | |||
|     19  7180002041348   1 | |||
|    #20  7180002041350   2 | |||
|     21  7180002041356   1 | |||
|     22  7180002041358   1 | |||
| ===== Contigs to break ===== | |||
|     nl  ctgid            | |||
|      1  7180001722964 | |||
|      2  7180001723456 | |||
|      3  7180001730191 | |||
|      4  7180001852904 | |||
|      5  7180001855003 | |||
|      6  7180002002029 | |||
|      7  7180002010182 | |||
|      8  7180002012722 | |||
|      9  7180002012752 | |||
|     10  7180002018361 | |||
|     11  7180002024308 | |||
|     12  7180002025240 | |||
|     13  7180002025880 | |||
|     14  7180002030033 | |||
| ===== Marker pairs ===== | |||
|     nl  scfid           begin1  end2    markerid1       Chr1    markerid2       Chr2    ctg1            ctg2            dist(end2-begin1)      dist(ctg2-ctg1) | |||
|     #1  7180002040844   8369147 8438324 CC521620        17      CC576837        30      7180001727899   7180001727899   69177                   0 | |||
|     #2  7180002040844   8467493 8523598 CC543771        30      CC513544        17      7180001727899   7180001727901   56105                   2 | |||
|      3  7180002040911   2189844 2211037 BZ923031        30      CZ405316        29      7180001730191   7180001730191   21193                   0 | |||
|      4  7180002041061   213959  236479  CC572066        15      CC550436        29      7180001852904   7180001852904   22520                   0 | |||
|     #5  7180002041103   5220856 5297045 CG984741        18      CC587675        15      7180001854649   7180001854651   76189                   2 | |||
|      6  7180002041107   7302203 7312254 CC906829        15      CC479102        9       7180001855003   7180001855003   10051                   0 | |||
|     #7  7180002041163   770783  791484  CC562874        30      BZ887794        11      7180001926720   7180001926721   20701                   1 | |||
|      8  7180002041200   4940131 4957105 CC500137        20      CZ428497        2       7180002002029   7180002002029   16974                   0 | |||
|     #9  7180002041216   307121  311546  CL609365        5       CL865591        15      7180002003578   7180002003578   4425                    0 | |||
|     10  7180002041234   3856848 3950652 BZ889975        10      CC509477        5       7180002010181   7180002010182   93804                   1 | |||
|    #11  7180002041235   1117908 1119675 CC579933        21      CC561837        14      7180002010285   7180002010285   1767                    0 | |||
|    #12  7180002041235   1158300 1164891 CC585677        14      BZ924510        21      7180002010285   7180002010285   6591                    0 | |||
|     13  7180002041259   4224828 4316178 CC920177        5       BZ922220        3       7180002012718   7180002012722   91350                   4 | |||
|     14  7180002041259   6220638 6240375 BZ869532        3       CZ413142        24      7180002012752   7180002012753   19737                   1 | |||
|    #15  7180002041279   6755839 6786957 BZ849919        1       CC472696        23      7180002018195   7180002018196   31118                   1 | |||
|    #16  7180002041279   6866378 6911897 CC522963        23      CC574255        1       7180002018196   7180002018197   45519                   1 | |||
|     17  7180002041281   2927406 2939696 CC573399        19      CC513914        28      7180002018361   7180002018361   12290                   0 | |||
|     18  7180002041306   248713  911554  CC503129        14      BZ839769        6       7180002022875   7180002022892   662841                  22 | |||
|    #19  7180002041308   1919399 1959168 BZ883381        18      CC571631        20      7180002023013   7180002023014   39769                   1 | |||
|    #20  7180002041308   1963333 2010367 BZ832318        20      CC499423        18      7180002023014   7180002023014   47034                   0 | |||
|     21  7180002041315   2152291 2197298 BZ836343        13      BZ838379        1       7180001862977   7180002024308   45007                   2* | |||
|     22  7180002041321   4820048 4909160 BZ846646        2       CC513377        15      7180002025240   7180002025242   89112                   2 | |||
|     23  7180002041325   2608389 2639207 CC506736        10      CC770009        5       7180002025880   7180002025880   30818                   0 | |||
|     24  7180002041348   3044263 3093121 BZ872906        8       BZ924509        7       7180002030033   7180002030033   48858                   0 | |||
|    #25  7180002041350   2423855 2447125 CG983886        13      BZ837387        3       7180002030523   7180002030524   23270                   1 | |||
|    #26  7180002041350   2457417 2487050 CC571149        3       CC490214        13      7180002030524   7180002030524   29633                   0 | |||
|     27  7180002041356   6828214 6833356 CC531427        16      CC494876        25      7180001722964   7180001722964   5142                    0 | |||
|     28  7180002041358 20964100 21003219 BZ839784        14      CC527932        26      7180001723456   7180001723457   39119                   1 | |||
| ===== Break intervals ===== | |||
|         #scfid          begin   end     scflen          len  allread goodmate badmate Chr1 Chr2 mark1   mark2  ctgid | |||
|     #1  7180002040844   8412735 8412739 9565303         5       1       12      15      17 30   357     3       | |||
|     #2  7180002040844   8485384 8485869 9565303         486     0       12      4       30 17   3       53 | |||
|      3  7180002040911   2205044 2205247 3009964         204     3       0       8       30 29   44      34     7180001730191 | |||
|      4  7180002041061   232819  232843  510915          25      5       0       23      15 29   8       14     7180001852904 | |||
|     #5  7180002041103   5295117 5295435 5342873         319     0       11      2       18 15   221     4 | |||
|      6  7180002041107   7306552 7306620 19515586        69      2       0       11      15 9    320     549    7180001855003 | |||
|     #7  7180002041163   787669  787911  1752998         243     0       10      0       30 11   16      8 | |||
|      8  7180002041200   4954103 4954180 16991478        78      1       0       27      20 2    219     532    7180002002029 | |||
|     #9  7180002041216   309331  309337  1435859         7       1       4       3       5 15    7       50 | |||
|     10  7180002041234   3939493 3939548 10292558        56      1       0       20      10 5    160     227    7180002010182 | |||
|    #11  7180002041235   1117752 1117987 1742396         236     5       20      2       21 14   43      2 | |||
|    #12  7180002041235   1161294 1161363 1742396         70      3       16      0       14 21   2       32 | |||
|     13  7180002041259   4313381 4313547 15594966        167     1       0       12      5 3     193     75     7180002012722 | |||
|     14  7180002041259   6231717 6231762 15594966        46      8       4       18      3 24    75      406    7180002012752 | |||
|    #15  7180002041279   6773364 6773383 9784709         20      0       4       14      1 23    292     5 | |||
|    #16  7180002041279   6894603 6894618 9784709         16      0       4       16      23 1    5       141 | |||
|     17  7180002041281   2931877 2931885 12908817        9       3       5       10      19 28   83      441    7180002018361 | |||
|     18  7180002041306   267445  267446  4484402         2       0       2       14      14 6    49      141    7180002022875,7180002005935 | |||
|    #19  7180002041308   1928806 1928825 23070077        20      0       10      0       18 20   63      2 | |||
|    #20  7180002041308   1977759 1977828 23070077        70      1       21      1       20 18   2       836 | |||
|     21  7180002041315   2192754 2192893 2233575         140     7       0       22      13 1    59      2      7180002024308 | |||
|     22  7180002041321   4840934 4841070 6492983         137     1       0       16      2 15    227     72     7180002025240 | |||
|     23  7180002041325   2624153 2624190 11155368        38      3       0       22      10 5    86      357    7180002025880 | |||
|     24  7180002041348   3064691 3064730 19179560        40      1       0       16      8 7     116     634    7180002030033 | |||
|    #25  7180002041350   2437553 2437572 23353087        20      0       19      1       13 3    107     2 | |||
|    #26  7180002041350   2479000 2479029 23353087        30      2       23      1       3 13    2       880 | |||
|     27  7180002041356   6831454 6831783 14699461        330     1       0       9       16 25   299     319    7180001722964 | |||
|     28  7180002041358  20970057 20970139 25079321       83      7       0       25      14 26   868     189    7180001723456 | |||
| Where: | |||
|   Chr1:  The most frequent chromosome markers with alignment at coordinates <= begin  | |||
|   Chr2:  The most frequent chromosome markers with alignment at coordinates >= end | |||
|   mark1: Number of Chr1 markers with alignment at coordinates <= begin  | |||
|   mark2: Number of Chr2 markers with alignment at coordinates >=end  | |||
|   Lines starting with # should be ignored. | |||
|   ctgid: ctg to break | |||
| ===== New scaffolds ===== | |||
|   #scfid(new)           scfid           begin   end      scflen(new) | |||
| ===== Files ===== | |||
|   # ctg:scf new name mapping | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scfproblems.posmap.ctbscb | |||
|   # scf:scf new:original name mapping | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scfproblems.posmap.scbscf | |||
|   # ctg:ctg new:original name mapping | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/ctgproblems.posmap.ctbctg | |||
|   # ctg FASTA sequences | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/ctbproblems.fasta | |||
|   # scf FASTA sequences | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scbproblems.fasta | |||
| = UMD2.6 (UMD2.5 without contam ctg/scaff; split ctg/scaff) = | |||
| == Contaminants & MarkerBreaks  == | |||
| Delete summary: | |||
|                            ctg     scf     scf->ctg | |||
|   contaminants(delete)     156     152     666 | |||
|   contaminants(trim)       12      12      1328 | |||
|   markerBreaks             14+1    14+1    2875+1  # 1 more break in UMD2.6.1       | |||
|   total                    182     178     4869  | |||
| Add summary: | |||
|                            ctg     scf   | |||
|   contaminants(delete)     0       2 | |||
|   contaminants(trim)       12      12 | |||
|   markerBreaks             28+2    29+2            # 1 more break in UMD2.6.1  | |||
|   total                    40      43 | |||
| Summary:                   | |||
|                            ctg     scf        markers | |||
|   original                 90135   39978 | |||
|   del/add                 -142    -135 | |||
|   final                    89993   39843      -17 | |||
| Files: | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.ctg.fasta     : contig  FASTA sequence | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.scf.fasta     : scaffold FASTA sequence | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.ctglen : contig  lengths  | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.scflen : scaffold lengths | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.ctgscf : mapping of contigs to scaffolds  (posmap format) | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.scaff  : mapping of contigs to scaffolds  (scaff  format) | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/ctg.delete.uid   : UID of the contigs   which were deleted   | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/scf.delete.uid   : UID of the scaffolds which were deleted   | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/ctg.add.uid      : UID of the contigs which were added :  UID =~/brk\d+[abc]/ OR UID =~/cnt\d+/ | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/scf.add.uid      : UID of the contigs which were added :  UID =~/brk\d+[abc]/ OR UID =~/cnt\d+/ | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.delete.uid : markers which got deleted | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.ctg.break  : 15 contig break regions  | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.scf.break  : 16 scaffold break regions  | |||
| Scripts: | |||
|   ~/bin/breakPosmapKeep.amos : pipeline for breaking scf/ctg | |||
| * Final:  | |||
|                     elem       min        max               mean       med        n50        sum	  | |||
|   ctg               89993      65         1160130(1.1M)     29736      5180       95952      2676109306 | |||
|   scf               39843      316        34167202(34M)     68353      1361       7451988    2723419943 | |||
|   scf<50K           38915      316        49898             2765       1349       5107       107632139 | |||
|   scf<5K            35229      316        4999              1518       1306       1435       53495879 | |||
| == Ctg Markers == | |||
| Filtered: | |||
|   %IDY>90 | |||
|   %Matched>85 | |||
| * ~30% of alignments agree to this condition | |||
| Markers:             | |||
|   total:     126111    # from 31372 ctgs | |||
| Contigs | |||
|                     elem       min        max        mean       med        n50        sum | |||
|   placed            31372*     442        1160130    72908      52734      111147     2287291732* | |||
|   unplaced          58621      65         425922     6632       1600       22204      388817574 | |||
|   total             89993      65         1160130    29736      5180       95952      2676109306 | |||
| == Scf Markers == | |||
| Markers: | |||
|   total:     126111  # from 2641 scaffolds; 1595 scaffolds have 2+ markers | |||
|   na:        18744   # not placed	  | |||
|   del:       17      # the scaffolds they belong to were deleted	  | |||
|   other:     3722    # not on the "main" chromosome; "main" chromosome determined by a majority rule; if it's a tie check markers for uniqueness  | |||
|   outliers:  411     # interquartile range method (IQR)   q1-1.5*(q3-q1) ..  q1+1.5*(q3-q1)  		  | |||
|   filtered:  103217  # "non conflicting" markers	 	  | |||
|   Scaffolds:	  | |||
|                     elem       min        max        mean       med        n50        sum              #ctgs | |||
|   placed            2641       1000       34167202   994523     16673      8170786    2626536153*      50528	  | |||
|   unplaced          37202      316        754615     2604       1337       3964       96883790	        39465 | |||
|   total             39843      316        34167202   68353      1361       7451988    2723419943       89993	  | |||
|                     scf  <2   >=2  >=10 min  max   mean  med n50 sum	  | |||
|   markers/scf       2641 1595 1046 562  1    1418  39    1   354 103479	  | |||
|   ctg/scf           2641 1546 1095 559  1    545   19    1   130 50528 | |||
|   cat markers_scf.mainChr.*summary  | count.pl -c 7 | sort -n | |||
|   1       1564   # 1 marker/scf | |||
|   2       276    # 2 markers/scf | |||
|   3       116    # ... | |||
|   4       52 | |||
|   ... | |||
|   1470    1      # scaffold 7180002041371 has 1470 markers | |||
| * Scaffold position:     | |||
|   filter outliers (interquartile range method); use median value | |||
|   problem: only 2 markers far apart: choose randomly or check for uniqueness | |||
|   Summary (approximate) | |||
|   Chr   #Ctg    #Scaff  ScaffSpanSum    MaxMarkerPos | |||
|   1     2989    112     157088082       167097751 | |||
|   2     2488    73      138112445       141135901 | |||
|   3     2257    109     120984003       128677351 | |||
|   4     2079    111     124695956       123662451 | |||
|   5     2271    98      120470218       130242001 | |||
|   6     2244    127     117350442       127208151 | |||
|   7     2169    93      109780318       114917551 | |||
|   8     2069    95      111646872       114607251 | |||
|   9     1937    72      103790361       106365151 | |||
|   10    1829    96      102878815       108508301 | |||
|   11    2030    62      106593132       107458151 | |||
|   12    1789    114     89109155        97406401 | |||
|   13    1498    73      83821399        88539451 | |||
|   14    1482    142     84084175        89211101 | |||
|   15    1734    105     84680500        91332551 | |||
|   16    1710    111     80727432        86838601 | |||
|   17    1384    51      72913556        78195801 | |||
|   18    1446    86      65689468        70299751 | |||
|   19    1338    56      63372609        69847351 | |||
|   20    1454    54      71941707        75982901 | |||
|   21    1405    63      70035525        72193201 | |||
|   22    1077    36      60892135      > 60178851 | |||
|   23    1021    44      51791473        54886001 | |||
|   24    1059    26      61662407      > 61466101 | |||
|   25    783     41      42670836      > 45254751 | |||
|   26    991     40      50640267        52316851 | |||
|   27    920     70      45768018        48911451 | |||
|   28    810     52      45884054        50753001 | |||
|   29    1143    89      51657687        55219751 | |||
|   30    3122    340     135803106       152429101 | |||
|   U     39465   37202   94983049 | |||
| * Scaffold orientation | |||
|   filter outliers (interquartile range method) | |||
|   use LeastSequareFit method to estimate the orientation : if slope is positive => forward; if slope is negative => reverse;  | |||
|   problem: slope ~=0 => which direction ?  | |||
|   cat Chr.summary | getSummary.pl -i 5 | |||
|   cat Chr.agp | grep W | awk '{print $9}' | count.pl | |||
|          elem       <0         0          >0         min        max        mean       med | |||
|   scf    2641       516        1610       515        -31        61         0          0    | |||
|   ctg    50528      24885      2236       23407 | |||
|   Use slope thold to determine direction? | |||
|   cat markers_scf.mainChr.noOutliers.summary | p 'print $_ if(abs($F[5])>0.5);' | wc -l #  634 | |||
|   Ambiguity examples: | |||
|   BZ908653 6 114061501 114016501 114106501 172149 597 98.66 100.00 7180002040834 | |||
|   BZ891600 6 114085251 114051501 114119001 4834 504 99.80 99.02 7180002040834 | |||
|   CZ411135 6 114094001 114059001 114129001 242710 609 98.20 100.00 7180002040834 | |||
|   BZ854276 6 114132751 114101501 114164001 100980 580 99.31 99.83 7180002040834 | |||
|   CC524983 30 115669901 115622401 115717401 96634 715 98.74 100.00 7180002041003 | |||
|   BZ869249 30 115791151 115717401 115864901 86931 448 99.78 100.00 7180002041003 | |||
|   BZ867530 30 115798651 115732401 115864901 69671 572 99.13 100.00 7180002041003 | |||
|   CC585731 30 115808651 115752401 115864901 54950 737 99.86 100.00 7180002041003 | |||
|   CC469285 30 115818651 115772401 115864901 125555 550 94.57 100.00 7180002041003 | |||
| * Scaffold overlaps: some small scaffolds might be contained by bigger ones | |||
|   cat markers_scf.mainChr.noOutliers.posmap.scfchrabs | ~/bin/posmap2ovl.pl | sort -nk6 -r | ~/bin/tab2tab.pl -f -15 | head | |||
|   Chr            ref            qry            begin          end            end-begin | |||
|   30             7180002041328  7180002041078  72536051       76993951       4457900 | |||
|   4              7180002041381  7180002041269  41976451       46175201       4198750 | |||
|   30             7180002040852  7180002041077  142001501      145251501      3250000 | |||
|   30             7180002038569  7180002041121  33012001       35518351       2506350 | |||
|   30             7180002040971  7180002034501  59411001       61836101       2425100 | |||
|   ... | |||
| == Marker Issues == | |||
| === Placement === | |||
| * 37202 scaffolds are unplaced (3.5% of total scaffold span); max=0.7M | |||
| * 87 unplaced scaffolds (1.5Mbp total) could be placed using SLK messages | |||
|   perl ~/bin/difference21.pl  bt.slk markers_scf.scflen | head | |||
|   7180002030632   7180002040171   I       -147744.734     21214.779       2       UP | |||
|   7180002030792   7180002031221   I       -15042.301      558.223         2       UP | |||
|   7180002030849   7180002041244   N       -1609089.875    18145.568       4       UP | |||
|   ... | |||
|                     elem       min        max        mean       med        n50        sum | |||
|   Unplaced          87         740        102909     17259      8544       34550      1501601 | |||
|   Placed            78         16177      27139572   5757234    4037663    10989230   449064304 | |||
| * ambiguous assignment to chromosmes: | |||
|   cat markers_scf.count | p 'print $_ if($F[2]*2==$F[3]);' | nl | |||
|         #k1             k2      count12 count1 | |||
|      1  7180002030741   15      2       4 | |||
|      2  7180002031341   11      1       2 | |||
|     ... | |||
|     36  7180002068140   25      1       2 | |||
|   # scaffold assigned differently | |||
|   ~/bin/difference12.pl markers.all/markers_scf.mainChr.count markers.all_plus_uniq/markers_scf.mainChr.count | nl | |||
|      1  7180002032536   19      1       2       23480 | |||
|      2  7180002037223   8       1       2       31516 | |||
|      3  7180002040013   25      2       4       15228 | |||
|      4  7180002040262   30      1       2       3087 | |||
|      5  7180002040378   10      1       2       50907 | |||
|      6  7180002040523   29      1       2       9105 | |||
|      7  7180002040769   4       1       3       827526 | |||
|      8  7180002041203   14      2       4       1141560 | |||
|      9  7180002044555   18      1       2       1566 | |||
| * Chr30 has many scaff aligned to it | |||
| * how reliable are the markers: 1151 out of 2641 placed scaffolds have no unique markers ? | |||
| * what measure is best for placing the scaffolds? | |||
| * can some scaffolds go in the gaps ? | |||
| * AGP: | |||
| ** gaps<=0 or 20 set to 100 | |||
| ** unoriented ctgs set to + | |||
| * Marker positions are not uniformly distributed; they tend to "custer" | |||
|                                    elem       0          >0         min        max        mean       med        n50        sum | |||
|   markers_chr                      107337     8405       98932      0          446250     25800      8750       75000      2769379200 | |||
|   markers_scf                      94171      123        94048      0          2565850    24143      15856      42236      2273610521 | |||
|   markers_chr.mainChr.noOutliers   101084     7609       93475      0          1253750    27383      10000      78750      2768040450  # filtered IRQx,xy method | |||
|   markers_scf.mainChr.noOutliers   88450      52         88398      0          2669361    25327      16587      44374      2240202221 | |||
| === Orientation === | |||
| * at least 1564 out of 2641 placed scaffolds cannot be oriented; max=1.39M | |||
| === Possible misassemblies === | |||
|   nl  #scfid          Ch      medianPos       rangePos        scfLen              slope   #ChMark #Mark   #ctg | |||
|   1   7180002041225   1       120473301       9866350         9279975(9M)         0.0103  381     390     184        : break interval: 1674263..1674319 (1X frg_cvg region)  | |||
|   2   7180002041078   30      77506551        6310600         3094999(3M)         0.0148  65      69      46         : no clear position | |||
| ==== scf7180002041225 : 9.27 Mbp on Chr1 ==== | |||
| scflen=9279975 | |||
|   [[Media:Markers_7180002041225.txt|7180002041225.markers]] | |||
|   [[Media:7180002041225.png|7180002041225.markers]]  | |||
|   [[Media:7180002041225.cvg.png|7180002041225.cvg]]  | |||
|   ChrPos    ScfPos | |||
|   115272051 56300 | |||
|   ... | |||
|   116848301 1618430  | |||
|   116953301 9190560 | |||
|   ... | |||
|   125138401 1691145 | |||
|   break interval: 1674263..1674319 = 57bp frg_cvg=1 mate_cvg=0 bad_mate_cvg(nearby)=7 | |||
| ====  scf7180002041078 : 3.09 Mbp (not broken) ==== | |||
|   [[Media:Markers_7180002041078.txt|7180002041078.markers]] | |||
|   [[Media:7180002041078.png|7180002041078.markers]]  | |||
|   [[Media:7180002041078.1.cvg.png|7180002041078.1.cvg]] xrange [315268:466827] | |||
|   [[Media:7180002041078.2.cvg.png|7180002041078.2.cvg]] xrange [1191000:1229713] | |||
|   [[Media:7180002041078.3.cvg.png|7180002041078.3.cvg]] xrange [2733146:2750017] | |||
|   ChrPos   ScfPos | |||
|   72536051 2893791 | |||
|   ... | |||
|   72682301 2750017  | |||
|   72728651 1191000 | |||
|   ... | |||
|   73273651 466827 | |||
|   77112801 2733146 | |||
|   ... | |||
|   78464051 1229713 | |||
|   .. | |||
|   78595401 315268 | |||
| ====  scf7180002040971 : 1.93 Mbp (not broken) ==== | |||
| ==== Files ==== | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo/ | |||
| * Nucmer alignments to UMD2.0 chromosomes: all seem to agree pretty well | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr26-Chr26.png | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr27-Chr27.png | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr29-Chr29.png | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.ChrX-Chr30.png | |||
| == Homo sapiens alignments == | |||
| Citations: | |||
| * mice closer to human than cow | |||
| * human and cow have approximately 201 homologous blocks of DNA  | |||
| * Independently generated mapping data provide another measure of the quality of the assembly. Snelling et al. [4] created a B. taurus map from three radiation hybrid panels, two genetic maps, and bacterial artificial chromosome (BAC) end sequences. We aligned all of the 17,254 markers (of which 17,193 are unique) in their composite map (Cmap) to both assemblies. A marker was considered as matching a chromosome if 90% of the marker sequence aligned with at least 95% identity. Of the Cmap markers, 14,620 align to the UMD2 assembly's chromosomes, versus 13,699 markers (6.3% fewer) for the BCM4 assembly. A small number of Cmap markers (119 and 82 for UMD2 and BCM4, respectively) mapped to a different chromosome from the one indicated in the Cmap data. | |||
| * homologous synteny block (HSB) :   | |||
| ** human-cow alignment extended for at least 250 Kbp  | |||
| ** it was not interrupted by an inversion or by an HSB on another chromosome.  | |||
| ** If two HSBs were interrupted by a gap of <3 Mbp and nothing else fell in that gap, the two blocks were merged. (Note that if a large region of synteny is interrupted by a distinct HSB, the interruption creates three HSBs.) | |||
| ** [http://genomebiology.com/2009/10/4/R42/table/T3 Number of homologous synteny blocks on each chromosome of the cow and human genomes] 201 -> 268 blocks | |||
| * Orienting contigs using cow-human alignments: | |||
| ** Scaffolds (sets of linked contigs) that were mapped onto chromosomes using only a single marker could not be oriented from the marker information alone. We oriented many of these scaffolds by taking advantage of the overall conserved synteny between cow and human. First, all cow scaffolds were aligned to the human genome using nucmer [14] with its maximal unique match (mum) option in order to avoid alignments of repetitive sequence. For each alignment of a previously unoriented scaffold to human, all alignments within 100 Kbp on each side were pulled out for analysis. A score S was computed for each unoriented scaffold, taking into account whether the scaffolds surrounding S on both sides (in cow) were mapped to a consistent set of locations in human. If the scaffolds surrounding S were oriented, and if a large majority of these scaffolds on both the left and right agreed on the orientation, then S was assigned that orientation. Using this procedure, 1,840 scaffolds containing 4,011 contigs were oriented. | |||
| ** We developed a similar procedure to assign unplaced contigs to chromosomes, again relying on conserved synteny between cow and human. First, all unplaced contigs were aligned as above. Mummer's 'delta-filter' program was then used to compute a one-to-one mapping of the unplaced contigs to human so that only the best aligning contig was considered at each region in human. For each unplaced contig's best alignment to human, the matching region in cow was identified via our human-cow syntenic map, and all contigs from this region were extracted for examination. We only considered placing a contig on a B. taurus chromosome if the order and direction of the surrounding contigs in cow matched the corresponding region in human. As above, we examined the alignments of nearby cow contigs that aligned within 100 kb of the unplaced contig's alignment in human. If the region of cow-human synteny contained no rearrangements, then the unplaced contig was placed at the location indicated by these alignments. Using this procedure, 1,046 contigs were placed on chromosomes. One consequence of this procedure was that a number of incompletely mapped genes (based on mRNA alignments) were completed. | |||
| Issues: | |||
| * which alignment program to use?  | |||
| ** nucmer | |||
| ** blastz: difficult to parse  | |||
| ** blat | |||
| * nucmer: what parameters? | |||
| ** default | |||
| ** loose : -mum -l 12 -c 30 -g 1000  | |||
| ** ref:   24 homo sapiens chromosomes files | |||
| ** query: 26 bos taurus scaffold files | |||
| Total scaffolds: | |||
|                          elem       min        max        mean       med        n50        sum | |||
|   '''scf(len)               39844      316        34167202   68352      1361       7451988    2723419938''' | |||
| Aligned to HS:                                 | |||
|                          elem       min        max        mean       med        n50        sum              | |||
|   '''scf(len)               8860       385        34167202   301202     2257       7740810    2668658140'''      # each scf aligns in avg to 4 Chr    | |||
|   scf(aligLen)           789272     250        19097      514        399        550        405946972       | |||
|   scf(alig%)             789272     58.04      100.00     79         78.54      78.72      . | |||
|   scf(len,maxX)          727        1002       7069350    194658     6530       1895564    141516604       # 220 in common with the 339 ones that have mark | |||
|   scf(len,noMark)        7025       385        7605708    7783       1654       27630      54680173    | |||
|   '''scf(len,mark)          1835       1001       34167202   1424511    40503      7970944    2613977967'''  | |||
| Not aligned to HS:                                 | |||
|                          elem       min        max        mean       med        n50        sum              | |||
|   scf(len,notAligned)    30984      316        163953     1767       1309       1555       54762139    | |||
| Aligned to markers:  | |||
|                          elem       min        max        mean       med        n50        sum | |||
|   '''scf(len)               2641       1000       34167202   994523     16673      8170786    2626536153''' | |||
|   1 marker               1595       1000       1396951    17482      4595       48276      27884078 | |||
|   2+ markers             1046       1055       34167202   2484371    596240     8217662    2598652075 | |||
|   abs(slope)<0.25        1946       1000       3094999    29136      7585       114274     56700529 | |||
|   abs(slope)>=0.25       696        4674       34167202   3692292    1845828    8401441    2569835619 | |||
| Scf summary: | |||
|                          elem       min        max        mean       med        n50        sum | |||
|   all                    39844      316        34167202   68352      1361       7451988    2723419938 | |||
|   1+mark                 2641       1000       34167202   994523     16673      8170786    2626536153 # best marker alignment | |||
|   1+align                8860       385        34167202   301202     2257       7740810    2668658140 # alignments > 250bp | |||
|   1+align(new)           9880       385        34167202   270458     1996       7740810    2672129298 # alignments > 200bp (3.74M more than 250bp align) | |||
|   1+mark or  1+align     9669       385        34167202   276443     2051       7740810    2672936792 | |||
|   0 mark and 1+align     7027       385        754,615    6603       1654       20219      46,400,303 !!! | |||
|   1+mark and 1+align     1791       1001       34167202   1305861    37914      7473583    2338798064 | |||
|   0 mark and 0 align     30179      316        88,326     1705       1305       1517       51,483,587 !!! | |||
|   1+mark or  1+align     9669       385        34167202   276443     2051       7740810    2,672,936,792 | |||
| Degenerate: | |||
|                              elem       min        max        mean       med        n50        sum | |||
|   all                        251413     65         39964      1003       984        994        252,279,234 | |||
|   all(2000+bp)               2128       2000       39964      3753       2827       3946       7,986,872 | |||
|   1+mark                     562        200        30168      2731       1274       5029       1,535,011 | |||
|   1+mark(2000+bp)            180        2007       30168      6079       4404       7510       1,094,252 | |||
|   1+align                    6429       251        39964      1487       1013       1287       9,566,273 | |||
|   1+align(2000+bp)           756        2004       39964      4820       3624       5646       3,644,556 | |||
| Issues: | |||
| * 24 scaffolds that have 200+ alignments to at least 2 HS chromosomes | |||
| * 37 scaffolds that have 100+ alignments to at least 2 HS chromosomes | |||
| * 83 scaffolds that have 50+ alignments to at least 2 HS chromosomes | |||
| File location: | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_human | |||
| === Synteny method === | |||
| File location: | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny | |||
| * Position & orient scaffolds that aligned to HS  | |||
| * find placed neighboring that don't "disagree" | |||
|                               elem       min        max        mean       med        n50        sum | |||
|   total                       7027       385        754615     6603       1654       20219      46,400,303 !!!   | |||
| * Scaffolds possibly assigned to the wrong chromosome | |||
|   cat summary.nucmer-markers.txt | perl ~/bin/synteny/getSynteny2.nucmer-marker.pl -minMarkers 2 | grep update:chr | getSummary.pl -i 6 -t "<2 markers" | |||
|   cat summary.nucmer-markers.txt | perl ~/bin/synteny/getSynteny2.nucmer-marker.pl -minMarkers 4 | grep update:chr | getSummary.pl -i 6 -t "<4 markers" | |||
|                               elem       min        max        mean       med        n50        sum | |||
|   <2 markers                  307        1001       937279     27898      9212       98674      8,564,850 | |||
|   <4 markers                  383        1001       1141560    33822      13426      160445     12,953,875 | |||
| ==== UMD2.0 ==== | |||
|                               elem       min        max       mean       med        N50        sum | |||
|   ctg                         74337      88        840370     35148      14148      79144      2612810882* | |||
| ==== UMD2.6.1 noVariants,noCont ==== | |||
|                               elem       min        max       mean       med        N50        sum | |||
|   scf                         2646       385       34167202   994701     20787      7139718    2631980624 | |||
|   ctg                         50755      65        1160130    50966      29450      88583      2586785910* | |||
| ==== UMD2.6.1 noVariants ==== | |||
|                                elem       min        max        mean       med        n50        sum | |||
|   scf                          39844      316        34167202   68352      1361       7451988    2,723,419,938 | |||
|   scf.placed                   4707       385        34167202   564225     10413      7800796    2655811430 | |||
|   scf.variants                 29436      723        51828      1714       1298       1514       50,461,989 | |||
|   scf.unplaced                 30575      316        451968     1845       1309       1601       56440663 | |||
|                                elem       min        max        mean       med        n50        sum | |||
|   ctg                          89994      65         1160130    29736      5180       95952      2676109378 | |||
|   ctg.placed                   53646      65         1160130    48650      26840      98428      2609925446 | |||
|   ctg.unplaced                 31510      101        207476     1761       1314       1556       55511410 | |||
|                                elem       min        max        mean       med        n50        sum | |||
|   deg(all)                     251413     65         39964      1003       984        994        252,279,234 | |||
|   deg(>2Kbp)                   2128       2000       39964      3753       2827       3946       7,986,872 | |||
|   deg.placed                   883        200        30246      2711       1303       4845       2,393,905 | |||
|   deg.variants                 4654       331        8039       1117       989        1024       5,200,553 | |||
|   deg.unplaced(?)              245343     65         39964      994        984        993        244047258 | |||
|   scf.unplaced                 978        316        451968     7932       3736       15828      7757882 | |||
|   scf.uplaced.0cvg             863        316        451968     7521       2286       15744      6,491,342 | |||
|   deg.unplaced                 747        2002       39964      4365       3509       4819       3261181 | |||
|   deg.unplaced.0cvg            734        2002       39964      4358       3507       4811       3,199,221 | |||
|   ct_deg.placed                53646      65         1160130    48650      26840      98428      2,609,925,446* | |||
|   UMD2.0-UMD2.6.1 gaps>1K summary | |||
|   id    count   min     max     median  sum | |||
|   1     775     999     64953   1683    2884541 | |||
|   ... | |||
|   30    2174    999     167808  1528    7739170 | |||
|   total 14936   999     167808  1638    53,526,388 | |||
| * align UMD2.6 Chr1..30 to UMD2.0 Chr1..30 0cvg regions | |||
|                elem       min        max        mean       med        n50        sum | |||
|   all          14936      1000       167809     3584       1639       6465       53541324 | |||
|   aligned      11091      1000       167809     4010       1692       7977       44475715 | |||
|   not_aligned  3845       1000       50546      2357       1529       2735       9065609   | |||
| Files: UMD2.0 regions not covered by UMD2.6.1 (chr aligned to itself) | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.fa | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.posmap | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.summary | |||
| * UMD2.6.1 vs UMD2.0 Chr1..30 | |||
| * nucmer -mum -l 50 -c 250 | |||
| * max gap: Chr8:60134434..60228381=93948 | |||
| * gaps>10K: 191 , 4.2M total, 4.19M aligned without "-mum" | |||
|                       elem       min        max        mean       med        n50        sum | |||
|   gaps>10K            191        10077      93948      22106      15599      26551      4222342 | |||
| * realign all scaff without using nucmer "-mum" option: all gaps >10K align to large scaffolds !!! | |||
| ==== UMD2.6.1 noVariants, add deg & UMD2.0 alignments ==== | |||
| --[[User:Dpuiu|Dpuiu]] 09:44, 22 June 2009 (EDT) | |||
|                                  elem       min        max        mean       med        n50        sum | |||
|   scf                            39844      316        34167202   68352      1361       7451988    2723419938 | |||
|   scf.variants                   29436      723        51828      1714       1298       1514       50461989 | |||
|   scf.noVariants                 10580      316        34167202   252690     1957       7740810    2673467579 | |||
|   scf.noVariants.ignore          4079       316        4985       1852       1541       1842       7558196    # less than 5K and placed through alignments inside a larger scaffold | |||
|   scf.noVariants.noIgnore        6506       385        34167202   409763     6673       7740810    2665923763 | |||
|   ctg.variants                   29967      723        51828      1641       1298       1475       49178546 | |||
|   deg.variants                   4654       331        8039       1117       989        1024       5200553 | |||
|   scf_deg.placed                 4933       385        34167202   540139     11548      7740810    2664507818 | |||
|   scf_deg.markers                2003       1020       34167202   1302992    30456      8170786    2609894380 | |||
|   scf_deg.alignHS                1818       385        3278163    27070      12322      47406      49214215 | |||
|   scf_deg.alignUMD2.0            1112       1002       73626      4855       2948       6898       5399223 | |||
|   scf.placed                     4044       385        34167202   657786     15617      7740810    2660090134 | |||
|   scf.markers                    1825       1020       34167202   1429490    37672      8170786    2608820353 | |||
|   scf.alignHS                    1587       385        3278163    30006      13998      50604      47620297 | |||
|   scf.alignUMD2.0                632        1002       73626      5774       5137       9046       3649484 | |||
|   deg.placed                     889        2002       39964      4969       3771       6165       4417684 | |||
|   deg.markers                    178        2007       30168      6033       4336       7350       1074027 | |||
|   deg.alignHS                    231        2049       39964      6900       6289       7337       1593918 | |||
|   deg.alignUMD2.0                480        2002       15992      3645       2767       3842       1749739 | |||
|   ctg_deg.placed                 54129      65         1160130    48371      26493      98075      2,618,296,162* | |||
|   ctg.placed                     53240      65         1160130    49096      27309      98285      2,613,878,478 | |||
|   deg.placed                     889        2002       39964      4969       3771       6165       4,417,684 | |||
| ==== HS-BT Synteny map ==== | |||
| * Trust scaffolds with 4+ markers | |||
| * Scaffolds with 3- markers must have a close neighbor that agrees with them | |||
|   same HS & BT Chr | |||
|   maxCount=2                      # at most 2 scaffolds away | |||
|   minMarkers=4 | |||
|   minRatio=0.66 & maxRatio=1.5    # distance ratio; maxDistance not used !!! | |||
|   => 226 synteny regions !!! | |||
|   join2.pl nucmer_*lsf markers_scf.*lsf | \ | |||
|          ~/bin/filterMarkers.pl -minMarkers 4 | \ | |||
|          ~/bin/getSyntenyBlock.pl | \ | |||
|          grep -v ^# | grep -v ^$ | ~/bin/flipSummary.pl | sort -nk2 -nk5 | ~/bin/tab2tab.pl | \ | |||
|          perl -ane  'print $P[13]," ",$F[13],"\n" if($F[13]-$P[13]==1); print $F[13]," ",$P[13],"\n" if($P[13]-$F[13]==1); @P=@F;' | sort -u -n | \ | |||
|          ~/bin/mergeMap.pl >! hs-bt.map.tmp | |||
|   join2.pl nucmer_*lsf markers_scf.*lsf | \ | |||
|          ~/bin/filterMarkers.pl -minMarkers 4 | \ | |||
|          ~/bin/getSyntenyBlock.pl -map  hs-bt.map.tmp | \ | |||
|          ~/bin/tab2tab.pl | grep  # | sed 's/#//' >  hs-bt.map | |||
|   ~/bin/map-draw.pl -refLen hs.infoseq -qryLen bt.infoseq hs-bt.map > ! hs-bt.png | |||
| Problems:  | |||
|   7180002041220: BT.Chr2 ok | |||
|   7180002041025: BT.Chr4 | |||
|   7180002041222: BT.Chr7 | |||
|   7180002041228: BT.Chr8 | |||
|   7180002040195: BT.Chr8; HS.Ch23 del !!! | |||
|   7180002041001: BT.Chr8; HS.Ch23 del !!! | |||
|   7180002040851: BT.Chr30; HS.Ch3 -> HS.Ch23 | |||
|   7180002041008: BT.Chr30; HS.Ch7 -> HS.Ch23 | |||
| * [[Media:hs-bt.png|hs-bt.png]] Map picture | |||
| * Map: | |||
|   #HS-ref       begin     end       len       HS-clust  BT-ref    begin     end       BT-clust  #scf | |||
|   01            870247    12637274  11767027  16        46893147  58621226  11728079  -1        5 | |||
|   01            16674545  30031540  13356995  2         126668764 139027145 12358381  -2        4 | |||
|   01            32016003  68894564  36878561  3         83066354  120874874 37808520  -3        10 | |||
|   01            68907947  122485896 53577949  3         23819464  83221225  59401761  -4        6 | |||
|   01            143725713 143737025 11312     3         25097798  25109110  11312     5         1 | |||
|   ... | |||
|   23            148835146 151656716 2821570   30        32889252  35700335  2811083   -276      3 | |||
|   23            152157769 154641974 2484205   30        39322867  42104054  2781187   -277      2 | |||
|   23            154387850 154403684 15834     30        39353299  39369133  15834     278       1 | |||
| Files: | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/map.4/hs-bt.map | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/map.4/hs-bt.png | |||
| ==== Overlaps ==== | |||
| Cases: | |||
|   1. CONTAINED scaffolds (clear variants) | |||
|   2. single BEGIN/END between 2 scaffolds: 2 scaffolds could be merged | |||
|   3. scaffold closing a sequence: 3 scaffolds could be merged | |||
|   4. multiple BEGIN/END/CONTAIN* between 2 scaffold contigs | |||
| 1: CONTAINED ~ 4731 cases | |||
|   cat nucmer_scf.ovl/all-all.contained.ids  marker_scf.ovl/all-all.contained.ids  | sort -u > all-all.contained.ids  | |||
| * Summary | |||
|                         elem       <2Kbp    2..10Kbp    >10Kbp   min    max        mean       med        n50        sum | |||
|   variants(all)         4731       3503     1085        143      723    75436      2513       1429       3469       11891882 | |||
|   variants(1+markers)   541        311      170         60       1001   75421      4466       1690       9947       2416312 | |||
| * Example: longest contained scaffold 75Kbp | |||
|   #scaffold alignments | |||
|   117893  126367  |       8520    1       |       8475    8520    |       94.06   |       562884  75576   |       1.51    11.27   |       7180002040891   7180002040646 | |||
|   286811  289412  |       72974   75576   |       2602    2603    |       93.29   |       562884  75576   |       0.46    3.44    |       7180002040891   7180002040646 | |||
|   421477  421977  |       9727    10223   |       501     497     |       95.41   |       562884  75576   |       0.09    0.66    |       7180002040891   7180002040646 | |||
|   171970  180765  |       73605   64798   |       8796    8808    |       96.74   |       274750  75576   |       3.20    11.65   |       7180002040912   7180002040646 | |||
|   180765  195989  |       64652   49398   |       15225   15255   |       96.05   |       274750  75576   |       5.54    20.18   |       7180002040912   7180002040646 | |||
|   196223  203368  |       49244   42105   |       7146    7140    |       96.81   |       274750  75576   |       2.60    9.45    |       7180002040912   7180002040646 | |||
|   205935  208456  |       38652   36130   |       2522    2523    |       96.12   |       274750  75576   |       0.92    3.34    |       7180002040912   7180002040646 | |||
|   207096  210496  |       42082   38673   |       3401    3410    |       96.07   |       274750  75576   |       1.24    4.51    |       7180002040912   7180002040646 | |||
|   209529  217353  |       36109   28272   |       7825    7838    |       95.50   |       274750  75576   |       2.85    10.37   |       7180002040912   7180002040646 | |||
|   213859  219155  |       25520   20220   |       5297    5301    |       95.80   |       274750  75576   |       1.93    7.01    |       7180002040912   7180002040646 | |||
|   221183  227630  |       20236   13870   |       6448    6367    |       93.63   |       274750  75576   |       2.35    8.42    |       7180002040912   7180002040646 | |||
|   227923  228686  |       13884   13126   |       764     759     |       95.16   |       274750  75576   |       0.28    1.00    |       7180002040912   7180002040646 | |||
|   228084  231413  |       13104   9760    |       3330    3345    |       94.10   |       274750  75576   |       1.21    4.43    |       7180002040912   7180002040646 | |||
|   #contig alignments | |||
|   1929    10403   |       8520    1       |       8475    8520    |       94.06   |       26592   8520    |       31.87   100.00  |       7180002040891.4.10      7180002040646.1.8       [CONTAINS] | |||
|   9084    11685   |       30871   33473   |       2602    2603    |       93.29   |       276174  33473   |       0.94    7.78    |       7180002040891.8.10      7180002040646.8.8 | |||
|   143750  144250  |       1187    1683    |       501     497     |       95.41   |       276174  4564    |       0.18    10.89   |       7180002040891.8.10      7180002040646.2.8 | |||
|   163797  172592  |       31502   22695   |       8796    8808    |       96.74   |       233551  33473   |       3.77    26.31   |       7180002040912.3.5       7180002040646.8.8 | |||
|   172592  187816  |       22549   7295    |       15225   15255   |       96.05   |       233551  33473   |       6.52    45.57   |       7180002040912.3.5       7180002040646.8.8 | |||
|   188050  195195  |       7141    2       |       7146    7140    |       96.81   |       233551  33473   |       3.06    21.33   |       7180002040912.3.5       7180002040646.8.8 | |||
|   197762  200283  |       2523    1       |       2522    2523    |       96.12   |       233551  2523    |       1.08    100.00  |       7180002040912.3.5       7180002040646.6.8       [CONTAINS] | |||
|   198923  202323  |       3410    1       |       3401    3410    |       96.07   |       233551  3411    |       1.46    99.97   |       7180002040912.3.5       7180002040646.7.8       [CONTAINS] | |||
|   201356  209180  |       7838    1       |       7825    7838    |       95.50   |       233551  7838    |       3.35    100.00  |       7180002040912.3.5       7180002040646.5.8       [CONTAINS] | |||
|   205686  210982  |       12396   7096    |       5297    5301    |       95.80   |       233551  12396   |       2.27    42.76   |       7180002040912.3.5       7180002040646.3.8 | |||
|   213010  219457  |       7112    746     |       6448    6367    |       93.63   |       233551  12396   |       2.76    51.36   |       7180002040912.3.5       7180002040646.3.8 | |||
|   219750  220513  |       760     2       |       764     759     |       95.16   |       233551  12396   |       0.33    6.12    |       7180002040912.3.5       7180002040646.3.8 | |||
|   219911  223240  |       4564    1220    |       3330    3345    |       94.10   |       233551  4564    |       1.43    73.29   |       7180002040912.3.5       7180002040646.2.8       [CONTAINS] | |||
|   #marker & alignment summary | |||
|   #id            BT-ref    #markers  slope     begin     end       len       HS-ref    #align    slope     begin     end       len | |||
|   7180002040891  4         1         0         13956053  14518937  562884    22        14        -1.5687   20838730  21401614  562884 | |||
|   7180002040163  8         1         -1        38079677  38178351  98674     22        3         1.0264    21133894  21232568  98674     update:dir:7180002040646 | |||
|   7180002040646  8         .         1         38088797  38164373  75576     22        4         -0.9833   21143014  21218590  75576     assign:Chr:7180002040163,7180002040163 | |||
| 2. ~ 349 cases (-3 cases 3.) | |||
|   cat all-all.begin.ids all-all.end.ids | sort -u | wc -l | |||
| 3.  ~ 3 cases | |||
|   intersect.pl all-all.begin.ids all-all.end.ids | perl -ane 'print $_ if($F[1]=~/1.1$/);' | |||
|   7180002032818.1.1       23259 | |||
|   7180002036943.1.1       28355 | |||
|   7180002040409.1.1       25219 # merges 2 scaff | |||
|   1       2811    |       2807    1       |       2811    2807    |       99.11   |       23259   5134    |       12.09   54.67   |       7180002032818.1.1       7180002032811.1.2[BEGIN]  [BEGIN] | |||
|   22101   23259   |       7238    6079    |       1159    1160    |       99.66   |       23259   7238    |       4.98    16.03   |       7180002032818.1.1       7180002032811.2.2[END]    [END] | |||
|   1       2811    |       2807    1       |       2811    2807    |       99.11   |       23259   5134    |       12.09   54.67   |       7180002032818.1.1       7180002032811.1.2[BEGIN]  [BEGIN] | |||
|   22101   23259   |       7238    6079    |       1159    1160    |       99.66   |       23259   7238    |       4.98    16.03   |       7180002032818.1.1       7180002032811.2.2[END]    [END] | |||
|   1       2025    |       2066    4090    |       2025    2025    |       99.80   |       25219   4090    |       8.03    49.51   |       7180002040409.1.1       7180002033541.1.1[BEGIN]  [END] | |||
|   23394   25219   |       5761    3930    |       1826    1832    |       98.42   |       25219   5761    |       7.24    31.80   |       7180002040409.1.1       7180002033538.1.1[END]    [END] | |||
|   cd  nucmer_scf.ovl | |||
|   intersect.pl  all-all.begin.ids all-all.end.ids | |||
|   7180002032818   23259 | |||
|   7180002036943   28355 | |||
|   7180002040409   25219 | |||
| 4. ~10 cases | |||
|   cd nucmer_ctg.ovl/translated/ | |||
|   cat all-all.annotated.coords | egrep 'BEGIN|END|CONTAIN' | p 'next if($F[6]<5000); next if($F[7]<5000); print $_;' | p '$F[17]=~/^([^.]+)(.+)/;  $F[17]=$1 ; $F[18]=~/^([^.]+)/; $F[18]=$1; print $F[17],"\t",$F[18],"\n";' | count.pl -m 2 | |||
|   7180002041235   brk002041306a   3 | |||
|   7180002038888   7180002034914   2 | |||
|   7180002041059   7180002040095   2 | |||
|   7180002040879   7180002040934   2 | |||
|   cnt0002041350   7180002040907   2 | |||
|   7180002041341   7180002040894   2 | |||
|   7180002041015   cnt0002040938   2 | |||
|   7180002039401   7180002040397   2 | |||
|   7180002037358   7180002031425   2 | |||
|   7180002040915   7180002039470   2 | |||
|   cat all-all.annotated.coords | egrep 'BEGIN|END|CONTAIN' | grep ... | |||
|   12363   21143   |       1       8837    |       8781    8837    |       93.43   |       21143   50443   |       41.53   17.52   |       7180002041235.23.33     brk002041306a.5.7[END]    [BEGIN] | |||
|   1       9272    |       5591    14944   |       9272    9354    |       94.47   |       9272    50443   |       100.00  18.54   |       7180002041235.25.33     brk002041306a.5.7[CONTAINED] | |||
|   71049   97689   |       31941   5220    |       26641   26722   |       96.71   |       97689   31941   |       27.27   83.66   |       7180002041235.29.33     brk002041306a.6.7[END]    [END] | |||
|   57      3896    |       14743   10903   |       3840    3841    |       95.94   |       3896    22667   |       98.56   16.95   |       7180002041235.32.33     brk002041306a.3.7[CONTAINED] | |||
|   1       8942    |       1       8947    |       8942    8947    |       99.49   |       20792   8947    |       43.01   100.00  |       7180002038888.1.1       7180002034914.1.2[CONTAINS] | |||
|   9670    20792   |       1       11091   |       11123   11091   |       99.27   |       20792   11091   |       53.50   100.00  |       7180002038888.1.1       7180002034914.2.2[CONTAINS] | |||
|   1       8705    |       3932    12610   |       8705    8679    |       96.17   |       8705    30654   |       100.00  28.31   |       7180002041059.1.26      7180002040095.1.1[CONTAINED] | |||
|   1       4777    |       12989   17697   |       4777    4709    |       95.32   |       4777    30654   |       100.00  15.36   |       7180002041059.2.26      7180002040095.1.1[CONTAINED] | |||
|   1       6588    |       17999   24607   |       6588    6609    |       96.15   |       6591    30654   |       99.95   21.56   |       7180002041059.3.26      7180002040095.1.1[CONTAINED] | |||
|   630     16709   |       16024   1       |       16080   16024   |       97.09   |       16901   16024   |       95.14   100.00  |       7180002040879.2.14      7180002040934.2.87[CONTAINS] | |||
|   1       5876    |       5882    1       |       5876    5882    |       98.51   |       39657   13643   |       14.82   43.11   |       7180002040879.3.14      7180002040934.1.87[BEGIN] [BEGIN] | |||
| == Scaffold links == | |||
| Try to identify scaffold that fit the following criteria: | |||
| * have no markers and no alignments to HS | |||
| * linked by 2+ links to a single scaffold that has markers/alignments to HS | |||
| * is not a variant (2Kbp+ of unique sequnece) | |||
|                                                   elem       min        max        mean       med        n50        sum | |||
|   linked                                          6169       316        88326      2408       1480       2709       14,860,479 | |||
|   linked(2+mates)                                 2109       316        44120      2595       1558       3092       5,474,510 | |||
|   linked(2+mates to a single scf)                 2057       316        44120      2516       1547       2850       5,177,182 | |||
|   linked(2+mates to a single scf, no variant)     112        316        43782      7278       1987       23338      815,223        # 46>2Kbp; 25>10Kbp | |||
| => 112 scaffold & 0.81Mbp could be added to the chromosomes !!! | |||
| == BT alignments == | |||
| * UMD2.6.1 vs UMD2.6.1 | |||
| # Marker scaffolds against themselves (2642 total scaffolds) : nucmer -maxmatch -l 40 -c 2500 -g 250 | |||
| # Mapped scaffolds without markers against marker scaffolds:   nucmer -maxmatch -l 40 -c 250 | |||
| # Mapped scaffolds without markers against themselves:         nucmer -maxmatch -l 40 -c 250 | |||
| File location: | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_Chr/ | |||
| == Other issues == | |||
| === Chr10 gene duplication === | |||
| * LOC100298457 | |||
| * See /nfshomes/dpuiu/Readmes/bos_taurus.runCA.13.txt | |||
| === Scf breaks (Mike Robers) === | |||
|   :::::::::::::: | |||
|   scf.break.ids | |||
|   :::::::::::::: | |||
|   #ctg            scf             beg     end     dir | |||
|   7180002015438   7180002038435   0       32822   f       # ok: 4 markers from Chr18 | |||
|   7180002021916   7180002040442   0       24559   f       # ok: 1 marker  from Chr10 | |||
|   7180001725791   7180002040808   466123  562461  f       # ok: 25 marker  from Chr1, 1 marker from Chr2  | |||
|   7180001727899   7180002040844   8236574 8479429 f       # also found by us; 3 Chr30 markers in the middle | |||
|   7180001854650   7180002041103   5230302 5294734 f       # also found by us; 4 Chr15 markers at 3' | |||
|   7180002003578   7180002041216   300250  392069  f       # also found by us; 7 Chr5 markers at 5' | |||
|   7180002020010   7180002041293   4482477 4600045 f       # ok | |||
|   7180001722390   7180002041353   0       51674   f       # ok | |||
|   :::::::::::::: | |||
|   scf.excised.ids | |||
|   :::::::::::::: | |||
|   #ctg            scf             beg     end     dir | |||
|   7180002008629   7180002034664   82740   105723  f       # no markers | |||
|   7180001786240   7180002040913   2076746 2112375 f       # ok | |||
|   7180001787022   7180002040927   1075352 1109350 f       # ok | |||
|   7180001787352   7180002040931   4956099 4967823 f       # ok | |||
|   7180001789387   7180002040981   6641924 6671967 f       # ok | |||
|   7180001789575   7180002040984   3095587 3154302 f       # ok | |||
|   7180002003269   7180002041209   6350890 6470971 f       # ok | |||
|   7180002025281   7180002041322   989366  1227636 f       # ok | |||
|   7180002026741   7180002041328   343726  414329  f       # ok | |||
|   7180002029433   7180002041343   834493  888144  f       # ok | |||
|   7180001726451   7180002041383   7893629 7901645 f       # ok | |||
| === LOC100298457 (duplicate gene on Chr10) === | |||
| Problem:  | |||
|   Is LOC100298457 gene (cow Chr10) a variation of  MFSD3 gene (cow Chr14)? | |||
| UMD2.0 | |||
|  Chr10   15186759        15189292        833     W       7180003260677   1       2534    +     # gene LOC100298457 | |||
|  Chr14   1839355         1958202         533     W       7180003326040   1       118848  +     # gene MFSD3 | |||
| * LOC100298457 aligns to 2 UMD2.6 scaffolds:  | |||
|   scf7180002041112 (1.56Mbp; 34 contigs)  | |||
|   scf7180002033841 (2.44Kbp; 1 contig) | |||
| * scf7180002041112 contains 39 bos taurus Chr14 markers and has 206 alignments to human chromosome 8 | |||
| * scf7180002033841 contains no bos taurus markers and has 1 alignments to human chromosome 8. | |||
| * scf7180002033841 5' aligns to scf7180002041112 (1.56Mbp; 34 contigs, cow Chr14) and to a human Chr8 region that maps to cow Chr14 | |||
| * scf7180002033841 3' aligns to scf7180002041157 (6.15Mbp; 99 contigs, cow Chr10) | |||
| * scf7180002033841 & scf7180002041157 are linked by 2 mate pairs (inserts from a 3kbp BCM shotgun library) | |||
| * LOC100298457 vs UMD2.6 scf: | |||
|    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS] | |||
|   =============================================================================================================================== | |||
|       1       88  |  1138313  1138226  |       88       88  |   100.00  |     1344  1561688  |     6.55     0.01  | LOC100298457       scf7180002041112 | |||
|      80      192  |  1138150  1138038  |      113      113  |   100.00  |     1344  1561688  |     8.41     0.01  | LOC100298457       scf7180002041112 | |||
|     190      316  |  1137974  1137848  |      127      127  |   100.00  |     1344  1561688  |     9.45     0.01  | LOC100298457       scf7180002041112 | |||
|     315      464  |  1137770  1137621  |      150      150  |   100.00  |     1344  1561688  |    11.16     0.01  | LOC100298457       scf7180002041112 | |||
|     455     1344  |  1137322  1136432  |      890      891  |    99.78  |     1344  1561688  |    66.22     0.06  | LOC100298457       scf7180002041112 | |||
|       1     1344  |     1349        6  |     1344     1344  |   100.00  |     1344     2445  |   100.00    54.97  | LOC100298457       scf7180002033841 | |||
| * UMD2.6 scf* vs scf7180002033841: | |||
|      [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS] | |||
|   =============================================================================================================================== | |||
|   1136427  1137322  |        1      895  |      896      895  |    99.78  |  1561688     2445  |     0.06    36.61  | scf7180002041112   scf7180002033841 | |||
|   4060495  4061426  |     2445     1521  |      932      925  |    97.76  |  6157473     2445  |     0.02    37.83  | scf7180002041157   scf7180002033841 | |||
| Bos taurus marker summary: | |||
|   #id                BT-ref    #markers  slope     begin     end       len | |||
|   scf7180002041112   Chr14     39        -0.8518   3377707   4939395   1561688 | |||
|   scf7180002041157   Chr10     255       1.0035    12831593  18989066  6157473 | |||
|   scf7180002033841   ?         0 | |||
| Homo sapiens alignment summary: | |||
|   #id                HS-ref    #align    slope     begin     end       len | |||
|   scf7180002041112   Chr8      206       0.6701    144381103 145942791 1561688 | |||
|   scf7180002041157   Chr15     1667      0.987     62245836  68403309  6157473 | |||
|   scf7180002033841   Chr8      1         1         145705320 145707765 2445 | |||
|         [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS] | |||
|   =============================================================================================================================== | |||
|   143996007 143996305  |    74278    74572  |      299      295  |    82.45  | 146274826  1561688  |     0.00     0.02  |  1  1  NC_000008        7180002041112 | |||
|   ... | |||
|   146249212 146249657  |  1268887  1268446  |      446      442  |    80.89  | 146274826  1561688  |     0.00     0.03  |  1 -1  NC_000008        7180002041112 | |||
|     63013167 63013429  |   887697   887966  |      263      270  |    78.15  | 100338915  6157473  |     0.00     0.00  |  1  1  NC_000015        7180002041157 | |||
|     ... | |||
|     68369047 68369417  |  6154757  6155112  |      371      356  |    83.06  | 100338915  6157473  |     0.00     0.01  |  1  1  NC_000015        7180002041157 | |||
|   145705522 145706206  |      202      886  |      685      685  |    81.28  | 146274826     2445  |     0.00    28.02  |  1  1  NC_000008        7180002033841 | |||
| scf7180002033841: 6 reads  | |||
|   read            scf             begin   end     dir | |||
|   1120017508      7180002033841   0       1018    f | |||
|   1120020725      7180002033841   227     1109    f | |||
|   1120020722      7180002033841   724     1720    r | |||
|   1120020726      7180002033841   1006    2073    f       diffScaffold | |||
|   1120017511      7180002033841   1303    2175    r | |||
|   1120020728      7180002033841   1370    2445    f       diffScaffold | |||
| scf7180002033841: 4 mates | |||
|   read1           read2           scf1            scf2 | |||
|   1120017508      1120020722      7180002033841   7180002033841 | |||
|   1120017511      1120020725      7180002033841   7180002033841 | |||
|   1120020726      1120017512      7180002033841   7180002041157  diffScaffold | |||
|   1120020728      1120017514      7180002033841   7180002041157  diffScaffold | |||
| scf7180002041157: 69095 reads; frg 1120017514 & 1120017512 positions close to the region aligned to scf7180002033841 (4060495-4061426) | |||
|   read            scf             begin   end     dir | |||
|   607312594       7180002041157   0       866     f | |||
|   ... | |||
|   1120017512      7180002041157   4059332 4060396 f       diffScaffold | |||
|   1120017514      7180002041157   4058851 4059931 f       diffScaffold | |||
|   .... | |||
|   583956204       7180002041157   6157144 6157228 r | |||
| scf7180002041157: 99 contigs | |||
|   count ctg                scf                begin   end     dir | |||
|   1     ctg7180001926175   scf7180002041157   0       81467   f | |||
|   .. | |||
|   64    ctg7180001926231   scf7180002041157   4039922 4046680 f # 6758bp ctg;    800bp gap following the contig | |||
|   65    ctg7180001926232   scf7180002041157   4047480 4193693 f # 146213bp ctg ; 20bp  gap following the contig | |||
|   66    ctg7180001926233   scf7180002041157   4193713 4208047 f # 14334bp ctg | |||
|   .. | |||
|   99    ctg7180001926260   scf7180002041157   6152691 6157473 f | |||
| frg 1120017514 & 1120017512 positions on ctg7180001926232 | |||
|   1120017514  ctg7180001926232   11371   12451   f  # reads are 11kbp(>3Kb lib mean) inside the contig | |||
|   1120017512  ctg7180001926232   11852   12916   f  # reads are 11kbp(>3Kb lib mean) inside the contig | |||
| === Missing genes === | |||
| * 5 genes not found by Liliana using ESTaligner but found using gmap (%id<95) : they were on the haplotype variants | |||
| ===  Chr27 -> Chr21 === | |||
| * The centromeric end of BTA27 is actually placed on BTA21  in UMD2.6 (which version?) | |||
| == Chromosomes == | |||
| === synteny.redo2 === | |||
| * summary.txt : 8122 scf + 6322 deg => 14444 seqs  (placed using markers or synteny to HS) | |||
| * from summary.txt removed: | |||
|    58 scf.questionable.ids (44 placed, 14 linked) | |||
|   100 ctg.questionable.ids | |||
|   931 deg.questionable.ids | |||
| === UMD_2.6.d_g === | |||
|   .                    elem       <2000      >2000      min    max        mean       med        n50        sum | |||
|   haplotypes_contigs   3828       2701       1127       471    123243     2503       1580       2590       9,584,985 | |||
| === UMD_2.6.a_g_070109 === | |||
| Gaps: 65900 | |||
| * all 65900 gaps are "fragment yes" | |||
| === UMD_2.6.a_g_070509 === | |||
| * Combine UMD_2.6.a_g_070109 (Guillaume's assembly) and UMD2.6.1 (Daniela's) | |||
| === UMD_2.6.a_g_070709 === | |||
| * Remove from UMD_2.6.a_g_070509  | |||
| ** ~ 21 ChrY ctg & ~ 39 ChrY deg | |||
| ** ~ 4206 haplotype variants (6.63Mbp) within 1K from one another  | |||
|   .                                      elem       <2000         >2000      min    max        mean       med        n50        sum | |||
|    ctg_deg.variants.placed                6654       5504         1150       263    42158      1761       1189       1772       11719653 | |||
|    ctg_deg.variants.placed.sameChr        5374       4455         919        263    42158      1748       1152       1775       9393815 | |||
|    ctg_deg.variants.placed.within_100K    4790       4100         690        263    42158      1651       1128       1605       7911160 | |||
|    ctg_deg.variants.placed.within_1K*     4206       3679         527        263    42158      1577       1107       1508       6633118  | |||
| Summary: | |||
|                                ctg+deg    <2000      >=2000     min    max        mean       med        n50        sum | |||
|   Chr1..29,X                   72197      20763      51434      65     1160130    36536      13055      97328      2,637,809,286 | |||
|   ChrU                         3752       2587       1165       362    179692     3284       1447       6427       12,324,356 | |||
|   ChrY-contigs                 315        266        49         224    26490      2249       974        6679       708,535 | |||
|   contigs.haplotype-variants   40198      36720      3478       263    51828      1460       1203       1361       58,698,457 | |||
|   deg.unplaced.less_2K         224945     224945     0          65     1996       972        983        990        218,847,978 | |||
| === Issues === | |||
| * 490 scf don't have all ctgs placed (865 ctgs) | |||
| * 699 reliable contigs (3.25Mbp) unplaced | |||
|   difference.pl ctg.reliable.ids UMD_2.6.a_g_070709/Chr.posmap | getSummary.pl -i 2 -z 2000 | |||
|   elem       <2000      >2000      min    max        mean       med        n50        sum | |||
|   699        405        294        362    123243     4660       1759       14809      3257635 | |||
| * Rearrangements UMD2.0 vs UMD_2.6.a_g_070709 : ~ 25 ctgs>50K ; ~ 12 scf | |||
| * AGP file format: "fragment yes" should be preserved even if gap type=U | |||
|   grep -A 2 7180001925241 UMD_2.6.a_g_070109/Chr.agp | |||
|   Chr30   131686408       131762880       14157   W       7180001925241   1       76473   + | |||
|   Chr30   131762881       131762980       14158   U       100     fragment        yes | |||
|   Chr30   131762981       131796167       14159   W       7180001925242   1       33187   + | |||
|   grep -A 2 7180001925241 ../UMD_2.6.a_g_070709/Chr.agp | |||
|   ChrX    133855417       133931889       15581   W       7180001925241   1       76473   + | |||
|   ChrX    133931890       133931989       15582   U       100     contig  no | |||
|   ChrX    133931990       133965176       15583   W       7180001925242   1       33187   + | |||
| * Liliana found  18 ctg + deg that have genes | |||
|   Fixed case B (7180001932648,7180001925237) | |||
| * Bob (Missouri)  | |||
| # scf7180002041216 (43 ctgs, 1.44Bmp) should be split : Chr15 (36 ctgs) , Chr5 (first 7 ctgs)  ctg7180002003576,ctg7180002003577 moved from Chr9=>Chr5 | |||
|   => Chr5 | |||
|      1  7180002003574   7180002041216   0       19886   f # already on Chr5 | |||
|      2  7180001694221   7180002041216   19906   20911   r | |||
|      3  7180001978231   7180002041216   20931   52725   r | |||
|      4  7180001946738   7180002041216   53559   55402   r | |||
|      5  7180002003575   7180002041216   55422   92145   f | |||
|      6  7180002003576   7180002041216   92165   200007  f | |||
|      7  7180002003577   7180002041216   200456  300048  f | |||
| # scf7180002041153 (48 ctgs, 2.36Mbp): assignment to Chr11 seems correct, Chr6(0) | |||
| ===  UMD_2.6.a_g_071709 -> UMD_Freeze2.99 === | |||
| Changes vs UMD_2.6.a_g_070709: | |||
| * 17 contigs/degenerates recruited by Liliana based on mRNA alignments got placed on chromosomes | |||
| * Chr9,15 => Chr5 correction : a scaffold got broken between 2 chromosomes & and several contigs got moved to Chr5 | |||
| * 697 unplaced contigs from placed scaffolds got placed as well | |||
| Contig placement summary: | |||
|                                  #ctg+deg <2Kbp   >=2Kbp min  max        mean   med    n50    sum | |||
|    ================================================================================================== | |||
|    Chr1..29,X                    72911   21180   51731  65   1160130    36223  12706  97232  2641097363 | |||
|    ChrU                          3365    2448    917    224  179692     2898   1348   5399   9754701 | |||
|    contigs.haplotype-variants    40198   36720   3478   263  51828      1460   1203   1361   58698457 | |||
|    deg.unplaced.less_2K          224933  224933  0      65   1996       972    983    990    218837572 | |||
|    Chr1..29,X(new)               714     417     297    177  123243     4605   1741   14809  3288596 | |||
|    ChrY-contigs                  315     266     49     224  26490      2249   974    6679   708535 | |||
|    ChrY-contigs.SHOTGUN_ONLY     144     140     4      804  4224       993    882    888    143047 | |||
|    =================================================================================================== | |||
| Comments: | |||
| * "Chr1..29,X", ChrU, contigs.haplotype-variants, deg.unplaced.less_2K are mutually exclusive sets | |||
| * "Chr1..29,X(new)" are contigs which were not placed in UMD_2.6.a_g_071709. | |||
| ** 17 contig them were added by Liliana (1 failed) | |||
| ** the rest are reliable contigs left unplaced by Aleksey/Guillaume program | |||
| * ChrY-contigs.SHOTGUN_ONLY are a subset of ChrY-contigs which don't contain only SHOTGUN reads | |||
| * ChrY-contigs are part of ChrU | |||
| Files (walnut): | |||
|    /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/                # FASTA & AGP format | |||
|    /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/nucmer_UMD2.0/  # nucmer alignments to UMD2.0 | |||
| Files (freeze): | |||
|   /fs/szasmg3/bos_taurus/UMD_Freeze2.99/               # FASTA & AGP format | |||
|   /fs/szasmg3/bos_taurus/UMD_Freeze2.99/ncbi_files     # SEQUIN format | |||
| Ftp: | |||
|    ftp://ftp.cbcb.umd.edu/pub/salzberg/Bos_taurus_2.99/ -> /fs/ftp-cbcb/pub/salzberg/Bos_taurus_2.99/ | |||
| === UMD_2.6.a_g_072109 -> UMD_Freeze3.0 === | |||
| Changes vs UMD_2.6.a_g_070709: | |||
| * Delete 97 contaminated sequences found by NCBI (all except the primates) : http://www.ncbi.nlm.nih.gov/projects/WGS/screens/DAAA02_071709/ | |||
| * Delete 441 haplotype variants found by Guillaume  | |||
| * Trim 54 partial contaminants (contaminants were on the ends) | |||
| * Trim 7 terminal N's | |||
| Gaps: 75739 | |||
| * all 27103 N gaps are "fragment yes" | |||
| * all 48636 U gaps are "contig no" | |||
| Files (walnut): | |||
|    /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_072109/                # FASTA & AGP format | |||
| Files (freeze): | |||
|   /fs/szasmg3/bos_taurus/UMD_Freeze3.0/               # FASTA & AGP format | |||
|   /fs/szasmg3/bos_taurus/UMD_Freeze3.0/ncbi_files     # SEQUIN format | |||
| Ftp: | |||
|   ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_3.0/ -> pub/data/assembly/Bos_taurus/Bos_taurus_UMD_3.0/ | |||
| Issues: | |||
| * 7180001836672 941bp deg on Chr4 : aligns on all its length to the cow mitochondrion; placed  based on human synteny | |||
| * mitochondrion screening was done only on contigs, not on degenerates | |||
| * Align all Chr*.fasta files to cow mitochondrion; show-coords -I 90 -L 600  | |||
|   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS] | |||
|   =============================================================================================================================== | |||
|   3603     4930  |        6     1333  |     1328     1328  |    90.89  |    16339     1333  |     8.13    99.62  | AY676873  ctg7180001759596 [CONTAINS] | |||
|   13284    14521  |    1214        1  |     1238     1214  |    97.90  |    16339     1216  |     7.58    99.84  | AY676873  deg7180001872458 [CONTAINS] | |||
|   15731    16339  |       1      608  |      609      608  |    99.18  |    16339      941  |     3.73    64.61  | AY676873  deg7180001836672 [END] | |||
| * Delete 2 degenerates | |||
| * Summary 3.a --[[User:Dpuiu|Dpuiu]] 11:04, 5 August 2009 (EDT) | |||
|   .                                 ctg+deg <2Kbp   >=2Kbp min  max      mean   med    n50     sum | |||
|   ====================================================================================================== | |||
|   Chr1..29,X                        72479   20862   51617  65   1160130  36424  12941  103785  2639984487 | |||
|   ChrU                              3285    2404    881    224  179692   2890   1338   5425    9496583 | |||
|   Chr                               75764   23266   52498  65   1160130  34970  11207  96955   2649481070 | |||
|   contigs.haplotype-variants        40611   36984   3627   263  97877    1476   1205   1372    59958728 | |||
|   deg.unplaced.less_2K              224933  224933  0      65   1996     972    983    990     218837572 | |||
|   ChrY-contigs                      314     266     48     224  26490    2210   973    6539    694140 | |||
|   ChrY-contigs.SHOTGUN_ONLY         144     140     4      804  4224     993    882    888     143047 | |||
|   ====================================================================================================== | |||
| === UMD_Freeze3.1 ===  | |||
| --[[User:Dpuiu|Dpuiu]] 11:42, 19 November 2009 (EST) | |||
| * Only changed some of the gap specifications | |||
| * Original(CA): | |||
|   ctg: 90135  | |||
|   deg: 251413  | |||
|   scf: 39978  | |||
| * UMD3.1 AGP: | |||
|   ctg:             60499 | |||
|   deg:             15229 | |||
|   scf(CA):         11458 | |||
|   ctg(unoriented)    2118 | |||
|   scf(placed chr)    3193  | |||
|   scf(unplaced chr)  3285 | |||
| Files: | |||
|  /fs/szasmg3/bos_taurus/UMD_Freeze3.1 | |||
| == ToDo == | |||
| * Align all UMD2.0 ChrU ctg/deg 10Kbp+ to our assembly; make sure everything aligns (Steven's suggestion) | |||
|   /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/nucmer_UMD2.0 | |||
| ** all ctgs: 113K, 244Mbp | |||
| ** 10Kbp+ ctgs: 2561, 54Mbp | |||
| ** 53,951,168  out of 54,971,011 bp covered (98% of the sequence) | |||
| ** Most ctgs were added to ChrU | |||
|   join2.pl -i 4 UMD2.0.ChrU.10K.maxCvg.pair ../other/Chr.posmap | awk '{print $10,$2}' | ~/bin/sum2.pl | sort -nk3 -r | |||
|   Chr30   252     7,511,975 | |||
|   Chr1    171     3263575 | |||
|   Chr6    138     3043388 | |||
|   Chr12   152     2676615 | |||
|   ChrU    100     2600562 | |||
|   ... | |||
|   Chr28   29      473523 | |||
|   all     2651    54,960,233 | |||
| 1. Remove remain_haps | |||
|   .                                elem       <2000      >2000      min    max        mean       med        n50        sum | |||
|   remain_haps                      436        282        154        471    37860      2600       1693       2819       1,133,874 | |||
|   remain_haps(Chr1..30,U)          408        264        144        471    37860      2636       1692       2883       1,075,616 # 77 from ChrU | |||
|   remain_haps(Chr1..30)            335        226        109        471    37860      2480       1661       2728       831,068 | |||
| 2. Remove/trim contaminants  | |||
| 3. Remove/trim N's | |||
Latest revision as of 18:11, 8 April 2011
BCM
NCBI Data
- Genome Projects
- TA search
- TA ftp
- 91 volumes: 87 with qual & 4 with no quality (85 volumes contain BCM reads)
- 14 centers
- 21 center/trace_type_codes
- Avg LEN=984
- Avg CLIP (CLB intersect CLV)=760
- Avg CLV=997 > Avg LEN ???
- Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
- Avg UMDoverlapper CLIP=778
Problems:
- 0 QUAL reads 650,133 ( 551,114 are BCM.WGS )
- the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1
- several xml contained the "&" character => XML parser error
- xml.bos_taurus.087 contained 2 trace_volumes => XML parser error
- BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000
- UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000
CENTER_NAME counts
    COUNT           CENTER_NAME     
 1  35629020        BCM             Baylor College of Medicine
 2  737900          NISC            NIH Intramural Sequencing Center
 3  652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Centre                           # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 
 4  378871          MARC            USDA, ARS, US Meat Animal Research Center
 5  114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368
 6  107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
 7  65171           TIGR            The Institute for Genome Research
 8  53556           GSC             Genoscope
 9  43033           CENARGEN        Embrapa Genetic Resources and Biotechnology
 10 18623           SC              The Sanger Center
 11 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
 12 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                 # TA query_tracedb CENTER_NAME="JCVI"
 13 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
 14 49              WUGSC           Washington University, Genome Sequencing Center                                 # TA query_tracedb CENTER_NAME = "WUGSC" => 9
    37829394        total           total                                                                           # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710
TRACE_TYPE_CODE counts
    COUNT         CENTER_NAME     TRACE_TYPE_CODE        
 1  24863599      BCM*            WGS                    SEQ_LIB_ID:89
 2  10748529      BCM*            SHOTGUN                SEQ_LIB_ID:15543
 3  737900        NISC            SHOTGUN                SEQ_LIB_ID:247
 4  125597        BCCAGSC         CLONEEND               LIBRARY_ID:1         large insert size; some qualityless; !!! almost all have CLIP3=0
 5  114753        UIUC            CLONEEND               LIBRARY_ID:2         insert size missing , no frequent kmers
 6  65171         TIGR            CLONEEND               SEQ_LIB_ID:1         2K & use TRACE_DIRECTION instead of TRACE_END
 7  53556         GSC             CLONEEND               SEQ_LIB_ID:1         large insert size; !!! all have qual=0 and were excluded 
 8  26246         CENARGEN        WGS                    .                    no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID)
 9  25454         BARC            CLONEEND               SEQ_LIB_ID:14304     !!! all have CLIP3=0
 10 16892         BCM*            CLONEEND               LIBRARY_ID:1         VBBAA   mea=167000  std=25000
 11 16787         CENARGEN        CLONEEND               LIBRARY_ID:1         
 12 15150         UOKNOR          SHOTGUN                LIBRARY_ID:1         some qualityless
 13 10651         TIGR_JCVIJTC    CLONEEND               SEQ_LIB_ID:2
 14 151           UOKNOR          FINISHING              LIBRARY_ID:1         some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV
 15 49            WUGSC           CLONEEND               SEQ_LIB_ID:1 
    36820485      total
 16 527017        BCCAGSC         EST
 17 207204        MARC            EST
 18 171667        MARC            PCR
 19 81913         BARC            EST
 20 18623         SC              EST 
 21 2485          UIACBCB         EST
    1008909       total
STRATEGY & TRACE_TYPE_CODE counts
COUNT CENTER_NAME STRATEGY TRACE_TYPE_CODE 12545304 BCM . WGS 11425910 BCM WGA WGS 5223683 BCM CLONE SHOTGUN 4479883 BCM POOLCLONE SHOTGUN 1044963 BCM . SHOTGUN 892385 BCM SNP WGS 737900 NISC CLONE SHOTGUN 125597 BCCAGSC CLONEEND CLONEEND 114753 UIUC CLONEEND CLONEEND 65171 TIGR CLONEEND CLONEEND 53556 GSC CLONEEND CLONEEND 26246 CENARGEN . WGS 25454 BARC . CLONEEND 16892 BCM CLONEEND CLONEEND 16787 CENARGEN CLONEEND CLONEEND 12195 UOKNOR . SHOTGUN 10651 TIGR_JCVIJTC CLONEEND CLONEEND 2955 UOKNOR CLONE SHOTGUN 151 UOKNOR . FINISHING 49 WUGSC CLONEEND CLONEEND
527017 BCCAGSC EST EST 145820 MARC EST EST 117958 MARC COMPARATIVE PCR 81913 BARC EST EST 61384 MARC CLONE EST 53709 MARC Re-Sequencing PCR 18623 SC EST EST 2485 UIACBCB . EST
BCM.SHOTGUN libraries
- The long inserts are probably wrong !!!
SIZE STDEV COUNT 3500 1500 4502569 2000 1000 3244493 3000 1000 1021577 180000 1000 840528 6500 1500 429026 180000 13000 320208 6000 2000 208192 167000 13000 96337 3500 15000 85599
SIZE COUNT 3500 4588168 2000 3244493 180000 1160736 3000 1021577 6500 429026 6000 208192 167000 96337
3' VECTOR TRIMMED counts
CENTER_NAME TRACE_TYPE_CODE TOTAL 3'CLV<LEN QUAL==0 UMD.FRG 1 BCM WGS 24863599 10968979 551114 24050767 2 BCM SHOTGUN 10748529 5052692 23419 10068499 3 NISC SHOTGUN 737900 28972 0 735488 4 BCCAGSC CLONEEND 125597 125484 8926 113790 5 UIUC CLONEEND 114753 90243 0 106247 6 TIGR CLONEEND 65171 46389 0 64903 7 GSC CLONEEND 53556 53556 53556 (all) 0 !!! all have 0 quals and were excluded 8 CENARGEN WGS 26246 26246 0 25976 9 BARC CLONEEND 25454 25454 0 25387 10 BCM CLONEEND 16892 6751 0 16863 11 CENARGEN CLONEEND 16787 16787 0 16628 12 UOKNOR SHOTGUN 15150 2885 12195 0 13 TIGR_JCVIJTC CLONEEND 10651 339 0 10644 14 UOKNOR FINISHING 151 0 151 151 15 WUGSC CLONEEND 49 0 0 0 16 BCCAGSC EST 527017 524173 772 0 17 MARC EST 207204 207204 0 0 18 MARC PCR 171667 171667 0 0 19 BARC EST 81913 78597 0 0 20 SC EST 18623 7350 0 0 21 UIACBCB EST 2485 2485 0 0
ZERO QUALITY COUNTS
- Counts
CENTER_NAME TRACE_TYPE_CODE COUNT BCM WGS 551114 GSC CLONEEND 53556 BCM SHOTGUN 23419 UOKNOR SHOTGUN 12195 BCCAGSC CLONEEND 8926 BCCAGSC EST 772 UOKNOR FINISHING 151 TOTAL 650134
- For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701..
- Volumes 026..039 have been fixed
Local Data
Files & Dirs
/fs/szasmg3/bos_taurus/data/ /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors /nfshomes/dpuiu/db/UniVec
Software
Figaro
- trims vector only at 5' end
- call lucy trimming for qualities
Lucy
- both vector sequence and splice sites are required
Atlas
- web site
- atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "
Contaminant search
nucmer reads CLIPPING range to UniVec & EcoliK12
UniVec
Ref
#seqs min max mean median n50 sum UniVec 2861 12 48551 231 99 781 660,151 UniVec_Core 1348 12 48551 243 98 967 327,641
Hits: alignment length
bp #reads min max mean median n50 sum 19 4548466 19 1045 28.37 23 27 129025025 20 3684852 20 1045 30.56 25 28 112616359 30 1097357 30 1045 48.04 38 43 52714583 40 484661 40 1045 66.36 47 53 32163896 100 54334 100 1045 198 116 223 10772815 # many are ESTs
Ecoli
Ref:
K12 4,639,675 bp
Hits: alignment length
bp #reads min max mean median n50 sum 19 275109 19 1223 30.66 19 20 8435470 20 102550 20 1223 50.29 21 161 5156849 30 19032 30 1223 178 37 706 3381214 40 9234 40 1223 329 171 738 3034293 100 6781 100 1223 424 223 749 2876432 200 4378 200 1223 575 696 771 2516916
BCM vectors
#seqs min max mean median n50 sum BCM 14 2580 33180 9379 5821 32705 131312
Vector/Splice site search
Strategy
- 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
- 2. Get the quality clipping trim: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
- 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE
- 4. Get the most frequent kmers in each set (24 & 8 bp)
- 5. Check if the most frequent kmers are overrepresented
- 6. Check if the most frequent 8mers are present in the most frequent 24mers
- 7. Try to extend the 24mers by a few bp => linkers
- 8. Align linkers to the opposite stand sequences using nucmer
- 9. Extract the subsequences adjacent(following) to linker (50..150bp)
- 10. Align the subsequences; if they align we've probably identified the vector
- 11. Identify the vector name/id by alignment to UniVec => several alignments
- 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
- 13. create the Lucy vector & splice files; the splice contains the linker+vector
- 14. run lucy & trim input reads according to Lucy clr
- 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust
- 16. align input reads to linker,vector,splice & UniVec.dust
- 17. compare the 15. & 16. counts
Example
- 1. volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
- 2.
- 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
- 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
==> 24.fwd/kmers.tab <== AGTTCGACTGCAAGTAGTTCATCA TGATGAACTACTTGCAGTCGAACT 2463 # contains AGTAGTTC GAGTTCGACTGCAAGTAGTTCATC GATGAACTACTTGCAGTCGAACTC 2189 CGAGTTCGACTGCAAGTAGTTCAT ATGAACTACTTGCAGTCGAACTCG 1996 TCGAGTTCGACTGCAAGTAGTTCA TGAACTACTTGCAGTCGAACTCGA 1593 GTTCGACTGCAAGTAGTTCATCAA TTGATGAACTACTTGCAGTCGAAC 1023 GAGTTCGACTGCAGTAGTTCATCA TGATGAACTACTGCAGTCGAACTC 812 CGAGTTCGACTGCAGTAGTTCATC GATGAACTACTGCAGTCGAACTCG 777 GTTCGACTGCAAGTAGTTCATCAT ATGATGAACTACTTGCAGTCGAAC 769 TCGAGTTCGACTGCAGTAGTTCAT ATGAACTACTGCAGTCGAACTCGA 637 ATCGAGTTCGACTGCAAGTAGTTC GAACTACTTGCAGTCGAACTCGAT 594 ==> 08.fwd/kmers.tab <== AGTAGTTC GAACTACT 86477 CAGTAGTT AACTACTG 67681 AGTTCTCA TGAGAACT 61556 TAGTTCTC GAGAACTA 60964 GTAGTTCT AGAACTAC 57866 AGTTCATC GATGAACT 49676 TAGTTCAT ATGAACTA 45298 GTTCATCA TGATGAAC 42117 GCAGTAGT ACTACTGC 41391 GTAGTTCA TGAACTAC 40694 ==> 24.rev/kmers.tab <== TATCGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCGATA 999 # contains AGTAGTTC CTATCGATGGTACAGTAGTTCATC GATGAACTACTGTACCATCGATAG 774 GCTATCGATGGTACAGTAGTTCAT ATGAACTACTGTACCATCGATAGC 600 CGCTATCGATGGTACAGTAGTTCA TGAACTACTGTACCATCGATAGCG 432 ATCGATGGTACAGTAGTTCATCAT ATGATGAACTACTGTACCATCGAT 417 ATCGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCGAT 380 ATCAGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCTGAT 373 ATCGATGGTACAGTAGTTCATCAC GTGATGAACTACTGTACCATCGAT 265 CTATCGATGGTAAGTAGTTCATCA TGATGAACTACTTACCATCGATAG 235 TCAGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCTGA 224 ==> 08.rev/kmers.tab <== AGTTCATC GATGAACT 85127 TAGTTCAT ATGAACTA 77902 GTTCATCA TGATGAAC 75585 TAGTTCTC GAGAACTA 68057 AGTTCTCA TGAGAACT 67277 GTAGTTCT AGAACTAC 64894 GTAGTTCA TGAACTAC 62607 CGTAGTTC GAACTACG 52031 AGTAGTTC GAACTACT 51013 ACGTAGTT AACTACGT 31552
- 7. Get linker sequences
>linker.fwd 27bp TCGAGTTCGACTGCAAGTAGTTCATCA >linker.rev 27bp CTAATCAGATGGTACAGTAGTTCATCA #>linker.rev 40 bp Art's (13 more bp at 5') #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
#GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences)
- 8 & 9 Align reads to linkers using nucmer
Fwd:
 nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta 
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq
 
Rev:
 nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq
 
Both:
clrFasta out.seq >! out.cseq fasta2tab.pl out.cseq | sort -k2 > ! out.tab nucmer -c 40 out.cseq ~/db/UniVec -p vector delta-filter -q vector.delta >! vector.filter-q.delta show-coords vector.filter-q.delta | sort -n | head
cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2
- 10. Extract "vector reads"
>399553028 # 24.fwd TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA ... >400269118 #24.rev TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
- 11. Align "vector reads" to UniVec; identify vector
 show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1
     31  148  | 1175 1292  | 118   118  |  95.76  |     1276     7477  |     9.25     1.58  | 399553028.rev gnl|uv|J01636.1:1-7477
     32  199  | 1302 1463  | 168   162  |  90.48  |      653     7477  |    25.73     2.17  | 400269118     gnl|uv|J01636.1:1-7477
- 12. 10bp distance between the 2 alignments
- 13. Lucy files
$ more vector.seq >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA ... $ more splice.seq >J01636.for.begin vector+linker.rev TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG CGCCTAATCAGATGGTACAGTAGTTCATCA >J01636.for.end rev(linker.fwd)+vector TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC AGCTGGCGTAAAAACGTAAAAAGCCCCGCA >J01636.rev.begin (revcomp of J01636.for.end) TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT GATTCGATTTCGACTGCAAGTAGTTCATCA >J01636.rev.end (revcomp of J01636.for.begin) TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG GTGTCAAATGAGAGACCTAACTCACATTCA
 # splice=linker+vector  
      3      120  |     1175     1292  |      118      118  |    95.76  |      150     7477  |    78.67     1.58  | J01636.for.begin   J01636
     32      131  |     1302     1399  |      100       98  |    96.00  |      150     7477  |    66.67     1.31  | J01636.for.end     J01636
- 13.1 Align vector & splice to Ecoli
1 7474 | 366812 359335 | 7474 7478 | 99.91 | 7477 4639675 | 99.96 0.16 | J01636 NC_000913.2 [CONTAINED]
     20      119  |       65      162  |      100       98  |    96.00  |      150      395  |    66.67    24.81  | J01636.rev.begin   NC_000913.2
     31      148  |      172      289  |      118      118  |    95.76  |      150      395  |    78.67    29.87  | J01636.rev.end     NC_000913.2
1069 1463 | 395 1 | 395 395 | 100.00 | 7477 395 | 5.28 100.00 | J01636 NC_000913.2.365350-365744
- 14. Run lucy & trim reads
 $ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \ 
     -v vector.seq splice.seq
     -o bos_taurus.lucy.seq bos_taurus.lucy.qual \
     -debug  bos_taurus.lucy.info \
     bos_taurus.seq bos_taurus.qual
# Trim clr $ clrFasta bos_taurus.seq > bos_taurus.cseq
- 15. Align lucy output to linker, vector, splice & UniVec.dust
$ nucmer -l 12 -c 24 ~/db/vector.seq bos_taurus.lucy.cseq -p vector-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.lucy.cseq -p vector-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.lucy.cseq -p splice-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy
- 16. Align input to linker, vector, splice & UniVec.dust
$ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus
Count how many reads got trimmed
 infoseq *seq | getSummary.pl -c 1 -t original.LEN
 
 cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary  
 cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary
 cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary
Libraries
011.BCM.WGS FORWARD
- vector: J01636
- UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
ll ~dpuiu/db/J01636* -rw-rw-r-- 1 dpuiu dpuiu 7651 Jan 9 15:56 /nfshomes/dpuiu/db/J01636 -rw-rw-r-- 1 dpuiu dpuiu 105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker -rw-rw-r-- 1 dpuiu dpuiu 840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice
cat ~dpuiu/db/J01636* | infoseq J01636 7477 53.43 J01636.linker.fwd 27 44.44 J01636.linker.rev 27 37.04 J01636.for.begin 150 44.67 J01636.for.end 150 51.33 J01636.rev.begin 150 51.33 J01636.rev.end 150 44.67
- 249,611 reads:
- 91% got vector trimmed at the 5'
- 0.4% (1149) got vector trimmed at the 3'
#elem #0s min max mean median n50 sum original.LEN 249611 0 437 2349 1082 991 1009 270035781 lucy.CLV5 249611 21215 0 741 25.03 25 27 6247415 lucy.CLV3 249611 248462 0 1047 3.49 0 859 870344
- Original reads hit counts:
10975 linker.fwd 133 linker.rev 166 splice 152 vector 228 UniVec.dust
- Lucy trimmed read counts
2 linker.fwd 0 linker.rev 1 splice 1 vector 6 UniVec.dust (only 3 are >40bp)
011.BCM.WGS REVERSE
#elem #0s min max mean median n50 sum original.LEN 250389 0 502 2148 1085 993 1012 271691094 lucy.CLR 250389 7345 0 1281 795 876 892 198982171 lucy.CLV5 250389 20271 0 668 26.52 27 29 6641362 lucy.CLV3 250389 249269 0 997 3.35 0 861 839029
- Original reads hit counts:
linker.fwd 113 linker.rev 3812 splice 143 UniVec.dust 237 vector 4318
- Lucy trimmed reads hit counts:
linker.fwd 1 linker.rev 0 splice 1 UniVec.dust 10 vector 1
030.BCM.SHOTGUN
- same linker/vector/splice as BCM.WGS
- 2.5% (4K out of 160K) reads contain linker & vector at 3'
#elem #0s min max mean median n50 sum original.LEN 8411 0 325 1685 1181 1240 1314 9933150 lucy.CLR 8411 8 0 1054 841 863 874 7070994 lucy.CLV5 8411 568 0 232 27.01 28 29 227206 lucy.CLV3 8411 2325 0 1040 597 794 851 5023445
- Original reads hit counts:
linker.fwd 4314 linker.rev 4125 splice 7816 UniVec.dust 4212 vector 6750 vector 27235
- Lucy trimmed reads hit counts:
linker.fwd 3 linker.rev 1 splice 1 UniVec.dust 13 vector 0
001.NISC.SHOTGUN
- Vector: pOTW13
- UniVec: 3 partial seqs
gnl|uv|NGB00080.1:1-198 pOTW13 with linkers gnl|uv|NGB00080.1:718-888 pOTW13 with linkers gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers
ll /nfshomes/dpuiu/db/NGB00080* -rw-rw-r-- 1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080 -rw-r--r-- 1 dpuiu dpuiu 94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker -rw-r--r-- 1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice
cat /nfshomes/dpuiu/db/NGB00080* | infoseq NGB00080 1054 50.00 NGB00080.linker.fwd 24 45.83 NGB00080.linker.rev 26 53.85 NGB00080.for.beg 518 46.14 NGB00080.for.end 518 50.48 NGB00080.rev.begin 518 50.48 NGB00080.rev.beg 518 46.14
- 944 read sample
#elem #0s min max mean median n50 sum original.LEN 944 0 652 1017 735 721 722 693668 lucy.CLR 944 39 0 886 415 422 522 391333 lucy.CLV5 944 121 0 275 34.05 33 35 32143 lucy.CLV3 944 18 0 885 410 409 511 387007
- Original reads hit counts:
linker.fwd 479 linker.rev 492 splice 910 UniVec.dust 0 vector 939
- Lucy trimmed reads hit counts:
linker.fwd 1 linker.rev 0 splice 0 UniVec.dust 9 vector 1
060.BCCAGSC.CLONEEND
- Linkers:
linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG
- No common vector !!!
 wc -l *clb
   60746 bos_taurus.060.f.clb  #18 reads original align to UniVec (nucmer default params)
   60836 bos_taurus.060.r.clb
 
 Fwd:
    329      428  |      440      535  |      100       96  |    91.00  |      503     1585  |    19.88     6.06  | 723951410  gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1
    330      370  |       89       49  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
    330      370  |       94       54  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
 
  Rev:
      1       96  |       71      165  |       96       95  |    93.81  |      203      165  |    47.29    57.58  | 724018013  gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6
     50      143  |        1       94  |       94       94  |    92.71  |      203       94  |    46.31   100.00  | 724018013  gnl|uv|U80929.2:2858-2951     Cloning vector pBACe3.6
017.UIUC.CLONEEND
- No overrepresented kmers
wc -l *clb 17978 bos_taurus.017.f.clb 17911 bos_taurus.017.r.clb ==> 24.fwd/kmers.tab <== CCCTGCTTTGTCTGGAAGGGGTTC GAACCCCTTCCAGACAAAGCAGGG 9 CTGCTTTGTCTGGAAGGGGTTCCC GGGAACCCCTTCCAGACAAAGCAG 9 ==> 24.rev/kmers.tab <== GAATGTTGAGCTTTAGCCAACTTT AAAGTTGGCTAAAGCTCAACATTC 4 TCTGAATGTTGAGCTTTAGCCAAC GTTGGCTAAAGCTCAACATTCAGA 4 ==> 8.fwd/kmers.tab <== TTTTTTTT AAAAAAAA 55 AAGGGGTT AACCCCTT 35 ==> 8.rev/kmers.tab <== GTCTGGAA TTCCAGAC 41 TCTGGAAG CTTCCAGA 39
- No UniVec hits
010.TIGR.CLONEEND
- No overrepresented kmers
wc -l *clb 5479 bos_taurus.032.f.clb 5174 bos_taurus.032.r.clb ==> 24.fwd/kmers.tab <== CTTGTGTTGGCCCAGGCAAGTCCA TGGACTTGCCTGGGCCAACACAAG 30 TTGTGTTGGCCCAGGCAAGTCCAA TTGGACTTGCCTGGGCCAACACAA 30 ==> 24.rev/kmers.tab <== CTGCCTCTTGTGTTGGCCCAGGCA TGCCTGGGCCAACACAAGAGGCAG 16 GCTGCCTCTTGTGTTGGCCCAGGC GCCTGGGCCAACACAAGAGGCAGC 15 ==> 8.fwd/kmers.tab <== GAGTGGGT ACCCACTC 176 GGAGTGGG CCCACTCC 171 ==> 8.rev/kmers.tab <== TGGAGTGG CCACTCCA 182 GGAGTGGG CCCACTCC 181
- No UniVec hits
...
070.BCM.CLONEEND
- No frequent kmers
wc -l *clb 6027 bos_taurus.070.f.clb 6236 bos_taurus.070.r.clb ==> 24.fwd/kmers.tab <== GGACTCTCAGAGTCTTCTCCAACA TGTTGGAGAAGACTCTGAGAGTCC 18 ACTGGTTGGATCTCCTTGCAGTCC GGACTGCAAGGAGATCCAACCAGT 18 ==> 24.rev/kmers.tab <== ATAAAATCTGAGCCACCAGGGAAG CTTCCCTGGTGGCTCAGATTTTAT 1 CTATTGGTTCATATGGTCAACGTC GACGTTGACCATATGAACCAATAG 1 ==> 8.fwd/kmers.tab <== TTTTTTTT AAAAAAAA 86 CTTCTCCA TGGAGAAG 75 ==> 8.rev/kmers.tab <== TATAGTGT ACACTATA 9 ATATAGGG CCCTATAT 8
- No alignments to BCM WGS vector
Running Lucy
- Default parameters with vector trimming
- BCM vector/splice
/nfshomes/dpuiu/db/vector.BCM.seq /nfshomes/dpuiu/db/splice.BCM.seq
- NISC vector/splice
/nfshomes/dpuiu/db/vector.NISC.seq /nfshomes/dpuiu/db/splice.NISC.seq
BCM.WGS (all reads)
- orig.CLR < lucy.CLR ( 765 < 792 )
- orig.CLV > lucy.CLV ( 1015 > 973 )
- 739,529 out of 24,863,599 reads (3%) deleted by Lucy (CLR=-1,-1)
- 21,728,592 out of 24,863,599 reads (87%) vector trimmed at the 5' end
- 92,646 out of 24,863,599 reads (0.3%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 24863599 0 0 24863599 5 3097 1002 997 1015 24915462033 orig.CLR 24863599 463669 7 24399923 -1143 1833 765 836 864 19036744256 orig.CLR5 24863599 0 359245 24504354 0 2103 42 22 58 1047922451 orig.CLR3 24863599 463404 0 24400195 -1 2169 807 872 895 20084666707 lucy.CLR 24863599 0 739529 24124070 0 1219 792 878 904 19695000417 lucy.CLR5 24863599 739529 36108 24087962 -1 1753 43 29 42 1086413880 lucy.CLR3 24863599 739529 0 24124070 -1 1894 835 915 939 20781414297 orig.CLR5-lucy.CLR5 24863599 16299521 215345 8348733 -1186 2104 -1 -10 -1186 -38491429 orig.CLR3-lucy.CLR3 24863599 14858542 1494794 8510263 -1273 2170 -28 -20 -1273 -696747590 orig.CLV 24863599 1053 1920 24860626 -2 5345 1015 1002 1017 25260581538 orig.CLV5 8841849 0 0 8841849 1 1219 33 46 49 295011460 orig.CLV3 24861698 1053 0 24860645 -1 5346 1027 1005 1019 25555592998 lucy.CLV 24863599 10694 707 24852198 -469 3096 973 968 987 24195085877 lucy.CLV5 24863599 0 3135007 21728592 0 1359 25 27 29 623457486 lucy.CLV3 24863599 0 0 24863599 4 3096 998 995 1014 24818543363 lucy.CLVABS5 24863599 0 3135007 21728592 0 1359 25 27 29 623457486 lucy.CLVABS3 24863599 0 24770953 92646 0 1343 2 0 880 72055071 orig.CLV5-lucy.CLV5 24863599 17216820 1512453 6134326 -1312 1219 -13 -25 -1312 -328446026 orig.CLV3-lucy.CLV3 24863599 1519132 18579609 4764858 -1832 4672 29 0 479 737049635
BCM.WGS (0 quality reads)
- orig.CLR > lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 7,153 out of 551,114 reads (1.3%) deleted by Lucy (CLR=-1,-1)
- 508,166 out of 551,114 reads (92%) vector trimmed at the 5' end
- 1,946 out of 551,114 reads (0.35%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 551114 0 0 551114 5 1464 872 946 959 480705828 orig.CLR 551114 7754 0 543360 -770 1175 708 786 807 390325117 orig.CLR5 551114 0 6773 544341 0 1519 44 20 111 24582849 orig.CLR3 551114 7744 0 543370 -1 1638 752 818 833 414907966 lucy.CLR 551114 0 7153 543961 0 699 636 671 671 350759771 lucy.CLR5 551114 7153 35872 508089 -1 201 26 27 28 14442310 lucy.CLR3 551114 7153 0 543961 -1 699 662 699 699 365202081 orig.CLR5-lucy.CLR5 551114 364282 8801 178031 -198 1500 18 -8 215 10140539 orig.CLR3-lucy.CLR3 551114 85058 2962 463094 -700 1472 90 123 178 49705885 orig.CLV 551114 971 0 550143 -2 2037 974 978 981 537127121 orig.CLV5 5100 0 0 5100 1 845 35 29 31 180490 orig.CLV3 551114 971 0 550143 -1 2037 974 978 981 537307611 lucy.CLV 551114 58 6 551050 -84 1456 841 917 930 463903233 lucy.CLV5 551114 0 42948 508166 0 202 27 28 29 14964546 lucy.CLV3 551114 0 0 551114 4 1463 868 945 958 478867779 lucy.CLVABS5 551114 0 42948 508166 0 202 27 28 29 14964546 lucy.CLVABS3 551114 0 549168 1946 0 700 2 0 686 1286935 orig.CLV5-lucy.CLV5 551114 506108 42215 2791 -202 845 -26 -28 -202 -14784056 orig.CLV3-lucy.CLV3 551114 134959 23422 392733 -967 1614 106 7 459 58439832
BCM.SHOTGUN
- orig.CLR < lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 98,070 out of 10,748,529 reads (0.9%) deleted by Lucy (CLR=-1,-1)
- 9,737,008 out of 10,748,529 reads (90%) vector trimmed at the 5' end
- 294,942 out of 10,748,529 reads (2.7%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 10748529 0 0 10748529 5 2043 975 950 964 10486690472 orig.CLR 10748529 17308 2 10731219 -1293 1467 809 833 847 8701344571 orig.CLR5 10748529 0 68 10748461 0 1315 26 16 38 288662580 orig.CLR3 10748529 16780 0 10731749 -1 1647 836 851 863 8990007151 lucy.CLR 10748529 0 98070 10650459 0 1337 833 854 868 8955866769 lucy.CLR5 10748529 98070 1973 10648486 -1 1307 35 28 32 376276188 lucy.CLR3 10748529 98070 0 10650459 -1 1553 868 882 896 9332142957 orig.CLR5-lucy.CLR5 10748529 9498290 65171 1185068 -1099 1293 -8 -11 -1099 -87613608 orig.CLR3-lucy.CLR3 10748529 6879532 671097 3197900 -1149 1437 -31 -26 -1149 -342135806 orig.CLV 10748529 16779 412 10731338 -2 3919 974 948 964 10472347908 orig.CLV5 8594910 0 0 8594910 1 1239 3 1 49 28350257 orig.CLV3 10748349 16779 0 10731570 -1 3919 976 950 965 10500698165 lucy.CLV 10748529 7026 614 10740889 -268 2042 930 924 940 9997862132 lucy.CLV5 10748529 0 1011521 9737008 0 855 24 24 27 257993796 lucy.CLV3 10748529 0 0 10748529 4 2042 954 945 962 10255855928 lucy.CLVABS5 10748529 0 1011521 9737008 0 855 24 24 27 257993796 lucy.CLVABS3 10748529 0 10453587 294942 0 1214 20 0 847 220086015 orig.CLV5-lucy.CLV5 10748529 9538738 138680 1071111 -854 1239 -21 -23 -854 -229643539 orig.CLV3-lucy.CLV3 10748529 357934 9324166 1066429 -1328 2846 22 0 704 244842237
NISC.SHOTGUN
- orig.CLR < lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 8,248 out of 737,900 reads (1.1%) deleted by Lucy (CLR=-1,-1)
- 633,409 out of 737,900 reads (85%) vector trimmed at the 5' end
- 7,201 out of 737,900 reads (0.97%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 737900 0 0 737900 104 2104 784 729 734 579172842 orig.CLR 737900 5988 2 731910 -636 1033 651 668 676 480400909 orig.CLR5 737900 0 0 737900 1 1407 47 40 51 34857531 orig.CLR3 737900 0 5879 732021 0 1470 698 710 715 515258440 lucy.CLR 737900 0 8248 729652 0 1035 658 670 676 485757685 lucy.CLR5 737900 8248 56 729596 -1 1091 45 35 46 33811606 lucy.CLR3 737900 8248 0 729652 -1 1391 704 710 714 519569291 orig.CLR5-lucy.CLR5 737900 253727 89345 394828 -566 1408 1 1 485 1045925 orig.CLR3-lucy.CLR3 737900 177007 31 560862 -867 1471 -5 1 -867 -4310851 orig.CLV 737900 3224 2655 732021 -636 2103 771 725 730 569178445 orig.CLV5 734026 0 0 734026 1 987 5 1 35 4375315 orig.CLV3 732021 0 0 732021 35 2104 783 729 734 573553760 lucy.CLV 737900 1335 55 736510 -200 2104 747 696 702 551392388 lucy.CLV5 737900 104491 0 633409 -1 1199 30 31 34 22784742 lucy.CLV3 737900 0 0 737900 15 2103 778 728 733 574177130 lucy.CLVABS5 737900 0 104491 633409 0 1200 31 32 35 23522642 lucy.CLVABS3 737900 0 730699 7201 0 1076 5 0 686 4257812 orig.CLV5-lucy.CLV5 737900 561851 66390 109659 -1198 983 -24 -29 -1198 -18409427 orig.CLV3-lucy.CLV3 737900 8386 1 729513 -950 1077 0 1 -950 -623370
Fragment files
- Locations:
/fs/szasmg3/bos_taurus/data/frg /fs/szasmg3/bos_taurus/data/frg.new
- All DST messages are unique
- bos_taurus.clv : contains the vector clipping points
- BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
- others: the TA clv
- 374,454 reads don't have valid clv's
- 36,446,031 reads have valid clv's with avg=955
 
Message counts (original)
DST FRG LKG bos_taurus.BCM.WGS.frg 79 24124070 11311841 #bos_taurus.BCM.SHOTGUN.frg 7339 10650459 1799069 # some libs & mates are missing due to a tarchive2ca crash (used by UMD2.1) #bos_taurus.BCM.SHOTGUN.new.frg 18208 10650459 4715172 # split the libraries by VOL & SEQ_LIB_ID (used by UMD2.2) #bos_taurus.BCM.SHOTGUN.new.frg 13826 10650459 5046435 # double check the FRG count !!! (used by UMD2.3) bos_taurus.BCM.SHOTGUN.new.frg 7 10650459 5046435 # UMD2.4 bos_taurus.NISC.SHOTGUN.frg 246 729652 344932 bos_taurus.BCCAGSC.CLONEEND.frg 1 125241 59505 bos_taurus.UIUC.CLONEEND.frg 2 114750 46319 bos_taurus.TIGR.CLONEEND.frg 1 65171 27067 bos_taurus.GSC.CLONEEND.frg 1 53521 25889 bos_taurus.CENARGEN.WGS.frg 0 26246 0 #bos_taurus.BARC.CLONEEND.frg 11150 25454 11150 # (used by UMD2.3) bos_taurus.BARC.CLONEEND.frg 1 25454 11150 # (used by UMD2.4) bos_taurus.BCM.CLONEEND.frg 1 16875 7103 bos_taurus.CENARGEN.CLONEEND.frg 1 16787 6269 bos_taurus.UOKNOR.SHOTGUN.frg 1 14651 4910 bos_taurus.TIGR_JCVIJTC.CLONEEND.frg 2 10651 4803 bos_taurus.UOKNOR.FINISHING.frg 0 151 0 bos_taurus.WUGSC.COLONEEND.frg 1 49 21 #total 25312 35973728 16896244 # (UMD2.3) total 344 35973728 16896244 # (UMD2.4)
Message counts (quality)
DST FRG LKG bos_taurus.BCM.WGS.qual.count 79 23580109 11035582 #bos_taurus.BCM.SHOTGUN.qual.count 7339 10644092 1799069 bos_taurus.BCM.SHOTGUN.qual.new.count 18208 10644092 4712446 bos_taurus.NISC.SHOTGUN.count 246 729652 344932 bos_taurus.BCCAGSC.CLONEEND.qual.count 1 116484 53585 bos_taurus.UIUC.CLONEEND.count 2 114750 46319 bos_taurus.TIGR.CLONEEND.count 1 65171 27067 bos_taurus.CENARGEN.WGS.count 0 26246 0 bos_taurus.BARC.CLONEEND.count 11150 25454 11150 bos_taurus.BCM.CLONEEND.count 1 16875 7103 bos_taurus.CENARGEN.CLONEEND.count 1 16787 6269 bos_taurus.TIGR_JCVIJTC.CLONEEND.count 2 10651 4803 bos_taurus.UOKNOR.SHOTGUN.qual.count 1 2456 813 bos_taurus.WUGSC.COLONEEND.count 1 49 21
Message counts (0quality)
DST FRG LKG bos_taurus.BCM.WGS.0qual.count 79 543961 234397 bos_taurus.GSC.CLONEEND.0qual.count 1 53521 25889 bos_taurus.UOKNOR.SHOTGUN.0qual.count 1 12195 4097 bos_taurus.BCCAGSC.CLONEEND.0qual.count 1 8757 2114 bos_taurus.BCM.SHOTGUN.0qual.count 7339 6367 0 bos_taurus.UOKNOR.FINISHING.0qual.count 0 151 0
Assemblies
Bt.qc.combine UMD2.0 ... UMD2.5 combine stats
UMD2.1(2009_0122_CA; Quality reads)
Issues
- Uses only quality reads
- BCM.SHOTGUN library : ~ 4715172-1799069=2.9M mates were missed due to a tarchive2ca crash ; some libraries got merged (were assigned the same lib_id)
- All reads except for BCM.WGS were set as nonrandom
- Update the runCA script to run overlapper concurently; new "ovlConcurrency" parameter added to the .spec file !!!
- consensus after cgw crashed in MultiAlignContig() ... use "consensus -D forceunitigabut" !!!
- cgw crashed after updating gkpStore with new lib/mate info => edit Input_CGW.c, remove the assert in line 117
Info
host: walnut assembly version: wgs-5.2 stable dir: /scratch1/bos_taurus/Assembly/2009_0122_CA command: /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin/runCA-test -d . -p bt -s bt01.specFile *.frg spec file: cgwDistanceSampleSize = 1000 # ??? too big; more than 50% of the BCM.SHOTGUN reads are in libraries with less than 1000 inserts cnsConcurrency = 15 cnsMinFrags = 200000 doOverlapTrimming = 1 frgCorrBatchSize = 100000 frgCorrConcurrency = 15 merylMemory = 24000 merylThreads = 15 obtMerThreshold = 200 obtOverlapper = ovl ovlConcurrency = 8 ovlCorrBatchSize = 100000 ovlCorrConcurrency = 15 ovlHashBlockSize = 1200000 ovlMemory = 8GB --hashload 0.8 --hashstrings 400000 ovlMerThreshold = 500 ovlOverlapper = ovl ovlRefBlockSize = 7200000 ovlThreads = 2 unitigger = utg utgErrorRate = 0.015 vectorIntersect = bos_taurus.clv doExtendClearRanges = 2
Steps
1. Run up till after initialStoreBuilding
runCA stopAfter=initialStoreBuilding ...
2. Update gkpStore with nonrandom frg flag
cat bos_taurus.nonrandom.clv | perl -ane 'print "frg uid $F[0] isnonrandom 1\n";' > bos_taurus.nonrandom.edit gatekeeper -edit bos_taurus.nonrandom.edit bt.gkpStore
Input
gatekeeper -dumpinfo -lastfragiid bt.gkpStore ... Last frag in store is iid = 35348776
OBT
elem <0 0 >0 min max mean median n50 sum CLV5 35085508 0 3387027 31698481 0 970 25 27 29 891007232 CLV3 35164784 0 0 35164784 15 2974 984 980 1000 34612019144 CLR_ORIG5 35348776 0 43354 35305422 0 1753 42 29 38 1502168205 CLR_ORIG3 35348776 0 0 35348776 70 1894 864 905 927 30547294868 CLR_OBT5 35348776 0 26513 35322263 0 1690 49 30 73 1756346429 CLR_OBT3 35348776 0 23477 35325299 0 1813 843 895 914 29824543869
- 421,379 reads deleted by OBT: why so many???
- Chimera:
20297 reads too short => deleted
- more 0-overlaptrim/bt.mergeLog.stats
... 211037: short or inconsistent 253536: deleted fragment due to zero clear
- Example:
gatekeeper -dumpfragments 516316990 bt.gkpStore fragmentIdent = 516316990,14 fragmentMate = 0,0 fragmentLibrary = 27473,1563 fragmentIsDeleted = 1 fragmentIsNonRandom = 1 fragmentStatus = G fragmentOrientation = I fragmentHasVectorClear = 0 fragmentHasQualityClear = 0 fragmentPlate = 0 fragmentPlateLocation = 0 fragmentSeqLen = 862 fragmentHPSLen = 0 fragmentSrcLen = 17 fragmentClearORIG = 38,553 fragmentClearQLT = 1,0 fragmentClearVEC = 1,0 fragmentClearOBTINI = 35,578 fragmentClearOBT = 35,578 fragmentClearUTG = 35,578 fragmentClearECR1 = 35,578 fragmentClearECR2 = 35,578 fragmentSeqOffset = 5376 fragmentQltOffset = 11038 fragmentHpsOffset = 53 fragmentSrcOffset = 287
cat 0-overlaptrim/bt.mergeLog | grep 516316990 516316990,14 412 412 0 0 (deleted, too short)
 zcat *r000*gz | convertOverlap -a -obt 
 ...
    14 12128740  f  377  478   292  393   2.97
    14 15226267  f  397  446    31   80   2.04
    14 19071241  f    4  513   199  708   1.18
    14 20073917  f    7  478    36  508   4.88
    14 20042424  f    4  419   299  714   1.93
    14 20212935  f    7  478   234  706   4.88
    14 20073828  r    7  478   507   35   4.67
    14 20212846  r    7  478   557   85   4.67
    14 27089060  r  491  534   836  793   2.33
    14 29061748  f  489  540    86  137   1.96
    14 32105697  f  455  543   381  469   2.27
    14 32187461  f  430  534   105  209   1.92
    14 32027289  f    4  419   493  907   4.59
 ...
 #read aligns to contigs
 show-coords 516316990-ctg.filter-r.strict.delta
     35      531  |       97      594  |      497      498  |    99.20  |      862     2759  |    57.66    18.05  | 516316990  ctg7180001872751
     45      678  |      931     1564  |      634      634  |    97.00  |      862     1567  |    73.55    40.46  | 516316990  ctg7180001837311
- OBT deleted reads:
BCM WGS 253816 BCM SHOTGUN 151770 BCCAGSC CLONEEND 7510 NISC SHOTGUN 4757 TIGR CLONEEND 1577 CENARGEN WGS 599 CENARGEN CLONEEND 431 TIGR_JCVIJTC CLONEEND 377 UIUC CLONEEND 182 BCM CLONEEND 150 BARC CLONEEND 125 UOKNOR SHOTGUN 85 total . 421379
OBT deleted reads:
elem >0 min max mean med n50 sum len 421379 421379 98 2974 862 927 968 363280405 avgQual 421379 421379 1 57 28 24 36 11852865
Overlapper
- 98.33% of the reads (34,761,786 out of 35,348,776 reads) had overlaps
- 1.66% of the reads had no overlaps
- 6.68% of the BCCAGSC.CLONEEND reads had no overlaps
- 4.95% of the TIGR_JCVIJTC.CLONEEND reads had no overlaps
- 3.48% of the TIGR.CLONEEND reads had no overlaps
 
- the median number of overlaps is 20
 
 Overlaps
         reads      min        max        mean       median     n50        sum
 qual    35348776   0          5592       106        20         769        3777789082
- the median number of overlaps for the BCM.WGS reads is 16
- the median number of overlaps for the BCM.SHOTGUN reads is 16 !!!
- the median number of overlaps for the NISC.SHOTGUN reads is 40 !!!
- the median number of overlaps for the BCM.CLONEEND reads is 16 !!!
 
Media:Bt.ovlStore.big.png , Media:Bt.ovlStore.small.png
Unitigger
more 4-unitigger/bt.cga.0
UNITIG OVERLAP GRAPH INFORMATION
 
       5208738 : Total number of unitigs
       2527051 : Total number of singleton, contained unitigs
       1814842 : Total number of singleton, non-contained unitigs
        180910 : Total number of non-singleton, spanned unitigs
        685935 : Total number of non-singleton, non-spanned unitigs
      34927397 : Total number of fragments
      34927397 : Total number of fragments in all unitigs
      21521581 : Total number of essential fragments in all unitigs
      13405816 : Total number of contained fragments in all unitigs
  0.0076239952 : Randomly sampled fragment arrival rate per bp
    2510896132 : The sum of overhangs in all the unitigs
    6400342737 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2510896132
 Total nfrags = 19143061
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007624
 Computed global_fragment_arrival_rate =0.007624
 Total number of randomly sampled fragments in genome = 23326293
 Computed genome length  = 3059589120.000000
 Used global_fragment_arrival_rate=0.007624
 Used global_fragment_arrival_distance=131.164826
 
 Histogram of the number of base pairs in a chunk
 100292 - 159434:    22 
 90010 -  99906:     25 
 80043 -  89676:     73 
 70013 -  79966:    162 
 60010 -  69988:    389 
 50008 -  59983:    977 
 40000 -  49998:   2434 
 30000 -  39997:   6458 
 20000 -  29999:  18957 
 10000 -  19999:  57442
 Unitigs >=10kb
             NewAsm          UMd2Asm
 
 Number       86,939          57,204
 Mean         19,464          15,140
 Sum         1,692.1Mb       866.0Mb
 max         159,434bp      78,570bp
 Contigs >=10Kb:
           NewAsm          UMd2Asm
 n         42,343           45,958      
 mean      59,856           55,473
 sum        2,534.5Mb        2,549.4Mb
 Contigs >=100Kb: 
           NewAsm          UMd2Asm
 n          7,051            6,683         
 mean     163,170          162,357    
 sum        1,150.5Mb        1,085.0Mb
 max      627,705          742,802
 Scaffolds >=10Mb:
           NewAsm          UMd2Asm
 n             30                3
 mean       14.10Mb          11.36Mb
 sum       422.95Mb         340.70Mb
 max        26.54Mb          13.36Mb
CGW & ECR
- Checkpoints:
cat 7-0-CGW/bt.timing | grep ^Checkpoint Checkpoint 3 written during MergeScaffoldsAggressive at iteration 49 Checkpoint 4 written during MergeScaffoldsAggressive at iteration 85 Checkpoint 5 written after 1st Scaffold Merge Checkpoint 6 written after 2nd Aggressive Scaffold Merge Checkpoint 7 written after Final Rocks
cat 7-2-CGW/bt.timing | grep ^Checkpoint Checkpoint 19 written during MergeScaffoldsAggressive at iteration 12 Checkpoint 20 written during MergeScaffoldsAggressive at iteration 31 Checkpoint 21 written after 1st Scaffold Merge Checkpoint 22 written after 2nd Aggressive Scaffold Merge Checkpoint 23 written after Final Rocks cat 7-4-CGW/bt.timing | grep ^Checkpoint Checkpoint 34 written during MergeScaffoldsAggressive at iteration 12 Checkpoint 35 written during MergeScaffoldsAggressive at iteration 49 Checkpoint 36 written after 1st Scaffold Merge Checkpoint 37 written during Stones CleanupScaffolds after scaffold 32436 Checkpoint 38 written during Stones CleanupScaffolds after scaffold 34939 Checkpoint 39 written after Stone Throwing and CleanupScaffolds Checkpoint 40 written after 2nd Aggressive Scaffold Merge Checkpoint 41 written after Final Rocks
Checkpoint 42 written after Partial Stones Checkpoint 43 written after Final Contained Stones Checkpoint 44 written after resolveSurrogates
- Get early CTG/SCF stats
cat 7-CGW/bt.cgw_scaffolds | countMessages.pl ICL 451555 # ??? ICP 116455 # CTG ISF 66141 # SCF ISL 711 # SLK
- Clear read extension:
elem <0 0 >0 min max mean median n50 sum ClearORIG 35348776 4 0 35348772 -1147 1572 821 870 893 29045126663 ClearQLT 35348776 35348776 0 0 -1 -1 -1 -1 -1 -35348776 ClearVEC 35348776 299034 20323 35029419 -1 2043 952 953 975 33658445088 ClearOBTINI 35348776 0 31254 35317522 0 1364 831 879 902 29394688367 ClearOBT 35348776 0 31254 35317522 0 1318 794 854 877 28068197440
ClearECR1 35348776 0 31254 35317522 0 1329 794 854 877 28072014464 ClearECR2 35348776 0 31254 35317522 0 1329 794 854 877 28072365712
sum(ClearECR1)-sum(ClearUTG) = 3,817,024 sum(ClearECR2)-sum(ClearECR1)= 351,248
- Scaffold length stats:
cat 7-0-CGW/stat/final0.Scaffolds.nodelength.cgm | grep -v ^Sca | getSummary.pl -t 0 # 0,2,4 ...
step scaff min max mean med n50 sum 0 7048 2249 19719008 385020 21967 3114907 2713622175 2 4960 2249 21907006 540915 21181 4490171 2682939682 4 4006 2391 26541374 668427 29193 4590744 2677722052
- Last cgw
 cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf
 cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf
            elem       min        max        mean       med        n50        sum            
 scf        66141      432        26541374   42648      1347       4349378    2820819506     
 ctg        120461     65         627705     22421      2018       84989      2700959854
QC stats
- Bos_taurus.qc this assembly stats
- Bos_taurus.qc.combine UMD2 vs this assembly stats
TotalScaffolds=66,141 MaxBasesInScaffolds=26,048,998 MeanBasesInScaffolds=40,861 TotalContigsInScaffolds=120,461 MaxContigLength=627,911 MeanContigLength=22,436 TotalDegenContigs=269,031 MaxDegenContig=33,824 SingletonReads=3,721,123
- Posmap info
 cat bt.posmap.mates | awk '{print $3}' |count.pl -p 100
 good            10338164
 bothChaff       1160137
 oneChaff        695982
 oneSurrogate    233151
 bothDegen       218198
 diffScaffold    150423
 badShort        138464
 oneDegen        118232
 badLong         23196
 badSame         22451
 badOuttie       8751
 bothSurrogate   589
 total           13107738
 cat bt.posmap.frags | awk '{print $4,$5}' |count.pl  -p 100
 placed good             20676328
 placed notMated         8007072
 chaff bothChaff         2320274
 chaff notMated          704849
 placed oneChaff         695982
 chaff oneChaff          695982
 placed oneSurrogate     466302
 placed bothDegen        436396
 placed diffScaffold     300846
 placed badShort         276928
 placed oneDegen         236464
 placed badLong          46392
 placed badSame          44902
 placed badOuttie        17502
 placed bothSurrogate    1178
 total                   34927397
Log files
- Bt.runCA.log
- Bt.runCA.hourly.runtimes approximate running times (in hours)
Analysis
Insert libraries
1. BCM.WGS : ok
- FRG.mea: 1750-7000
- ASM.mea: 1594-6727
- Most libs have > 1000 reads & get reestimated
- All libs have ASM.std< ASM.mea/3
2. BCM.SHOTGUN
- only ~ 50% of the inserts are in libs with >1000 inserts and get reestimated by the assembly
- if the thold is dropped from 1000 to 100, we'd get ~ 95% of the inserts reestimated
elem <0 0 >0 min max mean median n50 sum 0 7339 0 0 7339 1 11237 245 135 1137 1799069 100 4361 0 0 4361 100 11237 395 157 1252 1725604 1000 440 0 0 440 1008 11237 2075 1791 2323 913086
3. NISC.SHOTGUN: ok
- Most libs have > 1000 reads & get reestimated
- All libs have ASM.std< ASM.mea/3
4. BCCAGSC.CLONEEND: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 125606 150000 30000 59505 BCCAGSC.CLONEEND 161998 20133
5. UIUC.CLONEEND: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 114892 150000 30000 31063 UIUC.CLONEEND 175594 41208 115020 150000 30000 15256 UIUC.CLONEEND 162488 26358
6. TIGR.CLONEEND: originally wrong; gets reestimated
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 65177 2000 600 27067 TIGR.CLONEEND 161761 34938
7. GSC.CLONEEND: not used (all 53556 are 0 qual)
8. CENARGEN.WGS: "not used" (all 26246 are unmated)
9. BARC.CLONEEND: each library contains 1 template id => inserts did not get reestimated (25454 reads/11151 inserts)
10. BCM cloneend: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 19070 167000 25000 7103 BCM.CLONEEND 171244 18555
11. CENARGEN.CLONEEND: large stdev
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 17249 202000 20200 6269 CENARGEN.CLONEEND 158938 55165
12. UOKNOR.SHOTGUN: ok ?
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 15158 3000 1000 4910 UOKNOR.SHOTGUN 3000 1000
13. TIGR_JCVI.CLONEEND: originally wrong; gets reestimated
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 10691 2500 750 2763 TIGR_JCVI.CLONEEND 160363 29580 10738 2500 750 2040 TIGR_JCVI.CLONEEND 161915 29343
14. UOKNOR.FINISHING: only 151 reads
15. WUGSC.CLONEEND: only 49 reads
Contigs Vs UMD2 contaminants & Ecoli
4865 contigs in list.exclude_contigs.fa 34404 exclude-ctg.qry_hits 3763 exclude-ctg.ref_hits 1204 exclude-ctg.CBE.qry_hits CONTAIN|IDENTITY|BEGIN|END 748 exclude-ctg.CBE.ref_hits CONTAIN|IDENTITY|BEGIN|END
559 Ecoli.365350-365744-ctg.qry_hits : max ctg aligned is 179K bp; 10 are > 10K bp
Contigs Vs UMD2 chromosomes
- Split 120,461 contigs into 100 files; degeneartes not split
- Align them to the 31 chromosomes 1..30,U (ref) => 101*31 jobs
 #Alignment stats
 cat chr*ctg*delta | grep "^>" | awk '{print $2}' | count.pl -f ../9-terminator/bt.ctg.infoseq | getSummary.pl -i 1 -z 1
 ctg        0          1          >1         min        max        mean       med        n50        sum
 120461     652        37540      82269      0          176        11         3          33         1422808
#Unaligned ctg lengths ctg min max mean med n50 sum 652 65 5849 1146 1134 1194 747295
- 50% of the contigs aligned uniquely
 cat chr*-ctg*.delta | ~/bin/mergeDelta.pl   >  chr-ctg.delta
                                                             # degens? 
 delta-filter -q  chr-ctg.delta              >>  chr-ctg.filter-q.delta
cat chr1-*.delta | ~/bin/delta2cvg.pl -M 0 | getSummary.pl -i 4 elem 0 >0 min max mean med n50 sum 6681 1 6680 0 12892 366 142 1095 2450106
- There are disagreements:
/fs/sz-user-supported/Linux-x86_64/bin/show-coords -l -r -H chr1-ctg.filter-q.delta | p 'print $F[-1],"\n";' | count.pl | head ctg7180001761585 24 ... ctg7180001634116 7 ...
show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001761585 | p 'print " $_";' 142115744 142188863 | 383463 310345 | 73120 73119 | 99.98 | 157714772 383463 | 0.05 19.07 | 1 -1 chr1 ctg7180001761585 142188878 142286012 | 310361 213223 | 97135 97139 | 99.94 | 157714772 383463 | 0.06 25.33 | 1 -1 chr1 ctg7180001761585 142287100 142287675 | 212133 211556 | 576 578 | 98.27 | 157714772 383463 | 0.00 0.15 | 1 -1 chr1 ctg7180001761585 142288052 142288602 | 211182 210633 | 551 550 | 99.09 | 157714772 383463 | 0.00 0.14 | 1 -1 chr1 ctg7180001761585 142288652 142295709 | 210586 203531 | 7058 7056 | 99.87 | 157714772 383463 | 0.00 1.84 | 1 -1 chr1 ctg7180001761585 142295709 142342174 | 203512 157047 | 46466 46466 | 100.00 | 157714772 383463 | 0.03 12.12 | 1 -1 chr1 ctg7180001761585 142346440 142367791 | 156958 135606 | 21352 21353 | 99.99 | 157714772 383463 | 0.01 5.57 | 1 -1 chr1 ctg7180001761585 142367822 142370681 | 135597 132737 | 2860 2861 | 99.93 | 157714772 383463 | 0.00 0.75 | 1 -1 chr1 ctg7180001761585 142370660 142382289 | 132746 121117 | 11630 11630 | 99.88 | 157714772 383463 | 0.01 3.03 | 1 -1 chr1 ctg7180001761585 142382282 142411927 | 120984 91339 | 29646 29646 | 99.96 | 157714772 383463 | 0.02 7.73 | 1 -1 chr1 ctg7180001761585 142411941 142419553 | 91339 83728 | 7613 7612 | 99.66 | 157714772 383463 | 0.00 1.99 | 1 -1 chr1 ctg7180001761585 142419553 142434546 | 83721 68728 | 14994 14994 | 99.79 | 157714772 383463 | 0.01 3.91 | 1 -1 chr1 ctg7180001761585 142434506 142437288 | 68778 65996 | 2783 2783 | 99.86 | 157714772 383463 | 0.00 0.73 | 1 -1 chr1 ctg7180001761585 142437389 142439015 | 66757 65131 | 1627 1627 | 99.94 | 157714772 383463 | 0.00 0.42 | 1 -1 chr1 ctg7180001761585 142439271 142440703 | 65629 64197 | 1433 1433 | 100.00 | 157714772 383463 | 0.00 0.37 | 1 -1 chr1 ctg7180001761585 142441869 142442975 | 63548 62442 | 1107 1107 | 100.00 | 157714772 383463 | 0.00 0.29 | 1 -1 chr1 ctg7180001761585 142446690 142449325 | 30312 32945 | 2636 2634 | 99.58 | 157714772 383463 | 0.00 0.69 | 1 1 chr1 ctg7180001761585 142451384 142452476 | 63510 64603 | 1093 1094 | 99.91 | 157714772 383463 | 0.00 0.29 | 1 1 chr1 ctg7180001761585 142452577 142454379 | 61000 62806 | 1803 1807 | 99.78 | 157714772 383463 | 0.00 0.47 | 1 1 chr1 ctg7180001761585 142454487 142456821 | 59122 61456 | 2335 2335 | 100.00 | 157714772 383463 | 0.00 0.61 | 1 1 chr1 ctg7180001761585 142458383 142459582 | 57978 59177 | 1200 1200 | 100.00 | 157714772 383463 | 0.00 0.31 | 1 1 chr1 ctg7180001761585 142459738 142472295 | 32272 44828 | 12558 12557 | 99.92 | 157714772 383463 | 0.01 3.27 | 1 1 chr1 ctg7180001761585 142472300 142485640 | 44828 58163 | 13341 13336 | 99.89 | 157714772 383463 | 0.01 3.48 | 1 1 chr1 ctg7180001761585 142501686 142530021 | 28336 1 | 28336 28336 | 99.99 | 157714772 383463 | 0.02 7.39 | 1 -1 chr1 ctg7180001761585
show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001634116 116312 162914 | 1 46603 | 46603 46603 | 99.99 | 157714772 122722 | 0.03 37.97 | 1 1 chr1 ctg7180001634116 164916 201988 | 58062 95135 | 37073 37074 | 99.99 | 157714772 122722 | 0.02 30.21 | 1 1 chr1 ctg7180001634116 203244 213377 | 48198 58331 | 10134 10134 | 100.00 | 157714772 122722 | 0.01 8.26 | 1 1 chr1 ctg7180001634116 261393 264506 | 45949 49062 | 3114 3114 | 100.00 | 157714772 122722 | 0.00 2.54 | 1 1 chr1 ctg7180001634116 264607 268579 | 94345 98317 | 3973 3973 | 100.00 | 157714772 122722 | 0.00 3.24 | 1 1 chr1 ctg7180001634116 268586 274734 | 98323 104471 | 6149 6149 | 100.00 | 157714772 122722 | 0.00 5.01 | 1 1 chr1 ctg7180001634116 274835 293945 | 103611 122722 | 19111 19112 | 99.99 | 157714772 122722 | 0.01 15.57 | 1 1 chr1 ctg7180001634116
 ~/bin/delta2breaks.pl -m 200 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       9827
 INVERSION       283
 TRANSLOCATION+  230
 TRANSLOCATION-  154
 
 ~/bin/delta2breaks.pl -m 1000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       7564
 INVERSION       216
 TRANSLOCATION+  192
 TRANSLOCATION-  127
 
 ~/bin/delta2breaks.pl -m 10000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       3394
 INVERSION       62
 TRANSLOCATION+  50
 TRANSLOCATION-  29
Assembly UMD2.2 (Quality reads)
- Try to add the missing BCM.SHOTGUN reads at the assembly
- Assign new BCM.SHOTGUN library ID's base on volume & SEQ_LIB_ID : same library might have different insert size in different volume => might loose some correct mates from different volumes
cat bos_taurus.summary | grep BCM | grep SHOTG | cut -f6,7,8,10 | sort | more FAAEP 180000 13000 252 FAAEP 2000 1000 84 ... FAAHP 180000 13000 77 FAAHP 2000 1000 230 ...
- => 20,538 libraries out of which 18,208 contain mated reads
- create DST messages & add them to gkpStore
gatekeeper -a -o bt.gkpStore -T -F bos_taurus.BCM.SHOTGUN.new.DST
- generate gatekeeper edit file that maps each TI to the new library id
head bos_taurus.BCM.SHOTGUN.new.ti2libinfo.edit frg uid 499507131 libuid 601081 frg uid 499507132 libuid 601081 ...
- generate gatekeeper edit file that deletes all mate information
head bos_taurus.BCM.SHOTGUN.new.mate.delete frg uid 500086180 mateuid 0 frg uid 500084310 mateuid 0 ...
- pair forward/reverse read that have the same new library id, same TEMPLATE_ID
head bos_taurus.BCM.SHOTGUN.new.mate.edit frg uid 583866821 mateuid 583872364 frg uid 583866822 mateuid 583872408 ...
- run gatekeeper --edit for each edit/delete file
gatekeeper --edit ... bt.gkpStore
- restart assembly at cgw (doExtendClearRanges=1)
- consensus after cgw failed on job 25 on CTG 5597062 : cannot create consensus from multialignment ...
 Fix: delete failed message
 cp bt.cgw_contigs.25 bt.cgw_contigs.25.FAILED
 delete "{ICM acc:5597062 pla:P len:20889 ..." from bt.cgw_contigs.25
- terminator fail; message:
ICL: reference before definition error for contig ID 5597062
Assembly UMD2.3 (2009_0210_CA; all reads)
- 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
- 16,896,244 mates
- 25,312 libraries
Issues (not solved):
- 10420 contain at least 1 "NN" in their clr (50.. min(len,600))
- 5973 contain at least 1 "NNN" in their clr (50.. min(len,600))
Quality-less clrs
- 624,952 quality-less reads
- Quality-less read stats: : alignment CLR or 50..min(len,600) trimming
elem min max mean median n50 sum len 624952 5 1495 887 947 961 554429198 5 624952 6 1584 51 51 51 32150411 3 624952 5 1495 695 699 699 434960697 53 624952 -1579 1444 644 648 648 402810286
- Align 624,952 to the 120,461 Assembly1 contigs (no degenerates) : 1 day on 13 cpus
- 572,140(91.5%) reads aligned and 52,812(8.5%) did not align to the contigs
1. Launch jobs in parallel: 12766 jobs on 13 processors
nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.001.fasta bos_taurus.0qual.01.seq -p ctg.001-seq.01 ... nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.982.fasta bos_taurus.0qual.13.seq -p ctg.001-seq.01
- CPU usage: 100% /job
- Max mem usage: 0.1% /job
2. Get maximum extended clrs
cat *delta | ~/bin/delta2qryClr.pl -best | sort > bos_taurus.0qual.best.clr
 Length stats
            elem       min        max        mean       median     n50        sum
 all        624952     5          1495       887        947        961        554429198
 aligned    572140     221        1416       912        953        964        522281354
 unaligned  52812      5          1495       608        580        754        32147844
 Best/Max/Max+extended alignment coord stats:
            elem       min        max        mean       median     n50        sum
 53.best    572140     94         1208       766        841        877        438793102
 53.max     572140     170        1208       794        863        888        454816817
 53.extend  572140     170        1208       797        865        889        456014184
 Unaligned read counts:                           
                          unaligned    total   quality   quality-less
 BCM.WGS                  42595
 UOKNOR.SHOTGUN           5787         14651   2456      12195
 GSC.CLONEEND             2294         53521   0         53521
 BCCAGSC.CLONEEND         1869         125241  116484    8757
 BCM.SHOTGUN              186
 UOKNOR.FINISHING         81
- 52,812 quality-less unaligned reads to the contigs using less strict nucmer parameters: -l 30 -c 50 -b 50 -g 50 -d 0.12
- 9,269 reads aligned at an average 92% identity (min 81% identity) : not too good
3. Get reads without clrs: set their clr to maximum 50..600
cp bos_taurus.0qual.extended.clr bos_taurus.0qual.clr difference.pl bos_taurus.0qual.infoseq bos_taurus.0qual.extended.clr | perl -ane '$three=600; $three=$F[1] if ($F[1]<600); print "$F[0] 50 $three\n";' >> bos_taurus.0qual.clr
Quality clrs
- Use Assembly1 OBT clrs
- Delete reads deleted in the OBT process
Gatekeeper
Load order:
- Add quality FRG : "gatekeeper -T -F ..."
- Add quality-less FRG "gatekeeper -F -a ..." # -T should be removed
- Delete quality FRG (deleted by UMD2.1 OBT)
- Add DST
- Add LKG
Edit
- Loads clrs
- Loads clvs
- Loads nonrandom info
Meryl
Use Assembly1 kmer counts
Overlapper
- Use 80/90 Assembly1 overlap results
- Rerun 10 overlap jobs
- 96.64% of the quality-less reads have overlaps (vs 98.33% of the quality reads)
reads 0ovl 1+ovl min max mean median n50 sum 0qual(all) 624831 20941 603890 0 4350 96 19 740 60494730 # 96.64% 0qual(unaligned) 52691 15384 37307 0 3229 50 5 349 2655545 # 70.80%
Unitigger
- More unitigs, more bases in unitigs
- Few of the longest unitigs got broken: Example 138,294(UMD2.3) vs 159,434(UMD2.1)
 UNITIG OVERLAP GRAPH INFORMATION
 
       5333434 : Total number of unitigs
       2595174 : Total number of singleton, contained unitigs
       1865473 : Total number of singleton, non-contained unitigs
        183693 : Total number of non-singleton, spanned unitigs
        689094 : Total number of non-singleton, non-spanned unitigs
      35551316 : Total number of fragments
      35551316 : Total number of fragments in all unitigs
      21830994 : Total number of essential fragments in all unitigs
      13720322 : Total number of contained fragments in all unitigs
  0.0077856472 : Randomly sampled fragment arrival rate per bp
    2514833413 : The sum of overhangs in all the unitigs
    6483064813 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2514833413
 Total nfrags = 19579606
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007786
 Computed global_fragment_arrival_rate =0.007786
 Total number of randomly sampled fragments in genome = 23870254
 Computed genome length  = 3065930496.000000
 Used global_fragment_arrival_rate=0.007786
 Used global_fragment_arrival_distance=128.441474
  
 Histogram of the number of base pairs in a chunk
 100406 - 138294:    21
 90330 -  99887:     23
 80042 -  89675:     79
 70014 -  79943:    169
 60002 -  69792:    374
 50000 -  59982:   1008
 40002 -  49995:   2440
 30001 -  39994:   6509
 20000 -  29999:  18989
 10000 -  19999:  57404
Consensus after unitigger
Problems:
- job 120 executed partially (see bt_120.cgi_tmp); Solution: split into 3 parts, run separately, merge results
- failed on 19 unitigs (587..7447 bp)
rm 5-consensus/*failed touch 5-consensus/consensus.success
Cgw
- Failure 1 : because job 120 was run partially => missing mates
- Failure 2 : because of /5-consensus/FAILED/bt_???.cgi.failed => missing mates => delete 356 mates
Error: ProcessFrags()-- WARNING! fragiid=35973388,index=33600942 mateiid=35973363,index=0 -- MATE DOESN'T EXIST! cgw: Input_CGW.c:117: ProcessFrags: Assertion `err == 0' failed. Fix: cat cgw.out | grep MATE | p '/mateiid=(\d+)/; print $1,"\n";' >! cgw.out.mateiid gatekeeper -dumpfragments -tabular -iid cgw.out.mateiid bt.gkpStore/ | cut -f1,3 | ~/bin/mate2lkg.pl -a D >! cgw.out.delete.LKG gatekeeper -a -o bt.gkpStore -T -F -L cgw.out.delete.LKG
- Failure 3: because of cgwOutputIntermediate=1
Try to restart from ckp : die with assertion failure cgw -y -R 8 -N 12 -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi cgw -y -R 8 -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi Fix: Restart cgw from the beginning
- cgw does update bt.SeqStore - OpenSequenceDB()
ECR (eventually skipped)
- Failed after running for 1 day
/fs/szdevel/dpuiu/SourceForge/wgs-5.2/Linux-amd64/bin/extendClearRanges -g ./bt.gkpStore -n 15 -c bt -b 146216 -e 167100 -i 1 > 7-1-ECR/extendClearRanges-scaffold.146216.err sh: line 1: 17016 Aborted
- Last ckp : bt.ckp.15
- Try to fix:
touch 7-1-ECR/cgw.success runCA "doExtendClearRanges = 1"
- Runs too slow !!!
- Can specify a scaffold range to process: -b ? -e ? => ckp files; could we merge them?
- Failed after running for 1 day
Consensus after CGW
- Failed on job 56
tail 8-consensus/bt.cns_contigs.56.err ... Could (really) not find overlap between 153923 (U) and 2508303 (R) estimated ahang: 0 (ejecting frag 2508303 from contig) consensus: math_AS.h:51: ceil_log2: Assertion `x > 0' failed.
cat 7-CGW/bt.cgw_contigs.56 | countMessages.pl ICM 440 IMP 281412 IUP 12715 cat 8-consensus/bt.cns_contigs.56_tmp | countMessages.pl ICM 115 IMP 103322 IMV 8122 IUP 4849
- Fix: split ICM messages 1..115,116,116+ and run consensus on each set
QC
elem min max mean med n50 sum scf 56891 407 33129045 50871 1378 4716077 2894145150 ctg 122851 64 651167 21957 3647 71561 2697514858 deg 268237 65 30246 1019 985 997 273575106
- Compared with UMD2.1 : better scaffols, worse contigs & unitigs
Analysis
Issues:
- Identify bacterial & mito contigs: mito seq
- Align ctg°en to UMD2 chromosomes
- the chromosomes should have no 0cvg regions
- possible inversions, translocations (UMD2 used markers)
- if align breaks/indels, which assembly is correct?
 
Assembly UMD2.4 (2004_0217_CA; All reads)
- 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
- 16,896,244 mates
- 344 libraries
Fix quality-less read clrs (N's) (temporary solution)
- 10420 contain at least 1 "NN" in their clr (50.. min(len,600))
- 5973 contain at least 1 "NNN" in their clr (50.. min(len,600))
Fix:
frg2seq.pl < bos_taurus.0qual.frg > bos_taurus.0qual.seq fasta2qual.pl bos_taurus.0qual.seq > ! bos_taurus.0qual.qual
 lucy \
    -o bos_taurus.0qual.lucy.seq  bos_taurus.0qual.lucy.qual \
    -debug  bos_taurus.0qual.lucy.info \
    bos_taurus.0qual.seq bos_taurus.0qual.qual
cat bos_taurus.0qual.lucy.info | cut -f1,3,4 -d ' ' | sort >! bos_taurus.0qual.lucy.clr
- 624,952 quality-less reads
- Quality-less read stats: 50..min(len,600) & lucy trimming
elem 0 >0 min max mean med n50 sum 5 624952 2857 622095 0 501 52 52 52 33012433 3 624952 2857 622095 0 600 579 600 600 361980208 53 624952 2857 622095 0 548 526 548 548 328967775
Fix quality-less read clrs (low complexity)
- Run dust filter on seq (before qual & lucy)
elem 0 >0 min max mean med n50 sum 5 624952 3564 621388 0 501 75 52 52 47385578 3 624952 3564 621388 0 600 554 600 600 346470473 53 624952 3564 621388 0 548 478 548 548 299084895
- Merge dust.lucy clrs with the alignment clrs
elem 0 >0 min max mean med n50 sum 5 624952 4488 620464 0 599 93 52 126 58378496 3 624952 4488 620464 0 600 547 600 600 342258160 53 624952 4488 620464 0 548 454 512 548 283879664
- Test seq
 gatekeeper -dumpfastaseq -b 35348777 -e  35973728 bt.gkpStore  | grep NNN
 gatekeeper -dumpfastaseq  bt.gkpStore   | perl -ane 'if(/^>(\d+)/) { $id=$1} elsif(/NNN/) { print $id,"\n";} ' | uniq -c | awk '{print $2,$1}'  > bt.NNN.seqs  # 2411 seqs (all have the N's "in the middle")
 gatekeeper -dumpfastaseq -uid bt.NNN.seqs bt.gkpStore >  bt.NNN.cseqs
Consolidate libraries
Drop from 25,312 to 344 libs
BCM.SHOTGUN
UMD2.4 reestimated 10,117 out of 13,826 libs (have > 100mates)
Base on initial estimates
- Reduce the total number from 13826 to 2 libs: 3000 & 6000
- UMD2.3 mean estimates (Initial vs Final):
meanI #libs minF maxF meanF medF n50F sumF uid 180000 436 1636 5199 2475 2410 2458 1079407 #3000 167000 86 1585 2948 2264 2258 2285 194775 #3000 6500 31 5212 6636 5837 5867 5924 180951 #6000 6000 11 4556 6272 5389 5421 5421 59286 #6000 3500 949 1670 4769 2668 2608 2645 2532027 #3000 3000 2511 1483 5250 2715 2662 2723 6818678 #3000 2000 6093 1157 6443 2526 2487 2554 15391160 #3000
Base on final estimates
- Reduce the total number from 13826 to 7 libs: 6500,5500,...1500, un-estimates (2501)
meanF #libs min max mean med n50 sum uid(new) mean(new) std(new) 6K<=mea<7K 15 6010 6636 6176 6159 6159 92650 6500 6500 5K<=mea<6K 29 5121 5985 5540 5536 5577 160673 5500 5500 4K<=mea<5K 67 4017 4939 4284 4266 4274 287072 4500 4500 3K<=mea<4K 1401 3000 3998 3276 3209 3226 4590323 3500 3500 2K<=mea<3K 7998 2000 2999 2502 2498 2532 20017767 2500 2500 1K<=mea<2K 607 1157 1999 1825 1882 1890 1107798 1500 1200 un-estimated 3709 2501 2501
BARC.CLONEEND
Collapse all 11150 into 1:
uid:25456 mea:165000 std:43000
Overlapper
- Quality-less reads overlaps: fewer than in the UMD2.3 assembly
elem 0 >0 min max mean med n50 sum 0qual(all) 624830 35692 589138 0 3237 60 14 439 37578899 # 94.39%
Unitigger
 UNITIG OVERLAP GRAPH INFORMATION    
       5356408 : Total number of unitigs
       2613795 : Total number of singleton, contained unitigs
       1870448 : Total number of singleton, non-contained unitigs
        182878 : Total number of non-singleton, spanned unitigs
        689287 : Total number of non-singleton, non-spanned unitigs
      35547861 : Total number of fragments
      35547861 : Total number of fragments in all unitigs
      21685943 : Total number of essential fragments in all unitigs
      13861918 : Total number of contained fragments in all unitigs
  0.0077797328 : Randomly sampled fragment arrival rate per bp
    2513424271 : The sum of overhangs in all the unitigs
    6468428782 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2513424271               
 Total nfrags = 19553770
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007780
 Computed global_fragment_arrival_rate =0.007780     
 Total number of randomly sampled fragments in genome = 23868770
 Computed genome length  = 3068070656.000000          
 Used global_fragment_arrival_rate=0.007780             
 Used global_fragment_arrival_distance=128.539119
Histogram of the number of base pairs in a chunk
100292 - 138301:     19
 90052 -  99906:     23
 80043 -  89676:     79
 70013 -  79966:    164
 60010 -  69988:    390
 50008 -  59983:    949
 40000 -  49998:   2433
 30000 -  39997:   6437
 20000 -  29999:  18808
 10000 -  19999:  57634
Bog
!!! Much bigger unitigs than default unitigger
Global Arrival Rate: 0.013829 212260 - 224992: 4 100099 - 186873: 372 90015 - 99973: 353 80045 - 89988: 582 70011 - 79999: 1084 60000 - 69994: 1856 50001 - 59996: 3162 40002 - 49994: 5407 30000 - 39999: 9767 20000 - 29996: 18981 10000 - 19999: 39641
Consensus after Unitigger
- Failed on jobs 120 & 121 ( _tmp file)
cat 4-unitigger/*120* | countMessages.pl IMP 280264 IUM 124707 cat 4-unitigger/*121* | countMessages.pl IMP 282146 IUM 245650
cat 5-consensus/bt_120.cgi_tmp | countMessages.pl IMP 34348 IUM 19222 cat 5-consensus/bt_121.cgi_tmp | countMessages.pl IMP 51833 IUM 16805
- Fix 120: split IUM messages
extractfromfrgMSG.pl -b 0 -e 19222 bt_120.cgb.orig IUM >! bt_120.cgb & extractfromfrgMSG.pl -b 19222 bt_120.cgb.orig IUM >! bt_120.cgb &
- Fix 121: remove assertion in AS_CNS/MultiAlignment_CNS.c
 if(to <= from || to > ma_length-1){
   fprintf(stderr, "AbacusRefine range (to) invalid");
   //assert(0); 
 }
CGW
- Failed after Ckp3(7-0-CGW/bt.ckp.3; MergeScaffoldsAggressive 2nd itteration)
CI extends beyond end of scaffold! offsetAEnd = 254204 offsetBEnd = 252250 scaffoldLength = 253268 cgw: CIScaffoldT_Merge_CGW.c:307: InsertScaffoldContentsIntoScaffold: Assertion `0' failed.
- Last cgw
 Scaffold lengths:
 cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf
 cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf
            elem       min        max        mean       med        n50        sum            
 scf        45826      385        34263871   59591      1349       7059820    2730853790     
 ctg        96562      65         738899     27789      3657       93988      2683452359
 Library insert estimates:
 cat 7-4-CGW/stat/scaffold_final.distupdate.dst | grep ^# | awk '{print $3,int($8),int($10)}' > 7-4-CGW/bt.dst
 join2.pl bt.dst 7-4-CGW/bt.dst | p 'print join "\t",@F[0,1,2,5,6,3,4]; print "\n";' > bt.dst.combine
CLONEEND inserts: UID MEANI STDI MEANF STDF COUNT LIB 114892 150000 30000 175701 40732 31063 UIUC.CLONEEND 19070 167000 25000 171349 18253 7103 BCM.CLONEEND 118 167000 16700 167000 16700 21 WUGSC.CLONEEND 25456 165000 43000 163044 25849 11150 BARC.CLONEEND 115020 150000 30000 162719 25343 15256 UIUC.CLONEEND 65177 2000 600 162540 34155 27067 TIGR.CLONEEND 125606 150000 30000 162396 19319 59505 BCCAGSC.CLONEEND 10738 2500 750 162386 27567 2040 TIGR_JCVIJTC.CLONEEND 10691 2500 750 161540 28239 2763 TIGR_JCVIJTC.CLONEEND 17249 202000 20200 157496 55375 6269 CENARGEN.CLONEEND 54017 120000 12000 115671 27594 25889 GSC.CLONEEND total 188126 CLONEENDs
Consensus
- Failed on job 34 with segmentation fault
- 9kbp contig, made out of 3007 reads (24 of which are quality-less)
 cat 7-CGW/bt.cgw_contigs.34.1 | grep "^{" | uniq -c | awk '{print $2,$1}'
 {ICM 1
 {IMP 3007
 {IUP 329
- Fix : edit AS_CNS/MultiAlignment_CNS.c; add
 if(!ungappedSequence->Elements) { ungappedSequence->numElements=0; }
 if(!ungappedQuality->Elements) { ungappedQuality->numElements=0; }
Analysis
Contigs Vs possible contaminants
- nucmer alignment parameters: -l 40 -c 100 -b 10 -g 5 -d 0.05
- have to redo alignments using -maxmatch !!!
- file location:
reference seqs: /nfshomes/dpuiu/db/Ecoli.365350-365744 # Ecoli K12 region with most alignments (BCM WGS splice site) /nfshomes/dpuiu/db/Ecoli # Ecoli K12 substrain MG1655 (NC_000913 ; 1st completed) /nfshomes/dpuiu/db/Ecoli.all # 22 Ecoli completed genomes ( + plasmids) /nfshomes/dpuiu/db/UniVec_Core # UniVec Core seqs /nfshomes/dpuiu/db/OtherVec # 100 other vector sequences identified by aligning UMD2.0 contaminants to GenBank; align also to 110 UniVec core using nucmer (params above)
/nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.fasta # 4813 whole contigs and 30329 contig regions identified by NCBI as UMD2 contamination /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.organism_count # organism counts: vector is the most abundant /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.infoseq # grep -v 'coli|vector|7180003101029' => 905 other contamiants query seqs: /scratch1/bos_taurus/Assembly/2009_0217_CA/9-terminator/ctg.split100/*fasta # latest assembly contigs (no degenerates) delta files: /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_ctg/no_maxmatch/*delta
Ecoli K12 substrains:
NC_010473.1 4686137 50.78 Escherichia coli str. K-12 substr. DH10B, complete genome NC_000913.2 4639675 50.79 Escherichia coli str. K-12 substr. MG1655, complete genome AC_000091.1 4646332 50.80 Escherichia coli str. K-12 substr. W3110, complete genome
no maxmatch
- fewer alignments in UMD2.4 than in UMD2
UMD2 (all): just a few degens
15102 Ecoli.365350-365744-ctg.qry_hits 15943 Ecoli-ctg.qry_hits 17308 Ecoli.all-ctg.qry_hits 79065 UMD2.contaminant-ctg.qry_hits # 55877 new hits 20105 UMD2.contaminant-ctg.CBE.qry_hits # CONTAIN|BEGIN|END|IDENTITY 19839 UniVec_Core-ctg.qry_hits
UMD2.4
559 Ecoli.365350-365744-ctg.qry_hits 1215 Ecoli-ctg.qry_hits 2767 Ecoli.all-ctg.qry_hits # most 2 frequenct starins are UMN026 & ATCC 8739; K12 DH10B is rank 5th; K12 MG1655 is ranked 19th (out 31 seqs) 44112 UMD2.contaminant-ctg.qry_hits 5286 UniVec_Core-ctg.qry_hits
Length of the reference seqs used for screening:
#seqs min max mean med n50 sum Ecoli.365350-365744 1 395 395 395 395 395 395 # Ecoli K12 regions with most alignments (BCM WGS splice site) Ecoli 1 4639675 4639675 4639675 4639675 4639675 4639675 # Ecoli K12 substrain MG1655 Ecoli.all 49 3306 5572075 2293320 130440 5065741 112372708 # 22 Ecoli's UniVec_Core 1348 12 48551 243 98 967 327641 OtherVec 100 1702 739874 15419 5027 166744 1541984
UMD2.contaminant 35142 48 16661 512 362 674 18022349
Length of UMD2.4 contigs that contain contaminant (0+ bp from end):
#ctgs <2000bp >=2000bp min max mean med n50 sum Ecoli.365350-365744-ctg 559 534 25 1001 179527 2467 1341 1894 1379440 Ecoli-ctg 1215 1086 129 1001 360312 4326 1347 71372 5256540 Ecoli.all-ctg 2767 2455 312* 1001 453627* 8031 1366 134516 22224468 UniVec_Core-ctg 5286 4718 568* 882 651163* 9820 1337 136090 51909339 UMD2.contaminant-ctg.CBE 4976 4410 566* 738 651163* 8497 1339 122281 42281715 #annotated alignments: CONTAIN|BEGIN|END|IDENTITY UMD2.contaminant-ctg 44112 12813 31299 268 739442 50591 27461 111598 2231701788
Length of UMD2.4 contigs that contain contaminant in the middle (500+ bp from end):
#ctgs <2000bp >=2000bp min max mean med n50 sum Ecoli.365350-365744-ctg 144 136 8 1286 2053 1779 1811 1814 256259 Ecoli-ctg 171 152 19 1286 4703 1835 1807 1821 313820 Ecoli.all-ctg 197 160 37*(81) 1228 351373* 6516 1815 125069 1283728 #81 2K+ ctgs using -maxmatch UniVec_Core-ctg 1278 1110 168*(276) 1085 651163* 12266 1496 160336 15676765 #276 2K+ ctgs using -maxmatch UMD2.contaminant-ctg.CBE 52 25 27* 1249 351373* 22195 2054 125069 1154142 #annotated alignments: CONTAIN|BEGIN|END|IDENTITY UMD2.contaminant-ctg 31019 1437 29582 1113 739442 70665 50798 113684 2191986214
Length of the UMD2.4 contaminant seqeunece (0+ bp from end):
#align <200bp >=200bp min max mean med n50 sum Ecoli.365350-365744-ctg 1066 537 529 104 225 192 162 224 205379 Ecoli-ctg 1793 587 1206 50 4440 496 224 994 889798 Ecoli.all-ctg 4074 1132 2942* 40 17075* 380 254 441 1551783 UniVec_Core-ctg 14425 9819 4606* 40 1801* 236 162 325 3409187 UMD2.contaminant-ctg 144843 96008 48835 40 16661 199 169 209 28912002
Length of the UMD2.4 contaminant seqeunece (500+ bp from end)
alignm <200bp >=200bp min max mean med n50 sum Ecoli.365350-365744-ctg 243 136 107 162 224 189 162 224 46000 Ecoli-ctg 273 149 124 106 1341 219 162 224 59923 Ecoli.all-ctg 294 153 141* 106 2150* 251 162 224 73992 UniVec_Core-ctg 2144 2035 109* 50 1340* 122 121 121 261821 UMD2.contaminant-ctg 121331 86985 34346 40 2738 171 162 184 20753580
- Problem: 8 long ctgs contain Ecoli in the middle (1000+ bp from end)
show-coords Ecoli.all-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 4640589 4641890 | 161712 160411 | 1302 1302 | 99.46 | 4686137 351373 | 0.03 0.37 | gi|170079663|ref|NC_010473.1| ctg7180001872124 5068908 5069620 | 87386 86679 | 713 708 | 98.88 | 5209548 91972 | 0.01 0.77 | gi|218687878|ref|NC_011745.1| ctg7180002055226 3087480 3088683 | 50423 51620 | 1204 1198 | 99.00 | 5202090 88182 | 0.02 1.36 | gi|218703261|ref|NC_011751.1| ctg7180002054092 4640580 4641890 | 19953 18646 | 1311 1308 | 99.08 | 4686137 31157 | 0.03 4.20 | gi|170079663|ref|NC_010473.1| ctg7180001875158 131462 133564 | 1247 3349 | 2103 2103 | 98.19 | 241387 5751 | 0.87 36.57 | gi|157412014|ref|NC_009838.1| ctg7180002043242 82801 83166 | 2986 2621 | 366 366 | 98.09 | 241387 4709 | 0.15 7.77 | gi|157412014|ref|NC_009838.1| ctg7180001714551 82264 82793 | 3523 2994 | 530 530 | 98.49 | 241387 4709 | 0.22 11.26 | gi|157412014|ref|NC_009838.1| ctg7180001714551 1652253 1652545 | 1487 1195 | 293 293 | 98.63 | 4700560 2492 | 0.01 11.76 | gi|218552585|ref|NC_011741.1| ctg7180001754941
- Regions present in DH10B but not MG1655
delta2cvg -M 0 < DH10B-MG1655.delta gi|170079663|ref|NC_010473.1| 1349629 1378243 28614 0 gi|170079663|ref|NC_010473.1| 1391006 1396986 5980 0 gi|170079663|ref|NC_010473.1| 3199469 3200798 1329 0 gi|170079663|ref|NC_010473.1| 3211928 3213257 1329 0 gi|170079663|ref|NC_010473.1| 4640588 4641918 1330 0 !!!
- Problem: 10 long ctgs contain Vector in the middle (1000+ bp from end)
show-coords UniVec_Core-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
      1      121  |   215495   215615  |      121      121  |    99.17  |      170   271477  |    71.18     0.04  | gnl|uv|U09128.1:15891-16011-49     ctg7180002047604       # pSacBII P1 cloning vector  
   2252     2435  |     1334     1151  |      184      184  |   100.00  |     2485   160336  |     7.40     0.11  | gnl|uv|U75992.1:16925-19409        ctg7180001808271
    180      312  |     1153     1020  |      133      134  |    99.25  |      312   160336  |    42.63     0.08  | gnl|uv|NGB00145.1:2378-2689        ctg7180001808271
      1      121  |     1367     1487  |      121      121  |   100.00  |      170   160336  |    71.18     0.08  | gnl|uv|U09128.1:15891-16011-49     ctg7180001808271
      1      103  |     1286     1388  |      103      103  |   100.00  |      103   160336  |   100.00     0.06  | gnl|uv|U80929.2:11415-11517        ctg7180001808271        [CONTAINED]
      4      121  |    68269    68386  |      118      118  |   100.00  |      170   111913  |    69.41     0.11  | gnl|uv|U09128.1:15891-16011-49     ctg7180002052060
     40      152  |    30255    30142  |      113      114  |    99.12  |     1663    42854  |     6.79     0.27  | gnl|uv|U09128.1:1-1663             ctg7180002053344
      1      121  |    34358    34238  |      121      121  |   100.00  |      170    35471  |    71.18     0.34  | gnl|uv|U09128.1:15891-16011-49     ctg7180002046164
      1      103  |    34439    34337  |      103      103  |   100.00  |      103    35471  |   100.00     0.29  | gnl|uv|U80929.2:11415-11517        ctg7180002046164        [CONTAINED]
     46     1385  |     8928    10267  |     1340     1340  |   100.00  |     1413    17587  |    94.83     7.62  | gnl|uv|X65279.1:5941-7353          ctg7180002043597        [CONTAINED]
- ctg7180001872124 : 351373 bp; region 160411..161712 contaminated by Ecoli
cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | wc -l # 329
cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);' 7180000441625 7180001872124 159483 161201 r 7180000441788 7180001872124 160330 161329 f #Ecoli 7180000442730 7180001872124 160368 161010 r #Ecoli 7180000441635 7180001872124 160740 162700 f #Ecoli
cat 9-terminator/bt.utg.info utg7180000441625 length=1715 num_frags=12 Astat=7.00 utg7180000441788 length=999 num_frags=1 Astat=0.00 utg7180000442730 length=640 num_frags=1 Astat=0.00 utg7180000441635 length=1957 num_frags=9 Astat=7.00
cat 9-terminator/bt.posmap.frgctg | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);' 1237446426 7180001872124 160117 161201 f 1238816835 7180001872124 160133 160993 f 1238817728 7180001872124 160322 161123 r 1244436200 7180001872124 159976 160984 f 1238817676 7180001872124 160105 160890 r 1237443253 7180001872124 160106 160900 f 1237471027 7180001872124 159930 160928 f 1238822613 7180001872124 159774 160782 f 1238816875 7180001872124 159878 160728 f 1244436248 7180001872124 159483 160553 f 1238818306 7180001872124 159718 160489 f 1238818332 7180001872124 159722 160483 f 1237476824 7180001872124 160330 161329 f 1238817689 7180001872124 160368 161010 r 1237447135 7180001872124 160740 161768 r 1237483546 7180001872124 160814 161790 r 1237483530 7180001872124 160818 161856 r 1237471108 7180001872124 161003 162009 f 1238817744 7180001872124 161151 161978 f 1237446441 7180001872124 161050 162107 f 1244436201 7180001872124 161117 162164 f 1237446407 7180001872124 161586 162699 r 1237471055 7180001872124 161652 162700 r # 23 BCM SHOTGUN RP42 VVHNP reads (1369 read lib; 1341 of the reads in this ctg)
- ctg7180002047604 : Vctor in the middle
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
      1      121  |   215495   215615  |      121      121  |    99.17  |      170   271477  |    71.18     0.04  | gnl|uv|U09128.1:15891-16011-49     ctg7180002047604       # pSacBII P1 cloning vector 
cat 9-terminator/bt.posmap.utgctg | grep 7180002047604 perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);' 7180000441711 7180001872124 214458 219678 r
cat 9-terminator/bt.posmap.frgctg | grep 7180002047604 | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);' 498776751 7180001872124 215425 216426 r 1236502885 7180001872124 215514 216377 r 379408823 7180001872124 215572 216340 f 1244436224 7180001872124 215388 216405 f 1237471071 7180001872124 215229 216234 r 1233297450 7180001872124 215234 216046 f 1233363357 7180001872124 215267 215687 f 937200686 7180001872124 215300 216129 r 937254901 7180001872124 215321 216160 f 1233294025 7180001872124 215383 216204 r 1237446444 7180001872124 215146 216187 f 1232033776 7180001872124 215193 215996 r 671976381 7180001872124 215035 216021 r 514932286 7180001872124 215043 216008 f 500723879 7180001872124 215043 215802 f 671927656 7180001872124 215116 215733 r 381173692 7180001872124 214947 215877 r 1233303570 7180001872124 214963 215803 f 1232037705 7180001872124 214990 215803 f 490852264 7180001872124 214923 215843 f 1237447184 7180001872124 214684 215646 f 668822243 7180001872124 214586 215572 f #22 reads ; ~half come from BCM SHOTGUN RP42 VVFOP
maxmatch ctg
Parameters:
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ... AllVec: UniVec_Core + 100 more vector seqs
Length of UMD2.4 contigs that contain contaminant (0+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 2951* 2602 349 1001 453627 8252 1367 132226 24352779 UniVec_Core 5387* 4802 585 882 651163 9979 1334 136556 53760575 OtherVec 5657 5062 595 882 651163 9726 1320 136556 55021803 UMD2.cont.other 3976 3430 546 804 651163 11217 1346 130385 44601117 #18 aligned to Acinetobacter; longest is 56467bp
Length of UMD2.4 contigs that contain contaminant (500+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 182 156 26* 1286 351373 6525 1811 125069 1187706 # 7* are >5K; 321* come from multi-ctg scaffolds UniVec_Core 2532 2220 312* 1065 651163 10593 1481 128344 26821960 # 267* are >5K ; 655* come from multi-ctg scaffolds OtherVec 376 323 53 1184 361749 13278 1508 139997 4992774 UMD2.cont.other ...
Length of UMD2.4 contigs that contain contaminant (1000+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 8 0 8* 4709 351373 73065 31157 351373 584520 UniVec_Core 11 0 11* 2600 334933 93674 37847 271477 1030414 OtherVec 5 0 5* 3717 271477 131604 111913 228060 658021 UMD2.cont.other 54 0 54* 2398 522682 110947 88182 189352 5991164 total 67* # 18 of them are CONTAINED by UMD2.0 chromosomes
Length of the UMD2.4 contaminant sequence (0+ bp from end):
elem <200 >200 min max mean med n50 sum Ecoli.all 4775 1610 3165 39 17072 381 236 502 1823278 UniVec_Core 16985 12380 4605 39 1800 207 132 300 3519080 OtherVec 7563 1372 6191 39 1800 509 548 643 3849567 UMD2.cont.other 6626 343 6283 39 8228 543 573 615 3602329
maxmatch deg
All degenerates aligned are <2K
Length of UMD2.4 deg that contain contaminant (0+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 1266 1266 0 104 1611 783 833 869 991447 UniVec_Core 1908 1908 0 147 1510 872 896 910 1664746 OtherVec 1963 1963 0 147 1510 872 898 911 1712703 UMD2.cont.other 1609 1609 0 132 1611 852 892 914 1372106
maxmatch utg
Unitig stats:
elem <2000 >2000 min max mean med n50 sum 1707816 1434164 273652 21 138676 2228 937 8002 3805166508
Parameters:
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
Files:
/scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_utg/
Length of UMD2.4 unitigs that align to contaminants
elem <2000 >2000 min max mean med n50 sum Ecoli.all 4275 4110 165 104 71709 1442 1212 1398 6166566 UniVec_Core 7563 7409 154 139 71709 1397 1182 1331 10570512 OtherVec 8208 8054 154 139 71709 1370 1159 1308 11248775 UMD2.cont.other 6094 5849 245 132 53113 1546 1163 1401 9422951 #80 aligned to Acinetobacter; longest is 9114bp Contaminants(all above) 10264 9895 369 104 71709 1471 1148 1359 15107544 Acinetobacter 2306** 0 2306 154 71709 1451 1316 1412 3347230 #2182 already in the Cont set
Length of UMD2.4 unitigs that have contaminants 500+bp from ends
elem <2000 >2000 min max mean med n50 sum Ecoli.all 172 156 16 1286 4852 1820 1805 1815 313185 UniVec_Core 2491 2422 69 1065 71709 1722 1457 1523 4291584 OtherVec 364 358 6 1167 71709 1795 1478 1538 653595 UMD2.cont.other 156 108 48 1213 50248 5344 1838 17518 833673
Length of the UMD2.4 alignments of unitigs to contaminants (unique unitig regions)
elem <200 >200 min max mean med n50 sum reads(all unitig reads for unitgs with alignments>1K) Ecoli.all 5975 1686 4289 40 8184 397 268 542 2374366 12112(12142) UniVec_Core 8754 1674 7080 40 1801 474 490 645 4153030 26590(26849) OtherVec 8919 1250 7669 40 1801 511 536 629 4562326 30268(30268) UMD2.cont.other 6752 896 5856 40 6012 529 555 651 3573528 25006(25328) Contaminants(all above) 10992 1396 9596 40 8184 571 573 684 6280759 40351(40699) Acinetobacter (8286)
40699 reads aligned back to contaminants: nucmer -maxmatch
- 35919 align
- 34400 align 100+bp
- 27742 align 200+bp
- 14211 align 500+bp
utg 5'& 3'
Unitig stats:
elem <200 >200 min max mean med n50 sum utg 1,707,816 81200 1626616 21 138676 2228 937 8002 3805166508 utg5'&3' 3,334,432 0 3334432 21 199 100 100 100 335263271
Align utg5'&3' to Ecoli.all using:
- nucmer -l 40 -c 100 -b 10 -g 5 -d 0.05 : 4,275 hits
- nucmer -l 20 -c 40 : 6,617 hits
- nucmer -l 20 -c 20 : 23,350
- blastall : 2,895,506 out of 3,334,432 (86%) aligned
Acinetobacter contamination
Database:
~dpuiu/db/Acinetobacter.all : 7 complete genomes, 19 seqs
Seq len summary:
elem min max mean med n50 sum 19 2726 4050513 1418094 28279 3904116 26943793
Align all unitigs to Acinetobacter.all; Longest alignments is 8517bp
show-coords Acinetobacter.all-utg.filter-q.delta | sort -nk8 -r | head [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [GenBank] [UMD2.4 utg] =============================================================================================================================== 20644 29164 | 62 8578 | 8521 8517 | 98.80 | 94413 8578 | 9.03 99.29 | gi|169786889|ref|NC_010404.1| utg7180000281954* [CONTAINS] 3395586 3401299 | 5712 1 | 5714 5712 | 99.79 | 3976747 8015 | 0.14 71.27 | gi|126640115|ref|NC_009085.1| utg7180000212251 3400344 3404485 | 1 4142 | 4142 4142 | 99.66 | 3976747 9114 | 0.10 45.45 | gi|126640115|ref|NC_009085.1| utg7180000277331 ...
utg7180000281954* -> ctg7180002053982 (28140bp; 78 unitigs)
grep 7180002053982 ../9-terminator/bt.posmap.utgctg | nl
    1  7180000185222   7180002053982   0       3019    f
    2  7180000314302   7180002053982   2151    5706    r
    3  7180001463328   7180002053982   2256    2869    f
    ...
   75  7180000281954*  7180002053982   17862   26442   r
   76  7180001471348   7180002053982   17886   18723   r
   77  7180001468075   7180002053982   17919   18732   f
   78  7180000280508   7180002053982   25672   28140   r
 show-coords UMD2.contaminant.other-ctg.filter-q.delta  | grep 7180002053982
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [UMD2.0 contam]   [UMD2.4 ctg]
===============================================================================================================================
   1394     2508  |    28140    27024  |     1115     1117  |    98.75  |     7098    28140  |    15.71     3.97  | 7180003313366      ctg7180002053982
   2561     2871  |    26971    26661  |      311      311  |    97.11  |     7098    28140  |     4.38     1.11  | 7180003313366      ctg7180002053982
   2934     5670  |    26599    23862  |     2737     2738  |    97.99  |     7098    28140  |    38.56     9.73  | 7180003313366      ctg7180002053982
 ...gap...  
   5930     7098  |    17270    16101  |     1169     1170  |    98.46  |     7098    28140  |    16.47     4.16  | 7180003313366      ctg7180002053982
 ...gap...  
    281     1981  |    10335     8635  |     1701     1701  |    99.41  |    13090    28140  |    12.99     6.04  | 7180003320028      ctg7180002053982
   1992     2672  |     8635     7954  |      681      682  |    99.71  |    13090    28140  |     5.20     2.42  | 7180003320028      ctg7180002053982
   3302     5376  |     6719     4643  |     2075     2077  |    99.33  |    13090    28140  |    15.85     7.38  | 7180003320028      ctg7180002053982
   8469     9021  |     4642     4090  |      553      553  |    99.10  |    13090    28140  |     4.22     1.97  | 7180003320028      ctg7180002053982
   9038     9313  |     4073     3798  |      276      276  |    98.55  |    13090    28140  |     2.11     0.98  | 7180003320028      ctg7180002053982
   9780    13090  |     3331       19  |     3311     3313  |    99.19  |    13090    28140  |    25.29    11.77  | 7180003320028      ctg7180002053982
 grep 7180002053982 ../9-terminator/bt.posmap.utgctg | awk '{print $1,$4-$3+1}' | sed 's/^/utg/' >! ctg7180002053982.utgs
 intersect.pl UMD2.contaminant.other-utg.qry_hits ctg7180002053982.utgs | wc -l
 37 # only 37 out  of 78 unitigs were detected
 ctg7180002053982 is Acinetobacter
UMD2.5 (2004_0312_CA; delete 40699 contam reads & 22607 mates )
40699 reads:
- 25803 mated + 14896 unmated
- 6392 mated reads had the mate also contaminated
Location:
/scratch1/bos_taurus/Assembly/2009_0312_CA
UNITIGGER
UNITIG OVERLAP GRAPH INFORMATION
       5322910 : Total number of unitigs
       2595715 : Total number of singleton, contained unitigs
       1869655 : Total number of singleton, non-contained unitigs
        182193 : Total number of non-singleton, spanned unitigs
        675347 : Total number of non-singleton, non-spanned unitigs
      35507162 : Total number of fragments
      35507162 : Total number of fragments in all unitigs
      21641007 : Total number of essential fragments in all unitigs
      13866155 : Total number of contained fragments in all unitigs
  0.0077909501 : Randomly sampled fragment arrival rate per bp
    2511009753 : The sum of overhangs in all the unitigs
    6442095933 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
Total rho    = 2511009753
Total nfrags = 19563152
Estimated genome length = 0
Estimated global_fragment_arrival_rate=0.007791
Computed global_fragment_arrival_rate =0.007791
Total number of randomly sampled fragments in genome = 23866135
Computed genome length  = 3063315200.000000
Used global_fragment_arrival_rate=0.007791
Used global_fragment_arrival_distance=128.354050
Histogram of the number of base pairs in a chunk
100292 - 138301:     22    # 19 in UMD2.4 
 90020 -  99906:     28    # 23
 80043 -  89676:     90    # 79
 70013 -  79966:    190    # 164
 60010 -  69988:    423
 50008 -  59983:   1016
 40000 -  49998:   2558
 30000 -  39997:   6660
 20000 -  29999:  18927
 10000 -  19999:  57057
CONSENSUS after CGW
- failed on job 80 : ctg 5706539, len=180,024, 159 unitigs, 1,851 reads
 head 80/bt.cns_contigs.80.failed
 {ICM
 acc:5706539
 pla:P
 len:180024
 cns:
 .
 qlt:
 .
 for:0
 npc:1851
 more  ../9-terminator/bt.asm
 ...
 {CCO
 acc:(7180002022380*,5706539)
 pla:P
 len:180024
 cns:
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 ...
cat 80/bt.cns_contigs.80.failed | countMessages.pl ICM 1 IUP 159 #unitigs IMP 1851 #reads : 1264 are BCM.WGS, 389 are BCM.SHOTGUN ...
# 254 contig scaffold cat ../9-terminator/bt.posmap.ctgscf | grep 7180002041301 | nl 1 7180002022042 7180002041301 0 14121 f ... 183 7180002022380* 7180002041301 9164814 9344838 f .. 254 7180002022240 7180002041301 12874874 12893460 f
 # the 5 UMD2.4 contigs below have the same number of reads with the ones that matched => CONTAINED
 cat UMD2.4-7180002022380.posma   p.frgctg | grep -v ^$ | awk '{print $2}' | uniq -c
   #reads ctgid
     6 7180001712307
   209 7180002028662
     6 7180002028663
  1552 7180002028664
    21 7180002032323
  1794 total => 1851-1794=57 additional reads
cat ../../2009_0217_CA/9-terminator/bt.posmap.ctgscf | nl | egrep '7180001712307|7180002028662|7180002028663|7180002028664|7180002032323' #nl ctgid scfid start end dir 34791 7180002028662 7180002069912 1425623 1446022 f 34793 7180002032323 7180002069912 1448133 1450475 f 34794 7180002028663 7180002069912 1450495 1451918 f 34795 7180002028664 7180002069912 1452234 1602973 f 64110 7180001712307 7180002071598 0 1441 f
- Solution 1:
* consensus -Dforceunitigabut => new assembly, new UID's ctg7180002022636 179505 38.53 # => ctg7180002022380 scf7180002041557 12892941 40.58 # => scf7180002041301
- Solution 2:
 * Reassemble 1851 reads ; clr=ECR2; doOBT=no
 * Asm dir: 
    /scratch1/bos_taurus/Assembly/2009_0312_CA/8-consensus/80.ECR2.asm
 * It contains one 179,530 bp scaffold that has two contigs. 
 * One contig is 156,349 bp and the other one is 23,181 bp. 
 * The estimated gap between them is 231 bp.
 show-coords ctg7180002022636-80.ECR2.filter-r.delta 
      1   156326  |   1   156331  | 156326 156331  | 99.99  | 179505  156349 |    87.09    99.99  | ctg7180002022636   ctg7180000000103        [CONTAINS]
 156345   179505  |  21    23181  |  23161  23161  | 99.99  | 179505  23181  |    12.90    99.91  | ctg7180002022636   ctg7180000000104        [CONTAINS]
>ctg7180002022636_156327_156344 TTGTAAAAACCATCCCCT
# ~ 20 bp unaligned on ctg7180002022636 & Chr1 show-coords ctg7180002022636-Chr1.filter-r.delta | more ... 151839 156326 | 61124826 61120339* | 4488 4488 | 99.91 | 179505 157590899 | 2.50 0.00 | ctg7180002022636 Chr1 156347 157501 | 61111219* 61110064 | 1155 1156 | 99.91 | 179505 157590899 | 0.64 0.00 | ctg7180002022636 Chr1 ...
# 2 UMD2.0 ctg & 2 UMD2.0 deg in this region more Chr1.agp ... Chr1 61110064 61111482 3579 W deg0003139347 1 1419 + Chr1 61111483 61114130 3580 N 2648 fragment yes Chr1 61114131 61115490 3581 W deg0002967451 1 1360 + Chr1 61115491 61118114 3582 N 2624 fragment yes Chr1 61118115 61120117 3583 W 7180002846553 1 2003 + Chr1 61120118 61120217 3584 U 100 fragment yes Chr1 61120218 61145567 3585 W 7180003318962 1 25350 + ...
QC
Lengths:
elem <2000 >=2000 min max mean med n50 sum scf 39978* 31311 8667 316 34167202 68129 1360 8217662 2723691675 ctg 90135* 36140 53995 65 1160130 29693 5124 95988 2676390147 deg 251413 249285 2128 65 39964 1003 984 994 252279234 utg 1689033 1419729 269304 21 138676 2242 936 8213 3788090224
elem <0 0 >0 min max mean med n50 sum gaps(ca2scf) 50157 10759 3296 36102 -20 177144 929 20 34357 46620040 gaps(posmap) 50157 0 0 50157 20 177144 943 20 34065 47301528
Fragment happiness:
placed good 27263138 chaff bothChaff 2467462 placed notMated 2050084 placed oneChaff 732517 chaff oneChaff 732517 placed oneSurrogate 555510 placed bothDegen 465114 chaff notMated 434052 placed diffScaffold 369,294 * placed oneDegen 213,768 * placed badSame 96862 placed badLong 76640 placed badOuttie 41142 placed badShort 5784 placed bothSurrogate 3278
Mate happiness:
good 13631569 bothChaff 1233731 oneChaff 732517 oneSurrogate 277755 bothDegen 232557 diffScaffold 184647 oneDegen 106884 badSame 48431 badLong 38320 badOuttie 20571 badShort 2892 bothSurrogate 1639
Scaffold zero read/mate cvg regions:
elem <2000 >2000 min max mean med n50 sum read 57011 55048 1963 1 177144 913 57 29302 52084484 mate 10507 8945 1562 1 30014 996 493 2367 10466518
Scaffold 10K+ zero read/mate cvg regions (2K+ inside) (some might be a result of surrogates?):
elem <2000 >2000 min max mean med n50 sum read 51747 49878 1869 1 177144 958 38 32625 49613432 mate 1290 201 1089 15 30014 3560 3047 4017 4593586 mate(ignore seq len) 2599 1051 1548 1 72706 3321 2541 5513 8633691
Contaminant search
ctg
elem <0 >0 min max mean med n50 sum ctg 90135 0 90135 65 1160130 29693 5124 95988 2676390147
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
 #OLD table
                  elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all        71         66         5          1006       129770     4180       1127       45234      296830
 UniVec_Core      120        111        9          1000       426540     7366       1130       100944     884023
 OtherVec         121        112        9          1000       426540     7314       1127       100944     885022
 UMD2.cont.other  98(-3)     83         15(-3)     1000       426540     13523      1190       199700     1325332  # 3 are 1000bp+ from ctg ends; these are actually "fake" contaminants
                                                                                                                   # 7 are Acinetobacter baumannii min=1059 , max=9765 
 total            152*(-3)   133        19*(-3)    1000       426540     10649      1150       199700     1618735
 
 Acinetobacter    65         53         12         1013       44359      2586       1212       3847       168130   # 46 out of 65 are in the 152* set; 19 are new; 13 have lots of alignments to other contigs (probably fake contaminants)
 total(new)       171*(-3)   144        27*(-3)    1000       426540     10013      1189       129770     1712376  # 65 are Acinetobacter and should be removed
 cat UMD2.contaminant.other-ctg.filter-q.coords | grep Acinetobacter
                                                                                                                    UMD2.0                     UMD2.5
      1      285  |      285        1  |      285      285  |    99.65  |      287     8096  |    99.30     3.52  | 7180003292866_1_288        ctg7180002015457        [CONTAINED]     Acinetobacter baumannii
   1422     2500  |     1078        1  |     1079     1078  |    99.63  |     7098     1078  |    15.20   100.00  | 7180003313366              ctg7180001706852        [CONTAINS]      Acinetobacter baumannii
   2934     3940  |     1008        1  |     1007     1008  |    98.61  |     7098     1059  |    14.19    95.18  | 7180003313366              ctg7180001709709                        Acinetobacter baumannii
   6281     7098  |        1      818  |      818      818  |    99.76  |     7098     1553  |    11.52    52.67  | 7180003313366              ctg7180001716052        [END]           Acinetobacter baumannii
      1      790  |      790        1  |      790      790  |   100.00  |     1822     9765  |    43.36     8.09  | 7180003319195_8956_10778   ctg7180002015485        [BEGIN]         Acinetobacter calcoaceticus
    285     1981  |        1     1697  |     1697     1697  |    99.59  |    13090     1856  |    12.96    91.43  | 7180003320028              ctg7180001706656*                       Acinetobacter baumannii
   1992     2148  |     1697     1856  |      157      160  |    98.12  |    13090     1856  |     1.20     8.62  | 7180003320028              ctg7180001706656*                       Acinetobacter baumannii
  12210    13090  |       63      943  |      881      881  |    99.89  |    13090     2556  |     6.73    34.47  | 7180003320028              ctg7180002007423                        Acinetobacter baumannii
# 7 Acinetobacter baumannii ctgs # no Serratia "best hits" # 3 mitochondrion ctgs, all < 2Kbp
Delete summary: 65 Acinetobacter ctgs + 91 contaminant ctgs <2000bp => 156 ctgs , 152 scf => 4105 reads
ctgs <2000 >2000 min max mean med n50 sum reads 156 144 12 1000 44359 1782 1150 1483 278009 4105 # all ctg
Trim summary: 12 contigs >=2000bp & 44 reads that overlap at least 10bp
ctgs <2000 >2000 min max mean med n50 sum reads 12 0 12 7072 426540 78470 45234 129770 941646 ? # all ctg 12 12 0 172 935 532 618 750 6393 44 # ctg regions
Files
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_DELETE/ctg.delete.uid /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_DELETE/scf.delete.uid /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_TRIM/ctg.trim.uid
ctg 5'&3'
elem <0 >0 min max mean med n50 sum ctg53 180044 0 180044 65 598 300 300 300 54033229
nucmer -maxmatch -l 17 -c 35 ...
#ctgEnds #ctgs min max mean med n50 sum Ecoli.all 180 149 300 300 300 300 300 54000 UniVec_Core 312 277 300 300 300 300 300 93600 OtherVec 1211 1167 300 553 300 300 300 363989 UMD2.cont.other 15689 14693 257 598 300 300 300 4712162
deg
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
                  elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all        387        387        0          131        1099       756        806        835        292892
 UniVec_Core      569        569        0          101        1115       763        822        843        434400
 OtherVec         579        579        0          101        1115       752        819        840        435549
 UMD2.cont.other  539        539        0          131        1483       792        838        873        427408
                 
 total            810*       810        0*         101        1483       784        838        869        63547
Scaffolds vs UMD2.0 chromosome alignments
Directory:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf
Depening on the ref/qry seq and nucmer parameters, the number of unaligned gaps in UMD2.0 can vary between:
101M: REF=Chr, QRY=scf, nucmer -l 100 -c 500 6M: REF=ChrPlaced, QRY=scf-deg, nucmer -maxmatch -l 50 -c 250
nucmer -l 100 -c 500
Chr-scf.summary
elem <2000 >2000 min max mean med n50 sum Chr-scf.qry_hits 32901 24546 8355 723 34167202 82494 1405 8217662 2714164100 Chr-scf.qry_nohits 7077 6765 312 316 12006 1346 1239 1291 9527056 Chr-scf.10K.qry_hits2+ 574 0 574 10308 34167202 4006753 1887309 9586144 2299876795 Chr-scf.0cvg 144712 125933 18779 1 102265 900 178 2968 130248709 Chr-scf.0cvg.clean 148556 143283 5273 1 39625 683 280 1363 101526883(101M)
Chr-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum Chr-scf-deg.qry_hits 210225 199781 10444 501 34167202 13785 1007 7328685 2898141592 Chr-scf-deg.qry_nohits 81166 80815 351 65 12006 958 972 989 77828798 Chr-scf-deg.10K.qry_hits2+ 574 0 574 10308 34167202 4006753 1887309 9586144 2299876795 Chr-scf-deg.0cvg 175952 168553 7399 1 22067 445 120 1329 78433265 Chr-scf-deg.0cvg.clean 133809 132381 1428 1 20512 371 124 1101 49711440(49M)
ChrPlaced-scf.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf.qry_hits 19488 13112 6376 723 34167202 137773 1527 8428844 2684927057 ChrPlaced-scf.qry_nohits 20490 18199 2291 316 192648 1891 1276 1569 38764099 ChrPlaced-scf.10K.qry_hits2+ 139 0 139 10486 31959312 6786671 4979278 12956086 943347316 ChrPlaced-scf.0cvg 76271 71816 4455 1 102265 568 179 1710 43339413 ChrPlaced-scf.0cvg.clean 72865 70541 2324 1 39625 356 94 1413 25951987(25M)
ChrPlaced-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf-deg.qry_hits 130125 122000 8125 501 34167202 21530 1009 7515049 2801670853 ChrPlaced-scf-deg.qry_nohits 161266 158596 2670 65 192648 1080 987 1012 174299537 ChrPlaced-scf-deg.10K.qry_hits2+ 139 0 139 10486 31959312 6786671 4979278 12956086 943347316 ChrPlaced-scf-deg.0cvg 79041 76374 2667 1 22067 395 157 948 31251753 ChrPlaced-scf-deg.0cvg.clean 69012 68271 741 1 20512 200 81 592 13864328(13M)
nucmer -maxmatch -l 100 -c 500
Dir:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.2
ChrPlaced-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf-deg.qry_hits 130510 122377 8133 501 34167202 21470 1009 7515049 2802100771 ChrPlaced-scf-deg.qry_nohits 160881 158219 2662 65 192648 1080 986 1012 173869619 ChrPlaced-scf-deg.10K.qry_hits2+ 120 0 120 20022 31959312 7587296 5639522 13010806 910475551 ChrPlaced-scf-deg.0cvg 82159 80425 1734 1 7002 321 145 647 26444796 ChrPlaced-scf-deg.0cvg.clean 111645 111546 99 1 6248 81 13 272 9057424(9M)
nucmer -maxmatch -l 50 -c 250
Dir:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3
ChrPlaced-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf-deg.qry_hits 204673 195653 9020 251 34167202 14088 1005 7329288 2883625712 ChrPlaced-scf-deg.qry_nohits 86718 84943 1775 65 192648 1064 970 1007 92344678 ChrPlaced-scf-deg.10K.qry_hits2+ 148 0 148 10486 31959312 6814292 5135095 12792673 1008515300 ChrPlaced-scf-deg.0cvg 86085 84614 1471 1 4565 279 123 557 24101912 ChrPlaced-scf-deg.0cvg.clean 113796 113791 5 1 2822 59 7 176 6714556(6M)
nucmer -maxmatch -l 50 -c 250 ; delta-fileter -q
Dir:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3
 ChrPlaced-scf-deg.filter-q.summary
                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf-deg.qry_hits               204673     195653     9020       251        34167202   14088      1005       7329288    2883625712
 ChrPlaced-scf-deg.qry_nohits             86718      84943      1775       65         192648     1064       970        1007       92344678
 ChrPlaced-scf-deg.10K.qry_hits2+         118        0          118        20022      31959312   7633834    5639522    13010806   900792422
 ChrPlaced-scf-deg.0cvg                   77864      73686      4178       1          28711      529        181        1523       41240419
 ChrPlaced-scf-deg.0cvg.clean             74172      72150      2022       1          28331*     321        89         1415       23852952(23M)
Max gap is 28331; Duplicate region in UMD2.0?
ChrPlaced-scf-deg.coords 70739616 70767946 | 2993389 2965048 | 28331 28342 | 99.17 | 85187327 19514159 | 0.03 0.15 | Chr15 scf7180002041107 70768054 70808299 | 3005305 2965048 | 40246 40258 | 99.56 | 85187327 19514159 | 0.05 0.21 | Chr15 scf7180002041107 => ChrPlaced-scf-deg.filter-q.coords 70768054 70808299 | 3005305 2965048 | 40246 40258 | 99.56 | 85187327 19514159 | 0.05 0.21 | Chr15 scf7180002041107
Markers
ALL:
head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.txt Marker Chr_BTA Pos(K) Pos_from Pos_to UMD_Ctg_Pos Match_Len %IDY %Match UMD_Ctg_name BZ945871 1 47501 1 95001 7622 515 100.00 99.61 ctg7180002007845 BZ953651 1 80001 47501 112501 10786 700 99.57 100.00 ctg7180002026484 CC504788 1 118751 80001 157501 54583 862 100.00 100.00 ctg7180002026483 CC484491 1 123751 90001 157501 50169 77 98.72 100.00 ctg7180002026482 CZ415082 1 125001 92501 157501 75850 507 99.21 99.80 ctg7180002026483 CC475154 1 130001 97501 162501 40013 666 99.25 100.00 ctg7180002026482 CC561114 1 182501 145001 220001 1130 709 99.02 100.00 ctg7180002026482 CC578374 1 190001 155001 225001 170145 647 100.00 100.00 ctg7180002026481 BZ911787 1 278751 232501 325001 na na na na na ...
- 126,014 markers & 90,135 ctgs total
- 107,271 markers align to 31,407 ctg & 2640 scf:
- 85% of the markers align to 85% of the ctg sequence
- avg distance between markers is 25Kbp
 
- 188 questionable ctgs & 219 questionable scf (2 out of 4 disagreeing markers)
UNIQ:
head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.unique_only.txt | p 'print " ",$_;' Marker Chr_BTA Pos(Kbp) CI_Pos_from CI_Pos_to UMD_Scaff_Pos Match_Len %IDY %Matched UMD_Scaff_name BZ945871 1 52251 1 104501 na na na na na BZ953651 1 88001 52251 123751 na na na na na CC504788 1 130626 88001 173251 54583 862 100.00 100.00 ctg7180002026483 CC484491 1 136126 99001 173251 50169 77 98.72 100.00 ctg7180002026482 CZ415082 1 137501 101751 173251 na na na na na CC475154 1 143001 107251 178751 40013 666 99.25 100.00 ctg7180002026482 CC561114 1 200751 159501 242001 1130 709 99.02 100.00 ctg7180002026482 CC578374 1 209001 170501 247501 170145 647 100.00 100.00 ctg7180002026481 BZ911787 1 306626 255751 357501 na na na na na ...
- 93,508 markers align to 28,752 ctgs & 1,476 scf:
- 109 questionable ctgs & 153 questionable scf (2 out of 4 disagreeing markers)
Scripts:
~/bin/marker2pos.pl markers_contigs_Art.unique_only.txt | sed 's/ctg//' | sort -nk1 -nk2 > markers_ctg.pos ~/bin/translatePosMap.pl markers_ctg.pos bt.posmap.ctgscf | ~/bin/tab2tab.pl > markers_scf.pos
Ctg vs markers summary:
#ctg <10000 >10000 min max mean med n50 sum file ctg (all) 90135* 51024 39111 65 1160130 29693 5124 95988 2676390147 no markers 58728 48324 10404 65 322949* 6573 1597 21989 386064754 markers from 1+ Chr 31407 2700 28707 442 1160130 72924 52693 111252 2290325393 markers_ctg.Chr.count markers from 2+ Chr 2987 25 2962 1002 1160130 132480 104807 179692 395718221 markers_ctg.Chr.count2+ 2+ markers from 2+ Chr 26 0 26* 15228 604155 221354 192182 298848 5755227 markers_ctg.Chr.count2.2+ 2+ adjacent markers from 2+ Chr 15 0 15** 15228 368879 202728 194749 294623 3040932 markers_ctg.Chr.count2+a
Scf vs markers summary:
#scf <10000 >10000 min max mean med n50 sum scf(all) 39978* 37135 2843 316 34167202 68129 1360 8217662 2723691675 no markers 37338 36038 1300 316 754615* 2601 1336 3957 97140879 markers from 1+ Chr 2640 1097 1543 1000 34167202 994905 16220 8661690 2626550796 markers_scf.Chr.count markers from 2+ Chr 552 10 542 1002 34167202 4526814 2714036 9167014 2498801557 markers_scf.Chr.count2+ 2+ markers from 2+ Chr 212 0 212* 15228 34167202 8579232 7358307 10521496 1818797327 markers_scf.Chr.count2.2+ 2+ adjacent markers from 2+ Chr 38 0 38** 15228 25078118 8419544 7176534 13458592 319942681 markers_scf.Chr.count2+a
212* scaffolds
scf_id scf_len #Chr/2+markers #/Chr/2+adjmarkers reads 7180002041381 31959312 15 0 469503 7180002041358 25078118 13 2 291163 7180002041386 21280754 12 0 ...
scf7180002041381.1
- no low cvg regions in the middle
- 1281 markers: 1231 on Chr4, 12 on Ch11
#1- mate cvg regions: at the ends !!! #scfid begin end scf_len cvg_len cvg 7180002041381 1 1173 31959312 1173 0 7180002041381 1174 1454 31959312 281 1 7180002041381 31959139 31959312 31959312 174 1
scf7180002041358.2
- one low cvg region & real break
- 1111 markers: 869 on Chr14, 193 on Chr26, ...
#1- mate cvg regions: middle #scfid begin end scf_len cvg_len cvg 7180002041358 20970531 20970827 25078118 297 1 7180002041358 20970828 20970949 25078118 122 0 7180002041358 20970950 20971112 25078118 163 1
# markers in the regions #scfid begin end makerid Chr 7180002041358 20964100 20964829 BZ839784 14 7180002041358 21002368 21003219 CC527932 26
scf7180002041386.3
- one low cvg region but no markers in that region
- 939 markers: 902 on Chr24, 3 on Chr6 ...
#1- mate cvg regions: at the ends #scfid begin end scf_len cvg_len cvg 7180002041386 26382 27557 21280754 1176 1 7180002041386 27558 27607 21280754 50 0 7180002041386 27608 28031 21280754 424 1
...
scf7180002041368.128
- one low cvg region
#scfid begin end scf_len cvg_len cvg 7180002041368 1356953 1357221 1404851 269 1
- markers:Chr7 & Chr5 at 3'
#scfid begin end markerid Chr 7180002041368 10436 11033 BZ840669 7 ... 7180002041368 1278950 1279444 CZ404867 7 7180002041368 1319489 1320264 CC593534 7 7180002041368 1344001 1344347 BZ885865 7 * 7180002041368 1371526 1372308 BZ867572 5 * 7180002041368 1386933 1387750 CC534893 5
- subassembly: extract all reads & extra mates(265) and reassemble
- Msg Counts
 
DST 40 FRG 13762 LKG 6482
- qc
 
[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=37,1211999,1358964,32757,4082,7180000000571 1=2,53844,53824,26922,-20,7180000000566 2=4,50178,50238,12544,20,7180000000570 3=1,28553,28553,28553,0,7180000000568 4=1,17195,17195,17195,0,7180000000569 total=45,1361769,1508774,30262,3675
show-coords scf7180002041368-7180002041368.update.scf.filter-q.delta
... 1269662 1338725 | 69065 1 | 69064 69065 | 99.97 | 1404851 1359628 | 4.92 5.08 | scf7180002041368 scf7180000000571 1340619 1341912 | 1 1295 | 1294 1295 | 98.77 | 1404851 1295 | 0.09 100.00 | scf7180002041368 scf7180000000574 [CONTAINS] 1345570 1347105 | 1537 1 | 1536 1537 | 98.18 | 1404851 1537 | 0.11 100.00 | scf7180002041368 scf7180000000572 [CONTAINS] 1347683 1356952 | 12617 3352 | 9270 9266 | 99.85 | 1404851 12617 | 0.66 73.44 | scf7180002041368 scf7180000000567 1356901 1361861 | 1767 6728 | 4961 4962 | 99.96 | 1404851 50318 | 0.35 9.86 | scf7180002041368 scf7180000000570 ...
scf7180002041061.187
- infoseq
scf7180002041061 509987 39.73
- 22 markers
#scfid begin end makerid Chr 7180002041061 2061 2904 BZ847430 15 7180002041061 8979 9756 CC481592 15 7180002041061 10856 11485 BZ839377 15 7180002041061 47578 48186 BZ836581 15 7180002041061 80485 81237 CC553472 15 7180002041061 117811 118488 CC918533 15 7180002041061 151253 152009 CC477055 15 7180002041061 213959 214640 CC572066 15 * 7180002041061 236304 236479 CC550436 29 * 7180002041061 242590 243462 BZ848041 29 7180002041061 267880 268614 BZ877402 29 7180002041061 268891 269493 CC580499 29 7180002041061 282248 282861 BZ885584 29 7180002041061 295221 295987 CC572941 29 7180002041061 337785 338517 CC923898 29 7180002041061 338464 339092 BZ921430 29 7180002041061 341219 341971 CZ415932 29 7180002041061 382509 383281 CC581748 29 7180002041061 387548 388176 CC572064 29 7180002041061 415263 415788 BZ878185 29 7180002041061 470138 470773 CC558303 29 7180002041061 493006 493794 CC565740 29
- mate happiness:
cat 7180002041061.posmap.mates | cut -f3 | count.pl good 2501 diffScaffold 86 oneSurrogate 69 oneChaff 40 oneDegen 11 badSame 9 badOuttie 4 badLong 4
- low cvg region:
#scfid begin end scf_len cvg_len cvg 7180002041061 232849 232871 509987 23 1 #it is in the middle of a unitig intersectPos.pl -i 1 7180002041061.posmap.utgscf 7180002041061.posmap.frgscf.10K.2K.mate_cvg.1- 7180000849321 7180002041061 229782 238910 r
- subassembly: extract all reads & mates(139) and reassemble
- Msg Counts
 
DST 38 FRG 5769 LKG 2724
Dirs:
/scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scf7180002041061.187.mates/asm.ECR2 /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scf7180002041061.187.mates/asm.ECR2.mates
Qc stats:
asm.ECR2 asm.ECR2.bog ... [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0 3,277300,277925,92433,312,7180000000278 5,279938,399837,55988,29975,7180000000155 1 6,234445,280546,39074,9220,7180000000277 6,234463,280485,39077,9204,7180000000156 2 1,1630,1630,1630,0,7180000000276 1,1799,1799,1799,0,7180000000154 3 NA 1,1674,1674,1674,0,7180000000153 4 NA 1,1183,1183,1183,0,7180000000158 total 10,513375,560101,51338,6675 14,519057,684978,37076,18436
- Some of the 139 mates assemble into scaffolds
- There is slightly more sequence in the bog assembly
- Mean/Max bog utg size is twice larger than default utg (scf, ctg sizes are about the same)
- align asm.ECR2 scaffolds to scf7180002041061
- 2919 bp at the 3' end of the new scf7180000000277 don't align
- 1990 bp at the 3' end of the new scf7180000000278 don't align
- new scf7180000000277 & new scf7180000000278 align for ~ 1058bp
- most of the mated read added assembled at new scf7180000000277 & new scf7180000000278 3'
 
 nucmer -l 100 -c 500 ../scf7180002041061.fasta 9-terminator/7180002041061.update.scf.fasta -p scf7180002041061-7180002041061.update.scf
 
 delta-filter -q scf7180002041061-7180002041061.update.scf.delta > scf7180002041061-7180002041061.update.scf.filter-q.delta
 
 show-coords scf7180002041061-7180002041061.update.scf.filter-q.delta
      1    46099  |        1    46099  |    46099    46099  |    99.99  |   509987   280626  |     9.04    16.43  | scf7180002041061   scf7180000000277
  46273    48914  |    46120    48761  |     2642     2642  |   100.00  |   509987   280626  |     0.52     0.94  | scf7180002041061   scf7180000000277
  50810    56904  |    95425   101521  |     6095     6097  |    99.43  |   509987   280626  |     1.20     2.17  | scf7180002041061   scf7180000000277
  57301   233303  |   101699   277707  |   176003   176009  |    99.97  |   509987   280626  |    34.51    62.72  | scf7180002041061   scf7180000000277 *
 
 232245   444418  |   275975    63802  |   212174   212174  |    99.99  |   509987   277965  |    41.60    76.33  | scf7180002041061   scf7180000000278 * 
 444964   469975  |    63156    38151  |    25012    25006  |    99.97  |   509987   277965  |     4.90     9.00  | scf7180002041061   scf7180000000278
 449967   479221  |    58152    28900  |    29255    29253  |    99.96  |   509987   277965  |     5.74    10.52  | scf7180002041061   scf7180000000278
 480298   509987  |    29688        1  |    29690    29688  |    99.99  |   509987   277965  |     5.82    10.68  | scf7180002041061   scf7180000000278
scf7180002041163.214
- infoseq
scf7180002041234 10336067 41.93
- Markers
#scfid begin end makerid Chr ... 7180002041163 770783 771508 CC562874 30 * 7180002041163 791050 791484 BZ887794 11 * ...
- Markers on different contigs !!!
cat 7180002041163.posmap.ctgscf .. 7180001926720 7180002041163 750392 787668 f 7180001926721 7180002041163 787911 815738 f ...
- subassembly: extract all reads & extra mates(293) and reassemble
- Msg Counts
 
DST 39 FRG 17287 LKG 8113
- qc
 
[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=54,1742235,1760846,32264,351,7180000000507 1=1,6224,6224,6224,0,7180000000506 2=1,1472,1472,1472,0,7180000000508 total=56,1749931,1768542,31249,351
- alignments :
show-coords scf7180002041163-7180002041163.update.scf.filter-q.delta 1748 7972 | 6224 1 | 6225 6224 | 99.84 | 1757217 6224 | 0.35 100.00 | scf7180002041163 scf7180000000506 [CONTAINS] 4012 5490 | 1 1470 | 1479 1470 | 98.85 | 1757217 1472 | 0.08 99.86 | scf7180002041163 scf7180000000508 [CONTAINS] 4567 18273 | 1 13697 | 13707 13697 | 99.92 | 1757217 1761853 | 0.78 0.78 | scf7180002041163 scf7180000000507 ... 750393 787668 | 757092 794367 | 37276 37276 | 100.00 | 1757217 1761853 | 2.12 2.12 | scf7180002041163 scf7180000000507 787912 815738 | 794644 822455 | 27827 27812 | 99.81 | 1757217 1761853 | 1.58 1.58 | scf7180002041163 scf7180000000507 ... 1755928 1757217 | 1760564 1761853 | 1290 1290 | 100.00 | 1757217 1761853 | 0.07 0.07 | scf7180002041163 scf7180000000507
Manually curated
Markers within 50K of a low mate cvg region
- 13 scaffolds (22 before)
- 14 breaks : 9 on the same contig , 2 on adjacent contigs , 3 on non adjacent contigs
- File
/scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.mates//scfproblems
- scfproblems.markers.txt
#scfid begin end markerid Chr(break) comment 7180002040911 2210713 2211037 CZ405316 29 half Chr30, half Chr29 ; SAME ctg 7180002041061 213959 214640 CC572066 15 half Chr29, half Chr15; SAME ctg 7180002041103 5296272 5297045 CC587675 15 Chr18 & Chr15 at 3'; diff ctg (2 ctgs in between) 7180002041107 7302203 7302874 CC906829 15 half Chr9, half Chr15; SAME ctg 7180002041200 4920833 4921695 BZ877250 20 half Chr20, half Chr2; SAME ctg 7180002041216 307121 307653 CL609365 5 Chr5 at 5' & Chr15; SAME ctg 7180002041259 4315523 4316178 BZ922220 3 Chr5, Chr3, Chr24 !!! ; diff ctgs (4 ctgs in between) 7180002041281 2884855 2885652 CC538348 19 half Chr19, half Chr28 ; SAME ctg 7180002041315 2196754 2197298 BZ838379 1 Chr13 & Chr1 at 3'; diff ctg (2 ctgs in between) 7180002041325 2597590 2598283 CC472212 10 half Chr10, half Chr5; SAME ctg 7180002041348 3028112 3028921 CC531996 8 half Chr8, half Chr7; SAME ctg 7180002041356 6828214 6828925 CC531427 16 half Chr16, half Chr25; SAME ctg 7180002041358 21002368 21003219 CC527932 26 Chr14 & Chr26 at 3' (long chunck); diff ctg
Details:
#scfid begin end markerid Chr #ctgid begin end markerid Chr 7180002040911 2189844 2190490 BZ923031 30 7180001730191 42851 43497 BZ923031 30 * 7180002040911 2210713 2211037 CZ405316 29 7180001730191 63720 64044 CZ405316 29 * -- 7180002041061 213959 214640 CC572066 15 7180001852904 29927 30608 CC572066 15 * 7180002041061 236304 236479 CC550436 29 7180001852904 52272 52447 CC550436 29 * -- 7180002041103 5220856 5221407 CG984741 18 7180001854649 21126 21677 CG984741 18 * 7180002041103 5296272 5297045 CC587675 15 7180001854651 837 1610 CC587675 15 * -- 7180002041107 7302203 7302874 CC906829 15 7180001855003 77842 78513 CC906829 15 ** 7180002041107 7311399 7312254 CC479102 9 7180001855003 87038 87893 CC479102 9 ** -- 7180002041200 4940131 4940949 CC500137 20 7180002002029 31935 32753 CC500137 20 ** 7180002041200 4956412 4957105 CZ428497 2 7180002002029 48216 48909 CZ428497 2 ** -- 7180002041216 307121 307653 CL609365 5 7180002003578 6871 7403 CL609365 5 * 7180002041216 310832 311546 CL865591 15 7180002003578 10582 11296 CL865591 15 * -- 7180002041259 4224828 4225527 CC920177 5 7180002012718 112862 113561 CC920177 5 * 7180002041259 4315523 4316178 BZ922220 3 7180002012722 19376 20031 BZ922220 3 * -- 7180002041259 6220638 6221268 BZ869532 3 7180002012752 199799 200429 BZ869532 3 * 7180002041259 6239728 6240375 CZ413142 24 7180002012753 2600 3247 CZ413142 24 * -- 7180002041281 2927406 2928016 CC573399 19 7180002018361 74539 75149 CC573399 19 ** 7180002041281 2938896 2939696 CC513914 28 7180002018361 86029 86829 CC513914 28 ** -- 7180002041315 2152291 2153097 BZ836343 13 7180001862977 3397 4203 BZ836343 13 * 7180002041315 2196754 2197298 BZ838379 1 7180002024308 7255 7799 BZ838379 1 * -- 7180002041325 2608389 2609213 CC506736 10 7180002025880 237956 238780 CC506736 10 ** 7180002041325 2638591 2639207 CC770009 5 7180002025880 268158 268774 CC770009 5 ** -- 7180002041348 3044263 3044902 BZ872906 8 7180002030033 152672 153311 BZ872906 8 ** 7180002041348 3092547 3093121 BZ924509 7 7180002030033 200956 201530 BZ924509 7 ** -- 7180002041356 6828214 6828925 CC531427 16 7180001722964 12571 13282 CC531427 16 * 7180002041356 6832720 6833356 CC494876 25 7180001722964 17077 17713 CC494876 25 * -- 7180002041358 20964100 20964829 BZ839784 14 7180001723456 140919 141648 BZ839784 14 * 7180002041358 21002368 21003219 CC527932 26 7180001723457 15983 16834 CC527932 26 *
- scfproblems.low_mate_cvg.txt
#scfid begin end ctglen len mate_cvg 7180002040911 2205494 2205582 3008363 89 1 7180002041061 232849 232871 509987 23 1 7180002041103 5285314 5285479 5341215 166 1 7180002041107 7306552 7306653 19514159 102 1 7180002041200 4954073 4954180 16995932 108 1 7180002041216 307614 307741 1444165 128 0 7180002041259 4313297 4313311 15600612 15 0 7180002041281 2933371 2933618 12907599 248 1 7180002041315 2193142 2193167 2232736 26 1 7180002041325 2623956 2624190 11153784 235 1 7180002041348 3064633 3064750 19180127 118 1 7180002041356 6831454 6831811 14697197 358 1 7180002041358 20970950 20971112 25078118 163 1
No low cvg regions
- 9 scaffolds
- 14 breaks : 5 on the same contig , 6 on adjacent contigs , 3 on non adjacent contigs
7180002040844 : 3 consecutive Chr30 markers (in the middle); mate_cvg > 18 ; markers CC576837,CC516738,CC543771; SAME ctg 7180002041163 : 5' Chr30, 3' Chr11 ; mate_cvg~=10; marker BZ887794 ; diff ctg (1 ctg in between) 7180002041234 : 5' Chr10, 3' Chr5; mate_cvg=10; marker BZ889975; scf=1757217bp; diff ctg 7180002041235 : 2 consecutive Chr14 markers; mate_cvg=15..20; markers CC561837 & CC585677; NOT uniq; SAME ctg 7180002041279 : 5 consecutive Chr23 markers; markers CC472696,CC522963; cvg=9..18; diff ctg 7180002041306 : 5' Chr14, 3' Chr6; marker CC549871; cvg=2; diff ctg (22 ctgs in between) 7180002041308 : 2 consecutive Chr20 markers; markers CC571631 & BZ832318; cvg=20; SAME/diff ctg 7180002041321 : 5' Chr2, 3' Chr15 ; marker CC513377 cvg=10; diff ctg (1 ctgs in between) 7180002041350 : 2 consecutive Chr3 markers ; markers BZ837387 & CC571149; cvg 17..26; NOT uniq; SAME/diff ctg
Details:
#scfid begin end markerid Chr #ctgid begin end markerid Chr 7180002040844 8369147 8369989 CC521620 17 7180001727899 132573 133415 CC521620 17 ** 7180002040844 8437707 8438324 CC576837 30 7180001727899 201133 201750 CC576837 30 ** ... 7180002040844 8467493 8468096 CC543771 30 7180001727899 230919 231522 CC543771 30 * 7180002040844 8522732 8523598 CC513544 17 7180001727901 34970 35836 CC513544 17 * --- 7180002041163 770783 771508 CC562874 30 7180001926720 20391 21116 CC562874 30 * 7180002041163 791050 791484 BZ887794 11 7180001926721 3139 3573 BZ887794 11 * -- 7180002041234 3856848 3857673 BZ889975 10 7180002010181 56608 57433 BZ889975 10 * 7180002041234 3949893 3950652 CC509477 5 7180002010182 28610 29369 CC509477 5 * -- 7180002041235 1117908 1118686 CC579933 21 7180002010285 4086 4864 CC579933 21 ** 7180002041235 1118939 1119675 CC561837 14 7180002010285 5117 5853 CC561837 14 ** 7180002041235 1158300 1159022 CC585677 14 7180002010285 44478 45200 CC585677 14 ** 7180002041235 1164296 1164891 BZ924510 21 7180002010285 50474 51069 BZ924510 21 ** -- 7180002041279 6755839 6756519 BZ849919 1 7180002018195 417990 418670 BZ849919 1 * 7180002041279 6786086 6786957 CC472696 23 7180002018196 12703 13574 CC472696 23 * ... 7180002041279 6866378 6866890 CC522963 23 7180002018196 92995 93507 CC522963 23 * 7180002041279 6911185 6911897 CC574255 1 7180002018197 15411 16123 CC574255 1 * -- 7180002041306 248713 249457 CC503129 14 7180002022875 59534 60278 CC503129 14 * 7180002041306 910997 911554 BZ839769 6 7180002022892 21939 22496 BZ839769 6 * -- 7180002041308 1919399 1920196 BZ883381 18 7180002023013 57093 57890 BZ883381 18 * 7180002041308 1958658 1959168 CC571631 20 7180002023014 29833 30343 CC571631 20 * 7180002041308 1963333 1964083 BZ832318 20 7180002023014 34508 35258 BZ832318 20 ** 7180002041308 2009574 2010367 CC499423 18 7180002023014 80749 81542 CC499423 30 ** -- 7180002041321 4820048 4820542 BZ846646 2 7180002025240 77429 77923 BZ846646 2 * 7180002041321 4908631 4909160 CC513377 15 7180002025242 13302 13831 CC513377 15 * -- 7180002041350 2423855 2424205 CG983886 13 7180002030523 183644 183994 CG983886 13 * 7180002041350 2446763 2447125 BZ837387 3 7180002030524 9191 9553 BZ837387 3 * 7180002041350 2457417 2457704 CC571149 3 7180002030524 19845 20132 CC571149 3 ** 7180002041350 2486474 2487050 CC490214 13 7180002030524 48902 49478 CC490214 13 **
Scaffold splitting
Before:
- 22 scaffolds
- 28 breaks : 14 on the same contig , 8 on adjacent contigs , 6 on non adjacent contigs
Now:
- 14 scaffolds
- 15 breaks : 8 on the same contig , 3 on adjacent contigs , 4 on non adjacent contigs
Scaffold to break
   nl  scfid           breaks
   #1  7180002040844   2
    2  7180002040911   1
    3  7180002041061   1
   #4  7180002041103   1
    5  7180002041107   1
   #6  7180002041163   1
    7  7180002041200   1
   #8  7180002041216   1
    9  7180002041234   1
  #10  7180002041235   2
   11  7180002041259   2
  #12  7180002041279   2
   13  7180002041281   1
   14  7180002041306   1
  #15  7180002041308   2
   16  7180002041315   1
   17  7180002041321   1
   18  7180002041325   1
   19  7180002041348   1
  #20  7180002041350   2
   21  7180002041356   1
   22  7180002041358   1
Contigs to break
   nl  ctgid           
    1  7180001722964
    2  7180001723456
    3  7180001730191
    4  7180001852904
    5  7180001855003
    6  7180002002029
    7  7180002010182
    8  7180002012722
    9  7180002012752
   10  7180002018361
   11  7180002024308
   12  7180002025240
   13  7180002025880
   14  7180002030033
Marker pairs
   nl  scfid           begin1  end2    markerid1       Chr1    markerid2       Chr2    ctg1            ctg2            dist(end2-begin1)      dist(ctg2-ctg1)
   #1  7180002040844   8369147 8438324 CC521620        17      CC576837        30      7180001727899   7180001727899   69177                   0
   #2  7180002040844   8467493 8523598 CC543771        30      CC513544        17      7180001727899   7180001727901   56105                   2
 
    3  7180002040911   2189844 2211037 BZ923031        30      CZ405316        29      7180001730191   7180001730191   21193                   0
 
    4  7180002041061   213959  236479  CC572066        15      CC550436        29      7180001852904   7180001852904   22520                   0
 
   #5  7180002041103   5220856 5297045 CG984741        18      CC587675        15      7180001854649   7180001854651   76189                   2
 
    6  7180002041107   7302203 7312254 CC906829        15      CC479102        9       7180001855003   7180001855003   10051                   0
 
   #7  7180002041163   770783  791484  CC562874        30      BZ887794        11      7180001926720   7180001926721   20701                   1
 
    8  7180002041200   4940131 4957105 CC500137        20      CZ428497        2       7180002002029   7180002002029   16974                   0
 
   #9  7180002041216   307121  311546  CL609365        5       CL865591        15      7180002003578   7180002003578   4425                    0
 
   10  7180002041234   3856848 3950652 BZ889975        10      CC509477        5       7180002010181   7180002010182   93804                   1
 
  #11  7180002041235   1117908 1119675 CC579933        21      CC561837        14      7180002010285   7180002010285   1767                    0
  #12  7180002041235   1158300 1164891 CC585677        14      BZ924510        21      7180002010285   7180002010285   6591                    0
 
   13  7180002041259   4224828 4316178 CC920177        5       BZ922220        3       7180002012718   7180002012722   91350                   4
   14  7180002041259   6220638 6240375 BZ869532        3       CZ413142        24      7180002012752   7180002012753   19737                   1
 
  #15  7180002041279   6755839 6786957 BZ849919        1       CC472696        23      7180002018195   7180002018196   31118                   1
  #16  7180002041279   6866378 6911897 CC522963        23      CC574255        1       7180002018196   7180002018197   45519                   1
 
   17  7180002041281   2927406 2939696 CC573399        19      CC513914        28      7180002018361   7180002018361   12290                   0
 
   18  7180002041306   248713  911554  CC503129        14      BZ839769        6       7180002022875   7180002022892   662841                  22
 
  #19  7180002041308   1919399 1959168 BZ883381        18      CC571631        20      7180002023013   7180002023014   39769                   1
  #20  7180002041308   1963333 2010367 BZ832318        20      CC499423        18      7180002023014   7180002023014   47034                   0
 
   21  7180002041315   2152291 2197298 BZ836343        13      BZ838379        1       7180001862977   7180002024308   45007                   2*
 
   22  7180002041321   4820048 4909160 BZ846646        2       CC513377        15      7180002025240   7180002025242   89112                   2
 
   23  7180002041325   2608389 2639207 CC506736        10      CC770009        5       7180002025880   7180002025880   30818                   0
 
   24  7180002041348   3044263 3093121 BZ872906        8       BZ924509        7       7180002030033   7180002030033   48858                   0
 
  #25  7180002041350   2423855 2447125 CG983886        13      BZ837387        3       7180002030523   7180002030524   23270                   1
  #26  7180002041350   2457417 2487050 CC571149        3       CC490214        13      7180002030524   7180002030524   29633                   0
 
   27  7180002041356   6828214 6833356 CC531427        16      CC494876        25      7180001722964   7180001722964   5142                    0
 
   28  7180002041358 20964100 21003219 BZ839784        14      CC527932        26      7180001723456   7180001723457   39119                   1
Break intervals
       #scfid          begin   end     scflen          len  allread goodmate badmate Chr1 Chr2 mark1   mark2  ctgid
   #1  7180002040844   8412735 8412739 9565303         5       1       12      15      17 30   357     3      
   #2  7180002040844   8485384 8485869 9565303         486     0       12      4       30 17   3       53
 
    3  7180002040911   2205044 2205247 3009964         204     3       0       8       30 29   44      34     7180001730191
    4  7180002041061   232819  232843  510915          25      5       0       23      15 29   8       14     7180001852904
   #5  7180002041103   5295117 5295435 5342873         319     0       11      2       18 15   221     4
    6  7180002041107   7306552 7306620 19515586        69      2       0       11      15 9    320     549    7180001855003
   #7  7180002041163   787669  787911  1752998         243     0       10      0       30 11   16      8
    8  7180002041200   4954103 4954180 16991478        78      1       0       27      20 2    219     532    7180002002029
   #9  7180002041216   309331  309337  1435859         7       1       4       3       5 15    7       50
   10  7180002041234   3939493 3939548 10292558        56      1       0       20      10 5    160     227    7180002010182
  #11  7180002041235   1117752 1117987 1742396         236     5       20      2       21 14   43      2
  #12  7180002041235   1161294 1161363 1742396         70      3       16      0       14 21   2       32
   13  7180002041259   4313381 4313547 15594966        167     1       0       12      5 3     193     75     7180002012722
   14  7180002041259   6231717 6231762 15594966        46      8       4       18      3 24    75      406    7180002012752
  #15  7180002041279   6773364 6773383 9784709         20      0       4       14      1 23    292     5
  #16  7180002041279   6894603 6894618 9784709         16      0       4       16      23 1    5       141
   17  7180002041281   2931877 2931885 12908817        9       3       5       10      19 28   83      441    7180002018361
   18  7180002041306   267445  267446  4484402         2       0       2       14      14 6    49      141    7180002022875,7180002005935
 
  #19  7180002041308   1928806 1928825 23070077        20      0       10      0       18 20   63      2
  #20  7180002041308   1977759 1977828 23070077        70      1       21      1       20 18   2       836
   21  7180002041315   2192754 2192893 2233575         140     7       0       22      13 1    59      2      7180002024308
   22  7180002041321   4840934 4841070 6492983         137     1       0       16      2 15    227     72     7180002025240
   23  7180002041325   2624153 2624190 11155368        38      3       0       22      10 5    86      357    7180002025880
   24  7180002041348   3064691 3064730 19179560        40      1       0       16      8 7     116     634    7180002030033
  #25  7180002041350   2437553 2437572 23353087        20      0       19      1       13 3    107     2
  #26  7180002041350   2479000 2479029 23353087        30      2       23      1       3 13    2       880
 
   27  7180002041356   6831454 6831783 14699461        330     1       0       9       16 25   299     319    7180001722964
   28  7180002041358  20970057 20970139 25079321       83      7       0       25      14 26   868     189    7180001723456
Where:
Chr1: The most frequent chromosome markers with alignment at coordinates <= begin Chr2: The most frequent chromosome markers with alignment at coordinates >= end mark1: Number of Chr1 markers with alignment at coordinates <= begin mark2: Number of Chr2 markers with alignment at coordinates >=end Lines starting with # should be ignored. ctgid: ctg to break
New scaffolds
#scfid(new) scfid begin end scflen(new)
Files
# ctg:scf new name mapping /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scfproblems.posmap.ctbscb # scf:scf new:original name mapping /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scfproblems.posmap.scbscf # ctg:ctg new:original name mapping /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/ctgproblems.posmap.ctbctg # ctg FASTA sequences /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/ctbproblems.fasta # scf FASTA sequences /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scbproblems.fasta
UMD2.6 (UMD2.5 without contam ctg/scaff; split ctg/scaff)
Contaminants & MarkerBreaks
Delete summary:
ctg scf scf->ctg contaminants(delete) 156 152 666 contaminants(trim) 12 12 1328 markerBreaks 14+1 14+1 2875+1 # 1 more break in UMD2.6.1 total 182 178 4869
Add summary:
ctg scf contaminants(delete) 0 2 contaminants(trim) 12 12 markerBreaks 28+2 29+2 # 1 more break in UMD2.6.1 total 40 43
Summary:
ctg scf markers original 90135 39978 del/add -142 -135 final 89993 39843 -17
Files:
/scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.ctg.fasta : contig FASTA sequence /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.scf.fasta : scaffold FASTA sequence /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.ctglen : contig lengths /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.scflen : scaffold lengths /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.ctgscf : mapping of contigs to scaffolds (posmap format) /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.scaff : mapping of contigs to scaffolds (scaff format) /scratch1/bos_taurus/Assembly/2009_0312_CA/new/ctg.delete.uid : UID of the contigs which were deleted /scratch1/bos_taurus/Assembly/2009_0312_CA/new/scf.delete.uid : UID of the scaffolds which were deleted /scratch1/bos_taurus/Assembly/2009_0312_CA/new/ctg.add.uid : UID of the contigs which were added : UID =~/brk\d+[abc]/ OR UID =~/cnt\d+/ /scratch1/bos_taurus/Assembly/2009_0312_CA/new/scf.add.uid : UID of the contigs which were added : UID =~/brk\d+[abc]/ OR UID =~/cnt\d+/ /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.delete.uid : markers which got deleted
/scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.ctg.break : 15 contig break regions /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.scf.break : 16 scaffold break regions
Scripts:
~/bin/breakPosmapKeep.amos : pipeline for breaking scf/ctg
- Final:
elem min max mean med n50 sum ctg 89993 65 1160130(1.1M) 29736 5180 95952 2676109306
scf 39843 316 34167202(34M) 68353 1361 7451988 2723419943 scf<50K 38915 316 49898 2765 1349 5107 107632139 scf<5K 35229 316 4999 1518 1306 1435 53495879
Ctg Markers
Filtered:
%IDY>90 %Matched>85
- ~30% of alignments agree to this condition
Markers:
total: 126111 # from 31372 ctgs
Contigs
elem min max mean med n50 sum placed 31372* 442 1160130 72908 52734 111147 2287291732* unplaced 58621 65 425922 6632 1600 22204 388817574 total 89993 65 1160130 29736 5180 95952 2676109306
Scf Markers
Markers:
total: 126111 # from 2641 scaffolds; 1595 scaffolds have 2+ markers na: 18744 # not placed del: 17 # the scaffolds they belong to were deleted other: 3722 # not on the "main" chromosome; "main" chromosome determined by a majority rule; if it's a tie check markers for uniqueness outliers: 411 # interquartile range method (IQR) q1-1.5*(q3-q1) .. q1+1.5*(q3-q1) filtered: 103217 # "non conflicting" markers
 Scaffolds:	 
                   elem       min        max        mean       med        n50        sum              #ctgs
 placed            2641       1000       34167202   994523     16673      8170786    2626536153*      50528	 
 unplaced          37202      316        754615     2604       1337       3964       96883790	        39465
 total             39843      316        34167202   68353      1361       7451988    2723419943       89993	 
scf <2 >=2 >=10 min max mean med n50 sum markers/scf 2641 1595 1046 562 1 1418 39 1 354 103479 ctg/scf 2641 1546 1095 559 1 545 19 1 130 50528
cat markers_scf.mainChr.*summary | count.pl -c 7 | sort -n 1 1564 # 1 marker/scf 2 276 # 2 markers/scf 3 116 # ... 4 52 ... 1470 1 # scaffold 7180002041371 has 1470 markers
- Scaffold position:
filter outliers (interquartile range method); use median value problem: only 2 markers far apart: choose randomly or check for uniqueness
Summary (approximate) Chr #Ctg #Scaff ScaffSpanSum MaxMarkerPos 1 2989 112 157088082 167097751 2 2488 73 138112445 141135901 3 2257 109 120984003 128677351 4 2079 111 124695956 123662451 5 2271 98 120470218 130242001 6 2244 127 117350442 127208151 7 2169 93 109780318 114917551 8 2069 95 111646872 114607251 9 1937 72 103790361 106365151 10 1829 96 102878815 108508301 11 2030 62 106593132 107458151 12 1789 114 89109155 97406401 13 1498 73 83821399 88539451 14 1482 142 84084175 89211101 15 1734 105 84680500 91332551 16 1710 111 80727432 86838601 17 1384 51 72913556 78195801 18 1446 86 65689468 70299751 19 1338 56 63372609 69847351 20 1454 54 71941707 75982901 21 1405 63 70035525 72193201 22 1077 36 60892135 > 60178851 23 1021 44 51791473 54886001 24 1059 26 61662407 > 61466101 25 783 41 42670836 > 45254751 26 991 40 50640267 52316851 27 920 70 45768018 48911451 28 810 52 45884054 50753001 29 1143 89 51657687 55219751 30 3122 340 135803106 152429101 U 39465 37202 94983049
- Scaffold orientation
filter outliers (interquartile range method) use LeastSequareFit method to estimate the orientation : if slope is positive => forward; if slope is negative => reverse; problem: slope ~=0 => which direction ?
 cat Chr.summary | getSummary.pl -i 5
 cat Chr.agp | grep W | awk '{print $9}' | count.pl
elem <0 0 >0 min max mean med scf 2641 516 1610 515 -31 61 0 0 ctg 50528 24885 2236 23407 Use slope thold to determine direction? cat markers_scf.mainChr.noOutliers.summary | p 'print $_ if(abs($F[5])>0.5);' | wc -l # 634
Ambiguity examples: BZ908653 6 114061501 114016501 114106501 172149 597 98.66 100.00 7180002040834 BZ891600 6 114085251 114051501 114119001 4834 504 99.80 99.02 7180002040834 CZ411135 6 114094001 114059001 114129001 242710 609 98.20 100.00 7180002040834 BZ854276 6 114132751 114101501 114164001 100980 580 99.31 99.83 7180002040834 CC524983 30 115669901 115622401 115717401 96634 715 98.74 100.00 7180002041003 BZ869249 30 115791151 115717401 115864901 86931 448 99.78 100.00 7180002041003 BZ867530 30 115798651 115732401 115864901 69671 572 99.13 100.00 7180002041003 CC585731 30 115808651 115752401 115864901 54950 737 99.86 100.00 7180002041003 CC469285 30 115818651 115772401 115864901 125555 550 94.57 100.00 7180002041003
- Scaffold overlaps: some small scaffolds might be contained by bigger ones
cat markers_scf.mainChr.noOutliers.posmap.scfchrabs | ~/bin/posmap2ovl.pl | sort -nk6 -r | ~/bin/tab2tab.pl -f -15 | head Chr ref qry begin end end-begin 30 7180002041328 7180002041078 72536051 76993951 4457900 4 7180002041381 7180002041269 41976451 46175201 4198750 30 7180002040852 7180002041077 142001501 145251501 3250000 30 7180002038569 7180002041121 33012001 35518351 2506350 30 7180002040971 7180002034501 59411001 61836101 2425100 ...
Marker Issues
Placement
- 37202 scaffolds are unplaced (3.5% of total scaffold span); max=0.7M
- 87 unplaced scaffolds (1.5Mbp total) could be placed using SLK messages
perl ~/bin/difference21.pl bt.slk markers_scf.scflen | head 7180002030632 7180002040171 I -147744.734 21214.779 2 UP 7180002030792 7180002031221 I -15042.301 558.223 2 UP 7180002030849 7180002041244 N -1609089.875 18145.568 4 UP ...
elem min max mean med n50 sum Unplaced 87 740 102909 17259 8544 34550 1501601 Placed 78 16177 27139572 5757234 4037663 10989230 449064304
- ambiguous assignment to chromosmes:
 cat markers_scf.count | p 'print $_ if($F[2]*2==$F[3]);' | nl
       #k1             k2      count12 count1
    1  7180002030741   15      2       4
    2  7180002031341   11      1       2
   ...
   36  7180002068140   25      1       2
 # scaffold assigned differently
 ~/bin/difference12.pl markers.all/markers_scf.mainChr.count markers.all_plus_uniq/markers_scf.mainChr.count | nl
    1  7180002032536   19      1       2       23480
    2  7180002037223   8       1       2       31516
    3  7180002040013   25      2       4       15228
    4  7180002040262   30      1       2       3087
    5  7180002040378   10      1       2       50907
    6  7180002040523   29      1       2       9105
    7  7180002040769   4       1       3       827526
    8  7180002041203   14      2       4       1141560
    9  7180002044555   18      1       2       1566
- Chr30 has many scaff aligned to it
- how reliable are the markers: 1151 out of 2641 placed scaffolds have no unique markers ?
- what measure is best for placing the scaffolds?
- can some scaffolds go in the gaps ?
- AGP:
- gaps<=0 or 20 set to 100
- unoriented ctgs set to +
 
- Marker positions are not uniformly distributed; they tend to "custer"
elem 0 >0 min max mean med n50 sum markers_chr 107337 8405 98932 0 446250 25800 8750 75000 2769379200 markers_scf 94171 123 94048 0 2565850 24143 15856 42236 2273610521 markers_chr.mainChr.noOutliers 101084 7609 93475 0 1253750 27383 10000 78750 2768040450 # filtered IRQx,xy method markers_scf.mainChr.noOutliers 88450 52 88398 0 2669361 25327 16587 44374 2240202221
Orientation
- at least 1564 out of 2641 placed scaffolds cannot be oriented; max=1.39M
Possible misassemblies
nl #scfid Ch medianPos rangePos scfLen slope #ChMark #Mark #ctg 1 7180002041225 1 120473301 9866350 9279975(9M) 0.0103 381 390 184 : break interval: 1674263..1674319 (1X frg_cvg region) 2 7180002041078 30 77506551 6310600 3094999(3M) 0.0148 65 69 46 : no clear position
scf7180002041225 : 9.27 Mbp on Chr1
scflen=9279975
7180002041225.markers 7180002041225.markers 7180002041225.cvg
ChrPos ScfPos 115272051 56300 ... 116848301 1618430 116953301 9190560 ... 125138401 1691145
break interval: 1674263..1674319 = 57bp frg_cvg=1 mate_cvg=0 bad_mate_cvg(nearby)=7
scf7180002041078 : 3.09 Mbp (not broken)
7180002041078.markers 7180002041078.markers 7180002041078.1.cvg xrange [315268:466827] 7180002041078.2.cvg xrange [1191000:1229713] 7180002041078.3.cvg xrange [2733146:2750017]
ChrPos ScfPos 72536051 2893791 ... 72682301 2750017 72728651 1191000 ... 73273651 466827 77112801 2733146 ... 78464051 1229713 .. 78595401 315268
scf7180002040971 : 1.93 Mbp (not broken)
Files
/scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo/
- Nucmer alignments to UMD2.0 chromosomes: all seem to agree pretty well
/scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr26-Chr26.png /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr27-Chr27.png /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr29-Chr29.png /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.ChrX-Chr30.png
Homo sapiens alignments
Citations:
- mice closer to human than cow
- human and cow have approximately 201 homologous blocks of DNA
- Independently generated mapping data provide another measure of the quality of the assembly. Snelling et al. [4] created a B. taurus map from three radiation hybrid panels, two genetic maps, and bacterial artificial chromosome (BAC) end sequences. We aligned all of the 17,254 markers (of which 17,193 are unique) in their composite map (Cmap) to both assemblies. A marker was considered as matching a chromosome if 90% of the marker sequence aligned with at least 95% identity. Of the Cmap markers, 14,620 align to the UMD2 assembly's chromosomes, versus 13,699 markers (6.3% fewer) for the BCM4 assembly. A small number of Cmap markers (119 and 82 for UMD2 and BCM4, respectively) mapped to a different chromosome from the one indicated in the Cmap data.
- homologous synteny block (HSB) :
- human-cow alignment extended for at least 250 Kbp
- it was not interrupted by an inversion or by an HSB on another chromosome.
- If two HSBs were interrupted by a gap of <3 Mbp and nothing else fell in that gap, the two blocks were merged. (Note that if a large region of synteny is interrupted by a distinct HSB, the interruption creates three HSBs.)
- Number of homologous synteny blocks on each chromosome of the cow and human genomes 201 -> 268 blocks
 
- Orienting contigs using cow-human alignments:
- Scaffolds (sets of linked contigs) that were mapped onto chromosomes using only a single marker could not be oriented from the marker information alone. We oriented many of these scaffolds by taking advantage of the overall conserved synteny between cow and human. First, all cow scaffolds were aligned to the human genome using nucmer [14] with its maximal unique match (mum) option in order to avoid alignments of repetitive sequence. For each alignment of a previously unoriented scaffold to human, all alignments within 100 Kbp on each side were pulled out for analysis. A score S was computed for each unoriented scaffold, taking into account whether the scaffolds surrounding S on both sides (in cow) were mapped to a consistent set of locations in human. If the scaffolds surrounding S were oriented, and if a large majority of these scaffolds on both the left and right agreed on the orientation, then S was assigned that orientation. Using this procedure, 1,840 scaffolds containing 4,011 contigs were oriented.
- We developed a similar procedure to assign unplaced contigs to chromosomes, again relying on conserved synteny between cow and human. First, all unplaced contigs were aligned as above. Mummer's 'delta-filter' program was then used to compute a one-to-one mapping of the unplaced contigs to human so that only the best aligning contig was considered at each region in human. For each unplaced contig's best alignment to human, the matching region in cow was identified via our human-cow syntenic map, and all contigs from this region were extracted for examination. We only considered placing a contig on a B. taurus chromosome if the order and direction of the surrounding contigs in cow matched the corresponding region in human. As above, we examined the alignments of nearby cow contigs that aligned within 100 kb of the unplaced contig's alignment in human. If the region of cow-human synteny contained no rearrangements, then the unplaced contig was placed at the location indicated by these alignments. Using this procedure, 1,046 contigs were placed on chromosomes. One consequence of this procedure was that a number of incompletely mapped genes (based on mRNA alignments) were completed.
 
Issues:
- which alignment program to use?
- nucmer
- blastz: difficult to parse
- blat
 
- nucmer: what parameters?
- default
- loose : -mum -l 12 -c 30 -g 1000
- ref: 24 homo sapiens chromosomes files
- query: 26 bos taurus scaffold files
 
Total scaffolds:
elem min max mean med n50 sum scf(len) 39844 316 34167202 68352 1361 7451988 2723419938
Aligned to HS:
elem min max mean med n50 sum scf(len) 8860 385 34167202 301202 2257 7740810 2668658140 # each scf aligns in avg to 4 Chr scf(aligLen) 789272 250 19097 514 399 550 405946972 scf(alig%) 789272 58.04 100.00 79 78.54 78.72 . scf(len,maxX) 727 1002 7069350 194658 6530 1895564 141516604 # 220 in common with the 339 ones that have mark scf(len,noMark) 7025 385 7605708 7783 1654 27630 54680173 scf(len,mark) 1835 1001 34167202 1424511 40503 7970944 2613977967
Not aligned to HS:
elem min max mean med n50 sum scf(len,notAligned) 30984 316 163953 1767 1309 1555 54762139
Aligned to markers:
elem min max mean med n50 sum scf(len) 2641 1000 34167202 994523 16673 8170786 2626536153 1 marker 1595 1000 1396951 17482 4595 48276 27884078 2+ markers 1046 1055 34167202 2484371 596240 8217662 2598652075 abs(slope)<0.25 1946 1000 3094999 29136 7585 114274 56700529 abs(slope)>=0.25 696 4674 34167202 3692292 1845828 8401441 2569835619
Scf summary:
elem min max mean med n50 sum all 39844 316 34167202 68352 1361 7451988 2723419938 1+mark 2641 1000 34167202 994523 16673 8170786 2626536153 # best marker alignment 1+align 8860 385 34167202 301202 2257 7740810 2668658140 # alignments > 250bp 1+align(new) 9880 385 34167202 270458 1996 7740810 2672129298 # alignments > 200bp (3.74M more than 250bp align) 1+mark or 1+align 9669 385 34167202 276443 2051 7740810 2672936792 0 mark and 1+align 7027 385 754,615 6603 1654 20219 46,400,303 !!! 1+mark and 1+align 1791 1001 34167202 1305861 37914 7473583 2338798064 0 mark and 0 align 30179 316 88,326 1705 1305 1517 51,483,587 !!! 1+mark or 1+align 9669 385 34167202 276443 2051 7740810 2,672,936,792
Degenerate:
elem min max mean med n50 sum all 251413 65 39964 1003 984 994 252,279,234 all(2000+bp) 2128 2000 39964 3753 2827 3946 7,986,872 1+mark 562 200 30168 2731 1274 5029 1,535,011 1+mark(2000+bp) 180 2007 30168 6079 4404 7510 1,094,252
1+align 6429 251 39964 1487 1013 1287 9,566,273 1+align(2000+bp) 756 2004 39964 4820 3624 5646 3,644,556
Issues:
- 24 scaffolds that have 200+ alignments to at least 2 HS chromosomes
- 37 scaffolds that have 100+ alignments to at least 2 HS chromosomes
- 83 scaffolds that have 50+ alignments to at least 2 HS chromosomes
File location:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_human
Synteny method
File location:
/scratch1/bos_taurus/Assembly/2009_0312_CA/synteny
- Position & orient scaffolds that aligned to HS
- find placed neighboring that don't "disagree"
elem min max mean med n50 sum total 7027 385 754615 6603 1654 20219 46,400,303 !!!
- Scaffolds possibly assigned to the wrong chromosome
 cat summary.nucmer-markers.txt | perl ~/bin/synteny/getSynteny2.nucmer-marker.pl -minMarkers 2 | grep update:chr | getSummary.pl -i 6 -t "<2 markers"
 cat summary.nucmer-markers.txt | perl ~/bin/synteny/getSynteny2.nucmer-marker.pl -minMarkers 4 | grep update:chr | getSummary.pl -i 6 -t "<4 markers"
                             elem       min        max        mean       med        n50        sum
 <2 markers                  307        1001       937279     27898      9212       98674      8,564,850
 <4 markers                  383        1001       1141560    33822      13426      160445     12,953,875
UMD2.0
elem min max mean med N50 sum ctg 74337 88 840370 35148 14148 79144 2612810882*
UMD2.6.1 noVariants,noCont
elem min max mean med N50 sum scf 2646 385 34167202 994701 20787 7139718 2631980624 ctg 50755 65 1160130 50966 29450 88583 2586785910*
UMD2.6.1 noVariants
                              elem       min        max        mean       med        n50        sum
 scf                          39844      316        34167202   68352      1361       7451988    2,723,419,938
 scf.placed                   4707       385        34167202   564225     10413      7800796    2655811430
 scf.variants                 29436      723        51828      1714       1298       1514       50,461,989
 scf.unplaced                 30575      316        451968     1845       1309       1601       56440663
 
                              elem       min        max        mean       med        n50        sum
 ctg                          89994      65         1160130    29736      5180       95952      2676109378
 ctg.placed                   53646      65         1160130    48650      26840      98428      2609925446
 ctg.unplaced                 31510      101        207476     1761       1314       1556       55511410
                              elem       min        max        mean       med        n50        sum
 deg(all)                     251413     65         39964      1003       984        994        252,279,234
 deg(>2Kbp)                   2128       2000       39964      3753       2827       3946       7,986,872
deg.placed 883 200 30246 2711 1303 4845 2,393,905 deg.variants 4654 331 8039 1117 989 1024 5,200,553 deg.unplaced(?) 245343 65 39964 994 984 993 244047258
scf.unplaced 978 316 451968 7932 3736 15828 7757882 scf.uplaced.0cvg 863 316 451968 7521 2286 15744 6,491,342
deg.unplaced 747 2002 39964 4365 3509 4819 3261181 deg.unplaced.0cvg 734 2002 39964 4358 3507 4811 3,199,221 ct_deg.placed 53646 65 1160130 48650 26840 98428 2,609,925,446*
UMD2.0-UMD2.6.1 gaps>1K summary id count min max median sum 1 775 999 64953 1683 2884541 ... 30 2174 999 167808 1528 7739170 total 14936 999 167808 1638 53,526,388
- align UMD2.6 Chr1..30 to UMD2.0 Chr1..30 0cvg regions
elem min max mean med n50 sum all 14936 1000 167809 3584 1639 6465 53541324 aligned 11091 1000 167809 4010 1692 7977 44475715 not_aligned 3845 1000 50546 2357 1529 2735 9065609
Files: UMD2.0 regions not covered by UMD2.6.1 (chr aligned to itself)
/scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.fa /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.posmap /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.summary
- UMD2.6.1 vs UMD2.0 Chr1..30
- nucmer -mum -l 50 -c 250
- max gap: Chr8:60134434..60228381=93948
- gaps>10K: 191 , 4.2M total, 4.19M aligned without "-mum"
elem min max mean med n50 sum gaps>10K 191 10077 93948 22106 15599 26551 4222342
- realign all scaff without using nucmer "-mum" option: all gaps >10K align to large scaffolds !!!
UMD2.6.1 noVariants, add deg & UMD2.0 alignments
--Dpuiu 09:44, 22 June 2009 (EDT)
elem min max mean med n50 sum scf 39844 316 34167202 68352 1361 7451988 2723419938 scf.variants 29436 723 51828 1714 1298 1514 50461989 scf.noVariants 10580 316 34167202 252690 1957 7740810 2673467579 scf.noVariants.ignore 4079 316 4985 1852 1541 1842 7558196 # less than 5K and placed through alignments inside a larger scaffold scf.noVariants.noIgnore 6506 385 34167202 409763 6673 7740810 2665923763 ctg.variants 29967 723 51828 1641 1298 1475 49178546 deg.variants 4654 331 8039 1117 989 1024 5200553
scf_deg.placed 4933 385 34167202 540139 11548 7740810 2664507818 scf_deg.markers 2003 1020 34167202 1302992 30456 8170786 2609894380 scf_deg.alignHS 1818 385 3278163 27070 12322 47406 49214215 scf_deg.alignUMD2.0 1112 1002 73626 4855 2948 6898 5399223 scf.placed 4044 385 34167202 657786 15617 7740810 2660090134 scf.markers 1825 1020 34167202 1429490 37672 8170786 2608820353 scf.alignHS 1587 385 3278163 30006 13998 50604 47620297 scf.alignUMD2.0 632 1002 73626 5774 5137 9046 3649484 deg.placed 889 2002 39964 4969 3771 6165 4417684 deg.markers 178 2007 30168 6033 4336 7350 1074027 deg.alignHS 231 2049 39964 6900 6289 7337 1593918 deg.alignUMD2.0 480 2002 15992 3645 2767 3842 1749739 ctg_deg.placed 54129 65 1160130 48371 26493 98075 2,618,296,162* ctg.placed 53240 65 1160130 49096 27309 98285 2,613,878,478 deg.placed 889 2002 39964 4969 3771 6165 4,417,684
HS-BT Synteny map
- Trust scaffolds with 4+ markers
- Scaffolds with 3- markers must have a close neighbor that agrees with them
same HS & BT Chr maxCount=2 # at most 2 scaffolds away minMarkers=4 minRatio=0.66 & maxRatio=1.5 # distance ratio; maxDistance not used !!!
 => 226 synteny regions !!!
 join2.pl nucmer_*lsf markers_scf.*lsf | \
        ~/bin/filterMarkers.pl -minMarkers 4 | \
        ~/bin/getSyntenyBlock.pl | \
        grep -v ^# | grep -v ^$ | ~/bin/flipSummary.pl | sort -nk2 -nk5 | ~/bin/tab2tab.pl | \
        perl -ane  'print $P[13]," ",$F[13],"\n" if($F[13]-$P[13]==1); print $F[13]," ",$P[13],"\n" if($P[13]-$F[13]==1); @P=@F;' | sort -u -n | \
        ~/bin/mergeMap.pl >! hs-bt.map.tmp
 join2.pl nucmer_*lsf markers_scf.*lsf | \
        ~/bin/filterMarkers.pl -minMarkers 4 | \
        ~/bin/getSyntenyBlock.pl -map  hs-bt.map.tmp | \
        ~/bin/tab2tab.pl | grep  # | sed 's/#//' >  hs-bt.map
 
 ~/bin/map-draw.pl -refLen hs.infoseq -qryLen bt.infoseq hs-bt.map > ! hs-bt.png
Problems:
7180002041220: BT.Chr2 ok 7180002041025: BT.Chr4 7180002041222: BT.Chr7 7180002041228: BT.Chr8 7180002040195: BT.Chr8; HS.Ch23 del !!! 7180002041001: BT.Chr8; HS.Ch23 del !!! 7180002040851: BT.Chr30; HS.Ch3 -> HS.Ch23 7180002041008: BT.Chr30; HS.Ch7 -> HS.Ch23
- hs-bt.png Map picture
- Map:
#HS-ref begin end len HS-clust BT-ref begin end BT-clust #scf 01 870247 12637274 11767027 16 46893147 58621226 11728079 -1 5 01 16674545 30031540 13356995 2 126668764 139027145 12358381 -2 4 01 32016003 68894564 36878561 3 83066354 120874874 37808520 -3 10 01 68907947 122485896 53577949 3 23819464 83221225 59401761 -4 6 01 143725713 143737025 11312 3 25097798 25109110 11312 5 1 ... 23 148835146 151656716 2821570 30 32889252 35700335 2811083 -276 3 23 152157769 154641974 2484205 30 39322867 42104054 2781187 -277 2 23 154387850 154403684 15834 30 39353299 39369133 15834 278 1
Files:
/scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/map.4/hs-bt.map /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/map.4/hs-bt.png
Overlaps
Cases:
1. CONTAINED scaffolds (clear variants) 2. single BEGIN/END between 2 scaffolds: 2 scaffolds could be merged 3. scaffold closing a sequence: 3 scaffolds could be merged 4. multiple BEGIN/END/CONTAIN* between 2 scaffold contigs
1: CONTAINED ~ 4731 cases
cat nucmer_scf.ovl/all-all.contained.ids marker_scf.ovl/all-all.contained.ids | sort -u > all-all.contained.ids
- Summary
elem <2Kbp 2..10Kbp >10Kbp min max mean med n50 sum variants(all) 4731 3503 1085 143 723 75436 2513 1429 3469 11891882 variants(1+markers) 541 311 170 60 1001 75421 4466 1690 9947 2416312
- Example: longest contained scaffold 75Kbp
#scaffold alignments 117893 126367 | 8520 1 | 8475 8520 | 94.06 | 562884 75576 | 1.51 11.27 | 7180002040891 7180002040646 286811 289412 | 72974 75576 | 2602 2603 | 93.29 | 562884 75576 | 0.46 3.44 | 7180002040891 7180002040646 421477 421977 | 9727 10223 | 501 497 | 95.41 | 562884 75576 | 0.09 0.66 | 7180002040891 7180002040646 171970 180765 | 73605 64798 | 8796 8808 | 96.74 | 274750 75576 | 3.20 11.65 | 7180002040912 7180002040646 180765 195989 | 64652 49398 | 15225 15255 | 96.05 | 274750 75576 | 5.54 20.18 | 7180002040912 7180002040646 196223 203368 | 49244 42105 | 7146 7140 | 96.81 | 274750 75576 | 2.60 9.45 | 7180002040912 7180002040646 205935 208456 | 38652 36130 | 2522 2523 | 96.12 | 274750 75576 | 0.92 3.34 | 7180002040912 7180002040646 207096 210496 | 42082 38673 | 3401 3410 | 96.07 | 274750 75576 | 1.24 4.51 | 7180002040912 7180002040646 209529 217353 | 36109 28272 | 7825 7838 | 95.50 | 274750 75576 | 2.85 10.37 | 7180002040912 7180002040646 213859 219155 | 25520 20220 | 5297 5301 | 95.80 | 274750 75576 | 1.93 7.01 | 7180002040912 7180002040646 221183 227630 | 20236 13870 | 6448 6367 | 93.63 | 274750 75576 | 2.35 8.42 | 7180002040912 7180002040646 227923 228686 | 13884 13126 | 764 759 | 95.16 | 274750 75576 | 0.28 1.00 | 7180002040912 7180002040646 228084 231413 | 13104 9760 | 3330 3345 | 94.10 | 274750 75576 | 1.21 4.43 | 7180002040912 7180002040646 #contig alignments 1929 10403 | 8520 1 | 8475 8520 | 94.06 | 26592 8520 | 31.87 100.00 | 7180002040891.4.10 7180002040646.1.8 [CONTAINS] 9084 11685 | 30871 33473 | 2602 2603 | 93.29 | 276174 33473 | 0.94 7.78 | 7180002040891.8.10 7180002040646.8.8 143750 144250 | 1187 1683 | 501 497 | 95.41 | 276174 4564 | 0.18 10.89 | 7180002040891.8.10 7180002040646.2.8 163797 172592 | 31502 22695 | 8796 8808 | 96.74 | 233551 33473 | 3.77 26.31 | 7180002040912.3.5 7180002040646.8.8 172592 187816 | 22549 7295 | 15225 15255 | 96.05 | 233551 33473 | 6.52 45.57 | 7180002040912.3.5 7180002040646.8.8 188050 195195 | 7141 2 | 7146 7140 | 96.81 | 233551 33473 | 3.06 21.33 | 7180002040912.3.5 7180002040646.8.8 197762 200283 | 2523 1 | 2522 2523 | 96.12 | 233551 2523 | 1.08 100.00 | 7180002040912.3.5 7180002040646.6.8 [CONTAINS] 198923 202323 | 3410 1 | 3401 3410 | 96.07 | 233551 3411 | 1.46 99.97 | 7180002040912.3.5 7180002040646.7.8 [CONTAINS] 201356 209180 | 7838 1 | 7825 7838 | 95.50 | 233551 7838 | 3.35 100.00 | 7180002040912.3.5 7180002040646.5.8 [CONTAINS] 205686 210982 | 12396 7096 | 5297 5301 | 95.80 | 233551 12396 | 2.27 42.76 | 7180002040912.3.5 7180002040646.3.8 213010 219457 | 7112 746 | 6448 6367 | 93.63 | 233551 12396 | 2.76 51.36 | 7180002040912.3.5 7180002040646.3.8 219750 220513 | 760 2 | 764 759 | 95.16 | 233551 12396 | 0.33 6.12 | 7180002040912.3.5 7180002040646.3.8 219911 223240 | 4564 1220 | 3330 3345 | 94.10 | 233551 4564 | 1.43 73.29 | 7180002040912.3.5 7180002040646.2.8 [CONTAINS] #marker & alignment summary #id BT-ref #markers slope begin end len HS-ref #align slope begin end len 7180002040891 4 1 0 13956053 14518937 562884 22 14 -1.5687 20838730 21401614 562884 7180002040163 8 1 -1 38079677 38178351 98674 22 3 1.0264 21133894 21232568 98674 update:dir:7180002040646 7180002040646 8 . 1 38088797 38164373 75576 22 4 -0.9833 21143014 21218590 75576 assign:Chr:7180002040163,7180002040163
2. ~ 349 cases (-3 cases 3.)
cat all-all.begin.ids all-all.end.ids | sort -u | wc -l
3. ~ 3 cases
intersect.pl all-all.begin.ids all-all.end.ids | perl -ane 'print $_ if($F[1]=~/1.1$/);' 7180002032818.1.1 23259 7180002036943.1.1 28355 7180002040409.1.1 25219 # merges 2 scaff 1 2811 | 2807 1 | 2811 2807 | 99.11 | 23259 5134 | 12.09 54.67 | 7180002032818.1.1 7180002032811.1.2[BEGIN] [BEGIN] 22101 23259 | 7238 6079 | 1159 1160 | 99.66 | 23259 7238 | 4.98 16.03 | 7180002032818.1.1 7180002032811.2.2[END] [END] 1 2811 | 2807 1 | 2811 2807 | 99.11 | 23259 5134 | 12.09 54.67 | 7180002032818.1.1 7180002032811.1.2[BEGIN] [BEGIN] 22101 23259 | 7238 6079 | 1159 1160 | 99.66 | 23259 7238 | 4.98 16.03 | 7180002032818.1.1 7180002032811.2.2[END] [END] 1 2025 | 2066 4090 | 2025 2025 | 99.80 | 25219 4090 | 8.03 49.51 | 7180002040409.1.1 7180002033541.1.1[BEGIN] [END] 23394 25219 | 5761 3930 | 1826 1832 | 98.42 | 25219 5761 | 7.24 31.80 | 7180002040409.1.1 7180002033538.1.1[END] [END]
cd nucmer_scf.ovl intersect.pl all-all.begin.ids all-all.end.ids 7180002032818 23259 7180002036943 28355 7180002040409 25219
4. ~10 cases
cd nucmer_ctg.ovl/translated/ cat all-all.annotated.coords | egrep 'BEGIN|END|CONTAIN' | p 'next if($F[6]<5000); next if($F[7]<5000); print $_;' | p '$F[17]=~/^([^.]+)(.+)/; $F[17]=$1 ; $F[18]=~/^([^.]+)/; $F[18]=$1; print $F[17],"\t",$F[18],"\n";' | count.pl -m 2 7180002041235 brk002041306a 3 7180002038888 7180002034914 2 7180002041059 7180002040095 2 7180002040879 7180002040934 2 cnt0002041350 7180002040907 2 7180002041341 7180002040894 2 7180002041015 cnt0002040938 2 7180002039401 7180002040397 2 7180002037358 7180002031425 2 7180002040915 7180002039470 2 cat all-all.annotated.coords | egrep 'BEGIN|END|CONTAIN' | grep ... 12363 21143 | 1 8837 | 8781 8837 | 93.43 | 21143 50443 | 41.53 17.52 | 7180002041235.23.33 brk002041306a.5.7[END] [BEGIN] 1 9272 | 5591 14944 | 9272 9354 | 94.47 | 9272 50443 | 100.00 18.54 | 7180002041235.25.33 brk002041306a.5.7[CONTAINED] 71049 97689 | 31941 5220 | 26641 26722 | 96.71 | 97689 31941 | 27.27 83.66 | 7180002041235.29.33 brk002041306a.6.7[END] [END] 57 3896 | 14743 10903 | 3840 3841 | 95.94 | 3896 22667 | 98.56 16.95 | 7180002041235.32.33 brk002041306a.3.7[CONTAINED] 1 8942 | 1 8947 | 8942 8947 | 99.49 | 20792 8947 | 43.01 100.00 | 7180002038888.1.1 7180002034914.1.2[CONTAINS] 9670 20792 | 1 11091 | 11123 11091 | 99.27 | 20792 11091 | 53.50 100.00 | 7180002038888.1.1 7180002034914.2.2[CONTAINS] 1 8705 | 3932 12610 | 8705 8679 | 96.17 | 8705 30654 | 100.00 28.31 | 7180002041059.1.26 7180002040095.1.1[CONTAINED] 1 4777 | 12989 17697 | 4777 4709 | 95.32 | 4777 30654 | 100.00 15.36 | 7180002041059.2.26 7180002040095.1.1[CONTAINED] 1 6588 | 17999 24607 | 6588 6609 | 96.15 | 6591 30654 | 99.95 21.56 | 7180002041059.3.26 7180002040095.1.1[CONTAINED] 630 16709 | 16024 1 | 16080 16024 | 97.09 | 16901 16024 | 95.14 100.00 | 7180002040879.2.14 7180002040934.2.87[CONTAINS] 1 5876 | 5882 1 | 5876 5882 | 98.51 | 39657 13643 | 14.82 43.11 | 7180002040879.3.14 7180002040934.1.87[BEGIN] [BEGIN]
Scaffold links
Try to identify scaffold that fit the following criteria:
- have no markers and no alignments to HS
- linked by 2+ links to a single scaffold that has markers/alignments to HS
- is not a variant (2Kbp+ of unique sequnece)
elem min max mean med n50 sum linked 6169 316 88326 2408 1480 2709 14,860,479 linked(2+mates) 2109 316 44120 2595 1558 3092 5,474,510 linked(2+mates to a single scf) 2057 316 44120 2516 1547 2850 5,177,182 linked(2+mates to a single scf, no variant) 112 316 43782 7278 1987 23338 815,223 # 46>2Kbp; 25>10Kbp
=> 112 scaffold & 0.81Mbp could be added to the chromosomes !!!
BT alignments
- UMD2.6.1 vs UMD2.6.1
- Marker scaffolds against themselves (2642 total scaffolds) : nucmer -maxmatch -l 40 -c 2500 -g 250
- Mapped scaffolds without markers against marker scaffolds: nucmer -maxmatch -l 40 -c 250
- Mapped scaffolds without markers against themselves: nucmer -maxmatch -l 40 -c 250
File location:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_Chr/
Other issues
Chr10 gene duplication
- LOC100298457
- See /nfshomes/dpuiu/Readmes/bos_taurus.runCA.13.txt
Scf breaks (Mike Robers)
:::::::::::::: scf.break.ids :::::::::::::: #ctg scf beg end dir 7180002015438 7180002038435 0 32822 f # ok: 4 markers from Chr18 7180002021916 7180002040442 0 24559 f # ok: 1 marker from Chr10 7180001725791 7180002040808 466123 562461 f # ok: 25 marker from Chr1, 1 marker from Chr2 7180001727899 7180002040844 8236574 8479429 f # also found by us; 3 Chr30 markers in the middle 7180001854650 7180002041103 5230302 5294734 f # also found by us; 4 Chr15 markers at 3' 7180002003578 7180002041216 300250 392069 f # also found by us; 7 Chr5 markers at 5' 7180002020010 7180002041293 4482477 4600045 f # ok 7180001722390 7180002041353 0 51674 f # ok :::::::::::::: scf.excised.ids :::::::::::::: #ctg scf beg end dir 7180002008629 7180002034664 82740 105723 f # no markers 7180001786240 7180002040913 2076746 2112375 f # ok 7180001787022 7180002040927 1075352 1109350 f # ok 7180001787352 7180002040931 4956099 4967823 f # ok 7180001789387 7180002040981 6641924 6671967 f # ok 7180001789575 7180002040984 3095587 3154302 f # ok 7180002003269 7180002041209 6350890 6470971 f # ok 7180002025281 7180002041322 989366 1227636 f # ok 7180002026741 7180002041328 343726 414329 f # ok 7180002029433 7180002041343 834493 888144 f # ok 7180001726451 7180002041383 7893629 7901645 f # ok
LOC100298457 (duplicate gene on Chr10)
Problem:
Is LOC100298457 gene (cow Chr10) a variation of MFSD3 gene (cow Chr14)?
UMD2.0
Chr10 15186759 15189292 833 W 7180003260677 1 2534 + # gene LOC100298457 Chr14 1839355 1958202 533 W 7180003326040 1 118848 + # gene MFSD3
- LOC100298457 aligns to 2 UMD2.6 scaffolds:
scf7180002041112 (1.56Mbp; 34 contigs) scf7180002033841 (2.44Kbp; 1 contig)
- scf7180002041112 contains 39 bos taurus Chr14 markers and has 206 alignments to human chromosome 8
- scf7180002033841 contains no bos taurus markers and has 1 alignments to human chromosome 8.
- scf7180002033841 5' aligns to scf7180002041112 (1.56Mbp; 34 contigs, cow Chr14) and to a human Chr8 region that maps to cow Chr14
- scf7180002033841 3' aligns to scf7180002041157 (6.15Mbp; 99 contigs, cow Chr10)
- scf7180002033841 & scf7180002041157 are linked by 2 mate pairs (inserts from a 3kbp BCM shotgun library)
- LOC100298457 vs UMD2.6 scf:
  [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
 ===============================================================================================================================
     1       88  |  1138313  1138226  |       88       88  |   100.00  |     1344  1561688  |     6.55     0.01  | LOC100298457       scf7180002041112
    80      192  |  1138150  1138038  |      113      113  |   100.00  |     1344  1561688  |     8.41     0.01  | LOC100298457       scf7180002041112
   190      316  |  1137974  1137848  |      127      127  |   100.00  |     1344  1561688  |     9.45     0.01  | LOC100298457       scf7180002041112
   315      464  |  1137770  1137621  |      150      150  |   100.00  |     1344  1561688  |    11.16     0.01  | LOC100298457       scf7180002041112
   455     1344  |  1137322  1136432  |      890      891  |    99.78  |     1344  1561688  |    66.22     0.06  | LOC100298457       scf7180002041112
     1     1344  |     1349        6  |     1344     1344  |   100.00  |     1344     2445  |   100.00    54.97  | LOC100298457       scf7180002033841
- UMD2.6 scf* vs scf7180002033841:
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 1136427 1137322 | 1 895 | 896 895 | 99.78 | 1561688 2445 | 0.06 36.61 | scf7180002041112 scf7180002033841 4060495 4061426 | 2445 1521 | 932 925 | 97.76 | 6157473 2445 | 0.02 37.83 | scf7180002041157 scf7180002033841
Bos taurus marker summary:
#id BT-ref #markers slope begin end len scf7180002041112 Chr14 39 -0.8518 3377707 4939395 1561688 scf7180002041157 Chr10 255 1.0035 12831593 18989066 6157473 scf7180002033841 ? 0
Homo sapiens alignment summary:
#id HS-ref #align slope begin end len scf7180002041112 Chr8 206 0.6701 144381103 145942791 1561688 scf7180002041157 Chr15 1667 0.987 62245836 68403309 6157473 scf7180002033841 Chr8 1 1 145705320 145707765 2445
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 143996007 143996305 | 74278 74572 | 299 295 | 82.45 | 146274826 1561688 | 0.00 0.02 | 1 1 NC_000008 7180002041112 ... 146249212 146249657 | 1268887 1268446 | 446 442 | 80.89 | 146274826 1561688 | 0.00 0.03 | 1 -1 NC_000008 7180002041112 63013167 63013429 | 887697 887966 | 263 270 | 78.15 | 100338915 6157473 | 0.00 0.00 | 1 1 NC_000015 7180002041157 ... 68369047 68369417 | 6154757 6155112 | 371 356 | 83.06 | 100338915 6157473 | 0.00 0.01 | 1 1 NC_000015 7180002041157 145705522 145706206 | 202 886 | 685 685 | 81.28 | 146274826 2445 | 0.00 28.02 | 1 1 NC_000008 7180002033841
scf7180002033841: 6 reads
read scf begin end dir 1120017508 7180002033841 0 1018 f 1120020725 7180002033841 227 1109 f 1120020722 7180002033841 724 1720 r 1120020726 7180002033841 1006 2073 f diffScaffold 1120017511 7180002033841 1303 2175 r 1120020728 7180002033841 1370 2445 f diffScaffold
scf7180002033841: 4 mates
read1 read2 scf1 scf2 1120017508 1120020722 7180002033841 7180002033841 1120017511 1120020725 7180002033841 7180002033841 1120020726 1120017512 7180002033841 7180002041157 diffScaffold 1120020728 1120017514 7180002033841 7180002041157 diffScaffold
scf7180002041157: 69095 reads; frg 1120017514 & 1120017512 positions close to the region aligned to scf7180002033841 (4060495-4061426)
read scf begin end dir 607312594 7180002041157 0 866 f ... 1120017512 7180002041157 4059332 4060396 f diffScaffold 1120017514 7180002041157 4058851 4059931 f diffScaffold .... 583956204 7180002041157 6157144 6157228 r
scf7180002041157: 99 contigs
count ctg scf begin end dir 1 ctg7180001926175 scf7180002041157 0 81467 f .. 64 ctg7180001926231 scf7180002041157 4039922 4046680 f # 6758bp ctg; 800bp gap following the contig 65 ctg7180001926232 scf7180002041157 4047480 4193693 f # 146213bp ctg ; 20bp gap following the contig 66 ctg7180001926233 scf7180002041157 4193713 4208047 f # 14334bp ctg .. 99 ctg7180001926260 scf7180002041157 6152691 6157473 f
frg 1120017514 & 1120017512 positions on ctg7180001926232
1120017514 ctg7180001926232 11371 12451 f # reads are 11kbp(>3Kb lib mean) inside the contig 1120017512 ctg7180001926232 11852 12916 f # reads are 11kbp(>3Kb lib mean) inside the contig
Missing genes
- 5 genes not found by Liliana using ESTaligner but found using gmap (%id<95) : they were on the haplotype variants
Chr27 -> Chr21
- The centromeric end of BTA27 is actually placed on BTA21 in UMD2.6 (which version?)
Chromosomes
synteny.redo2
- summary.txt : 8122 scf + 6322 deg => 14444 seqs (placed using markers or synteny to HS)
- from summary.txt removed:
58 scf.questionable.ids (44 placed, 14 linked) 100 ctg.questionable.ids 931 deg.questionable.ids
UMD_2.6.d_g
. elem <2000 >2000 min max mean med n50 sum haplotypes_contigs 3828 2701 1127 471 123243 2503 1580 2590 9,584,985
UMD_2.6.a_g_070109
Gaps: 65900
- all 65900 gaps are "fragment yes"
UMD_2.6.a_g_070509
- Combine UMD_2.6.a_g_070109 (Guillaume's assembly) and UMD2.6.1 (Daniela's)
UMD_2.6.a_g_070709
- Remove from UMD_2.6.a_g_070509
- ~ 21 ChrY ctg & ~ 39 ChrY deg
- ~ 4206 haplotype variants (6.63Mbp) within 1K from one another
 
. elem <2000 >2000 min max mean med n50 sum ctg_deg.variants.placed 6654 5504 1150 263 42158 1761 1189 1772 11719653 ctg_deg.variants.placed.sameChr 5374 4455 919 263 42158 1748 1152 1775 9393815 ctg_deg.variants.placed.within_100K 4790 4100 690 263 42158 1651 1128 1605 7911160 ctg_deg.variants.placed.within_1K* 4206 3679 527 263 42158 1577 1107 1508 6633118
Summary:
ctg+deg <2000 >=2000 min max mean med n50 sum Chr1..29,X 72197 20763 51434 65 1160130 36536 13055 97328 2,637,809,286 ChrU 3752 2587 1165 362 179692 3284 1447 6427 12,324,356 ChrY-contigs 315 266 49 224 26490 2249 974 6679 708,535 contigs.haplotype-variants 40198 36720 3478 263 51828 1460 1203 1361 58,698,457 deg.unplaced.less_2K 224945 224945 0 65 1996 972 983 990 218,847,978
Issues
- 490 scf don't have all ctgs placed (865 ctgs)
- 699 reliable contigs (3.25Mbp) unplaced
difference.pl ctg.reliable.ids UMD_2.6.a_g_070709/Chr.posmap | getSummary.pl -i 2 -z 2000 elem <2000 >2000 min max mean med n50 sum 699 405 294 362 123243 4660 1759 14809 3257635
- Rearrangements UMD2.0 vs UMD_2.6.a_g_070709 : ~ 25 ctgs>50K ; ~ 12 scf
- AGP file format: "fragment yes" should be preserved even if gap type=U
grep -A 2 7180001925241 UMD_2.6.a_g_070109/Chr.agp Chr30 131686408 131762880 14157 W 7180001925241 1 76473 + Chr30 131762881 131762980 14158 U 100 fragment yes Chr30 131762981 131796167 14159 W 7180001925242 1 33187 +
grep -A 2 7180001925241 ../UMD_2.6.a_g_070709/Chr.agp ChrX 133855417 133931889 15581 W 7180001925241 1 76473 + ChrX 133931890 133931989 15582 U 100 contig no ChrX 133931990 133965176 15583 W 7180001925242 1 33187 +
- Liliana found 18 ctg + deg that have genes
Fixed case B (7180001932648,7180001925237)
- Bob (Missouri)
- scf7180002041216 (43 ctgs, 1.44Bmp) should be split : Chr15 (36 ctgs) , Chr5 (first 7 ctgs) ctg7180002003576,ctg7180002003577 moved from Chr9=>Chr5
 => Chr5
    1  7180002003574   7180002041216   0       19886   f # already on Chr5
    2  7180001694221   7180002041216   19906   20911   r
    3  7180001978231   7180002041216   20931   52725   r
    4  7180001946738   7180002041216   53559   55402   r
    5  7180002003575   7180002041216   55422   92145   f
    6  7180002003576   7180002041216   92165   200007  f
    7  7180002003577   7180002041216   200456  300048  f
- scf7180002041153 (48 ctgs, 2.36Mbp): assignment to Chr11 seems correct, Chr6(0)
UMD_2.6.a_g_071709 -> UMD_Freeze2.99
Changes vs UMD_2.6.a_g_070709:
- 17 contigs/degenerates recruited by Liliana based on mRNA alignments got placed on chromosomes
- Chr9,15 => Chr5 correction : a scaffold got broken between 2 chromosomes & and several contigs got moved to Chr5
- 697 unplaced contigs from placed scaffolds got placed as well
Contig placement summary:
#ctg+deg <2Kbp >=2Kbp min max mean med n50 sum ================================================================================================== Chr1..29,X 72911 21180 51731 65 1160130 36223 12706 97232 2641097363 ChrU 3365 2448 917 224 179692 2898 1348 5399 9754701 contigs.haplotype-variants 40198 36720 3478 263 51828 1460 1203 1361 58698457 deg.unplaced.less_2K 224933 224933 0 65 1996 972 983 990 218837572
Chr1..29,X(new) 714 417 297 177 123243 4605 1741 14809 3288596
ChrY-contigs 315 266 49 224 26490 2249 974 6679 708535 ChrY-contigs.SHOTGUN_ONLY 144 140 4 804 4224 993 882 888 143047 ===================================================================================================
Comments:
- "Chr1..29,X", ChrU, contigs.haplotype-variants, deg.unplaced.less_2K are mutually exclusive sets
- "Chr1..29,X(new)" are contigs which were not placed in UMD_2.6.a_g_071709.
- 17 contig them were added by Liliana (1 failed)
- the rest are reliable contigs left unplaced by Aleksey/Guillaume program
 
- ChrY-contigs.SHOTGUN_ONLY are a subset of ChrY-contigs which don't contain only SHOTGUN reads
- ChrY-contigs are part of ChrU
Files (walnut):
/scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/ # FASTA & AGP format /scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/nucmer_UMD2.0/ # nucmer alignments to UMD2.0
Files (freeze):
/fs/szasmg3/bos_taurus/UMD_Freeze2.99/ # FASTA & AGP format /fs/szasmg3/bos_taurus/UMD_Freeze2.99/ncbi_files # SEQUIN format
Ftp:
ftp://ftp.cbcb.umd.edu/pub/salzberg/Bos_taurus_2.99/ -> /fs/ftp-cbcb/pub/salzberg/Bos_taurus_2.99/
UMD_2.6.a_g_072109 -> UMD_Freeze3.0
Changes vs UMD_2.6.a_g_070709:
- Delete 97 contaminated sequences found by NCBI (all except the primates) : http://www.ncbi.nlm.nih.gov/projects/WGS/screens/DAAA02_071709/
- Delete 441 haplotype variants found by Guillaume
- Trim 54 partial contaminants (contaminants were on the ends)
- Trim 7 terminal N's
Gaps: 75739
- all 27103 N gaps are "fragment yes"
- all 48636 U gaps are "contig no"
Files (walnut):
/scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_072109/ # FASTA & AGP format
Files (freeze):
/fs/szasmg3/bos_taurus/UMD_Freeze3.0/ # FASTA & AGP format /fs/szasmg3/bos_taurus/UMD_Freeze3.0/ncbi_files # SEQUIN format
Ftp:
ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_3.0/ -> pub/data/assembly/Bos_taurus/Bos_taurus_UMD_3.0/
Issues:
- 7180001836672 941bp deg on Chr4 : aligns on all its length to the cow mitochondrion; placed based on human synteny
- mitochondrion screening was done only on contigs, not on degenerates
- Align all Chr*.fasta files to cow mitochondrion; show-coords -I 90 -L 600
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 3603 4930 | 6 1333 | 1328 1328 | 90.89 | 16339 1333 | 8.13 99.62 | AY676873 ctg7180001759596 [CONTAINS] 13284 14521 | 1214 1 | 1238 1214 | 97.90 | 16339 1216 | 7.58 99.84 | AY676873 deg7180001872458 [CONTAINS] 15731 16339 | 1 608 | 609 608 | 99.18 | 16339 941 | 3.73 64.61 | AY676873 deg7180001836672 [END]
- Delete 2 degenerates
- Summary 3.a --Dpuiu 11:04, 5 August 2009 (EDT)
. ctg+deg <2Kbp >=2Kbp min max mean med n50 sum ====================================================================================================== Chr1..29,X 72479 20862 51617 65 1160130 36424 12941 103785 2639984487 ChrU 3285 2404 881 224 179692 2890 1338 5425 9496583 Chr 75764 23266 52498 65 1160130 34970 11207 96955 2649481070 contigs.haplotype-variants 40611 36984 3627 263 97877 1476 1205 1372 59958728 deg.unplaced.less_2K 224933 224933 0 65 1996 972 983 990 218837572 ChrY-contigs 314 266 48 224 26490 2210 973 6539 694140 ChrY-contigs.SHOTGUN_ONLY 144 140 4 804 4224 993 882 888 143047 ======================================================================================================
UMD_Freeze3.1
--Dpuiu 11:42, 19 November 2009 (EST)
- Only changed some of the gap specifications
- Original(CA):
ctg: 90135 deg: 251413 scf: 39978
- UMD3.1 AGP:
ctg: 60499 deg: 15229 scf(CA): 11458 ctg(unoriented) 2118 scf(placed chr) 3193 scf(unplaced chr) 3285
Files:
/fs/szasmg3/bos_taurus/UMD_Freeze3.1
ToDo
- Align all UMD2.0 ChrU ctg/deg 10Kbp+ to our assembly; make sure everything aligns (Steven's suggestion)
/scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_2.6.a_g_071709/nucmer_UMD2.0
- all ctgs: 113K, 244Mbp
- 10Kbp+ ctgs: 2561, 54Mbp
 
- 53,951,168 out of 54,971,011 bp covered (98% of the sequence)
- Most ctgs were added to ChrU
 
 join2.pl -i 4 UMD2.0.ChrU.10K.maxCvg.pair ../other/Chr.posmap | awk '{print $10,$2}' | ~/bin/sum2.pl | sort -nk3 -r
 Chr30   252     7,511,975
 Chr1    171     3263575
 Chr6    138     3043388
 Chr12   152     2676615
 ChrU    100     2600562
 ...
 Chr28   29      473523
 all     2651    54,960,233
1. Remove remain_haps
. elem <2000 >2000 min max mean med n50 sum remain_haps 436 282 154 471 37860 2600 1693 2819 1,133,874 remain_haps(Chr1..30,U) 408 264 144 471 37860 2636 1692 2883 1,075,616 # 77 from ChrU remain_haps(Chr1..30) 335 226 109 471 37860 2480 1661 2728 831,068
2. Remove/trim contaminants
3. Remove/trim N's