Bos taurus redo: Difference between revisions
| Line 3,623: | Line 3,623: | ||
|                           elem       min        max        mean       med        n50        sum |                           elem       min        max        mean       med        n50        sum | ||
|    all                    39844      316        34167202   68352      1361       7451988    2723419938 |    all                    39844      316        34167202   68352      1361       7451988    2723419938 | ||
|    1+mark                 2641       1000       34167202   994523     16673      8170786    2626536153 |    1+mark                 2641       1000       34167202   994523     16673      8170786    2626536153 # best marker alignment | ||
|    1+align                8860       385        34167202   301202     2257       7740810    2668658140 |    1+align                8860       385        34167202   301202     2257       7740810    2668658140 # alignments > 250bp | ||
|   1+align(new)           9880       385        34167202   270458     1996       7740810    2672129298 # alignments > 200bp (3.74M more than 250bp align) | |||
|    1+mark or  1+align     9669       385        34167202   276443     2051       7740810    2672936792 |    1+mark or  1+align     9669       385        34167202   276443     2051       7740810    2672936792 | ||
|    0 mark and 1+align     7027       385        754,615    6603       1654       20219      46,400,303 !!! |    0 mark and 1+align     7027       385        754,615    6603       1654       20219      46,400,303 !!! | ||
Revision as of 03:43, 17 June 2009
BCM
NCBI Data
- Genome Projects
- TA search
- TA ftp
- 91 volumes: 87 with qual & 4 with no quality (85 volumes contain BCM reads)
- 14 centers
- 21 center/trace_type_codes
- Avg LEN=984
- Avg CLIP (CLB intersect CLV)=760
- Avg CLV=997 > Avg LEN ???
- Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
- Avg UMDoverlapper CLIP=778
Problems:
- 0 QUAL reads 650,133
- the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1
- several xml contained the "&" character => XML parser error
- xml.bos_taurus.087 contained 2 trace_volumes => XML parser error
- BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000
- UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000
CENTER_NAME counts
    COUNT           CENTER_NAME     
 1  35629020        BCM             Baylor College of Medicine
 2  737900          NISC            NIH Intramural Sequencing Center
 3  652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Centre                           # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 
 4  378871          MARC            USDA, ARS, US Meat Animal Research Center
 5  114753          UIUC            University of Illinois at Urbana-Champaign                                      # TA query_tracedb CENTER_NAME = "UIUC" => 106,368
 6  107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
 7  65171           TIGR            The Institute for Genome Research
 8  53556           GSC             Genoscope
 9  43033           CENARGEN        Embrapa Genetic Resources and Biotechnology
 10 18623           SC              The Sanger Center
 11 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
 12 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC                 # TA query_tracedb CENTER_NAME="JCVI"
 13 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
 14 49              WUGSC           Washington University, Genome Sequencing Center                                 # TA query_tracedb CENTER_NAME = "WUGSC" => 9
    37829394        total           total                                                                           # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710
TRACE_TYPE_CODE counts
    COUNT         CENTER_NAME     TRACE_TYPE_CODE        
 1  24863599      BCM*            WGS                    SEQ_LIB_ID:89
 2  10748529      BCM*            SHOTGUN                SEQ_LIB_ID:15543
 3  737900        NISC            SHOTGUN                SEQ_LIB_ID:247
 4  125597        BCCAGSC         CLONEEND               LIBRARY_ID:1         large insert size; some qualityless; !!! almost all have CLIP3=0
 5  114753        UIUC            CLONEEND               LIBRARY_ID:2         insert size missing , no frequent kmers
 6  65171         TIGR            CLONEEND               SEQ_LIB_ID:1         2K & use TRACE_DIRECTION instead of TRACE_END
 7  53556         GSC             CLONEEND               SEQ_LIB_ID:1         large insert size; !!! all have qual=0 and were excluded 
 8  26246         CENARGEN        WGS                    .                    no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID)
 9  25454         BARC            CLONEEND               SEQ_LIB_ID:14304     !!! all have CLIP3=0
 10 16892         BCM*            CLONEEND               LIBRARY_ID:1         VBBAA   mea=167000  std=25000
 11 16787         CENARGEN        CLONEEND               LIBRARY_ID:1         
 12 15150         UOKNOR          SHOTGUN                LIBRARY_ID:1         some qualityless
 13 10651         TIGR_JCVIJTC    CLONEEND               SEQ_LIB_ID:2
 14 151           UOKNOR          FINISHING              LIBRARY_ID:1         some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV
 15 49            WUGSC           CLONEEND               SEQ_LIB_ID:1 
    36820485      total
 16 527017        BCCAGSC         EST
 17 207204        MARC            EST
 18 171667        MARC            PCR
 19 81913         BARC            EST
 20 18623         SC              EST 
 21 2485          UIACBCB         EST
    1008909       total
STRATEGY & TRACE_TYPE_CODE counts
COUNT CENTER_NAME STRATEGY TRACE_TYPE_CODE 12545304 BCM . WGS 11425910 BCM WGA WGS 5223683 BCM CLONE SHOTGUN 4479883 BCM POOLCLONE SHOTGUN 1044963 BCM . SHOTGUN 892385 BCM SNP WGS 737900 NISC CLONE SHOTGUN 125597 BCCAGSC CLONEEND CLONEEND 114753 UIUC CLONEEND CLONEEND 65171 TIGR CLONEEND CLONEEND 53556 GSC CLONEEND CLONEEND 26246 CENARGEN . WGS 25454 BARC . CLONEEND 16892 BCM CLONEEND CLONEEND 16787 CENARGEN CLONEEND CLONEEND 12195 UOKNOR . SHOTGUN 10651 TIGR_JCVIJTC CLONEEND CLONEEND 2955 UOKNOR CLONE SHOTGUN 151 UOKNOR . FINISHING 49 WUGSC CLONEEND CLONEEND
527017 BCCAGSC EST EST 145820 MARC EST EST 117958 MARC COMPARATIVE PCR 81913 BARC EST EST 61384 MARC CLONE EST 53709 MARC Re-Sequencing PCR 18623 SC EST EST 2485 UIACBCB . EST
BCM.SHOTGUN libraries
- The long inserts are probably wrong !!!
SIZE STDEV COUNT 3500 1500 4502569 2000 1000 3244493 3000 1000 1021577 180000 1000 840528 6500 1500 429026 180000 13000 320208 6000 2000 208192 167000 13000 96337 3500 15000 85599
SIZE COUNT 3500 4588168 2000 3244493 180000 1160736 3000 1021577 6500 429026 6000 208192 167000 96337
3' VECTOR TRIMMED counts
CENTER_NAME TRACE_TYPE_CODE TOTAL 3'CLV<LEN QUAL==0 UMD.FRG 1 BCM WGS 24863599 10968979 551114 24050767 2 BCM SHOTGUN 10748529 5052692 23419 10068499 3 NISC SHOTGUN 737900 28972 0 735488 4 BCCAGSC CLONEEND 125597 125484 8926 113790 5 UIUC CLONEEND 114753 90243 0 106247 6 TIGR CLONEEND 65171 46389 0 64903 7 GSC CLONEEND 53556 53556 53556 (all) 0 !!! all have 0 quals and were excluded 8 CENARGEN WGS 26246 26246 0 25976 9 BARC CLONEEND 25454 25454 0 25387 10 BCM CLONEEND 16892 6751 0 16863 11 CENARGEN CLONEEND 16787 16787 0 16628 12 UOKNOR SHOTGUN 15150 2885 12195 0 13 TIGR_JCVIJTC CLONEEND 10651 339 0 10644 14 UOKNOR FINISHING 151 0 151 151 15 WUGSC CLONEEND 49 0 0 0 16 BCCAGSC EST 527017 524173 772 0 17 MARC EST 207204 207204 0 0 18 MARC PCR 171667 171667 0 0 19 BARC EST 81913 78597 0 0 20 SC EST 18623 7350 0 0 21 UIACBCB EST 2485 2485 0 0
ZERO QUALITY COUNTS
- Counts
CENTER_NAME TRACE_TYPE_CODE COUNT BCM WGS 551114 GSC CLONEEND 53556 BCM SHOTGUN 23419 UOKNOR SHOTGUN 12195 BCCAGSC CLONEEND 8926 BCCAGSC EST 772 UOKNOR FINISHING 151 TOTAL 650134
- For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701..
- Volumes 026..039 have been fixed
Local Data
Files & Dirs
/fs/szasmg3/bos_taurus/data/ /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors /nfshomes/dpuiu/db/UniVec
Software
Figaro
- trims vector only at 5' end
- call lucy trimming for qualities
Lucy
- both vector sequence and splice sites are required
Atlas
- web site
- atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "
Contaminant search
nucmer reads CLIPPING range to UniVec & EcoliK12
UniVec
Ref
#seqs min max mean median n50 sum UniVec 2861 12 48551 231 99 781 660,151 UniVec_Core 1348 12 48551 243 98 967 327,641
Hits: alignment length
bp #reads min max mean median n50 sum 19 4548466 19 1045 28.37 23 27 129025025 20 3684852 20 1045 30.56 25 28 112616359 30 1097357 30 1045 48.04 38 43 52714583 40 484661 40 1045 66.36 47 53 32163896 100 54334 100 1045 198 116 223 10772815 # many are ESTs
Ecoli
Ref:
K12 4,639,675 bp
Hits: alignment length
bp #reads min max mean median n50 sum 19 275109 19 1223 30.66 19 20 8435470 20 102550 20 1223 50.29 21 161 5156849 30 19032 30 1223 178 37 706 3381214 40 9234 40 1223 329 171 738 3034293 100 6781 100 1223 424 223 749 2876432 200 4378 200 1223 575 696 771 2516916
BCM vectors
#seqs min max mean median n50 sum BCM 14 2580 33180 9379 5821 32705 131312
Vector/Splice site search
Strategy
- 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
- 2. Get the quality clipping trim: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
- 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE
- 4. Get the most frequent kmers in each set (24 & 8 bp)
- 5. Check if the most frequent kmers are overrepresented
- 6. Check if the most frequent 8mers are present in the most frequent 24mers
- 7. Try to extend the 24mers by a few bp => linkers
- 8. Align linkers to the opposite stand sequences using nucmer
- 9. Extract the subsequences adjacent(following) to linker (50..150bp)
- 10. Align the subsequences; if they align we've probably identified the vector
- 11. Identify the vector name/id by alignment to UniVec => several alignments
- 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
- 13. create the Lucy vector & splice files; the splice contains the linker+vector
- 14. run lucy & trim input reads according to Lucy clr
- 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust
- 16. align input reads to linker,vector,splice & UniVec.dust
- 17. compare the 15. & 16. counts
Example
- 1. volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
- 2.
- 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
- 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
==> 24.fwd/kmers.tab <== AGTTCGACTGCAAGTAGTTCATCA TGATGAACTACTTGCAGTCGAACT 2463 # contains AGTAGTTC GAGTTCGACTGCAAGTAGTTCATC GATGAACTACTTGCAGTCGAACTC 2189 CGAGTTCGACTGCAAGTAGTTCAT ATGAACTACTTGCAGTCGAACTCG 1996 TCGAGTTCGACTGCAAGTAGTTCA TGAACTACTTGCAGTCGAACTCGA 1593 GTTCGACTGCAAGTAGTTCATCAA TTGATGAACTACTTGCAGTCGAAC 1023 GAGTTCGACTGCAGTAGTTCATCA TGATGAACTACTGCAGTCGAACTC 812 CGAGTTCGACTGCAGTAGTTCATC GATGAACTACTGCAGTCGAACTCG 777 GTTCGACTGCAAGTAGTTCATCAT ATGATGAACTACTTGCAGTCGAAC 769 TCGAGTTCGACTGCAGTAGTTCAT ATGAACTACTGCAGTCGAACTCGA 637 ATCGAGTTCGACTGCAAGTAGTTC GAACTACTTGCAGTCGAACTCGAT 594 ==> 08.fwd/kmers.tab <== AGTAGTTC GAACTACT 86477 CAGTAGTT AACTACTG 67681 AGTTCTCA TGAGAACT 61556 TAGTTCTC GAGAACTA 60964 GTAGTTCT AGAACTAC 57866 AGTTCATC GATGAACT 49676 TAGTTCAT ATGAACTA 45298 GTTCATCA TGATGAAC 42117 GCAGTAGT ACTACTGC 41391 GTAGTTCA TGAACTAC 40694 ==> 24.rev/kmers.tab <== TATCGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCGATA 999 # contains AGTAGTTC CTATCGATGGTACAGTAGTTCATC GATGAACTACTGTACCATCGATAG 774 GCTATCGATGGTACAGTAGTTCAT ATGAACTACTGTACCATCGATAGC 600 CGCTATCGATGGTACAGTAGTTCA TGAACTACTGTACCATCGATAGCG 432 ATCGATGGTACAGTAGTTCATCAT ATGATGAACTACTGTACCATCGAT 417 ATCGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCGAT 380 ATCAGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCTGAT 373 ATCGATGGTACAGTAGTTCATCAC GTGATGAACTACTGTACCATCGAT 265 CTATCGATGGTAAGTAGTTCATCA TGATGAACTACTTACCATCGATAG 235 TCAGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCTGA 224 ==> 08.rev/kmers.tab <== AGTTCATC GATGAACT 85127 TAGTTCAT ATGAACTA 77902 GTTCATCA TGATGAAC 75585 TAGTTCTC GAGAACTA 68057 AGTTCTCA TGAGAACT 67277 GTAGTTCT AGAACTAC 64894 GTAGTTCA TGAACTAC 62607 CGTAGTTC GAACTACG 52031 AGTAGTTC GAACTACT 51013 ACGTAGTT AACTACGT 31552
- 7. Get linker sequences
>linker.fwd 27bp TCGAGTTCGACTGCAAGTAGTTCATCA >linker.rev 27bp CTAATCAGATGGTACAGTAGTTCATCA #>linker.rev 40 bp Art's (13 more bp at 5') #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
#GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences)
- 8 & 9 Align reads to linkers using nucmer
Fwd:
 nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta 
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq
 
Rev:
 nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta
 #  nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta  
 show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr
 extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq
 
Both:
clrFasta out.seq >! out.cseq fasta2tab.pl out.cseq | sort -k2 > ! out.tab nucmer -c 40 out.cseq ~/db/UniVec -p vector delta-filter -q vector.delta >! vector.filter-q.delta show-coords vector.filter-q.delta | sort -n | head
cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2
- 10. Extract "vector reads"
>399553028 # 24.fwd TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA ... >400269118 #24.rev TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
- 11. Align "vector reads" to UniVec; identify vector
 show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1
     31  148  | 1175 1292  | 118   118  |  95.76  |     1276     7477  |     9.25     1.58  | 399553028.rev gnl|uv|J01636.1:1-7477
     32  199  | 1302 1463  | 168   162  |  90.48  |      653     7477  |    25.73     2.17  | 400269118     gnl|uv|J01636.1:1-7477
- 12. 10bp distance between the 2 alignments
- 13. Lucy files
$ more vector.seq >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA ... $ more splice.seq >J01636.for.begin vector+linker.rev TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG CGCCTAATCAGATGGTACAGTAGTTCATCA >J01636.for.end rev(linker.fwd)+vector TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC AGCTGGCGTAAAAACGTAAAAAGCCCCGCA >J01636.rev.begin (revcomp of J01636.for.end) TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT GATTCGATTTCGACTGCAAGTAGTTCATCA >J01636.rev.end (revcomp of J01636.for.begin) TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG GTGTCAAATGAGAGACCTAACTCACATTCA
 # splice=linker+vector  
      3      120  |     1175     1292  |      118      118  |    95.76  |      150     7477  |    78.67     1.58  | J01636.for.begin   J01636
     32      131  |     1302     1399  |      100       98  |    96.00  |      150     7477  |    66.67     1.31  | J01636.for.end     J01636
- 13.1 Align vector & splice to Ecoli
1 7474 | 366812 359335 | 7474 7478 | 99.91 | 7477 4639675 | 99.96 0.16 | J01636 NC_000913.2 [CONTAINED]
     20      119  |       65      162  |      100       98  |    96.00  |      150      395  |    66.67    24.81  | J01636.rev.begin   NC_000913.2
     31      148  |      172      289  |      118      118  |    95.76  |      150      395  |    78.67    29.87  | J01636.rev.end     NC_000913.2
1069 1463 | 395 1 | 395 395 | 100.00 | 7477 395 | 5.28 100.00 | J01636 NC_000913.2.365350-365744
- 14. Run lucy & trim reads
 $ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \ 
     -v vector.seq splice.seq
     -o bos_taurus.lucy.seq bos_taurus.lucy.qual \
     -debug  bos_taurus.lucy.info \
     bos_taurus.seq bos_taurus.qual
# Trim clr $ clrFasta bos_taurus.seq > bos_taurus.cseq
- 15. Align lucy output to linker, vector, splice & UniVec.dust
$ nucmer -l 12 -c 24 ~/db/vector.seq bos_taurus.lucy.cseq -p vector-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.lucy.cseq -p vector-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.lucy.cseq -p splice-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy
- 16. Align input to linker, vector, splice & UniVec.dust
$ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus
Count how many reads got trimmed
 infoseq *seq | getSummary.pl -c 1 -t original.LEN
 
 cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary  
 cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary
 cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary
Libraries
011.BCM.WGS FORWARD
- vector: J01636
- UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
ll ~dpuiu/db/J01636* -rw-rw-r-- 1 dpuiu dpuiu 7651 Jan 9 15:56 /nfshomes/dpuiu/db/J01636 -rw-rw-r-- 1 dpuiu dpuiu 105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker -rw-rw-r-- 1 dpuiu dpuiu 840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice
cat ~dpuiu/db/J01636* | infoseq J01636 7477 53.43 J01636.linker.fwd 27 44.44 J01636.linker.rev 27 37.04 J01636.for.begin 150 44.67 J01636.for.end 150 51.33 J01636.rev.begin 150 51.33 J01636.rev.end 150 44.67
- 249,611 reads:
- 91% got vector trimmed at the 5'
- 0.4% (1149) got vector trimmed at the 3'
#elem #0s min max mean median n50 sum original.LEN 249611 0 437 2349 1082 991 1009 270035781 lucy.CLV5 249611 21215 0 741 25.03 25 27 6247415 lucy.CLV3 249611 248462 0 1047 3.49 0 859 870344
- Original reads hit counts:
10975 linker.fwd 133 linker.rev 166 splice 152 vector 228 UniVec.dust
- Lucy trimmed read counts
2 linker.fwd 0 linker.rev 1 splice 1 vector 6 UniVec.dust (only 3 are >40bp)
011.BCM.WGS REVERSE
#elem #0s min max mean median n50 sum original.LEN 250389 0 502 2148 1085 993 1012 271691094 lucy.CLR 250389 7345 0 1281 795 876 892 198982171 lucy.CLV5 250389 20271 0 668 26.52 27 29 6641362 lucy.CLV3 250389 249269 0 997 3.35 0 861 839029
- Original reads hit counts:
linker.fwd 113 linker.rev 3812 splice 143 UniVec.dust 237 vector 4318
- Lucy trimmed reads hit counts:
linker.fwd 1 linker.rev 0 splice 1 UniVec.dust 10 vector 1
030.BCM.SHOTGUN
- same linker/vector/splice as BCM.WGS
- 2.5% (4K out of 160K) reads contain linker & vector at 3'
#elem #0s min max mean median n50 sum original.LEN 8411 0 325 1685 1181 1240 1314 9933150 lucy.CLR 8411 8 0 1054 841 863 874 7070994 lucy.CLV5 8411 568 0 232 27.01 28 29 227206 lucy.CLV3 8411 2325 0 1040 597 794 851 5023445
- Original reads hit counts:
linker.fwd 4314 linker.rev 4125 splice 7816 UniVec.dust 4212 vector 6750 vector 27235
- Lucy trimmed reads hit counts:
linker.fwd 3 linker.rev 1 splice 1 UniVec.dust 13 vector 0
001.NISC.SHOTGUN
- Vector: pOTW13
- UniVec: 3 partial seqs
gnl|uv|NGB00080.1:1-198 pOTW13 with linkers gnl|uv|NGB00080.1:718-888 pOTW13 with linkers gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers
ll /nfshomes/dpuiu/db/NGB00080* -rw-rw-r-- 1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080 -rw-r--r-- 1 dpuiu dpuiu 94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker -rw-r--r-- 1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice
cat /nfshomes/dpuiu/db/NGB00080* | infoseq NGB00080 1054 50.00 NGB00080.linker.fwd 24 45.83 NGB00080.linker.rev 26 53.85 NGB00080.for.beg 518 46.14 NGB00080.for.end 518 50.48 NGB00080.rev.begin 518 50.48 NGB00080.rev.beg 518 46.14
- 944 read sample
#elem #0s min max mean median n50 sum original.LEN 944 0 652 1017 735 721 722 693668 lucy.CLR 944 39 0 886 415 422 522 391333 lucy.CLV5 944 121 0 275 34.05 33 35 32143 lucy.CLV3 944 18 0 885 410 409 511 387007
- Original reads hit counts:
linker.fwd 479 linker.rev 492 splice 910 UniVec.dust 0 vector 939
- Lucy trimmed reads hit counts:
linker.fwd 1 linker.rev 0 splice 0 UniVec.dust 9 vector 1
060.BCCAGSC.CLONEEND
- Linkers:
linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG
- No common vector !!!
 wc -l *clb
   60746 bos_taurus.060.f.clb  #18 reads original align to UniVec (nucmer default params)
   60836 bos_taurus.060.r.clb
 
 Fwd:
    329      428  |      440      535  |      100       96  |    91.00  |      503     1585  |    19.88     6.06  | 723951410  gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1
    330      370  |       89       49  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
    330      370  |       94       54  |       41       41  |   100.00  |      503      143  |     8.15    28.67  | 723951410  gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector
 
  Rev:
      1       96  |       71      165  |       96       95  |    93.81  |      203      165  |    47.29    57.58  | 724018013  gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6
     50      143  |        1       94  |       94       94  |    92.71  |      203       94  |    46.31   100.00  | 724018013  gnl|uv|U80929.2:2858-2951     Cloning vector pBACe3.6
017.UIUC.CLONEEND
- No overrepresented kmers
wc -l *clb 17978 bos_taurus.017.f.clb 17911 bos_taurus.017.r.clb ==> 24.fwd/kmers.tab <== CCCTGCTTTGTCTGGAAGGGGTTC GAACCCCTTCCAGACAAAGCAGGG 9 CTGCTTTGTCTGGAAGGGGTTCCC GGGAACCCCTTCCAGACAAAGCAG 9 ==> 24.rev/kmers.tab <== GAATGTTGAGCTTTAGCCAACTTT AAAGTTGGCTAAAGCTCAACATTC 4 TCTGAATGTTGAGCTTTAGCCAAC GTTGGCTAAAGCTCAACATTCAGA 4 ==> 8.fwd/kmers.tab <== TTTTTTTT AAAAAAAA 55 AAGGGGTT AACCCCTT 35 ==> 8.rev/kmers.tab <== GTCTGGAA TTCCAGAC 41 TCTGGAAG CTTCCAGA 39
- No UniVec hits
010.TIGR.CLONEEND
- No overrepresented kmers
wc -l *clb 5479 bos_taurus.032.f.clb 5174 bos_taurus.032.r.clb ==> 24.fwd/kmers.tab <== CTTGTGTTGGCCCAGGCAAGTCCA TGGACTTGCCTGGGCCAACACAAG 30 TTGTGTTGGCCCAGGCAAGTCCAA TTGGACTTGCCTGGGCCAACACAA 30 ==> 24.rev/kmers.tab <== CTGCCTCTTGTGTTGGCCCAGGCA TGCCTGGGCCAACACAAGAGGCAG 16 GCTGCCTCTTGTGTTGGCCCAGGC GCCTGGGCCAACACAAGAGGCAGC 15 ==> 8.fwd/kmers.tab <== GAGTGGGT ACCCACTC 176 GGAGTGGG CCCACTCC 171 ==> 8.rev/kmers.tab <== TGGAGTGG CCACTCCA 182 GGAGTGGG CCCACTCC 181
- No UniVec hits
...
070.BCM.CLONEEND
- No frequent kmers
wc -l *clb 6027 bos_taurus.070.f.clb 6236 bos_taurus.070.r.clb ==> 24.fwd/kmers.tab <== GGACTCTCAGAGTCTTCTCCAACA TGTTGGAGAAGACTCTGAGAGTCC 18 ACTGGTTGGATCTCCTTGCAGTCC GGACTGCAAGGAGATCCAACCAGT 18 ==> 24.rev/kmers.tab <== ATAAAATCTGAGCCACCAGGGAAG CTTCCCTGGTGGCTCAGATTTTAT 1 CTATTGGTTCATATGGTCAACGTC GACGTTGACCATATGAACCAATAG 1 ==> 8.fwd/kmers.tab <== TTTTTTTT AAAAAAAA 86 CTTCTCCA TGGAGAAG 75 ==> 8.rev/kmers.tab <== TATAGTGT ACACTATA 9 ATATAGGG CCCTATAT 8
- No alignments to BCM WGS vector
Running Lucy
- Default parameters with vector trimming
- BCM vector/splice
/nfshomes/dpuiu/db/vector.BCM.seq /nfshomes/dpuiu/db/splice.BCM.seq
- NISC vector/splice
/nfshomes/dpuiu/db/vector.NISC.seq /nfshomes/dpuiu/db/splice.NISC.seq
BCM.WGS (all reads)
- orig.CLR < lucy.CLR ( 765 < 792 )
- orig.CLV > lucy.CLV ( 1015 > 973 )
- 739,529 out of 24,863,599 reads (3%) deleted by Lucy (CLR=-1,-1)
- 21,728,592 out of 24,863,599 reads (87%) vector trimmed at the 5' end
- 92,646 out of 24,863,599 reads (0.3%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 24863599 0 0 24863599 5 3097 1002 997 1015 24915462033 orig.CLR 24863599 463669 7 24399923 -1143 1833 765 836 864 19036744256 orig.CLR5 24863599 0 359245 24504354 0 2103 42 22 58 1047922451 orig.CLR3 24863599 463404 0 24400195 -1 2169 807 872 895 20084666707 lucy.CLR 24863599 0 739529 24124070 0 1219 792 878 904 19695000417 lucy.CLR5 24863599 739529 36108 24087962 -1 1753 43 29 42 1086413880 lucy.CLR3 24863599 739529 0 24124070 -1 1894 835 915 939 20781414297 orig.CLR5-lucy.CLR5 24863599 16299521 215345 8348733 -1186 2104 -1 -10 -1186 -38491429 orig.CLR3-lucy.CLR3 24863599 14858542 1494794 8510263 -1273 2170 -28 -20 -1273 -696747590 orig.CLV 24863599 1053 1920 24860626 -2 5345 1015 1002 1017 25260581538 orig.CLV5 8841849 0 0 8841849 1 1219 33 46 49 295011460 orig.CLV3 24861698 1053 0 24860645 -1 5346 1027 1005 1019 25555592998 lucy.CLV 24863599 10694 707 24852198 -469 3096 973 968 987 24195085877 lucy.CLV5 24863599 0 3135007 21728592 0 1359 25 27 29 623457486 lucy.CLV3 24863599 0 0 24863599 4 3096 998 995 1014 24818543363 lucy.CLVABS5 24863599 0 3135007 21728592 0 1359 25 27 29 623457486 lucy.CLVABS3 24863599 0 24770953 92646 0 1343 2 0 880 72055071 orig.CLV5-lucy.CLV5 24863599 17216820 1512453 6134326 -1312 1219 -13 -25 -1312 -328446026 orig.CLV3-lucy.CLV3 24863599 1519132 18579609 4764858 -1832 4672 29 0 479 737049635
BCM.WGS (0 quality reads)
- orig.CLR > lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 7,153 out of 551,114 reads (1.3%) deleted by Lucy (CLR=-1,-1)
- 508,166 out of 551,114 reads (92%) vector trimmed at the 5' end
- 1,946 out of 551,114 reads (0.35%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 551114 0 0 551114 5 1464 872 946 959 480705828 orig.CLR 551114 7754 0 543360 -770 1175 708 786 807 390325117 orig.CLR5 551114 0 6773 544341 0 1519 44 20 111 24582849 orig.CLR3 551114 7744 0 543370 -1 1638 752 818 833 414907966 lucy.CLR 551114 0 7153 543961 0 699 636 671 671 350759771 lucy.CLR5 551114 7153 35872 508089 -1 201 26 27 28 14442310 lucy.CLR3 551114 7153 0 543961 -1 699 662 699 699 365202081 orig.CLR5-lucy.CLR5 551114 364282 8801 178031 -198 1500 18 -8 215 10140539 orig.CLR3-lucy.CLR3 551114 85058 2962 463094 -700 1472 90 123 178 49705885 orig.CLV 551114 971 0 550143 -2 2037 974 978 981 537127121 orig.CLV5 5100 0 0 5100 1 845 35 29 31 180490 orig.CLV3 551114 971 0 550143 -1 2037 974 978 981 537307611 lucy.CLV 551114 58 6 551050 -84 1456 841 917 930 463903233 lucy.CLV5 551114 0 42948 508166 0 202 27 28 29 14964546 lucy.CLV3 551114 0 0 551114 4 1463 868 945 958 478867779 lucy.CLVABS5 551114 0 42948 508166 0 202 27 28 29 14964546 lucy.CLVABS3 551114 0 549168 1946 0 700 2 0 686 1286935 orig.CLV5-lucy.CLV5 551114 506108 42215 2791 -202 845 -26 -28 -202 -14784056 orig.CLV3-lucy.CLV3 551114 134959 23422 392733 -967 1614 106 7 459 58439832
BCM.SHOTGUN
- orig.CLR < lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 98,070 out of 10,748,529 reads (0.9%) deleted by Lucy (CLR=-1,-1)
- 9,737,008 out of 10,748,529 reads (90%) vector trimmed at the 5' end
- 294,942 out of 10,748,529 reads (2.7%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 10748529 0 0 10748529 5 2043 975 950 964 10486690472 orig.CLR 10748529 17308 2 10731219 -1293 1467 809 833 847 8701344571 orig.CLR5 10748529 0 68 10748461 0 1315 26 16 38 288662580 orig.CLR3 10748529 16780 0 10731749 -1 1647 836 851 863 8990007151 lucy.CLR 10748529 0 98070 10650459 0 1337 833 854 868 8955866769 lucy.CLR5 10748529 98070 1973 10648486 -1 1307 35 28 32 376276188 lucy.CLR3 10748529 98070 0 10650459 -1 1553 868 882 896 9332142957 orig.CLR5-lucy.CLR5 10748529 9498290 65171 1185068 -1099 1293 -8 -11 -1099 -87613608 orig.CLR3-lucy.CLR3 10748529 6879532 671097 3197900 -1149 1437 -31 -26 -1149 -342135806 orig.CLV 10748529 16779 412 10731338 -2 3919 974 948 964 10472347908 orig.CLV5 8594910 0 0 8594910 1 1239 3 1 49 28350257 orig.CLV3 10748349 16779 0 10731570 -1 3919 976 950 965 10500698165 lucy.CLV 10748529 7026 614 10740889 -268 2042 930 924 940 9997862132 lucy.CLV5 10748529 0 1011521 9737008 0 855 24 24 27 257993796 lucy.CLV3 10748529 0 0 10748529 4 2042 954 945 962 10255855928 lucy.CLVABS5 10748529 0 1011521 9737008 0 855 24 24 27 257993796 lucy.CLVABS3 10748529 0 10453587 294942 0 1214 20 0 847 220086015 orig.CLV5-lucy.CLV5 10748529 9538738 138680 1071111 -854 1239 -21 -23 -854 -229643539 orig.CLV3-lucy.CLV3 10748529 357934 9324166 1066429 -1328 2846 22 0 704 244842237
NISC.SHOTGUN
- orig.CLR < lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 8,248 out of 737,900 reads (1.1%) deleted by Lucy (CLR=-1,-1)
- 633,409 out of 737,900 reads (85%) vector trimmed at the 5' end
- 7,201 out of 737,900 reads (0.97%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 737900 0 0 737900 104 2104 784 729 734 579172842 orig.CLR 737900 5988 2 731910 -636 1033 651 668 676 480400909 orig.CLR5 737900 0 0 737900 1 1407 47 40 51 34857531 orig.CLR3 737900 0 5879 732021 0 1470 698 710 715 515258440 lucy.CLR 737900 0 8248 729652 0 1035 658 670 676 485757685 lucy.CLR5 737900 8248 56 729596 -1 1091 45 35 46 33811606 lucy.CLR3 737900 8248 0 729652 -1 1391 704 710 714 519569291 orig.CLR5-lucy.CLR5 737900 253727 89345 394828 -566 1408 1 1 485 1045925 orig.CLR3-lucy.CLR3 737900 177007 31 560862 -867 1471 -5 1 -867 -4310851 orig.CLV 737900 3224 2655 732021 -636 2103 771 725 730 569178445 orig.CLV5 734026 0 0 734026 1 987 5 1 35 4375315 orig.CLV3 732021 0 0 732021 35 2104 783 729 734 573553760 lucy.CLV 737900 1335 55 736510 -200 2104 747 696 702 551392388 lucy.CLV5 737900 104491 0 633409 -1 1199 30 31 34 22784742 lucy.CLV3 737900 0 0 737900 15 2103 778 728 733 574177130 lucy.CLVABS5 737900 0 104491 633409 0 1200 31 32 35 23522642 lucy.CLVABS3 737900 0 730699 7201 0 1076 5 0 686 4257812 orig.CLV5-lucy.CLV5 737900 561851 66390 109659 -1198 983 -24 -29 -1198 -18409427 orig.CLV3-lucy.CLV3 737900 8386 1 729513 -950 1077 0 1 -950 -623370
Fragment files
- Locations:
/fs/szasmg3/bos_taurus/data/frg /fs/szasmg3/bos_taurus/data/frg.new
- All DST messages are unique
- bos_taurus.clv : contains the vector clipping points
- BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
- others: the TA clv
- 374,454 reads don't have valid clv's
- 36,446,031 reads have valid clv's with avg=955
 
Message counts (original)
DST FRG LKG bos_taurus.BCM.WGS.frg 79 24124070 11311841 #bos_taurus.BCM.SHOTGUN.frg 7339 10650459 1799069 # some libs & mates are missing due to a tarchive2ca crash (used by UMD2.1) #bos_taurus.BCM.SHOTGUN.new.frg 18208 10650459 4715172 # split the libraries by VOL & SEQ_LIB_ID (used by UMD2.2) #bos_taurus.BCM.SHOTGUN.new.frg 13826 10650459 5046435 # double check the FRG count !!! (used by UMD2.3) bos_taurus.BCM.SHOTGUN.new.frg 7 10650459 5046435 # UMD2.4 bos_taurus.NISC.SHOTGUN.frg 246 729652 344932 bos_taurus.BCCAGSC.CLONEEND.frg 1 125241 59505 bos_taurus.UIUC.CLONEEND.frg 2 114750 46319 bos_taurus.TIGR.CLONEEND.frg 1 65171 27067 bos_taurus.GSC.CLONEEND.frg 1 53521 25889 bos_taurus.CENARGEN.WGS.frg 0 26246 0 #bos_taurus.BARC.CLONEEND.frg 11150 25454 11150 # (used by UMD2.3) bos_taurus.BARC.CLONEEND.frg 1 25454 11150 # (used by UMD2.4) bos_taurus.BCM.CLONEEND.frg 1 16875 7103 bos_taurus.CENARGEN.CLONEEND.frg 1 16787 6269 bos_taurus.UOKNOR.SHOTGUN.frg 1 14651 4910 bos_taurus.TIGR_JCVIJTC.CLONEEND.frg 2 10651 4803 bos_taurus.UOKNOR.FINISHING.frg 0 151 0 bos_taurus.WUGSC.COLONEEND.frg 1 49 21 #total 25312 35973728 16896244 # (UMD2.3) total 344 35973728 16896244 # (UMD2.4)
Message counts (quality)
DST FRG LKG bos_taurus.BCM.WGS.qual.count 79 23580109 11035582 #bos_taurus.BCM.SHOTGUN.qual.count 7339 10644092 1799069 bos_taurus.BCM.SHOTGUN.qual.new.count 18208 10644092 4712446 bos_taurus.NISC.SHOTGUN.count 246 729652 344932 bos_taurus.BCCAGSC.CLONEEND.qual.count 1 116484 53585 bos_taurus.UIUC.CLONEEND.count 2 114750 46319 bos_taurus.TIGR.CLONEEND.count 1 65171 27067 bos_taurus.CENARGEN.WGS.count 0 26246 0 bos_taurus.BARC.CLONEEND.count 11150 25454 11150 bos_taurus.BCM.CLONEEND.count 1 16875 7103 bos_taurus.CENARGEN.CLONEEND.count 1 16787 6269 bos_taurus.TIGR_JCVIJTC.CLONEEND.count 2 10651 4803 bos_taurus.UOKNOR.SHOTGUN.qual.count 1 2456 813 bos_taurus.WUGSC.COLONEEND.count 1 49 21
Message counts (0quality)
DST FRG LKG bos_taurus.BCM.WGS.0qual.count 79 543961 234397 bos_taurus.GSC.CLONEEND.0qual.count 1 53521 25889 bos_taurus.UOKNOR.SHOTGUN.0qual.count 1 12195 4097 bos_taurus.BCCAGSC.CLONEEND.0qual.count 1 8757 2114 bos_taurus.BCM.SHOTGUN.0qual.count 7339 6367 0 bos_taurus.UOKNOR.FINISHING.0qual.count 0 151 0
Assemblies
Bt.qc.combine UMD2.0 ... UMD2.5 combine stats
UMD2.1(2009_0122_CA; Quality reads)
Issues
- Uses only quality reads
- BCM.SHOTGUN library : ~ 4715172-1799069=2.9M mates were missed due to a tarchive2ca crash ; some libraries got merged (were assigned the same lib_id)
- All reads except for BCM.WGS were set as nonrandom
- Update the runCA script to run overlapper concurently; new "ovlConcurrency" parameter added to the .spec file !!!
- consensus after cgw crashed in MultiAlignContig() ... use "consensus -D forceunitigabut" !!!
- cgw crashed after updating gkpStore with new lib/mate info => edit Input_CGW.c, remove the assert in line 117
Info
host: walnut assembly version: wgs-5.2 stable dir: /scratch1/bos_taurus/Assembly/2009_0122_CA command: /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin/runCA-test -d . -p bt -s bt01.specFile *.frg spec file: cgwDistanceSampleSize = 1000 # ??? too big; more than 50% of the BCM.SHOTGUN reads are in libraries with less than 1000 inserts cnsConcurrency = 15 cnsMinFrags = 200000 doOverlapTrimming = 1 frgCorrBatchSize = 100000 frgCorrConcurrency = 15 merylMemory = 24000 merylThreads = 15 obtMerThreshold = 200 obtOverlapper = ovl ovlConcurrency = 8 ovlCorrBatchSize = 100000 ovlCorrConcurrency = 15 ovlHashBlockSize = 1200000 ovlMemory = 8GB --hashload 0.8 --hashstrings 400000 ovlMerThreshold = 500 ovlOverlapper = ovl ovlRefBlockSize = 7200000 ovlThreads = 2 unitigger = utg utgErrorRate = 0.015 vectorIntersect = bos_taurus.clv doExtendClearRanges = 2
Steps
1. Run up till after initialStoreBuilding
runCA stopAfter=initialStoreBuilding ...
2. Update gkpStore with nonrandom frg flag
cat bos_taurus.nonrandom.clv | perl -ane 'print "frg uid $F[0] isnonrandom 1\n";' > bos_taurus.nonrandom.edit gatekeeper -edit bos_taurus.nonrandom.edit bt.gkpStore
Input
gatekeeper -dumpinfo -lastfragiid bt.gkpStore ... Last frag in store is iid = 35348776
OBT
elem <0 0 >0 min max mean median n50 sum CLV5 35085508 0 3387027 31698481 0 970 25 27 29 891007232 CLV3 35164784 0 0 35164784 15 2974 984 980 1000 34612019144 CLR_ORIG5 35348776 0 43354 35305422 0 1753 42 29 38 1502168205 CLR_ORIG3 35348776 0 0 35348776 70 1894 864 905 927 30547294868 CLR_OBT5 35348776 0 26513 35322263 0 1690 49 30 73 1756346429 CLR_OBT3 35348776 0 23477 35325299 0 1813 843 895 914 29824543869
- 421,379 reads deleted by OBT: why so many???
- Chimera:
20297 reads too short => deleted
- more 0-overlaptrim/bt.mergeLog.stats
... 211037: short or inconsistent 253536: deleted fragment due to zero clear
- Example:
gatekeeper -dumpfragments 516316990 bt.gkpStore fragmentIdent = 516316990,14 fragmentMate = 0,0 fragmentLibrary = 27473,1563 fragmentIsDeleted = 1 fragmentIsNonRandom = 1 fragmentStatus = G fragmentOrientation = I fragmentHasVectorClear = 0 fragmentHasQualityClear = 0 fragmentPlate = 0 fragmentPlateLocation = 0 fragmentSeqLen = 862 fragmentHPSLen = 0 fragmentSrcLen = 17 fragmentClearORIG = 38,553 fragmentClearQLT = 1,0 fragmentClearVEC = 1,0 fragmentClearOBTINI = 35,578 fragmentClearOBT = 35,578 fragmentClearUTG = 35,578 fragmentClearECR1 = 35,578 fragmentClearECR2 = 35,578 fragmentSeqOffset = 5376 fragmentQltOffset = 11038 fragmentHpsOffset = 53 fragmentSrcOffset = 287
cat 0-overlaptrim/bt.mergeLog | grep 516316990 516316990,14 412 412 0 0 (deleted, too short)
 zcat *r000*gz | convertOverlap -a -obt 
 ...
    14 12128740  f  377  478   292  393   2.97
    14 15226267  f  397  446    31   80   2.04
    14 19071241  f    4  513   199  708   1.18
    14 20073917  f    7  478    36  508   4.88
    14 20042424  f    4  419   299  714   1.93
    14 20212935  f    7  478   234  706   4.88
    14 20073828  r    7  478   507   35   4.67
    14 20212846  r    7  478   557   85   4.67
    14 27089060  r  491  534   836  793   2.33
    14 29061748  f  489  540    86  137   1.96
    14 32105697  f  455  543   381  469   2.27
    14 32187461  f  430  534   105  209   1.92
    14 32027289  f    4  419   493  907   4.59
 ...
 #read aligns to contigs
 show-coords 516316990-ctg.filter-r.strict.delta
     35      531  |       97      594  |      497      498  |    99.20  |      862     2759  |    57.66    18.05  | 516316990  ctg7180001872751
     45      678  |      931     1564  |      634      634  |    97.00  |      862     1567  |    73.55    40.46  | 516316990  ctg7180001837311
- OBT deleted reads:
BCM WGS 253816 BCM SHOTGUN 151770 BCCAGSC CLONEEND 7510 NISC SHOTGUN 4757 TIGR CLONEEND 1577 CENARGEN WGS 599 CENARGEN CLONEEND 431 TIGR_JCVIJTC CLONEEND 377 UIUC CLONEEND 182 BCM CLONEEND 150 BARC CLONEEND 125 UOKNOR SHOTGUN 85 total . 421379
OBT deleted reads:
elem >0 min max mean med n50 sum len 421379 421379 98 2974 862 927 968 363280405 avgQual 421379 421379 1 57 28 24 36 11852865
Overlapper
- 98.33% of the reads (34,761,786 out of 35,348,776 reads) had overlaps
- 1.66% of the reads had no overlaps
- 6.68% of the BCCAGSC.CLONEEND reads had no overlaps
- 4.95% of the TIGR_JCVIJTC.CLONEEND reads had no overlaps
- 3.48% of the TIGR.CLONEEND reads had no overlaps
 
- the median number of overlaps is 20
 
 Overlaps
         reads      min        max        mean       median     n50        sum
 qual    35348776   0          5592       106        20         769        3777789082
- the median number of overlaps for the BCM.WGS reads is 16
- the median number of overlaps for the BCM.SHOTGUN reads is 16 !!!
- the median number of overlaps for the NISC.SHOTGUN reads is 40 !!!
- the median number of overlaps for the BCM.CLONEEND reads is 16 !!!
 
Media:Bt.ovlStore.big.png , Media:Bt.ovlStore.small.png
Unitigger
more 4-unitigger/bt.cga.0
UNITIG OVERLAP GRAPH INFORMATION
 
       5208738 : Total number of unitigs
       2527051 : Total number of singleton, contained unitigs
       1814842 : Total number of singleton, non-contained unitigs
        180910 : Total number of non-singleton, spanned unitigs
        685935 : Total number of non-singleton, non-spanned unitigs
      34927397 : Total number of fragments
      34927397 : Total number of fragments in all unitigs
      21521581 : Total number of essential fragments in all unitigs
      13405816 : Total number of contained fragments in all unitigs
  0.0076239952 : Randomly sampled fragment arrival rate per bp
    2510896132 : The sum of overhangs in all the unitigs
    6400342737 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2510896132
 Total nfrags = 19143061
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007624
 Computed global_fragment_arrival_rate =0.007624
 Total number of randomly sampled fragments in genome = 23326293
 Computed genome length  = 3059589120.000000
 Used global_fragment_arrival_rate=0.007624
 Used global_fragment_arrival_distance=131.164826
 
 Histogram of the number of base pairs in a chunk
 100292 - 159434:    22 
 90010 -  99906:     25 
 80043 -  89676:     73 
 70013 -  79966:    162 
 60010 -  69988:    389 
 50008 -  59983:    977 
 40000 -  49998:   2434 
 30000 -  39997:   6458 
 20000 -  29999:  18957 
 10000 -  19999:  57442
 Unitigs >=10kb
             NewAsm          UMd2Asm
 
 Number       86,939          57,204
 Mean         19,464          15,140
 Sum         1,692.1Mb       866.0Mb
 max         159,434bp      78,570bp
 Contigs >=10Kb:
           NewAsm          UMd2Asm
 n         42,343           45,958      
 mean      59,856           55,473
 sum        2,534.5Mb        2,549.4Mb
 Contigs >=100Kb: 
           NewAsm          UMd2Asm
 n          7,051            6,683         
 mean     163,170          162,357    
 sum        1,150.5Mb        1,085.0Mb
 max      627,705          742,802
 Scaffolds >=10Mb:
           NewAsm          UMd2Asm
 n             30                3
 mean       14.10Mb          11.36Mb
 sum       422.95Mb         340.70Mb
 max        26.54Mb          13.36Mb
CGW & ECR
- Checkpoints:
cat 7-0-CGW/bt.timing | grep ^Checkpoint Checkpoint 3 written during MergeScaffoldsAggressive at iteration 49 Checkpoint 4 written during MergeScaffoldsAggressive at iteration 85 Checkpoint 5 written after 1st Scaffold Merge Checkpoint 6 written after 2nd Aggressive Scaffold Merge Checkpoint 7 written after Final Rocks
cat 7-2-CGW/bt.timing | grep ^Checkpoint Checkpoint 19 written during MergeScaffoldsAggressive at iteration 12 Checkpoint 20 written during MergeScaffoldsAggressive at iteration 31 Checkpoint 21 written after 1st Scaffold Merge Checkpoint 22 written after 2nd Aggressive Scaffold Merge Checkpoint 23 written after Final Rocks cat 7-4-CGW/bt.timing | grep ^Checkpoint Checkpoint 34 written during MergeScaffoldsAggressive at iteration 12 Checkpoint 35 written during MergeScaffoldsAggressive at iteration 49 Checkpoint 36 written after 1st Scaffold Merge Checkpoint 37 written during Stones CleanupScaffolds after scaffold 32436 Checkpoint 38 written during Stones CleanupScaffolds after scaffold 34939 Checkpoint 39 written after Stone Throwing and CleanupScaffolds Checkpoint 40 written after 2nd Aggressive Scaffold Merge Checkpoint 41 written after Final Rocks
Checkpoint 42 written after Partial Stones Checkpoint 43 written after Final Contained Stones Checkpoint 44 written after resolveSurrogates
- Get early CTG/SCF stats
cat 7-CGW/bt.cgw_scaffolds | countMessages.pl ICL 451555 # ??? ICP 116455 # CTG ISF 66141 # SCF ISL 711 # SLK
- Clear read extension:
elem <0 0 >0 min max mean median n50 sum ClearORIG 35348776 4 0 35348772 -1147 1572 821 870 893 29045126663 ClearQLT 35348776 35348776 0 0 -1 -1 -1 -1 -1 -35348776 ClearVEC 35348776 299034 20323 35029419 -1 2043 952 953 975 33658445088 ClearOBTINI 35348776 0 31254 35317522 0 1364 831 879 902 29394688367 ClearOBT 35348776 0 31254 35317522 0 1318 794 854 877 28068197440
ClearECR1 35348776 0 31254 35317522 0 1329 794 854 877 28072014464 ClearECR2 35348776 0 31254 35317522 0 1329 794 854 877 28072365712
sum(ClearECR1)-sum(ClearUTG) = 3,817,024 sum(ClearECR2)-sum(ClearECR1)= 351,248
- Scaffold length stats:
cat 7-0-CGW/stat/final0.Scaffolds.nodelength.cgm | grep -v ^Sca | getSummary.pl -t 0 # 0,2,4 ...
step scaff min max mean med n50 sum 0 7048 2249 19719008 385020 21967 3114907 2713622175 2 4960 2249 21907006 540915 21181 4490171 2682939682 4 4006 2391 26541374 668427 29193 4590744 2677722052
- Last cgw
 cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf
 cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf
            elem       min        max        mean       med        n50        sum            
 scf        66141      432        26541374   42648      1347       4349378    2820819506     
 ctg        120461     65         627705     22421      2018       84989      2700959854
QC stats
- Bos_taurus.qc this assembly stats
- Bos_taurus.qc.combine UMD2 vs this assembly stats
TotalScaffolds=66,141 MaxBasesInScaffolds=26,048,998 MeanBasesInScaffolds=40,861 TotalContigsInScaffolds=120,461 MaxContigLength=627,911 MeanContigLength=22,436 TotalDegenContigs=269,031 MaxDegenContig=33,824 SingletonReads=3,721,123
- Posmap info
 cat bt.posmap.mates | awk '{print $3}' |count.pl -p 100
 good            10338164
 bothChaff       1160137
 oneChaff        695982
 oneSurrogate    233151
 bothDegen       218198
 diffScaffold    150423
 badShort        138464
 oneDegen        118232
 badLong         23196
 badSame         22451
 badOuttie       8751
 bothSurrogate   589
 total           13107738
 cat bt.posmap.frags | awk '{print $4,$5}' |count.pl  -p 100
 placed good             20676328
 placed notMated         8007072
 chaff bothChaff         2320274
 chaff notMated          704849
 placed oneChaff         695982
 chaff oneChaff          695982
 placed oneSurrogate     466302
 placed bothDegen        436396
 placed diffScaffold     300846
 placed badShort         276928
 placed oneDegen         236464
 placed badLong          46392
 placed badSame          44902
 placed badOuttie        17502
 placed bothSurrogate    1178
 total                   34927397
Log files
- Bt.runCA.log
- Bt.runCA.hourly.runtimes approximate running times (in hours)
Analysis
Insert libraries
1. BCM.WGS : ok
- FRG.mea: 1750-7000
- ASM.mea: 1594-6727
- Most libs have > 1000 reads & get reestimated
- All libs have ASM.std< ASM.mea/3
2. BCM.SHOTGUN
- only ~ 50% of the inserts are in libs with >1000 inserts and get reestimated by the assembly
- if the thold is dropped from 1000 to 100, we'd get ~ 95% of the inserts reestimated
elem <0 0 >0 min max mean median n50 sum 0 7339 0 0 7339 1 11237 245 135 1137 1799069 100 4361 0 0 4361 100 11237 395 157 1252 1725604 1000 440 0 0 440 1008 11237 2075 1791 2323 913086
3. NISC.SHOTGUN: ok
- Most libs have > 1000 reads & get reestimated
- All libs have ASM.std< ASM.mea/3
4. BCCAGSC.CLONEEND: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 125606 150000 30000 59505 BCCAGSC.CLONEEND 161998 20133
5. UIUC.CLONEEND: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 114892 150000 30000 31063 UIUC.CLONEEND 175594 41208 115020 150000 30000 15256 UIUC.CLONEEND 162488 26358
6. TIGR.CLONEEND: originally wrong; gets reestimated
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 65177 2000 600 27067 TIGR.CLONEEND 161761 34938
7. GSC.CLONEEND: not used (all 53556 are 0 qual)
8. CENARGEN.WGS: "not used" (all 26246 are unmated)
9. BARC.CLONEEND: each library contains 1 template id => inserts did not get reestimated (25454 reads/11151 inserts)
10. BCM cloneend: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 19070 167000 25000 7103 BCM.CLONEEND 171244 18555
11. CENARGEN.CLONEEND: large stdev
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 17249 202000 20200 6269 CENARGEN.CLONEEND 158938 55165
12. UOKNOR.SHOTGUN: ok ?
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 15158 3000 1000 4910 UOKNOR.SHOTGUN 3000 1000
13. TIGR_JCVI.CLONEEND: originally wrong; gets reestimated
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 10691 2500 750 2763 TIGR_JCVI.CLONEEND 160363 29580 10738 2500 750 2040 TIGR_JCVI.CLONEEND 161915 29343
14. UOKNOR.FINISHING: only 151 reads
15. WUGSC.CLONEEND: only 49 reads
Contigs Vs UMD2 contaminants & Ecoli
4865 contigs in list.exclude_contigs.fa 34404 exclude-ctg.qry_hits 3763 exclude-ctg.ref_hits 1204 exclude-ctg.CBE.qry_hits CONTAIN|IDENTITY|BEGIN|END 748 exclude-ctg.CBE.ref_hits CONTAIN|IDENTITY|BEGIN|END
559 Ecoli.365350-365744-ctg.qry_hits : max ctg aligned is 179K bp; 10 are > 10K bp
Contigs Vs UMD2 chromosomes
- Split 120,461 contigs into 100 files; degeneartes not split
- Align them to the 31 chromosomes 1..30,U (ref) => 101*31 jobs
 #Alignment stats
 cat chr*ctg*delta | grep "^>" | awk '{print $2}' | count.pl -f ../9-terminator/bt.ctg.infoseq | getSummary.pl -i 1 -z 1
 ctg        0          1          >1         min        max        mean       med        n50        sum
 120461     652        37540      82269      0          176        11         3          33         1422808
#Unaligned ctg lengths ctg min max mean med n50 sum 652 65 5849 1146 1134 1194 747295
- 50% of the contigs aligned uniquely
 cat chr*-ctg*.delta | ~/bin/mergeDelta.pl   >  chr-ctg.delta
                                                             # degens? 
 delta-filter -q  chr-ctg.delta              >>  chr-ctg.filter-q.delta
cat chr1-*.delta | ~/bin/delta2cvg.pl -M 0 | getSummary.pl -i 4 elem 0 >0 min max mean med n50 sum 6681 1 6680 0 12892 366 142 1095 2450106
- There are disagreements:
/fs/sz-user-supported/Linux-x86_64/bin/show-coords -l -r -H chr1-ctg.filter-q.delta | p 'print $F[-1],"\n";' | count.pl | head ctg7180001761585 24 ... ctg7180001634116 7 ...
show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001761585 | p 'print " $_";' 142115744 142188863 | 383463 310345 | 73120 73119 | 99.98 | 157714772 383463 | 0.05 19.07 | 1 -1 chr1 ctg7180001761585 142188878 142286012 | 310361 213223 | 97135 97139 | 99.94 | 157714772 383463 | 0.06 25.33 | 1 -1 chr1 ctg7180001761585 142287100 142287675 | 212133 211556 | 576 578 | 98.27 | 157714772 383463 | 0.00 0.15 | 1 -1 chr1 ctg7180001761585 142288052 142288602 | 211182 210633 | 551 550 | 99.09 | 157714772 383463 | 0.00 0.14 | 1 -1 chr1 ctg7180001761585 142288652 142295709 | 210586 203531 | 7058 7056 | 99.87 | 157714772 383463 | 0.00 1.84 | 1 -1 chr1 ctg7180001761585 142295709 142342174 | 203512 157047 | 46466 46466 | 100.00 | 157714772 383463 | 0.03 12.12 | 1 -1 chr1 ctg7180001761585 142346440 142367791 | 156958 135606 | 21352 21353 | 99.99 | 157714772 383463 | 0.01 5.57 | 1 -1 chr1 ctg7180001761585 142367822 142370681 | 135597 132737 | 2860 2861 | 99.93 | 157714772 383463 | 0.00 0.75 | 1 -1 chr1 ctg7180001761585 142370660 142382289 | 132746 121117 | 11630 11630 | 99.88 | 157714772 383463 | 0.01 3.03 | 1 -1 chr1 ctg7180001761585 142382282 142411927 | 120984 91339 | 29646 29646 | 99.96 | 157714772 383463 | 0.02 7.73 | 1 -1 chr1 ctg7180001761585 142411941 142419553 | 91339 83728 | 7613 7612 | 99.66 | 157714772 383463 | 0.00 1.99 | 1 -1 chr1 ctg7180001761585 142419553 142434546 | 83721 68728 | 14994 14994 | 99.79 | 157714772 383463 | 0.01 3.91 | 1 -1 chr1 ctg7180001761585 142434506 142437288 | 68778 65996 | 2783 2783 | 99.86 | 157714772 383463 | 0.00 0.73 | 1 -1 chr1 ctg7180001761585 142437389 142439015 | 66757 65131 | 1627 1627 | 99.94 | 157714772 383463 | 0.00 0.42 | 1 -1 chr1 ctg7180001761585 142439271 142440703 | 65629 64197 | 1433 1433 | 100.00 | 157714772 383463 | 0.00 0.37 | 1 -1 chr1 ctg7180001761585 142441869 142442975 | 63548 62442 | 1107 1107 | 100.00 | 157714772 383463 | 0.00 0.29 | 1 -1 chr1 ctg7180001761585 142446690 142449325 | 30312 32945 | 2636 2634 | 99.58 | 157714772 383463 | 0.00 0.69 | 1 1 chr1 ctg7180001761585 142451384 142452476 | 63510 64603 | 1093 1094 | 99.91 | 157714772 383463 | 0.00 0.29 | 1 1 chr1 ctg7180001761585 142452577 142454379 | 61000 62806 | 1803 1807 | 99.78 | 157714772 383463 | 0.00 0.47 | 1 1 chr1 ctg7180001761585 142454487 142456821 | 59122 61456 | 2335 2335 | 100.00 | 157714772 383463 | 0.00 0.61 | 1 1 chr1 ctg7180001761585 142458383 142459582 | 57978 59177 | 1200 1200 | 100.00 | 157714772 383463 | 0.00 0.31 | 1 1 chr1 ctg7180001761585 142459738 142472295 | 32272 44828 | 12558 12557 | 99.92 | 157714772 383463 | 0.01 3.27 | 1 1 chr1 ctg7180001761585 142472300 142485640 | 44828 58163 | 13341 13336 | 99.89 | 157714772 383463 | 0.01 3.48 | 1 1 chr1 ctg7180001761585 142501686 142530021 | 28336 1 | 28336 28336 | 99.99 | 157714772 383463 | 0.02 7.39 | 1 -1 chr1 ctg7180001761585
show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001634116 116312 162914 | 1 46603 | 46603 46603 | 99.99 | 157714772 122722 | 0.03 37.97 | 1 1 chr1 ctg7180001634116 164916 201988 | 58062 95135 | 37073 37074 | 99.99 | 157714772 122722 | 0.02 30.21 | 1 1 chr1 ctg7180001634116 203244 213377 | 48198 58331 | 10134 10134 | 100.00 | 157714772 122722 | 0.01 8.26 | 1 1 chr1 ctg7180001634116 261393 264506 | 45949 49062 | 3114 3114 | 100.00 | 157714772 122722 | 0.00 2.54 | 1 1 chr1 ctg7180001634116 264607 268579 | 94345 98317 | 3973 3973 | 100.00 | 157714772 122722 | 0.00 3.24 | 1 1 chr1 ctg7180001634116 268586 274734 | 98323 104471 | 6149 6149 | 100.00 | 157714772 122722 | 0.00 5.01 | 1 1 chr1 ctg7180001634116 274835 293945 | 103611 122722 | 19111 19112 | 99.99 | 157714772 122722 | 0.01 15.57 | 1 1 chr1 ctg7180001634116
 ~/bin/delta2breaks.pl -m 200 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       9827
 INVERSION       283
 TRANSLOCATION+  230
 TRANSLOCATION-  154
 
 ~/bin/delta2breaks.pl -m 1000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       7564
 INVERSION       216
 TRANSLOCATION+  192
 TRANSLOCATION-  127
 
 ~/bin/delta2breaks.pl -m 10000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl
 AGREEMENT       3394
 INVERSION       62
 TRANSLOCATION+  50
 TRANSLOCATION-  29
Assembly UMD2.2 (Quality reads)
- Try to add the missing BCM.SHOTGUN reads at the assembly
- Assign new BCM.SHOTGUN library ID's base on volume & SEQ_LIB_ID : same library might have different insert size in different volume => might loose some correct mates from different volumes
cat bos_taurus.summary | grep BCM | grep SHOTG | cut -f6,7,8,10 | sort | more FAAEP 180000 13000 252 FAAEP 2000 1000 84 ... FAAHP 180000 13000 77 FAAHP 2000 1000 230 ...
- => 20,538 libraries out of which 18,208 contain mated reads
- create DST messages & add them to gkpStore
gatekeeper -a -o bt.gkpStore -T -F bos_taurus.BCM.SHOTGUN.new.DST
- generate gatekeeper edit file that maps each TI to the new library id
head bos_taurus.BCM.SHOTGUN.new.ti2libinfo.edit frg uid 499507131 libuid 601081 frg uid 499507132 libuid 601081 ...
- generate gatekeeper edit file that deletes all mate information
head bos_taurus.BCM.SHOTGUN.new.mate.delete frg uid 500086180 mateuid 0 frg uid 500084310 mateuid 0 ...
- pair forward/reverse read that have the same new library id, same TEMPLATE_ID
head bos_taurus.BCM.SHOTGUN.new.mate.edit frg uid 583866821 mateuid 583872364 frg uid 583866822 mateuid 583872408 ...
- run gatekeeper --edit for each edit/delete file
gatekeeper --edit ... bt.gkpStore
- restart assembly at cgw (doExtendClearRanges=1)
- consensus after cgw failed on job 25 on CTG 5597062 : cannot create consensus from multialignment ...
 Fix: delete failed message
 cp bt.cgw_contigs.25 bt.cgw_contigs.25.FAILED
 delete "{ICM acc:5597062 pla:P len:20889 ..." from bt.cgw_contigs.25
- terminator fail; message:
ICL: reference before definition error for contig ID 5597062
Assembly UMD2.3 (2009_0210_CA; all reads)
- 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
- 16,896,244 mates
- 25,312 libraries
Issues (not solved):
- 10420 contain at least 1 "NN" in their clr (50.. min(len,600))
- 5973 contain at least 1 "NNN" in their clr (50.. min(len,600))
Quality-less clrs
- 624,952 quality-less reads
- Quality-less read stats: : alignment CLR or 50..min(len,600) trimming
elem min max mean median n50 sum len 624952 5 1495 887 947 961 554429198 5 624952 6 1584 51 51 51 32150411 3 624952 5 1495 695 699 699 434960697 53 624952 -1579 1444 644 648 648 402810286
- Align 624,952 to the 120,461 Assembly1 contigs (no degenerates) : 1 day on 13 cpus
- 572,140(91.5%) reads aligned and 52,812(8.5%) did not align to the contigs
1. Launch jobs in parallel: 12766 jobs on 13 processors
nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.001.fasta bos_taurus.0qual.01.seq -p ctg.001-seq.01 ... nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.982.fasta bos_taurus.0qual.13.seq -p ctg.001-seq.01
- CPU usage: 100% /job
- Max mem usage: 0.1% /job
2. Get maximum extended clrs
cat *delta | ~/bin/delta2qryClr.pl -best | sort > bos_taurus.0qual.best.clr
 Length stats
            elem       min        max        mean       median     n50        sum
 all        624952     5          1495       887        947        961        554429198
 aligned    572140     221        1416       912        953        964        522281354
 unaligned  52812      5          1495       608        580        754        32147844
 Best/Max/Max+extended alignment coord stats:
            elem       min        max        mean       median     n50        sum
 53.best    572140     94         1208       766        841        877        438793102
 53.max     572140     170        1208       794        863        888        454816817
 53.extend  572140     170        1208       797        865        889        456014184
 Unaligned read counts:                           
                          unaligned    total   quality   quality-less
 BCM.WGS                  42595
 UOKNOR.SHOTGUN           5787         14651   2456      12195
 GSC.CLONEEND             2294         53521   0         53521
 BCCAGSC.CLONEEND         1869         125241  116484    8757
 BCM.SHOTGUN              186
 UOKNOR.FINISHING         81
- 52,812 quality-less unaligned reads to the contigs using less strict nucmer parameters: -l 30 -c 50 -b 50 -g 50 -d 0.12
- 9,269 reads aligned at an average 92% identity (min 81% identity) : not too good
3. Get reads without clrs: set their clr to maximum 50..600
cp bos_taurus.0qual.extended.clr bos_taurus.0qual.clr difference.pl bos_taurus.0qual.infoseq bos_taurus.0qual.extended.clr | perl -ane '$three=600; $three=$F[1] if ($F[1]<600); print "$F[0] 50 $three\n";' >> bos_taurus.0qual.clr
Quality clrs
- Use Assembly1 OBT clrs
- Delete reads deleted in the OBT process
Gatekeeper
Load order:
- Add quality FRG : "gatekeeper -T -F ..."
- Add quality-less FRG "gatekeeper -F -a ..." # -T should be removed
- Delete quality FRG (deleted by UMD2.1 OBT)
- Add DST
- Add LKG
Edit
- Loads clrs
- Loads clvs
- Loads nonrandom info
Meryl
Use Assembly1 kmer counts
Overlapper
- Use 80/90 Assembly1 overlap results
- Rerun 10 overlap jobs
- 96.64% of the quality-less reads have overlaps (vs 98.33% of the quality reads)
reads 0ovl 1+ovl min max mean median n50 sum 0qual(all) 624831 20941 603890 0 4350 96 19 740 60494730 # 96.64% 0qual(unaligned) 52691 15384 37307 0 3229 50 5 349 2655545 # 70.80%
Unitigger
- More unitigs, more bases in unitigs
- Few of the longest unitigs got broken: Example 138,294(UMD2.3) vs 159,434(UMD2.1)
 UNITIG OVERLAP GRAPH INFORMATION
 
       5333434 : Total number of unitigs
       2595174 : Total number of singleton, contained unitigs
       1865473 : Total number of singleton, non-contained unitigs
        183693 : Total number of non-singleton, spanned unitigs
        689094 : Total number of non-singleton, non-spanned unitigs
      35551316 : Total number of fragments
      35551316 : Total number of fragments in all unitigs
      21830994 : Total number of essential fragments in all unitigs
      13720322 : Total number of contained fragments in all unitigs
  0.0077856472 : Randomly sampled fragment arrival rate per bp
    2514833413 : The sum of overhangs in all the unitigs
    6483064813 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2514833413
 Total nfrags = 19579606
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007786
 Computed global_fragment_arrival_rate =0.007786
 Total number of randomly sampled fragments in genome = 23870254
 Computed genome length  = 3065930496.000000
 Used global_fragment_arrival_rate=0.007786
 Used global_fragment_arrival_distance=128.441474
  
 Histogram of the number of base pairs in a chunk
 100406 - 138294:    21
 90330 -  99887:     23
 80042 -  89675:     79
 70014 -  79943:    169
 60002 -  69792:    374
 50000 -  59982:   1008
 40002 -  49995:   2440
 30001 -  39994:   6509
 20000 -  29999:  18989
 10000 -  19999:  57404
Consensus after unitigger
Problems:
- job 120 executed partially (see bt_120.cgi_tmp); Solution: split into 3 parts, run separately, merge results
- failed on 19 unitigs (587..7447 bp)
rm 5-consensus/*failed touch 5-consensus/consensus.success
Cgw
- Failure 1 : because job 120 was run partially => missing mates
- Failure 2 : because of /5-consensus/FAILED/bt_???.cgi.failed => missing mates => delete 356 mates
Error: ProcessFrags()-- WARNING! fragiid=35973388,index=33600942 mateiid=35973363,index=0 -- MATE DOESN'T EXIST! cgw: Input_CGW.c:117: ProcessFrags: Assertion `err == 0' failed. Fix: cat cgw.out | grep MATE | p '/mateiid=(\d+)/; print $1,"\n";' >! cgw.out.mateiid gatekeeper -dumpfragments -tabular -iid cgw.out.mateiid bt.gkpStore/ | cut -f1,3 | ~/bin/mate2lkg.pl -a D >! cgw.out.delete.LKG gatekeeper -a -o bt.gkpStore -T -F -L cgw.out.delete.LKG
- Failure 3: because of cgwOutputIntermediate=1
Try to restart from ckp : die with assertion failure cgw -y -R 8 -N 12 -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi cgw -y -R 8 -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi Fix: Restart cgw from the beginning
- cgw does update bt.SeqStore - OpenSequenceDB()
ECR (eventually skipped)
- Failed after running for 1 day
/fs/szdevel/dpuiu/SourceForge/wgs-5.2/Linux-amd64/bin/extendClearRanges -g ./bt.gkpStore -n 15 -c bt -b 146216 -e 167100 -i 1 > 7-1-ECR/extendClearRanges-scaffold.146216.err sh: line 1: 17016 Aborted
- Last ckp : bt.ckp.15
- Try to fix:
touch 7-1-ECR/cgw.success runCA "doExtendClearRanges = 1"
- Runs too slow !!!
- Can specify a scaffold range to process: -b ? -e ? => ckp files; could we merge them?
- Failed after running for 1 day
Consensus after CGW
- Failed on job 56
tail 8-consensus/bt.cns_contigs.56.err ... Could (really) not find overlap between 153923 (U) and 2508303 (R) estimated ahang: 0 (ejecting frag 2508303 from contig) consensus: math_AS.h:51: ceil_log2: Assertion `x > 0' failed.
cat 7-CGW/bt.cgw_contigs.56 | countMessages.pl ICM 440 IMP 281412 IUP 12715 cat 8-consensus/bt.cns_contigs.56_tmp | countMessages.pl ICM 115 IMP 103322 IMV 8122 IUP 4849
- Fix: split ICM messages 1..115,116,116+ and run consensus on each set
QC
elem min max mean med n50 sum scf 56891 407 33129045 50871 1378 4716077 2894145150 ctg 122851 64 651167 21957 3647 71561 2697514858 deg 268237 65 30246 1019 985 997 273575106
- Compared with UMD2.1 : better scaffols, worse contigs & unitigs
Analysis
Issues:
- Identify bacterial & mito contigs: mito seq
- Align ctg°en to UMD2 chromosomes
- the chromosomes should have no 0cvg regions
- possible inversions, translocations (UMD2 used markers)
- if align breaks/indels, which assembly is correct?
 
Assembly UMD2.4 (2004_0217_CA; All reads)
- 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
- 16,896,244 mates
- 344 libraries
Fix quality-less read clrs (N's) (temporary solution)
- 10420 contain at least 1 "NN" in their clr (50.. min(len,600))
- 5973 contain at least 1 "NNN" in their clr (50.. min(len,600))
Fix:
frg2seq.pl < bos_taurus.0qual.frg > bos_taurus.0qual.seq fasta2qual.pl bos_taurus.0qual.seq > ! bos_taurus.0qual.qual
 lucy \
    -o bos_taurus.0qual.lucy.seq  bos_taurus.0qual.lucy.qual \
    -debug  bos_taurus.0qual.lucy.info \
    bos_taurus.0qual.seq bos_taurus.0qual.qual
cat bos_taurus.0qual.lucy.info | cut -f1,3,4 -d ' ' | sort >! bos_taurus.0qual.lucy.clr
- 624,952 quality-less reads
- Quality-less read stats: 50..min(len,600) & lucy trimming
elem 0 >0 min max mean med n50 sum 5 624952 2857 622095 0 501 52 52 52 33012433 3 624952 2857 622095 0 600 579 600 600 361980208 53 624952 2857 622095 0 548 526 548 548 328967775
Fix quality-less read clrs (low complexity)
- Run dust filter on seq (before qual & lucy)
elem 0 >0 min max mean med n50 sum 5 624952 3564 621388 0 501 75 52 52 47385578 3 624952 3564 621388 0 600 554 600 600 346470473 53 624952 3564 621388 0 548 478 548 548 299084895
- Merge dust.lucy clrs with the alignment clrs
elem 0 >0 min max mean med n50 sum 5 624952 4488 620464 0 599 93 52 126 58378496 3 624952 4488 620464 0 600 547 600 600 342258160 53 624952 4488 620464 0 548 454 512 548 283879664
- Test seq
 gatekeeper -dumpfastaseq -b 35348777 -e  35973728 bt.gkpStore  | grep NNN
 gatekeeper -dumpfastaseq  bt.gkpStore   | perl -ane 'if(/^>(\d+)/) { $id=$1} elsif(/NNN/) { print $id,"\n";} ' | uniq -c | awk '{print $2,$1}'  > bt.NNN.seqs  # 2411 seqs (all have the N's "in the middle")
 gatekeeper -dumpfastaseq -uid bt.NNN.seqs bt.gkpStore >  bt.NNN.cseqs
Consolidate libraries
Drop from 25,312 to 344 libs
BCM.SHOTGUN
UMD2.4 reestimated 10,117 out of 13,826 libs (have > 100mates)
Base on initial estimates
- Reduce the total number from 13826 to 2 libs: 3000 & 6000
- UMD2.3 mean estimates (Initial vs Final):
meanI #libs minF maxF meanF medF n50F sumF uid 180000 436 1636 5199 2475 2410 2458 1079407 #3000 167000 86 1585 2948 2264 2258 2285 194775 #3000 6500 31 5212 6636 5837 5867 5924 180951 #6000 6000 11 4556 6272 5389 5421 5421 59286 #6000 3500 949 1670 4769 2668 2608 2645 2532027 #3000 3000 2511 1483 5250 2715 2662 2723 6818678 #3000 2000 6093 1157 6443 2526 2487 2554 15391160 #3000
Base on final estimates
- Reduce the total number from 13826 to 7 libs: 6500,5500,...1500, un-estimates (2501)
meanF #libs min max mean med n50 sum uid(new) mean(new) std(new) 6K<=mea<7K 15 6010 6636 6176 6159 6159 92650 6500 6500 5K<=mea<6K 29 5121 5985 5540 5536 5577 160673 5500 5500 4K<=mea<5K 67 4017 4939 4284 4266 4274 287072 4500 4500 3K<=mea<4K 1401 3000 3998 3276 3209 3226 4590323 3500 3500 2K<=mea<3K 7998 2000 2999 2502 2498 2532 20017767 2500 2500 1K<=mea<2K 607 1157 1999 1825 1882 1890 1107798 1500 1200 un-estimated 3709 2501 2501
BARC.CLONEEND
Collapse all 11150 into 1:
uid:25456 mea:165000 std:43000
Overlapper
- Quality-less reads overlaps: fewer than in the UMD2.3 assembly
elem 0 >0 min max mean med n50 sum 0qual(all) 624830 35692 589138 0 3237 60 14 439 37578899 # 94.39%
Unitigger
 UNITIG OVERLAP GRAPH INFORMATION    
       5356408 : Total number of unitigs
       2613795 : Total number of singleton, contained unitigs
       1870448 : Total number of singleton, non-contained unitigs
        182878 : Total number of non-singleton, spanned unitigs
        689287 : Total number of non-singleton, non-spanned unitigs
      35547861 : Total number of fragments
      35547861 : Total number of fragments in all unitigs
      21685943 : Total number of essential fragments in all unitigs
      13861918 : Total number of contained fragments in all unitigs
  0.0077797328 : Randomly sampled fragment arrival rate per bp
    2513424271 : The sum of overhangs in all the unitigs
    6468428782 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
 Total rho    = 2513424271               
 Total nfrags = 19553770
 Estimated genome length = 0
 Estimated global_fragment_arrival_rate=0.007780
 Computed global_fragment_arrival_rate =0.007780     
 Total number of randomly sampled fragments in genome = 23868770
 Computed genome length  = 3068070656.000000          
 Used global_fragment_arrival_rate=0.007780             
 Used global_fragment_arrival_distance=128.539119
Histogram of the number of base pairs in a chunk
100292 - 138301:     19
 90052 -  99906:     23
 80043 -  89676:     79
 70013 -  79966:    164
 60010 -  69988:    390
 50008 -  59983:    949
 40000 -  49998:   2433
 30000 -  39997:   6437
 20000 -  29999:  18808
 10000 -  19999:  57634
Bog
!!! Much bigger unitigs than default unitigger
Global Arrival Rate: 0.013829 212260 - 224992: 4 100099 - 186873: 372 90015 - 99973: 353 80045 - 89988: 582 70011 - 79999: 1084 60000 - 69994: 1856 50001 - 59996: 3162 40002 - 49994: 5407 30000 - 39999: 9767 20000 - 29996: 18981 10000 - 19999: 39641
Consensus after Unitigger
- Failed on jobs 120 & 121 ( _tmp file)
cat 4-unitigger/*120* | countMessages.pl IMP 280264 IUM 124707 cat 4-unitigger/*121* | countMessages.pl IMP 282146 IUM 245650
cat 5-consensus/bt_120.cgi_tmp | countMessages.pl IMP 34348 IUM 19222 cat 5-consensus/bt_121.cgi_tmp | countMessages.pl IMP 51833 IUM 16805
- Fix 120: split IUM messages
extractfromfrgMSG.pl -b 0 -e 19222 bt_120.cgb.orig IUM >! bt_120.cgb & extractfromfrgMSG.pl -b 19222 bt_120.cgb.orig IUM >! bt_120.cgb &
- Fix 121: remove assertion in AS_CNS/MultiAlignment_CNS.c
 if(to <= from || to > ma_length-1){
   fprintf(stderr, "AbacusRefine range (to) invalid");
   //assert(0); 
 }
CGW
- Failed after Ckp3(7-0-CGW/bt.ckp.3; MergeScaffoldsAggressive 2nd itteration)
CI extends beyond end of scaffold! offsetAEnd = 254204 offsetBEnd = 252250 scaffoldLength = 253268 cgw: CIScaffoldT_Merge_CGW.c:307: InsertScaffoldContentsIntoScaffold: Assertion `0' failed.
- Last cgw
 Scaffold lengths:
 cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf
 cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf
            elem       min        max        mean       med        n50        sum            
 scf        45826      385        34263871   59591      1349       7059820    2730853790     
 ctg        96562      65         738899     27789      3657       93988      2683452359
 Library insert estimates:
 cat 7-4-CGW/stat/scaffold_final.distupdate.dst | grep ^# | awk '{print $3,int($8),int($10)}' > 7-4-CGW/bt.dst
 join2.pl bt.dst 7-4-CGW/bt.dst | p 'print join "\t",@F[0,1,2,5,6,3,4]; print "\n";' > bt.dst.combine
CLONEEND inserts: UID MEANI STDI MEANF STDF COUNT LIB 114892 150000 30000 175701 40732 31063 UIUC.CLONEEND 19070 167000 25000 171349 18253 7103 BCM.CLONEEND 118 167000 16700 167000 16700 21 WUGSC.CLONEEND 25456 165000 43000 163044 25849 11150 BARC.CLONEEND 115020 150000 30000 162719 25343 15256 UIUC.CLONEEND 65177 2000 600 162540 34155 27067 TIGR.CLONEEND 125606 150000 30000 162396 19319 59505 BCCAGSC.CLONEEND 10738 2500 750 162386 27567 2040 TIGR_JCVIJTC.CLONEEND 10691 2500 750 161540 28239 2763 TIGR_JCVIJTC.CLONEEND 17249 202000 20200 157496 55375 6269 CENARGEN.CLONEEND 54017 120000 12000 115671 27594 25889 GSC.CLONEEND total 188126 CLONEENDs
Consensus
- Failed on job 34 with segmentation fault
- 9kbp contig, made out of 3007 reads (24 of which are quality-less)
 cat 7-CGW/bt.cgw_contigs.34.1 | grep "^{" | uniq -c | awk '{print $2,$1}'
 {ICM 1
 {IMP 3007
 {IUP 329
- Fix : edit AS_CNS/MultiAlignment_CNS.c; add
 if(!ungappedSequence->Elements) { ungappedSequence->numElements=0; }
 if(!ungappedQuality->Elements) { ungappedQuality->numElements=0; }
Analysis
Contigs Vs possible contaminants
- nucmer alignment parameters: -l 40 -c 100 -b 10 -g 5 -d 0.05
- have to redo alignments using -maxmatch !!!
- file location:
reference seqs: /nfshomes/dpuiu/db/Ecoli.365350-365744 # Ecoli K12 region with most alignments (BCM WGS splice site) /nfshomes/dpuiu/db/Ecoli # Ecoli K12 substrain MG1655 (NC_000913 ; 1st completed) /nfshomes/dpuiu/db/Ecoli.all # 22 Ecoli completed genomes ( + plasmids) /nfshomes/dpuiu/db/UniVec_Core # UniVec Core seqs /nfshomes/dpuiu/db/OtherVec # 100 other vector sequences identified by aligning UMD2.0 contaminants to GenBank; align also to 110 UniVec core using nucmer (params above)
/nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.fasta # 4813 whole contigs and 30329 contig regions identified by NCBI as UMD2 contamination /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.organism_count # organism counts: vector is the most abundant /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.infoseq # grep -v 'coli|vector|7180003101029' => 905 other contamiants query seqs: /scratch1/bos_taurus/Assembly/2009_0217_CA/9-terminator/ctg.split100/*fasta # latest assembly contigs (no degenerates) delta files: /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_ctg/no_maxmatch/*delta
Ecoli K12 substrains:
NC_010473.1 4686137 50.78 Escherichia coli str. K-12 substr. DH10B, complete genome NC_000913.2 4639675 50.79 Escherichia coli str. K-12 substr. MG1655, complete genome AC_000091.1 4646332 50.80 Escherichia coli str. K-12 substr. W3110, complete genome
no maxmatch
- fewer alignments in UMD2.4 than in UMD2
UMD2 (all): just a few degens
15102 Ecoli.365350-365744-ctg.qry_hits 15943 Ecoli-ctg.qry_hits 17308 Ecoli.all-ctg.qry_hits 79065 UMD2.contaminant-ctg.qry_hits # 55877 new hits 20105 UMD2.contaminant-ctg.CBE.qry_hits # CONTAIN|BEGIN|END|IDENTITY 19839 UniVec_Core-ctg.qry_hits
UMD2.4
559 Ecoli.365350-365744-ctg.qry_hits 1215 Ecoli-ctg.qry_hits 2767 Ecoli.all-ctg.qry_hits # most 2 frequenct starins are UMN026 & ATCC 8739; K12 DH10B is rank 5th; K12 MG1655 is ranked 19th (out 31 seqs) 44112 UMD2.contaminant-ctg.qry_hits 5286 UniVec_Core-ctg.qry_hits
Length of the reference seqs used for screening:
#seqs min max mean med n50 sum Ecoli.365350-365744 1 395 395 395 395 395 395 # Ecoli K12 regions with most alignments (BCM WGS splice site) Ecoli 1 4639675 4639675 4639675 4639675 4639675 4639675 # Ecoli K12 substrain MG1655 Ecoli.all 49 3306 5572075 2293320 130440 5065741 112372708 # 22 Ecoli's UniVec_Core 1348 12 48551 243 98 967 327641 OtherVec 100 1702 739874 15419 5027 166744 1541984
UMD2.contaminant 35142 48 16661 512 362 674 18022349
Length of UMD2.4 contigs that contain contaminant (0+ bp from end):
#ctgs <2000bp >=2000bp min max mean med n50 sum Ecoli.365350-365744-ctg 559 534 25 1001 179527 2467 1341 1894 1379440 Ecoli-ctg 1215 1086 129 1001 360312 4326 1347 71372 5256540 Ecoli.all-ctg 2767 2455 312* 1001 453627* 8031 1366 134516 22224468 UniVec_Core-ctg 5286 4718 568* 882 651163* 9820 1337 136090 51909339 UMD2.contaminant-ctg.CBE 4976 4410 566* 738 651163* 8497 1339 122281 42281715 #annotated alignments: CONTAIN|BEGIN|END|IDENTITY UMD2.contaminant-ctg 44112 12813 31299 268 739442 50591 27461 111598 2231701788
Length of UMD2.4 contigs that contain contaminant in the middle (500+ bp from end):
#ctgs <2000bp >=2000bp min max mean med n50 sum Ecoli.365350-365744-ctg 144 136 8 1286 2053 1779 1811 1814 256259 Ecoli-ctg 171 152 19 1286 4703 1835 1807 1821 313820 Ecoli.all-ctg 197 160 37*(81) 1228 351373* 6516 1815 125069 1283728 #81 2K+ ctgs using -maxmatch UniVec_Core-ctg 1278 1110 168*(276) 1085 651163* 12266 1496 160336 15676765 #276 2K+ ctgs using -maxmatch UMD2.contaminant-ctg.CBE 52 25 27* 1249 351373* 22195 2054 125069 1154142 #annotated alignments: CONTAIN|BEGIN|END|IDENTITY UMD2.contaminant-ctg 31019 1437 29582 1113 739442 70665 50798 113684 2191986214
Length of the UMD2.4 contaminant seqeunece (0+ bp from end):
#align <200bp >=200bp min max mean med n50 sum Ecoli.365350-365744-ctg 1066 537 529 104 225 192 162 224 205379 Ecoli-ctg 1793 587 1206 50 4440 496 224 994 889798 Ecoli.all-ctg 4074 1132 2942* 40 17075* 380 254 441 1551783 UniVec_Core-ctg 14425 9819 4606* 40 1801* 236 162 325 3409187 UMD2.contaminant-ctg 144843 96008 48835 40 16661 199 169 209 28912002
Length of the UMD2.4 contaminant seqeunece (500+ bp from end)
alignm <200bp >=200bp min max mean med n50 sum Ecoli.365350-365744-ctg 243 136 107 162 224 189 162 224 46000 Ecoli-ctg 273 149 124 106 1341 219 162 224 59923 Ecoli.all-ctg 294 153 141* 106 2150* 251 162 224 73992 UniVec_Core-ctg 2144 2035 109* 50 1340* 122 121 121 261821 UMD2.contaminant-ctg 121331 86985 34346 40 2738 171 162 184 20753580
- Problem: 8 long ctgs contain Ecoli in the middle (1000+ bp from end)
show-coords Ecoli.all-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 4640589 4641890 | 161712 160411 | 1302 1302 | 99.46 | 4686137 351373 | 0.03 0.37 | gi|170079663|ref|NC_010473.1| ctg7180001872124 5068908 5069620 | 87386 86679 | 713 708 | 98.88 | 5209548 91972 | 0.01 0.77 | gi|218687878|ref|NC_011745.1| ctg7180002055226 3087480 3088683 | 50423 51620 | 1204 1198 | 99.00 | 5202090 88182 | 0.02 1.36 | gi|218703261|ref|NC_011751.1| ctg7180002054092 4640580 4641890 | 19953 18646 | 1311 1308 | 99.08 | 4686137 31157 | 0.03 4.20 | gi|170079663|ref|NC_010473.1| ctg7180001875158 131462 133564 | 1247 3349 | 2103 2103 | 98.19 | 241387 5751 | 0.87 36.57 | gi|157412014|ref|NC_009838.1| ctg7180002043242 82801 83166 | 2986 2621 | 366 366 | 98.09 | 241387 4709 | 0.15 7.77 | gi|157412014|ref|NC_009838.1| ctg7180001714551 82264 82793 | 3523 2994 | 530 530 | 98.49 | 241387 4709 | 0.22 11.26 | gi|157412014|ref|NC_009838.1| ctg7180001714551 1652253 1652545 | 1487 1195 | 293 293 | 98.63 | 4700560 2492 | 0.01 11.76 | gi|218552585|ref|NC_011741.1| ctg7180001754941
- Regions present in DH10B but not MG1655
delta2cvg -M 0 < DH10B-MG1655.delta gi|170079663|ref|NC_010473.1| 1349629 1378243 28614 0 gi|170079663|ref|NC_010473.1| 1391006 1396986 5980 0 gi|170079663|ref|NC_010473.1| 3199469 3200798 1329 0 gi|170079663|ref|NC_010473.1| 3211928 3213257 1329 0 gi|170079663|ref|NC_010473.1| 4640588 4641918 1330 0 !!!
- Problem: 10 long ctgs contain Vector in the middle (1000+ bp from end)
show-coords UniVec_Core-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
      1      121  |   215495   215615  |      121      121  |    99.17  |      170   271477  |    71.18     0.04  | gnl|uv|U09128.1:15891-16011-49     ctg7180002047604       # pSacBII P1 cloning vector  
   2252     2435  |     1334     1151  |      184      184  |   100.00  |     2485   160336  |     7.40     0.11  | gnl|uv|U75992.1:16925-19409        ctg7180001808271
    180      312  |     1153     1020  |      133      134  |    99.25  |      312   160336  |    42.63     0.08  | gnl|uv|NGB00145.1:2378-2689        ctg7180001808271
      1      121  |     1367     1487  |      121      121  |   100.00  |      170   160336  |    71.18     0.08  | gnl|uv|U09128.1:15891-16011-49     ctg7180001808271
      1      103  |     1286     1388  |      103      103  |   100.00  |      103   160336  |   100.00     0.06  | gnl|uv|U80929.2:11415-11517        ctg7180001808271        [CONTAINED]
      4      121  |    68269    68386  |      118      118  |   100.00  |      170   111913  |    69.41     0.11  | gnl|uv|U09128.1:15891-16011-49     ctg7180002052060
     40      152  |    30255    30142  |      113      114  |    99.12  |     1663    42854  |     6.79     0.27  | gnl|uv|U09128.1:1-1663             ctg7180002053344
      1      121  |    34358    34238  |      121      121  |   100.00  |      170    35471  |    71.18     0.34  | gnl|uv|U09128.1:15891-16011-49     ctg7180002046164
      1      103  |    34439    34337  |      103      103  |   100.00  |      103    35471  |   100.00     0.29  | gnl|uv|U80929.2:11415-11517        ctg7180002046164        [CONTAINED]
     46     1385  |     8928    10267  |     1340     1340  |   100.00  |     1413    17587  |    94.83     7.62  | gnl|uv|X65279.1:5941-7353          ctg7180002043597        [CONTAINED]
- ctg7180001872124 : 351373 bp; region 160411..161712 contaminated by Ecoli
cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | wc -l # 329
cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);' 7180000441625 7180001872124 159483 161201 r 7180000441788 7180001872124 160330 161329 f #Ecoli 7180000442730 7180001872124 160368 161010 r #Ecoli 7180000441635 7180001872124 160740 162700 f #Ecoli
cat 9-terminator/bt.utg.info utg7180000441625 length=1715 num_frags=12 Astat=7.00 utg7180000441788 length=999 num_frags=1 Astat=0.00 utg7180000442730 length=640 num_frags=1 Astat=0.00 utg7180000441635 length=1957 num_frags=9 Astat=7.00
cat 9-terminator/bt.posmap.frgctg | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);' 1237446426 7180001872124 160117 161201 f 1238816835 7180001872124 160133 160993 f 1238817728 7180001872124 160322 161123 r 1244436200 7180001872124 159976 160984 f 1238817676 7180001872124 160105 160890 r 1237443253 7180001872124 160106 160900 f 1237471027 7180001872124 159930 160928 f 1238822613 7180001872124 159774 160782 f 1238816875 7180001872124 159878 160728 f 1244436248 7180001872124 159483 160553 f 1238818306 7180001872124 159718 160489 f 1238818332 7180001872124 159722 160483 f 1237476824 7180001872124 160330 161329 f 1238817689 7180001872124 160368 161010 r 1237447135 7180001872124 160740 161768 r 1237483546 7180001872124 160814 161790 r 1237483530 7180001872124 160818 161856 r 1237471108 7180001872124 161003 162009 f 1238817744 7180001872124 161151 161978 f 1237446441 7180001872124 161050 162107 f 1244436201 7180001872124 161117 162164 f 1237446407 7180001872124 161586 162699 r 1237471055 7180001872124 161652 162700 r # 23 BCM SHOTGUN RP42 VVHNP reads (1369 read lib; 1341 of the reads in this ctg)
- ctg7180002047604 : Vctor in the middle
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
===============================================================================================================================
      1      121  |   215495   215615  |      121      121  |    99.17  |      170   271477  |    71.18     0.04  | gnl|uv|U09128.1:15891-16011-49     ctg7180002047604       # pSacBII P1 cloning vector 
cat 9-terminator/bt.posmap.utgctg | grep 7180002047604 perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);' 7180000441711 7180001872124 214458 219678 r
cat 9-terminator/bt.posmap.frgctg | grep 7180002047604 | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);' 498776751 7180001872124 215425 216426 r 1236502885 7180001872124 215514 216377 r 379408823 7180001872124 215572 216340 f 1244436224 7180001872124 215388 216405 f 1237471071 7180001872124 215229 216234 r 1233297450 7180001872124 215234 216046 f 1233363357 7180001872124 215267 215687 f 937200686 7180001872124 215300 216129 r 937254901 7180001872124 215321 216160 f 1233294025 7180001872124 215383 216204 r 1237446444 7180001872124 215146 216187 f 1232033776 7180001872124 215193 215996 r 671976381 7180001872124 215035 216021 r 514932286 7180001872124 215043 216008 f 500723879 7180001872124 215043 215802 f 671927656 7180001872124 215116 215733 r 381173692 7180001872124 214947 215877 r 1233303570 7180001872124 214963 215803 f 1232037705 7180001872124 214990 215803 f 490852264 7180001872124 214923 215843 f 1237447184 7180001872124 214684 215646 f 668822243 7180001872124 214586 215572 f #22 reads ; ~half come from BCM SHOTGUN RP42 VVFOP
maxmatch ctg
Parameters:
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ... AllVec: UniVec_Core + 100 more vector seqs
Length of UMD2.4 contigs that contain contaminant (0+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 2951* 2602 349 1001 453627 8252 1367 132226 24352779 UniVec_Core 5387* 4802 585 882 651163 9979 1334 136556 53760575 OtherVec 5657 5062 595 882 651163 9726 1320 136556 55021803 UMD2.cont.other 3976 3430 546 804 651163 11217 1346 130385 44601117 #18 aligned to Acinetobacter; longest is 56467bp
Length of UMD2.4 contigs that contain contaminant (500+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 182 156 26* 1286 351373 6525 1811 125069 1187706 # 7* are >5K; 321* come from multi-ctg scaffolds UniVec_Core 2532 2220 312* 1065 651163 10593 1481 128344 26821960 # 267* are >5K ; 655* come from multi-ctg scaffolds OtherVec 376 323 53 1184 361749 13278 1508 139997 4992774 UMD2.cont.other ...
Length of UMD2.4 contigs that contain contaminant (1000+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 8 0 8* 4709 351373 73065 31157 351373 584520 UniVec_Core 11 0 11* 2600 334933 93674 37847 271477 1030414 OtherVec 5 0 5* 3717 271477 131604 111913 228060 658021 UMD2.cont.other 54 0 54* 2398 522682 110947 88182 189352 5991164 total 67* # 18 of them are CONTAINED by UMD2.0 chromosomes
Length of the UMD2.4 contaminant sequence (0+ bp from end):
elem <200 >200 min max mean med n50 sum Ecoli.all 4775 1610 3165 39 17072 381 236 502 1823278 UniVec_Core 16985 12380 4605 39 1800 207 132 300 3519080 OtherVec 7563 1372 6191 39 1800 509 548 643 3849567 UMD2.cont.other 6626 343 6283 39 8228 543 573 615 3602329
maxmatch deg
All degenerates aligned are <2K
Length of UMD2.4 deg that contain contaminant (0+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 1266 1266 0 104 1611 783 833 869 991447 UniVec_Core 1908 1908 0 147 1510 872 896 910 1664746 OtherVec 1963 1963 0 147 1510 872 898 911 1712703 UMD2.cont.other 1609 1609 0 132 1611 852 892 914 1372106
maxmatch utg
Unitig stats:
elem <2000 >2000 min max mean med n50 sum 1707816 1434164 273652 21 138676 2228 937 8002 3805166508
Parameters:
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
Files:
/scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_utg/
Length of UMD2.4 unitigs that align to contaminants
elem <2000 >2000 min max mean med n50 sum Ecoli.all 4275 4110 165 104 71709 1442 1212 1398 6166566 UniVec_Core 7563 7409 154 139 71709 1397 1182 1331 10570512 OtherVec 8208 8054 154 139 71709 1370 1159 1308 11248775 UMD2.cont.other 6094 5849 245 132 53113 1546 1163 1401 9422951 #80 aligned to Acinetobacter; longest is 9114bp Contaminants(all above) 10264 9895 369 104 71709 1471 1148 1359 15107544 Acinetobacter 2306** 0 2306 154 71709 1451 1316 1412 3347230 #2182 already in the Cont set
Length of UMD2.4 unitigs that have contaminants 500+bp from ends
elem <2000 >2000 min max mean med n50 sum Ecoli.all 172 156 16 1286 4852 1820 1805 1815 313185 UniVec_Core 2491 2422 69 1065 71709 1722 1457 1523 4291584 OtherVec 364 358 6 1167 71709 1795 1478 1538 653595 UMD2.cont.other 156 108 48 1213 50248 5344 1838 17518 833673
Length of the UMD2.4 alignments of unitigs to contaminants (unique unitig regions)
elem <200 >200 min max mean med n50 sum reads(all unitig reads for unitgs with alignments>1K) Ecoli.all 5975 1686 4289 40 8184 397 268 542 2374366 12112(12142) UniVec_Core 8754 1674 7080 40 1801 474 490 645 4153030 26590(26849) OtherVec 8919 1250 7669 40 1801 511 536 629 4562326 30268(30268) UMD2.cont.other 6752 896 5856 40 6012 529 555 651 3573528 25006(25328) Contaminants(all above) 10992 1396 9596 40 8184 571 573 684 6280759 40351(40699) Acinetobacter (8286)
40699 reads aligned back to contaminants: nucmer -maxmatch
- 35919 align
- 34400 align 100+bp
- 27742 align 200+bp
- 14211 align 500+bp
utg 5'& 3'
Unitig stats:
elem <200 >200 min max mean med n50 sum utg 1,707,816 81200 1626616 21 138676 2228 937 8002 3805166508 utg5'&3' 3,334,432 0 3334432 21 199 100 100 100 335263271
Align utg5'&3' to Ecoli.all using:
- nucmer -l 40 -c 100 -b 10 -g 5 -d 0.05 : 4,275 hits
- nucmer -l 20 -c 40 : 6,617 hits
- nucmer -l 20 -c 20 : 23,350
- blastall : 2,895,506 out of 3,334,432 (86%) aligned
Acinetobacter contamination
Database:
~dpuiu/db/Acinetobacter.all : 7 complete genomes, 19 seqs
Seq len summary:
elem min max mean med n50 sum 19 2726 4050513 1418094 28279 3904116 26943793
Align all unitigs to Acinetobacter.all; Longest alignments is 8517bp
show-coords Acinetobacter.all-utg.filter-q.delta | sort -nk8 -r | head [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [GenBank] [UMD2.4 utg] =============================================================================================================================== 20644 29164 | 62 8578 | 8521 8517 | 98.80 | 94413 8578 | 9.03 99.29 | gi|169786889|ref|NC_010404.1| utg7180000281954* [CONTAINS] 3395586 3401299 | 5712 1 | 5714 5712 | 99.79 | 3976747 8015 | 0.14 71.27 | gi|126640115|ref|NC_009085.1| utg7180000212251 3400344 3404485 | 1 4142 | 4142 4142 | 99.66 | 3976747 9114 | 0.10 45.45 | gi|126640115|ref|NC_009085.1| utg7180000277331 ...
utg7180000281954* -> ctg7180002053982 (28140bp; 78 unitigs)
grep 7180002053982 ../9-terminator/bt.posmap.utgctg | nl
    1  7180000185222   7180002053982   0       3019    f
    2  7180000314302   7180002053982   2151    5706    r
    3  7180001463328   7180002053982   2256    2869    f
    ...
   75  7180000281954*  7180002053982   17862   26442   r
   76  7180001471348   7180002053982   17886   18723   r
   77  7180001468075   7180002053982   17919   18732   f
   78  7180000280508   7180002053982   25672   28140   r
 show-coords UMD2.contaminant.other-ctg.filter-q.delta  | grep 7180002053982
   [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [UMD2.0 contam]   [UMD2.4 ctg]
===============================================================================================================================
   1394     2508  |    28140    27024  |     1115     1117  |    98.75  |     7098    28140  |    15.71     3.97  | 7180003313366      ctg7180002053982
   2561     2871  |    26971    26661  |      311      311  |    97.11  |     7098    28140  |     4.38     1.11  | 7180003313366      ctg7180002053982
   2934     5670  |    26599    23862  |     2737     2738  |    97.99  |     7098    28140  |    38.56     9.73  | 7180003313366      ctg7180002053982
 ...gap...  
   5930     7098  |    17270    16101  |     1169     1170  |    98.46  |     7098    28140  |    16.47     4.16  | 7180003313366      ctg7180002053982
 ...gap...  
    281     1981  |    10335     8635  |     1701     1701  |    99.41  |    13090    28140  |    12.99     6.04  | 7180003320028      ctg7180002053982
   1992     2672  |     8635     7954  |      681      682  |    99.71  |    13090    28140  |     5.20     2.42  | 7180003320028      ctg7180002053982
   3302     5376  |     6719     4643  |     2075     2077  |    99.33  |    13090    28140  |    15.85     7.38  | 7180003320028      ctg7180002053982
   8469     9021  |     4642     4090  |      553      553  |    99.10  |    13090    28140  |     4.22     1.97  | 7180003320028      ctg7180002053982
   9038     9313  |     4073     3798  |      276      276  |    98.55  |    13090    28140  |     2.11     0.98  | 7180003320028      ctg7180002053982
   9780    13090  |     3331       19  |     3311     3313  |    99.19  |    13090    28140  |    25.29    11.77  | 7180003320028      ctg7180002053982
 grep 7180002053982 ../9-terminator/bt.posmap.utgctg | awk '{print $1,$4-$3+1}' | sed 's/^/utg/' >! ctg7180002053982.utgs
 intersect.pl UMD2.contaminant.other-utg.qry_hits ctg7180002053982.utgs | wc -l
 37 # only 37 out  of 78 unitigs were detected
 ctg7180002053982 is Acinetobacter
UMD2.5 (2004_0312_CA; delete 40699 contam reads & 22607 mates )
40699 reads:
- 25803 mated + 14896 unmated
- 6392 mated reads had the mate also contaminated
Location:
/scratch1/bos_taurus/Assembly/2009_0312_CA
UNITIGGER
UNITIG OVERLAP GRAPH INFORMATION
       5322910 : Total number of unitigs
       2595715 : Total number of singleton, contained unitigs
       1869655 : Total number of singleton, non-contained unitigs
        182193 : Total number of non-singleton, spanned unitigs
        675347 : Total number of non-singleton, non-spanned unitigs
      35507162 : Total number of fragments
      35507162 : Total number of fragments in all unitigs
      21641007 : Total number of essential fragments in all unitigs
      13866155 : Total number of contained fragments in all unitigs
  0.0077909501 : Randomly sampled fragment arrival rate per bp
    2511009753 : The sum of overhangs in all the unitigs
    6442095933 : Total number of bases in all unitigs
             0 : Estimated number of base pairs in the genome.
             0 : Total number of contained fragments not connected
                 by containment edges to essential fragments.
Total rho    = 2511009753
Total nfrags = 19563152
Estimated genome length = 0
Estimated global_fragment_arrival_rate=0.007791
Computed global_fragment_arrival_rate =0.007791
Total number of randomly sampled fragments in genome = 23866135
Computed genome length  = 3063315200.000000
Used global_fragment_arrival_rate=0.007791
Used global_fragment_arrival_distance=128.354050
Histogram of the number of base pairs in a chunk
100292 - 138301:     22    # 19 in UMD2.4 
 90020 -  99906:     28    # 23
 80043 -  89676:     90    # 79
 70013 -  79966:    190    # 164
 60010 -  69988:    423
 50008 -  59983:   1016
 40000 -  49998:   2558
 30000 -  39997:   6660
 20000 -  29999:  18927
 10000 -  19999:  57057
CONSENSUS after CGW
- failed on job 80 : ctg 5706539, len=180,024, 159 unitigs, 1,851 reads
 head 80/bt.cns_contigs.80.failed
 {ICM
 acc:5706539
 pla:P
 len:180024
 cns:
 .
 qlt:
 .
 for:0
 npc:1851
 more  ../9-terminator/bt.asm
 ...
 {CCO
 acc:(7180002022380*,5706539)
 pla:P
 len:180024
 cns:
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 ...
cat 80/bt.cns_contigs.80.failed | countMessages.pl ICM 1 IUP 159 #unitigs IMP 1851 #reads : 1264 are BCM.WGS, 389 are BCM.SHOTGUN ...
# 254 contig scaffold cat ../9-terminator/bt.posmap.ctgscf | grep 7180002041301 | nl 1 7180002022042 7180002041301 0 14121 f ... 183 7180002022380* 7180002041301 9164814 9344838 f .. 254 7180002022240 7180002041301 12874874 12893460 f
 # the 5 UMD2.4 contigs below have the same number of reads with the ones that matched => CONTAINED
 cat UMD2.4-7180002022380.posma   p.frgctg | grep -v ^$ | awk '{print $2}' | uniq -c
   #reads ctgid
     6 7180001712307
   209 7180002028662
     6 7180002028663
  1552 7180002028664
    21 7180002032323
  1794 total => 1851-1794=57 additional reads
cat ../../2009_0217_CA/9-terminator/bt.posmap.ctgscf | nl | egrep '7180001712307|7180002028662|7180002028663|7180002028664|7180002032323' #nl ctgid scfid start end dir 34791 7180002028662 7180002069912 1425623 1446022 f 34793 7180002032323 7180002069912 1448133 1450475 f 34794 7180002028663 7180002069912 1450495 1451918 f 34795 7180002028664 7180002069912 1452234 1602973 f 64110 7180001712307 7180002071598 0 1441 f
- Solution 1:
* consensus -Dforceunitigabut => new assembly, new UID's ctg7180002022636 179505 38.53 # => ctg7180002022380 scf7180002041557 12892941 40.58 # => scf7180002041301
- Solution 2:
 * Reassemble 1851 reads ; clr=ECR2; doOBT=no
 * Asm dir: 
    /scratch1/bos_taurus/Assembly/2009_0312_CA/8-consensus/80.ECR2.asm
 * It contains one 179,530 bp scaffold that has two contigs. 
 * One contig is 156,349 bp and the other one is 23,181 bp. 
 * The estimated gap between them is 231 bp.
 show-coords ctg7180002022636-80.ECR2.filter-r.delta 
      1   156326  |   1   156331  | 156326 156331  | 99.99  | 179505  156349 |    87.09    99.99  | ctg7180002022636   ctg7180000000103        [CONTAINS]
 156345   179505  |  21    23181  |  23161  23161  | 99.99  | 179505  23181  |    12.90    99.91  | ctg7180002022636   ctg7180000000104        [CONTAINS]
>ctg7180002022636_156327_156344 TTGTAAAAACCATCCCCT
# ~ 20 bp unaligned on ctg7180002022636 & Chr1 show-coords ctg7180002022636-Chr1.filter-r.delta | more ... 151839 156326 | 61124826 61120339* | 4488 4488 | 99.91 | 179505 157590899 | 2.50 0.00 | ctg7180002022636 Chr1 156347 157501 | 61111219* 61110064 | 1155 1156 | 99.91 | 179505 157590899 | 0.64 0.00 | ctg7180002022636 Chr1 ...
# 2 UMD2.0 ctg & 2 UMD2.0 deg in this region more Chr1.agp ... Chr1 61110064 61111482 3579 W deg0003139347 1 1419 + Chr1 61111483 61114130 3580 N 2648 fragment yes Chr1 61114131 61115490 3581 W deg0002967451 1 1360 + Chr1 61115491 61118114 3582 N 2624 fragment yes Chr1 61118115 61120117 3583 W 7180002846553 1 2003 + Chr1 61120118 61120217 3584 U 100 fragment yes Chr1 61120218 61145567 3585 W 7180003318962 1 25350 + ...
QC
Lengths:
elem <2000 >=2000 min max mean med n50 sum scf 39978* 31311 8667 316 34167202 68129 1360 8217662 2723691675 ctg 90135* 36140 53995 65 1160130 29693 5124 95988 2676390147 deg 251413 249285 2128 65 39964 1003 984 994 252279234 utg 1689033 1419729 269304 21 138676 2242 936 8213 3788090224
elem <0 0 >0 min max mean med n50 sum gaps(ca2scf) 50157 10759 3296 36102 -20 177144 929 20 34357 46620040 gaps(posmap) 50157 0 0 50157 20 177144 943 20 34065 47301528
Fragment happiness:
placed good 27263138 chaff bothChaff 2467462 placed notMated 2050084 placed oneChaff 732517 chaff oneChaff 732517 placed oneSurrogate 555510 placed bothDegen 465114 chaff notMated 434052 placed diffScaffold 369,294 * placed oneDegen 213,768 * placed badSame 96862 placed badLong 76640 placed badOuttie 41142 placed badShort 5784 placed bothSurrogate 3278
Mate happiness:
good 13631569 bothChaff 1233731 oneChaff 732517 oneSurrogate 277755 bothDegen 232557 diffScaffold 184647 oneDegen 106884 badSame 48431 badLong 38320 badOuttie 20571 badShort 2892 bothSurrogate 1639
Scaffold zero read/mate cvg regions:
elem <2000 >2000 min max mean med n50 sum read 57011 55048 1963 1 177144 913 57 29302 52084484 mate 10507 8945 1562 1 30014 996 493 2367 10466518
Scaffold 10K+ zero read/mate cvg regions (2K+ inside) (some might be a result of surrogates?):
elem <2000 >2000 min max mean med n50 sum read 51747 49878 1869 1 177144 958 38 32625 49613432 mate 1290 201 1089 15 30014 3560 3047 4017 4593586 mate(ignore seq len) 2599 1051 1548 1 72706 3321 2541 5513 8633691
Contaminant search
ctg
elem <0 >0 min max mean med n50 sum ctg 90135 0 90135 65 1160130 29693 5124 95988 2676390147
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
 #OLD table
                  elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all        71         66         5          1006       129770     4180       1127       45234      296830
 UniVec_Core      120        111        9          1000       426540     7366       1130       100944     884023
 OtherVec         121        112        9          1000       426540     7314       1127       100944     885022
 UMD2.cont.other  98(-3)     83         15(-3)     1000       426540     13523      1190       199700     1325332  # 3 are 1000bp+ from ctg ends; these are actually "fake" contaminants
                                                                                                                   # 7 are Acinetobacter baumannii min=1059 , max=9765 
 total            152*(-3)   133        19*(-3)    1000       426540     10649      1150       199700     1618735
 
 Acinetobacter    65         53         12         1013       44359      2586       1212       3847       168130   # 46 out of 65 are in the 152* set; 19 are new; 13 have lots of alignments to other contigs (probably fake contaminants)
 total(new)       171*(-3)   144        27*(-3)    1000       426540     10013      1189       129770     1712376  # 65 are Acinetobacter and should be removed
 cat UMD2.contaminant.other-ctg.filter-q.coords | grep Acinetobacter
                                                                                                                    UMD2.0                     UMD2.5
      1      285  |      285        1  |      285      285  |    99.65  |      287     8096  |    99.30     3.52  | 7180003292866_1_288        ctg7180002015457        [CONTAINED]     Acinetobacter baumannii
   1422     2500  |     1078        1  |     1079     1078  |    99.63  |     7098     1078  |    15.20   100.00  | 7180003313366              ctg7180001706852        [CONTAINS]      Acinetobacter baumannii
   2934     3940  |     1008        1  |     1007     1008  |    98.61  |     7098     1059  |    14.19    95.18  | 7180003313366              ctg7180001709709                        Acinetobacter baumannii
   6281     7098  |        1      818  |      818      818  |    99.76  |     7098     1553  |    11.52    52.67  | 7180003313366              ctg7180001716052        [END]           Acinetobacter baumannii
      1      790  |      790        1  |      790      790  |   100.00  |     1822     9765  |    43.36     8.09  | 7180003319195_8956_10778   ctg7180002015485        [BEGIN]         Acinetobacter calcoaceticus
    285     1981  |        1     1697  |     1697     1697  |    99.59  |    13090     1856  |    12.96    91.43  | 7180003320028              ctg7180001706656*                       Acinetobacter baumannii
   1992     2148  |     1697     1856  |      157      160  |    98.12  |    13090     1856  |     1.20     8.62  | 7180003320028              ctg7180001706656*                       Acinetobacter baumannii
  12210    13090  |       63      943  |      881      881  |    99.89  |    13090     2556  |     6.73    34.47  | 7180003320028              ctg7180002007423                        Acinetobacter baumannii
# 7 Acinetobacter baumannii ctgs # no Serratia "best hits" # 3 mitochondrion ctgs, all < 2Kbp
Delete summary: 65 Acinetobacter ctgs + 91 contaminant ctgs <2000bp => 156 ctgs , 152 scf => 4105 reads
ctgs <2000 >2000 min max mean med n50 sum reads 156 144 12 1000 44359 1782 1150 1483 278009 4105 # all ctg
Trim summary: 12 contigs >=2000bp & 44 reads that overlap at least 10bp
ctgs <2000 >2000 min max mean med n50 sum reads 12 0 12 7072 426540 78470 45234 129770 941646 ? # all ctg 12 12 0 172 935 532 618 750 6393 44 # ctg regions
Files
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_DELETE/ctg.delete.uid /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_DELETE/scf.delete.uid /scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_ctg/TO_TRIM/ctg.trim.uid
ctg 5'&3'
elem <0 >0 min max mean med n50 sum ctg53 180044 0 180044 65 598 300 300 300 54033229
nucmer -maxmatch -l 17 -c 35 ...
#ctgEnds #ctgs min max mean med n50 sum Ecoli.all 180 149 300 300 300 300 300 54000 UniVec_Core 312 277 300 300 300 300 300 93600 OtherVec 1211 1167 300 553 300 300 300 363989 UMD2.cont.other 15689 14693 257 598 300 300 300 4712162
deg
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
                  elem       <2000      >2000      min        max        mean       med        n50        sum
 Ecoli.all        387        387        0          131        1099       756        806        835        292892
 UniVec_Core      569        569        0          101        1115       763        822        843        434400
 OtherVec         579        579        0          101        1115       752        819        840        435549
 UMD2.cont.other  539        539        0          131        1483       792        838        873        427408
                 
 total            810*       810        0*         101        1483       784        838        869        63547
Scaffolds vs UMD2.0 chromosome alignments
Directory:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf
Depening on the ref/qry seq and nucmer parameters, the number of unaligned gaps in UMD2.0 can vary between:
101M: REF=Chr, QRY=scf, nucmer -l 100 -c 500 6M: REF=ChrPlaced, QRY=scf-deg, nucmer -maxmatch -l 50 -c 250
nucmer -l 100 -c 500
Chr-scf.summary
elem <2000 >2000 min max mean med n50 sum Chr-scf.qry_hits 32901 24546 8355 723 34167202 82494 1405 8217662 2714164100 Chr-scf.qry_nohits 7077 6765 312 316 12006 1346 1239 1291 9527056 Chr-scf.10K.qry_hits2+ 574 0 574 10308 34167202 4006753 1887309 9586144 2299876795 Chr-scf.0cvg 144712 125933 18779 1 102265 900 178 2968 130248709 Chr-scf.0cvg.clean 148556 143283 5273 1 39625 683 280 1363 101526883(101M)
Chr-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum Chr-scf-deg.qry_hits 210225 199781 10444 501 34167202 13785 1007 7328685 2898141592 Chr-scf-deg.qry_nohits 81166 80815 351 65 12006 958 972 989 77828798 Chr-scf-deg.10K.qry_hits2+ 574 0 574 10308 34167202 4006753 1887309 9586144 2299876795 Chr-scf-deg.0cvg 175952 168553 7399 1 22067 445 120 1329 78433265 Chr-scf-deg.0cvg.clean 133809 132381 1428 1 20512 371 124 1101 49711440(49M)
ChrPlaced-scf.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf.qry_hits 19488 13112 6376 723 34167202 137773 1527 8428844 2684927057 ChrPlaced-scf.qry_nohits 20490 18199 2291 316 192648 1891 1276 1569 38764099 ChrPlaced-scf.10K.qry_hits2+ 139 0 139 10486 31959312 6786671 4979278 12956086 943347316 ChrPlaced-scf.0cvg 76271 71816 4455 1 102265 568 179 1710 43339413 ChrPlaced-scf.0cvg.clean 72865 70541 2324 1 39625 356 94 1413 25951987(25M)
ChrPlaced-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf-deg.qry_hits 130125 122000 8125 501 34167202 21530 1009 7515049 2801670853 ChrPlaced-scf-deg.qry_nohits 161266 158596 2670 65 192648 1080 987 1012 174299537 ChrPlaced-scf-deg.10K.qry_hits2+ 139 0 139 10486 31959312 6786671 4979278 12956086 943347316 ChrPlaced-scf-deg.0cvg 79041 76374 2667 1 22067 395 157 948 31251753 ChrPlaced-scf-deg.0cvg.clean 69012 68271 741 1 20512 200 81 592 13864328(13M)
nucmer -maxmatch -l 100 -c 500
Dir:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.2
ChrPlaced-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf-deg.qry_hits 130510 122377 8133 501 34167202 21470 1009 7515049 2802100771 ChrPlaced-scf-deg.qry_nohits 160881 158219 2662 65 192648 1080 986 1012 173869619 ChrPlaced-scf-deg.10K.qry_hits2+ 120 0 120 20022 31959312 7587296 5639522 13010806 910475551 ChrPlaced-scf-deg.0cvg 82159 80425 1734 1 7002 321 145 647 26444796 ChrPlaced-scf-deg.0cvg.clean 111645 111546 99 1 6248 81 13 272 9057424(9M)
nucmer -maxmatch -l 50 -c 250
Dir:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3
ChrPlaced-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf-deg.qry_hits 204673 195653 9020 251 34167202 14088 1005 7329288 2883625712 ChrPlaced-scf-deg.qry_nohits 86718 84943 1775 65 192648 1064 970 1007 92344678 ChrPlaced-scf-deg.10K.qry_hits2+ 148 0 148 10486 31959312 6814292 5135095 12792673 1008515300 ChrPlaced-scf-deg.0cvg 86085 84614 1471 1 4565 279 123 557 24101912 ChrPlaced-scf-deg.0cvg.clean 113796 113791 5 1 2822 59 7 176 6714556(6M)
nucmer -maxmatch -l 50 -c 250 ; delta-fileter -q
Dir:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3
 ChrPlaced-scf-deg.filter-q.summary
                                          elem       <2000      >2000      min        max        mean       med        n50        sum
 ChrPlaced-scf-deg.qry_hits               204673     195653     9020       251        34167202   14088      1005       7329288    2883625712
 ChrPlaced-scf-deg.qry_nohits             86718      84943      1775       65         192648     1064       970        1007       92344678
 ChrPlaced-scf-deg.10K.qry_hits2+         118        0          118        20022      31959312   7633834    5639522    13010806   900792422
 ChrPlaced-scf-deg.0cvg                   77864      73686      4178       1          28711      529        181        1523       41240419
 ChrPlaced-scf-deg.0cvg.clean             74172      72150      2022       1          28331*     321        89         1415       23852952(23M)
Max gap is 28331; Duplicate region in UMD2.0?
ChrPlaced-scf-deg.coords 70739616 70767946 | 2993389 2965048 | 28331 28342 | 99.17 | 85187327 19514159 | 0.03 0.15 | Chr15 scf7180002041107 70768054 70808299 | 3005305 2965048 | 40246 40258 | 99.56 | 85187327 19514159 | 0.05 0.21 | Chr15 scf7180002041107 => ChrPlaced-scf-deg.filter-q.coords 70768054 70808299 | 3005305 2965048 | 40246 40258 | 99.56 | 85187327 19514159 | 0.05 0.21 | Chr15 scf7180002041107
Markers
ALL:
head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.txt Marker Chr_BTA Pos(K) Pos_from Pos_to UMD_Ctg_Pos Match_Len %IDY %Match UMD_Ctg_name BZ945871 1 47501 1 95001 7622 515 100.00 99.61 ctg7180002007845 BZ953651 1 80001 47501 112501 10786 700 99.57 100.00 ctg7180002026484 CC504788 1 118751 80001 157501 54583 862 100.00 100.00 ctg7180002026483 CC484491 1 123751 90001 157501 50169 77 98.72 100.00 ctg7180002026482 CZ415082 1 125001 92501 157501 75850 507 99.21 99.80 ctg7180002026483 CC475154 1 130001 97501 162501 40013 666 99.25 100.00 ctg7180002026482 CC561114 1 182501 145001 220001 1130 709 99.02 100.00 ctg7180002026482 CC578374 1 190001 155001 225001 170145 647 100.00 100.00 ctg7180002026481 BZ911787 1 278751 232501 325001 na na na na na ...
- 126,014 markers & 90,135 ctgs total
- 107,271 markers align to 31,407 ctg & 2640 scf:
- 85% of the markers align to 85% of the ctg sequence
- avg distance between markers is 25Kbp
 
- 188 questionable ctgs & 219 questionable scf (2 out of 4 disagreeing markers)
UNIQ:
head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.unique_only.txt | p 'print " ",$_;' Marker Chr_BTA Pos(Kbp) CI_Pos_from CI_Pos_to UMD_Scaff_Pos Match_Len %IDY %Matched UMD_Scaff_name BZ945871 1 52251 1 104501 na na na na na BZ953651 1 88001 52251 123751 na na na na na CC504788 1 130626 88001 173251 54583 862 100.00 100.00 ctg7180002026483 CC484491 1 136126 99001 173251 50169 77 98.72 100.00 ctg7180002026482 CZ415082 1 137501 101751 173251 na na na na na CC475154 1 143001 107251 178751 40013 666 99.25 100.00 ctg7180002026482 CC561114 1 200751 159501 242001 1130 709 99.02 100.00 ctg7180002026482 CC578374 1 209001 170501 247501 170145 647 100.00 100.00 ctg7180002026481 BZ911787 1 306626 255751 357501 na na na na na ...
- 93,508 markers align to 28,752 ctgs & 1,476 scf:
- 109 questionable ctgs & 153 questionable scf (2 out of 4 disagreeing markers)
Scripts:
~/bin/marker2pos.pl markers_contigs_Art.unique_only.txt | sed 's/ctg//' | sort -nk1 -nk2 > markers_ctg.pos ~/bin/translatePosMap.pl markers_ctg.pos bt.posmap.ctgscf | ~/bin/tab2tab.pl > markers_scf.pos
Ctg vs markers summary:
#ctg <10000 >10000 min max mean med n50 sum file ctg (all) 90135* 51024 39111 65 1160130 29693 5124 95988 2676390147 no markers 58728 48324 10404 65 322949* 6573 1597 21989 386064754 markers from 1+ Chr 31407 2700 28707 442 1160130 72924 52693 111252 2290325393 markers_ctg.Chr.count markers from 2+ Chr 2987 25 2962 1002 1160130 132480 104807 179692 395718221 markers_ctg.Chr.count2+ 2+ markers from 2+ Chr 26 0 26* 15228 604155 221354 192182 298848 5755227 markers_ctg.Chr.count2.2+ 2+ adjacent markers from 2+ Chr 15 0 15** 15228 368879 202728 194749 294623 3040932 markers_ctg.Chr.count2+a
Scf vs markers summary:
#scf <10000 >10000 min max mean med n50 sum scf(all) 39978* 37135 2843 316 34167202 68129 1360 8217662 2723691675 no markers 37338 36038 1300 316 754615* 2601 1336 3957 97140879 markers from 1+ Chr 2640 1097 1543 1000 34167202 994905 16220 8661690 2626550796 markers_scf.Chr.count markers from 2+ Chr 552 10 542 1002 34167202 4526814 2714036 9167014 2498801557 markers_scf.Chr.count2+ 2+ markers from 2+ Chr 212 0 212* 15228 34167202 8579232 7358307 10521496 1818797327 markers_scf.Chr.count2.2+ 2+ adjacent markers from 2+ Chr 38 0 38** 15228 25078118 8419544 7176534 13458592 319942681 markers_scf.Chr.count2+a
212* scaffolds
scf_id scf_len #Chr/2+markers #/Chr/2+adjmarkers reads 7180002041381 31959312 15 0 469503 7180002041358 25078118 13 2 291163 7180002041386 21280754 12 0 ...
scf7180002041381.1
- no low cvg regions in the middle
- 1281 markers: 1231 on Chr4, 12 on Ch11
#1- mate cvg regions: at the ends !!! #scfid begin end scf_len cvg_len cvg 7180002041381 1 1173 31959312 1173 0 7180002041381 1174 1454 31959312 281 1 7180002041381 31959139 31959312 31959312 174 1
scf7180002041358.2
- one low cvg region & real break
- 1111 markers: 869 on Chr14, 193 on Chr26, ...
#1- mate cvg regions: middle #scfid begin end scf_len cvg_len cvg 7180002041358 20970531 20970827 25078118 297 1 7180002041358 20970828 20970949 25078118 122 0 7180002041358 20970950 20971112 25078118 163 1
# markers in the regions #scfid begin end makerid Chr 7180002041358 20964100 20964829 BZ839784 14 7180002041358 21002368 21003219 CC527932 26
scf7180002041386.3
- one low cvg region but no markers in that region
- 939 markers: 902 on Chr24, 3 on Chr6 ...
#1- mate cvg regions: at the ends #scfid begin end scf_len cvg_len cvg 7180002041386 26382 27557 21280754 1176 1 7180002041386 27558 27607 21280754 50 0 7180002041386 27608 28031 21280754 424 1
...
scf7180002041368.128
- one low cvg region
#scfid begin end scf_len cvg_len cvg 7180002041368 1356953 1357221 1404851 269 1
- markers:Chr7 & Chr5 at 3'
#scfid begin end markerid Chr 7180002041368 10436 11033 BZ840669 7 ... 7180002041368 1278950 1279444 CZ404867 7 7180002041368 1319489 1320264 CC593534 7 7180002041368 1344001 1344347 BZ885865 7 * 7180002041368 1371526 1372308 BZ867572 5 * 7180002041368 1386933 1387750 CC534893 5
- subassembly: extract all reads & extra mates(265) and reassemble
- Msg Counts
 
DST 40 FRG 13762 LKG 6482
- qc
 
[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=37,1211999,1358964,32757,4082,7180000000571 1=2,53844,53824,26922,-20,7180000000566 2=4,50178,50238,12544,20,7180000000570 3=1,28553,28553,28553,0,7180000000568 4=1,17195,17195,17195,0,7180000000569 total=45,1361769,1508774,30262,3675
show-coords scf7180002041368-7180002041368.update.scf.filter-q.delta
... 1269662 1338725 | 69065 1 | 69064 69065 | 99.97 | 1404851 1359628 | 4.92 5.08 | scf7180002041368 scf7180000000571 1340619 1341912 | 1 1295 | 1294 1295 | 98.77 | 1404851 1295 | 0.09 100.00 | scf7180002041368 scf7180000000574 [CONTAINS] 1345570 1347105 | 1537 1 | 1536 1537 | 98.18 | 1404851 1537 | 0.11 100.00 | scf7180002041368 scf7180000000572 [CONTAINS] 1347683 1356952 | 12617 3352 | 9270 9266 | 99.85 | 1404851 12617 | 0.66 73.44 | scf7180002041368 scf7180000000567 1356901 1361861 | 1767 6728 | 4961 4962 | 99.96 | 1404851 50318 | 0.35 9.86 | scf7180002041368 scf7180000000570 ...
scf7180002041061.187
- infoseq
scf7180002041061 509987 39.73
- 22 markers
#scfid begin end makerid Chr 7180002041061 2061 2904 BZ847430 15 7180002041061 8979 9756 CC481592 15 7180002041061 10856 11485 BZ839377 15 7180002041061 47578 48186 BZ836581 15 7180002041061 80485 81237 CC553472 15 7180002041061 117811 118488 CC918533 15 7180002041061 151253 152009 CC477055 15 7180002041061 213959 214640 CC572066 15 * 7180002041061 236304 236479 CC550436 29 * 7180002041061 242590 243462 BZ848041 29 7180002041061 267880 268614 BZ877402 29 7180002041061 268891 269493 CC580499 29 7180002041061 282248 282861 BZ885584 29 7180002041061 295221 295987 CC572941 29 7180002041061 337785 338517 CC923898 29 7180002041061 338464 339092 BZ921430 29 7180002041061 341219 341971 CZ415932 29 7180002041061 382509 383281 CC581748 29 7180002041061 387548 388176 CC572064 29 7180002041061 415263 415788 BZ878185 29 7180002041061 470138 470773 CC558303 29 7180002041061 493006 493794 CC565740 29
- mate happiness:
cat 7180002041061.posmap.mates | cut -f3 | count.pl good 2501 diffScaffold 86 oneSurrogate 69 oneChaff 40 oneDegen 11 badSame 9 badOuttie 4 badLong 4
- low cvg region:
#scfid begin end scf_len cvg_len cvg 7180002041061 232849 232871 509987 23 1 #it is in the middle of a unitig intersectPos.pl -i 1 7180002041061.posmap.utgscf 7180002041061.posmap.frgscf.10K.2K.mate_cvg.1- 7180000849321 7180002041061 229782 238910 r
- subassembly: extract all reads & mates(139) and reassemble
- Msg Counts
 
DST 38 FRG 5769 LKG 2724
Dirs:
/scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scf7180002041061.187.mates/asm.ECR2 /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scf7180002041061.187.mates/asm.ECR2.mates
Qc stats:
asm.ECR2 asm.ECR2.bog ... [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0 3,277300,277925,92433,312,7180000000278 5,279938,399837,55988,29975,7180000000155 1 6,234445,280546,39074,9220,7180000000277 6,234463,280485,39077,9204,7180000000156 2 1,1630,1630,1630,0,7180000000276 1,1799,1799,1799,0,7180000000154 3 NA 1,1674,1674,1674,0,7180000000153 4 NA 1,1183,1183,1183,0,7180000000158 total 10,513375,560101,51338,6675 14,519057,684978,37076,18436
- Some of the 139 mates assemble into scaffolds
- There is slightly more sequence in the bog assembly
- Mean/Max bog utg size is twice larger than default utg (scf, ctg sizes are about the same)
- align asm.ECR2 scaffolds to scf7180002041061
- 2919 bp at the 3' end of the new scf7180000000277 don't align
- 1990 bp at the 3' end of the new scf7180000000278 don't align
- new scf7180000000277 & new scf7180000000278 align for ~ 1058bp
- most of the mated read added assembled at new scf7180000000277 & new scf7180000000278 3'
 
 nucmer -l 100 -c 500 ../scf7180002041061.fasta 9-terminator/7180002041061.update.scf.fasta -p scf7180002041061-7180002041061.update.scf
 
 delta-filter -q scf7180002041061-7180002041061.update.scf.delta > scf7180002041061-7180002041061.update.scf.filter-q.delta
 
 show-coords scf7180002041061-7180002041061.update.scf.filter-q.delta
      1    46099  |        1    46099  |    46099    46099  |    99.99  |   509987   280626  |     9.04    16.43  | scf7180002041061   scf7180000000277
  46273    48914  |    46120    48761  |     2642     2642  |   100.00  |   509987   280626  |     0.52     0.94  | scf7180002041061   scf7180000000277
  50810    56904  |    95425   101521  |     6095     6097  |    99.43  |   509987   280626  |     1.20     2.17  | scf7180002041061   scf7180000000277
  57301   233303  |   101699   277707  |   176003   176009  |    99.97  |   509987   280626  |    34.51    62.72  | scf7180002041061   scf7180000000277 *
 
 232245   444418  |   275975    63802  |   212174   212174  |    99.99  |   509987   277965  |    41.60    76.33  | scf7180002041061   scf7180000000278 * 
 444964   469975  |    63156    38151  |    25012    25006  |    99.97  |   509987   277965  |     4.90     9.00  | scf7180002041061   scf7180000000278
 449967   479221  |    58152    28900  |    29255    29253  |    99.96  |   509987   277965  |     5.74    10.52  | scf7180002041061   scf7180000000278
 480298   509987  |    29688        1  |    29690    29688  |    99.99  |   509987   277965  |     5.82    10.68  | scf7180002041061   scf7180000000278
scf7180002041163.214
- infoseq
scf7180002041234 10336067 41.93
- Markers
#scfid begin end makerid Chr ... 7180002041163 770783 771508 CC562874 30 * 7180002041163 791050 791484 BZ887794 11 * ...
- Markers on different contigs !!!
cat 7180002041163.posmap.ctgscf .. 7180001926720 7180002041163 750392 787668 f 7180001926721 7180002041163 787911 815738 f ...
- subassembly: extract all reads & extra mates(293) and reassemble
- Msg Counts
 
DST 39 FRG 17287 LKG 8113
- qc
 
[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=54,1742235,1760846,32264,351,7180000000507 1=1,6224,6224,6224,0,7180000000506 2=1,1472,1472,1472,0,7180000000508 total=56,1749931,1768542,31249,351
- alignments :
show-coords scf7180002041163-7180002041163.update.scf.filter-q.delta 1748 7972 | 6224 1 | 6225 6224 | 99.84 | 1757217 6224 | 0.35 100.00 | scf7180002041163 scf7180000000506 [CONTAINS] 4012 5490 | 1 1470 | 1479 1470 | 98.85 | 1757217 1472 | 0.08 99.86 | scf7180002041163 scf7180000000508 [CONTAINS] 4567 18273 | 1 13697 | 13707 13697 | 99.92 | 1757217 1761853 | 0.78 0.78 | scf7180002041163 scf7180000000507 ... 750393 787668 | 757092 794367 | 37276 37276 | 100.00 | 1757217 1761853 | 2.12 2.12 | scf7180002041163 scf7180000000507 787912 815738 | 794644 822455 | 27827 27812 | 99.81 | 1757217 1761853 | 1.58 1.58 | scf7180002041163 scf7180000000507 ... 1755928 1757217 | 1760564 1761853 | 1290 1290 | 100.00 | 1757217 1761853 | 0.07 0.07 | scf7180002041163 scf7180000000507
Manually curated
Markers within 50K of a low mate cvg region
- 13 scaffolds (22 before)
- 14 breaks : 9 on the same contig , 2 on adjacent contigs , 3 on non adjacent contigs
- File
/scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.mates//scfproblems
- scfproblems.markers.txt
#scfid begin end markerid Chr(break) comment 7180002040911 2210713 2211037 CZ405316 29 half Chr30, half Chr29 ; SAME ctg 7180002041061 213959 214640 CC572066 15 half Chr29, half Chr15; SAME ctg 7180002041103 5296272 5297045 CC587675 15 Chr18 & Chr15 at 3'; diff ctg (2 ctgs in between) 7180002041107 7302203 7302874 CC906829 15 half Chr9, half Chr15; SAME ctg 7180002041200 4920833 4921695 BZ877250 20 half Chr20, half Chr2; SAME ctg 7180002041216 307121 307653 CL609365 5 Chr5 at 5' & Chr15; SAME ctg 7180002041259 4315523 4316178 BZ922220 3 Chr5, Chr3, Chr24 !!! ; diff ctgs (4 ctgs in between) 7180002041281 2884855 2885652 CC538348 19 half Chr19, half Chr28 ; SAME ctg 7180002041315 2196754 2197298 BZ838379 1 Chr13 & Chr1 at 3'; diff ctg (2 ctgs in between) 7180002041325 2597590 2598283 CC472212 10 half Chr10, half Chr5; SAME ctg 7180002041348 3028112 3028921 CC531996 8 half Chr8, half Chr7; SAME ctg 7180002041356 6828214 6828925 CC531427 16 half Chr16, half Chr25; SAME ctg 7180002041358 21002368 21003219 CC527932 26 Chr14 & Chr26 at 3' (long chunck); diff ctg
Details:
#scfid begin end markerid Chr #ctgid begin end markerid Chr 7180002040911 2189844 2190490 BZ923031 30 7180001730191 42851 43497 BZ923031 30 * 7180002040911 2210713 2211037 CZ405316 29 7180001730191 63720 64044 CZ405316 29 * -- 7180002041061 213959 214640 CC572066 15 7180001852904 29927 30608 CC572066 15 * 7180002041061 236304 236479 CC550436 29 7180001852904 52272 52447 CC550436 29 * -- 7180002041103 5220856 5221407 CG984741 18 7180001854649 21126 21677 CG984741 18 * 7180002041103 5296272 5297045 CC587675 15 7180001854651 837 1610 CC587675 15 * -- 7180002041107 7302203 7302874 CC906829 15 7180001855003 77842 78513 CC906829 15 ** 7180002041107 7311399 7312254 CC479102 9 7180001855003 87038 87893 CC479102 9 ** -- 7180002041200 4940131 4940949 CC500137 20 7180002002029 31935 32753 CC500137 20 ** 7180002041200 4956412 4957105 CZ428497 2 7180002002029 48216 48909 CZ428497 2 ** -- 7180002041216 307121 307653 CL609365 5 7180002003578 6871 7403 CL609365 5 * 7180002041216 310832 311546 CL865591 15 7180002003578 10582 11296 CL865591 15 * -- 7180002041259 4224828 4225527 CC920177 5 7180002012718 112862 113561 CC920177 5 * 7180002041259 4315523 4316178 BZ922220 3 7180002012722 19376 20031 BZ922220 3 * -- 7180002041259 6220638 6221268 BZ869532 3 7180002012752 199799 200429 BZ869532 3 * 7180002041259 6239728 6240375 CZ413142 24 7180002012753 2600 3247 CZ413142 24 * -- 7180002041281 2927406 2928016 CC573399 19 7180002018361 74539 75149 CC573399 19 ** 7180002041281 2938896 2939696 CC513914 28 7180002018361 86029 86829 CC513914 28 ** -- 7180002041315 2152291 2153097 BZ836343 13 7180001862977 3397 4203 BZ836343 13 * 7180002041315 2196754 2197298 BZ838379 1 7180002024308 7255 7799 BZ838379 1 * -- 7180002041325 2608389 2609213 CC506736 10 7180002025880 237956 238780 CC506736 10 ** 7180002041325 2638591 2639207 CC770009 5 7180002025880 268158 268774 CC770009 5 ** -- 7180002041348 3044263 3044902 BZ872906 8 7180002030033 152672 153311 BZ872906 8 ** 7180002041348 3092547 3093121 BZ924509 7 7180002030033 200956 201530 BZ924509 7 ** -- 7180002041356 6828214 6828925 CC531427 16 7180001722964 12571 13282 CC531427 16 * 7180002041356 6832720 6833356 CC494876 25 7180001722964 17077 17713 CC494876 25 * -- 7180002041358 20964100 20964829 BZ839784 14 7180001723456 140919 141648 BZ839784 14 * 7180002041358 21002368 21003219 CC527932 26 7180001723457 15983 16834 CC527932 26 *
- scfproblems.low_mate_cvg.txt
#scfid begin end ctglen len mate_cvg 7180002040911 2205494 2205582 3008363 89 1 7180002041061 232849 232871 509987 23 1 7180002041103 5285314 5285479 5341215 166 1 7180002041107 7306552 7306653 19514159 102 1 7180002041200 4954073 4954180 16995932 108 1 7180002041216 307614 307741 1444165 128 0 7180002041259 4313297 4313311 15600612 15 0 7180002041281 2933371 2933618 12907599 248 1 7180002041315 2193142 2193167 2232736 26 1 7180002041325 2623956 2624190 11153784 235 1 7180002041348 3064633 3064750 19180127 118 1 7180002041356 6831454 6831811 14697197 358 1 7180002041358 20970950 20971112 25078118 163 1
No low cvg regions
- 9 scaffolds
- 14 breaks : 5 on the same contig , 6 on adjacent contigs , 3 on non adjacent contigs
7180002040844 : 3 consecutive Chr30 markers (in the middle); mate_cvg > 18 ; markers CC576837,CC516738,CC543771; SAME ctg 7180002041163 : 5' Chr30, 3' Chr11 ; mate_cvg~=10; marker BZ887794 ; diff ctg (1 ctg in between) 7180002041234 : 5' Chr10, 3' Chr5; mate_cvg=10; marker BZ889975; scf=1757217bp; diff ctg 7180002041235 : 2 consecutive Chr14 markers; mate_cvg=15..20; markers CC561837 & CC585677; NOT uniq; SAME ctg 7180002041279 : 5 consecutive Chr23 markers; markers CC472696,CC522963; cvg=9..18; diff ctg 7180002041306 : 5' Chr14, 3' Chr6; marker CC549871; cvg=2; diff ctg (22 ctgs in between) 7180002041308 : 2 consecutive Chr20 markers; markers CC571631 & BZ832318; cvg=20; SAME/diff ctg 7180002041321 : 5' Chr2, 3' Chr15 ; marker CC513377 cvg=10; diff ctg (1 ctgs in between) 7180002041350 : 2 consecutive Chr3 markers ; markers BZ837387 & CC571149; cvg 17..26; NOT uniq; SAME/diff ctg
Details:
#scfid begin end markerid Chr #ctgid begin end markerid Chr 7180002040844 8369147 8369989 CC521620 17 7180001727899 132573 133415 CC521620 17 ** 7180002040844 8437707 8438324 CC576837 30 7180001727899 201133 201750 CC576837 30 ** ... 7180002040844 8467493 8468096 CC543771 30 7180001727899 230919 231522 CC543771 30 * 7180002040844 8522732 8523598 CC513544 17 7180001727901 34970 35836 CC513544 17 * --- 7180002041163 770783 771508 CC562874 30 7180001926720 20391 21116 CC562874 30 * 7180002041163 791050 791484 BZ887794 11 7180001926721 3139 3573 BZ887794 11 * -- 7180002041234 3856848 3857673 BZ889975 10 7180002010181 56608 57433 BZ889975 10 * 7180002041234 3949893 3950652 CC509477 5 7180002010182 28610 29369 CC509477 5 * -- 7180002041235 1117908 1118686 CC579933 21 7180002010285 4086 4864 CC579933 21 ** 7180002041235 1118939 1119675 CC561837 14 7180002010285 5117 5853 CC561837 14 ** 7180002041235 1158300 1159022 CC585677 14 7180002010285 44478 45200 CC585677 14 ** 7180002041235 1164296 1164891 BZ924510 21 7180002010285 50474 51069 BZ924510 21 ** -- 7180002041279 6755839 6756519 BZ849919 1 7180002018195 417990 418670 BZ849919 1 * 7180002041279 6786086 6786957 CC472696 23 7180002018196 12703 13574 CC472696 23 * ... 7180002041279 6866378 6866890 CC522963 23 7180002018196 92995 93507 CC522963 23 * 7180002041279 6911185 6911897 CC574255 1 7180002018197 15411 16123 CC574255 1 * -- 7180002041306 248713 249457 CC503129 14 7180002022875 59534 60278 CC503129 14 * 7180002041306 910997 911554 BZ839769 6 7180002022892 21939 22496 BZ839769 6 * -- 7180002041308 1919399 1920196 BZ883381 18 7180002023013 57093 57890 BZ883381 18 * 7180002041308 1958658 1959168 CC571631 20 7180002023014 29833 30343 CC571631 20 * 7180002041308 1963333 1964083 BZ832318 20 7180002023014 34508 35258 BZ832318 20 ** 7180002041308 2009574 2010367 CC499423 18 7180002023014 80749 81542 CC499423 30 ** -- 7180002041321 4820048 4820542 BZ846646 2 7180002025240 77429 77923 BZ846646 2 * 7180002041321 4908631 4909160 CC513377 15 7180002025242 13302 13831 CC513377 15 * -- 7180002041350 2423855 2424205 CG983886 13 7180002030523 183644 183994 CG983886 13 * 7180002041350 2446763 2447125 BZ837387 3 7180002030524 9191 9553 BZ837387 3 * 7180002041350 2457417 2457704 CC571149 3 7180002030524 19845 20132 CC571149 3 ** 7180002041350 2486474 2487050 CC490214 13 7180002030524 48902 49478 CC490214 13 **
Scaffold splitting
Before:
- 22 scaffolds
- 28 breaks : 14 on the same contig , 8 on adjacent contigs , 6 on non adjacent contigs
Now:
- 14 scaffolds
- 15 breaks : 8 on the same contig , 3 on adjacent contigs , 4 on non adjacent contigs
Scaffold to break
   nl  scfid           breaks
   #1  7180002040844   2
    2  7180002040911   1
    3  7180002041061   1
   #4  7180002041103   1
    5  7180002041107   1
   #6  7180002041163   1
    7  7180002041200   1
   #8  7180002041216   1
    9  7180002041234   1
  #10  7180002041235   2
   11  7180002041259   2
  #12  7180002041279   2
   13  7180002041281   1
   14  7180002041306   1
  #15  7180002041308   2
   16  7180002041315   1
   17  7180002041321   1
   18  7180002041325   1
   19  7180002041348   1
  #20  7180002041350   2
   21  7180002041356   1
   22  7180002041358   1
Contigs to break
   nl  ctgid           
    1  7180001722964
    2  7180001723456
    3  7180001730191
    4  7180001852904
    5  7180001855003
    6  7180002002029
    7  7180002010182
    8  7180002012722
    9  7180002012752
   10  7180002018361
   11  7180002024308
   12  7180002025240
   13  7180002025880
   14  7180002030033
Marker pairs
   nl  scfid           begin1  end2    markerid1       Chr1    markerid2       Chr2    ctg1            ctg2            dist(end2-begin1)      dist(ctg2-ctg1)
   #1  7180002040844   8369147 8438324 CC521620        17      CC576837        30      7180001727899   7180001727899   69177                   0
   #2  7180002040844   8467493 8523598 CC543771        30      CC513544        17      7180001727899   7180001727901   56105                   2
 
    3  7180002040911   2189844 2211037 BZ923031        30      CZ405316        29      7180001730191   7180001730191   21193                   0
 
    4  7180002041061   213959  236479  CC572066        15      CC550436        29      7180001852904   7180001852904   22520                   0
 
   #5  7180002041103   5220856 5297045 CG984741        18      CC587675        15      7180001854649   7180001854651   76189                   2
 
    6  7180002041107   7302203 7312254 CC906829        15      CC479102        9       7180001855003   7180001855003   10051                   0
 
   #7  7180002041163   770783  791484  CC562874        30      BZ887794        11      7180001926720   7180001926721   20701                   1
 
    8  7180002041200   4940131 4957105 CC500137        20      CZ428497        2       7180002002029   7180002002029   16974                   0
 
   #9  7180002041216   307121  311546  CL609365        5       CL865591        15      7180002003578   7180002003578   4425                    0
 
   10  7180002041234   3856848 3950652 BZ889975        10      CC509477        5       7180002010181   7180002010182   93804                   1
 
  #11  7180002041235   1117908 1119675 CC579933        21      CC561837        14      7180002010285   7180002010285   1767                    0
  #12  7180002041235   1158300 1164891 CC585677        14      BZ924510        21      7180002010285   7180002010285   6591                    0
 
   13  7180002041259   4224828 4316178 CC920177        5       BZ922220        3       7180002012718   7180002012722   91350                   4
   14  7180002041259   6220638 6240375 BZ869532        3       CZ413142        24      7180002012752   7180002012753   19737                   1
 
  #15  7180002041279   6755839 6786957 BZ849919        1       CC472696        23      7180002018195   7180002018196   31118                   1
  #16  7180002041279   6866378 6911897 CC522963        23      CC574255        1       7180002018196   7180002018197   45519                   1
 
   17  7180002041281   2927406 2939696 CC573399        19      CC513914        28      7180002018361   7180002018361   12290                   0
 
   18  7180002041306   248713  911554  CC503129        14      BZ839769        6       7180002022875   7180002022892   662841                  22
 
  #19  7180002041308   1919399 1959168 BZ883381        18      CC571631        20      7180002023013   7180002023014   39769                   1
  #20  7180002041308   1963333 2010367 BZ832318        20      CC499423        18      7180002023014   7180002023014   47034                   0
 
   21  7180002041315   2152291 2197298 BZ836343        13      BZ838379        1       7180001862977   7180002024308   45007                   2*
 
   22  7180002041321   4820048 4909160 BZ846646        2       CC513377        15      7180002025240   7180002025242   89112                   2
 
   23  7180002041325   2608389 2639207 CC506736        10      CC770009        5       7180002025880   7180002025880   30818                   0
 
   24  7180002041348   3044263 3093121 BZ872906        8       BZ924509        7       7180002030033   7180002030033   48858                   0
 
  #25  7180002041350   2423855 2447125 CG983886        13      BZ837387        3       7180002030523   7180002030524   23270                   1
  #26  7180002041350   2457417 2487050 CC571149        3       CC490214        13      7180002030524   7180002030524   29633                   0
 
   27  7180002041356   6828214 6833356 CC531427        16      CC494876        25      7180001722964   7180001722964   5142                    0
 
   28  7180002041358 20964100 21003219 BZ839784        14      CC527932        26      7180001723456   7180001723457   39119                   1
Break intervals
       #scfid          begin   end     scflen          len  allread goodmate badmate Chr1 Chr2 mark1   mark2  ctgid
   #1  7180002040844   8412735 8412739 9565303         5       1       12      15      17 30   357     3      
   #2  7180002040844   8485384 8485869 9565303         486     0       12      4       30 17   3       53
 
    3  7180002040911   2205044 2205247 3009964         204     3       0       8       30 29   44      34     7180001730191
    4  7180002041061   232819  232843  510915          25      5       0       23      15 29   8       14     7180001852904
   #5  7180002041103   5295117 5295435 5342873         319     0       11      2       18 15   221     4
    6  7180002041107   7306552 7306620 19515586        69      2       0       11      15 9    320     549    7180001855003
   #7  7180002041163   787669  787911  1752998         243     0       10      0       30 11   16      8
    8  7180002041200   4954103 4954180 16991478        78      1       0       27      20 2    219     532    7180002002029
   #9  7180002041216   309331  309337  1435859         7       1       4       3       5 15    7       50
   10  7180002041234   3939493 3939548 10292558        56      1       0       20      10 5    160     227    7180002010182
  #11  7180002041235   1117752 1117987 1742396         236     5       20      2       21 14   43      2
  #12  7180002041235   1161294 1161363 1742396         70      3       16      0       14 21   2       32
   13  7180002041259   4313381 4313547 15594966        167     1       0       12      5 3     193     75     7180002012722
   14  7180002041259   6231717 6231762 15594966        46      8       4       18      3 24    75      406    7180002012752
  #15  7180002041279   6773364 6773383 9784709         20      0       4       14      1 23    292     5
  #16  7180002041279   6894603 6894618 9784709         16      0       4       16      23 1    5       141
   17  7180002041281   2931877 2931885 12908817        9       3       5       10      19 28   83      441    7180002018361
   18  7180002041306   267445  267446  4484402         2       0       2       14      14 6    49      141    7180002022875,7180002005935
 
  #19  7180002041308   1928806 1928825 23070077        20      0       10      0       18 20   63      2
  #20  7180002041308   1977759 1977828 23070077        70      1       21      1       20 18   2       836
   21  7180002041315   2192754 2192893 2233575         140     7       0       22      13 1    59      2      7180002024308
   22  7180002041321   4840934 4841070 6492983         137     1       0       16      2 15    227     72     7180002025240
   23  7180002041325   2624153 2624190 11155368        38      3       0       22      10 5    86      357    7180002025880
   24  7180002041348   3064691 3064730 19179560        40      1       0       16      8 7     116     634    7180002030033
  #25  7180002041350   2437553 2437572 23353087        20      0       19      1       13 3    107     2
  #26  7180002041350   2479000 2479029 23353087        30      2       23      1       3 13    2       880
 
   27  7180002041356   6831454 6831783 14699461        330     1       0       9       16 25   299     319    7180001722964
   28  7180002041358  20970057 20970139 25079321       83      7       0       25      14 26   868     189    7180001723456
Where:
Chr1: The most frequent chromosome markers with alignment at coordinates <= begin Chr2: The most frequent chromosome markers with alignment at coordinates >= end mark1: Number of Chr1 markers with alignment at coordinates <= begin mark2: Number of Chr2 markers with alignment at coordinates >=end Lines starting with # should be ignored. ctgid: ctg to break
New scaffolds
#scfid(new) scfid begin end scflen(new)
Files
# ctg:scf new name mapping /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scfproblems.posmap.ctbscb # scf:scf new:original name mapping /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scfproblems.posmap.scbscf # ctg:ctg new:original name mapping /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/ctgproblems.posmap.ctbctg # ctg FASTA sequences /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/ctbproblems.fasta # scf FASTA sequences /scratch1/bos_taurus/Assembly/2009_0312_CA/markers/scfproblems.both.filtered/scbproblems.fasta
UMD2.6 (UMD2.5 without contam ctg/scaff; split ctg/scaff)
Contaminants & MarkerBreaks
Delete summary:
ctg scf scf->ctg contaminants(delete) 156 152 666 contaminants(trim) 12 12 1328 markerBreaks 14+1 14+1 2875+1 # 1 more break in UMD2.6.1 total 182 178 4869
Add summary:
ctg scf contaminants(delete) 0 2 contaminants(trim) 12 12 markerBreaks 28+2 29+2 # 1 more break in UMD2.6.1 total 40 43
Summary:
ctg scf markers original 90135 39978 del/add -142 -135 final 89993 39843 -17
Files:
/scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.ctg.fasta : contig FASTA sequence /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.scf.fasta : scaffold FASTA sequence /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.ctglen : contig lengths /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.scflen : scaffold lengths /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.ctgscf : mapping of contigs to scaffolds (posmap format) /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.posmap.scaff : mapping of contigs to scaffolds (scaff format) /scratch1/bos_taurus/Assembly/2009_0312_CA/new/ctg.delete.uid : UID of the contigs which were deleted /scratch1/bos_taurus/Assembly/2009_0312_CA/new/scf.delete.uid : UID of the scaffolds which were deleted /scratch1/bos_taurus/Assembly/2009_0312_CA/new/ctg.add.uid : UID of the contigs which were added : UID =~/brk\d+[abc]/ OR UID =~/cnt\d+/ /scratch1/bos_taurus/Assembly/2009_0312_CA/new/scf.add.uid : UID of the contigs which were added : UID =~/brk\d+[abc]/ OR UID =~/cnt\d+/ /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.delete.uid : markers which got deleted
/scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.ctg.break : 15 contig break regions /scratch1/bos_taurus/Assembly/2009_0312_CA/new/bt.scf.break : 16 scaffold break regions
Scripts:
~/bin/breakPosmapKeep.amos : pipeline for breaking scf/ctg
- Final:
elem min max mean med n50 sum ctg 89993 65 1160130(1.1M) 29736 5180 95952 2676109306
scf 39843 316 34167202(34M) 68353 1361 7451988 2723419943 scf<50K 38915 316 49898 2765 1349 5107 107632139 scf<5K 35229 316 4999 1518 1306 1435 53495879
Ctg Markers
Filtered:
%IDY>90 %Matched>85
- ~30% of alignments agree to this condition
Markers:
total: 126111 # from 31372 ctgs
Contigs
elem min max mean med n50 sum placed 31372* 442 1160130 72908 52734 111147 2287291732* unplaced 58621 65 425922 6632 1600 22204 388817574 total 89993 65 1160130 29736 5180 95952 2676109306
Scf Markers
Markers:
total: 126111 # from 2641 scaffolds; 1595 scaffolds have 2+ markers na: 18744 # not placed del: 17 # the scaffolds they belong to were deleted other: 3722 # not on the "main" chromosome; "main" chromosome determined by a majority rule; if it's a tie check markers for uniqueness outliers: 411 # interquartile range method (IQR) q1-1.5*(q3-q1) .. q1+1.5*(q3-q1) filtered: 103217 # "non conflicting" markers
 Scaffolds:	 
                   elem       min        max        mean       med        n50        sum              #ctgs
 placed            2641       1000       34167202   994523     16673      8170786    2626536153*      50528	 
 unplaced          37202      316        754615     2604       1337       3964       96883790	        39465
 total             39843      316        34167202   68353      1361       7451988    2723419943       89993	 
scf <2 >=2 >=10 min max mean med n50 sum markers/scf 2641 1595 1046 562 1 1418 39 1 354 103479 ctg/scf 2641 1546 1095 559 1 545 19 1 130 50528
cat markers_scf.mainChr.*summary | count.pl -c 7 | sort -n 1 1564 # 1 marker/scf 2 276 # 2 markers/scf 3 116 # ... 4 52 ... 1470 1 # scaffold 7180002041371 has 1470 markers
- Scaffold position:
filter outliers (interquartile range method); use median value problem: only 2 markers far apart: choose randomly or check for uniqueness
Summary (approximate) Chr #Ctg #Scaff ScaffSpanSum MaxMarkerPos 1 2989 112 157088082 167097751 2 2488 73 138112445 141135901 3 2257 109 120984003 128677351 4 2079 111 124695956 123662451 5 2271 98 120470218 130242001 6 2244 127 117350442 127208151 7 2169 93 109780318 114917551 8 2069 95 111646872 114607251 9 1937 72 103790361 106365151 10 1829 96 102878815 108508301 11 2030 62 106593132 107458151 12 1789 114 89109155 97406401 13 1498 73 83821399 88539451 14 1482 142 84084175 89211101 15 1734 105 84680500 91332551 16 1710 111 80727432 86838601 17 1384 51 72913556 78195801 18 1446 86 65689468 70299751 19 1338 56 63372609 69847351 20 1454 54 71941707 75982901 21 1405 63 70035525 72193201 22 1077 36 60892135 > 60178851 23 1021 44 51791473 54886001 24 1059 26 61662407 > 61466101 25 783 41 42670836 > 45254751 26 991 40 50640267 52316851 27 920 70 45768018 48911451 28 810 52 45884054 50753001 29 1143 89 51657687 55219751 30 3122 340 135803106 152429101 U 39465 37202 94983049
- Scaffold orientation
filter outliers (interquartile range method) use LeastSequareFit method to estimate the orientation : if slope is positive => forward; if slope is negative => reverse; problem: slope ~=0 => which direction ?
 cat Chr.summary | getSummary.pl -i 5
 cat Chr.agp | grep W | awk '{print $9}' | count.pl
elem <0 0 >0 min max mean med scf 2641 516 1610 515 -31 61 0 0 ctg 50528 24885 2236 23407 Use slope thold to determine direction? cat markers_scf.mainChr.noOutliers.summary | p 'print $_ if(abs($F[5])>0.5);' | wc -l # 634
Ambiguity examples: BZ908653 6 114061501 114016501 114106501 172149 597 98.66 100.00 7180002040834 BZ891600 6 114085251 114051501 114119001 4834 504 99.80 99.02 7180002040834 CZ411135 6 114094001 114059001 114129001 242710 609 98.20 100.00 7180002040834 BZ854276 6 114132751 114101501 114164001 100980 580 99.31 99.83 7180002040834 CC524983 30 115669901 115622401 115717401 96634 715 98.74 100.00 7180002041003 BZ869249 30 115791151 115717401 115864901 86931 448 99.78 100.00 7180002041003 BZ867530 30 115798651 115732401 115864901 69671 572 99.13 100.00 7180002041003 CC585731 30 115808651 115752401 115864901 54950 737 99.86 100.00 7180002041003 CC469285 30 115818651 115772401 115864901 125555 550 94.57 100.00 7180002041003
- Scaffold overlaps: some small scaffolds might be contained by bigger ones
cat markers_scf.mainChr.noOutliers.posmap.scfchrabs | ~/bin/posmap2ovl.pl | sort -nk6 -r | ~/bin/tab2tab.pl -f -15 | head Chr ref qry begin end end-begin 30 7180002041328 7180002041078 72536051 76993951 4457900 4 7180002041381 7180002041269 41976451 46175201 4198750 30 7180002040852 7180002041077 142001501 145251501 3250000 30 7180002038569 7180002041121 33012001 35518351 2506350 30 7180002040971 7180002034501 59411001 61836101 2425100 ...
Marker Issues
Placement
- 37202 scaffolds are unplaced (3.5% of total scaffold span); max=0.7M
- 87 unplaced scaffolds (1.5Mbp total) could be placed using SLK messages
perl ~/bin/difference21.pl bt.slk markers_scf.scflen | head 7180002030632 7180002040171 I -147744.734 21214.779 2 UP 7180002030792 7180002031221 I -15042.301 558.223 2 UP 7180002030849 7180002041244 N -1609089.875 18145.568 4 UP ...
elem min max mean med n50 sum Unplaced 87 740 102909 17259 8544 34550 1501601 Placed 78 16177 27139572 5757234 4037663 10989230 449064304
- ambiguous assignment to chromosmes:
 cat markers_scf.count | p 'print $_ if($F[2]*2==$F[3]);' | nl
       #k1             k2      count12 count1
    1  7180002030741   15      2       4
    2  7180002031341   11      1       2
   ...
   36  7180002068140   25      1       2
 # scaffold assigned differently
 ~/bin/difference12.pl markers.all/markers_scf.mainChr.count markers.all_plus_uniq/markers_scf.mainChr.count | nl
    1  7180002032536   19      1       2       23480
    2  7180002037223   8       1       2       31516
    3  7180002040013   25      2       4       15228
    4  7180002040262   30      1       2       3087
    5  7180002040378   10      1       2       50907
    6  7180002040523   29      1       2       9105
    7  7180002040769   4       1       3       827526
    8  7180002041203   14      2       4       1141560
    9  7180002044555   18      1       2       1566
- Chr30 has many scaff aligned to it
- how reliable are the markers: 1151 out of 2641 placed scaffolds have no unique markers ?
- what measure is best for placing the scaffolds?
- can some scaffolds go in the gaps ?
- AGP:
- gaps<=0 or 20 set to 100
- unoriented ctgs set to +
 
- Marker positions are not uniformly distributed; they tend to "custer"
elem 0 >0 min max mean med n50 sum markers_chr 107337 8405 98932 0 446250 25800 8750 75000 2769379200 markers_scf 94171 123 94048 0 2565850 24143 15856 42236 2273610521 markers_chr.mainChr.noOutliers 101084 7609 93475 0 1253750 27383 10000 78750 2768040450 # filtered IRQx,xy method markers_scf.mainChr.noOutliers 88450 52 88398 0 2669361 25327 16587 44374 2240202221
Orientation
- at least 1564 out of 2641 placed scaffolds cannot be oriented; max=1.39M
Possible misassemblies
nl #scfid Ch medianPos rangePos scfLen slope #ChMark #Mark #ctg 1 7180002041225 1 120473301 9866350 9279975(9M) 0.0103 381 390 184 : break interval: 1674263..1674319 (1X frg_cvg region) 2 7180002041078 30 77506551 6310600 3094999(3M) 0.0148 65 69 46 : no clear position
scf7180002041225 : 9.27 Mbp on Chr1
scflen=9279975
7180002041225.markers 7180002041225.markers 7180002041225.cvg
ChrPos ScfPos 115272051 56300 ... 116848301 1618430 116953301 9190560 ... 125138401 1691145
break interval: 1674263..1674319 = 57bp frg_cvg=1 mate_cvg=0 bad_mate_cvg(nearby)=7
scf7180002041078 : 3.09 Mbp (not broken)
7180002041078.markers 7180002041078.markers 7180002041078.1.cvg xrange [315268:466827] 7180002041078.2.cvg xrange [1191000:1229713] 7180002041078.3.cvg xrange [2733146:2750017]
ChrPos ScfPos 72536051 2893791 ... 72682301 2750017 72728651 1191000 ... 73273651 466827 77112801 2733146 ... 78464051 1229713 .. 78595401 315268
scf7180002040971 : 1.93 Mbp (not broken)
Files
/scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo/
- Nucmer alignments to UMD2.0 chromosomes: all seem to agree pretty well
/scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr26-Chr26.png /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr27-Chr27.png /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.Chr29-Chr29.png /scratch1/bos_taurus/Assembly/2009_0312_CA/new/markers.all.redo2/nucmer_UMD2.0/UMD2.0.ChrX-Chr30.png
Homo sapiens alignments
Citations:
- mice closer to human than cow
- human and cow have approximately 201 homologous blocks of DNA
- Independently generated mapping data provide another measure of the quality of the assembly. Snelling et al. [4] created a B. taurus map from three radiation hybrid panels, two genetic maps, and bacterial artificial chromosome (BAC) end sequences. We aligned all of the 17,254 markers (of which 17,193 are unique) in their composite map (Cmap) to both assemblies. A marker was considered as matching a chromosome if 90% of the marker sequence aligned with at least 95% identity. Of the Cmap markers, 14,620 align to the UMD2 assembly's chromosomes, versus 13,699 markers (6.3% fewer) for the BCM4 assembly. A small number of Cmap markers (119 and 82 for UMD2 and BCM4, respectively) mapped to a different chromosome from the one indicated in the Cmap data.
- homologous synteny block (HSB) :
- human-cow alignment extended for at least 250 Kbp
- it was not interrupted by an inversion or by an HSB on another chromosome.
- If two HSBs were interrupted by a gap of <3 Mbp and nothing else fell in that gap, the two blocks were merged. (Note that if a large region of synteny is interrupted by a distinct HSB, the interruption creates three HSBs.)
- Number of homologous synteny blocks on each chromosome of the cow and human genomes 201 -> 268 blocks
 
- Orienting contigs using cow-human alignments:
- Scaffolds (sets of linked contigs) that were mapped onto chromosomes using only a single marker could not be oriented from the marker information alone. We oriented many of these scaffolds by taking advantage of the overall conserved synteny between cow and human. First, all cow scaffolds were aligned to the human genome using nucmer [14] with its maximal unique match (mum) option in order to avoid alignments of repetitive sequence. For each alignment of a previously unoriented scaffold to human, all alignments within 100 Kbp on each side were pulled out for analysis. A score S was computed for each unoriented scaffold, taking into account whether the scaffolds surrounding S on both sides (in cow) were mapped to a consistent set of locations in human. If the scaffolds surrounding S were oriented, and if a large majority of these scaffolds on both the left and right agreed on the orientation, then S was assigned that orientation. Using this procedure, 1,840 scaffolds containing 4,011 contigs were oriented.
- We developed a similar procedure to assign unplaced contigs to chromosomes, again relying on conserved synteny between cow and human. First, all unplaced contigs were aligned as above. Mummer's 'delta-filter' program was then used to compute a one-to-one mapping of the unplaced contigs to human so that only the best aligning contig was considered at each region in human. For each unplaced contig's best alignment to human, the matching region in cow was identified via our human-cow syntenic map, and all contigs from this region were extracted for examination. We only considered placing a contig on a B. taurus chromosome if the order and direction of the surrounding contigs in cow matched the corresponding region in human. As above, we examined the alignments of nearby cow contigs that aligned within 100 kb of the unplaced contig's alignment in human. If the region of cow-human synteny contained no rearrangements, then the unplaced contig was placed at the location indicated by these alignments. Using this procedure, 1,046 contigs were placed on chromosomes. One consequence of this procedure was that a number of incompletely mapped genes (based on mRNA alignments) were completed.
 
Issues:
- which alignment program to use?
- nucmer
- blastz: difficult to parse
- blat
 
- nucmer: what parameters?
- default
- loose : -mum -l 12 -c 30 -g 1000
- ref: 24 homo sapiens chromosomes files
- query: 26 bos taurus scaffold files
 
Total scaffolds:
elem min max mean med n50 sum scf(len) 39844 316 34167202 68352 1361 7451988 2723419938
Aligned to HS:
elem min max mean med n50 sum scf(len) 8860 385 34167202 301202 2257 7740810 2668658140 # each scf aligns in avg to 4 Chr scf(aligLen) 789272 250 19097 514 399 550 405946972 scf(alig%) 789272 58.04 100.00 79 78.54 78.72 . scf(len,maxX) 727 1002 7069350 194658 6530 1895564 141516604 # 220 in common with the 339 ones that have mark scf(len,noMark) 7025 385 7605708 7783 1654 27630 54680173 scf(len,mark) 1835 1001 34167202 1424511 40503 7970944 2613977967
Not aligned to HS:
elem min max mean med n50 sum scf(len,notAligned) 30984 316 163953 1767 1309 1555 54762139
Aligned to markers:
elem min max mean med n50 sum scf(len) 2641 1000 34167202 994523 16673 8170786 2626536153 1 marker 1595 1000 1396951 17482 4595 48276 27884078 2+ markers 1046 1055 34167202 2484371 596240 8217662 2598652075 abs(slope)<0.25 1946 1000 3094999 29136 7585 114274 56700529 abs(slope)>=0.25 696 4674 34167202 3692292 1845828 8401441 2569835619
Scf summary:
elem min max mean med n50 sum all 39844 316 34167202 68352 1361 7451988 2723419938 1+mark 2641 1000 34167202 994523 16673 8170786 2626536153 # best marker alignment 1+align 8860 385 34167202 301202 2257 7740810 2668658140 # alignments > 250bp 1+align(new) 9880 385 34167202 270458 1996 7740810 2672129298 # alignments > 200bp (3.74M more than 250bp align) 1+mark or 1+align 9669 385 34167202 276443 2051 7740810 2672936792 0 mark and 1+align 7027 385 754,615 6603 1654 20219 46,400,303 !!! 1+mark and 1+align 1791 1001 34167202 1305861 37914 7473583 2338798064 0 mark and 0 align 30179 316 88,326 1705 1305 1517 51,483,587 !!! 1+mark or 1+align 9669 385 34167202 276443 2051 7740810 2,672,936,792
Degenerate:
elem min max mean med n50 sum all 251413 65 39964 1003 984 994 252,279,234 all(2000+bp) 2128 2000 39964 3753 2827 3946 7,986,872 1+mark 562 200 30168 2731 1274 5029 1,535,011 1+mark(2000+bp) 180 2007 30168 6079 4404 7510 1,094,252
1+align 6429 251 39964 1487 1013 1287 9,566,273 1+align(2000+bp) 756 2004 39964 4820 3624 5646 3,644,556
Issues:
- 24 scaffolds that have 200+ alignments to at least 2 HS chromosomes
- 37 scaffolds that have 100+ alignments to at least 2 HS chromosomes
- 83 scaffolds that have 50+ alignments to at least 2 HS chromosomes
File location:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_human
Synteny method
File location:
/scratch1/bos_taurus/Assembly/2009_0312_CA/synteny
- Position & orient scaffolds that aligned to HS
- find placed neighboring that don't "disagree"
elem min max mean med n50 sum total 7027 385 754615 6603 1654 20219 46,400,303 !!!
- Scaffolds possibly assigned to the wrong chromosome
 cat summary.nucmer-markers.txt | perl ~/bin/synteny/getSynteny2.nucmer-marker.pl -minMarkers 2 | grep update:chr | getSummary.pl -i 6 -t "<2 markers"
 cat summary.nucmer-markers.txt | perl ~/bin/synteny/getSynteny2.nucmer-marker.pl -minMarkers 4 | grep update:chr | getSummary.pl -i 6 -t "<4 markers"
                             elem       min        max        mean       med        n50        sum
 <2 markers                  307        1001       937279     27898      9212       98674      8,564,850
 <4 markers                  383        1001       1141560    33822      13426      160445     12,953,875
- UMD2.0
elem min max mean med N50 sum ctg 74337 88 840370 35148 14148 79144 2612810882
- UMD2.6.1 noVariants,noCont
elem min max mean med N50 sum scf 2646 385 34167202 994701 20787 7139718 2631980624 ctg 50755 65 1160130 50966 29450 88583 2586785910
- UMD2.6.1 noVariants
elem min max mean med n50 sum scf 39844 316 34167202 68352 1361 7451988 2,723,419,938 scf.placed 4707 385 34167202 564225 10413 7800796 2655811430 scf.variants 29436 723 51828 1714 1298 1514 50,461,989
 scf.unplaced                 30575      316        451968     1845       1309       1601       56440663
 
                              elem       min        max        mean       med        n50        sum
 ctg                          89994      65         1160130    29736      5180       95952      2676109378
 ctg.placed                   53646      65         1160130    48650      26840      98428      2609925446
 ctg.unplaced                 31510      101        207476     1761       1314       1556       55511410
                              elem       min        max        mean       med        n50        sum
 deg(all)                     251413     65         39964      1003       984        994        252,279,234
 deg(>2Kbp)                   2128       2000       39964      3753       2827       3946       7,986,872
deg.placed 883 200 30246 2711 1303 4845 2,393,905* deg.variants 4654 331 8039 1117 989 1024 5,200,553 deg.unplaced(?) 245343 65 39964 994 984 993 244047258
scf.unplaced 978 316 451968 7932 3736 15828 7757882 scf.uplaced.0cvg 863 316 451968 7521 2286 15744 6,491,342*
deg.unplaced 747 2002 39964 4365 3509 4819 3261181 deg.unplaced.0cvg 734 2002 39964 4358 3507 4811 3,199,221*
UMD2.0-UMD2.6.1 gaps>1K summary id count min max median sum 1 775 999 64953 1683 2884541 2 530 999 57624 1617 1741662 3 541 1000 69453 1684 1806436 4 549 1000 41029 1704 1905867 5 541 999 71460 1671 2324906 6 793 1000 58433 1589 2456268 7 531 1004 42991 1703 1928743 8 537 999 93947 1594 1923064 9 411 1000 46519 1722 1379152 10 481 999 86066 1811 1877559 11 448 999 93292 1697 1643904 12 574 999 50668 1839 2234042 13 367 1000 29022 1633 1087403 14 731 999 31863 1581 2200546 15 529 999 84891 1711 2148628 16 531 1003 52705 1774 1934384 17 313 999 30429 1566 888071 18 461 1000 49668 1595 1492322 19 291 1003 32720 1676 989993 20 332 999 40339 1659 1021011 21 352 999 96288 1601 1369633 22 241 1006 39580 1549 643143 23 249 1000 68244 1714 1110793 24 181 1006 43610 1671 666753 25 217 1001 44087 1569 660253 26 356 1002 42165 1705 1428967 27 343 1000 134551 1656 1993198 28 213 1000 57368 1662 880367 29 344 1004 33306 1705 1165609 30 2174 999 167808 1528 7739170 total 14936 999 167808 1638 53,526,388
- align UMD2.6 Chr1..30 to UMD2.0 Chr1..30 0cvg regions
elem min max mean med n50 sum all 14936 1000 167809 3584 1639 6465 53541324 aligned 11091 1000 167809 4010 1692 7977 44475715 not_aligned 3845 1000 50546 2357 1529 2735 9065609
Files: UMD2.0 regions not covered by UMD2.6.1 (chr aligned to itself)
/scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.fa /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.posmap /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/agp_markers.noVariants_nucmer.noVariants/nucmer_UMD2.0/Chr.0cvg.summary
HS-BT Synteny map
- Trust scaffolds with 4+ markers
- Scaffolds with 3- markers must have a close neighbor that agrees with them
same HS & BT Chr maxCount=2 # at most 2 scaffolds away minMarkers=4 minRatio=0.66 & maxRatio=1.5 # distance ratio; maxDistance not used !!!
 => 226 synteny regions !!!
 join2.pl nucmer_*lsf markers_scf.*lsf | \
        ~/bin/filterMarkers.pl -minMarkers 4 | \
        ~/bin/getSyntenyBlock.pl | \
        grep -v ^# | grep -v ^$ | ~/bin/flipSummary.pl | sort -nk2 -nk5 | ~/bin/tab2tab.pl | \
        perl -ane  'print $P[13]," ",$F[13],"\n" if($F[13]-$P[13]==1); print $F[13]," ",$P[13],"\n" if($P[13]-$F[13]==1); @P=@F;' | sort -u -n | \
        ~/bin/mergeMap.pl >! hs-bt.map.tmp
 join2.pl nucmer_*lsf markers_scf.*lsf | \
        ~/bin/filterMarkers.pl -minMarkers 4 | \
        ~/bin/getSyntenyBlock.pl -map  hs-bt.map.tmp | \
        ~/bin/tab2tab.pl | grep  # | sed 's/#//' >  hs-bt.map
 
 ~/bin/map-draw.pl -refLen hs.infoseq -qryLen bt.infoseq hs-bt.map > ! hs-bt.png
Problems:
7180002041220: BT.Chr2 ok 7180002041025: BT.Chr4 7180002041222: BT.Chr7 7180002041228: BT.Chr8 7180002040195: BT.Chr8; HS.Ch23 del !!! 7180002041001: BT.Chr8; HS.Ch23 del !!! 7180002040851: BT.Chr30; HS.Ch3 -> HS.Ch23 7180002041008: BT.Chr30; HS.Ch7 -> HS.Ch23
- hs-bt.png Map picture
- Map:
#HS-ref begin end len HS-clust BT-ref begin end BT-clust #scf 01 870247 12637274 11767027 16 46893147 58621226 11728079 -1 5 01 16674545 30031540 13356995 2 126668764 139027145 12358381 -2 4 01 32016003 68894564 36878561 3 83066354 120874874 37808520 -3 10 01 68907947 122485896 53577949 3 23819464 83221225 59401761 -4 6 01 143725713 143737025 11312 3 25097798 25109110 11312 5 1 ... 23 148835146 151656716 2821570 30 32889252 35700335 2811083 -276 3 23 152157769 154641974 2484205 30 39322867 42104054 2781187 -277 2 23 154387850 154403684 15834 30 39353299 39369133 15834 278 1
Files:
/scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/map.4/hs-bt.map /scratch1/bos_taurus/Assembly/2009_0312_CA/synteny/map.4/hs-bt.png
Overlaps
Cases:
1. CONTAINED scaffolds (clear variants) 2. single BEGIN/END between 2 scaffolds: 2 scaffolds could be merged 3. scaffold closing a sequence: 3 scaffolds could be merged 4. multiple BEGIN/END/CONTAIN* between 2 scaffold contigs
1: CONTAINED ~ 4731 cases
cat nucmer_scf.ovl/all-all.contained.ids marker_scf.ovl/all-all.contained.ids | sort -u > all-all.contained.ids
- Summary
elem <2Kbp 2..10Kbp >10Kbp min max mean med n50 sum variants(all) 4731 3503 1085 143 723 75436 2513 1429 3469 11891882 variants(1+markers) 541 311 170 60 1001 75421 4466 1690 9947 2416312
- Example: longest contained scaffold 75Kbp
#scaffold alignments 117893 126367 | 8520 1 | 8475 8520 | 94.06 | 562884 75576 | 1.51 11.27 | 7180002040891 7180002040646 286811 289412 | 72974 75576 | 2602 2603 | 93.29 | 562884 75576 | 0.46 3.44 | 7180002040891 7180002040646 421477 421977 | 9727 10223 | 501 497 | 95.41 | 562884 75576 | 0.09 0.66 | 7180002040891 7180002040646 171970 180765 | 73605 64798 | 8796 8808 | 96.74 | 274750 75576 | 3.20 11.65 | 7180002040912 7180002040646 180765 195989 | 64652 49398 | 15225 15255 | 96.05 | 274750 75576 | 5.54 20.18 | 7180002040912 7180002040646 196223 203368 | 49244 42105 | 7146 7140 | 96.81 | 274750 75576 | 2.60 9.45 | 7180002040912 7180002040646 205935 208456 | 38652 36130 | 2522 2523 | 96.12 | 274750 75576 | 0.92 3.34 | 7180002040912 7180002040646 207096 210496 | 42082 38673 | 3401 3410 | 96.07 | 274750 75576 | 1.24 4.51 | 7180002040912 7180002040646 209529 217353 | 36109 28272 | 7825 7838 | 95.50 | 274750 75576 | 2.85 10.37 | 7180002040912 7180002040646 213859 219155 | 25520 20220 | 5297 5301 | 95.80 | 274750 75576 | 1.93 7.01 | 7180002040912 7180002040646 221183 227630 | 20236 13870 | 6448 6367 | 93.63 | 274750 75576 | 2.35 8.42 | 7180002040912 7180002040646 227923 228686 | 13884 13126 | 764 759 | 95.16 | 274750 75576 | 0.28 1.00 | 7180002040912 7180002040646 228084 231413 | 13104 9760 | 3330 3345 | 94.10 | 274750 75576 | 1.21 4.43 | 7180002040912 7180002040646 #contig alignments 1929 10403 | 8520 1 | 8475 8520 | 94.06 | 26592 8520 | 31.87 100.00 | 7180002040891.4.10 7180002040646.1.8 [CONTAINS] 9084 11685 | 30871 33473 | 2602 2603 | 93.29 | 276174 33473 | 0.94 7.78 | 7180002040891.8.10 7180002040646.8.8 143750 144250 | 1187 1683 | 501 497 | 95.41 | 276174 4564 | 0.18 10.89 | 7180002040891.8.10 7180002040646.2.8 163797 172592 | 31502 22695 | 8796 8808 | 96.74 | 233551 33473 | 3.77 26.31 | 7180002040912.3.5 7180002040646.8.8 172592 187816 | 22549 7295 | 15225 15255 | 96.05 | 233551 33473 | 6.52 45.57 | 7180002040912.3.5 7180002040646.8.8 188050 195195 | 7141 2 | 7146 7140 | 96.81 | 233551 33473 | 3.06 21.33 | 7180002040912.3.5 7180002040646.8.8 197762 200283 | 2523 1 | 2522 2523 | 96.12 | 233551 2523 | 1.08 100.00 | 7180002040912.3.5 7180002040646.6.8 [CONTAINS] 198923 202323 | 3410 1 | 3401 3410 | 96.07 | 233551 3411 | 1.46 99.97 | 7180002040912.3.5 7180002040646.7.8 [CONTAINS] 201356 209180 | 7838 1 | 7825 7838 | 95.50 | 233551 7838 | 3.35 100.00 | 7180002040912.3.5 7180002040646.5.8 [CONTAINS] 205686 210982 | 12396 7096 | 5297 5301 | 95.80 | 233551 12396 | 2.27 42.76 | 7180002040912.3.5 7180002040646.3.8 213010 219457 | 7112 746 | 6448 6367 | 93.63 | 233551 12396 | 2.76 51.36 | 7180002040912.3.5 7180002040646.3.8 219750 220513 | 760 2 | 764 759 | 95.16 | 233551 12396 | 0.33 6.12 | 7180002040912.3.5 7180002040646.3.8 219911 223240 | 4564 1220 | 3330 3345 | 94.10 | 233551 4564 | 1.43 73.29 | 7180002040912.3.5 7180002040646.2.8 [CONTAINS] #marker & alignment summary #id BT-ref #markers slope begin end len HS-ref #align slope begin end len 7180002040891 4 1 0 13956053 14518937 562884 22 14 -1.5687 20838730 21401614 562884 7180002040163 8 1 -1 38079677 38178351 98674 22 3 1.0264 21133894 21232568 98674 update:dir:7180002040646 7180002040646 8 . 1 38088797 38164373 75576 22 4 -0.9833 21143014 21218590 75576 assign:Chr:7180002040163,7180002040163
2. ~ 349 cases (-3 cases 3.)
cat all-all.begin.ids all-all.end.ids | sort -u | wc -l
3. ~ 3 cases
intersect.pl all-all.begin.ids all-all.end.ids | perl -ane 'print $_ if($F[1]=~/1.1$/);' 7180002032818.1.1 23259 7180002036943.1.1 28355 7180002040409.1.1 25219 # merges 2 scaff 1 2811 | 2807 1 | 2811 2807 | 99.11 | 23259 5134 | 12.09 54.67 | 7180002032818.1.1 7180002032811.1.2[BEGIN] [BEGIN] 22101 23259 | 7238 6079 | 1159 1160 | 99.66 | 23259 7238 | 4.98 16.03 | 7180002032818.1.1 7180002032811.2.2[END] [END] 1 2811 | 2807 1 | 2811 2807 | 99.11 | 23259 5134 | 12.09 54.67 | 7180002032818.1.1 7180002032811.1.2[BEGIN] [BEGIN] 22101 23259 | 7238 6079 | 1159 1160 | 99.66 | 23259 7238 | 4.98 16.03 | 7180002032818.1.1 7180002032811.2.2[END] [END] 1 2025 | 2066 4090 | 2025 2025 | 99.80 | 25219 4090 | 8.03 49.51 | 7180002040409.1.1 7180002033541.1.1[BEGIN] [END] 23394 25219 | 5761 3930 | 1826 1832 | 98.42 | 25219 5761 | 7.24 31.80 | 7180002040409.1.1 7180002033538.1.1[END] [END]
cd nucmer_scf.ovl intersect.pl all-all.begin.ids all-all.end.ids 7180002032818 23259 7180002036943 28355 7180002040409 25219
4. ~10 cases
cd nucmer_ctg.ovl/translated/ cat all-all.annotated.coords | egrep 'BEGIN|END|CONTAIN' | p 'next if($F[6]<5000); next if($F[7]<5000); print $_;' | p '$F[17]=~/^([^.]+)(.+)/; $F[17]=$1 ; $F[18]=~/^([^.]+)/; $F[18]=$1; print $F[17],"\t",$F[18],"\n";' | count.pl -m 2 7180002041235 brk002041306a 3 7180002038888 7180002034914 2 7180002041059 7180002040095 2 7180002040879 7180002040934 2 cnt0002041350 7180002040907 2 7180002041341 7180002040894 2 7180002041015 cnt0002040938 2 7180002039401 7180002040397 2 7180002037358 7180002031425 2 7180002040915 7180002039470 2 cat all-all.annotated.coords | egrep 'BEGIN|END|CONTAIN' | grep ... 12363 21143 | 1 8837 | 8781 8837 | 93.43 | 21143 50443 | 41.53 17.52 | 7180002041235.23.33 brk002041306a.5.7[END] [BEGIN] 1 9272 | 5591 14944 | 9272 9354 | 94.47 | 9272 50443 | 100.00 18.54 | 7180002041235.25.33 brk002041306a.5.7[CONTAINED] 71049 97689 | 31941 5220 | 26641 26722 | 96.71 | 97689 31941 | 27.27 83.66 | 7180002041235.29.33 brk002041306a.6.7[END] [END] 57 3896 | 14743 10903 | 3840 3841 | 95.94 | 3896 22667 | 98.56 16.95 | 7180002041235.32.33 brk002041306a.3.7[CONTAINED] 1 8942 | 1 8947 | 8942 8947 | 99.49 | 20792 8947 | 43.01 100.00 | 7180002038888.1.1 7180002034914.1.2[CONTAINS] 9670 20792 | 1 11091 | 11123 11091 | 99.27 | 20792 11091 | 53.50 100.00 | 7180002038888.1.1 7180002034914.2.2[CONTAINS] 1 8705 | 3932 12610 | 8705 8679 | 96.17 | 8705 30654 | 100.00 28.31 | 7180002041059.1.26 7180002040095.1.1[CONTAINED] 1 4777 | 12989 17697 | 4777 4709 | 95.32 | 4777 30654 | 100.00 15.36 | 7180002041059.2.26 7180002040095.1.1[CONTAINED] 1 6588 | 17999 24607 | 6588 6609 | 96.15 | 6591 30654 | 99.95 21.56 | 7180002041059.3.26 7180002040095.1.1[CONTAINED] 630 16709 | 16024 1 | 16080 16024 | 97.09 | 16901 16024 | 95.14 100.00 | 7180002040879.2.14 7180002040934.2.87[CONTAINS] 1 5876 | 5882 1 | 5876 5882 | 98.51 | 39657 13643 | 14.82 43.11 | 7180002040879.3.14 7180002040934.1.87[BEGIN] [BEGIN]
Scaffold links
Try to identify scaffold that fit the following criteria:
- have no markers and no alignments to HS
- linked by 2+ links to a single scaffold that has markers/alignments to HS
- is not a variant (2Kbp+ of unique sequnece)
elem min max mean med n50 sum linked 6169 316 88326 2408 1480 2709 14,860,479 linked(2+mates) 2109 316 44120 2595 1558 3092 5,474,510 linked(2+mates to a single scf) 2057 316 44120 2516 1547 2850 5,177,182 linked(2+mates to a single scf, no variant) 112 316 43782 7278 1987 23338 815,223 # 46>2Kbp; 25>10Kbp
=> 112 scaffold & 0.81Mbp could be added to the chromosomes !!!
BT alignments
- UMD2.6.1 vs UMD2.6.1
- Marker scaffolds against themselves (2642 total scaffolds) : nucmer -maxmatch -l 40 -c 2500 -g 250
- Mapped scaffolds without markers against marker scaffolds: nucmer -maxmatch -l 40 -c 250
- Mapped scaffolds without markers against themselves: nucmer -maxmatch -l 40 -c 250
File location:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_Chr/
Other issues
Chr10 gene duplication
- LOC100298457
- See /nfshomes/dpuiu/Readmes/bos_taurus.runCA.13.txt
Scf breaks (Mike Robers)
:::::::::::::: scf.break.ids :::::::::::::: #ctg scf beg end dir 7180002015438 7180002038435 0 32822 f # ok: 4 markers from Chr18 7180002021916 7180002040442 0 24559 f # ok: 1 marker from Chr10 7180001725791 7180002040808 466123 562461 f # ok: 25 marker from Chr1, 1 marker from Chr2 7180001727899 7180002040844 8236574 8479429 f # also found by us; 3 Chr30 markers in the middle 7180001854650 7180002041103 5230302 5294734 f # also found by us; 4 Chr15 markers at 3' 7180002003578 7180002041216 300250 392069 f # also found by us; 7 Chr5 markers at 5' 7180002020010 7180002041293 4482477 4600045 f # ok 7180001722390 7180002041353 0 51674 f # ok :::::::::::::: scf.excised.ids :::::::::::::: #ctg scf beg end dir 7180002008629 7180002034664 82740 105723 f # no markers 7180001786240 7180002040913 2076746 2112375 f # ok 7180001787022 7180002040927 1075352 1109350 f # ok 7180001787352 7180002040931 4956099 4967823 f # ok 7180001789387 7180002040981 6641924 6671967 f # ok 7180001789575 7180002040984 3095587 3154302 f # ok 7180002003269 7180002041209 6350890 6470971 f # ok 7180002025281 7180002041322 989366 1227636 f # ok 7180002026741 7180002041328 343726 414329 f # ok 7180002029433 7180002041343 834493 888144 f # ok 7180001726451 7180002041383 7893629 7901645 f # ok
LOC100298457 (duplicate gene on Chr10)
Problem:
Is LOC100298457 gene (cow Chr10) a variation of MFSD3 gene (cow Chr14)?
UMD2.0
Chr10 15186759 15189292 833 W 7180003260677 1 2534 + # gene LOC100298457 Chr14 1839355 1958202 533 W 7180003326040 1 118848 + # gene MFSD3
- LOC100298457 aligns to 2 UMD2.6 scaffolds:
scf7180002041112 (1.56Mbp; 34 contigs) scf7180002033841 (2.44Kbp; 1 contig)
- scf7180002041112 contains 39 bos taurus Chr14 markers and has 206 alignments to human chromosome 8
- scf7180002033841 contains no bos taurus markers and has 1 alignments to human chromosome 8.
- scf7180002033841 5' aligns to scf7180002041112 (1.56Mbp; 34 contigs, cow Chr14) and to a human Chr8 region that maps to cow Chr14
- scf7180002033841 3' aligns to scf7180002041157 (6.15Mbp; 99 contigs, cow Chr10)
- scf7180002033841 & scf7180002041157 are linked by 2 mate pairs (inserts from a 3kbp BCM shotgun library)
- LOC100298457 vs UMD2.6 scf:
  [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
 ===============================================================================================================================
     1       88  |  1138313  1138226  |       88       88  |   100.00  |     1344  1561688  |     6.55     0.01  | LOC100298457       scf7180002041112
    80      192  |  1138150  1138038  |      113      113  |   100.00  |     1344  1561688  |     8.41     0.01  | LOC100298457       scf7180002041112
   190      316  |  1137974  1137848  |      127      127  |   100.00  |     1344  1561688  |     9.45     0.01  | LOC100298457       scf7180002041112
   315      464  |  1137770  1137621  |      150      150  |   100.00  |     1344  1561688  |    11.16     0.01  | LOC100298457       scf7180002041112
   455     1344  |  1137322  1136432  |      890      891  |    99.78  |     1344  1561688  |    66.22     0.06  | LOC100298457       scf7180002041112
     1     1344  |     1349        6  |     1344     1344  |   100.00  |     1344     2445  |   100.00    54.97  | LOC100298457       scf7180002033841
- UMD2.6 scf* vs scf7180002033841:
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 1136427 1137322 | 1 895 | 896 895 | 99.78 | 1561688 2445 | 0.06 36.61 | scf7180002041112 scf7180002033841 4060495 4061426 | 2445 1521 | 932 925 | 97.76 | 6157473 2445 | 0.02 37.83 | scf7180002041157 scf7180002033841
Bos taurus marker summary:
#id BT-ref #markers slope begin end len scf7180002041112 Chr14 39 -0.8518 3377707 4939395 1561688 scf7180002041157 Chr10 255 1.0035 12831593 18989066 6157473 scf7180002033841 ? 0
Homo sapiens alignment summary:
#id HS-ref #align slope begin end len scf7180002041112 Chr8 206 0.6701 144381103 145942791 1561688 scf7180002041157 Chr15 1667 0.987 62245836 68403309 6157473 scf7180002033841 Chr8 1 1 145705320 145707765 2445
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 143996007 143996305 | 74278 74572 | 299 295 | 82.45 | 146274826 1561688 | 0.00 0.02 | 1 1 NC_000008 7180002041112 ... 146249212 146249657 | 1268887 1268446 | 446 442 | 80.89 | 146274826 1561688 | 0.00 0.03 | 1 -1 NC_000008 7180002041112 63013167 63013429 | 887697 887966 | 263 270 | 78.15 | 100338915 6157473 | 0.00 0.00 | 1 1 NC_000015 7180002041157 ... 68369047 68369417 | 6154757 6155112 | 371 356 | 83.06 | 100338915 6157473 | 0.00 0.01 | 1 1 NC_000015 7180002041157 145705522 145706206 | 202 886 | 685 685 | 81.28 | 146274826 2445 | 0.00 28.02 | 1 1 NC_000008 7180002033841
scf7180002033841: 6 reads
read scf begin end dir 1120017508 7180002033841 0 1018 f 1120020725 7180002033841 227 1109 f 1120020722 7180002033841 724 1720 r 1120020726 7180002033841 1006 2073 f diffScaffold 1120017511 7180002033841 1303 2175 r 1120020728 7180002033841 1370 2445 f diffScaffold
scf7180002033841: 4 mates
read1 read2 scf1 scf2 1120017508 1120020722 7180002033841 7180002033841 1120017511 1120020725 7180002033841 7180002033841 1120020726 1120017512 7180002033841 7180002041157 diffScaffold 1120020728 1120017514 7180002033841 7180002041157 diffScaffold
scf7180002041157: 69095 reads; frg 1120017514 & 1120017512 positions close to the region aligned to scf7180002033841 (4060495-4061426)
read scf begin end dir 607312594 7180002041157 0 866 f ... 1120017512 7180002041157 4059332 4060396 f diffScaffold 1120017514 7180002041157 4058851 4059931 f diffScaffold .... 583956204 7180002041157 6157144 6157228 r
scf7180002041157: 99 contigs
count ctg scf begin end dir 1 ctg7180001926175 scf7180002041157 0 81467 f .. 64 ctg7180001926231 scf7180002041157 4039922 4046680 f # 6758bp ctg; 800bp gap following the contig 65 ctg7180001926232 scf7180002041157 4047480 4193693 f # 146213bp ctg ; 20bp gap following the contig 66 ctg7180001926233 scf7180002041157 4193713 4208047 f # 14334bp ctg .. 99 ctg7180001926260 scf7180002041157 6152691 6157473 f
frg 1120017514 & 1120017512 positions on ctg7180001926232
1120017514 ctg7180001926232 11371 12451 f # reads are 11kbp(>3Kb lib mean) inside the contig 1120017512 ctg7180001926232 11852 12916 f # reads are 11kbp(>3Kb lib mean) inside the contig