Bos taurus redo
BCM
NCBI Data
- Genome Projects
- TA search
- TA ftp
- 91 volumes: 87 with qual & 4 with no quality (85 volumes contain BCM reads)
- 14 centers
- 21 center/trace_type_codes
- Avg LEN=984
- Avg CLIP (CLB intersect CLV)=760
- Avg CLV=997 > Avg LEN ???
- Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
- Avg UMDoverlapper CLIP=778
Problems:
- 0 QUAL reads 650,133
- the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1
- several xml contained the "&" character => XML parser error
- xml.bos_taurus.087 contained 2 trace_volumes => XML parser error
- BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000
- UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000
CENTER_NAME counts
COUNT CENTER_NAME 1 35629020 BCM Baylor College of Medicine 2 737900 NISC NIH Intramural Sequencing Center 3 652614 BCCAGSC British Columbia Cancer Agency Genome Sciences Centre # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 4 378871 MARC USDA, ARS, US Meat Animal Research Center 5 114753 UIUC University of Illinois at Urbana-Champaign # TA query_tracedb CENTER_NAME = "UIUC" => 106,368 6 107367 BARC USDA, ARS, Beltsville Agricultural Research Center 7 65171 TIGR The Institute for Genome Research 8 53556 GSC Genoscope 9 43033 CENARGEN Embrapa Genetic Resources and Biotechnology 10 18623 SC The Sanger Center 11 15301 UOKNOR University of Oklahoma Norman Campus, Advanced Center for Genome Technology 12 10651 TIGR_JCVIJTC The Institute for Genomic Research, Traces generated at JCVIJTC # TA query_tracedb CENTER_NAME="JCVI" 13 2485 UIACBCB University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB) 14 49 WUGSC Washington University, Genome Sequencing Center # TA query_tracedb CENTER_NAME = "WUGSC" => 9 37829394 total total # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710
TRACE_TYPE_CODE counts
COUNT CENTER_NAME TRACE_TYPE_CODE 1 24863599 BCM* WGS SEQ_LIB_ID:89 2 10748529 BCM* SHOTGUN SEQ_LIB_ID:15543 3 737900 NISC SHOTGUN SEQ_LIB_ID:247 4 125597 BCCAGSC CLONEEND LIBRARY_ID:1 large insert size; some qualityless; !!! almost all have CLIP3=0 5 114753 UIUC CLONEEND LIBRARY_ID:2 insert size missing , no frequent kmers 6 65171 TIGR CLONEEND SEQ_LIB_ID:1 2K & use TRACE_DIRECTION instead of TRACE_END 7 53556 GSC CLONEEND SEQ_LIB_ID:1 large insert size; !!! all have qual=0 and were excluded 8 26246 CENARGEN WGS . no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID) 9 25454 BARC CLONEEND SEQ_LIB_ID:14304 !!! all have CLIP3=0 10 16892 BCM* CLONEEND LIBRARY_ID:1 VBBAA mea=167000 std=25000 11 16787 CENARGEN CLONEEND LIBRARY_ID:1 12 15150 UOKNOR SHOTGUN LIBRARY_ID:1 some qualityless 13 10651 TIGR_JCVIJTC CLONEEND SEQ_LIB_ID:2 14 151 UOKNOR FINISHING LIBRARY_ID:1 some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV 15 49 WUGSC CLONEEND SEQ_LIB_ID:1 36820485 total 16 527017 BCCAGSC EST 17 207204 MARC EST 18 171667 MARC PCR 19 81913 BARC EST 20 18623 SC EST 21 2485 UIACBCB EST 1008909 total
STRATEGY & TRACE_TYPE_CODE counts
COUNT CENTER_NAME STRATEGY TRACE_TYPE_CODE 12545304 BCM . WGS 11425910 BCM WGA WGS 5223683 BCM CLONE SHOTGUN 4479883 BCM POOLCLONE SHOTGUN 1044963 BCM . SHOTGUN 892385 BCM SNP WGS 737900 NISC CLONE SHOTGUN 125597 BCCAGSC CLONEEND CLONEEND 114753 UIUC CLONEEND CLONEEND 65171 TIGR CLONEEND CLONEEND 53556 GSC CLONEEND CLONEEND 26246 CENARGEN . WGS 25454 BARC . CLONEEND 16892 BCM CLONEEND CLONEEND 16787 CENARGEN CLONEEND CLONEEND 12195 UOKNOR . SHOTGUN 10651 TIGR_JCVIJTC CLONEEND CLONEEND 2955 UOKNOR CLONE SHOTGUN 151 UOKNOR . FINISHING 49 WUGSC CLONEEND CLONEEND
527017 BCCAGSC EST EST 145820 MARC EST EST 117958 MARC COMPARATIVE PCR 81913 BARC EST EST 61384 MARC CLONE EST 53709 MARC Re-Sequencing PCR 18623 SC EST EST 2485 UIACBCB . EST
BCM.SHOTGUN libraries
- The long inserts are probably wrong !!!
SIZE STDEV COUNT 3500 1500 4502569 2000 1000 3244493 3000 1000 1021577 180000 1000 840528 6500 1500 429026 180000 13000 320208 6000 2000 208192 167000 13000 96337 3500 15000 85599
SIZE COUNT 3500 4588168 2000 3244493 180000 1160736 3000 1021577 6500 429026 6000 208192 167000 96337
3' VECTOR TRIMMED counts
CENTER_NAME TRACE_TYPE_CODE TOTAL 3'CLV<LEN QUAL==0 UMD.FRG 1 BCM WGS 24863599 10968979 551114 24050767 2 BCM SHOTGUN 10748529 5052692 23419 10068499 3 NISC SHOTGUN 737900 28972 0 735488 4 BCCAGSC CLONEEND 125597 125484 8926 113790 5 UIUC CLONEEND 114753 90243 0 106247 6 TIGR CLONEEND 65171 46389 0 64903 7 GSC CLONEEND 53556 53556 53556 (all) 0 !!! all have 0 quals and were excluded 8 CENARGEN WGS 26246 26246 0 25976 9 BARC CLONEEND 25454 25454 0 25387 10 BCM CLONEEND 16892 6751 0 16863 11 CENARGEN CLONEEND 16787 16787 0 16628 12 UOKNOR SHOTGUN 15150 2885 12195 0 13 TIGR_JCVIJTC CLONEEND 10651 339 0 10644 14 UOKNOR FINISHING 151 0 151 151 15 WUGSC CLONEEND 49 0 0 0 16 BCCAGSC EST 527017 524173 772 0 17 MARC EST 207204 207204 0 0 18 MARC PCR 171667 171667 0 0 19 BARC EST 81913 78597 0 0 20 SC EST 18623 7350 0 0 21 UIACBCB EST 2485 2485 0 0
ZERO QUALITY COUNTS
- Counts
CENTER_NAME TRACE_TYPE_CODE COUNT BCM WGS 551114 GSC CLONEEND 53556 BCM SHOTGUN 23419 UOKNOR SHOTGUN 12195 BCCAGSC CLONEEND 8926 BCCAGSC EST 772 UOKNOR FINISHING 151 TOTAL 650134
- For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701..
- Volumes 026..039 have been fixed
Local Data
Files & Dirs
/fs/szasmg3/bos_taurus/data/ /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors /nfshomes/dpuiu/db/UniVec
Software
Figaro
- trims vector only at 5' end
- call lucy trimming for qualities
Lucy
- both vector sequence and splice sites are required
Atlas
- web site
- atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "
Contaminant search
nucmer reads CLIPPING range to UniVec & EcoliK12
UniVec
Ref
#seqs min max mean median n50 sum UniVec 2861 12 48551 231 99 781 660,151 UniVec_Core 1348 12 48551 243 98 967 327,641
Hits: alignment length
bp #reads min max mean median n50 sum 19 4548466 19 1045 28.37 23 27 129025025 20 3684852 20 1045 30.56 25 28 112616359 30 1097357 30 1045 48.04 38 43 52714583 40 484661 40 1045 66.36 47 53 32163896 100 54334 100 1045 198 116 223 10772815 # many are ESTs
Ecoli
Ref:
K12 4,639,675 bp
Hits: alignment length
bp #reads min max mean median n50 sum 19 275109 19 1223 30.66 19 20 8435470 20 102550 20 1223 50.29 21 161 5156849 30 19032 30 1223 178 37 706 3381214 40 9234 40 1223 329 171 738 3034293 100 6781 100 1223 424 223 749 2876432 200 4378 200 1223 575 696 771 2516916
BCM vectors
#seqs min max mean median n50 sum BCM 14 2580 33180 9379 5821 32705 131312
Vector/Splice site search
Strategy
- 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
- 2. Get the quality clipping trim: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
- 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE
- 4. Get the most frequent kmers in each set (24 & 8 bp)
- 5. Check if the most frequent kmers are overrepresented
- 6. Check if the most frequent 8mers are present in the most frequent 24mers
- 7. Try to extend the 24mers by a few bp => linkers
- 8. Align linkers to the opposite stand sequences using nucmer
- 9. Extract the subsequences adjacent(following) to linker (50..150bp)
- 10. Align the subsequences; if they align we've probably identified the vector
- 11. Identify the vector name/id by alignment to UniVec => several alignments
- 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
- 13. create the Lucy vector & splice files; the splice contains the linker+vector
- 14. run lucy & trim input reads according to Lucy clr
- 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust
- 16. align input reads to linker,vector,splice & UniVec.dust
- 17. compare the 15. & 16. counts
Example
- 1. volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
- 2.
- 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
- 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
==> 24.fwd/kmers.tab <== AGTTCGACTGCAAGTAGTTCATCA TGATGAACTACTTGCAGTCGAACT 2463 # contains AGTAGTTC GAGTTCGACTGCAAGTAGTTCATC GATGAACTACTTGCAGTCGAACTC 2189 CGAGTTCGACTGCAAGTAGTTCAT ATGAACTACTTGCAGTCGAACTCG 1996 TCGAGTTCGACTGCAAGTAGTTCA TGAACTACTTGCAGTCGAACTCGA 1593 GTTCGACTGCAAGTAGTTCATCAA TTGATGAACTACTTGCAGTCGAAC 1023 GAGTTCGACTGCAGTAGTTCATCA TGATGAACTACTGCAGTCGAACTC 812 CGAGTTCGACTGCAGTAGTTCATC GATGAACTACTGCAGTCGAACTCG 777 GTTCGACTGCAAGTAGTTCATCAT ATGATGAACTACTTGCAGTCGAAC 769 TCGAGTTCGACTGCAGTAGTTCAT ATGAACTACTGCAGTCGAACTCGA 637 ATCGAGTTCGACTGCAAGTAGTTC GAACTACTTGCAGTCGAACTCGAT 594 ==> 08.fwd/kmers.tab <== AGTAGTTC GAACTACT 86477 CAGTAGTT AACTACTG 67681 AGTTCTCA TGAGAACT 61556 TAGTTCTC GAGAACTA 60964 GTAGTTCT AGAACTAC 57866 AGTTCATC GATGAACT 49676 TAGTTCAT ATGAACTA 45298 GTTCATCA TGATGAAC 42117 GCAGTAGT ACTACTGC 41391 GTAGTTCA TGAACTAC 40694 ==> 24.rev/kmers.tab <== TATCGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCGATA 999 # contains AGTAGTTC CTATCGATGGTACAGTAGTTCATC GATGAACTACTGTACCATCGATAG 774 GCTATCGATGGTACAGTAGTTCAT ATGAACTACTGTACCATCGATAGC 600 CGCTATCGATGGTACAGTAGTTCA TGAACTACTGTACCATCGATAGCG 432 ATCGATGGTACAGTAGTTCATCAT ATGATGAACTACTGTACCATCGAT 417 ATCGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCGAT 380 ATCAGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCTGAT 373 ATCGATGGTACAGTAGTTCATCAC GTGATGAACTACTGTACCATCGAT 265 CTATCGATGGTAAGTAGTTCATCA TGATGAACTACTTACCATCGATAG 235 TCAGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCTGA 224 ==> 08.rev/kmers.tab <== AGTTCATC GATGAACT 85127 TAGTTCAT ATGAACTA 77902 GTTCATCA TGATGAAC 75585 TAGTTCTC GAGAACTA 68057 AGTTCTCA TGAGAACT 67277 GTAGTTCT AGAACTAC 64894 GTAGTTCA TGAACTAC 62607 CGTAGTTC GAACTACG 52031 AGTAGTTC GAACTACT 51013 ACGTAGTT AACTACGT 31552
- 7. Get linker sequences
>linker.fwd 27bp TCGAGTTCGACTGCAAGTAGTTCATCA >linker.rev 27bp CTAATCAGATGGTACAGTAGTTCATCA #>linker.rev 40 bp Art's (13 more bp at 5') #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
#GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences)
- 8 & 9 Align reads to linkers using nucmer
Fwd:
nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta # nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq
Rev:
nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta # nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq
Both:
clrFasta out.seq >! out.cseq fasta2tab.pl out.cseq | sort -k2 > ! out.tab nucmer -c 40 out.cseq ~/db/UniVec -p vector delta-filter -q vector.delta >! vector.filter-q.delta show-coords vector.filter-q.delta | sort -n | head
cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2
- 10. Extract "vector reads"
>399553028 # 24.fwd TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA ... >400269118 #24.rev TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
- 11. Align "vector reads" to UniVec; identify vector
show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1 31 148 | 1175 1292 | 118 118 | 95.76 | 1276 7477 | 9.25 1.58 | 399553028.rev gnl|uv|J01636.1:1-7477 32 199 | 1302 1463 | 168 162 | 90.48 | 653 7477 | 25.73 2.17 | 400269118 gnl|uv|J01636.1:1-7477
- 12. 10bp distance between the 2 alignments
- 13. Lucy files
$ more vector.seq >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA ... $ more splice.seq >J01636.for.begin vector+linker.rev TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG CGCCTAATCAGATGGTACAGTAGTTCATCA >J01636.for.end rev(linker.fwd)+vector TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC AGCTGGCGTAAAAACGTAAAAAGCCCCGCA >J01636.rev.begin (revcomp of J01636.for.end) TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT GATTCGATTTCGACTGCAAGTAGTTCATCA >J01636.rev.end (revcomp of J01636.for.begin) TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG GTGTCAAATGAGAGACCTAACTCACATTCA
# splice=linker+vector 3 120 | 1175 1292 | 118 118 | 95.76 | 150 7477 | 78.67 1.58 | J01636.for.begin J01636 32 131 | 1302 1399 | 100 98 | 96.00 | 150 7477 | 66.67 1.31 | J01636.for.end J01636
- 13.1 Align vector & splice to Ecoli
1 7474 | 366812 359335 | 7474 7478 | 99.91 | 7477 4639675 | 99.96 0.16 | J01636 NC_000913.2 [CONTAINED]
20 119 | 65 162 | 100 98 | 96.00 | 150 395 | 66.67 24.81 | J01636.rev.begin NC_000913.2 31 148 | 172 289 | 118 118 | 95.76 | 150 395 | 78.67 29.87 | J01636.rev.end NC_000913.2
1069 1463 | 395 1 | 395 395 | 100.00 | 7477 395 | 5.28 100.00 | J01636 NC_000913.2.365350-365744
- 14. Run lucy & trim reads
$ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \ -v vector.seq splice.seq -o bos_taurus.lucy.seq bos_taurus.lucy.qual \ -debug bos_taurus.lucy.info \ bos_taurus.seq bos_taurus.qual
# Trim clr $ clrFasta bos_taurus.seq > bos_taurus.cseq
- 15. Align lucy output to linker, vector, splice & UniVec.dust
$ nucmer -l 12 -c 24 ~/db/vector.seq bos_taurus.lucy.cseq -p vector-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.lucy.cseq -p vector-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.lucy.cseq -p splice-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy
- 16. Align input to linker, vector, splice & UniVec.dust
$ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus
Count how many reads got trimmed
infoseq *seq | getSummary.pl -c 1 -t original.LEN cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary
Libraries
011.BCM.WGS FORWARD
- vector: J01636
- UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
ll ~dpuiu/db/J01636* -rw-rw-r-- 1 dpuiu dpuiu 7651 Jan 9 15:56 /nfshomes/dpuiu/db/J01636 -rw-rw-r-- 1 dpuiu dpuiu 105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker -rw-rw-r-- 1 dpuiu dpuiu 840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice
cat ~dpuiu/db/J01636* | infoseq J01636 7477 53.43 J01636.linker.fwd 27 44.44 J01636.linker.rev 27 37.04 J01636.for.begin 150 44.67 J01636.for.end 150 51.33 J01636.rev.begin 150 51.33 J01636.rev.end 150 44.67
- 249,611 reads:
- 91% got vector trimmed at the 5'
- 0.4% (1149) got vector trimmed at the 3'
#elem #0s min max mean median n50 sum original.LEN 249611 0 437 2349 1082 991 1009 270035781 lucy.CLV5 249611 21215 0 741 25.03 25 27 6247415 lucy.CLV3 249611 248462 0 1047 3.49 0 859 870344
- Original reads hit counts:
10975 linker.fwd 133 linker.rev 166 splice 152 vector 228 UniVec.dust
- Lucy trimmed read counts
2 linker.fwd 0 linker.rev 1 splice 1 vector 6 UniVec.dust (only 3 are >40bp)
011.BCM.WGS REVERSE
#elem #0s min max mean median n50 sum original.LEN 250389 0 502 2148 1085 993 1012 271691094 lucy.CLR 250389 7345 0 1281 795 876 892 198982171 lucy.CLV5 250389 20271 0 668 26.52 27 29 6641362 lucy.CLV3 250389 249269 0 997 3.35 0 861 839029
- Original reads hit counts:
linker.fwd 113 linker.rev 3812 splice 143 UniVec.dust 237 vector 4318
- Lucy trimmed reads hit counts:
linker.fwd 1 linker.rev 0 splice 1 UniVec.dust 10 vector 1
030.BCM.SHOTGUN
- same linker/vector/splice as BCM.WGS
- 2.5% (4K out of 160K) reads contain linker & vector at 3'
#elem #0s min max mean median n50 sum original.LEN 8411 0 325 1685 1181 1240 1314 9933150 lucy.CLR 8411 8 0 1054 841 863 874 7070994 lucy.CLV5 8411 568 0 232 27.01 28 29 227206 lucy.CLV3 8411 2325 0 1040 597 794 851 5023445
- Original reads hit counts:
linker.fwd 4314 linker.rev 4125 splice 7816 UniVec.dust 4212 vector 6750 vector 27235
- Lucy trimmed reads hit counts:
linker.fwd 3 linker.rev 1 splice 1 UniVec.dust 13 vector 0
001.NISC.SHOTGUN
- Vector: pOTW13
- UniVec: 3 partial seqs
gnl|uv|NGB00080.1:1-198 pOTW13 with linkers gnl|uv|NGB00080.1:718-888 pOTW13 with linkers gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers
ll /nfshomes/dpuiu/db/NGB00080* -rw-rw-r-- 1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080 -rw-r--r-- 1 dpuiu dpuiu 94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker -rw-r--r-- 1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice
cat /nfshomes/dpuiu/db/NGB00080* | infoseq NGB00080 1054 50.00 NGB00080.linker.fwd 24 45.83 NGB00080.linker.rev 26 53.85 NGB00080.for.beg 518 46.14 NGB00080.for.end 518 50.48 NGB00080.rev.begin 518 50.48 NGB00080.rev.beg 518 46.14
- 944 read sample
#elem #0s min max mean median n50 sum original.LEN 944 0 652 1017 735 721 722 693668 lucy.CLR 944 39 0 886 415 422 522 391333 lucy.CLV5 944 121 0 275 34.05 33 35 32143 lucy.CLV3 944 18 0 885 410 409 511 387007
- Original reads hit counts:
linker.fwd 479 linker.rev 492 splice 910 UniVec.dust 0 vector 939
- Lucy trimmed reads hit counts:
linker.fwd 1 linker.rev 0 splice 0 UniVec.dust 9 vector 1
060.BCCAGSC.CLONEEND
- Linkers:
linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG
- No common vector !!!
wc -l *clb 60746 bos_taurus.060.f.clb #18 reads original align to UniVec (nucmer default params) 60836 bos_taurus.060.r.clb Fwd: 329 428 | 440 535 | 100 96 | 91.00 | 503 1585 | 19.88 6.06 | 723951410 gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1 330 370 | 89 49 | 41 41 | 100.00 | 503 143 | 8.15 28.67 | 723951410 gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector 330 370 | 94 54 | 41 41 | 100.00 | 503 143 | 8.15 28.67 | 723951410 gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector Rev: 1 96 | 71 165 | 96 95 | 93.81 | 203 165 | 47.29 57.58 | 724018013 gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6 50 143 | 1 94 | 94 94 | 92.71 | 203 94 | 46.31 100.00 | 724018013 gnl|uv|U80929.2:2858-2951 Cloning vector pBACe3.6
017.UIUC.CLONEEND
- No overrepresented kmers
wc -l *clb 17978 bos_taurus.017.f.clb 17911 bos_taurus.017.r.clb ==> 24.fwd/kmers.tab <== CCCTGCTTTGTCTGGAAGGGGTTC GAACCCCTTCCAGACAAAGCAGGG 9 CTGCTTTGTCTGGAAGGGGTTCCC GGGAACCCCTTCCAGACAAAGCAG 9 ==> 24.rev/kmers.tab <== GAATGTTGAGCTTTAGCCAACTTT AAAGTTGGCTAAAGCTCAACATTC 4 TCTGAATGTTGAGCTTTAGCCAAC GTTGGCTAAAGCTCAACATTCAGA 4 ==> 8.fwd/kmers.tab <== TTTTTTTT AAAAAAAA 55 AAGGGGTT AACCCCTT 35 ==> 8.rev/kmers.tab <== GTCTGGAA TTCCAGAC 41 TCTGGAAG CTTCCAGA 39
- No UniVec hits
010.TIGR.CLONEEND
- No overrepresented kmers
wc -l *clb 5479 bos_taurus.032.f.clb 5174 bos_taurus.032.r.clb ==> 24.fwd/kmers.tab <== CTTGTGTTGGCCCAGGCAAGTCCA TGGACTTGCCTGGGCCAACACAAG 30 TTGTGTTGGCCCAGGCAAGTCCAA TTGGACTTGCCTGGGCCAACACAA 30 ==> 24.rev/kmers.tab <== CTGCCTCTTGTGTTGGCCCAGGCA TGCCTGGGCCAACACAAGAGGCAG 16 GCTGCCTCTTGTGTTGGCCCAGGC GCCTGGGCCAACACAAGAGGCAGC 15 ==> 8.fwd/kmers.tab <== GAGTGGGT ACCCACTC 176 GGAGTGGG CCCACTCC 171 ==> 8.rev/kmers.tab <== TGGAGTGG CCACTCCA 182 GGAGTGGG CCCACTCC 181
- No UniVec hits
...
070.BCM.CLONEEND
- No frequent kmers
wc -l *clb 6027 bos_taurus.070.f.clb 6236 bos_taurus.070.r.clb ==> 24.fwd/kmers.tab <== GGACTCTCAGAGTCTTCTCCAACA TGTTGGAGAAGACTCTGAGAGTCC 18 ACTGGTTGGATCTCCTTGCAGTCC GGACTGCAAGGAGATCCAACCAGT 18 ==> 24.rev/kmers.tab <== ATAAAATCTGAGCCACCAGGGAAG CTTCCCTGGTGGCTCAGATTTTAT 1 CTATTGGTTCATATGGTCAACGTC GACGTTGACCATATGAACCAATAG 1 ==> 8.fwd/kmers.tab <== TTTTTTTT AAAAAAAA 86 CTTCTCCA TGGAGAAG 75 ==> 8.rev/kmers.tab <== TATAGTGT ACACTATA 9 ATATAGGG CCCTATAT 8
- No alignments to BCM WGS vector
Running Lucy
- Default parameters with vector trimming
- BCM vector/splice
/nfshomes/dpuiu/db/vector.BCM.seq /nfshomes/dpuiu/db/splice.BCM.seq
- NISC vector/splice
/nfshomes/dpuiu/db/vector.NISC.seq /nfshomes/dpuiu/db/splice.NISC.seq
BCM.WGS (all reads)
- orig.CLR < lucy.CLR ( 765 < 792 )
- orig.CLV > lucy.CLV ( 1015 > 973 )
- 739,529 out of 24,863,599 reads (3%) deleted by Lucy (CLR=-1,-1)
- 21,728,592 out of 24,863,599 reads (87%) vector trimmed at the 5' end
- 92,646 out of 24,863,599 reads (0.3%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 24863599 0 0 24863599 5 3097 1002 997 1015 24915462033 orig.CLR 24863599 463669 7 24399923 -1143 1833 765 836 864 19036744256 orig.CLR5 24863599 0 359245 24504354 0 2103 42 22 58 1047922451 orig.CLR3 24863599 463404 0 24400195 -1 2169 807 872 895 20084666707 lucy.CLR 24863599 0 739529 24124070 0 1219 792 878 904 19695000417 lucy.CLR5 24863599 739529 36108 24087962 -1 1753 43 29 42 1086413880 lucy.CLR3 24863599 739529 0 24124070 -1 1894 835 915 939 20781414297 orig.CLR5-lucy.CLR5 24863599 16299521 215345 8348733 -1186 2104 -1 -10 -1186 -38491429 orig.CLR3-lucy.CLR3 24863599 14858542 1494794 8510263 -1273 2170 -28 -20 -1273 -696747590 orig.CLV 24863599 1053 1920 24860626 -2 5345 1015 1002 1017 25260581538 orig.CLV5 8841849 0 0 8841849 1 1219 33 46 49 295011460 orig.CLV3 24861698 1053 0 24860645 -1 5346 1027 1005 1019 25555592998 lucy.CLV 24863599 10694 707 24852198 -469 3096 973 968 987 24195085877 lucy.CLV5 24863599 0 3135007 21728592 0 1359 25 27 29 623457486 lucy.CLV3 24863599 0 0 24863599 4 3096 998 995 1014 24818543363 lucy.CLVABS5 24863599 0 3135007 21728592 0 1359 25 27 29 623457486 lucy.CLVABS3 24863599 0 24770953 92646 0 1343 2 0 880 72055071 orig.CLV5-lucy.CLV5 24863599 17216820 1512453 6134326 -1312 1219 -13 -25 -1312 -328446026 orig.CLV3-lucy.CLV3 24863599 1519132 18579609 4764858 -1832 4672 29 0 479 737049635
BCM.WGS (0 quality reads)
- orig.CLR > lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 7,153 out of 551,114 reads (1.3%) deleted by Lucy (CLR=-1,-1)
- 508,166 out of 551,114 reads (92%) vector trimmed at the 5' end
- 1,946 out of 551,114 reads (0.35%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 551114 0 0 551114 5 1464 872 946 959 480705828 orig.CLR 551114 7754 0 543360 -770 1175 708 786 807 390325117 orig.CLR5 551114 0 6773 544341 0 1519 44 20 111 24582849 orig.CLR3 551114 7744 0 543370 -1 1638 752 818 833 414907966 lucy.CLR 551114 0 7153 543961 0 699 636 671 671 350759771 lucy.CLR5 551114 7153 35872 508089 -1 201 26 27 28 14442310 lucy.CLR3 551114 7153 0 543961 -1 699 662 699 699 365202081 orig.CLR5-lucy.CLR5 551114 364282 8801 178031 -198 1500 18 -8 215 10140539 orig.CLR3-lucy.CLR3 551114 85058 2962 463094 -700 1472 90 123 178 49705885 orig.CLV 551114 971 0 550143 -2 2037 974 978 981 537127121 orig.CLV5 5100 0 0 5100 1 845 35 29 31 180490 orig.CLV3 551114 971 0 550143 -1 2037 974 978 981 537307611 lucy.CLV 551114 58 6 551050 -84 1456 841 917 930 463903233 lucy.CLV5 551114 0 42948 508166 0 202 27 28 29 14964546 lucy.CLV3 551114 0 0 551114 4 1463 868 945 958 478867779 lucy.CLVABS5 551114 0 42948 508166 0 202 27 28 29 14964546 lucy.CLVABS3 551114 0 549168 1946 0 700 2 0 686 1286935 orig.CLV5-lucy.CLV5 551114 506108 42215 2791 -202 845 -26 -28 -202 -14784056 orig.CLV3-lucy.CLV3 551114 134959 23422 392733 -967 1614 106 7 459 58439832
BCM.SHOTGUN
- orig.CLR < lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 98,070 out of 10,748,529 reads (0.9%) deleted by Lucy (CLR=-1,-1)
- 9,737,008 out of 10,748,529 reads (90%) vector trimmed at the 5' end
- 294,942 out of 10,748,529 reads (2.7%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 10748529 0 0 10748529 5 2043 975 950 964 10486690472 orig.CLR 10748529 17308 2 10731219 -1293 1467 809 833 847 8701344571 orig.CLR5 10748529 0 68 10748461 0 1315 26 16 38 288662580 orig.CLR3 10748529 16780 0 10731749 -1 1647 836 851 863 8990007151 lucy.CLR 10748529 0 98070 10650459 0 1337 833 854 868 8955866769 lucy.CLR5 10748529 98070 1973 10648486 -1 1307 35 28 32 376276188 lucy.CLR3 10748529 98070 0 10650459 -1 1553 868 882 896 9332142957 orig.CLR5-lucy.CLR5 10748529 9498290 65171 1185068 -1099 1293 -8 -11 -1099 -87613608 orig.CLR3-lucy.CLR3 10748529 6879532 671097 3197900 -1149 1437 -31 -26 -1149 -342135806 orig.CLV 10748529 16779 412 10731338 -2 3919 974 948 964 10472347908 orig.CLV5 8594910 0 0 8594910 1 1239 3 1 49 28350257 orig.CLV3 10748349 16779 0 10731570 -1 3919 976 950 965 10500698165 lucy.CLV 10748529 7026 614 10740889 -268 2042 930 924 940 9997862132 lucy.CLV5 10748529 0 1011521 9737008 0 855 24 24 27 257993796 lucy.CLV3 10748529 0 0 10748529 4 2042 954 945 962 10255855928 lucy.CLVABS5 10748529 0 1011521 9737008 0 855 24 24 27 257993796 lucy.CLVABS3 10748529 0 10453587 294942 0 1214 20 0 847 220086015 orig.CLV5-lucy.CLV5 10748529 9538738 138680 1071111 -854 1239 -21 -23 -854 -229643539 orig.CLV3-lucy.CLV3 10748529 357934 9324166 1066429 -1328 2846 22 0 704 244842237
NISC.SHOTGUN
- orig.CLR < lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 8,248 out of 737,900 reads (1.1%) deleted by Lucy (CLR=-1,-1)
- 633,409 out of 737,900 reads (85%) vector trimmed at the 5' end
- 7,201 out of 737,900 reads (0.97%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 737900 0 0 737900 104 2104 784 729 734 579172842 orig.CLR 737900 5988 2 731910 -636 1033 651 668 676 480400909 orig.CLR5 737900 0 0 737900 1 1407 47 40 51 34857531 orig.CLR3 737900 0 5879 732021 0 1470 698 710 715 515258440 lucy.CLR 737900 0 8248 729652 0 1035 658 670 676 485757685 lucy.CLR5 737900 8248 56 729596 -1 1091 45 35 46 33811606 lucy.CLR3 737900 8248 0 729652 -1 1391 704 710 714 519569291 orig.CLR5-lucy.CLR5 737900 253727 89345 394828 -566 1408 1 1 485 1045925 orig.CLR3-lucy.CLR3 737900 177007 31 560862 -867 1471 -5 1 -867 -4310851 orig.CLV 737900 3224 2655 732021 -636 2103 771 725 730 569178445 orig.CLV5 734026 0 0 734026 1 987 5 1 35 4375315 orig.CLV3 732021 0 0 732021 35 2104 783 729 734 573553760 lucy.CLV 737900 1335 55 736510 -200 2104 747 696 702 551392388 lucy.CLV5 737900 104491 0 633409 -1 1199 30 31 34 22784742 lucy.CLV3 737900 0 0 737900 15 2103 778 728 733 574177130 lucy.CLVABS5 737900 0 104491 633409 0 1200 31 32 35 23522642 lucy.CLVABS3 737900 0 730699 7201 0 1076 5 0 686 4257812 orig.CLV5-lucy.CLV5 737900 561851 66390 109659 -1198 983 -24 -29 -1198 -18409427 orig.CLV3-lucy.CLV3 737900 8386 1 729513 -950 1077 0 1 -950 -623370
Fragment files
- Locations:
/fs/szasmg3/bos_taurus/data/frg /fs/szasmg3/bos_taurus/data/frg.new
- All DST messages are unique
- bos_taurus.clv : contains the vector clipping points
- BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
- others: the TA clv
- 374,454 reads don't have valid clv's
- 36,446,031 reads have valid clv's with avg=955
Message counts (original)
DST FRG LKG bos_taurus.BCM.WGS.frg 79 24124070 11311841 #bos_taurus.BCM.SHOTGUN.frg 7339 10650459 1799069 # some libs & mates are missing due to a tarchive2ca crash (used by UMD2.1) #bos_taurus.BCM.SHOTGUN.new.frg 18208 10650459 4715172 # split the libraries by VOL & SEQ_LIB_ID (used by UMD2.2) #bos_taurus.BCM.SHOTGUN.new.frg 13826 10650459 5046435 # double check the FRG count !!! (used by UMD2.3) bos_taurus.BCM.SHOTGUN.new.frg 7 10650459 5046435 # UMD2.4 bos_taurus.NISC.SHOTGUN.frg 246 729652 344932 bos_taurus.BCCAGSC.CLONEEND.frg 1 125241 59505 bos_taurus.UIUC.CLONEEND.frg 2 114750 46319 bos_taurus.TIGR.CLONEEND.frg 1 65171 27067 bos_taurus.GSC.CLONEEND.frg 1 53521 25889 bos_taurus.CENARGEN.WGS.frg 0 26246 0 #bos_taurus.BARC.CLONEEND.frg 11150 25454 11150 # (used by UMD2.3) bos_taurus.BARC.CLONEEND.frg 1 25454 11150 # (used by UMD2.4) bos_taurus.BCM.CLONEEND.frg 1 16875 7103 bos_taurus.CENARGEN.CLONEEND.frg 1 16787 6269 bos_taurus.UOKNOR.SHOTGUN.frg 1 14651 4910 bos_taurus.TIGR_JCVIJTC.CLONEEND.frg 2 10651 4803 bos_taurus.UOKNOR.FINISHING.frg 0 151 0 bos_taurus.WUGSC.COLONEEND.frg 1 49 21 #total 25312 35973728 16896244 # (UMD2.3) total 344 35973728 16896244 # (UMD2.4)
Message counts (quality)
DST FRG LKG bos_taurus.BCM.WGS.qual.count 79 23580109 11035582 #bos_taurus.BCM.SHOTGUN.qual.count 7339 10644092 1799069 bos_taurus.BCM.SHOTGUN.qual.new.count 18208 10644092 4712446 bos_taurus.NISC.SHOTGUN.count 246 729652 344932 bos_taurus.BCCAGSC.CLONEEND.qual.count 1 116484 53585 bos_taurus.UIUC.CLONEEND.count 2 114750 46319 bos_taurus.TIGR.CLONEEND.count 1 65171 27067 bos_taurus.CENARGEN.WGS.count 0 26246 0 bos_taurus.BARC.CLONEEND.count 11150 25454 11150 bos_taurus.BCM.CLONEEND.count 1 16875 7103 bos_taurus.CENARGEN.CLONEEND.count 1 16787 6269 bos_taurus.TIGR_JCVIJTC.CLONEEND.count 2 10651 4803 bos_taurus.UOKNOR.SHOTGUN.qual.count 1 2456 813 bos_taurus.WUGSC.COLONEEND.count 1 49 21
Message counts (0quality)
DST FRG LKG bos_taurus.BCM.WGS.0qual.count 79 543961 234397 bos_taurus.GSC.CLONEEND.0qual.count 1 53521 25889 bos_taurus.UOKNOR.SHOTGUN.0qual.count 1 12195 4097 bos_taurus.BCCAGSC.CLONEEND.0qual.count 1 8757 2114 bos_taurus.BCM.SHOTGUN.0qual.count 7339 6367 0 bos_taurus.UOKNOR.FINISHING.0qual.count 0 151 0
Assemblies
Bt.qc.combine UMD2.0 ... UMD2.5 combine stats
Assembly UMD2.1(2009_0122_CA; Quality reads)
Issues
- Uses only quality reads
- BCM.SHOTGUN library : ~ 4715172-1799069=2.9M mates were missed due to a tarchive2ca crash ; some libraries got merged (were assigned the same lib_id)
- All reads except for BCM.WGS were set as nonrandom
- Update the runCA script to run overlapper concurently; new "ovlConcurrency" parameter added to the .spec file !!!
- consensus after cgw crashed in MultiAlignContig() ... use "consensus -D forceunitigabut" !!!
- cgw crashed after updating gkpStore with new lib/mate info => edit Input_CGW.c, remove the assert in line 117
Info
host: walnut assembly version: wgs-5.2 stable dir: /scratch1/bos_taurus/Assembly/2009_0122_CA command: /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin/runCA-test -d . -p bt -s bt01.specFile *.frg spec file: cgwDistanceSampleSize = 1000 # ??? too big; more than 50% of the BCM.SHOTGUN reads are in libraries with less than 1000 inserts cnsConcurrency = 15 cnsMinFrags = 200000 doOverlapTrimming = 1 frgCorrBatchSize = 100000 frgCorrConcurrency = 15 merylMemory = 24000 merylThreads = 15 obtMerThreshold = 200 obtOverlapper = ovl ovlConcurrency = 8 ovlCorrBatchSize = 100000 ovlCorrConcurrency = 15 ovlHashBlockSize = 1200000 ovlMemory = 8GB --hashload 0.8 --hashstrings 400000 ovlMerThreshold = 500 ovlOverlapper = ovl ovlRefBlockSize = 7200000 ovlThreads = 2 unitigger = utg utgErrorRate = 0.015 vectorIntersect = bos_taurus.clv doExtendClearRanges = 2
Steps
1. Run up till after initialStoreBuilding
runCA stopAfter=initialStoreBuilding ...
2. Update gkpStore with nonrandom frg flag
cat bos_taurus.nonrandom.clv | perl -ane 'print "frg uid $F[0] isnonrandom 1\n";' > bos_taurus.nonrandom.edit gatekeeper -edit bos_taurus.nonrandom.edit bt.gkpStore
Input
gatekeeper -dumpinfo -lastfragiid bt.gkpStore ... Last frag in store is iid = 35348776
OBT
elem <0 0 >0 min max mean median n50 sum CLV5 35085508 0 3387027 31698481 0 970 25 27 29 891007232 CLV3 35164784 0 0 35164784 15 2974 984 980 1000 34612019144 CLR_ORIG5 35348776 0 43354 35305422 0 1753 42 29 38 1502168205 CLR_ORIG3 35348776 0 0 35348776 70 1894 864 905 927 30547294868 CLR_OBT5 35348776 0 26513 35322263 0 1690 49 30 73 1756346429 CLR_OBT3 35348776 0 23477 35325299 0 1813 843 895 914 29824543869
- 421,379 reads deleted by OBT: why so many???
- Chimera:
20297 reads too short => deleted
- more 0-overlaptrim/bt.mergeLog.stats
... 211037: short or inconsistent 253536: deleted fragment due to zero clear
- Example:
gatekeeper -dumpfragments 516316990 bt.gkpStore fragmentIdent = 516316990,14 fragmentMate = 0,0 fragmentLibrary = 27473,1563 fragmentIsDeleted = 1 fragmentIsNonRandom = 1 fragmentStatus = G fragmentOrientation = I fragmentHasVectorClear = 0 fragmentHasQualityClear = 0 fragmentPlate = 0 fragmentPlateLocation = 0 fragmentSeqLen = 862 fragmentHPSLen = 0 fragmentSrcLen = 17 fragmentClearORIG = 38,553 fragmentClearQLT = 1,0 fragmentClearVEC = 1,0 fragmentClearOBTINI = 35,578 fragmentClearOBT = 35,578 fragmentClearUTG = 35,578 fragmentClearECR1 = 35,578 fragmentClearECR2 = 35,578 fragmentSeqOffset = 5376 fragmentQltOffset = 11038 fragmentHpsOffset = 53 fragmentSrcOffset = 287
cat 0-overlaptrim/bt.mergeLog | grep 516316990 516316990,14 412 412 0 0 (deleted, too short)
zcat *r000*gz | convertOverlap -a -obt ... 14 12128740 f 377 478 292 393 2.97 14 15226267 f 397 446 31 80 2.04 14 19071241 f 4 513 199 708 1.18 14 20073917 f 7 478 36 508 4.88 14 20042424 f 4 419 299 714 1.93 14 20212935 f 7 478 234 706 4.88 14 20073828 r 7 478 507 35 4.67 14 20212846 r 7 478 557 85 4.67 14 27089060 r 491 534 836 793 2.33 14 29061748 f 489 540 86 137 1.96 14 32105697 f 455 543 381 469 2.27 14 32187461 f 430 534 105 209 1.92 14 32027289 f 4 419 493 907 4.59 ...
#read aligns to contigs show-coords 516316990-ctg.filter-r.strict.delta 35 531 | 97 594 | 497 498 | 99.20 | 862 2759 | 57.66 18.05 | 516316990 ctg7180001872751 45 678 | 931 1564 | 634 634 | 97.00 | 862 1567 | 73.55 40.46 | 516316990 ctg7180001837311
- OBT deleted reads:
BCM WGS 253816 BCM SHOTGUN 151770 BCCAGSC CLONEEND 7510 NISC SHOTGUN 4757 TIGR CLONEEND 1577 CENARGEN WGS 599 CENARGEN CLONEEND 431 TIGR_JCVIJTC CLONEEND 377 UIUC CLONEEND 182 BCM CLONEEND 150 BARC CLONEEND 125 UOKNOR SHOTGUN 85 total . 421379
OBT deleted reads:
elem >0 min max mean med n50 sum len 421379 421379 98 2974 862 927 968 363280405 avgQual 421379 421379 1 57 28 24 36 11852865
Overlapper
- 98.33% of the reads (34,761,786 out of 35,348,776 reads) had overlaps
- 1.66% of the reads had no overlaps
- 6.68% of the BCCAGSC.CLONEEND reads had no overlaps
- 4.95% of the TIGR_JCVIJTC.CLONEEND reads had no overlaps
- 3.48% of the TIGR.CLONEEND reads had no overlaps
- the median number of overlaps is 20
Overlaps reads min max mean median n50 sum qual 35348776 0 5592 106 20 769 3777789082
- the median number of overlaps for the BCM.WGS reads is 16
- the median number of overlaps for the BCM.SHOTGUN reads is 16 !!!
- the median number of overlaps for the NISC.SHOTGUN reads is 40 !!!
- the median number of overlaps for the BCM.CLONEEND reads is 16 !!!
Media:Bt.ovlStore.big.png , Media:Bt.ovlStore.small.png
Unitigger
more 4-unitigger/bt.cga.0 UNITIG OVERLAP GRAPH INFORMATION 5208738 : Total number of unitigs 2527051 : Total number of singleton, contained unitigs 1814842 : Total number of singleton, non-contained unitigs 180910 : Total number of non-singleton, spanned unitigs 685935 : Total number of non-singleton, non-spanned unitigs 34927397 : Total number of fragments 34927397 : Total number of fragments in all unitigs 21521581 : Total number of essential fragments in all unitigs 13405816 : Total number of contained fragments in all unitigs 0.0076239952 : Randomly sampled fragment arrival rate per bp 2510896132 : The sum of overhangs in all the unitigs 6400342737 : Total number of bases in all unitigs 0 : Estimated number of base pairs in the genome. 0 : Total number of contained fragments not connected by containment edges to essential fragments. Total rho = 2510896132 Total nfrags = 19143061 Estimated genome length = 0 Estimated global_fragment_arrival_rate=0.007624 Computed global_fragment_arrival_rate =0.007624 Total number of randomly sampled fragments in genome = 23326293 Computed genome length = 3059589120.000000 Used global_fragment_arrival_rate=0.007624 Used global_fragment_arrival_distance=131.164826 Histogram of the number of base pairs in a chunk 100292 - 159434: 22 90010 - 99906: 25 80043 - 89676: 73 70013 - 79966: 162 60010 - 69988: 389 50008 - 59983: 977 40000 - 49998: 2434 30000 - 39997: 6458 20000 - 29999: 18957 10000 - 19999: 57442
Unitigs >=10kb NewAsm UMd2Asm Number 86,939 57,204 Mean 19,464 15,140 Sum 1,692.1Mb 866.0Mb max 159,434bp 78,570bp
Contigs >=10Kb: NewAsm UMd2Asm n 42,343 45,958 mean 59,856 55,473 sum 2,534.5Mb 2,549.4Mb
Contigs >=100Kb: NewAsm UMd2Asm n 7,051 6,683 mean 163,170 162,357 sum 1,150.5Mb 1,085.0Mb max 627,705 742,802
Scaffolds >=10Mb: NewAsm UMd2Asm n 30 3 mean 14.10Mb 11.36Mb sum 422.95Mb 340.70Mb max 26.54Mb 13.36Mb
CGW & ECR
- Checkpoints:
cat 7-0-CGW/bt.timing | grep ^Checkpoint Checkpoint 3 written during MergeScaffoldsAggressive at iteration 49 Checkpoint 4 written during MergeScaffoldsAggressive at iteration 85 Checkpoint 5 written after 1st Scaffold Merge Checkpoint 6 written after 2nd Aggressive Scaffold Merge Checkpoint 7 written after Final Rocks
cat 7-2-CGW/bt.timing | grep ^Checkpoint Checkpoint 19 written during MergeScaffoldsAggressive at iteration 12 Checkpoint 20 written during MergeScaffoldsAggressive at iteration 31 Checkpoint 21 written after 1st Scaffold Merge Checkpoint 22 written after 2nd Aggressive Scaffold Merge Checkpoint 23 written after Final Rocks cat 7-4-CGW/bt.timing | grep ^Checkpoint Checkpoint 34 written during MergeScaffoldsAggressive at iteration 12 Checkpoint 35 written during MergeScaffoldsAggressive at iteration 49 Checkpoint 36 written after 1st Scaffold Merge Checkpoint 37 written during Stones CleanupScaffolds after scaffold 32436 Checkpoint 38 written during Stones CleanupScaffolds after scaffold 34939 Checkpoint 39 written after Stone Throwing and CleanupScaffolds Checkpoint 40 written after 2nd Aggressive Scaffold Merge Checkpoint 41 written after Final Rocks
Checkpoint 42 written after Partial Stones Checkpoint 43 written after Final Contained Stones Checkpoint 44 written after resolveSurrogates
- Get early CTG/SCF stats
cat 7-CGW/bt.cgw_scaffolds | countMessages.pl ICL 451555 # ??? ICP 116455 # CTG ISF 66141 # SCF ISL 711 # SLK
- Clear read extension:
elem <0 0 >0 min max mean median n50 sum ClearORIG 35348776 4 0 35348772 -1147 1572 821 870 893 29045126663 ClearQLT 35348776 35348776 0 0 -1 -1 -1 -1 -1 -35348776 ClearVEC 35348776 299034 20323 35029419 -1 2043 952 953 975 33658445088 ClearOBTINI 35348776 0 31254 35317522 0 1364 831 879 902 29394688367 ClearOBT 35348776 0 31254 35317522 0 1318 794 854 877 28068197440
ClearECR1 35348776 0 31254 35317522 0 1329 794 854 877 28072014464 ClearECR2 35348776 0 31254 35317522 0 1329 794 854 877 28072365712
sum(ClearECR1)-sum(ClearUTG) = 3,817,024 sum(ClearECR2)-sum(ClearECR1)= 351,248
- Scaffold length stats:
cat 7-0-CGW/stat/final0.Scaffolds.nodelength.cgm | grep -v ^Sca | getSummary.pl -t 0 # 0,2,4 ...
step scaff min max mean med n50 sum 0 7048 2249 19719008 385020 21967 3114907 2713622175 2 4960 2249 21907006 540915 21181 4490171 2682939682 4 4006 2391 26541374 668427 29193 4590744 2677722052
- Last cgw
cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf elem min max mean med n50 sum scf 66141 432 26541374 42648 1347 4349378 2820819506 ctg 120461 65 627705 22421 2018 84989 2700959854
QC stats
- Bos_taurus.qc this assembly stats
- Bos_taurus.qc.combine UMD2 vs this assembly stats
TotalScaffolds=66,141 MaxBasesInScaffolds=26,048,998 MeanBasesInScaffolds=40,861 TotalContigsInScaffolds=120,461 MaxContigLength=627,911 MeanContigLength=22,436 TotalDegenContigs=269,031 MaxDegenContig=33,824 SingletonReads=3,721,123
- Posmap info
cat bt.posmap.mates | awk '{print $3}' |count.pl -p 100 good 10338164 bothChaff 1160137 oneChaff 695982 oneSurrogate 233151 bothDegen 218198 diffScaffold 150423 badShort 138464 oneDegen 118232 badLong 23196 badSame 22451 badOuttie 8751 bothSurrogate 589 total 13107738
cat bt.posmap.frags | awk '{print $4,$5}' |count.pl -p 100 placed good 20676328 placed notMated 8007072 chaff bothChaff 2320274 chaff notMated 704849 placed oneChaff 695982 chaff oneChaff 695982 placed oneSurrogate 466302 placed bothDegen 436396 placed diffScaffold 300846 placed badShort 276928 placed oneDegen 236464 placed badLong 46392 placed badSame 44902 placed badOuttie 17502 placed bothSurrogate 1178 total 34927397
Log files
- Bt.runCA.log
- Bt.runCA.hourly.runtimes approximate running times (in hours)
Analysis
Insert libraries
1. BCM.WGS : ok
- FRG.mea: 1750-7000
- ASM.mea: 1594-6727
- Most libs have > 1000 reads & get reestimated
- All libs have ASM.std< ASM.mea/3
2. BCM.SHOTGUN
- only ~ 50% of the inserts are in libs with >1000 inserts and get reestimated by the assembly
- if the thold is dropped from 1000 to 100, we'd get ~ 95% of the inserts reestimated
elem <0 0 >0 min max mean median n50 sum 0 7339 0 0 7339 1 11237 245 135 1137 1799069 100 4361 0 0 4361 100 11237 395 157 1252 1725604 1000 440 0 0 440 1008 11237 2075 1791 2323 913086
3. NISC.SHOTGUN: ok
- Most libs have > 1000 reads & get reestimated
- All libs have ASM.std< ASM.mea/3
4. BCCAGSC.CLONEEND: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 125606 150000 30000 59505 BCCAGSC.CLONEEND 161998 20133
5. UIUC.CLONEEND: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 114892 150000 30000 31063 UIUC.CLONEEND 175594 41208 115020 150000 30000 15256 UIUC.CLONEEND 162488 26358
6. TIGR.CLONEEND: originally wrong; gets reestimated
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 65177 2000 600 27067 TIGR.CLONEEND 161761 34938
7. GSC.CLONEEND: not used (all 53556 are 0 qual)
8. CENARGEN.WGS: "not used" (all 26246 are unmated)
9. BARC.CLONEEND: each library contains 1 template id => inserts did not get reestimated (25454 reads/11151 inserts)
10. BCM cloneend: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 19070 167000 25000 7103 BCM.CLONEEND 171244 18555
11. CENARGEN.CLONEEND: large stdev
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 17249 202000 20200 6269 CENARGEN.CLONEEND 158938 55165
12. UOKNOR.SHOTGUN: ok ?
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 15158 3000 1000 4910 UOKNOR.SHOTGUN 3000 1000
13. TIGR_JCVI.CLONEEND: originally wrong; gets reestimated
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 10691 2500 750 2763 TIGR_JCVI.CLONEEND 160363 29580 10738 2500 750 2040 TIGR_JCVI.CLONEEND 161915 29343
14. UOKNOR.FINISHING: only 151 reads
15. WUGSC.CLONEEND: only 49 reads
Contigs Vs UMD2 contaminants & Ecoli
4865 contigs in list.exclude_contigs.fa 34404 exclude-ctg.qry_hits 3763 exclude-ctg.ref_hits 1204 exclude-ctg.CBE.qry_hits CONTAIN|IDENTITY|BEGIN|END 748 exclude-ctg.CBE.ref_hits CONTAIN|IDENTITY|BEGIN|END
559 Ecoli.365350-365744-ctg.qry_hits : max ctg aligned is 179K bp; 10 are > 10K bp
Contigs Vs UMD2 chromosomes
- Split 120,461 contigs into 100 files; degeneartes not split
- Align them to the 31 chromosomes 1..30,U (ref) => 101*31 jobs
#Alignment stats cat chr*ctg*delta | grep "^>" | awk '{print $2}' | count.pl -f ../9-terminator/bt.ctg.infoseq | getSummary.pl -i 1 -z 1 ctg 0 1 >1 min max mean med n50 sum 120461 652 37540 82269 0 176 11 3 33 1422808
#Unaligned ctg lengths ctg min max mean med n50 sum 652 65 5849 1146 1134 1194 747295
- 50% of the contigs aligned uniquely
cat chr*-ctg*.delta | ~/bin/mergeDelta.pl > chr-ctg.delta # degens? delta-filter -q chr-ctg.delta >> chr-ctg.filter-q.delta
cat chr1-*.delta | ~/bin/delta2cvg.pl -M 0 | getSummary.pl -i 4 elem 0 >0 min max mean med n50 sum 6681 1 6680 0 12892 366 142 1095 2450106
- There are disagreements:
/fs/sz-user-supported/Linux-x86_64/bin/show-coords -l -r -H chr1-ctg.filter-q.delta | p 'print $F[-1],"\n";' | count.pl | head ctg7180001761585 24 ... ctg7180001634116 7 ...
show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001761585 | p 'print " $_";' 142115744 142188863 | 383463 310345 | 73120 73119 | 99.98 | 157714772 383463 | 0.05 19.07 | 1 -1 chr1 ctg7180001761585 142188878 142286012 | 310361 213223 | 97135 97139 | 99.94 | 157714772 383463 | 0.06 25.33 | 1 -1 chr1 ctg7180001761585 142287100 142287675 | 212133 211556 | 576 578 | 98.27 | 157714772 383463 | 0.00 0.15 | 1 -1 chr1 ctg7180001761585 142288052 142288602 | 211182 210633 | 551 550 | 99.09 | 157714772 383463 | 0.00 0.14 | 1 -1 chr1 ctg7180001761585 142288652 142295709 | 210586 203531 | 7058 7056 | 99.87 | 157714772 383463 | 0.00 1.84 | 1 -1 chr1 ctg7180001761585 142295709 142342174 | 203512 157047 | 46466 46466 | 100.00 | 157714772 383463 | 0.03 12.12 | 1 -1 chr1 ctg7180001761585 142346440 142367791 | 156958 135606 | 21352 21353 | 99.99 | 157714772 383463 | 0.01 5.57 | 1 -1 chr1 ctg7180001761585 142367822 142370681 | 135597 132737 | 2860 2861 | 99.93 | 157714772 383463 | 0.00 0.75 | 1 -1 chr1 ctg7180001761585 142370660 142382289 | 132746 121117 | 11630 11630 | 99.88 | 157714772 383463 | 0.01 3.03 | 1 -1 chr1 ctg7180001761585 142382282 142411927 | 120984 91339 | 29646 29646 | 99.96 | 157714772 383463 | 0.02 7.73 | 1 -1 chr1 ctg7180001761585 142411941 142419553 | 91339 83728 | 7613 7612 | 99.66 | 157714772 383463 | 0.00 1.99 | 1 -1 chr1 ctg7180001761585 142419553 142434546 | 83721 68728 | 14994 14994 | 99.79 | 157714772 383463 | 0.01 3.91 | 1 -1 chr1 ctg7180001761585 142434506 142437288 | 68778 65996 | 2783 2783 | 99.86 | 157714772 383463 | 0.00 0.73 | 1 -1 chr1 ctg7180001761585 142437389 142439015 | 66757 65131 | 1627 1627 | 99.94 | 157714772 383463 | 0.00 0.42 | 1 -1 chr1 ctg7180001761585 142439271 142440703 | 65629 64197 | 1433 1433 | 100.00 | 157714772 383463 | 0.00 0.37 | 1 -1 chr1 ctg7180001761585 142441869 142442975 | 63548 62442 | 1107 1107 | 100.00 | 157714772 383463 | 0.00 0.29 | 1 -1 chr1 ctg7180001761585 142446690 142449325 | 30312 32945 | 2636 2634 | 99.58 | 157714772 383463 | 0.00 0.69 | 1 1 chr1 ctg7180001761585 142451384 142452476 | 63510 64603 | 1093 1094 | 99.91 | 157714772 383463 | 0.00 0.29 | 1 1 chr1 ctg7180001761585 142452577 142454379 | 61000 62806 | 1803 1807 | 99.78 | 157714772 383463 | 0.00 0.47 | 1 1 chr1 ctg7180001761585 142454487 142456821 | 59122 61456 | 2335 2335 | 100.00 | 157714772 383463 | 0.00 0.61 | 1 1 chr1 ctg7180001761585 142458383 142459582 | 57978 59177 | 1200 1200 | 100.00 | 157714772 383463 | 0.00 0.31 | 1 1 chr1 ctg7180001761585 142459738 142472295 | 32272 44828 | 12558 12557 | 99.92 | 157714772 383463 | 0.01 3.27 | 1 1 chr1 ctg7180001761585 142472300 142485640 | 44828 58163 | 13341 13336 | 99.89 | 157714772 383463 | 0.01 3.48 | 1 1 chr1 ctg7180001761585 142501686 142530021 | 28336 1 | 28336 28336 | 99.99 | 157714772 383463 | 0.02 7.39 | 1 -1 chr1 ctg7180001761585
show-coords -d chr1-ctg.filter-q.delta | grep ctg7180001634116 116312 162914 | 1 46603 | 46603 46603 | 99.99 | 157714772 122722 | 0.03 37.97 | 1 1 chr1 ctg7180001634116 164916 201988 | 58062 95135 | 37073 37074 | 99.99 | 157714772 122722 | 0.02 30.21 | 1 1 chr1 ctg7180001634116 203244 213377 | 48198 58331 | 10134 10134 | 100.00 | 157714772 122722 | 0.01 8.26 | 1 1 chr1 ctg7180001634116 261393 264506 | 45949 49062 | 3114 3114 | 100.00 | 157714772 122722 | 0.00 2.54 | 1 1 chr1 ctg7180001634116 264607 268579 | 94345 98317 | 3973 3973 | 100.00 | 157714772 122722 | 0.00 3.24 | 1 1 chr1 ctg7180001634116 268586 274734 | 98323 104471 | 6149 6149 | 100.00 | 157714772 122722 | 0.00 5.01 | 1 1 chr1 ctg7180001634116 274835 293945 | 103611 122722 | 19111 19112 | 99.99 | 157714772 122722 | 0.01 15.57 | 1 1 chr1 ctg7180001634116
~/bin/delta2breaks.pl -m 200 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl AGREEMENT 9827 INVERSION 283 TRANSLOCATION+ 230 TRANSLOCATION- 154 ~/bin/delta2breaks.pl -m 1000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl AGREEMENT 7564 INVERSION 216 TRANSLOCATION+ 192 TRANSLOCATION- 127 ~/bin/delta2breaks.pl -m 10000 < chr1-ctg.filter-q.delta | awk '{print $8}' | count.pl AGREEMENT 3394 INVERSION 62 TRANSLOCATION+ 50 TRANSLOCATION- 29
Assembly UMD2.2 (Quality reads)
- Try to add the missing BCM.SHOTGUN reads at the assembly
- Assign new BCM.SHOTGUN library ID's base on volume & SEQ_LIB_ID : same library might have different insert size in different volume => might loose some correct mates from different volumes
cat bos_taurus.summary | grep BCM | grep SHOTG | cut -f6,7,8,10 | sort | more FAAEP 180000 13000 252 FAAEP 2000 1000 84 ... FAAHP 180000 13000 77 FAAHP 2000 1000 230 ...
- => 20,538 libraries out of which 18,208 contain mated reads
- create DST messages & add them to gkpStore
gatekeeper -a -o bt.gkpStore -T -F bos_taurus.BCM.SHOTGUN.new.DST
- generate gatekeeper edit file that maps each TI to the new library id
head bos_taurus.BCM.SHOTGUN.new.ti2libinfo.edit frg uid 499507131 libuid 601081 frg uid 499507132 libuid 601081 ...
- generate gatekeeper edit file that deletes all mate information
head bos_taurus.BCM.SHOTGUN.new.mate.delete frg uid 500086180 mateuid 0 frg uid 500084310 mateuid 0 ...
- pair forward/reverse read that have the same new library id, same TEMPLATE_ID
head bos_taurus.BCM.SHOTGUN.new.mate.edit frg uid 583866821 mateuid 583872364 frg uid 583866822 mateuid 583872408 ...
- run gatekeeper --edit for each edit/delete file
gatekeeper --edit ... bt.gkpStore
- restart assembly at cgw (doExtendClearRanges=1)
- consensus after cgw failed on job 25 on CTG 5597062 : cannot create consensus from multialignment ...
Fix: delete failed message cp bt.cgw_contigs.25 bt.cgw_contigs.25.FAILED delete "{ICM acc:5597062 pla:P len:20889 ..." from bt.cgw_contigs.25
- terminator fail; message:
ICL: reference before definition error for contig ID 5597062
Assembly UMD2.3 (2009_0210_CA; all reads)
- 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
- 16,896,244 mates
- 25,312 libraries
Issues (not solved):
- 10420 contain at least 1 "NN" in their clr (50.. min(len,600))
- 5973 contain at least 1 "NNN" in their clr (50.. min(len,600))
Quality-less clrs
- 624,952 quality-less reads
- Quality-less read stats: : alignment CLR or 50..min(len,600) trimming
elem min max mean median n50 sum len 624952 5 1495 887 947 961 554429198 5 624952 6 1584 51 51 51 32150411 3 624952 5 1495 695 699 699 434960697 53 624952 -1579 1444 644 648 648 402810286
- Align 624,952 to the 120,461 Assembly1 contigs (no degenerates) : 1 day on 13 cpus
- 572,140(91.5%) reads aligned and 52,812(8.5%) did not align to the contigs
1. Launch jobs in parallel: 12766 jobs on 13 processors
nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.001.fasta bos_taurus.0qual.01.seq -p ctg.001-seq.01 ... nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.982.fasta bos_taurus.0qual.13.seq -p ctg.001-seq.01
- CPU usage: 100% /job
- Max mem usage: 0.1% /job
2. Get maximum extended clrs
cat *delta | ~/bin/delta2qryClr.pl -best | sort > bos_taurus.0qual.best.clr
Length stats elem min max mean median n50 sum all 624952 5 1495 887 947 961 554429198 aligned 572140 221 1416 912 953 964 522281354 unaligned 52812 5 1495 608 580 754 32147844
Best/Max/Max+extended alignment coord stats: elem min max mean median n50 sum 53.best 572140 94 1208 766 841 877 438793102 53.max 572140 170 1208 794 863 888 454816817 53.extend 572140 170 1208 797 865 889 456014184
Unaligned read counts: unaligned total quality quality-less BCM.WGS 42595 UOKNOR.SHOTGUN 5787 14651 2456 12195 GSC.CLONEEND 2294 53521 0 53521 BCCAGSC.CLONEEND 1869 125241 116484 8757 BCM.SHOTGUN 186 UOKNOR.FINISHING 81
- 52,812 quality-less unaligned reads to the contigs using less strict nucmer parameters: -l 30 -c 50 -b 50 -g 50 -d 0.12
- 9,269 reads aligned at an average 92% identity (min 81% identity) : not too good
3. Get reads without clrs: set their clr to maximum 50..600
cp bos_taurus.0qual.extended.clr bos_taurus.0qual.clr difference.pl bos_taurus.0qual.infoseq bos_taurus.0qual.extended.clr | perl -ane '$three=600; $three=$F[1] if ($F[1]<600); print "$F[0] 50 $three\n";' >> bos_taurus.0qual.clr
Quality clrs
- Use Assembly1 OBT clrs
- Delete reads deleted in the OBT process
Gatekeeper
Load order:
- Add quality FRG : "gatekeeper -T -F ..."
- Add quality-less FRG "gatekeeper -F -a ..." # -T should be removed
- Delete quality FRG (deleted by UMD2.1 OBT)
- Add DST
- Add LKG
Edit
- Loads clrs
- Loads clvs
- Loads nonrandom info
Meryl
Use Assembly1 kmer counts
Overlapper
- Use 80/90 Assembly1 overlap results
- Rerun 10 overlap jobs
- 96.64% of the quality-less reads have overlaps (vs 98.33% of the quality reads)
reads 0ovl 1+ovl min max mean median n50 sum 0qual(all) 624831 20941 603890 0 4350 96 19 740 60494730 # 96.64% 0qual(unaligned) 52691 15384 37307 0 3229 50 5 349 2655545 # 70.80%
Unitigger
- More unitigs, more bases in unitigs
- Few of the longest unitigs got broken: Example 138,294(UMD2.3) vs 159,434(UMD2.1)
UNITIG OVERLAP GRAPH INFORMATION 5333434 : Total number of unitigs 2595174 : Total number of singleton, contained unitigs 1865473 : Total number of singleton, non-contained unitigs 183693 : Total number of non-singleton, spanned unitigs 689094 : Total number of non-singleton, non-spanned unitigs 35551316 : Total number of fragments 35551316 : Total number of fragments in all unitigs 21830994 : Total number of essential fragments in all unitigs 13720322 : Total number of contained fragments in all unitigs 0.0077856472 : Randomly sampled fragment arrival rate per bp 2514833413 : The sum of overhangs in all the unitigs 6483064813 : Total number of bases in all unitigs 0 : Estimated number of base pairs in the genome. 0 : Total number of contained fragments not connected by containment edges to essential fragments. Total rho = 2514833413 Total nfrags = 19579606 Estimated genome length = 0 Estimated global_fragment_arrival_rate=0.007786 Computed global_fragment_arrival_rate =0.007786 Total number of randomly sampled fragments in genome = 23870254 Computed genome length = 3065930496.000000 Used global_fragment_arrival_rate=0.007786 Used global_fragment_arrival_distance=128.441474 Histogram of the number of base pairs in a chunk 100406 - 138294: 21 90330 - 99887: 23 80042 - 89675: 79 70014 - 79943: 169 60002 - 69792: 374 50000 - 59982: 1008 40002 - 49995: 2440 30001 - 39994: 6509 20000 - 29999: 18989 10000 - 19999: 57404
Consensus after unitigger
Problems:
- job 120 executed partially (see bt_120.cgi_tmp); Solution: split into 3 parts, run separately, merge results
- failed on 19 unitigs (587..7447 bp)
rm 5-consensus/*failed touch 5-consensus/consensus.success
Cgw
- Failure 1 : because job 120 was run partially => missing mates
- Failure 2 : because of /5-consensus/FAILED/bt_???.cgi.failed => missing mates => delete 356 mates
Error: ProcessFrags()-- WARNING! fragiid=35973388,index=33600942 mateiid=35973363,index=0 -- MATE DOESN'T EXIST! cgw: Input_CGW.c:117: ProcessFrags: Assertion `err == 0' failed. Fix: cat cgw.out | grep MATE | p '/mateiid=(\d+)/; print $1,"\n";' >! cgw.out.mateiid gatekeeper -dumpfragments -tabular -iid cgw.out.mateiid bt.gkpStore/ | cut -f1,3 | ~/bin/mate2lkg.pl -a D >! cgw.out.delete.LKG gatekeeper -a -o bt.gkpStore -T -F -L cgw.out.delete.LKG
- Failure 3: because of cgwOutputIntermediate=1
Try to restart from ckp : die with assertion failure cgw -y -R 8 -N 12 -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi cgw -y -R 8 -j 1 -k 5 -r 5 -s 2 -S 0 -z -m 100 -g ./bt.gkpStore -o ./7-0-CGW.8_12/bt ./5-consensus/bt_001.cgi Fix: Restart cgw from the beginning
- cgw does update bt.SeqStore - OpenSequenceDB()
ECR (eventually skipped)
- Failed after running for 1 day
/fs/szdevel/dpuiu/SourceForge/wgs-5.2/Linux-amd64/bin/extendClearRanges -g ./bt.gkpStore -n 15 -c bt -b 146216 -e 167100 -i 1 > 7-1-ECR/extendClearRanges-scaffold.146216.err sh: line 1: 17016 Aborted
- Last ckp : bt.ckp.15
- Try to fix:
touch 7-1-ECR/cgw.success runCA "doExtendClearRanges = 1"
- Runs too slow !!!
- Can specify a scaffold range to process: -b ? -e ? => ckp files; could we merge them?
- Failed after running for 1 day
Consensus after CGW
- Failed on job 56
tail 8-consensus/bt.cns_contigs.56.err ... Could (really) not find overlap between 153923 (U) and 2508303 (R) estimated ahang: 0 (ejecting frag 2508303 from contig) consensus: math_AS.h:51: ceil_log2: Assertion `x > 0' failed.
cat 7-CGW/bt.cgw_contigs.56 | countMessages.pl ICM 440 IMP 281412 IUP 12715 cat 8-consensus/bt.cns_contigs.56_tmp | countMessages.pl ICM 115 IMP 103322 IMV 8122 IUP 4849
- Fix: split ICM messages 1..115,116,116+ and run consensus on each set
QC
elem min max mean med n50 sum scf 56891 407 33129045 50871 1378 4716077 2894145150 ctg 122851 64 651167 21957 3647 71561 2697514858 deg 268237 65 30246 1019 985 997 273575106
- Compared with UMD2.1 : better scaffols, worse contigs & unitigs
Analysis
Issues:
- Identify bacterial & mito contigs: mito seq
- Align ctg°en to UMD2 chromosomes
- the chromosomes should have no 0cvg regions
- possible inversions, translocations (UMD2 used markers)
- if align breaks/indels, which assembly is correct?
Assembly UMD2.4 (2004_0217_CA; All reads)
- 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
- 16,896,244 mates
- 344 libraries
Fix quality-less read clrs (N's) (temporary solution)
- 10420 contain at least 1 "NN" in their clr (50.. min(len,600))
- 5973 contain at least 1 "NNN" in their clr (50.. min(len,600))
Fix:
frg2seq.pl < bos_taurus.0qual.frg > bos_taurus.0qual.seq fasta2qual.pl bos_taurus.0qual.seq > ! bos_taurus.0qual.qual
lucy \ -o bos_taurus.0qual.lucy.seq bos_taurus.0qual.lucy.qual \ -debug bos_taurus.0qual.lucy.info \ bos_taurus.0qual.seq bos_taurus.0qual.qual
cat bos_taurus.0qual.lucy.info | cut -f1,3,4 -d ' ' | sort >! bos_taurus.0qual.lucy.clr
- 624,952 quality-less reads
- Quality-less read stats: 50..min(len,600) & lucy trimming
elem 0 >0 min max mean med n50 sum 5 624952 2857 622095 0 501 52 52 52 33012433 3 624952 2857 622095 0 600 579 600 600 361980208 53 624952 2857 622095 0 548 526 548 548 328967775
Fix quality-less read clrs (low complexity)
- Run dust filter on seq (before qual & lucy)
elem 0 >0 min max mean med n50 sum 5 624952 3564 621388 0 501 75 52 52 47385578 3 624952 3564 621388 0 600 554 600 600 346470473 53 624952 3564 621388 0 548 478 548 548 299084895
- Merge dust.lucy clrs with the alignment clrs
elem 0 >0 min max mean med n50 sum 5 624952 4488 620464 0 599 93 52 126 58378496 3 624952 4488 620464 0 600 547 600 600 342258160 53 624952 4488 620464 0 548 454 512 548 283879664
- Test seq
gatekeeper -dumpfastaseq -b 35348777 -e 35973728 bt.gkpStore | grep NNN gatekeeper -dumpfastaseq bt.gkpStore | perl -ane 'if(/^>(\d+)/) { $id=$1} elsif(/NNN/) { print $id,"\n";} ' | uniq -c | awk '{print $2,$1}' > bt.NNN.seqs # 2411 seqs (all have the N's "in the middle") gatekeeper -dumpfastaseq -uid bt.NNN.seqs bt.gkpStore > bt.NNN.cseqs
Consolidate libraries
Drop from 25,312 to 344 libs
BCM.SHOTGUN
UMD2.4 reestimated 10,117 out of 13,826 libs (have > 100mates)
Base on initial estimates
- Reduce the total number from 13826 to 2 libs: 3000 & 6000
- UMD2.3 mean estimates (Initial vs Final):
meanI #libs minF maxF meanF medF n50F sumF uid 180000 436 1636 5199 2475 2410 2458 1079407 #3000 167000 86 1585 2948 2264 2258 2285 194775 #3000 6500 31 5212 6636 5837 5867 5924 180951 #6000 6000 11 4556 6272 5389 5421 5421 59286 #6000 3500 949 1670 4769 2668 2608 2645 2532027 #3000 3000 2511 1483 5250 2715 2662 2723 6818678 #3000 2000 6093 1157 6443 2526 2487 2554 15391160 #3000
Base on final estimates
- Reduce the total number from 13826 to 7 libs: 6500,5500,...1500, un-estimates (2501)
meanF #libs min max mean med n50 sum uid(new) mean(new) std(new) 6K<=mea<7K 15 6010 6636 6176 6159 6159 92650 6500 6500 5K<=mea<6K 29 5121 5985 5540 5536 5577 160673 5500 5500 4K<=mea<5K 67 4017 4939 4284 4266 4274 287072 4500 4500 3K<=mea<4K 1401 3000 3998 3276 3209 3226 4590323 3500 3500 2K<=mea<3K 7998 2000 2999 2502 2498 2532 20017767 2500 2500 1K<=mea<2K 607 1157 1999 1825 1882 1890 1107798 1500 1200 un-estimated 3709 2501 2501
BARC.CLONEEND
Collapse all 11150 into 1:
uid:25456 mea:165000 std:43000
Overlapper
- Quality-less reads overlaps: fewer than in the UMD2.3 assembly
elem 0 >0 min max mean med n50 sum 0qual(all) 624830 35692 589138 0 3237 60 14 439 37578899 # 94.39%
Unitigger
UNITIG OVERLAP GRAPH INFORMATION 5356408 : Total number of unitigs 2613795 : Total number of singleton, contained unitigs 1870448 : Total number of singleton, non-contained unitigs 182878 : Total number of non-singleton, spanned unitigs 689287 : Total number of non-singleton, non-spanned unitigs 35547861 : Total number of fragments 35547861 : Total number of fragments in all unitigs 21685943 : Total number of essential fragments in all unitigs 13861918 : Total number of contained fragments in all unitigs 0.0077797328 : Randomly sampled fragment arrival rate per bp 2513424271 : The sum of overhangs in all the unitigs 6468428782 : Total number of bases in all unitigs 0 : Estimated number of base pairs in the genome. 0 : Total number of contained fragments not connected by containment edges to essential fragments. Total rho = 2513424271 Total nfrags = 19553770 Estimated genome length = 0 Estimated global_fragment_arrival_rate=0.007780 Computed global_fragment_arrival_rate =0.007780 Total number of randomly sampled fragments in genome = 23868770 Computed genome length = 3068070656.000000 Used global_fragment_arrival_rate=0.007780 Used global_fragment_arrival_distance=128.539119 Histogram of the number of base pairs in a chunk 100292 - 138301: 19 90052 - 99906: 23 80043 - 89676: 79 70013 - 79966: 164 60010 - 69988: 390 50008 - 59983: 949 40000 - 49998: 2433 30000 - 39997: 6437 20000 - 29999: 18808 10000 - 19999: 57634
Bog
!!! Much bigger unitigs than default unitigger
Global Arrival Rate: 0.013829 212260 - 224992: 4 100099 - 186873: 372 90015 - 99973: 353 80045 - 89988: 582 70011 - 79999: 1084 60000 - 69994: 1856 50001 - 59996: 3162 40002 - 49994: 5407 30000 - 39999: 9767 20000 - 29996: 18981 10000 - 19999: 39641
Consensus after Unitigger
- Failed on jobs 120 & 121 ( _tmp file)
cat 4-unitigger/*120* | countMessages.pl IMP 280264 IUM 124707 cat 4-unitigger/*121* | countMessages.pl IMP 282146 IUM 245650
cat 5-consensus/bt_120.cgi_tmp | countMessages.pl IMP 34348 IUM 19222 cat 5-consensus/bt_121.cgi_tmp | countMessages.pl IMP 51833 IUM 16805
- Fix 120: split IUM messages
extractfromfrgMSG.pl -b 0 -e 19222 bt_120.cgb.orig IUM >! bt_120.cgb & extractfromfrgMSG.pl -b 19222 bt_120.cgb.orig IUM >! bt_120.cgb &
- Fix 121: remove assertion in AS_CNS/MultiAlignment_CNS.c
if(to <= from || to > ma_length-1){ fprintf(stderr, "AbacusRefine range (to) invalid"); //assert(0); }
CGW
- Failed after Ckp3(7-0-CGW/bt.ckp.3; MergeScaffoldsAggressive 2nd itteration)
CI extends beyond end of scaffold! offsetAEnd = 254204 offsetBEnd = 252250 scaffoldLength = 253268 cgw: CIScaffoldT_Merge_CGW.c:307: InsertScaffoldContentsIntoScaffold: Assertion `0' failed.
- Last cgw
Scaffold lengths: cat 7-4-CGW/stat/final0.*Scaffolds.nodelength.cgm | grep -v ^Scaff | getSummary.pl -t scf cat 7-4-CGW/stat/final0.PlacedContig.n | grep -v ^Scaff | getSummary.pl -t scf elem min max mean med n50 sum scf 45826 385 34263871 59591 1349 7059820 2730853790 ctg 96562 65 738899 27789 3657 93988 2683452359
Library insert estimates: cat 7-4-CGW/stat/scaffold_final.distupdate.dst | grep ^# | awk '{print $3,int($8),int($10)}' > 7-4-CGW/bt.dst join2.pl bt.dst 7-4-CGW/bt.dst | p 'print join "\t",@F[0,1,2,5,6,3,4]; print "\n";' > bt.dst.combine
CLONEEND inserts: UID MEANI STDI MEANF STDF COUNT LIB 114892 150000 30000 175701 40732 31063 UIUC.CLONEEND 19070 167000 25000 171349 18253 7103 BCM.CLONEEND 118 167000 16700 167000 16700 21 WUGSC.CLONEEND 25456 165000 43000 163044 25849 11150 BARC.CLONEEND 115020 150000 30000 162719 25343 15256 UIUC.CLONEEND 65177 2000 600 162540 34155 27067 TIGR.CLONEEND 125606 150000 30000 162396 19319 59505 BCCAGSC.CLONEEND 10738 2500 750 162386 27567 2040 TIGR_JCVIJTC.CLONEEND 10691 2500 750 161540 28239 2763 TIGR_JCVIJTC.CLONEEND 17249 202000 20200 157496 55375 6269 CENARGEN.CLONEEND 54017 120000 12000 115671 27594 25889 GSC.CLONEEND total 188126 CLONEENDs
Consensus
- Failed on job 34 with segmentation fault
- 9kbp contig, made out of 3007 reads (24 of which are quality-less)
cat 7-CGW/bt.cgw_contigs.34.1 | grep "^{" | uniq -c | awk '{print $2,$1}' {ICM 1 {IMP 3007 {IUP 329
- Fix : edit AS_CNS/MultiAlignment_CNS.c; add
if(!ungappedSequence->Elements) { ungappedSequence->numElements=0; } if(!ungappedQuality->Elements) { ungappedQuality->numElements=0; }
Analysis
Contigs Vs possible contaminants
- nucmer alignment parameters: -l 40 -c 100 -b 10 -g 5 -d 0.05
- have to redo alignments using -maxmatch !!!
- file location:
reference seqs: /nfshomes/dpuiu/db/Ecoli.365350-365744 # Ecoli K12 region with most alignments (BCM WGS splice site) /nfshomes/dpuiu/db/Ecoli # Ecoli K12 substrain MG1655 (NC_000913 ; 1st completed) /nfshomes/dpuiu/db/Ecoli.all # 22 Ecoli completed genomes ( + plasmids) /nfshomes/dpuiu/db/UniVec_Core # UniVec Core seqs /nfshomes/dpuiu/db/OtherVec # 100 other vector sequences identified by aligning UMD2.0 contaminants to GenBank; align also to 110 UniVec core using nucmer (params above)
/nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.fasta # 4813 whole contigs and 30329 contig regions identified by NCBI as UMD2 contamination /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.organism_count # organism counts: vector is the most abundant /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.infoseq # grep -v 'coli|vector|7180003101029' => 905 other contamiants query seqs: /scratch1/bos_taurus/Assembly/2009_0217_CA/9-terminator/ctg.split100/*fasta # latest assembly contigs (no degenerates) delta files: /scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_ctg/no_maxmatch/*delta
Ecoli K12 substrains:
NC_010473.1 4686137 50.78 Escherichia coli str. K-12 substr. DH10B, complete genome NC_000913.2 4639675 50.79 Escherichia coli str. K-12 substr. MG1655, complete genome AC_000091.1 4646332 50.80 Escherichia coli str. K-12 substr. W3110, complete genome
no maxmatch
- fewer alignments in UMD2.4 than in UMD2
UMD2 (all): just a few degens
15102 Ecoli.365350-365744-ctg.qry_hits 15943 Ecoli-ctg.qry_hits 17308 Ecoli.all-ctg.qry_hits 79065 UMD2.contaminant-ctg.qry_hits # 55877 new hits 20105 UMD2.contaminant-ctg.CBE.qry_hits # CONTAIN|BEGIN|END|IDENTITY 19839 UniVec_Core-ctg.qry_hits
UMD2.4
559 Ecoli.365350-365744-ctg.qry_hits 1215 Ecoli-ctg.qry_hits 2767 Ecoli.all-ctg.qry_hits # most 2 frequenct starins are UMN026 & ATCC 8739; K12 DH10B is rank 5th; K12 MG1655 is ranked 19th (out 31 seqs) 44112 UMD2.contaminant-ctg.qry_hits 5286 UniVec_Core-ctg.qry_hits
Length of the reference seqs used for screening:
#seqs min max mean med n50 sum Ecoli.365350-365744 1 395 395 395 395 395 395 # Ecoli K12 regions with most alignments (BCM WGS splice site) Ecoli 1 4639675 4639675 4639675 4639675 4639675 4639675 # Ecoli K12 substrain MG1655 Ecoli.all 49 3306 5572075 2293320 130440 5065741 112372708 # 22 Ecoli's UniVec_Core 1348 12 48551 243 98 967 327641 OtherVec 100 1702 739874 15419 5027 166744 1541984
UMD2.contaminant 35142 48 16661 512 362 674 18022349
Length of UMD2.4 contigs that contain contaminant (0+ bp from end):
#ctgs <2000bp >=2000bp min max mean med n50 sum Ecoli.365350-365744-ctg 559 534 25 1001 179527 2467 1341 1894 1379440 Ecoli-ctg 1215 1086 129 1001 360312 4326 1347 71372 5256540 Ecoli.all-ctg 2767 2455 312* 1001 453627* 8031 1366 134516 22224468 UniVec_Core-ctg 5286 4718 568* 882 651163* 9820 1337 136090 51909339 UMD2.contaminant-ctg.CBE 4976 4410 566* 738 651163* 8497 1339 122281 42281715 #annotated alignments: CONTAIN|BEGIN|END|IDENTITY UMD2.contaminant-ctg 44112 12813 31299 268 739442 50591 27461 111598 2231701788
Length of UMD2.4 contigs that contain contaminant in the middle (500+ bp from end):
#ctgs <2000bp >=2000bp min max mean med n50 sum Ecoli.365350-365744-ctg 144 136 8 1286 2053 1779 1811 1814 256259 Ecoli-ctg 171 152 19 1286 4703 1835 1807 1821 313820 Ecoli.all-ctg 197 160 37*(81) 1228 351373* 6516 1815 125069 1283728 #81 2K+ ctgs using -maxmatch UniVec_Core-ctg 1278 1110 168*(276) 1085 651163* 12266 1496 160336 15676765 #276 2K+ ctgs using -maxmatch UMD2.contaminant-ctg.CBE 52 25 27* 1249 351373* 22195 2054 125069 1154142 #annotated alignments: CONTAIN|BEGIN|END|IDENTITY UMD2.contaminant-ctg 31019 1437 29582 1113 739442 70665 50798 113684 2191986214
Length of the UMD2.4 contaminant seqeunece (0+ bp from end):
#align <200bp >=200bp min max mean med n50 sum Ecoli.365350-365744-ctg 1066 537 529 104 225 192 162 224 205379 Ecoli-ctg 1793 587 1206 50 4440 496 224 994 889798 Ecoli.all-ctg 4074 1132 2942* 40 17075* 380 254 441 1551783 UniVec_Core-ctg 14425 9819 4606* 40 1801* 236 162 325 3409187 UMD2.contaminant-ctg 144843 96008 48835 40 16661 199 169 209 28912002
Length of the UMD2.4 contaminant seqeunece (500+ bp from end)
alignm <200bp >=200bp min max mean med n50 sum Ecoli.365350-365744-ctg 243 136 107 162 224 189 162 224 46000 Ecoli-ctg 273 149 124 106 1341 219 162 224 59923 Ecoli.all-ctg 294 153 141* 106 2150* 251 162 224 73992 UniVec_Core-ctg 2144 2035 109* 50 1340* 122 121 121 261821 UMD2.contaminant-ctg 121331 86985 34346 40 2738 171 162 184 20753580
- Problem: 8 long ctgs contain Ecoli in the middle (1000+ bp from end)
show-coords Ecoli.all-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 4640589 4641890 | 161712 160411 | 1302 1302 | 99.46 | 4686137 351373 | 0.03 0.37 | gi|170079663|ref|NC_010473.1| ctg7180001872124 5068908 5069620 | 87386 86679 | 713 708 | 98.88 | 5209548 91972 | 0.01 0.77 | gi|218687878|ref|NC_011745.1| ctg7180002055226 3087480 3088683 | 50423 51620 | 1204 1198 | 99.00 | 5202090 88182 | 0.02 1.36 | gi|218703261|ref|NC_011751.1| ctg7180002054092 4640580 4641890 | 19953 18646 | 1311 1308 | 99.08 | 4686137 31157 | 0.03 4.20 | gi|170079663|ref|NC_010473.1| ctg7180001875158 131462 133564 | 1247 3349 | 2103 2103 | 98.19 | 241387 5751 | 0.87 36.57 | gi|157412014|ref|NC_009838.1| ctg7180002043242 82801 83166 | 2986 2621 | 366 366 | 98.09 | 241387 4709 | 0.15 7.77 | gi|157412014|ref|NC_009838.1| ctg7180001714551 82264 82793 | 3523 2994 | 530 530 | 98.49 | 241387 4709 | 0.22 11.26 | gi|157412014|ref|NC_009838.1| ctg7180001714551 1652253 1652545 | 1487 1195 | 293 293 | 98.63 | 4700560 2492 | 0.01 11.76 | gi|218552585|ref|NC_011741.1| ctg7180001754941
- Regions present in DH10B but not MG1655
delta2cvg -M 0 < DH10B-MG1655.delta gi|170079663|ref|NC_010473.1| 1349629 1378243 28614 0 gi|170079663|ref|NC_010473.1| 1391006 1396986 5980 0 gi|170079663|ref|NC_010473.1| 3199469 3200798 1329 0 gi|170079663|ref|NC_010473.1| 3211928 3213257 1329 0 gi|170079663|ref|NC_010473.1| 4640588 4641918 1330 0 !!!
- Problem: 10 long ctgs contain Vector in the middle (1000+ bp from end)
show-coords UniVec_Core-ctg.filter-q.delta | ~/bin/filterQryCoords.pl -i 1000 | sort -nk13 -r
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 1 121 | 215495 215615 | 121 121 | 99.17 | 170 271477 | 71.18 0.04 | gnl|uv|U09128.1:15891-16011-49 ctg7180002047604 # pSacBII P1 cloning vector 2252 2435 | 1334 1151 | 184 184 | 100.00 | 2485 160336 | 7.40 0.11 | gnl|uv|U75992.1:16925-19409 ctg7180001808271 180 312 | 1153 1020 | 133 134 | 99.25 | 312 160336 | 42.63 0.08 | gnl|uv|NGB00145.1:2378-2689 ctg7180001808271 1 121 | 1367 1487 | 121 121 | 100.00 | 170 160336 | 71.18 0.08 | gnl|uv|U09128.1:15891-16011-49 ctg7180001808271 1 103 | 1286 1388 | 103 103 | 100.00 | 103 160336 | 100.00 0.06 | gnl|uv|U80929.2:11415-11517 ctg7180001808271 [CONTAINED] 4 121 | 68269 68386 | 118 118 | 100.00 | 170 111913 | 69.41 0.11 | gnl|uv|U09128.1:15891-16011-49 ctg7180002052060 40 152 | 30255 30142 | 113 114 | 99.12 | 1663 42854 | 6.79 0.27 | gnl|uv|U09128.1:1-1663 ctg7180002053344 1 121 | 34358 34238 | 121 121 | 100.00 | 170 35471 | 71.18 0.34 | gnl|uv|U09128.1:15891-16011-49 ctg7180002046164 1 103 | 34439 34337 | 103 103 | 100.00 | 103 35471 | 100.00 0.29 | gnl|uv|U80929.2:11415-11517 ctg7180002046164 [CONTAINED] 46 1385 | 8928 10267 | 1340 1340 | 100.00 | 1413 17587 | 94.83 7.62 | gnl|uv|X65279.1:5941-7353 ctg7180002043597 [CONTAINED]
- ctg7180001872124 : 351373 bp; region 160411..161712 contaminated by Ecoli
cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | wc -l # 329
cat 9-terminator/bt.posmap.utgctg | grep 7180001872124 | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);' 7180000441625 7180001872124 159483 161201 r 7180000441788 7180001872124 160330 161329 f #Ecoli 7180000442730 7180001872124 160368 161010 r #Ecoli 7180000441635 7180001872124 160740 162700 f #Ecoli
cat 9-terminator/bt.utg.info utg7180000441625 length=1715 num_frags=12 Astat=7.00 utg7180000441788 length=999 num_frags=1 Astat=0.00 utg7180000442730 length=640 num_frags=1 Astat=0.00 utg7180000441635 length=1957 num_frags=9 Astat=7.00
cat 9-terminator/bt.posmap.frgctg | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<160411 and 160411<$F[3] or $F[2]<161712 and 161712<$F[3]);' 1237446426 7180001872124 160117 161201 f 1238816835 7180001872124 160133 160993 f 1238817728 7180001872124 160322 161123 r 1244436200 7180001872124 159976 160984 f 1238817676 7180001872124 160105 160890 r 1237443253 7180001872124 160106 160900 f 1237471027 7180001872124 159930 160928 f 1238822613 7180001872124 159774 160782 f 1238816875 7180001872124 159878 160728 f 1244436248 7180001872124 159483 160553 f 1238818306 7180001872124 159718 160489 f 1238818332 7180001872124 159722 160483 f 1237476824 7180001872124 160330 161329 f 1238817689 7180001872124 160368 161010 r 1237447135 7180001872124 160740 161768 r 1237483546 7180001872124 160814 161790 r 1237483530 7180001872124 160818 161856 r 1237471108 7180001872124 161003 162009 f 1238817744 7180001872124 161151 161978 f 1237446441 7180001872124 161050 162107 f 1244436201 7180001872124 161117 162164 f 1237446407 7180001872124 161586 162699 r 1237471055 7180001872124 161652 162700 r # 23 BCM SHOTGUN RP42 VVHNP reads (1369 read lib; 1341 of the reads in this ctg)
- ctg7180002047604 : Vctor in the middle
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 1 121 | 215495 215615 | 121 121 | 99.17 | 170 271477 | 71.18 0.04 | gnl|uv|U09128.1:15891-16011-49 ctg7180002047604 # pSacBII P1 cloning vector
cat 9-terminator/bt.posmap.utgctg | grep 7180002047604 perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);' 7180000441711 7180001872124 214458 219678 r
cat 9-terminator/bt.posmap.frgctg | grep 7180002047604 | perl -ane '@F[2,3]=@F[3,2] if($F[2]>$F[3]); print $_ if($F[2]<215495 and 215495<$F[3] or $F[2]<215615 and 215615<$F[3]);' 498776751 7180001872124 215425 216426 r 1236502885 7180001872124 215514 216377 r 379408823 7180001872124 215572 216340 f 1244436224 7180001872124 215388 216405 f 1237471071 7180001872124 215229 216234 r 1233297450 7180001872124 215234 216046 f 1233363357 7180001872124 215267 215687 f 937200686 7180001872124 215300 216129 r 937254901 7180001872124 215321 216160 f 1233294025 7180001872124 215383 216204 r 1237446444 7180001872124 215146 216187 f 1232033776 7180001872124 215193 215996 r 671976381 7180001872124 215035 216021 r 514932286 7180001872124 215043 216008 f 500723879 7180001872124 215043 215802 f 671927656 7180001872124 215116 215733 r 381173692 7180001872124 214947 215877 r 1233303570 7180001872124 214963 215803 f 1232037705 7180001872124 214990 215803 f 490852264 7180001872124 214923 215843 f 1237447184 7180001872124 214684 215646 f 668822243 7180001872124 214586 215572 f #22 reads ; ~half come from BCM SHOTGUN RP42 VVFOP
maxmatch ctg
Parameters:
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ... AllVec: UniVec_Core + 100 more vector seqs
Length of UMD2.4 contigs that contain contaminant (0+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 2951* 2602 349 1001 453627 8252 1367 132226 24352779 UniVec_Core 5387* 4802 585 882 651163 9979 1334 136556 53760575 OtherVec 5657 5062 595 882 651163 9726 1320 136556 55021803 UMD2.cont.other 3976 3430 546 804 651163 11217 1346 130385 44601117 #18 aligned to Acinetobacter; longest is 56467bp
Length of UMD2.4 contigs that contain contaminant (500+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 182 156 26* 1286 351373 6525 1811 125069 1187706 # 7* are >5K; 321* come from multi-ctg scaffolds UniVec_Core 2532 2220 312* 1065 651163 10593 1481 128344 26821960 # 267* are >5K ; 655* come from multi-ctg scaffolds OtherVec 376 323 53 1184 361749 13278 1508 139997 4992774 UMD2.cont.other ...
Length of UMD2.4 contigs that contain contaminant (1000+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 8 0 8* 4709 351373 73065 31157 351373 584520 UniVec_Core 11 0 11* 2600 334933 93674 37847 271477 1030414 OtherVec 5 0 5* 3717 271477 131604 111913 228060 658021 UMD2.cont.other 54 0 54* 2398 522682 110947 88182 189352 5991164 total 67* # 18 of them are CONTAINED by UMD2.0 chromosomes
Length of the UMD2.4 contaminant sequence (0+ bp from end):
elem <200 >200 min max mean med n50 sum Ecoli.all 4775 1610 3165 39 17072 381 236 502 1823278 UniVec_Core 16985 12380 4605 39 1800 207 132 300 3519080 OtherVec 7563 1372 6191 39 1800 509 548 643 3849567 UMD2.cont.other 6626 343 6283 39 8228 543 573 615 3602329
maxmatch deg
All degenerates aligned are <2K
Length of UMD2.4 deg that contain contaminant (0+ bp from end):
elem <2000 >2000 min max mean med n50 sum Ecoli.all 1266 1266 0 104 1611 783 833 869 991447 UniVec_Core 1908 1908 0 147 1510 872 896 910 1664746 OtherVec 1963 1963 0 147 1510 872 898 911 1712703 UMD2.cont.other 1609 1609 0 132 1611 852 892 914 1372106
maxmatch utg
Unitig stats:
elem <2000 >2000 min max mean med n50 sum 1707816 1434164 273652 21 138676 2228 937 8002 3805166508
Parameters:
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
Files:
/scratch1/bos_taurus/Assembly/2009_0217_CA/nucmer_utg/
Length of UMD2.4 unitigs that align to contaminants
elem <2000 >2000 min max mean med n50 sum Ecoli.all 4275 4110 165 104 71709 1442 1212 1398 6166566 UniVec_Core 7563 7409 154 139 71709 1397 1182 1331 10570512 OtherVec 8208 8054 154 139 71709 1370 1159 1308 11248775 UMD2.cont.other 6094 5849 245 132 53113 1546 1163 1401 9422951 #80 aligned to Acinetobacter; longest is 9114bp Contaminants(all above) 10264 9895 369 104 71709 1471 1148 1359 15107544 Acinetobacter 2306** 0 2306 154 71709 1451 1316 1412 3347230 #2182 already in the Cont set
Length of UMD2.4 unitigs that have contaminants 500+bp from ends
elem <2000 >2000 min max mean med n50 sum Ecoli.all 172 156 16 1286 4852 1820 1805 1815 313185 UniVec_Core 2491 2422 69 1065 71709 1722 1457 1523 4291584 OtherVec 364 358 6 1167 71709 1795 1478 1538 653595 UMD2.cont.other 156 108 48 1213 50248 5344 1838 17518 833673
Length of the UMD2.4 alignments of unitigs to contaminants (unique unitig regions)
elem <200 >200 min max mean med n50 sum reads(all unitig reads for unitgs with alignments>1K) Ecoli.all 5975 1686 4289 40 8184 397 268 542 2374366 12112(12142) UniVec_Core 8754 1674 7080 40 1801 474 490 645 4153030 26590(26849) OtherVec 8919 1250 7669 40 1801 511 536 629 4562326 30268(30268) UMD2.cont.other 6752 896 5856 40 6012 529 555 651 3573528 25006(25328) Contaminants(all above) 10992 1396 9596 40 8184 571 573 684 6280759 40351(40699) Acinetobacter (8286)
40699 reads aligned back to contaminants: nucmer -maxmatch
- 35919 align
- 34400 align 100+bp
- 27742 align 200+bp
- 14211 align 500+bp
utg 5'& 3'
Unitig stats:
elem <200 >200 min max mean med n50 sum utg 1,707,816 81200 1626616 21 138676 2228 937 8002 3805166508 utg5'&3' 3,334,432 0 3334432 21 199 100 100 100 335263271
Align utg5'&3' to Ecoli.all using:
- nucmer -l 40 -c 100 -b 10 -g 5 -d 0.05 : 4,275 hits
- nucmer -l 20 -c 40 : 6,617 hits
- nucmer -l 20 -c 20 : 23,350
- blastall : 2,895,506 out of 3,334,432 (86%) aligned
Acinetobacter contamination
Database:
~dpuiu/db/Acinetobacter.all : 7 complete genomes, 19 seqs
Seq len summary:
elem min max mean med n50 sum 19 2726 4050513 1418094 28279 3904116 26943793
Align all unitigs to Acinetobacter.all; Longest alignments is 8517bp
show-coords Acinetobacter.all-utg.filter-q.delta | sort -nk8 -r | head [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [GenBank] [UMD2.4 utg] =============================================================================================================================== 20644 29164 | 62 8578 | 8521 8517 | 98.80 | 94413 8578 | 9.03 99.29 | gi|169786889|ref|NC_010404.1| utg7180000281954* [CONTAINS] 3395586 3401299 | 5712 1 | 5714 5712 | 99.79 | 3976747 8015 | 0.14 71.27 | gi|126640115|ref|NC_009085.1| utg7180000212251 3400344 3404485 | 1 4142 | 4142 4142 | 99.66 | 3976747 9114 | 0.10 45.45 | gi|126640115|ref|NC_009085.1| utg7180000277331 ...
utg7180000281954* -> ctg7180002053982 (28140bp; 78 unitigs)
grep 7180002053982 ../9-terminator/bt.posmap.utgctg | nl
1 7180000185222 7180002053982 0 3019 f 2 7180000314302 7180002053982 2151 5706 r 3 7180001463328 7180002053982 2256 2869 f ... 75 7180000281954* 7180002053982 17862 26442 r 76 7180001471348 7180002053982 17886 18723 r 77 7180001468075 7180002053982 17919 18732 f 78 7180000280508 7180002053982 25672 28140 r
show-coords UMD2.contaminant.other-ctg.filter-q.delta | grep 7180002053982 [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [UMD2.0 contam] [UMD2.4 ctg] =============================================================================================================================== 1394 2508 | 28140 27024 | 1115 1117 | 98.75 | 7098 28140 | 15.71 3.97 | 7180003313366 ctg7180002053982 2561 2871 | 26971 26661 | 311 311 | 97.11 | 7098 28140 | 4.38 1.11 | 7180003313366 ctg7180002053982 2934 5670 | 26599 23862 | 2737 2738 | 97.99 | 7098 28140 | 38.56 9.73 | 7180003313366 ctg7180002053982 ...gap... 5930 7098 | 17270 16101 | 1169 1170 | 98.46 | 7098 28140 | 16.47 4.16 | 7180003313366 ctg7180002053982 ...gap... 281 1981 | 10335 8635 | 1701 1701 | 99.41 | 13090 28140 | 12.99 6.04 | 7180003320028 ctg7180002053982 1992 2672 | 8635 7954 | 681 682 | 99.71 | 13090 28140 | 5.20 2.42 | 7180003320028 ctg7180002053982 3302 5376 | 6719 4643 | 2075 2077 | 99.33 | 13090 28140 | 15.85 7.38 | 7180003320028 ctg7180002053982 8469 9021 | 4642 4090 | 553 553 | 99.10 | 13090 28140 | 4.22 1.97 | 7180003320028 ctg7180002053982 9038 9313 | 4073 3798 | 276 276 | 98.55 | 13090 28140 | 2.11 0.98 | 7180003320028 ctg7180002053982 9780 13090 | 3331 19 | 3311 3313 | 99.19 | 13090 28140 | 25.29 11.77 | 7180003320028 ctg7180002053982
grep 7180002053982 ../9-terminator/bt.posmap.utgctg | awk '{print $1,$4-$3+1}' | sed 's/^/utg/' >! ctg7180002053982.utgs intersect.pl UMD2.contaminant.other-utg.qry_hits ctg7180002053982.utgs | wc -l 37 # only 37 out of 78 unitigs were detected ctg7180002053982 is Acinetobacter
Assembly UMD2.5 (2004_0312_CA; delete 40699 contam reads & 22607 mates )
40699 reads:
- 25803 mated + 14896 unmated
- 6392 mated reads had the mate also contaminated
Location:
/scratch1/bos_taurus/Assembly/2009_0312_CA
UNITIGGER
UNITIG OVERLAP GRAPH INFORMATION 5322910 : Total number of unitigs 2595715 : Total number of singleton, contained unitigs 1869655 : Total number of singleton, non-contained unitigs 182193 : Total number of non-singleton, spanned unitigs 675347 : Total number of non-singleton, non-spanned unitigs 35507162 : Total number of fragments 35507162 : Total number of fragments in all unitigs 21641007 : Total number of essential fragments in all unitigs 13866155 : Total number of contained fragments in all unitigs 0.0077909501 : Randomly sampled fragment arrival rate per bp 2511009753 : The sum of overhangs in all the unitigs 6442095933 : Total number of bases in all unitigs 0 : Estimated number of base pairs in the genome. 0 : Total number of contained fragments not connected by containment edges to essential fragments. Total rho = 2511009753 Total nfrags = 19563152 Estimated genome length = 0 Estimated global_fragment_arrival_rate=0.007791 Computed global_fragment_arrival_rate =0.007791 Total number of randomly sampled fragments in genome = 23866135 Computed genome length = 3063315200.000000 Used global_fragment_arrival_rate=0.007791 Used global_fragment_arrival_distance=128.354050 Histogram of the number of base pairs in a chunk 100292 - 138301: 22 # 19 in UMD2.4 90020 - 99906: 28 # 23 80043 - 89676: 90 # 79 70013 - 79966: 190 # 164 60010 - 69988: 423 50008 - 59983: 1016 40000 - 49998: 2558 30000 - 39997: 6660 20000 - 29999: 18927 10000 - 19999: 57057
CONSENSUS after CGW
- failed on job 80 : ctg 5706539, len=180,024, 159 unitigs, 1,851 reads
head 80/bt.cns_contigs.80.failed {ICM acc:5706539 pla:P len:180024 cns: . qlt: . for:0 npc:1851
more ../9-terminator/bt.asm ... {CCO acc:(7180002022380*,5706539) pla:P len:180024 cns: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ...
cat 80/bt.cns_contigs.80.failed | countMessages.pl ICM 1 IUP 159 #unitigs IMP 1851 #reads : 1264 are BCM.WGS, 389 are BCM.SHOTGUN ...
# 254 contig scaffold cat ../9-terminator/bt.posmap.ctgscf | grep 7180002041301 | nl 1 7180002022042 7180002041301 0 14121 f ... 183 7180002022380* 7180002041301 9164814 9344838 f .. 254 7180002022240 7180002041301 12874874 12893460 f
# the 5 UMD2.4 contigs below have the same number of reads with the ones that matched => CONTAINED cat UMD2.4-7180002022380.posma p.frgctg | grep -v ^$ | awk '{print $2}' | uniq -c #reads ctgid 6 7180001712307 209 7180002028662 6 7180002028663 1552 7180002028664 21 7180002032323 1794 total => 1851-1794=57 additional reads
cat ../../2009_0217_CA/9-terminator/bt.posmap.ctgscf | nl | egrep '7180001712307|7180002028662|7180002028663|7180002028664|7180002032323' #nl ctgid scfid start end dir 34791 7180002028662 7180002069912 1425623 1446022 f 34793 7180002032323 7180002069912 1448133 1450475 f 34794 7180002028663 7180002069912 1450495 1451918 f 34795 7180002028664 7180002069912 1452234 1602973 f 64110 7180001712307 7180002071598 0 1441 f
- Solution 1:
* consensus -Dforceunitigabut => new assembly, new UID's ctg7180002022636 179505 38.53 # => ctg7180002022380 scf7180002041557 12892941 40.58 # => scf7180002041301
- Solution 2:
* Reassemble 1851 reads ; clr=ECR2; doOBT=no * Asm dir: /scratch1/bos_taurus/Assembly/2009_0312_CA/8-consensus/80.ECR2.asm * It contains one 179,530 bp scaffold that has two contigs. * One contig is 156,349 bp and the other one is 23,181 bp. * The estimated gap between them is 231 bp.
show-coords ctg7180002022636-80.ECR2.filter-r.delta 1 156326 | 1 156331 | 156326 156331 | 99.99 | 179505 156349 | 87.09 99.99 | ctg7180002022636 ctg7180000000103 [CONTAINS] 156345 179505 | 21 23181 | 23161 23161 | 99.99 | 179505 23181 | 12.90 99.91 | ctg7180002022636 ctg7180000000104 [CONTAINS]
>ctg7180002022636_156327_156344 TTGTAAAAACCATCCCCT
# ~ 20 bp unaligned on ctg7180002022636 & Chr1 show-coords ctg7180002022636-Chr1.filter-r.delta | more ... 151839 156326 | 61124826 61120339* | 4488 4488 | 99.91 | 179505 157590899 | 2.50 0.00 | ctg7180002022636 Chr1 156347 157501 | 61111219* 61110064 | 1155 1156 | 99.91 | 179505 157590899 | 0.64 0.00 | ctg7180002022636 Chr1 ...
# 2 UMD2.0 ctg & 2 UMD2.0 deg in this region more Chr1.agp ... Chr1 61110064 61111482 3579 W deg0003139347 1 1419 + Chr1 61111483 61114130 3580 N 2648 fragment yes Chr1 61114131 61115490 3581 W deg0002967451 1 1360 + Chr1 61115491 61118114 3582 N 2624 fragment yes Chr1 61118115 61120117 3583 W 7180002846553 1 2003 + Chr1 61120118 61120217 3584 U 100 fragment yes Chr1 61120218 61145567 3585 W 7180003318962 1 25350 + ...
QC
Lengths:
elem <2000 >2000 min max mean med n50 sum scf 39978* 31311 8667 316 34167202 68129 1360 8217662 2723691675 ctg 90135* 36140 53995 65 1160130 29693 5124 95988 2676390147 deg 251413 249285 2128 65 39964 1003 984 994 252279234 utg 1689033 1419729 269304 21 138676 2242 936 8213 3788090224
Scaffold zero read/mate cvg regions:
elem <2000 >2000 min max mean med n50 sum read 57011 55048 1963 1 177144 913 57 29302 52084484 mate 10507 8945 1562 1 30014 996 493 2367 10466518
Scaffold 10K+ zero read/mate cvg regions (2K+ inside) (some might be a result of surrogates?):
elem <2000 >2000 min max mean med n50 sum read 51747 49878 1869 1 177144 958 38 32625 49613432 mate 1290 201 1089 15 30014 3560 3047 4017 4593586
Contaminant search
ctg
elem <0 >0 min max mean med n50 sum ctg 90135 0 90135 65 1160130 29693 5124 95988 2676390147
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
elem <2000 >2000 min max mean med n50 sum Ecoli.all 71 66 5 1006 129770 4180 1127 45234 296830 UniVec_Core 120 111 9 1000 426540 7366 1130 100944 884023 OtherVec 121 112 9 1000 426540 7314 1127 100944 885022 UMD2.cont.other 98(-3) 83 15(-3) 1000 426540 13523 1190 199700 1325332 # 3 are 1000bp+ from ctg ends; these are actually "fake" contaminants # 7 are Acinetobacter baumannii min=1059 , max=9765 total 152*(-3) 133 19*(-3) 1000 426540 10649 1150 199700 1618735 Acinetobacter 65 53 12 1013 44359 2586 1212 3847 168130 # 46 out of 65 are in the 152* set; 19 are new; 13 have lots of alignments to other contigs (probably fake contaminants) total(new) 171*(-3) 144 27*(-3) 1000 426540 10013 1189 129770 1712376 # 65 are Acinetobacter and should be removed
cat UMD2.contaminant.other-ctg.filter-q.coords | grep Acinetobacter UMD2.0 UMD2.5 1 285 | 285 1 | 285 285 | 99.65 | 287 8096 | 99.30 3.52 | 7180003292866_1_288 ctg7180002015457 [CONTAINED] Acinetobacter baumannii 1422 2500 | 1078 1 | 1079 1078 | 99.63 | 7098 1078 | 15.20 100.00 | 7180003313366 ctg7180001706852 [CONTAINS] Acinetobacter baumannii 2934 3940 | 1008 1 | 1007 1008 | 98.61 | 7098 1059 | 14.19 95.18 | 7180003313366 ctg7180001709709 Acinetobacter baumannii 6281 7098 | 1 818 | 818 818 | 99.76 | 7098 1553 | 11.52 52.67 | 7180003313366 ctg7180001716052 [END] Acinetobacter baumannii 1 790 | 790 1 | 790 790 | 100.00 | 1822 9765 | 43.36 8.09 | 7180003319195_8956_10778 ctg7180002015485 [BEGIN] Acinetobacter calcoaceticus 285 1981 | 1 1697 | 1697 1697 | 99.59 | 13090 1856 | 12.96 91.43 | 7180003320028 ctg7180001706656* Acinetobacter baumannii 1992 2148 | 1697 1856 | 157 160 | 98.12 | 13090 1856 | 1.20 8.62 | 7180003320028 ctg7180001706656* Acinetobacter baumannii 12210 13090 | 63 943 | 881 881 | 99.89 | 13090 2556 | 6.73 34.47 | 7180003320028 ctg7180002007423 Acinetobacter baumannii
# 7 Acinetobacter baumannii ctgs # no Serratia "best hits" # 3 mitochondrion ctgs, all < 2Kbp
Delete summary: 65 Acinetobacter ctgs + 91 contaminant ctgs <2000bp => 156 ctgs => 4105 reads
ctgs <2000 >2000 min max mean med n50 sum reads 156 144 12 1000 44359 1782 1150 1483 278009 4105
Trim summary: 12 contigs >=2000bp & 44 reads that overlap at least 10bp
ctgs <2000 >2000 min max mean med n50 sum reads 12 12 0 172 935 532 618 750 6393 44
ctg 5'&3'
elem <0 >0 min max mean med n50 sum ctg53 180044 0 180044 65 598 300 300 300 54033229
nucmer -maxmatch -l 17 -c 35 ...
#ctgEnds #ctgs min max mean med n50 sum Ecoli.all 180 149 300 300 300 300 300 54000 UniVec_Core 312 277 300 300 300 300 300 93600 OtherVec 1211 1167 300 553 300 300 300 363989 UMD2.cont.other 15689 14693 257 598 300 300 300 4712162
deg
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 ...
elem <2000 >2000 min max mean med n50 sum Ecoli.all 387 387 0 131 1099 756 806 835 292892 UniVec_Core 569 569 0 101 1115 763 822 843 434400 OtherVec 579 579 0 101 1115 752 819 840 435549 UMD2.cont.other 539 539 0 131 1483 792 838 873 427408 total 810* 810 0* 101 1483 784 838 869 63547
Scaffolds vs UMD2.0 chromosome alignments
Directory:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf
Depening on the ref/qry seq and nucmer parameters, the number of unaligned gaps in UMD2.0 can vary between:
101M: REF=Chr, QRY=scf, nucmer -l 100 -c 500 6M: REF=ChrPlaced, QRY=scf-deg, nucmer -maxmatch -l 50 -c 250
nucmer -l 100 -c 500
Chr-scf.summary
elem <2000 >2000 min max mean med n50 sum Chr-scf.qry_hits 32901 24546 8355 723 34167202 82494 1405 8217662 2714164100 Chr-scf.qry_nohits 7077 6765 312 316 12006 1346 1239 1291 9527056 Chr-scf.10K.qry_hits2+ 574 0 574 10308 34167202 4006753 1887309 9586144 2299876795 Chr-scf.0cvg 144712 125933 18779 1 102265 900 178 2968 130248709 Chr-scf.0cvg.clean 148556 143283 5273 1 39625 683 280 1363 101526883(101M)
Chr-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum Chr-scf-deg.qry_hits 210225 199781 10444 501 34167202 13785 1007 7328685 2898141592 Chr-scf-deg.qry_nohits 81166 80815 351 65 12006 958 972 989 77828798 Chr-scf-deg.10K.qry_hits2+ 574 0 574 10308 34167202 4006753 1887309 9586144 2299876795 Chr-scf-deg.0cvg 175952 168553 7399 1 22067 445 120 1329 78433265 Chr-scf-deg.0cvg.clean 133809 132381 1428 1 20512 371 124 1101 49711440(49M)
ChrPlaced-scf.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf.qry_hits 19488 13112 6376 723 34167202 137773 1527 8428844 2684927057 ChrPlaced-scf.qry_nohits 20490 18199 2291 316 192648 1891 1276 1569 38764099 ChrPlaced-scf.10K.qry_hits2+ 139 0 139 10486 31959312 6786671 4979278 12956086 943347316 ChrPlaced-scf.0cvg 76271 71816 4455 1 102265 568 179 1710 43339413 ChrPlaced-scf.0cvg.clean 72865 70541 2324 1 39625 356 94 1413 25951987(25M)
ChrPlaced-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf-deg.qry_hits 130125 122000 8125 501 34167202 21530 1009 7515049 2801670853 ChrPlaced-scf-deg.qry_nohits 161266 158596 2670 65 192648 1080 987 1012 174299537 ChrPlaced-scf-deg.10K.qry_hits2+ 139 0 139 10486 31959312 6786671 4979278 12956086 943347316 ChrPlaced-scf-deg.0cvg 79041 76374 2667 1 22067 395 157 948 31251753 ChrPlaced-scf-deg.0cvg.clean 69012 68271 741 1 20512 200 81 592 13864328(13M)
nucmer -maxmatch -l 100 -c 500
Dir:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.2
ChrPlaced-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf-deg.qry_hits 130510 122377 8133 501 34167202 21470 1009 7515049 2802100771 ChrPlaced-scf-deg.qry_nohits 160881 158219 2662 65 192648 1080 986 1012 173869619 ChrPlaced-scf-deg.10K.qry_hits2+ 120 0 120 20022 31959312 7587296 5639522 13010806 910475551 ChrPlaced-scf-deg.0cvg 82159 80425 1734 1 7002 321 145 647 26444796 ChrPlaced-scf-deg.0cvg.clean 111645 111546 99 1 6248 81 13 272 9057424(9M)
nucmer -maxmatch -l 50 -c 250
Dir:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3
ChrPlaced-scf-deg.summary
elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf-deg.qry_hits 204673 195653 9020 251 34167202 14088 1005 7329288 2883625712 ChrPlaced-scf-deg.qry_nohits 86718 84943 1775 65 192648 1064 970 1007 92344678 ChrPlaced-scf-deg.10K.qry_hits2+ 148 0 148 10486 31959312 6814292 5135095 12792673 1008515300 ChrPlaced-scf-deg.0cvg 86085 84614 1471 1 4565 279 123 557 24101912 ChrPlaced-scf-deg.0cvg.clean 113796 113791 5 1 2822 59 7 176 6714556(6M)
nucmer -maxmatch -l 50 -c 250 ; delta-fileter -q
Dir:
/scratch1/bos_taurus/Assembly/2009_0312_CA/nucmer_scf.3
ChrPlaced-scf-deg.filter-q.summary elem <2000 >2000 min max mean med n50 sum ChrPlaced-scf-deg.qry_hits 204673 195653 9020 251 34167202 14088 1005 7329288 2883625712 ChrPlaced-scf-deg.qry_nohits 86718 84943 1775 65 192648 1064 970 1007 92344678 ChrPlaced-scf-deg.10K.qry_hits2+ 118 0 118 20022 31959312 7633834 5639522 13010806 900792422 ChrPlaced-scf-deg.0cvg 77864 73686 4178 1 28711 529 181 1523 41240419 ChrPlaced-scf-deg.0cvg.clean 74172 72150 2022 1 28331* 321 89 1415 23852952(23M)
Max gap is 28331; Duplicate region in UMD2.0?
ChrPlaced-scf-deg.coords 70739616 70767946 | 2993389 2965048 | 28331 28342 | 99.17 | 85187327 19514159 | 0.03 0.15 | Chr15 scf7180002041107 70768054 70808299 | 3005305 2965048 | 40246 40258 | 99.56 | 85187327 19514159 | 0.05 0.21 | Chr15 scf7180002041107 => ChrPlaced-scf-deg.filter-q.coords 70768054 70808299 | 3005305 2965048 | 40246 40258 | 99.56 | 85187327 19514159 | 0.05 0.21 | Chr15 scf7180002041107
Markers
head /fs/szasmg3/bos_taurus/UMD_Freeze2.5/markers/markers_contigs_Art.txt Marker Chr_BTA Pos(K) Pos_from Pos_to UMD_Ctg_Pos Match_Len %IDY %Match UMD_Ctg_name BZ945871 1 47501 1 95001 7622 515 100.00 99.61 ctg7180002007845 BZ953651 1 80001 47501 112501 10786 700 99.57 100.00 ctg7180002026484 CC504788 1 118751 80001 157501 54583 862 100.00 100.00 ctg7180002026483 CC484491 1 123751 90001 157501 50169 77 98.72 100.00 ctg7180002026482 CZ415082 1 125001 92501 157501 75850 507 99.21 99.80 ctg7180002026483 CC475154 1 130001 97501 162501 40013 666 99.25 100.00 ctg7180002026482 CC561114 1 182501 145001 220001 1130 709 99.02 100.00 ctg7180002026482 CC578374 1 190001 155001 225001 170145 647 100.00 100.00 ctg7180002026481 BZ911787 1 278751 232501 325001 na na na na na ...
- 126,014 markers & 90,135 ctgs total
- 107,271 markers align to 31,407 ctgs:
- 85% of the markers align to 85% of the ctg sequence
- avg distance between markers is 25Kbp
- 93,508 unique markers (out of 107,271)
Ctg vs markers summary:
#ctg <10000 >10000 min max mean med n50 sum file ctg (all) 90135* 51024 39111 65 1160130 29693 5124 95988 2676390147 no markers 58728 48324 10404 65 322949* 6573 1597 21989 386064754 markers from 1+ Chr 31407 2700 28707 442 1160130 72924 52693 111252 2290325393 markers_ctg.Chr.count markers from 2+ Chr 2987 25 2962 1002 1160130 132480 104807 179692 395718221 markers_ctg.Chr.count2+ 2+ markers from 2+ Chr 26 0 26* 15228 604155 221354 192182 298848 5755227 markers_ctg.Chr.count2.2+ 2+ adjacent markers from 2+ Chr 15 0 15** 15228 368879 202728 194749 294623 3040932 markers_ctg.Chr.count2+a
Scf vs markers summary:
#scf <10000 >10000 min max mean med n50 sum scf(all) 39978* 37135 2843 316 34167202 68129 1360 8217662 2723691675 no markers 37338 36038 1300 316 754615* 2601 1336 3957 97140879 markers from 1+ Chr 2640 1097 1543 1000 34167202 994905 16220 8661690 2626550796 markers_scf.Chr.count markers from 2+ Chr 552 10 542 1002 34167202 4526814 2714036 9167014 2498801557 markers_scf.Chr.count2+ 2+ markers from 2+ Chr 212 0 212* 15228 34167202 8579232 7358307 10521496 1818797327 markers_scf.Chr.count2.2+ 2+ adjacent markers from 2+ Chr 38 0 38** 15228 25078118 8419544 7176534 13458592 319942681 markers_scf.Chr.count2+a
212* scaffolds
scf_id scf_len #Chr/2+markers #/Chr/2+adjmarkers reads 7180002041381 31959312 15 0 469503 7180002041358 25078118 13 2 291163 7180002041386 21280754 12 0 ...
1. 7180002041381 : no low cvg regions in the middle
- 1281 markers: 1231 on Chr4, 12 on Ch11
#1- mate cvg regions: at the ends !!! #scfid begin end scf_len cvg_len cvg 7180002041381 1 1173 31959312 1173 0 7180002041381 1174 1454 31959312 281 1 7180002041381 31959139 31959312 31959312 174 1
2. 7180002041358 : one low cvg region & real break
- 1111 markers: 869 on Chr14, 193 on Chr26, ...
#1- mate cvg regions: middle #scfid begin end scf_len cvg_len cvg 7180002041358 20970531 20970827 25078118 297 1 7180002041358 20970828 20970949 25078118 122 0 7180002041358 20970950 20971112 25078118 163 1
# markers in the regions #scfid begin end makerid Chr 7180002041358 20964100 20964829 BZ839784 14 7180002041358 21002368 21003219 CC527932 26
3. 7180002041386: one low cvg region but no markers in that region
- 939 markers: 902 on Chr24, 3 on Chr6 ...
#1- mate cvg regions: at the ends #scfid begin end scf_len cvg_len cvg 7180002041386 26382 27557 21280754 1176 1 7180002041386 27558 27607 21280754 50 0 7180002041386 27608 28031 21280754 424 1
...