Bos taurus redo: Difference between revisions
Line 1,227: | Line 1,227: | ||
53 572140 94 1208 766 841 877 438793102 | 53 572140 94 1208 766 841 877 438793102 | ||
Unaligned read counts: | Best/Max/Max+extended alignment coord stats: | ||
elem min max mean median n50 sum | |||
53.best 572140 94 1208 766 841 877 438793102 | |||
53.max 572140 170 1208 794 863 888 454816817 | |||
53.extend 572140 170 1208 797 865 889 456014184 | |||
Unaligned read counts: | |||
unaligned total quality quality-less | unaligned total quality quality-less | ||
BCM.WGS 42595 | BCM.WGS 42595 |
Revision as of 18:44, 12 February 2009
BCM
NCBI Data
- Genome Projects
- TA search
- TA ftp
- 91 volumes: 87 with qual & 4 with no quality (85 volumes contain BCM reads)
- 14 centers
- 21 center/trace_type_codes
- Avg LEN=984
- Avg CLIP (CLB intersect CLV)=760
- Avg CLV=997 > Avg LEN ???
- Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
- Avg UMDoverlapper CLIP=778
Problems:
- 0 QUAL reads 650,133
- the quality lines in several qual. files start with space; need to remove it otherwise tarchive2ca errors out saying that the len(quality)=len(seq)+1
- several xml contained the "&" character => XML parser error
- xml.bos_taurus.087 contained 2 trace_volumes => XML parser error
- BCCAGSC.CLONEEND : all reads have LIBRARY_ID=CH240, SEQ_LIB_ID=. ; the INSERT_SIZE & INSERT_STDEV vary within the library: set to 150,000 & 30,000
- UIUC.CLONEEND: INSERT_SIZE & INSERT_STDEV missing: set to 150,000 & 30,000
CENTER_NAME counts
COUNT CENTER_NAME 1 35629020 BCM Baylor College of Medicine 2 737900 NISC NIH Intramural Sequencing Center 3 652614 BCCAGSC British Columbia Cancer Agency Genome Sciences Centre # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 4 378871 MARC USDA, ARS, US Meat Animal Research Center 5 114753 UIUC University of Illinois at Urbana-Champaign # TA query_tracedb CENTER_NAME = "UIUC" => 106,368 6 107367 BARC USDA, ARS, Beltsville Agricultural Research Center 7 65171 TIGR The Institute for Genome Research 8 53556 GSC Genoscope 9 43033 CENARGEN Embrapa Genetic Resources and Biotechnology 10 18623 SC The Sanger Center 11 15301 UOKNOR University of Oklahoma Norman Campus, Advanced Center for Genome Technology 12 10651 TIGR_JCVIJTC The Institute for Genomic Research, Traces generated at JCVIJTC # TA query_tracedb CENTER_NAME="JCVI" 13 2485 UIACBCB University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB) 14 49 WUGSC Washington University, Genome Sequencing Center # TA query_tracedb CENTER_NAME = "WUGSC" => 9 37829394 total total # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710
TRACE_TYPE_CODE counts
COUNT CENTER_NAME TRACE_TYPE_CODE 1 24863599 BCM* WGS SEQ_LIB_ID:89 2 10748529 BCM* SHOTGUN SEQ_LIB_ID:15543 3 737900 NISC SHOTGUN SEQ_LIB_ID:247 4 125597 BCCAGSC CLONEEND LIBRARY_ID:1 large insert size; some qualityless; !!! almost all have CLIP3=0 5 114753 UIUC CLONEEND LIBRARY_ID:2 insert size missing , no frequent kmers 6 65171 TIGR CLONEEND SEQ_LIB_ID:1 2K & use TRACE_DIRECTION instead of TRACE_END 7 53556 GSC CLONEEND SEQ_LIB_ID:1 large insert size; !!! all have qual=0 and were excluded 8 26246 CENARGEN WGS . no LIBRARY_ID; no SEQ_LIB_ID; no INSERT_SIZE; no INSERT_STDEV; reads have no direction; ~21954 could be paired (same TEMPLATE_ID) 9 25454 BARC CLONEEND SEQ_LIB_ID:14304 !!! all have CLIP3=0 10 16892 BCM* CLONEEND LIBRARY_ID:1 VBBAA mea=167000 std=25000 11 16787 CENARGEN CLONEEND LIBRARY_ID:1 12 15150 UOKNOR SHOTGUN LIBRARY_ID:1 some qualityless 13 10651 TIGR_JCVIJTC CLONEEND SEQ_LIB_ID:2 14 151 UOKNOR FINISHING LIBRARY_ID:1 some qualityless, no direction(TRACE_END=N); no INSERT_SIZE; no INSERT_STDEV 15 49 WUGSC CLONEEND SEQ_LIB_ID:1 36820485 total 16 527017 BCCAGSC EST 17 207204 MARC EST 18 171667 MARC PCR 19 81913 BARC EST 20 18623 SC EST 21 2485 UIACBCB EST 1008909 total
STRATEGY & TRACE_TYPE_CODE counts
COUNT CENTER_NAME STRATEGY TRACE_TYPE_CODE 12545304 BCM . WGS 11425910 BCM WGA WGS 5223683 BCM CLONE SHOTGUN 4479883 BCM POOLCLONE SHOTGUN 1044963 BCM . SHOTGUN 892385 BCM SNP WGS 737900 NISC CLONE SHOTGUN 125597 BCCAGSC CLONEEND CLONEEND 114753 UIUC CLONEEND CLONEEND 65171 TIGR CLONEEND CLONEEND 53556 GSC CLONEEND CLONEEND 26246 CENARGEN . WGS 25454 BARC . CLONEEND 16892 BCM CLONEEND CLONEEND 16787 CENARGEN CLONEEND CLONEEND 12195 UOKNOR . SHOTGUN 10651 TIGR_JCVIJTC CLONEEND CLONEEND 2955 UOKNOR CLONE SHOTGUN 151 UOKNOR . FINISHING 49 WUGSC CLONEEND CLONEEND
527017 BCCAGSC EST EST 145820 MARC EST EST 117958 MARC COMPARATIVE PCR 81913 BARC EST EST 61384 MARC CLONE EST 53709 MARC Re-Sequencing PCR 18623 SC EST EST 2485 UIACBCB . EST
BCM.SHOTGUN libraries
SIZE STDEV COUNT 3500 1500 4502569 2000 1000 3244493 3000 1000 1021577 180000 1000 840528 6500 1500 429026 180000 13000 320208 6000 2000 208192 167000 13000 96337 3500 15000 85599
SIZE COUNT 3500 4588168 2000 3244493 180000 1160736 3000 1021577 6500 429026 6000 208192 167000 96337
3' VECTOR TRIMMED counts
CENTER_NAME TRACE_TYPE_CODE TOTAL 3'CLV<LEN QUAL==0 UMD.FRG 1 BCM WGS 24863599 10968979 551114 24050767 2 BCM SHOTGUN 10748529 5052692 23419 10068499 3 NISC SHOTGUN 737900 28972 0 735488 4 BCCAGSC CLONEEND 125597 125484 8926 113790 5 UIUC CLONEEND 114753 90243 0 106247 6 TIGR CLONEEND 65171 46389 0 64903 7 GSC CLONEEND 53556 53556 53556 (all) 0 !!! all have 0 quals and were excluded 8 CENARGEN WGS 26246 26246 0 25976 9 BARC CLONEEND 25454 25454 0 25387 10 BCM CLONEEND 16892 6751 0 16863 11 CENARGEN CLONEEND 16787 16787 0 16628 12 UOKNOR SHOTGUN 15150 2885 12195 0 13 TIGR_JCVIJTC CLONEEND 10651 339 0 10644 14 UOKNOR FINISHING 151 0 151 151 15 WUGSC CLONEEND 49 0 0 0 16 BCCAGSC EST 527017 524173 772 0 17 MARC EST 207204 207204 0 0 18 MARC PCR 171667 171667 0 0 19 BARC EST 81913 78597 0 0 20 SC EST 18623 7350 0 0 21 UIACBCB EST 2485 2485 0 0
ZERO QUALITY COUNTS
- Counts
CENTER_NAME TRACE_TYPE_CODE COUNT BCM WGS 551114 GSC CLONEEND 53556 BCM SHOTGUN 23419 UOKNOR SHOTGUN 12195 BCCAGSC CLONEEND 8926 BCCAGSC EST 772 UOKNOR FINISHING 151 TOTAL 650134
- For 0 quality reads, assign quality 20 to bases 1..700, 0 to bases 701..
- Volumes 026..039 have been fixed
Local Data
Files & Dirs
/fs/szasmg3/bos_taurus/data/ /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors /nfshomes/dpuiu/db/UniVec
Software
Figaro
- trims vector only at 5' end
- call lucy trimming for qualities
Lucy
- both vector sequence and splice sites are required
Atlas
- web site
- atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "
Contaminant search
nucmer reads CLIPPING range to UniVec & EcoliK12
UniVec
Ref
#seqs min max mean median n50 sum UniVec 2861 12 48551 231 99 781 660,151 UniVec_Core 1348 12 48551 243 98 967 327,641
Hits: alignment length
bp #reads min max mean median n50 sum 19 4548466 19 1045 28.37 23 27 129025025 20 3684852 20 1045 30.56 25 28 112616359 30 1097357 30 1045 48.04 38 43 52714583 40 484661 40 1045 66.36 47 53 32163896 100 54334 100 1045 198 116 223 10772815 # many are ESTs
Ecoli
Ref:
K12 4,639,675 bp
Hits: alignment length
bp #reads min max mean median n50 sum 19 275109 19 1223 30.66 19 20 8435470 20 102550 20 1223 50.29 21 161 5156849 30 19032 30 1223 178 37 706 3381214 40 9234 40 1223 329 171 738 3034293 100 6781 100 1223 424 223 749 2876432 200 4378 200 1223 575 696 771 2516916
BCM vectors
#seqs min max mean median n50 sum BCM 14 2580 33180 9379 5821 32705 131312
Vector/Splice site search
Strategy
- 1. Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
- 2. Get the quality clipping time: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
- 3. Separate reads in 2 sets according to direction TRACE_END: FORWARD & REVERSE
- 4. Get the most frequent kmers in each set (24 & 8 bp)
- 5. Check if the most frequent kmers are overrepresented
- 6. Check if the most frequent 8mers are present in the most frequent 24mers
- 7. Try to extend the 24mers by a few bp => linkers
- 8. Align linkers to the opposite stand sequences using nucmer
- 9. Extract the subsequences adjacent(following) to linker (50..150bp)
- 10. Align the subsequences; if they align we've probably identified the vector
- 11. Identify the vector name/id by alignment to UniVec => several alignments
- 12. Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
- 13. create the Lucy vector & splice files; the splice contains the linker+vector
- 14. run lucy & trim input reads according to Lucy clr
- 15. align lucy trimmed reads to linker,vector,splice & UniVec.dust
- 16. align input reads to linker,vector,splice & UniVec.dust
- 17. compare the 15. & 16. counts
Example
- 1. volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
- 2.
- 3. 249,611 TRACE_END=F & 250,389 TRACE_END=R
- 4. kmers: 8 8bp most frequent kmers are shared by the FORWARD & REVERSE strands ; no 24bp kmers are shared
==> 24.fwd/kmers.tab <== AGTTCGACTGCAAGTAGTTCATCA TGATGAACTACTTGCAGTCGAACT 2463 # contains AGTAGTTC GAGTTCGACTGCAAGTAGTTCATC GATGAACTACTTGCAGTCGAACTC 2189 CGAGTTCGACTGCAAGTAGTTCAT ATGAACTACTTGCAGTCGAACTCG 1996 TCGAGTTCGACTGCAAGTAGTTCA TGAACTACTTGCAGTCGAACTCGA 1593 GTTCGACTGCAAGTAGTTCATCAA TTGATGAACTACTTGCAGTCGAAC 1023 GAGTTCGACTGCAGTAGTTCATCA TGATGAACTACTGCAGTCGAACTC 812 CGAGTTCGACTGCAGTAGTTCATC GATGAACTACTGCAGTCGAACTCG 777 GTTCGACTGCAAGTAGTTCATCAT ATGATGAACTACTTGCAGTCGAAC 769 TCGAGTTCGACTGCAGTAGTTCAT ATGAACTACTGCAGTCGAACTCGA 637 ATCGAGTTCGACTGCAAGTAGTTC GAACTACTTGCAGTCGAACTCGAT 594 ==> 08.fwd/kmers.tab <== AGTAGTTC GAACTACT 86477 CAGTAGTT AACTACTG 67681 AGTTCTCA TGAGAACT 61556 TAGTTCTC GAGAACTA 60964 GTAGTTCT AGAACTAC 57866 AGTTCATC GATGAACT 49676 TAGTTCAT ATGAACTA 45298 GTTCATCA TGATGAAC 42117 GCAGTAGT ACTACTGC 41391 GTAGTTCA TGAACTAC 40694 ==> 24.rev/kmers.tab <== TATCGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCGATA 999 # contains AGTAGTTC CTATCGATGGTACAGTAGTTCATC GATGAACTACTGTACCATCGATAG 774 GCTATCGATGGTACAGTAGTTCAT ATGAACTACTGTACCATCGATAGC 600 CGCTATCGATGGTACAGTAGTTCA TGAACTACTGTACCATCGATAGCG 432 ATCGATGGTACAGTAGTTCATCAT ATGATGAACTACTGTACCATCGAT 417 ATCGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCGAT 380 ATCAGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCTGAT 373 ATCGATGGTACAGTAGTTCATCAC GTGATGAACTACTGTACCATCGAT 265 CTATCGATGGTAAGTAGTTCATCA TGATGAACTACTTACCATCGATAG 235 TCAGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCTGA 224 ==> 08.rev/kmers.tab <== AGTTCATC GATGAACT 85127 TAGTTCAT ATGAACTA 77902 GTTCATCA TGATGAAC 75585 TAGTTCTC GAGAACTA 68057 AGTTCTCA TGAGAACT 67277 GTAGTTCT AGAACTAC 64894 GTAGTTCA TGAACTAC 62607 CGTAGTTC GAACTACG 52031 AGTAGTTC GAACTACT 51013 ACGTAGTT AACTACGT 31552
- 7. Get linker sequences
>linker.fwd 27bp TCGAGTTCGACTGCAAGTAGTTCATCA >linker.rev 27bp CTAATCAGATGGTACAGTAGTTCATCA #>linker.rev 40 bp Art's (13 more bp at 5') #TATGACCATGCGCCTAATCAGATGGTACAGTAGTTCATCA
#GCTATCGATGGTACAGTAGTTCATCAT is the most frequent rev seq 27 kmers but not the linker (few snp differences)
- 8 & 9 Align reads to linkers using nucmer
Fwd:
nucmer -l 12 -c 24 -r linker.fwd.seq ../bos_taurus.$v.r.fasta # nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.r.fasta show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.r.fasta >! out.seq
Rev:
nucmer -l 12 -c 24 -r linker.rev.seq ../bos_taurus.$v.f.fasta # nucmer -l 12 -c 24 -r kmers.seq ../bos_taurus.$v.f.fasta show-coords out.delta | awk '{print $19,$5,$13}' > ! out.clr extractfromfastanames.pl -clr -f out.clr < ../bos_taurus.$v.f.fasta >! out.seq
Both:
clrFasta out.seq >! out.cseq fasta2tab.pl out.cseq | sort -k2 > ! out.tab nucmer -c 40 out.cseq ~/db/UniVec -p vector delta-filter -q vector.delta >! vector.filter-q.delta show-coords vector.filter-q.delta | sort -n | head
cat vector.filter-q.delta | grep "^>" | count.pl -c 1 -m 2
- 10. Extract "vector reads"
>399553028 # 24.fwd TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG GTGTCAAATGAGAGACCTAACTCACATTCAACTTTTTTTTTTTTTCTGCCCTCTATTCTA ... >400269118 #24.rev TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC AGCTGGCGTAAAAACGTAAAAAGCCCCGCACCGATCGCCCTTTCCCAACAGGTTGCCCAG
- 11. Align "vector reads" to UniVec; identify vector
show-coords 24.fwd/400269118-UniVec.delta 24.rev/399553028-UniVec.delta | grep J01636.1 31 148 | 1175 1292 | 118 118 | 95.76 | 1276 7477 | 9.25 1.58 | 399553028.rev gnl|uv|J01636.1:1-7477 32 199 | 1302 1463 | 168 162 | 90.48 | 653 7477 | 25.73 2.17 | 400269118 gnl|uv|J01636.1:1-7477
- 12. 10bp distance between the 2 alignments
- 13. Lucy files
$ more vector.seq >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA ... $ more splice.seq >J01636.for.begin vector+linker.rev TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG CGCCTAATCAGATGGTACAGTAGTTCATCA >J01636.for.end rev(linker.fwd)+vector TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC AGCTGGCGTAAAAACGTAAAAAGCCCCGCA >J01636.rev.begin (revcomp of J01636.for.end) TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT GATTCGATTTCGACTGCAAGTAGTTCATCA >J01636.rev.end (revcomp of J01636.for.begin) TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG GTGTCAAATGAGAGACCTAACTCACATTCA
# splice=linker+vector 3 120 | 1175 1292 | 118 118 | 95.76 | 150 7477 | 78.67 1.58 | J01636.for.begin J01636 32 131 | 1302 1399 | 100 98 | 96.00 | 150 7477 | 66.67 1.31 | J01636.for.end J01636
- 13.1 Align vector & splice to Ecoli
1 7474 | 366812 359335 | 7474 7478 | 99.91 | 7477 4639675 | 99.96 0.16 | J01636 NC_000913.2 [CONTAINED]
20 119 | 65 162 | 100 98 | 96.00 | 150 395 | 66.67 24.81 | J01636.rev.begin NC_000913.2 31 148 | 172 289 | 118 118 | 95.76 | 150 395 | 78.67 29.87 | J01636.rev.end NC_000913.2
1069 1463 | 395 1 | 395 395 | 100.00 | 7477 395 | 5.28 100.00 | J01636 NC_000913.2.365350-365744
- 14. Run lucy & trim reads
$ /nfshomes/dpuiu/szdevel/SourceForge/lucy-1.19p/lucy \ -v vector.seq splice.seq -o bos_taurus.lucy.seq bos_taurus.lucy.qual \ -debug bos_taurus.lucy.info \ bos_taurus.seq bos_taurus.qual
# Trim clr $ clrFasta bos_taurus.seq > bos_taurus.cseq
- 15. Align lucy output to linker, vector, splice & UniVec.dust
$ nucmer -l 12 -c 24 ~/db/vector.seq bos_taurus.lucy.cseq -p vector-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.lucy.cseq -p vector-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.lucy.cseq -p splice-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.lucy.cseq -p UniVec.dust-bos_taurus.lucy
- 16. Align input to linker, vector, splice & UniVec.dust
$ nucmer -l 12 -c 24 ~/db/linker.seq bos_taurus.seq -p linker-bos_taurus $ nucmer -l 16 -c 30 ~/db/vector.seq bos_taurus.seq -p vector-bos_taurus $ nucmer -l 16 -c 30 ~/db/splice.seq bos_taurus.seq -p splice-bos_taurus $ nucmer -l 16 -c 30 ~/db/UniVec.dust bos_taurus.seq -p UniVec.dust-bos_taurus
Count how many reads got trimmed
infoseq *seq | getSummary.pl -c 1 -t original.LEN cat bos_taurus.lucy.info | awk '{print $4-$3}' | getSummary.pl -t lucy.CLR >! bos_taurus.lucy.summary cat bos_taurus.lucy.info | getSummary.pl -c 14 -t lucy.CLV5 -nh >> bos_taurus.lucy.summary cat bos_taurus.lucy.info | getSummary.pl -c 15 -t lucy.CLV3 -nh >> bos_taurus.lucy.summary
Libraries
011.BCM.WGS FORWARD
- vector: J01636
- UniVec: gnl|uv|J01636.1:1-7477 E.coli lactose operon with lacI, lacZ, lacY and lacA genes
ll ~dpuiu/db/J01636* -rw-rw-r-- 1 dpuiu dpuiu 7651 Jan 9 15:56 /nfshomes/dpuiu/db/J01636 -rw-rw-r-- 1 dpuiu dpuiu 105 Jan 14 07:17 /nfshomes/dpuiu/db/J01636linker -rw-rw-r-- 1 dpuiu dpuiu 840 Jan 13 13:43 /nfshomes/dpuiu/db/J01636splice
cat ~dpuiu/db/J01636* | infoseq J01636 7477 53.43 J01636.linker.fwd 27 44.44 J01636.linker.rev 27 37.04 J01636.for.begin 150 44.67 J01636.for.end 150 51.33 J01636.rev.begin 150 51.33 J01636.rev.end 150 44.67
- 249,611 reads:
- 91% got vector trimmed at the 5'
- 0.4% (1149) got vector trimmed at the 3'
#elem #0s min max mean median n50 sum original.LEN 249611 0 437 2349 1082 991 1009 270035781 lucy.CLV5 249611 21215 0 741 25.03 25 27 6247415 lucy.CLV3 249611 248462 0 1047 3.49 0 859 870344
- Original reads hit counts:
10975 linker.fwd 133 linker.rev 166 splice 152 vector 228 UniVec.dust
- Lucy trimmed read counts
2 linker.fwd 0 linker.rev 1 splice 1 vector 6 UniVec.dust (only 3 are >40bp)
011.BCM.WGS REVERSE
#elem #0s min max mean median n50 sum original.LEN 250389 0 502 2148 1085 993 1012 271691094 lucy.CLR 250389 7345 0 1281 795 876 892 198982171 lucy.CLV5 250389 20271 0 668 26.52 27 29 6641362 lucy.CLV3 250389 249269 0 997 3.35 0 861 839029
- Original reads hit counts:
linker.fwd 113 linker.rev 3812 splice 143 UniVec.dust 237 vector 4318
- Lucy trimmed reads hit counts:
linker.fwd 1 linker.rev 0 splice 1 UniVec.dust 10 vector 1
030.BCM.SHOTGUN
- same linker/vector/splice as BCM.WGS
- 2.5% (4K out of 160K) reads contain linker & vector at 3'
#elem #0s min max mean median n50 sum original.LEN 8411 0 325 1685 1181 1240 1314 9933150 lucy.CLR 8411 8 0 1054 841 863 874 7070994 lucy.CLV5 8411 568 0 232 27.01 28 29 227206 lucy.CLV3 8411 2325 0 1040 597 794 851 5023445
- Original reads hit counts:
linker.fwd 4314 linker.rev 4125 splice 7816 UniVec.dust 4212 vector 6750 vector 27235
- Lucy trimmed reads hit counts:
linker.fwd 3 linker.rev 1 splice 1 UniVec.dust 13 vector 0
001.NISC.SHOTGUN
- Vector: pOTW13
- UniVec: 3 partial seqs
gnl|uv|NGB00080.1:1-198 pOTW13 with linkers gnl|uv|NGB00080.1:718-888 pOTW13 with linkers gnl|uv|NGB00080.1:1490-1654-49 pOTW13 with linkers
ll /nfshomes/dpuiu/db/NGB00080* -rw-rw-r-- 1 dpuiu dpuiu 1083 Jan 14 20:43 /nfshomes/dpuiu/db/NGB00080 -rw-r--r-- 1 dpuiu dpuiu 94 Jan 14 21:01 /nfshomes/dpuiu/db/NGB00080linker -rw-r--r-- 1 dpuiu dpuiu 2183 Jan 14 20:44 /nfshomes/dpuiu/db/NGB00080splice
cat /nfshomes/dpuiu/db/NGB00080* | infoseq NGB00080 1054 50.00 NGB00080.linker.fwd 24 45.83 NGB00080.linker.rev 26 53.85 NGB00080.for.beg 518 46.14 NGB00080.for.end 518 50.48 NGB00080.rev.begin 518 50.48 NGB00080.rev.beg 518 46.14
- 944 read sample
#elem #0s min max mean median n50 sum original.LEN 944 0 652 1017 735 721 722 693668 lucy.CLR 944 39 0 886 415 422 522 391333 lucy.CLV5 944 121 0 275 34.05 33 35 32143 lucy.CLV3 944 18 0 885 410 409 511 387007
- Original reads hit counts:
linker.fwd 479 linker.rev 492 splice 910 UniVec.dust 0 vector 939
- Lucy trimmed reads hit counts:
linker.fwd 1 linker.rev 0 splice 0 UniVec.dust 9 vector 1
060.BCCAGSC.CLONEEND
- Linkers:
linker.fwd CCCTGCTTTGTCTGGAAGGGGTTCCCGACCT linker.rev CAGGAGGGGAGAAAGGGCTCAGAGG
- No common vector !!!
wc -l *clb 60746 bos_taurus.060.f.clb #18 reads original align to UniVec (nucmer default params) 60836 bos_taurus.060.r.clb Fwd: 329 428 | 440 535 | 100 96 | 91.00 | 503 1585 | 19.88 6.06 | 723951410 gnl|uv|U30497.1:3230-4814 Cloning vector pAS2-1 330 370 | 89 49 | 41 41 | 100.00 | 503 143 | 8.15 28.67 | 723951410 gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector 330 370 | 94 54 | 41 41 | 100.00 | 503 143 | 8.15 28.67 | 723951410 gnl|uv|U67875.1:6541-6683 pESP-I yeast expression vector Rev: 1 96 | 71 165 | 96 95 | 93.81 | 203 165 | 47.29 57.58 | 724018013 gnl|uv|AF133437.1:16659-16823 Cloning vector pCYPAC6 50 143 | 1 94 | 94 94 | 92.71 | 203 94 | 46.31 100.00 | 724018013 gnl|uv|U80929.2:2858-2951 Cloning vector pBACe3.6
017.UIUC.CLONEEND
- No overrepresented kmers
wc -l *clb 17978 bos_taurus.017.f.clb 17911 bos_taurus.017.r.clb ==> 24.fwd/kmers.tab <== CCCTGCTTTGTCTGGAAGGGGTTC GAACCCCTTCCAGACAAAGCAGGG 9 CTGCTTTGTCTGGAAGGGGTTCCC GGGAACCCCTTCCAGACAAAGCAG 9 ==> 24.rev/kmers.tab <== GAATGTTGAGCTTTAGCCAACTTT AAAGTTGGCTAAAGCTCAACATTC 4 TCTGAATGTTGAGCTTTAGCCAAC GTTGGCTAAAGCTCAACATTCAGA 4 ==> 8.fwd/kmers.tab <== TTTTTTTT AAAAAAAA 55 AAGGGGTT AACCCCTT 35 ==> 8.rev/kmers.tab <== GTCTGGAA TTCCAGAC 41 TCTGGAAG CTTCCAGA 39
- No UniVec hits
010.TIGR.CLONEEND
- No overrepresented kmers
wc -l *clb 5479 bos_taurus.032.f.clb 5174 bos_taurus.032.r.clb ==> 24.fwd/kmers.tab <== CTTGTGTTGGCCCAGGCAAGTCCA TGGACTTGCCTGGGCCAACACAAG 30 TTGTGTTGGCCCAGGCAAGTCCAA TTGGACTTGCCTGGGCCAACACAA 30 ==> 24.rev/kmers.tab <== CTGCCTCTTGTGTTGGCCCAGGCA TGCCTGGGCCAACACAAGAGGCAG 16 GCTGCCTCTTGTGTTGGCCCAGGC GCCTGGGCCAACACAAGAGGCAGC 15 ==> 8.fwd/kmers.tab <== GAGTGGGT ACCCACTC 176 GGAGTGGG CCCACTCC 171 ==> 8.rev/kmers.tab <== TGGAGTGG CCACTCCA 182 GGAGTGGG CCCACTCC 181
- No UniVec hits
...
070.BCM.CLONEEND
- No frequent kmers
wc -l *clb 6027 bos_taurus.070.f.clb 6236 bos_taurus.070.r.clb ==> 24.fwd/kmers.tab <== GGACTCTCAGAGTCTTCTCCAACA TGTTGGAGAAGACTCTGAGAGTCC 18 ACTGGTTGGATCTCCTTGCAGTCC GGACTGCAAGGAGATCCAACCAGT 18 ==> 24.rev/kmers.tab <== ATAAAATCTGAGCCACCAGGGAAG CTTCCCTGGTGGCTCAGATTTTAT 1 CTATTGGTTCATATGGTCAACGTC GACGTTGACCATATGAACCAATAG 1 ==> 8.fwd/kmers.tab <== TTTTTTTT AAAAAAAA 86 CTTCTCCA TGGAGAAG 75 ==> 8.rev/kmers.tab <== TATAGTGT ACACTATA 9 ATATAGGG CCCTATAT 8
- No alignments to BCM WGS vector
Running Lucy
- Default parameters with vector trimming
- BCM vector/splice
/nfshomes/dpuiu/db/vector.BCM.seq /nfshomes/dpuiu/db/splice.BCM.seq
- NISC vector/splice
/nfshomes/dpuiu/db/vector.NISC.seq /nfshomes/dpuiu/db/splice.NISC.seq
BCM.WGS (all reads)
- orig.CLR < lucy.CLR ( 765 < 792 )
- orig.CLV > lucy.CLV ( 1015 > 973 )
- 739,529 out of 24,863,599 reads (3%) deleted by Lucy (CLR=-1,-1)
- 21,728,592 out of 24,863,599 reads (87%) vector trimmed at the 5' end
- 92,646 out of 24,863,599 reads (0.3%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 24863599 0 0 24863599 5 3097 1002 997 1015 24915462033 orig.CLR 24863599 463669 7 24399923 -1143 1833 765 836 864 19036744256 orig.CLR5 24863599 0 359245 24504354 0 2103 42 22 58 1047922451 orig.CLR3 24863599 463404 0 24400195 -1 2169 807 872 895 20084666707 lucy.CLR 24863599 0 739529 24124070 0 1219 792 878 904 19695000417 lucy.CLR5 24863599 739529 36108 24087962 -1 1753 43 29 42 1086413880 lucy.CLR3 24863599 739529 0 24124070 -1 1894 835 915 939 20781414297 orig.CLR5-lucy.CLR5 24863599 16299521 215345 8348733 -1186 2104 -1 -10 -1186 -38491429 orig.CLR3-lucy.CLR3 24863599 14858542 1494794 8510263 -1273 2170 -28 -20 -1273 -696747590 orig.CLV 24863599 1053 1920 24860626 -2 5345 1015 1002 1017 25260581538 orig.CLV5 8841849 0 0 8841849 1 1219 33 46 49 295011460 orig.CLV3 24861698 1053 0 24860645 -1 5346 1027 1005 1019 25555592998 lucy.CLV 24863599 10694 707 24852198 -469 3096 973 968 987 24195085877 lucy.CLV5 24863599 0 3135007 21728592 0 1359 25 27 29 623457486 lucy.CLV3 24863599 0 0 24863599 4 3096 998 995 1014 24818543363 lucy.CLVABS5 24863599 0 3135007 21728592 0 1359 25 27 29 623457486 lucy.CLVABS3 24863599 0 24770953 92646 0 1343 2 0 880 72055071 orig.CLV5-lucy.CLV5 24863599 17216820 1512453 6134326 -1312 1219 -13 -25 -1312 -328446026 orig.CLV3-lucy.CLV3 24863599 1519132 18579609 4764858 -1832 4672 29 0 479 737049635
BCM.WGS (0 quality reads)
- orig.CLR > lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 7,153 out of 551,114 reads (1.3%) deleted by Lucy (CLR=-1,-1)
- 508,166 out of 551,114 reads (92%) vector trimmed at the 5' end
- 1,946 out of 551,114 reads (0.35%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 551114 0 0 551114 5 1464 872 946 959 480705828 orig.CLR 551114 7754 0 543360 -770 1175 708 786 807 390325117 orig.CLR5 551114 0 6773 544341 0 1519 44 20 111 24582849 orig.CLR3 551114 7744 0 543370 -1 1638 752 818 833 414907966 lucy.CLR 551114 0 7153 543961 0 699 636 671 671 350759771 lucy.CLR5 551114 7153 35872 508089 -1 201 26 27 28 14442310 lucy.CLR3 551114 7153 0 543961 -1 699 662 699 699 365202081 orig.CLR5-lucy.CLR5 551114 364282 8801 178031 -198 1500 18 -8 215 10140539 orig.CLR3-lucy.CLR3 551114 85058 2962 463094 -700 1472 90 123 178 49705885 orig.CLV 551114 971 0 550143 -2 2037 974 978 981 537127121 orig.CLV5 5100 0 0 5100 1 845 35 29 31 180490 orig.CLV3 551114 971 0 550143 -1 2037 974 978 981 537307611 lucy.CLV 551114 58 6 551050 -84 1456 841 917 930 463903233 lucy.CLV5 551114 0 42948 508166 0 202 27 28 29 14964546 lucy.CLV3 551114 0 0 551114 4 1463 868 945 958 478867779 lucy.CLVABS5 551114 0 42948 508166 0 202 27 28 29 14964546 lucy.CLVABS3 551114 0 549168 1946 0 700 2 0 686 1286935 orig.CLV5-lucy.CLV5 551114 506108 42215 2791 -202 845 -26 -28 -202 -14784056 orig.CLV3-lucy.CLV3 551114 134959 23422 392733 -967 1614 106 7 459 58439832
BCM.SHOTGUN
- orig.CLR < lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 98,070 out of 10,748,529 reads (0.9%) deleted by Lucy (CLR=-1,-1)
- 9,737,008 out of 10,748,529 reads (90%) vector trimmed at the 5' end
- 294,942 out of 10,748,529 reads (2.7%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 10748529 0 0 10748529 5 2043 975 950 964 10486690472 orig.CLR 10748529 17308 2 10731219 -1293 1467 809 833 847 8701344571 orig.CLR5 10748529 0 68 10748461 0 1315 26 16 38 288662580 orig.CLR3 10748529 16780 0 10731749 -1 1647 836 851 863 8990007151 lucy.CLR 10748529 0 98070 10650459 0 1337 833 854 868 8955866769 lucy.CLR5 10748529 98070 1973 10648486 -1 1307 35 28 32 376276188 lucy.CLR3 10748529 98070 0 10650459 -1 1553 868 882 896 9332142957 orig.CLR5-lucy.CLR5 10748529 9498290 65171 1185068 -1099 1293 -8 -11 -1099 -87613608 orig.CLR3-lucy.CLR3 10748529 6879532 671097 3197900 -1149 1437 -31 -26 -1149 -342135806 orig.CLV 10748529 16779 412 10731338 -2 3919 974 948 964 10472347908 orig.CLV5 8594910 0 0 8594910 1 1239 3 1 49 28350257 orig.CLV3 10748349 16779 0 10731570 -1 3919 976 950 965 10500698165 lucy.CLV 10748529 7026 614 10740889 -268 2042 930 924 940 9997862132 lucy.CLV5 10748529 0 1011521 9737008 0 855 24 24 27 257993796 lucy.CLV3 10748529 0 0 10748529 4 2042 954 945 962 10255855928 lucy.CLVABS5 10748529 0 1011521 9737008 0 855 24 24 27 257993796 lucy.CLVABS3 10748529 0 10453587 294942 0 1214 20 0 847 220086015 orig.CLV5-lucy.CLV5 10748529 9538738 138680 1071111 -854 1239 -21 -23 -854 -229643539 orig.CLV3-lucy.CLV3 10748529 357934 9324166 1066429 -1328 2846 22 0 704 244842237
NISC.SHOTGUN
- orig.CLR < lucy.CLR (mean)
- orig.CLV > lucy.CLV (mean)
- 8,248 out of 737,900 reads (1.1%) deleted by Lucy (CLR=-1,-1)
- 633,409 out of 737,900 reads (85%) vector trimmed at the 5' end
- 7,201 out of 737,900 reads (0.97%) vector trimmed at the 3' end
elem <0 0 >0 min max mean median n50 sum orig.LEN 737900 0 0 737900 104 2104 784 729 734 579172842 orig.CLR 737900 5988 2 731910 -636 1033 651 668 676 480400909 orig.CLR5 737900 0 0 737900 1 1407 47 40 51 34857531 orig.CLR3 737900 0 5879 732021 0 1470 698 710 715 515258440 lucy.CLR 737900 0 8248 729652 0 1035 658 670 676 485757685 lucy.CLR5 737900 8248 56 729596 -1 1091 45 35 46 33811606 lucy.CLR3 737900 8248 0 729652 -1 1391 704 710 714 519569291 orig.CLR5-lucy.CLR5 737900 253727 89345 394828 -566 1408 1 1 485 1045925 orig.CLR3-lucy.CLR3 737900 177007 31 560862 -867 1471 -5 1 -867 -4310851 orig.CLV 737900 3224 2655 732021 -636 2103 771 725 730 569178445 orig.CLV5 734026 0 0 734026 1 987 5 1 35 4375315 orig.CLV3 732021 0 0 732021 35 2104 783 729 734 573553760 lucy.CLV 737900 1335 55 736510 -200 2104 747 696 702 551392388 lucy.CLV5 737900 104491 0 633409 -1 1199 30 31 34 22784742 lucy.CLV3 737900 0 0 737900 15 2103 778 728 733 574177130 lucy.CLVABS5 737900 0 104491 633409 0 1200 31 32 35 23522642 lucy.CLVABS3 737900 0 730699 7201 0 1076 5 0 686 4257812 orig.CLV5-lucy.CLV5 737900 561851 66390 109659 -1198 983 -24 -29 -1198 -18409427 orig.CLV3-lucy.CLV3 737900 8386 1 729513 -950 1077 0 1 -950 -623370
Fragment files
- Location: /fs/szasmg3/bos_taurus/data/frg
- All DST messages are unique
- bos_taurus.clv : contains the vector clipping points
- BCM.WGS, BCM.SHOTGUN & NISC.SHOTGUN: lucy.clv
- others: the TA clv
- 374,454 reads don't have valid clv's
- 36,446,031 reads have valid clv's with avg=955
Message counts (original)
DST FRG LKG bos_taurus.BCM.WGS.frg 79 24124070 11311841 #bos_taurus.BCM.SHOTGUN.frg 7339 10650459 1799069 # some libs & mates are missing due to a tarchive2ca crash #bos_taurus.BCM.SHOTGUN.new.frg 18208 10650459 4715172 # split the libraries by VOL & SEQ_LIB_ID bos_taurus.BCM.SHOTGUN.new.frg 13826 10650459 5046435 # double check the FRG count !!! bos_taurus.NISC.SHOTGUN.frg 246 729652 344932 bos_taurus.BCCAGSC.CLONEEND.frg 1 125241 59505 bos_taurus.UIUC.CLONEEND.frg 2 114750 46319 bos_taurus.TIGR.CLONEEND.frg 1 65171 27067 bos_taurus.GSC.CLONEEND.frg 1 53521 25889 bos_taurus.CENARGEN.WGS.frg 0 26246 0 bos_taurus.BARC.CLONEEND.frg 11150 25454 11150 bos_taurus.BCM.CLONEEND.frg 1 16875 7103 bos_taurus.CENARGEN.CLONEEND.frg 1 16787 6269 bos_taurus.UOKNOR.SHOTGUN.frg 1 14651 4910 bos_taurus.TIGR_JCVIJTC.CLONEEND.frg 2 10651 4803 bos_taurus.UOKNOR.FINISHING.frg 0 151 0 bos_taurus.WUGSC.COLONEEND.frg 1 49 21 total 25312 35973728 16896244
Message counts (quality)
DST FRG LKG bos_taurus.BCM.WGS.qual.count 79 23580109 11035582 #bos_taurus.BCM.SHOTGUN.qual.count 7339 10644092 1799069 bos_taurus.BCM.SHOTGUN.qual.new.count 18208 10644092 4712446 bos_taurus.NISC.SHOTGUN.count 246 729652 344932 bos_taurus.BCCAGSC.CLONEEND.qual.count 1 116484 53585 bos_taurus.UIUC.CLONEEND.count 2 114750 46319 bos_taurus.TIGR.CLONEEND.count 1 65171 27067 bos_taurus.CENARGEN.WGS.count 0 26246 0 bos_taurus.BARC.CLONEEND.count 11150 25454 11150 bos_taurus.BCM.CLONEEND.count 1 16875 7103 bos_taurus.CENARGEN.CLONEEND.count 1 16787 6269 bos_taurus.TIGR_JCVIJTC.CLONEEND.count 2 10651 4803 bos_taurus.UOKNOR.SHOTGUN.qual.count 1 2456 813 bos_taurus.WUGSC.COLONEEND.count 1 49 21
Message counts (0quality)
DST FRG LKG bos_taurus.BCM.WGS.0qual.count 79 543961 234397 bos_taurus.GSC.CLONEEND.0qual.count 1 53521 25889 bos_taurus.UOKNOR.SHOTGUN.0qual.count 1 12195 4097 bos_taurus.BCCAGSC.CLONEEND.0qual.count 1 8757 2114 bos_taurus.BCM.SHOTGUN.0qual.count 7339 6367 0 bos_taurus.UOKNOR.FINISHING.0qual.count 0 151 0
Assembly 1 (Quality reads)
Issues
- Uses only quality reads
- BCM.SHOTGUN library : ~ 4715172-1799069=2.9M mates were missed due to a tarchive2ca crash ; some libraries got merged (were assigned the same lib_id)
- All reads except for BCM.WGS were set as nonrandom
- Update the runCA script to run overlapper concurently; new "ovlConcurrency" parameter added to the .spec file !!!
- consensus after cgw crashed in MultiAlignContig() ... use "consensus -D forceunitigabut" !!!
- cgw crashed after updating gkpStore with new lib/mate info => edit Input_CGW.c, remove the assert in line 117
Info
host: walnut assembly version: wgs-5.2 stable dir: /scratch1/bos_taurus/Assembly/2009_0122_CA command: /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin/runCA-test -d . -p bt -s bt01.specFile *.frg spec file: cgwDistanceSampleSize = 1000 # ??? too big; more than 50% of the BCM.SHOTGUN reads are in libraries with less than 1000 inserts cnsConcurrency = 15 cnsMinFrags = 200000 doOverlapTrimming = 1 frgCorrBatchSize = 100000 frgCorrConcurrency = 15 merylMemory = 24000 merylThreads = 15 obtMerThreshold = 200 obtOverlapper = ovl ovlConcurrency = 8 ovlCorrBatchSize = 100000 ovlCorrConcurrency = 15 ovlHashBlockSize = 1200000 ovlMemory = 8GB --hashload 0.8 --hashstrings 400000 ovlMerThreshold = 500 ovlOverlapper = ovl ovlRefBlockSize = 7200000 ovlThreads = 2 unitigger = utg utgErrorRate = 0.015 vectorIntersect = bos_taurus.clv
doExtendClearRanges = 2 # should be set to 1 to run cgw 1+1=2 times (instead of 3 times) cgwOutputIntermediate = 0 # should be set to 1 to get intermediate .cgw files
Steps
1. Run up till after initialStoreBuilding
runCA-test stopAfter=initialStoreBuilding ...
2. Update gkpStore with nonrandom frg flag
cat bos_taurus.nonrandom.clv | perl -ane 'print "frg uid $F[0] isnonrandom 1\n";' > bos_taurus.nonrandom.edit gatekeeper -edit bos_taurus.nonrandom.edit bt.gkpStore
3. Restart
runCA-test ...
Input
gatekeeper -dumpinfo -lastfragiid bt.gkpStore ... Last frag in store is iid = 35348776
Trimming
elem <0 0 >0 min max mean median n50 sum CLV5 35085508 0 3387027 31698481 0 970 25 27 29 891007232 CLV3 35164784 0 0 35164784 15 2974 984 980 1000 34612019144 CLR_ORIG5 35348776 0 43354 35305422 0 1753 42 29 38 1502168205 CLR_ORIG3 35348776 0 0 35348776 70 1894 864 905 927 30547294868 CLR_OBT5 35348776 0 26513 35322263 0 1690 49 30 73 1756346429 CLR_OBT3 35348776 0 23477 35325299 0 1813 843 895 914 29824543869
elem <0 0 >0 min max mean median n50 sum ClearORIG 35348776 4 0 35348772 -1147 1572 821 870 893 29045126663 ClearQLT 35348776 35348776 0 0 -1 -1 -1 -1 -1 -35348776 ClearVEC 35348776 299034 20323 35029419 -1 2043 952 953 975 33658445088 ClearOBTINI 35348776 0 31254 35317522 0 1364 831 879 902 29394688367 ClearOBT 35348776 0 31254 35317522 0 1318 794 854 877 28068197440 ClearUTG 35348776 0 31254 35317522 0 1318 794 854 877 28068197440 ClearECR1 35348776 0 31254 35317522 0 1329 794 854 877 28072014464 ClearECR2 35348776 0 31254 35317522 0 1329 794 854 877 28072365712
- sum(ClearECR1)-sum(ClearUTG) = 3,817,024
- sum(ClearECR2)-sum(ClearECR1)= 351,248
- 421,379 reads deleted by OBT
Overlapper
- 98.33% of the reads (34,761,786 out of 35,348,776 reads) had overlaps
- 1.66% of the reads had no overlaps
- 6.68% of the BCCAGSC.CLONEEND reads had no overlaps
- 4.95% of the TIGR_JCVIJTC.CLONEEND reads had no overlaps
- 3.48% of the TIGR.CLONEEND reads had no overlaps
- the median number of overlaps is 16
sort -nk2 -r bt.ovlStore.count2 | head 16 1582324 17 1561352 15 1558093 18 1504595 14 1494160 ...
- the median number of overlaps for the BCM.WGS reads is 16
- the median number of overlaps for the BCM.SHOTGUN reads is 16 !!!
- the median number of overlaps for the NISC.SHOTGUN reads is 40 !!!
- the median number of overlaps for the BCM.CLONEEND reads is 16 !!!
Media:Bt.ovlStore.big.png , Media:Bt.ovlStore.small.png
Unitigger
more 4-unitigger/bt.cga.0 UNITIG OVERLAP GRAPH INFORMATION
5208738 : Total number of unitigs 2527051 : Total number of singleton, contained unitigs 1814842 : Total number of singleton, non-contained unitigs 180910 : Total number of non-singleton, spanned unitigs 685935 : Total number of non-singleton, non-spanned unitigs 34927397 : Total number of fragments 34927397 : Total number of fragments in all unitigs 21521581 : Total number of essential fragments in all unitigs 13405816 : Total number of contained fragments in all unitigs 0.0076239952 : Randomly sampled fragment arrival rate per bp 2510896132 : The sum of overhangs in all the unitigs 6400342737 : Total number of bases in all unitigs 0 : Estimated number of base pairs in the genome. 0 : Total number of contained fragments not connected by containment edges to essential fragments. Total rho = 2510896132 Total nfrags = 19143061 Estimated genome length = 0 Estimated global_fragment_arrival_rate=0.007624 Computed global_fragment_arrival_rate =0.007624 Total number of randomly sampled fragments in genome = 23326293 Computed genome length = 3059589120.000000 Used global_fragment_arrival_rate=0.007624 Used global_fragment_arrival_distance=131.164826
Histogram of the number of base pairs in a chunk 100292 - 159434: 22 90010 - 99906: 25 80043 - 89676: 73 70013 - 79966: 162 60010 - 69988: 389 50008 - 59983: 977 40000 - 49998: 2434 30000 - 39997: 6458 20000 - 29999: 18957 10000 - 19999: 57442
Unitigs >=10kb NewAsm UMd2Asm Number 86,939 57,204 Mean 19,464 15,140 Sum 1,692.1Mb 866.0Mb max 159,434bp 78,570bp
Contigs >=10Kb: NewAsm UMd2Asm n 42,343 45,958 mean 59,856 55,473 sum 2,534.5Mb 2,549.4Mb
Contigs >=100Kb: NewAsm UMd2Asm n 7,051 6,683 mean 163,170 162,357 sum 1,150.5Mb 1,085.0Mb max 627,705 742,802
Scaffolds >=10Mb: NewAsm UMd2Asm n 30 3 mean 14.10Mb 11.36Mb sum 422.95Mb 340.70Mb max 26.54Mb 13.36Mb
QC stats
TotalScaffolds=66,141 MaxBasesInScaffolds=26,048,998 MeanBasesInScaffolds=40,861 TotalContigsInScaffolds=120,461 MaxContigLength=627,911 MeanContigLength=22,436 TotalDegenContigs=269,031 MaxDegenContig=33,824 SingletonReads=3,721,123
Analysis
Inser libraries
1. BCM.WGS : ok
- FRG.mea: 1750-7000
- ASM.mea: 1594-6727
- Most libs have > 1000 reads & get reestimated
- All libs have ASM.std< ASM.mea/3
2. BCM.SHOTGUN
- only ~ 50% of the inserts are in libs with >1000 inserts and get reestimated by the assembly
- if the thold is dropped from 1000 to 100, we'd get ~ 95% of the inserts reestimated
elem <0 0 >0 min max mean median n50 sum 0 7339 0 0 7339 1 11237 245 135 1137 1799069 100 4361 0 0 4361 100 11237 395 157 1252 1725604 1000 440 0 0 440 1008 11237 2075 1791 2323 913086
3. NISC.SHOTGUN: ok
- Most libs have > 1000 reads & get reestimated
- All libs have ASM.std< ASM.mea/3
4. BCCAGSC.CLONEEND: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 125606 150000 30000 59505 BCCAGSC.CLONEEND 161998 20133
5. UIUC.CLONEEND: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 114892 150000 30000 31063 UIUC.CLONEEND 175594 41208 115020 150000 30000 15256 UIUC.CLONEEND 162488 26358
6. TIGR.CLONEEND: originally wrong; gets reestimated
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 65177 2000 600 27067 TIGR.CLONEEND 161761 34938
7. GSC.CLONEEND: not used (all 53556 are 0 qual)
8. CENARGEN.WGS: "not used" (all 26246 are unmated)
9. BARC.CLONEEND: each library contains 1 template id => inserts did not get reestimated (25454 reads/11151 inserts)
10. BCM cloneend: ok
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 19070 167000 25000 7103 BCM.CLONEEND 171244 18555
11. CENARGEN.CLONEEND: large stdev
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 17249 202000 20200 6269 CENARGEN.CLONEEND 158938 55165
12. UOKNOR.SHOTGUN: ok ?
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 15158 3000 1000 4910 UOKNOR.SHOTGUN 3000 1000
13. TIGR_JCVI.CLONEEND: originally wrong; gets reestimated
LIB.id FRG.mea FRG.std FRG.count CENTER.TYPE ASM.mea ASM.std 10691 2500 750 2763 TIGR_JCVI.CLONEEND 160363 29580 10738 2500 750 2040 TIGR_JCVI.CLONEEND 161915 29343
14. UOKNOR.FINISHING: only 151 reads
15. WUGSC.CLONEEND: only 49 reads
Contigs Vs UMD2 contaminants & Ecoli
4865 contigs in list.exclude_contigs.fa 34404 exclude-ctg.qry_hits 3763 exclude-ctg.ref_hits 1204 exclude-ctg.CBE.qry_hits CONTAIN|IDENTITY|BEGIN|END 748 exclude-ctg.CBE.ref_hits CONTAIN|IDENTITY|BEGIN|END
559 Ecoli.365350-365744-ctg.qry_hits : max ctg aligned is 179K bp; 10 are > 10K bp
Top 100 contigs Vs UMD2 contigs
Assembly 2 (Quality reads)
- Try to add the missing BCM.SHOTGUN reads at the assembly
- Assign new BCM.SHOTGUN library ID's base on volume & SEQ_LIB_ID : same library might have different insert size in different volume => might loose some correct mates from different volumes
cat bos_taurus.summary | grep BCM | grep SHOTG | cut -f6,7,8,10 | sort | more FAAEP 180000 13000 252 FAAEP 2000 1000 84 ... FAAHP 180000 13000 77 FAAHP 2000 1000 230 ...
- => 20,538 libraries out of which 18,208 contain mated reads
- create DST messages & add them to gkpStore
gatekeeper -a -o bt.gkpStore -T -F bos_taurus.BCM.SHOTGUN.new.DST
- generate gatekeeper edit file that maps each TI to the new library id
head bos_taurus.BCM.SHOTGUN.new.ti2libinfo.edit frg uid 499507131 libuid 601081 frg uid 499507132 libuid 601081 ...
- generate gatekeeper edit file that deletes all mate information
head bos_taurus.BCM.SHOTGUN.new.mate.delete frg uid 500086180 mateuid 0 frg uid 500084310 mateuid 0 ...
- pair forward/reverse read that have the same new library id, same TEMPLATE_ID
head bos_taurus.BCM.SHOTGUN.new.mate.edit frg uid 583866821 mateuid 583872364 frg uid 583866822 mateuid 583872408 ...
- run gatekeeper --edit for each edit/delete file
gatekeeper --edit ... bt.gkpStore
- restart assembly at cgw (doExtendClearRanges=1)
- consensus after cgw failed on job 25 on CTG 5597062 : cannot create consensus from multialignment ...
Fix: delete failed message cp bt.cgw_contigs.25 bt.cgw_contigs.25.FAILED delete "{ICM acc:5597062 pla:P len:20889 ..." from bt.cgw_contigs.25
- terminator fail; message:
ICL: reference before definition error for contig ID 5597062
Assembly 3 (All reads)
- 35,973,728 reads : 35,348,776 quality & 624,952 quality-less
- 16,896,244 mates
- 25,312 libraries
Nucmer alignments
- 624,952 quality-less reads
- Quality-less read stats
elem min max mean median n50 sum len 624952 5 1495 887 947 961 554429198 5 624952 6 1584 51 51 51 32150411 3 624952 5 1495 695 699 699 434960697 53 624952 -1579 1444 644 648 648 402810286
- Align 624,952 to the 120,461 Assembly1 contigs (no degenerates) : 1 day on 13 cpus
- 572,140(91.5%) reads aligned and 52,812(8.5%) did not align to the contigs
1. Launch jobs in parallel: 12766 jobs on 13 processors
nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.001.fasta bos_taurus.0qual.01.seq -p ctg.001-seq.01 ... nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05 bt.ctg.982.fasta bos_taurus.0qual.13.seq -p ctg.001-seq.01
- CPU usage: 100% /job
- Max mem usage: 0.1% /job
2. Get clrs
cat *delta | ~/bin/delta2qryClr.pl -best | sort > bos_taurus.0qual.best.clr
Length stats elem min max mean median n50 sum all 624952 5 1495 887 947 961 554429198 aligned 572140 221 1416 912 953 964 522281354 unaligned 52812 5 1495 608 580 754 32147844
Best alignment coord stats: elem min max mean median n50 sum 5 572140 1 847 77 33 129 44112635 3 572140 187 1311 844 911 930 482905737 53 572140 94 1208 766 841 877 438793102
Best/Max/Max+extended alignment coord stats: elem min max mean median n50 sum 53.best 572140 94 1208 766 841 877 438793102 53.max 572140 170 1208 794 863 888 454816817 53.extend 572140 170 1208 797 865 889 456014184
Unaligned read counts: unaligned total quality quality-less BCM.WGS 42595 UOKNOR.SHOTGUN 5787 14651 2456 12195 GSC.CLONEEND 2294 53521 0 53521 BCCAGSC.CLONEEND 1869 125241 116484 8757 BCM.SHOTGUN 186 UOKNOR.FINISHING 81
3. Get reads without clrs: set their clr to maximum 50..600
difference.pl bos_taurus.0qual.infoseq bos_taurus.0qual.clr | perl -ane '$three=600; $three=$F[1] if ($F[1]<600); print "$F[0] 50 $three\n";' > bos_taurus.clr.tmp cat bos_taurus.0qual.clr.tmp >> bos_taurus.0qual.clr