Bos taurus redo: Difference between revisions
Jump to navigation
Jump to search
Line 191: | Line 191: | ||
# create the Lucy vector & splice files | # create the Lucy vector & splice files | ||
Example: | |||
# volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS | |||
# | |||
# 249,611 TRACE_END=F & 250,389 TRACE_END=R | |||
# | |||
==> 24.fwd/kmers.tab <== | |||
AGTTCGACTGCAAGTAGTTCATCA TGATGAACTACTTGCAGTCGAACT 2463 | |||
GAGTTCGACTGCAAGTAGTTCATC GATGAACTACTTGCAGTCGAACTC 2189 | |||
CGAGTTCGACTGCAAGTAGTTCAT ATGAACTACTTGCAGTCGAACTCG 1996 | |||
TCGAGTTCGACTGCAAGTAGTTCA TGAACTACTTGCAGTCGAACTCGA 1593 | |||
GTTCGACTGCAAGTAGTTCATCAA TTGATGAACTACTTGCAGTCGAAC 1023 | |||
GAGTTCGACTGCAGTAGTTCATCA TGATGAACTACTGCAGTCGAACTC 812 | |||
CGAGTTCGACTGCAGTAGTTCATC GATGAACTACTGCAGTCGAACTCG 777 | |||
GTTCGACTGCAAGTAGTTCATCAT ATGATGAACTACTTGCAGTCGAAC 769 | |||
TCGAGTTCGACTGCAGTAGTTCAT ATGAACTACTGCAGTCGAACTCGA 637 | |||
ATCGAGTTCGACTGCAAGTAGTTC GAACTACTTGCAGTCGAACTCGAT 594 | |||
==> 08.fwd/kmers.tab <== | |||
AGTAGTTC GAACTACT 86477 | |||
CAGTAGTT AACTACTG 67681 | |||
AGTTCTCA TGAGAACT 61556 | |||
TAGTTCTC GAGAACTA 60964 | |||
GTAGTTCT AGAACTAC 57866 | |||
AGTTCATC GATGAACT 49676 | |||
TAGTTCAT ATGAACTA 45298 | |||
GTTCATCA TGATGAAC 42117 | |||
GCAGTAGT ACTACTGC 41391 | |||
GTAGTTCA TGAACTAC 40694 | |||
==> 24.rev/kmers.tab <== | |||
TATCGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCGATA 999 | |||
CTATCGATGGTACAGTAGTTCATC GATGAACTACTGTACCATCGATAG 774 | |||
GCTATCGATGGTACAGTAGTTCAT ATGAACTACTGTACCATCGATAGC 600 | |||
CGCTATCGATGGTACAGTAGTTCA TGAACTACTGTACCATCGATAGCG 432 | |||
ATCGATGGTACAGTAGTTCATCAT ATGATGAACTACTGTACCATCGAT 417 | |||
ATCGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCGAT 380 | |||
ATCAGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCTGAT 373 | |||
ATCGATGGTACAGTAGTTCATCAC GTGATGAACTACTGTACCATCGAT 265 | |||
CTATCGATGGTAAGTAGTTCATCA TGATGAACTACTTACCATCGATAG 235 | |||
TCAGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCTGA 224 | |||
==> 08.rev/kmers.tab <== | |||
AGTTCATC GATGAACT 85127 | |||
TAGTTCAT ATGAACTA 77902 | |||
GTTCATCA TGATGAAC 75585 | |||
TAGTTCTC GAGAACTA 68057 | |||
AGTTCTCA TGAGAACT 67277 | |||
GTAGTTCT AGAACTAC 64894 | |||
GTAGTTCA TGAACTAC 62607 | |||
CGTAGTTC GAACTACG 52031 | |||
AGTAGTTC GAACTACT 51013 | |||
ACGTAGTT AACTACGT 31552 | |||
# | |||
$ more vector.seq | $ more vector.seq | ||
>J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes | >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes |
Revision as of 01:44, 13 January 2009
BCM
NCBI Data
- Genome Projects
- TA search
- Avg LEN=984
- Avg CLIP (CLB intersect CLV)=760
- Avg CLV=997 (3.66M reads) !!!> Avg LEN
- Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
- 0 QUAL reads 650,133
- Avg UMDoverlapper CLIP=778 (3.53M reads)
CENTER_NAME counts
COUNT CENTER_NAME 35629020 BCM Baylor College of Medicine 737900 NISC NIH Intramural Sequencing Center 652614 BCCAGSC British Columbia Cancer Agency Genome Sciences Centre # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 378871 MARC USDA, ARS, US Meat Animal Research Center 114753 UIUC University of Illinois at Urbana-Champaign # TA query_tracedb CENTER_NAME = "UIUC" => 106,368 107367 BARC USDA, ARS, Beltsville Agricultural Research Center 65171 TIGR The Institute for Genome Research 53556 GSC Genoscope 43033 CENARGEN Embrapa Genetic Resources and Biotechnology 18623 SC The Sanger Center 15301 UOKNOR University of Oklahoma Norman Campus, Advanced Center for Genome Technology 10651 TIGR_JCVIJTC The Institute for Genomic Research, Traces generated at JCVIJTC # TA query_tracedb CENTER_NAME="JCVI" 2485 UIACBCB University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB) 49 WUGSC Washington University, Genome Sequencing Center # TA query_tracedb CENTER_NAME = "WUGSC" => 9 37829394 total total # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710
TRACE_TYPE_CODE counts
COUNT CENTER_NAME TRACE_TYPE_CODE #LIBS(all) #LIBS(10K+ reads) 24863599 BCM WGS 89 31 10748529 BCM SHOTGUN 10 10 737900 NISC SHOTGUN 4 3 125597 BCCAGSC CLONEEND 114753 UIUC CLONEEND 65171 TIGR CLONEEND 53556 GSC CLONEEND 26246 CENARGEN WGS 25454 BARC CLONEEND 16892 BCM CLONEEND 1 1 VBBAA mea=167000 std=25000 16787 CENARGEN CLONEEND 15150 UOKNOR SHOTGUN 10651 TIGR_JCVIJTC CLONEEND 151 UOKNOR FINISHING 49 WUGSC CLONEEND 36809945 total 527017 BCCAGSC EST 207204 MARC EST 171667 MARC PCR 81913 BARC EST 81913 BARC EST 2485 UIACBCB EST 1019449 total
STRATEGY & TRACE_TYPE_CODE counts
COUNT CENTER_NAME STRATEGY TRACE_TYPE_CODE 12545304 BCM . WGS 11425910 BCM WGA WGS 5223683 BCM CLONE SHOTGUN 4479883 BCM POOLCLONE SHOTGUN 1044963 BCM . SHOTGUN 892385 BCM SNP WGS 737900 NISC CLONE SHOTGUN 125597 BCCAGSC CLONEEND CLONEEND 114753 UIUC CLONEEND CLONEEND 65171 TIGR CLONEEND CLONEEND 53556 GSC CLONEEND CLONEEND 26246 CENARGEN . WGS 25454 BARC . CLONEEND 16892 BCM CLONEEND CLONEEND 16787 CENARGEN CLONEEND CLONEEND 12195 UOKNOR . SHOTGUN 10651 TIGR_JCVIJTC CLONEEND CLONEEND 2955 UOKNOR CLONE SHOTGUN 151 UOKNOR . FINISHING 49 WUGSC CLONEEND CLONEEND
527017 BCCAGSC EST EST 145820 MARC EST EST 117958 MARC COMPARATIVE PCR 81913 BARC EST EST 61384 MARC CLONE EST 53709 MARC Re-Sequencing PCR 18623 SC EST EST 2485 UIACBCB . EST
3' VECTOR TRIMMED counts
CENTER_NAME TRACE_TYPE_CODE TOTAL 3'CLV<LEN QUAL==0 UMD.FRG BCM WGS 24863599 10968979 551114 24050767 BCM SHOTGUN 10748529 5052692 23419 10068499 NISC SHOTGUN 737900 28972 0 735488 BCCAGSC CLONEEND 125597 125484 8926 113790 UIUC CLONEEND 114753 90243 0 106247 TIGR CLONEEND 65171 46389 0 64903 GSC CLONEEND 53556 53556 53556 (all) 0 !!! all have 0 quals and were excluded CENARGEN WGS 26246 26246 0 25976 BARC CLONEEND 25454 25454 0 25387 BCM CLONEEND 16892 6751 0 16863 CENARGEN CLONEEND 16787 16787 0 16628 UOKNOR SHOTGUN 15150 2885 12195 0 TIGR_JCVIJTC CLONEEND 10651 339 0 10644 UOKNOR FINISHING 151 0 151 151 WUGSC CLONEEND 49 0 0 0 BCCAGSC EST 527017 524173 772 0 MARC EST 207204 207204 0 0 MARC PCR 171667 171667 0 0 BARC EST 81913 78597 0 0 SC EST 18623 7350 0 0 UIACBCB EST 2485 2485 0 0
Local Data
Files & Dirs
/fs/szasmg3/bos_taurus/data/ /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors /nfshomes/dpuiu/db/UniVec
Software
Figaro
- trims vector only at 5' end
- call lucy trimming for qualities
Lucy
- both vector sequence and splice sites are required
Atlas
- web site
- atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "
Contaminant search
nucmer reads CLIPPING range to UniVec & EcoliK12
UniVec
Ref
#seqs min max mean median n50 sum UniVec 2861 12 48551 231 99 781 660,151 UniVec_Core 1348 12 48551 243 98 967 327,641
Hits: alignment length
bp #reads min max mean median n50 sum 19 4548466 19 1045 28.37 23 27 129025025 20 3684852 20 1045 30.56 25 28 112616359 30 1097357 30 1045 48.04 38 43 52714583 40 484661 40 1045 66.36 47 53 32163896 100 54334 100 1045 198 116 223 10772815 # many are ESTs
Ecoli
Ref:
K12 4,639,675 bp
Hits: alignment length
bp #reads min max mean median n50 sum 19 275109 19 1223 30.66 19 20 8435470 20 102550 20 1223 50.29 21 161 5156849 30 19032 30 1223 178 37 706 3381214 40 9234 40 1223 329 171 738 3034293 100 6781 100 1223 424 223 749 2876432 200 4378 200 1223 575 696 771 2516916
BCM vectors
#seqs min max mean median n50 sum BCM 14 2580 33180 9379 5821 32705 131312
Vector/Splice site search
- Select all the reads in the same volume that belong to one particular library; same CENTER_NAME, STRATEGY & TRACE_TYPE_CODE
- Get the quality clipping time: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
- Separate reads in 2 piles according to direction TRACE_END: FORWARD & REVERSE
- Get the most frequent kmers (24 & 8 bp)
- Check if the most frequent kmers are overrepresented
- Check if the most frequent 8mers are part of the most frequent 24mers
- Try to extend the kmers by a few bp => linkers
- Align linkers to the opposite stand sequences
- Extract the sequences adjacent(following) to linker (50..150bp)
- Align the sequences; if they align we've probably identified the vector
- Align the vector to UniVec => several alignments
- Check if the forward/reverse vector(s) are the same : should find a common vector sequence; the UniVec alignments should be adjacent
- create the Lucy vector & splice files
Example:
- volume 011 : 500,000 reads CENTER_NAME=BCM, TRACE_TYPE_CODE=WGS
- 249,611 TRACE_END=F & 250,389 TRACE_END=R
==> 24.fwd/kmers.tab <== AGTTCGACTGCAAGTAGTTCATCA TGATGAACTACTTGCAGTCGAACT 2463 GAGTTCGACTGCAAGTAGTTCATC GATGAACTACTTGCAGTCGAACTC 2189 CGAGTTCGACTGCAAGTAGTTCAT ATGAACTACTTGCAGTCGAACTCG 1996 TCGAGTTCGACTGCAAGTAGTTCA TGAACTACTTGCAGTCGAACTCGA 1593 GTTCGACTGCAAGTAGTTCATCAA TTGATGAACTACTTGCAGTCGAAC 1023 GAGTTCGACTGCAGTAGTTCATCA TGATGAACTACTGCAGTCGAACTC 812 CGAGTTCGACTGCAGTAGTTCATC GATGAACTACTGCAGTCGAACTCG 777 GTTCGACTGCAAGTAGTTCATCAT ATGATGAACTACTTGCAGTCGAAC 769 TCGAGTTCGACTGCAGTAGTTCAT ATGAACTACTGCAGTCGAACTCGA 637 ATCGAGTTCGACTGCAAGTAGTTC GAACTACTTGCAGTCGAACTCGAT 594 ==> 08.fwd/kmers.tab <== AGTAGTTC GAACTACT 86477 CAGTAGTT AACTACTG 67681 AGTTCTCA TGAGAACT 61556 TAGTTCTC GAGAACTA 60964 GTAGTTCT AGAACTAC 57866 AGTTCATC GATGAACT 49676 TAGTTCAT ATGAACTA 45298 GTTCATCA TGATGAAC 42117 GCAGTAGT ACTACTGC 41391 GTAGTTCA TGAACTAC 40694 ==> 24.rev/kmers.tab <== TATCGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCGATA 999 CTATCGATGGTACAGTAGTTCATC GATGAACTACTGTACCATCGATAG 774 GCTATCGATGGTACAGTAGTTCAT ATGAACTACTGTACCATCGATAGC 600 CGCTATCGATGGTACAGTAGTTCA TGAACTACTGTACCATCGATAGCG 432 ATCGATGGTACAGTAGTTCATCAT ATGATGAACTACTGTACCATCGAT 417 ATCGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCGAT 380 ATCAGATGGTACAGTAGTTCATCA TGATGAACTACTGTACCATCTGAT 373 ATCGATGGTACAGTAGTTCATCAC GTGATGAACTACTGTACCATCGAT 265 CTATCGATGGTAAGTAGTTCATCA TGATGAACTACTTACCATCGATAG 235 TCAGATGGTACAGTAGTTCATCAA TTGATGAACTACTGTACCATCTGA 224 ==> 08.rev/kmers.tab <== AGTTCATC GATGAACT 85127 TAGTTCAT ATGAACTA 77902 GTTCATCA TGATGAAC 75585 TAGTTCTC GAGAACTA 68057 AGTTCTCA TGAGAACT 67277 GTAGTTCT AGAACTAC 64894 GTAGTTCA TGAACTAC 62607 CGTAGTTC GAACTACG 52031 AGTAGTTC GAACTACT 51013 ACGTAGTT AACTACGT 31552
$ more vector.seq >J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes GACACCATCGAATGGCGCAAAACCTTTCGCGGTATGGCATGATAGCGCCCGGAAGAGAGTCAATTCAGGG TGGTGAATGTGAAACCAGTAACGTTATACGATGTCGCAGAGTATGCCGGTGTCTCTTATCAGACCGTTTC CCGCGTGGTGAACCAGGCCAGCCACGTTTCTGCGAAAACGCGGGAAAAAGTGGAAGCGGCGATGGCGGAG CTGAATTACATTCCCAACCGCGTGGCACAACAACTGGCGGGCAAACAGTCGTTGCTGATTGGCGTTGCCA $ cat ~/db/J01636splice >J01636.for.begin TGAATGTGAGTTAGGTCTCTCATTTGACACCCCAGGCTTTACACTTTATGCTTCCGGCTC GTATGTTGTGTGGAATTGTGAGCGGATAGCAATTTCACACAGGAAACAGCTATGACCATG CGCCTAATCAGATGGTACAGTAGTTCATCA >J01636.for.end TGATGAACTACTTGCAGTCGAAATCGAATCATCACTGGCCGTCCTTTTACAACGTCGTGA CTGGGAAAACCCTGGCGTTACCCAACTTAATCCGCCTTGCAGCACATCCCCCTTTCCCCC AGCTGGCGTAAAAACGTAAAAAGCCCCGCA >J01636.rev.end TGATGAACTACTGTACCATCTGATTAGGCGCATGGTCATAGCTGTTTCCTGTGTGAAATT GCTATCCGCTCACAATTCCACACAACATACGAGCCGGAAGCATAAAGTGTAAAGCCTGGG GTGTCAAATGAGAGACCTAACTCACATTCA >J01636.rev.begin TGCGGGGCTTTTTACGTTTTTACGCCAGCTGGGGGAAAGGGGGATGTGCTGCAAGGCGGA TTAAGTTGGGTAACGCCAGGGTTTTCCCAGTCACGACGTTGTAAAAGGACGGCCAGTGAT GATTCGATTTCGACTGCAAGTAGTTCATCA
Run lucy
# Run lucy $ lucy \ -v vector.seq splice.seq -o bos_taurus.lucy.seq bos_taurus.lucy.qual \ -debug bos_taurus.lucy.info \ bos_taurus.seq bos_taurus.qual
# Trim clr $ clrFasta bos_taurus.seq > bos_taurus.cseq
# Align lucy output to vector.seq & UniVec $ nucmer -l 16 -c 30 vector.seq bos_taurus.lucy.cseq -p vector-bos_taurus.lucy $ nucmer -l 16 -c 30 ~/db/UniVec bos_taurus.lucy.cseq -p UniVec-bos_taurus.lucy