Bos taurus redo: Difference between revisions
		
		
		
		Jump to navigation
		Jump to search
		
| Line 174: | Line 174: | ||
|                    #seqs   min     max     mean    median  n50     sum |                    #seqs   min     max     mean    median  n50     sum | ||
|    BCM             14      2580    33180   9379    5821    32705   131312 |    BCM             14      2580    33180   9379    5821    32705   131312 | ||
| = Vector/Splice site search = | |||
| Strategy | |||
| # Select all the reads in the same volume that belong to one particular library; same | |||
|   CENTER_NAME      | |||
|   STRATEGY         | |||
|   TRACE_TYPE_CODE | |||
| # Get the quality clipping time: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT | |||
| # Separate reads in 2 piles according to direction TRACE_END: FORWARD & REVERSE  | |||
| # Get the most frequent kmers (24 & 8 bp) | |||
| # Check if the most frequent kmers are overrepresented | |||
| # Check if the most frequent 8mers are part of the most frequent 24mers | |||
| # Try to extend the kmers by a few bp => linkers | |||
| # Align linkers to the opposite stand sequences | |||
| # Extract the sequences adjacent(following) to linker (50..150bp) | |||
| # Align the sequences; if they align we've probably identified the vector  | |||
| # Align the vector to UniVec => several alignments | |||
| # Check if the forward/reverse vector(s) are the same : should find a common vector sequence; the UniVec alignments should be adjacent | |||
| # create the Lucy vector & splice files | |||
Revision as of 01:19, 13 January 2009
BCM
NCBI Data
- Genome Projects
- TA search
- Avg LEN=984
- Avg CLIP (CLB intersect CLV)=760
- Avg CLV=997 (3.66M reads) !!!> Avg LEN
- Avg QUAL=38.96 (27.51 for the 2.59M reads not in the UMD assembly)
- 0 QUAL reads 650,133
- Avg UMDoverlapper CLIP=778 (3.53M reads)
CENTER_NAME counts
COUNT CENTER_NAME 35629020 BCM Baylor College of Medicine 737900 NISC NIH Intramural Sequencing Center 652614 BCCAGSC British Columbia Cancer Agency Genome Sciences Centre # TA query_tracedb CENTER_NAME = "BCCAGSC" => 652,510 378871 MARC USDA, ARS, US Meat Animal Research Center 114753 UIUC University of Illinois at Urbana-Champaign # TA query_tracedb CENTER_NAME = "UIUC" => 106,368 107367 BARC USDA, ARS, Beltsville Agricultural Research Center 65171 TIGR The Institute for Genome Research 53556 GSC Genoscope 43033 CENARGEN Embrapa Genetic Resources and Biotechnology 18623 SC The Sanger Center 15301 UOKNOR University of Oklahoma Norman Campus, Advanced Center for Genome Technology 10651 TIGR_JCVIJTC The Institute for Genomic Research, Traces generated at JCVIJTC # TA query_tracedb CENTER_NAME="JCVI" 2485 UIACBCB University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB) 49 WUGSC Washington University, Genome Sequencing Center # TA query_tracedb CENTER_NAME = "WUGSC" => 9 37829394 total total # TA query_tracedb SPECIES_CODE = "BOS TAURUS" => 37,788,710
TRACE_TYPE_CODE counts
COUNT CENTER_NAME TRACE_TYPE_CODE #LIBS(all) #LIBS(10K+ reads) 24863599 BCM WGS 89 31 10748529 BCM SHOTGUN 10 10 737900 NISC SHOTGUN 4 3 125597 BCCAGSC CLONEEND 114753 UIUC CLONEEND 65171 TIGR CLONEEND 53556 GSC CLONEEND 26246 CENARGEN WGS 25454 BARC CLONEEND 16892 BCM CLONEEND 1 1 VBBAA mea=167000 std=25000 16787 CENARGEN CLONEEND 15150 UOKNOR SHOTGUN 10651 TIGR_JCVIJTC CLONEEND 151 UOKNOR FINISHING 49 WUGSC CLONEEND 36809945 total 527017 BCCAGSC EST 207204 MARC EST 171667 MARC PCR 81913 BARC EST 81913 BARC EST 2485 UIACBCB EST 1019449 total
STRATEGY & TRACE_TYPE_CODE counts
COUNT CENTER_NAME STRATEGY TRACE_TYPE_CODE 12545304 BCM . WGS 11425910 BCM WGA WGS 5223683 BCM CLONE SHOTGUN 4479883 BCM POOLCLONE SHOTGUN 1044963 BCM . SHOTGUN 892385 BCM SNP WGS 737900 NISC CLONE SHOTGUN 125597 BCCAGSC CLONEEND CLONEEND 114753 UIUC CLONEEND CLONEEND 65171 TIGR CLONEEND CLONEEND 53556 GSC CLONEEND CLONEEND 26246 CENARGEN . WGS 25454 BARC . CLONEEND 16892 BCM CLONEEND CLONEEND 16787 CENARGEN CLONEEND CLONEEND 12195 UOKNOR . SHOTGUN 10651 TIGR_JCVIJTC CLONEEND CLONEEND 2955 UOKNOR CLONE SHOTGUN 151 UOKNOR . FINISHING 49 WUGSC CLONEEND CLONEEND
527017 BCCAGSC EST EST 145820 MARC EST EST 117958 MARC COMPARATIVE PCR 81913 BARC EST EST 61384 MARC CLONE EST 53709 MARC Re-Sequencing PCR 18623 SC EST EST 2485 UIACBCB . EST
3' VECTOR TRIMMED counts
CENTER_NAME TRACE_TYPE_CODE TOTAL 3'CLV<LEN QUAL==0 UMD.FRG BCM WGS 24863599 10968979 551114 24050767 BCM SHOTGUN 10748529 5052692 23419 10068499 NISC SHOTGUN 737900 28972 0 735488 BCCAGSC CLONEEND 125597 125484 8926 113790 UIUC CLONEEND 114753 90243 0 106247 TIGR CLONEEND 65171 46389 0 64903 GSC CLONEEND 53556 53556 53556 (all) 0 !!! all have 0 quals and were excluded CENARGEN WGS 26246 26246 0 25976 BARC CLONEEND 25454 25454 0 25387 BCM CLONEEND 16892 6751 0 16863 CENARGEN CLONEEND 16787 16787 0 16628 UOKNOR SHOTGUN 15150 2885 12195 0 TIGR_JCVIJTC CLONEEND 10651 339 0 10644 UOKNOR FINISHING 151 0 151 151 WUGSC CLONEEND 49 0 0 0 BCCAGSC EST 527017 524173 772 0 MARC EST 207204 207204 0 0 MARC PCR 171667 171667 0 0 BARC EST 81913 78597 0 0 SC EST 18623 7350 0 0 UIACBCB EST 2485 2485 0 0
Local Data
Files & Dirs
/fs/szasmg3/bos_taurus/data/ /fs/szasmg2/Drosophila/D_pseudoobscura/Vectors /nfshomes/dpuiu/db/UniVec
Software
Figaro
- trims vector only at 5' end
- call lucy trimming for qualities
Lucy
- both vector sequence and splice sites are required
Atlas
- web site
- atlas-screen-trim-file : "calls cross_match and atlas-screen-window to create trimmed reads file (scan in from each end of read looking for 50-base windows of high quality and no vector); "
Contaminant search
nucmer reads CLIPPING range to UniVec & EcoliK12
UniVec
Ref
#seqs min max mean median n50 sum UniVec 2861 12 48551 231 99 781 660,151 UniVec_Core 1348 12 48551 243 98 967 327,641
Hits: alignment length
bp #reads min max mean median n50 sum 19 4548466 19 1045 28.37 23 27 129025025 20 3684852 20 1045 30.56 25 28 112616359 30 1097357 30 1045 48.04 38 43 52714583 40 484661 40 1045 66.36 47 53 32163896 100 54334 100 1045 198 116 223 10772815 # many are ESTs
Ecoli
Ref:
K12 4,639,675 bp
Hits: alignment length
bp #reads min max mean median n50 sum 19 275109 19 1223 30.66 19 20 8435470 20 102550 20 1223 50.29 21 161 5156849 30 19032 30 1223 178 37 706 3381214 40 9234 40 1223 329 171 738 3034293 100 6781 100 1223 424 223 749 2876432 200 4378 200 1223 575 696 771 2516916
BCM vectors
#seqs min max mean median n50 sum BCM 14 2580 33180 9379 5821 32705 131312
Vector/Splice site search
Strategy
- Select all the reads in the same volume that belong to one particular library; same
CENTER_NAME STRATEGY TRACE_TYPE_CODE
- Get the quality clipping time: CLIP_QUALITY_LEFT & CLIP_QUALITY_RIGHT
- Separate reads in 2 piles according to direction TRACE_END: FORWARD & REVERSE
- Get the most frequent kmers (24 & 8 bp)
- Check if the most frequent kmers are overrepresented
- Check if the most frequent 8mers are part of the most frequent 24mers
- Try to extend the kmers by a few bp => linkers
- Align linkers to the opposite stand sequences
- Extract the sequences adjacent(following) to linker (50..150bp)
- Align the sequences; if they align we've probably identified the vector
- Align the vector to UniVec => several alignments
- Check if the forward/reverse vector(s) are the same : should find a common vector sequence; the UniVec alignments should be adjacent
- create the Lucy vector & splice files