Bos taurus 3.0: Difference between revisions
Jump to navigation
Jump to search
Line 247: | Line 247: | ||
== Assignment to chromosomes == | == Assignment to chromosomes == | ||
=== | === Markers === | ||
* Assignment | * 2640 scaffolds and 562 degenerates have markers | ||
* Assignment to chromosomes: use best alignment & majority rule | |||
* Position: | * Position: | ||
** Filter out outliers according to position on chromosome & scaffold (interquartile range method) | ** Filter out outliers according to position on chromosome & scaffold (interquartile range method) | ||
Line 257: | Line 258: | ||
** if only 1 markers/scaffolds => direction=unknown (0) | ** if only 1 markers/scaffolds => direction=unknown (0) | ||
=== | === Human synteny === | ||
* Align all scaffolds/degenerates to the 24 Human chromosomes; filter all alignments longer than 200bp | * Align all scaffolds/degenerates to the 24 Human chromosomes; filter all alignments longer than 200bp | ||
Line 263: | Line 264: | ||
delta-filter -q -l 200 | delta-filter -q -l 200 | ||
* 9,914 scaffolds and 16,527 degenerates align to Human chromosomes; most alignments are short, just over 200bp | * 9,914 scaffolds and 16,527 degenerates align to Human chromosomes; most alignments are short, just over 200bp | ||
=== Combine Human synteny & Marker data === | |||
* 1,908 scaffolds and 118 degenerates both align to human and contain markers | |||
* 10,790 scaffolds and 16,590 degenerates align to human or contain markers | |||
* Try to infer the position/orientation on the chromosomes for the scaffolds/degenerates that align to human but contain no markers | |||
** Find 2 adjacent scaffolds (preferably on left & right side) which both align to human, contain markers and placements agree (chromosome, position, direction) | |||
** Otherwise, find 1 adjacent scaffolds which both aligns to human, and contains markers | |||
** Extrapolate the position/orientation of the "unplaced" sequence based on its neighbor(s) | |||
** Sort the scaffolds/degenerates based on chromosome positions, identify incorrect markers & alignments, remove them from the input data and repeat the process | |||
=== By linking information === | === By linking information === |
Revision as of 19:28, 24 July 2009
Data download
- NCBI : ftp://ftp.ncbi.nih.gov/pub/TraceDB/bos_taurus/
- 37,829,394 reads organized in 91 volumes
- 36,820,485 WGS, SHOTGUN, CLONEEND & FINISHING reads
- 36,170,352 quality reads
- 650,133 qualityless reads
- 1,008,909 EST & PCR reads
- 36,820,485 WGS, SHOTGUN, CLONEEND & FINISHING reads
- 25,312 libraries (mostly SHOTGUN and BARC.CLONEEND)
Centers
TRACE_COUNT CENTER_NAME 1 35629020 BCM Baylor College of Medicine 2 737900 NISC NIH Intramural Sequencing Center 3 652614 BCCAGSC British Columbia Cancer Agency Genome Sciences Center 4 378871 MARC USDA, ARS, US Meat Animal Research Center 5 114753 UIUC University of Illinois at Urbana-Champaign 6 107367 BARC USDA, ARS, Beltsville Agricultural Research Center 7 65171 TIGR The Institute for Genome Research 8 53556 GSC Genoscope 9 43033 CENARGEN Embrapa Genetic Resources and Biotechnology 10 18623 SC The Sanger Center 11 15301 UOKNOR University of Oklahoma Norman Campus, Advanced Center for Genome Technology 12 10651 TIGR_JCVIJTC The Institute for Genomic Research, Traces generated at JCVIJTC 13 2485 UIACBCB University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB) 14 49 WUGSC Washington University, Genome Sequencing Center 37829394 total total
Trace counts
TRACE_COUNT CENTER_NAME TRACE_TYPE_CODE 1 24863599 BCM* WGS 2 10748529 BCM* SHOTGUN 3 737900 NISC SHOTGUN 4 125597 BCCAGSC CLONEEND 5 114753 UIUC CLONEEND 6 65171 TIGR CLONEEND 7 53556 GSC CLONEEND 8 26246 CENARGEN WGS 9 25454 BARC CLONEEND 10 16892 BCM* CLONEEND 11 16787 CENARGEN CLONEEND 12 15150 UOKNOR SHOTGUN 13 10651 TIGR_JCVIJTC CLONEEND 14 151 UOKNOR FINISHING 15 49 WUGSC CLONEEND 36820485 total 16 527017 BCCAGSC EST 17 207204 MARC EST 18 171667 MARC PCR 19 81913 BARC EST 20 18623 SC EST 21 2485 UIACBCB EST 1008909 total
Data processing
Vector trimming
Issues:
- Many traces are missing CLIP_VECTOR_LEFT,CLIP_VECTOR_RIGHT or CLIP_VECTOR_RIGHT==0
- OBT needs to get CLV as input
- Re-trim reads (each library separately)
Identify linkers
- Separate FWD/REV reads; for each set
- Identify top 20 most frequent kmers (8mers,24mers)
- Check if kmers a overrepresented
- Check if most frequent 8mer is present in the 24mers
- Align 24mers (extend them by a few bp) => linker
Identify vectors
- Align linkers to the opposite stand sequences (nucmer -l 12 -c 24 -r)
- Extract the subsequences following to linker (50..150bp)
- Align the subsequences; if they align we've probably identified the vector
- Identify the vector name/id by alignment to UniVec (several alignments probably) (nucmer -l 12 -c 24)
- Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
- create the Lucy vector & splice files; the splice contains the linker+vector
Trim quality reads
- run Lucy & trim input reads according to Lucy clr
- align Lucy trimmed reads to linker,vector,splice & UniVec
- align input reads to linker,vector,splice & UniVec (there should be no alignments)
BCM reads
- linker:
>J01636.linker.fwd 27bp TCGAGTTCGACTGCAAGTAGTTCATCA >J01636.linker.rev 27bp CTAATCAGATGGTACAGTAGTTCATCA
- vector: J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes (7477 bp)
- avg Original CLV > avg Lucy CLV (20+ bp ; 1015 vs 973 in quality WGS reads , ...)
NISC reads
- linker:
>NGB00080.linker.fwd TATCATCGCCACTGTGGTGGAATT >NGB00080.linker.rev GCTGAAGCTCCATGTGGTGGAATTCC
- vector NGB00080 (pOTW13 with linkers)
- avg Original CLV > avg Lucy CLV (20+ bp ; 771 vs 747)
Preliminary Assembly
- Assembly version: wgs-5.2
- Use only quality reads
- set read CLV to Lucy CLV
- set non random flag = 1 on all reads except for WGS ones
- obtMerThreshold = 200 (default 1000)
- doOBT = 1
Input
Reads=36,170,352 # WGS, SHOTGUN, CLONEEND & FINISHING quality reads Libraries=25,312 # mostly SHOTGUN and BARC.CLONEEND
Output
TotalScaffolds=66,141 MaxBasesInScaffolds=26,048,998 MeanBasesInScaffolds=40,861 TotalContigsInScaffolds=120,461 MaxContigLength=627,911 MeanContigLength=22,436 TotalDegenContigs=269,031 MaxDegenContig=33,824 SingletonReads=3,721,123
DeletedReads=421,379 (too short or zero CLR)
Preliminary Assembly processing
Clear ranges
- Quality reads: extract OBT CLR from gatekeeper store
- Quality reads:
- Align them to contigs (no degenerates) : nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05
- Set CLR to the maximum alignment coordinates or 50..min(len,600)
- Shrink read CLR if there are multiple N's or low complexity regions
Contaminant search
Databases:
- Ecoli : 22 completed genomes + plasmids
- UniVec_Core 1348 sequences : mostly cloning vectors & primers, avg 250bp long
- OtherVec: 100 other vector sequences (mostly complete), identified by aligning UMD2.0 contaminants to GenBank
- bos_taurus UMD2.0 contaminant : 4,813 whole contigs and 30,329 contig regions identified by NCBI as contamination in UMD2.0; many contained cow sequences as well
Files:
/nfshomes/dpuiu/db/Ecoli.all /nfshomes/dpuiu/db/UniVec_Core /nfshomes/dpuiu/db/OtherVec /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.fasta
Alignment parameters:
nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05
Contig/degenerate counts:
- 2,951/1,266 aligned to Ecoli
- 5,387/1,908 aligned to UniVec_Core
- 5,657/1,963 aligned to OtherVec
Read/mate counts: TO BE DELETED
- 40,699 reads in contaminated regions
- 22,607 mates in contaminated regions
Library estimates
- Extract library insert estimates; merge libraries sequenced by same center that have similar mean/std 25,312 libs => 344 libs
- Assign new library ids; average means & stdevs
Final Assembly
- Assembly version: wgs-5.2
- Use all traces
- set read CLR to:
- OBT CLR (quality reads)
- alignment coordinates (aligned quality-less reads)
- 50..min(len,600) (unaligned quality-less reads)
- set non random flag = 1 on all reads except for WGS reads
- obtMerThreshold = 200 (default 1000)
- doOBT = 0
Input
Reads=35,973,728 # WGS, SHOTGUN, CLONEEND & FINISHING with and without qualities Libraries=344
Output
TotalScaffolds=39,978 TotalContigsInScaffolds=90,135 MeanBasesInScaffolds=66,947 MaxBasesInScaffolds=3,3907,885 TotalContigsInScaffolds=90,135 MeanContigLength=29,693 MaxContigLength=1,160,130 TotalDegenContigs=251,413 MaxDegenContig=39,964 SingletonReads=3,634,305(10.24%)
Final assembly processing
Contaminant search
- Same databases and alignment parameters as before
Delete summary:
- 65 Acinetobacter ctgs
- 91 other contaminant ctgs <2000bp
- Total: 156 ctgs , 152 scf , 4105 reads
Trim summary:
- 12 contigs >=2000bp , 44 reads
Marker mapping
- 126,013 total markers
- Avg distance between markers is 25Kbp; marker position error is 50Kbp
- Markers aligned to all contigs/degenerates; filter all alignments with %IDY>90 & %Matched>85; find best alignment
- 107,271 markers align to 31,407 ctg & 2,640 scf
- 552 scf have markers from multiple chromosomes
- 212 scf have multiple markers from multiple chromosomes
- 38 scf have multiple adjacent markers from multiple chromosomes: SUSPICIOUS
- 628 markers align to 562 degenerates
Scaffold/contig breaking
- Analyze 38 scf that have multiple adjacent markers from multiple chromosomes
- Compute coverage in the suspicious region (between different chromosome markers):
- read cvg
- mate ctg: good, bad
- Break ctg/scf unless the region has "high read cvg" , "high good mate cvg" , "low bad mate cvg"
- Break summary:
- 14 scaffolds
- 15 breaks : 8 on the same contig , 3 on adjacent contigs , 4 on non adjacent contigs
Assignment to chromosomes
Markers
- 2640 scaffolds and 562 degenerates have markers
- Assignment to chromosomes: use best alignment & majority rule
- Position:
- Filter out outliers according to position on chromosome & scaffold (interquartile range method)
- Compute the average position on chromosome of the markers
- Orientation:
- use LeastSequareFit method : if slope is positive => forward; if slope is negative => reverse
- if only 1 markers/scaffolds => direction=unknown (0)
Human synteny
- Align all scaffolds/degenerates to the 24 Human chromosomes; filter all alignments longer than 200bp
nucmer -mum -l 12 -c 30 -g 1000 delta-filter -q -l 200
- 9,914 scaffolds and 16,527 degenerates align to Human chromosomes; most alignments are short, just over 200bp
Combine Human synteny & Marker data
- 1,908 scaffolds and 118 degenerates both align to human and contain markers
- 10,790 scaffolds and 16,590 degenerates align to human or contain markers
- Try to infer the position/orientation on the chromosomes for the scaffolds/degenerates that align to human but contain no markers
- Find 2 adjacent scaffolds (preferably on left & right side) which both align to human, contain markers and placements agree (chromosome, position, direction)
- Otherwise, find 1 adjacent scaffolds which both aligns to human, and contains markers
- Extrapolate the position/orientation of the "unplaced" sequence based on its neighbor(s)
- Sort the scaffolds/degenerates based on chromosome positions, identify incorrect markers & alignments, remove them from the input data and repeat the process
By linking information
Comparison to UMD2.0
Alignment parameters:
nucmer -mum -l 200 -c 1000
Haplotype search
Chromosome mapping
Assembly Summary
. ctg+deg <2Kbp >=2Kbp min max mean med n50 sum ====================================================================================================== Chr1..29,X 72481 20864 51617 65 1160130 36423 12940 97255 2639986644 ChrU 3285 2404 881 224 179692 2890 1338 5425 9496583 Chr 75766 23268 52498 65 1160130 34969 11207 96955 2649483227 contigs.haplotype-variants 40611 36984 3627 263 97877 1476 1205 1372 59958728 deg.unplaced.less_2K 224933 224933 0 65 1996 972 983 990 218837572 ChrY-contigs 314 266 48 224 26490 2210 973 6539 694140 ChrY-contigs.SHOTGUN_ONLY 144 140 4 804 4224 993 882 888 143047 delete.notPrimates 97 96 1 263 5310 1031 996 1004 100066 trim 61 21 40 213 205361 38577 11681 126330 2353214 ======================================================================================================