Brugia malayi: Difference between revisions
Line 333: | Line 333: | ||
Library Chimeric% Full% Missing% Partial% Repeat% Unmapped% All% All Linker | Library Chimeric% Full% Missing% Partial% Repeat% Unmapped% All% All Linker | ||
All 0 21 35 2 9 33 100 4763659 | All 0 21 35 2 9 33 100 4763659 | ||
E4RA0X101 0 22 0 2 10 64 100 271066 . | E4RA0X101 0 22 0 2 10 64 100 271066 . | ||
E4RA0X102 0 21 0 2 11 64 100 260166 . | E4RA0X102 0 21 0 2 11 64 100 260166 . |
Revision as of 20:30, 17 February 2010
Articles
Genome Info
- 6 chromosomes: 1-5, XY ; diploit genome ~ 110M bp
- 30% GC,
- 32% coding, 15% repeats
Other sequences
- mitochondrion finished: 13,657 bp; 24% GC
- Wolbachia endosymbiont strain TRS from Brugia malayi strain wMel complete: 1,080,084 bp; 34%GC (New England Biolabs)
- Wolbachia endosymbiont strain wMel progress (TIGR)
- Rodent: some trace contamination; Example: Mus musculus is ~44%GC
contaminants min q1 q2 q3 max mean n50 sum 2929 200 527 675 820 8994 740.04 762 2167588
Genome Project
Brugia malayi has a diploid genome of approximately 110 Mb, organized in 6 pairs of chromosomes (five pairs of autosomes and one pair of sex chromosomes). In addition to the nuclear genome, B. malayi has a mitochondrial genome of about 14kb, and the genome of the harbored bacterial endosymbiont Wolbachia sp (1-2Mb).
The B. malayi genome project has been completed by The Institute for Genomic Research. Whole Genome Shotgun sequencing was used to obtain more than eight-fold coverage of the genome. The complete genome was assembled into approximately 8200 scaffolds and deposited in GenBank. The accession for the WGS project is AAQA00000000 and consists of sequences AAQA01000001-AAQA01029808. File location:
/fs/szasmg3/dpuiu/Brugia_malayi/Data/Bm.fasta ctgs min q1 q2 q3 max mean n50 sum 26879 200 836 1005 1495 611244 3241.17 18986 87119350
- TIGR Genome project (TRS strain)
Data
Original Traces
- 1.26M Sanger reads & 15 Libraries:
- NCBI TA
- NCBI TA FTP
SEQ_LIB_ID INSERT_SIZE INSERT_STDEV TRACE_TYPE_CODE 1047113828118 1000 300 WGS 13500 1047113856575 1000 300 PRIMERWALK 325 1047111632737 1258 377 PRIMERWALK 3 1047111632737 1258 377 WGS 305,906 1047111540304 1415 424 WGS 51772 1047112577106 1415 424 WGS 337,789 1047111718946 3123 936 WGS 47597 1047113358719 3123 936 PRIMERWALK 173 1047113358719 3123 936 WGS 246,185 1047174912885 3123 936 TRANSPOSON 1437 1047113570927 6000 1800 WGS 3193 1047111814561 7158 2147 WGS 219,306 1047111480027 17168 5150 WGS 4087 1047111488095 17168 5150 WGS 3434 1047111495007 17168 5150 WGS 3716 1047111501919 17168 5150 WGS 3697 1047111480605 22419 6725 WGS 4638 1047111516154 22419 6725 WGS 4004 1047111523212 22419 6725 WGS 3766 1047111530126 22419 6725 WGS 5686 1047113855421 23000 6900 WGS 1 total 1,260,215
FRG file:
- FRG.src : TI's
- FRG.acc: 2 ..
- DST.acc: 1260217, ... , 1260234
- Location
/fs/szasmg3/dpuiu/Brugia_malayi/Data/nucmer_seq/Bm-all.frg DST 15 FRG 1178192 LKG 530930 seqs min q1 q2 q3 max mean n50 sum 1,178,192 65 645 771 850 1214 724 800 853,847,771 => 8X
Problems:
- All library insert sizes are underestimated ???
- The contaminant reads align at ~91-93% id to the contaminant ctgs while the Mt/We reads align at 99% id to Mt/We finished seq. What %id thold to use for contaminant?
BACS
8,000 BAC clones @Children's Hospital Oakland Research Institute. (!!! no NCBI TA submission)
PITT FTP data
- 3.21M 454 reads
- 3K insert flx libraries (estimated to 2K based on alignment to the existing assembly)
- 20K insert tit libraries (estimated to 28K ...)
CBCB Location:
/fs/szattic-asmg4/brugia_malayi/Data/ /fs/szattic-asmg4/brugia_malayi/Data/Sff/ # Sff files /fs/szattic-asmg4/brugia_malayi/Data/Frg/ # Frg files /fs/szattic-asmg4/brugia_malayi/Data/Seq/ # Seq files
FTP access:
lftp -u bma 136.142.191.201 pass: 6279 user: bma # empty as of --Dpuiu 12:04, 8 January 2010 (EST)
Elodie's table:
/scratch1/brugia_malayi/brugia-sequencing-summary.txt.csv
# elodie's date protocol platform type description run_name Reads Mates 1 01/17/2008 WGS Standard Full run (2/2) Mix of worms (calibration of the machine) R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample ? 0 2 07/01/2008 3Kb Standard Full single worm (pUC contamination) R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1 492575 84341 3 09/11/2008 3kb Standard 4/8 wells single worm (pUC contamination) R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1 263421 49258 4 10/01/2008 3Kb Standard Full Mix of worms (still pUC contamination) R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest 59711 5096 5 02/01/2009 WGS Standard 1/4 wells Mix of worms; regions 2 & 3 were myxoma R_2009_02_27_16_11_34_FLX10070260_adminrig_022709_GHEDIN ? 0 6 04/06/2009 WGS Standard 1/4 wells Mix of worms; with comp. bio run R_2009_04_15_14_46_56_FLX10070260_adminrig_041509_GHEDIN_r1-WGS1_r2-LMW4_r3-pool2compbio_r4-pool3compbio ? 0 7 05/01/2009 20Kb Titanium 7/8 wells Mix of worms R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1 631287 213524 8 10/28/2009 20Kb Titanium Full Mix of worms R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2 1095713 377547 9 Pending 3 Kb Titanium Full Mix of worms ? ? ? . Total 2542707 729766
- 22 Sff files:
run sffReads linker 1 R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample/D_2008_01_31_18_01_35_FLX10070260_adminrig_FullAnalysis/sff/E4RA0X101.sff 272923 . 1 R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample/D_2008_01_31_18_01_35_FLX10070260_adminrig_FullAnalysis/sff/E4RA0X102.sff 261899 . 2 R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1/D_2009_02_12_22_12_04_j_SignalProcessing/sff/FEZH5RS01.sff 228204 flx 2 R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1/D_2009_02_12_22_12_04_j_SignalProcessing/sff/FEZH5RS02.sff 264371 flx 3 R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T02.sff 86862 flx 3 R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T03.sff 87488 flx 3 R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T04.sff 89071 flx 4 R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM01.sff 13695 flx 4 R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM02.sff 14197 flx 4 R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM03.sff 15515 flx 4 R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM04.sff 16304 flx 5 R_2009_02_27_16_11_34_FLX10070260_adminrig_022709_GHEDIN/FRLDXKV01.sff 18025 . 6 R_2009_04_15_14_46_56_FLX10070260_adminrig_041509_GHEDIN_r1-WGS1_r2-LMW4_r3-pool2compbio_r4-pool3compbio/D_2009_04_16_14_19_21_morty_fullProcessing/FT9KOI001.sff 118490 . 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY01.sff 73807 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY02.sff 91698 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY03.sff 93878 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY04.sff 90232 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY05.sff 97065 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY06.sff 94326 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY07.sff 90281 tit 8 R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB01.sff 551263 tit 8 R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB02.sff 544450 tit total 3214044 .
- 22 Frg Libraries
lib mean(orig) mean(estimates) reads mates linker 1 E4RA0X101 0 0 271066 0 . 1 E4RA0X102 0 0 260166 0 . 2 FEZH5RS01 3000 2000 181035 18676 flx 2 FEZH5RS02 3000 2000 211064 22270 flx 3 FHAVB5T02 3000 2000 68708 7850 flx 3 FHAVB5T03 3000 2000 69306 8227 flx 3 FHAVB5T04 3000 2000 70028 8353 flx 4 FIOXLOM01 3000 0 10921 0 flx # no mates !!! 4 FIOXLOM02 3000 0 11157 0 flx # no mates !!! 4 FIOXLOM03 3000 0 12197 0 flx # no mates !!! 4 FIOXLOM04 3000 0 12727 0 flx # no mates !!! 5 FRLDXKV01 0 0 17349 0 . 6 FT9KOI001 0 0 108826 0 . 7 FW1OXFY01 20000 28000 86127 15825 tit 7 FW1OXFY02 20000 28000 106911 19668 tit 7 FW1OXFY03 20000 28000 109874 20396 tit 7 FW1OXFY04 20000 28000 104797 18933 tit 7 FW1OXFY05 20000 28000 113716 20649 tit 7 FW1OXFY06 20000 28000 110693 20326 tit 7 FW1OXFY07 20000 28000 105931 19176 tit 8 F4H5CMB01 20000 28000 626046 109918 tit 8 F4H5CMB02 20000 28000 628432 118903 tit total . . 3297077 429170
- Clr of the Sff seqs (good qual)
. seqs min q1 q2 q3 max mean n50 sum all 3,214,044 0 240 274 383 2042 294 326 947,254,956 => 9.4X
- Clr of the Frg seqs (good qual , no linker)
seqs min q1 q2 q3 max mean n50 sum all 3,297,077 3 156 248 301 2043 244 275 806,091,347 => 8X mated 858,340 64 107 156 223 612 171 201 147,070,298 unmated 2,438,737 2 207 261 335 2042 268 286 655,723,972
- Locations:
/fs/szattic-asmg4/brugia_malayi/Data/Sff/ /fs/szattic-asmg4/brugia_malayi/Data/Frg/
Contaminant search
nucmer -maxmatch -c 65 -l 40 Sanger 454 jird 31,501 197,420 Mt 1,507 2,634 We 49,014 23,249 UniVec ? 549,378 # most hits to "Cloning vector pBR322"
Assemblies
TIGR/NCBI
- 9X coverage, 856K Sanger traces => 8,200 scaff & 29,808 ctg (avg. scaff=~10K & avg ctg=~3K)
- "scaffolds totaling ~71 Mb of data with a further ~17.5 Mb of contigs not integrated into any scaffold (orphan contigs)" (Science 2007)
- NCBI AAQA00000000 AAQA01000001-AAQA01029808
* 26,879 good ctgs * 2,929 jird contaminants (Example: AAQA01001321 : mouse 99%id hits)
- Stats
. elem min q1 q2 q3 max mean n50 sum ctg.good 26879 200 836 1005 1495 611244* 3241.17 18986 87,119,350 ctg.good.2K+ 4887 2000 2914 4380 10049 611244 13363.00 37130 65,304,988 ctg.contaminant 2929 200 527 675 820 8994 740.04 762 2,167,588
- Location
/fs/szasmg3/dpuiu/Brugia_malayi/Assembly/TIGR/ <-> NCBI
PITT
- Best so far
- Date: 11/05/08
- Stats:
elem min q1 q2 q3 max mean n50 sum scf.2K+ 3170 2000 2917 4483 14471 6534162* 22916 112914 72,643,770
- Location:
/fs/szasmg3/dpuiu/Brugia_malayi/Assembly/PITT/
CBCB Sanger
- test wgs 5.1 on filtered Sanger reads
- better assembly than the published one
- Stats:
. elem min q1 q2 q3 max mean n50 sum ctg 12753 273 1245 1632 3873 376744 6113.39 24748 77,964,006 deg 9661 65 858 949 1023 72494 1240.97 1008 11,988,997 scf 10317 935 1215 1538 3462 3890532 8018.85 41716 82,730,474 . elem min q1 q2 q3 max mean n50 sum ctg.2K+ 5210 2000 3049 4813 13528 376744 13013.29 30835 67,799,238 deg.2K+ 391 2000 3009 4693 10099 72494 8104.38 12352 3,168,812 scf.2K+ 3656 2001 3181 5733 18904 3890532 20189.57 50293 73,813,083 reads 1178192(100%) singletons 134119 (11.43%)
- Location:
/fs/szasmg3/dpuiu/Brugia_malayi/Assembly/CBCB/2008_0826_CA/
CBCB 454 CA
gatekeeper -dumpinfo -lastfragiid asm.gkpStore Last frag in store is iid = 3297077
- Problems
- olap-from-seeds very memory/cpu intensive!!!
# overmerry.sh jobs -> ./1-overlapper/seeds/ my $ovmBatchSize = getGlobal("merOverlapperSeedBatchSize"); # default 100,000 my $ovmJobs = int(($numFrags - 1) / $ovmBatchSize) + 1; # int(3297076/100000)+1=33
# olap-from-seeds.sh jobs -> ./1-overlapper/olaps/ my $olpBatchSize = getGlobal("merOverlapperExtendBatchSize"); # default 75,000 ; reduce to 20,000 my $olpJobs = int(($numFrags - 1) / $olpBatchSize) + 1; # int(3297076/20000)+1=165
- Example: 6 jobs: each is 2 thread, ~ 20G mem
merOverlapperSeedConcurrency=6 => 6 jobs merOverlapperExtendBatchSize=20000
$ ps -C olap-from-seeds PID %MEM RSZ(KB) %CPU STIME TIME CMD 13158 0.0 1132 0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 90 13159 0.0 1136 0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 91 13160 0.0 1136 0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 92 13161 0.0 1136 0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 93 13162 0.0 952 0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 94 13163 0.0 1136 0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 95 13199 15.6 20675720 133 10:21 02:46:39 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0092.frgcorr.WORKING -o 1-overlapper/olaps/0092.ovb.WORKING.gz asm.gkpStore 1820001 1840000 13200 13.9 18383164 126 10:21 02:37:22 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0094.frgcorr.WORKING -o 1-overlapper/olaps/0094.ovb.WORKING.gz asm.gkpStore 1860001 1880000 13201 16.5 21870940 135 10:21 02:49:08 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0093.frgcorr.WORKING -o 1-overlapper/olaps/0093.ovb.WORKING.gz asm.gkpStore 1840001 1860000 13203 17.8 23603964 130 10:21 02:41:51 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0095.frgcorr.WORKING -o 1-overlapper/olaps/0095.ovb.WORKING.gz asm.gkpStore 1880001 1900000 13204 12.6 16763480 130 10:21 02:41:47 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0090.frgcorr.WORKING -o 1-overlapper/olaps/0090.ovb.WORKING.gz asm.gkpStore 1780001 1800000 13205 15.2 20139808 138 10:21 02:52:35 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0091.frgcorr.WORKING -o 1-overlapper/olaps/0091.ovb.WORKING.gz asm.gkpStore 1800001 1820000
$ vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 9 0 9534576 121356 10124 2743568 5 23 7 23 36 69 16 0 83 1 0
$ free total used free shared buffers cached Mem: 132168632 132043920 124712 0 10064 2934076 -/+ buffers/cache: 129099780 3068852 Swap: 67108856 8842720 58266136
- Location:
ginkgo:/scratch1/brugia_malayi/Assembly/454/CA /scratch1/ -> umiacsfs01:/xraid03 ginkgo: 32 processor machine
CBCB 454 newbler deNovo
CBCB 454 newbler refMapper
- Ref: NCBI assembly
- Ctg stats
. ctgs min q1 q2 q3 max mean n50 sum All 101286 100 236 323 530 7013 433.36 535 43893507
- Read stats
. seqs min q1 q2 q3 max mean n50 sum All 3297077 2 155 247 300 2042 243.49 274 802794270 Full|Partial 562554 27 210 268 330 715 269.87 290 151814761 Chimeric|Repeat|Unmapped 1071953 21 182 253 299 2042 249.97 272 267957796 Missing 1662570 2 135 222 288 759 230.38 269 383021713
- Lib stats
Library Chimeric% Full% Missing% Partial% Repeat% Unmapped% All% All Linker All 0 21 35 2 9 33 100 4763659 E4RA0X101 0 22 0 2 10 64 100 271066 . E4RA0X102 0 21 0 2 11 64 100 260166 . FEZH5RS01 0 43 75 2 8 9 100 181035 flx FEZH5RS02 0 43 75 2 8 9 100 211064 flx FHAVB5T02 0 43 79 2 8 9 100 68708 flx FHAVB5T03 0 44 79 2 8 9 100 69306 flx FHAVB5T04 0 44 79 2 8 9 100 70028 flx FIOXLOM01 2 41 0 1 63 7 100 10921 flx no mates FIOXLOM02 2 40 0 1 64 7 100 11157 flx no mates FIOXLOM03 0 15 39 0 7 51 100 12197 flx no mates ??? FIOXLOM04 0 15 39 0 7 51 100 12727 flx no mates ??? FRLDXKV01 0 25 0 3 12 61 100 17349 . FT9KOI001 0 26 0 3 13 64 100 108826 . FW1OXFY01 0 29 60 2 13 51 100 86127 tit FW1OXFY02 0 29 60 2 13 51 100 106911 tit FW1OXFY03 0 29 60 2 13 51 100 109874 tit FW1OXFY04 0 29 59 2 13 51 100 104797 tit FW1OXFY05 0 29 59 2 13 51 100 113716 tit FW1OXFY06 0 29 59 2 13 51 100 110693 tit FW1OXFY07 0 29 59 2 13 51 100 105931 tit F4H5CMB01 0 28 57 2 13 55 100 626046 tit F4H5CMB02 0 29 60 2 14 55 100 628432 tit
Files
/fs/szattic/asmg1/adelcher/Genomes/Brugia : Art's files /fs/sztmpscratch/cole/tarchive_download/brugia_malay : Cole's files /fs/szasmg3/dpuiu/Brugia_malayi/ : Daniela's files /scratch1/brugia_malayi/Data/ : ftp PITT data /fs/szattic-asmg4/brugia_malayi : ftp PITT data (as well)