Brugia malayi: Difference between revisions
(10 intermediate revisions by the same user not shown) | |||
Line 139: | Line 139: | ||
7 05/01/2009 20Kb Titanium 7/8 wells Mix of worms R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1 631287 213524 | 7 05/01/2009 20Kb Titanium 7/8 wells Mix of worms R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1 631287 213524 | ||
8 10/28/2009 20Kb Titanium Full Mix of worms R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2 1095713 377547 | 8 10/28/2009 20Kb Titanium Full Mix of worms R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2 1095713 377547 | ||
9 | 9 11/17/2009 3 Kb Titanium Full Mix of worms 111209_Brugia_3kb.zip 868928 ? | ||
. Total | . Total 3411635 ? | ||
* 22 Sff files: | * 22 Sff files: | ||
Line 173: | Line 173: | ||
8 R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB01.sff 551263 tit | 8 R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB01.sff 551263 tit | ||
8 R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB02.sff 544450 tit | 8 R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB02.sff 544450 tit | ||
9 111209_Brugia_3kb_1.sff 365339 tit | |||
9 111209_Brugia_3kb_2.sff 503589 tit | |||
total <span style="background:yellow">3,214,044</span> | total <span style="background:yellow">3,214,044</span> #without run 9 | ||
total <span style="background:yellow">4,082,972</span> #all runs | |||
* 22 Frg Libraries | * 22 Frg Libraries | ||
Line 207: | Line 211: | ||
8 F4H5CMB01 20000 28000 626046 109918 tit 241 36.08 | 8 F4H5CMB01 20000 28000 626046 109918 tit 241 36.08 | ||
8 F4H5CMB02 20000 28000 628432 118903 tit 256 35.87 | 8 F4H5CMB02 20000 28000 628432 118903 tit 256 35.87 | ||
. total . . <span style="background:yellow">3,297,077 429,170</span> . | 9 111209_Brugia_3kb_1 3000 . 453449 100097 tit 199 31.19 | ||
9 111209_Brugia_3kb_2 3000 . 655457 165137 tit 214 30.93 | |||
. total . . <span style="background:yellow">3,297,077 429,170</span> #without run 9 | |||
. total . . <span style="background:yellow">4,405,983 694,404</span> #all runs | |||
* Sff seqs clr (good qual) | * Sff seqs clr (good qual) | ||
Line 249: | Line 257: | ||
total ~119,000 ~910,000 | total ~119,000 ~910,000 | ||
* List of contaminated reads: | |||
/nfshomes/dpuiu/Brugia_malayi/Data/nucmer_sanger/problems.qry_hits # 118,996 Sanger reads | |||
/nfshomes/dpuiu/Brugia_malayi/Data/nucmer_454/problems.qry_hits # 914,525 454 reads | |||
/nfshomes/dpuiu/Brugia_malayi/Data/problems.qry_hits # 1,033,521 Sanger+454 reads | |||
* Other possible contaminants: Schistosoma | |||
''In my latest Brugia assembly, I looked for contigs/degenerates that were exclusively Sanger reads, thinking they might be jird contaminants. I came across a degenerate, deg1596341, with 417 reads, all Sanger, and only 1235bp long. When I BLAST it against NCBI, the best hit (entire length, 99% identity) is to Schistosoma--then poorer hits to 28s rRNAs. It has lots of mate pairs to another degenerate, which matches Schistosoma just as well and in the right position, but that degenerate has some 454 reads.'' (Art) | |||
>deg1596337 | |||
ATTAGACAGTCGGATTCCCCGAGTCCGTGCCAGTTCTAAGTTGACTGTTTAACGCCGGCCGAAATATCAA | |||
ATAAAACATTTACTTTTTTAAAAAAAAAAATAAAAAAATAAATGTTGATATGCAGCTATAACGGTCCATA | |||
AGACAGTTCGAACACTAGCCGAGTTTCATCAAAATGAATACATTTTTTTTTTTTAATGTTTTCATTTTAA | |||
TGTTACACTGCATGGATCAAACCGTACTCACTTCACATTACAGCCCGACCGGCCCAGTCCTTAGAGCCAA | |||
TCCTTATCCCGAAGTTACGGATCTAATTTGCCGACTTCCCTTACCTACATTATTCTATCGACTAGAGGCT | |||
GTTCACCTTGGAGACCTGCTGCGGATATGGGTACGATCTGGCACGAAATTCAAATAGCTTCCCTCGGATT | |||
TTCATGGATCGAACAAAGCGCACGAGACACCACAGGAACCGTGGCGCTTTACGGAAACAACATCCCTATC | |||
TCCGGCTGAACCGATTCCAGGGAGTCCGTTCCTTAACCAGAAAAGAGAACTCTGGCTCGGGCTTTCCTCA | |||
ATGTTTCCGAGTTCATTTGCGTTACCGCGCTAAATTCTCACGATGAGCATTTATCTCCGTGTCCAGGTAC | |||
GGGAATATTAACCCGTTTCCCTTTCGATTTATCAGATGGATTACACCTCCATTCCTCTATTTTATTTTAA | |||
AAAACGGCACTAGCCAATATCTTAGGATCGACTGACCCACATTCAACTGCTGTTCACGTGGAACCCTTCT | |||
CCACTTCAGTCTTCAAGGATCTCACTTGAATATTTGCTACTACCACCAAGATCTGCACCAATGGAAGCTT | |||
CAACCGGGCCTACGCCCAAAGTCTTCAACGCTAACCATTGCGACCCTCTTACTCGTTGCGGCCAGATTTC | |||
CCAAAAAAAAAAAAACACAAGCCATGCAACGGTTGAGTATAAGTCTCCCGCTCAAGCGCCATCCATTTTC | |||
AGGGCTAGTTGATTTGGCAGGTGAGTTGTTACACACTCCTTAGCGGTTTCCAACTTCCATGGCCACCGTC | |||
CTGCTGTCTATATCAACCAACGCCTTTCATGGGGTCTCATGAGCGGAAAGTTTGGCACTTTAACTCAACG | |||
TTTGGTTCATCCCACAGCGCCAGTTCTGCTTACCAAAAATGGCCCACTTGGAGCACACATTCAATGTCTA | |||
TGCTTCATAAAAAATTTAAGCAAGCAAGACGTCATACTCATTGAAAGTTTGAGAATAGGTTGAAGAC | |||
>deg1596341 | |||
CCAATTATACCAAAGATAATCTTTACTTTCATTATGCTTTTTATCTTTTAAATTAGGTTTACTACCCAAT | |||
AACTTGCGTATATGCTAGACTCCTTGGTCCGTGTTTCAAGACGGGTCAGATAGGTGATTAACGTTCACAT | |||
CGAGATGTAACTTTATTGCATACAATATTATAATATTACCAATTATTTTTACCGATAAAGTCGCATGCGA | |||
CCACATGTAAAATAATAATAAGCAAAATTATAATCGATACATGTCACTATTATTTCAAGTGAAAGTTACA | |||
TATATGGGAAAAAAAAAAAAAACTTCATCTAAGACATATTTCAACATAATTTAGGATTCCAATTATCAAT | |||
TGAAATAATTGGTCCACTAAATTAACTTGTATTAATATGCTAAAATGAAGTTCTCGATGCATACCATCGG | |||
TAAATACACCAATCTATGCATATACTGCTAATTTAGCATTAATATCATTTTATTCATTAATAAAAAAAAA | |||
AAAATTATTAATGAATAATGAAATGAATTATGATTGCTAAATTGATTGGTTGAATACCGATAAGTTTTGT | |||
TAACTCTATCCGTTTCCATCTCAGCGGTTTCACGCCCTCTTGAACTCTCTCTTCAAAGTTCTTTGCAACT | |||
TTCCCTCACGGTACTTGTTTGCTATCGGTCTCATGGTCGTATTTAGCCTTAGATGAGGTTTACCACCCTC | |||
TTTGGGCTGCAATCTCAAACAACCCGACTCCAAGGAATAACCTACCGTAACTTTTTTCACCCGTACAGGT | |||
CTAGCACCTTCTATGGACTGTAGCCCCGCTCAAGGGGACTTTGGGTGTAAAAATATGTTACGGATAGTTA | |||
TACCTATACGCTACATTTCCATATAGCCATATAATGTCTATTGGATTCAGCGTTGGGCTTTTTCCTTTTC | |||
ACTCGCCGTTACTAGGGAAATCCTCGTTAGTTTCTTTTCCTCCGCTTAGTTATATGCTTAAATTCAGCGG | |||
GTAATCACGACTGAGTTGAGGTCAAAAAAAAAAAAAATGATATAAAACATATTGAAATTATCATTCATAT | |||
ATATATGCTAATTTTTTACCTTATTTATTTGTTTATTTTAATGTTTCAAATAACTTGCATTTTAATTTGA | |||
AACATTTAACAACAAAACAAACAAACAATAAAGTAAATCAATGCATAATAAATAAATAATTGTAATCTTT | |||
CTTTATTATTTATTCATGAAAGATTACTTTTTAATATATATATAT | |||
''... posisble contamination at the library construction level. Schisto was being sequenced at the same time as Brugia at TIGR. Does this mean we should first filter all the Sanger reads against Schisto now that the Schisto genome is available?'' (Elodie) | |||
== Assemblies == | == Assemblies == | ||
Line 501: | Line 557: | ||
MultiplyMapped 92 0.05 | MultiplyMapped 92 0.05 | ||
* Most assembled contigs or unmapped singletons seem to be contaminants (aligned by blast to human/mouse/rat) => more contamination | |||
* Location | * Location | ||
ginkgo:/scratch1/brugia_malayi/Assembly/454/newbler.deNovo/ | ginkgo:/scratch1/brugia_malayi/Assembly/454/newbler.deNovo/ | ||
Line 560: | Line 617: | ||
assembled 14626 0.76 23.74 27.47 30.26 60.37 27.24 28 . | assembled 14626 0.76 23.74 27.47 30.26 60.37 27.24 28 . | ||
not_assembled 12252 0.00 26.87 30.27 34.39 72.30 30.80 31 . | not_assembled 12252 0.00 26.87 30.27 34.39 72.30 30.80 31 . | ||
=== CBCB CA Sanger (Art's) === | |||
''My redo of the assembly using just original Sanger reads (after removing jird contaminant and doing some extra vector trimming) got the following:'' | |||
TotalBasesInScaffolds 81,379,515 | |||
N50ScaffoldBases 80,913 <<** wrt TBS=70676234 | |||
MaxBasesInScaffolds 6,446,756 | |||
IntraScaffoldGaps 2,758 | |||
TotalContigsInScaffolds 12,564 | |||
MaxContigSize 565,900 | |||
N50ContigBases 36,160 <<** wrt TBS=70676234 | |||
''The read coverage of unitigs was very biased by GC content. E.g., for unitigs with 23% GC, there averaged one read every 134bp, while for unitigs with 40% GC, there averaged one read every 23bp. So I used these values to recompute the unitig | |||
astats (the astats indicate whether a unitig is likely a repeat or not). This is a more principled way of doing the "boosting" that we did on the original assembly. The assembly changed to: | |||
'' | |||
TotalBasesInScaffolds 79,851,223 | |||
N50ScaffoldBases 100,938 <<** wrt TBS=70676234 | |||
MaxBasesInScaffolds 6,435,383 | |||
IntraScaffoldGaps 2,616 | |||
TotalContigsInScaffolds 12,232 | |||
MaxContigSize 1,356,278 | |||
N50ContigBases 39,235 <<** wrt TBS=70676234 | |||
''Note that all assemblies above used only the original Sanger reads. I next added the 3Kb and 20Kb paired 454 reads (with some extra linker trimming and removing duplicate mate-pairs). This reduced the size of unitigs (N50 fell from 9142 to 5297) indicating there are still some trimming issues. The coverage bias is also less with the 454 reads and the astat-adjustments are less effective. The best assembly (based on N50 sizes) I have of these data calculated astats assuming a g enome size of 70Mb:'' | |||
TotalBasesInScaffolds 80,688,005 | |||
N50ScaffoldBases 358,475 <<** wrt TBS=70676234 | |||
MaxBasesInScaffolds 3,020,329 | |||
IntraScaffoldGaps 4,039 | |||
TotalContigsInScaffolds 13,894 | |||
MaxContigSize 602,785 | |||
N50ContigBases 43,796 <<** wrt TBS=70676234 | |||
== Files == | == Files == |
Latest revision as of 14:03, 30 March 2010
Articles
the copy number of the repeat in B. malayi was found to be about 30,000. The 320-base-pair Hha I repeated sequences are arranged in direct tandem arrays and comprise about 12% of the genome.
Brugia malayi Hha I-repeat family element >gi|156092|gb|M12691.1|BRPRSHA Brugia malayi Hha I-repeat family element GCGCATAAATTCATCAGCAAAATTAATAAAACTTTCAATTAATCATGATTTTAATTGAATGTAAGAATTT AAATTAAATTTAAATTCAAATTTAAATTTTTAATTTTTTAAAAATTTTAAAATTTGTTATAGTTTTCCTT CATTAGACAAGGATATTGGTTCTAATTTATCAATTTTAATTCTAATTAAGTGCCAAAACTACTAAAAAAA GCTTATTTTGAAATTAATTGACTACGTTAGCTGCATTGTACCAGTGCTGGTCGTGTATTGTGTTGTCATT TTATAGTTTAAATATTAAAATACGCTTTTGTAATTAAGTTTT
Genome Info
- 6 chromosomes: 1-5, XY ; diploit genome ~ 110M bp
- 30% GC,
- 32% coding, 15% repeats
Genome Project
Brugia malayi has a diploid genome of approximately 110 Mb, organized in 6 pairs of chromosomes (five pairs of autosomes and one pair of sex chromosomes). In addition to the nuclear genome, B. malayi has a mitochondrial genome of about 14kb, and the genome of the harbored bacterial endosymbiont Wolbachia sp (1-2Mb).
The B. malayi genome project has been completed by The Institute for Genomic Research. Whole Genome Shotgun sequencing was used to obtain more than eight-fold coverage of the genome. The complete genome was assembled into approximately 8200 scaffolds and deposited in GenBank. The accession for the WGS project is AAQA00000000 and consists of sequences AAQA01000001-AAQA01029808. File location:
/fs/szasmg3/dpuiu/Brugia_malayi/Data/Bm.fasta
ctgs min q1 q2 q3 max mean n50 sum
26,879 200 836 1005 1495 611,244 3241.17 18986 87,119,350
- TIGR Genome project (TRS strain)
Contamination
- mitochondrion finished: 13,657 bp; 24% GC
- Wolbachia endosymbiont strain TRS from Brugia malayi strain wMel 1.08 Mbp 34% GC
- complete: New England Biolabs
- progress: TIGR
- Rodent: some trace contamination; 44%GC
/fs/szasmg3/dpuiu/Brugia_malayi/Data/contam.fasta contaminants min q1 q2 q3 max mean n50 sum 2,929 200 527 675 820 8994 740.04 762 2,167,588
- pUC19c vector: 2686bp, 50.63% GC
Data
1.26M Sanger reads (original TA) : medLen=773bp; medGC=32.57% 1.26M Sanger reads (contamination free): medLen=771bp; medGC=32.36%
3.21M 454 reads (original sff) : medLen=274bp 3.29M 454 reads (linker free) : medLen=247bp; medGC=36.39%
Original Traces
- 1.26M Sanger reads & 15 Libraries:
- NCBI TA
- NCBI TA FTP
SEQ_LIB_ID INSERT_SIZE INSERT_STDEV TRACE_TYPE_CODE
1047113828118 1000 300 WGS 13500
1047113856575 1000 300 PRIMERWALK 325
1047111632737 1258 377 PRIMERWALK 3
1047111632737 1258 377 WGS 305,906
1047111540304 1415 424 WGS 51772
1047112577106 1415 424 WGS 337,789
1047111718946 3123 936 WGS 47597
1047113358719 3123 936 PRIMERWALK 173
1047113358719 3123 936 WGS 246,185
1047174912885 3123 936 TRANSPOSON 1437
1047113570927 6000 1800 WGS 3193
1047111814561 7158 2147 WGS 219,306
1047111480027 17168 5150 WGS 4087
1047111488095 17168 5150 WGS 3434
1047111495007 17168 5150 WGS 3716
1047111501919 17168 5150 WGS 3697
1047111480605 22419 6725 WGS 4638
1047111516154 22419 6725 WGS 4004
1047111523212 22419 6725 WGS 3766
1047111530126 22419 6725 WGS 5686
1047113855421 23000 6900 WGS 1
total 1,260,215
FRG file: (contaminant free)
- FRG.src : TI's
- FRG.acc: 2 ..
- DST.acc: 1260217, ... , 1260234
- Location
/fs/szasmg3/dpuiu/Brugia_malayi/Data/nucmer_seq/Bm-all.frg DST 15 FRG 1178192 LKG 530930 seqs min q1 q2 q3 max mean n50 sum len 1,178,192 65 645 771 850 1214 724 800 853,847,771 => 8X gc% 1,178,192 0.00 29 32.36 35 100 32.41 33 .
Problems:
- All library insert sizes are underestimated ???
- The contaminant reads align at ~91-93% id to the contaminant ctgs while the Mt/We reads align at 99% id to Mt/We finished seq. What %id thold to use for contaminant?
BACS
8,000 BAC clones @Children's Hospital Oakland Research Institute. (!!! no NCBI TA submission)
PITT FTP data
- 3.21M 454 reads ; about 13% are mated
- 3K insert flx libraries (estimated to 2K based on alignment to the existing assembly)
- 20K insert tit libraries (estimated to 28K ...)
CBCB Location:
/fs/szattic-asmg4/brugia_malayi/Data/ /fs/szattic-asmg4/brugia_malayi/Data/Sff/ # Sff files /fs/szattic-asmg4/brugia_malayi/Data/Frg/ # Frg files /fs/szattic-asmg4/brugia_malayi/Data/Seq/ # Seq files
FTP access:
lftp -u bma 136.142.191.201 pass: 6279 user: bma # empty as of --Dpuiu 12:04, 8 January 2010 (EST)
Elodie's table:
/scratch1/brugia_malayi/brugia-sequencing-summary.txt.csv
# elodie's date protocol platform type description run_name Reads Mates 1 01/17/2008 WGS Standard Full run (2/2) Mix of worms (calibration of the machine) R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample 534822 0 2 07/01/2008 3Kb Standard Full single worm (pUC contamination) R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1 492575 84341 3 09/11/2008 3kb Standard 4/8 wells single worm (pUC contamination) R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1 263421 49258 4 10/01/2008 3Kb Standard Full Mix of worms (still pUC contamination) R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest 59711 5096 5 02/01/2009 WGS Standard 1/4 wells Mix of worms; regions 2 & 3 were myxoma R_2009_02_27_16_11_34_FLX10070260_adminrig_022709_GHEDIN 18025 0 6 04/06/2009 WGS Standard 1/4 wells Mix of worms; with comp. bio run R_2009_04_15_14_46_56_FLX10070260_adminrig_041509_GHEDIN_r1-WGS1_r2-LMW4_r3-pool2compbio_r4-pool3compbio 118490 0 7 05/01/2009 20Kb Titanium 7/8 wells Mix of worms R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1 631287 213524 8 10/28/2009 20Kb Titanium Full Mix of worms R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2 1095713 377547 9 11/17/2009 3 Kb Titanium Full Mix of worms 111209_Brugia_3kb.zip 868928 ? . Total 3411635 ?
- 22 Sff files:
run sffReads linker 1 R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample/D_2008_01_31_18_01_35_FLX10070260_adminrig_FullAnalysis/sff/E4RA0X101.sff 272923 . 1 R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample/D_2008_01_31_18_01_35_FLX10070260_adminrig_FullAnalysis/sff/E4RA0X102.sff 261899 . 2 R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1/D_2009_02_12_22_12_04_j_SignalProcessing/sff/FEZH5RS01.sff 228204 flx 2 R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1/D_2009_02_12_22_12_04_j_SignalProcessing/sff/FEZH5RS02.sff 264371 flx 3 R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T02.sff 86862 flx 3 R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T03.sff 87488 flx 3 R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T04.sff 89071 flx 4 R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM01.sff 13695 flx 4 R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM02.sff 14197 flx 4 R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM03.sff 15515 flx 4 R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM04.sff 16304 flx 5 R_2009_02_27_16_11_34_FLX10070260_adminrig_022709_GHEDIN/FRLDXKV01.sff 18025 . 6 R_2009_04_15_14_46_56_FLX10070260_adminrig_041509_GHEDIN_r1-WGS1_r2-LMW4_r3-pool2compbio_r4-pool3compbio/D_2009_04_16_14_19_21_morty_fullProcessing/FT9KOI001.sff 118490 . 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY01.sff 73807 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY02.sff 91698 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY03.sff 93878 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY04.sff 90232 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY05.sff 97065 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY06.sff 94326 tit 7 R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY07.sff 90281 tit 8 R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB01.sff 551263 tit 8 R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB02.sff 544450 tit 9 111209_Brugia_3kb_1.sff 365339 tit 9 111209_Brugia_3kb_2.sff 503589 tit total 3,214,044 #without run 9 total 4,082,972 #all runs
- 22 Frg Libraries
. lib meanIns(orig) meanIns(est) #reads #mates linker medLen medGC 1 E4RA0X101 0 0 271066 0 . 250 37.04 1 E4RA0X102 0 0 260166 0 . 249 37.06 2 FEZH5RS01 3000 2000 181035 18676 flx 228 37.89 2 FEZH5RS02 3000 2000 211064 22270 flx 227 37.79 3 FHAVB5T02 3000 2000 68708 7850 flx 244 38.35 3 FHAVB5T03 3000 2000 69306 8227 flx 243 37.98 3 FHAVB5T04 3000 2000 70028 8353 flx 243 38.10 4 FIOXLOM01 3000 0 10921 0 flx 102 43.93 # no mates , shorter read length, highest GC !!! 4 FIOXLOM02 3000 0 11157 0 flx 103 43.75 # no mates , shorter read length, highest GC !!! 4 FIOXLOM03 3000 0 12197 0 flx 103 43.51 # no mates , shorter read length, highest GC !!! 4 FIOXLOM04 3000 0 12727 0 flx 103 43.56 # no mates , shorter read length, highest GC !!! 5 FRLDXKV01 0 0 17349 0 . 255 36.02 6 FT9KOI001 0 0 108826 0 . 256 36.10 7 FW1OXFY01 20000 28000 86127 15825 tit 276 35.40 7 FW1OXFY02 20000 28000 106911 19668 tit 275 35.37 7 FW1OXFY03 20000 28000 109874 20396 tit 271 35.23 7 FW1OXFY04 20000 28000 104797 18933 tit 269 35.23 7 FW1OXFY05 20000 28000 113716 20649 tit 265 35.09 7 FW1OXFY06 20000 28000 110693 20326 tit 270 35.05 7 FW1OXFY07 20000 28000 105931 19176 tit 271 35.18 8 F4H5CMB01 20000 28000 626046 109918 tit 241 36.08 8 F4H5CMB02 20000 28000 628432 118903 tit 256 35.87 9 111209_Brugia_3kb_1 3000 . 453449 100097 tit 199 31.19 9 111209_Brugia_3kb_2 3000 . 655457 165137 tit 214 30.93 . total . . 3,297,077 429,170 #without run 9 . total . . 4,405,983 694,404 #all runs
- Sff seqs clr (good qual)
. seqs min q1 q2 q3 max mean n50 sum all 3,214,044 0 240 274 383 2042 294 326 947,254,956 => 9.4X
- Frg seqs clr (good qual , no linker)
seqs min q1 q2 q3 max mean n50 sum all 3,297,077 3 156 248 301 2043 244 275 806,091,347 => 8X mated 858,340 64 107 156 223 612 171 201 147,070,298 unmated 2,438,737 2 207 261 335 2042 268 286 655,723,972
- Frg seqs GC%
seqs min q1 q2 q3 max mean n50 sum all 3,297,077 0.00 29.25 36.39 44.54 86.76 36.86 39 . mated 858,340 0.00 28.48 34.35 40.65 78.75 34.74 36 . unmated 2,438,737 0.00 29.61 37.36 45.86 86.76 37.60 41 .
- Locations:
/fs/szattic-asmg4/brugia_malayi/Data/Sff/ /fs/szattic-asmg4/brugia_malayi/Data/Frg/
Contaminant & high copy repeats
- nucmer -maxmatch "-l 20 -c 65" or "-l 12 -c 24"
Sanger 454 jird(26,879 ctgs) 31,501 197,420 # we'd probably find more contaminated reads if we align all the reads to the whole mouse genome ?? Mt 1,507 2,634 # 98% avg identity, 92% of read length We 49,014 23,249 # 98% avg identity, 92% of read length UniVec ? 661,586 pUC19 134 562,107 # 99% avg identity, 99% of read length HhaI(~320bp) 16,336 69,400 # 90% avg identity, 63% of read length ; # 29,021 out of 69,400 454 reads align 2+ times => tandem repeat # 11,882 out of 16,272 454 mated reads that align have both mates aligned => 30K+ repeats mRNA(264bp) 20,504 59,706 # 80% avg identity, 65% of read length total ~119,000 ~910,000
- List of contaminated reads:
/nfshomes/dpuiu/Brugia_malayi/Data/nucmer_sanger/problems.qry_hits # 118,996 Sanger reads /nfshomes/dpuiu/Brugia_malayi/Data/nucmer_454/problems.qry_hits # 914,525 454 reads /nfshomes/dpuiu/Brugia_malayi/Data/problems.qry_hits # 1,033,521 Sanger+454 reads
- Other possible contaminants: Schistosoma
In my latest Brugia assembly, I looked for contigs/degenerates that were exclusively Sanger reads, thinking they might be jird contaminants. I came across a degenerate, deg1596341, with 417 reads, all Sanger, and only 1235bp long. When I BLAST it against NCBI, the best hit (entire length, 99% identity) is to Schistosoma--then poorer hits to 28s rRNAs. It has lots of mate pairs to another degenerate, which matches Schistosoma just as well and in the right position, but that degenerate has some 454 reads. (Art)
>deg1596337 ATTAGACAGTCGGATTCCCCGAGTCCGTGCCAGTTCTAAGTTGACTGTTTAACGCCGGCCGAAATATCAA ATAAAACATTTACTTTTTTAAAAAAAAAAATAAAAAAATAAATGTTGATATGCAGCTATAACGGTCCATA AGACAGTTCGAACACTAGCCGAGTTTCATCAAAATGAATACATTTTTTTTTTTTAATGTTTTCATTTTAA TGTTACACTGCATGGATCAAACCGTACTCACTTCACATTACAGCCCGACCGGCCCAGTCCTTAGAGCCAA TCCTTATCCCGAAGTTACGGATCTAATTTGCCGACTTCCCTTACCTACATTATTCTATCGACTAGAGGCT GTTCACCTTGGAGACCTGCTGCGGATATGGGTACGATCTGGCACGAAATTCAAATAGCTTCCCTCGGATT TTCATGGATCGAACAAAGCGCACGAGACACCACAGGAACCGTGGCGCTTTACGGAAACAACATCCCTATC TCCGGCTGAACCGATTCCAGGGAGTCCGTTCCTTAACCAGAAAAGAGAACTCTGGCTCGGGCTTTCCTCA ATGTTTCCGAGTTCATTTGCGTTACCGCGCTAAATTCTCACGATGAGCATTTATCTCCGTGTCCAGGTAC GGGAATATTAACCCGTTTCCCTTTCGATTTATCAGATGGATTACACCTCCATTCCTCTATTTTATTTTAA AAAACGGCACTAGCCAATATCTTAGGATCGACTGACCCACATTCAACTGCTGTTCACGTGGAACCCTTCT CCACTTCAGTCTTCAAGGATCTCACTTGAATATTTGCTACTACCACCAAGATCTGCACCAATGGAAGCTT CAACCGGGCCTACGCCCAAAGTCTTCAACGCTAACCATTGCGACCCTCTTACTCGTTGCGGCCAGATTTC CCAAAAAAAAAAAAACACAAGCCATGCAACGGTTGAGTATAAGTCTCCCGCTCAAGCGCCATCCATTTTC AGGGCTAGTTGATTTGGCAGGTGAGTTGTTACACACTCCTTAGCGGTTTCCAACTTCCATGGCCACCGTC CTGCTGTCTATATCAACCAACGCCTTTCATGGGGTCTCATGAGCGGAAAGTTTGGCACTTTAACTCAACG TTTGGTTCATCCCACAGCGCCAGTTCTGCTTACCAAAAATGGCCCACTTGGAGCACACATTCAATGTCTA TGCTTCATAAAAAATTTAAGCAAGCAAGACGTCATACTCATTGAAAGTTTGAGAATAGGTTGAAGAC >deg1596341 CCAATTATACCAAAGATAATCTTTACTTTCATTATGCTTTTTATCTTTTAAATTAGGTTTACTACCCAAT AACTTGCGTATATGCTAGACTCCTTGGTCCGTGTTTCAAGACGGGTCAGATAGGTGATTAACGTTCACAT CGAGATGTAACTTTATTGCATACAATATTATAATATTACCAATTATTTTTACCGATAAAGTCGCATGCGA CCACATGTAAAATAATAATAAGCAAAATTATAATCGATACATGTCACTATTATTTCAAGTGAAAGTTACA TATATGGGAAAAAAAAAAAAAACTTCATCTAAGACATATTTCAACATAATTTAGGATTCCAATTATCAAT TGAAATAATTGGTCCACTAAATTAACTTGTATTAATATGCTAAAATGAAGTTCTCGATGCATACCATCGG TAAATACACCAATCTATGCATATACTGCTAATTTAGCATTAATATCATTTTATTCATTAATAAAAAAAAA AAAATTATTAATGAATAATGAAATGAATTATGATTGCTAAATTGATTGGTTGAATACCGATAAGTTTTGT TAACTCTATCCGTTTCCATCTCAGCGGTTTCACGCCCTCTTGAACTCTCTCTTCAAAGTTCTTTGCAACT TTCCCTCACGGTACTTGTTTGCTATCGGTCTCATGGTCGTATTTAGCCTTAGATGAGGTTTACCACCCTC TTTGGGCTGCAATCTCAAACAACCCGACTCCAAGGAATAACCTACCGTAACTTTTTTCACCCGTACAGGT CTAGCACCTTCTATGGACTGTAGCCCCGCTCAAGGGGACTTTGGGTGTAAAAATATGTTACGGATAGTTA TACCTATACGCTACATTTCCATATAGCCATATAATGTCTATTGGATTCAGCGTTGGGCTTTTTCCTTTTC ACTCGCCGTTACTAGGGAAATCCTCGTTAGTTTCTTTTCCTCCGCTTAGTTATATGCTTAAATTCAGCGG GTAATCACGACTGAGTTGAGGTCAAAAAAAAAAAAAATGATATAAAACATATTGAAATTATCATTCATAT ATATATGCTAATTTTTTACCTTATTTATTTGTTTATTTTAATGTTTCAAATAACTTGCATTTTAATTTGA AACATTTAACAACAAAACAAACAAACAATAAAGTAAATCAATGCATAATAAATAAATAATTGTAATCTTT CTTTATTATTTATTCATGAAAGATTACTTTTTAATATATATATAT
... posisble contamination at the library construction level. Schisto was being sequenced at the same time as Brugia at TIGR. Does this mean we should first filter all the Sanger reads against Schisto now that the Schisto genome is available? (Elodie)
Assemblies
TIGR/NCBI
- 9X coverage, 856K Sanger traces => 8,200 scaff & 29,808 ctg (avg. scaff=~10K & avg ctg=~3K)
- "scaffolds totaling ~71 Mb of data with a further ~17.5 Mb of contigs not integrated into any scaffold (orphan contigs)" (Science 2007)
- NCBI AAQA00000000 AAQA01000001-AAQA01029808
* 26,879 good ctgs * 2,929 jird contaminants (Example: AAQA01001321 : mouse 99%id hits)
- Stats
. elem min q1 q2 q3 max mean n50 sum ctg.len(good) 26879 200 836 1005 1495 611244* 3241.17 18986 87,119,350 ctg.len(contaminants) 2929 200 527 675 820 8994 740.04 762 2,167,588
. elem min q1 q2 q3 max mean n50 sum
ctg.gc%(good) 26878 0.00 24.77 28.56 32.27 72.30 28.86 29 .
ctg.gc%(contaminants) 2929 18.09 39.16 43.59 48.35 75.96 44.10 44 .
- Location
/fs/szasmg3/dpuiu/Brugia_malayi/Assembly/TIGR/ <-> NCBI
PITT
- Date: 11/05/08
- Stats:
elem min q1 q2 q3 max mean n50 sum scf.len 3170 2000 2917 4483 14471 6534162* 22916 112914* 72,643,770 (66,051,795bp without gaps) scf.gc% 3170 15.70 25.53 28.17 30.95 66.60 28.46 28 .
- Location:
/fs/szasmg3/dpuiu/Brugia_malayi/Assembly/PITT/
CBCB CA 5.1 Sanger
- Assembler: wgs 5.1
- Date: 2008/08/26
- Input: filtered Sanger reads
- better assembly than the published one
- repeat Hha appears in a few dozen contigs but not in tandem
- Stats:
. elem min q1 q2 q3 max mean n50 sum scf 10317 935 1215 1538 3462 3890532 8018.85 41716 82,730,474 scf2K+ 3656 2001 3181 5733 18904 3890532 20189.57 50293 73,813,083 ctg 12753 273 1245 1632 3873 376744 6113.39 24748 77,964,006 deg 9661 65 858 949 1023 72494 1240.97 1008 11,988,997 singl 134119 (11.43%) reads 1178192(100%)
- Location:
/fs/szasmg3/dpuiu/Brugia_malayi/Assembly/CBCB/2008_0826_CA/
CBCB CA 6.0 454 (failed)
- Assembler: wgs 6.0-beta
- Input: 3,297,077 454 sffToCA processed reads
- Locations:
ginkgo:/scratch1/brugia_malayi/Assembly/454/CA.failed/ /scratch1/ -> umiacsfs01:/xraid03 ginkgo: 32 proc, 128G mem
genome6.umd.edu:/genome6/raid/dpuiu/Brugia_malayi/Assembly/CA.bog/ genome6: 32 proc, 256G mem
- Problem: high frequency contamination & repeats
obtMerThreshold, ovlMerThreshold set on auto (default) !!! runCA estimated them to: (see runCA.log) Reset OBT mer threshold from auto to 37235. Reset OVL mer threshold from auto to 43186.
=> olap-from-seeds very memory/cpu intensive!!! Example: 6 jobs: each is 2 thread, ~ 20G mem merOverlapperSeedConcurrency=6 => 6 jobs merOverlapperExtendBatchSize=20000
$ ps -C olap-from-seeds PID %MEM RSZ(KB) %CPU STIME TIME CMD 13158 0.0 1132 0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 90 ... 13163 0.0 1136 0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 95 13199 15.6 20675720 133 10:21 02:46:39 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0092.frgcorr.WORKING -o 1-overlapper/olaps/0092.ovb.WORKING.gz asm.gkpStore 1820001 1840000 ... 13205 15.2 20139808 138 10:21 02:52:35 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0091.frgcorr.WORKING -o 1-overlapper/olaps/0091.ovb.WORKING.gz asm.gkpStore 1800001 1820000
$ vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 9 0 9534576 121356 10124 2743568 5 23 7 23 36 69 16 0 83 1 0
$ free total used free shared buffers cached Mem: 132168632 132043920 124712 0 10064 2934076 -/+ buffers/cache: 129099780 3068852 Swap: 67108856 8842720 58266136
- 1-overlapper
#overlaps/read 0+ 3297077 0 1282138 !!! more than 1/3 of reads have no overlaps 1+ 2014939 100+ 561195 #seqs #maxOvl 100+ 561196 99376 UniVec 359340 99376 pUC19 355392 99376 contam 116857 97757 Hha1 50474 29186 mRNA 36296 12935 other 46214 4720
- 8-consensus failed on utg
- 9-terminator => asm.asm.FAILED
CBCB CA 6.0 454
- Input: 3,297,077 454 sffToCA processed reads
- obtMerThreshold=200, ovlMerThreshold=60
. elem min q1 q2 q3 max mean n50 sum scf.len 1406 585 1077 1178 1416 23313 1583.79 1354 2226813 ctg.len 1513 280 1074 1177 1404 4952 1309.21 1257 1980831 deg.len 192840 63 173 262 368 5688 290.91 339 56098824 utg.len 194394 63 174 263 373 5691 298.80 348 58084744 singl.len 1039561 64 162 249 277 1519 244.45 265 254119265 seq.len 2618840 64 140 239 287 1519 234.61 268 614416134 . elem min q1 q2 q3 max mean n50 sum scf.gc% 1406 17.45 29.91 32.75 35.47 69.26 33.21 33 . ctg.gc% 1513 17.45 30.01 32.54 34.93 69.26 32.94 33 . deg.gc% 192840 4.08 27.62 33.33 40.25 80.99 34.17 35 . utg.gc% 194394 4.08 27.65 33.33 40.20 80.99 34.16 35 . singl.gc% 1039561 0.00 34.62 43.11 49.35 86.76 41.58 45 . seq.gc% 2618840 0.00 29.44 37.01 45.48 86.76 37.36 40 .
CBCB CA 6.0 Sanger
- Input: 1,178,192 Sanger "clean" reads
- obtMerThreshold=200, ovlMerThreshold=60
. elem min q1 q2 q3 max mean n50 sum scf.len 10148 904 1235 1578 3564 2324103 8619.28 45146 87468465 scf.len2K+ 3701 2001 3123 6500 20644 2324103 21272.70 52830 78730278 ctg.len 12659 274 1265 1658 3811 565900 6478.56 27474 82012118 deg.len 8497 65 820 914 987 82776 1211.94 980 10297846 utg.len 40545 64 943 1247 1751 306650 2459.44 4231 99718105 singl.len 89048 64 545 702 812 1181 661.17 753 58875872 seq.len 1173341 64 716 827 915 1222 790.88 850 927974911 . elem min q1 q2 q3 max mean n50 sum scf.gc% 10148 13.21 25.02 28.34 32.07 65.75 29.11 29 . ctg.gc% 12659 13.21 24.86 28.22 31.70 65.75 28.78 28 . deg.gc% 8497 7.87 23.87 28.90 33.86 77.13 29.60 30 . utg.gc% 40545 1.39 25.26 29.02 32.72 77.13 29.29 30 . singl.gc% 89048 0.00 31.87 38.97 45.31 99.15 38.52 41 . seq.gc% 1173341 0.00 29.41 32.40 35.18 99.15 32.46 33 .
- Rerun using
- all Sanger reads : did not improve the stats
- isNotRandom=1 for all libs : did not improve the stats
CBCB CA 6.0 Sanger+454 (Best so far)
- Input: 1,178,192 Sanger "clean" reads ; 3,297,077 454 sffToCA processed reads
- obtMerThreshold=400, ovlMerThreshold=120
. elem min q1 q2 q3 max mean n50 sum scf.len 10254 283 1149 1472 2552 1789338 9181.77 108218 94149853 scf.len2K+ 3193 2000 2728 4513 14385 1789338 26585* 150310* 84,886,293* (75,044,213bp without gaps) ctg.len 13607 66 1181 1578 3187 522243 6196.05 30726 84309668 deg.len 114780 63 195 279 478 25405 367.61 478 42193843 utg.len 157710 63 231 383 885 131968 901.05 1602 142104793 singl.len 904272 64 199 255 306 1690 282.94 278 255858016 # 831,547 454 + 72,725 Sanger seq.len 3849841 64 202 281 662 1690 410.15 699 1579010968 . elem min q1 q2 q3 max mean n50 sum scf.gc% 10254 13.80 24.59 28.34 32.95 69.26 29.44 29 . ctg.gc% 13607 13.80 24.54 28.17 32.02 69.26 28.95 28 . deg.gc% 114780 3.75 29.39 38.22 45.45 80.99 37.40 41 . utg.gc% 157710 0.00 27.63 34.59 43.08 80.99 35.28 38 . singl.gc% 904272 0.00 36.60 43.75 49.49 99.33 42.44 46 . seq.gc% 3849841 0.00 29.34 34.26 41.98 99.33 35.73 36 .
- Location:
ginkgo:/scratch1/brugia_malayi/Assembly/hybrid/CA/
- Rerun using bogBadMateDepth = 4 (default is 7) at Aleksey's advice; utgs are smaller; failed in cgw "scaffolder failed" message
- scaffolds:
10254 : total 904 : begin in surrogates 1029 : end in surrogates
CBCB newbler deNovo 454 (failed)
- Still running after 7 days (killed)
Detangling alignments... -> Level 2, Phase 8, Round 1...
PID %MEM RSZ %CPU STIME TIME CMD 4576 2.1 1427100 94.4 Feb16 7-03:26:12 /fs/szdevel/dpuiu/454/bin/runProject .
CBCB newbler deNovo 454
- Filtered contaminants, high copy repeats
deleted 1524401 kept 1772677
- Input: 454 CA gkp dump
elem min q1 q2 q3 max mean n50 sum len 1772677 64 216 270 357 2044 294.13 312 521393273 gc% 1772677 0.00 28.60 34.15 40.70 83.90 34.83 36 .
- Output
# ctg stats . elem min q1 q2 q3 max mean n50 sum Len 38315 100 217 319 437 5560 346.60 402 13,280,032 GC% 38315 0.00 27.69 32.27 37.33 73.73 32.83 33 .
# scf stats . elem min q1 q2 q3 max mean n50 sum Len 69 2006 2157 2422 2795 9476 2932.87 2629 202368
# read counts count % All 1772677 100 Singleton 884948 50.01 Assembled 780126 44.08 PartiallyAssembled 52199 2.95 Outlier 33009 1.87 TooShort 11372 0.64 Repeat 8028 0.45
# read GC% elem min q1 q2 q3 max mean n50 sum Assembled 780126 0.00 26.47 31.74 37.20 79.35 32.12 33 Singleton 884948 0.00 30.17 36.76 43.87 86.76 36.96 39 Singleton.Mapped 434267 0.00 27.22 32.69 38.91 81.43 33.31 34 Singleton.Unmapped 450681 3.37 34.65 40.75 46.74 86.76 40.47 42
# mate pair counts count % All 178718 100 Link 72421 40.52 OneUnmapped 62534 34.99 BothUnmapped 42982 24.05 FalsePair 344 0.19 SameContig 344 0.19 MultiplyMapped 92 0.05
- Most assembled contigs or unmapped singletons seem to be contaminants (aligned by blast to human/mouse/rat) => more contamination
- Location
ginkgo:/scratch1/brugia_malayi/Assembly/454/newbler.deNovo/
CBCB newbler refMapper 454
- Assembler: newbler 2.3
- Host: CBCB walnut server
- Input
# NCBI ref assembly ctgs min q1 q2 q3 max mean n50 sum Len 26,879 200 836 1005 1495 611244 3241 18986 87,119,350
#Sff reads . seqs min q1 q2 q3 max mean n50 sum Len 3,214,044 0 240 274 383 2042 294.72 326 947,254,956
- Output
#Ctg stats . ctgs min q1 q2 q3 max mean n50 sum Len 101,286 100 236 323 530 7013 433.36 535 43,893,507
#Trimmed read stats . seqs min q1 q2 q3 max mean n50 sum All 3,898,373 1 45 163 264 1995 167.84 265 654319111 Full|Partial 1,085,167 20 119 216 285 706 214.13 271 232364804 Chimeric|Repeat|Unmapped|TooShort 2,015,920 20 111 221 276 1995 208.85 263 421032444 Deleted 797,286 1 1 1 1 19 1.16 1 921863
#Trimmed read counts count % All 3898373 100 Chimeric 25460 0.65 Deleted 797286 20.45 !!! Full 1001745 25.7 Partial 83422 2.14 Repeat 406119 10.42 !!! TooShort 14031 0.36 Unmapped 1570310 40.28 !!!
#Mate pair counts count % BothUnmapped 301390 42.86 OneUnmapped 110922 15.77 MultiplyMapped 108641 15.45 FalsePair 106249 15.11 TruePair 75992 10.81
- Ref ctgs partially assembled
# len ctgs min q1 q2 q3 max mean n50 sum all 26879 200 836 1005 1495 611244 3241 18986 7119350 assembled 14627 206 881 1265 2778 611244 5081.46 27414 74326497 not_assembled 12252 200 812 920 1075 32555 1044.14 988 12792853
# gc% . elem min q1 q2 q3 max mean n50 sum all 26878 0.00 24.77 28.56 32.27 72.30 28.86 29 . assembled 14626 0.76 23.74 27.47 30.26 60.37 27.24 28 . not_assembled 12252 0.00 26.87 30.27 34.39 72.30 30.80 31 .
CBCB CA Sanger (Art's)
My redo of the assembly using just original Sanger reads (after removing jird contaminant and doing some extra vector trimming) got the following:
TotalBasesInScaffolds 81,379,515 N50ScaffoldBases 80,913 <<** wrt TBS=70676234 MaxBasesInScaffolds 6,446,756 IntraScaffoldGaps 2,758 TotalContigsInScaffolds 12,564 MaxContigSize 565,900 N50ContigBases 36,160 <<** wrt TBS=70676234
The read coverage of unitigs was very biased by GC content. E.g., for unitigs with 23% GC, there averaged one read every 134bp, while for unitigs with 40% GC, there averaged one read every 23bp. So I used these values to recompute the unitig astats (the astats indicate whether a unitig is likely a repeat or not). This is a more principled way of doing the "boosting" that we did on the original assembly. The assembly changed to:
TotalBasesInScaffolds 79,851,223 N50ScaffoldBases 100,938 <<** wrt TBS=70676234 MaxBasesInScaffolds 6,435,383 IntraScaffoldGaps 2,616 TotalContigsInScaffolds 12,232 MaxContigSize 1,356,278 N50ContigBases 39,235 <<** wrt TBS=70676234
Note that all assemblies above used only the original Sanger reads. I next added the 3Kb and 20Kb paired 454 reads (with some extra linker trimming and removing duplicate mate-pairs). This reduced the size of unitigs (N50 fell from 9142 to 5297) indicating there are still some trimming issues. The coverage bias is also less with the 454 reads and the astat-adjustments are less effective. The best assembly (based on N50 sizes) I have of these data calculated astats assuming a g enome size of 70Mb:
TotalBasesInScaffolds 80,688,005 N50ScaffoldBases 358,475 <<** wrt TBS=70676234 MaxBasesInScaffolds 3,020,329 IntraScaffoldGaps 4,039 TotalContigsInScaffolds 13,894 MaxContigSize 602,785 N50ContigBases 43,796 <<** wrt TBS=70676234
Files
/fs/szattic/asmg1/adelcher/Genomes/Brugia : Art's files /fs/sztmpscratch/cole/tarchive_download/brugia_malay : Cole's files /fs/szasmg3/dpuiu/Brugia_malayi/ : Daniela's files /scratch1/brugia_malayi/Data/ : ftp PITT data /fs/szattic-asmg4/brugia_malayi : ftp PITT data (as well)