Cbcb:Pop-Lab:Ted-Report
Older Entries
January 15, 2010
Minimus Documentation
Presently, the only relevant Google hit for "minimus" on the first page of results is the sourceforge wiki. The only example on this page is incomplete and appears to be an early draft made during development.
Ideally, it should be easy to find a complete guide with the general format:
- Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg` `minimus path/to/fastaFile(prefix)`
- Necessary tools for set up (toAmos)
- Other options
- etc
The description found on the Minimus/README page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos File Conversion Utilities page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.
Comparative Network Analysis pt. 2
- Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
- I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
- I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
- The results would be stretches of linked reactions that have been annotated in KEGG pathways.
- This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
- The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
- The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
- Ask Bo about this when he gets back. He may have already done this.
- I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
January 22, 2010
- Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
- Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
- After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
- Sergey had an interesting idea for making a better read simulator:
- Error-free reads are cheap and easy to generate. The problem is with the error model.
- The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
- Sergey has pointed out that most companies will assemble E. coli when they release a new sequencer. Conveniently, there are many high quality assemblies of E. coli available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the E. coli reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
- I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.
- Met with James to discuss my work with Volker
- Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
- Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
- Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
- Volker has sequenced 2 more non-pathogenic strains of mycobacteria (gastri, and kansasiiW58) with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
- The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
- The other, smaller publishable portion of this project would be a comparison of gastri and kansasiiW58 to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
- James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
- Ended up also discussing his work on differential abundance in populations of microorganisms.
- I'm going to start working on taking over and expanding Metastats this semester.
- I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
- Mihai has given me his approval to focus on this.
- Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
- Met with Mihai to discuss working with Volker
- Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
- In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
- I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
- Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.
- Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
- Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
- Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.
January 29, 2010
Minimus Performance Analysis
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.
Number of 75bp Reads (in millions): | 1 | 2 | 4 | 8 | 16 | 20 | Model |
---|---|---|---|---|---|---|---|
RAM used by the Overlapper (in GB): | 1.2 | 2.4 | 4.5 | 8.7 | (17) | 21.5 | (#Reads in Millions) * 1.1 GB = (Memory Used) |
RAM used by the Contigger (in GB): | 3.0 | 6.0 | 12.1 | 25.2 | (48) | (60) | (#Reads in Millions) * 3.1 GB = (Memory Used) |
Run Time of the Overlapper (in min): | 3 | 9 | 34 | 130 | (576) | 783 | 2.96 * (#Reads in Millions)1.87 = (Run Time in Min) |
Run Time of the Contigger (in min): | 9 | 66 | 473 | (3,456) | (25,088) | (47,493) | 9.03 * (#Reads in Millions)2.86 = (Run Time in Min) |
- Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
- Numbers listed in parenthesis are predictions made using the listed models.
- Notes on building the models:
- The memory usage models were pretty straight forward and obvious.
- The run time models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
- The model for the run time of the contigger is highly suspect, but the exponential model fit the data considerably better than the polynomial model. I intend to refine this once I have more data.
Number of 75bp Reads (in millions): | 1 | 2 | 4 | 8 | 16 | 20 | Model |
---|---|---|---|---|---|---|---|
Run Time of the Overlapper (in min): | 2.7 | 8 | 27.5 | - | - | - | A * (#Reads in Millions)B = (Run Time in Min) |
Run Time of the Contigger (in min): | 14 | 81 | - | - | - | - | C * (#Reads in Millions)D = (Run Time in Min) |
- Walnut has 2.8GHz processor and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
- Memory usage is independent of machine architecture. See above table for Privet.
Other Observations About the Assemblies
- Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
- Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
- Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
- I intend to run the n50 program included with AMOS on the assemblies at some point to get more valuable analysis than can be offered by me just eye-balling it.
UMIACS Resources
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.
Machine | Processor | Speed | Cores | RAM |
---|---|---|---|---|
Walnut | Dual Core AMD Opteron 8220 | 2.8GHz | 16 | 64GB |
Privet | AMD Opteron 850 | 2.4GHz | 4 | 32GB |
Larch | AMD Opteron 850 | 2.4GHz | 4 | 32GB |
Sycamore | Dual Core AMD Opteron 875 | 1GHz | 8 | 32GB |
Shagbark | Intel Core 2 Quad | 2.83GHz | 4 | 4GB |