User:Tgibbons:Project-Ideas: Difference between revisions
Jump to navigation
Jump to search
(→Other Potential Research Projects: Added a bunch of stuff about the prebinning project with Arthur) |
|||
Line 14: | Line 14: | ||
== Other Potential Research Projects == | == Other Potential Research Projects == | ||
=== Metagenomic assembly === | |||
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature. | |||
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig. | |||
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem. | |||
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler. | |||
=== (pre)Binning to improve metagenomic assembly === | |||
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins. | |||
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques. | |||
* Motivation | |||
# Convert computationally challenging problem into an embarrassingly parallel problem | |||
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge. | |||
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst). | |||
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin. | |||
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques. | |||
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include: | |||
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it. | |||
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies. | |||
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round. | |||
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources. | |||
# Improve assemblies by using alternative algorithms to place promiscuous reads | |||
#* For the most part, assemblers use rather naive criteria to place reads into contigs. | |||
=== Volker's Mycobacterial genomes === |
Revision as of 17:32, 9 August 2010
My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.
Potential Research Projects Inspired by Microbial Inhabitants of Humans
- Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data
- Search for and consider making quorum sensing gene DB
- KEGG has pathways containing both acyl-homoserine lactone and it's synthase
- After indexing known quorum sensing genes, search for homologues
- WGS data - Obviously search for homologues directly
- 16S data - Identify organisms and search for homologues in public DBs
- Search for and consider making quorum sensing gene DB
- Search for "core metabolome" in pioneer organisms from infant studies
- On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
- Attempt to search for cases of symbiosis where possible
- Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)
Other Potential Research Projects
Metagenomic assembly
- I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
- Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
- The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
- In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.
(pre)Binning to improve metagenomic assembly
- Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
- This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
- Motivation
- Convert computationally challenging problem into an embarrassingly parallel problem
- Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
- Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
- An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
- The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
- Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
- Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
- This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
- Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
- This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
- Improve assemblies by using alternative algorithms to place promiscuous reads
- For the most part, assemblers use rather naive criteria to place reads into contigs.