User:Tgibbons:Project-Ideas: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 24: Line 24:
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.


The more I kick around this idea, the more I think this might work better as a scaffolder. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something.
The more I kick around this idea, the more I think this might work better as a scaffolder. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which is not necessarily a constraint of this assembler/scaffolder. Also any smaller variance collapsed by the assembler into a consensus sequence would be lost.


=== (pre)Binning to improve metagenomic assembly ===
=== (pre)Binning to improve metagenomic assembly ===

Revision as of 02:36, 10 August 2010

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

Potential Research Projects Inspired by Microbial Inhabitants of Humans

Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data

  • Search for and consider making quorum sensing gene DB
    • KEGG has pathways containing both acyl-homoserine lactone and it's synthase
  • After indexing known quorum sensing genes, search for homologues
    • WGS data - Obviously search for homologues directly
    • 16S data - Identify organisms and search for homologues in public DBs

Search for "core metabolome" in pioneer organisms from infant studies

  • On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.

Attempt to search for cases of symbiosis where possible

  • Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

Other Potential Research Projects

Metagenomic assembly

  • I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
  • Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
  • The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
    • The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
    • As a simple starting point, I would assemble all unitigs. From these seeds, I would extend out in both directions, allowing forks without breaking the contigs. Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species.
    • To accomplish this, I would need to set thresholds for spawning and merging forks. SNPs and other very small variances could be handled by allowing a string of 3 or so mismatches in a unitig before terminating the unitig and considering a fork.
  • In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

The more I kick around this idea, the more I think this might work better as a scaffolder. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which is not necessarily a constraint of this assembler/scaffolder. Also any smaller variance collapsed by the assembler into a consensus sequence would be lost.

(pre)Binning to improve metagenomic assembly

  • Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
    • This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
  • Motivation
  1. Convert computationally challenging problem into an embarrassingly parallel problem
    • Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
    • Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
    • An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
      • The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
      • Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
      1. Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
        • This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
      1. Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
        • This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
  2. Improve assemblies by using alternative algorithms to place promiscuous reads
    • For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

Volker's Mycobacterial genomes