From Cbcb
Jump to: navigation, search

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

Metagenomic assembly

Potential Title: Justifying chimeric contigs in metagenomic assembly

  • There are two important aspects to metagenomic assembly:
  1. High-throughput short-read assembly, which is already being addressed by eurlerian and debruijn assemblers designed to run in the cloud.
  2. Heterogeneous assembly, which I haven't seen addressed by any well-known assemblers.
  • I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
  • Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
  • The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
    • The fundamental limitation of differentiating between species based on sequence divergence is that in many cases, a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
  • As a simple starting point:
  1. I would begin by assembling all unitigs.
  2. From these seeds, I would extend out in both directions, allowing forks without automatically breaking the contigs.
  3. Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
    1. SNPs and other very small indels and mutations could be handled by allowing a string of some small number (eg. 3 or so) within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to just using a scaffolder. The main concern here is differentiating between variation and sequencing error.
      • Unfortunately, I don't think simply using exceptionally stringent quality score thresholds is a good approach as long as the sequencing data is vastly smaller than the amount of sequence in the original sample. I therefore think the current approach of throwing out N's and then using standard trimming algorithms should still be used, and then sequencing error should further be inferred algorithmically.
      • A single instance of a variant with relatively low quality scores is (somewhat obviously) more likely to be sequencing error than actually variation within the population.
      • Other such small indels and mutations could tentatively be considered actual variation within the population.
      • I believe much of the statistics for handling such cases have already been worked out for eulerian path and de bruijn graph assemblers.
    2. Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs, could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway).
      • It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement.
      • It is possible for inserted genes to create such scenarios in a biologically valid/meaningful way, but even these will likely be be surrounded by repetitive sequences.
      • I will need to review the various methods other assemblers use to handle repeats before deciding on even a simple starting scheme.
    3. Everything in between - Forked regions with lengths falling between the minimum length of a unitig and the maximum length set for small mutations.
      • As with the other two scenarios, this one could also arise by incorrect assembly. These may be harder to sort out however, because this category is (currently) less well defined.
      • One of the big issues will be to develop a heuristic to differentiate between complex variation, and reads that are from organisms so divergent that they should be classified as different OTUs.
      • It might be best to start by only considering the first two categories.
  • In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost. ...actually, nvm.

Preliminary Experiment(s)

  1. Create synthetic heterogeneous metagenomic read set using very closely related strains of Mycobacteria.
    • Why Mycobacteria? There are many sequenced strains of M. tuberculosis & M. bovis, plus several more strains that are very closely related such as M. kansasii, M. gastri, and M. marinum. This should provide the ability to combine reads from many strains that are >95% identical, but have significantly different phenotypes. I'm also hoping I might get lucky and stumble across something that would help me publish something with Volker. If I run into a problem with Mycobacteria, Lactobacillus would probably also be a good choice because there are many available sequences and there's a (slim) chance to discover meaningful insights into the vaginal microbiome.
    • I should consider my options for generating reads. The options I'm already familiar with are Metasim, and Arthur's naive in-house read generator. I think it would be worth my time to do a literature search for alternative options though.
    • I would need to generate read sets with increasing variance and identify the types of changes that break contigs and lead to more fragmented (poorer) assemblies.

(pre)Binning to improve metagenomic assembly

  • Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
    • This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
  • Motivation
  1. Convert computationally challenging problem into an embarrassingly parallel problem
    • Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
    • Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
    • An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
      • The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
      • Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
      1. Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
        • This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
      1. Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
        • This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
  2. Improve assemblies by using alternative algorithms to place promiscuous reads
    • For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

Microbial Ecology of the Human Microbiome

  1. Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
    • Search for and consider making quorum sensing gene DB
      • KEGG has pathways containing both acyl-homoserine lactone and it's synthase
    • After indexing known quorum sensing genes, search for homologues
      • WGS data - Obviously search for homologues directly
      • 16S data - Identify organisms and search for homologues in public DBs
  2. Search for "core metabolome" in pioneer organisms from infant studies ===
    • On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
  3. Attempt to search for cases of symbiosis where possible ===
    • Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

Volker's Mycobacterial genomes

Thoughts on my Career Trajectory

  • For the past 2+ years, when asked, I've been saying that my research interests are "using metagenomics to study the human microbiome." This is all well and good for a green grad student, but "metagenomics" and "the human microbiome" are simultaneously too broad and too limiting to define an individual researcher's specific area of specialty.
  • After some consideration of how I've spent my time, and which projects have most interested me, a refined statement of my research interests would seem to be, "using high-throughput biological techniques to study microbial communities, especially where there is a direct impact on human health."