Cbcb:Pop-Lab:Challenges
- comparative assembly of metagenomic data with thousands of references
The basic idea here is that we have pretty good software for doing comparative assembly once you're settled on a genome to use as a reference. What if you have a metagenomic dataset and thousands of reference genomes? Can you do better than simply running the data-set against each of the genomes and combining the results afterwards? There are some issues here of both how you pick the correct genomes to use as references, and how you store the genomes and/or sequences in order to efficiently do the comparative assembly.
- visualization tools for large assembly graphs
How do you display large assembly graphs with the goal of presenting this information to biologists looking for interesting patterns in terms of population structure in closely related organisms.
- "interesting" patterns in assembly graphs
there are quite a few examples of genomic structures that bacteria use to rapidly generate antigenic variation, eg. by expressing different types of proteins on the surface. These strucutres usually involve repeats that allow the genome to rearrange. What do these regions look like in genome assembly graphs? Can you find putative hypervariable loci by looking at the assembly graphs?
- pooling of samples for assembly
Most metagenomic projects will focus on multiple samples/individuals, yet, due to cost constraints, each sample will only be thinly covered by sequencing data so that only the most abundant organisms can be assembled. A simple solution is to mix together multiple samples prior to assembly. How would you do this, however, if you have too much data (either too many samples, or too many reads in each sample)? Also, how would you deal with polymorphisms introduced by this pooling approach (e.g. different samples contain slightly different variants of a same organism).