Cbcb - User contributions [en]

Cbcb:Pop-Lab:Ted-Report

2010-09-05T05:54:44Z

Tgibbons: /* September 10, 2010 */ Created entry

== Older Entries ==
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]

== January 15, 2010 ==

=== Minimus Documentation ===

Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development. 

Ideally, it should be easy to find a complete guide with the general format:
* Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
* Necessary tools for set up (toAmos)
* Other options
* etc

The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

=== Comparative Network Analysis pt. 2 ===
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
** Ask Bo about this when he gets back. He may have already done this.

== January 22, 2010 ==
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Sergey had an interesting idea for making a better read simulator:
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.

* Met with James to discuss my work with Volker
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** Ended up also discussing his work on differential abundance in populations of microorganisms.
*** I'm going to start working on taking over and expanding Metastats this semester.
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.

* Met with Mihai to discuss working with Volker
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.

* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

== January 29, 2010 ==

=== Minimus Performance Analysis ===
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Memory Usage Analysis'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|RAM used by the Overlapper (in GB):
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Privet'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Walnut'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.

==== Other Observations About the Assemblies ====
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

=== UMIACS Resources ===
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
|+ '''Umiacs Resources'''
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}

== February 5, 2010 ==

=== Meeting with Volker and Sarada on Feb 3 ===
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

=== Biomarker Assembly ===
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

'''Preliminary Steps'''
* Gather biomarker consensus amino acid sequences
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
* Cluster amino acid sequences within each biomarker set

'''Sequence Identity Threshold Determination''' 
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

'''Schemes''' 
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
# Marker-wise assembly
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
# Cluster-wise assembly
#* Cluster protein sequences
#* Bin reads that align to any protein sequence in a given cluster
# Gene-wise assembly
#* Bin reads that align to a particular protein sequence
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

== February 12, 2010 ==
'''SNOW!!'''

== February 19, 2010 ==
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
# Find a better way to compute the false discovery rate (FDR)
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
#* Need to find a more algebraically rigorous way to compute it
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
#* James just thinks this sucks and needs to be fixed
# Compute F-tests for features across all samples
#* Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
* James has created time-series models, and is interested in tackling spacial models with me.
* I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets

Metastats:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

== April 2, 2010 ==
More on my Thesis Project:
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
* As a starting point, James has recommended building spacial models similar to his time series models
* James is essentially mentoring me on my project at this point. It's pretty excellent.

== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
#* Try not to get too side-tracked reading about biofilms.
#* Search for an existing database of quorum sensing genes to use as references to potentially identify novel quorum sensing genes in microbiome WGS data. Consider making one if it's not already available.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...

=== Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' ===
* I transferred this to a new page I created that's dedicated my potential [[User:Tgibbons:Project-Ideas | project ideas]]

== July 30, 2010 ==

=== Metagenomics Papers and Subjects Related to the Content of ''Microbial Inhabitants of Humans'' ===
# MetaHIT
# Vaginal Microbiome
# Acquisition of Microbiome
#* [http://www.pnas.org/content/early/2010/06/08/1002601107.full.pdf Vaginal birth vs cesarean section]

== August 13, 2010 ==

=== Other Potential Research Projects ===
* I transferred this to a new page I created that's dedicated my potential [[User:Tgibbons:Project-Ideas | project ideas]]

== September 10, 2010 ==
I concatenated together the sequences from a subset of mated pairs from Mihai's oral microbiome data, and was able to align one of them to human dna with >97% identity. This is surprising because neither of these reads were able to be aligned to a human reference using bowtie. I plan concatenate the remainder of the filtered reads and attempt to align them to human dna using blast.

User:Tgibbons:Project-Ideas

2010-08-26T17:41:28Z

Tgibbons: Rearranged the layout

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Metagenomic assembly ==

'''Potential Title:''' Justifying chimeric contigs in metagenomic assembly

* There are two important aspects to metagenomic assembly:
# High-throughput short-read assembly, which is already being addressed by eurlerian and debruijn assemblers designed to run in the cloud.
# Heterogeneous assembly, which I haven't seen addressed by any well-known assemblers.
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that in many cases, a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
* As a simple starting point:
# I would begin by assembling all unitigs.
# From these seeds, I would extend out in both directions, allowing forks without automatically breaking the contigs.
# Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
## '''SNPs and other very small indels and mutations''' could be handled by allowing a string of some small number (eg. 3 or so) within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to just using a scaffolder. The main concern here is differentiating between variation and sequencing error.
##* Unfortunately, I don't think simply using exceptionally stringent quality score thresholds is a good approach as long as the sequencing data is vastly smaller than the amount of sequence in the original sample. I therefore think the current approach of throwing out N's and then using standard trimming algorithms should still be used, and then sequencing error should further be inferred algorithmically.
##* A single instance of a variant with relatively low quality scores is (somewhat obviously) more likely to be sequencing error than actually variation within the population.
##* Other such small indels and mutations could tentatively be considered actual variation within the population.
##* I believe much of the statistics for handling such cases have already been worked out for eulerian path and de bruijn graph assemblers.
## '''Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs''', could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway).
##* It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement.
##* It is possible for inserted genes to create such scenarios in a biologically valid/meaningful way, but even these will likely be be surrounded by repetitive sequences.
##* I will need to review the various methods other assemblers use to handle repeats before deciding on even a simple starting scheme.
## '''Everything in between''' - Forked regions with lengths falling between the minimum length of a unitig and the maximum length set for small mutations.
##* As with the other two scenarios, this one could also arise by incorrect assembly. These may be harder to sort out however, because this category is (currently) less well defined.
##* One of the big issues will be to develop a heuristic to differentiate between complex variation, and reads that are from organisms so divergent that they should be classified as different OTUs.
##* It might be best to start by only considering the first two categories.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

<s>The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.</s> ...actually, nvm.

=== Preliminary Experiment(s) ===
# Create synthetic heterogeneous metagenomic read set using very closely related strains of ''Mycobacteria''.
#* Why ''Mycobacteria''? There are many sequenced strains of ''M. tuberculosis'' & ''M. bovis'', plus several more strains that are very closely related such as ''M. kansasii'', ''M. gastri'', and ''M. marinum''. This should provide the ability to combine reads from many strains that are >95% identical, but have significantly different phenotypes. I'm also hoping I might get lucky and stumble across something that would help me publish something with Volker. If I run into a problem with ''Mycobacteria'', ''Lactobacillus'' would probably also be a good choice because there are many available sequences and there's a (slim) chance to discover meaningful insights into the vaginal microbiome.
#* I should consider my options for generating reads. The options I'm already familiar with are Metasim, and Arthur's naive in-house read generator. I think it would be worth my time to do a literature search for alternative options though.
#* I would need to generate read sets with increasing variance and identify the types of changes that break contigs and lead to more fragmented (poorer) assemblies.

== (pre)Binning to improve metagenomic assembly ==
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

== Microbial Ecology of the Human Microbiome ==
# Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
#* Search for and consider making quorum sensing gene DB
#** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
#* After indexing known quorum sensing genes, search for homologues
#** WGS data - Obviously search for homologues directly
#** 16S data - Identify organisms and search for homologues in public DBs
# Search for "core metabolome" in pioneer organisms from infant studies ===
#* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
# Attempt to search for cases of symbiosis where possible ===
#* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Volker's Mycobacterial genomes ==

== Thoughts on my Career Trajectory ==
* For the past 2+ years, when asked, I've been saying that my research interests are "using metagenomics to study the human microbiome." This is all well and good for a green grad student, but "metagenomics" and "the human microbiome" are simultaneously too broad and too limiting to define an individual researcher's specific area of specialty.
* After some consideration of how I've spent my time, and which projects have most interested me, a refined statement of my research interests would seem to be, "using high-throughput biological techniques to study microbial communities, especially where there is a direct impact on human health."

User:Tgibbons:Project-Ideas

2010-08-18T02:58:42Z

Tgibbons: /* Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' */ Added thematic header

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
All of these projects fall under the broader category of ''microbial ecology''. I should consider finding a good reference, such as a microbial ecology textbook. I should also keep in mind the overarching theme of this branch of my research interests and not be discouraged when sub-projects do not pan out because there is no shortage of ways in which microbes interact with each other.
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* There are two important aspects to metagenomic assembly:
# High-throughput short-read assembly, which is already being addressed by eurlerian and debruijn assemblers designed to run in the cloud.
# Heterogeneous assembly, which I haven't seen addressed by any well-known assemblers.
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that in many cases, a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
* As a simple starting point:
# I would begin by assembling all unitigs.
# From these seeds, I would extend out in both directions, allowing forks without automatically breaking the contigs.
# Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
## '''SNPs and other very small indels and mutations''' could be handled by allowing a string of some small number (eg. 3 or so) within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to just using a scaffolder. The main concern here is differentiating between variation and sequencing error.
##* Unfortunately, I don't think simply using exceptionally stringent quality score thresholds is a good approach as long as the sequencing data is vastly smaller than the amount of sequence in the original sample. I therefore think the current approach of throwing out N's and then using standard trimming algorithms should still be used, and then sequencing error should further be inferred algorithmically.
##* A single instance of a variant with relatively low quality scores is (somewhat obviously) more likely to be sequencing error than actually variation within the population.
##* Other such small indels and mutations could tentatively be considered actual variation within the population.
##* I believe much of the statistics for handling such cases have already been worked out for eulerian path and de bruijn graph assemblers.
## '''Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs''', could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway).
##* It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement.
##* It is possible for inserted genes to create such scenarios in a biologically valid/meaningful way, but even these will likely be be surrounded by repetitive sequences.
##* I will need to review the various methods other assemblers use to handle repeats before deciding on even a simple starting scheme.
## '''Everything in between''' - Forked regions with lengths falling between the minimum length of a unitig and the maximum length set for small mutations.
##* As with the other two scenarios, this one could also arise by incorrect assembly. These may be harder to sort out however, because this category is (currently) less well defined.
##* One of the big issues will be to develop a heuristic to differentiate between complex variation, and reads that are from organisms so divergent that they should be classified as different OTUs.
##* It might be best to start by only considering the first two categories.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

<s>The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.</s> ...actually, nvm.

==== Preliminary Experiment(s) ====
# Create synthetic heterogeneous metagenomic read set using very closely related strains of ''Mycobacteria''.
#* Why ''Mycobacteria''? There are many sequenced strains of ''M. tuberculosis'' & ''M. bovis'', plus several more strains that are very closely related such as ''M. kansasii'', ''M. gastri'', and ''M. marinum''. This should provide the ability to combine reads from many strains that are >95% identical, but have significantly different phenotypes. I'm also hoping I might get lucky and stumble across something that would help me publish something with Volker. If I run into a problem with ''Mycobacteria'', ''Lactobacillus'' would probably also be a good choice because there are many available sequences and there's a (slim) chance to discover meaningful insights into the vaginal microbiome.
#* I should consider my options for generating reads. The options I'm already familiar with are Metasim, and Arthur's naive in-house read generator. I think it would be worth my time to do a literature search for alternative options though.
#* I would need to generate read sets with increasing variance and identify the types of changes that break contigs and lead to more fragmented (poorer) assemblies.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

== Thoughts on my Career Trajectory ==
* For the past 2+ years, when asked, I've been saying that my research interests are "using metagenomics to study the human microbiome." This is all well and good for a green grad student, but "metagenomics" and "the human microbiome" are simultaneously too broad and too limiting to define an individual researcher's specific area of specialty.
* After some consideration of how I've spent my time, and which projects have most interested me, a refined statement of my research interests would seem to be, "using high-throughput biological techniques to study microbial communities, especially where there is a direct impact on human health."

User:Tgibbons:Project-Ideas

2010-08-17T22:10:43Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* There are two important aspects to metagenomic assembly:
# High-throughput short-read assembly, which is already being addressed by eurlerian and debruijn assemblers designed to run in the cloud.
# Heterogeneous assembly, which I haven't seen addressed by any well-known assemblers.
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that in many cases, a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
* As a simple starting point:
# I would begin by assembling all unitigs.
# From these seeds, I would extend out in both directions, allowing forks without automatically breaking the contigs.
# Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
## '''SNPs and other very small indels and mutations''' could be handled by allowing a string of some small number (eg. 3 or so) within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to just using a scaffolder. The main concern here is differentiating between variation and sequencing error.
##* Unfortunately, I don't think simply using exceptionally stringent quality score thresholds is a good approach as long as the sequencing data is vastly smaller than the amount of sequence in the original sample. I therefore think the current approach of throwing out N's and then using standard trimming algorithms should still be used, and then sequencing error should further be inferred algorithmically.
##* A single instance of a variant with relatively low quality scores is (somewhat obviously) more likely to be sequencing error than actually variation within the population.
##* Other such small indels and mutations could tentatively be considered actual variation within the population.
##* I believe much of the statistics for handling such cases have already been worked out for eulerian path and de bruijn graph assemblers.
## '''Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs''', could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway).
##* It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement.
##* It is possible for inserted genes to create such scenarios in a biologically valid/meaningful way, but even these will likely be be surrounded by repetitive sequences.
##* I will need to review the various methods other assemblers use to handle repeats before deciding on even a simple starting scheme.
## '''Everything in between''' - Forked regions with lengths falling between the minimum length of a unitig and the maximum length set for small mutations.
##* As with the other two scenarios, this one could also arise by incorrect assembly. These may be harder to sort out however, because this category is (currently) less well defined.
##* One of the big issues will be to develop a heuristic to differentiate between complex variation, and reads that are from organisms so divergent that they should be classified as different OTUs.
##* It might be best to start by only considering the first two categories.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

<s>The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.</s> ...actually, nvm.

==== Preliminary Experiment(s) ====
# Create synthetic heterogeneous metagenomic read set using very closely related strains of ''Mycobacteria''.
#* Why ''Mycobacteria''? There are many sequenced strains of ''M. tuberculosis'' & ''M. bovis'', plus several more strains that are very closely related such as ''M. kansasii'', ''M. gastri'', and ''M. marinum''. This should provide the ability to combine reads from many strains that are >95% identical, but have significantly different phenotypes. I'm also hoping I might get lucky and stumble across something that would help me publish something with Volker. If I run into a problem with ''Mycobacteria'', ''Lactobacillus'' would probably also be a good choice because there are many available sequences and there's a (slim) chance to discover meaningful insights into the vaginal microbiome.
#* I should consider my options for generating reads. The options I'm already familiar with are Metasim, and Arthur's naive in-house read generator. I think it would be worth my time to do a literature search for alternative options though.
#* I would need to generate read sets with increasing variance and identify the types of changes that break contigs and lead to more fragmented (poorer) assemblies.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

== Thoughts on my Career Trajectory ==
* For the past 2+ years, when asked, I've been saying that my research interests are "using metagenomics to study the human microbiome." This is all well and good for a green grad student, but "metagenomics" and "the human microbiome" are simultaneously too broad and too limiting to define an individual researcher's specific area of specialty.
* After some consideration of how I've spent my time, and which projects have most interested me, a refined statement of my research interests would seem to be, "using high-throughput biological techniques to study microbial communities, especially where there is a direct impact on human health."

User:Tgibbons:Project-Ideas

2010-08-13T22:30:34Z

Tgibbons: /* Metagenomic assembly */ Added ideas for a preliminary experiment

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that in many cases, a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
* As a simple starting point:
# I would begin by assembling all unitigs.
# From these seeds, I would extend out in both directions, allowing forks without automatically breaking the contigs.
# Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
## '''SNPs and other very small indels and mutations''' could be handled by allowing a string of some small number (eg. 3 or so) within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to just using a scaffolder. The main concern here is differentiating between variation and sequencing error.
##* Unfortunately, I don't think simply using exceptionally stringent quality score thresholds is a good approach as long as the sequencing data is vastly smaller than the amount of sequence in the original sample. I therefore think the current approach of throwing out N's and then using standard trimming algorithms should still be used, and then sequencing error should further be inferred algorithmically.
##* A single instance of a variant with relatively low quality scores is (somewhat obviously) more likely to be sequencing error than actually variation within the population.
##* Other such small indels and mutations could tentatively be considered actual variation within the population.
##* I believe much of the statistics for handling such cases have already been worked out for eulerian path and de bruijn graph assemblers.
## '''Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs''', could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway).
##* It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement.
##* It is possible for inserted genes to create such scenarios in a biologically valid/meaningful way, but even these will likely be be surrounded by repetitive sequences.
##* I will need to review the various methods other assemblers use to handle repeats before deciding on even a simple starting scheme.
## '''Everything in between''' - Forked regions with lengths falling between the minimum length of a unitig and the maximum length set for small mutations.
##* As with the other two scenarios, this one could also arise by incorrect assembly. These may be harder to sort out however, because this category is (currently) less well defined.
##* One of the big issues will be to develop a heuristic to differentiate between complex variation, and reads that are from organisms so divergent that they should be classified as different OTUs.
##* It might be best to start by only considering the first two categories.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

<s>The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.</s> ...actually, nvm.

==== Preliminary Experiment(s) ====
# Create synthetic heterogeneous metagenomic read set using very closely related strains of ''Mycobacteria''.
#* Why ''Mycobacteria''? There are many sequenced strains of ''M. tuberculosis'' & ''M. bovis'', plus several more strains that are very closely related such as ''M. kansasii'', ''M. gastri'', and ''M. marinum''. This should provide the ability to combine reads from many strains that are >95% identical, but have significantly different phenotypes. I'm also hoping I might get lucky and stumble across something that would help me publish something with Volker. If I run into a problem with ''Mycobacteria'', ''Lactobacillus'' would probably also be a good choice because there are many available sequences and there's a (slim) chance to discover meaningful insights into the vaginal microbiome.
#* I should consider my options for generating reads. The options I'm already familiar with are Metasim, and Arthur's naive in-house read generator. I think it would be worth my time to do a literature search for alternative options though.
#* I would need to generate read sets with increasing variance and identify the types of changes that break contigs and lead to more fragmented (poorer) assemblies.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

== Thoughts on my Career Trajectory ==
* For the past 2+ years, when asked, I've been saying that my research interests are "using metagenomics to study the human microbiome." This is all well and good for a green grad student, but "metagenomics" and "the human microbiome" are simultaneously too broad and too limiting to define an individual researcher's specific area of specialty.
* After some consideration of how I've spent my time, and which projects have most interested me, a refined statement of my research interests would seem to be, "using high-throughput biological techniques to study microbial communities, especially where there is a direct impact on human health."

User:Tgibbons:Project-Ideas

2010-08-13T22:14:20Z

Tgibbons: Added section for thoughts on my career in general

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that in many cases, a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
* As a simple starting point:
# I would begin by assembling all unitigs.
# From these seeds, I would extend out in both directions, allowing forks without automatically breaking the contigs.
# Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
## '''SNPs and other very small indels and mutations''' could be handled by allowing a string of some small number (eg. 3 or so) within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to just using a scaffolder. The main concern here is differentiating between variation and sequencing error.
##* Unfortunately, I don't think simply using exceptionally stringent quality score thresholds is a good approach as long as the sequencing data is vastly smaller than the amount of sequence in the original sample. I therefore think the current approach of throwing out N's and then using standard trimming algorithms should still be used, and then sequencing error should further be inferred algorithmically.
##* A single instance of a variant with relatively low quality scores is (somewhat obviously) more likely to be sequencing error than actually variation within the population.
##* Other such small indels and mutations could tentatively be considered actual variation within the population.
##* I believe much of the statistics for handling such cases have already been worked out for eulerian path and de bruijn graph assemblers.
## '''Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs''', could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway).
##* It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement.
##* It is possible for inserted genes to create such scenarios in a biologically valid/meaningful way, but even these will likely be be surrounded by repetitive sequences.
##* I will need to review the various methods other assemblers use to handle repeats before deciding on even a simple starting scheme.
## '''Everything in between''' - Forked regions with lengths falling between the minimum length of a unitig and the maximum length set for small mutations.
##* As with the other two scenarios, this one could also arise by incorrect assembly. These may be harder to sort out however, because this category is (currently) less well defined.
##* One of the big issues will be to develop a heuristic to differentiate between complex variation, and reads that are from organisms so divergent that they should be classified as different OTUs.
##* It might be best to start by only considering the first two categories.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

<s>The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.</s> ...actually, nvm.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

== Thoughts on my Career Trajectory ==
* For the past 2+ years, when asked, I've been saying that my research interests are "using metagenomics to study the human microbiome." This is all well and good for a green grad student, but "metagenomics" and "the human microbiome" are simultaneously too broad and too limiting to define an individual researcher's specific area of specialty.
* After some consideration of how I've spent my time, and which projects have most interested me, a refined statement of my research interests would seem to be, "using high-throughput biological techniques to study microbial communities, especially where there is a direct impact on human health."

User:Tgibbons:Project-Ideas

2010-08-11T03:20:45Z

Tgibbons: /* Metagenomic assembly */ I cleaned up a lot of the notes I jotted down yesterday

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that in many cases, a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
* As a simple starting point:
# I would begin by assembling all unitigs.
# From these seeds, I would extend out in both directions, allowing forks without automatically breaking the contigs.
# Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
## '''SNPs and other very small indels and mutations''' could be handled by allowing a string of some small number (eg. 3 or so) within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to just using a scaffolder. The main concern here is differentiating between variation and sequencing error.
##* Unfortunately, I don't think simply using exceptionally stringent quality score thresholds is a good approach as long as the sequencing data is vastly smaller than the amount of sequence in the original sample. I therefore think the current approach of throwing out N's and then using standard trimming algorithms should still be used, and then sequencing error should further be inferred algorithmically.
##* A single instance of a variant with relatively low quality scores is (somewhat obviously) more likely to be sequencing error than actually variation within the population.
##* Other such small indels and mutations could tentatively be considered actual variation within the population.
##* I believe much of the statistics for handling such cases have already been worked out for eulerian path and de bruijn graph assemblers.
## '''Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs''', could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway).
##* It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement.
##* It is possible for inserted genes to create such scenarios in a biologically valid/meaningful way, but even these will likely be be surrounded by repetitive sequences.
##* I will need to review the various methods other assemblers use to handle repeats before deciding on even a simple starting scheme.
## '''Everything in between''' - Forked regions with lengths falling between the minimum length of a unitig and the maximum length set for small mutations.
##* As with the other two scenarios, this one could also arise by incorrect assembly. These may be harder to sort out however, because this category is (currently) less well defined.
##* One of the big issues will be to develop a heuristic to differentiate between complex variation, and reads that are from organisms so divergent that they should be classified as different OTUs.
##* It might be best to start by only considering the first two categories.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

<s>The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.</s> ...actually, nvm.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-11T02:31:32Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that in many cases, a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
* As a simple starting point:
# I would begin by assembling all unitigs.
# From these seeds, I would extend out in both directions, allowing forks without automatically breaking the contigs.
# Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
## '''SNPs and other very small indels and mutations''' could be handled by allowing a string of some small number (eg. 3 or so) within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to just using a scaffolder. The main concern here is differentiating between variation and sequencing error.
##* Unfortunately, I don't think simply using exceptionally stringent quality score thresholds is a good approach as long as the sequencing data is vastly smaller than the amount of sequence in the original sample. I therefore think the current approach of throwing out N's and then using standard trimming algorithms should still be used, and then sequencing error should further be inferred algorithmically.
##* A single instance of a variant with relatively low quality scores is (somewhat obviously) more likely to be sequencing error than actually variation within the population.
##* Other such small indels and mutations could tentatively be considered actual variation within the population.
##* I believe much of the statistics for handling such cases have already been worked out for eulerian path and de bruijn graph assemblers.
## '''Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs''', could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway).
##* It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement.
##* It is possible for inserted genes to create such scenarios in a biologically valid/meaningful way, but even these will likely be be surrounded by repetitive sequences.
##* I will need to review the various methods other assemblers use to handle repeats before deciding on even a simple starting scheme.
## '''Everything in between''' - Forked regions with lengths falling between the minimum length of a unitig and the maximum length set for small mutations.
##* T
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

<s>The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.</s> ...actually, nvm.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-11T01:56:02Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that in many cases, a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
* As a simple starting point:
# I would begin by assembling all unitigs.
# From these seeds, I would extend out in both directions, allowing forks without automatically breaking the contigs.
# Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
## '''SNPs and other very small indels and mutations''' could be handled by allowing a string of some small number (eg. 3 or so) within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to just using a scaffolder. The main concern here is differentiating between variation and sequencing error.
##* Unfortunately, I don't think simply using exceptionally stringent quality score thresholds is a good approach as long as the sequencing data is vastly smaller than the amount of sequence in the original sample. I therefore think the current approach of throwing out N's and then using standard trimming algorithms should still be used, and then sequencing error should further be inferred algorithmically.
##* A single instance of a variant with relatively low quality scores is (somewhat obviously) more likely to be sequencing error than actually variation within the population.
##* Other such small indels and mutations could tentatively be considered actual variation within the population.
##* I believe much of the statistics for handling such cases have already been worked out for eulerian path and de bruijn graph assemblers.
## '''Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs''', could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway).
##* It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement.
##* It is possible for inserted genes to create such scenarios in a biologically valid/meaningful way, but even these will likely be be surrounded by repetitive sequences.
##* I will need to review the various methods other assemblers use to handle repeats before deciding on even a simple starting scheme.
## The other "catch all" scenario will be when any boundary lengths in between the minimum length of a unitig and the maximum length set for small mutations. [needs more detail, but it's bed time]
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

<s>The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.</s> ...actually, nvm.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-11T01:39:47Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that in many cases, a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
* As a simple starting point:
# I would begin by assembling all unitigs.
# From these seeds, I would extend out in both directions, allowing forks without breaking the contigs.
# Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
## '''SNPs and other very small indels and mutations''' could be handled by allowing a string of some small number (eg. 3 or so) within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to just using a scaffolder. The main concern here is differentiating between variation and sequencing error.
##* Unfortunately, I don't think simply using exceptionally stringent quality score thresholds is a good approach as long as the sequencing data is vastly smaller than the amount of sequence in the original sample. I therefore think the current approach of throwing out N's and then using standard trimming algorithms should still be used, and then sequencing error should further be inferred algorithmically.
##* A single instance of a variant with relatively low quality scores is (somewhat obviously) more likely to be sequencing error than actually variation within the population.
## '''Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs''', could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway). It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement. It is possible though, for inserted genes to create such scenarios in a biologically valid/meaningful way.
## The other "catch all" scenario will be when any boundary lengths in between the minimum length of a unitig and the maximum length set for small mutations. [needs more detail, but it's bed time]
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

<s>The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.</s> ...actually, nvm.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T03:35:53Z

Tgibbons: /* Metagenomic assembly */ Added a bunch of details. It's not complete, but it's a serious start at considering this project.

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
** As a simple starting point, I would assemble all unitigs. From these seeds, I would extend out in both directions, allowing forks without breaking the contigs. Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks. I expect these to be of the three following types:
*# SNPs and other very small variances could be handled by allowing a string of 3 or so mismatches within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to a scaffolder.
*# Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs, could be assembled from the output of an existing assembler. Of course, most current assemblers tend to generate a very large number of very small contigs, leaving us with most of the same challenges we would face with a full-blown assembler (also Chris is itching to make an assembler anyway). It is important to consider that such cases are very likely the result of repeats at the boundaries of unitigs that can be placed in such an arrangement. It is possible though, for inserted genes to create such scenarios in a biologically valid/meaningful way.
*# The other "catch all" scenario will be when any boundary lengths in between the minimum length of a unitig and the maximum length set for small mutations. [needs more detail, but it's bed time]
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

<s>The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.</s>

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T03:09:55Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
** As a simple starting point, I would assemble all unitigs. From these seeds, I would extend out in both directions, allowing forks without breaking the contigs. Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species. In order to accomplish this, I would need to set thresholds for spawning and merging forks.:
*** SNPs and other very small variances could be handled by allowing a string of 3 or so mismatches within a unitig before terminating the unitig and considering a fork. This would probably require modifying or creating an assembler, as opposed to a scaffolder.
*** Forked regions that are long enough to contain disparate unitigs, closed on both sides by other unitigs, could be assembled from the output of an existing
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

The more I kick around this idea, the more I think this might work better as a scaffolder, of sorts. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which would not necessarily be a constraint for this assembler/scaffolder. Also any smaller variance collapsed by an assembler into a consensus sequence would be lost.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T03:00:32Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
** As a simple starting point, I would assemble all unitigs. From these seeds, I would extend out in both directions, allowing forks without breaking the contigs. Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species.
*** To accomplish this, I would need to set thresholds for spawning and merging forks. SNPs and other very small variances could be handled by allowing a string of 3 or so mismatches within a unitig before terminating the unitig and considering a fork.
***
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

The more I kick around this idea, the more I think this might work better as a scaffolder. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which is not necessarily a constraint of this assembler/scaffolder. Also any smaller variance collapsed by the assembler into a consensus sequence would be lost.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T02:36:21Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
** As a simple starting point, I would assemble all unitigs. From these seeds, I would extend out in both directions, allowing forks without breaking the contigs. Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species.
** To accomplish this, I would need to set thresholds for spawning and merging forks. SNPs and other very small variances could be handled by allowing a string of 3 or so mismatches in a unitig before terminating the unitig and considering a fork.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

The more I kick around this idea, the more I think this might work better as a scaffolder. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something. Of course, Bambus fundamentally requires mate pairs, which is not necessarily a constraint of this assembler/scaffolder. Also any smaller variance collapsed by the assembler into a consensus sequence would be lost.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T02:32:34Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
** As a simple starting point, I would assemble all unitigs. From these seeds, I would extend out in both directions, allowing forks without breaking the contigs. Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species.
** To accomplish this, I would need to set thresholds for spawning and merging forks. SNPs and other very small variances could be handled by allowing a string of 3 or so mismatches in a unitig before terminating the unitig and considering a fork.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

The more I kick around this idea, the more I think this might work better as a scaffolder. I should talk to Sergey about the possibility of just incorporating this into Bambus as an alternative output or something.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T02:21:19Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
** As a simple starting point, I would assemble all unitigs. From these seeds, I would extend out in both directions, allowing forks without breaking the contigs. Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species.
** To accomplish this, I would need to set thresholds for the number of
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T02:16:28Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
** As a simple starting point, I would assemble all unitigs. From these seeds, I would extend out in both directions, allowing forks without breaking the contigs. Forks that can be joined on either side by unitigs that are substantially longer than the forked regions, I would tentatively consider to be variation within a single species.
*** To handle this, I would need to
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T01:52:10Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
** As a simple starting point, I would assemble all unitigs. From these seeds, I would extend
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T01:40:28Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to a micro-organism without significantly altering the overall sequence. This means that sequence divergence is not necessarily proportional to phenotypic divergence and thus can not be used to differentiate between what I would consider to be biologically meaningful species (micro-organisms with differing phenotypes).
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T01:31:34Z

Tgibbons: /* Metagenomic assembly */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
** The fundamental limitation of differentiating between species based on sequence divergence is that many times a relatively small number of mutations, a single gene insertion, or a single functional plasmid can impart dramatic new phenotypic properties to an organism without significantly changing the overall sequence.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-10T01:17:13Z

Tgibbons: /* Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' */ Reformatted the headers for each project

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
=== Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data ===
* Search for and consider making quorum sensing gene DB
** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
* After indexing known quorum sensing genes, search for homologues
** WGS data - Obviously search for homologues directly
** 16S data - Identify organisms and search for homologues in public DBs
=== Search for "core metabolome" in pioneer organisms from infant studies ===
* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
=== Attempt to search for cases of symbiosis where possible ===
* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons

2010-08-10T01:15:03Z

Tgibbons: Added link to my project ideas page

*[[Cbcb:Pop-Lab:Ted-Report | Ted's Progress Reports]]
*[[User:Tgibbons:Project-Ideas | Ted's Project Ideas]]

User:Tgibbons:Project-Ideas

2010-08-09T17:47:43Z

Tgibbons: /* (pre)Binning to improve metagenomic assembly */ Added more to the prebinning entry

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
# Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data
#* Search for and consider making quorum sensing gene DB
#** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
#* After indexing known quorum sensing genes, search for homologues
#** WGS data - Obviously search for homologues directly
#** 16S data - Identify organisms and search for homologues in public DBs
# Search for "core metabolome" in pioneer organisms from infant studies
#* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
# Attempt to search for cases of symbiosis where possible
#* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs. This is in part because of the nature of the problem when these assemblers were first being developed. In clonal sequencing, one can safely assume that the vast majority of sequence variation actually comes from sequencing error, and not from variation within the population of that organism within a particular environment. To this end, traditional assemblers have sought to identify and eliminate minor sequence variation within the read set, assemble overlapping reads with relatively clean sequences, and simply break up assemblies in places where there was too much variation to justify any other action.

hmm... Well that's not where I'd intended to go with that. The more I think about this problem, the more I think my other metagenomic assembly idea is more promising in terms of improving assembly quality. I also have serious doubts about the need in the scientific community for the ability to assemble large metagenomic sequencing projects using modest local computational resources. It seems much more likely to me that these groups will increasingly request money for time on an Amazon cluster when writing their grants, and that an iterative local approach would quickly become obsolete, if it isn't already.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-09T17:32:41Z

Tgibbons: /* Other Potential Research Projects */ Added a bunch of stuff about the prebinning project with Arthur

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
# Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data
#* Search for and consider making quorum sensing gene DB
#** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
#* After indexing known quorum sensing genes, search for homologues
#** WGS data - Obviously search for homologues directly
#** 16S data - Identify organisms and search for homologues in public DBs
# Search for "core metabolome" in pioneer organisms from infant studies
#* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
# Attempt to search for cases of symbiosis where possible
#* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==

=== Metagenomic assembly ===
* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

=== (pre)Binning to improve metagenomic assembly ===
* Mihai's concerns that binning could (and probably would) break assemblies by separating overlapping reads into different bins are valid, but assume the simplist of binning schemes: Every read is placed in exactly one bin and the assembler is never allowed to combine reads from multiple bins.
** This scheme is obviously overly simplistic and offers little value to existing assembly techniques.
* Motivation
# Convert computationally challenging problem into an embarrassingly parallel problem
#* Current metagenomic sequencing projects are generating hundreds of millions of (30-500bp) reads, which traditional assemblers would attempt to load into RAM all at once. Even the average US university does likely have the computational resources to successfully attempt such an assembly, and it is unlikely that the average group generating the sequences is prepared for this challenge.
#* Recent attempts to address this problem have focused on massively parallelized assemblers designed to run on large computer clusters (SOAPdenovo) or cloud clusters such as those offered by Amazon and Google (CloudBurst).
#* An alternative approach is to first attempt to bin reads we expect to be assembled together, and then use traditional assemblers to assemble the reads placed in each bin.
#** The hope here is that the amount of sequence in each bin would be more similar to traditional clonal sequencing projects and would therefore be more amenable to traditional assembly techniques.
#** Avoiding Mihai's concerns would require additional computation that would further inflate the overhead of this approach. These may include:
#*# Keeping a special bin for reads that could not be placed with high confidence, and allowing each assembly to pull reads from it.
#*** This would require some sort of record keeping to ensure promiscuous reads are not added to many assemblies.
#*# Iterating the assemblies so that contigs from different bins have an opportunity to be combined, including singlets that were not able to be assembled in the first round.
#*** This approach would greatly increase the overall runtime, but may essentially allow a relatively small group to assemble a large metagenomic sample with modest computational resources.
# Improve assemblies by using alternative algorithms to place promiscuous reads
#* For the most part, assemblers use rather naive criteria to place reads into contigs.

=== Volker's Mycobacterial genomes ===

User:Tgibbons:Project-Ideas

2010-08-09T16:08:11Z

Tgibbons: /* Other Potential Research Projects */

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
# Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data
#* Search for and consider making quorum sensing gene DB
#** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
#* After indexing known quorum sensing genes, search for homologues
#** WGS data - Obviously search for homologues directly
#** 16S data - Identify organisms and search for homologues in public DBs
# Search for "core metabolome" in pioneer organisms from infant studies
#* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
# Attempt to search for cases of symbiosis where possible
#* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==
# Metagenomic assembly
#* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested. This could be due either to them already having large projects to which they're committed, or it could be that they sense trouble. I'll ask Mihai once I've spent a couple of days looking through literature.
#* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
#* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
#* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

Cbcb:Pop-Lab:Ted-Report

2010-08-09T15:55:10Z

Tgibbons: /* Other Potential Research Projects */

Cbcb:Pop-Lab:Ted-Report

2010-08-09T15:54:56Z

Tgibbons: /* Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' */

== Older Entries ==
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]

== January 15, 2010 ==

=== Minimus Documentation ===

Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development. 

Ideally, it should be easy to find a complete guide with the general format:
* Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
* Necessary tools for set up (toAmos)
* Other options
* etc

The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

=== Comparative Network Analysis pt. 2 ===
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
** Ask Bo about this when he gets back. He may have already done this.

== January 22, 2010 ==
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Sergey had an interesting idea for making a better read simulator:
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.

* Met with James to discuss my work with Volker
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** Ended up also discussing his work on differential abundance in populations of microorganisms.
*** I'm going to start working on taking over and expanding Metastats this semester.
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.

* Met with Mihai to discuss working with Volker
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.

* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

== January 29, 2010 ==

=== Minimus Performance Analysis ===
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Memory Usage Analysis'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|RAM used by the Overlapper (in GB):
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Privet'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Walnut'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.

==== Other Observations About the Assemblies ====
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

=== UMIACS Resources ===
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
|+ '''Umiacs Resources'''
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}

== February 5, 2010 ==

=== Meeting with Volker and Sarada on Feb 3 ===
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

=== Biomarker Assembly ===
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

'''Preliminary Steps'''
* Gather biomarker consensus amino acid sequences
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
* Cluster amino acid sequences within each biomarker set

'''Sequence Identity Threshold Determination''' 
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

'''Schemes''' 
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
# Marker-wise assembly
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
# Cluster-wise assembly
#* Cluster protein sequences
#* Bin reads that align to any protein sequence in a given cluster
# Gene-wise assembly
#* Bin reads that align to a particular protein sequence
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

== February 12, 2010 ==
'''SNOW!!'''

== February 19, 2010 ==
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
# Find a better way to compute the false discovery rate (FDR)
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
#* Need to find a more algebraically rigorous way to compute it
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
#* James just thinks this sucks and needs to be fixed
# Compute F-tests for features across all samples
#* Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
* James has created time-series models, and is interested in tackling spacial models with me.
* I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets

Metastats:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

== April 2, 2010 ==
More on my Thesis Project:
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
* As a starting point, James has recommended building spacial models similar to his time series models
* James is essentially mentoring me on my project at this point. It's pretty excellent.

== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
#* Try not to get too side-tracked reading about biofilms.
#* Search for an existing database of quorum sensing genes to use as references to potentially identify novel quorum sensing genes in microbiome WGS data. Consider making one if it's not already available.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...

=== Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' ===
* I transferred this to a new page I created that's dedicated my potential [[User:Tgibbons:Project-Ideas | project ideas]]

== July 30, 2010 ==

=== Metagenomics Papers and Subjects Related to the Content of ''Microbial Inhabitants of Humans'' ===
# MetaHIT
# Vaginal Microbiome
# Acquisition of Microbiome
#* [http://www.pnas.org/content/early/2010/06/08/1002601107.full.pdf Vaginal birth vs cesarean section]

== August 13, 2010 ==

=== Other Potential Research Projects ===
I transferred this to a new page I created that's dedicated my potential [[User:Tgibbons:Project-Ideas | project ideas]]

Cbcb:Pop-Lab:Ted-Report

2010-08-09T15:54:35Z

Tgibbons: /* Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' */ Replaced with a link to a dedicated page I created for my project ideas

== Older Entries ==
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]

== January 15, 2010 ==

=== Minimus Documentation ===

Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development. 

Ideally, it should be easy to find a complete guide with the general format:
* Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
* Necessary tools for set up (toAmos)
* Other options
* etc

The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

=== Comparative Network Analysis pt. 2 ===
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
** Ask Bo about this when he gets back. He may have already done this.

== January 22, 2010 ==
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Sergey had an interesting idea for making a better read simulator:
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.

* Met with James to discuss my work with Volker
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** Ended up also discussing his work on differential abundance in populations of microorganisms.
*** I'm going to start working on taking over and expanding Metastats this semester.
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.

* Met with Mihai to discuss working with Volker
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.

* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

== January 29, 2010 ==

=== Minimus Performance Analysis ===
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Memory Usage Analysis'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|RAM used by the Overlapper (in GB):
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Privet'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Walnut'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.

==== Other Observations About the Assemblies ====
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

=== UMIACS Resources ===
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
|+ '''Umiacs Resources'''
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}

== February 5, 2010 ==

=== Meeting with Volker and Sarada on Feb 3 ===
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

=== Biomarker Assembly ===
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

'''Preliminary Steps'''
* Gather biomarker consensus amino acid sequences
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
* Cluster amino acid sequences within each biomarker set

'''Sequence Identity Threshold Determination''' 
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

'''Schemes''' 
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
# Marker-wise assembly
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
# Cluster-wise assembly
#* Cluster protein sequences
#* Bin reads that align to any protein sequence in a given cluster
# Gene-wise assembly
#* Bin reads that align to a particular protein sequence
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

== February 12, 2010 ==
'''SNOW!!'''

== February 19, 2010 ==
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
# Find a better way to compute the false discovery rate (FDR)
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
#* Need to find a more algebraically rigorous way to compute it
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
#* James just thinks this sucks and needs to be fixed
# Compute F-tests for features across all samples
#* Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
* James has created time-series models, and is interested in tackling spacial models with me.
* I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets

Metastats:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

== April 2, 2010 ==
More on my Thesis Project:
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
* As a starting point, James has recommended building spacial models similar to his time series models
* James is essentially mentoring me on my project at this point. It's pretty excellent.

== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
#* Try not to get too side-tracked reading about biofilms.
#* Search for an existing database of quorum sensing genes to use as references to potentially identify novel quorum sensing genes in microbiome WGS data. Consider making one if it's not already available.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...

=== Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' ===
I transferred this to a new page I created that's dedicated my potential [[User:Tgibbons:Project-Ideas | project ideas]]

== July 30, 2010 ==

=== Metagenomics Papers and Subjects Related to the Content of ''Microbial Inhabitants of Humans'' ===
# MetaHIT
# Vaginal Microbiome
# Acquisition of Microbiome
#* [http://www.pnas.org/content/early/2010/06/08/1002601107.full.pdf Vaginal birth vs cesarean section]

== August 13, 2010 ==

=== Other Potential Research Projects ===
I transferred this to a new page I created that's dedicated my potential [[User:Tgibbons:Project-Ideas | project ideas]]

Cbcb:Pop-Lab:Ted-Report

2010-08-09T15:53:34Z

Tgibbons: /* Other Potential Research Projects */ Replaced with link to new dedicated page for my project ideas

== Older Entries ==
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]

== January 15, 2010 ==

=== Minimus Documentation ===

Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development. 

Ideally, it should be easy to find a complete guide with the general format:
* Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
* Necessary tools for set up (toAmos)
* Other options
* etc

The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

=== Comparative Network Analysis pt. 2 ===
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
** Ask Bo about this when he gets back. He may have already done this.

== January 22, 2010 ==
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Sergey had an interesting idea for making a better read simulator:
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.

* Met with James to discuss my work with Volker
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** Ended up also discussing his work on differential abundance in populations of microorganisms.
*** I'm going to start working on taking over and expanding Metastats this semester.
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.

* Met with Mihai to discuss working with Volker
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.

* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

== January 29, 2010 ==

=== Minimus Performance Analysis ===
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Memory Usage Analysis'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|RAM used by the Overlapper (in GB):
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Privet'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Walnut'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.

==== Other Observations About the Assemblies ====
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

=== UMIACS Resources ===
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
|+ '''Umiacs Resources'''
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}

== February 5, 2010 ==

=== Meeting with Volker and Sarada on Feb 3 ===
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

=== Biomarker Assembly ===
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

'''Preliminary Steps'''
* Gather biomarker consensus amino acid sequences
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
* Cluster amino acid sequences within each biomarker set

'''Sequence Identity Threshold Determination''' 
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

'''Schemes''' 
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
# Marker-wise assembly
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
# Cluster-wise assembly
#* Cluster protein sequences
#* Bin reads that align to any protein sequence in a given cluster
# Gene-wise assembly
#* Bin reads that align to a particular protein sequence
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

== February 12, 2010 ==
'''SNOW!!'''

== February 19, 2010 ==
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
# Find a better way to compute the false discovery rate (FDR)
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
#* Need to find a more algebraically rigorous way to compute it
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
#* James just thinks this sucks and needs to be fixed
# Compute F-tests for features across all samples
#* Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
* James has created time-series models, and is interested in tackling spacial models with me.
* I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets

Metastats:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

== April 2, 2010 ==
More on my Thesis Project:
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
* As a starting point, James has recommended building spacial models similar to his time series models
* James is essentially mentoring me on my project at this point. It's pretty excellent.

== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
#* Try not to get too side-tracked reading about biofilms.
#* Search for an existing database of quorum sensing genes to use as references to potentially identify novel quorum sensing genes in microbiome WGS data. Consider making one if it's not already available.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...

=== Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' ===
# Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data
#* Search for and consider making quorum sensing gene DB
#** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
#* After indexing known quorum sensing genes, search for homologues
#** WGS data - Obviously search for homologues directly
#** 16S data - Identify organisms and search for homologues in public DBs
# Search for "core metabolome" in pioneer organisms from infant studies
#* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
# Attempt to search for cases of symbiosis where possible
#* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== July 30, 2010 ==

=== Metagenomics Papers and Subjects Related to the Content of ''Microbial Inhabitants of Humans'' ===
# MetaHIT
# Vaginal Microbiome
# Acquisition of Microbiome
#* [http://www.pnas.org/content/early/2010/06/08/1002601107.full.pdf Vaginal birth vs cesarean section]

== August 13, 2010 ==

=== Other Potential Research Projects ===
I transferred this to a new page I created that's dedicated my potential [[User:Tgibbons:Project-Ideas | project ideas]]

User:Tgibbons:Project-Ideas

2010-08-09T15:51:04Z

Tgibbons: Moved previous ideas from my progress report

My weekly progress report just didn't seem appropriate for my brainstorming after a bit, so I've transferred everything here.

== Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' ==
# Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data
#* Search for and consider making quorum sensing gene DB
#** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
#* After indexing known quorum sensing genes, search for homologues
#** WGS data - Obviously search for homologues directly
#** 16S data - Identify organisms and search for homologues in public DBs
# Search for "core metabolome" in pioneer organisms from infant studies
#* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
# Attempt to search for cases of symbiosis where possible
#* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== Other Potential Research Projects ==
# Metagenomic assembly
#* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested.
#* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
#* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
#* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

Cbcb:Pop-Lab:Ted-Report

2010-08-09T15:29:22Z

Tgibbons: /* August 13, 2010 */ Created a new entry with a subsection for my project ideas

== Older Entries ==
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]

== January 15, 2010 ==

=== Minimus Documentation ===

Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development. 

Ideally, it should be easy to find a complete guide with the general format:
* Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
* Necessary tools for set up (toAmos)
* Other options
* etc

The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

=== Comparative Network Analysis pt. 2 ===
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
** Ask Bo about this when he gets back. He may have already done this.

== January 22, 2010 ==
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Sergey had an interesting idea for making a better read simulator:
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.

* Met with James to discuss my work with Volker
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** Ended up also discussing his work on differential abundance in populations of microorganisms.
*** I'm going to start working on taking over and expanding Metastats this semester.
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.

* Met with Mihai to discuss working with Volker
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.

* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

== January 29, 2010 ==

=== Minimus Performance Analysis ===
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Memory Usage Analysis'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|RAM used by the Overlapper (in GB):
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Privet'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Walnut'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.

==== Other Observations About the Assemblies ====
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

=== UMIACS Resources ===
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
|+ '''Umiacs Resources'''
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}

== February 5, 2010 ==

=== Meeting with Volker and Sarada on Feb 3 ===
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

=== Biomarker Assembly ===
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

'''Preliminary Steps'''
* Gather biomarker consensus amino acid sequences
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
* Cluster amino acid sequences within each biomarker set

'''Sequence Identity Threshold Determination''' 
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

'''Schemes''' 
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
# Marker-wise assembly
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
# Cluster-wise assembly
#* Cluster protein sequences
#* Bin reads that align to any protein sequence in a given cluster
# Gene-wise assembly
#* Bin reads that align to a particular protein sequence
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

== February 12, 2010 ==
'''SNOW!!'''

== February 19, 2010 ==
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
# Find a better way to compute the false discovery rate (FDR)
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
#* Need to find a more algebraically rigorous way to compute it
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
#* James just thinks this sucks and needs to be fixed
# Compute F-tests for features across all samples
#* Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
* James has created time-series models, and is interested in tackling spacial models with me.
* I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets

Metastats:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

== April 2, 2010 ==
More on my Thesis Project:
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
* As a starting point, James has recommended building spacial models similar to his time series models
* James is essentially mentoring me on my project at this point. It's pretty excellent.

== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
#* Try not to get too side-tracked reading about biofilms.
#* Search for an existing database of quorum sensing genes to use as references to potentially identify novel quorum sensing genes in microbiome WGS data. Consider making one if it's not already available.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...

=== Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' ===
# Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data
#* Search for and consider making quorum sensing gene DB
#** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
#* After indexing known quorum sensing genes, search for homologues
#** WGS data - Obviously search for homologues directly
#** 16S data - Identify organisms and search for homologues in public DBs
# Search for "core metabolome" in pioneer organisms from infant studies
#* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
# Attempt to search for cases of symbiosis where possible
#* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== July 30, 2010 ==

=== Metagenomics Papers and Subjects Related to the Content of ''Microbial Inhabitants of Humans'' ===
# MetaHIT
# Vaginal Microbiome
# Acquisition of Microbiome
#* [http://www.pnas.org/content/early/2010/06/08/1002601107.full.pdf Vaginal birth vs cesarean section]

== August 13, 2010 ==

=== Other Potential Research Projects ===
# Metagenomic assembly
#* I've been kicking this idea around the floor for months, but none of the people I perceive as being better suited to tackle the problem have appeared all that interested.
#* Essentially I think the ideal metagenomic assembler would allow for, and gracefully represent, diversity within a single organism, without collapsing the genetic material of an entire community into a single messy contig.
#* The major theoretical challenge would be the development of an algorithm that could differentiate between variation and "speciation" in a biologically meaningful way. This is far from being a new problem.
#* In addition to the well-known theoretical challenges, coding this in such a way as to make it practical for current and future metagenomic problems would probably require more programming skills than I have the time to master. As such, I'm hoping that maybe I can work out the theory on a smaller scale using Python, and then maybe convince someone else to collaborate with me for the implementation, or just publish my findings and leave it to the scientific community to incorporate my methods into their-favorite-assembler.

Cbcb:Pop-Lab:Ted-Report

2010-07-27T19:34:02Z

Tgibbons: /* July 23, 2010 */ Addedentry for week ending July 30, 2010, and a section for metagenomics papers I should read when I'm done with this book

== Older Entries ==
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]

== January 15, 2010 ==

=== Minimus Documentation ===

Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development. 

Ideally, it should be easy to find a complete guide with the general format:
* Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
* Necessary tools for set up (toAmos)
* Other options
* etc

The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

=== Comparative Network Analysis pt. 2 ===
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
** Ask Bo about this when he gets back. He may have already done this.

== January 22, 2010 ==
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Sergey had an interesting idea for making a better read simulator:
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.

* Met with James to discuss my work with Volker
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** Ended up also discussing his work on differential abundance in populations of microorganisms.
*** I'm going to start working on taking over and expanding Metastats this semester.
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.

* Met with Mihai to discuss working with Volker
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.

* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

== January 29, 2010 ==

=== Minimus Performance Analysis ===
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Memory Usage Analysis'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|RAM used by the Overlapper (in GB):
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Privet'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Walnut'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.

==== Other Observations About the Assemblies ====
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

=== UMIACS Resources ===
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
|+ '''Umiacs Resources'''
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}

== February 5, 2010 ==

=== Meeting with Volker and Sarada on Feb 3 ===
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

=== Biomarker Assembly ===
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

'''Preliminary Steps'''
* Gather biomarker consensus amino acid sequences
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
* Cluster amino acid sequences within each biomarker set

'''Sequence Identity Threshold Determination''' 
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

'''Schemes''' 
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
# Marker-wise assembly
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
# Cluster-wise assembly
#* Cluster protein sequences
#* Bin reads that align to any protein sequence in a given cluster
# Gene-wise assembly
#* Bin reads that align to a particular protein sequence
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

== February 12, 2010 ==
'''SNOW!!'''

== February 19, 2010 ==
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
# Find a better way to compute the false discovery rate (FDR)
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
#* Need to find a more algebraically rigorous way to compute it
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
#* James just thinks this sucks and needs to be fixed
# Compute F-tests for features across all samples
#* Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
* James has created time-series models, and is interested in tackling spacial models with me.
* I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets

Metastats:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

== April 2, 2010 ==
More on my Thesis Project:
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
* As a starting point, James has recommended building spacial models similar to his time series models
* James is essentially mentoring me on my project at this point. It's pretty excellent.

== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
#* Try not to get too side-tracked reading about biofilms.
#* Search for an existing database of quorum sensing genes to use as references to potentially identify novel quorum sensing genes in microbiome WGS data. Consider making one if it's not already available.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...

=== Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' ===
# Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data
#* Search for and consider making quorum sensing gene DB
#** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
#* After indexing known quorum sensing genes, search for homologues
#** WGS data - Obviously search for homologues directly
#** 16S data - Identify organisms and search for homologues in public DBs
# Search for "core metabolome" in pioneer organisms from infant studies
#* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
# Attempt to search for cases of symbiosis where possible
#* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

== July 30, 2010 ==

=== Metagenomics Papers and Subjects Related to the Content of ''Microbial Inhabitants of Humans'' ===
# MetaHIT
# Vaginal Microbiome
# Acquisition of Microbiome
#* [http://www.pnas.org/content/early/2010/06/08/1002601107.full.pdf Vaginal birth vs cesarean section]

Cbcb:Pop-Lab:Ted-Report

2010-07-22T18:23:19Z

Tgibbons: /* Potential Research Projects Inspired by ''Microbial Inhabitants of Humans'' */

== Older Entries ==
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]

== January 15, 2010 ==

=== Minimus Documentation ===

Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development. 

Ideally, it should be easy to find a complete guide with the general format:
* Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
* Necessary tools for set up (toAmos)
* Other options
* etc

The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

=== Comparative Network Analysis pt. 2 ===
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
** Ask Bo about this when he gets back. He may have already done this.

== January 22, 2010 ==
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Sergey had an interesting idea for making a better read simulator:
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.

* Met with James to discuss my work with Volker
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** Ended up also discussing his work on differential abundance in populations of microorganisms.
*** I'm going to start working on taking over and expanding Metastats this semester.
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.

* Met with Mihai to discuss working with Volker
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.

* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

== January 29, 2010 ==

=== Minimus Performance Analysis ===
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Memory Usage Analysis'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|RAM used by the Overlapper (in GB):
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Privet'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Walnut'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.

==== Other Observations About the Assemblies ====
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

=== UMIACS Resources ===
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
|+ '''Umiacs Resources'''
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}

== February 5, 2010 ==

=== Meeting with Volker and Sarada on Feb 3 ===
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

=== Biomarker Assembly ===
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

'''Preliminary Steps'''
* Gather biomarker consensus amino acid sequences
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
* Cluster amino acid sequences within each biomarker set

'''Sequence Identity Threshold Determination''' 
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

'''Schemes''' 
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
# Marker-wise assembly
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
# Cluster-wise assembly
#* Cluster protein sequences
#* Bin reads that align to any protein sequence in a given cluster
# Gene-wise assembly
#* Bin reads that align to a particular protein sequence
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

== February 12, 2010 ==
'''SNOW!!'''

== February 19, 2010 ==
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
# Find a better way to compute the false discovery rate (FDR)
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
#* Need to find a more algebraically rigorous way to compute it
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
#* James just thinks this sucks and needs to be fixed
# Compute F-tests for features across all samples
#* Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
* James has created time-series models, and is interested in tackling spacial models with me.
* I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets

Metastats:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

== April 2, 2010 ==
More on my Thesis Project:
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
* As a starting point, James has recommended building spacial models similar to his time series models
* James is essentially mentoring me on my project at this point. It's pretty excellent.

== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
#* Try not to get too side-tracked reading about biofilms.
#* Search for an existing database of quorum sensing genes to use as references to potentially identify novel quorum sensing genes in microbiome WGS data. Consider making one if it's not already available.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...

=== Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' ===
# Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data
#* Search for and consider making quorum sensing gene DB
#** KEGG has pathways containing both acyl-homoserine lactone and it's synthase
#* After indexing known quorum sensing genes, search for homologues
#** WGS data - Obviously search for homologues directly
#** 16S data - Identify organisms and search for homologues in public DBs
# Search for "core metabolome" in pioneer organisms from infant studies
#* On the off chance I get to publish on this subject, it might be interesting to draw an analogy between genetic evolution involving chance mutations that are occasionally beneficial and are therefore propagated, and microbial colonization which involves chance introduction of microorganisms that are occasionally beneficial and therefore become stable members of the microbiome.
# Attempt to search for cases of symbiosis where possible
#* Search terms: commensalism, synergism (protocooperation), mutualism, competition, amensalism (antagonism), predation, parasitism, neutralism, syntrophism (nutritional synergism, cross-feeding, examples on pp.16-17)

Cbcb:Pop-Lab:Ted-Report

2010-07-22T17:18:02Z

Tgibbons: /* July 23, 2010 */ Started section to brainstorm research projects I'd like to look into

== Older Entries ==
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]

== January 15, 2010 ==

=== Minimus Documentation ===

Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development. 

Ideally, it should be easy to find a complete guide with the general format:
* Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
* Necessary tools for set up (toAmos)
* Other options
* etc

The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

=== Comparative Network Analysis pt. 2 ===
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
** Ask Bo about this when he gets back. He may have already done this.

== January 22, 2010 ==
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Sergey had an interesting idea for making a better read simulator:
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.

* Met with James to discuss my work with Volker
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** Ended up also discussing his work on differential abundance in populations of microorganisms.
*** I'm going to start working on taking over and expanding Metastats this semester.
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.

* Met with Mihai to discuss working with Volker
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.

* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

== January 29, 2010 ==

=== Minimus Performance Analysis ===
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Memory Usage Analysis'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|RAM used by the Overlapper (in GB):
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Privet'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Walnut'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.

==== Other Observations About the Assemblies ====
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

=== UMIACS Resources ===
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
|+ '''Umiacs Resources'''
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}

== February 5, 2010 ==

=== Meeting with Volker and Sarada on Feb 3 ===
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

=== Biomarker Assembly ===
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

'''Preliminary Steps'''
* Gather biomarker consensus amino acid sequences
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
* Cluster amino acid sequences within each biomarker set

'''Sequence Identity Threshold Determination''' 
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

'''Schemes''' 
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
# Marker-wise assembly
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
# Cluster-wise assembly
#* Cluster protein sequences
#* Bin reads that align to any protein sequence in a given cluster
# Gene-wise assembly
#* Bin reads that align to a particular protein sequence
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

== February 12, 2010 ==
'''SNOW!!'''

== February 19, 2010 ==
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
# Find a better way to compute the false discovery rate (FDR)
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
#* Need to find a more algebraically rigorous way to compute it
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
#* James just thinks this sucks and needs to be fixed
# Compute F-tests for features across all samples
#* Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
* James has created time-series models, and is interested in tackling spacial models with me.
* I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets

Metastats:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

== April 2, 2010 ==
More on my Thesis Project:
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
* As a starting point, James has recommended building spacial models similar to his time series models
* James is essentially mentoring me on my project at this point. It's pretty excellent.

== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
#* Try not to get too side-tracked reading about biofilms.
#* Search for an existing database of quorum sensing genes to use as references to potentially identify novel quorum sensing genes in microbiome WGS data. Consider making one if it's not already available.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...

=== Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' ===
# Searching for quorum sensing genes, both known and novel, and any pathways including them in human microbiome data
#* Search for and consider making quorum sensing gene DB
#* After indexing known quorum sensing genes, search for homologues
#** WGS data - Obviously search for homologues directly
#** 16S data - Identify organisms and search for homologues in public DBs
# Search for "core metabolome" in pioneer organisms from infant studies

Cbcb:Pop-Lab:Ted-Report

2010-07-22T04:16:13Z

Tgibbons: /* July 23, 2010 */

Cbcb:Pop-Lab:Ted-Report

2010-07-22T03:19:20Z

Tgibbons: /* July 23, 2010 */

== Older Entries ==
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]

== January 15, 2010 ==

=== Minimus Documentation ===

Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development. 

Ideally, it should be easy to find a complete guide with the general format:
* Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
* Necessary tools for set up (toAmos)
* Other options
* etc

The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

=== Comparative Network Analysis pt. 2 ===
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
** Ask Bo about this when he gets back. He may have already done this.

== January 22, 2010 ==
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Sergey had an interesting idea for making a better read simulator:
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.

* Met with James to discuss my work with Volker
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** Ended up also discussing his work on differential abundance in populations of microorganisms.
*** I'm going to start working on taking over and expanding Metastats this semester.
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.

* Met with Mihai to discuss working with Volker
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.

* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

== January 29, 2010 ==

=== Minimus Performance Analysis ===
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Memory Usage Analysis'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|RAM used by the Overlapper (in GB):
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Privet'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Walnut'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.

==== Other Observations About the Assemblies ====
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

=== UMIACS Resources ===
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
|+ '''Umiacs Resources'''
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}

== February 5, 2010 ==

=== Meeting with Volker and Sarada on Feb 3 ===
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

=== Biomarker Assembly ===
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

'''Preliminary Steps'''
* Gather biomarker consensus amino acid sequences
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
* Cluster amino acid sequences within each biomarker set

'''Sequence Identity Threshold Determination''' 
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

'''Schemes''' 
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
# Marker-wise assembly
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
# Cluster-wise assembly
#* Cluster protein sequences
#* Bin reads that align to any protein sequence in a given cluster
# Gene-wise assembly
#* Bin reads that align to a particular protein sequence
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

== February 12, 2010 ==
'''SNOW!!'''

== February 19, 2010 ==
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
# Find a better way to compute the false discovery rate (FDR)
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
#* Need to find a more algebraically rigorous way to compute it
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
#* James just thinks this sucks and needs to be fixed
# Compute F-tests for features across all samples
#* Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
* James has created time-series models, and is interested in tackling spacial models with me.
* I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets

Metastats:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

== April 2, 2010 ==
More on my Thesis Project:
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
* As a starting point, James has recommended building spacial models similar to his time series models
* James is essentially mentoring me on my project at this point. It's pretty excellent.

== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
#* Try not to get too side-tracked reading about biofilms.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...

Cbcb:Pop-Lab:Ted-Report

2010-07-22T03:08:12Z

Tgibbons: /* July 23, 2010 */ Created entry

== Older Entries ==
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]

== January 15, 2010 ==

=== Minimus Documentation ===

Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development. 

Ideally, it should be easy to find a complete guide with the general format:
* Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
* Necessary tools for set up (toAmos)
* Other options
* etc

The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

=== Comparative Network Analysis pt. 2 ===
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
** Ask Bo about this when he gets back. He may have already done this.

== January 22, 2010 ==
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Sergey had an interesting idea for making a better read simulator:
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.

* Met with James to discuss my work with Volker
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** Ended up also discussing his work on differential abundance in populations of microorganisms.
*** I'm going to start working on taking over and expanding Metastats this semester.
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.

* Met with Mihai to discuss working with Volker
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.

* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

== January 29, 2010 ==

=== Minimus Performance Analysis ===
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Memory Usage Analysis'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|RAM used by the Overlapper (in GB):
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Privet'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''

{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
|+ '''Minimus Run Time Analysis on Walnut'''
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.

==== Other Observations About the Assemblies ====
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

=== UMIACS Resources ===
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
|+ '''Umiacs Resources'''
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}

== February 5, 2010 ==

=== Meeting with Volker and Sarada on Feb 3 ===
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

=== Biomarker Assembly ===
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

'''Preliminary Steps'''
* Gather biomarker consensus amino acid sequences
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
* Cluster amino acid sequences within each biomarker set

'''Sequence Identity Threshold Determination''' 
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

'''Schemes''' 
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
# Marker-wise assembly
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
# Cluster-wise assembly
#* Cluster protein sequences
#* Bin reads that align to any protein sequence in a given cluster
# Gene-wise assembly
#* Bin reads that align to a particular protein sequence
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

== February 12, 2010 ==
'''SNOW!!'''

== February 19, 2010 ==
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
# Find a better way to compute the false discovery rate (FDR)
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
#* Need to find a more algebraically rigorous way to compute it
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
#* James just thinks this sucks and needs to be fixed
# Compute F-tests for features across all samples
#* Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
* James has created time-series models, and is interested in tackling spacial models with me.
* I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets

Metastats:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

== April 2, 2010 ==
More on my Thesis Project:
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
* As a starting point, James has recommended building spacial models similar to his time series models
* James is essentially mentoring me on my project at this point. It's pretty excellent.

== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...

Communal Software

2010-06-04T18:58:12Z

Tgibbons: /* Introduction */

== Introduction ==
CBCB users do not have root access on their machines or the communal servers. Communal CBCB software is therefore typically installed in
/fs/sz-user-supported/
There are two primary subdirectories which differ by architecture. For the most part, 32bit software is installed in
/fs/sz-user-supported/Linux-i686/
while 64bit software is installed in
/fs/sz-user-supported/Linux-x86_64/
The appropriate subdirectory is typically dynamically chosen by embedding
/fs/sz-user-supported/`uname`-`uname -m`
in a user's path variables in their bashrc file.

Installing software to these directories and updating a user's path can be tricky. For this reason, many users choose to maintain specialty software in personal directories.

== Carl's World ==
In an attempt to solve many software installation related headaches for his lab members, Carl Kingsford has created an alternate environment called "Carl's World" which contains many programs and modules frequently used by his lab members. Little else is known about Carl's World, but rumor has it that if you ask nicely, he will let you in. It is important to note that Carl's World is designed and maintained specifically for use by Carl's lab members, and there are therefore many common and useful software packages not included in Carl's World. It may therefore be necessary to modify YOUR OWN bashrc file once you have joined, but be warned that attempting to modify Carl's World itself is a sure way to lose your membership.

== Planned Maintenance (Summer 2010) ==
The Pop lab is currently planning to update much of the communal software installed in sz-user-supported over the Summer of 2010. So far this list includes:
* [http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOS AMOS] -- Already installed
* [http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/index.html blastall 2.2.23] -- May not be worth upgrading from 2.2.18. Will look at release notes.
* [http://hmmer.janelia.org/ HMMer3] -- Installed.
** [http://pfam.sanger.ac.uk/ Pfam 24.0]
* PERL -- May attempt to standardize default version across servers.
** [http://www.bioperl.org/wiki/Main_Page BioPERL]
* Python -- May attempt to standardize default version across servers.
** [http://biopython.org/wiki/Main_Page BioPython]
*** [http://numpy.scipy.org/ NumPy]
*** [http://www.reportlab.com/software/opensource/ ReportLab]
Please feel free to contact Ted Gibbons at tgibbons@umd.edu to make suggestions or express concerns about tampering with the communal software installations and the corresponding paths in the communal CBCB bashrc file.

Communal Software

2010-06-04T18:56:43Z

Tgibbons: Created page

== Introduction ==
Communal CBCB software is installed in
/fs/sz-user-supported/
There are two primary subdirectories which differ by architecture. For the most part, 32bit software is installed in
/fs/sz-user-supported/Linux-i686/
while 64bit software is installed in
/fs/sz-user-supported/Linux-x86_64/
The appropriate subdirectory is typically dynamically chosen by embedding
/fs/sz-user-supported/`uname`-`uname -m`
in a user's path variables in their bashrc file.

Installing software to these directories and updating a user's path can be tricky. For this reason, many users choose to maintain specialty software in personal directories.

== Carl's World ==
In an attempt to solve many software installation related headaches for his lab members, Carl Kingsford has created an alternate environment called "Carl's World" which contains many programs and modules frequently used by his lab members. Little else is known about Carl's World, but rumor has it that if you ask nicely, he will let you in. It is important to note that Carl's World is designed and maintained specifically for use by Carl's lab members, and there are therefore many common and useful software packages not included in Carl's World. It may therefore be necessary to modify YOUR OWN bashrc file once you have joined, but be warned that attempting to modify Carl's World itself is a sure way to lose your membership.

== Planned Maintenance (Summer 2010) ==
The Pop lab is currently planning to update much of the communal software installed in sz-user-supported over the Summer of 2010. So far this list includes:
* [http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOS AMOS] -- Already installed
* [http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/index.html blastall 2.2.23] -- May not be worth upgrading from 2.2.18. Will look at release notes.
* [http://hmmer.janelia.org/ HMMer3] -- Installed.
** [http://pfam.sanger.ac.uk/ Pfam 24.0]
* PERL -- May attempt to standardize default version across servers.
** [http://www.bioperl.org/wiki/Main_Page BioPERL]
* Python -- May attempt to standardize default version across servers.
** [http://biopython.org/wiki/Main_Page BioPython]
*** [http://numpy.scipy.org/ NumPy]
*** [http://www.reportlab.com/software/opensource/ ReportLab]
Please feel free to contact Ted Gibbons at tgibbons@umd.edu to make suggestions or express concerns about tampering with the communal software installations and the corresponding paths in the communal CBCB bashrc file.

Main Page

2010-06-04T18:10:40Z

Tgibbons: /* Getting started */ Created and linked to communal software page

== Seminars ==
* [http://www.cbcb.umd.edu/seminars Regular CBCB seminars (during academic year)] 
* [[Cbcb:Works-In-Progress]] - Works in progress seminar schedule (Summer 2008) 
* [[short_read_sequencing|Short read sequencing Meeting]] (Fridays at 3pm)

== Projects ==

* [[Project:Pop-Lab|Pop-Lab]]
* [[Project:Kingsford-Group|Kingsford Group]]
* [[Project:Cloud-Computing|Cloud Computing]]
* [[Project:SummerInternships|Summer Internship Projects]]

== People ==

* [[User:ayres|Daniel Ayres]]
* [[User:pknut777|Adam Bazinet]]
* [[User:amp|Adam M Phillipy]]
* [[User:adelcher|Arthur L. Delcher]]
* [[User:carlk|Carl Kinsford]]
* [[User:dpuiu|Daniela Puiu]]
* [[User:dsommer|Dan Sommer]]
* [[User:gpertea|Geo Pertea]]
* [[User:jeallen|Jonathan Edward All]]
* [[User:ayanbule|Kunmi Ayanbule]]
* [[User:mschatz|Michael Schatz]]
* [[User:mpertea|Mihaela Pertea]]
* [[User:mpop|Mihai Pop]]
* [[User:nelsayed|Najib El-Sayed]]
* [[User:nedwards|Nathan Edwards]]
* [[User:niranjan|Niranjan Nagarajan]]
* [[User:saket|Saket Navlakha]]
* [[User:angiuoli|Samuel V Angluoli]]
* [[User:salzberg|Steven Salzberg]]
* [[User:tgibbons | Ted Gibbons]]
* [[User:whitej|James Robert White]]

== Getting started ==
If you have just received a new umiacs account through CBCB, follow the instructions on this page to get the basic information you'll need to start working: 
*[[Getting Started in CBCB]]
*[https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage CBCB Storage]
*[https://wiki.umiacs.umd.edu/cbcb-private/index.php/Compute CBCB Computers]
*[[Communal Software]]

Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)

2010-05-26T21:52:05Z

Tgibbons: /* Directory structure */

== 16S analysis pipeline ==

Assumptions: 
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the "Sample ID", well on the plate, and additional information regarding the sample quality and DNA concentration
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).

=== Directory structure ===
<pre>
Gates_SOM/
Main/
samples.csv - information about all the samples available to us
454.csv - information about all 454 runs (essentially concatenation of .csvs from 454 dir)
phylochip.csv - information about all Phylochip runs
IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs
scripts/ - scripts used to process the data
454/ - here's where all 454 sequences live
[batch1]/ - ... each batch in a separate directory
[batch1].csv - meta-information about the batch
[fasta1] - fasta files containing the batch
...
[fastan]
[batch1].part - partition file describing how the sequences get split by barcode/sample
part/ - directory where all the partitioned files live
...
[batchn]
Phylochip/ - all the CEL files and auxiliary information on the Phylochip runs
</pre>

=== Step 0: Get the sequence information ===
* From .SFF files (assuming these are 454 sequences)

This step uses the sff_extract program from the Staden package (if I'm not mistaken)
<pre>
for i in *.sff ;do
name=`expr $i : '$.*$\.sff'`
sff_extract -c -s $name.seq -q $name.qual $i
done
</pre>

=== Step 1: Cleanup meta-information ===
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID. At this stage also check that the header row information is in canonical format.
* Add barcode information using add_barcode.pl

<pre>${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv > [batch]_barcode.csv</pre>

=== Step 2: Create partition file ===

First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.

<pre>${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] > [batch].part</pre>

The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline: sequences that are too short (< 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.

=== Step 3: Break up the fasta file into separate batches by partition ===

* Create partition directory

<pre>
mkdir part
cd part
</pre>

* Partition main file into sub-parts
<pre>
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part
</pre>

The result is one fasta file per sub-partition (i.e. individual subject).

* Remove barcodes
(still in the part/ subdirectory)
<pre>
for i in *.seq; do
${SCRIPTS}/unbarcode.pl $i
done
</pre>

The output files will have the same name as the original file but with the addition of the .nbc suffix. You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.

<pre> rm *.BAD.nbc.fa *.NONE.nbc.fa </pre>

=== Step 4: Add file names to sample tables ===

The following needs to be run from the root of the 454 directory.

<pre> ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part > [batch]/[batch]_names.csv </pre>

As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file. Note that only files ending in .nbc.fa are procesed.

=== Step 5: Merge all the batch meta-info files into a same file at the top ===

Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -> Filename mapping is unique. In the 454.csv file in the top directory the unique key is the file name.

<pre>
mv ../454.csv ../454.csv.bak
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv "Sample ID" non-unique > ../454.csv
</pre>

This should probably be run with the filename as the key but then the tables need to be sorted by filename. Ultimately, this will all work better once the data are in a relational database.

=== Step 6: Update sample file ===
From the top directory:

* First add all new samples to the samples.csv file
<pre>
mv samples.csv samples.csv.bak
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv "Sample ID" unique merge > samples.csv
</pre>

Note: merge means that if record keys conflict, the empty fields will be updated with the new data.

* Update the tag indicating 454 sequences available for this sample

<pre>
mv samples.csv samples.csv.bak
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv "Sample ID" 454 Y > samples.csv
</pre>

=== Step 7: Assign numbers to all filenames ===

Each file in the 454.csv file will be assigned an integer (if one is not already available). This number will be used to prefix the sequences in the combined file for the project.

<pre>
mv 454.csv 454.csv.bak
${SCRIPTS}/add_filenum.pl 454.csv.bak > 454.csv
</pre>

=== Step 8: Combine all fasta files into a single one ===

'''Note: this assumes we're running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis'''

In the new file all sequences will be named <n>_<nn> where <n> is the value in the "File #"
field in 454.csv and <nn> is the index of the sequence in the file.

<pre>
mkdir Analysis/Run[date]
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454
</pre>

The output will be in Analysis/Run[date]/Run[date].fna

=== Step 9: Run clustering tool ===

* First generate clusters
'''Note: This part assumes we're running the whole set of sequences as one batch.'''
<pre>
cd Analysis/Run[date]/Run[date].fna
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna > Run[date].fna.cluster
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna > Run[date].fna.align
</pre>

Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.

The .align file contains aligned FASTA records for all the sequences in each cluster. Clusters are separated by #<number> where <number> is the number of sequences in the cluster.

* Then extract the cluster centers
'''From here on the code runs the same in both full-run and batch modes'''
<pre>
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna < Run[date].fna.cluster > Run[date].centers.fna
</pre>

=== Step 10: Assign putative taxonomic labels to clusters ===

<pre>
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna > Run[date].centers.taxid
</pre>

Output is tab-delimited: sequence name <TAB> taxid
Note: using "findtax" instead of "findtaxid" will retrieve actual taxonomy names.

=== Step 11: Build summary tables ===

Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples. The colums are the samples and the rows are the respective units. The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic "No Assignment" bin.

* First create a partition file that contains all the clusters and associated taxids (if they exist)

<pre>
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid > [batch].part
</pre>

* Using this partition, construct summary tables at various taxonomic levels

<pre>
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]
</pre>

The following outputs will be created:

<pre>
[batch].stats.txt - overall statistics for the data-set
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent
# of sequences in each OTU/Sample pair
[batch].otus.percent.csv - same as "count" except that entries are percentages wrt total sequences
in each sample
[batch].[tax].[otu|count|percent] - same as the "otus" file except at varying taxonomic levels.
[tax] is one of "strain", "species", "genus", "family", "order", "class", "phylum"
the "count" and "percent" entries are the same as for the "otus" files
the "otu" entries contain number of OTUs assigned to the taxonomic group/Sample pair.

</pre>

Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)

2010-05-26T21:30:23Z

Tgibbons: /* Directory structure */ Fixed some indeces, but the file structure appears to already be slightly out of date.

== 16S analysis pipeline ==

Assumptions: 
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the "Sample ID", well on the plate, and additional information regarding the sample quality and DNA concentration
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).

=== Directory structure ===
<pre>
Gates_SOM/
Main/
samples.csv - information about all the samples available to us
454.csv - information about all 454 runs (essentially concatenation of .csvs from 454 dir)
phylochip.csv - information about all Phylochip runs
IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs
scripts/ - scripts used to process the data
454/ - here's where all 454 sequences live
[batch1]/ - ... each batch in a separate directory
[batch1].csv - meta-information about the batch
[fasta1]- fasta files containing the batch
...
[fastan]
[batch1].part - partition file describing how the sequences get split by barcode/sample
part/ - directory where all the partitioned files live
...
[batchn]
Phylochip/ - all the CEL files and auxiliary information on the Phylochip runs
</pre>

=== Step 0: Get the sequence information ===
* From .SFF files (assuming these are 454 sequences)

This step uses the sff_extract program from the Staden package (if I'm not mistaken)
<pre>
for i in *.sff ;do
name=`expr $i : '$.*$\.sff'`
sff_extract -c -s $name.seq -q $name.qual $i
done
</pre>

=== Step 1: Cleanup meta-information ===
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID. At this stage also check that the header row information is in canonical format.
* Add barcode information using add_barcode.pl

<pre>${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv > [batch]_barcode.csv</pre>

=== Step 2: Create partition file ===

First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.

<pre>${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] > [batch].part</pre>

The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline: sequences that are too short (< 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.

=== Step 3: Break up the fasta file into separate batches by partition ===

* Create partition directory

<pre>
mkdir part
cd part
</pre>

* Partition main file into sub-parts
<pre>
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part
</pre>

The result is one fasta file per sub-partition (i.e. individual subject).

* Remove barcodes
(still in the part/ subdirectory)
<pre>
for i in *.seq; do
${SCRIPTS}/unbarcode.pl $i
done
</pre>

The output files will have the same name as the original file but with the addition of the .nbc suffix. You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.

<pre> rm *.BAD.nbc.fa *.NONE.nbc.fa </pre>

=== Step 4: Add file names to sample tables ===

The following needs to be run from the root of the 454 directory.

<pre> ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part > [batch]/[batch]_names.csv </pre>

As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file. Note that only files ending in .nbc.fa are procesed.

=== Step 5: Merge all the batch meta-info files into a same file at the top ===

Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -> Filename mapping is unique. In the 454.csv file in the top directory the unique key is the file name.

<pre>
mv ../454.csv ../454.csv.bak
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv "Sample ID" non-unique > ../454.csv
</pre>

This should probably be run with the filename as the key but then the tables need to be sorted by filename. Ultimately, this will all work better once the data are in a relational database.

=== Step 6: Update sample file ===
From the top directory:

* First add all new samples to the samples.csv file
<pre>
mv samples.csv samples.csv.bak
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv "Sample ID" unique merge > samples.csv
</pre>

Note: merge means that if record keys conflict, the empty fields will be updated with the new data.

* Update the tag indicating 454 sequences available for this sample

<pre>
mv samples.csv samples.csv.bak
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv "Sample ID" 454 Y > samples.csv
</pre>

=== Step 7: Assign numbers to all filenames ===

Each file in the 454.csv file will be assigned an integer (if one is not already available). This number will be used to prefix the sequences in the combined file for the project.

<pre>
mv 454.csv 454.csv.bak
${SCRIPTS}/add_filenum.pl 454.csv.bak > 454.csv
</pre>

=== Step 8: Combine all fasta files into a single one ===

'''Note: this assumes we're running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis'''

In the new file all sequences will be named <n>_<nn> where <n> is the value in the "File #"
field in 454.csv and <nn> is the index of the sequence in the file.

<pre>
mkdir Analysis/Run[date]
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454
</pre>

The output will be in Analysis/Run[date]/Run[date].fna

=== Step 9: Run clustering tool ===

* First generate clusters
'''Note: This part assumes we're running the whole set of sequences as one batch.'''
<pre>
cd Analysis/Run[date]/Run[date].fna
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna > Run[date].fna.cluster
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna > Run[date].fna.align
</pre>

Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.

The .align file contains aligned FASTA records for all the sequences in each cluster. Clusters are separated by #<number> where <number> is the number of sequences in the cluster.

* Then extract the cluster centers
'''From here on the code runs the same in both full-run and batch modes'''
<pre>
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna < Run[date].fna.cluster > Run[date].centers.fna
</pre>

=== Step 10: Assign putative taxonomic labels to clusters ===

<pre>
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna > Run[date].centers.taxid
</pre>

Output is tab-delimited: sequence name <TAB> taxid
Note: using "findtax" instead of "findtaxid" will retrieve actual taxonomy names.

=== Step 11: Build summary tables ===

Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples. The colums are the samples and the rows are the respective units. The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic "No Assignment" bin.

* First create a partition file that contains all the clusters and associated taxids (if they exist)

<pre>
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid > [batch].part
</pre>

* Using this partition, construct summary tables at various taxonomic levels

<pre>
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]
</pre>

The following outputs will be created:

<pre>
[batch].stats.txt - overall statistics for the data-set
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent
# of sequences in each OTU/Sample pair
[batch].otus.percent.csv - same as "count" except that entries are percentages wrt total sequences
in each sample
[batch].[tax].[otu|count|percent] - same as the "otus" file except at varying taxonomic levels.
[tax] is one of "strain", "species", "genus", "family", "order", "class", "phylum"
the "count" and "percent" entries are the same as for the "otus" files
the "otu" entries contain number of OTUs assigned to the taxonomic group/Sample pair.

</pre>

Main Page

2010-04-05T15:32:14Z

Tgibbons: /* Getting started */ Added links to the recently created resources pages

Cbcb:Pop-Lab:Ted-Report

2010-03-29T02:37:01Z

Tgibbons: /* April 2, 2010 */ Added entry for week ending April 2

Cbcb:Pop-Lab:Ted-Report

2010-03-29T02:30:23Z

Tgibbons: /* March 26, 2010 */ Finished my thoughts for the week

Cbcb:Pop-Lab:Ted-Report

2010-03-26T20:25:07Z

Tgibbons: /* February 19, 2010 */ created an entry for March 26

Cbcb:Pop-Lab:Ted-Report

2010-02-19T19:52:42Z

Tgibbons: /* February 19, 2010 */

Cbcb:Pop-Lab:Ted-Report

2010-02-19T17:46:34Z

Tgibbons: /* February 5, 2010 */ Created entries for the next two weeks

Cbcb:Pop-Lab:Ted-Report

2010-02-03T02:09:23Z

Tgibbons: /* Biomarker Assembly */ Finished entry

Cbcb:Pop-Lab:Ted-Report

2010-02-02T23:37:01Z

Tgibbons: /* February 5, 2010 */ Part two. Still have more to write.