Cbcb:Pop-Lab:Ted-Report: Difference between revisions

From Cbcb
Jump to navigation Jump to search
(→‎September 10, 2010: Created entry)
 
(71 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Summer 2009 Goals ==
== Older Entries ==
* Research traditional approaches taken to gene-level analysis of metagenomic data <br>
[[Cbcb:Pop-Lab:Ted-Report-2009 | 2009]]
* Critically evaluate the traditional approaches in general and in the context of current Pop Lab projects <br>
* Identify portions of the analysis that can be automated <br>
* Develop scalable tools to do the automated analysis


== May 15, 2009 ==
== January 15, 2010 ==
* Read [[VisANT]] paper and user manual[http://visant.bu.edu/]. Determined VisANT will work for manual metabolic pathway analysis of even large scale data sets and can be automated by running in "Batch Mode". <br>
* Need to read about FastHMM[http://www.microbesonline.org/fasthmm/]
* Still need to make "Welcome Wiki" for n00bs (read: new members)


== May 22, 2009 ==
=== Minimus Documentation ===
* Made Welcome Wiki
* Read metagenomics papers
* Determined that VisANT can be used with Bo's data by importing it as MicroArray data


== May 29, 2009 ==
Presently, the only relevant Google hit for "minimus" on the first page of results is the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus#Basic_usage_example sourceforge wiki.] The only example on this page is incomplete and appears to be an early draft made during development.<br>
* Took an early Summer vacation last weekend:
** Drove to NC to see friend graduate with BS' in CS & Physics
** Went sailing for the first time at girlfriends' parents' place in VA
* Refined Welcome Wiki
* Read metagenomics/pathway reconstruction/analysis papers
* Organized reading group for Palsson Systems Bio book


== June 5, 2009 ==
Ideally, it should be easy to find a complete guide with the general format:
* Read metagenomics/pathway reconstruction/analysis papers and first two chapters of Palsson book.
* Simple use case:
* Currently building test set for incorporation of Phymm into metagenomics pipeline.
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
** A single archaeal genome was chosen from the findings of Mihai's 2006 Science paper analyzing the human distal gut.
`minimus path/to/fastaFile(prefix)`
** Two pairs of bacterial genomes were chosen for the test set using columns on the NCBI RefSeq ''Complete Genomes'' website[http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi]:
* Necessary tools for set up (toAmos)
*# The pairs of bacterial genomes were taken from the ''Groups'': Bacteroidetes/Chlorobi and Firmicutes because they are the two most predominant groups present in the human gut.
* Other options
*# I chose genomes with a complete set of NCBI ''Tools'' available.
* etc
*# After this I attempted to choose genomes with significantly different GC content.
*# Finally, preference was given to organisms I recognized from gut microbiome papers/discussions, or failing that, just a really awesome name.


=== Description of Trial Metagenomic Dataset ===
The description found on the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus/README Minimus/README] page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos [http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities File Conversion Utilities] page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.


::{| class="wikitable" border="1"
=== Comparative Network Analysis pt. 2 ===
|-
* Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
!  Organism
* I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
!  Classification
** I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
!  Genome Length
*** The results would be stretches of linked reactions that have been annotated in KEGG pathways.
|- align="center"
*** This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
|  align="left"|Methanobrevibacter_smithii_ATCC_35061
*** The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
|  archaea
*** The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
|  1,853,160 bp
** Ask Bo about this when he gets back. He may have already done this.
|- align="center"
|  align="left"|Bacteroides fragilis NCTC 9343
|  bacteroidetes
|  5,205,140 bp
|- align="center"
|  align="left"|Porphyromonas gingivalis W83
|  bacteroidetes
|  2,343,476 bp
|- align="center"
|  align="left"|Aster yellows witches'-broom phytoplasma AYWB
|  firmicutes
|  706,569 bp
|- align="center"
|  align="left"|Bacillus subtilis subsp. subtilis str. 168
|  firmicutes
|  4,214,630 bp
|}
Note: plasmid DNA was not included <br>
The combined length for all the genomes is: 14,322,975 bp


== June 12, 2009 ==
== January 22, 2010 ==
* Today is my birthday!!! :D
* Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
* Last week's meeting was a success!
** Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
** The books came in Wednesday so the reading is now readily available.
** After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
** Read chapter 3 and part of the 2006 paper for Friday.
** Sergey had an interesting idea for making a better read simulator:
* Met with Arthur to discuss Phymm.
*** Error-free reads are cheap and easy to generate. The problem is with the error model.
** I have gotten better at using MetaSim and have generated the previously described test data set composed of 1 million 200bp reads, approximately 15x coverage (actual average read length: 215.42bp).
*** The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
** I have been relearning how to use the AMOS package in preparation of piping the output from Phymm into it
*** Sergey has pointed out that most companies will assemble ''E. coli'' when they release a new sequencer. Conveniently, there are many high quality assemblies of ''E. coli'' available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the ''E. coli'' reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
** Note: It appears that Phymm can be "parallelized" by dividing the query file into smaller files and merging the output files. According to Arthur, each read is scored independently. So the only limits are the number of reads and the number of processors.
*** I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.
*** I have parsed the test set into 10 files and intend to run them simultaneously on Ginkgo.
** I backed up his copy of RefSeq Bacteria/Archaea and removed the test sequences from the original database. I now need to find a way to remove the sequences from the BLAST database and I'm ready to go.


== June 19 & 26, 2009 ==
* Met with James to discuss my work with Volker
* Forgot to update these weeks so this is from memory
** Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
* I moved all my stuff to /fs/szattic-asmg4/tgibbons/
*** Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
** Hopefully they get the backups set up soon...
*** Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
* Arthur and I got phymmBL to run on those files I mentioned last time
*** Volker has sequenced 2 more non-pathogenic strains of mycobacteria (''gastri'', and ''kansasiiW58'') with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
* I concatenated the output files back together using merge sort:
*** The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
sort -um results.03.combined_<input file prefix>* > <output file prefix>.fna
*** The other, smaller publishable portion of this project would be a comparison of ''gastri'' and ''kansasiiW58'' to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
* I wrote a post-processing script for the phymm output called phymm2minimus.pl
*** James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
** The script outputs to STDOUT the completeness at each taxonomic level of classifications
** Ended up also discussing his work on differential abundance in populations of microorganisms.
** Version 3 of the script then iterates through the original script, creating bins for each taxonomic level
*** I'm going to start working on taking over and expanding Metastats this semester.
** The scripts and test files are located in /fs/szattic-asmg4/tgibbons/phymm/
*** I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
*** Mihai has given me his approval to focus on this.


== July 3, 2009 ==
* Met with Mihai to discuss working with Volker
I'm back to reading about metabolic network analysis and finally getting some tractable ideas <br>
** Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
I'm trying to create a data structure that will store the matrix: <br>
** In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
RN.#####            (potential reactions capable by query sequences)
** I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
        .EC.#####  (array of ECs of proteins necessary for reaction)
** Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.
                  .# (stoichiometry of protein in reaction; not sure if this will work as a child of a pointer)
        .CO.#####  (array of compounds necessary for reaction)
                  .# (stoichiometry of compound in reaction; not sure if this will work as a child of a pointer)
        .r.bool    (reversibility of reaction, shown by => / <=> in KEGG RN equations)


Plus two additional data structures for manipulation by protein and/or compound:
* Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
EC.#####            (array of ECs in query)
** Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
        .RN.#####    (array of reactions in which each KO participates)
** Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.


CO.#####            (array of compounds used in any reactions)
== January 29, 2010 ==
        .RN.#####    (array of reactions in which each compound participates)
        .ChEBI.##### (ChEBI number for compound, or maybe just use the ChEBI number in the first place)


This data structure should be thought of an R x C matrix plus metadata (such as necessary proteins). Other thoughts on the use of the matrix:
* It would be interesting to evaluate the matrix using chemical moieties (ex. CoA, AMP) as described in section 6.5 Palsson's book, but that would require more significant processing of the KEGG information and might break the above-described data structure. Palsson offers a list of common metabolic moieties in Table 6.2 on p. 102, which he has adapted from another bioengineering book. The KEGG database implicitly treats enzymes as moieties.
* A common question will be to find the metabolites required to make the essential building blocks for an organism. To answer this we will need a list of said building blocks, or target products. Palsson offers such a list in Table 6.1 on p. 98, which he has adapted from the same bioengineering book.
* Next we will likely want to fill in the gaps. This may require the storage of additional information from the assignments of protein sequences into KOs, as well as the expansion of the arrays. Alternatively, it may be better to simply create all new data structures but that would likely end up being redundant. With that information we could then attempt statistical analysis of partial reactions to determine the likely hood that the lacking component was simply overlooked by sampling or something.
* A further step may be to evaluate the gene-set's ability to up-take the necessary precursors through either passive or active means.
* Another interesting form of analysis, when possible, would be to multiply the reaction matrix by the concentrations of substrates in the environment (presumably determined by mass spec) and observe the potential output concentrations.


Useful files I'm considering using for the network analysis:
=== Minimus Performance Analysis ===
* [ftp://ftp.genome.jp/pub/kegg/genes/ko ftp://ftp.genome.jp/pub/kegg/genes/ko] contains detailed information about each KO
I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.
* [ftp://ftp.genome.jp/pub/kegg/ligand/enzyme/enzyme ftp://ftp.genome.jp/pub/kegg/ligand/enzyme/enzyme] contains detailed enzyme information by EC number instead of KO
* [ftp://ftp.genome.jp/pub/kegg/ligand/reaction/reaction ftp://ftp.genome.jp/pub/kegg/ligand/reaction/reaction] contains detailed information about each RN, including enzyme and compound information
* [ftp://ftp.genome.jp/pub/kegg/ligand/compound/compound ftp://ftp.genome.jp/pub/kegg/ligand/compound/compound] contains, as you might be guessing by now, compound information


Proposed method of building arrays:
{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
# Assign protein sequences to KOs
|+ '''Minimus Memory Usage Analysis'''
# Build list of reactions associated with each KO
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
# Eliminate any reaction lacking necessary protein(s)
|-
#* Alternatively, store the partial reactions elsewhere or mark them as incomplete or something
!align="left"|RAM used by the Overlapper (in GB):
# Build R, KO, and C from the final list of reactions using the files above
| 1.2 || 2.4 || 4.5 || 8.7 || 17 || 21.5 || ~1.1 GB * (#Reads in Millions) = (Memory Used)
|-
!align="left"|RAM used by the Tigger (in GB):
| 3 || 6 || 12 || 25 || 48.4 || (60) || ~3 GB * (#Reads in Millions) = (Memory Used)
|}
* The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
* Numbers listed in parentheses are predictions made using the listed models.


== July 10, 2009 ==
Goals for July:
* Decide on a method for assigning protein sequences to KOs and implement it
* Implement data structure described in July 3rd entry using C++
** Test output by printing to tab-delimited files that can be viewed in Excel
** Start playing around with software tools using test data
* Finish Part II of the Palsson book
Algorithm for elucidating all necessary precursors (warning: iz fugly):
# Push all the precursors onto a stack
# Iterate through precursors
# For any reaction containing a positive coefficient for the precursor, push the reactants (compounds with negative coefficients) of the reaction onto a stack
# *A miracle happens*
# Print the output as a network of reactions(?)


== July 17, 2009 ==
{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
Talked to Dan:
|+ '''Minimus Run Time Analysis on Privet'''
* Regenerated test set (see June 5 entry) of 1,000,000 reads with default 454 error model and a uniform read length of 200bp
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
Other than that I'm trying to read up on pathway analysis and then help James with his project for Volker, but complications from moving last weekend are making this a disappointingly unproductive week.
|-
!align="left"|Run Time of the Overlapper (in min):
| 3 || 9 || 34 || 130 || (576) || 783 || 2.96 * (#Reads in Millions)<sup>1.87</sup> = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 9 || 66 || 473 || (3,456) || (25,088) || (47,493) || 9.03 * (#Reads in Millions)<sup>2.86</sup> = (Run Time in Min)
|}
* Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R<sup>2</sup> value for each was 1.
* '''For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week'''


== July 24 & 31, 2009 ==
* I have been learning bash scripting to finish integrating phymm into the AMOS pipeline. I seem to be nearly done, but I'm running into problems with the MetaSim data set. Hopefully repeating the procedures will at least help smooth out the edges.
* I've also resumed using Google Code via SVN to backup and log my work, as well at make it available to the lab: code.google.com/p/pop-lab/
** Note: Google Code is currently testing an implementation of Mercurial and should begin supporting it for the general user pool soon!
* As for the pathway analysis, I realized at this week's lab meeting that I've been brainstorming more or less exactly what Bo's been implementing. So I'm back to reading without a strong sense of direction.
* After much playing around with MetaSim I became more comfortable with how to use it and figured out some of the quirks not described in the manual. I've generated another test data set from the same 5 sequences and found some strange inconsistencies. The same number of files have been output, however the new files are smaller, and at the same time the phymm output files are larger. Also phymm appears to have created significantly different bins for the new data set, despite being generated from the same sequences. The average taxonomic completeness at a given taxonomic level for the first set was about 98%. For the new set the average is around 80% (with the noteworthy exception of Family at 98%).


== August 7, 2009 ==
{| class="wikitable" style="text-align:center; width:1000px; height:100px" border="1"
* I have apparently fixed all the bugs in my phymm2minimus.pl script regarding handling special characters in the NCBI taxonomy file! I am pretty happy/excited about this. Consequently, I have another set of bins I can attempt to assemble now.
|+ '''Minimus Run Time Analysis on Walnut'''
* Arrrgh! toAmos is still telling me the fasta sequence entries have 208 entries while the fasta trace entries have 312 entries! Blaaaarrgh!!!
!align="left"|Number of 75bp Reads (in millions): !! 1 !! 2 !! 4 !! 8 !! 16 !! 20 !! Model
|-
!align="left"|Run Time of the Overlapper (in min):
| 2.7 || 8 || 27.5 || 102 || (325) || (481.5) || 2.54 * (#Reads in Millions)<sup>1.75</sup> = (Run Time in Min)
|-
!align="left"|Run Time of the Tigger (in min):
| 14 || 81 || 471.5 || (2,752) || (16,006) || (28,212) || 13.99 * (#Reads in Millions)<sup>2.54</sup> = (Run Time in Min)
|}
* Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
* Numbers listed in parentheses are predictions made using the listed models.
* The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R<sup>2</sup> value for each was 1.


== August 28, 2009 ==
Focus on automated analysis of host-pathogen interactions in metagenomic samples


== September 25, 2009 ==
==== Other Observations About the Assemblies ====
I've been making good progress testing the effects of prebinning on minimus assemblies. So far I'm still using the test set described in the [[Cbcb:Pop-Lab:Ted-Report#June_5.2C_2009|June 5 entry]], although I'm now generating reads using a simpler perl script written by Arthur Brady instead of MetaSim.
* Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
* Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
* The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
* Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
** '''Update:''' Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
* I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20


== October 9, 2009 ==
=== UMIACS Resources ===
Generated three sets of a million 200bp reads, assembled each with Minimus, then for each set:
I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.
* Labeled each read with PhymmBL (which generates labels with BLAST as an intermediate step)
* Generated bins at six phylogenetic levels (phylum-phylotype) using each labeling method (BLAST & PhymmBL)
* Assembled each bin using Minimus
* Combined the contigs from each bin along with the unbinned reads (not all reads have a classification at all phylogenetic levels), and then assembled the full set with Minimus
I then calculated the n50s and compared each assembly with dnadiff:
* Unlike with the reads generated with MetaSim, I did not observe a ~2.5x increase in n50 size. In fact, the n50 was a little bit lower after binning.
** Unfortunately, because the MetaSim reads were suspect due to excessive incompleteness of the taxanomic assignment, these results seem more reliable
** The n50s were
* The dnadiff reports were also not encouraging, showing a reduction in total length of all contigs from the expected 14Mbps to 11Mbps
Next step:
* Find out what went wrong. For now this means looking for:
** Differences in read assignments between the binned and unbinned assemblies
** Homogeneity of reads in bins (ie. make sure that even if reads from a single genome have been scattered across many bins, bins do not contain reads from a variety of genomes)
To do list for next week:
# Check for homogeneity in Phymm Bins
# Create gene annotation/comparison pipeline for Volker
# Modify Phymm


== October 16, 2009 ==
{| class="wikitable" style="text-align:center; width:500px; height:200px" border="1"
* Check for homogeneity in Phymm Bins
|+ '''Umiacs Resources'''
** Traced back read labels to original sequences to discover the source labels are lost while sampling reads :(
!align="left"|Machine !! Processor !! Speed !! Cores !! RAM
** So now I'm generating a new set (testSet10) with equal 10x coverage for each of the five genomes and will cat the multi-fasta read files instead of the genome fasta files. Then, I'll repeat two days of computation and finally be where I thought I'd started. The new test will have the following composition:
|-
!align="left"|Walnut
| Dual Core AMD Opteron 8220 || 2.8GHz || 16 || 64GB
|-
!align="left"|Privet
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Larch
| AMD Opteron 850 || 2.4GHz || 4 || 32GB
|-
!align="left"|Sycamore
| Dual Core AMD Opteron 875 || 1GHz || 8 || 32GB
|-
!align="left"|Shagbark
| Intel Core 2 Quad || 2.83GHz || 4 || 4GB
|}


=== Description of Regenerated Trial Metagenomic Dataset ===
== February 5, 2010 ==


::{| class="wikitable" border="1"
=== Meeting with Volker and Sarada on Feb 3 ===
|-
* Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
!  Organism
* Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project
!  ID
 
!  Classification
=== Biomarker Assembly ===
!  Genome Length
Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.
!  # 200bp Reads
 
|- align="center"
'''Preliminary Steps'''
|  align="left"|Methanobrevibacter_smithii_ATCC_35061
* Gather biomarker consensus amino acid sequences
|  NC_009515
* Gather amino acid sequences for associated genes from each bacterial genome in refseq
|  archaea
* Cluster amino acid sequences within each biomarker set
|  1,853,160 bp
|  92,658
|- align="center"
|  align="left"|Bacteroides fragilis NCTC 9343
|  NC_003228
|  bacteroidetes
|  5,205,140 bp
|  260,257
|- align="center"
|  align="left"|Porphyromonas gingivalis W83
|  NC_002950
|  bacteroidetes
|  2,343,476 bp
|  117,174
|- align="center"
|  align="left"|Aster yellows witches'-broom phytoplasma AYWB
|  NC_007716
|  firmicutes
|  706,569 bp
|  35,328
|- align="center"
|  align="left"|Bacillus subtilis subsp. subtilis str. 168
|  NC_000964
|  firmicutes
|  4,214,630 bp
|  210,732
|}


* Create gene annotation/comparison pipeline for Volker
'''Sequence Identity Threshold Determination''' <br>
*# BLAST genes against genomes
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.
*#* All genes vs all genomes probably isn't a very practical approach. Consider an initial setup step that blasts all genes vs all genomes and stores the resulting hits in a database for later use, or allowing user to select the queries and targets. Probably the first approach would be better because it wouldn't take up <i>too</i> much space and would allow rapid, flexible analysis. I should talk to Bo about setting up something like that.
* Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
*# Rank genes by %ID ratio between user-defined partition
** The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
*# Provide other associated information for genes such as locale
* Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
** This is all inspired by feed back I got from Volker and Petros Karakousis at Johns Hopkins. After showing them an ordered list of genes that were ranked by the ratio of their presence in virulent strains minus their presence in non-virulent strains, they had many questions about the top hits (where "virulence" was defined as being a pulmonary pathogen). They found it interesting that there were nearly 80 genes perfectly conserved across all virulent strains, yet completely absent in the non-virulent strains. They didn't recognize most of the genes however, and seemed curious about the kinds of data that could be provided by James' other scripts.
** While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
* <s>Modify Phymm</s>
* Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
** <s>Create personal directory in phymmTests filled with mostly links back to phymmTests</s>
** While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.
** <s>done.</s> <s>Broke something.</s> Fixed it.
* I have midterms the next two weeks, so things will probably be slow going. Should have known when I resisted studying at the beginning of the semester that it would catch up with my research eventually.


== October 30, 2009 ==
'''Schemes''' <br>
Trying to frame mycobacterium genomic analysis such that it's appropriate to use as my project in Carl's Biological Networks class:
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.
* Have a collection of mycobacterium genomes and plasmids, some annotated, some not
# Marker-wise assembly
* Need to compare networks across genomes/plasmids
#* Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
** Look into pathBLAST, Genemerge, other comparative genomic software
# Cluster-wise assembly
* First must:
#* Cluster protein sequences
** Quickly annotate genes in novel sequences
#* Bin reads that align to any protein sequence in a given cluster
*** Gene finding software
# Gene-wise assembly
*** Blast known genes in other strains
#* Bin reads that align to a particular protein sequence
*** Find some way to combine results of both
* Marker-wise and cluster-wise binning should be better for assembling novel sequences
** Identify pathways in all sequences
* Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related
*** Assign genes to KO #s


=== NCBI Info Summary for Mycobacterium Genome Sequences ===
== February 12, 2010 ==
'''SNOW!!'''


* [http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi NCBI] has 22 completed genomic sequences representing as many strains, plus 9 plasmid sequences non-uniformly distributed among them
== February 19, 2010 ==
* [http://www.genome.jp/kegg/catalog/org_list.html KEGG] has pathway annotations for 21 strains
Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):
* 3 strains were annotated by TIGR, none containing plasmids. These will be treated as the highest confidence annotations.
# Find a better way to compute the false discovery rate (FDR)
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=88 Mycobacterium avium 104]
#* Currently computed by using the lowest 100 p-values from each sample (look at source code)
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=92 Mycobacterium smegmatis str. MC2 155]
#* Need to find a more algebraically rigorous way to compute it
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=223 Mycobacterium tuberculosis CDC1551]
#* False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
* 5 strains were annotated by JGI, including 3 strains containing a total of 6 plasmids. These will be considered as having the next highest level of confidence.
#* James just thinks this sucks and needs to be fixed
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=15760 Mycobacterium gilvum PYR-GCK] - contains 3 plasmids
# Compute F-tests for features across all samples
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16079 Mycobacterium sp. JLS]
#* Most requested feature
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16081 Mycobacterium sp. KMS] - contains 2 plasmids
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=15762 Mycobacterium sp. MCS] - contains 1 plasmid
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=15761 Mycobacterium vanbaalenii PYR-1]
* 6 strains were annotated by other centers I've heard regularly referenced and will probably treat with some intermediate level of confidence if needed.
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=89 Mycobacterium bovis AF2122/97] - Sanger
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=90 Mycobacterium leprae TN] - Sanger
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16725 Mycobacterium marinum M] - Sanger, 1 plasmid
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=15642 Mycobacterium tuberculosis F11] - Broad
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=224 Mycobacterium tuberculosis H37Rv] - Sanger
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=21055 Mycobacterium tuberculosis KZN 1435] - Broad
* 8 strains were annotated by various other institutes and will be treated with relatively low confidence.
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=15691 Mycobacterium abscessus] - Genoscope, 1 plasmid
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=91 Mycobacterium avium subsp. paratuberculosis K-10] - U Minnesota
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=18279 Mycobacterium bovis BCG] - Fiocruz Brazil
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=18059 Mycobacterium bovis BCG str. Pasteur 1173P2] - Institut Pasteur
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=31211 Mycobacterium bovis BCG str. Tokyo 172] - Japan BCG
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=31271 Mycobacterium leprae Br4923] - Institut Pasteur
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=18883 Mycobacterium tuberculosis H37Ra] - Chinese National HGC, Shanghai
** [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=16230 Mycobacterium ulcerans Agy99] - Institut Pasteur, 1 plasmid
* All of these, plus the two novel sequences, will comprise the study set.
** Kansasii
** Gastri


== November 6, 2009 ==
I spent too much time talking with people about science and not enough time doing it this week...
Start testing Minimus for assembling metagenomic data sets from [http://www.nature.com/nmeth/journal/v4/n6/full/nmeth1043.html Nature Methods paper (Mavromatis et al. 2007)] under the following conditions:


=== Metagenomic Assembly Performance Analysis ===
== March 26, 2010 ==
I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I <i>did</i> however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!


::{| class="wikitable" border="1"
My Thesis Project:
|-
* I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
!  Dataset
** I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
!  Read Length
* James has created time-series models, and is interested in tackling spacial models with me.
!  Error Model
* I would really like to correlate certain metabolic pathways with his modeled relationships.
!  Overall Coverage
!  Dispersion
!  Total Assemblies
|- align="center"
|  align="left"|simMC, simLC, simHC
|  200bp, 400bp, 600bp
|  none, 454
|  1x, 3x, 10x
|  uniform, random skew (light, medium, heavy), dominant organism(s)
|  54 x dispersion cases
|- align="center"
|  align="left"|simMC, simLC, simHC
|  50bp, 100bp
|  none, Solexa
|  1x, 3x, 10x
|  uniform, random skew (light, medium, heavy), dominant organism(s)
|  36 x dispersion cases
|}
Current plan of attack for analyzing mycobacterial genomes (listed as [http://www.genome.jp/kegg/soap/doc/keggapi_manual.html KEGG SOAP API commands]):
* [http://www.genome.jp/kegg/soap/doc/keggapi_manual.html#label:106 get_genes_by_organism]
* get_*_by_gene where * represents: enzymes, best_neighbors, best_best_neighbors, reverse_best_neighbors, paralogs, motifs, ko, or pathways
Will build stoichiometry arrays with these functions and manipulate them with python


== November 13, 2009 ==
Volker's Project:
* has taken a big hit this week.
* I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
** Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
** Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
** Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
** Use stable virulence prediction to rank genes as virulence targets


=== Subsets of Mycobacerial Strains ===
Metastats:
Met with Volker and trimmed down the study set considerably. I am now focusing on three phenotypically-defined subsets:
* As I mentioned, this is put on hold
* I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work


# Pulmonary Pathogens
== April 2, 2010 ==
#* H37Rv
More on my Thesis Project:
#* CDC1551
* I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
#* Bovis
* Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
# Facultative Pathogens
* Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
#* Abscessus
* As a starting point, James has recommended building spacial models similar to his time series models
#* Avium
* James is essentially mentoring me on my project at this point. It's pretty excellent.
#* Kansasii W58
#* Bovis BCG
#* H37Ra
# Environmental Non-Pathongenic
#* Smegmatis
#* Gastri
#* Vanbaalenii
#* Gilvum


With the intention of making the following comparisons: <br>
== July 23, 2010 ==
It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.
# Read the book Mihai loaned me within the next week (or so): ''Microbial Inhabitants of Humans'' by Michael Wilson
#* Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
# Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
#* Try not to get too side-tracked reading about biofilms.
#* Search for an existing database of quorum sensing genes to use as references to potentially identify novel quorum sensing genes in microbiome WGS data. Consider making one if it's not already available.
# Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
#* Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
# Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
# Don't make a habit of using the wiki like a personal journal...


# Pulmonary vs Fac-Path
=== Possible Research Projects Inspired by ''Microbial Inhabitants of Humans'' ===
# Pulmonary vs Attenuated
* I transferred this to a new page I created that's dedicated my potential [[User:Tgibbons:Project-Ideas | project ideas]]
# Fac-Path vs Non-Path
# Gastri vs Kansasii
#* Non-path vs Fac-path
#* Closely related
# Bovis vs (H37Rv & CDC1551)
#* Bovis not very contagious between humans, but just as deadly
# Pulmonary vs Non-Path
#* If time


=== Downloading Information From KEGG ===
== July 30, 2010 ==


Using SOAPpy to download information by gene for complete sets of genes takes hours. Luckily I've discovered that python supports "pickling" data structures, which saves them to files which can be easily imported by other programs. I've generated the following Python pickles. <br>
=== Metagenomics Papers and Subjects Related to the Content of ''Microbial Inhabitants of Humans'' ===
# MetaHIT
# Vaginal Microbiome
# Acquisition of Microbiome
#* [http://www.pnas.org/content/early/2010/06/08/1002601107.full.pdf Vaginal birth vs cesarean section]


Most of the pickles are generated directly from SOAPpy functions as follows: <br>
== August 13, 2010 ==
"x_by_y.kegg" is generated with the associated SOAPpy function "get_x_by_y(y[i])" over all i. <br>
For nested data structures, "x_by_y_by_z.kegg", I nested the functions as you would expect except where otherwise noted.


Information by Organism:
=== Other Potential Research Projects ===
# pathways_by_organism.kegg
* I transferred this to a new page I created that's dedicated my potential [[User:Tgibbons:Project-Ideas | project ideas]]
#* each pathway name is appended by the organism name
#*filled directly from the associated SOAPpy function
# genPaths_by_organism.kegg
#* generalized pathway names
#* created by stripping the appended organism name
# genes_by_organism.kegg
#* filled directly from the associated SOAPpy function
Information by Gene:
# enzymes_by_gene.kegg
#* filled directly from the associated SOAPpy function
# kos_by_gene.kegg
#*filled directly from the associated SOAPpy function
Information by Enzyme:
# reactions_by_enzyme.kegg
#* filled directly from the associated SOAPpy function
# compounds_by_enzyme.kegg
#*filled directly from the associated SOAPpy function
Information by Reaction:
# enzymes_by_reaction.kegg
#* filled directly from the associated SOAPpy function
# compounds_by_reaction.kegg
#*filled directly from the associated SOAPpy function
Information by Pathway:
# reactions_by_pathway.kegg
#* filled directly from the associated SOAPpy function


=== Computationally Comparing the Metabolic Networks ===


For an early comparison, I like the representation scheme used in [http://www.pnas.org/content/105/38/14482.full this paper by Borenstein et al.] They represent pathways as directed networks with compounds as nodes and reactions as edges.
== September 10, 2010 ==
I concatenated together the sequences from a subset of mated pairs from Mihai's oral microbiome data, and was able to align one of them to human dna with >97% identity. This is surprising because neither of these reads were able to be aligned to a human reference using bowtie. I plan concatenate the remainder of the filtered reads and attempt to align them to human dna using blast.

Latest revision as of 05:54, 5 September 2010

Older Entries

2009

January 15, 2010

Minimus Documentation

Presently, the only relevant Google hit for "minimus" on the first page of results is the sourceforge wiki. The only example on this page is incomplete and appears to be an early draft made during development.

Ideally, it should be easy to find a complete guide with the general format:

  • Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile(prefix)`
  • Necessary tools for set up (toAmos)
  • Other options
  • etc

The description found on the Minimus/README page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos File Conversion Utilities page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

Comparative Network Analysis pt. 2

  • Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
  • I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
    • I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
      • The results would be stretches of linked reactions that have been annotated in KEGG pathways.
      • This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
      • The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
      • The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
    • Ask Bo about this when he gets back. He may have already done this.

January 22, 2010

  • Met with Dan and Sergey to talk about the Minimus-Bambus pipeline
    • Minimus is running fine. I've begun characterizing its run-time behavior (see next week's entry)
    • After some tweeking by Sergey, Bambus was able to finish but did not generate a scaffold. We're going to talk about this after the meeting on Monday.
    • Sergey had an interesting idea for making a better read simulator:
      • Error-free reads are cheap and easy to generate. The problem is with the error model.
      • The "best" tool (that we are aware of) which includes error models is MetaSim, but the error models are years out of date and the authors has been historically unreachable. While Mihai has now shown me how to edit the models in a reasonable way from flat files allowing to characterize base substitutions, I'm not convinced it would be faster or easier to write a program that would modify these files than it would be to just write an entirely new program; and given the amount of time I've spent trying to use MetaSim, I'm more than ready to walk away from it. Oh yeah, and MetaSim doesn't work from the command line, so no scripting.
      • Sergey has pointed out that most companies will assemble E. coli when they release a new sequencer. Conveniently, there are many high quality assemblies of E. coli available for reference. It might therefore be possible to generate new error models for these sequencers in an automated fashion by mapping the E. coli reads to the available reference genomes, collecting the error frequencies, and then using them to mask synthesized reads.
      • I also talked with Mohammad and Mihai about this, who seemed to also think it was a pretty good idea. Mihai has proposed having Sergey or Mohammad add the described error model-generator to his read sampler (written in C) when they have time, but not in preparation of the oral microbiome data.
  • Met with James to discuss my work with Volker
    • Told him about my meeting with Volker and the paper he wanted me to prepare, more or less by myself. The concepts of the papers are these:
      • Most available genomic sequences of mycobacteria are of a very small subset of highly pathogenic organisms.
      • Subtractive comparative genomics can be used to identify genes that are potentially responsible for differing phenotypes (such as extreme pathogenicity), but there must be an available genomic sequences for closely related organisms with differing phenotypes.
      • Volker has sequenced 2 more non-pathogenic strains of mycobacteria (gastri, and kansasiiW58) with the intention of increasing the effectiveness of these subtractive comparative genomic studies.
      • The meat of the paper would be comparing the results of subtractive comparative genomic analysis using all currently available strains in RefSeq, with the results from also using the two novel sequences.
      • The other, smaller publishable portion of this project would be a comparison of gastri and kansasiiW58 to each other because they are allegedly thought to be extremely closely related, and yet they have distinct phenotypes (which I've now forgotten).
      • James seemed to think this could make an okay paper, and he confirmed that he did not understand that Volker was looking for someone to do all of the analysis, both computational and biological, with Volker only contributing analysis of the analysis after it was all over.
    • Ended up also discussing his work on differential abundance in populations of microorganisms.
      • I'm going to start working on taking over and expanding Metastats this semester.
      • I'm also going to start talking to Bo when he gets back about exactly what he's doing, and how I might be able to include pathway prediction in my expansion of Metastats without stepping on his toes.
      • Mihai has given me his approval to focus on this.
  • Met with Mihai to discuss working with Volker
    • Explained that rather than looking for someone to do only the complex portions of the computational analysis, Volker was/is looking for someone to do the complete analysis.
    • In exchange, Volker is offering first authorship and, if need be, to split the student's funding with their primary PI.
    • I think I'm capable of doing this within 3 or 4 months but it would consume my time pretty thoroughly.
    • Mihai agreed that this is a reasonable deal, but that I have no personal interest in studying mycobacteria, and it's therefore unwise of me to invest a bunch of time becoming an expert on an organism I have no interest in continuing to study or work with. I've therefore offered Volker to work closely with one of his graduate students who could meet with me every week or two. I would be willing to do all of the computational analysis and explain it to them, but they would have to actually look up potentially interesting genes and relationships I discover and help me keep the analysis biologically interesting and relevant.
  • Met with Mihai and Mohammad to discuss our impending huge-ass(embly) problem
    • Talked about strategies for iterative assembly as an approach to assembling intractably large data sets. Most have glaring short-comings and complications.
    • Discovered Mike Schatz has a map-reduce implementation of an assembler that uses De Bruijn graphs and is better suited to assemblies with high coverage but short read lengths.

January 29, 2010

Minimus Performance Analysis

I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead.

Minimus Memory Usage Analysis
Number of 75bp Reads (in millions): 1 2 4 8 16 20 Model
RAM used by the Overlapper (in GB): 1.2 2.4 4.5 8.7 17 21.5 ~1.1 GB * (#Reads in Millions) = (Memory Used)
RAM used by the Tigger (in GB): 3 6 12 25 48.4 (60) ~3 GB * (#Reads in Millions) = (Memory Used)
  • The 16 million read assembly data is from Walnut, all other numbers are rough averages from both Privet and Walnut.
  • Numbers listed in parentheses are predictions made using the listed models.


Minimus Run Time Analysis on Privet
Number of 75bp Reads (in millions): 1 2 4 8 16 20 Model
Run Time of the Overlapper (in min): 3 9 34 130 (576) 783 2.96 * (#Reads in Millions)1.87 = (Run Time in Min)
Run Time of the Tigger (in min): 9 66 473 (3,456) (25,088) (47,493) 9.03 * (#Reads in Millions)2.86 = (Run Time in Min)
  • Privet has 2.4GHz Opteron 850 processors and 32GB of RAM. Minimus is not parallelized and therefore only uses a single core.
  • Numbers listed in parentheses are predictions made using the listed models.
  • The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.
  • For reference: There are 1,440 minutes in one day, and 10,080 minutes in one week


Minimus Run Time Analysis on Walnut
Number of 75bp Reads (in millions): 1 2 4 8 16 20 Model
Run Time of the Overlapper (in min): 2.7 8 27.5 102 (325) (481.5) 2.54 * (#Reads in Millions)1.75 = (Run Time in Min)
Run Time of the Tigger (in min): 14 81 471.5 (2,752) (16,006) (28,212) 13.99 * (#Reads in Millions)2.54 = (Run Time in Min)
  • Walnut has 2.8GHz Opteron 875 processors and 64GB of RAM. Minimus is not parallelized and therefore only uses a single core.
  • Numbers listed in parentheses are predictions made using the listed models.
  • The models were generated by plotting the data points in open office and fitting a polynomial trendline. The R2 value for each was 1.


Other Observations About the Assemblies

  • Because of the short read length, every million reads is only 75MB of sequence. This is roughly 10-20x coverage of an average single bacteria. These test sets have reads sampled from roughly 100 bacterial genomic sequences, I would expect the coverage to be on the order of 0.1% on average.
  • Unsurprisingly, a cursory glance through the contig files show that each is only comprised of about 2 or 3 reads.
  • The n50 analysis for the smaller assemblies shows that only 2-3 reads are being added to each contig on average, leaving both n50's and average lengths just below 150bp.
  • Therefore if the complexity of the oral microbiome data is high and/or the contamination of human DNA is extreme (80-95%), the coverage may be extremely low. This may make the use of Mike's assembler impractical, or at least that's how I'm going to keep justifying this testing to myself until someone corrects me.
    • Update: Apparently Mike and Dan have talked about this, and somewhere around 75-80bp, the performance of minimus catches up with Mike's de Bruijn graph assembler anyway. I also did not know that Dan's map-reduce minimus was running and would be used to assemble the data alongside Mike's.
  • I learned on Feb. 1, 2010 that the 454 error model allows wild variation wrt read length. So these assemblies might not actually be representative of the performance with the illumina data we're expecting on Feb. 20

UMIACS Resources

I just discovered the information listed on the CBCB intranet Resources page is inaccurate and very out of date, so I'm making my own table.

Umiacs Resources
Machine Processor Speed Cores RAM
Walnut Dual Core AMD Opteron 8220 2.8GHz 16 64GB
Privet AMD Opteron 850 2.4GHz 4 32GB
Larch AMD Opteron 850 2.4GHz 4 32GB
Sycamore Dual Core AMD Opteron 875 1GHz 8 32GB
Shagbark Intel Core 2 Quad 2.83GHz 4 4GB

February 5, 2010

Meeting with Volker and Sarada on Feb 3

  • Need to teach Sarada how to perform local blast on some sequences they have that aren't yet in genbank
  • Trying to set up a meeting with Volker to find out for sure if he wants me to work on this project

Biomarker Assembly

Bo, Mohammad, and I spent a couple hours discussing biomarker assembly today. I'm going to try to efficiently summarize our conclusions, but it might be difficult without an easy way to make images. We eventually decided it would be best to attempt several methods in tandem, due to the severe time constraints. The general approach of each method is to fish out and bin reads through one method or another, and then assemble the reads in each bin using minimus. All sequence identify values will be determined by using BLASTx.

Preliminary Steps

  • Gather biomarker consensus amino acid sequences
  • Gather amino acid sequences for associated genes from each bacterial genome in refseq
  • Cluster amino acid sequences within each biomarker set

Sequence Identity Threshold Determination
There are 31 biomarkers and about 1,000 bacterial genomes in which they occur. This means that there are 31 sets of 1,000 sequences that are all relatively similar to one another. Because of the sequence similarity and the short read length, it's possible that a significant number of reads will map equally well to multiple sequences within each biomarker set. For this reason, it is better to allow a single read to be placed in any bin containing a sequence to which the read mapped above some minimum threshold. This will protect against synthetically lowering the coverage of extremely well conserved regions, and with any luck, incorrectly binned reads will simply not be included in the assembly. There are several ways to approach the determination of this threshold.

  • Determine the lowest level of sequence identity between the consensus sequence for each biomarker and any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
    • The obvious shortcoming of this approach is that the sequence identity between two homologous gene-length sequences can by much lower than between two homologous read-length sequences.
  • Align 75mers to determine the lowest score between any two 75mers in the consensus sequence for each biomarker and the corresponding 75mer in any actual protein sequence in that biomarker set. Use that as the minimum threshold for each biomarker set, or use the lowest from any biomarker set as the minimum threshold for all biomarker sets.
    • While this solves the problem with the above approach, it is significantly more complicated and the data is going to be here soon.
  • Choose a sequence identity level, or try a few different levels and see which produces the most complete biomarker proteins without creating overly complex graphs.
    • While there's no good theoretical justification for this approach, it's probably what we'll do and it will probably work well enough.

Schemes
After making absurdly complicated descriptions of the various approaches which I felt weren't very clear, I used keynote to recreate the diagrams we'd drawn on the white board and then printed them to a PDF. Unfortunately I'm not sure exactly how to embed that in the wiki. So email me at trgibbons@gmail.com if you're reading this and I'll send it to you.

  1. Marker-wise assembly
    • Bin reads that align to any sequence in a given marker set, and/or the consensus sequence for that marker
  2. Cluster-wise assembly
    • Cluster protein sequences
    • Bin reads that align to any protein sequence in a given cluster
  3. Gene-wise assembly
    • Bin reads that align to a particular protein sequence
  • Marker-wise and cluster-wise binning should be better for assembling novel sequences
  • Gene-wise binning should produce higher quality assemblies of markers for known organisms or those that are closely related

February 12, 2010

SNOW!!

February 19, 2010

Met with James to discuss Metastats. I'm going to attempt the following two updates by the end of the semester (I probably incorrectly described them, but I'll work it out later):

  1. Find a better way to compute the false discovery rate (FDR)
    • Currently computed by using the lowest 100 p-values from each sample (look at source code)
    • Need to find a more algebraically rigorous way to compute it
    • False positive rate for 1000 samples is the p-value (p=0.05 => 50 H_a's will be incorrectly predicted; so if the null hypothesis is thrown out for 100 samples, 50 will be expected to be incorrect)
    • James just thinks this sucks and needs to be fixed
  2. Compute F-tests for features across all samples
    • Most requested feature

I spent too much time talking with people about science and not enough time doing it this week...

March 26, 2010

I didn't realize it had been a whole month since I updated. Let's see, I nearly dropped Dr. Song's statistical genomics course, but then I didn't. I did however learn that we don't have a class project. So the Metastats upgrades are going on a backburner for now because ZOMGZ I HAVE TO PRESENT MY PROPOSAL BY THE END OF THIS YEAR!!!

My Thesis Project:

  • I'm generally interested in pathways shared between micro-organisms in a community, and also between micro-organisms and their multicellular hosts.
    • I'm particularly interested in studying the metabolic pathways shared between micro-organisms in the human gut, both with each other and their human hosts.
  • James has created time-series models, and is interested in tackling spacial models with me.
  • I would really like to correlate certain metabolic pathways with his modeled relationships.

Volker's Project:

  • has taken a big hit this week.
  • I'm going to go forward, with Bo's help, using the plan outlined in my project proposal for Mihai's biosequence analysis class:
    • Use reciprocal best blast hits to map H37Rv genes to annotated genes in all available virulent and non-virulent strains of mycobacteria
    • Use results from gene mapping to identify a core set of tuberculosis genes, as well as a set of predicted virulence genes
    • Use a variety of comparison schemes to study the effect on the set of predicted virulence genes of the consideration of different subsets of non-virulent strains
    • Use stable virulence prediction to rank genes as virulence targets

Metastats:

  • As I mentioned, this is put on hold
  • I intend to pick this back up after I'm done with Volker's project, as it could be instrumental to my thesis work

April 2, 2010

More on my Thesis Project:

  • I read the most recent (Science, Nov. 2009) paper by Gordon and Knight on their ongoing gut microbiota experiments
  • Pretty much every section addressed the potential thesis topics I'd imagined while reading the preceding section. Frustrating, but reaffirming (trying to learn from Mihai on not getting bummed about being scooped).
  • Something that seems interesting and useful to me is the do more rigorous statistical analysis to attempt to correlate particular genes and pathways with the time series and spacial data. I will have to work closely with Bo at least at first.
  • As a starting point, James has recommended building spacial models similar to his time series models
  • James is essentially mentoring me on my project at this point. It's pretty excellent.

July 23, 2010

It's been several months, I don't feel any closer to finding a thesis project, and it's really starting to stress me out. I've finally stopped making excuses for not reading and have been steadily reading about 10 abstracts and 2 papers per week for the last month or two, but it doesn't appear to be nearly enough. I met with Mihai today to talk about it and then foolishly went for a run in the heat of the afternoon, where I decided on a new direction in a state of euphoric delirium.

  1. Read the book Mihai loaned me within the next week (or so): Microbial Inhabitants of Humans by Michael Wilson
    • Mihai says the book is a summery of what was known about the human microbiome 5 years ago. The table of contents for the first chapter is essentially identical to the list of wikipedia pages I've read in the past week, so I'm pretty excited to now have a more thorough, authoritative source.
  2. Go back to looking for papers describing quorum sensing, especially in organisms known to be present in the human microbiome, either stably or transiently.
    • Try not to get too side-tracked reading about biofilms.
    • Search for an existing database of quorum sensing genes to use as references to potentially identify novel quorum sensing genes in microbiome WGS data. Consider making one if it's not already available.
  3. Look for a core metabolome (at this point I think a core microbiome seems unlikely) using metapath (or something similar) in the new HMP data for the gut, oral, and vaginal samples from the 100 reference individuals, as well as other sources like MetaHIT.
    • Start with a fast and dirty approach pulling all KO's associated with organisms identified using 16S rDNA sequencing, and then possibly attempt more accurate gene assemblies and annotation from WGS sequencing projects.
  4. Try to stay focused on a research topic for more than 2 weeks so I don't keep wasting time and effort.
  5. Don't make a habit of using the wiki like a personal journal...

Possible Research Projects Inspired by Microbial Inhabitants of Humans

  • I transferred this to a new page I created that's dedicated my potential project ideas

July 30, 2010

Metagenomics Papers and Subjects Related to the Content of Microbial Inhabitants of Humans

  1. MetaHIT
  2. Vaginal Microbiome
  3. Acquisition of Microbiome

August 13, 2010

Other Potential Research Projects

  • I transferred this to a new page I created that's dedicated my potential project ideas


September 10, 2010

I concatenated together the sequences from a subset of mated pairs from Mihai's oral microbiome data, and was able to align one of them to human dna with >97% identity. This is surprising because neither of these reads were able to be aligned to a human reference using bowtie. I plan concatenate the remainder of the filtered reads and attempt to align them to human dna using blast.