Cbcb:Pop-Lab:Ted-Report
Summer 2009 Goals
- Research traditional approaches taken to gene-level analysis of metagenomic data
- Critically evaluate the traditional approaches in general and in the context of current Pop Lab projects
- Identify portions of the analysis that can be automated
- Develop scalable tools to do the automated analysis
May 15, 2009
- Read VisANT paper and user manual[1]. Determined VisANT will work for manual metabolic pathway analysis of even large scale data sets and can be automated by running in "Batch Mode".
- Need to read about FastHMM[2]
- Still need to make "Welcome Wiki" for n00bs (read: new members)
May 22, 2009
- Made Welcome Wiki
- Read metagenomics papers
- Determined that VisANT can be used with Bo's data by importing it as MicroArray data
May 29, 2009
- Took an early Summer vacation last weekend:
- Drove to NC to see friend graduate with BS' in CS & Physics
- Went sailing for the first time at girlfriends' parents' place in VA
- Refined Welcome Wiki
- Read metagenomics/pathway reconstruction/analysis papers
- Organized reading group for Palsson Systems Bio book
June 5, 2009
- Read metagenomics/pathway reconstruction/analysis papers and first two chapters of Palsson book.
- Currently building test set for incorporation of Phymm into metagenomics pipeline.
- A single archaeal genome was chosen from the findings of Mihai's 2006 Science paper analyzing the human distal gut.
- Two pairs of bacterial genomes were chosen for the test set using columns on the NCBI RefSeq Complete Genomes website[3]:
- The pairs of bacterial genomes were taken from the Groups: Bacteroidetes/Chlorobi and Firmicutes because they are the two most predominant groups present in the human gut.
- I chose genomes with a complete set of NCBI Tools available.
- After this I attempted to choose genomes with significantly different GC content.
- Finally, preference was given to organisms I recognized from gut microbiome papers/discussions, or failing that, just a really awesome name.
- The final list is:
Organism | Classification | Genome Length |
---|---|---|
Methanobrevibacter_smithii_ATCC_35061 | archaea | 1853160 bp |
Bacteroides fragilis NCTC 9343 | bacteroidetes | 5205140 bp |
Porphyromonas gingivalis W83 | bacteroidetes | 2343476 bp |
Aster yellows witches'-broom phytoplasma AYWB | firmicutes | 706569 bp |
Bacillus subtilis subsp. subtilis str. 168 | firmicutes | 4214630 bp |
Note: plasmid DNA was not included
June 12, 2009
- Today is my birthday!!! :D
- Last week's meeting was a success!
- The books came in Wednesday so the reading is now readily available.
- Read chapter 3 and part of the 2006 paper for Friday.
- Met with Arthur to discuss Phymm.
- I have gotten better at using MetaSim and have generated the previously described test data set composed of 1 million 200bp reads, approximately 15x coverage (actual average read length: 215.42bp).
- I have been relearning how to use the AMOS package in preparation of piping the output from Phymm into it
- Note: It appears that Phymm can be "parallelized" by dividing the query file into smaller files and merging the output files. According to Arthur, each read is scored independently. So the only limits are the number of reads and the number of processors.
- I have parsed the test set into 10 files and intend to run them simultaneously on Ginkgo.
- I backed up his copy of RefSeq Bacteria/Archaea and removed the test sequences from the original database. I now need to find a way to remove the sequences from the BLAST database and I'm ready to go.
June 19 & 26, 2009
- Forgot to update these weeks so this is from memory
- I moved all my stuff to /fs/szattic-asmg4/tgibbons/
- Hopefully they get the backups set up soon...
- Arthur and I got phymmBL to run on those files I mentioned last time
- I concatenated the output files back together using merge sort:
sort -um results.03.combined_<input file prefix>* > <output file prefix>.fna
- I wrote a post-processing script for the phymm output called phymm2minimus.pl
- The script outputs to STDOUT the completeness at each taxonomic level of classifications
- Version 3 of the script then iterates through the original script, creating bins for each taxonomic level
- The scripts and test files are located in /fs/szattic-asmg4/tgibbons/phymm/
July 3, 2009
I'm back to reading about metabolic network analysis and finally getting some tractable ideas
I'm trying to create a data structure that will store the matrix:
RN.##### (potential reactions capable by query sequences) .EC.##### (array of ECs of proteins necessary for reaction) .# (stoichiometry of protein in reaction; not sure if this will work as a child of a pointer) .CO.##### (array of compounds necessary for reaction) .# (stoichiometry of compound in reaction; not sure if this will work as a child of a pointer) .r.bool (reversibility of reaction, shown by => / <=> in KEGG RN equations)
Plus two additional data structures for manipulation by protein and/or compound:
EC.##### (array of ECs in query) .RN.##### (array of reactions in which each KO participates)
CO.##### (array of compounds used in any reactions) .RN.##### (array of reactions in which each compound participates) .ChEBI.##### (ChEBI number for compound, or maybe just use the ChEBI number in the first place)
This data structure should be thought of an R x C matrix plus metadata (such as necessary proteins). Other thoughts on the use of the matrix:
- It would be interesting to evaluate the matrix using chemical moieties (ex. CoA, AMP) as described in section 6.5 Palsson's book, but that would require more significant processing of the KEGG information and might break the above-described data structure. Palsson offers a list of common metabolic moieties in Table 6.2 on p. 102, which he has adapted from another bioengineering book. The KEGG database implicitly treats enzymes as moieties.
- A common question will be to find the metabolites required to make the essential building blocks for an organism. To answer this we will need a list of said building blocks, or target products. Palsson offers such a list in Table 6.1 on p. 98, which he has adapted from the same bioengineering book.
- Next we will likely want to fill in the gaps. This may require the storage of additional information from the assignments of protein sequences into KOs, as well as the expansion of the arrays. Alternatively, it may be better to simply create all new data structures but that would likely end up being redundant. With that information we could then attempt statistical analysis of partial reactions to determine the likely hood that the lacking component was simply overlooked by sampling or something.
- A further step may be to evaluate the gene-set's ability to up-take the necessary precursors through either passive or active means.
- Another interesting form of analysis, when possible, would be to multiply the reaction matrix by the concentrations of substrates in the environment (presumably determined by mass spec) and observe the potential output concentrations.
Useful files I'm considering using for the network analysis:
- ftp://ftp.genome.jp/pub/kegg/genes/ko contains detailed information about each KO
- ftp://ftp.genome.jp/pub/kegg/ligand/enzyme/enzyme contains detailed enzyme information by EC number instead of KO
- ftp://ftp.genome.jp/pub/kegg/ligand/reaction/reaction contains detailed information about each RN, including enzyme and compound information
- ftp://ftp.genome.jp/pub/kegg/ligand/compound/compound contains, as you might be guessing by now, compound information
Proposed method of building arrays:
- Assign protein sequences to KOs
- Build list of reactions associated with each KO
- Eliminate any reaction lacking necessary protein(s)
- Alternatively, store the partial reactions elsewhere or mark them as incomplete or something
- Build R, KO, and C from the final list of reactions using the files above
July 10, 2009
Goals for July:
- Decide on a method for assigning protein sequences to KOs and implement it
- Implement data structure described in July 3rd entry using C++
- Test output by printing to tab-delimited files that can be viewed in Excel
- Start playing around with software tools using test data
- Finish Part II of the Palsson book
Algorithm for elucidating all necessary precursors (warning: iz fugly):
- Push all the precursors onto a stack
- Iterate through precursors
- For any reaction containing a positive coefficient for the precursor, push the reactants (compounds with negative coefficients) of the reaction onto a stack
- *A miracle happens*
- Print the output as a network of reactions(?)
July 17, 2009
Talked to Dan:
- Regenerated test set (see June 5 entry) of 1,000,000 reads with default 454 error model and a uniform read length of 200bp
Other than that I'm trying to read up on pathway analysis and then help James with his project for Volker, but complications from moving last weekend are making this a disappointingly unproductive week.