Cbcb:Pop-Lab:Ted-Report-2009
Summer 2009 Goals
- Research traditional approaches taken to gene-level analysis of metagenomic data
- Critically evaluate the traditional approaches in general and in the context of current Pop Lab projects
- Identify portions of the analysis that can be automated
- Develop scalable tools to do the automated analysis
May 15, 2009
- Read VisANT paper and user manual[1]. Determined VisANT will work for manual metabolic pathway analysis of even large scale data sets and can be automated by running in "Batch Mode".
- Need to read about FastHMM[2]
- Still need to make "Welcome Wiki" for n00bs (read: new members)
May 22, 2009
- Made Welcome Wiki
- Read metagenomics papers
- Determined that VisANT can be used with Bo's data by importing it as MicroArray data
May 29, 2009
- Took an early Summer vacation last weekend:
- Drove to NC to see friend graduate with BS' in CS & Physics
- Went sailing for the first time at girlfriends' parents' place in VA
- Refined Welcome Wiki
- Read metagenomics/pathway reconstruction/analysis papers
- Organized reading group for Palsson Systems Bio book
June 5, 2009
- Read metagenomics/pathway reconstruction/analysis papers and first two chapters of Palsson book.
- Currently building test set for incorporation of Phymm into metagenomics pipeline.
- A single archaeal genome was chosen from the findings of Mihai's 2006 Science paper analyzing the human distal gut.
- Two pairs of bacterial genomes were chosen for the test set using columns on the NCBI RefSeq Complete Genomes website[3]:
- The pairs of bacterial genomes were taken from the Groups: Bacteroidetes/Chlorobi and Firmicutes because they are the two most predominant groups present in the human gut.
- I chose genomes with a complete set of NCBI Tools available.
- After this I attempted to choose genomes with significantly different GC content.
- Finally, preference was given to organisms I recognized from gut microbiome papers/discussions, or failing that, just a really awesome name.
Description of Trial Metagenomic Dataset
Organism Classification Genome Length Methanobrevibacter_smithii_ATCC_35061 archaea 1,853,160 bp Bacteroides fragilis NCTC 9343 bacteroidetes 5,205,140 bp Porphyromonas gingivalis W83 bacteroidetes 2,343,476 bp Aster yellows witches'-broom phytoplasma AYWB firmicutes 706,569 bp Bacillus subtilis subsp. subtilis str. 168 firmicutes 4,214,630 bp
Note: plasmid DNA was not included
The combined length for all the genomes is: 14,322,975 bp
June 12, 2009
- Today is my birthday!!! :D
- Last week's meeting was a success!
- The books came in Wednesday so the reading is now readily available.
- Read chapter 3 and part of the 2006 paper for Friday.
- Met with Arthur to discuss Phymm.
- I have gotten better at using MetaSim and have generated the previously described test data set composed of 1 million 200bp reads, approximately 15x coverage (actual average read length: 215.42bp).
- I have been relearning how to use the AMOS package in preparation of piping the output from Phymm into it
- Note: It appears that Phymm can be "parallelized" by dividing the query file into smaller files and merging the output files. According to Arthur, each read is scored independently. So the only limits are the number of reads and the number of processors.
- I have parsed the test set into 10 files and intend to run them simultaneously on Ginkgo.
- I backed up his copy of RefSeq Bacteria/Archaea and removed the test sequences from the original database. I now need to find a way to remove the sequences from the BLAST database and I'm ready to go.
June 19 & 26, 2009
- Forgot to update these weeks so this is from memory
- I moved all my stuff to /fs/szattic-asmg4/tgibbons/
- Hopefully they get the backups set up soon...
- Arthur and I got phymmBL to run on those files I mentioned last time
- I concatenated the output files back together using merge sort:
sort -um results.03.combined_<input file prefix>* > <output file prefix>.fna
- I wrote a post-processing script for the phymm output called phymm2minimus.pl
- The script outputs to STDOUT the completeness at each taxonomic level of classifications
- Version 3 of the script then iterates through the original script, creating bins for each taxonomic level
- The scripts and test files are located in /fs/szattic-asmg4/tgibbons/phymm/
July 3, 2009
I'm back to reading about metabolic network analysis and finally getting some tractable ideas
I'm trying to create a data structure that will store the matrix:
RN.##### (potential reactions capable by query sequences) .EC.##### (array of ECs of proteins necessary for reaction) .# (stoichiometry of protein in reaction; not sure if this will work as a child of a pointer) .CO.##### (array of compounds necessary for reaction) .# (stoichiometry of compound in reaction; not sure if this will work as a child of a pointer) .r.bool (reversibility of reaction, shown by => / <=> in KEGG RN equations)
Plus two additional data structures for manipulation by protein and/or compound:
EC.##### (array of ECs in query) .RN.##### (array of reactions in which each KO participates)
CO.##### (array of compounds used in any reactions) .RN.##### (array of reactions in which each compound participates) .ChEBI.##### (ChEBI number for compound, or maybe just use the ChEBI number in the first place)
This data structure should be thought of an R x C matrix plus metadata (such as necessary proteins). Other thoughts on the use of the matrix:
- It would be interesting to evaluate the matrix using chemical moieties (ex. CoA, AMP) as described in section 6.5 Palsson's book, but that would require more significant processing of the KEGG information and might break the above-described data structure. Palsson offers a list of common metabolic moieties in Table 6.2 on p. 102, which he has adapted from another bioengineering book. The KEGG database implicitly treats enzymes as moieties.
- A common question will be to find the metabolites required to make the essential building blocks for an organism. To answer this we will need a list of said building blocks, or target products. Palsson offers such a list in Table 6.1 on p. 98, which he has adapted from the same bioengineering book.
- Next we will likely want to fill in the gaps. This may require the storage of additional information from the assignments of protein sequences into KOs, as well as the expansion of the arrays. Alternatively, it may be better to simply create all new data structures but that would likely end up being redundant. With that information we could then attempt statistical analysis of partial reactions to determine the likely hood that the lacking component was simply overlooked by sampling or something.
- A further step may be to evaluate the gene-set's ability to up-take the necessary precursors through either passive or active means.
- Another interesting form of analysis, when possible, would be to multiply the reaction matrix by the concentrations of substrates in the environment (presumably determined by mass spec) and observe the potential output concentrations.
Useful files I'm considering using for the network analysis:
- ftp://ftp.genome.jp/pub/kegg/genes/ko contains detailed information about each KO
- ftp://ftp.genome.jp/pub/kegg/ligand/enzyme/enzyme contains detailed enzyme information by EC number instead of KO
- ftp://ftp.genome.jp/pub/kegg/ligand/reaction/reaction contains detailed information about each RN, including enzyme and compound information
- ftp://ftp.genome.jp/pub/kegg/ligand/compound/compound contains, as you might be guessing by now, compound information
Proposed method of building arrays:
- Assign protein sequences to KOs
- Build list of reactions associated with each KO
- Eliminate any reaction lacking necessary protein(s)
- Alternatively, store the partial reactions elsewhere or mark them as incomplete or something
- Build R, KO, and C from the final list of reactions using the files above
July 10, 2009
Goals for July:
- Decide on a method for assigning protein sequences to KOs and implement it
- Implement data structure described in July 3rd entry using C++
- Test output by printing to tab-delimited files that can be viewed in Excel
- Start playing around with software tools using test data
- Finish Part II of the Palsson book
Algorithm for elucidating all necessary precursors (warning: iz fugly):
- Push all the precursors onto a stack
- Iterate through precursors
- For any reaction containing a positive coefficient for the precursor, push the reactants (compounds with negative coefficients) of the reaction onto a stack
- *A miracle happens*
- Print the output as a network of reactions(?)
July 17, 2009
Talked to Dan:
- Regenerated test set (see June 5 entry) of 1,000,000 reads with default 454 error model and a uniform read length of 200bp
Other than that I'm trying to read up on pathway analysis and then help James with his project for Volker, but complications from moving last weekend are making this a disappointingly unproductive week.
July 24 & 31, 2009
- I have been learning bash scripting to finish integrating phymm into the AMOS pipeline. I seem to be nearly done, but I'm running into problems with the MetaSim data set. Hopefully repeating the procedures will at least help smooth out the edges.
- I've also resumed using Google Code via SVN to backup and log my work, as well at make it available to the lab: code.google.com/p/pop-lab/
- Note: Google Code is currently testing an implementation of Mercurial and should begin supporting it for the general user pool soon!
- As for the pathway analysis, I realized at this week's lab meeting that I've been brainstorming more or less exactly what Bo's been implementing. So I'm back to reading without a strong sense of direction.
- After much playing around with MetaSim I became more comfortable with how to use it and figured out some of the quirks not described in the manual. I've generated another test data set from the same 5 sequences and found some strange inconsistencies. The same number of files have been output, however the new files are smaller, and at the same time the phymm output files are larger. Also phymm appears to have created significantly different bins for the new data set, despite being generated from the same sequences. The average taxonomic completeness at a given taxonomic level for the first set was about 98%. For the new set the average is around 80% (with the noteworthy exception of Family at 98%).
August 7, 2009
- I have apparently fixed all the bugs in my phymm2minimus.pl script regarding handling special characters in the NCBI taxonomy file! I am pretty happy/excited about this. Consequently, I have another set of bins I can attempt to assemble now.
- Arrrgh! toAmos is still telling me the fasta sequence entries have 208 entries while the fasta trace entries have 312 entries! Blaaaarrgh!!!
August 28, 2009
Focus on automated analysis of host-pathogen interactions in metagenomic samples
September 25, 2009
I've been making good progress testing the effects of prebinning on minimus assemblies. So far I'm still using the test set described in the June 5 entry, although I'm now generating reads using a simpler perl script written by Arthur Brady instead of MetaSim.
October 9, 2009
Generated three sets of a million 200bp reads, assembled each with Minimus, then for each set:
- Labeled each read with PhymmBL (which generates labels with BLAST as an intermediate step)
- Generated bins at six phylogenetic levels (phylum-phylotype) using each labeling method (BLAST & PhymmBL)
- Assembled each bin using Minimus
- Combined the contigs from each bin along with the unbinned reads (not all reads have a classification at all phylogenetic levels), and then assembled the full set with Minimus
I then calculated the n50s and compared each assembly with dnadiff:
- Unlike with the reads generated with MetaSim, I did not observe a ~2.5x increase in n50 size. In fact, the n50 was a little bit lower after binning.
- Unfortunately, because the MetaSim reads were suspect due to excessive incompleteness of the taxanomic assignment, these results seem more reliable
- The n50s were
- The dnadiff reports were also not encouraging, showing a reduction in total length of all contigs from the expected 14Mbps to 11Mbps
Next step:
- Find out what went wrong. For now this means looking for:
- Differences in read assignments between the binned and unbinned assemblies
- Homogeneity of reads in bins (ie. make sure that even if reads from a single genome have been scattered across many bins, bins do not contain reads from a variety of genomes)
To do list for next week:
- Check for homogeneity in Phymm Bins
- Create gene annotation/comparison pipeline for Volker
- Modify Phymm
October 16, 2009
- Check for homogeneity in Phymm Bins
- Traced back read labels to original sequences to discover the source labels are lost while sampling reads :(
- So now I'm generating a new set (testSet10) with equal 10x coverage for each of the five genomes and will cat the multi-fasta read files instead of the genome fasta files. Then, I'll repeat two days of computation and finally be where I thought I'd started. The new test will have the following composition:
Description of Regenerated Trial Metagenomic Dataset
Organism ID Classification Genome Length # 200bp Reads Methanobrevibacter_smithii_ATCC_35061 NC_009515 archaea 1,853,160 bp 92,658 Bacteroides fragilis NCTC 9343 NC_003228 bacteroidetes 5,205,140 bp 260,257 Porphyromonas gingivalis W83 NC_002950 bacteroidetes 2,343,476 bp 117,174 Aster yellows witches'-broom phytoplasma AYWB NC_007716 firmicutes 706,569 bp 35,328 Bacillus subtilis subsp. subtilis str. 168 NC_000964 firmicutes 4,214,630 bp 210,732
- Create gene annotation/comparison pipeline for Volker
- BLAST genes against genomes
- All genes vs all genomes probably isn't a very practical approach. Consider an initial setup step that blasts all genes vs all genomes and stores the resulting hits in a database for later use, or allowing user to select the queries and targets. Probably the first approach would be better because it wouldn't take up too much space and would allow rapid, flexible analysis. I should talk to Bo about setting up something like that.
- Rank genes by %ID ratio between user-defined partition
- Provide other associated information for genes such as locale
- This is all inspired by feed back I got from Volker and Petros Karakousis at Johns Hopkins. After showing them an ordered list of genes that were ranked by the ratio of their presence in virulent strains minus their presence in non-virulent strains, they had many questions about the top hits (where "virulence" was defined as being a pulmonary pathogen). They found it interesting that there were nearly 80 genes perfectly conserved across all virulent strains, yet completely absent in the non-virulent strains. They didn't recognize most of the genes however, and seemed curious about the kinds of data that could be provided by James' other scripts.
- BLAST genes against genomes
Modify PhymmCreate personal directory in phymmTests filled with mostly links back to phymmTestsdone.Broke something.Fixed it.
- I have midterms the next two weeks, so things will probably be slow going. Should have known when I resisted studying at the beginning of the semester that it would catch up with my research eventually.
October 30, 2009
Trying to frame mycobacterium genomic analysis such that it's appropriate to use as my project in Carl's Biological Networks class:
- Have a collection of mycobacterium genomes and plasmids, some annotated, some not
- Need to compare networks across genomes/plasmids
- Look into pathBLAST, Genemerge, other comparative genomic software
- First must:
- Quickly annotate genes in novel sequences
- Gene finding software
- Blast known genes in other strains
- Find some way to combine results of both
- Identify pathways in all sequences
- Assign genes to KO #s
- Quickly annotate genes in novel sequences
NCBI Info Summary for Mycobacterium Genome Sequences
- NCBI has 22 completed genomic sequences representing as many strains, plus 9 plasmid sequences non-uniformly distributed among them
- KEGG has pathway annotations for 21 strains
- 3 strains were annotated by TIGR, none containing plasmids. These will be treated as the highest confidence annotations.
- 5 strains were annotated by JGI, including 3 strains containing a total of 6 plasmids. These will be considered as having the next highest level of confidence.
- Mycobacterium gilvum PYR-GCK - contains 3 plasmids
- Mycobacterium sp. JLS
- Mycobacterium sp. KMS - contains 2 plasmids
- Mycobacterium sp. MCS - contains 1 plasmid
- Mycobacterium vanbaalenii PYR-1
- 6 strains were annotated by other centers I've heard regularly referenced and will probably treat with some intermediate level of confidence if needed.
- Mycobacterium bovis AF2122/97 - Sanger
- Mycobacterium leprae TN - Sanger
- Mycobacterium marinum M - Sanger, 1 plasmid
- Mycobacterium tuberculosis F11 - Broad
- Mycobacterium tuberculosis H37Rv - Sanger
- Mycobacterium tuberculosis KZN 1435 - Broad
- 8 strains were annotated by various other institutes and will be treated with relatively low confidence.
- Mycobacterium abscessus - Genoscope, 1 plasmid
- Mycobacterium avium subsp. paratuberculosis K-10 - U Minnesota
- Mycobacterium bovis BCG - Fiocruz Brazil
- Mycobacterium bovis BCG str. Pasteur 1173P2 - Institut Pasteur
- Mycobacterium bovis BCG str. Tokyo 172 - Japan BCG
- Mycobacterium leprae Br4923 - Institut Pasteur
- Mycobacterium tuberculosis H37Ra - Chinese National HGC, Shanghai
- Mycobacterium ulcerans Agy99 - Institut Pasteur, 1 plasmid
- All of these, plus the two novel sequences, will comprise the study set.
- Kansasii
- Gastri
November 6, 2009
Start testing Minimus for assembling metagenomic data sets from Nature Methods paper (Mavromatis et al. 2007) under the following conditions:
Metagenomic Assembly Performance Analysis
Dataset Read Length Error Model Overall Coverage Dispersion Total Assemblies simMC, simLC, simHC 200bp, 400bp, 600bp none, 454 1x, 3x, 10x uniform, random skew (light, medium, heavy), dominant organism(s) 54 x dispersion cases simMC, simLC, simHC 50bp, 100bp none, Solexa 1x, 3x, 10x uniform, random skew (light, medium, heavy), dominant organism(s) 36 x dispersion cases
Current plan of attack for analyzing mycobacterial genomes (listed as KEGG SOAP API commands):
- get_genes_by_organism
- get_*_by_gene where * represents: enzymes, best_neighbors, best_best_neighbors, reverse_best_neighbors, paralogs, motifs, ko, or pathways
Will build stoichiometry arrays with these functions and manipulate them with python
November 13, 2009
Subsets of Mycobacerial Strains
Met with Volker and trimmed down the study set considerably. I am now focusing on three phenotypically-defined subsets:
- Pulmonary Pathogens
- H37Rv
- CDC1551
- Bovis
- Facultative Pathogens
- Abscessus
- Avium
- Kansasii W58
- Bovis BCG
- H37Ra
- Environmental Non-Pathongenic
- Smegmatis
- Gastri
- Vanbaalenii
- Gilvum
With the intention of making the following comparisons:
- Pulmonary vs Fac-Path
- Pulmonary vs Attenuated
- Fac-Path vs Non-Path
- Gastri vs Kansasii
- Non-path vs Fac-path
- Closely related
- Bovis vs (H37Rv & CDC1551)
- Bovis not very contagious between humans, but just as deadly
- Pulmonary vs Non-Path
- If time
Downloading Information From KEGG
Using SOAPpy to download information by gene for complete sets of genes takes hours. Luckily I've discovered that python supports "pickling" data structures, which saves them to files which can be easily imported by other programs. I've generated the following Python pickles.
Most of the pickles are generated directly from SOAPpy functions as follows:
"x_by_y.kegg" is generated with the associated SOAPpy function "get_x_by_y(y[i])" over all i.
For nested data structures, "x_by_y_by_z.kegg", I nested the functions as you would expect except where otherwise noted.
Information by Organism:
- pathways_by_organism.kegg
- each pathway name is appended by the organism name
- filled directly from the associated SOAPpy function
- genPaths_by_organism.kegg
- generalized pathway names
- created by stripping the appended organism name
- genes_by_organism.kegg
- filled directly from the associated SOAPpy function
Information by Gene:
- enzymes_by_gene.kegg
- filled directly from the associated SOAPpy function
- kos_by_gene.kegg
- filled directly from the associated SOAPpy function
Information by Enzyme:
- reactions_by_enzyme.kegg
- filled directly from the associated SOAPpy function
- compounds_by_enzyme.kegg
- filled directly from the associated SOAPpy function
Information by Reaction:
- enzymes_by_reaction.kegg
- filled directly from the associated SOAPpy function
- compounds_by_reaction.kegg
- filled directly from the associated SOAPpy function
Information by Pathway:
- reactions_by_pathway.kegg
- filled directly from the associated SOAPpy function
Computationally Comparing the Metabolic Networks
For an early comparison, I like the representation scheme used in this paper by Borenstein et al. They represent pathways as directed networks with compounds as nodes and reactions as edges.