Cbcb:Pop-Lab:Ted-Report-2009

Summer 2009 Goals

Research traditional approaches taken to gene-level analysis of metagenomic data
Critically evaluate the traditional approaches in general and in the context of current Pop Lab projects
Identify portions of the analysis that can be automated
Develop scalable tools to do the automated analysis

May 15, 2009

Read VisANT paper and user manual[1]. Determined VisANT will work for manual metabolic pathway analysis of even large scale data sets and can be automated by running in "Batch Mode".
Need to read about FastHMM[2]
Still need to make "Welcome Wiki" for n00bs (read: new members)

May 22, 2009

Made Welcome Wiki
Read metagenomics papers
Determined that VisANT can be used with Bo's data by importing it as MicroArray data

May 29, 2009

Took an early Summer vacation last weekend:
- Drove to NC to see friend graduate with BS' in CS & Physics
- Went sailing for the first time at girlfriends' parents' place in VA
Refined Welcome Wiki
Read metagenomics/pathway reconstruction/analysis papers
Organized reading group for Palsson Systems Bio book

June 5, 2009

Read metagenomics/pathway reconstruction/analysis papers and first two chapters of Palsson book.
Currently building test set for incorporation of Phymm into metagenomics pipeline.
- A single archaeal genome was chosen from the findings of Mihai's 2006 Science paper analyzing the human distal gut.
- Two pairs of bacterial genomes were chosen for the test set using columns on the NCBI RefSeq Complete Genomes website[3]:
1. The pairs of bacterial genomes were taken from the Groups: Bacteroidetes/Chlorobi and Firmicutes because they are the two most predominant groups present in the human gut.
2. I chose genomes with a complete set of NCBI Tools available.
3. After this I attempted to choose genomes with significantly different GC content.
4. Finally, preference was given to organisms I recognized from gut microbiome papers/discussions, or failing that, just a really awesome name.

Description of Trial Metagenomic Dataset

Organism	Classification	Genome Length
Methanobrevibacter_smithii_ATCC_35061	archaea	1,853,160 bp
Bacteroides fragilis NCTC 9343	bacteroidetes	5,205,140 bp
Porphyromonas gingivalis W83	bacteroidetes	2,343,476 bp
Aster yellows witches'-broom phytoplasma AYWB	firmicutes	706,569 bp
Bacillus subtilis subsp. subtilis str. 168	firmicutes	4,214,630 bp

Note: plasmid DNA was not included
The combined length for all the genomes is: 14,322,975 bp

June 12, 2009

Today is my birthday!!! :D
Last week's meeting was a success!
- The books came in Wednesday so the reading is now readily available.
- Read chapter 3 and part of the 2006 paper for Friday.
Met with Arthur to discuss Phymm.
- I have gotten better at using MetaSim and have generated the previously described test data set composed of 1 million 200bp reads, approximately 15x coverage (actual average read length: 215.42bp).
- I have been relearning how to use the AMOS package in preparation of piping the output from Phymm into it
- Note: It appears that Phymm can be "parallelized" by dividing the query file into smaller files and merging the output files. According to Arthur, each read is scored independently. So the only limits are the number of reads and the number of processors.
  - I have parsed the test set into 10 files and intend to run them simultaneously on Ginkgo.
- I backed up his copy of RefSeq Bacteria/Archaea and removed the test sequences from the original database. I now need to find a way to remove the sequences from the BLAST database and I'm ready to go.

June 19 & 26, 2009

Forgot to update these weeks so this is from memory
I moved all my stuff to /fs/szattic-asmg4/tgibbons/
- Hopefully they get the backups set up soon...
Arthur and I got phymmBL to run on those files I mentioned last time
I concatenated the output files back together using merge sort:

sort -um results.03.combined_<input file prefix>* > <output file prefix>.fna

I wrote a post-processing script for the phymm output called phymm2minimus.pl
- The script outputs to STDOUT the completeness at each taxonomic level of classifications
- Version 3 of the script then iterates through the original script, creating bins for each taxonomic level
- The scripts and test files are located in /fs/szattic-asmg4/tgibbons/phymm/

July 3, 2009

I'm back to reading about metabolic network analysis and finally getting some tractable ideas
I'm trying to create a data structure that will store the matrix:

RN.#####            (potential reactions capable by query sequences)
        .EC.#####   (array of ECs of proteins necessary for reaction)
                 .# (stoichiometry of protein in reaction; not sure if this will work as a child of a pointer)
        .CO.#####   (array of compounds necessary for reaction)
                 .# (stoichiometry of compound in reaction; not sure if this will work as a child of a pointer)
        .r.bool     (reversibility of reaction, shown by => / <=> in KEGG RN equations)

Plus two additional data structures for manipulation by protein and/or compound:

EC.#####             (array of ECs in query)
        .RN.#####    (array of reactions in which each KO participates)

CO.#####             (array of compounds used in any reactions)
        .RN.#####    (array of reactions in which each compound participates)
        .ChEBI.##### (ChEBI number for compound, or maybe just use the ChEBI number in the first place)

This data structure should be thought of an R x C matrix plus metadata (such as necessary proteins). Other thoughts on the use of the matrix:

It would be interesting to evaluate the matrix using chemical moieties (ex. CoA, AMP) as described in section 6.5 Palsson's book, but that would require more significant processing of the KEGG information and might break the above-described data structure. Palsson offers a list of common metabolic moieties in Table 6.2 on p. 102, which he has adapted from another bioengineering book. The KEGG database implicitly treats enzymes as moieties.
A common question will be to find the metabolites required to make the essential building blocks for an organism. To answer this we will need a list of said building blocks, or target products. Palsson offers such a list in Table 6.1 on p. 98, which he has adapted from the same bioengineering book.
Next we will likely want to fill in the gaps. This may require the storage of additional information from the assignments of protein sequences into KOs, as well as the expansion of the arrays. Alternatively, it may be better to simply create all new data structures but that would likely end up being redundant. With that information we could then attempt statistical analysis of partial reactions to determine the likely hood that the lacking component was simply overlooked by sampling or something.
A further step may be to evaluate the gene-set's ability to up-take the necessary precursors through either passive or active means.
Another interesting form of analysis, when possible, would be to multiply the reaction matrix by the concentrations of substrates in the environment (presumably determined by mass spec) and observe the potential output concentrations.

Useful files I'm considering using for the network analysis:

ftp://ftp.genome.jp/pub/kegg/genes/ko contains detailed information about each KO
ftp://ftp.genome.jp/pub/kegg/ligand/enzyme/enzyme contains detailed enzyme information by EC number instead of KO
ftp://ftp.genome.jp/pub/kegg/ligand/reaction/reaction contains detailed information about each RN, including enzyme and compound information
ftp://ftp.genome.jp/pub/kegg/ligand/compound/compound contains, as you might be guessing by now, compound information

Proposed method of building arrays:

Assign protein sequences to KOs
Build list of reactions associated with each KO
Eliminate any reaction lacking necessary protein(s)
- Alternatively, store the partial reactions elsewhere or mark them as incomplete or something
Build R, KO, and C from the final list of reactions using the files above

July 10, 2009

Goals for July:

Decide on a method for assigning protein sequences to KOs and implement it
Implement data structure described in July 3rd entry using C++
- Test output by printing to tab-delimited files that can be viewed in Excel
- Start playing around with software tools using test data
Finish Part II of the Palsson book

Algorithm for elucidating all necessary precursors (warning: iz fugly):

Push all the precursors onto a stack
Iterate through precursors
For any reaction containing a positive coefficient for the precursor, push the reactants (compounds with negative coefficients) of the reaction onto a stack
*A miracle happens*
Print the output as a network of reactions(?)

July 17, 2009

Talked to Dan:

Regenerated test set (see June 5 entry) of 1,000,000 reads with default 454 error model and a uniform read length of 200bp

Other than that I'm trying to read up on pathway analysis and then help James with his project for Volker, but complications from moving last weekend are making this a disappointingly unproductive week.

July 24 & 31, 2009

I have been learning bash scripting to finish integrating phymm into the AMOS pipeline. I seem to be nearly done, but I'm running into problems with the MetaSim data set. Hopefully repeating the procedures will at least help smooth out the edges.
I've also resumed using Google Code via SVN to backup and log my work, as well at make it available to the lab: code.google.com/p/pop-lab/
- Note: Google Code is currently testing an implementation of Mercurial and should begin supporting it for the general user pool soon!
As for the pathway analysis, I realized at this week's lab meeting that I've been brainstorming more or less exactly what Bo's been implementing. So I'm back to reading without a strong sense of direction.
After much playing around with MetaSim I became more comfortable with how to use it and figured out some of the quirks not described in the manual. I've generated another test data set from the same 5 sequences and found some strange inconsistencies. The same number of files have been output, however the new files are smaller, and at the same time the phymm output files are larger. Also phymm appears to have created significantly different bins for the new data set, despite being generated from the same sequences. The average taxonomic completeness at a given taxonomic level for the first set was about 98%. For the new set the average is around 80% (with the noteworthy exception of Family at 98%).

August 7, 2009

I have apparently fixed all the bugs in my phymm2minimus.pl script regarding handling special characters in the NCBI taxonomy file! I am pretty happy/excited about this. Consequently, I have another set of bins I can attempt to assemble now.
Arrrgh! toAmos is still telling me the fasta sequence entries have 208 entries while the fasta trace entries have 312 entries! Blaaaarrgh!!!

August 28, 2009

Focus on automated analysis of host-pathogen interactions in metagenomic samples

September 25, 2009

I've been making good progress testing the effects of prebinning on minimus assemblies. So far I'm still using the test set described in the June 5 entry, although I'm now generating reads using a simpler perl script written by Arthur Brady instead of MetaSim.

October 9, 2009

Generated three sets of a million 200bp reads, assembled each with Minimus, then for each set:

Labeled each read with PhymmBL (which generates labels with BLAST as an intermediate step)
Generated bins at six phylogenetic levels (phylum-phylotype) using each labeling method (BLAST & PhymmBL)
Assembled each bin using Minimus
Combined the contigs from each bin along with the unbinned reads (not all reads have a classification at all phylogenetic levels), and then assembled the full set with Minimus

I then calculated the n50s and compared each assembly with dnadiff:

Unlike with the reads generated with MetaSim, I did not observe a ~2.5x increase in n50 size. In fact, the n50 was a little bit lower after binning.
- Unfortunately, because the MetaSim reads were suspect due to excessive incompleteness of the taxanomic assignment, these results seem more reliable
- The n50s were
The dnadiff reports were also not encouraging, showing a reduction in total length of all contigs from the expected 14Mbps to 11Mbps

Next step:

Find out what went wrong. For now this means looking for:
- Differences in read assignments between the binned and unbinned assemblies
- Homogeneity of reads in bins (ie. make sure that even if reads from a single genome have been scattered across many bins, bins do not contain reads from a variety of genomes)

To do list for next week:

Check for homogeneity in Phymm Bins
Create gene annotation/comparison pipeline for Volker
Modify Phymm

October 16, 2009

Check for homogeneity in Phymm Bins
- Traced back read labels to original sequences to discover the source labels are lost while sampling reads :(
- So now I'm generating a new set (testSet10) with equal 10x coverage for each of the five genomes and will cat the multi-fasta read files instead of the genome fasta files. Then, I'll repeat two days of computation and finally be where I thought I'd started. The new test will have the following composition:

Description of Regenerated Trial Metagenomic Dataset

Organism	ID	Classification	Genome Length	# 200bp Reads
Methanobrevibacter_smithii_ATCC_35061	NC_009515	archaea	1,853,160 bp	92,658
Bacteroides fragilis NCTC 9343	NC_003228	bacteroidetes	5,205,140 bp	260,257
Porphyromonas gingivalis W83	NC_002950	bacteroidetes	2,343,476 bp	117,174
Aster yellows witches'-broom phytoplasma AYWB	NC_007716	firmicutes	706,569 bp	35,328
Bacillus subtilis subsp. subtilis str. 168	NC_000964	firmicutes	4,214,630 bp	210,732

Create gene annotation/comparison pipeline for Volker
1. BLAST genes against genomes
  - All genes vs all genomes probably isn't a very practical approach. Consider an initial setup step that blasts all genes vs all genomes and stores the resulting hits in a database for later use, or allowing user to select the queries and targets. Probably the first approach would be better because it wouldn't take up too much space and would allow rapid, flexible analysis. I should talk to Bo about setting up something like that.
2. Rank genes by %ID ratio between user-defined partition
3. Provide other associated information for genes such as locale
- This is all inspired by feed back I got from Volker and Petros Karakousis at Johns Hopkins. After showing them an ordered list of genes that were ranked by the ratio of their presence in virulent strains minus their presence in non-virulent strains, they had many questions about the top hits (where "virulence" was defined as being a pulmonary pathogen). They found it interesting that there were nearly 80 genes perfectly conserved across all virulent strains, yet completely absent in the non-virulent strains. They didn't recognize most of the genes however, and seemed curious about the kinds of data that could be provided by James' other scripts.
~~Modify Phymm~~
- ~~Create personal directory in phymmTests filled with mostly links back to phymmTests~~
- ~~done.~~ ~~Broke something.~~ Fixed it.
I have midterms the next two weeks, so things will probably be slow going. Should have known when I resisted studying at the beginning of the semester that it would catch up with my research eventually.

October 30, 2009

Trying to frame mycobacterium genomic analysis such that it's appropriate to use as my project in Carl's Biological Networks class:

Have a collection of mycobacterium genomes and plasmids, some annotated, some not
Need to compare networks across genomes/plasmids
- Look into pathBLAST, Genemerge, other comparative genomic software
First must:
- Quickly annotate genes in novel sequences
  - Gene finding software
  - Blast known genes in other strains
  - Find some way to combine results of both
- Identify pathways in all sequences
  - Assign genes to KO #s

NCBI Info Summary for Mycobacterium Genome Sequences

NCBI has 22 completed genomic sequences representing as many strains, plus 9 plasmid sequences non-uniformly distributed among them
KEGG has pathway annotations for 21 strains
3 strains were annotated by TIGR, none containing plasmids. These will be treated as the highest confidence annotations.
5 strains were annotated by JGI, including 3 strains containing a total of 6 plasmids. These will be considered as having the next highest level of confidence.
- Mycobacterium gilvum PYR-GCK - contains 3 plasmids
- Mycobacterium sp. JLS
- Mycobacterium sp. KMS - contains 2 plasmids
- Mycobacterium sp. MCS - contains 1 plasmid
- Mycobacterium vanbaalenii PYR-1
6 strains were annotated by other centers I've heard regularly referenced and will probably treat with some intermediate level of confidence if needed.
- Mycobacterium bovis AF2122/97 - Sanger
- Mycobacterium leprae TN - Sanger
- Mycobacterium marinum M - Sanger, 1 plasmid
- Mycobacterium tuberculosis F11 - Broad
- Mycobacterium tuberculosis H37Rv - Sanger
- Mycobacterium tuberculosis KZN 1435 - Broad
8 strains were annotated by various other institutes and will be treated with relatively low confidence.
- Mycobacterium abscessus - Genoscope, 1 plasmid
- Mycobacterium avium subsp. paratuberculosis K-10 - U Minnesota
- Mycobacterium bovis BCG - Fiocruz Brazil
- Mycobacterium bovis BCG str. Pasteur 1173P2 - Institut Pasteur
- Mycobacterium bovis BCG str. Tokyo 172 - Japan BCG
- Mycobacterium leprae Br4923 - Institut Pasteur
- Mycobacterium tuberculosis H37Ra - Chinese National HGC, Shanghai
- Mycobacterium ulcerans Agy99 - Institut Pasteur, 1 plasmid
All of these, plus the two novel sequences, will comprise the study set.
- Kansasii
- Gastri

November 6, 2009

Start testing Minimus for assembling metagenomic data sets from Nature Methods paper (Mavromatis et al. 2007) under the following conditions:

Metagenomic Assembly Performance Analysis

Dataset	Read Length	Error Model	Overall Coverage	Dispersion	Total Assemblies
simMC, simLC, simHC	200bp, 400bp, 600bp	none, 454	1x, 3x, 10x	uniform, random skew (light, medium, heavy), dominant organism(s)	54 x dispersion cases
simMC, simLC, simHC	50bp, 100bp	none, Solexa	1x, 3x, 10x	uniform, random skew (light, medium, heavy), dominant organism(s)	36 x dispersion cases

Current plan of attack for analyzing mycobacterial genomes (listed as KEGG SOAP API commands):

get_genes_by_organism
get_*_by_gene where * represents: enzymes, best_neighbors, best_best_neighbors, reverse_best_neighbors, paralogs, motifs, ko, or pathways

Will build stoichiometry arrays with these functions and manipulate them with python

November 13, 2009

Subsets of Mycobacerial Strains

Met with Volker and trimmed down the study set considerably. I am now focusing on three phenotypically-defined subsets:

Pulmonary Pathogens
- H37Rv
- CDC1551
- Bovis
Facultative Pathogens
- Abscessus
- Avium
- Kansasii W58
- Bovis BCG
- H37Ra
Environmental Non-Pathongenic
- Smegmatis
- Gastri
- Vanbaalenii
- Gilvum

With the intention of making the following comparisons:

Pulmonary vs Fac-Path
Pulmonary vs Attenuated
Fac-Path vs Non-Path
Gastri vs Kansasii
- Non-path vs Fac-path
- Closely related
Bovis vs (H37Rv & CDC1551)
- Bovis not very contagious between humans, but just as deadly
Pulmonary vs Non-Path
- If time

Downloading Information From KEGG

Using SOAPpy to download information by gene for complete sets of genes takes hours. Luckily I've discovered that python supports "pickling" data structures, which saves them to files which can be easily imported by other programs. I've generated the following Python pickles.

Most of the pickles are generated directly from SOAPpy functions as follows:
"x_by_y.kegg" is generated with the associated SOAPpy function "get_x_by_y(y[i])" over all i.
For nested data structures, "x_by_y_by_z.kegg", I nested the functions as you would expect except where otherwise noted.

Information by Organism:

pathways_by_organism.kegg
- each pathway name is appended by the organism name
- filled directly from the associated SOAPpy function
genPaths_by_organism.kegg
- generalized pathway names
- created by stripping the appended organism name
genes_by_organism.kegg
- filled directly from the associated SOAPpy function

Information by Gene:

enzymes_by_gene.kegg
- filled directly from the associated SOAPpy function
kos_by_gene.kegg
- filled directly from the associated SOAPpy function

Information by Enzyme:

reactions_by_enzyme.kegg
- filled directly from the associated SOAPpy function
compounds_by_enzyme.kegg
- filled directly from the associated SOAPpy function

Information by Reaction:

enzymes_by_reaction.kegg
- filled directly from the associated SOAPpy function
compounds_by_reaction.kegg
- filled directly from the associated SOAPpy function

Information by Pathway:

reactions_by_pathway.kegg
- filled directly from the associated SOAPpy function

Computationally Comparing the Metabolic Networks

For an early comparison, I like the representation scheme used in this paper by Borenstein et al. They represent pathways as directed networks with compounds as nodes and reactions as edges.

Cbcb:Pop-Lab:Ted-Report-2009

Contents

Summer 2009 Goals

May 15, 2009

May 22, 2009

May 29, 2009

June 5, 2009

Description of Trial Metagenomic Dataset

June 12, 2009

June 19 & 26, 2009

July 3, 2009

July 10, 2009

July 17, 2009

July 24 & 31, 2009

August 7, 2009

August 28, 2009

September 25, 2009

October 9, 2009

October 16, 2009

Description of Regenerated Trial Metagenomic Dataset

October 30, 2009

NCBI Info Summary for Mycobacterium Genome Sequences

November 6, 2009

Metagenomic Assembly Performance Analysis

November 13, 2009

Subsets of Mycobacerial Strains

Downloading Information From KEGG

Computationally Comparing the Metabolic Networks

Navigation menu

Cbcb:Pop-Lab:Ted-Report-2009

Summer 2009 Goals

May 15, 2009

May 22, 2009

May 29, 2009

June 5, 2009

Description of Trial Metagenomic Dataset

June 12, 2009

June 19 & 26, 2009

July 3, 2009

July 10, 2009

July 17, 2009

July 24 & 31, 2009

August 7, 2009

August 28, 2009

September 25, 2009

October 9, 2009

October 16, 2009

Description of Regenerated Trial Metagenomic Dataset

October 30, 2009

NCBI Info Summary for Mycobacterium Genome Sequences

November 6, 2009

Metagenomic Assembly Performance Analysis

November 13, 2009

Subsets of Mycobacerial Strains

Downloading Information From KEGG

Computationally Comparing the Metabolic Networks

Navigation menu

Search