Cbcb:Pop-Lab:Ted-Report: Difference between revisions

Revision as of 17:53, 2 July 2009

Research traditional approaches taken to gene-level analysis of metagenomic data
Critically evaluate the traditional approaches in general and in the context of current Pop Lab projects
Identify portions of the analysis that can be automated
Develop scalable tools to do the automated analysis

Read VisANT paper and user manual[1]. Determined VisANT will work for manual metabolic pathway analysis of even large scale data sets and can be automated by running in "Batch Mode".
Need to read about FastHMM[2]
Still need to make "Welcome Wiki" for n00bs (read: new members)

Made Welcome Wiki
Read metagenomics papers
Determined that VisANT can be used with Bo's data by importing it as MicroArray data

Took an early Summer vacation last weekend:
- Drove to NC to see friend graduate with BS' in CS & Physics
- Went sailing for the first time at girlfriends' parents' place in VA
Refined Welcome Wiki
Read metagenomics/pathway reconstruction/analysis papers
Organized reading group for Palsson Systems Bio book

Read metagenomics/pathway reconstruction/analysis papers and first two chapters of Palsson book.
Currently building test set for incorporation of Phymm into metagenomics pipeline.
- A single archaeal genome was chosen from the findings of Mihai's 2006 Science paper analyzing the human distal gut.
- Two pairs of bacterial genomes were chosen for the test set using columns on the NCBI RefSeq Complete Genomes website[3]:
1. The pairs of bacterial genomes were taken from the Groups: Bacteroidetes/Chlorobi and Firmicutes because they are the two most predominant groups present in the human gut.
2. I chose genomes with a complete set of NCBI Tools available.
3. After this I attempted to choose genomes with significantly different GC content.
4. Finally, preference was given to organisms I recognized from gut microbiome papers/discussions, or failing that, just a really awesome name.
- The final list is:

Organism	Classification	Genome Length
Methanobrevibacter_smithii_ATCC_35061	archaea	1853160 bp
Bacteroides fragilis NCTC 9343	bacteroidetes	5205140 bp
Porphyromonas gingivalis W83	bacteroidetes	2343476 bp
Aster yellows witches'-broom phytoplasma AYWB	firmicutes	706569 bp
Bacillus subtilis subsp. subtilis str. 168	firmicutes	4214630 bp

Note: plasmid DNA was not included

Today is my birthday!!! :D
Last week's meeting was a success!
- The books came in Wednesday so the reading is now readily available.
- Read chapter 3 and part of the 2006 paper for Friday.
Met with Arthur to discuss Phymm.
- I have gotten better at using MetaSim and have generated the previously described test data set composed of 1 million 200bp reads, approximately 15x coverage (actual average read length: 215.42bp).
- I have been relearning how to use the AMOS package in preparation of piping the output from Phymm into it
- Note: It appears that Phymm can be "parallelized" by dividing the query file into smaller files and merging the output files. According to Arthur, each read is scored independently. So the only limits are the number of reads and the number of processors.
  - I have parsed the test set into 10 files and intend to run them simultaneously on Ginkgo.
- I backed up his copy of RefSeq Bacteria/Archaea and removed the test sequences from the original database. I now need to find a way to remove the sequences from the BLAST database and I'm ready to go.

Forgot to update these weeks so this is from memory
I moved all my stuff to /fs/szattic-asmg4/tgibbons/
- Hopefully they get the backups set up soon...
Arthur and I got phymmBL to run on those files I mentioned last time
I concatenated the output files back together using merge sort
- [command]
I wrote a post-processing script for the phymm output called phymm2minimus.pl
- The script outputs to STDOUT the completeness at each taxonomic level of classifications
- Version 3 of the script then iterates through the original script, creating bins for each taxonomic level
- The scripts and test files are located in /fs/szattic-asmg4/tgibbons/phymm/

I'm back to reading about metabolic network analysis and finally getting some tractable ideas
- I'm trying to create a data structure that will store the matrix
  - [notes soon to follow]

@@ Line 74: / Line 74: @@
 ** I backed up his copy of RefSeq Bacteria/Archaea and removed the test sequences from the original database. I now need to find a way to remove the sequences from the BLAST database and I'm ready to go.
-== June 19 & 26 ==
+== June 19 & 26, 2009 ==
 * Forgot to update these weeks so this is from memory
 * I moved all my stuff to /fs/szattic-asmg4/tgibbons/