Cbcb:Pop-Lab:Serge-Report
February
February 16th, 2009:
- I have been working on the materials for my preliminary exam. My preliminary paper is is available here and the preliminary paper is here.
- I have also met with Dr. Alan Sussman, a member of my preliminary committee, to discuss parallel option for graph algorithms. He pointed me to several papers on graph and hyper-graph partitioning. I had been thinking of doing randomized graph algorithms for speedup but using efficient partitioning, the standard graph algorithms can be distributed while minimizing communication. The other option is to use shared-memory machines which are more expensive and less available than distributed memory machines.
- I also implemented two methods for repeat detection in Bambus 2.
- The first splits the graph into strongly connected components using the SeqAn library. The astat is computed on each connected component separately. The motivation is to subdivide metagenomic samples into their constituent organisms and compute astat only on one organism at a time to avoid inadvertently marking abundant organisms as repeats
- The second computes all-to-all shortest paths and the number of shortest paths each node is on. The nodes that are on a large number of shortest paths (more than 1.5 stdev from mean) are marked as repetitive.
- Using the combination of these two approaches Brucella_suis_1330 correctly identifies repeats that were previously causing mis-assembly. I also tried the methods on Acid Mine but SeqAn was crashing.
 
- My next steps are to continue writing the preliminary paper, focusing on the proposed work section. I also plan to look at the results on Brucella_suis_1330 in detail to evaluate correctness of contig placement given my repeat identification.