# September 2009

## September 17, 2009 (Thursday)

### Compute the Number of k-edge Subgraphs

The problem I'm working on is to calculate or estimate the number of n-edge subgraph in a larger graph, which can be used for Bonferroni correction for my multiple hypothesis testing problem. I've tried to google different keywords ("counting subgraph") in order to find previous methods, and most of the previous studies are kind of related with network motif finding. In brief, I could not find efficient algorithms or tools that can solve the problem I have. In graph motif finding, in order to find over- or under-represented motifs, generally two approaches are used: (1) Explicitly enumerate all n-edge subgraph, then count the frequencies, to see which motifs are abundant or depleted. This approaches is not applicable for large motif finding because of computation time. (2) Another approach I noticed is based on sampling (here is the paper Kashtan N, Itzkovitz S, Milo R, Alon U: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 2004, 20(11):1746-1758.). Basically, they randomly sample some subgraphs and then estimate the subgraph concentrations. Hence, this method can only estimates the relative abundance of subgraphs, but not the frequencies. In addition, this method also only applies to motif detection of size less than 8 vertices.

My problem is a little bit simpler than the motif finding problem, because I'm not trying to find explicit motifs, but only the number of n-edge subgraphs. I tried two approaches (unfortunately, both failed): (1) Suppose the network has n edges, and we try to find the number of k-edge subgraphs. I randomly sample k edges (do not need to be connected), to see if they are connected with each other. If I sample 1000 times, and in 3 times the k edges are connected, then approximately 0.3% of all possible k edges are connected. Consequently, the total number of k-edge subgraph can be approximated by 0.3% X (n chooose k), where n choose k is the number of all possible combinations of k edges. However, the problem is n choose k is really huge, which means in order to get a good approximation, huge amount of samplings is needed. (2) Explicitly find all k-edge subgraph. Start from a node n1, expand to its neighbors n2, then expand to their (n1 and n2) neighbors until (k-1)th neighbor. This approach requires to keep track of all found subgraphs, because same k-edge subgraph could be found in different ways (multiple times). Again, this process is very expensive for large subgraphs (takes more than a hour for k=6).

### Some Interesting Visualizations from Class

1. Want to be rich? Get a Doctorate Degree First!

2. VA, TX, and CA take in high percentage of federal contracts

3. Spain Ranks Third Following US and China on Contributing GenBank Database

4. Women Obtaining Bachelor's Degrees Outnumber Men Significantly

5. Medicare Benefits Wealthy Most.