September 2009: Difference between revisions

From Cbcb
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 5: Line 5:


My problem is a little bit simpler than the motif finding problem, because I'm not trying to find explicit motifs, but only the number of n-edge subgraphs. I tried two approaches (unfortunately, both failed): (1) Suppose the network has n edges, and we try to find the number of k-edge subgraphs. I randomly sample k edges (do not need to be connected), to see if they are connected with each other. If I sample 1000 times, and in 3 times the k edges are connected, then approximately 0.3% of all possible k edges are connected. Consequently, the total number of k-edge subgraph can be approximated by 0.3% X (n chooose k), where n choose k is the number of all possible combinations of k edges. However, the problem is n choose k is really huge, which means in order to get a good approximation, huge amount of samplings is needed. (2) Explicitly find all k-edge subgraph. Start from a node n1, expand to its neighbors n2, then expand to their (n1 and n2) neighbors until (k-1)th neighbor. This approach requires to keep track of all found subgraphs, because same k-edge subgraph could be found in different ways (multiple times). Again, this process is very expensive for large subgraphs (takes more than a hour for k=6).
My problem is a little bit simpler than the motif finding problem, because I'm not trying to find explicit motifs, but only the number of n-edge subgraphs. I tried two approaches (unfortunately, both failed): (1) Suppose the network has n edges, and we try to find the number of k-edge subgraphs. I randomly sample k edges (do not need to be connected), to see if they are connected with each other. If I sample 1000 times, and in 3 times the k edges are connected, then approximately 0.3% of all possible k edges are connected. Consequently, the total number of k-edge subgraph can be approximated by 0.3% X (n chooose k), where n choose k is the number of all possible combinations of k edges. However, the problem is n choose k is really huge, which means in order to get a good approximation, huge amount of samplings is needed. (2) Explicitly find all k-edge subgraph. Start from a node n1, expand to its neighbors n2, then expand to their (n1 and n2) neighbors until (k-1)th neighbor. This approach requires to keep track of all found subgraphs, because same k-edge subgraph could be found in different ways (multiple times). Again, this process is very expensive for large subgraphs (takes more than a hour for k=6).
=== Some Interesting Visualizations from Class===
1. [http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/want-to-be-rich-get-a-doctorate-de Want to be rich? Get a Doctorate Degree First!]
2. [http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/comparison-of-federal-contracts-pe/comments/28cf13cca26411deb146000255111976 VA, TX, and CA take in high percentage of federal contracts]
3. [http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/number-of-records-in-genbank-datab Spain Ranks Third Following US and China on Contributing GenBank Database]
4. [http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/women-obtaining-bachelors-degrees- Women Obtaining Bachelor's Degrees Outnumber Men Significantly]
5. [http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/medicare-benefits-wealthy-most Medicare Benefits Wealthy Most.]
6. [http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/what-is-the-most-common-female-nam What is the most common female name in US?]
7. [http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/1a403aaea2e411de857a000255111976/comments/489f598ea2e411de9e1f000255111976 Where are rich people?]

Latest revision as of 23:54, 17 September 2009

September 17, 2009 (Thursday)

Compute the Number of k-edge Subgraphs

The problem I'm working on is to calculate or estimate the number of n-edge subgraph in a larger graph, which can be used for Bonferroni correction for my multiple hypothesis testing problem. I've tried to google different keywords ("counting subgraph") in order to find previous methods, and most of the previous studies are kind of related with network motif finding. In brief, I could not find efficient algorithms or tools that can solve the problem I have. In graph motif finding, in order to find over- or under-represented motifs, generally two approaches are used: (1) Explicitly enumerate all n-edge subgraph, then count the frequencies, to see which motifs are abundant or depleted. This approaches is not applicable for large motif finding because of computation time. (2) Another approach I noticed is based on sampling (here is the paper Kashtan N, Itzkovitz S, Milo R, Alon U: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 2004, 20(11):1746-1758.). Basically, they randomly sample some subgraphs and then estimate the subgraph concentrations. Hence, this method can only estimates the relative abundance of subgraphs, but not the frequencies. In addition, this method also only applies to motif detection of size less than 8 vertices.

My problem is a little bit simpler than the motif finding problem, because I'm not trying to find explicit motifs, but only the number of n-edge subgraphs. I tried two approaches (unfortunately, both failed): (1) Suppose the network has n edges, and we try to find the number of k-edge subgraphs. I randomly sample k edges (do not need to be connected), to see if they are connected with each other. If I sample 1000 times, and in 3 times the k edges are connected, then approximately 0.3% of all possible k edges are connected. Consequently, the total number of k-edge subgraph can be approximated by 0.3% X (n chooose k), where n choose k is the number of all possible combinations of k edges. However, the problem is n choose k is really huge, which means in order to get a good approximation, huge amount of samplings is needed. (2) Explicitly find all k-edge subgraph. Start from a node n1, expand to its neighbors n2, then expand to their (n1 and n2) neighbors until (k-1)th neighbor. This approach requires to keep track of all found subgraphs, because same k-edge subgraph could be found in different ways (multiple times). Again, this process is very expensive for large subgraphs (takes more than a hour for k=6).

Some Interesting Visualizations from Class

1. Want to be rich? Get a Doctorate Degree First!

2. VA, TX, and CA take in high percentage of federal contracts

3. Spain Ranks Third Following US and China on Contributing GenBank Database

4. Women Obtaining Bachelor's Degrees Outnumber Men Significantly

5. Medicare Benefits Wealthy Most.

6. What is the most common female name in US?

7. Where are rich people?