Title: Modularity and Community Structure in Networks*
1Modularity and Community Structure in Networks
- Final project
- Based on a paper by M.E.J Newman in PNAS 2006
2Introduction
3Networks
- A network presented by a graph G(V,E)V
nodes, E edges (link node pairs) - Examples of real-life networks
- social networks (V people)
- World Wide Web (V webpages)
- protein-protein interaction networks (V
proteins)
4Protein-protein Interaction Networks
- Nodes proteins (6K), edges interactions
(15K). - Reflect the cells machinery and signaling
pathways.
5Communities (clusters) in a network
- A community (cluster) is a densely connected
group of vertices, with only sparser connections
to other groups.
6Searching for communities in a network
- There are numerous algorithms with different
"target-functions" - "Homogenity" - dense connectivity clusters
- "Separation"- graph partitioning, min-cut
approach - Clustering is important for Understanding the
structure of the network - Provides an overview of the network
7Distilling Modules from Networks
Motivation identifying protein complexes
responsible for certain functions in the cell
8Newman's network division algorithm
9Important features of Newman's clustering
algorithm
- The number and size of the clusters are
determined by the algorithm - Attempts to find a division that maximizes a
modularity score Q - heuristic algorithm
- Notifies when the network is non-modular
10Modularity of a division (Q)
Q (edges within groups) - E((edges within
groups in a
RANDOM graph with same node degrees))Trivial
division all vertices in one groupgt Q(trivial
division) 0
ki degree of node i M ?ki 2E Aij 1 if
(i,j)?E, 0 otherwise Eij expected number of
edges between i and j in a random graph with same
node degrees. Lemma Eij ? kikj / M
Edges within groups
Q ?(Aij - kikj/M i,j in the same group)
11Algorithm 1 Division into two groups(1)
Q ?(Aij - kikj/M i,j in the same group)
- Suppose we have n vertices 1,...,n
- s - ?1 vector of size n. Represent a
2-division - si sj iff i and j are in the same group
- ½ (sisj1) 1 if sisj, 0 otherwise
- gt
12Algorithm 1 Division into two groups (2)
Since
B the modularity matrix - symmetric
- row sum 0
0 is an eigvenvalue of B
where
13Modularity matrix example
14Algorithm 1 Division into two groups (3)
B is symmetric ? B is diagonalizable (real
eigenvalues)
B's corresponding eigen vectors
B's eigen values
Bui ?iui
ns2 ?ai2
- Which vector s maximizes Q?
- clearly s u1 maximizes Q, but u1 may not be
?1 vector - Greedy heuristic choose s u1 si 1 if
uigt0, si-1 otherwise
15(No Transcript)
16Example a 2-division of a social network
known group leaders
known group leader
Color matches the entries of the eigen vector u1
light positive entry (si1)dark negative
(si-1)
A network showing relationships between people in
a karate club which eventually split into 2. The
division algorithm predicts exactly the two
groups after the split
17Dividing into more than 2(1)
- How to compute into more than 2?
- Idea apply the algorithm recursively on every
group.
1 iff i and j are in the same group, 0 otherwise
i,j pairs that needs to be updated in Q
18Dividing into more than 2(2)
- g - a group of ng vertices
- s - a ?1 vector of size ng
- Compute ?Q for a 2-division of g
19Dividing into more than 2(3)
Bg the submatrix of B defined by g
where
fi(g) sum of ith row Bgfi(1,...,n) 0
generalized modularity matrix
20Generalized modularity matrix example
g 1, 4, 5 (1 is the minimal index)
21A "generalized" 2-division algorithm (divides a
group in a network)
22(No Transcript)
23Further techniques for modularity maximization
- (Combined with Neman's "generalized' 2-division
algorithm)
24A heuristic for 2-division
The last iteration produces a 2-division which
equals the initial 2-division
- g1, g2 - an initial 2-division of g
- While there is an unmoved node
- Let v be an unmoved node, whose moving between g1
and g2 maximizes ?Q - Move v between g1 and g2
- From the ng 2-divisions generated in the previous
step - let g1, g2 be the one with maximum ?Q - If ?Qgt0 gt go to 1
25Computing ?Q for each node
Choosing j' with maximum ?Q
moving j' and storing its ?Q
2.While there is an unmoved node 1. Let v be
an unmoved node, whose moving between
g1 and g2 maximizes ?Q 2. Move v
between g1 and g2
26Algorithm 4 -cont.
3. From the ng 2-divisions generated in the
previous step - let g1, g2 be the one with
maximum ?Q 4. If ?Qgt0 gt go to 1
27Finding the leading eigen-pair
28The Power Method (1)
- A - a diagonalizable matrix
- Let (?1,V1),..., (?n,Vn) be n eigenpairs of A
where ?1 gt ?2 ? ?3?...? ?n - The power method finds the dominant eigenpair of
A, i.e. (V1, ?1) (Note that ?1 is not
necessarily the leading eigenvalue) - X0 any vector.
- ? X0 c1V1... cnVn , where ci X0?Vi
29The Power Method (2)
- X1AX0 A (c1V1... cnVn) c1AV1... cnAVn
c1?1V1.... cn?nVn - X2A2X0 AX1 A (c1?1V1.... cn?nVn)
c1?12V1.... cn?n2Vn - ...
- XmAmX0 AXm-1 A (c1?1m-1V1.... cn?nm-1Vn)
c1?1mV1.... cn?nmVn
c1 ?1mV1 - If m is large enough ?
30Power Method (3)
Xm AXm-1 AmX0
- Suppose V1?Y?0. For m large enough
31Power method - Example
We perform only matrix-vector multiplications!
?
Convergence usually occurs within O(n) iterations
32Power method convergence condition
The desired precision
To avoid numerical problems due to large numbers
normalize Xi before computing Xi1 A Xi X0
X / X X1 AX0 / AX0 X2 AX1 /
AX1 ....
33Finding the leading eigenpairusing matrix
shifting
- Let
be the eigenvalues of A, and U1,...,Un their
corresponding eigenvectors - Let A1 ?max ?i
(exercise) - Q What is the dominant eigenpair of AA1I?
- A (?1 A1, U1)
34Implementation
- Robustness and Efficiency
35Checking "positiveness"
- define IS_POSITIVE(X) ((X) gt 0.00001)
- Instead "xgt0" gt use IS_POSITIVE(X)
36Efficient multiplications in the (extended)
modularity matrix O(n) instead O(n2)
multiplication in a sparse matrix
"matrix shifting"
inner product
?f(g)ixi
("matrix shifting")
37sparse_matrix_arr
- typedef struct
- int n / matrix size /
- elem values / the non zero elements ordered
by rows/ - int colind / column indices /
- int rowptr / pointers to where rows
begin in the values
array. / - sparse_matrix_arr
38Fast score computations
Algorithm 4
Computing ?Q for each node gtO(n2)
Computing ?Q for each node in O(n)
before moving 1st node
Updating the score AFTER a move of a node k (s is
already updated)
39Project specifications
40programs
computing a 2-division
for the power method
- sparse_mlpl lt matrix_vec.in
- modularity_mat ltadj_matrixgt ltgroupgt
- spectral_div ltadj_matrixgt ltgroupgt ltprecisiongt
- improve_div lt adj_matrixgt ltgroupgt ltsubgroupgt
- cluster ltadj_matrixgt ltprecisiongt
for the power method
The complete clustering algorithm (including the
improvement)
41Implementation process
- Read and understand the document
- Design ALL programs
- Data structures
- Functions used by more than one program
- Check your code
- "Toy" examples on website - easy to debug
- Your own created LARGE examples
- Run your code on yeast/fly networks
42Analyzing clusters in yeast and fly
protein-protein interaction networks
Saccharomyces cerevisiae
- Input true PPI network 2 random networks
- Task 1 infer the true network
- Solution the true network is more modular
- Task 2 compute associated functions (using
cytoscape BiNGO)
drosophila melanogaster
43Cytoscape, BiNGO
- www.cytoscape.com (version 2.5.1)
- A framework for analyzing networks
- Provides visualization of networks and clusters
- http//www.psb.ugent.be/cbd/papers/BiNGO/
- Finding functions associated with gene cluster
- Runs from cytoscape
- Version 2.3 is not suitable for our project!!!
(due to a bug) gt use version 2.4 (when
available) or version 2.0 (available under
ozery/public/cytoscape-v2.5.1/plugins/BiNGO.jar).
44BiNGO output (GO Gene Ontology)
45Visualization with cytoscape
46How is the project checked?
- Most checks (points) "BLACK BOX"
- The common checks in "real world"
- Running with fixed input files, comparing to
fixed output files - Score (successful checks) / (total checks)
- "WHITE BOX" checks code review (10 points
maximum) - code simplicity / efficiency
47A simple data structure for maintaining a division
typedef struct Division_ int n int
group-ids int numGroups double Q Division
nodes in the network
for each node - its group id (initially 0 - all
nodes within on group)
- Complexity
- Finding all the elements of a group O(n)
- Splitting a group into 2 O(n)
48Maintaining the generalized modularity matrix
- Should we maintain the modularity matrix?
- No 1) we do not use it explicitly 2) it
is a dense matrix - consumes a large memory space - Yes 1) Despite its large size - can be kept in
memory 2) Can simplify code (e.g.
deriving Bg from B, computing the
L1-norm) 3) Can be used in validating
the correctness of optimized
multiplications (debug mode only!)
49Suggestion for modules
- Sparse matrices
- Data structure sparse_matrix_lst
- Reading a sparse matrix ( file / stdin)
- Multiplication in a vector
- Computing Ag
- Methods hiding the inner structure (allows a
simple replacement of sparse_matrix_lst with
another data structure for holding sparse
matrices)
The improvement algorithm
Group
Division
- The generalized modularity matrix
- Data structure Ag, kg, M, fg, L1-norm
- Multiplication in a vector
- Computing Q
- printing the modularity matrix
- The spectral algorithm
- 2-division
- full-division
50Good luck!