Title: Detecting topological patterns in protein networks
1Lecture 4
2- Modules/communities
- in networks
3What is a module?
- Nodes in a given module (or community group or a
functional unit) tend to connect with other nodes
in the same module - Biology proteins of the same function (e.g. DNA
repair) or sub-cellular localization (e.g.
nucleus) - WWW websites on a common topic (e.g. physics)
or organization (e.g. EPFL) - Internet Autonomous systems/routers by
geography (e.g. Switzerland) or domain (e.g.
educational or military)
4Sometimes easy to discover
5Sometimes hard
6Hierarchical clustering
- calculating the similarity weight Wij for all
pairs of vertices (e.g. of independent paths i
? j) - start with all n vertices disconnected
- add edges between pairs one by one in order of
decreasing weight - result nested components, where one can take a
slice at any level of the tree
7Girvan Newman (2002) betweenness clustering
- Betweenness of and edge i -- j is the of
shortest paths going through this edge - Algorithm
- compute the betweenness of all edges
- remove edge with the lowest betweenness
- recalculate betweenness
- Caveats
- Betweenness needs to be recalculated at each step
- very expensive all pairs shortest path O(N3)
- may need to repeat up to N times
- does not scale to more than a few hundred nodes,
even with the fastest algorithms
8- Using random walks/diffusion to discover modules
in networks
K. Eriksen, I. Simonsen, S. Maslov, K. Sneppen,
PRL 90, 148701(2003)
9Why diffusion?
- Any dynamical process would equilibrate faster on
modules and slower between modules - Thus its slow modes reveal modules
- Diffusion is the simplest dynamical process
(people also use others like Ising/Potts model,
etc.)
10Random walkers on a network
- Study the behavior of many VIRTUAL random walkers
on a network - At each time step each random walker steps on a
randomly selected neighbor - They equilibrate to a steady state ni ki
(solid state physics ni const) - Slow modes of equilibration to the steady state
allow to detect modules in a network
11Matrix formalism
12Eigenvectors of the transfer matrix Tij
13Similarity transformation
- Matrix Tij is asymmetric ?
- Could in principle result to complex
eigenvalues/eigenvectors - Luckily, Sij1/(?Ki ?Kj) has the same eigenvalues
and eigenvectors vi /?Ki - Known as similarity transformation
14Density of states ?(?)
- filled circles real AS-network
- empty squares degree-preserving randomized
version
15Participation ratio PR(?) ?i1/(v(?)i)4
250
200
150
Participation Ratio
100
50
0
-1
-0.5
0
0.5
1
l
16US Military
17 2 0.9626 RU RU RU RU CA RU RU
?? ?? US US US US ?? (US
Department of Defence) 3 0.9561 ?? FR FR FR
?? FR ?? RU RU RU ?? ?? RU ??
4 0.9523 US ?? US ?? ?? ?? ?? (US Navy)
NZ NZ NZ NZ NZ NZ NZ 5. 0.9474
KR KR KR KR KR ?? KR UA UA UA
UA UA UA UA
18Hacked Ford AS
19(No Transcript)
20- Using random walks/diffusion to rank information
networks
e.g. Googles PageRank made it 160 billion
21Information networks
- 3x105 Phys Rev articles connected by 3x106
citation links - 1010 webpages in the world
- To find relevant information one needs to
efficiently search and rank!!
22Ranking webpages
- Assign an importance factor Gi to every webpage
- Given a keyword (say jaguar) find all the pages
that have it in their text and display them in
the order of descending Gi. - One solution still used in scientific publishing
is GiKin(i) (the number of incoming links), but
- Too democratic It doesnt take into account the
importance of nodes sending links - Easy to trick and artificially boost the ranking
(for the WWW)
23How Google works
- Googles recipe (circa 1998) is to simulate the
behavior of many virtual random surfers - PageRank Gi the number of virtual hits the
page gets. It is also the steady state number
of random surfers at a given page - Popular pages send more surfers your way ?
PageRank Kin is weighted by the popularity of a
webpage sending each hyperlink - Surfers get bored following links ? with
probability ?0.15 a surfer jumps to a randomly
selected page (not following any hyperlinks)
24- How communities in the WWW influence Google
ranking
H. Xie, K.-K. Yan, SM, cond-mat/0409087 physics/05
10107 Physica A 373 (2007) 831836
25How do WWW communities influence their average Gi?
- Pages in a web-community preferentially link to
each other. Examples - Pages from the same organization (e.g. EPFL)
- Pages devoted to a common topic (e.g. Physics)
- Pages in the same geographical location (e.g
Switzerland) - Naïve argument communities trap random surfers
to spend more time inside ? they should increase
the average Google ranking of the community
26Test of a naïve argument
Community 1
log10(ltGgtc)
Community 2
of intra-community links
- Naïve argument is wrong it could go either way
27Eww
Ecc
28- Gc average Google rank of pages in the
community Gw ? 1 in the outside world - Ecw Gc/ltKoutgtc current from C to W
- It must be equal to Ewc Gw/ltKoutgtw
current from W to C -
- Thus Gc depends on the ratio between Ecw and
Ewc the number of edges (hyperlinks) between
the community and the world
29Balancing currents for nonzero ?
- Jcw(1- ?) Ecw Gc/ltKoutgtc ? Gc Nc current
from C to W - It must be equal to Jcw(1- ?) Ewc Gw/ltKoutgt
? Gw Nw(Nc/Nw) current from W to C
30What are the consequences?
- For very isolated communities (Ecw/E(r)cwlt? and
Ewc/E(r)wclt?) one has Gc1. Their Google rank is
decoupled from the outside world! - Overall range ? ltGclt1/?
31WWW - the empirical data
- We have data for 10 US universities ( all UK
and Australian Universities) - Looked closely at UCLA and Long Island University
(LIU) - UCLA has different departments
- LIU has 4 campuses
32?0.15
33?0.001
Abnormally high PageRank
34Top PageRank LIU websites for ?0.001 dont make
sense
- 1 www.cwpost.liu.edu/cwis/cwp/edu/edleader/highe
r_ed/ hear.html' - 5 /higher_ed/ index.html
- 9 /higher_ed/courses.html
Strongly connected component
World
35(No Transcript)
36(No Transcript)
37What about citation networks?
- Better use ?0.5 instead of ?0.15 people dont
click through papers as easily as through
webpages - Time arrow papers only cite older papers Small
values of ? give older papers unfair advantage - New algorithm CiteRank (as in PageRank). Random
walkers start from recent papers exp(-t/?d)
38(No Transcript)
39Summary
- Diffusion and modules (communities) in a network
affect each other - In the hardware part of the Internet
(Autonomous systems or routers ) diffusion allows
one to detect modules - In the software part
- Diffusion-like process is used for ranking
(Googles PageRank) - WWW communities affect this ranking in a
non-trivial way
40THE END
41(No Transcript)
42Part 2 Opinion networks
"Extracting Hidden Information from Knowledge
Networks", S. Maslov, and Y-C. Zhang, Phys. Rev.
Lett. (2001). "Exploring an opinion network for
taste prediction an empirical study", M.
Blattner, Y.-C. Zhang, and S. Maslov, in
preparation.
43Predicting customers tastes from their opinions
on products
- Each of us has personal tastes
- Information about them is contained in our
opinions on products - Matchmaking opinions of customers with tastes
similar to mine could be used to forecast my
opinions on untested products - Internet allows to do it on large scale (see
amazon.com and many others)
44Opinion networks
Opinions of movie-goers on movies
WWW
Other webpages
1
opinion
1
Movies
1
Webapges
Customers
2
1
2
2
3
2
3
3
4
3
4
45Storing opinions
Matrix of opinions ?IJ
Network of opinions
Movies
1
2
9
Customers
1
2
8
2
3
8
1
3
4
46Using correlations to reconstruct customers
tastes
- Similar opinions ? similar tastes
- Simplest model
- Movie-goers ? M-dimensional vector of tastes TI
- Movies ? M-dimensional vector of features FJ
- Opinions ? scalar product
- ?IJ TI?FJ
Movies
2
1
1
Customers
9
1
2
8
2
3
8
3
4
47Loop correlation
- Predictive power 1/M(L-1)/2
- One needs many loops to best reconstruct
unknown opinions
L5 known opinions Predictive power of an
unknown opinion is 1/M2
An unknown opinion
48Main parameter density of edges
- The larger is the density of edges p the easier
is the prediction - At p1 ? 1/N (NNcostomersNmovies) macroscopic
prediction becomes possible. Nodes are connected
but vectors TI and FJ are not fixed ordinary
percolation threshold - At p2 ? 2M/N gt p1 all tastes and features (TI
and FJ) can be uniquely reconstructed rigidity
percolation threshold
49Real empirical data (EachMovie dataset) on
opinions of customers on movies 5-star ratings
of 1600 movies by 73000 users 1.6 million
opinions!
50Spectral properties of ?
- For MltN the matrix ?IJ has N-M zero eigenvalues
and M positive ones ? R ? R. - Using SVD one can diagonalize R U ? D ? V
such that matrices V and U are orthogonal V ? V
1, U ? U 1, and D is diagonal. Then ? U ?
D2? U - The amount of information contained in ?
NM-M(M-1)/2 ltlt N(N-1)/2 - the of off-diagonal
elements
51Recursive algorithm for the prediction of unknown
opinions
- Start with ?0 where all unknown elements are
filled with lt?gt (zero in our case) - Diagonalize and keep only M largest eigenvalues
and eigenvectors - In the resulting truncated matrix ?0 replace
all known elements with their exact values and go
to step 1
52Convergence of the algorithm
- Above p2 the algorithm exponentially converges
to theexact values of unknown elements - The rate of convergence scales as (p-p2)2
53Reality check sources of errors
- Customers are not rational! ?IJ rI?bJ
?IJ(idiosyncrasy) - Opinions are delivered to the matchmaker through
a narrow channel - Binary channel SIJ sign(?IJ) 1 or 0 (liked or
not) - Experience rated on a scale 1 to 5 or 1 to 10 at
best - If number of edges K, and size N are large,
while M is small these errors could be reduced
54How to determine M?
- In real systems M is not fixed there are always
finer and finer details of tastes - Given the number of known opinions K one should
choose Meff ? K/(NreadersNbooks) so that systems
are below the second transition p2 ? tastes
should be determined hierarchically
55Avoid overfitting
- Divide known votes into training and test sets
- Select Meff so that to avoid overfitting !!!
56Knowledge networks in biology
- Interacting biomolecules key and lock principle
- Matrix of interactions (binding energies) ?IJ
kI?lJ lI?kJ - Matchmaker (bioinformatics researcher) tries to
guess yet unknown interactions based on the
pattern of known ones - Many experiments measure SIJ ?(?IJ-?th)
k(1)
k(2)
l(2)
l(1)
57THE END