Detecting topological patterns in protein networks - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

Detecting topological patterns in protein networks

Description:

Nodes in a given module (or community group or a functional unit) tend to ... Matrix formalism. Eigenvectors of the. transfer matrix Tij. Similarity transformation ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 58

Provided by: sergei9

Category:

more less

Transcript and Presenter's Notes

Title: Detecting topological patterns in protein networks

1
Lecture 4
2

Modules/communities
in networks

3
What is a module?

Nodes in a given module (or community group or a
functional unit) tend to connect with other nodes
in the same module
Biology proteins of the same function (e.g. DNA
repair) or sub-cellular localization (e.g.
nucleus)
WWW websites on a common topic (e.g. physics)
or organization (e.g. EPFL)
Internet Autonomous systems/routers by
geography (e.g. Switzerland) or domain (e.g.
educational or military)

4
Sometimes easy to discover
5
Sometimes hard
6
Hierarchical clustering

calculating the similarity weight Wij for all
pairs of vertices (e.g. of independent paths i
? j)
start with all n vertices disconnected
add edges between pairs one by one in order of
decreasing weight
result nested components, where one can take a
slice at any level of the tree

7
Girvan Newman (2002) betweenness clustering

Betweenness of and edge i -- j is the of
shortest paths going through this edge
Algorithm
compute the betweenness of all edges
remove edge with the lowest betweenness
recalculate betweenness
Caveats
Betweenness needs to be recalculated at each step
very expensive all pairs shortest path O(N3)
may need to repeat up to N times
does not scale to more than a few hundred nodes,
even with the fastest algorithms

Using random walks/diffusion to discover modules
in networks

K. Eriksen, I. Simonsen, S. Maslov, K. Sneppen,
PRL 90, 148701(2003)
9
Why diffusion?

Any dynamical process would equilibrate faster on
modules and slower between modules
Thus its slow modes reveal modules
Diffusion is the simplest dynamical process
(people also use others like Ising/Potts model,
etc.)

10
Random walkers on a network

Study the behavior of many VIRTUAL random walkers
on a network
At each time step each random walker steps on a
randomly selected neighbor
They equilibrate to a steady state ni ki
(solid state physics ni const)
Slow modes of equilibration to the steady state
allow to detect modules in a network

11
Matrix formalism
12
Eigenvectors of the transfer matrix Tij
13
Similarity transformation

Matrix Tij is asymmetric ?
Could in principle result to complex
eigenvalues/eigenvectors
Luckily, Sij1/(?Ki ?Kj) has the same eigenvalues
and eigenvectors vi /?Ki
Known as similarity transformation

14
Density of states ?(?)

filled circles real AS-network
empty squares degree-preserving randomized
version

15
Participation ratio PR(?) ?i1/(v(?)i)4
250
200
150
Participation Ratio
100
50
0
-1
-0.5
0
0.5
1
l
16
US Military
17
2 0.9626 RU RU RU RU CA RU RU
?? ?? US US US US ?? (US
Department of Defence) 3 0.9561 ?? FR FR FR
?? FR ?? RU RU RU ?? ?? RU ??
4 0.9523 US ?? US ?? ?? ?? ?? (US Navy)
NZ NZ NZ NZ NZ NZ NZ 5. 0.9474
KR KR KR KR KR ?? KR UA UA UA
UA UA UA UA
18
Hacked Ford AS
19
(No Transcript)
20

Using random walks/diffusion to rank information
networks

e.g. Googles PageRank made it 160 billion
21
Information networks

3x105 Phys Rev articles connected by 3x106
citation links
1010 webpages in the world
To find relevant information one needs to
efficiently search and rank!!

22
Ranking webpages

Assign an importance factor Gi to every webpage
Given a keyword (say jaguar) find all the pages
that have it in their text and display them in
the order of descending Gi.
One solution still used in scientific publishing
is GiKin(i) (the number of incoming links), but
Too democratic It doesnt take into account the
importance of nodes sending links
Easy to trick and artificially boost the ranking
(for the WWW)

23
How Google works

Googles recipe (circa 1998) is to simulate the
behavior of many virtual random surfers
PageRank Gi the number of virtual hits the
page gets. It is also the steady state number
of random surfers at a given page
Popular pages send more surfers your way ?
PageRank Kin is weighted by the popularity of a
webpage sending each hyperlink
Surfers get bored following links ? with
probability ?0.15 a surfer jumps to a randomly
selected page (not following any hyperlinks)

How communities in the WWW influence Google
ranking

H. Xie, K.-K. Yan, SM, cond-mat/0409087 physics/05
10107 Physica A 373 (2007) 831836
25
How do WWW communities influence their average Gi?

Pages in a web-community preferentially link to
each other. Examples
Pages from the same organization (e.g. EPFL)
Pages devoted to a common topic (e.g. Physics)
Pages in the same geographical location (e.g
Switzerland)
Naïve argument communities trap random surfers
to spend more time inside ? they should increase
the average Google ranking of the community

26
Test of a naïve argument
Community 1
log10(ltGgtc)
Community 2
of intra-community links

Naïve argument is wrong it could go either way

27
Eww
Ecc
28

Gc average Google rank of pages in the
community Gw ? 1 in the outside world
Ecw Gc/ltKoutgtc current from C to W
It must be equal to Ewc Gw/ltKoutgtw
current from W to C
Thus Gc depends on the ratio between Ecw and
Ewc the number of edges (hyperlinks) between
the community and the world

29
Balancing currents for nonzero ?

Jcw(1- ?) Ecw Gc/ltKoutgtc ? Gc Nc current
from C to W
It must be equal to Jcw(1- ?) Ewc Gw/ltKoutgt
? Gw Nw(Nc/Nw) current from W to C

30
What are the consequences?

For very isolated communities (Ecw/E(r)cwlt? and
Ewc/E(r)wclt?) one has Gc1. Their Google rank is
decoupled from the outside world!
Overall range ? ltGclt1/?

31
WWW - the empirical data

We have data for 10 US universities ( all UK
and Australian Universities)
Looked closely at UCLA and Long Island University
(LIU)
UCLA has different departments
LIU has 4 campuses

32
?0.15
33
?0.001
Abnormally high PageRank
34
Top PageRank LIU websites for ?0.001 dont make
sense

1 www.cwpost.liu.edu/cwis/cwp/edu/edleader/highe
r_ed/ hear.html'
5 /higher_ed/ index.html
9 /higher_ed/courses.html

Strongly connected component
World
35
(No Transcript)
36
(No Transcript)
37
What about citation networks?

Better use ?0.5 instead of ?0.15 people dont
click through papers as easily as through
webpages
Time arrow papers only cite older papers Small
values of ? give older papers unfair advantage
New algorithm CiteRank (as in PageRank). Random
walkers start from recent papers exp(-t/?d)

38
(No Transcript)
39
Summary

Diffusion and modules (communities) in a network
affect each other
In the hardware part of the Internet
(Autonomous systems or routers ) diffusion allows
one to detect modules
In the software part
Diffusion-like process is used for ranking
(Googles PageRank)
WWW communities affect this ranking in a
non-trivial way

40
THE END
41
(No Transcript)
42
Part 2 Opinion networks
"Extracting Hidden Information from Knowledge
Networks", S. Maslov, and Y-C. Zhang, Phys. Rev.
Lett. (2001). "Exploring an opinion network for
taste prediction an empirical study", M.
Blattner, Y.-C. Zhang, and S. Maslov, in
preparation.
43
Predicting customers tastes from their opinions
on products

Each of us has personal tastes
Information about them is contained in our
opinions on products
Matchmaking opinions of customers with tastes
similar to mine could be used to forecast my
opinions on untested products
Internet allows to do it on large scale (see
amazon.com and many others)

44
Opinion networks
Opinions of movie-goers on movies
WWW
Other webpages
1
opinion
1
Movies
1
Webapges
Customers
2
1
2
2
3
2
3
3
4
3
4
45
Storing opinions
Matrix of opinions ?IJ
Network of opinions
Movies
1
2
9
Customers
1
2
8
2
3
8
1
3
4
46
Using correlations to reconstruct customers
tastes

Similar opinions ? similar tastes
Simplest model
Movie-goers ? M-dimensional vector of tastes TI
Movies ? M-dimensional vector of features FJ
Opinions ? scalar product
?IJ TI?FJ

Movies
2
1
1
Customers
9
1
2
8
2
3
8
3
4
47
Loop correlation

Predictive power 1/M(L-1)/2
One needs many loops to best reconstruct
unknown opinions

L5 known opinions Predictive power of an
unknown opinion is 1/M2
An unknown opinion
48
Main parameter density of edges

The larger is the density of edges p the easier
is the prediction
At p1 ? 1/N (NNcostomersNmovies) macroscopic
prediction becomes possible. Nodes are connected
but vectors TI and FJ are not fixed ordinary
percolation threshold
At p2 ? 2M/N gt p1 all tastes and features (TI
and FJ) can be uniquely reconstructed rigidity
percolation threshold

49
Real empirical data (EachMovie dataset) on
opinions of customers on movies 5-star ratings
of 1600 movies by 73000 users 1.6 million
opinions!
50
Spectral properties of ?

For MltN the matrix ?IJ has N-M zero eigenvalues
and M positive ones ? R ? R.
Using SVD one can diagonalize R U ? D ? V
such that matrices V and U are orthogonal V ? V
1, U ? U 1, and D is diagonal. Then ? U ?
D2? U
The amount of information contained in ?
NM-M(M-1)/2 ltlt N(N-1)/2 - the of off-diagonal
elements

51
Recursive algorithm for the prediction of unknown
opinions

Start with ?0 where all unknown elements are
filled with lt?gt (zero in our case)
Diagonalize and keep only M largest eigenvalues
and eigenvectors
In the resulting truncated matrix ?0 replace
all known elements with their exact values and go
to step 1

52
Convergence of the algorithm

Above p2 the algorithm exponentially converges
to theexact values of unknown elements
The rate of convergence scales as (p-p2)2

53
Reality check sources of errors

Customers are not rational! ?IJ rI?bJ
?IJ(idiosyncrasy)
Opinions are delivered to the matchmaker through
a narrow channel
Binary channel SIJ sign(?IJ) 1 or 0 (liked or
not)
Experience rated on a scale 1 to 5 or 1 to 10 at
best
If number of edges K, and size N are large,
while M is small these errors could be reduced

54
How to determine M?

In real systems M is not fixed there are always
finer and finer details of tastes
Given the number of known opinions K one should
choose Meff ? K/(NreadersNbooks) so that systems
are below the second transition p2 ? tastes
should be determined hierarchically

55
Avoid overfitting

Divide known votes into training and test sets
Select Meff so that to avoid overfitting !!!

56
Knowledge networks in biology

Interacting biomolecules key and lock principle
Matrix of interactions (binding energies) ?IJ
kI?lJ lI?kJ
Matchmaker (bioinformatics researcher) tries to
guess yet unknown interactions based on the
pattern of known ones
Many experiments measure SIJ ?(?IJ-?th)

k(1)
k(2)
l(2)
l(1)
57
THE END

Write a Comment

User Comments (0)