V2 - PowerPoint PPT Presentation

1 / 39
About This Presentation



... (heavy): Bela Bollobas, Modern Graph Theory; Random Graphs - Scale-free networks: quite new. Properties were mostly studied numerically and heuristically (sofar) ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 40
Provided by: volkhar


Transcript and Presenter's Notes

Title: V2

V2 network topologies
- Random graphs classical field in graph theory.
Well studied analytically and numerically. Liter
ature (heavy) Bela Bollobas, Modern Graph
Theory Random Graphs - Scale-free networks
quite new. Properties were mostly studied
numerically and heuristically (sofar). -
Evolution of domain linkage networks. -
Classification of network topologies.

A graph G is an ordered pair of disjoint sets
(V,E) such that E is a subset of the set V(2) of
unordered pairs of V. V and E are assumed always
finite. The set V is the set of vertices E is
the set of edges. A weighted graph has a real
valued weight assigned to each edge. A subgraph
of a graph G is a graph whose vertex and edge
sets are subsets of those of G. Given an
undirected graph, two vertices u and v are called
connected if there exists a path from u to v.
Otherwise they are called disconnected. The graph
is called connected graph if every pair of
vertices in the graph is connected. The Giant
component is a network theory term referring to a
connected subgraph that contains a majority of
the entire graph's vertices.

A path in a graph is a sequence of vertices such
that from each of its vertices there is an edge
to the successor vertex. The first vertex is
called the start vertex and the last vertex is
called the end vertex. Both of them are called
end or terminal vertices of the path. The other
vertices in the path are internal vertices. Two
paths are independent (alternatively, internally
vertex-disjoint) if they do not have any internal
vertex in common.

Shortest path problem
The shortest path problem is the problem of
finding a path between two vertices such that the
sum of the weights of its constituent edges is
minimized. More formally, given a weighted
graph (V,E), and two elements n, n' ? V, find a
path P from n to n' so that is minimal among
all paths connecting n to n' . The all-pairs
shortest path problem is a similar problem, in
which we have to find such paths for every two
vertices n to n' .

Dijkstras algorithm
Dijkstra's algorithm solves the shortest path
problem for a directed graph with non-negative
edge weights. Input weighted directed graph G
(V,E) and a source vertex s in G. Each edge of
the graph is an ordered pair of vertices (u,v)
representing a connection from vertex u to vertex
v. The weight w(u,v) is the non-negative cost of
moving from vertex u to vertex v. The cost of
an edge can be thought of as (a generalization
of) the distance between those two vertices. The
cost of a path between two vertices is the sum of
costs of the edges in that path. For a given
pair of vertices s and t in V, the algorithm
finds the path from s to t with lowest cost (i.e.
the shortest path). It can also be used for
finding costs of shortest paths from a single
vertex s to all other vertices in the graph.

Description of the algorithm
The algorithm works by keeping for each vertex v
the cost dv of the shortest path found so far
between s and v. Initially, this value is 0 for
the source vertex s (ds0), and infinity for
all other vertices, representing the fact that we
do not know any path leading to those vertices
(dv8 ?v in V, except s). When the algorithm
finishes, dv will be the cost of the shortest
path from s to v -- or infinity, if no such path
exists. The basic operation of Dijkstra's
algorithm is edge relaxation if there is an edge
from u to v, then the shortest known path from s
to u (du) can be extended to a path from s to v
by adding edge (u,v) at the end. This path will
have length duw(u,v). If this is less than the
current dv, we can replace the current value of
dv with the new value.

Description of the algorithm
Edge relaxation is applied until all values dv
represent the cost of the shortest path from s to
v. The algorithm is organized so that each edge
(u,v) is relaxed only once, when du has
reached its final value. The algorithm maintains
two sets of vertices S and Q. Set S contains all
vertices for which we know that the value dv is
already the cost of the shortest path and set Q
contains all other vertices. Set S starts
empty, and in each step one vertex is moved from
Q to S. This vertex is chosen as the vertex
with lowest value of du. When a vertex u is
moved to S, the algorithm relaxes every outgoing
edge (u,v).

In the following algorithm, u  Extract-Min(Q)
searches for the vertex u in the vertex set Q
that has the smallest du value. That vertex is
removed from the set Q and returned to the user.
Q  update(Q) updates the weight field of the
current vertex in the vertex set Q. 1 function
Dijkstra(G, w, s) 2 for each vertex v in
VG // Initialization 3
do dv infinity 4
previousv undefined 5 ds 0 6
S empty set 7 Q set of all vertices 8
while Q is not an empty set 9 do u
Extract-Min(Q) 10 S S U u 11
for each edge (u,v) outgoing from u 12
do if dv gt du w(u,v) //
Relax (u,v) 13 then dv du
w(u,v) 14 previousv
u 15 Q Update(Q) ....enddo
... end ... endo ... end If we are only
interested in a shortest path between vertices s
and t, we can terminate the search at line 9 if u
Now we can read the shortest path from s to t by
iteration 1 S empty sequence 2 u t 3
while defined u
4 do insert u to the beginning of S 5
u previousu Now sequence S is the list
of vertices on the shortest path from s to t.

Running time
The simplest implementation of the Dijkstra's
algorithm stores the n vertices of set Q in an
ordinary linked list or array, and operation
Extract-Min(Q) is simply a linear search through
all vertices in Q. In this case, the running time
is O(n2). For sparse graphs, that is, graphs
with much less than n2 edges, Dijkstra's
algorithm can be implemented more efficiently.

Description of Dijkstras algorithm taken from
Erdös-Renyi model of a random graph
n nodes (vertices) joined by edges that have been
chosen and placed between pairs of nodes
uniformly at random. Gn,p each possible edge
in the graph on n nodes is present with
probability p and absent with probability 1
p. Average number of edges in Gn,p Each edge
connects two vertices ? average degree of a

Erdös-Renyi model components
Erdös and Renyi studied how the expected topology
of a random graph with n nodes changes as a
function of the number of edges m. When m is
small, the graph is likely fragmented into many
small connected components having vertex sets of
size at most O(log n). As m increases the
components grow at first by linking to isolated
nodes, and later by fusing with other
components. A transition happens at m n/2,
when many clusters cross-link spontaneously to
form a unique largest component called the giant
component. Its vertex set size is much larger
than the vertex set sizes of any other
components. It contains O(n) nodes, while the
second largest component contains O(log n)
nodes. In statistical physics, this phenomenon
is called percolation.

Erdös-Renyi model shortest path length
  • The shortest path length between any pairs of
    nodes in the giant component grows like log n.
  • Therefore, these graphs are called small
  • The properties of random graphs have been studied
    very extensively.
  • Literature B. Bollobas. Random Graphs. Academic,
    London, 1985, 2004
  • However, random graphs are no adequate models for
    real-world networks because
  • real networks appear to have a power-law degree
  • (while random graphs have Poisson distribution)
  • (2) real networks show strong clustering while
    the clustering coefficient of a random graph is C
    p, independent of whether two vertices have a
    common neighbor.

Generalized Random Graphs
  • Aim allow a power-law degree distribution in a
    graph while leaving all other aspects as in the
    random graph model.
  • Given a degree sequence (e.g. power-law
    distribution) one can generate a random graph by
    assigning to a vertex i a degree ki from the
    given degree sequence. Then choose pairs of
    vertices uniformly at random to make edges so
    that the assigned degrees remain preserved.
  • When all degrees have been used up to make edges,
    the resulting graph is a random member of the set
    of graphs with the desired degree distribution.
  • Problem method does not allow to specify
    clustering coefficient.
  • On the other hand, this property makes it
    possible to exactly determine many properties of
    these graphs in the limit of large n.
  • E.g. almost all random graphs with a fixed degree
    distribution and no nodes of degree smaller than
    2 have a unique giant component.

Barabasis construction algorithm for scale-free
Input (n0, m, t) where n0 is the initial number
of vertices, m (m? n0) is the number of added
edges every time one new vertex is added to the
graph, and t is the number of iterations. Algori
thm a) Start with n0 isolated nodes. b) Every
time we add one new node v, m edges will be
linked to the existing nodes from v with a
preferential attachment probability where ki
is the number of links (degree) of the i-th
node. Eventually, the graph will have (n0 t)
nodes and (mt) edges. Problem of pure
mathematicians with this algorithm how to start
from n0 0?

Properties of Barabasi-Albert scale-free model
P(k) ? k-? with ? 3. Real networks often show
? ? 2.1 2.4 Observation if either growth or
preferential attachment is eliminated, the
resulting network does not exhibit scale-free
properties. The average path length in the
BA-model is proportional to ln n/ln ln n which is
shorter than in random graphs ? scale-free
networks are ultrasmall worlds. Observation
non-trivial correlations clustering between the
degrees of connected nodes. Numerical result
for BA-model C ? n-0.75. No analytical
predictions of C sofar.
Properties of scale-free models
Scale-free networks are resistant to random
failures (robustness) because a few high-degree
hubs dominate their topology a deliberate node
that fails probably has a small degree, and thus
not severly affects the rest of the
network. However, scale-free networks are quite
vulnerable to attacks on the hubs. See example
of last lecture about lethality of gene
deletions in yeast. These properties have been
confirmed numerically and analytically by
studying the average path length and the size of
the giant component.

Properties of Barabasi-Albert scale-free model
  • BA-model is a minimal model that captures the
    mechanisms responsible for the power-law degree
    distribution observed in real networks.
  • A discrepany is the fixed exponent of the
    predicted power-law distribution (? 3).
  • Does the BA-model describe the true biological
    evolution of networks?
  • Recent efforts
  • study variants with cleaner mathematical
    properties (Bollobas, LCD-model)
  • include effects of adding or re-wiring edges,
  • allow nodes to age so that they can no longer
    accept new edges
  • or vary forms of preferential attachment.
  • These models also predict exponential and
    truncated power-law degree distribution in some
    parameter regimes.

2 Scale-free behavior in protein domain networks
  • Domains are fundamental units of protein
  • Most proteins only contain one single domain.
  • Some sequences appear as multidomain proteins.
  • On average, they have 2-3 domains, but can have
    up to 130 domains!
  • Most new sequences show homologies to parts of
    known protein sequences
  • most proteins may have descended from relatively
    few ancestral types.
  • Sequences of large proteins often seem to have
    evolved by
  • joining preexisting domains in new combinations,
    domain shuffling
  • domain duplication or domain insertion.

Wuchty Mol. Biol. Evol. 18, 1694 (2001)
Protein domain database SMART
http//smart.embl-heidelberg.de/ contains (in
2001) 153 signalling domains 176 nuclear
domains, e.g. HLH domains 225 extracellular
domains 115 other domains
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
Protein Domain databases
Prosite (http//expasy.proteome.org.au/prosite/)
contains 1400 biologically significant motifs and
profiles. Pfam (http//www.sanger.ac.uk/Software/
Pfam/index.shtml) collection of
multiple-sequence alignments of protein families
and profile HMMs. Curated documentation on 2500
families. ProDom (http//www.toulouse.inra.fr/pro
dom.html) contains all 160.000 protein domain
families that can be automatically generated from
SwissProt and TrEMBL databases. Here, only
consider families with ?10 members ? 6000 ProDom
families. InterPro Proteome Analysis of 41
nonredundant proteomes of genomes of archaea,
bacteria, and eukaryotes (http//www.ebi.ac.uk/pro
teome) yields domains which appear along with
other domains in a protein sequence ? domains are
vertices co-appearance in a protein sequence
means an edge.

Wuchty Mol. Biol. Evol. 18, 1694 (2001)
Protein Domain databases
Prosite (http//expasy.proteome.org.au/prosite/)
contains 1360 biologically significant motifs and

P(number of links to other domains)
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
number of links to other domains
Which ones are highly connected domains?
The majority of highly connected InterPro domains
appear in signalling pathways. List of the 10
best linked domains in various species.

From left to right Number of links increases.
Number of signalling domains (PH, SH3), their
ligands (proline-rich extensions), and receptors
(GPCR/RHODOPSIN) increases.
? evolutionary trend toward compartementalization
of the cell and multicellularity demands a higher
degree of organization.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
Evolutionary Aspects
  • BA-model of scale-free networks is constructed by
    preferential attachment of newly added vertices
    to already well connected ones.
  • Fell and Wagner (2000) argued that vertices with
    many connections in metabolic network were
    metabolites originating very early in the course
    of evolution where they shaped a core metabolism.
  • Analogously, highly connected domains could have
    also originated very early.
  • Is this true?

No. Majority of highly connected domains in
Methanococcus and in E.coli are concerned with
maintanced of metabolism. None of the highly
connected domains of higher organisms is found
here. On the other hand, helicase C has roughly
similar degrees of connection in all organisms.
Wuchty Mol. Biol. Evol. 18, 1694 (2001)
  • Expansion of protein families in multcellular
    vertebrates coincides with higher connectivity of
    the respective domains.
  • Extensive shuffling of domains to increase
    combinatorial diversity might provide protein
    sets which are sufficient to preserve cellular
    procedures without dramatically expanding the
    absolute size of the protein complement.
  • greater proteome complexity of higher eukaryotes
    is not simply a consequence of the genome size,
    but must also be a consequence of innovations in
    domain arrangements.
  • highly linked domains represent functional
    centers in various different cellular aspects.
  • They could be treated as evolutionary hubs
    which help to organize the domain space.

Wuchty Mol. Biol. Evol. 18, 1694 (2001)
Network growth mechanism
How can we know what is the true growth
mechanism of real biological networks? Question
1 Is it important to know this? Yes. Question
2 What measure do we use to distinguish networks
produced by different growth mechanisms? ? Look
at the fine structure (motifs) of biological
Analysis of Drosophila melanogaster protein
interaction network
Data set protein-protein interaction map for
Drosophila by Giot et al. Problem data set is
subject to numerous false positives. Giot et al.
assign a confidence score p ? 0,1 to each
interaction measuring how likely the interaction
occurs in vivo. What threshold p should be
used? Measure size of the components for all
possible values of p. Observe for p 0.65,
the two largest components are connected ? use
this value as threshold. Edges in the graph
correspond to interactions for which p gt
p. Remove self-interactions and isolated
vertices ? 3359 (4625) nodes with 2795 (4683)
edges for p 0.65 (0.5)
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Network evolution models considered
Duplication-mutation-complementation (DMC)
algorithm based on model that proposes that most
of the duplicate genes observed today have been
preserved by functional complementation. If
either the gene or its copy loses one of its
functions (edges), the other becomes essential in
assuring the organismss survival. Algorithm
duplication step is followed by mutations that
preserve functional complementarity. At every
time step choose a node v at random. A twin
vertex vtwin is introduced copying all of vs
edges. For each edge of v, delete with
probability qdel either the original edge or its
corresponding edge of vtwin. Cojoin twins
themselves with independent probability qcon
representing an interaction of a protein with its
own copy. No edges are created by mutations ?
DMC algorithm assumes that the probability of
creating new advantageous functions by random
mutations is negligible.
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Network evolution models considered
Variant of DMC Duplication-random mutations
(DMR) algorithm Possible interactions between
twins are neglected. Instead, edges between vtwin
and the neighbors of v can be removed with
probability qdel and new edges can be created at
random between vtwin and any other vertices with
probability qnew/N, where N is the current total
number of vertices. DMR emphasizes the creation
of new advantageous functions by mutation. Other
models - linear preferential attachment (LPA)
(Barabasi) - random static networks (Erdös-Renyi)
(RDS) - random growing networks (RDG growing
graphs where new edges are created randomly
between existing nodes) - aging vertex networks
(AGV growing graphs modeling citation networks,
where the probability for new edges decreases
with the age of the vertex) - small-world network
(SMV interpolation between regular ring
lattices and randomly connected graphs).
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Training set
Create 1000 graphs as training data for each of
the seven different models. Every graph is
generated with the same number of edges and nodes
as measured in Drosophila. Quantify topology of
a network by counting all possible subgraphs up
to a given cut-off, which could be the number of
nodes, number of edges, or the length of a given
walk. Here count all subgraphs that can be
constructed by a walk of length8 (148
non-isomorphic subgraphs) or length7 (130
non-isomorphic subgraphs). Use these counts as
input features for classifier. Note that the
average shortest path between two nodes of the
Drosophila networks giant component is 11.6
(9.4) for p0.65 (0.5). ? Walks of length8 can
traverse large parts of the network.
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Visualization of subgraphs
A qualitative and more intuitive way of
interpreting the classification result is
visualizing the subgraph profiles. Subgraphs
associated with Figures 3 and 1. A representatie
subset of 50 subgraphs out of 148 is shown.
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Learning algorithm Alternating Decision Tree
Rectangles decision nodes. A given networks
subgraph counts determine paths in the tree
dictated by inequalities specified by
the decision nodes. For each class, the
Alternative Decision Tree outputs a
real-valued prediction score, which is the sum of
all weights over all paths. The class with the
heighest score wins.
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Performance on training set
Can the Decision Tree separate the graphs
generated by the different growth
mechanisms? The confusion matrix shows truth and
prediction for the test sets. 5 out of 7 have
nearly perfect prediction accuracy. AGV is
constructed as an interpolation between LPA and a
ring lattice ? the AGV, LPA and SMW mechanisms
are equivalent in specific parameter regimes and
show a non-negligible overlap.
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Task discriminate different growth mechanisms
Ten graphs of two different mechanisms exhibit
similar average geodesic lengths and almost
identical degree distribution and clustering
coefficients. (a) cumulative degree distribution
p(k gt k0), average clustering coefficient ltCgt and
average geodesic length ltLgt, all quantities
averaged over a set of 10 graphs. ? global
topology descriptors cannot separate between
growth mechanisms (b) Prediction score for all
ten graphs and all five cross-validated ADTs.
The two sets of graphs can now be perfectly
separated by the classifier.
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Learning algorithm Alternating Decision Tree
Figure shows the first few descision nodes (out
of 120) of a resulting ADT. The prediction scores
reveal that a high count of 3-cycles suggests a
DMC network. DMC mechanism indeed facilitates
creation of many 3-cycles by allowing 2 copies to
attach to eachother, thus creating 3-cycles with
their common neighbors. A low count in 3-cycles
but a high count in 8-edge linear chains is a
good precictor for LPA and DMR networks.
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Prediction for Drosophila melanogaster network
Use this classifier (ADT) with good prediction
accuracy now to determine the network mechanism
that best reproduces the Drosophila network (or
any network of the same size). Prediction scores
for the Drosophial protein network for different
confidence threshold p and different cut-offs in
subgraph size. Drosophial is consistently
classified as a DMC network, with an especially
strong prediction for a confidence threshould of
p0.65 and independently of the cut-off in
subgraph size.
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Subgraph profiles
The average subgraph count of the training data
for every mechanism is shown for the 50
representative subgraphs S1-S50. Black lines
indicate that this model is closest to Drosophila
based on the absolute difference between the
subgraph counts. For 60 of the subgraphs
(S1-S30), the counts for Drosophila are closest
to the DMC model. All of these subgraphs contain
one or more cycles, including highly connected
subgraphs (S1) and long linear chains ending in
cycles (S16, S18, S22, S23, S25).
The DMC algorithm is the only mechanism that
produces such cycles with a high occurrence.
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Robustness against noise
Edges in Drosophila network are randomly replaced
and the network is classified. Plotted are
prediction scores for each of the 7 classes as
more and more edges are replaced. Every point is
an average over 200 independent random
replacements. For high noise level (beyond
80), the network is classified as an Erdös-Renyi
(RDS) graph. For low noise (lt 30), the
confidence in the classification as a DMC network
is even higher than in the classification as an
RDS network for high noise. The prediction score
y(c) for class c is related to the estimated
probability p(c) for the tested network to be in
class c by
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Very nice (!) method that allows to infer growth
mechanisms for real networks. Method is robust
against noise and data subsampling, no prior
assumption about network features/topology
required. Learning algorithm does not assume any
relationships between features (e.g.
orthogonality). Therefore the input space can be
augmented with various features in addition to
subgraph counts. The protein interaction network
of Drosophila is confidently classified as DMC
network. However, further growth mechanisms need
to be explored in future. Input from evolutionary
biology is needed. Here, we mostly concentrated
on the technique of characterizing the resulting
network topologies.
Middendorf et al., DOI q-bio.QM/0408010, arXiv,
Write a Comment
User Comments (0)
About PowerShow.com