Title: Network Statistics
1Network Statistics
2Yeast protein interactions
3Summary statistics
- Vertex degree distribution (the degree of a
vertex is the number of vertices connected with
it via an edge) - Clustering coefficient the average proportion of
neighbours of a vertex that are themselves
neighbours - Shortest distance between two vertices - also
average shortest distance, maximal distance,
average of inverse distance (efficiency) - Betweenness of a vertex the number of shortest
paths that go through a given vertex (similarly
for edge)
4Some examples for real networks(in averages)
Network size vertex degree shortest path Shortest path in fitted random graph Clustering Clustering in random graph
Film actors 225,226 61 3.65 2.99 0.79 0.00027
MEDLINE coauthorship 1,520,251 18.1 4.6 4.91 0.43 1.8 x 10-4
E.Coli substrate graph 282 7.35 2.9 3.04 0.32 0.026
C.Elegans 282 14 2.65 2.25 0.28 0.05
5Underlying model assumptions
- Network consisting of vertices and edges
- Randomness in edges
- Here assume edges undirected, no self-loops, no
multiple edges
6Main model 1 Random Graph
- Bernoulli random graph (ErdösRenyi 1959, 1960)
- L vertices, any two connected by an edge with
probability p, independent of each other - need not be connected
- Phase transition for edge probability p(L)
(log L)/L the random graph becomes connected.
7Main model 2 Watts-Strogatz Small World (1998)
L vertices, each connected are to m nearest
neighbours, in addition random links, each
probability p (originally, rewiring edges
instead of adding edges was proposed, but then
the resulting network need not be connected)
8Main model 3 Scale-free network
- Network growth models start with one vertex new
vertex attaches to existing vertices by
preferential attachment vertex tends choose
vertex according to vertex degree
(BarabasiAlbert 1999, Price 1965)
9Watts-Strogatz Small World
- Amenable to mathematical analysis
- More realistic than random graphs
- Shortest path length
- Motif counts
- Vertex degrees
- Predicting links
- Generalization hard-wired links only present
with a certain probability
10Shortest path length
- Put ?2 (L-2m-1) p, where p is the probability of
a shortcut - Approximation continuous model gives
- Expected shortest path length is approximately
- 1/? 1/2 log (L ?) 0.2886
- ( distribution, Barbour R.)
- In the discrete case, the distribution may be
concentrated on one or two points.
11Example 6 degrees of separation?
- If the number of vertices is L200,000,000, and
we observe l6, then we can estimate ? as
approximately 1.54 - This gives for L60,000,000 that the expected
shortest path length is approximately 5.81 - For L100,000 it gives approximately 3.73
- For L6,500,000,000 it gives approximately 7.33
12Motif counts
- Triangles relate to clustering coefficient
- Cycles biologically relevant
- Distributions approximately compound Poisson
- Can get joint distribution for cycle counts of
different lengths (also using compound Poisson)
dependence! - Goal assess statistical significance of counts
13Vertex degrees
- Random graph superimposed on hard-wired networks
- Poisson approximation for number of vertices with
degree at least k, say - Normal approximation for joint distribution of
some vertex degrees - Goal assess scale-free appearance
14Predicting links
- Use Bayesian analysis and biochemical properties
to predict which proteins might interact - Use H.pylori interactions to construct prior for
E.coli interactions - Assess whether small-world structure if so, use
parametric model
15Statistical significance
- Clustering coefficient, vertex degrees, shortest
path length are not independent - Long-term goal joint distribution of summary
statistics to assess whether networks are similar
or not
16People
- Research students
- Kaisheng Lin (motif counts, metabolic networks
vertex degrees) - Pao-Yang Chen (protein interaction networks)
- KimHuat Lim (epidemics on networks)
- Collaborators
- Andrew Barbour (shortest path length)
- Charlotte Deane (protein interaction networks)
- Susan Holmes (bottlenecks)