Title: Data Mining for Complex Network
1Data Mining for Complex Network
- Introduction and Background
2Welcome!
- Instructor Ruoming Jin
- Homepage www.cs.kent.edu/jin/
- Office 264 MCS Building
- Email jin_at_cs.kent.edu
3Course overview
- The course goal
- First of all, this is a research course, or a
special topic course. There is no textbook and
even no definition what topics are supposed to
under the course name? - Discussing the state-of-are techniques for mining
complex networks - Review Course Project
4Simple Concepts (Review)
- ErdosRényi model Random Graph Model
- Markov Chain and Random Walk
- Maximal Likelihood/Model Selection
5Requirement
- Each of you will have one presentation
- Review Paper
- Select a topic and review at least four papers
- Bonus Develop a new idea on this topic
- Course Project
- One or Two a group
- Collect data preprocess data analyzing the
data - Final Grade 30 presentations, 25 review, 35
project, and 10 class participation -
6What is a network?
- Network a collection of entities that are
interconnected with links. - people that are friends
- computers that are interconnected
- web pages that point to each other
- proteins that interact
7Graphs
- In mathematics, networks are called graphs, the
entities are nodes, and the links are edges - Graph theory starts in the 18th century, with
Leonhard Euler - The problem of Königsberg bridges
- Since then graphs have been studied extensively.
-
-
8Networks in the past
- Graphs have been used in the past to model
existing networks (e.g., networks of highways,
social networks) - usually these networks were small
- network can be studied visual inspection can
reveal a lot of information
9Networks now
- More and larger networks appear
- Products of technological advancement
- e.g., Internet, Web
- Result of our ability to collect more, better,
and more complex data - e.g., gene regulatory networks
- Networks of thousands, millions, or billions of
nodes - impossible to visualize
10The internet map
11Understanding large graphs
- What are the statistics of real life networks?
- Can we explain how the networks were generated?
- What else? A still very young field!
- (What is the basic principles and what those
principle will mean?)
12Measuring network properties
- Around 1999
- Watts and Strogatz, Dynamics and small-world
phenomenon - Faloutsos3, On power-law relationships of the
Internet Topology - Kleinberg et al., The Web as a graph
- Barabasi and Albert, The emergence of scaling in
real networks
13Real network properties
- Most nodes have only a small number of neighbors
(degree), but there are some nodes with very high
degree (power-law degree distribution) - scale-free networks
- If a node x is connected to y and z, then y and z
are likely to be connected - high clustering coefficient
- Most nodes are just a few edges away on average.
- small world networks
- Networks from very diverse areas (from internet
to biological networks) have similar properties - Is it possible that there is a unifying
underlying generative process?
14Generating random graphs
- Classic graph theory model (Erdös-Renyi)
- each edge is generated independently with
probability p - Very well studied model but
- most vertices have about the same degree
- the probability of two nodes being linked is
independent of whether they share a neighbor - the average paths are short
15Modeling real networks
- Real life networks are not random
- Can we define a model that generates graphs with
statistical properties similar to those in real
life? - a flurry of models for random graphs
16Processes on networks
- Why is it important to understand the structure
of networks? - Epidemiology Viruses propagate much faster in
scale-free networks - Vaccination of random nodes does not work, but
targeted vaccination is very effective
17The future of networks
- Networks seem to be here to stay
- More and more systems are modeled as networks
- Scientists from various disciplines are working
on networks (physicists, computer scientists,
mathematicians, biologists, sociologist,
economists) - There are many questions to understand.
18Basic Mathematical Tools
- Graph theory
- Probability theory
- Linear Algebra
19Graph Theory
- Graph G(V,E)
- V set of vertices
- E set of edges
2
1
3
5
4
undirected graph E(1,2),(1,3),(2,3),(3,4),(4,5)
20Graph Theory
- Graph G(V,E)
- V set of vertices
- E set of edges
2
1
3
5
4
directed graph E1,2, 2,1 1,3, 3,2,
3,4, 4,5
21Undirected graph
2
- degree d(i) of node i
- number of edges incident on node i
1
- degree sequence
- d(1),d(2),d(3),d(4),d(5)
- 2,2,2,1,1
3
5
4
- degree distribution
- (1,2),(2,3)
22Directed Graph
2
- in-degree din(i) of node i
- number of edges pointing to node i
1
- out-degree dout(i) of node i
- number of edges leaving node i
3
- in-degree sequence
- 1,2,1,1,1
- out-degree sequence
- 2,1,2,1,0
5
4
23Paths
- Path from node i to node j a sequence of edges
(directed or undirected from node i to node j) - path length number of edges on the path
- nodes i and j are connected
- cycle a path that starts and ends at the same
node
2
2
1
1
3
3
5
5
4
4
24Shortest Paths
- Shortest Path from node i to node j
- also known as BFS path, or geodesic path
2
2
1
1
3
3
5
5
4
4
25Diameter
- The longest shortest path in the graph
2
2
1
1
3
3
5
5
4
4
26Undirected graph
- Connected graph a graph where there every pair
of nodes is connected - Disconnected graph a graph that is not connected
- Connected Components subsets of vertices that
are connected
2
1
3
5
4
27Fully Connected Graph
- Clique Kn
- A graph that has all possible n(n-1)/2 edges
2
1
3
5
4
28Directed Graph
2
- Strongly connected graph there exists a path
from every i to every j
1
- Weakly connected graph If edges are made to be
undirected the graph is connected
3
5
4
29Subgraphs
- Subgraph Given V ? V, and E ? E, the graph
G(V,E) is a subgraph of G. - Induced subgraph Given V ? V, let E ? E is
the set of all edges between the nodes in V. The
graph G(V,E), is an induced subgraph of G
2
1
3
5
4
30Trees
- Connected Undirected graphs without cycles
2
1
3
5
4
31Bipartite graphs
- Graphs where the set V can be partitioned into
two sets L and R, such that all edges are between
nodes in L and R, and there is no edge within L
or R
32Linear Algebra
- Adjacency Matrix
- symmetric matrix for undirected graphs
2
1
3
5
4
33Linear Algebra
- Adjacency Matrix
- unsymmetric matrix for undirected graphs
2
1
3
5
4
34Random Walks
- Start from a node, and follow links uniformly at
random. - Stationary distribution The fraction of times
that you visit node i, as the number of steps of
the random walk approaches infinity - if the graph is strongly connected, the
stationary distribution converges to a unique
vector.
35Random Walks
- stationary distribution principal left
eigenvector of the normalized adjacency matrix - x xP
- for undirected graphs, the degree distribution
2
1
3
5
4
36Eigenvalues and Eigenvectors
- The value ? is an eigenvalue of matrix A if there
exists a non-zero vector x, such that Ax?x.
Vector x is an eigenvector of matrix A - The largest eigenvalue is called the principal
eigenvalue - The corresponding eigenvector is the principal
eigenvector - Corresponds to the direction of maximum change
37Types of networks
- Social networks
- Knowledge (Information) networks
- Technology networks
- Biological networks
38Social Networks
- Links denote a social interaction
- Networks of acquaintances
- actor networks
- co-authorship networks
- director networks
- phone-call networks
- e-mail networks
- IM networks
- Microsoft buddy network
- Bluetooth networks
- sexual networks
- home page networks
39Knowledge (Information) Networks
- Nodes store information, links associate
information - Citation network (directed acyclic)
- The Web (directed)
- Peer-to-Peer networks
- Word networks
- Networks of Trust
- Bluetooth networks
40Technological networks
- Networks built for distribution of commodity
- The Internet
- router level, AS level
- Power Grids
- Airline networks
- Telephone networks
- Transportation Networks
- roads, railways, pedestrian traffic
- Software graphs
41Biological networks
- Biological systems represented as networks
- Protein-Protein Interaction Networks
- Gene regulation networks
- Metabolic pathways
- The Food Web
- Neural Networks
42Now what?
- The world is full with networks. What do we do
with them? - understand their topology and measure their
properties - study their evolution and dynamics
- create realistic models
- create algorithms that make use of the network
structure
43Measuring Networks
- Degree distributions
- Small world phenomena
- Clustering Coefficient
- Mixing patterns
- Degree correlations
- Communities and clusters
44Degree distributions
frequency
fk fraction of nodes with degree k
probability of a randomly selected node to
have degree k
fk
degree
k
- Problem find the probability distribution that
best fits the observed data
45Power-law distributions
- The degree distributions of most real-life
networks follow a power law - Right-skewed/Heavy-tail distribution
- there is a non-negligible fraction of nodes that
has very high degree (hubs) - scale-free no characteristic scale, average is
not informative - In stark contrast with the random graph model!
- highly concentrated around the mean
- the probability of very high degree nodes is
exponentially small
p(k) Ck-a
46Power-law signature
- Power-law distribution gives a line in the
log-log plot - a power-law exponent (typically 2 a 3)
log p(k) -a logk logC
a
log frequency
frequency
log degree
degree
47Examples
Taken from Newman 2003
48A random graph example
49Maximum degree
- For random graphs, the maximum degree is highly
concentrated around the average degree z - For power law graphs
- Rough argument solve nPXk1
50Exponential distribution
- Observed in some technological or collaboration
networks - Identified by a line in the log-linear plot
p(k) ?e-?k
log p(k) - ?k log ?
log frequency
?
degree
51Collective Statistics (M. Newman 2003)
52Clustering (Transitivity) coefficient
- Measures the density of triangles (local
clusters) in the graph - Two different ways to measure it
- The ratio of the means
53Example
1
4
3
2
5
54Clustering (Transitivity) coefficient
- Clustering coefficient for node i
- The mean of the ratios
55Example
- The two clustering coefficients give different
measures - C(2) increases with nodes with low degree
1
4
3
2
5
56Collective Statistics (M. Newman 2003)
57Clustering coefficient for random graphs
- The probability of two of your neighbors also
being neighbors is p, independent of local
structure - clustering coefficient C p
- when z is fixed C z/n O(1/n)
58Small world phenomena
- Small worlds networks with short paths
Stanley Milgram (1933-1984) The man who shocked
the world
Obedience to authority (1963)
Small world experiment (1967)
59Small world experiment
- Letters were handed out to people in Nebraska to
be sent to a target in Boston - People were instructed to pass on the letters to
someone they knew on first-name basis - The letters that reached the destination followed
paths of length around 6 - Six degrees of separation (play of John Guare)
- Also
- The Kevin Bacon game
- The Erdös number
- Small world project http//smallworld.columbia.ed
u/index.html
60Measuring the small world phenomenon
- dij shortest path between i and j
- Diameter
- Characteristic path length
- Harmonic mean
61Collective Statistics (M. Newman 2003)
62Mixing patterns
- Assume that we have various types of nodes. What
is the probability that two nodes of different
type are linked? - assortative mixing (homophily)
E mixing matrix
p(i,j) mixing probability
p(j i) conditional mixing probability
63Mixing coefficient
- Gupta, Anderson, May 1989
- Advantages
- Q1 if the matrix is diagonal
- Q0 if the matrix is uniform
- Disadvantages
- sensitive to transposition
- does not weight the entries
64Mixing coefficient
- Newman 2003
- Advantages
- r 1 for diagonal matrix , r 0 for uniform
matrix - not sensitive to transposition, accounts for
weighting
(row marginal)
(column marginal)
r0.621
Q0.528
65Degree correlations
- Do high degree nodes tend to link to high degree
nodes? - Pastor Satoras et al.
- plot the mean degree of the neighbors as a
function of the degree - Newman
- compute the correlation coefficient of the
degrees of the two endpoints of an edge - assortative/disassortative
66Collective Statistics (M. Newman 2003)
67Communities and Clusters
- Use the graph structure to discover communities
of nodes - essentially clustering and classification on
graphs
68Other measures
- Frequent (or interesting) motifs
- bipartite cliques in the web graph
- patterns in biological and software graphs
- Use graphlets to compare models
Przulj,Corneil,Jurisica 2004
69Other measures
- Network resilience
- against random or targeted node deletions
- Graph eigenvalues
70Other measures
- The giant component
- Other?
71References
- M. E. J. Newman, The structure and function of
complex networks, SIAM Reviews, 45(2) 167-256,
2003 - M. E. J. Newman, Random graphs as models of
networks in Handbook of Graphs and Networks, S.
Bornholdt and H. G. Schuster (eds.), Wiley-VCH,
Berlin (2003). - N. Alon J. Spencer, The Probabilistic Method