Data Mining for Complex Network

About This Presentation

Title:

Data Mining for Complex Network

Description:

Data Mining for Complex Network Introduction and Background – PowerPoint PPT presentation

Number of Views:212

Avg rating:3.0/5.0

Slides: 72

Provided by: admi2704

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining for Complex Network

1
Data Mining for Complex Network

Introduction and Background

2
Welcome!

Instructor Ruoming Jin
Homepage www.cs.kent.edu/jin/
Office 264 MCS Building
Email jin_at_cs.kent.edu

3
Course overview

The course goal
First of all, this is a research course, or a
special topic course. There is no textbook and
even no definition what topics are supposed to
under the course name?
Discussing the state-of-are techniques for mining
complex networks
Review Course Project

4
Simple Concepts (Review)

ErdosRényi model Random Graph Model
Markov Chain and Random Walk
Maximal Likelihood/Model Selection

5
Requirement

Each of you will have one presentation
Review Paper
Select a topic and review at least four papers
Bonus Develop a new idea on this topic
Course Project
One or Two a group
Collect data preprocess data analyzing the
data
Final Grade 30 presentations, 25 review, 35
project, and 10 class participation

6
What is a network?

Network a collection of entities that are
interconnected with links.
people that are friends
computers that are interconnected
web pages that point to each other
proteins that interact

7
Graphs

In mathematics, networks are called graphs, the
entities are nodes, and the links are edges
Graph theory starts in the 18th century, with
Leonhard Euler
The problem of Königsberg bridges
Since then graphs have been studied extensively.

8
Networks in the past

Graphs have been used in the past to model
existing networks (e.g., networks of highways,
social networks)
usually these networks were small
network can be studied visual inspection can
reveal a lot of information

9
Networks now

More and larger networks appear
Products of technological advancement
e.g., Internet, Web
Result of our ability to collect more, better,
and more complex data
e.g., gene regulatory networks
Networks of thousands, millions, or billions of
nodes
impossible to visualize

10
The internet map
11
Understanding large graphs

What are the statistics of real life networks?
Can we explain how the networks were generated?
What else? A still very young field!
(What is the basic principles and what those
principle will mean?)

12
Measuring network properties

Around 1999
Watts and Strogatz, Dynamics and small-world
phenomenon
Faloutsos3, On power-law relationships of the
Internet Topology
Kleinberg et al., The Web as a graph
Barabasi and Albert, The emergence of scaling in
real networks

13
Real network properties

Most nodes have only a small number of neighbors
(degree), but there are some nodes with very high
degree (power-law degree distribution)
scale-free networks
If a node x is connected to y and z, then y and z
are likely to be connected
high clustering coefficient
Most nodes are just a few edges away on average.
small world networks
Networks from very diverse areas (from internet
to biological networks) have similar properties
Is it possible that there is a unifying
underlying generative process?

14
Generating random graphs

Classic graph theory model (Erdös-Renyi)
each edge is generated independently with
probability p
Very well studied model but
most vertices have about the same degree
the probability of two nodes being linked is
independent of whether they share a neighbor
the average paths are short

15
Modeling real networks

Real life networks are not random
Can we define a model that generates graphs with
statistical properties similar to those in real
life?
a flurry of models for random graphs

16
Processes on networks

Why is it important to understand the structure
of networks?
Epidemiology Viruses propagate much faster in
scale-free networks
Vaccination of random nodes does not work, but
targeted vaccination is very effective

17
The future of networks

Networks seem to be here to stay
More and more systems are modeled as networks
Scientists from various disciplines are working
on networks (physicists, computer scientists,
mathematicians, biologists, sociologist,
economists)
There are many questions to understand.

18
Basic Mathematical Tools

Graph theory
Probability theory
Linear Algebra

19
Graph Theory

Graph G(V,E)
V set of vertices
E set of edges

2
1
3
5
4
undirected graph E(1,2),(1,3),(2,3),(3,4),(4,5)
20
Graph Theory

Graph G(V,E)
V set of vertices
E set of edges

2
1
3
5
4
directed graph E1,2, 2,1 1,3, 3,2,
3,4, 4,5
21
Undirected graph
2

degree d(i) of node i
number of edges incident on node i

degree sequence
d(1),d(2),d(3),d(4),d(5)
2,2,2,1,1

3
5
4

degree distribution
(1,2),(2,3)

22
Directed Graph
2

in-degree din(i) of node i
number of edges pointing to node i

out-degree dout(i) of node i
number of edges leaving node i

in-degree sequence
1,2,1,1,1
out-degree sequence
2,1,2,1,0

5
4
23
Paths

Path from node i to node j a sequence of edges
(directed or undirected from node i to node j)
path length number of edges on the path
nodes i and j are connected
cycle a path that starts and ends at the same
node

2
2
1
1
3
3
5
5
4
4
24
Shortest Paths

Shortest Path from node i to node j
also known as BFS path, or geodesic path

2
2
1
1
3
3
5
5
4
4
25
Diameter

The longest shortest path in the graph

2
2
1
1
3
3
5
5
4
4
26
Undirected graph

Connected graph a graph where there every pair
of nodes is connected
Disconnected graph a graph that is not connected
Connected Components subsets of vertices that
are connected

2
1
3
5
4
27
Fully Connected Graph

Clique Kn
A graph that has all possible n(n-1)/2 edges

2
1
3
5
4
28
Directed Graph
2

Strongly connected graph there exists a path
from every i to every j

Weakly connected graph If edges are made to be
undirected the graph is connected

3
5
4
29
Subgraphs

Subgraph Given V ? V, and E ? E, the graph
G(V,E) is a subgraph of G.
Induced subgraph Given V ? V, let E ? E is
the set of all edges between the nodes in V. The
graph G(V,E), is an induced subgraph of G

2
1
3
5
4
30
Trees

Connected Undirected graphs without cycles

2
1
3
5
4
31
Bipartite graphs

Graphs where the set V can be partitioned into
two sets L and R, such that all edges are between
nodes in L and R, and there is no edge within L
or R

32
Linear Algebra

Adjacency Matrix
symmetric matrix for undirected graphs

2
1
3
5
4
33
Linear Algebra

Adjacency Matrix
unsymmetric matrix for undirected graphs

2
1
3
5
4
34
Random Walks

Start from a node, and follow links uniformly at
random.
Stationary distribution The fraction of times
that you visit node i, as the number of steps of
the random walk approaches infinity
if the graph is strongly connected, the
stationary distribution converges to a unique
vector.

35
Random Walks

stationary distribution principal left
eigenvector of the normalized adjacency matrix
x xP
for undirected graphs, the degree distribution

2
1
3
5
4
36
Eigenvalues and Eigenvectors

The value ? is an eigenvalue of matrix A if there
exists a non-zero vector x, such that Ax?x.
Vector x is an eigenvector of matrix A
The largest eigenvalue is called the principal
eigenvalue
The corresponding eigenvector is the principal
eigenvector
Corresponds to the direction of maximum change

37
Types of networks

Social networks
Knowledge (Information) networks
Technology networks
Biological networks

38
Social Networks

Links denote a social interaction
Networks of acquaintances
actor networks
co-authorship networks
director networks
phone-call networks
e-mail networks
IM networks
Microsoft buddy network
Bluetooth networks
sexual networks
home page networks

39
Knowledge (Information) Networks

Nodes store information, links associate
information
Citation network (directed acyclic)
The Web (directed)
Peer-to-Peer networks
Word networks
Networks of Trust
Bluetooth networks

40
Technological networks

Networks built for distribution of commodity
The Internet
router level, AS level
Power Grids
Airline networks
Telephone networks
Transportation Networks
roads, railways, pedestrian traffic
Software graphs

41
Biological networks

Biological systems represented as networks
Protein-Protein Interaction Networks
Gene regulation networks
Metabolic pathways
The Food Web
Neural Networks

42
Now what?

The world is full with networks. What do we do
with them?
understand their topology and measure their
properties
study their evolution and dynamics
create realistic models
create algorithms that make use of the network
structure

43
Measuring Networks

Degree distributions
Small world phenomena
Clustering Coefficient
Mixing patterns
Degree correlations
Communities and clusters

44
Degree distributions
frequency
fk fraction of nodes with degree k
probability of a randomly selected node to
have degree k
fk
degree
k

Problem find the probability distribution that
best fits the observed data

45
Power-law distributions

The degree distributions of most real-life
networks follow a power law
Right-skewed/Heavy-tail distribution
there is a non-negligible fraction of nodes that
has very high degree (hubs)
scale-free no characteristic scale, average is
not informative
In stark contrast with the random graph model!
highly concentrated around the mean
the probability of very high degree nodes is
exponentially small

p(k) Ck-a
46
Power-law signature

Power-law distribution gives a line in the
log-log plot
a power-law exponent (typically 2 a 3)

log p(k) -a logk logC
a
log frequency
frequency
log degree
degree
47
Examples
Taken from Newman 2003
48
A random graph example
49
Maximum degree

For random graphs, the maximum degree is highly
concentrated around the average degree z
For power law graphs
Rough argument solve nPXk1

50
Exponential distribution

Observed in some technological or collaboration
networks
Identified by a line in the log-linear plot

p(k) ?e-?k
log p(k) - ?k log ?
log frequency
?
degree
51
Collective Statistics (M. Newman 2003)
52
Clustering (Transitivity) coefficient

Measures the density of triangles (local
clusters) in the graph
Two different ways to measure it
The ratio of the means

53
Example
1
4
3
2
5
54
Clustering (Transitivity) coefficient

Clustering coefficient for node i
The mean of the ratios

55
Example

The two clustering coefficients give different
measures
C(2) increases with nodes with low degree

1
4
3
2
5
56
Collective Statistics (M. Newman 2003)
57
Clustering coefficient for random graphs

The probability of two of your neighbors also
being neighbors is p, independent of local
structure
clustering coefficient C p
when z is fixed C z/n O(1/n)

58
Small world phenomena

Small worlds networks with short paths

Stanley Milgram (1933-1984) The man who shocked
the world
Obedience to authority (1963)
Small world experiment (1967)
59
Small world experiment

Letters were handed out to people in Nebraska to
be sent to a target in Boston
People were instructed to pass on the letters to
someone they knew on first-name basis
The letters that reached the destination followed
paths of length around 6
Six degrees of separation (play of John Guare)
Also
The Kevin Bacon game
The Erdös number
Small world project http//smallworld.columbia.ed
u/index.html

60
Measuring the small world phenomenon

dij shortest path between i and j
Diameter
Characteristic path length
Harmonic mean

61
Collective Statistics (M. Newman 2003)
62
Mixing patterns

Assume that we have various types of nodes. What
is the probability that two nodes of different
type are linked?
assortative mixing (homophily)

E mixing matrix
p(i,j) mixing probability
p(j i) conditional mixing probability
63
Mixing coefficient

Gupta, Anderson, May 1989
Advantages
Q1 if the matrix is diagonal
Q0 if the matrix is uniform
Disadvantages
sensitive to transposition
does not weight the entries

64
Mixing coefficient

Newman 2003
Advantages
r 1 for diagonal matrix , r 0 for uniform
matrix
not sensitive to transposition, accounts for
weighting

(row marginal)
(column marginal)
r0.621
Q0.528
65
Degree correlations

Do high degree nodes tend to link to high degree
nodes?
Pastor Satoras et al.
plot the mean degree of the neighbors as a
function of the degree
Newman
compute the correlation coefficient of the
degrees of the two endpoints of an edge
assortative/disassortative

66
Collective Statistics (M. Newman 2003)
67
Communities and Clusters

Use the graph structure to discover communities
of nodes
essentially clustering and classification on
graphs

68
Other measures

Frequent (or interesting) motifs
bipartite cliques in the web graph
patterns in biological and software graphs
Use graphlets to compare models
Przulj,Corneil,Jurisica 2004

69
Other measures

Network resilience
against random or targeted node deletions
Graph eigenvalues

70
Other measures

The giant component
Other?

71
References

M. E. J. Newman, The structure and function of
complex networks, SIAM Reviews, 45(2) 167-256,
2003
M. E. J. Newman, Random graphs as models of
networks in Handbook of Graphs and Networks, S.
Bornholdt and H. G. Schuster (eds.), Wiley-VCH,
Berlin (2003).
N. Alon J. Spencer, The Probabilistic Method

Write a Comment

User Comments (0)

About PowerShow.com

Data Mining for Complex Network - PowerPoint PPT Presentation

Data Mining for Complex Network

Data Mining for Complex Network Introduction and Background – PowerPoint PPT presentation