Title: Using Structure Indices for Efficient Approximation of Network Properties
1Using Structure Indices for Efficient
Approximation of Network Properties
- Matthew J. Rattigan, Marc Maier, and David
Jensen
- University of Massachusetts Amherst
- Data Mining
- November 27, 2006
- Deborah Stoffer
2The Problem
- Recent research works with very large networks
- Millions of nodes
- Calculating network statistics on very large
networks can be difficult
- Shortest paths
- Betweenness centrality
- The proportion of all shortest paths in the
network that run through a given node
- Closeness centrality
- The average distance from the given node to every
other node in the network
3The Problem
- The most efficient known algorithms for
calculating betweenness centrality and closeness
centrality are O(ne n2logn)
- n number of nodes
- e number of edges
- Calculations for path finding can have even
higher complexity
- Require bidirectional breadth-first search
4The Problem
- Example - Rexa citation graph
- Papers in computer science and related fields
- Largest connected component contains 165,000
nodes (papers) and 321,000 edges (citations)
- Finding a path of length 15 requires the
exploration of 65,000 nodes
5The Problem
6Network Structure Index (NSI)
- Similar to the type of index commonly used to
speed queries in modern database systems
- Can be constructed once for a given graph and
then used to speed the calculations of many
measures on the graph
- Two components of a NSI
- Set of annotations on every node in the network
that provide information about relative or
absolute location
- For G(V,E) the annotations define A V ? S, where
S is an arbitrarily complex annotation space
- A distance function that uses the annotations to
define graph distance between pairs of nodes by
mapping pairs of node annotations to a positive
real number - D S x S ? R
7Types of Network Structure Indices
- All Pairs Shortest Path (APSP)
- Degree
- Landmark
- Global Network Positioning (GNP)
- Zone
- Distance to Zone (DTZ)
8All Pairs Shortest Path NSI
- Node annotations
- Consist of an n x n matrix (n V) containing
the optimal path distances between all pairs of
nodes
- Distance function
- A simple lookup in the matrix
9Degree NSI
- Node annotations
- Annotate each node with its undirected degree
within the graph
- Distance function between source node s and
target node t
- DDegree (s, t) 2n degree (s) degree (t)
10Landmark NSI
- Randomly designate a small number of nodes in the
network to serve as navigational beacons
- Node annotations
- Annotate nodes in the graph by flooding out from
each landmark and recording the graph distance to
each node in the network
- Gives a vector of graph distances for each node
- Distance function
-
11Landmark NSI
12Global Network Positioning NSI
- Node annotation
- Annotation uses a nonlinear optimization
algorithm to create a multidimensional coordinate
system that encodes the location of each node
within the network - Distance function is the Manhattan distance
between node pairs
-
13Zone NSI
- Node annotations
- Each node is annotated with a d-dimensional
vector of zone labels
- Distance function
-
14Zone NSI Algorithm
- For d dimensions
- Randomly select k seed nodes, assign them zone
labels 1 through k, and place them in the labeled
set
- Place all other nodes in the unlabeled set
- While the unlabeled set is not empty
- Randomly select a node l from the labeled set
- Randomly select a node u from the unlabeled set
that is a neighbor to l
- Assign u to the same zone as l and move it to the
labeled set
15Zone NSI
16Distance to Zone (DTZ) NSI
- Hybrid between Landmark and Zone NSIs
- Node annotations
- Divide the graph into zones and for each node u
and zone Z calculate the distance from u to the
closest node in Z
- Distance function
-
17Distance to Zone (DTZ) NSI
18Complexity of Different NSIs
19Search Performance
- Optimality of the lengths of paths found
- Path ratio
-
- pf is the length of the found paths
- po is the length of the optimal paths
- r is the number of randomly selected pairs of
nodes in the graph
- P 1.0 indicates an NSI that finds optimal
paths
- P 1.0 indicates a poor performing NSI
20Search Performance
- Performance gain
- Exploration ratio
-
- ef is the number of nodes explored by best-first
search
- eb is the number of nodes that are explored using
a bidirectional breadth-first search
- r is the number of pairs of nodes in the graph
- E values close to zero indicate good search
performance
- E values greater than 1.0 indicate poor search
performance
21Search Performance
- NSIs evaluated on synthetic graphs
- Random
- Rewired lattices
- Forest Fire
22Search Performance
23Search Performance
24Search Performance
25Search Performance
26Constant Time Distance Estimation
- Can sometimes use an NSI to directly estimate the
graph distance between any two nodes
- Can use the DTZ annotation distance to estimate
actual graph distances
- Annotate the graph as described for the DTZ NSI
- Randomly sample p pairs of nodes in the graph and
perform breadth-first search to obtain their
exact graph distance
- Use linear regression to obtain an equation for
estimated distance
27Constant Time Distance Estimation
28Constant Time Distance Estimation
29Constant Time Distance Estimation
- Simple distance can be used to produce a wide
variety of attributes on nodes, which can be used
by data mining algorithms that analyze graphs
- Label nodes with their distance to a particular
node in a graph
- How close is each actor to Kevin Bacon?
- Label nodes with the minimum or maximum distance
to one of a set of designated nodes
- How close is each actor to an Academy Award
winner?
30Closeness Centrality
- Measures the proximity of a given node in a
network to every other node
-
- Important to social network dynamics
- Accurate estimates of closeness centrality often
impossible to calculate for large data sets
- Using an NSI for path finding can estimate
closeness centrality efficiently
31Closeness Centrality
32Closeness Centrality
- A measure of centrality can be used to produce
attributes on nodes that may be useful to
knowledge discovery algorithms
- Determine the closeness of every node to a
collection of key nodes
- Closeness to all winners of Academy Awards for
best actor in the past 10 years
- Constrain closeness calculations for members of
clusters
- Closeness rank of an actor within their movie
industry
- Weight closeness based on the attributes of the
outlying nodes
- Closeness to winners of Academy Awards weighted
by how recent an award
33Betweenness Centrality
- Measures the number of short paths on which a
given node lies
-
- Important to social network dynamics
- Accurate estimates of betweenness centrality
often impossible to calculate for large data
sets
34Betweenness Centrality
- Can estimate betweenness using the paths
identified through NSI navigation
- Randomly sample pairs of nodes and discover the
shortest path between them
- Count the number of times each node in the graph
appears on one of these paths to obtain a
betweenness ranking
35Betweenness Centrality
36Betweenness Centrality
- A high betweenness score can indicate a bridge
between two communities
- An actor that has played in movies belonging to
different movie industries
- Betweenness centrality can be used to create
features on nodes that are useful for data
mining
- Calculate betweenness centrality for particular
groups of nodes
- Actors that sit between winners of Academy Awards
for best picture and the IMDbs Bottom 100, the
worst 100 movies as voted by users of the
Internet Movie Database
37Conclusions
- The NSIs Zone and DTZ allow efficient and
accurate estimation of path lengths between
arbitrary nodes in a network
- Efficient calculations of network statistics
allow a better range of potential approaches to
knowledge discovery
- All potential NSIs have not been exhaustively
researched
- NSIs could have other applications
- Finding connection subgraphs
- Approximating neighborhood functions
38