Using Structure Indices for Efficient Approximation of Network Properties

About This Presentation
Title:

Using Structure Indices for Efficient Approximation of Network Properties

Description:

How close is each actor to an Academy Award winner? Closeness Centrality ... Closeness to all winners of Academy Awards for best actor in the past 10 years ... –

Number of Views:78
Avg rating:3.0/5.0
Slides: 39
Provided by: csK4
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Using Structure Indices for Efficient Approximation of Network Properties


1
Using Structure Indices for Efficient
Approximation of Network Properties
  • Matthew J. Rattigan, Marc Maier, and David
    Jensen
  • University of Massachusetts Amherst
  • Data Mining
  • November 27, 2006
  • Deborah Stoffer

2
The Problem
  • Recent research works with very large networks
  • Millions of nodes
  • Calculating network statistics on very large
    networks can be difficult
  • Shortest paths
  • Betweenness centrality
  • The proportion of all shortest paths in the
    network that run through a given node
  • Closeness centrality
  • The average distance from the given node to every
    other node in the network

3
The Problem
  • The most efficient known algorithms for
    calculating betweenness centrality and closeness
    centrality are O(ne n2logn)
  • n number of nodes
  • e number of edges
  • Calculations for path finding can have even
    higher complexity
  • Require bidirectional breadth-first search

4
The Problem
  • Example - Rexa citation graph
  • Papers in computer science and related fields
  • Largest connected component contains 165,000
    nodes (papers) and 321,000 edges (citations)
  • Finding a path of length 15 requires the
    exploration of 65,000 nodes

5
The Problem
6
Network Structure Index (NSI)
  • Similar to the type of index commonly used to
    speed queries in modern database systems
  • Can be constructed once for a given graph and
    then used to speed the calculations of many
    measures on the graph
  • Two components of a NSI
  • Set of annotations on every node in the network
    that provide information about relative or
    absolute location
  • For G(V,E) the annotations define A V ? S, where
    S is an arbitrarily complex annotation space
  • A distance function that uses the annotations to
    define graph distance between pairs of nodes by
    mapping pairs of node annotations to a positive
    real number
  • D S x S ? R

7
Types of Network Structure Indices
  • All Pairs Shortest Path (APSP)
  • Degree
  • Landmark
  • Global Network Positioning (GNP)
  • Zone
  • Distance to Zone (DTZ)

8
All Pairs Shortest Path NSI
  • Node annotations
  • Consist of an n x n matrix (n V) containing
    the optimal path distances between all pairs of
    nodes
  • Distance function
  • A simple lookup in the matrix

9
Degree NSI
  • Node annotations
  • Annotate each node with its undirected degree
    within the graph
  • Distance function between source node s and
    target node t
  • DDegree (s, t) 2n degree (s) degree (t)

10
Landmark NSI
  • Randomly designate a small number of nodes in the
    network to serve as navigational beacons
  • Node annotations
  • Annotate nodes in the graph by flooding out from
    each landmark and recording the graph distance to
    each node in the network
  • Gives a vector of graph distances for each node
  • Distance function

11
Landmark NSI
12
Global Network Positioning NSI
  • Node annotation
  • Annotation uses a nonlinear optimization
    algorithm to create a multidimensional coordinate
    system that encodes the location of each node
    within the network
  • Distance function is the Manhattan distance
    between node pairs

13
Zone NSI
  • Node annotations
  • Each node is annotated with a d-dimensional
    vector of zone labels
  • Distance function

14
Zone NSI Algorithm
  • For d dimensions
  • Randomly select k seed nodes, assign them zone
    labels 1 through k, and place them in the labeled
    set
  • Place all other nodes in the unlabeled set
  • While the unlabeled set is not empty
  • Randomly select a node l from the labeled set
  • Randomly select a node u from the unlabeled set
    that is a neighbor to l
  • Assign u to the same zone as l and move it to the
    labeled set

15
Zone NSI
16
Distance to Zone (DTZ) NSI
  • Hybrid between Landmark and Zone NSIs
  • Node annotations
  • Divide the graph into zones and for each node u
    and zone Z calculate the distance from u to the
    closest node in Z
  • Distance function

17
Distance to Zone (DTZ) NSI
18
Complexity of Different NSIs
19
Search Performance
  • Optimality of the lengths of paths found
  • Path ratio
  • pf is the length of the found paths
  • po is the length of the optimal paths
  • r is the number of randomly selected pairs of
    nodes in the graph
  • P 1.0 indicates an NSI that finds optimal
    paths
  • P 1.0 indicates a poor performing NSI

20
Search Performance
  • Performance gain
  • Exploration ratio
  • ef is the number of nodes explored by best-first
    search
  • eb is the number of nodes that are explored using
    a bidirectional breadth-first search
  • r is the number of pairs of nodes in the graph
  • E values close to zero indicate good search
    performance
  • E values greater than 1.0 indicate poor search
    performance

21
Search Performance
  • NSIs evaluated on synthetic graphs
  • Random
  • Rewired lattices
  • Forest Fire

22
Search Performance
23
Search Performance
24
Search Performance
25
Search Performance
26
Constant Time Distance Estimation
  • Can sometimes use an NSI to directly estimate the
    graph distance between any two nodes
  • Can use the DTZ annotation distance to estimate
    actual graph distances
  • Annotate the graph as described for the DTZ NSI
  • Randomly sample p pairs of nodes in the graph and
    perform breadth-first search to obtain their
    exact graph distance
  • Use linear regression to obtain an equation for
    estimated distance

27
Constant Time Distance Estimation
28
Constant Time Distance Estimation
29
Constant Time Distance Estimation
  • Simple distance can be used to produce a wide
    variety of attributes on nodes, which can be used
    by data mining algorithms that analyze graphs
  • Label nodes with their distance to a particular
    node in a graph
  • How close is each actor to Kevin Bacon?
  • Label nodes with the minimum or maximum distance
    to one of a set of designated nodes
  • How close is each actor to an Academy Award
    winner?

30
Closeness Centrality
  • Measures the proximity of a given node in a
    network to every other node
  • Important to social network dynamics
  • Accurate estimates of closeness centrality often
    impossible to calculate for large data sets
  • Using an NSI for path finding can estimate
    closeness centrality efficiently

31
Closeness Centrality
32
Closeness Centrality
  • A measure of centrality can be used to produce
    attributes on nodes that may be useful to
    knowledge discovery algorithms
  • Determine the closeness of every node to a
    collection of key nodes
  • Closeness to all winners of Academy Awards for
    best actor in the past 10 years
  • Constrain closeness calculations for members of
    clusters
  • Closeness rank of an actor within their movie
    industry
  • Weight closeness based on the attributes of the
    outlying nodes
  • Closeness to winners of Academy Awards weighted
    by how recent an award

33
Betweenness Centrality
  • Measures the number of short paths on which a
    given node lies
  • Important to social network dynamics
  • Accurate estimates of betweenness centrality
    often impossible to calculate for large data
    sets

34
Betweenness Centrality
  • Can estimate betweenness using the paths
    identified through NSI navigation
  • Randomly sample pairs of nodes and discover the
    shortest path between them
  • Count the number of times each node in the graph
    appears on one of these paths to obtain a
    betweenness ranking

35
Betweenness Centrality
36
Betweenness Centrality
  • A high betweenness score can indicate a bridge
    between two communities
  • An actor that has played in movies belonging to
    different movie industries
  • Betweenness centrality can be used to create
    features on nodes that are useful for data
    mining
  • Calculate betweenness centrality for particular
    groups of nodes
  • Actors that sit between winners of Academy Awards
    for best picture and the IMDbs Bottom 100, the
    worst 100 movies as voted by users of the
    Internet Movie Database

37
Conclusions
  • The NSIs Zone and DTZ allow efficient and
    accurate estimation of path lengths between
    arbitrary nodes in a network
  • Efficient calculations of network statistics
    allow a better range of potential approaches to
    knowledge discovery
  • All potential NSIs have not been exhaustively
    researched
  • NSIs could have other applications
  • Finding connection subgraphs
  • Approximating neighborhood functions

38
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com