EDA with Graphs - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

EDA with Graphs

Description:

EDA with Graphs. Chris Volinsky. Shannon Laboratory. AT&T Labs-Research ... Elizabeth Harmon. Name. RDD Process. A big matching problem.... Every day ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 25
Provided by: voli
Category:

less

Transcript and Presenter's Notes

Title: EDA with Graphs


1
EDA with Graphs
  • Chris Volinsky
  • Shannon Laboratory
  • ATT Labs-Research
  • Workshop on Statistical Inference, Computing and
    Visualization for Graphs
  • Stanford University
  • August 2, 2003

2
Introduction
  • Some suggestions about looking at graphs
  • Our way of analyzing graphs COI
  • Two motivating examples
  • Challenges for the room

Main point sometimes EDA is all you need!
3
Preaching to the choir
  • Visualize, even when you cant
  • Speech example
  • Learn a little graph theory, even if you dont
    want to
  • Expand your toolbox with
  • bridges
  • cutpoints
  • centroids
  • pseudo cliques
  • strongly connected components
  • Etc.
  • Look at node and edge variables, even if they are
    not there
  • Variables induced by the graph itself are often
    useful (in-out degree, centrality, boundary)

4
Our data
  • Huge! Hundreds of millions of nodes and edges,
    mostly connected
  • Modelling, or even EDA, on the entire graph may
    not be possible
  • COI Communities of Interest are one way of
    analyzing these data
  • Storage - Break it down
  • Analysis Build up from signatures
  • Updating - Through time via exponential smoothing

5
Storage - Break it down
  • Consider the atomic units of the graph, which we
    call a COI signature
  • For every node in the graph, store
  • Top k numbers inbound
  • Top k numbers outbound
  • Weights on each edge
  • overflow bin
  • In short, we are storing a huge graph as many
    little graphs, which are easily accessible (via
    indexed storage) for analysis.

6
Analysis Build up from signatures
  • Fraud we build signatures
  • When, how long, but not to whom
  • We use the COI signature to build a Community of
    Interest for everyone, and then use that for
    analysis
  • Example
  • Communities are everywhere (e.g. Amazon), but
    representing (and visualizing) as a graph gives a
    lot of insight.

7
Updating through time
  • our graph is dynamic
  • 3M new/old number per week!
  • We use an exponentially weighted moving average
    as a way to smoothly update through time

8
Two motivating examples
  • Two examples where looking at local network
    behavior via COI helped answer the questions of
    interest, without modeling
  • Viral Marketing
  • Fraud

9
Viral Marketing plans
  • Viral Marketing let your customers sell for you
  • COI was the perfect tool to throw at thisby
    capturing the local neighborhood of the
    enrollees, we can test the viral hypothesis
  • We can also track through time
  • What did we do?
  • For the enrollees, find the induced subgraph from
    their COI
  • Look at a control group

10
Cluster results
Lets look at some
11
whats up with the big cluster?
12
RDD Repetitive Debtors Database
  • Lots of people cant pay their bill, but they want
    phone service anyway

13
RDD Process
  • A big matching problem.
  • Every day
  • we get restricted TNs, 4K / day
  • we get connected TNs 40K / day
  • Look over a 30 day period (possible 4B
    comparisons!)
  • Compare the COI graphs of the disconnected number
    and the new number
  • We need a metric for graph distance

14
Matching Strategy
  • We use a combination of
  • Intersection gt 2 (to pare down)
  • Name/address overlap (to weed out)
  • owed (to prioritize)
  • Heres where modeling could helpor maybe not

15
Wrap up
  • Viral Marketing
  • Used connected components of reduced data as
    clusters
  • Looked for centers of clusters for retention
  • Visualized clusters for understanding
  • Used boundary to predict new customers
  • COI was the best predictive variable in a
    marketing study
  • Fraud
  • Attacked massive matching via simple measures of
    distance
  • Fraud reps use visualized clusters to work cases
  • We detected RDD with an 80 success rate
  • Is this EDA?

16
Challenges
  • Viewing graphs through time
  • What if I dont know what is coming next?
  • Graph distance metrics
  • What does distance between graphs mean?
  • Tools for looking at many graphs
  • what do union and intersection mean?
  • Modelling and EDA go hand in hand
  • Viral marketing models define network value, feed
    this into graph to do EDA.

17
An answer for Duncan
  • What do I want and who is going to do it?
  • Tools that combine
  • Interactive capability
  • Graph operations
  • Statistical analysis
  • Its happening
  • Its great!!
  • Its a little confusing
  • This model works for me.do you agree?

18
  • What I want.
  • powerful ways to do union/intersection
  • unclear actually what that means
  • statistical measures of distances between
    graphs, what is the metric of interest, really?
  • use variables on nodes and edges to easily
    define new graphs, and automatically point me
    towards the interesting ones (largest, densest)
  • standard tools for finding graph theoretic
    concepts like cliques, pseudo cliques, density,
    bridge edges, boundary
  • ability to visualize the temporal component of
    graphs is there another paradigm other than
    plot the ubergraph?

19
  • Points to make
  • if each tn is a graph, and we are looking for
    similar graphs, we could be doing millions or
    billions of these comparisonssna stuff is great,
    but it doesnt really work!
  • sometimes EDA is the answer, it is the best we
    can do, or perhaps it is sufficient for the user.
  • think graphs and plot it! Even if you cant
    plot the whole thing, plot some of it do speech
    example.
  • network value might be important this might
    not be the same as density it may be a
    sunburst, which is not a high density subgraph,
    or highest value it may depend on tine
  • Modelling can be great find pseudo edges, use
    latent space models,etc

20
  • Visualize, even when you cant
  • always a way to subset or threshold, or
    something
  • Speech example
  • learn some graph theoretics
  • bridge nodes/edges
  • Density, defs of cliques and pseudo cliques
  • dfs/bfs minimal spanning trees.
  • Strongly conn comp
  • subset

21
Storing COI Signatures
  • COI sigs are stored in Hancock, a C-based
    domain-specific language designed for large
    amounts of signature-type data (Rogers, Fisher,
    et al)
  • Indexed by TN, so it is easy and fast to get COI
    for large lists of TN, and use spiders for
    recursion.
  • e.g. cycling over all TNs to learn something
    about our customer base takes minutes. We could
    never do this before!

22
(No Transcript)
23
Informative overlap score
  • Calculate the informative overlap score
  • Where
  • wao weight of edge from a to o
  • wob weight of edge from o to b
  • wo sum weight of edges to o
  • dao, dob are the graph distances from a and b to
    o

wob
wao
wo
24
Selecting q
Calls fade out over time The larger q is , the
longer the call has non-negligible weight
Write a Comment
User Comments (0)
About PowerShow.com