Statistical Inference for Large Directed Graphs with Communities of Interest

1 / 34
About This Presentation
Title:

Statistical Inference for Large Directed Graphs with Communities of Interest

Description:

Statistical Inference for Large Directed Graphs with Communities of Interest. Deepak Agarwal ... computation using Hancock (Anne Rogers and Kathleen Fisher) ... –

Number of Views:26
Avg rating:3.0/5.0
Slides: 35
Provided by: AT167
Category:

less

Transcript and Presenter's Notes

Title: Statistical Inference for Large Directed Graphs with Communities of Interest


1
Statistical Inference for Large Directed Graphs
with Communities of Interest
  • Deepak Agarwal

2
Outline
  • Communities of Interest overview
  • Why a probabilistic model?
  • Bayesian Stochastic Blockmodels
  • Example
  • Ongoing work

3
Communites of interest
  • Goal understand calling behavior of every TN on
    ATT LD network massive graph
  • Corinna, Daryl and Chris invented COIs to scale
  • computation using Hancock (Anne Rogers and
    Kathleen Fisher)
  • Definition COI of TN X is a subgraph centered
    around X
  • Top k called by X other
  • Top k calling X other

4
COI signature
Other inbound
X
Other outbound
5
  • Entire graph union of COIs
  • Extend a COI by recursively growing the spider
  • Captures calling behavior more accurately
  • Definition for this work
  • Grow the spider till depth 3. Only retain depth 3
    edges that are between depth 2 nodes.

6
Extended COI
x
other
me
other
X
7
Enhancing a COI !!
  • Missed calls
  • Local calls where TNs not ATT local
  • Outbound OCC calls
  • Calls to/from the bin other
  • Big outbound and inbound TNs
  • Dominate the COI, lot of clutter.
  • Need to down weight their calls.
  • Other issues
  • Want to quantify things like tendency to call,
    tendency of being called, tendency of returning
    calls for every TN.

8
Our approach so far
  • COI -gt social network
  • Want a statistical model that estimates missing
    edges, add desired ones and remove (or down
    weight) undesired ones.

9
me
COI from top probability edges of a statistical
model. The model adds new edges. (brown arrows)
Removes undesired ones.
10
Getting a sense of data
  • Some descriptive statistics
  • based on a random sample
  • of 500 residential COIs.

11
density 100ne/(g(g-1)) ne number of edges g
number of nodes
12
(No Transcript)
13
Under random Average conditional on out -degrees
14
Under random Conditional on outdegrees
15
Under random Conditional on indegrees
16
(No Transcript)
17
Distribution of Other"
18
Representing the Data
  • Collection of all edges with activity
  • Matrix with no diagonal entries
  • Collection of several 2x2 contingency tables

19
COI gxg matrix without diagonal entries
20
COI collection of 2x2 tables.
  • Data matrix a collection of g(g-1)/2 2x2 tables
    (called dyads).

j-gt i
Row total
present
absent
aij
pij
mij
present
i-gtj
nij
1-pij
aji
absent
pji
1-pji
Column total
1
21
  • More probabilities than edges.
  • Need to express them in terms of fewer
    parameters which could be learned from data.

22
Optimizer goes crazy due to presence of so many
zero degrees Do regularization, known as
shrinkage estimation in statistics. Incur bias
for small degree nodes but get reduction in
variance.
All Greek letters to be estimated from
data Computation 2 minutes for a typical COI on
fry Likelihood, gradient and Hessian computed
using C, optimizer in R.
23
Meaning of parameters
  • Node i
  • ai expansiveness (tendency to call)
  • ßi attractiveness (tendency of being called)
  • Global parameters
  • ? density of COI (reduces with increasing
    sparseness)
  • ? reciprocity of COI (tendency to return calls)
  • ?s caller specific effect
  • ?r cal lee specific effect
  • ? call specific effect

24
Differential reciprocity
  • Different reciprocity for each node
  • Add another parameter ?i to node i
  • Replace ?M by ?M S i?i Mi in the likelihood
  • Called differential reciprocity model
  • Computationally challenging, have implemented it.

25
Missing edges?
  • Can estimate all parameters as long as we have
    some observed edges in data matrix
  • for each row (to estimate expansiveness)
  • for each column (to estimate attractiveness)
  • Missing local calls -gt o.k.
  • OCC -gt problem, entire row missing.
  • Impute data using reasonable assumptions m times
    (typically m3 o.k.) and combine results. Working
    on it.

26
Incorporating edge weights
  • Edge weights binned into k bins using a random
    sample of 500 COIs. Weights in ith bin assigned
    a score i.
  • tij unknown,
  • ws weights on
  • dyad (i,j). tij
  • imputed using
  • Hyper geometric

Row total
wij
tij
k - wij
k
k - wji
wji
Column total
27
Example
  • COI with 117 nodes, 172 edges.
  • 14 missing edges, local calls from14 non ATT
    local customers to seed node (local list provided
    by Gus).
  • One edge attribute number of common buddies
    between TN i and TN j
  • Tried Bizocity, Localness to seed for caller
    and cal lee effects, eventually settled with one
    caller effect viz localness to seed, no cal lee
    effect.

28
Parameter estimates.
  • ? -6.28 ?2.76 (higher side)
  • ?s.29 (TNs local to seed have a higher tendency
    to call)
  • ?.41 (common acquaintances between two TNs
    increase their tendency to call each other)

29
(No Transcript)
30
Pruning the big (red) nodes
  • Down weight expansiveness/attractiveness based on
    proportion of volume going to other, higher
    value get down weighted more by adding offset
  • Renormalize the new probability matrix to have
    the same mass as the original one.
  • Offset function used

31
(No Transcript)
32
Matrix obtained by taking union of top 50 data
edges, top 50 edges from original model, top 50
edges from pruned model.
33
(No Transcript)
34
Where to from here?
  • Estimate missing OCC calls multiple imputation.
  • Scale the algorithm to get parameter estimates
    for every TN, maybe on a weekly basis, enrich
    customer signature.
  • Can compute Hellinger distance between two COIs
    in closed form. Could be useful in supervised
    learning tasks like tracking Repetitive debtors.
Write a Comment
User Comments (0)
About PowerShow.com