Statistical Inference for Large Directed Graphs with Communities of Interest

1 / 34

About This Presentation

Title:

Statistical Inference for Large Directed Graphs with Communities of Interest

Description:

Statistical Inference for Large Directed Graphs with Communities of Interest. Deepak Agarwal ... computation using Hancock (Anne Rogers and Kathleen Fisher) ... –

Number of Views:26

Avg rating:3.0/5.0

Slides: 35

Provided by: AT167

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Inference for Large Directed Graphs with Communities of Interest

1
Statistical Inference for Large Directed Graphs
with Communities of Interest

Deepak Agarwal

2
Outline

Communities of Interest overview
Why a probabilistic model?
Bayesian Stochastic Blockmodels
Example
Ongoing work

3
Communites of interest

Goal understand calling behavior of every TN on
ATT LD network massive graph
Corinna, Daryl and Chris invented COIs to scale
computation using Hancock (Anne Rogers and
Kathleen Fisher)
Definition COI of TN X is a subgraph centered
around X
Top k called by X other
Top k calling X other

4
COI signature
Other inbound
X
Other outbound
5

Entire graph union of COIs
Extend a COI by recursively growing the spider
Captures calling behavior more accurately
Definition for this work
Grow the spider till depth 3. Only retain depth 3
edges that are between depth 2 nodes.

6
Extended COI
x
other
me
other
X
7
Enhancing a COI !!

Missed calls
Local calls where TNs not ATT local
Outbound OCC calls
Calls to/from the bin other
Big outbound and inbound TNs
Dominate the COI, lot of clutter.
Need to down weight their calls.
Other issues
Want to quantify things like tendency to call,
tendency of being called, tendency of returning
calls for every TN.

8
Our approach so far

COI -gt social network
Want a statistical model that estimates missing
edges, add desired ones and remove (or down
weight) undesired ones.

9
me
COI from top probability edges of a statistical
model. The model adds new edges. (brown arrows)
Removes undesired ones.
10
Getting a sense of data

Some descriptive statistics
based on a random sample
of 500 residential COIs.

11
density 100ne/(g(g-1)) ne number of edges g
number of nodes
12
(No Transcript)
13
Under random Average conditional on out -degrees
14
Under random Conditional on outdegrees
15
Under random Conditional on indegrees
16
(No Transcript)
17
Distribution of Other"
18
Representing the Data

Collection of all edges with activity
Matrix with no diagonal entries
Collection of several 2x2 contingency tables

19
COI gxg matrix without diagonal entries
20
COI collection of 2x2 tables.

Data matrix a collection of g(g-1)/2 2x2 tables
(called dyads).

j-gt i
Row total
present
absent
aij
pij
mij
present
i-gtj
nij
1-pij
aji
absent
pji
1-pji
Column total
1
21

More probabilities than edges.
Need to express them in terms of fewer
parameters which could be learned from data.

22
Optimizer goes crazy due to presence of so many
zero degrees Do regularization, known as
shrinkage estimation in statistics. Incur bias
for small degree nodes but get reduction in
variance.
All Greek letters to be estimated from
data Computation 2 minutes for a typical COI on
fry Likelihood, gradient and Hessian computed
using C, optimizer in R.
23
Meaning of parameters

Node i
ai expansiveness (tendency to call)
ßi attractiveness (tendency of being called)
Global parameters
? density of COI (reduces with increasing
sparseness)
? reciprocity of COI (tendency to return calls)
?s caller specific effect
?r cal lee specific effect
? call specific effect

24
Differential reciprocity

Different reciprocity for each node
Add another parameter ?i to node i
Replace ?M by ?M S i?i Mi in the likelihood
Called differential reciprocity model
Computationally challenging, have implemented it.

25
Missing edges?

Can estimate all parameters as long as we have
some observed edges in data matrix
for each row (to estimate expansiveness)
for each column (to estimate attractiveness)
Missing local calls -gt o.k.
OCC -gt problem, entire row missing.
Impute data using reasonable assumptions m times
(typically m3 o.k.) and combine results. Working
on it.

26
Incorporating edge weights

Edge weights binned into k bins using a random
sample of 500 COIs. Weights in ith bin assigned
a score i.
tij unknown,
ws weights on
dyad (i,j). tij
imputed using
Hyper geometric

Row total
wij
tij
k - wij
k
k - wji
wji
Column total
27
Example

COI with 117 nodes, 172 edges.
14 missing edges, local calls from14 non ATT
local customers to seed node (local list provided
by Gus).
One edge attribute number of common buddies
between TN i and TN j
Tried Bizocity, Localness to seed for caller
and cal lee effects, eventually settled with one
caller effect viz localness to seed, no cal lee
effect.

28
Parameter estimates.

? -6.28 ?2.76 (higher side)
?s.29 (TNs local to seed have a higher tendency
to call)
?.41 (common acquaintances between two TNs
increase their tendency to call each other)

29
(No Transcript)
30
Pruning the big (red) nodes

Down weight expansiveness/attractiveness based on
proportion of volume going to other, higher
value get down weighted more by adding offset
Renormalize the new probability matrix to have
the same mass as the original one.
Offset function used

31
(No Transcript)
32
Matrix obtained by taking union of top 50 data
edges, top 50 edges from original model, top 50
edges from pruned model.
33
(No Transcript)
34
Where to from here?

Estimate missing OCC calls multiple imputation.
Scale the algorithm to get parameter estimates
for every TN, maybe on a weekly basis, enrich
customer signature.
Can compute Hellinger distance between two COIs
in closed form. Could be useful in supervised
learning tasks like tracking Repetitive debtors.

Write a Comment

User Comments (0)