Title: Statistical Inference for Large Directed Graphs with Communities of Interest
1Statistical Inference for Large Directed Graphs
with Communities of Interest
2Outline
- Communities of Interest overview
- Why a probabilistic model?
- Bayesian Stochastic Blockmodels
- Example
- Ongoing work
3Communites of interest
- Goal understand calling behavior of every TN on
ATT LD network massive graph - Corinna, Daryl and Chris invented COIs to scale
- computation using Hancock (Anne Rogers and
Kathleen Fisher) - Definition COI of TN X is a subgraph centered
around X - Top k called by X other
- Top k calling X other
4COI signature
Other inbound
X
Other outbound
5- Entire graph union of COIs
- Extend a COI by recursively growing the spider
- Captures calling behavior more accurately
- Definition for this work
- Grow the spider till depth 3. Only retain depth 3
edges that are between depth 2 nodes.
6Extended COI
x
other
me
other
X
7Enhancing a COI !!
- Missed calls
- Local calls where TNs not ATT local
- Outbound OCC calls
- Calls to/from the bin other
- Big outbound and inbound TNs
- Dominate the COI, lot of clutter.
- Need to down weight their calls.
- Other issues
- Want to quantify things like tendency to call,
tendency of being called, tendency of returning
calls for every TN.
8Our approach so far
- COI -gt social network
- Want a statistical model that estimates missing
edges, add desired ones and remove (or down
weight) undesired ones.
9me
COI from top probability edges of a statistical
model. The model adds new edges. (brown arrows)
Removes undesired ones.
10Getting a sense of data
-
- Some descriptive statistics
- based on a random sample
- of 500 residential COIs.
11density 100ne/(g(g-1)) ne number of edges g
number of nodes
12(No Transcript)
13Under random Average conditional on out -degrees
14Under random Conditional on outdegrees
15Under random Conditional on indegrees
16(No Transcript)
17Distribution of Other"
18Representing the Data
- Collection of all edges with activity
- Matrix with no diagonal entries
- Collection of several 2x2 contingency tables
19COI gxg matrix without diagonal entries
20COI collection of 2x2 tables.
- Data matrix a collection of g(g-1)/2 2x2 tables
(called dyads).
j-gt i
Row total
present
absent
aij
pij
mij
present
i-gtj
nij
1-pij
aji
absent
pji
1-pji
Column total
1
21- More probabilities than edges.
- Need to express them in terms of fewer
parameters which could be learned from data.
22Optimizer goes crazy due to presence of so many
zero degrees Do regularization, known as
shrinkage estimation in statistics. Incur bias
for small degree nodes but get reduction in
variance.
All Greek letters to be estimated from
data Computation 2 minutes for a typical COI on
fry Likelihood, gradient and Hessian computed
using C, optimizer in R.
23Meaning of parameters
- Node i
- ai expansiveness (tendency to call)
- ßi attractiveness (tendency of being called)
- Global parameters
- ? density of COI (reduces with increasing
sparseness) - ? reciprocity of COI (tendency to return calls)
- ?s caller specific effect
- ?r cal lee specific effect
- ? call specific effect
24Differential reciprocity
- Different reciprocity for each node
- Add another parameter ?i to node i
- Replace ?M by ?M S i?i Mi in the likelihood
- Called differential reciprocity model
- Computationally challenging, have implemented it.
25Missing edges?
- Can estimate all parameters as long as we have
some observed edges in data matrix - for each row (to estimate expansiveness)
- for each column (to estimate attractiveness)
- Missing local calls -gt o.k.
- OCC -gt problem, entire row missing.
- Impute data using reasonable assumptions m times
(typically m3 o.k.) and combine results. Working
on it.
26Incorporating edge weights
- Edge weights binned into k bins using a random
sample of 500 COIs. Weights in ith bin assigned
a score i. - tij unknown,
- ws weights on
- dyad (i,j). tij
- imputed using
- Hyper geometric
Row total
wij
tij
k - wij
k
k - wji
wji
Column total
27Example
- COI with 117 nodes, 172 edges.
- 14 missing edges, local calls from14 non ATT
local customers to seed node (local list provided
by Gus). - One edge attribute number of common buddies
between TN i and TN j - Tried Bizocity, Localness to seed for caller
and cal lee effects, eventually settled with one
caller effect viz localness to seed, no cal lee
effect.
28Parameter estimates.
- ? -6.28 ?2.76 (higher side)
- ?s.29 (TNs local to seed have a higher tendency
to call) - ?.41 (common acquaintances between two TNs
increase their tendency to call each other)
29(No Transcript)
30Pruning the big (red) nodes
- Down weight expansiveness/attractiveness based on
proportion of volume going to other, higher
value get down weighted more by adding offset - Renormalize the new probability matrix to have
the same mass as the original one. - Offset function used
31(No Transcript)
32Matrix obtained by taking union of top 50 data
edges, top 50 edges from original model, top 50
edges from pruned model.
33(No Transcript)
34Where to from here?
- Estimate missing OCC calls multiple imputation.
- Scale the algorithm to get parameter estimates
for every TN, maybe on a weekly basis, enrich
customer signature. - Can compute Hellinger distance between two COIs
in closed form. Could be useful in supervised
learning tasks like tracking Repetitive debtors.