Title: The Probability Mechanics of Social Networks
1The Probability Mechanics of Social Networks
- Ian McCulloh
- Carnegie Mellon University
- Pittsburgh, PA 15213
- Joshua Lospinoso
- United States Military Academy
- West Point, NY 10996
2Agenda
- Background
- Motivation
- Probability Spaces
- Complications with Social Networks
- Previous work
- Network Probability Matrices
- Statistical Distributions
- Applications
- References
3Background
- Two broad areas of modeling deterministic and
stochastic. - Differential Equations
- Predator prey systems
- Regression Analysis
- Economics
- Design Analysis of Experiments
- Social Network Analysis (SNA) should be
interested in what regression analysis ignores
and differential equations assume away.
4Motivation
- Regression analysis enjoys a century of study and
rigorous research. - SNA community needs the same rigorous set of
probability mechanics. - Enable application of statistical techniques
- Error analysis
- Design of efficient sociological experiments
- There are approaches currently being refined to
answer this call, and all rely on assumptions
about underlying entity and edge behaviors. - We wish to provide a framework flexible to a wide
array of these initial assumptions.
5What is a probability space?
- Introduced by famous statistician Andrey
Kolmogorov, probability space is the foundation
of all probability theory - A probability space contains sample outcomes, an
outcome space, and an associated probability - (SampleOutcomes,OutcomeSpace,AssocProb)
- Consider the simple example of two flips of a
coin - Assume the two flips are independent and the coin
either lands head or tails. - Possible outcomes (read outcome space F) for two
flips are as follows, where H is heads, T is
tails, and outcomes are of the form
Trial1,Trial2 - H,H H,T T,H T,T
- There are four elements (read sample outcomes
Omega) in the outcome space (F), so the
associated probability (P) of each outcome is ¼
.25
6Probability Spaces and Social Network Analysis
- At its core, a social network is stochastically
arrayed, for human behavior governs their
construction and maintenance. - If we accept the notion that there is a
probability p that two nodes form an edge (given
sufficient conditions), adjacency matrices must
have probability spaces. - These probability spaces must be explored to
truly understand and research dynamic networks.
7Definitions
- The following terms depend largely on the domain
and convenience - Vertices (aka nodes, entities)
- Edges (relationships)
- Adjacency matrix a data structure which holds
edge information - Weighted, directed
-
8Complications
- Simulation can sometimes exhaustively explore the
probability space of a system. - Unfortunately, graphs get extremely large very
quickly. - n of nodes and e of possible edges
- Fixing n and e, of possible graphs (network
structures) is -
- And if we only fix n, of possible graphs is
given by -
- For example, a network of 30 nodes has 7.87 x
10261 configurations.
9Previous Work with Random Graphsand Social
Networks
- Based on assumptions about node and edge
behavior, researchers have postulated about how
dynamic networks array themselves (termed random
graphs) - Degrees of nodes (Newman, Scott, Wasserman and
Faust) - If each p in an adjacency matrix is equal, then
the degree of each node as well as network size
follow a binomial distribution which
asymptotically approaches a Poisson distribution. - Empirical work shows that the equal p assumption
is too strong for many applications - Scale free graphs (Yule-Simon distribution) such
as the internet - McCulloh et. al. 2007
- Small world, six degrees of separation (Travers
et. al. 1969) - Watts and Strogratz (1998) propose clustering
coefficient to explore Scale Free -
- Translation consider the neighborhood of a node
i (consists of each node, or neighbor, directly
connected to i). The clustering coefficient of
node i is the ratio of connections among its
neighborhood to the total number of possible
connections in its neighborhood
10Previous Work with Probability Spacesand Social
Networks
- Albert and Barabasi (2002)
- Using varied datasets, Albert and Barabasi show
that empirical social networks have higher
clustering coefficients than random networks of
equal dimensions. - Many empirical social networks follow a power-law
statistical distribution (as measured by node
degree) - The verdict is still out is demonstrating that
node degree is distributed as a power-law
distribution sufficient to apply scale-free
properties? - Instead of analyzing the degree distribution, we
propose that it may be advantageous to estimate
the stochastic process that dynamically generates
degree over time.
11Statistical Distributionsand Social Networks
- We introduce the idea of a Network Probability
Matrix (NPM), which describes a network of size
N15
12Statistical Distributionsand Social Networks
- To illustrate how our original NPM may be applied
to the scale-free question - This NPM models three groups which interact
within neighborhood with an 80 probability and
outside neighborhood 20. - The clustering coefficient is .463 for this
graph, compared to a clustering coefficient of
.329 for an NPM with equal probabilities
(according to a Monte-Carlo simulation of the
previous NPM). - The conclusion should be intuitive and obvious.
Instead of starting with a theoretical NPM, we
would like to generate NPMs from real world data
to draw similar conclusions. - So what do Network Probability Matrices (NPMs)
look like in real dynamic networks?
13Distribution Theory and Social Networks
- We can use statistical distributions to generate
adjacency matrices over time. - One source of variation in SNA experiments is
that researchers must choose how to define edges
in an interaction matrix. - McCulloh et. al. (2007) studied email traffic to
analyze shifts in network structure. - In order to define edges, the researchers had to
decide on what blocks of time to analyze. - We looked at the email data from this experiment
and fitted well known statistical distributions
using parameter estimation techniques to each
directed edge. - All of the arrival times were log-normally
distributed
14Distribution Theory and Social Networks
- The probability that an email is sent from i to j
within some period of time t is -
- (p, as a function of t, is a CDF f is the PDF
that best fits cell ij in an NPM)
15Distribution Theory and Social Networks
- The probability that two emails are sent from i
to j within some period of time t is -
16Distribution Theory and Social Networks
- The probability that x emails are sent from i to
j within some period of time t is -
17From NPM to Adjacency
- An adjacency matrix is normally treated as a
structure of scalar values. - It is imperative to understand that an adjacency
matrix is a function of many elements, including
definitional considerations (weighted, directed)
and time. - If we accept the notion of an NPM, the adjacency
matrix is a structure of random variables. - Analogy you may remember from stochastic
processes that if arrival times are distributed
exponentially, then the amount of arrivals in a
given interval is distributed Poisson. - The NPM can be regarded as the exponential
distribution and the adjacency matrix as the
Poisson in this case.
18From NPM to Adjacency
- Analytically, an adjacency matrix can be derived
and would require future research of varying
complexity. - Simulations are much more practical for applied
work and can be constructed for very specific
applications. We are developing an extension of
this for a variety of applications (sampling
distributions for network measures, network
perturbations, etc.)
19Why does it matter to you?
- Understanding the probability space of the
adjacency matrix and how it relates to the
definition of an edge and aids both applied and
theoretical research - Hypothesis testing/confidence intervals on
network measures - Mitigation of time chunking considerations by
researchers. - Organizational Simulations
- Error analysis (in measurement)
- A flexible framework for analytics based on
myriad initial assumptions
20Future work
- We are currently refining an algorithmic approach
to determining the sampling distributions of
network measures given an NPM we are interested
in a practical approach to create sampling
distributions of network measures over time. - An application of this techique to McCullohs
IkeNET data is undergoing final revision - Developing the closed form relationships between
NPMs and adjacency matrices could provide a
platform for exploring random graphs.
21Back Ups
- Works Cited
- Albert, R. and Barabasi, A. (2002) Statistical
Mechanics of Complex Networks. Reviews of Modern
Physics, 74 47-97. - Albert, R. and Barabasi, A. (1999) Emergence of
Scaling in Random Networks. Science,
286509-512. - Barabasi, A. (2003) Linked How Everything is
Connected to Everything Else and What It Means
for Business, Science, and Everyday Life. Plume,
New York. ISBN 0-452-28439-2 - Barabási, A. (2003) Scale-Free Networks.
Scientific American, 28860-69. - Dorogovtsev, S.N. and Mendes, J.F.F. (2003).
Evolution of Networks from biological networks
to the Internet and WWW, Oxford University Press.
ISBN 0-19-851590-1 - Dorogovtsev, S.N. and Mendes, J.F.F. and
Samukhin, A.N., (2000) "Structure of Growing
Networks Exact Solution of the
Barabási--Albert's Model", Physical Review
Letters, 85, 4633 - Erdos, P., and Rényi, A. (1960) On the Evolution
of Random Graphs. Mathematical Institute of the
Hungarian Academy of Science. 5, 17-61. - Faloutsos, M., Faloutsos, P. and Faloutsos, C.
(1999) On power-law relationships of the
internet topology Computer Communication Review,
29, 251. - Guare, J. (1990) Six Degrees of Separation A
Play (Vintage Books, New York). - Milgram, S. (1967) The small world problem.
Psychology Today, 2, 6067. - Newman, M. (2005) The Mathematics of Complex
Networks. Unpublished paper. - Newman, M. (2003) The Structure and Function of
Complex Networks. SIAM Review, 45(2) 167-256. - Travers, Jeffrey Stanley Milgram. (1969) "An
Experimental Study of the Small World Problem."
Sociometry, 32, 4 425-443. - Watts, D.J. and Strogatz, S.H. (1998) Collective
dynamics of small-world networks. Nature,
393(6684) 440-2.
22Statistical Distributions
- Statistical distributions of stochastic processes
are rarely known in practice, and often assumed
to be normal - Statistical distributions can be estimated by
using a variety of techniques (Maximum Likelihood
Estimation, Least Squares Estimation, Method of
Moments) - If we know the distributions associated with a
stochastic process, a whole world of statistical
tools is made available.
23Statistical Distributions
- For a random variable X
- A probability density function f(X) defines the
relative probability that X takes on a certain
value. - A cumulative density function F(X) defines the
probability (from 0 to 1) that X takes on a value
less than or equal to a certain value.
24Distribution Fitting
- Maximum Likelihood Estimation
- Maximizes the likelihood function (the geometric
sum of the probability density function of the
empirical data and a given set of parameters) - Least squares estimation
- Minimizes the sum of squared error between the
empirical density function and a cumulative
density function of a given set of parameters - Method of Moments Estimation
- Fits a set of moment equations (number equal to
the number of parameters required) by solving the
set for the parameters and substituting sample
statistics for the moments.