Title: Stochastic%20Block%20Models%20of%20Mixed%20Membership
1Stochastic Block Models of Mixed Membership
- Edo Airoldi 1,2, Dave Blei 2, Steve Fienberg 1,
Eric Xing 1 - 1 Carnegie-Mellon University 2 Princeton
University
SAMSI, High Dimensional Inference and Random
Matrices, September 17th, 2006
2The Scientific Problem
- Protein-protein interactions in Yeast
- Different studies test protein interactions with
different technologies (precision)
3The Data Interaction Graphs
- M proteins in a graph (nodes)
- M2 observations on pairs of proteins
- Edges are random quantities, Y n,m
- Interactions are not independent
- Interacting proteins form a protein complex
- T graphs on the same set of proteins
- Partial annotations for each protein, X n
M 871 nodes M2 750K entries
4The Scientific Problems
- What are stable protein complexes?
- They perform many cellular processes
- A protein may be a member of several ones
- How many are there?
- How do stable protein complexes interact?
- Test hypotheses (inform new analyses)
- Learn complex-to-complex interaction patterns
5More Network Data
Disease Spread
Electronic Circuit
Food Web
Internet
Social Network
6An Abstraction of the Data
- A collection of unipartite graphs G1T (Y1T
,N ) - Integer, real, multivariate edge weights Yt
Yt nm n,m ? N - Node-specific (multivariate) attributes X1T
Xt n n ? N - Partially observable Y1T and X1T
7The Challenge
- Given the data abstraction and the goals of the
analysis - Can we posit a rich class of models that is
instrumental for thinking about the scientific
problems we face? Amenable to theoretical
analyses?
8Modeling Ideas
- Hierarchical Bayes
- Latent variables encode semantic elements
- Assume structure on observable-latent elements
- Combination of 2 class of models
1. Models of mixed membership
2. Network models (block models)
?
Stochastic block models of mixed membership
9Graphical Model Representation
Stochastic Blocks
Mixed Membership
10A Hierarchical Likelihood
11More Modeling Issues
- Technical Sparsity
- Introduce parameter that modulates the relative
importance of ones and zeros (binary edges) in
the cost function that drives the clustering - Biological Ribosomes Distress
- Some protein complexes act like hubs because they
are involved, e.g., in protein production or cell
recovery (Y2H technology is invasive)
12Large Scale Computation
- Masses of data
- 750K observations in a small problem (M871)
- 2.5M observations with (M1578)
- 3M expressions for 6K genes/proteins in Yeast
- Variational inference Jordan et al., 2001
- Naïve implementation does not work
- We develop a novel nested variational algorithm
13Example A Scientific Question
- Do PPI contain information about functions?
Model
Approximate Posterior on Membership Vectors
?
YLD014W
Raw data
Functional Annotations
14Interactions in Yeast (MIPS)
- Do PPI contain information about functions?
YLD014W
15Results Identifiability
- In this example we map latent groups to known
functional categories
Known Annotations
Unknown Annotations
16Results Functional Annotations
17Results Mixed Membership
- The estimated membership vectors support the
mixed membership assumption
18Results Stochastic Block Model
19General Bayesian Formulation
- Assumptions for unipartite graphs
- Population existence of K sub-populations
- Latent variable mixed memb. vectors ?n D?
- Subject exchangeable edges given blocks memb.
Ynm f ( . ?n ? ?m ) - Sampling scheme the graphs are IID
- Additional data, e.g., attributes, annotations
- Integrated model formulation (descriptive/predicti
ve)
T
20Variational Algorithms
- Naïve algorithm
- init (?i ?i, ?ij ?ij)
- while ( log-lik ?)update (?ij ?ij)update (?i
?i)
- Nested algorithm
- init (?i ?i)
- while ( log-lik ?)loop ij
- init ?ij
- while ( log-lik ?)update ?ij
- partially update (?i,?j)
We trade space for time but
21Variational Algorithms for MMSB
Nested
Nested
Naïve
Naïve
- On a single machine we empirically observed
faster convergence (offsets extra computation),
and more stable paths to convergence.
22Take Home Points
- Bayesian formulation is integral to the biology
- A novel class of models that combines MM for
soft-clustering network models for dependent
data - Latent aspects ? patterns that correlate with,
help predict, functional processes in the cell - Current implementation allows for fast inference
on large matrices through variational
approximation ? considerable opportunity to
improve upon both computation and efficiency of
the approximation
23- Data Problems Gavin et al. (2002) Nature Ho
et al. (2002) Nature Mewes et al. (2004) Nucleic
Acids Research Krogan et al. (2006) Nature. - Mixed Membership Models
- Pritchard et al. (2000) Erosheva (2002)
Rosenberg et al. (2002) Blei et al. (2003) Xing
et al. (2003ab) Erosheva et al. (2004) Airoldi
et al. (2005) Blei Lafferty (2006) Xing et
al. (2006) - Stochastic network models
- Wasserman et al. (1980, 1994, 1996) Fienberg et
al. (1985) Frank Strauss (1986) Nowicki
Snijders (2001) Hoff et al. (2002), Airoldi et
al. (2006) - More material on the Web at http//www.cs.cmu.edu
/eairoldi/ - ICML Workshop on Statistical Network Analysis
Models, Issues and New Directions on June 29 at
Carnegie Mellon, Pittsburgh PA
http//nlg.cs.cmu.edu/