Title: New Models for Relational Classification
1New Models for Relational Classification
- Ricardo Silva (Statslab)
- Joint work with Wei Chu and Zoubin Ghahramani
2The talk
- Classification with non-iid data
- A source of non-iidness relational information
- A new family of models, and what is new
- Applications to classification of text documents
3The prediction problem
X
Y
4Standard setup
Xnew
X
N
?
Ynew
Y
5Prediction with non-iid data
X1
X2
Xnew
?
Y1
Ynew
Y2
6Where does the non-iid information come from?
- Relations
- Links between data points
- Webpage A links to Webpage B
- Movie A and Movie B are often rented together
- Relations as data
- Linked webpages are likely to present similar
content - Movies that are rented together often have
correlated personal ratings
7The vanilla relational domain time-series
- Relations Yi precedes Yi k, k gt 0
- Dependencies Markov structure G
Y1
Y2
Y3
8A model for integrating link data
- How to model the class labels dependencies?
- Movies that are rented together often might have
all other sources of common, unmeasured factors - These hidden common causes affect the ratings
9Example
Same director?
Same genre?
Both released in same year?
Target same age groups?
10Integrating link data
- Of course, many of these common causes will be
measured - Many will not
- Idea
- Postulate a hidden common cause structure, based
on relations - Define a model Markov to this structure
- Design an adequate inference algorithm
11Example Political Books database
- A network of books about recent US politics sold
by the online bookseller Amazon.com - Valdis Krebs, http//www.orgnet.com/
- Relations frequent co-purchasing of books by the
same buyers - Political inclination factors as the hidden
common causes
12Political Books relations
13Political Books database
- Features
- I collected the Amazon.com front page for each of
the books - Bag-of-words, tf-idf features, normalized to
unity - Task
- Binary classification liberal or not-liberal
books - 43 liberal books out of 105
14Contribution
- We will
- show a classical multiple linear regression model
- built a relational variation
- generalize with a more complex set of
independence constraints - generalize it using Gaussian processes
15Seemingly unrelated regression (Zellner,1962)
X
??
- Y (Y1, Y2), X (X1, X2)
- Suppose you regress Y1 X1, X2 and
- X2 turns out to be useless
- Analogously for Y2 X1, X2 (X1 vanishes)
- Suppose you regress Y1 X1, X2, Y2
- And now every variable is a relevant predictor
X1
X2
Y1
??
??
??
X1
X2
Y2
Y1
16Graphically, with latents
Capital(GE)
Capital(Westinghouse)
X
Stock price(GE)
Stock price(Westinghouse)
Y
Industry factor k?
Industry factor 2
Industry factor 1
17The Directed Mixed Graph (DMG)
Capital(GE)
Capital(Westinghouse)
X
Stock price(GE)
Stock price(Westinghouse)
Y
Richardson (2003), Richardson and Spirtes (2002)
18A new family of relational models
- Inspired by SUR
- Structure DMG graphs
- Edges postulated from given relations
X1
X2
X3
X4
X5
Y3
Y1
Y4
Y5
Y2
19Model for binary classification
- Nonparametric Probit regression
- Zero-mean Gaussian process prior over f( . )
P(yi 1 xi) P(y(xi) gt 0) y(xi) f(xi)
?i, ?i N(0, 1)
20Relational dependency model
- Make ? dependent multivariate Gaussian
- For convenience, decouple it into two error terms
? ? ?
21Dependency model the decomposition
Independent from each other
? ? ?
Marginally independent
Dependent according to relations
?? ?? ??
Diagonal
Not diagonal, with 0s onlyon unrelated pairs
22Dependency model the decomposition
y(xi) f(xi) ? f(xi) ? ? g(xi) ?
- If K was the original kernel matrix for f(. ),
the covariance of g(. ) is simply
?g(.) K ??
23Approximation
- Posterior for f(.), g(.) is a truncated Gaussian,
hard to integrate - Approximate posterior with a Gaussian
- Expectation-Propagation (Minka, 2001)
- The reason for ? becomes apparent in the EP
approximation
24Approximation
- Likelihood does not factorize over f( . ), but
factorizes over g( . ) - Approximate each factor p(yi g(xi)) with a
Gaussian - if ? were 0, yi would be a deterministic
function of g(xi)
?
p(g x, y) ? p(g x) p(yi g(xi))
i
25Generalizations
- This can be generalized for any number of
relations
Y3
Y1
Y4
Y5
Y2
? ? ?1 ?2 ?3
26But how to parameterize ???
- Non-trivial
- Desiderata
- Positive definite
- Zeroes on the right places
- Few parameters, but broad family
- Easy to compute
27But how to parameterize ???
- Poking zeroes on a positive definite matrix
doesnt work
Y1
Y2
Y3
1 0.8 0
0.8 1 0.8
0 0.8 1
1 0.8 0.8
0.8 1 0.8
0.8 0.8 1
positive definite
not positive definite
28Approach 1
- Assume we can find all cliques for the
bi-directed subgraph of relations - Create a factor analysis model, where
- for each clique Ci there is a latent variable Li
- members of each clique are the only children of
Li - Set of latents L is a set of N(0, 1) variables
- coefficients in the model are equal to 1
29Approach 1
L1
L2
Y3
Y1
Y4
Y2
Y1
Y3
Y2
Y4
30Approach 1
- In practice, we set the variance of each ? to a
small constant (10-4) - Covariance between any two Ys is
- proportional to the number of cliques they belong
together - inversely proportional to the number of cliques
they belong to individually
31Approach 1
- Let U be the correlation matrix obtained from the
proposed procedure - To define the error covariance, use a single
hyperparameter ? ? 0, 1
?? (I ?Udiag) ?U
?? ??
32Approach 1
- Notice if everybody is connected, model is
exchangeable and simple
L1
Y3
Y1
Y3
Y2
Y4
Y1
Y4
Y2
1 ? ? ?
? 1 ? ?
? ? 1 ?
? ? ? 1
??
33Approach 1
- Finding all cliques is impossible, what to do?
- Triangulate and them extract cliques
- Can be done in polynomial time
- This is a relaxation of the problem, since
constraints are thrown away - Can have bad side effects the Blow-Up effect
34Political Books dataset
35Political Books datasetthe Blow-up effect
36Approach 2
- Dont look for cliques create a latent for each
pair of variables - Very fast to compute, zeroes respected
L13
Y3
Y3
Y1
Y4
Y1
L13
Y4
Y2
Y2
L13
L13
37Approach 2
- Correlations, however, are given by
- Penalizes nodes with many neighbors, even if Yi
and Yj have many neighbors in common - We call this the pulverization effect
1
Corr(?i, ?j) ??
Sqrt(neigh(i) . neigh(j))
38Political Books dataset
39Political Books datasetthe pulverization
effect
40WebKB dataset links of pages in University of
Washington
41Approach 1
42Approach 2
43Comparisonundirected models
- Generative stories
- Conditional random fields (Lafferty, McCallum,
Pereira, 2001) - Wei et al., 2006/Richardson and Spirtes, 2002
X1
X2
X3
Y1
Y3
Y2
44Chu Weis model
X1
X2
X3
- Dependency family equivalent to a pairwise Markov
random field
Y1
Y2
Y3
Y1
Y3
Y2
R12 1
R23 1
Y1
Y3
Y2
45Properties of undirected models
- MRFs propagate information among test points
Y2
Y4
Y3
Y6
Y1
Y7
Y8
Y10
Y5
Y9
Y12
Y11
46Properties of DMG models
- DMGs propagate information among training points
Y2
Y4
Y3
Y6
Y1
Y7
Y8
Y10
Y5
Y9
Y12
Y11
47Properties of DMG models
- In a DMG, each test point will have in the
Markov blanket a whole training component
Y2
Y4
Y3
Y6
Y1
Y7
Y8
Y10
Y5
Y9
Y12
Y11
48Properties of DMG models
- It seems acceptable that a typical relational
domain will not have a extrapolation pattern - Like typical structured output problems, e.g.,
NLP domains - Ultimately, the choice of model concerns the
question - Hidden common causes or relational
indicators?
49Experiment 1
- A subset of the CORA database
- 4,285 machine learning papers, 7 classes
- Links citations between papers
- hidden common cause interpretation particular
ML subtopic being treated - Experiment 7 binary classification problems,
Class 5 vs. others - Criterion AUC
50Experiment 1
- Comparisons
- Regular GP
- Regular GP citation adjacency matrix
- Chu Weis Relational GP (RGP)
- Our method, miXed graph GP (XGP)
- Fairly easy task
- Analysis of low-sample tasks
- Uses 1 of the data (roughly 10 data points for
training) - Not that useful for XGP, but more useful for RGP
51Experiment 1
- Chu Weis method get up to 0.99 in several of
those
52Experiment 2
- Political Books database
- 105 datapoints, 100 runs using 50 for training
- Comparison with standard Gaussian processes
- Linear kernels
- Results
- 0.92 for regular GP
- 0.98 for XGP (using pairwise kernel generator)
- Hyperparameters optimized by grid search
- Difference 0.06 with std 0.02
- Chu Weis method does the same
53Experiment 3
- WebKB
- Collections of webpages from 4 different
universities - Task outlier classification
- Identify which pages are not a student, course,
project or faculty pages - 10 for training data (still not that hard)
- However, an order of magnitude of more data than
in Cora
54Experiment 3
- As far as I know, XGP gets easily the best
results on this task
55Future work
- Tons of possibilities on how to parameterize
output covariance matrix - Incorporating relation attributes too
- Heteroscedastic relational noise
- Mixtures of relations
- New approximation algorithms
- Clustering problems
- On-line learning
56Thank You