New Models for Relational Classification - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

New Models for Relational Classification

Description:

New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani – PowerPoint PPT presentation

Number of Views:235

Avg rating:3.0/5.0

Slides: 57

Provided by: rba109

Category:

more less

Transcript and Presenter's Notes

Title: New Models for Relational Classification

1
New Models for Relational Classification

Ricardo Silva (Statslab)
Joint work with Wei Chu and Zoubin Ghahramani

2
The talk

Classification with non-iid data
A source of non-iidness relational information
A new family of models, and what is new
Applications to classification of text documents

3
The prediction problem
X
Y
4
Standard setup
Xnew
X
N
?
Ynew
Y
5
Prediction with non-iid data
X1
X2
Xnew
?
Y1
Ynew
Y2
6
Where does the non-iid information come from?

Relations
Links between data points
Webpage A links to Webpage B
Movie A and Movie B are often rented together
Relations as data
Linked webpages are likely to present similar
content
Movies that are rented together often have
correlated personal ratings

7
The vanilla relational domain time-series

Relations Yi precedes Yi k, k gt 0
Dependencies Markov structure G

Y1
Y2
Y3

8
A model for integrating link data

How to model the class labels dependencies?
Movies that are rented together often might have
all other sources of common, unmeasured factors
These hidden common causes affect the ratings

9
Example
Same director?
Same genre?
Both released in same year?
Target same age groups?
10
Integrating link data

Of course, many of these common causes will be
measured
Many will not
Idea
Postulate a hidden common cause structure, based
on relations
Define a model Markov to this structure
Design an adequate inference algorithm

11
Example Political Books database

A network of books about recent US politics sold
by the online bookseller Amazon.com
Valdis Krebs, http//www.orgnet.com/
Relations frequent co-purchasing of books by the
same buyers
Political inclination factors as the hidden
common causes

12
Political Books relations
13
Political Books database

Features
I collected the Amazon.com front page for each of
the books
Bag-of-words, tf-idf features, normalized to
unity
Task
Binary classification liberal or not-liberal
books
43 liberal books out of 105

14
Contribution

We will
show a classical multiple linear regression model
built a relational variation
generalize with a more complex set of
independence constraints
generalize it using Gaussian processes

15
Seemingly unrelated regression (Zellner,1962)
X
??

Y (Y1, Y2), X (X1, X2)
Suppose you regress Y1 X1, X2 and
X2 turns out to be useless
Analogously for Y2 X1, X2 (X1 vanishes)
Suppose you regress Y1 X1, X2, Y2
And now every variable is a relevant predictor

X1
X2
Y1
??
??
??
X1
X2
Y2
Y1
16
Graphically, with latents
Capital(GE)
Capital(Westinghouse)
X
Stock price(GE)
Stock price(Westinghouse)
Y
Industry factor k?
Industry factor 2
Industry factor 1

17
The Directed Mixed Graph (DMG)
Capital(GE)
Capital(Westinghouse)
X
Stock price(GE)
Stock price(Westinghouse)
Y
Richardson (2003), Richardson and Spirtes (2002)
18
A new family of relational models

Inspired by SUR
Structure DMG graphs
Edges postulated from given relations

X1
X2
X3
X4
X5
Y3
Y1
Y4
Y5
Y2
19
Model for binary classification

Nonparametric Probit regression
Zero-mean Gaussian process prior over f( . )

P(yi 1 xi) P(y(xi) gt 0) y(xi) f(xi)
?i, ?i N(0, 1)
20
Relational dependency model

Make ? dependent multivariate Gaussian
For convenience, decouple it into two error terms

? ? ?
21
Dependency model the decomposition
Independent from each other
? ? ?
Marginally independent
Dependent according to relations
?? ?? ??
Diagonal
Not diagonal, with 0s onlyon unrelated pairs
22
Dependency model the decomposition
y(xi) f(xi) ? f(xi) ? ? g(xi) ?

If K was the original kernel matrix for f(. ),
the covariance of g(. ) is simply

?g(.) K ??
23
Approximation

Posterior for f(.), g(.) is a truncated Gaussian,
hard to integrate
Approximate posterior with a Gaussian
Expectation-Propagation (Minka, 2001)
The reason for ? becomes apparent in the EP
approximation

24
Approximation

Likelihood does not factorize over f( . ), but
factorizes over g( . )
Approximate each factor p(yi g(xi)) with a
Gaussian
if ? were 0, yi would be a deterministic
function of g(xi)

?
p(g x, y) ? p(g x) p(yi g(xi))
i
25
Generalizations

This can be generalized for any number of
relations

Y3
Y1
Y4
Y5
Y2
? ? ?1 ?2 ?3
26
But how to parameterize ???

Non-trivial
Desiderata
Positive definite
Zeroes on the right places
Few parameters, but broad family
Easy to compute

27
But how to parameterize ???

Poking zeroes on a positive definite matrix
doesnt work

Y1
Y2
Y3
1 0.8 0
0.8 1 0.8
0 0.8 1
1 0.8 0.8
0.8 1 0.8
0.8 0.8 1
positive definite
not positive definite
28
Approach 1

Assume we can find all cliques for the
bi-directed subgraph of relations
Create a factor analysis model, where
for each clique Ci there is a latent variable Li
members of each clique are the only children of
Li
Set of latents L is a set of N(0, 1) variables
coefficients in the model are equal to 1

29
Approach 1
L1
L2
Y3
Y1
Y4
Y2
Y1
Y3
Y2
Y4

Y1 L1 ?1
Y2 L1 L2 ?2

30
Approach 1

In practice, we set the variance of each ? to a
small constant (10-4)
Covariance between any two Ys is
proportional to the number of cliques they belong
together
inversely proportional to the number of cliques
they belong to individually

31
Approach 1

Let U be the correlation matrix obtained from the
proposed procedure
To define the error covariance, use a single
hyperparameter ? ? 0, 1

?? (I ?Udiag) ?U
?? ??
32
Approach 1

Notice if everybody is connected, model is
exchangeable and simple

L1
Y3
Y1
Y3
Y2
Y4
Y1
Y4
Y2
1 ? ? ?
? 1 ? ?
? ? 1 ?
? ? ? 1
??
33
Approach 1

Finding all cliques is impossible, what to do?
Triangulate and them extract cliques
Can be done in polynomial time
This is a relaxation of the problem, since
constraints are thrown away
Can have bad side effects the Blow-Up effect

34
Political Books dataset
35
Political Books datasetthe Blow-up effect
36
Approach 2

Dont look for cliques create a latent for each
pair of variables
Very fast to compute, zeroes respected

L13
Y3
Y3
Y1
Y4
Y1
L13
Y4
Y2
Y2
L13
L13
37
Approach 2

Correlations, however, are given by
Penalizes nodes with many neighbors, even if Yi
and Yj have many neighbors in common
We call this the pulverization effect

1
Corr(?i, ?j) ??
Sqrt(neigh(i) . neigh(j))
38
Political Books dataset
39
Political Books datasetthe pulverization
effect
40
WebKB dataset links of pages in University of
Washington
41
Approach 1
42
Approach 2
43
Comparisonundirected models

Generative stories
Conditional random fields (Lafferty, McCallum,
Pereira, 2001)
Wei et al., 2006/Richardson and Spirtes, 2002

X1
X2
X3
Y1
Y3
Y2
44
Chu Weis model
X1
X2
X3

Dependency family equivalent to a pairwise Markov
random field

Y1
Y2
Y3
Y1
Y3
Y2
R12 1
R23 1
Y1
Y3
Y2
45
Properties of undirected models

MRFs propagate information among test points

Y2
Y4
Y3
Y6
Y1
Y7
Y8
Y10
Y5
Y9
Y12
Y11
46
Properties of DMG models

DMGs propagate information among training points

Y2
Y4
Y3
Y6
Y1
Y7
Y8
Y10
Y5
Y9
Y12
Y11
47
Properties of DMG models

In a DMG, each test point will have in the
Markov blanket a whole training component

Y2
Y4
Y3
Y6
Y1
Y7
Y8
Y10
Y5
Y9
Y12
Y11
48
Properties of DMG models

It seems acceptable that a typical relational
domain will not have a extrapolation pattern
Like typical structured output problems, e.g.,
NLP domains
Ultimately, the choice of model concerns the
question
Hidden common causes or relational
indicators?

49
Experiment 1