New Models for Relational Classification - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

New Models for Relational Classification

Description:

New Models for Relational Classification Ricardo Silva (Statslab) Joint work with Wei Chu and Zoubin Ghahramani – PowerPoint PPT presentation

Number of Views:232
Avg rating:3.0/5.0
Slides: 57
Provided by: rba109
Category:

less

Transcript and Presenter's Notes

Title: New Models for Relational Classification


1
New Models for Relational Classification
  • Ricardo Silva (Statslab)
  • Joint work with Wei Chu and Zoubin Ghahramani

2
The talk
  • Classification with non-iid data
  • A source of non-iidness relational information
  • A new family of models, and what is new
  • Applications to classification of text documents

3
The prediction problem
X
Y
4
Standard setup
Xnew
X
N
?
Ynew
Y
5
Prediction with non-iid data
X1
X2
Xnew
?
Y1
Ynew
Y2
6
Where does the non-iid information come from?
  • Relations
  • Links between data points
  • Webpage A links to Webpage B
  • Movie A and Movie B are often rented together
  • Relations as data
  • Linked webpages are likely to present similar
    content
  • Movies that are rented together often have
    correlated personal ratings

7
The vanilla relational domain time-series
  • Relations Yi precedes Yi k, k gt 0
  • Dependencies Markov structure G

Y1
Y2
Y3


8
A model for integrating link data
  • How to model the class labels dependencies?
  • Movies that are rented together often might have
    all other sources of common, unmeasured factors
  • These hidden common causes affect the ratings

9
Example
Same director?
Same genre?
Both released in same year?
Target same age groups?
10
Integrating link data
  • Of course, many of these common causes will be
    measured
  • Many will not
  • Idea
  • Postulate a hidden common cause structure, based
    on relations
  • Define a model Markov to this structure
  • Design an adequate inference algorithm

11
Example Political Books database
  • A network of books about recent US politics sold
    by the online bookseller Amazon.com
  • Valdis Krebs, http//www.orgnet.com/
  • Relations frequent co-purchasing of books by the
    same buyers
  • Political inclination factors as the hidden
    common causes

12
Political Books relations
13
Political Books database
  • Features
  • I collected the Amazon.com front page for each of
    the books
  • Bag-of-words, tf-idf features, normalized to
    unity
  • Task
  • Binary classification liberal or not-liberal
    books
  • 43 liberal books out of 105

14
Contribution
  • We will
  • show a classical multiple linear regression model
  • built a relational variation
  • generalize with a more complex set of
    independence constraints
  • generalize it using Gaussian processes

15
Seemingly unrelated regression (Zellner,1962)
X
??
  • Y (Y1, Y2), X (X1, X2)
  • Suppose you regress Y1 X1, X2 and
  • X2 turns out to be useless
  • Analogously for Y2 X1, X2 (X1 vanishes)
  • Suppose you regress Y1 X1, X2, Y2
  • And now every variable is a relevant predictor

X1
X2
Y1
??
??
??
X1
X2
Y2
Y1
16
Graphically, with latents
Capital(GE)
Capital(Westinghouse)
X
Stock price(GE)
Stock price(Westinghouse)
Y
Industry factor k?
Industry factor 2
Industry factor 1

17
The Directed Mixed Graph (DMG)
Capital(GE)
Capital(Westinghouse)
X
Stock price(GE)
Stock price(Westinghouse)
Y
Richardson (2003), Richardson and Spirtes (2002)
18
A new family of relational models
  • Inspired by SUR
  • Structure DMG graphs
  • Edges postulated from given relations

X1
X2
X3
X4
X5
Y3
Y1
Y4
Y5
Y2
19
Model for binary classification
  • Nonparametric Probit regression
  • Zero-mean Gaussian process prior over f( . )

P(yi 1 xi) P(y(xi) gt 0) y(xi) f(xi)
?i, ?i N(0, 1)
20
Relational dependency model
  • Make ? dependent multivariate Gaussian
  • For convenience, decouple it into two error terms

? ? ?
21
Dependency model the decomposition
Independent from each other
? ? ?
Marginally independent
Dependent according to relations
?? ?? ??
Diagonal
Not diagonal, with 0s onlyon unrelated pairs
22
Dependency model the decomposition
y(xi) f(xi) ? f(xi) ? ? g(xi) ?
  • If K was the original kernel matrix for f(. ),
    the covariance of g(. ) is simply

?g(.) K ??
23
Approximation
  • Posterior for f(.), g(.) is a truncated Gaussian,
    hard to integrate
  • Approximate posterior with a Gaussian
  • Expectation-Propagation (Minka, 2001)
  • The reason for ? becomes apparent in the EP
    approximation

24
Approximation
  • Likelihood does not factorize over f( . ), but
    factorizes over g( . )
  • Approximate each factor p(yi g(xi)) with a
    Gaussian
  • if ? were 0, yi would be a deterministic
    function of g(xi)

?
p(g x, y) ? p(g x) p(yi g(xi))
i
25
Generalizations
  • This can be generalized for any number of
    relations

Y3
Y1
Y4
Y5
Y2
? ? ?1 ?2 ?3
26
But how to parameterize ???
  • Non-trivial
  • Desiderata
  • Positive definite
  • Zeroes on the right places
  • Few parameters, but broad family
  • Easy to compute

27
But how to parameterize ???
  • Poking zeroes on a positive definite matrix
    doesnt work

Y1
Y2
Y3
1 0.8 0
0.8 1 0.8
0 0.8 1
1 0.8 0.8
0.8 1 0.8
0.8 0.8 1
positive definite
not positive definite
28
Approach 1
  • Assume we can find all cliques for the
    bi-directed subgraph of relations
  • Create a factor analysis model, where
  • for each clique Ci there is a latent variable Li
  • members of each clique are the only children of
    Li
  • Set of latents L is a set of N(0, 1) variables
  • coefficients in the model are equal to 1

29
Approach 1
L1
L2
Y3
Y1
Y4
Y2
Y1
Y3
Y2
Y4
  • Y1 L1 ?1
  • Y2 L1 L2 ?2

30
Approach 1
  • In practice, we set the variance of each ? to a
    small constant (10-4)
  • Covariance between any two Ys is
  • proportional to the number of cliques they belong
    together
  • inversely proportional to the number of cliques
    they belong to individually

31
Approach 1
  • Let U be the correlation matrix obtained from the
    proposed procedure
  • To define the error covariance, use a single
    hyperparameter ? ? 0, 1

?? (I ?Udiag) ?U
?? ??
32
Approach 1
  • Notice if everybody is connected, model is
    exchangeable and simple

L1
Y3
Y1
Y3
Y2
Y4
Y1
Y4
Y2
1 ? ? ?
? 1 ? ?
? ? 1 ?
? ? ? 1
??
33
Approach 1
  • Finding all cliques is impossible, what to do?
  • Triangulate and them extract cliques
  • Can be done in polynomial time
  • This is a relaxation of the problem, since
    constraints are thrown away
  • Can have bad side effects the Blow-Up effect

34
Political Books dataset
35
Political Books datasetthe Blow-up effect
36
Approach 2
  • Dont look for cliques create a latent for each
    pair of variables
  • Very fast to compute, zeroes respected

L13
Y3
Y3
Y1
Y4
Y1
L13
Y4
Y2
Y2
L13
L13
37
Approach 2
  • Correlations, however, are given by
  • Penalizes nodes with many neighbors, even if Yi
    and Yj have many neighbors in common
  • We call this the pulverization effect

1
Corr(?i, ?j) ??
Sqrt(neigh(i) . neigh(j))
38
Political Books dataset
39
Political Books datasetthe pulverization
effect
40
WebKB dataset links of pages in University of
Washington
41
Approach 1
42
Approach 2
43
Comparisonundirected models
  • Generative stories
  • Conditional random fields (Lafferty, McCallum,
    Pereira, 2001)
  • Wei et al., 2006/Richardson and Spirtes, 2002

X1
X2
X3
Y1
Y3
Y2
44
Chu Weis model
X1
X2
X3
  • Dependency family equivalent to a pairwise Markov
    random field

Y1
Y2
Y3
Y1
Y3
Y2
R12 1
R23 1
Y1
Y3
Y2
45
Properties of undirected models
  • MRFs propagate information among test points

Y2
Y4
Y3
Y6
Y1
Y7
Y8
Y10
Y5
Y9
Y12
Y11
46
Properties of DMG models
  • DMGs propagate information among training points

Y2
Y4
Y3
Y6
Y1
Y7
Y8
Y10
Y5
Y9
Y12
Y11
47
Properties of DMG models
  • In a DMG, each test point will have in the
    Markov blanket a whole training component

Y2
Y4
Y3
Y6
Y1
Y7
Y8
Y10
Y5
Y9
Y12
Y11
48
Properties of DMG models
  • It seems acceptable that a typical relational
    domain will not have a extrapolation pattern
  • Like typical structured output problems, e.g.,
    NLP domains
  • Ultimately, the choice of model concerns the
    question
  • Hidden common causes or relational
    indicators?

49
Experiment 1
  • A subset of the CORA database
  • 4,285 machine learning papers, 7 classes
  • Links citations between papers
  • hidden common cause interpretation particular
    ML subtopic being treated
  • Experiment 7 binary classification problems,
    Class 5 vs. others
  • Criterion AUC

50
Experiment 1
  • Comparisons
  • Regular GP
  • Regular GP citation adjacency matrix
  • Chu Weis Relational GP (RGP)
  • Our method, miXed graph GP (XGP)
  • Fairly easy task
  • Analysis of low-sample tasks
  • Uses 1 of the data (roughly 10 data points for
    training)
  • Not that useful for XGP, but more useful for RGP

51
Experiment 1
  • Chu Weis method get up to 0.99 in several of
    those

52
Experiment 2
  • Political Books database
  • 105 datapoints, 100 runs using 50 for training
  • Comparison with standard Gaussian processes
  • Linear kernels
  • Results
  • 0.92 for regular GP
  • 0.98 for XGP (using pairwise kernel generator)
  • Hyperparameters optimized by grid search
  • Difference 0.06 with std 0.02
  • Chu Weis method does the same

53
Experiment 3
  • WebKB
  • Collections of webpages from 4 different
    universities
  • Task outlier classification
  • Identify which pages are not a student, course,
    project or faculty pages
  • 10 for training data (still not that hard)
  • However, an order of magnitude of more data than
    in Cora

54
Experiment 3
  • As far as I know, XGP gets easily the best
    results on this task

55
Future work
  • Tons of possibilities on how to parameterize
    output covariance matrix
  • Incorporating relation attributes too
  • Heteroscedastic relational noise
  • Mixtures of relations
  • New approximation algorithms
  • Clustering problems
  • On-line learning

56
Thank You
Write a Comment
User Comments (0)
About PowerShow.com