Transfer Learning by Mapping and Revising Relational Knowledge

About This Presentation

Title:

Transfer Learning by Mapping and Revising Relational Knowledge

Description:

Leave one university out testing (Craven et al., 1998) TL and SRL and I.I.D. ... Entity relations from the original WebKB domain (Craven et al. 1998) ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 69

Provided by: Raymond

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Transfer Learning by Mapping and Revising Relational Knowledge

1
Transfer Learning by Mapping and Revising
Relational Knowledge

Raymond J. Mooney
University of Texas at Austin
with acknowledgements to
Lily Mihalkova, Tuyen Huynh

2
Transfer Learning

Most machine learning methods learn each new task
from scratch, failing to utilize previously
learned knowledge.
Transfer learning concerns using knowledge
acquired in a previous source task to facilitate
learning in a related target task.
Usually assume significant training data was
available in the source domain but limited
training data is available in the target domain.
By exploiting knowledge from the source, learning
in the target can be
More accurate Learned knowledge makes better
predictions.
Faster Training time is reduced.

3
Transfer Learning Curves

Transfer learning increases accuracy in the
target domain.

Predictive Accuracy
Amount of training data in target domain
4
Recent Work on Transfer Learning

Recent DARPA program on Transfer Learning has led
to significant recent research in the area.
Some work focuses on feature-vector
classification.
Hierarchical Bayes (Yu et al., 2005 Lawrence
Platt, 2004)
Informative Bayesian Priors (Raina et al., 2005)
Boosting for transfer learning (Dai et al., 2007)
Structural Correspondence Learning (Blitzer et
al., 2007)
Some work focuses on Reinforcement Learning
Value-function transfer (Taylor Stone, 2005
2007)
Advice-based policy transfer (Torrey et al.,
2005 2007)

5
Similar Research Problems

Multi-Task Learning (Caruana, 1997)
Learn multiple tasks simultaneously each one
helped by the others.
Life-Long Learning (Thrun, 1996)
Transfer learning from a number of prior source
problems, picking the correct source problems to
use.

6
Logical Paradigm

Represents knowledge and data in binary symbolic
logic such as First Order Predicate Calculus.
Rich representation that handles arbitrary
sets of objects, with properties, relations,
quantifiers, etc.
? Unable to handle uncertain knowledge and
probabilistic reasoning.

7
Probabilistic Paradigm

Represents knowledge and data as a fixed set of
random variables with a joint probability
distribution.
Handles uncertain knowledge and probabilistic
reasoning.
? Unable to handle arbitrary sets of objects,
with properties, relations, quantifiers, etc.

8
Statistical Relational Learning (SRL)

Most machine learning methods assume i.i.d.
examples represented as fixed-length feature
vectors.
Many domains require learning and making
inferences about unbounded sets of entities that
are richly relationally connected.
SRL methods attempt to integrate methods from
predicate logic and probabilistic graphical
models to handle such structured,
multi-relational data.

9
Statistical Relational Learning
Actor
Movie
Director
WorkedFor
pacino
brando

godFather pacino
godFather brando
godFather coppola
streetCar brando

pacino coppola
brando coppola

coppola

Multi-Relational Data
Learning Algorithm
Probabilistic Graphical Model
10
Multi-Relational Data Challenges

Examples cannot be effectively represented as
feature vectors.
Predictions for connected facts are not
independent. (e.g. WorkedFor(brando,
coppolo), Movie(godFather, brando))
Data is not i.i.d.
Requires collective inference (classification)
(Taskar et al., 2001)
A single independent example (mega-example) often
contains information about a large number of
interconnected entities and can vary in length.
Leave one university out testing (Craven et al.,
1998)

11
TL and SRL and I.I.D.

Standard Machine Learning assumes examples are
Independent and Identically Distributed

TL breaks the assumption that test examples
are drawn from the same distribution as the
training instances
SRL breaks the assumption that examples are
independent
12
Multi-Relational Domains

Domains about people
Academic departments (UW-CSE)
Movies (IMDB)
Biochemical domains
Mutagenesis
Alzheimer drug design
Linked text domains
WebKB
Cora

13
Relational Learning Methods

Inductive Logic Programming (ILP)
Produces sets of first-order rules
Not appropriate for probabilistic reasoning
If a student wrote a paper with a professor, then
the professor is the students advisor
SRL models learning algorithms
SLPs (Muggleton, 1996)
PRMs (Koller, 1999)
BLPs (Kersting De Raedt, 2001)
RMNs (Taskar et al., 2002)
MLNs (Richardson Domingos, 2006)

14
MLN Transfer(Mihalkova, Huynh, Mooney, 2007)

Given two multi-relational domains, such as
Transfer a Markov logic network learned in the
Source to the Target by
Mapping the Source predicates to the Target
Revising the mapped knowledge

15
First-Order Logic Basics

Literal A predicate (or its negation) applied to
constants and/or variables.
Gliteral Ground literal WorkedFor(brando,
coppola)
Vliteral Variablized literal ?WorkedFor(A, B)
We assume predicates have typed arguments.
For example Movie(godFather, coppola)

16
First-Order Clauses

Clause A disjunction of literals

Can be rewritten as a set of rules

17
Representing the Data
Actor(pacino) WorkedFor(pacino, coppola) Movie(godFather, pacino)
Actor(brando) WorkedFor(brando, coppola) Movie(godFather, brando)
Director(coppola) Movie(godFather, coppola)

Makes a closed world assumption
The gliterals listed are true the rest are false

18
Markov Logic Networks(Richardson Domingos,
2006)

Set of first-order clauses, each assigned a
weight.
Larger weight indicates stronger belief that the
clause should hold.
The clauses are called the structure of the MLN.

19
Markov Networks(Pearl, 1988)

A concise representation of the joint probability
distribution of a set of random variables using
an undirected graph.

Joint distribution
Reputation of Author
R A E Q
e e e e 0.7
e e e p 0.1
e e p e 0.03

Quality of Paper
Same probability distribution can be represented
as the product of a set of functions defined over
the cliques of the graph
20
Markov Network Equations

General form
Log-linear models

21
Ground Markov Network for an MLN

MLNs are templates for constructing Markov
networks for a given set of constants
Include a node for each type-consistent grounding
(a gliteral) of each predicate in the MLN.
Two nodes are connected by an edge if their
corresponding gliterals appear together in any
grounding of any clause in the MLN.
Include a feature for each grounding of each
clause in the MLN with weight equal to the weight
of the clause.

constants
coppola,
brando,
godFather

1.3
1.2
0.5
Actor(brando)
Director(brando)
WorkedFor(brando, brando)
WorkedFor(brando, coppola)
Movie(godFather, brando)
Movie(godFather,coppola)
WorkedFor(coppola, brando)
WorkedFor(coppola, coppola)
Director(coppola)
Actor(coppola)
23
MLN Equations
24
MLN Equation Intuition

A possible world (a truth assignment to all
gliterals) becomes exponentially less likely as
the total weight of all the grounded clauses it
violates increases.

25
MLN Inference

Given truth assignments for given set of evidence
gliterals, infer the probability that each member
of set of unknown query gliterals is true.

26
Actor(brando)
Director(brando)
WorkedFor(brando, brando)
WorkedFor(brando, coppola)
Movie(godFather, brando)
Movie(godFather,coppola)
WorkedFor(coppola, brando)
WorkedFor(coppola, coppola)
Director(coppola)
Actor(coppola)
27
MLN Inference Algorithms

Gibbs Sampling (Richardson Domingos, 2006)
MC-SAT (Poon Domingos, 2006)

28
MLN Learning

Weight-learning (Richardson Domingos, 2006
Lowd Domingos, 2007)
Performed using optimization methods.
Structure-learning (Kok Domingos, 2005)
Proceeds in iterations of beam search, adding the
best-performing clause after each iteration to
the MLN.
Clauses are evaluated using WPLL score.

29
WPLL (Kok Domingos, 2005)

Weighted pseudo log-likelihood

30
Alchemy

Open-source package of MLN software provided by
UW that includes
Inference algorithms
Weight learning algorithms
Structure learning algorithm
Sample data sets
All our software uses and extends Alchemy.

31
TAMAR(Transfer via Automatic Mapping And
Revision)
Target (IMDB) Data
32
Predicate Mapping

Each clause is mapped independently of the
others.
The algorithm considers all possible ways to map
a clause such that
Each predicate in the source clause is mapped to
some target predicate.
Each argument type in the source is mapped to
exactly one argument type in the target.
Each mapped clause is evaluated by measuring its
WPLL for the target data, and the most accurate
mapping is kept.

33
Predicate Mapping Example
Consistent Type Mapping title ?
name person ? person
34
Predicate Mapping Example 2
Consistent Type Mapping title ?
person person ? gend
35
TAMAR(Transfer via Automatic Mapping And
Revision)
Target (IMDB) Data
36
Transfer Learning as Revision

Regard mapped source MLN as an approximate model
for the target task that needs to be accurately
and efficiently revised.
Thus our general approach is similar to that
taken by theory revision systems (Richards
Mooney, 1995).
Revisions are proposed in a bottom-up fashion.

37
R-TAMAR
Relational Data
New clause discovery
New Candidate Clauses
Change in WPLL
0.1
-0.2
0.5
1.7
1.3
38
R-TAMAR Self-Diagnosis

Use mapped source MLN to make inferences in the
target and observe the behavior of each clause
Consider each predicate P in the domain in turn.
Use Gibbs sampling to infer truth values for the
gliterals of P, using the remaining gliterals as
evidence.
Bin the clauses containing gliterals of P based
on whether they behave as desired.
Revisions are focused only on clauses in the
Bad bins.

39
Self-Diagnosis Clause Bins
Actor(brando) Director(coppola) Movie(godFather,
brando) Movie(godFather, coppola) Movie(rainMaker,
coppola) WorkedFor(brando, coppola)
Current gliteral Actor(brando)

Relevant

Good
40
Self-Diagnosis Clause Bins
Actor(brando) Director(coppola) Movie(godFather,
brando) Movie(godFather, coppola) Movie(rainMaker,
coppola) WorkedFor(brando, coppola)
Current gliteral Actor(brando)

Relevant

Good

Relevant

Bad
41
Self-Diagnosis Clause Bins
Actor(brando) Director(coppola) Movie(godFather,
brando) Movie(godFather, coppola) Movie(rainMaker,
coppola) WorkedFor(brando, coppola)
Current gliteral Actor(brando)

Relevant

Good

Relevant

Bad

Irrelevant

Good
42
Self-Diagnosis Clause Bins
Actor(brando) Director(coppola) Movie(godFather,
brando) Movie(godFather, coppola) Movie(rainMaker,
coppola) WorkedFor(brando, coppola)
Current gliteral Actor(brando)

Relevant

Good

Relevant

Bad

Irrelevant

Good

Irrelevant

Bad
43
Structure Revisions

Using directed beam search
Literal deletions attempted only from clauses
marked for shortening.
Literal additions attempted only for clauses
marked for lengthening.
Training is much faster since search space is
constrained by
Limiting the clauses considered for updates.
Restricting the type of updates allowed.

44
New Clause Discovery

Uses Relational Pathfinding (Richards Mooney,
1992)

Actor(brando) Director(coppola) Movie(godFather,
brando) Movie(godFather, coppola) Movie(rainMaker,
coppola) WorkedFor(brando, coppola)
WorkedFor
WorkedFor
brando
coppola
Movie
Movie
Movie
godFather
rainMaker
45
Weight Revision
Publication(T,A) ? AdvisedBy(A,B) ?
Publication(T,B)
Target (IMDB) Data
Movie(T,A) ? WorkedFor(A,B) ? Movie(T,B)
Movie(T,A) ? WorkedFor(A,B) ? Relative(A,B) ?
Movie(T,B)
46
Experiments Domains

UW-CSE
Data about members of the UW CSE department
Predicates include Professor, Student, AdvisedBy,
TaughtBy, Publication, etc.
IMDB
Data about 20 movies
Predicates include Actor, Director, Movie,
WorkedFor, Genre, etc.
WebKB
Entity relations from the original WebKB domain
(Craven et al. 1998)
Predicates include Faculty, Student, Project,
CourseTA, etc.

47
Dataset Statistics
Data is organized as mega-examples

Each mega-example contains information about a
group of related entities.
Mega-examples are independent and disconnected
from each other.

Data Set Mega- Examples Constants Types Predicates True Gliterals Total Gliterals
IMDB 5 316 4 10 1,540 32,615
UW-CSE 5 1,323 9 15 2,673 678,899
WebKB 4 1,700 3 6 2,065 688,193
48
Manually Developed Source KB

UW-KB is a hand-built knowledge base (set of
clauses) for the UW-CSE domain.
When used as a source domain, transfer learning
is a form of theory refinement that also includes
mapping to a new domain with a different
representation.

49
Systems Compared

TAMAR Complete transfer system.
ScrKD Algorithm of Kok Domingos (2005)
learning from scratch.
TrKD Algorithm of Kok Domingos (2005)
performing transfer, using M-TAMAR to produce a
mapping.

50
Methodology Training Testing

Generated learning curves using leave-one-out
CV
Each run keeps one mega-example for testing and
trains on the remaining ones, provided one by
one.
Curves are averages over all runs.
Evaluated learned MLN by performing inference for
all gliterals of each predicate in turn,
providing the rest as evidence, and averaging the
results.

51
Methodology Metrics Kok Domingos (2005)

CLL Conditional Log Likelihood
The log of the probability predicted by the model
that a gliteral has the correct truth value given
in the data.
Averaged over all test gliterals.
AUC Area under the precision recall (PR) curve
Produce a PR curve by varying the probability
threshold.
Compute the area under this curve.

52
Metrics to Summarize Curves

Transfer Ratio (Cohen et al. 2007)
Gives overall idea of improvement achieved over
learning from scratch

53
Transfer Scenarios

Source/target pairs tested
WebKB ? IMDB
UW-CSE ? IMDB
UW-KB ? IMDB
WebKB ? UW-CSE
IMDB ? UW-CSE
WebKB not used as a target since one mega-example
is sufficient to learn an accurate theory for its
limited predicate set.

54
(No Transcript)
55
(No Transcript)
56
Sample Learning Curve
ScrKD TrKD, Hand Mapping TAMAR, Hand
Mapping TrKD TAMAR
57
(No Transcript)
58
Future Research Issues

More realistic application domains.
Application to other SRL models (e.g. SLPs,
BLPs).
More flexible predicate mapping
Allow argument ordering or arity to change.
Map 1 predicate to conjunction of gt 1 predicates
AdvisedBy(X,Y) ?? Movie(M,X) ? Director(M,Y)

59
Multiple Source Transfer

Transfer from multiple source problems to a given
target problem.
Determine which clauses to map and revise from
different source MLNs.

60
Source Selection

Select useful source domains from a large number
of previously learned tasks.
Ideally, picking source domain(s) is sub-linear
in the number of previously learned tasks.

61
Conclusions

Presented TAMAR, a complete transfer system for
SRL that
Maps relational knowledge in the source to the
target domain.
Revises the mapped knowledge to further improve
accuracy.
Showed experimentally that TAMAR improves speed
and accuracy over existing methods.

62
Questions?

Related papers at
http//www.cs.utexas.edu/users/ml/publication/tran
sfer.html

63
Why MLNs?

Inherit the expressivity of first-order logic
Can apply insights from ILP
Inherit the flexibility of probabilistic
graphical models
Can deal with noisy uncertain environments
Undirected models
Do not need to learn causal directions
Subsume all other SRL models that are special
cases of first-order logic or probabilistic
graphical models Richardson 04
Publicly available software package Alchemy

64
Predicate Mapping Comments

A particular source predicate can be mapped to
different target predicates in different clauses.
This makes our approach context sensitive.
More scalable.
In the worst-case, the number of mappings is
exponential in the number of predicates.
The number of predicates in a clause is generally
much smaller than the total number of predicates
in a domain.

65
Relationship to Structure Mapping Engine
(Falkenheiner et al., 1989)

A system for mapping relations using analogy
based on a psychological theory.
Mappings are evaluated based only on the
structural relational similarity between the two
domains.
Does not consider the accuracy of mapped
knowledge in the target when determining the
preferred mapping.
Determines a single global mapping for a given
source target.

66
Summary of Methodology

Learn MLNs for each point on learning curve
Perform inference over learned models
Summarize inference results using 2 metrics CLL
and AUC, thus producing two learning curves
Summarize each learning curve using transfer
ratio and percentage improvement from one
mega-example

67
(No Transcript)
68
(No Transcript)

Write a Comment

User Comments (0)