Learning and Inference in Natural Language Understanding - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Learning and Inference in Natural Language Understanding

Description:

2. Winnie the Pooh is a title of a book. 3. Christopher Robin's dad ... There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 51
Provided by: danr169
Category:

less

Transcript and Presenter's Notes

Title: Learning and Inference in Natural Language Understanding


1
Learning and Inference in Natural Language
Understanding
  • Dan Roth
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign

With thanks to Collaborators Ming-Wei Chang,
Vasin Punyakanok, Lev Ratinov,





Nick Rizzolo, Mark
Sammons, Scott Yih, Dav Zimak Funding ARDA,
under the AQUAINT program NSF ITR
IIS-0085836, ITR IIS-0428472, ITR IIS- 0085980,
SoD-HCER-0613885 A DOI grant under the
Reflex program DHS DASH Optimization
(Xpress-MP)
February 2009 Need 4 Speed Seminar _at_ UIUC
2
Comprehension
A process that maintains and updates a collection
of propositions about the state of affairs.
  • (ENGLAND, June, 1989) - Christopher Robin is
    alive and well. He lives in England. He is the
    same person that you read about in the book,
    Winnie the Pooh. As a boy, Chris lived in a
    pretty home called Cotchfield Farm. When Chris
    was three years old, his father wrote a poem
    about him. The poem was printed in a magazine
    for others to read. Mr. Robin then wrote a book.
    He made up a fairy tale land where Chris lived.
    His friends were animals. There was a bear
    called Winnie the Pooh. There was also an owl
    and a young pig, called a piglet. All the
    animals were stuffed toys that Chris owned. Mr.
    Robin made them come to life with his words. The
    places in the story were all near Cotchfield
    Farm. Winnie the Pooh was written in 1925.
    Children still love to read about Christopher
    Robin and his animal friends. Most people don't
    know he is a real person who is grown now. He
    has written two books of his own. They tell what
    it is like to be famous.

1. Christopher Robin was born in England. 2.
Winnie the Pooh is a title of a book. 3.
Christopher Robins dad was a magician. 4.
Christopher Robin must be at least 65 now.
Multiple context sensitive disambiguation
problems are solved (via machine learning) and
global decisions are derived as constrained
optimization over learned variables.
3
Nice to Meet You
4
Learning and Inference
  • Global decisions in which several local decisions
    play a role but there are mutual dependencies on
    their outcome.
  • E.g. Structured Output Problems multiple
    dependent output variables
  • (Learned) models/classifiers for different
    sub-problems
  • In some cases, not all local models can be
    learned simultaneously
  • Key examples in NLP are Textual Entailment and QA
  • In these cases, constraints may appear only at
    evaluation time
  • Incorporate models information, along with prior
    knowledge/constraints, in making coherent
    decisions
  • decisions that respect the local models as well
    as domain context specific knowledge/constraints
    .

5
Constrained Conditional Models
  • A general inference framework that combines
  • Learning conditional models with using
    declarative expressive constraints
  • Within a constrained optimization framework
  • Formulate a decision process as a constrained
    optimization problem
  • Break up a complex problem into a set of
    sub-problems and require components outcomes to
    be consistent modulo constraints
  • Has been shown useful in the context of many NLP
    problems
  • SRL, Summarization Co-reference Information
    Extraction Transliteration
  • RothYih04,07 Punyakanok et.al 05,08 Chang
    et.al 07,08 ClarkeLapata06,07
    DeniseBaldrige07GoldwasserRoth08

6
Outline
  • Constrained Conditional Models
  • Motivation
  • Examples
  • A tutorial on the computational processes
    involved
  • Classification
  • Inference
  • Pipeline
  • Scale of problems
  • A quick view of some current research questions
    in this line
  • Training Paradigms Investigate ways for
    training models and combining constraints
  • Joint Learning and Inference vs. decoupling
    Learning Inference
  • Guiding Semi-Supervised Learning with Constraints
  • Features vs. Constraints
  • Hard and Soft Constraints

7
Inference with General Constraint Structure
RothYih04,07Recognizing Entities and
Relations
Improvement over no inference 2-5
Non-Sequential
  • Some Questions
  • How to guide the global inference?
  • Why not learn Jointly?

Models could be learned separately constraints
may come up only at decision time.
8
Structured Output
  • For each instance, assign values to a set of
    variables
  • Output variables depend on each other
  • Common tasks in
  • Natural language processing
  • Parsing Semantic Parsing Summarization
    Transliteration Co-reference resolution,
  • Information extraction
  • Entities, Relations,
  • Many pure machine learning approaches exist
  • Hidden Markov Models (HMMs)? CRFs
  • Structured Perceptrons ad SVMs
  • However,

9
Motivation II
Information Extraction via Hidden Markov Models
Lars Ole Andersen . Program analysis and
specialization for the C Programming language.
PhD thesis. DIKU , University of Copenhagen, May
1994 .
Prediction result of a trained HMM Lars Ole
Andersen . Program analysis and specialization
for the C Programming language . PhD
thesis . DIKU , University of Copenhagen ,
May 1994 .
AUTHOR TITLE EDITOR BOOKTITLE
TECH-REPORT INSTITUTION DATE
Unsatisfactory results !
Page 9
10
Strategies for Improving the Results
  • (Pure) Machine Learning Approaches
  • Higher Order HMM/CRF?
  • Increasing the window size?
  • Adding a lot of new features
  • Requires a lot of labeled examples
  • What if we only have a few labeled examples?
  • Any other options?
  • Humans can immediately tell bad outputs
  • The output does not make sense

Increasing the model complexity
Can we keep the learned model simple and still
make expressive decisions?
11
Information extraction without Prior Knowledge
Lars Ole Andersen . Program analysis and
specialization for the C Programming language.
PhD thesis. DIKU , University of Copenhagen, May
1994 .
Violates lots of natural constraints!
Page 11
12
Examples of Constraints
  • Each field must be a consecutive list of words
    and can appear at most once in a citation.
  • State transitions must occur on punctuation
    marks.
  • The citation can only start with AUTHOR or
    EDITOR.
  • The words pp., pages correspond to PAGE.
  • Four digits starting with 20xx and 19xx are DATE.
  • Quotations can appear only in TITLE
  • .

Easy to express pieces of knowledge
Non Propositional May use Quantifiers
13
Information Extraction with Constraints
  • Adding constraints, we get correct results!
  • Without changing the model
  • AUTHOR Lars Ole Andersen .
  • TITLE Program analysis and
    specialization for the
  • C Programming language .
  • TECH-REPORT PhD thesis .
  • INSTITUTION DIKU , University of Copenhagen
    ,
  • DATE May, 1994 .

Page 13
14
Problem Setting
  • Random Variables Y
  • Conditional Distributions P (learned by
    models/classifiers)
  • Constraints C any Boolean function
  • defined over partial assignments
    (possibly weights W )
  • Goal Find the best assignment
  • The assignment that achieves the highest global
    performance.
  • This is an Integer Programming Problem

y7
observations
YargmaxY P?Y subject to
constraints C
15
Formal Model
Subject to constraints
(Soft) constraints component
How to solve? This is an Integer Linear
Program Solving using ILP packages gives an
exact solution. Search techniques are also
possible
How to train? How to decompose the global
objective function? Should we incorporate
constraints in the learning process?
16
Example Semantic Role Labeling
Who did what to whom, when, where, why,
  • I left my pearls to my daughter in my will .
  • IA0 left my pearlsA1 to my daughterA2 in
    my willAM-LOC .
  • A0 Leaver
  • A1 Things left
  • A2 Benefactor
  • AM-LOC Location
  • I left my pearls to my daughter in my will
    .

Special Case (structured output problem) here,
all the data is available at one time in
general, classifiers might be learned from
different sources, at different times, at
different contexts. Implications on training
paradigms
Overlapping arguments If A2 is present, A1 must
also be present.
17
Semantic Role Labeling (2/2)
  • PropBank Palmer et. al. 05 provides a large
    human-annotated corpus of semantic verb-argument
    relations.
  • It adds a layer of generic semantic labels to
    Penn Tree Bank II.
  • (Almost) all the labels are on the constituents
    of the parse trees.
  • Core arguments A0-A5 and AA
  • different semantics for each verb
  • specified in the PropBank Frame files
  • 13 types of adjuncts labeled as AM-arg
  • where arg specifies the adjunct type

18
Algorithmic Approach
Identify Vocabulary
candidate arguments
  • Identify argument candidates
  • Pruning XuePalmer, EMNLP04
  • Argument Identifier
  • Binary classification (SNoW)
  • Classify argument candidates
  • Argument Classifier
  • Multi-class classification (SNoW)
  • Inference
  • Use the estimated probability distribution given
    by the argument classifier
  • Use structural and linguistic constraints
  • Infer the optimal global output

EASY
but could be expensive
Inference over (old and new) Vocabulary
I left my nice pearls to her
19
Inference
I left my nice pearls to her
  • The output of the argument classifier often
    violates some constraints, especially when the
    sentence is long.
  • Finding the best legitimate output is formalized
    as an optimization problem and solved via Integer
    Linear Programming. Punyakanok
    et. al 04, Roth Yih 040507
  • Input
  • The probability estimation (by the argument
    classifier)
  • Structural and linguistic constraints
  • Allows incorporating expressive (non-sequential)
    constraints on the variables (the arguments
    types).

20
Integer Linear Programming Inference
  • For each argument ai
  • Set up a Boolean variable ai,t indicating
    whether ai is classified as t
  • Goal is to maximize
  • ? i score(ai t ) ai,t
  • Subject to the (linear) constraints
  • If score(ai t ) P(ai t ), the objective is
    to find the assignment that maximizes the
    expected number of arguments that are correct and
    satisfies the constraints.

The Constrained Conditional Model is completely
decomposed during training
21
Constraints
Any Boolean rule can be encoded as a linear
constraint.
  • No duplicate argument classes
  • ?a ? POTARG xa A0 ? 1
  • R-ARG
  • ? a2 ? POTARG , ?a ? POTARG xa A0 ? xa2
    R-A0
  • C-ARG
  • a2 ? POTARG , ? (a ? POTARG) ? (a is before a2 )
    xa A0 ? xa2 C-A0
  • Many other possible constraints
  • Unique labels
  • No overlapping or embedding
  • Relations between number of arguments order
    constraints
  • If verb is of type A, no argument of type B

If there is an R-ARG phrase, there is an ARG
Phrase
If there is an C-ARG phrase, there is an ARG
before it
Universally quantified rules
LBJ allows a developer to encode constraints in
FOL these are compiled into linear inequalities
automatically.
Joint inference can be used also to combine
different SRL Systems.
22
Semantic Role Labeling
Screen shot from a CCG demo http//L2R.cs.uiuc.edu
/cogcomp
Semantic parsing reveals several relations in
the sentence along with their arguments.
This approach produces a very good semantic
parser. F190 Easy and fast 7 Sent/Sec
(using Xpress-MP)
Top ranked system in CoNLL05 shared task Key
difference is the Inference
23
Outline
  • Constrained Conditional Models
  • Motivation
  • Examples
  • A tutorial on the computational processes
    involved
  • Classification
  • Inference
  • Pipeline
  • Scale of problems
  • A quick pick into some current research
    questions
  • Training Paradigms Investigate ways for
    training models and combining constraints
  • Joint Learning and Inference vs. decoupling
    Learning Inference
  • Guiding Semi-Supervised Learning with Constraints
  • Features vs. Constraints
  • Hard and Soft Constraints

24
Classification Ambiguity Resolution
  • Illinois bored of education
    board
  • Nissan Car and truck plant plant and
    animal kingdom
  • (This Art) (can N) (will MD) (rust V)
    V,N,N
  • The dog bit the kid. He was taken to a
    veterinarian a hospital
  • Tiger was in Washington for the PGA Tour
  • ? Finance
    Banking World News Sports
  • Important or not important love or hate

We dont know how to program these decisions
explicitly. We resort to statistical machine
learning methods to learn these mappings.
24
25
Classification
  • The goal is to learn a function f X? Y that maps
    observations in a domain to one of several
    categories.
  • Task Decide which of board ,bored is more
    likely in the given context
  • X some representation of
  • The Illinois _______
    of education met yesterday
  • Y board ,bored
  • Typical learning protocol
  • Observe a collection of labeled examples (x,y) 2
    X Y
  • Use it to learn a function fX?Y that is
    consistent with the observed examples, and
    (hopefully) performs well on new, previously
    unobserved examples.

25
26
Classification (cont.)
  • The goal is to learn a function f X? Y that maps
    observations in a domain to one of several
    categories.
  • Assume (for a minute) Y -1,1 then
  • The learning problem is to find
  • a function that
  • best separates the data
  • What function?
  • Expressivity is linear good enough?
  • There could be multiple options
  • How to find it?

Linear X data representation w the
classifier Y sgn wT X
27
Not all functions are linear (XOR)
  • (x1 Æ x2) Ç (x1 Æ x2)
  • In general a parity function.
  • xi 2 0,1
  • f(x1, x2,, xn) 1
  • iff ? xi is even
  • This function is not
  • linearly separable.

28
Functions can be made linear
y3 Ç y4 Ç y7 New discriminator is functionally
simpler
  • x1 x2 x4 Ç x2 x4 x5 Ç x1 x3 x7
  • Space X x1, x2,, xn
  • Input Transformation
    New Space Y y1,y2, xi,xi xj, xi xj xj

An important outcome of this view is that we deal
with high dimensionality. 106 features is common.
29
Feature Space
  • Data are not separable in one dimension
  • Not separable if you insist on using a specific
    class of functions

x
30
Blown up Feature Space
  • Data are separable in ltx, x2gt space

Key issue what features to use.
Computationally, can be explicitly or implicitly
(kernels) but there are tradeoffs.
x2
x
  • Expressivity is linear good enough?
  • There could be multiple options

31
Regularization
  • Among several options that are consistent with
    the data at the same level, choose the simplest
    one.
  • Choosing to use linear functions is in itself a
    form of regularization
  • There are many ways to regularize, and there is a
    good understanding of how regularization drives
    better generalization.

32
Learning Classifiers
  • Overall, the problem we solve is the following
    optimization problem
  • Find the weight vector w such that
  • w argmin_w Empirical Loss(w)
    Regularization Term(w)
  • Loss could be
  • ?isgn(wtxi) ltgt y_i
    Number of mistakes
  • ?i(0, 1 yi wtxi) or ?i(0, 1 yi wtxi)
    L1 or L2 losses
  • ?i log1exp(-yiwTxi)
    Probabilistic interpretation

  • Regularization term could be
  • w2 (minimizing the margin) or
    w (sparse solution).
  • Algorithmically many solutions exist
  • Perceptron family (Pereptron Winnow, Average
    Perceptron) -- online
  • (Stochastic) Gradient Descent
    -- could be
    online
  • SVMs
    --
    batch
  • It is well understood today that all converge to
    about the same solution but differ in
    generalization properties, of examples required
    and efficiency.

33
Classification is Well Understood
  • Theoretically generalization bounds
  • How many example does one need to see in order to
    guarantee good behavior on previously unobserved
    examples.
  • Algorithmically good learning algorithms for
    linear representations.
  • Can deal with very high dimensionality (106
    features)
  • Very efficient in terms of computation and of
    examples. On-line.
  • Parallelization Data Parallel is easy at
    evaluation time not training time.
  • Key issues remaining
  • Learning protocols how to minimize interaction
    (supervision) how to map domain/task information
    to supervision semi-supervised learning active
    learning. Ranking.
  • What are the features? No good theoretical
    understanding here.

34
Feature Extraction (1)
  • The process of feature extraction dominates the
    computational time both in training and at
    evaluation time.
  • NLP Structures
  • 1. Context Sensitive Disambiguation
  • I did not know ___ to laugh or cry
  • Feature representation
  • know___ ___to ___to laugh ___to Verb Verb ___
    to..
  • .

35
Feature Extraction (2)
  • The process of feature extraction dominates the
    computational time both in training and at
    evaluation time.
  • NLP Structures
  • 2. Sequential Problems (e.g. Named Entity
    Tagging)
  • Features

36
Feature Extraction (3)
  • The process of feature extraction dominates the
    computational time both in training and at
    evaluation time.
  • NLP Structures
  • 3. Tree Representation
  • Features how many time (whether)
  • you see in x?

37
Feature Extraction
  • The process of feature extraction dominates the
    computational time both in training and at
    evaluation time.
  • 1. know___ ___to ___to laugh ___to Verb Verb
    ___ to..
  • 2.
  • 3.
  • Features Extraction is expensive
  • Many Features (lexical)
  • (small) sub-graph Isomorphism

38
Algorithmic Approach
Identify Vocabulary
candidate arguments
  • Identify argument candidates
  • Pruning XuePalmer, EMNLP04
  • Argument Identifier
  • Binary classification (SNoW)
  • Classify argument candidates
  • Argument Classifier
  • Multi-class classification (SNoW)
  • Inference
  • Use the estimated probability distribution given
    by the argument classifier
  • Use structural and linguistic constraints
  • Infer the optimal global output

EASY
but could be expensive
Inference over (old and new) Vocabulary
I left my nice pearls to her
39
Pipeline
  • Most problems are not single classification
    problems

Raw Data
POS Tagging
Phrases
Semantic Entities
Relations
Parsing
WSD
Semantic Role Labeling
  • Conceptually
  • Pipelining is a crude approximation interactions
    occur across levels and down stream decisions
    often interact with previous decisions.
  • Leads to propagation of errors
  • Occasionally, later stage problems are easier but
    upstream mistakes will not be corrected.
  • Interesting research questions here.
  • Computationally
  • People in NLP are not thinking about how to do it
    right a lot of room for improvement here.

40
Inference with Constraints
Any Boolean rule can be encoded as a linear
constraint.
  • No duplicate argument classes
  • ?a ? POTARG xa A0 ? 1
  • R-ARG
  • ? a2 ? POTARG , ?a ? POTARG xa A0 ? xa2
    R-A0
  • C-ARG
  • a2 ? POTARG , ? (a ? POTARG) ? (a is before a2 )
    xa A0 ? xa2 C-A0
  • Many other possible constraints
  • Unique labels
  • No overlapping or embedding
  • Relations between number of arguments order
    constraints
  • If verb is of type A, no argument of type B

If there is an R-ARG phrase, there is an ARG
Phrase
If there is an C-ARG phrase, there is an ARG
before it
LBJ allows a developer to encode constraints in
FOL these are compiled into linear inequalities
automatically.
Computation is dominated by of variables and
of constraints Typically, up to a few hundreds.
Inference is done per instance
(sentence) Sparsity often makes the computation
feasible
41
Outline
  • Constrained Conditional Models
  • Motivation
  • Examples
  • Training Paradigms Investigate ways for
    training models and combining constraints
  • Joint Learning and Inference vs. decoupling
    Learning Inference
  • Guiding Semi-Supervised Learning with Constraints
  • Features vs. Constraints
  • Hard and Soft Constraints
  • Examples
  • Semantic Parsing
  • Information Extraction
  • Pipeline processes

42
Textual Entailment
Phrasal verb paraphrasing ConnorRoth07
Semantic Role Labeling Punyakanok et. al05,08
Entity matching Li et. al, AAAI04, NAACL04
Inference for Entailment Braz et. al05, 07
Is it true that? (Textual Entailment)
Eyeing the huge market potential, currently led
by Google, Yahoo took over search company
Overture Services Inc. last year
?
Yahoo acquired Overture
Overture is a search company
Google is a search company
Google owns Overture
.
43
Training Paradigms that Support Global Inference
  • Algorithmic Approach Incorporating general
    constraints
  • Allow both statistical and expressive declarative
    constraints ICML05
  • Allow non-sequential constraints (generally
    difficult) CoNLL04
  • Coupling vs. Decoupling Training and Inference.
  • Incorporating global constraints is important but
  • Should it be done only at evaluation time or also
    at training time?
  • How to decompose the objective function and train
    in parts?
  • Issues related to
  • Modularity, efficiency and performance,
    availability of training data
  • Problem specific considerations

44
Training in the presence of Constraints
  • General Training Paradigm
  • First Term Learning from data (could be further
    decomposed)
  • Second Term Guiding the model by constraints
  • Can choose if constraints weights trained, when
    and how, or taken into account only in evaluation.

Decompose Model (SRL case)
Decompose Model from constraints
45
LI Learning plus Inference
Cartoon each model can be more complex and may
have a view on a set of output variables.
Training w/o ConstraintsTesting Inference with
Constraints
IBT Inference-based Training
Y
Learning the components together!
X
46
Perceptron-based Global Learning
f1(x)
X
f2(x)
f3(x)
Y
f4(x)
f5(x)
A lot more inference Which one is better? When
and Why?
47
Claims
  • When the local modes are easy to learn, LI
    outperforms IBT.
  • In many applications, the components are
    identifiable and easy to learn (e.g., argument,
    open-close, PER).
  • Only when the local problems become difficult to
    solve in isolation, IBT outperforms LI, but
    needs a larger number of training examples.
  • Other training paradigms are possible
  • Pipeline-like Sequential Models Roth, Small,
    Titov AIStat09
  • Identify a preferred ordering among components
  • Learn k-th model jointly with previously learned
    models

LI cheaper computationally modular IBT is
better in the limit, and other extreme cases.
48
Bound Prediction
LI vs. IBT the more identifiable individual
problems are, the better overall performance is
with LI
  • Local ? ?opt ( ( d log m log 1/? ) / m )1/2
  • Global ? 0 ( ( cd log m c2d log 1/? ) /
    m )1/2

Indication for hardness of problem
49
Relative Merits SRL
In some cases problems are hard due to lack of
training data. Semi-supervised learning (not
today)
Difficulty of the learning problem( features)
easy
hard
50
Conclusion
  • Constrained Conditional Models combine
  • Learning conditional models with using
    declarative expressive constraints
  • Within a constrained optimization framework
  • Use constraints! The framework supports
  • A clean way of incorporating constraints to bias
    and improve decisions of supervised learning
    models
  • Significant success on several NLP and IE tasks
    (often, with ILP)
  • A clean way to use (declarative) prior knowledge
    to guide semi-supervised learning
  • A tutorial on computational issues in machine
    learning (and inference)

LBJ (Learning Based Java) http//L2R.cs.uiuc.edu/
cogcomp A modeling language for Constrained
Conditional Models. Supports programming along
with building learned models, high level
specification of constraints and inference with
constraints
Write a Comment
User Comments (0)
About PowerShow.com