Decision List - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Decision List

Description:

Find the list of de-accent forms that are ambiguous. ... strip the accents from the data. Measure collocational distributions: ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 48

Provided by: coursesWa1

Learn more at: http://courses.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Decision List

1
Decision List

LING 572
Fei Xia
1/18/06

2
Outline

Basic concepts and properties
Case study

3
Definitions

A decision list (DL) is an ordered list of
conjunctive rules.
Rules can overlap, so the order is important.
A decision tree determines an examples class by
using the first matched rule.

4
An example

A simple DL x(f1, f2, f3)
If f1v11 f2v21 then c1
If f2v21 f3v34 then c2
Classify an example (v11,v21,v34)
? c1 or c2 ?

5
Decision list

A decision list is a list of pairs
(t1, v1), , (tr, vr),
ti are terms, and trtrue.
A term in this context is a conjunction of
literals
f1v11 is a literal.
f1v11 f2v21 is a term.

6
How to build a decision list

Decision tree ? Decision list
Greedy, iterative algorithm that builds DLs
directly.

7
Decision tree ? Decision list
Income
low
high
Nothing
Respond
8
The greedy algorithm

RuleList , Etraining_data
Repeat until E is empty or gain is small
t Find_best_term(E)
Let E be the examples covered by t
Let c be the most common class in E
Add (t, c) to RuleList
E ? E E

9
Problem of greedy algorithm

The interpretation of rules depends on preceding
rules.
Each iteration reduces the number of training
examples.
Poor rule choices at the beginning of the list
can significantly reduce the accuracy of DL
learned.
? Several papers on alternative algorithms

10
Algorithms for building DL

AQ algorithm (Michalski, 1969)
CN2 algorithm (Clark and Niblett, 1989)
Segal and Etzioni (1994)
Goodman (2002)

11
Probabilistic DL

DL a rule is (t, v)
Probabilistic DL a rule is
(t, c1/p1 c2/p2 cn/pn)

12
Case study(Yarowsky, 1994)
13
Case study accent restoration

Task to restore accents in Spanish and French
? A special case of WSD
Ex ambiguous de-accented forms
cesse ? cesse, cessé
cote ?côté, côte, cote, coté
Algorithm build a DL for each ambiguous
de-accented form e.g., one for cesse, another
one for cote
Attributes words within a window

14
The algorithm

Training
Find the list of de-accent forms that are
ambiguous.
For each ambiguous form, build a decision list.
Testing check each word in a sentence
if it is ambiguous,
then restore the accent form according to the DL

15
Algorithm for building DLs

Select feature templates
Build attribute-value table
Find the feature ft that maximizes
Split the data and iterate.

16
In this paper

Binary classification problem each form has only
two possible accent patterns.
Each rule tests only one feature
Very high baseline 98.7
Notation
Accent pattern label/target/y
Collocation feature

17
Step 1 Identify forms that are ambiguous
18
Step 2 Collecting training context
Context the previous three and next three
words. Strip the accents from the data. Why?
19
Step 3 Measure collocational distributions
Feature types are pre-defined.
20
Collocations (a.k.a. features)
21
Step 4 Rank decision rules by log-likelihood
There are many alternatives.
word class
22
Step 5 Pruning DLs

Pruning
Cross-validation
Remove redundant rules WEEKDAY rule precedes
domingo rule.

23
Summary of the algorithm

For a de-accented form w, find all possible
accented forms
Collect training contexts
collect k words on each side of w
strip the accents from the data
Measure collocational distributions
use pre-defined attribute combination
Ex -1 w, 1w, 2w
Rank decision rules by log-likelihood
Optional pruning and interpolation

24
Experiments
Prior (baseline) choose the most common form.
25
Global probabilities vs. Residual probabilities

Two ways to calculate the log-likelihood, log (ci
ft)
Global probabilities using the full data set
Residual probabilities using the residual
training data
More relevant, but less data and more expensive
to compute.
Interpolation use both
In practice, global probability works better.

26
Combining vs. Not combining evidence

Each decision is based on a single piece of
evidence (i.e., feature).
Run-time efficiency and easy modeling
It works well, at least for this task, but why?
Combining all available evidence rarely produces
different results
The gross exaggeration of prob from combining
all of these non-independent log-likelihood is
avoided (c.f. Naïve Bayes)

27
Summary of case study

It allows a wider context (compared to n-gram
methods)
It allows the use of multiple, highly
non-independent evidence types (compared to
Bayesian methods)
kitchen-sink approach of the best kind
(at that time)

28
Summary of decision list

Rules are easily understood by humans (but
remember the order factor)
DL tends to be relatively small, and fast and
easy to apply in practice.
Learning greedy algorithm and other improved
algorithms
Extension probabilistic DL
Ex if A B then (c1, 0.8) (c2, 0.2)
DL is related to DT, CNF, DNF, and TBL (see
additional slides).

29
Additional slides
30
Rivests paper

It assumes that all attributes (including goal
attribute) are binary.
It shows DL is easily learnable from examples.

31
Assignment and formula

Input attributes x1, , xn
An assignment gives each input attribute a value
(1 or 0) e.g., 10001
A boolean formula (function) maps each assignment
to a value (1 or 0)

Two formulae are equivalent if they give the same
value for same input.
Total number of different formulae
? Classification problem learn a formula given a
partial table

33
CNF an DNF

Literal
Term conjunction (and) of literals
Clause disjunction (or) of literals
CNF (conjunctive normal form) the conjunction of
clauses.
DNF (disjunctive normal form) the disjunction of
terms.
k-CNF and k-DNF

34
A slightly different definition of DT

A decision tree (DT) is a binary tree where each
internal node is labeled with a variable, and
each leaf is labeled with 0 or 1.
k-DT the depth of a DT is at most k.
A DT defines a boolean formula look at the paths
whose leaf node is 1.
An example

35
Decision list

A decision list is a list of pairs
(f1, v1), , (fr, vr),
fi are terms, and frtrue.
A decision list defines a boolean function
given an assignment x, DL(x)vj, where j is
the least index s.t. fj(x)1.

36
Relations among different representations

CNF, DNF, DT, DL
k-CNF, k-DNF, k-DT, k-DL
For any k lt n, k-DL is a proper superset of the
other three.
Compared to DT, DL has a simple structure, but
the complexity of the decisions allowed at each
node is greater.

37
k-CNF and k-DNF are proper subsets of k-DL

k-DNF is a subset of k-DL
Each term t of a DNF is converted into a decision
rule (t, 1).
Ex
k-CNF is a subset of k-DL
Every k-CNF is a complement of a k-DNF k-CNF and
k-DNF are duals of each other.
The complement of a k-DL is also a k-DL.
Ex
Neither k-CNF nor k-DNF is a subset of the other
Ex 1-DNF

38
K-DT is a proper subset of k-DL

K-DT is a subset of k-DNF
Each leaf labeled with 1 maps to a term in
k-DNF.
K-DT is a subset of k-CNF
Each leaf labeled with 0 maps to a clause in
k-CNF
? k-DT is a subset of

39
K-DT, k-CNF, k-DNF and k-DT
k-CNF
k-DNF
k-DT
K-DL
40
Learnability

Positive examples vs. negative examples of the
concept being learned.
In some domains, positive examples are easier to
collect.
A sample is a set of examples.
A boolean function is consistent with a sample if
it does not contradict any example in the sample.

41
Two properties of a learning algorithm

A learning algorithm is economical if it requires
few examples to identify the correct concept.
A learning algorithm is efficient if it requires
little computational effort to identify the
correct concept.
? We prefer algorithms that are both economical
and efficient.

42
Hypothesis space

Hypothesis space F a set of concepts that are
being considered.
Hopefully, the concept being learned should be in
the hypothesis space of a learning algorithm.
The goal of a learning algorithm is to select the
right concept from F given the training data.

Discrepancy between two functions f and g
Ideally, we want to be as small as
possible.
To deal with bad luck in drawing example
according to Pn, we define a confidence
parameter

44
Polynomially learnable

A set of Boolean functions is polynomially
learnable if there exists an algorithm A and a
polynomial function
when given a sample of f of size
drawn according to Pn, A will with probability
at least output a
s.t.
Furthermore, As running time is polynomially
bounded in n and m.
K-DL is polynomially learnable.

45
The algorithm in (Rivest, 1987)

If the example set S is empty, halt.
Examine each term of length k until a term t is
found s.t. all examples in S which make t true
are of the same type v.
Add (t, v) to decision list and remove those
examples from S.
Repeat 1-3.

46
Summary of (Rivest, 1987)