Title: Decision List
1Decision List
2Outline
- Basic concepts and properties
- Case study
3Definitions
- A decision list (DL) is an ordered list of
conjunctive rules. - Rules can overlap, so the order is important.
- A decision tree determines an examples class by
using the first matched rule.
4An example
- A simple DL x(f1, f2, f3)
- If f1v11 f2v21 then c1
- If f2v21 f3v34 then c2
- Classify an example (v11,v21,v34)
- ? c1 or c2 ?
5Decision list
- A decision list is a list of pairs
- (t1, v1), , (tr, vr),
- ti are terms, and trtrue.
- A term in this context is a conjunction of
literals - f1v11 is a literal.
- f1v11 f2v21 is a term.
6How to build a decision list
- Decision tree ? Decision list
- Greedy, iterative algorithm that builds DLs
directly.
7Decision tree ? Decision list
Income
low
high
Nothing
Respond
8The greedy algorithm
- RuleList , Etraining_data
- Repeat until E is empty or gain is small
- t Find_best_term(E)
- Let E be the examples covered by t
- Let c be the most common class in E
- Add (t, c) to RuleList
- E ? E E
9Problem of greedy algorithm
- The interpretation of rules depends on preceding
rules. - Each iteration reduces the number of training
examples. - Poor rule choices at the beginning of the list
can significantly reduce the accuracy of DL
learned. - ? Several papers on alternative algorithms
10Algorithms for building DL
- AQ algorithm (Michalski, 1969)
- CN2 algorithm (Clark and Niblett, 1989)
- Segal and Etzioni (1994)
- Goodman (2002)
-
11Probabilistic DL
- DL a rule is (t, v)
- Probabilistic DL a rule is
- (t, c1/p1 c2/p2 cn/pn)
12Case study(Yarowsky, 1994)
13Case study accent restoration
- Task to restore accents in Spanish and French
- ? A special case of WSD
- Ex ambiguous de-accented forms
- cesse ? cesse, cessé
- cote ?côté, côte, cote, coté
- Algorithm build a DL for each ambiguous
de-accented form e.g., one for cesse, another
one for cote - Attributes words within a window
14The algorithm
- Training
- Find the list of de-accent forms that are
ambiguous. - For each ambiguous form, build a decision list.
- Testing check each word in a sentence
- if it is ambiguous,
- then restore the accent form according to the DL
15Algorithm for building DLs
- Select feature templates
- Build attribute-value table
- Find the feature ft that maximizes
-
- Split the data and iterate.
16In this paper
- Binary classification problem each form has only
two possible accent patterns. - Each rule tests only one feature
- Very high baseline 98.7
- Notation
- Accent pattern label/target/y
- Collocation feature
-
17Step 1 Identify forms that are ambiguous
18Step 2 Collecting training context
Context the previous three and next three
words. Strip the accents from the data. Why?
19Step 3 Measure collocational distributions
Feature types are pre-defined.
20Collocations (a.k.a. features)
21Step 4 Rank decision rules by log-likelihood
There are many alternatives.
word class
22Step 5 Pruning DLs
- Pruning
- Cross-validation
- Remove redundant rules WEEKDAY rule precedes
domingo rule.
23Summary of the algorithm
- For a de-accented form w, find all possible
accented forms - Collect training contexts
- collect k words on each side of w
- strip the accents from the data
- Measure collocational distributions
- use pre-defined attribute combination
- Ex -1 w, 1w, 2w
- Rank decision rules by log-likelihood
- Optional pruning and interpolation
24Experiments
Prior (baseline) choose the most common form.
25Global probabilities vs. Residual probabilities
- Two ways to calculate the log-likelihood, log (ci
ft) - Global probabilities using the full data set
- Residual probabilities using the residual
training data - More relevant, but less data and more expensive
to compute. - Interpolation use both
- In practice, global probability works better.
26Combining vs. Not combining evidence
- Each decision is based on a single piece of
evidence (i.e., feature). - Run-time efficiency and easy modeling
- It works well, at least for this task, but why?
- Combining all available evidence rarely produces
different results - The gross exaggeration of prob from combining
all of these non-independent log-likelihood is
avoided (c.f. Naïve Bayes)
27Summary of case study
- It allows a wider context (compared to n-gram
methods) - It allows the use of multiple, highly
non-independent evidence types (compared to
Bayesian methods) - kitchen-sink approach of the best kind
- (at that time)
28Summary of decision list
- Rules are easily understood by humans (but
remember the order factor) - DL tends to be relatively small, and fast and
easy to apply in practice. - Learning greedy algorithm and other improved
algorithms - Extension probabilistic DL
- Ex if A B then (c1, 0.8) (c2, 0.2)
- DL is related to DT, CNF, DNF, and TBL (see
additional slides).
29Additional slides
30Rivests paper
- It assumes that all attributes (including goal
attribute) are binary. - It shows DL is easily learnable from examples.
31Assignment and formula
- Input attributes x1, , xn
- An assignment gives each input attribute a value
(1 or 0) e.g., 10001 - A boolean formula (function) maps each assignment
to a value (1 or 0)
32- Two formulae are equivalent if they give the same
value for same input. - Total number of different formulae
- ? Classification problem learn a formula given a
partial table
33CNF an DNF
- Literal
- Term conjunction (and) of literals
- Clause disjunction (or) of literals
- CNF (conjunctive normal form) the conjunction of
clauses. - DNF (disjunctive normal form) the disjunction of
terms. - k-CNF and k-DNF
34A slightly different definition of DT
- A decision tree (DT) is a binary tree where each
internal node is labeled with a variable, and
each leaf is labeled with 0 or 1. - k-DT the depth of a DT is at most k.
- A DT defines a boolean formula look at the paths
whose leaf node is 1. - An example
35Decision list
- A decision list is a list of pairs
- (f1, v1), , (fr, vr),
- fi are terms, and frtrue.
- A decision list defines a boolean function
- given an assignment x, DL(x)vj, where j is
the least index s.t. fj(x)1.
36Relations among different representations
- CNF, DNF, DT, DL
- k-CNF, k-DNF, k-DT, k-DL
- For any k lt n, k-DL is a proper superset of the
other three. - Compared to DT, DL has a simple structure, but
the complexity of the decisions allowed at each
node is greater.
37k-CNF and k-DNF are proper subsets of k-DL
- k-DNF is a subset of k-DL
- Each term t of a DNF is converted into a decision
rule (t, 1). - Ex
-
- k-CNF is a subset of k-DL
- Every k-CNF is a complement of a k-DNF k-CNF and
k-DNF are duals of each other. - The complement of a k-DL is also a k-DL.
- Ex
- Neither k-CNF nor k-DNF is a subset of the other
- Ex 1-DNF
38K-DT is a proper subset of k-DL
- K-DT is a subset of k-DNF
- Each leaf labeled with 1 maps to a term in
k-DNF. - K-DT is a subset of k-CNF
- Each leaf labeled with 0 maps to a clause in
k-CNF - ? k-DT is a subset of
39K-DT, k-CNF, k-DNF and k-DT
k-CNF
k-DNF
k-DT
K-DL
40Learnability
- Positive examples vs. negative examples of the
concept being learned. - In some domains, positive examples are easier to
collect. - A sample is a set of examples.
- A boolean function is consistent with a sample if
it does not contradict any example in the sample.
41Two properties of a learning algorithm
- A learning algorithm is economical if it requires
few examples to identify the correct concept. - A learning algorithm is efficient if it requires
little computational effort to identify the
correct concept. - ? We prefer algorithms that are both economical
and efficient.
42Hypothesis space
- Hypothesis space F a set of concepts that are
being considered. - Hopefully, the concept being learned should be in
the hypothesis space of a learning algorithm. - The goal of a learning algorithm is to select the
right concept from F given the training data.
43- Discrepancy between two functions f and g
- Ideally, we want to be as small as
possible. - To deal with bad luck in drawing example
according to Pn, we define a confidence
parameter
44Polynomially learnable
- A set of Boolean functions is polynomially
learnable if there exists an algorithm A and a
polynomial function - when given a sample of f of size
-
- drawn according to Pn, A will with probability
at least output a
s.t. - Furthermore, As running time is polynomially
bounded in n and m. - K-DL is polynomially learnable.
45The algorithm in (Rivest, 1987)
- If the example set S is empty, halt.
- Examine each term of length k until a term t is
found s.t. all examples in S which make t true
are of the same type v. - Add (t, v) to decision list and remove those
examples from S. - Repeat 1-3.
46Summary of (Rivest, 1987)
- Formal definition of DL
- Show the relation between k-DL, k-CNF, k-DNF and
k-DL. - Prove that k-DL is polynomially learnable.
- Give a simple greedy algorithm to build k-DL.
47In practice
- Input attributes and the goal are not necessarily
binary. - Ex the previous word
- A term ? a feature (it is not necessarily a
conjunction of literals) - Ex the word appears in a k-word window
- Only some feature types are considered, instead
of all possible features - Ex previous word and next word
- Greedy algorithm quality measure
- Ex a feature with minimum entropy