Decision List - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Decision List

Description:

Find the list of de-accent forms that are ambiguous. ... strip the accents from the data. Measure collocational distributions: ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 48
Provided by: coursesWa1
Category:
Tags: accents | decision | list

less

Transcript and Presenter's Notes

Title: Decision List


1
Decision List
  • LING 572
  • Fei Xia
  • 1/18/06

2
Outline
  • Basic concepts and properties
  • Case study

3
Definitions
  • A decision list (DL) is an ordered list of
    conjunctive rules.
  • Rules can overlap, so the order is important.
  • A decision tree determines an examples class by
    using the first matched rule.

4
An example
  • A simple DL x(f1, f2, f3)
  • If f1v11 f2v21 then c1
  • If f2v21 f3v34 then c2
  • Classify an example (v11,v21,v34)
  • ? c1 or c2 ?

5
Decision list
  • A decision list is a list of pairs
  • (t1, v1), , (tr, vr),
  • ti are terms, and trtrue.
  • A term in this context is a conjunction of
    literals
  • f1v11 is a literal.
  • f1v11 f2v21 is a term.

6
How to build a decision list
  • Decision tree ? Decision list
  • Greedy, iterative algorithm that builds DLs
    directly.

7
Decision tree ? Decision list
Income
low
high
Nothing
Respond
8
The greedy algorithm
  • RuleList , Etraining_data
  • Repeat until E is empty or gain is small
  • t Find_best_term(E)
  • Let E be the examples covered by t
  • Let c be the most common class in E
  • Add (t, c) to RuleList
  • E ? E E

9
Problem of greedy algorithm
  • The interpretation of rules depends on preceding
    rules.
  • Each iteration reduces the number of training
    examples.
  • Poor rule choices at the beginning of the list
    can significantly reduce the accuracy of DL
    learned.
  • ? Several papers on alternative algorithms

10
Algorithms for building DL
  • AQ algorithm (Michalski, 1969)
  • CN2 algorithm (Clark and Niblett, 1989)
  • Segal and Etzioni (1994)
  • Goodman (2002)

11
Probabilistic DL
  • DL a rule is (t, v)
  • Probabilistic DL a rule is
  • (t, c1/p1 c2/p2 cn/pn)

12
Case study(Yarowsky, 1994)
13
Case study accent restoration
  • Task to restore accents in Spanish and French
  • ? A special case of WSD
  • Ex ambiguous de-accented forms
  • cesse ? cesse, cessé
  • cote ?côté, côte, cote, coté
  • Algorithm build a DL for each ambiguous
    de-accented form e.g., one for cesse, another
    one for cote
  • Attributes words within a window

14
The algorithm
  • Training
  • Find the list of de-accent forms that are
    ambiguous.
  • For each ambiguous form, build a decision list.
  • Testing check each word in a sentence
  • if it is ambiguous,
  • then restore the accent form according to the DL

15
Algorithm for building DLs
  • Select feature templates
  • Build attribute-value table
  • Find the feature ft that maximizes
  • Split the data and iterate.

16
In this paper
  • Binary classification problem each form has only
    two possible accent patterns.
  • Each rule tests only one feature
  • Very high baseline 98.7
  • Notation
  • Accent pattern label/target/y
  • Collocation feature

17
Step 1 Identify forms that are ambiguous
18
Step 2 Collecting training context
Context the previous three and next three
words. Strip the accents from the data. Why?
19
Step 3 Measure collocational distributions
Feature types are pre-defined.
20
Collocations (a.k.a. features)
21
Step 4 Rank decision rules by log-likelihood
There are many alternatives.
word class
22
Step 5 Pruning DLs
  • Pruning
  • Cross-validation
  • Remove redundant rules WEEKDAY rule precedes
    domingo rule.

23
Summary of the algorithm
  • For a de-accented form w, find all possible
    accented forms
  • Collect training contexts
  • collect k words on each side of w
  • strip the accents from the data
  • Measure collocational distributions
  • use pre-defined attribute combination
  • Ex -1 w, 1w, 2w
  • Rank decision rules by log-likelihood
  • Optional pruning and interpolation

24
Experiments
Prior (baseline) choose the most common form.
25
Global probabilities vs. Residual probabilities
  • Two ways to calculate the log-likelihood, log (ci
    ft)
  • Global probabilities using the full data set
  • Residual probabilities using the residual
    training data
  • More relevant, but less data and more expensive
    to compute.
  • Interpolation use both
  • In practice, global probability works better.

26
Combining vs. Not combining evidence
  • Each decision is based on a single piece of
    evidence (i.e., feature).
  • Run-time efficiency and easy modeling
  • It works well, at least for this task, but why?
  • Combining all available evidence rarely produces
    different results
  • The gross exaggeration of prob from combining
    all of these non-independent log-likelihood is
    avoided (c.f. Naïve Bayes)

27
Summary of case study
  • It allows a wider context (compared to n-gram
    methods)
  • It allows the use of multiple, highly
    non-independent evidence types (compared to
    Bayesian methods)
  • kitchen-sink approach of the best kind
  • (at that time)

28
Summary of decision list
  • Rules are easily understood by humans (but
    remember the order factor)
  • DL tends to be relatively small, and fast and
    easy to apply in practice.
  • Learning greedy algorithm and other improved
    algorithms
  • Extension probabilistic DL
  • Ex if A B then (c1, 0.8) (c2, 0.2)
  • DL is related to DT, CNF, DNF, and TBL (see
    additional slides).

29
Additional slides
30
Rivests paper
  • It assumes that all attributes (including goal
    attribute) are binary.
  • It shows DL is easily learnable from examples.

31
Assignment and formula
  • Input attributes x1, , xn
  • An assignment gives each input attribute a value
    (1 or 0) e.g., 10001
  • A boolean formula (function) maps each assignment
    to a value (1 or 0)

32
  • Two formulae are equivalent if they give the same
    value for same input.
  • Total number of different formulae
  • ? Classification problem learn a formula given a
    partial table

33
CNF an DNF
  • Literal
  • Term conjunction (and) of literals
  • Clause disjunction (or) of literals
  • CNF (conjunctive normal form) the conjunction of
    clauses.
  • DNF (disjunctive normal form) the disjunction of
    terms.
  • k-CNF and k-DNF

34
A slightly different definition of DT
  • A decision tree (DT) is a binary tree where each
    internal node is labeled with a variable, and
    each leaf is labeled with 0 or 1.
  • k-DT the depth of a DT is at most k.
  • A DT defines a boolean formula look at the paths
    whose leaf node is 1.
  • An example

35
Decision list
  • A decision list is a list of pairs
  • (f1, v1), , (fr, vr),
  • fi are terms, and frtrue.
  • A decision list defines a boolean function
  • given an assignment x, DL(x)vj, where j is
    the least index s.t. fj(x)1.

36
Relations among different representations
  • CNF, DNF, DT, DL
  • k-CNF, k-DNF, k-DT, k-DL
  • For any k lt n, k-DL is a proper superset of the
    other three.
  • Compared to DT, DL has a simple structure, but
    the complexity of the decisions allowed at each
    node is greater.

37
k-CNF and k-DNF are proper subsets of k-DL
  • k-DNF is a subset of k-DL
  • Each term t of a DNF is converted into a decision
    rule (t, 1).
  • Ex
  • k-CNF is a subset of k-DL
  • Every k-CNF is a complement of a k-DNF k-CNF and
    k-DNF are duals of each other.
  • The complement of a k-DL is also a k-DL.
  • Ex
  • Neither k-CNF nor k-DNF is a subset of the other
  • Ex 1-DNF

38
K-DT is a proper subset of k-DL
  • K-DT is a subset of k-DNF
  • Each leaf labeled with 1 maps to a term in
    k-DNF.
  • K-DT is a subset of k-CNF
  • Each leaf labeled with 0 maps to a clause in
    k-CNF
  • ? k-DT is a subset of

39
K-DT, k-CNF, k-DNF and k-DT
k-CNF
k-DNF
k-DT
K-DL
40
Learnability
  • Positive examples vs. negative examples of the
    concept being learned.
  • In some domains, positive examples are easier to
    collect.
  • A sample is a set of examples.
  • A boolean function is consistent with a sample if
    it does not contradict any example in the sample.

41
Two properties of a learning algorithm
  • A learning algorithm is economical if it requires
    few examples to identify the correct concept.
  • A learning algorithm is efficient if it requires
    little computational effort to identify the
    correct concept.
  • ? We prefer algorithms that are both economical
    and efficient.

42
Hypothesis space
  • Hypothesis space F a set of concepts that are
    being considered.
  • Hopefully, the concept being learned should be in
    the hypothesis space of a learning algorithm.
  • The goal of a learning algorithm is to select the
    right concept from F given the training data.

43
  • Discrepancy between two functions f and g
  • Ideally, we want to be as small as
    possible.
  • To deal with bad luck in drawing example
    according to Pn, we define a confidence
    parameter

44
Polynomially learnable
  • A set of Boolean functions is polynomially
    learnable if there exists an algorithm A and a
    polynomial function
  • when given a sample of f of size
  • drawn according to Pn, A will with probability
    at least output a
    s.t.
  • Furthermore, As running time is polynomially
    bounded in n and m.
  • K-DL is polynomially learnable.

45
The algorithm in (Rivest, 1987)
  • If the example set S is empty, halt.
  • Examine each term of length k until a term t is
    found s.t. all examples in S which make t true
    are of the same type v.
  • Add (t, v) to decision list and remove those
    examples from S.
  • Repeat 1-3.

46
Summary of (Rivest, 1987)
  • Formal definition of DL
  • Show the relation between k-DL, k-CNF, k-DNF and
    k-DL.
  • Prove that k-DL is polynomially learnable.
  • Give a simple greedy algorithm to build k-DL.

47
In practice
  • Input attributes and the goal are not necessarily
    binary.
  • Ex the previous word
  • A term ? a feature (it is not necessarily a
    conjunction of literals)
  • Ex the word appears in a k-word window
  • Only some feature types are considered, instead
    of all possible features
  • Ex previous word and next word
  • Greedy algorithm quality measure
  • Ex a feature with minimum entropy
Write a Comment
User Comments (0)
About PowerShow.com