Decision Trees and more! - PowerPoint PPT Presentation

About This Presentation
Title:

Decision Trees and more!

Description:

might disqualifies only one literal per round. Might remain with O(n) ... Goal (Occam Razor): Small decision tree. Classifies all (most) examples correctly. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 39
Provided by: yishaym4
Category:
Tags: decision | more | razor | trees

less

Transcript and Presenter's Notes

Title: Decision Trees and more!


1
Decision Trees and more!
2
Learning OR with few attributes
  • Target function OR of k literals
  • Goal learn in time
  • polynomial in k and log n
  • e and d constants
  • ELIM makes slow progress
  • might disqualifies only one literal per round
  • Might remain with O(n) candidate literals

3
ELIM Algorithm for learning OR
  • Keep a list of all candidate literals
  • For every example whose classification is 0
  • Erase all the literals that are 1.
  • Correctness
  • Our hypothesis h An OR of our set of literals.
  • Our set of literals includes the target OR
    literals.
  • Every time h predicts zero we are correct.
  • Sample size
  • m gt (1/e) ln (3n/d) O (n/e 1/e ln (1/d))

4
Set Cover - Definition
  • Input S1 , , St and Si ? U
  • Output Si1, , Sik and ?j SjkU
  • Question Are there k sets that cover U?
  • NP-complete

5
Set Cover Greedy algorithm
  • j0 UjU C?
  • While Uj ? ?
  • Let Si be arg max Si ? Uj
  • Add Si to C
  • Let Uj1 Uj Si
  • j j1

6
Set Cover Greedy Analysis
  • At termination, C is a cover.
  • Assume there is a cover C of size k.
  • C is a cover for every Uj
  • Some S in C covers Uj/k elements of Uj
  • Analysis of Uj Uj1 ? Uj - Uj/k
  • Solving the recursion.
  • Number of sets j ? k ln ( U1)

7
Building an Occam algorithm
  • Given a sample T of size m
  • Run ELIM on T
  • Let LIT be the set of remaining literals
  • Assume there exists k literals in LIT that
    classify correctly all the sample T
  • Negative examples T-
  • any subset of LIT classifies T- correctly

8
Building an Occam algorithm
  • Positive examples T
  • Search for a small subset of LIT which classifies
    T correctly
  • For a literal z build Szx z satisfies x
  • Our assumption there are k sets that cover T
  • Greedy finds k ln m sets that cover T
  • Output h OR of the k ln m literals
  • Size (h) lt k ln m log 2n
  • Sample size m O( k log n log (k log n))

9
k-DNF
  • Definition
  • A disjunction of terms at most k literals
  • Term Tx3? x1 ? x5
  • DNF T1? T2 ? T3 ? T4
  • Example

10
Learning k-DNF
  • Extended input
  • For each AND of k literals define a new input T
  • Example Tx3? x1 ? x5
  • Number of new inputs at most (2n)k
  • Can compute the new input easily in time k(2n)k
  • The k-DNF is an OR over the new inputs.
  • Run the ELIM algorithm over the new inputs.
  • Sample size O ((2n)k/e 1/e ln (1/d))
  • Running time same.

11
Learning Decision Lists
  • Definition

0
x4
x7
x1
1
1
1
1
0
0
-1
1
-1
12
Learning Decision Lists
  • Similar to ELIM.
  • Input a sample S of size m.
  • While S not empty
  • For a literal z build Tzx z satisfies x
  • Find a Tz which all have the same classification
  • Add z to the decision list
  • Update S S-Tz

13
DL algorithm correctness
  • The output decision list is consistent.
  • Number of decision lists
  • Length lt n1
  • Node 2n lirals
  • Leaf 2 values
  • Total bound (22n)n1
  • Sample size
  • m O (n log n/e 1/e ln (1/d))

14
k-DL
  • Each node is a conjunction of k literals
  • Includes k-DNF (and k-CNF)

15
Learning k-DL
  • Extended input
  • For each AND of k literals define a new input
  • Example Tx3? x1 ? x5
  • Number of new inputs at most (2n)k
  • Can compute the new input easily in time k(2n)k
  • The k-DL is a DL over the new inputs.
  • Run the DL algorithm over the new inputs.
  • Sample size
  • Running time

16
Open Problems
  • Attribute Efficient
  • Decision list very limited results
  • Parity functions negative?
  • k-DNF and k-DL

17
Decision Trees
x1
1
0
x6
1
0
18
Learning Decision Trees Using DL
  • Consider a decision tree T of size r.
  • Theorem
  • There exists a log (r1)-DL L that computes T.
  • Claim There exists a leaf in T of depth log
    (r1).
  • Learn a Decision Tree using a Decision List
  • Running time nlog s
  • n number of attributes
  • S Tree Size.

19
Decision Trees
x1 gt 5
x6 gt 2
20
Decision Trees Basic Setup.
  • Basic class of hypotheses H.
  • Input Sample of examples
  • Output Decision tree
  • Each internal node from H
  • Each leaf a classification value
  • Goal (Occam Razor)
  • Small decision tree
  • Classifies all (most) examples correctly.

21
Decision Tree Why?
  • Efficient algorithms
  • Construction.
  • Classification
  • Performance Comparable to other methods
  • Software packages
  • CART
  • C4.5 and C5

22
Decision Trees This Lecture
  • Algorithms for constructing DT
  • A theoretical justification
  • Using boosting
  • Future lecture
  • DT pruning.

23
Decision Trees Algorithm Outline
  • A natural recursive procedure.
  • Decide a predicate h at the root.
  • Split the data using h
  • Build right subtree (for h(x)1)
  • Build left subtree (for h(x)0)
  • Running time
  • T(s) O(s) T(s) T(s-) O(s log s)
  • s Tree size

24
DT Selecting a Predicate
  • Basic setting
  • Clearly qup (1-u)r

Prf1q
v
h
Prh0u
Prh11-u
0
1
v1
v2
Prf1 h0p
Prf1 h1r
25
Potential function setting
  • Compare predicates using potential function.
  • Inputs q, u, p, r
  • Output value
  • Node dependent
  • For each node and predicate assign a value.
  • Given a split u val(v1) (1-u) val(v2)
  • For a tree weighted sum over the leaves.

26
PF classification error
  • Let val(v)minq,1-q
  • Classification error.
  • The average potential only drops
  • Termination
  • When the average is zero
  • Perfect Classification

27
PF classification error
  • Is this a good split?
  • Initial error 0.2
  • After Split 0.4 (1/2) 0.6(1/2) 0.2

28
Potential Function requirements
  • When zero perfect classification.
  • Strictly convex.

29
Potential Function requirements
  • Every Change in an improvement

val(r)
val(q)
val(p)
1-u
u
p
q
r
30
Potential Functions Candidates
  • Potential Functions
  • val(q) Ginni(q)2q(1-q) CART
  • val(q)etropy(q) -q log q (1-q) log (1-q)
    C4.5
  • val(q) sqrt2 q (1-q)
  • Assumption
  • Symmetric val(q) val(1-q)
  • Convex
  • val(0)val(1) 0 and val(1/2) 1

31
DT Construction Algorithm
  • Procedure DT(S) S- sample
  • If all the examples in S have the classification
    b
  • Create a leaf of value b and return
  • For each h compute val(h,S)
  • val(h,S) uhval(ph) (1-uh) val(rh)
  • Let h arg minh val(h,S)
  • Split S using h to S0 and S1
  • Recursively invoke DT(S0) and DT(S1)

32
DT Analysis
  • Potential function
  • val(T) Sv leaf of T Prv val(qv)
  • For simplicity use true probability
  • Bounding the classification error
  • error(T) ? val(T)
  • study how fast val(T) drops
  • Given a tree T define T(l,h) where
  • h predicate
  • l leaf.

T
h
33
Top-Down algorithm
  • Input s size H predicates val()
  • T0 single leaf tree
  • For t from 1 to s do
  • Let (l,h) arg max(l,h)val(Tt) val(Tt(l,h))
  • Tt1 Tt(l,h)

34
Theoretical Analysis
  • Assume H satisfies the weak learning hypo.
  • For each D there is an h s.t. error(h)lt1/2-g
  • Show, that in every step
  • a significant drop in val(T)
  • Results weaker than AdaBoost
  • But algorithm never intended to do it!
  • Use Weak Learning
  • show a large drop in val(T) at each step
  • Modify initial distribution to be unbiased.

35
Theoretical Analysis
  • Let val(q) 2q(1-q)
  • Local drop at a node at least 16g2 q(1-q)2
  • Claim At every step t there is a leaf l s.t.
  • Prl ? et/2t
  • error(l) minql,1-ql ? et/2
  • where et is the error at stage t
  • Proof!

36
Theoretical Analysis
  • Drop at time t at least
  • Prl g2 ql (1-ql)2 ? g2 et3 / t
  • For Ginni index
  • val(q)2q(1-q)
  • q ? q(1-q) ? val(q)/2
  • Drop at least O(g2 val(qt)3/ t )

37
Theoretical Analysis
  • Need to solve when val(Tk) lt e.
  • Bound k.
  • Time expO(1/g2 1/ e2)

38
Something to think about
  • AdaBoost very good bounds
  • DT Ginni Index exponential
  • Comparable results in practice
  • How can it be?
Write a Comment
User Comments (0)
About PowerShow.com