Title: Decision Trees and more!
1Decision Trees and more!
2Learning OR with few attributes
- Target function OR of k literals
- Goal learn in time
- polynomial in k and log n
- e and d constants
- ELIM makes slow progress
- might disqualifies only one literal per round
- Might remain with O(n) candidate literals
3ELIM Algorithm for learning OR
- Keep a list of all candidate literals
- For every example whose classification is 0
- Erase all the literals that are 1.
- Correctness
- Our hypothesis h An OR of our set of literals.
- Our set of literals includes the target OR
literals. - Every time h predicts zero we are correct.
- Sample size
- m gt (1/e) ln (3n/d) O (n/e 1/e ln (1/d))
4Set Cover - Definition
- Input S1 , , St and Si ? U
- Output Si1, , Sik and ?j SjkU
- Question Are there k sets that cover U?
- NP-complete
5Set Cover Greedy algorithm
- j0 UjU C?
- While Uj ? ?
- Let Si be arg max Si ? Uj
- Add Si to C
- Let Uj1 Uj Si
- j j1
6Set Cover Greedy Analysis
- At termination, C is a cover.
- Assume there is a cover C of size k.
- C is a cover for every Uj
- Some S in C covers Uj/k elements of Uj
- Analysis of Uj Uj1 ? Uj - Uj/k
- Solving the recursion.
- Number of sets j ? k ln ( U1)
7Building an Occam algorithm
- Given a sample T of size m
- Run ELIM on T
- Let LIT be the set of remaining literals
- Assume there exists k literals in LIT that
classify correctly all the sample T - Negative examples T-
- any subset of LIT classifies T- correctly
8Building an Occam algorithm
- Positive examples T
- Search for a small subset of LIT which classifies
T correctly - For a literal z build Szx z satisfies x
- Our assumption there are k sets that cover T
- Greedy finds k ln m sets that cover T
- Output h OR of the k ln m literals
- Size (h) lt k ln m log 2n
- Sample size m O( k log n log (k log n))
9k-DNF
- Definition
- A disjunction of terms at most k literals
- Term Tx3? x1 ? x5
- DNF T1? T2 ? T3 ? T4
- Example
10Learning k-DNF
- Extended input
- For each AND of k literals define a new input T
- Example Tx3? x1 ? x5
- Number of new inputs at most (2n)k
- Can compute the new input easily in time k(2n)k
- The k-DNF is an OR over the new inputs.
- Run the ELIM algorithm over the new inputs.
- Sample size O ((2n)k/e 1/e ln (1/d))
- Running time same.
11Learning Decision Lists
0
x4
x7
x1
1
1
1
1
0
0
-1
1
-1
12Learning Decision Lists
- Similar to ELIM.
- Input a sample S of size m.
- While S not empty
- For a literal z build Tzx z satisfies x
- Find a Tz which all have the same classification
- Add z to the decision list
- Update S S-Tz
13DL algorithm correctness
- The output decision list is consistent.
- Number of decision lists
- Length lt n1
- Node 2n lirals
- Leaf 2 values
- Total bound (22n)n1
- Sample size
- m O (n log n/e 1/e ln (1/d))
14k-DL
- Each node is a conjunction of k literals
- Includes k-DNF (and k-CNF)
15Learning k-DL
- Extended input
- For each AND of k literals define a new input
- Example Tx3? x1 ? x5
- Number of new inputs at most (2n)k
- Can compute the new input easily in time k(2n)k
- The k-DL is a DL over the new inputs.
- Run the DL algorithm over the new inputs.
- Sample size
- Running time
16Open Problems
- Attribute Efficient
- Decision list very limited results
- Parity functions negative?
- k-DNF and k-DL
17Decision Trees
x1
1
0
x6
1
0
18Learning Decision Trees Using DL
- Consider a decision tree T of size r.
- Theorem
- There exists a log (r1)-DL L that computes T.
- Claim There exists a leaf in T of depth log
(r1). - Learn a Decision Tree using a Decision List
- Running time nlog s
- n number of attributes
- S Tree Size.
19Decision Trees
x1 gt 5
x6 gt 2
20Decision Trees Basic Setup.
- Basic class of hypotheses H.
- Input Sample of examples
- Output Decision tree
- Each internal node from H
- Each leaf a classification value
- Goal (Occam Razor)
- Small decision tree
- Classifies all (most) examples correctly.
21Decision Tree Why?
- Efficient algorithms
- Construction.
- Classification
- Performance Comparable to other methods
- Software packages
- CART
- C4.5 and C5
22Decision Trees This Lecture
- Algorithms for constructing DT
- A theoretical justification
- Using boosting
- Future lecture
- DT pruning.
23Decision Trees Algorithm Outline
- A natural recursive procedure.
- Decide a predicate h at the root.
- Split the data using h
- Build right subtree (for h(x)1)
- Build left subtree (for h(x)0)
- Running time
- T(s) O(s) T(s) T(s-) O(s log s)
- s Tree size
24DT Selecting a Predicate
- Basic setting
- Clearly qup (1-u)r
Prf1q
v
h
Prh0u
Prh11-u
0
1
v1
v2
Prf1 h0p
Prf1 h1r
25Potential function setting
- Compare predicates using potential function.
- Inputs q, u, p, r
- Output value
- Node dependent
- For each node and predicate assign a value.
- Given a split u val(v1) (1-u) val(v2)
- For a tree weighted sum over the leaves.
26PF classification error
- Let val(v)minq,1-q
- Classification error.
- The average potential only drops
- Termination
- When the average is zero
- Perfect Classification
27PF classification error
- Is this a good split?
- Initial error 0.2
- After Split 0.4 (1/2) 0.6(1/2) 0.2
28Potential Function requirements
- When zero perfect classification.
- Strictly convex.
29Potential Function requirements
- Every Change in an improvement
val(r)
val(q)
val(p)
1-u
u
p
q
r
30Potential Functions Candidates
- Potential Functions
- val(q) Ginni(q)2q(1-q) CART
- val(q)etropy(q) -q log q (1-q) log (1-q)
C4.5 - val(q) sqrt2 q (1-q)
- Assumption
- Symmetric val(q) val(1-q)
- Convex
- val(0)val(1) 0 and val(1/2) 1
31DT Construction Algorithm
- Procedure DT(S) S- sample
- If all the examples in S have the classification
b - Create a leaf of value b and return
- For each h compute val(h,S)
- val(h,S) uhval(ph) (1-uh) val(rh)
- Let h arg minh val(h,S)
- Split S using h to S0 and S1
- Recursively invoke DT(S0) and DT(S1)
32DT Analysis
- Potential function
- val(T) Sv leaf of T Prv val(qv)
- For simplicity use true probability
- Bounding the classification error
- error(T) ? val(T)
- study how fast val(T) drops
- Given a tree T define T(l,h) where
- h predicate
- l leaf.
T
h
33Top-Down algorithm
- Input s size H predicates val()
- T0 single leaf tree
- For t from 1 to s do
- Let (l,h) arg max(l,h)val(Tt) val(Tt(l,h))
- Tt1 Tt(l,h)
34Theoretical Analysis
- Assume H satisfies the weak learning hypo.
- For each D there is an h s.t. error(h)lt1/2-g
- Show, that in every step
- a significant drop in val(T)
- Results weaker than AdaBoost
- But algorithm never intended to do it!
- Use Weak Learning
- show a large drop in val(T) at each step
- Modify initial distribution to be unbiased.
35Theoretical Analysis
- Let val(q) 2q(1-q)
- Local drop at a node at least 16g2 q(1-q)2
- Claim At every step t there is a leaf l s.t.
- Prl ? et/2t
- error(l) minql,1-ql ? et/2
- where et is the error at stage t
- Proof!
36Theoretical Analysis
- Drop at time t at least
- Prl g2 ql (1-ql)2 ? g2 et3 / t
- For Ginni index
- val(q)2q(1-q)
- q ? q(1-q) ? val(q)/2
- Drop at least O(g2 val(qt)3/ t )
37Theoretical Analysis
- Need to solve when val(Tk) lt e.
- Bound k.
- Time expO(1/g2 1/ e2)
38Something to think about
- AdaBoost very good bounds
- DT Ginni Index exponential
- Comparable results in practice
- How can it be?