Title: DECISION TREES
1DECISION TREES NOISY/R-BOOSTING
2BOOSTING EXAMPLE
- X 0,1n
- Fn f(x) MAJi2S xi S µ n, S odd
- Weak learner L output h(x)xi with min emp-err
- Thm For any n1 and m16n2log (5n), above is a
?1/(4n)-weak learner, i.e. EZmerr(h) ½?. - Proof Each training example label agrees with
(S1)/2 of the bits in S, a ½ 1/(2n)
fraction. - ) emp-err(h) ½1/(2n). From Lecture 4,
- Emaxf2F err(f)emp-err(f)(log(5F)/m)½
1/(4n)
3BOOSTING EXAMPLE
- X 0,1n
- Fn f(x) MAJi2S xi S µ n, S odd
- Weak learner L output h(x)xi with min emp-err
- Therefore, boost(L) AC-learns Fn
4DECISION TREE EXAMPLE
x1 10
X R 0,1,2 R Y 0,1 ? over X
Y err(f)P(x,y)?f(x)?y
Y
N
x2
x3 ½
1
2
N
0
Y
0
x3 ¼
1
0
1
N
Y
f(x)(x1¼)Ç (x110Æx3½)
0
1
SIZE-10 DECISION TREE T X ! Y
5REGRESSION TREE EX.
x1 10
X R 0,1,2 R Y 0,1 ? over X
Y err(f)?
Y
N
x2
x3 ½
1
2
N
0
Y
0.2
x3 ¼
0.2
0.8
0.7
N
Y
0.2
0.8
SIZE-10 REGRESSION TREE T X ! Y
6SQUARED ERROR
- Suppose P?y1¼, P?y0¾ (regardless of x)
- E(x,y)?f(x)-y?
- f(x)¼ ) Ef(x)-y ¼ ¾ ¾ ¼ 3/8
- f(x)0 ) Ef(x)-y¼
- E(x,y)?(f(x)-y)2 minc20,1 E(c-y)2 at
cEy¼ - Def ?(x)E(x,y)?yx E(x,y)?(f(x)-y)2
E(f(x)-?(x))2 E(y-?(x))2 - (homework)
?(F)E(x,y)?yxF
Evariance
7R BATCH LEARNING
- Set X, Y 0,1
- Family F of f X ! Y
- Distribution ? over X Y
- Emp-err(f) (1/m)?i (f(xi)-yi)2
- Define ?(x)Eyx
- err(f) E(x,y)?(f(x)-?(x))2
- err(f) E(f(x)-y)2-E(y-?(x))2
- Assume ? 2 F
- Special binary cases
- P(x,y)? y 2 0,1 1
- Noiseless 8f2F, f X ! 0,1
- Random noise 8f2F, f X ! ?,1-?
8GROWING TREESTOP-DOWN
9DATA CALIBRATION
x1 10
Data calibration minimizes empirical error
Y
N
x2 4
0.5
N
Y
Proof
0.2
x3 ¼
N
Y
0.2
0.8
emp-err(T,Zm)
- Value at each L is mean of training data in L.
10DATA CALIBRATION
x1 10
Data calibration minimizes empirical error
Y
N
x2 4
0.5
N
Y
0.2
x3 ¼
Any split can only reduce empiricalerror
N
Y
0.2
0.8
- Value at each L is mean of training data in L.
11DATA CALIBRATION
x1 10
Data calibration minimizes empirical error
Y
N
x2 4
x3 ½
N
Y
N
Y
0.2
x3 ¼
?
?
Any split can only reduce empirical error
N
Y
Proof calibrationis better than keeping old
value
0.2
0.8
- Value at each L is mean of training data in L.
12TOP-DOWN ALGORITHM
internal nodes
- Input Zm 2 (Rn 0,1)m, size s 1
- Output (Binary) decision tree
- Start with one-node tree.
- For i1 to s
- Find the split (over any leaf L) that results in
calibrated tree of smallest emp-error. - Make the split.
- For a decision tree, round each value to 0,1.
-
Runtime 1? poly(n,m)?
13EMP-ERR IMPURITY
emp-err(T,Zm)
impurity(T)
g(Zm)
14TOP-DOWN ALGORITHM
- Input Zm 2 (Rn 0,1)m, size s 1
- Output (Binary) decision tree
- Start with one-node tree.
- For i1 to s
- Find the split (over any leaf L) that results in
calibrated tree of largest impurity decrease. - Make the split.
Or other splitting criteria, e.g., information
gain (entropy)
15WHAT SIZE TREE?
16WHAT SIZE TREE?
- Theory take size(T) m.
- Practice divide Zm into (A,B) of size
(0.9m,0.1m). - Stopping criteria
- Build the tree for s1,2, on A.
- Among all trees generated, choose the one that
minimizes error on B. - Pruning
- Build the complete tree T on A. (s1)
- Use sub-tree with min error on B. (efficient
bottom-up)
17BOOSTING D.T. NOISE
Dietterich99
- AdaBoost performs poorly with noise
- AdaBoosts feature quickly identifies outliers
by putting (exponentially) large weight on them
18DECISION TR. BOOSTING
- Replace (xi ?) with weak learner output
- Natural divide and conquer boosting
- Problem boost to learn f(x) MAJi2n xi
- Need exponential size tree!
- Solution merge nodes (graph instead of tree)
19DECISION GRAPH
- X 0,15
- f(x) MAJ(x1,x2,x3,x4,x5)
x1 ½
x2 ½
x2 ½
x3 ½
x3 ½
x3 ½
0
x4 ½
x4 ½
1
0
x5 ½
1
0
1