Machine Learning, Decision Trees, Overfitting - PowerPoint PPT Presentation

About This Presentation

Title:

Machine Learning, Decision Trees, Overfitting

Description:

Title: PowerPoint Presentation Author: Tom M. Mitchell Last modified by: Tom Mitchell Created Date: 1/15/2001 4:39:59 AM Document presentation format – PowerPoint PPT presentation

Number of Views:314

Avg rating:3.0/5.0

Slides: 41

Provided by: TomM2176

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning, Decision Trees, Overfitting

1
Machine Learning,Decision Trees, Overfitting
Reading Mitchell, Chapter 3

Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department
Carnegie Mellon University
January 14, 2008

2
Machine Learning 10-601

Instructors
William Cohen
Tom Mitchell
TAs
Andrew Arnold
Mary McGlohon
Course assistant
Sharon Cavlovich

See webpage for
Office hours
Grading policy
Final exam date
Late homework policy
Syllabus details
...

webpage www.cs.cmu.edu/tom/10601
3
Machine Learning

Study of algorithms that
improve their performance P
at some task T
with experience E

well-defined learning task ltP,T,Egt
4
Learning to Predict Emergency C-Sections
Sims et al., 2000
9714 patient records, each with 215 features
5
Learning to detect objects in images
(Prof. H. Schneiderman)
Example training images for each orientation
6
Learning to classify text documents
Company home page vs Personal home page
vs University home page vs
7
Reading a noun (vs verb)
Rustandi et al., 2005
8
Machine Learning - Practice
Speech Recognition

Supervised learning
Bayesian networks
Hidden Markov models
Unsupervised clustering
Reinforcement learning
....

Control learning
Text analysis
9
Machine Learning - Theory

Other theories for
Reinforcement skill learning
Semi-supervised learning
Active student querying

PAC Learning Theory
(supervised concept learning)
examples (m)
representational complexity (H)
error rate (e)

also relating
of mistakes during learning
learners query strategy
convergence rate
asymptotic performance
bias, variance

failure probability (d)
10
Growth of Machine Learning

Machine learning already the preferred approach
to
Speech recognition, Natural language processing
Computer vision
Medical outcomes analysis
Robot control
This ML niche is growing
Improved machine learning algorithms
Increased data capture, networking
Software too complex to write by hand
New sensors / IO devices
Demand for self-customization to user, environment

ML apps.
All software apps.
11
Function Approximation and Decision tree learning
12
Function approximation

Setting
Set of possible instances X
Unknown target function f X?Y
Set of function hypotheses H h h X?Y
Given
Training examples ltxi,yigt of unknown target
function f
Determine
Hypothesis h? H that best approximates f

13
How would you represent AB ? CD(?E)?
Each internal node test one attribute Xi Each
branch from a node selects one value for Xi Each
leaf node predict Y (or P(YX ? leaf))
14
(No Transcript)
15
ID3, C4.5,
node Root
16
Entropy

Entropy H(X) of a random variable X
H(X) is the expected number of bits needed to
encode a randomly drawn value of X (under most
efficient code)
Why? Information theory
Most efficient code assigns -log2P(Xi) bits to
encode the message Xi
So, expected number of bits to code one random X
is

of possible values for X
17
Entropy

Entropy H(X) of a random variable X

Specific conditional entropy H(XYv) of X given
Yv
Conditional entropy H(XY) of X given Y
Mututal information (aka information gain) of X
and Y
18
Sample Entropy
19
Subset of S for which Av
Gain(S,A) mutual information between A and
target class variable over sample S
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Decision Tree Learning Applet

http//www.cs.ualberta.ca/7Eaixplore/learning/Dec
isionTrees/Applet/DecisionTreeApplet.html

24
Which Tree Should We Output?

ID3 performs heuristic search through space of
decision trees
It stops at smallest acceptable tree. Why?

Occams razor prefer the simplest hypothesis
that fits the data
25
Why Prefer Short Hypotheses? (Occams Razor)

Argument in favor
Fewer short hypotheses than long ones
? a short hypothesis that fits the data is less
likely to be a statistical coincidence
? highly probable that a sufficiently complex
hypothesis will fit the data
Argument opposed
Also fewer hypotheses with prime number of nodes
and attributes beginning with Z
Whats so special about short hypotheses?

26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Split data into training and validation
set Create tree that classifies training set
correctly
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
What you should know

Well posed function approximation problems
Instance space, X
Sample of labeled training data ltxi, yigt
Hypothesis space, H f X?Y
Learning is a search/optimization problem over H
Various objective functions
minimize training error (0-1 loss)
among hypotheses that minimize training error,
select shortest
Decision tree learning
Greedy top-down learning of decision trees (ID3,
C4.5, ...)
Overfitting and tree/rule post-pruning
Extensions

38
Questions to think about (1)

Why use Information Gain to select attributes in
decision trees? What other criteria seem
reasonable, and what are the tradeoffs in making
this choice?

39
Questions to think about (2)

ID3 and C4.5 are heuristic algorithms that search
through the space of decision trees. Why not
just do an exhaustive search?

40
Questions to think about (3)

Consider target function f ltx1,x2gt ? y, where x1
and x2 are real-valued, y is boolean. What is
the set of decision surfaces describable with
decision trees that use each attribute at most
once?

Write a Comment

User Comments (0)