Title: Machine Learning: Concept Learning
1Machine LearningConcept Learning
Decision-Tree Learning
Medical Decision Support Systems
2Machine Learning
- Learning Improving (a programs) performance in
some task with experience - Multiple application domains, such as
- Game playing (e.g., TD-gammon)
- Speech recognition (e.g., Sphinx)
- Data mining (e.g., marketing)
- Driving autonomous vehicles (e.g., ALVINN)
- Classification of ER and ICU patients
- Prediction of financial and other fraud
- Prediction of pneumonia-patients recovery rate
3Concept Learning
- Inference of a boolean-valued function (concept)
from its I/O training examples - The concept c is defined over a set of instances
X - c X ? 0,1
- The learner is presented with a set of
positive/negative training examples ltx, c(x)gt
taken from X - There is a set H of possible hypotheses that the
learner might consider regarding the concept - Goal Find a hypothesis h, s.t. ?(x ? X), h(x)
c(x)
4A Concept-Learning Example
5The Inductive Learning Hypothesis
- Any hypothesis approximating the target
function well over a large set of training
examples will also approximate that target
function well over other, unobserved, examples
6Concept Learning as Search
- Learning is searching through a large space of
hypotheses - Space is implicitly defined by the hypothesis
representation - General-to-specific ordering of hypotheses
- H1 is more-general-or-equal to H1 if any instance
that satisfies H2 also satisfies H1 - ltSun, ?, ?, ?, ?, ?gt ?g ltSun, ?, ?, Strong, ?,
?gt
7The Find-S Algorithm
- Start with the most specific hypothesis h in H
- h ? lt?, ?, ?, ?, ?, ?gt
- Generalize h by the next more general constraint
(for each appropriate attribute) whenever it
fails to classify correctly a positive training
example - Leads here finally to h ltSun, Warm, ?, Strong,
?, ?gt - Finds only one (the most specific) hypothesis
- Cannot detect inconsistencies
- Ignores negative examples!
- Assumes no noise and no errors in the input
8The Candidate-Elimination (CE) Algorithm(Mitchel,
1977, 1979)
- A Version Space The subset of hypotheses of H
consistent with the training examples set D - A version space can be represented by
- Its general (maximally general) boundary set G of
hypotheses consistent with D (G0? lt?, ?,
...,?gt) - Its specific (minimally general) boundary set S
of hypotheses consistent with D (S0? lt ?, ?,
..., ? gt) - The CE algorithm updates the general and specific
boundaries given each positive and negative
example - The resultant version space contains all and only
all hypotheses consistent with the training set
9Properties of The CE Algorithm
- Converges to the correct hypothesis if
- There are no errors in the training set
- Else, the correct target concept is always
eliminated! - There is in fact such a hypothesis in H
- The next best query (new training example to ask
for) separates maximally the hypotheses in the
version space (best into two halves) - Partially learned concepts might suffice to
classify a new instance with certainty, or at
least with some confidence
10Inductive Biases
- Every learning method implicitly is biased
towards a certain hypotheses space H - The conjunctive hypothesis space (only one value
per attribute) can only represent 973 out of 296
target concepts in our example domain - Without an inductive bias (no a priori
assumptions regarding the target concept) there
is no way to classify new, unseen instances! - The S boundary will always be the disjunction of
the positive example instances the G boundary
will be the negated disjunction of the negative
example instances - Convergence possible only when all of X is seen!
- Strongly biased methods make more inductive leaps
- Inductive bias of CE The target concept c is in H
11Decision Tree learning
- Decision trees A method for representing
classification functions - Can be represented as a set of If-Then rules
- Each node represents a test of some attribute
- An instance is classified by starting at the
root, testing attributes in each node and moving
along the branch corresponding to that
attributes value
12Example Decision Tree
Outlook?
Sun
Overcast
Rain
Humidity?
Wind?
Yes
High
Normal
Strong
Weak
No
Yes
Yes
No
13When Should Decision Trees Be Used?
- When instances are ltattribute, valuegt pairs
- Values are typically discrete, but can be
continuous - The target function has discrete output values
- Disjunctive descriptions might be needed
- Natural representation of disjunction of rules
- Training data might contain errors
- Robust to errors of classification and attribute
values - The training data might contain missing values
- Several methods for completion of unknown values
14The Basic Decision-Tree Learning Algorithm
ID3(Quinlan, 1986)
- A top-down greedy search through the hypothesis
space of possible decision trees - Originally intended for boolean-valued functions
- Extensions incorporated in C4.5 (Quinlan, 1993)
- In each step, the best attribute for testing is
selected using some measure, and branching occurs
along its values, continuing the process - Ends when all attributes have been used, or all
examples in this node are either positive or
negative
15Which Attribute is Best to Test?
- The central choice in the ID3 algorithm and
similar approaches - Here, an information gain measure is used, which
measures how well each attribute separates
training examples according to their target
classification
16Entropy
- Entropy An information-theory measure that
characterizes the (im)purity of an example set S
using the proportion of positive (?) and negative
instances (?) - Informally Number of bits needed to encode the
classification of an arbitrary member of S - Entropy(S) p? log2p? p? log2p?
- Entropy(S) is in 0..1
- Entropy(S) is 0 if all members are positive or
negative - Entropy is maximal (1) when p? p? 0.5
(uniform distribution of positive and negative
cases) - If there are c different values to the target
concept, Entropy(S) ?i1..c pi log2pi (pi is
proportion of class i) -
17Entropy Function for a Boolean Classification
1.0
Entropy(S)
0.0
0.5
1.0
p?
18Information Gain of an Attribute
- The expected reduction in entropy E(S) caused by
partitioning the examples in S using the
attribute A and all its corresponding values - Gain(S, A) ? E (S) ?v ? Values(A) (Sv/S) E
(Sv) - The attribute with maximal information gain is
chosen by ID3 for splitting the node
19Information Gain Example
S 9,5- E 0.940
S 9,5- E 0.940
Humidity?
Wind?
High
Normal
Strong
Weak
6, 2- E 0.811
3, 3- E 1.0
6, 1- E 0.592
3, 4- E 0.985
Gain(S, Humidity) 0.940-(7/14)0.985-(7/14)0.592
0.151
Gain(S, Wind) 0.940-(8/14)0.811-(6/14)1.0
0.048
20Properties of ID3
- Searches the hypothesis space of decision trees
- A complete space of all finite discrete-valued
functions (unlike using conjunctive hypotheses) - Maintains only a single hypothesis (unlike CE)
- Performs no backtracking thus, might get stuck
in a local optimum - Uses all training examples at every step to
refine the current hypothesis (unlike Find-S or
CE) - (Approximate) Inductive bias Prefers shorter
trees over larger trees (Occams razor), and
trees that place high information gain close to
the root over those that do not
21The Data Over-Fitting Problem
- Occurs due to noise in data or too-few examples
- Handling the data over-fitting problem
- Stop growing the tree earlier, or
- Prune the final tree retrospectively
- In either case, correct final tree size is
determined by - A separate validation set of examples, or
- Using all examples, deciding if expansion is
likely to help - Using an explicit measure to encode the training
examples and the tree and stop when the measure
is minimized
22Other Improvements to ID3
- Handling continuous values of attributes
- Pick a threshold that maximizes information gain
- Avoid selection of many-valued attributes such as
date by using more sophisticated measures, such
as gain ratio (dividing the gain of S relative to
A and the target concept by the entropy of S with
respect to the values of A) - Handling missing values (average value or
distribution) - Handling costs of measuring attributes (e.g.,
laboratory tests) by including cost in the
attribute selection process
23Summary Concept and Decision-Tree Learning
- Concept learning is a search through a hypothesis
space - The Candidate Elimination algorithm uses
general-to-specific ordering of hypotheses to
compute the version space - Inductive learning algorithms can classify unseen
examples only because of their implicit inductive
bias - ID3 searches through the space of decision trees
- ID3 searches a complete hypothesis space and can
handle noise and missing values in the training
set - Over-fitting the training is a common problem and
requires handling by methods such as post-pruning