Machine Learning: Concept Learning - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Machine Learning: Concept Learning

Description:

... attribute for testing is selected using some measure, and ... (im)purity of an example set S using the proportion of positive ( ) and negative instances ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 24
Provided by: yov
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning: Concept Learning


1
Machine LearningConcept Learning
Decision-Tree Learning
Medical Decision Support Systems
  • Yuval Shahar M.D., Ph.D.

2
Machine Learning
  • Learning Improving (a programs) performance in
    some task with experience
  • Multiple application domains, such as
  • Game playing (e.g., TD-gammon)
  • Speech recognition (e.g., Sphinx)
  • Data mining (e.g., marketing)
  • Driving autonomous vehicles (e.g., ALVINN)
  • Classification of ER and ICU patients
  • Prediction of financial and other fraud
  • Prediction of pneumonia-patients recovery rate

3
Concept Learning
  • Inference of a boolean-valued function (concept)
    from its I/O training examples
  • The concept c is defined over a set of instances
    X
  • c X ? 0,1
  • The learner is presented with a set of
    positive/negative training examples ltx, c(x)gt
    taken from X
  • There is a set H of possible hypotheses that the
    learner might consider regarding the concept
  • Goal Find a hypothesis h, s.t. ?(x ? X), h(x)
    c(x)

4
A Concept-Learning Example
5
The Inductive Learning Hypothesis
  • Any hypothesis approximating the target
    function well over a large set of training
    examples will also approximate that target
    function well over other, unobserved, examples

6
Concept Learning as Search
  • Learning is searching through a large space of
    hypotheses
  • Space is implicitly defined by the hypothesis
    representation
  • General-to-specific ordering of hypotheses
  • H1 is more-general-or-equal to H1 if any instance
    that satisfies H2 also satisfies H1
  • ltSun, ?, ?, ?, ?, ?gt ?g ltSun, ?, ?, Strong, ?,
    ?gt

7
The Find-S Algorithm
  • Start with the most specific hypothesis h in H
  • h ? lt?, ?, ?, ?, ?, ?gt
  • Generalize h by the next more general constraint
    (for each appropriate attribute) whenever it
    fails to classify correctly a positive training
    example
  • Leads here finally to h ltSun, Warm, ?, Strong,
    ?, ?gt
  • Finds only one (the most specific) hypothesis
  • Cannot detect inconsistencies
  • Ignores negative examples!
  • Assumes no noise and no errors in the input

8
The Candidate-Elimination (CE) Algorithm(Mitchel,
1977, 1979)
  • A Version Space The subset of hypotheses of H
    consistent with the training examples set D
  • A version space can be represented by
  • Its general (maximally general) boundary set G of
    hypotheses consistent with D (G0? lt?, ?,
    ...,?gt)
  • Its specific (minimally general) boundary set S
    of hypotheses consistent with D (S0? lt ?, ?,
    ..., ? gt)
  • The CE algorithm updates the general and specific
    boundaries given each positive and negative
    example
  • The resultant version space contains all and only
    all hypotheses consistent with the training set

9
Properties of The CE Algorithm
  • Converges to the correct hypothesis if
  • There are no errors in the training set
  • Else, the correct target concept is always
    eliminated!
  • There is in fact such a hypothesis in H
  • The next best query (new training example to ask
    for) separates maximally the hypotheses in the
    version space (best into two halves)
  • Partially learned concepts might suffice to
    classify a new instance with certainty, or at
    least with some confidence

10
Inductive Biases
  • Every learning method implicitly is biased
    towards a certain hypotheses space H
  • The conjunctive hypothesis space (only one value
    per attribute) can only represent 973 out of 296
    target concepts in our example domain
  • Without an inductive bias (no a priori
    assumptions regarding the target concept) there
    is no way to classify new, unseen instances!
  • The S boundary will always be the disjunction of
    the positive example instances the G boundary
    will be the negated disjunction of the negative
    example instances
  • Convergence possible only when all of X is seen!
  • Strongly biased methods make more inductive leaps
  • Inductive bias of CE The target concept c is in H

11
Decision Tree learning
  • Decision trees A method for representing
    classification functions
  • Can be represented as a set of If-Then rules
  • Each node represents a test of some attribute
  • An instance is classified by starting at the
    root, testing attributes in each node and moving
    along the branch corresponding to that
    attributes value

12
Example Decision Tree
Outlook?
Sun
Overcast
Rain
Humidity?
Wind?
Yes
High
Normal
Strong
Weak
No
Yes
Yes
No
13
When Should Decision Trees Be Used?
  • When instances are ltattribute, valuegt pairs
  • Values are typically discrete, but can be
    continuous
  • The target function has discrete output values
  • Disjunctive descriptions might be needed
  • Natural representation of disjunction of rules
  • Training data might contain errors
  • Robust to errors of classification and attribute
    values
  • The training data might contain missing values
  • Several methods for completion of unknown values

14
The Basic Decision-Tree Learning Algorithm
ID3(Quinlan, 1986)
  • A top-down greedy search through the hypothesis
    space of possible decision trees
  • Originally intended for boolean-valued functions
  • Extensions incorporated in C4.5 (Quinlan, 1993)
  • In each step, the best attribute for testing is
    selected using some measure, and branching occurs
    along its values, continuing the process
  • Ends when all attributes have been used, or all
    examples in this node are either positive or
    negative

15
Which Attribute is Best to Test?
  • The central choice in the ID3 algorithm and
    similar approaches
  • Here, an information gain measure is used, which
    measures how well each attribute separates
    training examples according to their target
    classification

16
Entropy
  • Entropy An information-theory measure that
    characterizes the (im)purity of an example set S
    using the proportion of positive (?) and negative
    instances (?)
  • Informally Number of bits needed to encode the
    classification of an arbitrary member of S
  • Entropy(S) p? log2p? p? log2p?
  • Entropy(S) is in 0..1
  • Entropy(S) is 0 if all members are positive or
    negative
  • Entropy is maximal (1) when p? p? 0.5
    (uniform distribution of positive and negative
    cases)
  • If there are c different values to the target
    concept, Entropy(S) ?i1..c pi log2pi (pi is
    proportion of class i)

17
Entropy Function for a Boolean Classification
1.0
Entropy(S)
0.0
0.5
1.0
p?
18
Information Gain of an Attribute
  • The expected reduction in entropy E(S) caused by
    partitioning the examples in S using the
    attribute A and all its corresponding values
  • Gain(S, A) ? E (S) ?v ? Values(A) (Sv/S) E
    (Sv)
  • The attribute with maximal information gain is
    chosen by ID3 for splitting the node

19
Information Gain Example
S 9,5- E 0.940
S 9,5- E 0.940
Humidity?
Wind?
High
Normal
Strong
Weak
6, 2- E 0.811
3, 3- E 1.0
6, 1- E 0.592
3, 4- E 0.985
Gain(S, Humidity) 0.940-(7/14)0.985-(7/14)0.592
0.151
Gain(S, Wind) 0.940-(8/14)0.811-(6/14)1.0
0.048
20
Properties of ID3
  • Searches the hypothesis space of decision trees
  • A complete space of all finite discrete-valued
    functions (unlike using conjunctive hypotheses)
  • Maintains only a single hypothesis (unlike CE)
  • Performs no backtracking thus, might get stuck
    in a local optimum
  • Uses all training examples at every step to
    refine the current hypothesis (unlike Find-S or
    CE)
  • (Approximate) Inductive bias Prefers shorter
    trees over larger trees (Occams razor), and
    trees that place high information gain close to
    the root over those that do not

21
The Data Over-Fitting Problem
  • Occurs due to noise in data or too-few examples
  • Handling the data over-fitting problem
  • Stop growing the tree earlier, or
  • Prune the final tree retrospectively
  • In either case, correct final tree size is
    determined by
  • A separate validation set of examples, or
  • Using all examples, deciding if expansion is
    likely to help
  • Using an explicit measure to encode the training
    examples and the tree and stop when the measure
    is minimized

22
Other Improvements to ID3
  • Handling continuous values of attributes
  • Pick a threshold that maximizes information gain
  • Avoid selection of many-valued attributes such as
    date by using more sophisticated measures, such
    as gain ratio (dividing the gain of S relative to
    A and the target concept by the entropy of S with
    respect to the values of A)
  • Handling missing values (average value or
    distribution)
  • Handling costs of measuring attributes (e.g.,
    laboratory tests) by including cost in the
    attribute selection process

23
Summary Concept and Decision-Tree Learning
  • Concept learning is a search through a hypothesis
    space
  • The Candidate Elimination algorithm uses
    general-to-specific ordering of hypotheses to
    compute the version space
  • Inductive learning algorithms can classify unseen
    examples only because of their implicit inductive
    bias
  • ID3 searches through the space of decision trees
  • ID3 searches a complete hypothesis space and can
    handle noise and missing values in the training
    set
  • Over-fitting the training is a common problem and
    requires handling by methods such as post-pruning
Write a Comment
User Comments (0)
About PowerShow.com