Learning Agents - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Learning Agents

Description:

Machine Learning: Decision Tree Learning. 2. The Induction Task [Quinlan, 1986] ... The Chi-Square Contingency Table Statistics ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 44
Provided by: cs07
Category:
Tags: agents | learning

less

Transcript and Presenter's Notes

Title: Learning Agents


1
Machine LearningDecision Tree Learning
2
The Induction TaskQuinlan, 1986
? Reminder Inductive inference is the process of
reaching a general conclusion from specific
examples. The general conclusion should apply to
unseen examples.
  • Each object in the universe belongs to one of a
    set of mutually exclusive classes.
  • For example, in two-class induction tasks,
    objects can be referred to as positive instances
    and negative instances of the concept being
    learned.
  • The other major ingredient is a training set of
    objects whose class is known.
  • The induction task is to develop a
    classifier/classification rule that can determine
    the class of any object.

3
The Induction Task Universe of Objects
Attributes
A universe of objects could be described in terms
of a collection of attributes ATAj j1,2,,L,
Ajai, i1,2,,n. Each attribute measures some
important feature of an object. It can be limited
by a set of discrete, mutually exclusive values.
4
Universe of Objects Attributes an Example
Quinlan, 1986
  • For example, if the objects were Saturday
    mornings and the classification task involved the
    weather, attributes might be
  • Outlook, with value sunny, overcast, rain
  • Temperature, with values cool, mild, hot
  • Humidity, with values high, normal
  • Windy, with values true, false
  • ATOutlook, Temperature, Humidity, Windy,
  • i.e. A1 Outlook, A2 Temperature, A3 Humidity,
    A4 Windy L4.
  • A1sunny, overcast, rain A2cool, mild, hot
    A3high, normal A4true, false
  • Taken together, the attributes provide a language
    for characterising objects in the universe, e.g.
    Saturday morning might be described as
  • Outlook overcast
  • Temperature cool
  • Humidity normal
  • Windy false

5
A Decision Tree Structure
  • Leaves of a decision tree are classes names,
    other nodes represent attribute-based tests with
    a branch for each possible outcome.
  • In order to classify an object, we start at the
    root of the tree, evaluate the test, and take the
    branch appropriate to the outcome.
  • The process continues until a leaf is
    encountered, at which time the object is asserted
    to belong to the class named by the leaf.

root
node
node
branch
leaves
6
Decision Tree LearningTop-down Induction of
Decision Tree
Task Concept Learning, Classification Method
Top-down Induction of Decision Tree Given A set
of classified examples in attribute-value
representation, where the set of classes is
finite. Find A decision tree that is as small
as possible and fits the data.
The induction task in Decision Tree Learning is
to develop a classification rule that can
determine the class of any object from its values
of the attributes.
7
Decision Tree Learning
The hypothesis space in decision tree learning is
the set of all possible decision trees. Decision
tree algorithms search through this space to find
a decision tree that determines the best
classification of the training data (i.e.
classification task). That is to find the best
tree which fits the training data.
Decision tree learning is a method for
approximating discrete-valued concepts (i.e.
target functions) in which the learned function
is represented by a decision tree. Learned
function can also be represented as sets of
if-then rules to improve human readability.
8
The Basic Decision Tree Learning Algorithm ID3
Decision tree learning is typically associated
with a supervised, inductive learning technique.
Most algorithms that have been developed for
learning decision trees are variations on a core
algorithm that employs a top-down induction
method, e.g. basic algorithm - ID3 algorithm
Quinlan, 1986 and its extension C4.5 Quinlan,
1993.
? Reminder any inductive algorithm may
misclassify data.
9
ID3 Algorithm Data Requirements
  • Requirements for the sample data used by ID3 are
  • Attribute-value description the same attribute
    must describe each example have a fixed number
    of values.
  • Predefined classes for attributes must be
    defined, i.e. they are not learned by ID3.
  • Discrete classes classes should be clearly
    outlined/broken up.
  • Sufficient examples there must be enough test
    cases to distinguish valid patterns from chance
    occurrences.

10
Decision Tree Learning
? Decision tree algorithm constructs the tree
by answering the question which attribute is the
best classifier? at each step.
How to choose the most effective attribute to
split on at any branch node?
We may define a measurable valuation function
(e.g. cost, pay, profit) so that we can compare
the alternative ways to act and finally choose
the way of acting that produces the
highest/lowest value for this function.
11
ID3 General Description
The central choice in the ID3 algorithm is
selecting which attribute to test at each node in
the tree. ID3 uses a statistical
property/measure, called information gain, that
measures how well a given attribute separates the
training examples according to their target
concept/classification.
That is, ID3 uses this information gain measure
to select the most effective attribute among the
candidate attributes at each step while growing
the tree.
12
Entropy
In order to define information gain, we define a
measure that is commonly used in information
theory, called entropy. Entropy (in data
transmission and information theory) is a measure
of the loss of information in a transmitted
signal or message. Entropy characterises the
(im)purity of an arbitrary collection of examples.
Gain is the expected reduction in entropy caused
by knowing the value of attribute.
? The higher level of the entropy of data defines
the more loss of information. The aim is to
reduce the entropy (i.e. the loss of information).
13
Entropy Formal Definition
The entropy or uncertainty still remaining about
the class ci, i1,2,,N , of an example is
defined as   where is the probability
that an example drawn from S belongs to
class ci, and the summation is over all of the
classes, i.e. over all of N classes. Entropy
measures the impurity of S. In order to
calculate entropy we need to know the
probability that in general is not known. We can
estimate it by sample statistics. That is, we can
define as the number of examples in
S belonging to class ci divided by the total
number of examples in S.
,
14
Entropy Formal Definition Boolean
Classification
Given a collection S, containing positive () and
negative (-) examples of some target concept, the
entropy of S relative to this Boolean
classification ( or -) is
where S is a collection of examples is
the proportion of positive examples in S
is the proportion of negative examples in S.
? In all calculations involving entropy we define
0log0 to be 0.
15
An Example Entropy (1)
Values (Wind) Weak, Strong Suppose that S is a
collection of 14 examples of some Boolean
concepts, including 9 positive and 5 negative
examples. S 9, 5- Suppose 6 of the
positive and 2 of the negative examples have
WindWeak the remainder have WindStrong. Sweak
6,2- Sstrong 3,3-
16
An Example Entropy (2)
S 9, 5- Entropy(9, 5-)
-(9/14)log2(9/14)-(5/14)log2(5/14)
0.940 Note the Entropy is 1 when the
collection contains an equal number of positive
and negative examples. the Entropy is 0 when all
members of the collection are positive or all
members of the collection are negative.
17
Gain
where Values (A) is the set of all positive
values for attribute A, and is the
subset of S for which attribute A has value v
(i.e. ).
?Reminder Gain(S,A) is the expected reduction in
entropy caused by knowing the value of attribute
A.
18
An Example Gain
Sweak 6,2- Sstrong 3,3-
Entropy(Sweak) 0.811 Entropy(Sstrong) 1
Gain (S,Wind) 0.940 - (8/14) 0.811 - (6/14) 1.00
0.0481
19
An Illustrative Example- is adapted from
Quinlan,1986 (1)
20
An Illustrative Example (2)
The weather attributes are Outlook, Temperature,
Humidity, and Wind speed. They have the following
values Outlook Sunny, Overcast,
Rain Temperature Hot, Mild, Cool Humidity
High, Normal Wind Weak, Strong We need to
calculate the proportions of positive and
negative training cases that are currently
available at a node.
21
An Illustrative Example (3)
  • We need to find which attribute will be the root
    in the decision tree
  • Gain (S, Outlook) ?
  • Gain (S, Humidity) ?
  • Gain (S, Wind) ?
  • Gain (S, Temperature) ?

where S denotes the collection of training
examples from the Table.
Entropy (S) Entropy(9, 5-)
-(9/14)log2(9/14)-(5/14)log2(5/14) 0.940
Outlook Sunny, Overcast, Rain 2,3-,
4,0-, 3,2-
Gain (S. Outlook) Entropy (S) (5/14) Entropy
(OutlookSunny) (4/14) Entropy(OutlookOvercast)
(5/14) Entropy(OutlookRain) 0.248
22
An Illustrative Example (4)
We need to find which attribute will be the root
in the decision tree Gain (S, Outlook)
0.248 Gain (S, Humidity) 0.151 Gain (S, Wind)
0.048 Gain (S, Temperature) 0.029
where S denotes the collection of training
examples from the Table.
23
An Illustrative Example (5)
D1, D2, , D14 9,5-
D4, D5, D6, D10, D14 3,2-
Rain
Sunny
D1, D2, D8, D9, D11 2,3-
D3, D7, D12, D13 4,0-
24
An Illustrative Example (6)
Ssunny D1, D2, D8, D9, D11 Gain(Ssunny,
Temperature) 0.940-(2/5)0.0-(2/5)1.0-(1/5)0.0
0.570 Gain(Ssunny, Humidity) 0.940-(3/5)0.0-(2/5
)0.0 0.970 Gain(Ssunny, Wind)
0.940-(2/5)1.0 (3/5)0.918 0.019 i.e. Gain
(Ssunny, Humidity) 0.970 Humidity attribute
should be selected as a node.
25
An Illustrative Example (7)
Rain
Sunny
26
An Illustrative Example (8)
The final result is a decision tree. The decision
tree can be converted into a rule set
If Outlook Sunny And HumidityHigh Then
PlayTennis no If OutlookRain And HumidityHigh
Then PlayTennisno If OutlookRain And
WindStrong Then PlayTennisyes If
OutlookOvercast Then PlayTennisyes If
OutlookRain And Windweak Then PlayTennisyes
Sunny
Rain
27
ID3 Algorithm Summary
  • Input a training set / example
  • Output a decision tree
  • Steps
  • If all members of the collection are positive or
    negative then terminate the process and return
    the decision tree.
  • Else
  • Compute Information Gain for all attributes and
    select an attribute with the highest Gain and
    create a root node for all attributes
  • Make a branch from the root for every value of
    the root attribute
  • Assign members to branches
  • Recursively update the sub tree found by
    following each branch.

28
Quality MeasuresMeasures of Split Goodness
  • The GINI Index of Diversity
  • Gain-Ratio Measure
  • Marshall Correction
  • The G Statistics
  • The Chi-Square Contingency Table Statistics

?
? Comparison of a number of statistics Mingers,
1989 shows that the predictive accuracy of the
induced decision trees is not sensitive to the
choice of statistics.
29
Learning Model Classification Task
30
Decision Tree Windowing
  • The window is a subset of the training data.
    (Windowing is used if the training set is very
    large.)
  • The window is chosen randomly to build an initial
    tree.
  • This tree correctly classified all objects in the
    window.
  • All other objects in the training set are then
    classified using the tree.

If the tree gives the correct answer for all
these objects then it is correct for the entire
training set and the process terminates. If not,
a selection of the incorrectly classified objects
is added to the window and the process continues.
? The windowing procedure produces a different
tree each time it is run since the initial window
is chosen randomly.
31
Issues in Decision Tree Learning
? So far we assumed that training data consist of
entirely accurate information/data.
Reminder Types of Attributes Attributes can be,
basically, of two types nominal (or categorical)
with symbolic, unordered values (e.g. (set of)
letters of the alphabet), i.e. with values
possible only from a predefined and finite set of
labels numeric (i.e. numerical attributes) if
values are numeric, i.e. expressed by numbers
instead of letters. Nominal value is measured in
an amount rather than in real value giving a
name or names.
32
Issues in Decision Tree Learning Problems with
Data
? Missing values pose an obvious problem. It is
not clear which branch should be taken when a
node tests an attribute whose value is missing.
? If the set includes noisy data (values may
not always be correct) or if the attributes are
not always sufficient to classify the data
correctly, the tree should only include branches
that are justified adequately.
? Unknown Attribute Values Training sets may
have records with unknown attribute values. DT
Algorithm, e.g. C4.5, considers the gain ratio
for only the records where the attribute value is
defined.
? Continuous Attribute Values
33
Issues in Decision Tree Learning
  • How deeply to grow the decision tree
  • Choosing an appropriate attribute selection
    measure
  • Improving computational efficiency

34
Decision Tree Overfitting
Overfitting when an alternative decision tree
exists that has more classification errors on the
training set, but less classification errors on
the test set.
Hypothesis h overfits training data set if an
alternative hypothesis h such that error
train(h) lt error train(h) and error test(h) gt
error test(h). Causes sample too small
(decisions based on too little data) noise
coincidence.
35
Effect of Noise in Training Dataan Example
ID3 produced original tree h for training
instances in Table 1.
Consider effect of adding incorrectly labelled
instance ltOutlookSunny, TemperatureHot,
HumidityNormal, WindStrong, PlayTennisNogt
Sunny
Rain
!?
May fit noise or other coincidental regularities
36
Decision Tree Overfitting
ltSunny, Hot, Normal, Strong, - gt Example is noisy
because the correct label is Previously
constructed tree h misclassifies it New
hypothesis h is expected to perform worse than
h. How shall h be revised (incremental
learning)?
37
Decision Tree Overfitting
Accuracy
training data
overfitting
testing data
Size of the tree (number of nodes)
Optimal tree size
Overfitting can cause strong performance
degradation.
38
Decision Tree Overfitting
  • There are several approaches to avoiding
    over-fitting in decision tree learning.
  • These can be grouped into two classes
  • approaches that stop growing the tree earlier,
    before it reaches the point where it perfectly
    classifies the training data
  • approaches that allow the tree to over-fit the
    data, and then post prune the tree.

39
Decision Tree Pruning
Pruning a decision node consists of removing the
subtree rooted at that node, making a leaf node,
and assigning it the most common classification
of the training examples affiliated with that
node.
40
Decision Tree Pruning Overfitting
  • Basic Approaches are
  • Prevention
  • Pre-pruning (avoidance)
  • Post-pruning (recovery)

41
C4.5
  • C4.5 is an extension of the base algorithm ID3.
  • Some of the additional features are
  • Incorporation of numerical (continuous)
    attributes
  • Can deal with incomplete information (missing
    attribute values)
  • Nominal (discrete) values of a single attribute
    may be grouped together to support more complex
    tests.

42
The Strengths of Decision Tree Methods
  • Decision trees perform classification without
    requiring much computation.
  • Decision trees are able to generate
    understandable rules.
  • Decision trees provide a clear indication of
    which fields are most important for prediction or
    classification.
  • Decision trees are able to handle both continuous
    and categorical variables.

43
The Weaknesses of Decision Tree Methods
  • Decision trees are less appropriate for
    estimation tasks where the goal is to predict the
    value of a continuous variable such as income,
    blood pressure, or interest rate.
  • Most decision tree algorithms examine a single
    field at a time.
  • Decision trees are prone to errors in
    classification problems with many classes and
    relatively small number of training examples.
  • Decision trees are computationally expensive to
    train.
Write a Comment
User Comments (0)
About PowerShow.com