Chapter 4 Classification and Scoring

About This Presentation

Title:

Chapter 4 Classification and Scoring

Description:

Why the most compact? Occam's razor principle. UIC - CS 594. B. Liu. 15 ... Building a compact tree ... Class Association Rules (CARs) Mining rules with a fixed target ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 48

Provided by: dis12

Learn more at: https://www.cs.uic.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 4 Classification and Scoring

1
Chapter 4Classification and Scoring
2
An example application

An emergency room in a hospital measures 17
variables (e.g., blood pressure, age, etc) of
newly admitted patients. A decision has to be
taken whether to put the patient in an
intensive-care unit. Due to the high cost of ICU,
those patients who may survive less than a month
are given higher priority. The problem is to
predict high-risk patients and discriminate them
from low-risk patients.

3
Another application

A credit card company typically receives
thousands of applications for new cards. The
application contains information regarding
several different attributes, such as annual
salary, any outstanding debts, age etc. The
problem is to categorize applications into those
who have good credit, bad credit, or fall into a
gray area (thus requiring further human
analysis).

4
Classification

Data It has k attributes A1, Ak. Each tuple
(case or example) is described by values of the
attributes and a class label.
Goal To learn rules or to build a model that can
be used to predict the classes of new (or future
or test) cases.
The data used for building the model is called
the training data.

5
An example data
6
ClassificationA Two-Step Process

Model construction describing a set of
predetermined classes based on a training set. It
is also called learning.
Each tuple/sample is assumed to belong to a
predefined class
The model is represented as classification rules,
decision trees, or mathematical formulae
Model usage for classifying future test
data/objects
Estimate accuracy of the model
The known label of test example is compared with
the classified result from the model
Accuracy rate is the of test cases that are
correctly classified by the model
If the accuracy is acceptable, use the model to
classify data tuples whose class labels are not
known.

7
Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
8
Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
9
Supervised vs. Unsupervised Learning

Supervised learning classification is seen as
supervised learning from examples.
Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the classes of the observations/cases.
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data

10
Evaluating Classification Methods

Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness handling noise and missing values
Scalability efficiency in disk-resident
databases
Interpretability
understandable and insight provided by the model
Compactness of the model size of the tree, or
the number of rules.

11
Different classification techniques

There are many techniques for classification
Decision trees
Naïve Bayesian classifiers
Using association rules
Neural networks
Logistic regression
and many more ...

12
Building a decision tree an example training
dataset
13
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
14
Inducing a decision tree

There are many possible trees
lets try it on a credit data
How to find the most compact one
that is consistent with the data?
Why the most compact?
Occams razor principle

15
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
manner
At start, all the training examples are at the
root
Attributes are categorical (we will talk about
continuous-valued attributes later)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All exmples for a given node belong to the same
class
There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
There are no exmples left

16
Building a compact tree

The key to building a decision tree - which
attribute to choose in order to branch.
The heuristic is to choose the attribute with the
maximum Information Gain based on information
theory.
Another explanation is to reduce uncertainty as
much as possible.

17
Information theory

Information theory provides a mathematical basis
for measuring the information content.
To understand the notion of information, think
about it as providing the answer to a question,
for example, whether a coin will come up heads.
If one already has a good guess about the answer,
then the actual answer is less informative.
If one already knows that the coin is rigged so
that it will come with heads with probability
0.99, then a message (advanced information) about
the actual outcome of a flip is worth less than
it would be for a honest coin.

18
Information theory (cont )

For a fair (honest) coin, you have no
information, and you are willing to pay more (say
in terms of ) for advanced information - less
you know, the more valuable the information.
Information theory uses this same intuition, but
instead of measuring the value for information in
dollars, it measures information contents in
bits. One bit of information is enough to answer
a yes/no question about which one has no idea,
such as the flip of a fair coin

19
Information theory

In general, if the possible answers vi have
probabilities P(vi), then the information content
I (entropy) of the actual answer is given by
For example, for the tossing of a fair coin we
get
If the coin is loaded to give 99 head we get I
0.08, and as the probability of heads goes to 1,
the information of the actual answer goes to 0

20
Back to decision tree learning

For a given example, what is the correct
classification?
We may think of a decision tree as conveying
information about the classification of examples
in the table (of examples)
The entropy measure characterizes the (im)purity
of an arbitrary collection of examples.

21
Attribute Selection Measure Information Gain
(ID3/C4.5)

S contains si tuples of class Ci for i 1, ,
m
information measures info (entropy) required to
classify any arbitrary tuple
Assume a set of training examples, S. If we make
attribute A, with v values, the root of the
current tree, this will partition S into v
subsets. The expected information needed to
complete the tree after making A the root is

22
Information gain

information gained by branching on attribute A
We will choose the attribute with the highest
information gain to branch the current tree.

23
Attribute Selection by info gain

Class P buys_computer yes
Class N buys_computer no
I(p, n) I(9, 5) 0.940
Compute the entropy for age

means age lt30 has 5 out of 14
samples, with 2 yeses and 3 nos. Hence
Similarly,

24
We build the following tree
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
25
Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN
rules
One rule is created for each path from the root
to a leaf
Each attribute-value pair along a path forms a
conjunction. The leaf node holds the class
prediction
Rules are easier for humans to understand
Example
IF age lt30 AND student no THEN
buys_computer no
IF age lt30 AND student yes THEN
buys_computer yes
IF age 3140 THEN buys_computer yes
IF age gt40 AND credit_rating excellent
THEN buys_computer yes
IF age lt30 AND credit_rating fair THEN
buys_computer no

26
Avoid Overfitting in Classification

Overfitting An tree may overfit the training
data
Good accuracy on training data but poor on test
exmples
Too many branches, some may reflect anomalies due
to noise or outliers
Two approaches to avoid overfitting
Prepruning Halt tree construction early
Difficult to decide
Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned
trees.
This method is commonly used (based on validation
set or statistical estimate or MDL)

27
Enhancements to basic decision tree induction

Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes
that partition the continuous attribute value
into a discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that
are sparsely represented.
This reduces fragmentation, repetition, and
replication

28
Bayesian Classification Why?

Probabilistic learning Classification learning
can also be seen as computing P(Cc d), i.e.,
given a data tuple d, what is the probability
that d is of class c. (C is the class attribute).
How?

29
Naïve Bayesian Classifier

Let A1 through Ak be attributes with discrete
values. They are used to predict a discrete class
C.
Given an example with observed attribute values
a1 through ak.
The prediction is the class c such that
P(CcA1a1?...?Akak)
is maximal.

30
Compute Probabilities

By Bayes rule, the above can be expressed
P(Cc) can be easily estimated from training
data.
P(A1a1?...?Akak) is irrelevant for decision
making since it is the same for every class value
c.

31
Computing probabilities

We only need P(A1a1?...?Akak Cc), which can
be written as
P(A1a1A2a2?...?Akak, Cc)
P(A2a2?...?Akak Cc)
Recursively, the second factor above can be
written in the same way, and so on.

32
Computing probabilities

Now suppose we assume that all attributes are
conditionally independent given the class c.
Formally, we assume.
P(A1a1A2a2?...?Akak, Cc) P(A1a1 Cc)
and so on for A2 through Ak.
We are done.
How do we estimate P(A1a1 Cc)?

33
Training dataset
Class C1buys_computer yes C2buys_computer
no Data sample X (agelt30, Incomemedium, Stud
entyes Credit_rating Fair)
34
An Example

Compute P(A1a1 Cc) for each class
P(agelt30 buys_computeryes)
2/90.222
P(agelt30 buys_computerno) 3/5 0.6
P(incomemedium buys_computeryes)
4/9 0.444
P(incomemedium buys_computerno)
2/5 0.4
P(studentyes buys_computeryes) 6/9
0.667
P(studentyes buys_computerno)
1/50.2
P(credit_ratingfair buys_computeryes)
6/90.667
P(credit_ratingfair buys_computerno)
2/50.4
X(agelt30 ,income medium, studentyes,credit_
ratingfair)
P(Xbuys_computeryes) 0.222 x 0.444 x 0.667
x 0.0.667 0.044
P(Xbuys_computerno) 0.6 x 0.4 x 0.2 x
0.4 0.019
P(XCc)P(Cc) P(Xbuys_computeryes)
P(buys_computeryes)0.028
P(Xbuys_computeryes)
P(buys_computeryes)0.007
X belongs to class buys_computeryes

35
On Naïve Bayesian Classifier

Advantages
Easy to implement
Good results obtained in many applications
Disadvantages
Assumption class conditional independence,
therefore loss of accuracy when the assumption is
not true.
Practically, dependencies exist
How to deal with these dependencies?
Bayesian Belief Networks

36
Use of Association RulesClassification

Classification mine a small set of rules
existing in the data to form a classifier or
predictor.
It has a target attribute (on the right side)
Class attribute
Association has no fixed target, but we can fix
a target.

37
Class Association Rules (CARs)

Mining rules with a fixed target
Right-hand-side of the rules are fixed to a
single attribute, which can have a number of
values
E.g., X a, Y d ? Class yes
X b ? Class no
Call such rules class association rules

38
Mining Class Association Rules

Itemset in class association rules
ltcondset, class_valuegt
condset a set of items
item attribute value pair, e.g.,
attribute1 a
class_value a value in class attribute

39
Classification Based on Associations (CBA)

Two steps
Find all class association rules
Using a modified Apriori algorithm
Build a classifier
There can be many ways, e.g.,
Choose a small set of rules to cover the data
Numeric attributes need to be discrertized.

40
Advantages of the CBA Model

One algorithm performs 3 tasks
mine class association rules
build an accurate classifier (or predictor)
mine normal association rules
by treating class as a dummy in
ltcondset, class_valuegt
then condset itemset

41
Advantages of the CBA Model

Existing classification systems use
Table data.
CBA can build classifiers using either
Table form data or
Transaction form data (sparse data)
CBA is able to find rules that existing
classification systems cannot.

42
Assoc. Rules can be Used in Many Ways for
Prediction

We have so many rules
Select a subset of rules
Using Baysian Probability together with the rules
Using rule combinations
A number of systems have been designed and
implemented.

43
Other classification techniques

Support vector machines
Logistic regression
K-nearest neighbor
Neural networks
Genetic algorithms
Etc.

44
How to Estimated Classification Accuracy or Error
Rates

Partition Training-and-testing
use two independent data sets, e.g., training set
(2/3), test set(1/3)
used for data set with large number of exmples
Cross-validation
divide the data set into k subsamples
use k-1 subsamples as training data and one
sub-sample as test datak-fold cross-validation
for data set with moderate size
leave-one-out for small size data

45
Scoring the data

Scoring is related to classification.
Normally, we are only interested a single class
(called positive class), e.g., buyers class in a
marketing database.
Instead of assigning each test example a definite
class, scoring assigns a probability estimate
(PE) to indicate the likelihood that the example
belongs to the positive class.

46
Ranking and lift analysis

After each example is given a score, we can rank
all examples according to their PEs.
We then divide the data into n (say 10) bins. A
lift curve can be drawn according how many
positive examples are in each bin. This is called
lift analysis.
Classification systems can be used for scoring.
Need to produce a probability estimate.

47
Lift curve
Bin 1 2 3 4 5
6 7 8 9 10

Write a Comment

User Comments (0)