Title: Chapter 4 Classification and Scoring
1Chapter 4Classification and Scoring
2An example application
- An emergency room in a hospital measures 17
variables (e.g., blood pressure, age, etc) of
newly admitted patients. A decision has to be
taken whether to put the patient in an
intensive-care unit. Due to the high cost of ICU,
those patients who may survive less than a month
are given higher priority. The problem is to
predict high-risk patients and discriminate them
from low-risk patients.
3Another application
- A credit card company typically receives
thousands of applications for new cards. The
application contains information regarding
several different attributes, such as annual
salary, any outstanding debts, age etc. The
problem is to categorize applications into those
who have good credit, bad credit, or fall into a
gray area (thus requiring further human
analysis).
4Classification
- Data It has k attributes A1, Ak. Each tuple
(case or example) is described by values of the
attributes and a class label. - Goal To learn rules or to build a model that can
be used to predict the classes of new (or future
or test) cases. - The data used for building the model is called
the training data.
5An example data
6ClassificationA Two-Step Process
- Model construction describing a set of
predetermined classes based on a training set. It
is also called learning. - Each tuple/sample is assumed to belong to a
predefined class - The model is represented as classification rules,
decision trees, or mathematical formulae - Model usage for classifying future test
data/objects - Estimate accuracy of the model
- The known label of test example is compared with
the classified result from the model - Accuracy rate is the of test cases that are
correctly classified by the model - If the accuracy is acceptable, use the model to
classify data tuples whose class labels are not
known.
7Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
8Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
9Supervised vs. Unsupervised Learning
- Supervised learning classification is seen as
supervised learning from examples. - Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the classes of the observations/cases. - New data is classified based on the training set
- Unsupervised learning (clustering)
- The class labels of training data is unknown
- Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
10Evaluating Classification Methods
- Predictive accuracy
- Speed and scalability
- time to construct the model
- time to use the model
- Robustness handling noise and missing values
- Scalability efficiency in disk-resident
databases - Interpretability
- understandable and insight provided by the model
- Compactness of the model size of the tree, or
the number of rules.
11Different classification techniques
- There are many techniques for classification
- Decision trees
- Naïve Bayesian classifiers
- Using association rules
- Neural networks
- Logistic regression
- and many more ...
12Building a decision tree an example training
dataset
13Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
14Inducing a decision tree
- There are many possible trees
- lets try it on a credit data
- How to find the most compact one
- that is consistent with the data?
- Why the most compact?
- Occams razor principle
15Algorithm for Decision Tree Induction
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
manner - At start, all the training examples are at the
root - Attributes are categorical (we will talk about
continuous-valued attributes later) - Examples are partitioned recursively based on
selected attributes - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain) - Conditions for stopping partitioning
- All exmples for a given node belong to the same
class - There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf - There are no exmples left
16Building a compact tree
- The key to building a decision tree - which
attribute to choose in order to branch. - The heuristic is to choose the attribute with the
maximum Information Gain based on information
theory. - Another explanation is to reduce uncertainty as
much as possible.
17Information theory
- Information theory provides a mathematical basis
for measuring the information content. - To understand the notion of information, think
about it as providing the answer to a question,
for example, whether a coin will come up heads. - If one already has a good guess about the answer,
then the actual answer is less informative. - If one already knows that the coin is rigged so
that it will come with heads with probability
0.99, then a message (advanced information) about
the actual outcome of a flip is worth less than
it would be for a honest coin.
18Information theory (cont )
- For a fair (honest) coin, you have no
information, and you are willing to pay more (say
in terms of ) for advanced information - less
you know, the more valuable the information. - Information theory uses this same intuition, but
instead of measuring the value for information in
dollars, it measures information contents in
bits. One bit of information is enough to answer
a yes/no question about which one has no idea,
such as the flip of a fair coin
19Information theory
- In general, if the possible answers vi have
probabilities P(vi), then the information content
I (entropy) of the actual answer is given by - For example, for the tossing of a fair coin we
get - If the coin is loaded to give 99 head we get I
0.08, and as the probability of heads goes to 1,
the information of the actual answer goes to 0
20Back to decision tree learning
- For a given example, what is the correct
classification? - We may think of a decision tree as conveying
information about the classification of examples
in the table (of examples) - The entropy measure characterizes the (im)purity
of an arbitrary collection of examples.
21Attribute Selection Measure Information Gain
(ID3/C4.5)
- S contains si tuples of class Ci for i 1, ,
m - information measures info (entropy) required to
classify any arbitrary tuple - Assume a set of training examples, S. If we make
attribute A, with v values, the root of the
current tree, this will partition S into v
subsets. The expected information needed to
complete the tree after making A the root is
22Information gain
- information gained by branching on attribute A
- We will choose the attribute with the highest
information gain to branch the current tree.
23Attribute Selection by info gain
- Class P buys_computer yes
- Class N buys_computer no
- I(p, n) I(9, 5) 0.940
- Compute the entropy for age
- means age lt30 has 5 out of 14
samples, with 2 yeses and 3 nos. Hence - Similarly,
24We build the following tree
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
25Extracting Classification Rules from Trees
- Represent the knowledge in the form of IF-THEN
rules - One rule is created for each path from the root
to a leaf - Each attribute-value pair along a path forms a
conjunction. The leaf node holds the class
prediction - Rules are easier for humans to understand
- Example
- IF age lt30 AND student no THEN
buys_computer no - IF age lt30 AND student yes THEN
buys_computer yes - IF age 3140 THEN buys_computer yes
- IF age gt40 AND credit_rating excellent
THEN buys_computer yes - IF age lt30 AND credit_rating fair THEN
buys_computer no
26Avoid Overfitting in Classification
- Overfitting An tree may overfit the training
data - Good accuracy on training data but poor on test
exmples - Too many branches, some may reflect anomalies due
to noise or outliers - Two approaches to avoid overfitting
- Prepruning Halt tree construction early
- Difficult to decide
- Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned
trees. - This method is commonly used (based on validation
set or statistical estimate or MDL)
27Enhancements to basic decision tree induction
- Allow for continuous-valued attributes
- Dynamically define new discrete-valued attributes
that partition the continuous attribute value
into a discrete set of intervals - Handle missing attribute values
- Assign the most common value of the attribute
- Assign probability to each of the possible values
- Attribute construction
- Create new attributes based on existing ones that
are sparsely represented. - This reduces fragmentation, repetition, and
replication
28Bayesian Classification Why?
- Probabilistic learning Classification learning
can also be seen as computing P(Cc d), i.e.,
given a data tuple d, what is the probability
that d is of class c. (C is the class attribute).
- How?
29Naïve Bayesian Classifier
- Let A1 through Ak be attributes with discrete
values. They are used to predict a discrete class
C. - Given an example with observed attribute values
a1 through ak. - The prediction is the class c such that
- P(CcA1a1?...?Akak)
- is maximal.
30Compute Probabilities
- By Bayes rule, the above can be expressed
- P(Cc) can be easily estimated from training
data. - P(A1a1?...?Akak) is irrelevant for decision
making since it is the same for every class value
c.
31Computing probabilities
- We only need P(A1a1?...?Akak Cc), which can
be written as - P(A1a1A2a2?...?Akak, Cc)
P(A2a2?...?Akak Cc) - Recursively, the second factor above can be
written in the same way, and so on.
32Computing probabilities
- Now suppose we assume that all attributes are
conditionally independent given the class c.
Formally, we assume. - P(A1a1A2a2?...?Akak, Cc) P(A1a1 Cc)
- and so on for A2 through Ak.
- We are done.
- How do we estimate P(A1a1 Cc)?
33Training dataset
Class C1buys_computer yes C2buys_computer
no Data sample X (agelt30, Incomemedium, Stud
entyes Credit_rating Fair)
34An Example
- Compute P(A1a1 Cc) for each class
- P(agelt30 buys_computeryes)
2/90.222 - P(agelt30 buys_computerno) 3/5 0.6
- P(incomemedium buys_computeryes)
4/9 0.444 - P(incomemedium buys_computerno)
2/5 0.4 - P(studentyes buys_computeryes) 6/9
0.667 - P(studentyes buys_computerno)
1/50.2 - P(credit_ratingfair buys_computeryes)
6/90.667 - P(credit_ratingfair buys_computerno)
2/50.4 - X(agelt30 ,income medium, studentyes,credit_
ratingfair) - P(Xbuys_computeryes) 0.222 x 0.444 x 0.667
x 0.0.667 0.044 - P(Xbuys_computerno) 0.6 x 0.4 x 0.2 x
0.4 0.019 - P(XCc)P(Cc) P(Xbuys_computeryes)
P(buys_computeryes)0.028 - P(Xbuys_computeryes)
P(buys_computeryes)0.007 - X belongs to class buys_computeryes
35On Naïve Bayesian Classifier
- Advantages
- Easy to implement
- Good results obtained in many applications
- Disadvantages
- Assumption class conditional independence,
therefore loss of accuracy when the assumption is
not true. - Practically, dependencies exist
- How to deal with these dependencies?
- Bayesian Belief Networks
36Use of Association RulesClassification
- Classification mine a small set of rules
existing in the data to form a classifier or
predictor. - It has a target attribute (on the right side)
Class attribute - Association has no fixed target, but we can fix
a target.
37Class Association Rules (CARs)
- Mining rules with a fixed target
- Right-hand-side of the rules are fixed to a
single attribute, which can have a number of
values - E.g., X a, Y d ? Class yes
- X b ? Class no
- Call such rules class association rules
38Mining Class Association Rules
- Itemset in class association rules
- ltcondset, class_valuegt
- condset a set of items
- item attribute value pair, e.g.,
- attribute1 a
- class_value a value in class attribute
39Classification Based on Associations (CBA)
- Two steps
- Find all class association rules
- Using a modified Apriori algorithm
- Build a classifier
- There can be many ways, e.g.,
- Choose a small set of rules to cover the data
- Numeric attributes need to be discrertized.
40Advantages of the CBA Model
- One algorithm performs 3 tasks
- mine class association rules
- build an accurate classifier (or predictor)
- mine normal association rules
- by treating class as a dummy in
- ltcondset, class_valuegt
- then condset itemset
41Advantages of the CBA Model
- Existing classification systems use
- Table data.
- CBA can build classifiers using either
- Table form data or
- Transaction form data (sparse data)
- CBA is able to find rules that existing
classification systems cannot.
42Assoc. Rules can be Used in Many Ways for
Prediction
- We have so many rules
- Select a subset of rules
- Using Baysian Probability together with the rules
- Using rule combinations
-
- A number of systems have been designed and
implemented.
43Other classification techniques
- Support vector machines
- Logistic regression
- K-nearest neighbor
- Neural networks
- Genetic algorithms
- Etc.
44How to Estimated Classification Accuracy or Error
Rates
- Partition Training-and-testing
- use two independent data sets, e.g., training set
(2/3), test set(1/3) - used for data set with large number of exmples
- Cross-validation
- divide the data set into k subsamples
- use k-1 subsamples as training data and one
sub-sample as test datak-fold cross-validation - for data set with moderate size
- leave-one-out for small size data
45Scoring the data
- Scoring is related to classification.
- Normally, we are only interested a single class
(called positive class), e.g., buyers class in a
marketing database. - Instead of assigning each test example a definite
class, scoring assigns a probability estimate
(PE) to indicate the likelihood that the example
belongs to the positive class.
46Ranking and lift analysis
- After each example is given a score, we can rank
all examples according to their PEs. - We then divide the data into n (say 10) bins. A
lift curve can be drawn according how many
positive examples are in each bin. This is called
lift analysis. - Classification systems can be used for scoring.
Need to produce a probability estimate.
47Lift curve
Bin 1 2 3 4 5
6 7 8 9 10