Title: Classification and Prediction
1Classification and Prediction
Bamshad Mobasher DePaul University
2What Is Classification?
- The goal of data classification is to organize
and categorize data in distinct classes - A model is first created based on the data
distribution - The model is then used to classify new data
- Given the model, a class can be predicted for new
data - Classification prediction for discrete and
nominal values - With classification, I can predict in which
bucket to put the ball, but I cant predict the
weight of the ball
3Prediction, Clustering, Classification
- What is Prediction?
- The goal of prediction is to forecast or deduce
the value of an attribute based on values of
other attributes - A model is first created based on the data
distribution - The model is then used to predict future or
unknown values - Supervised vs. Unsupervised Classification
- Supervised Classification Classification
- We know the class labels and the number of
classes - Unsupervised Classification Clustering
- We do not know the class labels and may not know
the number of classes
4Classification 3 Step Process
- 1. Model construction (Learning)
- Each record (instance) is assumed to belong to a
predefined class, as determined by one of the
attributes, called the class label - The set of all records used for construction of
the model is called training set - The model is usually represented in the form of
classification rules, (IF-THEN statements) or
decision trees - 2. Model Evaluation (Accuracy)
- Estimate accuracy rate of the model based on a
test set - The known label of test sample is compared with
the classified result from model - Accuracy rate percentage of test set samples
correctly classified by the model - Test set is independent of training set otherwise
over-fitting will occur - 3. Model Use (Classification)
- The model is used to classify unseen instances
(assigning class labels) - Predict the value of an actual attribute
5Model Construction
6Model Evaluation
7Model Use Classification
8Classification Methods
- Decision Tree Induction
- Neural Networks
- Bayesian Classification
- Association-Based Classification
- K-Nearest Neighbor
- Case-Based Reasoning
- Genetic Algorithms
- Fuzzy Sets
- Many More
9Decision Trees
- A decision tree is a flow-chart-like tree
structure - Internal node denotes a test on an attribute
(feature) - Branch represents an outcome of the test
- All records in a branch have the same value for
the tested attribute - Leaf node represents class label or class label
distribution
10Decision Trees
- Example is it a good day to play golf?
- a set of attributes and their possible values
- outlook sunny, overcast, rain
- temperature cool, mild, hot
- humidity high, normal
- windy true, false
A particular instance in the training set might
be ltovercast, hot, normal, falsegt play
In this case, the target class is a binary
attribute, so each instance represents a
positive or a negative example.
11Using Decision Trees for Classification
- Examples can be classified as follows
- 1. look at the example's value for the feature
specified - 2. move along the edge labeled with this value
- 3. if you reach a leaf, return the label of the
leaf - 4. otherwise, repeat from step 1
- Example (a decision tree to decide whether to go
on a picnic)
So a new instance ltrainy, hot, normal,
truegt ? will be classified as noplay
12Decision Trees and Decision Rules
If attributes are continuous, internal nodes may
test against a threshold.
Each path in the tree represents a decision rule
Rule1 If (outlooksunny) AND
(humiditylt0.75) Then (playyes) Rule2 If
(outlookrainy) AND (windgt20) Then (playno)
Rule3 If (outlookovercast) Then
(playyes) . . .
13Top-Down Decision Tree Generation
- The basic approach usually consists of two
phases - Tree construction
- At the start, all the training examples are at
the root - Partition examples are recursively based on
selected attributes - Tree pruning
- remove tree branches that may reflect noise in
the training data and lead to errors when
classifying test data - improve classification accuracy
- Basic Steps in Decision Tree Construction
- Tree starts a single node representing all data
- If sample are all same class then node becomes a
leaf labeled with class label - Otherwise, select feature that best separates
sample into individual classes. - Recursion stops when
- Samples in node belong to the same class
(majority) - There are no remaining attributes on which to
split
14Trees Construction Algorithm (ID3)
- Decision Tree Learning Method (ID3)
- Input a set of training examples S, a set of
features F - 1. If every element of S has a class value yes,
return yes if every element of S has class
value no, return no - 2. Otherwise, choose the best feature f from F
(if there are no features remaining, then return
failure) - 3. Extend tree from f by adding a new branch
for each attribute value of f - 3.1. Set F F f,
- 4. Distribute training examples to leaf nodes (so
each leaf node n represents the subset of
examples Sn of S with the corresponding attribute
value - 5. Repeat steps 1-5 for each leaf node n with Sn
as the new set of training examples and F as the
set of attributes - Main Question
- how do we choose the best feature at each step?
Note ID3 algorithm only deals with categorical
attributes, but can be extended (as in C4.5) to
handle continuous attributes
15Choosing the Best Feature
- Use Information Gain to find the best (most
discriminating) feature - Assume there are two classes, P and N (e.g, P
yes and N no) - Let the set of instances S (training data)
contains p elements of class P and n elements
of class N - The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is
defined in terms of entropy, I(p,n) - Note that Pr(P) p / (pn) and Pr(N) n / (pn)
16Choosing the Best Feature
- More generally, if we have m classes, and s1, s2,
, sm are the number of instances of S in each
class, then the entropy is - where pi is the probability that an arbitrary
instance belongs to the class i.
17Choosing the Best Feature
- Now, assume that using attribute A a set S of
instances will be partitioned into sets S1, S2 ,
, Sv each corresponding to distinct values of
attribute A. - If Si contains pi cases of P and ni cases of N,
the entropy, or the expected information needed
to classify objects in all subtrees Si is - The encoding information that would be gained by
branching on A - At any point we want to branch using an attribute
that provides the highest information gain.
The probability that an arbitrary instance in S
belongs to the partition Si
where,
18Attribute Selection - Example
- The Golf example what attribute should we
choose as the root?
S 9,5-
Outlook?
overcast
rainy
sunny
2,3-
3,2-
4,0-
I(9,5) -(9/14).log(9/14) - (5/14).log(5/14)
0.94
I(4,0) -(4/4).log(4/4) - (0/4).log(0/4)
0
I(2,3) -(2/5).log(2/5) - (3/5).log(3/5)
0.97
Gain(outlook) .94 - (4/14)0
- (5/14).97
- (5/14).97
.24
I(3,2) -(3/5).log(3/5) - (2/5).log(2/5)
0.97
19Attribute Selection - Example (Cont.)
S 9,5- (I 0.940)
humidity?
high
normal
6,1- (I 0.592)
3,4- (I 0.985)
Gain(humidity) .940 - (7/14).985 -
(7/14).592 .151
S 9,5- (I 0.940)
wind?
strong
weak
So, classifying examples by humidity
provides more information gain than by wind.
Similarly, we must find the information gain for
temp. In this case, however, you can verify
that outlook has largest information gain, so
itll be selected as root
3,3- (I 1.00)
6,2- (I 0.811)
Gain(wind) .940 - (8/14).811 - (8/14)1.0
.048
20Attribute Selection - Example (Cont.)
- Partially learned decision tree
- which attribute should be tested here?
S 9,5-
D1, D2, , D14
Outlook
sunny
overcast
rainy
?
?
yes
4,0-
2,3-
3,2-
D1, D2, D8, D9, D11
D3, D7, D12, D13
D4, D5, D6, D10, D14
Ssunny D1, D2, D8, D9, D11
Gain(Ssunny, humidity) .970 - (3/5)0.0 -
(2/5)0.0 .970
Gain(Ssunny, temp) .970 - (2/5)0.0 - (2/5)1.0
- (1/5)0.0 .570
Gain(Ssunny, wind) .970 - (2/5)1.0 -
(3/5).918 .019
21Dealing With Continuous Variables
- Partition continuous attribute into a discrete
set of intervals - sort the examples according to the continuous
attribute A - identify adjacent examples that differ in their
target classification - generate a set of candidate thresholds midway
- problem may generate too many intervals
- Another Solution
- take a minimum threshold M of the examples of the
majority class in each adjacent partition then
merge adjacent partitions with the same majority
class
70.5
77.5
Example M 3
Same majority, so they are merged
Final mapping temperature 77.5 gt yes
temperature gt 77.5 gt no
22Over-fitting in Classification
- A tree generated may over-fit the training
examples due to noise or too small a set of
training data - Two approaches to avoid over-fitting
- (Stop earlier) Stop growing the tree earlier
- (Post-prune) Allow over-fit and then post-prune
the tree - Approaches to determine the correct final tree
size - Separate training and testing sets or use
cross-validation - Use all the data for training, but apply a
statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve
over entire distribution - Use Minimum Description Length (MDL) principle
halting growth of the tree when the encoding is
minimized. - Rule post-pruning (C4.5) converting to rules
before pruning
23Pruning the Decision Tree
- A decision tree constructed using the training
data may need to be pruned - over-fitting may result in branches or leaves
based on too few examples - pruning is the process of removing branches and
subtrees that are generated due to noise this
improves classification accuracy - Subtree Replacement merge a subtree into a leaf
node - Using a set of data different from the training
data - At a tree node, if the accuracy without splitting
is higher than the accuracy with splitting,
replace the subtree with a leaf node label it
using the majority class
Suppose with test set we find 3 red no
examples, and 2 blue yes example. We can
replace the tree with a single no node. After
replacement there will be only 2 errors instead
of 5.
24Bayesian Methods
- Bayess theorem plays a critical role in
probabilistic learning and classification - Uses prior probability of each category given no
information about an item - Categorization produces a posterior probability
distribution over the possible categories given a
description of an item - The models are incremental in the sense that each
training example can incrementally increase or
decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data - Given a data sample X with an unknown class
label, H is the hypothesis that X belongs to a
specific class C - The conditional probability of hypothesis H
given observation X, Pr(HX), follows the Bayess
theorem - Practical difficulty requires initial knowledge
of many probabilities, significant computational
cost
25Axioms of Probability Theory
- All probabilities between 0 and 1
- True proposition has probability 1, false has
probability 0. - P(true) 1 P(false) 0
- The probability of disjunction is
26Conditional Probability
- P(A B) is the probability of A given B
- Assumes that B is all and only information known.
- Defined by
27Independence
- A and B are independent iff
- Therefore, if A and B are independent
- Bayess Rule
These two constraints are logically equivalent
28Bayesian Categorization
- Let set of categories be c1, c2,cn
- Let E be description of an instance.
- Determine category of E by determining for each
ci - P(E) can be determined since categories are
complete and disjoint.
29Bayesian Categorization (cont.)
- Need to know
- Priors P(ci) and Conditionals P(E ci)
- P(ci) are easily estimated from data.
- If ni of the examples in D are in ci,then P(ci)
ni / D - Assume instance is a conjunction of binary
features/attributes
30Naïve Bayesian Categorization
- Problem Too many possible instances (exponential
in m) to estimate all P(E ci) - If we assume features/attributes of an instance
are independent given the category (ci)
(conditionally independent) - Therefore, we then only need to know P(ej ci)
for each feature and category
31Estimating Probabilities
- Normally, probabilities are estimated based on
observed frequencies in the training data. - If D contains ni examples in category ci, and nij
of these ni examples contains feature/attribute
ej, then - However, estimating such probabilities from small
training sets is error-prone. - If due only to chance, a rare feature, ek, is
always false in the training data, ?ci P(ek
ci) 0. - If ek then occurs in a test example, E, the
result is that ?ci P(E ci) 0 and ?ci P(ci
E) 0
32Smoothing
- To account for estimation from small samples,
probability estimates are adjusted or smoothed. - Laplace smoothing using an m-estimate assumes
that each feature is given a prior probability,
p, that is assumed to have been previously
observed in a virtual sample of size m. - For binary features, p is simply assumed to be
0.5.
33Naïve Bayesian Classifier - Example
- Here, we have two classes C1yes (Positive) and
C2no (Negative) - Pr(yes) instances with yes / all instances
9/14 - If a new instance X had outlooksunny, then
Pr(outlooksunny yes) 2/9 - (since there are 9 instances with yes (or P)
of which 2 have outlooksunny) - Similarly, for humidityhigh,
Pr(humidityhigh no) 4/5 - And so on.
34Naïve Bayes (Example Continued)
- Now, given the training set, we can compute all
the probabilities - Suppose we have new instance X ltsunny, mild,
high, truegt. How should it be classified? - Similarly
X lt sunny , mild , high , true gt
Pr(X no) 3/5 . 2/5 . 4/5 . 3/5
Pr(X yes) (2/9 . 4/9 . 3/9 . 3/9)
35Naïve Bayes (Example Continued)
- To find out to which class X belongs we need to
maximize Pr(X Ci).Pr(Ci), for each class Ci
(here yes and no) - To convert these to probabilities, we can
normalize by dividing each by the sum of the two - Pr(no X) 0.04 / (0.04 0.007) 0.85
- Pr(yes X) 0.007 / (0.04 0.007) 0.15
- Therefore the new instance X will be classified
as no.
X ltsunny, mild, high, truegt
Pr(X no).Pr(no) (3/5 . 2/5 . 4/5 . 3/5) .
5/14 0.04
Pr(X yes).Pr(yes) (2/9 . 4/9 . 3/9 . 3/9)
. 9/14 0.007
36Association-Based Classification
- Recall quantitative association rules
- If the right-hand-side of the rules are
restricted to the class attribute to be
predicted, the rules can be used directly for
classification - It mines high support and high confidence rules
in the form of - cond_set gt Y
- where Y is a class label.
- Has been shown to work better than decision tree
models in some cases.
37Measuring Effectiveness of Classification Models
- When the output field is ordinal or nominal
(e.g., in two-class prediction), we use the
classification table, the so-called confusion
matrix to evaluate the resulting model - Example
- Overall correct classification rate (18 15) /
38 87 - Given T, correct classification rate 18 / 20
90 - Given F, correct classification rate 15 / 18
83
Predicted Class
Actual Class
38Measuring Effectiveness Lift
- usually used for classification, but can be
adopted to other methods - measure change in conditional probability of a
target class when going from the general
population (full test set) to a biased sample - Example
- suppose expected response rate to a direct
mailing campaign is 5 in the training set - use classifier to assign yes or no value to a
target class predicted to respond - the yes group will contain a higher proportion
of actual responders than the test set - suppose the yes group (our biased sample)
contains 50 actual responders - this gives lift of 10 0.5 / 0.05
- What if the lift sample is too small
- need to increase the sample size
- trade-off between lift and sample size
39What Is Prediction?
- Prediction is similar to classification
- First, construct a model
- Second, use model to predict unknown value
- Prediction is different from classification
- Classification refers to predicting categorical
class label (e.g., yes, no) - Prediction models are used to predict values of a
numeric target attribute - They can be thought of as continuous-valued
functions - Major method for prediction is regression
- Linear and multiple regression
- Non-linear regression
- K-Nearest-Neighbor
- Most common application domains
- recommender systems, credit scoring, customer
lifetime values
40Prediction Regression Analysis
- Most common approaches to prediction linear or
multiple regression. - Linear regression Y ? ? X
- The model is a line which best reflects the data
distribution the line allows for prediction of
the Y attribute value based on the single
attribute X. - Two parameters , ? and ? specify the line and
are to be estimated by using the data at hand - Common approach apply the least squares
criterion to the known values of Y1, Y2, , X1,
X2, . - Regression applet
- http//www.math.csusb.edu/faculty/stanton/pro
bstat/regression.html - Multiple regression Y b0 b1 X1 b2 X2
- Necessary when prediction must be made based on
multiple attributes - E.g., predict Customer LTV based on Age, Income,
Spending, Items purchased, etc. - Many nonlinear functions can be transformed into
the above.
41Measuring Effectiveness of Prediction
- Predictive models are evaluated based on the
accuracy of their predictions on unseen data - accuracy measured in terms of error rate (usually
of records classified incorrectly) - error rate on a pre-classified evaluation set
estimates the real error rate - Prediction Effectiveness
- Difference between predicted scores and the
actual results (from evaluation set) - Typically the accuracy of the model is measured
in terms of variance (i.e., average of the
squared differences) - E.g, Root Mean Squared Error compute the
standard deviation (i.e., square root of the
co-variance between predicted and actual ratings)
42Example Recommender Systems
- Basic formulation as a prediction problem
- Typically, the profile Pu contains interest
scores by u on some other items, i1, , ik
different from it - Interest scores on i1, , ik may have been
obtained explicitly (e.g., movie ratings) or
implicitly (e.g., time spent on a product page or
news article)
Given a profile Pu for a user u, and a target
item it, predict the interest score of user u on
item it
43Example Recommender Systems
- Content-based recommenders
- Predictions for unseen (target) items are
computed based on their similarity (in terms of
content) to items in the user profile. - E.g., user profile Pu contains
- recommend highly and recommend
mildly
44Content-Based Recommender Systems
45Example Recommender Systems
- Collaborative filtering recommenders
- Predictions for unseen (target) items are
computed based the other users with similar
interest scores on items in user us profile - i.e. users with similar tastes (aka nearest
neighbors) - requires computing correlations between user u
and other users according to interest scores or
ratings
Can we predict Karens rating on the unseen item
Independence Day?
46Example Recommender Systems
- Collaborative filtering recommenders
- Predictions for unseen (target) items are
computed based the other users with similar
interest scores on items in user us profile - i.e. users with similar tastes (aka nearest
neighbors) - requires computing correlations between user u
and other users according to interest scores or
ratings
prediction
Correlation to Karen
Predictions for Karen on Indep. Day based on the
K nearest neighbors
47Possible Interesting Project Ideas
- Build a content-based recommender for
- Movies (e.g., previous example)
- News stories (requires basic text processing and
indexing of documents) - Music (based on features such as genre, artist,
etc. - Build a collaborative recommender for
- Movies (using movie ratings), e.g., movielens.org
- Music, e.g., pandora.com
- Recommend songs or albums based on collaborative
ratings - Or, recommend whole playlists based on playlists
from other users (this might be a good candidate
application for association rule mining (why?)
48Other Forms of Collaborative and Social Filtering
- Social Tagging (Folksonomy)
- people add free-text tags to their content
- where people happen to use the same terms then
their content is linked - frequently used terms floating to the top to
create a kind of positive feedback loop for
popular tags. - Examples
- Del.icio.us
- Flickr
49Social Tagging
- Deviating from standard mental models
- No browsing of topical, categorized navigation or
searching for an explicit term or phrase - Instead is use language I use to define my world
(tagging) - Sharing my language and contexts will create
community - Tagging creates community through the overlap of
perspectives - This leads to the creation of social networks
which may further develop and evolve - But, does this lead to dynamic evolution of
complex concepts or knowledge? Collective
intelligence?
50Clustering and Collaborative Filtering
clustering based on ratings movielens
51Clustering and Collaborative Filtering tag
clustering example
52Classification Example - Bank Data
- Want to determine likely responders to a direct
mail campaign - a new product, a "Personal Equity Plan" (PEP)
- training data include records kept about how
previous customers responded and bought the
product - in this case the target class is pep with
binary value - want to build a model and apply it to new data (a
customer list) in which the value of the class
attribute is not available
53Data Preparation
- Several steps for prepare data for Weka and for
See5 - open training data in Excel, remove the id
column, save results (as a comma delimited file
(e.g., bank.csv) - do the same with new customer data, but also add
a new column called pep as the last column the
value of this column for each record should be
? - Weka
- must convert the the data to ARFF format
- attribute specification and data are in the same
file - the data portion is just the comma delimited data
file without the label row - See5/C5
- create a name file and a data file
- name file contains attribute specification
data file is same as above - first line of name file must be the name(s) of
the target class(es) - in this case pep
54Data File Format for Weka
_at_relation train-bank-data' _at_attribute 'age'
real _at_attribute 'sex' 'MALE','FEMALE' _at_attribute
'region' 'INNER_CITY','RURAL','TOWN','SUBURBAN'
_at_attribute 'income' real _at_attribute 'married'
'YES','NO' _at_attribute 'children'
real _at_attribute 'car' 'YES','NO' _at_attribute
'save_act' 'YES','NO' _at_attribute 'current_act'
'YES','NO' _at_attribute 'mortgage'
'YES','NO' _at_attribute 'pep' 'YES','NO' _at_data 4
8,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES 40,
MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO . . .
Training Data
_at_relation 'new-bank-data' _at_attribute 'age'
real _at_attribute 'region' 'INNER_CITY','RURAL','TO
WN','SUBURBAN' . . . _at_attribute 'pep'
'YES','NO' _at_data 23,MALE,INNER_CITY,18766.9,YES,
0,YES,YES,NO,YES,? 30,MALE,RURAL,9915.67,NO,1,NO,Y
ES,NO,YES,?
New Cases
55C4.5 Implementation in Weka
children lt 2 children lt 0 married
YES mortgage YES
save_act YES NO (16.0/2.0)
save_act NO YES (9.0/1.0) mortgage
NO NO (59.0/6.0) married NO
mortgage YES save_act
YES NO (12.0) save_act NO YES
(3.0) mortgage NO YES (29.0/2.0)
children gt 0 income lt 29622
children lt 1 income lt 12640.3
NO (5.0) income gt 12640.3
current_act YES YES (28.0/1.0)
current_act NO
income lt 17390.1 NO (3.0)
income gt 17390.1 YES (6.0)
children gt 1 NO (47.0/3.0) income gt
29622 YES (48.0/2.0) children gt 2 income lt
43228.2 NO (30.0/2.0) income gt 43228.2 YES
(5.0)
- To build a model (decision tree) using the
classifiers.trees.j48..J48 class
Decision Tree Output (pruned)
56C4.5 Implementation in Weka
Error on training data Correctly
Classified Instances 281 93.6667
Incorrectly Classified Instances 19
6.3333 Mean absolute error
0.1163 Root mean squared error
0.2412 Relative absolute error 23.496
Root relative squared error 48.4742
Total Number of Instances 300
Confusion Matrix a b lt--
classified as 122 13 a YES 6 159 b
NO Stratified cross-validation
Correctly Classified Instances 274
91.3333 Incorrectly Classified Instances 26
8.6667 Mean absolute error
0.1434 Root mean squared error
0.291 Relative absolute error
28.9615 Root relative squared error
58.4922 Total Number of Instances 300
Confusion Matrix a b lt--
classified as 118 17 a YES 9 156 b
NO
The rest of the output contains statistical
information about the model, including confusion
matrix, error rates, etc.
The model can be saved to be later applied to the
test data (or to new unclassified instances).
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)