Classification and Prediction

About This Presentation

Title:

Classification and Prediction

Description:

Title: Data Miing and Knowledge Discvoery - Web Data Mining Author: Bamshad Mobasher Last modified by: Bamshad Mobasher Created Date: 3/29/1999 8:01:23 PM – PowerPoint PPT presentation

Number of Views:168

Avg rating:3.0/5.0

Slides: 71

Provided by: Bamsh4

Learn more at: http://facweb.cs.depaul.edu

Category:

more less

Transcript and Presenter's Notes

Title: Classification and Prediction

1
Classification and Prediction
Bamshad Mobasher DePaul University
2
What Is Classification?

The goal of data classification is to organize
and categorize data in distinct classes
A model is first created based on the data
distribution
The model is then used to classify new data
Given the model, a class can be predicted for new
data
Classification prediction for discrete and
nominal values
With classification, I can predict in which
bucket to put the ball, but I cant predict the
weight of the ball

3
Prediction, Clustering, Classification

What is Prediction?
The goal of prediction is to forecast or deduce
the value of an attribute based on values of
other attributes
A model is first created based on the data
distribution
The model is then used to predict future or
unknown values
Supervised vs. Unsupervised Classification
Supervised Classification Classification
We know the class labels and the number of
classes
Unsupervised Classification Clustering
We do not know the class labels and may not know
the number of classes

4
Classification 3 Step Process

1. Model construction (Learning)
Each record (instance) is assumed to belong to a
predefined class, as determined by one of the
attributes, called the class label
The set of all records used for construction of
the model is called training set
The model is usually represented in the form of
classification rules, (IF-THEN statements) or
decision trees
2. Model Evaluation (Accuracy)
Estimate accuracy rate of the model based on a
test set
The known label of test sample is compared with
the classified result from model
Accuracy rate percentage of test set samples
correctly classified by the model
Test set is independent of training set otherwise
over-fitting will occur
3. Model Use (Classification)
The model is used to classify unseen instances
(assigning class labels)
Predict the value of an actual attribute

5
Model Construction
6
Model Evaluation
7
Model Use Classification
8
Classification Methods

Decision Tree Induction
Neural Networks
Bayesian Classification
Association-Based Classification
K-Nearest Neighbor
Case-Based Reasoning
Genetic Algorithms
Fuzzy Sets
Many More

9
Decision Trees

A decision tree is a flow-chart-like tree
structure
Internal node denotes a test on an attribute
(feature)
Branch represents an outcome of the test
All records in a branch have the same value for
the tested attribute
Leaf node represents class label or class label
distribution

10
Decision Trees

Example is it a good day to play golf?
a set of attributes and their possible values
outlook sunny, overcast, rain
temperature cool, mild, hot
humidity high, normal
windy true, false

A particular instance in the training set might
be ltovercast, hot, normal, falsegt play
In this case, the target class is a binary
attribute, so each instance represents a
positive or a negative example.
11
Using Decision Trees for Classification

Examples can be classified as follows
1. look at the example's value for the feature
specified
2. move along the edge labeled with this value
3. if you reach a leaf, return the label of the
leaf
4. otherwise, repeat from step 1
Example (a decision tree to decide whether to go
on a picnic)

So a new instance ltrainy, hot, normal,
truegt ? will be classified as noplay
12
Decision Trees and Decision Rules
If attributes are continuous, internal nodes may
test against a threshold.
Each path in the tree represents a decision rule
Rule1 If (outlooksunny) AND
(humiditylt0.75) Then (playyes) Rule2 If
(outlookrainy) AND (windgt20) Then (playno)
Rule3 If (outlookovercast) Then
(playyes) . . .
13
Top-Down Decision Tree Generation

The basic approach usually consists of two
phases
Tree construction
At the start, all the training examples are at
the root
Partition examples are recursively based on
selected attributes
Tree pruning
remove tree branches that may reflect noise in
the training data and lead to errors when
classifying test data
improve classification accuracy
Basic Steps in Decision Tree Construction
Tree starts a single node representing all data
If sample are all same class then node becomes a
leaf labeled with class label
Otherwise, select feature that best separates
sample into individual classes.
Recursion stops when
Samples in node belong to the same class
(majority)
There are no remaining attributes on which to
split

14
Trees Construction Algorithm (ID3)

Decision Tree Learning Method (ID3)
Input a set of training examples S, a set of
features F
1. If every element of S has a class value yes,
return yes if every element of S has class
value no, return no
2. Otherwise, choose the best feature f from F
(if there are no features remaining, then return
failure)
3. Extend tree from f by adding a new branch
for each attribute value of f
3.1. Set F F f,
4. Distribute training examples to leaf nodes (so
each leaf node n represents the subset of
examples Sn of S with the corresponding attribute
value
5. Repeat steps 1-5 for each leaf node n with Sn
as the new set of training examples and F as the
set of attributes
Main Question
how do we choose the best feature at each step?

Note ID3 algorithm only deals with categorical
attributes, but can be extended (as in C4.5) to
handle continuous attributes
15
Choosing the Best Feature

Use Information Gain to find the best (most
discriminating) feature
Assume there are two classes, P and N (e.g, P
yes and N no)
Let the set of instances S (training data)
contains p elements of class P and n elements
of class N
The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is
defined in terms of entropy, I(p,n)
Note that Pr(P) p / (pn) and Pr(N) n / (pn)

16
Choosing the Best Feature

More generally, if we have m classes, and s1, s2,
, sm are the number of instances of S in each
class, then the entropy is
where pi is the probability that an arbitrary
instance belongs to the class i.

17
Choosing the Best Feature

Now, assume that using attribute A a set S of
instances will be partitioned into sets S1, S2 ,
, Sv each corresponding to distinct values of
attribute A.
If Si contains pi cases of P and ni cases of N,
the entropy, or the expected information needed
to classify objects in all subtrees Si is
The encoding information that would be gained by
branching on A
At any point we want to branch using an attribute
that provides the highest information gain.

The probability that an arbitrary instance in S
belongs to the partition Si
where,
18
Attribute Selection - Example

The Golf example what attribute should we
choose as the root?

S 9,5-
Outlook?
overcast
rainy
sunny
2,3-
3,2-
4,0-
I(9,5) -(9/14).log(9/14) - (5/14).log(5/14)
0.94
I(4,0) -(4/4).log(4/4) - (0/4).log(0/4)
0
I(2,3) -(2/5).log(2/5) - (3/5).log(3/5)
0.97
Gain(outlook) .94 - (4/14)0
- (5/14).97
- (5/14).97
.24
I(3,2) -(3/5).log(3/5) - (2/5).log(2/5)
0.97
19
Attribute Selection - Example (Cont.)
S 9,5- (I 0.940)
humidity?
high
normal
6,1- (I 0.592)
3,4- (I 0.985)
Gain(humidity) .940 - (7/14).985 -
(7/14).592 .151
S 9,5- (I 0.940)
wind?
strong
weak
So, classifying examples by humidity
provides more information gain than by wind.
Similarly, we must find the information gain for
temp. In this case, however, you can verify
that outlook has largest information gain, so
itll be selected as root
3,3- (I 1.00)
6,2- (I 0.811)
Gain(wind) .940 - (8/14).811 - (8/14)1.0
.048
20
Attribute Selection - Example (Cont.)

Partially learned decision tree
which attribute should be tested here?

S 9,5-
D1, D2, , D14
Outlook
sunny
overcast
rainy
?
?
yes
4,0-
2,3-
3,2-
D1, D2, D8, D9, D11
D3, D7, D12, D13
D4, D5, D6, D10, D14
Ssunny D1, D2, D8, D9, D11
Gain(Ssunny, humidity) .970 - (3/5)0.0 -
(2/5)0.0 .970
Gain(Ssunny, temp) .970 - (2/5)0.0 - (2/5)1.0
- (1/5)0.0 .570
Gain(Ssunny, wind) .970 - (2/5)1.0 -
(3/5).918 .019
21
Dealing With Continuous Variables

Partition continuous attribute into a discrete
set of intervals
sort the examples according to the continuous
attribute A
identify adjacent examples that differ in their
target classification
generate a set of candidate thresholds midway
problem may generate too many intervals
Another Solution
take a minimum threshold M of the examples of the
majority class in each adjacent partition then
merge adjacent partitions with the same majority
class

70.5
77.5
Example M 3
Same majority, so they are merged
Final mapping temperature 77.5 gt yes
temperature gt 77.5 gt no
22
Over-fitting in Classification

A tree generated may over-fit the training
examples due to noise or too small a set of
training data
Two approaches to avoid over-fitting
(Stop earlier) Stop growing the tree earlier
(Post-prune) Allow over-fit and then post-prune
the tree
Approaches to determine the correct final tree
size
Separate training and testing sets or use
cross-validation
Use all the data for training, but apply a
statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve
over entire distribution
Use Minimum Description Length (MDL) principle
halting growth of the tree when the encoding is
minimized.
Rule post-pruning (C4.5) converting to rules
before pruning

23
Pruning the Decision Tree

A decision tree constructed using the training
data may need to be pruned
over-fitting may result in branches or leaves
based on too few examples
pruning is the process of removing branches and
subtrees that are generated due to noise this
improves classification accuracy
Subtree Replacement merge a subtree into a leaf
node
Using a set of data different from the training
data
At a tree node, if the accuracy without splitting
is higher than the accuracy with splitting,
replace the subtree with a leaf node label it
using the majority class

Suppose with test set we find 3 red no
examples, and 2 blue yes example. We can
replace the tree with a single no node. After
replacement there will be only 2 errors instead
of 5.
24
Bayesian Methods

Bayess theorem plays a critical role in
probabilistic learning and classification
Uses prior probability of each category given no
information about an item
Categorization produces a posterior probability
distribution over the possible categories given a
description of an item
The models are incremental in the sense that each
training example can incrementally increase or
decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with
observed data
Given a data sample X with an unknown class
label, H is the hypothesis that X belongs to a
specific class C
The conditional probability of hypothesis H
given observation X, Pr(HX), follows the Bayess
theorem
Practical difficulty requires initial knowledge
of many probabilities, significant computational
cost

25
Axioms of Probability Theory

All probabilities between 0 and 1
True proposition has probability 1, false has
probability 0.
P(true) 1 P(false) 0
The probability of disjunction is

26
Conditional Probability

P(A B) is the probability of A given B
Assumes that B is all and only information known.
Defined by

27
Independence

A and B are independent iff
Therefore, if A and B are independent
Bayess Rule

These two constraints are logically equivalent
28
Bayesian Categorization

Let set of categories be c1, c2,cn
Let E be description of an instance.
Determine category of E by determining for each
ci
P(E) can be determined since categories are
complete and disjoint.

29
Bayesian Categorization (cont.)

Need to know
Priors P(ci) and Conditionals P(E ci)
P(ci) are easily estimated from data.
If ni of the examples in D are in ci,then P(ci)
ni / D
Assume instance is a conjunction of binary
features/attributes

30
Naïve Bayesian Categorization

Problem Too many possible instances (exponential
in m) to estimate all P(E ci)
If we assume features/attributes of an instance
are independent given the category (ci)
(conditionally independent)
Therefore, we then only need to know P(ej ci)
for each feature and category

31
Estimating Probabilities

Normally, probabilities are estimated based on
observed frequencies in the training data.
If D contains ni examples in category ci, and nij
of these ni examples contains feature/attribute
ej, then
However, estimating such probabilities from small
training sets is error-prone.
If due only to chance, a rare feature, ek, is
always false in the training data, ?ci P(ek
ci) 0.
If ek then occurs in a test example, E, the
result is that ?ci P(E ci) 0 and ?ci P(ci
E) 0

32
Smoothing

To account for estimation from small samples,
probability estimates are adjusted or smoothed.
Laplace smoothing using an m-estimate assumes
that each feature is given a prior probability,
p, that is assumed to have been previously
observed in a virtual sample of size m.
For binary features, p is simply assumed to be
0.5.

33
Naïve Bayesian Classifier - Example

Here, we have two classes C1yes (Positive) and
C2no (Negative)
Pr(yes) instances with yes / all instances
9/14
If a new instance X had outlooksunny, then
Pr(outlooksunny yes) 2/9
(since there are 9 instances with yes (or P)
of which 2 have outlooksunny)
Similarly, for humidityhigh,
Pr(humidityhigh no) 4/5
And so on.

34
Naïve Bayes (Example Continued)

Now, given the training set, we can compute all
the probabilities
Suppose we have new instance X ltsunny, mild,
high, truegt. How should it be classified?
Similarly

X lt sunny , mild , high , true gt
Pr(X no) 3/5 . 2/5 . 4/5 . 3/5
Pr(X yes) (2/9 . 4/9 . 3/9 . 3/9)
35
Naïve Bayes (Example Continued)

To find out to which class X belongs we need to
maximize Pr(X Ci).Pr(Ci), for each class Ci
(here yes and no)
To convert these to probabilities, we can
normalize by dividing each by the sum of the two
Pr(no X) 0.04 / (0.04 0.007) 0.85
Pr(yes X) 0.007 / (0.04 0.007) 0.15
Therefore the new instance X will be classified
as no.

X ltsunny, mild, high, truegt
Pr(X no).Pr(no) (3/5 . 2/5 . 4/5 . 3/5) .
5/14 0.04
Pr(X yes).Pr(yes) (2/9 . 4/9 . 3/9 . 3/9)
. 9/14 0.007
36
Association-Based Classification

Recall quantitative association rules
If the right-hand-side of the rules are
restricted to the class attribute to be
predicted, the rules can be used directly for
classification
It mines high support and high confidence rules
in the form of
cond_set gt Y
where Y is a class label.
Has been shown to work better than decision tree
models in some cases.

37
Measuring Effectiveness of Classification Models

When the output field is ordinal or nominal
(e.g., in two-class prediction), we use the
classification table, the so-called confusion
matrix to evaluate the resulting model
Example
Overall correct classification rate (18 15) /
38 87
Given T, correct classification rate 18 / 20
90
Given F, correct classification rate 15 / 18
83

Predicted Class
Actual Class
38
Measuring Effectiveness Lift

usually used for classification, but can be
adopted to other methods
measure change in conditional probability of a
target class when going from the general
population (full test set) to a biased sample
Example
suppose expected response rate to a direct
mailing campaign is 5 in the training set
use classifier to assign yes or no value to a
target class predicted to respond
the yes group will contain a higher proportion
of actual responders than the test set
suppose the yes group (our biased sample)
contains 50 actual responders
this gives lift of 10 0.5 / 0.05
What if the lift sample is too small
need to increase the sample size
trade-off between lift and sample size

39
What Is Prediction?

Prediction is similar to classification
First, construct a model
Second, use model to predict unknown value
Prediction is different from classification
Classification refers to predicting categorical
class label (e.g., yes, no)
Prediction models are used to predict values of a
numeric target attribute
They can be thought of as continuous-valued
functions
Major method for prediction is regression
Linear and multiple regression
Non-linear regression
K-Nearest-Neighbor
Most common application domains
recommender systems, credit scoring, customer
lifetime values

40
Prediction Regression Analysis

Most common approaches to prediction linear or
multiple regression.
Linear regression Y ? ? X
The model is a line which best reflects the data
distribution the line allows for prediction of
the Y attribute value based on the single
attribute X.
Two parameters , ? and ? specify the line and
are to be estimated by using the data at hand
Common approach apply the least squares
criterion to the known values of Y1, Y2, , X1,
X2, .
Regression applet
http//www.math.csusb.edu/faculty/stanton/pro
bstat/regression.html
Multiple regression Y b0 b1 X1 b2 X2
Necessary when prediction must be made based on
multiple attributes
E.g., predict Customer LTV based on Age, Income,
Spending, Items purchased, etc.
Many nonlinear functions can be transformed into
the above.

41
Measuring Effectiveness of Prediction

Predictive models are evaluated based on the
accuracy of their predictions on unseen data
accuracy measured in terms of error rate (usually
of records classified incorrectly)
error rate on a pre-classified evaluation set
estimates the real error rate
Prediction Effectiveness
Difference between predicted scores and the
actual results (from evaluation set)
Typically the accuracy of the model is measured
in terms of variance (i.e., average of the
squared differences)
E.g, Root Mean Squared Error compute the
standard deviation (i.e., square root of the
co-variance between predicted and actual ratings)

42
Example Recommender Systems

Basic formulation as a prediction problem
Typically, the profile Pu contains interest
scores by u on some other items, i1, , ik
different from it
Interest scores on i1, , ik may have been
obtained explicitly (e.g., movie ratings) or
implicitly (e.g., time spent on a product page or
news article)

Given a profile Pu for a user u, and a target
item it, predict the interest score of user u on
item it
43
Example Recommender Systems

Content-based recommenders
Predictions for unseen (target) items are
computed based on their similarity (in terms of
content) to items in the user profile.
E.g., user profile Pu contains
recommend highly and recommend
mildly

44
Content-Based Recommender Systems
45
Example Recommender Systems

Collaborative filtering recommenders
Predictions for unseen (target) items are
computed based the other users with similar
interest scores on items in user us profile
i.e. users with similar tastes (aka nearest
neighbors)
requires computing correlations between user u
and other users according to interest scores or
ratings

Can we predict Karens rating on the unseen item
Independence Day?
46
Example Recommender Systems

Collaborative filtering recommenders
Predictions for unseen (target) items are
computed based the other users with similar
interest scores on items in user us profile
i.e. users with similar tastes (aka nearest
neighbors)
requires computing correlations between user u
and other users according to interest scores or
ratings

prediction
Correlation to Karen
Predictions for Karen on Indep. Day based on the
K nearest neighbors
47
Possible Interesting Project Ideas

Build a content-based recommender for
Movies (e.g., previous example)
News stories (requires basic text processing and
indexing of documents)
Music (based on features such as genre, artist,
etc.
Build a collaborative recommender for
Movies (using movie ratings), e.g., movielens.org
Music, e.g., pandora.com
Recommend songs or albums based on collaborative
ratings
Or, recommend whole playlists based on playlists
from other users (this might be a good candidate
application for association rule mining (why?)

48
Other Forms of Collaborative and Social Filtering

Social Tagging (Folksonomy)
people add free-text tags to their content
where people happen to use the same terms then
their content is linked
frequently used terms floating to the top to
create a kind of positive feedback loop for
popular tags.
Examples
Del.icio.us
Flickr

49
Social Tagging

Deviating from standard mental models
No browsing of topical, categorized navigation or
searching for an explicit term or phrase
Instead is use language I use to define my world
(tagging)
Sharing my language and contexts will create
community
Tagging creates community through the overlap of
perspectives
This leads to the creation of social networks
which may further develop and evolve
But, does this lead to dynamic evolution of
complex concepts or knowledge? Collective
intelligence?

50
Clustering and Collaborative Filtering
clustering based on ratings movielens
51
Clustering and Collaborative Filtering tag
clustering example
52
Classification Example - Bank Data

Want to determine likely responders to a direct
mail campaign
a new product, a "Personal Equity Plan" (PEP)
training data include records kept about how
previous customers responded and bought the
product
in this case the target class is pep with
binary value
want to build a model and apply it to new data (a
customer list) in which the value of the class
attribute is not available

53
Data Preparation

Several steps for prepare data for Weka and for
See5
open training data in Excel, remove the id
column, save results (as a comma delimited file
(e.g., bank.csv)
do the same with new customer data, but also add
a new column called pep as the last column the
value of this column for each record should be
?
Weka
must convert the the data to ARFF format
attribute specification and data are in the same
file
the data portion is just the comma delimited data
file without the label row
See5/C5
create a name file and a data file
name file contains attribute specification
data file is same as above
first line of name file must be the name(s) of
the target class(es) - in this case pep

54
Data File Format for Weka
_at_relation train-bank-data' _at_attribute 'age'
real _at_attribute 'sex' 'MALE','FEMALE' _at_attribute
'region' 'INNER_CITY','RURAL','TOWN','SUBURBAN'
_at_attribute 'income' real _at_attribute 'married'
'YES','NO' _at_attribute 'children'
real _at_attribute 'car' 'YES','NO' _at_attribute
'save_act' 'YES','NO' _at_attribute 'current_act'
'YES','NO' _at_attribute 'mortgage'
'YES','NO' _at_attribute 'pep' 'YES','NO' _at_data 4
8,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES 40,
MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO . . .
Training Data
_at_relation 'new-bank-data' _at_attribute 'age'
real _at_attribute 'region' 'INNER_CITY','RURAL','TO
WN','SUBURBAN' . . . _at_attribute 'pep'
'YES','NO' _at_data 23,MALE,INNER_CITY,18766.9,YES,
0,YES,YES,NO,YES,? 30,MALE,RURAL,9915.67,NO,1,NO,Y
ES,NO,YES,?
New Cases
55
C4.5 Implementation in Weka
children lt 2 children lt 0 married
YES mortgage YES
save_act YES NO (16.0/2.0)
save_act NO YES (9.0/1.0) mortgage
NO NO (59.0/6.0) married NO
mortgage YES save_act
YES NO (12.0) save_act NO YES
(3.0) mortgage NO YES (29.0/2.0)
children gt 0 income lt 29622
children lt 1 income lt 12640.3
NO (5.0) income gt 12640.3
current_act YES YES (28.0/1.0)
current_act NO
income lt 17390.1 NO (3.0)
income gt 17390.1 YES (6.0)
children gt 1 NO (47.0/3.0) income gt
29622 YES (48.0/2.0) children gt 2 income lt
43228.2 NO (30.0/2.0) income gt 43228.2 YES
(5.0)

To build a model (decision tree) using the
classifiers.trees.j48..J48 class

Decision Tree Output (pruned)
56
C4.5 Implementation in Weka
Error on training data Correctly
Classified Instances 281 93.6667
Incorrectly Classified Instances 19
6.3333 Mean absolute error
0.1163 Root mean squared error
0.2412 Relative absolute error 23.496
Root relative squared error 48.4742
Total Number of Instances 300
Confusion Matrix a b lt--
classified as 122 13 a YES 6 159 b
NO Stratified cross-validation
Correctly Classified Instances 274
91.3333 Incorrectly Classified Instances 26
8.6667 Mean absolute error
0.1434 Root mean squared error
0.291 Relative absolute error
28.9615 Root relative squared error
58.4922 Total Number of Instances 300
Confusion Matrix a b lt--
classified as 118 17 a YES 9 156 b
NO
The rest of the output contains statistical
information about the model, including confusion
matrix, error rates, etc.
The model can be saved to be later applied to the
test data (or to new unclassified instances).
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)

Write a Comment

User Comments (0)