Title: Data Mining
1Data Mining
- Instructor Bajuna Salehe
- Email bajunar_at_yahoo.com
- Web http//www.ifm.ac.tz/staff/bajuna/courses
Classification and Prediction
2Classification and Prediction
- Classification and prediction are two forms of
data analysis that can be used to extract models
describing important data classes or to predict
future data trends. Such analysis can help
provide us with a better understanding of the
data at large.
3An example application
- An emergency room in a hospital measures 17
variables (e.g., blood pressure, age, etc) of
newly admitted patients. - A decision is needed whether to put a new
patient in an intensive-care unit. - Due to the high cost of ICU, those patients who
may survive less than a month are given higher
priority. - Problem to predict high-risk patients and
discriminate them from low-risk patients.
4Another application
- A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant, - age
- Marital status
- annual salary
- outstanding debts
- credit rating
- etc.
- Problem to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.
5Machine learning and our focus
- Like human learning from past experiences.
- A computer does not have experiences.
- A computer system learns from data, which
represent some past experiences of an
application domain. - Our focus learn a target function that can be
used to predict the values of a discrete class
attribute, e.g., approve or not-approved, and
high-risk or low risk. - The task is commonly called Supervised learning,
classification, or inductive learning.
6Classification and Prediction
- Whereas classification predicts categorical
(discrete, unordered) labels, prediction models
continuous valued functions.
7Classification and Prediction
- For example, we can build a classification model
to categorize bank loan applications as either
safe or risky, or a prediction model to predict
the expenditures in dollars of potential
customers on computer equipment given their
income and occupation.
8Classification
- Classification is the process of finding a model
(or function) that describes and distinguishes
data classes or concepts, for the purpose of
being able to use the model to predict the class
of objects whose class label is unknown. - The derived model is based on the analysis of a
set of training data (i.e., data objects whose
class label is known).
9What is Classification
- Classification is the task of assigning objects
to their respective categories. - Examples include classifying email messages as
spam or non-spam based upon the message header
and content, and classifying galaxies based upon
their respective shapes.
10What is Classification
- Classification can provide a valuable support for
informed decision making in the organisation. - For example, suppose a mobile phone company would
like to promote a new cell-phone product to the
public. Instead of mass mailing the promotional
catalog to everyone, the company may be able to
reduce the campaign cost by targeting only a
small segment of the population
11What is Classification
- It may classify each person as a potential buyer
or non-buyer based on their personal information
such as income, occupation, lifestyle, and credit
ratings.
12Discrete Data
- Discrete Data A set of data is said to be
discrete if the values / observations belonging
to it are distinct and separate, i.e. they can be
counted (1,2,3,....). Examples might include the
number of kittens in a litter the number of
patients in a doctors surgery the number of
flaws in one metre of cloth gender (male,
female) blood group (O, A, B, AB).
13Discrete Data
- Any data measurements that are not quantified on
an infinitely divisible numeric scale. Includes
items like counts, proportions, ratios, or
percentage of a characteristics, (i.e. sex, loan
forms, department attendance, etc.) that have
measurements like pass or fail, leak or no leak,
small, medium, or large, go or no go tests.
(SixSigma.com Dictonary)
14Continuous Data
- Continuous/Variable Data A set of data is said
to be continuous if the values / observations
belonging to it may take on any value within a
finite or infinite interval. You can count, order
and measure continuous data. For example height,
weight, temperature, the amount of sugar in an
orange, the time required to run a mile.
15Continuous Data
- Variable data type have real numbers in the
measurement like 2.34, 2.55, etc. (i.e. data that
can be measured on a continuous scale)
16Categorical Data
- Categorical Data A set of data is said to be
categorical if the values or observations
belonging to it can be sorted according to
category. Each value is chosen from a set of
non-overlapping categories. For example, shoes in
a cupboard can be sorted according to colour the
characteristic 'colour' can have non-overlapping
categories 'black', 'brown', 'red' and 'other'.
People have the characteristic of 'gender' with
categories 'male' and 'female'.
17Nominal Data
- Nominal Data A set of data is said to be
nominal if the values / observations belonging to
it can be assigned a code in the form of a number
where the numbers are simply labels. You can
count but not order or measure nominal data. For
example, in a data set males could be coded as 0,
females as 1 marital status of an individual
could be coded as Y if married, N if single.
18Ordinal Data
- Ordinal Data - A set of data is said to be
ordinal if the values / observations belonging to
it can be ranked (put in order) or have a rating
scale attached. You can count and order, but not
measure, ordinal data.
19Ordinal Data
- The categories for an ordinal set of data have a
natural order, for example, suppose a group of
people were asked to taste varieties of biscuit
and classify each biscuit on a rating scale of 1
to 5, representing strongly dislike, dislike,
neutral, like, strongly like. A rating of 5
indicates more enjoyment than a rating of 4, for
example, so such data are ordinal.
20Preliminaries
- The input data for classification task is given
in the form of collection of records. - Each record also known as instance or example is
characterised by a tuple (x,y), where x is the
attribute set and y is the class label
21Preliminaries
- Table 1. Vertebrate
Data Set
22Preliminaries
- In the above slide, the table shows a sample data
set used for classifying vertebrates into one of
the following categories mammal, bird, fish,
reptile, or amphibian. - The attribute set includes properties of a
vertebrate such as its body temperature, skin
cover, method of reproduction, ability to fly and
ability to live in water.
23Preliminaries
- The attribute set may contain discrete and
continuous features, however on the table above
attribute set contains mostly discrete values. - The class label on the other hand, must be a
discrete attribute. - This is a key characteristics that distinguishes
classification from another predictive modeling
task known as regression, where y is a continuous
attribute.
24What is Classification
- Classification can be described as a task of
assigning objects to one of several predefined
categories. - Input Output
- Attribute Set Class label
- (x) (y)
- The diagram show the classification as task
of mapping an input attribute set x into its
class label y
Classification Model
25Simple Definition
- Classification is the task of learning a target
function f that maps each attribute set x into
one of the pre-defined class labels y. - The target function is also known informally as a
classification model.
26Usefulness of Classification Model
- A classification model is useful for the
following purposes - It may serve as an explanatory tool to
distinguish between objects of different classes
(Descriptive Modeling). - It may also be used to predict the class label of
unknown records (Predictive Modeling). Consider
the table below
27Usefulness of Classification Model
- A classification model can be treated as a black
box that automatically assigns a class label when
presented with the attribute set of an unknown
record. - Example you can be given the characteristics of
creature known as gila monster.
28Usefulness of Classification Model
- By building a classification model from the data
set shown in Table 1, you may use the model to
determine the class to which the creature
belongs. - Classification models are most suited for
predicting or describing data sets with binary or
nominal target attributes.
29Classification Prediction
- Classification
- Predicts categorical class labels
- Classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data - Prediction
- Models continuous-valued functions, i.e.,
predicts unknown or missing values - Typical Applications
- Credit approval
- Target marketing
- Medical diagnosis
- Treatment effectiveness analysis
30Classification Techniques
31Classification Technique
- A classification technique is a systematic
approach for building classification models from
an input data set. - Examples of classification techniques include
- Decision Tree Classifiers
- Rule-Based Classifiers
- Neural Networks
- Support Vector Machines
- Naive Bayes Classifiers
- Nearest-Neighbor Classifiers
32Classification Technique
- Each technique employs a learning algorithm to
identify a model that best fits the relationship
between the attribute set and class label of the
input data (produces outputs consistent with the
class labels of the input data).
33Classification Technique
- A good classification model must predict
correctly the class labels of records it has
never seen before. - Building models with good generalization
capability, i.e., models that accurately predict
the class labels of previously unseen records, is
therefore a key objective of the learning
algorithm.
34General Approach to Solve a Classification Problem
- A general strategy to solving a classification
problem is that - First, the input data is divided into two
disjoint sets, known as the training set and test
set, respectively. - The training set will be used for building a
classification model. - The induced model is later applied to the test
set to predict the class label of each test
record.
35Why are we dividing the data into two set?
- This strategy of dividing the data into
independent training and test sets allows us to
obtain an unbiased estimate of the performance of
a model on previously unseen records. - A figure below in the next slide depicts
36General Approach to Solve a Classification Problem
37Performance Measurement of Model
- Evaluation of the performance of a classification
model is based upon the number of test records
predicted correctly and wrongly by the model. - The counts are tabulated in a table known as a
confusion matrix.
38Performance Measurement of Model
- Table 2 depicts the confusion matrix for a binary
classification problem.
39Performance Measurement of Model
- Each entry fij in this table denotes the number
of records from class i predicted to be of class
j. - For instance, f01 is the number of records from
class 0 wrongly predicted as class 1 - Based on the entries in the confusion matrix, the
total number of correct predictions made by the
model is (f11 f00) and the total number of
wrong predictions is (f10 f01).
40Performance Measurement of Model
- Although a confusion matrix provides the
information needed to determine how good is a
classification model, it is useful to summarize
this information into a single number. - This would make it more convenient to compare the
performance of different classification models.
41Performance Measurement of Model
- There are several performance metrics available
for doing this. One of the most popular metrics
is model accuracy, which is defined as - Accuracy Number of correct predictions
- Total number of
predictions - f11 f00
- f11 f10 f01 f00
42Performance Measurement of Model
- Equivalently, the performance of a model can be
expressed in terms of its error rate given by the
following equation - Error rate Number of wrong predictions
- Total number of
predictions - f10 f01
- f11 f10 f01 f00
43 44Decision Trees