Data Mining - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Data Mining

Description:

... etc.) that have measurements like pass or fail, leak or no leak, small, medium, or large, go or no go tests. (SixSigma.com Dictonary) ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 45

Provided by: ifmAcTzs

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining

1
Data Mining

Instructor Bajuna Salehe
Email bajunar_at_yahoo.com
Web http//www.ifm.ac.tz/staff/bajuna/courses

Classification and Prediction
2
Classification and Prediction

Classification and prediction are two forms of
data analysis that can be used to extract models
describing important data classes or to predict
future data trends. Such analysis can help
provide us with a better understanding of the
data at large.

3
An example application

An emergency room in a hospital measures 17
variables (e.g., blood pressure, age, etc) of
newly admitted patients.
A decision is needed whether to put a new
patient in an intensive-care unit.
Due to the high cost of ICU, those patients who
may survive less than a month are given higher
priority.
Problem to predict high-risk patients and
discriminate them from low-risk patients.

4
Another application

A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
age
Marital status
annual salary
outstanding debts
credit rating
etc.
Problem to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.

5
Machine learning and our focus

Like human learning from past experiences.
A computer does not have experiences.
A computer system learns from data, which
represent some past experiences of an
application domain.
Our focus learn a target function that can be
used to predict the values of a discrete class
attribute, e.g., approve or not-approved, and
high-risk or low risk.
The task is commonly called Supervised learning,
classification, or inductive learning.

6
Classification and Prediction

Whereas classification predicts categorical
(discrete, unordered) labels, prediction models
continuous valued functions.

7
Classification and Prediction

For example, we can build a classification model
to categorize bank loan applications as either
safe or risky, or a prediction model to predict
the expenditures in dollars of potential
customers on computer equipment given their
income and occupation.

8
Classification

Classification is the process of finding a model
(or function) that describes and distinguishes
data classes or concepts, for the purpose of
being able to use the model to predict the class
of objects whose class label is unknown.
The derived model is based on the analysis of a
set of training data (i.e., data objects whose
class label is known).

9
What is Classification

Classification is the task of assigning objects
to their respective categories.
Examples include classifying email messages as
spam or non-spam based upon the message header
and content, and classifying galaxies based upon
their respective shapes.

10
What is Classification

Classification can provide a valuable support for
informed decision making in the organisation.
For example, suppose a mobile phone company would
like to promote a new cell-phone product to the
public. Instead of mass mailing the promotional
catalog to everyone, the company may be able to
reduce the campaign cost by targeting only a
small segment of the population

11
What is Classification

It may classify each person as a potential buyer
or non-buyer based on their personal information
such as income, occupation, lifestyle, and credit
ratings.

12
Discrete Data

Discrete Data A set of data is said to be
discrete if the values / observations belonging
to it are distinct and separate, i.e. they can be
counted (1,2,3,....). Examples might include the
number of kittens in a litter the number of
patients in a doctors surgery the number of
flaws in one metre of cloth gender (male,
female) blood group (O, A, B, AB).

13
Discrete Data

Any data measurements that are not quantified on
an infinitely divisible numeric scale. Includes
items like counts, proportions, ratios, or
percentage of a characteristics, (i.e. sex, loan
forms, department attendance, etc.) that have
measurements like pass or fail, leak or no leak,
small, medium, or large, go or no go tests.
(SixSigma.com Dictonary)

14
Continuous Data

Continuous/Variable Data A set of data is said
to be continuous if the values / observations
belonging to it may take on any value within a
finite or infinite interval. You can count, order
and measure continuous data. For example height,
weight, temperature, the amount of sugar in an
orange, the time required to run a mile.

15
Continuous Data

Variable data type have real numbers in the
measurement like 2.34, 2.55, etc. (i.e. data that
can be measured on a continuous scale)

16
Categorical Data

Categorical Data A set of data is said to be
categorical if the values or observations
belonging to it can be sorted according to
category. Each value is chosen from a set of
non-overlapping categories. For example, shoes in
a cupboard can be sorted according to colour the
characteristic 'colour' can have non-overlapping
categories 'black', 'brown', 'red' and 'other'.
People have the characteristic of 'gender' with
categories 'male' and 'female'.

17
Nominal Data

Nominal Data A set of data is said to be
nominal if the values / observations belonging to
it can be assigned a code in the form of a number
where the numbers are simply labels. You can
count but not order or measure nominal data. For
example, in a data set males could be coded as 0,
females as 1 marital status of an individual
could be coded as Y if married, N if single.

18
Ordinal Data

Ordinal Data - A set of data is said to be
ordinal if the values / observations belonging to
it can be ranked (put in order) or have a rating
scale attached. You can count and order, but not
measure, ordinal data.

19
Ordinal Data

The categories for an ordinal set of data have a
natural order, for example, suppose a group of
people were asked to taste varieties of biscuit
and classify each biscuit on a rating scale of 1
to 5, representing strongly dislike, dislike,
neutral, like, strongly like. A rating of 5
indicates more enjoyment than a rating of 4, for
example, so such data are ordinal.

20
Preliminaries

The input data for classification task is given
in the form of collection of records.
Each record also known as instance or example is
characterised by a tuple (x,y), where x is the
attribute set and y is the class label

21
Preliminaries

Table 1. Vertebrate
Data Set

22
Preliminaries

In the above slide, the table shows a sample data
set used for classifying vertebrates into one of
the following categories mammal, bird, fish,
reptile, or amphibian.
The attribute set includes properties of a
vertebrate such as its body temperature, skin
cover, method of reproduction, ability to fly and
ability to live in water.

23
Preliminaries

The attribute set may contain discrete and
continuous features, however on the table above
attribute set contains mostly discrete values.
The class label on the other hand, must be a
discrete attribute.
This is a key characteristics that distinguishes
classification from another predictive modeling
task known as regression, where y is a continuous
attribute.

24
What is Classification

Classification can be described as a task of
assigning objects to one of several predefined
categories.
Input Output
Attribute Set Class label
(x) (y)
The diagram show the classification as task
of mapping an input attribute set x into its
class label y

Classification Model
25
Simple Definition

Classification is the task of learning a target
function f that maps each attribute set x into
one of the pre-defined class labels y.
The target function is also known informally as a
classification model.

26
Usefulness of Classification Model

A classification model is useful for the
following purposes
It may serve as an explanatory tool to
distinguish between objects of different classes
(Descriptive Modeling).
It may also be used to predict the class label of
unknown records (Predictive Modeling). Consider
the table below

27
Usefulness of Classification Model

A classification model can be treated as a black
box that automatically assigns a class label when
presented with the attribute set of an unknown
record.
Example you can be given the characteristics of
creature known as gila monster.

28
Usefulness of Classification Model

By building a classification model from the data
set shown in Table 1, you may use the model to
determine the class to which the creature
belongs.
Classification models are most suited for
predicting or describing data sets with binary or
nominal target attributes.

29
Classification Prediction

Classification
Predicts categorical class labels
Classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data
Prediction
Models continuous-valued functions, i.e.,
predicts unknown or missing values
Typical Applications
Credit approval
Target marketing

Medical diagnosis
Treatment effectiveness analysis

30
Classification Techniques
31
Classification Technique

A classification technique is a systematic
approach for building classification models from
an input data set.
Examples of classification techniques include
Decision Tree Classifiers
Rule-Based Classifiers
Neural Networks
Support Vector Machines
Naive Bayes Classifiers
Nearest-Neighbor Classifiers

32
Classification Technique

Each technique employs a learning algorithm to
identify a model that best fits the relationship
between the attribute set and class label of the
input data (produces outputs consistent with the
class labels of the input data).

33
Classification Technique

A good classification model must predict
correctly the class labels of records it has
never seen before.
Building models with good generalization
capability, i.e., models that accurately predict
the class labels of previously unseen records, is
therefore a key objective of the learning
algorithm.

34
General Approach to Solve a Classification Problem

A general strategy to solving a classification
problem is that
First, the input data is divided into two
disjoint sets, known as the training set and test
set, respectively.
The training set will be used for building a
classification model.
The induced model is later applied to the test
set to predict the class label of each test
record.

35
Why are we dividing the data into two set?

This strategy of dividing the data into
independent training and test sets allows us to
obtain an unbiased estimate of the performance of
a model on previously unseen records.
A figure below in the next slide depicts

36
General Approach to Solve a Classification Problem
37
Performance Measurement of Model

Evaluation of the performance of a classification
model is based upon the number of test records
predicted correctly and wrongly by the model.
The counts are tabulated in a table known as a
confusion matrix.

38
Performance Measurement of Model

Table 2 depicts the confusion matrix for a binary
classification problem.

39
Performance Measurement of Model

Each entry fij in this table denotes the number
of records from class i predicted to be of class
j.
For instance, f01 is the number of records from
class 0 wrongly predicted as class 1
Based on the entries in the confusion matrix, the
total number of correct predictions made by the
model is (f11 f00) and the total number of
wrong predictions is (f10 f01).

40
Performance Measurement of Model

Although a confusion matrix provides the
information needed to determine how good is a
classification model, it is useful to summarize
this information into a single number.
This would make it more convenient to compare the
performance of different classification models.

41
Performance Measurement of Model

There are several performance metrics available
for doing this. One of the most popular metrics
is model accuracy, which is defined as
Accuracy Number of correct predictions
Total number of
predictions
f11 f00
f11 f10 f01 f00

42
Performance Measurement of Model

Equivalently, the performance of a model can be
expressed in terms of its error rate given by the
following equation
Error rate Number of wrong predictions
Total number of
predictions
f10 f01
f11 f10 f01 f00

44
Decision Trees

Write a Comment

User Comments (0)