Data Preparation for Data Mining Part I - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Data Preparation for Data Mining Part I

Description:

How many % of your time spends assuring data quality, then preparing that data ... Source: http://en.wikipedia.org/wiki/Sepal. Iris ... – PowerPoint PPT presentation

Number of Views:205

Avg rating:3.0/5.0

Slides: 46

Provided by: drtehy

Category:

more less

Transcript and Presenter's Notes

Title: Data Preparation for Data Mining Part I

1
Data Preparation for Data Mining (Part I)

2
??

How many of your time spends assuring data
quality, then preparing that data for developing
and deploying predictive models.

3
Why Data Preprocessing?

Data in the real world is dirty
Incomplete
Noisy
Inconsistent
No quality data,
no quality mining results!

4
Data Understanding
5
Data Understanding

data mining methodology (CRISP-DM)
finding initial relationships
finding out the very first view understanding
into the existing data
interesting patterns

6
COLLECT INITIAL DATA

involves the identification of
relevant attributes or
factors.

7
Example

You want to play outdoor golf with your friends.
What are the relevant attributes or factors
considered?

8
DESCRIBE DATA

The meta data view of Clementine consists of
type (Range, Set, Flag)
Values
total number of sample size
the total number of attribute types

9
EXPLORE DATA

Data exploration
Visualisation (distribution, scatter plot).
To provide the
first patterns or
correlation among attributes.
To show
the distribution of attributes or factors,
the pair of numbers of attributes over target
attribute

10
IRIS.txt (Example)
11
Source http//en.wikipedia.org/wiki/Sepal
12
Iris

Iris-setosa (Source http//www.badbear.com/signa/
signa.pl?Iris-setosa)
Iris-versicolor (Source http//en.wikipedia.org/w
iki/Iris_versicolor)
Iris-virginica (Source http//plants.usda.gov/ja
va/profile?symbolIRVI)

13
Quality data

High quality data
If they are fit for their intended uses in
operations, decision making and planning.
Poor quality data may lead to uninteresting
patterns.

14
Data Quality Team

The data quality team should screen/check the
data carefully to ensure the quality and quantity
of data being gathered which meet the business
objectives to ensure a
successful data mining project.

15
data analysts and business analysts

Olson says that data quality team should consist
of data analysts and business analysts.

16
Data analysts

should be good at data architecture and
know how to apply tools to navigate large volumes
of data
to find out patterns relevant to data quality.
Source http//download.oracle.com/docs/cd/B10500
_01/server.920/a96520/schemas.htm

17
Business analysts

to understand the best practices of business and
current business processes
to find out patterns relevant to data quality as
well

18
Exercise 1
19
DATA CLEANING

large volumes of data during the data cleaning
process
some missing values are possible in the databases
or files.

How do we deal with missing values?

21
Point estimation

one of several data cleaning techniques
involves the use of sample data to calculate a
single value
serves as a "best guess
Uses to estimate mean.
can be applied in estimating the value of missing
value.

22
For example

100 employees records in employee file.
99 employees records have salary
information.
The mean sales amount of these is 900.
900 is selected as a value for this employees
salary amount.

23
Good Estimator

? Good Estimator

24
Jackknife estimate

is one of the well known point estimation
techniques.
is left one data value out from the set of
observed values each time and
it generates that statistic iteratively on the
data set.

25
Example

Given the following set of values 2,3,4,8,25,
determine the jackknife estimate for the mean

26
MAXIMUM LIKELIHOOD ESTIMATE (MLE)

is another technique for point estimation.
is used to calculate the best way of fitting a
mathematical model to some data.
is to determine the parameters that maximise the
probability (likelihood) of the sample data.

27
Expectation-Maximization

Solves estimation with incomplete data.
EM algorithm finds an MLE for a parameter (as a
mean) using a two-step process estimation and
maximization.
Obtain initial estimates for parameters.
Iteratively use estimates for missing data and
continue until convergence.

28
Example

Given the following set of values 2,4,10,16
(known items) with two data items missing
(unknown items).
Determine the MLE for the mean as µ (0) 3
(initially guess).

29
Data Preparation Data Mining (Part II)
30
Missing value

no data value is stored for the variable in the
current observation

31
Patients.txt
32
Patients.txt

Variable
Name Description Type Valid
Values
PATNO Patient Number Character
Numerals
GENDER Gender Character M' or 'F'
VISIT Visit Date MMDDYY10 Any valid
date
HR Heart Rate Numeric 40 to 100
SBP Systolic Blood Pres. Numeric 80 to
200
DBP Diastolic Blood Pres. Numeric 60 to
120
DX Diagnosis Code Character 1 to 3
digits
AE Adverse Event Character '0' or '1'

33
Data integration

combining/merging data from heterogeneous data
sources.
is the process of combining data residing at
different sources (internal data sources and
external data sources)
providing the user with a unified view of these
data.

34
SCHEMA INTEGRATION