Data Preparation for Data Mining Part I - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Data Preparation for Data Mining Part I

Description:

How many % of your time spends assuring data quality, then preparing that data ... Source: http://en.wikipedia.org/wiki/Sepal. Iris ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 46
Provided by: drtehy
Category:

less

Transcript and Presenter's Notes

Title: Data Preparation for Data Mining Part I


1
Data Preparation for Data Mining (Part I)

2
??
  • How many of your time spends assuring data
    quality, then preparing that data for developing
    and deploying predictive models.

3
Why Data Preprocessing?
  • Data in the real world is dirty
  • Incomplete
  • Noisy
  • Inconsistent
  • No quality data,
  • no quality mining results!

4
Data Understanding
5
Data Understanding
  • data mining methodology (CRISP-DM)
  • finding initial relationships
  • finding out the very first view understanding
    into the existing data
  • interesting patterns

6
COLLECT INITIAL DATA
  • involves the identification of
  • relevant attributes or
  • factors.

7
Example
  • You want to play outdoor golf with your friends.
  • What are the relevant attributes or factors
    considered?

8
DESCRIBE DATA
  • The meta data view of Clementine consists of
  • type (Range, Set, Flag)
  • Values
  • total number of sample size
  • the total number of attribute types

9
EXPLORE DATA
  • Data exploration
  • Visualisation (distribution, scatter plot).
  • To provide the
  • first patterns or
  • correlation among attributes.
  • To show
  • the distribution of attributes or factors,
  • the pair of numbers of attributes over target
    attribute

10
IRIS.txt (Example)
11
Source http//en.wikipedia.org/wiki/Sepal
12
Iris
  • Iris-setosa (Source http//www.badbear.com/signa/
    signa.pl?Iris-setosa)
  • Iris-versicolor (Source http//en.wikipedia.org/w
    iki/Iris_versicolor)
  • Iris-virginica (Source http//plants.usda.gov/ja
    va/profile?symbolIRVI)

13
Quality data
  • High quality data
  • If they are fit for their intended uses in
    operations, decision making and planning.
  • Poor quality data may lead to uninteresting
    patterns.

14
Data Quality Team
  • The data quality team should screen/check the
    data carefully to ensure the quality and quantity
    of data being gathered which meet the business
    objectives to ensure a
  • successful data mining project.

15
data analysts and business analysts
  • Olson says that data quality team should consist
    of data analysts and business analysts.

16
Data analysts
  • should be good at data architecture and
  • know how to apply tools to navigate large volumes
    of data
  • to find out patterns relevant to data quality.
  • Source http//download.oracle.com/docs/cd/B10500
    _01/server.920/a96520/schemas.htm

17
Business analysts
  • to understand the best practices of business and
    current business processes
  • to find out patterns relevant to data quality as
    well

18
Exercise 1
19
DATA CLEANING
  • large volumes of data during the data cleaning
    process
  • some missing values are possible in the databases
    or files.

20
  • How do we deal with missing values?

21
Point estimation
  • one of several data cleaning techniques
  • involves the use of sample data to calculate a
    single value
  • serves as a "best guess
  • Uses to estimate mean.
  • can be applied in estimating the value of missing
    value.

22
For example
  • 100 employees records in employee file.
  • 99 employees records have salary
    information.
  • The mean sales amount of these is 900.
  • 900 is selected as a value for this employees
    salary amount.

23
Good Estimator
  • ? Good Estimator

24
Jackknife estimate
  • is one of the well known point estimation
    techniques.
  • is left one data value out from the set of
    observed values each time and
  • it generates that statistic iteratively on the
    data set.

25
Example
  • Given the following set of values 2,3,4,8,25,
  • determine the jackknife estimate for the mean

26
MAXIMUM LIKELIHOOD ESTIMATE (MLE)
  • is another technique for point estimation.
  • is used to calculate the best way of fitting a
    mathematical model to some data.
  • is to determine the parameters that maximise the
    probability (likelihood) of the sample data.

27
Expectation-Maximization
  • Solves estimation with incomplete data.
  • EM algorithm finds an MLE for a parameter (as a
    mean) using a two-step process estimation and
    maximization.
  • Obtain initial estimates for parameters.
  • Iteratively use estimates for missing data and
    continue until convergence.

28
Example
  • Given the following set of values 2,4,10,16
    (known items) with two data items missing
    (unknown items).
  • Determine the MLE for the mean as µ (0) 3
    (initially guess).

29
Data Preparation Data Mining (Part II)
30
Missing value
  • no data value is stored for the variable in the
    current observation

31
Patients.txt
32
Patients.txt
  • Variable
  • Name Description Type Valid
    Values
  • PATNO Patient Number Character
    Numerals
  • GENDER Gender Character M' or 'F'
  • VISIT Visit Date MMDDYY10 Any valid
    date
  • HR Heart Rate Numeric 40 to 100
  • SBP Systolic Blood Pres. Numeric 80 to
    200
  • DBP Diastolic Blood Pres. Numeric 60 to
    120
  • DX Diagnosis Code Character 1 to 3
    digits
  • AE Adverse Event Character '0' or '1'

33
Data integration
  • combining/merging data from heterogeneous data
    sources.
  • is the process of combining data residing at
    different sources (internal data sources and
    external data sources)
  • providing the user with a unified view of these
    data.

34
SCHEMA INTEGRATION
  • use different representations or definitions of
    schema but it refers to or represent the same
    information.
  • as the entity identification problem.

35
For example
  • How can we identify that customer_id in one data
    set and customer_no in another refer to the same
    entity?

36
Schema matching
  • Currently, most of the schema matching is done
    manually.
  • tedious,
  • time-consuming,
  • error-prone.

37
  • We need automated support for schema matching
  • faster,
  • error-free and
  • less labor-intensive.

38
A mapping between Global Schema and Local Schema
39
The architecture for data integration
40
Correlation Analysis
  • Redundancy
  • apply correlation analysis

41
  • Given two attributes (X1, X2)
  • Measure the correlation of one attribute (X1) to
    another attribute (X2).

42
(No Transcript)
43
(No Transcript)
44
DATA TRANSFORMATION
  • In metadata, a data transformation converts data
    from a source data format into destination data.

45
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com