Data Preprocessing - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Data Preprocessing

Description:

convert the data into appropriate forms for mining. ... Where j is the smallest integer such that Max(| |) 1. Forms of Data Preprocessing. Data Cleaning ... – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 65

Provided by: RL6

Category:

more less

Transcript and Presenter's Notes

Title: Data Preprocessing

1
Data Preprocessing
2
Data Preprocessing

An important issue for data warehousing and data
mining
real world data tend to be incomplete, noisy and
inconsistent
includes
data cleaning
data integration
data transformation
data reduction

3
Forms of Data Preprocessing

Data Cleaning

Data integration

Data transformation
-2, 32, 100, 59, 48
-0.02, 0.32, 1.00, 0.59, 0.48
Data reduction
A1 A2 A3 ... A126
A1 A2 A3 ... A115
T1 T2 T2000
T1 T4 T1456
4
Data Preprocessing

Data cleaning
fill in missing values
smooth noisy data
identify outliers
correct data inconsistency

5
Data Preprocessing

Data integration
combines data from multiple sources to form a
coherent data store.
Metadata, correlation analysis, data conflict
detection and resolution of semantic
heterogeneity contribute towards smooth data
integration.

6
Data Preprocessing

Data transformation
convert the data into appropriate forms for
mining.
E.g. attribute data maybe normalized to fall
between a small range such as 0.0 to 1.0

7
Data Preprocessing

Data reduction
data cube aggregation, dimension reduction, data
compression, numerosity reduction and
discretization.
Used to obtain a reduced representation of the
data while minimizing the loss of information
content.

8
Data Preprocessing

Automatic generation of concept hierarchies for
numeric data
binning, histogram analysis
cluster analysis, entropy based discretization
segmentation by natural partitioning
for categoric data, concept hierarchies may be
generated based on the number of distinct values
of the attributes defining hierarchies.

9
Forms of Data Preprocessing

Data Cleaning

Data integration

-2, 32, 100, 59, 48
Data transformation
-0.02, 0.32, 1.00, 0.59, 0.48
Data reduction
A1 A2 A3 ... A126
A1 A2 A3 ... A115
T1 T2 T2000
T1 T4 T1456
10
Data Cleaning

Handling data that are
incomplete,
noisy and
inconsistent

It is an imperfect world
11
Data Cleaning Missing Values

Method of filling the missing values
Ignore the tuple
Fill in the missing value manually
Use a global constant
Use the attribute mean
Use the attribute mean for all samples belonging
to the same class
Use the most probable value

12
Data CleaningNoisy Data

Noise - random error or variance in a measured
variable
smooth out the data to remove the noise

13
Data CleaningNoisy Data

Data Smoothing Techniques
Binning
smooth a sorted data value by consulting its
neighborhood
the sorted values are distributed into a number
of buckets or bins
smoothing by bin means
smoothing by bin medians
smoothing by bin boundaries

14
Simple Discretization Methods Binning

Equal-width (distance) partitioning
Divides the range into N intervals of equal size
uniform grid
if A and B are the lowest and highest values of
the attribute, the width of intervals will be W
(B A)/N.
The most straightforward, but outliers may
dominate presentation
Skewed data is not handled well.
Equal-depth (frequency) partitioning
Divides the range into N intervals, each
containing approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.

15
Binning Methods for Data Smoothing

Sorted data for price (in dollars) 4, 8, 9,
15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins
- Bin 1 4, 8, 9, 15
- Bin 2 21, 21, 24, 25
- Bin 3 26, 28, 29, 34
Smoothing by bin means
- Bin 1 9, 9, 9, 9
- Bin 2 23, 23, 23, 23
- Bin 3 29, 29, 29, 29
Smoothing by bin boundaries
- Bin 1 4, 4, 4, 15
- Bin 2 21, 21, 25, 25
- Bin 3 26, 26, 26, 34

16
Cluster Analysis

Clustering
Outliers may be detected by clustering, where
similar values are organized into groups or
clusters.
Combined computer and human inspection
Regression

17
Cluster Analysis
18
Regression
y
Y1
y x 1
Y1
x
X1
19
Data Smoothing Techniques Binning

Example
sorted data for price
4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into equidepth bins
Bin 1 4, 8, 15
Bin 2 21, 21, 24
Bin 3 25, 28, 34

20
Data Smoothing Techniques Binning

smoothing by bin means
Bin 1 9, 9, 9
Bin 2 22, 22, 22
Bin 3 29, 29, 29
smoothing by bin boundaries
Bin 1 4, 4, 15
Bin 2 21, 21, 24
Bin 3 25, 25, 34

21
Data Cleaning Inconsistent Data

Can be corrected manually using external
references
Source of inconsistency
error made at data entry, can be corrected using
paper trace

22
Forms of Data Preprocessing

-2, 32, 100, 59, 48
Data transformation
-0.02, 0.32, 1.00, 0.59, 0.48
Data reduction
A1 A2 A3 ... A126
A1 A2 A3 ... A115
T1 T2 T2000
T1 T4 T1456
23
Data Integration and Transformation

Data integration
combines data from multiple sources into a
coherent data store e.g. data warehouse
sources may include multiple database, data cubes
or flat files
Issues in data integration
schema integration
redundancy
detection and resolution of data value conflicts

Data Transformation
data are transformed or consolidates into forms
appropriate for mining
involves
smoothing
Aggregation
Generalization
Normalization
Attribute construction

24
Data Integration

Schema integration
integrate metadata from different sources
Entity identification problem identify real
world entities from multiple data sources, e.g.,
A.cust-id ? B.cust-
Detecting and resolving data value conflicts
for the same real world entity, attribute values
from different sources are different
possible reasons different representations,
different scales, e.g., metric vs. British units

25
Data Integration

Redundant data occur often when integration of
multiple databases
The same attribute may have different names in
different databases
One attribute may be a derived attribute in
another table, e.g., annual revenue
Redundant data may be able to be detected by
correlational analysis
Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality

26
Data Transformation

Smoothing remove noise from data
Aggregation summarization, data cube
construction
Generalization concept hierarchy climbing
Normalization scaled to fall within a small,
specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones

27
Data Transformation Normalization

min-max normalization
z-score normalization
normalization by decimal scaling

Where j is the smallest integer such that Max(
)lt1
28
Forms of Data Preprocessing

Data Cleaning

Data integration

-2, 32, 100, 59, 48
Data transformation
-0.02, 0.32, 1.00, 0.59, 0.48

Data reduction
A1 A2 A3 ... A126
A1 A2 A3 ... A115
T1 T2 T2000
T1 T4 T1456
29
Data Reduction

To obtain a reduced representation of the data
set that is
much smaller in volume
but closely maintains the integrity of the
original data
mining on the reduced dataset should be more
efficient yet produce the same analytical results.

30
Data Reduction

Data cube Aggregation
Dimensionality reduction
Data Reduction
Data compression
Numerosity reduction
Discretization and Concept Hierarchy generation
31
Data Cube Aggregation

The lowest level of a data cube
the aggregated data for an individual entity of
interest
e.g., a customer in a phone calling data
warehouse.
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
Reference appropriate levels
Use the smallest representation which is enough
to solve the task
Queries regarding aggregated information should
be answered using data cube, when possible

32
Data Cube Aggregation
Sales data for company AllElectronics for 1997 -
1999
Year 1999
Year 1998
Year 1997 Quarter Sales Q1 224,000 Q2
408,000 Q3 350,000 Q4 586,000
Year Sales 1997 1,568,000 1998
2,356,000 1999 3,594,000
33
Data Reduction

Data cube Aggregation
Dimensionality reduction
Data compression
Data Reduction
Numerosity reduction
Discretization and Concept Hierarchy generation
34
Dimensionality Reduction
Standard form
Data preparation
Dimension reduction
Prediction Methods
Data Subset
Evaluation
The role of dimension reduction in Data Mining
35
Dimensionality Reduction

Data sets for analysis may contain hundreds of
attributes that may be irrelevant to the mining
task or redundant
Dimensionality reduction reduces the dataset size
by removing such attributes among them

36
Dimensionality Reduction

How can we find a good subset of the original
attributes??
attribute subset selection is to find a minimum
set of attributes such that the resulting
probability distribution of the data classes is
as close as possible to the original distribution
obtained using all attributes.

37
Dimensionality Reduction

Attribute subset selection techniques
Forward selection
start with empty set of attributes
the best of the original attributes is determined
and added to the set.
At each subsequent iteration or step, the best of
the remaining original attributes is added to the
set.
Stepwise backward elimination
starts with the full set of attributes
At each step, it removes the worst attribute
remaining in the set.

38
Dimensionality Reduction

Attribute subset selection techniques
Combination of forward selection and backward
elimination
the procedure combines and selects the best
attribute and removes the worst from among the
remaining attributes

39
Dimensionality Reduction

Attribute subset selection techniques
Decision tree induction
ID3, C4.5 intended for classification
construct a flow chart like structure where each
internal (nonleaf) node denotes a test on an
attribute
each branch corresponds to an outcome of the test
and each external node denotes a class prediction
At each node the algorithm chooses the best
attribute to partition the data into individual
classes.

40
Example of Decision Tree Induction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
41
Dimensionality Reduction

Attribute subset selection techniques
Reducts computation by rough set theory
selection of attributes are identified by the
concept of discernibility relations of classes in
the dataset
Will be discussed in next class.

42
Data Reduction
Data cube Aggregation

Dimensionality reduction
Data Reduction
Data compression
Numerosity reduction
Discretization and Concept Hierarchy generation
43
Data Compression

Apply data encoding or transformation to obtain a
reduced or compressed representation of the
original data
lossless
although typically lossless, they allow only
limited manipulation of data.
lossy

44
Data Compression

Two methods of lossy data compression
Wavelet Transforms
Principle Component Analysis

45
Data Compression

Wavelet Transforms
is a linear signal processing technique that when
applied to a data vector D, transforms it to a
numerically different vector D of wavelet
coefficients

46
Data Compression

Principle Component Analysis
suppose the data to be compresses consist of N
tuples from k dimensions.
PCA searches for c k-dimensional orthogonal
vectors that can best be used to represent the
data where c ? k.
the original data are projected onto a much
smaller space

47
Data Reduction
Data cube Aggregation

Dimensionality reduction
Data Reduction
Data compression
Numerosity reduction
Discretization and Concept Hierarchy generation
48
Numerosity Reduction

Numerosity reduction technique can be applied to
reduce the data volume by choosing alternative,
smaller forms of data representation
techniques
Regression and Log-Linear Models
Histograms
Clustering
Sampling

49
Data Reduction
Data cube Aggregation

Dimensionality reduction
Data Reduction
Data compression
Numerosity reduction
Discretization and Concept Hierarchy generation
50
Discretization

Three types of attributes
Nominal values from an unordered set
Ordinal values from an ordered set
Continuous real numbers
Discretization
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Prepare for further analysis

51
Discretization and Concept hierarchy

Discretization
reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can
then be used to replace actual data values
Concept hierarchies
reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior)

52
Discretization

Example
Manual discretization of AUS data set

53
Discretization and Concept Hierarchy Generation
for Numeric Data

Binning (see sections before)
Histogram analysis (see sections before)
Clustering analysis (see sections before)
Entropy-based discretization
Segmentation by natural partitioning

54
Entropy-Based Discretization

Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is
The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization.

55
Entropy-Based Discretization

The process is recursively applied to partitions
obtained until some stopping criterion is met,
Experiments show that it may reduce data size and
improve classification accuracy

56
Segmentation by Natural Partitioning

A simply 3-4-5 rule can be used to segment
numeric data into relatively uniform, natural
intervals.
If an interval covers 3, 6, 7 or 9 distinct
values at the most significant digit, partition
the range into 3 equi-width intervals
If it covers 2, 4, or 8 distinct values at the
most significant digit, partition the range into
4 intervals
If it covers 1, 5, or 10 distinct values at the
most significant digit, partition the range into
5 intervals (see fig 3.16,pg137)

57
Concept Hierarchy Generation

Many techniques can be applied recursively in
order to provide a hierarchical partitioning of
the attribute - concept hierarchy
Concept hierarchy useful for mining at multiple
levels of abstraction

58
Concept Hierarchy Generation for Categorical Data

Specification of a partial ordering of attributes
explicitly at the schema level by users or
experts
streetltcityltstateltcountry
Specification of a portion of a hierarchy by
explicit data grouping
Urbana, Champaign, ChicagoltIllinois

Specification of a set of attributes.
System automatically generates partial ordering
by analysis of the number of distinct values
E.g., street lt city ltstate lt country
Specification of only a partial set of attributes
E.g., only street lt city, not others

59
Automatic Concept Hierarchy Generation

Some concept hierarchies can be automatically
generated based on the analysis of the number of
distinct values per attribute in the given data
set
The attribute with the most distinct values is
placed at the lowest level of the hierarchy
Note Exceptionweekday, month, quarter, year

15 distinct values
country
province_or_ state
365 distinct values
city
3567 distinct values
674,339 distinct values
street
60
Discretization and Concept Hierarchy Generation

Manual Discretization
The information to convert the continuous values
into discrete values are obtain from the expert
of the domain area
Example( refer to UCI machine learning data banks)

61
Data Discretization
62
Data Discretization
Table 5 The invariance features for mathematical
symbols

63
Data Discretization
Table 6 Discretization of the mathematical
symbols

64
Summary