Data Preprocessing - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Data Preprocessing

Description:

convert the data into appropriate forms for mining. ... Where j is the smallest integer such that Max(| |) 1. Forms of Data Preprocessing. Data Cleaning ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 65
Provided by: RL6
Category:

less

Transcript and Presenter's Notes

Title: Data Preprocessing


1
Data Preprocessing
2
Data Preprocessing
  • An important issue for data warehousing and data
    mining
  • real world data tend to be incomplete, noisy and
    inconsistent
  • includes
  • data cleaning
  • data integration
  • data transformation
  • data reduction

3
Forms of Data Preprocessing


Data Cleaning

Data integration


Data transformation
-2, 32, 100, 59, 48
-0.02, 0.32, 1.00, 0.59, 0.48
Data reduction
A1 A2 A3 ... A126
A1 A2 A3 ... A115
T1 T2 T2000
T1 T4 T1456
4
Data Preprocessing
  • Data cleaning
  • fill in missing values
  • smooth noisy data
  • identify outliers
  • correct data inconsistency

5
Data Preprocessing
  • Data integration
  • combines data from multiple sources to form a
    coherent data store.
  • Metadata, correlation analysis, data conflict
    detection and resolution of semantic
    heterogeneity contribute towards smooth data
    integration.

6
Data Preprocessing
  • Data transformation
  • convert the data into appropriate forms for
    mining.
  • E.g. attribute data maybe normalized to fall
    between a small range such as 0.0 to 1.0

7
Data Preprocessing
  • Data reduction
  • data cube aggregation, dimension reduction, data
    compression, numerosity reduction and
    discretization.
  • Used to obtain a reduced representation of the
    data while minimizing the loss of information
    content.

8
Data Preprocessing
  • Automatic generation of concept hierarchies for
    numeric data
  • binning, histogram analysis
  • cluster analysis, entropy based discretization
  • segmentation by natural partitioning
  • for categoric data, concept hierarchies may be
    generated based on the number of distinct values
    of the attributes defining hierarchies.

9
Forms of Data Preprocessing



Data Cleaning

Data integration


-2, 32, 100, 59, 48
Data transformation
-0.02, 0.32, 1.00, 0.59, 0.48
Data reduction
A1 A2 A3 ... A126
A1 A2 A3 ... A115
T1 T2 T2000
T1 T4 T1456
10
Data Cleaning
  • Handling data that are
  • incomplete,
  • noisy and
  • inconsistent

It is an imperfect world
11
Data Cleaning Missing Values
  • Method of filling the missing values
  • Ignore the tuple
  • Fill in the missing value manually
  • Use a global constant
  • Use the attribute mean
  • Use the attribute mean for all samples belonging
    to the same class
  • Use the most probable value

12
Data CleaningNoisy Data
  • Noise - random error or variance in a measured
    variable
  • smooth out the data to remove the noise

13
Data CleaningNoisy Data
  • Data Smoothing Techniques
  • Binning
  • smooth a sorted data value by consulting its
    neighborhood
  • the sorted values are distributed into a number
    of buckets or bins
  • smoothing by bin means
  • smoothing by bin medians
  • smoothing by bin boundaries

14
Simple Discretization Methods Binning
  • Equal-width (distance) partitioning
  • Divides the range into N intervals of equal size
    uniform grid
  • if A and B are the lowest and highest values of
    the attribute, the width of intervals will be W
    (B A)/N.
  • The most straightforward, but outliers may
    dominate presentation
  • Skewed data is not handled well.
  • Equal-depth (frequency) partitioning
  • Divides the range into N intervals, each
    containing approximately same number of samples
  • Good data scaling
  • Managing categorical attributes can be tricky.

15
Binning Methods for Data Smoothing
  • Sorted data for price (in dollars) 4, 8, 9,
    15, 21, 21, 24, 25, 26, 28, 29, 34
  • Partition into (equi-depth) bins
  • - Bin 1 4, 8, 9, 15
  • - Bin 2 21, 21, 24, 25
  • - Bin 3 26, 28, 29, 34
  • Smoothing by bin means
  • - Bin 1 9, 9, 9, 9
  • - Bin 2 23, 23, 23, 23
  • - Bin 3 29, 29, 29, 29
  • Smoothing by bin boundaries
  • - Bin 1 4, 4, 4, 15
  • - Bin 2 21, 21, 25, 25
  • - Bin 3 26, 26, 26, 34

16
Cluster Analysis
  • Clustering
  • Outliers may be detected by clustering, where
    similar values are organized into groups or
    clusters.
  • Combined computer and human inspection
  • Regression

17
Cluster Analysis
18
Regression
y
Y1
y x 1
Y1
x
X1
19
Data Smoothing Techniques Binning
  • Example
  • sorted data for price
  • 4, 8, 15, 21, 21, 24, 25, 28, 34
  • Partition into equidepth bins
  • Bin 1 4, 8, 15
  • Bin 2 21, 21, 24
  • Bin 3 25, 28, 34

20
Data Smoothing Techniques Binning
  • smoothing by bin means
  • Bin 1 9, 9, 9
  • Bin 2 22, 22, 22
  • Bin 3 29, 29, 29
  • smoothing by bin boundaries
  • Bin 1 4, 4, 15
  • Bin 2 21, 21, 24
  • Bin 3 25, 25, 34

21
Data Cleaning Inconsistent Data
  • Can be corrected manually using external
    references
  • Source of inconsistency
  • error made at data entry, can be corrected using
    paper trace

22
Forms of Data Preprocessing


-2, 32, 100, 59, 48
Data transformation
-0.02, 0.32, 1.00, 0.59, 0.48
Data reduction
A1 A2 A3 ... A126
A1 A2 A3 ... A115
T1 T2 T2000
T1 T4 T1456
23
Data Integration and Transformation
  • Data integration
  • combines data from multiple sources into a
    coherent data store e.g. data warehouse
  • sources may include multiple database, data cubes
    or flat files
  • Issues in data integration
  • schema integration
  • redundancy
  • detection and resolution of data value conflicts
  • Data Transformation
  • data are transformed or consolidates into forms
    appropriate for mining
  • involves
  • smoothing
  • Aggregation
  • Generalization
  • Normalization
  • Attribute construction

24
Data Integration
  • Schema integration
  • integrate metadata from different sources
  • Entity identification problem identify real
    world entities from multiple data sources, e.g.,
    A.cust-id ? B.cust-
  • Detecting and resolving data value conflicts
  • for the same real world entity, attribute values
    from different sources are different
  • possible reasons different representations,
    different scales, e.g., metric vs. British units

25
Data Integration
  • Redundant data occur often when integration of
    multiple databases
  • The same attribute may have different names in
    different databases
  • One attribute may be a derived attribute in
    another table, e.g., annual revenue
  • Redundant data may be able to be detected by
    correlational analysis
  • Careful integration of the data from multiple
    sources may help reduce/avoid redundancies and
    inconsistencies and improve mining speed and
    quality

26
Data Transformation
  • Smoothing remove noise from data
  • Aggregation summarization, data cube
    construction
  • Generalization concept hierarchy climbing
  • Normalization scaled to fall within a small,
    specified range
  • min-max normalization
  • z-score normalization
  • normalization by decimal scaling
  • Attribute/feature construction
  • New attributes constructed from the given ones

27
Data Transformation Normalization
  • min-max normalization
  • z-score normalization
  • normalization by decimal scaling

Where j is the smallest integer such that Max(
)lt1
28
Forms of Data Preprocessing


Data Cleaning

Data integration


-2, 32, 100, 59, 48
Data transformation
-0.02, 0.32, 1.00, 0.59, 0.48

Data reduction
A1 A2 A3 ... A126
A1 A2 A3 ... A115
T1 T2 T2000
T1 T4 T1456
29
Data Reduction
  • To obtain a reduced representation of the data
    set that is
  • much smaller in volume
  • but closely maintains the integrity of the
    original data
  • mining on the reduced dataset should be more
    efficient yet produce the same analytical results.

30
Data Reduction

Data cube Aggregation
Dimensionality reduction
Data Reduction
Data compression
Numerosity reduction
Discretization and Concept Hierarchy generation
31
Data Cube Aggregation
  • The lowest level of a data cube
  • the aggregated data for an individual entity of
    interest
  • e.g., a customer in a phone calling data
    warehouse.
  • Multiple levels of aggregation in data cubes
  • Further reduce the size of data to deal with
  • Reference appropriate levels
  • Use the smallest representation which is enough
    to solve the task
  • Queries regarding aggregated information should
    be answered using data cube, when possible

32
Data Cube Aggregation
Sales data for company AllElectronics for 1997 -
1999
Year 1999
Year 1998
Year 1997 Quarter Sales Q1 224,000 Q2
408,000 Q3 350,000 Q4 586,000
Year Sales 1997 1,568,000 1998
2,356,000 1999 3,594,000
33
Data Reduction

Data cube Aggregation
Dimensionality reduction
Data compression
Data Reduction
Numerosity reduction
Discretization and Concept Hierarchy generation
34
Dimensionality Reduction
Standard form
Data preparation
Dimension reduction
Prediction Methods
Data Subset
Evaluation
The role of dimension reduction in Data Mining
35
Dimensionality Reduction
  • Data sets for analysis may contain hundreds of
    attributes that may be irrelevant to the mining
    task or redundant
  • Dimensionality reduction reduces the dataset size
    by removing such attributes among them

36
Dimensionality Reduction
  • How can we find a good subset of the original
    attributes??
  • attribute subset selection is to find a minimum
    set of attributes such that the resulting
    probability distribution of the data classes is
    as close as possible to the original distribution
    obtained using all attributes.

37
Dimensionality Reduction
  • Attribute subset selection techniques
  • Forward selection
  • start with empty set of attributes
  • the best of the original attributes is determined
    and added to the set.
  • At each subsequent iteration or step, the best of
    the remaining original attributes is added to the
    set.
  • Stepwise backward elimination
  • starts with the full set of attributes
  • At each step, it removes the worst attribute
    remaining in the set.

38
Dimensionality Reduction
  • Attribute subset selection techniques
  • Combination of forward selection and backward
    elimination
  • the procedure combines and selects the best
    attribute and removes the worst from among the
    remaining attributes

39
Dimensionality Reduction
  • Attribute subset selection techniques
  • Decision tree induction
  • ID3, C4.5 intended for classification
  • construct a flow chart like structure where each
    internal (nonleaf) node denotes a test on an
    attribute
  • each branch corresponds to an outcome of the test
    and each external node denotes a class prediction
  • At each node the algorithm chooses the best
    attribute to partition the data into individual
    classes.

40
Example of Decision Tree Induction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
41
Dimensionality Reduction
  • Attribute subset selection techniques
  • Reducts computation by rough set theory
  • selection of attributes are identified by the
    concept of discernibility relations of classes in
    the dataset
  • Will be discussed in next class.

42
Data Reduction
Data cube Aggregation

Dimensionality reduction
Data Reduction
Data compression
Numerosity reduction
Discretization and Concept Hierarchy generation
43
Data Compression
  • Apply data encoding or transformation to obtain a
    reduced or compressed representation of the
    original data
  • lossless
  • although typically lossless, they allow only
    limited manipulation of data.
  • lossy

44
Data Compression
  • Two methods of lossy data compression
  • Wavelet Transforms
  • Principle Component Analysis

45
Data Compression
  • Wavelet Transforms
  • is a linear signal processing technique that when
    applied to a data vector D, transforms it to a
    numerically different vector D of wavelet
    coefficients

46
Data Compression
  • Principle Component Analysis
  • suppose the data to be compresses consist of N
    tuples from k dimensions.
  • PCA searches for c k-dimensional orthogonal
    vectors that can best be used to represent the
    data where c ? k.
  • the original data are projected onto a much
    smaller space

47
Data Reduction
Data cube Aggregation

Dimensionality reduction
Data Reduction
Data compression
Numerosity reduction
Discretization and Concept Hierarchy generation
48
Numerosity Reduction
  • Numerosity reduction technique can be applied to
    reduce the data volume by choosing alternative,
    smaller forms of data representation
  • techniques
  • Regression and Log-Linear Models
  • Histograms
  • Clustering
  • Sampling

49
Data Reduction
Data cube Aggregation

Dimensionality reduction
Data Reduction
Data compression
Numerosity reduction
Discretization and Concept Hierarchy generation
50
Discretization
  • Three types of attributes
  • Nominal values from an unordered set
  • Ordinal values from an ordered set
  • Continuous real numbers
  • Discretization
  • divide the range of a continuous attribute into
    intervals
  • Some classification algorithms only accept
    categorical attributes.
  • Reduce data size by discretization
  • Prepare for further analysis

51
Discretization and Concept hierarchy
  • Discretization
  • reduce the number of values for a given
    continuous attribute by dividing the range of the
    attribute into intervals. Interval labels can
    then be used to replace actual data values
  • Concept hierarchies
  • reduce the data by collecting and replacing low
    level concepts (such as numeric values for the
    attribute age) by higher level concepts (such as
    young, middle-aged, or senior)

52
Discretization
  • Example
  • Manual discretization of AUS data set

53
Discretization and Concept Hierarchy Generation
for Numeric Data
  • Binning (see sections before)
  • Histogram analysis (see sections before)
  • Clustering analysis (see sections before)
  • Entropy-based discretization
  • Segmentation by natural partitioning

54
Entropy-Based Discretization
  • Given a set of samples S, if S is partitioned
    into two intervals S1 and S2 using boundary T,
    the entropy after partitioning is
  • The boundary that minimizes the entropy function
    over all possible boundaries is selected as a
    binary discretization.

55
Entropy-Based Discretization
  • The process is recursively applied to partitions
    obtained until some stopping criterion is met,
  • Experiments show that it may reduce data size and
    improve classification accuracy

56
Segmentation by Natural Partitioning
  • A simply 3-4-5 rule can be used to segment
    numeric data into relatively uniform, natural
    intervals.
  • If an interval covers 3, 6, 7 or 9 distinct
    values at the most significant digit, partition
    the range into 3 equi-width intervals
  • If it covers 2, 4, or 8 distinct values at the
    most significant digit, partition the range into
    4 intervals
  • If it covers 1, 5, or 10 distinct values at the
    most significant digit, partition the range into
    5 intervals (see fig 3.16,pg137)

57
Concept Hierarchy Generation
  • Many techniques can be applied recursively in
    order to provide a hierarchical partitioning of
    the attribute - concept hierarchy
  • Concept hierarchy useful for mining at multiple
    levels of abstraction

58
Concept Hierarchy Generation for Categorical Data
  • Specification of a partial ordering of attributes
    explicitly at the schema level by users or
    experts
  • streetltcityltstateltcountry
  • Specification of a portion of a hierarchy by
    explicit data grouping
  • Urbana, Champaign, ChicagoltIllinois
  • Specification of a set of attributes.
  • System automatically generates partial ordering
    by analysis of the number of distinct values
  • E.g., street lt city ltstate lt country
  • Specification of only a partial set of attributes
  • E.g., only street lt city, not others

59
Automatic Concept Hierarchy Generation
  • Some concept hierarchies can be automatically
    generated based on the analysis of the number of
    distinct values per attribute in the given data
    set
  • The attribute with the most distinct values is
    placed at the lowest level of the hierarchy
  • Note Exceptionweekday, month, quarter, year

15 distinct values
country
province_or_ state
365 distinct values
city
3567 distinct values
674,339 distinct values
street
60
Discretization and Concept Hierarchy Generation
  • Manual Discretization
  • The information to convert the continuous values
    into discrete values are obtain from the expert
    of the domain area
  • Example( refer to UCI machine learning data banks)

61
Data Discretization
62
Data Discretization
Table 5 The invariance features for mathematical
symbols
 
63
Data Discretization
  Table 6 Discretization of the mathematical
symbols
 
64
Summary
  • Data preparation is a big issue for both
    warehousing and mining
  • Data preparation includes
  • Data cleaning and data integration
  • Data reduction and feature selection
  • Discretization
  • A lot a methods have been developed but still an
    active area of research
Write a Comment
User Comments (0)
About PowerShow.com