Data Preprocessing - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Data Preprocessing

Description:

Data Preprocessing Chapter 2 – PowerPoint PPT presentation

Number of Views:1048
Avg rating:3.0/5.0
Slides: 64
Provided by: Jiaw244
Category:

less

Transcript and Presenter's Notes

Title: Data Preprocessing


1
Data Preprocessing
Chapter 2
2
Chapter Objectives
  • Realize the importance of data preprocessing for
    real world data before data mining or
    construction of data warehouses.
  • Get an overview of some data preprocessing issues
    and techniques.

3
The course
(4)
DS
OLAP
(2)
(3)
Data Preprocessing
DW
DS
DM
(5)
Association
DS
(6)
Classification
(7)
Clustering
DS Data source DW Data warehouse DM Data
Mining
4
- The Chapter
(2.3)
(2.4)
(2.4)
(2.5)
5
- Chapter Outline
  • Introduction (2.1, 2.2)
  • Data Cleaning (2.3)
  • Data Integration (2.4)
  • Data Transformation (2.4)
  • Data Reduction (2.5)
  • Concept Hierarchy (2.6)

6
- Introduction
  • Introduction
  • Why Preprocess the Data (2.1)
  • Where Preprocess Data
  • Identifying Typical properties of Data (2.2)

7
-- Why Preprocess the Data
  • A well-accepted multi-dimensional measure of data
    quality
  • Accuracy
  • Completeness
  • Consistency
  • Timeliness
  • Believability
  • Value added
  • Interpretability
  • Accessibility

8
-- Why Preprocess the Data
  • Reason for data cleaning
  • Incomplete data (missing data)
  • Noisy data (contains errors)
  • Inconsistent data (containing discrepancies)
  • Reasons for data integration
  • Data comes from multiple sources
  • Reason for data transformation
  • Some data must be transformed to be used for
    mining
  • Reasons for data reduction
  • Performance
  • No quality data ? no quality mining results!

9
-- Where Preprocess Data
DS
OLAP
DW
SD
DS
DM
Association
Data preprocessing is done here, In the Staging
Database
DS
Classification
Clustering
DS Data source DW Data warehouse DM Data
Mining SD Staging Database
10
-- Identifying Typical Properties of Data
  • Descriptive Data Summarization techniques can be
    used to identify the typical properties of data
    and helps which data values should be treated as
    noise. For many data preprocessing tasks it is
    useful to know the following measures of the data
  • The central tendency
  • The dispersion

11
--- Measuring the Central Tendency
  • Central Tendency measures
  • Mean
  • Median
  • Mode
  • Midrange (max() min())/2
  • For data mining purposes, we need to know how to
    compute these measures efficiently in large
    databases. It is important to know whether the
    measure is
  • distributive
  • Algebraic or
  • holistic

12
--- Measuring the Central Tendency
  • Distributive measure A measure that can be
    computed by partitioning the data, compute the
    measure for each partition, and the merge the
    results to arrive at the measures value for the
    entire data.
  • eg. Sum(), count(), max(), min().
  • Algebraic measure is a measure that can be
    computed by applying an algebraic function to one
    or more distributed measures.
  • eg. Avg() which is sum()/count()
  • Holistic measure. You need the entire data to
    compute the measure
  • eg. median

13
--- Measuring the Dispersion
  • Dispersion or variance is the degree to which
    numerical data tends to spread.
  • The most common measures are
  • Standard deviation
  • Range max() min()
  • Quartiles
  • The five-number summary
  • Interquartile range (IQR)
  • Boxplot Analysis

14
---- Quartiles
  • The kth percentile of a data sorted in ascending
    order is the value x having the property the k
    percent of the data entries lie at or below x.
  • The first quartile, Q1, is the 25th percentile,
    Q2 and median the 50th percentile, and Q3 is the
    75th percentile.
  • IQR is Q3 Q1 and is a simple measure that gives
    the spread of the middle half.
  • A common rule of thump for identifying suspected
    outliers is to single out values 1.5 IQR above
    Q3 or below Q1.
  • The 5-number summary The min, Q1, median, Q3,
    the max
  • Box plots can be plotted based on the 5-number
    summary and are useful tools for identifying
    outliers.

15
---- Boxplot Analysis
  • Boxplot
  • Data is represented with a box
  • The ends of the box are Q1 and Q3,
  • i.e., the height of the box is IRQ
  • The median is marked by a line within the box
  • Whiskers two lines outside the box extend to
    Minimum and Maximum

Highest value
Q3
Median
Q1
Whisker
Lowest value
16
- Chapter Outline
  • Introduction (2.1, 2.2)
  • Data Cleaning (2.3)
  • Data Integration (2.4)
  • Data Transformation (2.4)
  • Data Reduction (2.5)
  • Concept Hierarchy (2.6)

17
- Data Cleaning
  • Importance
  • Data cleaning is the number one problem in data
    warehousing
  • In data cleaning, the following data problems are
    resolved
  • Incomplete data (missing data)
  • Noisy data (contains errors)
  • Inconsistent data (containing discrepancies)

18
-- Missing Data
  • Data is not always available
  • E.g., many tuples have no recorded value for
    several attributes, such as customer income in
    sales data
  • Missing data may be due to
  • equipment malfunction
  • inconsistent with other recorded data and thus
    deleted
  • data not entered due to misunderstanding
  • certain data may not be considered important at
    the time of entry

19
--- How to Handle Missing Data?
  • Fill in missing value manually (often unfeasible)
  • Fill in with a global constant. Unknown or n/a
    not recommended (data mining algorithm will see
    this as a normal value)
  • Fill in with attribute mean or median
  • Fill in with class mean or median (classes need
    to be known)
  • Fill in with most likely value (using regression,
    decision trees, most similar records, etc.)
  • Use other attributes to predict value (e.g. if a
    postcode is missing use suburb value)
  • Ignore the record

20
-- Noisy Data
  • Noise random error or variance in a measured
    variable
  • Incorrect attribute values may due to
  • faulty data collection
  • data entry problems
  • data transmission problems
  • data conversion errors
  • Data decay problems
  • technology limitations, e.g. buffer overflow or
    field size limits

21
--- How to Handle Noisy Data?
  • Binning
  • First sort data and partition into
    (equal-frequency) bins, then one can smooth by
    bin means, or by bin median, or by bin
    boundaries, etc.
  • Regression
  • smooth by fitting the data into regression
    functions
  • Clustering
  • detect and remove outliers
  • Combined computer and human inspection
  • detect suspicious values and check by human.

22
--- Binning Methods for Data Smoothing
  • Sorted data for price 4, 8, 9, 15, 21, 21, 24,
    25, 26, 28, 29, 34
  • Partition into equal-frequency (equi-depth) bins
  • Bin 1 4, 8, 9, 15
  • Bin 2 21, 21, 24, 25
  • Bin 3 26, 28, 29, 34
  • Smoothing by bin means
  • Bin 1 9, 9, 9, 9
  • Bin 2 23, 23, 23, 23
  • Bin 3 29, 29, 29, 29
  • Smoothing by bin boundaries
  • Bin 1 4, 4, 4, 15
  • Bin 2 21, 21, 25, 25
  • Bin 3 26, 26, 26, 34

23
--- Regression
y
Y1
y x 1
Y1
x
X1
24
--- Cluster Analysis
25
-- Inconsistent data
  • Inconsistent data can be due to
  • data entry errors
  • data integration errors (different formats,
    codes, etc.)
  • Handling inconsistent data
  • Important to have data entry verification (check
    both format and values of data entered)
  • Correct with help of external reference data

26
-- Data Cleaning as a Process
  • Data discrepancy detection
  • Use metadata (e.g., domain, range, correlation,
    distribution, DDS)
  • Check field overloading
  • Inconsistent use of codes (e.g. 5/12/2004 and
    12/5/2004)
  • Check uniqueness rule, consecutive rule, and null
    rule
  • Use commercial tools
  • Data scrubbing use simple domain knowledge
    (e.g., postal code, spell-check) to detect errors
    and make corrections
  • Data auditing by analyzing data to discover
    rules and relationship to detect violators (e.g.,
    correlation and clustering to find outliers)

27
--- Properties of Normal Distribution Curve
  • The normal (distribution) curve
  • From µs to µs contains about 68 of the
    measurements (µ mean, s standard deviation)
  • From µ2s to µ2s contains about 95 of it
  • From µ3s to µ3s contains about 99.7 of it

28
--- Correlation
Positive correlation
Negative correlation
No correlation
29
- Chapter Outline
  • Introduction (2.1, 2.2)
  • Data Cleaning (2.3)
  • Data Integration (2.4)
  • Data Transformation (2.4)
  • Data Reduction (2.5)
  • Concept Hierarchy (2.6)

30
- Data Integration
  • Data integration Combines data from multiple
    sources into a coherent data store
  • Main problems
  • Entity identification problem
  • Identify real world entities from multiple data
    sources, e.g., A.cust-id ? B.cust-
  • Redundancy problem
  • An attribute is redundant if it can be derived
    from other attribute(s).
  • Inconsistencies in attribute naming can cause
    redundancy
  • Solutions
  • Entity identification problems can be resolved
    using metadata
  • Some redundancy problems can be also be resolved
    using metadata and some others can be resolved
    correlation analysis.

31
-- Correlation
Positive correlation
Negative correlation
No correlation
32
--- Correlation Analysis (Numerical Data)
  • Correlation coefficient (also called Pearsons
    product moment coefficient)
  • where n is the number of tuples, and
    are the respective means of A and B, sA and sB
    are the respective standard deviation of A and B,
    and S(AB) is the sum of the AB cross-product.
  • If rA,B gt 0, A and B are positively correlated
  • rA,B 0 independent
  • rA,B lt 0 negatively correlated

33
--- Correlation Analysis (Categorical Data)
  • ?2 (chi-square) test
  • The larger the ?2 value, the more likely the
    variables are related
  • The cells that contribute the most to the ?2
    value are those whose actual count is very
    different from the expected count

34
--- Chi-Square Calculation An Example
  • ?2 (chi-square) calculation (numbers in
    parenthesis are expected counts calculated based
    on the data distribution in the two categories)
  • It shows that like_science_fiction and play_chess
    are correlated in the group

Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
35
- Chapter Outline
  • Introduction (2.1, 2.2)
  • Data Cleaning (2.3)
  • Data Integration (2.4)
  • Data Transformation (2.4)
  • Data Reduction (2.5)
  • Concept Hierarchy (2.6)

36
- Data Transformation
  • In data transformation, data is transformed or
    consolidated to forms appropriate for mining.
    Data transformation can involve
  • Smoothing remove noise from data using binning,
    regression, or clustering.
  • Aggregation E.g. sales data can be aggregated to
    monthly.
  • Generalization concept hierarchy climbing. E.g.
    cities can be generalized to countries. Ages can
    be generalized to youth, middle-aged, and senior.
  • Normalization Attribute data scaled to fall
    within a small, specified range
  • Attribute/feature construction New attributes
    constructed from the given ones

37
-- Data Transformation Normalization
  • Min-max normalization to new_minA, new_maxA
  • Ex. Let income range 12,000 to 98,000
    normalized to 0.0, 1.0. Then 73,000 is mapped
    to
  • Z-score normalization (µ mean, s standard
    deviation)
  • Ex. Let µ 54,000, s 16,000. Then
  • Normalization by decimal scaling

Where j is the smallest integer such that
Max(?) lt 1
38
-- Attribute/feature construction
  • Sometimes it is helpful or necessary to construct
    new attributes or features
  • Helpful for understanding and accuracy
  • For example Create attribute volume based on
    attributes height, depth and width
  • Construction is based on mathematical or logical
    operations
  • Attribute construction can help to discover
    missing information about the relationships
    between data attributes

39
- Chapter Outline
  • Introduction (2.1, 2.2)
  • Data Cleaning (2.3)
  • Data Integration (2.4)
  • Data Transformation (2.4)
  • Data Reduction (2.5)
  • Concept Hierarchy (2.6)

40
- Data Reduction
  • The data is often too large. Reducing the data
    can improve performance. Data reduction consists
    of reducing the representation of the data set
    while producing the same (or almost the same)
    results.
  • Data Reduction Includes
  • Reducing the number of rows (objects)
  • Reducing the number of attributes (features)
  • Compression
  • Discretization (will be covered in the next
    section)

41
-- Reducing the number of Rows
  • Aggregation
  • Aggregation of data in to a higher concept level.
  • We can have multiple levels of aggregation. E.g.,
    Weekly, monthly, quarterly, yearly, and so on.
  • For data reduction use the highest aggregation
    level which is enough
  • Numerosity reduction
  • Data volume can be reduced by choosing
    alternative forms of data representation

42
--- Types of Numerosity reduction
  • Parametric
  • Assume the data fits some model, estimate model
    parameters, store only the parameters, and
    discard the data (except possible outliers)
  • E.g. Linear regression Data are modeled to fit
    a straight line
  • Non-parametric
  • Histograms
  • Clustering
  • Sampling

43
---- Reduction with Histograms
  • A popular data reduction technique. Divide data
    into buckets and store representation of buckets
    (sum, count, etc.)
  • Histogram Types
  • Equal-width Divides the range into N intervals
    of equal size.
  • Equal-depth Divides the range into N intervals,
    each containing approximately same number of
    samples
  • V-optimal Considers all histogram types for a
    given number of buckets and chooses the one with
    the least variance.
  • MaxDiff After sorting the data to be
    approximated, the borders of the buckets are
    defined at points where the adjacent values have
    the maximum difference

44
---- Example Histogram
45
---- Reduction with Clustering
  • Partition data into clusters based on closeness
    in space. Retain representatives of clusters
    (centroids) and outliers. Effectiveness depends
    upon the distribution of data. Hierarchical
    clustering is possible (multi-resolution).

Outlier
x
x
x
Centroid
46
---- Reduction with Sampling
  • Allows a large data set to be represented by a
    much smaller random sample of the data (sub-set).
  • Will the patterns in the sample represent the
    patterns in the data?
  • How to select a random sample?
  • Simple random sample without replacement (SRSWOR)
  • Simple random sampling with replacement (SRSWR)
  • Cluster sample (SRSWOR or SRSWR from clusters)
  • Stratified sample (stratum group based on
    attribute value)

47
----Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
48
Sampling Example
Cluster/Stratified Sample
Raw Data
49
-- Reduce the number of Attributes
  • Reduce the number of attributes or dimensions or
    features.
  • Select a minimum set of attributes (features)
    that is sufficient for the data mining or
    analytical task.
  • Purpose
  • Avoid curse of dimensionality which creates
    sparse data space and bad clusters.
  • Reduce amount of time and memory required by data
    mining algorithms
  • Allow data to be more easily visualized
  • May help to eliminate irrelevant and duplicate
    features or reduce noise

50
--- Reduce the number of Attributes techniques
  • Step-wise forward selection
  • E.g. ?A1 ? A1,A3 ? A1,A3,A5
  • Step-wise backward elimination
  • E.g. A1,A2,A3,A4,A5 ? A1,A3,A4,A5 ?
    A1,A3,A5
  • Combining forward selection and backward
    elimination
  • Decision-tree induction (This will be covered in
    Chapter 5).

51
-- Data Compression
  • Data compression reduces the size of data and can
    be used for all sorts of data.
  • saves storage space.
  • saves communication time.
  • There is lossless compression and lossy
    compression. E.g., ZIP, Discrete wavelet
    transform (DWT), and Principal Component Analysis
    (PCA).
  • For data mining, data compression is beneficial
    if data mining algorithms can manipulate
    compressed data directly without uncompressing
    it. Examples String compression (e.g. ZIP, only
    allow limited manipulation of data.)

52
- Chapter Outline
  • Introduction (2.1, 2.2)
  • Data Cleaning (2.3)
  • Data Integration (2.4)
  • Data Transformation (2.4)
  • Data Reduction (2.5)
  • Data discritization Concept Hierarchy (2.6)

53
- Discretization
  • Three types of attributes
  • Nominal values from an unordered set, e.g.,
    color, profession
  • Ordinal values from an ordered set, e.g.,
    military or academic rank
  • Continuous real numbers, e.g., integer or real
    numbers
  • Discretization
  • Divide the range of a continuous attribute into
    intervals
  • Some classification algorithms only accept
    categorical attributes.
  • Reduce data size by discretization
  • Prepare for further analysis

54
-- Discretization and Concept Hierarchy
  • Discretization
  • Reduce the number of values for a given
    continuous attribute by dividing the range of the
    attribute into intervals
  • Interval labels can then be used to replace
    actual data values
  • Supervised vs. unsupervised
  • Split (top-down) vs. merge (bottom-up)
  • Discretization can be performed recursively on an
    attribute
  • Concept hierarchy formation
  • Recursively reduce the data by collecting and
    replacing low level concepts (such as numeric
    values for age) by higher level concepts (such as
    young, middle-aged, or senior)

55
-- Discretization and Concept Hierarchy
Generation for Numeric Data
  • Typical methods All the methods can be applied
    recursively
  • Binning (covered above)
  • Top-down split, unsupervised,
  • Histogram analysis (covered above)
  • Top-down split, unsupervised
  • Clustering analysis (covered above)
  • Either top-down split or bottom-up merge,
    unsupervised
  • Entropy-based discretization supervised,
    top-down split
  • Interval merging by ?2 Analysis unsupervised,
    bottom-up merge
  • Segmentation by natural partitioning top-down
    split, unsupervised

56
--- Entropy-Based Discretization
  • Given a set of samples S, if S is partitioned
    into two intervals S1 and S2 using boundary T,
    the information gain after partitioning is
  • Entropy is calculated based on class distribution
    of the samples in the set. Given m classes, the
    entropy of S1 is
  • where pi is the probability of class i in S1
  • The boundary that minimizes the entropy function
    over all possible boundaries is selected as a
    binary discretization
  • The process is recursively applied to partitions
    obtained until some stopping criterion is met

57
--- Interval Merge by ?2 Analysis
  • Merging-based (bottom-up) vs. splitting-based
    methods
  • Merge Find the best neighboring intervals and
    merge them to form larger intervals recursively
  • ChiMerge
  • Initially, each distinct value of a numerical
    attr. A is considered to be one interval
  • ?2 tests are performed for every pair of adjacent
    intervals
  • Adjacent intervals with the least ?2 values are
    merged together, since low ?2 values for a pair
    indicate similar class distributions
  • This merge process proceeds recursively until a
    predefined stopping criterion is met (such as
    significance level, max-interval, max
    inconsistency, etc.)

58
--- Segmentation by Natural Partitioning
  • A simply 3-4-5 rule can be used to segment
    numeric data into relatively uniform, natural
    intervals.
  • If an interval covers 3, 6, 7 or 9 distinct
    values at the most significant digit, partition
    the range into 3 equi-width intervals
  • If it covers 2, 4, or 8 distinct values at the
    most significant digit, partition the range into
    4 intervals
  • If it covers 1, 5, or 10 distinct values at the
    most significant digit, partition the range into
    5 intervals

59
---- Example of 3-4-5 Rule
(-400 -5,000)
Step 4
60
-- Concept Hierarchy Generation for Categorical
Data
  • Categorical data
  • Discrete, finite cardinality, unordered
  • E.g. Geographic location, job category, product
  • Problem how to compose an order to categorical
    data
  • E.g. Organize location into categories
  • Street lt city lt province lt country

61
-- Concept Hierarchy Generation for Categorical
Data
  • A partial order of attributes at the schema level
  • Street lt city lt state lt country
  • A portion of a hierarchy by explicit data
    grouping
  • Jeddah, Riyadh lt Saudi Arabia
  • A set of attributes
  • A partial order generated by cardinality of
    attributes
  • E.g., street lt city ltstate lt country
  • Only a partial set of attributes
  • E.g., only street lt city, not others
  • System automatically fills in others

62
--- Automatic Concept Hierarchy Generation
  • Some hierarchies can be automatically generated
    based on the analysis of the number of distinct
    values per attribute in the data set
  • The attribute with the most distinct values is
    placed at the lowest level of the hierarchy
  • Exceptions, e.g., weekday, month, quarter, year

63
End
Write a Comment
User Comments (0)
About PowerShow.com