Title: Chap' 2 Data Preprocessing
1Chap. 2 Data Preprocessing
2Why Data Preprocessing?
- Data in the real world is dirty
- incomplete lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data - e.g., occupation
- noisy containing errors or outliers
- e.g., Salary-10
- inconsistent containing discrepancies in codes
or names - e.g., gender vs. sex
- e.g., sexwoman vs. sexfemale
- No quality data, no quality mining results!
- Quality decisions must be based on quality data
- Data warehouse needs consistent integration of
quality data
3Major Tasks
- Data cleaning
- Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies - Data integration
- Integration of multiple databases, data cubes, or
files - Data transformation
- Normalization and aggregation
- Data reduction
- Obtains reduced representation in volume but
produces the same or similar analytical results - Data discretization
- Part of data reduction but with particular
importance, especially for numerical data
4Major Tasks
5Data Cleaning
- Data cleaning tasks
- Fill in missing values
- Identify outliers and smooth out noisy data
- Correct inconsistent data
- Resolve redundancy caused by data integration
6Missing Data
- Data is not always available
- Many tuples have no recorded value for several
attributes, such as customer income in sales data - Missing data may be due to
- Equipment malfunction
- Inconsistent with other recorded data and thus
deleted - Data not entered due to misunderstanding
- Certain data may not be considered important at
the time of entry - Missing data may need to be inferred
7How to Handle Missing Data?
- Ignore the tuple
- usually done when class label is missing
(assuming the tasks in classification) - Fill in manually
- time-consuming
- Use a global constant
- Exgt unknown, 0, or -?
- Use the attribute mean
- Use the attribute mean for all samples of the
same class - Exgt For customer of risk_high class ? fill in
the average of risk_high people - Use the most probable value
- Inference-based such as Bayesian formula or
decision tree
8Noisy Data
- Noise
- Random error or variance in a measured variable
- Incorrect attribute values may due to
- Faulty data collection instruments
- Data entry problems
- Data transmission problems
- Inconsistency in naming convention
9How to Handle Noisy Data?
- Binning
- First sort data and partition into (equi-depth)
bins - Then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc. - Clustering
- Similar values are organized into groups
(clusters) ? detect and remove
outliers - Combined computer and human inspection
- Detect suspicious values and check by human
- Regression
- Smooth by fitting the data into regression
functions
10Binning
- Equal-width (distance) partitioning
- It divides the data value range into N intervals
of equal size - Outliers may dominate presentation
- Equal-depth (frequency) partitioning
- It divides the range into N intervals, each
containing approximately same number of samples - Example
- Sorted data for price (in dollars) 4, 8, 9,
15, 21, 21, 24, 25, 26, 28, 29, 34 - Partition into (equi-depth) bins Bin 1 4, 8,
9, 15 - Bin 2 21, 21, 24, 25
- Bin 3 26, 28, 29, 34
- Smoothing by bin means Bin 1 9, 9, 9, 9
- Bin 2 23, 23, 23, 23
- Bin 3 29, 29, 29, 29
11Cluster Analysis
12Regression
13Data Integration
- Data integration
- Combines data from multiple sources into a
coherent store - Schema integration
- Integrate metadata from different sources
- Entity identification problem identify real
world entities from multiple data sources - Exgt customer_id ? cust-No
- Detecting and resolving data value conflicts
- For the same real world entity, attribute values
from different sources are different - Possible reasons different representations,
different scales - Exgt 100 vs. 100.00 dollors
14Data Integration
- Redundancy
- One attribute may be a derived from another
attribute - Exgt monthly sales vs. annual sales
- Detecting redundancy
- Some redundancy can be detected by correlation
analysis (how strongly one attribute implies
the other) - r gt 0 highly correlated (A increase ? B
increase) - r 0 independant
- r gt 0 negatively correlated
15Data Integration
- For categorical data, correlation can be
discovered by ?2 test - The larger the ?2 value, the more likely the
variables are related
16Data Transformation
- Data transformation
- Change data to appropriate form
- Smoothing
- Remove noise from data (binning, clustering)
- Aggregation
- Summarization, data cube construction
- Generalization
- Concept hierarchy climbing
- Normalization
- Scaled to fall within a small, specified range
(Exgt -1.0, 1.0) - Attribute/feature construction
- New attributes constructed from the given ones
17Normalization
- min-max normalization
- z-score normalization (zero-mean normalization)
- normalization by decimal scaling
Where j is the smallest integer such that Max(
) lt 1
18Data Reduction
- Warehouse may store terabytes of data
- Complex data mining may take a very long time to
run on the complete data set - Data reduction
- Obtains a reduced representation of the data set
that is much smaller but yet produces the same
analytical results - Data reduction strategies
- Data cube aggregation
- Dimensionality reduction
- Data compression
- Numerosity reduction
- Discretization and concept hierarchy generation
19Data Cube Aggregation
- Multiple levels of aggregation in data cubes
- Further reduce the size of data to deal with
- Exgt monthly data ? annual data
- Reference appropriate levels
- Use the smallest available cuboid relevant to the
given task.
20Dimensionality Reduction
- Attribute (feature) selection
- Remove irrelevant or redundant attributes
(dimensions) - Exgt remove telephone no. in customer data
analysis - Step-wise forward selection
- Step-wise backward elimination
- Attribute selection using decision-tree
- Based on the information theory
- Select best attributes that classify the data
21Dimensionality Reduction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
? Reduced attribute set A1, A4, A6
22Data Compression
- String compression
- Typically lossless
- Only limited manipulation is possible
- Wavelet transform
- N-D data vector D ? transformed to N-D data
vector D (DWT, DFT) - Store only a small fraction of the coefficients
after transformation - Typically lossy compression
- Principal component analysis
- N-d data vector ? projected to k-d data vector (N
gt k)
23Numerosity Reduction
- Linear regression
- Data (x1,y1), (x2, y2) are modeled to fit a
straight line - Y ? ? X
- Uses the least-square method to minimize the
error - Histogram
- Divide data into buckets and store average (sum)
for each bucket
24Numerosity Reduction
- Clustering
- Partition data set into clusters ? one can store
cluster representation only - Can have hierarchical clustering and be stored in
multi-dimensional index tree structures - Choices of clustering algorithms ? detailed in
Chapter 8 - Sampling
- Choose a subset of the data
- Simple random sampling ? may have very poor
performance on the skewed data set - Stratified sampling ? approximate the percentage
of each class - Exgt Sampling customers in several age group
25Numerosity Reduction
Clustering
Stratified Sampling
Raw Data
26Discretization
- Discretization
- Dividing the range of the attribute into
intervals - ? Interval labels can be used to replace actual
data values - ? Reduce the number of values for a continuous
attribute - Concept hierarchy
- Defines a discretization
- Low level concepts ? higher level concepts
- Exgt Age (integer) ? young, middle-aged, senior
- (18, 15, 27, 14, 19, 63, 32, ) ? ( Y,
Y, M, Y, Y, S, M, ) - Can be automatically generated based on data
distribution
27Concept Hierarchy Generation for Numerical Data
- Binning, histogram analysis, clustering analysis
- Entropy-based discretization
- Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is - The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization - The process is recursively applied to partitions
obtained until some stopping criterion
28Concept Hierarchy Generation for Numerical Data
- Natural Partitioning
- Use 3-4-5 rule to segment data into natural
intervals - Exgt (213.98, 802.34 by Entropy-based
discretization vs. (200, 800 - Check the distinct values at the most significant
digit - 3, 6, 7, 9 distinct values ? partition the range
into 3 intervals - 2, 4, 8 distinct values ? partition the range
into 4 intervals - 1, 5, or 10 distinct values ? partition the range
into 5 intervals
29(No Transcript)
30Concept Hierarchy Generation for Categorical Data
- Manual
- Specification of a partial ordering by experts
- Exgt Define (street lt city lt country)
- Automatic
- Specification of a set of attributes, but not of
their ordering - Generate hierarchy based on the number of
distinct values - Exgt Select (country, street, city) attributes for
location ? street 674,339 distinct
values, city 3,567 distinct values,
country 15 distinct values ? Generate
hierarchy (street lt city lt country)