Title: Data Mining: Concepts and Techniques
1Data Mining Concepts and Techniques
Chapter 2
2Chapter 2 Data Preprocessing
- Why preprocess the data?
- Descriptive data summarization
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
3Why Data Preprocessing?
- Data in the real world is dirty
- incomplete lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data - e.g., occupation
- noisy containing errors or outliers
- e.g., Salary-10
- inconsistent containing discrepancies in codes
or names - e.g., Age42 Birthday03/07/1997
- e.g., Was rating 1,2,3, now rating A, B, C
- e.g., discrepancy between duplicate records
4Why Is Data Dirty?
- Incomplete data may come from
- Not applicable data value when collected
- Different considerations between the time when
the data was collected and when it is analyzed. - Human/hardware/software problems
- Noisy data (incorrect values) may come from
- Faulty data collection instruments
- Human or computer error at data entry
- Errors in data transmission
- Inconsistent data may come from
- Different data sources
- Functional dependency violation (e.g., modify
some linked data) - Duplicate records also need data cleaning
5Why Is Data Preprocessing Important?
- No quality data, no quality mining results!
- Quality decisions must be based on quality data
- e.g., duplicate or missing data may cause
incorrect or even misleading statistics. - Data warehouse needs consistent integration of
quality data - Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse
6Multi-Dimensional Measure of Data Quality
- A well-accepted multidimensional view
- Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Value added
- Interpretability
- Accessibility
- Broad categories
- Intrinsic, contextual, representational, and
accessibility
7Major Tasks in Data Preprocessing
- Data cleaning
- Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies - Data integration
- Integration of multiple databases, data cubes, or
files - Data transformation
- Normalization and aggregation
- Data reduction
- Obtains reduced representation in volume but
produces the same or similar analytical results - Data discretization
- Part of data reduction but with particular
importance, especially for numerical data
8Forms of Data Preprocessing
9Chapter 2 Data Preprocessing
- Why preprocess the data?
- Descriptive data summarization
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
10Mining Data Descriptive Characteristics
- Motivation
- To better understand the data central tendency,
variation and spread - Data dispersion characteristics
- median, max, min, quantiles, outliers, variance,
etc. - Numerical dimensions correspond to sorted
intervals - Data dispersion analyzed with multiple
granularities of precision - Boxplot or quantile analysis on sorted intervals
- Dispersion analysis on computed measures
- Folding measures into numerical dimensions
- Boxplot or quantile analysis on the transformed
cube
11Measuring the Central Tendency
- Mean (algebraic measure) (sample vs. population)
- Weighted arithmetic mean
- Trimmed mean chopping extreme values
- Median A holistic measure
- Middle value if odd number of values, or average
of the middle two values otherwise - Estimated by interpolation (for grouped data)
- Mode
- Value that occurs most frequently in the data
- Unimodal, bimodal, trimodal
- Empirical formula
12 Symmetric vs. Skewed Data
- Median, mean and mode of symmetric, positively
and negatively skewed data
13Measuring the Dispersion of Data
- Quartiles, outliers and boxplots
- Quartiles Q1 (25th percentile), Q3 (75th
percentile) - Inter-quartile range IQR Q3 Q1
- Five number summary min, Q1, M, Q3, max
- Boxplot ends of the box are the quartiles,
median is marked, whiskers, and plot outlier
individually - Outlier usually, a value higher/lower than 1.5 x
IQR - Variance and standard deviation (sample s,
population s) - Variance (algebraic, scalable computation)
- Standard deviation s (or s) is the square root of
variance s2 (or s2)
14Properties of Normal Distribution Curve
- The normal (distribution) curve
- From µs to µs contains about 68 of the
measurements (µ mean, s standard deviation) - From µ2s to µ2s contains about 95 of it
- From µ3s to µ3s contains about 99.7 of it
15 Boxplot Analysis
- Five-number summary of a distribution
- Minimum, Q1, M, Q3, Maximum
- Boxplot
- Data is represented with a box
- The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ - The median is marked by a line within the box
- Whiskers two lines outside the box extend to
Minimum and Maximum
16Visualization of Data Dispersion Boxplot Analysis
17Histogram Analysis
- Graph displays of basic statistical class
descriptions - Frequency histograms
- A univariate graphical method
- Consists of a set of rectangles that reflect the
counts or frequencies of the classes present in
the given data
18Quantile Plot
- Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences)
19Quantile-Quantile (Q-Q) Plot
- Graphs the quantiles of one univariate
distribution against the corresponding quantiles
of another - Allows the user to view whether there is a shift
in going from one distribution to another
20Scatter plot
- Provides a first look at bivariate data to see
clusters of points, outliers, etc - Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
21Loess Curve
- Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of
dependence - Loess curve is fitted by setting two parameters
a smoothing parameter, and the degree of the
polynomials that are fitted by the regression
22Positively and Negatively Correlated Data
23 Not Correlated Data
24Chapter 2 Data Preprocessing
- Why preprocess the data?
- Descriptive data summarization
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
25Data Cleaning
- Importance
- Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball - Data cleaning is the number one problem in data
warehousingDCI survey - Data cleaning tasks
- Fill in missing values
- Identify outliers and smooth out noisy data
- Correct inconsistent data
- Resolve redundancy caused by data integration
26Missing Data
- Data is not always available
- E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data - Missing data may be due to
- equipment malfunction
- inconsistent with other recorded data and thus
deleted - data not entered due to misunderstanding
- certain data may not be considered important at
the time of entry - not register history or changes of the data
- Missing data may need to be inferred.
27How to Handle Missing Data?
- Ignore the tuple usually done when class label
is missing (assuming the tasks in
classificationnot effective when the percentage
of missing values per attribute varies
considerably. - Fill in the missing value manually tedious
infeasible? - Fill in it automatically with
- a global constant e.g., unknown, a new
class?! - the attribute mean
- the attribute mean for all samples belonging to
the same class smarter - the most probable value inference-based such as
Bayesian formula or decision tree
28Noisy Data
- Noise random error or variance in a measured
variable - Incorrect attribute values may due to
- faulty data collection instruments
- data entry problems
- data transmission problems
- technology limitation
- inconsistency in naming convention
- Other data problems which requires data cleaning
- duplicate records
- incomplete data
- inconsistent data
29How to Handle Noisy Data?
- Binning
- first sort data and partition into
(equal-frequency) bins - then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc. - Regression
- smooth by fitting the data into regression
functions - Clustering
- detect and remove outliers
- Combined computer and human inspection
- detect suspicious values and check by human
(e.g., deal with possible outliers)
30Data Cleaning as a Process
- Data discrepancy detection
- Use metadata (e.g., domain, range, dependency,
distribution) - Check field overloading
- Check uniqueness rule, consecutive rule and null
rule - Use commercial tools
- Data scrubbing use simple domain knowledge
(e.g., postal code, spell-check) to detect errors
and make corrections - Data auditing by analyzing data to discover
rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers) - Data migration and integration
- Data migration tools allow transformations to be
specified - ETL (Extraction/Transformation/Loading) tools
allow users to specify transformations through a
graphical user interface - Integration of the two processes
- Iterative and interactive (e.g., Potters Wheels)
31Chapter 2 Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
32Data Integration
- Data integration
- Combines data from multiple sources into a
coherent store - Schema integration e.g., A.cust-id ? B.cust-
- Integrate metadata from different sources
- Entity identification problem
- Identify real world entities from multiple data
sources, e.g., Bill Clinton William Clinton - Detecting and resolving data value conflicts
- For the same real world entity, attribute values
from different sources are different - Possible reasons different representations,
different scales, e.g., metric vs. British units
Correlation analysis chi-square test
33Data Transformation
- Smoothing remove noise from data
- Aggregation summarization, data cube
construction - Generalization concept hierarchy climbing
- Normalization scaled to fall within a small,
specified range - min-max normalization
- z-score normalization
- normalization by decimal scaling
- Attribute/feature construction
- New attributes constructed from the given ones
34Chapter 2 Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
35Data Reduction Strategies
- Why data reduction?
- A database/data warehouse may store terabytes of
data - Complex data analysis/mining may take a very long
time to run on the complete data set - Data reduction
- Obtain a reduced representation of the data set
that is much smaller in volume but yet produce
the same (or almost the same) analytical results - Data reduction strategies
- Data cube aggregation
- Dimensionality reduction e.g., remove
unimportant attributes - Data Compression
- Numerosity reduction e.g., fit data into models
- Discretization and concept hierarchy generation
36Data Compression
- String compression
- There are extensive theories and well-tuned
algorithms - Typically lossless
- But only limited manipulation is possible without
expansion - Audio/video compression
- Typically lossy compression, with progressive
refinement - Sometimes small fragments of signal can be
reconstructed without reconstructing the whole - Time sequence is not audio
- Typically short and vary slowly with time
37Data Compression
Original Data
Compressed Data
lossless
Original Data Approximated
lossy
38Dimensionality ReductionWavelet Transformation
- Discrete wavelet transform (DWT) linear signal
processing, multi-resolutional analysis - Compressed approximation store only a small
fraction of the strongest of the wavelet
coefficients - Similar to discrete Fourier transform (DFT), but
better lossy compression, localized in space.
39DWT for Image Compression
- Image
- Low Pass High Pass
- Low Pass High Pass
- Low Pass High Pass
40Dimensionality Reduction Principal Component
Analysis (PCA)
- Given N data vectors from n-dimensions, find k
n orthogonal vectors (principal components) that
can be best used to represent data - Works for numeric data only
- Used when the number of dimensions is large
41Principal Component Analysis
X2
Y1
Y2
X1
42Data Reduction Method (3) Clustering
- Partition data set into clusters based on
similarity, and store cluster representation
(e.g., centroid and diameter) only - Can be very effective if data is clustered but
not if data is smeared - Can have hierarchical clustering and be stored in
multi-dimensional index tree structures - There are many choices of clustering definitions
and clustering algorithms - Cluster analysis will be studied in depth in
Chapter 7
43Data Reduction Method (4) Sampling
- Sampling obtaining a small sample s to represent
the whole data set N - Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data - Choose a representative subset of the data
- Simple random sampling may have very poor
performance in the presence of skew - Develop adaptive sampling methods
- Stratified sampling
- Approximate the percentage of each class (or
subpopulation of interest) in the overall
database - Used in conjunction with skewed data
- Note Sampling may not reduce database I/Os (page
at a time)
44Sampling with or without Replacement
SRSWOR (simple random sample without
replacement)
SRSWR
45Sampling Cluster or Stratified Sampling
Cluster/Stratified Sample
Raw Data
46Chapter 2 Data Preprocessing
- Why preprocess the data?
- Data cleaning
- Data integration and transformation
- Data reduction
- Discretization and concept hierarchy generation
- Summary
47Discretization
- Three types of attributes
- Nominal values from an unordered set, e.g.,
color, profession - Ordinal values from an ordered set, e.g.,
military or academic rank - Continuous real numbers, e.g., integer or real
numbers - Discretization
- Divide the range of a continuous attribute into
intervals - Some classification algorithms only accept
categorical attributes. - Reduce data size by discretization
- Prepare for further analysis
48Discretization and Concept Hierarchy
- Discretization
- Reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals - Interval labels can then be used to replace
actual data values - Supervised vs. unsupervised
- Split (top-down) vs. merge (bottom-up)
- Discretization can be performed recursively on an
attribute - Concept hierarchy formation
- Recursively reduce the data by collecting and
replacing low level concepts (such as numeric
values for age) by higher level concepts (such as
young, middle-aged, or senior)
49Discretization and Concept Hierarchy Generation
for Numeric Data
- Typical methods All the methods can be applied
recursively - Binning (covered above)
- Top-down split, unsupervised,
- Histogram analysis (covered above)
- Top-down split, unsupervised
- Clustering analysis (covered above)
- Either top-down split or bottom-up merge,
unsupervised - Entropy-based discretization supervised,
top-down split - Interval merging by ?2 Analysis unsupervised,
bottom-up merge - Segmentation by natural partitioning top-down
split, unsupervised
50Summary
- Data preparation or preprocessing is a big issue
for both data warehousing and data mining - Discriptive data summarization is need for
quality data preprocessing - Data preparation includes
- Data cleaning and data integration
- Data reduction and feature selection
- Discretization
- A lot a methods have been developed but data
preprocessing still an active area of research
51(No Transcript)