Data Mining: Concepts and Techniques

About This Presentation

Title:

Data Mining: Concepts and Techniques

Description:

Data Mining: Concepts and Techniques. 5. Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on ... – PowerPoint PPT presentation

Number of Views:696

Avg rating:3.0/5.0

Slides: 52

Provided by: Jiawe9

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining: Concepts and Techniques

1
Data Mining Concepts and Techniques
Chapter 2
2
Chapter 2 Data Preprocessing

Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

3
Why Data Preprocessing?

Data in the real world is dirty
incomplete lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
e.g., occupation
noisy containing errors or outliers
e.g., Salary-10
inconsistent containing discrepancies in codes
or names
e.g., Age42 Birthday03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records

4
Why Is Data Dirty?

Incomplete data may come from
Not applicable data value when collected
Different considerations between the time when
the data was collected and when it is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify
some linked data)
Duplicate records also need data cleaning

5
Why Is Data Preprocessing Important?

No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause
incorrect or even misleading statistics.
Data warehouse needs consistent integration of
quality data
Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse

6
Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories
Intrinsic, contextual, representational, and
accessibility

7
Major Tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies
Data integration
Integration of multiple databases, data cubes, or
files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but
produces the same or similar analytical results
Data discretization
Part of data reduction but with particular
importance, especially for numerical data

8
Forms of Data Preprocessing
9
Chapter 2 Data Preprocessing

Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

10
Mining Data Descriptive Characteristics

Motivation
To better understand the data central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance,
etc.
Numerical dimensions correspond to sorted
intervals
Data dispersion analyzed with multiple
granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed
cube

11
Measuring the Central Tendency

Mean (algebraic measure) (sample vs. population)
Weighted arithmetic mean
Trimmed mean chopping extreme values
Median A holistic measure
Middle value if odd number of values, or average
of the middle two values otherwise
Estimated by interpolation (for grouped data)
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula

12
Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively
and negatively skewed data

13
Measuring the Dispersion of Data

Quartiles, outliers and boxplots
Quartiles Q1 (25th percentile), Q3 (75th
percentile)
Inter-quartile range IQR Q3 Q1
Five number summary min, Q1, M, Q3, max
Boxplot ends of the box are the quartiles,
median is marked, whiskers, and plot outlier
individually
Outlier usually, a value higher/lower than 1.5 x
IQR
Variance and standard deviation (sample s,
population s)
Variance (algebraic, scalable computation)
Standard deviation s (or s) is the square root of
variance s2 (or s2)

14
Properties of Normal Distribution Curve

The normal (distribution) curve
From µs to µs contains about 68 of the
measurements (µ mean, s standard deviation)
From µ2s to µ2s contains about 95 of it
From µ3s to µ3s contains about 99.7 of it

15
Boxplot Analysis

Five-number summary of a distribution
Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ
The median is marked by a line within the box
Whiskers two lines outside the box extend to
Minimum and Maximum

16
Visualization of Data Dispersion Boxplot Analysis
17
Histogram Analysis

Graph displays of basic statistical class
descriptions
Frequency histograms
A univariate graphical method
Consists of a set of rectangles that reflect the
counts or frequencies of the classes present in
the given data

18
Quantile Plot

Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences)

19
Quantile-Quantile (Q-Q) Plot

Graphs the quantiles of one univariate
distribution against the corresponding quantiles
of another
Allows the user to view whether there is a shift
in going from one distribution to another

20
Scatter plot

Provides a first look at bivariate data to see
clusters of points, outliers, etc
Each pair of values is treated as a pair of
coordinates and plotted as points in the plane

21
Loess Curve

Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of
dependence
Loess curve is fitted by setting two parameters
a smoothing parameter, and the degree of the
polynomials that are fitted by the regression

22
Positively and Negatively Correlated Data
23
Not Correlated Data
24
Chapter 2 Data Preprocessing

Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

25
Data Cleaning

Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in data
warehousingDCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration

26
Missing Data

Data is not always available
E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus
deleted
data not entered due to misunderstanding
certain data may not be considered important at
the time of entry
not register history or changes of the data
Missing data may need to be inferred.

27
How to Handle Missing Data?

Ignore the tuple usually done when class label
is missing (assuming the tasks in
classificationnot effective when the percentage
of missing values per attribute varies
considerably.
Fill in the missing value manually tedious
infeasible?
Fill in it automatically with
a global constant e.g., unknown, a new
class?!
the attribute mean
the attribute mean for all samples belonging to
the same class smarter
the most probable value inference-based such as
Bayesian formula or decision tree

28
Noisy Data

Noise random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data

29
How to Handle Noisy Data?

Binning
first sort data and partition into
(equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression
functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers)

30
Data Cleaning as a Process

Data discrepancy detection
Use metadata (e.g., domain, range, dependency,
distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null
rule
Use commercial tools
Data scrubbing use simple domain knowledge
(e.g., postal code, spell-check) to detect errors
and make corrections
Data auditing by analyzing data to discover
rules and relationship to detect violators (e.g.,
correlation and clustering to find outliers)
Data migration and integration
Data migration tools allow transformations to be
specified
ETL (Extraction/Transformation/Loading) tools
allow users to specify transformations through a
graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potters Wheels)

31
Chapter 2 Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

32
Data Integration

Data integration
Combines data from multiple sources into a
coherent store
Schema integration e.g., A.cust-id ? B.cust-
Integrate metadata from different sources
Entity identification problem
Identify real world entities from multiple data
sources, e.g., Bill Clinton William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values
from different sources are different
Possible reasons different representations,
different scales, e.g., metric vs. British units

Correlation analysis chi-square test
33
Data Transformation

Smoothing remove noise from data
Aggregation summarization, data cube
construction
Generalization concept hierarchy climbing
Normalization scaled to fall within a small,
specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones

34
Chapter 2 Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

35
Data Reduction Strategies

Why data reduction?
A database/data warehouse may store terabytes of
data
Complex data analysis/mining may take a very long
time to run on the complete data set
Data reduction
Obtain a reduced representation of the data set
that is much smaller in volume but yet produce
the same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation
Dimensionality reduction e.g., remove
unimportant attributes
Data Compression
Numerosity reduction e.g., fit data into models
Discretization and concept hierarchy generation

36
Data Compression

String compression
There are extensive theories and well-tuned
algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time

37
Data Compression
Original Data
Compressed Data
lossless
Original Data Approximated
lossy
38
Dimensionality ReductionWavelet Transformation

Discrete wavelet transform (DWT) linear signal
processing, multi-resolutional analysis
Compressed approximation store only a small
fraction of the strongest of the wavelet
coefficients
Similar to discrete Fourier transform (DFT), but
better lossy compression, localized in space.

39
DWT for Image Compression

Image
Low Pass High Pass
Low Pass High Pass
Low Pass High Pass

40
Dimensionality Reduction Principal Component
Analysis (PCA)

Given N data vectors from n-dimensions, find k
n orthogonal vectors (principal components) that
can be best used to represent data
Works for numeric data only
Used when the number of dimensions is large

41
Principal Component Analysis
X2
Y1
Y2
X1
42
Data Reduction Method (3) Clustering

Partition data set into clusters based on
similarity, and store cluster representation
(e.g., centroid and diameter) only
Can be very effective if data is clustered but
not if data is smeared
Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
There are many choices of clustering definitions
and clustering algorithms
Cluster analysis will be studied in depth in
Chapter 7

43
Data Reduction Method (4) Sampling

Sampling obtaining a small sample s to represent
the whole data set N
Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
Stratified sampling
Approximate the percentage of each class (or
subpopulation of interest) in the overall
database
Used in conjunction with skewed data
Note Sampling may not reduce database I/Os (page
at a time)

44
Sampling with or without Replacement
SRSWOR (simple random sample without
replacement)
SRSWR
45
Sampling Cluster or Stratified Sampling
Cluster/Stratified Sample
Raw Data
46
Chapter 2 Data Preprocessing

Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary

47
Discretization

Three types of attributes
Nominal values from an unordered set, e.g.,
color, profession
Ordinal values from an ordered set, e.g.,
military or academic rank
Continuous real numbers, e.g., integer or real
numbers
Discretization
Divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Prepare for further analysis

48
Discretization and Concept Hierarchy

Discretization
Reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals
Interval labels can then be used to replace
actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an
attribute
Concept hierarchy formation
Recursively reduce the data by collecting and
replacing low level concepts (such as numeric
values for age) by higher level concepts (such as
young, middle-aged, or senior)

49
Discretization and Concept Hierarchy Generation
for Numeric Data

Typical methods All the methods can be applied
recursively
Binning (covered above)
Top-down split, unsupervised,
Histogram analysis (covered above)
Top-down split, unsupervised
Clustering analysis (covered above)
Either top-down split or bottom-up merge,
unsupervised
Entropy-based discretization supervised,
top-down split
Interval merging by ?2 Analysis unsupervised,
bottom-up merge
Segmentation by natural partitioning top-down
split, unsupervised

50
Summary

Data preparation or preprocessing is a big issue
for both data warehousing and data mining
Discriptive data summarization is need for
quality data preprocessing
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but data
preprocessing still an active area of research

51
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Data Mining: Concepts and Techniques - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques. 5. Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on ... – PowerPoint PPT presentation