Data Mining: Concepts and Techniques - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Data Mining: Concepts and Techniques

Description:

Data Mining: Concepts and Techniques. 5. Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on ... – PowerPoint PPT presentation

Number of Views:695
Avg rating:3.0/5.0
Slides: 52
Provided by: Jiawe9
Category:

less

Transcript and Presenter's Notes

Title: Data Mining: Concepts and Techniques


1
Data Mining Concepts and Techniques
Chapter 2
2
Chapter 2 Data Preprocessing
  • Why preprocess the data?
  • Descriptive data summarization
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

3
Why Data Preprocessing?
  • Data in the real world is dirty
  • incomplete lacking attribute values, lacking
    certain attributes of interest, or containing
    only aggregate data
  • e.g., occupation
  • noisy containing errors or outliers
  • e.g., Salary-10
  • inconsistent containing discrepancies in codes
    or names
  • e.g., Age42 Birthday03/07/1997
  • e.g., Was rating 1,2,3, now rating A, B, C
  • e.g., discrepancy between duplicate records

4
Why Is Data Dirty?
  • Incomplete data may come from
  • Not applicable data value when collected
  • Different considerations between the time when
    the data was collected and when it is analyzed.
  • Human/hardware/software problems
  • Noisy data (incorrect values) may come from
  • Faulty data collection instruments
  • Human or computer error at data entry
  • Errors in data transmission
  • Inconsistent data may come from
  • Different data sources
  • Functional dependency violation (e.g., modify
    some linked data)
  • Duplicate records also need data cleaning

5
Why Is Data Preprocessing Important?
  • No quality data, no quality mining results!
  • Quality decisions must be based on quality data
  • e.g., duplicate or missing data may cause
    incorrect or even misleading statistics.
  • Data warehouse needs consistent integration of
    quality data
  • Data extraction, cleaning, and transformation
    comprises the majority of the work of building a
    data warehouse

6
Multi-Dimensional Measure of Data Quality
  • A well-accepted multidimensional view
  • Accuracy
  • Completeness
  • Consistency
  • Timeliness
  • Believability
  • Value added
  • Interpretability
  • Accessibility
  • Broad categories
  • Intrinsic, contextual, representational, and
    accessibility

7
Major Tasks in Data Preprocessing
  • Data cleaning
  • Fill in missing values, smooth noisy data,
    identify or remove outliers, and resolve
    inconsistencies
  • Data integration
  • Integration of multiple databases, data cubes, or
    files
  • Data transformation
  • Normalization and aggregation
  • Data reduction
  • Obtains reduced representation in volume but
    produces the same or similar analytical results
  • Data discretization
  • Part of data reduction but with particular
    importance, especially for numerical data

8
Forms of Data Preprocessing
9
Chapter 2 Data Preprocessing
  • Why preprocess the data?
  • Descriptive data summarization
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

10
Mining Data Descriptive Characteristics
  • Motivation
  • To better understand the data central tendency,
    variation and spread
  • Data dispersion characteristics
  • median, max, min, quantiles, outliers, variance,
    etc.
  • Numerical dimensions correspond to sorted
    intervals
  • Data dispersion analyzed with multiple
    granularities of precision
  • Boxplot or quantile analysis on sorted intervals
  • Dispersion analysis on computed measures
  • Folding measures into numerical dimensions
  • Boxplot or quantile analysis on the transformed
    cube

11
Measuring the Central Tendency
  • Mean (algebraic measure) (sample vs. population)
  • Weighted arithmetic mean
  • Trimmed mean chopping extreme values
  • Median A holistic measure
  • Middle value if odd number of values, or average
    of the middle two values otherwise
  • Estimated by interpolation (for grouped data)
  • Mode
  • Value that occurs most frequently in the data
  • Unimodal, bimodal, trimodal
  • Empirical formula

12
Symmetric vs. Skewed Data
  • Median, mean and mode of symmetric, positively
    and negatively skewed data

13
Measuring the Dispersion of Data
  • Quartiles, outliers and boxplots
  • Quartiles Q1 (25th percentile), Q3 (75th
    percentile)
  • Inter-quartile range IQR Q3 Q1
  • Five number summary min, Q1, M, Q3, max
  • Boxplot ends of the box are the quartiles,
    median is marked, whiskers, and plot outlier
    individually
  • Outlier usually, a value higher/lower than 1.5 x
    IQR
  • Variance and standard deviation (sample s,
    population s)
  • Variance (algebraic, scalable computation)
  • Standard deviation s (or s) is the square root of
    variance s2 (or s2)

14
Properties of Normal Distribution Curve
  • The normal (distribution) curve
  • From µs to µs contains about 68 of the
    measurements (µ mean, s standard deviation)
  • From µ2s to µ2s contains about 95 of it
  • From µ3s to µ3s contains about 99.7 of it

15
Boxplot Analysis
  • Five-number summary of a distribution
  • Minimum, Q1, M, Q3, Maximum
  • Boxplot
  • Data is represented with a box
  • The ends of the box are at the first and third
    quartiles, i.e., the height of the box is IRQ
  • The median is marked by a line within the box
  • Whiskers two lines outside the box extend to
    Minimum and Maximum

16
Visualization of Data Dispersion Boxplot Analysis
17
Histogram Analysis
  • Graph displays of basic statistical class
    descriptions
  • Frequency histograms
  • A univariate graphical method
  • Consists of a set of rectangles that reflect the
    counts or frequencies of the classes present in
    the given data

18
Quantile Plot
  • Displays all of the data (allowing the user to
    assess both the overall behavior and unusual
    occurrences)

19
Quantile-Quantile (Q-Q) Plot
  • Graphs the quantiles of one univariate
    distribution against the corresponding quantiles
    of another
  • Allows the user to view whether there is a shift
    in going from one distribution to another

20
Scatter plot
  • Provides a first look at bivariate data to see
    clusters of points, outliers, etc
  • Each pair of values is treated as a pair of
    coordinates and plotted as points in the plane

21
Loess Curve
  • Adds a smooth curve to a scatter plot in order to
    provide better perception of the pattern of
    dependence
  • Loess curve is fitted by setting two parameters
    a smoothing parameter, and the degree of the
    polynomials that are fitted by the regression

22
Positively and Negatively Correlated Data
23
Not Correlated Data
24
Chapter 2 Data Preprocessing
  • Why preprocess the data?
  • Descriptive data summarization
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

25
Data Cleaning
  • Importance
  • Data cleaning is one of the three biggest
    problems in data warehousingRalph Kimball
  • Data cleaning is the number one problem in data
    warehousingDCI survey
  • Data cleaning tasks
  • Fill in missing values
  • Identify outliers and smooth out noisy data
  • Correct inconsistent data
  • Resolve redundancy caused by data integration

26
Missing Data
  • Data is not always available
  • E.g., many tuples have no recorded value for
    several attributes, such as customer income in
    sales data
  • Missing data may be due to
  • equipment malfunction
  • inconsistent with other recorded data and thus
    deleted
  • data not entered due to misunderstanding
  • certain data may not be considered important at
    the time of entry
  • not register history or changes of the data
  • Missing data may need to be inferred.

27
How to Handle Missing Data?
  • Ignore the tuple usually done when class label
    is missing (assuming the tasks in
    classificationnot effective when the percentage
    of missing values per attribute varies
    considerably.
  • Fill in the missing value manually tedious
    infeasible?
  • Fill in it automatically with
  • a global constant e.g., unknown, a new
    class?!
  • the attribute mean
  • the attribute mean for all samples belonging to
    the same class smarter
  • the most probable value inference-based such as
    Bayesian formula or decision tree

28
Noisy Data
  • Noise random error or variance in a measured
    variable
  • Incorrect attribute values may due to
  • faulty data collection instruments
  • data entry problems
  • data transmission problems
  • technology limitation
  • inconsistency in naming convention
  • Other data problems which requires data cleaning
  • duplicate records
  • incomplete data
  • inconsistent data

29
How to Handle Noisy Data?
  • Binning
  • first sort data and partition into
    (equal-frequency) bins
  • then one can smooth by bin means, smooth by bin
    median, smooth by bin boundaries, etc.
  • Regression
  • smooth by fitting the data into regression
    functions
  • Clustering
  • detect and remove outliers
  • Combined computer and human inspection
  • detect suspicious values and check by human
    (e.g., deal with possible outliers)

30
Data Cleaning as a Process
  • Data discrepancy detection
  • Use metadata (e.g., domain, range, dependency,
    distribution)
  • Check field overloading
  • Check uniqueness rule, consecutive rule and null
    rule
  • Use commercial tools
  • Data scrubbing use simple domain knowledge
    (e.g., postal code, spell-check) to detect errors
    and make corrections
  • Data auditing by analyzing data to discover
    rules and relationship to detect violators (e.g.,
    correlation and clustering to find outliers)
  • Data migration and integration
  • Data migration tools allow transformations to be
    specified
  • ETL (Extraction/Transformation/Loading) tools
    allow users to specify transformations through a
    graphical user interface
  • Integration of the two processes
  • Iterative and interactive (e.g., Potters Wheels)

31
Chapter 2 Data Preprocessing
  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

32
Data Integration
  • Data integration
  • Combines data from multiple sources into a
    coherent store
  • Schema integration e.g., A.cust-id ? B.cust-
  • Integrate metadata from different sources
  • Entity identification problem
  • Identify real world entities from multiple data
    sources, e.g., Bill Clinton William Clinton
  • Detecting and resolving data value conflicts
  • For the same real world entity, attribute values
    from different sources are different
  • Possible reasons different representations,
    different scales, e.g., metric vs. British units

Correlation analysis chi-square test
33
Data Transformation
  • Smoothing remove noise from data
  • Aggregation summarization, data cube
    construction
  • Generalization concept hierarchy climbing
  • Normalization scaled to fall within a small,
    specified range
  • min-max normalization
  • z-score normalization
  • normalization by decimal scaling
  • Attribute/feature construction
  • New attributes constructed from the given ones

34
Chapter 2 Data Preprocessing
  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

35
Data Reduction Strategies
  • Why data reduction?
  • A database/data warehouse may store terabytes of
    data
  • Complex data analysis/mining may take a very long
    time to run on the complete data set
  • Data reduction
  • Obtain a reduced representation of the data set
    that is much smaller in volume but yet produce
    the same (or almost the same) analytical results
  • Data reduction strategies
  • Data cube aggregation
  • Dimensionality reduction e.g., remove
    unimportant attributes
  • Data Compression
  • Numerosity reduction e.g., fit data into models
  • Discretization and concept hierarchy generation

36
Data Compression
  • String compression
  • There are extensive theories and well-tuned
    algorithms
  • Typically lossless
  • But only limited manipulation is possible without
    expansion
  • Audio/video compression
  • Typically lossy compression, with progressive
    refinement
  • Sometimes small fragments of signal can be
    reconstructed without reconstructing the whole
  • Time sequence is not audio
  • Typically short and vary slowly with time

37
Data Compression
Original Data
Compressed Data
lossless
Original Data Approximated
lossy
38
Dimensionality ReductionWavelet Transformation
  • Discrete wavelet transform (DWT) linear signal
    processing, multi-resolutional analysis
  • Compressed approximation store only a small
    fraction of the strongest of the wavelet
    coefficients
  • Similar to discrete Fourier transform (DFT), but
    better lossy compression, localized in space.

39
DWT for Image Compression
  • Image
  • Low Pass High Pass
  • Low Pass High Pass
  • Low Pass High Pass

40
Dimensionality Reduction Principal Component
Analysis (PCA)
  • Given N data vectors from n-dimensions, find k
    n orthogonal vectors (principal components) that
    can be best used to represent data
  • Works for numeric data only
  • Used when the number of dimensions is large

41
Principal Component Analysis
X2
Y1
Y2
X1
42
Data Reduction Method (3) Clustering
  • Partition data set into clusters based on
    similarity, and store cluster representation
    (e.g., centroid and diameter) only
  • Can be very effective if data is clustered but
    not if data is smeared
  • Can have hierarchical clustering and be stored in
    multi-dimensional index tree structures
  • There are many choices of clustering definitions
    and clustering algorithms
  • Cluster analysis will be studied in depth in
    Chapter 7

43
Data Reduction Method (4) Sampling
  • Sampling obtaining a small sample s to represent
    the whole data set N
  • Allow a mining algorithm to run in complexity
    that is potentially sub-linear to the size of the
    data
  • Choose a representative subset of the data
  • Simple random sampling may have very poor
    performance in the presence of skew
  • Develop adaptive sampling methods
  • Stratified sampling
  • Approximate the percentage of each class (or
    subpopulation of interest) in the overall
    database
  • Used in conjunction with skewed data
  • Note Sampling may not reduce database I/Os (page
    at a time)

44
Sampling with or without Replacement
SRSWOR (simple random sample without
replacement)
SRSWR
45
Sampling Cluster or Stratified Sampling
Cluster/Stratified Sample
Raw Data
46
Chapter 2 Data Preprocessing
  • Why preprocess the data?
  • Data cleaning
  • Data integration and transformation
  • Data reduction
  • Discretization and concept hierarchy generation
  • Summary

47
Discretization
  • Three types of attributes
  • Nominal values from an unordered set, e.g.,
    color, profession
  • Ordinal values from an ordered set, e.g.,
    military or academic rank
  • Continuous real numbers, e.g., integer or real
    numbers
  • Discretization
  • Divide the range of a continuous attribute into
    intervals
  • Some classification algorithms only accept
    categorical attributes.
  • Reduce data size by discretization
  • Prepare for further analysis

48
Discretization and Concept Hierarchy
  • Discretization
  • Reduce the number of values for a given
    continuous attribute by dividing the range of the
    attribute into intervals
  • Interval labels can then be used to replace
    actual data values
  • Supervised vs. unsupervised
  • Split (top-down) vs. merge (bottom-up)
  • Discretization can be performed recursively on an
    attribute
  • Concept hierarchy formation
  • Recursively reduce the data by collecting and
    replacing low level concepts (such as numeric
    values for age) by higher level concepts (such as
    young, middle-aged, or senior)

49
Discretization and Concept Hierarchy Generation
for Numeric Data
  • Typical methods All the methods can be applied
    recursively
  • Binning (covered above)
  • Top-down split, unsupervised,
  • Histogram analysis (covered above)
  • Top-down split, unsupervised
  • Clustering analysis (covered above)
  • Either top-down split or bottom-up merge,
    unsupervised
  • Entropy-based discretization supervised,
    top-down split
  • Interval merging by ?2 Analysis unsupervised,
    bottom-up merge
  • Segmentation by natural partitioning top-down
    split, unsupervised

50
Summary
  • Data preparation or preprocessing is a big issue
    for both data warehousing and data mining
  • Discriptive data summarization is need for
    quality data preprocessing
  • Data preparation includes
  • Data cleaning and data integration
  • Data reduction and feature selection
  • Discretization
  • A lot a methods have been developed but data
    preprocessing still an active area of research

51
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com