Data Mining and Data Warehousing - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Data Mining and Data Warehousing

Description:

identifying duplicate records is not an easy task. merge-purge approach. Incomplete data ... Here fM and fN represent vectors of values of respective feature ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 26
Provided by: hlu1
Category:
Tags: data | fm | mining | task | warehousing

less

Transcript and Presenter's Notes

Title: Data Mining and Data Warehousing


1
Data Mining and Data Warehousing
  • Introduction
  • OLAP and Data warehousing
  • Data preprocessing for mining and warehousing
  • Concept description characterization and
    discrimination
  • Classification and prediction
  • Clustering analysis
  • Association analysis
  • Mining complex data and advanced mining
    techniques
  • Trends and research issues

2
Session 3 Data Preprocessing
  • Data cleansing and integration
  • Feature Selection
  • Discretization
  • Summary

3
Data Cleansing
  • Data warehouse contains data that is analyzed for
    business decisions knowledge discovered from
    data will be used for future use
  • Detecting data anomalies and rectifying them
    early has huge payoffs

4
Real World Data Are Dirty
  • Typical data quality problems
  • incomplete data
  • values of certain fields are missing
  • duplicate data
  • two records refer to the same real world object
  • inaccurate data
  • values do not reflect the real world situation
  • inconsistent data
  • relevant data records violate integrity
    constraints

5
Cleansing Dirty Data
  • Remove duplicate records
  • identifying duplicate records is not an easy task
  • merge-purge approach
  • Incomplete data
  • drop data items that are incomplete
  • fill in missing values
  • Inconsistent data
  • external references

6
Data Integration
  • Schema integration
  • integrate metadata from different sources
  • Entity identification problem
  • to identify real world entities from multiple
    data sources
  • Detecting and resolving data value conflicts
  • for the same real world entity, attribute values
    from different sources are different
  • possible reasons different representation,
    different scale

7
Session 3 Data Preprocessing
  • Data cleansing and integration
  • Feature Selection
  • Discretization
  • Summary

8
Feature Selection for Classification
  • Given a data set with N features (attributes),
    feature selection can be defined as the process
    of selecting a minimum set of M features where M
    ? N such that the probability distribution of
    different classes given the values for those M
    features is as close as possible to the original
    distribution given the values all features.
  • If FN is the original feature set and FM is the
    selected feature set, then P(C FM fM) should
    be as close as possible to P(C FN fN) for
    difference possible class, C. Here fM and fN
    represent vectors of values of respective feature
    vectors FM and FN.

9
Feature Selection Approaches
  • Filter act as a preprocessing stage of the
    learning algorithms by removing the features that
    are not required for correct classification.
  • Wrapper uses the learning algorithm as an
    evaluation function in choosing the required
    features.
  • Any learning algorithm is biased select relevant
    feature according to a particular learning
    algorithm is equivalent changing the data to fit
    the algorithm
  • time complexity is usually high
  • may have problem with very large data set

10
Feature Selection Algorithms -- Relief
  • Proposed by Kira Rendelli, 1992
  • Initialize the weights of all attributes to zero
  • Randomly choose a tuple (instance) and find its
    near-hit and near-miss Euclidean difference
    measure
  • Adjust the weight of an attribute by adding and
    subtracting the square of differences
  • Repeat ? and ? N times and divide the weights by
    N
  • Those attributes having weight greater than a
    threshold are chosen as relevant attribute

11
Feature Selection Algorithms -- Relief
Near-hit
Near- miss
Chosen instance
12
Selecting Features Based on Inconsistency
  • Select a subset of attributes
  • Examine whether there exist inconsistency
  • two samples are inconsistent if they agree on all
    the attributes selected but do not agree on the
    class label
  • Repeat the above procedure until a subset is
    found without inconsistency found

13
Selecting Subset For Testing Relevance
  • Different ways of selecting subset
  • start from subset with cardinality 1, then 2, ...
  • systematic generate the possible subsets
  • LVF (Liu Setiono, 1996)
  • select subset from all possible sets using Las
    Vegas Algorithm
  • for each subset, compute the inconsistency rate
    (the number of inconsistent samples divided by
    the total number of sample)
  • retain the subset with the minimum inconsistency
    rate as the best set
  • repeat the above for a pre-determined number of
    times. The best set retained is the final results

14
Session 3 Data Preprocessing
  • Data cleansing and integration
  • Feature Selection
  • Discretization
  • Summary

15
Discretization
  • Three types of attributes
  • nominal -- values from an unordered set
  • ordinal -- values from an ordered set
  • continuous -- real numbers
  • Discretization divide the range of a continuous
    attribute into intervals.
  • Some classification algorithm only accept
    categorical attributes.
  • Reduce data size by discretization
  • Prepare for other further analysis

16
Static vs. Dynamic Discretization
  • Dynamic discretization some classification
    algorithms has built in mechanism to discretize
    continuous attributes
  • Static discretization a pre-preprocessing step
    in the process of data mining (or other
    applications)

17
Simple Discretization Methods
  • Equal width (distance) intervals
  • divide the range into N intervals of equal size
  • if A and B are the lowest and highest values of
    the attribute, the width of intervals will be
  • W (B-A) / N
  • The interval boundaries are at
  • AW, A2W, , A (N-1)W
  • Equal Frequency Intervals
  • divide the range into N intervals
  • each interval will contain approximately same
    number of samples
  • The discretization process ignore the class
    information.

18
ChiMerge (Kerber92)
  • Quality of discretization is hard to define.
  • ChiMerges view
  • relative class frequencies should be fairly
    consistent within an interval (otherwise should
    split)
  • two adjacent intervals should not have similar
    relative class frequencies (otherwise should
    merge)

19
?2 Test and Discretization
  • ?2 is a statistical measure used to test the
    hypothesis that two discrete attributes are
    statistically independent.
  • For two intervals, if ?2 test concludes that the
    class is independent of the intervals, the
    intervals should be merged. If ?2 test concludes
    that they are not independent, i.e., the
    difference in relative class frequency is
    statistically significant, the two intervals
    should remain separate.

20
Computing ?2
  • Value can be computed as follows

K number of classes Aij number of samples in
ith interval, jth class Eij expected frequency
of Aij (Ri Cj) / N Ri number of
samples in ith interval Cj number of samples in
jth class N total number of samples
21
ChiMerge -- The Algorithm
  • Compute the ?2 value for each pair of adjacent
    intervals
  • Merge the pair of adjacent intervals with the
    lowest ?2 value
  • Repeat ? and ? until ?2 values of all adjacent
    pairs exceeds a threshold
  • Threshold determined by the significance level
    and freedom (number of classes -1)

22
An Example
Merge result of ?2 thresholds 1.4 and 4.6
23
Entropy Based Discretization
  • Given a set of samples S, if S is partitioned
    into two intervals S1 and S2 using boundary T,
    the entropy after partitioning is
  • The boundary that minimize the entropy function
    over all possible boundaries is selected as a
    binary discretization.
  • The process is recursively applied to partitions
    obtained until some stopping criterion is met,
    e.g.,

24
Effects of Discretization
  • Experimental results indicate that with
    discretization
  • data size can be reduced
  • classification accuracy can be improved

25
Session 3 Summary
  • Data preparation is a big issue for both
    warehousing and mining
  • Need to consolidate research work conducted in
    different areas
Write a Comment
User Comments (0)
About PowerShow.com