Data Mining and Data Warehousing - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Data Mining and Data Warehousing

Description:

identifying duplicate records is not an easy task. merge-purge approach. Incomplete data ... Here fM and fN represent vectors of values of respective feature ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 26

Provided by: hlu1

Category:

Tags: data | fm | mining | task | warehousing

more less

Transcript and Presenter's Notes

Title: Data Mining and Data Warehousing

1
Data Mining and Data Warehousing

Introduction
OLAP and Data warehousing
Data preprocessing for mining and warehousing
Concept description characterization and
discrimination
Classification and prediction
Clustering analysis
Association analysis
Mining complex data and advanced mining
techniques
Trends and research issues

2
Session 3 Data Preprocessing

Data cleansing and integration
Feature Selection
Discretization
Summary

3
Data Cleansing

Data warehouse contains data that is analyzed for
business decisions knowledge discovered from
data will be used for future use
Detecting data anomalies and rectifying them
early has huge payoffs

4
Real World Data Are Dirty

Typical data quality problems
incomplete data
values of certain fields are missing
duplicate data
two records refer to the same real world object
inaccurate data
values do not reflect the real world situation
inconsistent data
relevant data records violate integrity
constraints

5
Cleansing Dirty Data

Remove duplicate records
identifying duplicate records is not an easy task
merge-purge approach
Incomplete data
drop data items that are incomplete
fill in missing values
Inconsistent data
external references

6
Data Integration

Schema integration
integrate metadata from different sources
Entity identification problem
to identify real world entities from multiple
data sources
Detecting and resolving data value conflicts
for the same real world entity, attribute values
from different sources are different
possible reasons different representation,
different scale

7
Session 3 Data Preprocessing

Data cleansing and integration
Feature Selection
Discretization
Summary

8
Feature Selection for Classification

Given a data set with N features (attributes),
feature selection can be defined as the process
of selecting a minimum set of M features where M
? N such that the probability distribution of
different classes given the values for those M
features is as close as possible to the original
distribution given the values all features.
If FN is the original feature set and FM is the
selected feature set, then P(C FM fM) should
be as close as possible to P(C FN fN) for
difference possible class, C. Here fM and fN
represent vectors of values of respective feature
vectors FM and FN.

9
Feature Selection Approaches

Filter act as a preprocessing stage of the
learning algorithms by removing the features that
are not required for correct classification.
Wrapper uses the learning algorithm as an
evaluation function in choosing the required
features.
Any learning algorithm is biased select relevant
feature according to a particular learning
algorithm is equivalent changing the data to fit
the algorithm
time complexity is usually high
may have problem with very large data set

10
Feature Selection Algorithms -- Relief

Proposed by Kira Rendelli, 1992
Initialize the weights of all attributes to zero
Randomly choose a tuple (instance) and find its
near-hit and near-miss Euclidean difference
measure
Adjust the weight of an attribute by adding and
subtracting the square of differences
Repeat ? and ? N times and divide the weights by
N
Those attributes having weight greater than a
threshold are chosen as relevant attribute

11
Feature Selection Algorithms -- Relief
Near-hit
Near- miss
Chosen instance
12
Selecting Features Based on Inconsistency

Select a subset of attributes
Examine whether there exist inconsistency
two samples are inconsistent if they agree on all
the attributes selected but do not agree on the
class label
Repeat the above procedure until a subset is
found without inconsistency found

13
Selecting Subset For Testing Relevance

Different ways of selecting subset
start from subset with cardinality 1, then 2, ...
systematic generate the possible subsets
LVF (Liu Setiono, 1996)
select subset from all possible sets using Las
Vegas Algorithm
for each subset, compute the inconsistency rate
(the number of inconsistent samples divided by
the total number of sample)
retain the subset with the minimum inconsistency
rate as the best set
repeat the above for a pre-determined number of
times. The best set retained is the final results

14
Session 3 Data Preprocessing

Data cleansing and integration
Feature Selection
Discretization
Summary

15
Discretization

Three types of attributes
nominal -- values from an unordered set
ordinal -- values from an ordered set
continuous -- real numbers
Discretization divide the range of a continuous
attribute into intervals.
Some classification algorithm only accept
categorical attributes.
Reduce data size by discretization
Prepare for other further analysis

16
Static vs. Dynamic Discretization

Dynamic discretization some classification
algorithms has built in mechanism to discretize
continuous attributes
Static discretization a pre-preprocessing step
in the process of data mining (or other
applications)

17
Simple Discretization Methods

Equal width (distance) intervals
divide the range into N intervals of equal size
if A and B are the lowest and highest values of
the attribute, the width of intervals will be
W (B-A) / N
The interval boundaries are at
AW, A2W, , A (N-1)W
Equal Frequency Intervals
divide the range into N intervals
each interval will contain approximately same
number of samples
The discretization process ignore the class
information.

18
ChiMerge (Kerber92)

Quality of discretization is hard to define.
ChiMerges view
relative class frequencies should be fairly
consistent within an interval (otherwise should
split)
two adjacent intervals should not have similar
relative class frequencies (otherwise should
merge)

19
?2 Test and Discretization

?2 is a statistical measure used to test the
hypothesis that two discrete attributes are
statistically independent.
For two intervals, if ?2 test concludes that the
class is independent of the intervals, the
intervals should be merged. If ?2 test concludes
that they are not independent, i.e., the
difference in relative class frequency is
statistically significant, the two intervals
should remain separate.

20
Computing ?2

Value can be computed as follows

K number of classes Aij number of samples in
ith interval, jth class Eij expected frequency
of Aij (Ri Cj) / N Ri number of
samples in ith interval Cj number of samples in
jth class N total number of samples
21
ChiMerge -- The Algorithm

Compute the ?2 value for each pair of adjacent
intervals
Merge the pair of adjacent intervals with the
lowest ?2 value
Repeat ? and ? until ?2 values of all adjacent
pairs exceeds a threshold
Threshold determined by the significance level
and freedom (number of classes -1)

22
An Example
Merge result of ?2 thresholds 1.4 and 4.6
23
Entropy Based Discretization

Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is
The boundary that minimize the entropy function
over all possible boundaries is selected as a
binary discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met,
e.g.,

24
Effects of Discretization