Title: Data Mining and Data Warehousing
1Data Mining and Data Warehousing
- Introduction
- OLAP and Data warehousing
- Data preprocessing for mining and warehousing
- Concept description characterization and
discrimination - Classification and prediction
- Clustering analysis
- Association analysis
- Mining complex data and advanced mining
techniques - Trends and research issues
2Session 3 Data Preprocessing
- Data cleansing and integration
- Feature Selection
- Discretization
- Summary
3Data Cleansing
- Data warehouse contains data that is analyzed for
business decisions knowledge discovered from
data will be used for future use - Detecting data anomalies and rectifying them
early has huge payoffs
4Real World Data Are Dirty
- Typical data quality problems
- incomplete data
- values of certain fields are missing
- duplicate data
- two records refer to the same real world object
- inaccurate data
- values do not reflect the real world situation
- inconsistent data
- relevant data records violate integrity
constraints
5Cleansing Dirty Data
- Remove duplicate records
- identifying duplicate records is not an easy task
- merge-purge approach
- Incomplete data
- drop data items that are incomplete
- fill in missing values
- Inconsistent data
- external references
6Data Integration
- Schema integration
- integrate metadata from different sources
- Entity identification problem
- to identify real world entities from multiple
data sources - Detecting and resolving data value conflicts
- for the same real world entity, attribute values
from different sources are different - possible reasons different representation,
different scale
7Session 3 Data Preprocessing
- Data cleansing and integration
- Feature Selection
- Discretization
- Summary
8Feature Selection for Classification
- Given a data set with N features (attributes),
feature selection can be defined as the process
of selecting a minimum set of M features where M
? N such that the probability distribution of
different classes given the values for those M
features is as close as possible to the original
distribution given the values all features. - If FN is the original feature set and FM is the
selected feature set, then P(C FM fM) should
be as close as possible to P(C FN fN) for
difference possible class, C. Here fM and fN
represent vectors of values of respective feature
vectors FM and FN.
9Feature Selection Approaches
- Filter act as a preprocessing stage of the
learning algorithms by removing the features that
are not required for correct classification. - Wrapper uses the learning algorithm as an
evaluation function in choosing the required
features. - Any learning algorithm is biased select relevant
feature according to a particular learning
algorithm is equivalent changing the data to fit
the algorithm - time complexity is usually high
- may have problem with very large data set
10Feature Selection Algorithms -- Relief
- Proposed by Kira Rendelli, 1992
- Initialize the weights of all attributes to zero
- Randomly choose a tuple (instance) and find its
near-hit and near-miss Euclidean difference
measure - Adjust the weight of an attribute by adding and
subtracting the square of differences - Repeat ? and ? N times and divide the weights by
N - Those attributes having weight greater than a
threshold are chosen as relevant attribute
11Feature Selection Algorithms -- Relief
Near-hit
Near- miss
Chosen instance
12Selecting Features Based on Inconsistency
- Select a subset of attributes
- Examine whether there exist inconsistency
- two samples are inconsistent if they agree on all
the attributes selected but do not agree on the
class label - Repeat the above procedure until a subset is
found without inconsistency found
13Selecting Subset For Testing Relevance
- Different ways of selecting subset
- start from subset with cardinality 1, then 2, ...
- systematic generate the possible subsets
- LVF (Liu Setiono, 1996)
- select subset from all possible sets using Las
Vegas Algorithm - for each subset, compute the inconsistency rate
(the number of inconsistent samples divided by
the total number of sample) - retain the subset with the minimum inconsistency
rate as the best set - repeat the above for a pre-determined number of
times. The best set retained is the final results
14Session 3 Data Preprocessing
- Data cleansing and integration
- Feature Selection
- Discretization
- Summary
15Discretization
- Three types of attributes
- nominal -- values from an unordered set
- ordinal -- values from an ordered set
- continuous -- real numbers
- Discretization divide the range of a continuous
attribute into intervals. - Some classification algorithm only accept
categorical attributes. - Reduce data size by discretization
- Prepare for other further analysis
16Static vs. Dynamic Discretization
- Dynamic discretization some classification
algorithms has built in mechanism to discretize
continuous attributes - Static discretization a pre-preprocessing step
in the process of data mining (or other
applications)
17Simple Discretization Methods
- Equal width (distance) intervals
- divide the range into N intervals of equal size
- if A and B are the lowest and highest values of
the attribute, the width of intervals will be - W (B-A) / N
- The interval boundaries are at
- AW, A2W, , A (N-1)W
- Equal Frequency Intervals
- divide the range into N intervals
- each interval will contain approximately same
number of samples - The discretization process ignore the class
information.
18ChiMerge (Kerber92)
- Quality of discretization is hard to define.
- ChiMerges view
- relative class frequencies should be fairly
consistent within an interval (otherwise should
split) - two adjacent intervals should not have similar
relative class frequencies (otherwise should
merge)
19?2 Test and Discretization
- ?2 is a statistical measure used to test the
hypothesis that two discrete attributes are
statistically independent. - For two intervals, if ?2 test concludes that the
class is independent of the intervals, the
intervals should be merged. If ?2 test concludes
that they are not independent, i.e., the
difference in relative class frequency is
statistically significant, the two intervals
should remain separate.
20Computing ?2
- Value can be computed as follows
K number of classes Aij number of samples in
ith interval, jth class Eij expected frequency
of Aij (Ri Cj) / N Ri number of
samples in ith interval Cj number of samples in
jth class N total number of samples
21ChiMerge -- The Algorithm
- Compute the ?2 value for each pair of adjacent
intervals - Merge the pair of adjacent intervals with the
lowest ?2 value - Repeat ? and ? until ?2 values of all adjacent
pairs exceeds a threshold - Threshold determined by the significance level
and freedom (number of classes -1)
22An Example
Merge result of ?2 thresholds 1.4 and 4.6
23Entropy Based Discretization
- Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is - The boundary that minimize the entropy function
over all possible boundaries is selected as a
binary discretization. - The process is recursively applied to partitions
obtained until some stopping criterion is met,
e.g.,
24Effects of Discretization
- Experimental results indicate that with
discretization - data size can be reduced
- classification accuracy can be improved
25Session 3 Summary
- Data preparation is a big issue for both
warehousing and mining - Need to consolidate research work conducted in
different areas