Title: The Knowledge Discovery Process; Data Preparation
1The Knowledge Discovery ProcessData Preparation
Preprocessing
Bamshad Mobasher DePaul University
2The Knowledge Discovery Process
- The KDD Process
3Steps in the KDD Process
- Learning the application domain
- Translate the business problem into a data mining
problem - Gathering and integrating of data
- Cleaning and preprocessing data
- may be the most resource intensive part
- Reducing and selecting data
- find useful features, dimensionality reduction,
etc. - Choosing functions of data mining
- summarization, classification, regression,
association, clustering, etc. - Choosing the mining algorithm(s)
- Data mining discover patterns of interest
construct models - Evaluate patterns and models
- Interpretation analysis of results
- visualization, alteration, removing redundant
patterns, querying - Use of discovered knowledge
4Data Mining What Kind of Data?
- Structured Databases
- relational, object-relational, etc.
- can use SQL to perform parts of the process
- e.g., SELECT count() FROM Items WHERE
- typevideo GROUP BY category
5Data Mining What Kind of Data?
- Flat Files
- most common data source
- can be text (or HTML) or binary
- may contain transactions, statistical data,
measurements, etc. - Transactional databases
- set of records each with a transaction id, time
stamp, and a set of items - may have an associated description file for the
items - typical source of data used in market basket
analysis
6Data Mining What Kind of Data?
- Other Types of Databases
- legacy databases
- multimedia databases (usually very
high-dimensional) - spatial databases (containing geographical
information, such as maps, or satellite imaging
data, etc.) - Time Series Temporal Data (time dependent
information such as stock market data usually
very dynamic) - World Wide Web
- basically a large, heterogeneous, distributed
database - need for new or additional tools and techniques
- information retrieval, filtering and extraction
- agents to assist in browsing and filtering
- Web content, usage, and structure (linkage)
mining tools
7Data Mining What Kind of Data?
- Data Warehouses
- a data warehouse is a repository of data
collected from multiple data sources (often
heterogeneous) so that analysis can be done under
the same unified schema - data from the different sources are loaded,
cleaned, transformed and integrated - allows for interactive analysis and
decision-making - usually modeled by a multi-dimensional data
structure (data cube) to facilitate
decision-making - aggregate values can be pre-computed and stored
along many dimensions - each dimension of the data cube contains a
hierarchy of values for one attribute - data cubes are well suited for fast interactive
querying and analysis of data at different
conceptual levels, known as On-Line Analytical
Processing (OLAP) - OLAP operations allow the navigation of data at
different levels of abstraction, such as
drill-down, roll-up, slice, dice, etc.
8DM Tasks Classification
- Classifying observations/instances into different
given classes (i.e., classification is
supervised) - example, classifying credit applicants as low,
medium, or high risk - normally use a training set where all objects are
already associated with known class labels - classification algorithm learns from the training
set and builds a model - the model is used to classify new objects
- Suitable data mining tools
- Decision Tree algorithms (CHAD, C 5.0, and CART)
- Memory-based reasoning
- Neural Network
- Example (Hypothetical Video Store)
- build a model of users based their rental history
(returning on-time, payments, etc.) - observe the current customers rental and payment
history - decide whether should charge a deposit to current
customer
9DM Tasks Prediction
- Same as classification
- except classify according to some predicted or
estimated future value - In prediction, historical data is used to build a
(predictive) model that explains the current
observed behavior - Model can then be applied to new instances to
predict future behavior or forecast the future
value of some missing attribute - Examples
- predicting the size of balance transfer if the
prospect accepts the offer - predicting the load on a Web server in a
particular time period - Suitable data mining tools
- Association rule discovery
- Memory-based reasoning
- Decision Trees
- Neural Networks
10DM Tasks Affinity Grouping
- Determine what items often go together (usually
in transactional databases) - Often Referred to as Market Basket Analysis
- used in retail for planning arrangement on
shelves - used for identifying cross-selling opportunities
- should be used to determine best link structure
for a Web site - Examples
- people who buy milk and beer also tend to buy
diapers - people who access pages A and B are likely to
place an online order - Suitable data mining tools
- association rule discovery
- clustering
- Nearest Neighbor analysis (memory-based reasoning)
11DM Tasks Clustering
- Like classification, clustering is the
organization of data into classes - however, class labels are unknown and it is up to
the clustering algorithm to discover acceptable
classes - also called unsupervised classification, because
the classification is not dictated by given class
labels - There are many clustering approaches
- all based on the principle of maximizing the
similarity between objects in a same class
(intra-class similarity) and minimizing the
similarity between objects of different classes
(inter-class similarity) - Examples
- doing market segmentation of customers based on
buying patterns and demographic attributes - grouping user transactions on a Web site based
their navigational patterns
12Characterization Discrimination
- Data characterization is a summarization of
general features of objects in a target class - The data relevant to a target class are retrieved
by a database query and run through a
summarization module to extract the essence of
the data at different levels of abstraction - example characterize the video stores customers
who regularly rent more than 30 movies a year - Data discrimination is used to compare of the
general features of objects between two classes - comparison relevant features of objects between a
target class and a contrasting class - example compare the general characteristics of
the customers who rented more than 30 movies in
the last year with those who rented less than 5 - The techniques used for data discrimination are
very similar to the techniques used for data
characterization with the exception that data
discrimination results include comparative
measures
13Example Moviegoer Database
14Example Moviegoer Database
SELECT moviegoers.name, moviegoers.sex,
moviegoers.age, sources.source, movies.name FROM
movies, sources, moviegoers WHERE
sources.source_ID moviegoers.source_ID AND
movies.movie_ID moviegoers.movie_ID ORDER BY
moviegoers.name
15Example Moviegoer Database
- Classification
- determine sex based on age, source, and movies
seen - determine source based on sex, age, and movies
seen - determine most recent movie based on past movies,
age, sex, and source - Prediction or Estimation
- for predict, need a continuous variable (e.g.,
age) - predict age as a function of source, sex, and
past movies - if we had a rating field for each moviegoer, we
could predict the rating a new moviegoer gives to
a movie based on age, sex, past movies, etc. - Clustering
- find groupings of movies that are often seen by
the same people - find groupings of people that tend to see the
same movies - clustering might reveal relationships that are
not necessarily recorded in the data (e.g., we
may find a cluster that is dominated by people
with young children or a cluster of movies that
correspond to a particular genre)
16Example Moviegoer Database
- Affinity Grouping
- market basket analysis (MBA) which movies go
together? - need to create transactions for each moviegoer
containing movies seen by that moviegoer - may result in association rules such as
- Phenomenon, The Birdcage gt
Trainspotting - Trainspotting, The Birdcage gt sex f
- Sequence Analysis
- similar to MBA, but order in which items appear
in the pattern is important - e.g., people who rent The Birdcage during a
visit tend to rent Trainspotting in the next
visit.
17The Knowledge Discovery Process
- The KDD Process
18Data Preprocessing
- Why do we need to prepare the data?
- In real world applications data can be
inconsistent, incomplete and/or noisy - Data entry, data transmission, or data collection
problems - Discrepancy in naming conventions
- Duplicated records
- Incomplete data
- Contradictions in data
- What happens when the data can not be trusted?
- Can the decision be trusted? Decision making is
jeopardized - Better chance to discover useful knowledge when
data is clean
19Data Preprocessing
20Data Preprocessing
Data Cleaning
Data Integration
Data Transformation
-2,32,100,59,48
-0.02,0.32,1.00,0.59,0.48
Data Reduction
21Data Cleaning
- Real-world application data can be incomplete,
noisy, and inconsistent - No recorded values for some attributes
- Not considered at time of entry
- Random errors
- Irrelevant records or fields
- Data cleaning attempts to
- Fill in missing values
- Smooth out noisy data
- Correct inconsistencies
- Remove irrelevant data
22Dealing with Missing Values
- Data is not always available (missing attribute
values in records) - equipment malfunction
- deleted due to inconsistency or misunderstanding
- not considered important at time of data
gathering - Solving Missing Data
- Ignore the record with missing values
- Fill in the missing values manually
- Use a global constant to fill in missing values
(NULL, unknown, etc.) - Use the attribute value mean to filling missing
values of that attribute - Use the attribute mean for all samples belonging
to the same class to fill in the missing values - Infer the most probable value to fill in the
missing value - may need to use methods such as Bayesian
classification or decision trees to automatically
infer missing attribute values
23Smoothing Noisy Data
- The purpose of data smoothing is to eliminate
noise
Original Data for price (after sorting) 4, 8,
15, 21, 21, 24, 25, 28, 34
Partition into equidepth bins Bin1 4, 8,
15 Bin2 21, 21, 24 Bin3 25, 28, 34
Binning
Min and Max values in each bin are identified
(boundaries). Each value in a bin is replaced
with the closest boundary value.
Each value in a bin is replaced by the mean value
of the bin.
means Bin1 9, 9, 9 Bin2 22, 22, 22 Bin3 29,
29, 29
boundaries Bin1 4, 4, 15 Bin2 21, 21, 24 Bin3
25, 25, 34
24Smoothing Noisy Data
Similar values are organized into groups
(clusters). Values falling outside of clusters
may be considered outliers and may be
candidates for elimination.
Clustering
Fit data to a function. Linear regression finds
the best line to fit two variables. Multiple
regression can handle multiple variables. The
values given by the function are used instead of
the original values.
Regression
25Smoothing Noisy Data - Example
- Want to smooth Temperature by bin means with
bins of size 3 - First sort the values of the attribute (keep
track of the ID or key so that the transformed
values can be replaced in the original table. - Divide the data into bins of size 3 (or less in
case of last bin). - Convert the values in each bin to the mean value
for that bin - Put the resulting values into the original table
26Smoothing Noisy Data - Example
Value of every record in each bin is changed to
the mean value for that bin. If it is necessary
to keep the value as an integer, then the mean
values are rounded to the nearest integer.
27Smoothing Noisy Data - Example
The final table with the new values for the
Temperature attribute.
28Data Integration
- Data analysis may require a combination of data
from multiple sources into a coherent data store - Challenges in Data Integration
- Schema integration CID C_number Cust-id
cust - Semantic heterogeneity
- Data value conflicts (different representations
or scales, etc.) - Synchronization (especially important in Web
usage mining) - Redundant attributes (redundant if it can be
derived from other attributes) -- may be able to
identify redundancies via correlation analysis - Meta-data is often necessary for successful data
integration
Correlation analysis Pr(A,B) / (Pr(A).Pr(B))
1 independent, gt 1 positive correlation, lt 1
negative correlation.
29Normalization
- Min-max normalization linear transformation from
v to v - v (v - min)/(max - min) x (newmax - newmin)
newmin - Ex transform 30000 between 10000..45000 into
0..1 gt - (30000 10000) / 35000 x (1 - 0) 0 0.514
- z-score normalization normalization of v into v
based on attribute value mean and standard
deviation - v (v - Mean) / StandardDeviation
- Normalization by decimal scaling
- moves the decimal point of v by j positions such
that j is the minimum number of positions moved
so that absolute maximum value falls in 0..1. - v v / 10j
- Ex if v ranges between -56 and 9976, j4 gt v
ranges between -0.0056 and 0.9976
30Normalization Example
- z-score normalization v (v - Mean) / Stdev
- Example normalizing the Humidity attribute
Mean 80.3 Stdev 9.84
31Normalization Example II
- Min-Max normalization on an employee database
- max distance for salary 100000-19000 81000
- max distance for age 52-27 25
- New min for age and salary 0 new max for age
and salary 1
32Data Reduction
- Data is often too large reducing data can
improve performance - Data reduction consists of reducing the
representation of the data set while producing
the same (or almost the same) results - Data reduction includes
- Data cube aggregation
- Dimensionality reduction
- Discretization
- Numerosity reduction
- Regression
- Histograms
- Clustering
- Sampling
33Data Cube Aggregation
- Reduce the data to the concept level needed in
the analysis - Use the smallest (most detailed) level necessary
to solve the problem - Queries regarding aggregated information should
be answered using data cube when possible
34Dimensionality Reduction
- Feature selection (i.e., attribute subset
selection) - Select only the necessary attributes.
- The goal is to find a minimum set of attributes
such that the resulting probability distribution
of data classes is as close as possible to the
original distribution obtained using all
attributes. - Exponential number of possibilities.
- Use heuristics select local best (or most
pertinent) attribute, e.g., using information
gain, etc. - step-wise forward selection A1A1, A3A1,
A3, A5 - step-wise backward elimination A1, A2, A3, A4,
A5A1, A3, A4, A5 A1, A3, A5 - combining forward selection and backward
elimination - decision-tree induction
35Decision Tree Induction
36Numerocity Reduction
- Reduction via histograms
- Divide data into buckets and store representation
of buckets (sum, count, etc.) - Reduction via clustering
- Partition data into clusters based on closeness
in space - Retain representatives of clusters (centroids)
and outliers - Reduction via sampling
- Will the patterns in the sample represent the
patterns in the data? - Random sampling can produce poor results
- Stratified sample (stratum group based on
attribute value)
37Sampling Techniques
SRSWOR (simple random sample without
replacement)
SRSWR
Raw Data
Cluster/Stratified Sample
Raw Data
38Discretization
- 3 Types of attributes
- nominal - values from an unordered set (also
categorical attributes) - ordinal - values from an ordered set
- continuous - real numbers (but sometimes also
integer values) - Discretization is used to reduce the number of
values for a given continuous attribute - usually done by dividing the range of the
attribute into intervals - interval labels are then used to replace actual
data values - Some data mining algorithms only accept
categorical attributes and cannot handle a range
of continuous attribute value - Discretization can also be used to generate
concept hierarchies - reduce the data by collecting and replacing low
level concepts (e.g., numeric values for age)
by higher level concepts (e.g., young, middle
aged, old)
39Discretization - Example
- Example discretizing the Humidity attribute
using 3 bins.
Low 60-69 Normal 70-79 High 80
40Converting Categorical Attributes to Numerical
Attributes
Attributes Outlook (overcast, rain,
sunny) Temperature real Humidity real Windy
(true, false)
Standard Spreadsheet Format
Create separate columns for each value of a
categorical attribute (e.g., 3 values for the
Outlook attribute and two values of the Windy
attribute). There is no change to the numerical
attributes.
41Visualizing Patterns
Windy Not Windy
Outlook sunny 2 3
Outlook rain 2 3
Outlook overcast 2 2
42Evaluating Models
- To train and evaluate models, data are often
divided into three sets the training set, the
test set, and the evaluation set - Training Set
- is used to build the initial model
- may need to enrich the data to get enough of
the special cases - Test Set
- is used to adjust the initial model
- models can be tweaked to be less idiosyncrasies
to the training data and can be adapted for a
more general model - idea is to prevent over-training (i.e., finding
patterns where none exist). - Evaluation Set
- is used to evaluate the model performance
43Training Sets
- The training set will be used to train the models
- most important consideration need to cover the
full range of values for all the features that
the model might encounter - good idea to have several examples for each value
of a categorical feature and for a range of
values for numerical features - Data Enrichment
- training set should have sufficient number of
examples of rare events - a random sample of available data is not
sufficient since common examples will swamp the
rare examples - training set needs to over-sample the rare cases
so that the model can learn the features for rare
events instead of predicting everything to be
common outputs
44Test and Evaluation Sets
- Reading too much into the training set
(overfitting) - common problem with most data mining algorithms
- resulting model works well on the training set
but performs miserably on unseen data - test set can be used to tweak the initial
model, and to remove unnecessary inputs or
features - Evaluations Set for Final Performance Evaluation
- test set can not be used to evaluate model
performance for future unseen data - error rate on the evaluation is a good predictor
of error rate on unseen data - Insufficient data to divide into three disjoint
sets? - In such cases, validation techniques can play a
major role - Cross Validation
- Bootstrap Validation
45Cross Validation
- Cross validation is a heuristic that works as
follows - randomly divide the data into n folds, each with
approximately the same number of records - create n models using the same algorithms and
training parameters each model is trained with
n-1 folds of the data and tested on the remaining
fold - can be used to find the best algorithm and its
optimal training parameter - Steps in Cross Validation
- 1. Divide the available data into a training set
and an evaluation set - 2. Split the training data into n folds
- 3. Select an architecture (algorithm) and
training parameters - 4. Train and test n models
- 5. Repeat step 2 to 4 using different
architectures and parameters - 6. Select the best model
- 7. Use all the training data to train the model
- 8. Assess the final model using the evaluation set
46Bootstrap Validation
- Based on the statistical procedure of sampling
with replacement - data set of n instances is sampled n times (with
replacement) to give another data set of n
instances - since some elements will be repeated, there will
be elements in the original data set that are not
picked - these remaining instances are used as the test
set - How many instances in the test set?
- Probability of not getting picked in one sampling
1 - 1/n - Pr(not getting picked in n samples) (1 -1/n)n
e-1 0.368 - so, for large data set, test set will contain
about 36.8 of instances - to compensate for smaller training sample
(63.2), test set error rate is combined with the
re-substitution error in training set - e (0.632 e test instance) (0.368 e
training instance) - Bootstrap validation increases variance that can
occur in each fold
47Measuring Effectiveness
- Predictive models are measured based on the
accuracy of their predictions on unseen data - Classification and Prediction Tasks
- accuracy measured in terms of error rate (usually
of records classified incorrectly) - error rate on a pre-classified evaluation set
estimates the real error rate - can also use cross-validation methods discussed
before - Estimation Effectiveness
- difference between predicted scores and the
actual results (from evaluation set) - the accuracy of the model is measured in terms of
variance (i.e., average of the squared
differences) - to be able to express this in terms of the
original units, compute the standard deviation
(i.e., square root of the variance)
48Ordinal or Nominal Outputs
- When the output field is ordinal or nominal
(e.g., in two-class prediction), we use the
classification table, the so-called confusion
matrix to evaluate the resulting model - Example
- Overall correct classification rate (18 15) /
38 87 - Given T, correct classification rate 18 / 20
90 - Given F, correct classification rate 15 / 18
83
Predicted Class
Actual Class
49Measuring Effectiveness
- Market Basket Analysis
- MBA may be used for estimation or prediction
(e.g., people who buy milk and diapers tend to
also buy beer) - confidence percentage of the time beer occurs
in transaction that contain milk and diapers
(i.e., conditional probability) - support percentage of the time milk, diaper,
and beer occurred together in the whole
training set (i.e., prior probability) - Distance-Based Techniques
- clustering and memory-based reasoning a measure
of distance is used to evaluate the closeness
or similarity of a point to cluster centers or
a neighborhood - regression line of best fit minimizes the sum of
the distances between the line and the
observations - often for numeric variables, can use Euclidean
distance measure (square root of the sum of the
squares of the difference along each dimension)
50Measuring Effectiveness Lift Charts
- usually used for classification, but can be
adopted to other methods - measure change in conditional probability of a
target class when going from the general
population (full test set) to a biased sample - Example
- suppose expected response rate to a direct
mailing campaign is 5 in the training set - use classifier to assign yes or no value to a
target class predicted to respond - the yes group will contain a higher proportion
of actual responders than the test set - suppose the yes group (our biased sample)
contains 50 actual responders - this gives lift of 10 0.5 / 0.05
- What if the lift sample is too small
- need to increase the sample size
- trade-off between lift and sample size