The Knowledge Discovery Process; Data Preparation

About This Presentation

Title:

The Knowledge Discovery Process; Data Preparation

Description:

The Knowledge Discovery Process; Data Preparation & Preprocessing Bamshad Mobasher DePaul University ... – PowerPoint PPT presentation

Number of Views:226

Avg rating:3.0/5.0

Slides: 51

Provided by: Bamsh5

Learn more at: http://facweb.cs.depaul.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Knowledge Discovery Process; Data Preparation

1
The Knowledge Discovery ProcessData Preparation
Preprocessing
Bamshad Mobasher DePaul University
2
The Knowledge Discovery Process
- The KDD Process
3
Steps in the KDD Process

Learning the application domain
Translate the business problem into a data mining
problem
Gathering and integrating of data
Cleaning and preprocessing data
may be the most resource intensive part
Reducing and selecting data
find useful features, dimensionality reduction,
etc.
Choosing functions of data mining
summarization, classification, regression,
association, clustering, etc.
Choosing the mining algorithm(s)
Data mining discover patterns of interest
construct models
Evaluate patterns and models
Interpretation analysis of results
visualization, alteration, removing redundant
patterns, querying
Use of discovered knowledge

4
Data Mining What Kind of Data?

Structured Databases
relational, object-relational, etc.
can use SQL to perform parts of the process
e.g., SELECT count() FROM Items WHERE
typevideo GROUP BY category

5
Data Mining What Kind of Data?

Flat Files
most common data source
can be text (or HTML) or binary
may contain transactions, statistical data,
measurements, etc.
Transactional databases
set of records each with a transaction id, time
stamp, and a set of items
may have an associated description file for the
items
typical source of data used in market basket
analysis

6
Data Mining What Kind of Data?

Other Types of Databases
legacy databases
multimedia databases (usually very
high-dimensional)
spatial databases (containing geographical
information, such as maps, or satellite imaging
data, etc.)
Time Series Temporal Data (time dependent
information such as stock market data usually
very dynamic)
World Wide Web
basically a large, heterogeneous, distributed
database
need for new or additional tools and techniques
information retrieval, filtering and extraction
agents to assist in browsing and filtering
Web content, usage, and structure (linkage)
mining tools

7
Data Mining What Kind of Data?

Data Warehouses
a data warehouse is a repository of data
collected from multiple data sources (often
heterogeneous) so that analysis can be done under
the same unified schema
data from the different sources are loaded,
cleaned, transformed and integrated
allows for interactive analysis and
decision-making
usually modeled by a multi-dimensional data
structure (data cube) to facilitate
decision-making
aggregate values can be pre-computed and stored
along many dimensions
each dimension of the data cube contains a
hierarchy of values for one attribute
data cubes are well suited for fast interactive
querying and analysis of data at different
conceptual levels, known as On-Line Analytical
Processing (OLAP)
OLAP operations allow the navigation of data at
different levels of abstraction, such as
drill-down, roll-up, slice, dice, etc.

8
DM Tasks Classification

Classifying observations/instances into different
given classes (i.e., classification is
supervised)
example, classifying credit applicants as low,
medium, or high risk
normally use a training set where all objects are
already associated with known class labels
classification algorithm learns from the training
set and builds a model
the model is used to classify new objects
Suitable data mining tools
Decision Tree algorithms (CHAD, C 5.0, and CART)
Memory-based reasoning
Neural Network
Example (Hypothetical Video Store)
build a model of users based their rental history
(returning on-time, payments, etc.)
observe the current customers rental and payment
history
decide whether should charge a deposit to current
customer

9
DM Tasks Prediction

Same as classification
except classify according to some predicted or
estimated future value
In prediction, historical data is used to build a
(predictive) model that explains the current
observed behavior
Model can then be applied to new instances to
predict future behavior or forecast the future
value of some missing attribute
Examples
predicting the size of balance transfer if the
prospect accepts the offer
predicting the load on a Web server in a
particular time period
Suitable data mining tools
Association rule discovery
Memory-based reasoning
Decision Trees
Neural Networks

10
DM Tasks Affinity Grouping

Determine what items often go together (usually
in transactional databases)
Often Referred to as Market Basket Analysis
used in retail for planning arrangement on
shelves
used for identifying cross-selling opportunities
should be used to determine best link structure
for a Web site
Examples
people who buy milk and beer also tend to buy
diapers
people who access pages A and B are likely to
place an online order
Suitable data mining tools
association rule discovery
clustering
Nearest Neighbor analysis (memory-based reasoning)

11
DM Tasks Clustering

Like classification, clustering is the
organization of data into classes
however, class labels are unknown and it is up to
the clustering algorithm to discover acceptable
classes
also called unsupervised classification, because
the classification is not dictated by given class
labels
There are many clustering approaches
all based on the principle of maximizing the
similarity between objects in a same class
(intra-class similarity) and minimizing the
similarity between objects of different classes
(inter-class similarity)
Examples
doing market segmentation of customers based on
buying patterns and demographic attributes
grouping user transactions on a Web site based
their navigational patterns

12
Characterization Discrimination

Data characterization is a summarization of
general features of objects in a target class
The data relevant to a target class are retrieved
by a database query and run through a
summarization module to extract the essence of
the data at different levels of abstraction
example characterize the video stores customers
who regularly rent more than 30 movies a year
Data discrimination is used to compare of the
general features of objects between two classes
comparison relevant features of objects between a
target class and a contrasting class
example compare the general characteristics of
the customers who rented more than 30 movies in
the last year with those who rented less than 5
The techniques used for data discrimination are
very similar to the techniques used for data
characterization with the exception that data
discrimination results include comparative
measures

13
Example Moviegoer Database
14
Example Moviegoer Database
SELECT moviegoers.name, moviegoers.sex,
moviegoers.age, sources.source, movies.name FROM
movies, sources, moviegoers WHERE
sources.source_ID moviegoers.source_ID AND
movies.movie_ID moviegoers.movie_ID ORDER BY
moviegoers.name
15
Example Moviegoer Database

Classification
determine sex based on age, source, and movies
seen
determine source based on sex, age, and movies
seen
determine most recent movie based on past movies,
age, sex, and source
Prediction or Estimation
for predict, need a continuous variable (e.g.,
age)
predict age as a function of source, sex, and
past movies
if we had a rating field for each moviegoer, we
could predict the rating a new moviegoer gives to
a movie based on age, sex, past movies, etc.
Clustering
find groupings of movies that are often seen by
the same people
find groupings of people that tend to see the
same movies
clustering might reveal relationships that are
not necessarily recorded in the data (e.g., we
may find a cluster that is dominated by people
with young children or a cluster of movies that
correspond to a particular genre)

16
Example Moviegoer Database

Affinity Grouping
market basket analysis (MBA) which movies go
together?
need to create transactions for each moviegoer
containing movies seen by that moviegoer
may result in association rules such as
Phenomenon, The Birdcage gt
Trainspotting
Trainspotting, The Birdcage gt sex f
Sequence Analysis
similar to MBA, but order in which items appear
in the pattern is important
e.g., people who rent The Birdcage during a
visit tend to rent Trainspotting in the next
visit.

17
The Knowledge Discovery Process
- The KDD Process
18
Data Preprocessing

Why do we need to prepare the data?
In real world applications data can be
inconsistent, incomplete and/or noisy
Data entry, data transmission, or data collection
problems
Discrepancy in naming conventions
Duplicated records
Incomplete data
Contradictions in data
What happens when the data can not be trusted?
Can the decision be trusted? Decision making is
jeopardized
Better chance to discover useful knowledge when
data is clean

19
Data Preprocessing
20
Data Preprocessing
Data Cleaning
Data Integration
Data Transformation
-2,32,100,59,48
-0.02,0.32,1.00,0.59,0.48
Data Reduction
21
Data Cleaning

Real-world application data can be incomplete,
noisy, and inconsistent
No recorded values for some attributes
Not considered at time of entry
Random errors
Irrelevant records or fields
Data cleaning attempts to
Fill in missing values
Smooth out noisy data
Correct inconsistencies
Remove irrelevant data

22
Dealing with Missing Values

Data is not always available (missing attribute
values in records)
equipment malfunction
deleted due to inconsistency or misunderstanding
not considered important at time of data
gathering
Solving Missing Data
Ignore the record with missing values
Fill in the missing values manually
Use a global constant to fill in missing values
(NULL, unknown, etc.)
Use the attribute value mean to filling missing
values of that attribute
Use the attribute mean for all samples belonging
to the same class to fill in the missing values
Infer the most probable value to fill in the
missing value
may need to use methods such as Bayesian
classification or decision trees to automatically
infer missing attribute values

23
Smoothing Noisy Data

The purpose of data smoothing is to eliminate
noise

Original Data for price (after sorting) 4, 8,
15, 21, 21, 24, 25, 28, 34
Partition into equidepth bins Bin1 4, 8,
15 Bin2 21, 21, 24 Bin3 25, 28, 34
Binning
Min and Max values in each bin are identified
(boundaries). Each value in a bin is replaced
with the closest boundary value.
Each value in a bin is replaced by the mean value
of the bin.
means Bin1 9, 9, 9 Bin2 22, 22, 22 Bin3 29,
29, 29
boundaries Bin1 4, 4, 15 Bin2 21, 21, 24 Bin3
25, 25, 34
24
Smoothing Noisy Data

Other Methods

Similar values are organized into groups
(clusters). Values falling outside of clusters
may be considered outliers and may be
candidates for elimination.
Clustering
Fit data to a function. Linear regression finds
the best line to fit two variables. Multiple
regression can handle multiple variables. The
values given by the function are used instead of
the original values.
Regression
25
Smoothing Noisy Data - Example

Want to smooth Temperature by bin means with
bins of size 3
First sort the values of the attribute (keep
track of the ID or key so that the transformed
values can be replaced in the original table.
Divide the data into bins of size 3 (or less in
case of last bin).
Convert the values in each bin to the mean value
for that bin
Put the resulting values into the original table

26
Smoothing Noisy Data - Example
Value of every record in each bin is changed to
the mean value for that bin. If it is necessary
to keep the value as an integer, then the mean
values are rounded to the nearest integer.
27
Smoothing Noisy Data - Example
The final table with the new values for the
Temperature attribute.
28
Data Integration

Data analysis may require a combination of data
from multiple sources into a coherent data store
Challenges in Data Integration
Schema integration CID C_number Cust-id
cust
Semantic heterogeneity
Data value conflicts (different representations
or scales, etc.)
Synchronization (especially important in Web
usage mining)
Redundant attributes (redundant if it can be
derived from other attributes) -- may be able to
identify redundancies via correlation analysis
Meta-data is often necessary for successful data
integration

Correlation analysis Pr(A,B) / (Pr(A).Pr(B))
1 independent, gt 1 positive correlation, lt 1
negative correlation.
29
Normalization

Min-max normalization linear transformation from
v to v
v (v - min)/(max - min) x (newmax - newmin)
newmin
Ex transform 30000 between 10000..45000 into
0..1 gt
(30000 10000) / 35000 x (1 - 0) 0 0.514
z-score normalization normalization of v into v
based on attribute value mean and standard
deviation
v (v - Mean) / StandardDeviation
Normalization by decimal scaling
moves the decimal point of v by j positions such
that j is the minimum number of positions moved
so that absolute maximum value falls in 0..1.
v v / 10j
Ex if v ranges between -56 and 9976, j4 gt v
ranges between -0.0056 and 0.9976

30
Normalization Example

z-score normalization v (v - Mean) / Stdev
Example normalizing the Humidity attribute

Mean 80.3 Stdev 9.84
31
Normalization Example II

Min-Max normalization on an employee database
max distance for salary 100000-19000 81000
max distance for age 52-27 25
New min for age and salary 0 new max for age
and salary 1

32
Data Reduction

Data is often too large reducing data can
improve performance
Data reduction consists of reducing the
representation of the data set while producing
the same (or almost the same) results
Data reduction includes
Data cube aggregation
Dimensionality reduction
Discretization
Numerosity reduction
Regression
Histograms
Clustering
Sampling

33
Data Cube Aggregation

Reduce the data to the concept level needed in
the analysis
Use the smallest (most detailed) level necessary
to solve the problem
Queries regarding aggregated information should
be answered using data cube when possible

34
Dimensionality Reduction

Feature selection (i.e., attribute subset
selection)
Select only the necessary attributes.
The goal is to find a minimum set of attributes
such that the resulting probability distribution
of data classes is as close as possible to the
original distribution obtained using all
attributes.
Exponential number of possibilities.
Use heuristics select local best (or most
pertinent) attribute, e.g., using information
gain, etc.
step-wise forward selection A1A1, A3A1,
A3, A5
step-wise backward elimination A1, A2, A3, A4,
A5A1, A3, A4, A5 A1, A3, A5
combining forward selection and backward
elimination
decision-tree induction

35
Decision Tree Induction
36
Numerocity Reduction

Reduction via histograms
Divide data into buckets and store representation
of buckets (sum, count, etc.)
Reduction via clustering
Partition data into clusters based on closeness
in space
Retain representatives of clusters (centroids)
and outliers
Reduction via sampling
Will the patterns in the sample represent the
patterns in the data?
Random sampling can produce poor results
Stratified sample (stratum group based on
attribute value)

37
Sampling Techniques
SRSWOR (simple random sample without
replacement)
SRSWR
Raw Data
Cluster/Stratified Sample
Raw Data
38
Discretization

3 Types of attributes
nominal - values from an unordered set (also
categorical attributes)
ordinal - values from an ordered set
continuous - real numbers (but sometimes also
integer values)
Discretization is used to reduce the number of
values for a given continuous attribute
usually done by dividing the range of the
attribute into intervals
interval labels are then used to replace actual
data values
Some data mining algorithms only accept
categorical attributes and cannot handle a range
of continuous attribute value
Discretization can also be used to generate
concept hierarchies
reduce the data by collecting and replacing low
level concepts (e.g., numeric values for age)
by higher level concepts (e.g., young, middle
aged, old)

39
Discretization - Example

Example discretizing the Humidity attribute
using 3 bins.

Low 60-69 Normal 70-79 High 80
40
Converting Categorical Attributes to Numerical
Attributes
Attributes Outlook (overcast, rain,
sunny) Temperature real Humidity real Windy
(true, false)
Standard Spreadsheet Format
Create separate columns for each value of a
categorical attribute (e.g., 3 values for the
Outlook attribute and two values of the Windy
attribute). There is no change to the numerical
attributes.
41
Visualizing Patterns

Example Cross Tabulation

Windy Not Windy
Outlook sunny 2 3
Outlook rain 2 3
Outlook overcast 2 2
42
Evaluating Models

To train and evaluate models, data are often
divided into three sets the training set, the
test set, and the evaluation set
Training Set
is used to build the initial model
may need to enrich the data to get enough of
the special cases
Test Set
is used to adjust the initial model
models can be tweaked to be less idiosyncrasies
to the training data and can be adapted for a
more general model
idea is to prevent over-training (i.e., finding
patterns where none exist).
Evaluation Set
is used to evaluate the model performance

43
Training Sets

The training set will be used to train the models
most important consideration need to cover the
full range of values for all the features that
the model might encounter
good idea to have several examples for each value
of a categorical feature and for a range of
values for numerical features
Data Enrichment
training set should have sufficient number of
examples of rare events
a random sample of available data is not
sufficient since common examples will swamp the
rare examples
training set needs to over-sample the rare cases
so that the model can learn the features for rare
events instead of predicting everything to be
common outputs

44
Test and Evaluation Sets

Reading too much into the training set
(overfitting)
common problem with most data mining algorithms
resulting model works well on the training set
but performs miserably on unseen data
test set can be used to tweak the initial
model, and to remove unnecessary inputs or
features
Evaluations Set for Final Performance Evaluation
test set can not be used to evaluate model
performance for future unseen data
error rate on the evaluation is a good predictor
of error rate on unseen data
Insufficient data to divide into three disjoint
sets?
In such cases, validation techniques can play a
major role
Cross Validation
Bootstrap Validation

45
Cross Validation

Cross validation is a heuristic that works as
follows
randomly divide the data into n folds, each with
approximately the same number of records
create n models using the same algorithms and
training parameters each model is trained with
n-1 folds of the data and tested on the remaining
fold
can be used to find the best algorithm and its
optimal training parameter
Steps in Cross Validation
1. Divide the available data into a training set
and an evaluation set
2. Split the training data into n folds
3. Select an architecture (algorithm) and
training parameters
4. Train and test n models
5. Repeat step 2 to 4 using different
architectures and parameters
6. Select the best model
7. Use all the training data to train the model
8. Assess the final model using the evaluation set

46
Bootstrap Validation

Based on the statistical procedure of sampling
with replacement
data set of n instances is sampled n times (with
replacement) to give another data set of n
instances
since some elements will be repeated, there will
be elements in the original data set that are not
picked
these remaining instances are used as the test
set
How many instances in the test set?
Probability of not getting picked in one sampling
1 - 1/n
Pr(not getting picked in n samples) (1 -1/n)n
e-1 0.368
so, for large data set, test set will contain
about 36.8 of instances
to compensate for smaller training sample
(63.2), test set error rate is combined with the
re-substitution error in training set
e (0.632 e test instance) (0.368 e
training instance)
Bootstrap validation increases variance that can
occur in each fold

47
Measuring Effectiveness

Predictive models are measured based on the
accuracy of their predictions on unseen data
Classification and Prediction Tasks
accuracy measured in terms of error rate (usually
of records classified incorrectly)
error rate on a pre-classified evaluation set
estimates the real error rate
can also use cross-validation methods discussed
before
Estimation Effectiveness
difference between predicted scores and the
actual results (from evaluation set)
the accuracy of the model is measured in terms of
variance (i.e., average of the squared
differences)
to be able to express this in terms of the
original units, compute the standard deviation
(i.e., square root of the variance)

48
Ordinal or Nominal Outputs

When the output field is ordinal or nominal
(e.g., in two-class prediction), we use the
classification table, the so-called confusion
matrix to evaluate the resulting model
Example
Overall correct classification rate (18 15) /
38 87
Given T, correct classification rate 18 / 20
90
Given F, correct classification rate 15 / 18
83

Predicted Class
Actual Class
49
Measuring Effectiveness

Market Basket Analysis
MBA may be used for estimation or prediction
(e.g., people who buy milk and diapers tend to
also buy beer)
confidence percentage of the time beer occurs
in transaction that contain milk and diapers
(i.e., conditional probability)
support percentage of the time milk, diaper,
and beer occurred together in the whole
training set (i.e., prior probability)
Distance-Based Techniques
clustering and memory-based reasoning a measure
of distance is used to evaluate the closeness
or similarity of a point to cluster centers or
a neighborhood
regression line of best fit minimizes the sum of
the distances between the line and the
observations
often for numeric variables, can use Euclidean
distance measure (square root of the sum of the
squares of the difference along each dimension)

50
Measuring Effectiveness Lift Charts

usually used for classification, but can be
adopted to other methods
measure change in conditional probability of a
target class when going from the general
population (full test set) to a biased sample
Example
suppose expected response rate to a direct
mailing campaign is 5 in the training set
use classifier to assign yes or no value to a
target class predicted to respond
the yes group will contain a higher proportion
of actual responders than the test set
suppose the yes group (our biased sample)
contains 50 actual responders
this gives lift of 10 0.5 / 0.05
What if the lift sample is too small
need to increase the sample size
trade-off between lift and sample size