Chapter 2 Overview of the Data Mining Process - PowerPoint PPT Presentation

About This Presentation

Title:

Chapter 2 Overview of the Data Mining Process

Description:

Chapter 2 Overview of the Data Mining Process * – PowerPoint PPT presentation

Number of Views:175

Avg rating:3.0/5.0

Slides: 36

Provided by: was111

Learn more at: https://www.washburn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 2 Overview of the Data Mining Process

1
Chapter 2Overview of the Data Mining Process
2
Introduction

Data Mining
Predictive analysis
Tasks of Classification Prediction
Core of Business Intelligence
Data Base Methods
OLAP
SQL
Do not involve statistical modeling

3
Core Ideas in Data Mining

Analytical Methods Used in Predictive Analytics
Classification
Used with categorical response variables
E.g. Will purchase be made / not made?
Prediction
Predict (estimate) value of continuous response
variable
Prediction used with categorical as well
Association Rules
Affinity analysis what goes with what
Seeks correlations among data

4
Core Ideas in Data Mining

Data Reduction
Reduce variables
Group together similar variables
Data Exploration
View data as evidence
Get a feel for the data
Data Visualization
Graphical representation of data
Locate tends, correlations, etc.

5
Supervised Learning

Supervised learning" algorithms are those used
in classification and prediction.
Data is available in which the value of the
outcome of interest is known.
Training data" are the data from which the
classification or prediction algorithm learns,"
or is trained," about the relationship between
predictor variables and the outcome variable.
This process results in a model
Classification Model
Predictive Model

6
Supervised Learning

Model is then run with another sample of data
validation data"
the outcome is known but we wish to see how well
the model performs
If many different models are being tried out, a
third sample of known outcomes -test data is
used with the final, selected model to predict
how well it will do.
The model can then be used to classify or predict
the outcome of interest in new cases where the
outcome is unknown.

7
Supervised Learning

Linear regression analysis is an example of
supervised Learning
The Y variable is the (known) outcome variable
The X variable is some predictor variable.
A regression line is drawn to minimize the sum of
squared deviations between the actual Y values
and the values predicted by this line.
The regression line can now be used to predict Y
values for new values of X for which we do not
know the Y value.

8
Unsupervised Learning

No outcome variable to predict or classify
No learning from cases
Unsupervised leaning methods
Association Rules
Data Reduction Methods
Clustering Techniques

9
The Steps in Data Mining

1. Develop an understanding of the purpose of the
data mining project
It is a one-shot effort to answer a question or
questions or
Application (if it is an ongoing procedure).
2. Obtain the dataset to be used in the analysis.
Random sampling from a large database to capture
records to be used in an analysis
Pulling together data from different databases.
Internal (e.g. Past purchases made by customers)
External (credit ratings).
Usually the analysis to be done requires only
thousands or tens of thousands of records.

10
The Steps in Data Mining

3. Explore, clean, and preprocess the data
Verifying that the data are in reasonable
condition.
How missing data should be handled?
Are the values in a reasonable range, given what
you would expect for each variable?
Are there obvious outliers?"
Data are reviewed graphically
For example, a matrix of scatter plots showing
the relationship of each variable with each other
variable.
Ensure consistency in the definitions of fields,
units of measurement, time periods, etc.

11
The Steps in Data Mining

4. Reduce the data
If supervised training is involved separate them
into training, validation and test datasets.
Eliminating unneeded variables,
Transforming variables
Turning money spent" into spent gt 100" vs.
Spent 100"),
Creating new variables
A variable that records whether at least one of
several products was purchased
Make sure you know what each variable means, and
whether it is sensible to include it in the
model.
5. Determine the data mining task
Classification, prediction, clustering, etc.
6. Choose the data mining techniques to be used
Regression, neural nets, hierarchical clustering,
etc.

12
The Steps in Data Mining

7. Use algorithms to perform the task.
Iterative process - trying multiple variants, and
often using multiple variants of the same
algorithm (choosing different variables or
settings within the algorithm).
When appropriate, feedback from the algorithm's
performance on validation data is used to refine
the settings.
8. Interpret the results of the algorithms.
Choose the best algorithm to deploy,
Use final choice on the test data to get an idea
how well it will perform.
9. Deploy the model.
Integrate the model into operational systems
Run it on real records to produce decisions or
actions.
For example, the model might be applied to a
purchased list of possible customers, and the
action might be include in the mailing if the
predicted amount of purchase is gt 10."

13
Preliminary Steps

Organization of datasets
Records in rows
Variables in columns
In supervised learning one of these will be the
outcome variable
Labels the first or last column
Sampling from a database
Use a samples to create, validate, test model
Oversampling rare events
If response variable value is seldom found in
data then sample size increase
Adjust algorithm as necessary

14
Preliminary Steps(Pre-processing and Cleaning
the Data)

Types of variables
Continuous assumes a any real numerical value
(generally within a specified range)
Categorical assumes one of a limited number of
values
Text (e.g. Payments e current, not current,
bankrupt
Numerical (e.g. Age e 0 120 )
Nominal (payments)
Ordinal (age)

15
Preliminary Steps(Pre-processing and Cleaning
the Data)

Handling categorical variables
If categorical is ordered then it can be used as
continuous variable (e..G. Age, level of credit,
etc.)
Use of dummy variables when range of values not
large
e.g. Variable occupation e student, unemployed,
employed, retired
Create binary (yes/no) dummy variables
Student yes/no
Unemployed yes/no
Employed yes/no
Retired yes/no
Variable selection
The more predictor variables the more records
need to build the model
Reduce number of variables whenever appropriate

16
Preliminary Steps(Pre-processing and Cleaning
the Data)

Overfitting
Building a model - describe relationships among
variables in order to predict future outcome
(dependent) values on the basis of future
predictor (independent) values.
Avoid explaining variation in the data that was
nothing more than chance variation. Avoid
mislabeling noise in the data as if it were a
signal
Caution - if the dataset is not much larger than
the number of predictor variables, then it is
very likely that a spurious relationship like
this will creep into the model

17
Overfitting
18
Preliminary Steps (Pre-processing and Cleaning
the Data)

How many variables how much data
A good rule of thumb is to have ten records for
every predictor variable.
For classification procedures
At least 6xmxp records,
Where m number of outcome classes, and p
number of variables
Compactness or parsimony is a desirable feature
in a model.
A matrix of x-y plots can be useful in variable
selection.
Can see at a glance x-y plots for all variable
combinations.
A straight line would be an indication that one
variable is exactly correlated with another.
We would want to include only one of them in our
model.
Weed out irrelevant and redundant variables from
our model
Consult domain expert whenever possible

19
Preliminary Steps(Pre-processing and Cleaning
the Data)

Outliers
Values that lie far away from the bulk of the
data are called outliers
no statistical rule can tell us whether such an
outlier is the result of an error
these are judgments best made by someone with
domain" knowledge
if the number of records with outliers is very
small, they might be treated as missing data.

20
Preliminary Steps(Pre-processing and Cleaning
the Data)

Missing values
If the number of records with missing values is
small, those records might be omitted
The more variables, the more records to dropped
Solution - use average value computed from
records with valid data for variable with missing
data
Reduces variability in data set
Human judgment can be used to determine best way
to handle missing data

21
Preliminary Steps(Pre-processing and Cleaning
the Data)

Normalizing (standardizing) the data
To normalize the data, we subtract the mean from
each value, and divide by the standard deviation
of the resulting deviations from the mean
Expressing each value as number of standard
deviations away from the mean the z-score
Needed if variables are in different units e.G.
Hours, thousands of dollars, etc.
Clustering algorithms measure variables values in
distance from each other need a standard value
for distance.
Data mining software, including XLMiner,
typically has an option that normalizes the data
in those algorithms where it may be required

22
Preliminary Steps

Use and creation of partition
Training partition
The largest partition
Contains the data used to build the various
models
Same training partition is generally used to
develop multiple models.
Validation partition
Used to assess the performance of each model,
Used to compare models and pick the best one.
In classification and regression trees algorithms
the validation partition may be used
automatically to tune and improve the model.
Test partition
Sometimes called the holdout" or evaluation"
partition is used to assess the performance of a
chosen model with new data.

23
The Three Data Partitions and Their Role in the
Data Mining Process
24
Simple Regression Example
25
Simple Regression Model

Make prediction about the starting salary of a
current college graduate
Data set of starting salaries of recent college
graduates

Data Set
Compute Average Salary
How certain are of this prediction? There is
variability in the data.
26
Simple Regression Model

Use total variation as an index of uncertainty
about our prediction

The smaller the amount of total variation the
more accurate (certain) will be our prediction.

27
Simple Regression Model

How explain the variability - Perhaps it
depends on the students GPA

28
Simple Regression Model

Find a linear relationship between GPA and
starting salary
As GPA increases/decreases starting salary
increases/decreases

29
Simple Regression Model

Least Squares Method to find regression model
Choose a and b in regression model (equation) so
that it minimizes the sum of the squared
deviations actual Y value minus predicted Y
value (Y-hat)

30
Simple Regression Model

How good is the model?

a 4,779 b 5,370 A computer program
computed these values

u-hat is a residual value
The sum of all u-hats is zero
The sum of all u-hats squared is the total
variance not explained by the model
unexplained variance is 7,425,926

31
Simple Regression Model
Total Variation 23,000,000
32
Simple Regression Model
Total Unexplained Variation 7,425,726
33
Simple Regression Model

Relative Goodness of Fit
Summarize the improvement in prediction using
regression model
Computer R2 coefficient of determination

Regression Model (equation) a better predictor
than guessing the average salary The GPA is a
more accurate predictor of starting salary than
guessing the average R2 is the performance
measure for the model. Predicted Starting
Salary 4,779 5,370 GPA
34
Building a Model - An Example with Linear
Regression
35
Problems