Title: Chapter 2 Overview of the Data Mining Process
1Chapter 2Overview of the Data Mining Process
2Introduction
- Data Mining
- Predictive analysis
- Tasks of Classification Prediction
- Core of Business Intelligence
- Data Base Methods
- OLAP
- SQL
- Do not involve statistical modeling
3Core Ideas in Data Mining
- Analytical Methods Used in Predictive Analytics
- Classification
- Used with categorical response variables
- E.g. Will purchase be made / not made?
- Prediction
- Predict (estimate) value of continuous response
variable - Prediction used with categorical as well
- Association Rules
- Affinity analysis what goes with what
- Seeks correlations among data
4Core Ideas in Data Mining
- Data Reduction
- Reduce variables
- Group together similar variables
- Data Exploration
- View data as evidence
- Get a feel for the data
- Data Visualization
- Graphical representation of data
- Locate tends, correlations, etc.
5Supervised Learning
- Supervised learning" algorithms are those used
in classification and prediction. - Data is available in which the value of the
outcome of interest is known. - Training data" are the data from which the
classification or prediction algorithm learns,"
or is trained," about the relationship between
predictor variables and the outcome variable. - This process results in a model
- Classification Model
- Predictive Model
6Supervised Learning
- Model is then run with another sample of data
- validation data"
- the outcome is known but we wish to see how well
the model performs - If many different models are being tried out, a
third sample of known outcomes -test data is
used with the final, selected model to predict
how well it will do. - The model can then be used to classify or predict
the outcome of interest in new cases where the
outcome is unknown.
7Supervised Learning
- Linear regression analysis is an example of
supervised Learning - The Y variable is the (known) outcome variable
- The X variable is some predictor variable.
- A regression line is drawn to minimize the sum of
squared deviations between the actual Y values
and the values predicted by this line. - The regression line can now be used to predict Y
values for new values of X for which we do not
know the Y value.
8Unsupervised Learning
- No outcome variable to predict or classify
- No learning from cases
- Unsupervised leaning methods
- Association Rules
- Data Reduction Methods
- Clustering Techniques
9The Steps in Data Mining
- 1. Develop an understanding of the purpose of the
data mining project - It is a one-shot effort to answer a question or
questions or - Application (if it is an ongoing procedure).
- 2. Obtain the dataset to be used in the analysis.
- Random sampling from a large database to capture
records to be used in an analysis - Pulling together data from different databases.
- Internal (e.g. Past purchases made by customers)
- External (credit ratings).
- Usually the analysis to be done requires only
thousands or tens of thousands of records.
10The Steps in Data Mining
- 3. Explore, clean, and preprocess the data
- Verifying that the data are in reasonable
condition. - How missing data should be handled?
- Are the values in a reasonable range, given what
you would expect for each variable? - Are there obvious outliers?"
- Data are reviewed graphically
- For example, a matrix of scatter plots showing
the relationship of each variable with each other
variable. - Ensure consistency in the definitions of fields,
units of measurement, time periods, etc.
11The Steps in Data Mining
- 4. Reduce the data
- If supervised training is involved separate them
into training, validation and test datasets. - Eliminating unneeded variables,
- Transforming variables
- Turning money spent" into spent gt 100" vs.
Spent 100"), - Creating new variables
- A variable that records whether at least one of
several products was purchased - Make sure you know what each variable means, and
whether it is sensible to include it in the
model. - 5. Determine the data mining task
- Classification, prediction, clustering, etc.
- 6. Choose the data mining techniques to be used
- Regression, neural nets, hierarchical clustering,
etc.
12The Steps in Data Mining
- 7. Use algorithms to perform the task.
- Iterative process - trying multiple variants, and
often using multiple variants of the same
algorithm (choosing different variables or
settings within the algorithm). - When appropriate, feedback from the algorithm's
performance on validation data is used to refine
the settings. - 8. Interpret the results of the algorithms.
- Choose the best algorithm to deploy,
- Use final choice on the test data to get an idea
how well it will perform. - 9. Deploy the model.
- Integrate the model into operational systems
- Run it on real records to produce decisions or
actions. - For example, the model might be applied to a
purchased list of possible customers, and the
action might be include in the mailing if the
predicted amount of purchase is gt 10."
13Preliminary Steps
- Organization of datasets
- Records in rows
- Variables in columns
- In supervised learning one of these will be the
outcome variable - Labels the first or last column
- Sampling from a database
- Use a samples to create, validate, test model
- Oversampling rare events
- If response variable value is seldom found in
data then sample size increase - Adjust algorithm as necessary
14Preliminary Steps(Pre-processing and Cleaning
the Data)
- Types of variables
- Continuous assumes a any real numerical value
(generally within a specified range) - Categorical assumes one of a limited number of
values - Text (e.g. Payments e current, not current,
bankrupt - Numerical (e.g. Age e 0 120 )
- Nominal (payments)
- Ordinal (age)
15 Preliminary Steps(Pre-processing and Cleaning
the Data)
- Handling categorical variables
- If categorical is ordered then it can be used as
continuous variable (e..G. Age, level of credit,
etc.) - Use of dummy variables when range of values not
large - e.g. Variable occupation e student, unemployed,
employed, retired - Create binary (yes/no) dummy variables
- Student yes/no
- Unemployed yes/no
- Employed yes/no
- Retired yes/no
- Variable selection
- The more predictor variables the more records
need to build the model - Reduce number of variables whenever appropriate
16 Preliminary Steps(Pre-processing and Cleaning
the Data)
- Overfitting
- Building a model - describe relationships among
variables in order to predict future outcome
(dependent) values on the basis of future
predictor (independent) values. - Avoid explaining variation in the data that was
nothing more than chance variation. Avoid
mislabeling noise in the data as if it were a
signal - Caution - if the dataset is not much larger than
the number of predictor variables, then it is
very likely that a spurious relationship like
this will creep into the model
17Overfitting
18Preliminary Steps (Pre-processing and Cleaning
the Data)
- How many variables how much data
- A good rule of thumb is to have ten records for
every predictor variable. - For classification procedures
- At least 6xmxp records,
- Where m number of outcome classes, and p
number of variables - Compactness or parsimony is a desirable feature
in a model. - A matrix of x-y plots can be useful in variable
selection. - Can see at a glance x-y plots for all variable
combinations. - A straight line would be an indication that one
variable is exactly correlated with another. - We would want to include only one of them in our
model. - Weed out irrelevant and redundant variables from
our model - Consult domain expert whenever possible
19 Preliminary Steps(Pre-processing and Cleaning
the Data)
- Outliers
- Values that lie far away from the bulk of the
data are called outliers - no statistical rule can tell us whether such an
outlier is the result of an error - these are judgments best made by someone with
domain" knowledge - if the number of records with outliers is very
small, they might be treated as missing data.
20 Preliminary Steps(Pre-processing and Cleaning
the Data)
- Missing values
- If the number of records with missing values is
small, those records might be omitted - The more variables, the more records to dropped
- Solution - use average value computed from
records with valid data for variable with missing
data - Reduces variability in data set
- Human judgment can be used to determine best way
to handle missing data
21 Preliminary Steps(Pre-processing and Cleaning
the Data)
- Normalizing (standardizing) the data
- To normalize the data, we subtract the mean from
each value, and divide by the standard deviation
of the resulting deviations from the mean - Expressing each value as number of standard
deviations away from the mean the z-score - Needed if variables are in different units e.G.
Hours, thousands of dollars, etc. - Clustering algorithms measure variables values in
distance from each other need a standard value
for distance. - Data mining software, including XLMiner,
typically has an option that normalizes the data
in those algorithms where it may be required
22Preliminary Steps
- Use and creation of partition
- Training partition
- The largest partition
- Contains the data used to build the various
models - Same training partition is generally used to
develop multiple models. - Validation partition
- Used to assess the performance of each model,
- Used to compare models and pick the best one.
- In classification and regression trees algorithms
the validation partition may be used
automatically to tune and improve the model. - Test partition
- Sometimes called the holdout" or evaluation"
partition is used to assess the performance of a
chosen model with new data.
23The Three Data Partitions and Their Role in the
Data Mining Process
24Simple Regression Example
25Simple Regression Model
- Make prediction about the starting salary of a
current college graduate - Data set of starting salaries of recent college
graduates
Data Set
Compute Average Salary
How certain are of this prediction? There is
variability in the data.
26Simple Regression Model
- Use total variation as an index of uncertainty
about our prediction
- The smaller the amount of total variation the
more accurate (certain) will be our prediction.
27Simple Regression Model
- How explain the variability - Perhaps it
depends on the students GPA
28Simple Regression Model
- Find a linear relationship between GPA and
starting salary - As GPA increases/decreases starting salary
increases/decreases
29Simple Regression Model
- Least Squares Method to find regression model
- Choose a and b in regression model (equation) so
that it minimizes the sum of the squared
deviations actual Y value minus predicted Y
value (Y-hat)
30Simple Regression Model
a 4,779 b 5,370 A computer program
computed these values
- u-hat is a residual value
- The sum of all u-hats is zero
- The sum of all u-hats squared is the total
variance not explained by the model - unexplained variance is 7,425,926
31Simple Regression Model
Total Variation 23,000,000
32Simple Regression Model
Total Unexplained Variation 7,425,726
33Simple Regression Model
- Relative Goodness of Fit
- Summarize the improvement in prediction using
regression model - Computer R2 coefficient of determination
Regression Model (equation) a better predictor
than guessing the average salary The GPA is a
more accurate predictor of starting salary than
guessing the average R2 is the performance
measure for the model. Predicted Starting
Salary 4,779 5,370 GPA
34Building a Model - An Example with Linear
Regression
35ProblemsÂ