Title: Outline
1Outline
- The five pillars of data mining
- Supervised and unsupervised learning
2Data mining
- The process of
- Selecting
- Exploring
- Modifying
- Modeling
- Assessing
- large amounts of data to uncover previously
unknown patterns
3SEMMA
Sample the data by creating one or more data
tables Explore the data by searching for (i)
anticipated relationships and trends (ii)
unanticipated relationships and trends (iii)
anomalies Modify the data by transforming
variables and combining existing variables into
new variables Model the data by searching for a
combination of the data that reliably predicts a
desired outcome Assess the data by evaluating
the usefulness and reliability of the findings
from the data mining process
4Sample the data and create data tables
Cases and variables
Objects and attributes
5Examine anticipated relationships electricity
consumption and temperature
6Examine the presence of outliersTotal nitrogen
concentrations in Swedish riversdetermined by
two different methods
7Modifying inputs
-
- Transforming inputs or outputs
- Combining existing variables into new variables
- Aggregating inputs
- Reducing the dimension of the inputs
8Model selection credit scoring
-
- Candidate predictors
- Age
- Sex
- Income
- Marital status
- Education
- Savings
- Loans
- Payment records
- Houseowner
- .
- .
- .
- Subset selection aims to produce a model that is
interpretable and has possibly lower prediction
error
9Bias, Variance and Model Complexity
Low Bias High Variance
High Bias Low Variance
Test sample
Prediction error
Training sample
Low
High
Model complexity
10Statistical learning
- Supervised learning (prediction, classification)
- We have a training set of data, in which we
observe the outcome and feature measurements for
a set of objects - Using this data we build a prediction model, or
learner, which will enable us to predict the
outcome for new unseen objects - Unsupervised learning (association analysis,
clustering) - We observe only the features and have no
measurements of the outcome. - Our task is to describe how the data are
organized and clustered
Hastie, Tibshirani, and Friedman The elements of
statistical learning
11Statistical learning problems some examples
- Supervised learning (prediction, classification)
- Predict tomorrows electricity consumption, from
weather forecasts and calendar records (season,
weekday, holiday) - Identify the numbers in a handwritten ZIP code,
from a digitized image - Unsupervised learning (association analysis)
- Identify buying patterns that can be used to
design sales promotions
12Supervised learning statistical terminology
-
- Prediction of one or more outputs using
observations of one or more inputs - Statistical terminology
-
- Inputs Predictors
- Independent variables
- Explanatory variables
- Outputs Responses
- Dependent variables
13Naming convention
-
- Regression
- Prediction of quantitative outputs using one or
more inputs - Classification
- Prediction of qualitative outputs using
observations of one or more inputs -
14Prediction by learning from data
Assume that we have a data set whi
ch shows the outcome (response) y for a set of
investigated objects with features x1, , xp
Prediction by learning from data implies that
we derive a function that can be used to
foresee the outcome for new objects (with known
or observed features)
15Some major types of quantitative prediction models
- Linear or nonlinear regression models with i.i.d.
error terms - Time series regression models with stochastic
noise -
- Transfer function models