Title: Data mining and the knowledge discovery process
1Data mining and the knowledge discovery process
- Summer Course 2006
- H.H.L.M. Donkers
2Content
- Opening / acquaintance
- What is data mining
- Data mining methodology
- Course perspective
- Course contents
3Data - Information - Knowledge -
- Data symbols
- Information data that are processed to be
useful provides answers to "who", "what",
"where", and "when" questions - Knowledge application of data and information
answers "how" questions - Understanding appreciation of "why"
- Wisdom evaluated understanding. (Russell
Ackoff - http//www.outsights.com/systems/dikw/dik
w.htm)
4Data - Information - Knowledge -
- http//www.outsights.com/systems/dikw/dikw.htm
5What is Data Mining Traditionally
- Data mining is the extraction of implicit,
previously unknown, and potentially useful
information from data. - Witten Frank (2000). Data Mining.
6What is Data Mining Traditionally
- The application of specific algorithms for
extracting patterns from data, it is a part of
knowledge discovery from databases - Fayyad (1997). From data mining to knowledge
discovery in databases.
7What is Data Mining Traditionally
- Data mining is a process, not just a series of
statistical analyses. - SAS Institute (2003). Finding the solution to
data mining.
8What is Data Mining Traditionally
- Computer Science
- (Semi-)automated application of algorithms for
pattern discovery - Algorithms developed in the field of Artificial
Intelligence (machine learning) - Part of the process of knowledge discovery
- Statistics
- Process of discovering patterns in data
- (Manual) application of a series of statistical
techniques (among which machine learning) - Incorporates
- Exploration
- Sampling
- Modeling
- Validation
Data mining Statistics Marketing
9What is Data Mining A Fusion
- An analytic process designed to explore data in
search of consistent patterns and/or systematic
relationships between variables, and then to
validate the findings by applying the detected
patterns to new subsets of data. The ultimate
goal is prediction. - Statsoft (2003). Data Mining Techniques.
10What is Data Mining A Fusion
- An information extraction activity whose goal is
to discover hidden facts contained in databases.
Using a combination of machine learning,
statistical analysis, modeling techniques and
database technology, data mining finds patterns
and subtle relationships in data and infers rules
that allow the prediction of future results. - Rudjer Boskovic Institute (2001). DMS Tutorial.
11Data Mining In This Course
- We use the book of Witten Frank
- Computer science (machine learning) approach
- Emphasis on algorithms for pattern discovery and
rule extraction - What are the underlying models
- What are the properties of the algorithms
- When to use (for which tasks)
- How to apply and to tune
- How to interpret and assess the results
12Data Mining Process
- These algorithms are only part of a process that
computer scientists call Knowledge Discovery and
the statisticians call Data Mining - The process starts with the recognition of a
problem and ends with the control of a deployed
solution - The whole process needs to be supported for a
successful application
13Methodologies for Data Mining
- As Data Mining is coming of age, several
methodologies have been developed, each with
their own perspective. We will discuss three of
them - Fayyad et al. (Computer science)
- E.g., WEKA
- SEMMA (SAS) (Statistics)
- SAS Enterprise Miner, R
- CRISP-DM (SPSS, OHRA, a.o.) (Business)
- SPSS Clementine
14Fayyads KDD Methodology
data
15SEMMA Methodology
Supported by SAS Enterprise Mining environment
16CRISP-DM Methodology
- Developed by data-mining companies (SPSS, NCR,
OHRA, ChryslerDaimler), funded by the European
Commission - Tool-independent / industry-independent
- Hierarchical process model
- 1 Generic phases 2 Generic tasks
- 3 Specific tasks 4 Task instances
- Supported by SPSS Clementine environment
17CRISP-DM Methodology
TASKS Business objective Assess situation Data
mining goals Project plan
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
18CRISP-DM Methodology
TASKS Collect data Describe data Explore
data Verify data quality
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
19CRISP-DM Methodology
TASKS Select data Clean data Construct
data Integrate data Format data
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
20CRISP-DM Methodology
TASKS Select modeling techniques Design the
test Build model Assess model
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
21CRISP-DM Methodology
TASKS Evaluate results Review
process Determine next steps
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
22CRISP-DM Methodology
TASKS Plan deployment Plan monitoring and
maintenance Final report Review project
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
23A Comparison
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
24A Small Poll (July 2002)
Source http//www.kdnuggets.com/polls/2002/method
ology.htm
25Poll repeated (2004)
Source http//www.kdnuggets.com/polls/2004/data_m
ining_methodology.htm
26Course perspective and goal
- The perspective is from computer science
(machine learning) Fayyads approach - The emphasis is on techniques for the automated
discovery of patterns in data and the automated
extraction of rules (the model phase of SEMMA and
CRISP) - The goal is to get acquainted with these
techniques, so you can use them in the
methodology of your choice
27Course contents
- Data preparation (Tuesday)
- Selection, preprocessing, transformation
- Techniques, algorithms and models
- Decision trees (Monday)
- Instance based and Bayesian learning (Wednesday)
- Neural networks (Wednesday)
- Association rules (Thursday)
- Clustering (Thursday)
- Support Vector Machines (Friday)
- Evaluation of learned models (Tuesday)
28Course contents
- For each technique you learn
- For which tasks it is suitable
- Classification, rules, prediction,
- Restrictions on input data (numerical, symbolic,
etc.) - What algorithms are available
- What parameters should be tuned
- How to interpret the results
- How to evaluate the model