Data mining and the knowledge discovery process - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Data mining and the knowledge discovery process

Description:

Opening / acquaintance. What is data mining. Data mining methodology. Course perspective ... Knowledge: application of data and information; answers 'how' questions ... – PowerPoint PPT presentation

Number of Views:364

Avg rating:3.0/5.0

Slides: 29

Provided by: Donk152

Category:

more less

Transcript and Presenter's Notes

Title: Data mining and the knowledge discovery process

1
Data mining and the knowledge discovery process

Summer Course 2006
H.H.L.M. Donkers

2
Content

Opening / acquaintance
What is data mining
Data mining methodology
Course perspective
Course contents

3
Data - Information - Knowledge -

Data symbols
Information data that are processed to be
useful provides answers to "who", "what",
"where", and "when" questions
Knowledge application of data and information
answers "how" questions
Understanding appreciation of "why"
Wisdom evaluated understanding. (Russell
Ackoff - http//www.outsights.com/systems/dikw/dik
w.htm)

4
Data - Information - Knowledge -

http//www.outsights.com/systems/dikw/dikw.htm

5
What is Data Mining Traditionally

Data mining is the extraction of implicit,
previously unknown, and potentially useful
information from data.
Witten Frank (2000). Data Mining.

6
What is Data Mining Traditionally

The application of specific algorithms for
extracting patterns from data, it is a part of
knowledge discovery from databases
Fayyad (1997). From data mining to knowledge
discovery in databases.

7
What is Data Mining Traditionally

Data mining is a process, not just a series of
statistical analyses.
SAS Institute (2003). Finding the solution to
data mining.

8
What is Data Mining Traditionally

Computer Science
(Semi-)automated application of algorithms for
pattern discovery
Algorithms developed in the field of Artificial
Intelligence (machine learning)
Part of the process of knowledge discovery

Statistics
Process of discovering patterns in data
(Manual) application of a series of statistical
techniques (among which machine learning)
Incorporates
Exploration
Sampling
Modeling
Validation

Data mining Statistics Marketing
9
What is Data Mining A Fusion

An analytic process designed to explore data in
search of consistent patterns and/or systematic
relationships between variables, and then to
validate the findings by applying the detected
patterns to new subsets of data. The ultimate
goal is prediction.
Statsoft (2003). Data Mining Techniques.

10
What is Data Mining A Fusion

An information extraction activity whose goal is
to discover hidden facts contained in databases.
Using a combination of machine learning,
statistical analysis, modeling techniques and
database technology, data mining finds patterns
and subtle relationships in data and infers rules
that allow the prediction of future results.
Rudjer Boskovic Institute (2001). DMS Tutorial.

11
Data Mining In This Course

We use the book of Witten Frank
Computer science (machine learning) approach
Emphasis on algorithms for pattern discovery and
rule extraction
What are the underlying models
What are the properties of the algorithms
When to use (for which tasks)
How to apply and to tune
How to interpret and assess the results

12
Data Mining Process

These algorithms are only part of a process that
computer scientists call Knowledge Discovery and
the statisticians call Data Mining
The process starts with the recognition of a
problem and ends with the control of a deployed
solution
The whole process needs to be supported for a
successful application

13
Methodologies for Data Mining

As Data Mining is coming of age, several
methodologies have been developed, each with
their own perspective. We will discuss three of
them
Fayyad et al. (Computer science)
E.g., WEKA
SEMMA (SAS) (Statistics)
SAS Enterprise Miner, R
CRISP-DM (SPSS, OHRA, a.o.) (Business)
SPSS Clementine

14
Fayyads KDD Methodology
data
15
SEMMA Methodology
Supported by SAS Enterprise Mining environment
16
CRISP-DM Methodology

Developed by data-mining companies (SPSS, NCR,
OHRA, ChryslerDaimler), funded by the European
Commission
Tool-independent / industry-independent
Hierarchical process model
1 Generic phases 2 Generic tasks
3 Specific tasks 4 Task instances
Supported by SPSS Clementine environment

17
CRISP-DM Methodology
TASKS Business objective Assess situation Data
mining goals Project plan
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
18
CRISP-DM Methodology
TASKS Collect data Describe data Explore
data Verify data quality
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
19
CRISP-DM Methodology
TASKS Select data Clean data Construct
data Integrate data Format data
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
20
CRISP-DM Methodology
TASKS Select modeling techniques Design the
test Build model Assess model
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
21
CRISP-DM Methodology
TASKS Evaluate results Review
process Determine next steps
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
22
CRISP-DM Methodology
TASKS Plan deployment Plan monitoring and
maintenance Final report Review project
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
23
A Comparison
Business understanding
Data understanding
Data Preparation
Deployment
Modeling
Evaluation
24
A Small Poll (July 2002)
Source http//www.kdnuggets.com/polls/2002/method
ology.htm
25
Poll repeated (2004)
Source http//www.kdnuggets.com/polls/2004/data_m
ining_methodology.htm
26
Course perspective and goal

The perspective is from computer science
(machine learning) Fayyads approach
The emphasis is on techniques for the automated
discovery of patterns in data and the automated
extraction of rules (the model phase of SEMMA and
CRISP)
The goal is to get acquainted with these
techniques, so you can use them in the
methodology of your choice

27
Course contents