Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View

1 / 32

About This Presentation

Title:

Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View

Description:

Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University –

Number of Views:981

Avg rating:3.0/5.0

Slides: 33

Provided by: chen127

Category:

more less

Transcript and Presenter's Notes

Title: Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View

1
Part IData Mining Fundamentals Chapter 1Data
Mining A First View
Jason C. H. Chen, Ph.D. Professor of MIS School
of Business Administration Gonzaga
University Spokane, WA 99223 chen_at_jepson.gonzaga.e
du
2
1.1 Data Mining A Definition
3
1.1 Data Mining A Definition

The process of employing one or more computer
learning techniques to automatically analyze and
extract knowledge from data.

4
Induction-based Learning

The process of forming general concept
definitions by observing specific examples of
concepts to be learned.

Knowledge Discovery in Databases (KDD)

The application of the scientific method to data
mining. Data mining is one step of the KDD
process.

5
Data Mining Examples

A telephone company used a data mining tool to
analyze their customers data warehouse. The
data mining tool found about 10,000 supposedly
residential customers that were expending over
1,000 monthly in phone bills.
After further study, the phone company discovered
that they were really small business owners
trying to avoid paying business rates

6
Other Data Mining Examples

65 of customers who did not use the credit card
in the last six months are 88 likely to cancel
their accounts.
If age lt 30 and income lt 25,000 and credit
rating lt 3 and credit amount gt 25,000 then the
minimum loan term is 10 years.
82 of customers who bought a new TV 27" or
larger are 90 likely to buy an entertainment
center within the next 4 weeks.

7
1.2 What Can Computers Learn?
8
Four Levels of Learning

Fact
a simple statement of truth
Concept
a set of objects, symbols, or events grouped
together because they share certain
characteristics
Principle
is a step-by-step course of action to achieve a
goal. We use procedures in our everyday
functioning as well as in the solution of
difficult problems
Procedure
represents the highest level of learning.
Principles are general truths or laws that are
basic to other truths.

N
Source Merril and Tennyson, 1977, p.5 of the text
9
Concepts

Computers are good at learning concepts. Concepts
are the output of a data mining session.

Three Concept Views

Classical View
Probabilistic View
Exemplar View

10
Three Concept Views

Classical View
Attests that all concepts have definite defining
properties.
Probabilistic View
Concepts are represented by properties that are
probable of concept members.
Exemplar View
States that a given instance is determined to be
an example of a particular concept if the
instance is similar enough to a set of one or
more known examples of the concepts

11
Figure - A hierarchy of data mining strategies
No output attributes
Categorical/discrete (current behavior)
Numeric
Future outcome (categorical/numeric)
12
Supervised Learning
Supervised learning is the process of building
classification models using data instances of
known origin.

Two purposes
1. Build a learner (classification) model using
data instances of known origin.
is an induction process
2. Use the model to determine the outcome new
instances of unknown origin.
is a deduction process

Supervised Learning
A Decision Tree Example

14
Decision Tree

A tree structure where non-terminal nodes
represent tests on one or more attributes and
terminal nodes reflect decision outcomes.

Table 1.1 Hypothetical Training Data for
Disease Diagnosis
15
Figure 1.1 A decision tree for the data in
Table 1.1
16
Table 1.1 Hypothetical Training Data for
Disease Diagnosis
17
Production Rules
We can translate any decision tree into a set of
production rules. They are rules of the form IF
ltantecedent conditionsgt THEN ltconsequent
conditionsgt

IF Swollen Glands Yes
THEN Diagnosis Strep Throat
IF Swollen Glands No Fever Yes
THEN Diagnosis Cold
IF Swollen Glands No Fever No
THEN Diagnosis Allergy

18
Unsupervised Clustering

A data mining method that builds models from data
without predefined classes (see Table 1.3).
Data instances are grouped together based on a
similarity scheme defined by the clustering
system.
With the help of one or several evaluation
techniques, it is up to us to decide the meaning
of the formed clusters.

19
Table 1.3 Acme Investors Incorporated
20
Possible Questions
Questions for supervised learning

1. Can I develop a general profile of an online
investor? If so, what characteristics distinguish
online investors from investors that use a
broker?
2. Can I determine if a new customer who does not
initially open a margin account is likely to do
so in the future?
3. Can I build a model able to accurately predict
the average number of trades per month for a new
investor?
4. What characteristics differentiate female and
male investors?

Questions for unsupervised learning

1. What attribute similarities group customers of
Acme Investors together?
2. What differences in attribute values segment
the customer database?

21
1.3 Is Data Mining Appropriate for My Problem?
22
Data Mining or Data Query?

Shallow Knowledge
is factual tools used DBMS/SQL
Multidimensional Knowledge
Is factual tools used OLAP
Hidden Knowledge
Represents patterns or regularities in data that
cannot be easily found, tools used data mining
Deep Knowledge
Knowledge stored in a database that can only be
found if we are given some direction.

23
Data Mining vs. Data Query An Example

Use data query if you already almost know what
you are looking for.
Use data mining to find regularities in data
that are not obvious.

24
1.4 Expert Systems or Data Mining?
25
Expert System and Knowledge Engineer

An expert system is a computer program that
emulates the problem-solving skills of one or
more human experts.
A knowledge engineer is a person trained to
interact with an expert in order to capture their
knowledge.

26
(No Transcript)
27
1.5 A Simple Data Mining Process Model
28
Figure 1.3 - A simples data mining process model
29
Characteristics of Data Warehouse

Data Warehouse
Definitions a subject-oriented, integrated,
time-variant, non-updatable collection of data
used in support of management decision-making
processes
Subject-oriented e.g. customers, patients,
students, products
Integrated Consistent naming conventions,
formats, encoding structures from multiple data
sources
Time-variant Can study trends and changes
Nonupdatable Read-only, periodically refreshed
Data Mart
A data warehouse that is limited in scope