Title: Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View
1Part IData Mining Fundamentals Chapter 1Data
Mining A First View
Jason C. H. Chen, Ph.D. Professor of MIS School
of Business Administration Gonzaga
University Spokane, WA 99223 chen_at_jepson.gonzaga.e
du
21.1 Data Mining A Definition
31.1 Data Mining A Definition
- The process of employing one or more computer
learning techniques to automatically analyze and
extract knowledge from data.
4Induction-based Learning
- The process of forming general concept
definitions by observing specific examples of
concepts to be learned.
Knowledge Discovery in Databases (KDD)
- The application of the scientific method to data
mining. Data mining is one step of the KDD
process.
5Data Mining Examples
- A telephone company used a data mining tool to
analyze their customers data warehouse. The
data mining tool found about 10,000 supposedly
residential customers that were expending over
1,000 monthly in phone bills. - After further study, the phone company discovered
that they were really small business owners
trying to avoid paying business rates
6Other Data Mining Examples
- 65 of customers who did not use the credit card
in the last six months are 88 likely to cancel
their accounts. - If age lt 30 and income lt 25,000 and credit
rating lt 3 and credit amount gt 25,000 then the
minimum loan term is 10 years. - 82 of customers who bought a new TV 27" or
larger are 90 likely to buy an entertainment
center within the next 4 weeks.
71.2 What Can Computers Learn?
8Four Levels of Learning
- Fact
- a simple statement of truth
- Concept
- a set of objects, symbols, or events grouped
together because they share certain
characteristics - Principle
- is a step-by-step course of action to achieve a
goal. We use procedures in our everyday
functioning as well as in the solution of
difficult problems - Procedure
- represents the highest level of learning.
Principles are general truths or laws that are
basic to other truths.
N
Source Merril and Tennyson, 1977, p.5 of the text
9Concepts
- Computers are good at learning concepts. Concepts
are the output of a data mining session.
Three Concept Views
- Classical View
- Probabilistic View
- Exemplar View
10Three Concept Views
- Classical View
- Attests that all concepts have definite defining
properties. - Probabilistic View
- Concepts are represented by properties that are
probable of concept members. - Exemplar View
- States that a given instance is determined to be
an example of a particular concept if the
instance is similar enough to a set of one or
more known examples of the concepts
11Figure - A hierarchy of data mining strategies
No output attributes
Categorical/discrete (current behavior)
Numeric
Future outcome (categorical/numeric)
12Supervised Learning
Supervised learning is the process of building
classification models using data instances of
known origin.
- Two purposes
- 1. Build a learner (classification) model using
data instances of known origin. - is an induction process
- 2. Use the model to determine the outcome new
instances of unknown origin. - is a deduction process
13- Supervised Learning
- A Decision Tree Example
14Decision Tree
- A tree structure where non-terminal nodes
represent tests on one or more attributes and
terminal nodes reflect decision outcomes.
Table 1.1 Hypothetical Training Data for
Disease Diagnosis
15Figure 1.1 A decision tree for the data in
Table 1.1
16Table 1.1 Hypothetical Training Data for
Disease Diagnosis
17Production Rules
We can translate any decision tree into a set of
production rules. They are rules of the form IF
ltantecedent conditionsgt THEN ltconsequent
conditionsgt
- IF Swollen Glands Yes
- THEN Diagnosis Strep Throat
- IF Swollen Glands No Fever Yes
- THEN Diagnosis Cold
- IF Swollen Glands No Fever No
- THEN Diagnosis Allergy
18Unsupervised Clustering
- A data mining method that builds models from data
without predefined classes (see Table 1.3). - Data instances are grouped together based on a
similarity scheme defined by the clustering
system. - With the help of one or several evaluation
techniques, it is up to us to decide the meaning
of the formed clusters.
19Table 1.3 Acme Investors Incorporated
20Possible Questions
Questions for supervised learning
- 1. Can I develop a general profile of an online
investor? If so, what characteristics distinguish
online investors from investors that use a
broker? - 2. Can I determine if a new customer who does not
initially open a margin account is likely to do
so in the future? - 3. Can I build a model able to accurately predict
the average number of trades per month for a new
investor? - 4. What characteristics differentiate female and
male investors?
Questions for unsupervised learning
- 1. What attribute similarities group customers of
Acme Investors together? - 2. What differences in attribute values segment
the customer database?
211.3 Is Data Mining Appropriate for My Problem?
22Data Mining or Data Query?
- Shallow Knowledge
- is factual tools used DBMS/SQL
- Multidimensional Knowledge
- Is factual tools used OLAP
- Hidden Knowledge
- Represents patterns or regularities in data that
cannot be easily found, tools used data mining - Deep Knowledge
- Knowledge stored in a database that can only be
found if we are given some direction.
23Data Mining vs. Data Query An Example
- Use data query if you already almost know what
you are looking for. - Use data mining to find regularities in data
that are not obvious.
241.4 Expert Systems or Data Mining?
25Expert System and Knowledge Engineer
- An expert system is a computer program that
emulates the problem-solving skills of one or
more human experts. - A knowledge engineer is a person trained to
interact with an expert in order to capture their
knowledge.
26(No Transcript)
271.5 A Simple Data Mining Process Model
28Figure 1.3 - A simples data mining process model
29Characteristics of Data Warehouse
- Data Warehouse
- Definitions a subject-oriented, integrated,
time-variant, non-updatable collection of data
used in support of management decision-making
processes - Subject-oriented e.g. customers, patients,
students, products - Integrated Consistent naming conventions,
formats, encoding structures from multiple data
sources - Time-variant Can study trends and changes
- Nonupdatable Read-only, periodically refreshed
- Data Mart
- A data warehouse that is limited in scope
30A four-step process for performing a data mining
session
- 1. Assembling the data
- Operational database (relational databases and
flat files) vs. data warehouse - 2. Mining the Data (Giving the data to a mining
tool) - Instances for building the model or testing the
model - 3. Interpreting the results
- 4. Result application
311.7 Data Mining Applications (p.24)
- Fraud Detection
- Health care
- Business and finance
- Scientific applications
- Sports and gaming
32Customer Intrinsic Value
B
A
C