Title: Issues in Data Mining Infrastructure
1Issues in Data Mining Infrastructure
Authors Nemanja Jovanovic, nemko_at_acm.org Valen
tina Milenkovic, tina_at_eunet.yu Prof. Dr.
Veljko Milutinovic, vm_at_etf.bg.ac.yu
http//galeb.etf.bg.ac.yu/vm
2Data Mining in the Nutshell
- Uncovering the hidden knowledge
- Huge n-p complete search space
- Multidimensional interface
3A Problem
You are a marketing manager for a cellular phone
company
- Problem Churn is too high
- Turnover (after contract expires) is 40
- Customers receive free phone (cost 125) with
contract
- You pay a sales commission of 250 per contract
- Giving a new telephone to everyone whose
contract is expiring is very expensive (as well
as wasteful)
- Bringing back a customer after quitting is both
difficult and expensive
4 A Solution
- Three months before a contract expires, predict
which customers will leave
- If you want to keep a customer that is predicted
to churn, offer them a new phone
- The ones that are not predicted to churn need no
attention
- If you dont want to keep the customer, do nothing
- How can you predict future behavior?
5Still Skeptical?
6The Definition
The automated extraction of predictive
information from (large) databases
7History of Data Mining
8Repetition in Solar Activity
9The Return of theHalley Comet
Edmund Halley (1656 - 1742)
1531
1607
1682
239 BC
1910
1986
2061 ???
10Data Mining is Not
- Online Analytical Processing (OLAP)
11Data Mining is
- Automated extraction of predictive
informationfrom various data sources
- Powerful technology with great potential to help
users focus on the most important information
stored in data warehouses or streamed through
communication lines
12Data Mining can
- Answer question that were too time consuming to
resolve in the past
- Predict future trends and behaviors, allowing us
to make proactive, knowledge driven decision
13Focus of this Presentation
- Data Mining problem types
- Data Mining models and algorithms
14Data Mining Problem Types
15Data Mining Problem Types
- 6 types
- Often a combination solves the problem
16Data Description and Summarization
- Aims at concise description of data
characteristics
- Lower end of scale of problem types
- Provides the user an overview of the data
structure
17Segmentation
- Separates the data into interesting and
meaningful subgroups or classes
- Manual or (semi)automatic
- A problem for itself or just a step in solving
a problem
18Classification
- Assumption existence of objects with
characteristics that belong to different classes
- Building classification models which assign
correct labels in advance
- Exists in wide range of various application
- Segmentation can provide labels or restrict data
sets
19Concept Description
- Understandable description of concepts or classes
- Close connection to both segmentation and
classification
- Similarity and differences to classification
20Prediction (Regression)
- Finds the numerical value of the target
attribute for unseen objects
- Similar to classification - differencediscrete
becomes continuous
21Dependency Analysis
- Finding the model that describes significant
dependences between data items or events
- Prediction of value of a data item
- Special case associations
22Data Mining Models
23Neural Networks
- Characterizes processed data with single numeric
value
- Efficient modeling of large and complex problems
- Based on biological structures Neurons
- Network consists of neurons grouped into layers
24Neuron Functionality
W1
I1
W2
I2
Output
W3
I3
f
In
Wn
Output f (W1I1, W2I1, , WnIn)
25Training Neural Networks
26Neural Networks - Conclusion
- Once trained, Neural Networks can efficiently
estimate value of output variable for given input
- Neurons and network topology are essentials
- Usually used for prediction or regression
problem types
- Data pre-processing often required
27Decision Trees
- A way of representing a series of rules that
lead to a class or value
- Iterative splitting of data into discrete groups
maximizing distance between them at each split
- Classification trees and regression trees
- Univariate splits and multivariate splits
- Unlimited growth and stopping rules
- CHAID, CHART, Quest, C5.0
28Decision Trees
Balancegt10
Balancelt10
Agelt32
Agegt32
MarriedNO
MarriedYES
29Decision Trees
30Rule Induction
- Method of deriving a set of rules to classify
cases
- Creates independent rules that are unlikely to
form a tree
- Rules may not cover all possible situations
- Rules may sometimes conflict in a prediction
31Rule Induction
If balancegt100.000 then confidenceHIGH
weight1.7
If balancegt25.000 and statusmarriedthen
confidenceHIGH weight2.3
If balancelt40.000 then confidenceLOW
weight1.9
32K-nearest Neighbor and Memory-Based Reasoning
(MBR)
- Usage of knowledge of previously solved similar
problems in solving the new problem
- Assigning the class to the group where most of
the k-neighbors belong
- First step finding the suitable measure for
distance between attributes in the data
- How far is black from green?
- Easy handling of non-standard data types
33K-nearest Neighbor and Memory-Based Reasoning
(MBR)
34Data Mining Models and Algorithms
- Many other available models and algorithms
- Logistic regression
- Discriminant analysis
- Generalized Adaptive Models (GAM)
- Genetic algorithms
- Etc
- Many application specific variations of known
models
- Final implementation usually involves several
techniques
- Selection of solution that match best results
35Efficient Data Mining
36Is It Working?
NO
YES
Dont Mess With It!
Did You Mess With It?
YES
You Shouldnt Have!
NO
Will it Explode In Your Hands?
Anyone Else Knows?
Youre in TROUBLE!
YES
YES
Can You Blame Someone Else?
NO
NO
NO
Hide It
Look The Other Way
YES
NO PROBLEM!
37DM Process Model
- 5A used by SPSS Clementine (Assess, Access,
Analyze, Act and Automate)
- SEMMA used by SAS Enterprise Miner (Sample,
Explore, Modify, Model and Assess)
- CRISPDM tends to become a standard
38CRISP - DM
- CRoss-Industry Standard for DM
- Conceived in 1996 by three companies
39CRISP DM methodology
Four level breakdown of the CRISP-DM methodology
Phases
Generic Tasks
Specialized Tasks
Process Instances
40Mapping generic modelsto specialized models
- Analyze the specific context
- Remove any details not applicable to the context
- Add any details specific to the context
- Specialize generic context according toconcrete
characteristic of the context - Possibly rename generic contents to provide more
explicit meanings
41Generalized and Specialized Cooking
- Preparing food on your own
- Find out what you want to eat
- Find the recipe for that meal
- Gather the ingredients
- Prepare the meal
- Enjoy your food
- Clean up everything (or leave it for later)
- Raw stake with vegetables?
- Check the Cookbook or call mom
- Defrost the meat (if you had it in the fridge)
- Buy missing ingredients or borrow the from the
neighbors
- Cook the vegetables and fry the meat
- Enjoy your food or even more
- You were cooking so convince someone else to do
the dishes
42CRISP DM model
Business understanding
Data understanding
Datapreparation
Deployment
Modeling
Evaluation
43Business Understanding
- Determine business objectives
- Determine data mining goals
44Data Understanding
45Data Preparation
46Modeling
- Select modeling technique
47Evaluation
results models findings
48Deployment
- Plan monitoring and maintenance
49At Last
50Available Software
14
51Conclusions
52WWW.NBA.COM
53Se7en
54? CD ROM ?
55Credits
Anne Stern, SPSS, Inc.
Djuro Gluvajic, ITE, Denmark
Obrad Milivojevic, PC PRO, Yugoslavia
56References
- Bruha, I., Data Mining, KDD and Knowledge
Integration Methodology and A case Study,
SSGRR 2000 - Fayyad, U., Shapiro, P., Smyth, P., Uthurusamy,
R., Advances in Knowledge Discovery and Data
Mining, MIT Press, 1996 - Glumour, C., Maddigan, D., Pregibon, D., Smyth,
P., Statistical Themes nad Lessons for Data
Mining, Data Mining And Knowledge Discovery 1,
11-28, 1997 - Hecht-Nilsen, R., Neurocomputing,
Addison-Wesley, 1990 - Pyle, D., Data Preparation for Data Mining,
Morgan Kaufman, 1999
- galeb.etf.bg.ac.yu/vm
- www.thearling.com
- www.crisp-dm.com
- www.twocrows.com
- www.sas.com/products/miner
- www.spss.com/clementine
57The END