Title: Knowledge Discovery and Data Mining Lecture 1
1Knowledge Discovery and Data Mining(Lecture 1)
2Objectives
- fundamental techniques of knowledge discovery and
data mining (KDD) - issues in KDD practical use and tools
- Common Data Mining Tasks
3Overview of KDD and Data Mining
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
4 KDD A Definition
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
106-1012 bytes never see the whole data set or
put it in the memory of computers
What knowledge? How to represent and use it?
Data mining algorithms?
5Data, Information, Knowledge
We often see data as a string of bits, or
numbers and symbols, or objects which we
collect daily.
Information is data stripped of redundancy, and
reduced to the minimum necessary to characterize
the data.
Knowledge is integrated information, including
facts and their relations, which have been
perceived, discovered, or learned as our mental
pictures.
Knowledge can be considered
data at a high level of abstraction and
generalization.
6 From Data to Knowledge
Medical Data by Dr. Tsumoto, Tokyo Med. Dent.
Univ., 38 attributes
... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2,
1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-,
2852, 2148, 712, 97, 49, F,-,multiple,,2137,
negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0,
0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-,
10700,4,0,normal, abnormal, , 1080, 680, 400,
71, 59, F,-,ABPCCZX,, 70, negative, n, n, n,
BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0,
ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal,
abnormal, , 1124, 622, 502, 47, 63, F,
-,FMOXAMK, , 48, negative, n, n, n, BACTE(E),
BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38,
2, 0, 0, 15, -, , 12600, 4, 0,abnormal,
abnormal, , 41, 39, 2, 44, 57, F, -, ABPCCZX,
?, ? ,negative, ?, n, n, ABSCESS, VIRUS ...
Numerical attribute categorical attribute
missing values class labels
IF cell_poly lt 220 AND Risk n AND Loc_dat
AND Nausea gt 15 THEN Prediction VIRUS 87,5
confidence, predictive accuracy
7Data Rich Knowledge Poor
How to acquire knowledge for
knowledge-based systems remains as the
main difficult and
crucial problem.
People gathered and stored so much data because
they think some valuable assets are implicitly
coded within it.
?
knowledge base
inference engine
Raw data is rarely of direct benefit.
Its true value depends on the ability to extract
information useful for decision support.
Tradition via knowledge engineers
Impractical Manual Data Analysis
New trend via automatic programs
8 Benefits of Knowledge Discovery
Value
Disseminate
DSS
Generate
MIS
EDP
Rapid Response
Volume
EDP Electronic Data Processing MIS Management
Information Systems DSS Decision Support Systems
9Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
10The KDD process
The non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data - Fayyad,
Platetsky-Shapiro, Smyth (1996)
11 The Knowledge Discovery Process
5
a step in the KDD process consisting of methods
that produce useful patterns or models from the
data, under some acceptable computational
efficiency limitations
4
3
2
1
KDD is inherently interactive and iterative
12 The KDD Process
Data organized by function
Create/select target database
Data warehousing
1
Select sampling technique and sample data
Supply missing values
Eliminate noisy data
2
Normalize values
Transform values
Create derived attributes
Find important attributes value ranges
4
3
Select DM task (s)
Select DM method (s)
Extract knowledge
Test knowledge
Refine knowledge
Query report generation Aggregation
sequences Advanced methods
Transform to different representation
5
13 Main Contributing Areas of KDD
Statistics
Infer info from data (deduction induction,
mainly numeric data)
data warehouses integrated data
OLAP On-Line Analytical Processing
KDD
Databases
Machine Learning
Store, access, search, update data (deduction)
Computer algorithms that improve automatically
through experience (mainly induction, symbolic
data)
14Lecture 1 Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
15 Potential Applications
Manufacturing information
Business information
- Marketing and sales data analysis -
Investment analysis - Loan approval - Fraud
detection - etc.
- Controlling and scheduling - Network
management - Experiment result analysis - etc.
Personal information
Scientific information
- Sky survey cataloging - Biosequence Databases -
Geosciences Quakefinder - etc.
16 KDD Opportunity and Challenges
Competitive Pressure
Data Rich Knowledge Poor (the resource)
KDD
Data Mining Technology Mature
Enabling Technology (Interactive MIS, OLAP,
parallel computing, Web, etc.)
17Lecture 1 Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining and its Common Methods
5. Challenges for KDD
18Data mining
- Mining or discovery of new information in terms
of patterns or rules from vast amounts of data
based on the following techniques - Machine learning
- Statistics
- Neural networks
- Genetic algorithms
- Applications
- Retail/Marketing
- Consumer behaviour based on buying patterns
- Finance
- Creditworthiness of clients
- Performance analysis of finance investments
- Health care/Medicine
- Effectiveness / side effects of treatments
192.1 Data Mining Strategies
Moh!
20Supervised learning
- Learning to assign objects to classes given
examples - Learner (classifier)
A typical supervised text learning scenario.
21Unsupervised Learning
- The target goal is not pre-defined, i.e.
gathering items in a database to groups where
items in the same group are similar (clustering).
- examples
- Clustering
- Association Rule Discovery
22 Common Tasks of Data Mining
finding the description of several predefined
classes and classify a data item into one of
them.
identifying a finite set of categories or
clusters to describe the data.
Clustering
Classification
Finding correlations Between items in a database
Association Rule
finding a compact description for a subset of
data
discovering the most significant changes in the
data
Deviation and change detection
Summarization
23Classification
What factors determine cancerous cells?
Examples
General patterns
Data
Mining Algorithm
- Rule Induction - Decision tree - Neural Network
Classification Algorithm
Cancerous Cell Data
24Classification
- Learning is supervised.
- The dependent variable is categorical.
- Well-defined classes.
- Current rather than future behavior.
25Classification
- Example
- Input feature x face length, distance between
eyes - Output y whether it is man or woman
- Two phases
- Training
- Testing
New unclassified examples (or Unlabeled data)
Classified examples (or Labeled data)
learner
model
Classified examples
26Classification
- Mathematically, assume there is some function
F(x) y producing the data. Given many pairs (x,
y), find F
Distance between eyes
o
o
o
o
o
o
Face length
27Classification
- Issues
- Expressiveness how flexible is the modeling
method? - Scalability how fast can it learn a model from N
features and M examples - Overfitting fitting the labeled examples too
exactly gt often ends up degrading
generalization performance. Usually caused by
long training to derive a perfect model. Solution
to overfitting is to use a test data set or
cross-validation method. - N fold- Cross validation Dividing the training
data into a n partitions, where learning the
model will be done on n-1 partitions and testing
the learned model will be carried out on the hold
out block. The process is repeated n times
randomly and then we average the results obtained
in the n repetitions to obtain the accuracy of
the model. - Generalization performance of learned model on
unseen (test) examples (or beyond training
examples)
28Classification
- Methods
- K-Nearest neighbors
- Decision trees
- Rule Induction
- Associative Classification
- Naïve bayes and Bayesian belief networks
- Artificial neural networks
29 Classification Rule Induction
What factors determine a cell is cancerous?
If Color light and Tails 1 and
Nuclei 2 Then Healthy Cell (certainty
92) If Color dark and Tails 2 and
Nuclei 2 Then Cancerous Cell (certainty
87)
30Classification Decision Trees
Color dark
Color light
nuclei1
nuclei2
nuclei1
nuclei2
cancerous
healthy
tails1
tails2
tails1
tails2
healthy
cancerous
healthy
cancerous
31Classification Neural Networks
What factors determine a cell is cancerous?
Color dark nuclei 1 tails 2
Healthy
Cancerous
32Association Rules
- Which feature values are commonly associated with
each other? - Knowledge of the form
- IF (feature1 value1) then (feature2 value2)
- Sample applications
- Market basket analysis
- Recommender systems
- Microarray analysis
33Associations Rule Mining vs. Classification
34Clustering
- Also called unsupervised learning
- Grouping data based on similarity
- Need a similarity or distance function
- Need a domain expert to interpret results
35Clustering
- Methods
- Agglomerative clustering methods
- K-mean clustering
- SOM (Self organization map)