Knowledge Discovery and Data Mining Lecture 1 - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Knowledge Discovery and Data Mining Lecture 1

Description:

Transform. values. Select DM. method (s) Create derived. attributes. Extract. knowledge ... Output y: whether it is man or woman. Two phases. Training. Testing ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 36

Provided by: hotu7

Category:

more less

Transcript and Presenter's Notes

Title: Knowledge Discovery and Data Mining Lecture 1

1
Knowledge Discovery and Data Mining(Lecture 1)
2
Objectives

fundamental techniques of knowledge discovery and
data mining (KDD)
issues in KDD practical use and tools
Common Data Mining Tasks

3
Overview of KDD and Data Mining
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
4

KDD A Definition
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
106-1012 bytes never see the whole data set or
put it in the memory of computers
What knowledge? How to represent and use it?
Data mining algorithms?
5
Data, Information, Knowledge
We often see data as a string of bits, or
numbers and symbols, or objects which we
collect daily.
Information is data stripped of redundancy, and
reduced to the minimum necessary to characterize
the data.
Knowledge is integrated information, including
facts and their relations, which have been
perceived, discovered, or learned as our mental
pictures.
Knowledge can be considered
data at a high level of abstraction and
generalization.
6
From Data to Knowledge
Medical Data by Dr. Tsumoto, Tokyo Med. Dent.
Univ., 38 attributes
... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2,
1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-,
2852, 2148, 712, 97, 49, F,-,multiple,,2137,
negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0,
0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-,
10700,4,0,normal, abnormal, , 1080, 680, 400,
71, 59, F,-,ABPCCZX,, 70, negative, n, n, n,
BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0,
ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal,
abnormal, , 1124, 622, 502, 47, 63, F,
-,FMOXAMK, , 48, negative, n, n, n, BACTE(E),
BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38,
2, 0, 0, 15, -, , 12600, 4, 0,abnormal,
abnormal, , 41, 39, 2, 44, 57, F, -, ABPCCZX,
?, ? ,negative, ?, n, n, ABSCESS, VIRUS ...
Numerical attribute categorical attribute
missing values class labels
IF cell_poly lt 220 AND Risk n AND Loc_dat
AND Nausea gt 15 THEN Prediction VIRUS 87,5
confidence, predictive accuracy
7
Data Rich Knowledge Poor
How to acquire knowledge for
knowledge-based systems remains as the
main difficult and
crucial problem.
People gathered and stored so much data because
they think some valuable assets are implicitly
coded within it.
?
knowledge base
inference engine
Raw data is rarely of direct benefit.
Its true value depends on the ability to extract
information useful for decision support.
Tradition via knowledge engineers
Impractical Manual Data Analysis
New trend via automatic programs
8

Benefits of Knowledge Discovery
Value
Disseminate
DSS
Generate
MIS
EDP
Rapid Response
Volume
EDP Electronic Data Processing MIS Management
Information Systems DSS Decision Support Systems
9
Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
10
The KDD process
The non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data - Fayyad,
Platetsky-Shapiro, Smyth (1996)
11
The Knowledge Discovery Process
5
a step in the KDD process consisting of methods
that produce useful patterns or models from the
data, under some acceptable computational
efficiency limitations
4
3
2
1
KDD is inherently interactive and iterative
12
The KDD Process
Data organized by function
Create/select target database
Data warehousing
1
Select sampling technique and sample data
Supply missing values
Eliminate noisy data
2
Normalize values
Transform values
Create derived attributes
Find important attributes value ranges
4
3
Select DM task (s)
Select DM method (s)
Extract knowledge
Test knowledge
Refine knowledge
Query report generation Aggregation
sequences Advanced methods
Transform to different representation
5
13
Main Contributing Areas of KDD
Statistics
Infer info from data (deduction induction,
mainly numeric data)
data warehouses integrated data
OLAP On-Line Analytical Processing
KDD
Databases
Machine Learning
Store, access, search, update data (deduction)
Computer algorithms that improve automatically
through experience (mainly induction, symbolic
data)
14
Lecture 1 Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
15
Potential Applications
Manufacturing information
Business information
- Marketing and sales data analysis -
Investment analysis - Loan approval - Fraud
detection - etc.
- Controlling and scheduling - Network
management - Experiment result analysis - etc.
Personal information
Scientific information
- Sky survey cataloging - Biosequence Databases -
Geosciences Quakefinder - etc.
16
KDD Opportunity and Challenges
Competitive Pressure
Data Rich Knowledge Poor (the resource)
KDD
Data Mining Technology Mature
Enabling Technology (Interactive MIS, OLAP,
parallel computing, Web, etc.)
17
Lecture 1 Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining and its Common Methods
5. Challenges for KDD
18
Data mining

Mining or discovery of new information in terms
of patterns or rules from vast amounts of data
based on the following techniques
Machine learning
Statistics
Neural networks
Genetic algorithms
Applications
Retail/Marketing
Consumer behaviour based on buying patterns
Finance
Creditworthiness of clients
Performance analysis of finance investments
Health care/Medicine
Effectiveness / side effects of treatments

19
2.1 Data Mining Strategies
Moh!
20
Supervised learning

Learning to assign objects to classes given
examples
Learner (classifier)

A typical supervised text learning scenario.
21
Unsupervised Learning

The target goal is not pre-defined, i.e.
gathering items in a database to groups where
items in the same group are similar (clustering).
examples
Clustering
Association Rule Discovery

22
Common Tasks of Data Mining
finding the description of several predefined
classes and classify a data item into one of
them.
identifying a finite set of categories or
clusters to describe the data.
Clustering
Classification
Finding correlations Between items in a database
Association Rule
finding a compact description for a subset of
data
discovering the most significant changes in the
data
Deviation and change detection
Summarization
23
Classification
What factors determine cancerous cells?
Examples
General patterns
Data
Mining Algorithm
- Rule Induction - Decision tree - Neural Network
Classification Algorithm
Cancerous Cell Data
24
Classification

Learning is supervised.
The dependent variable is categorical.
Well-defined classes.
Current rather than future behavior.

25
Classification

Example
Input feature x face length, distance between
eyes
Output y whether it is man or woman
Two phases
Training
Testing

New unclassified examples (or Unlabeled data)
Classified examples (or Labeled data)
learner
model
Classified examples
26
Classification

Mathematically, assume there is some function
F(x) y producing the data. Given many pairs (x,
y), find F

Distance between eyes
o
o
o
o
o
o

Face length
27
Classification

Issues
Expressiveness how flexible is the modeling
method?
Scalability how fast can it learn a model from N
features and M examples
Overfitting fitting the labeled examples too
exactly gt often ends up degrading
generalization performance. Usually caused by
long training to derive a perfect model. Solution
to overfitting is to use a test data set or
cross-validation method.
N fold- Cross validation Dividing the training
data into a n partitions, where learning the
model will be done on n-1 partitions and testing
the learned model will be carried out on the hold
out block. The process is repeated n times
randomly and then we average the results obtained
in the n repetitions to obtain the accuracy of
the model.
Generalization performance of learned model on
unseen (test) examples (or beyond training
examples)

28
Classification

Methods
K-Nearest neighbors
Decision trees
Rule Induction
Associative Classification
Naïve bayes and Bayesian belief networks
Artificial neural networks

29

Classification Rule Induction
What factors determine a cell is cancerous?
If Color light and Tails 1 and
Nuclei 2 Then Healthy Cell (certainty
92) If Color dark and Tails 2 and
Nuclei 2 Then Cancerous Cell (certainty
87)
30
Classification Decision Trees
Color dark
Color light
nuclei1
nuclei2
nuclei1
nuclei2
cancerous
healthy
tails1
tails2
tails1
tails2
healthy
cancerous
healthy
cancerous
31
Classification Neural Networks
What factors determine a cell is cancerous?
Color dark nuclei 1 tails 2
Healthy
Cancerous
32
Association Rules