MIS 467 Data Mining Chapter 1 introduction Fall 2003

About This Presentation

Title:

MIS 467 Data Mining Chapter 1 introduction Fall 2003

Description:

Course Hours :Wednesdays 5(13:00-13:50) :Thusedays1,2(9:00-10:50) ... and fouls) to gain competitive advantage for New York Knicks and Miami Heat ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 92

Provided by: bertan5

more less

Transcript and Presenter's Notes

Title: MIS 467 Data Mining Chapter 1 introduction Fall 2003

1
MIS 467Data MiningChapter 1introduction
Fall2003
2
Personal Information

Instructor Bertan Badur, Ph.D
Office HKA 226
Phone 0 212 358 15 40 ext.2027
E-mail badur_at_boun.edu.tr
Office Hours Mondays 14.00-15.00
Tuesdays
14.00-15.00
or by
appointment

3
Course Information

Course Hours Wednesdays 5(1300-1350)
Thusedays1,2(900-
1050)
Lab Hours Wednsdays 4
(1200-1250)
Place HKA303
Course Assistant Gülçin Buruncuk
Web page www.mis.boun.edu.tr/ba
dur/MIS467

4
Course Description

This course aims at introducing basic
methodologies and techniques of data mining.
Basic data mining functionalities such as
association, concept description, classification,
prediction and clustering are introduced and
various algorithms to achieve them are presented.
Applications of these concepts and techniques to
real world problems are discussed. Data mining
software programs are introduced in lab hours.

5
Text Book

Main
Data Mining Concepts and Techniques, by Jiawei
Han, Kamber M Morgan Kaufmann Publishers 2001
Supplementary Text Books
Data Mining Practical Machine Learning Tools
and Techniques with Java Implementations, by Ian
H. Witten, Morgan Kaufmann Publishers, 2000
Machine Learning, by Tom M. Mitchell, McGraw-Hill
International Editions, 1997
Mastering Data Mining The Art and Science of
Customer Relationship Management, by Michael T.
A. Berry, Gordon Linoff, Willey Computer
Publishing, 2000
Predictive Data Mining Weiss S. M. and
N.Indurkhaya Morgan Koufmann Pub. 1998

6
Hans CS397 course slides

http//www-courses.cs.uiuc.edu/cs397han/index.htm

7
Course Outline (1)

Introduction, 1W, Ch.1
An Overview of Data Warehouses and OLAP 1W, Ch.2
Data Preprocessing 2 W, Ch. 3
Concept description 1 W, Ch 5

8
Course Outline (2)

Association Rule Mining 2 W, Ch.6
Classification and Prediction 4W, Ch.7
Decision Trees
Bayesian Classification
Classification by Backpropagation
Regression for Classification and prediction
Classification Accuracy
Cluster Analysis 2W, Ch.8

9
Prerequisites

Basic notion of probability
Elementary calculus
Elementary knowledge of matrices
Basic concepts of database systems

10
Grading

Midterm 25 due 19.11.2003
Homework 50
Project optional
Final Exam 25

11
Project

Each student or group of students is required to
prepare a term project. Implementation of
selected data mining algorithms, application of
studied techniques to a small scale real world
problem or a literature surveys can be accepted
as term projects

12
Software

DBMiner DBMiner 2.0 Educational Version
developed by J. Han and his team author of the
book Data Mining Concepts and Techniques
compatible with the text book, perform
association classification and cluster analysis.
Microsoft SQL Server Analysis Services
SPSS
Neural Connection Performs neural network
modelling for classification and prediction
Answer Tree Decision tree analysis

13
Data Sources

FoodMart database coming with Analysis Services
WareMart database
Data sources from internet
UCI KDD Archive
UCI Machine Learning Library
Financial data from IMKB
A database about disabled people in Turkey

14
Where to Find the Set of Slidesfor the Text Book?

Tutorial sections (MS PowerPoint files)
http//www.cs.sfu.ca/han/dmbook
Other conference presentation slides (.ppt)
http//db.cs.sfu.ca/ or http//www.cs.sfu.ca/han
Research papers, DBMiner system, and other
related information
http//db.cs.sfu.ca/ or http//www.cs.sfu.ca/han

15
Chapter 1. Introduction

Motivation Why data mining?
What is data mining?
Business Applications of data mining
Data Mining On what kind of data?
Data mining functionality
Are all the patterns interesting?
Classification of data mining systems
Major issues in data mining

16
Motivation Necessity is the Mother of
Invention

Data explosion problem
Automated data collection tools and mature
database technology lead to tremendous amounts of
data stored in databases, data warehouses and
other information repositories
Need to convert such data into knowledge and
information
Applications
Business management
Production control
market analysis
Engineering design
Science exploration

17
Evolution of Database Technology (1)

Data collection, database creation
Data management
data storage and retrieval
database transaction processing
Data analysis and understanding
Data mining and data warehousing

18
Evolution of Database Technology (2) (See Fig.
1.1)

1960s
Data collection, database creation, IMS and
network DBMS
1970s
Relational data model, relational DBMS
implementation
1980s
RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented
DBMS (spatial, scientific, engineering, etc.)
1990s2000s
Data mining and data warehousing, multimedia
databases, and Web databases

19
Developments in computer hardware

Powerful and affordable computers
data collection equipment
storage media

20
Data Warehouse

Data cleaning
Data integration
OLAP On-Line Analytical Processing
summarization
consolidation
aggregation
view information from different angles
but additional data analysis tools are needed for
classification
clustering
charecterization of data changing over time

21
Data rich information poor situation

Abundance of data
need for powerful data analysis tools
data tombs - data archives
seldom visited
Important decisions are made
not on the information rich data stored in
databases
but on a decision makers intuition
no tool to extract knowledge embedded in vast
amounts of data
Expert system technology
domain experts to input knowledge
time consuming and costly

22
What Is Data Mining?

Data mining (knowledge discovery in databases)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
Alternative names and their inside stories
Data mining a misnomer?
Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc.
What is not data mining?
query processing.
Expert systems or small ML/statistical programs

23
Potential Business Applications

Market analysis and management
target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation
Risk analysis and management
Banks assume a financial risk when they grant
loans
risk models attempt to predict the probability of
default or fail to pay back the borrowed amount
Credit cards
Insurance companies
Fraud detection and management
Other Applications
Text mining (news group, email, documents) and
Web analysis.
Intelligent query answering

24
Market Analysis and Management (1)

Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies,clickstreams
Customer profiling-segmentation
data mining can tell you what types of customers
buy what products (clustering or classification)
Target marketing
Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc.

25
Market Analysis and Management (2)

Effectiveness of sales campaigns
Advertisements, coupons, discounts, bonuses
promote products and attract customers
can help improve profits
Compare amount of sales and number of
transactions
during the sales period versus before or after
the sales campaign
Association analysis
which items are likely to be purchased together
with the items on sale

26
Market Analysis and Management (3)

Customer retention Analysis of Customer loyalty
sequences of purchases of particular customers
goods purchased at different periods by the same
customers can be grouped into sequences
changes in customer consumtion or loyalty
suggests adjustments on the pricing and variety
of goods
to retain old customers and attract new customers
Cross-selling and up-selling
associations from sales records
a customer who buy a PC is likely to buy a
printer
purchase recommendations

27
Fraud Detection and Management

Applications
widely used in health care, retail, credit card
services, telecommunications (phone card fraud),
etc.
Approach
use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances
Examples
auto insurance detect a group of people who
stage accidents to collect on insurance
money laundering detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network)
Detecting telephone fraud
Telephone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm.

28
Financial Data Analysis

Financial data
complete, reliable, high quality
Loan payment prediction and customer credit
policy analysis

29
Loan payment prediction and customer credit
policy analysis

Factors influencing loan payment performance
loan-to-value ratio
term of the loan
dept ratio (total monthly debt/total monthly
income)
payment-to-income ratio
income level
education level
residence region
credit history
analysis may find that
payment-income ratio is a dominant factor while
education level and debt ratio are not

30
Data Mining for the Telecommunication Industry

Telecommunication data are multidimensional
calling-time duration
location of caller location of callee
type of call
used to identify and compare
data traffic system workload
resource usage user group behavior
profit
fraudulent pattern analysis and identification of
unusual patterns
to achieve customer loyalty
characteristics of customers affecting line usage

31
Other Applications

Sports
IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat
Astronomy
JPL and the Palomar Observatory discovered 22
quasars with the help of data mining
Internet Web Surf-Aid
IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.

32
Steps of a KDD Process (1)

1. Learning the application domain
relevant prior knowledge and goals of application
2. Creating a target data set data selection
3. Data cleaning and preprocessing (may take 60
-80 of effort!)
removal of noise or outliers
strategies for missing data fields
accounting for time sequence information
4. Data reduction and transformation
Find useful features, dimensionality/variable
reduction, invariant representation.

33
Steps of a KDD Process (2)

5. Choosing functions of data mining
summarization, classification, regression,
association, clustering.
6. Choosing the mining algorithm(s)
which models or parameters
7. Data mining search for patterns of interest
8. Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant
patterns, etc.
9. Use of discovered knowledge
incorporating into the performance system
documenting
reporting to interested parties

34
An example customer segmentation

1. Marketing department wants to perform a
segmentation study on the customers of AE Company
2. Decide on revevant variables from a data
warehouse on customers, sales, promotions
Customers name,ID,income,age,education,...
Sales hisory of sales
Promotion promotion types durations...
3. Hendle missing income, addresses..
determine outliers if any
4. Cenerate new index variables representing
wealth of customers
Wealth aincomebhousesccars...
Make neccesary transformations z scores so that
some data mining algorithms work more efficiently

5. Choose clustering as the data mining
functionality as it is the natural one for a
segmentation study so as to find group of
customers with similar charecteristics
6. Choose a clustering algorithm
K-means or k-medoids or any suitable one for that
problem
7. Apply the algorithm
Find clusters or segments
8. make reverse transformations, visualize the
customer segments
9. present the results in the form of a report to
the marketing deprtment
Implement the segmentation as part of a DSS so
that it can be applied repeatedly at certain
internvals as new customers arrive

36
Data Mining A KDD Process

Knowledge

Pattern Evaluation

Data mining the core of knowledge discovery
process.

Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
37
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
38
Architecture of a Typical Data Mining System

Data base, data warehouse
Data base or data warehouse server
Knowledge base
concept hierarchies
user beliefs
asses patterns interestingness
other thresholds
Data mining engine
functional modules
characterization, association, classification,
cluster analysis, evolution and deviation
analysis
Pattern evaluation module
Graphical user interface

39
Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
40
Efficient and Scalable Techniques

For an algorithm to be scalable
its running time should grow linearly in
proportion to the size of the data base

41
Data Mining On What Kind of Data?

Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW

42
An Example problem

All Electronic is a multi branch retail company
relational tables include
customer
ID,name, address, age, income,education ,sex, m
status
items
ID,name,brand,category,type,price,place_made,
supplier, cost
employee
ID,name,department, education, salary
branch
purchases
transID, item_sold, customer ID, emp_ID, date,
time ,method_paid,amount

43
Two styles of data mining

Descriptive data mining
Charecterize the general properties of the data
in the database
finds patterns in data and
user determines which ones are important
Predictive data mining
perform inference on the current data to make
predictions
we know what to predict
Not mutually exclusive
used together

44
Descriptive Data Mining (1)

Discovering new patterns inside the data
Used during the data exploration steps
Typical questions answered by descriptive data
mining
what is in the data
what does it look like
are there any unusual patterns
what dose the data suggest for customer
segmentation
users may have no idea
which kind of patterns may be interesting

45
Descriptive Data Mining (2)

patterns at verious granularities
geograph
country - city - region - street
student
university - faculty - department - minor
Fuctionalities of descriptive data mining
Clustering
Ex customer segmentation
summarization
visualization
Association
Ex market basket analysis

46
A model is a black box
X vector of independent variables Y f(X) an
unknown function
Model
Y output
inputs X1,X2
The user does not care what the model is doing it
is a black box interested in the accuracy of its
predictions
47
Predictive Data Mining (1)

Using known examples the model is trained
the unknown function is learned from data
the more data with known outcomes is available
the better the predictive power of the model
Used to predict outcomes whose inputs are known
but the output values are not realized yet
Never 100 accurate

48
Predictive Data Mining (2)

The performance of a model on past data is not
important
to predict the known outcomes
Its performance on unknown data is much more
important

49
Typical questions answered by predictive models

Who is likely to respond to our next offer
based on history of previous marketing campaigns
Which customers are likely to leave in the next
six months
What transactions are likely to be fraudulent
based on known examples of fraud
What is the total amount spending of a customer
in the next month

50
Data Mining Functionalities (1)

Concept description Characterization and
discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Association (correlation and causality)
Multi-dimensional vs. single-dimensional
association
age(X, 20..29) income(X, 20..29K) à buys(X,
PC) support 2, confidence 60
contains(T, computer) à contains(x, software)
1, 75

51
Data Mining Functionalities (2)

Classification and Prediction
Finding models (functions) that describe and
distinguish classes or concepts for future
prediction
E.g., classify countries based on climate, or
classify cars based on gas mileage
Presentation decision-tree, classification rule,
neural network
Prediction Predict some unknown or missing
numerical values
Cluster analysis
Class label is unknown Group data to form new
classes, e.g., cluster houses to find
distribution patterns
Clustering based on the principle maximizing the
intra-class similarity and minimizing the
interclass similarity

52
Data Mining Functionalities (3)

Outlier analysis
Outlier a data object that does not comply with
the general behavior of the data
It can be considered as noise or exception but is
quite useful in fraud detection, rare events
analysis
Trend and evolution analysis
Trend and deviation regression analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses

53
Concept Description

Characterization
Discerimination
Data
classes or
concpets
classes of items for sale
computers, printers
concepts of customers
bigSpenders
BudgetSpenders

54
Data Characterization

Summarization the data of the class under study
(target class)
Methods
SQL queries
OLAP roll up -operation
user-controlled data summarization
along a specified dimension
attribute oriented induction
without step by step user interraction
the output of characterization
pie charts, bar chars, curves, multidimensional
data cube, or cross tabs
in rule form as characteristic rules

55
Characterization example

Description summarizing the characteristics of
customers who spend more than 1000 a year at
AllElecronics
age, employment, income
drill down on any dimension
on occupation view these according to their type
of employment

56
Data Discrimination

Comparing the target class with one or a set of
comparative classes (contrasting classes)
these classes can be specified by the use
database queries
methods and output
similar to those used for characterization
include comparative measures to distinguish
between the target and contrasting classes

57
Discrimination examples

Example 1Compare the general features of
software products
whose sales increased by 10 in the last year
whose sales decreased by at least 30 during the
same period
Example 2 Compare two groups of AE customers
I) who shop for computer products regularly
more than two times a month
II) who rarely shop for such products
lee than three times a year
The resulting description
80 of I group customers
university education
ages 20-40
60 of II group customers
seniors or young
no university degree

58
Multidimensional Data

sales according to region month and product type

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
59
Association Analysis

Discovery of association rules showing
attribute-value conditions that occur frequently
together in a given set of data
widely used
market basket
transaction data analysis
more formally
X ? Y that is
A1?A2.. ?Ak ? B1?B2.. ?Bl
A1 , B1 are attribute value pairs

60
Example association analysis

From the AllEs database
age(X,20..29)?income(X,1 billion...2
billon)?buy(X,CD player)
(support 2,
confidence 60)
X is a variable representing a customer
2 of the AE customers are
between 20 and 29 age
incomes ranging from 1 to 2 billon TL
buy CD player
with 60 probability that customers in those age
and income groups will buy CD player
a multidimensional association rule
contains more than one attribute or predicate

61
Market basket analysis

customers buying behavior is investigated
Based on only the transactions data
no information about customer properties age
income
Managers
are interested in which products or product
groups are sold together

62
Example basket analysis rule

buy(computer)?buy(printer)
(support 1,confidence60)
1 of all transactions contains
computer and printer
if a transaction contains computer
there is a 60 chance that it contains printer as
well
a single dimensional association rule
contains a single predicate
an association rule is interesting if
its support exceeds a minimum threshold and
its confidence exceeds a min threshold
These min values are set by specialists

63
Classification and Prediction

Classification Finding models (functions) that
describe and distinguish classes or concepts for
future prediction
The derived model is based on the analysis of a
set of training data
E.g., classify countries based on climate, or
classify cars based on gas mileage
Presentation decision-tree, classification rule,
neural network
Prediction Predict some unknown or missing
numerical values
may need to be preceded by relevance analysis
which attempts to identify attributes that do not
contribute to the classification or prediction
process
these attributes can be excluded

64
Steps of classification process

Train the model
using a training set
object whose class labels are known
Test the model
on a test sample
whose class labels are known but not used for
training the model
Use the model for classification
on new data whose class labels are unknown

65
An hypothetical example
Historical data Each customer type Is known Each
customer has a Label

Testing set whose labels are also
Known but not used in model
Training the model

New customers Whose type hsa to be
Estimated
Each new customer hss to be classified as Risky
normal or good

66
An hypothetical example cont.

Based on historical data develop a classification
model
Decision tree, neural network, regression ...
Test the performance of the model on a portion of
the historical data
If accuricy of the model is satisfactory
Use the model on the new customers
11 and 27 to assign a type the these new
customers

67
example
wealth
OK DEFAULT
Yearly income
68
Decision Trees
x1 yearly income x2 wealth y 0 DEFAULT y
1 OK

Numerical values of
q2 amdq2
are estimated
by the algorithm

69
Solution
OK DEFAULT
q2
rule IF yearly incomegt q1 and wealthgt q2
THEN OK ELSE DEFAULT
70
Artificial Neural Nets Perseptron
x01
x1
w1
w0
x2
g
w2
y
wd
xd
71
Training ANNs
Learning set
Find w which minimizes the error on X
72
ANN for classification
73
Prediction methods

linear regression
Yi a0a1X1,ia2X2,i...akXk,iui
non-linear regression
Yi f(X1,i, X2,i,.., Xk,ia1,a2,..,ak,ui)
generalized linear regression
logistic
logit,probit
when the dependent variable is categorical
good customer bed customer or employed unemployed
poisson regression
for count variables

74
ExamplePrediction and Clasification

Classification is used to classify customers
applying for credit cards
known class labels risky,reliable
when a new customer applies looking at her
charecteristics
income age education wealth region ...
Customer class is predicted
Prediction The monthly expense of a new customer
( a real continuous variable ) is predicted based
on personal information
independent variables
income education wealth profession ...
Some are numeric some categorical

75
Cluster Analysis

Class label is unknown Group data to form new
classes,
assign class labels to each data object
e.g., cluster houses to find distribution
patterns
Clustering based on the principle maximizing the
intra-class similarity and minimizing the
interclass similarity
Objects within a cluster have high similarity in
comparison to one another
but are very dissimilar to objects in other
clusters
there may be hierarchy of classes

76
Example Clustering

Can be performed on AE customer data
to identify homogenous subpopulations of
customers
represent individual target groups for marketing

77

distance
Type1
Type 2
type 3
income
Clustering according to income and distance to
store three cluster of data points are evident
78
Outlier Analysis

Outlier a data object that does not comply with
the general behavior of the data
It can be considered as noise or exception but is
quite useful in fraud detection, rare events
analysis
DECTECED using
statistical tests
distance measures
visually inspecting the data
Examples

79
Reasons for outliers

Measurement errors
coding errors
age is entered as 999
nature of data
salary of the general manager is much more higher
than the other employees
in crisis the interest rate was in the order of
1000s

80
Evolution Analysis

Describes and models regularities or trends for
objects whose behavior changes over time
Distinct features include
Trend and deviation time-series data analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Example
Stock market predictions future stock prices
for overall stocks indexes or individual company
stocks

81
Are All the Discovered Patterns Interesting?

A data mining system/query may generate thousands
of patterns, not all of them are interesting.
Are all patterns interesting?
Typically not -only a small fraction of patterns
are interesting to any given user
Interestingness measures A pattern is
interesting if
it is easily understood by humans,
valid on new or test data with some degree of
certainty,
potentially useful,
novel, or
validates some hypothesis that a user seeks to
confirm

82
Objective vs. subjective interestingness measures

Objective
Objective based on statistics and structures of
patterns, e.g.,
support,
X ?Y P(X ? Y)probability of a transaction
contains both X and Y
confidence, degree of certainty of the detected
association
P(Y I X) the conditional probability the
probability that a transaction containing X also
contains Y
thresholds - controlled by the user
ex rules that do not satisfy a confidence
threshold of 50 are uninteresting
Subjective based on users belief in the data,
e.g., unexpectedness, novelty, actionability,
etc.

83
Can We Find All and Only Interesting Patterns?

Find all the interesting patterns Completeness
Can a data mining system find all the interesting
patterns?
Association vs. classification vs. clustering
Search for only interesting patterns
Optimization
Can a data mining system find only the
interesting patterns?
Approaches
First general all the patterns and then filter
out the uninteresting ones.
Generate only the interesting patternsmining
query optimization

84
Data Mining Classification Schemes

General functionality
Descriptive data mining
Predictive data mining
Different views, different classifications
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted

85
A Multi-Dimensional View of Data Mining
Classification

Databases to be mined
Relational, transactional, object-oriented,
object-relational, active, spatial, time-series,
text, multi-media, heterogeneous, legacy, WWW,
etc.
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend, deviation and
outlier analysis, etc.
Multiple/integrated functions and mining at
multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural
network, etc.
Applications adapted
Retail, telecommunication, banking, fraud
analysis, DNA mining, stock market analysis, Web
mining, Weblog analysis, etc.

86
Major Issues in Data Mining (1)

Mining methodology and user interaction
Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple
levels of abstraction
Incorporation of background knowledge
Data mining query languages and ad-hoc data
mining
Expression and visualization of data mining
results
Handling noise and incomplete data
Pattern evaluation the interestingness problem
Performance and scalability
Efficiency and scalability of data mining
algorithms
Parallel, distributed and incremental mining
methods

87
Major Issues in Data Mining (2)

Issues relating to the diversity of data types
Handling relational and complex types of data
Mining information from heterogeneous databases
and global information systems (WWW)
Issues related to applications and social impacts
Application of discovered knowledge
Domain-specific data mining tools
Intelligent query answering
Process control and decision making
Integration of the discovered knowledge with
existing knowledge A knowledge fusion problem
Protection of data security, integrity, and
privacy

88
Summary

Data mining discovering interesting patterns
from large amounts of data
A natural evolution of database technology, in
great demand, with wide applications
A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of
information repositories
Data mining functionalities characterization,
discrimination, association, classification,
clustering, outlier and trend analysis, etc.
Classification of data mining systems
Major issues in data mining

89
A Brief History of Data Mining Society

1989 IJCAI Workshop on Knowledge Discovery in
Databases (Piatetsky-Shapiro)
Knowledge Discovery in Databases (G.
Piatetsky-Shapiro and W. Frawley, 1991)
1991-1994 Workshops on Knowledge Discovery in
Databases
Advances in Knowledge Discovery and Data Mining
(U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge
Discovery in Databases and Data Mining
(KDD95-98)
Journal of Data Mining and Knowledge Discovery
(1997)
1998 ACM SIGKDD, SIGKDD1999-2001 conferences,
and SIGKDD Explorations
More conferences on data mining
PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, etc.

90
Where to Find References?

Data mining and KDD (SIGKDD member CDROM)
Conference proceedings KDD, and others, such as
PKDD, PAKDD, etc.
Journal Data Mining and Knowledge Discovery
Database field (SIGMOD member CD ROM)
Conference proceedings ACM-SIGMOD, ACM-PODS,
VLDB, ICDE, EDBT, DASFAA
Journals ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.
AI and Machine Learning
Conference proceedings Machine learning, AAAI,
IJCAI, etc.
Journals Machine Learning, Artificial
Intelligence, etc.
Statistics
Conference proceedings Joint Stat. Meeting, etc.
Journals Annals of statistics, etc.
Visualization
Conference proceedings CHI, etc.
Journals IEEE Trans. visualization and computer
graphics, etc.

91
References

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy. Advances in Knowledge Discovery
and Data Mining. AAAI/MIT Press, 1996.
J. Han and M. Kamber. Data Mining Concepts and
Techniques. Morgan Kaufmann, 2000.
T. Imielinski and H. Mannila. A database
perspective on knowledge discovery.
Communications of ACM, 3958-64, 1996.
G. Piatetsky-Shapiro, U. Fayyad, and P. Smith.
From data mining to knowledge discovery An
overview. In U.M. Fayyad, et al. (eds.), Advances
in Knowledge Discovery and Data Mining, 1-35.
AAAI/MIT Press, 1996.
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge
Discovery in Databases. AAAI/MIT Press, 1991.

Write a Comment

User Comments (0)