Title: MIS 467 Data Mining Chapter 1 introduction Fall 2003
1MIS 467Data MiningChapter 1introduction
Fall2003
2Personal Information
- Instructor Bertan Badur, Ph.D
- Office HKA 226
- Phone 0 212 358 15 40 ext.2027
- E-mail badur_at_boun.edu.tr
- Office Hours Mondays 14.00-15.00
- Tuesdays
14.00-15.00 - or by
appointment
3Course Information
- Course Hours Wednesdays 5(1300-1350)
- Thusedays1,2(900-
1050) - Lab Hours Wednsdays 4
(1200-1250) - Place HKA303
- Course Assistant Gülçin Buruncuk
- Web page www.mis.boun.edu.tr/ba
dur/MIS467
4Course Description
- This course aims at introducing basic
methodologies and techniques of data mining.
Basic data mining functionalities such as
association, concept description, classification,
prediction and clustering are introduced and
various algorithms to achieve them are presented.
Applications of these concepts and techniques to
real world problems are discussed. Data mining
software programs are introduced in lab hours.
5Text Book
- Main
- Data Mining Concepts and Techniques, by Jiawei
Han, Kamber M Morgan Kaufmann Publishers 2001 - Supplementary Text Books
- Data Mining Practical Machine Learning Tools
and Techniques with Java Implementations, by Ian
H. Witten, Morgan Kaufmann Publishers, 2000 - Machine Learning, by Tom M. Mitchell, McGraw-Hill
International Editions, 1997 - Mastering Data Mining The Art and Science of
Customer Relationship Management, by Michael T.
A. Berry, Gordon Linoff, Willey Computer
Publishing, 2000 - Predictive Data Mining Weiss S. M. and
N.Indurkhaya Morgan Koufmann Pub. 1998
6Hans CS397 course slides
- http//www-courses.cs.uiuc.edu/cs397han/index.htm
7Course Outline (1)
- Introduction, 1W, Ch.1
- An Overview of Data Warehouses and OLAP 1W, Ch.2
- Data Preprocessing 2 W, Ch. 3
- Concept description 1 W, Ch 5
8Course Outline (2)
- Association Rule Mining 2 W, Ch.6
- Classification and Prediction 4W, Ch.7
- Decision Trees
- Bayesian Classification
- Classification by Backpropagation
- Regression for Classification and prediction
- Classification Accuracy
- Cluster Analysis 2W, Ch.8
9Prerequisites
- Basic notion of probability
- Elementary calculus
- Elementary knowledge of matrices
- Basic concepts of database systems
10Grading
- Midterm 25 due 19.11.2003
- Homework 50
- Project optional
- Final Exam 25
11Project
- Each student or group of students is required to
prepare a term project. Implementation of
selected data mining algorithms, application of
studied techniques to a small scale real world
problem or a literature surveys can be accepted
as term projects
12Software
- DBMiner DBMiner 2.0 Educational Version
developed by J. Han and his team author of the
book Data Mining Concepts and Techniques
compatible with the text book, perform
association classification and cluster analysis. - Microsoft SQL Server Analysis Services
- SPSS
- Neural Connection Performs neural network
modelling for classification and prediction - Answer Tree Decision tree analysis
13Data Sources
- FoodMart database coming with Analysis Services
- WareMart database
- Data sources from internet
- UCI KDD Archive
- UCI Machine Learning Library
- Financial data from IMKB
- A database about disabled people in Turkey
14Where to Find the Set of Slidesfor the Text Book?
- Tutorial sections (MS PowerPoint files)
- http//www.cs.sfu.ca/han/dmbook
- Other conference presentation slides (.ppt)
- http//db.cs.sfu.ca/ or http//www.cs.sfu.ca/han
- Research papers, DBMiner system, and other
related information - http//db.cs.sfu.ca/ or http//www.cs.sfu.ca/han
15Chapter 1. Introduction
- Motivation Why data mining?
- What is data mining?
- Business Applications of data mining
- Data Mining On what kind of data?
- Data mining functionality
- Are all the patterns interesting?
- Classification of data mining systems
- Major issues in data mining
16Motivation Necessity is the Mother of
Invention
- Data explosion problem
- Automated data collection tools and mature
database technology lead to tremendous amounts of
data stored in databases, data warehouses and
other information repositories - Need to convert such data into knowledge and
information - Applications
- Business management
- Production control
- market analysis
- Engineering design
- Science exploration
17Evolution of Database Technology (1)
- Data collection, database creation
- Data management
- data storage and retrieval
- database transaction processing
- Data analysis and understanding
- Data mining and data warehousing
18Evolution of Database Technology (2) (See Fig.
1.1)
- 1960s
- Data collection, database creation, IMS and
network DBMS - 1970s
- Relational data model, relational DBMS
implementation - 1980s
- RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented
DBMS (spatial, scientific, engineering, etc.) - 1990s2000s
- Data mining and data warehousing, multimedia
databases, and Web databases
19Developments in computer hardware
- Powerful and affordable computers
- data collection equipment
- storage media
20Data Warehouse
- Data cleaning
- Data integration
- OLAP On-Line Analytical Processing
- summarization
- consolidation
- aggregation
- view information from different angles
- but additional data analysis tools are needed for
- classification
- clustering
- charecterization of data changing over time
21Data rich information poor situation
- Abundance of data
- need for powerful data analysis tools
- data tombs - data archives
- seldom visited
- Important decisions are made
- not on the information rich data stored in
databases - but on a decision makers intuition
- no tool to extract knowledge embedded in vast
amounts of data - Expert system technology
- domain experts to input knowledge
- time consuming and costly
22What Is Data Mining?
- Data mining (knowledge discovery in databases)
- Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases - Alternative names and their inside stories
- Data mining a misnomer?
- Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc. - What is not data mining?
- query processing.
- Expert systems or small ML/statistical programs
23Potential Business Applications
- Market analysis and management
- target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation - Risk analysis and management
- Banks assume a financial risk when they grant
loans - risk models attempt to predict the probability of
default or fail to pay back the borrowed amount - Credit cards
- Insurance companies
- Fraud detection and management
- Other Applications
- Text mining (news group, email, documents) and
Web analysis. - Intelligent query answering
24Market Analysis and Management (1)
- Where are the data sources for analysis?
- Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies,clickstreams - Customer profiling-segmentation
- data mining can tell you what types of customers
buy what products (clustering or classification) - Target marketing
- Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc.
25Market Analysis and Management (2)
- Effectiveness of sales campaigns
- Advertisements, coupons, discounts, bonuses
- promote products and attract customers
- can help improve profits
- Compare amount of sales and number of
transactions - during the sales period versus before or after
the sales campaign - Association analysis
- which items are likely to be purchased together
with the items on sale
26Market Analysis and Management (3)
- Customer retention Analysis of Customer loyalty
- sequences of purchases of particular customers
- goods purchased at different periods by the same
customers can be grouped into sequences - changes in customer consumtion or loyalty
- suggests adjustments on the pricing and variety
of goods - to retain old customers and attract new customers
- Cross-selling and up-selling
- associations from sales records
- a customer who buy a PC is likely to buy a
printer - purchase recommendations
27Fraud Detection and Management
- Applications
- widely used in health care, retail, credit card
services, telecommunications (phone card fraud),
etc. - Approach
- use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances - Examples
- auto insurance detect a group of people who
stage accidents to collect on insurance - money laundering detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network) - Detecting telephone fraud
- Telephone call model destination of the call,
duration, time of day or week. Analyze patterns
that deviate from an expected norm.
28Financial Data Analysis
- Financial data
- complete, reliable, high quality
- Loan payment prediction and customer credit
policy analysis
29Loan payment prediction and customer credit
policy analysis
- Factors influencing loan payment performance
- loan-to-value ratio
- term of the loan
- dept ratio (total monthly debt/total monthly
income) - payment-to-income ratio
- income level
- education level
- residence region
- credit history
- analysis may find that
- payment-income ratio is a dominant factor while
- education level and debt ratio are not
30Data Mining for the Telecommunication Industry
- Telecommunication data are multidimensional
- calling-time duration
- location of caller location of callee
- type of call
- used to identify and compare
- data traffic system workload
- resource usage user group behavior
- profit
- fraudulent pattern analysis and identification of
unusual patterns - to achieve customer loyalty
- characteristics of customers affecting line usage
31Other Applications
- Sports
- IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat - Astronomy
- JPL and the Palomar Observatory discovered 22
quasars with the help of data mining - Internet Web Surf-Aid
- IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.
32Steps of a KDD Process (1)
- 1. Learning the application domain
- relevant prior knowledge and goals of application
- 2. Creating a target data set data selection
- 3. Data cleaning and preprocessing (may take 60
-80 of effort!) - removal of noise or outliers
- strategies for missing data fields
- accounting for time sequence information
- 4. Data reduction and transformation
- Find useful features, dimensionality/variable
reduction, invariant representation.
33Steps of a KDD Process (2)
- 5. Choosing functions of data mining
- summarization, classification, regression,
association, clustering. - 6. Choosing the mining algorithm(s)
- which models or parameters
- 7. Data mining search for patterns of interest
- 8. Pattern evaluation and knowledge presentation
- visualization, transformation, removing redundant
patterns, etc. - 9. Use of discovered knowledge
- incorporating into the performance system
- documenting
- reporting to interested parties
34An example customer segmentation
- 1. Marketing department wants to perform a
segmentation study on the customers of AE Company - 2. Decide on revevant variables from a data
warehouse on customers, sales, promotions - Customers name,ID,income,age,education,...
- Sales hisory of sales
- Promotion promotion types durations...
- 3. Hendle missing income, addresses..
- determine outliers if any
- 4. Cenerate new index variables representing
wealth of customers - Wealth aincomebhousesccars...
- Make neccesary transformations z scores so that
some data mining algorithms work more efficiently
35- 5. Choose clustering as the data mining
functionality as it is the natural one for a
segmentation study so as to find group of
customers with similar charecteristics - 6. Choose a clustering algorithm
- K-means or k-medoids or any suitable one for that
problem - 7. Apply the algorithm
- Find clusters or segments
- 8. make reverse transformations, visualize the
customer segments - 9. present the results in the form of a report to
the marketing deprtment - Implement the segmentation as part of a DSS so
that it can be applied repeatedly at certain
internvals as new customers arrive
36Data Mining A KDD Process
Pattern Evaluation
- Data mining the core of knowledge discovery
process.
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
37Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
38Architecture of a Typical Data Mining System
- Data base, data warehouse
- Data base or data warehouse server
- Knowledge base
- concept hierarchies
- user beliefs
- asses patterns interestingness
- other thresholds
- Data mining engine
- functional modules
- characterization, association, classification,
cluster analysis, evolution and deviation
analysis - Pattern evaluation module
- Graphical user interface
39Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
40Efficient and Scalable Techniques
- For an algorithm to be scalable
- its running time should grow linearly in
proportion to the size of the data base
41Data Mining On What Kind of Data?
- Relational databases
- Data warehouses
- Transactional databases
- Advanced DB and information repositories
- Object-oriented and object-relational databases
- Spatial databases
- Time-series data and temporal data
- Text databases and multimedia databases
- Heterogeneous and legacy databases
- WWW
42An Example problem
- All Electronic is a multi branch retail company
- relational tables include
- customer
- ID,name, address, age, income,education ,sex, m
status - items
- ID,name,brand,category,type,price,place_made,
supplier, cost - employee
- ID,name,department, education, salary
- branch
- purchases
- transID, item_sold, customer ID, emp_ID, date,
time ,method_paid,amount
43Two styles of data mining
- Descriptive data mining
- Charecterize the general properties of the data
in the database - finds patterns in data and
- user determines which ones are important
- Predictive data mining
- perform inference on the current data to make
predictions - we know what to predict
- Not mutually exclusive
- used together
-
44Descriptive Data Mining (1)
- Discovering new patterns inside the data
- Used during the data exploration steps
- Typical questions answered by descriptive data
mining - what is in the data
- what does it look like
- are there any unusual patterns
- what dose the data suggest for customer
segmentation - users may have no idea
- which kind of patterns may be interesting
45Descriptive Data Mining (2)
- patterns at verious granularities
- geograph
- country - city - region - street
- student
- university - faculty - department - minor
- Fuctionalities of descriptive data mining
- Clustering
- Ex customer segmentation
- summarization
- visualization
- Association
- Ex market basket analysis
46A model is a black box
X vector of independent variables Y f(X) an
unknown function
Model
Y output
inputs X1,X2
The user does not care what the model is doing it
is a black box interested in the accuracy of its
predictions
47Predictive Data Mining (1)
- Using known examples the model is trained
- the unknown function is learned from data
- the more data with known outcomes is available
- the better the predictive power of the model
- Used to predict outcomes whose inputs are known
but the output values are not realized yet - Never 100 accurate
48Predictive Data Mining (2)
- The performance of a model on past data is not
important - to predict the known outcomes
- Its performance on unknown data is much more
important
49Typical questions answered by predictive models
- Who is likely to respond to our next offer
- based on history of previous marketing campaigns
- Which customers are likely to leave in the next
six months - What transactions are likely to be fraudulent
- based on known examples of fraud
- What is the total amount spending of a customer
in the next month
50Data Mining Functionalities (1)
- Concept description Characterization and
discrimination - Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions - Association (correlation and causality)
- Multi-dimensional vs. single-dimensional
association - age(X, 20..29) income(X, 20..29K) Ã buys(X,
PC) support 2, confidence 60 - contains(T, computer) Ã contains(x, software)
1, 75
51Data Mining Functionalities (2)
- Classification and Prediction
- Finding models (functions) that describe and
distinguish classes or concepts for future
prediction - E.g., classify countries based on climate, or
classify cars based on gas mileage - Presentation decision-tree, classification rule,
neural network - Prediction Predict some unknown or missing
numerical values - Cluster analysis
- Class label is unknown Group data to form new
classes, e.g., cluster houses to find
distribution patterns - Clustering based on the principle maximizing the
intra-class similarity and minimizing the
interclass similarity
52Data Mining Functionalities (3)
- Outlier analysis
- Outlier a data object that does not comply with
the general behavior of the data - It can be considered as noise or exception but is
quite useful in fraud detection, rare events
analysis - Trend and evolution analysis
- Trend and deviation regression analysis
- Sequential pattern mining, periodicity analysis
- Similarity-based analysis
- Other pattern-directed or statistical analyses
53Concept Description
- Characterization
- Discerimination
- Data
- classes or
- concpets
- classes of items for sale
- computers, printers
- concepts of customers
- bigSpenders
- BudgetSpenders
54Data Characterization
- Summarization the data of the class under study
(target class) - Methods
- SQL queries
- OLAP roll up -operation
- user-controlled data summarization
- along a specified dimension
- attribute oriented induction
- without step by step user interraction
- the output of characterization
- pie charts, bar chars, curves, multidimensional
data cube, or cross tabs - in rule form as characteristic rules
55Characterization example
- Description summarizing the characteristics of
customers who spend more than 1000 a year at
AllElecronics - age, employment, income
- drill down on any dimension
- on occupation view these according to their type
of employment
56Data Discrimination
- Comparing the target class with one or a set of
comparative classes (contrasting classes) - these classes can be specified by the use
- database queries
- methods and output
- similar to those used for characterization
- include comparative measures to distinguish
between the target and contrasting classes
57Discrimination examples
- Example 1Compare the general features of
software products - whose sales increased by 10 in the last year
- whose sales decreased by at least 30 during the
same period - Example 2 Compare two groups of AE customers
- I) who shop for computer products regularly
- more than two times a month
- II) who rarely shop for such products
- lee than three times a year
- The resulting description
- 80 of I group customers
- university education
- ages 20-40
- 60 of II group customers
- seniors or young
- no university degree
58Multidimensional Data
- sales according to region month and product type
Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
59Association Analysis
- Discovery of association rules showing
attribute-value conditions that occur frequently
together in a given set of data - widely used
- market basket
- transaction data analysis
- more formally
- X ? Y that is
- A1?A2.. ?Ak ? B1?B2.. ?Bl
- A1 , B1 are attribute value pairs
60Example association analysis
- From the AllEs database
- age(X,20..29)?income(X,1 billion...2
billon)?buy(X,CD player) - (support 2,
- confidence 60)
- X is a variable representing a customer
- 2 of the AE customers are
- between 20 and 29 age
- incomes ranging from 1 to 2 billon TL
- buy CD player
- with 60 probability that customers in those age
and income groups will buy CD player - a multidimensional association rule
- contains more than one attribute or predicate
61Market basket analysis
- customers buying behavior is investigated
- Based on only the transactions data
- no information about customer properties age
income - Managers
- are interested in which products or product
groups are sold together
62Example basket analysis rule
- buy(computer)?buy(printer)
- (support 1,confidence60)
- 1 of all transactions contains
- computer and printer
- if a transaction contains computer
- there is a 60 chance that it contains printer as
well - a single dimensional association rule
- contains a single predicate
- an association rule is interesting if
- its support exceeds a minimum threshold and
- its confidence exceeds a min threshold
- These min values are set by specialists
63Classification and Prediction
- Classification Finding models (functions) that
describe and distinguish classes or concepts for
future prediction - The derived model is based on the analysis of a
set of training data - E.g., classify countries based on climate, or
classify cars based on gas mileage - Presentation decision-tree, classification rule,
neural network - Prediction Predict some unknown or missing
numerical values - may need to be preceded by relevance analysis
which attempts to identify attributes that do not
contribute to the classification or prediction
process - these attributes can be excluded
64Steps of classification process
- Train the model
- using a training set
- object whose class labels are known
- Test the model
- on a test sample
- whose class labels are known but not used for
training the model - Use the model for classification
- on new data whose class labels are unknown
65An hypothetical example
Historical data Each customer type Is known Each
customer has a Label
- Testing set whose labels are also
- Known but not used in model
- Training the model
- New customers Whose type hsa to be
- Estimated
- Each new customer hss to be classified as Risky
normal or good
66An hypothetical example cont.
- Based on historical data develop a classification
model - Decision tree, neural network, regression ...
- Test the performance of the model on a portion of
the historical data - If accuricy of the model is satisfactory
- Use the model on the new customers
- 11 and 27 to assign a type the these new
customers
67example
wealth
OK DEFAULT
Yearly income
68Decision Trees
x1 yearly income x2 wealth y 0 DEFAULT y
1 OK
- Numerical values of
- q2 amdq2
- are estimated
- by the algorithm
69Solution
OK DEFAULT
q2
rule IF yearly incomegt q1 and wealthgt q2
THEN OK ELSE DEFAULT
70Artificial Neural Nets Perseptron
x01
x1
w1
w0
x2
g
w2
y
wd
xd
71Training ANNs
Learning set
Find w which minimizes the error on X
72ANN for classification
73Prediction methods
- linear regression
- Yi a0a1X1,ia2X2,i...akXk,iui
- non-linear regression
- Yi f(X1,i, X2,i,.., Xk,ia1,a2,..,ak,ui)
- generalized linear regression
- logistic
- logit,probit
- when the dependent variable is categorical
- good customer bed customer or employed unemployed
- poisson regression
- for count variables
74ExamplePrediction and Clasification
- Classification is used to classify customers
applying for credit cards - known class labels risky,reliable
- when a new customer applies looking at her
charecteristics - income age education wealth region ...
- Customer class is predicted
- Prediction The monthly expense of a new customer
( a real continuous variable ) is predicted based
on personal information - independent variables
- income education wealth profession ...
- Some are numeric some categorical
75Cluster Analysis
- Class label is unknown Group data to form new
classes, - assign class labels to each data object
- e.g., cluster houses to find distribution
patterns - Clustering based on the principle maximizing the
intra-class similarity and minimizing the
interclass similarity - Objects within a cluster have high similarity in
comparison to one another - but are very dissimilar to objects in other
clusters - there may be hierarchy of classes
76Example Clustering
- Can be performed on AE customer data
- to identify homogenous subpopulations of
customers - represent individual target groups for marketing
77distance
Type1
Type 2
type 3
income
Clustering according to income and distance to
store three cluster of data points are evident
78Outlier Analysis
- Outlier a data object that does not comply with
the general behavior of the data - It can be considered as noise or exception but is
quite useful in fraud detection, rare events
analysis - DECTECED using
- statistical tests
- distance measures
- visually inspecting the data
- Examples
79Reasons for outliers
- Measurement errors
- coding errors
- age is entered as 999
- nature of data
- salary of the general manager is much more higher
than the other employees - in crisis the interest rate was in the order of
1000s
80Evolution Analysis
- Describes and models regularities or trends for
objects whose behavior changes over time - Distinct features include
- Trend and deviation time-series data analysis
- Sequential pattern mining, periodicity analysis
- Similarity-based analysis
- Example
- Stock market predictions future stock prices
- for overall stocks indexes or individual company
stocks
81Are All the Discovered Patterns Interesting?
- A data mining system/query may generate thousands
of patterns, not all of them are interesting. - Are all patterns interesting?
- Typically not -only a small fraction of patterns
are interesting to any given user - Interestingness measures A pattern is
interesting if - it is easily understood by humans,
- valid on new or test data with some degree of
certainty, - potentially useful,
- novel, or
- validates some hypothesis that a user seeks to
confirm
82Objective vs. subjective interestingness measures
- Objective
- Objective based on statistics and structures of
patterns, e.g., - support,
- X ?Y P(X ? Y)probability of a transaction
contains both X and Y - confidence, degree of certainty of the detected
association - P(Y I X) the conditional probability the
probability that a transaction containing X also
contains Y - thresholds - controlled by the user
- ex rules that do not satisfy a confidence
threshold of 50 are uninteresting - Subjective based on users belief in the data,
e.g., unexpectedness, novelty, actionability,
etc.
83Can We Find All and Only Interesting Patterns?
- Find all the interesting patterns Completeness
- Can a data mining system find all the interesting
patterns? - Association vs. classification vs. clustering
- Search for only interesting patterns
Optimization - Can a data mining system find only the
interesting patterns? - Approaches
- First general all the patterns and then filter
out the uninteresting ones. - Generate only the interesting patternsmining
query optimization
84Data Mining Classification Schemes
- General functionality
- Descriptive data mining
- Predictive data mining
- Different views, different classifications
- Kinds of databases to be mined
- Kinds of knowledge to be discovered
- Kinds of techniques utilized
- Kinds of applications adapted
85A Multi-Dimensional View of Data Mining
Classification
- Databases to be mined
- Relational, transactional, object-oriented,
object-relational, active, spatial, time-series,
text, multi-media, heterogeneous, legacy, WWW,
etc. - Knowledge to be mined
- Characterization, discrimination, association,
classification, clustering, trend, deviation and
outlier analysis, etc. - Multiple/integrated functions and mining at
multiple levels - Techniques utilized
- Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural
network, etc. - Applications adapted
- Retail, telecommunication, banking, fraud
analysis, DNA mining, stock market analysis, Web
mining, Weblog analysis, etc.
86Major Issues in Data Mining (1)
- Mining methodology and user interaction
- Mining different kinds of knowledge in databases
- Interactive mining of knowledge at multiple
levels of abstraction - Incorporation of background knowledge
- Data mining query languages and ad-hoc data
mining - Expression and visualization of data mining
results - Handling noise and incomplete data
- Pattern evaluation the interestingness problem
- Performance and scalability
- Efficiency and scalability of data mining
algorithms - Parallel, distributed and incremental mining
methods
87Major Issues in Data Mining (2)
- Issues relating to the diversity of data types
- Handling relational and complex types of data
- Mining information from heterogeneous databases
and global information systems (WWW) - Issues related to applications and social impacts
- Application of discovered knowledge
- Domain-specific data mining tools
- Intelligent query answering
- Process control and decision making
- Integration of the discovered knowledge with
existing knowledge A knowledge fusion problem - Protection of data security, integrity, and
privacy
88Summary
- Data mining discovering interesting patterns
from large amounts of data - A natural evolution of database technology, in
great demand, with wide applications - A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation - Mining can be performed in a variety of
information repositories - Data mining functionalities characterization,
discrimination, association, classification,
clustering, outlier and trend analysis, etc. - Classification of data mining systems
- Major issues in data mining
89A Brief History of Data Mining Society
- 1989 IJCAI Workshop on Knowledge Discovery in
Databases (Piatetsky-Shapiro) - Knowledge Discovery in Databases (G.
Piatetsky-Shapiro and W. Frawley, 1991) - 1991-1994 Workshops on Knowledge Discovery in
Databases - Advances in Knowledge Discovery and Data Mining
(U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy, 1996) - 1995-1998 International Conferences on Knowledge
Discovery in Databases and Data Mining
(KDD95-98) - Journal of Data Mining and Knowledge Discovery
(1997) - 1998 ACM SIGKDD, SIGKDD1999-2001 conferences,
and SIGKDD Explorations - More conferences on data mining
- PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, etc.
90Where to Find References?
- Data mining and KDD (SIGKDD member CDROM)
- Conference proceedings KDD, and others, such as
PKDD, PAKDD, etc. - Journal Data Mining and Knowledge Discovery
- Database field (SIGMOD member CD ROM)
- Conference proceedings ACM-SIGMOD, ACM-PODS,
VLDB, ICDE, EDBT, DASFAA - Journals ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.
- AI and Machine Learning
- Conference proceedings Machine learning, AAAI,
IJCAI, etc. - Journals Machine Learning, Artificial
Intelligence, etc. - Statistics
- Conference proceedings Joint Stat. Meeting, etc.
- Journals Annals of statistics, etc.
- Visualization
- Conference proceedings CHI, etc.
- Journals IEEE Trans. visualization and computer
graphics, etc.
91References
- U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy. Advances in Knowledge Discovery
and Data Mining. AAAI/MIT Press, 1996. - J. Han and M. Kamber. Data Mining Concepts and
Techniques. Morgan Kaufmann, 2000. - T. Imielinski and H. Mannila. A database
perspective on knowledge discovery.
Communications of ACM, 3958-64, 1996. - G. Piatetsky-Shapiro, U. Fayyad, and P. Smith.
From data mining to knowledge discovery An
overview. In U.M. Fayyad, et al. (eds.), Advances
in Knowledge Discovery and Data Mining, 1-35.
AAAI/MIT Press, 1996. - G. Piatetsky-Shapiro and W. J. Frawley. Knowledge
Discovery in Databases. AAAI/MIT Press, 1991.