Title: Motivation: Necessity is the Mother of Invention
1Motivation Necessity is the Mother of
Invention
- Data explosion problem
- Automated data collection tools and mature
database technology lead to tremendous amounts of
data stored in databases, data warehouses and
other information repositories - Solution Data warehousing and data mining
- Data warehousing and on-line analytical
processing - Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data
in large databases
2Why Data Mining? Potential Applications
- Database analysis and decision support
- Market analysis and management
- Risk analysis and management
- Other Applications
- Text mining (news group, email, documents) and
Web analysis. - Intelligent query answering
3Market Analysis and Management (1)
- Where are the data sources for analysis?
- Credit card transactions, customer complaint
calls, proxy log, multimedia database, etc. - Analysis what?
- Find clusters of customers who share the same
characteristics interest, income level, spending
habits, etc. - Determine customer purchasing patterns over time
- Associations/co-relations between product sales
- Prediction based on the association information
- data mining can tell you what types of customers
buy what products (clustering or classification)
4Market Analysis and Management (2)
- Identifying customer requirements
- identifying the best products for different
customers - use prediction to find what factors will attract
new customers - Provides summary information
- various multidimensional summary reports
- statistical summary information
5Other Applications
- Sports
- Internet Web mining
- web site organization
- proxy server prefetch
- improve search engine performance
- Multimedia database
- Mobile database
6Data Mining Functionalities (1)
- Association
- Multi-dimensional vs. single-dimensional
association - age(X, 20..29) income(X, 20..29K) à buys(X,
PC) support 2, confidence 60 - contains(T, computer) à contains(x, software)
1, 75
7Data Mining Functionalities (2)
- Classification and Prediction
- Finding models (functions) that describe and
distinguish classes or concepts for future
prediction - Presentation decision-tree, classification rule,
neural network - Prediction Predict some unknown or missing
numerical values - Cluster analysis
- Class label is unknown Group data to form new
classes - Clustering based on the principle maximizing the
intra-class similarity and minimizing the
interclass similarity
8Data Mining Functionalities (3)
- Outlier analysis
- Outlier a data object that does not comply with
the general behavior of the data - It can be considered as noise or exception but is
quite useful in rare events analysis - Sequential pattern mining, periodicity analysis
- Privacy preserving data mining
9Are All the Discovered Patterns Interesting?
- A data mining system/query may generate thousands
of patterns, not all of them are interesting. - Interestingness measures A pattern is
interesting if it is easily understood by humans,
potentially useful, novel, or validates some
hypothesis that a user seeks to confirm - support, confidence
10What Is Association Mining?
- Association rule mining
- Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories. - Applications
- Basket data analysis, clustering, classification,
etc. - Examples.
- Rule form Body Head support, confidence.
- buys(x, diapers) buys(x, beers) 0.5,
60 - major(x, CS) takes(x, DB) grade(x, A)
1, 75
11Association Rule Basic Concepts
- Given (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit) - Find all rules that correlate the presence of
one set of items with that of another set of
items - E.g., 98 of people who purchase computers and
printers also purchase scanners - Measures
- support
- confidence
- Some terms
- minimum support, minimum confidence (threshold)
- k-itemset
- frequent k-itemset
12Association Rule Mining A Road Map
- Boolean v.s. quantitative associations (Based on
the types of values handled) - buys(x, SQLServer) buys(x, DMBook)
buys(x, DBMiner) 0.2, 60 - age(x, 30..39) income(x, 42..48K)
buys(x, PC) 1, 75 - Single dimension vs. multiple dimensional
associations (see ex. Above) - Single level vs. multiple-level analysis
- What brands of beers are associated with what
brands of diapers? - Various extensions
- Maxpatterns
- Cyclic rules
13Classification vs. Prediction
- Classification
- predicts categorical class labels
- classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data - Prediction
- models continuous-valued functions, i.e.,
predicts unknown or missing values - Typical Applications
- credit approval
- target marketing
14ClassificationA Two-Step Process
- Model construction describing a set of
predetermined classes - Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute - The set of tuples used for model construction
training set - The model is represented as classification rules,
decision trees, or mathematical formulae - Model usage for classifying future or unknown
objects - Estimate accuracy of the model
- The known label of test sample is compared with
the classified result from the model - Accuracy rate is the percentage of test set
samples that are correctly classified by the
model - Test set is independent of training set,
otherwise over-fitting will occur
15Classification Process (1) Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
16Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
17Classification by Decision Tree Induction
- Decision tree
- A flow-chart-like tree structure
- Internal node denotes a test on an attribute
- Branch represents an outcome of the test
- Leaf nodes represent class labels or class
distribution - Decision tree generation consists of two phases
- Tree construction
- At start, all the training examples are at the
root - Partition examples recursively based on selected
attributes - Tree pruning
- Identify and remove branches that reflect noise
or outliers - Use of decision tree Classifying an unknown
sample - Test the attribute values of the sample against
the decision tree
18Training Dataset
This follows an example from Quinlans ID3
19Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes