Title: Knowledge Discovery
1Knowledge Discovery Data Mining
- process of extracting previously unknown, valid,
and actionable (understandable) information from
large databases - Data mining is a step in the KDD process of
applying data analysis and discovery algorithms - Machine learning, pattern recognition,
statistics, databases, data visualization. - Traditional techniques may be inadequate
- large data
2Why Mine Data?
- Huge amounts of data being collected and
warehoused - Walmart records 20 millions transactions per day
- WebLogs Millions of hits per day on major sites
- health care transactions multi-gigabyte
databases - Mobil Oil geological data of over 100 terabytes
- Affordable computing
- Competitive pressure
- gain an edge by providing improved, customized
services - information as a product in its own right
3- Knowledge discovery in databases (KDD) is the
non-trivial process of identifying valid,
potentially useful and ultimately understandable
patterns in data
Data Mining
Clean, Collect, Summarize
Data Preparation
Training Data
Data Warehouse
Model Patterns
Verification, Evaluation
Operational Databases
4Data mining
- Pattern
- 12121?
- 12 pattern is found often enough So, with some
confidence we can say ? is 2 - If 1 then 2 follows
- Pattern ? Model
- Confidence
- 121212?
- 12121231212123121212?
- 121212? 3
- Models are created using historical data by
detecting patterns. It is a calculated guess
about likelihood of repetition of pattern.
5- Note Models and patterns A pattern can be
thought of as an instantiation of a model. Eg.
f(x) 3 x2 x is a pattern whereas f(x) ax2
bx is considered a model. - Data mining involves fitting models to and
determining patterns from observed data.
6Data Mining
- Prediction Methods
- using some variables to predict unknown or future
values of other variables - It uses database fields (predictors) for
prediction model, using the field values we can
make predictions - Descriptive Methods
- finding human-interpretable patterns describing
the data
7Data Mining Techniques
- Classification
- Clustering
- Association Rule Discovery
- Sequential Pattern Discovery
- Regression
- Deviation Detection
8Classification
- Data defined in terms of attributes, one of which
is the class - Find a model for class attribute as a function of
the values of other(predictor) attributes, such
that previously unseen records can be assigned a
class as accurately as possible. - Training Data used to build the model
- Test data used to validate the model (determine
accuracy of the model) - Given data is usually divided into training and
test sets.
9Classification
- Given old data about customers and payments,
predict new applicants loan eligibility.
Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary gt 5 L
Good/ bad
Prof. Exec
New applicants data
10Classification methods
- Goal Predict class Ci f(x1, x2, .. Xn)
- Regression (linear or any other polynomial)
- Decision tree classifier divide decision space
into piecewise constant regions. - Neural networks partition by non-linear
boundaries - Probabilistic/generative models
- Lazy learning methods nearest neighbor
11Decision trees
- Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.
Salary lt 50 K
Prof teacher
Age lt 30
12ClassificationExample
13Decision Tree
Training Dataset
14Output A Decision Tree for buys_computer
15Algorithm for Decision Tree Induction
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
divide-and-conquer manner - At start, all the training examples are at the
root - Attributes are categorical (if continuous-valued,
they are discretized in advance) - Examples are partitioned recursively based on
selected attributes - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain) - Conditions for stopping partitioning
- All samples for a given node belong to the same
class - There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf - There are no samples left
16Attribute Selection Measure Information Gain
(ID3/C4.5)
- Select the attribute with the highest information
gain - S contains si tuples of class Ci for i 1, ,
m - information measures info required to classify
any arbitrary tuple - entropy of attribute A with values a1,a2,,av
- information gained by branching on attribute A
17Attribute Selection by Information Gain
Computation
- Class P buys_computer yes
- Class N buys_computer no
- I(p, n) I(9, 5) 0.940
- Compute the entropy for age
- means age lt30 has 5 out of 14
samples, with 2 yeses and 3 nos. Hence - Similarly,
18Classification Direct Marketing
- Goal Reduce cost of soliciting (mailing) by
targeting a set of consumers likely to buy a new
product. - Data
- for similar product introduced earlier
- we know which customers decided to buy and which
did not buy, not buy class attribute - collect various demographic, lifestyle, and
company related information about all such
customers - as possible predictor variables. - Learn classifier model
19Classification Fraud detection
- Goal Predict fraudulent cases in credit card
transactions. - Data
- Use credit card transactions and information on
its account-holder as input variables - label past transactions as fraud or fair.
- Learn a model for the class of transactions
- Use the model to detect fraud by observing credit
card transactions on a given account.
20Clustering
- Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that - data points in one cluster are more similar to
one another - data points in separate clusters are less similar
to one another. - Similarity measures
- Euclidean distance, if attributes are continuous
- Problem specific measures
21Clustering Market Segmentation
- Goal subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix. - Approach
- collect different attributes on customers based
on geographical, and lifestyle related
information - identify clusters of similar customers
- measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters.
22Association Rule Discovery
- Given a set of records, each of which contain
some number of items from a given collection - produce dependency rules which will predict
occurrence of an item based on occurences of
other items
23Association Rule Basic Concepts
- Given (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit) - Find all rules that correlate the presence of
one set of items with that of another set of
items - E.g., 98 of people who purchase tires and auto
accessories also get automotive services done - Applications
- ? Maintenance Agreement (What the store
should do to boost Maintenance Agreement sales) - Home Electronics ? (What other products
should the store stocks up?) - Attached mailing in direct marketing
24Association Rule Basic Concepts
- number of tuples containing both A and B
- Support (A? B) ---------------------------------
-------------- - total number of tuples
- number of tuples containing both A and B
- Confidence (A? B) ------------------------------
----------- - total number of tuples containg A
25Rule Measures Support and Confidence
Customer buys both
- Find all the rules X Y ? Z with minimum
confidence and support - support, s, probability that a transaction
contains X , Y , Z - confidence, c, conditional probability that a
transaction having X , Y also contains Z
Customer buys d
Customer buys b
- Let minimum support 50, and minimum confidence
50, we have - A ? C (50, 66.6)
- C ? A (50, 100)
26Mining Association RulesAn Example
Min. support 50 Min. confidence 50
- For rule A ? C
- support support(A , C) 50
- confidence support(A , C)/support(A) 66.6
27Association RulesApplication
- Marketing and Sales Promotion
- Consider discovered rule
- Bagels, --gt Potato Chips
- Potato Chips as consequent can be used to
determine what may be done to boost sales - Bagels as an antecedent can be used to see which
products may be affected if bagels are
discontinued - Can be used to see which products should be sold
with Bagels to promote sale of Potato Chips
28Association Rules Application
- Supermarket shelf management
- Goal to identify items which are bought together
(by sufficiently many customers) - Approach process point-of-sale data (collected
with barcode scanners) to find dependencies among
items. - Example
- If a customer buys Diapers and Milk, then he is
very likely to buy Beer - so stack six-packs next to diapers?
29Sequential Pattern Discovery
- Given set of objects, each associated with its
own timeline of events, find rules that predict
strong sequential dependencies among different
events, of the form (A B) (C) (D E) --gt (F)
- xg max allowed time between consecutive
- event-sets
- ng min required time between consecutive
- event sets
- ws window-size, max time difference between
- earliest and latest events in an event-set
(events - within an event-set may occur in any order)
- ms max allowed time between earliest and
- latest events of the sequence.
30Sequential Pattern Discovery Examples
- Sequences in which customers purchase
goods/services - Understanding long term customer behavior --
timely promotions. - In point-of--sale transaction sequences
- Computer bookstore
- (Intro to Visual C) (C Primer) --gt (Perl
for Dummies, Tcl/Tk) - Athletic Apparel Store
- (Shoes) (Racket, Racquetball) --gt (Sports
Jacket)