Knowledge Discovery - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Knowledge Discovery

Description:

Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 31

Provided by: Dr557

Category:

more less

Transcript and Presenter's Notes

Title: Knowledge Discovery

1
Knowledge Discovery Data Mining

process of extracting previously unknown, valid,
and actionable (understandable) information from
large databases
Data mining is a step in the KDD process of
applying data analysis and discovery algorithms
Machine learning, pattern recognition,
statistics, databases, data visualization.
Traditional techniques may be inadequate
large data

2
Why Mine Data?

Huge amounts of data being collected and
warehoused
Walmart records 20 millions transactions per day
WebLogs Millions of hits per day on major sites
health care transactions multi-gigabyte
databases
Mobil Oil geological data of over 100 terabytes
Affordable computing
Competitive pressure
gain an edge by providing improved, customized
services
information as a product in its own right

Knowledge discovery in databases (KDD) is the
non-trivial process of identifying valid,
potentially useful and ultimately understandable
patterns in data

Data Mining
Clean, Collect, Summarize
Data Preparation
Training Data
Data Warehouse
Model Patterns
Verification, Evaluation
Operational Databases
4
Data mining

Pattern
12121?
12 pattern is found often enough So, with some
confidence we can say ? is 2
If 1 then 2 follows
Pattern ? Model
Confidence
121212?
12121231212123121212?
121212? 3
Models are created using historical data by
detecting patterns. It is a calculated guess
about likelihood of repetition of pattern.

Note Models and patterns A pattern can be
thought of as an instantiation of a model. Eg.
f(x) 3 x2 x is a pattern whereas f(x) ax2
bx is considered a model.
Data mining involves fitting models to and
determining patterns from observed data.

6
Data Mining

Prediction Methods
using some variables to predict unknown or future
values of other variables
It uses database fields (predictors) for
prediction model, using the field values we can
make predictions
Descriptive Methods
finding human-interpretable patterns describing
the data

7
Data Mining Techniques

Classification
Clustering
Association Rule Discovery
Sequential Pattern Discovery
Regression
Deviation Detection

8
Classification

Data defined in terms of attributes, one of which
is the class
Find a model for class attribute as a function of
the values of other(predictor) attributes, such
that previously unseen records can be assigned a
class as accurately as possible.
Training Data used to build the model
Test data used to validate the model (determine
accuracy of the model)
Given data is usually divided into training and
test sets.

9
Classification

Given old data about customers and payments,
predict new applicants loan eligibility.

Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary gt 5 L
Good/ bad
Prof. Exec
New applicants data
10
Classification methods

Goal Predict class Ci f(x1, x2, .. Xn)
Regression (linear or any other polynomial)
Decision tree classifier divide decision space
into piecewise constant regions.
Neural networks partition by non-linear
boundaries
Probabilistic/generative models
Lazy learning methods nearest neighbor

11
Decision trees

Tree where internal nodes are simple decision
rules on one or more attributes and leaf nodes
are predicted class labels.

Salary lt 50 K
Prof teacher
Age lt 30
12
ClassificationExample
13
Decision Tree
Training Dataset
14
Output A Decision Tree for buys_computer
15
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Attributes are categorical (if continuous-valued,
they are discretized in advance)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
There are no samples left

16
Attribute Selection Measure Information Gain
(ID3/C4.5)

Select the attribute with the highest information
gain
S contains si tuples of class Ci for i 1, ,
m
information measures info required to classify
any arbitrary tuple
entropy of attribute A with values a1,a2,,av
information gained by branching on attribute A

17
Attribute Selection by Information Gain
Computation

Class P buys_computer yes
Class N buys_computer no
I(p, n) I(9, 5) 0.940
Compute the entropy for age

means age lt30 has 5 out of 14
samples, with 2 yeses and 3 nos. Hence
Similarly,

18
Classification Direct Marketing

Goal Reduce cost of soliciting (mailing) by
targeting a set of consumers likely to buy a new
product.
Data
for similar product introduced earlier
we know which customers decided to buy and which
did not buy, not buy class attribute
collect various demographic, lifestyle, and
company related information about all such
customers - as possible predictor variables.
Learn classifier model

19
Classification Fraud detection

Goal Predict fraudulent cases in credit card
transactions.
Data
Use credit card transactions and information on
its account-holder as input variables
label past transactions as fraud or fair.
Learn a model for the class of transactions
Use the model to detect fraud by observing credit
card transactions on a given account.

20
Clustering

Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
data points in one cluster are more similar to
one another
data points in separate clusters are less similar
to one another.
Similarity measures
Euclidean distance, if attributes are continuous
Problem specific measures

21
Clustering Market Segmentation

Goal subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
Approach
collect different attributes on customers based
on geographical, and lifestyle related
information
identify clusters of similar customers
measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters.

22
Association Rule Discovery

Given a set of records, each of which contain
some number of items from a given collection
produce dependency rules which will predict
occurrence of an item based on occurences of
other items

23
Association Rule Basic Concepts

Given (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find all rules that correlate the presence of
one set of items with that of another set of
items
E.g., 98 of people who purchase tires and auto
accessories also get automotive services done
Applications
? Maintenance Agreement (What the store
should do to boost Maintenance Agreement sales)
Home Electronics ? (What other products
should the store stocks up?)
Attached mailing in direct marketing

24
Association Rule Basic Concepts

number of tuples containing both A and B
Support (A? B) ---------------------------------
--------------
total number of tuples
number of tuples containing both A and B
Confidence (A? B) ------------------------------
-----------
total number of tuples containg A

25
Rule Measures Support and Confidence
Customer buys both

Find all the rules X Y ? Z with minimum
confidence and support
support, s, probability that a transaction
contains X , Y , Z
confidence, c, conditional probability that a
transaction having X , Y also contains Z

Customer buys d
Customer buys b

Let minimum support 50, and minimum confidence
50, we have
A ? C (50, 66.6)
C ? A (50, 100)

26
Mining Association RulesAn Example
Min. support 50 Min. confidence 50

For rule A ? C
support support(A , C) 50
confidence support(A , C)/support(A) 66.6

27
Association RulesApplication

Marketing and Sales Promotion
Consider discovered rule
Bagels, --gt Potato Chips
Potato Chips as consequent can be used to
determine what may be done to boost sales
Bagels as an antecedent can be used to see which
products may be affected if bagels are
discontinued
Can be used to see which products should be sold
with Bagels to promote sale of Potato Chips

28
Association Rules Application

Supermarket shelf management
Goal to identify items which are bought together
(by sufficiently many customers)
Approach process point-of-sale data (collected
with barcode scanners) to find dependencies among
items.
Example
If a customer buys Diapers and Milk, then he is
very likely to buy Beer
so stack six-packs next to diapers?

29
Sequential Pattern Discovery

Given set of objects, each associated with its
own timeline of events, find rules that predict
strong sequential dependencies among different
events, of the form (A B) (C) (D E) --gt (F)