Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining

Description:

Data mining is the exploration and analysis of large quantities of data in order ... Supermarket scanners, POS data. Preferred customer cards. Credit card transactions ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 36
Provided by: johanne
Category:
Tags: data | mining | scanners

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
  • 198541

2
Definition
  • Data mining is the exploration and analysis of
    large quantities of data in order to discover
    valid, novel, potentially useful, and ultimately
    understandable patterns in data.
  • Example pattern (Census Bureau Data)If
    (relationship husband), then (gender male).
    99.6

3
Definition (Cont.)
  • Data mining is the exploration and analysis of
    large quantities of data in order to discover
    valid, novel, potentially useful, and ultimately
    understandable patterns in data.
  • Valid The patterns hold in general.
  • Novel We did not know the pattern beforehand.
  • Useful We can devise actions from the patterns.
  • Understandable We can interpret and comprehend
    the patterns.

4
Why Use Data Mining Today?
  • Human analysis skills are inadequate
  • Volume and dimensionality of the data
  • High data growth rate
  • Availability of
  • Data
  • Storage
  • Computational power
  • Off-the-shelf software
  • Expertise

5
An Abundance of Data
  • Supermarket scanners, POS data
  • Preferred customer cards
  • Credit card transactions
  • Direct mail response
  • Call center records
  • ATM machines
  • Demographic data
  • Sensor networks
  • Cameras
  • Web server logs
  • Customer web site trails

6
Why Use Data Mining Today?
  • Competitive pressure!
  • The secret of success is to know something that
    nobody else knows.
  • Aristotle Onassis
  • Competition on service, not only on price (Banks,
    phone companies, hotel chains, rental car
    companies)
  • Personalization, CRM
  • The real-time enterprise
  • Systemic listening
  • Security, homeland defense

7
The Knowledge Discovery Process
  • Steps
  • Identify business problem
  • Data mining
  • Action
  • Evaluation and measurement
  • Deployment and integration into businesses
    processes

8
Data Mining Step in Detail
  • 2.1 Data preprocessing
  • Data selection Identify target datasets and
    relevant fields
  • Data cleaning
  • Remove noise and outliers
  • Data transformation
  • Create common units
  • Generate new fields
  • 2.2 Data mining model construction
  • 2.3 Model evaluation

9
Preprocessing and Mining
Knowledge
Patterns
PreprocessedData
TargetData
Interpretation
ModelConstruction
Original Data
Preprocessing
DataIntegrationand Selection
10
Example Application Sky Survey
  • Input data 3 TB of image data with 2 billion sky
    objects, took more than six years to complete
  • Goal Generate a catalog with all objects and
    their type
  • Method Use decision trees as data mining model
  • Results
  • 94 accuracy in predicting sky object classes
  • Increased number of faint objects classified by
    300
  • Helped team of astronomers to discover 16 new
    high red-shift quasars in one order of magnitude
    less observation time

11
Gold Nuggets?
  • Investment firm mailing list Discovered that old
    people do not respond to IRA mailings
  • Bank clustered their customers. One cluster
    Older customers, no mortgage, less likely to have
    a credit card
  • Bank of 1911
  • Customer churn example

12
What is a Data Mining Model?
  • A data mining model is a description of a
    specific aspect of a dataset. It produces output
    values for an assigned set of input values.
  • Examples
  • Linear regression model
  • Classification model
  • Clustering

13
Data Mining Models (Contd.)
  • A data mining model can be described at two
    levels
  • Functional level
  • Describes model in terms of its intended
    usage.Examples Classification, clustering
  • Representational level
  • Specific representation of a model.Example
    Log-linear model, classification tree, nearest
    neighbor method.
  • Black-box models versus transparent models

14
Data Mining Types of Data
  • Relational data and transactional data
  • Spatial and temporal data, spatio-temporal
    observations
  • Time-series data
  • Text
  • Images, video
  • Mixtures of data
  • Sequence data
  • Features from processing other data sources

15
Types of Variables
  • Numerical Domain is ordered and can be
    represented on the real line (e.g., age, income)
  • Nominal or categorical Domain is a finite set
    without any natural ordering (e.g., occupation,
    marital status, race)
  • Ordinal Domain is ordered, but absolute
    differences between values is unknown (e.g.,
    preference scale, severity of an injury)

16
Data Mining Techniques
  • Supervised learning
  • Classification and regression
  • Unsupervised learning
  • Clustering
  • Dependency modeling
  • Associations, summarization, causality
  • Outlier and deviation detection
  • Trend analysis and change detection

17
Market Basket Analysis
  • Consider shopping cart filled with several items
  • Market basket analysis tries to answer the
    following questions
  • Who makes purchases?
  • What do customers buy together?
  • In what order do customers purchase items?

18
Market Basket Analysis
  • Given
  • A database of customer transactions
  • Each transaction is a set of items
  • ExampleTransaction with TID 111 contains items
    Pen, Ink, Milk, Juice

19
Market Basket Analysis (Contd.)
  • Coocurrences
  • 80 of all customers purchase items X, Y and Z
    together.
  • Association rules
  • 60 of all customers who purchase X and Y also
    buy Z.
  • Sequential patterns
  • 60 of customers who first buy X also purchase Y
    within three weeks.

20
Confidence and Support
  • We prune the set of all possible association
    rules (LHS gt RHS) using two interestingness
    measures
  • Confidence of a rule ( of transactions with LHS
    that contain RHS)
  • X Ă  Y has confidence c if P(YX) c
  • Support of a rule ( of transactions that
    contain LHS U RHS)
  • X Ă  Y has support s if P(XY) s
  • We can also define
  • Support of an itemset (a coocurrence) XY
  • XY has support s if P(XY) s

21
Example
  • Examples
  • Pen gt MilkSupport 75Confidence 75
  • Ink gt PenSupport 75Confidence 100

22
Example
  • Find all itemsets withsupport gt 75?

23
Example
  • Can you find all association rules with support
    gt 50?

24
Market Basket Analysis Applications
  • Sample Applications
  • Direct marketing
  • Fraud detection for medical insurance
  • Floor/shelf planning
  • Web site layout
  • Cross-selling

25
Association Rule Algorithms
  • Find all large itemsets
  • For each large itemset, find all association
    rules with sufficient confidence

A priori Algorithm (Agrawal et al.)
Any subset of a frequent itemset has to be
frequent
26
Problem Redux (Contd.)
  • Definitions
  • An itemset is frequent if it is a subset of at
    least x transactions. (FI.)
  • An itemset is maximally frequent if it is
    frequent and it does not have a frequent
    superset. (MFI.)
  • GOAL Given x, find all frequent (maximally
    frequent) itemsets (to be stored in the FI
    (MFI)).
  • Obvious relationshipMFI subset FI
  • Example
  • D 1,2,3, 1,2,3, 1,2,3, 1,2,4
  • Minimum support x 3
  • 1,2 is frequent
  • 1,2,3 is maximal frequent
  • Support(1,2) 4
  • All maximal frequent itemsets 1,2,3

27
The Itemset Lattice

2
1
4
3
1,2
2,3
1,3
1,4
2,4
3,4
1,2,3
1,2,4
1,3,4
2,3,4
1,2,3,4
28
Frequent Itemsets

2
1
4
3
1,2
2,3
1,3
1,4
2,4
3,4
1,2,3
1,2,4
1,3,4
2,3,4
1,2,3,4
Frequent itemsets
Infrequent itemsets
29
Apriori 1-Itemsets

2
1
4
3
1,2
2,3
1,3
1,4
2,4
3,4
1,2,3
1,2,4
1,3,4
2,3,4
Infrequent Frequent Currently examined Dont know
1,2,3,4
The Apriori Principle I infrequent if (I - x)
infrequent
30
Apriori 2-Itemsets

2
1
4
3
1,2
2,3
1,3
1,4
2,4
3,4
1,2,3
1,2,4
1,3,4
2,3,4
Infrequent Frequent Currently examined Dont know
1,2,3,4
31
Apriori 3-Itemsets

2
1
4
3
1,2
2,3
1,3
1,4
2,4
3,4
1,2,3
1,2,4
1,3,4
2,3,4
Infrequent Frequent Currently examined Dont know
1,2,3,4
32
Extensions
  • Imposing constraints
  • Only find rules involving the dairy department
  • Only find rules involving expensive products
  • Only find expensive rules
  • Only find rules with whiskey on the right hand
    side
  • Only find rules with milk on the left hand side
  • Hierarchies on the items
  • Calendars (every Sunday, every 1st of the month)

33
Itemset Constraints
  • Definition
  • A constraint is an arbitrary property of
    itemsets.
  • Examples
  • The itemset has support greater than 1000.
  • No element of the itemset costs more than 40.
  • The items in the set average more than 20.
  • Goal
  • Find all itemsets satisfying a given constraint
    P.
  • Solution
  • If P is a support constraint, use the Apriori
    Algorithm.

34
Finding Association Rules
  • Identify frequent itemsets
  • (Itemsets with support gt minsup)
  • Generate candidate rules
  • Divide each frequent itemset X into pairs of LHS
    and RHS itemsets (LHS U RHS X)
  • Compute the confidence of the rule
  • Support(X)/support(LHS)
  • (From Apriori, all LHS and RHS are frequent)

35
Applications
  • Spatial association rules
  • Web mining
  • Market basket analysis
  • User/customer profiling
Write a Comment
User Comments (0)
About PowerShow.com