Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining

Description:

networks/connectivity. IT-maturity of management. 4. Matwin, 2002 ... E.g. standards such as Open DataBase Connectivity (ODBC) and Java DBC (JDBC) APIs ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 29
Provided by: stanm1
Category:

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
  • What is data mining?
  • Motivating example
  • Why now?
  • Technological foundations
  • Tasks
  • Architectures and processes
  • data warehouse, data mart
  • middleware
  • OLAP
  • Conclusion

http//www.site.uottawa.ca/stan/csi5387/dm35171.
pdf
2
Definition
  • Technology that fins implicit, unexpected
    relationships in the data
  • the K-mart example

3
Why now?
  • Bar codes
  • networks/connectivity
  • IT-maturity of management

4
Technological foundations
  • Databases
  • machine learning
  • visualization
  • statistics

5
Tasks
  • Associations/MBA
  • estimation
  • classification
  • clustering
  • ...

6
Associations
  • Given
  • I i1,, im set of items
  • D set of transactions (a database), each
    transaction is a set of items T?2I
  • Association rule X?Y, X ?I, Y ?I, X?Y0
  • confidence c ratio of transactions that
    contain Y to of all transaction that contain X
  • support s ratio of of transactions that
    contain both X and Y to of transactions in D

7
  • An association rule A ? B is a conditional
    implication among itemsets A and B, where A ? I,
    B ? I and A ? B ?.
  • The confidence of an association rule r A ? B is
    the conditional probability that a transaction
    contains B, given that it contains A.
  • The support of rule r is defined as sup(r)
    sup(A?B). The confidence of rule r can be
    expressed as conf(r) sup(A?B)/sup(A).

8
Associations - mining
  • Given D, generate all assoc rules with c, s gt
    thresholds minc, mins
  • (items are ordered, e.g. by barcode)
  • Idea
  • find all itemsets that have transaction support
    gt mins large itemsets

9
Associations - mining
  • to do that start with indiv. items with large
    support
  • in ea next step, k,
  • use itemsets from step k-1, generate new
    itemset Ck,
  • count support of Ck (by counting the
    candidates which are contained in any t),
  • prune the ones that are not large

10
Associations - mining
Only keep those that are contained in some
transaction
11
Candidate generation
Ck apriori-gen(Lk-1)
12
Subset function
Subset(Ck, t) checks if an itemset Ck is in a
transaction t It is done via a tree structure
through a series of hashing
Hash C on every item in t itemsets
not containing anything from t are ignored
If you got here by hashing item i of t, hash on
all following items of t
set of itemsets
set of itemsets
Check if itemset contained in this leaf
13
Example
  • L31 2 3, 1 2 4,1 3 4,1 3 5,2 3 4
  • C41 2 3 4 1 3 4 5
  • pruning deletes 1 3 4 5 because 1 4 5 is not
    in L3.
  • See http//www.almaden.ibm.com/u/ragrawal/pubs.htm
    lassociations for details

14
Lift chart
population 1005 response ratecontacting 10
best chances, we obtain 20 of the 5 who
respond, so 1 person. Without a model, 0.5 pers.
The lift is 2.Oftentimes, cost has to be taken
into account for samples of small and large size
15
Architectures
  • data warehouse
  • metadata
  • middleware
  • data mart
  • data cube

16
Architecture - defs
  • data warehouse several heterogeneous databases
    that contain data relevant to a given problem
    (e.g. transactions, customer info, )
  • metadata data about the data. Describes the
    hierarchy of attributes and the logical
    organization of the data (e.g. customer data
    consists of the number, name, accounts,
    accounts is )
  • the database scheme is an example of metadata
  • metadata describes the data in the data
    warehouse from the business perspective

17
Architecture - defs
  • middleware software protocol for a single
    interface to a distributed DW. E.g. standards
    such as Open DataBase Connectivity (ODBC) and
    Java DBC (JDBC) APIs
  • problems (efficiency) when querying
  • multitiered approach
  • datamarts data warehouse needed for a given dept.

18
(No Transcript)
19
ProcessesOLAP On Line Analytical Processing
  • from de-normalized data (source system, e.g.
    transactions) to a star topology
  • analyzing the reports that are likely to be
    needed
  • the star and the report define the dimension
  • the dimensions define the cube

20
Example
  • moviegoers database (de-normalized)
  • name sex age source movie name
  • Amy f 27 Oberlin Independence day
  • Andy m 34 Oberlin The Birdcage
  • Bob m 51 Pinewoods Schindlers list
  • Cathy f 39 124 Mt. Auburn The Birdcage
  • Curt m 30 MRJ Judgement day
  • David m 40 MRJ Independence day
  • Erica f 23 124 Mt. Auburn Trainspotting

21
central fact table
dimension tables
22
Typical reports
  • of times ea. movie was seen for movies seen gt 5
    times
  • for what movies is the avg age of viewers gt 30?
  • the of people and their ages by source
  • the of people from ea. source by gender

23
Cube
  • is formed by representing the whole database
    (denormalized) by the dimensions
  • the size of the cube does not depend on the
    number of people
  • the cube has subcubes, ea. containing the key
    info that identifies it plus the summary
    aggregate info
  • cube MDD (multi-dimensional Database)
  • real cubes have more than three dimensions
  • ea. record belongs to exactly one subcube

24
Tasks
  • drilling looking inside a subcube at the
    records(of the original database) that are
    represented in that subcube
  • churning/attrition loosing customers.
  • can be cast as a classification problem on
    historical data (two classes customers who have
    churned and those who have not)
  • then a classification system (e.g. decision tree
    induction) can induce the classifiers
  • fraud detection learning regular patterns,
    watching for discrepancies

25
(No Transcript)
26
(No Transcript)
27
Data mining - tools
  • IBM Intelligent Miner
  • SAS Enterprise Miner
  • SGI MineSet
  • RuleQuest (you can download it for a trial!)

28
Data mining - conclusion
  • Treats historical data as an organizational
    asset, rather than burden
  • tries to
  • find out the unknown
  • predict the unknown
  • applies to
  • marketing
  • internet mining
  • E-commerce
  • ...
Write a Comment
User Comments (0)
About PowerShow.com