Title: Data Mining
1Data Mining
- What is data mining?
- Motivating example
- Why now?
- Technological foundations
- Tasks
- Architectures and processes
- data warehouse, data mart
- middleware
- OLAP
- Conclusion
http//www.site.uottawa.ca/stan/csi5387/dm35171.
pdf
2Definition
- Technology that fins implicit, unexpected
relationships in the data - the K-mart example
3Why now?
- Bar codes
- networks/connectivity
- IT-maturity of management
4Technological foundations
- Databases
- machine learning
- visualization
- statistics
5Tasks
- Associations/MBA
- estimation
- classification
- clustering
- ...
6Associations
- Given
- I i1,, im set of items
- D set of transactions (a database), each
transaction is a set of items T?2I - Association rule X?Y, X ?I, Y ?I, X?Y0
- confidence c ratio of transactions that
contain Y to of all transaction that contain X - support s ratio of of transactions that
contain both X and Y to of transactions in D
7- An association rule A ? B is a conditional
implication among itemsets A and B, where A ? I,
B ? I and A ? B ?. - The confidence of an association rule r A ? B is
the conditional probability that a transaction
contains B, given that it contains A. - The support of rule r is defined as sup(r)
sup(A?B). The confidence of rule r can be
expressed as conf(r) sup(A?B)/sup(A).
8Associations - mining
- Given D, generate all assoc rules with c, s gt
thresholds minc, mins - (items are ordered, e.g. by barcode)
- Idea
- find all itemsets that have transaction support
gt mins large itemsets -
9Associations - mining
- to do that start with indiv. items with large
support - in ea next step, k,
- use itemsets from step k-1, generate new
itemset Ck, - count support of Ck (by counting the
candidates which are contained in any t), - prune the ones that are not large
10Associations - mining
Only keep those that are contained in some
transaction
11Candidate generation
Ck apriori-gen(Lk-1)
12Subset function
Subset(Ck, t) checks if an itemset Ck is in a
transaction t It is done via a tree structure
through a series of hashing
Hash C on every item in t itemsets
not containing anything from t are ignored
If you got here by hashing item i of t, hash on
all following items of t
set of itemsets
set of itemsets
Check if itemset contained in this leaf
13Example
- L31 2 3, 1 2 4,1 3 4,1 3 5,2 3 4
- C41 2 3 4 1 3 4 5
- pruning deletes 1 3 4 5 because 1 4 5 is not
in L3. - See http//www.almaden.ibm.com/u/ragrawal/pubs.htm
lassociations for details
14Lift chart
population 1005 response ratecontacting 10
best chances, we obtain 20 of the 5 who
respond, so 1 person. Without a model, 0.5 pers.
The lift is 2.Oftentimes, cost has to be taken
into account for samples of small and large size
15Architectures
- data warehouse
- metadata
- middleware
- data mart
- data cube
16Architecture - defs
- data warehouse several heterogeneous databases
that contain data relevant to a given problem
(e.g. transactions, customer info, ) - metadata data about the data. Describes the
hierarchy of attributes and the logical
organization of the data (e.g. customer data
consists of the number, name, accounts,
accounts is ) - the database scheme is an example of metadata
- metadata describes the data in the data
warehouse from the business perspective
17Architecture - defs
- middleware software protocol for a single
interface to a distributed DW. E.g. standards
such as Open DataBase Connectivity (ODBC) and
Java DBC (JDBC) APIs - problems (efficiency) when querying
- multitiered approach
- datamarts data warehouse needed for a given dept.
18(No Transcript)
19ProcessesOLAP On Line Analytical Processing
- from de-normalized data (source system, e.g.
transactions) to a star topology - analyzing the reports that are likely to be
needed - the star and the report define the dimension
- the dimensions define the cube
20Example
- moviegoers database (de-normalized)
- name sex age source movie name
- Amy f 27 Oberlin Independence day
- Andy m 34 Oberlin The Birdcage
- Bob m 51 Pinewoods Schindlers list
- Cathy f 39 124 Mt. Auburn The Birdcage
- Curt m 30 MRJ Judgement day
- David m 40 MRJ Independence day
- Erica f 23 124 Mt. Auburn Trainspotting
21central fact table
dimension tables
22Typical reports
- of times ea. movie was seen for movies seen gt 5
times - for what movies is the avg age of viewers gt 30?
- the of people and their ages by source
- the of people from ea. source by gender
23Cube
- is formed by representing the whole database
(denormalized) by the dimensions - the size of the cube does not depend on the
number of people - the cube has subcubes, ea. containing the key
info that identifies it plus the summary
aggregate info - cube MDD (multi-dimensional Database)
- real cubes have more than three dimensions
- ea. record belongs to exactly one subcube
24Tasks
- drilling looking inside a subcube at the
records(of the original database) that are
represented in that subcube - churning/attrition loosing customers.
- can be cast as a classification problem on
historical data (two classes customers who have
churned and those who have not) - then a classification system (e.g. decision tree
induction) can induce the classifiers - fraud detection learning regular patterns,
watching for discrepancies
25(No Transcript)
26(No Transcript)
27Data mining - tools
- IBM Intelligent Miner
- SAS Enterprise Miner
- SGI MineSet
- RuleQuest (you can download it for a trial!)
28Data mining - conclusion
- Treats historical data as an organizational
asset, rather than burden - tries to
- find out the unknown
- predict the unknown
- applies to
- marketing
- internet mining
- E-commerce
- ...