Title: MINING
1MINING WAREHOUSING (MSS2) BY CHANDRA S.
AMARAVADI
2EXTENSIONS TO DSS
- BI systems (aka EIS)
- Geographical Information Systems (GIS)
- Collaborative Systems (formerly GDSS)
- Expert Systems
- OLAP/Data mining/warehousing
3DATA WAREHOUSES
4DATA WAREHOUSE
A large collection of historical data that is
organized specifically for use in decision
support (i.e. OLAP, data mining)
5DATA WAREHOUSING
USERS
APPLICATIONS
WAREHOUSE
User1
Prodn.
S E R V E R
Mktg.
User2
Fin.
User3
User4
Acctg.
6THE DATA LIFE CYCLE
Dashboard/BI
Internal Data
Data Visualization
Data Marts
OLAP
External Data
Data Warehouse
Decisions
Internal Data
Data Mining
Data Marts
Knowledge
The activities taking place with respect to data
for warehouse/OLAP/mining
7INTEGRATION OF DATA
Data Warehouse
application A m,f
M/F application B
1,0 application C x,y application D male,
female
8CHARACTERISTICS OF WAREHOUSES
Data warehouses have a number of characteristics
- Subject-oriented
- Integrated
- Time-variant
- Non-volatile
9KEY FEATURES OF DATA WAREHOUSES
Subject-oriented A data warehouse is organized
around major subjects, such as customer,
supplier, product, and sales. Integrated A data
warehouse is usually constructed by integrating
data from multiple heterogeneous sources, such as
relational databases, flat files, and on-line
transaction records. Time-variant Data are
stored to provide information from a historical
perspective (e.g., the past 5-10 years). Every
key structure in the data warehouse contains,
either implicitly or explicitly, an element of
time Nonvolatile The data in a warehouse is
permanent.
10DESIGN OF WAREHOUSE
Design of warehouses is similar to databases
- Schema not radically different
- W/H designed primarily for queries
- Database will be converted to cube at run time.
11ORGANIZING THE W/H
STAR SCHEMA Consists of a large central table
and a set of smaller tables, one each for each
dimension. SNOWFLAKE SCHEMA A variant of the
star schema, Where some dimension tables are
normalized, thereby splitting the data into
additional tables. CONSTELLATION SCHEMA A
collection of stars.
12STAR SCHEMA OF A DATA WAREHOUSE (FYI)
BRANCH
ACCOUNTS
TRANSACTIONS
What are the dimensions here?
13SNOWFLAKE SCHEMA OF A DATA WAREHOUSE (FYI)
BRANCH
ACCOUNTS
TRANSACTIONS
CREDIT
DEBIT
14CONSTELLATION SCHEMA OF A DATA WAREHOUSE FOR
SALES AND SHIPPING (FYI)
DEPARTMENTS
BRANCH
EMPLOYEES
ACCOUNTS
TRANSACTIONS
CREDIT
DEBIT
15EXAMPLE OF WAREHOUSE MARTS
Weekly sales by region 2007-2013
Sales by Product Line 2010-2013
Weekly sales by state 2005-2013
Weekly sales by product 2008-2013
Data Mart
Data Mart
Warehouse Sales Detail
Sales Detail 2003-2013
16OLAP DATA WAREHOUSES
17ONLINE ANALYTICAL PROCESSING
OLAP Tools to analyze data in a warehouse for
decision support. How many light bulbs sold in
December?
- Aggregation (data summarization)
- Dimension
- Slice Dice
- Cube organization (Multi-dimensional database)
18AGGREGATION CUBE
Region
Units
NE
SE
SW
North. E 40 South E.
20 South W. 30 Midwest
20 South W.
50 North E. 65
65
20
30
40
50
sales in the Northern region?
19DIMENSIONS CONCEPT HIERARCHIES
A dimension is an aspect of the data, it is a
characteristic of a variable such as location,
for sales variable. Dimensions can have
hierarchies (or various levels of aggregations) A
concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher-level,
more general concepts
20CONCEPT HIERARCHY
Mfr. dimension
iPhone5
iPhone4
150,600
GALII
GAL
FX
G
G2
13,600
30,000
15,067
18,240
51,326
21MULTI-DIMENSIONAL ORGANIZATION
Sales, costs etc.
nw
Products (tables, desks, lamps..)
sw
Regions
Cube organization supports slice dice
22MULTI-DIMENSIONAL ORGANIZATION..
LG MW SW SE NW NE TOTAL
G 3,780 4,893 7,494 6,520 2,450 25,137
G2 2,342 1,200 1,400 1,678 950 7,570
FX 3,780 7,893 5,647 6,493 7,839 31,652
TOTAL 9,902 13,986 14,541 14,691 11,239 64,359
LG MW SW SE NW NE TOTAL
G 3,780 4,893 7,494 6,520 2,450 25,137
G2 2,342 1,200 1,400 1,678 950 7,570
FX 3,780 7,893 5,647 6,493 7,839 31,652
TOTAL 9,902 13,986 14,541 14,691 11,239 64,359
March sales
LG MW SW SE NW NE TOTAL
G 3,780 4,893 7,494 6,520 2,450 25,137
G2 2,342 1,200 1,400 1,678 950 7,570
FX 3,780 7,893 5,647 6,493 7,839 31,652
TOTAL 9,902 13,986 14,541 14,691 11,239 64,359
February sales
January sales
shows multi-dimensional/cube organization
23CUBE ORGANIZATION
- Data from warehouse imported into memory
- A sophisticated 3D representation is created
- Referred to as sparse matrix
- Sides of cube are dimensions
- Allows slice dice
- Answers to high level queries/reports
24DATA MINING
25DATA MINING
Application of statistical and AI techniques to
identify patterns that exist in large databases
but are hidden in the vast amounts of data.
e.g. sequence/association, classification, and
clustering
- Buyer non-buyer
- Success/failure of an operation
- Customers who spend more
- Preferences of target groups
26SOME DATA MINING APPLICATIONS
- Predicting the probability of default for
consumer loans - Predicting audience response to TV advertisements
- Predicting the probability that a cancer patient
will respond to radiation therapy. - Predicting the probability that an offshore well
will produce oil - A phone company mined data from 140 million
households, each with as many as 10,000
attributes, including life-style and calling
habits. Identified 22 profiles
27DATA MINING AS A STEP IN THE PROCESS OF
KNOWLEDGE DISCOVERY
Evaluation and Presentation
Knowledge
Data Mining
Selection and Transformation
Patterns
Data Warehouse
Cleaning and Integration
Database
Flat files
28DATA MINING PROCESS
Cleaning integration data is brought in from
multiple sources Selection transformation
sometimes called dimensionality reduction, it is
concerned with selection of dimensions and
sometimes the raw data needs to be transformed to
suit the problem e.g. calculate margin. Data
mining - process of extracting data patterns,
using statistical or AI techniques. Pattern
evaluation - identifying patterns useful and
relevant to the organizational context.
Knowledge presentation -- Visualization and
knowledge representation techniques are used to
present the mined knowledge to the user.
29OLAP, MINING WAREHOUSING...
Data warehousing refers to the use of high
speed/high capacity servers to store historical
transaction information and to make this
information accessible to decision makers.
OLAP is used to perform high level analysis of
data based on data summarization (aggregation)
and slice and dice operations. For e.g. how many
shoes sold in midwest in Feb? Data mining refers
to identification of patterns from data.
30DATA MINING TECHNIQUES
31TYPES OF ANALYSES
Sequence -- Activities occurring one after
another e.g. loan after buying car,
warranty. Association -- (AKA Market Basket
Analysis) Activities
which occur together (e.g. bread and
meat) Classification -- Identifying profiles of
data classified into
pre-defined groups (frequent infrequent
shoppers) Clustering --
Identifying natural characteristics of data
(what major areas are cust.
coming from?)
32SEQUENTIAL ANALYSIS
33SEQUENTIAL ANALYSIS
Predict purchasing
- 40 of customers buy a gray skirt six months
after buying a red jacket - Identified by time-series analysis, Neural
networks or genetic algorithms
34IDENTIFYING SEQUENTIAL PATTERNS..
applications in forecasting exchange rates, meat
consumption , bankruptcies etc.
35ASSOCIATIVE ANALYSIS
36ASSOCIATION
- Identifies items purchased together
- Customers who bought items A, B, C also buy Y
- 85 percent of customers who buy a certain wine
brand also buy a certain type of pasta - Many men buy a six-pack when they purchase
diapers - Enhances product placement in stores
37ANALYSIS TECHNIQUES ASSOCIATION
Algorithm A-priori
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
- list all items 1 item set (C1)
- filter by min. transaction support (L1)
- identify 2-item sets (L1 L1) (C2 )
- filter and generate L2
- repeat process...
Database
Min. transaction support is the number (sometimes
given as ) of transactions in which the item
must occur.
38A-PRIORI ALGORITHM
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
Itemset Sup.
A 2
B 3
C 3
D 1
E 3
Itemset Sup.
A 2
B 3
C 3
E 3
minimum Support 2
Scan Database
C1
L1
Itemset
A B
A C
A E
B C
B E
C E
Itemset Sup.
A B 1
A C 2
A E 1
B C 2
B E 3
C E 2
Itemset Sup.
A C 2
B C 2
B E 3
C E 2
Scan Database
minimum Support 2
L2
C2
C2
Scan Database
Itemset Sup.
B C E 2
Itemset
B C E
Itemset Sup.
B C E 2
L3
C3
C3
39DISCUSSION
Apply associative rule mining (Use A-priori
algorithm) to the following portfolios of clients
of a brokerage company, to identify stocks that
are purchased together. Use a minimum support of
two.
Client1 AAPL, ORCL, WFC, BAC
Client2 BAC, WTU
Client3 GOOG, IBM, C
Client4 WTU, WFC, C, IBM
40CLASSIFICATION ANALYSIS
41CLASSIFICATION
A technique for grouping data into pre-defined
classes using certain attributes of the data.
E.g. defaulter or not, cruise customer, 4G
subscriber or not etc.
- Define classes
- Identify classification criteria
- Select suitable technique
- application
42CLASSIFICATION TECHNIQUES
- Discriminant analysis (categorical membership)
- Bayesian classification
- Logistic regression (probabilistic membership)
- Neural networks
43DISCRIMINANT ANALYSIS
A method of classification that uses a
Discriminant Function to decide classes
- DF based on attributes and weights.
- Calculated as a weighted average of attributes
and weights (of training data) (Zave score) - E.g. (GMAT 200 UGPA) gt 1200
- Given a new set of data, calculate Z score
- Compare with Zav
DF Discriminant Function
a simplified version
44CLASSIFICATION WITH BAYESIAN ANALYSIS
45REFRESHER ON PROBABILITY
Probability is the chance that an event/outcome
will take place.
- E.g. Cubs had 43 wins 36 losses in one season
- Probability of winning?
46PRIOR (CONDITIONAL) PROBABILITIES
Prior probabilities are knowledge of other events
which may help improve predictions
- E.g. p(IPO success) 0.33
- Suppose we know that a big company behind IPO
- p(IPO success/big company) 0.99
Which is higher?
P(successful cellphone call) or P(successful
call/subscriber in service area)
47CONDITIONAL PROBABILITIES..
College of students
CBT 1,800
COAS 2,300
COFAC 1,200
COE 6,700
Total 12,000
If we see a student in the union and he/she is a
WIU student What is the probability he/she is a)
CBT?, b) COAS?, c) COFAC?, d) COE?
48BAYESIAN ANALYSIS
Bayes theorem can be exploited for classification
- Bayes theorem
- P(A/B) P(B/A) x P(A)/P(B)
- Classify into class I or II based on conditional
probability - P(class I/x) gt P(class II/x)?
- E.g. wiu student in union, which college are
they?
A method for classifying objects/events into
classes based on probabilities of occurrence of
the objects/events
x is some condition e.g. surgery or being a
shopper in a retail chain
49INTRODUCTION TO BAYESIAN ANALYSIS
We are interested in p(person becoming a
manager/mba) How can we use Bayes theorem?
ASSUME
- 300 m population 100 m employees
- 500,000 are managers
- 10,000 managers go to college for an MBA
- 20 m go to college
- 100,000 do MBA
you need to write formula using terms from the
problem
50THIS SLIDE IS INTENTIONALLY BLANK
51BAYESIAN ANALYSIS - EXAMPLE
PROBLEM
- An observer has collected information about
- Eagles Hawks for a long time. If a new bird
- is spotted with a certain wingspan (x), need to
know - whether eagle or hawk
From Data Mining Methods for Knowledge
Discovery by K.Cios, W.Pedrycz, R.Swiniarski
52BAYESIAN ANALYSIS - EXAMPLE
CLASSIFICATION PROBLEM
- Compare p(eagle/x) and p(hawk/x)
- Whichever is higher ? unknown bird
- We dont know p(eagle/x) or p(hawk/x)
53BAYESIAN ANALYSIS - EXAMPLE
Bayes theorem
Decision rule?
54PROBABILITY DENSITY FUNCTION
Shows from observations of birds, the
probability of a bird having a particular wingspan
P(xeagle), P(x hawk)
55PROBABILITY OF EAGLE OR HAWK
N Number of birds
neagle nhawk
0.8
0.2
56USING BAYES THEOREM
New birds size 45 cm
(from known probability density functions)
p(45eagle) 2.22 x 10-2 p(45hawk) 1.10 x 10-2
2.22 x 10-2 x 0.8 vs 1.10 x 10-2 x 0.2
0.01776 gt 0.0021
Decision rule predicts eagle
57CLUSTERING
58DATA MINING FUNCTIONS
Clustering
- Goal is to identify natural groupings of data.
applicationications in market segmentation,
discovering affinity groups, and defect analysis
- based on euclidean distance
- similar to centroid
- trial error approach
- need to guess of clusters
59CLUSTER ANALYSIS OF CUSTOMER LOCATIONS IN A CITY
60CUSTOMERS ARE CLUSTERED INTO FOUR SEGMENTS
Income Medium Children 2
Income High Children 1 Car Luxury
Income Medium Children3
Car Sedan and Car Truck
Cluster 4
Cluster 1
Cluster 3
Income Low Children 0 Car Compact
Cluster 2
61A SIMPLE EXAMPLE
ACCT MONTHLY SPENDING
109905 200
112343 30
124205 1,510
125687 11,078
130984 341
131987 543
140001 178
140387 2,000
150933 1,750
167912 230
62(No Transcript)
63(No Transcript)
64DISCUSSION QUESTIONS
- What kind of data is stored in a data warehouse?
- What are the characteristics of a data warehouse?
- How is data organized in a warehouse?
- What tool(s) can we use to analyze data in a
warehouse? - What type of multidimensional organization of
data is also called a sparse matrix? - What is a concept hierarchy?
- What technique can you use to predict the
probability that a cancer patient will respond to
radiation therapy? - What type of mining identifies natural
characteristics of data? - What technique is useful for predicting trends?
- What is minimum transaction support?
- How is Bayes theorem used for classification?