MINING - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

MINING

Description:

DATA WAREHOUSE. A large collection of historical data that is organized specifically for use in decision support (i.e. OLAP, data mining) – PowerPoint PPT presentation

Number of Views:173

Avg rating:3.0/5.0

Slides: 65

Provided by: wiu62

Category:

more less

Transcript and Presenter's Notes

Title: MINING

1
MINING WAREHOUSING (MSS2) BY CHANDRA S.
AMARAVADI
2
EXTENSIONS TO DSS

BI systems (aka EIS)
Geographical Information Systems (GIS)
Collaborative Systems (formerly GDSS)
Expert Systems
OLAP/Data mining/warehousing

3
DATA WAREHOUSES
4
DATA WAREHOUSE
A large collection of historical data that is
organized specifically for use in decision
support (i.e. OLAP, data mining)
5
DATA WAREHOUSING
USERS
APPLICATIONS
WAREHOUSE
User1
Prodn.
S E R V E R
Mktg.
User2
Fin.
User3
User4
Acctg.
6
THE DATA LIFE CYCLE
Dashboard/BI
Internal Data
Data Visualization
Data Marts
OLAP
External Data
Data Warehouse
Decisions
Internal Data
Data Mining
Data Marts
Knowledge
The activities taking place with respect to data
for warehouse/OLAP/mining
7
INTEGRATION OF DATA
Data Warehouse
application A m,f
M/F application B
1,0 application C x,y application D male,
female
8
CHARACTERISTICS OF WAREHOUSES
Data warehouses have a number of characteristics

Subject-oriented
Integrated
Time-variant
Non-volatile

9
KEY FEATURES OF DATA WAREHOUSES
Subject-oriented A data warehouse is organized
around major subjects, such as customer,
supplier, product, and sales. Integrated A data
warehouse is usually constructed by integrating
data from multiple heterogeneous sources, such as
relational databases, flat files, and on-line
transaction records. Time-variant Data are
stored to provide information from a historical
perspective (e.g., the past 5-10 years). Every
key structure in the data warehouse contains,
either implicitly or explicitly, an element of
time Nonvolatile The data in a warehouse is
permanent.
10
DESIGN OF WAREHOUSE
Design of warehouses is similar to databases

Schema not radically different
W/H designed primarily for queries
Database will be converted to cube at run time.

11
ORGANIZING THE W/H
STAR SCHEMA Consists of a large central table
and a set of smaller tables, one each for each
dimension. SNOWFLAKE SCHEMA A variant of the
star schema, Where some dimension tables are
normalized, thereby splitting the data into
additional tables. CONSTELLATION SCHEMA A
collection of stars.
12
STAR SCHEMA OF A DATA WAREHOUSE (FYI)
BRANCH
ACCOUNTS
TRANSACTIONS
What are the dimensions here?
13
SNOWFLAKE SCHEMA OF A DATA WAREHOUSE (FYI)
BRANCH
ACCOUNTS
TRANSACTIONS
CREDIT
DEBIT
14
CONSTELLATION SCHEMA OF A DATA WAREHOUSE FOR
SALES AND SHIPPING (FYI)
DEPARTMENTS
BRANCH
EMPLOYEES
ACCOUNTS
TRANSACTIONS
CREDIT
DEBIT
15
EXAMPLE OF WAREHOUSE MARTS
Weekly sales by region 2007-2013
Sales by Product Line 2010-2013
Weekly sales by state 2005-2013
Weekly sales by product 2008-2013
Data Mart
Data Mart
Warehouse Sales Detail
Sales Detail 2003-2013
16
OLAP DATA WAREHOUSES
17
ONLINE ANALYTICAL PROCESSING
OLAP Tools to analyze data in a warehouse for
decision support. How many light bulbs sold in
December?

Aggregation (data summarization)
Dimension
Slice Dice
Cube organization (Multi-dimensional database)

18
AGGREGATION CUBE
Region
Units
NE
SE
SW
North. E 40 South E.
20 South W. 30 Midwest
20 South W.
50 North E. 65
65
20
30
40
50
sales in the Northern region?
19
DIMENSIONS CONCEPT HIERARCHIES
A dimension is an aspect of the data, it is a
characteristic of a variable such as location,
for sales variable. Dimensions can have
hierarchies (or various levels of aggregations) A
concept hierarchy defines a sequence of mappings
from a set of low-level concepts to higher-level,
more general concepts
20
CONCEPT HIERARCHY
Mfr. dimension
iPhone5
iPhone4
150,600
GALII
GAL
FX
G
G2
13,600
30,000
15,067
18,240
51,326
21
MULTI-DIMENSIONAL ORGANIZATION
Sales, costs etc.
nw
Products (tables, desks, lamps..)
sw
Regions
Cube organization supports slice dice
22
MULTI-DIMENSIONAL ORGANIZATION..

LG MW SW SE NW NE TOTAL
G 3,780 4,893 7,494 6,520 2,450 25,137
G2 2,342 1,200 1,400 1,678 950 7,570
FX 3,780 7,893 5,647 6,493 7,839 31,652
TOTAL 9,902 13,986 14,541 14,691 11,239 64,359

LG MW SW SE NW NE TOTAL
G 3,780 4,893 7,494 6,520 2,450 25,137
G2 2,342 1,200 1,400 1,678 950 7,570
FX 3,780 7,893 5,647 6,493 7,839 31,652
TOTAL 9,902 13,986 14,541 14,691 11,239 64,359

March sales

LG MW SW SE NW NE TOTAL
G 3,780 4,893 7,494 6,520 2,450 25,137
G2 2,342 1,200 1,400 1,678 950 7,570
FX 3,780 7,893 5,647 6,493 7,839 31,652
TOTAL 9,902 13,986 14,541 14,691 11,239 64,359

February sales
January sales
shows multi-dimensional/cube organization
23
CUBE ORGANIZATION

Data from warehouse imported into memory
A sophisticated 3D representation is created
Referred to as sparse matrix
Sides of cube are dimensions
Allows slice dice
Answers to high level queries/reports

24
DATA MINING
25
DATA MINING
Application of statistical and AI techniques to
identify patterns that exist in large databases
but are hidden in the vast amounts of data.
e.g. sequence/association, classification, and
clustering

Buyer non-buyer
Success/failure of an operation
Customers who spend more
Preferences of target groups

26
SOME DATA MINING APPLICATIONS

Predicting the probability of default for
consumer loans
Predicting audience response to TV advertisements
Predicting the probability that a cancer patient
will respond to radiation therapy.
Predicting the probability that an offshore well
will produce oil
A phone company mined data from 140 million
households, each with as many as 10,000
attributes, including life-style and calling
habits. Identified 22 profiles

27
DATA MINING AS A STEP IN THE PROCESS OF
KNOWLEDGE DISCOVERY
Evaluation and Presentation
Knowledge
Data Mining
Selection and Transformation
Patterns
Data Warehouse
Cleaning and Integration
Database
Flat files
28
DATA MINING PROCESS
Cleaning integration data is brought in from
multiple sources Selection transformation
sometimes called dimensionality reduction, it is
concerned with selection of dimensions and
sometimes the raw data needs to be transformed to
suit the problem e.g. calculate margin. Data
mining - process of extracting data patterns,
using statistical or AI techniques. Pattern
evaluation - identifying patterns useful and
relevant to the organizational context.
Knowledge presentation -- Visualization and
knowledge representation techniques are used to
present the mined knowledge to the user.
29
OLAP, MINING WAREHOUSING...
Data warehousing refers to the use of high
speed/high capacity servers to store historical
transaction information and to make this
information accessible to decision makers.
OLAP is used to perform high level analysis of
data based on data summarization (aggregation)
and slice and dice operations. For e.g. how many
shoes sold in midwest in Feb? Data mining refers
to identification of patterns from data.
30
DATA MINING TECHNIQUES
31
TYPES OF ANALYSES
Sequence -- Activities occurring one after
another e.g. loan after buying car,
warranty. Association -- (AKA Market Basket
Analysis) Activities
which occur together (e.g. bread and
meat) Classification -- Identifying profiles of
data classified into
pre-defined groups (frequent infrequent
shoppers) Clustering --
Identifying natural characteristics of data
(what major areas are cust.
coming from?)
32
SEQUENTIAL ANALYSIS
33
SEQUENTIAL ANALYSIS
Predict purchasing

40 of customers buy a gray skirt six months
after buying a red jacket
Identified by time-series analysis, Neural
networks or genetic algorithms

34
IDENTIFYING SEQUENTIAL PATTERNS..
applications in forecasting exchange rates, meat
consumption , bankruptcies etc.
35
ASSOCIATIVE ANALYSIS
36
ASSOCIATION

Identifies items purchased together
Customers who bought items A, B, C also buy Y
85 percent of customers who buy a certain wine
brand also buy a certain type of pasta
Many men buy a six-pack when they purchase
diapers
Enhances product placement in stores

37
ANALYSIS TECHNIQUES ASSOCIATION
Algorithm A-priori
TID Items
100 A C D
200 B C E
300 A B C E
400 B E

list all items 1 item set (C1)
filter by min. transaction support (L1)
identify 2-item sets (L1 L1) (C2 )
filter and generate L2
repeat process...

Database
Min. transaction support is the number (sometimes
given as ) of transactions in which the item
must occur.
38
A-PRIORI ALGORITHM
TID Items
100 A C D
200 B C E
300 A B C E
400 B E
Itemset Sup.
A 2
B 3
C 3
D 1
E 3
Itemset Sup.
A 2
B 3
C 3
E 3
minimum Support 2
Scan Database
C1
L1
Itemset
A B
A C
A E
B C
B E
C E
Itemset Sup.
A B 1
A C 2
A E 1
B C 2
B E 3
C E 2
Itemset Sup.
A C 2
B C 2
B E 3
C E 2
Scan Database
minimum Support 2
L2
C2
C2
Scan Database
Itemset Sup.
B C E 2
Itemset
B C E
Itemset Sup.
B C E 2
L3
C3
C3
39
DISCUSSION
Apply associative rule mining (Use A-priori
algorithm) to the following portfolios of clients
of a brokerage company, to identify stocks that
are purchased together. Use a minimum support of
two.
Client1 AAPL, ORCL, WFC, BAC
Client2 BAC, WTU
Client3 GOOG, IBM, C
Client4 WTU, WFC, C, IBM
40
CLASSIFICATION ANALYSIS
41
CLASSIFICATION
A technique for grouping data into pre-defined
classes using certain attributes of the data.
E.g. defaulter or not, cruise customer, 4G
subscriber or not etc.

Define classes
Identify classification criteria
Select suitable technique
application

42
CLASSIFICATION TECHNIQUES

Discriminant analysis (categorical membership)
Bayesian classification
Logistic regression (probabilistic membership)
Neural networks

43
DISCRIMINANT ANALYSIS
A method of classification that uses a
Discriminant Function to decide classes

DF based on attributes and weights.
Calculated as a weighted average of attributes
and weights (of training data) (Zave score)
E.g. (GMAT 200 UGPA) gt 1200
Given a new set of data, calculate Z score
Compare with Zav

DF Discriminant Function
a simplified version
44
CLASSIFICATION WITH BAYESIAN ANALYSIS
45
REFRESHER ON PROBABILITY
Probability is the chance that an event/outcome
will take place.

E.g. Cubs had 43 wins 36 losses in one season
Probability of winning?

46
PRIOR (CONDITIONAL) PROBABILITIES
Prior probabilities are knowledge of other events
which may help improve predictions

E.g. p(IPO success) 0.33
Suppose we know that a big company behind IPO
p(IPO success/big company) 0.99

Which is higher?
P(successful cellphone call) or P(successful
call/subscriber in service area)
47
CONDITIONAL PROBABILITIES..
College of students
CBT 1,800
COAS 2,300
COFAC 1,200
COE 6,700
Total 12,000
If we see a student in the union and he/she is a
WIU student What is the probability he/she is a)
CBT?, b) COAS?, c) COFAC?, d) COE?
48
BAYESIAN ANALYSIS
Bayes theorem can be exploited for classification

Bayes theorem
P(A/B) P(B/A) x P(A)/P(B)
Classify into class I or II based on conditional
probability
P(class I/x) gt P(class II/x)?
E.g. wiu student in union, which college are
they?

A method for classifying objects/events into
classes based on probabilities of occurrence of
the objects/events
x is some condition e.g. surgery or being a
shopper in a retail chain
49
INTRODUCTION TO BAYESIAN ANALYSIS
We are interested in p(person becoming a
manager/mba) How can we use Bayes theorem?
ASSUME

300 m population 100 m employees
500,000 are managers
10,000 managers go to college for an MBA
20 m go to college
100,000 do MBA

you need to write formula using terms from the
problem
50
THIS SLIDE IS INTENTIONALLY BLANK
51
BAYESIAN ANALYSIS - EXAMPLE
PROBLEM

An observer has collected information about
Eagles Hawks for a long time. If a new bird
is spotted with a certain wingspan (x), need to
know
whether eagle or hawk

From Data Mining Methods for Knowledge
Discovery by K.Cios, W.Pedrycz, R.Swiniarski
52
BAYESIAN ANALYSIS - EXAMPLE
CLASSIFICATION PROBLEM

Compare p(eagle/x) and p(hawk/x)
Whichever is higher ? unknown bird
We dont know p(eagle/x) or p(hawk/x)

53
BAYESIAN ANALYSIS - EXAMPLE
Bayes theorem
Decision rule?
54
PROBABILITY DENSITY FUNCTION
Shows from observations of birds, the
probability of a bird having a particular wingspan
P(xeagle), P(x hawk)
55
PROBABILITY OF EAGLE OR HAWK
N Number of birds
neagle nhawk
0.8
0.2
56
USING BAYES THEOREM
New birds size 45 cm
(from known probability density functions)
p(45eagle) 2.22 x 10-2 p(45hawk) 1.10 x 10-2
2.22 x 10-2 x 0.8 vs 1.10 x 10-2 x 0.2
0.01776 gt 0.0021
Decision rule predicts eagle
57
CLUSTERING
58
DATA MINING FUNCTIONS
Clustering

Goal is to identify natural groupings of data.
applicationications in market segmentation,
discovering affinity groups, and defect analysis

based on euclidean distance
similar to centroid
trial error approach
need to guess of clusters

59
CLUSTER ANALYSIS OF CUSTOMER LOCATIONS IN A CITY
60
CUSTOMERS ARE CLUSTERED INTO FOUR SEGMENTS
Income Medium Children 2
Income High Children 1 Car Luxury
Income Medium Children3
Car Sedan and Car Truck
Cluster 4
Cluster 1
Cluster 3
Income Low Children 0 Car Compact
Cluster 2
61
A SIMPLE EXAMPLE
ACCT MONTHLY SPENDING
109905 200
112343 30
124205 1,510
125687 11,078
130984 341
131987 543
140001 178
140387 2,000
150933 1,750
167912 230
62
(No Transcript)
63
(No Transcript)
64
DISCUSSION QUESTIONS

What kind of data is stored in a data warehouse?
What are the characteristics of a data warehouse?
How is data organized in a warehouse?
What tool(s) can we use to analyze data in a
warehouse?
What type of multidimensional organization of
data is also called a sparse matrix?
What is a concept hierarchy?
What technique can you use to predict the
probability that a cancer patient will respond to
radiation therapy?
What type of mining identifies natural
characteristics of data?
What technique is useful for predicting trends?
What is minimum transaction support?
How is Bayes theorem used for classification?