Title: Data Warehousing and Data Mining
1Data Warehousing and Data Mining
- Sue Patience/David Nelson
- April 2004
2Contents
- Data Warehousing
- Data Mining
- Further Reading
3Data Warehousing
- OLTP (online transaction processing) systems
- range in size from megabytes to terabytes
- high transaction throughput
- Decision makers require access to all data
- A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of managements
decision-making process (Inmon 1993)
4Benefits
- Potential high returns on investment
- 90 of companies in 1996 reported return of
investment (over three years) of gt 40 - Competitive advantage
- Data can reveal previously unknown, unavailable
and untapped information - Increased productivity of corporate
decision-makers - Integration allows more substantive, accurate and
consistent analysis
5Comparison
Source Connolly Begg, p 1049
6Data Warehouses
- Types of Data
- Detailed
- Summarised
- Meta-data
- Archive
- Back-up
7Architecture
Mainframe operational data
Warehouse mgr
Reporting query, app development,EIS tools
Meta-data
Highly summarized data
Departmental RDBMS data
Load mgr
Query manager
OLAP tools
Lightly summarized data
Private data
DBMS
Detailed data
Warehouse mgr
Data-mining tools
External data
Archive/backup
Source Connolly Begg p1053
8Information Flows
Operational data source 1
Warehouse Mgr
Meta-flow
Reporting query, app development,EIS tools
Meta- data
Highly summ. data
Inflow
Outflow
Load mgr
Query manager
OLAP tools
Lightly summ.
Upflow
DBMS
Detailed data
Warehouse mgr
Data-mining tools
Downflow
Operational data source n
Archive/backup
Source Connolly Begg p1058
9Information Flow Processes
- Five primary information flows
- Inflow - extraction, cleansing and loading of
data from source systems into warehouse - Upflow - adding value to data in warehouse
through summarizing, packaging and distributing
data - Downflow - archiving and backing up data in
warehouse - Outflow - making data available to end users
- Metaflow - managing the metadata
10Problems of Data Warehousing
- Underestimation of resources for data loading
- Hidden problems with source systems
- Required data not captured
- Increased end-user demands
- Data homogenization
- High demand for resources
- Data ownership
- High maintenance
- Long duration projects
- Complexity of integration
11Data Warehousing Tools Technologies
- Extraction, Cleansing Transformation Tools
- Code generators
- Database data replication tools
- Dynamic transformation engines
12Data Warehouse Requirements
- Data Warehouse DBMS
- Load performance Processing
- Data Quality management
- Query performance
- Scalability
- Advanced Query functionality
13Data Warehouse Design
- Data must be designed to allow ad-hoc queries to
be answered with acceptable performance
constraints - Queries usually require access to factual data
generated by business transactions - e.g. find the average number of properties rented
out with a monthly rent greater than 700 at each
branch office over the last six months - Uses Dimensionality Modelling
14Dimensionality Modelling
- Similar to E-R modelling but with constraints
- composed of one fact table with a composite
primary key - dimension tables have a simple primary key which
corresponds exactly to one foreign key in the
fact table - uses surrogate keys based on integer values
- Can efficiently and easily support ad-hoc
end-user queries
15Star Schemas
- The most common dimensional model
- A fact table surrounded by dimension tables
- Fact tables
- contains FK for each dimension table
- large relative to dimension tables
- read-only
- Dimension tables
- reference data
- query performance can be speeded up by
denormalising into a single dimension table - Normally create a dimension table for time
16E-R Model Example
17Star Schema Example
18Data Mining
- The process of extracting valid, previously
unknown, comprehensible and actionable
information from large databases and using it to
make crucial business decisions. (Simoudis 1996) - focus is to reveal information which is hidden or
unexpected - patterns and relationships are identified by
examining the underlying rules and features of
the data - work from data up
- require large volumes of data
19Data Mining Techniques
- Four Operations
- predictive modelling
- database segmentation
- link analysis
- deviation direction
20Data Mining Techniques
- Predictive Modelling
- using observations to form a model of the
important characteristics of some phenomenon. - Two techniques-
- Classification
- Value prediction
21Classification Example Tree Induction
Customer renting property gt 2 years
No
Yes
Customer age gt 25 years?
Rent property
No
Yes
Buy property
Rent property
22Data Mining Techniques
- Database Segmentation
- to partition a database into an unknown number of
segments (or clusters) of records which share a
number of properties
23Data Mining Techniques
- Link Analysis
- establish associations between individual records
(or sets of records) in a database - e.g. when a customer rents property for more
than two years and is more than 25 year olds,
then in 40 of cases, the customer will buy the
property - Techniques
- Association discovery
- Sequential pattern discovery
- Similar time sequence discovery
24Data Mining Techniques
- Deviation Detection
- identify outliers, something which deviates
from some known expectation or norm
25Mining and Warehousing
- Data mining needs single, separate, clean,
integrated, self-consistent data source - Data warehouse well equipped
- populated with clean, consistent data
- contains multiple sources
- utilises query capabilities
- capability to go back to data source
26Further Reading
- Connolly and Begg, 3rd edition, chapters 30, 31
and 32. - Connolly and Begg, 4th edition, chapters 31 - 34.
- W H Inmon, Building the Data Warehouse, New York,
Wiley and Sons, 1993. - Benyon-Davies P, Database Systems (2nd ed),
Macmillan Press, 2000, ch 34, 35 36.