Title: Problems with Today's Information Environment
1Data Warehousing
Xintao Wu
2Can You Easily Answer These Questions?
What are Personnel Services costs across all
departments for all funding sources?
What is the correlation between expenditures and
collection of delinquent taxes?
What are the effects of outsourcing specific
services?
What is the impact on revenues and expenditures
of changing the operating hours of the Dept. of
Motor Vehicles?
What is the economic impact of the small business
initiative in our district?
3Overview Data Warehousing and OLAP Technology
for Data Mining
- What is a data warehouse?
- Why a data warehouse?
- A multi-dimensional data model
- Data warehouse architecture
- Data warehouse implementation
- From data warehousing to data mining
4What is a Warehouse?
- Collection of diverse data
- subject oriented
- aimed at executive, decision maker
- often a copy of operational data
- with value-added data (e.g., summaries, history)
- integrated
- time-varying
- non-volatile
5What is a Warehouse?
- Collection of tools
- gathering data
- cleansing, integrating, ...
- querying, reporting, analysis
- data mining
- monitoring, administering warehouse
6Data Warehouse vs. Operational DBMS
- OLTP (on-line transaction processing)
- Major task of traditional relational DBMS
- Day-to-day operations purchasing, inventory,
banking, manufacturing, payroll, registration,
accounting, etc. - OLAP (on-line analytical processing)
- Major task of data warehouse system
- Data analysis and decision making
- Distinct features (OLTP vs. OLAP)
- User and system orientation customer vs. market
- Data contents current, detailed vs. historical,
consolidated - Database design ER application vs. star
subject - View current, local vs. evolutionary, integrated
- Access patterns update vs. read-only but complex
queries
7OLTP vs. OLAP
8Overview Data Warehousing and OLAP Technology
for Data Mining
- What a data warehouse?
- Why a data warehouse?
- A multi-dimensional data model
- Data warehouse architecture
- Data warehouse implementation
- From data warehousing to data mining
9Why Separate Data Warehouse?
- High performance for both systems
- DBMS tuned for OLTP access methods, indexing,
concurrency control, recovery - Warehousetuned for OLAP complex OLAP queries,
multidimensional view, consolidation. - Different functions and different data
- missing data Decision support requires
historical data which operational DBs do not
typically maintain - data consolidation DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources - data quality different sources typically use
inconsistent data representations, codes and
formats which have to be reconciled
10Warehouse Architecture
Metadata
11Advantages of Warehousing
- High query performance
- Queries not visible outside warehouse
- Local processing at sources unaffected
- Can operate when sources unavailable
- Can query data not stored in a DBMS
- Extra information at warehouse
- Modify, summarize (store aggregates)
- Add historical information
12Overview Data Warehousing and OLAP Technology
for Data Mining
- What a data warehouse?
- Why a data warehouse?
- A multi-dimensional data model
- Data warehouse architecture
- Data warehouse implementation
- From data warehousing to data mining
13From Tables and Spreadsheets to Data Cubes
- A data warehouse is based on a multidimensional
data model which views data in the form of a data
cube - A data cube, such as sales, allows data to be
modeled and viewed in multiple dimensions - Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year) - Fact table contains measures (such as
dollars_sold) and keys to each of the related
dimension tables - In data warehousing literature, an n-D base cube
is called a base cuboid. The top most 0-D cuboid,
which holds the highest-level of summarization,
is called the apex cuboid. The lattice of
cuboids forms a data cube.
14Cube A Lattice of Cuboids
all
0-D(apex) cuboid
time
item
location
supplier
1-D cuboids
time,item
time,location
item,location
location,supplier
2-D cuboids
time,supplier
item,supplier
time,location,supplier
time,item,location
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
time, item, location, supplier
15Conceptual Modeling of Data Warehouses
- Modeling data warehouses dimensions measures
- Star schema A fact table in the middle connected
to a set of dimension tables - Snowflake schema A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake - Fact constellations Multiple fact tables share
dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact
constellation
16Example of Star Schema
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
17Example of Snowflake Schema
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
18Example of Fact Constellation
Shipping Fact Table
time_key
Sales Fact Table
item_key
time_key
shipper_key
item_key
from_location
branch_key
to_location
location_key
dollars_cost
units_sold
units_shipped
dollars_sold
avg_sales
Measures
19Typical OLAP Operations
- Roll up (drill-up) summarize data
- by climbing up hierarchy or by dimension
reduction - Drill down (roll down) reverse of roll-up
- from higher level summary to lower level summary
or detailed data, or introducing new dimensions - Slice and dice
- project and select
- Pivot (rotate)
- reorient the cube, visualization, 3D to series of
2D planes. - Other operations
- drill across involving (across) more than one
fact table - drill through through the bottom level of the
cube to its back-end relational tables (using SQL)
20Overview Data Warehousing and OLAP Technology
for Data Mining
- What a data warehouse?
- Why a data warehouse?
- A multi-dimensional data model
- Data warehouse architecture
- Data warehouse implementation
- From data warehousing to data mining
21Multi-Tiered Architecture
Monitor Integrator
OLAP Server
Metadata
Analysis Query Reports Data mining
Serve
Data Warehouse
Data Marts
Data Sources
OLAP Engine
Front-End Tools
Data Storage
22OLAP Server Architectures
- Relational OLAP (ROLAP)
- ROLAP - provides a Multi-dimensional view of a
relational DB (e.g. MicroStrategy) - Use relational or extended-relational DBMS to
store and manage warehouse data and OLAP middle
ware to support missing pieces - Include optimization of DBMS backend,
implementation of aggregation navigation logic,
and additional tools and services - greater scalability
- Multidimensional OLAP (MOLAP)
- Array-based multidimensional storage engine
(sparse matrix techniques) - fast indexing to pre-computed summarized data
- Hybrid OLAP (HOLAP)
- User flexibility, e.g., low level relational,
high-level array - Specialized SQL servers
- specialized support for SQL queries over
star/snowflake schemas
23MOLAP versus ROLAP
- MOLAP
- Multidimensional OLAP
- Data stored in multi-dimensional cube
- Transformation required
- Data retrieved directly from cube for analysis
- Faster analytical processing
- Cube size limitations
- ROLAP
- Relational OLAP
- Data stored in relational database as virtual
cube - No transformation needed
- Data retrieved via SQL from database for analysis
- Slower analytical processing
- No size limitations
24Data Warehouse Back-End Tools and Utilities
- Data extraction
- get data from multiple, heterogeneous, and
external sources - Data cleaning
- detect errors in the data and rectify them when
possible - Data transformation
- convert data from legacy or host format to
warehouse format - Load
- sort, summarize, consolidate, compute views,
check integrity, and build indices and partitions - Refresh
- propagate the updates from the data sources to
the warehouse
25Overview Data Warehousing and OLAP Technology
for Data Mining
- What a data warehouse?
- Why a data warehouse?
- A multi-dimensional data model
- Data warehouse architecture
- Data warehouse implementation
- From data warehouse to data mining
26Data Mining A KDD Process
Knowledge
Pattern Evaluation
- Data mining the core of knowledge discovery
process.
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
27Steps of a KDD Process
- Learning the application domain
- relevant prior knowledge and goals of application
- Creating a target data set data selection
- Data cleaning and preprocessing (may take 60 of
effort!) - Data reduction and transformation
- Find useful features, dimensionality/variable
reduction, invariant representation. - Choosing functions of data mining
- summarization, classification, regression,
association, clustering. - Choosing the mining algorithm(s)
- Data mining search for patterns of interest
- Pattern evaluation and knowledge presentation
- visualization, transformation, removing redundant
patterns, etc. - Use of discovered knowledge
28Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
29Summary
- Data warehouse
- A subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of
managements decision-making process - A multi-dimensional model of a data warehouse
- Star schema, snowflake schema, fact
constellations - A data cube consists of dimensions measures
- OLAP operations drilling, rolling, slicing,
dicing and pivoting - OLAP servers ROLAP, MOLAP, HOLAP
- From OLAP to OLAM