Problems with Today's Information Environment - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Problems with Today's Information Environment

Description:

Data Warehousing Xintao Wu – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 30
Provided by: Geral216
Category:

less

Transcript and Presenter's Notes

Title: Problems with Today's Information Environment


1
Data Warehousing

Xintao Wu
2
Can You Easily Answer These Questions?
What are Personnel Services costs across all
departments for all funding sources?
What is the correlation between expenditures and
collection of delinquent taxes?
What are the effects of outsourcing specific
services?
What is the impact on revenues and expenditures
of changing the operating hours of the Dept. of
Motor Vehicles?
What is the economic impact of the small business
initiative in our district?
3
Overview Data Warehousing and OLAP Technology
for Data Mining
  • What is a data warehouse?
  • Why a data warehouse?
  • A multi-dimensional data model
  • Data warehouse architecture
  • Data warehouse implementation
  • From data warehousing to data mining

4
What is a Warehouse?
  • Collection of diverse data
  • subject oriented
  • aimed at executive, decision maker
  • often a copy of operational data
  • with value-added data (e.g., summaries, history)
  • integrated
  • time-varying
  • non-volatile

5
What is a Warehouse?
  • Collection of tools
  • gathering data
  • cleansing, integrating, ...
  • querying, reporting, analysis
  • data mining
  • monitoring, administering warehouse

6
Data Warehouse vs. Operational DBMS
  • OLTP (on-line transaction processing)
  • Major task of traditional relational DBMS
  • Day-to-day operations purchasing, inventory,
    banking, manufacturing, payroll, registration,
    accounting, etc.
  • OLAP (on-line analytical processing)
  • Major task of data warehouse system
  • Data analysis and decision making
  • Distinct features (OLTP vs. OLAP)
  • User and system orientation customer vs. market
  • Data contents current, detailed vs. historical,
    consolidated
  • Database design ER application vs. star
    subject
  • View current, local vs. evolutionary, integrated
  • Access patterns update vs. read-only but complex
    queries

7
OLTP vs. OLAP
8
Overview Data Warehousing and OLAP Technology
for Data Mining
  • What a data warehouse?
  • Why a data warehouse?
  • A multi-dimensional data model
  • Data warehouse architecture
  • Data warehouse implementation
  • From data warehousing to data mining

9
Why Separate Data Warehouse?
  • High performance for both systems
  • DBMS tuned for OLTP access methods, indexing,
    concurrency control, recovery
  • Warehousetuned for OLAP complex OLAP queries,
    multidimensional view, consolidation.
  • Different functions and different data
  • missing data Decision support requires
    historical data which operational DBs do not
    typically maintain
  • data consolidation DS requires consolidation
    (aggregation, summarization) of data from
    heterogeneous sources
  • data quality different sources typically use
    inconsistent data representations, codes and
    formats which have to be reconciled

10
Warehouse Architecture
Metadata
11
Advantages of Warehousing
  • High query performance
  • Queries not visible outside warehouse
  • Local processing at sources unaffected
  • Can operate when sources unavailable
  • Can query data not stored in a DBMS
  • Extra information at warehouse
  • Modify, summarize (store aggregates)
  • Add historical information

12
Overview Data Warehousing and OLAP Technology
for Data Mining
  • What a data warehouse?
  • Why a data warehouse?
  • A multi-dimensional data model
  • Data warehouse architecture
  • Data warehouse implementation
  • From data warehousing to data mining

13
From Tables and Spreadsheets to Data Cubes
  • A data warehouse is based on a multidimensional
    data model which views data in the form of a data
    cube
  • A data cube, such as sales, allows data to be
    modeled and viewed in multiple dimensions
  • Dimension tables, such as item (item_name, brand,
    type), or time(day, week, month, quarter, year)
  • Fact table contains measures (such as
    dollars_sold) and keys to each of the related
    dimension tables
  • In data warehousing literature, an n-D base cube
    is called a base cuboid. The top most 0-D cuboid,
    which holds the highest-level of summarization,
    is called the apex cuboid. The lattice of
    cuboids forms a data cube.

14
Cube A Lattice of Cuboids
all
0-D(apex) cuboid
time
item
location
supplier
1-D cuboids
time,item
time,location
item,location
location,supplier
2-D cuboids
time,supplier
item,supplier
time,location,supplier
time,item,location
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
time, item, location, supplier
15
Conceptual Modeling of Data Warehouses
  • Modeling data warehouses dimensions measures
  • Star schema A fact table in the middle connected
    to a set of dimension tables
  • Snowflake schema A refinement of star schema
    where some dimensional hierarchy is normalized
    into a set of smaller dimension tables, forming a
    shape similar to snowflake
  • Fact constellations Multiple fact tables share
    dimension tables, viewed as a collection of
    stars, therefore called galaxy schema or fact
    constellation

16
Example of Star Schema

Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
17
Example of Snowflake Schema
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
18
Example of Fact Constellation
Shipping Fact Table
time_key
Sales Fact Table
item_key
time_key
shipper_key
item_key
from_location
branch_key
to_location
location_key
dollars_cost
units_sold
units_shipped
dollars_sold
avg_sales
Measures
19
Typical OLAP Operations
  • Roll up (drill-up) summarize data
  • by climbing up hierarchy or by dimension
    reduction
  • Drill down (roll down) reverse of roll-up
  • from higher level summary to lower level summary
    or detailed data, or introducing new dimensions
  • Slice and dice
  • project and select
  • Pivot (rotate)
  • reorient the cube, visualization, 3D to series of
    2D planes.
  • Other operations
  • drill across involving (across) more than one
    fact table
  • drill through through the bottom level of the
    cube to its back-end relational tables (using SQL)

20
Overview Data Warehousing and OLAP Technology
for Data Mining
  • What a data warehouse?
  • Why a data warehouse?
  • A multi-dimensional data model
  • Data warehouse architecture
  • Data warehouse implementation
  • From data warehousing to data mining

21
Multi-Tiered Architecture
Monitor Integrator
OLAP Server
Metadata
Analysis Query Reports Data mining
Serve
Data Warehouse
Data Marts
Data Sources
OLAP Engine
Front-End Tools
Data Storage
22
OLAP Server Architectures
  • Relational OLAP (ROLAP)
  • ROLAP - provides a Multi-dimensional view of a
    relational DB (e.g. MicroStrategy)
  • Use relational or extended-relational DBMS to
    store and manage warehouse data and OLAP middle
    ware to support missing pieces
  • Include optimization of DBMS backend,
    implementation of aggregation navigation logic,
    and additional tools and services
  • greater scalability
  • Multidimensional OLAP (MOLAP)
  • Array-based multidimensional storage engine
    (sparse matrix techniques)
  • fast indexing to pre-computed summarized data
  • Hybrid OLAP (HOLAP)
  • User flexibility, e.g., low level relational,
    high-level array
  • Specialized SQL servers
  • specialized support for SQL queries over
    star/snowflake schemas

23
MOLAP versus ROLAP
  • MOLAP
  • Multidimensional OLAP
  • Data stored in multi-dimensional cube
  • Transformation required
  • Data retrieved directly from cube for analysis
  • Faster analytical processing
  • Cube size limitations
  • ROLAP
  • Relational OLAP
  • Data stored in relational database as virtual
    cube
  • No transformation needed
  • Data retrieved via SQL from database for analysis
  • Slower analytical processing
  • No size limitations

24
Data Warehouse Back-End Tools and Utilities
  • Data extraction
  • get data from multiple, heterogeneous, and
    external sources
  • Data cleaning
  • detect errors in the data and rectify them when
    possible
  • Data transformation
  • convert data from legacy or host format to
    warehouse format
  • Load
  • sort, summarize, consolidate, compute views,
    check integrity, and build indices and partitions
  • Refresh
  • propagate the updates from the data sources to
    the warehouse

25
Overview Data Warehousing and OLAP Technology
for Data Mining
  • What a data warehouse?
  • Why a data warehouse?
  • A multi-dimensional data model
  • Data warehouse architecture
  • Data warehouse implementation
  • From data warehouse to data mining

26
Data Mining A KDD Process
Knowledge
Pattern Evaluation
  • Data mining the core of knowledge discovery
    process.

Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
27
Steps of a KDD Process
  • Learning the application domain
  • relevant prior knowledge and goals of application
  • Creating a target data set data selection
  • Data cleaning and preprocessing (may take 60 of
    effort!)
  • Data reduction and transformation
  • Find useful features, dimensionality/variable
    reduction, invariant representation.
  • Choosing functions of data mining
  • summarization, classification, regression,
    association, clustering.
  • Choosing the mining algorithm(s)
  • Data mining search for patterns of interest
  • Pattern evaluation and knowledge presentation
  • visualization, transformation, removing redundant
    patterns, etc.
  • Use of discovered knowledge

28
Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
29
Summary
  • Data warehouse
  • A subject-oriented, integrated, time-variant, and
    nonvolatile collection of data in support of
    managements decision-making process
  • A multi-dimensional model of a data warehouse
  • Star schema, snowflake schema, fact
    constellations
  • A data cube consists of dimensions measures
  • OLAP operations drilling, rolling, slicing,
    dicing and pivoting
  • OLAP servers ROLAP, MOLAP, HOLAP
  • From OLAP to OLAM
Write a Comment
User Comments (0)
About PowerShow.com