Chap' 3 Data Warehouse and OLAP Technology - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Chap' 3 Data Warehouse and OLAP Technology

Description:

Star schema. A fact table in the middle connected to a set of dimension tables. Snowflake schema ... and reporting using crosstabs, tables, charts and graphs ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 32
Provided by: jiaw185
Category:

less

Transcript and Presenter's Notes

Title: Chap' 3 Data Warehouse and OLAP Technology


1
Chap. 3 Data Warehouse and OLAP Technology
  • Data Mining

2
What is Data Warehouse?
  • DW decision support DB
  • Maintained separately from the operational
    database
  • Collection of data in support of managements
    decision-making
  • subject-oriented
  • Integrated
  • time-variant
  • nonvolatile
  • Data warehousing
  • The process of constructing and using data
    warehouses

3
DW - Subject-Oriented
  • Organized around major subjects
  • Customer, product, sales
  • Focusing on the modeling and analysis of data
  • not on daily operations or transaction processing
  • Provide a simple and concise view
  • Around particular subject
  • Excluding data that are not useful in the
    decision support process

4
DW - Integrated
  • Integrating multiple, heterogeneous data sources
  • Relational databases, flat files, on-line
    transaction records
  • Data cleaning and data integration
  • Ensure consistency in naming conventions,
    encoding structures, attribute measures, etc.
    among different data sources
  • E.g., Hotel price currency, tax, breakfast
    covered, etc.
  • When data is moved to the warehouse, it is
    converted.

5
DW - Time Variant
  • The time horizon is significantly longer
  • Operational database current value data
  • Data warehouse data provide information from a
    historical perspective (e.g., past 5-10 years)
  • Every key structure contains an element of time
  • Explicitly or implicitly

6
DW - Non-Volatile
  • A physically separate store of data
  • Transformed from the operational environment
  • Operational update of data does not occur
  • Does not require transaction processing,
    recovery, and concurrency control mechanisms
  • Requires only two operations
  • initial loading of data
  • access of data

7
Heterogeneous DBMS vs. DW
  • Traditional heterogeneous DB integration
  • Build wrappers/mediators on top of heterogeneous
    databases
  • Query driven approach
  • A query is translated into queries appropriate
    for individual heterogeneous sites involved
  • The results are integrated into a global answer
    set
  • Data warehouse
  • Update-driven approach
  • Information from heterogeneous sources is
    integrated in advance and stored in warehouses
    for direct query and analysis
  • High performance

8
OLTP(DBMS) vs. OLAP(DW)
  • OLTP (on-line transaction processing)
  • Major task of traditional relational DBMS
  • Day-to-day operations purchasing, inventory,
    banking, manufacturing, payroll, registration,
    accounting, etc.
  • OLAP (on-line analytical processing)
  • Major task of data warehouse system
  • Data analysis and decision making
  • Distinct features (OLTP vs. OLAP)
  • Data contents current, detailed vs. historical,
    consolidated
  • View current, local vs. evolutionary, integrated
  • Access patterns update vs. read-only but complex
    queries

9
OLTP vs. OLAP
10
Why Separate DW?
  • High performance for both systems
  • DBMS - tuned for OLTP access methods, indexing,
    concurrency control, recovery
  • Warehouse - tuned for OLAP complex OLAP queries,
    multidimensional view, consolidation
  • Different functions and different data
  • Decision support requires historical data which
    operational DB do not typically maintain
  • Decision support requires consolidation
    (aggregation, summarization) of data from
    heterogeneous sources
  • Inconsistent data representations, codes and
    formats have to be reconciled

11
Multidimensional Data
Dimensions Product, Location, Time
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
12
Multidimensional Data
Total annual sales of TV in U.S.A.
13
Data Cubes
  • Data cubes
  • A multidimensional data model
  • Dimensions data are kept in different
    dimensions
  • item (item_id, brand, type)
  • time(day, month, quarter, year)
  • location(city, country)
  • Facts numerical measure
  • dollars_sold
  • units_sold
  • Cuboid
  • Each data cube that represents different degree
    of summarization

14
Data Cubes
  • Example sales data cube
  • 2-D cube (time, item)

15
Data Cubes
  • 3-D cube (time, item, location)

16
Data Cubes
  • 4-D cube (time, item, location, supplier)

17
Cube A Lattice of Cuboids
  • Total of cuboids?
  • Item PC, Printer
  • Time 2003, 2004
  • Location Korea, Japan, USA
  • ? Sales for each year or
  • Sales for each nation or
  • Sales for each year for each item or

18
Cube A Lattice of Cuboids
all
0-D(apex) cuboid
time
item
location
supplier
1-D cuboids
time,item
time,location
item,location
location,supplier
2-D cuboids
time,supplier
item,supplier
time,location,supplier
time,item,location
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
19
Browsing a Data Cube
  • Visualization
  • OLAP capabilities
  • Interactive manipulation

20
Conceptual Modeling of DW
  • Modeling data warehouses
  • Star schema
  • A fact table in the middle connected to a set of
    dimension tables
  • Snowflake schema
  • Some dimensional hierarchy is normalized into a
    set of smaller dimension tables
  • Fact constellations
  • Multiple fact tables share dimension tables,
    viewed as a collection of stars

21
Star Schema

time
item
Sales Fact Table
time_key day day_of_the_week month quarter year
item_key item_name brand type supplier_type
time_key
item_key
branch_key
location
branch
location_key
location_key street city province_or_street countr
y
branch_key branch_name branch_type
units_sold
dollars_sold
avg_sales
Dimension Table
Measures
22
Measures
  • A numerical function that can be evaluated at
    each point in the data cube space
  • Distributive
  • It can be computed in a distributed manner.
  • sum(), count(), min(), max().
  • Algebraic
  • If it can be computed by an algebraic function
    with arguments which are obtained by applying a
    distributive function.
  • avg() ( sum()/count() ), min_N(),
    standard_deviation().
  • Holistic
  • If there is no algebraic function for the
    computation
  • median(), rank().

23
A Concept Hierarchy
  • Allows data to be handled at various levels of
    abstraction

24
Typical OLAP Operations
  • Roll up (drill-up) summarize data
  • by climbing up hierarchy or by dimension
    reduction
  • Drill down (roll down) reverse of roll-up
  • from higher level summary to lower level summary
  • Slice and dice
  • project and select
  • Pivot (rotate)
  • reorient the cube, visualization, 3D to series of
    2D planes.

25
(No Transcript)
26
DW Architecture
Monitor Integrator
OLAP Server
Metadata
Analysis Query Reports Data mining
Extract Transform Load Refresh
Data Warehouse
Serve
Data Marts
Data Sources
OLAP Engine
Front-End Tools
Data Storage
27
Three DW Models
  • Enterprise warehouse
  • collects all of the information about subjects
    spanning the entire organization
  • Data Mart
  • a subset of corporate-wide data that is of value
    to a specific groups of users.
  • Exgt marketing data mart
  • Virtual warehouse
  • A set of views over operational databases
  • Only some of the possible summary views may be
    materialized

28
OLAP Server Architecture
  • Relational OLAP (ROLAP)
  • Use relational or extended-relational DBMS to
    store and manage warehouse data and OLAP
    middleware to support missing pieces
  • Include optimization of DBMS backend,
    implementation of aggregation navigation logic,
    and additional tools and services
  • greater scalability
  • Multidimensional OLAP (MOLAP)
  • Array-based multidimensional storage engine
    (sparse matrix techniques)
  • fast indexing to pre-computed summarized data
  • Hybrid OLAP (HOLAP)
  • User flexibility, e.g., low level relational,
    high-level array

29
Efficient Data Cube Computation
  • Data cube - lattice of cuboids
  • How many cuboids in an n-dimensional cube with L
    levels?
  • Materialization of data cube
  • Materialize
  • Every (full materialization), none (no
    materialization), or some (partial
    materialization) cuboids
  • Selection of which cuboids to materialize
  • Based on size, sharing, access frequency, etc.

30
Data Warehouse Usage
  • Information processing
  • supports querying, basic statistical analysis,
    and reporting using crosstabs, tables, charts and
    graphs
  • Analytical processing
  • multidimensional analysis of data warehouse data
  • supports basic OLAP operations, slice-dice,
    drilling, pivoting
  • Data mining
  • knowledge discovery from hidden patterns
  • supports associations, constructing analytical
    models, performing classification and prediction,
    and presenting the mining results using
    visualization tools.

31
OLAP Mining (OLAM)
  • Why mining on OLAP?
  • High quality of data
  • DW contains integrated, consistent, cleaned data
  • Available information processing structure
  • ODBC, Web accessing facilities, reporting and
    OLAP tools
  • OLAP-based exploratory data analysis
  • Drilling, dicing, pivoting, etc.
  • On-line selection of data mining functions
  • integration and swapping of multiple mining
    functions, algorithms, and tasks
Write a Comment
User Comments (0)
About PowerShow.com