Title: Design%20of%20Multidimensional%20Data%20Models%20for%20Data%20Warehouses%20and%20OLAP
1Design ofMultidimensional Data Models forData
Warehousesand OLAP
2Who I am
- Freelance Data Base Consultant
- More than 30 years of Experience with DBMSs as a
Vendor Person and as Freelance Consultant - 10 years of Experience with Data Warehouse
construction - Have trained more than 350 persons in
Multidimensional Modelling and Star Schema Design - Charter member of the IAIDQ
- Board member of The Data Warehouse Institute
Denmark - Locations Nordic Countries (principal residence
in Denmark) and Côte dAzur (secondary residence
in Antibes)
3Acknowledgements
- Based on
- Ralph Kimballs books The Data Warehouse
Toolkit, Wiley 1996, ISBN 0-471-15337-0, The
Data Warehouse Lifecycle Toolkit ), Wiley 1998,
ISBN 0-471-25547-5, The Data Webhouse Toolkit
), Wiley 2000, ISBN 0-471-37680-9, and selected
parts of Ralphs published papers - Some examples from Data Warehouse Design
Solutions, Christopher Adamson and Michael
Venerable, Wiley 1998, ISBN 0-471-25195-X. - Microsoft OLAP Solutions, Erik Thomsen, George
Spofford and Dick Chase, Wiley 1999, ISBN
0-471-33258-5 - Telco example from The Official Guide to
Informix/Red Brick Data Warehousing, MT Books
(IDG) 2000, ISBN 0-7645-4694-5 - DWLIST discussions
- My own practical experiences and readings
- ) With Laura Reeves, Margy Ross and Warren
Thornthwaite. - ) With Richard Merz
4Agenda
- History
- Data Warehousing Objectives and Architecture
- Dimensional Design Basics
- Industry examples
- The Data Webhouse (Clickstream Analysis)
- Advanced Design Issues
- The bigger picture
- Data Quality
- The future of the Data Warehouse
- Literature and web-adresses
5History
- How the Multidimensional Model came into existence
6The early years of database
Online! Pioneers General Electric, IDS, Charlie
Bachman Rockwell/IBM, IMS Techniques Hashing,
pointers, physical colocation, and concurrency /
transaction control
7The VisionThe Information System
8Relational The Codd Date Seminars
9Relational / SQL
- Database language standard
- The Query Optimizer (automated navigation)
- DBMSs become commodities
- End-user tools by the hundreds
10Around 1987 Getting closer
Tactical
Operational
The Information Warehouse Concept
11Innovation for AnalystsThe Multidimensional
Database
Sales Districts
Products
Timeperiods
12IT Business OpportunitiesDecision Support
Systems
Executive Information Systems
Strategic
On-Line Analytical Processing
Tactical
On-Line Transaction Processing
Operational
13OLTP and OLAPConflict of Purpose
(millions of records)
(hundreds of records)
14OLAP Categoriesand Sample Products
- ROLAP Relational DBMSs with Star Schema support
and specific tools - MOLAP Multidimensional Databases
- HOLAP Hybrid OLAP, the combination of the two
- ROLAP Most RDBMSs, tools like Microstrategy,
Business Objects, Informix Metacube, IBM DB2 OLAP
Server, Oracle 9i OLAP Server and many more - MOLAP Hyperion Essbase, Applix TM1, Cognos,
Microsoft OLAP Services, IBM DB2 OLAP Server - HOLAP Microsoft SQL Server with OLAP Services,
(IBM DB2 OLAP Server, Hyperion Essbase 7)
15ROLAP vs. MOLAP
- Hot debate in the 90-es (The Shootout at the
OLAP Corral) - The major strength of ROLAP
- Large volumes of data (billions of rows,
terabytes of data) - The major weakness of ROLAP
- Performance on large result sets
- The major strength of MOLAP
- Fast response times, also on large result sets
(aggregated queries) - The major drawbacks of MOLAP
- Pre-calculation times
- Scalability (the cubes explode as the volume and
complexity increase) - So, what to do?
- Use both (HOLAP)
16The Power of the 2
MOLAP / HOLAP
ROLAP Star Schema
Smooth, often automatic/transparent,
integration Detailed levels in ROLAP (many
rows) Two good examples - Microsoft SQL Server
with OLAP Services - IBM DB2 OLAP Server
17Its here!
Analytical Applications in OLAP
Distributed On-Line Transaction Processing
18Data Warehousing Objectivesand Architecture
19Data Warehouse Objectives
- Provide one data source for reporting, analysis,
and mining - consistent answers across the organization or
organizational unit (data marts)
20The Feeding System(Extract, Transformation and
Load - ETL Tools)
Operational systems
Extracting Cleaning Standardizing Consolidating Ag
gregating Transforming Keygenerating Reformatting
and much more
Datawarehouse
21ETL where the time is spent!
- Do it yourself (SQL, scripts, COBOL etc.)
- ETL tool products (e.g. Informatica)
- 80 of your time is spent here!
- Buy and read The Data Warehouse ETL Toolkit,
Ralph Kimball and Joe Caserta, Wiley 2004, ISBN
0-7645-6757-8
22Data Warehouse statistics
- Development time
- lt 6 months 16
- 6-12 months 32
- 12-24 months 26
- 24-60 months 20
- gt 60 months 6
- 50 of all are over 3 years of age (9 over 10
years) - 33 over 1 TB (10 over 10 TB)
Source Business Intelligence Journal, Fall 2005
23The Points of Measurements in Business Processes
- Facts are numerical
- Counts
- Volumes
- Money
- and represent the key subject areas in business.
- Each point of measurement becomes a star.
24One, big Data Warehouse
One global DW
Finished goods inventory
Customer dimension
Shipments
Distribution inventory
Product dimension
Depletions
Store inventory
Etc. etc. etc.
Salestransactions
Time dimension
25Anarchistic Data Marts
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
26Coordinated Data Marts
Enterprise Data Warehouse (maybe also an
Operational Data Store)
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
27Top 10 reasons for a layer of atomic data in
front of data marts
- Coordinated transformations at mart level
(consistency) - Minimized impact on source systems (one extract)
- Temporal integrity across marts
- Single source for cross-subject mining
- Allows smaller marts - with drill-down to atomic
level - Coordinated meta data across entire DW
- Scalability of entire DW (quick population of new
marts) - Facilitates building stars
- Mart recovery
- If volumes not too high, operational reporting
may take place
Doug Laney, Prism Solutions on DWLIST
28The Data Warehouse Bus Architecture
Design Conformed Dimensions and Conformed
Facts Physical Data Staging facilities as
necessary
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
Data Mart
29Roles of ROLAP, MOLAPand HOLAP
Data Staging Area in Relational DBMS E/R May be
used for data cleansing Star Schema For all
facts and dimensions
Data Mart 1, ROLAP
Data Mart 2, MOLAP
Data Mart 3, HOLAP
Data Mart 4, MOLAP
Data Mart 5, ROLAP
Data Mart 5, MOLAP
30Components and structures in Decision Support
Systems
Multidimensional Cube(s)
User Interface, ie. Windows or Web
Ad hoc OLAP tools
Own applications developed with OLAP tools
Off the shelf applications
Typically on aggregated levels
Metadata
Relational Star Schemas
SQL
Typically on the atomic, event level
Datawarehouse database
31A few words on Meta Data
- You need it and it is important
- Read Meta Meta Data Data, Ralph Kimball
(www.dbmsmag.com/9803d05.html) - Comes with client tools, data transformation
tools, stand-alone tools, modelling tools, DBMSs
etc. - Microsoft Repository Open Information Model
- Meta Data Coalition OIM 1.1 (www.mdcinfo.com )
- Parallel work in the OMG (Common Warehouse
Metamodel) (www.omg.org/technology/cwm ) - September 2000 MDC integrates into OMG CWM, the
two standards become one.
J
32Common Warehouse Metamodel
- Expressed in UML
- XML for metadata interchange (XMI)
- Check www.cwmforum.org
- CWM 1.0 February 2001
- Partners IBM, Unisys, NCR, Hyperion, Oracle,
UBS, Genesis, Dimension EDI - Supporters Deere, Sun, HP, Data Access, InLine,
Aonix, Hitachi, SAS, Meta Integration, Adaptive - Common Warehouse Metamodel, John Poole, Dan
Chang, Douglas Tolbert, and David Mellor, OMG
Press / Wiley 2002, ISBN 0-471-20052-2 - Actual product support? Still maybe a little too
early to tell
33Data Warehousing Objectives
- Publish what is important
- Provide the means to find out why
- Promote well-informed decisions
34Multidimensional Design Basics
- The Art of Constructing the Stars!
35Dimensional Design Methodology
- Design begins
- with business requirements gathered from the
decision makers and analysts - and data sourced from
- the corporations operational systems
- external data sources
- Design requires
- user involvement at all stages
- Design resembles
- a series of successive approximations (3-4
revisions)
36Simple, E/R model for OLTP
37Problems with E/R
- Humans cant navigate or remember an E/R
- Software cant navigate an E/R
- Every path gives a different answer
- The shortest path is meaningless
- Bad performance
38Do you need a warehouse E/R model
- Not necessarily
- the data relationships in the enterprise E/R
model can suffice
39Can I model by subject area with a phased approach
- Yes
- models are extensible
- the key is conformable dimensions
- dimensions can then be shared and are extensible
themselves
40Heart of the matter
- Business Views
- Must look like the business
- Recognized by business types
- Relevant for business types
- Three design rules
- Simplicity
- Simplicity
- Simplicity
K.I.S.S. !
41Design objectives
- Business View Schema must be readily understood
and navigatable by the users - Important information must not be obscured by
unimportant detail and complexity - The implementation(s) must provide rapid response
time against large volumes of historical data - The implementation(s) must be legible and
navigatable for extract processing mining
42Classic definition, ROLAP
- STAR schema
- A relational schema organized around a central
table joined to smaller tables using foreign key
references
43Dimensional Model
Facts/Measures 95 of data base storage
Dimensions
Dimensions
most of the fields
44Terminology Facts, Measures, Accounts
- ROLAP The Fact Table(s)
- Contains Facts and (foreign key) references to
the dimensions - MOLAP The Measures Dimension
- Contains Measures and references to the lowest
level Member Keys of the dimensions - Essbase (MOLAP) Measures Dimensions are called
Accounts Dimensions
45Advantages over classic datamodels
(Entity/Relationship)
- Humans can navigate remember a multidimensional
star or cube - Software can navigate deterministically
- The two major purposes of Multidimensional
Modelling are - Reducing complexity
- Deliver good response times also for large
aggregations
46The Dimensions
- Dimensions
- Define the business dimensions, in terms already
familiar to users, by which the central table is
to be analyzed - Numerous columns of text, highly descriptive
- Represent the hierarchies of different levels of
reporting (eg. Year-gtQuarter-gtMonth-gtDay) - Usually less than a million rows, can be much
larger in some businesses
47Technically speaking (ROLAP)
- Dimension tables
- Must have primary key
- Joined to fact table through foreign key
reference - Typically represent ninety percent of the data
elements - Commonly occurs in constraints and GROUP BY
clauses - Heavily indexed
48Typical dimensions
- Generic Industry specific
- Time period(s) Frequent flier, stayer
- Geographic region Service level,
procedure, (markets, cities) operation - Products Room type, service,
- (also in bank/insurance) classification,
seat - Promotions/campaign Drug, medicine
- Customers Vendors, distributor, (Account
number) warehouse - Sales rep, buyer,
- organisation
49The Time Dimension
- Problem SQL and many OLAP products do not
support date arithmetic well enough - How many working days in a month?
- How many days in the Easter season?
- When is the industrial vacation period?
- How to calculate AVG(xxx) per working day?
50The Time Dimension
- Give it all the attributes you need to make life
easy for the end-user - Daynumber
- Date
- Day in week
- Name of day
- Type of day
- Season
- Week in year
- Workingdays in week
- Month
Name of month Type of month Day in month Days
in month Workingdays in month End of Month
Flag Quarter Year Day in year Workingdays in
year
Think about this - it is worth the effort!
51The Time of Day Dimension
52The Status Dimension
S-key Status Fulfilled-flag Ordertype 1 Current
Yes Mail 2 Current No Mail 3 Old Yes Ma
il 4 Old No Mail 5 Current Yes Phone 6 Curr
ent No Phone etc...
(Cartesian product of all values) Is a real
dimension
53Hierachies within Dimensions
ROLAP One table (denormalised)
Company Divison Region Business Unit CC
Country CC State CC City Cost Center (PK)
MOLAP One or two dimensions, depending on your
product(s)
(Dimension diagram technique proposed by Ralph
Kimball et al in The Data Warehouse Lifecycle
Toolkit using eg. Microsoft Visio)
54Fact of life (1) Ragged Hierarchies
55Fact of life Ragged Hierarchies
Data enters at this level
56ROLAP Solution to the Ragged Hierarchy Problem
Base table
End-user and/or MOLAP view Select Category_Descr
iption, Case when Account_Description is not
null then Account_description else
Category_Description end as Account_Description f
rom Chart_of_Accounts order by Category_ID
57MOLAP Solutions to the Ragged Hierarchy Problem
- Depends on your product
- Might require manual definitions
- Supported in at least
- Microsoft SQL Server 2000
- Hyperion Essbase
- IBM DB2 OLAP Server
- Applix TM1
58Fact of Life (2)Unique Members
- Member Often used in OLAP to designate the
individual entries on the individual levels (eg.
Country France, Year 2001 etc.). - Some products require that members (values) be
unique across - The whole level
- The whole dimension(!)
- Check your selected product(s) before getting too
deep into the implementation!
59Fact of Life (2)Unique Members
- All parent-child relationships MUST be
one-to-many - Checking your levels
- select quarter, count(distinct year) as count_col
from time_period group by quarter having
count_col ltgt 1 - The result set of the query above must be empty!
- If your product needs global uniqueness, you must
do similar checking across all levels - If you dont do this, your users will get
meaningsless answers to their queries!
60Fact of Life (3) Sparsity
- Some dimension members may occur quite
infrequently (eg. demographic data only available
on 10 percent of your customers) - This is called sparsity (a sparse dimension)
- Also look for dimensions, which do not intersect
with other dimensions in a star they are not
interesting only take up space - ROLAP Not a big problem, some wasted disk space
- MOLAP A very big problem leads to cube
explosion (many unused cells) - Try to eliminate as much as possible
- Make sure that you tune your product
configuration well (many MOLAP products are
sparsity aware)
61Fact of Life (4) Flat Dimensions
- E.g. A dimension on Standard Industry Codes
(SIC) - In ROLAP just another attribute on your customer
(maybe) - In MOLAP, a member attribute on the lowest level
of your customer hierarchy (if your product
permits it) - Keep the number of dimensions down!
62Dimension Data Become Row-/Column Headers in
Reports
Star Schema
Customers
Product Xxx Yyy Zzz Jan - -
- Febr - - - Mar -
- - Apr May ....
Products
age
Productkey Productname etc.
gender
Sales Detail
income level
education
(billions of records)
Not only your model, but also your data must
look good!
Time
Markets
Time_key Day Month Year ...
Marketkey Sales_area ....
63Slowly Changing Dimensions
- When data is changed, what can you do?
- Overwrite (if you dont care loosing the history)
- Create another dimension record (if for instance
a customer moves) - (what about the keys? --gt use
surrogate keys!) - Create current and previous value fields (for
instance changing sales territories)
64Surrogate keys
- Because the meaning of IDs change (SKUs,
moving customers etc.) - Because concatened primary keys are impractical
- Keep external keys as (a) dimension field(s)
- Use plain integers for data warehouse keys (users
shouldnt see them, they are just used for joins) - In short Always repeat ALWAYS use them!
- You will want to hide them from the users by way
of using views (ROLAP) and external, natural key
values, which the users are familiar with
65SCD Type 2 The Past
Customer dimension
Sales facts
66The Change The customer moves
Customer dimension
Sales facts
SCD Type 2 History is preserved, but how many
customers do we have?
67Effective Dates?
- Effective_begin_date effective_end_date
- Might be the only way to deal with late arriving
records - But
- What is the meaning of Manufactured from/to
versus Sold from/to? - Which attributes are affected?
- Makes query construction complicated
- If you use them, use a current_flag also!
68Type 6 (231)
- When sales districts change randomly
- Sales Team Key
- Sales Team Name
- Sales Physical Address
- Begin Effective Date
- End Effective Date
- Is_current_flag (type 2)
- Current District (type 1)
- Old district (type 3)
- .
69Impact of SCD gt 1
- Small matter of programming
- How to detect changes?
- Cyclic Redundancy Check (CRC) is a possibility
- Which changes are important?
- Type 6 requires many updates
- Changes can cascade
70Pragmatic preservation of history
- Make historical copies of your MOLAP cubes or
ROLAP databases per year - Make copies of the complete dimension, when major
changes occur, and use those copies for
historical analysis, maybe in a separate
environment
71The Facts/Measures
- Mostly raw numeric items, relevant measures, and
dimension keys only. Can signify events or
coverage - Try to use as few measures as you can get away
with, these are costly - From some million to more than a billion
observations - Items are typically additive. May be
semi-additive or non-additive in special cases - Access primarily via dimensions
- Families of facts are common
72Value of additivity
- Prevent incorrect computations
- Percentages and other statistical measures cannot
always be simply added together - For example, average bank balances
- Good advice
- Store base measures
- Calculate percentages and other statistics when
facts are retrieved - Be careful with NULLs in the database!
73Additive measures
- Numeric datatypes
- Units sold
- Dollars sold versus per unit dollars
- Claim amount
- Discount dollars
- Profit before tax
- Tax dollars
- Service charges
- Number of calls
- Number of transactions
74Typical facts
- Sales and purchases
- Daily, weekly, monthly, quarterly sales
- Policies sold, claims sold
- Orders, shipments
- Budget forecasts, actuals
75Reasons for going to the transaction level
- Behavior analysis
- Time-of-day analysis / queue analysis
- Time gap analysis
- Sequential behavior
- Fraud detection
- Cancellation warnings
- Basket analysis
76Technically speaking (ROLAP)
- Fact table
- Must have primary key
- Joined to dimensions through foreign key
references - Usually physically sorted by time dimension and
the primary analysis path
77Mystery fields in the fact records
- Facts Only measures and keys to dimensions
- Sometimes you see fields, which are not that and
which also not appear to be textual attributes of
dimensions or other foreign keys. - Most often these fields are codes and are sparse
- If they are really necessary, try to create one
or more mystery dimensions out of them - Look at correlations between values of the
mystery fields - Assume X has 200 values and Y has 1000 values
- If 1000 combinations of X, Y exist, then X is a
parent of Y - If 100000 combinations exist, then they are
completely uncorrelated, ie. two dimensions.
78MOLAP Hierarchies in the Measures Dimension
- Some (most) MOLAP products allow you to set up
hierarchies within the measures dimension (a.k.a.
the Accounts Dimension in Essbase) - This is particularly useful in financial
reporting - Requires some level of manually entered
definitions - ROLAP
- Push down parent to its children
- More than one fact table
- A helper table (discussed later)
79An Essbase Example
80Using degenerate dimensions(ROLAP)
- Unique, primary key of a sales fact table
- Time
- Product
- Store/Register
- Promotion
- Customer
- Employee
- Ticket (degenerate)
- Line (degenerate)
81Good reasons for getting the fact table primary
key right
- Global Warming
- Avoid more rows than necessary (if granularity of
fact and dimensions do not match) - Lost dimensions
- Could be the reason for the problem
- Lost attributes
- Not getting a dimension detailed enough
- Low-cost Insurance
- For avoiding duplicate rows
- Kids and Matches
- Nobody will be tempted to join two fact tables
- (Thanks to Jim Stagnitto of Questral, Inc.)
82Technically speaking (ROLAP)
- Fact table
- non key columns are usually not indexed, rapid
access is through the dimensions - Columns often occur in sum(), rank(), min()
functions
83Reliable relations(ROLAP)
- All joins between dimensions and fact tables
- Completely understood
- One to many (dimension to fact) relation based on
foreign key references of the fact tables
multi-part key - Referential integrity enforced. Always
84Oops - forgot one thingThe Indexes!
Rate period
Cust
In an ordinary (universal) database, you will
have 10 indexes on the fact table 1 PK 9 FKs.
Discount type
Call type
Call detail
Access method
Batch
Juris- diction
Hour
Day
85Sample ROLAP Index Calculation
- You want to
- Keep the
- number of
- dimensions down!
- Use integer keys!
- Find alternatives
- to B-trees!
86Audit, Balance and Control
- Source feed file name for the row
- Job instance that processed the row
- Record number in the feed file
- Can also contribute to a unique key for the row
87Families of Facts
- Related facts
- Aggregated facts
88Related Facts
- Shared dimensions must be the same
- Value chain, eg. in Manufacturing
Manufactoring Inventory
Manufactoring Shipments
Distribution Inventory
Distribution Shipments
Store Inventory
Store Sales
Flow of Product Each Process a set of Facts
89Drilling Across The Value Chain
Customer Dimension
Time Dimension
Manufactoring Inventory
Manufactoring Shipments
Distribution Inventory
Distribution Shipments
Store Inventory
Store Sales
Product Dimension
90Drilling AcrossThe Budgetting Chain
Time Dimension (Month)
Department Dimension
Budget Foreign keys Budget amount
Line Item Dimension
Account Dimension
Commitments Foreign keys Commitment Amount
Payment Dimension
Commitment Dimension
Payments Foreign keys Payment Amount
91Building SupermartsThe Data Warehouse Bus
Architecture
- Conformed Dimensions
- Standard Fact Definitions
- Revenue, Profit, Price, Cost etc.
- Granularity at the lowest level in front of the
data marts - Data Marts constructed from these standard
sources as necessary
92Aggregated Facts (ROLAP)
- Multiple fact tables
- Share one or more dimensions
- Daily fact table
- Monthly fact table
- Monthly Category fact table
- Caveat What keys to use in dimension tables?
- Do not use level fields!
- Use MOLAP or HOLAP whereever possible!
93Drawbacks of aggregate tables
- There are so many of them!!!
- You is the one, who must manage them!
- All they do for you is enabling you to get better
performance in routine queries - The application is programmed to your aggregation
scheme if that changes, then . - Digression If you also use logical partitioning,
you will have hundreds of tables!
94Aggregate Tables in ROLAP
- Be careful out there!
- You must use an aggregate navigator
Client tool
Plain SQL
Metadata and statistics
Navigator
Aggregate aware SQL
Data and aggregates
DBMS
95Families of Facts
- Heterogeneous Productlines, such as in Banking
Month_key Product_key Customer_key Status_key Earn
ed_dollars Paid_dollars Average_balance ----------
--- Num ATM Trans Num Branch Trans Num
Overdrafts Tot Overdraft Fees Overdraft
Limit Declined Trans
Core keys and facts, kept in one common
table Facts applicable only to checking
accounts
96Families of Facts
- Transactions
- Date_key
- Product_key
- Sales_person_key
- Customer_key
- Transaction_key
- Amount
- And/or snapshots
- Month_key
- Product_key
- Sales_person_key
- Customer_key
- Status_key
- Earned_dollars
- Paid_dollars
- Average_balance
Why not both?
97Using snapshots
- When transactions are not pieces of revenue
- Deposits / withdrawals
- Payments in advance
- Insurance coverage premiums
- Consider a current rolling snapshot (period to
date) - Consider a status dimension
- Many fields in the snapshot fact table, some
possibly semi-additive
98Related Facts in MOLAP
- MOLAP products allow only one measures dimension
per cube - You must, if necessary at all, combine related
facts into one measures dimension - Use a Version (Scenario) dimension with
values like eg. - Actual
- Budget 2001-01
- Prognosis 2001-05-01
- Maybe also a Type dimension, eg.
- Income
- Expenses
- Taxes
- Or build several cubes, one for Income, one for
Expenses etc. - Many MOLAP products allow you to combine cubes
into Virtual Cubes using a library of shared
dimensions
99Similar models in each industry
- A common framework for every industry
- Retail
- Telecommunications
- Transportation
- Insurance
- Healthcare
- Manufacturers
- Banking
- Government
- Websites
- E-business
100Retail (1) The Grocery Store
101Retail (2) Orders
102Telco The Billing CDR
103Transportation
104Insurance (1) Bus Architecture
105Insurance(2) Some fact tables
106Manufacturing (1)
107Manufacturing (2)
108Manufacturing (3)
109Inventory
110Deliveries
111Drilling Across The Value Chain
Customer Dimension
Time Dimension
Manufactoring Inventory
Manufactoring Shipments
Distribution Inventory
Distribution Shipments
Store Inventory
Store Sales
Product Dimension
112Banking
113Website Analysis
- Going beyond log file statistics
- WebTrends, Analog, NetTracker etc.
- Time-series analysis is necessary
- What happened during a site visit?
- Why was the visit abandoned?
- What is the effectiveness of a targetted
promotion? - What is the trend in the above over time?
- The Clickstream Data Warehouse
114So, whats new?
- From Sales Facts to User Activity Facts
- (How) do we know the User (customer)?
- Sessions
- Pages
- Events
- Probable cause of the visit/sale
- Where did the customer come from?
- The World Wide Web (247365, multiple languages,
cultures, timezones ) - From CRM to eRM electronic Relationship
Management
115Website Design also matters
- The clickstream data warehouse needs information
about - Pages
- Cookies
- Users
- Clickable objects
- URLs
- Events
- Etc. Etc.
- This content and event information must be
presentable to end-users! - Some can be obtained by using log file processors
- It is likely you will have to do a considerable
amount of data clean-up, if the website is not
well-designed!
116Web effects on old dimensions
- Calender
- Local time, global time
- Time of Day
- Customer
- Becoming a user dimension
- Cookies
- Named users
- Integration with eg. CRM
- The global perspective
- Promotion
- Must take web advertising into consideration
- Dynamic, individual, targeted special offers,
maybe in real time
117New Web Dimensions
- Page
- Event
- e.g. Open Page, Refresh Page, Click Link, Enter
Data - Session
- Type of session, context/mission, success etc.
- Referral
- How did the customer/visitor arrive?
118Choosing the grain
- Page event
- Session
- Aggregated levels
119E-business
120The Data Webhouse
- Coined by Ralph Kimball in The Data Webhouse
Toolkit, Ralph Kimball and Richard Merz, Wiley
2000, ISBN 0-471-37680-9 - The Data Warehouse on the Web
- The Data Warehouse as the driver of the website
(closing the loop)
121The Birth of the Data Webhouse
122Synonyms
- Basic function
- avoids multiple join paths between two tables
- makes schema more legible and thus less prone to
query formulation errors - use view to rename columns for ease of use with
query tools - Rationale over separate dimension tables
- Reduces need to duplicate data
- Simplifies administration
- CREATE FIRST_OPEN_TIME AS SYNONYM FOR TIME
- Not all databases support this - you may use
views instead
123Typical synonyms
- City tables
- origin and destination (travel facts)
- Period tables
- order date and ship date
- Customer tables
- customers for ship to and bill to
124Factless Fact Tables
Promotion Coverage Fact Table time_key product_key
store_key promo_key
Time
Product
Promotion
Store
125A Factless snapshot table
126The Effect of Multidimensional Design on the ETL
system
- Existence checks
- Denormalisation (N-way, multilevel merge)
- Lookup values
- Deal with missing values for foreign keys
127Missing Values
- Which are the problems?
- Default values in source data
- Dummy values in source data
- Missing values in source data
- What can be done?
- Fix the source system(s) it is a data quality
issue - Try to infer values
- Use a dummy value
- More than one dummy value? (Unknown, Unavailable,
Not applicabale) - Missing dimension keys?
- (The dummy dimension record Cust.No. 1, Name
Unknown) - What about dummy dates? And numeric values?
- (See Dealing with Missing Values In The Data
Warehouse from www.sbti.com (Author John Hess,
Stonebridge Technologies 1998, now to be found on
www.olap.it/articles !)
128The 38 subsystems of ETL!
From the new book The Data Warehouse ETL
Toolkit, Ralph Kimball and Joe Caserta, Wiley
2004, ISBN 0764567578
- Extract
- Change data capture
- Data profiling
- Data cleansing
- Data conformer
- Audit dimension
- Quality screener
- Error event handler
- Surrogate key creation
- Slowly Changing Dims
- Late arriving dims
- Fixed hierarchy dims
- Variable hierarchy dims
- Multivalued dims
- Junk dimensions
- Facttable transaction load
- Periodic snapshot load
- Accum. Snapshot load
- Surrogate key pipeline
- Late arrivals handler
- Aggregate builder
- Cube builder
- Partition builder
- Dimension manager
- Facttable provider
- Job scheduler
- Workflow monitor
- Recovery and restart
- Parallelizing system
- Problem escalation
- Version control
- Version migration
- Lineage dependencies
- Compliance reporter
- Security
- Backup
- Metadata repository
- Project management
129Documentation
- Data Mart structure
- Logical Model
- Read The Data Warehouse Lifecycle Toolkit for the
complete picture!
130Advanced Design Issues
- When the going gets tough,
- the tough get going!
131Tricky stuff
- Monster Dimensions
- Multivalued Dimensions
- Multilevel Hierarchies
- Really Difficult Business Questions
132Monster Dimensions
- Some dimension tables grow VERY big, eg.
customers, who also have many attributes - Some demographic attributes change often
(income, number of children etc.) - Some demographic attributes are not used very
much - Result A lot of wasted space, indexes and
complexity in keeping up to date
133Dealing with Monster Dimensions
Demographics dimension
Customer dimension
customer_key name address birth date other
stable attributes
demographics_key income_band purchases_band no_chi
ldren education_level etc.
Sales facts
Customer_key Demographics_key ......
Note Contains all possible combinations predefin
ed and preloaded!
134Multivalued Dimensions
- Healthcare billing fact
- Date
- Patient
- Doctor
- Location
- Service performed
- Diagnosis
- Payer
- etc.
Do people only have one illness at a time?
135What are the alternatives?
- Forget the dimension!
- Choose one value (Primary Diagnosis)
- Fixed number of diagnoses
- Going MM?
136Helper Tables
Payer
Patient
Doctor
Diagnosis Group diagnosis_group_key diagnosis_key
weight_factor
Bill fact . diagnosis_group_key . . .
Location
Diagnosis Dimension diagnosis_key description type
category
Service
Day
137Solving a Multivalued Dimension with a Helper
Table
- Use only when absolutely necessary
- Weight factors usually equal size within a group
and should always add up to one - May use a view to hide the helper table
- Healthcare, Retail Banks, SIC-codes
138Customers Hierarchies
The Problem is The Customer decides how many
levels there are!!!
139The Customer Hierarchy Model
customer_key name address base_org always
populated level5_org level4_org level3_org level2_
org populated if 2 or more top_org always
populated other attributes
140Using a Helper Table
Service
Seller
Customer Dimension customer_key etc. etc.
Customer_tree_path parent_customer_key child_custo
mer_key depth_from_parent lowest_flag topmost_flag
Bill fact . customer_key . . .
Contains one record for each separate path from
each node to itself and to every node below it
Day
141Pros and cons
- Works like a normal dimension constraint
- Use depth_from_parent to eg. get only immediate
subsidiaries - Use lowest_flag to get only leaf nodes
- Reversing the joins will take you upwards
- Maybe add begin_effective_date and
end_effective_date, but be careful - Maybe add weighting factor to support partially
owned subsidiaries - May grow very big, quickly!
142The bigger picture
- Classification of analytical applications
- Data Mining besides SQL and OLAP
- International issues
- Recommandations
143Really DifficultBusiness Questions
- Simple Constraints
- Simple Subqueries
- Correlated Subqueries
- Simple Behavioral Queries
- Derived Behavioral Queries
- Progressive Subsetting Queries
- Classification Queries
1441) Simple Constraint
- Constraints against literal constants
- Show the sales of candy products in September 1997
1452) Simple Subquery
- Constraints against a global value found in the
data - Show the sales of candy products in September
1997 in those stores that had above average sales
of candy products
1463) Correlated Subquery
- Constraints against a value defined by each
output row - Show the sales of candy products for each month
of 1997 in those stores that had above average
sales of candy in that month
1474) Simple Behavioral Query
- Constraints against values resulting from an
exception report or a complex series of queries
that isolate desired behavior - Show the sales of those candy products in
September 1997 whose household penetration for
our grocery chain in the 12 months prior to
September were more than two standard deviations
less than the household penetration of the same
products across our 10 biggest retail competitors
1485) Derived Behavioral Query
- Constraints against values found in set
operations on more than one complex exception
report or series of queries - Show the sales of those candy products identified
in example 4, and which also experienced a
merchandise return rate of more than two standard
deviations greater then our 10 biggest retail
competitors - (Intersection of two behavioral queries)
1496) Progressive Subsetting Query
- Constraints against values, as in number 4, but
temporally ordered so that membership in an
exception report is dependent on membership in a
previous exception report - Show the sales of those candy products in example
number 4 that were similarly selected in August
1997 but were not similarly selected in either
June or July 1997
1507) Classification Queries
- Constraints on values that are the results of
classifying records against a set of defined
clusters using nearest neighbor and fuzzy
matching logic - Show the percentage of low-calorie candy sales
contained in the 1000 market baskets whose
content most closely matches a young,
health-conscious family profile
151Challenges
- Most query tools do not support the more complex
query types - You may build in support for those, which your
users need
152Multi-step approach (SQL)
Query
Query
Result table
Query
Build result table(s) that contains only keys
(and maybe also time information) of those
objects, who display the desired, special
behavior(s)
Query
153Extended ROLAP or MOLAPfor Behavioral Analysis
Sales facts
Regular dimensions
Special behavior dimension(s) )
Time_key Product_key Customer_key Store_key Promot
ion_key Ticket_number Line_number Units_sold Dolla
rs_sold
Time dimension
Product_key defined by complex behavior study
Product dimension
Customer dimension
Store dimension
Promotion dimension
) May be hidden in views
154Preparing for Data Mining
- Rigorous quality assurance on the values, which
you are going to mine - Supply values as text, not codes
- Eliminate context dependencies
- Flag normal, abnormal, out of bound or impossible
facts - Mask out random or noise values
- Uniform treatment of NULL/missing/unknown
- Use Status dimensions (eg. Customer about to
cancel etc.) - Find training, test and evaluation sets
- Supply computed values (eg. Profit)
- Band continous values
155Data Mining Algorithms
- Clustering
- Decision Trees
- Neural Networks
- Statistics
- Fuzzy Logic
- Genetic Algorithms
- Ants
- Hybrids
156Integrated Data Mining
- The Multidimensional Model is ideal as a source
for Data Mining - ROLAP is necessary for time-series / sequential
analysis - Ideal case for integration of software tools
- Microsoft Analysis Services 2000
- Oracle 9i
- Others?
157Microsoft SQL Server 2000 Analysis Services
158International Issues
- Languages
- Alphabets and character sets
- Names
- Adresses
- Numbers
- Telephone numbers
- Currencies
- Time of day
- Calendars
- Handling unsupported characters
- Collating sequencies
- See The Data Webhouse Toolkit, Ralph Kimball
and Richard Merz, Wiley 2000, ISBN 0-471-37680-9
159Comments on MS SQL Server 2005
- Unified dimensional model
- Multivalued dimensions
- Possible to have many flat dimensions
- No need to build cubes can put a layer on top
of a non-multidimensional schema - But
- Users need the ease of use
- Machines need the speed
- We will see, just how well this works in a short
period of time
160The biggest challenge Data Quality
161What is Data Quality?
- Accuracy
- Does the data accurately represent reality?
- Integrity
- Is the structure of data and relationships among
entities and attributes maintained consistently? - Consistency
- Are data elements consistently defined and
understood? - Completeness
- Is all necessary data present?
- Validity
- Do data values fall within acceptable ranges
defined by the business? - Timeliness
- Is data available when needed?
- Accessibility
- Is the data easily accessible, understandable,
and usable?
162Data quality and integration issues
- Legacy data issues
- Data accessibility availability
- Insufficient time to analyse
- Inaccurate Metadata, documentation
- Lack of resources
- Disparate systems
- Data ownership issues
- Semantic differences
- Structure violations
- Rule violations
-
Home-grown applications
Web applications
ERP
CRM
Legacy applications
Data Warehouse
163Why do data integration projects fail?
- The source data is not fully understood
- Complexity is underestimated
- Planning is actually guesswork
- Systems do not join as expected
- Delays when analysts/programmers need to
interpret results - Poor data analysis leads to complex development
cycle and unpredictable rework - Problems are uncovered ad-hoc and late
- Manual analysis on samples is time consuming,
laborious and inaccurate - Full volume testing is done too late
164Data profiling analysis
- Data profiling analysis is critical to
- Understand the scope and nature of the problem
- Determine success criteria
- Accurate planning
- Automated tools are available
- Do it right the first time
- Then keep on doing it!
165Examples of how to find data quality problems
using Data Profiling
- How do you identify those customer records where
values are missing or incomplete ? - Are the product codes correct in my order entry
system? - Will my data actually integrate?
- Are all the order shipment dates correct?
- Are my supplier details correct?
- How do I find misspellings of any attribute
values? - What about redundant data how do I find it?
- Are the relationships held within my data
consistent ? - Will my data support the new business
requirements?
166Benefits of automated Data Profiling
- Improve Business responsiveness
- Time to market reduced
- Enhance Data Quality
- Ensure data is accurate and fit-for-purpose
- Project Planning
- Early Accurate
- Reduce Risks
- Identify ALL data-related issues at the start
- Manage Resources
- Deliver with less effort
- Reduce Costs
- Reduce cost of analysis and build
167Data Quality is a Business Issue
- The Business owns the data
- Set data quality standards across the company
- Build company-wide metadata knowledge
- Data Quality must be managed
168Data Warehouse in the future?
169Data Warehouse andthe Enterprise Nervous System
- Contemporary Enterprise Information Architecture
calls for - Realtime
- Integration
- Message brokers
- Service Oriented Architectures etc.
- What is the role of the Data Warehouse in this?
- This is based on an actual and recent
architecture study for a Danish government body
170Characteristics of a mature Data Warehouse
- Vision One environment, one version of the Truth
- Integration of data from disparate sources
- Refinement of data
- Searching and browsing
- Production data
- Other data (partners, public / purchased
information) - Master data
- The Historical Data Warehouse
- Management Information
- Statistics
- Data Exchange
- Production reporting
- Ad hoc
171More characteristics
- Much multidimensionality
- Refined data, both details and aggregations
- 1000 frequent users
- Also external recipients
- New systems The interface(s) to the DW should be
defined - Development projects Decide for either a
production system or a Data Warehouse (based on
technical feasibility) - Standardised data model (most often not
documented, frequently changed) - Analysis
- Reporting
- Ad hoc tasks (incl. operational one-time systems)
172Components of a matureData Warehouse
- Database(s)
- Normalised
- Multidimensional
- Load processes
- Predominantly batch
- Homegrown (COBOL)
- ETL jobs in e.g. Informatica
- BI tools such as e.g. Business Objects
- Applications
- Predefined reports
- Ad hoc environments
- Statistics
- SAS
173Business benefits from a DW
- Holistic view of data across disparate systems
- Integration of data from different sources,
including external partners - History
- Refinement of data
- Presentation of data for business people
- Analysis, incl. decision support
- Reporting
- Ad hoc solutions
- Data foundation for statistics
- Data exchange
174The Enterprise Nervous System
Portal platform
Intranet (internal portal)
Internet (external portal)
Portlet api
Portlet api
CMS- modules
Email/ Calendar
Workflow
New services
Data mining
Eksternal WS UDDI
Internal WS UDDI
Info services
New services
CMS- modules
Info services
Web Service
OCES certificate handling
Service- and Integration platform (ENS)
Common services
Analysis
Update
Query
Statistics
Report
Infrastructure services - single
signon security, administration, mv.
Business Activity Monitoring and Workflow
management
Integration platform with persistence
Integration Broker
User catalog
New systems data
Data- Warehouse
Messages
CMS data and metadata
175The Data Warehouse hasserved us well
- The concept of a centralised database is maybe
not as necessary from now on - But the practises of Data Warehousing are
important - Good Data Management
- Good performance
- Data refinement
- Data presentation
- Realtime integration is certainly useful and
practical - Business Intelligence is evolving, Business
Activity Monitoring is a natural next. - We have learned a lot now is the time to do it
right!
176What is going on in data modelling?
strong semantics
Source Leo Obrst, The Mitre Corp.
weak semantics
177Taxonomies The unstructured world of documents
178Conceptual model (ORM)
Syntax defined in Nijssen, G.M. and T.A. Halpin.
Conceptual Schema and Relational Database Design
- A fact oriented approach. Prentice Hall 1989.
179RDF based schema for Family
Property labels t rdftype s
rdfssubClassOf d rdfsdomain r
rdfsrange et rdfsxcollectionElementType
Kilde Stephen Cranefield, Journal of Digital
Information, Volume 1 Issue 8, Article No. 44,
2001-02-15
180UML Class diagrams
Kilde Stephen Cranefield, Journal of Digital
Information, Volume 1 Issue 8, Article No. 44,
2001-02-15
181Web Ontology Language (OWL)
- lt?xml version"1.0"?gt
- ltrdfRDF
- xmlns"http//mySite.com/myOntology"
- xmlnsrdf"http//www.w3.org/1999/02/22-rdf-synt
ax-ns" - xmlnsrdfs"http//www.w3.org/2000/01/rdf-schema
" - xmlnsowl"http//www.w3.org/2002/07/owl"
- xmlbase"http//mySite.com/myOntology"gt
- ltowlClass rdfID"Person"/gt
- ltowlClass rdfID"Mother"gt
- ltrdfssubClassOf rdfresource"Person"/gt
- lt/owlClassgt
- ltowlObjectProperty rdfID"hasChild"gt
- ltrdfsdomain rdfresource"Mother"/gt
- ltrdfsrange rdfresource"Parent"/gt
- lt/owlObjectPropertygt
- ltowlequivalentClassgt
- ltowlClassgt
- ltowlintersectionOf rdfparseType"Collection"
gt - ltowlClass rdfabout"Mother"/gt
- ltowlRestrictiongt
- ltowlonPropertygt
- ltowlObjectProperty rdfabout"hasChild"/gt
- lt/owlonPropertygt
- ltowlsomeValuesFromgt
- ltowlClass rdfID"Parent"/gt
- lt/owlsomeValuesFromgt
- lt/owlRestrictiongt
- lt/owlintersectionOfgt
- lt/owlClassgt
- lt/owlequivalentClassgt
182Information Management II
- Data Warehouse showed us how to
- Semantics is the key
- Ontologies is the foundation
- Repositories is the technology
- Gartner Enterprise Information Architecture
- The sponsor is The needs for integration!
183Wrap up
184Listen to Ralph
- Embed all knowledge of the data in the data
- Stick to one level of dimension tables
- Aggregates should be separate tables
- Use an aggregate navigator (serverside)
- Even better Use MOLAP or HOLAP
- Stick to simple star schemas
- Properly design conformed dimensions and
conformed facts, first!
185Listen to me
- K.I.S.S.
- Keep It Simple, Stupid!
186Ralph Kimballs 20 criteria
187Literature
- The Data Warehouse Toolkit Second Edition,
Wiley 2002, ISBN 0-471-20024-7 - The Data Warehouse Lifecycle Toolkit, Ralph
Kimball, Laura Reeves, Margy Ross and Warren
Thornthwaite, Wiley 1998, ISBN 0-471-25547-5 - The Data Webhouse Toolkit, Ralph Kimball and
Richard Merz, Wiley 2000, ISBN 0-471-37680-9 - Data Warehouse Design Solutions, Christopher
Adamsan and Michael Venerable, Wiley 1998, ISBN
0-471-25195-X - Microsoft OLAP Solutions, Erik Thomsen, George
Spofford and Dick Chase, Wiley 1999, ISBN
0-471-33258-5 - Improving Data Warehouse and Business
Information Quality, Larry P. English, Wiley
1999, ISBN 0-471-25383-9
188Usefull web-adresses
- www.ralphkimball.com (Ralph Kimball)
- www.dwinfocenter.org (Larry Greenfields Data
Warehouse Info Center) - www.intelligententerprise.com (Intelligent
Enterprise Magazine, previously DBMS magazine -
many articles by Ralph Kimball and other fine
people - for example August 97 A Dimensional
Modelling Manifesto) - www.datawarehousing.com (home of DWLIST)
- www.tdwi.org - The Data Warehouse Institute
- www.iaidq.org, The International Association for
Information and Data Quality (IAIDQ) - http//www.tondering.dk/claus/calendar.html,
everything you would ever want to know about
calendars!
189What to do first?
- Buy and Read Ralph Kimballs The Data Warehouse
Toolkit Second Edition, Wiley 2002, ISBN
0-471-20024-7 - Buy and read The Data Warehouse Lifecycle
Toolkit, Wiley 1998, ISBN 0-471-25547-5, Ralph
Kimball, Laura Reeves, Margy Ross and Warren
Thornthwaite - Buy and read Data Warehouse Design Solutions,
Christopher Adamson and Michael Venerable, Wiley
1998, ISBN 0-471-25195-X. - Buy and read The Data Warehouse ETL Toolkit,
Ralph Kimball and Joe Caserta, Wiley 2004, ISBN
0-7645-6757-8 - Just Do It!
190Thank You!
thomasf_at_tf-informatik.dk 45-40 54 83 40
(GSM) 04.93.33.88.93 (occasionally) 45-49 70 83
40 (Landline)