Title: NIC Exposure Level Training
1NIC Exposure Level Training
- Vijayendra Gururao
- Business Intelligence Consultant
2Agenda
- Data warehousing Concepts - Day 1
- Govt Case Study - Day 2
- Defense HR Case study - Day 3
- Manufacturing Case Study - Day 4
- Data mining - Day 5
3The Evolution of Business Intelligence
4Data Warehousing
5Introduction
- Definitions
- Legacy Systems
- Dimensions
- Data Dependencies Model
- Dimensional Model
6An ER Model
7Why Data Warehouses?
- To meet the long sought after goal of providing
the user with more flexible data bases containing
data that can be accessed every which way.
8OLTP vs. OLAP
- OLTP (Online transaction processing) has been the
standard reason for IS and DP for the last thirty
years. Most legacy systems are quite good at
capturing data but do not facilitate data access. - OLAP (Online analytical processing) is a set of
procedures for defining and using a dimension
framework for decision support
9The Goals for and Characteristics of a DW
- Make organizational data accessible
- Facilitate consistency
- Adaptable and yet resilient to change
- Secure and reliable
- Designed with a focus on supporting decision
making
10The Goals for and Characteristics of a DW
- Generate an environment in which data can be
sliced and diced in multiple ways - It is more than data, it is a set of tools to
query, analyze, and present information - The DW is the place where operational data is
published (cleaned up, assembled, etc.)
11Data Warehousing is Changing!
Application requirements--not just data
requirements--are now driving need.
Customer Relationship Mgmt.
Campaign Management
ERP
Call Center
Knowledge Management
ERP
Target Marketing
Supply Chain
E-commerce
12Organization of data in the presentation area of
the data warehouse
- Data in the warehouse are dimensional, not
normalized relations - However, data that are ultimately presented in
the data warehouse will often be derived directly
from relational DBs - Data should be atomic someplace in the warehouse
even if the presentation is aggregate - Uses the bus architecture to support a
decentralized set of data marts
13Updates to a data warehouse
- For many years, the dogma stated that data
warehouses are never updated. - This is unrealistic since labels, titles, etc.
change. - Some components will, therefore, be changed
albeit, via a managed load (as opposed to
transactional updates)
14Basic elements of the data warehouse
DataStaging Area
Operational Source Systems
DataPresentation Area
DataAccessTools
Services Clean, combine, and standardizeConform
DimensionsNo user query services Data
Store Flat files and relational
tables Processing Sorting and sequential
processing
Data Mart 1 DimensionalAtomic and summary
dataBased on a single business process
Ad hoc query tools Report Writers Analytical
Applications Modeling Forecasting Scoring Dat
a Mining
Extract
Load
Access
DW BusConformed facts and dimensions
Extract
Data Mart 2 Similar design
Extract
Load
Access
15Data Staging Area
- Extract-Transformation-Load
- Extract Reading the source data and copying the
data to the staging area - Transformation
- Cleaning
- Combining
- Duplicating
- Assigning keys
- Load present data to the bulk loading facilities
of the data mart
16Dimensional Modeling Terms and Concepts
- Fact table
- Dimension tables
17Fact Tables
- Fact table a table in the data warehouse that
contains - Numerical performance measures
- Foreign keys that tie the fact table to the
dimension tables
18Fact Tables
- Each row records a measurement describing a
transaction - Where?
- When?
- Who?
- How much?
- How many?
- The level of detail represented by this data is
referred to as the grain of the data warehouse - Questions can only be asked down to a level
corresponding with the grain of the data warehouse
19Dimension tables
- Tables containing textual descriptors of the
business - Dimension tables are usually wide (e.g., 100
columns) - Dimension tables are usually shallow (100s of
thousand or a few million rows) - Values in the dimensions usually provide
- Constraints on queries (e.g., view customer by
region) - Report headings
20Dimension tables
- The quality of the dimensions will determine the
quality of the data warehouse that is, the DW is
only as good as its dimension attributes - Dimensions are often split into hierarchical
branches (i.e., snowflakes) because of the
hierarchical nature of organizations - Product part ? Product ? Brand
- Dimensions are usually highly denormalized
21Dimension tables
- The dimension attributes define the constraints
for the DW. Without good dimensions, it becomes
difficult to narrow down on a solution when the
DW is used for decision support
22Bringing together facts and dimensions Building
the dimensional Model
- Start with the normalized ER Model
- Group the ER diagram components into segments
based on common business processes and model each
as a unit - Find MM relationships in the model with numeric
and additive non-key facts and include them in a
fact table - Denormalize the other tables as needed and
designate one field as a primary key
23A Dimensional Model
24Kimball Methodology
25Review A Private Data Mart
- A data mart containing one fact table and three
dimension tables. We delivered all the tables by
executing a fact build. - What if we want to add another fact table called
F_Sales that will reference the three existing
dimension tables?
26Understand Conformed Dimensions
- A conformed dimension is a dimension that is
standardized across all data marts.
Location
Time
Sales Fact
Conformed Dimensions
Customer
Distribution Fact
Distributor
Promotion
Product
Order Fact
27Advantages of Conformed Dimensions
- Deliver incremental data marts in a short period
of time. - Independent data marts become part of a fully
integrated data warehouse. - Deliver a consistent view across your business.
28Conformed Dimensions Within Bus Architecture
- Identifying and designing the conformed
dimensions is a critical step in the architecture
phase of data warehouse design.
29Design of Conformed Dimensions
- A commitment to using conformed dimensions is
more than just a technical consideration. It must
be a business mandate. - Lay out a broad dimensional map for the
enterprise. - Define conformed dimensions at the most granular
(atomic) level possible. - Conformed dimensions should always use surrogate
keys. - Define standard definitions for dimension and
fact attributes.
30Granularity in Conformed Dimensions
- Conformed dimensions should be defined at the
most granular (atomic) level so that each record
in these tables corresponds to a single record in
the base-level fact table.
Order FactDay IdProduct IdCustomer
IdCostNumberOrdered
D_TimeDayDay IdDayMonth Year
D_ProductProduct IdDescriptionProduct
TypeType DescriptionProduct LineLine
Description
D_CustomerCustomer IdLast NameFirst
NameAddress
31Flexibility of Conformed Dimensions
- Conformed dimensions are usually designed within
star schema data marts. For multiple granularity
fact tables, higher level views of dimensions can
be used (or a snowflake table).
View or Snowflake table
Time Dimension
Order FactDay IdProduct IdCustomer
IdCostNumberOrdered
CustomerCustomer IdLast NameFirst NameAddress
Time(Day)Day IdDayMonth Id Period
Sales FactMonth IdProduct IdCustomer
IdAmountSoldRevenue
ProductProduct IdDescriptionProduct
TypeProduct Line
Time(Month) ViewMonth IdMonthPeriod
32So, What is a DW?
- A data warehouse is a subject-oriented,
integrated, non-volatile, and time-variant
collection of data in support of managements
decisions W.H. Inmon (the father of DW)
33Subject Oriented
- Data in a data warehouse are organized around the
major subjects of the organization
34Integrated
- Data from multiple sources are standardized
(scrubbed, cleansed, etc.) and brought into one
environment
35Non-Volatile
- Once added to the DW, data are not changed
(barring the existence of major errors)
36Time Variant
- The DW captures data at a specific moment, thus,
it is a snap-shot view of the organization at
that moment in time. As these snap-shots
accumulate, the analyst is able to examine the
organization over time (a time series!) - The snap-shot is called a production data extract
37Need for Data Warehousing
- Integrated, company-wide view of high-quality
information (from disparate databases) - Separation of operational and informational
systems and data (for improved performance)
38(No Transcript)
39Data Warehouse Architectures
- Generic Two-Level Architecture
- Independent Data Mart
- Dependent Data Mart and Operational Data Store
- Logical Data Mart and _at_ctive Warehouse
- Three-Layer architecture
All involve some form of extraction,
transformation and loading (ETL)
40Generic two-level architecture
L
One, company-wide warehouse
T
E
Periodic extraction ? data is not completely
current in warehouse
41Independent Data Mart
42Dependent data mart with operational data store
43Logical data mart and _at_ctive data warehouse
44Three-layer architecture
45DW Design
- Mainly consists of
- Logical Design
- Physical Design
46Logical Design of DW
- Identification of
- Entities
- Relationships
- Attributes
- Uniqe identifiers
Conceptual and abstract. Results in Fact and
dimension tables Created using Pen and Paper OR
Modeling tools also
47Physical Design of DW
- Conversion of data gathered in Logical design to
Physical database structure - Mainly driven for query performance
48Logical ? Physical
49Data CharacteristicsStatus vs. Event Data
Example of DBMS log entry
Event a database action (create/update/delete)
that results from a transaction
50Data CharacteristicsTransient vs. Periodic Data
Figure 11-8 Transient operational data
Changes to existing records are written over
previous records, thus destroying the previous
data content
51Data CharacteristicsTransient vs. Periodic Data
Periodic warehouse data
Data are never physically altered or deleted once
they have been added to the store
52Data Reconciliation
- Typical operational data is
- Transient not historical
- Not normalized (perhaps due to denormalization
for performance) - Restricted in scope not comprehensive
- Sometimes poor quality inconsistencies and
errors - After ETL, data should be
- Detailed not summarized yet
- Historical periodic
- Denormalized
- Comprehensive enterprise-wide perspective
- Quality controlled accurate with full integrity
53Extract Transform Load
- Extract data from operational system, transform
and load into data warehouse - Why ETL?
- Will your warehouse produce correct information
with the current data? - How how can I ensure warehouse credibility?
54Excuses for NOT Transforming Legacy Data
- Old data works fine, new will work as well.
- Data will be fixed at point of entry through GUI.
- If needed, data will be cleaned after new system
populated After proof-of-concept pilot. - Keys join the data most of the time.
- Users will not agree to modifying or
standardizing their data.
55Levels of Migration Problem
- Existing metadata is insufficient and unreliable
- Metadata must hold for all occurrences
- Metadata must represent business and technical
attributes - Data values incorrectly typed and accessible
- Values form extracted from storage
- Values meaning inferred from its content
- Entity keys unreliable or unavailable
- Inferred from related values
56Metadata Challenge
- Metadata gets out of synch with details it
summarizes - Business grows faster than systems designed to
capture business info - Not at the right level of detail
- Multiple values in a single field
- Multiple meanings to a single field
- No fixed format for value
- Expressed in awkward of limited terms
- Program/compiler view rather than business view
57Character-level Challenge
- Value instance level
- Spelling, aliases
- Abbreviations, truncations, transpositions
- Inconsistent storage formats
- Named type level
- Multiple meanings, contextual meanings
- Synonyms, homonyms
- Entity level
- No common keys or representation
- No integrated view across records, files, systems
58The ETL Process
- Capture
- Scrub or data cleansing
- Transform
- Load and Index
ETL Extract, transform, and load
59The ETL Process
Presentation System
Source Systems
Staging Area
Transform
Extract
Load
60Source Data
- Record the name location and data that exists in
the TPS environment. - File names and location
- Layout
- Attribute meaning
61Extraction
- Copy specific data directly from the source
tables into a working dataset in the staging area.
62Transformation (Dimension Tables)
- Generate surrogate key in a primary-surrogate
table. Make this permanent. - Insert the surrogate key into the working
dimension tables. - Conduct any editing/cleaning operations you need
(usually on the working table) - Generate any derived attributes you need.
- Generate and retain process logs.
63Transformation(Fact tables)
- Join all dimensions to the fact table (using
original primary keys). Insert surrogate keys - Generate derived facts
- Generate indicator flags
64Target Data
- Describe the presentation data structure.
- Model
- Metadata
- Usage and constraints
65Flow Documentation
- DFD for the ETL process
- ERD for Source, Staging and Target databases.
- Metadata
- Usage notes.
66Steps in data reconciliation
Capture extractobtaining a snapshot of a
chosen subset of the source data for loading into
the data warehouse
Incremental extract capturing changes that have
occurred since the last static extract
Static extract capturing a snapshot of the
source data at a point in time
67Steps in data reconciliation (continued)
Scrub cleanseuses pattern recognition and AI
techniques to upgrade data quality
Fixing errors misspellings, erroneous dates,
incorrect field usage, mismatched addresses,
missing data, duplicate data, inconsistencies
Also decoding, reformatting, time stamping,
conversion, key generation, merging, error
detection/logging, locating missing data
68Steps in data reconciliation (continued)
Transform convert data from format of
operational system to format of data warehouse
Record-level Selection data partitioning Joinin
g data combining Aggregation data
summarization
Field-level single-field from one field to
one field multi-field from many fields to one,
or one field to many
69Steps in data reconciliation (continued)
Load/Index place transformed data into the
warehouse and create indexes
Refresh mode bulk rewriting of target data at
periodic intervals
Update mode only changes in source data are
written to data warehouse
70Single-field transformation
In general some transformation function
translates data from old form to new form
Algorithmic transformation uses a formula or
logical expression
Table lookup another approach
71Multi field transformation
M1 from many source fields to one target field
1M from one source field to many target fields
72Derived Data
- Objectives
- Ease of use for decision support applications
- Fast response to predefined user queries
- Customized data for particular target audiences
- Ad-hoc query support
- Data mining capabilities
- ? Characteristics
- Detailed (mostly periodic) data
- Aggregate (for summary)
- Distributed (to departmental servers)
Most common data model star schema (also called
dimensional model)
73Components of a star schema
Fact tables contain factual or quantitative data
Dimension tables are denormalized to maximize
performance
1N relationship between dimension tables and
fact tables
Dimension tables contain descriptions about the
subjects of the business
Excellent for ad-hoc queries, but bad for online
transaction processing
74Star schema example
Fact table provides statistics for sales broken
down by product, period and store dimensions
75Star schema with sample data
76Advanced concepts
- Slowly Changing dimensions
- Ragged Hierarchies
77What if Our Data is not Static?
- Small occasional changes in dimension data are
normal in business. - Examples of these changes include
- addition of new members (a new product is
launched) - changing of relationships within the dimension (a
sales rep moves to another branch) - properties of members changed (a product is
reformulated or renamed) - deletion of members (this is rare in data
warehousing)
78Understand Surrogates
- 4-byte integer key (can hold more than 2 billion
positive integers) - Internally assigned and meaningless
- insures uniqueness
- always known
- Used in conjunction with business keys
- business key is often mnemonic for example, OTA
used for Ottawa office - surrogate key is numeric for example, 000128
- Surrogate keys are never used in reports. They
are used to link dimension tables to fact tables.
79Understand Surrogate Keys Used In Operational
Systems
- Operational databases also sometimes use
surrogate keys (for example, Employee_No). These
keys typically cannot be used as the data mart
surrogate keys. - A single member in a data mart (for example, a
particular employee) may have several data mart
surrogate keys assigned over time to deal with
slowly changing dimensions. - You may have to merge entities from separate
operational systems, each with its own
operational surrogate key (for example, customers
from separate banking and insurance
applications). - Operational surrogate keys are usually considered
business keys in the data mart.
80Understand Natural Keys Example
Metrics
Product Dimension
Customer Dimension
Cust Code Name
SA 1 11 Safeway LO 2 22
Loblaws SE 5 55 7-11
Fact Table
Prod Code Cust Code
PR X 002 39 SA 1 11 PR X 003 40 LO 2
22 PR Y 003 40 SE 5 55
Measures
81Understand Surrogate Keys Example
Product Dimension
Customer Dimension
Cust Sur Cust Code Name
Prod Sur Prod Code Name
10 SA 1 11 Safeway
20 LO 2 22 Loblaws 30
SE 5 55 7-11
1 PR X 002 39 Soup
2 PR X 003 40 Beans 3
PR Y 003 40 Peas
Fact Table
Prod Sur Cust Sur
1 10 2
20 3
30
Measures
82Track Dimensional Changes Over Time
- Operational systems tend to contain data about
the current state of the business. - A data warehouse is expected to hold data for
five to 10 years. - Users may need to query data as of any particular
date (for example, at which office(s) did Mary
Smith work between January/1999 and
December/1999?). - If Mary Smith changes offices, to which office do
her sales apply, the old one or the new one?
83Understand Slowly Changing Dimensions (SCD)
- Operational dimensional data may often be thought
of as static. It may only need to reflect the
current state. - Data warehouse dimensional data often must show
how the dimensional data changes over time. It is
not static. - The term Slowly Changing Dimension (SCD) refers
to the tracking of changes to dimensional data
over time.
84Understand Issues With Slowly Changing Dimensions
- Maintaining SCDs can be complex without
surrogates.
85Use Different Methods of Handling Dimensional
Changes
- Two most commonly used types of SCDs (according
to Kimball) - Type 1. Overwrite the old value with the new
value (do not track the changes). - Type 2. Add a new dimension record with a new
surrogate key (track changes over time). - A single row may have a combination of columns of
different types.
86Type 1 Overwrite the Original Value
- The organization may not choose to track certain
data changes because - the original data may have been incorrect
- the change is not considered relevant for tracking
Sales Rep Dimension Table
Sales Fact Table
87Type 2 Add a New Dimension Record
- When a tracked change is detected, a new
surrogate key is assigned and a new row is added
to the dimension table. - Usually, an effective begin/end date is also
updated on the new and old rows. - Multiple rows may have the same business key, but
they will always have unique surrogate keys.
Sales Rep Dimension Table
Sales Fact Table
88Balanced and Ragged Hierarchies
- Dimensional data are usually structured as
hierarchies, either balanced or ragged
(unbalanced). - Balanced hierarchies (those with a fixed number
of levels) are most common and are the easiest to
understand and analyze. - In ragged hierarchies, each branch does not break
down into the same number of levels. They are
harder to analyze and report against. - Also, PowerPlay requires that all leaf
(lowest-level) nodes be at the same level to
aggregate properly.
89Parent-Child Relationships
- Parent-child relationship are recursive
relationships. - The levels of the hierarchy are determined by
rows of the same table.
Reports To
Employees
Orders
90Ragged Hierarchies
- Leaf nodes have no children.
- DecisionStream fact builds only look for leaf
nodes at the lowest level.
91Resolve Ragged Hierarchies Step 1
- Create an auto-level hierarchy to obtain the
number of levels. - Create a dimension build to create a physical
table that will identify for each row the level
it belongs to.
92Resolve Ragged Hierarchies Step 1 (contd) Use
Auto-Level Hierarchies
- The purpose of auto-level hierarchies in
DecisionStream is to determine the number of
levels in a hierarchy.
Top Level (Level 1)
Report to Andrew Fuller (Level 2)
Report to Steven Buchanan (Level 3)
Report to Andrew Fuller (Level 2)
93Issues Regarding Star Schema
- Dimension table keys must be surrogate
(non-intelligent and non-business related),
because - Keys may change over time
- Length/format consistency
- Granularity of Fact Table what level of detail
do you want? - Transactional grain finest level
- Aggregated grain more summarized
- Finer grains ? better market basket analysis
capability - Finer grain ? more dimension tables, more rows in
fact table
94The User InterfaceMetadata (data catalog)
- Identify subjects of the data mart
- Identify dimensions and facts
- Indicate how data is derived from enterprise data
warehouses, including derivation rules - Indicate how data is derived from operational
data store, including derivation rules - Identify available reports and predefined queries
- Identify data analysis techniques (e.g.
drill-down) - Identify responsible people
95 96Multi-dimensional data
97On-Line Analytical Processing (OLAP)
- The use of a set of graphical tools that provides
users with multidimensional views of their data
and allows them to analyze the data using simple
windowing techniques - Relational OLAP (ROLAP)
- Traditional relational representation
- Multidimensional OLAP (MOLAP)
- Cube structure
- OLAP Operations
- Cube slicing come up with 2-D view of data
- Drill-down going from summary to more detailed
views
98Overall Plan
- We need fast answers to analytical questions
- Relational model may not be the answer
- We can restructure data specifically for analysis
- First we need to find out how people analyse data
99Overall Plan
- Analysis analysis reveals importance of measures
and dimensions - So we structure the data with that in mind
- The star schema is the physical structure that
emerges - We can implement this as ROLAP, MOLAP and HOLAP
- We achieve our objective rapid analytical
processing
100How do we make databases faster?
- Indexing
- Query design
- Application design
- Care with locking
- Lots of ways
- Data structuring
101Relational
102Data structuring Relational model
- Pros
- Data integrity in the face of user updates
- Small data volume
- Good for transactional queries
- Cons
- Poor analytical query performance
103Poor analytical query performance
- Why are relational databases slow?
- Joins
- Functions
- Aggregations
104Poor analytical query performance
- So, there is a tension between
- Transactions
- Analytical querying
- Solution
- Split them up
- Take a copy of the transactional database and
structure it in a totally different way that is
optimised for analytical querying
105Structure for analytical querying
- Great idea, but first we need to find out how
people analyse their data
106(No Transcript)
107(No Transcript)
108(No Transcript)
109How people analyse their data
- People analyse their data in terms of
- Graphs
- Grids
- Reports
- Do these have anything in common?
110How people analyse their data
- Do these have anything in common?
- Measures
- Numerical values
- Typically plotted on the Y axis
- Dimensions
- Discontinuous variables
- Slice the measures into aggregated groups
- Typically plotted on the X axis
111(No Transcript)
112(No Transcript)
113(No Transcript)
114How people analyse their data
- Dimensions are often hierarchical
- People want to analyse
- Time by Year, Quarter, Month, Day
- Product by Warehouse, Type, Product
- Customer by Country, County, Person
115How people analyse their data
- So, we need to summarise all of this..
- Measures
- Dimensions
- Hierarchies
116Squashed Octopus
Customer
Item
Country
Warehouse
Region
Class
Name
Product
Delay
Profit
Quantity
WeekDay
Month
Employee
Quarter
Region
Year
County
Time
Employee
117Squashed Octopus
- The SO is a logical model
- What about a physical model?
- (Recap why the relational model is slow for
analytical querying) - Joins
- Functions
- Aggregations
118Relational
119Star Schema
120Star Schema
- What is in the fact table?
- Facts
121(No Transcript)
122Star Schema
- What is in a dimension table?
- Dimensional information
- Hierarchical information
123(No Transcript)
124Star Schema
- How do dimension and fact tables work together?
125(No Transcript)
126(No Transcript)
127Star Schema
128Query Time Relational Time Star Schema
Monthly totals 70 60
Sales in March by Product 18 6
Sales in March 2004 by Product 12 2
129Star Schema
- Is it faster?
- Yes
- How can we make it even faster?
- Aggregation
130(No Transcript)
131Query Time Relational Time Star Schema Time Aggregated Star Schema
Monthly totals 70 60 lt1
Sales in March by Product 18 6 lt1
Sales in March 2004 by Product 12 2 lt1
132Star Schema
- If we leave this as a set of tables then it is
ROLAP Relational OLAP - (OLAP On-Line Analytical Processing)
- But it is a pain to manage
- All those aggregation tables
133Star Schema
- So, the answer is MOLAP
- (Multi-dimensional OLAP)
134Star Schema
- Finally HOLAP
- (Hybrid OLAP)
135On-Line Analytical Processing (OLAP)
- OLAP Operations
- Cube slicing come up with 2-D view of data
- Drill-down going from summary to more detailed
views
136Figure 11-22 Slicing a data cube
137Summary report
Example of drill-down
Drill-down with color added
138Summary
- We need fast answers to analytical questions
- Relational model may not be the answer
- We can restructure data specifically for analysis
- First we need to find out how people analyse data
139Summary
- Analysis analysis reveals importance of measures
and dimensions - So we structure the data with that in mind
- The star schema is the physical structure that
emerges - We can implement this as ROLAP, MOLAP and HOLAP
- We achieve our objective rapid analytical
processing