NIC Exposure Level Training

About This Presentation

Title:

NIC Exposure Level Training

Description:

OLTP (Online transaction processing) has been the standard reason for IS and DP ... Synonyms, homonyms. Entity level. No common keys or representation ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 140

Provided by: vijay77

more less

Transcript and Presenter's Notes

Title: NIC Exposure Level Training

1
NIC Exposure Level Training

Vijayendra Gururao
Business Intelligence Consultant

2
Agenda

Data warehousing Concepts - Day 1
Govt Case Study - Day 2
Defense HR Case study - Day 3
Manufacturing Case Study - Day 4
Data mining - Day 5

3
The Evolution of Business Intelligence
4
Data Warehousing
5
Introduction

Definitions
Legacy Systems
Dimensions
Data Dependencies Model
Dimensional Model

6
An ER Model
7
Why Data Warehouses?

To meet the long sought after goal of providing
the user with more flexible data bases containing
data that can be accessed every which way.

8
OLTP vs. OLAP

OLTP (Online transaction processing) has been the
standard reason for IS and DP for the last thirty
years. Most legacy systems are quite good at
capturing data but do not facilitate data access.
OLAP (Online analytical processing) is a set of
procedures for defining and using a dimension
framework for decision support

9
The Goals for and Characteristics of a DW

Make organizational data accessible
Facilitate consistency
Adaptable and yet resilient to change
Secure and reliable
Designed with a focus on supporting decision
making

10
The Goals for and Characteristics of a DW

Generate an environment in which data can be
sliced and diced in multiple ways
It is more than data, it is a set of tools to
query, analyze, and present information
The DW is the place where operational data is
published (cleaned up, assembled, etc.)

11
Data Warehousing is Changing!
Application requirements--not just data
requirements--are now driving need.
Customer Relationship Mgmt.
Campaign Management
ERP
Call Center
Knowledge Management
ERP
Target Marketing
Supply Chain
E-commerce
12
Organization of data in the presentation area of
the data warehouse

Data in the warehouse are dimensional, not
normalized relations
However, data that are ultimately presented in
the data warehouse will often be derived directly
from relational DBs
Data should be atomic someplace in the warehouse
even if the presentation is aggregate
Uses the bus architecture to support a
decentralized set of data marts

13
Updates to a data warehouse

For many years, the dogma stated that data
warehouses are never updated.
This is unrealistic since labels, titles, etc.
change.
Some components will, therefore, be changed
albeit, via a managed load (as opposed to
transactional updates)

14
Basic elements of the data warehouse
DataStaging Area
Operational Source Systems
DataPresentation Area
DataAccessTools
Services Clean, combine, and standardizeConform
DimensionsNo user query services Data
Store Flat files and relational
tables Processing Sorting and sequential
processing
Data Mart 1 DimensionalAtomic and summary
dataBased on a single business process
Ad hoc query tools Report Writers Analytical
Applications Modeling Forecasting Scoring Dat
a Mining
Extract
Load
Access
DW BusConformed facts and dimensions
Extract
Data Mart 2 Similar design
Extract
Load
Access
15
Data Staging Area

Extract-Transformation-Load
Extract Reading the source data and copying the
data to the staging area
Transformation
Cleaning
Combining
Duplicating
Assigning keys
Load present data to the bulk loading facilities
of the data mart

16
Dimensional Modeling Terms and Concepts

Fact table
Dimension tables

17
Fact Tables

Fact table a table in the data warehouse that
contains
Numerical performance measures
Foreign keys that tie the fact table to the
dimension tables

18
Fact Tables

Each row records a measurement describing a
transaction
Where?
When?
Who?
How much?
How many?
The level of detail represented by this data is
referred to as the grain of the data warehouse
Questions can only be asked down to a level
corresponding with the grain of the data warehouse

19
Dimension tables

Tables containing textual descriptors of the
business
Dimension tables are usually wide (e.g., 100
columns)
Dimension tables are usually shallow (100s of
thousand or a few million rows)
Values in the dimensions usually provide
Constraints on queries (e.g., view customer by
region)
Report headings

20
Dimension tables

The quality of the dimensions will determine the
quality of the data warehouse that is, the DW is
only as good as its dimension attributes
Dimensions are often split into hierarchical
branches (i.e., snowflakes) because of the
hierarchical nature of organizations
Product part ? Product ? Brand
Dimensions are usually highly denormalized

21
Dimension tables

The dimension attributes define the constraints
for the DW. Without good dimensions, it becomes
difficult to narrow down on a solution when the
DW is used for decision support

22
Bringing together facts and dimensions Building
the dimensional Model

Start with the normalized ER Model
Group the ER diagram components into segments
based on common business processes and model each
as a unit
Find MM relationships in the model with numeric
and additive non-key facts and include them in a
fact table
Denormalize the other tables as needed and
designate one field as a primary key

23
A Dimensional Model
24
Kimball Methodology

Conformed Dimensions

25
Review A Private Data Mart

A data mart containing one fact table and three
dimension tables. We delivered all the tables by
executing a fact build.
What if we want to add another fact table called
F_Sales that will reference the three existing
dimension tables?

26
Understand Conformed Dimensions

A conformed dimension is a dimension that is
standardized across all data marts.

Location
Time
Sales Fact
Conformed Dimensions
Customer
Distribution Fact
Distributor
Promotion
Product
Order Fact
27
Advantages of Conformed Dimensions

Deliver incremental data marts in a short period
of time.
Independent data marts become part of a fully
integrated data warehouse.
Deliver a consistent view across your business.

28
Conformed Dimensions Within Bus Architecture

Identifying and designing the conformed
dimensions is a critical step in the architecture
phase of data warehouse design.

29
Design of Conformed Dimensions

A commitment to using conformed dimensions is
more than just a technical consideration. It must
be a business mandate.
Lay out a broad dimensional map for the
enterprise.
Define conformed dimensions at the most granular
(atomic) level possible.
Conformed dimensions should always use surrogate
keys.
Define standard definitions for dimension and
fact attributes.

30
Granularity in Conformed Dimensions

Conformed dimensions should be defined at the
most granular (atomic) level so that each record
in these tables corresponds to a single record in
the base-level fact table.

Order FactDay IdProduct IdCustomer
IdCostNumberOrdered
D_TimeDayDay IdDayMonth Year
D_ProductProduct IdDescriptionProduct
TypeType DescriptionProduct LineLine
Description
D_CustomerCustomer IdLast NameFirst
NameAddress
31
Flexibility of Conformed Dimensions

Conformed dimensions are usually designed within
star schema data marts. For multiple granularity
fact tables, higher level views of dimensions can
be used (or a snowflake table).

View or Snowflake table
Time Dimension
Order FactDay IdProduct IdCustomer
IdCostNumberOrdered
CustomerCustomer IdLast NameFirst NameAddress
Time(Day)Day IdDayMonth Id Period
Sales FactMonth IdProduct IdCustomer
IdAmountSoldRevenue
ProductProduct IdDescriptionProduct
TypeProduct Line
Time(Month) ViewMonth IdMonthPeriod
32
So, What is a DW?

A data warehouse is a subject-oriented,
integrated, non-volatile, and time-variant
collection of data in support of managements
decisions W.H. Inmon (the father of DW)

33
Subject Oriented

Data in a data warehouse are organized around the
major subjects of the organization

34
Integrated

Data from multiple sources are standardized
(scrubbed, cleansed, etc.) and brought into one
environment

35
Non-Volatile

Once added to the DW, data are not changed
(barring the existence of major errors)

36
Time Variant

The DW captures data at a specific moment, thus,
it is a snap-shot view of the organization at
that moment in time. As these snap-shots
accumulate, the analyst is able to examine the
organization over time (a time series!)
The snap-shot is called a production data extract

37
Need for Data Warehousing

Integrated, company-wide view of high-quality
information (from disparate databases)
Separation of operational and informational
systems and data (for improved performance)

38
(No Transcript)
39
Data Warehouse Architectures

Generic Two-Level Architecture
Independent Data Mart
Dependent Data Mart and Operational Data Store
Logical Data Mart and _at_ctive Warehouse
Three-Layer architecture

All involve some form of extraction,
transformation and loading (ETL)
40
Generic two-level architecture
L
One, company-wide warehouse
T
E
Periodic extraction ? data is not completely
current in warehouse
41
Independent Data Mart
42
Dependent data mart with operational data store
43
Logical data mart and _at_ctive data warehouse
44
Three-layer architecture
45
DW Design

Mainly consists of
Logical Design
Physical Design

46
Logical Design of DW

Identification of
Entities
Relationships
Attributes
Uniqe identifiers

Conceptual and abstract. Results in Fact and
dimension tables Created using Pen and Paper OR
Modeling tools also
47
Physical Design of DW

Conversion of data gathered in Logical design to
Physical database structure
Mainly driven for query performance

48
Logical ? Physical
49
Data CharacteristicsStatus vs. Event Data
Example of DBMS log entry
Event a database action (create/update/delete)
that results from a transaction
50
Data CharacteristicsTransient vs. Periodic Data
Figure 11-8 Transient operational data
Changes to existing records are written over
previous records, thus destroying the previous
data content
51
Data CharacteristicsTransient vs. Periodic Data
Periodic warehouse data
Data are never physically altered or deleted once
they have been added to the store
52
Data Reconciliation

Typical operational data is
Transient not historical
Not normalized (perhaps due to denormalization
for performance)
Restricted in scope not comprehensive
Sometimes poor quality inconsistencies and
errors
After ETL, data should be
Detailed not summarized yet
Historical periodic
Denormalized
Comprehensive enterprise-wide perspective
Quality controlled accurate with full integrity

53
Extract Transform Load

Extract data from operational system, transform
and load into data warehouse
Why ETL?
Will your warehouse produce correct information
with the current data?
How how can I ensure warehouse credibility?

54
Excuses for NOT Transforming Legacy Data

Old data works fine, new will work as well.
Data will be fixed at point of entry through GUI.
If needed, data will be cleaned after new system
populated After proof-of-concept pilot.
Keys join the data most of the time.
Users will not agree to modifying or
standardizing their data.

55
Levels of Migration Problem

Existing metadata is insufficient and unreliable
Metadata must hold for all occurrences
Metadata must represent business and technical
attributes
Data values incorrectly typed and accessible
Values form extracted from storage
Values meaning inferred from its content
Entity keys unreliable or unavailable
Inferred from related values

56
Metadata Challenge

Metadata gets out of synch with details it
summarizes
Business grows faster than systems designed to
capture business info
Not at the right level of detail
Multiple values in a single field
Multiple meanings to a single field
No fixed format for value
Expressed in awkward of limited terms
Program/compiler view rather than business view

57
Character-level Challenge

Value instance level
Spelling, aliases
Abbreviations, truncations, transpositions
Inconsistent storage formats
Named type level
Multiple meanings, contextual meanings
Synonyms, homonyms
Entity level
No common keys or representation
No integrated view across records, files, systems

58
The ETL Process

Capture
Scrub or data cleansing
Transform
Load and Index

ETL Extract, transform, and load
59
The ETL Process
Presentation System
Source Systems
Staging Area
Transform
Extract
Load
60
Source Data

Record the name location and data that exists in
the TPS environment.
File names and location
Layout
Attribute meaning

61
Extraction

Copy specific data directly from the source
tables into a working dataset in the staging area.

62
Transformation (Dimension Tables)

Generate surrogate key in a primary-surrogate
table. Make this permanent.
Insert the surrogate key into the working
dimension tables.
Conduct any editing/cleaning operations you need
(usually on the working table)
Generate any derived attributes you need.
Generate and retain process logs.

63
Transformation(Fact tables)

Join all dimensions to the fact table (using
original primary keys). Insert surrogate keys
Generate derived facts
Generate indicator flags

64
Target Data

Describe the presentation data structure.
Model
Metadata
Usage and constraints

65
Flow Documentation

DFD for the ETL process
ERD for Source, Staging and Target databases.
Metadata
Usage notes.

66
Steps in data reconciliation
Capture extractobtaining a snapshot of a
chosen subset of the source data for loading into
the data warehouse
Incremental extract capturing changes that have
occurred since the last static extract
Static extract capturing a snapshot of the
source data at a point in time
67
Steps in data reconciliation (continued)
Scrub cleanseuses pattern recognition and AI
techniques to upgrade data quality
Fixing errors misspellings, erroneous dates,
incorrect field usage, mismatched addresses,
missing data, duplicate data, inconsistencies
Also decoding, reformatting, time stamping,
conversion, key generation, merging, error
detection/logging, locating missing data
68
Steps in data reconciliation (continued)
Transform convert data from format of
operational system to format of data warehouse
Record-level Selection data partitioning Joinin
g data combining Aggregation data
summarization
Field-level single-field from one field to
one field multi-field from many fields to one,
or one field to many
69
Steps in data reconciliation (continued)
Load/Index place transformed data into the
warehouse and create indexes
Refresh mode bulk rewriting of target data at
periodic intervals
Update mode only changes in source data are
written to data warehouse
70
Single-field transformation
In general some transformation function
translates data from old form to new form
Algorithmic transformation uses a formula or
logical expression
Table lookup another approach
71
Multi field transformation
M1 from many source fields to one target field
1M from one source field to many target fields
72
Derived Data

Objectives
Ease of use for decision support applications
Fast response to predefined user queries
Customized data for particular target audiences
Ad-hoc query support
Data mining capabilities
? Characteristics
Detailed (mostly periodic) data
Aggregate (for summary)
Distributed (to departmental servers)

Most common data model star schema (also called
dimensional model)
73
Components of a star schema
Fact tables contain factual or quantitative data
Dimension tables are denormalized to maximize
performance
1N relationship between dimension tables and
fact tables
Dimension tables contain descriptions about the
subjects of the business
Excellent for ad-hoc queries, but bad for online
transaction processing
74
Star schema example
Fact table provides statistics for sales broken
down by product, period and store dimensions
75
Star schema with sample data
76
Advanced concepts

Slowly Changing dimensions
Ragged Hierarchies

77
What if Our Data is not Static?

Small occasional changes in dimension data are
normal in business.
Examples of these changes include
addition of new members (a new product is
launched)
changing of relationships within the dimension (a
sales rep moves to another branch)
properties of members changed (a product is
reformulated or renamed)
deletion of members (this is rare in data
warehousing)

78
Understand Surrogates

4-byte integer key (can hold more than 2 billion
positive integers)
Internally assigned and meaningless
insures uniqueness
always known
Used in conjunction with business keys
business key is often mnemonic for example, OTA
used for Ottawa office
surrogate key is numeric for example, 000128
Surrogate keys are never used in reports. They
are used to link dimension tables to fact tables.

79
Understand Surrogate Keys Used In Operational
Systems

Operational databases also sometimes use
surrogate keys (for example, Employee_No). These
keys typically cannot be used as the data mart
surrogate keys.
A single member in a data mart (for example, a
particular employee) may have several data mart
surrogate keys assigned over time to deal with
slowly changing dimensions.
You may have to merge entities from separate
operational systems, each with its own
operational surrogate key (for example, customers
from separate banking and insurance
applications).
Operational surrogate keys are usually considered
business keys in the data mart.

80
Understand Natural Keys Example
Metrics
Product Dimension
Customer Dimension
Cust Code Name
SA 1 11 Safeway LO 2 22
Loblaws SE 5 55 7-11
Fact Table
Prod Code Cust Code
PR X 002 39 SA 1 11 PR X 003 40 LO 2
22 PR Y 003 40 SE 5 55
Measures
81
Understand Surrogate Keys Example
Product Dimension
Customer Dimension
Cust Sur Cust Code Name
Prod Sur Prod Code Name
10 SA 1 11 Safeway
20 LO 2 22 Loblaws 30
SE 5 55 7-11
1 PR X 002 39 Soup
2 PR X 003 40 Beans 3
PR Y 003 40 Peas
Fact Table
Prod Sur Cust Sur
1 10 2
20 3
30
Measures
82
Track Dimensional Changes Over Time

Operational systems tend to contain data about
the current state of the business.
A data warehouse is expected to hold data for
five to 10 years.
Users may need to query data as of any particular
date (for example, at which office(s) did Mary
Smith work between January/1999 and
December/1999?).
If Mary Smith changes offices, to which office do
her sales apply, the old one or the new one?

83
Understand Slowly Changing Dimensions (SCD)

Operational dimensional data may often be thought
of as static. It may only need to reflect the
current state.
Data warehouse dimensional data often must show
how the dimensional data changes over time. It is
not static.
The term Slowly Changing Dimension (SCD) refers
to the tracking of changes to dimensional data
over time.

84
Understand Issues With Slowly Changing Dimensions

Maintaining SCDs can be complex without
surrogates.

85
Use Different Methods of Handling Dimensional
Changes

Two most commonly used types of SCDs (according
to Kimball)
Type 1. Overwrite the old value with the new
value (do not track the changes).
Type 2. Add a new dimension record with a new
surrogate key (track changes over time).
A single row may have a combination of columns of
different types.

86
Type 1 Overwrite the Original Value

The organization may not choose to track certain
data changes because
the original data may have been incorrect
the change is not considered relevant for tracking

Sales Rep Dimension Table
Sales Fact Table
87
Type 2 Add a New Dimension Record

When a tracked change is detected, a new
surrogate key is assigned and a new row is added
to the dimension table.
Usually, an effective begin/end date is also
updated on the new and old rows.
Multiple rows may have the same business key, but
they will always have unique surrogate keys.

Sales Rep Dimension Table
Sales Fact Table
88
Balanced and Ragged Hierarchies

Dimensional data are usually structured as
hierarchies, either balanced or ragged
(unbalanced).
Balanced hierarchies (those with a fixed number
of levels) are most common and are the easiest to
understand and analyze.
In ragged hierarchies, each branch does not break
down into the same number of levels. They are
harder to analyze and report against.
Also, PowerPlay requires that all leaf
(lowest-level) nodes be at the same level to
aggregate properly.

89
Parent-Child Relationships

Parent-child relationship are recursive
relationships.
The levels of the hierarchy are determined by
rows of the same table.

Reports To
Employees
Orders
90
Ragged Hierarchies

Leaf nodes have no children.
DecisionStream fact builds only look for leaf
nodes at the lowest level.

91
Resolve Ragged Hierarchies Step 1

Create an auto-level hierarchy to obtain the
number of levels.
Create a dimension build to create a physical
table that will identify for each row the level
it belongs to.

92
Resolve Ragged Hierarchies Step 1 (contd) Use
Auto-Level Hierarchies

The purpose of auto-level hierarchies in
DecisionStream is to determine the number of
levels in a hierarchy.

Top Level (Level 1)
Report to Andrew Fuller (Level 2)
Report to Steven Buchanan (Level 3)
Report to Andrew Fuller (Level 2)
93
Issues Regarding Star Schema

Dimension table keys must be surrogate
(non-intelligent and non-business related),
because
Keys may change over time
Length/format consistency
Granularity of Fact Table what level of detail
do you want?
Transactional grain finest level
Aggregated grain more summarized
Finer grains ? better market basket analysis
capability
Finer grain ? more dimension tables, more rows in
fact table

94
The User InterfaceMetadata (data catalog)

Identify subjects of the data mart
Identify dimensions and facts
Indicate how data is derived from enterprise data
warehouses, including derivation rules
Indicate how data is derived from operational
data store, including derivation rules
Identify available reports and predefined queries
Identify data analysis techniques (e.g.
drill-down)
Identify responsible people

96
Multi-dimensional data
97
On-Line Analytical Processing (OLAP)

The use of a set of graphical tools that provides
users with multidimensional views of their data
and allows them to analyze the data using simple
windowing techniques
Relational OLAP (ROLAP)
Traditional relational representation
Multidimensional OLAP (MOLAP)
Cube structure
OLAP Operations
Cube slicing come up with 2-D view of data
Drill-down going from summary to more detailed
views

98
Overall Plan

We need fast answers to analytical questions
Relational model may not be the answer
We can restructure data specifically for analysis
First we need to find out how people analyse data

99
Overall Plan

Analysis analysis reveals importance of measures
and dimensions
So we structure the data with that in mind
The star schema is the physical structure that
emerges
We can implement this as ROLAP, MOLAP and HOLAP
We achieve our objective rapid analytical
processing

100
How do we make databases faster?

Indexing
Query design
Application design
Care with locking
Lots of ways
Data structuring

101
Relational
102
Data structuring Relational model

Pros
Data integrity in the face of user updates
Small data volume
Good for transactional queries
Cons
Poor analytical query performance

103
Poor analytical query performance

Why are relational databases slow?
Joins
Functions
Aggregations

104
Poor analytical query performance

So, there is a tension between
Transactions
Analytical querying
Solution
Split them up
Take a copy of the transactional database and
structure it in a totally different way that is
optimised for analytical querying

105
Structure for analytical querying

Great idea, but first we need to find out how
people analyse their data

106
(No Transcript)
107
(No Transcript)
108
(No Transcript)
109
How people analyse their data

People analyse their data in terms of
Graphs
Grids
Reports
Do these have anything in common?

110
How people analyse their data

Do these have anything in common?
Measures
Numerical values
Typically plotted on the Y axis
Dimensions
Discontinuous variables
Slice the measures into aggregated groups
Typically plotted on the X axis

111
(No Transcript)
112
(No Transcript)
113
(No Transcript)
114
How people analyse their data

Dimensions are often hierarchical
People want to analyse
Time by Year, Quarter, Month, Day
Product by Warehouse, Type, Product
Customer by Country, County, Person

115
How people analyse their data

So, we need to summarise all of this..
Measures
Dimensions
Hierarchies

116
Squashed Octopus
Customer
Item
Country
Warehouse
Region
Class
Name
Product
Delay
Profit
Quantity
WeekDay
Month
Employee
Quarter
Region
Year
County
Time
Employee
117
Squashed Octopus

The SO is a logical model
What about a physical model?
(Recap why the relational model is slow for
analytical querying)
Joins
Functions
Aggregations

118
Relational
119
Star Schema
120
Star Schema

What is in the fact table?
Facts

121
(No Transcript)
122
Star Schema

What is in a dimension table?
Dimensional information
Hierarchical information

123
(No Transcript)
124
Star Schema

How do dimension and fact tables work together?

125
(No Transcript)
126
(No Transcript)
127
Star Schema

Is it faster?

128
Query Time Relational Time Star Schema
Monthly totals 70 60
Sales in March by Product 18 6
Sales in March 2004 by Product 12 2
129
Star Schema

Is it faster?
Yes
How can we make it even faster?
Aggregation

130
(No Transcript)
131
Query Time Relational Time Star Schema Time Aggregated Star Schema
Monthly totals 70 60 lt1
Sales in March by Product 18 6 lt1
Sales in March 2004 by Product 12 2 lt1
132
Star Schema