Title: Data Warehousing
1DataWarehousing
- A Look Back, Moving Forward
Dale Sanders June 2005
2Introduction Warnings
- Why am I here?
- Teach
- Stimulate some thought
- Share some of my experiences and lessons
- Learn
- From you, please
- Ask questions, challenge opinions, share your
knowledge - Ill do my best to live up to my end of the
bargain - Warnings
- The pictures in this presentation
- May or may not have any relevance
- whatsoever to the topic or slide
- Mostly intended to break up the monotony
3Expectation Management
- My DW Strengths (according to others)
- I know what not to do as much as I know what to
do - Seen and made all the big mistakes
- Vision, strategy, system architecture, data
management DW modeling, complex cultural
issues, leapfrog problem solving - My DW weaknesses
- My programming skills suck and Im not a DBA
- Havent written a decent line of code in four
years! - Some might say its been 24 years ?
- Knowledge of leading products is very rusty
- Though Im beefing up on Microsoft and Cognos
Within these expectations, make no mistake about
it I know data warehousing ?
4Todays Material
Giving due credit.
- I am a good Idea Guy
- But, ideas are worthless without someone to
implement and enhance them - Steve Barlow, Dan Lidgard, Jon Despain, Chuck
Lyon, Laure Shull, Kris Mitchell, Peter Hess,
Nancy McLeod, Ron Gault, Rob Carpenter, Tom
Robison, Stan Smith, my wife, and many others - My greatest strength and blessing
- The ability to recognize, listen to, and hold
- onto good people
- (Knock on wood)
- My achievements in personal and professional life
- More a function of those around me than a
reflection on me
5DW Best Practices The Most Important Metrics
- Employee satisfaction
- Without it, long-term customer satisfaction is
impossible - Customer satisfaction
- Thats the nature of the Information Services
career field - Some people in our profession still dont get it
- We are here to serve
- The Organizational Laugh Metric
- How many times do you hear laughter in the
day-to-day operations of your team? - It is the single most important vital sign to
organizational health and business success
6My Background
- Three, eight-year chapters
- Captain, Information Systems Engineer, US Air
Force - Nuclear warfare battle management
- Force status data integration
- Intelligence and attack warning data fusion
- Consultant in several capacities
- TRW
- National Security Agency (NSA)
- Intel New Mexico Data Repository (NMDR)
- Air Force
- Integrated Minuteman Data Base (IMDB)
- Peacekeeper Information Retrieval System (PIRS)
- Information Technology International (ITI, Inc.)
- Healthcare
- Intermountain Health Care Enterprise Data
Warehouse - Consultant to other healthcare organizations
data warehouses - Now at Northwestern University Healthcare System
7Overview
- Data warehousing history
- According to Sanders
- Why and how did this become a sub-specialty in
information systems? - What have we learned so far?
- My take on Best Practices
- Key lessons-learned
- My thoughts on the most popular authors in the
field - What they contribute, where they detract
8Data Warehousing History
Newspaper Rock 100 B.C.
American Retail 2005 A.D.
9What Happened in the Cloud?
- Stage 1 Laziness
- Operators grew tired of hanging tapes
- In response to requests for historical financial
data - They stored data on-line, in unauthorized
mainframe databases - Stage 2 End of the mainframe bully
- Computing moved out from finance to the rest of
the business - Unix and relational databases
- Distributed computing created islands of
information - Stage 2.1 The government gets involved
- Consolidating IRS and military databases to save
money on mainframes - Hey, look what I can do with this data
- Stage 3 Deming comes along
- Push towards constant business reengineering
- Cultural emphasis on continuous quality
improvement and business - innovation drives the need for data
10The Real Truth
- Data warehousing is a symptom of a problem
- Technological inability to deploy single-platform
information systems that - Capture data once and reuse it throughout an
enterprise - Support high-transaction rates (single record
CREATE, SELECT, UPDATE, DELETE) and analytic
queries on the same computing platform, with the
same data, at the same time - Someday, maybe we will address the root cause
- Until then, its a good way to make a living
11The Ideal Library Analogy
- Stores all of the books and other reference
material you need to conduct your research - The Enterprise data warehouse
- A single place to visit
- One database environment
- Contents are kept current and refreshed
- Timely, well-choreographed data loads
- Staffed with friendly, knowledgeable people that
can help you find your way around - Your Data Warehouse team
- Organized for easy navigation and use
- Metadata
- Data models
- User friendly naming conventions
- Solid architectural infrastructure
- Hardware, software, standards, metrics
12Cultural Detractors
- The two biggies
- The business supported by the data warehouse must
be motivated by a desire for constant improvement
and fact-based decision making - The data warehouse team falls victim to the
Politics of Data - Through naivety
- Through misguided motives, themselves
13Business Culture
I love data!
- Does your CEO
- Talk about constant improvement, constantly?
- Drive corporate goals that are SMART?
- Specific, Measurable, Attainable, Realistic,
Tangible - Crave data to make better informed decisions?
- Become visibly, buoyantly excited at a demo for a
data cube? - If so, the success of your data warehouse is
right around the corner sort of
14Political Best Practices
- You will be called a data thief
- Get used to it
- Encourage life cycle ownership of the OLTP data,
even in the EDW - You will be called dangerous
- You dont understand our data!
- OLTP owners know their data better than you do
acknowledge it and leverage it
- You will be blamed for poor data quality in the
OLTP systems - This is a natural reaction
- Data warehouses raise the visibility of poor data
quality - Use the EDW as a tool for raising overall data
quality - You will be called a job robber
- EDW is perceived as a replacement for OLTP
systems - Educate people The EDW depends on OLTP systems
for its existence - Stick to your values and pure motives
- The politics will fade away
15Data Quality
- Pitfall
- Taking accountability for data quality on the
source system - Spending gobs of time and money cleansing data
before its loaded into the DW - Its a never ending, never win battle
- You will always be one step behind data quality
- You will always be in the cross-hairs of
- blame
- Best Practice
- Push accountability where it belongs to the
source system - Use the data warehouse as a tool to reveal data
quality, either good or bad - Be prepared to weather the initial storm of blame
16Measuring Data Quality
- Data Quality Completeness x Validity
- Can it be measured objectively?
- Measuring Completeness
- Number of null values in a column
- Measuring Validity
- Cardinality is a simple way to measure validity
- We only have four standard regions in the
business, but we have 18 distinct values in the
region column.
17Business Validity
- How can you measure it? You cant
- I collect this data from our customers, but I
have to guess sometimes because I dont speak
Spanish. - This data is valid for trend analysis decisions
before 9/11/2001, but should not be used after
that date, due to changes in security
procedures. - You cant use insurance billing and
reimbursement data to make clinical, patient care
decisions. - This customer purchased four copies of Zamfir,
Master of the Pan Flute, therefore he loves
everything about Zamfir. - What Amazon didnt know I bought them for my
mom and her sewing circle.
Where do you capture subjective data quality?
Metadata.
18The Importance of Metadata
- Maybe the most over-hyped, underserved area of
data warehousing common sense - Vendors want to charge you big for their
tools - Consultants would like you to think that its the
Holy Grail in disguise and only they can help you
find it - Authors who have never been in an operational
environment would have you chasing your tail in
pursuit of an esoteric, mythological Metadata
Nirvana - Dont listen to the confusing messages! You know
the answer just listen to your common sense
19Metadata Keep It Simple!
- Ultimately, what are the most valuable business
motives behind metadata? - Make data more understandable to those who are
not familiar with it - Data quality issues
- Data timeliness and temporal issues
- Context in which is was collected
- Translating physical names to natural
- language
- Make data more findable to those who
- dont know where it is
- Organize it
- Take a lesson from library science and the card
catalog
20Metadata on the Web
21Example Metadata Entry
- Physical Table Name
- LKUP.POSTAL_CD_MASTER
- Plain Language Name
- Postal Code Master - IHC
- Description
- Contains Postal (Zip) codes for the IHC referral
region and IHC specific descriptions. These
descriptions allow for specific IHC groupings
used in various analyses. Data Steward Jim
Allred, ext. 3518
22Table Metadata
- Required Elements
- Long Name (or English name)
- Description
- Semi-optional
- Source
- Example
- Data Steward
23Column Metadata
- Required
- Long Name
- Description
- Notable ETL transforms
- Optional
- Value Range
- Data Quality
- Associated Lookup
24The Data Model
25Some Metadata Is Free
- It can be collected from the data dictionary
- For example
- Primary and Foreign Keys
- Indexed Columns
- Table Creation Dates
26The Most Valuable Metadata is Subjective
- The Human Factor
- Most metadata is not automatically collected by
tools because it does NOT exist in that form - Interviews with data stewards are the key
- It can take months of effort to collect initial
metadata.
27Holding Feet to the Fire
- Data architects
- Responsible for metadata in their subject areas
- Torturing it from data stewards
- Firm Principle
- No new data marts go live without metadata
28Is Metadata Worth The Effort?
- Good metadata will push the ROI of a home-hum DW
into the stratosphere - It does for DWing what the Yellow Pages did for
the business ROI of the telephone
29It Gets Used
- At Intermountain Health Care
- 210 web hits on average each week day
- (23,000 employees, 2B revenue)
30Whats New In Metadata
31Report Quality
- A function of
- Data quality
- How well does the report reflect the intent
behind the question being asked? - This report doesnt make sense. Im trying to
find out how many widgets we can produce next
year, based on the last four years production. - Thats not what you asked for.
- SQL and other programming accuracy
- Statistical validity population size of the data
- Timeliness of the data relative to the decision
- Event Correlation
- Best Practice
- An accompanying meta-report for every report
that involves significant, high risk decisions
32Meta Report
- A document, associated with a published report,
which defines the report
33Repository
- A central place for storing and sharing
information about business reports
34IHC Analysts Use of Meta Reports
35Meta Report
- Recommended Elements
- Business Owner
- Run Frequency
- Intended Audience
- Statistical Tests
- Software Used
- Source Code
- Formulas
- Relevant Issues Commentary
- Core Elements
- Author Information
- Report Name
- Report Purpose
- Data Source(s)
- Report Methods
36- Title
- Location
- Author
- Owner
37- Purpose
- Frequency
- Audience
- Data Source(s)
38- Selection Criteria
- Statistics
- Software
- Source Code
- Formulas
39Whats It Look Like?
40(No Transcript)
41Utilization and Creation Rate
42EDW Operations Best Practices
- Think Mission Control
- Mission Control Commander
- Dan Lidgard EDW Operations
- Manager
- Operations metrics
- Customized ETL Library
- Schedule of operations
- Alerting tool
- Storage strategies / backups
- Performancemonitoring and tuning
43IHC Architecture
- EDW
- Oracle v 9.2.0.3 on AIX 5.2
- Storage IBM SAN (shark), gt3T
- ETL tools
- Ascentials Data Stage
- Kornshell (unix), SQL scripts, PL/SQL
- scripting
- OLAP MS Analysis Services
- BI Business Objects (Crystal Enterprise)
- With a Cube presentation layer
- Dashboard Visual Minings Net Charts
- EDW Team 16 FTEs, plus SAs and DBAs
44CustomizedETL Library
45History
- One of the IHC ETL programmers noticed he kept
doing the same things over and over for all of
his ETL jobs - Rather than copying and pasting this repetitive
code, he created a library. - Now everyone uses it on the team
- Named the library EDW_UTIL
- EDW Utilities
46Implementation
- Executes via Oracle stored procedures
- Supported by associated tables to hold data when
necessary - Error table
- QA table
- Index table
47Benefits
- Provides standardization
- Eliminates code rewrites
- Can hide complexities
- Such as the appropriate way to
- analyze and gather statistics
- on tables
- Very accessible to all of
- our ETL tools
- Simply an Oracle stored
- procedure call
48Index Management
- Past process included
- Dropping the tables indexes with a script
- Loading the table
- Creating the indexes with a script
- The past process resulted in messy
- scripts to manage and
- coordinate
49Index Management
- New process includes
- Capturing a tables existing indexes metadata
- Dropping the tables indexes with a single
procedure call - Loading the table
- Recreating the indexes with a single
- procedure call
- There are no more messy scripts to
- manage and coordinate
- No lost indexes were neglected
- when adding to create index script
50Index Management
- Samples
- IMPORT_SCHEMA_INDEX_DATA
- IMPORT_TABLE_INDEX_DATA
- DROP_TABLE_INDEXES
- CREATE_TABLE_INDEXES
51Background Table Loads
- Motive Higher availability ratings
- Solution
- Load data into tables which are not accessible to
end users. A simple rename puts them into
production. - Helps transfer the identical attributes from the
live to the background table - Samples
- COPY_TABLE_METADATA
- TRANSFER_TABLE_PRIVS
- DROP_TABLE_INDEXES
- CREATE_TABLE_INDEXES(Create on background table,
identical to production table)
52Tracking Load Times, Errors, QA
- We had no idea who was loading what and when
- Each staff member logged in their own way and for
their own interest - ETL error capturing and QA was difficult
- We can now capture errors and QA information in
a somewhat standardized fashion
53Load Times, Errors, QA
- Samples
- BEGIN_JOB_TIME
- (ex CASEMIX)
- BEGIN_LOAD_TIME
- (ex CASEMIX INDEX)
- END_LOAD_TIME
- END_JOB_TIME
- COMPLETE_LOAD_TIME(Begin and end together)
- LOAD_TIME_ERROR(Alert on these errors)
- LOAD_TIME_METRICSQA (row counts)
54Miscellaneous Procedures
- Motive Hide the gory details and make Oracle
look pretty - Such as Oracles table analyze command
- Gives consistent application of system wide
parameters such as - A new box with a different number of CPUs
(parallel slaves)or - A new version of Oracle
- We populate some metadata too, such as last load
date
55DW Schedule of Operations
- Motives
- Choreography of operations
- Avoid stepping on each others toes
- Maximize shrinking windows of opportunity
- Communicating with customers
- Whats happening when
- Challenges to a complete schedule
- Some loads are ad hoc, not scheduled
- Users query in an ad hoc fashion
- Use of a variety of ETL tools
- Use of a variety of user categories
- DBA, SA, ETL user, end users
- Lots of servers to manage
- Production EDW, Stage EDW, ETL servers, OLAP
servers, presentation layer servers
56Solution
- Focus on operations around the production EDW
- Still working on all the reporting aspects
- A sample on the next slide
- Pull this information out of the load times
data captured by these ETL library calls - BEGIN_JOB_TIME
- BEGIN_LOAD_TIME
- END_LOAD_TIME
- END_JOB_TIME
- COMPLETE_LOAD_TIME
57Sample Report
58DW Alerting Tool
- Motive Business dashboard alerts
- Aggregate data alerts, such as, The number of
patients seen in the ER the last week with chief
complaint of flu indicates the onset of an
outbreak. - A simple tool was created which sends a text
email, based on existence of data returned from a
query
- Healthcare isnt ready for this, yet
- But, the EDW Team took advantage of the tool in
unplanned ways
59Features
- Web based
- Open to all EDW users
- Flexible run schedules
- Daily, weekly, every two weeks, monthly,
quarterly - Wakes every 5 minutes
- Ability to enter query in SQL
- Alert (email) on 3 situations
- Query returns data
- Query returns no data
- Always
60User Interface
61Examples
- 100 alerts in use
- Live performance check
- Every 4 hours
- Alert Check for inactive sessions
- holding active slaves
- Daily
- Alert Look for any active sessions older than
72 hours - ETL monitoring
- Alert only if problem
- Alert on errors logged via the ETL_UTIL
- library (manage by exception)
- Alert on existence of bad records
- captured during ETL
62Storage and Backup
- The inherited state of affairs
- DW running like any OLTP database
- High end expensive SANs
- FULL nightly online backups
- Out of space? Just buy more
63Nightmare in the Making
- Exponential growth
- More data sources
- More summary tables
- More indexes
- No data has yet been
- purged
- Relaxed attitude
- Disk is cheap
- Reality Disk management is expensive
- Consequence Negating the cost benefits of
cheaper storage by becoming careless with the
commodity
64Looming Crisis
- Backups running 16 hours or more
- Performance degradation witnessed by users
- Good backups obtained less than 50 of the time
- Squeezing into load windows
- Running out of space
- Gross underestimating
- Financial cost
- The system administrators (SAs) quadruple the
price of disk purchase from the previous budget
year - They included the price of tape drives, etc.
65Major Changes in Operations
- Stop performing full daily backups
- Move towards less expensive disk option
- IBM offers a several levels of SANs
- Transfer some disk ownership AND backup
responsibilities to DW team, away from SAs and
DBAs - EDW team more aware of upcoming space demands
- EDW team more in tune with which data sets are
easily recreated from the source and dont need a
backup
66Tracking and PredictingStorage Use
67Changes to Backup Strategy
- Perform full backup once monthly during downtime
- Perform no data backup on DEV/STAGE environments
- Do backup DDL (all code) daily in all
environments - Implement daily incremental backup
68Daily Incremental Backups
- Easier said than done
- Resorted to a table level backup
- The DW team owns which tables are exported
- DW team populates a table, the export table
list with each tables export frequency - Populated via an application in development
- The DBAs run an export based on the export
table list
69Use Cheaper Disk
- General practice You can take greater risks with
DW reliability and availability vs. OLTP systems - Use it to your advantage
- Our SAN vendor (IBM) offers a few levels of SANs.
Next level down is a big step down in price,
small step down in features. - Feature loss
- Read cache (referring to disk cache, not box
memory). - We rarely read the same thing twice anyway
- No phone home to IBM (auto paging)
- Mean time to failure is higher, but still
acceptable
70Performance Monitoring Tuning
- Err on the side of freedom and empowerment
- How much harm can really be done?
- Wed rather not constrain our customers
- Pounding queries will find their way to
production - Opportunity to educate users
- Opportunity to tune underlying structures
71Performance Focus Areas
- Indexing
- Well-defined criteria for when and how to apply
indexes - Is this a lost art?
- Big use of BITMAPS
- Composite index trick (acts like a table)
- Partitioning for performance, rather than data
management - Exploiting Oracles Direct Path INSERT feature
- Avoiding UPDATE and DELETE commands
- Copy with MINUS instead
- Implementing Oracle's Parallel Query
- Turn off referential integrity in the DW.. no
brainer - Thats the job of the source system
72DW Monitoring Empowering End Users
- Motive
- Too many calls from end users about their queries
- Please kill it.
- Is it still running or is my PC locked up?
- Why is the DW so slow?
- Give them the insight and tools
- Give them the ability to kill their own queries
- Still in the works
73The Insight
74Tracking Long-Running Queries
- At IHC, we used Pinecone (from Ambeo) to monitor
all queries and the SQL - Each week, review the longest queries
- Typical outcome?
- Add indexes
- Denormalize
- Contact the user
- and assist them with
- writing a better query
75The DW Sandbox
- Motives
- Lots of little MS Access
- databases containing
- valuable data, spread
- all over the place
- Analysts were joining
- with DW data
- Costly to maintain
- Security issues
- PC hogs
- Solution
- Provide customers with their own sandbox on the
DW, with DBA-like privs
76Sandbox Freedom
- Web based tool for creating tables and loading
disparate data to the DW - Simple, easy to use interface
- Privileges
- Users have full rights to the
- tables they create
- Can grant rights to others
- Big, big victory for customer service and data
maturity - 10 of DW customers use the Sandbox
- About 600 tables in use now
- About 2G of data
77Best Practices in the Design Phase
78Design-Build
- Design horizontally, build vertically
- Start by building data marts that address
analytic needs in one area of the business with a
fairly limited data set - But, design with the horizontal needs of the
company in mind, so that you will eventually
tie all of these vertical data marts together
with a common semantic layer, or data bus
79Creating Value In Both Axes
Design
Build
80For Example
Oncology Data Integration Strategy
Top down reporting requirements and data model
An Integrated Analytic Data Model For Cancer Care
Disparate Sources connected semantically to the
data bus
Cancer Registry
Mammography Radiology
Pathology
Laboratory
Continuing Care And Follow-Up
Quality of Life Survey
Radiation Therapy
Health Plans Claims
Ambulatory Casemix
Acute Care Casemix
81Development Process
Architectural Data Flow Diagram
Logical Data Model
Physical Data Model
Analytic Reporting Requirements
ETL Requirements
Source-to-Target Data Map and Logic
ETL Programming
Analytic Reporting System
Data Processing Requirements
Analytic Application Development
End User Interface Options
82The Logic Layer in Data Warehouses
Transaction Systems
Data Layer
Logic Layer
Presentation Layer
Analytic Systems
Source System
Data Warehouse
ETL Process
Reports
Not Here
Here
83Evidence of Business Process Alignment
- Map out your high level business process
- Dont fall prey to analysis paralysis with
endless business process modeling diagrams! - Identify and associate the transaction systems
that support those processes - Identify the common, overlapping semantics/data
attributes and their utilization rates - Build your data marts within an enterprise
framework - that is aligned with the processes
- you are trying to understand
Alignment
84For example
AP/AR
Claims Processing
Healthcare business process
Diagnosis
Health Need
Patient Perception
Procedure
Results Outcomes
Supported by non-integrated data in Transaction
Systems
HELP
Lab
HPI MC400
AS400
IDX
HDM
CIS/CDR
HNA
Rx
Survey
Integrated in the Data Warehouse
Data Warehouse
85Event Correlation
- A leading edge Best Practice
- The third dimension to rows and columns
- Overlays the data that underlies a report or
graph - In 2004, we experienced a drop in revenue as a
result of the earthquake that destroyed our plant
in the Philippines. - In January of 2005, we saw a spike in the North
America market for snow shovel sales that
coincided with an increase in sales for pain
relievers. This correlates to the record
snowfall in that region and should not be
considered a trend. Barring major product
innovation, we consider the market for snow
shovels in this area as saturated. Sales will be
slow for the next several years.
86Standardizing Semantics
- Sweet irony are the many synonyms for standard
semantics - Data dictionary
- Vocabulary
- Dimensions
- Data elements
- Data attributes
- The bottom line issue Standardizing the terms
you use to describe key facts about your business
87Standardizing Names of Things
- You better do it within the first two months of
your data warehouse project - If you are beyond that point, you better stop and
do it now, lest you pay a bigger price later - Dont
- Push the standard on the source systems, unless
its easy to accomplish - This was one of the common pitfalls of early data
warehousing project failures - Try to standardize everything under
- the sun!
- Focus on the high value facts
88Where Are The High Value Semantics?
In the high-overlap, high-utilization areas
Source System X
89Another Perspective
Semantic Overlap
Semantic Utilization
90The Standard Semantic Layer
91Data Modeling
- Star schemas are great and simple, but they
arent the end-all, be-all of analytic data
modeling - Best practices Do what makes sense dont be a
schema bigot - Ive seen great analytic value from 3NF models
- Maintain data familiarity for your customers
- When meeting vertical needs
- Dont make massive changes to the way the model
looks and feels, nor the naming conventions you
will alienate existing users of the data - Use views to achieve new or standards-compliant
perspectives on data - When meeting horizontal needs
92For Example
Source perspective
DW perspective
Horizontal data customer
93The Case For Timely Updates
Generally, to minimize Total Cost of Ownership
(TCO), your update frequency should be no
greater than the decision making cycle associated
with the data. But everyone wants more timely
data.
94Best Practice Measure Yourself
The Data Warehouse Dashboard
- Total number of query-able tables
- Total number of query-able columns
- Number of users
- Average rows delivered per month
- Storage utilization
- CPU utilization
- Downtime per month by data mart
- Employee satisfaction
- Customer satisfaction
- Average number of queries/month
- Number of queries above a threshold (30 minutes?)
- Average query response time
- Total number of records
95Other Best Practices
- The Data Warehouse Information Systems Team
reports to the CIO - Most data analysts can and probably should report
to the business units - Change management/service level agreements with
the source systems - No changes in the sources systems unless they are
coordinated with the data warehouse team
96More Best Practices
- Skills of the Data Warehouse IS Team
- Experienced chief architect/project manager
- Procedural/script programmers
- SQL/declarative programmers
- Data warehouse storage management architects
- Data warehouse hardware architects and system
administrators - Data architects/modelers
- DBAs
97More Best Practices
- Evidence of project collaboration
- A cross section of members and expertise from the
data warehouse IS team - Statisticians and data analysts who understand
the business domain - A customer that understands the process(es) being
measured and can influence change
- A data steward usually someone from the front
lines who knows how the data is collected - Project Complex reports or a data mart
98More Best Practices
- When at all possible, always extract as close to
the source as possible
99The Most Popular Authors
- I appreciate
- The interest they stir
- The vocabulary semantics of this new specialty
that they helped create - The downside
- The buzzwords that are more buzz than substance
- Corporate Information Factories
- Endless, meaningless debate
- Thats not an Operational Data Store!
- Do you follow Kimball or Inmon?
- Follow your own common sense
- Most of these authors have not had to build a
data warehouse from scratch and live with their
decisions through a complete lifecycle
100ETL Operations
- Besides the cultural risks and challenges, the
riskiest part of a data warehouse - Good book
- Westerman, WalMart Data Warehousing
- The Extract, Transform, and Load
- processes
- Worthy of its own Best Practices discussion
- Suffice to say, mitigate risks in this area
carefully and deliberately - The major design errors dont show up until late
in the lifecycle, when the cost of repair is great
101Two Essential ETL Functions
- Initial loads
- How far back do we go in history?
- Maintenance loads
- Differential loads or total refresh?
- How often?
- You will run and tune these processes several
times before you go into production - How many records are we dealing with?
- How long will this take to run?
- Whats the impact on the source system
performance?
102Maintenance Loads
- Total refresh vs. Incremental loads
- Total refresh Truncate and reload everything
from the source system - Incremental Load only the new and updated
records - For small data sets, a total refresh strategy is
the easiest to implement - How do you define small? You will know it when
dont see it. - Sometimes the fastest strategy when you are
trying to show quick results - Grab and go
103Incremental Loads
- How do we get a snapshot of the data that has
changed since the last load? - Many source systems will have an existing log
file of some kind - Take advantage of these when you can, otherwise
incremental loads can be complicated
104File Transfer Formats
- Design your extract so that it uses
- Fixed, predetermined length for all records and
fields - Avoid variable length if at all possible
- A unique character that separates each field in a
record, such as - A standard format for header records across all
source systems - Such as the first three records in each file
- Include name of source system,
- file, and record count and
- number of fields in the
- record
- This will be handy for
- monitoring jobs and
- collecting load metadata
105Benefits of Standard File Transfer Format
- Compatible with standard database and operating
system utilities - Dynamically create initial and maintenance load
scripts - Read the table definitions (DDL) then merge that
with the standard transfer file format
- Dynamically generate load monitoring data
- Read the header row, insert that into a Load
Status table with status Running, of
records, start time - At EOF, change status to Complete and capture
end of load time - I wish I would have thought about this topic
more, and earlier in my career
106Westerman Makes A Good Point
- If you design for instantaneous updates from the
beginning, it translates to less ETL maintenance
and labor time for the EDW staff, later
- My experience ETL is the least tasteful and
productive use of a veteran EDW Team member, so I
like Westermans insight on this topic
107Messaging Applied to ETL
- Motive behind the concept
- Continuous updates in a mixed workload
environment - Mixed Analytical processing at the same time as
transaction oriented, constant updates, deletes,
inserts
- Basic concepts
- Use a load message queue for records that need to
be updated, coming from the source systems - When the EDW analytical processing workload is
low (off-peak), pick the next message off the
load queue and load the data - Run this in parallel so that you can process
several load messages at the same time while you
have a window of opportunity - Sometimes called throttling
- Speed up and slow down based upon traffic
conditions
108ETL Message Queue Process
109Four Data Maintenance Processes
- Initial load
- Loading into an empty table
- Append load
- Update process
- Delete process
- As much as practical, use your database utilities
for these processes - Study and know your database utilities for data
warehousing they are getting better all the time - I see some bad strategies in this area i.e.,
companies spending time building their own
utilities.
110A Few Planning Thoughts
- Understand the percentage of records that will be
updated, deleted, or inserted - Youll probably develop a different process for
90 inserts vs. 90 updates
- Logging
- In general, turn logging off during the
processes, if logging was on at all - Field vs. Record level updates
- Some folks, in the interest of purity, will build
complex update processes for passing only field
(attribute) level changes - No brainer Pass the whole record
111Initial Load
- Every table will, at some time, require an
initial load - For some tables, it will be the best choice for
data maintenance - Total data refresh
- Best for small tables
- Simple process to implement
- Simply delete (or truncate) and reload with fresh
data
112A Better Initial Load Process
- Background load
- Safer protects against corrupt files
- Higher availability to customers
- Three or four steps maybe 6?
- Create a temporary table
- Load the temporary table
- Run quality checks
- Rename the temporary table to the production
table name - Delete the old table
- Regrant rights, if necessary
- Westerman You want to use as many initial load
processes as possible. - I agree!
113Append Load
- For larger tables that accumulate historical data
- There are no updates, just appends
- A hard fact that will not change
- Example
- Sales that are closed
- Lab results
114Append Load Options
- Load a single part of a table
- Load a partition and attach it to the table
- Create a new, empty partition
- Load the new records
- Attach the partition to the table
- Look to use the LOAD APPEND command in your
database
115Another Append Option
- Create a temp table identical to the one you are
loading - Load the new records into the empty temp table
- Issue INSERT/SELECT
- INSERT INTO Big_Table (SELECT FROM
Temp_Big_Table) - Delete the temp table
IF RECORDS IN TEMP IS MUCH lt OF RECORDS IN
BIG THEN GOOD TECHNIQUE ELSE NOT GOOD
116Update Process
- The most difficult and risky to build
- Use this process only if the tables are too large
for a complete refresh, Initial Load process
- Updates affect data that changes over time
- Like Purchase Orders, hospital transactions, etc.
- Medical records, if you treat the data
maintenance at the macroscopic level
117Update Process Options
- Simple process
- Separate the affected records into an update
file, insert file, or delete file - Do this on the source system, if possible
- Transfer the files to the data warehouse staging
area - Create and run two processes
- A delete process for deleting the records in the
production table that need updated or deleted - An insert process for inserting the entirely new
updated record into the production table, as
well as the true inserts - Simple, but typically not very fast
118Simple Process
- Delete Process identifies records for deletion
from the Production Table based upon contents of
the Updates file. - Delete Process identifies records for deletion
from the Production Table based upon contents of
the Deletes file. - Delete process deletes records from Production
Table. - Insert Process identifies records for insert to
the Production Table based upon contents of the
Updates file. - Insert Process identifies records for insert to
the Production Table based upon contents of the
Inserts file. - Insert Process inserts records into the
Production Table.
Source System
Updated records
Deleted records
New records
EDW Staging Area
EDW Production Table
Delete Process
3
1
Updates
2
4
Deletes
Insert Process
6
5
Inserts
119When You Are Unsure
- Sometimes, source system log and audit files make
it difficult to know if a record was updated or
inserted (i.e. created) - Try this
- Load the records into a temp table that is
identical to the production table to be updated - Delete corresponding records from the production
table - DELETE FROM prod_table WHERE key_field
- IN (SELECT temp_key_field FROM temp_table)
- Insert all the records from the temp table into
the production table - Most databases now support this with an UPSERT
120Massive Deletes
- Just as with Updates and Inserts, the number of
Deletes you have to manage is inversely
proportional to the frequency of your ETL
processes - Infrequent ETL Massive data operations
- Partitions work well for this, again
- E.g., keeping a 5 year window of data
- Insert most recent year with a partition
- Delete the last years partition
- Blazing fast!
Delete partition
1
2
3
4
5
Insert partition
121Raw Data Standards for ETL
- Makes the process of communicating with your
source system partners much easier - Data type (e.g., format for date time stamps)
- File formats (ASCII vs. EBCDIC)
- Header records
- Control characters
- Rule of thumb
- Never transfer data at the binary level unless
you are transferring between binary compatible
computer systems - Use only text-displayable characters
- Less rework time vs. Less storage space and
faster transfer speed - Storage and CPU time are cheap compared to labor
122Last ThoughtIndexing Strategies
- Define these early, practice them religiously,
use them extensively
- This is Database Design 101
- Dont fall prey to this most common
- performance problem!
123In Summary
- Best Practices
- They are as much about what you should do, as
what you should not do - Riskiest areas of data warehousing
- Poor risk management
- Risk Probability x Consequence
- Data quality and ownership politics lead to
project quagmire - Lack of data warehousing skills on the IS team
- Bad basic old IS operations around the DW
- Like storage, backup, indexing, managing by
metrics - Exploding and imploding ETL
- A corporate culture that doesnt care about
improvement
124My Thanks
- For being invited
- For your time and attention
- For the many folks who have worked for and with
me over the years that made me look better as a
result - Please contact me if you have any questions
- dsanders_at_nmff.org
- PH 312-695-8618