Title: Extract, Transform and Load
1Extract, Transform and Load
2Methodology Review
Business Needs
Delivery
Storage
Engineering
Acquisition
Access
- Define business requirements, with focus on
information requirements - Define dimensional data model for the data mart
facts and dimensions - Define data model for the DW
- Identify authoritative source of required data
from operational systems and analyze source data - Map source data, define transformation
requirements - Design build and test extract, transformation and
load mechanisms - Design and build end user application
- Demonstrate and refine user application via
prototyping techniques - Train and conduct user acceptance
DONE
DONE
DONE
DONE
WE ARE HERE
3Step 5 Mapping and Transformation Requirements
- This step defines
- How data from source system(s) will map to the
data warehouse and data marts - The process and logic that will be used to
extract data from source systems - The requirements for transformation of source
data into a form that is ready to load into the
data warehouse
4Design, Build and Test Extract and Transformation
Routines
- In this step, the programs required to extract,
transform and load data into the data warehouse
are developed. - This step is very similar to any information
system project (design, code and test) - In order to perform technical design, must take
into consideration the programing language (C,
COBAL PL/SQL) or ETL tool (Informatica,
DataStage, ETI) that will be used.
5Extract Transform and Load Tools
- Today, most large scale data warehouse project
use ETL tools to perform ETL functions - Rationale
- The tools provide a number of key data warehouse
functions that greatly enhance developer
productivity (more on this next class) - The tools provide robust metadata that allows
quick reaction to change in source system or
evolution of the data warehouse - Require less developer expertise, allowing more
junior staff to perform development activities,
reducing costs - Many tools provide job scheduling and monitoring
functions
6ETL Tools
- First Generation Tools
- Generated program code (typically C or Cobol)
that performed ETL functions - Vendors Platinum, ETI
- Second Generation tools
- Engine based code generation is transparent
to developers and is stored within the tool as
metadata - Transformations are done in memory, on the fly
- Advantage flexible, easy to learn and use, can
develop applications more quickly than with code
generators. - Dsiadvantage may not perform as well as code
generators and may not be able to handle the same
degree of complexity
7ETL Tools
- Code Generators
- Today, most organizations choose engine based
products over code generators - Leading tools Informatica, Ascential, Epiphany
8Data Acquisition
Access
Acquisition
Engineering
Storage
Data Delivery
- Purpose
- This component acquires all data, internal and
external, based on the target data models within
the Storage component
Source Data (Internal and External)
Metadata
Extraction process
9Data Acquisition
Access
Acquisition
Engineering
Storage
Data Delivery
- Functions/Considerations for Design
- Source Data
- Platform/DBMS
- Location of extraction
- Method of obtaining new/changed data
- Specific data to obtain
- Audit and Error Trapping
10Data Acquisition Considerations
- Platform and DBMS from which data will be
extracted - If the source is relational, extraction process
is, generally speaking, easier. - This is due to the fact that the metadata is
readily available - However, in large organizations, source may still
be non-relational (VSAM files and other) - Issue Metadata is available only from things
like COBOL copy books. Metadata is more complex
adds time to the design process to understand - Choice of ETL tool often depends on the sources
from which data must be extracted.
11Data Acquisition Considerations
- Location of extraction
- Need to consider where extraction process will
take place on the same server where the source
system resides, or elsewhere - Most common approach on the same server as
source system - May also do extraction on DW server, if resources
not available on the source system server
12Data Acquisition Considerations
- Method for extracting new or changed information
- Each time we extract data, we only want data that
has been added or changed since the last time we
extracted - Ideally, wed like each record in each source
system table to contain a reliable last change
timestamp - We could then compare this timestamp with
timestamp we maintain in the DW to record the
last time/date we extracted.
13Data Acquisition Considerations
- Unfortunately, this situation is not really
common. Either the timestamp does not exist, or
is not reliable for our purposes - Often, need to use other mechanisms to identify
new or changed records - Log/journal
- Worst case scenario delta processing extract
an entire table, compare it to a previous version
to identify changes - Delta processing is complex and very resource
intensive
14Data Acquisition Considerations
- Determine all data to be extracted
- Obviously, we need to extract all data required
by the target data model - Additionally, we need to extract any data
required for processing (last change dates, etc.) - Audit
- Design should consider requirements for auditing
the extraction process - Gather statistics relative to extract volumes,
processing times, etc.
15Data Acquisition Considerations
- Error trapping and exception handling
- Design must consider key errors that should be
captured during the extract process - Design must also consider how errors will be
handled when they are encountered - ETL tools have features that help with the above
Examples of the types of errors we might want to
capture?
16Data Engineering
Acquisition
Engineering
Storage
Data Delivery
Access
Purpose Applies the business meaning and and
rules to the incoming (Extracted) data and
performs all necessary transformations for
compliance to those. Re-engineered data is the
transformed data per the business and integrity
requirements of the warehouse data models.
Metadata
Source Data Model(s)
Warehouse Data Model
Transformation
Staged transformed data
Transformed
Data
17Data Engineering
Access
Acquisition
Engineering
Storage
Data Delivery
- Considerations for Design
- Integration assigning surrogate keys
- Managing Slowly Changing Dimensions
- Referental Integrity Checking
- Denormalization/renormalization
- Data type conversion
- Cleaning
- Standardization
- Derivation
- Aggregation(s)
- Audit
- Data Lineage audit
- NULL value management