Extract, Transform and Load - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Extract, Transform and Load

Description:

... (C, COBAL PL/SQL) or ETL tool (Informatica, DataStage, ETI) that will be used. ... Leading tools: Informatica, Ascential, Epiphany. Data Acquisition. Source Data ... – PowerPoint PPT presentation

Number of Views:292
Avg rating:3.0/5.0
Slides: 18
Provided by: thomas848
Category:

less

Transcript and Presenter's Notes

Title: Extract, Transform and Load


1
Extract, Transform and Load
  • Lecture 1

2
Methodology Review
Business Needs
Delivery
Storage
Engineering
Acquisition
Access
  • Define business requirements, with focus on
    information requirements
  • Define dimensional data model for the data mart
    facts and dimensions
  • Define data model for the DW
  • Identify authoritative source of required data
    from operational systems and analyze source data
  • Map source data, define transformation
    requirements
  • Design build and test extract, transformation and
    load mechanisms
  • Design and build end user application
  • Demonstrate and refine user application via
    prototyping techniques
  • Train and conduct user acceptance

DONE
DONE
DONE
DONE
WE ARE HERE
3
Step 5 Mapping and Transformation Requirements
  • This step defines
  • How data from source system(s) will map to the
    data warehouse and data marts
  • The process and logic that will be used to
    extract data from source systems
  • The requirements for transformation of source
    data into a form that is ready to load into the
    data warehouse

4
Design, Build and Test Extract and Transformation
Routines
  • In this step, the programs required to extract,
    transform and load data into the data warehouse
    are developed.
  • This step is very similar to any information
    system project (design, code and test)
  • In order to perform technical design, must take
    into consideration the programing language (C,
    COBAL PL/SQL) or ETL tool (Informatica,
    DataStage, ETI) that will be used.

5
Extract Transform and Load Tools
  • Today, most large scale data warehouse project
    use ETL tools to perform ETL functions
  • Rationale
  • The tools provide a number of key data warehouse
    functions that greatly enhance developer
    productivity (more on this next class)
  • The tools provide robust metadata that allows
    quick reaction to change in source system or
    evolution of the data warehouse
  • Require less developer expertise, allowing more
    junior staff to perform development activities,
    reducing costs
  • Many tools provide job scheduling and monitoring
    functions

6
ETL Tools
  • First Generation Tools
  • Generated program code (typically C or Cobol)
    that performed ETL functions
  • Vendors Platinum, ETI
  • Second Generation tools
  • Engine based code generation is transparent
    to developers and is stored within the tool as
    metadata
  • Transformations are done in memory, on the fly
  • Advantage flexible, easy to learn and use, can
    develop applications more quickly than with code
    generators.
  • Dsiadvantage may not perform as well as code
    generators and may not be able to handle the same
    degree of complexity

7
ETL Tools
  • Code Generators
  • Today, most organizations choose engine based
    products over code generators
  • Leading tools Informatica, Ascential, Epiphany

8
Data Acquisition
Access
Acquisition
Engineering
Storage
Data Delivery
  • Purpose
  • This component acquires all data, internal and
    external, based on the target data models within
    the Storage component

Source Data (Internal and External)
Metadata
Extraction process
9
Data Acquisition
Access
Acquisition
Engineering
Storage
Data Delivery
  • Functions/Considerations for Design
  • Source Data
  • Platform/DBMS
  • Location of extraction
  • Method of obtaining new/changed data
  • Specific data to obtain
  • Audit and Error Trapping

10
Data Acquisition Considerations
  • Platform and DBMS from which data will be
    extracted
  • If the source is relational, extraction process
    is, generally speaking, easier.
  • This is due to the fact that the metadata is
    readily available
  • However, in large organizations, source may still
    be non-relational (VSAM files and other)
  • Issue Metadata is available only from things
    like COBOL copy books. Metadata is more complex
    adds time to the design process to understand
  • Choice of ETL tool often depends on the sources
    from which data must be extracted.

11
Data Acquisition Considerations
  • Location of extraction
  • Need to consider where extraction process will
    take place on the same server where the source
    system resides, or elsewhere
  • Most common approach on the same server as
    source system
  • May also do extraction on DW server, if resources
    not available on the source system server

12
Data Acquisition Considerations
  • Method for extracting new or changed information
  • Each time we extract data, we only want data that
    has been added or changed since the last time we
    extracted
  • Ideally, wed like each record in each source
    system table to contain a reliable last change
    timestamp
  • We could then compare this timestamp with
    timestamp we maintain in the DW to record the
    last time/date we extracted.

13
Data Acquisition Considerations
  • Unfortunately, this situation is not really
    common. Either the timestamp does not exist, or
    is not reliable for our purposes
  • Often, need to use other mechanisms to identify
    new or changed records
  • Log/journal
  • Worst case scenario delta processing extract
    an entire table, compare it to a previous version
    to identify changes
  • Delta processing is complex and very resource
    intensive

14
Data Acquisition Considerations
  • Determine all data to be extracted
  • Obviously, we need to extract all data required
    by the target data model
  • Additionally, we need to extract any data
    required for processing (last change dates, etc.)
  • Audit
  • Design should consider requirements for auditing
    the extraction process
  • Gather statistics relative to extract volumes,
    processing times, etc.

15
Data Acquisition Considerations
  • Error trapping and exception handling
  • Design must consider key errors that should be
    captured during the extract process
  • Design must also consider how errors will be
    handled when they are encountered
  • ETL tools have features that help with the above

Examples of the types of errors we might want to
capture?
16
Data Engineering
Acquisition
Engineering
Storage
Data Delivery
Access
Purpose Applies the business meaning and and
rules to the incoming (Extracted) data and
performs all necessary transformations for
compliance to those. Re-engineered data is the
transformed data per the business and integrity
requirements of the warehouse data models.
Metadata
Source Data Model(s)
Warehouse Data Model
Transformation
Staged transformed data
Transformed
Data
17
Data Engineering
Access
Acquisition
Engineering
Storage
Data Delivery
  • Considerations for Design
  • Integration assigning surrogate keys
  • Managing Slowly Changing Dimensions
  • Referental Integrity Checking
  • Denormalization/renormalization
  • Data type conversion
  • Cleaning
  • Standardization
  • Derivation
  • Aggregation(s)
  • Audit
  • Data Lineage audit
  • NULL value management
Write a Comment
User Comments (0)
About PowerShow.com