ETL - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

ETL

Description:

ETL By Dr. Gabriel ETL Process 4 major components: Extracting Gathering raw data from source systems and storing it in ETL staging environment Cleaning and conforming ... – PowerPoint PPT presentation

Number of Views:1015
Avg rating:3.0/5.0
Slides: 23
Provided by: I161
Category:
Tags: etl

less

Transcript and Presenter's Notes

Title: ETL


1
ETL
  • By Dr. Gabriel

2
ETL Process
  • 4 major components
  • Extracting
  • Gathering raw data from source systems and
    storing it in ETL staging environment
  • Cleaning and conforming
  • Processing data to improve its quality, format
    it, merge from multiple sources, enforce
    conformed dimensions
  • Delivering
  • Loading data into data warehouse tables
  • Managing
  • Management of ETL environment

3
ETL Extracting
  • Data profiling
  • Identifying data that changed since last load
  • extraction

4
ETL Cleaning and Conforming
  • Data cleansing
  • Recording error events
  • Audit dimensions
  • Deduping
  • Creating and maintaining conformed dimensions and
    facts

5
ETL Delivering
  • Implementation of SCD logic
  • Surrogate key generation
  • Managing hierarchies in dimensions
  • Managing special dimensions such as date and
    time, junk, mini, shrunken, small static, and
    user-maintained dimensions
  • Mini dimensions
  • used to track changes of dimension attribute
    when type 2 technique is infeasible.
  • Similar to junk dimensions Typically is used for
    large dimensions
  • Combinations can be built in advance or on the
    fly
  • Built from dimension table input

6
ETL Delivering (Cont)
  • Small static dimensions
  • Dimensions created by the ETL system without real
    source
  • Lookup dimensions for translations of codes, etc.
  • User maintained dimensions
  • Master dimensions without real source system
  • Descriptions, groupings, hierarchies created for
    reporting and analysis purposes.

7
ETL Delivering (Cont)
  • Fact table loading
  • Building and maintaining bridge dimension tables
  • Handling late arriving data
  • Management of conformed dimensions
  • Administration of fact tables
  • Building aggregations
  • Building OLAP cubes
  • Transferring DW data to other environment for
    specific purposes

8
ETL Managing
  • Management of ETL environment
  • Goals
  • Reliability
  • Availability
  • Manageability
  • Job scheduler
  • backup system
  • Recovery and restart system
  • Version control system

9
ETL Managing (Cont.)
  • Version migration system
  • Workflow monitor
  • Sorting system
  • Analyzing dependencies and lineage
  • Problem escalation system
  • Parallelization
  • Security system
  • Compliance manager
  • Metadata repository manager

10
ETL Process
  • Planning
  • High level source to target data flow diagram
  • Selection and implementation of ETL tool
  • Development of default strategies for dimension
    management, error handling, and other processes
  • Development data transformations diagrams by
    target table
  • Development of job sequencing

11
ETL Process
  • Developing one-time historic load
  • Build and test the historic dimension and fact
    tables load
  • Developing incremental load process
  • Build and test dimension and fact tables
    incremental load processes
  • Build and test aggregate table loads and/or OLAP
    processing
  • Design, build, and test the ETL system automation

12
ETL Tools Build vs Buy
  • Many off-the-shelf tools exist
  • Benefits are not seen right away
  • Setup
  • Learning curve
  • High-end tools may not justify value for smaller
    warehouses

13
Off-the-shelf ETL Tools
14
ETL Specification Document
  • Can be as large as 100 pages per business
    process In reality, the work starts after the
    high level design is documented in a few pages.
  • Source-to-target mappings
  • Data profiling reports
  • Physical design decisions
  • Default strategy for extracting from each major
    source system
  • Archival strategy
  • Data quality tracking and metadata
  • Default strategy for managing changes to
    dimension attributes

15
ETL Specification Document (Cont)
  • System availability requirements and strategy
  • Design of data auditing mechanism
  • Location of staging areas
  • Historic and incremental load strategies for each
    table
  • Detailed table design
  • Historic data load parameters ( of months) and
    volumes ( of rows)
  • Incremental data volumes

16
ETL Specification Document (Cont)
  • Handling of late arriving data
  • Load frequency
  • Handling of changes in each dimension attribute
    (types 1,2,3)
  • Table partitioning
  • Overview of data sources discussion of
    source-specific characteristics
  • Extract strategy for the source data
  • Change data capture logic for each source table
  • Dependencies
  • Transformation logic (diagram or pseudo code)

17
ETL Specification Document (Cont)
  • Preconditions to avoid error conditions
  • Recovery and restart assumptions for each major
    step of the ETL pipeline
  • Archiving assumptions for each table
  • Cleanup steps
  • Estimated effort
  • Overall workflow
  • Job sequencing
  • Logical dependencies

18
Loading Pointers
  • One time historic load
  • Disable RI constraints (FKs) and re-enable them
    after the load is complete
  • Drop indexes and re-create them after the load is
    complete
  • Use bulk loading techniques
  • Not always the case

19
Loading Pointers (Cont)
  • Incremental load

20
Loading Pointers (Cont)
  • Sometimes historic and incremental load logic is
    the same many times- is similar.
  • Updating aggregations, if necessary
  • Error handling

21
Sample Generation of Surrogate Keys on SQL Server
  • As simple as
  • DECLARE _at_i INTEGER
  • SELECT _at_i MAX(ID) 1
  • FROM TableName
  • But may not work with concurrent processes
  • OR
  • Create PROCEDURE pGetNextID
  • (_at_SeedName VARCHAR(32),
  • _at_SeedValue BIGINT OUTPUT)
  • AS
  • UPDATE Lookup_Seed
  • SET _at_SeedValue SeedValue SeedValue 1
  • WHERE SeedID _at_SeedName
  • Lookup_Seed table
  • SeedID varchar (32)
  • SeedValue bigint

22
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com