Module 1 - PowerPoint PPT Presentation

About This Presentation
Title:

Module 1

Description:

Module 1 DS324EE DataStage Enterprise Edition Concept Review Ascential s Enterprise Data Integration Platform Course Objectives You will learn to: Build ... – PowerPoint PPT presentation

Number of Views:314
Avg rating:3.0/5.0
Slides: 311
Provided by: DaleNi
Category:
Tags: module

less

Transcript and Presenter's Notes

Title: Module 1


1
Module 1
  • DS324EE DataStage Enterprise Edition
  • Concept Review

2
Ascentials Enterprise Data Integration Platform
Command Control
ANY TARGET
ANY SOURCE
CRM ERP SCM RDBMS Legacy Real-time Client-server
Web services Data Warehouse Other apps.
CRM ERP SCM BI/Analytics RDBMS Real-time
Client-server Web services Data Warehouse Other
apps.
3
Course Objectives
  • You will learn to
  • Build DataStage EE jobs using complex logic
  • Utilize parallel processing techniques to
    increase job performance
  • Build custom stages based on application needs
  • Course emphasis is
  • Advanced usage of DataStage EE
  • Application job development
  • Best practices techniques

4
Course Agenda
  • Day 1
  • Review of EE Concepts
  • Sequential Access
  • Standards
  • DBMS Access
  • Day 2
  • EE Architecture
  • Transforming Data
  • Sorting Data
  • Day 3
  • Combining Data
  • Configuration Files
  • Day 4
  • Extending EE
  • Meta Data Usage
  • Job Control
  • Testing

5
Module Objectives
  • Provide a background for completing work in the
    DSEE advanced course
  • Ensure all students will have a successful
    advanced class
  • Tasks
  • Review parallel processing concepts

6
Review Topics
  • DataStage architecture
  • DataStage client review
  • Administrator
  • Manager
  • Designer
  • Director
  • Parallel processing paradigm
  • DataStage Enterprise Edition

7
Client-Server Architecture
Microsoft Windows NT/2000/XP
ANY TARGET
ANY SOURCE
CRM ERP SCM BI/Analytics RDBMS Real-Time
Client-server Web services Data Warehouse Other
apps.
Repository Manager
Designer
Director
Administrator
Discover
Prepare
Transform
Extend
Extract
Cleanse
Transform
Integrate
Server
Repository
Microsoft Windows NT or UNIX
8
Process Flow
  • Administrator add/delete projects, set defaults
  • Manager import meta data, backup projects
  • Designer assemble jobs, compile, and execute
  • Director execute jobs, examine job run logs

9
Administrator Licensing and Timeout
10
Administrator Project Creation/Removal
Functions specific to a project.
11
Administrator Project Properties
RCP for parallel jobs should be enabled
Variables for parallel processing
12
Administrator Environment Variables
Variables are category specific
13
OSH is what is run by the EE Framework
14
DataStage Manager
15
Export Objects to MetaStage
Push meta data to MetaStage
16
Designer Workspace
Can execute the job from Designer
17
DataStage Generated OSH
The EE Framework runs OSH
18
Director Executing Jobs
Messages from previous run in different color
19
Stages
Can now customize the Designers palette
20
Popular Stages
Row generator
Peek
21
Row Generator
  • Can build test data

Edit row in column tab
Repeatable property
22
Peek
  • Displays field values
  • Will be displayed in job log or sent to a file
  • Skip records option
  • Can control number of records to be displayed
  • Can be used as stub stage for iterative
    development (more later)

23
Why EE is so Effective
  • Parallel processing paradigm
  • More hardware, faster processing
  • Level of parallelization is determined by a
    configuration file read at runtime
  • Emphasis on memory
  • Data read into memory and lookups performed like
    hash table

24
Scalable Systems
  • Parallel processing executing your application
    on multiple CPUs
  • Scalable processing add more resources (CPUs,
    RAM, and disks) to increase system performance
  • Example system containing6 CPUs (or processing
    nodes)and disks

25
Scaleable Systems Examples
  • Three main types of scalable systems
  • Symmetric Multiprocessors (SMP), shared memory
  • Clusters UNIX systems connected via networks
  • MPP

note
26
SMP Shared Everything
  • Multiple CPUs with a single operating system
  • Programs communicate using shared memory
  • All CPUs share system resources (OS, memory with
    single linear address space, disks, I/O)
  • When used with enterprise edition
  • Data transport uses shared memory
  • Simplified startup

enterprise edition treats NUMA (NonUniform Memory
Access) as SMP
27
Traditional Batch Processing
  • Traditional approach to batch processing
  • Write to disk and read from disk before each
    processing operation
  • Sub-optimal utilization of resources
  • a 10 GB stream leads to 70 GB of I/O
  • processing resources can sit idle during I/O
  • Very complex to manage (lots and lots of small
    jobs)
  • Becomes impractical with big data volumes
  • disk I/O consumes the processing
  • terabytes of disk required for temporary staging

28
Pipeline Multiprocessing
Data Pipelining
  • Transform, clean and load processes are
    executing simultaneously on the same processor
  • rows are moving forward through the flow

Operational Data
Transform

Clean

Load
Data Warehouse
Archived Data
Source
Target
  • Start a downstream process while an upstream
    process is still running.
  • This eliminates intermediate storing to disk,
    which is critical for big data.
  • This also keeps the processors busy.
  • Still has limits on scalability

Think of a conveyor belt moving the rows from
process to process!
29
Partition Parallelism
Data Partitioning
  • Break up big data into partitions
  • Run one partition on each processor
  • 4X times faster on 4 processors - With data
    big enough
  • 100X faster on 100 processors
  • This is exactly how the parallel databases
    work!
  • Data Partitioning requires the same
    transform to all partitions Aaron Abbott and
    Zygmund Zorn undergo the same transform

30
Combining Parallelism Types
Putting It All Together Parallel Dataflow
Source
31
EE Program Elements
  • Dataset uniform set of rows in the Framework's
    internal representation
  • - Three flavors
  • 1. file sets .fs stored on multiple
    Unix files as flat files
  • 2. persistent .ds stored on multiple
    Unix files in Framework format
  • read and written using the DataSet Stage
  • 3. virtual .v links, in
    Framework format, NOT stored on disk
  • - The Framework processes only datasetshence
    possible need for Import
  • - Different datasets typically have different
    schemas
  • - Convention "dataset" Framework data set.
  • Partition subset of rows in a dataset earmarked
    for processing by the same node (virtual CPU,
    declared in a configuration file).
  • - All the partitions of a dataset follow
    the same schema that of the dataset

32
Repartitioning
Putting It All Together Parallel Dataflow with
Repartioning on-the-fly
Source
  • Without Landing To Disk!

33
DataStage EE Architecture
34
Introduction to DataStage EE
  • DSEE
  • Automatically scales to fit the machine
  • Handles data flow among multiple CPUs and disks
  • With DSEE you can
  • Create applications for SMPs, clusters and
    MPPs enterprise edition is architecture-neutral
  • Access relational databases in parallel
  • Execute external applications in parallel
  • Store data across multiple disks and nodes

35
Job Design VS. Execution
User assembles the flow using the DataStage
Designer
  • and gets parallel access, propagation,
    transformation, and load.
  • The design is good for 1 node, 4 nodes,
  • or N nodes. To change nodes, just swap
    configuration file.
  • No need to modify or recompile your design!

36
Partitioners and Collectors
  • Partitioners distribute rows into partitions
  • implement data-partition parallelism
  • Collectors inverse partitioners
  • Live on input links of stages running
  • in parallel (partitioners)
  • sequentially (collectors)
  • Use a choice of methods

37
Example Partitioning Icons
partitioner
38
Exercise
  • Complete exercises 1-1 and 1-2, and 1-3

39
Module 2
  • DSEE Sequential Access

40
Module Objectives
  • You will learn to
  • Import sequential files into the EE Framework
  • Utilize parallel processing techniques to
    increase sequential file access
  • Understand usage of the Sequential, DataSet,
    FileSet, and LookupFileSet stages
  • Manage partitioned data stored by the Framework

41
Types of Sequential Data Stages
  • Sequential
  • Fixed or variable length
  • File Set
  • Lookup File Set
  • Data Set

42
Sequential Stage Introduction
  • The EE Framework processes only datasets
  • For files other than datasets, such as flat
    files, enterprise edition must perform import
    and export operations this is performed by
    import and export OSH operators (generated by
    Sequential or FileSet stages)
  • During import or export DataStage performs format
    translations into, or out of, the EE internal
    format
  • Data is described to the Framework in a schema

43
How the Sequential Stage Works
  • Generates Import/Export operators
  • Types of transport
  • Performs direct C file I/O streams
  • Source programs which feed stdout (gunzip) send
    stdout into EE via sequential pipe

44
Using the Sequential File Stage
  • Both import and export of general files (text,
    binary) are performed by the SequentialFile
    Stage.
  • Data import
  • Data export

Importing/Exporting Data
EE internal format
EE internal format
45
Working With Flat Files
  • Sequential File Stage
  • Normally will execute in sequential mode
  • Can execute in parallel if reading multiple files
    (file pattern option)
  • Can use multiple readers within a node on fixed
    width file
  • DSEE needs to know
  • How file is divided into rows
  • How row is divided into columns

46
Processes Needed to Import Data
  • Recordization
  • Divides input stream into records
  • Set on the format tab
  • Columnization
  • Divides the record into columns
  • Default set on the format tab but can be
    overridden on the columns tab
  • Can be incomplete if using a schema or not even
    specified in the stage if using RCP

47
File Format Example
48
Sequential File Stage
  • To set the properties, use stage editor
  • Page (general, input/output)
  • Tabs (format, columns)
  • Sequential stage link rules
  • One input link
  • One output links (except for reject link
    definition)
  • One reject link
  • Will reject any records not matching meta data in
    the column definitions

49
Job Design Using Sequential Stages
Stage categories
50
General Tab Sequential Source
Show records
Multiple output links
51
Properties Multiple Files
Click to add more files having the same meta
data.
52
Properties - Multiple Readers
Multiple readers option allows you to set number
of readers
53
Format Tab
Record into columns
File into records
54
Read Methods
55
Reject Link
  • Reject mode output
  • Source
  • All records not matching the meta data (the
    column definitions)
  • Target
  • All records that are rejected for any reason
  • Meta data one column, datatype raw

56
File Set Stage
  • Can read or write file sets
  • Files suffixed by .fs
  • File set consists of
  • Descriptor file contains location of raw data
    files meta data
  • Individual raw data files
  • Can be processed in parallel

57
File Set Stage Example
Descriptor file
58
File Set Usage
  • Why use a file set?
  • 2G limit on some file systems
  • Need to distribute data among nodes to prevent
    overruns
  • If used in parallel, runs faster that sequential
    file

59
Lookup File Set Stage
  • Can create file sets
  • Usually used in conjunction with Lookup stages

60
Lookup File Set gt Properties
Key column specified
Key column dropped in descriptor file
61
Data Set
  • Operating system (Framework) file
  • Suffixed by .ds
  • Referred to by a control file
  • Managed by Data Set Management utility from GUI
    (Manager, Designer, Director)
  • Represents persistent data
  • Key to good performance in set of linked jobs

62
Persistent Datasets
  • Accessed from/to disk with DataSet Stage.
  • Two parts
  • Descriptor file
  • contains metadata, data location, but NOT the
    data itself
  • Data file(s)
  • contain the data
  • multiple Unix files (one per node), accessible in
    parallel

input.ds
record ( partno int32 description
string )
node1/local/disk1/node2/local/disk2/
63
Quiz!
  • True or False?
  • Everything that has been data-partitioned must be
    collected in same job

64
Data Set Stage
Is the data partitioned?
65
Engine Data Translation
  • Occurs on import
  • From sequential files or file sets
  • From RDBMS
  • Occurs on export
  • From datasets to file sets or sequential files
  • From datasets to RDBMS
  • Engine most efficient when processing internally
    formatted records (I.e. data contained in
    datasets)

66
Managing DataSets
  • GUI (Manager, Designer, Director) tools gt data
    set management
  • Alternative methods
  • Orchadmin
  • Unix command line utility
  • List records
  • Remove data sets (will remove all components)
  • Dsrecords
  • Lists number of records in a dataset

67
Data Set Management
Display data
Schema
68
Data Set Management From Unix
  • Alternative method of managing file sets and data
    sets
  • Dsrecords
  • Gives record count
  • Unix command-line utility
  • dsrecords ds_name
  • I.e.. dsrecords myDS.ds
  • 156999 records
  • Orchadmin
  • Manages EE persistent data sets
  • Unix command-line utility
  • I.e. orchadmin rm myDataSet.ds

69
Exercise
  • Complete exercises 2-1, 2-2, 2-3, and 2-4.

70
Blank
71
Module 3
  • Standards and Techniques

72
Objectives
  • Establish standard techniques for DSEE
    development
  • Will cover
  • Job documentation
  • Naming conventions for jobs, links, and stages
  • Iterative job design
  • Useful stages for job development
  • Using configuration files for development
  • Using environmental variables
  • Job parameters

73
Job Presentation
Document using the annotation stage
74
Job Properties Documentation
Organize jobs into categories
Description shows in DS Manager and MetaStage
75
Naming conventions
  • Stages named after the
  • Data they access
  • Function they perform
  • DO NOT leave defaulted stage names like
    Sequential_File_0
  • Links named for the data they carry
  • DO NOT leave defaulted link names like DSLink3

76
Stage and Link Names
Stages and links renamed to data they handle
77
Create Reusable Job Components
  • Use enterprise edition shared containers when
    feasible

Container
78
Use Iterative Job Design
  • Use copy or peek stage as stub
  • Test job in phases small first, then increasing
    in complexity
  • Use Peek stage to examine records

79
Copy or Peek Stage Stub
Copy stage
80
Transformer StageTechniques
  • Suggestions -
  • Always include reject link.
  • Always test for null value before using a column
    in a function.
  • Try to use RCP and only map columns that have a
    derivation other than a copy. More on RCP later.
  • Be aware of Column and Stage variable Data Types.
  • Often user does not pay attention to Stage
    Variable type.
  • Avoid type conversions.
  • Try to maintain the data type as imported.

81
The Copy Stage
  • With 1 link in, 1 link out
  • the Copy Stage is the ultimate "no-op"
    (place-holder)
  • Partitioners
  • Sort / Remove Duplicates
  • Rename, Drop column
  • can be inserted on
  • input link (Partitioning) Partitioners, Sort,
    Remove Duplicates)
  • output link (Mapping page) Rename, Drop.
  • Sometimes replace the transformer
  • Rename,
  • Drop,
  • Implicit type Conversions
  • Link Constraint break up schema

82
Developing Jobs
  • Keep it simple
  • Jobs with many stages are hard to debug and
    maintain.
  • Start small and Build to final Solution
  • Use view data, copy, and peek.
  • Start from source and work out.
  • Develop with a 1 node configuration file.
  • Solve the business problem before the performance
    problem.
  • Dont worry too much about partitioning until the
    sequential flow works as expected.
  • If you have to write to Disk use a Persistent
    Data set.

83
Final Result
84
Good Things to Have in each Job
  • Use job parameters
  • Some helpful environmental variables to add to
    job parameters
  • APT_DUMP_SCORE
  • Report OSH to message log
  • APT_CONFIG_FILE
  • Establishes runtime parameters to EE engine I.e.
    Degree of parallelization

85
Setting Job Parameters
Click to add environment variables
86
DUMP SCORE Output
Setting APT_DUMP_SCORE yields
Double-click
Partitoner And Collector
Mapping Node--gt partition

87
Use Multiple Configuration Files
  • Make a set for 1X, 2X,.
  • Use different ones for test versus production
  • Include as a parameter in each job

88
Exercise
  • Complete exercise 3-1

89
Module 4
  • DBMS Access

90
Objectives
  • Understand how DSEE reads and writes records to
    an RDBMS
  • Understand how to handle nulls on DBMS lookup
  • Utilize this knowledge to
  • Read and write database tables
  • Use database tables to lookup data
  • Use null handling options to clean data

91
Parallel Database Connectivity
TraditionalClient-Server
enterprise edition
Client
Client
Sort
Client
Client
Client
Load
Client
Parallel RDBMS
Parallel RDBMS
  • Parallel server runs APPLICATIONS
  • Application has parallel connections to RDBMS
  • Suitable for large data volumes
  • Higher levels of integration possible
  • Only RDBMS is running in parallel
  • Each application has only one connection
  • Suitable only for small data volumes

92
RDBMS AccessSupported Databases
  • enterprise edition provides high performance /
    scalable interfaces for
  • DB2
  • Informix
  • Oracle
  • Teradata
  • Users must be granted specific privileges,
    depending on RDBMS.

93
RDBMS AccessSupported Databases
  • Automatically convert RDBMS table layouts to/from
    enterprise edition Table Definitions
  • RDBMS nulls converted to/from nullable field
    values
  • Support for standard SQL syntax for specifying
  • field list for SELECT statement
  • filter for WHERE clause
  • open command, close command
  • Can write an explicit SQL query to access RDBMS
  • EE supplies additional information in the SQL
    query

94
RDBMS Stages
  • DB2/UDB Enterprise
  • Informix Enterprise
  • Oracle Enterprise
  • Teradata Enterprise
  • ODBC

95
RDBMS Usage
  • As a source
  • Extract data from table (stream link)
  • Extract as table, generated SQL, or user-defined
    SQL
  • User-defined can perform joins, access views
  • Lookup (reference link)
  • Normal lookup is memory-based (all table data
    read into memory)
  • Can perform one lookup at a time in DBMS (sparse
    option)
  • Continue/drop/fail options
  • As a target
  • Inserts
  • Upserts (Inserts and updates)
  • Loader

96
RDBMS Source Stream Link
Stream link
97
DBMS Source - User-defined SQL
Columns in SQL statement must match the meta data
in columns tab
98
Exercise
  • User-defined SQL
  • Exercise 4-1

99
DBMS Source Reference Link
Reject link
100
Lookup Reject Link
Output option automatically creates the reject
link
101
Null Handling
  • Must handle null condition if lookup record is
    not found and continue option is chosen
  • Can be done in a transformer stage

102
Lookup Stage Mapping
Link name
103
Lookup Stage Properties
Reference link
Must have same column name in input and reference
links. You will get the results of the lookup in
the output column.
104
DBMS as a Target
105
DBMS As Target
  • Write Methods
  • Delete
  • Load
  • Upsert
  • Write (DB2)
  • Write mode for load method
  • Truncate
  • Create
  • Replace
  • Append

106
Target Properties
Generated code can be copied
Upsert mode determines options
107
Checking for Nulls
  • Use Transformer stage to test for fields with
    null values (Use IsNull functions)
  • In Transformer, can reject or load default value

108
Exercise
  • Complete exercise 4-2

109
Module 5
  • Platform Architecture

110
Objectives
  • Understand how enterprise edition Framework
    processes data
  • You will be able to
  • Read and understand OSH
  • Perform troubleshooting

111
Concepts
  • The EE Platform
  • OSH (generated by DataStage Parallel Canvas, and
    run by DataStage Director)
  • Conductor,Section leaders,players.
  • Configuration files (only one active at a time,
    describes H/W)
  • Schemas/tables
  • Schema propagation/RCP
  • Buildop,Wrapper
  • Datasets (data in Framework's internal
    representation)

112
DS-EE Program Elements
EE Stages Involve A Series Of Processing Steps
  • Piece of Application Logic Running Against
    Individual Records
  • Parallel or Sequential
  • Three Sources
  • Ascential Supplied
  • Commercial tools/applications
  • Custom/Existing programs

Output Data Set schema prov_numint16 member_n
umint8 custidint32
Input Data Set schema prov_numint16 member_nu
mint8 custidint32
Output Interface
Business Logic
InputInterface
Partitioner
EE Stage
113
DS-EE Program ElementsStage Execution
Dual Parallelism Eliminates Bottlenecks!
  • EE Delivers Parallelism in Two Ways
  • Pipeline
  • Partition
  • Block Buffering Between Components
  • Eliminates Need for Program Load Balancing
  • Maintains Orderly Data Flow

Producer
Pipeline
Consumer
Partition
114
Stages Control Partition Parallelism
  • Execution Mode (sequential/parallel) is
    controlled by Stage
  • default parallel for most Ascential-supplied
    Stages
  • User can override default mode
  • Parallel Stage inserts the default partitioner
    (Auto) on its input links
  • Sequential Stage inserts the default collector
    (Auto) on its input links
  • user can override default
  • execution mode (parallel/sequential) of Stage
    (Advanced tab)
  • choice of partitioner/collector (Input
    Partitioning Tab)

115
How Parallel Is It?
  • Degree of Parallelism is determined by the
    configuration file
  • Total number of logical nodes in default pool,
    or a subset if using "constraints".
  • Constraints are assigned to specific pools as
    defined in configuration file and can be
    referenced in the stage

116
OSH
  • DataStage EE GUI generates OSH scripts
  • Ability to view OSH turned on in Administrator
  • OSH can be viewed in Designer using job
    properties
  • The Framework executes OSH
  • What is OSH?
  • Orchestrate shell
  • Has a UNIX command-line interface

117
OSH Script
  • An osh script is a quoted string which specifies
  • The operators and connections of a single
    Orchestrate step
  • In its simplest form, it is
  • osh op lt in.ds gt out.ds
  • Where
  • op is an Orchestrate operator
  • in.ds is the input data set
  • out.ds is the output data set

118
OSH Operators
  • Operator is an instance of a C class inheriting
    from APT_Operator
  • Developers can create new operators
  • Examples of existing operators
  • Import
  • Export
  • RemoveDups

119
Enable Visible OSH in Administrator
Will be enabled for all projects
120
View OSH in Designer
Operator
Schema
121
OSH Practice
  • Exercise 5-1

122
Orchestrate May Add Operators to Your Command
Lets revisit the following OSH command
osh " echo 'Hello world!' par gt outfile "
The Framework silently inserts operators (steps
1,2,3,4)
123
Elements of a Framework Program
Steps, with internal and terminal datasets and
links, described by schemas
  • Step unit of OSH program
  • one OSH command one step
  • at end of step synchronization, storage to disk
  • Datasets set of rows processed by Framework
  • Orchestrate data sets
  • persistent (terminal) .ds, and
  • virtual (internal) .v.
  • Also flat file sets .fs
  • Schema data description (metadata) for datasets
    and links.

124
Orchestrate Datasets
  • Consist of Partitioned Data and Schema
  • Can be Persistent (.ds) or Virtual
    (.v, Link)
  • Overcome 2 GB File Limit

What you program
What gets processed
Node 1
Node 2
Node 3
Node 4
GUI OSH
Operator A
Operator A
Operator A
Operator A

. . .
What gets generated
Multiple files per partition Each file up to
2GBytes (or larger)
osh operator_A gt x.ds
125
Computing Architectures Definition
Dedicated Disk
Shared Nothing
Shared Disk
Disk
Disk
Disk
Disk
Disk
Disk
CPU
Shared Memory
Memory
Memory
Memory
Memory
Memory
Uniprocessor
SMP System (Symmetric Multiprocessor)
Clusters and MPP Systems
  • PC
  • Workstation
  • Single processor server
  • IBM, Sun, HP, Compaq
  • 2 to 64 processors
  • Majority of installations
  • 2 to hundreds of processors
  • MPP IBM and NCR Teradata
  • each node is a uniprocessor or SMP

126
Job ExecutionOrchestrate
  • Conductor - initial DS/EE process
  • Step Composer
  • Creates Section Leader processes (one/node)
  • Consolidates massages, outputs them
  • Manages orderly shutdown.
  • Section Leader
  • Forks Players processes (one/Stage)
  • Manages up/down communication.
  • Players
  • The actual processes associated with Stages
  • Combined players one process only
  • Send stderr to SL
  • Establish connections to other players for data
    flow
  • Clean up upon completion.

Processing Node
Processing Node
  • Communication
  • - SMP Shared Memory
  • - MPP TCP

127
Working with Configuration Files
  • You can easily switch between config files
  • '1-node' file - for sequential
    execution, lighter reportshandy for testing
  • 'MedN-nodes' file - aims at a mix of pipeline
    and data-partitioned parallelism
  • 'BigN-nodes' file - aims at full
    data-partitioned parallelism
  • Only one file is active while a step is running
  • The Framework queries (first) the environment
    variable
  • APT_CONFIG_FILE
  • nodes declared in the config file needs not
    match CPUs
  • Same configuration file can be used in
    development and target machines

128
SchedulingNodes, Processes, and CPUs
  • DS/EE does not
  • know how many CPUs are available
  • schedule
  • Who knows what?
  • Who does what?
  • DS/EE creates (NodesOps) Unix processes
  • The O/S schedules these processes on the CPUs

129
Configuring DSEE Node Pools
node "n1" fastname "s1" pool ""
"n1" "s1" "app2" "sort" resource disk
"/orch/n1/d1" resource disk "/orch/n1/d2"
resource scratchdisk "/temp" "sort"
node "n2" fastname "s2" pool "" "n2"
"s2" "app1" resource disk "/orch/n2/d1"
resource disk "/orch/n2/d2" resource
scratchdisk "/temp" node "n3"
fastname "s3" pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" resource
scratchdisk "/temp" node "n4"
fastname "s4" pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" resource
scratchdisk "/temp"
3
4
1
2
130
Configuring DSEE Disk Pools
node "n1" fastname "s1" pool ""
"n1" "s1" "app2" "sort" resource disk
"/orch/n1/d1" resource disk "/orch/n1/d2"
"bigdata" resource scratchdisk "/temp"
"sort" node "n2" fastname "s2"
pool "" "n2" "s2" "app1" resource disk
"/orch/n2/d1" resource disk "/orch/n2/d2"
"bigdata" resource scratchdisk "/temp"
node "n3" fastname "s3" pool ""
"n3" "s3" "app1" resource disk "/orch/n3/d1"
resource scratchdisk "/temp" node
"n4" fastname "s4" pool "" "n4" "s4"
"app1" resource disk "/orch/n4/d1"
resource scratchdisk "/temp"
3
4
1
2
131
Re-Partitioning
Parallel to parallel flow may incur
reshuffling Records may jump between nodes
node 1
node 2

partitioner
132
Re-Partitioning X-ray
  • Partitioner with parallel import
  • When a partitioner receives
  • sequential input (1 partition), it creates N
    partitions
  • parallel input (N partitions), it outputs N
    partitions, may result in re-partitioning
  • Assuming no constraints

node 1
node 2
N

N
133
Automatic Re-Partitioning
partition 1
partition 2
In most cases, automatic re-partitioning is
benign (no reshuffling), preserving the same
partitioning as upstream. Re-partitioning can
be forced to be benign, using either same
preserve partitioning
If Stage 2 runs in parallel, DS/EE silently
inserts a partitioner upstream of it. If Stage
1 also runs in parallel, re-partitioning occurs.
Stage 1
partitioner
Stage 2
134
Partitioning Methods
  • Auto
  • Hash
  • Entire
  • Range
  • Range Map

135
Collectors
  • Collectors combine partitions of a dataset into a
    single input stream to a sequential Stage

...
data partitions
collector
sequential Stage
  • Collectors do NOT synchronize data

136
Partitioning and Repartitioning Are Visible On
Job Design
137
Partitioning and Collecting Icons
Partitioner
Collector
138
Setting a Node Constraint in the GUI
139
Reading Messages in Director
  • Set APT_DUMP_SCORE to true
  • Can be specified as job parameter
  • Messages sent to Director log
  • If set, parallel job will produce a report
    showing the operators, processes, and datasets in
    the running job

140
Messages With APT_DUMP_SCORE True
141
Exercise
  • Complete exercise 5-2

142
Blank
143
Module 6
  • Transforming Data

144
Module Objectives
  • Understand ways DataStage allows you to transform
    data
  • Use this understanding to
  • Create column derivations using user-defined code
    or system functions
  • Filter records based on business criteria
  • Control data flow based on data conditions

145
Transformed Data
  • Transformed data is
  • Outgoing column is a derivation that may, or may
    not, include incoming fields or parts of incoming
    fields
  • May be comprised of system variables
  • Frequently uses functions performed on something
    (ie. incoming columns)
  • Divided into categories I.e.
  • Date and time
  • Mathematical
  • Logical
  • Null handling
  • More

146
Stages Review
  • Stages that can transform data
  • Transformer
  • Parallel
  • Basic (from Parallel palette)
  • Aggregator (discussed in later module)
  • Sample stages that do not transform data
  • Sequential
  • FileSet
  • DataSet
  • DBMS

147
Transformer Stage Functions
  • Control data flow
  • Create derivations

148
Flow Control
  • Separate record flow down links based on data
    condition specified in Transformer stage
    constraints
  • Transformer stage can filter records
  • Other stages can filter records but do not
    exhibit advanced flow control
  • Sequential
  • Lookup
  • Filter

149
Rejecting Data
  • Reject option on sequential stage
  • Data does not agree with meta data
  • Output consists of one column with binary data
    type
  • Reject links (from Lookup stage) result from the
    drop option of the property If Not Found
  • Lookup failed
  • All columns on reject link (no column mapping
    option)
  • Reject constraints are controlled from the
    constraint editor of the transformer
  • Can control column mapping
  • Use the Other/Log checkbox

150
Rejecting Data Example
Contstraint Other/log option
If Not Found property
Property Reject Mode Output
151
Transformer Stage Properties
152
Transformer Stage Variables
  • First of transformer stage entities to execute
  • Execute in order from top to bottom
  • Can write a program by using one stage variable
    to point to the results of a previous stage
    variable
  • Multi-purpose
  • Counters
  • Hold values for previous rows to make comparison
  • Hold derivations to be used in multiple field
    dervations
  • Can be used to control execution of constraints

153
Stage Variables
Show/Hide button
154
Transforming Data
  • Derivations
  • Using expressions
  • Using functions
  • Date/time
  • Transformer Stage Issues
  • Sometimes require sorting before the transformer
    stage I.e. using stage variable as accumulator
    and need to break on change of column value
  • Checking for nulls

155
Checking for Nulls
  • Nulls can get introduced into the dataflow
    because of failed lookups and the way in which
    you chose to handle this condition
  • Can be handled in constraints, derivations, stage
    variables, or a combination of these

156
Nullability
Can set the value of null i.e.. If value of
column is null put NULL in the outgoing column
Source Field Destination Field Result
not_nullable not_nullable Source value propagates to destination.
not_nullable nullable Source value propagates destination value is never null.
nullable not_nullable WARNING messages in log. If source value is null,a fatal error occurs. Must handle in transformer.
nullable nullable Source value or null propagates.
157
Transformer Stage- Handling Rejects
  • Constraint Rejects
  • All expressions are false and reject row is
    checked
  • Expression Error Rejects
  • Improperly Handled Null

158
Transformer Execution Order
  • Derivations in stage variables are executed
    first
  • Constraints are executed before derivations
  • Column derivations in earlier links are executed
    before later links
  • Derivations in higher columns are executed
    before lower columns

159
Two Transformers for the Parallel Palette
  • All gt Processing gt
  • Transformer
  • Is the non-Universe transformer
  • Has a specific set of functions
  • No DS routines available
  • Parallel gt Processing
  • Basic Transformer
  • Makes server style transforms available on the
    parallel palette
  • Can use DS routines
  • No need for shared container to get Universe
    functionality on the parallel palette
  • Program in Basic for both transformers

160
Transformer Functions From Derivation Editor
  • Data Time
  • Logical
  • Mathematical
  • Null Handling
  • Number
  • Raw
  • String
  • Type Conversion
  • Utility

161
Timestamps and Dates
  • Data Time
  • Also some in Type Conversion

162
Exercise
  • Complete exercises 6-1, 6-2, and 6-3

163
Module 7
  • Sorting Data

164
Objectives
  • Understand DataStage EE sorting options
  • Use this understanding to create sorted list of
    data to enable functionality within a transformer
    stage

165
Sorting Data
  • Important because
  • Transformer may be using stage variables for
    accumulators or control breaks and order is
    important
  • Other stages may run faster I.e Aggregator
  • Facilitates the RemoveDups stage, order is
    important
  • Job has partitioning requirements
  • Can be performed
  • Option within stages (use input gt partitioning
    tab and set partitioning to anything other than
    auto)
  • As a separate stage (more complex sorts)

166
Sorting Alternatives
  • Alternative representation of same flow

167
Sort Option on Stage Link
168
Sort Stage
169
Sort Utility
  • DataStage the default
  • SyncSort
  • UNIX

170
Sort Stage - Outputs
  • Specifies how the output is derived

171
Sort Specification Options
  • Input Link Property
  • Limited functionality
  • Max memory/partition is 20 MB, then spills to
    scratch
  • Sort Stage
  • Tunable to use more memory before spilling to
    scratch.
  • Note Spread I/O by adding more scratch file
    systems to each node of the APT_CONFIG_FILE

172
Removing Duplicates
  • Can be done by Sort
  • Use unique option
  • OR
  • Remove Duplicates stage
  • Has more sophisticated ways to remove duplicates

173
Exercise
  • Complete exercise 7-1

174
Blank
175
Module 8
  • Combining Data

176
Objectives
  • Understand how DataStage can combine data using
    the Join, Lookup, Merge, and Aggregator stages
  • Use this understanding to create jobs that will
  • Combine data from separate input streams
  • Aggregate data to form summary totals

177
Combining Data
  • There are two ways to combine data
  • Horizontally Several input links one output
    link ( optional rejects) made of columns from
    different input links. E.g.,
  • Joins
  • Lookup
  • Merge
  • Vertically One input link, output with column
    combining values from all input rows. E.g.,
  • Aggregator

178
Recall the Join, Lookup MergeStages
  • These "three Stages" combine two or more input
    links according to values of user-designated
    "key" column(s).
  • They differ mainly in
  • Memory usage
  • Treatment of rows with unmatched key values
  • Input requirements (sorted, de-duplicated)

179
Joins - Lookup - MergeNot all Links are Created
Equal!
  • enterprise edition distinguishes between
  • - The Primary Input (Framework port 0)
  • - Secondary - in some cases "Reference" (other
    ports)
  • Naming convention
  • Tip
  • Check "Input Ordering" tab to make sure
    intended Primary is listed first

180
Join Stage Editor
Link Order immaterial for Inner and Full Outer
Joins (but VERY important for Left/Right Outer
and Lookup and Merge)
  • One of four variants
  • Inner
  • Left Outer
  • Right Outer
  • Full Outer

Several key columns allowed
181
1. The Join Stage
  • Four types
  • 2 sorted input links, 1 output link
  • "left" on primary input, "right" on secondary
    input
  • Pre-sort make joins "lightweight" few rows need
    to be in RAM
  • Inner
  • Left Outer
  • Right Outer
  • Full Outer

182
2. The Lookup Stage
  • Combines
  • one source link with
  • one or more duplicate-free table links

no pre-sort necessary allows multiple keys
LUTs flexible exception handling forsource input
rows with no match
Sourceinput
One or more tables (LUTs)
0
2
1
0
1
Lookup
Reject
Output
183
The Lookup Stage
  • Lookup Tables should be small enough to fit into
    physical memory (otherwise, performance hit due
    to paging)
  • Space time trade-off presort vs. in RAM table
  • On an MPP you should partition the lookup tables
    using entire partitioning method, or partition
    them the same way you partition the source link
  • On an SMP, no physical duplication of a Lookup
    Table occurs

184
The Lookup Stage
  • Lookup File Set
  • Like a persistent data set only it contains
    metadata about the key.
  • Useful for staging lookup tables
  • RDBMS LOOKUP
  • SPARSE
  • Select for each row.
  • Might become a performance bottleneck.
  • NORMAL
  • Loads to an in memory hash table first.

185
3. The Merge Stage
  • Combines
  • one sorted, duplicate-free master (primary) link
    with
  • one or more sorted update (secondary) links.
  • Pre-sort makes merge "lightweight" few rows need
    to be in RAM (as with joins, but opposite to
    lookup).
  • Follows the Master-Update model
  • Master row and one or more updates row are merged
    if they have the same value in user-specified
    key column(s).
  • A non-key column occurs in several inputs? The
    lowest input port number prevails (e.g., master
    over update update values are ignored)
  • Unmatched ("Bad") master rows can be either
  • kept
  • dropped
  • Unmatched ("Bad") update rows in input link can
    be captured in a "reject" link
  • Matched update rows are consumed.

186
The Merge Stage
Allows composite keys Multiple update
links Matched update rows are consumed









Unmatched updates can
be captured Lightweight Space/time tradeoff
presorts vs. in-RAM table
One or more updates
Master
1
2
0
0
2
1
Merge
Rejects
Output
187
SynopsisJoins, Lookup, Merge
  • In this table
  • , ltcommagt separator between primary and
    secondary input links

  • (out and reject links)

188
The Aggregator Stage
  • Purpose Perform data aggregations
  • Specify
  • Zero or more key columns that define the
    aggregation units (or groups)
  • Columns to be aggregated
  • Aggregation functions
  • count (nulls/non-nulls) sum
    max/min/range
  • standard error coeff. of variation
  • sum of weights un/corrected sum of squares
  • variance mean standard deviation
  • The grouping method (hash table or pre-sort) is a
    performance issue

189
Grouping Methods
  • Hash results for each aggregation group are
    stored in a hash table, and the table is written
    out after all input has been processed
  • doesnt require sorted data
  • good when number of unique groups is small.
    Running tally for each groups aggregate
    calculations need to fit easily into memory.
    Require about 1KB/group of RAM.
  • Example average family income by state, requires
    .05MB of RAM
  • Sort results for only a single aggregation group
    are kept in memory when new group is seen (key
    value changes), current group written out.
  • requires input sorted by grouping keys
  • can handle unlimited numbers of groups
  • Example average daily balance by credit card

190
Aggregator Functions
  • Sum
  • Min, max
  • Mean
  • Missing value count
  • Non-missing value count
  • Percent coefficient of variation

191
Aggregator Properties
192
Aggregation Types
Aggregation types
193
Containers
  • Two varieties
  • Local
  • Shared
  • Local
  • Simplifies a large, complex diagram
  • Shared
  • Creates reusable object that many jobs can include

194
Creating a Container
  • Create a job
  • Select (loop) portions to containerize
  • Edit gt Construct container gt local or shared

195
Using a Container
  • Select as though it were a stage

196
Exercise
  • Complete exercise 8-1

197
Module 9
  • Configuration Files

198
Objectives
  • Understand how DataStage EE uses configuration
    files to determine parallel behavior
  • Use this understanding to
  • Build a EE configuration file for a computer
    system
  • Change node configurations to support adding
    resources to processes that need them
  • Create a job that will change resource
    allocations at the stage level

199
Configuration File Concepts
  • Determine the processing nodes and disk space
    connected to each node
  • When system changes, need only change the
    configuration file no need to recompile jobs
  • When DataStage job runs, platform reads
    configuration file
  • Platform automatically scales the application to
    fit the system

200
Processing Nodes Are
  • Locations on which the framework runs
    applications
  • Logical rather than physical construct
  • Do not necessarily correspond to the number of
    CPUs in your system
  • Typically one node for two CPUs
  • Can define one processing node for multiple
    physical nodes or multiple processing nodes for
    one physical node

201
Optimizing Parallelism
  • Degree of parallelism determined by number of
    nodes defined
  • Parallelism should be optimized, not maximized
  • Increasing parallelism distributes work load but
    also increases Framework overhead
  • Hardware influences degree of parallelism
    possible
  • System hardware partially determines
    configuration

202
More Factors to Consider
  • Communication amongst operators
  • Should be optimized by your configuration
  • Operators exchanging large amounts of data should
    be assigned to nodes communicating by shared
    memory or high-speed link
  • SMP leave some processors for operating system
  • Desirable to equalize partitioning of data
  • Use an experimental approach
  • Start with small data sets
  • Try different parallelism while scaling up data
    set sizes

203
Factors Affecting Optimal Degree of Parallelism
  • CPU intensive applications
  • Benefit from the greatest possible parallelism
  • Applications that are disk intensive
  • Number of logical nodes equals the number of disk
    spindles being accessed

204
EE Configuration File
  • Text file containing string data that is passed
    to the Framework
  • Sits on server side
  • Can be displayed and edited
  • Name and location found in environmental variable
    APT_CONFIG_FILE
  • Components
  • Node
  • Fast name
  • Pools
  • Resource

205
Sample Configuration File
  • node Node1"
  • fastname "BlackHole"
  • pools "" "node1"
  • resource disk "/usr/dsadm/Ascential/DataStage/
    Datasets" pools ""
  • resource scratchdisk "/usr/dsadm/Ascential/DataS
    tage/Scratch" pools ""

206
Node Options
  • Node name - name of a processing node used by EE
  • Typically the network name
  • Use command uname -n to obtain network name
  • Fastname
  • Name of node as referred to by fastest network in
    the system
  • Operators use physical node name to open
    connections
  • NOTE for SMP, all CPUs share single connection
    to network
  • Pools
  • Names of pools to which this node is assigned
  • Used to logically group nodes
  • Can also be used to group resources
  • Resource
  • Disk
  • Scratchdisk

207
Node Pools
  • Node node1"
  • fastname server_name pool "pool_name
  • "pool_name" is the name of the node pool. I.e.
    extra
  • Node pools group processing nodes based on
    usage.
  • Example memory capacity and high-speed I/O.
  • One node can be assigned to multiple pools.
  • Default node pool ( ") is made up of each node
    defined in the config file, unless its
    qualified as belonging to a different pool
    and it is not designated as belonging to the
    default pool (see following example).

208
Resource Disk and Scratchdisk
  • node node_0"
  • fastname server_name pool "pool_name
  • resource disk path pool pool_1
  • resource scratchdisk path pool pool_1
  • ...
  • Resource type can be disk(s) or scratchdisk(s)
  • "pool_1" is the disk or scratchdisk pool,
    allowing you to group disks and/or
    scratchdisks.

209
Disk Pools
  • Disk pools allocate storage
  • Pooling applies to both disk types
  • By default, EE uses the default pool,
    specified by

pool "bigdata"
210
Sorting Requirements
  • Resource pools can also be specified for
    sorting
  • The Sort stage looks first for scratch disk
    resources in a sort pool, and then in the
    default disk pool
  • Sort uses as many scratch disks as defined in the
    first pool it finds

211
Configuration File Example
node "n1" fastname s1" pool ""
"n1" "s1" "sort" resource disk "/data/n1/d1"
resource disk "/data/n1/d2"
resource scratchdisk "/scratch" "sort"
node "n2" fastname "s2" pool "" "n2"
"s2" "app1" resource disk "/data/n2/d1"
resource scratchdisk "/scratch" node
"n3" fastname "s3" pool "" "n3" "s3"
"app1" resource disk "/data/n3/d1"
resource scratchdisk "/scratch" node
"n4" fastname "s4" pool "" "n4" "s4"
"app1" resource disk "/data/n4/d1"
resource scratchdisk "/scratch" ...
212
Resource Types
  • Disk
  • Scratchdisk
  • DB2
  • Oracle
  • Saswork
  • Sortwork
  • Can exist in a pool
  • Groups resources together

213
Using Different Configurations
Lookup stage where DBMS is using a sparse lookup
type
214
Building a Configuration File
  • Scoping the hardware
  • Is the hardware configuration SMP, Cluster,
Write a Comment
User Comments (0)
About PowerShow.com