Title: Module 1
1Module 1
- DS324EE DataStage Enterprise Edition
- Concept Review
2Ascentials Enterprise Data Integration Platform
Command Control
ANY TARGET
ANY SOURCE
CRM ERP SCM RDBMS Legacy Real-time Client-server
Web services Data Warehouse Other apps.
CRM ERP SCM BI/Analytics RDBMS Real-time
Client-server Web services Data Warehouse Other
apps.
3Course Objectives
- You will learn to
- Build DataStage EE jobs using complex logic
- Utilize parallel processing techniques to
increase job performance - Build custom stages based on application needs
- Course emphasis is
- Advanced usage of DataStage EE
- Application job development
- Best practices techniques
4Course Agenda
- Day 1
- Review of EE Concepts
- Sequential Access
- Standards
- DBMS Access
- Day 2
- EE Architecture
- Transforming Data
- Sorting Data
- Day 3
- Combining Data
- Configuration Files
- Day 4
- Extending EE
- Meta Data Usage
- Job Control
- Testing
5Module Objectives
- Provide a background for completing work in the
DSEE advanced course - Ensure all students will have a successful
advanced class - Tasks
- Review parallel processing concepts
6Review Topics
- DataStage architecture
- DataStage client review
- Administrator
- Manager
- Designer
- Director
- Parallel processing paradigm
- DataStage Enterprise Edition
7Client-Server Architecture
Microsoft Windows NT/2000/XP
ANY TARGET
ANY SOURCE
CRM ERP SCM BI/Analytics RDBMS Real-Time
Client-server Web services Data Warehouse Other
apps.
Repository Manager
Designer
Director
Administrator
Discover
Prepare
Transform
Extend
Extract
Cleanse
Transform
Integrate
Server
Repository
Microsoft Windows NT or UNIX
8Process Flow
- Administrator add/delete projects, set defaults
- Manager import meta data, backup projects
- Designer assemble jobs, compile, and execute
- Director execute jobs, examine job run logs
9Administrator Licensing and Timeout
10Administrator Project Creation/Removal
Functions specific to a project.
11Administrator Project Properties
RCP for parallel jobs should be enabled
Variables for parallel processing
12Administrator Environment Variables
Variables are category specific
13OSH is what is run by the EE Framework
14DataStage Manager
15Export Objects to MetaStage
Push meta data to MetaStage
16Designer Workspace
Can execute the job from Designer
17DataStage Generated OSH
The EE Framework runs OSH
18Director Executing Jobs
Messages from previous run in different color
19Stages
Can now customize the Designers palette
20Popular Stages
Row generator
Peek
21Row Generator
Edit row in column tab
Repeatable property
22Peek
- Displays field values
- Will be displayed in job log or sent to a file
- Skip records option
- Can control number of records to be displayed
- Can be used as stub stage for iterative
development (more later)
23Why EE is so Effective
- Parallel processing paradigm
- More hardware, faster processing
- Level of parallelization is determined by a
configuration file read at runtime - Emphasis on memory
- Data read into memory and lookups performed like
hash table
24Scalable Systems
- Parallel processing executing your application
on multiple CPUs - Scalable processing add more resources (CPUs,
RAM, and disks) to increase system performance
- Example system containing6 CPUs (or processing
nodes)and disks
25Scaleable Systems Examples
- Three main types of scalable systems
- Symmetric Multiprocessors (SMP), shared memory
- Clusters UNIX systems connected via networks
- MPP
note
26SMP Shared Everything
- Multiple CPUs with a single operating system
- Programs communicate using shared memory
- All CPUs share system resources (OS, memory with
single linear address space, disks, I/O) - When used with enterprise edition
- Data transport uses shared memory
- Simplified startup
enterprise edition treats NUMA (NonUniform Memory
Access) as SMP
27Traditional Batch Processing
- Traditional approach to batch processing
- Write to disk and read from disk before each
processing operation - Sub-optimal utilization of resources
- a 10 GB stream leads to 70 GB of I/O
- processing resources can sit idle during I/O
- Very complex to manage (lots and lots of small
jobs) - Becomes impractical with big data volumes
- disk I/O consumes the processing
- terabytes of disk required for temporary staging
28Pipeline Multiprocessing
Data Pipelining
- Transform, clean and load processes are
executing simultaneously on the same processor - rows are moving forward through the flow
Operational Data
Transform
Clean
Load
Data Warehouse
Archived Data
Source
Target
- Start a downstream process while an upstream
process is still running. - This eliminates intermediate storing to disk,
which is critical for big data. - This also keeps the processors busy.
- Still has limits on scalability
Think of a conveyor belt moving the rows from
process to process!
29Partition Parallelism
Data Partitioning
- Break up big data into partitions
- Run one partition on each processor
- 4X times faster on 4 processors - With data
big enough - 100X faster on 100 processors
- This is exactly how the parallel databases
work! - Data Partitioning requires the same
transform to all partitions Aaron Abbott and
Zygmund Zorn undergo the same transform
30Combining Parallelism Types
Putting It All Together Parallel Dataflow
Source
31EE Program Elements
- Dataset uniform set of rows in the Framework's
internal representation - - Three flavors
- 1. file sets .fs stored on multiple
Unix files as flat files - 2. persistent .ds stored on multiple
Unix files in Framework format - read and written using the DataSet Stage
- 3. virtual .v links, in
Framework format, NOT stored on disk - - The Framework processes only datasetshence
possible need for Import - - Different datasets typically have different
schemas - - Convention "dataset" Framework data set.
- Partition subset of rows in a dataset earmarked
for processing by the same node (virtual CPU,
declared in a configuration file). - - All the partitions of a dataset follow
the same schema that of the dataset
32Repartitioning
Putting It All Together Parallel Dataflow with
Repartioning on-the-fly
Source
33DataStage EE Architecture
34Introduction to DataStage EE
- DSEE
- Automatically scales to fit the machine
- Handles data flow among multiple CPUs and disks
- With DSEE you can
- Create applications for SMPs, clusters and
MPPs enterprise edition is architecture-neutral
- Access relational databases in parallel
- Execute external applications in parallel
- Store data across multiple disks and nodes
35Job Design VS. Execution
User assembles the flow using the DataStage
Designer
- and gets parallel access, propagation,
transformation, and load. - The design is good for 1 node, 4 nodes,
- or N nodes. To change nodes, just swap
configuration file. - No need to modify or recompile your design!
36Partitioners and Collectors
- Partitioners distribute rows into partitions
- implement data-partition parallelism
- Collectors inverse partitioners
- Live on input links of stages running
- in parallel (partitioners)
- sequentially (collectors)
- Use a choice of methods
37Example Partitioning Icons
partitioner
38Exercise
- Complete exercises 1-1 and 1-2, and 1-3
39Module 2
40Module Objectives
- You will learn to
- Import sequential files into the EE Framework
- Utilize parallel processing techniques to
increase sequential file access - Understand usage of the Sequential, DataSet,
FileSet, and LookupFileSet stages - Manage partitioned data stored by the Framework
41Types of Sequential Data Stages
- Sequential
- Fixed or variable length
- File Set
- Lookup File Set
- Data Set
42Sequential Stage Introduction
- The EE Framework processes only datasets
- For files other than datasets, such as flat
files, enterprise edition must perform import
and export operations this is performed by
import and export OSH operators (generated by
Sequential or FileSet stages) - During import or export DataStage performs format
translations into, or out of, the EE internal
format - Data is described to the Framework in a schema
43How the Sequential Stage Works
- Generates Import/Export operators
- Types of transport
- Performs direct C file I/O streams
- Source programs which feed stdout (gunzip) send
stdout into EE via sequential pipe
44Using the Sequential File Stage
- Both import and export of general files (text,
binary) are performed by the SequentialFile
Stage. - Data import
- Data export
-
Importing/Exporting Data
EE internal format
EE internal format
45Working With Flat Files
- Sequential File Stage
- Normally will execute in sequential mode
- Can execute in parallel if reading multiple files
(file pattern option) - Can use multiple readers within a node on fixed
width file - DSEE needs to know
- How file is divided into rows
- How row is divided into columns
46Processes Needed to Import Data
- Recordization
- Divides input stream into records
- Set on the format tab
- Columnization
- Divides the record into columns
- Default set on the format tab but can be
overridden on the columns tab - Can be incomplete if using a schema or not even
specified in the stage if using RCP
47File Format Example
48Sequential File Stage
- To set the properties, use stage editor
- Page (general, input/output)
- Tabs (format, columns)
- Sequential stage link rules
- One input link
- One output links (except for reject link
definition) - One reject link
- Will reject any records not matching meta data in
the column definitions
49Job Design Using Sequential Stages
Stage categories
50General Tab Sequential Source
Show records
Multiple output links
51Properties Multiple Files
Click to add more files having the same meta
data.
52Properties - Multiple Readers
Multiple readers option allows you to set number
of readers
53Format Tab
Record into columns
File into records
54Read Methods
55Reject Link
- Reject mode output
- Source
- All records not matching the meta data (the
column definitions) - Target
- All records that are rejected for any reason
- Meta data one column, datatype raw
56File Set Stage
- Can read or write file sets
- Files suffixed by .fs
- File set consists of
- Descriptor file contains location of raw data
files meta data - Individual raw data files
- Can be processed in parallel
57File Set Stage Example
Descriptor file
58File Set Usage
- Why use a file set?
- 2G limit on some file systems
- Need to distribute data among nodes to prevent
overruns - If used in parallel, runs faster that sequential
file
59Lookup File Set Stage
- Can create file sets
- Usually used in conjunction with Lookup stages
60Lookup File Set gt Properties
Key column specified
Key column dropped in descriptor file
61Data Set
- Operating system (Framework) file
- Suffixed by .ds
- Referred to by a control file
- Managed by Data Set Management utility from GUI
(Manager, Designer, Director) - Represents persistent data
- Key to good performance in set of linked jobs
62Persistent Datasets
- Accessed from/to disk with DataSet Stage.
- Two parts
- Descriptor file
- contains metadata, data location, but NOT the
data itself - Data file(s)
- contain the data
- multiple Unix files (one per node), accessible in
parallel
input.ds
record ( partno int32 description
string )
node1/local/disk1/node2/local/disk2/
63Quiz!
- True or False?
- Everything that has been data-partitioned must be
collected in same job
64Data Set Stage
Is the data partitioned?
65Engine Data Translation
- Occurs on import
- From sequential files or file sets
- From RDBMS
- Occurs on export
- From datasets to file sets or sequential files
- From datasets to RDBMS
- Engine most efficient when processing internally
formatted records (I.e. data contained in
datasets)
66Managing DataSets
- GUI (Manager, Designer, Director) tools gt data
set management - Alternative methods
- Orchadmin
- Unix command line utility
- List records
- Remove data sets (will remove all components)
- Dsrecords
- Lists number of records in a dataset
67Data Set Management
Display data
Schema
68Data Set Management From Unix
- Alternative method of managing file sets and data
sets - Dsrecords
- Gives record count
- Unix command-line utility
- dsrecords ds_name
- I.e.. dsrecords myDS.ds
- 156999 records
- Orchadmin
- Manages EE persistent data sets
- Unix command-line utility
- I.e. orchadmin rm myDataSet.ds
69Exercise
- Complete exercises 2-1, 2-2, 2-3, and 2-4.
70Blank
71Module 3
72Objectives
- Establish standard techniques for DSEE
development - Will cover
- Job documentation
- Naming conventions for jobs, links, and stages
- Iterative job design
- Useful stages for job development
- Using configuration files for development
- Using environmental variables
- Job parameters
73Job Presentation
Document using the annotation stage
74Job Properties Documentation
Organize jobs into categories
Description shows in DS Manager and MetaStage
75Naming conventions
- Stages named after the
- Data they access
- Function they perform
- DO NOT leave defaulted stage names like
Sequential_File_0 - Links named for the data they carry
- DO NOT leave defaulted link names like DSLink3
76Stage and Link Names
Stages and links renamed to data they handle
77Create Reusable Job Components
- Use enterprise edition shared containers when
feasible
Container
78Use Iterative Job Design
- Use copy or peek stage as stub
- Test job in phases small first, then increasing
in complexity - Use Peek stage to examine records
79Copy or Peek Stage Stub
Copy stage
80Transformer StageTechniques
- Suggestions -
- Always include reject link.
- Always test for null value before using a column
in a function. - Try to use RCP and only map columns that have a
derivation other than a copy. More on RCP later. - Be aware of Column and Stage variable Data Types.
- Often user does not pay attention to Stage
Variable type. - Avoid type conversions.
- Try to maintain the data type as imported.
81The Copy Stage
- With 1 link in, 1 link out
- the Copy Stage is the ultimate "no-op"
(place-holder) - Partitioners
- Sort / Remove Duplicates
- Rename, Drop column
- can be inserted on
- input link (Partitioning) Partitioners, Sort,
Remove Duplicates) - output link (Mapping page) Rename, Drop.
- Sometimes replace the transformer
- Rename,
- Drop,
- Implicit type Conversions
- Link Constraint break up schema
82Developing Jobs
- Keep it simple
- Jobs with many stages are hard to debug and
maintain. - Start small and Build to final Solution
- Use view data, copy, and peek.
- Start from source and work out.
- Develop with a 1 node configuration file.
- Solve the business problem before the performance
problem. - Dont worry too much about partitioning until the
sequential flow works as expected. - If you have to write to Disk use a Persistent
Data set.
83Final Result
84Good Things to Have in each Job
- Use job parameters
- Some helpful environmental variables to add to
job parameters - APT_DUMP_SCORE
- Report OSH to message log
- APT_CONFIG_FILE
- Establishes runtime parameters to EE engine I.e.
Degree of parallelization
85Setting Job Parameters
Click to add environment variables
86DUMP SCORE Output
Setting APT_DUMP_SCORE yields
Double-click
Partitoner And Collector
Mapping Node--gt partition
87Use Multiple Configuration Files
- Make a set for 1X, 2X,.
- Use different ones for test versus production
- Include as a parameter in each job
88Exercise
89Module 4
90Objectives
- Understand how DSEE reads and writes records to
an RDBMS - Understand how to handle nulls on DBMS lookup
- Utilize this knowledge to
- Read and write database tables
- Use database tables to lookup data
- Use null handling options to clean data
91Parallel Database Connectivity
TraditionalClient-Server
enterprise edition
Client
Client
Sort
Client
Client
Client
Load
Client
Parallel RDBMS
Parallel RDBMS
- Parallel server runs APPLICATIONS
- Application has parallel connections to RDBMS
- Suitable for large data volumes
- Higher levels of integration possible
- Only RDBMS is running in parallel
- Each application has only one connection
- Suitable only for small data volumes
92RDBMS AccessSupported Databases
- enterprise edition provides high performance /
scalable interfaces for - DB2
- Informix
- Oracle
- Teradata
- Users must be granted specific privileges,
depending on RDBMS.
93RDBMS AccessSupported Databases
- Automatically convert RDBMS table layouts to/from
enterprise edition Table Definitions - RDBMS nulls converted to/from nullable field
values - Support for standard SQL syntax for specifying
- field list for SELECT statement
- filter for WHERE clause
- open command, close command
- Can write an explicit SQL query to access RDBMS
- EE supplies additional information in the SQL
query
94RDBMS Stages
- DB2/UDB Enterprise
- Informix Enterprise
- Oracle Enterprise
95RDBMS Usage
- As a source
- Extract data from table (stream link)
- Extract as table, generated SQL, or user-defined
SQL - User-defined can perform joins, access views
- Lookup (reference link)
- Normal lookup is memory-based (all table data
read into memory) - Can perform one lookup at a time in DBMS (sparse
option) - Continue/drop/fail options
- As a target
- Inserts
- Upserts (Inserts and updates)
- Loader
96RDBMS Source Stream Link
Stream link
97DBMS Source - User-defined SQL
Columns in SQL statement must match the meta data
in columns tab
98Exercise
- User-defined SQL
- Exercise 4-1
99DBMS Source Reference Link
Reject link
100Lookup Reject Link
Output option automatically creates the reject
link
101Null Handling
- Must handle null condition if lookup record is
not found and continue option is chosen - Can be done in a transformer stage
102Lookup Stage Mapping
Link name
103Lookup Stage Properties
Reference link
Must have same column name in input and reference
links. You will get the results of the lookup in
the output column.
104DBMS as a Target
105DBMS As Target
- Write Methods
- Delete
- Load
- Upsert
- Write (DB2)
- Write mode for load method
- Truncate
- Create
- Replace
- Append
106Target Properties
Generated code can be copied
Upsert mode determines options
107Checking for Nulls
- Use Transformer stage to test for fields with
null values (Use IsNull functions) - In Transformer, can reject or load default value
108Exercise
109Module 5
110Objectives
- Understand how enterprise edition Framework
processes data - You will be able to
- Read and understand OSH
- Perform troubleshooting
111Concepts
- The EE Platform
- OSH (generated by DataStage Parallel Canvas, and
run by DataStage Director) - Conductor,Section leaders,players.
- Configuration files (only one active at a time,
describes H/W) - Schemas/tables
- Schema propagation/RCP
- Buildop,Wrapper
- Datasets (data in Framework's internal
representation)
112DS-EE Program Elements
EE Stages Involve A Series Of Processing Steps
- Piece of Application Logic Running Against
Individual Records - Parallel or Sequential
- Three Sources
- Ascential Supplied
- Commercial tools/applications
- Custom/Existing programs
Output Data Set schema prov_numint16 member_n
umint8 custidint32
Input Data Set schema prov_numint16 member_nu
mint8 custidint32
Output Interface
Business Logic
InputInterface
Partitioner
EE Stage
113DS-EE Program ElementsStage Execution
Dual Parallelism Eliminates Bottlenecks!
- EE Delivers Parallelism in Two Ways
- Pipeline
- Partition
- Block Buffering Between Components
- Eliminates Need for Program Load Balancing
- Maintains Orderly Data Flow
Producer
Pipeline
Consumer
Partition
114Stages Control Partition Parallelism
- Execution Mode (sequential/parallel) is
controlled by Stage - default parallel for most Ascential-supplied
Stages - User can override default mode
- Parallel Stage inserts the default partitioner
(Auto) on its input links - Sequential Stage inserts the default collector
(Auto) on its input links - user can override default
- execution mode (parallel/sequential) of Stage
(Advanced tab) - choice of partitioner/collector (Input
Partitioning Tab) -
115How Parallel Is It?
- Degree of Parallelism is determined by the
configuration file - Total number of logical nodes in default pool,
or a subset if using "constraints". - Constraints are assigned to specific pools as
defined in configuration file and can be
referenced in the stage
116OSH
- DataStage EE GUI generates OSH scripts
- Ability to view OSH turned on in Administrator
- OSH can be viewed in Designer using job
properties - The Framework executes OSH
- What is OSH?
- Orchestrate shell
- Has a UNIX command-line interface
117OSH Script
- An osh script is a quoted string which specifies
- The operators and connections of a single
Orchestrate step - In its simplest form, it is
- osh op lt in.ds gt out.ds
- Where
- op is an Orchestrate operator
- in.ds is the input data set
- out.ds is the output data set
118OSH Operators
- Operator is an instance of a C class inheriting
from APT_Operator - Developers can create new operators
- Examples of existing operators
- Import
- Export
- RemoveDups
119Enable Visible OSH in Administrator
Will be enabled for all projects
120View OSH in Designer
Operator
Schema
121OSH Practice
122Orchestrate May Add Operators to Your Command
Lets revisit the following OSH command
osh " echo 'Hello world!' par gt outfile "
The Framework silently inserts operators (steps
1,2,3,4)
123Elements of a Framework Program
Steps, with internal and terminal datasets and
links, described by schemas
- Step unit of OSH program
- one OSH command one step
- at end of step synchronization, storage to disk
- Datasets set of rows processed by Framework
- Orchestrate data sets
- persistent (terminal) .ds, and
- virtual (internal) .v.
- Also flat file sets .fs
- Schema data description (metadata) for datasets
and links.
124Orchestrate Datasets
- Consist of Partitioned Data and Schema
- Can be Persistent (.ds) or Virtual
(.v, Link) - Overcome 2 GB File Limit
What you program
What gets processed
Node 1
Node 2
Node 3
Node 4
GUI OSH
Operator A
Operator A
Operator A
Operator A
. . .
What gets generated
Multiple files per partition Each file up to
2GBytes (or larger)
osh operator_A gt x.ds
125Computing Architectures Definition
Dedicated Disk
Shared Nothing
Shared Disk
Disk
Disk
Disk
Disk
Disk
Disk
CPU
Shared Memory
Memory
Memory
Memory
Memory
Memory
Uniprocessor
SMP System (Symmetric Multiprocessor)
Clusters and MPP Systems
- PC
- Workstation
- Single processor server
- IBM, Sun, HP, Compaq
- 2 to 64 processors
- Majority of installations
- 2 to hundreds of processors
- MPP IBM and NCR Teradata
- each node is a uniprocessor or SMP
126Job ExecutionOrchestrate
- Conductor - initial DS/EE process
- Step Composer
- Creates Section Leader processes (one/node)
- Consolidates massages, outputs them
- Manages orderly shutdown.
- Section Leader
- Forks Players processes (one/Stage)
- Manages up/down communication.
- Players
- The actual processes associated with Stages
- Combined players one process only
- Send stderr to SL
- Establish connections to other players for data
flow - Clean up upon completion.
Processing Node
Processing Node
- Communication
- - SMP Shared Memory
- - MPP TCP
127Working with Configuration Files
- You can easily switch between config files
- '1-node' file - for sequential
execution, lighter reportshandy for testing - 'MedN-nodes' file - aims at a mix of pipeline
and data-partitioned parallelism - 'BigN-nodes' file - aims at full
data-partitioned parallelism - Only one file is active while a step is running
- The Framework queries (first) the environment
variable - APT_CONFIG_FILE
- nodes declared in the config file needs not
match CPUs - Same configuration file can be used in
development and target machines
128SchedulingNodes, Processes, and CPUs
- DS/EE does not
- know how many CPUs are available
- schedule
- Who knows what?
- Who does what?
- DS/EE creates (NodesOps) Unix processes
- The O/S schedules these processes on the CPUs
129Configuring DSEE Node Pools
node "n1" fastname "s1" pool ""
"n1" "s1" "app2" "sort" resource disk
"/orch/n1/d1" resource disk "/orch/n1/d2"
resource scratchdisk "/temp" "sort"
node "n2" fastname "s2" pool "" "n2"
"s2" "app1" resource disk "/orch/n2/d1"
resource disk "/orch/n2/d2" resource
scratchdisk "/temp" node "n3"
fastname "s3" pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" resource
scratchdisk "/temp" node "n4"
fastname "s4" pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" resource
scratchdisk "/temp"
3
4
1
2
130Configuring DSEE Disk Pools
node "n1" fastname "s1" pool ""
"n1" "s1" "app2" "sort" resource disk
"/orch/n1/d1" resource disk "/orch/n1/d2"
"bigdata" resource scratchdisk "/temp"
"sort" node "n2" fastname "s2"
pool "" "n2" "s2" "app1" resource disk
"/orch/n2/d1" resource disk "/orch/n2/d2"
"bigdata" resource scratchdisk "/temp"
node "n3" fastname "s3" pool ""
"n3" "s3" "app1" resource disk "/orch/n3/d1"
resource scratchdisk "/temp" node
"n4" fastname "s4" pool "" "n4" "s4"
"app1" resource disk "/orch/n4/d1"
resource scratchdisk "/temp"
3
4
1
2
131Re-Partitioning
Parallel to parallel flow may incur
reshuffling Records may jump between nodes
node 1
node 2
partitioner
132Re-Partitioning X-ray
- Partitioner with parallel import
- When a partitioner receives
- sequential input (1 partition), it creates N
partitions - parallel input (N partitions), it outputs N
partitions, may result in re-partitioning - Assuming no constraints
node 1
node 2
N
N
133Automatic Re-Partitioning
partition 1
partition 2
In most cases, automatic re-partitioning is
benign (no reshuffling), preserving the same
partitioning as upstream. Re-partitioning can
be forced to be benign, using either same
preserve partitioning
If Stage 2 runs in parallel, DS/EE silently
inserts a partitioner upstream of it. If Stage
1 also runs in parallel, re-partitioning occurs.
Stage 1
partitioner
Stage 2
134Partitioning Methods
- Auto
- Hash
- Entire
- Range
- Range Map
135Collectors
- Collectors combine partitions of a dataset into a
single input stream to a sequential Stage
...
data partitions
collector
sequential Stage
- Collectors do NOT synchronize data
136Partitioning and Repartitioning Are Visible On
Job Design
137Partitioning and Collecting Icons
Partitioner
Collector
138Setting a Node Constraint in the GUI
139Reading Messages in Director
- Set APT_DUMP_SCORE to true
- Can be specified as job parameter
- Messages sent to Director log
- If set, parallel job will produce a report
showing the operators, processes, and datasets in
the running job
140Messages With APT_DUMP_SCORE True
141Exercise
142Blank
143Module 6
144Module Objectives
- Understand ways DataStage allows you to transform
data - Use this understanding to
- Create column derivations using user-defined code
or system functions - Filter records based on business criteria
- Control data flow based on data conditions
145Transformed Data
- Transformed data is
- Outgoing column is a derivation that may, or may
not, include incoming fields or parts of incoming
fields - May be comprised of system variables
- Frequently uses functions performed on something
(ie. incoming columns) - Divided into categories I.e.
- Date and time
- Mathematical
- Logical
- Null handling
- More
146Stages Review
- Stages that can transform data
- Transformer
- Parallel
- Basic (from Parallel palette)
- Aggregator (discussed in later module)
- Sample stages that do not transform data
- Sequential
- FileSet
- DataSet
- DBMS
147Transformer Stage Functions
- Control data flow
- Create derivations
148Flow Control
- Separate record flow down links based on data
condition specified in Transformer stage
constraints - Transformer stage can filter records
- Other stages can filter records but do not
exhibit advanced flow control - Sequential
- Lookup
- Filter
149Rejecting Data
- Reject option on sequential stage
- Data does not agree with meta data
- Output consists of one column with binary data
type - Reject links (from Lookup stage) result from the
drop option of the property If Not Found - Lookup failed
- All columns on reject link (no column mapping
option) - Reject constraints are controlled from the
constraint editor of the transformer - Can control column mapping
- Use the Other/Log checkbox
150Rejecting Data Example
Contstraint Other/log option
If Not Found property
Property Reject Mode Output
151Transformer Stage Properties
152Transformer Stage Variables
- First of transformer stage entities to execute
- Execute in order from top to bottom
- Can write a program by using one stage variable
to point to the results of a previous stage
variable - Multi-purpose
- Counters
- Hold values for previous rows to make comparison
- Hold derivations to be used in multiple field
dervations - Can be used to control execution of constraints
153Stage Variables
Show/Hide button
154Transforming Data
- Derivations
- Using expressions
- Using functions
- Date/time
- Transformer Stage Issues
- Sometimes require sorting before the transformer
stage I.e. using stage variable as accumulator
and need to break on change of column value - Checking for nulls
155Checking for Nulls
- Nulls can get introduced into the dataflow
because of failed lookups and the way in which
you chose to handle this condition - Can be handled in constraints, derivations, stage
variables, or a combination of these
156Nullability
Can set the value of null i.e.. If value of
column is null put NULL in the outgoing column
Source Field Destination Field Result
not_nullable not_nullable Source value propagates to destination.
not_nullable nullable Source value propagates destination value is never null.
nullable not_nullable WARNING messages in log. If source value is null,a fatal error occurs. Must handle in transformer.
nullable nullable Source value or null propagates.
157Transformer Stage- Handling Rejects
- Constraint Rejects
- All expressions are false and reject row is
checked - Expression Error Rejects
- Improperly Handled Null
158Transformer Execution Order
- Derivations in stage variables are executed
first - Constraints are executed before derivations
- Column derivations in earlier links are executed
before later links - Derivations in higher columns are executed
before lower columns
159Two Transformers for the Parallel Palette
- All gt Processing gt
- Transformer
- Is the non-Universe transformer
- Has a specific set of functions
- No DS routines available
- Parallel gt Processing
- Basic Transformer
- Makes server style transforms available on the
parallel palette - Can use DS routines
- No need for shared container to get Universe
functionality on the parallel palette
- Program in Basic for both transformers
160Transformer Functions From Derivation Editor
- Data Time
- Logical
- Mathematical
- Null Handling
- Number
- Raw
- String
- Type Conversion
- Utility
161Timestamps and Dates
- Data Time
- Also some in Type Conversion
162Exercise
- Complete exercises 6-1, 6-2, and 6-3
163Module 7
164Objectives
- Understand DataStage EE sorting options
- Use this understanding to create sorted list of
data to enable functionality within a transformer
stage
165Sorting Data
- Important because
- Transformer may be using stage variables for
accumulators or control breaks and order is
important - Other stages may run faster I.e Aggregator
- Facilitates the RemoveDups stage, order is
important - Job has partitioning requirements
- Can be performed
- Option within stages (use input gt partitioning
tab and set partitioning to anything other than
auto) - As a separate stage (more complex sorts)
166 Sorting Alternatives
- Alternative representation of same flow
167Sort Option on Stage Link
168Sort Stage
169Sort Utility
- DataStage the default
- SyncSort
- UNIX
170Sort Stage - Outputs
- Specifies how the output is derived
171Sort Specification Options
- Input Link Property
- Limited functionality
- Max memory/partition is 20 MB, then spills to
scratch - Sort Stage
- Tunable to use more memory before spilling to
scratch. - Note Spread I/O by adding more scratch file
systems to each node of the APT_CONFIG_FILE
172Removing Duplicates
- Can be done by Sort
- Use unique option
- OR
- Remove Duplicates stage
- Has more sophisticated ways to remove duplicates
173Exercise
174Blank
175Module 8
176Objectives
- Understand how DataStage can combine data using
the Join, Lookup, Merge, and Aggregator stages - Use this understanding to create jobs that will
- Combine data from separate input streams
- Aggregate data to form summary totals
177Combining Data
- There are two ways to combine data
- Horizontally Several input links one output
link ( optional rejects) made of columns from
different input links. E.g., - Joins
- Lookup
- Merge
- Vertically One input link, output with column
combining values from all input rows. E.g., - Aggregator
178Recall the Join, Lookup MergeStages
- These "three Stages" combine two or more input
links according to values of user-designated
"key" column(s). - They differ mainly in
- Memory usage
- Treatment of rows with unmatched key values
- Input requirements (sorted, de-duplicated)
179Joins - Lookup - MergeNot all Links are Created
Equal!
- enterprise edition distinguishes between
- - The Primary Input (Framework port 0)
- - Secondary - in some cases "Reference" (other
ports) - Naming convention
- Tip
- Check "Input Ordering" tab to make sure
intended Primary is listed first
180Join Stage Editor
Link Order immaterial for Inner and Full Outer
Joins (but VERY important for Left/Right Outer
and Lookup and Merge)
- One of four variants
- Inner
- Left Outer
- Right Outer
- Full Outer
Several key columns allowed
1811. The Join Stage
- Four types
-
- 2 sorted input links, 1 output link
- "left" on primary input, "right" on secondary
input - Pre-sort make joins "lightweight" few rows need
to be in RAM
- Inner
- Left Outer
- Right Outer
- Full Outer
1822. The Lookup Stage
- Combines
- one source link with
- one or more duplicate-free table links
no pre-sort necessary allows multiple keys
LUTs flexible exception handling forsource input
rows with no match
Sourceinput
One or more tables (LUTs)
0
2
1
0
1
Lookup
Reject
Output
183The Lookup Stage
- Lookup Tables should be small enough to fit into
physical memory (otherwise, performance hit due
to paging) - Space time trade-off presort vs. in RAM table
- On an MPP you should partition the lookup tables
using entire partitioning method, or partition
them the same way you partition the source link - On an SMP, no physical duplication of a Lookup
Table occurs
184The Lookup Stage
- Lookup File Set
- Like a persistent data set only it contains
metadata about the key. - Useful for staging lookup tables
- RDBMS LOOKUP
- SPARSE
- Select for each row.
- Might become a performance bottleneck.
- NORMAL
- Loads to an in memory hash table first.
1853. The Merge Stage
- Combines
- one sorted, duplicate-free master (primary) link
with - one or more sorted update (secondary) links.
- Pre-sort makes merge "lightweight" few rows need
to be in RAM (as with joins, but opposite to
lookup). - Follows the Master-Update model
- Master row and one or more updates row are merged
if they have the same value in user-specified
key column(s). - A non-key column occurs in several inputs? The
lowest input port number prevails (e.g., master
over update update values are ignored) - Unmatched ("Bad") master rows can be either
- kept
- dropped
- Unmatched ("Bad") update rows in input link can
be captured in a "reject" link - Matched update rows are consumed.
186The Merge Stage
Allows composite keys Multiple update
links Matched update rows are consumed
Unmatched updates can
be captured Lightweight Space/time tradeoff
presorts vs. in-RAM table
One or more updates
Master
1
2
0
0
2
1
Merge
Rejects
Output
187SynopsisJoins, Lookup, Merge
- In this table
- , ltcommagt separator between primary and
secondary input links -
(out and reject links)
188The Aggregator Stage
- Purpose Perform data aggregations
- Specify
- Zero or more key columns that define the
aggregation units (or groups) - Columns to be aggregated
- Aggregation functions
- count (nulls/non-nulls) sum
max/min/range - standard error coeff. of variation
- sum of weights un/corrected sum of squares
- variance mean standard deviation
- The grouping method (hash table or pre-sort) is a
performance issue
189Grouping Methods
- Hash results for each aggregation group are
stored in a hash table, and the table is written
out after all input has been processed - doesnt require sorted data
- good when number of unique groups is small.
Running tally for each groups aggregate
calculations need to fit easily into memory.
Require about 1KB/group of RAM. - Example average family income by state, requires
.05MB of RAM - Sort results for only a single aggregation group
are kept in memory when new group is seen (key
value changes), current group written out. - requires input sorted by grouping keys
- can handle unlimited numbers of groups
- Example average daily balance by credit card
190Aggregator Functions
- Sum
- Min, max
- Mean
- Missing value count
- Non-missing value count
- Percent coefficient of variation
191Aggregator Properties
192Aggregation Types
Aggregation types
193Containers
- Two varieties
- Local
- Shared
- Local
- Simplifies a large, complex diagram
- Shared
- Creates reusable object that many jobs can include
194Creating a Container
- Create a job
- Select (loop) portions to containerize
- Edit gt Construct container gt local or shared
195Using a Container
- Select as though it were a stage
196Exercise
197Module 9
198Objectives
- Understand how DataStage EE uses configuration
files to determine parallel behavior - Use this understanding to
- Build a EE configuration file for a computer
system - Change node configurations to support adding
resources to processes that need them - Create a job that will change resource
allocations at the stage level
199Configuration File Concepts
- Determine the processing nodes and disk space
connected to each node - When system changes, need only change the
configuration file no need to recompile jobs - When DataStage job runs, platform reads
configuration file - Platform automatically scales the application to
fit the system
200Processing Nodes Are
- Locations on which the framework runs
applications - Logical rather than physical construct
- Do not necessarily correspond to the number of
CPUs in your system - Typically one node for two CPUs
- Can define one processing node for multiple
physical nodes or multiple processing nodes for
one physical node
201Optimizing Parallelism
- Degree of parallelism determined by number of
nodes defined - Parallelism should be optimized, not maximized
- Increasing parallelism distributes work load but
also increases Framework overhead - Hardware influences degree of parallelism
possible - System hardware partially determines
configuration
202More Factors to Consider
- Communication amongst operators
- Should be optimized by your configuration
- Operators exchanging large amounts of data should
be assigned to nodes communicating by shared
memory or high-speed link - SMP leave some processors for operating system
- Desirable to equalize partitioning of data
- Use an experimental approach
- Start with small data sets
- Try different parallelism while scaling up data
set sizes
203Factors Affecting Optimal Degree of Parallelism
- CPU intensive applications
- Benefit from the greatest possible parallelism
- Applications that are disk intensive
- Number of logical nodes equals the number of disk
spindles being accessed
204EE Configuration File
- Text file containing string data that is passed
to the Framework - Sits on server side
- Can be displayed and edited
- Name and location found in environmental variable
APT_CONFIG_FILE - Components
- Node
- Fast name
- Pools
- Resource
205Sample Configuration File
-
- node Node1"
-
- fastname "BlackHole"
- pools "" "node1"
- resource disk "/usr/dsadm/Ascential/DataStage/
Datasets" pools "" - resource scratchdisk "/usr/dsadm/Ascential/DataS
tage/Scratch" pools "" -
206Node Options
- Node name - name of a processing node used by EE
- Typically the network name
- Use command uname -n to obtain network name
- Fastname
- Name of node as referred to by fastest network in
the system - Operators use physical node name to open
connections - NOTE for SMP, all CPUs share single connection
to network - Pools
- Names of pools to which this node is assigned
- Used to logically group nodes
- Can also be used to group resources
- Resource
- Disk
- Scratchdisk
207Node Pools
- Node node1"
- fastname server_name pool "pool_name
-
- "pool_name" is the name of the node pool. I.e.
extra - Node pools group processing nodes based on
usage. - Example memory capacity and high-speed I/O.
- One node can be assigned to multiple pools.
- Default node pool ( ") is made up of each node
defined in the config file, unless its
qualified as belonging to a different pool
and it is not designated as belonging to the
default pool (see following example).
208Resource Disk and Scratchdisk
- node node_0"
- fastname server_name pool "pool_name
- resource disk path pool pool_1
- resource scratchdisk path pool pool_1
- ...
-
- Resource type can be disk(s) or scratchdisk(s)
- "pool_1" is the disk or scratchdisk pool,
allowing you to group disks and/or
scratchdisks.
209Disk Pools
- Disk pools allocate storage
- Pooling applies to both disk types
- By default, EE uses the default pool,
specified by
pool "bigdata"
210Sorting Requirements
- Resource pools can also be specified for
sorting - The Sort stage looks first for scratch disk
resources in a sort pool, and then in the
default disk pool - Sort uses as many scratch disks as defined in the
first pool it finds
211Configuration File Example
node "n1" fastname s1" pool ""
"n1" "s1" "sort" resource disk "/data/n1/d1"
resource disk "/data/n1/d2"
resource scratchdisk "/scratch" "sort"
node "n2" fastname "s2" pool "" "n2"
"s2" "app1" resource disk "/data/n2/d1"
resource scratchdisk "/scratch" node
"n3" fastname "s3" pool "" "n3" "s3"
"app1" resource disk "/data/n3/d1"
resource scratchdisk "/scratch" node
"n4" fastname "s4" pool "" "n4" "s4"
"app1" resource disk "/data/n4/d1"
resource scratchdisk "/scratch" ...
212Resource Types
- Disk
- Scratchdisk
- DB2
- Oracle
- Saswork
- Sortwork
- Can exist in a pool
- Groups resources together
213Using Different Configurations
Lookup stage where DBMS is using a sparse lookup
type
214Building a Configuration File
- Scoping the hardware
- Is the hardware configuration SMP, Cluster,