Real Time load balancing of parallel application

About This Presentation

Title:

Real Time load balancing of parallel application

Description:

Real Time load balancing of parallel application ECE696b Yeliang Zhang Agenda Introduction Parallel paradigms Performance analysis Real time load balancing project ... – PowerPoint PPT presentation

Number of Views:279

Avg rating:3.0/5.0

Slides: 66

Provided by: SalimH6

Category:

more less

Transcript and Presenter's Notes

Title: Real Time load balancing of parallel application

1
Real Time load balancing of parallel application

ECE696b
Yeliang Zhang

2
Agenda

Introduction
Parallel paradigms
Performance analysis
Real time load balancing project
Other research work example
Future work

3
What is Parallel Computing?

Using more than one computer at the same time to
solve a problem, or using a computer that has
more than one processor working simultaneously (a
parallel computer).
Same program can be run on different machine at
the same time (SPMD)
Different program can be run on different machine
at the same time (MPMD)

4
Why it is interesting?

Use efficiently of computer capability
Solve problems which will take single CPU machine
months, or years to solve
Provide redundancy to certain application

5
Continue

Limits of single CPU computing
Available memory
Performance
Parallel computing allows
Solve problems that dont fit on a single CPUs
memory space
Solve problems that cant be solved in a
reasonable time
We can run
Larger problems
Faster

6
One Application Example

Weather Modeling and Forecasting
Consider 3000 X 3000 miles, and height of 11
miles.
For modeling partition into segments of
0.1X0.1X0.1 cubic miles 1011 segments.
Lets take 2-day period and parameters need to be
computed every 30 min. Assume the
computations take 100 instrs. A single update
takes 1013 instrs. For two days we have total
instrs. of 1015 . For serial computer with 1010
instrs./sec, this takes 280 hrs to predict next
48
hrs !!
Lets take 1000 processors capable of 108
instrs/sec. Each processor will do 108 segments.
For
2 days we have 1012 instrs. Calculation done in 3
hrs !!
Currently all major weather forecast centers (US,
Europe, Asia) have supercomputers with
1000s of processors.

7
Some Other Application

Database inquiry
Simulation super star explosion
Fluid dynamic calculation
Cosmic microwave data analysis
Ocean modeling
Genetic research

8
Types of Parallelism Two Extremes

Data parallel
Each processor performs the same task on
different data
Example - grid problems
Task parallel
Each processor performs a different task
Example - signal processing
Most applications fall somewhere on the continuum
between these two extremes

9
Basics Data Parallelism

Data parallelism exploits the concurrency that
derives from the application of the same
operation to multiple elements of a data
structure
Ex Add 2 to all elements of an array
Ex increase the salary of all employees with 5
years services

10
Typical Task Parallel Application

N tasks if not overlapped, they can be run on N
processors

Application
Task 2
Task n
Task 1
..
11
Limits of Parallel Computing

Theoretical Upper Limits
Amdahls Law
Practical Limits
Load balancing
Non-computational sections
Other Considerations
Sometime it needs to re-write the code

12
Amdahls Law

Amdahls Law places a strict limit on the speedup
that can be realized by using multiple
processors.
Effect of multiple processors on run time
Effect of multiple processors on speed up
Where
fs serial fraction of code
fp parallel fraction of code
N number of processors
tn time to run on N processors

13
Practical Limits Amdahls Law vs. Reality

Amdahls Law provides a theoretical upper limit
on parallel speedup assuming
that there are no costs for speedup assuming that
there are no costs for
communications. In reality, communications will
result in a further degradation
of performance

14
Practical Limits Amdahls Law vs. Reality

In reality, Amdahls Law is limited by many
things
Communications
I/O
Load balancing
Scheduling (shared processors or memory)

15
Other Considerations

Writing effective parallel application is
difficult!
Load balance is important
Communication can limit parallel efficiency
Serial time can dominate
Is it worth your time to rewrite your
application?
Do the CPU requirements justify parallelization?
Will the code be used just once?

16
Sources of Parallel Overhead

Interprocessor communication Time to transfer
data between processors is usually the most
significant source of parallel processing
overhead.
Load imbalance In some parallel applications it
is impossible to equally distribute the subtask
workload to each processor. So at some point all
but one processor might be done and waiting for
one processor to complete.
Extra computation Sometime the best sequential
algorithm is not easily parallelizable and one is
forced to use a parallel algorithm based on a
poorer but easily parallelizable sequential
algorithm. Sometimes repetitive work is done on
each of the N processors instead of send/recv,
which leads to extra computation.

17
Parallel program Performance Touchstone

Execution time is the principle measure of
performance

18
Programming Parallel Computers

Programming single-processor systems is
(relatively) easy due to
single thread of execution
single address space
Programming shared memory systems can benefit
from the single address space
Programming distributed memory systems is the
most difficult due to multiple address spaces
and need to access remote data
Both parallel systems (shared memory and
distributed memory) offer ability to perform
independent operations on different data (MIMD)
and implement task parallelism
Both can be programmed in a data parallel, SPMD
fashion

19
Single Program, Multiple Data (SPMD)

SPMD dominant programming model for shared and
distributed memory machines.
One source code is written
Code can have conditional execution based on
which processor is executing the copy
All copies of code are started simultaneously and
communicate and synch with each other
periodically
MPMD more general, and possible in hardware, but
no system/programming software enables it

20
Shared Memory vs. Distributed Memory

Tools can be developed to make any system appear
to look like a different kind of system
distributed memory systems can be programmed as
if they have shared memory, and vice versa
such tools do not produce the most efficient
code, but might enable portability
HOWEVER, the most natural way to program any
machine is to use tools languages that express
the algorithm explicitly for the architecture.

21
Shared Memory Programming OpenMP

Shared memory systems have a single address
space
applications can be developed in which loop
iterations (with no dependencies) are executed by
different processors
shared memory codes are mostly data parallel,
SPMD kinds of codes
OpenMP is the new standard for shared memory
programming (compiler directives)
Vendors offer native compiler directives

22
Accessing Shared Variables

If multiple processors want to write to a shared
variable at the same time there may be conflicts
Processor 1 and 2
read X
compute X1
write X
Programmer, language, and/or architecture must
provide ways of resolving conflicts

23
OpenMP Example Parallel loop

!OMP PARALLEL DO
do i1,128
b(i) a(i) c(i)
end do
!OMP END PARALLEL DO
The first directive specifies that the loop
immediately following should be executed in
parallel. The second directive specifies the end
of the parallel section (optional).
For codes that spend the majority of their time
executing the content of simple loops, the
PARALLEL DO directive can result in significant
parallel performance.

24
MPI Basics

What is MPI?
A message-passing library specification
Extended message-passing model
Not a language or compiler specification
Not a specific implementation or product
Designed to permit the development of parallel
software libraries
Designed to provide access to advanced parallel
hardware for
End users
Library writers
Tool developers

25
Features of MPI

General
Communications combine context and group for
message security
Thread safety
Point-to-point communication
Structured buffers and derived datatypes,
heterogeneity.
Modes normal(blocking and non-blocking),
synchronous, ready(to allow access to fast
protocol), buffered
Collective
Both built-in and user-defined collective
operations.
Large number of data movement routines.
Subgroups defined directly or by topology

26
Performance Analysis

Performance analysis process includes
Data collection
Data transformation
Data visualization

27
Data Collection Techniques

Profile
Record the amount of time spent in different
parts of a program
Counters
Record either frequencies of events or cumulative
times
Event Traces
Record each occurrence of various specified events

28
Performance Analysis Tool

Paragraph
Portable trace analysis and visualization package
developed at Oak Ridge National Laboratory for
MPI program
Upshot
A trace analysis and visualization package
developed at Argonne National Laboratory for MPI
program
SvPablo
Provides a variety of mechanisms for collecting,
transforming, and visualizing data and is
designed to be extensible, so that the programmer
can incorporate new data formats, data collection
mechanisms, data reduction modules and displays

29
Load Balance

Load Balance
Static load balance
The task and data distribution are determined at
compile time
Not optimally because application behavior is
data dependent
Dynamic load balance
Work is assigned to nodes at runtime

30
Load balance for heterogeneous tasks

Load balance for heterogeneous tasks is difficult
Different tasks have different costs
Data dependencies between tasks can be very
complex
Consider data dependencies when doing load
balancing

31
General load balance architecture(Research of
Carnegie Mellon Univ.)

Used for dynamic load balancing and applied on
heterogeneous application

32
General load balance architecture(continue)

Global load balancer
Includes a set of simple load balancing
strategies for each of the task types
Manages the interaction between the different
task types and their load balancers.

33
Tasks with different dependency types
34
Explanation on General load balancer architecture

Task scheduler
Collects status information from the nodes and
issues task migration instructions based on this
information
Task scheduler supports three load balancing
policies for homogeneous tasks

35
Why Real Time application monitoring important

A distributed and parallel application to gain
high performance needs
Acquisition and use of substantial amounts of
information about programs, about the systems on
which they are running, and about specific
program runs.
These information is difficult to predict
accurately prior to a programs execution
Ex. Experimentation must be conducted to
determine the performance effects of a programs
load on processors and communication links or of
a programs usage of certain operating system
facilities

36
PRAGMA An Infrastructure for Runtime Management
of Grid Applications(U of A)

The overall goal of Pragma
Realize a next-generation adaptive runtime
infrastructure capable of
Reactively and proactively managing and
optimizing application execution
Gather current system and application state,
system behavior and application performance in
real time
Network control based on agent technology

37
Pragma addressed key challenges

Formulation of predictive performance functions
Mechanisms for application state monitoring and
characterizing
Design and deployment of an active control
network combining application sensors and
actuators

38
Performance Function

Performance function hierarchically combine
analytical, experimental and empirical
performance models
Performance function is used along with current
system/network state information to predict the
application performance

39
Identifying Performance Function

1. Identify the attributes that can accurately
express and quantify the operation and
performance of a resource
2. Use experimental and analytical techniques to
obtain the performance function
3. Compose the component performance function to
generate an overall performance function

40
Performance function example

Performance function model and analyze a simple
network system
Two computer(PC1 and PC2) connected through an
Ethernet switch
PC1 performs a matrix multiplication and sends
the result to PC2 through switch
The same for PC2
We want to find the performance function to
analyze the response time(delay) for the whole
application

41
Performance function example(continue)

Attribute
Data size
Performance function determines the application
response time with respect to this attribute
Measure the task processing time in terms of data
size and feed to a neural network

42
Performance function example(continue)

Aj,bj,cj,di are constants and D is the data size

43
Pragma components

System characterization and abstraction component
Abstracting the current state of the underlying
computational environment and predict its
behavior
Application characterization component
Abstracting AMR application in terms of its
communication and computational requirements

44
Pragma components(continue)

Active network control
Sensor
Actuator
Management/policy agents of adaptive runtime
control
Policy base
A programmable database of adaptation policies
used by agents and derive the overall adaptation
process

45
Adaptive Mesh Refinement Basics

Concentrating computational effort to appropriate
regions
Tracking regions in the domain that require
additional resolution by overlaying finer grid
over these region
Refinement proceeds recursively

46
AMR Basics(continue)
47
System Characterization and Abstraction

Objective
Monitor, abstract and characterize the current
state of the underlying computational environment
Use this information to drive the predictive
performance functions and models that can
estimate its performance in the near future

48
Block diagram of the system model
49
Agent-based runtime adaptation

The underlying mechanisms for adaptive run-time
management of grid applications is realized by an
active control network of sensors, actuators and
management agents

50
Agent-based runtime management architecture
51
Sensors and actuators for active adaptation

Sensors and actuators embedded within the
application and/or system software
Define the adaptation interface and implement the
mechanics of adaptation
Sensors and actuators can be directly deployed
with the applications computational data
structures

52
Adaptation Policy knowledge-base

Adaptation policy base maintains policies used by
the management agents to drive decision-making
during runtime application management and
adaptation
Knowledge base is programmable
Knowledge base provides interface for agent to
formulate partial queries and fuzzy reasoning

53
Results of using Pragma architecture

Our experiments are on RM3D (Richtmyer-Meshkov
CFD kernel)

54
PF results

We generate Performance Function on two different
platforms.
IBM SP Seaborg located in National Energy
Research Scientific Computing Center
Linux Beowulf Cluster Discover located in New
Jersey State University

55
PF on IBM SP

We obtain two PFs
PF for small loads(lt 30,000 work units)
PF for high loads(gt 30,000 work units)

56
IBM SP PF coefficient
57
PF on Linux Beowulf

Single PF on the linux beowulf cluster is
generated
Coefficient

58
Accuracy of PF on IBM SP
59
Accuracy of PF on Linux Beowulf Cluster
60
Execution Time Gain (Beowulf)

Self-optimizing performance gain for 4 processor
cluster

61
Execution Time Gain(continue)

Self-optimizing performance gain for 8 processor
cluster with problem size 646432

62
Other Univ. work Example

Load balancing on different nodes need to be
exploited more
According to research in Northwestern Univ.,
perfect load balancing might not be
optimistic(Multilevel Spectral Bisection
application)

of procs Exec. Time(Balanced) Exec. Time (Imbalanced in )
16 2808.28ms 2653.07ms(5.2)
32 1501.97ms 1491.57ms(7.4)
64 854.06ms 854.843ms(10.6)
63
Lesson from this research

Allow some load imbalance can provide significant
reductions in parallel execution time
As the number of processors increases for a fixed
sized problem, the amount of load imbalance that
can be exploited generally increases

64
Future Work

Generate context aware, self-configuring,
self-adapting and self-optimizing components
Provide vGrid infrastructure that provides
autonomic middleware services

65
Question?

Write a Comment

User Comments (0)