Title: Real Time load balancing of parallel application
1Real Time load balancing of parallel application
2Agenda
- Introduction
- Parallel paradigms
- Performance analysis
- Real time load balancing project
- Other research work example
- Future work
3What is Parallel Computing?
- Using more than one computer at the same time to
solve a problem, or using a computer that has
more than one processor working simultaneously (a
parallel computer). - Same program can be run on different machine at
the same time (SPMD) - Different program can be run on different machine
at the same time (MPMD)
4Why it is interesting?
- Use efficiently of computer capability
- Solve problems which will take single CPU machine
months, or years to solve - Provide redundancy to certain application
5Continue
- Limits of single CPU computing
- Available memory
- Performance
- Parallel computing allows
- Solve problems that dont fit on a single CPUs
memory space - Solve problems that cant be solved in a
reasonable time - We can run
- Larger problems
- Faster
6One Application Example
- Weather Modeling and Forecasting
- Consider 3000 X 3000 miles, and height of 11
miles. - For modeling partition into segments of
0.1X0.1X0.1 cubic miles 1011 segments. - Lets take 2-day period and parameters need to be
computed every 30 min. Assume the - computations take 100 instrs. A single update
takes 1013 instrs. For two days we have total - instrs. of 1015 . For serial computer with 1010
instrs./sec, this takes 280 hrs to predict next
48 - hrs !!
- Lets take 1000 processors capable of 108
instrs/sec. Each processor will do 108 segments.
For - 2 days we have 1012 instrs. Calculation done in 3
hrs !! - Currently all major weather forecast centers (US,
Europe, Asia) have supercomputers with - 1000s of processors.
7Some Other Application
- Database inquiry
- Simulation super star explosion
- Fluid dynamic calculation
- Cosmic microwave data analysis
- Ocean modeling
- Genetic research
8Types of Parallelism Two Extremes
- Data parallel
- Each processor performs the same task on
different data - Example - grid problems
- Task parallel
- Each processor performs a different task
- Example - signal processing
- Most applications fall somewhere on the continuum
between these two extremes
9Basics Data Parallelism
- Data parallelism exploits the concurrency that
derives from the application of the same
operation to multiple elements of a data
structure - Ex Add 2 to all elements of an array
- Ex increase the salary of all employees with 5
years services
10Typical Task Parallel Application
- N tasks if not overlapped, they can be run on N
processors
Application
Task 2
Task n
Task 1
..
11Limits of Parallel Computing
- Theoretical Upper Limits
- Amdahls Law
- Practical Limits
- Load balancing
- Non-computational sections
- Other Considerations
- Sometime it needs to re-write the code
12Amdahls Law
- Amdahls Law places a strict limit on the speedup
that can be realized by using multiple
processors. - Effect of multiple processors on run time
- Effect of multiple processors on speed up
- Where
- fs serial fraction of code
- fp parallel fraction of code
- N number of processors
- tn time to run on N processors
13Practical Limits Amdahls Law vs. Reality
- Amdahls Law provides a theoretical upper limit
on parallel speedup assuming - that there are no costs for speedup assuming that
there are no costs for - communications. In reality, communications will
result in a further degradation - of performance
14Practical Limits Amdahls Law vs. Reality
- In reality, Amdahls Law is limited by many
things - Communications
- I/O
- Load balancing
- Scheduling (shared processors or memory)
15Other Considerations
- Writing effective parallel application is
difficult! - Load balance is important
- Communication can limit parallel efficiency
- Serial time can dominate
- Is it worth your time to rewrite your
application? - Do the CPU requirements justify parallelization?
- Will the code be used just once?
16Sources of Parallel Overhead
- Interprocessor communication Time to transfer
data between processors is usually the most
significant source of parallel processing
overhead. - Load imbalance In some parallel applications it
is impossible to equally distribute the subtask
workload to each processor. So at some point all
but one processor might be done and waiting for
one processor to complete. - Extra computation Sometime the best sequential
algorithm is not easily parallelizable and one is
forced to use a parallel algorithm based on a
poorer but easily parallelizable sequential
algorithm. Sometimes repetitive work is done on
each of the N processors instead of send/recv,
which leads to extra computation.
17Parallel program Performance Touchstone
- Execution time is the principle measure of
performance
18Programming Parallel Computers
- Programming single-processor systems is
(relatively) easy due to - single thread of execution
- single address space
- Programming shared memory systems can benefit
from the single address space - Programming distributed memory systems is the
most difficult due to multiple address spaces
and need to access remote data - Both parallel systems (shared memory and
distributed memory) offer ability to perform
independent operations on different data (MIMD)
and implement task parallelism - Both can be programmed in a data parallel, SPMD
fashion
19Single Program, Multiple Data (SPMD)
- SPMD dominant programming model for shared and
distributed memory machines. - One source code is written
- Code can have conditional execution based on
which processor is executing the copy - All copies of code are started simultaneously and
communicate and synch with each other
periodically - MPMD more general, and possible in hardware, but
no system/programming software enables it
20Shared Memory vs. Distributed Memory
- Tools can be developed to make any system appear
to look like a different kind of system - distributed memory systems can be programmed as
if they have shared memory, and vice versa - such tools do not produce the most efficient
code, but might enable portability - HOWEVER, the most natural way to program any
machine is to use tools languages that express
the algorithm explicitly for the architecture.
21Shared Memory Programming OpenMP
- Shared memory systems have a single address
space - applications can be developed in which loop
iterations (with no dependencies) are executed by
different processors - shared memory codes are mostly data parallel,
SPMD kinds of codes - OpenMP is the new standard for shared memory
programming (compiler directives) - Vendors offer native compiler directives
22Accessing Shared Variables
- If multiple processors want to write to a shared
variable at the same time there may be conflicts - Processor 1 and 2
- read X
- compute X1
- write X
- Programmer, language, and/or architecture must
provide ways of resolving conflicts
23OpenMP Example Parallel loop
- !OMP PARALLEL DO
- do i1,128
- b(i) a(i) c(i)
- end do
- !OMP END PARALLEL DO
- The first directive specifies that the loop
immediately following should be executed in
parallel. The second directive specifies the end
of the parallel section (optional). - For codes that spend the majority of their time
executing the content of simple loops, the
PARALLEL DO directive can result in significant
parallel performance.
24MPI Basics
- What is MPI?
- A message-passing library specification
- Extended message-passing model
- Not a language or compiler specification
- Not a specific implementation or product
- Designed to permit the development of parallel
software libraries - Designed to provide access to advanced parallel
hardware for - End users
- Library writers
- Tool developers
25Features of MPI
- General
- Communications combine context and group for
message security - Thread safety
- Point-to-point communication
- Structured buffers and derived datatypes,
heterogeneity. - Modes normal(blocking and non-blocking),
synchronous, ready(to allow access to fast
protocol), buffered - Collective
- Both built-in and user-defined collective
operations. - Large number of data movement routines.
- Subgroups defined directly or by topology
26Performance Analysis
- Performance analysis process includes
- Data collection
- Data transformation
- Data visualization
27Data Collection Techniques
- Profile
- Record the amount of time spent in different
parts of a program - Counters
- Record either frequencies of events or cumulative
times - Event Traces
- Record each occurrence of various specified events
28Performance Analysis Tool
- Paragraph
- Portable trace analysis and visualization package
developed at Oak Ridge National Laboratory for
MPI program - Upshot
- A trace analysis and visualization package
developed at Argonne National Laboratory for MPI
program - SvPablo
- Provides a variety of mechanisms for collecting,
transforming, and visualizing data and is
designed to be extensible, so that the programmer
can incorporate new data formats, data collection
mechanisms, data reduction modules and displays
29Load Balance
- Load Balance
- Static load balance
- The task and data distribution are determined at
compile time - Not optimally because application behavior is
data dependent - Dynamic load balance
- Work is assigned to nodes at runtime
30Load balance for heterogeneous tasks
- Load balance for heterogeneous tasks is difficult
- Different tasks have different costs
- Data dependencies between tasks can be very
complex - Consider data dependencies when doing load
balancing
31General load balance architecture(Research of
Carnegie Mellon Univ.)
- Used for dynamic load balancing and applied on
heterogeneous application
32General load balance architecture(continue)
- Global load balancer
- Includes a set of simple load balancing
strategies for each of the task types - Manages the interaction between the different
task types and their load balancers.
33Tasks with different dependency types
34Explanation on General load balancer architecture
- Task scheduler
- Collects status information from the nodes and
issues task migration instructions based on this
information - Task scheduler supports three load balancing
policies for homogeneous tasks
35Why Real Time application monitoring important
- A distributed and parallel application to gain
high performance needs - Acquisition and use of substantial amounts of
information about programs, about the systems on
which they are running, and about specific
program runs. - These information is difficult to predict
accurately prior to a programs execution - Ex. Experimentation must be conducted to
determine the performance effects of a programs
load on processors and communication links or of
a programs usage of certain operating system
facilities
36PRAGMA An Infrastructure for Runtime Management
of Grid Applications(U of A)
- The overall goal of Pragma
- Realize a next-generation adaptive runtime
infrastructure capable of - Reactively and proactively managing and
optimizing application execution - Gather current system and application state,
system behavior and application performance in
real time - Network control based on agent technology
37Pragma addressed key challenges
- Formulation of predictive performance functions
- Mechanisms for application state monitoring and
characterizing - Design and deployment of an active control
network combining application sensors and
actuators
38Performance Function
- Performance function hierarchically combine
analytical, experimental and empirical
performance models - Performance function is used along with current
system/network state information to predict the
application performance
39Identifying Performance Function
- 1. Identify the attributes that can accurately
express and quantify the operation and
performance of a resource - 2. Use experimental and analytical techniques to
obtain the performance function - 3. Compose the component performance function to
generate an overall performance function
40Performance function example
- Performance function model and analyze a simple
network system - Two computer(PC1 and PC2) connected through an
Ethernet switch - PC1 performs a matrix multiplication and sends
the result to PC2 through switch - The same for PC2
- We want to find the performance function to
analyze the response time(delay) for the whole
application
41Performance function example(continue)
- Attribute
- Data size
- Performance function determines the application
response time with respect to this attribute - Measure the task processing time in terms of data
size and feed to a neural network
42Performance function example(continue)
- Aj,bj,cj,di are constants and D is the data size
43Pragma components
- System characterization and abstraction component
- Abstracting the current state of the underlying
computational environment and predict its
behavior - Application characterization component
- Abstracting AMR application in terms of its
communication and computational requirements
44Pragma components(continue)
- Active network control
- Sensor
- Actuator
- Management/policy agents of adaptive runtime
control - Policy base
- A programmable database of adaptation policies
used by agents and derive the overall adaptation
process
45Adaptive Mesh Refinement Basics
- Concentrating computational effort to appropriate
regions - Tracking regions in the domain that require
additional resolution by overlaying finer grid
over these region - Refinement proceeds recursively
46AMR Basics(continue)
47System Characterization and Abstraction
- Objective
- Monitor, abstract and characterize the current
state of the underlying computational environment - Use this information to drive the predictive
performance functions and models that can
estimate its performance in the near future
48Block diagram of the system model
49Agent-based runtime adaptation
- The underlying mechanisms for adaptive run-time
management of grid applications is realized by an
active control network of sensors, actuators and
management agents
50Agent-based runtime management architecture
51Sensors and actuators for active adaptation
- Sensors and actuators embedded within the
application and/or system software - Define the adaptation interface and implement the
mechanics of adaptation - Sensors and actuators can be directly deployed
with the applications computational data
structures
52Adaptation Policy knowledge-base
- Adaptation policy base maintains policies used by
the management agents to drive decision-making
during runtime application management and
adaptation - Knowledge base is programmable
- Knowledge base provides interface for agent to
formulate partial queries and fuzzy reasoning
53Results of using Pragma architecture
- Our experiments are on RM3D (Richtmyer-Meshkov
CFD kernel)
54PF results
- We generate Performance Function on two different
platforms. - IBM SP Seaborg located in National Energy
Research Scientific Computing Center - Linux Beowulf Cluster Discover located in New
Jersey State University
55PF on IBM SP
- We obtain two PFs
- PF for small loads(lt 30,000 work units)
- PF for high loads(gt 30,000 work units)
56IBM SP PF coefficient
57PF on Linux Beowulf
- Single PF on the linux beowulf cluster is
generated - Coefficient
58Accuracy of PF on IBM SP
59Accuracy of PF on Linux Beowulf Cluster
60Execution Time Gain (Beowulf)
- Self-optimizing performance gain for 4 processor
cluster
61Execution Time Gain(continue)
- Self-optimizing performance gain for 8 processor
cluster with problem size 646432
62Other Univ. work Example
- Load balancing on different nodes need to be
exploited more - According to research in Northwestern Univ.,
perfect load balancing might not be
optimistic(Multilevel Spectral Bisection
application)
of procs Exec. Time(Balanced) Exec. Time (Imbalanced in )
16 2808.28ms 2653.07ms(5.2)
32 1501.97ms 1491.57ms(7.4)
64 854.06ms 854.843ms(10.6)
63Lesson from this research
- Allow some load imbalance can provide significant
reductions in parallel execution time - As the number of processors increases for a fixed
sized problem, the amount of load imbalance that
can be exploited generally increases
64Future Work
- Generate context aware, self-configuring,
self-adapting and self-optimizing components - Provide vGrid infrastructure that provides
autonomic middleware services
65Question?