Title: ROSS: Parallel DiscreteEvent Simulations on Near Petascale Supercomputers
1ROSS Parallel Discrete-Event Simulations on Near
Petascale Supercomputers
- Christopher D. Carothers
- Department of Computer Science
- Rensselaer Polytechnic Institute
- chrisc_at_cs.rpi.edu
- Sponsors NSF CAREER, NeTS, PetaApps, ANL/ALCF
- Why Parallel Discrete-Event Simulation (DES)?
- Large-scale systems are difficult to understand
- Analytical models are often constrained
- Parallel DES simulation offers
- Dramatically shrinks models execution-time
- Prediction of future what-if systems
performance - Potential for real-time decision support
- Minutes instead of days
- Analysis can be done right away
- Example models national air space (NAS), ISP
backbone(s), distributed content caches, next
generation supercomputer systems.
3Ex Movies over the Internet
- Suppose we want to model 1 million home ISP
customers downloading a 2 GB movie - How long to compute?
- Assume a nominal 100K ev/sec seq. simulator
- Assume on avg. each packet takes 8 hops
- 2GB movies yields 2 trillion 1K data packets.
- _at_ 8 hops yields 16 trillion events
- 16 trillion events _at_ 100K ev/sec
- Over 1,900 days!!! Or
- 5 years!!!
- Need massively parallel simulation to make
- Intro to DES
- Time Warp and other PDES Schemes
- Reverse Computation
- Blue Gene/L /P
- Implementation
- Performance Results
- Observations on PDES Performance
- Future Directions
5Discrete Event Simulation (DES)
- Discrete event simulation computer model for a
system where changes in the state of the system
occur at discrete points in simulation time. - Fundamental concepts
- system state (state variables)
- state transitions (events)
- A DES computation can be viewed as a sequence of
event computations, with each event computation
is assigned a (simulation time) time stamp - Each event computation can
- modify state variables
- schedule new events
6DES Computation
example air traffic at an airport events
aircraft arrival, landing, departure
arrival 800
departure 915
arrival 930
landed 805
simulation time
- Unprocessed events are stored in a pending list
- Events are processed in time stamp order
7Discrete Event Simulation System
model of the physical system
independent of the simulation application
8Event-Oriented World View
Event handler procedures
state variables
Departure Event
Arrival Event
Landed Event
Integer InTheAir Integer OnTheGround Boolean
Simulation application
9Ex Air traffic at an Airport
- Model aircraft arrivals and departures, arrival
queueing - Single runway model ignores departure queueing
- R time runway is used for each landing aircraft
(const) - G time required on the ground before departing
(const) - State Variables
- Now current simulation time
- InTheAir number of aircraft landing or waiting
to land - OnTheGround number of landed aircraft
- RunwayFree Boolean, true if runway available
- Model Events
- Arrival denotes aircraft arriving in air space
of airport - Landed denotes aircraft landing
- Departure denotes aircraft leaving
10Arrival Events
New aircraft arrives at airport. If the runway
is free, it will begin to land. Otherwise, the
aircraft must circle, and wait to land.
- R time runway is used for each landing aircraft
- G time required on the ground before departing
- Now current simulation time
- InTheAir number of aircraft landing or waiting
to land - OnTheGround number of landed aircraft
- RunwayFree Boolean, true if runway available
- Arrival Event
- InTheAir InTheAir1
- If (RunwayFree)
- RunwayFreeFALSE
- Schedule Landed event _at_ Now R
11Landed Event
An aircraft has completed its landing.
- R time runway is used for each landing aircraft
- G time required on the ground before departing
- Now current simulation time
- InTheAir number of aircraft landing or waiting
to land - OnTheGround number of landed aircraft
- RunwayFree Boolean, true if runway available
- Landed Event
- InTheAirInTheAir-1
- OnTheGroundOnTheGround1
- Schedule Departure event _at_ Now G
- If (InTheAirgt0)
- Schedule Landed event _at_ Now R
- Else
- RunwayFree TRUE
12Departure Event
An aircraft now on the ground departs for a new
- R time runway is used for each landing aircraft
- G time required on the ground before departing
- Now current simulation time
- InTheAir number of aircraft landing or waiting
to land - OnTheGround number of landed aircraft
- RunwayFree Boolean, true if runway available
- Departure Event
- OnTheGround OnTheGround - 1
13Execution Example
- DES computation is sequence of event computations
- Modify state variables
- Schedule new events
- DES System model simulation executive
- Data structures
- Pending event list to hold unprocessed events
- State variables
- Simulation time clock variable
- Program (Code)
- Main event processing loop
- Event procedures
- Events processed in time stamp order
- Intro to DES
- Time Warp and other PDES Schemes
- Reverse Computation
- Blue Gene/L /P
- Implementation
- Performance Results
- Observations on PDES Performance
- Future Directions
16How to Synchronize Parallel Simulations?
processed event
straggler event
17Massively Parallel Discrete-Event Simulation Via
Time Warp
Local Control Mechanism error detection and
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
(1) undo state Ds (2) cancel sent events
LP 2
LP 3
LP 1
LP 2
LP 3
LP 1
unprocessed event
processed event
straggler event
committed event
18Whew Time Warp sounds expensive are there other
PDES Schemes?
- Non-rollback options
- Called Conservative because they disallow out
of order event execution. - Deadlock Avoidance
- NULL Message Algorithm
- Deadlock Detection and Recovery
19Deadlock Avoidance Using Null Messages
Null Message Algorithm (executed by each
LP) Goal Ensure events are processed in time
stamp order and avoid deadlock WHILE (simulation
is not over) wait until each FIFO contains at
least one message remove smallest time stamped
event from its FIFO process that event send
null messages to neighboring LPs with time stamp
indicating a lower bound on future messages
sent to that LP (current time plus minimum
transit time between airports) END-LOOP
- Variation LP requests null message when FIFO
becomes empty - Fewer null messages
- Delay to get time stamp information
20The Time Creep Problem
Five null messages to process a single event!
- Many null messages if minimum flight time is
21Livelock Can Occur!
Suppose the minimum delay between airports is
Livelock un-ending cycle of null messages where
no LP can advance its simulation time There
cannot be a cycle where for each LP in the cycle,
an incoming message with time stamp T results in
a new message sent to the next LP in the cycle
with time stamp T (zero lookahead cycle)
- The null message algorithm relies on a
prediction ability referred to as lookahead - ORD at simulation time 5, minimum transit time
between airports is 3, so the next message sent
by ORD must have a time stamp of at least 8 - Lookahead is a constraint on LPs behavior
- Link lookahead If an LP is at simulation time T,
and an outgoing link has lookahead Li, then any
message sent on that link must have a time stamp
of at least TLi - LP Lookahead If an LP is at simulation time T,
and has a lookahead of L, then any message sent
by that LP must will have a time stamp of at
least TL - Equivalent to link lookahead where the lookahead
on each outgoing link is the same
23Lookahead and the Simulation Model
- Lookahead is clearly dependent on the simulation
model - could be derived from physical constraints in the
system being modeled, such as minimum simulation
time for one entity to affect another (e.g., a
weapon fired from a tank requires L units of time
to reach another tank, or maximum speed of the
tank places lower bound on how soon it can affect
another entity) - could be derived from characteristics of the
simulation entities, such as non-preemptable
behavior (e.g., a tank is traveling north at 30
mph, and nothing in the federation model can
cause its behavior to change over the next 10
minutes, so all output from the tank simulator
can be generated immediately up to time local
clock 10 minutes) - could be derived from tolerance to temporal
inaccuracies (e.g., users cannot perceive
temporal difference of 100 milliseconds, so
messages may be timestamped 100 milliseconds into
the future). - simulations may be able to precompute when its
next interaction with another simulation will be
(e.g., if time until next interaction is
stochastic, pre-sample random number generator to
determine time of next interaction).
Lookahead changes as LP topology changes which
can have a profound impact on the performance of
network models (wired or wireless).
24Why is Lookahead Important?
Each LP A using logical time declares a lookahead
value LA the time stamp of any event generated
by the LP must be LTA LA
- Lookahead is used in virtually all conservative
synch. protocols - Essential to allow concurrent processing of events
Lookahead is necessary to allow concurrent
processing of events with different time stamps
(unless optimistic event processing is used)
25Null Message Algorithm Speed Up
- toroid topology
- message density 4 per LP
- 1 millisecond computation per event
- vary time stamp increment distribution
- ILARlookahead / average time stamp increment
Conservative algorithms live or die by their
26Deadlock Detection Recovery
Algorithm A (executed by each LP) Goal Ensure
events are processed in time stamp order WHILE
(simulation is not over) wait until each FIFO
contains at least one message remove smallest
time stamped event from its FIFO process that
event END-LOOP
- No null messages
- Allow simulation to execute until deadlock occurs
- Provide a mechanism to detect deadlock
- Provide a mechanism to recover from deadlocks
27Deadlock Recovery
Deadlock recovery identify safe events (events
that can be processed w/o violating local
- Which events are safe?
- Time stamp 7 smallest time stamped event in
system - Time stamp 8, 9 safe because of lookahead
constraint - Time stamp 10 OK if events with the same time
stamp can be processed in any order - No lookahead creep!
28Preventing LA Creep Using Next Event Time Info
- Observation smallest time stamped event is safe
to process - Lookahead creep avoided by allowing the
synchronization algorithm to immediately advance
to (global) time of the next event - Synchronization algorithm must know time stamp of
LPs next event - Each LP guarantees a logical time T such that if
no additional events are delivered to LP with TS
lt T, all subsequent messages that LP produces
have a time stamp at least TL (L lookahead)
29No Free Lunch for PDES!
- Time Warp ? State saving overheads
- Null message algorithm ? Lookahead creep problem
- No zero lookahead cycles allowed
- Lookahead ? Essential for concurrent processing
of events for conservative algorithms - Has large effect on performance ? need to program
it - Deadlock Detection and Recovery ? Smallest time
stamp event safe to process - Others may also be safe (requires additional work
to determine this) - Use time of next event to avoid lookahead creep,
but hard to compute at scale - Can we avoid some of these overheads and
- Intro to DES
- Time Warp and other PDES Schemes
- Reverse Computation
- Blue Gene/L /P
- Implementation
- Performance Results
- Observations on PDES Performance
- Future Directions
31Our Solution Reverse Computation...
- Use Reverse Computation (RC)
- automatically generate reverse code from model
source - undo by executing reverse code
- Delivers better performance
- negligible overhead for forward computation
- significantly lower memory utilization
Original Code
Modified Code
Reverse Code
32Ex Simple Network Switch
if( qlen lt B ) qlen delaysqlen else l
on packet arrival...
33Benefits of Reverse Computation
- State size reduction
- from B2 words to 1 word
- e.g. B100 gt 100x reduction!
- Negligible overhead in forward computation
- removed from forward computation
- moved to rollback phase
- Result
- significant increase in speed
- significant decrease in memory
- How?...
34Beneficial Application Properties
- 1. Majority of operations are constructive
- e.g., , --, etc.
- 2. Size of control state lt size of data state
- e.g., size of b1 lt size of qlen, sent, lost, etc.
- 3. Perfectly reversible high-level operations
- gleaned from irreversible smaller operations
- e.g., random number generation
35Rules for Automation...
Generation rules, and upper-bounds on bit
requirements for various statement types
36Destructive Assignment...
- Destructive assignment (DA)
- examples x y x y
- requires all modified bytes to be saved
- Caveat
- reversing technique for DAs can degenerate to
traditional incremental state saving - Good news
- certain collections of DAs are perfectly
reversible! - queueing network models contain collections of
easily/perfectly reversible DAs - queue handling (swap, shift, tree insert/delete,
) - statistics collection (increment, decrement, )
- random number generation (reversible RNGs)
37Reversing an RNG?
double RNGGenVal(Generator g) long k,s
double u u 0.0 s Cg 0g k s
/ 46693 s 45991 (s - k 46693) - k
25884 if (s lt 0) s s 2147483647
Cg 0g s u u 4.65661287524579692e-10
s s Cg 1g k s / 10339 s
207707 (s - k 10339) - k 870 if (s lt
0) s s 2147483543 Cg 1g s u
u - 4.65661310075985993e-10 s if (u lt 0)
u u 1.0
s Cg 2g k s / 15499 s
138556 (s - k 15499) - k 3979 if (s lt
0.0) s s 2147483423 Cg 2g s
u u 4.65661336096842131e-10 s if (u gt
1.0) u u - 1.0 s Cg 3g k s /
43218 s 49689 (s - k 43218) - k
24121 if (s lt 0) s s 2147483323
Cg 3g s u u - 4.65661357780891134e-10
s if (u lt 0) u u 1.0 return
Observation k s / 46693 is a Destructive
AssignmentResult RC degrades to classic
state-savingcan we do better?
38RNGs A Higher Level View
The previous RNG is based on the following
recurrence. xi,n aixi,n-1 mod mi where xi,n
one of the four seed values in the Nth set, mi is
one the four largest primes less than 231, and ai
is a primitive root of mi. Now, the above
recurrence is in fact reversible. inverse of ai
modulo mi is defined, bi aimi-2 mod mi Using
bi, we can generate the reverse recurrence as
follows xi,n-1 bixi,n mod mi
39Reverse Code Efficiency...
- Property...
- Non-reversibility of indvidual steps DO NOT imply
that the computation as a whole is not
reversible. - Can we automatically find this higher-level
reversibility? - Other Reversible Structures Include...
- Circular shift operation
- Insertion deletion operations on trees (i.e.,
priority queues).
Reverse computation is well-suited for small
grain event models!
40RC Applications
- PDES applications include
- Wireless telephone networks
- Distributed content caches
- Large-scale Internet models
- TCP over ATT backbone
- Leverges RC swaps
- Hodgkin-Huxley neuron models
- Plasma physics models using PIC
- Non-DES include
- Debugging
- PISA Reversible instruction set architecture
for low power computing - Quantum computing
- Intro to DES
- Time Warp and other PDES Schemes
- Reverse Computation
- Blue Gene/L /P
- Implementation
- Performance Results
- Observations on PDES Performance
- Future Directions
42Target Systems Blue Gene/L /P
- Configuration
- BG/L nodes 2x700 MHz PPC cores
- BG/P nodes 4x850 MHz PPC cores
- Dediciated compute and I/O nodes (321 or 81).
- 3-D torus P2P network
- Additional barrier, collective, I/O and ethernet
networks - Can partition system into dedicated slices from
32 nodes to whole systems - Properties for GOOD scaling
- Balanced architecture between network(s) and
processor speed - Exclusive access to network and process
- Exceptionally low OS jitter
- Collective overheads not adversely impacted at
large nodes counts
1 rack of IBM Blue Gene/L
43Blue Gene /L Layout
44Blue Gene/L SoC
45Blue Gene/L Network
46Blue Gene /P Layout
47Blue Gene/P Architectual Highlights
- Scaled performance via density and frequency
increase - 2x performance increase via doubling the
processors per node. - 1.2x from frequency increase 700 MHz ? 850 MHz.
- Enhanced function
- 4-way SMP ? 3 modes SMP/ DUAL/ VNM
- L2, L3 changed for SMP mode
- DMA for torus, remote put-get, user prog. memory
prefetch - Enhanced 64-bit performance counters via PPC450
core - Double Hummer FPU and networks are the
same..except.. - Better Network
- 2.4x more bandwidth, lower latency Torus and Tree
neworks - 10x higher Ethernet I/O bandwidth
- 72K nodes in 72 racks for 1 PF peak performance
- Low power via aggressive power management
48Blue Gene L vs. P
49Blue Gene /P Compute Card
- Intro to DES
- Time Warp and other PDES Schemes
- Reverse Computation
- Blue Gene/L /P
- Implementation
- Performance Results
- Observations on PDES Performance
- Future Directions
51Local Control Implementation
- MPI_ISend/MPI_Irecv used to send/recv off core
events - Event Network memory is managed directly.
- Pool is allocated _at_ startup
- Event list keep sorted using a Splay Tree (logN)
- LP-2-Core mapping tables are computed and not
stored to avoid the need for large global LP
Local Control Mechanism error detection and
V i r t u a l T i m e
(1) undo state Ds (2) cancel sent events
LP 1
LP 2
LP 3
52Global Control Implementation
- GVT (kicks off when memory is low)
- Each core counts sent, recv
- Recv all pending MPI msgs.
- MPI_Allreduce Sum on (sent - recv)
- If sent - recv ! 0 goto 2
- Compute local cores lower bound time-stamp
(LVT). - GVT MPI_Allreduce Min on LVTs
- Algorithms needs efficient MPI collective
- LC/GC can be very sensitive to OS jitter
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
LP 2
LP 3
LP 1
So, how does this translate into Time Warp
performance on BG/L BG/P?
- Intro to DES
- Time Warp and other PDES Schemes
- Reverse Computation
- Blue Gene/L /P
- Implementation
- Performance Results
- Observations on PDES Performance
- Future Directions
54Performance Results Setup
- Synthetic benchmark model
- 1024x1024 grid of LPs
- Each LP has 10 initial events
- Event routed randomly among all LPs based on a
configurable percent remote parameter - Time stamps are exponentially distributed with a
mean of 1.0 (i.e., lookahead is 0). - TLM Tranmission Line Matrix
- Discrete electromagnetic propagation wave model
- Used model the physical layer of MANETs
- As accurate as previous ray tracing models, but
dramatically faster - Considers wave attenuation effects
- Event populations grows cubically outward from
the single radio source. - ROSS parameters
- GVT_Interval ? number of times thru scheduler
loop before computing GVT. - Batch ? number of local events to process before
check network for new events. - Batch X GVT_Interval events processed per GVT
epoch - KPs ? kernel processes that hold the aggregated
processed event lists for LPs to lower search
overheads for fossil collection of old events. - Send/Recv Buffers number of network events for
sending or recving. Used as a flow control
55PHOLD on 8192 BG/L cores
56PHOLD on 8192 BG/L cores
57PHOLD on 8192 BG/L cores
58PHOLD on 8192 BG/L cores
597.5 billion ev/sec for 10 remote on 32,768
2.7 billion ev/sec for 100 remote on 32,768
Stable performance across processor
configurations attributed to near noiseless OS
6012.27 billion ev/sec for 10 remote on 65,536
4 billion ev/sec for 100 remote on 65,536 cores!!
61Rollback Efficiency 1 - Erb /Enet
62(No Transcript)
- Intro to DES
- Parallel DES via Time Warp
- Reverse Computation
- Blue Gene/L /P
- Implementation
- Performance Results
- Observations on PDES Performance
- Future Directions
64History of PHOLD Performance
These results are not completely comparable
which explains large variation in event rate and
processor efficiency
65Movies over the Internet Revisited
- Suppose we want to model 1 million home ISP
customers over ATT downloading a 2 GB movie - How long to compute with massively parallel DES?
- 16 trillion events _at_ 1 Billion ev/sec
- 4.5 hours!!
- ROSS on Blue Gene indicates billion-events per
second model are feasible today! - Yields significant TIME COMPRESSION of current
models.. - LP to PE mapping less of a concern
- Past systems where very sensitive to this
- 90 TF systems can yield Giga-scale event
rates. - Tera-event models require teraflop systems.
- Assumes most of event processing time is spent in
event-list management (splay tree
enqueue/dequeue). - Potential 10 PF supercomputers will be able to
model near peta-event systems - 100 trillion to 1 quadrillion events in less than
1.4 to 14 hours - Current testbed emulators dont come close to
this for Network Modeling and Simulation..
- Intro to DES
- Parallel DES via Time Warp
- Reverse Computation
- Blue Gene/L /P
- Implementation
- Performance Results
- Observations on PDES Performance
- Future Directions
68Future Models Enabled by X-Scale Computing
- Discrete transistor level models for whole
multi-core architectures - Potential for more rapid improvements in
processor technology - Model nearly whole U.S. Internet at packet level
- Potential to radically improve overall QoS for
all - Model all C4I network/systems for a whole theatre
of war faster than real-time many time over.. - Enables the real-timeactive network control..
69Future Models Enabled by X-Scale Computing
- Realistic discrete model the human brain
- 100 billion neurons w/ 100 trillion synapes (e.g.
connections huge fan-out) - Potential for several exa-events per run
- Detailed discrete agent-based model for every
human on the earth for.. - gobal economic modeling
- pandemic flu/disease modeling
- food / water / energy usage modeling
70Thank you!!
- Acknowledgments
- David Bauer HPTi
- David Jefferson LLNL for helping us get
discretionary access to Intrepid _at_ ALCF - Sysadmins Ray Loy (ANL), Tisha Stacey (ANL) and
Adam Todorski (CCNI) - Sponsers
- NSF NeTS and CAREER programs