Title: ROSS: Parallel DiscreteEvent Simulations on Near Petascale Supercomputers
1 ROSS Parallel Discrete-Event Simulations on
Near Petascale Supercomputers
- Christopher D. Carothers
- Department of Computer Science
- Rensselaer Polytechnic Institute
- chrisc_at_cs.rpi.edu
2Outline
- Motivation for PDES
- Overview of HPC Platforms
- ROSS Implementation
- Performance Results
- Summary
3Motivation
- Why Parallel Discrete-Event Simulation (DES)?
- Large-scale systems are difficult to understand
- Analytical models are often constrained
- Parallel DES simulation offers
- Dramatically shrinks models execution-time
- Prediction of future what-if systems
performance - Potential for real-time decision support
- Minutes instead of days
- Analysis can be done right away
- Example models national air space (NAS), ISP
backbone(s), distributed content caches, next
generation supercomputer systems.
4Model a 10 PF Supercomputer
- Suppose we want to model a 10 PF supercomputer at
the MPI message level - How long excute DES model?
- 10 flop rate ? 1 PF sustained
- _at_ .2 bytes/sec per flop _at_ 1 usage ? 2 TB/sec
- _at_ 1K size MPI msgs ? 2 billion msgs per simulated
second - _at_ 8 hops per msg ? 16 billion events per
simulated second - _at_ 1000 simulated seconds ? 16 trillion events
for DES model - No I/O included !!
- Nominal seq. DES simulator ? 100K events/sec
- 16 trillion events _at_ 100K ev/sec
-
- 5 years!!!
- Need massively parallel simulation to make
tractable
5Blue Gene /L Layout
- CCNI fen
- 32K cores/ 16 racks
- 12 TB / 8 TB usable RAM
- 1 PB of disk over GPFS
- Custom OS kernel
6Blue Gene /P Layout
- ALCF/ANL Intrepid
- 163K cores/ 40 racks
- 80TB RAM
- 8 PB of disk over GPFS
- Custom OS kernel
7Blue Gene L vs. P
8How to Synchronize Parallel Simulations?
processed event
straggler event
9Massively Parallel Discrete-Event Simulation Via
Time Warp
Local Control Mechanism error detection and
rollback
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
(1) undo state Ds (2) cancel sent events
GVT
LP 2
LP 3
LP 1
LP 2
LP 3
LP 1
unprocessed event
processed event
straggler event
committed event
10Our Solution Reverse Computation...
- Use Reverse Computation (RC)
- automatically generate reverse code from model
source - undo by executing reverse code
- Delivers better performance
- negligible overhead for forward computation
- significantly lower memory utilization
Original Code
Compiler
Modified Code
Reverse Code
11Ex Simple Network Switch
Original
N
if( qlen lt B ) qlen delaysqlen else l
ost
B
on packet arrival...
12Beneficial Application Properties
- 1. Majority of operations are constructive
- e.g., , --, etc.
- 2. Size of control state lt size of data state
- e.g., size of b1 lt size of qlen, sent, lost, etc.
- 3. Perfectly reversible high-level operations
- gleaned from irreversible smaller operations
- e.g., random number generation
13Destructive Assignment...
- Destructive assignment (DA)
- examples x y x y
- requires all modified bytes to be saved
- Caveat
- reversing technique for DAs can degenerate to
traditional incremental state saving - Good news
- certain collections of DAs are perfectly
reversible! - queueing network models contain collections of
easily/perfectly reversible DAs - queue handling (swap, shift, tree insert/delete,
) - statistics collection (increment, decrement, )
- random number generation (reversible RNGs)
14RC Applications
- PDES applications include
- Wireless telephone networks
- Distributed content caches
- Large-scale Internet models
- TCP over ATT backbone
- Leverges RC swaps
- Hodgkin-Huxley neuron models
- Plasma physics models using PIC
- Pose -- UIUC
- Non-DES include
- Debugging
- PISA Reversible instruction set architecture
for low power computing - Quantum computing
15Local Control Implementation
- MPI_ISend/MPI_Irecv used to send/recv off core
events - Event Network memory is managed directly.
- Pool is allocated _at_ startup
- Event list keep sorted using a Splay Tree (logN)
- LP-2-Core mapping tables are computed and not
stored to avoid the need for large global LP maps.
Local Control Mechanism error detection and
rollback
V i r t u a l T i m e
(1) undo state Ds (2) cancel sent events
LP 1
LP 2
LP 3
16Global Control Implementation
- GVT (kicks off when memory is low)
- Each core counts sent, recv
- Recv all pending MPI msgs.
- MPI_Allreduce Sum on (sent - recv)
- If sent - recv ! 0 goto 2
- Compute local cores lower bound time-stamp
(LVT). - GVT MPI_Allreduce Min on LVTs
- Algorithms needs efficient MPI collective
- LC/GC can be very sensitive to OS jitter
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
GVT
LP 2
LP 3
LP 1
So, how does this translate into Time Warp
performance on BG/L BG/P?
17Performance Results Setup
- PHOLD
- Synthetic benchmark model
- 1024x1024 grid of LPs
- Each LP has 10 initial events
- Event routed randomly among all LPs based on a
configurable percent remote parameter - Time stamps are exponentially distributed with a
mean of 1.0 (i.e., lookahead is 0). - TLM Tranmission Line Matrix
- Discrete electromagnetic propagation wave model
- Used model the physical layer of MANETs
- As accurate as previous ray tracing models, but
dramatically faster - Considers wave attenuation effects
- Event populations grows cubically outward from
the single radio source. - ROSS parameters
- GVT_Interval ? number of times thru scheduler
loop before computing GVT. - Batch ? number of local events to process before
check network for new events. - Batch X GVT_Interval events processed per GVT
epoch - KPs ? kernel processes that hold the aggregated
processed event lists for LPs to lower search
overheads for fossil collection of old events. - Send/Recv Buffers number of network events for
sending or recving. Used as a flow control
mechanism.
187.5 billion ev/sec for 10 remote on 32,768
cores!!
2.7 billion ev/sec for 100 remote on 32,768
cores!!
Stable performance across processor
configurations attributed to near noiseless OS
19Performance falls off after just 100 processors
on a PS3 cluster w/ Gigabit Eithernet
2012.27 billion ev/sec for 10 remote on 65,536
cores!!
4 billion ev/sec for 100 remote on 65,536 cores!!
21Rollback Efficiency 1 - Erb /Enet
22(No Transcript)
23Model a 10 PF Supercomputer (revisited)
- Suppose we want to model a 10 PF supercomputer at
the MPI message level - How long excute parallel DES model?
- 16 trillion events _at_ 10 billion ev/sec
- 27 mins
24Observations
- ROSS on Blue Gene indicates billion-events per
second model are feasible today! - Yields significant TIME COMPRESSION of current
models.. - LP to PE mapping less of a concern
- Past systems where very sensitive to this
- 90 TF systems can yield Giga-scale event
rates. - Tera-event models require teraflop systems.
- Assumes most of event processing time is spent in
event-list management (splay tree
enqueue/dequeue). - Potential 10 PF supercomputers will be able to
model near peta-event systems - 100 trillion to 1 quadrillion events in less than
1.4 to 14 hours - Current testbed emulators dont come close to
this for Network Modeling and Simulation..
25Future Models Enabled by X-Scale Computing
- Discrete transistor level models for whole
multi-core architectures - Potential for more rapid improvements in
processor technology - Model nearly whole U.S. Internet at packet level
- Potential to radically improve overall QoS for
all - Model all C4I network/systems for a whole theatre
of war faster than real-time many time over.. - Enables the real-timeactive network control..
26Future Models Enabled by X-Scale Computing
- Realistic discrete model the human brain
- 100 billion neurons w/ 100 trillion synapes (e.g.
connections huge fan-out) - Potential for several exa-events per run
- Detailed discrete agent-based model for every
human on the earth for.. - Global economic modeling
- pandemic flu/disease modeling
- food / water / energy usage modeling
But to get there investments must be made in code
that are COMPLETELY parallel from start to
finish!!
27Thank you!!
- Additional Acknowledgments
- David Bauer HPTi
- David Jefferson LLNL for helping us get
discretionary access to Intrepid _at_ ALCF - Sysadmins Ray Loy (ANL), Tisha Stacey (ANL) and
Adam Todorski (CCNI) - ROSS Sponsers
- NSF PetaApps, NeTS CAREER programs
- ALFC/ANL