ROSS: Parallel DiscreteEvent Simulations on Near Petascale Supercomputers - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

ROSS: Parallel DiscreteEvent Simulations on Near Petascale Supercomputers

Description:

Custom OS kernel. Blue Gene: L vs. P. How to Synchronize ... GVT (kicks off when memory is low): Each core counts #sent, #recv. Recv all pending MPI msgs. ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 28

Provided by: christophe54

Category:

more less

Transcript and Presenter's Notes

Title: ROSS: Parallel DiscreteEvent Simulations on Near Petascale Supercomputers

1
ROSS Parallel Discrete-Event Simulations on
Near Petascale Supercomputers

Christopher D. Carothers
Department of Computer Science
Rensselaer Polytechnic Institute
chrisc_at_cs.rpi.edu

2
Outline

Motivation for PDES
Overview of HPC Platforms
ROSS Implementation
Performance Results
Summary

3
Motivation

Why Parallel Discrete-Event Simulation (DES)?
Large-scale systems are difficult to understand
Analytical models are often constrained
Parallel DES simulation offers
Dramatically shrinks models execution-time
Prediction of future what-if systems
performance
Potential for real-time decision support
Minutes instead of days
Analysis can be done right away
Example models national air space (NAS), ISP
backbone(s), distributed content caches, next
generation supercomputer systems.

4
Model a 10 PF Supercomputer

Suppose we want to model a 10 PF supercomputer at
the MPI message level
How long excute DES model?
10 flop rate ? 1 PF sustained
_at_ .2 bytes/sec per flop _at_ 1 usage ? 2 TB/sec
_at_ 1K size MPI msgs ? 2 billion msgs per simulated
second
_at_ 8 hops per msg ? 16 billion events per
simulated second
_at_ 1000 simulated seconds ? 16 trillion events
for DES model
No I/O included !!
Nominal seq. DES simulator ? 100K events/sec

16 trillion events _at_ 100K ev/sec
5 years!!!
Need massively parallel simulation to make
tractable

5
Blue Gene /L Layout

CCNI fen
32K cores/ 16 racks
12 TB / 8 TB usable RAM
1 PB of disk over GPFS
Custom OS kernel

6
Blue Gene /P Layout

ALCF/ANL Intrepid
163K cores/ 40 racks
80TB RAM
8 PB of disk over GPFS
Custom OS kernel

7
Blue Gene L vs. P
8
How to Synchronize Parallel Simulations?
processed event
straggler event
9
Massively Parallel Discrete-Event Simulation Via
Time Warp
Local Control Mechanism error detection and
rollback
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
(1) undo state Ds (2) cancel sent events
GVT
LP 2
LP 3
LP 1
LP 2
LP 3
LP 1
unprocessed event
processed event
straggler event
committed event
10
Our Solution Reverse Computation...

Use Reverse Computation (RC)
automatically generate reverse code from model
source
undo by executing reverse code
Delivers better performance
negligible overhead for forward computation
significantly lower memory utilization

Original Code
Compiler
Modified Code
Reverse Code
11
Ex Simple Network Switch
Original
N
if( qlen lt B ) qlen delaysqlen else l
ost
B
on packet arrival...
12
Beneficial Application Properties

1. Majority of operations are constructive
e.g., , --, etc.
2. Size of control state lt size of data state
e.g., size of b1 lt size of qlen, sent, lost, etc.
3. Perfectly reversible high-level operations
gleaned from irreversible smaller operations
e.g., random number generation

13
Destructive Assignment...

Destructive assignment (DA)
examples x y x y
requires all modified bytes to be saved
Caveat
reversing technique for DAs can degenerate to
traditional incremental state saving
Good news
certain collections of DAs are perfectly
reversible!
queueing network models contain collections of
easily/perfectly reversible DAs
queue handling (swap, shift, tree insert/delete,
)
statistics collection (increment, decrement, )
random number generation (reversible RNGs)

14
RC Applications

PDES applications include
Wireless telephone networks
Distributed content caches
Large-scale Internet models
TCP over ATT backbone
Leverges RC swaps
Hodgkin-Huxley neuron models
Plasma physics models using PIC
Pose -- UIUC
Non-DES include
Debugging
PISA Reversible instruction set architecture
for low power computing
Quantum computing

15
Local Control Implementation

MPI_ISend/MPI_Irecv used to send/recv off core
events
Event Network memory is managed directly.
Pool is allocated _at_ startup
Event list keep sorted using a Splay Tree (logN)
LP-2-Core mapping tables are computed and not
stored to avoid the need for large global LP maps.

Local Control Mechanism error detection and
rollback
V i r t u a l T i m e
(1) undo state Ds (2) cancel sent events
LP 1
LP 2
LP 3
16
Global Control Implementation

GVT (kicks off when memory is low)
Each core counts sent, recv
Recv all pending MPI msgs.
MPI_Allreduce Sum on (sent - recv)
If sent - recv ! 0 goto 2
Compute local cores lower bound time-stamp
(LVT).
GVT MPI_Allreduce Min on LVTs
Algorithms needs efficient MPI collective
LC/GC can be very sensitive to OS jitter

Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
GVT
LP 2
LP 3
LP 1
So, how does this translate into Time Warp
performance on BG/L BG/P?
17
Performance Results Setup

PHOLD
Synthetic benchmark model
1024x1024 grid of LPs
Each LP has 10 initial events
Event routed randomly among all LPs based on a
configurable percent remote parameter
Time stamps are exponentially distributed with a
mean of 1.0 (i.e., lookahead is 0).
TLM Tranmission Line Matrix
Discrete electromagnetic propagation wave model
Used model the physical layer of MANETs
As accurate as previous ray tracing models, but
dramatically faster
Considers wave attenuation effects
Event populations grows cubically outward from
the single radio source.
ROSS parameters
GVT_Interval ? number of times thru scheduler
loop before computing GVT.
Batch ? number of local events to process before
check network for new events.
Batch X GVT_Interval events processed per GVT
epoch
KPs ? kernel processes that hold the aggregated
processed event lists for LPs to lower search
overheads for fossil collection of old events.
Send/Recv Buffers number of network events for
sending or recving. Used as a flow control
mechanism.

18
7.5 billion ev/sec for 10 remote on 32,768
cores!!
2.7 billion ev/sec for 100 remote on 32,768
cores!!
Stable performance across processor
configurations attributed to near noiseless OS
19
Performance falls off after just 100 processors
on a PS3 cluster w/ Gigabit Eithernet
20
12.27 billion ev/sec for 10 remote on 65,536
cores!!
4 billion ev/sec for 100 remote on 65,536 cores!!
21
Rollback Efficiency 1 - Erb /Enet
22
(No Transcript)
23
Model a 10 PF Supercomputer (revisited)

Suppose we want to model a 10 PF supercomputer at
the MPI message level
How long excute parallel DES model?

16 trillion events _at_ 10 billion ev/sec
27 mins

24
Observations

ROSS on Blue Gene indicates billion-events per
second model are feasible today!
Yields significant TIME COMPRESSION of current
models..
LP to PE mapping less of a concern
Past systems where very sensitive to this
90 TF systems can yield Giga-scale event
rates.
Tera-event models require teraflop systems.
Assumes most of event processing time is spent in
event-list management (splay tree
enqueue/dequeue).
Potential 10 PF supercomputers will be able to
model near peta-event systems
100 trillion to 1 quadrillion events in less than
1.4 to 14 hours
Current testbed emulators dont come close to
this for Network Modeling and Simulation..

25
Future Models Enabled by X-Scale Computing

Discrete transistor level models for whole
multi-core architectures
Potential for more rapid improvements in
processor technology
Model nearly whole U.S. Internet at packet level
Potential to radically improve overall QoS for
all
Model all C4I network/systems for a whole theatre
of war faster than real-time many time over..
Enables the real-timeactive network control..

26
Future Models Enabled by X-Scale Computing

Realistic discrete model the human brain
100 billion neurons w/ 100 trillion synapes (e.g.
connections huge fan-out)
Potential for several exa-events per run
Detailed discrete agent-based model for every
human on the earth for..
Global economic modeling
pandemic flu/disease modeling
food / water / energy usage modeling

But to get there investments must be made in code
that are COMPLETELY parallel from start to
finish!!
27
Thank you!!

Additional Acknowledgments
David Bauer HPTi
David Jefferson LLNL for helping us get
discretionary access to Intrepid _at_ ALCF
Sysadmins Ray Loy (ANL), Tisha Stacey (ANL) and
Adam Todorski (CCNI)
ROSS Sponsers
NSF PetaApps, NeTS CAREER programs
ALFC/ANL

Write a Comment

User Comments (0)