ROSS: Parallel DiscreteEvent Simulations on Near Petascale Supercomputers - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

ROSS: Parallel DiscreteEvent Simulations on Near Petascale Supercomputers

Description:

ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers ... toroid topology. message density: 4 per LP. 1 millisecond computation per event ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 71
Provided by: Christophe6
Category:

less

Transcript and Presenter's Notes

Title: ROSS: Parallel DiscreteEvent Simulations on Near Petascale Supercomputers


1
ROSS Parallel Discrete-Event Simulations on Near
Petascale Supercomputers
  • Christopher D. Carothers
  • Department of Computer Science
  • Rensselaer Polytechnic Institute
  • chrisc_at_cs.rpi.edu
  • Sponsors NSF CAREER, NeTS, PetaApps, ANL/ALCF

2
Motivation
  • Why Parallel Discrete-Event Simulation (DES)?
  • Large-scale systems are difficult to understand
  • Analytical models are often constrained
  • Parallel DES simulation offers
  • Dramatically shrinks models execution-time
  • Prediction of future what-if systems
    performance
  • Potential for real-time decision support
  • Minutes instead of days
  • Analysis can be done right away
  • Example models national air space (NAS), ISP
    backbone(s), distributed content caches, next
    generation supercomputer systems.

3
Ex Movies over the Internet
  • Suppose we want to model 1 million home ISP
    customers downloading a 2 GB movie
  • How long to compute?
  • Assume a nominal 100K ev/sec seq. simulator
  • Assume on avg. each packet takes 8 hops
  • 2GB movies yields 2 trillion 1K data packets.
  • _at_ 8 hops yields 16 trillion events
  • 16 trillion events _at_ 100K ev/sec
  • Over 1,900 days!!! Or
  • 5 years!!!
  • Need massively parallel simulation to make
    tractable

4
Outline
  • Intro to DES
  • Time Warp and other PDES Schemes
  • Reverse Computation
  • Blue Gene/L /P
  • Implementation
  • Performance Results
  • PHOLD PCS
  • Observations on PDES Performance
  • Future Directions

5
Discrete Event Simulation (DES)
  • Discrete event simulation computer model for a
    system where changes in the state of the system
    occur at discrete points in simulation time.
  • Fundamental concepts
  • system state (state variables)
  • state transitions (events)
  • A DES computation can be viewed as a sequence of
    event computations, with each event computation
    is assigned a (simulation time) time stamp
  • Each event computation can
  • modify state variables
  • schedule new events

6
DES Computation
example air traffic at an airport events
aircraft arrival, landing, departure
arrival 800
schedules
departure 915
arrival 930
landed 805
schedules
simulation time
  • Unprocessed events are stored in a pending list
  • Events are processed in time stamp order

7
Discrete Event Simulation System
model of the physical system
independent of the simulation application
8
Event-Oriented World View
Event handler procedures
state variables
Departure Event
Arrival Event
Landed Event
Integer InTheAir Integer OnTheGround Boolean
RunwayFree
Simulation application
9
Ex Air traffic at an Airport
  • Model aircraft arrivals and departures, arrival
    queueing
  • Single runway model ignores departure queueing
  • R time runway is used for each landing aircraft
    (const)
  • G time required on the ground before departing
    (const)
  • State Variables
  • Now current simulation time
  • InTheAir number of aircraft landing or waiting
    to land
  • OnTheGround number of landed aircraft
  • RunwayFree Boolean, true if runway available
  • Model Events
  • Arrival denotes aircraft arriving in air space
    of airport
  • Landed denotes aircraft landing
  • Departure denotes aircraft leaving

10
Arrival Events
New aircraft arrives at airport. If the runway
is free, it will begin to land. Otherwise, the
aircraft must circle, and wait to land.
  • R time runway is used for each landing aircraft
  • G time required on the ground before departing
  • Now current simulation time
  • InTheAir number of aircraft landing or waiting
    to land
  • OnTheGround number of landed aircraft
  • RunwayFree Boolean, true if runway available
  • Arrival Event
  • InTheAir InTheAir1
  • If (RunwayFree)
  • RunwayFreeFALSE
  • Schedule Landed event _at_ Now R

11
Landed Event
An aircraft has completed its landing.
  • R time runway is used for each landing aircraft
  • G time required on the ground before departing
  • Now current simulation time
  • InTheAir number of aircraft landing or waiting
    to land
  • OnTheGround number of landed aircraft
  • RunwayFree Boolean, true if runway available
  • Landed Event
  • InTheAirInTheAir-1
  • OnTheGroundOnTheGround1
  • Schedule Departure event _at_ Now G
  • If (InTheAirgt0)
  • Schedule Landed event _at_ Now R
  • Else
  • RunwayFree TRUE

12
Departure Event
An aircraft now on the ground departs for a new
dest.
  • R time runway is used for each landing aircraft
  • G time required on the ground before departing
  • Now current simulation time
  • InTheAir number of aircraft landing or waiting
    to land
  • OnTheGround number of landed aircraft
  • RunwayFree Boolean, true if runway available
  • Departure Event
  • OnTheGround OnTheGround - 1

13
Execution Example
14
Summary
  • DES computation is sequence of event computations
  • Modify state variables
  • Schedule new events
  • DES System model simulation executive
  • Data structures
  • Pending event list to hold unprocessed events
  • State variables
  • Simulation time clock variable
  • Program (Code)
  • Main event processing loop
  • Event procedures
  • Events processed in time stamp order

15
Outline
  • Intro to DES
  • Time Warp and other PDES Schemes
  • Reverse Computation
  • Blue Gene/L /P
  • Implementation
  • Performance Results
  • PHOLD PCS
  • Observations on PDES Performance
  • Future Directions

16
How to Synchronize Parallel Simulations?
processed event
straggler event
17
Massively Parallel Discrete-Event Simulation Via
Time Warp
Local Control Mechanism error detection and
rollback
Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
(1) undo state Ds (2) cancel sent events
GVT
LP 2
LP 3
LP 1
LP 2
LP 3
LP 1
unprocessed event
processed event
straggler event
committed event
18
Whew Time Warp sounds expensive are there other
PDES Schemes?
  • Non-rollback options
  • Called Conservative because they disallow out
    of order event execution.
  • Deadlock Avoidance
  • NULL Message Algorithm
  • Deadlock Detection and Recovery

19
Deadlock Avoidance Using Null Messages
Null Message Algorithm (executed by each
LP) Goal Ensure events are processed in time
stamp order and avoid deadlock WHILE (simulation
is not over) wait until each FIFO contains at
least one message remove smallest time stamped
event from its FIFO process that event send
null messages to neighboring LPs with time stamp
indicating a lower bound on future messages
sent to that LP (current time plus minimum
transit time between airports) END-LOOP
  • Variation LP requests null message when FIFO
    becomes empty
  • Fewer null messages
  • Delay to get time stamp information

20
The Time Creep Problem
Five null messages to process a single event!
  • Many null messages if minimum flight time is
    small!

21
Livelock Can Occur!
Suppose the minimum delay between airports is
zero!
5.0
5.0
5.0
5.0
Livelock un-ending cycle of null messages where
no LP can advance its simulation time There
cannot be a cycle where for each LP in the cycle,
an incoming message with time stamp T results in
a new message sent to the next LP in the cycle
with time stamp T (zero lookahead cycle)
22
Lookahead
  • The null message algorithm relies on a
    prediction ability referred to as lookahead
  • ORD at simulation time 5, minimum transit time
    between airports is 3, so the next message sent
    by ORD must have a time stamp of at least 8
  • Lookahead is a constraint on LPs behavior
  • Link lookahead If an LP is at simulation time T,
    and an outgoing link has lookahead Li, then any
    message sent on that link must have a time stamp
    of at least TLi
  • LP Lookahead If an LP is at simulation time T,
    and has a lookahead of L, then any message sent
    by that LP must will have a time stamp of at
    least TL
  • Equivalent to link lookahead where the lookahead
    on each outgoing link is the same

23
Lookahead and the Simulation Model
  • Lookahead is clearly dependent on the simulation
    model
  • could be derived from physical constraints in the
    system being modeled, such as minimum simulation
    time for one entity to affect another (e.g., a
    weapon fired from a tank requires L units of time
    to reach another tank, or maximum speed of the
    tank places lower bound on how soon it can affect
    another entity)
  • could be derived from characteristics of the
    simulation entities, such as non-preemptable
    behavior (e.g., a tank is traveling north at 30
    mph, and nothing in the federation model can
    cause its behavior to change over the next 10
    minutes, so all output from the tank simulator
    can be generated immediately up to time local
    clock 10 minutes)
  • could be derived from tolerance to temporal
    inaccuracies (e.g., users cannot perceive
    temporal difference of 100 milliseconds, so
    messages may be timestamped 100 milliseconds into
    the future).
  • simulations may be able to precompute when its
    next interaction with another simulation will be
    (e.g., if time until next interaction is
    stochastic, pre-sample random number generator to
    determine time of next interaction).

Lookahead changes as LP topology changes which
can have a profound impact on the performance of
network models (wired or wireless).
24
Why is Lookahead Important?
Each LP A using logical time declares a lookahead
value LA the time stamp of any event generated
by the LP must be LTA LA
  • Lookahead is used in virtually all conservative
    synch. protocols
  • Essential to allow concurrent processing of events

Lookahead is necessary to allow concurrent
processing of events with different time stamps
(unless optimistic event processing is used)
25
Null Message Algorithm Speed Up
  • toroid topology
  • message density 4 per LP
  • 1 millisecond computation per event
  • vary time stamp increment distribution
  • ILARlookahead / average time stamp increment

Conservative algorithms live or die by their
lookahead!
26
Deadlock Detection Recovery
Algorithm A (executed by each LP) Goal Ensure
events are processed in time stamp order WHILE
(simulation is not over) wait until each FIFO
contains at least one message remove smallest
time stamped event from its FIFO process that
event END-LOOP
  • No null messages
  • Allow simulation to execute until deadlock occurs
  • Provide a mechanism to detect deadlock
  • Provide a mechanism to recover from deadlocks

27
Deadlock Recovery
Deadlock recovery identify safe events (events
that can be processed w/o violating local
causality),
  • Which events are safe?
  • Time stamp 7 smallest time stamped event in
    system
  • Time stamp 8, 9 safe because of lookahead
    constraint
  • Time stamp 10 OK if events with the same time
    stamp can be processed in any order
  • No lookahead creep!

28
Preventing LA Creep Using Next Event Time Info
  • Observation smallest time stamped event is safe
    to process
  • Lookahead creep avoided by allowing the
    synchronization algorithm to immediately advance
    to (global) time of the next event
  • Synchronization algorithm must know time stamp of
    LPs next event
  • Each LP guarantees a logical time T such that if
    no additional events are delivered to LP with TS
    lt T, all subsequent messages that LP produces
    have a time stamp at least TL (L lookahead)

29
No Free Lunch for PDES!
  • Time Warp ? State saving overheads
  • Null message algorithm ? Lookahead creep problem
  • No zero lookahead cycles allowed
  • Lookahead ? Essential for concurrent processing
    of events for conservative algorithms
  • Has large effect on performance ? need to program
    it
  • Deadlock Detection and Recovery ? Smallest time
    stamp event safe to process
  • Others may also be safe (requires additional work
    to determine this)
  • Use time of next event to avoid lookahead creep,
    but hard to compute at scale
  • Can we avoid some of these overheads and
    complexities??

30
Outline
  • Intro to DES
  • Time Warp and other PDES Schemes
  • Reverse Computation
  • Blue Gene/L /P
  • Implementation
  • Performance Results
  • PHOLD PCS
  • Observations on PDES Performance
  • Future Directions

31
Our Solution Reverse Computation...
  • Use Reverse Computation (RC)
  • automatically generate reverse code from model
    source
  • undo by executing reverse code
  • Delivers better performance
  • negligible overhead for forward computation
  • significantly lower memory utilization

Original Code
Compiler
Modified Code
Reverse Code
32
Ex Simple Network Switch
Original
N
if( qlen lt B ) qlen delaysqlen else l
ost
B
on packet arrival...
33
Benefits of Reverse Computation
  • State size reduction
  • from B2 words to 1 word
  • e.g. B100 gt 100x reduction!
  • Negligible overhead in forward computation
  • removed from forward computation
  • moved to rollback phase
  • Result
  • significant increase in speed
  • significant decrease in memory
  • How?...

34
Beneficial Application Properties
  • 1. Majority of operations are constructive
  • e.g., , --, etc.
  • 2. Size of control state lt size of data state
  • e.g., size of b1 lt size of qlen, sent, lost, etc.
  • 3. Perfectly reversible high-level operations
  • gleaned from irreversible smaller operations
  • e.g., random number generation

35
Rules for Automation...
Generation rules, and upper-bounds on bit
requirements for various statement types
36
Destructive Assignment...
  • Destructive assignment (DA)
  • examples x y x y
  • requires all modified bytes to be saved
  • Caveat
  • reversing technique for DAs can degenerate to
    traditional incremental state saving
  • Good news
  • certain collections of DAs are perfectly
    reversible!
  • queueing network models contain collections of
    easily/perfectly reversible DAs
  • queue handling (swap, shift, tree insert/delete,
    )
  • statistics collection (increment, decrement, )
  • random number generation (reversible RNGs)

37
Reversing an RNG?
double RNGGenVal(Generator g) long k,s
double u u 0.0 s Cg 0g k s
/ 46693 s 45991 (s - k 46693) - k
25884 if (s lt 0) s s 2147483647
Cg 0g s u u 4.65661287524579692e-10
s s Cg 1g k s / 10339 s
207707 (s - k 10339) - k 870 if (s lt
0) s s 2147483543 Cg 1g s u
u - 4.65661310075985993e-10 s if (u lt 0)
u u 1.0
s Cg 2g k s / 15499 s
138556 (s - k 15499) - k 3979 if (s lt
0.0) s s 2147483423 Cg 2g s
u u 4.65661336096842131e-10 s if (u gt
1.0) u u - 1.0 s Cg 3g k s /
43218 s 49689 (s - k 43218) - k
24121 if (s lt 0) s s 2147483323
Cg 3g s u u - 4.65661357780891134e-10
s if (u lt 0) u u 1.0 return
(u)
Observation k s / 46693 is a Destructive
AssignmentResult RC degrades to classic
state-savingcan we do better?
38
RNGs A Higher Level View
The previous RNG is based on the following
recurrence. xi,n aixi,n-1 mod mi where xi,n
one of the four seed values in the Nth set, mi is
one the four largest primes less than 231, and ai
is a primitive root of mi. Now, the above
recurrence is in fact reversible. inverse of ai
modulo mi is defined, bi aimi-2 mod mi Using
bi, we can generate the reverse recurrence as
follows xi,n-1 bixi,n mod mi
39
Reverse Code Efficiency...
  • Property...
  • Non-reversibility of indvidual steps DO NOT imply
    that the computation as a whole is not
    reversible.
  • Can we automatically find this higher-level
    reversibility?
  • Other Reversible Structures Include...
  • Circular shift operation
  • Insertion deletion operations on trees (i.e.,
    priority queues).

Reverse computation is well-suited for small
grain event models!
40
RC Applications
  • PDES applications include
  • Wireless telephone networks
  • Distributed content caches
  • Large-scale Internet models
  • TCP over ATT backbone
  • Leverges RC swaps
  • Hodgkin-Huxley neuron models
  • Plasma physics models using PIC
  • Non-DES include
  • Debugging
  • PISA Reversible instruction set architecture
    for low power computing
  • Quantum computing

41
Outline
  • Intro to DES
  • Time Warp and other PDES Schemes
  • Reverse Computation
  • Blue Gene/L /P
  • Implementation
  • Performance Results
  • PHOLD PCS
  • Observations on PDES Performance
  • Future Directions

42
Target Systems Blue Gene/L /P
  • Configuration
  • BG/L nodes 2x700 MHz PPC cores
  • BG/P nodes 4x850 MHz PPC cores
  • Dediciated compute and I/O nodes (321 or 81).
  • 3-D torus P2P network
  • Additional barrier, collective, I/O and ethernet
    networks
  • Can partition system into dedicated slices from
    32 nodes to whole systems
  • Properties for GOOD scaling
  • Balanced architecture between network(s) and
    processor speed
  • Exclusive access to network and process
  • Exceptionally low OS jitter
  • Collective overheads not adversely impacted at
    large nodes counts

1 rack of IBM Blue Gene/L
43
Blue Gene /L Layout
44
Blue Gene/L SoC
45
Blue Gene/L Network
46
Blue Gene /P Layout
47
Blue Gene/P Architectual Highlights
  • Scaled performance via density and frequency
    increase
  • 2x performance increase via doubling the
    processors per node.
  • 1.2x from frequency increase 700 MHz ? 850 MHz.
  • Enhanced function
  • 4-way SMP ? 3 modes SMP/ DUAL/ VNM
  • L2, L3 changed for SMP mode
  • DMA for torus, remote put-get, user prog. memory
    prefetch
  • Enhanced 64-bit performance counters via PPC450
    core
  • Double Hummer FPU and networks are the
    same..except..
  • Better Network
  • 2.4x more bandwidth, lower latency Torus and Tree
    neworks
  • 10x higher Ethernet I/O bandwidth
  • 72K nodes in 72 racks for 1 PF peak performance
  • Low power via aggressive power management

48
Blue Gene L vs. P
49
Blue Gene /P Compute Card
50
Outline
  • Intro to DES
  • Time Warp and other PDES Schemes
  • Reverse Computation
  • Blue Gene/L /P
  • Implementation
  • Performance Results
  • PHOLD PCS
  • Observations on PDES Performance
  • Future Directions

51
Local Control Implementation
  • MPI_ISend/MPI_Irecv used to send/recv off core
    events
  • Event Network memory is managed directly.
  • Pool is allocated _at_ startup
  • Event list keep sorted using a Splay Tree (logN)
  • LP-2-Core mapping tables are computed and not
    stored to avoid the need for large global LP
    maps.

Local Control Mechanism error detection and
rollback
V i r t u a l T i m e
(1) undo state Ds (2) cancel sent events
LP 1
LP 2
LP 3
52
Global Control Implementation
  • GVT (kicks off when memory is low)
  • Each core counts sent, recv
  • Recv all pending MPI msgs.
  • MPI_Allreduce Sum on (sent - recv)
  • If sent - recv ! 0 goto 2
  • Compute local cores lower bound time-stamp
    (LVT).
  • GVT MPI_Allreduce Min on LVTs
  • Algorithms needs efficient MPI collective
  • LC/GC can be very sensitive to OS jitter

Global Control Mechanism compute Global Virtual
Time (GVT)
V i r t u a l T i m e
collect versions of state / events perform
I/O operations that are lt GVT
GVT
LP 2
LP 3
LP 1
So, how does this translate into Time Warp
performance on BG/L BG/P?
53
Outline
  • Intro to DES
  • Time Warp and other PDES Schemes
  • Reverse Computation
  • Blue Gene/L /P
  • Implementation
  • Performance Results
  • PHOLD TLM
  • Observations on PDES Performance
  • Future Directions

54
Performance Results Setup
  • PHOLD
  • Synthetic benchmark model
  • 1024x1024 grid of LPs
  • Each LP has 10 initial events
  • Event routed randomly among all LPs based on a
    configurable percent remote parameter
  • Time stamps are exponentially distributed with a
    mean of 1.0 (i.e., lookahead is 0).
  • TLM Tranmission Line Matrix
  • Discrete electromagnetic propagation wave model
  • Used model the physical layer of MANETs
  • As accurate as previous ray tracing models, but
    dramatically faster
  • Considers wave attenuation effects
  • Event populations grows cubically outward from
    the single radio source.
  • ROSS parameters
  • GVT_Interval ? number of times thru scheduler
    loop before computing GVT.
  • Batch ? number of local events to process before
    check network for new events.
  • Batch X GVT_Interval events processed per GVT
    epoch
  • KPs ? kernel processes that hold the aggregated
    processed event lists for LPs to lower search
    overheads for fossil collection of old events.
  • Send/Recv Buffers number of network events for
    sending or recving. Used as a flow control
    mechanism.

55
PHOLD on 8192 BG/L cores
56
PHOLD on 8192 BG/L cores
57
PHOLD on 8192 BG/L cores
58
PHOLD on 8192 BG/L cores
59
7.5 billion ev/sec for 10 remote on 32,768
cores!!
2.7 billion ev/sec for 100 remote on 32,768
cores!!
Stable performance across processor
configurations attributed to near noiseless OS
60
12.27 billion ev/sec for 10 remote on 65,536
cores!!
4 billion ev/sec for 100 remote on 65,536 cores!!
61
Rollback Efficiency 1 - Erb /Enet
62
(No Transcript)
63
Outline
  • Intro to DES
  • Parallel DES via Time Warp
  • Reverse Computation
  • Blue Gene/L /P
  • Implementation
  • Performance Results
  • PHOLD TLM
  • Observations on PDES Performance
  • Future Directions

64
History of PHOLD Performance
These results are not completely comparable
which explains large variation in event rate and
processor efficiency
65
Movies over the Internet Revisited
  • Suppose we want to model 1 million home ISP
    customers over ATT downloading a 2 GB movie
  • How long to compute with massively parallel DES?
  • 16 trillion events _at_ 1 Billion ev/sec
  • 4.5 hours!!

66
Observations
  • ROSS on Blue Gene indicates billion-events per
    second model are feasible today!
  • Yields significant TIME COMPRESSION of current
    models..
  • LP to PE mapping less of a concern
  • Past systems where very sensitive to this
  • 90 TF systems can yield Giga-scale event
    rates.
  • Tera-event models require teraflop systems.
  • Assumes most of event processing time is spent in
    event-list management (splay tree
    enqueue/dequeue).
  • Potential 10 PF supercomputers will be able to
    model near peta-event systems
  • 100 trillion to 1 quadrillion events in less than
    1.4 to 14 hours
  • Current testbed emulators dont come close to
    this for Network Modeling and Simulation..

67
Outline
  • Intro to DES
  • Parallel DES via Time Warp
  • Reverse Computation
  • Blue Gene/L /P
  • Implementation
  • Performance Results
  • PHOLD TLM
  • Observations on PDES Performance
  • Future Directions

68
Future Models Enabled by X-Scale Computing
  • Discrete transistor level models for whole
    multi-core architectures
  • Potential for more rapid improvements in
    processor technology
  • Model nearly whole U.S. Internet at packet level
  • Potential to radically improve overall QoS for
    all
  • Model all C4I network/systems for a whole theatre
    of war faster than real-time many time over..
  • Enables the real-timeactive network control..

69
Future Models Enabled by X-Scale Computing
  • Realistic discrete model the human brain
  • 100 billion neurons w/ 100 trillion synapes (e.g.
    connections huge fan-out)
  • Potential for several exa-events per run
  • Detailed discrete agent-based model for every
    human on the earth for..
  • gobal economic modeling
  • pandemic flu/disease modeling
  • food / water / energy usage modeling

70
Thank you!!
  • Acknowledgments
  • David Bauer HPTi
  • David Jefferson LLNL for helping us get
    discretionary access to Intrepid _at_ ALCF
  • Sysadmins Ray Loy (ANL), Tisha Stacey (ANL) and
    Adam Todorski (CCNI)
  • Sponsers
  • NSF NeTS and CAREER programs
  • ALFC/ANL
Write a Comment
User Comments (0)
About PowerShow.com