Title: Workload Benchmarks that Scale in Multiple Dimensions
1Workload Benchmarks thatScale in Multiple
Dimensions
- John L. Gustafson
- Sun Labs HPC Workload Characterization and System
Analysis Team
2Goal
- Create a suite of purpose-based benchmarks
representative of HPC, which adjust in several
dimensions to match real workloads. - Description must be architecture-independent and
language-independent. - Conjecture This approach will yield improved
predictive methods that are relatively invariant
as technology evolves.
3What Does it Mean to Scale a Workload?
- Performance Work/Time.
- Work is usually undefined in benchmarks, so its
fixed to avoid the issue. - FLOPS are not work. Not in 2002, anyway.
- Multiple instances (small, medium, large) doesnt
make a workload scalable. - True scalability requires an objective function.
4Latency Tradeoffs Drive Need for New HPC
Approaches
1 ms
- ILLIAC IV, 1970 latency1000 nsec
- SGI O2000, 1996 latency800 nsec
- Sun SF15K, 2001 latency400 nsec
- but traditional complexity analysis counts
operations, ignores memory latency!
Time for one memory access
1 µs
Time for one operation
1 ns
1950
1970
2010
1990
5Moores Law near limit for micro-processors, and
this time we mean it!
- The clock cant make it across a 20 mm die at
current GHz rates - Either we go to multiple cores or use Non-Uniform
Cache Access (NUCA) - This is a physical limit, not technological
130 nm
100 nm
35 nm
70 nm
Source Chuck Moore, UT-Austin
6Bandwidth Burns Energy Maybe Our Measure of Work
is Joules?
source Bill Dalley, Stanford
7Purpose-Based Benchmarks
- A purpose-based benchmark states an objective
function that has direct interest to humans. - An activity-based benchmark states computer
operations to be performed, usually defined by
source code. - Either can be made to scale in multiple
dimensions, but its harder to do with
activity-based benchmarks.
8Means-Based vs Ends-Based Metrics
9Two-Way Taxonomy Examples
Fixed-Size
Scalable
Activity- Based
Purpose- Based
10Example of Prediction Failure
11Peak FLOPS as Inverse Predictor
12Why F.P. Op Counts Make Poor Workload Definitions
13Peak Bandwidth Sometimes Works Surprisingly Well
14An HPC Taxonomy
- For each of these, we seek to match
- Problem size
- Data locality
- Predominate data type
- Dynamic behavior in time
- Spatial irregularity
- Demands on I/O
- To avoid the toy benchmark problem.
Electronic Design Nuclear Applications Mechanical
Design Mechanical Engineering Radar
Cross-Section Crash Simulation Fluid
Dynamics Weather, Climate modeling Signal/Image
Processing Encryption Life Sciences Financial
Modeling Petroleum
15Some Invariant Workload Dimensions
Real-Time
Workload Category
Physical Simulation
Math
Highly data-parallel
Data Parallelism
No locality of reference
Boolean
Small Discrete
Large Discrete
Low-precision continuous
16Example of Workload Dimension Data Parallelism
(estimated)
100 Data- Parallel content 0
Gene-matching, SETI, factoring large
integers Vibrational analysis, ray-tracing Dense
matrix multiply, factoring convolution and
filtering Eulerian fluid dynamics (decomposed
spatially) N-Body problems Stress-strain analysis
with finite elements crash testing Lagrangian
fluid dynamics (decomposed by fluid
element) Easy databases (conflict-free, few
updates) FFTs (frequency-space filtering) Particle
simulations with imposed fields, game-tree
exploration Circuit simulation Economic
simulations Typical database applications
transaction processing
17Prior Art HINTEnglish Description of Scalable
Task
18Region of Computation asWorkload Scales
19Serial Systems Used to Test Model
20Curve Crossings Predict Inconsistent Rankings
21Shared Memory Systems Tested
22Here, Differences are Not Subtle!
23Parallel Systems Tested
24HINT Curves for Parallel Systems
25SPECint Correlation with HINT
26SPECfp Correlation with HINT
27SPEC Linear Fits
28Heuristic App Profile hydro2d
29Worst Prediction FT, from NPB
30Scalability Allows gt0.995 Correlation with Other
Benchmarks
You know those EnergyGuide stickers you see on
refrigerators, water heaters, air
conditioners, Etc.?
31Wouldnt it be great if they had something like
this at retail computer stores?
32Benchmark 1 Truss Design
(L,y0,z0)
L meters
(0,y1,z1)
(0,y2,z2)
100 tons
TASK Given a point that must support a 100 ton
load and three attachment points, find the
geometry of struts and cables that creates the
lightest structure. Each node requires 1 kg of
steel. Structure must support its own weight.
(0,y3,z3)
33Strut Design-continued
Stiffness (strength)
More complex structures have higher strength. The
benchmark initially sets the topology, and then
perturbs the xyz positions of node points to
optimize resulting total mass required.
34Strut Algorithm (Preprocess)
Read problem geometry and Nnumber of vertices
from nonvolatile storage. Iterate until there are
N vertices Sort edges (cables or struts) by
length. Bisect longest edge, creating new
vertex. Adjust vertex position to make it
non-collinear. Add two non-coplanar edges
between new vertex and neighbor
vertices. Compute new lengths of edges. Save
entire mesh description to nonvolatile storage.
35Strut Algorithm (Inner Solver)
For each edge For each edge that touches the
same vertex Project edge vectors to this edge
to obtain Aij Compute external force from edge
weights load. Solve linear system such that S
vertex forces 0. Compute required
cross-sections of cables and struts
needed. Return the total weight of the truss.
36Strut Optimization-Point Method
Iterate until time limit reached Pick a strut
(randomly or sequentially). Vary xyz coordinates
of a vertex, adjusting edge lengths that
connect Modify equations and re-solve. If
truss weight is reduced, keep the
modification Else restore original mass
distribution (or use annealing) Report best
solution found for structure. This imitates the
actions of a human engineer exploring a design
space. Note the possibility of massive
parallelism at the job level each processor can
try a different variation of the structure. The
best solution found is then shared globally and
used as a new starting point.
37Strut Optimization-Interval Method
Set initial bounds on vertex positions (can be
very conservative) Iterate until time limit
reached Subdivide M-dimensional space of vertex
positions into subregions For each
subregion Compute the truss weight, as an
interval bound. Share bounds globally to exclude
subregions from search. Report best solution
(range) found for structure. This replaces
trial-and-error with rigorous exclusion of
infeasible sets. Massive parallelism is easy
each processor can try a different subspace.
38Strut Optimization Dimensions
- Number of vertices (if too many, fewer trials)
- Number of trials (if too many, fewer vertices)
- Type of solver (iterative, direct)
- Precision of solver (match to workload!)
- Search strategy (point, point parallel, interval
exclusion)
39Benchmark 2 Radiosity
5
4
1
4
2
1
3
1
TASK Given the geometry above and three 1 by 1
diffuse light sources, find the placement of the
light sources that results in the most even
illumination of the bottom surface. All surfaces
are Lambertian reflectors. Reflectivity of
the Top surface, lights, and vertical surfaces is
0.95 reflectivity of the bottom surface and the
occluder is 0.70. Figure of merit
brightest/darkest ratio.
40Radiosity-continued
Surfaces are subdivided into patches. Using more
patches gives a better result. Point method uses
Monte Carlo and only subdivides the bottom
surface once interval method uses an iterative
solver and recursively subdivides all surfaces.
41Radiosity Solver (Point)
Read problem geometry and Nnumber of patches
from nonvolatile storage. Subdivide bottom
surface into N patches. Until all patches have
three-sigma confidence Fire a random photon
from a light source. Track reflections using
probabilities until photon is absorbed. If
absorbed in a bottom surface patch, increment
histogram. Find maximum and minimum photon
counts. Save bottom surface radiosities to
nonvolatile storage. Compute ratio. Note Highly
parallel if care is used to create independent
random number generation. Easy tests for
occlusion compared to interval method.
42Radiosity Solver (Interval)
Read problem geometry from nonvolatile
storage. Create initial subdivision into large
rectangles. Set up form factor matrix, and
initial bounds on radiosity. Until all patch
intervals are 1 of the lightest-darkest
range Subdivide the patch with the largest
uncertainty. Use radiosity equation (contractive
mapping) to find new bounds. Compute
lightest-darkest range on bottom surface. Save
subdivision geometry and patch ranges to
nonvolatile storage. Compute ratio. Note
Parallel if asynchronous updating of radiosity is
allowed (runs will not be exactly repeatable but
will always converge). Closed-form expressions
for form factors exist for this problem.
43Radiosity Optimization-Point
Iterate until time limit reached Pick a light
source coordinate. Vary it by some small amount,
like 0.1 meter. Resolve the radiosity problem
using the point method. Compare radiation
evenness on bottom surface If ratio is closer
to 1.0, keep the modification Else restore
original light position. Report best solution
found for position of lights. As before, this
allows job parallelism if information about the
search is shared by all processors after each run.
44Radiosity Optimization-Interval
Set initial bounds on light positions (Must be on
ceiling, disjoint) Iterate until time limit
reached Subdivide the 6-dimensional space of
light positions For each subset of the search
space Bound the ratio of lightest/darkest
surface patch Share bounds globally to exclude
subregions from search. Report best solution
(range) found for light positions. Parallelism
exists at the job level and within each solver
step.
45Benchmark3 Life Sciences (Proteomics)
Given a sequence of N peptides and a time limit
T, find the minimum energy conformation of the
peptide sequence.
Figure of merit N, or N/T This approaches
protein folding as N grows. Answer validity can
be tested against experiment. Currently, N55 is
the frontier.
46Benchmark4 Electronic Design
Design an N-bit adder with carry look-ahead in a
given process technology. (Like 0.10 micron, FO4,
6-layer Cu interconnect). Simulate with a
cycle-based simulator for a complete set of test
vectors. Optimize to minimize clock cycle and
chip area.
Figure of merit Clock speed or area. This
captures both integer (logical) and
floating-point (analog) aspects of electronic
design.
47Benchmark5 Financial Modeling
Generate real-time market behavior drawn from
historical data to drive workload. Execute trades
based on estimates of future value for N
financial instruments over a period T.
Objective function Profit!
48Benchmark6 Weather Modeling
Generate real-time weather behavior drawn from
historical data to drive workload. Predict
weather (temperature, precipitation, cloud cover,
pressure, wind speed) for N days in advance.
Objective function Minimum total log(error)/time.
49Benchmark7 PetroleumReservoir Management
Given a geological structure containing oil,
water, gas, and a set of M injector wells and N
extraction wells, position the wells to maximize
the total oil and gas extracted over a period of
time T.
Objective function Maximum fuel extracted.
50SUMMARY
- Scalability is much easier if the workloads are
purpose-based. - Multiple dimensions of scalability arise
naturally as adjustable parameters. - Predictive value looks promising based on prior
experience with HINT. - We will share our HPC workload benchmarks with
the HPC community when completed. n