Workload Benchmarks that Scale in Multiple Dimensions - PowerPoint PPT Presentation

About This Presentation

Title:

Workload Benchmarks that Scale in Multiple Dimensions

Description:

Particle simulations with imposed fields, game-tree exploration. Circuit simulation ... precipitation, cloud cover, pressure, wind speed) for N days in advance. ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 51

Provided by: johngus8

Learn more at: https://iiswc.org

Category:

more less

Transcript and Presenter's Notes

Title: Workload Benchmarks that Scale in Multiple Dimensions

1
Workload Benchmarks thatScale in Multiple
Dimensions

John L. Gustafson
Sun Labs HPC Workload Characterization and System
Analysis Team

2
Goal

Create a suite of purpose-based benchmarks
representative of HPC, which adjust in several
dimensions to match real workloads.
Description must be architecture-independent and
language-independent.
Conjecture This approach will yield improved
predictive methods that are relatively invariant
as technology evolves.

3
What Does it Mean to Scale a Workload?

Performance Work/Time.
Work is usually undefined in benchmarks, so its
fixed to avoid the issue.
FLOPS are not work. Not in 2002, anyway.
Multiple instances (small, medium, large) doesnt
make a workload scalable.
True scalability requires an objective function.

4
Latency Tradeoffs Drive Need for New HPC
Approaches
1 ms

ILLIAC IV, 1970 latency1000 nsec
SGI O2000, 1996 latency800 nsec
Sun SF15K, 2001 latency400 nsec
but traditional complexity analysis counts
operations, ignores memory latency!

Time for one memory access
1 µs
Time for one operation
1 ns
1950
1970
2010
1990
5
Moores Law near limit for micro-processors, and
this time we mean it!

The clock cant make it across a 20 mm die at
current GHz rates
Either we go to multiple cores or use Non-Uniform
Cache Access (NUCA)
This is a physical limit, not technological

130 nm
100 nm
35 nm
70 nm
Source Chuck Moore, UT-Austin
6
Bandwidth Burns Energy Maybe Our Measure of Work
is Joules?
source Bill Dalley, Stanford
7
Purpose-Based Benchmarks

A purpose-based benchmark states an objective
function that has direct interest to humans.
An activity-based benchmark states computer
operations to be performed, usually defined by
source code.
Either can be made to scale in multiple
dimensions, but its harder to do with
activity-based benchmarks.

8
Means-Based vs Ends-Based Metrics
9
Two-Way Taxonomy Examples
Fixed-Size
Scalable
Activity- Based
Purpose- Based
10
Example of Prediction Failure
11
Peak FLOPS as Inverse Predictor
12
Why F.P. Op Counts Make Poor Workload Definitions
13
Peak Bandwidth Sometimes Works Surprisingly Well
14
An HPC Taxonomy

For each of these, we seek to match
Problem size
Data locality
Predominate data type
Dynamic behavior in time
Spatial irregularity
Demands on I/O
To avoid the toy benchmark problem.

Electronic Design Nuclear Applications Mechanical
Design Mechanical Engineering Radar
Cross-Section Crash Simulation Fluid
Dynamics Weather, Climate modeling Signal/Image
Processing Encryption Life Sciences Financial
Modeling Petroleum
15
Some Invariant Workload Dimensions
Real-Time
Workload Category
Physical Simulation
Math
Highly data-parallel
Data Parallelism
No locality of reference
Boolean
Small Discrete
Large Discrete
Low-precision continuous
16
Example of Workload Dimension Data Parallelism
(estimated)
100 Data- Parallel content 0
Gene-matching, SETI, factoring large
integers Vibrational analysis, ray-tracing Dense
matrix multiply, factoring convolution and
filtering Eulerian fluid dynamics (decomposed
spatially) N-Body problems Stress-strain analysis
with finite elements crash testing Lagrangian
fluid dynamics (decomposed by fluid
element) Easy databases (conflict-free, few
updates) FFTs (frequency-space filtering) Particle
simulations with imposed fields, game-tree
exploration Circuit simulation Economic
simulations Typical database applications
transaction processing
17
Prior Art HINTEnglish Description of Scalable
Task
18
Region of Computation asWorkload Scales
19
Serial Systems Used to Test Model
20
Curve Crossings Predict Inconsistent Rankings
21
Shared Memory Systems Tested
22
Here, Differences are Not Subtle!
23
Parallel Systems Tested
24
HINT Curves for Parallel Systems
25
SPECint Correlation with HINT
26
SPECfp Correlation with HINT
27
SPEC Linear Fits
28
Heuristic App Profile hydro2d
29
Worst Prediction FT, from NPB
30
Scalability Allows gt0.995 Correlation with Other
Benchmarks
You know those EnergyGuide stickers you see on
refrigerators, water heaters, air
conditioners, Etc.?
31
Wouldnt it be great if they had something like
this at retail computer stores?
32
Benchmark 1 Truss Design
(L,y0,z0)
L meters
(0,y1,z1)
(0,y2,z2)
100 tons
TASK Given a point that must support a 100 ton
load and three attachment points, find the
geometry of struts and cables that creates the
lightest structure. Each node requires 1 kg of
steel. Structure must support its own weight.
(0,y3,z3)
33
Strut Design-continued
Stiffness (strength)
More complex structures have higher strength. The
benchmark initially sets the topology, and then
perturbs the xyz positions of node points to
optimize resulting total mass required.
34
Strut Algorithm (Preprocess)
Read problem geometry and Nnumber of vertices
from nonvolatile storage. Iterate until there are
N vertices Sort edges (cables or struts) by
length. Bisect longest edge, creating new
vertex. Adjust vertex position to make it
non-collinear. Add two non-coplanar edges
between new vertex and neighbor
vertices. Compute new lengths of edges. Save
entire mesh description to nonvolatile storage.
35
Strut Algorithm (Inner Solver)
For each edge For each edge that touches the
same vertex Project edge vectors to this edge
to obtain Aij Compute external force from edge
weights load. Solve linear system such that S
vertex forces 0. Compute required
cross-sections of cables and struts
needed. Return the total weight of the truss.
36
Strut Optimization-Point Method
Iterate until time limit reached Pick a strut
(randomly or sequentially). Vary xyz coordinates
of a vertex, adjusting edge lengths that
connect Modify equations and re-solve. If
truss weight is reduced, keep the
modification Else restore original mass
distribution (or use annealing) Report best
solution found for structure. This imitates the
actions of a human engineer exploring a design
space. Note the possibility of massive
parallelism at the job level each processor can
try a different variation of the structure. The
best solution found is then shared globally and
used as a new starting point.
37
Strut Optimization-Interval Method
Set initial bounds on vertex positions (can be
very conservative) Iterate until time limit
reached Subdivide M-dimensional space of vertex
positions into subregions For each
subregion Compute the truss weight, as an
interval bound. Share bounds globally to exclude
subregions from search. Report best solution
(range) found for structure. This replaces
trial-and-error with rigorous exclusion of
infeasible sets. Massive parallelism is easy
each processor can try a different subspace.
38
Strut Optimization Dimensions

Number of vertices (if too many, fewer trials)
Number of trials (if too many, fewer vertices)
Type of solver (iterative, direct)
Precision of solver (match to workload!)
Search strategy (point, point parallel, interval
exclusion)

39
Benchmark 2 Radiosity
5
4
1
4
2
1
3
1
TASK Given the geometry above and three 1 by 1
diffuse light sources, find the placement of the
light sources that results in the most even
illumination of the bottom surface. All surfaces
are Lambertian reflectors. Reflectivity of
the Top surface, lights, and vertical surfaces is
0.95 reflectivity of the bottom surface and the
occluder is 0.70. Figure of merit
brightest/darkest ratio.
40
Radiosity-continued
Surfaces are subdivided into patches. Using more
patches gives a better result. Point method uses
Monte Carlo and only subdivides the bottom
surface once interval method uses an iterative
solver and recursively subdivides all surfaces.
41
Radiosity Solver (Point)
Read problem geometry and Nnumber of patches
from nonvolatile storage. Subdivide bottom
surface into N patches. Until all patches have
three-sigma confidence Fire a random photon
from a light source. Track reflections using
probabilities until photon is absorbed. If
absorbed in a bottom surface patch, increment
histogram. Find maximum and minimum photon
counts. Save bottom surface radiosities to
nonvolatile storage. Compute ratio. Note Highly
parallel if care is used to create independent
random number generation. Easy tests for
occlusion compared to interval method.
42
Radiosity Solver (Interval)
Read problem geometry from nonvolatile
storage. Create initial subdivision into large
rectangles. Set up form factor matrix, and
initial bounds on radiosity. Until all patch
intervals are 1 of the lightest-darkest
range Subdivide the patch with the largest
uncertainty. Use radiosity equation (contractive
mapping) to find new bounds. Compute
lightest-darkest range on bottom surface. Save
subdivision geometry and patch ranges to
nonvolatile storage. Compute ratio. Note
Parallel if asynchronous updating of radiosity is
allowed (runs will not be exactly repeatable but
will always converge). Closed-form expressions
for form factors exist for this problem.
43
Radiosity Optimization-Point
Iterate until time limit reached Pick a light
source coordinate. Vary it by some small amount,
like 0.1 meter. Resolve the radiosity problem
using the point method. Compare radiation
evenness on bottom surface If ratio is closer
to 1.0, keep the modification Else restore
original light position. Report best solution
found for position of lights. As before, this
allows job parallelism if information about the
search is shared by all processors after each run.
44
Radiosity Optimization-Interval
Set initial bounds on light positions (Must be on
ceiling, disjoint) Iterate until time limit
reached Subdivide the 6-dimensional space of
light positions For each subset of the search
space Bound the ratio of lightest/darkest
surface patch Share bounds globally to exclude
subregions from search. Report best solution
(range) found for light positions. Parallelism
exists at the job level and within each solver
step.
45
Benchmark3 Life Sciences (Proteomics)
Given a sequence of N peptides and a time limit
T, find the minimum energy conformation of the
peptide sequence.
Figure of merit N, or N/T This approaches
protein folding as N grows. Answer validity can
be tested against experiment. Currently, N55 is
the frontier.
46
Benchmark4 Electronic Design
Design an N-bit adder with carry look-ahead in a
given process technology. (Like 0.10 micron, FO4,
6-layer Cu interconnect). Simulate with a
cycle-based simulator for a complete set of test
vectors. Optimize to minimize clock cycle and
chip area.
Figure of merit Clock speed or area. This
captures both integer (logical) and
floating-point (analog) aspects of electronic
design.
47
Benchmark5 Financial Modeling
Generate real-time market behavior drawn from
historical data to drive workload. Execute trades
based on estimates of future value for N
financial instruments over a period T.
Objective function Profit!
48
Benchmark6 Weather Modeling
Generate real-time weather behavior drawn from
historical data to drive workload. Predict
weather (temperature, precipitation, cloud cover,
pressure, wind speed) for N days in advance.
Objective function Minimum total log(error)/time.
49
Benchmark7 PetroleumReservoir Management
Given a geological structure containing oil,
water, gas, and a set of M injector wells and N
extraction wells, position the wells to maximize
the total oil and gas extracted over a period of
time T.
Objective function Maximum fuel extracted.
50
SUMMARY

Scalability is much easier if the workloads are
purpose-based.
Multiple dimensions of scalability arise
naturally as adjustable parameters.
Predictive value looks promising based on prior
experience with HINT.
We will share our HPC workload benchmarks with
the HPC community when completed. n

Write a Comment

User Comments (0)