Title: Grid performance, grid benchmarks, grid metrics
1Grid performance, grid benchmarks, grid metrics
- Zsolt NĂ©meth
- MTA SZTAKI Computer and Automation Research
Institute - zsnemeth_at_sztaki.hu
- http//www.lpds.sztaki.hu/zsnemeth
- What is the grid?
- What is grid performance?
- Are benchmarks useful?
- How can be grid metrics defined?
3What is the grid?
4Distributed applications
- A set of cooperative processes
5Distributed applications
- Processes require resources
I/O devices
6Distributed applications
- Resources can be found on computational nodes
I/O devices
7Distributed applications
Application Cooperative processes
- Process control?
- Security?
- Naming?
- Communication?
- Input / output?
- File access?
Physical layer Computational nodes
8Distributed applications
Application Cooperative processes
- Virtual machine
- Process control ?
- Security ?
- Naming ?
- Communication ?
- Input / output ?
- File access ?
Physical layer Computational nodes
9Conventional distributed environments and grids
- Distributed resources are virtually unified by a
software layer - A virtual machine is introduced between the
application and the physical layer - Provides a single system image to the application
- Types
- Conventional (PVM, some implementations of MPI)
- Grid (Globus, Legion)
10Conventional distributed environments and grids
- What is the essential difference?
11Conventional distributed environments and grids
12Conventional distributed environments and grids
13Conventional distributed environments and grids
14Conventional distributed environments and grids
- How is the virtual machine built up?
- What does execution mean?
- What is the semantics of execution?
15Description of grid
- flexible, secure, coordinated resource sharing
among dynamic collections of individuals,
institutions and resources (The anatomy of the
grid) - single, seamless, computational environment in
which cycles, communication and data are shared
(Legion the Next Step Toward a Nationwide
Virtual Computer) - widearea environment that transparently consists
of workstations, personal computers, graphic
rendering engines, supercomputers and
nontraditional devices (Legion - A View from
50,000 Feet) - collection of geographically separated resources
connected by a high speed network, a software
layer which transforms a collection of
independent resources into a single, coherent
virtual machine (Metacomputing - Whats in it
for me)
16Conventional environments
- Processes
- Have resource requests
- Mapping
- Processes are mapped onto nodes
- Resource assignment is implicit
Physical level
- Processes
- Have resource requirements
- Mapping
- Assign nodes to resources?
Physical layer
18Grid the resource abstraction
- Processes
- Have resource needs
Physical layer
19Grid the user abstraction
- Processes
- Belong to a user
- User of the virtual machine is authorised to use
the constituting resources - Have no login access to the node the resource
belongs to
- Physical layer
- Local, physical users (user accounts)
20The grid abstraction
- Semantically the grid is nothing but abstraction
- Resource abstraction
- Physical resources can be assigned to virtual
resource needs (matched by properties) - Grid provides a mapping between virtual and
physical resources - User abstraction
- User of the physical machine may be different
from the user of the virtual machine - Grid provides a temporal mapping between virtual
and physical users
21Conventional distributed environments and grids
Smith 4 nodes
Smith, 4 CPU, memory, storage
Smith 1 CPU
22Grid performance
23What is grid performance at all?
- Performance of grid infrastructure or
performance of grid application? - Traditionally performance is
- Speed
- Throughput
- Bandwidth, etc.
- Using grids
- Quantitative reasons
- Qualitative reasons QoS
- Economic aspects
24Grid performance analysis scenarios
- Resource brokering evaluate the performance of a
given resource if it is appropriate for a certain
job - At runtime check if a resource can maintain an
acceptable/required performance - At runtime check if a job can evolve according
to checkpoints - Find obvious idling/waiting spots
- Find bad communication patterns
- Find serious performance skew
- Post mortem see if brokering strategy was
correct - Etc.
25What is grid performance at all?
26What is grid performance at all?
- supercomputer
- task is done in 20 minutes
- cluster
- task is done in 12 hours
27What is grid performance at all?
- supercomputer
- task is done in 20 minutes
- available tomorrow night
- cluster
- task is done in 12 hours
- available now
28What is grid performance at all?
- supercomputer
- task is done in 20 minutes
- available tomorrow night
- costs 200/hour
- cluster
- task is done in 12 hours
- available now
- costs 15/hour
29What is grid performance at all?
- Grid is about resource sharing
- What is the benefit of sharing
- acceptable for resource owners
- acceptable for resource users
- Speed, bandwidth, capacity, etc. is just one
aspect - Properness, fairness, effectiveness of assignment
of processes to resources
30Grid performance
31Grid performance
Virtual layer
Physical layer
32Grid performance
Virtual layer
Physical layer
33Interaction of application and the infrastructure
- Performance application perf. ? infrastructure
perf. - Signature model (Pablo group)
- Application signature
- e.g. instructions/FLOPs
- Scaling factor (capabilities of the resources)
- e.g. FLOPs/seconds
- Execution signature
- application signature scaling factor
- E.g. instructions/second instructions/FLOPS
34Possible performance problems in grids
- All that may occur in a distributed application
- Plus
- Effectiveness of resource brokering
- Synchronous availability of resources
- Resources may change during execution
- Various local policies
- Shared use of resources
- Higher costs of some activities
- The corresponding symptoms must be characterised
35Grid performance metrics
- Abstract representation of measurable quantities
- MR1xR2x...Rn
- Usual metrics
- Speedup, efficiency
- Load, queue length, etc.
- Such strict values are not characteristic in grid
- Cannot be interpreted
- Cannot be compared
- New metrics
- Local metrics and grid metrics
- Symbolic description / metrics
36Processing monitoring information
- Trace data reduction
- Proportional to time t, processes P, metrics
dimension n - Statistical clustering (reducing P)
- Similar temporal behaviours are classified
- Questionnable if works for grids
- Representative processes are recorded for each
class - Statistical projection pursuit (reducing n)
- reduces the dimension by identifying significant
metrics - Sampling frequency (reducing t)
37Performance tuning, optimisation
- The execution cannot be reproduced
- Post-mortem optimisation is not viable
- On-line steering is necessary though, hard to
realise - Sensors and actuators
- Application and implementation dependent
- E.g Autopilot, Falcon
- Average behaviour of applications can be improved
- Post-mortem tuning of the infrastructure (if
possible) - Brokering decisions
- Supporting services
38Grid benchmarking
39Grid performance,resource performance
- The traditional way benchmarking
- As suggested by GGF-GBRG
40Running benchmarks
- Benchmarks are executed on a virtual machine
41Running benchmarks
- Benchmarks are executed on a virtual machine
- The virtual machine may change (composed of
different resources) from run to run
42Running benchmarks
- Benchmarks are executed on a virtual machine
- The virtual machine may change (composed of
different resources) from run to run - Benchmark result is representative to one certain
virtual machine
43Running benchmarks
- Benchmarks are executed on a virtual machine
- The virtual machine may change (composed of
different resources) from run to run - Benchmark result is representative to one certain
virtual machine - What can it show about the entire grid?
- What can it show about a certain resource?
44Grid benchmarking
Virtual layer
Physical layer
45Grid metrics
46Local metrics
- Load averages, CPU user, system, idle
percentages, network bandwidth, cache hit ratio,
available memory, page faults, etc. - Performance is a trajectory in a
multi-dimensional space - Cannot be compared
- Cannot be interpreted
- processes 55.2, user 70, system 0, idle 30
- underloaded 64-CPU system
- processes 55.2, user 70, system 30, idle 0
- 64-CPU system, serious overheads
- processes 72.8, user 99, system 1, idle 0
- slightly overloaded 64-CPU system
- processes 4.1, user 99, system 1, idle 0
- seriously overloaded 1-CPU system
- Fine details are even more complex to evaluate
47Local metrics, global (grid) metrics
- Local metrics are transformed into some globally
understandable performance figures - What are the dimensions?
- What is the transformation?
48Global metrics
- MIPS, MFLOPS, Gbit/s, etc.
- Comparable, interpretable
- Most users have no idea about the computing power
they really require - These are usually nominal and not actual values
- Too general characterisation fine details are
49Benchmark metrics
- Benchmarks are for comparing computer systems
- A well selected benchmark set
- sensitive to different factors CPU intensive,
communication intensive, I/O intensive jobs - able to show fine details cache behaviour,
floating point capabilities, etc. - able to show behaviour at different levels
instruction, loop, procedure, application - These figures can be obtained actively require
time, resources
50Benchmark metrics
- Given a local database with local and benchmark
performance records - get the local performance figures
- low cost OS functionality
- look up the database for benchmark performance
- there may not be record for actual local
performance - symbolic (fuzzy) interpolation
- the actual benchmark figures can be estimated
- actual execution of benchmarks is costly if not
impossible - Estimated benchmark figures give a
characterisation of the system in a comparable
and interpretable way - Sounds reasonable but not enough
51Benchmark metrics
- Benchmarks may show actual execution performance
but it is not enough - Real-life experiments execution time may show no
correlation to actual load - start every job and suffer resource starvation
- wait until resources are available and start
specific jobs - Resource management policy must be taken into
52Job startup times
- corona.iif.hu, SUN Ultra Enterprise 10000, 64 CPU
- Sun Grid Engine
- Time between submission and actual start
- 1 processor job within 1 minute
- 2 processor job mostly within 1 minute
- 4 processor job 2-3 hours
- 8 processor job 1-2 days
- 9 processor job 1-2 days
- 16 processor job 2-3 days
- 25 processor job gt 4-5 days
- See online
- http//www.lpds.sztaki.hu/zsnemeth/apart/statisti
53Resource performance characterisation
- Execution phase resource performance can be
characterized in the space of benchmark metrics - analyse relationship between local metrics a
benchmark results - find the principal components
- Waiting phase a stochastic model
- find the parameters of the distribution
54Resource performance characterisation
- These parameters (?i, ?i, t1, t2,tn ) can be
distributed in an information system - Interpretable the stochastic model and the
benchmark set give an appropriate framework - Comparable figures have the same meaning within
this framework
55Ongoing work
- Exploring the statistical properties of
benchmarks and system parameters - Intensive benchmark experiments
- Getting the most out of figures
- Principal component analysis which figures are
really meaningful - Testing the stability of statistic data
- http//www.lpds.sztaki.hu/zsnemeth/apart/statisti
cs/statistics.shtml - Exploring the way how benchmark results can be
estimated from past measurements - Database management
- Symbolic interpolation
- A semantic definition for grids
- the presence of user and resource abstraction
- Grid performance has a more complex meaning
- Resource abstraction requires abstraction in the
performance characterisation, too - separation of local (physical) an global
(virtual) metrics - benchmarking is not viable
- but benchmarks can serve as metrics
- Experiments with resource characterisation