Title: Workload Characteristics and Representative Workloads
1Workload Characteristicsand Representative
Workloads
- David Kaeli
- Department of Electrical and Computer Engineering
- Northeastern University
- Boston, MA
- kaeli_at_ece.neu.edu
2Overview
- When we want to collect profiles to be used in
the design of a next-generation computing system,
we need to be very careful that we capture a
representative sample in our profile - Workload characteristics allow us to better
understand the content of the samples which we
collect - We need to select programs to study which
represent the class of applications we will
eventually run on the target system
3Examples of Workload Characteristics
- Instruction mix
- Static vs. dynamic instruction count
- Working sets
- Control flow behavior
- Working set behavior
- Inter-reference gap model (temporal locality)
- Database size
- Address and value predictability
- Application/library/OS breakdown
- Heap vs. stack allocation
4Benchmarks
- Real or synthetic programs/applications used to
exercise a hardware system, and generally
representing a range of behaviors found in a
particular class of applications - Benchmarks classes include
- Toy benchmarks tower of hanoi, qsort, fibo
- Synthetic benchmarks dhrystone, whetstone
- Embedded EEMBC, UTDSP
- CPU benchmarks SPECint/fp
- Internet benchmark SPECjAppServ, SPECjvm
- Commerical benchmark TPC, SAP, SPECjAppServ
- Supercomputing Perfect Club, Splash, Livermore
Loops
5SPEC Benchmarks Presentation
6Is there a way to reduce the runtime of SPEC
while maintaining representativeness?
- MinneSPEC (U. of Minn) Using gprof statistics
about the runtime of SPEC and various
Simplescalar simulation results (I-mix, cache
misses, etc), we can capture statistically
similar, though significantly shorter, runs of
the programs - Provides three input sets will run in
- A few minutes
- A few hours
- A few days
- IEEE Computer Arch Letters paper
7How many programs do we need?
- Does the set of application capture enough
variation to be representative of the entire
class of workload? - Should we consider using multiple benchmark
suites to factor out similarities in programming
styles? - Can we utilize workload characteristics to
identify the particular programs of interest?
8Example of capturing program behavior
Quantifying Behavioral Differences Between C
and C Programs Calder, Grunwald and Zorn, 1995
- C is a programming language growing in
popularity - We need to design tomorrows computer
architectures based on tomorrows software
paradigms - How do workload characteristics changes as we
move to new programming paradigms?
9Example of capturing program behavior
Quantifying Behavioral Differences Between C
and C Programs Calder, Grunwald and Zorn, 1995
- First problem Find a set of representative
programs from both FO and OO domains - Difficult for OO in 1995
- Differences between programming models
- OO relies heavily on messages and methods
- Data locality will change due to the bundling
together of data structures in Objects - Size of functions will be reduced in OO
Polymorphism allows for indirect function
invocation and runtime method selection - OO programs will manage a larger number of
dynamically allocated objects
10Address and Value Profiling
- Lipasti observed that profiled instructions tend
to repeat their behavior - Many addresses are nearly constant
- Many values do not change between instruction
execution - Can we use profiles to better understand some of
these behaviors, and the until this knowledge to
optimize execution?
11Address Profiling
- If an address remains unchanged, can we issue
loads and store early (similar to prefetching)? - Do we even have to issue the load or store if we
have not modified memory? - What are the implications if indirect addressing
is used? - Can we detect patterns (i.e., strides) in the
address values? - Can we do anything smart when we detect pointer
chasing??
12Data Value Profiling
- When we see that particular data values do not
change, how can we take advantage of this? - Lipasti noticed that a large percentage of store
instructions overwrite memory with the value
already stored there - Can we avoid computing new results if we notice
that our input operand have not changed? - What can we do if we a particular operand only
takes on a small set of values?
13Parameter Value Profiling
- Profile the parameter values passed to functions
- If these parameters are predictable, we can
exploit this fact during compilation - We can study this on an individual function basis
or a call site basis - Compiler optimizations such as code
specialization and function cloning can be used
14Parameter Value Profiling
- We have profiled a set of MS Windows NT 4.0
desktop applications - Word97
- Foxpro 6.0
- SQLserver 7.0
- VC 6.0
- Excel97
- Powerpoint97
- Access97
- We measured the value predictability of parameter
values for all non-pointer based parameters
15Parameter Value Profiling
- We look for the predictability of parameters
using - Invariance 1 probability that the most frequent
value is passed - Invariance 4 probability that one of the 4 most
frequent values is passed - Parameter values are more predictable on a call
site basis than on a function basis (e.g., for
Word97, 8 of the functions pass highly
predictable parms, where as when computed on
individual call sites, over 16 of the call sites
pass highly predictable parms) - Highly predictable means that on over 90 of the
calls the same value is observed - We will discuss how to clone and specialize
procedures when we discuss profile guided data
transformations
16How can we reduce the runtime of a single
benchmark and still maintain accuracy?
- Simpoint attempt to collect a set of trace
samples that best represents the whole execution
of the program - Identifies phase behavior in programs
- Considers a metric that captures the differences
between two samples - Computes the difference between these two
intervals - Selects the interval that is closest to all other
intervals
17Simpoint (Calder ASPLOS 2002)
- Utilize basic block frequencies to build basic
block vectors (bbf0, bbf1.bbfn-1) - Each frequency is weighted by its length
- Entire vector normalized by dividing by total
number of basic blocks executed - Take fixed-length samples (100M instructions)
- Compare BBVs using
- Euclidean Distance
- ED(a, b) sqrt(sum(i-gt1,n) (ai-bi)2)
- Manhattan Distance
- MD(a, b) sum(i-gt1,n)(ai-bi)
18Simpoint (Calder ASPLOS 2002)
- Manhattan Distance is used to build a similarity
matrix - N x N matrix, where N is the number of sampling
intervals in the program - Element SM(x, y) is the Manhattan Distance
between two 100M element BBV at sample offsets x
and y - Plot the Similarity Matrix as an upper triangle
19Simpoint (Calder ASPLOS 2002)
- Basic Block Vectors can adequately capture the
necessary representative characteristics of a
program - Distance metrics can help to identify the most
representative samples - Cluster analysis (k-means) can improve
representativeness by selecting multiple samples
20Simpoint (Calder ISPASS 2004)
- Newer work on Simpoint considers using register
def-use behavior on an interval basis - Also, tracking of loop behavior and procedure
calls frequencies provides similar accuracy as
using basis block vectors
21Simpoint (Calder ASPLOS 2002)
- Algorithm overview
- Profile program by dividing into fixed sized
intervals (e.g., 1M, 10M, 100M insts) - Collect frequency vectors (e.g., BBVs, def-use,
etc.) compute normalized frequencies - Run k-means clustering algorithm to divide the
set of intervals into k partitions/sets, for
values of k from 1 to K - Compute a goodness-of-fit of the data for each
value of k - Select the clustering the reduces small k and
provides a reasonable goodness-of-fit result - The result is a selection of representative
simulation points that best fit the entire
application execution
22Simpoint Paper
23Improvements to Simpoints (KDD05, ISPASS06)
- Utilize a Mixture of Multinomials instead of
K-means - Assumes data is generated by a mixture of
K-component density functions - We utilize Expectation-Maximization (EM) to find
a local maximum likelihood for the parameters of
the density function iterate on E and M steps
until convergence - The number of clusters is selected using the
Bayesian Information Criteria (BIC) approach to
judge goodness of fit - A multinomial clustering model for fast
simulation of computer architecture designs, K.
Sanghai et al., Proc. of KDD 2005, Chicago, IL.,
pp. 808-813.
24Summary
- Before you begin studying a new architecture,
have a clear understanding of the target
workloads for this system - Perform a significant amount of workload
characterization before you begin profiling work - Benchmarks are very useful tools, but must be
used properly to obtain meaningful results - Value profiling is a rich area for future
research - Simpoints can be used to reduce the runtime of
simulation and still maintain simulation fidelity