Title: Combining Statistical and Symbolic Simulation
1Combining Statistical and Symbolic Simulation
- Mark Oskin
- Fred Chong and Matthew Farrens
- Dept. of Computer ScienceUniversity of
California at Davis
2Overview
- HLS is a hybrid performance simulation
- Statistical Symbolic
- Fast
- Accurate
- Flexible
3Motivation
I-cache hit rate
Basic block size
Dispatch bandwidth
I-cache miss penalty
Branch miss-predict penalty
4Motivation
- Fast simulation
- seconds instead of hours or days
- Ideally is interactive
- Abstract simulation
- simulate performance of unknown designs
- application characteristics not applications
5Outline
- Simulation technologies and HLS
- From applications to profiles
- Validation
- Examples
- Issues
- Conclusion
6Design Flow with HLS
Cycle-by- Cycle Simulation
Estimate Performance
Profile
Design Issue
Possible solution
HLS
Design Issue
Design Issue
7Traditional Simulation Techniques
- Cycle-by-cycle (Simplescalar, SimOS,etc.)
- accurate
- slow
- Native emulation/basic block models (Atom, Pixie)
- fast, complex applications
- useful to a point (no low-level modifications)
8Statistical / Symbolic Execution
- HLS
- fast (near interactive)
- accurate / within regions
- permits variation of low-level parameters
- arbitrary design points / use carefully
9HLS A Superscalar Statistical and Symbolic
Simulator
Statistical
Symbolic
10Workflow
Code
sim-stat
Binary
sim-outorder
app profile
machine-profile
R10k
Stat-binary
HLS
machine-configuration
11Machine Configurations
- Number of Functional units (I,F,L,S,B)
- Functional unit pipeline depths
- Fetch, Dispatch and completion bandwidths
- Memory access latencies
- Mis-speculation penalties
12Profiles
- Machine profile
- cache hit rates gt (?)
- branch prediction accuracy gt (?)
- Application profile
- basic block size gt (?,?)
- instruction mix ( of I,F,L,S,B)
- dynamic instruction distance (histogram)
13Statistical Binary
- 100 basic blocks
- Correlated
- random instruction mix
- random assignment of dynamic instruction distance
- random distribution of cache and branch behaviors
14Statistical Binary
dynamic instruction distance
branch predictor behavior
load (l1 i-cache, l2 i-cache, l1 d-cache l2
d-cache, dependence 0)
integer (l1 i-cache, l2 i-cache, dependence 0,
dependence 1)
integer (l1 i-cache, l2 i-cache, dependence 0,
dependence 1)
branch (l1 i-cache, l2 i-cache, branch-predictor
accr., dep 0, dep 1)
store (l1 i-cache, l2 i-cache, l1 d-cache l2
d-cache, dep 0, dep 1)
load (l1 i-cache, l2 i-cache, l1 d-cache l2
d-cache, dependence 0)
core functional unit requirements
cache behavior during I-fetch
cache behavior during data access
15HLS Instruction Fetch Stage
Fetches symbolic instructions and interacts with
a statistical memory system and branch predictor
model.
Similar to conventional instruction fetch - has
a PC- has a fetch window- interacts with
caches- utilizes branch predictor- passes
instructions to dispatch Differences - caches
and branch predictor are statistical models
16Validation - SimpleScalar vs. HLS
17Validation - R10k vs. HLS
18HLS Multi-value Validation with SimpleScalar
HLS
Simple-Scalar
(Perl)
19HLS Multi-Value Validation with SimpleScalar
HLS
Simple-Scalar
(Xlisp)
20Example use of HLS
An intuitive result branch prediction accuracy
becomes less important (crosses fewer iso-IPC
contour lines, as basic block size increase).
(Perl)
21Example use of HLS
Another intuitive result gains in IPC due to
basic block size are front-loaded
Trade-off between front-end (fetch/dispatch) and
back-end (ILP) processor performance
(Perl)
22Example use of HLS
This space intentionally left blank.
(Perl)
23Related work
- R. Carl and J.E. Smith. Modeling superscalar
processors via statistical simulation - PAID
Workshop - June 1998. - N. Jouppi. The non-uniform distribution of
instruction-level and machine parallelism and its
effect on performance. - IEEE Trans. 1989. - D. Noonburg and John Shen. Theoretical modeling
of superscalar processor performance - MICRO27 -
November 1994.
24Questions Future Directions
- How important are different well-performing
benchmarks anyway? - easily summarized
- summaries are not precise gt yet precise enough
- Will the statisticalsymbolic technique work for
poorly behaved applications? - Will it extend to deeper pipelines and more real
processors (i.e. Alpha, P6 architecture)?
25Conclusion
- HLS Statistical Symbolic Execution
- Intuitive design space exploration
- Fast
- Accurate
- Flexible
- Validated against cycle-by-cycle and R10k
- Future work deeper pipelines, more hardware
validations, additional domains - source code at http//arch.cs.ucdavis.edu/oskin