Title: Catching Accurate Profiles in Hardware
1Catching Accurate Profiles in Hardware
ICS 280/259
- Satish Narayanasamy, Timothy Sherwood, Suleyman
Sair, Brad Calder, George Varghese
Presented by Jelena Trajkovic
2Outline
- Introduction Motivation
- Goal
- Related Work (Stratified Sampler)
- Interval-based Profiling for a Single Hash
Profiler - Experimental results
- Multiple-hash Profiler
- Experimental results
3Introduction Motivation
- SW used to gather program behavior information
- Architectural support for generating profiles at
run-time - HW is used to assist SW,
- dependent on on system SW (for management or
aggregation of events) - HW-only profiler
4Introduction Motivation (cont.)
- HW optimizations that can take advantage of info
gathered in run-time - Cache replacement prefetching
- identifying loads that cause majority of misses
- Value based optimization
- 50 of memory accesses are dominated by 10
distinct values - capture this dynamically? gt this information is
used for storing compressed values in data cache - Trace formation
- dynamically extracting and ordering frequently
executed code gt I-fetch more efficient - Multiple path execution
- find branches that are hard to predict and
execute down multiple paths
5Goal
- The goal is to build a profiling scheme that
satisfies following properties - Area Efficient capacity constraints (fixed
amount of area) - Accurate identify important / frequent events
and count them accurately - Timely up-to-date information about program
behavior - Performance Efficiency and SW Independence
independent of system SW support to manage
profiles (accumulate and analyze events),
identifying in HW
6Related Work
- SW profiling
- Binary instrumentation (ATOM by Calder et al.)
- HW counter assisted profiling
- DCPI system for Alpha Processors
- HW table based profiling
- Stratified sampling (Sastry et al.)
- Co-processor profiler
- Distill information passed from main processor
(Ziles and Sohi)
7Profiling Events
- Profiling event combination of several variables
- instruction PC, load address, register value or
name, cache miss - Tuple represents event as combination of 2
variables - ltpc, valuegt
8Related Work Stratified Sampler
- Divides the original input stream into multiple
streams via hashing (independently sampled)
- Table of counters
- number of occurrences of different events
- counter is selected by applying hash function on
the input event - incremented when event appears in the input
stream - on reaching threshold value, counter is reset
and event is reported (interrupt to the OS)
9Related Work Stratified Sampler (cont.)
- To reduce aliasing and improve accuracy
- Partial tags, miss counters, state information
- Hit counters number of occurrences
- Miss counters tuple hashes to particular entry,
but tag differs (replacement policy) - On reaching threshold value
- Generate interrupt
- Buffered, interrupt is sent when buffer fills up
- Placed in associative counter table, passed to SW
(via intermediate buffer) - Accumulating information in SW (5 interrupt
overhead)
10Interval-based Profiling for a Single Hash
Profiler
- Removing SW accumulator table
- Interval-based
- significant number of occurrences within
interval - reset hash-table counters after every interval
- improving accuracy - shielding
- Divide execution time into intervals
- interval length fixed number of profiling
events (tuples) - capture only events (candidate tuples) that occur
more than candidate threshold ( of interval
length)
11Single Hash Architecture
- accumulator table is fully associative and tagged
- if (input tuple is in acc. table )
- inc counter
- else
- hash into hash-table
- increment corresponding counter
- hash-table does not contain tags aliasing
- if (tuple reaches candidate threshold value)
- if (acc. table is not full)
- acc. table is allocated
- mark entry as non-replicable till the end of
interval - particular entry is not given as an input to the
hash-table shielding - if (end of the interval)
- flush hash-table
- mark all entries in acc. table as replaceable
12Single Hash Architecture (cont.)
- Calculate worst case number of entries in the
acc. table (avoid capacity and aliasing issues)
as a function of profile interval length and
candidate threshold - number of events that determine profiling
interval - number of occurrences in order to get recorded in
acc. table (percentage of interval length) - e.g. interval length 10,000
- candidate threshold 1 gt 100 entries
- 0.1 gt 1,000 entries
- 10,000 w/ 1 and 1 million w/ 0.1
- Hash-table 2K entries
13Single Hash Architecture (cont.)
- Hash functions for a given tuple ltpc, valuegt
- npc flip(randomize(pc))
- nv randomize(value)
- index xor-fold(npc xor nv, index-size)
- Optimizations
- Retaining keeps top entries in acc. table from
the previous interval - Resetting reset counter in hash-table, after it
reaches candidate threshold
14Experimental setup
- SPEC95go, li, vortex SPEC2K gcc, vortex
deltablue, sis, burg - Compilation
- DEC Alpha 21164, DEC C (full optimizations)
- Profiling analysis ATOM
- Fast forwarded and then ran for 500 million
instructions
15Error Calculation
- For each interval compare candidates seen by HW
profiler and perfect profiler - False Positive
- False Negative
- Neutral Positive
- Neutral Negatives
- Total error rate for an interval
16Experimental Results
- Accuracy of HW profiling depends
- number of unique tuples in an interval (distinct
tuples) - number of unique tuples that cross threshold
- Analysis of candidate tuples
Number of distinct tuples seen in an interval
on average
17- Number of unique candidate tuples in an interval
on average
18- Percentage of variation of candidates from
- one interval to the next
19Error rates
- Single Hash table with retaining/resetting
results across a set of benchmarks
20Multiple-hash Profiler
- Independent hash functions (for each table)
- if(no entry in acc. table)
- hash to each table
- update each counter
- if(all entries for particular tuple in hash table
reach candidate threshold) - add entry to the acc. table
- reset counters in hash-table (immediately or at
the end of interval) - Conservative update update just smallest counter
21- Muti-hash profiler for an interval of 10,000, 1
candidate threshold, and a total number of 2K
hash-table entries
Muti-hash profiler for an interval of 1 million,
0.1 candidate threshold, and a total number of
hash-table entries of 2K
22- Varying number of hash tables for the best
muti-hash profiler - C1, R0 (w/ conservative
update and w/o resetting) (10,00, 1 - L 1mill,
0.1 - R)
Variation in the error across different intervals
(BSH w/ resetting - L multi-hash w/ conservative
update and no resetting 4hash tables - R)
23Summary
- Profiling architecture
- Efficiently filters out important data
- Efficient in terms of HW cost (6KB (1KB or 10
KB) and overhead (no performance overhead)