Title: Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware
1Frequent Loop Detection Using Efficient
Non-Intrusive On-Chip Hardware
- Ann Gordon-Ross and Frank Vahid
- Department of Computer Science and Engineering
- University of California, Riverside
- Also with the Center for Embedded Computer
Systems, UC Irvine
This work was supported in part by the U.S.
National Science Foundation and a U.S. Dept. of
Education GAANN Fellowship
International Conference on Compilers,
Architecture, and Synthesis for Embedded Systems
2003.
2System Optimizations - Static
- Specialization of a system for a particular
application or suite of applications to improve
power consumption and/or performance - Static optimizations are performed at design time
by the designer - There are many static optimization approaches
- Critical regions can be partitioned to
configurable logic - Constant propagation and code specialization for
statically determined invariant variables - Critical regions can be locked to a specialized
cache - Many more
3System Optimizations - Static
Power Consumption
Modeled input stimulus
Design Time
http//www.ims.uconn.edu/metal/
Undergrad/microchip.jpg
4Static Optimization Drawbacks
- Designer must perform optimizations
- May disrupt standard software tool flows
- Runtime environment may present optimization
opportunities that are not evident during design
time simulation - Simulation is usually used
- Framework may take months to set up a testing
environment with realistic input stimuli - Exploration time may take weeks to run one of
hundreds of possible configurations
5System Optimizations - Dynamic
- Dynamic software optimizations are becoming
increasingly popular for improving software
performance and power. - Dynamic optimizations are performed in system
during runtime
Designer
End Product
Power Consumption
Runtime Input
Execution Time
6System Optimizations - Dynamic
- There are many dynamic optimization approaches
- Dynamo performs dynamic software optimizations on
the most frequently executed regions of code - Frequently executed regions of code can be
remapped to non-interfering cache locations - Dynamic binary translation methods store
translation results of frequently executed
regions of code for quick look-up - Value profiling can determine runtime invariant
variables for constant propagation and/or code
specialization - Many others.
7Dynamic Optimizations - Effectiveness
- For dynamic optimizations to be most effective,
optimizations are typically applied to the most
frequently executed regions of code - For a large selection of the MediaBench benchmark
suite, we observed that 90 of the execution time
was spent in approximately 10 of the code - Profiling is used to determine the critical
regions of code
8Previous Profiling Methods
Desktop
- Desktop targeted profiling methods
- Instrumentation and sampling
- These methods are unsuitable for embedded systems
- Causes disruption of run-time behavior
- Early methods used logic analyzers
- Not possible for todays systems-on-a-chip (SOCs)
- JTAG standard allows for internal registers to be
read - Typically used for testing and debugging
- Interrupts processor to write internal
information to external pins
Embedded
9Profiling Methodology Goal
- The goal of our profiling approach is to design a
profiling tool suitable for embedded systems to
determine the most critical regions of code
10Critical Region Detection - Operational
Requirements
- Non-intrusion
- Important for real-time systems
- Minimizes the impact on current tool chains i.e.
no special compilers or binary modification tools
- Low power
- Battery operated systems
- Systems with limited cooling
- Small area
- Less significant due to the large transistor
capacities of current and future chips - Accuracy
- Exact results are not required for the
information to be useful -- instead, reasonable
accuracy is acceptable
11Frequent Loop Detection
- We analyzed the critical regions for various
Powerstone and Mediabench benchmarks - We translate the problem of finding the critical
regions to finding the frequently executed loops - Short backwards branch (sbb) instruction is
typically the last instruction of a loop
All Critical Regions
15 - Subroutines with no inner loops
85 - Small inner loops
12Percentage of Execution Time for Frequent Loops
- In addition to detection of frequent loops we
also want to know the loops percentage
contribution to total execution time.
Application X
Application Y
Loop A - 32
Loop A - 10
Loop B - 33
Loop B - 10
Loop C - 35
Loop C - 80
13Frequent Loop Detection - Cache Based Architecture
To L1 Memory
rd/wr
rd/wr
addr
addr
data
sbb
saturation
14Cache Operation
Sbb Trace
1111001
15Cache Operation - Conflict Resolution
- Resolve most conflicts using associativity and an
LRU replacement policy for further conflicts - Further conflicts may cause frequent loops to
constantly be replaced in the cache - thrashing - Our experiments did not suffer from this
contention but a victim buffer may be added if
necessary
16Cache Operation Frequency Width
- Our goal is to find the smallest possible cache
needed to determine the frequent loops - We keep the cache small by allowing the frequency
field width to be varied - If the frequency field is too small, saturations
can occur and frequency information may be lost
17Cache Operation - Frequency Counter Saturation
Sbb Trace
All frequencies are divided by 2 with a shift
right (built as a special feature of the cache
and activated by asserting the saturation signal
to the cache)
1111001
255
1101010
100
1011010
2
18Experimental Setup
- We ran extensive experiments to determine the
best frequent loop cache configuration - Cache configurations simulated
- To determine the accuracy of each cache
configuration we wrote a trace simulator for the
cache architecture in C
Frequency Counter Field Widths
Cache Associativities
Cache Sizes
336 configurations
16, 32, and 64 entries
1, 2, 4, and 8-way
X
4 to 32 bits
X
19Experimental Setup
- Benchmarks
- Selected Powerstone benchmarks running on a
32-bit MIPS instruction set simulator - Selected MediaBench benchmarks running on
SimpleScalar - Power consumption
- UMC 0.18-micron CMOS technology running at 250
MHz at 1.8V - Cache memory power consumption obtained using the
Artisan memory compiler - Additional logic and functionality modeled in
synthesizable VHDL using Synopsys Design Compiler
20Accuracy - Sum of Differences
- We computed the average difference between the
actual loop execution time percentage and the
computed loop execution time percentage for the
ten most frequently executed loops
21Results - Sum of Differences
- Sum of differences results averaged over all
Powerstone benchmarks
22Results - Best Cache Configuration
- We determine the smallest possible cache
configuration necessary to give good results - Overall best cache configuration -
- 2-way 32-entry cache with a frequency width of 24
bits - 95 accuracy for Powerstone
- 90 accuracy for MediaBench
- No change to system performance
23Results - Base System Power and Area
- MIPS32 4Kp microprocessor core
- Embedded processor with a cache
- Small - area of 1.7 mm2
- Low power - 528 mW
24Results - Frequent Loop Detector Power Overhead
- For the best cache configuration -
- Increase in average power consumption of the
total system with frequent loop detector is 2.4
Power Consumption of Operations
142 mW per cache read and increment
156 mW per cache write
20.7 mW per saturation
Average Frequency of Operations
Cache updates 4.25
Saturations 0.000051
25Results - Frequent Loop Detector Area Overhead
- Resulting area overhead of 10.5 compared to the
reported size of the MIPS32 4Kp - Our numbers are pessimistic while reported
microprocessor areas are likely optimistic
Area Overhead Area Overhead
Frequent loop cache controller, incrementor and additional control/steering logic 1400 gates (0.0012 mm2)
Cache including saturation logic 0.167 mm2
26Reducing Power Overhead via Frequent Update
Coalescing
- Since frequent loops tend to iterate many times,
the same entry is updated in the frequent loop
cache many times in a row - Coalesce consecutive sbb executions into one
cache update to reduce cache updates
Sbb Trace
1110 1110 1110 1110
increment frequency
increment frequency
add 4 to frequency
increment frequency
increment frequency
27Reducing Power Overhead via Frequent Update
Coalescing
MediaBench
Powerstone
28Sampling for Further Reduced Power Overhead
- Instead of tallying every sbb executed, only
tally sbbs that occur at a fixed sampling
interval - This method does not require interrupting of the
processor.
29Sum of Differences for Sampling
Powerstone
MediaBench
30Sum of Differences for Sampling
- When going from a sampling interval of 1 to 50 -
- Average accuracy decreases for Powerstone
benchmarks by 5 - Average accuracy increases for MediaBench
benchmarks by 2
31Results for a Sampling Interval of 50
- Coalescing plus sampling reduces the average
system power overhead to a mere 0.02 - Still no change to system performance
Average Frequency of Operations
Cache updates 0.03
Coalesces - 0.06
No Saturations
32Example Use - Warp Processing
- The detector has been successfully incorporated
into a novel prototype system-on-a-chip
architecture performing what is presently known
as warp processing (also being developed at UCR)
33Warp Processing
µP
µP
Profiler
Mem
µP
µP
Configurable Logic
Dynamic Partitioning Module
34Conclusions
- We have presented a frequent loop detector that
is small, power-efficient, non-intrusive and
accurately provides relative frequencies of loops - 2-way set-associative 32-entry cache with a
24-bit frequency counter - Power overhead of 2.4 compared to a low-power
32-bit embedded processor - Power overhead is easily reducible to well below
0.1 using simple coalescing and sampling methods - Currently being used in the profiling step of the
Warp processor at UCR