Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware - PowerPoint PPT Presentation

About This Presentation
Title:

Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware

Description:

Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 35
Provided by: GregS192
Category:

less

Transcript and Presenter's Notes

Title: Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware


1
Frequent Loop Detection Using Efficient
Non-Intrusive On-Chip Hardware
  • Ann Gordon-Ross and Frank Vahid
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Also with the Center for Embedded Computer
    Systems, UC Irvine

This work was supported in part by the U.S.
National Science Foundation and a U.S. Dept. of
Education GAANN Fellowship
International Conference on Compilers,
Architecture, and Synthesis for Embedded Systems
2003.
2
System Optimizations - Static
  • Specialization of a system for a particular
    application or suite of applications to improve
    power consumption and/or performance
  • Static optimizations are performed at design time
    by the designer
  • There are many static optimization approaches
  • Critical regions can be partitioned to
    configurable logic
  • Constant propagation and code specialization for
    statically determined invariant variables
  • Critical regions can be locked to a specialized
    cache
  • Many more

3
System Optimizations - Static
Power Consumption
Modeled input stimulus

Design Time
http//www.ims.uconn.edu/metal/
Undergrad/microchip.jpg
4
Static Optimization Drawbacks
  • Designer must perform optimizations
  • May disrupt standard software tool flows
  • Runtime environment may present optimization
    opportunities that are not evident during design
    time simulation
  • Simulation is usually used
  • Framework may take months to set up a testing
    environment with realistic input stimuli
  • Exploration time may take weeks to run one of
    hundreds of possible configurations

5
System Optimizations - Dynamic
  • Dynamic software optimizations are becoming
    increasingly popular for improving software
    performance and power.
  • Dynamic optimizations are performed in system
    during runtime

Designer
End Product
Power Consumption
Runtime Input
Execution Time
6
System Optimizations - Dynamic
  • There are many dynamic optimization approaches
  • Dynamo performs dynamic software optimizations on
    the most frequently executed regions of code
  • Frequently executed regions of code can be
    remapped to non-interfering cache locations
  • Dynamic binary translation methods store
    translation results of frequently executed
    regions of code for quick look-up
  • Value profiling can determine runtime invariant
    variables for constant propagation and/or code
    specialization
  • Many others.

7
Dynamic Optimizations - Effectiveness
  • For dynamic optimizations to be most effective,
    optimizations are typically applied to the most
    frequently executed regions of code
  • For a large selection of the MediaBench benchmark
    suite, we observed that 90 of the execution time
    was spent in approximately 10 of the code
  • Profiling is used to determine the critical
    regions of code

8
Previous Profiling Methods
Desktop
  • Desktop targeted profiling methods
  • Instrumentation and sampling
  • These methods are unsuitable for embedded systems
  • Causes disruption of run-time behavior
  • Early methods used logic analyzers
  • Not possible for todays systems-on-a-chip (SOCs)
  • JTAG standard allows for internal registers to be
    read
  • Typically used for testing and debugging
  • Interrupts processor to write internal
    information to external pins

Embedded
9
Profiling Methodology Goal
  • The goal of our profiling approach is to design a
    profiling tool suitable for embedded systems to
    determine the most critical regions of code

10
Critical Region Detection - Operational
Requirements
  • Non-intrusion
  • Important for real-time systems
  • Minimizes the impact on current tool chains i.e.
    no special compilers or binary modification tools
  • Low power
  • Battery operated systems
  • Systems with limited cooling
  • Small area
  • Less significant due to the large transistor
    capacities of current and future chips
  • Accuracy
  • Exact results are not required for the
    information to be useful -- instead, reasonable
    accuracy is acceptable

11
Frequent Loop Detection
  • We analyzed the critical regions for various
    Powerstone and Mediabench benchmarks
  • We translate the problem of finding the critical
    regions to finding the frequently executed loops
  • Short backwards branch (sbb) instruction is
    typically the last instruction of a loop

All Critical Regions
15 - Subroutines with no inner loops
85 - Small inner loops
12
Percentage of Execution Time for Frequent Loops
  • In addition to detection of frequent loops we
    also want to know the loops percentage
    contribution to total execution time.

Application X
Application Y
Loop A - 32
Loop A - 10
Loop B - 33
Loop B - 10
Loop C - 35
Loop C - 80
13
Frequent Loop Detection - Cache Based Architecture
To L1 Memory
rd/wr
rd/wr
addr
addr
data
sbb
saturation
14
Cache Operation
Sbb Trace
1111001
15
Cache Operation - Conflict Resolution
  • Resolve most conflicts using associativity and an
    LRU replacement policy for further conflicts
  • Further conflicts may cause frequent loops to
    constantly be replaced in the cache - thrashing
  • Our experiments did not suffer from this
    contention but a victim buffer may be added if
    necessary

16
Cache Operation Frequency Width
  • Our goal is to find the smallest possible cache
    needed to determine the frequent loops
  • We keep the cache small by allowing the frequency
    field width to be varied
  • If the frequency field is too small, saturations
    can occur and frequency information may be lost

17
Cache Operation - Frequency Counter Saturation
Sbb Trace
All frequencies are divided by 2 with a shift
right (built as a special feature of the cache
and activated by asserting the saturation signal
to the cache)
1111001
255
1101010
100
1011010
2
18
Experimental Setup
  • We ran extensive experiments to determine the
    best frequent loop cache configuration
  • Cache configurations simulated
  • To determine the accuracy of each cache
    configuration we wrote a trace simulator for the
    cache architecture in C

Frequency Counter Field Widths
Cache Associativities
Cache Sizes
336 configurations
16, 32, and 64 entries
1, 2, 4, and 8-way
X

4 to 32 bits
X
19
Experimental Setup
  • Benchmarks
  • Selected Powerstone benchmarks running on a
    32-bit MIPS instruction set simulator
  • Selected MediaBench benchmarks running on
    SimpleScalar
  • Power consumption
  • UMC 0.18-micron CMOS technology running at 250
    MHz at 1.8V
  • Cache memory power consumption obtained using the
    Artisan memory compiler
  • Additional logic and functionality modeled in
    synthesizable VHDL using Synopsys Design Compiler

20
Accuracy - Sum of Differences
  • We computed the average difference between the
    actual loop execution time percentage and the
    computed loop execution time percentage for the
    ten most frequently executed loops

21
Results - Sum of Differences
  • Sum of differences results averaged over all
    Powerstone benchmarks

22
Results - Best Cache Configuration
  • We determine the smallest possible cache
    configuration necessary to give good results
  • Overall best cache configuration -
  • 2-way 32-entry cache with a frequency width of 24
    bits
  • 95 accuracy for Powerstone
  • 90 accuracy for MediaBench
  • No change to system performance

23
Results - Base System Power and Area
  • MIPS32 4Kp microprocessor core
  • Embedded processor with a cache
  • Small - area of 1.7 mm2
  • Low power - 528 mW

24
Results - Frequent Loop Detector Power Overhead
  • For the best cache configuration -
  • Increase in average power consumption of the
    total system with frequent loop detector is 2.4

Power Consumption of Operations
142 mW per cache read and increment
156 mW per cache write
20.7 mW per saturation
Average Frequency of Operations
Cache updates 4.25
Saturations 0.000051
25
Results - Frequent Loop Detector Area Overhead
  • Resulting area overhead of 10.5 compared to the
    reported size of the MIPS32 4Kp
  • Our numbers are pessimistic while reported
    microprocessor areas are likely optimistic

Area Overhead Area Overhead
Frequent loop cache controller, incrementor and additional control/steering logic 1400 gates (0.0012 mm2)
Cache including saturation logic 0.167 mm2
26
Reducing Power Overhead via Frequent Update
Coalescing
  • Since frequent loops tend to iterate many times,
    the same entry is updated in the frequent loop
    cache many times in a row
  • Coalesce consecutive sbb executions into one
    cache update to reduce cache updates

Sbb Trace
1110 1110 1110 1110

increment frequency
increment frequency
add 4 to frequency
increment frequency
increment frequency
27
Reducing Power Overhead via Frequent Update
Coalescing
MediaBench
Powerstone
28
Sampling for Further Reduced Power Overhead
  • Instead of tallying every sbb executed, only
    tally sbbs that occur at a fixed sampling
    interval
  • This method does not require interrupting of the
    processor.

29
Sum of Differences for Sampling
Powerstone
MediaBench
30
Sum of Differences for Sampling
  • When going from a sampling interval of 1 to 50 -
  • Average accuracy decreases for Powerstone
    benchmarks by 5
  • Average accuracy increases for MediaBench
    benchmarks by 2

31
Results for a Sampling Interval of 50
  • Coalescing plus sampling reduces the average
    system power overhead to a mere 0.02
  • Still no change to system performance

Average Frequency of Operations
Cache updates 0.03
Coalesces - 0.06
No Saturations
32
Example Use - Warp Processing
  • The detector has been successfully incorporated
    into a novel prototype system-on-a-chip
    architecture performing what is presently known
    as warp processing (also being developed at UCR)

33
Warp Processing
µP
µP
Profiler
Mem
µP
µP
Configurable Logic
Dynamic Partitioning Module
34
Conclusions
  • We have presented a frequent loop detector that
    is small, power-efficient, non-intrusive and
    accurately provides relative frequencies of loops
  • 2-way set-associative 32-entry cache with a
    24-bit frequency counter
  • Power overhead of 2.4 compared to a low-power
    32-bit embedded processor
  • Power overhead is easily reducible to well below
    0.1 using simple coalescing and sampling methods
  • Currently being used in the profiling step of the
    Warp processor at UCR
Write a Comment
User Comments (0)
About PowerShow.com