Lazy Logic

About This Presentation

Title:

Lazy Logic

Description:

Mikko Lipasti, University of Wisconsin Seminar--University of Toronto ... Mikko Lipasti, University of Wisconsin Seminar--University of Toronto. What is Lazy Logic? ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 77

Provided by: eceW

Category:

more less

Transcript and Presenter's Notes

Title: Lazy Logic

1
Lazy Logic

Mikko H. Lipasti
Associate Professor
Department of Electrical and Computer Engineering
University of WisconsinMadison

http//www.ece.wisc.edu/pharm
2
CMOS History

CMOS has been a faithful servant
40 years since invention
Tremendous advances
Device size, integration level
Voltage scaling
Yield, manufacturability, reliability
Nearly 20 years now as high-performance workhorse
Result life has been easy for architects
Ease leads to complacency laziness

3
CMOS Futures

The reports of my demise are greatly
exaggerated. Mark Twain
CMOS has some life left in it
Device scaling will continue
What comes after CMOS
Many new challenges
Process variability
Device reliability
Leakage power
Dynamic power

Focus of this talk
4
Dynamic Power

Static CMOS current flows when transistors
switch
Combinational logic evaluates new inputs
Flip-flop, latch captures new value (clock edge)?
Terms
C capacitance of circuit
wire length, number and size of transistors
V supply voltage
A activity factor
f frequency
Architects can/should focus on Ci x Ai
Reduce capacitance of each unit
Reduce activity of each unit

5
Design Objective Inversion

Historically, hardware was expensive
Every gate, wire, cable, unit mattered
Squeeze maximum utilization from each
Now, power is expensive
On-chip devices wires, not so much
Should minimize Ci x Ai
Logic should be simple, infrequently used
Both sequential and combinational
?Lazy Logic?

6
Talk Outline

Motivation
What is Lazy Logic?
Applications of Lazy Logic
Circuit-switched coherence
Stall-cycle redistribution
Dynamic scheduling
Conclusions
Research Group Overview

7
What is Lazy Logic?

Design philosophy
Some overall principles
Minimize unit utilization
Minimize unit complexity
OK to increase number of units/wires/devices
As long as reduced Ai (activity) compensates
Dont forget leakage
Result
Reject conventional good ideas
Reduce power without loss of performance
Sometimes improve performance

8
Lazy Logic Applications

CMP interconnection networks
Old Packet-switched, store-and-forward
New Circuit-switched, reconfigurable
Stall cycle redistribution
Transparent pipelines want fine-grained stalls
Redistribute coarse stalls into fine stalls
High-performance dynamic scheduling
Cycle time goal achieved by replicating ALUs

9
CMP Interconnection Networks

Options
Buses dont scale
Crossbars are too expensive
Rings are too slow
Packet-switched mesh
Attractive for all the DSM reasons
Scalable
Low latency
High link utilization

10
CMP Interconnection Networks

But
Cables/traces are now on-chip wires
Fast, cheap, plentiful
Short 1 cycle per hop
Router latency adds up
3-4 cpu cycles per hop
Store-and-forward
Lots of activity/power
Is this the right answer?

11
Circuit-switched Interconnects

Communication patterns
Spatial locality to memory
Pairwise communication
Circuit-switched links
Avoid switching/routing
Reduce latency
Save power?

12
Router Design

Switches can be logically configured to appear as
wires (no routing overhead)?
Can also act as packet-switched network
Can switch back and forth very easily
Detailed router design not presented here

13
Dirty Miss coverage
14
Directory Protocol

Initial 3-hop miss establishes CS path
Subsequent miss requests
Sent directly on CS path to predicted owner
Also in parallel to home node
Predicted owner sources data early
Directory acks update to sharing list
Benefits
Reduced 3-hop latency
Less activity, less power

15
Circuit-switched Performance
16
Link Activity
17
Buffer Activity
18
Circuit-switched Coherence

Summary
Reconfigurable interconnect
Circuit-switched links
Some performance benefit
Substantial reduction in activity
Current status (slides are out of date)?
Router design and physical/area models
Protocol tuning and tweaks, etc.
Initial results in CA Letters paper

19
Talk Outline

Motivation
What is Lazy Logic?
Applications of Lazy Logic
Circuit-switched coherence
Stall-cycle redistribution
Dynamic scheduling
Conclusions
Research Group Overview

20
Pipeline Clocking Revisited
A
B

Conventional pipeline clock gating
Each valid work unit gets clocked into each latch
This is needlessly conservative

Two units of work, 10 clock pulses
Latches clocked to propagate data
21
Transparent Pipeline Gating
A
B

Transparent pipelining novel approach to
clocking Jacobsen 2004, 2005
Both master and slave latch can remain
transparent
Gating logic ensures no races
Pipeline registers are clocked lazily only when
race occurs
Quite effective for low utilization pipelines
Gaps between valid work units enable transparent
mode

Two units of work, 5 clock pulses
return
22
Applications

Best suited for low utilization pipelines
E.g. FP, Media processing functional units
High utilization pipelines see least benefit
E.g. Instruction fetch pipelines
To benefit from transparent approach
Valid data items need fine-grained gaps (stalls)?
1-cycle gap provides lions share (50)?

23
Application Front-end Pipelines

Provide back-end with sufficient supply of
instructions to find ILP
High branch prediction accuracy
Low instruction cache miss rates
Little opportunity for clock gating
Designed to feed peak demand
Poor match for transparent pipeline gating

24
In-Order Execution Model

In-order Cores
Power efficient
Low design complexity
Throughput oriented CMP systems trending towards
simple cores (e.g. Sun Niagara)?
Data dependences cause fine-grained stalls at
dispatch
Can we project these back to fetch?
Exploit fetch slack

time
25
Pipeline Diagram
Issue Buffer
Bpred
clock vector
PC
RP Instruction Fetch
Execution Core
0x0
bpred update
26
Available Fetch Slack
27
Implementation

Stall cycle bits embedded in BTB
EPIC ISAs (IA64) could use stop bits
Verify prediction by observing unperturbed groups
Let high confidence groups periodically execute
unperturbed
Observe overall increase in execution time
Modeled Cell PPU-like PowerPC core with
aggressive clock gating

28
Latch Activity Reduction
29
FE Energy Delay Product
30
Stall Cycle Redistribution

Summary ISLPED 2006
Transparent pipelines reduce latch activity
Not effective in pipelines with coarse-grained
stalls (e.g. fetch)?
Coarse-grained stalls can be redistributed
without affecting performance (fetch slack)?
Benefits
Equivalent performance, lower power
Transparent fetch pipeline now attractive

31
Talk Outline

Motivation
What is Lazy Logic?
Applications of Lazy Logic
Circuit-switched coherence
Stall-cycle redistribution
Dynamic scheduling
Conclusions
Research Group Overview

32
A Brief Scheduler Overview

Data capture/ non-data capture scheduler

Speculative scheduling

Data capture scheduler desirable for many reasons
Cycle time is not competitive because of data
path delay
Current machines use speculative scheduling
Misscheduled/replayed instructions burn power
Depending on recovery policy, up to 17 issued
insts need to replay

33
Slicing the Core
Back-End
Front-End
OoO Core

Bitslice the core narrow (16b) and wide (64b)?
Narrow core can be full data capture
Still makes aggressive cycle time (with lazy
logic)?
Completely nonspeculative, virtually no replays
Further power benefits (not in this talk)?

34
Dynamic Scheduling with Partial Operand Values

Narrow core
Computes partial operand
Determines load latency
Avoids misscheduling
Wide core
Computes the rest of the operand (if needed)?

35
Scheduler w/ Narrow Data-Path

Non-data capture scheduler
Select mux tag bcast compare ready wr

Naïve narrow data capture scheduler
Select mux tag bcast compare ready wr

Select mux narrow ALU data bcast data wr
36
Scheduler w/ Embedded ALUs

With embedded ALUs
Select mux tag bcast compare ready wr

Max(select, data bcast mux narrow ALU) mux
latch setup

Lazy Logic
Replicated ALUs
Low utilization
Off critical delay path

37
Cycle Time, Area, Energy

32 entries, implemented using verilog
Synthesized using Synopsis Design Compiler and
LSI Logics gflxp 0.11um

38
Dynamic Scheduling Summary

Benefits JILP 2007
Save 25-30 of total OoO window energygt 12-18
total dynamic chip power
Reduce misspeculated loads by 75-80
Slightly improved IPC
Comparable cycle time
Enabled by
Lazy narrow ALUs
ALUs are cheap, so compute in parallel with
scheduling select logic

39
Talk Outline

Motivation
What is Lazy Logic?
Applications of Lazy Logic
Circuit-switched coherence
Stall-cycle redistribution
Dynamic scheduling
Conclusions
Research Group Overview

40
Conclusions

Lazy Logic
Promising new design philosophy
Some overall principles
Minimize unit utilization
Minimize unit complexity
OK to increase number of units/wires/devices
Initial Results
Circuit-switched CMP interconnects
Stall cycle redistribution
Dynamic Scheduling

41
Who Are We?

Faculty Mikko Lipasti
Current Ph.D. students
Profligate execution Gordie Bell (joining IBM in
2006)?
Coarse-grained coherence Jason Cantin (joining
IBM in 2006)?
Lazy Logic
Circuit-switched coherence Natalie Enright
Stall cycle redistribution Eric Hill
Dynamic scheduling Erika Gunadi
Dynamic code optimization Lixin Su
SMT/CMP scheduling/resource allocation Dana
Vantrease
Pharmed out
IBM Trey Cain, Brian Mestan
AMD Kevin Lepak
Intel Ilhyun Kim, Morris Marden, Craig Saldanha,
Madhu Seshadri
Sun Microsystems Matt Ramsay, Razvan Cheveresan,
Pranay Koka

42
Research Group Overview

Faculty Mikko Lipasti, since 1999
Current MS/PhD students
Gordie Bell, Natalie Enright Jerger, Erika
Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana
Vantrease
Graduates, current employment
AMD Kevin Lepak
IBM Trey Cain, Jason Cantin, Brian Mestan
Intel Ilhyun Kim, Morris Marden, Craig Saldanha,
Madhu Seshadri
Sun Microsystems Matt Ramsay, Razvan Cheveresan,
Pranay Koka

43
Current Focus Areas

Multiprocessors
Coherence protocol optimization
Interconnection network design
Fairness issues in hierarchical systems
Microprocessor design
Complexity-effective microarchitecture
Scalable dynamic scheduling hardware
Speculation reduction for power savings
Transparent clock gating
Domain-specific ISA extensions
Software
Java Virtual Machine run-time optimization
Workload development and characterization

44
Funding

IBM
Faculty Partnership Awards
Shared University Research equipment
Intel
Research council support
Equipment donations
National Science Foundation
CSA, ITR, NGS, CPA
Career Award
Schneider ECE Faculty Fellowship
UW Graduate School

45
Questions?

http//www.ece.wisc.edu/pharm

46
Questions?
47
Backup slides
48
Technology Parameters

65 nm technology generation
16 tiled processors
Approximately 4 mm x 4mm
Signal can travel approximately 4 mm/cycle
Circuit switched interconnect consists of
5 mm unidirectional links

49
Broadcast Protocol

Broadcast to all nodes
Establish Circuit-Switched path with owner of
data
Future broadcasts will use Circuit-Switched path
to reduce power
Predict when CS path will suffice
Use LRU information for paths to tear down old
paths when resources need to be claimed by new
path

50
Switch Design from paper
51
Race example from paper (1 of 2)?
52
Race example (2 of 2)?
1a. CS Req
6. Inval Resp
P0
P1
P2
1b. CS Notify
4a. CS Resp (S)?
5. Invalidate
2. Upgrade
4b. Nack
Dir3
3.
53
LRU pairs for Dirty Misses

23 or fewer pairs capture gt80 of dirty misses
for 3 out of 4 benchmarks (16p)?

54
Local LRU pairs

2 Circuit-Switched Paths per processor covers
between 55 and 85 of dirty misses

55
Concurrent Links

5 concurrent links cover 90 necessary pairs
Captures 50-77 of overall opportunity

56
Experimental Setup

PHARMsim
Activity-based power model based on Wattch added
InOrder issue
4/2/2 fetch/issue/commit (based on Cell PPU)?
10 stage transparent front-end pipeline
(conventional latches at endpoints)?
Gshare (8k entry) branch predictor, 1024 set,
4-way BTB
32KB I/D cache (1/4), 512KB L2 cache (12)?
4 confidence bits / gt4 high conf threshold /
predictions checked randomly 10 of the time
Benchmarks simulated for 250M instructions