Title: Lazy Logic
1Lazy Logic
- Mikko H. Lipasti
- Associate Professor
- Department of Electrical and Computer Engineering
- University of WisconsinMadison
http//www.ece.wisc.edu/pharm
2CMOS History
- CMOS has been a faithful servant
- 40 years since invention
- Tremendous advances
- Device size, integration level
- Voltage scaling
- Yield, manufacturability, reliability
- Nearly 20 years now as high-performance workhorse
- Result life has been easy for architects
- Ease leads to complacency laziness
3CMOS Futures
- The reports of my demise are greatly
exaggerated. Mark Twain - CMOS has some life left in it
- Device scaling will continue
- What comes after CMOS
- Many new challenges
- Process variability
- Device reliability
- Leakage power
- Dynamic power
Focus of this talk
4Dynamic Power
- Static CMOS current flows when transistors
switch - Combinational logic evaluates new inputs
- Flip-flop, latch captures new value (clock edge)?
- Terms
- C capacitance of circuit
- wire length, number and size of transistors
- V supply voltage
- A activity factor
- f frequency
- Architects can/should focus on Ci x Ai
- Reduce capacitance of each unit
- Reduce activity of each unit
5Design Objective Inversion
- Historically, hardware was expensive
- Every gate, wire, cable, unit mattered
- Squeeze maximum utilization from each
- Now, power is expensive
- On-chip devices wires, not so much
- Should minimize Ci x Ai
- Logic should be simple, infrequently used
- Both sequential and combinational
- ?Lazy Logic?
6Talk Outline
- Motivation
- What is Lazy Logic?
- Applications of Lazy Logic
- Circuit-switched coherence
- Stall-cycle redistribution
- Dynamic scheduling
- Conclusions
- Research Group Overview
7What is Lazy Logic?
- Design philosophy
- Some overall principles
- Minimize unit utilization
- Minimize unit complexity
- OK to increase number of units/wires/devices
- As long as reduced Ai (activity) compensates
- Dont forget leakage
- Result
- Reject conventional good ideas
- Reduce power without loss of performance
- Sometimes improve performance
8Lazy Logic Applications
- CMP interconnection networks
- Old Packet-switched, store-and-forward
- New Circuit-switched, reconfigurable
- Stall cycle redistribution
- Transparent pipelines want fine-grained stalls
- Redistribute coarse stalls into fine stalls
- High-performance dynamic scheduling
- Cycle time goal achieved by replicating ALUs
9CMP Interconnection Networks
- Options
- Buses dont scale
- Crossbars are too expensive
- Rings are too slow
- Packet-switched mesh
- Attractive for all the DSM reasons
- Scalable
- Low latency
- High link utilization
10CMP Interconnection Networks
- But
- Cables/traces are now on-chip wires
- Fast, cheap, plentiful
- Short 1 cycle per hop
- Router latency adds up
- 3-4 cpu cycles per hop
- Store-and-forward
- Lots of activity/power
- Is this the right answer?
11Circuit-switched Interconnects
- Communication patterns
- Spatial locality to memory
- Pairwise communication
- Circuit-switched links
- Avoid switching/routing
- Reduce latency
- Save power?
12Router Design
- Switches can be logically configured to appear as
wires (no routing overhead)? - Can also act as packet-switched network
- Can switch back and forth very easily
- Detailed router design not presented here
13Dirty Miss coverage
14Directory Protocol
- Initial 3-hop miss establishes CS path
- Subsequent miss requests
- Sent directly on CS path to predicted owner
- Also in parallel to home node
- Predicted owner sources data early
- Directory acks update to sharing list
- Benefits
- Reduced 3-hop latency
- Less activity, less power
15Circuit-switched Performance
16Link Activity
17Buffer Activity
18Circuit-switched Coherence
- Summary
- Reconfigurable interconnect
- Circuit-switched links
- Some performance benefit
- Substantial reduction in activity
- Current status (slides are out of date)?
- Router design and physical/area models
- Protocol tuning and tweaks, etc.
- Initial results in CA Letters paper
19Talk Outline
- Motivation
- What is Lazy Logic?
- Applications of Lazy Logic
- Circuit-switched coherence
- Stall-cycle redistribution
- Dynamic scheduling
- Conclusions
- Research Group Overview
20Pipeline Clocking Revisited
A
B
- Conventional pipeline clock gating
- Each valid work unit gets clocked into each latch
- This is needlessly conservative
Two units of work, 10 clock pulses
Latches clocked to propagate data
21Transparent Pipeline Gating
A
B
- Transparent pipelining novel approach to
clocking Jacobsen 2004, 2005 - Both master and slave latch can remain
transparent - Gating logic ensures no races
- Pipeline registers are clocked lazily only when
race occurs - Quite effective for low utilization pipelines
- Gaps between valid work units enable transparent
mode
Two units of work, 5 clock pulses
return
22Applications
- Best suited for low utilization pipelines
- E.g. FP, Media processing functional units
- High utilization pipelines see least benefit
- E.g. Instruction fetch pipelines
- To benefit from transparent approach
- Valid data items need fine-grained gaps (stalls)?
- 1-cycle gap provides lions share (50)?
23Application Front-end Pipelines
- Provide back-end with sufficient supply of
instructions to find ILP - High branch prediction accuracy
- Low instruction cache miss rates
- Little opportunity for clock gating
- Designed to feed peak demand
- Poor match for transparent pipeline gating
24In-Order Execution Model
- In-order Cores
- Power efficient
- Low design complexity
- Throughput oriented CMP systems trending towards
simple cores (e.g. Sun Niagara)? - Data dependences cause fine-grained stalls at
dispatch - Can we project these back to fetch?
- Exploit fetch slack
time
25Pipeline Diagram
Issue Buffer
Bpred
clock vector
PC
RP Instruction Fetch
Execution Core
0x0
bpred update
26Available Fetch Slack
27Implementation
- Stall cycle bits embedded in BTB
- EPIC ISAs (IA64) could use stop bits
- Verify prediction by observing unperturbed groups
- Let high confidence groups periodically execute
unperturbed - Observe overall increase in execution time
- Modeled Cell PPU-like PowerPC core with
aggressive clock gating
28Latch Activity Reduction
29FE Energy Delay Product
30Stall Cycle Redistribution
- Summary ISLPED 2006
- Transparent pipelines reduce latch activity
- Not effective in pipelines with coarse-grained
stalls (e.g. fetch)? - Coarse-grained stalls can be redistributed
without affecting performance (fetch slack)? - Benefits
- Equivalent performance, lower power
- Transparent fetch pipeline now attractive
31Talk Outline
- Motivation
- What is Lazy Logic?
- Applications of Lazy Logic
- Circuit-switched coherence
- Stall-cycle redistribution
- Dynamic scheduling
- Conclusions
- Research Group Overview
32A Brief Scheduler Overview
- Data capture/ non-data capture scheduler
- Data capture scheduler desirable for many reasons
- Cycle time is not competitive because of data
path delay - Current machines use speculative scheduling
- Misscheduled/replayed instructions burn power
- Depending on recovery policy, up to 17 issued
insts need to replay
33Slicing the Core
Back-End
Front-End
OoO Core
- Bitslice the core narrow (16b) and wide (64b)?
- Narrow core can be full data capture
- Still makes aggressive cycle time (with lazy
logic)? - Completely nonspeculative, virtually no replays
- Further power benefits (not in this talk)?
34Dynamic Scheduling with Partial Operand Values
- Narrow core
- Computes partial operand
- Determines load latency
- Avoids misscheduling
- Wide core
- Computes the rest of the operand (if needed)?
35Scheduler w/ Narrow Data-Path
- Non-data capture scheduler
- Select mux tag bcast compare ready wr
- Naïve narrow data capture scheduler
- Select mux tag bcast compare ready wr
Select mux narrow ALU data bcast data wr
36Scheduler w/ Embedded ALUs
- With embedded ALUs
- Select mux tag bcast compare ready wr
Max(select, data bcast mux narrow ALU) mux
latch setup
- Lazy Logic
- Replicated ALUs
- Low utilization
- Off critical delay path
37Cycle Time, Area, Energy
- 32 entries, implemented using verilog
- Synthesized using Synopsis Design Compiler and
LSI Logics gflxp 0.11um
38Dynamic Scheduling Summary
- Benefits JILP 2007
- Save 25-30 of total OoO window energygt 12-18
total dynamic chip power - Reduce misspeculated loads by 75-80
- Slightly improved IPC
- Comparable cycle time
- Enabled by
- Lazy narrow ALUs
- ALUs are cheap, so compute in parallel with
scheduling select logic
39Talk Outline
- Motivation
- What is Lazy Logic?
- Applications of Lazy Logic
- Circuit-switched coherence
- Stall-cycle redistribution
- Dynamic scheduling
- Conclusions
- Research Group Overview
40Conclusions
- Lazy Logic
- Promising new design philosophy
- Some overall principles
- Minimize unit utilization
- Minimize unit complexity
- OK to increase number of units/wires/devices
- Initial Results
- Circuit-switched CMP interconnects
- Stall cycle redistribution
- Dynamic Scheduling
41Who Are We?
- Faculty Mikko Lipasti
- Current Ph.D. students
- Profligate execution Gordie Bell (joining IBM in
2006)? - Coarse-grained coherence Jason Cantin (joining
IBM in 2006)? - Lazy Logic
- Circuit-switched coherence Natalie Enright
- Stall cycle redistribution Eric Hill
- Dynamic scheduling Erika Gunadi
- Dynamic code optimization Lixin Su
- SMT/CMP scheduling/resource allocation Dana
Vantrease - Pharmed out
- IBM Trey Cain, Brian Mestan
- AMD Kevin Lepak
- Intel Ilhyun Kim, Morris Marden, Craig Saldanha,
Madhu Seshadri - Sun Microsystems Matt Ramsay, Razvan Cheveresan,
Pranay Koka
42Research Group Overview
- Faculty Mikko Lipasti, since 1999
- Current MS/PhD students
- Gordie Bell, Natalie Enright Jerger, Erika
Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana
Vantrease - Graduates, current employment
- AMD Kevin Lepak
- IBM Trey Cain, Jason Cantin, Brian Mestan
- Intel Ilhyun Kim, Morris Marden, Craig Saldanha,
Madhu Seshadri - Sun Microsystems Matt Ramsay, Razvan Cheveresan,
Pranay Koka
43Current Focus Areas
- Multiprocessors
- Coherence protocol optimization
- Interconnection network design
- Fairness issues in hierarchical systems
- Microprocessor design
- Complexity-effective microarchitecture
- Scalable dynamic scheduling hardware
- Speculation reduction for power savings
- Transparent clock gating
- Domain-specific ISA extensions
- Software
- Java Virtual Machine run-time optimization
- Workload development and characterization
44Funding
- IBM
- Faculty Partnership Awards
- Shared University Research equipment
- Intel
- Research council support
- Equipment donations
- National Science Foundation
- CSA, ITR, NGS, CPA
- Career Award
- Schneider ECE Faculty Fellowship
- UW Graduate School
45Questions?
- http//www.ece.wisc.edu/pharm
46Questions?
47Backup slides
48Technology Parameters
- 65 nm technology generation
- 16 tiled processors
- Approximately 4 mm x 4mm
- Signal can travel approximately 4 mm/cycle
- Circuit switched interconnect consists of
- 5 mm unidirectional links
49Broadcast Protocol
- Broadcast to all nodes
- Establish Circuit-Switched path with owner of
data - Future broadcasts will use Circuit-Switched path
to reduce power - Predict when CS path will suffice
- Use LRU information for paths to tear down old
paths when resources need to be claimed by new
path
50Switch Design from paper
51Race example from paper (1 of 2)?
52Race example (2 of 2)?
1a. CS Req
6. Inval Resp
P0
P1
P2
1b. CS Notify
4a. CS Resp (S)?
5. Invalidate
2. Upgrade
4b. Nack
Dir3
3.
53LRU pairs for Dirty Misses
- 23 or fewer pairs capture gt80 of dirty misses
for 3 out of 4 benchmarks (16p)?
54Local LRU pairs
- 2 Circuit-Switched Paths per processor covers
between 55 and 85 of dirty misses
55Concurrent Links
- 5 concurrent links cover 90 necessary pairs
- Captures 50-77 of overall opportunity
56Experimental Setup
- PHARMsim
- Activity-based power model based on Wattch added
- InOrder issue
- 4/2/2 fetch/issue/commit (based on Cell PPU)?
- 10 stage transparent front-end pipeline
(conventional latches at endpoints)? - Gshare (8k entry) branch predictor, 1024 set,
4-way BTB - 32KB I/D cache (1/4), 512KB L2 cache (12)?
- 4 confidence bits / gt4 high conf threshold /
predictions checked randomly 10 of the time - Benchmarks simulated for 250M instructions
57Branch Predictor Activity
58Related Work
- Removing Wrong Path Instructions
- Manne 1998
- Flow Based Throttling Techniques
- Baniasadi 2001, Karkhanis 2002
59Future Work
- Explore performance of other fetch gating schemes
with transparent pipelining - Explore dependence driven gating on Itanium
machine model - Explore latch soft error vulnerability (TVF) when
lazy clocking is used - Explore change in AVF when fetch gating is used
- Less ACE state in-flight
60Scheduling Replay Example
- Squashing/non-selective replay alpha 21264
- Replays all dependent and independent
instructions issued under load shadow - Analogous to squashing recovery in branch
misprediction - Simple but high performance penalty
- Independent instructions are unnecessarily
replayed
61Narrow Core
- Narrow Scheduler
- Captures partial operands
- Determines load latency (hit/miss)?
- Narrow Data-Path
- Narrow ALU provides partial data to consumers
- Nar
- row LSQ and partial tag cache
- Finds only possible load data source
- Uses least significant 16 bits
- Large enough to help predict load latency
- Small enough to achieve fast cycle time
62L/S Disambiguation Partial Tag Matching
- Exploits operand significance
- Brooks et.al. 1999, Canal et al. 2000
- Load/store disambiguation
- 10 bits finds 99 of matching stores
- Partial tag match
- 16 bits for 97(mcf) - 99(bzip2) accuracy
63Outline
- Motivation
- Dynamic Scheduling with Narrow Values
- Scheduler with Narrow Data-Path
- Pipelined Data Cache
- Pipeline Integration
- Implementation and Experiments
- Conclusions and Future Work
64Dynamic Scheduling withPartial Operands
Back-End
Front-End
OoO Core
- Stores a subset of operands in scheduler
- Exploits partial operand knowledge
- Load-store disambiguation
- Partial tag match
65Pipelined Cache w/ Early Bits
- Narrow bank for partial access, wide bank for the
rest - Uses partial tag match in narrow bank
- Saves power in wide bank
- Hide wide cache bank latency by starting early
66Narrow LSQ
- Stores partial addresses of stores
- Used for partial load-store disambiguation
- Accessed in parallel with narrow bank
- Saves power in the wide LSQ
- Cheaper direct mapped access rather than full
associative search
67Pipeline Integration
- Simple ALU insts link dependences in back-to-back
cycle
- Load insts need another cycle to schedule
dependences
- Complex ALU insts link dependences
non-speculatively
68Pipelined Data Cache LSQ
- Modeled using modified CACTI 3.0
- Configuration 16KB, 4-way, 64B blocks
69Experiments
- Simplescalar / Alpha 3.0 tool set
- Machine Model
- 64-entry ROB
- 4-wide fetch/issue/commit
- 16-entry SQ, 16-entry LQ
- 32-entry scheduler
- 13-stage pipeline
- 64KB I-Cache (2-cyc), 16KB D-Cache (2-cyc)?
- 2-cycle store to load forwarding
70Energy Dissipation
- On average narrow captured scheduling consume 25
less energy than non-data captured scheduling
71Mispredicted Load Instructions
- Reduce misspeculated loads by 75-80
72Optimized model
- Using refetch replay scheme to reduce replay
complexity - Clear the scheduler entries once instructions are
issued - Decreases scheduler occupancy
- Instructions enters OoO window sooner
- Reduce L1 cache latency from 2-cycle to 1-cycle
73Optimized Model Performance
- Small variations
- Always perform as good or better
74Future Work
- Implement a more accurate dynamic power model
- Study custom design vs. synthesized model
- Study opportunities for leakage power reduction
75Delay Model
- Processor 0 can reach Processor 15 in 9 fewer
cycles
76Pipeline Unrolling