Lazy Logic - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Lazy Logic

Description:

Mikko Lipasti, University of Wisconsin Seminar--University of Toronto ... Mikko Lipasti, University of Wisconsin Seminar--University of Toronto. What is Lazy Logic? ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 77
Provided by: eceW
Category:
Tags: lazy | logic | toronto

less

Transcript and Presenter's Notes

Title: Lazy Logic


1
Lazy Logic
  • Mikko H. Lipasti
  • Associate Professor
  • Department of Electrical and Computer Engineering
  • University of WisconsinMadison

http//www.ece.wisc.edu/pharm
2
CMOS History
  • CMOS has been a faithful servant
  • 40 years since invention
  • Tremendous advances
  • Device size, integration level
  • Voltage scaling
  • Yield, manufacturability, reliability
  • Nearly 20 years now as high-performance workhorse
  • Result life has been easy for architects
  • Ease leads to complacency laziness

3
CMOS Futures
  • The reports of my demise are greatly
    exaggerated. Mark Twain
  • CMOS has some life left in it
  • Device scaling will continue
  • What comes after CMOS
  • Many new challenges
  • Process variability
  • Device reliability
  • Leakage power
  • Dynamic power

Focus of this talk
4
Dynamic Power
  • Static CMOS current flows when transistors
    switch
  • Combinational logic evaluates new inputs
  • Flip-flop, latch captures new value (clock edge)?
  • Terms
  • C capacitance of circuit
  • wire length, number and size of transistors
  • V supply voltage
  • A activity factor
  • f frequency
  • Architects can/should focus on Ci x Ai
  • Reduce capacitance of each unit
  • Reduce activity of each unit

5
Design Objective Inversion
  • Historically, hardware was expensive
  • Every gate, wire, cable, unit mattered
  • Squeeze maximum utilization from each
  • Now, power is expensive
  • On-chip devices wires, not so much
  • Should minimize Ci x Ai
  • Logic should be simple, infrequently used
  • Both sequential and combinational
  • ?Lazy Logic?

6
Talk Outline
  • Motivation
  • What is Lazy Logic?
  • Applications of Lazy Logic
  • Circuit-switched coherence
  • Stall-cycle redistribution
  • Dynamic scheduling
  • Conclusions
  • Research Group Overview

7
What is Lazy Logic?
  • Design philosophy
  • Some overall principles
  • Minimize unit utilization
  • Minimize unit complexity
  • OK to increase number of units/wires/devices
  • As long as reduced Ai (activity) compensates
  • Dont forget leakage
  • Result
  • Reject conventional good ideas
  • Reduce power without loss of performance
  • Sometimes improve performance

8
Lazy Logic Applications
  • CMP interconnection networks
  • Old Packet-switched, store-and-forward
  • New Circuit-switched, reconfigurable
  • Stall cycle redistribution
  • Transparent pipelines want fine-grained stalls
  • Redistribute coarse stalls into fine stalls
  • High-performance dynamic scheduling
  • Cycle time goal achieved by replicating ALUs

9
CMP Interconnection Networks
  • Options
  • Buses dont scale
  • Crossbars are too expensive
  • Rings are too slow
  • Packet-switched mesh
  • Attractive for all the DSM reasons
  • Scalable
  • Low latency
  • High link utilization

10
CMP Interconnection Networks
  • But
  • Cables/traces are now on-chip wires
  • Fast, cheap, plentiful
  • Short 1 cycle per hop
  • Router latency adds up
  • 3-4 cpu cycles per hop
  • Store-and-forward
  • Lots of activity/power
  • Is this the right answer?

11
Circuit-switched Interconnects
  • Communication patterns
  • Spatial locality to memory
  • Pairwise communication
  • Circuit-switched links
  • Avoid switching/routing
  • Reduce latency
  • Save power?

12
Router Design
  • Switches can be logically configured to appear as
    wires (no routing overhead)?
  • Can also act as packet-switched network
  • Can switch back and forth very easily
  • Detailed router design not presented here

13
Dirty Miss coverage
14
Directory Protocol
  • Initial 3-hop miss establishes CS path
  • Subsequent miss requests
  • Sent directly on CS path to predicted owner
  • Also in parallel to home node
  • Predicted owner sources data early
  • Directory acks update to sharing list
  • Benefits
  • Reduced 3-hop latency
  • Less activity, less power

15
Circuit-switched Performance
16
Link Activity
17
Buffer Activity
18
Circuit-switched Coherence
  • Summary
  • Reconfigurable interconnect
  • Circuit-switched links
  • Some performance benefit
  • Substantial reduction in activity
  • Current status (slides are out of date)?
  • Router design and physical/area models
  • Protocol tuning and tweaks, etc.
  • Initial results in CA Letters paper

19
Talk Outline
  • Motivation
  • What is Lazy Logic?
  • Applications of Lazy Logic
  • Circuit-switched coherence
  • Stall-cycle redistribution
  • Dynamic scheduling
  • Conclusions
  • Research Group Overview

20
Pipeline Clocking Revisited
A
B
  • Conventional pipeline clock gating
  • Each valid work unit gets clocked into each latch
  • This is needlessly conservative

Two units of work, 10 clock pulses
Latches clocked to propagate data
21
Transparent Pipeline Gating
A
B
  • Transparent pipelining novel approach to
    clocking Jacobsen 2004, 2005
  • Both master and slave latch can remain
    transparent
  • Gating logic ensures no races
  • Pipeline registers are clocked lazily only when
    race occurs
  • Quite effective for low utilization pipelines
  • Gaps between valid work units enable transparent
    mode

Two units of work, 5 clock pulses
return
22
Applications
  • Best suited for low utilization pipelines
  • E.g. FP, Media processing functional units
  • High utilization pipelines see least benefit
  • E.g. Instruction fetch pipelines
  • To benefit from transparent approach
  • Valid data items need fine-grained gaps (stalls)?
  • 1-cycle gap provides lions share (50)?

23
Application Front-end Pipelines
  • Provide back-end with sufficient supply of
    instructions to find ILP
  • High branch prediction accuracy
  • Low instruction cache miss rates
  • Little opportunity for clock gating
  • Designed to feed peak demand
  • Poor match for transparent pipeline gating

24
In-Order Execution Model
  • In-order Cores
  • Power efficient
  • Low design complexity
  • Throughput oriented CMP systems trending towards
    simple cores (e.g. Sun Niagara)?
  • Data dependences cause fine-grained stalls at
    dispatch
  • Can we project these back to fetch?
  • Exploit fetch slack

time
25
Pipeline Diagram
Issue Buffer
Bpred
clock vector
PC
RP Instruction Fetch
Execution Core
0x0
bpred update
26
Available Fetch Slack
27
Implementation
  • Stall cycle bits embedded in BTB
  • EPIC ISAs (IA64) could use stop bits
  • Verify prediction by observing unperturbed groups
  • Let high confidence groups periodically execute
    unperturbed
  • Observe overall increase in execution time
  • Modeled Cell PPU-like PowerPC core with
    aggressive clock gating

28
Latch Activity Reduction
29
FE Energy Delay Product
30
Stall Cycle Redistribution
  • Summary ISLPED 2006
  • Transparent pipelines reduce latch activity
  • Not effective in pipelines with coarse-grained
    stalls (e.g. fetch)?
  • Coarse-grained stalls can be redistributed
    without affecting performance (fetch slack)?
  • Benefits
  • Equivalent performance, lower power
  • Transparent fetch pipeline now attractive

31
Talk Outline
  • Motivation
  • What is Lazy Logic?
  • Applications of Lazy Logic
  • Circuit-switched coherence
  • Stall-cycle redistribution
  • Dynamic scheduling
  • Conclusions
  • Research Group Overview

32
A Brief Scheduler Overview
  • Data capture/ non-data capture scheduler
  • Speculative scheduling
  • Data capture scheduler desirable for many reasons
  • Cycle time is not competitive because of data
    path delay
  • Current machines use speculative scheduling
  • Misscheduled/replayed instructions burn power
  • Depending on recovery policy, up to 17 issued
    insts need to replay

33
Slicing the Core
Back-End
Front-End
OoO Core
  • Bitslice the core narrow (16b) and wide (64b)?
  • Narrow core can be full data capture
  • Still makes aggressive cycle time (with lazy
    logic)?
  • Completely nonspeculative, virtually no replays
  • Further power benefits (not in this talk)?

34
Dynamic Scheduling with Partial Operand Values
  • Narrow core
  • Computes partial operand
  • Determines load latency
  • Avoids misscheduling
  • Wide core
  • Computes the rest of the operand (if needed)?

35
Scheduler w/ Narrow Data-Path
  • Non-data capture scheduler
  • Select mux tag bcast compare ready wr
  • Naïve narrow data capture scheduler
  • Select mux tag bcast compare ready wr

Select mux narrow ALU data bcast data wr
36
Scheduler w/ Embedded ALUs
  • With embedded ALUs
  • Select mux tag bcast compare ready wr

Max(select, data bcast mux narrow ALU) mux
latch setup
  • Lazy Logic
  • Replicated ALUs
  • Low utilization
  • Off critical delay path

37
Cycle Time, Area, Energy
  • 32 entries, implemented using verilog
  • Synthesized using Synopsis Design Compiler and
    LSI Logics gflxp 0.11um

38
Dynamic Scheduling Summary
  • Benefits JILP 2007
  • Save 25-30 of total OoO window energygt 12-18
    total dynamic chip power
  • Reduce misspeculated loads by 75-80
  • Slightly improved IPC
  • Comparable cycle time
  • Enabled by
  • Lazy narrow ALUs
  • ALUs are cheap, so compute in parallel with
    scheduling select logic

39
Talk Outline
  • Motivation
  • What is Lazy Logic?
  • Applications of Lazy Logic
  • Circuit-switched coherence
  • Stall-cycle redistribution
  • Dynamic scheduling
  • Conclusions
  • Research Group Overview

40
Conclusions
  • Lazy Logic
  • Promising new design philosophy
  • Some overall principles
  • Minimize unit utilization
  • Minimize unit complexity
  • OK to increase number of units/wires/devices
  • Initial Results
  • Circuit-switched CMP interconnects
  • Stall cycle redistribution
  • Dynamic Scheduling

41
Who Are We?
  • Faculty Mikko Lipasti
  • Current Ph.D. students
  • Profligate execution Gordie Bell (joining IBM in
    2006)?
  • Coarse-grained coherence Jason Cantin (joining
    IBM in 2006)?
  • Lazy Logic
  • Circuit-switched coherence Natalie Enright
  • Stall cycle redistribution Eric Hill
  • Dynamic scheduling Erika Gunadi
  • Dynamic code optimization Lixin Su
  • SMT/CMP scheduling/resource allocation Dana
    Vantrease
  • Pharmed out
  • IBM Trey Cain, Brian Mestan
  • AMD Kevin Lepak
  • Intel Ilhyun Kim, Morris Marden, Craig Saldanha,
    Madhu Seshadri
  • Sun Microsystems Matt Ramsay, Razvan Cheveresan,
    Pranay Koka

42
Research Group Overview
  • Faculty Mikko Lipasti, since 1999
  • Current MS/PhD students
  • Gordie Bell, Natalie Enright Jerger, Erika
    Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana
    Vantrease
  • Graduates, current employment
  • AMD Kevin Lepak
  • IBM Trey Cain, Jason Cantin, Brian Mestan
  • Intel Ilhyun Kim, Morris Marden, Craig Saldanha,
    Madhu Seshadri
  • Sun Microsystems Matt Ramsay, Razvan Cheveresan,
    Pranay Koka

43
Current Focus Areas
  • Multiprocessors
  • Coherence protocol optimization
  • Interconnection network design
  • Fairness issues in hierarchical systems
  • Microprocessor design
  • Complexity-effective microarchitecture
  • Scalable dynamic scheduling hardware
  • Speculation reduction for power savings
  • Transparent clock gating
  • Domain-specific ISA extensions
  • Software
  • Java Virtual Machine run-time optimization
  • Workload development and characterization

44
Funding
  • IBM
  • Faculty Partnership Awards
  • Shared University Research equipment
  • Intel
  • Research council support
  • Equipment donations
  • National Science Foundation
  • CSA, ITR, NGS, CPA
  • Career Award
  • Schneider ECE Faculty Fellowship
  • UW Graduate School

45
Questions?
  • http//www.ece.wisc.edu/pharm

46
Questions?
47
Backup slides
48
Technology Parameters
  • 65 nm technology generation
  • 16 tiled processors
  • Approximately 4 mm x 4mm
  • Signal can travel approximately 4 mm/cycle
  • Circuit switched interconnect consists of
  • 5 mm unidirectional links

49
Broadcast Protocol
  • Broadcast to all nodes
  • Establish Circuit-Switched path with owner of
    data
  • Future broadcasts will use Circuit-Switched path
    to reduce power
  • Predict when CS path will suffice
  • Use LRU information for paths to tear down old
    paths when resources need to be claimed by new
    path

50
Switch Design from paper
51
Race example from paper (1 of 2)?
52
Race example (2 of 2)?
1a. CS Req
6. Inval Resp
P0
P1
P2
1b. CS Notify
4a. CS Resp (S)?
5. Invalidate
2. Upgrade
4b. Nack
Dir3
3.
53
LRU pairs for Dirty Misses
  • 23 or fewer pairs capture gt80 of dirty misses
    for 3 out of 4 benchmarks (16p)?

54
Local LRU pairs
  • 2 Circuit-Switched Paths per processor covers
    between 55 and 85 of dirty misses

55
Concurrent Links
  • 5 concurrent links cover 90 necessary pairs
  • Captures 50-77 of overall opportunity

56
Experimental Setup
  • PHARMsim
  • Activity-based power model based on Wattch added
  • InOrder issue
  • 4/2/2 fetch/issue/commit (based on Cell PPU)?
  • 10 stage transparent front-end pipeline
    (conventional latches at endpoints)?
  • Gshare (8k entry) branch predictor, 1024 set,
    4-way BTB
  • 32KB I/D cache (1/4), 512KB L2 cache (12)?
  • 4 confidence bits / gt4 high conf threshold /
    predictions checked randomly 10 of the time
  • Benchmarks simulated for 250M instructions

57
Branch Predictor Activity
58
Related Work
  • Removing Wrong Path Instructions
  • Manne 1998
  • Flow Based Throttling Techniques
  • Baniasadi 2001, Karkhanis 2002

59
Future Work
  • Explore performance of other fetch gating schemes
    with transparent pipelining
  • Explore dependence driven gating on Itanium
    machine model
  • Explore latch soft error vulnerability (TVF) when
    lazy clocking is used
  • Explore change in AVF when fetch gating is used
  • Less ACE state in-flight

60
Scheduling Replay Example
  • Squashing/non-selective replay alpha 21264
  • Replays all dependent and independent
    instructions issued under load shadow
  • Analogous to squashing recovery in branch
    misprediction
  • Simple but high performance penalty
  • Independent instructions are unnecessarily
    replayed

61
Narrow Core
  • Narrow Scheduler
  • Captures partial operands
  • Determines load latency (hit/miss)?
  • Narrow Data-Path
  • Narrow ALU provides partial data to consumers
  • Nar
  • row LSQ and partial tag cache
  • Finds only possible load data source
  • Uses least significant 16 bits
  • Large enough to help predict load latency
  • Small enough to achieve fast cycle time

62
L/S Disambiguation Partial Tag Matching
  • Exploits operand significance
  • Brooks et.al. 1999, Canal et al. 2000
  • Load/store disambiguation
  • 10 bits finds 99 of matching stores
  • Partial tag match
  • 16 bits for 97(mcf) - 99(bzip2) accuracy

63
Outline
  • Motivation
  • Dynamic Scheduling with Narrow Values
  • Scheduler with Narrow Data-Path
  • Pipelined Data Cache
  • Pipeline Integration
  • Implementation and Experiments
  • Conclusions and Future Work

64
Dynamic Scheduling withPartial Operands
Back-End
Front-End
OoO Core
  • Stores a subset of operands in scheduler
  • Exploits partial operand knowledge
  • Load-store disambiguation
  • Partial tag match

65
Pipelined Cache w/ Early Bits
  • Narrow bank for partial access, wide bank for the
    rest
  • Uses partial tag match in narrow bank
  • Saves power in wide bank
  • Hide wide cache bank latency by starting early

66
Narrow LSQ
  • Stores partial addresses of stores
  • Used for partial load-store disambiguation
  • Accessed in parallel with narrow bank
  • Saves power in the wide LSQ
  • Cheaper direct mapped access rather than full
    associative search

67
Pipeline Integration
  • Simple ALU insts link dependences in back-to-back
    cycle
  • Load insts need another cycle to schedule
    dependences
  • Complex ALU insts link dependences
    non-speculatively

68
Pipelined Data Cache LSQ
  • Modeled using modified CACTI 3.0
  • Configuration 16KB, 4-way, 64B blocks

69
Experiments
  • Simplescalar / Alpha 3.0 tool set
  • Machine Model
  • 64-entry ROB
  • 4-wide fetch/issue/commit
  • 16-entry SQ, 16-entry LQ
  • 32-entry scheduler
  • 13-stage pipeline
  • 64KB I-Cache (2-cyc), 16KB D-Cache (2-cyc)?
  • 2-cycle store to load forwarding

70
Energy Dissipation
  • On average narrow captured scheduling consume 25
    less energy than non-data captured scheduling

71
Mispredicted Load Instructions
  • Reduce misspeculated loads by 75-80

72
Optimized model
  • Using refetch replay scheme to reduce replay
    complexity
  • Clear the scheduler entries once instructions are
    issued
  • Decreases scheduler occupancy
  • Instructions enters OoO window sooner
  • Reduce L1 cache latency from 2-cycle to 1-cycle

73
Optimized Model Performance
  • Small variations
  • Always perform as good or better

74
Future Work
  • Implement a more accurate dynamic power model
  • Study custom design vs. synthesized model
  • Study opportunities for leakage power reduction

75
Delay Model
  • Processor 0 can reach Processor 15 in 9 fewer
    cycles

76
Pipeline Unrolling
Write a Comment
User Comments (0)
About PowerShow.com