ELEC 669 Low Power Design Techniques Lecture 1 - PowerPoint PPT Presentation

About This Presentation
Title:

ELEC 669 Low Power Design Techniques Lecture 1

Description:

Low Power Design Techniques Lecture 1 Amirali Baniasadi amirali_at_ece.uvic.ca – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 54
Provided by: Origi6
Category:

less

Transcript and Presenter's Notes

Title: ELEC 669 Low Power Design Techniques Lecture 1


1
ELEC 669Low Power Design TechniquesLecture 1
  • Amirali Baniasadi
  • amirali_at_ece.uvic.ca

2
ELEC 669 Low Power Design Techniques
  • Instructor
  • Amirali Baniasadi
  • EOW 441, Only by appt. Call or
    email with your schedule.
  • Email amirali_at_ece.uvic.ca
    Office Tel 721-8613
  • Web Page for this class will be
    at
  • http//www.ece.uvic.ca/amirali/c
    ourses/ELEC669/elec669.html
  • Will use paper reprints
  • Lecture notes will be posted on the course web
    page.

3
Course Structure
  • Lectures
  • 1-2 weeks on processor review
  • 5 weeks on low power techniques
  • 6 weeks discussion, presentation, meetings
  • Reading paper posted on the web for each week.
  • Need to bring a 1 page review of the papers.
  • Presentations Each student should give to
    presentations in class.

4
Course Philosophy
  • Papers to be used as supplement for lectures (If
    a topic is not covered in the class, or a detail
    not presented in the class, that means I expect
    you to read on your own to learn those details)
  • One Project (50)
  • Presentation (30)- Will be announced in advance.
  • Final Exam take home (20)
  • IMPORTANT NOTE Must get passing grade in all
    components to pass the course. Failing any of the
    three components will result in failing the
    course.

5
Project
  • More on project later

6
Topics
  • High Performance Processors?
  • Low-Power Design
  • Low Power Branch Prediction
  • Low-Power Register Renaming
  • Low-Power SRAMs
  • Low-Power Front-End
  • Low-Power Back-End
  • Low-Power Issue Logic
  • Low-Power Commit
  • AND more

7
A Modern Processor
1-What do each do? 2-Possible Power
Optimizations?
Front-end
Back-end
8
Power Breakdown
PentiumPro
Alpha 21464
9
Instruction Set Architecture (ISA)
  • Instruction Execution Cycle

10
What Should we Know?
  • A specific ISA (MIPS)
  • Performance issues - vocabulary and motivation
  • Instruction-Level Parallelism
  • How to Use Pipelining to improve performance
  • Exploiting Instruction-Level Parallelism w/
    Dynamic Approach
  • Memory caches and virtual memory

11
What is Expected From You?
  • Read papers!
  • Be up-to-date!
  • Come back with your input questions for
    discussion!

12
Power?
  • Everything is done by tiny switches
  • Their charge represents logic values
  • Changing charge ? energy
  • Power ? energy over time
  • Devices are non-ideal ? power ? heat
  • Excess heat ? Circuits breakdown
  • Need to keep power within acceptable limits

13
POWER in the real world
14
Power as a Performance Limiter
Conventional Performance Scaling Goal Max.
performance w/ min cost/complexity How -More
and faster xtors. -More complex
structures. Power Dont fix if it
aint broken Not True Anymore Power has
increased rapidly Power-Aware Architecture
a Necessity
15
Power-Aware Architecture
Conventional Architecture Goal Max.
performance How Do as much as you can. This
Work Power-Aware Architecture
Goal Min. Power and Maintain Performance How
Do as little as you can, while maintaining
performance Challenging and new area
16
Why is this challenging
  • Identify actions that can be delayed/eliminated
  • Dont touch those that boost performance
  • Cost/Power of doing so must not out-weight
    benefits

17
Definitions
  • Performance is in units of things-per-second
  • bigger is better
  • If we are primarily concerned with response time
  • performance(x) 1
    execution_time(x)
  • " X is n times faster than Y" means
  • Performance(X)
  • n ----------------------
  • Performance(Y)

18
Amdahl's Law
  • Speedup due to enhancement E
  • ExTime w/o E
    Performance w/ E
  • Speedup(E) --------------------
    ---------------------
  • ExTime w/ E
    Performance w/o E
  • Suppose that enhancement E accelerates a fraction
    F of the task
  • by a factor S and the remainder of the task is
    unaffected then,
  • ExTime(with E) ((1-F) F/S) X ExTime(without
    E)
  • Speedup(with E) ExTime(without E) ((1-F)
    F/S) X ExTime(without E)
  • Speedup(with E) 1/ ((1-F) F/S)

19
Amdahl's Law-example
  • A new CPU makes Web serving 10 times faster. The
    old CPU spent 40 of the time on computation and
    60 on waiting for I/O. What is the overall
    enhancement?
  • Fraction enhanced 0.4
  • Speedup enhanced 10
  • Speedup overall 1
    1.56
  • 0.6 0.4/10

20
Why Do Benchmarks?
  • How we evaluate differences
  • Different systems
  • Changes to a single system
  • Provide a target
  • Benchmarks should represent large class of
    important programs
  • Improving benchmark performance should help many
    programs
  • For better or worse, benchmarks shape a field
  • Good ones accelerate progress
  • good target for development
  • Bad benchmarks hurt progress
  • help real programs v. sell machines/papers?
  • Inventions that help real programs dont help
    benchmark

21
SPEC first round
  • First round 1989 10 programs, single number to
    summarize performance
  • One program 99 of time in single line of code
  • New front-end compiler could improve dramatically

22
SPEC Evolution
  • Second round SpecInt92 (6 integer programs) and
    SpecFP92 (14 floating point programs)
  • Add SPECbase one flag setting for integer
    programs 1 for FP
  • Third round 1995 new set of programs
  • benchmarks useful for 3 years
  • Now (SPEC 2000)

23
SPEC95
  • Eighteen application benchmarks (with inputs)
    reflecting a technical computing workload
  • Eight integer
  • go, m88ksim, gcc, compress, li, ijpeg, perl,
    vortex
  • Ten floating-point intensive
  • tomcatv, swim, su2cor, hydro2d, mgrid, applu,
    turb3d, apsi, fppp, wave5
  • Must run with standard compiler flags
  • eliminate special undocumented incantations that
    may not even generate working code for real
    programs

24
Summary
  • Time is the measure of computer performance!
  • Remember Amdahls Law Improvement is limited by
    unimproved part of program

25
Execution Cycle
Instruction Fetch
Obtain instruction from program storage
Instruction Decode
Determine required actions and instruction size
Operand Fetch
Locate and obtain operand data
Compute result value or status
Execute
Result Store
Deposit results in storage for later use
Next Instruction
Determine successor instruction
26
What Must be Specified?
Instruction Fetch
  • Instruction Format or Encoding
  • how is it decoded?
  • Location of operands and result
  • where other than memory?
  • how many explicit operands?
  • how are memory operands located?
  • which can or cannot be in memory?
  • Data type and Size
  • Operations
  • what are supported
  • Successor instruction
  • jumps, conditions, branches

Instruction Decode
Operand Fetch
Execute
Result Store
Next Instruction
27
What Is an ILP?
  • Principle Many instructions in the code do not
    depend on each other
  • Result Possible to execute them in parallel
  • ILP Potential overlap among instructions (so
    they can be evaluated in parallel)
  • Issues
  • Building compilers to analyze the code
  • Building special/smarter hardware to handle the
    code
  • ILP Increase the amount of parallelism
    exploited among instructions
  • Seeks Good Results out of Pipelining

28
What Is ILP?
  • CODE A
    CODE B
  • LD R1, (R2)100 LD
    R1,(R2)100
  • ADD R4, R1 ADD
    R4,R1
  • SUB R5,R1 SUB
    R5,R4
  • CMP R1,R2 SW
    R5,(R2)100
  • ADD R3,R1 LD
    R1,(R2)100
  • Code A Possible to execute 4 instructions in
    parallel.
  • Code B Cant execute more than one instruction
    per cycle.
  • Code A has Higher ILP

29
Out of Order Execution
Programmer Instructions execute
in-order Processor Instructions may execute
in any order if results remain the same at the
end
Out-of-Order
B ADD R3, R4 C ADD R3, R5 A LD R1, (R2) D CMP
R3, R1
30
Assumptions
  • Five-stage integer pipeline
  • Branches have delay of one clock cycle
  • ID stage Comparisons done, decisions made and PC
    loaded
  • No structural hazards
  • Functional units are fully pipelined or
    replicated (as many times as the pipeline depth)
  • FP Latencies

Integer load latency 1 Integer ALU operation
latency 0
31
Simple Loop Assembler Equivalent
  • for (i1000 igt0 i--) xi xi s
  • Loop LD F0, 0(R1) F0array element
  • ADDD F4, F0, F2 add scalar in F2
  • SD F4 , 0(R1) store result
  • SUBI R1, R1, 8 decrement pointer 8bytes
    (DW)
  • BNE R1, R2, Loop branch R1!R2
  • xi s are double/floating point type
  • R1 initially address of array element with the
    highest address
  • F2 contains the scalar value s
  • Register R2 is pre-computed so that 8(R2) is the
    last element to operate on

32
Where are the stalls?
  • Unscheduled
  • Loop LD F0, 0(R1)
  • stall
  • ADDD F4, F0, F2
  • stall
  • stall
  • SD F4, 0(R1)
  • SUBI R1, R1, 8
  • stall
  • BNE R1, R2, Loop
  • stall
  • 10 clock cycles
  • Can we minimize?
  • Scheduled
  • Loop LD F0, 0(R1)
  • SUBI R1, R1, 8
  • ADDD F4, F0, F2
  • stall
  • BNE R1, R2, Loop
  • SD F4, 8(R1)
  • 6 clock cycles
  • 3 cycles actual work 3 cycles overhead
  • Can we minimize further?

33
Loop Unrolling
Four copies of loop
Four iteration code
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BNE R1, R2, Loop
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BNE R1, R2, Loop
  • LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4 , 0(R1)
  • SUBI R1, R1, 8
  • BNE R1, R2, Loop
  • LD F0, 0(R1)
  • Loop LD F0, 0(R1)
  • ADDD F4, F0, F2
  • SD F4, 0(R1)
  • LD F6, -8(R1)
  • ADDD F8, F6, F2
  • SD F8, -8(R1)
  • LD F10, -16(R1)
  • ADDD F12, F10, F2
  • SD F12, -16(R1)
  • LD F14, -24(R1)
  • ADDD F16, F14, F2
  • SD F16, -24(R1)
  • SUBI R1, R1, 32
  • BNE R1, R2, Loop

Assumption R1 is initially a multiple of 32 or
number of loop iterations is a multiple of 4
34
Loop Unroll Schedule
  • Loop LD F0, 0(R1)
  • stall
  • ADDD F4, F0, F2
  • stall
  • stall
  • SD F4, 0(R1)
  • LD F6, -8(R1)
  • stall
  • ADDD F8, F6, F2
  • stall
  • stall
  • SD F8, -8(R1)
  • LD F10, -16(R1)
  • stall
  • ADDD F12, F10, F2
  • stall
  • stall
  • SD F12, -16(R1)
  • LD F14, -24(R1)

Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12,
-16(R1) SUBI R1, R1, 32 BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
28 clock cycles or 7 per iteration Can we
minimize further?
35
Summary
Iteration 10 cycles
Unrolling
7 cycles
Scheduling
Scheduling
6 cycles
3.5 cycles (No stalls)
36
Multiple Issue
  • Multiple Issue is the ability of the processor to
    start more than one instruction in a given cycle.
  • Superscalar processors
  • Very Long Instruction Word (VLIW) processors

37
A Modern Processor
Multiple Issue
Front-end
Back-end
38
1990s Superscalar Processors
  • Bottleneck CPI gt 1
  • Limit on scalar performance (single instruction
    issue)
  • Hazards
  • Superpipelining? Diminishing returns (hazards
    overhead)
  • How can we make the CPI 0.5?
  • Multiple instructions in every pipeline stage
    (super-scalar)
  • 1 2 3 4 5 6 7
  • Inst0 IF ID EX MEM WB
  • Inst1 IF ID EX MEM WB
  • Inst2 IF ID EX MEM WB
  • Inst3 IF ID EX MEM WB
  • Inst4 IF ID EX MEM WB
  • Inst5 IF ID EX MEM WB

39
Elements of Advanced Superscalars
  • High performance instruction fetching
  • Good dynamic branch and jump prediction
  • Multiple instructions per cycle, multiple
    branches per cycle?
  • Scheduling and hazard elimination
  • Dynamic scheduling
  • Not necessarily Alpha 21064 Pentium were
    statically scheduled
  • Register renaming to eliminate WAR and WAW
  • Parallel functional units, paths/buses/multiple
    register ports
  • High performance memory systems
  • Speculative execution

40
SS DS Speculation
  • Superscalar Dynamic scheduling Speculation
  • Three great tastes that taste great together
  • CPI gt 1?
  • Overcome with superscalar
  • Superscalar increases hazards
  • Overcome with dynamic scheduling
  • RAW dependences still a problem?
  • Overcome with a large window
  • Branches a problem for filling large window?
  • Overcome with speculation

41
The Big Picture
issue
Static program
Fetch branch predict
execution

Reorder commit
42
Superscalar Microarchitecture
Floating point register file
Functional units
Memory interface
Floating point inst. buffer
Inst. Cache
Decode rename dispatch
Inst. buffer
Pre-decode
Functional units and data cache
Integer address inst buffer
Integer register file
Reorder and commit
43
Register renaming methods
  • First Method
  • Physical register file vs. logical
    (architectural) register file.
  • Mapping table used to associate physical reg w/
    current value of log. Reg
  • use a free list of physical registers
  • Physical register file bigger than log register
    file
  • Second Method
  • physical register file same size as logical
  • Also, use a buffer w/ one entry per inst.
    Reorder buffer.

44
Register Renaming Example
  • Loop LD F0, 0(R1)
  • stall
  • ADDD F4, F0, F2
  • stall
  • stall
  • SD F4, 0(R1)
  • LD F6, -8(R1)
  • stall
  • ADDD F8, F6, F2
  • stall
  • stall
  • SD F8, -8(R1)
  • LD F10, -16(R1)
  • stall
  • ADDD F12, F10, F2
  • stall
  • stall
  • SD F12, -16(R1)
  • LD F14, -24(R1)

Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12,
-16(R1) SUBI R1, R1, 32 BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
28 clock cycles or 7 per iteration Can we
minimize further?
45
Register renaming first method
Mapping table
Mapping table
Add r3,r3,4
Free List
Free List
46
Superscalar Processors
  • Issues varying number of instructions per clock
  • Scheduling Static (by the compiler) or
    dynamic(by the hardware)
  • Superscalar has a varying number of
    instructions/cycle (1 to 8), scheduled by
    compiler or by HW (Tomasulo).
  • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

47
More Realistic HW Register Impact
  • Effect of limiting the number of renaming
    registers

FP 11 - 45
Integer 5 - 15
IPC
48
Reorder Buffer
  • Place data in entry when execution finished

Reserve entry at tail when dispatched
Remove from head when complete
Bypass to other instructions when needed
49
register renamingreorder buffer
Before add r3,r3,4
Add r3, rob6, 4 add rob8,rob6,4
r0 r1 r2 r3 r4
R8
r0 r1 r2 r3 r4
R7
R5
rob8
R9
8 7 6 0
7 6 0
R3 0 R3 .
....
r3
Reorder buffer
Reorder buffer
50
Instruction Buffers
Floating point register file
Functional units
Memory interface
Floating point inst. buffer
Inst. Cache
Decode rename dispatch
Inst. buffer
Pre-decode
Functional units and data cache
Integer address inst buffer
Integer register file
Reorder and commit
51
Issue Buffer Organization
  • a) Single, shared queue
    b)Multiple queue one per inst. type

No out-of-order No Renaming
No out-of-order inside queues Queues issue out of
order
52
Issue Buffer Organization
  • c) Multiple reservation stations (one
    per instruction type or big pool)
  • NO FIFO ordering
  • Ready operands, hardware available
    execution starts
  • Proposed by Tomasulo

From Instruction Dispatch
53
Typical reservation station
Operation source 1 data 1
valid 1 source 2 data 2 valid 2
destination
54
Memory Hazard Detection Logic
Write a Comment
User Comments (0)
About PowerShow.com