Title: ELEC 669 Low Power Design Techniques Lecture 1
1ELEC 669Low Power Design TechniquesLecture 1
- Amirali Baniasadi
- amirali_at_ece.uvic.ca
2ELEC 669 Low Power Design Techniques
- Instructor
- Amirali Baniasadi
- EOW 441, Only by appt. Call or
email with your schedule. - Email amirali_at_ece.uvic.ca
Office Tel 721-8613 -
- Web Page for this class will be
at - http//www.ece.uvic.ca/amirali/c
ourses/ELEC669/elec669.html - Will use paper reprints
- Lecture notes will be posted on the course web
page.
3Course Structure
- Lectures
- 1-2 weeks on processor review
- 5 weeks on low power techniques
- 6 weeks discussion, presentation, meetings
- Reading paper posted on the web for each week.
- Need to bring a 1 page review of the papers.
- Presentations Each student should give to
presentations in class.
4Course Philosophy
- Papers to be used as supplement for lectures (If
a topic is not covered in the class, or a detail
not presented in the class, that means I expect
you to read on your own to learn those details) - One Project (50)
- Presentation (30)- Will be announced in advance.
- Final Exam take home (20)
- IMPORTANT NOTE Must get passing grade in all
components to pass the course. Failing any of the
three components will result in failing the
course.
5Project
6Topics
- High Performance Processors?
- Low-Power Design
- Low Power Branch Prediction
- Low-Power Register Renaming
- Low-Power SRAMs
- Low-Power Front-End
- Low-Power Back-End
- Low-Power Issue Logic
- Low-Power Commit
- AND more
7A Modern Processor
1-What do each do? 2-Possible Power
Optimizations?
Front-end
Back-end
8Power Breakdown
PentiumPro
Alpha 21464
9Instruction Set Architecture (ISA)
- Instruction Execution Cycle
10What Should we Know?
- A specific ISA (MIPS)
- Performance issues - vocabulary and motivation
- Instruction-Level Parallelism
- How to Use Pipelining to improve performance
- Exploiting Instruction-Level Parallelism w/
Dynamic Approach - Memory caches and virtual memory
11What is Expected From You?
- Read papers!
- Be up-to-date!
- Come back with your input questions for
discussion!
12Power?
- Everything is done by tiny switches
- Their charge represents logic values
- Changing charge ? energy
- Power ? energy over time
- Devices are non-ideal ? power ? heat
- Excess heat ? Circuits breakdown
- Need to keep power within acceptable limits
13POWER in the real world
14Power as a Performance Limiter
Conventional Performance Scaling Goal Max.
performance w/ min cost/complexity How -More
and faster xtors. -More complex
structures. Power Dont fix if it
aint broken Not True Anymore Power has
increased rapidly Power-Aware Architecture
a Necessity
15Power-Aware Architecture
Conventional Architecture Goal Max.
performance How Do as much as you can. This
Work Power-Aware Architecture
Goal Min. Power and Maintain Performance How
Do as little as you can, while maintaining
performance Challenging and new area
16Why is this challenging
- Identify actions that can be delayed/eliminated
- Dont touch those that boost performance
- Cost/Power of doing so must not out-weight
benefits
17Definitions
- Performance is in units of things-per-second
- bigger is better
- If we are primarily concerned with response time
- performance(x) 1
execution_time(x) - " X is n times faster than Y" means
- Performance(X)
- n ----------------------
- Performance(Y)
18Amdahl's Law
- Speedup due to enhancement E
- ExTime w/o E
Performance w/ E - Speedup(E) --------------------
--------------------- - ExTime w/ E
Performance w/o E - Suppose that enhancement E accelerates a fraction
F of the task - by a factor S and the remainder of the task is
unaffected then, - ExTime(with E) ((1-F) F/S) X ExTime(without
E) - Speedup(with E) ExTime(without E) ((1-F)
F/S) X ExTime(without E) - Speedup(with E) 1/ ((1-F) F/S)
19Amdahl's Law-example
- A new CPU makes Web serving 10 times faster. The
old CPU spent 40 of the time on computation and
60 on waiting for I/O. What is the overall
enhancement? - Fraction enhanced 0.4
- Speedup enhanced 10
- Speedup overall 1
1.56 - 0.6 0.4/10
20Why Do Benchmarks?
- How we evaluate differences
- Different systems
- Changes to a single system
- Provide a target
- Benchmarks should represent large class of
important programs - Improving benchmark performance should help many
programs - For better or worse, benchmarks shape a field
- Good ones accelerate progress
- good target for development
- Bad benchmarks hurt progress
- help real programs v. sell machines/papers?
- Inventions that help real programs dont help
benchmark
21SPEC first round
- First round 1989 10 programs, single number to
summarize performance - One program 99 of time in single line of code
- New front-end compiler could improve dramatically
22SPEC Evolution
- Second round SpecInt92 (6 integer programs) and
SpecFP92 (14 floating point programs) - Add SPECbase one flag setting for integer
programs 1 for FP - Third round 1995 new set of programs
- benchmarks useful for 3 years
- Now (SPEC 2000)
23SPEC95
- Eighteen application benchmarks (with inputs)
reflecting a technical computing workload - Eight integer
- go, m88ksim, gcc, compress, li, ijpeg, perl,
vortex - Ten floating-point intensive
- tomcatv, swim, su2cor, hydro2d, mgrid, applu,
turb3d, apsi, fppp, wave5 - Must run with standard compiler flags
- eliminate special undocumented incantations that
may not even generate working code for real
programs
24Summary
- Time is the measure of computer performance!
- Remember Amdahls Law Improvement is limited by
unimproved part of program
25Execution Cycle
Instruction Fetch
Obtain instruction from program storage
Instruction Decode
Determine required actions and instruction size
Operand Fetch
Locate and obtain operand data
Compute result value or status
Execute
Result Store
Deposit results in storage for later use
Next Instruction
Determine successor instruction
26What Must be Specified?
Instruction Fetch
- Instruction Format or Encoding
- how is it decoded?
- Location of operands and result
- where other than memory?
- how many explicit operands?
- how are memory operands located?
- which can or cannot be in memory?
- Data type and Size
- Operations
- what are supported
- Successor instruction
- jumps, conditions, branches
Instruction Decode
Operand Fetch
Execute
Result Store
Next Instruction
27What Is an ILP?
- Principle Many instructions in the code do not
depend on each other - Result Possible to execute them in parallel
- ILP Potential overlap among instructions (so
they can be evaluated in parallel) - Issues
- Building compilers to analyze the code
- Building special/smarter hardware to handle the
code - ILP Increase the amount of parallelism
exploited among instructions - Seeks Good Results out of Pipelining
28What Is ILP?
- CODE A
CODE B - LD R1, (R2)100 LD
R1,(R2)100 - ADD R4, R1 ADD
R4,R1 - SUB R5,R1 SUB
R5,R4 - CMP R1,R2 SW
R5,(R2)100 - ADD R3,R1 LD
R1,(R2)100 - Code A Possible to execute 4 instructions in
parallel. - Code B Cant execute more than one instruction
per cycle. - Code A has Higher ILP
29 Out of Order Execution
Programmer Instructions execute
in-order Processor Instructions may execute
in any order if results remain the same at the
end
Out-of-Order
B ADD R3, R4 C ADD R3, R5 A LD R1, (R2) D CMP
R3, R1
30Assumptions
- Five-stage integer pipeline
- Branches have delay of one clock cycle
- ID stage Comparisons done, decisions made and PC
loaded - No structural hazards
- Functional units are fully pipelined or
replicated (as many times as the pipeline depth) - FP Latencies
Integer load latency 1 Integer ALU operation
latency 0
31Simple Loop Assembler Equivalent
- for (i1000 igt0 i--) xi xi s
-
- Loop LD F0, 0(R1) F0array element
- ADDD F4, F0, F2 add scalar in F2
- SD F4 , 0(R1) store result
- SUBI R1, R1, 8 decrement pointer 8bytes
(DW) - BNE R1, R2, Loop branch R1!R2
-
- xi s are double/floating point type
- R1 initially address of array element with the
highest address - F2 contains the scalar value s
- Register R2 is pre-computed so that 8(R2) is the
last element to operate on
32Where are the stalls?
- Unscheduled
- Loop LD F0, 0(R1)
- stall
- ADDD F4, F0, F2
- stall
- stall
- SD F4, 0(R1)
- SUBI R1, R1, 8
- stall
- BNE R1, R2, Loop
- stall
- 10 clock cycles
- Can we minimize?
- Scheduled
- Loop LD F0, 0(R1)
- SUBI R1, R1, 8
- ADDD F4, F0, F2
- stall
- BNE R1, R2, Loop
- SD F4, 8(R1)
-
- 6 clock cycles
- 3 cycles actual work 3 cycles overhead
- Can we minimize further?
-
33Loop Unrolling
Four copies of loop
Four iteration code
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BNE R1, R2, Loop
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BNE R1, R2, Loop
- LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4 , 0(R1)
- SUBI R1, R1, 8
- BNE R1, R2, Loop
- LD F0, 0(R1)
- Loop LD F0, 0(R1)
- ADDD F4, F0, F2
- SD F4, 0(R1)
- LD F6, -8(R1)
- ADDD F8, F6, F2
- SD F8, -8(R1)
- LD F10, -16(R1)
- ADDD F12, F10, F2
- SD F12, -16(R1)
- LD F14, -24(R1)
- ADDD F16, F14, F2
- SD F16, -24(R1)
- SUBI R1, R1, 32
- BNE R1, R2, Loop
-
Assumption R1 is initially a multiple of 32 or
number of loop iterations is a multiple of 4
34Loop Unroll Schedule
- Loop LD F0, 0(R1)
- stall
- ADDD F4, F0, F2
- stall
- stall
- SD F4, 0(R1)
- LD F6, -8(R1)
- stall
- ADDD F8, F6, F2
- stall
- stall
- SD F8, -8(R1)
- LD F10, -16(R1)
- stall
- ADDD F12, F10, F2
- stall
- stall
- SD F12, -16(R1)
- LD F14, -24(R1)
Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12,
-16(R1) SUBI R1, R1, 32 BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
28 clock cycles or 7 per iteration Can we
minimize further?
35Summary
Iteration 10 cycles
Unrolling
7 cycles
Scheduling
Scheduling
6 cycles
3.5 cycles (No stalls)
36Multiple Issue
- Multiple Issue is the ability of the processor to
start more than one instruction in a given cycle. - Superscalar processors
- Very Long Instruction Word (VLIW) processors
37A Modern Processor
Multiple Issue
Front-end
Back-end
381990s Superscalar Processors
- Bottleneck CPI gt 1
- Limit on scalar performance (single instruction
issue) - Hazards
- Superpipelining? Diminishing returns (hazards
overhead) - How can we make the CPI 0.5?
- Multiple instructions in every pipeline stage
(super-scalar) - 1 2 3 4 5 6 7
- Inst0 IF ID EX MEM WB
- Inst1 IF ID EX MEM WB
- Inst2 IF ID EX MEM WB
- Inst3 IF ID EX MEM WB
- Inst4 IF ID EX MEM WB
- Inst5 IF ID EX MEM WB
39Elements of Advanced Superscalars
- High performance instruction fetching
- Good dynamic branch and jump prediction
- Multiple instructions per cycle, multiple
branches per cycle? - Scheduling and hazard elimination
- Dynamic scheduling
- Not necessarily Alpha 21064 Pentium were
statically scheduled - Register renaming to eliminate WAR and WAW
- Parallel functional units, paths/buses/multiple
register ports - High performance memory systems
- Speculative execution
40SS DS Speculation
- Superscalar Dynamic scheduling Speculation
- Three great tastes that taste great together
- CPI gt 1?
- Overcome with superscalar
- Superscalar increases hazards
- Overcome with dynamic scheduling
- RAW dependences still a problem?
- Overcome with a large window
- Branches a problem for filling large window?
- Overcome with speculation
41The Big Picture
issue
Static program
Fetch branch predict
execution
Reorder commit
42Superscalar Microarchitecture
Floating point register file
Functional units
Memory interface
Floating point inst. buffer
Inst. Cache
Decode rename dispatch
Inst. buffer
Pre-decode
Functional units and data cache
Integer address inst buffer
Integer register file
Reorder and commit
43Register renaming methods
- First Method
- Physical register file vs. logical
(architectural) register file. - Mapping table used to associate physical reg w/
current value of log. Reg - use a free list of physical registers
- Physical register file bigger than log register
file - Second Method
- physical register file same size as logical
- Also, use a buffer w/ one entry per inst.
Reorder buffer.
44Register Renaming Example
- Loop LD F0, 0(R1)
- stall
- ADDD F4, F0, F2
- stall
- stall
- SD F4, 0(R1)
- LD F6, -8(R1)
- stall
- ADDD F8, F6, F2
- stall
- stall
- SD F8, -8(R1)
- LD F10, -16(R1)
- stall
- ADDD F12, F10, F2
- stall
- stall
- SD F12, -16(R1)
- LD F14, -24(R1)
Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12,
-16(R1) SUBI R1, R1, 32 BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
28 clock cycles or 7 per iteration Can we
minimize further?
45Register renaming first method
Mapping table
Mapping table
Add r3,r3,4
Free List
Free List
46Superscalar Processors
- Issues varying number of instructions per clock
- Scheduling Static (by the compiler) or
dynamic(by the hardware) - Superscalar has a varying number of
instructions/cycle (1 to 8), scheduled by
compiler or by HW (Tomasulo). - IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
47More Realistic HW Register Impact
- Effect of limiting the number of renaming
registers
FP 11 - 45
Integer 5 - 15
IPC
48Reorder Buffer
- Place data in entry when execution finished
Reserve entry at tail when dispatched
Remove from head when complete
Bypass to other instructions when needed
49register renamingreorder buffer
Before add r3,r3,4
Add r3, rob6, 4 add rob8,rob6,4
r0 r1 r2 r3 r4
R8
r0 r1 r2 r3 r4
R7
R5
rob8
R9
8 7 6 0
7 6 0
R3 0 R3 .
....
r3
Reorder buffer
Reorder buffer
50Instruction Buffers
Floating point register file
Functional units
Memory interface
Floating point inst. buffer
Inst. Cache
Decode rename dispatch
Inst. buffer
Pre-decode
Functional units and data cache
Integer address inst buffer
Integer register file
Reorder and commit
51Issue Buffer Organization
- a) Single, shared queue
b)Multiple queue one per inst. type
No out-of-order No Renaming
No out-of-order inside queues Queues issue out of
order
52Issue Buffer Organization
- c) Multiple reservation stations (one
per instruction type or big pool) - NO FIFO ordering
- Ready operands, hardware available
execution starts - Proposed by Tomasulo
From Instruction Dispatch
53Typical reservation station
Operation source 1 data 1
valid 1 source 2 data 2 valid 2
destination
54Memory Hazard Detection Logic