Title: Introduction to Energy Aware Computing
1IntroductiontoEnergy Aware Computing
- Henk Corporaal
- www.ics.ele.tue.nl/heco
- ASCI Winterschool on Energy Aware Computing
- Soesterberg, March 2012
23GHz
100W
- Intel Trends
- transistors follows Moore
- but not freq. and performance/core
5
3Types of compute systems
4A 20nm scenario (high end processor)
- This means
- a 2cm2 processor consumens 10 kW
- a bound of 100W requires only 1 to be active ?
dark silicon
5Intel's answer 48-core x86
6Power versus Energy
- Power P ?fCVdd2
- ? switching activity (lt1) f frequency C
switching capacitance, Vdd supply voltage - heat / temperature constraint
- wear-out
- peak power delivery constraint
- Energy E Pt or, for time varying P ?P(t).dt
- battery life
- cost electricity bill
- Note lowering f reduces P, but not necessarily
E E may even increase due to leakage (static
power dissipation)
7What's happening at the top
8Top500 nr 1
- 1st K Computer
- 10.51 Petaflop/s on Linpack
- 705024 SPARC64 cores (8 per die 45 nm) (Fujitsu
design) - Tofu interconnect (6-D torus)
- 12.7 MegaWatt
9Top500 nr 2
- 2nd Chinese Tianhe-1A
- 2.57 Petaflop/s
- 186368 cores (Xeon NVDIA proc)
- 4.0 MegaWatt
10What's happening at the low end.
- March 14, 2012 ARM announced the Cortex M0
- "The 32-bit Cortex-M0 consumes just 9µA/MHz on a
low-cost 90nm LP process, around one third of the
energy of any 8- or 16-bit processor available
today, while delivering significantly higher
performance" - 2-stage pipeline
- option 1-cycle MUL
11Low end How much energy in the air?
Rabaey 2009
12Computational efficiency (Mops/mW)
what do we need?
This means 1 pJ / operation or 1 TeraOp/Watt
Woh e.a., ISCA 2009
13Green500 Top 10 in green supercomputing
14Green500 evolution
- 2008 best result 536 MFlops/Watt gt
1.87 nJ / FloatingPt_operation - 2009 best result 723 MFlops/Watt gt 1.38 nJ
/ FloatingPt_operation - Cell cluster, ranking 110 in top500
- 2010 best result 1684 MFlops/Watt gt 594
pJ / FloatingPt operation - IBM BlueGene/Q prototype 1, ranking 101 in
top500, Peakperf 65 TFlops see also
http//www.theregister.co.uk/2010/11/22/ibm_blue_g
ene_q_super/ - 2011 best result 2097 MFlops/Watt gt 476
pJ / FloatingPt operation - IBM BlueGene/Q prototype 2
- power consumption 41 kW / Peak 85 TFlop/s
15Energy cost
- At 1M per MW, energy costs are substantial
- 1 petaflop in 2010 uses 3 MW
- 1 exaflop in 2018 possible in 200 MW with usual
scaling - 1 exaflop in 2018 at 20 MW is DOE (Dep Of Energy)
target - see also MontBlanc EU project www.montblanc-proje
ct.eu - goal 200PFlops for 10MWatt in 2017
normal scaling
desired scaling
from Katy Yelick, Berkeley
16Reducing power _at_ all design levels
- Algoritmic level
- Compiler level
- Architecture level
- Organization level
- Circuit level
- Silicon level
- Important concepts
- Lower Vdd and freq. (even if errors occur) /
dynamically adapt Vdd and freq. - Reduce circuit
- Exploit locality
- Reduce switching activity, glitches, etc.
P a.f.C.Vdd2
E? P.dt ? E/cycle a.C.Vdd2
17Algoritmic level
- The best indicator for energy is .. . the
number of cycles - Try alternative algorithms with lower complexity
- E.g. quick-sort, O(n log n) ? bubble-sort, O (n2)
- but be aware of the 'constant' O(n log n) ?
c(n log n) - Heuristic approach
- Go for a good solution, not the best !!
Biggest gains at this level !!
18Compiler level
- Source-to-Source transformations
- loop trafo's to improve locality
- Strength reduction
- E.g. replace Const A with Add's and Shift's
- Replace Floating point with Fixed point
- Reduce register pressure / number of accesses to
register file - Use software bypassing
- Scenarios current workloads are highly dynamic
- Determine and predict execution modes
- Group execution modes into scenarios
- Perform special optimizations per scenario
- DFVS Dynamic Voltage and Frequency Scaling
- More advanced loop optimizations
- Reorder instructions to reduce bit-transistions
19Architecture level
- Going parallel
- Going heterogeneous
- tune your architecture, exploit SFUs (special
function units) - trade-off between flexibility / programmability /
genericity and efficiency - Add local memories
- prefer scratchpad i.s.o. cache
- Cluster FUs and register files (see next slide)
- Reduce bit-width
- sub-word parallelism (SIMD)
20Organization (micro-arch.) level
- Enabling Vdd reduction
- Pipelining
- cheap way of parallelism
- Enabling lower freq. ? lower Vdd
- Note 1 don't pipeline if you don't need the
performance - Note 2 don't exaggerate (like the 31-stage
Pentium 4) - Reduce register traffic
- avoid unnecessary reads and write
- make bypass registers visible
21Circuit level
- Clock gating
- Power gating
- Multiple Vdd modes
- Reduce glitches balancing digital path's
- Exploit Zeros
- Special SRAM cells
- normal SRAM can not scale below Vdd 0.7 - 0.8
Volt - Razor method replay
- Allow errors and add redundancy to architectural
invisible structures - branch predictor
- caches
- .. and many more ..
22Silicon level
- Higher Vt (V_threshold)
- Back Biasing control
- see thesis Maurice Meijer (2011)
- SOI (Silicon on Insulator)
- silicon junction is above an electr. insulator
(silicon dioxide) - lowers parasitic device capacitance
- Better transistors Finfet
- multi-gate
- reduce leakage (off-state curent)
- .. and many more
Wait for lectures of Pineda on Friday
23Let's detail a few examples
- Algoritmic level
- Exploiting locality
- Compiler level
- Software bypassing
- Architecture level
- Going parallel
- Organization level
- Razor
- Circuit level
- Exploit zeros in a Multiplier
- Silicon level
- Sub-threshold
24Algorithm level Exploiting locality
Generic platform
Level-2
Level-3
Level-4
Level-1
SCSI bus
bus
bus
Chip
on-chip busses
bus-if
bridge
SCSI
Disk
L2 Cache
ICache
CPUs
Main Memory
DCache
Disk
HW accel
Local Memory
Local Memory
Disk
Local Memory
25Data transfer and storage power
26Loop transformations
- Loop transformations
- improve regularity of accesses
- improve temporal locality production ?
consumption - Expected influence
- reduce temporary storage and (anticipated)
background storage - Work horse Loop Merging
- typically many enabling trafos needed before you
can merge loops
27Loop transformations Merging
for (i0 iltN i) Bi f(Ai) for (j0
jltN j) Cj f(Bj,Aj)
for (i0 iltN i) Bi f(Ai) Ci
f(Bi,Ai)
Locality improved !
28Loop transformations
Example
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) Ci g(Bi)
for (i0 iltN i) Bi f(Ai) Ci
g(Bi)
N cyc.
2N cyc.
N cyc.
2 background ports
1 backgr. 1 foregr. ports
29Loop transformations
Example enabling trafo required
30Compiler level Software bypassing
- Register file consumes considerable amount of
total processor power - gt 15 in simple 5-stage RISC (2R1W, 32bx32)
- Even more in VLIW and SIMD as size and number of
ports increase
PAGE 30
31Reducing RF Accesses
- Many RF accesses can be eliminated
- Bypass read read operands from bypass network
instead of RF - Writeback elimination skip writeback if the
variable is dead - Operand sharing the same variable in the same
port only needs to be read from RF once
Only 3 RF reads are actually needed.
PAGE 31
32Move-Pro an Improved TTA
- Original TTA has a few drawbacks
- Separate schedule of operands may increase
circuit activity - The trigger port introduces extra scheduling
constraints - TTA Code density is likely to be lower compared
to RISC/VLIW - May need more slots for the same performance
- Increases instruction fetching energy
PAGE 32
33Compiler Framework
- Low level IR
- Similar to RISC assembly
- With extra metadata to the backend
- Local instruction scheduling
PAGE 33
34Scheduling Example
- Direct translation results in bad code density
- More instruction also means worse performance
- Bypassing improves code density and reduces RF
accesses - Performance and energy consumption are also
improved
Software bypassing scheduling
PAGE 34
35Graph-based Resource Model
- Nodes represent resources
- Resources are duplicated for each cycle
- Edges represent connectivity or storage
- Each node has capacity and cost
- Cost determined by power model
- Instruction cost is taken into account
PAGE 35
36Energy Results Compared to RISC
- 3 Configurations
- R1 RISC, 2R1W RF
- M2 2-issue MOVE-Pro, 2R1W RF
- M3 3-issue MOVE-Pro, 2R1W RF
- 8KB (32-bit)/9KB (48-bit) I-Mem
- RF energy saving gt70
- No loss in instr-mem
- R1 and M2 has the same performance
PAGE 36
37Architecture level going parallel
- Running into the
- Frequency wall
- ILP wall
- Memory wall
- Energy wall
- Chip area enabler Moore's law goes well below 22
nm - What to do with all this area?
- Multiple processors fit easily on a single die
- Application demands
- Cost effective
- Reusue just connect existing processors or
processor cores - Low power parallelism may allow lowering Vdd
38Low power through parallelism
- Sequential Processor
- Switching capacitance C
- Frequency f
- Voltage V
- P1 ?fCV2
- Parallel Processor (two times the number of
units) - Switching capacitance 2C
- Frequency f/2
- Voltage V lt V
- P2 ?f/22CV2 ?fCV2 lt P1
- Check yourself whether this worksfor pipelining
as well !
394-D model of parallel architectures
- How to speedup your favorite processor?
- Super-pipelining
- Powerful instructions
- MD-technique
- multiple data operands per operation
- MO-technique
- multiple operations per instruction
- Multiple instruction issue
- Single stream Superscalar
- Multiple streams
- Single core, multiple threads Simultaneously
Multi-Threading - Multiple cores
40Architecture methods1. Pipelined Execution of
Instructions
IF Instruction Fetch DC Instruction Decode RF
Register Fetch EX Execute instruction WB Write
Result Register
CYCLE
1
2
4
3
5
6
7
8
1
2
INSTRUCTION
3
4
Simple 5-stage pipeline
- Purpose of pipelining
- Reduce gate_levels in critical path
- Reduce CPI close to one (instead of a large
number for the multicycle machine) - More efficient Hardware
- Some bad news Hazards or pipeline stalls
- Structural hazards add more hardware
- Control hazards, branch penalties use branch
prediction - Data hazards by passing required
41Architecture methods1. Super pipelining
- Superpipelining
- Split one or more of the critical pipeline stages
- Superpipelining degree S
S(architecture) ? f(Op) lt (Op)
?Op ?I_set
where f(op) is frequency of operation op
lt(op) is latency of operation op
42Architecture methods2. Powerful Instructions (1)
- MD-technique
- Multiple data operands per operation
- SIMD Single Instruction Multiple Data
Vector instruction for (i0, i, ilt64) ci
ai 5bi or c a 5b
Assembly set vl,64 ldv v1,0(r2) mulvi
v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv
v3,0(r3)
43Architecture methods2. Powerful Instructions (1)
- SIMD computing
- All PEs (Processing Elements) execute same
operation - Typical mesh or hypercube connectivity
- Exploit data locality of e.g. image processing
applications - Dense encoding (few instruction bits needed)
44Architecture methods2. Powerful Instructions (1)
- Sub-word parallelism
- SIMD on restricted scale
- Used for Multi-media instructions
- Many processors support this
- Examples
- MMX, SSE, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow,
Trimedia II - Example ?i1..4 ai-bi
45Architecture methods2. Powerful Instructions (2)
- MO-technique multiple operations per instruction
- Two options
- CISC (Complex Instruction Set Computer)
- this is what we did in the 'old' days of
microcoded processors - VLIW (Very Long Instruction Word)
FU 1
FU 2
FU 3
FU 4
FU 5
field
sub r8, r5, 3
and r1, r5, 12
mul r6, r5, r2
ld r3, 0(r5)
bnez r5, 13
instruction
VLIW instruction example
46VLIW architecture central Register File
Register file
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 6
Exec unit 7
Exec unit 8
Exec unit 9
Issue slot 1
Issue slot 2
Issue slot 3
Q How many ports does the registerfile need for
n-issue?
47Clustered VLIW
- Clustering Splitting up the VLIW data path-
same can be done for the instruction path - Exploit locality _at_ Level 0, for Instructions and
Data
48Architecture methods3. Multiple instruction
issue (per cycle)
- Who guarantees semantic correctness?
- can instructions be executed in parallel
- User he specifies multiple instruction streams
- Multi-processor MIMD (Multiple Instruction
Multiple Data) - HW Run-time detection of ready instructions
- Superscalar, single instruction stream
- Compiler Compile into dataflow representation
- Dataflow processors
- Multi-threaded processors
49Four dimensional representation of the
architecture design space ltI, O, D, Sgt
Mpar IODS
You should exploit this amount of parallelism !!!
50Examples of many core / PE architectures
- SIMD
- Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan
Univ) - VLIW
- ADRES, TriMedia
- more dynamic Itanium (static sched., rt
mapping), TRIPS/EDGE (rt scheduling) - Multi-threaded
- idea hide long latencies
- Denelcor HEP (1982), SUN Niagara (2005)
- Multi-processor
- RaW, PicoChip, Intel/AMD, GRID, Farms, ..
- Hybrid, like , Imagine, GPUs, XC-Core, Cell
- actually, most are hybrid !!
51In need of TeraFlops on your desk?
- 4 Nvidia GTX295
- 1920 PEs
- 7 TeraFlop
52How Do GPUs Spend Their Die Area?
GPUs are designed to match the workload of 3D
graphics.
- Nvidia GTX 280
- most area spend on processing
- relatively small on-chip memories
- huge off-chip memory latencies
J. Roca, et al. "Workload Characterization of 3D
Games", IISWC 2006, link T. Mitra, et al.
"Dynamic 3D Graphics Workload Characterization
and the Architectural Implications", Micro 1999,
link
53How Do CPUs Spend Their Die Area?
CPUs are designed for low latency instead of high
throughput
Die photo of Intel Penryn (source Intel)
54Organization level Razor
- Use shadow latch clocked with delayed clock
- Reduce Vdd as far as possible
- Detect error Main FF ? Shadow FF
- Correct error e.g. replay instruction in
Microprocessor
55Razor used in Microprocessor
56Razor Energy reduction
57Circuit level exploit actual data width
- Multiplication is a very basic and widely-used
operation - Multipliers are usually one of the most
power-hungry operations in many designs - When operating data with smaller width (e.g., a
16-bit multiplier processes 8-bit data), we would
like to observe energy consumption close to a
short-width multiplier.
58Motivation
Normalized Energy Consumption/Operation
Energy consumption of signed multipliers of
different sizes (Baugh-Wooley multiplier with
Wallace tree)
59Unsigned data
Normalized Energy Consumption/Operation
Energy consumption of unsigned multipliers of
different sizes
- Unlike signed multiplier, unsigned multiplier
are naturally data-width aware
60Signed Multiplier sign-magnitude
Normalized Energy Consumption/Operation
Energy consumption of signed multipliers of
different sizes (sign-magnitude format)
61Signed Multiplier sign-magnitude
- A sign-magnitude multiplier is essentially an
unsigned multiplier - with sign-bit calculation logic (xor)
- However, drawbacks of using sign-magnitude
format - Requires a different set of rules for arithmetic
computation - E.g., to add two numbers, we have to choose
addition or subtraction depends on the sign bits
of the two numbers - Zero has two representations (0, 0000 and -0,
1000).
62Data-Width-Aware Multiplier Design
Architecture choices when effective data input is
only half width
63Silicon level Sub/Near Threshold JPEG
Accelerator
Pu Yu e.a., ISSCC 2009
lt 1pJ/op at 400mV, 65nm CMOS
64Trends Low power, how far can we scale Vdd?
- Subthreshold JPEG encoder
- Vdd 0.4 1.2 Volt
Pu Yu, ISSCC '09
65A 280mV-to-1.2V IA-32 Processor in 32nm CMOS
Intel ISSCC'12
66Exercise how far can we go?Let's consider a
Massively-Parallel SIMD
Data/Frame Memory
PE N
PE 0
PE 1
PE 2
PE 3
Instruction Memory
Control
Interconnect
- SIMD low-power architecture
- massively-parallel large number of PEs, high
performance
67Xetal-II
68Xetal-II Processor details
- 600 mW
- 90 nm CMOS
- 53.5 GOPS (arithmetic only) _at_ 84MHz
- Best computational efficient programmable
silicon in 2007
Kleihorst, et al.2007
69Xetal-II Block Diagram
70Xetal-II Energy Breakdown at 1.2V
- 125 MHz, 400 mW _at_ 65nm
- 25 ins./pixel (5x5 convolution)
- 240.8 pJ/pixel ? 10 pJ/ins. ? 5 pJ/op
- However 69 consumed by FM!
71Hybrid Memory Architecture
- Exploiting Data Locality
- Scratchpad Memory can be
- bypassed and clock-gated
- 15 area overhead to Tile
- ACCU register
- - short-term data
- Scratchpad Memory (SM, 32 entries)
- - intermediate-term data
- Frame Memory (FM)
- - long-term data
72Energy Breakdown _at_ 1.2 V
- SM is realized by commercial SRAM here
- Require 1 extra instruction/px to implement the
55 filter - SM and PEs dominate energy consumption
- 151.9 pJ/pixel, 1.6 reduction,
- 3 pJ/op
Implementing SM with standard cells reduces SM
energy 2 ? 2.1 reduction w.r.t. Xetal II, ?
2.4 pJ/op
73Energy breakdown _at_ low Vdd
- SM realized by standard cell
- FM by commercial SRAM
- Optimal point - FM 0.7V, SM 0.38V, and
PE 0.42V - 22.6 pJ/pixel, 12.5 reduction0.45 pJ/16-bit
op. ? 0.9 pJ/32-bit op. - Still rendering 0.7 GOPS
-
Sub-threshold Scratchpad Memory with
super-threshold Frame Memory
74ICE Curve Extended with Vdd Scaling
1 pJ/op
75Can we match the human brain ???
- Performance 100 Billion (1011) Neurons 1000
(103) Connections/Neuron 200 (2 102)
Calculations Per Second Per Connection 2
1016 Calculations Per Second - Memory 100 Billion (1011) Neurons 1000
(103) Connections/Neuron 10 bytes (information
about connection strength and adress of output
neuron, type of synapse) 1015 bytes 1 PB
1000 TB - How far off are we?
- Brain needs only 20 Watt
- and processors need MegaWatts
76Blue brain research
- Software replica of one column of the neocortex
- cortex 85 of brains total mass
- required for language, learning, memory and
complex thought - the essential first step to simulating the whole
brain - Next include circuitry from other brain regions
and - eventually the whole brain.
77Are we / CMOS running out of options?
- Google yourself
- Reversible logic gates
- Adiabatic logic
- Nano tubes
- Graphene
- Bio / Molecular / DNA computing
- Approximate computing
- Analog computing
- e.g. with only 9 transistors you can build a
Gilbert multiplierMead 1989 - Or much better algorithms
78Reading
- Low power design essentials (book)Jan
RabaeySpringer 2009 - DATE 2012 tutorialDesign Methodology and
Techniques in Production Low-Power SOC
DesignsKaijian Shi (Cadence), Thoams Buechner
(IBM)