Introduction to Energy Aware Computing

About This Presentation

Title:

Introduction to Energy Aware Computing

Description:

Title: Introduction to Many-Core Architectures Author: Henk Corporaal Last modified by: Henk Corporaal Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 79

Provided by: Henk99

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Energy Aware Computing

1
IntroductiontoEnergy Aware Computing

Henk Corporaal
www.ics.ele.tue.nl/heco
ASCI Winterschool on Energy Aware Computing
Soesterberg, March 2012

2
3GHz
100W

Intel Trends
transistors follows Moore
but not freq. and performance/core

5
3
Types of compute systems
4
A 20nm scenario (high end processor)

This means
a 2cm2 processor consumens 10 kW
a bound of 100W requires only 1 to be active ?
dark silicon

5
Intel's answer 48-core x86
6
Power versus Energy

Power P ?fCVdd2
? switching activity (lt1) f frequency C
switching capacitance, Vdd supply voltage
heat / temperature constraint
wear-out
peak power delivery constraint
Energy E Pt or, for time varying P ?P(t).dt
battery life
cost electricity bill
Note lowering f reduces P, but not necessarily
E E may even increase due to leakage (static
power dissipation)

7
What's happening at the top
8
Top500 nr 1

1st K Computer
10.51 Petaflop/s on Linpack
705024 SPARC64 cores (8 per die 45 nm) (Fujitsu
design)
Tofu interconnect (6-D torus)
12.7 MegaWatt

9
Top500 nr 2

2nd Chinese Tianhe-1A
2.57 Petaflop/s
186368 cores (Xeon NVDIA proc)
4.0 MegaWatt

10
What's happening at the low end.

March 14, 2012 ARM announced the Cortex M0
"The 32-bit Cortex-M0 consumes just 9µA/MHz on a
low-cost 90nm LP process, around one third of the
energy of any 8- or 16-bit processor available
today, while delivering significantly higher
performance"
2-stage pipeline
option 1-cycle MUL

11
Low end How much energy in the air?
Rabaey 2009
12
Computational efficiency (Mops/mW)
what do we need?
This means 1 pJ / operation or 1 TeraOp/Watt
Woh e.a., ISCA 2009
13
Green500 Top 10 in green supercomputing
14
Green500 evolution

2008 best result 536 MFlops/Watt gt
1.87 nJ / FloatingPt_operation
2009 best result 723 MFlops/Watt gt 1.38 nJ
/ FloatingPt_operation
Cell cluster, ranking 110 in top500
2010 best result 1684 MFlops/Watt gt 594
pJ / FloatingPt operation
IBM BlueGene/Q prototype 1, ranking 101 in
top500, Peakperf 65 TFlops see also
http//www.theregister.co.uk/2010/11/22/ibm_blue_g
ene_q_super/
2011 best result 2097 MFlops/Watt gt 476
pJ / FloatingPt operation
IBM BlueGene/Q prototype 2
power consumption 41 kW / Peak 85 TFlop/s

15
Energy cost

At 1M per MW, energy costs are substantial
1 petaflop in 2010 uses 3 MW
1 exaflop in 2018 possible in 200 MW with usual
scaling
1 exaflop in 2018 at 20 MW is DOE (Dep Of Energy)
target
see also MontBlanc EU project www.montblanc-proje
ct.eu
goal 200PFlops for 10MWatt in 2017

normal scaling
desired scaling
from Katy Yelick, Berkeley
16
Reducing power _at_ all design levels

Algoritmic level
Compiler level
Architecture level
Organization level
Circuit level
Silicon level
Important concepts
Lower Vdd and freq. (even if errors occur) /
dynamically adapt Vdd and freq.
Reduce circuit
Exploit locality
Reduce switching activity, glitches, etc.

P a.f.C.Vdd2
E? P.dt ? E/cycle a.C.Vdd2
17
Algoritmic level

The best indicator for energy is .. . the
number of cycles
Try alternative algorithms with lower complexity
E.g. quick-sort, O(n log n) ? bubble-sort, O (n2)
but be aware of the 'constant' O(n log n) ?
c(n log n)
Heuristic approach
Go for a good solution, not the best !!

Biggest gains at this level !!
18
Compiler level

Source-to-Source transformations
loop trafo's to improve locality
Strength reduction
E.g. replace Const A with Add's and Shift's
Replace Floating point with Fixed point
Reduce register pressure / number of accesses to
register file
Use software bypassing
Scenarios current workloads are highly dynamic
Determine and predict execution modes
Group execution modes into scenarios
Perform special optimizations per scenario
DFVS Dynamic Voltage and Frequency Scaling
More advanced loop optimizations
Reorder instructions to reduce bit-transistions

19
Architecture level

Going parallel
Going heterogeneous
tune your architecture, exploit SFUs (special
function units)
trade-off between flexibility / programmability /
genericity and efficiency
Add local memories
prefer scratchpad i.s.o. cache
Cluster FUs and register files (see next slide)
Reduce bit-width
sub-word parallelism (SIMD)

20
Organization (micro-arch.) level

Enabling Vdd reduction
Pipelining
cheap way of parallelism
Enabling lower freq. ? lower Vdd
Note 1 don't pipeline if you don't need the
performance
Note 2 don't exaggerate (like the 31-stage
Pentium 4)
Reduce register traffic
avoid unnecessary reads and write
make bypass registers visible

21
Circuit level

Clock gating
Power gating
Multiple Vdd modes
Reduce glitches balancing digital path's
Exploit Zeros
Special SRAM cells
normal SRAM can not scale below Vdd 0.7 - 0.8
Volt
Razor method replay
Allow errors and add redundancy to architectural
invisible structures
branch predictor
caches
.. and many more ..

22
Silicon level

Higher Vt (V_threshold)
Back Biasing control
see thesis Maurice Meijer (2011)
SOI (Silicon on Insulator)
silicon junction is above an electr. insulator
(silicon dioxide)
lowers parasitic device capacitance
Better transistors Finfet
multi-gate
reduce leakage (off-state curent)
.. and many more

Wait for lectures of Pineda on Friday
23
Let's detail a few examples

Algoritmic level
Exploiting locality
Compiler level
Software bypassing
Architecture level
Going parallel
Organization level
Razor
Circuit level
Exploit zeros in a Multiplier
Silicon level
Sub-threshold

24
Algorithm level Exploiting locality
Generic platform
Level-2
Level-3
Level-4
Level-1
SCSI bus
bus
bus
Chip
on-chip busses
bus-if
bridge
SCSI
Disk
L2 Cache
ICache
CPUs
Main Memory
DCache
Disk
HW accel
Local Memory
Local Memory
Disk
Local Memory
25
Data transfer and storage power
26
Loop transformations

Loop transformations
improve regularity of accesses
improve temporal locality production ?
consumption
Expected influence
reduce temporary storage and (anticipated)
background storage
Work horse Loop Merging
typically many enabling trafos needed before you
can merge loops

27
Loop transformations Merging
for (i0 iltN i) Bi f(Ai) for (j0
jltN j) Cj f(Bj,Aj)
for (i0 iltN i) Bi f(Ai) Ci
f(Bi,Ai)
Locality improved !
28
Loop transformations
Example
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) Ci g(Bi)
for (i0 iltN i) Bi f(Ai) Ci
g(Bi)
N cyc.
2N cyc.
N cyc.
2 background ports
1 backgr. 1 foregr. ports
29
Loop transformations
Example enabling trafo required
30
Compiler level Software bypassing

Register file consumes considerable amount of
total processor power
gt 15 in simple 5-stage RISC (2R1W, 32bx32)
Even more in VLIW and SIMD as size and number of
ports increase

PAGE 30
31
Reducing RF Accesses

Many RF accesses can be eliminated
Bypass read read operands from bypass network
instead of RF
Writeback elimination skip writeback if the
variable is dead
Operand sharing the same variable in the same
port only needs to be read from RF once

Only 3 RF reads are actually needed.
PAGE 31
32
Move-Pro an Improved TTA

Original TTA has a few drawbacks
Separate schedule of operands may increase
circuit activity
The trigger port introduces extra scheduling
constraints
TTA Code density is likely to be lower compared
to RISC/VLIW
May need more slots for the same performance
Increases instruction fetching energy

PAGE 32
33
Compiler Framework

Low level IR
Similar to RISC assembly
With extra metadata to the backend
Local instruction scheduling

PAGE 33
34
Scheduling Example

Direct translation results in bad code density
More instruction also means worse performance
Bypassing improves code density and reduces RF
accesses
Performance and energy consumption are also
improved

Software bypassing scheduling
PAGE 34
35
Graph-based Resource Model

Nodes represent resources
Resources are duplicated for each cycle
Edges represent connectivity or storage
Each node has capacity and cost
Cost determined by power model
Instruction cost is taken into account

PAGE 35
36
Energy Results Compared to RISC

3 Configurations
R1 RISC, 2R1W RF
M2 2-issue MOVE-Pro, 2R1W RF
M3 3-issue MOVE-Pro, 2R1W RF
8KB (32-bit)/9KB (48-bit) I-Mem

RF energy saving gt70
No loss in instr-mem
R1 and M2 has the same performance

PAGE 36
37
Architecture level going parallel

Running into the
Frequency wall
ILP wall
Memory wall
Energy wall
Chip area enabler Moore's law goes well below 22
nm
What to do with all this area?
Multiple processors fit easily on a single die
Application demands
Cost effective
Reusue just connect existing processors or
processor cores
Low power parallelism may allow lowering Vdd

38
Low power through parallelism

Sequential Processor
Switching capacitance C
Frequency f
Voltage V
P1 ?fCV2
Parallel Processor (two times the number of
units)
Switching capacitance 2C
Frequency f/2
Voltage V lt V
P2 ?f/22CV2 ?fCV2 lt P1
Check yourself whether this worksfor pipelining
as well !

39
4-D model of parallel architectures

How to speedup your favorite processor?
Super-pipelining
Powerful instructions
MD-technique
multiple data operands per operation
MO-technique
multiple operations per instruction
Multiple instruction issue
Single stream Superscalar
Multiple streams
Single core, multiple threads Simultaneously
Multi-Threading
Multiple cores

40
Architecture methods1. Pipelined Execution of
Instructions
IF Instruction Fetch DC Instruction Decode RF
Register Fetch EX Execute instruction WB Write
Result Register
CYCLE
1
2
4
3
5
6
7
8
1
2
INSTRUCTION
3
4
Simple 5-stage pipeline

Purpose of pipelining
Reduce gate_levels in critical path
Reduce CPI close to one (instead of a large
number for the multicycle machine)
More efficient Hardware
Some bad news Hazards or pipeline stalls
Structural hazards add more hardware
Control hazards, branch penalties use branch
prediction
Data hazards by passing required

41
Architecture methods1. Super pipelining

Superpipelining
Split one or more of the critical pipeline stages
Superpipelining degree S

S(architecture) ? f(Op) lt (Op)
?Op ?I_set
where f(op) is frequency of operation op
lt(op) is latency of operation op
42
Architecture methods2. Powerful Instructions (1)

MD-technique
Multiple data operands per operation
SIMD Single Instruction Multiple Data

Vector instruction for (i0, i, ilt64) ci
ai 5bi or c a 5b
Assembly set vl,64 ldv v1,0(r2) mulvi
v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv
v3,0(r3)
43
Architecture methods2. Powerful Instructions (1)

SIMD computing
All PEs (Processing Elements) execute same
operation
Typical mesh or hypercube connectivity
Exploit data locality of e.g. image processing
applications
Dense encoding (few instruction bits needed)

44
Architecture methods2. Powerful Instructions (1)

Sub-word parallelism
SIMD on restricted scale
Used for Multi-media instructions
Many processors support this
Examples
MMX, SSE, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow,
Trimedia II
Example ?i1..4 ai-bi

45
Architecture methods2. Powerful Instructions (2)

MO-technique multiple operations per instruction
Two options
CISC (Complex Instruction Set Computer)
this is what we did in the 'old' days of
microcoded processors
VLIW (Very Long Instruction Word)

FU 1
FU 2
FU 3
FU 4
FU 5
field
sub r8, r5, 3
and r1, r5, 12
mul r6, r5, r2
ld r3, 0(r5)
bnez r5, 13
instruction
VLIW instruction example
46
VLIW architecture central Register File
Register file
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 6
Exec unit 7
Exec unit 8
Exec unit 9
Issue slot 1
Issue slot 2
Issue slot 3
Q How many ports does the registerfile need for
n-issue?
47
Clustered VLIW

Clustering Splitting up the VLIW data path-
same can be done for the instruction path
Exploit locality _at_ Level 0, for Instructions and
Data

48
Architecture methods3. Multiple instruction
issue (per cycle)

Who guarantees semantic correctness?
can instructions be executed in parallel
User he specifies multiple instruction streams
Multi-processor MIMD (Multiple Instruction
Multiple Data)
HW Run-time detection of ready instructions
Superscalar, single instruction stream
Compiler Compile into dataflow representation
Dataflow processors
Multi-threaded processors

49
Four dimensional representation of the
architecture design space ltI, O, D, Sgt
Mpar IODS
You should exploit this amount of parallelism !!!
50
Examples of many core / PE architectures

SIMD
Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan
Univ)
VLIW
ADRES, TriMedia
more dynamic Itanium (static sched., rt
mapping), TRIPS/EDGE (rt scheduling)
Multi-threaded
idea hide long latencies
Denelcor HEP (1982), SUN Niagara (2005)
Multi-processor
RaW, PicoChip, Intel/AMD, GRID, Farms, ..
Hybrid, like , Imagine, GPUs, XC-Core, Cell
actually, most are hybrid !!

51
In need of TeraFlops on your desk?

4 Nvidia GTX295
1920 PEs
7 TeraFlop

52
How Do GPUs Spend Their Die Area?
GPUs are designed to match the workload of 3D
graphics.

Nvidia GTX 280
most area spend on processing
relatively small on-chip memories
huge off-chip memory latencies

J. Roca, et al. "Workload Characterization of 3D
Games", IISWC 2006, link T. Mitra, et al.
"Dynamic 3D Graphics Workload Characterization
and the Architectural Implications", Micro 1999,
link
53
How Do CPUs Spend Their Die Area?
CPUs are designed for low latency instead of high
throughput
Die photo of Intel Penryn (source Intel)
54
Organization level Razor

Use shadow latch clocked with delayed clock
Reduce Vdd as far as possible
Detect error Main FF ? Shadow FF
Correct error e.g. replay instruction in
Microprocessor

55
Razor used in Microprocessor
56
Razor Energy reduction
57
Circuit level exploit actual data width

Multiplication is a very basic and widely-used
operation
Multipliers are usually one of the most
power-hungry operations in many designs
When operating data with smaller width (e.g., a
16-bit multiplier processes 8-bit data), we would
like to observe energy consumption close to a
short-width multiplier.

58
Motivation
Normalized Energy Consumption/Operation
Energy consumption of signed multipliers of
different sizes (Baugh-Wooley multiplier with
Wallace tree)
59
Unsigned data
Normalized Energy Consumption/Operation
Energy consumption of unsigned multipliers of
different sizes

Unlike signed multiplier, unsigned multiplier
are naturally data-width aware

60
Signed Multiplier sign-magnitude
Normalized Energy Consumption/Operation
Energy consumption of signed multipliers of
different sizes (sign-magnitude format)
61
Signed Multiplier sign-magnitude

A sign-magnitude multiplier is essentially an
unsigned multiplier
with sign-bit calculation logic (xor)
However, drawbacks of using sign-magnitude
format
Requires a different set of rules for arithmetic
computation
E.g., to add two numbers, we have to choose
addition or subtraction depends on the sign bits
of the two numbers
Zero has two representations (0, 0000 and -0,
1000).

62
Data-Width-Aware Multiplier Design
Architecture choices when effective data input is
only half width

63
Silicon level Sub/Near Threshold JPEG
Accelerator
Pu Yu e.a., ISSCC 2009
lt 1pJ/op at 400mV, 65nm CMOS
64
Trends Low power, how far can we scale Vdd?

Subthreshold JPEG encoder
Vdd 0.4 1.2 Volt

Pu Yu, ISSCC '09
65
A 280mV-to-1.2V IA-32 Processor in 32nm CMOS
Intel ISSCC'12
66
Exercise how far can we go?Let's consider a
Massively-Parallel SIMD
Data/Frame Memory
PE N
PE 0
PE 1
PE 2
PE 3
Instruction Memory

Control
Interconnect

SIMD low-power architecture
massively-parallel large number of PEs, high
performance

67
Xetal-II
68
Xetal-II Processor details

600 mW
90 nm CMOS
53.5 GOPS (arithmetic only) _at_ 84MHz
Best computational efficient programmable
silicon in 2007

Kleihorst, et al.2007
69
Xetal-II Block Diagram
70
Xetal-II Energy Breakdown at 1.2V

125 MHz, 400 mW _at_ 65nm
25 ins./pixel (5x5 convolution)
240.8 pJ/pixel ? 10 pJ/ins. ? 5 pJ/op
However 69 consumed by FM!

71
Hybrid Memory Architecture

Exploiting Data Locality
Scratchpad Memory can be
bypassed and clock-gated
15 area overhead to Tile

ACCU register
- short-term data
Scratchpad Memory (SM, 32 entries)
- intermediate-term data
Frame Memory (FM)
- long-term data

72
Energy Breakdown _at_ 1.2 V

SM is realized by commercial SRAM here
Require 1 extra instruction/px to implement the
55 filter
SM and PEs dominate energy consumption
151.9 pJ/pixel, 1.6 reduction,
3 pJ/op

Implementing SM with standard cells reduces SM
energy 2 ? 2.1 reduction w.r.t. Xetal II, ?
2.4 pJ/op
73
Energy breakdown _at_ low Vdd

SM realized by standard cell
FM by commercial SRAM

Optimal point - FM 0.7V, SM 0.38V, and
PE 0.42V
22.6 pJ/pixel, 12.5 reduction0.45 pJ/16-bit
op. ? 0.9 pJ/32-bit op.
Still rendering 0.7 GOPS

Sub-threshold Scratchpad Memory with
super-threshold Frame Memory
74
ICE Curve Extended with Vdd Scaling
1 pJ/op
75
Can we match the human brain ???

Performance 100 Billion (1011) Neurons 1000
(103) Connections/Neuron 200 (2 102)
Calculations Per Second Per Connection 2
1016 Calculations Per Second
Memory 100 Billion (1011) Neurons 1000
(103) Connections/Neuron 10 bytes (information
about connection strength and adress of output
neuron, type of synapse) 1015 bytes 1 PB
1000 TB
How far off are we?
Brain needs only 20 Watt
and processors need MegaWatts

76
Blue brain research

Software replica of one column of the neocortex
cortex 85 of brains total mass
required for language, learning, memory and
complex thought
the essential first step to simulating the whole
brain
Next include circuitry from other brain regions
and
eventually the whole brain.

77
Are we / CMOS running out of options?

Google yourself
Reversible logic gates
Adiabatic logic
Nano tubes
Graphene
Bio / Molecular / DNA computing
Approximate computing
Analog computing
e.g. with only 9 transistors you can build a
Gilbert multiplierMead 1989
Or much better algorithms

78
Reading

Low power design essentials (book)Jan
RabaeySpringer 2009
DATE 2012 tutorialDesign Methodology and
Techniques in Production Low-Power SOC
DesignsKaijian Shi (Cadence), Thoams Buechner
(IBM)

Write a Comment

User Comments (0)