Introduction to Energy Aware Computing - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Introduction to Energy Aware Computing

Description:

Title: Introduction to Many-Core Architectures Author: Henk Corporaal Last modified by: Henk Corporaal Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 79
Provided by: Henk99
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Energy Aware Computing


1
IntroductiontoEnergy Aware Computing
  • Henk Corporaal
  • www.ics.ele.tue.nl/heco
  • ASCI Winterschool on Energy Aware Computing
  • Soesterberg, March 2012

2
3GHz
100W
  • Intel Trends
  • transistors follows Moore
  • but not freq. and performance/core

5
3
Types of compute systems
4
A 20nm scenario (high end processor)
  • This means
  • a 2cm2 processor consumens 10 kW
  • a bound of 100W requires only 1 to be active ?
    dark silicon

5
Intel's answer 48-core x86
6
Power versus Energy
  • Power P ?fCVdd2
  • ? switching activity (lt1) f frequency C
    switching capacitance, Vdd supply voltage
  • heat / temperature constraint
  • wear-out
  • peak power delivery constraint
  • Energy E Pt or, for time varying P ?P(t).dt
  • battery life
  • cost electricity bill
  • Note lowering f reduces P, but not necessarily
    E E may even increase due to leakage (static
    power dissipation)

7
What's happening at the top
8
Top500 nr 1
  • 1st K Computer
  • 10.51 Petaflop/s on Linpack
  • 705024 SPARC64 cores (8 per die 45 nm) (Fujitsu
    design)
  • Tofu interconnect (6-D torus)
  • 12.7 MegaWatt

9
Top500 nr 2
  • 2nd Chinese Tianhe-1A
  • 2.57 Petaflop/s
  • 186368 cores (Xeon NVDIA proc)
  • 4.0 MegaWatt

10
What's happening at the low end.
  • March 14, 2012 ARM announced the Cortex M0
  • "The 32-bit Cortex-M0 consumes just 9µA/MHz on a
    low-cost 90nm LP process, around one third of the
    energy of any 8- or 16-bit processor available
    today, while delivering significantly higher
    performance"
  • 2-stage pipeline
  • option 1-cycle MUL

11
Low end How much energy in the air?
Rabaey 2009
12
Computational efficiency (Mops/mW)
what do we need?
This means 1 pJ / operation or 1 TeraOp/Watt
Woh e.a., ISCA 2009
13
Green500 Top 10 in green supercomputing
14
Green500 evolution
  • 2008 best result 536 MFlops/Watt gt
    1.87 nJ / FloatingPt_operation
  • 2009 best result 723 MFlops/Watt gt 1.38 nJ
    / FloatingPt_operation
  • Cell cluster, ranking 110 in top500
  • 2010 best result 1684 MFlops/Watt gt 594
    pJ / FloatingPt operation
  • IBM BlueGene/Q prototype 1, ranking 101 in
    top500, Peakperf 65 TFlops see also
    http//www.theregister.co.uk/2010/11/22/ibm_blue_g
    ene_q_super/
  • 2011 best result 2097 MFlops/Watt gt 476
    pJ / FloatingPt operation
  • IBM BlueGene/Q prototype 2
  • power consumption 41 kW / Peak 85 TFlop/s

15
Energy cost
  • At 1M per MW, energy costs are substantial
  • 1 petaflop in 2010 uses 3 MW
  • 1 exaflop in 2018 possible in 200 MW with usual
    scaling
  • 1 exaflop in 2018 at 20 MW is DOE (Dep Of Energy)
    target
  • see also MontBlanc EU project www.montblanc-proje
    ct.eu
  • goal 200PFlops for 10MWatt in 2017

normal scaling
desired scaling
from Katy Yelick, Berkeley
16
Reducing power _at_ all design levels
  • Algoritmic level
  • Compiler level
  • Architecture level
  • Organization level
  • Circuit level
  • Silicon level
  • Important concepts
  • Lower Vdd and freq. (even if errors occur) /
    dynamically adapt Vdd and freq.
  • Reduce circuit
  • Exploit locality
  • Reduce switching activity, glitches, etc.

P a.f.C.Vdd2
E? P.dt ? E/cycle a.C.Vdd2
17
Algoritmic level
  • The best indicator for energy is .. . the
    number of cycles
  • Try alternative algorithms with lower complexity
  • E.g. quick-sort, O(n log n) ? bubble-sort, O (n2)
  • but be aware of the 'constant' O(n log n) ?
    c(n log n)
  • Heuristic approach
  • Go for a good solution, not the best !!

Biggest gains at this level !!
18
Compiler level
  • Source-to-Source transformations
  • loop trafo's to improve locality
  • Strength reduction
  • E.g. replace Const A with Add's and Shift's
  • Replace Floating point with Fixed point
  • Reduce register pressure / number of accesses to
    register file
  • Use software bypassing
  • Scenarios current workloads are highly dynamic
  • Determine and predict execution modes
  • Group execution modes into scenarios
  • Perform special optimizations per scenario
  • DFVS Dynamic Voltage and Frequency Scaling
  • More advanced loop optimizations
  • Reorder instructions to reduce bit-transistions

19
Architecture level
  • Going parallel
  • Going heterogeneous
  • tune your architecture, exploit SFUs (special
    function units)
  • trade-off between flexibility / programmability /
    genericity and efficiency
  • Add local memories
  • prefer scratchpad i.s.o. cache
  • Cluster FUs and register files (see next slide)
  • Reduce bit-width
  • sub-word parallelism (SIMD)

20
Organization (micro-arch.) level
  • Enabling Vdd reduction
  • Pipelining
  • cheap way of parallelism
  • Enabling lower freq. ? lower Vdd
  • Note 1 don't pipeline if you don't need the
    performance
  • Note 2 don't exaggerate (like the 31-stage
    Pentium 4)
  • Reduce register traffic
  • avoid unnecessary reads and write
  • make bypass registers visible

21
Circuit level
  • Clock gating
  • Power gating
  • Multiple Vdd modes
  • Reduce glitches balancing digital path's
  • Exploit Zeros
  • Special SRAM cells
  • normal SRAM can not scale below Vdd 0.7 - 0.8
    Volt
  • Razor method replay
  • Allow errors and add redundancy to architectural
    invisible structures
  • branch predictor
  • caches
  • .. and many more ..

22
Silicon level
  • Higher Vt (V_threshold)
  • Back Biasing control
  • see thesis Maurice Meijer (2011)
  • SOI (Silicon on Insulator)
  • silicon junction is above an electr. insulator
    (silicon dioxide)
  • lowers parasitic device capacitance
  • Better transistors Finfet
  • multi-gate
  • reduce leakage (off-state curent)
  • .. and many more

Wait for lectures of Pineda on Friday
23
Let's detail a few examples
  • Algoritmic level
  • Exploiting locality
  • Compiler level
  • Software bypassing
  • Architecture level
  • Going parallel
  • Organization level
  • Razor
  • Circuit level
  • Exploit zeros in a Multiplier
  • Silicon level
  • Sub-threshold

24
Algorithm level Exploiting locality
Generic platform
Level-2
Level-3
Level-4
Level-1
SCSI bus
bus
bus
Chip
on-chip busses
bus-if
bridge
SCSI
Disk
L2 Cache
ICache
CPUs
Main Memory
DCache
Disk
HW accel
Local Memory
Local Memory
Disk
Local Memory
25
Data transfer and storage power
26
Loop transformations
  • Loop transformations
  • improve regularity of accesses
  • improve temporal locality production ?
    consumption
  • Expected influence
  • reduce temporary storage and (anticipated)
    background storage
  • Work horse Loop Merging
  • typically many enabling trafos needed before you
    can merge loops

27
Loop transformations Merging
for (i0 iltN i) Bi f(Ai) for (j0
jltN j) Cj f(Bj,Aj)
for (i0 iltN i) Bi f(Ai) Ci
f(Bi,Ai)
Locality improved !
28
Loop transformations
Example
for (i0 iltN i) Bi f(Ai) for (i0
iltN i) Ci g(Bi)
for (i0 iltN i) Bi f(Ai) Ci
g(Bi)
N cyc.
2N cyc.
N cyc.
2 background ports
1 backgr. 1 foregr. ports
29
Loop transformations
Example enabling trafo required
30
Compiler level Software bypassing
  • Register file consumes considerable amount of
    total processor power
  • gt 15 in simple 5-stage RISC (2R1W, 32bx32)
  • Even more in VLIW and SIMD as size and number of
    ports increase

PAGE 30
31
Reducing RF Accesses
  • Many RF accesses can be eliminated
  • Bypass read read operands from bypass network
    instead of RF
  • Writeback elimination skip writeback if the
    variable is dead
  • Operand sharing the same variable in the same
    port only needs to be read from RF once

Only 3 RF reads are actually needed.
PAGE 31
32
Move-Pro an Improved TTA
  • Original TTA has a few drawbacks
  • Separate schedule of operands may increase
    circuit activity
  • The trigger port introduces extra scheduling
    constraints
  • TTA Code density is likely to be lower compared
    to RISC/VLIW
  • May need more slots for the same performance
  • Increases instruction fetching energy

PAGE 32
33
Compiler Framework
  • Low level IR
  • Similar to RISC assembly
  • With extra metadata to the backend
  • Local instruction scheduling

PAGE 33
34
Scheduling Example
  • Direct translation results in bad code density
  • More instruction also means worse performance
  • Bypassing improves code density and reduces RF
    accesses
  • Performance and energy consumption are also
    improved

Software bypassing scheduling
PAGE 34
35
Graph-based Resource Model
  • Nodes represent resources
  • Resources are duplicated for each cycle
  • Edges represent connectivity or storage
  • Each node has capacity and cost
  • Cost determined by power model
  • Instruction cost is taken into account

PAGE 35
36
Energy Results Compared to RISC
  • 3 Configurations
  • R1 RISC, 2R1W RF
  • M2 2-issue MOVE-Pro, 2R1W RF
  • M3 3-issue MOVE-Pro, 2R1W RF
  • 8KB (32-bit)/9KB (48-bit) I-Mem
  • RF energy saving gt70
  • No loss in instr-mem
  • R1 and M2 has the same performance

PAGE 36
37
Architecture level going parallel
  • Running into the
  • Frequency wall
  • ILP wall
  • Memory wall
  • Energy wall
  • Chip area enabler Moore's law goes well below 22
    nm
  • What to do with all this area?
  • Multiple processors fit easily on a single die
  • Application demands
  • Cost effective
  • Reusue just connect existing processors or
    processor cores
  • Low power parallelism may allow lowering Vdd

38
Low power through parallelism
  • Sequential Processor
  • Switching capacitance C
  • Frequency f
  • Voltage V
  • P1 ?fCV2
  • Parallel Processor (two times the number of
    units)
  • Switching capacitance 2C
  • Frequency f/2
  • Voltage V lt V
  • P2 ?f/22CV2 ?fCV2 lt P1
  • Check yourself whether this worksfor pipelining
    as well !

39
4-D model of parallel architectures
  • How to speedup your favorite processor?
  • Super-pipelining
  • Powerful instructions
  • MD-technique
  • multiple data operands per operation
  • MO-technique
  • multiple operations per instruction
  • Multiple instruction issue
  • Single stream Superscalar
  • Multiple streams
  • Single core, multiple threads Simultaneously
    Multi-Threading
  • Multiple cores

40
Architecture methods1. Pipelined Execution of
Instructions
IF Instruction Fetch DC Instruction Decode RF
Register Fetch EX Execute instruction WB Write
Result Register
CYCLE
1
2
4
3
5
6
7
8
1
2
INSTRUCTION
3
4
Simple 5-stage pipeline
  • Purpose of pipelining
  • Reduce gate_levels in critical path
  • Reduce CPI close to one (instead of a large
    number for the multicycle machine)
  • More efficient Hardware
  • Some bad news Hazards or pipeline stalls
  • Structural hazards add more hardware
  • Control hazards, branch penalties use branch
    prediction
  • Data hazards by passing required

41
Architecture methods1. Super pipelining
  • Superpipelining
  • Split one or more of the critical pipeline stages
  • Superpipelining degree S

S(architecture) ? f(Op) lt (Op)
?Op ?I_set
where f(op) is frequency of operation op
lt(op) is latency of operation op
42
Architecture methods2. Powerful Instructions (1)
  • MD-technique
  • Multiple data operands per operation
  • SIMD Single Instruction Multiple Data

Vector instruction for (i0, i, ilt64) ci
ai 5bi or c a 5b
Assembly set vl,64 ldv v1,0(r2) mulvi
v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv
v3,0(r3)
43
Architecture methods2. Powerful Instructions (1)
  • SIMD computing
  • All PEs (Processing Elements) execute same
    operation
  • Typical mesh or hypercube connectivity
  • Exploit data locality of e.g. image processing
    applications
  • Dense encoding (few instruction bits needed)

44
Architecture methods2. Powerful Instructions (1)
  • Sub-word parallelism
  • SIMD on restricted scale
  • Used for Multi-media instructions
  • Many processors support this
  • Examples
  • MMX, SSE, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow,
    Trimedia II
  • Example ?i1..4 ai-bi

45
Architecture methods2. Powerful Instructions (2)
  • MO-technique multiple operations per instruction
  • Two options
  • CISC (Complex Instruction Set Computer)
  • this is what we did in the 'old' days of
    microcoded processors
  • VLIW (Very Long Instruction Word)

FU 1
FU 2
FU 3
FU 4
FU 5
field
sub r8, r5, 3
and r1, r5, 12
mul r6, r5, r2
ld r3, 0(r5)
bnez r5, 13
instruction
VLIW instruction example
46
VLIW architecture central Register File
Register file
Exec unit 1
Exec unit 2
Exec unit 3
Exec unit 4
Exec unit 5
Exec unit 6
Exec unit 7
Exec unit 8
Exec unit 9
Issue slot 1
Issue slot 2
Issue slot 3
Q How many ports does the registerfile need for
n-issue?
47
Clustered VLIW
  • Clustering Splitting up the VLIW data path-
    same can be done for the instruction path
  • Exploit locality _at_ Level 0, for Instructions and
    Data

48
Architecture methods3. Multiple instruction
issue (per cycle)
  • Who guarantees semantic correctness?
  • can instructions be executed in parallel
  • User he specifies multiple instruction streams
  • Multi-processor MIMD (Multiple Instruction
    Multiple Data)
  • HW Run-time detection of ready instructions
  • Superscalar, single instruction stream
  • Compiler Compile into dataflow representation
  • Dataflow processors
  • Multi-threaded processors

49
Four dimensional representation of the
architecture design space ltI, O, D, Sgt
Mpar IODS
You should exploit this amount of parallelism !!!
50
Examples of many core / PE architectures
  • SIMD
  • Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan
    Univ)
  • VLIW
  • ADRES, TriMedia
  • more dynamic Itanium (static sched., rt
    mapping), TRIPS/EDGE (rt scheduling)
  • Multi-threaded
  • idea hide long latencies
  • Denelcor HEP (1982), SUN Niagara (2005)
  • Multi-processor
  • RaW, PicoChip, Intel/AMD, GRID, Farms, ..
  • Hybrid, like , Imagine, GPUs, XC-Core, Cell
  • actually, most are hybrid !!

51
In need of TeraFlops on your desk?
  • 4 Nvidia GTX295
  • 1920 PEs
  • 7 TeraFlop

52
How Do GPUs Spend Their Die Area?
GPUs are designed to match the workload of 3D
graphics.
  • Nvidia GTX 280
  • most area spend on processing
  • relatively small on-chip memories
  • huge off-chip memory latencies

J. Roca, et al. "Workload Characterization of 3D
Games", IISWC 2006, link T. Mitra, et al.
"Dynamic 3D Graphics Workload Characterization
and the Architectural Implications", Micro 1999,
link
53
How Do CPUs Spend Their Die Area?
CPUs are designed for low latency instead of high
throughput
Die photo of Intel Penryn (source Intel)
54
Organization level Razor
  • Use shadow latch clocked with delayed clock
  • Reduce Vdd as far as possible
  • Detect error Main FF ? Shadow FF
  • Correct error e.g. replay instruction in
    Microprocessor

55
Razor used in Microprocessor
56
Razor Energy reduction
57
Circuit level exploit actual data width
  • Multiplication is a very basic and widely-used
    operation
  • Multipliers are usually one of the most
    power-hungry operations in many designs
  • When operating data with smaller width (e.g., a
    16-bit multiplier processes 8-bit data), we would
    like to observe energy consumption close to a
    short-width multiplier.

58
Motivation
Normalized Energy Consumption/Operation
Energy consumption of signed multipliers of
different sizes (Baugh-Wooley multiplier with
Wallace tree)
59
Unsigned data
Normalized Energy Consumption/Operation
Energy consumption of unsigned multipliers of
different sizes
  • Unlike signed multiplier, unsigned multiplier
    are naturally data-width aware

60
Signed Multiplier sign-magnitude
Normalized Energy Consumption/Operation
Energy consumption of signed multipliers of
different sizes (sign-magnitude format)
61
Signed Multiplier sign-magnitude
  • A sign-magnitude multiplier is essentially an
    unsigned multiplier
  • with sign-bit calculation logic (xor)
  • However, drawbacks of using sign-magnitude
    format
  • Requires a different set of rules for arithmetic
    computation
  • E.g., to add two numbers, we have to choose
    addition or subtraction depends on the sign bits
    of the two numbers
  • Zero has two representations (0, 0000 and -0,
    1000).

62
Data-Width-Aware Multiplier Design
Architecture choices when effective data input is
only half width

63
Silicon level Sub/Near Threshold JPEG
Accelerator
Pu Yu e.a., ISSCC 2009
lt 1pJ/op at 400mV, 65nm CMOS
64
Trends Low power, how far can we scale Vdd?
  • Subthreshold JPEG encoder
  • Vdd 0.4 1.2 Volt

Pu Yu, ISSCC '09
65
A 280mV-to-1.2V IA-32 Processor in 32nm CMOS
Intel ISSCC'12
66
Exercise how far can we go?Let's consider a
Massively-Parallel SIMD
Data/Frame Memory
PE N
PE 0
PE 1
PE 2
PE 3
Instruction Memory

Control
Interconnect
  • SIMD low-power architecture
  • massively-parallel large number of PEs, high
    performance

67
Xetal-II
68
Xetal-II Processor details
  • 600 mW
  • 90 nm CMOS
  • 53.5 GOPS (arithmetic only) _at_ 84MHz
  • Best computational efficient programmable
    silicon in 2007

Kleihorst, et al.2007
69
Xetal-II Block Diagram
70
Xetal-II Energy Breakdown at 1.2V
  • 125 MHz, 400 mW _at_ 65nm
  • 25 ins./pixel (5x5 convolution)
  • 240.8 pJ/pixel ? 10 pJ/ins. ? 5 pJ/op
  • However 69 consumed by FM!

71
Hybrid Memory Architecture
  • Exploiting Data Locality
  • Scratchpad Memory can be
  • bypassed and clock-gated
  • 15 area overhead to Tile
  • ACCU register
  • - short-term data
  • Scratchpad Memory (SM, 32 entries)
  • - intermediate-term data
  • Frame Memory (FM)
  • - long-term data

72
Energy Breakdown _at_ 1.2 V
  • SM is realized by commercial SRAM here
  • Require 1 extra instruction/px to implement the
    55 filter
  • SM and PEs dominate energy consumption
  • 151.9 pJ/pixel, 1.6 reduction,
  • 3 pJ/op

Implementing SM with standard cells reduces SM
energy 2 ? 2.1 reduction w.r.t. Xetal II, ?
2.4 pJ/op
73
Energy breakdown _at_ low Vdd
  • SM realized by standard cell
  • FM by commercial SRAM
  • Optimal point - FM 0.7V, SM 0.38V, and
    PE 0.42V
  • 22.6 pJ/pixel, 12.5 reduction0.45 pJ/16-bit
    op. ? 0.9 pJ/32-bit op.
  • Still rendering 0.7 GOPS

Sub-threshold Scratchpad Memory with
super-threshold Frame Memory
74
ICE Curve Extended with Vdd Scaling
1 pJ/op
75
Can we match the human brain ???
  • Performance 100 Billion (1011) Neurons 1000
    (103) Connections/Neuron 200 (2 102)
    Calculations Per Second Per Connection 2
    1016 Calculations Per Second
  • Memory 100 Billion (1011) Neurons 1000
    (103) Connections/Neuron 10 bytes (information
    about connection strength and adress of output
    neuron, type of synapse) 1015 bytes 1 PB
    1000 TB
  • How far off are we?
  • Brain needs only 20 Watt
  • and processors need MegaWatts

76
Blue brain research
  • Software replica of one column of the neocortex
  • cortex 85 of brains total mass
  • required for language, learning, memory and
    complex thought
  • the essential first step to simulating the whole
    brain
  • Next include circuitry from other brain regions
    and
  • eventually the whole brain.

77
Are we / CMOS running out of options?
  • Google yourself
  • Reversible logic gates
  • Adiabatic logic
  • Nano tubes
  • Graphene
  • Bio / Molecular / DNA computing
  • Approximate computing
  • Analog computing
  • e.g. with only 9 transistors you can build a
    Gilbert multiplierMead 1989
  • Or much better algorithms

78
Reading
  • Low power design essentials (book)Jan
    RabaeySpringer 2009
  • DATE 2012 tutorialDesign Methodology and
    Techniques in Production Low-Power SOC
    DesignsKaijian Shi (Cadence), Thoams Buechner
    (IBM)
Write a Comment
User Comments (0)
About PowerShow.com