Embedded Computer Architecture

About This Presentation

Title:

Embedded Computer Architecture

Description:

Title: No Slide Title Author: abc Last modified by: Henk Corporaal Created Date: 7/10/1998 11:19:28 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 64

Provided by: abc61

Category:

more less

Transcript and Presenter's Notes

Title: Embedded Computer Architecture

1
Embedded Computer Architecture
VLIW architectures Generating VLIW code

TU/e 5kk73
Henk Corporaal

2
VLIW lectures overview

Enhance performance architecture methods
Instruction Level Parallelism
VLIW
Examples
C6
TM
TTA
Clustering and Reconfigurable components
Code generation
compiler basics
mapping and scheduling
TTA code generation
Design space exploration
Hands-on

3
Compiler basics

Overview
Compiler trajectory / structure / passes
Control Flow Graph (CFG)
Mapping and Scheduling
Basic block list scheduling
Extended scheduling scope
Loop scheduling
Loop transformations
separate lecture

4
Compiler basics trajectory
Source program
Preprocessor
Compiler
Error messages
Assembler
Library code
Loader/Linker
Object program
5
Compiler basics structure / passes
Source code
Lexical analyzer
token generation
check syntax check semantic
parse tree generation
Parsing
Intermediate code
data flow analysis local optimizations
global optimizations
Code optimization
code selection peephole optimizations
Code generation
making interference graph graph
coloring
spill code insertion
caller / callee save and restore code
Register allocation
Sequential code
Scheduling and allocation
exploiting ILP
Object code
6
Compiler basics structure Simple example from
HLL to (Sequential) Assembly code
position initial rate 60
Lexical analyzer
temp1 intoreal(60) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
id id id 60
Syntax analyzer
Code optimizer
temp1 id3 60.0 id1 id2 temp1
Code generator
movf id3, r2 mulf 60, r2, r2 movf id2,
r1 addf r2, r1 movf r1, id1
Intermediate code generator
7
Compiler basics Control flow graph (CFG)
CFG shows the flow between basic blocks
C input code
1 sub t1, a, b bgz t1, 2, 3
if (a gt b) r a b else r b
a
2 rem r, a, b goto 4
3 rem r, b, a goto 4
4 .. ..
Program, is collection of Functions, each
function is collection of Basic Blocks,
each BB contains set of
Instructions, each instruction consists of
several Transports,..
8
Compiler basics Basic optimizations

Machine independent optimizations
Machine dependent optimizations

9
Compiler basics Basic optimizations

Machine independent optimizations
Common subexpression elimination
Constant folding
Copy propagation
Dead-code elimination
Induction variable elimination
Strength reduction
Algebraic identities
Commutative expressions
Associativity Tree height reduction
Note not always allowed(due to limited
precision)
For details check any good compiler book !

10
Compiler basics Basic optimizations

Machine dependent optimization example
Whats the optimal implementation of a34 ?
Use multiplier mul Tb, Ta, 34
Pro No thinking required
Con May take many cycles
Alternative
SHL Tb, Ta, 1
SHL Tc, Ta, 5
ADD Tb, Tb, Tc
Pros May take fewer cycles
Cons
Uses more registers
Additional instructions ( I-cache load / code
size)

11
Compiler basics Register allocation

Register Organization
Conventions needed for parameter passing
and register usage across function calls

12
Register allocation using graph coloring

Given a set of registers, what is the most
efficient
mapping of registers to program variables in
terms
of execution time of the program?
Some definitions
A variable is defined at a point in program when
a value is assigned to it.
A variable is used at a point in a program when
its value is referenced in an expression.
The live range of a variable is the execution
range between definitions and uses of a variable.

13
Register allocation using graph coloring
Live Ranges
define
use
14
Register allocation using graph coloring
Inference Graph
a
Coloring a red b green c blue d green
b
c
d
Graph needs 3 colors gt program needs 3 registers
Question map coloring requires (at most) 4
colors whats the maximum number of colors (
registers) needed for register interference
graph coloring?
15
Register allocation using graph coloring
Spill/ Reload code
Spill/ Reload code is needed when there are not
enough colors (registers) to color the
interference graph
Example Only two registers available !!
16
Register allocation for a monolithic RF
Scheme of the optimistic register allocator
Spill code
Renumber
Build
Spill costs
Simplify
Select
The Select phase selects a color ( machine
register) for a variable that minimizes the
heuristic h
h fdep(col, var) caller_callee(col, var)
where fdep(col, var) a measure
for the introduction of false dependencies
caller_callee(col, var) cost for mapping var on
a caller or callee saved register
17
Some explanation of reg allocation phases

Renumber The first phase finds all live ranges
in a procedure
and numbers (renames) them uniquely.
Build This phase constructs the interference
graph.
Spill Costs In preparation for coloring, a
spill cost estimate
is computed for every live range. The cost is
simply the sum of the
execution frequencies of the transports that
define or use the variable
of the live range.
Simplify This phase removes nodes with degree
lt k in an
arbitrary order from the graph and pushes them on
a stack. Whenever
it discovers that all remaining nodes have degree
gt k, it chooses
a spill candidate. This node is also removed from
the graph and
optimistically pushed on the stack, hoping a
color will be available in
spite of its high degree.
Select Colors are selected for nodes. In turn,
each node is
popped from the stack, reinserted in the
interference graph and given a

18
Compiler basics Code selection

CISC era (before 1985)
Code size important
Determine shortest sequence of code
Many options may exist
Pattern matching
Example M68029
D1 D1 M M10A1 16D2 20 ?
ADD (10,A1, D216, 20) D1
RISC era
Performance important
Only few possible code sequences
New implementations of old architectures optimize
RISC part of instruction set only for e.g. i486
/ Pentium / M68020

19
Overview

Enhance performance architecture methods
Instruction Level Parallelism
VLIW
Examples
C6
TM
TTA
Clustering
Code generation
Compiler basics
Mapping and Scheduling of Operations
Design Space Exploration TTA framework

What is scheduling
Basic Block Scheduling
Extended Basic Block Scheduling
Loop Scheduling

20
Mapping / Scheduling placing operations in
space and time

d a b
e a d
f 2 b d
r f e
x z y

21
How to map these operations?

Architecture constraints
One Function Unit
All operations single cycle latency

22
How to map these operations?

Architecture constraints
One Add-sub and one Mul unit
All operations single cycle latency

23
There are many mapping solutions
24
Scheduling Overview

Transforming a sequential program into a parallel
program
read sequential program
read machine description file
for each procedure do
perform function inlining
for each procedure do
transform an irreducible CFG into a reducible CFG
perform control flow analysis
perform loop unrolling
perform data flow analysis
perform memory reference disambiguation
perform register allocation
for each scheduling scope do
perform instruction scheduling
write out the parallel program

25
Basic Block Scheduling

Basic Block piece of code which can only be
entered from the top (first instruction) and left
at the bottom (final instruction)
Scheduling a basic block Assign resources and
a cycle to every operation
List Scheduling Heuristic scheduling approach,
scheduling the operation one-by-one
Time_complexity O(N), where N is operations
Optimal scheduling has Time_complexity O(exp(N)
Question what is a good scheduling heuristic

26
Basic Block Scheduling

Make a Data Dependence Graph (DDG)
Determine minimal length of the DDG (for the
given architecture)
minimal number of cycles to schedule the graph
(assuming sufficient resources)
Determine
ASAP (As Soon As Possible) cycle earliest cycle
instruction can be scheduled
ALAP (As Late As Possible) cycle latest cycle
instruction can be scheduled
Slack of each operation ALAP ASAP
Priority of operations f (Slack, decendants,
register impact, . )
Place each operation in first cycle with
sufficient resources
Notes
Basic Block a (maximal) piece of consecutive
instructions which can only be entered at the
first instruction and left at the end
Scheduling order sequential
Scheduling Priority determined by used heuristic
e.g. slack other contributions

27
Basic Block Schedulingdetermine ASAP and ALAP
cycles
ASAP cycle
B
C
we assume all operations are single cycle !
ALAP cycle
ADD
A
slack
lt1,1gt
A
C
SUB
lt2,2gt
ADD
NEG
LD
lt3,3gt
lt1,3gt
lt2,3gt
A
B
LD
MUL
ADD
lt4,4gt
lt2,4gt
lt1,4gt
z
y
X
28
Cycle based list scheduling
proc Schedule(DDG (V,E)) beginproc ready
v ??(u,v) ? E ready ready sched
? current_cycle 0 while sched ? V
do for each v ? ready (select in
priority order) do if
?ResourceConfl(v,current_cycle, sched) then
cycle(v) current_cycle
sched sched ? v endif
endfor current_cycle
current_cycle 1 ready v v ?
sched ? ? (u,v)? E, u ? sched ready
v v ? ready ? ? (u,v)? E, cycle(u)
delay(u,v) ? current_cycle endwhile endproc
29
Extended Scheduling Scope look at the CFG
Code
CFG Control Flow Graph
A If cond Then B Else C D If cond Then
E Else F G
Q Why enlarge the scheduling scope?
30
Extended basic block scheduling Code Motion
Q Why moving code?

Downward code motions?
a ? B, a ? C, a ? D, c ? D, d ? D
Upward code motions?
c ? A, d ? A, e ? B, e ? C, e ? A

31
Possible Scheduling Scopes
32
Create and Enlarge Scheduling Scope
33
Create and Enlarge Scheduling Scope
34
Comparing scheduling scopes
35
Code movement (upwards) within regions what to
check?
destination block
I
I
I
I
add
source block
36
Extended basic block schedulingCode Motion

A dominates B ? A is always executed before B
Consequently
A does not dominate B ? code motion from B to A
requires
code duplication
B post-dominates A ? B is always executed after A
Consequently
B does not post-dominate A ? code motion from B
to A is speculative

Q1 does C dominate E? Q2 does C dominate D? Q3
does F post-dominate D? Q4 does D post-dominate
B?
37
Scheduling Loops
Loop Optimizations
A
B
C
D
38
Scheduling Loops

Problems with unrolling
Exploits only parallelism within sets of n
iterations
Iteration start-up latency
Code expansion

Basic block scheduling
Basic block scheduling and unrolling
resource utilization
Software pipelining
time
39
Software pipelining

Software pipelining a loop is
Scheduling the loop such that iterations start
before preceding iterations have finished
Or
Moving operations across the backedge

LD LD ML LD ML ST ML ST ST
Unroling (3 times) 5/3 cycles/iteration
Software pipelining 1 cycle/iteration
3 cycles/iteration
40
Software pipelining (contd)

Basic loop scheduling techniques
Modulo scheduling (Rau, Lam)
list scheduling with modulo resource constraints
Kernel recognition techniques
unroll the loop
schedule the iterations
identify a repeating pattern
Examples
Perfect pipelining (Aiken and Nicolau)
URPR (Su, Ding and Xia)
Petri net pipelining (Allan)
Enhanced pipeline scheduling (Ebcioglu)
fill first cycle of iteration
copy this instruction over the backedge

This algorithm most used in commercial compilers
41
Software pipelining Modulo scheduling
Example Modulo scheduling a loop

Prologue fills the SW pipeline with iterations
Epilogue drains the SW pipeline

42
Software pipelining determine II, the
Initiation Interval
Cyclic data dependences
For (i0.....) Ai6 3Ai-1
ld r1, (r2)
(0,1)
(1,0)
(delay, iteration distance)
mul r3, r1, 3
(1,6)
(0,1)
(1,0)
sub r4, r3, 1
(0,1)
(1,0)
st r4, (r5)
Initiation Interval
cycle(v) ? cycle(u) delay(u,v) -
II.distance(u,v)
43
Modulo scheduling constraints
MII, minimum initiation interval, bounded by
cyclic dependences and resources
MII max ResMinII, RecMinII
44
Let's go back to The Role of the Compiler

9 steps required to translate an HLL program
(see online bookchapter)
Front-end compilation
Determine dependencies
Graph partitioning make multiple threads (or
tasks)
Bind partitions to compute nodes
Bind operands to locations
Bind operations to time slots Scheduling
Bind operations to functional units
Bind transports to buses
Execute operations and perform transports

45
Division of responsibilities between hardware and
compiler
Application
(1)
Frontend
Superscalar
(2)
Determine Dependencies
Determine Dependencies
Dataflow
Binding of Operands
Binding of Operands
(3)
Multi-threaded
Scheduling
Scheduling
(4)
Indep. Arch
Binding of Operations
Binding of Operations
(5)
VLIW
Binding of Transports
Binding of Transports
(6)
TTA
Execute
(7)
Responsibility of compiler
Responsibility of Hardware
46
Overview

Enhance performance architecture methods
Instruction Level Parallelism
VLIW
Examples
C6
TM
TTA
Clustering
Code generation
Design Space Exploration TTA framework

47
Mapping applications to processorsMOVE framework
User intercation
Optimizer
Architecture parameters
feedback
feedback
Parametric compiler
Hardware generator
Move framework
Parallel object code
chip
TTA based system
48
TTA (MOVE) organization
Data Memory
Socket
Instruction Memory
49
Code generation trajectory for TTAs

Frontend
GCC or SUIF
(adapted)

Application (C)
Compiler frontend
Sequential code
Sequential simulation
Input/Output
Architecture description
Compiler backend
Profiling data
Parallel code
Parallel simulation
Input/Output
50
Exploration TTA resource reduction
51
Exporation TTA connectivity reduction
Critical connections disappear
Reducing bus delay
Execution time
FU stage constrains cycle time
0
Number of connections removed
52
Can we do better?
Yes we can !!

How ?
Code Transformations
SFUs Special Function Units
Vector processing
Multiple Processors

53
Transforming the specification (1)

Based on associativity of operation a (b c)
(a b) c
54
Transforming the specification (2)
d a b e a d f 2 b d r f e x
z y
r 2b a x z y
a
55
Changing the architectureadding SFUs special
function units
4-input adder why is this faster?
56
Changing the architectureadding SFUs special
function units

In the extreme case put everything into one unit!

Spatial mapping - no control flow
However no flexibility / programmability !!
but could use FPGAs
57
SFUs fine grain patterns

Why using fine grain SFUs
Code size reduction
Register file ports reduction
Could be cheaper and/or faster
Transport reduction
Power reduction (avoid charging non-local wires)
Supports whole application domain !
coarse grain would only help certain specific
applications
Which patterns do need support?
Detection of recurring operation patterns needed

58
SFUs covering results
Adding only 20 'patterns' of 2 operations
dramatically reduces of operations (with about
40) !!
59
Exploration resulting architecture

Architecture for image processing
Several SFUs
Note the reduced connectivity

60
Conclusions

Billions of embedded processing systems / year
how to design these systems quickly, cheap,
correct, low power,.... ?
what will their processing platform look like?
VLIWs are very powerful and flexible
can be easily tuned to application domain
TTAs even more flexible, scalable, and lower power

61
Conclusions

Compilation for ILP architectures is mature
used in commercial compilers
However
Great discrepancy between available and
exploitable parallelism
Advanced code scheduling techniques needed to
exploit ILP

62
Bottom line
Do not pay for hardware if
you can do it in software !!
63
Handson-1 (2014)

HOW FAR ARE YOU?
VLIW processor of Silicon Hive (Intel)
Map your algorithm
Optimize the mapping
Optimize the architecture
Perform DSE (Design Space Exploration) trading
off (gt Pareto curves)
Performance,
Energy and
Area ( Cost)

Write a Comment

User Comments (0)

About PowerShow.com

Embedded Computer Architecture - PowerPoint PPT Presentation

Embedded Computer Architecture

Title: No Slide Title Author: abc Last modified by: Henk Corporaal Created Date: 7/10/1998 11:19:28 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation