Introduction to VLSI Programming Lecture 8: High Performance (DLX)

About This Presentation

Title:

Introduction to VLSI Programming Lecture 8: High Performance (DLX)

Description:

Introduction to VLSI Programming Lecture 8: High ... Lab work: improve performance of Tangram DLX by introducing pipelining. 10 ... on bottlenecks ... –

Number of Views:135

Avg rating:3.0/5.0

Slides: 26

Provided by: Keesvan4

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to VLSI Programming Lecture 8: High Performance (DLX)

1
Introduction to VLSI Programming Lecture 8
High Performance (DLX)

(course 2IN30)
Prof. dr. ir.Kees van Berkel
Dr. Johan Lukkien

2
Time table 2005
date class lab subject
Aug. 30 2 0 hours intro VLSI
Sep. 6 3 0 hours handshake circuits
Sep. 13 3 0 hours handshake circuits assignment
Sep. 20 3 0 hours Tangram
Sep. 27 no lecture
Oct. 4 no lecture
Oct. 11 1 2 hours demo, fifos, registers deadline assignment
Oct. 18 1 2 hours design cases
Oct. 25 1 2 hours DLX introduction
Nov. 1 1 2 hours low-cost DLX
Nov. 8 1 2 hours high-speed DLX
Dec. 13 deadline final report
3
Lecture 8

Outline
Recapitulation of Lecture 7
VLSI programming for high performance
parallelism expressions, commands, loops,
pipelining
pipelining the DLX
Lab work improve performance of Tangram DLX by
introducing pipelining

4
DLX instruction formats
31 26, 25 21, 20 16, 15 11,
10 0
5
Example instructions
6
DLX interface, state
Instruction memory
Mem (Data memory)
address
address
r0
pc
r1
r2
DLX CPU
Reg
instruction
data
r/w
r31
clock
interrupt
7
VLSI programming for

Low costs
introduce resource sharing.
Low delay (high throughput)
introduce parallelism.
Low energy (low power)
reduce activity

8
VLSI programming for high performance

Keep it simple!!
Make the analysis focus on bottlenecks
Introduce parallelism expressions, commands,
loops, pipelining
Enable parallelism, by reducing dependencies such
as resource sharing

9
Expression-level parallelism

Examples
balancing (vw)(xy) is faster than vwxy
substitution zg(f(x)) is faster than y
f(x) z g(y)
carry-select adder
carry-save multiplier

10
Command level parallelism

If S2 does not depend on outcome of S1 thenS1
S2 can be transformed into S1
S2.(dependencies data, sharing,
synchronization)
This reduces computation time ?, unless ordering
is enforced through external synchronization.
?(S1 S2 ) ?() ?(S1) ?(S2)
?(S1 S2 ) ? () max(?(S1), ?(S2))

11
Exposure of cmd-level parallelism

Let S be a shorthand for forever do S od
Assume S0 must precede S1 and S1 must precede S2
How to speedup S0 S1 S2 ?
S0 S1 S2
loop unfolding S0 S1 S2 S0
S0 does not depend on S1 S0 S1 (S2
S0)

12
wagging

a?x b!f(x)
loop unrolling, renaming
a?x b!f(x) a?y b!f(y)
loop folding
a?x b!f(x) a?y b!f(y) a?x
? increases slack by 1
a?x (b!f(x) a?y) (b!f(y) a?x)

13
Parallel reads from REG file

Let RF be a register file. Then x RFi
y RFj cannot be parallelized. (Register
files have a single read port.)
Parallel read actions can be realized by doubling
the register file ltlt RFi , RGi gtgt ltlt z ,
z gtgt write and ltlt x , y gtgt ltlt
RFi , RGj gtgt read

14
Pipelining in Tangram

Compare three programs
P0 a?x0 b!f2(f1(f0(x0)))
P1 a?x0 x1 f0(x0) x2 f1(x1)
b!f2(x2)
P2 a?x0 a1!f0(x0) a1?x1
a2!f1(x1) a2?x2 b!f2(x2)

15
Pipelining in Tangram (cntd)

Output sequence b identical for P0, P1, and P2.
P0 and P1 have same communication behavior P1
is larger, slower, and warmer.
P2 vs P1 similar in size, energy, and latency,
but up to 3 times higher throughput, depending
on (relative) complexity of f0, f1, f2.

16
DLX 5-step sequential execution
IF
ID
EX
MM
WB
17
DLX pipelined execution
Time ? in clock cycles 1 2 3
4 5 6 7 8
...
Program execution ? instructions
18
DLX pipelined execution
Instruction Fetch
Inst.Decode
EXecute
Memory
Write Back
4
0?
pc
Instr. mem
Reg
Mem
19
Lab work

Assignment 5
Create a 2-stage pipelined dlx2.tgThroughput
must exceed 5 MIPS (benchmark GCD).
Design a reduced-costs version dlx2s.tg
Note use of shared variables is not allowed.Let
command S1 S2 be part of your DLX.When S1
has write access to variable x, S2 may neither
read nor write x (and vice versa).

20
Next week lecture 9

Outline
Pipelining the DLX, using branch-delay slots.
Lab work Assignment 6 (3-stage DLX)

21
DLX system organization
RAMaddrdatatoRAMdatafromRAM
ROMaddrROMdata
dlx()

systemboundary
rom()
ram()
filesRAMoutRAMin
system_dlx()
file gcd.bin
22
dlx0.ht

include types.ht
dlx0 export proc ( ROMaddr!chan adtype
ROMdata?chan word
RAMaddr!chan rwadtype datatoRAM!chan
S30 datafromRAM?chan S30
) .
begin
RF ram array U5 of S30
end

23
system_dlx0.ht

include "dlx0.ht"
dlx0 proc ( ROMaddr!chan adtype
ROMdata?chan word
RAMaddr!chan rwadtype datatoRAM!chan
S30 datafromRAM?chan S30
) . import
env_dlx4 main proc (
ROMfile? chan word
RAMinfile? chan S30
RAMfile! chan S30 / ltltaddress,datagtgt
/
) .
begin
next slide
end

24
system_dlx0.ht main body

begin
ROMaddr chan adtype
ROMdata chan word
RAMaddr chan rwadtype
datatoRAM chan S30
datafromRAM chan S30
ROMinterface proc() . begin .. end
RAMinterface proc() . begin .. end
initialise() ROMinterface()
RAMinterface() dlx0( ROMaddr, ROMdata,
RAMaddr, datatoRAM, datafromRAM )
end

25
script