Title: Introduction to VLSI Programming Lecture 8: High Performance (DLX)
1Introduction to VLSI Programming Lecture 8
High Performance (DLX)
- (course 2IN30)
- Prof. dr. ir.Kees van Berkel
- Dr. Johan Lukkien
-
2Time table 2005
date class lab subject
Aug. 30 2 0 hours intro VLSI
Sep. 6 3 0 hours handshake circuits
Sep. 13 3 0 hours handshake circuits assignment
Sep. 20 3 0 hours Tangram
Sep. 27 no lecture
Oct. 4 no lecture
Oct. 11 1 2 hours demo, fifos, registers deadline assignment
Oct. 18 1 2 hours design cases
Oct. 25 1 2 hours DLX introduction
Nov. 1 1 2 hours low-cost DLX
Nov. 8 1 2 hours high-speed DLX
Dec. 13 deadline final report
3Lecture 8
- Outline
- Recapitulation of Lecture 7
- VLSI programming for high performance
- parallelism expressions, commands, loops,
pipelining - pipelining the DLX
- Lab work improve performance of Tangram DLX by
introducing pipelining
4DLX instruction formats
31 26, 25 21, 20 16, 15 11,
10 0
5Example instructions
6DLX interface, state
Instruction memory
Mem (Data memory)
address
address
r0
pc
r1
r2
DLX CPU
Reg
instruction
data
r/w
r31
clock
interrupt
7VLSI programming for
- Low costs
- introduce resource sharing.
- Low delay (high throughput)
- introduce parallelism.
- Low energy (low power)
- reduce activity
8VLSI programming for high performance
- Keep it simple!!
- Make the analysis focus on bottlenecks
- Introduce parallelism expressions, commands,
loops, pipelining - Enable parallelism, by reducing dependencies such
as resource sharing
9Expression-level parallelism
- Examples
- balancing (vw)(xy) is faster than vwxy
- substitution zg(f(x)) is faster than y
f(x) z g(y) - carry-select adder
- carry-save multiplier
10Command level parallelism
- If S2 does not depend on outcome of S1 thenS1
S2 can be transformed into S1
S2.(dependencies data, sharing,
synchronization) - This reduces computation time ?, unless ordering
is enforced through external synchronization. - ?(S1 S2 ) ?() ?(S1) ?(S2)
- ?(S1 S2 ) ? () max(?(S1), ?(S2))
11Exposure of cmd-level parallelism
- Let S be a shorthand for forever do S od
- Assume S0 must precede S1 and S1 must precede S2
How to speedup S0 S1 S2 ? - S0 S1 S2
- loop unfolding S0 S1 S2 S0
- S0 does not depend on S1 S0 S1 (S2
S0)
12wagging
- a?x b!f(x)
- loop unrolling, renaming
- a?x b!f(x) a?y b!f(y)
- loop folding
- a?x b!f(x) a?y b!f(y) a?x
- ? increases slack by 1
- a?x (b!f(x) a?y) (b!f(y) a?x)
13Parallel reads from REG file
- Let RF be a register file. Then x RFi
y RFj cannot be parallelized. (Register
files have a single read port.) - Parallel read actions can be realized by doubling
the register file ltlt RFi , RGi gtgt ltlt z ,
z gtgt write and ltlt x , y gtgt ltlt
RFi , RGj gtgt read
14Pipelining in Tangram
- Compare three programs
- P0 a?x0 b!f2(f1(f0(x0)))
- P1 a?x0 x1 f0(x0) x2 f1(x1)
b!f2(x2) - P2 a?x0 a1!f0(x0) a1?x1
a2!f1(x1) a2?x2 b!f2(x2)
15Pipelining in Tangram (cntd)
- Output sequence b identical for P0, P1, and P2.
- P0 and P1 have same communication behavior P1
is larger, slower, and warmer. - P2 vs P1 similar in size, energy, and latency,
but up to 3 times higher throughput, depending
on (relative) complexity of f0, f1, f2.
16DLX 5-step sequential execution
IF
ID
EX
MM
WB
17DLX pipelined execution
Time ? in clock cycles 1 2 3
4 5 6 7 8
...
Program execution ? instructions
18DLX pipelined execution
Instruction Fetch
Inst.Decode
EXecute
Memory
Write Back
4
0?
pc
Instr. mem
Reg
Mem
19Lab work
- Assignment 5
- Create a 2-stage pipelined dlx2.tgThroughput
must exceed 5 MIPS (benchmark GCD). - Design a reduced-costs version dlx2s.tg
- Note use of shared variables is not allowed.Let
command S1 S2 be part of your DLX.When S1
has write access to variable x, S2 may neither
read nor write x (and vice versa).
20Next week lecture 9
- Outline
- Pipelining the DLX, using branch-delay slots.
- Lab work Assignment 6 (3-stage DLX)
21DLX system organization
RAMaddrdatatoRAMdatafromRAM
ROMaddrROMdata
dlx()
systemboundary
rom()
ram()
filesRAMoutRAMin
system_dlx()
file gcd.bin
22dlx0.ht
- include types.ht
- dlx0 export proc ( ROMaddr!chan adtype
- ROMdata?chan word
- RAMaddr!chan rwadtype datatoRAM!chan
S30 datafromRAM?chan S30 - ) .
- begin
- RF ram array U5 of S30
- end
23system_dlx0.ht
- include "dlx0.ht"
-
- dlx0 proc ( ROMaddr!chan adtype
- ROMdata?chan word
- RAMaddr!chan rwadtype datatoRAM!chan
S30 datafromRAM?chan S30 - ) . import
- env_dlx4 main proc (
- ROMfile? chan word
- RAMinfile? chan S30
- RAMfile! chan S30 / ltltaddress,datagtgt
/ - ) .
- begin
- next slide
- end
24system_dlx0.ht main body
- begin
- ROMaddr chan adtype
- ROMdata chan word
- RAMaddr chan rwadtype
- datatoRAM chan S30
- datafromRAM chan S30
-
- ROMinterface proc() . begin .. end
- RAMinterface proc() . begin .. end
-
- initialise() ROMinterface()
RAMinterface() dlx0( ROMaddr, ROMdata,
RAMaddr, datatoRAM, datafromRAM ) - end
25script
- htcomp -B system_dlx0
- htsim -limit 1000 system_dlx0 gcd.bin RAMin
RAMout - htview system_dlx0