Introduction to VLSI Programming Lecture 8: High Performance (DLX)

About This Presentation
Title:

Introduction to VLSI Programming Lecture 8: High Performance (DLX)

Description:

Introduction to VLSI Programming Lecture 8: High ... Lab work: improve performance of Tangram DLX by introducing pipelining. 10 ... on bottlenecks ... –

Number of Views:135
Avg rating:3.0/5.0
Slides: 26
Provided by: Keesvan4
Category:

less

Transcript and Presenter's Notes

Title: Introduction to VLSI Programming Lecture 8: High Performance (DLX)


1
Introduction to VLSI Programming Lecture 8
High Performance (DLX)
  • (course 2IN30)
  • Prof. dr. ir.Kees van Berkel
  • Dr. Johan Lukkien

2
Time table 2005
date class lab subject
Aug. 30 2 0 hours intro VLSI
Sep. 6 3 0 hours handshake circuits
Sep. 13 3 0 hours handshake circuits assignment
Sep. 20 3 0 hours Tangram
Sep. 27 no lecture
Oct. 4 no lecture
Oct. 11 1 2 hours demo, fifos, registers deadline assignment
Oct. 18 1 2 hours design cases
Oct. 25 1 2 hours DLX introduction
Nov. 1 1 2 hours low-cost DLX
Nov. 8 1 2 hours high-speed DLX
Dec. 13 deadline final report
3
Lecture 8
  • Outline
  • Recapitulation of Lecture 7
  • VLSI programming for high performance
  • parallelism expressions, commands, loops,
    pipelining
  • pipelining the DLX
  • Lab work improve performance of Tangram DLX by
    introducing pipelining

4
DLX instruction formats
31 26, 25 21, 20 16, 15 11,
10 0
5
Example instructions
6
DLX interface, state
Instruction memory
Mem (Data memory)
address
address
r0
pc
r1
r2
DLX CPU
Reg
instruction
data
r/w
r31
clock
interrupt
7
VLSI programming for
  • Low costs
  • introduce resource sharing.
  • Low delay (high throughput)
  • introduce parallelism.
  • Low energy (low power)
  • reduce activity

8
VLSI programming for high performance
  • Keep it simple!!
  • Make the analysis focus on bottlenecks
  • Introduce parallelism expressions, commands,
    loops, pipelining
  • Enable parallelism, by reducing dependencies such
    as resource sharing

9
Expression-level parallelism
  • Examples
  • balancing (vw)(xy) is faster than vwxy
  • substitution zg(f(x)) is faster than y
    f(x) z g(y)
  • carry-select adder
  • carry-save multiplier

10
Command level parallelism
  • If S2 does not depend on outcome of S1 thenS1
    S2 can be transformed into S1
    S2.(dependencies data, sharing,
    synchronization)
  • This reduces computation time ?, unless ordering
    is enforced through external synchronization.
  • ?(S1 S2 ) ?() ?(S1) ?(S2)
  • ?(S1 S2 ) ? () max(?(S1), ?(S2))

11
Exposure of cmd-level parallelism
  • Let S be a shorthand for forever do S od
  • Assume S0 must precede S1 and S1 must precede S2
    How to speedup S0 S1 S2 ?
  • S0 S1 S2
  • loop unfolding S0 S1 S2 S0
  • S0 does not depend on S1 S0 S1 (S2
    S0)

12
wagging
  • a?x b!f(x)
  • loop unrolling, renaming
  • a?x b!f(x) a?y b!f(y)
  • loop folding
  • a?x b!f(x) a?y b!f(y) a?x
  • ? increases slack by 1
  • a?x (b!f(x) a?y) (b!f(y) a?x)

13
Parallel reads from REG file
  • Let RF be a register file. Then x RFi
    y RFj cannot be parallelized. (Register
    files have a single read port.)
  • Parallel read actions can be realized by doubling
    the register file ltlt RFi , RGi gtgt ltlt z ,
    z gtgt write and ltlt x , y gtgt ltlt
    RFi , RGj gtgt read

14
Pipelining in Tangram
  • Compare three programs
  • P0 a?x0 b!f2(f1(f0(x0)))
  • P1 a?x0 x1 f0(x0) x2 f1(x1)
    b!f2(x2)
  • P2 a?x0 a1!f0(x0) a1?x1
    a2!f1(x1) a2?x2 b!f2(x2)

15
Pipelining in Tangram (cntd)
  • Output sequence b identical for P0, P1, and P2.
  • P0 and P1 have same communication behavior P1
    is larger, slower, and warmer.
  • P2 vs P1 similar in size, energy, and latency,
    but up to 3 times higher throughput, depending
    on (relative) complexity of f0, f1, f2.

16
DLX 5-step sequential execution
IF
ID
EX
MM
WB
17
DLX pipelined execution
Time ? in clock cycles 1 2 3
4 5 6 7 8
...
Program execution ? instructions
18
DLX pipelined execution
Instruction Fetch
Inst.Decode
EXecute
Memory
Write Back
4
0?
pc
Instr. mem
Reg
Mem
19
Lab work
  • Assignment 5
  • Create a 2-stage pipelined dlx2.tgThroughput
    must exceed 5 MIPS (benchmark GCD).
  • Design a reduced-costs version dlx2s.tg
  • Note use of shared variables is not allowed.Let
    command S1 S2 be part of your DLX.When S1
    has write access to variable x, S2 may neither
    read nor write x (and vice versa).

20
Next week lecture 9
  • Outline
  • Pipelining the DLX, using branch-delay slots.
  • Lab work Assignment 6 (3-stage DLX)

21
DLX system organization
RAMaddrdatatoRAMdatafromRAM
ROMaddrROMdata
dlx()

systemboundary
rom()
ram()
filesRAMoutRAMin
system_dlx()
file gcd.bin
22
dlx0.ht
  • include types.ht
  • dlx0 export proc ( ROMaddr!chan adtype
  • ROMdata?chan word
  • RAMaddr!chan rwadtype datatoRAM!chan
    S30 datafromRAM?chan S30
  • ) .
  • begin
  • RF ram array U5 of S30
  • end

23
system_dlx0.ht
  • include "dlx0.ht"
  • dlx0 proc ( ROMaddr!chan adtype
  • ROMdata?chan word
  • RAMaddr!chan rwadtype datatoRAM!chan
    S30 datafromRAM?chan S30
  • ) . import
  • env_dlx4 main proc (
  • ROMfile? chan word
  • RAMinfile? chan S30
  • RAMfile! chan S30 / ltltaddress,datagtgt
    /
  • ) .
  • begin
  • next slide
  • end

24
system_dlx0.ht main body
  • begin
  • ROMaddr chan adtype
  • ROMdata chan word
  • RAMaddr chan rwadtype
  • datatoRAM chan S30
  • datafromRAM chan S30
  • ROMinterface proc() . begin .. end
  • RAMinterface proc() . begin .. end
  • initialise() ROMinterface()
    RAMinterface() dlx0( ROMaddr, ROMdata,
    RAMaddr, datatoRAM, datafromRAM )
  • end

25
script
  • htcomp -B system_dlx0
  • htsim -limit 1000 system_dlx0 gcd.bin RAMin
    RAMout
  • htview system_dlx0
Write a Comment
User Comments (0)
About PowerShow.com