Title: Introduction to Silicon Programming in the Tangram/Haste language
1Introduction to Silicon Programmingin the
Tangram/Haste language
- Material adapted from lectures by
- Prof.dr.ir Kees van Berkel
- Dr. Johan Lukkien
- Dr.ir. Ad Peeters
- at the Technical University of Eindhoven, the
Netherlands
2VLSI programming for
- Low costs
- introduce resource sharing.
- Low delay (high throughput)
- introduce parallelism.
- Low energy (low power)
- reduce activity
3VLSI programming for high performance
- Keep it simple!!
- Make the analysis focus on bottlenecks
- Introduce parallelism expressions, commands,
loops, pipelining - Enable parallelism, by reducing dependencies such
as resource sharing
4Expression-level parallelism
- Examples
- balancing (vw)(xy) is faster than vwxy
- substitution zg(f(x)) is faster than y
f(x) z g(y) - carry-select adder
- carry-save multiplier
5Command level parallelism
- If S2 does not depend on outcome of S1 thenS1
S2 can be transformed into S1
S2.(dependencies data, sharing,
synchronization) - This reduces computation time ?, unless ordering
is enforced through external synchronization. - ?(S1 S2 ) ?() ?(S1) ?(S2)
- ?(S1 S2 ) ? () max(?(S1), ?(S2))
6Exposure of cmd-level parallelism
- Let S be a shorthand for forever do S od
- Assume S0 must precede S1 and S1 must precede S2
How to speedup S0 S1 S2 ? - S0 S1 S2
- loop unfolding S0 S1 S2 S0
- S0 does not depend on S1 S0 S1 (S2
S0)
7wagging
- a?x b!f(x)
- loop unrolling, renaming
- a?x b!f(x) a?y b!f(y)
- loop folding
- a?x b!f(x) a?y b!f(y) a?x
- ? increases slack by 1
- a?x (b!f(x) a?y) (b!f(y) a?x)
8Parallel reads from REG file
- Let RF be a register file. Then x RFi
y RFj cannot be parallelized. (Register
files have a single read port.) - Parallel read actions can be realized by doubling
the register file ltlt RFi , RGi gtgt ltlt z ,
z gtgt write and ltlt x , y gtgt ltlt RFi
, RGj gtgt read
9Pipelining in Tangram
- Compare three programs
- P0 a?x0 b!f2(f1(f0(x0)))
- P1 a?x0 x1 f0(x0) x2 f1(x1)
b!f2(x2) - P2 a?x0 a1!f0(x0) a1?x1
a2!f1(x1) a2?x2 b!f2(x2)
10Pipelining in Tangram (cntd)
- Output sequence b identical for P0, P1, and P2.
- P0 and P1 have same communication behavior P1
is larger, slower, and warmer. - P2 vs P1 similar in size, energy, and latency,
but up to 3 times higher throughput, depending
on (relative) complexity of f0, f1, f2.
11A Processor Example DLX (Deluxe)
-
- (AMD 29K DECstation 3100 HP850 IBM801
Intel i860 MIPS M/120A MIPS M/1000 Motorola
88K RISC I SGI 4D/60 SPARCstation-1 Sun
4/110 Sun-4/260) / 13 - DLX
- Other RISC examples include
Cray-1,2,3, AMD2900, DEC
Alpha, ARM.
12DLX instruction formats
31 26, 25 21, 20 16, 15 11, 10
0
13Example instructions
14GCD in DLX assembler
- pre LW R1,4(R0) R1Mem40
- LW R2,8(R0) R2Mem80
- loop SUB R3,R1,R2 R3R1-R2
- BEQZ R3,exit if (R30) then PCexit
- SLT R4,R1,R2 R4(R1ltR2)
- BEQZ R4,pos2 if (R40) then PCpos2
- pos1 SUB R2,R2,R1 R2R2-R1
- J loop PCloop
- pos2 SUB R1,R1,R2 R1R1-R2
- J loop PCloop
- exit SW 20(R0),R1 Mem200R1
- HLT
15DLX interface, state
Instruction memory
Mem (Data memory)
address
address
r0
pc
r1
r2
DLX CPU
Reg
instruction
data
r/w
r31
clock
interrupt
16DLX Moore machine(ignoring interrupts)
- ?Reg0,pc ? ?0,0?
- do ?MemRegrs1 immediate, pc, Regrd ?
- ? if SW ? Regrd fi
- , if J ? pc4offset
- BEQZ ? if Regrs0 ? pc4
immediate Regrs0 ? pc4 fi - else ? pc4
- fi
- , if LW ? Memrs1immediate
- ADD ? ALU(add, Regrs1, Regrs2)
- fi ?
- od
17DLX 5-step sequential execution
IF
ID
EX
MM
WB
18DLX pipelined execution
Time ? in clock cycles 1 2 3
4 5 6 7 8
...
Program execution ? instructions
19DLX pipelined execution
Instruction Fetch
Inst.Decode
EXecute
Memory
Write Back
4
0?
pc
Instr. mem
Reg
Mem
20DLX system organization
RAMaddrdatatoRAMdatafromRAM
ROMaddrROMdata
dlx()
systemboundary
rom()
ram()
filesRAMoutRAMin
system_dlx()
file gcd.bin
21dlx0.ht
- include types.ht
- dlx0 export proc ( ROMaddr!chan adtype
- ROMdata?chan word
- RAMaddr!chan rwadtype datatoRAM!chan
S30 datafromRAM?chan S30 - ) .
- begin
- RF ram array U5 of S30
- end
22system_dlx0.ht
- include "dlx0.ht"
-
- dlx0 proc ( ROMaddr!chan adtype
- ROMdata?chan word
- RAMaddr!chan rwadtype datatoRAM!chan
S30 datafromRAM?chan S30 - ) . import
- env_dlx4 main proc (
- ROMfile? chan word
- RAMinfile? chan S30
- RAMfile! chan S30 / ltltaddress,datagtgt
/ - ) .
- begin
- next slide
- end
23system_dlx0.ht main body
- begin
- ROMaddr chan adtype
- ROMdata chan word
- RAMaddr chan rwadtype
- datatoRAM chan S30
- datafromRAM chan S30
-
- ROMinterface proc() . begin .. end
- RAMinterface proc() . begin .. end
-
- initialise() ROMinterface()
RAMinterface() dlx0( ROMaddr, ROMdata,
RAMaddr, datatoRAM, datafromRAM ) - end
24script
- htcomp system_dlx0
- htsim -limit 1000 system_dlx0 RAMin RAMout
- htview system_dlx0
- Htmap system_dlx0
25DLX0 instruction loop
- do -halted then
- ROMaddr!PC
- ROMdata?ir
- PCPC4
auxPCPC4 PCPCaux - case (ir cast Itype.0)
- is ltltt,f,f,f,f,fgtgt then LW()
- or ltltt,f,f,f,f,tgtgt then SW()
- or ltltf,f,f,f,f,fgtgt then if (ir cast
Rtype.4 1) then SLT() fi - or ltltf,t,f,f,f,fgtgt then BEQZ()
- or ltltf,t,f,f,f,tgtgt then J()
- or ltltf,f,t,f,f,fgtgt then haltedtrue
- si
- od