Title: Performance on Architecture
1Performance on Architecture
- Phichai Leangtong
- University of Southern California
2Introduction
- The today microprocessors are archive high
throughput by many techniques such as
pre-fetching, speed up clock frequency, increase
the number of instruction issue. - But the throughput of the processor is
Number of instruction issue/cycle x Clock speed
3Introduction
- Even some instructions are SIMD type but
- There requires specific use such as multimedia
instructions. - The general purpose application will never get
speed up.
4Problem key
- The limit of the throughput of current processor.
- The current processor only focuses to increase
the performance on some application.
5Architecture Overview
6Conventional Processor Architecture
- Instructions flow in sequence one after another
style - Lack use of low-level processing element
parallelism
7Future Architectures
- FPGA
- Raw Machine
- Smart Memories
- MorphoSys (example of application specific
processor)
8FPGA
- Field Programmable Gate Array
9FPGA Architecture
Configurable Logic Block (CLB)
Programmable Switch Matrix (PSM)
Input/Output Block (IOB)
10FPGA Architecture
- CLB Architecture uses RAM cell as a functional
unit - Each CLB consist of 2 Flip-Flop, 2x16x1 RAM cell
- Configure ability
- 2x4 input function
- 1x5 input function
11FPGA Architecture Summary
- High performance in fine grain since each CLB
actually perform in bit-level operation. - Communication latency from CLB to other CLB is
varied because each signal is dynamically route
through PSM and interconnect - Large routing (logic-to-chip mapping) time
12Raw Machine Architecture
13Raw overview
14Raw tile architecture
- Each tile of Raw machine is basically a FPGA
- The use of each tile is to map the software into
pure hardware - In the case that the compiler cannot map the
function into hardware, the compiler will
configure that tile to be a general-purpose
processor (MIPS like)
15Raw Performance
- Raw machine get an incredible performance when
perform in massively parallel application.
16Raw problem
- The programming that take full advantage of Raw
machine is far away from traditional programming
paradigm. - The mapping time from software to each tile of
Raw machine is a big problem because Raw
architecture model bases on FPGA, which can be
measure in hour per compile time.
17Smart Memories Architecture
18Smart Memories Chip
Tile
Quad
19Tile Architecture
- Each tile of Smart Memories can be configured as
a processor or DRAM (size of 2-4 MB) cell.
Tile configured as a DRAM
Tile configured as a Superscalar processor
Tile configured as a VLIW like processor
20Tile Processor Architecture
- Each tile consists of 2 Integer cluster and 1
Floating point cluster. - The instruction format and decode for each tile
is configurable.Thus, each tile can be configured
as a Superscalar processor, VLIW like processor,
or a custom-configure.
21Tile Processor Architecture
Memory system of 16 8KB SRAMs
Quad interface
Crossbar interconnect
Processor
22Process Unit
Ports to crossbar interconnect
FP Cluster
Shared FP reg file
Integer Cluster
FP adder
Integer Cluster
FP adder
FP divide/sqrt
FP multiplier
23Smart Memories Summary
- To reduce the latency between processor and
memory. Smart Memories integrates memory and
processor together. - To reduce the communication overhead between each
tile for high computation application, Smart
Memories put 4 tiles into a quad and add quad
interface network module to each tile.
24MorphoSys Architecture
- Example of application specific processor
25MorphoSys Architecture
268x8 RC array
27MorphoSys
- The basic idea of MorphoSys is based on SIMD
style. - But MorphoSys performs instruction in parallel.
28Functional Unit Efficiency
29Function selection
Data
Op. Code
f
g
h
i
g(x)
30Function selection
a
b
c
d
VLIW
f
g
h
i
g(b)
f(a)
h(c)
i(d)
31Function selection
Op Code 1 Op Code 2 Op Code 3 Op Code 4
a
b
c
d
f
g
h
i
g()
f()
h()
i()
32Solution to Maximize Throughput
- Configurable Processing Unit
- Configurable Control Unit
33Parallel Model
Memory
PC
Registers
PC
Registers
Functional Unit
34Execution
ROR
SHL
PC1
PC2
ADD
ADD
Function Scheduler
Int
Int
Logic
Shift
FP
M1
M2
35Programmable Microprogram
36Parallel Model
Memory
MPC
MPU
MPC
MPU
PC
Registers
PC
Registers
Functional Unit
37MicroVLIW architecture
MPC
MPU
MPC
MPU
PC
Registers
PC
Registers
Int
Int
Logic
Shift
FP
M1
M2
COND
MP ADDR
Function Scheduler
Int
Int
Logic
Shift
FP
M1
M2
COND
MP ADDR
Int
Int
Logic
Shift
FP
M1
M2
Cond
38Microcode Converter
Add
Operand
Int
Int
Logic
Shift
FP
M1
M2
Cond
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
39Microcode Converter
Add_Sub
Operand
Int
Int
Logic
Shift
FP
M1
M2
Cond
ADD
SUB
NOP
NOP
NOP
NOP
NOP
NOP
40Execution
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC1
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC2
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
Function Scheduler
Int
Int
Logic
Shift
FP
M1
M2
Cond
41Execution
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC1
ADD
NOP
NOP
ROR
NOP
NOP
NOP
NOP
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC2
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
Function Scheduler
Cond
Int
Int
Logic
Shift
FP
M1
M2
42Execution
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC1
ADD
NOP
NOP
ROR
NOP
NOP
NOP
NOP
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC2
ADD
NOP
NOP
SHL
NOP
NOP
NOP
NOP
Function Scheduler
Cond
Int
Int
Logic
Shift
FP
M1
M2
43Example
for(i0 iltn i) aibici
li t1, 0 li t2, n loop load r2,
array_bt1 load r3, array_ct1 add r1, r2,
r3 store r1, array_at1 add t1, t1, 1 je
t1, t2, loop
44Microprogram
la a1, array_a la a2, array_b la a3,
array_c li r1, n vadd
Custom-configured instruction
Logic
Int
Int
Shift
M1
M2
Cond
FP
xor t4,t4,t4
nop
nop
nop
nop
nop
nop
nop
00
nop
nop
nop
load t2,t4,a2
load t3,t4,a3
if r1t4,finish
nop
nop
01
add t1,t2,t3
nop
nop
nop
nop
nop
nop
02
nop
nop
nop
nop
store t1,t4,a1
nop
nop
03
nop
nop
add t4,t4,1
nop
nop
nop
nop
goto 01
nop
nop
04
45Execution
xor t4,t4,t4
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
load t2,t4,a2
load t3,t4,a3
if r4t4,finish
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
add t1,t2,t3
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
store t1,t4,a1
nop
nop
nop
nop
add t4,t4,1
nop
nop
nop
nop
goto 01
nop
nop
46Performance Evaluation
47Performance Evaluation
48Conclusion
- Taking the advantages of VLIW architecture to
increase parallelism. The worst case of
performance is equal to the maximum throughput of
SMT on RISC architecture. - The instruction size is the same size as RISC
architecture. - Have to pay for the overhead for instruction
configuration and function call. - Increase effective bandwidth for code instruction
for loop function. - Need space for Microprogram