Performance on Architecture - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Performance on Architecture

Description:

The today microprocessors are archive high throughput by many techniques such as ... Instructions flow in sequence one after another style ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 49

Provided by: phichail

Category:

more less

Transcript and Presenter's Notes

Title: Performance on Architecture

1
Performance on Architecture

Phichai Leangtong
University of Southern California

2
Introduction

The today microprocessors are archive high
throughput by many techniques such as
pre-fetching, speed up clock frequency, increase
the number of instruction issue.
But the throughput of the processor is

Number of instruction issue/cycle x Clock speed
3
Introduction

Even some instructions are SIMD type but
There requires specific use such as multimedia
instructions.
The general purpose application will never get
speed up.

4
Problem key

The limit of the throughput of current processor.
The current processor only focuses to increase
the performance on some application.

5
Architecture Overview
6
Conventional Processor Architecture

Instructions flow in sequence one after another
style
Lack use of low-level processing element
parallelism

7
Future Architectures

FPGA
Raw Machine
Smart Memories
MorphoSys (example of application specific
processor)

8
FPGA

Field Programmable Gate Array

9
FPGA Architecture
Configurable Logic Block (CLB)
Programmable Switch Matrix (PSM)
Input/Output Block (IOB)
10
FPGA Architecture

CLB Architecture uses RAM cell as a functional
unit
Each CLB consist of 2 Flip-Flop, 2x16x1 RAM cell
Configure ability
2x4 input function
1x5 input function

11
FPGA Architecture Summary

High performance in fine grain since each CLB
actually perform in bit-level operation.
Communication latency from CLB to other CLB is
varied because each signal is dynamically route
through PSM and interconnect
Large routing (logic-to-chip mapping) time

12
Raw Machine Architecture
13
Raw overview
14
Raw tile architecture

Each tile of Raw machine is basically a FPGA
The use of each tile is to map the software into
pure hardware
In the case that the compiler cannot map the
function into hardware, the compiler will
configure that tile to be a general-purpose
processor (MIPS like)

15
Raw Performance

Raw machine get an incredible performance when
perform in massively parallel application.

16
Raw problem

The programming that take full advantage of Raw
machine is far away from traditional programming
paradigm.
The mapping time from software to each tile of
Raw machine is a big problem because Raw
architecture model bases on FPGA, which can be
measure in hour per compile time.

17
Smart Memories Architecture
18
Smart Memories Chip
Tile
Quad
19
Tile Architecture

Each tile of Smart Memories can be configured as
a processor or DRAM (size of 2-4 MB) cell.

Tile configured as a DRAM
Tile configured as a Superscalar processor
Tile configured as a VLIW like processor
20
Tile Processor Architecture

Each tile consists of 2 Integer cluster and 1
Floating point cluster.
The instruction format and decode for each tile
is configurable.Thus, each tile can be configured
as a Superscalar processor, VLIW like processor,
or a custom-configure.

21
Tile Processor Architecture
Memory system of 16 8KB SRAMs
Quad interface
Crossbar interconnect
Processor
22
Process Unit
Ports to crossbar interconnect
FP Cluster
Shared FP reg file
Integer Cluster
FP adder
Integer Cluster
FP adder
FP divide/sqrt
FP multiplier
23
Smart Memories Summary

To reduce the latency between processor and
memory. Smart Memories integrates memory and
processor together.
To reduce the communication overhead between each
tile for high computation application, Smart
Memories put 4 tiles into a quad and add quad
interface network module to each tile.

24
MorphoSys Architecture

Example of application specific processor

25
MorphoSys Architecture
26
8x8 RC array
27
MorphoSys

The basic idea of MorphoSys is based on SIMD
style.
But MorphoSys performs instruction in parallel.

28
Functional Unit Efficiency
29
Function selection
Data
Op. Code
f
g
h
i
g(x)
30
Function selection
a
b
c
d
VLIW
f
g
h
i
g(b)
f(a)
h(c)
i(d)
31
Function selection
Op Code 1 Op Code 2 Op Code 3 Op Code 4
a
b
c
d
f
g
h
i
g()
f()
h()
i()
32
Solution to Maximize Throughput

Configurable Processing Unit
Configurable Control Unit

33
Parallel Model
Memory
PC
Registers
PC
Registers
Functional Unit
34
Execution
ROR
SHL
PC1
PC2
ADD
ADD
Function Scheduler
Int
Int
Logic
Shift
FP
M1
M2
35
Programmable Microprogram

MicroVLIW architecture

36
Parallel Model
Memory
MPC
MPU
MPC
MPU
PC
Registers
PC
Registers
Functional Unit
37
MicroVLIW architecture
MPC
MPU
MPC
MPU
PC
Registers
PC
Registers
Int
Int
Logic
Shift
FP
M1
M2
COND
MP ADDR
Function Scheduler
Int
Int
Logic
Shift
FP
M1
M2
COND
MP ADDR
Int
Int
Logic
Shift
FP
M1
M2
Cond
38
Microcode Converter
Add
Operand
Int
Int
Logic
Shift
FP
M1
M2
Cond
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
39
Microcode Converter
Add_Sub
Operand
Int
Int
Logic
Shift
FP
M1
M2
Cond
ADD
SUB
NOP
NOP
NOP
NOP
NOP
NOP
40
Execution
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC1
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC2
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
Function Scheduler
Int
Int
Logic
Shift
FP
M1
M2
Cond
41
Execution
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC1
ADD
NOP
NOP
ROR
NOP
NOP
NOP
NOP
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC2
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
Function Scheduler
Cond
Int
Int
Logic
Shift
FP
M1
M2
42
Execution
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC1
ADD
NOP
NOP
ROR
NOP
NOP
NOP
NOP
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC2
ADD
NOP
NOP
SHL
NOP
NOP
NOP
NOP
Function Scheduler
Cond
Int
Int
Logic
Shift
FP
M1
M2
43
Example
for(i0 iltn i) aibici
li t1, 0 li t2, n loop load r2,
array_bt1 load r3, array_ct1 add r1, r2,
r3 store r1, array_at1 add t1, t1, 1 je
t1, t2, loop
44
Microprogram
la a1, array_a la a2, array_b la a3,
array_c li r1, n vadd
Custom-configured instruction
Logic
Int
Int
Shift
M1
M2
Cond
FP
xor t4,t4,t4
nop
nop
nop
nop
nop
nop
nop
00
nop
nop
nop
load t2,t4,a2
load t3,t4,a3
if r1t4,finish
nop
nop
01
add t1,t2,t3
nop
nop
nop
nop
nop
nop
02
nop
nop
nop
nop
store t1,t4,a1
nop
nop
03
nop
nop
add t4,t4,1
nop
nop
nop
nop
goto 01
nop
nop
04
45
Execution
xor t4,t4,t4
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
load t2,t4,a2
load t3,t4,a3
if r4t4,finish
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop

nop
nop
nop
nop
nop
nop
nop
nop
add t1,t2,t3
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
store t1,t4,a1
nop
nop
nop
nop
add t4,t4,1
nop
nop
nop
nop
goto 01
nop
nop
46
Performance Evaluation
47
Performance Evaluation
48
Conclusion

Taking the advantages of VLIW architecture to
increase parallelism. The worst case of
performance is equal to the maximum throughput of
SMT on RISC architecture.
The instruction size is the same size as RISC
architecture.
Have to pay for the overhead for instruction
configuration and function call.
Increase effective bandwidth for code instruction
for loop function.
Need space for Microprogram