Performance on Architecture - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Performance on Architecture

Description:

The today microprocessors are archive high throughput by many techniques such as ... Instructions flow in sequence one after another style ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 49
Provided by: phichail
Category:

less

Transcript and Presenter's Notes

Title: Performance on Architecture


1
Performance on Architecture
  • Phichai Leangtong
  • University of Southern California

2
Introduction
  • The today microprocessors are archive high
    throughput by many techniques such as
    pre-fetching, speed up clock frequency, increase
    the number of instruction issue.
  • But the throughput of the processor is

Number of instruction issue/cycle x Clock speed
3
Introduction
  • Even some instructions are SIMD type but
  • There requires specific use such as multimedia
    instructions.
  • The general purpose application will never get
    speed up.

4
Problem key
  • The limit of the throughput of current processor.
  • The current processor only focuses to increase
    the performance on some application.

5
Architecture Overview
6
Conventional Processor Architecture
  • Instructions flow in sequence one after another
    style
  • Lack use of low-level processing element
    parallelism

7
Future Architectures
  • FPGA
  • Raw Machine
  • Smart Memories
  • MorphoSys (example of application specific
    processor)

8
FPGA
  • Field Programmable Gate Array

9
FPGA Architecture
Configurable Logic Block (CLB)
Programmable Switch Matrix (PSM)
Input/Output Block (IOB)
10
FPGA Architecture
  • CLB Architecture uses RAM cell as a functional
    unit
  • Each CLB consist of 2 Flip-Flop, 2x16x1 RAM cell
  • Configure ability
  • 2x4 input function
  • 1x5 input function

11
FPGA Architecture Summary
  • High performance in fine grain since each CLB
    actually perform in bit-level operation.
  • Communication latency from CLB to other CLB is
    varied because each signal is dynamically route
    through PSM and interconnect
  • Large routing (logic-to-chip mapping) time

12
Raw Machine Architecture
13
Raw overview
14
Raw tile architecture
  • Each tile of Raw machine is basically a FPGA
  • The use of each tile is to map the software into
    pure hardware
  • In the case that the compiler cannot map the
    function into hardware, the compiler will
    configure that tile to be a general-purpose
    processor (MIPS like)

15
Raw Performance
  • Raw machine get an incredible performance when
    perform in massively parallel application.

16
Raw problem
  • The programming that take full advantage of Raw
    machine is far away from traditional programming
    paradigm.
  • The mapping time from software to each tile of
    Raw machine is a big problem because Raw
    architecture model bases on FPGA, which can be
    measure in hour per compile time.

17
Smart Memories Architecture
18
Smart Memories Chip
Tile
Quad
19
Tile Architecture
  • Each tile of Smart Memories can be configured as
    a processor or DRAM (size of 2-4 MB) cell.

Tile configured as a DRAM
Tile configured as a Superscalar processor
Tile configured as a VLIW like processor
20
Tile Processor Architecture
  • Each tile consists of 2 Integer cluster and 1
    Floating point cluster.
  • The instruction format and decode for each tile
    is configurable.Thus, each tile can be configured
    as a Superscalar processor, VLIW like processor,
    or a custom-configure.

21
Tile Processor Architecture
Memory system of 16 8KB SRAMs
Quad interface
Crossbar interconnect
Processor
22
Process Unit
Ports to crossbar interconnect
FP Cluster
Shared FP reg file
Integer Cluster
FP adder
Integer Cluster
FP adder
FP divide/sqrt
FP multiplier
23
Smart Memories Summary
  • To reduce the latency between processor and
    memory. Smart Memories integrates memory and
    processor together.
  • To reduce the communication overhead between each
    tile for high computation application, Smart
    Memories put 4 tiles into a quad and add quad
    interface network module to each tile.

24
MorphoSys Architecture
  • Example of application specific processor

25
MorphoSys Architecture
26
8x8 RC array
27
MorphoSys
  • The basic idea of MorphoSys is based on SIMD
    style.
  • But MorphoSys performs instruction in parallel.

28
Functional Unit Efficiency
29
Function selection
Data
Op. Code
f
g
h
i
g(x)
30
Function selection
a
b
c
d
VLIW
f
g
h
i
g(b)
f(a)
h(c)
i(d)
31
Function selection
Op Code 1 Op Code 2 Op Code 3 Op Code 4
a
b
c
d
f
g
h
i
g()
f()
h()
i()
32
Solution to Maximize Throughput
  • Configurable Processing Unit
  • Configurable Control Unit

33
Parallel Model
Memory
PC
Registers
PC
Registers
Functional Unit
34
Execution
ROR
SHL
PC1
PC2
ADD
ADD
Function Scheduler
Int
Int
Logic
Shift
FP
M1
M2
35
Programmable Microprogram
  • MicroVLIW architecture

36
Parallel Model
Memory
MPC
MPU
MPC
MPU
PC
Registers
PC
Registers
Functional Unit
37
MicroVLIW architecture
MPC
MPU
MPC
MPU
PC
Registers
PC
Registers
Int
Int
Logic
Shift
FP
M1
M2
COND
MP ADDR
Function Scheduler
Int
Int
Logic
Shift
FP
M1
M2
COND
MP ADDR
Int
Int
Logic
Shift
FP
M1
M2
Cond
38
Microcode Converter
Add
Operand
Int
Int
Logic
Shift
FP
M1
M2
Cond
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
39
Microcode Converter
Add_Sub
Operand
Int
Int
Logic
Shift
FP
M1
M2
Cond
ADD
SUB
NOP
NOP
NOP
NOP
NOP
NOP
40
Execution
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC1
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC2
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
Function Scheduler
Int
Int
Logic
Shift
FP
M1
M2
Cond
41
Execution
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC1
ADD
NOP
NOP
ROR
NOP
NOP
NOP
NOP
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC2
ADD
NOP
NOP
NOP
NOP
NOP
NOP
NOP
Function Scheduler
Cond
Int
Int
Logic
Shift
FP
M1
M2
42
Execution
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC1
ADD
NOP
NOP
ROR
NOP
NOP
NOP
NOP
Int
Int
Logic
Shift
FP
M1
M2
Cond
PC2
ADD
NOP
NOP
SHL
NOP
NOP
NOP
NOP
Function Scheduler
Cond
Int
Int
Logic
Shift
FP
M1
M2
43
Example
for(i0 iltn i) aibici
li t1, 0 li t2, n loop load r2,
array_bt1 load r3, array_ct1 add r1, r2,
r3 store r1, array_at1 add t1, t1, 1 je
t1, t2, loop
44
Microprogram
la a1, array_a la a2, array_b la a3,
array_c li r1, n vadd
Custom-configured instruction
Logic
Int
Int
Shift
M1
M2
Cond
FP
xor t4,t4,t4
nop
nop
nop
nop
nop
nop
nop
00
nop
nop
nop
load t2,t4,a2
load t3,t4,a3
if r1t4,finish
nop
nop
01
add t1,t2,t3
nop
nop
nop
nop
nop
nop
02
nop
nop
nop
nop
store t1,t4,a1
nop
nop
03
nop
nop
add t4,t4,1
nop
nop
nop
nop
goto 01
nop
nop
04
45
Execution
xor t4,t4,t4
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
load t2,t4,a2
load t3,t4,a3
if r4t4,finish
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop



nop
nop
nop
nop
nop
nop
nop
nop
add t1,t2,t3
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
store t1,t4,a1
nop
nop
nop
nop
add t4,t4,1
nop
nop
nop
nop
goto 01
nop
nop
46
Performance Evaluation
47
Performance Evaluation
48
Conclusion
  • Taking the advantages of VLIW architecture to
    increase parallelism. The worst case of
    performance is equal to the maximum throughput of
    SMT on RISC architecture.
  • The instruction size is the same size as RISC
    architecture.
  • Have to pay for the overhead for instruction
    configuration and function call.
  • Increase effective bandwidth for code instruction
    for loop function.
  • Need space for Microprogram
Write a Comment
User Comments (0)
About PowerShow.com