Application Specific Processors with VLIW Architecture - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Application Specific Processors with VLIW Architecture

Description:

Better performance and lower power consumption (compared to general purpose processors) ... Instruction Execution Timings in various Architectures [Ref : Hwang et al] ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 45

Provided by: man147

Category:

more less

Transcript and Presenter's Notes

Title: Application Specific Processors with VLIW Architecture

1
Application Specific Processors with VLIW
Architecture

Anshul Kumar
anshul_at_cse.iitd.ernet.in
Dept of CSE, I.I.T. Delhi
Jan 2, 2002

2
Outline

Why VLIW Architecture
Various Facets of VLIW Architecture
Exploring VLIW Design Space
Control Path
Data Path
VLIWs with Application Specific FUs

3
Outline

Why VLIW Architecture
Various Facets of VLIW Architecture
Exploring VLIW Design Space
Control Path
Data Path
VLIWs with Application Specific FUs

4
Motivation for ASIPs

Better performance and lower power consumption
(compared to general purpose processors)
Higher flexibility and reuse potential (compared
to ASICs)
Our focus gt High performance

5
Classification of Parallel Architectures
Parallel architectures PAs
Data-parallel architecture
Function-parallel architectures
Instruction-level PAs
Thread level PAs
Process-level PAs
ILPs
MIMDs
DPs
Pipelined processors
Superscalar processors
Distributed Memory MIMD
VLIWs
Shared Memory MIMD
Vector architectures
SIMDs
Associative And neural architectures
Systolic architectures
Ref Sima et al
6
Distinction between VLIW and Superscalar
processors
VLIW Approach
Cache/ memory
Fetch Unit
Single multi-operation instruction
FU
FU
FU
Register file
multi-operation instruction
Ref Sima et al
7
Distinction between VLIW and Superscalar
processors
Superscalar Approach
Decode and issue unit
Cache/ memory
Fetch Unit
Multiple instruction
FU
FU
FU
Sequential stream of instructions
Instruction/control
Register file
Data
FU
Funtional Unit
Ref Sima et al
8
Instruction Execution Timings in various
Architectures
Ref Hwang et al
9
VLIW History

The term coined by J.A. Fisher (Yale) in 1983
ELI S12 (prototype)
Trace (Commercial)
Origin lies in horizontal microcode optimization
Another pioneering work by B. Ramakrishna Rau in
1982
Poly cyclic (Prototype)
Cydra-5 (Commercial)
Recent developments
Trimedia Philips
TMS320C6X Texas Instrumens

10
Why Superscalar Processors are commercially more
popular as compared to VLIW processor ?

Binary code compatibility among scalar
superscalar processors of same family
Same compiler works for all processors (scalars
and superscalars) of same family
Assembly programming of VLIWs is tedious
Code density in VLIWs is very poor
- Instruction encoding schemes
Area Performance

11
VLIW Architecture for ASIPs

Advantages of superscalar processors dont hold
in ASIP domain
Code compatibility with off the self
processors - Use of off the self compilers
ASIPs require retargetable compilers or compiler
generators
VLIWs fit nicely into ASIP philosophy analyze the
application and adapt the architecture
Area scaned by omitting dynamic scheduling can be
used for application specific features

12
Outline

Why VLIW Architecture
Various Facets of VLIW Architecture
Exploring VLIW Design Space
Control Path
Data Path
VLIWs with Application Specific FUs

13
Data path A simple VLIW Architecture
FU
FU
FU
Register file
Scalability ? Access time, area, power
consumption sharply increase with number of
register ports
14
Data path Clustered VLIW Architecture(distribut
ed register file)
Alternative 1
FU
FU
FU
FU
FU
FU
Register file
Register file
Register file
Interconnection Network
15
Data path Clustered VLIW Architecture(distribut
ed register file)
Alternative 2
FU
FU
FU
FU
Register file
Register file
outputs can be multicast
Interconnection Network
16
Controlling FUs by Instructions

Data stationary encoding
All fields of an instruction in same word
Time stationary encoding
Fields of different instructions which act
in same time slot in same word

17
Data Stationary Unicast type UniOp
Control flow for Data Stationary Unicast type
UniOp
18
Time Stationary Unicast type UniOp
Control flow for TimeStationary Unicast type UniOp
19
Data Stationary Multicasttype UniOp
Control flow for Data Stationary Multicast type
UniOp
20
Time Stationary Multicast type UniOp
Control flow for Time Stationary Multicast type
UniOp
21
Outline

Why VLIW Architecture
Various Facets of VLIW Architecture
Exploring VLIW Design Space
Control Path
Data Path
VLIWs with Application Specific FUs

22
Instruction Encoding NOP Compression
Book Hwang et al
23
Instruction Encoding Application Specific

Identify the limited patterns (contributions of
fields) and encode
Eliminate constant bits / fields
Eliminate bits / fields derivable from other bits
/ fields

24
Synthesizing Instruction Set from parameterized
Micro Architecture
Huang et al, DAC 1994
Bit width specification for some instruction
field types
25
Synthesizing Instruction Set from parameterized
Micro Architecture
MOP Specification
26
Synthesizing Instruction Set from parameterized
Micro Architecture
MO1 m(r20)lt-r20
MO2r0lt-r20
MO3m(r21)lt-r21
MO4r1lt-r21
MO5r2lt-r22
MO6PClt-PC1024
Data/ Control flow graph of MOP of a basic block
27
Synthesizing Instruction Set from parameterized
Micro Architecture
Schedule I for the MOP in prev. slide and the
resulted instructions
28
Synthesizing Instruction Set from parameterized
Micro Architecture
Schedule II for the MOP in slide no. 20 and the
resulted instructions
29
Outline

Why VLIW Architecture
Various Facets of VLIW Architecture
Exploring VLIW Design Space
Control Path
Data Path
VLIWs with Application Specific FUs

30
Implementation of Basic Operations
Operations in given application - primitive
implemented in HW kernel - basic multiple
implementation alternatives -
HW choices - SW (primitive
other basic ops) Objective maximize performance
under given area and power constraints
M. Imai et al ISSS 1996
31
Problem Formulation
Solution vector X (x0, x1, , xi, ,
xn) Where x0 HW kernel for all primitive ops
xi implementation method for ith
operation Area constraint ?xi a(xi) ? A_max Where
a(xi) area for xi Power constraint ?xi p(xi) ?
P_max Where p(xi) power of xi Objective
function minimize T(X) Where T(X) execution
time for choice X
32
Estimating Execution Time T(X)
T(X) ?j Fj . (t(Bj,X) Cj) b Where Fj
execution frequency of basic block Bj
t(Bj,X) execution cycles for Bj using
X Cj cycles needed to
branch from Bj to other blocks b
execution cycles reduced by untaken branch
33
Estimating Block Execution Time t(Bj,X)
Computed from ?-(i) execution cycles for ith
basic operation ?-(i)
?(?u ki(u) . ?i(u)) / fi)? Where u a
distinct data value tuple for operation i
?i(u) execution cycles ki(u)
frequency count for u fi total
frequency count
34
Intercluster conections
BUS
Sanchez et al ISSS 2000
BUS
incoming value register
MUX
Local Register File
Cluster
Cluster
MUX network
Cluster
FU
FU
L1 Cache
L1 Cache
VLIW clustered architecture
Detailed architecture of a single cluster
35
Intercluster connectionsCompute Accelerator on
Transmogrifier - 2
Data/Signal Bus
Host Computer
PM
PM
Zhang et al, FPCCM 2000
Top level interconnection network
PM
PM
PM Processing Module
Field-Programmable Compute Accelerator
Top Level Architecture
36
Intercluster connectionsCompute Accelerator on
Transmogrifier - 2
To/From Top-Level Interconnection Network
FU
FU
FU
Control Logic
Zhang et al, FPCCM 2000
PC
PM-Level Interconnection Block
Control Store
Memory
RF
RF
Processing Module
From/To Host Computer
Processing Module (PM) Architecture
37
Parameters for each Processing Module

No. and sizes and and ports of storage units
No. and types of FUs
PM level interconnections
Top level interconnections

38
Outline

Why VLIW Architecture
Various Facets of VLIW Architecture
Exploring VLIW Design Space
Control Path
Data Path
VLIWs with Application Specific FUs

39
Coarse grain Fus with VLIW core
Busa et al, ISSS 2000
Multiplexer network
IR
Micro Code
Reg2
Reg1
Reg1
Reg1
Reg2
Reg2
Coarse grain FU
Prg. Counter Logic
MULT
RAM
ALU
Embedded (co)-processors as Fus in a VLIW
architecture
40
Application Specific FUs - 1
number of inputs
FU
functionality
number of outputs
latency initiation interval I/O time shape
41
Application Specific FUs - 2
access memory
include control
Mem/ Cache
RF
FU
42
Conclusions What needs to be done?