Title: Application Specific Processors with VLIW Architecture
1Application Specific Processors with VLIW
Architecture
- Anshul Kumar
- anshul_at_cse.iitd.ernet.in
- Dept of CSE, I.I.T. Delhi
- Jan 2, 2002
2Outline
- Why VLIW Architecture
- Various Facets of VLIW Architecture
- Exploring VLIW Design Space
- Control Path
- Data Path
- VLIWs with Application Specific FUs
3Outline
- Why VLIW Architecture
- Various Facets of VLIW Architecture
- Exploring VLIW Design Space
- Control Path
- Data Path
- VLIWs with Application Specific FUs
4Motivation for ASIPs
- Better performance and lower power consumption
(compared to general purpose processors) - Higher flexibility and reuse potential (compared
to ASICs) - Our focus gt High performance
5Classification of Parallel Architectures
Parallel architectures PAs
Data-parallel architecture
Function-parallel architectures
Instruction-level PAs
Thread level PAs
Process-level PAs
ILPs
MIMDs
DPs
Pipelined processors
Superscalar processors
Distributed Memory MIMD
VLIWs
Shared Memory MIMD
Vector architectures
SIMDs
Associative And neural architectures
Systolic architectures
Ref Sima et al
6Distinction between VLIW and Superscalar
processors
VLIW Approach
Cache/ memory
Fetch Unit
Single multi-operation instruction
FU
FU
FU
Register file
multi-operation instruction
Ref Sima et al
7Distinction between VLIW and Superscalar
processors
Superscalar Approach
Decode and issue unit
Cache/ memory
Fetch Unit
Multiple instruction
FU
FU
FU
Sequential stream of instructions
Instruction/control
Register file
Data
FU
Funtional Unit
Ref Sima et al
8Instruction Execution Timings in various
Architectures
Ref Hwang et al
9VLIW History
- The term coined by J.A. Fisher (Yale) in 1983
ELI S12 (prototype)
Trace (Commercial) - Origin lies in horizontal microcode optimization
- Another pioneering work by B. Ramakrishna Rau in
1982
Poly cyclic (Prototype)
Cydra-5 (Commercial) - Recent developments
Trimedia Philips
TMS320C6X Texas Instrumens
10Why Superscalar Processors are commercially more
popular as compared to VLIW processor ?
- Binary code compatibility among scalar
superscalar processors of same family - Same compiler works for all processors (scalars
and superscalars) of same family - Assembly programming of VLIWs is tedious
- Code density in VLIWs is very poor
- Instruction encoding schemes
Area Performance
11VLIW Architecture for ASIPs
- Advantages of superscalar processors dont hold
in ASIP domain
Code compatibility with off the self
processors - Use of off the self compilers - ASIPs require retargetable compilers or compiler
generators - VLIWs fit nicely into ASIP philosophy analyze the
application and adapt the architecture - Area scaned by omitting dynamic scheduling can be
used for application specific features
12Outline
- Why VLIW Architecture
- Various Facets of VLIW Architecture
- Exploring VLIW Design Space
- Control Path
- Data Path
- VLIWs with Application Specific FUs
13Data path A simple VLIW Architecture
FU
FU
FU
Register file
Scalability ? Access time, area, power
consumption sharply increase with number of
register ports
14Data path Clustered VLIW Architecture(distribut
ed register file)
Alternative 1
FU
FU
FU
FU
FU
FU
Register file
Register file
Register file
Interconnection Network
15Data path Clustered VLIW Architecture(distribut
ed register file)
Alternative 2
FU
FU
FU
FU
Register file
Register file
outputs can be multicast
Interconnection Network
16Controlling FUs by Instructions
- Data stationary encoding
- All fields of an instruction in same word
- Time stationary encoding
- Fields of different instructions which act
- in same time slot in same word
17Data Stationary Unicast type UniOp
Control flow for Data Stationary Unicast type
UniOp
18Time Stationary Unicast type UniOp
Control flow for TimeStationary Unicast type UniOp
19Data Stationary Multicasttype UniOp
Control flow for Data Stationary Multicast type
UniOp
20Time Stationary Multicast type UniOp
Control flow for Time Stationary Multicast type
UniOp
21Outline
- Why VLIW Architecture
- Various Facets of VLIW Architecture
- Exploring VLIW Design Space
- Control Path
- Data Path
- VLIWs with Application Specific FUs
22Instruction Encoding NOP Compression
Book Hwang et al
23Instruction Encoding Application Specific
- Identify the limited patterns (contributions of
fields) and encode - Eliminate constant bits / fields
- Eliminate bits / fields derivable from other bits
/ fields
24Synthesizing Instruction Set from parameterized
Micro Architecture
Huang et al, DAC 1994
Bit width specification for some instruction
field types
25Synthesizing Instruction Set from parameterized
Micro Architecture
MOP Specification
26Synthesizing Instruction Set from parameterized
Micro Architecture
MO1 m(r20)lt-r20
MO2r0lt-r20
MO3m(r21)lt-r21
MO4r1lt-r21
MO5r2lt-r22
MO6PClt-PC1024
Data/ Control flow graph of MOP of a basic block
27Synthesizing Instruction Set from parameterized
Micro Architecture
Schedule I for the MOP in prev. slide and the
resulted instructions
28Synthesizing Instruction Set from parameterized
Micro Architecture
Schedule II for the MOP in slide no. 20 and the
resulted instructions
29Outline
- Why VLIW Architecture
- Various Facets of VLIW Architecture
- Exploring VLIW Design Space
- Control Path
- Data Path
- VLIWs with Application Specific FUs
30Implementation of Basic Operations
Operations in given application - primitive
implemented in HW kernel - basic multiple
implementation alternatives -
HW choices - SW (primitive
other basic ops) Objective maximize performance
under given area and power constraints
M. Imai et al ISSS 1996
31Problem Formulation
Solution vector X (x0, x1, , xi, ,
xn) Where x0 HW kernel for all primitive ops
xi implementation method for ith
operation Area constraint ?xi a(xi) ? A_max Where
a(xi) area for xi Power constraint ?xi p(xi) ?
P_max Where p(xi) power of xi Objective
function minimize T(X) Where T(X) execution
time for choice X
32Estimating Execution Time T(X)
T(X) ?j Fj . (t(Bj,X) Cj) b Where Fj
execution frequency of basic block Bj
t(Bj,X) execution cycles for Bj using
X Cj cycles needed to
branch from Bj to other blocks b
execution cycles reduced by untaken branch
33Estimating Block Execution Time t(Bj,X)
Computed from ?-(i) execution cycles for ith
basic operation ?-(i)
?(?u ki(u) . ?i(u)) / fi)? Where u a
distinct data value tuple for operation i
?i(u) execution cycles ki(u)
frequency count for u fi total
frequency count
34Intercluster conections
BUS
Sanchez et al ISSS 2000
BUS
incoming value register
MUX
Local Register File
Cluster
Cluster
MUX network
Cluster
FU
FU
L1 Cache
L1 Cache
VLIW clustered architecture
Detailed architecture of a single cluster
35Intercluster connectionsCompute Accelerator on
Transmogrifier - 2
Data/Signal Bus
Host Computer
PM
PM
Zhang et al, FPCCM 2000
Top level interconnection network
PM
PM
PM Processing Module
Field-Programmable Compute Accelerator
Top Level Architecture
36Intercluster connectionsCompute Accelerator on
Transmogrifier - 2
To/From Top-Level Interconnection Network
FU
FU
FU
Control Logic
Zhang et al, FPCCM 2000
PC
PM-Level Interconnection Block
Control Store
Memory
RF
RF
Processing Module
From/To Host Computer
Processing Module (PM) Architecture
37Parameters for each Processing Module
- No. and sizes and and ports of storage units
- No. and types of FUs
- PM level interconnections
- Top level interconnections
38Outline
- Why VLIW Architecture
- Various Facets of VLIW Architecture
- Exploring VLIW Design Space
- Control Path
- Data Path
- VLIWs with Application Specific FUs
39Coarse grain Fus with VLIW core
Busa et al, ISSS 2000
Multiplexer network
IR
Micro Code
Reg2
Reg1
Reg1
Reg1
Reg2
Reg2
Coarse grain FU
Prg. Counter Logic
MULT
RAM
ALU
Embedded (co)-processors as Fus in a VLIW
architecture
40Application Specific FUs - 1
number of inputs
FU
functionality
number of outputs
latency initiation interval I/O time shape
41Application Specific FUs - 2
access memory
include control
Mem/ Cache
RF
FU
42Conclusions What needs to be done?
- Define design space
- Profile and analyze application
- Identify architectural parameters
- Identify special functional units
- Estimate performance
- Synthesize processor
- Generate code
- Validate design
43Acknowledgements
- Manoj Kumar Jain
- manoj_at_cse.iitd.ernet.in
- C. P. Joshi
- csm99003_at_cse.iitd.ernet.in
44Thanks