Title: Benchmark Implementations on Raw and MONARCH processor architectures
1Benchmark Implementations on Raw and MONARCH
processor architectures
- Jinwoo Suh (jsuh_at_isi.edu)
- University of Southern California
- Information Sciences Institute
- 3811 N. Fairfax Dr. Suite 200
- Arlington, VA, 22203, USA
- August 10-11, 2007
2Overview
- Two Research Architectures
- Raw Processor
- Morphable Networked MicroArchitecture (MONARCH)
- Stream Virtual Machine
- Protability, performance, efficiency
- Benchmarks and Implementation Results
- Matrix Multiplication
- FIR bank
3Raw Processor
- Developed by MIT
- Contains 16 tiles
- Contains 2D mesh networks
- Dynamic network
- Two static network
- Memory network
- Each tile is single issue MIPS like core
- A network port is mapped to a register
- Communication is as easy as reading from/writing
to a register
4Raw
4-stage
Computing
pipelined
processor
FPU
(8 stage 32 bit,
single issue,
32 KB
in order)
I-Cache
64 KB
Com-
I-Cache
muication
32 KB
processor
D-Cache
Crossbar
Switch
8 32-bit
channels
- Network port is like a register
- Programmable switches
M. B. Taylor, et. al., Scalar Operand Networks
On-chip Interconnect for ILP in Partitioned
Architectures, ICHPC, 2003 M. B. Taylor, et.
al., A 16-issue multiple-program-counter
microprocessor with point-to-point scalar operand
network, IEEE ISSCC, 2003
5Raw Handheld Board
- Developed by ISI-East and MIT
- One Raw chip and several FPGAs for glue logic
6Morphable Networked MicroArchitecture (MONARCH)
Processor
- By Raytheon and USC/ISI
- Both control flow processors and data flow
processors - 6 RISCs and12 ALU clusters
- Provides high performance
- 64 GOPS peak ALU performance at 333 MHz
7Stream Virtual Machine
- Stream processing processes input stream data and
generates output stream data - Exploits the properties of the stream
applications such as parallelism and
throughput-oriented - A uniform approach for stream processing for
multiple input languages and multiple processor
architectures - Developed by Morphware forum (morphware.org)
- Centered around Stable Architecture Abstraction
Layer - Part of the layer is Stream Virtual Machine (SVM)
- Consists of three major components
- High Level Compiler
- Low Level Compiler
- Machine model
8Matrix Multiplication
- C AB
- Boundary tiles emulate network input/output by
generating and consuming data
A source
A
B source
C destination
B
Matrix multiplication
C
9Matrix Multiplication Implementation
- Hand coded using the SVM API (not HLC-generated
code) - Cost analysis and optimizations
- Full implementation
- Full SVM stream communication through a dynamic
network - One stream per network
- Each stream is allocated to a network.
- Broadcast
- With broadcasting by switch processor
- Communication is off-loaded from compute
processor. - Network ports as operands
- Raw can use network ports as operands
- Reduces cycles since load/store operations
eliminated
10Matrix Multiplication Results
- Number of cycles per multiplication-addition pair
- Lower bound 2
- Multiplication
- Addition
Number of cycles
Best obtained results 2.23
Lower bound2
11FIR Banks
- Multiple FIR filters specified by Lincoln Lab
- Implemented by using radix-4 FFT, multiplication,
and radix-4 IFFT - Optimizations using hand-assembly in core
operations - Minimize pipeline bubbles
- Manual instruction scheduling
- Prevent register spilling
- Prone to this problem since radix-4 FFT requires
more registers - Minimizing register requirement
- Code expansion
- Minimize address calculation
- Using offset
- Duplicated and rearranged twiddle factors
- Minimize data copy operation
- Reverse the order of processing back to front
12FIR Bank Results
- Definitions
- LB (UB) lower (upper) bound based on the number
of floating point operations - ILB (IUB) lower (upper) bound based on the
number of floating point operations and
load/store instructions - Hand Optimization hand-assembly work results
- Compiler Optimization only compiler optimization
was done - One FFT-multiplication-IFFT
- For 64 sample data
Number of operations per cycle
13Implementation of FIR on MONARCH Processor
- FIR bank implemented
- Several sets of FIR in time domain
- Performance results collected for various number
of sets, number of input data, and number of
coefficients - Manual coding using MONARCH assembly language
14Implementation of FIR on MONARCH Processor
15Implementation of FIR on MONARCH Processor
(Contd)
- At some points, relatively low efficiency
observed - Due to non-integer ratio of number of
coefficients and functional units
16Estimation of Efficiency for Better Algorithms
17Conclusion
- Raw provides 16 independent tiles on a chip
- Each tile can performs independent computations
- MONARCH provides multiple RISCs and data flow
computations - May take advantage of both worlds
- Our benchmark implementations exploits each
architectures capability - Obtained high efficiency on both architectures
- Only data flow computation parts are used in this
implementation