Benchmark Implementations on Raw and MONARCH processor architectures - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Benchmark Implementations on Raw and MONARCH processor architectures

Description:

Morphable Networked MicroArchitecture (MONARCH) Stream Virtual Machine ... Implementation of FIR on MONARCH Processor. FIR bank implemented ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 18
Provided by: jsuh
Category:

less

Transcript and Presenter's Notes

Title: Benchmark Implementations on Raw and MONARCH processor architectures


1
Benchmark Implementations on Raw and MONARCH
processor architectures
  • Jinwoo Suh (jsuh_at_isi.edu)
  • University of Southern California
  • Information Sciences Institute
  • 3811 N. Fairfax Dr. Suite 200
  • Arlington, VA, 22203, USA
  • August 10-11, 2007

2
Overview
  • Two Research Architectures
  • Raw Processor
  • Morphable Networked MicroArchitecture (MONARCH)
  • Stream Virtual Machine
  • Protability, performance, efficiency
  • Benchmarks and Implementation Results
  • Matrix Multiplication
  • FIR bank

3
Raw Processor
  • Developed by MIT
  • Contains 16 tiles
  • Contains 2D mesh networks
  • Dynamic network
  • Two static network
  • Memory network
  • Each tile is single issue MIPS like core
  • A network port is mapped to a register
  • Communication is as easy as reading from/writing
    to a register

4
Raw
4-stage
Computing
pipelined
processor
FPU
(8 stage 32 bit,
single issue,
32 KB
in order)
I-Cache
64 KB
Com-
I-Cache
muication
32 KB
processor
D-Cache
Crossbar
Switch
8 32-bit
channels
  • Network port is like a register
  • Programmable switches

M. B. Taylor, et. al., Scalar Operand Networks
On-chip Interconnect for ILP in Partitioned
Architectures, ICHPC, 2003 M. B. Taylor, et.
al., A 16-issue multiple-program-counter
microprocessor with point-to-point scalar operand
network, IEEE ISSCC, 2003
5
Raw Handheld Board
  • Developed by ISI-East and MIT
  • One Raw chip and several FPGAs for glue logic

6
Morphable Networked MicroArchitecture (MONARCH)
Processor
  • By Raytheon and USC/ISI
  • Both control flow processors and data flow
    processors
  • 6 RISCs and12 ALU clusters
  • Provides high performance
  • 64 GOPS peak ALU performance at 333 MHz

7
Stream Virtual Machine
  • Stream processing processes input stream data and
    generates output stream data
  • Exploits the properties of the stream
    applications such as parallelism and
    throughput-oriented
  • A uniform approach for stream processing for
    multiple input languages and multiple processor
    architectures
  • Developed by Morphware forum (morphware.org)
  • Centered around Stable Architecture Abstraction
    Layer
  • Part of the layer is Stream Virtual Machine (SVM)
  • Consists of three major components
  • High Level Compiler
  • Low Level Compiler
  • Machine model

8
Matrix Multiplication
  • C AB
  • Boundary tiles emulate network input/output by
    generating and consuming data

A source
A
B source
C destination
B
Matrix multiplication
C
9
Matrix Multiplication Implementation
  • Hand coded using the SVM API (not HLC-generated
    code)
  • Cost analysis and optimizations
  • Full implementation
  • Full SVM stream communication through a dynamic
    network
  • One stream per network
  • Each stream is allocated to a network.
  • Broadcast
  • With broadcasting by switch processor
  • Communication is off-loaded from compute
    processor.
  • Network ports as operands
  • Raw can use network ports as operands
  • Reduces cycles since load/store operations
    eliminated

10
Matrix Multiplication Results
  • Number of cycles per multiplication-addition pair
  • Lower bound 2
  • Multiplication
  • Addition

Number of cycles
Best obtained results 2.23
Lower bound2
11
FIR Banks
  • Multiple FIR filters specified by Lincoln Lab
  • Implemented by using radix-4 FFT, multiplication,
    and radix-4 IFFT
  • Optimizations using hand-assembly in core
    operations
  • Minimize pipeline bubbles
  • Manual instruction scheduling
  • Prevent register spilling
  • Prone to this problem since radix-4 FFT requires
    more registers
  • Minimizing register requirement
  • Code expansion
  • Minimize address calculation
  • Using offset
  • Duplicated and rearranged twiddle factors
  • Minimize data copy operation
  • Reverse the order of processing back to front

12
FIR Bank Results
  • Definitions
  • LB (UB) lower (upper) bound based on the number
    of floating point operations
  • ILB (IUB) lower (upper) bound based on the
    number of floating point operations and
    load/store instructions
  • Hand Optimization hand-assembly work results
  • Compiler Optimization only compiler optimization
    was done
  • One FFT-multiplication-IFFT
  • For 64 sample data

Number of operations per cycle
13
Implementation of FIR on MONARCH Processor
  • FIR bank implemented
  • Several sets of FIR in time domain
  • Performance results collected for various number
    of sets, number of input data, and number of
    coefficients
  • Manual coding using MONARCH assembly language

14
Implementation of FIR on MONARCH Processor
15
Implementation of FIR on MONARCH Processor
(Contd)
  • At some points, relatively low efficiency
    observed
  • Due to non-integer ratio of number of
    coefficients and functional units

16
Estimation of Efficiency for Better Algorithms
17
Conclusion
  • Raw provides 16 independent tiles on a chip
  • Each tile can performs independent computations
  • MONARCH provides multiple RISCs and data flow
    computations
  • May take advantage of both worlds
  • Our benchmark implementations exploits each
    architectures capability
  • Obtained high efficiency on both architectures
  • Only data flow computation parts are used in this
    implementation
Write a Comment
User Comments (0)
About PowerShow.com