Title: Lecture 16: Accelerator Design in the XUP Board
1ECE 412 Microcomputer Laboratory
- Lecture 16 Accelerator Design in the XUP Board
2Objectives
- Understand accelerator design considerations in a
practical FPGA environment - Gain knowledge in some details of the XUP
platform required for efficient accelerator
design
3Four Fundamental Models of Accelerator Design
No OS Service (in simple embedded systems)
Base
OS service acc as User space mmaped I/O device
Virtualized Device with OS sched support
4Hybrid Hardware/Software Execution Model
- Hardware Accelerator as a DLL
- Seamless integration of hardware accelerators
into the Linux software stack for use by
mainstream applications - The DLL approach enables transparent interchange
of software and hardware components - Application level execution model
- Compiler deep analysis and transformations
generate CPU code, hardware library stubs and
synthesized components - FPGA bitmaps as hardware counterpart to existing
software modules. - Same dynamic linking library interfaces and stubs
apply to both software and hardware
implementation - OS resource management
- Services (API) for allocation, partial
reconfiguration, saving and restoring the status,
and monitoring - Multiprogramming scheduler can pre-fetch hardware
accelerators in time for next use - Control the access to the new hardware to ensure
trust under private or shared use
5MP3 Decoder Madplay Lib. Dithering as DLL
Noise Shaping
Noise Shaping
Noise Shaping
Quantization
Clipping
Dithering
Random generator
Biasing
Quantization
Clipping
Dithering
Random generator
Biasing
Quantization
Clipping
Dithering
Random generator
Biasing
Software Dithering DLL
Software Dithering DLL
Application
Application
Decode MP3
Read
Write
Decode MP3
Read
Write
DL
DL
Block
Sample
Sample
Block
Sample
Sample
- Madplay shared library dithering function as
software and FPGA DLL - Audio_linear_dither() software profiling shows
97 of application time - DL (dynamic linker) can switch the call to
hardware or software implementation - Used by 100 video and audio applications
Stub
Hardware Dithering DLL
Stub
Hardware Dithering DLL
Sound driver
OS
Sound driver
OS
Hardware Dithering
Hardware Dithering
AC97
AC97
6 cycles
Quantization
Clipping
Dithering
Biasing
Noise Shaping
Quantization
Clipping
Dithering
Biasing
Noise Shaping
Quantization
Clipping
Dithering
Biasing
Noise Shaping
Random generator
Random generator
Random generator
FPGA
FPGA
6CPU-Accelerator Interconnect Options
- PLB (Processor Local Bus)
- Wide transfer 64 bits
- Access to DRAM channel
- 1/3 CPU frequency
- Big penalty if bus is busy during first attempt
to access bus - OCM (On-chip Memory) interconnect
- Narrower 32 bits
- No direct access to DRAM channel
- CPU clock frequency
7Motion Estimation Design Experience
- Significant overhead in mmap, open calls
- This arrangement can only support accelerators
that will be invoked many times - Notice dramatic reduction in computation time
- Notice large overhead in data marshalling and
white - Full Search gives 10 better compression
- Diamond Search is sequential, not suitable for
acceleration
8JPEG An Example
Run-Length Encoding (RLE)
2D Discrete Cosine Transform (DCT)
Y
Downsample
Huffman Coding (HC)
RGB to YUV
RGB
U
Downsample
Quantization (QUANT)
V
Original Image
Downsample
Compressed Image
Parallel Execution on Independent Blocks
Inherently Sequential Region
Implemented as Reconfigurable Logic
Accelerator Candidate
9JPEG Accelerator Design Experience
- Based on Model (d)
- System call overhead for each invocation
- Better protection
- DCT and Quant are accelerated
- Data flows directly from DCT to Quant
- Data copy to user DMA buffer dominates cost
10Execution Flow of DCT System Call
Application
Operating System
Hardware
Time ?
PLB
Enable Accelerator Access for Application
open(/dev/accel) / only once/ / construct
macroblocks / macroblock syscall(macroblock,
num_blocks)
PPC
Memory
Data copy
PPC
Flush Cache Range
Memory
Setup DMA Transfer
PPC
DMA Controller
Poll
PPC
Accelerator (Executing)
Setup DMA Transfer
PPC
DMA Controller
Invalidate Cache Range
/ macroblock now has transformed data /
PPC
Memory
Data Copy
PLB
PPC
Memory
11Software Versus Hardware Acceleration
Overhead is a major issue!
12Device Driver Access Cost