Lecture 16: Accelerator Design in the XUP Board

About This Presentation

Title:

Lecture 16: Accelerator Design in the XUP Board

Description:

Quantization (QUANT) Original Image. Compressed Image. Parallel Execution on Independent Blocks ... DCT and Quant are accelerated. Data flows directly from DCT ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 13

Provided by: nichola95

Learn more at: http://web.cecs.pdx.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 16: Accelerator Design in the XUP Board

1
ECE 412 Microcomputer Laboratory

Lecture 16 Accelerator Design in the XUP Board

2
Objectives

Understand accelerator design considerations in a
practical FPGA environment
Gain knowledge in some details of the XUP
platform required for efficient accelerator
design

3
Four Fundamental Models of Accelerator Design
No OS Service (in simple embedded systems)
Base
OS service acc as User space mmaped I/O device
Virtualized Device with OS sched support
4
Hybrid Hardware/Software Execution Model

Hardware Accelerator as a DLL
Seamless integration of hardware accelerators
into the Linux software stack for use by
mainstream applications
The DLL approach enables transparent interchange
of software and hardware components
Application level execution model
Compiler deep analysis and transformations
generate CPU code, hardware library stubs and
synthesized components
FPGA bitmaps as hardware counterpart to existing
software modules.
Same dynamic linking library interfaces and stubs
apply to both software and hardware
implementation
OS resource management
Services (API) for allocation, partial
reconfiguration, saving and restoring the status,
and monitoring
Multiprogramming scheduler can pre-fetch hardware
accelerators in time for next use
Control the access to the new hardware to ensure
trust under private or shared use

5
MP3 Decoder Madplay Lib. Dithering as DLL
Noise Shaping
Noise Shaping
Noise Shaping
Quantization
Clipping
Dithering
Random generator
Biasing
Quantization
Clipping
Dithering
Random generator
Biasing
Quantization
Clipping
Dithering
Random generator
Biasing
Software Dithering DLL
Software Dithering DLL
Application
Application
Decode MP3
Read
Write
Decode MP3
Read
Write
DL
DL
Block
Sample
Sample
Block
Sample
Sample

Madplay shared library dithering function as
software and FPGA DLL
Audio_linear_dither() software profiling shows
97 of application time
DL (dynamic linker) can switch the call to
hardware or software implementation
Used by 100 video and audio applications

Stub
Hardware Dithering DLL
Stub
Hardware Dithering DLL
Sound driver
OS
Sound driver
OS
Hardware Dithering
Hardware Dithering
AC97
AC97
6 cycles
Quantization
Clipping
Dithering
Biasing
Noise Shaping
Quantization
Clipping
Dithering
Biasing
Noise Shaping
Quantization
Clipping
Dithering
Biasing
Noise Shaping
Random generator
Random generator
Random generator
FPGA
FPGA
6
CPU-Accelerator Interconnect Options

PLB (Processor Local Bus)
Wide transfer 64 bits
Access to DRAM channel
1/3 CPU frequency
Big penalty if bus is busy during first attempt
to access bus
OCM (On-chip Memory) interconnect
Narrower 32 bits
No direct access to DRAM channel
CPU clock frequency

7
Motion Estimation Design Experience

Significant overhead in mmap, open calls
This arrangement can only support accelerators
that will be invoked many times
Notice dramatic reduction in computation time
Notice large overhead in data marshalling and
white
Full Search gives 10 better compression
Diamond Search is sequential, not suitable for
acceleration

8
JPEG An Example
Run-Length Encoding (RLE)
2D Discrete Cosine Transform (DCT)
Y
Downsample
Huffman Coding (HC)
RGB to YUV
RGB
U
Downsample
Quantization (QUANT)
V
Original Image
Downsample
Compressed Image
Parallel Execution on Independent Blocks
Inherently Sequential Region
Implemented as Reconfigurable Logic
Accelerator Candidate
9
JPEG Accelerator Design Experience

Based on Model (d)
System call overhead for each invocation
Better protection
DCT and Quant are accelerated
Data flows directly from DCT to Quant
Data copy to user DMA buffer dominates cost

10
Execution Flow of DCT System Call
Application
Operating System
Hardware
Time ?
PLB
Enable Accelerator Access for Application
open(/dev/accel) / only once/ / construct
macroblocks / macroblock syscall(macroblock,
num_blocks)
PPC
Memory
Data copy
PPC
Flush Cache Range
Memory
Setup DMA Transfer
PPC
DMA Controller
Poll
PPC
Accelerator (Executing)
Setup DMA Transfer
PPC
DMA Controller
Invalidate Cache Range
/ macroblock now has transformed data /
PPC
Memory
Data Copy
PLB
PPC
Memory
11
Software Versus Hardware Acceleration
Overhead is a major issue!
12
Device Driver Access Cost

Write a Comment

User Comments (0)