Lecture 16: Accelerator Design in the XUP Board - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 16: Accelerator Design in the XUP Board

Description:

Quantization (QUANT) Original Image. Compressed Image. Parallel Execution on Independent Blocks ... DCT and Quant are accelerated. Data flows directly from DCT ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 13
Provided by: nichola95
Learn more at: http://web.cecs.pdx.edu
Category:

less

Transcript and Presenter's Notes

Title: Lecture 16: Accelerator Design in the XUP Board


1
ECE 412 Microcomputer Laboratory
  • Lecture 16 Accelerator Design in the XUP Board

2
Objectives
  • Understand accelerator design considerations in a
    practical FPGA environment
  • Gain knowledge in some details of the XUP
    platform required for efficient accelerator
    design

3
Four Fundamental Models of Accelerator Design
No OS Service (in simple embedded systems)
Base
OS service acc as User space mmaped I/O device
Virtualized Device with OS sched support
4
Hybrid Hardware/Software Execution Model
  • Hardware Accelerator as a DLL
  • Seamless integration of hardware accelerators
    into the Linux software stack for use by
    mainstream applications
  • The DLL approach enables transparent interchange
    of software and hardware components
  • Application level execution model
  • Compiler deep analysis and transformations
    generate CPU code, hardware library stubs and
    synthesized components
  • FPGA bitmaps as hardware counterpart to existing
    software modules.
  • Same dynamic linking library interfaces and stubs
    apply to both software and hardware
    implementation
  • OS resource management
  • Services (API) for allocation, partial
    reconfiguration, saving and restoring the status,
    and monitoring
  • Multiprogramming scheduler can pre-fetch hardware
    accelerators in time for next use
  • Control the access to the new hardware to ensure
    trust under private or shared use

5
MP3 Decoder Madplay Lib. Dithering as DLL
Noise Shaping
Noise Shaping
Noise Shaping
Quantization
Clipping
Dithering
Random generator
Biasing
Quantization
Clipping
Dithering
Random generator
Biasing
Quantization
Clipping
Dithering
Random generator
Biasing
Software Dithering DLL
Software Dithering DLL
Application
Application
Decode MP3
Read
Write
Decode MP3
Read
Write
DL
DL
Block
Sample
Sample
Block
Sample
Sample
  • Madplay shared library dithering function as
    software and FPGA DLL
  • Audio_linear_dither() software profiling shows
    97 of application time
  • DL (dynamic linker) can switch the call to
    hardware or software implementation
  • Used by 100 video and audio applications

Stub
Hardware Dithering DLL
Stub
Hardware Dithering DLL
Sound driver
OS
Sound driver
OS
Hardware Dithering
Hardware Dithering
AC97
AC97
6 cycles
Quantization
Clipping
Dithering
Biasing
Noise Shaping
Quantization
Clipping
Dithering
Biasing
Noise Shaping
Quantization
Clipping
Dithering
Biasing
Noise Shaping
Random generator
Random generator
Random generator
FPGA
FPGA
6
CPU-Accelerator Interconnect Options
  • PLB (Processor Local Bus)
  • Wide transfer 64 bits
  • Access to DRAM channel
  • 1/3 CPU frequency
  • Big penalty if bus is busy during first attempt
    to access bus
  • OCM (On-chip Memory) interconnect
  • Narrower 32 bits
  • No direct access to DRAM channel
  • CPU clock frequency

7
Motion Estimation Design Experience
  • Significant overhead in mmap, open calls
  • This arrangement can only support accelerators
    that will be invoked many times
  • Notice dramatic reduction in computation time
  • Notice large overhead in data marshalling and
    white
  • Full Search gives 10 better compression
  • Diamond Search is sequential, not suitable for
    acceleration

8
JPEG An Example
Run-Length Encoding (RLE)
2D Discrete Cosine Transform (DCT)
Y
Downsample
Huffman Coding (HC)
RGB to YUV
RGB
U
Downsample
Quantization (QUANT)
V
Original Image
Downsample
Compressed Image
Parallel Execution on Independent Blocks
Inherently Sequential Region
Implemented as Reconfigurable Logic
Accelerator Candidate
9
JPEG Accelerator Design Experience
  • Based on Model (d)
  • System call overhead for each invocation
  • Better protection
  • DCT and Quant are accelerated
  • Data flows directly from DCT to Quant
  • Data copy to user DMA buffer dominates cost

10
Execution Flow of DCT System Call
Application
Operating System
Hardware
Time ?
PLB
Enable Accelerator Access for Application
open(/dev/accel) / only once/ / construct
macroblocks / macroblock syscall(macroblock,
num_blocks)
PPC
Memory
Data copy
PPC
Flush Cache Range
Memory
Setup DMA Transfer
PPC
DMA Controller
Poll
PPC
Accelerator (Executing)
Setup DMA Transfer
PPC
DMA Controller
Invalidate Cache Range
/ macroblock now has transformed data /
PPC
Memory
Data Copy
PLB
PPC
Memory
11
Software Versus Hardware Acceleration
Overhead is a major issue!
12
Device Driver Access Cost
Write a Comment
User Comments (0)
About PowerShow.com