Bottlenecks of SIMD - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Bottlenecks of SIMD

Description:

Title: PowerPoint Presentation Last modified by: Student Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:137
Avg rating:3.0/5.0
Slides: 20
Provided by: icsEleTu7
Category:

less

Transcript and Presenter's Notes

Title: Bottlenecks of SIMD


1
Bottlenecks of SIMD
  • Haibin Wang
  • Wei tong

2
Paper
  • Bottlenecks in Multimedia Processing with SIMD
    Style
  • Extensions and Architectural Enhancements One
  • IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 8,
    AUGUST 2003
  • Deepu Talla, Member, IEEE ,Lizy Kurian John,
    Senior Member, IEEE, and Doug Burger, Member, IEEE

3
Outline
  • Introduction
  • Bottlenecks Analysis
  • MediaBreeze Architecture
  • Summary

4
Introduction
  • It is popular to use multimedia SIMD extensions
    to speed up media processing, but the efficiency
    is not very high.
  • 75 to 85 percent of the dynamic instructions in
    the processor instruction stream are supporting
    instructions.

5
Introduction
  • The bottlenecks are caused by the loop structure
    and the access patterns of the media program.
  • So instead of exploiting more data-level
    parallelism, the paper focuses on improving the
    efficiency of the instructions supporting the
    core computation.

6
Introduction
  • This paper has two major contributions
  • Firstly, it focuses on the supporting
    instructions to enhance the performance of SIMD
    which is an innovation.
  • Secondly, it gives a method to reduce and
    eliminate supporting instructions with the
    MediaBreeze architecture.

7
Nested Loop
8
The analysis of loop architecture
  • The sub-block is very small which leads to the
    limited DLP because it needs many supporting
    instructions.
  • There are 5 loops for every block which waste so
    much time on braches.
  • You need to reorganize the data to use SIMD

9
Access patterns
10
Access patterns
  • The addressing sequences are complex and big part
    which need lots of supporting instructions to
    generate them.
  • Using general-purpose instruction sets to
    generate multiple addressing sequences is not
    very efficient.

11
The overhead instructions
  • Address generation address calculation
  • Address transformation data movement, data
    reorganization
  • Loads and Stores memory
  • Branches control transfer, for-loop

12
(No Transcript)
13
Architecture
14
Instruction Structure
15
Breeze Instruction Mapping of 1D-DCT
16
Full Map
  • . five branches,
  • . three loads and one store,
  • . four address value generation (one on each
    stream
  • with each address generation representing
    multiple
  • RISC instructions),
  • . one SIMD operation (2-way to 16-way parallelism
  • depending on each data element size),
  • . one accumulation of SIMD result and one SIMD
  • reduction operation,
  • four SIMD data reorganization (pack/unpack,
    permute,
  • etc.) operations, and
  • . shifting and saturation of SIMD results.

17
Performance Evaluation
  • cfa,dct, motest,scale
  • G711, decrypt
  • Aud, jpeg, ijpeg

18
Any improvement?
  • Why not higher efficiency in cfa?
  • Memory latency!
  • Solution?
  • Prefetch!

19
Evaluation
  • Advantage
  • Eliminating and reducing overhead.
  • Much better than normal SIMD extension.
  • 0.3 processor area, less 1 total power
    consumption.
  • Drawback
  • Complicated instruction.
  • Who will design a compiler for this?
Write a Comment
User Comments (0)
About PowerShow.com