A GPU-Like Soft Processor for High-Throughput Acceleration - PowerPoint PPT Presentation

About This Presentation
Title:

A GPU-Like Soft Processor for High-Throughput Acceleration

Description:

A GPU-Like Soft Processor for High-Throughput Acceleration ... pipelined design we would require access to 4 read ports and 3 write ports each clock cycle. – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 29
Provided by: Jeff1212
Category:

less

Transcript and Presenter's Notes

Title: A GPU-Like Soft Processor for High-Throughput Acceleration


1
A GPU-Like Soft Processor for High-Throughput
Acceleration Jeffrey Kingyens and J. Gregory
Steffan Electrical and Computer
Engineering University of Toronto
2
FGPA-Based Acceleration
  • In-socket acceleration platforms
  • FPGA and CPU on same motherboard
  • Xtremedata, Nallatech, SGI RASC
  • Intel Quick-Assist, AMD Torrenza
  • How to program them?
  • HDL is for experts
  • Behavioural synthesis is limited
  • Can we provide a more familiar programming model?

XD1000
3
Potential Solution Soft Processor
  • Advantages of soft processors
  • Familiar, portable, customizable
  • Our Goal Develop a new S.P architecture that is
  • Naturally capable of utilizing FPGA resources
  • Suitable for high-throughput workloads
  • Challenges
  • Memory latency
  • Pipeline latency and hazards
  • Exploiting parallelism
  • Scaling

4
Inspiration GPU Architecture
  • Multithreading
  • Tolerate memory and pipeline latencies
  • Vector instructions
  • Data-level parallelism, scaling
  • Predication
  • Minimize impact of control flow
  • Multiple processors
  • Scaling
  • We propose a GPU-like S.P. and programming model

5
Overview
  • A GPU-based system
  • NVIDIAs Cg
  • AMDs CTM r5xx ISA
  • A GPU-like architecture
  • Overcoming port limitations
  • Avoiding stalls
  • Preliminary results
  • Simulation based on Xtremedata XD1000

6
A GPU-Based System
7
GPU Shader Processors
Shader
Program
Separate in/out buffers simplify memory coherence
8
Software Compilation
CTM Binary
Our system behaves like a real graphics card!
9
NVIDIAs Cg Language (C-like)
Matrix-matrix element-wise multiplication offset
9
10
AMDs CTM r5xx ISA (simplified)
A B C
Each register is a 4-element vector
10
11
A GPU-Like Architecture
12
Soft Processor Architecture
Soft Processor
Coordinate Generator
Must tolerate port limitations and latencies
13
Overcoming Port Limitations
  • Problem central register file
  • Needs four reads and two writes per cycle
  • FPGA block RAMs have only two ports
  • Solution exploit symmetry of threads
  • Group threads into batches of 4
  • Fetch operands across batches in lock-step

Only read one operand per thread per cycle
14
Transposed RegFile Access
T3 RF
T2 RF
T1 RF
T0 RF
15
Avoiding Stalls
  • Problem long pipeline and memory latency
  • Frequent long stalls lead to underutilized ALU
    datapath
  • Solution exploit abundance of threads
  • Store contexts for multiple batches of threads
  • Issue instructions from different batches to hide
    latencies

Requires logic to issue-from and manage batches
16
Methodology and Results
17
Simulation Methodology
  • SystemC-based simulation
  • Parameterized to model XD1000
  • Assume conservative 100Mhz soft processor clock
  • Cycle accurate at the block interfaces
  • Models HyperTransport (bandwidth and latency)
  • Benchmarks
  • photon monte-carlo heat-transfer sim
    (ALU-intensive)
  • matmatmult dense matrix multiplication
    (mem-intensive)

18
ALU Utilization
100
ALU Utilization ()
80
Utilized
60
Not ALU
Mem Wait
40
Inside ALU
20
0
1
2
4
8
16
32
64
Number of Hardware Batch Contexts (Photon)
19
ALU Utilization
Utilized
Mem Wait
Not ALU
Inside ALU
32 batches is sufficient
19
20
Conclusions
  • GPU-inspired soft processor architecture
  • exploits multithreading, vector operations,
    predication
  • Thread symmetry and batching allows
  • tolerating limited block RAM ports
  • tolerating long memory and pipeline latencies
  • 32 batches sufficient
  • to achieve 100 ALU utilization
  • Future work
  • customize programming model and arch. to FPGAs
  • exploit longer vectors, multiple CPUs, custom ops

21
Backups
22
ALU Datapath
64 Cycles!
23
Reducing Register File Size
  • Possible to shrink this?
  • Some programs use much fewer registers
  • Ex) photon uses 4, ? can replace below with 16 x
    M4k

24
Multi-threading
25
Implementation
  • Assume an Xtremedata XD1000 system (or similar)
  • FPGA plugs into second socket on dual CPU board
  • Communication with CPU via HyperTransport
  • 16-bit wide, 400 Mhz DDR interface ? 1.6GB/sec

26
Estimating Memory Latency
27
ALU Utilization
Utilized
Mem Wait
Not ALU
Inside ALU
Matmatmult is bottlenecked on load latency
27
28
Performance (16 Bit HT Link)
2
4
8
16
32
64
Number of Hardware Batch Contexts
Write a Comment
User Comments (0)
About PowerShow.com