A GPU-Like Soft Processor for High-Throughput Acceleration

About This Presentation

Title:

A GPU-Like Soft Processor for High-Throughput Acceleration

Description:

A GPU-Like Soft Processor for High-Throughput Acceleration ... pipelined design we would require access to 4 read ports and 3 write ports each clock cycle. – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 29

Provided by: Jeff1212

Category:

more less

Transcript and Presenter's Notes

Title: A GPU-Like Soft Processor for High-Throughput Acceleration

1
A GPU-Like Soft Processor for High-Throughput
Acceleration Jeffrey Kingyens and J. Gregory
Steffan Electrical and Computer
Engineering University of Toronto
2
FGPA-Based Acceleration

In-socket acceleration platforms
FPGA and CPU on same motherboard
Xtremedata, Nallatech, SGI RASC
Intel Quick-Assist, AMD Torrenza
How to program them?
HDL is for experts
Behavioural synthesis is limited
Can we provide a more familiar programming model?

XD1000
3
Potential Solution Soft Processor

Advantages of soft processors
Familiar, portable, customizable
Our Goal Develop a new S.P architecture that is
Naturally capable of utilizing FPGA resources
Suitable for high-throughput workloads
Challenges
Memory latency
Pipeline latency and hazards
Exploiting parallelism
Scaling

4
Inspiration GPU Architecture

Multithreading
Tolerate memory and pipeline latencies
Vector instructions
Data-level parallelism, scaling
Predication
Minimize impact of control flow
Multiple processors
Scaling
We propose a GPU-like S.P. and programming model

5
Overview

A GPU-based system
NVIDIAs Cg
AMDs CTM r5xx ISA
A GPU-like architecture
Overcoming port limitations
Avoiding stalls
Preliminary results
Simulation based on Xtremedata XD1000

6
A GPU-Based System
7
GPU Shader Processors
Shader
Program
Separate in/out buffers simplify memory coherence
8
Software Compilation
CTM Binary
Our system behaves like a real graphics card!
9
NVIDIAs Cg Language (C-like)
Matrix-matrix element-wise multiplication offset
9
10
AMDs CTM r5xx ISA (simplified)
A B C
Each register is a 4-element vector
10
11
A GPU-Like Architecture
12
Soft Processor Architecture
Soft Processor
Coordinate Generator
Must tolerate port limitations and latencies
13
Overcoming Port Limitations

Problem central register file
Needs four reads and two writes per cycle
FPGA block RAMs have only two ports
Solution exploit symmetry of threads
Group threads into batches of 4
Fetch operands across batches in lock-step

Only read one operand per thread per cycle
14
Transposed RegFile Access
T3 RF
T2 RF
T1 RF
T0 RF
15
Avoiding Stalls

Problem long pipeline and memory latency
Frequent long stalls lead to underutilized ALU
datapath
Solution exploit abundance of threads
Store contexts for multiple batches of threads
Issue instructions from different batches to hide
latencies

Requires logic to issue-from and manage batches
16
Methodology and Results
17
Simulation Methodology

SystemC-based simulation
Parameterized to model XD1000
Assume conservative 100Mhz soft processor clock
Cycle accurate at the block interfaces
Models HyperTransport (bandwidth and latency)
Benchmarks
photon monte-carlo heat-transfer sim
(ALU-intensive)
matmatmult dense matrix multiplication
(mem-intensive)

18
ALU Utilization
100
ALU Utilization ()
80
Utilized
60
Not ALU
Mem Wait
40
Inside ALU
20
0
1
2
4
8
16
32
64
Number of Hardware Batch Contexts (Photon)
19
ALU Utilization
Utilized
Mem Wait
Not ALU
Inside ALU
32 batches is sufficient
19
20
Conclusions

GPU-inspired soft processor architecture
exploits multithreading, vector operations,
predication
Thread symmetry and batching allows
tolerating limited block RAM ports
tolerating long memory and pipeline latencies
32 batches sufficient
to achieve 100 ALU utilization
Future work
customize programming model and arch. to FPGAs
exploit longer vectors, multiple CPUs, custom ops

21
Backups
22
ALU Datapath
64 Cycles!
23
Reducing Register File Size

Possible to shrink this?
Some programs use much fewer registers
Ex) photon uses 4, ? can replace below with 16 x
M4k

24
Multi-threading
25
Implementation

Assume an Xtremedata XD1000 system (or similar)
FPGA plugs into second socket on dual CPU board
Communication with CPU via HyperTransport
16-bit wide, 400 Mhz DDR interface ? 1.6GB/sec

26
Estimating Memory Latency
27
ALU Utilization
Utilized
Mem Wait
Not ALU
Inside ALU
Matmatmult is bottlenecked on load latency
27
28
Performance (16 Bit HT Link)
2
4
8
16
32
64
Number of Hardware Batch Contexts

Write a Comment

User Comments (0)