Title: A GPU-Like Soft Processor for High-Throughput Acceleration
1A GPU-Like Soft Processor for High-Throughput
Acceleration Jeffrey Kingyens and J. Gregory
Steffan Electrical and Computer
Engineering University of Toronto
2FGPA-Based Acceleration
- In-socket acceleration platforms
- FPGA and CPU on same motherboard
- Xtremedata, Nallatech, SGI RASC
- Intel Quick-Assist, AMD Torrenza
- How to program them?
- HDL is for experts
- Behavioural synthesis is limited
- Can we provide a more familiar programming model?
XD1000
3Potential Solution Soft Processor
- Advantages of soft processors
- Familiar, portable, customizable
- Our Goal Develop a new S.P architecture that is
- Naturally capable of utilizing FPGA resources
- Suitable for high-throughput workloads
- Challenges
- Memory latency
- Pipeline latency and hazards
- Exploiting parallelism
- Scaling
4Inspiration GPU Architecture
- Multithreading
- Tolerate memory and pipeline latencies
- Vector instructions
- Data-level parallelism, scaling
- Predication
- Minimize impact of control flow
- Multiple processors
- Scaling
- We propose a GPU-like S.P. and programming model
5Overview
- A GPU-based system
- NVIDIAs Cg
- AMDs CTM r5xx ISA
- A GPU-like architecture
- Overcoming port limitations
- Avoiding stalls
- Preliminary results
- Simulation based on Xtremedata XD1000
6A GPU-Based System
7GPU Shader Processors
Shader
Program
Separate in/out buffers simplify memory coherence
8Software Compilation
CTM Binary
Our system behaves like a real graphics card!
9NVIDIAs Cg Language (C-like)
Matrix-matrix element-wise multiplication offset
9
10AMDs CTM r5xx ISA (simplified)
A B C
Each register is a 4-element vector
10
11A GPU-Like Architecture
12Soft Processor Architecture
Soft Processor
Coordinate Generator
Must tolerate port limitations and latencies
13Overcoming Port Limitations
- Problem central register file
- Needs four reads and two writes per cycle
- FPGA block RAMs have only two ports
- Solution exploit symmetry of threads
- Group threads into batches of 4
- Fetch operands across batches in lock-step
Only read one operand per thread per cycle
14Transposed RegFile Access
T3 RF
T2 RF
T1 RF
T0 RF
15Avoiding Stalls
- Problem long pipeline and memory latency
- Frequent long stalls lead to underutilized ALU
datapath - Solution exploit abundance of threads
- Store contexts for multiple batches of threads
- Issue instructions from different batches to hide
latencies
Requires logic to issue-from and manage batches
16Methodology and Results
17Simulation Methodology
- SystemC-based simulation
- Parameterized to model XD1000
- Assume conservative 100Mhz soft processor clock
- Cycle accurate at the block interfaces
- Models HyperTransport (bandwidth and latency)
- Benchmarks
- photon monte-carlo heat-transfer sim
(ALU-intensive) - matmatmult dense matrix multiplication
(mem-intensive)
18ALU Utilization
100
ALU Utilization ()
80
Utilized
60
Not ALU
Mem Wait
40
Inside ALU
20
0
1
2
4
8
16
32
64
Number of Hardware Batch Contexts (Photon)
19ALU Utilization
Utilized
Mem Wait
Not ALU
Inside ALU
32 batches is sufficient
19
20Conclusions
- GPU-inspired soft processor architecture
- exploits multithreading, vector operations,
predication - Thread symmetry and batching allows
- tolerating limited block RAM ports
- tolerating long memory and pipeline latencies
- 32 batches sufficient
- to achieve 100 ALU utilization
- Future work
- customize programming model and arch. to FPGAs
- exploit longer vectors, multiple CPUs, custom ops
21Backups
22ALU Datapath
64 Cycles!
23Reducing Register File Size
- Possible to shrink this?
- Some programs use much fewer registers
- Ex) photon uses 4, ? can replace below with 16 x
M4k
24Multi-threading
25Implementation
- Assume an Xtremedata XD1000 system (or similar)
- FPGA plugs into second socket on dual CPU board
- Communication with CPU via HyperTransport
- 16-bit wide, 400 Mhz DDR interface ? 1.6GB/sec
26Estimating Memory Latency
27ALU Utilization
Utilized
Mem Wait
Not ALU
Inside ALU
Matmatmult is bottlenecked on load latency
27
28Performance (16 Bit HT Link)
2
4
8
16
32
64
Number of Hardware Batch Contexts