Title: MAPLD2005/C178
1Sorting on the SRC 6 Reconfigurable Computer
- John Harkins, Tarek El-Ghazawi, Esam El-Araby,
Miaoqing HuangThe George Washington
UniversityWashington, DC
2Algorithms
- Quick Sort
- Heap Sort
- Radix Sort
- Bitonic Sort
- Odd/Even Merge
3SRC System Architecture
16 Port Crossbar Switch1.6 GB/s Peak Port BW
\ 64
\ 64
\ 64
\ 64
ProcessorNode
FPGANode
MemoryNode
Up to 16 Nodes per Switch
4Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 13 3141510 2 6 0 8
412 7 511 1 9
5Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
6Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
7Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
8Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
mL 0 3 5 7 4 2 6 1 8
9Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
mL 0 3 5 7 4 2 6 1 8
PS 0 1 2 3 4 5 6 7 8
10Quick Sort - MIMD Architecture
- 6 Instances
- Median of 3 to select pivot
- Pipeline Sort for partitions 10 vs. Insertion
Sort 20
BankA
BankB
BankC
BankD
BankE
BankF
FPGA1
FPGA2
QS1
QS2
QS3
QS4
QS5
QS6
90
84
11Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
13
14
3
10
2
6
15
11
1
0
8
4
12
7
5
9
12Example - Heap Sort
13
14
3
10
2
6
15
11
1
4
12
7
5
8
13Example - Heap Sort
13
14
3
10
2
6
15
11
1
8
4
12
7
5
0
9
14Example - Heap Sort
13
14
3
10
2
6
15
11
1
8
4
12
7
5
9
0
15Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
6 13 3141510 2 6 9 8
412 7 511 1 0
13
14
3
10
2
15
6
11
1
8
4
12
7
5
16Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
6 13 3141510 211 9 8
412 7 5 6 1 0
13
14
3
10
2
15
11
6
1
8
4
12
7
5
17Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
max 151314 912 711 3 8
410 2 5 6 1 0
15
14
13
12
7
11
9
6
1
3
8
4
10
2
5
0
18Heap Sort - MIMD Architecture
- 6 Instances
- Almost identical to processor code
BankA
BankB
BankC
BankD
BankE
BankF
FPGA1
FPGA2
HS1
HS2
HS3
HS4
HS5
HS6
55
5
19Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
1 13 3141510 2 6 0 8
412 7 511 1 9
Pass1
01234567891011121314
15
? index0 0
count1 4
1101001111101111101000100110000010000100
110001110101101100011001
count2 4
count3 4
count4 4
? index1 4
index0 0
n
indexn ? counti n gt 0
? index2 8
i1
? index3 12
20Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2
Pass2
01234567891011121314
15
? index0 0
count0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 0
? index1 4
? index2 8
? index3 12
21Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2 13
Pass2
01234567891011121314
15
? index0 0
count0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 1
1101
? index1 5
? index2 8
? index3 12
22Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2 13
3
Pass2
01234567891011121314
15
count0 1
? index0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 1
1101
? index1 5
? index2 8
0011
? index3 13
23Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2 13 14
3
Pass2
01234567891011121314
15
count0 1
? index0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 2
1101
? index1 5
1110
? index2 9
0011
? index3 13
24Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
3 0 1 2 3 4 5 6 7 8
9101112131415
Pass3
01234567891011121314
15
0000
0000
1101001111101111101000100110000010000100
110001110101101100011001
1000
0001
0100
0010
1100
0011
1101
? index0 4
0100
0101
0101
0001
0110
1001
0111
? index1 8
1110
1000
1010
1001
0010
1010
0110
1011
0011
? index2 12
1100
1111
1101
0111
1110
1011
1111
? index3 16
25Radix Sort - MIMD Architecture
- 3 Instances
- Uses enumeration sort
- Radix 13 bits vs. 8 bits
BankA
BankB
BankC
BankD
BankE
BankF
FPGA1
FPGA2
Radix Sort1
Radix Sort2
Radix Sort3
33
5
26MIMD Code Structure
main.c int main( ) int n 5237706
int64 buf buf cacheAlign(n)
mapSort(buf, n) free(buf) exit(0)
mapSort.mc void mapSort(int64 buf, n)
OBM_BANK_A (bufA, int64, n/6) OBM_BANK_B (bufB,
int64, n/6) OBM_BANK_F (bufF, int64, n/6)
DMA_CPU(dir, bufA, stripes, buf, n) pragma src
parallel sections pragma src section
Xsort(bufA, n/6) pragma src section
Xsort(bufB, n/6) pragma src section
Xsort(bufF, n/6) DMA_CPU(dir, bufA,
stripes, buf, n) return
27Example - Bitonic Sort
Input Keys
Schedule
13 31415 10 2 6 0 8 412
7 511 1 9
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
13
3
14
15
28Example - Bitonic Sort
Input Keys
Schedule
10 2 6 0 8 412
7 511 1 9
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
3
13
15
14
10
2
6
0
29Example - Bitonic Sort
Input Keys
Schedule
8 412
7 511 1 9
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
3
5
15
11
13
1
14
9
2
10
6
0
30Example - Bitonic Sort
Input Keys
Schedule
8 412
7
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
5
3
11
13
9
14
1
15
6
8
2
4
10
12
0
7
31Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
1
0
12
2
5
3
8
6
7
10
9
13
4
14
11
15
32Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6 10131415
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
1
7
4
5
9
10
12
13
8
14
11
15
33Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6 10131415
1 4 5 7
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
1
4
5
7
8
9
11
12
34Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6 10131415 8
91112 1 4 5 7
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
8
9
11
12
35Bitonic Sort - SIMD Architecture
- 2 Instances
- Parallel sorting network
BankA
BankB
BankC
BankD
BankE
BankF
FPGA1
FPGA2
8 Input Bitonic Sorting Network1
4 InputBitonic Sort2
SIMDController
5
27
36Example - Odd/Even Merge
Input Keys
A 0 1 2 4 7111214 B 3 5
6 8 9101315
Merged Keys
C
MUX
Z-2
Z-1
37Example - Odd/Even Merge
Input Keys
A 0 1 2 4 7111214 B 3 5
6 8 9101315
Merged Keys
C
0
Z-2
3
1
Z-1
5
38Example - Odd/Even Merge
Input Keys
A 2 4 7111214 B
6 8 9101315
Merged Keys
C
0
2
Z-2
3
1
4
Z-1
5
39Example - Odd/Even Merge
Input Keys
A 7111214 B
6 8 9101315
Merged Keys
C
0
2
7
Z-2
3
4
1
11
Z-1
5
40Example - Odd/Even Merge
Input Keys
A 1214 B
6 8 9101315
Merged Keys
C 0 1
2
3
7
Z-2
0
6
5
4
11
1
Z-1
8
41Example - Odd/Even Merge
Input Keys
A 1214 B
9101315
Merged Keys
C 0 1 2 3
4
6
7
Z-2
2
9
8
5
11
3
Z-1
10
42Odd/Even Merge - SIMD Architecture
- 1 Instance
- Parallel sorting network
- A/B odd C/D even
BankA
BankB
BankC
BankD
BankE
BankF
FPGA1
FPGA2
Odd Merge Two
Even Merge Two
Merge Out
40
5
43SIMD Code Structure
main.c int main( ) int n 5237706
int64 buf buf cacheAlign(n)
mapSort(buf, n) free(buf) exit(0)
mapSort.mc void mapSort(int64 buf, n)
OBM_BANK_A (AA, int64, n/6) OBM_BANK_B (BB,
int64, n/6) OBM_BANK_F (FF, int64, n/6)
DMA_CPU(dir, AA, stripes, buf, n) for (i0
iltrounds i) schedule( r1, r2)
bitonicSort8(AAr1,BBr1,CCr1,DDr1,
AAr2,BBr2,CCr2.DDr2,
AAr1,BBr1,CCr1,DDr1,
AAr2,BBr2,CCr2,DDr2)
bitonicSort4(EEr1,FFr1,EEr2,FFr2,
) DMA_CPU(dir, bufA, stripes,
buf, n) return
44Implementation Comparisons
Algorithm Processor Complexity Language Compiler Lines Of Code Recursion FPGA Util. Slices MIMD SIMD Refactoring Upper Bound x106 keys/s
Quick Sort X86 N lgN C 81
Quick Sort FPGA N lgN MC 97/96 n/a 90,84 31.58
Heap Sort X86 N lgN C 55 -
Heap Sort FPGA N lgN MC 56/54 n/a 55,0 31.58
Radix Sort X86 N C 70 -
Radix Sort FPGA N MC 81/64 n/a 33,0 60.00
Bitonic Sort X86 Nlg2N C 78
Bitonic Sort FPGA lg2N VHDL 53/478/365 n/a 27,0 6.32
O/E Merge X86 N C 52 -
O/E Merge FPGA N MC 71/120 n/a 40,0 60.87
icc v8.0 -fast
entirely
X86 Dual Xeon 2.8GHz
mcc v1.8
major changes
FPGA Virtex2XC6000 _at_ 100MHz
mcc v1.9
some
MC MAP C
very little
almost none
45Lesson Learned 1
- Know your tools
- Develop accurate assessments early
Compiler Quick Sort Heap Sort Radix Sort Bitonic Sort O/E Merge
2.8 GHz Xeonx106 keys/s gcc 1.99 0.50 1.63 - -
2.8 GHz Xeonx106 keys/s icc -fast 5.66 1.06 4.72 - -
FPGA upper bound estimatex106 keys/s 31.58 31.58 60.00 6.32 60.87
Upper bound on speedup vs gcc 15.87 63.16 36.81 - -
Upper bound on speedup vs icc 5.58 29.79 12.71 - -
46Test Conditions
- 64 bit unsigned integer keys
- Uniformly distributed
- Randomly permuted
- Scores average of 10 runs
- FPGA configuration time 65ms
- DMA time 18ms
- Typical key quantity 3.14M
- Processor comparison Xeon 2.8GHz, 1GB mem
47Experimental Results - 64 bit keys
x 106 keys/s
Sorting Algorithms
48mcc Compiler
- Attempts to pipeline inner loops
- Maintains sequential behavior of C
- Reports dependencies/penalties
- Quick Sort 1 penalty
- Heap Sort 12 penalties
- Radix Sort 2 penalties
- Bitonic Sort 5 penalties
- Odd/Even Merge 1 penalty
- Easy to build embarrassingly parallel code
- Resource usage 2x HDL
49Conclusion
- FPGAs not best choice for sorting
- Sorting is memory bound
- Tight loops, low computation suited to processor
- More parallel memory accesses
- Faster clock rates
- Refactoring for better performance
- FPGAs underutilized
- Understand compiler limitations
- Eliminate dependencies
50Tight Loop Example
- Merge
- aNbNinfinityjk0Loop i 0 to
2N-1 if (aj gt bk) mergedi bk
else mergedi aj
51Future Work
- More refactoring
- Greater use of block rams
- HW prediction to reduce penalties
- FPGA performance gain Æ’(computation
density/memory access)