MAPLD2005/C178 - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

MAPLD2005/C178

Description:

Algorithms Quick Sort Heap Sort Radix Sort Bitonic Sort Odd/Even Merge SRC System Architecture Example - Quick Sort Example - Quick Sort Example ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 52
Provided by: JohnHa51
Learn more at: http://klabs.org
Category:
Tags: c178 | mapld2005 | radix | sort

less

Transcript and Presenter's Notes

Title: MAPLD2005/C178


1
Sorting on the SRC 6 Reconfigurable Computer
  • John Harkins, Tarek El-Ghazawi, Esam El-Araby,
    Miaoqing HuangThe George Washington
    UniversityWashington, DC

2
Algorithms
  • Quick Sort
  • Heap Sort
  • Radix Sort
  • Bitonic Sort
  • Odd/Even Merge

3
SRC System Architecture
16 Port Crossbar Switch1.6 GB/s Peak Port BW



\ 64
\ 64
\ 64
\ 64
ProcessorNode
FPGANode
MemoryNode
Up to 16 Nodes per Switch
4
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 13 3141510 2 6 0 8
412 7 511 1 9
5
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
6
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
7
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
8
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
mL 0 3 5 7 4 2 6 1 8
9
Example - Quick Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
med 0 3141510 2 6 9 8
412 7 511 113
QS1 0 3 5 7 4 2 6 1 8
9121514111013
mL 0 3 5 7 4 2 6 1 8
PS 0 1 2 3 4 5 6 7 8
10
Quick Sort - MIMD Architecture
  • 6 Instances
  • Median of 3 to select pivot
  • Pipeline Sort for partitions 10 vs. Insertion
    Sort 20

BankA
BankB
BankC
BankD
BankE
BankF

FPGA1

FPGA2
QS1
QS2
QS3
QS4
QS5
QS6
90
84
11
Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
0 13 3141510 2 6 0 8
412 7 511 1 9
13
14
3
10
2
6
15
11
1
0
8
4
12
7
5
9
12
Example - Heap Sort
13
14
3
10
2
6
15
11
1
4
12
7
5
8
13
Example - Heap Sort
13
14
3
10
2
6
15
11
1
8
4
12
7
5
0
9
14
Example - Heap Sort
13
14
3
10
2
6
15
11
1
8
4
12
7
5
9
0
15
Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
6 13 3141510 2 6 9 8
412 7 511 1 0
13
14
3
10
2
15
6
11
1
8
4
12
7
5
16
Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
6 13 3141510 211 9 8
412 7 5 6 1 0
13
14
3
10
2
15
11
6
1
8
4
12
7
5
17
Example - Heap Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
max 151314 912 711 3 8
410 2 5 6 1 0
15
14
13
12
7
11
9
6
1
3
8
4
10
2
5
0
18
Heap Sort - MIMD Architecture
  • 6 Instances
  • Almost identical to processor code

BankA
BankB
BankC
BankD
BankE
BankF

FPGA1
FPGA2
HS1
HS2
HS3
HS4
HS5
HS6
55
5
19
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
1 13 3141510 2 6 0 8
412 7 511 1 9
Pass1
01234567891011121314
15
? index0 0
count1 4
1101001111101111101000100110000010000100
110001110101101100011001
count2 4
count3 4
count4 4
? index1 4
index0 0
n
indexn ? counti n gt 0
? index2 8
i1
? index3 12
20
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2

Pass2
01234567891011121314
15
? index0 0
count0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 0
? index1 4
? index2 8
? index3 12
21
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2 13

Pass2
01234567891011121314
15
? index0 0
count0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 1
1101
? index1 5
? index2 8
? index3 12
22
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2 13
3
Pass2
01234567891011121314
15
count0 1
? index0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 1
1101
? index1 5
? index2 8
0011
? index3 13
23
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
2 13 14
3
Pass2
01234567891011121314
15
count0 1
? index0 0
1101001111101111101000100110000010000100
110001110101101100011001
count1 0
count2 0
count3 2
1101
? index1 5
1110
? index2 9
0011
? index3 13
24
Example - Radix Sort
0 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
3 0 1 2 3 4 5 6 7 8
9101112131415
Pass3
01234567891011121314
15
0000
0000
1101001111101111101000100110000010000100
110001110101101100011001
1000
0001
0100
0010
1100
0011
1101
? index0 4
0100
0101
0101
0001
0110
1001
0111
? index1 8
1110
1000
1010
1001
0010
1010
0110
1011
0011
? index2 12
1100
1111
1101
0111
1110
1011
1111
? index3 16
25
Radix Sort - MIMD Architecture
  • 3 Instances
  • Uses enumeration sort
  • Radix 13 bits vs. 8 bits

BankA
BankB
BankC
BankD
BankE
BankF

FPGA1
FPGA2
Radix Sort1
Radix Sort2
Radix Sort3
33
5
26
MIMD Code Structure
main.c int main( ) int n 5237706
int64 buf buf cacheAlign(n)
mapSort(buf, n) free(buf) exit(0)
mapSort.mc void mapSort(int64 buf, n)
OBM_BANK_A (bufA, int64, n/6) OBM_BANK_B (bufB,
int64, n/6) OBM_BANK_F (bufF, int64, n/6)
DMA_CPU(dir, bufA, stripes, buf, n) pragma src
parallel sections pragma src section
Xsort(bufA, n/6) pragma src section
Xsort(bufB, n/6) pragma src section
Xsort(bufF, n/6) DMA_CPU(dir, bufA,
stripes, buf, n) return


27
Example - Bitonic Sort
Input Keys
Schedule
13 31415 10 2 6 0 8 412
7 511 1 9
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
13
3
14
15




28
Example - Bitonic Sort
Input Keys
Schedule
10 2 6 0 8 412
7 511 1 9
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
3

13

15

14

10
2
6
0
29
Example - Bitonic Sort
Input Keys
Schedule
8 412
7 511 1 9
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

3
5

15
11

13
1

14
9
2

10

6

0

30
Example - Bitonic Sort
Input Keys
Schedule
8 412
7
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)
5

3

11

13

9

14

1

15


6
8

2
4

10
12

0
7
31
Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6

0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)



1


0



12


2



5


3



8


6



7

10




9

13




4

14




11

15

32
Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6 10131415

0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)




1






7






4






5






9
10





12
13





8
14





11
15

33
Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6 10131415
1 4 5 7
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)






1






4






5






7





8






9






11






12

34
Example - Bitonic Sort
Input Keys
Schedule
0 2 3 6 10131415 8
91112 1 4 5 7
0 1 2 3
(0,1) (3,2) (0,2) (1,3) (0,1) (2,3)

































8






9






11






12

35
Bitonic Sort - SIMD Architecture
  • 2 Instances
  • Parallel sorting network

BankA
BankB
BankC
BankD
BankE
BankF


FPGA1
FPGA2

8 Input Bitonic Sorting Network1
4 InputBitonic Sort2
SIMDController
5
27
36
Example - Odd/Even Merge
Input Keys
A 0 1 2 4 7111214 B 3 5
6 8 9101315
Merged Keys
C



MUX
Z-2




Z-1


37
Example - Odd/Even Merge
Input Keys
A 0 1 2 4 7111214 B 3 5
6 8 9101315
Merged Keys
C


0

Z-2
3


1

Z-1
5

38
Example - Odd/Even Merge
Input Keys
A 2 4 7111214 B
6 8 9101315
Merged Keys
C

0
2

Z-2

3

1

4

Z-1
5

39
Example - Odd/Even Merge
Input Keys
A 7111214 B
6 8 9101315
Merged Keys
C

0
2
7

Z-2

3

4
1
11

Z-1
5

40
Example - Odd/Even Merge
Input Keys
A 1214 B
6 8 9101315
Merged Keys
C 0 1

2
3
7

Z-2
0
6

5
4
11

1
Z-1
8

41
Example - Odd/Even Merge
Input Keys
A 1214 B
9101315
Merged Keys
C 0 1 2 3

4
6
7

Z-2
2
9

8
5
11

3
Z-1
10

42
Odd/Even Merge - SIMD Architecture
  • 1 Instance
  • Parallel sorting network
  • A/B odd C/D even

BankA
BankB
BankC
BankD
BankE
BankF

FPGA1
FPGA2
Odd Merge Two
Even Merge Two
Merge Out
40
5
43
SIMD Code Structure
main.c int main( ) int n 5237706
int64 buf buf cacheAlign(n)
mapSort(buf, n) free(buf) exit(0)
mapSort.mc void mapSort(int64 buf, n)
OBM_BANK_A (AA, int64, n/6) OBM_BANK_B (BB,
int64, n/6) OBM_BANK_F (FF, int64, n/6)
DMA_CPU(dir, AA, stripes, buf, n) for (i0
iltrounds i) schedule( r1, r2)
bitonicSort8(AAr1,BBr1,CCr1,DDr1,
AAr2,BBr2,CCr2.DDr2,
AAr1,BBr1,CCr1,DDr1,
AAr2,BBr2,CCr2,DDr2)
bitonicSort4(EEr1,FFr1,EEr2,FFr2,
) DMA_CPU(dir, bufA, stripes,
buf, n) return

44
Implementation Comparisons
Algorithm Processor Complexity Language Compiler Lines Of Code Recursion FPGA Util. Slices MIMD SIMD Refactoring Upper Bound x106 keys/s
Quick Sort X86 N lgN C 81
Quick Sort FPGA N lgN MC 97/96 n/a 90,84 31.58
Heap Sort X86 N lgN C 55 -
Heap Sort FPGA N lgN MC 56/54 n/a 55,0 31.58
Radix Sort X86 N C 70 -
Radix Sort FPGA N MC 81/64 n/a 33,0 60.00
Bitonic Sort X86 Nlg2N C 78
Bitonic Sort FPGA lg2N VHDL 53/478/365 n/a 27,0 6.32
O/E Merge X86 N C 52 -
O/E Merge FPGA N MC 71/120 n/a 40,0 60.87
icc v8.0 -fast
entirely
X86 Dual Xeon 2.8GHz
mcc v1.8
major changes
FPGA Virtex2XC6000 _at_ 100MHz
mcc v1.9
some
MC MAP C
very little
almost none
45
Lesson Learned 1
  • Know your tools
  • Develop accurate assessments early

Compiler Quick Sort Heap Sort Radix Sort Bitonic Sort O/E Merge
2.8 GHz Xeonx106 keys/s gcc 1.99 0.50 1.63 - -
2.8 GHz Xeonx106 keys/s icc -fast 5.66 1.06 4.72 - -
FPGA upper bound estimatex106 keys/s 31.58 31.58 60.00 6.32 60.87
Upper bound on speedup vs gcc 15.87 63.16 36.81 - -
Upper bound on speedup vs icc 5.58 29.79 12.71 - -
46
Test Conditions
  • 64 bit unsigned integer keys
  • Uniformly distributed
  • Randomly permuted
  • Scores average of 10 runs
  • FPGA configuration time 65ms
  • DMA time 18ms
  • Typical key quantity 3.14M
  • Processor comparison Xeon 2.8GHz, 1GB mem

47
Experimental Results - 64 bit keys
x 106 keys/s
Sorting Algorithms
48
mcc Compiler
  • Attempts to pipeline inner loops
  • Maintains sequential behavior of C
  • Reports dependencies/penalties
  • Quick Sort 1 penalty
  • Heap Sort 12 penalties
  • Radix Sort 2 penalties
  • Bitonic Sort 5 penalties
  • Odd/Even Merge 1 penalty
  • Easy to build embarrassingly parallel code
  • Resource usage 2x HDL

49
Conclusion
  • FPGAs not best choice for sorting
  • Sorting is memory bound
  • Tight loops, low computation suited to processor
  • More parallel memory accesses
  • Faster clock rates
  • Refactoring for better performance
  • FPGAs underutilized
  • Understand compiler limitations
  • Eliminate dependencies

50
Tight Loop Example
  • Merge
  • aNbNinfinityjk0Loop i 0 to
    2N-1 if (aj gt bk) mergedi bk
    else mergedi aj

51
Future Work
  • More refactoring
  • Greater use of block rams
  • HW prediction to reduce penalties
  • FPGA performance gain Æ’(computation
    density/memory access)
Write a Comment
User Comments (0)
About PowerShow.com