Reconfigurable computing a new supercomputing paradigm - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

Reconfigurable computing a new supercomputing paradigm

Description:

Comparing a dual core Opteron to FPGA on fp performance: ... Wrapper makes core look like a function call in C code. W. Najjar. TU Delft. 15 ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 61

Provided by: walidn

Category:

more less

Transcript and Presenter's Notes

Title: Reconfigurable computing a new supercomputing paradigm

1
Reconfigurable computing - a new supercomputing
paradigm

Walid Najjar
Computer Science Engineering
University of California Riverside

2
ROCCC

Riverside Optimizing Compiler for Configurable
Computing
Code acceleration
By mapping of circuits to FPGA
Achieve same speed as hand-written VHDL codes
Improved productivity
Allows design and algorithm space exploration
Keeps the user fully in control
We automate only what is very well understood

3
FPGA A New HPC Platform?
David Strensky, FPGAs Floating-Point Performance
-- a pencil and paper evaluation, in HPCwire.com

Comparing a dual core Opteron to FPGA on fp
performance
Opteron 2.5 GHz, 1 add and 1 mult per cycle. 2.5
x 2 x 2 10 Gflops
FPGAs Xilinx V4 and V5 with DSP cores

Balanced allocation of dp fp adders, multipliers
and registers
Use both DSP and logic for multipliers,run at
lower speed
Logic for I/O interfaces

4
Balanced Designs

Same number of mults as adds (matrix
multiplication).
Double precision

Higher percentage of peak on FPGA (streaming)
1/3 of the power!

5
Challenges

FPGA is an amorphous mass of logic
Languages reflect the von Neumann execution model

6
ROCCC Overview
Procedure, loop and array optimizations
Instruction scheduling Pipelining and
storage optimizations
C/C
High level transformations
Low level transformations
Code generation
Hi-CIRRF
Lo-CIRRF
Java
SystemC
CIRRF Compiler Intermediate Representation for
Reconfigurable Fabrics

Limitations on the code
No recursion
No pointers

7
Focus

Extensive compile time optimizations
Maximize parallelism, speed and throughput
Minimize area and memory accesses
Optimizations
Loop level fine grained parallelism
Storage level compiler configured storage for
data reuse
Circuit level expression simplification,
pipelining

8
Execution Model

A simplified model
Decoupled memory access from datapath
Parallel loop iterations
Pipelined datapath

9
So far, working compiler with

Extensive compiler optimizations and
transformations
Analysis and hardware support for data reuse
Efficient code generation and pipelining
Import of existing IP cores
Support for dynamic partial reconfiguration

10
So far, working compiler with

Extensive compiler optimizations and
transformations
Analysis and hardware support for data reuse
Efficient code generation and pipelining
Import of existing IP cores
Support for dynamic partial reconfiguration

Loop, array procedure transformations. Maximize
clock speed parallelism, within
resources. Under user control.
11
High Level Transformations
12
So far, working compiler with

Extensive compiler optimizations and
transformations
Analysis and hardware support for data reuse
Efficient code generation and pipelining
Import of existing IP cores
Support for dynamic partial reconfiguration

Smart buffer technique reduces off chip memory
accesses by gt 98
13
So far, working compiler with

Extensive compiler optimizations and
transformations
Analysis and hardware support for data reuse
Efficient code generation and pipelining
Import of existing IP cores
Support for dynamic partial reconfiguration

Clock speed comparable to hand written HDL codes
14
So far, working compiler with

Extensive compiler optimizations and
transformations
Analysis and hardware support for data reuse
Efficient code generation and pipelining
Import of existing IP cores
Support for dynamic partial reconfiguration

Huge wealth of existing IP cores. Wrapper makes
core look like a function call in C code.
15
So far, working compiler with

Extensive compiler optimizations and
transformations
Analysis and hardware support for data reuse
Efficient code generation and pipelining
Import of existing IP cores
Support for dynamic partial reconfiguration

DPR allows reconfiguration of a subset of the
FPGA, dynamically, under software
control. Reduces configuration overhead.
16
Simple example

5-tap FIR Bi 3Ai 5Ai1 7Ai2
9Ai3 11Ai4

define N 516
void begin_hw()
void end_hw()
int main()
int i
const int T5 3,5,7,9,11
int AN, BN
begin_hw()
L1 for (i0 ilt(N-5) ii1)
Bi T0Ai T1Ai1 T2Ai2
T3Ai3 T4Ai4
end_hw()

17
Lo-CIRRF Viewer
Example 3-tap FIR unrolled once (two concurrent
iterations)
Indices of A
coefficients
int main() int i int A32 int B32
for (i0 ilt28 ii1) Bi 3Ai
5Ai1 7Ai2
18
RC Platform Models
1
2
3
Fast Network
CPU
FPGA
Memory
FPGA
Memory
CPU
19
Platforms for RC

SGI Altix 4700
Shared memory machine, fast interconnect
12.8 GB/sec
Itanium 2, 1.6 GHz
RASC RC100 Blade 2 Virtex 4 LX200
Xtremedata XD1000
Altera Stratix II drop-in for AMD Opteron
Integrated interface to Hypertransport
16 bits _at_ 800 M transfers/sec
Memory interface
128 bits DDR-333up to 4 x 4 GB ECC
Flash memory
For FPGA configuration or data

20
SGI RASC RC100 Blade
SRAM
SRAM
SRAM
SSP
NL4
V4LX200
TIO
SRAM
PCI
SRAM
Selmap
NL4
Loader
SRAM
Selmap
SRAM
NL4
SSP
TIO
V4LX200
SSAM
SRAM
SRAM
21
RC 100 Blade
22
Xtremedata XD1000
23
XD 1000
24
XD 1000 (drop-in)
25
Examples

Molecular dynamics
Computes the forces exerted by atoms on atoms in
a molecule and its environment
Time step 1 femto second
Bioinformatics
Exact string edit distance computation
Using Smith-Waterman, a dynamic programming
Similar dynamic time warping, motif discovery

26
Molecular Dynamics

Objective
Determine the shape of a molecule by computing
the forces exerted on each atom by all other
atoms, in the molecule and its environment.
N-body problem.
Forces
Electrostatic (Coulomb)
Van der Waal
Importance
Computationally intensive
months and years of compute time for small
problems
Impact move bio-chemistry to digital simulation
Ultimate goal protein folding

27
Algorithm
For every atom I in system for each other atom
J in system compute the forces exerted by
atom J on atom I sum all the forces compute
its next position Repeat until stable

Of course, not al forces have meaningful values
(1/d2)
More complex calculations on the boundaries
One loop body with 60 variants!

28
Nanoscale Molecular Dynamics

NAMD
MD code designed for high-performance simulation
of large biomolecular systems
Double precision floating-point
Critical loop
Computes the forces 82 of execution time
60 variants to compute boundary conditions
Forces computed in X, Y Z dimensions
52 FP operations per loop body

29
Characteristics of NAMD

Required bytes
per iteration
Sp. 48 bytes
Dp. 96 bytes
RASC 6.4 GB/s

30
NAMD Results

Itanium
Ideal one full EPIC instruction/cycle
Measured actual execution time

FPGA
Enough bandwidth for single precision
Double precision two cycles for data for each
iteration

31
Smith Waterman Algorithm

Dynamic programming string matching algorithm
used widely in genetics related research.
Computes a matching score of two input strings S
and T using a 2D matrix.
Computation of each cell depends on the computed
values of three neighboring cells north, west
and northwest.

32
Smith Waterman Algorithm
33
Smith Waterman Algorithm
34
Smith Waterman Algorithm
35
Smith Waterman Algorithm
36
Smith Waterman Algorithm
37
Smith Waterman Algorithm
38
Smith Waterman Algorithm
39
Smith Waterman Algorithm
40
Smith Waterman Algorithm
41
Smith Waterman Algorithm
42
Smith Waterman Algorithm
43
Smith Waterman Algorithm
44
Smith-Waterman Code

Dynamic Programming
Used in protein modeling, bio-informatics, data
mining
A wave-front algorithm with two input strings
Ai,j F(Ai,j-1, Ai-1, j-1, Ai-1, j)
F CostMatrix(Ai,0,A0,j)
Our Approach
Chunk the input strings in fixed sizes k
Build a k x k template hardware by compiling two
nested loops (k each) and fully unrolling both.
Host strip mines the two outer loops over this
template.

45
S-W View
46
After (many) Transformations

Transformations
Loop unrolling
Scalar replacement
Feedback store elimination
gt 70 passes

47
Systolic execution
48
SW Performance
49
SW Potential on the RASC

100 MHz clock
3 cores of 2K cells each, 72 of FPGA area
3 x 2K x 100 MHz 600 Gcups
Speedup over Itanium 7140

50
Productivity Speedup
A ratio of 1,000 Productivity speedup
10x to 100x
51
Impact?
Performance time
Programmability time
Programmability Performance 2
Commoditization of HPC will take science and
engineering to a new revolution
52
Conclusion

FPGAs a viable platform for supercomputing
Including single and double precision fp
Main challenge is their programmability
ROCCC shown as bridging the gap between
HLL program representation, and
Circuit instantiation
A new paradigm deskside supercomputing

Thank you
www.cs.ucr.edu/roccc

54
Intrusion Detection Bloom filter

Bloom filter
Is a data structure used to test set membership
of an element
has an array of N elements all of which are set
to 0 initially
members of the set are inserted in the filter
using multiple hash functions, each returns a
unique value in the range of 0 to N-1.

55
Search operation in a bloom filter

During a search operation, multiple hash
functions are applied to an incoming value.
If all the locations returned by the hash
function contain 1, then the element belongs to
the set with a probability P.
Probability of a false positive

K number of hash functions
m number of bits in the Bloom filter array
n number of elements inserted into the Bloom
filter

56
Bloom filter for virus detection

Signature Processing Engine (SPE) contains the
generated bloom filter code
Bloom filter output contains false positives.
Hence a RAM is used for absolute string
comparison and to eliminate false positives.

Legend SPE Signature Processing Engine
FPE False Positive Eliminator
57
Virus signatures

the virus rules in the bleeding snort database.
Each rule consists of a rule header and an
option.
Header contains information to be used in packet
classification.
Rule option contains the signatures to be used in
intrusion detection.
Most of the signatures in bleeding-snort database
were under 32 bytes.

58
Bloom Filter C Code

for(i0ilt248i)
for(j0jlt7j)
value input_streamij
temp value 0x1
for(k0 klt7 k)
result_location1 result_location1
(hash_function1k temp)
result_location2 result_location2
(hash_function1k temp)
result_location3 result_location3
(hash_function1k temp)
result_location4 result_location4
(hash_function1k temp)
value value gtgt 1
found bit_arrayresult_location1
bit_arrayresult_location2 bit_arrayresult_loc
ation3 bit_arrayresult_location4

Compile time constant, folded
In data-path Table lookup
59
Datapath Analysis

Compiler exploits ILP by grouping instructions
into different execution levels.
Each level corresponds to a loop iteration and
the instructions are executed simultaneously.
ROCC automatically places latches for pipelining

Each latched level corresponds to one pipeline
stage and has a delay of one cycle. In the
3-stage pipeline each box of XOR corresponds to
one byte of input being XORed with a hashing
function
60
Throughput evaluation