Title: Reconfigurable computing a new supercomputing paradigm
1Reconfigurable computing - a new supercomputing
paradigm
- Walid Najjar
- Computer Science Engineering
- University of California Riverside
2ROCCC
- Riverside Optimizing Compiler for Configurable
Computing - Code acceleration
- By mapping of circuits to FPGA
- Achieve same speed as hand-written VHDL codes
- Improved productivity
- Allows design and algorithm space exploration
- Keeps the user fully in control
- We automate only what is very well understood
3FPGA A New HPC Platform?
David Strensky, FPGAs Floating-Point Performance
-- a pencil and paper evaluation, in HPCwire.com
- Comparing a dual core Opteron to FPGA on fp
performance - Opteron 2.5 GHz, 1 add and 1 mult per cycle. 2.5
x 2 x 2 10 Gflops - FPGAs Xilinx V4 and V5 with DSP cores
- Balanced allocation of dp fp adders, multipliers
and registers - Use both DSP and logic for multipliers,run at
lower speed - Logic for I/O interfaces
4Balanced Designs
- Same number of mults as adds (matrix
multiplication). - Double precision
- Higher percentage of peak on FPGA (streaming)
- 1/3 of the power!
5Challenges
- FPGA is an amorphous mass of logic
- Languages reflect the von Neumann execution model
6ROCCC Overview
Procedure, loop and array optimizations
Instruction scheduling Pipelining and
storage optimizations
C/C
High level transformations
Low level transformations
Code generation
Hi-CIRRF
Lo-CIRRF
Java
SystemC
CIRRF Compiler Intermediate Representation for
Reconfigurable Fabrics
- Limitations on the code
- No recursion
- No pointers
7Focus
- Extensive compile time optimizations
- Maximize parallelism, speed and throughput
- Minimize area and memory accesses
- Optimizations
- Loop level fine grained parallelism
- Storage level compiler configured storage for
data reuse - Circuit level expression simplification,
pipelining
8Execution Model
- A simplified model
- Decoupled memory access from datapath
- Parallel loop iterations
- Pipelined datapath
9So far, working compiler with
- Extensive compiler optimizations and
transformations - Analysis and hardware support for data reuse
- Efficient code generation and pipelining
- Import of existing IP cores
- Support for dynamic partial reconfiguration
10So far, working compiler with
- Extensive compiler optimizations and
transformations - Analysis and hardware support for data reuse
- Efficient code generation and pipelining
- Import of existing IP cores
- Support for dynamic partial reconfiguration
Loop, array procedure transformations. Maximize
clock speed parallelism, within
resources. Under user control.
11High Level Transformations
12So far, working compiler with
- Extensive compiler optimizations and
transformations - Analysis and hardware support for data reuse
- Efficient code generation and pipelining
- Import of existing IP cores
- Support for dynamic partial reconfiguration
Smart buffer technique reduces off chip memory
accesses by gt 98
13So far, working compiler with
- Extensive compiler optimizations and
transformations - Analysis and hardware support for data reuse
- Efficient code generation and pipelining
- Import of existing IP cores
- Support for dynamic partial reconfiguration
Clock speed comparable to hand written HDL codes
14So far, working compiler with
- Extensive compiler optimizations and
transformations - Analysis and hardware support for data reuse
- Efficient code generation and pipelining
- Import of existing IP cores
- Support for dynamic partial reconfiguration
Huge wealth of existing IP cores. Wrapper makes
core look like a function call in C code.
15So far, working compiler with
- Extensive compiler optimizations and
transformations - Analysis and hardware support for data reuse
- Efficient code generation and pipelining
- Import of existing IP cores
- Support for dynamic partial reconfiguration
DPR allows reconfiguration of a subset of the
FPGA, dynamically, under software
control. Reduces configuration overhead.
16Simple example
- 5-tap FIR Bi 3Ai 5Ai1 7Ai2
9Ai3 11Ai4
- define N 516
- void begin_hw()
- void end_hw()
- int main()
-
- int i
- const int T5 3,5,7,9,11
- int AN, BN
- begin_hw()
- L1 for (i0 ilt(N-5) ii1)
-
- Bi T0Ai T1Ai1 T2Ai2
T3Ai3 T4Ai4 -
- end_hw()
17Lo-CIRRF Viewer
Example 3-tap FIR unrolled once (two concurrent
iterations)
Indices of A
coefficients
int main() int i int A32 int B32
for (i0 ilt28 ii1) Bi 3Ai
5Ai1 7Ai2
18RC Platform Models
1
2
3
Fast Network
CPU
FPGA
Memory
FPGA
Memory
CPU
19Platforms for RC
- SGI Altix 4700
- Shared memory machine, fast interconnect
- 12.8 GB/sec
- Itanium 2, 1.6 GHz
- RASC RC100 Blade 2 Virtex 4 LX200
- Xtremedata XD1000
- Altera Stratix II drop-in for AMD Opteron
- Integrated interface to Hypertransport
- 16 bits _at_ 800 M transfers/sec
- Memory interface
- 128 bits DDR-333up to 4 x 4 GB ECC
- Flash memory
- For FPGA configuration or data
20SGI RASC RC100 Blade
SRAM
SRAM
SRAM
SSP
NL4
V4LX200
TIO
SRAM
PCI
SRAM
Selmap
NL4
Loader
SRAM
Selmap
SRAM
NL4
SSP
TIO
V4LX200
SSAM
SRAM
SRAM
21RC 100 Blade
22Xtremedata XD1000
23XD 1000
24XD 1000 (drop-in)
25Examples
- Molecular dynamics
- Computes the forces exerted by atoms on atoms in
a molecule and its environment - Time step 1 femto second
- Bioinformatics
- Exact string edit distance computation
- Using Smith-Waterman, a dynamic programming
- Similar dynamic time warping, motif discovery
26Molecular Dynamics
- Objective
- Determine the shape of a molecule by computing
the forces exerted on each atom by all other
atoms, in the molecule and its environment. - N-body problem.
- Forces
- Electrostatic (Coulomb)
- Van der Waal
- Importance
- Computationally intensive
- months and years of compute time for small
problems - Impact move bio-chemistry to digital simulation
- Ultimate goal protein folding
27Algorithm
For every atom I in system for each other atom
J in system compute the forces exerted by
atom J on atom I sum all the forces compute
its next position Repeat until stable
- Of course, not al forces have meaningful values
(1/d2) - More complex calculations on the boundaries
- One loop body with 60 variants!
28Nanoscale Molecular Dynamics
- NAMD
- MD code designed for high-performance simulation
of large biomolecular systems - Double precision floating-point
- Critical loop
- Computes the forces 82 of execution time
- 60 variants to compute boundary conditions
- Forces computed in X, Y Z dimensions
- 52 FP operations per loop body
29Characteristics of NAMD
- Required bytes
- per iteration
- Sp. 48 bytes
- Dp. 96 bytes
- RASC 6.4 GB/s
30NAMD Results
- Itanium
- Ideal one full EPIC instruction/cycle
- Measured actual execution time
- FPGA
- Enough bandwidth for single precision
- Double precision two cycles for data for each
iteration
31Smith Waterman Algorithm
- Dynamic programming string matching algorithm
used widely in genetics related research. - Computes a matching score of two input strings S
and T using a 2D matrix. - Computation of each cell depends on the computed
values of three neighboring cells north, west
and northwest.
32Smith Waterman Algorithm
33Smith Waterman Algorithm
34Smith Waterman Algorithm
35Smith Waterman Algorithm
36Smith Waterman Algorithm
37Smith Waterman Algorithm
38Smith Waterman Algorithm
39Smith Waterman Algorithm
40Smith Waterman Algorithm
41Smith Waterman Algorithm
42Smith Waterman Algorithm
43Smith Waterman Algorithm
44Smith-Waterman Code
- Dynamic Programming
- Used in protein modeling, bio-informatics, data
mining - A wave-front algorithm with two input strings
- Ai,j F(Ai,j-1, Ai-1, j-1, Ai-1, j)
- F CostMatrix(Ai,0,A0,j)
- Our Approach
- Chunk the input strings in fixed sizes k
- Build a k x k template hardware by compiling two
nested loops (k each) and fully unrolling both. - Host strip mines the two outer loops over this
template.
45S-W View
46After (many) Transformations
- Transformations
- Loop unrolling
- Scalar replacement
- Feedback store elimination
- gt 70 passes
47Systolic execution
48SW Performance
49SW Potential on the RASC
- 100 MHz clock
- 3 cores of 2K cells each, 72 of FPGA area
- 3 x 2K x 100 MHz 600 Gcups
- Speedup over Itanium 7140
50Productivity Speedup
A ratio of 1,000 Productivity speedup
10x to 100x
51Impact?
Performance time
Programmability time
Programmability Performance 2
Commoditization of HPC will take science and
engineering to a new revolution
52Conclusion
- FPGAs a viable platform for supercomputing
- Including single and double precision fp
- Main challenge is their programmability
- ROCCC shown as bridging the gap between
- HLL program representation, and
- Circuit instantiation
- A new paradigm deskside supercomputing
53- Thank you
- www.cs.ucr.edu/roccc
54Intrusion Detection Bloom filter
- Bloom filter
- Is a data structure used to test set membership
of an element - has an array of N elements all of which are set
to 0 initially - members of the set are inserted in the filter
using multiple hash functions, each returns a
unique value in the range of 0 to N-1.
55Search operation in a bloom filter
- During a search operation, multiple hash
functions are applied to an incoming value. - If all the locations returned by the hash
function contain 1, then the element belongs to
the set with a probability P. - Probability of a false positive
- K number of hash functions
- m number of bits in the Bloom filter array
- n number of elements inserted into the Bloom
filter
56Bloom filter for virus detection
- Signature Processing Engine (SPE) contains the
generated bloom filter code - Bloom filter output contains false positives.
Hence a RAM is used for absolute string
comparison and to eliminate false positives.
Legend SPE Signature Processing Engine
FPE False Positive Eliminator
57Virus signatures
- the virus rules in the bleeding snort database.
- Each rule consists of a rule header and an
option. - Header contains information to be used in packet
classification. - Rule option contains the signatures to be used in
intrusion detection. - Most of the signatures in bleeding-snort database
were under 32 bytes.
58Bloom Filter C Code
- for(i0ilt248i)
- for(j0jlt7j)
- value input_streamij
- temp value 0x1
- for(k0 klt7 k)
-
- result_location1 result_location1
(hash_function1k temp) - result_location2 result_location2
(hash_function1k temp) - result_location3 result_location3
(hash_function1k temp) - result_location4 result_location4
(hash_function1k temp) - value value gtgt 1
-
- found bit_arrayresult_location1
bit_arrayresult_location2 bit_arrayresult_loc
ation3 bit_arrayresult_location4 -
-
Compile time constant, folded
In data-path Table lookup
59Datapath Analysis
- Compiler exploits ILP by grouping instructions
into different execution levels. - Each level corresponds to a loop iteration and
the instructions are executed simultaneously. - ROCC automatically places latches for pipelining
Each latched level corresponds to one pipeline
stage and has a delay of one cycle. In the
3-stage pipeline each box of XOR corresponds to
one byte of input being XORed with a hashing
function
60Throughput evaluation
- Clock frequency of the synthesized circuit is
73MHz. - The BRAM on our target FPGA can process 32 bytes
per cycle. - Throughput bits per cycle clock frequency
- 328 73 100,000 bits/sec
- 18.6 Gbps