Title: Reconfigurable Computing RC Group
1Reconfigurable Computing (RC) Group
- Reconfigurable Architectures, Networks, and
Services for COTS-based Cluster Computing Systems - Appendix for Q3 Status Report
- DOD Project MDA904-03-R-0507
- February 9, 2004
2Outline
- RC Group Overview
- Motivation
- CARMA Framework Overview
- CARMA Framework Updates
- Applications
- Benchmarking
- Application Mappers
- Job Scheduler
- Configuration Manager
- Resource Monitoring Service
- Conclusions
- Future Work
3RC Group Overview
- Group Members (Fall 2003)
- Vikas Aggarwal, M.S. student in ECE
- Ramya Chandrasekaran, M.S. student in ECE
- Gall Gotfried, B.S. student in CISE
- Aju Jacob, M.S. student in ECE
- Matt Radlinski, Ph.D. student in ECE
- Ian Troxel, Ph.D. student in ECE, group leader
- Girish Venkatasubramanian, M.S. student in ECE
- Industry Collaborators
- Honeywell
- Xilinx (hardware and tools)
- Celoxica / Alpha Data / Tarari (boards and tools)
- Starbridge Systems (pending)
- Silicon Graphics (pending)
- Numerous other sponsors for cluster resources
(Intel, AMD, etc.)
4Motivation
- Key missing pieces in RC clusters for HPC
- Dynamic RC fabric discovery and management
- Coherent multitasking, multi-user environment
- Robust job scheduling and management
- Fault tolerance and scalability
- Performance monitoring down into the RC fabric
- Automated application mapping into management
tool - The HCS labs proposed Cluster-based Approach to
Reconfigurable Management Architecture (CARMA)
attempts to unify existing technologies as well
as fill in missing pieces
5CARMA Framework Overview
- CARMA seeks to integrate
- Graphical user interface
- Applications and Benchmarking (New)
- COTS application mapper
- Handel-C, Viva, CoreFire, etc.
- Graph-based job description
- Condensed Graphs, DAGMan, etc.
- Robust management tool
- Distributed, scalable job scheduling
- Checkpointing, rollback and recovery
- Distributed configuration management
- Multilevel monitoring service (GEMS)
- Clusters, networks, hosts, RC fabric
- Monitoring down into RC Fabric
- Bypass Middleware API (for future work)
- Multiple types of RC boards
- Multiple high-speed networks
- SCI, Myrinet, GigE, InfiniBand, etc.
Note Substituting RC/HPC benchmarking for the
middleware API task
6Applications
- Test applications developed
- Block ciphers
- DES, Blowfish
- Floating-point FFT
- Sonar Beamforming
- Hyperspectral Imaging (c/o LANL)
- Future development
- Additional cryptanalysis applications
- RC4, RSA, Diffie-Helmann, Serpent, Elliptic Key
systems - RC/HPC benchmarks (c/o Honeywell TC and UCSD)
- Cryptanalysis benchmarks (c/o DoD)
- Other benchmarking algorithm possibilities
- N Queens, Monte Carlo Pi generator, many others
considered (see slide 8)
7RC Benchmarking
- Parallel RC attributes
- RC speedup vs. software
- Parallel efficiency
- Parallel speedup
- Job throughput
- Isoefficiency
- Communication bandwidth
- Machine and RC overhead
- Reconfigurability
- Parallelism
- Versatility
- Capacity
- Time sensitivity
- Scalability
- RC metrics
- Area used (slices, LUTs)
- Execution time
- Speedup per FPGA
- Total configuration time
- Communication latency
- Selected References
- Parallel Computing Architecture- A Hardware
Software Approach, Culler Singh - Parallel Computing Performance Models and
Metrics, Sartaj Sahni Thanvantri - Analyzing Scalability of Parallel Algorithms and
Systems, Vipin Kumar and Atul Gupta - A Benchmark Suite for evaluating configurable
systems-status, reflections, and future
directions, Honeywell Tech. Center (FPGA 2000) - The RAW Benchmark Suite Computation Structures
for General Purpose Computing, MIT lab for
Computer Science (FCCM 97)
8RC Benchmarking
- Algorithms under consideration
- Binary heap
- DES
- FFT, DCT
- Game of Life
- Boolean satisfiability
- Matrix multiply
- N queens
- CORDIC algorithms
- Huffman encoding
- Jacobi relaxation
- Hanoi algorithms
- Permutation generator
- Monte Carlo Pi generator
- Bubble, quick and merge sort
- Wireless comm. algorithms
- Sieve of Erasthones prime number generator
- Search over discrete matrices
- Wavelet-based image compression
- Differential PCM
- Adaptive Viterbi Algorithm
- RC5, DES, Serpent, Blowfish key crack
- RSA, Diffie-Helmann
- Elliptic key cryptography
- Graph problems (SSP,SPM,TC)
- Micro benchmarks to be created as needed
- Benchmark suites
- NAS parallel
- Pallas
- SPEC HPC
- SparseBench
- PARKBENCH
- DoD crypto. emulation
9Applications (Blowfish)
- Parallelization
- B optimized for parallel network traffic encrypt.
where key remains fixed - C optimized for parallel cryptanalysis where key
changes rapidly
Blowfish
Network Packet Processing Optimized
B
Blowfish
Blowfish
A
Crypto Unit
Crypto Unit
S Boxes
Crypto Unit
Control
Control
P Arrays
Control
S Boxes
Init S Boxes
Init S Boxes
Init P Arrays
P Arrays
Init P Arrays
F Function
2 Iterations
Single Instance
Cryptanalysis Optimized
Blowfish
Blowfish
C
S Boxes
Crypto Unit
S Boxes
Crypto Unit
P Arrays
Control
P Arrays
Control
Init S Boxes
Init P Arrays
Based on Virtex 1000E at 25MHz
Denotes shared resources
10Applications
Local CPU(s)
PCI Bridge
Memory
NIC
Remote CPU(s)
- Remote access to Functional Units (FU)
- Remote processes and FUs access local FUs
- Potential for FUs to perform autonomous
communication - Users ID sets an access level for enhanced
security - Authentication and encryption could be included
- Q3 Accomplishments
- Seven Blowfish B FUs addressable from the local
processor - Able to decrypt input data and encrypt output
data within the FPGA for secure comm. between
FPGAs and hosts (ex. FFT) - Working to provide remote access and autonomous
comm.
Remote RC Boards
11Application Mapper
- Evaluating three application mappers on the basis
of - Ease of use, performance, hardware device
independence, - programming model, parallelization support,
resource targeting, - network support, stand-alone mapping
- Celoxica - SDK (Handel-C)
- Provides access to in-house boards
- ADM-XRC (x1) and Tarari (x4)
- StarBridge Systems - Viva
- Provides best option for hardware independence
- Annapolis Micro Systems - CoreFire
- Provides access to the AFRL-IFTC 48-node cluster
- Xilinx - ISE compulsory, evaluating Jbits for
partial RTR
Tarari
ADM-XRC
12Application Mapper
- QFD Comparison (V1)
- Compared mappers in various categories
- No clear winner among application mappers
- Mapping efficiency will be examined next
- Jbits
- Allows for flexibility
- Potential for splicing partial configurations
- Users can potentially create hazardous designs!
- Xilinx is likely to not support Jbits in future
13Job Scheduler (JS)
- Prototyping effort underway (forecasting)
- Completed first version of JS (coded Q2 but still
under test) - Single node
- Task-based execution using Dynamic Acyclic Graphs
(DAGs) - Separate processes and message queues for
fault-tolerance - Second version of JS (Q4 completion)
- Multi-node
- Distributed job migration
- Checkpoint and rollback
- Links to Configuration Manager and GEMS
- External extensions to traditional tools
(interoperability) - Expand upon GWU/GMU work (Dr. El-Ghazawis group)
- Code and algorithms reviewed but LSF required
(now trying to acquire) - Other COTS job schedulers under consideration
- Striving for plug and play approach to JS
within CARMA
c/o GWU/GMU
14Job Scheduler
DAG-based execution
- Q3 Accomplishments
- Rewritten JS in C
- Stable, secure, easily extendable
- Minimal overhead penalty
- Network enabled
- Uses RPC for local and network comm.
- Minimal overhead
- Standard interface
- Under development
- Enable dynamic job scheduling
- Fault tolerance (checkpointing, rollback)
- Evaluate interface with commercial job schedulers
15Configuration Manager (CM)
Config File Reg.
Stub
CMUI
Proxy
CM
CM
Networks
Com
Com
Decision Maker
Remote Node
Boards
File Location File Transport File
Managing File Loading
Network Node Reg.
NetworkCompilerMessage Queue
RC Fabric Processor
- Configuration Manager (CM)
- Application interface to RC Board
- Handles configuration caching, distribution and
loading - CM User Interface (CMUI)
- Allows user input to configure CM
- Communication Module (Com)
- Used to transfer configuration files between CMs
via TCP/Ethernet or SCI
FPGA Hardware
16Management Schemes
MW and CS already built,
CB and PP to be built Q4
Jobs submitted centrally
Global view of the system at all times
APP
Global view of the system at all times
APP MAP
GJS
GRMAN
Network
Results, Statistics
Tasks, States
LRMON
LRMON
Local Sys
Local Sys
Server houses configurations
Master-Worker (MW)
Client-Server (CS)
Global view of the system at all times
Server brokers configurations
Client-Broker (CB)
Peer-to-Peer (PP)
Multilevel approach anticipated for large number
of nodes having different schemes at different
levels
17Configuration Manager
Five nodes in all (four workers, one master or
server)
- Prelim. stress testing measurements of CM
software infrastructure excluding exec. time on
FPGAs - Major component of Completion Time is CM Queue
Latency (over 70 on average) - CM Queue Latency directly dependent on contention
for configuration files - MW and CS performance degrades above 5
configuration requests per second - MW passes out tasks (? requests) in round-robin
fashion so nodes have similar completion times - CS produces a first-come-first-served order as
each node fights for the server, and closer nodes
receive preference (due to SCI rings) so there
exists variance in completion times of nodes - CS provides better average completion latency
while MW provides less variance between nodes
Note config. file transfers via SCI
18Monitoring Service Options
- Custom agent per Functional Unit (FU)
- Provides customized information per FU
- Heavily burdens user (unless automated)
- Requires additional FPGA area
- Centralized agent per FPGA
- Possibly reduces area overhead
- Reduces data storage and communication
- Limits scalability
- Information storage and response
- Store on chip or on board
- Periodically send or respond when queried
- Key parameters to monitor - further study
- Custom parameters per algorithm
- Requires all-encompassing interface
- Automation needed for usability
- Monitoring is overhead, so use sparingly!
19Monitoring Service Parameters
- Many parameters to measure will start with
subset - GEMS to be extended for HPC/RC monitoring
- Various security levels for each parameter type
GEMS is the gossip-enabled monitoring service
developed by the HCS Lab for robust, scalable,
multilevel monitoring of resource health and
performance for more info. see
http//www.hcs.ufl.edu/gems
20Results Summary
- Q1
- Several algorithms developed as test cases
- DES, Blowfish, RSA, Sonar Beamforming,
Hyperspectral Imaging - Prototyping of initial mechanisms for CARMA
framework - ExMan, ConfigMan, TaskMan, and simple JS over
ADM-XRC API - Evaluation of cluster management tools
- Many commercial tools identified and evaluated
(ref GWU/GMU) - Q2
- Parallelization options identified and under
development - Two Blowfish flavors with multi-board per node,
multi-instance per FPGA - Application mapper evaluation underway
- Handel-C, Viva, CoreFire, Jbits
- Further prototyping and test of CARMA framework
- ConfigMan over TCP and SCI with MW scheme
21Results Summary
- Q3
- FU remote access
- Addressing scheme designed, developed and tested
- Multiple Blowfish and Floating-Point FFT modules
operable - Additional algorithms under development for RC
Benchmarking - N Queens, Serpent, Elliptic Curve Crypto., Monte
Carlo Pi generator, etc. - First phase of application mapper evaluation
concluded - No single winner of Handel-C, Viva, CoreFire,
Jbits - Began to study mapper optimization / inefficiency
issues - Further prototyping of mechanisms for CARMA
framework - ConfigMan (MW and CS) and JS over ADM-XRC and
Tarari API - Hardware monitoring scheme designed
- Monitoring options, parameters and interfaces
identified
22Conclusions
- Broad coverage of RC design space
- With focus on COTS-based RC clusters for HPC
- Builds on lab strength in cluster computing and
communications - Key missing pieces in RC cluster design
identified - Initial framework for CARMA developed
- Design options being refined
- Prototyping of preliminary mechanisms underway
- Several test applications for RC developed
- Parallelization options for RC under development
- Collaboration with other RC groups
- Developing collaboration with key groups in
academia - Pursuing and hopeful of significant industry
collaboration
23CARMA Future Work (for Q4)
- Continue development and evaluation of CARMA
- Applications and Benchmarking
- Refine attribute, benchmark and metric
definitions - Map algorithms as appropriate
- Develop initial benchmark suite for HPC/RC
- Algorithm Mappers
- Determine efficiency (or inefficiency) of the
three mappers under test vs. hand VHDL - Job Scheduler (JS)
- Enable dynamic job scheduling
- Build in fault tolerance (checkpointing,
rollback) - Provide for distributed job submission and
scheduling - Integrate with the CM
- Configuration Manager (CM)
- Scale MW and CS up to 32 nodes
- Provide a multiple server CS to reduce completion
latency variability - Finish coding CB and PP
- Add support for additional boards as they become
available - Hardware Monitoring
- Further develop remote access to FPGA functional
units / processing elements
24Future Work (beyond Q4)
- Continue development and evaluation of CARMA
- Expanded features
- Support for additional boards, networks, etc.
- Functionality and performance optimization
- Extend early work in RC cluster simulation
- Extend previous analytical modeling work
(architecture and software) - Leverage modeling and simulation tools under
development by MS group _at_ HCS - Forecast architecture limitations
- Forecast software/management limitations
- Determine key design tradeoffs
- Investigate network-attached RC resources
- Currently procuring evaluation boards for our
donated FPGAs - Could provide in-network content processing or
pre/post processing - Develop interfaces for network attached RC
devices - Develop cores for high-performance networks (e.g.
SCI, Myrinet, InfiniBand) - The Virtex II Pro X shows a trend toward merging
high-performance networking and RC - RC extensions to UPC, SHMEM, etc.
- Investigate/develop HPC/RC system programming
model - Consider additional RC hardware security
challenges (e.g. FPGA viruses)