Title: SciDac2 Kickoff Meeting
1 SciDac2 Kickoff Meeting
- Karen A. Tomko
- Electrical and Computer Engineering Department
- email Karen.Tomko_at_uc.edu
2Outline of talk
- Other Application areas
- Crash Worthiness
- Computational Electromagnetics
- Computational Fluid Dynamics
- Research Interests
- Application Performance Challenges
- Reconfigurable High Performance Computing
3Crash Worthiness
- with S. Abraham, E.S. Davidson, Q. Stout on Ford
Motor Co. sponsored Project - Finite Element Method
- Newtonian physics, deformation models for
crumpling of car body - 100,000 lines of Fortran 77
1996 Ford Taurus Model http//crash.ncac.gwu.edu/p
radeep/Models.html
- Parallelization and performance enhancement for
shared memory and distributed memory systems - Weighted and multi-constraint domain
decomposition using graph partitioning algorithms
4Computational Electromagnetics
- with L. Katehi, C. Sarris et. al.
- Time-accurate Wireless communication simulation
(transients are what is of interest) - Solution of Maxwells equations, based on Yees
FDTD approach
- Adaptive multi-resolution using the Haar-wavelet
transform - C with MPI, dynamic domain decomposition using
Zoltan K. Devine, et. al. - one level of a multi-resolution modeling problem
5COSITE INTERFERENCE IN A VEHICULAR TRANSCEIVER
NETWORK WITHIN A FOREST ENVIRONMENT A HYBRID
FDTD/MOM APPROACH
- Problem Statement
- The in-forest communication between multi-antenna
mobile transmit-receive units is considered.
Issues to address - Forest propagation and multi-path (FDTD modeling
requires enormous resources) . - Effect of arbitrary platform (MoM requires
extremely complex Greens function). - Operation of transceiver electronics under
cosite interference conditions (MoM incompatible
with SPICE type solvers as TRANSIM).
humvee.net
- Modeling Approach
- Use the Method of Moments Sarabandi and Koh,
IEEE AP-49, Feb. 2001 to model wave propagation
through the forest. - Enclose the vehicular transceivers in an FDTD
mesh to model rigorously the effect of the
platform and the transceiver architecture as in
Sarris et al., Proc. 2001 IEEE AP-S.
Joint CEN-5/FCS work. Contributors
CEN-5 C. D. Sarris, W. Thiel , L. P.
Katehi FCS I.-S. Koh, K.
Sarabandi
6Computational Fluid Dynamics
- with D. Rizzetta, P. Morgan, M. Visbal and also
with A. Hamed, D. Basu, Q. Liu - Time accurate solution of the Navier-Stokes
equations, overset (Chimera) grids - Unsteady and turbulent fluid flow and acoustics
modeling - Fortran 77, coarse level parallelization with MPI
- Memory and cache analysis
- Variety of numerical models implemented and
compared - New effort turbomachinery modeling with M.
Turner - multi-scale, unsteady, discontinuities,
multi-domain modeling, periodicity and symmetry
7Hybrid Turbulence Models
Cavity mid-span axial vorticity contours
Baseline Grid
Fine Grid
8FPGA-based Reconfigurable High Performance
Computing
- Field-programmable Gate Arrays (FPGA)
- Programmable digital logic
- Manufacturers Xilinx, Altera, others
- Trends that make FPGA especially appealing
- Computational capacity of FPGA has been scaling
faster than CPU - Current generation chips are able to support
large numbers of floating point units
FPGA
Software
Hardware
9Programming the FPGA
- FPGA are programmed or configured with a a
sequence of bits containing the contents of the
LUTs and the control bits determining the
connections between LUTs, Flip Flops, Block Ram,
etc.. - This sequence of bits is referred to as the
configuration or bit file. - Programmed/Reprogrammed on-the-fly in
microseconds.
10XilinxVirtex-II Architecture
Figure from T. El-Ghazawi, K. Gaj, and D.
Pointer,Reconfigurable Supercomputing Systems
tutorial RSSI 05
11Configurable Logic Block (CLB) ofXilinx
VirtexTM 2.5 V FPGA
- 4 Logic cells
- 4 input Look up table
- Carry logic
- D flip-flop
Figure from VirtexTM 2.5v FPGA Datasheet by
Xilinx
12Trends in FPGA Floating Point Capabilities
from V. Natoli,A Computational Physicists View
of Reconfigurable High Performance Computing
Stone Ridge Technology RSSI July 05
13Xilinx XC4VLX200
- 32 bit Integer and Fixed Point
- Thousands of Arithmetic Units
- Floating Point
- 600 SP Floating Point Multipliers
- 100 SP Floating Point Dividers
- 100 DP Floating Point Multipliers
- 20 DP Floating Point Dividers
- SP ! 2 X DP
- Theoretical Peaks
- SP Floating Point 20-120 GFLOPs
- DP Floating Point 4-20 GFLOPs
- Integer .5-1 TOP
90 nm 200,448 Logic Cells 750 kB BRAM 96
18x18 bit Multipliers Clock upto 500MHz
from V. Natoli,A Computational Physicists View
of Reconfigurable High Performance Computing
Stone Ridge Technology RSSI July 05
14An FPGA-based FDTD Solver for Reconfigurable High
Performance Computing
15FDTD
- Maxwells equations were solved using integral
equations until Yee introduced Finite-Difference
Time-Domain (FDTD). - The FDTD calculation is very parallel, and is
currently employed in parallel simulations on
High Performance Computing Clusters (HPC). - Fairly linear improvement in computations.
- How to get even further speed-up on HPC systems?
16FDTD
Beowulf Cluster
Network
FPGA
CPU
FPGA
CPU
FPGA
CPU
- FPGA performs the computation
- Host Software moves the data.
- FPGA communication
- HPC communication
17FDTD
- Relation of the Equations
- HxtijHxt-1ij - dtumdy(Ezt-0.5i1j1-
Ezt-0.5i1j)
Hx/Hy Calculations Transfers
Ez Calculations Transfers
18FDTD
- The FDTD calculations have both temporal and
spatial locality.
Add
Ezij
Delay
Multiply
Constant
Hx/Hyij
Add
Hx/Hyij
Delay
- HxtijHxt-1ij - dtumdy(Ezt-0.5i1j1-
Ezt-0.5i1j) - HytijHyt-1ij dtumdx(Ezt-0.5i1j1-
Ezt-0.5ij1)
19FDTD
- Ez calculation has more operations.
Constant
Multiply
Add
Delay
Hxij
Add
Hyij
Add
Add
Ezij
Hyi-1j
Multiply
Constant
Delay
Ezij
EztijEzt-1ij dtepsdx(Hyt-0.5ij-1-H
yt-0.5i-1j-1) - dtepsdy(Hxt-0.5i-1j-Hx
t-0.5i-1j-1)
20Cray XD1 System Architecture
Cray XD1 Chasis
21Cray XD1-Expansion Module
- AAP FPGA Xilinx Virtex II Pro (xc2vp50-7)
- RAP RapidArray Processor
Cray XD1 Expansion Module
22Baseline Implementation
- Update engines created by Gandhi 2
- Floating point units provided by Belanovic 3 at
NEU - Two clocks system, and update engines
- Magnetic Updates in parallel (Hx and Hy)
- Electric update (Ez) every 2 clock cycles
- Multiple update cycles w/o host intervention
- Local SRAMs for input and output data
- SRAMs as ping-pong buffers
- Slower than Opeterons alone
23FPGA Implementation in Cray XD1
prog_clock_gen
Transmit Data Bus
app_fdtd
rt_core
qdr2_core
QDR 1 Interface
mux
Fabric Request Interface
rt_client
QDR II SRAM1 Interface
Receive Data Bus
QDR 2 Interface
QDR II SRAM2 Interface
Host Processor Interface
QDR 3 Interface
QDR II SRAM3 Interface
qdr_fdtd
User Request Interface
QDR 4 Interface
QDR II SRAM4 Interface
Clock Signals
- Cray IP Cores rt_core qdr2_core
- rt_client agent for host
- qdr_fdtd instantiates controls update
engine operations - mux multiplexes requests from rt_client
qdr_fdtd - prog_clock_gen clock for different blocks, uses
DCM
24Performance Analysis Existing Design
- Time for one electromagnetic field value update
- Tcray total time taken by Cray XD1 to upate
one electromagnetic field value - Dcray latency of QDR II SRAMs
- M latency of the magnetic (Hx and Hy) update
engine - E latency of the electric (Ez) update engine
- N size of the electromagnetic matrix
processed by the FPGA - (N Grid_Row x Grid_Column, Grid_Row mod 2
0 and Grid_Column mod 2 0) - k minimum 3
- Tu time period of clock for update engines
- 2C number of cycles of FDTD algorithm
calculation - Design Constants M 22, E 30, k 3
- Significant variables N, C, Tu (most
significant) - Reducing Tcray decrease Tu (most significant),
M E (not significant) - increase N and C (typically high) not
very significant
25Areas to improve performance
- Reducing time for update of one value
- Improve clock speed of update engines
- Higher clock speed of floating point units
- Single Clock signal
- Correct reset behavior
- SRAM R/W address generation scheme
- Increasing Throughput
- One Ez result per clock cycle (vs 1 per 2 cycles)
- FPGA-initiated boundary output data transfer
- Multiple copies of update engines
- FPGA-to-FPGA transfer of boundary data
- Using Pre-synthesized floating point units
(Sandia Labs, USA)
26Performance Comparison
27Performance Comparison
Original Design Units -
Sandia, Optimized Units -