Title: John Curreri
1Project F2 Application Performance Analysis
- John Curreri
- Seth Koehler
- Rafael Garcia
2Outline
- Introduction
- Application mappers
- Historical background
- Performance analysis today
- HLL runtime performance analysis tool
- Motivation
- Instrumentation
- Framework
- Visualization
- Case study
- Molecular Dynamics
- Conclusions References
3Application Mappers
- Translates C code to HDL
- Higher level of abstraction
- Usually a subset of ANSI C
- No pointers
- No standard C libraries for FPGA
- HDL is generated as a project file for Xilinx or
Altera tools - Built-in communication
- Separate C source files are made for the CPU
FPGA - Similar communication function calls between CPU
FPGA
4Application Mappers (continued)
- Computational parallelism
- Pipelining of loops
- for(), while(), etc.
- Use of library functions
- HDL coded functions called at HLL
- FFT, Floating point operations
- Replication of functions defined in hardware
- Types of communication
- DMA transfers
- Efficient transfer of large chucks of data
- Stream transfers
- Steady flow of data
- Buffered for transfer rate changes
5Introduction to the F2 project
- Goals for performance analysis in RC
- Productively identify and remedy performance
bottlenecks in RC applications (CPUs and FPGAs) - Motivations
- Complex systems are difficult to analyze by hand
- Manual instrumentation is unwieldy
- Difficult to make sense of large volume of raw
data - Tools can help quickly locate performance
problems - Collect and view performance data with little
effort - Analyze performance data to indicate potential
bottlenecks - Staple in HPC, limited in HPEC, and virtually
non-existent in RC - Challenges
- How do we expand notion of software performance
analysis into software-hardware realm of RC? - What are common bottlenecks for dual-paradigm
applications? - What techniques are necessary to detect
performance bottlenecks? - How do we analyze and present these bottlenecks
to a user?
6Historical Background
- Gettimeofday and printf
- VERY cumbersome, repetitive, manual, not
optimized for speed - Profilers date back to 70s with prof (gprof,
1982) - Provide user with information about application
behavior - Percentage of time spent in a function
- How often a function calls another function
- Simulators / Emulators
- Too slow or too inaccurate
- Require significant development time
- PAPI (Performance Application Programming
Interface) - Portable interface to hardware performance
counters on modern CPUs - Provides information about caches, CPU
functional units, main memory, and more
Source Wikipedia
7Performance Analysis Today
- What does performance analysis look like today?
- Goals
- Low impact on application behavior
- High-fidelity performance data
- Flexible
- Portable
- Automated
- Concise Visualization
- Techniques
- Event-based, sample-based
- Profile, Trace
- Above all, we want to understand application
behavior in order to locate performance problems!
8Related Research and Tools Parallel Performance
Wizard (PPW)
- Open-source tool developed by UPC Group at
University of Florida - Performance analysis and optimization (PGAS
systems and MPI support) - Performance data can be analyzed for bottlenecks
- Offers several ways of exploring performance data
- Graphs and charts to quickly view high-level
performance information at a glance right, top - In-depth execution statistics for identifying
communication and computational bottlenecks - Interacts with popular trace viewers (e.g.
Jumpshot right, bottom) for detailed analysis
of trace data - Comprehensive support for correlating performance
back to original source code
Partitioned Global Address Space languages
allow partitioned memory to be treated as global
shared memory by software.
9Motivation for RC Performance Analysis
- Dual-paradigm applications gaining more traction
in HPC and HPEC - Design flexibility allows best use of FPGAs and
traditional processors - Drawback More challenging to design applications
for dual-paradigm systems - Parallel application tuning and FPGA core
debugging are hard enough!
Less
Difficultylevel
More
- No existing holistic solutions for analyzing
dual-paradigm applications - Software-only views leave out low-level details
- Hardware-only views provide incomplete
performance information - Need complete system view for effective tuning of
entire application
10Motivation for RC Performance Analysis
- Q Is my runtime load-balancing strategy working?
- A ???
ChipScope waveform
11Motivation for RC Performance Analysis
- Q How well is my cores pipelining strategy
working? - A ???
Flat profile Each sample counts as 0.01
seconds. cumulative self
self total time seconds seconds calls
ms/call ms/call name 51.52 2.55 2.55
5 510.04 510.04 USURP_Reg_poll 29.41
4.01 1.46 34 42.82 42.82
USURP_DMA_write 11.97 4.60 0.59
14 42.31 42.31 USURP_DMA_read 4.06
4.80 0.20 1 200.80 200.80
USURP_Finalize 2.23 4.91 0.11 5
22.09 22.09 localp 1.22 4.97
0.06 5 12.05 12.05 USURP_Load
0.00 4.97 0.00 10 0.00
0.00 USURP_Reg_write 0.00 4.97 0.00
5 0.00 0.00 USURP_Set_clk 0.00
4.97 0.00 5 0.00 931.73
rcwork 0.00 4.97 0.00 1
0.00 0.00 USURP_Init
gprof output (N, one for each node!)
12Instrumentation Level
- High-level language (HLL)
- Requires HLL timing functions
- Application mapping disturbed by instrumentation
- Hardware Description Language (HDL)
- Portable between HLL and types FPGA families
- Selected level for instrumentation
- FPGA bit stream
- Requires targeting specific FPGA family
- Instrument in minutes
13Instrumentation Selection
- Automated - Computation
- State machines
- Used for preserving execution order in C
functions - Used to control state of pipelines
- Control and status signals
- Used by library function
- Automated - Communication
- Control and status signals
- Used for streaming communication
- Used for DMA transfers
- Application specific
- Monitoring variables for meaningful values
14Measurement Techniques
- Profiling
- Counters
- Records number of occurrences of event
- Low overhead
- Normally uses registers
- Block RAM can be used for state machines
- Tracing
- Timestamps
- Indicating when event occurred
- Data
- Associated with each event
- Greater overhead
- Uses memory to store timestamps and data
- Greater fidelity
- Reconstruction of sequence of events
CPU-0
1
2
3
Time
Zaki, O., Lusk, E., Gropp, W., and Swider, D.
1999. Toward Scalable Performance Visualization
with Jumpshot. Int. J. High Perform. Comput.
Appl. 13, 3 (Aug. 1999), 277-288.
15Hardware Measurement Module
16Adding Instrumentation Measurement
CPU(s)
HLL Tool Flow
C source
Application (C source)
Instrumentation
Software -hardware mapping
HLL API Wrapper
Compile software
FPGA(s)
Instrumentation
Implement hardware
HLL Hardware Wrapper
Application (C source)
Application (HDL)
Hardware Measurement Module
Finished design
Uninstrumented Project
Instrumentation added to C source
C source for FPGA mapped to HDL
Instrumentation added to HDL
Implement hardware
17Reverse Mapping Analysis
- Mapping of HDL data back to HLL
- Variable name-matching
- Observing scope and other patterns
- Bottleneck detection
- Load-balancing of replicated functions
- Monitoring for pipeline stalls
- Detecting streaming communication stalls
- Finding shared-memory contention
18Example RC Visualization
- Need unified visualizations that accentuate
important statistics - Must be scalable to many nodes
19Molecular Dynamics
- Simulation
- Interactions between atoms and molecules
- discrete time intervals
- Models forces
- Newtonian physics
- Van Der Walls forces
- Other interactions
- Tracks molecules position and velocity
- X, Y and Z directions
http//en.wikipedia.org/wiki/Molecular_dynamics
20Case Study Setup
- Impulse C v2.2
- XD1000 platform
- Opteron 2.2 GHz
- XD1000 module with Altera Stratix-II EP2S180 FPGA
in second processor socket - MD communication architecture
- Chunks of MD data are read from SRAM
- Data is streamed to multiple MD kernels that are
pipelined - Results are stored back to SRAM
21Impulse-C Profile Percentages
Output stream of Molecular Dynamics kernel is a
bottleneck.
22Stream buffer size was increased by 32 times
allowing application speedup to increase from
6.2 to 7.8 vs. serial baseline.
23Performance Analysis Overhead
- Additional FPGA resource usage
- Less than 4
- Frequency reduction
24Conclusions
- Developed prototype HLL-oriented RC performance
analysis tool - First such runtime performance analysis tool
framework (per extensive literature review) - Tracing profiling available
- Automated instrumentation in progress
- Application case study performed
- Observed minimal overhead from tool
- Speedup achieved due to performance analysis
- Future work
- SRC support, automated instrumentation and
analysis, integration with software PAT, further
case studies
25References
- Paul Graham, Brent Nelson, and Brad Hutchings.
Instrumenting bitstreams for debugging FPGA
circuits. In Proc. of the the 9th Annual IEEE
Symposium on Field-Programmable Custom Computing
Machines (FCCM), pages 41-50, Washington, DC,
USA, Apr. 2001. IEEE Computer Society. - Sameer S. Shende and Allen D. Malony. The Tau
parallel performance system. International
Journal of High Performance Computing
Applications (HPCA), 20(2)287-311, May 2006. - C. EricWu, Anthony Bolmarcich, Marc Snir,
DavidWootton, Farid Parpia, Anthony Chan, Ewing
Lusk, and William Gropp. From trace generation to
visualization a performance framework for
distributed parallel systems. In Proc. of the
2000 ACM/IEEE conference on Supercomputing
(CDROM) (SC), page 50, Washington, DC, USA, Nov.
2000. IEEE Computer Society. - Adam Leko and Max Billingsley, III. Parallel
performance wizard user manual.
http//ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007. - S. Koehler, J. Curreri, and A. George,
"Challenges for Performance Analysis in
High-Performance Reconfigurable Computing," Proc.
of Reconfigurable Systems Summer Institute 2007
(RSSI), Urbana, IL, July 17-20, 2007. - J. Curreri, S. Koehler, B. Holland, and A.
George, "Performance Analysis with High-Level
Languages for High-Performance Reconfigurable
Computing," Proc. of 16th IEEE Symposium on
Field-Programmable Custom Computing Machines
(FCCM), Palo Alto, CA, Apr. 14-15, 2008.