John Curreri - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

John Curreri

Description:

Separate C source files are made for the CPU & FPGA ... Simulators / Emulators. Too slow or too inaccurate. Require significant development time ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 26
Provided by: gstitt
Category:

less

Transcript and Presenter's Notes

Title: John Curreri


1
Project F2 Application Performance Analysis
  • John Curreri
  • Seth Koehler
  • Rafael Garcia

2
Outline
  • Introduction
  • Application mappers
  • Historical background
  • Performance analysis today
  • HLL runtime performance analysis tool
  • Motivation
  • Instrumentation
  • Framework
  • Visualization
  • Case study
  • Molecular Dynamics
  • Conclusions References

3
Application Mappers
  • Translates C code to HDL
  • Higher level of abstraction
  • Usually a subset of ANSI C
  • No pointers
  • No standard C libraries for FPGA
  • HDL is generated as a project file for Xilinx or
    Altera tools
  • Built-in communication
  • Separate C source files are made for the CPU
    FPGA
  • Similar communication function calls between CPU
    FPGA

4
Application Mappers (continued)
  • Computational parallelism
  • Pipelining of loops
  • for(), while(), etc.
  • Use of library functions
  • HDL coded functions called at HLL
  • FFT, Floating point operations
  • Replication of functions defined in hardware
  • Types of communication
  • DMA transfers
  • Efficient transfer of large chucks of data
  • Stream transfers
  • Steady flow of data
  • Buffered for transfer rate changes

5
Introduction to the F2 project
  • Goals for performance analysis in RC
  • Productively identify and remedy performance
    bottlenecks in RC applications (CPUs and FPGAs)
  • Motivations
  • Complex systems are difficult to analyze by hand
  • Manual instrumentation is unwieldy
  • Difficult to make sense of large volume of raw
    data
  • Tools can help quickly locate performance
    problems
  • Collect and view performance data with little
    effort
  • Analyze performance data to indicate potential
    bottlenecks
  • Staple in HPC, limited in HPEC, and virtually
    non-existent in RC
  • Challenges
  • How do we expand notion of software performance
    analysis into software-hardware realm of RC?
  • What are common bottlenecks for dual-paradigm
    applications?
  • What techniques are necessary to detect
    performance bottlenecks?
  • How do we analyze and present these bottlenecks
    to a user?

6
Historical Background
  • Gettimeofday and printf
  • VERY cumbersome, repetitive, manual, not
    optimized for speed
  • Profilers date back to 70s with prof (gprof,
    1982)
  • Provide user with information about application
    behavior
  • Percentage of time spent in a function
  • How often a function calls another function
  • Simulators / Emulators
  • Too slow or too inaccurate
  • Require significant development time
  • PAPI (Performance Application Programming
    Interface)
  • Portable interface to hardware performance
    counters on modern CPUs
  • Provides information about caches, CPU
    functional units, main memory, and more

Source Wikipedia
7
Performance Analysis Today
  • What does performance analysis look like today?
  • Goals
  • Low impact on application behavior
  • High-fidelity performance data
  • Flexible
  • Portable
  • Automated
  • Concise Visualization
  • Techniques
  • Event-based, sample-based
  • Profile, Trace
  • Above all, we want to understand application
    behavior in order to locate performance problems!

8
Related Research and Tools Parallel Performance
Wizard (PPW)
  • Open-source tool developed by UPC Group at
    University of Florida
  • Performance analysis and optimization (PGAS
    systems and MPI support)
  • Performance data can be analyzed for bottlenecks
  • Offers several ways of exploring performance data
  • Graphs and charts to quickly view high-level
    performance information at a glance right, top
  • In-depth execution statistics for identifying
    communication and computational bottlenecks
  • Interacts with popular trace viewers (e.g.
    Jumpshot right, bottom) for detailed analysis
    of trace data
  • Comprehensive support for correlating performance
    back to original source code

Partitioned Global Address Space languages
allow partitioned memory to be treated as global
shared memory by software.
9
Motivation for RC Performance Analysis
  • Dual-paradigm applications gaining more traction
    in HPC and HPEC
  • Design flexibility allows best use of FPGAs and
    traditional processors
  • Drawback More challenging to design applications
    for dual-paradigm systems
  • Parallel application tuning and FPGA core
    debugging are hard enough!

Less
Difficultylevel
More
  • No existing holistic solutions for analyzing
    dual-paradigm applications
  • Software-only views leave out low-level details
  • Hardware-only views provide incomplete
    performance information
  • Need complete system view for effective tuning of
    entire application

10
Motivation for RC Performance Analysis
  • Q Is my runtime load-balancing strategy working?
  • A ???

ChipScope waveform
11
Motivation for RC Performance Analysis
  • Q How well is my cores pipelining strategy
    working?
  • A ???

Flat profile Each sample counts as 0.01
seconds. cumulative self
self total time seconds seconds calls
ms/call ms/call name 51.52 2.55 2.55
5 510.04 510.04 USURP_Reg_poll 29.41
4.01 1.46 34 42.82 42.82
USURP_DMA_write 11.97 4.60 0.59
14 42.31 42.31 USURP_DMA_read 4.06
4.80 0.20 1 200.80 200.80
USURP_Finalize 2.23 4.91 0.11 5
22.09 22.09 localp 1.22 4.97
0.06 5 12.05 12.05 USURP_Load
0.00 4.97 0.00 10 0.00
0.00 USURP_Reg_write 0.00 4.97 0.00
5 0.00 0.00 USURP_Set_clk 0.00
4.97 0.00 5 0.00 931.73
rcwork 0.00 4.97 0.00 1
0.00 0.00 USURP_Init
gprof output (N, one for each node!)
12
Instrumentation Level
  • High-level language (HLL)
  • Requires HLL timing functions
  • Application mapping disturbed by instrumentation
  • Hardware Description Language (HDL)
  • Portable between HLL and types FPGA families
  • Selected level for instrumentation
  • FPGA bit stream
  • Requires targeting specific FPGA family
  • Instrument in minutes

13
Instrumentation Selection
  • Automated - Computation
  • State machines
  • Used for preserving execution order in C
    functions
  • Used to control state of pipelines
  • Control and status signals
  • Used by library function
  • Automated - Communication
  • Control and status signals
  • Used for streaming communication
  • Used for DMA transfers
  • Application specific
  • Monitoring variables for meaningful values

14
Measurement Techniques
  • Profiling
  • Counters
  • Records number of occurrences of event
  • Low overhead
  • Normally uses registers
  • Block RAM can be used for state machines
  • Tracing
  • Timestamps
  • Indicating when event occurred
  • Data
  • Associated with each event
  • Greater overhead
  • Uses memory to store timestamps and data
  • Greater fidelity
  • Reconstruction of sequence of events


CPU-0
1
2
3
Time
Zaki, O., Lusk, E., Gropp, W., and Swider, D.
1999. Toward Scalable Performance Visualization
with Jumpshot. Int. J. High Perform. Comput.
Appl. 13, 3 (Aug. 1999), 277-288.
15
Hardware Measurement Module
16
Adding Instrumentation Measurement
CPU(s)
HLL Tool Flow
C source
Application (C source)
Instrumentation
Software -hardware mapping
HLL API Wrapper
Compile software
FPGA(s)
Instrumentation
Implement hardware
HLL Hardware Wrapper
Application (C source)
Application (HDL)
Hardware Measurement Module
Finished design
Uninstrumented Project
Instrumentation added to C source
C source for FPGA mapped to HDL
Instrumentation added to HDL
Implement hardware
17
Reverse Mapping Analysis
  • Mapping of HDL data back to HLL
  • Variable name-matching
  • Observing scope and other patterns
  • Bottleneck detection
  • Load-balancing of replicated functions
  • Monitoring for pipeline stalls
  • Detecting streaming communication stalls
  • Finding shared-memory contention

18
Example RC Visualization
  • Need unified visualizations that accentuate
    important statistics
  • Must be scalable to many nodes

19
Molecular Dynamics
  • Simulation
  • Interactions between atoms and molecules
  • discrete time intervals
  • Models forces
  • Newtonian physics
  • Van Der Walls forces
  • Other interactions
  • Tracks molecules position and velocity
  • X, Y and Z directions

http//en.wikipedia.org/wiki/Molecular_dynamics
20
Case Study Setup
  • Impulse C v2.2
  • XD1000 platform
  • Opteron 2.2 GHz
  • XD1000 module with Altera Stratix-II EP2S180 FPGA
    in second processor socket
  • MD communication architecture
  • Chunks of MD data are read from SRAM
  • Data is streamed to multiple MD kernels that are
    pipelined
  • Results are stored back to SRAM

21
Impulse-C Profile Percentages
Output stream of Molecular Dynamics kernel is a
bottleneck.
22
Stream buffer size was increased by 32 times
allowing application speedup to increase from
6.2 to 7.8 vs. serial baseline.
23
Performance Analysis Overhead
  • Additional FPGA resource usage
  • Less than 4
  • Frequency reduction
  • Less than 3

24
Conclusions
  • Developed prototype HLL-oriented RC performance
    analysis tool
  • First such runtime performance analysis tool
    framework (per extensive literature review)
  • Tracing profiling available
  • Automated instrumentation in progress
  • Application case study performed
  • Observed minimal overhead from tool
  • Speedup achieved due to performance analysis
  • Future work
  • SRC support, automated instrumentation and
    analysis, integration with software PAT, further
    case studies

25
References
  • Paul Graham, Brent Nelson, and Brad Hutchings.
    Instrumenting bitstreams for debugging FPGA
    circuits. In Proc. of the the 9th Annual IEEE
    Symposium on Field-Programmable Custom Computing
    Machines (FCCM), pages 41-50, Washington, DC,
    USA, Apr. 2001. IEEE Computer Society.
  • Sameer S. Shende and Allen D. Malony. The Tau
    parallel performance system. International
    Journal of High Performance Computing
    Applications (HPCA), 20(2)287-311, May 2006.
  • C. EricWu, Anthony Bolmarcich, Marc Snir,
    DavidWootton, Farid Parpia, Anthony Chan, Ewing
    Lusk, and William Gropp. From trace generation to
    visualization a performance framework for
    distributed parallel systems. In Proc. of the
    2000 ACM/IEEE conference on Supercomputing
    (CDROM) (SC), page 50, Washington, DC, USA, Nov.
    2000. IEEE Computer Society.
  • Adam Leko and Max Billingsley, III. Parallel
    performance wizard user manual.
    http//ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007.
  • S. Koehler, J. Curreri, and A. George,
    "Challenges for Performance Analysis in
    High-Performance Reconfigurable Computing," Proc.
    of Reconfigurable Systems Summer Institute 2007
    (RSSI), Urbana, IL, July 17-20, 2007.
  • J. Curreri, S. Koehler, B. Holland, and A.
    George, "Performance Analysis with High-Level
    Languages for High-Performance Reconfigurable
    Computing," Proc. of 16th IEEE Symposium on
    Field-Programmable Custom Computing Machines
    (FCCM), Palo Alto, CA, Apr. 14-15, 2008.
Write a Comment
User Comments (0)
About PowerShow.com