Title: HP Caliper
1HP Caliper
- Eric Gouriou
- September 2003
2Todays Agenda
- Intended audience of this presentation
- What is HP Caliper ?
- Measurements
- Usage
- Caliper cheat sheet
- Limitations
- Future plans
- More information on Caliper, DSPP
- Questions
3Intended Audience
- Developers
- Tuning experts
- Engineers porting to HP-UX IPF
- Anyone interested in performance
4What is HP Caliper ?
- Performance analysis improvement tool
- Dynamic performance measurement tool forC / C
/ Fortran / assembly applications - Data collection vehicle for compiler-feedback/PBO
- Works on all programs as is(32/64bits, debug or
optimized, stripped, etc.) - Multiple measurements via a unified interface
- Provides insights thanks to
- Itanium Performance Monitoring Unit (PMU)
- Dynamic binary instrumentation
5Key Features
- Default measurement configurations, configurable
- Selective process module measurements
- Text HTML reports
- Performance datafile
- Three measurement models
- start application under Caliper
- attach to running process
- auto-invocation
6Measurements
- PMU event counts
- Total count of selected hardware events per
process - Negligible overhead
- Default set of events, can be overriden
- 400 events described in Itanium2 documentation
- Non-default use for advanced users
- ------------------------------------------------
- Counter Priv. Mask Count
- -------------------------------------------
- IA64_INST_RETIRED 8 (USER) 3414917
- NOPS_RETIRED 8 (USER) 684477
- CPU_CYCLES 8 (USER) 1899187
- BACK_END_BUBBLE_ALL 8 (USER) 810631
- -------------------------------------------
- of Cycles lost due to stalls (lower is
better) - 42.68 100 BACK_END_BUBBLE_ALL / CPU_CYCLES
- Effective CPI (lower is better)
7Measurements (contd)
- Histograms from samples of PMU data
- Allows identification of hotspots
- Module summary, function summary, function
details, selected global counts and derived
metrics - Flat profile, Cache / TLB / Branch Prediction /
ALAT - Per-thread data available
- Very low overhead
- Function Summary
- -------------------------------------------------
------------------------ - Total Cumulat
- IP of IP
- Samples Total Samples Function
File - -------------------------------------------------
------------------------ - 9.71 9.71 1286
livermoreinit livermore.c - 6.03 15.75 799
livermoremain livermore.c - 4.07 19.82 539
libc.so.1T_19_3c30_cl___doprnt_main doprnt.c - 0.60 20.42 79
libc.so.1_f80_to_dec bindec.c - 0.45 20.86 59
libc.so.1getenv getenv.c - ...
8Measurements (contd)
- Traces of PMU samples
- Provides full details for each sample
- Low overhead but high volume of data
- Customize configuration file for relevant data
- -------------------------------------------------
-------------------------------------- - -----------------------DCache
Miss---------------------- ------IP Samples------ - Sample AddrSlot Data Bundle
Bundle Address - Number (modulefunction) Runtime
Address Latency (modulefunction) - -------------------------------------------------
-------------------------------------- - 1 0x3eda00
0x200000007979f700 5 0x502f0
- (dld.soMM_malloc)
(dld.soBU_grow) - 2 0x211c00
0x200000007950b200 26 0x212a0
- (dld.soLE_finish_create)
(dld.soLE_finish_create) - 3 0x37bf00
0x20000000795297b8 172 0x37c40
- (dld.soR_apply_eplt_relocs)
(dld.soR_apply_eplt_relocs) - ...
9Measurements (contd)
- Source-level event counts
- Function call counts, arc counts
- High overhead, precise counts
- Done via dynamic binary instrumentation
- Function Count Details
- -------------------------------------------------
--------- - Total Function
File Line - -------------------------------------------------
--------- - 150 livermoreabs
livermore.c 405 - 104 libc.so.1__milli_memset
- 92 libc.so.1__milli_memmove
- ...
- Arc Counts
- -------------------------------------------------
------------------------------------------- - Total Taken Taken Source Address
Source Function File Line,Col - Target Address
Target Function File Line,Col - -------------------------------------------------
------------------------------------------- - 28672 28616 99 0x4005e702
livermoreinit livermore.c376,7
10Measurements (contd)
- Call graph profile (gprof-like)
- Flat profile and call graph
- High overhead
- Hybrid of exact counts and PMU sampling
- Call Graph
- -------------------------------------------------
------------- - De- Called/Total
Parents - Index Time Self scen- CalledSelf
Name Index - dants Called/Total
Children - -------------------------------------------------
------------- - 0.00 0.00 1/1
ROOT 1 - 2 25.1 0.00 0.00 1
livermoremain 2 - 0.00 0.00 150/150
livermoreabs 52 - 0.00 0.00 30/30
livermoreclock 45 - 0.00 0.00 14/14
livermoreinit 3 - 0.00 0.00 18/18
libc.so.1printf 4 - -------------------------------------------------
------------- - 0.00 0.00 14/14
livermoremain 2
11Measurements (contd)
- PBO profile gathering configuration
- Auto-invoked when compiling using Oprofilecolle
ct (I deprecated) - Data used to improve compiler optimizations Opro
fileuse (P deprecated) - Can be done manually (caliper pbo ...), however
not recommended, sub-optimal - Variable overhead
- gt cc Oprofilecollect -o livermore livermore.c
- gt ./livermore
- ...
- gt ls flow
- flow.data flow.log
- gt cc Oprofileuse O3 -o livermore livermore.c
12Usage
- Typical command line
- caliper config_file caliper_options program
program_args - Example
- gt caliper fprof --processall cc -o livermore
livermore.c - Configuration files
- packaged ones
- copy/modify
- command-line overrides
13Usage (contd)
- Type Configuration Files Comments
- --------------------------------------------------
-------------------------------------------- - Histograms fprof, reduced samples,
- dicache_miss very low impact
- ditlb_miss
- branch_prediction
- alat_miss
- Call graph cgprof sampled exact, high
impact - Sampled details pmu_trace large data volume
- Total HW event counts total_cpu exact totals,
no impact -
- Exact source-level arc_count, exact
details,event counts func_count, high impact - Compiler feedback pbo black box
-
14Caliper Cheat Sheet
- Where should I start ?
- Global view
- fprof, both for profile and per-process derived
metrics - cgprof, caller/callee, check for surprises
- dcache_miss, use latency threshold to show
expensive misses - Drill-down
- Restrict processes, libraries, functions measured
- What is missing for a global view ?
- System-wide measurements
- Multiplexed global counts (vs. many total_cpu
runs)
15Caliper Cheat Sheet (contd)
- Tuning the data collection parameters
- Multi-process application ?Check process tree
output, select processes using --process - Multi-threaded application ?Check --threadsall (
per-thread histograms)versus --threadssum-all (d
efault, aggregated data) - Libraries of interest or out of your control ?
- Use --module-include / --module-exclude
- Functions of interest ?
- Check --user-regionsrum/sum and/or triggered
samples
16Caliper Cheat Sheet (contd)
- Better reports
- Use HTML output (--html), text is the default
- Use datafiles
- Allow multiple reports for a single run
- Faster collection in multi-process runs
- Check source-level reporting
- --report-detailsstatement
- Vary amount of details generated
17Caliper Cheat Sheet (contd)
- PBO
- Performance for free for some applications
(almost) - Use Oprofilecollect/use
- caliper pbo works on O1 binaries but isnt
recommended - Can use chatr I enable to enable
auto-invocation - Trade-offs for large multi-process
applications,1 vs. many Caliper
18Limitations
- Application characteristics
- no dynamic library reload
- Measurement control
- pbo profile collection requires O1
binary(automatic when using Oprofilecollect) - HW limits the measurements possible per run
- per-thread data limited to histograms
- Other
- emulated PA binaries are not measured
- minimal dynamic code support
- limited gcc/g support
- setuid binaries require workaround
- limited support for MPAS binaries
19Future Plans
- PMU Measurements
- multiplexed PMU runs
- richer derived metrics
- system-wide measurements
- kernel profiles
- PBO
- PMU cache data collected for PBO
- Data Files
- aggregation
- merging
- diffing
20Future Plans (contd)
- Usability
- Graphs w/ html reports
- Reports on demand
- Function context
- Call stacks
- Remove limitations
- Detach for runs involving instrumentation
- MPAS applications
- Library load/unload
- Dynamically generated code
21More Information
- The Caliper web page is on the DSPP website
- lthttp//www.hp.com/go/hpcalipergt
- Documentation / Support / Downloads
- The Caliper mailing lists
- Majordomo lists ltmajordomo_at_cxx.cup.hp.comgt
- For product announcements
- ltcaliper-announcegt
- For announcements and user forum
- ltcalipergt
22DSPP Tools Resources for Itanium 2 Set You Up
for Success
- Software
- development environments, compilers, operating
systems, installation/configuration tools,
performance tools and more - Technical documentation
- white papers, tutorials, references documents and
manuals, FAQs, known problems, sample code, etc.
- Training and Education
- online and classroom training
-
23More DSPP Tools Resources
- Community
- Itanium forums, source code repository, document
sharing and mailing lists - Equipment
- rentals and purchase discounts
-
- Partner Resources
- News Events
D S P P
24Where to go
- Start with the Itanium web site for DSPP
partners - http//www.hp.com/go/dspp_itanium
- Contact points for additional information,
general support, - equipment, localization resources and more
-
- Americas spp_at_cup.hp.com
- telephone 1.800.249.3294
-
- Europe dspp.emea_at_hp.com
- telephone 800.100.929.70
-
- Asia-Pac hpdev.support_at_hp.com or go to
www.hp.com/go/dspp for local country
contacts
25Quote slide
Questions?
26(No Transcript)