Title: Performance Tools in Managed Runtime Environments
1Performance Tools in Managed Runtime Environments
- Padma Apparao
- Performance Architect
- Intel Corporation
- March 23rd, 2003
2Outline
- Motivation
- Overview of Run-Time Workloads
- Characterization/Optimization Methodology
- Profiling Techniques Tools
- Examples of Use
- Limitations Desired Enhancements
3Introduction Motivation
- Runtime environments introduce level of
indirection between the user code and the
underlying hardware architecture - Usage of tools in runtimes have subtle
differences when compared to static apps - Just-in-time compilation
- Code generation and layout is different between
runs and within the same run - For example profiling tools need the ability to
resolve an address down to its method and offset.
- A whole new world of profiling the heap has
opened up
4Runtime App Characteristics
- Non-Steady State
- Non-uniformity of the app itself is a problem,
every transaction may not behave the same way
from start to finish - Static Applications also may have problems with
steady state. - Managed run times may have additional steady
state issues due to - Garbage collection characteristics modify the
behavior of the app - Dynamic Jitting
- Characterizing the homogeneity of the application
is important. - Java programs tend to be overly synchronized
- Most locks not contended (less than 10)
- Locks are expensive (10 of the time spent in
lock related instructions, varies by workload)
5Runtime App Characteristics
- Workloads tend to be very branchy
- 1 branch in every 5 instructions or so.
- Pointer chasing, short methods.
- Large number of small methods typical of OO code
gt many calls and returns - For Java Apps approx 50 instructions per call.
- Tradeoffs between in-lining (code bloat) vs.
extensive calls.
Other names and brands may be claimed as the
property of others.
6Workload Homogeneity
Correlating performance metrics in non-steady
workloads is difficult
7Performance Methodology
- Magnitude of improvements depend on
- Maturity of application
- Previous level of performance tuning
- Performance Methodology
- Define understand the workload this is key
- Follow a systematic tuning approach
- Study effects at various levels system,
application, micro-architecture - Use the right tools
- Use the Closed Loop Cycle let the results of one
iteration direct the next - Make one change at a time
8Top-Down Closed Loop Methodology
Top-Down Approach
Closed Loop
9Types of Tools
- Hardware Software
- Non-intrusive hardware counters
- Operating System or application code counters
- Profiling Instrumentation
- System level profiling
- Application call-tree information via
instrumentation - Event Based Time Based
- Sampling based on occurrence of particular events
within the processor e.g. Cache Misses,
Instructions retired - Sampling based on clock ticks
- Select the appropriate kind of tool for your
application - Should be minimally intrusive
- Should provide relevant and accurate information
to optimize your application
10Tools Hierarchy
- System Level Monitoring
- Processor
- Memory
- Network
- Disk
- Application Level Profiling
- Lock contention
- Heap contention
- Threading
- Good bad APIs
- Micro-Architecture Level Event monitoring
- Branch prediction
- Cache performance
- Data alignment
11Windows vs. Linux
Linux Tools depend on kernels supported
Other names and brands may be claimed as the
property of others.
12System Level Tools
- Perfmon /IOStat-Sar
- Counters arranged by object (subsystem)
- Processor, Memory, Disk, Network, File System
- Derive standard formulas (ratios) for good vs.
bad subsystem performance - File system usage, CPI, cache behaviour
- Disk Network I/O bottlenecks
- Memory latency
- Advantages
- Low hanging fruit
- Helps identify obvious problems early
- Low intrusiveness on the system
- These problems usually easy to fix
Other names and brands may be claimed as the
property of others.
13System Level Tools
- Perfmon on Windows
- Well integrated and complete
- Can add new objects into registry automatically
detected - Counters available for Runtime in Microsoft .NET
- Object CLR Memory GC counters are exposed
- Extensive information available in /proc but need
tools to extract that information - New drivers may put information into /proc
Other names and brands may be claimed as the
property of others.
14IOstat Example
Iostat in Linux can give q sizes, wait time in
the queue and service time
Other names and brands may be claimed as the
property of others.
15Application Level Tools
- Profilers
- VTune, Visual Quantify, Metrowerks Code
Warrior, JProbe, OptimizeIt, strace, ltrace - Show where the time is being spent
- Ntoskernel.exe (kernel time)
- Hal.dll (hardware drivers)
- Ntdll.dll (synchronization, heap/memory mgt)
- vmlinux
- Analyze results to determine
- Kernel vs. User time
- Lock implementation latency
- Efficiency of protected resource
- Good vs. Bad APIs
- Thread and lock contention issues
Other names and brands may be claimed as the
property of others.
16Application Level Profilers Heap Profiling
- Memory and Heap profiling
- Heap is a tangled web of object references
- Sizes of heap generations, heap expansion and
shrinkage - How many objects created, and of what type/class
- Nature of object graph connectivity and depth
- How much heap is used, how much is transient
- Identification of memory leaks
- Hprof (Java HAT and HPjmeter) used for Heap
profiling - JProbe Memory Debugger
- http//www.sitraka.com/software/jprobe/jprobedebug
ger.html
Other names and brands may be claimed as the
property of others.
17Heap Profiling Example (using Hprof)
18Residual Objects (Hprof data displayed by
HPJmeter)
Gives the objects still lingering in the heap and
which method allocated these objects
Other names and brands may be claimed as the
property of others.
19Call Graph Data showing CPU Time distribtuion
(Hprof data displayed by HPJmeter)
CPU time spent in each method and call graph tree
Other names and brands may be claimed as the
property of others.
20Garbage Collection
- Garbage Collection
- How often does GC kick in?
- Which method invoked GC?
- What is the GC pause time?
- How many objects were reclaimed during each GC?
- Use verbosegc to gather GC stats
- CLR Allocation Profiler (AP)
- Details of object allocation inside the CLRs
heap - Sizes and frequency of collection of each
generation
Other names and brands may be claimed as the
property of others.
21Garbage Collection Data with 6 GC threads
GCViewer tagtraum industries
Other names and brands may be claimed as the
property of others.
22Garbage Collection Data with 10 GC threads
23Object Profiling
- Object profiling Can track objects, their sizes,
lifetimes, memory allocations - Object references and scope, hot objects
- Object access patterns
- Can start from an object and walk down its path
of references - Can start from a class and look at all its object
allocations - Allocator methods
24Object Lifetime Profiling
25Also possibleJIT and Lock Profiling
- Track JIT optimizations for profile information
- Rejitted code How often does a method get jitted
- Inlined functions
- Function splitting
- Lock profiling
- Useful for Scalability/Synchronizaton
- Thin and inflated locks statistics
- Contended Locks
- with average contention, max contention
- average maximum hold times, acquisitions.
26Thread Profiling
- Thread Analyzer Useful for Multithreaded
Programs - Detect race conditions
- Detect deadlocks and predict them
- Display status of threads running, blocked etc.
- Point to source code where contentions occur
- JProbe Thread Analyzer
- http//www.sitraka.com/software/jprobe/jprobethrea
dalyzer.html
Other names and brands may be claimed as the
property of others.
27Examples of Profiling Tools
- Optimizeit
- Has a Java Performance Suite with Profiler, Code
Coverage tool and a thread debugger. - Optimizeit Profiler has 2 Profilers
- CPU Profiler handles sampling and
instrumentation techniques and provides
information on execution time. - Memory profiler helps identify classes that
consume most memory, have most number of
instances. - JProbe
- Has Memory Debugger, Java Profiler, Thread
Analyzer and Coverage. - Memory object life time analysis, identifying
loitering objects and short life time
objects. - Thread profiler/analyzer identifies data race
problems and determines deadlocks and predicts
them. - VTune Performance Analyzer
- Well Integrated with Java and .NET
Other names and brands may be claimed as the
property of others.
28µArchitecture Tools
- Micro-Architecture Performance
- VTune and Emon (Intel Architecture specific
tools) - Knowledgeable of CPU perf counters
- Cache hit/miss ratios
- CPI (path length)
- Branch performance
- Misaligned Data
- Use after all other system and application level
issues have been resolved - Bigger investment to optimize
- Restructuring code / fundamental changes to
design architecture specific to Intel
architecture - Usually yields no more than 5-10 increase
29Desired Enhancements to Existing Tools
- Current Profiling Tools use JVMPI to profile
applications - Need a cheaper way to capture frequently
generated events - Selection of appropriate methods to track
- Byte Code Instrumentation also available in some
tools - Heap Profiling
- Extremely Intrusive about 30-50x slower
- Less expensive heap profiling needed
- Profiling of JIT events is least intrusive, but
requires more effort - One-stop-shop for all Platforms, Operating
Systems will reduce the learning curve for tool
usage