Title: Using Jikes RVM to Understand the Hardware Performance of Java Applications
1Using Jikes RVM to Understand the Hardware
Performance of Java Applications
- Michael Hind
- Manager, Dynamic Optimization Group
- Joint work with
- Peter Sweeney, Brendon Cahoon, Perry Cheng, David
Grove - IBM Watson Research Center
2Memory Wall is Getting Larger
- Distance to memory increasing
- Power4 estimates
- L1 3 cycles
- L2 12 cycles
- L3 120 cycles
- Memory 350 cycles!
- Interprocessor latencies also increasing
- Multiple processors on a chip
- Multi-chip modules
3Software Getting More Complex
- Complexity of understanding
- Instruction stream derived from diverse
collection of components - OS, runtimes (JVMs, GC, Compilation), database,
application server, etc. - Java is popular for modern workloads, foundation
for web services - Object-oriented
- More use of pointers
- Large number of small methods
- Many indirect (virtual) calls, can't be
predetermined - Garbage collection
- Dynamic semantics
- Compilation/optimization occurs at runtime (JIT)
4Impact of GC on Performance db (SPECjvm98)
5db Key Data Structures
6First Steps
- We need infrastructure to understand memory
performance of Java workloads - Approaches
- Simulated Execution
- Rich details of execution
- Configurable
- Cheap
- Hardware Execution
- Accurate performance information
- Completeness
- Speed of execution
7Our Goal
- Extend existing HPM infrastructure to understand
Java performance - Quantify latency impact on Java workloads
- Correlate performance with threads/region of code
- Quantify VM (GC, opts, etc) impact on latency
- VM-aware approach
- Extend Jikes RVM
8Hardware Performance Monitors (HPMs)
- Most architecture implementations provide HPMs to
count hardware events - Processor cycles, instructions completed, cache
misses, synchronization, branch mispredictions,
etc. - AIX 5.1 provides kernel extension API to access
counters - Initialize counters to count group of events
- Start/Stop/Read counters
- Library deals with overflows and kernel thread
context switches - Also provides command line interface
9AIX 5.1 Perfromance Monitor API
Application
OS Thread 1
OS Thread 1
OS Thread 1
Operating System
Performance Monitor API
Proc 2
Proc P
Processors
Perf Ctr
Perf Ctr
10Using HPMs for Java
- Problems
- Distinguishing application and VM
- Distinguishing Java threads
- SMP support
- Low overhead
- Temporal (information may vary over time)
- Approach
- Modify Jikes RVM to address these issues without
requiring application modification
11Jikes RVM
- Open source research VM that executes Java
applications, written in Java - Formerly called Jalapeno
- Runs on Linux/IA32, AIX/PPC, Linux/PPC
- Universities using system for a broad range of
research topics - 18 user publications in 2002 at top conferences
- 7 courses using system, teaching resources
(tutorials) available on web site - www.ibm.com/developerworks/oss/jikesrvm
12Jikes RVM Features
- Implemented in Java programming language
(300KLOC) - Reduces seams between VM and applications
- VM can be dynamically optimized
- Compile-only strategy
- Multiple compilers, mixing code is seamless
- Aggressive optimizing compiler
- 3 levels of IR, all with Java type info (CFG,
SSA, dominators, etc.) - Lightweight mn thread implementation
- Java threads are multiplexed on OS threads,
important for scalability, GC transition - Quasi-preemptive scheduling (using
compiler-generated yield points) - Adaptive optimization system
- Yieldpoint-based sampling, cost/benefit model,
what to recompile and how - Online feedback-directed inlining, code layout
- Type-accurate (exact) parallel GC/allocation
- Copying and noncopying, generational and
nongenerational, hybrids
13Adaptive Optimization System
14Jikes RVM MN Threading
Java Thread 1
Java Thread 2
Java Thread I
Java Application
Java Thread I1
Java Thread I2
Java Thread M
JVM
OS Thread 1
OS Thread 1
OS Thread 1
Operating System
External timer-based 10ms triggers need for
thread-switching, could be less (voluntary
yield) or more (OS runs other pthread)
15Jikes RVM Extension
Java Thread 1
Java Thread 2
Java Thread I
Java Application
Java Thread I1
Java Thread I2
Java Thread M
JVM
OS Thread 1
OS Thread 1
OS Thread 1
Operating System
Performance Monitor API
Proc 2
Proc P
Processors
Perf Ctr
Perf Ctr
16Jikes RVM Extension
17Implementation Overview
- Modify Jikes RVM Java thread scheduler to capture
HPM data for the duration of a Java threads
quatum - Produces sequence of traces
- Mechanism applied to both application and VM
threads - One trace file per processor
- Overhead is 1.7 on SPECjbb2000
18Implementation Details
- Trace record data
- Processor ID
- Thread ID
- Real time start, duration
- Hardware counter values (cycles, caches misses,
etc) - Trace record produced into buffer at threadswitch
- consumed by separate Java thread
- one trace writer thread per processor
- two buffers used to avoid blocking
- Also provide a VM call to distinguish source code
points that will generate a special trace record - Allows programmer to focus on logical
computations
19Initial Experience
- SPECjbb2000 benchmark
- Middle tier of 3-tier warehouse order system
- Each warehouse is a Java thread
- Configured to run 4 warehouses on 4 procs
- Measured for 40 secs of measurement period
- after 10 sec rampup
- 27,951 records produced, 2.5MB
2021 Threads
- 5 Application threads
- 1 main thread 4 warehouses
- 4 Garbage collection threads
- Stop-the-world, parallel collection
- 4 Scheduler threads
- Simple load balancing
- 4 Trace writer threads
- Consumes trace records, writes to file
- 4 AOS threads
- 2 Profiler, Controller, Recompilation
21Percentage of Total Cycles
22Investigating Load Latency
- Aggregate load latency is 22 cycles
- p690 Power4 Machine
23Average Estimated Load Latency
24Average Load Latency Cycles
25Avg Load Latency Over Time For 1 Warehouse
26Avg Load Latency Over Time For 1
WarehouseRecords gt 1M cycles (1msec)
27Avg Load Latency Over Time For 1 Warehouse 1
thread on 1 proc
Poor avg latency is mostly due to
synchronization-induced thread yields
28Applicability
- Other platforms?
- Jikes RVM also runs on Linux/IA32, Linux/PPC
- Straightforward to port code to access IA32 HPMs
- Does Linux provide OS thread distinction?
- Other JVMs without MN scheduling?
- Aggregate information
- Yes, VM extension probably not required
- PM API already provides OS thread distinction
- Temporal information
- Requires access to thread switch points
- Need to modify OS?
29Related Work
- Hardware counter tools
- DCPI (Alpha), VTune (Intel), SpeedShop (SGI)
- Ammons, Ball, Larus
- Profiling Java workloads
- Cain, Rajwar, Marden, Lipasti
- TPC-W, SPECweb99, SPECjbb2000 on 8-way RS64-III
simulation - Luo and John, Seshadri, John, Mericas
- SPECjbb2000, VolanoMark vs SPECint2000 on PIII
and 2 PowerPC archs - Karlsson, Moore, Hagersten, Wood
- SPECjbb2000, ECPerf on 16-way Sun Enterprise 6000
server - HPM library packages
- HPM toolkit, PAPI, PCL
- JVMPI
- . . .
30Future Work
- Short term
- Investigate components of load latency
- Refine intervals when scheduler is disabled (GC)
- Track other hardware events
- Graphical tool to navigate trace files
- Exploit other HPM mechanisms
- Correlate interval back to source
- Marker trace records
- Call stack
- Adaptive optimization system
- Exploit information in adaptive system
- . . .
31Conclusions
- Understanding performance of MRTE is difficult,
but important - Thanks to Carole, Lizy, and Chris for organizing!
- Extended Jikes RVM to use access HPMs
- Low overhead (lt 2)
- Distinguishing application and VM threads
- SMP support
- Provides temporal information
- First usage
- Load latency in SPECjbb2000