Using Jikes RVM to Understand the Hardware Performance of Java Applications

1 / 31

About This Presentation

Title:

Using Jikes RVM to Understand the Hardware Performance of Java Applications

Description:

Michael Hind. 1. MRTE'03, March 23, 2003. Using Jikes RVM to Understand the ... Michael Hind. Manager, Dynamic Optimization Group. Joint work with ... –

Number of Views:49

Avg rating:3.0/5.0

Slides: 32

Provided by: michae336

Category:

more less

Transcript and Presenter's Notes

Title: Using Jikes RVM to Understand the Hardware Performance of Java Applications

1
Using Jikes RVM to Understand the Hardware
Performance of Java Applications

Michael Hind
Manager, Dynamic Optimization Group
Joint work with
Peter Sweeney, Brendon Cahoon, Perry Cheng, David
Grove
IBM Watson Research Center

2
Memory Wall is Getting Larger

Distance to memory increasing
Power4 estimates
L1 3 cycles
L2 12 cycles
L3 120 cycles
Memory 350 cycles!
Interprocessor latencies also increasing
Multiple processors on a chip
Multi-chip modules

3
Software Getting More Complex

Complexity of understanding
Instruction stream derived from diverse
collection of components
OS, runtimes (JVMs, GC, Compilation), database,
application server, etc.
Java is popular for modern workloads, foundation
for web services
Object-oriented
More use of pointers
Large number of small methods
Many indirect (virtual) calls, can't be
predetermined
Garbage collection
Dynamic semantics
Compilation/optimization occurs at runtime (JIT)

4
Impact of GC on Performance db (SPECjvm98)
5
db Key Data Structures
6
First Steps

We need infrastructure to understand memory
performance of Java workloads
Approaches
Simulated Execution
Rich details of execution
Configurable
Cheap
Hardware Execution
Accurate performance information
Completeness
Speed of execution

7
Our Goal

Extend existing HPM infrastructure to understand
Java performance
Quantify latency impact on Java workloads
Correlate performance with threads/region of code
Quantify VM (GC, opts, etc) impact on latency
VM-aware approach
Extend Jikes RVM

8
Hardware Performance Monitors (HPMs)

Most architecture implementations provide HPMs to
count hardware events
Processor cycles, instructions completed, cache
misses, synchronization, branch mispredictions,
etc.
AIX 5.1 provides kernel extension API to access
counters
Initialize counters to count group of events
Start/Stop/Read counters
Library deals with overflows and kernel thread
context switches
Also provides command line interface

9
AIX 5.1 Perfromance Monitor API
Application
OS Thread 1
OS Thread 1
OS Thread 1
Operating System
Performance Monitor API
Proc 2
Proc P
Processors
Perf Ctr
Perf Ctr
10
Using HPMs for Java

Problems
Distinguishing application and VM
Distinguishing Java threads
SMP support
Low overhead
Temporal (information may vary over time)
Approach
Modify Jikes RVM to address these issues without
requiring application modification

11
Jikes RVM

Open source research VM that executes Java
applications, written in Java
Formerly called Jalapeno
Runs on Linux/IA32, AIX/PPC, Linux/PPC
Universities using system for a broad range of
research topics
18 user publications in 2002 at top conferences
7 courses using system, teaching resources
(tutorials) available on web site
www.ibm.com/developerworks/oss/jikesrvm

12
Jikes RVM Features

Implemented in Java programming language
(300KLOC)
Reduces seams between VM and applications
VM can be dynamically optimized
Compile-only strategy
Multiple compilers, mixing code is seamless
Aggressive optimizing compiler
3 levels of IR, all with Java type info (CFG,
SSA, dominators, etc.)
Lightweight mn thread implementation
Java threads are multiplexed on OS threads,
important for scalability, GC transition
Quasi-preemptive scheduling (using
compiler-generated yield points)
Adaptive optimization system
Yieldpoint-based sampling, cost/benefit model,
what to recompile and how
Online feedback-directed inlining, code layout
Type-accurate (exact) parallel GC/allocation
Copying and noncopying, generational and
nongenerational, hybrids

13
Adaptive Optimization System
14
Jikes RVM MN Threading
Java Thread 1
Java Thread 2
Java Thread I
Java Application
Java Thread I1
Java Thread I2
Java Thread M
JVM
OS Thread 1
OS Thread 1
OS Thread 1
Operating System
External timer-based 10ms triggers need for
thread-switching, could be less (voluntary
yield) or more (OS runs other pthread)
15
Jikes RVM Extension
Java Thread 1
Java Thread 2
Java Thread I
Java Application
Java Thread I1
Java Thread I2
Java Thread M
JVM
OS Thread 1
OS Thread 1
OS Thread 1
Operating System
Performance Monitor API
Proc 2
Proc P
Processors
Perf Ctr
Perf Ctr
16
Jikes RVM Extension
17
Implementation Overview

Modify Jikes RVM Java thread scheduler to capture
HPM data for the duration of a Java threads
quatum
Produces sequence of traces
Mechanism applied to both application and VM
threads
One trace file per processor
Overhead is 1.7 on SPECjbb2000

18
Implementation Details

Trace record data
Processor ID
Thread ID
Real time start, duration
Hardware counter values (cycles, caches misses,
etc)
Trace record produced into buffer at threadswitch
consumed by separate Java thread
one trace writer thread per processor
two buffers used to avoid blocking
Also provide a VM call to distinguish source code
points that will generate a special trace record
Allows programmer to focus on logical
computations

19
Initial Experience

SPECjbb2000 benchmark
Middle tier of 3-tier warehouse order system
Each warehouse is a Java thread
Configured to run 4 warehouses on 4 procs
Measured for 40 secs of measurement period
after 10 sec rampup
27,951 records produced, 2.5MB

20
21 Threads

5 Application threads
1 main thread 4 warehouses
4 Garbage collection threads
Stop-the-world, parallel collection
4 Scheduler threads
Simple load balancing
4 Trace writer threads
Consumes trace records, writes to file
4 AOS threads
2 Profiler, Controller, Recompilation

21
Percentage of Total Cycles
22
Investigating Load Latency

Aggregate load latency is 22 cycles
p690 Power4 Machine

23
Average Estimated Load Latency
24
Average Load Latency Cycles
25
Avg Load Latency Over Time For 1 Warehouse
26
Avg Load Latency Over Time For 1
WarehouseRecords gt 1M cycles (1msec)
27
Avg Load Latency Over Time For 1 Warehouse 1
thread on 1 proc
Poor avg latency is mostly due to
synchronization-induced thread yields
28
Applicability