Using Jikes RVM to Understand the Hardware Performance of Java Applications

1 / 31
About This Presentation
Title:

Using Jikes RVM to Understand the Hardware Performance of Java Applications

Description:

Michael Hind. 1. MRTE'03, March 23, 2003. Using Jikes RVM to Understand the ... Michael Hind. Manager, Dynamic Optimization Group. Joint work with ... –

Number of Views:49
Avg rating:3.0/5.0
Slides: 32
Provided by: michae336
Category:

less

Transcript and Presenter's Notes

Title: Using Jikes RVM to Understand the Hardware Performance of Java Applications


1
Using Jikes RVM to Understand the Hardware
Performance of Java Applications
  • Michael Hind
  • Manager, Dynamic Optimization Group
  • Joint work with
  • Peter Sweeney, Brendon Cahoon, Perry Cheng, David
    Grove
  • IBM Watson Research Center

2
Memory Wall is Getting Larger
  • Distance to memory increasing
  • Power4 estimates
  • L1 3 cycles
  • L2 12 cycles
  • L3 120 cycles
  • Memory 350 cycles!
  • Interprocessor latencies also increasing
  • Multiple processors on a chip
  • Multi-chip modules

3
Software Getting More Complex
  • Complexity of understanding
  • Instruction stream derived from diverse
    collection of components
  • OS, runtimes (JVMs, GC, Compilation), database,
    application server, etc.
  • Java is popular for modern workloads, foundation
    for web services
  • Object-oriented
  • More use of pointers
  • Large number of small methods
  • Many indirect (virtual) calls, can't be
    predetermined
  • Garbage collection
  • Dynamic semantics
  • Compilation/optimization occurs at runtime (JIT)

4
Impact of GC on Performance db (SPECjvm98)
5
db Key Data Structures
6
First Steps
  • We need infrastructure to understand memory
    performance of Java workloads
  • Approaches
  • Simulated Execution
  • Rich details of execution
  • Configurable
  • Cheap
  • Hardware Execution
  • Accurate performance information
  • Completeness
  • Speed of execution

7
Our Goal
  • Extend existing HPM infrastructure to understand
    Java performance
  • Quantify latency impact on Java workloads
  • Correlate performance with threads/region of code
  • Quantify VM (GC, opts, etc) impact on latency
  • VM-aware approach
  • Extend Jikes RVM

8
Hardware Performance Monitors (HPMs)
  • Most architecture implementations provide HPMs to
    count hardware events
  • Processor cycles, instructions completed, cache
    misses, synchronization, branch mispredictions,
    etc.
  • AIX 5.1 provides kernel extension API to access
    counters
  • Initialize counters to count group of events
  • Start/Stop/Read counters
  • Library deals with overflows and kernel thread
    context switches
  • Also provides command line interface

9
AIX 5.1 Perfromance Monitor API
Application
OS Thread 1
OS Thread 1
OS Thread 1
Operating System
Performance Monitor API
Proc 2
Proc P
Processors
Perf Ctr
Perf Ctr
10
Using HPMs for Java
  • Problems
  • Distinguishing application and VM
  • Distinguishing Java threads
  • SMP support
  • Low overhead
  • Temporal (information may vary over time)
  • Approach
  • Modify Jikes RVM to address these issues without
    requiring application modification

11
Jikes RVM
  • Open source research VM that executes Java
    applications, written in Java
  • Formerly called Jalapeno
  • Runs on Linux/IA32, AIX/PPC, Linux/PPC
  • Universities using system for a broad range of
    research topics
  • 18 user publications in 2002 at top conferences
  • 7 courses using system, teaching resources
    (tutorials) available on web site
  • www.ibm.com/developerworks/oss/jikesrvm

12
Jikes RVM Features
  • Implemented in Java programming language
    (300KLOC)
  • Reduces seams between VM and applications
  • VM can be dynamically optimized
  • Compile-only strategy
  • Multiple compilers, mixing code is seamless
  • Aggressive optimizing compiler
  • 3 levels of IR, all with Java type info (CFG,
    SSA, dominators, etc.)
  • Lightweight mn thread implementation
  • Java threads are multiplexed on OS threads,
    important for scalability, GC transition
  • Quasi-preemptive scheduling (using
    compiler-generated yield points)
  • Adaptive optimization system
  • Yieldpoint-based sampling, cost/benefit model,
    what to recompile and how
  • Online feedback-directed inlining, code layout
  • Type-accurate (exact) parallel GC/allocation
  • Copying and noncopying, generational and
    nongenerational, hybrids

13
Adaptive Optimization System
14
Jikes RVM MN Threading
Java Thread 1
Java Thread 2
Java Thread I
Java Application
Java Thread I1
Java Thread I2
Java Thread M
JVM
OS Thread 1
OS Thread 1
OS Thread 1
Operating System
External timer-based 10ms triggers need for
thread-switching, could be less (voluntary
yield) or more (OS runs other pthread)
15
Jikes RVM Extension
Java Thread 1
Java Thread 2
Java Thread I
Java Application
Java Thread I1
Java Thread I2
Java Thread M
JVM
OS Thread 1
OS Thread 1
OS Thread 1
Operating System
Performance Monitor API
Proc 2
Proc P
Processors
Perf Ctr
Perf Ctr
16
Jikes RVM Extension
17
Implementation Overview
  • Modify Jikes RVM Java thread scheduler to capture
    HPM data for the duration of a Java threads
    quatum
  • Produces sequence of traces
  • Mechanism applied to both application and VM
    threads
  • One trace file per processor
  • Overhead is 1.7 on SPECjbb2000

18
Implementation Details
  • Trace record data
  • Processor ID
  • Thread ID
  • Real time start, duration
  • Hardware counter values (cycles, caches misses,
    etc)
  • Trace record produced into buffer at threadswitch
  • consumed by separate Java thread
  • one trace writer thread per processor
  • two buffers used to avoid blocking
  • Also provide a VM call to distinguish source code
    points that will generate a special trace record
  • Allows programmer to focus on logical
    computations

19
Initial Experience
  • SPECjbb2000 benchmark
  • Middle tier of 3-tier warehouse order system
  • Each warehouse is a Java thread
  • Configured to run 4 warehouses on 4 procs
  • Measured for 40 secs of measurement period
  • after 10 sec rampup
  • 27,951 records produced, 2.5MB

20
21 Threads
  • 5 Application threads
  • 1 main thread 4 warehouses
  • 4 Garbage collection threads
  • Stop-the-world, parallel collection
  • 4 Scheduler threads
  • Simple load balancing
  • 4 Trace writer threads
  • Consumes trace records, writes to file
  • 4 AOS threads
  • 2 Profiler, Controller, Recompilation

21
Percentage of Total Cycles
22
Investigating Load Latency
  • Aggregate load latency is 22 cycles
  • p690 Power4 Machine

23
Average Estimated Load Latency
24
Average Load Latency Cycles
25
Avg Load Latency Over Time For 1 Warehouse
26
Avg Load Latency Over Time For 1
WarehouseRecords gt 1M cycles (1msec)
27
Avg Load Latency Over Time For 1 Warehouse 1
thread on 1 proc
Poor avg latency is mostly due to
synchronization-induced thread yields
28
Applicability
  • Other platforms?
  • Jikes RVM also runs on Linux/IA32, Linux/PPC
  • Straightforward to port code to access IA32 HPMs
  • Does Linux provide OS thread distinction?
  • Other JVMs without MN scheduling?
  • Aggregate information
  • Yes, VM extension probably not required
  • PM API already provides OS thread distinction
  • Temporal information
  • Requires access to thread switch points
  • Need to modify OS?

29
Related Work
  • Hardware counter tools
  • DCPI (Alpha), VTune (Intel), SpeedShop (SGI)
  • Ammons, Ball, Larus
  • Profiling Java workloads
  • Cain, Rajwar, Marden, Lipasti
  • TPC-W, SPECweb99, SPECjbb2000 on 8-way RS64-III
    simulation
  • Luo and John, Seshadri, John, Mericas
  • SPECjbb2000, VolanoMark vs SPECint2000 on PIII
    and 2 PowerPC archs
  • Karlsson, Moore, Hagersten, Wood
  • SPECjbb2000, ECPerf on 16-way Sun Enterprise 6000
    server
  • HPM library packages
  • HPM toolkit, PAPI, PCL
  • JVMPI
  • . . .

30
Future Work
  • Short term
  • Investigate components of load latency
  • Refine intervals when scheduler is disabled (GC)
  • Track other hardware events
  • Graphical tool to navigate trace files
  • Exploit other HPM mechanisms
  • Correlate interval back to source
  • Marker trace records
  • Call stack
  • Adaptive optimization system
  • Exploit information in adaptive system
  • . . .

31
Conclusions
  • Understanding performance of MRTE is difficult,
    but important
  • Thanks to Carole, Lizy, and Chris for organizing!
  • Extended Jikes RVM to use access HPMs
  • Low overhead (lt 2)
  • Distinguishing application and VM threads
  • SMP support
  • Provides temporal information
  • First usage
  • Load latency in SPECjbb2000
Write a Comment
User Comments (0)
About PowerShow.com