CrossArchitectural Performance Portability of a Java Virtual Machine Implementation presentation

About This Presentation

Title:

CrossArchitectural Performance Portability of a Java Virtual Machine Implementation

Description:

State-of-the-art implementation of JVM on Alpha. Real 64-bit implementation ... A single x86 instruction comprises several Alpha instructions. Different ... –

Number of Views:22

Avg rating:3.0/5.0

Slides: 26

Provided by: matthia2

Category:

more less

Transcript and Presenter's Notes

Title: CrossArchitectural Performance Portability of a Java Virtual Machine Implementation

1
Cross-Architectural Performance Portability of a
Java Virtual Machine Implementation

Matthias JacobPrinceton University

Keith RandallGoogle, Inc.
2
JVM architecture
Java Bytecode
Interpreter
JIT
Native Code
JVM
CPU
3
JVM architecture
Java Bytecode
Interpreter
JIT
Native Code
JVM
CPU
4
Compaq FastVM

State-of-the-art implementation of JVM on Alpha
Real 64-bit implementation
Efficient optimization mechanisms
Not feedback-based (as HotSpot)
Can we port the code generator to x86 and
preserve the performance ?

5
Differences Alpha x86

Reduced number of registers
8 registers on x86 versus 31 on Alpha
Instructions contain multiple operations
A single x86 instruction comprises several Alpha
instructions
Different addressing modes
Arithmetic x86 instructions operate on memory
directly
Non-orthogonality of instruction set
Different registers require different
instructions
Source registers get overwritten
Operand registers are used to store results on x86

6
Outline

Modified Optimizations for x86
Register Allocation
Instruction Selection
Instruction Patching
Method Inlining
New Optimizations for x86
Calling Convention
Floating-Point Modes
Results
Conclusion

7
Register Allocation for JIT

Traditional optimal register allocation too
expensive
Graph coloring
Use heuristics
LMAP structure

8
Register Allocation

Java entities Local variables Lx and Java stack
locations S(y)
Assign every Java entity home location H
Temporary location T for intermediate results

9
Register Allocation

Limited amount of registers
Flexible partitioning H- / T-registers
No dedicated registers
Thread-local pointer in segment register

10
Register Allocation

Instructions limited to certain registers
Allocate only subset of registers

11
Register Allocation

Memory locations as arguments
Pick different addressing mode instead of
allocating register

12
Register Allocation Speedup
13
Instruction Selection

Alpha/RISC
ALU operations
Memory operations
Control operations
x86/CISC
Instructions can be combined ALU/Memory/Control
operations
Different addressing modes
Limited set of registers per instruction
Emulate 64-bit operations
Floating-point stack

14
Instruction Patching

Patching instructions
Class initializers
Fix up branches
Copying registers
Method Inlining
Needs to be atomic because of concurrency
Alpha Every instruction is 4 bytes
single write instruction sufficient

15
Instruction Patching on x86

Different instruction lengths
Patch instructions atomically using
Compare-and-Exchange
Pad with NOPs
Difficult to walk back in code for renaming
registers (as on Alpha)
Input registers are often output registers
Renaming output registers alone is not sufficient
Retargeting by forward-looking heuristic
Look for nearest future use of a preferred
register

16
Method Inlining Speedup
17
Outline

Modified Optimizations for x86
Register Allocation
Instruction Selection
Instruction Patching
Method inlining
New Optimizations for x86
Calling Convention
Floating-Point Modes
Results
Conclusion

18
Optimizations for x86

Calling Convention on x86
Argument passing on stack instead of registers
Allocate registers for argument passing
Two registers for stack management Frame
pointer and Stack pointer
Constant stack frame size
Detection of stack overflow is difficult
Check at bottom of stack frame in method prolog
8-byte stack operations may be unaligned
Align stack frames to 8 byte boundaries

19
Optimized stack frame layout
Input arguments

Return address
Callee-save space

Local variables

Output stack arguments

Callee-save space (4 bytes)
esp
Method prolog
Method epilog