Title: The Potential of TraceLevel Parallelism in Java Programs
1The Potential of Trace-Level Parallelism in Java
Programs
- Borys J. Bradel
- Tarek S. Abdelrahman
- University of Toronto
- Principles and Practices of Programming in Java
- September 7th 2007
2Motivation
- Gap exists between hardware and software
- Hardware
- Majority of computer chips contain multiple cores
- Athlon X2, Core 2 Duo, Power5, Cell, Niagara
- Software
- Writing parallel software is difficult
- Bridging the gap may lead to better utilization
of hardware and therefore improved performance
3Automatic Parallelization
- Traditional compile time
- Perform analysis at compile time
- Divide program based on analysis
- Limited success
- Runtime
- New approach to automatic parallelization is
needed - Combine analysis with runtime information
- What information to use?
- Trace-Based
- Our solution is to use traces
3
4How successful can using traces be?
- We answer this question by simulating trace
execution - monitor a programs execution
- simulate the execution of traces in parallel
- Measure a practical upper-bound on parallelism
- not an accurate measurement of performance
5Outline
- Traces
- Execution Model
- Simulation Platform
- Experimental Evaluation
- Conclusion
6Trace Definition
- A trace is a frequently executed sequence of
unique basic blocks or instructions - Identified by a trace collection system at runtime
public static int foo() int a0 for (int
i0iltni) ai return a
7Benefits
- Source code is not required
- Granularity of parallelism can vary
- Traces simplify control flow and analysis
- Traces are simple to identify
8Execution Model
parallel
sequential
CFG
Method
9Dependence Communication
Method
Dependences limit parallelism
ai
10Dependence Communication
Different types of communication
Instruction-Instruction
Trace-Trace
i4
i4
ai
Communication Delay
Trace-Instruction
ai
i4
ai
11Requirements
- Java Virtual Machine
- Execute bytecode
- Interpreted or compiled
- Trace Collection System
- monitor control flow
- create traces
JVM
Code Execution
control flow
TCS
12Parallel Identification Engine
- Records memory information
- Keeps track of dependences
- Ignore instructions that read and write to the
same variablee.g. dependence between i and
itself is ignored - Schedules instructions
- Instruction Window
- Communication
- Processor Count
JVM
Code Execution
control flow
instruction info
traces
13Scheduling
Record trace information when traces execute
sequentially Schedule when instruction window
is full
Schedule
Schedule
14Schedule around Dependences
4 processors 12 traces per window
- Dependent traces are scheduled far enough apart
to have correct execution
15Speedup
- Ratio
- Cycles aggregated all scheduled traces on
parallel system - Cycles over all scheduled traces on one processor
system - Each trace executes sequentially on one processor
- A cycle represents the write of one memory
location
ai i
B1
2 cycles
if (iltn) goto B1
B2
16Experimental Evaluation
- Jupiter Patrick Doyle
- RedSpot Borys Bradel
- Modified Critical-Path Min-You Wu scheduler
- Benchmarks
- Java Grande Section 3
- SPECjvm98
17Effect of Window Size
18Effect of Communication Cost
19Effect of Communication Type
20Effect of Processor Count
21Conclusion
- How successful can using traces be?
- Built simulator to measure parallel execution of
traces - Traces have the potential to be used to
parallelize programs - Some benchmarks do not scale well
- Some benchmarks scale very well
- Most benchmarks have at least 2x speedup on four
processors - Future work create a system that performs
trace-based parallelization
22Jupiter and RedSpot
Interpreter emulate a0 emulate i0 emulate goto
B2 call RedSpot emulate if (iltn)
goto B1 call RedSpot emulate ai emulate
i emulate if (iltn) goto B1 call RedSpot
Trace 1
emulate ai emulate i emulate if (iltn) goto B1
call RedSpot
23Parallel Identification Engine
Interpreter emulate if (iltn) goto B1 call
RedSpot call PIE emulate ai
call PIE emulate
i
call PIE emulate if (iltn) goto B1 call
RedSpot call PIE emulate ai
call PIE emulate
i
call PIE emulate if (iltn) goto B1 call
RedSpot call PIE
call call PIE for each instruction and each
memory access
24Processor Count
Maximum number of processors limits performance
2 processors
25Scheduling Window
Can only schedule a limited number of tracesat a
time
4 traces per window