Title: Cluster Computing with Java Threads
1Cluster Computing with Java Threads
- Philip J. Hatcher
- University of New Hampshire
- Philip.Hatcher_at_unh.edu
2Collaborators
- UNH/Hyperion
- Mark MacBeth and Keith McGuigan
- ENS-Lyon/DSM-PM2
- Gabriel Antoniu, Luc Bougé and Raymond Namyst
3Focus
- Use Java as is for high-performance computing
- support computationally intensive applications
- utilize parallel computing hardware
4Outline
- Our Vision
- Java Threads
- The PM2 Run-time Environment
- Hyperion Java Threads on Clusters
- Evaluation
- Related Work
- Conclusions
5Why Java?
- Soon to be ubiquitous!
- use of Java is growing very rapidly
- Designed for portability
- develop programs on your desktop
- run programs on a distant cluster
6Why Java?
- Explicitly parallel!
- includes a threaded programming model
- Relaxed memory model
- consistency model aids an implementation on
distributed-memory parallel computers
7Unique Opportunity
- Use Java to bring parallelism to the masses
- Lets not miss it!
- But, programmers will not accept syntax or model
changes
8Open Question
- Parallelism via Java access to distributed-computi
ng techniques? - e.g. RMI (remote method invocation)
- Or, parallelism via Java threads?
9That is, ...
- Does a user prefer to view a cluster as a
collection of distinct machines? - Or, does a user prefer to view a cluster as a
black box that will simply run Java code faster?
10Are you in a box?
11Or, are you thinking outside of the box?
12Climb out of the box!
- Use Java threads as is to program clusters of
computers. - Program for the threaded Java virtual machine.
- Allow the implementation to handle the details of
executing in a cluster.
13Java Threads
- Threads are objects.
- The class java/lang/Thread contains all of the
methods for initializing, running, suspending,
querying and destroying threads.
14java/lang/Thread methods
- Thread() - constructor for thread object.
- start() - start the thread executing.
- run() - method invoked by start.
- stop(), suspend(), resume(), join(), yield().
- setPriority().
15Java Synchronization
- Java uses monitors, which protect a region of
code by allowing only one thread at a time to
execute it. - Monitors utilize locks.
- There is a lock associated with each object.
16synchronized keyword
- synchronized ( Exp ) Block
- public class Q synchronized void put()
17java/lang/Object methods
- wait() - the calling thread, which must hold the
lock for the object, is placed in a wait set
associated with the object. The lock is then
released. - notify() - an arbitrary thread in the wait set of
this object is awakened and then competes again
to get lock for object. - notifyall() - all waiting threads awakened.
18Shared-Memory Model
- Java threads execute in a virtual shared memory.
- All threads are able to access all objects.
- But threads may not access each others stacks.
19Java Memory Consistency
- A variant of release consistency.
- Threads can keep locally cached copies of
objects. - Consistency is provided by requiring that
- a thread's object cache be flushed upon entry to
a monitor. - local modifications made to cached objects be
transmitted to the central memory when a thread
exits a monitor.
20PM2 A Distributed, Multithreaded Runtime
Environment
- Thread library Marcel
- User-level
- Supports SMP
- POSIX-like
- Preemptive thread migration
- Communication library Madeleine
- Portable BIP, SISCI/SCI, MPI, TCP, PVM
- Efficient
21DSM-PM2 Architecture
- DSM comm
- send page request
- send page
- send invalidate request
-
- DSM page manager
- set/get page owner
- set/get page access
- add/remove to/from copyset
- ...
DSM-PM2
PM2
22DSM-PM2 Performance
- SCI cluster has 450 MHz Pentium II nodes
- Myrinet cluster has 200 MHz Pentium Pro nodes
23Hyperion
- Executes threaded Java programs on clusters.
- Built on top of PM2 and DSM-PM2.
- Provides both portability and efficiency
24Reversing the Bytecode Stream
- Conventionally, users pull bytecode to their
machines for local execution. - Our vision
- users develop their high-performance Java
programs using the Java toolset on their desktop. - they then push the resulting bytecode to a
Hyperion server for high-performance cycles.
25Supporting High Performance
- Utilizes a bytecode-to-C translator.
- Parallel execution via spreading of Java threads
across nodes of the cluster. - Java threads implemented as lightweight threads
using PM2 library.
26Compiling Java
- Hyperion designed for computationally intensive
applications, so small overhead of translating
bytecode is not important. - Translating to C allows us to leverage the native
C compiler and optimizer.
27General Hyperion Overview
Runtime libraries
28The Hyperion Run-Time System
- Collection of modules to allow plug-and-play
implementations - inter-node communication
- threads
- memory and synchronization
- etc
29Hyperion Internal Structure
30Thread and Object Allocation
- Currently, threads are allocated to processors in
round-robin fashion. - Currently, an object is allocated to the
processor that holds the thread that is creating
the object. - Currently, DSM-PM2 is used to implement the Java
memory model.
31Hyperions DSM API
- loadIntoCache
- invalidateCache
- updateMainMemory
- get
- put
32DSM Implementation
- Node-level caches.
- Page-based and home-based protocol.
- Log mods made to remote objects.
- Use explicit in-line checks in get/put.
- Each node allocates objects from a different
range of the virtual address space.
33Details
- Objects are aligned on 64-byte boundaries.
- An object reference is the address of the base of
the object. - The bottom 6 bits of the ref can be used to store
the node number of the objects home.
34More details
- loadIntoCache checks the 6 bits to see if an
object is remote. - If so, and if not already locally cached, DSM-PM2
is used to load the page(s) containing the
object. - When a remote object is cached, a bit is turned
on in its header.
35Yet more details
- The put primitive checks the header bit to see if
a modification should be logged. - updateMainMemory sends the logged changes to the
home node.
36Evaluation
- Minimal-cost map-coloring application.
- Branch-and-bound algorithm.
- 64 threads, each with its own priority queue.
- Current best solution is shared.
- Problem size 29 eastern-most states of USA with
4 colors of differing costs.
37Experimental Setting
- Two Linux 2.2 clusters
- eight 200 MHz Pentium Pro processors connected by
Myrinet switch and using MPI over BIP. - four 450 MHz Pentium II processors connected by a
SCI network and using SISCI. - gcc 2.7.2.3 with -O6
38Performance Results
39Parallelizability
40Baseline Performance
- Compared serial Java to serial C for map-coloring
application. - Each program has single queue, single thread.
41Serial Java versus Serial C
- Java v2 DSM checks disabled
- Java v3 DSM and array-bound checks disabled
- Executing on a single 450 MHz Pentium II
42Inline checks are expensive!
- Genericity of DSM-PM2 allows an alternative
implementation. - Use page-fault detection rather than inline check
to detect non-local object.
43Using Page Faults details
- An object reference is the address of the base of
the object. - loadIntoCache does nothing.
- DSM-PM2 is used to handle page faults generated
by the get/put primitives.
44More details
- When an object is allocated, its address is
appended to a list attached to the page that
contains its header. - When a page is loaded on a remote node, the list
is used to turn on the header bit for all object
headers on the page. - The put primitive uses the header bit in the same
manner as inline-check version.
45Inline Check versus Page Fault
- IC has higher overhead for accessing objects
(either local or locally cached). - PF has higher overhead (signal handling and
memory protection) for loading a page into the
cache.
46IC versus PF serial map-coloring
- Java XX v2 DSM checks disabled
- Java XX v3 DSM and array-bound checks disabled
- Executing on a single 450 MHz Pentium II
47IC versus PF parallel map-coloring
- Executing on 450MHz/SCI cluster.
48Related Work
- Java/MPI cluster nodes are explicit
- Java/RMI ditto
- Remote objects via RMI nearly transparent
- e.g. JavaParty, Do!
- Distributed interpreters
- e.g. Java/DSM, MultiJav, cJVM
49Conclusions
- Approach is clean Java as is
- Approach is promising
- good parallelizability for map-coloring
- need better scalar compilation
- e.g. array bound-check removal
- need further parallel application studies
- are thread/object placement heuristics sufficient
for programmers to write efficient programs?