Title: SSI Team Progress
1SSI Team Progress Status
- 2002. 7. 9
- NRL SSI TEAM
- ???, ???, ??
2Contents
- Introduction
- Load Balancing Design
- Opensource Functionality and Status
- Placement System on Opensource SSI Cluster
- Socket Migration Integration within Opensource
3Introduction
- Load balancing
- even distribution of workloads among nodes
- improve system performance and throughput through
full utilization of diverse resources - minimize applications avg. completion time
- Classification of dynamic load balancing
- remote execution (non-preemptive migration)
- some new processes are (possibly automatically)
invoked on remote nodes - only new-born processes are migrated
- process migration (preemptive migration)
- running processes may be suspended, moved to a
remote node and restarted
4Load Balancing Design (1)
- 0. Load balancing system over single system image
- multi-user environment
- interactive (or sequential) and parallel jobs
coexist - Dynamic load balancing system must support both
placement and process migration mechanism - preemptive process migration only is not
efficient - specially for short-lived jobs
placement layer
5Load Balancing Design (2)
- Dynamic load balancing system must consider the
characteristics of parallel workloads - minimize communication cost
- process selection
- comm. to comp. ratio
- Dynamic load balancing system must consider the
communication pattern of parallel workloads - avoid communication delay
host A
host B
wating
6Load Balancing Design (3)
7Load Balancing Design (4)
- CCR (Communication-to-Computation Ratio) Recorder
- assist to make decision for process selection in
MOSIX LL - heuristic process selection lowest
communication overhead - modification of kernel process descriptor
structure - accumulated read/write bytes during a time
quantum - as long as a lot of communication, more closer to
peer - for realization, real socket migration
process migration with shipback mechanism
8Opensource Functionality and Status
- SSI patch V0.5.2
- GFS as root filesystem (using GNBD)
- clusterwide device space (/devfs/)
- clusterwide process management
- clusterwide PID
- clusterwide IPC (signal, fifo, pipe)
- process migration with shipbacked socket
- SSI patch V0.6.0
- patch V0.5.2 MOSIX LL integration
9Placement System on Opensource SSI Cluster
10Overview
- Combine with kernel level migration system
- Component
- Interpreter / List manager / Load manager
placement system
MOSIX Load Leveler
MOSIX Load Leveler
11Component - Interpreter
- Interpreter
- read and parse user commands
- modify bash-2.05
- request eligibility check to list manager
- execute task locally or remotely
- measure execution time of first executed job
12Component List Manager
- List manager
- determine eligibility of remote execution
- maintain local and remote list per user
- add long job to remote list
- receive eligibility check request from
interpreter - check that user command is eligible to be run
remotely - respond local or remote execution to interpreter
13Component Load Manager
- Load Manager
- run as daemon process
- maintain load info. of MOSIX Load Leveler (LL) in
kernel level - invoke system call to get load info. from MOSIX
LL when MOSIX update load info.
14Overall Operations
Interpreter
lightLodedNode()
localOrRemote (task)
Lightest loaded node
SHM
local remote new
Light node
Local exe
Remote exe
List Manager
Daemon
Local exe Time check
Load Manager
New syscall
Load info.
User level
Kernel level
MOSIX LL
15Performance Evaluation - Overhead
- Program (pi value solver)
- exe-time 4.1612 sec
- Exe-time comparison
- Eligibility check time increases as length of
list grows
16Performance Evaluation - Speedup
- Environment
- P-III 700MHz, 256MB, 5 nodes
- 100Mbps Ethernet
- Linux 2.4.16 with patch Opensource SSI V0.6
- Program pi value solver
- exe-time 28.1 sec
- Test
- Invoke jobs on one node randomly
- Measure speedup as nodes added
17Performance evaluation - Speedup (contd)
- 28.1 sec / 50 jobs / random arrival (210 sec)
(sec)
18Limitation Future Work
- Limitation
- jobs that are invoked at the same time may go to
same light node at that time - coarse-grain placement
- unit of placement is a job
- no consideration to placement of processes that
are made by one job - Future work
- more detailed evaluation
- job characteristic / different job
- fine-grained placement
- job generating several processes
- adaptive load index depend on application
characteristic
19Socket Migration Integration within Opensource
20Opensource Shipback Socket
- Socket migration by file op. function shipping
- migrated process ships its file op. functions to
the original node. - Real socket migration
migration
VPROC
VPROC
VPROC
Result
PPROC
PPROC
Shipping op. functions back
Process B
Process B
Process A
Node 1
Node 2
Node 3
migration
VPROC
VPROC
VPROC
connection closed
connection reopened
PPROC
PPROC
Process A
Process B
Process B
Node 1
Node 2
Node 3
21CRAK 2001
checkpoint
restart
4. Recover socket and file descriptor 5. Try to
bind to the same port 6. Use rsh to change socket
info of remote process
1. Stop the peer process to be checkpointed
using rsh
2. ioctl
9. Set information of socket and file structures
constructed above
7. ioctl
10. let the stopped process continue to run
User level
Kernel level
8. Load the checkpointed file and copy it into
mem.
3. Save address space, register set,
open files/pipes/socket,
22Our Socket Migration Flow (1)
Node A Node B
Node C
(task kernel) (task
kernel) (task kernel)
SIGMIGRATE
ICS Communication
SIGSTOP
1.Migration ??? ?????? ??
Checkpoint ?? 2.Process descriptor 3.Exporting
Processs root, current working directory
RPC
TIME
4. ?? ????? Virtual Process? ??, ?? ??? ??
5. Process Context ??
RPC Response
6.file reopen? ??? ? descriptor ? ???
export file descriptor? socket
descriptor?? Socket Migration ?? 8.?? open?? ??
socket descriptor ? ??? file? ?? checkpoint
RPC
7. ? descriptor ?? export? path ? ???? file?
reopen
RPC Response
RPC
23Our Socket Migration Flow (2)
Node A Node B
Node C
(task kernel) (task
kernel) (task kernel)
10.Socket Structure ?? 11.??? dest. port? ????
bind
9. TCP_TIME_WAIT
RPC Response
12. (saddr, sport, daddr, dport)? ?? Socket? ???
process? ?? 13. saddr, sport? NodeC? ??
14.Destination socket information ??? ??
ICS Communication
TIME
16.SS_CONNECTING 17.daddr, dport? Node C?
?? 18.Hash Table Update 19.SS_CONNECTED 20.
TCP_ESTABLISHED
SIGCONT
SIGCONT
24Measurement (1)
- Environments
- Pentium III 850Mhz, 512MB, 3 nodes
- 100Mbps Ethernet
- Linux 2.4.10-ac4 with patch Opensource SSI V0.5.2
- Test Process
- consists of a sender and a receiver
- single socket connection, 64K buffer
- a small message is sent to receiver for 10000
times.
25Measurement (2)
- Total migration cost
- Opensource SSI vs. Opensource SSI real socket
migration - per one socket file descriptor, ? 8ms overhead
26Measurement (3)
ICS_CHANNEL
- Communication cost
- one way message transfer, 10000 loop
- data copy cost in original node
27Limitation Future Work
- Limitation
- not yet support socket migration within 2 nodes
- ICS channel vs. IPC
- some bugs
- pts or tty devices is not quickly activated on
certain node (receiver) - Future work
- debugging
- more test and analysis with various communication
conditions - 2 node socket migration support