Title: Porting NANOS on SDSM
1Porting NANOS on SDSM
GOAL Porting a shared memory environment to
distributed memory. What is missing to current
SDSM ?
Christian Perez
2Who am i ?
- December 1999 PhD at LIP, ENS Lyon, France
- Data parallel languages, distributed memory, load
balancing, preemptive thread migration - Winter 1999/2000 TMR at UPC
- OpenMP, Nanos, SDSM
- October 2000 INRIA researcher
- Distributed programs, code coupling
3Contents
- Motivation
- Related works
- Nanos execution model (NthLib)
- Nanos on top of 2 SDSM (JIAJIA DSM-PM2)
- Missing SDSM functionalities
- Conclusion
4Motivation
- OpenMP emerging standard
- simplicity (no data distribution)
- Cluster of machines (mono or multiprocessors)
- excellent ratio performance / price
- OpenMP on top of a cluster !
5OpenMP / Cluster HOW ?
- OpenMP paradigm shared memory
- Cluster paradigm message passing
- Use of software DSM system !
- Hardware DSM system SCI (write 2 ?s)
- specific hardware
- not yet stable
6Related work
- Several OpenMP/DSM implementations
- OpenMP NOW!, Omni
- But,
- Modification of OpenMP semantics
- One level of parallelism
- Do not exploit high performance networks
7OpenMP on classical DSM
- Compiler extracts shared data from stack
- Expensive local variable creation
- shared memory allocation
- Modification of OpenMP standard
- default should be private instead of being
shared variables - New synchronization primitives
- condition variables semaphores
8OpenMP on classical DSM
- One level of parallelism (SPMD)
!omp parallel do do i 1,4 x(i) x(i)
x(i1) end do
call schedule(lb, up, ) do i lb, ub x(i)
x(i) x(i1) end do call dsm_barrier()
barrier
9Omni compilation approach
Taken from pdplab.trc.rwcp.or.jp/pdperf/Omni/wgcc2
k/
10Our goals
- Support OpenMP standard
- High performance
- Allow exploitation of
- multithreading (SMP)
- high performance networks
11Nanos OpenMP compiler
- Convert an OpenMP program to a task graph
- Communications via shared memory
!omp parallel do do i 1,4 x(i) x(i)
x(i1) end do
i1,2
i3,4
12NthLib runtime support
- Nanos compiler generates intermediate codes
- Communications still via shared memory
call nthf_depadd() do nth_p 1, proc
nth nthf_create_1s(,f,) done call
nth_block() subroutine f() x(i) x(i)
x(i1)
13NthLib details
- Assumes to run on top of kernel threads
- Provides user-level threads (QT)
- Stack management (allocate)
- Stack initialization (argument)
- Explicit context switch
14Nthlib queues
- Global/Local
- Thread descriptor
- Rich functionalities
- Work descriptor
- High performance
15Nthlib Memory management
Nano-thread descriptor
Successors
Stack
Guard zone
- Mutal exclusion mmap allocation
- SLOT_SIZE stack alignment
16Porting Nthlib to SDSM
- Data consistency
- Shared memory management
- Nanos threads
- JIAJIA implementation
- DSM-PM2 implementation
- Summary of DSM requirements
17Data consistency
- Mutual exclusion for defined data structures
- ? Acquire/Release
- User level shared memory data
- ? Barrier
18Data consistency
- Mutual exclusion for defined data structures
- ? Acquire/Release
- User level shared memory data
- ? Barrier
barrier
barrier
barrier
19Shared memory management
- Asynchronous shared memory allocation
- Alignment parameter (gt PAGE_SIZE)
- Global variables/common declaration
- ? Not yet supported
20Nano-threads
- Run-to-block execution model
- Shared stacks (father/sons relationship)
- Implicit thread migration (scheduler)
21JIAJIA
- Developed at China by W. Hu, W. Shi Z. Tang
- Public domain DSM
- User level DSM
- DSM lock/unlock, barrier, cond. variables
- MP send/receive, broadcast, reduce
- Solaris, AIX, Irix, Linux, NT (not distributed)
22JIAJIA Memory Allocation
- No control of memory alignment (x2)
- Synchronous memory allocation primitive
- ? Development of an RPC version
- Based on send/receive primitive
- Add of a user level message handler
- ? Problems
- Global lock
- Interference with JIAJIA blocking function
23JIAJIA Discussion
- Global barrier for data synchronization
- ? Not multiple levels of parallelism
- No thread aware
- ? No efficient use of SMP nodes
24DSM/PM2
- Developed at LIP by G. Antoniu (PhD student)
- Public domain
- User level, module of PM2
- Generic and multi-protocol DSM
- DSM lock/unlock
- MP LRPC
- Linux, Solaris, Irix (32 bits)
25PM2 organization
DSM
MAD1 TCP PVM MPI SCI VIA SBP
MAD2 TCP MPI SCI VIA BIP
MARCEL MONO SMP ACTIVATON
PM2
TBX
NTBX
http//www.pm2.org
26DSM/PM2 Memory Allocation
- Only static memory allocation
- ? Build dynamic memory allocation primitive
- Centralized memory allocation
- LRPC to Node 0
- ? Integration of alignment parameter
- Summer 2000 dynamic memory allocation ready !
27DSM/PM2 marcel descriptor
Page boundary
marcel_t
(spMASK)SLOT_SIZE
NthLib requirement a kernel thread ? many
nano-threads
28DSM/PM2 marcel descriptor
Page boundary
marcel_t
(spMASK)SLOT_SIZE
marcel_t
Page boundary
((spMASK)SLOT_SIZE)
29DSM/PM2 Discussion
- Using page level sequential consistency
- no need of barrier (Multiple levels of
parallelism) - False sharing
- ? Dedicated stack layout
marcel_t
Page boundary
Pad
Page boundary
30DSM/PM2 Discussion (cont)
- No alternate stack for signal handler
- ? Prefetch page before context switch O(n)
- ? Pad to next page before opening parallelism
Page boundary
Shared data
Pad
Page boundary
31DSM/PM2 improvement
- Availability of an asynchronous DSM malloc
- Lazy data consistency protocol in evaluation
- eager consistency, multiple writer
- scope consistency
- Support for stack in shared memory (LINUX)
32DSM/PM2 shared stack support
marcel_t
SEGV stack
(spMASK)SLOT_SIZE
33DSM/PM2 shared stack support
marcel_t
SEGV stack
(spMASK)SLOT_SIZE
34DSM/PM2 shared stack support
marcel_t
SEGV stack
SEGV stack
(spMASK)SLOT_SIZE
35DSM/PM2 shared stack support
marcel_t
SEGV stack
SEGV stack
(spMASK)SLOT_SIZE
36DSM/PM2 shared stack support
marcel_t
SEGV stack
SEGV stack
(spMASK)SLOT_SIZE
37DSM/PM2 shared stack support
marcel_t
SEGV stack
(spMASK)SLOT_SIZE
38DSM requirement
- Support of static global shared variables
- Efficient code
- remove one indirection level
- Enable use of classical compiler
- Support for common
- ? Sharedization of already allocated memory
- dsm_to_shared(void p, size_t size)
39DSM requirement
- Support for multiple level of parallelism
- Partial barrier
- group management
- Dependencies support
- like acquire/release
- but without lock
40DSM requirement
- Support for multiple level of parallelism
- Partial barrier
- group management
- Dependencies support
- like acquire/release
- but without lock
barrier
barrier
41DSM requirement
- Support for multiple level of parallelism
- Partial barrier
- group management
- Dependencies support
- like acquire/release
- but without lock
barriers
barrier
42DSM requirement
- Support for multiple level of parallelism
- Partial barrier
- group management
- Dependencies support
- like acquire/release
- but without lock
start(1)
start(2)
stop(1)
stop(2)
update(1,2)
43Summary of DSM requirements
- Support of static global shared variables
- ? Sharedization of already allocated memory
- Acquire/release primitive
- Partial barrier
- ? group management
- Asynchronous shared memory allocation
- Alignment parameter to memory allocation
- Threads (SMP nodes)
- Optimized stack management
44Conclusion
- Successfully port Nanos to 2 DSM
- ? JIAJIA DSM-PM2
- DSM requirement to obtain performance
- ? Support MIMD model
- ? Automatic thread migration
- Performance ?
45Optimized stack management
- Virtual address range memory reservation
- Page creation (mmap) on demand
- Alternate stack for handler
- ? Minimize the number of created pages
- ? Reduce message size on thread migration
- ? Allow potential huge stacks