Title: Parallel programming paradigms
1Parallel programming paradigms
- Express coarse-grain parallelism in applications
in order to utilize multiple processors/cores in
a parallel and distributed system. - Task level parallelism/thread level parallelism
versus instruction level parallelism. - Mechanism to express the parallelism (execution
model) - Shared address space / distributed address space
(memory model) - Implicit communication / explicit communication
- General purpose /special purpose
2Parallel Programming models
- What is a programming model?
- An abstract virtual machine.
- A view of data and execution
- The logical interface between architecture amd
applications.
3Parallel Programming Models
- Why programming model?
- Decouple applications and architectures
- Write applications that run effectively across
architectures. - Design new architectures that effectively support
legacy applications. - Programming model design considerations
- Expose model architecture features
- Maintain ease of use
4Language features for supporting parallel
programming models
- Explicitly concurrent languages languages with
parallel/concurrent constructs. - UPC, Java, Ada
- Compiler-supported extensions annotations are
added to indicate the parallelism in a sequential
program. The compiler uses the annotations and
automatically generates parallel code. - HPF, Cilk, OpenMP
- Library packages outside the language.
- Pthreads, MPI
5Common parallel programming models
- Shared Memory (pthreads)
- Multiple threads work on shared memory
- Message Passing (MPI)
- Multiple processes work on independent memory,
use explicit communication to obtain information
about other processes. - Data parallel (HPF)
- Distributed data across different nodes, each
node works on its own data. - Opposite to the task parallelism.
- Shared memory data parallel (OpenMP)
- Partitioned shared memory (UPC)
- Hybrid OpenMP MPI
- Remote procedure call
6Programming models
7Pthreads A shared memory programming model
- POSIX standard shared memory multithreading
interface. - Not just for parallel programming, but for
general multithreaded programming - Provide primitives for thread management and
synchronization. - Threads are commonly associated with shared
memory architectures and operating systems. - Necessary for unleashing the computing power of
SMT and CMP processors. - Making it easy and efficient is very important at
this time.
8Pthreads execution model
- A single process can have multiple, concurrent
execution paths. - a.out creates a number of threads that can be
scheduled and run concurrently. - Each thread has local data, but also, shares the
entire resources (global data) of a.out. - Any thread can execute any subroutine at the same
time as other threads. - Threads communicate through global memory.
9Fork-join model for executing threads in an
application
Master thread
Fork
Parallel region
Join
10What does the developer have to do?
- Decide how to decompose the computation into
parallel parts. - Create and destroy threads to support the
decomposition - Add synchronization to make sure dependences are
covered.
11Creation
- Thread equivalent of fork()
- int pthread_create(
- pthread_t thread,
- pthread_attr_t attr,
- void (start_routine)(void ),
- void arg
- )
- Returns 0 if OK, and non-zero (gt 0) if error.
12Termination
- Thread Termination
- Return from initial function.
- void pthread_exit(void status)
- Process Termination
- exit() called by any thread
- main() returns
13Waiting for child thread
- int pthread_join( pthread_t tid, void status)
- Equivalent of waitpid()for processes
14Detaching a thread
- The detached thread can act as daemon thread
- The parent thread doesnt need to wait
- int pthread_detach(pthread_t tid)
- Detaching self
- pthread_detach(pthread_self())
15Example of thread creation
16General pthread structure
- A thread is a concurrent execution of a function
- The threaded version of the program must be
restructured such that the parallel part forms a
separate function.
17Matrix Multiply
- For (I0 Iltn I)
- for (j0 jltn j)
- cIj 0
- for (k0 kltn k)
- cIj cIj aIk
bkj
18Parallel Matrix Multiply
- All I- or j-iterations can be run in parallel
- If we have p processors, n/p rows to each
processor - Corresponds to partitioning I-loop
-
-
19Matrix Multiply parallel part
- void mmult(void s)
-
- int slice (int ) s
- int from slice n / p
- int to ((slice 1)n/p)
- for (Ifrom Iltto I)
- for (j0 jltn j)
- cIj 0
- for (k0 kltn k)
- cIj aIkbkj
-
-
-
20Matrix Multiply Main
- Int main()
-
- pthread_t thrdp
- int parap
- for (I0 Iltp I)
- paraI I
- pthread_create(thrdI, NULL, mmult,
(void )paraI) -
- for (Ifrom Iltto I)
- pthread_join(thrdI, NULL)
-
-
21General Program Structure
- Encapsulate parallel parts in functions.
- Use function arguments to parametrize what a
particular thread does. - Call pthread_create() with the function and
arguments, save thread identifier returned. - Call pthread_join() with that thread identifier
22Pthreads synchronization
- Create/exit/join
- Provides coarse grain synchronizations
- Requires thread creation/destruction
- Need for finer-grain synchronization
- Mutex locks, condition variables, semaphores
23Mutex lock for mutual exclusion
- int counter 0
- void thread_func(void arg)
-
- int val
-
- / unprotected code why? /
- val counter
- counter val 1
- return NULL
24Mutex locks lock
- pthread_mutex_lock(pthread_mutex_t mutex)
- Tries to acquire the lock specified by mutex
- If mutex is already locked, then the calling
thread blocks until mutex is unlocked.
25Mutex locks unlock
- pthread_mutex_unlock(pthread_mutex_t mutex)
- If the calling thread has mutex currently locked,
this will unlock the mutex. - If other threads are blocked waiting on this
mutex, one will unblock and acquire mutex. - Which one is determined by the scheduler.
26Mutex example
- int counter 0
- ptread_mutex_t mutex PTHREAD_MUTEX_INITIALIZER
- void thread_func(void arg)
-
- int val
-
- / protected by mutex /
- Pthread_mutex_lock( mutex )
- val counter
- counter val 1
- Pthread_mutex_unlock( mutex )
- return NULL
27Condition Variable for signaling
- Think of Producer consumer problem
- Producers and consumers run in separate threads.
- Producer produces data and consumer consumes
data. - Producer has to inform the consumer when data is
available - Consumer has to inform producer when buffer space
is available
28Condition variables wait
- Pthread_cond_wait(pthread_cond_t cond,
pthread_mutex_t mutex) - Blocks the calling thread, waiting on cond.
- Unlock the mutex
- Re-acquires the mutex when unblocked.
29Condition variables signal
- Pthread_cond_signal(pthread_cond_t cond)
- Unblocks one thread waiting on cond.
- The scheduler determines which thread to unblock.
- If no thread waiting, then signal is a no-op
30Producer consumer program without condition
variables
31- / Globals /
- int data_avail 0
- pthread_mutex_t data_mutex PTHREAD_MUTEX_INITIAL
IZER - void producer(void )
-
- Pthread_mutex_lock(data_mutex)
- Produce data
- Insert data into queue
- data_avail1
- Pthread_mutex_unlock(data_mutex)
32- void consumer(void )
-
- while( !data_avail )
- / do nothing keep looping!!/
-
- Pthread_mutex_lock(data_mutex)
-
- Extract data from queue
- if (queue is empty)
- data_avail 0
- Pthread_mutex_unlock(data_mutex)
- consume_data()
33Producer consumer with condition variables
34- int data_avail 0
- pthread_mutex_t data_mutex PTHREAD_MUTEX_INITIAL
IZER - pthread_cont_t data_cond PTHREAD_COND_INITIALIZE
R - void producer(void )
-
- Pthread_mutex_lock(data_mutex)
- Produce data
- Insert data into queue
- data_avail 1
- Pthread_cond_signal(data_cond)
- Pthread_mutex_unlock(data_mutex)
35- void consumer(void )
-
- Pthread_mutex_lock(data_mutex)
- while( !data_avail )
- / sleep on condition variable/
- Pthread_cond_wait(data_cond, data_mutex)
-
- / woken up /
- Extract data from queue
- if (queue is empty)
- data_avail 0
- Pthread_mutex_unlock(data_mutex)
- consume_data()
36A note on condition variables
- A signal is forgotten if there is no
corresponding wait that has already occurred. - If you want the signal to be remembered, use
semaphores.
37Semaphores
- Counters for resources shared between threads.
- Sem_wait(sem_t sem)
- Blocks until the semaphore vale is non-zero
- Decrements the semaphore value on return.
- Sem_post(sem_t sem)
- Unblocks the semaphore and unblocks one waiting
thread - Increments the semaphore value otherwise
38Pipelined task parallelism with semaphore
- P1 for (I0 Iltnum_pics, read(in_pic) I)
- int_pic_1I trans1(in_pic)
- sem_post(event_1_2I)
-
- P2 for (I0 Iltnum_pics I)
- sem_wait(event_1_2I)
- int_pic_2I trans2(int_pic_1I)
- sem_post(event_2_3I)
-