Title: Design and Implementation of the CCC Parallel Programming Language
1Design and Implementationof the CCC Parallel
Programming Language
- Nai-Wei Lin
- Department of Computer Science and Information
Engineering - National Chung Cheng University
2Outline
- Introduction
- The CCC programming language
- The CCC compiler
- Performance evaluation
- Conclusions
3Motivations
- Parallelism is the future trend
- Programming in parallel is much more difficult
than programming in serial - Parallel architectures are very diverse
- Parallel programming models are very diverse
4Motivations
- Design a parallel programming language that
uniformly integrates various parallel programming
models - Implement a retargetable compiler for this
parallel programming language on various parallel
architectures
5Approaches to Parallelism
- Library approach
- MPI (Message Passing Interface), pthread
- Compiler approach
- HPF (High Performance Fortran), HPC
- Language approach
- Occam, Linda, CCC (Chung Cheng C)
6Models of Parallel Architectures
- Control Model
- SIMD Single Instruction Multiple Data
- MIMD Multiple Instruction Multiple Data
- Data Model
- Shared-memory
- Distributed-memory
7Models of Parallel Programming
- Concurrency
- Control parallelism simultaneously execute
multiple threads of control - Data parallelism simultaneously execute the same
operations on multiple data - Synchronization and communication
- Shared variables
- Message passing
8Granularity of Parallelism
- Procedure-level parallelism
- Concurrent execution of procedures on multiple
processors - Loop-level parallelism
- Concurrent execution of iterations of loops on
multiple processors - Instruction-level parallelism
- Concurrent execution of instructions on a single
processor with multiple functional units
9The CCC Programming Language
- CCC is a simple extension of C and supports both
control and data parallelism - A CCC program consists of a set of concurrent and
cooperative tasks - Control parallelism runs in MIMD mode and
communicates via shared variables and/or message
passing - Data parallelism runs in SIMD mode and
communicates via shared variables
10Tasks in CCC Programs
11Control Parallelism
- Concurrency
- task
- par and parfor
- Synchronization and communication
- shared variables monitors
- message passing channels
12Monitors
- The monitor construct is a modular and efficient
construct for synchronizing shared variables
among concurrent tasks - It provides data abstraction, mutual exclusion,
and conditional synchronization
13An Example - Barber Shop
Barber
Chair
14An Example - Barber Shop
taskmain( ) monitor Barber_Shop bs
int i par barber( bs )
parfor (i 0 i lt 10 i)
customer( bs )
15An Example - Barber Shop
taskbarber(monitor Barber_Shop in bs)
while ( 1 ) bs.get_next_customer( )
bs.finished_cut( ) taskcustomer(m
onitor Barber_Shop in bs) bs.get_haircut(
)
16An Example - Barber Shop
monitor Barber_Shop int barber, chair,
open cond barber_available, chair_occupied
cond door_open, customer_left
Barber_Shop( ) void get_haircut( ) void
get_next_customer( ) void finished_cut( )
17An Example - Barber Shop
Barber_Shop( ) barber 0 chair 0 open
0 void get_haircut( ) while (barber
0) wait(barber_available) barber ? 1
chair 1 signal(chair_occupied) while
(open 0) wait(door_open) open ? 1
signal(customer_left)
18An Example - Barber Shop
void get_next_customer( ) barber 1
signal(barber_available) while (chair 0)
wait(chair_occupied) chair ? 1 void
get_haircut( ) open 1
signal(door_open) while (open gt 0)
wait(customer_left)
19Channels
- The channel construct is a modular and efficient
construct for message passing among concurrent
tasks - Pipe one to one
- Merger many to one
- Spliter one to many
- Multiplexer many to many
20Channels
- Communication structures among parallel tasks are
more comprehensive - The specification of communication structures is
easier - The implementation of communication structures is
more efficient - The static analysis of communication structures
is more effective
21An Example - Consumer-Producer
consumer
producer
consumer
spliter
consumer
22An Example - Consumer-Producer
taskmain( ) spliter int chan int
i par producer( chan )
parfor (i 0 i lt 10 i)
consumer( chan )
23An Example - Consumer-Producer
taskproducer(spliter in int chan) int
i for (i 0 i lt 100 i)
put(chan, i) for (i 0 i lt 10 i)
put(chan, END)
24An Example - Consumer-Producer
taskconsumer(spliter in int chan) int
data while ((data get(chan)) ! END)
process(data)
25Data Parallelism
- Concurrency
- domain an aggregate of synchronous tasks
- Synchronization and communication
- domain variables in global name space
26An Example Matrix Multiplication
?
27An Example Matrix Multiplication
domain matrix_op16 int a16, b16,
c16 multiply(distribute in int
16block16, distribute in
int 1616block, distribute
out int 16block16)
28An Example Matrix Multiplication
taskmain( ) int A1616, B1616,
C1616 domain matrix_op m
read_array(A) read_array(B)
m.multiply(A, B, C) print_array(C)
29An Example Matrix Multiplication
matrix_opmultiply(A, B, C) distribute in int
16block16 A distribute in int
1616block B distribute out int
16block16 C int i, j a A b
B for (i 0 i lt 16 i) for
(ci 0, j 0 j lt 16 j) ci
aj matrix_opi.bj C c
30Platforms for the CCC Compiler
- PCs and SMPs
- Pthread shared memory dynamic thread creation
- PC clusters and SMP clusters
- Millipede distributed shared memory dynamic
remote thread creation - The similarities between these two classes of
machines enable a retargetable compiler
implementation for CCC
31Organization of the CCC Programming System
CCC applications
CCC compiler
CCC runtime library
Virtual shared memory machine interface
Pthread
Millipede
SMP
SMP cluster
32The CCC Compiler
- Tasks ? threads
- Monitors ? mutex locks, read-write locks, and
condition variables - Channels ? mutex locks and condition variables
- Domains ? set of synchronous threads
- Synchronous execution ? barriers
33Virtual Shared Memory Machine Interface
- Processor management
- Thread management
- Shared memory allocation
- Mutex locks
- Read-write locks
- Condition variables
- Barriers
34The CCC Runtime Library
- The CCC runtime library contains a collection of
functions that implements the salient
abstractions of CCC on top of the virtual shared
memory machine interface
35Performance Evaluation
- SMPs
- Hardwarean SMP machine with four CPUs, each CPU
is an Intel PentiumII Xeon 450MHz, and cache is
512K - SoftwareOS is Solaris 5.7 and library is pthread
1.26 - SMP clusters
- Hardwarefour SMP machines, each of which has
two CPUs, each CPU is Intel PentiumIII 500MHz,
and cache is 512K - SoftwareOS is windows 2000 and library is
millipede 4.0 - NetworkFast ethernet network 100Mbps
36Benchmarks
- Matrix multiplication (1024 x 1024)
- Warshalls transitive closure (1024 x 1024)
- Airshed simulation (5)
37Matrix Multiplication (SMPs)
64.44 (4.46, 1.11)
59.44 (4.83, 1.20)
(unit sec)
38Matrix Multiplication (SMP clusters)
(unit sec)
39Warshalls Transitive Closure (SMPs)
(unit sec)
40Warshalls Transitive Closure (SMP clusters)
(unit sec)
41Airshed simulation (SMPs)
Seq 5\5\5 1\5\5 5\1\5 5\5\1 1\1\5 1\5\1 5\1\1
CCC(2cpu) 14.2 8.68 (1.6,0.8) 8.84 (1.6,0.8) 10.52 (1.3,0.6) 12.87 (1.1,0.5) 10.75 (1.3,0.6) 13.2 (1.1,0.5) 14.85 (0.9,0.4)
Pthread (2cpu) 14.2 8.63 (1.6,0.8) 8.82 (1.6,0.8) 10.42 (1.3,0.6) 12.84 (1.1,0.5) 10.72 (1.3,0.6) 13.19 (1.1,0.5) 14.82 (0.9,0.4)
CCC(4cpu) 14.2 6.49 (2.1,0.5) 6.84 (2.1,0.5) 9.03 (1.5,0.3) 12.08 (1.1,0.2) 9.41 (1.5,0.3) 12.46 (1.1,0.2) 14.66 (0.9,0.2)
Pthread (4cpu) 14.2 6.37 (2.2,0.5) 6.81 (2.1,0.5) 9.02 (1.5,0.3) 12.07 (1.1,0.2) 9.38 (1.5,0.3) 12.44 (1.1,0.2) 14.62 (0.9,0.2)
threads
(unit sec)
42Airshed simulation (SMP clusters)
threads
Seq 5\5\5 1\5\5 5\1\5 5\5\1 1\1\5 1\5\1 5\1\1
CCC (1m x 2p) 49.7 26.13 (1.9,0.9) 26.75 (1.8,0.9) 30.37 (1.6,0.8) 44.25 (1.1,0.5) 31.97 (1.5,0.7) 45.25 (1.1,0.5) 48.51 (1.1,0.5)
Millipede (1m x 2p) 49.9 20.02 (2.4,1.2) 20.87 (2.3,1.1) 26.05 (1.9,0.9) 30.41 (1.6,0.8) 26.42 (1.8,0.9) 31.13 (1.5,0.7) 35.89 (1.3,0.6)
CCC (2m x 2p) 49.9 26.41 (1.8,0.4) 27.51 (1.8,0.4) 50.42 (0.9,0.2) 56.68 (0.8,0.2) 54.76 (0.9,0.2) 58.25 (0.8,0.2) 91.17 (0.5,0.1)
Millipede (2m x 2p) 49.9 19.98 (2.4,0.6) 21.84 (2.2,0.5) 31.33 (1.5,0.4) 39.31 (1.2,0.3) 30.85 (1.6,0.4) 42.13 (1.1,0.2) 36.38 (1.3,0.3)
CCC (4m x 2p) 49.9 23.09 (2.1,0.2) 25.59 (1.9,0.2) 48.97 (1.0,0.1) 58.31 (0.8,0.1) 53.33 (0.9,0.1) 61.96 (0.8,0.1) 89.61 (0.5,0.1)
Millipede (4m x 2p) 49.9 16.72 (2.9,0.3) 17.61 (2.8,0.3) 35.11 (1.4,0.2) 41.03 (1.2,0.1) 33.95 (1.4,0.2) 40.88 (1.2,0.1) 36.07 (1.3,0.1)
(unit sec)
43Conclusions
- A high-level parallel programming language that
uniformly integrates - Both control and data parallelism
- Both shared variables and message passing
- A modular parallel programming language
- A retargetable compiler