Title: Day 2
1Day 2
2Agenda
- Parallelism basics
- Parallel machines
- Parallelism again
- High Throughput Computing
- Finding the right grain size
3One thing to remember
Easy
Hard
4Seeking Concurrency
- Data dependence graphs
- Data parallelism
- Functional parallelism
- Pipelining
5Data Dependence Graph
- Directed graph
- Vertices tasks
- Edges dependences
6Data Parallelism
- Independent tasks apply same operation to
different elements of a data set - Okay to perform operations concurrently
for i ? 0 to 99 do ai ? bi ci endfor
7Functional Parallelism
- Independent tasks apply different operations to
different data elements - First and second statements
- Third and fourth statements
a ? 2 b ? 3 m ? (a b) / 2 s ? (a2 b2) / 2 v ?
s - m2
8Pipelining
- Divide a process into stages
- Produce several items simultaneously
9Data Clustering
- Data mining looking for meaningful patterns in
large data sets - Data clustering organizing a data set into
clusters of similar items - Data clustering can speed retrieval of related
items
10Document Vectors
Moon
The Geology of Moon Rocks
The Story of Apollo 11
A Biography of Jules Verne
Alice in Wonderland
Rocket
11Document Clustering
12Clustering Algorithm
- Compute document vectors
- Choose initial cluster centers
- Repeat
- Compute performance function
- Adjust centers
- Until function value converges or max iterations
have elapsed - Output cluster centers
13Data Parallelism Opportunities
- Operation being applied to a data set
- Examples
- Generating document vectors
- Finding closest center to each vector
- Picking initial values of cluster centers
14Functional Parallelism Opportunities
- Draw data dependence diagram
- Look for sets of nodes such that there are no
paths from one node to another
15Data Dependence Diagram
Build document vectors
Choose cluster centers
Compute function value
Adjust cluster centers
Output cluster centers
16Programming Parallel Computers
- Extend compilers translate sequential programs
into parallel programs - Extend languages add parallel operations
- Add parallel language layer on top of sequential
language - Define totally new parallel language and compiler
system
17Strategy 1 Extend Compilers
- Parallelizing compiler
- Detect parallelism in sequential program
- Produce parallel executable program
- Focus on making Fortran programs parallel
18Extend Compilers (cont.)
- Advantages
- Can leverage millions of lines of existing serial
programs - Saves time and labor
- Requires no retraining of programmers
- Sequential programming easier than parallel
programming
19Extend Compilers (cont.)
- Disadvantages
- Parallelism may be irretrievably lost when
programs written in sequential languages - Performance of parallelizing compilers on broad
range of applications still up in air
20Extend Language
- Add functions to a sequential language
- Create and terminate processes
- Synchronize processes
- Allow processes to communicate
21Extend Language (cont.)
- Advantages
- Easiest, quickest, and least expensive
- Allows existing compiler technology to be
leveraged - New libraries can be ready soon after new
parallel computers are available
22Extend Language (cont.)
- Disadvantages
- Lack of compiler support to catch errors
- Easy to write programs that are difficult to debug
23Add a Parallel Programming Layer
- Lower layer
- Core of computation
- Process manipulates its portion of data to
produce its portion of result - Upper layer
- Creation and synchronization of processes
- Partitioning of data among processes
- A few research prototypes have been built based
on these principles
24Create a Parallel Language
- Develop a parallel language from scratch
- occam is an example
- Add parallel constructs to an existing language
- Fortran 90
- High Performance Fortran
- C
25New Parallel Languages (cont.)
- Advantages
- Allows programmer to communicate parallelism to
compiler - Improves probability that executable will achieve
high performance - Disadvantages
- Requires development of new compilers
- New languages may not become standards
- Programmer resistance
26Current Status
- Low-level approach is most popular
- Augment existing language with low-level parallel
constructs - MPI and OpenMP are examples
- Advantages of low-level approach
- Efficiency
- Portability
- Disadvantage More difficult to program and debug
27Architectures
- Interconnection networks
- Processor arrays (SIMD/data parallel)
- Multiprocessors (shared memory)
- Multicomputers (distributed memory)
- Flynns taxonomy
28Interconnection Networks
- Uses of interconnection networks
- Connect processors to shared memory
- Connect processors to each other
- Interconnection media types
- Shared medium
- Switched medium
29Shared versus Switched Media
30Shared Medium
- Allows only message at a time
- Messages are broadcast
- Each processor listens to every message
- Arbitration is decentralized
- Collisions require resending of messages
- Ethernet is an example
31Switched Medium
- Supports point-to-point messages between pairs of
processors - Each processor has its own path to switch
- Advantages over shared media
- Allows multiple messages to be sent
simultaneously - Allows scaling of network to accommodate increase
in processors
32Switch Network Topologies
- View switched network as a graph
- Vertices processors or switches
- Edges communication paths
- Two kinds of topologies
- Direct
- Indirect
33Direct Topology
- Ratio of switch nodes to processor nodes is 11
- Every switch node is connected to
- 1 processor node
- At least 1 other switch node
34Indirect Topology
- Ratio of switch nodes to processor nodes is
greater than 11 - Some switches simply connect other switches
35Evaluating Switch Topologies
- Diameter
- Bisection width
- Number of edges / node
- Constant edge length? (yes/no)
362-D Mesh Network
- Direct topology
- Switches arranged into a 2-D lattice
- Communication allowed only between neighboring
switches - Variants allow wraparound connections between
switches on edge of mesh
372-D Meshes
38Vector Computers
- Vector computer instruction set includes
operations on vectors as well as scalars - Two ways to implement vector computers
- Pipelined vector processor streams data through
pipelined arithmetic units - Processor array many identical, synchronized
arithmetic processing elements
39Why Processor Arrays?
- Historically, high cost of a control unit
- Scientific applications have data parallelism
40Processor Array
41Data/instruction Storage
- Front end computer
- Program
- Data manipulated sequentially
- Processor array
- Data manipulated in parallel
42Processor Array Performance
- Performance work done per time unit
- Performance of processor array
- Speed of processing elements
- Utilization of processing elements
43Performance Example 1
- 1024 processors
- Each adds a pair of integers in 1 ?sec
- What is performance when adding two 1024-element
vectors (one per processor)?
44Performance Example 2
- 512 processors
- Each adds two integers in 1 ?sec
- Performance adding two vectors of length 600?
452-D Processor Interconnection Network
Each VLSI chip has 16 processing elements
46if (COND) then A else B
47if (COND) then A else B
48if (COND) then A else B
49Processor Array Shortcomings
- Not all problems are data-parallel
- Speed drops for conditionally executed code
- Dont adapt to multiple users well
- Do not scale down well to starter systems
- Rely on custom VLSI for processors
- Expense of control units has dropped
50Multicomputer, aka Distributed Memory Machines
- Distributed memory multiple-CPU computer
- Same address on different processors refers to
different physical memory locations - Processors interact through message passing
- Commercial multicomputers
- Commodity clusters
51Asymmetrical Multicomputer
52Asymmetrical MC Advantages
- Back-end processors dedicated to parallel
computations ? Easier to understand, model, tune
performance - Only a simple back-end operating system needed ?
Easy for a vendor to create
53Asymmetrical MC Disadvantages
- Front-end computer is a single point of failure
- Single front-end computer limits scalability of
system - Primitive operating system in back-end processors
makes debugging difficult - Every application requires development of both
front-end and back-end program
54Symmetrical Multicomputer
55Symmetrical MC Advantages
- Alleviate performance bottleneck caused by single
front-end computer - Better support for debugging
- Every processor executes same program
56Symmetrical MC Disadvantages
- More difficult to maintain illusion of single
parallel computer - No simple way to balance program development
workload among processors - More difficult to achieve high performance when
multiple processes on each processor
57Commodity Cluster
- Co-located computers
- Dedicated to running parallel jobs
- No keyboards or displays
- Identical operating system
- Identical local disk images
- Administered as an entity
58Network of Workstations
- Dispersed computers
- First priority person at keyboard
- Parallel jobs run in background
- Different operating systems
- Different local images
- Checkpointing and restarting important
59DM programming model
- Communicating sequential programs
- Disjoint address spaces
- Communicate sending messages
- A message is an array of bytes
- Send(dest, char buf, in len)
- receive(dest, char buf, int len)
60Multiprocessors
- Multiprocessor multiple-CPU computer with a
shared memory - Same address on two different CPUs refers to the
same memory location - Avoid three problems of processor arrays
- Can be built from commodity CPUs
- Naturally support multiple users
- Maintain efficiency in conditional code
61Centralized Multiprocessor
- Straightforward extension of uniprocessor
- Add CPUs to bus
- All processors share same primary memory
- Memory access time same for all CPUs
- Uniform memory access (UMA) multiprocessor
- Symmetrical multiprocessor (SMP)
62Centralized Multiprocessor
63Private and Shared Data
- Private data items used only by a single
processor - Shared data values used by multiple processors
- In a multiprocessor, processors communicate via
shared data values
64Problems Associated with Shared Data
- Cache coherence
- Replicating data across multiple caches reduces
contention - How to ensure different processors have same
value for same address? - Synchronization
- Mutual exclusion
- Barrier
65Cache-coherence Problem
Memory
7
X
66Cache-coherence Problem
Memory
7
X
7
67Cache-coherence Problem
Memory
7
X
7
7
68Cache-coherence Problem
Memory
2
X
2
7
69Write Invalidate Protocol
7
X
7
Cache control monitor
7
70Write Invalidate Protocol
7
X
Intent to write X
7
7
71Write Invalidate Protocol
7
X
Intent to write X
7
72Write Invalidate Protocol
2
X
2
73Distributed Multiprocessor
- Distribute primary memory among processors
- Increase aggregate memory bandwidth and lower
average memory access time - Allow greater number of processors
- Also called non-uniform memory access (NUMA)
multiprocessor
74Distributed Multiprocessor
75Cache Coherence
- Some NUMA multiprocessors do not support it in
hardware - Only instructions, private data in cache
- Large memory access time variance
- Implementation more difficult
- No shared memory bus to snoop
- Directory-based protocol needed
76Flynns Taxonomy
- Instruction stream
- Data stream
- Single vs. multiple
- Four combinations
- SISD
- SIMD
- MISD
- MIMD
77SISD
- Single Instruction, Single Data
- Single-CPU systems
- Note co-processors dont count
- Functional
- I/O
- Example PCs
78SIMD
- Single Instruction, Multiple Data
- Two architectures fit this category
- Pipelined vector processor(e.g., Cray-1)
- Processor array(e.g., Connection Machine)
79MISD
- MultipleInstruction,Single Data
- Examplesystolic array
80MIMD
- Multiple Instruction, Multiple Data
- Multiple-CPU computers
- Multiprocessors
- Multicomputers
81Summary
- Commercial parallel computers appearedin 1980s
- Multiple-CPU computers now dominate
- Small-scale Centralized multiprocessors
- Large-scale Distributed memory architectures
(multiprocessors or multicomputers)
82Programming the Beast
- Task/channel model
- Algorithm design methodology
- Case studies
83Task/Channel Model
- Parallel computation set of tasks
- Task
- Program
- Local memory
- Collection of I/O ports
- Tasks interact by sending messages through
channels
84Task/Channel Model
85Fosters Design Methodology
- Partitioning
- Communication
- Agglomeration
- Mapping
86Fosters Methodology
87Partitioning
- Dividing computation and data into pieces
- Domain decomposition
- Divide data into pieces
- Determine how to associate computations with the
data - Functional decomposition
- Divide computation into pieces
- Determine how to associate data with the
computations
88Example Domain Decompositions
89Example Functional Decomposition
90Partitioning Checklist
- At least 10x more primitive tasks than processors
in target computer - Minimize redundant computations and redundant
data storage - Primitive tasks roughly the same size
- Number of tasks an increasing function of problem
size
91Communication
- Determine values passed among tasks
- Local communication
- Task needs values from a small number of other
tasks - Create channels illustrating data flow
- Global communication
- Significant number of tasks contribute data to
perform a computation - Dont create channels for them early in design
92Communication Checklist
- Communication operations balanced among tasks
- Each task communicates with only small group of
neighbors - Tasks can perform communications concurrently
- Task can perform computations concurrently
93Agglomeration
- Grouping tasks into larger tasks
- Goals
- Improve performance
- Maintain scalability of program
- Simplify programming
- In MPI programming, goal often to create one
agglomerated task per processor
94Agglomeration Can Improve Performance
- Eliminate communication between primitive tasks
agglomerated into consolidated task - Combine groups of sending and receiving tasks
95Agglomeration Checklist
- Locality of parallel algorithm has increased
- Replicated computations take less time than
communications they replace - Data replication doesnt affect scalability
- Agglomerated tasks have similar computational and
communications costs - Number of tasks increases with problem size
- Number of tasks suitable for likely target
systems - Tradeoff between agglomeration and code
modifications costs is reasonable
96Mapping
- Process of assigning tasks to processors
- Centralized multiprocessor mapping done by
operating system - Distributed memory system mapping done by user
- Conflicting goals of mapping
- Maximize processor utilization
- Minimize interprocessor communication
97Mapping Example
98Optimal Mapping
- Finding optimal mapping is NP-hard
- Must rely on heuristics
99Mapping Decision Tree
- Static number of tasks
- Structured communication
- Constant computation time per task
- Agglomerate tasks to minimize comm
- Create one task per processor
- Variable computation time per task
- Cyclically map tasks to processors
- Unstructured communication
- Use a static load balancing algorithm
- Dynamic number of tasks
100Mapping Strategy
- Static number of tasks
- Dynamic number of tasks
- Frequent communications between tasks
- Use a dynamic load balancing algorithm
- Many short-lived tasks
- Use a run-time task-scheduling algorithm
101Mapping Checklist
- Considered designs based on one task per
processor and multiple tasks per processor - Evaluated static and dynamic task allocation
- If dynamic task allocation chosen, task allocator
is not a bottleneck to performance - If static task allocation chosen, ratio of tasks
to processors is at least 101
102Case Studies
- Boundary value problem
- Finding the maximum
- The n-body problem
- Adding data input
103Boundary Value Problem
Ice water
Rod
Insulation
104Rod Cools as Time Progresses
105Finite Difference Approximation
106Partitioning
- One data item per grid point
- Associate one primitive task with each grid point
- Two-dimensional domain decomposition
107Communication
- Identify communication pattern between primitive
tasks - Each interior primitive task has three incoming
and three outgoing channels
108Agglomeration and Mapping
109Sequential execution time
- ? time to update element
- n number of elements
- m number of iterations
- Sequential execution time m (n-1) ?
110Parallel Execution Time
- p number of processors
- ? message latency
- Parallel execution time m(??(n-1)/p?2?)
111Reduction
- Given associative operator ?
- a0 ? a1 ? a2 ? ? an-1
- Examples
- Add
- Multiply
- And, Or
- Maximum, Minimum
112Parallel Reduction Evolution
113Parallel Reduction Evolution
114Parallel Reduction Evolution
115Binomial Trees
116Finding Global Sum
4
2
0
7
-3
5
-6
-3
8
1
2
3
-4
4
6
-1
117Finding Global Sum
1
7
-6
4
4
5
8
2
118Finding Global Sum
8
-2
9
10
119Finding Global Sum
17
8
120Finding Global Sum
25
121Agglomeration
122Agglomeration
123The n-body Problem
124The n-body Problem
125Partitioning
- Domain partitioning
- Assume one task per particle
- Task has particles position, velocity vector
- Iteration
- Get positions of all other particles
- Compute new position, velocity
126Gather
127All-gather
128Complete Graph for All-gather
129Hypercube for All-gather
130Communication Time
131Adding Data Input
132Scatter
133Scatter in log p Steps
12345678
134Summary Task/channel Model
- Parallel computation
- Set of tasks
- Interactions through channels
- Good designs
- Maximize local computations
- Minimize communications
- Scale up
135Summary Design Steps
- Partition computation
- Agglomerate tasks
- Map tasks to processors
- Goals
- Maximize processor utilization
- Minimize inter-processor communication
136Summary Fundamental Algorithms
- Reduction
- Gather and scatter
- All-gather
137High Throughput Computing
- Easy problems formerly known as embarrassingly
parallel now known as pleasingly parallel - Basic idea Gee I have a whole bunch of jobs
(single run of a program) that I need to do, why
not run them concurrently rather than
sequentially - Sometimes called bag of tasks or parameter
sweep problems
138Bag-of-tasks
139Examples
- A large number of proteins each represented by
a different file to dock with a target
protein - For all files x, execute f(x,y)
- Exploring a parameter space in n-dimensions
- Uniform
- Non-uniform
- Monte carlos
140Tools
- Most common tool is a queuing system sometimes
called a load management system, or a local
resource manager - PBS, LSF, and SGE are the three most common.
Condor is also often used. - They all have the same basic functions, well use
PBS as an exemplar. - Script languages (bash, Perl, etc.)
141PBS
- qsub options script-file
- Submit the script to run
- Options can specify number of processors, other
required resources (memory, etc.) - Returns the job ID (a string)
142(No Transcript)
143(No Transcript)
144Other PBS
- qstat give the status of jobs submitted to the
queue - qdel delete a job from the queue
145Blasting a set of jobs
146Issues
- Overhead per job is substantial
- Dont want to run millisecond jobs
- May need to bundle them up
- May not be enough jobs to saturate resources
- May need to break up jobs
- IO System may become saturated
- Copy large files to /tmp, check for existence in
your shell script, copy if not there - May be more jobs than the queuing system can
handle (many start to break down at several
thousand jobs) - Jobs may fail for no good reason
- Develop scripts to check for output and re-submit
upto k jobs
147Homework
- Submit a simple job to the queue that echos the
host name, redirect output to a file of your
choice. - Via a script submit 100 hostname jobs to a
script. Output should be output.X where X is
the output number - For each file in a rooted directory tree run a
wc to count the words. Maintain the results in
a shadow directory tree. Your script should be
able to detect results that have already been
computed.