Day 2

About This Presentation

Title:

Day 2

Description:

Day 2 ... Day 2 – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 148

Provided by: grimshaw

Learn more at: https://www.cs.virginia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Day 2

1
Day 2
2
Agenda

Parallelism basics
Parallel machines
Parallelism again
High Throughput Computing
Finding the right grain size

3
One thing to remember
Easy
Hard
4
Seeking Concurrency

Data dependence graphs
Data parallelism
Functional parallelism
Pipelining

5
Data Dependence Graph

Directed graph
Vertices tasks
Edges dependences

6
Data Parallelism

Independent tasks apply same operation to
different elements of a data set
Okay to perform operations concurrently

for i ? 0 to 99 do ai ? bi ci endfor
7
Functional Parallelism

Independent tasks apply different operations to
different data elements
First and second statements
Third and fourth statements

a ? 2 b ? 3 m ? (a b) / 2 s ? (a2 b2) / 2 v ?
s - m2
8
Pipelining

Divide a process into stages
Produce several items simultaneously

9
Data Clustering

Data mining looking for meaningful patterns in
large data sets
Data clustering organizing a data set into
clusters of similar items
Data clustering can speed retrieval of related
items

10
Document Vectors
Moon
The Geology of Moon Rocks
The Story of Apollo 11
A Biography of Jules Verne
Alice in Wonderland
Rocket
11
Document Clustering
12
Clustering Algorithm

Compute document vectors
Choose initial cluster centers
Repeat
Compute performance function
Adjust centers
Until function value converges or max iterations
have elapsed
Output cluster centers

13
Data Parallelism Opportunities

Operation being applied to a data set
Examples
Generating document vectors
Finding closest center to each vector
Picking initial values of cluster centers

14
Functional Parallelism Opportunities

Draw data dependence diagram
Look for sets of nodes such that there are no
paths from one node to another

15
Data Dependence Diagram
Build document vectors
Choose cluster centers
Compute function value
Adjust cluster centers
Output cluster centers
16
Programming Parallel Computers

Extend compilers translate sequential programs
into parallel programs
Extend languages add parallel operations
Add parallel language layer on top of sequential
language
Define totally new parallel language and compiler
system

17
Strategy 1 Extend Compilers

Parallelizing compiler
Detect parallelism in sequential program
Produce parallel executable program
Focus on making Fortran programs parallel

18
Extend Compilers (cont.)

Advantages
Can leverage millions of lines of existing serial
programs
Saves time and labor
Requires no retraining of programmers
Sequential programming easier than parallel
programming

19
Extend Compilers (cont.)

Disadvantages
Parallelism may be irretrievably lost when
programs written in sequential languages
Performance of parallelizing compilers on broad
range of applications still up in air

20
Extend Language

Add functions to a sequential language
Create and terminate processes
Synchronize processes
Allow processes to communicate

21
Extend Language (cont.)

Advantages
Easiest, quickest, and least expensive
Allows existing compiler technology to be
leveraged
New libraries can be ready soon after new
parallel computers are available

22
Extend Language (cont.)

Disadvantages
Lack of compiler support to catch errors
Easy to write programs that are difficult to debug

23
Add a Parallel Programming Layer

Lower layer
Core of computation
Process manipulates its portion of data to
produce its portion of result
Upper layer
Creation and synchronization of processes
Partitioning of data among processes
A few research prototypes have been built based
on these principles

24
Create a Parallel Language

Develop a parallel language from scratch
occam is an example
Add parallel constructs to an existing language
Fortran 90
High Performance Fortran
C

25
New Parallel Languages (cont.)

Advantages
Allows programmer to communicate parallelism to
compiler
Improves probability that executable will achieve
high performance
Disadvantages
Requires development of new compilers
New languages may not become standards
Programmer resistance

26
Current Status

Low-level approach is most popular
Augment existing language with low-level parallel
constructs
MPI and OpenMP are examples
Advantages of low-level approach
Efficiency
Portability
Disadvantage More difficult to program and debug

27
Architectures

Interconnection networks
Processor arrays (SIMD/data parallel)
Multiprocessors (shared memory)
Multicomputers (distributed memory)
Flynns taxonomy

28
Interconnection Networks

Uses of interconnection networks
Connect processors to shared memory
Connect processors to each other
Interconnection media types
Shared medium
Switched medium

29
Shared versus Switched Media
30
Shared Medium

Allows only message at a time
Messages are broadcast
Each processor listens to every message
Arbitration is decentralized
Collisions require resending of messages
Ethernet is an example

31
Switched Medium

Supports point-to-point messages between pairs of
processors
Each processor has its own path to switch
Advantages over shared media
Allows multiple messages to be sent
simultaneously
Allows scaling of network to accommodate increase
in processors

32
Switch Network Topologies

View switched network as a graph
Vertices processors or switches
Edges communication paths
Two kinds of topologies
Direct
Indirect

33
Direct Topology

Ratio of switch nodes to processor nodes is 11
Every switch node is connected to
1 processor node
At least 1 other switch node

34
Indirect Topology

Ratio of switch nodes to processor nodes is
greater than 11
Some switches simply connect other switches

35
Evaluating Switch Topologies

Diameter
Bisection width
Number of edges / node
Constant edge length? (yes/no)

36
2-D Mesh Network

Direct topology
Switches arranged into a 2-D lattice
Communication allowed only between neighboring
switches
Variants allow wraparound connections between
switches on edge of mesh

37
2-D Meshes
38
Vector Computers

Vector computer instruction set includes
operations on vectors as well as scalars
Two ways to implement vector computers
Pipelined vector processor streams data through
pipelined arithmetic units
Processor array many identical, synchronized
arithmetic processing elements

39
Why Processor Arrays?

Historically, high cost of a control unit
Scientific applications have data parallelism

40
Processor Array
41
Data/instruction Storage

Front end computer
Program
Data manipulated sequentially
Processor array
Data manipulated in parallel

42
Processor Array Performance

Performance work done per time unit
Performance of processor array
Speed of processing elements
Utilization of processing elements

43
Performance Example 1

1024 processors
Each adds a pair of integers in 1 ?sec
What is performance when adding two 1024-element
vectors (one per processor)?

44
Performance Example 2

512 processors
Each adds two integers in 1 ?sec
Performance adding two vectors of length 600?

45
2-D Processor Interconnection Network
Each VLSI chip has 16 processing elements
46
if (COND) then A else B
47
if (COND) then A else B
48
if (COND) then A else B
49
Processor Array Shortcomings

Not all problems are data-parallel
Speed drops for conditionally executed code
Dont adapt to multiple users well
Do not scale down well to starter systems
Rely on custom VLSI for processors
Expense of control units has dropped

50
Multicomputer, aka Distributed Memory Machines

Distributed memory multiple-CPU computer
Same address on different processors refers to
different physical memory locations
Processors interact through message passing
Commercial multicomputers
Commodity clusters

51
Asymmetrical Multicomputer
52
Asymmetrical MC Advantages

Back-end processors dedicated to parallel
computations ? Easier to understand, model, tune
performance
Only a simple back-end operating system needed ?
Easy for a vendor to create

53
Asymmetrical MC Disadvantages

Front-end computer is a single point of failure
Single front-end computer limits scalability of
system
Primitive operating system in back-end processors
makes debugging difficult
Every application requires development of both
front-end and back-end program

54
Symmetrical Multicomputer
55
Symmetrical MC Advantages

Alleviate performance bottleneck caused by single
front-end computer
Better support for debugging
Every processor executes same program

56
Symmetrical MC Disadvantages

More difficult to maintain illusion of single
parallel computer
No simple way to balance program development
workload among processors
More difficult to achieve high performance when
multiple processes on each processor

57
Commodity Cluster

Co-located computers
Dedicated to running parallel jobs
No keyboards or displays
Identical operating system
Identical local disk images
Administered as an entity

58
Network of Workstations

Dispersed computers
First priority person at keyboard
Parallel jobs run in background
Different operating systems
Different local images
Checkpointing and restarting important

59
DM programming model

Communicating sequential programs
Disjoint address spaces
Communicate sending messages
A message is an array of bytes
Send(dest, char buf, in len)
receive(dest, char buf, int len)

60
Multiprocessors

Multiprocessor multiple-CPU computer with a
shared memory
Same address on two different CPUs refers to the
same memory location
Avoid three problems of processor arrays
Can be built from commodity CPUs
Naturally support multiple users
Maintain efficiency in conditional code

61
Centralized Multiprocessor

Straightforward extension of uniprocessor
Add CPUs to bus
All processors share same primary memory
Memory access time same for all CPUs
Uniform memory access (UMA) multiprocessor
Symmetrical multiprocessor (SMP)

62
Centralized Multiprocessor
63
Private and Shared Data

Private data items used only by a single
processor
Shared data values used by multiple processors
In a multiprocessor, processors communicate via
shared data values

64
Problems Associated with Shared Data

Cache coherence
Replicating data across multiple caches reduces
contention
How to ensure different processors have same
value for same address?
Synchronization
Mutual exclusion
Barrier

65
Cache-coherence Problem
Memory
7
X
66
Cache-coherence Problem
Memory
7
X
7
67
Cache-coherence Problem
Memory
7
X
7
7
68
Cache-coherence Problem
Memory
2
X
2
7
69
Write Invalidate Protocol
7
X
7
Cache control monitor
7
70
Write Invalidate Protocol
7
X
Intent to write X
7
7
71
Write Invalidate Protocol
7
X
Intent to write X
7
72
Write Invalidate Protocol
2
X
2
73
Distributed Multiprocessor

Distribute primary memory among processors
Increase aggregate memory bandwidth and lower
average memory access time
Allow greater number of processors
Also called non-uniform memory access (NUMA)
multiprocessor

74
Distributed Multiprocessor
75
Cache Coherence

Some NUMA multiprocessors do not support it in
hardware
Only instructions, private data in cache
Large memory access time variance
Implementation more difficult
No shared memory bus to snoop
Directory-based protocol needed

76
Flynns Taxonomy

Instruction stream
Data stream
Single vs. multiple
Four combinations
SISD
SIMD
MISD
MIMD

77
SISD

Single Instruction, Single Data
Single-CPU systems
Note co-processors dont count
Functional
I/O
Example PCs

78
SIMD

Single Instruction, Multiple Data
Two architectures fit this category
Pipelined vector processor(e.g., Cray-1)
Processor array(e.g., Connection Machine)

79
MISD

MultipleInstruction,Single Data
Examplesystolic array

80
MIMD

Multiple Instruction, Multiple Data
Multiple-CPU computers
Multiprocessors
Multicomputers

81
Summary

Commercial parallel computers appearedin 1980s
Multiple-CPU computers now dominate
Small-scale Centralized multiprocessors
Large-scale Distributed memory architectures
(multiprocessors or multicomputers)

82
Programming the Beast

Task/channel model
Algorithm design methodology
Case studies

83
Task/Channel Model

Parallel computation set of tasks
Task
Program
Local memory
Collection of I/O ports
Tasks interact by sending messages through
channels

84
Task/Channel Model
85
Fosters Design Methodology

Partitioning
Communication
Agglomeration
Mapping

86
Fosters Methodology
87
Partitioning

Dividing computation and data into pieces
Domain decomposition
Divide data into pieces
Determine how to associate computations with the
data
Functional decomposition
Divide computation into pieces
Determine how to associate data with the
computations

88
Example Domain Decompositions
89
Example Functional Decomposition
90
Partitioning Checklist

At least 10x more primitive tasks than processors
in target computer
Minimize redundant computations and redundant
data storage
Primitive tasks roughly the same size
Number of tasks an increasing function of problem
size

91
Communication

Determine values passed among tasks
Local communication
Task needs values from a small number of other
tasks
Create channels illustrating data flow
Global communication
Significant number of tasks contribute data to
perform a computation
Dont create channels for them early in design

92
Communication Checklist

Communication operations balanced among tasks
Each task communicates with only small group of
neighbors
Tasks can perform communications concurrently
Task can perform computations concurrently

93
Agglomeration

Grouping tasks into larger tasks
Goals
Improve performance
Maintain scalability of program
Simplify programming
In MPI programming, goal often to create one
agglomerated task per processor

94
Agglomeration Can Improve Performance

Eliminate communication between primitive tasks
agglomerated into consolidated task
Combine groups of sending and receiving tasks

95
Agglomeration Checklist

Locality of parallel algorithm has increased
Replicated computations take less time than
communications they replace
Data replication doesnt affect scalability
Agglomerated tasks have similar computational and
communications costs
Number of tasks increases with problem size
Number of tasks suitable for likely target
systems
Tradeoff between agglomeration and code
modifications costs is reasonable

96
Mapping

Process of assigning tasks to processors
Centralized multiprocessor mapping done by
operating system
Distributed memory system mapping done by user
Conflicting goals of mapping
Maximize processor utilization
Minimize interprocessor communication

97
Mapping Example
98
Optimal Mapping

Finding optimal mapping is NP-hard
Must rely on heuristics

99
Mapping Decision Tree

Static number of tasks
Structured communication
Constant computation time per task
Agglomerate tasks to minimize comm
Create one task per processor
Variable computation time per task
Cyclically map tasks to processors
Unstructured communication
Use a static load balancing algorithm
Dynamic number of tasks

100
Mapping Strategy

Static number of tasks
Dynamic number of tasks
Frequent communications between tasks
Use a dynamic load balancing algorithm
Many short-lived tasks
Use a run-time task-scheduling algorithm

101
Mapping Checklist

Considered designs based on one task per
processor and multiple tasks per processor
Evaluated static and dynamic task allocation
If dynamic task allocation chosen, task allocator
is not a bottleneck to performance
If static task allocation chosen, ratio of tasks
to processors is at least 101

102
Case Studies

Boundary value problem
Finding the maximum
The n-body problem
Adding data input

103
Boundary Value Problem
Ice water
Rod
Insulation
104
Rod Cools as Time Progresses
105
Finite Difference Approximation
106
Partitioning

One data item per grid point
Associate one primitive task with each grid point
Two-dimensional domain decomposition

107
Communication

Identify communication pattern between primitive
tasks
Each interior primitive task has three incoming
and three outgoing channels

108
Agglomeration and Mapping
109
Sequential execution time

? time to update element
n number of elements
m number of iterations
Sequential execution time m (n-1) ?

110
Parallel Execution Time

p number of processors
? message latency
Parallel execution time m(??(n-1)/p?2?)

111
Reduction

Given associative operator ?
a0 ? a1 ? a2 ? ? an-1
Examples
Add
Multiply
And, Or
Maximum, Minimum

112
Parallel Reduction Evolution
113
Parallel Reduction Evolution
114
Parallel Reduction Evolution
115
Binomial Trees
116
Finding Global Sum
4
2
0
7
-3
5
-6
-3
8
1
2
3
-4
4
6
-1
117
Finding Global Sum
1
7
-6
4
4
5
8
2
118
Finding Global Sum
8
-2
9
10
119
Finding Global Sum
17
8
120
Finding Global Sum
25
121
Agglomeration
122
Agglomeration
123
The n-body Problem
124
The n-body Problem
125
Partitioning

Domain partitioning
Assume one task per particle
Task has particles position, velocity vector
Iteration
Get positions of all other particles
Compute new position, velocity

126
Gather
127
All-gather
128
Complete Graph for All-gather
129
Hypercube for All-gather
130
Communication Time
131
Adding Data Input
132
Scatter
133
Scatter in log p Steps
12345678
134
Summary Task/channel Model

Parallel computation
Set of tasks
Interactions through channels
Good designs
Maximize local computations
Minimize communications
Scale up

135
Summary Design Steps

Partition computation
Agglomerate tasks
Map tasks to processors
Goals
Maximize processor utilization
Minimize inter-processor communication

136
Summary Fundamental Algorithms

Reduction
Gather and scatter
All-gather

137
High Throughput Computing

Easy problems formerly known as embarrassingly
parallel now known as pleasingly parallel
Basic idea Gee I have a whole bunch of jobs
(single run of a program) that I need to do, why
not run them concurrently rather than
sequentially
Sometimes called bag of tasks or parameter
sweep problems

138
Bag-of-tasks
139
Examples

A large number of proteins each represented by
a different file to dock with a target
protein
For all files x, execute f(x,y)
Exploring a parameter space in n-dimensions
Uniform
Non-uniform
Monte carlos

140
Tools

Most common tool is a queuing system sometimes
called a load management system, or a local
resource manager
PBS, LSF, and SGE are the three most common.
Condor is also often used.
They all have the same basic functions, well use
PBS as an exemplar.
Script languages (bash, Perl, etc.)

141
PBS

qsub options script-file
Submit the script to run
Options can specify number of processors, other
required resources (memory, etc.)
Returns the job ID (a string)

142
(No Transcript)
143
(No Transcript)
144
Other PBS

qstat give the status of jobs submitted to the
queue
qdel delete a job from the queue

145
Blasting a set of jobs
146
Issues

Overhead per job is substantial
Dont want to run millisecond jobs
May need to bundle them up
May not be enough jobs to saturate resources
May need to break up jobs
IO System may become saturated
Copy large files to /tmp, check for existence in
your shell script, copy if not there
May be more jobs than the queuing system can
handle (many start to break down at several
thousand jobs)
Jobs may fail for no good reason
Develop scripts to check for output and re-submit
upto k jobs

147
Homework

Submit a simple job to the queue that echos the
host name, redirect output to a file of your
choice.
Via a script submit 100 hostname jobs to a
script. Output should be output.X where X is
the output number
For each file in a rooted directory tree run a
wc to count the words. Maintain the results in
a shadow directory tree. Your script should be
able to detect results that have already been
computed.

Write a Comment

User Comments (0)

About PowerShow.com

Day 2 - PowerPoint PPT Presentation

Day 2

Day 2 ... Day 2 – PowerPoint PPT presentation