Day 2 - PowerPoint PPT Presentation

About This Presentation
Title:

Day 2

Description:

Day 2 ... Day 2 – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 148
Provided by: grimshaw
Category:
Tags: day | mapping | strategy

less

Transcript and Presenter's Notes

Title: Day 2


1
Day 2
2
Agenda
  • Parallelism basics
  • Parallel machines
  • Parallelism again
  • High Throughput Computing
  • Finding the right grain size

3
One thing to remember
Easy
Hard
4
Seeking Concurrency
  • Data dependence graphs
  • Data parallelism
  • Functional parallelism
  • Pipelining

5
Data Dependence Graph
  • Directed graph
  • Vertices tasks
  • Edges dependences

6
Data Parallelism
  • Independent tasks apply same operation to
    different elements of a data set
  • Okay to perform operations concurrently

for i ? 0 to 99 do ai ? bi ci endfor
7
Functional Parallelism
  • Independent tasks apply different operations to
    different data elements
  • First and second statements
  • Third and fourth statements

a ? 2 b ? 3 m ? (a b) / 2 s ? (a2 b2) / 2 v ?
s - m2
8
Pipelining
  • Divide a process into stages
  • Produce several items simultaneously

9
Data Clustering
  • Data mining looking for meaningful patterns in
    large data sets
  • Data clustering organizing a data set into
    clusters of similar items
  • Data clustering can speed retrieval of related
    items

10
Document Vectors
Moon
The Geology of Moon Rocks
The Story of Apollo 11
A Biography of Jules Verne
Alice in Wonderland
Rocket
11
Document Clustering
12
Clustering Algorithm
  • Compute document vectors
  • Choose initial cluster centers
  • Repeat
  • Compute performance function
  • Adjust centers
  • Until function value converges or max iterations
    have elapsed
  • Output cluster centers

13
Data Parallelism Opportunities
  • Operation being applied to a data set
  • Examples
  • Generating document vectors
  • Finding closest center to each vector
  • Picking initial values of cluster centers

14
Functional Parallelism Opportunities
  • Draw data dependence diagram
  • Look for sets of nodes such that there are no
    paths from one node to another

15
Data Dependence Diagram
Build document vectors
Choose cluster centers
Compute function value
Adjust cluster centers
Output cluster centers
16
Programming Parallel Computers
  • Extend compilers translate sequential programs
    into parallel programs
  • Extend languages add parallel operations
  • Add parallel language layer on top of sequential
    language
  • Define totally new parallel language and compiler
    system

17
Strategy 1 Extend Compilers
  • Parallelizing compiler
  • Detect parallelism in sequential program
  • Produce parallel executable program
  • Focus on making Fortran programs parallel

18
Extend Compilers (cont.)
  • Advantages
  • Can leverage millions of lines of existing serial
    programs
  • Saves time and labor
  • Requires no retraining of programmers
  • Sequential programming easier than parallel
    programming

19
Extend Compilers (cont.)
  • Disadvantages
  • Parallelism may be irretrievably lost when
    programs written in sequential languages
  • Performance of parallelizing compilers on broad
    range of applications still up in air

20
Extend Language
  • Add functions to a sequential language
  • Create and terminate processes
  • Synchronize processes
  • Allow processes to communicate

21
Extend Language (cont.)
  • Advantages
  • Easiest, quickest, and least expensive
  • Allows existing compiler technology to be
    leveraged
  • New libraries can be ready soon after new
    parallel computers are available

22
Extend Language (cont.)
  • Disadvantages
  • Lack of compiler support to catch errors
  • Easy to write programs that are difficult to debug

23
Add a Parallel Programming Layer
  • Lower layer
  • Core of computation
  • Process manipulates its portion of data to
    produce its portion of result
  • Upper layer
  • Creation and synchronization of processes
  • Partitioning of data among processes
  • A few research prototypes have been built based
    on these principles

24
Create a Parallel Language
  • Develop a parallel language from scratch
  • occam is an example
  • Add parallel constructs to an existing language
  • Fortran 90
  • High Performance Fortran
  • C

25
New Parallel Languages (cont.)
  • Advantages
  • Allows programmer to communicate parallelism to
    compiler
  • Improves probability that executable will achieve
    high performance
  • Disadvantages
  • Requires development of new compilers
  • New languages may not become standards
  • Programmer resistance

26
Current Status
  • Low-level approach is most popular
  • Augment existing language with low-level parallel
    constructs
  • MPI and OpenMP are examples
  • Advantages of low-level approach
  • Efficiency
  • Portability
  • Disadvantage More difficult to program and debug

27
Architectures
  • Interconnection networks
  • Processor arrays (SIMD/data parallel)
  • Multiprocessors (shared memory)
  • Multicomputers (distributed memory)
  • Flynns taxonomy

28
Interconnection Networks
  • Uses of interconnection networks
  • Connect processors to shared memory
  • Connect processors to each other
  • Interconnection media types
  • Shared medium
  • Switched medium

29
Shared versus Switched Media
30
Shared Medium
  • Allows only message at a time
  • Messages are broadcast
  • Each processor listens to every message
  • Arbitration is decentralized
  • Collisions require resending of messages
  • Ethernet is an example

31
Switched Medium
  • Supports point-to-point messages between pairs of
    processors
  • Each processor has its own path to switch
  • Advantages over shared media
  • Allows multiple messages to be sent
    simultaneously
  • Allows scaling of network to accommodate increase
    in processors

32
Switch Network Topologies
  • View switched network as a graph
  • Vertices processors or switches
  • Edges communication paths
  • Two kinds of topologies
  • Direct
  • Indirect

33
Direct Topology
  • Ratio of switch nodes to processor nodes is 11
  • Every switch node is connected to
  • 1 processor node
  • At least 1 other switch node

34
Indirect Topology
  • Ratio of switch nodes to processor nodes is
    greater than 11
  • Some switches simply connect other switches

35
Evaluating Switch Topologies
  • Diameter
  • Bisection width
  • Number of edges / node
  • Constant edge length? (yes/no)

36
2-D Mesh Network
  • Direct topology
  • Switches arranged into a 2-D lattice
  • Communication allowed only between neighboring
    switches
  • Variants allow wraparound connections between
    switches on edge of mesh

37
2-D Meshes
38
Vector Computers
  • Vector computer instruction set includes
    operations on vectors as well as scalars
  • Two ways to implement vector computers
  • Pipelined vector processor streams data through
    pipelined arithmetic units
  • Processor array many identical, synchronized
    arithmetic processing elements

39
Why Processor Arrays?
  • Historically, high cost of a control unit
  • Scientific applications have data parallelism

40
Processor Array
41
Data/instruction Storage
  • Front end computer
  • Program
  • Data manipulated sequentially
  • Processor array
  • Data manipulated in parallel

42
Processor Array Performance
  • Performance work done per time unit
  • Performance of processor array
  • Speed of processing elements
  • Utilization of processing elements

43
Performance Example 1
  • 1024 processors
  • Each adds a pair of integers in 1 ?sec
  • What is performance when adding two 1024-element
    vectors (one per processor)?

44
Performance Example 2
  • 512 processors
  • Each adds two integers in 1 ?sec
  • Performance adding two vectors of length 600?

45
2-D Processor Interconnection Network
Each VLSI chip has 16 processing elements
46
if (COND) then A else B
47
if (COND) then A else B
48
if (COND) then A else B
49
Processor Array Shortcomings
  • Not all problems are data-parallel
  • Speed drops for conditionally executed code
  • Dont adapt to multiple users well
  • Do not scale down well to starter systems
  • Rely on custom VLSI for processors
  • Expense of control units has dropped

50
Multicomputer, aka Distributed Memory Machines
  • Distributed memory multiple-CPU computer
  • Same address on different processors refers to
    different physical memory locations
  • Processors interact through message passing
  • Commercial multicomputers
  • Commodity clusters

51
Asymmetrical Multicomputer
52
Asymmetrical MC Advantages
  • Back-end processors dedicated to parallel
    computations ? Easier to understand, model, tune
    performance
  • Only a simple back-end operating system needed ?
    Easy for a vendor to create

53
Asymmetrical MC Disadvantages
  • Front-end computer is a single point of failure
  • Single front-end computer limits scalability of
    system
  • Primitive operating system in back-end processors
    makes debugging difficult
  • Every application requires development of both
    front-end and back-end program

54
Symmetrical Multicomputer
55
Symmetrical MC Advantages
  • Alleviate performance bottleneck caused by single
    front-end computer
  • Better support for debugging
  • Every processor executes same program

56
Symmetrical MC Disadvantages
  • More difficult to maintain illusion of single
    parallel computer
  • No simple way to balance program development
    workload among processors
  • More difficult to achieve high performance when
    multiple processes on each processor

57
Commodity Cluster
  • Co-located computers
  • Dedicated to running parallel jobs
  • No keyboards or displays
  • Identical operating system
  • Identical local disk images
  • Administered as an entity

58
Network of Workstations
  • Dispersed computers
  • First priority person at keyboard
  • Parallel jobs run in background
  • Different operating systems
  • Different local images
  • Checkpointing and restarting important

59
DM programming model
  • Communicating sequential programs
  • Disjoint address spaces
  • Communicate sending messages
  • A message is an array of bytes
  • Send(dest, char buf, in len)
  • receive(dest, char buf, int len)

60
Multiprocessors
  • Multiprocessor multiple-CPU computer with a
    shared memory
  • Same address on two different CPUs refers to the
    same memory location
  • Avoid three problems of processor arrays
  • Can be built from commodity CPUs
  • Naturally support multiple users
  • Maintain efficiency in conditional code

61
Centralized Multiprocessor
  • Straightforward extension of uniprocessor
  • Add CPUs to bus
  • All processors share same primary memory
  • Memory access time same for all CPUs
  • Uniform memory access (UMA) multiprocessor
  • Symmetrical multiprocessor (SMP)

62
Centralized Multiprocessor
63
Private and Shared Data
  • Private data items used only by a single
    processor
  • Shared data values used by multiple processors
  • In a multiprocessor, processors communicate via
    shared data values

64
Problems Associated with Shared Data
  • Cache coherence
  • Replicating data across multiple caches reduces
    contention
  • How to ensure different processors have same
    value for same address?
  • Synchronization
  • Mutual exclusion
  • Barrier

65
Cache-coherence Problem
Memory
7
X
66
Cache-coherence Problem
Memory
7
X
7
67
Cache-coherence Problem
Memory
7
X
7
7
68
Cache-coherence Problem
Memory
2
X
2
7
69
Write Invalidate Protocol
7
X
7
Cache control monitor
7
70
Write Invalidate Protocol
7
X
Intent to write X
7
7
71
Write Invalidate Protocol
7
X
Intent to write X
7
72
Write Invalidate Protocol
2
X
2
73
Distributed Multiprocessor
  • Distribute primary memory among processors
  • Increase aggregate memory bandwidth and lower
    average memory access time
  • Allow greater number of processors
  • Also called non-uniform memory access (NUMA)
    multiprocessor

74
Distributed Multiprocessor
75
Cache Coherence
  • Some NUMA multiprocessors do not support it in
    hardware
  • Only instructions, private data in cache
  • Large memory access time variance
  • Implementation more difficult
  • No shared memory bus to snoop
  • Directory-based protocol needed

76
Flynns Taxonomy
  • Instruction stream
  • Data stream
  • Single vs. multiple
  • Four combinations
  • SISD
  • SIMD
  • MISD
  • MIMD

77
SISD
  • Single Instruction, Single Data
  • Single-CPU systems
  • Note co-processors dont count
  • Functional
  • I/O
  • Example PCs

78
SIMD
  • Single Instruction, Multiple Data
  • Two architectures fit this category
  • Pipelined vector processor(e.g., Cray-1)
  • Processor array(e.g., Connection Machine)

79
MISD
  • MultipleInstruction,Single Data
  • Examplesystolic array

80
MIMD
  • Multiple Instruction, Multiple Data
  • Multiple-CPU computers
  • Multiprocessors
  • Multicomputers

81
Summary
  • Commercial parallel computers appearedin 1980s
  • Multiple-CPU computers now dominate
  • Small-scale Centralized multiprocessors
  • Large-scale Distributed memory architectures
    (multiprocessors or multicomputers)

82
Programming the Beast
  • Task/channel model
  • Algorithm design methodology
  • Case studies

83
Task/Channel Model
  • Parallel computation set of tasks
  • Task
  • Program
  • Local memory
  • Collection of I/O ports
  • Tasks interact by sending messages through
    channels

84
Task/Channel Model
85
Fosters Design Methodology
  • Partitioning
  • Communication
  • Agglomeration
  • Mapping

86
Fosters Methodology
87
Partitioning
  • Dividing computation and data into pieces
  • Domain decomposition
  • Divide data into pieces
  • Determine how to associate computations with the
    data
  • Functional decomposition
  • Divide computation into pieces
  • Determine how to associate data with the
    computations

88
Example Domain Decompositions
89
Example Functional Decomposition
90
Partitioning Checklist
  • At least 10x more primitive tasks than processors
    in target computer
  • Minimize redundant computations and redundant
    data storage
  • Primitive tasks roughly the same size
  • Number of tasks an increasing function of problem
    size

91
Communication
  • Determine values passed among tasks
  • Local communication
  • Task needs values from a small number of other
    tasks
  • Create channels illustrating data flow
  • Global communication
  • Significant number of tasks contribute data to
    perform a computation
  • Dont create channels for them early in design

92
Communication Checklist
  • Communication operations balanced among tasks
  • Each task communicates with only small group of
    neighbors
  • Tasks can perform communications concurrently
  • Task can perform computations concurrently

93
Agglomeration
  • Grouping tasks into larger tasks
  • Goals
  • Improve performance
  • Maintain scalability of program
  • Simplify programming
  • In MPI programming, goal often to create one
    agglomerated task per processor

94
Agglomeration Can Improve Performance
  • Eliminate communication between primitive tasks
    agglomerated into consolidated task
  • Combine groups of sending and receiving tasks

95
Agglomeration Checklist
  • Locality of parallel algorithm has increased
  • Replicated computations take less time than
    communications they replace
  • Data replication doesnt affect scalability
  • Agglomerated tasks have similar computational and
    communications costs
  • Number of tasks increases with problem size
  • Number of tasks suitable for likely target
    systems
  • Tradeoff between agglomeration and code
    modifications costs is reasonable

96
Mapping
  • Process of assigning tasks to processors
  • Centralized multiprocessor mapping done by
    operating system
  • Distributed memory system mapping done by user
  • Conflicting goals of mapping
  • Maximize processor utilization
  • Minimize interprocessor communication

97
Mapping Example
98
Optimal Mapping
  • Finding optimal mapping is NP-hard
  • Must rely on heuristics

99
Mapping Decision Tree
  • Static number of tasks
  • Structured communication
  • Constant computation time per task
  • Agglomerate tasks to minimize comm
  • Create one task per processor
  • Variable computation time per task
  • Cyclically map tasks to processors
  • Unstructured communication
  • Use a static load balancing algorithm
  • Dynamic number of tasks

100
Mapping Strategy
  • Static number of tasks
  • Dynamic number of tasks
  • Frequent communications between tasks
  • Use a dynamic load balancing algorithm
  • Many short-lived tasks
  • Use a run-time task-scheduling algorithm

101
Mapping Checklist
  • Considered designs based on one task per
    processor and multiple tasks per processor
  • Evaluated static and dynamic task allocation
  • If dynamic task allocation chosen, task allocator
    is not a bottleneck to performance
  • If static task allocation chosen, ratio of tasks
    to processors is at least 101

102
Case Studies
  • Boundary value problem
  • Finding the maximum
  • The n-body problem
  • Adding data input

103
Boundary Value Problem
Ice water
Rod
Insulation
104
Rod Cools as Time Progresses
105
Finite Difference Approximation
106
Partitioning
  • One data item per grid point
  • Associate one primitive task with each grid point
  • Two-dimensional domain decomposition

107
Communication
  • Identify communication pattern between primitive
    tasks
  • Each interior primitive task has three incoming
    and three outgoing channels

108
Agglomeration and Mapping
109
Sequential execution time
  • ? time to update element
  • n number of elements
  • m number of iterations
  • Sequential execution time m (n-1) ?

110
Parallel Execution Time
  • p number of processors
  • ? message latency
  • Parallel execution time m(??(n-1)/p?2?)

111
Reduction
  • Given associative operator ?
  • a0 ? a1 ? a2 ? ? an-1
  • Examples
  • Add
  • Multiply
  • And, Or
  • Maximum, Minimum

112
Parallel Reduction Evolution
113
Parallel Reduction Evolution
114
Parallel Reduction Evolution
115
Binomial Trees
116
Finding Global Sum
4
2
0
7
-3
5
-6
-3
8
1
2
3
-4
4
6
-1
117
Finding Global Sum
1
7
-6
4
4
5
8
2
118
Finding Global Sum
8
-2
9
10
119
Finding Global Sum
17
8
120
Finding Global Sum
25
121
Agglomeration
122
Agglomeration
123
The n-body Problem
124
The n-body Problem
125
Partitioning
  • Domain partitioning
  • Assume one task per particle
  • Task has particles position, velocity vector
  • Iteration
  • Get positions of all other particles
  • Compute new position, velocity

126
Gather
127
All-gather
128
Complete Graph for All-gather
129
Hypercube for All-gather
130
Communication Time
131
Adding Data Input
132
Scatter
133
Scatter in log p Steps
12345678
134
Summary Task/channel Model
  • Parallel computation
  • Set of tasks
  • Interactions through channels
  • Good designs
  • Maximize local computations
  • Minimize communications
  • Scale up

135
Summary Design Steps
  • Partition computation
  • Agglomerate tasks
  • Map tasks to processors
  • Goals
  • Maximize processor utilization
  • Minimize inter-processor communication

136
Summary Fundamental Algorithms
  • Reduction
  • Gather and scatter
  • All-gather

137
High Throughput Computing
  • Easy problems formerly known as embarrassingly
    parallel now known as pleasingly parallel
  • Basic idea Gee I have a whole bunch of jobs
    (single run of a program) that I need to do, why
    not run them concurrently rather than
    sequentially
  • Sometimes called bag of tasks or parameter
    sweep problems

138
Bag-of-tasks
139
Examples
  • A large number of proteins each represented by
    a different file to dock with a target
    protein
  • For all files x, execute f(x,y)
  • Exploring a parameter space in n-dimensions
  • Uniform
  • Non-uniform
  • Monte carlos

140
Tools
  • Most common tool is a queuing system sometimes
    called a load management system, or a local
    resource manager
  • PBS, LSF, and SGE are the three most common.
    Condor is also often used.
  • They all have the same basic functions, well use
    PBS as an exemplar.
  • Script languages (bash, Perl, etc.)

141
PBS
  • qsub options script-file
  • Submit the script to run
  • Options can specify number of processors, other
    required resources (memory, etc.)
  • Returns the job ID (a string)

142
(No Transcript)
143
(No Transcript)
144
Other PBS
  • qstat give the status of jobs submitted to the
    queue
  • qdel delete a job from the queue

145
Blasting a set of jobs
146
Issues
  • Overhead per job is substantial
  • Dont want to run millisecond jobs
  • May need to bundle them up
  • May not be enough jobs to saturate resources
  • May need to break up jobs
  • IO System may become saturated
  • Copy large files to /tmp, check for existence in
    your shell script, copy if not there
  • May be more jobs than the queuing system can
    handle (many start to break down at several
    thousand jobs)
  • Jobs may fail for no good reason
  • Develop scripts to check for output and re-submit
    upto k jobs

147
Homework
  1. Submit a simple job to the queue that echos the
    host name, redirect output to a file of your
    choice.
  2. Via a script submit 100 hostname jobs to a
    script. Output should be output.X where X is
    the output number
  3. For each file in a rooted directory tree run a
    wc to count the words. Maintain the results in
    a shadow directory tree. Your script should be
    able to detect results that have already been
    computed.
Write a Comment
User Comments (0)
About PowerShow.com