Title: Introduction to Parallel Processing
1???? ?????? ?????? ??' ??? ??-???
- ???? ????? ??' 1
- ????? ???? ?', 22/10/2001
2Introduction to Parallel Processing
- Course Number 36113621
- ??? ?????
- http//www.bgu.ac.il/tel-zur/pp.html
321.10.01 ?' ???? ???"? Â Â ?? ???????? ??????
????? ??? ??' ???? ?????? Â Â Â Â ?????? ?????
???? ???? ?????? ?????? ( 36113621) Â ?. ?. ???
????? ???????? ??? ????????, ??? ?????? ???? ??
????????? ????? ???? ?????? ??????. ??? ?????
?? ??????? ?? ?? ??????? ?????? ??? ?????? ??????
???. ????? ????? ?? ????? ???????? ??????
?????? ????? . Â ?????,
??' ???? ??????
4???? ??? ???, 800-1100
5???? ??' ??? ??-???
- ??"?
- tel-zur_at_ee.bgu.ac.il
6?????? ??' ????? ????
- ???' ?????? ???? ???????
- npanov_at_ee.bgu.ac.il
7????? ???
- ???? ????
- ??? ??-??? ???? ?? ??? ??????, ??? 1100 ?-
1200. ????? ??? ????. - ????? ???? ??? ?', ??? 1400 ?- 1800. ??? 318
?????? ????? ???? ???????. - Email pp_at_ee.bgu.ac.il
- Newsgroup pp_news_at_ee.bgu.ac.il
8Course Objectives
- The goal of this course is to provide in-depth
understanding of modern parallel processing. The
course will cover theoretical and practical
aspects of parallel processing.
9Task 1
- Please send an email containing the following
data - Your first and last name
- Your Email at BGU
- Phone Number
- Year
- Course of Study
- to pp_at_ee.bgu.ac.il
- PLEASE WRITE EMAILS ONLY IN ENGLISH
10???? ?????
- ????
- ??????? ??????
- ?????????
- ???????
- ?????? ?????
11????? ?' ???"?
- ???? ?????? ???????? 21.10.01
- ????? ????? ??? ?' 16.12.01
- ???? ?????? ?????? ?????? 25.1.02
- ???? ???? ??????? ??? 18.2.02
- ???? ?????? ?????? ???? 24.2.02
12????? ?????? ?? ????? 1/3????? ???????!
13????? ?????? ?? ????? 2/3
14????? ?????? ?? ????? 3/3
15????? ??????? 1/2
- ????? ??' 1. ??? ?' ????? ???. ???? ????????
????-??? (????? ??' 4) ??? ???? ?? ???? ??????. - ????? ??' 2. ???? ?????? ??' 4. ?? ?????? ??????
??' 6. 15 ?????? ?????.
16????? ??????? 2/2
- ?????? ???? 5 ?????? ????? ?????????
- ???? ?????? ?????? ?????? ???? 7. ????? 20
?????? ?????. - ????? ??' 3 ???? ?????? 8. ?? ?????? ?????? 10.
????? 15 ?????? ????? - ?????? ??? ???? ?? ?- 18.2.02 ?????? ?- 50
?????? ?????.
17?????? ???
- ????? ????? ???? C ?? FORTRAN
- ???? ??? ?? ?????? ????? ????????? ???????
- ????? ????? ??? ?? ????? ??????, ??? ??????
??-??? ???????????
18??????
- ?? ????? ?? ???????? ??????? ????? ????????? ??
????? - ?? ???????? ????? ??? ???? (????????)
- ?? ???? ?????? ???? ?????? ???? ???? ?????? ?????
WORD - ?? ?????? ????? ?? ?? ???? ???? ???????
- ?? ????? ???? ?????? ?????? ?????? ???? ?? ?????
19?????? - ????
- ?? ????? ?????? ??????
- ????? ????? ????? ????? ?????? ??"? ?????? ??
?????? ??? n ????!!!
?? ?????? ?? ?? ???? ????? ????? ??????
???????????
20References
- ??? ????? ??? ????? ??? ????? ???? ????
- ???? ????? ????? ??? ??????
- ???? ????? ?????? ??????? ?????? ?????? ??? ?????
- ???? ???? ?? ????????
- ?? ???? ???? ?????? ?????? ??? ???????!!!
21Parallel Computer Architecture
David E. Culler et al
22Introduction to Parallel Computing
Vipin Kumar et al
23Using MPI
William Gropp et al
24Parallel Programming With MPI
Peter Pacheco
25Parallel Programming
Barry Wilkinson  Michael Allen
26????? ?????? ???????
- ???? ?"???? ?????? ??????"
- ???? ??? ?? ????? ??????? ???? ????? ??????
27???????
28??? ?????, ????? ???????
- Parallel Computing
- Parallel Processing
- Cluster Computing
- Beowulf Clusters
- HPC High Performance Computing
29Oxford Dictionary of Science
- A technique that allows more than one process
stream of activity to be running at any given
moment in a computer system, hence processes can
be executed in parallel. This means that two or
more processors are active among a group of
processes at any instant.
30??? ???? ?????? ??? ????? ????-???
31A Supercomputer
- An extremely high power computer that has a large
amount of main memory and very fast processors
Often the processors run in parallel.
http//www.netlib.org/benchmark/top500/top500.list
.html
32Why Study Parallel Architecture?
- Parallelism
- Provides alternative to faster clock for
performance - Applies at all levels of system design (H/W S/W
Integration) - Is a fascinating topic
- Is increasingly central in information
processing, science and engineering
33The Demand for Computational Speed
- Continual demand for greater computational speed
from a computer system than is currently
possible.Areas requiring great computational
speed include numerical modeling and simulation
of scientific and engineering problems.
Computations must be completed within a
reasonable time period.
34Large Memory Requirements
- Use parallel computing for executing larger
problems which require more memory than exists on
a single computer.
35Grand Challenge Problems
- A grand challenge problem is one that cannot be
solved in a reasonable amount of time with
todays computers.Obviously, an execution time of
10 years is always unreasonable. Examples
Modeling large DNA structures,global weather
forecasting, modeling motion of astronomical
bodies.
36Scientific Computing Demand
37?????
- ???? ???????? ?? 1011 ??????. ???? ?? ???? ?????
?????? 100 ???????? ?? ???? ????? ?? O(N2) ?????
??? ??-????? ?? 1GFLOPS?
38?????
- ???? 1011 ?????? ????? 1022 ???????????.
- ??"? ?????? ???? 100 ???????? 1024
- ??? ??? ?????? ????
39????? - ????
????? ????? ????????? ???? ??? ???? ???? ??????
??????!
40Technology Trends
41Clock Frequency Growth Rate
42?????? ??? ??? ??? ?? ?? ????!
- ?? ?? ???? ????? ???????
- ?????? ????? ???? ??? ??
- ?????? ??????
- ??? ?????? ??? ?????????? ????? (?????????
??????) - ????
43Parallel Architecture Considerations
- Resource Allocation
- how large a collection?
- how powerful are the elements?
- how much memory?
- Data access, Communication and Synchronization
- how do the elements cooperate and communicate?
- how are data transmitted between processors?
- what are the abstractions and primitives for
cooperation? - Performance and Scalability
- how does it all translate into performance?
- how does it scale?
44Conventional Computer
45Shared Memory System
46Message-Passing Multi-computer
47?????
- ?? ???? ?? ????? ?????? ??????? ????? ??????
- ?? ??? ?????? ??? ????? ??? ???? ?? ???? ???
- ??? ????? ???????/??????? ??? ??????? ?? ????
?????? ?????? Message Passing - ??? ??????? (?????? ?? ????? ?????)
48Distributed Shared Memory
49Flynn (1966) Taxonomy
- SISD - a single instruction stream-single data
stream computer. - SIMD - a single instruction stream-multiple data
stream computer. - MIMD - a multiple instruction stream-multiple
data stream computer.
50Multiple Program Multiple Data (MPMD)
51Single Program Multiple Data (SPMD)
- A Single source program
- Each processor will execute its personal copy of
this program - Independently and not in synchronism
52Message-Passing Multi-computers
53???? ??????? ????? ??????? ????? ?????? ??????!
- ?????? ????? ????? ??????? ???????? ?? ??? ???????
54Network Criteria 1/6
- Bandwidth
- Network Latency
- Communication Latency (H/WS/W)
- Message Latency (see next slide)
55Network Criteria 2/6
- Bandwidth is the inverse of the slope of the line
- time latency (1/rate) size_of_message
- Latency is sometimes described as time to send a
message of zero bytes. This is true only for
the simple model. The number quoted is sometimes
misleading.
56Network Criteria 3/6
- Bisection Width - links to be cut in order to
divide the network into two equal parts
2
57Network Criteria 4/6
- Diameter The max. distance between any two nodes
P/2
58Network Criteria 5/6
- Connectivity Multiplicity of paths between any
two nodes
2
59Network Criteria 6/6
P
60 ????? ??? ?? ?????? ??? ???? P ?????? ????
Fully Connected
61?????
Diameter 1 Bisectionp2/4 Connectivityp-1 Cost
p(p-1)/2
62????? ???? ?-Bisection - ????
- Number of links p(p-1)/2
- Internal links in each half (p/2)(p/2-1)/2
- Internal links in both halves (p/2)(p/2-1)
- Number of links being cut
- p(p-1)/2 (p/2)(p/2-1) p2/4
632D Mesh
64Example Intel Paragon
65A Binary Tree 1/2
66A Binary Tree 2/2
Fat tree Thinking Machine CM5, 1993
673D Hypercube Network
684D Hypercube Network
69Embedding 1/2
70Embedding 2/2
71Deadlock
72Ethernet
73Ethernet Frame Format
74Point-to-Point Communication
75Performance
- Computation/Communication ratio
- Speedup Factor
- Overhead
- Efficiency
- Cost
- Scalability
- Gustafsons Law
76Computation/Communication Ratio
77Speedup Factor
The maximum speedup is n (linear speadup)
78Speedup and Comp/Comm Ratio
79Overhead
- Things that limit the speedup
- Serial parts of the computation
- Some processors compute while others are idle
- Communication time for sending messages
- Extra computation in the parallel version not
appearing in the serial version
80Amdahls Law (1967)
81Amdahls Law - continue
With only 5 of the computation being serial,
the maximum speedup is 20
82Speedup
83Efficiency
E is the fraction of time that the processors are
being used. If E100 then S(n)n.
84Cost
Cost-optimal algorithm is when the cost is
proportional to the single processor cost ( i.e.
execution time)
85Scalability
- An imprecise term
- Reflects H/W and S/W scalability
- How to get increased performance when the H/W
increased? - What H/W is needed when problem size (e.g.
cells) is increased? - Problem dependent!
86Gustafsons Law (1988) 1/3
Gives an argument against the pessimistic
Amdahls Law conclusion. Rather than assume that
the problem size is fixed, we should assume that
the parallel execution time is fixed. Define a
Scaled Speedup for the case of increaseing the
number of processors as well as the problem size
87Gustafsons Law 2/3
88Gustafsons Law 3/3
An Example Assume we have n20 and a serial
fraction of s0.05 S(scaled)0.050.952019.05,
while the Speedup according to Amdahls Law
is S20/(0.05(20-1)1)10.26
89?????
- ???? ?????? ???? 10 ??????, ??"? ??? ????? ??
200MFLOPS. ??? ?????? ????? ??????? ?? MFLOPS
???? 10 ????? ??? ???? ?- 90 ????? ??? ???????
90?????
- ???? ?? ???? ??? ??????, ??? ?????? ???
- 10200 2000MFLOPs
- ????? ???? 10 ????? ???? ???? ???? ????? 90
????? ????? 10 ??????, ???
91Domain Decomposition
- ????? ????? ?????? ?? ????????? ????? ???????
- ????? ????? ??????? ????? ?????? ????? ????????
- Load Balance
- Granularity
92Load Balance 1/2
- All processors must be kept busy!
- The parallel cluster may not be homogenous
- (CPUs, memory, users/jobs, network)
93Load Balance 2/2
- Static versus Dynamic techniques
- Static
- Algorithmic assignment based on input wont
change - Low runtime overhead
- Computation must be predictable
- Preferable when applicable (except in
multiprogrammed/heterogeneous environment) - Dynamic
- Adapt at runtime to balance load
- Can increase communication and reduce locality
- Can increase task management overheads
94Determining Task Granularity
- Task granularity amount of work associated with
a task - General rule
- Coarse-grained gt often less load balance
- Fine-grained gt more overhead often more comm.,
contention
95Algorithms Adding 8 Numbers
96Summary Terms Defined 1
- Flynn Taxonomy
- Message Passing
- Shared Memory
- Bandwidth
- Latency
- Bisection Width
- Diameter
- Connectivity
- Cost
- Meshes, Trees, Hypercubes
- Deadlock
97Summary Terms Defined - 2
- Embedding
- Process
- Amdahls Law
- Speedup Factor
- Efficiency
- Cost
- Scalability
- Gustafsons Law
- Load Balance
98Next Week Class
- ?????? ??? ?????? ?????? ???????, ???? ?' ??????
????? ???? ??????? - ?? ????? ????? ????? ?? ????? ??????? ??? ?????
???? ?????? (Email ?????)!!! ????? ??? ????
????? ????? ?? ???? ???? ??????!!!
99Task 2
- Goto http//www.lam-mpi.org/tutorials/
- Download and print the file
- MPI quick reference sheet
- Linux Tutorial
- Goto http//www.ctssn.com/, learn at least
lessons 1,2 and 3.
100Cluster Computing
- COTS Commodities of The Shelf
- Free O/S, e.g. Linux
- LOBOS Lots Of Boxes On the Shelf
- PCs connected by a fast network
101??? ???? ???????????
- Cray-J932
- 16 Processors
- 200 MFLOPS per CPU
- 3.2 GFLOPS
102The Dwarves 1/5
- 12 PCs of several types
- Red Hat Linux 6.0-6.2
- Fast Ethernet 100Mbps
- Myrinet Network
- 1.281.28Gbps, SAN
103The Dwarves 2/5
There are 12 computers with Linux operating
system. dwarf1-12 or dwarf1-12m dwarf1m,
dwarf3m-dwarf7m - Pentium II 300
MHz, dwarf9m-dwarf12m - Pentium III 450 MHz
(dual CPU), dwarf2m, dwarf8m - Pentium III
733 MHz (dual CPU).
104The Dwarves 3/5
- 6 PII at 300MHz processors
- 8 PIII at 450MHz processors
- 4 PIII at 733MHz processors
- Total 18 processors, 8GFlops
105The Dwarves 4/5
- Dwarf1 ..dwarf12 nodes names for the Fast
Ethernet link - Dwarf1m .. Dwarf12m nodes names for the Myrinet
network
106The Dwarves 5/5
- GNU FORTRAN / C Compilers
- PVM / MPI
107Cluster Computing - 1
108Cluster Computing - 2
109Cluster Computing - 3
110Cluster Computing - 4
111Linux
http//www.ee.bgu.ac.il/tel-zur/linux.html
112Linux
In Google Linux 38,600,000 Microsoft
21,500,000 Bible 7,590,000