Title: Overview%20of%20Parallel%20Architecture%20and%20Programming%20Models
1Overview of Parallel Architecture and
Programming Models
2What is a Parallel Computer?
- A collection of processing elements that
cooperate to solve large problems fast - Some broad issues that distinguish parallel
computers - Resource Allocation
- how large a collection?
- how powerful are the elements?
- how much memory?
- Data access, Communication and Synchronization
- how do the elements cooperate and communicate?
- how are data transmitted between processors?
- what are the abstractions and primitives for
cooperation? - Performance and Scalability
- how does it all translate into performance?
- how does it scale?
3Why Parallelism?
- Provides alternative to faster clock for
performance - Assuming a doubling of effective per-node
performance every 2 years, 1024-CPU system can
get you the performance that it would take 20
years for a single-CPU system to deliver - Applies at all levels of system design
- Is increasingly central in information processing
- Scientific computing simulation, data analysis,
data storage and management, etc. - Commercial computing Transaction processing,
databases - Internet applications Search Google operates
at least 50,000 CPUs, many as part of large
parallel systems
4How to Study Parallel Systems
- History diverse and innovative organizational
structures, often tied to novel programming
models - Rapidly matured under strong technological
constraints - The microprocessor is ubiquitous
- Laptops and supercomputers are fundamentally
similar! - Technological trends cause diverse approaches to
converge - Technological trends make parallel computing
inevitable - In the mainstream
- Need to understand fundamental principles and
design tradeoffs, not just taxonomies - Naming, Ordering, Replication, Communication
performance
5Outline
- Drivers of Parallel Computing
- Trends in Supercomputers for Scientific
Computing - Evolution and Convergence of Parallel
Architectures - Fundamental Issues in Programming Models and
Architecture
6Drivers of Parallel Computing
- Application Needs Our insatiable need for
computing cycles - Scientific computing CFD, Biology, Chemistry,
Physics, ... - General-purpose computing Video, Graphics, CAD,
Databases, TP... - Internet applications Search, e-Commerce,
Clustering ... - Technology Trends
- Architecture Trends
- Economics
- Current trends
- All microprocessors have support for external
multiprocessing - Servers and workstations are MP Sun, SGI, Dell,
COMPAQ... - Microprocessors are multiprocessors. Multicore
SMP on a chip
7Application Trends
- Demand for cycles fuels advances in hardware, and
vice-versa - Cycle drives exponential increase in
microprocessor performance - Drives parallel architecture harder most
demanding applications - Range of performance demands
- Need range of system performance with
progressively increasing cost - Platform pyramid
- Goal of applications in using parallel machines
Speedup - Speedup (p processors)
- For a fixed problem size (input data set),
performance 1/time - Speedup fixed problem (p processors)
Time (1 processor)
Time (p processors)
8Scientific Computing Demand
- Ever increasing demand due to need for more
accuracy, higher-level modeling and knowledge,
and analysis of exploding amounts of data - Example area 1 Climate and Ecological Modeling
goals - By 2010 or so
- Simply resolution, simulated time, and improved
physics leads to increased requirement by factors
of 104 to 107. Then - Reliable global warming, natural disaster and
weather prediction - By 2015 or so
- Predictive models of rainforest destruction,
forest sustainability, effects of climate change
on ecoystems and on foodwebs, global health
trends - By 2020 or so
- Verifiable global ecosystem and epidemic models
- Integration of macro-effects with localized and
then micro-effects - Predictive effects of human activities on earths
life support systems - Understanding earths life support systems
9Scientific Computing Demand
- Example area 2 Biology goals
- By 2010 or so
- Ex vivo and then in vivo molecular-computer
diagnosis - By 2015 or so
- Modeling based vaccines
- Individualized medicine
- Comprehensive biological data integration (most
data co-analyzable) - Full model of a single cell
- By 2020 or so
- Full model of a multi-cellular tissue/organism
- Purely in-silico developed drugs personalized
smart drugs - Understanding complex biological systems cells
and organisms to ecosystems - Verifiable predictive models of biological
systems
10Engineering Computing Demand
- Large parallel machines a mainstay in many
industries - Petroleum (reservoir analysis)
- Automotive (crash simulation, drag analysis,
combustion efficiency), - Aeronautics (airflow analysis, engine efficiency,
structural mechanics, electromagnetism), - Computer-aided design
- Pharmaceuticals (molecular modeling)
- Visualization
- in all of the above
- entertainment (movies), architecture
(walk-throughs, rendering) - Financial modeling (yield and derivative
analysis) - etc.
11Learning Curve for Parallel Applications
- AMBER molecular dynamics simulation program
- Starting point was vector code for Cray-1
- 145 MFLOP on Cray90, 406 for final version on
128-processor Paragon, 891 on 128-processor Cray
T3D
12Commercial Computing
- Also relies on parallelism for high end
- Scale not so large, but use much more wide-spread
- Computational power determines scale of business
that can be handled - Databases, online-transaction processing,
decision support, data mining, data warehousing
... - TPC benchmarks (TPC-C order entry, TPC-D decision
support) - Explicit scaling criteria provided size of
enterprise scales with system - Problem size no longer fixed as p increases, so
throughput is used as a performance measure
(transactions per minute or tpm) - E-commerce, search and other scalable internet
services - Parallel applications running on clusters
- Developing new parallel software models and
primitives - Insight from automated analysis of large
disparate data
13TPC-C Results for Wintel Systems
6-way Unisys AQ HS6 Pentium Pro 200 MHz 12,026
tpmC 39.38/tpmC Avail 11-30-97 TPC-C
v3.3 (withdrawn)
4-way Cpq PL 5000 Pentium Pro 200 MHz 6,751
tpmC 89.62/tpmC Avail 12-1-96 TPC-C
v3.2 (withdrawn)
4-way IBM NF 7000 PII Xeon 400 MHz 18,893
tpmC 29.09/tpmC Avail 12-29-98 TPC-C
v3.3 (withdrawn)
8-way Cpq PL 8500 PIII Xeon 550 MHz 40,369
tpmC 18.46/tpmC Avail 12-31-99 TPC-C
v3.5 (withdrawn)
8-way Dell PE 8450 PIII Xeon 700 MHz 57,015
tpmC 14.99/tpmC Avail 1-15-01 TPC-C
v3.5 (withdrawn)
32-way Unisys ES7000 PIII Xeon 900 MHz 165,218
tpmC 21.33/tpmC Avail 3-10-02 TPC-C v5.0
32-way NEC Express5800 Itanium2 1GHz 342,746
tpmC 12.86/tpmC Avail 3-31-03 TPC-C v5.0
32-way Unisys ES7000 Xeon MP 2 GHz 234,325
tpmC 11.59/tpmC Avail 3-31-03 TPC-C v5.0
- Parallelism is pervasive
- Small to moderate scale parallelism very
important - Difficult to obtain snapshot to compare across
vendor platforms
14Summary of Application Trends
- Transition to parallel computing has occurred for
scientific and engineering computing - Also occurred commercial computing
- Database and transactions as well as financial
- Scalable internet services (at least
coarse-grained parallelism) - Desktop also uses multithreaded programs, which
are a lot like parallel programs - Demand for improving throughput on sequential
workloads - Greatest use of small-scale multiprocessors
- Solid application demand, keeps increasing with
time - Key challenge throughout is making parallel
programming easier - Taking advantage of pervasive parallelism with
multi-core systems
15Drivers of Parallel Computing
- Application Needs
- Technology Trends
- Architecture Trends
- Economics
16Technology Trends Rise of the Micro
The natural building block for multiprocessors is
now also about the fastest!
17General Technology Trends
- Microprocessor performance increases 50 - 100
per year - Clock frequency doubles every 3 years
- Transistor count quadruples every 3 years
- Moores law xtors per chip 1.59year-1959
(originally 2year-1959) - Huge investment per generation is carried by huge
commodity market - With every feature size scaling of n
- we get O(n2) transistors
- we get O(n) increase in possible clock frequency
- We should get O(n3) increase in processor
performance. - Do we?
- See architecture trends
18Die and Feature Size Scaling
- Die Size growing at 7 per year feature size
shrinking 25-30
19Clock Frequency Growth Rate (Intel family)
20Transistor Count Growth Rate (Intel family)
- Transistor count grows much faster than clock
rate - - 40 per year, order of magnitude more
contribution in 2 decades - Width/space has greater potential than per-unit
speed
21How to Use More Transistors
- Improve single threaded performance via
architecture - Not keeping up with potential given by technology
(next) - Use transistors for memory structures to improve
data locality - Doesnt give as high returns (2x for 4x cache
size, to a point) - Use parallelism
- Instruction-level
- Thread level
- Bottom line Not that single-threaded performance
has plateaued, but that parallelism is natural
way to stay on a better curve
22Microprocessor Performance
23Similar Story for Storage (Transistor Count)
24Similar Story for Storage (DRAM Capacity)
25Similar Story for Storage
- Divergence between memory capacity and speed more
pronounced - Capacity increased by 1000x from 1980-95, and
increases 50 per yr - Latency reduces only 3 per year (only 2x from
1980-95) - Bandwidth per memory chip increases 2x as fast as
latency reduces
- Larger memories are slower, while processors get
faster - Need to transfer more data in parallel
- Need deeper cache hierarchies
- How to organize caches?
26Similar Story for Storage
- Parallelism increases effective size of each
level of hierarchy, without increasing access
time - Parallelism and locality within memory systems
too - New designs fetch many bits within memory chip
follow with fast pipelined transfer across
narrower interface - Buffer caches most recently accessed data
- Disks too Parallel disks plus caching
- Overall, dramatic growth of processor speed,
storage capacity and bandwidths relative to
latency (especially) and clock speed point toward
parallelism as the desirable architectural
direction
27Drivers of Parallel Computing
- Application Needs
- Technology Trends
- Architecture Trends
- Economics
28Architectural Trends
- Architecture translates technologys gifts to
performance and capability - Resolves the tradeoff between parallelism and
locality - Recent microprocessors 1/3 compute, 1/3 cache,
1/3 off-chip connect - Tradeoffs may change with scale and technology
advances - Four generations of architectural history tube,
transistor, IC, VLSI - Here focus only on VLSI generation
- Greatest delineation in VLSI has been in scale
and type of parallelism exploited
29Architectural Trends in Parallelism
- Up to 1985 bit level parallelism 4-bit -gt 8 bit
-gt 16-bit - slows after 32 bit
- adoption of 64-bit well under way, 128-bit is far
(not performance issue) - great inflection point when 32-bit micro and
cache fit on a chip - Basic pipelining, hardware support for complex
operations like FP multiply etc. led to O(N3)
growth in performance. - Intel 4004 to 386
30Architectural Trends in Parallelism
- Mid 80s to mid 90s instruction level parallelism
- Pipelining and simple instruction sets,
compiler advances (RISC) - Larger on-chip caches
- But only halve miss rate on quadrupling cache
size - More functional units gt superscalar execution
- But limited performance scaling
- N2 growth in performance
- Intel 486 to Pentium III/IV
31Architectural Trends in Parallelism
- After mid-90s
- Greater sophistication out of order execution,
speculation, prediction - to deal with control transfer and latency
problems - Very wide issue processors
- Dont help many applications very much
- Need multiple threads (SMT) to exploit
- Increased complexity and size leads to slowdown
- Long global wires
- Increased access times to data
- Time to market
- Next step thread level parallelism
32Can Instruction-Level get us there?
- Reported speedups for superscalar processors
- Horst, Harris, and Jardine 1990
...................... 1.37 - Wang and Wu 1988 .............................
............. 1.70 - Smith, Johnson, and Horowitz 1989
.............. 2.30 - Murakami et al. 1989 .........................
............... 2.55 - Chang et al. 1991 ............................
................. 2.90 - Jouppi and Wall 1989 .........................
............. 3.20 - Lee, Kwok, and Briggs 1991 ...................
........ 3.50 - Wall 1991 ....................................
...................... 5 - Melvin and Patt 1991 .........................
.............. 8 - Butler et al. 1991 ...........................
.................. 17 - Large variance due to difference in
- application domain investigated (numerical versus
non-numerical) - capabilities of processor modeled
33ILP Ideal Potential
- Infinite resources and fetch bandwidth, perfect
branch prediction and renaming - real caches and non-zero miss latencies
34Results of ILP Studies
- Concentrate on parallelism for 4-issue machines
- Realistic studies show only 2-fold speedup
- More recent work examines ILP that looks across
threads for parallelism
35Architectural Trends Bus-based MPs
- Micro on a chip makes it natural to connect many
to shared memory - dominates server and enterprise market, moving
down to desktop - Faster processors began to saturate bus, then
bus technology advanced - today, range of sizes for bus-based systems,
desktop to large servers
No. of processors in fully configured commercial
shared-memory systems
36Bus Bandwidth
37Bus Bandwith Intel Systems
38Do Buses Scale?
- Buses are a convenient way to extend architecture
to parallelism, but they do not scale - bandwidth doesnt grow as CPUs are added
- Scalable systems use physically distributed memory
39Drivers of Parallel Computing
- Application Needs
- Technology Trends
- Architecture Trends
- Economics
40Finally, Economics
- Fabrication cost roughly O(1/feature-size)
- 90nm fabs cost about 1-2 billion dollars
- So fabrication of processors is expensive
- Number of designers also O(1/feature-size)
- 10 micron 4004 processor had 3 designers
- Recent 90 nm processors had 300
- New designs very expensive
- Push toward consolidation of processor types
- Processor complexity increasingly expensive
- Cores reused, but tweaks expensive too
41Design Complexity and Productivity
- Design complexity outstrips human productivity
42Economics
- Commodity microprocessors not only fast but CHEAP
- Development cost is tens of millions of dollars
- BUT, many more are sold compared to
supercomputers - Crucial to take advantage of the investment, and
use the commodity building block - Exotic parallel architectures no more than
special-purpose - Multiprocessors being pushed by software vendors
(e.g. database) as well as hardware vendors - Standardization by Intel makes small, bus-based
SMPs commodity - What about on-chip processor design?
43Whats on a processing chip?
- Recap
- Number of transistors growing fast
- Methods to use for single-thread performance
running out of steam - Memory issues argue for parallelism too
- Instruction-level parallelism limited, need
thread-level - Consolidation is a powerful force
- All seems to point to many simpler cores rather
than single bigger complex core - Additional key arguments wires, power, cost
44Wire Delay
- Gate delay shrinks, global interconnect delay
grows short local wires
45Power
- Power dissipation in Intel processors over time
46Power and Performance
47Power
- Power grows with number of transistors and clock
frequency - Power grows with voltage P CV2f
- Going from 12V to 1.1V reduced power consumption
by 120x in 20 yr - Voltage projected to go down to 0.7V in 2018, so
only another 2.5x - Power per chip peaking in designs
- - Itanium II was 130W, Montecito 100W
- - Power is first-class design constraint
- Circuit-level power techniques quite far along
- - clock gating, multiple thresholds, sleeper
transistors -
48Power versus Clock Frequency
- Two processor generations two feature sizes
49Architectural Implication of Power
- Fewer transistors per core a lot more power
efficient - Narrower issue, shorter pipelines, smaller OOO
window - Get per-processor performance back on O(n3) curve
- But lower single thread performance.
- What complexity to eliminate?
- Speculation, multithreading, ?
- All good for some things, but need to be careful
about power/benefit
50ITRS Projections
51ITRS Projections (contd.)
- Procs on chip will outstrip individual processor
performance
52Cost of Chip Development
- Non-recurring engineering costs increasing
greatly as complexity outstrips productivity
53Recurring Costs Per Die (1994)
54Summary Whats on a Chip
- Beyond arguments for parallelism based on
commodity processors in general - Wire delay, power and economics all argue for
multiple simpler cores on a chip rather than
increasingly complex single cores - Challenge SOFTWARE. How to program parallel
machines
55Summary Why Parallel Architecture?
- Increasingly attractive
- Economics, technology, architecture, application
demand - Increasingly central and mainstream
- Parallelism exploited at many levels
- Instruction-level parallelism
- Thread level parallelism and On-chip
multiprocessing - Multiprocessor servers
- Large-scale multiprocessors (MPPs)
- Focus of this class multiprocessor level of
parallelism - Same story from memory (and storage) system
perspective - Increase bandwidth, reduce average latency with
many local memories - Wide range of parallel architectures make sense
- Different cost, performance and scalability
56Outline
- Drivers of Parallel Computing
- Trends in Supercomputers for Scientific
Computing - Evolution and Convergence of Parallel
Architectures - Fundamental Issues in Programming Models and
Architecture
57Scientific Supercomputing
- Proving ground and driver for innovative
architecture and techniques - Market smaller relative to commercial as MPs
become mainstream - Dominated by vector machines starting in 70s
- Microprocessors have made huge gains in
floating-point performance - high clock rates
- pipelined floating point units (e.g. mult-add)
- instruction-level parallelism
- effective use of caches
- Plus economics
- Large-scale multiprocessors replace vector
supercomputers
58Raw Uniprocessor Performance LINPACK
59Raw Parallel Performance LINPACK
- Even vector Crays became parallel X-MP (2-4)
Y-MP (8), C-90 (16), T94 (32) - Since 1993, Cray produces MPPs too (T3D, T3E)
60Another View
61Top 10 Fastest Computers (Linpack)
- Rank Site Computer Processors Year Rmax
-
- DOE/NNSA/LLNL USA IBM BlueGene 131072 2005 28060
0 - NNSA/Sandia Labs, USA Cray Red Storm, Opteron
26544 2006 101400 - IBM Research, USA, IBM Blue Gene Solution
40960 2005 91290 - DOE/NNSA/LLNL, USA ASCI Purple - IBM eServer p5
12208 2006 75760 - Barcelona Center, Spain IBM JS21 Cluster, PPC
970 10240 2006 62630 - NNSA/Sandia Labs, USA Dell Thunderbird Cluster
9024 2006 53000 - CEA, France Bull Tera-10 Itanium2 Cluster
9968 2006 52840 - NASA/Ames, USA SGI Altix 1.5 GHz, Infiniband
10160 2004 51870 - GSIC Center, Japan NEC/Sun Grid Cluster
(Opteron) 11088 2006 47380
- NEC Earth Simulator (top for 5 lists) moves
down to 14 - 10 system has doubled in performance since
last year
62Top 500 Architectural Styles
63Top 500 Processor Type
64Top 500 Installation Type
65Top 500 as of Nov 2006 Highlights
- NEC Earth Simulator (top for 5 lists) moves down
to 14 - 10 system has doubled in performance since last
year - 359 six months ago was 500 in this list
- Total performance of top 500 up from 2.3 Pflops a
year ago to 3.5 Pflops - Clusters are dominant at this scale 359 of top
500 labeled as clusters - Dual core processors growing in popularity 75
use Opteron dual core, and 31 Intel Woodcrest - IBM is top vendor with almost 50 of systems, HP
is second - IBM and HP have 237 out of the 244 commercial and
industrial installations - US has 360 of the top 500 installations, UK 32,
Japan 30,Germany 19, China 18
66Top 500 Linpack Performance over Time
67Another View of Performance Growth
68Another View of Performance Growth
69Another View of Performance Growth
70Another View of Performance Growth
71Processor Types in Top 500 (2002)
72Parallel and Distributed Systems
73Outline
- Drivers of Parallel Computing
- Trends in Supercomputers for Scientific
Computing - Evolution and Convergence of Parallel
Architectures - Fundamental Issues in Programming Models and
Architecture
74History
- Historically, parallel architectures tied to
programming models - Divergent architectures, with no predictable
pattern of growth.
Application Software
System Software
Systolic Arrays
SIMD
Architecture
Message Passing
Dataflow
Shared Memory
- Uncertainty of direction paralyzed parallel
software development!
75Today
- Extension of computer architecture to support
communication and cooperation - OLD Instruction Set Architecture
- NEW Communication Architecture
- Defines
- Critical abstractions, boundaries, and primitives
(interfaces) - Organizational structures that implement
interfaces (hw or sw) - Compilers, libraries and OS are important bridges
between application and architecture today
76Modern Layered Framework
77Parallel Programming Model
- What the programmer uses in writing applications
- Specifies communication and synchronization
- Examples
- Multiprogramming no communication or synch. at
program level - Shared address space like bulletin board
- Message passing like letters or phone calls,
explicit point to point - Data parallel more regimented, global actions on
data - Implemented with shared address space or message
passing
78Communication Abstraction
- User level communication primitives provided by
system - Realizes the programming model
- Mapping exists between language primitives of
programming model and these primitives - Supported directly by hw, or via OS, or via user
sw - Lot of debate about what to support in sw and gap
between layers - Today
- Hw/sw interface tends to be flat, i.e. complexity
roughly uniform - Compilers and software play important roles as
bridges today - Technology trends exert strong influence
- Result is convergence in organizational structure
- Relatively simple, general purpose communication
primitives
79Communication Architecture
- User/System Interface Implementation
- User/System Interface
- Comm. primitives exposed to user-level by hw and
system-level sw - (May be additional user-level software between
this and prog model) - Implementation
- Organizational structures that implement the
primitives hw or OS - How optimized are they? How integrated into
processing node? - Structure of network
- Goals
- Performance
- Broad applicability
- Programmability
- Scalability
- Low Cost
80Evolution of Architectural Models
- Historically, machines were tailored to
programming models - Programming model, communication abstraction, and
machine organization lumped together as the
architecture - Understanding their evolution helps understand
convergence - Identify core concepts
- Evolution of Architectural Models
- Shared Address Space (SAS)
- Message Passing
- Data Parallel
- Others (wont discuss) Dataflow, Systolic Arrays
- Examine programming model, motivation, and
convergence
81Shared Address Space Architectures
- Any processor can directly reference any memory
location - Communication occurs implicitly as result of
loads and stores - Convenient
- Location transparency
- Similar programming model to time-sharing on
uniprocessors - Except processes run on different processors
- Good throughput on multiprogrammed workloads
- Naturally provided on wide range of platforms
- History dates at least to precursors of
mainframes in early 60s - Wide range of scale few to hundreds of
processors - Popularly known as shared memory machines or
model - Ambiguous memory may be physically distributed
among processors
82Shared Address Space Model
- Process virtual address space plus one or more
threads of control - Portions of address spaces of processes are shared
- Writes to shared address visible to other
threads (in other processes too) - Natural extension of uniprocessor model
conventional memory operations for comm. special
atomic operations for synchronization - OS uses shared memory to coordinate processes
83Communication Hardware for SAS
- Also natural extension of uniprocessor
- Already have processor, one or more memory
modules and I/O controllers connected by hardware
interconnect of some sort - Memory capacity increased by adding modules, I/O
by controllers
Add processors for processing!
84History of SAS Architecture
- Mainframe approach
- Motivated by multiprogramming
- Extends crossbar used for mem bw and I/O
- Originally processor cost limited to small
- later, cost of crossbar
- Bandwidth scales with p
- High incremental cost use multistage instead
- Minicomputer approach
- Almost all microprocessor systems have bus
- Motivated by multiprogramming, TP
- Used heavily for parallel computing
- Called symmetric multiprocessor (SMP)
- Latency larger than for uniprocessor
- Bus is bandwidth bottleneck
- caching is key coherence problem
- Low incremental cost
85Example Intel Pentium Pro Quad
- All coherence and multiprocessing glue integrated
in processor module - Highly integrated, targeted at high volume
- Low latency and bandwidth
86Example SUN Enterprise
- Memory on processor cards themselves
- 16 cards of either type processors memory, or
I/O - But all memory accessed over bus, so symmetric
- Higher bandwidth, higher latency bus
87Scaling Up
- Problem is interconnect cost (crossbar) or
bandwidth (bus) - Dance-hall bandwidth still scalable, but lower
cost than crossbar - latencies to memory uniform, but uniformly large
- Distributed memory or non-uniform memory access
(NUMA) - Construct shared address space out of simple
message transactions across a general-purpose
network (e.g. read-request, read-response) - Caching shared (particularly nonlocal) data?
88Example Cray T3E
- Scale up to 1024 processors, 480MB/s links
- Memory controller generates comm. request for
nonlocal references - Communication architecture tightly integrated
into node - No hardware mechanism for coherence (SGI Origin
etc. provide this)
89Caches and Cache Coherence
- Caches play key role in all cases
- Reduce average data access time
- Reduce bandwidth demands placed on shared
interconnect - But private processor caches create a problem
- Copies of a variable can be present in multiple
caches - A write by one processor may not become visible
to others - Theyll keep accessing stale value in their
caches - Cache coherence problem
- Need to take actions to ensure visibility
90Example Cache Coherence Problem
- Processors see different values for u after event
3 - With write back caches, value written back to
memory depends on happenstance of which cache
flushes or writes back value when - Processes accessing main memory may see very
stale value - Unacceptable to programs, and frequent!
91Cache Coherence
- Reading a location should return latest value
written (by any process) - Easy in uniprocessors
- Except for I/O coherence between I/O devices and
processors - But infrequent, so software solutions work
- Would like same to hold when processes run on
different processors - E.g. as if the processes were interleaved on a
uniprocessor - But coherence problem much more critical in
multiprocessors - Pervasive and performance-critical
- A very basic design issue in supporting the prog.
model effectively - Its worse than that what is the latest value
with indept. processes? - Memory consistency models
92SGI Origin2000
- Hub chip provides memory control, communication
and cache coherence support - Plus I/O communication etc
93Shared Address Space Machines Today
- Bus-based, cache coherent at small scale
- Distributed memory, cache-coherent at larger
scale - Without cache coherence, are essentially (fast)
message passing systems - Clusters of these at even larger scale
94Message-Passing Programming Model
- Send specifies data buffer to be transmitted and
receiving process - Recv specifies sending process and application
storage to receive into - Optional tag on send and matching rule on receive
- Memory to memory copy, but need to name processes
- User process names only local data and entities
in process/tag space - In simplest form, the send/recv match achieves
pairwise synch event - Other variants too
- Many overheads copying, buffer management,
protection
95Message Passing Architectures
- Complete computer as building block, including
I/O - Communication via explicit I/O operations
- Programming model directly access only private
address space (local memory), comm. via explicit
messages (send/receive) - High-level block diagram similar to
distributed-memory SAS - But comm. neednt be integrated into memory
system, only I/O - History of tighter integration, evolving to
spectrum incl. clusters - Easier to build than scalable SAS
- Can use clusters of PCs or SMPs on a LAN
- Programming model more removed from basic
hardware operations - Library or OS intervention
96Evolution of Message-Passing Machines
- Early machines FIFO on each link
- Hw close to prog. Model synchronous ops
- Replaced by DMA, enabling non-blocking ops
- Buffered by system at destination until recv
- Diminishing role of topology
- Storeforward routing topology important
- Introduction of pipelined routing made it less so
- Cost is in node-network interface
- Simplifies programming
97Example IBM SP-2
- Made out of essentially complete RS6000
workstations - Network interface integrated in I/O bus (bw
limited by I/O bus) - Doesnt need to see memory references
98Example Intel Paragon
- Network interface integrated in memory bus, for
performance
99Toward Architectural Convergence
- Evolution and role of software have blurred
boundary - Send/recv supported on SAS machines via buffers
- Can construct global address space on MP using
hashing - Software shared memory (e.g. using pages as units
of comm.) - Hardware organization converging too
- Tighter NI integration even for MP (low-latency,
high-bandwidth) - At lower level, even hardware SAS passes hardware
messages - Hw support for fine-grained comm makes software
MP faster as well - Even clusters of workstations/SMPs are parallel
systems - Fast system area networks (SAN)
- Programming models distinct, but organizations
converged - Nodes connected by general network and
communication assists - Assists range in degree of integration, all the
way to clusters
100Data Parallel Systems
- Programming model
- Operations performed in parallel on each element
of data structure - Logically single thread of control, performs
sequential or parallel steps - Conceptually, a processor associated with each
data element - Architectural model
- Array of many simple, cheap processors with
little memory each - Processors dont sequence through instructions
- Attached to a control processor that issues
instructions - Specialized and general communication, cheap
global synchronization
- Original motivations
- Matches simple differential equation solvers
- Centralize high cost of instruction
fetch/sequencing
101Application of Data Parallelism
- Each PE contains an employee record with his/her
salary - If salary gt 100K then
- salary salary 1.05
- else
- salary salary 1.10
- Logically, the whole operation is a single step
- Some processors enabled for arithmetic operation,
others disabled - Other examples
- Finite differences, linear algebra, ...
- Document searching, graphics, image processing,
... - Some recent machines
- Thinking Machines CM-1, CM-2 (and CM-5)
- Maspar MP-1 and MP-2,
102Evolution and Convergence
- Rigid control structure (SIMD in Flynn taxonomy)
- SISD uniprocessor, MIMD multiprocessor
- Popular when cost savings of centralized
sequencer high - 60s when CPU was a cabinet
- Replaced by vectors in mid-70s
- More flexible w.r.t. memory layout and easier to
manage - Revived in mid-80s when 32-bit datapath slices
just fit on chip - No longer true with modern microprocessors
- Other reasons for demise
- Simple, regular applications have good locality,
can do well anyway - Loss of applicability due to hardwiring data
parallelism - MIMD machines as effective for data parallelism
and more general - Prog. model converges with SPMD (single program
multiple data) - Contributes need for fast global synchronization
- Structured global address space, implemented with
either SAS or MP
103Convergence Generic Parallel Architecture
- A generic modern multiprocessor
- Node processor(s), memory system, plus
communication assist - Network interface and communication controller
- Scalable network
- Communication assist provides primitives with
perf profile - Build your programming model on this
- Convergence allows lots of innovation, now within
framework - Integration of assist with node, what operations,
how efficiently...
104Outline
- Drivers of Parallel Computing
- Trends in Supercomputers for Scientific
Computing - Evolution and Convergence of Parallel
Architectures - Fundamental Issues in Programming Models and
Architecture
105The Model/System Contract
- Model specifies an interface (contract) to the
programmer - Naming How are logically shared data and/or
processes referenced? - Operations What operations are provided on these
data - Ordering How are accesses to data ordered and
coordinated? - Replication How are data replicated to reduce
communication? - Underlying implementation addresses performance
issues - Communication Cost Latency, bandwidth,
overhead, occupancy - Well look at the aspects of the contract through
examples
106Supporting the Contract
- Given prog. model can be supported in various
ways at various layers - In fact, each layer takes a position on all
issues (naming, ops, performance etc), and any
set of positions can be mapped to another by
software - Key issues for supporting programming models are
- What primitives are provided at comm. abstraction
layer - How efficiently are they supported (hw/sw)
- How are programming models mapped to them
107Recap of Parallel Architecture
- Parallel architecture is important thread in
evolution of architecture - At all levels
- Multiple processor level now in mainstream of
computing - Exotic designs have contributed much, but given
way to convergence - Push of technology, cost and application
performance - Basic processor-memory architecture is the same
- Key architectural issue is in communication
architecture - How communication is integrated into memory and
I/O system on node - Fundamental design issues
- Functional naming, operations, ordering
- Performance organization, replication,
performance characteristics - Design decisions driven by workload-driven
evaluation - Integral part of the engineering focus