CS 213: Parallel Processing Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

CS 213: Parallel Processing Architectures

Description:

Parallelism moved to instruction level. Microprocessor performance ... Process Level or Thread level parallelism; mainstream for general purpose computing? ... – PowerPoint PPT presentation

Number of Views:363

Avg rating:3.0/5.0

Slides: 27

Provided by: laxmib

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 213: Parallel Processing Architectures

1
CS 213 Parallel Processing Architectures

Laxmi Narayan Bhuyan
http//www.cs.ucr.edu/bhuyan

PARALLEL PROCESSING ARCHITECTURES
CS213 SYLLABUS
Winter 2008
INSTRUCTOR L.N. Bhuyan (http//www.engr.ucr.edu/
bhuyan/)
PHONE (951) 827-2347 E-mail bhuyan_at_cs.ucr.edu
LECTURE TIME TR 1240pm-2pm
PLACE HMNSS 1502
OFFICE HOURS W 2.00-4.00 or By Appointment

References
John Hennessy and David Patterson, Computer
Architecture A Quantitative Approach, Morgan
Kauffman Publisher.
Research Papers to be available in the class
COURSE OUTLINE
Introduction to Parallel Processing Flynns
classification, SIMD and MIMD operations, Shared
Memory vs. message passing multiprocessors,
Distributed shared memory
Shared Memory Multiprocessors SMP and CC-NUMA
architectures, Cache coherence protocols,
Consistency protocols, Data pre-fetching, CC-NUMA
memory management, SGI 4700 multiprocessor, Chip
Multiprocessors, Network Processors (IXP and
Cavium)
Interconnection Networks Static and Dynamic
networks, switching techniques, Internet
techniques
Message Passing Architectures Message passing
paradigms, Grid architecture, Workstation
clusters, User level software
Multiprocessor Scheduling Scheduling and
mapping, Internet web servers, P2P, Content aware
load balancing
PREREQUISITE CS 203A
GRADING
Project I 20 points Project II 30
points Test 1 20 points Test 2 - 30 points

4
Possible Projects

Experiments with SGI Altix 4700 Supercomputer
Algorithm design and FPGA offloading
I/O Scheduling on SGI
Chip Multiprocessor (CMP) Design, analysis and
simulation
P2P Using Planet Lab
Note 2 students/group Expect submission of a
paper to a conference

5
Useful Web Addresses

http//www.sgi.com/products/servers/altix/4000/
and http//www.sgi.com/products/rasc/
Wisconsin Computer Architecture Page Simulators
http//www.cs.wisc.edu/arch/www/tools.html
SimpleScalar www.simplescalar.com Look for
multiprocessor extensions
NepSim http www.cs.ucr.edu/yluo/nepsim/
Working in a cluster environment
Beowulf Cluster www.beowulf.org
MPI www-unix.mcs.anl.gov/mpi
Application Benchmarks
http//www-flash.stanford.edu/apps/SPLASH/

6
Parallel Computers

Definition A parallel computer is a collection
of processing elements that cooperate and
communicate to solve large problems fast.
Almasi and Gottlieb, Highly Parallel Computing
,1989
Questions about parallel computers
How large a collection?
How powerful are processing elements?
How do they cooperate and communicate?
How are data transmitted?
What type of interconnection?
What are HW and SW primitives for programmer?
Does it translate into performance?

7
Parallel Processors Myth

The dream of computer architects since 1950s
replicate processors to add performance vs.
design a faster processor
Led to innovative organization tied to particular
programming models since uniprocessors cant
keep going
e.g., uniprocessors must stop getting faster due
to limit of speed of light Has it happened?
Killer Micros! Parallelism moved to instruction
level. Microprocessor performance doubles every
1.5 years!
In 1990s companies went out of business Thinking
Machines, Kendall Square, ...

8
What level Parallelism?

Bit level parallelism 1970 to 1985
4 bits, 8 bit, 16 bit, 32 bit microprocessors
Instruction level parallelism (ILP) 1985
through today
Pipelining
Superscalar
VLIW
Out-of-Order execution
Limits to benefits of ILP?
Process Level or Thread level parallelism
mainstream for general purpose computing?
Servers are parallel
High-end Desktop dual processor PC soon?? (or
just the sell the socket?)

9
Why Multiprocessors?

Microprocessors as the fastest CPUs
Collecting several much easier than redesigning 1
Complexity of current microprocessors
Do we have enough ideas to sustain 2X/1.5yr?
Can we deliver such complexity on schedule?
Slow (but steady) improvement in parallel
software (scientific apps, databases, OS)
Emergence of embedded and server markets driving
microprocessors in addition to desktops
Embedded functional parallelism
Network processors exploiting packet-level
parallelism
SMP Servers and cluster of workstations for
multiple users Less demand for parallel
computing

10
Amdahls Law and Parallel Computers

Amdahls Law (f original fraction
sequential)Speedup 1 / (f (1-f)/n
n/1(n-1)/f, where n No. of processors
A portion f is sequential gt limits parallel
speedup
Speedup lt 1/ f
Ex. What fraction sequential to get 80X speedup
from 100 processors? Assume either 1 processor or
100 fully used
80 1 / (f (1-f)/100 gt f 0.0025
Only 0.25 sequential! gt Must be a highly
parallel program

11
(No Transcript)
12
Popular Flynn Categories

SISD (Single Instruction Single Data)
Uniprocessors
MISD (Multiple Instruction Single Data)
??? multiple processors on a single data stream
SIMD (Single Instruction Multiple Data)
Examples Illiac-IV, CM-2
Simple programming model
Low overhead
Flexibility
All custom integrated circuits
(Phrase reused by Intel marketing for media
instructions vector)
MIMD (Multiple Instruction Multiple Data)
Examples Sun Enterprise 5000, Cray T3D, SGI
Origin
Flexible
Use off-the-shelf micros
MIMD current winner Concentrate on major design
emphasis lt 128 processor MIMD machines

13
Classification of Parallel Processors

SIMD EX Illiac IV and Maspar
MIMD - True Multiprocessors
1. Message Passing Multiprocessor -
Interprocessor communication through explicit
message passing through send and receive
operations.
EX IBM SP2, Cray XD1, and Clusters
2. Shared Memory Multiprocessor All
processors share the same address space.
Interprocessor communication through load/store
operations to a shared memory.
EX SMP Servers, SGI Origin, HP
V-Class, Cray T3E
Their advantages and disadvantages?

14
More Message passing Computers

Cluster Computers connected over high-bandwidth
local area network (Ethernet or Myrinet) used as
a parallel computer
Network of Workstations (NOW) Homogeneous
cluster same type computers
Grid Computers connected over wide area network

15
Another Classification for MIMD Computers

Centralized Memory Shared memory located at
centralized location may consist of several
interleaved modules same distance from any
processor Symmetric Multiprocessor (SMP)
Uniform Memory Access (UMA)
Distributed Memory Memory is distributed to each
processor improves scalability
(a) Message passing architectures No
processor can directly access another processors
memory
(b) Hardware Distributed Shared Memory (DSM)
Multiprocessor Memory is distributed, but the
address space is shared Non-Uniform Memory
Access (NUMA)
(c) Software DSM A level of o/s built on
top of message passing multiprocessor to give a
shared memory view to the programmer.

16
(No Transcript)
17
Data Parallel Model

Operations can be performed in parallel on each
element of a large regular data structure, such
as an array
1 Control Processor (CP) broadcasts to many PEs.
The CP reads an instruction from the control
memory, decodes the instruction, and broadcasts
control signals to all PEs.
Condition flag per PE so that can skip
Data distributed in each memory
Early 1980s VLSI gt SIMD rebirth 32 1-bit PEs
memory on a chip was the PE
Data parallel programming languages lay out data
to processor

18
Data Parallel Model

Vector processors have similar ISAs, but no data
placement restriction
SIMD led to Data Parallel Programming languages
Advancing VLSI led to single chip FPUs and whole
fast µProcs (SIMD less attractive)
SIMD programming model led to Single Program
Multiple Data (SPMD) model
All processors execute identical program
Data parallel programming languages still useful,
do communication all at once Bulk Synchronous
phases in which all communicate after a global
barrier

19
SIMD Programming High-Performance Fortran (HPF)

Single Program Multiple Data (SPMD)
FORALL Construct similar to Fork
FORALL (I1N), A(I) B(I) C(I), END
FORALL
Data Mapping in HPF
1. To reduce interprocessor communication
2. Load balancing among processors
http//www.npac.syr.edu/hpfa/
http//www.crpc.rice.edu/HPFF/

20
Major MIMD Styles

Centralized shared memory ("Uniform Memory
Access" time or "Shared Memory Processor")
Decentralized memory (memory module with CPU)
Advantages Scalability, get more memory
bandwidth, lower local memory latency
Drawback Longer remote communication latency,
Software model more complex
Two types Shared Memory and Message passing

21
Symmetric Multiprocessor (SMP)

Memory centralized with uniform access time
(uma) and bus interconnect
Examples Sun Enterprise 5000 , SGI Challenge,
Intel SystemPro

22
Decentralized Memory versions

Shared Memory with "Non Uniform Memory Access"
time (NUMA)
Message passing "multicomputer" with separate
address space per processor
Can invoke software with Remote Procedue Call
(RPC)
Often via library, such as MPI Message Passing
Interface
Also called "Syncrohnous communication" since
communication causes synchronization between 2
processes

23
Distributed Directory MPs
24
Communication Models

Shared Memory
Processors communicate with shared address space
Easy on small-scale machines
Advantages
Model of choice for uniprocessors, small-scale
MPs
Ease of programming
Lower latency
Easier to use hardware controlled caching
Message passing
Processors have private memories, communicate
via messages
Advantages
Less hardware, easier to design
Good scalability
Focuses attention on costly non-local operations
Virtual Shared Memory (VSM)

25
Shared Address/Memory Multiprocessor Model

Communicate via Load and Store
Oldest and most popular model
Based on timesharing processes on multiple
processors vs. sharing single processor
process a virtual address space and 1 thread
of control
Multiple processes can overlap (share), but ALL
threads share a process address space
Writes to shared address space by one thread are
visible to reads of other threads
Usual model share code, private stack, some
shared heap, some private heap

26
Shared Memory Multiprocessor Model

Communicate via Load and Store
Oldest and most popular model
Based on timesharing processes on multiple
processors vs. sharing single processor
process a virtual address space and 1 thread
of control
Multiple processes can overlap (share), but ALL
threads share a process address space
Writes to shared address space by one thread are
visible to reads of other threads
Usual model share code, private stack, some
shared heap, some private heap