Tuesday, September 12, 2006 - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Tuesday, September 12, 2006

Description:

Adding more CPUs increase traffic on shared memory-CPU path. ... Most of the parallel work focuses on performing operations on a data set. ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 47

Provided by: Erud

Category:

more less

Transcript and Presenter's Notes

Title: Tuesday, September 12, 2006

1
Tuesday, September 12, 2006

Nothing is impossible for people who don't have
to do it themselves.
- Weiler

BlueGene/L

3
Shared Memory

Adding more CPUs increase traffic on shared
memory-CPU path.
Cache coherent systems increase traffic
associated with cache/memory management

4
Today

Classification of parallel computers.
Programming Models

5
von Neumann Architecture

A common machine model known as the von Neumann
computer.
Uses the stored-program concept. The CPU executes
a stored program that specifies a sequence of
read and write operations on the memory.

6
How to classify parallel computers?

Flynn's Classical Taxonomy (1966)

S I S D Single Instruction, Single Data S I M D Single Instruction, Multiple Data
M I S D Multiple Instruction, Single Data M I M D Multiple Instruction, Multiple Data
7
Single Instruction, Single Data (SISD)

A serial (non-parallel) computer
Single instruction only one instruction stream
is being acted on by the CPU during any one clock
cycle
Single data only one data stream is being used
as input during any one clock cycle
This is the oldest and until recently, the most
prevalent form of computer
Examples most PCs, single CPU workstations and
mainframes.

8
Single Instruction, Single Data (SISD)

9
Single Instruction, Multiple Data (SIMD)

A type of parallel computer
This type of machine typically has an instruction
dispatcher, a very high-bandwidth internal
network, and a very large array of very
small-capacity instruction units.
Apply same operation to different data values.
Picture analogy.

10
Single Instruction, Multiple Data (SIMD)

Single instruction All processing units execute
the same instruction at any given clock cycle .
Multiple data Each processing unit can operate
on a different data element.
Best suited for specialized problems
characterized by a high degree of regularity,
such as image processing.

11
Single Instruction, Multiple Data (SIMD)

SIMD systems have fallen out of favor as general
purpose computers
Still important in fields like signal processing
Examples
Thinking machine corporation Connection module
(CM-1 and CM-2), Maspar

SIMD approach is used in some processors for
special operations
Intel Pentium processors with MMX includes small
set of SIMD-style instructions designed for use
in graphic transformations that involve
matrix-vector multiplication.
AMD K6/Athlon processors 3D Now!

13
Single Instruction, Multiple Data (SIMD)

14
Multiple Instruction, Single Data (MISD)

A single data stream is fed into multiple
processing units.
Each processing unit operates on the data
independently via independent instruction
streams.
Does not exist in practice.
Analogy.

15
Multiple Instruction, Single Data (MISD)

Some conceivable uses might be
Multiple frequency filters operating on a single
signal stream
Multiple cryptography algorithms attempting to
crack a single coded message.

16
Multiple Instruction, Single Data (MISD)
17
Multiple Instruction, Multiple Data (MIMD)

Currently, the most common type of parallel
computer.
Most modern computers fall into this category.

18
Multiple Instruction, Multiple Data (MIMD)

Multiple Instruction every processor may be
executing a different instruction stream
Multiple Data every processor may be working
with a different data stream
Examples
Most current supercomputers, networked parallel
computers and multi-processor SMP computers.

19
Single Program Multiple Data (SPMD)

SPMD is a subset of ?

20
Single Program Multiple Data (SPMD)

SPMD is a subset of MIMD
Simplification for software
Most parallel programs in technical and
scientific computing are SPMD

21
Other approaches to parallelism

Vector computing
(parallelism?)
Operations performed on vectors, often groups of
64 floating point numbers.
A single instruction may cause 64 results to be
computed using vectors stored in vector
registers.
Memory bandwidth is an order of magnitude greater
than non-vector computers.

22
Parallel Programming Models

Abstraction above hardware and memory
architectures.
Inspired by parallel architecture.

23
Shared Memory Model

Tasks share a common address space, which they
read and write asynchronously.
Various mechanisms such as locks / semaphores may
be used to control access to the shared memory.
An advantage of this model from the programmer's
point of view is that the notion of data
"ownership" is lacking, so there is no need to
specify explicitly the communication of data
between tasks.

24
Threads Model

A single process can have multiple, concurrent
execution paths.

25
Threads Model

Each thread has local data, but also, shares the
entire resources of the program.
Threads communicate with each other through
global memory (updating address locations).
This requires synchronization constructs to
ensure that more than one thread is not updating
the same global address at any time.
Threads are commonly associated with shared
memory architectures.

26
Threads Model

Vendor proprietary versions
Problem?
Standardization efforts.

From a programming point of view, threads
implementations comprise
A library of subroutines (POSIX threads)
Very explicit parallelism
Requires significant programmer attention to
detail.
APIs such as Pthreads are considered low level
primitives.

From a programming point of view, threads
implementations comprise
A set of compiler directives (OpenMP)
Higher level construct to relieve the programmer
from manipulating threads
Used with Fortran, C, C for programming shared
address space machines.

In both cases, the programmer is responsible for
determining all parallelism.

30
Message Passing Model

A set of tasks that use their own local memory
during computation.
Multiple tasks can reside on the same physical
machine as well across an arbitrary number of
machines.
Tasks exchange data through communications by
sending and receiving messages.

31
Message Passing Model

Data transfer usually requires cooperative
operations to be performed by each process.
A send operation must have a matching receive
operation.

32
Message Passing Model

From a programming point of view
Message passing implementations commonly comprise
a library of subroutines.
The programmer is responsible for determining all
parallelism.

33
Message Passing Model

Variety of message passing libraries have been
available since the 1980s.
Problem portability
In 1992, the MPI Forum was formed with the
primary goal of establishing a standard interface
for message passing implementations.
MPI is now the "de facto" industry standard for
message passing.
For shared memory architectures, MPI
implementations usually don't use a network for
task communications.

34
Data Parallel Model

Most of the parallel work focuses on performing
operations on a data set.
A set of tasks work collectively on the same data
structure, however, each task works on a
different partition of the same data structure.
Tasks perform the same operation on their
partition of work, for example, "add 4 to every
array element".

35
Data Parallel Model
36
Data Parallel Model

SIMD machines.
On shared memory architectures, all tasks may
have access to the data structure through global
memory.
On distributed memory architectures the data
structure is split up and resides as "chunks" in
the local memory of each task.

37
Data Parallel Model

Programming with the data parallel model is
usually accomplished by writing a program with
data parallel constructs.
Compiler Directives Allow the programmer to
specify the distribution and alignment of data.
Fortran implementations are available for most
common parallel platforms.
Distributed memory implementations of this model
usually have the compiler convert the program
into standard code with calls to a message
passing library (MPI usually) to distribute the
data to all the processes. All message passing is
done invisibly to the programmer.
High Performance Fortran (HPF) Extensions to
Fortran 90 to support data parallel programming.

38
Hybrid Model

Environment of networked SMP machines
Combination of the message passing model (MPI)
with either the threads model (POSIX threads) or
the shared memory model (OpenMP).

Historically, architectures were often tied to
programming models.

40
Message passing model on a shared memory machine

MPI on SGI Origin.
The SGI Origin employed the CC-NUMA type of
shared memory architecture, where every task has
direct access to global memory.
Ability to send and receive messages with MPI.

41
Shared memory model on a distributed memory
machine

Kendall Square Research (KSR) ALLCACHE approach.
Machine memory was physically distributed, but
appeared to the user as a single shared memory
(global address space).
This approach is referred to as "virtual shared
memory".
KSR approach is no longer used. No common
distributed memory platform implementations
currently exist.

There certainly are better implementations of
some models over others.

43
Networks for Connecting Parallel Systems

Simple buses, 2-D and 3-D meshes, hypercube
network topologies
In the past, understanding details of topologies
was important for the programmers.

44
CPU Parallelism

Superscalar parallelism.

45
CPU Parallelism

Amount of parallelism by superscalar processors
is limited by instruction look ahead.
Hardware logic for dependency analysis is 5-10
of total logic on conventional microprocessors

46
CPU Parallelism

Explicitly parallel instructions.
Each instruction contains explicit
sub-instructions for each of the use of each of
the different functional units in the CPU
Very long instruction word (VLIW) ISAs.
Relies on compilers to resolve dependencies.
Instructions that can be executed concurrently
are packed into groups and sent to processor as a
single long instruction word.
Example Intel Itanium