PARALLEL PROCESSOR ORGANIZATIONS - PowerPoint PPT Presentation

About This Presentation
Title:

PARALLEL PROCESSOR ORGANIZATIONS

Description:

PARALLEL PROCESSOR ORGANIZATIONS Jehan-Fran ois P ris jfparis_at_uh.edu – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 41
Provided by: Jehan72
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: PARALLEL PROCESSOR ORGANIZATIONS


1
PARALLEL PROCESSOR ORGANIZATIONS
  • Jehan-François Pâris
  • jfparis_at_uh.edu

2
Chapter Organization
  • Overview
  • Writing parallel programs
  • Multiprocessor Organizations
  • Hardware multithreading
  • Alphabet soup (SISD, SIMD, MIMD, )
  • Roofline performance model

3
OVERVIEW
4
The hardware side
  • Many parallel processing solutions
  • Multiprocessor architectures
  • Two or more microprocessor chips
  • Multiple architectures
  • Multicore architectures
  • Several processors on a single chip

5
The software side
  • Two ways for software to exploit parallel
    processing capabilities of hardware
  • Job-level parallelism
  • Several sequential processes run in parallel
  • Easy to implement (OS does the job!)
  • Process-level parallelism
  • A single program runs on several processors at
    the same time

6
WRITING PARALLEL PROGRAMS
7
Overview
  • Some problems are embarrassingly parallel
  • Many computer graphics tasks
  • Brute force searches in cryptography or password
    guessing
  • Much more difficult for other applications
  • Communication overhead among sub-tasks
  • Amdahl's law
  • Balancing the load

8
Amdahl's Law
  • Assume a sequential process takes
  • tp seconds to perform operations that could be
    performed in parallel
  • ts seconds to perform purely sequential
    operations
  • The maximum speedup will be
  • (tp ts )/ts

9
Balancing the load
  • Must ensure that workload is equally divided
    among all the processors
  • Worst case is when one of the processors does
    much more work than all others

10
Example (I)
  • Computation partitioned among n processors
  • One of them does 1/m of the work with m lt n
  • That processor becomes a bottleneck
  • Maximum expected speedup n
  • Actual maximum speedup m

11
Example (II)
  • Computation partitioned among 64 processors
  • One of them does 1/8 of the work
  • Maximum expected speedup 64
  • Actual maximum speedup 8

12
A last issue
  • Humans likes to address issues one after the
    order
  • We have meeting agendas
  • We do not like to be interrupted
  • We write sequential programs

13
Rene Descartes
  • Seventeenth-century French philosopher
  • Invented
  • Cartesian coordinates
  • Methodical doubt
  • To never to accept anything for true which I
    did not clearly know to be such
  • Proposed a scientific method based on four
    precepts

14
Method's third rule
  • The third, to conduct my thoughts in such order
    that, by commencing with objects the simplest and
    easiest to know, I might ascend by little and
    little, and, as it were, step by step, to the
    knowledge of the more complex assigning in
    thought a certain order even to those objects
    which in their own nature do not stand in a
    relation of antecedence and sequence.

15
MULTI PROCESSOR ORGANIZATIONS
16
Shared memory multiprocessors

Interconnection network
RAM
I/O
17
Shared memory multiprocessor
  • Can offer
  • Uniform memory access to all processors(UMA)
  • Easiest to program
  • Non-uniform memory access to all
    processors(NUMA)
  • Can scale up to larger sizes
  • Offer faster access to nearby memory

18
Computer clusters

Interconnection network
19
Computer clusters
  • Very easy to assemble
  • Can take advantage of high-speed LANs
  • Gigabit Ethernet, Myrinet,
  • Data exchanges must be done throughmessage
    passing

20
Message passing (I)
  • If processor P wants to access data in the main
    memory of processor Q it must
  • Send a request to Q
  • Wait for a reply
  • For this to work, processor Q must have a thread
  • Waiting for message from other processors
  • Sending them replies

21
Message passing (II)
  • In a shared memory architecture, each processor
    can directly access all data
  • A proposed solution
  • Distributed shared memory offers to the users of
    a cluster the illusion of a single address space
    for their shared data
  • Still has performance issues

22
When things do not add up
  • Memory capacity is very important for big
    computing applications
  • If the data can fit into main memory, the
    computation will run much faster

23
A problem
  • A company replaced
  • Single shared memory computer with 32GB of RAM
  • Four clustered computers with 8GB each
  • More I/O than ever
  • What did happen?

24
The explanation
  • Assume OS occupies one GB of RAM
  • The old shared-memory computer still had 31 GB of
    free RAM
  • Each of the clustered computer has 7 GB of free
    RAM
  • The total RAM available to the program went down
    from 31 GB to 4?7 28 GB!

25
Grid computing
  • The computers are distributed over a very large
    network
  • Sometimes computer time is donated
  • Volunteer computing
  • Seti_at_Home
  • Works well with embarrassingly parallel workloads
  • Searches in a n-dimensional space

26
HARDWARE MULTITHREADING
27
General idea
  • Let the processor switch to another thread of
    computation while them current one is stalled
  • Motivation
  • Increased cost of cache misses

28
Implementation
  • Entirely controlled by the hardware
  • Unlike multiprogramming
  • Requires a processor capable of
  • Keeping track of the state of each thread
  • One set of registersincluding PC for each
    concurrent thread
  • Quickly switching among concurrent threads

29
Approaches
  • Fine-grained multithreading
  • Switches between threads for each instruction
  • Provides highest throughputs
  • Slows down execution of individual threads

30
Approaches
  • Coarse-grained multithreading
  • Switches between threads whenever a long stall is
    detected
  • Easier to implement
  • Cannot eliminate all stalls

31
Approaches
  • Simultaneous multi-threading
  • Takes advantage of the possibility of modern
    hardware to perform different tasks in parallel
    for instructions of different threads
  • Best solution

32
ALPHABET SOUP
33
Overview
  • Used to describe processor organizations where
  • Same instructions can be applied to
  • Multiple data instances
  • Encountered in
  • Vector processors in the past
  • Graphic processing units (GPU)
  • x86 multimedia extension

34
Classification
  • SISD
  • Single instruction, single data
  • Conventional uniprocessor architecture
  • MIMD
  • Multiple instructions, multiple data
  • Conventional multiprocessor architecture

35
Classification
  • SIMD
  • Single instruction, multiple data
  • Perform same operations on a set of similar data
  • Think of adding two vectors
  • for (i 0 i i lt VECSIZE) sumi ai
    bi

36
Vector computing
  • Kind of SIMD architecture
  • Used by Cray computers
  • Pipelines multiple executions of single
    instruction with different data (vectors)
    trough the ALU
  • Requires
  • Vector registers able to storemultiple values
  • Special vector instructions say lv, addv,

37
Benchmarking
  • Two factors to consider
  • Memory bandwidth
  • Depends on interconnection network
  • Floating-point performance
  • Best known benchmark is LINPACK

38
Roofline model
  • Takes into account
  • Memory bandwidth
  • Floating-point performance
  • Introduces arithmetic intensity
  • Total number of floating point operations in a
    program divided by total number of bytes
    transferred to main memory
  • Measured in FLOPS/byte

39
Roofline model
  • Attainable GFLOPS/s Min(Peak Memory
    BW?Arithmetic Intensity, Peak
    Floating-Point Performance

40
Roofline model
Peak floating-point performance
Floating-point performance is limited by memory
bandwidth
Write a Comment
User Comments (0)
About PowerShow.com