CIS669 Distributed and Parallel Processing - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

CIS669 Distributed and Parallel Processing

Description:

Less degree of parallelism (do fewer things at the same time) ... We need to calculate proper parallelism BEFORE implementing a software/hardware solution. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 18
Provided by: yuan4
Category:

less

Transcript and Presenter's Notes

Title: CIS669 Distributed and Parallel Processing


1
CIS669 Distributed and Parallel Processing
  • Lecture 2 Parallel System Architectures and
    Performance Evaluation
  • Yuan Shi
  • Spring 2002

2
Parallel System Architectures
  • Lacking the dignity of a proper discipline,
    it was an orphan in the world of knowledge. The
    subject became a rag-bag filled with odds and
    ends of knowledge and pseudo-knowledge, of
    Biblical dogmas, traveler's tales, and mythical
    imaginings Boo83, p.100, textbookp.15.

3
Where are the flies?
  • Computers can be built in many different ways for
    many different applications. Finding a common
    criteria to compare architectures is VERY
    difficult.
  • Once built, computational performance (speed)
    varies greatly from applications to applications.
    It is equally difficult to have a common criteria
    to measure the goodness of a given
    architecture.
  • Finally, programming difficulties vary from
    architecture to architecture. Each parallel
    programming environment dictates a specific
    programming style that is typically more complex
    than the serial programming interface.

4
Most Recent Examples
  • Cilk (http//supertech.lcs.mit.edu/cilk/) MIT
  • Passages (http//www.cis.udel.edu/hiper/hiperspac
    e/projects/gary.htm) Udel
  • EARTH (http//www.capsl.udel.edu/CURRENTPROJ/EARTH
    /) Udel

5
First Battle Fine v.s. Coarse Grain Parallelism
  • Fine grain. Pros
  • Large degree of parallelism (do many things at
    one time)
  • Find grain Cons
  • Large communication overhead
  • Difficult programming model
  • Less reliable

6
Coarse Grain Parallelism
  • Coarse Grain Pros
  • Ease of programming
  • More reliable
  • Less communication overhead
  • Coarse Grain Cons
  • Less degree of parallelism (do fewer things at
    the same time)

7
Question How to determine the best degree of
parallelism?
  • Timing Models
  • Timing Model is a method for calculating the
    performance of running programs on any hardware
    architecture.
  • Timing Model can also be used to calculate the
    scalability of the running programs and
    architecture. Because, scalability is hardware
    and application dependent.

8
Introduction to Timing Models
  • Time Complexity T(n) O(f(n)) gt The time to
    run a program on n-sized input is above-bounded
    by f(n). It really means that it will take no
    more than f(n) steps to compute n inputs.
  • Timing Models
  • Single Processor Ts(n) f(n)/W gt The time to
    run a program is approximately equal to the
    estimated algorithmic steps/single processor
    power W (algorithmic steps per second).

9
Timing Model for Multiprocessors
  • T(n,p) TCompute TCommunication TIO
    f(n)/pW g(n,p)/? k(n,p)/B
  • T(n,p) estimated running time for size n and p
    processors.
  • g(n,p) estimated communication volume (bytes)
  • k(n,p) estimated IO volume(bytes)
  • W single processor power in algorithmic
    steps/second
  • ? interconnection network speed in bytes/second
  • B IO speed in bytes/second

10
How to obtain values for W, ? and B?
  • Each parameter represents a RANGE of values.
  • Each parameter can be calibrated using
    computational experiments.
  • Ts(n)f(n)/W can be used to derive Wf(n)/Ts(n).
  • Set p2, T(n,p) can be used to derive
    ?g(n,p)/(Tp(n)-f(n)/pW) by removing the IO part
    (easily done).
  • Instrumenting the sequential source code can
    derive k(n,p) and B easily.

11
Practical Example
  • Matrix Multiplication A x B gt C.
  • Assumptions
  • Each matrix is of n x n elements.
  • Each element is double precision (8 bytes).
  • Timing Model (using O(n3) algorithm)
  • Tp(n)n3/pWg(n)/ ?k(n)/B
  • Ignoring IO to simplify Tp(n) n3/pWg(n)/ ?
  • Observation If pn2, the system could be VERY
    FAST since each dot-product is computed on an
    independent processor in parallel with others.
    Degree of parallelism n2 (fine grain).

12
Quantitative Arguments for Coarse Grain
Parallelism
  • What about g(n,p)? A dot-product requires one row
    of A and one column of B gtminimal 2 x 8 x n
    16n bytes per processor or 16n3 bytes transmitted
    across the network.
  • Comparing g(n,p)/ ? with f(n)/W, f(n)/W should
    be smaller (faster) since W(in GHZ) is typically
    gtgt ?(in MBps).

P1
P1
P1
P1
Interconnection Network
P1
P1
P1
P1
Px
13
Example II Massively Parallel Potentials
  • Fractal calculation involves solving massively
    many equations in complex plane in order to
    produce the color indices (number of iterations
    until diverging outside of a pre-defined box
    (http//aleph0.clarku.edu/djoyce/julia/explorer.h
    tml) to make a striking look image.
  • Ref http//www.cis.temple.edu/shi.

14
Conclusions
  • We need to calculate proper parallelism BEFORE
    implementing a software/hardware solution.
  • Hardware technologies are advancing rapidly, we
    need a generic architecture platform that will
    leverage hardware advances without sacrificing
    programmability.

15
Idea Stateless Parallel Processors
16
A Few Finer Points
  • The ring must be slotted unidirectional. This
    allows multiple stations to transmit at the same
    time.
  • The ring must be redundant in order to prevent
    breakage by the loss of a single processor.
  • The result

17
Revised SPP Architecture
Write a Comment
User Comments (0)
About PowerShow.com