Extending the Unified Parallel Processing Speedup Model - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Extending the Unified Parallel Processing Speedup Model

Description:

Title: RISC Processor Architecture Author: asdf Last modified by: College of Science and Mathem Created Date: 4/14/2000 2:12:41 PM Document presentation format – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 21
Provided by: asdf88
Category:

less

Transcript and Presenter's Notes

Title: Extending the Unified Parallel Processing Speedup Model


1
Extending the Unified Parallel Processing Speedup
Model
  • Computer architectures take advantage of
    low-level parallelism multiple pipelines
  • The next generations of integrated circuits will
    continue to support increasing numbers of
    transistors.
  • How to make efficient use of the additional
    transistors?
  • Answer Parallelism beyond multiple pipelines
    adding multiple processors or processing
    components in a single chip or single package.
  • Each level of parallelism performance suffers
    from the law of diminishing returns outlined by
    Amdahl.
  • Incorporating multiple levels of parallelism
    results in higher overall performance and
    efficiency.

2
Presentation Content
  • A discussion of practical and theoretical
    parallel speedup alternative methods and the
    efficient use of hardware/processing resources in
    capturing speedup.
  • Parallel Speedup/Amdahls Law, Scaled Speedup
  • Pipelined Processors
  • Multiprocessors and Multicomputers
  • Multiple concurrent threads
  • Multiple concurrent processes
  • Multiple levels of parallelism with integrated
    chips/packages that combine microcontrollers with
    Digital Signal Processing chips

3
Presentation Summary
  • Architects/Chip-Manufacturers are integrating
    additional levels of parallelism.
  • Multiple levels of speedup achieve higher
    speedups and greater efficiencies than increasing
    hardware at a single parallel level.
  • A balanced approach would achieve about the same
    level of efficiency in cost of hardware resources
    allocated, in delivering parallel speedup at each
    level of parallelism.
  • Numerous architectural approaches are possible,
    each with different trade-offs and performance
    returns.
  • Current technology is integrating DSP processing
    with microcontroller functionality - achieving up
    to three levels of parallelism.

4
Classic Model of Parallel Processing
  • Multiple Processors available (4)
  • A Process can be divided into serial and parallel
    portions
  • The parallel parts are executed concurrently
  • Serial Time 10 time units
  • Parallel Time 4 time units

An example parallel process of time 10
S - Serial or non-parallel portion A - All A
parts can be executed concurrently B - All B
parts can be executed concurrently All A parts
must be completed prior to executing the B parts
Executed on a single processor
Executed in parallel on 4 processors
5
Amdahls Law (Analytical Model)
  • Analytical model of parallel speedup from 1960s
  • Parallel fraction (?) is run over n processors
    taking ?/n time
  • The part that must be executed in serial (1- ?)
    gets no speedup
  • Overall performance is limited by the fraction of
    the work that cannot be done in parallel (1- ?)
  • diminishing returns with increasing processors (n)

6
Pipelined Processing
  • Single Processor enhanced with discrete stages
  • Instructions flow through pipeline stages
  • Parallel Speedup with multiple instructions being
    executed (by parts) simultaneously
  • Realized speedup is partly determined by the
    number of stages 5 stagesat most 5 times faster

Cycle 1 2 3 4
5
OF
WB
EX
F - Instruction Fetch D - Instruction Decode OF -
Operand Fetch EX - Execute WB - Write Back or
Result Store Processor clock/cycle is divided
into sub-cycles, each stage takes one sub-cycle
7
Pipeline Performance
  • Speedup is serial time (nS) over parallel time
  • Performance is limited by the number of pipeline
    flushes (n) due to jumps
  • speculative execution and branch prediction can
    minimize pipeline flushes
  • Performance is also reduced by pipeline stalls
    (s), due to conflicts with bus access, data not
    ready delays, and other sources

8
Super-Scalar Multiple Pipelines
  • Concurrent Execution of Multiple sets of
    instructions
  • Example Simultaneous execution of instructions
    though an integer pipeline while processing
    instructions through a floating point pipeline
  • Compiler identifies and specifies separate
    instruction sets for concurrent execution through
    different pipes

9
Algorithm/Thread Level Parallelism
  • Example Algorithms to compute Fast Fourier
    Transform (FFT) used in Digital Signal Processing
    (DSP)
  • Many separate computations in parallel (High
    Degree Of Parallelism)
  • Large exchange of data - much communication
    between processors
  • Fine-Grained Parallelism
  • Communication time (latency) may be a
    consideration if multiple processors are combined
    on a board of motherboard
  • Large communication load (fine-grained
    parallelism) can force the algorithm to become
    bandwidth-bound rather than computation-bound.

10
Simple Algorithm/Thread Parallelism Model
P1 P2
  • Parallel threads of execution
  • could be a separate process
  • could be a multi-thread process
  • Each thread of execution obeys Amdahls parallel
    speedup model
  • Multiple concurrently executing processes
    resulting in
  • Multiple serial components executing
    concurrently - another level of parallelism

Observe that the serial parts of Program 1 and
Program 2 are now running in parallel with each
other. Each program would take 6 time units on a
uniprocessor, or a total workload serial time of
12. Each has a speedup of 1.5. The total speedup
is 12/4 3, which is also the sum of the program
speedups.
11
Multiprocess Speedup
  • Concurrent Execution of Multiple Processes
  • Each process is limited by Amdahls parallel
    speedup
  • Multiple concurrently executing processes
    resulting in
  • Multiple serial components executing
    concurrently - another level of parallelism
  • Avoid Degree of Parallelism (DOP) speedup
    limitations
  • Linear scaling up to machine limits of processors
    and memory n ? single process speedup

No speedup - uniprocessor 12 t
Single Process 8 t, Speedup 1.5
Multi-Process 4 t, Speedup 3
Two
12
Algorithm/Thread Parallelism - Analytical Model
Multi-Process/Thread Speedup ? fraction of work
that can be done in parallel nnumber of
processors N number concurrent (assumed
similar) processes or threads
Multi-Process/Thread Speedup ? fraction of work
that can be done in parallel nnumber of
processors N number concurrent (assumed
dissimilar) processes or threads
13
(Simple) Unified Model with Scaled Speedup
Adds scaling factor on parallel work, while
holding serial work constant k1 scaling factor
on parallel portion ? fraction of work that can
be done in parallel nnumber of processors N
number concurrent (assumed dissimilar) processes
or threads
14
Capturing Multiple Levels of Parallelism
  • Most parallelism suffers from diminishing returns
    - resulting in limited scalability.
  • Allocating hardware resources to capture multiple
    levels of parallelism - operate at efficient end
    of speedup curves.
  • Manufacturers of microcontrollers are integrating
    multiple levels of parallelism on a single chip

15
Trend in Microprocessor Architectures
  • Architectural Variations
  • DSP and microcontroller cores on same chip
  • DSP also does microprocessor
  • Microprocessor also does DSP
  • Multiprocessor
  • Each variation captures some speedup from all
    three levels
  • Varying amounts of speedup from each level
  • Each parallel level operates at a more efficient
    level than if all hardware resources were
    allocated to a single parallel level
  • 1. Intra-Instruction Parallelism Pipelines
  • 2. Instruction-Level Parallelism Super-Scalar -
    Multiple Pipelines
  • 3. Algorithm/Thread Parallelism
  • Multiple processing elements
  • Integrated DSP with microcontroller
  • Enhanced microcontroller to do DSP
  • Enhanced DSP processor that also functions as a
    microcontroller

16
More Levels of Parallelism Outside the Chip
  • Multiple Processors in a box
  • on a motherboard
  • on back-plane with daughter-boards
  • Shared-Memory Multiprocessors
  • communication is through shared memory
  • Clustered Multiprocessors
  • another hierarchical level
  • processors are grouped into clusters
  • intra-cluster can be bus or network
  • inter-cluster can be bus or network
  • Distributed Multicomputers
  • multiple computers loosely coupled through a
    network
  • n-tiered Architectures
  • modern client/server architectures

17
Speedup of Client-Server, 2-Tier Systems
  • ? - workload balance, of workload on client
  • ? 1 (100), completely distributed
  • ? 0 (100), completely centralized
  • n clients, m servers

n CLIENTS
m SERVERS
LAN
INTERNET
LAN
18
Speedup of Client-Server, n-Tier Systems
  • m1 level 1 machines (clients)
  • m2 server2, m3 server3, m3 server3, etc.
  • ?1 - workload balance, of workload on client
  • ?2 - of workload on server2, ?3 - of workload
    on server3, etc.

SERVERS m2 m3 m4
m1 CLIENTS
INTERNET
LAN
LAN
SAN
19
Hierarchy of Embedded Parallelism
  • 1. N-tiered Client-Server Distributed Systems
  • 2. Clustered Multi-computers
  • 3. Clustered-Multiprocessor
  • 4. Multiple Processors on a Chip
  • 5. Multiple Processing Elements
  • 6. Multiple Pipelines
  • 7. Multiple Stages per Pipeline
  • Goals
  • Single analytical model that captures
    parallelism from all levels
  • Simulator that allows exploration

20
References
  • K. Hoganson, "Alternative Mechanisms to Achieve
    Parallel Speedup", First IEEE Online Symposium
    for Electronics Engineers, IEEE Society, August
    2000.
  • K. Hoganson, Mapping Parallel Application
    Communication Topology to Rhombic
    Overlapping-Cluster Multiprocessors, accepted
    for publication, to appear in The Journal of
    Supercomputing, To appear 8/2000, Vol. 17, No. 1.
  • K. Hoganson, Workload Execution Strategies and
    Parallel Speedup on Clustered Computers,
    accepted for publication, IEEE Transactions on
    Computers, Vol. 48, No. 11, November 1999.
  • Undergraduate Research Project
  • Unified Parallel System Modeling project,
    Directed Study, Summer-Fall 2000
Write a Comment
User Comments (0)
About PowerShow.com