Lecture 18: Introduction to Multiprocessors

About This Presentation

Title:

Lecture 18: Introduction to Multiprocessors

Description:

How is parallelism expressed by the user? Expressive power. Process-level parallelism. Shared-memory. Message-passing. Operator-level parallelism ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 95

Provided by: kunl

more less

Transcript and Presenter's Notes

Title: Lecture 18: Introduction to Multiprocessors

1
Lecture 18Introduction to Multiprocessors

Prepared and presented by
Kurt Keutzer
with thanks for materials from
Kunle Olukotun, Stanford
David Patterson, UC Berkeley

2
Why Multiprocessors?

Needs
Relentless demand for higher performance
Servers
Networks
Commercial desire for product differentiation
Opportunities
Silicon capability
Ubiquitous computers

3
Exploiting (Program) Parallelism
4
Exploiting (Program) Parallelism -2
Process
Thread
Levels of Parallelism
Loop
Instruction
Bit
1
10
100
1K
10K
100K
1M
Grain Size (instructions)
5
Need for Parallel Computing

Diminishing returns from ILP
Limited ILP in programs
ILP increasingly expensive to exploit
Peak performance increases linearly with more
processors
Amhdahls law applies
Adding processors is inexpensive
But most people add memory also

Performance
Die Area
2PM
2P2M
Performance
PM
Die Area
6
What to do with a billion transistors ?
1 clk

Technology changes the cost and performance of
computer elements in a non-uniform manner
logic and arithmetic is becoming plentiful and
cheap
wires are becoming slow and scarce
This changes the tradeoffs between alternative
architectures
superscalar doesnt scale well
global control and data
So what will the architectures of the future be?

2007
2004
2001
1998
64 x the area 4x the speed slower wires
3 (10, 16, 20?) clks
7
Elements of a multiprocessing system

General purpose/special purpose
Granularity - capability of a basic module
Topology - interconnection/communication
geometry
Nature of coupling - loose to tight
Control-data mechanisms
Task allocation and routing methodology
Reconfigurable
Computation
Interconnect
Programmers model/Language support/ models of
computation
Implementation - IC, Board, Multiboard, Networked
Performance measures and objectives

After E. V. Krishnamurty - Chapter 5
8
Use, Granularity

General purpose
attempting to improve general purpose computation
(e.g. Spec benchmarks) by means of
multiprocessing
Special purpose
attempting to improve a specific application or
class of applications by means of multiprocessing
Granularity - scope and capability of a
processing element (PE)
Nand gate
ALU with registers
Execution unit with local memory
RISC R1000 processor

9
Topology

Topology - method of interconnection of
processors
Bus
Full-crossbar switch
Mesh
N-cube
Torus
Perfect shuffle, m-shuffle
Cube-connected components
Fat-trees

10
Coupling

Relationship of communication among processors
Shared clock (Pipelined)
Shared registers (VLIW)
Shared memory (SMM)
Shared network

11
Control/Data

Way in which data and control are organized
Control - how the instruction stream is managed
(e.g. sequential instruction fetch)
Data - how the data is accessed (e.g. numbered
memory addresses)
Multithreaded control flow - explicit
constructs fork, join, wait, control program
flow - central controller
Dataflow model - instructions execute as soon as
operands are ready, program structures flow of
data, decentralized control

12
Task allocation and routing

Way in which tasks are scheduled and managed
Static - allocation of tasks onto processing
elements pre-determined before runtime
Dynamic - hardware/software support allocation of
tasks to processors at runtime

13
Reconfiguration

Computational
restructuring of computational elements
reconfigurable - reconfiguration at compile time
dynamically reconfigurable- restructuring of
computational elements at runtime
Interconnection scheme
switching network - software controlled
reconfigurable fabric

14
Programmers model

How is parallelism expressed by the user?
Expressive power
Process-level parallelism
Shared-memory
Message-passing
Operator-level parallelism
Bit-level parallelism
Formal guarantees
Deadlock-free
Livelock free
Support for other real-time notions
Exception handling

15
Parallel Programming Models

Message Passing
Fork thread
Typically one per node
Explicit communication
Send messages
send(tid, tag, message)
receive(tid, tag, message)
Synchronization
Block on messages (implicit sync)
Barriers

Shared Memory (address space)
Fork thread
Typically one per node
Implicit communication
Using shared address space
Loads and stores
Synchronization
Atomic memory operators
barriers

16
Message Passing Multicomputers

Computers (nodes) connected by a network
Fast network interface
Send, receive, barrier
Nodes not different than regular PC or
workstation
Cluster conventional workstations or PCs with
fast network
cluster computing
Berkley NOW
IBM SP2

17
Shared-Memory Multiprocessors
P
P
P

Several processors share one address space
conceptually a shared memory
often implemented just like a multicomputer
address space distributed over private memories
Communication is implicit
read and write accesses to shared memory
locations
Synchronization
via shared memory locations
spin waiting for non-zero
barriers

Network
M
Conceptual Model
P
P
P
M
M
M
Network
Actual Implementation
18
Cache Coherence - A Quick Overview
P1
P2
PN

With caches, action is required to prevent access
to stale data
Processor 1 may read old data from its cache
instead of new data in memory or
Processor 3 may read old data from memory rather
than new data in Processor 2s cache
Solutions
no caching of shared data
Cray T3D, T3E, IBM RP3, BBN Butterfly
cache coherence protocol
keep track of copies
notify (update or invalidate) on writes

Network
M
A3
P1 Rd(A) Rd(A) P2 Wr(A,5) P3 Rd(A)
19
Implementation issues

Underlying hardware implementation
Bit-slice
Board assembly
Integration in an integrated-circuit
Exploitation of new technologies
DRAM integration on IC
Low-swing chip-level interconnect

20
Performance objectives

Objectives
Speed
Power
Cost
Ease of programming/time to market/ time to money
In-field flexibility
Methods of measurement
Modeling
Emulation
Simulation
Transaction
Instruction-set
Hardware

21
Flynns Taxonomy of Multiprocessing

Single-instruction single-datastream (SISD)
machines
Single-instruction multiple-datastream (SIMD)
machines
Multiple-instruction single-datastream (MISD)
machines
Multiple-instruction multiple-datastream (MIMD)
machines
Examples?

22
Examples

Single-instruction single-datastream (SISD)
machines
Non-pipelined Uniprocessors
Single-instruction multiple-datastream (SIMD)
machines
Vector processors (VIRAM)
Multiple-instruction single-datastream (MISD)
machines
Network processors (Intel IXP1200
Multiple-instruction multiple-datastream (MIMD)
machines
Network of workstations (NOW)

23
Predominant Approaches

Pipelining ubiquitious
Much academic research focused on performance
improvements of dusty decks
Illiac 4 - Speed-up of Fortran
SUIF, Flash - Speed-up of C
Niche market in high-performance computing
Cray
Commercial support for high-end servers
Shared-memory multiprocessors for server market
Commercial exploitation of silicon capability
General purpose Super-scalar, VLIW
Special purpose VLIW for DSP, Media processors,
Network processors
Reconfigurable computing

24
C62x Pipeline OperationPipeline Phases
Fetch
Decode
Execute
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5

Decode
DP Instruction Dispatch
DC Instruction Decode
Execute
E1 - E5 Execute 1 through Execute 5

Single-Cycle Throughput
Operate in Lock Step
Fetch
PG Program Address Generate
PS Program Address Send
PW Program Access Ready Wait
PR Program Fetch Packet Receive

Execute Packet 1
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 2
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 3
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 4
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 5
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 6
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 7
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
25
Superscalar PowerPC 604 and Pentium Pro

Both In-order Issue, Out-of-order execution,
In-order Commit

26
IA-64 aka EPIC aka VLIW
Instruction Cache

Compiler schedules instructions
Encodes dependencies explicitly
saves having the hardware repeatedly rediscover
them
Support speculation
speculative load
branch prediction
Really need to make communication explicit too
still has global registers and global instruction
issue

Instruction Issue
Register File
27
Phillips Trimedia Processor
28
TMS320C6201 Revision 2
29
TMS320C6701 DSP Block Diagram
Program Cache/Program Memory 32-bit address,
256-Bit data 512K Bits RAM
C67x Floating-Point CPU Core
Power Down
Program Fetch
Control Registers
Instruction Dispatch
Host Port Interface
Instruction Decode
Control Logic
4 Channel DMA
Data Path 1
Data Path 2
A Register File
B Register File
Test
Emulation
D1
M1
S1
L1
L2
S2
M2
D2
Interrupts
External Memory Interface
2 Timers 2 Multi-channel buffered serial ports
(T1/E1)
Data Memory 32-Bit address 8-, 16-, 32-Bit
data 512K Bits RAM
30
TMS320C67x CPU Core
C67x Floating-Point CPU Core
Program Fetch
Control Registers
Instruction Dispatch
Instruction Decode
Control Logic
Data Path 1
Data Path 2
A Register File
B Register File
Test
Emulation
D1
M1
S1
L1
L2
S2
M2
D2
Interrupts
Floating-PointCapabilities
ArithmeticLogicUnit
Auxiliary LogicUnit
MultiplierUnit
31
Single-Chip MultiprocessorsCMP

Build a multiprocessor on a single chip
linear increase in peak performance
advantage of fast interaction between processors
Fine grain threads
make communication and synchronization very fast
(1 cycle)
break the problem into smaller pieces
Memory bandwidth
Makes more effective use of limited memory
bandwidth
Programming model
Need parallel programs

P
P
P
P

M
32
Intel IXP1200 Network Processor

6 micro-engines
RISC engines
4 contexts/eng
24 threads total
IX Bus Interface
packet I/O
connect IXPs
scalable
StrongARM
less critical tasks
Hash engine
level 2 lookups
PCI interface

33
IXP1200 MicroEngine

32-bit RISC instruction set
Multithreading support for 4 threads
Maximum switching overhead of 1 cycle
128 32-bit GPRs in two banks of 64
Programmable 1KB instruction store (not shown in
diagram)
128 32-bit transfer registers
Command bus arbiter and FIFO (not shown in
diagram)

34
IXP1200 Instruction Set

Powerful ALU instructions
can manipulate word and part of word quite
effectively
Swap-thread on memory reference
Hides memory latency
sramread, r0, base1, offset, 1, ctx_swap
Can use an intelligent DMA-like controller to
copy packets to/from memory
sdramt_fifo_wr, --, pkt_bffr, offset, 8
Exposed branch behavior
can fill variable branch slots
can select a static prediction on a per-branch
basis

ARM mov r1, r0, lsl 16 mov r1, r1, r0, asr
16 add r0, r1, r0, asr 16
IXP1200 ld_field_w_clrtemp, 1100,
accum alu_shfaccum, temp, , accum, ltlt16
35
UCB Processor with DRAM (PIM)IRAM, VIRAM

Put the processor and the main memory on a single
chip
much lower memory latency
much higher memory bandwidth
But
need to build systems with more than one chip

M
P
V
64Mb SDRAM ChipInternal - 128 512K subarrays4
bits per subarray each 10ns51.2 Gb/s External -
8 bits at 10ns, 800Mb/s 1 Integer processor
100KBytes DRAM1 FP processor 500KBytes DRAM 1
Vector Unit 1 MByte DRAM
36
IRAM Vision Statement
Proc
L o g i c
f a b

Microprocessor DRAM on a single chip
on-chip memory latency 5-10X, bandwidth 50-100X
improve energy efficiency 2X-4X (no off-chip
bus)
serial I/O 5-10X v. buses
smaller board area/volume
adjustable memory size/width

L2
Bus
Bus
Proc
Bus
37
Potential Multimedia Architecture

New model VSIWVery Short Instruction Word!
Compact Describe N operations with 1 short
instruct.
Predictable (real-time) performance vs.
statistical performance (cache)
Multimedia ready choose N64b, 2N32b, 4N16b
Easy to get high performance
Compiler technology already developed, for sale!
Dont have to write all programs in assembly
language

38
Revive Vector ( VSIW) Architecture!

Cost 1M each?
Low latency, high BW memory system?
Code density?
Compilers?
Performance?
Power/Energy?
Limited to scientific applications?

Single-chip CMOS MPU/IRAM
IRAM
Much smaller than VLIW
For sale, mature (gt20 years)
Easy scale speed with technology
Parallel to save energy, keep perf
Multimedia apps vectorizable too N64b, 2N32b,
4N16b

39
V-IRAM1 0.18 µm, Fast Logic, 200 MHz1.6
GFLOPS(64b)/6.4 GOPS(16b)/16MB

4 x 64 or 8 x 32 or 16 x 16
x
2-way Superscalar
Vector
Instruction

Processor
Queue
Load/Store
Vector Registers
16K I cache
16K D cache
4 x 64
4 x 64
Serial I/O
Memory Crossbar Switch
M
M
M
M
M
M
M
M
M
M

M
M
M
M
M
M
M
M
M
M
4 x 64
4 x 64
4 x 64
4 x 64
4 x 64

M
M
M
M
M
M
M
M
M
M
40
Tentative VIRAM-1 Floorplan

0.18 µm DRAM16-32 MB in 16 banks x 256b
0.18 µm, 5 Metal Logic
200 MHz MIPS IV, 16K I, 16K D
4 200 MHz FP/int. vector units
die 20x20 mm
xtors 130-250M
power 2 Watts

Memory (128 Mbits / 16 MBytes)
Ring- based Switch
I/O
Memory (128 Mbits / 16 MBytes)
41
Tentative VIRAM-0.25 Floorplan

Demonstrate scalability via 2nd layout
(automatic from 1st)
8 MB in 2 banks x 256b, 32 subbanks
200 MHz CPU, 8K I, 8K D
1 200 MHz FP/int. vector units
die 5 x 20 mm
xtors 70M
power 0.5 Watts

Kernel GOPS V-1 V-0.25 Comp. 6.40 1.6 iDCT
3.10 0.8 Clr.Conv. 2.95 0.8 Convol. 3.16 0.8 FP
Matrix 3.19 0.8
42
Stanford Hydra Design

Single-chip multiprocessor
Four processors
Separate primary caches
Write-through data caches to maintain coherence

Shared 2nd-level cache
Separate read and write busses
Data Speculation Support

43
Mescal Architecture

Scott Weber
University of California at Berkeley

44
Outline

Architecture rationale and motivation
Architecture goals
Architecture template
Processing elements
Multiprocessor architecture
Communication architecture

45
Architectural Rationale and Motivation

Configurable processors have shown orders of
magnitude performance improvements
Tensilica has shown 2x to 50x performance
improvements
Specialized functional units
Memory configurations
Tensilica matches the architecture with software
development tools

46
Architectural Rationale and Motivation

In order to continue this performance improvement
trend
Architectural features which exploit more
concurrency are required
Heterogeneous configurations need to be made
possible
Software development tools support new
configuration options

...concurrent processes are required in order to
continue performance improvement trend...
...begins to look like a VLIW...
...generic mesh may not suit the applications top
ology...
47
Architecture Goals

Provide template for the exploration of a range
of architectures
Retarget compiler and simulator to the
architecture
Enable compiler to exploit the architecture
Concurrency
Multiple instructions per processing element
Multiple threads per and across processing
elements
Multiple processes per and across processing
elements
Support for efficient computation
Special-purpose functional units, intelligent
memory, processing elements
Support for efficient communication
Configurable network topology
Combined shared memory and message passing

48
Architecture Template

Prototyping template for array of processing
elements
Configure processing element for efficient
computation
Configure memory elements for efficient retiming
Configure the network topology for efficient
communication

...configure memory elements...
...configure PE...
...configure PEs and network to match the
application...
49
Range of Architectures

Scalar Configuration
EPIC Configuration
EPIC with special FUs
Mesh of HPL-PD PEs
Customized PEs, network
Supports a family of architectures
Plan to extend the family with the
micro-architectural features presented

50
Range of Architectures

Scalar Configuration
EPIC Configuration
EPIC with special FUs
Mesh of HPL-PD PEs
Customized PEs, network
Supports a family of architectures
Plan to extend the family with the
micro-architectural features presented

51
Range of Architectures

Scalar Configuration
EPIC Configuration
EPIC with special FUs
Mesh of HPL-PD PEs
Customized PEs, network
Supports a family of architectures
Plan to extend the family with the
micro-architectural features presented

52
Range of Architectures

Scalar Configuration
EPIC Configuration
EPIC with special FUs
Mesh of HPL-PD PEs
Customized PEs, network
Supports a family of architectures
Plan to extend the family with the
micro-architectural features presented

PE
53
Range of Architectures

Scalar Configuration
EPIC Configuration
EPIC with special FUs
Mesh of HPL-PD PEs
Customized PEs, network
Supports a family of architectures
Plan to extend the family with the
micro-architectural features presented

54
Range of Architectures (Future)

Template support for such an architecture
Prototype architecture
Software development tools generated
Generate compiler
Generate simulator

IXP1200 Network Processor (Intel)
55
The RAW Architecture

Slides prepared by Manish Vachhrajani

56
Outline

RAW architecture
Overview
Features
Benefits and Disadvantages
Compiling for RAW
Overview
Structure of the compiler
Basic block compilation
Other techniques

57
RAW Machine Overview

Scalable architecture without global interconnect
Constructed from Replicated Tiles
Each tile has a mP and a switch
Interconnect via a static and dynamic network

58
RAW Tiles

Simple 5 stage pipelined mP w/ local PC(MIMD)
Can contain configurable logic
Per Tile IMEM and DMEM, unlike other modern
architectures
mP contains ins. to send and recv. data

IMEM
DMEM
PC
REGS
SMEM
CL
PC
Switch
59
RAW Tiles(cont.)

Tiles have local switches
Implemented with a stripped down mP
Static Network
Fast, easy to implement
Need to know data transfers, source and
destintation at compile time
Dynamic Network
Much slower and more complex
Allows for messages whose route is not known at
compile time

60
Configurable Hardware in RAW

Each tile Contains its own configurable hardware
Each tile has several ALUs and logic gates that
can operate at bit/byte/word levels
Configurable interconnect to wire componenets
together
Coarser than FPGA based implementations

61
Benefits of RAW

Scalable
Each tile is simple and replicated
No global wiring, so it will scale even if wire
delay doesnt
Short wires and simple tiles allow higher clock
rates
Can target many forms of Parallelism
Ease of design
Replication reduces design overhead
Tiles are relatively simple designs
simplicity makes verification easier

62
Disadvantages of RAW

Complex Compilation
Full space-time compilation
Distributed memory system
Need sophisticated memory analysis to resolve
static references
Software Complexity
Low-level code is complex and difficult to
examine and write by hand
Code Size?

63
Traditional Operations on RAW

How does one exploit the Raw architecture across
function calls, especially in libraries?
Can we easily maintain portability with different
tile counts?
Memory Protection and OS Services
Context switch overhead
Load on dynamic network for memory protection and
virtual memory?

64
Compiling for RAW machines

Determine available parallelism
Determine placement of memory items
Discover memory constraints
Dependencies between parallel threads
Disambiguate memory references to allow for
static access to data elements
Trade-off memory dependence and Parallelism

65
Compiling for RAW(cont.)

Generate route instructions for switches
static network only
Generate message handlers for dynamic events
Speculative execution
Unpredictable memory references
Optimal partitioning algorithm is NP complete

66
Structure of RAWCC
Source Language

Partition data to increase static accesses
Partition instructions to allow parallel
execution
allocate data to tiles to minimize communication
overhead

Traditional Dataflow Optimizations
Build CFG
MAPS System
Space-time scheduler
RAW executable
67
The MAPS System

Manages memory to generate static promotions of
data structures
For loop accesses to arrays uses modulo unrolling
For data structures, uses SPAN analysis package
to identify potential references and partition
memory
structures can be split across processing
elements.

68
Space-Time Scheduler

For Basic Blocks
Maps instructions to processors
Maps scalar data to processors
Generates communication instructions
Schedules computation and communication
For overall CFG, performs control localization

69
Basic Block Orchestrator

All values are copied to the tiles that work on
the data from the home tile
Within a Block, all access are local
At the end of a block, values are copied to home
tiles

Initial Code Transformation
Instruction Partitioner
Global Data Partitioner
Data Ins. Placer
Event Scheduler
Comm Code Generator
70
Initial Code Transformation

Convert Block to static single assignment form
removes false dependencies
Analagous to register renaming
Live on entry, and live on exit variables marked
with dummy instructions
Allows for overlap of stitch code with useful
work

71
Instruction Partitioner

Partitions stream into multiple streams, one for
each tile
Clustering
Partition instructions to minimize runtime
considering only communication
Merging
Reduces cluster count to match tile count
Uses a heuristic based algorithm to achieve good
balance and low communication overhead

72
Global Data Partitioner

Partitions global data for assignment to home
locations
Local data is copied at the start of a basic
block
Summarize instruction streams data access
pattern with affinity
Maps instructions and data to virtual processors
Map instructions, optimally place data based on
affinity
Remap instructions with data placement knowledge
Repeat until local minima is reached
Only real data are mapped, not dummies formed in
ICT

73
Data and Instruction Placer

Places data items onto physical tiles
driven by static data items
Places instructions onto tiles
Uses data information to determine cost
Takes into account actual model of
communications network
Uses a swap based greedy allocation

74
Event Scheduler

Schedules routing instructions as well as
computation instructions in a basic block
Schedules instructions using a greedy list based
scheduler
Switch schedule is ensured to be deadlock free
Allows tolerance of dynamic events

75
Control Flow

Control Localization
Certain branches are enveloped in macro
instructions, and the surrounding blocks merged
Allows branch to occur only on one tile
Global Branching
Done through target broadcast and local branching

76
Performance

RAW achieves anywhere from 1.5 to 9 times speedup
depending on application and tile count
Applications tested were particularly well suited
to RAW
Heavily dependent integer programs may do
poorly(encryption, etc.))
Depends on its ability to statically schedule and
localize memory accesses

77
Future Work

Use multisequential execution to run multiple
applications simultaneously
Allow static communication between threads known
at compile time
Minimize dynamic overhead otherwise
Target ILP across branches more agressively
Explore configurability vs. parallelism in RAW

78
Reconfigurable processors

Adapt the processor to the application
special function units
special wiring between function units
Builds on FPGA technology
FPGAs are inefficient
a multiplier built from an FPGA is about 100x
larger and 10x slower than a custom multiplier.
Need to raise the granularity
configure ALUs, or whole processors
Memory and communication are usually the
bottleneck
not addressed by configuring a lot of ALUs
Programming model
Difficult to program
Verilog

79
SCOREStream Computation Organized for
Reconfigurable Execution
Eylon Caspi Michael Chu André DeHon Randy
Huang Joseph Yeh John Wawrzynek Nicholas Weaver
80
Opportunity

High-throughput, regular operations
can be mapped spatially onto FPGA-like
(programmable, spatial compute substrate)
achieving higher performance
(throughput per unit area)
than conventional, programmable devices
(e.g. processors)

81
Problem

Only have raw devices
Solutions non-portable
Solutions not scale to new hardware
Device resources exposed to developer
Little or no abstraction of implementations
Composition of subcomponents hard/ad hoc
No unifying computational model or run-time
environment

82
Introduce SCORE

Compute Model
virtualizes RC hardware resources
supports automatic scaling
supports dynamic program requirements efficiently
provides compositional semantics
defines runtime environment for programs

83
Viewpoint

SCORE (or something like it) is a necessary
condition to enable automatic exploitation of new
RC hardware as it becomes available.
Automatic exploitation is essential to making RC
a long-term viable computing solution.

84
Outline

Opportunity
Problem
Review
related work
enabling hardware
Model
execution
programmer
Preliminary Results
Challenges and Questions ahead

85
borrows heavily from...

RC, RTR
PFPGA
Dataflow
Streaming Dataflow
Multiprocessors
Operating System
(see working paper)

Tried to steal all the good ideas -)
build a coherent model
exploit strengths of RC

86
Enabling Hardware

High-speed, computational arrays
250MHz, HSRA, FPGA99
Large, on-chip memories
2Mbit, VLSI Symp. 99
allow microsecond reconfiguration
Processor and FPGA hybrids
GARP, NAPA, Triscend, etc.

87
BRASS Architecture
88
Array Model
89
Platform Vision

Hardware capacity scales up with each generation
Faster devices
More computation
More memory
With SCORE, old programs should run on new
hardware
and exploit the additional capacity automatically

90
Example SCORE Execution
91
Spatial Implementation
92
Serial Implementation
93
Summary Elements of a multiprocessing system

General purpose/special purpose
Granularity - capability of a basic module
Topology - interconnection/communication
geometry
Nature of coupling - loose to tight
Control-data mechanisms
Task allocation and routing methodology
Reconfigurable
Computation
Interconnect
Programmers model/Language support/ models of
computation
Implementation - IC, Board, Multiboard, Networked
Performance measures and objectives

After E. V. Krishnamurty - Chapter 5
94
Conclusions

Portions of multi/parallel processing have become
successful
Pipelining ubiquitious
Superscalar ubiquitious
VLIW successful in DSP, Multimedia - GPP?
Silicon capability re-invigorating multiprocessor
research
GPP - Flash, Hydra, RAW
SPP - Intel IXP 1200, IRAM/VIRAM, Mescal
Reconfigurable computing has found a niche in
wireless communications
Problem of programming models, languages,
computational models etc. for multiprocessors
still largely unsolved

Write a Comment

User Comments (0)