Title: Lecture 18: Introduction to Multiprocessors
1Lecture 18Introduction to Multiprocessors
- Prepared and presented by
- Kurt Keutzer
- with thanks for materials from
- Kunle Olukotun, Stanford
- David Patterson, UC Berkeley
2Why Multiprocessors?
- Needs
- Relentless demand for higher performance
- Servers
- Networks
- Commercial desire for product differentiation
- Opportunities
- Silicon capability
- Ubiquitous computers
3Exploiting (Program) Parallelism
4Exploiting (Program) Parallelism -2
Process
Thread
Levels of Parallelism
Loop
Instruction
Bit
1
10
100
1K
10K
100K
1M
Grain Size (instructions)
5Need for Parallel Computing
- Diminishing returns from ILP
- Limited ILP in programs
- ILP increasingly expensive to exploit
- Peak performance increases linearly with more
processors - Amhdahls law applies
- Adding processors is inexpensive
- But most people add memory also
Performance
Die Area
2PM
2P2M
Performance
PM
Die Area
6What to do with a billion transistors ?
1 clk
- Technology changes the cost and performance of
computer elements in a non-uniform manner - logic and arithmetic is becoming plentiful and
cheap - wires are becoming slow and scarce
- This changes the tradeoffs between alternative
architectures - superscalar doesnt scale well
- global control and data
- So what will the architectures of the future be?
2007
2004
2001
1998
64 x the area 4x the speed slower wires
3 (10, 16, 20?) clks
7Elements of a multiprocessing system
- General purpose/special purpose
- Granularity - capability of a basic module
- Topology - interconnection/communication
geometry - Nature of coupling - loose to tight
- Control-data mechanisms
- Task allocation and routing methodology
- Reconfigurable
- Computation
- Interconnect
- Programmers model/Language support/ models of
computation - Implementation - IC, Board, Multiboard, Networked
- Performance measures and objectives
After E. V. Krishnamurty - Chapter 5
8Use, Granularity
- General purpose
- attempting to improve general purpose computation
(e.g. Spec benchmarks) by means of
multiprocessing - Special purpose
- attempting to improve a specific application or
class of applications by means of multiprocessing - Granularity - scope and capability of a
processing element (PE) - Nand gate
- ALU with registers
- Execution unit with local memory
- RISC R1000 processor
9Topology
- Topology - method of interconnection of
processors - Bus
- Full-crossbar switch
- Mesh
- N-cube
- Torus
- Perfect shuffle, m-shuffle
- Cube-connected components
- Fat-trees
10Coupling
- Relationship of communication among processors
- Shared clock (Pipelined)
- Shared registers (VLIW)
- Shared memory (SMM)
- Shared network
11Control/Data
- Way in which data and control are organized
- Control - how the instruction stream is managed
(e.g. sequential instruction fetch) - Data - how the data is accessed (e.g. numbered
memory addresses) - Multithreaded control flow - explicit
constructs fork, join, wait, control program
flow - central controller - Dataflow model - instructions execute as soon as
operands are ready, program structures flow of
data, decentralized control
12Task allocation and routing
- Way in which tasks are scheduled and managed
- Static - allocation of tasks onto processing
elements pre-determined before runtime - Dynamic - hardware/software support allocation of
tasks to processors at runtime -
13Reconfiguration
- Computational
- restructuring of computational elements
- reconfigurable - reconfiguration at compile time
- dynamically reconfigurable- restructuring of
computational elements at runtime - Interconnection scheme
- switching network - software controlled
- reconfigurable fabric
14Programmers model
- How is parallelism expressed by the user?
- Expressive power
- Process-level parallelism
- Shared-memory
- Message-passing
- Operator-level parallelism
- Bit-level parallelism
- Formal guarantees
- Deadlock-free
- Livelock free
- Support for other real-time notions
- Exception handling
15Parallel Programming Models
- Message Passing
- Fork thread
- Typically one per node
- Explicit communication
- Send messages
- send(tid, tag, message)
- receive(tid, tag, message)
- Synchronization
- Block on messages (implicit sync)
- Barriers
- Shared Memory (address space)
- Fork thread
- Typically one per node
- Implicit communication
- Using shared address space
- Loads and stores
- Synchronization
- Atomic memory operators
- barriers
16Message Passing Multicomputers
- Computers (nodes) connected by a network
- Fast network interface
- Send, receive, barrier
- Nodes not different than regular PC or
workstation - Cluster conventional workstations or PCs with
fast network - cluster computing
- Berkley NOW
- IBM SP2
17Shared-Memory Multiprocessors
P
P
P
- Several processors share one address space
- conceptually a shared memory
- often implemented just like a multicomputer
- address space distributed over private memories
- Communication is implicit
- read and write accesses to shared memory
locations - Synchronization
- via shared memory locations
- spin waiting for non-zero
- barriers
Network
M
Conceptual Model
P
P
P
M
M
M
Network
Actual Implementation
18Cache Coherence - A Quick Overview
P1
P2
PN
- With caches, action is required to prevent access
to stale data - Processor 1 may read old data from its cache
instead of new data in memory or - Processor 3 may read old data from memory rather
than new data in Processor 2s cache - Solutions
- no caching of shared data
- Cray T3D, T3E, IBM RP3, BBN Butterfly
- cache coherence protocol
- keep track of copies
- notify (update or invalidate) on writes
Network
M
A3
P1 Rd(A) Rd(A) P2 Wr(A,5) P3 Rd(A)
19Implementation issues
- Underlying hardware implementation
- Bit-slice
- Board assembly
- Integration in an integrated-circuit
- Exploitation of new technologies
- DRAM integration on IC
- Low-swing chip-level interconnect
20Performance objectives
- Objectives
- Speed
- Power
- Cost
- Ease of programming/time to market/ time to money
- In-field flexibility
- Methods of measurement
- Modeling
- Emulation
- Simulation
- Transaction
- Instruction-set
- Hardware
21Flynns Taxonomy of Multiprocessing
- Single-instruction single-datastream (SISD)
machines - Single-instruction multiple-datastream (SIMD)
machines - Multiple-instruction single-datastream (MISD)
machines - Multiple-instruction multiple-datastream (MIMD)
machines - Examples?
22Examples
- Single-instruction single-datastream (SISD)
machines - Non-pipelined Uniprocessors
- Single-instruction multiple-datastream (SIMD)
machines - Vector processors (VIRAM)
- Multiple-instruction single-datastream (MISD)
machines - Network processors (Intel IXP1200
- Multiple-instruction multiple-datastream (MIMD)
machines - Network of workstations (NOW)
23Predominant Approaches
- Pipelining ubiquitious
- Much academic research focused on performance
improvements of dusty decks - Illiac 4 - Speed-up of Fortran
- SUIF, Flash - Speed-up of C
- Niche market in high-performance computing
- Cray
- Commercial support for high-end servers
- Shared-memory multiprocessors for server market
- Commercial exploitation of silicon capability
- General purpose Super-scalar, VLIW
- Special purpose VLIW for DSP, Media processors,
Network processors - Reconfigurable computing
24C62x Pipeline OperationPipeline Phases
Fetch
Decode
Execute
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
- Decode
- DP Instruction Dispatch
- DC Instruction Decode
- Execute
- E1 - E5 Execute 1 through Execute 5
- Single-Cycle Throughput
- Operate in Lock Step
- Fetch
- PG Program Address Generate
- PS Program Address Send
- PW Program Access Ready Wait
- PR Program Fetch Packet Receive
Execute Packet 1
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 2
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 3
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 4
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 5
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 6
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
Execute Packet 7
PG
PS
PW
PR
DP
DC
E1
E2
E3
E4
E5
25Superscalar PowerPC 604 and Pentium Pro
- Both In-order Issue, Out-of-order execution,
In-order Commit -
26IA-64 aka EPIC aka VLIW
Instruction Cache
- Compiler schedules instructions
- Encodes dependencies explicitly
- saves having the hardware repeatedly rediscover
them - Support speculation
- speculative load
- branch prediction
- Really need to make communication explicit too
- still has global registers and global instruction
issue
Instruction Issue
Register File
27Phillips Trimedia Processor
28TMS320C6201 Revision 2
29TMS320C6701 DSP Block Diagram
Program Cache/Program Memory 32-bit address,
256-Bit data 512K Bits RAM
C67x Floating-Point CPU Core
Power Down
Program Fetch
Control Registers
Instruction Dispatch
Host Port Interface
Instruction Decode
Control Logic
4 Channel DMA
Data Path 1
Data Path 2
A Register File
B Register File
Test
Emulation
D1
M1
S1
L1
L2
S2
M2
D2
Interrupts
External Memory Interface
2 Timers 2 Multi-channel buffered serial ports
(T1/E1)
Data Memory 32-Bit address 8-, 16-, 32-Bit
data 512K Bits RAM
30TMS320C67x CPU Core
C67x Floating-Point CPU Core
Program Fetch
Control Registers
Instruction Dispatch
Instruction Decode
Control Logic
Data Path 1
Data Path 2
A Register File
B Register File
Test
Emulation
D1
M1
S1
L1
L2
S2
M2
D2
Interrupts
Floating-PointCapabilities
ArithmeticLogicUnit
Auxiliary LogicUnit
MultiplierUnit
31Single-Chip MultiprocessorsCMP
- Build a multiprocessor on a single chip
- linear increase in peak performance
- advantage of fast interaction between processors
- Fine grain threads
- make communication and synchronization very fast
(1 cycle) - break the problem into smaller pieces
- Memory bandwidth
- Makes more effective use of limited memory
bandwidth - Programming model
- Need parallel programs
P
P
P
P
M
32Intel IXP1200 Network Processor
- 6 micro-engines
- RISC engines
- 4 contexts/eng
- 24 threads total
- IX Bus Interface
- packet I/O
- connect IXPs
- scalable
- StrongARM
- less critical tasks
- Hash engine
- level 2 lookups
- PCI interface
33IXP1200 MicroEngine
- 32-bit RISC instruction set
- Multithreading support for 4 threads
- Maximum switching overhead of 1 cycle
- 128 32-bit GPRs in two banks of 64
- Programmable 1KB instruction store (not shown in
diagram) - 128 32-bit transfer registers
- Command bus arbiter and FIFO (not shown in
diagram)
34IXP1200 Instruction Set
- Powerful ALU instructions
- can manipulate word and part of word quite
effectively - Swap-thread on memory reference
- Hides memory latency
- sramread, r0, base1, offset, 1, ctx_swap
- Can use an intelligent DMA-like controller to
copy packets to/from memory - sdramt_fifo_wr, --, pkt_bffr, offset, 8
- Exposed branch behavior
- can fill variable branch slots
- can select a static prediction on a per-branch
basis
ARM mov r1, r0, lsl 16 mov r1, r1, r0, asr
16 add r0, r1, r0, asr 16
IXP1200 ld_field_w_clrtemp, 1100,
accum alu_shfaccum, temp, , accum, ltlt16
35UCB Processor with DRAM (PIM)IRAM, VIRAM
- Put the processor and the main memory on a single
chip - much lower memory latency
- much higher memory bandwidth
- But
- need to build systems with more than one chip
M
P
V
64Mb SDRAM ChipInternal - 128 512K subarrays4
bits per subarray each 10ns51.2 Gb/s External -
8 bits at 10ns, 800Mb/s 1 Integer processor
100KBytes DRAM1 FP processor 500KBytes DRAM 1
Vector Unit 1 MByte DRAM
36IRAM Vision Statement
Proc
L o g i c
f a b
- Microprocessor DRAM on a single chip
- on-chip memory latency 5-10X, bandwidth 50-100X
- improve energy efficiency 2X-4X (no off-chip
bus) - serial I/O 5-10X v. buses
- smaller board area/volume
- adjustable memory size/width
L2
Bus
Bus
Proc
Bus
37Potential Multimedia Architecture
- New model VSIWVery Short Instruction Word!
- Compact Describe N operations with 1 short
instruct. - Predictable (real-time) performance vs.
statistical performance (cache) - Multimedia ready choose N64b, 2N32b, 4N16b
- Easy to get high performance
- Compiler technology already developed, for sale!
- Dont have to write all programs in assembly
language
38Revive Vector ( VSIW) Architecture!
- Cost 1M each?
- Low latency, high BW memory system?
- Code density?
- Compilers?
- Performance?
- Power/Energy?
- Limited to scientific applications?
- Single-chip CMOS MPU/IRAM
- IRAM
- Much smaller than VLIW
- For sale, mature (gt20 years)
- Easy scale speed with technology
- Parallel to save energy, keep perf
- Multimedia apps vectorizable too N64b, 2N32b,
4N16b
39V-IRAM1 0.18 µm, Fast Logic, 200 MHz1.6
GFLOPS(64b)/6.4 GOPS(16b)/16MB
4 x 64 or 8 x 32 or 16 x 16
x
2-way Superscalar
Vector
Instruction
Processor
Queue
Load/Store
Vector Registers
16K I cache
16K D cache
4 x 64
4 x 64
Serial I/O
Memory Crossbar Switch
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
4 x 64
4 x 64
4 x 64
4 x 64
4 x 64
M
M
M
M
M
M
M
M
M
M
40Tentative VIRAM-1 Floorplan
- 0.18 µm DRAM16-32 MB in 16 banks x 256b
- 0.18 µm, 5 Metal Logic
- 200 MHz MIPS IV, 16K I, 16K D
- 4 200 MHz FP/int. vector units
- die 20x20 mm
- xtors 130-250M
- power 2 Watts
Memory (128 Mbits / 16 MBytes)
Ring- based Switch
I/O
Memory (128 Mbits / 16 MBytes)
41Tentative VIRAM-0.25 Floorplan
- Demonstrate scalability via 2nd layout
(automatic from 1st) - 8 MB in 2 banks x 256b, 32 subbanks
- 200 MHz CPU, 8K I, 8K D
- 1 200 MHz FP/int. vector units
- die 5 x 20 mm
- xtors 70M
- power 0.5 Watts
Kernel GOPS V-1 V-0.25 Comp. 6.40 1.6 iDCT
3.10 0.8 Clr.Conv. 2.95 0.8 Convol. 3.16 0.8 FP
Matrix 3.19 0.8
42Stanford Hydra Design
- Single-chip multiprocessor
- Four processors
- Separate primary caches
- Write-through data caches to maintain coherence
- Shared 2nd-level cache
- Separate read and write busses
- Data Speculation Support
43Mescal Architecture
- Scott Weber
- University of California at Berkeley
44Outline
- Architecture rationale and motivation
- Architecture goals
- Architecture template
- Processing elements
- Multiprocessor architecture
- Communication architecture
45Architectural Rationale and Motivation
- Configurable processors have shown orders of
magnitude performance improvements - Tensilica has shown 2x to 50x performance
improvements - Specialized functional units
- Memory configurations
- Tensilica matches the architecture with software
development tools
46Architectural Rationale and Motivation
- In order to continue this performance improvement
trend - Architectural features which exploit more
concurrency are required - Heterogeneous configurations need to be made
possible - Software development tools support new
configuration options
...concurrent processes are required in order to
continue performance improvement trend...
...begins to look like a VLIW...
...generic mesh may not suit the applications top
ology...
47Architecture Goals
- Provide template for the exploration of a range
of architectures - Retarget compiler and simulator to the
architecture - Enable compiler to exploit the architecture
- Concurrency
- Multiple instructions per processing element
- Multiple threads per and across processing
elements - Multiple processes per and across processing
elements - Support for efficient computation
- Special-purpose functional units, intelligent
memory, processing elements - Support for efficient communication
- Configurable network topology
- Combined shared memory and message passing
48Architecture Template
- Prototyping template for array of processing
elements - Configure processing element for efficient
computation - Configure memory elements for efficient retiming
- Configure the network topology for efficient
communication
...configure memory elements...
...configure PE...
...configure PEs and network to match the
application...
49Range of Architectures
- Scalar Configuration
- EPIC Configuration
- EPIC with special FUs
- Mesh of HPL-PD PEs
- Customized PEs, network
- Supports a family of architectures
- Plan to extend the family with the
micro-architectural features presented
50Range of Architectures
- Scalar Configuration
- EPIC Configuration
- EPIC with special FUs
- Mesh of HPL-PD PEs
- Customized PEs, network
- Supports a family of architectures
- Plan to extend the family with the
micro-architectural features presented
51Range of Architectures
- Scalar Configuration
- EPIC Configuration
- EPIC with special FUs
- Mesh of HPL-PD PEs
- Customized PEs, network
- Supports a family of architectures
- Plan to extend the family with the
micro-architectural features presented
52Range of Architectures
- Scalar Configuration
- EPIC Configuration
- EPIC with special FUs
- Mesh of HPL-PD PEs
- Customized PEs, network
- Supports a family of architectures
- Plan to extend the family with the
micro-architectural features presented
PE
53Range of Architectures
- Scalar Configuration
- EPIC Configuration
- EPIC with special FUs
- Mesh of HPL-PD PEs
- Customized PEs, network
- Supports a family of architectures
- Plan to extend the family with the
micro-architectural features presented
54Range of Architectures (Future)
- Template support for such an architecture
- Prototype architecture
- Software development tools generated
- Generate compiler
- Generate simulator
IXP1200 Network Processor (Intel)
55The RAW Architecture
- Slides prepared by Manish Vachhrajani
56Outline
- RAW architecture
- Overview
- Features
- Benefits and Disadvantages
- Compiling for RAW
- Overview
- Structure of the compiler
- Basic block compilation
- Other techniques
57RAW Machine Overview
- Scalable architecture without global interconnect
- Constructed from Replicated Tiles
- Each tile has a mP and a switch
- Interconnect via a static and dynamic network
58RAW Tiles
- Simple 5 stage pipelined mP w/ local PC(MIMD)
- Can contain configurable logic
- Per Tile IMEM and DMEM, unlike other modern
architectures - mP contains ins. to send and recv. data
IMEM
DMEM
PC
REGS
SMEM
CL
PC
Switch
59RAW Tiles(cont.)
- Tiles have local switches
- Implemented with a stripped down mP
- Static Network
- Fast, easy to implement
- Need to know data transfers, source and
destintation at compile time - Dynamic Network
- Much slower and more complex
- Allows for messages whose route is not known at
compile time
60Configurable Hardware in RAW
- Each tile Contains its own configurable hardware
- Each tile has several ALUs and logic gates that
can operate at bit/byte/word levels - Configurable interconnect to wire componenets
together - Coarser than FPGA based implementations
61Benefits of RAW
- Scalable
- Each tile is simple and replicated
- No global wiring, so it will scale even if wire
delay doesnt - Short wires and simple tiles allow higher clock
rates - Can target many forms of Parallelism
- Ease of design
- Replication reduces design overhead
- Tiles are relatively simple designs
- simplicity makes verification easier
62Disadvantages of RAW
- Complex Compilation
- Full space-time compilation
- Distributed memory system
- Need sophisticated memory analysis to resolve
static references - Software Complexity
- Low-level code is complex and difficult to
examine and write by hand - Code Size?
63Traditional Operations on RAW
- How does one exploit the Raw architecture across
function calls, especially in libraries? - Can we easily maintain portability with different
tile counts? - Memory Protection and OS Services
- Context switch overhead
- Load on dynamic network for memory protection and
virtual memory?
64Compiling for RAW machines
- Determine available parallelism
- Determine placement of memory items
- Discover memory constraints
- Dependencies between parallel threads
- Disambiguate memory references to allow for
static access to data elements - Trade-off memory dependence and Parallelism
65Compiling for RAW(cont.)
- Generate route instructions for switches
- static network only
- Generate message handlers for dynamic events
- Speculative execution
- Unpredictable memory references
- Optimal partitioning algorithm is NP complete
66Structure of RAWCC
Source Language
- Partition data to increase static accesses
- Partition instructions to allow parallel
execution - allocate data to tiles to minimize communication
overhead
Traditional Dataflow Optimizations
Build CFG
MAPS System
Space-time scheduler
RAW executable
67The MAPS System
- Manages memory to generate static promotions of
data structures - For loop accesses to arrays uses modulo unrolling
- For data structures, uses SPAN analysis package
to identify potential references and partition
memory - structures can be split across processing
elements.
68Space-Time Scheduler
- For Basic Blocks
- Maps instructions to processors
- Maps scalar data to processors
- Generates communication instructions
- Schedules computation and communication
- For overall CFG, performs control localization
69Basic Block Orchestrator
- All values are copied to the tiles that work on
the data from the home tile - Within a Block, all access are local
- At the end of a block, values are copied to home
tiles
Initial Code Transformation
Instruction Partitioner
Global Data Partitioner
Data Ins. Placer
Event Scheduler
Comm Code Generator
70Initial Code Transformation
- Convert Block to static single assignment form
- removes false dependencies
- Analagous to register renaming
- Live on entry, and live on exit variables marked
with dummy instructions - Allows for overlap of stitch code with useful
work
71Instruction Partitioner
- Partitions stream into multiple streams, one for
each tile - Clustering
- Partition instructions to minimize runtime
considering only communication - Merging
- Reduces cluster count to match tile count
- Uses a heuristic based algorithm to achieve good
balance and low communication overhead
72Global Data Partitioner
- Partitions global data for assignment to home
locations - Local data is copied at the start of a basic
block - Summarize instruction streams data access
pattern with affinity - Maps instructions and data to virtual processors
- Map instructions, optimally place data based on
affinity - Remap instructions with data placement knowledge
- Repeat until local minima is reached
- Only real data are mapped, not dummies formed in
ICT
73Data and Instruction Placer
- Places data items onto physical tiles
- driven by static data items
- Places instructions onto tiles
- Uses data information to determine cost
- Takes into account actual model of
communications network - Uses a swap based greedy allocation
74Event Scheduler
- Schedules routing instructions as well as
computation instructions in a basic block - Schedules instructions using a greedy list based
scheduler - Switch schedule is ensured to be deadlock free
- Allows tolerance of dynamic events
75Control Flow
- Control Localization
- Certain branches are enveloped in macro
instructions, and the surrounding blocks merged - Allows branch to occur only on one tile
- Global Branching
- Done through target broadcast and local branching
76Performance
- RAW achieves anywhere from 1.5 to 9 times speedup
depending on application and tile count - Applications tested were particularly well suited
to RAW - Heavily dependent integer programs may do
poorly(encryption, etc.)) - Depends on its ability to statically schedule and
localize memory accesses
77Future Work
- Use multisequential execution to run multiple
applications simultaneously - Allow static communication between threads known
at compile time - Minimize dynamic overhead otherwise
- Target ILP across branches more agressively
- Explore configurability vs. parallelism in RAW
78Reconfigurable processors
- Adapt the processor to the application
- special function units
- special wiring between function units
- Builds on FPGA technology
- FPGAs are inefficient
- a multiplier built from an FPGA is about 100x
larger and 10x slower than a custom multiplier. - Need to raise the granularity
- configure ALUs, or whole processors
- Memory and communication are usually the
bottleneck - not addressed by configuring a lot of ALUs
- Programming model
- Difficult to program
- Verilog
79SCOREStream Computation Organized for
Reconfigurable Execution
Eylon Caspi Michael Chu André DeHon Randy
Huang Joseph Yeh John Wawrzynek Nicholas Weaver
80Opportunity
- High-throughput, regular operations
- can be mapped spatially onto FPGA-like
- (programmable, spatial compute substrate)
- achieving higher performance
- (throughput per unit area)
- than conventional, programmable devices
- (e.g. processors)
81Problem
- Only have raw devices
- Solutions non-portable
- Solutions not scale to new hardware
- Device resources exposed to developer
- Little or no abstraction of implementations
- Composition of subcomponents hard/ad hoc
- No unifying computational model or run-time
environment
82Introduce SCORE
- Compute Model
- virtualizes RC hardware resources
- supports automatic scaling
- supports dynamic program requirements efficiently
- provides compositional semantics
- defines runtime environment for programs
83Viewpoint
- SCORE (or something like it) is a necessary
condition to enable automatic exploitation of new
RC hardware as it becomes available. - Automatic exploitation is essential to making RC
a long-term viable computing solution.
84Outline
- Opportunity
- Problem
- Review
- related work
- enabling hardware
- Model
- execution
- programmer
- Preliminary Results
- Challenges and Questions ahead
85borrows heavily from...
- RC, RTR
- PFPGA
- Dataflow
- Streaming Dataflow
- Multiprocessors
- Operating System
- (see working paper)
- Tried to steal all the good ideas -)
- build a coherent model
- exploit strengths of RC
86Enabling Hardware
- High-speed, computational arrays
- 250MHz, HSRA, FPGA99
- Large, on-chip memories
- 2Mbit, VLSI Symp. 99
- allow microsecond reconfiguration
- Processor and FPGA hybrids
- GARP, NAPA, Triscend, etc.
87BRASS Architecture
88Array Model
89Platform Vision
- Hardware capacity scales up with each generation
- Faster devices
- More computation
- More memory
- With SCORE, old programs should run on new
hardware - and exploit the additional capacity automatically
90Example SCORE Execution
91Spatial Implementation
92Serial Implementation
93Summary Elements of a multiprocessing system
- General purpose/special purpose
- Granularity - capability of a basic module
- Topology - interconnection/communication
geometry - Nature of coupling - loose to tight
- Control-data mechanisms
- Task allocation and routing methodology
- Reconfigurable
- Computation
- Interconnect
- Programmers model/Language support/ models of
computation - Implementation - IC, Board, Multiboard, Networked
- Performance measures and objectives
After E. V. Krishnamurty - Chapter 5
94Conclusions
- Portions of multi/parallel processing have become
successful - Pipelining ubiquitious
- Superscalar ubiquitious
- VLIW successful in DSP, Multimedia - GPP?
- Silicon capability re-invigorating multiprocessor
research - GPP - Flash, Hydra, RAW
- SPP - Intel IXP 1200, IRAM/VIRAM, Mescal
- Reconfigurable computing has found a niche in
wireless communications - Problem of programming models, languages,
computational models etc. for multiprocessors
still largely unsolved