Alternate Architectures - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Alternate Architectures

Description:

Register usage and Load/Store more prominent (since transistor cheap) ... Optimize: number of messages to send and distance to travel ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 27

Provided by: toda98

Category:

more less

Transcript and Presenter's Notes

Title: Alternate Architectures

1
Alternate Architectures
2
Architectures

Extensive coverage of RISC architectures
RISC vs. CISC debate
Complex Instruction Set Computer (CISC)
1980s debate Why need so many instructions? Only
20 used frequently
Reduced Instruction Set Computer (RISC)
Today, difficult to categorize as either RISC or
CISC
Blurry lines many arch use both approaches
Various complex, and sometimes more instructions
in RISC
Register usage and Load/Store more prominent
(since transistor cheap)

Innovation does not necessarily mean inventing a
new wheel it may be a simple case of figuring
out the best way to use a wheel that already
exists.
3
Flynns Taxonomy

1972 Michael Flynn proposed way to categorize
computer arch
Considers 2 factors
Number of instructions
Number of data streams that flow into the
processor
Four main categories

SISD
Single Instruction Stream
Single Data Stream
MISD
SIMD
Multiple Instruction Streams
Multiple Data Streams
MIMD
4
Examples of Four Types

SISD Single Instruction Single Data
Single point of control
Uniprocessor machines
SIMD Single Instruction Multiple Data
Single point of control, execute the same
instruction simultaneously on multiple data
values
Array processors, vector processors
MISD Multiple Instruction Single Data
Multiple instruction streams operation on the
same data stream
MIMD Multiple Instruction Multiple Data
Multiple control points, independent instruction
and data streams
Multiprocessors, parallel systems
SIMD are simpler to design than MIMD, but less
flexible
SIMD must execute SAME instruction simultaneously
(Conditional Branch?)

5
Flynns Taxonomy Issues

Very few, if any, applications of MISD machines
Assumes parallelism was homogeneous
All processors are the identical
Machine with 4 FP adders, 2 multipliers, and 1
integer unit 7 simultaneous operations in
parallel. Where does it fit?
MIMD
Any multiprocessor system falls here, but no
consideration for processor connections or memory
view
Proposed sub-classification mechanisms
Subdividing by shared memory or not
Shared memory just like in uniprocessor system,
global memory, shared variables
Non-shared memory each processor has separate
memory bank/portion. Processors communicate by
message passing (expensive/slow)
Not H/W classification, this is memory
programming model (system S/W)
Bus-based or switched processors

6
MIMD Example

Two major parallel architectural paradigms
Symmetric Multiprocessors (SMPs)
Dual processor Intel PC, share memory
Massively Parallel Processors (MPPs)
Cray T3E, non-shared memory
MPP house thousands of CPUs with 100sGB memory
(mil)
To differentiate
MPP many processors distributed memory
communication via network
SMP few processors shared memory
communication via memory
Issues
MPP harder to program (communication between
processors for pieces)
SMP bottleneck when all processors attempt to
access same memory at same time

Can the program be partitioned easily? If so,
use MPP. Application dictates
7
MIMD Example Distributed/Cluster Computing

Networked computers that work collaboratively to
solve a problem

NOW Network of workstations
Heterogenous workstations, only use if not have
idle cycles
Communication by internet
IE. Intranet can be used to control
COW Cluster of workstations
Similar, single entity is in charge
Common software
Access to one node, gives access to all nodes

8
MIMD Example Distributed/Cluster Computing

DCPC Dedicated cluster parallel computer
Collection of workstations specifically collected
to work on a given parallel computation
Common S/W, file systems
Managed by single entity, communicate by internet
NOT used as common workstations
PoPC Pile of PCs
Cluster of dedicated heterogeneous hardware used
to build a parallel system of off-the shelf
components
Large number of slow and cheap nodes vs. DCPC
(few expensive computers)
Other Examples Grid Computing, Ubiquitous
Computing

9
Extension of Flynns Taxonomy

Expanded to include SPMD (single program,
multiple data)
Multiprocessors own data set program memory
Same program loaded and executed, with sync
points
Different nodes execute separate instructions of
the same program
If myNodeNum 1 do this, else do that
Actually programming paradigm on MIMD machines
Differs from SPMD
Processors can do different things at the same
time
Supercomputers use
Data Driven
Von Neumann instruction driven machine
Characteristics of data determine sequence of
processor events, not instructions (more later)

10
New Taxonomy
11
Parallel Multiprocessor Architectures

Superscalar and VLIW
Vector Processors
Interconnetion Networks
Shared Memory Processors
Distributed Computing

12
Superscalar and VLIW

Superpipelining
Pipeline stage requires lt ½ a clk cycle to
execute
Therefore can execute two tasks per external
clock cycle (internal clk 2x fast)
Superscalar
Allows multiple instructions to execute
simultaneously in each cycle
Analogy Adding a lane to a busy highway (IE add
extra hardware)
Instruction Fetch multiple instructions at once
Decoding Unit determines if two instructions
are independent and whether dependency exists.
Ex IRM RS/6000
Instruction fetch, 2 processors (6 stage FP and
4 stage Integer)
IF 2-stage pipeline,
1st stage fetch packets of 4 instructions each
2nd stage delivered instructions to the
appropriate processing unit
Parallelisms through pipelining and replication

13
Superscalar and VLIW

VLIW
All compiler optimization
No additional H/W
Processors pack independent instructions into one
long instruction
Tells execution units what to do
Compiler arbitrates all dependencies
Typically contain 4-8 instructions, fixed at
compile time
Ex Intel Itanium

14
Vector Processors

Specialized, heavily pipelined processors
Perform efficient operation on entire vectors or
matrices
High degree of parallelism
Applications Weather forecasting, image
processing
Utilize vector registers to hold several elements
at a time
FIFO queues
Two Categories
Register-register
operations use register as src and dest
Con long vectors must be split to fit in
registers
Memory-memory
operands from memory directly to ALU
Con large startup time (memory latency)
Efficient
Fetch significantly few instructions (less
decode, less control overhead, less memory
bandwidth)
Continuous source of data and can do prefetching
of values

15
Interconnection Networks

Each processor has own memory but allowed to
access others via network
Categorized by topology, routing strategy and
switching technique
Message passing efficiency is limited by
Bandwidth
Message Latency
Transport Latency
Overhead
Optimize number of messages to send and distance
to travel
Static or Dynamic path between pair can change
or not
Blocking or non-blocking new connections in the
presence of other existing connections or not
Processors typically connected by static networks
Processor-memory connected by dynamic networks

16
Static Interconnection Networks
Star
Completely Connected
Linear Ring
Mesh Mesh Ring
Tree
3D HyperCube
17
Dynamic Interconnection Networks

Bus-based network
Bandwidth bottleneck
Switching networks
Crossbar network
Fully connected, Non-blocking
Simultaneous connections
Difficult to manage

18
Dynamic Interconnection Networks

2x2 switches
4 states through, cross
Upper or lower broadcast
Advanced multistage interconnection networks
Built with 2 x 2 switches
Switches dynamically configure to allow any CPU
to any mem
switches num stages dictate length of
communication channel
Useful in loosely coupled distributed systems or
in tightly coupled systems (control processor-to
memory)
Blocking can occur, since switches can only be in
1 state

19
Multistage Networks Omega Network

Interchange switches in multi-stages
Can CPU 00 com. with Mem 00? How?
1A and 2A in through
Can CPU 10 com. with Mem 01? How?
1A and 2A set to cross
Can these 2 happen simultaneously?
This Omega Network is blocking
Non-blocking can be make by adding more switches
and more stages
Generally, n nodes requires log2 n stages with
n/2 switches per stage

Through Cross
Upper BC Lower BC
20
Comparison of Interconnection Networks

Bus are the simplest and most efficient when
moderate processors
Bus bottleneck if many processors make memory
requests simultaneously.

21
Shared Memory Processors

Doesnt have to mean 1 continuous large memory
Local memory for each processor, but must share
it
Local cache could be used with 1 single global
memory
Memory Sync
Uniform Memory Access (UMA)
All take same time
1 pool of shared memory connected through bus or
switch network
Non-Uniform Mem Access (NUMA)
Each processor own piece of mem
Near by mem takes less time than further away, ie
access time varies
Cache coherency problems
Private cache

22
Dataflow Computing

An instruction is executed when the data
necessary becomes available
Actual order of instructions has no effect on the
order in which they are executed.
Flow determined entirely by data dependencies
No PC, no shared data storage
Data passed from one instruction to the next
Data Flow Graph
Static dataflow architecture
Flow through staged pipelined

23
Dataflow Computing

Dynamic dataflow architecture
Data tagged with context info (instr tag)
Each clk cycle memory searched
If find complete set for instruction, execute

( initial j lt- n k lt-l while j gt l do new k lt-
kj new j lt- j -1 return k)
24
Dataflow Computing Processing Units

Enabling Unit
Two units
Matching unit
Accepts data tokens and stores them in memory
Fetching unit
If node (task) with a particular token is
activated, token is extracted from memory and
combined to make a executable packet
Functional Unit
Computes necessary output values and combines
result with destination addresses to form more
data tokens
Sent back to Enabling unit
No contention and cache coherency problems

25
Neural Networks

Based parallel architecture of Human Brain
Network of processing elements, each handling one
piece of larger problem
Difficulty lies in which PEs to connect together,
weights for connections, and weight thresholds
Neural Networks based on learning methods
Therefore can make mistake
Supervised or Unsupervised learning
Prior knowledge of correct result (training
phase)
No prior knowledge during training
Ex image classification, facial recognition,
risk management, sales forecasting, customer
research
When large, impossible to understand how the
network gets its result
Do you trust something a human cant understand?

26
Future of Computing

Boolean Logic is standard computing.
Moores Law cant last forever
Need new technologies
Optical or photonic computing light beams
Biological computing living organisms
DNA computing DNA as S/W and enzymes H/W
Quantum computing
Quantum Computing
Quantum bits (qbits) instead of boolean
Qbits can be in multiple states simultaneously
Think multiple possible values at one time. IE. a
3-bit register can have values 0-7 simultaneously
Processing can then perform computations of all
possible values simultaneously -gt Quantum
Parallelism
600 qbits mean 2600 states (cant do in std arch)
Can perform everyday tasks, but show superiority
with applications that exploit Quantum
Parallelism
Security applications
Truly random numbers