Title: Alternate Architectures
1Alternate Architectures
2Architectures
- Extensive coverage of RISC architectures
- RISC vs. CISC debate
- Complex Instruction Set Computer (CISC)
- 1980s debate Why need so many instructions? Only
20 used frequently - Reduced Instruction Set Computer (RISC)
- Today, difficult to categorize as either RISC or
CISC - Blurry lines many arch use both approaches
- Various complex, and sometimes more instructions
in RISC - Register usage and Load/Store more prominent
(since transistor cheap)
Innovation does not necessarily mean inventing a
new wheel it may be a simple case of figuring
out the best way to use a wheel that already
exists.
3Flynns Taxonomy
- 1972 Michael Flynn proposed way to categorize
computer arch - Considers 2 factors
- Number of instructions
- Number of data streams that flow into the
processor - Four main categories
SISD
Single Instruction Stream
Single Data Stream
MISD
SIMD
Multiple Instruction Streams
Multiple Data Streams
MIMD
4Examples of Four Types
- SISD Single Instruction Single Data
- Single point of control
- Uniprocessor machines
- SIMD Single Instruction Multiple Data
- Single point of control, execute the same
instruction simultaneously on multiple data
values - Array processors, vector processors
- MISD Multiple Instruction Single Data
- Multiple instruction streams operation on the
same data stream - MIMD Multiple Instruction Multiple Data
- Multiple control points, independent instruction
and data streams - Multiprocessors, parallel systems
- SIMD are simpler to design than MIMD, but less
flexible - SIMD must execute SAME instruction simultaneously
(Conditional Branch?)
5Flynns Taxonomy Issues
- Very few, if any, applications of MISD machines
- Assumes parallelism was homogeneous
- All processors are the identical
- Machine with 4 FP adders, 2 multipliers, and 1
integer unit 7 simultaneous operations in
parallel. Where does it fit? - MIMD
- Any multiprocessor system falls here, but no
consideration for processor connections or memory
view - Proposed sub-classification mechanisms
- Subdividing by shared memory or not
- Shared memory just like in uniprocessor system,
global memory, shared variables - Non-shared memory each processor has separate
memory bank/portion. Processors communicate by
message passing (expensive/slow) - Not H/W classification, this is memory
programming model (system S/W) - Bus-based or switched processors
6MIMD Example
- Two major parallel architectural paradigms
- Symmetric Multiprocessors (SMPs)
- Dual processor Intel PC, share memory
- Massively Parallel Processors (MPPs)
- Cray T3E, non-shared memory
- MPP house thousands of CPUs with 100sGB memory
(mil) - To differentiate
- MPP many processors distributed memory
communication via network - SMP few processors shared memory
communication via memory - Issues
- MPP harder to program (communication between
processors for pieces) - SMP bottleneck when all processors attempt to
access same memory at same time
Can the program be partitioned easily? If so,
use MPP. Application dictates
7MIMD Example Distributed/Cluster Computing
- Networked computers that work collaboratively to
solve a problem
- NOW Network of workstations
- Heterogenous workstations, only use if not have
idle cycles - Communication by internet
- IE. Intranet can be used to control
- COW Cluster of workstations
- Similar, single entity is in charge
- Common software
- Access to one node, gives access to all nodes
8MIMD Example Distributed/Cluster Computing
- DCPC Dedicated cluster parallel computer
- Collection of workstations specifically collected
to work on a given parallel computation - Common S/W, file systems
- Managed by single entity, communicate by internet
- NOT used as common workstations
- PoPC Pile of PCs
- Cluster of dedicated heterogeneous hardware used
to build a parallel system of off-the shelf
components - Large number of slow and cheap nodes vs. DCPC
(few expensive computers) - Other Examples Grid Computing, Ubiquitous
Computing
9Extension of Flynns Taxonomy
- Expanded to include SPMD (single program,
multiple data) - Multiprocessors own data set program memory
- Same program loaded and executed, with sync
points - Different nodes execute separate instructions of
the same program - If myNodeNum 1 do this, else do that
- Actually programming paradigm on MIMD machines
- Differs from SPMD
- Processors can do different things at the same
time - Supercomputers use
- Data Driven
- Von Neumann instruction driven machine
- Characteristics of data determine sequence of
processor events, not instructions (more later)
10New Taxonomy
11Parallel Multiprocessor Architectures
- Superscalar and VLIW
- Vector Processors
- Interconnetion Networks
- Shared Memory Processors
- Distributed Computing
12Superscalar and VLIW
- Superpipelining
- Pipeline stage requires lt ½ a clk cycle to
execute - Therefore can execute two tasks per external
clock cycle (internal clk 2x fast) - Superscalar
- Allows multiple instructions to execute
simultaneously in each cycle - Analogy Adding a lane to a busy highway (IE add
extra hardware) - Instruction Fetch multiple instructions at once
- Decoding Unit determines if two instructions
are independent and whether dependency exists. - Ex IRM RS/6000
- Instruction fetch, 2 processors (6 stage FP and
4 stage Integer) - IF 2-stage pipeline,
- 1st stage fetch packets of 4 instructions each
- 2nd stage delivered instructions to the
appropriate processing unit - Parallelisms through pipelining and replication
13Superscalar and VLIW
- VLIW
- All compiler optimization
- No additional H/W
- Processors pack independent instructions into one
long instruction - Tells execution units what to do
- Compiler arbitrates all dependencies
- Typically contain 4-8 instructions, fixed at
compile time - Ex Intel Itanium
14Vector Processors
- Specialized, heavily pipelined processors
- Perform efficient operation on entire vectors or
matrices - High degree of parallelism
- Applications Weather forecasting, image
processing - Utilize vector registers to hold several elements
at a time - FIFO queues
- Two Categories
- Register-register
- operations use register as src and dest
- Con long vectors must be split to fit in
registers - Memory-memory
- operands from memory directly to ALU
- Con large startup time (memory latency)
- Efficient
- Fetch significantly few instructions (less
decode, less control overhead, less memory
bandwidth) - Continuous source of data and can do prefetching
of values
15Interconnection Networks
- Each processor has own memory but allowed to
access others via network - Categorized by topology, routing strategy and
switching technique - Message passing efficiency is limited by
- Bandwidth
- Message Latency
- Transport Latency
- Overhead
- Optimize number of messages to send and distance
to travel - Static or Dynamic path between pair can change
or not - Blocking or non-blocking new connections in the
presence of other existing connections or not - Processors typically connected by static networks
- Processor-memory connected by dynamic networks
16Static Interconnection Networks
Star
Completely Connected
Linear Ring
Mesh Mesh Ring
Tree
3D HyperCube
17Dynamic Interconnection Networks
- Bus-based network
- Bandwidth bottleneck
- Switching networks
- Crossbar network
- Fully connected, Non-blocking
- Simultaneous connections
- Difficult to manage
18Dynamic Interconnection Networks
- 2x2 switches
- 4 states through, cross
- Upper or lower broadcast
- Advanced multistage interconnection networks
- Built with 2 x 2 switches
- Switches dynamically configure to allow any CPU
to any mem - switches num stages dictate length of
communication channel - Useful in loosely coupled distributed systems or
in tightly coupled systems (control processor-to
memory) - Blocking can occur, since switches can only be in
1 state
19Multistage Networks Omega Network
- Interchange switches in multi-stages
- Can CPU 00 com. with Mem 00? How?
- 1A and 2A in through
- Can CPU 10 com. with Mem 01? How?
- 1A and 2A set to cross
- Can these 2 happen simultaneously?
- This Omega Network is blocking
- Non-blocking can be make by adding more switches
and more stages - Generally, n nodes requires log2 n stages with
n/2 switches per stage
Through Cross
Upper BC Lower BC
20Comparison of Interconnection Networks
- Bus are the simplest and most efficient when
moderate processors - Bus bottleneck if many processors make memory
requests simultaneously.
21Shared Memory Processors
- Doesnt have to mean 1 continuous large memory
- Local memory for each processor, but must share
it - Local cache could be used with 1 single global
memory - Memory Sync
- Uniform Memory Access (UMA)
- All take same time
- 1 pool of shared memory connected through bus or
switch network - Non-Uniform Mem Access (NUMA)
- Each processor own piece of mem
- Near by mem takes less time than further away, ie
access time varies - Cache coherency problems
- Private cache
22Dataflow Computing
- An instruction is executed when the data
necessary becomes available - Actual order of instructions has no effect on the
order in which they are executed. - Flow determined entirely by data dependencies
- No PC, no shared data storage
- Data passed from one instruction to the next
- Data Flow Graph
- Static dataflow architecture
- Flow through staged pipelined
23Dataflow Computing
- Dynamic dataflow architecture
- Data tagged with context info (instr tag)
- Each clk cycle memory searched
- If find complete set for instruction, execute
( initial j lt- n k lt-l while j gt l do new k lt-
kj new j lt- j -1 return k)
24Dataflow Computing Processing Units
- Enabling Unit
- Two units
- Matching unit
- Accepts data tokens and stores them in memory
- Fetching unit
- If node (task) with a particular token is
activated, token is extracted from memory and
combined to make a executable packet - Functional Unit
- Computes necessary output values and combines
result with destination addresses to form more
data tokens - Sent back to Enabling unit
- No contention and cache coherency problems
25Neural Networks
- Based parallel architecture of Human Brain
- Network of processing elements, each handling one
piece of larger problem - Difficulty lies in which PEs to connect together,
weights for connections, and weight thresholds - Neural Networks based on learning methods
- Therefore can make mistake
- Supervised or Unsupervised learning
- Prior knowledge of correct result (training
phase) - No prior knowledge during training
- Ex image classification, facial recognition,
risk management, sales forecasting, customer
research -
- When large, impossible to understand how the
network gets its result - Do you trust something a human cant understand?
26Future of Computing
- Boolean Logic is standard computing.
- Moores Law cant last forever
- Need new technologies
- Optical or photonic computing light beams
- Biological computing living organisms
- DNA computing DNA as S/W and enzymes H/W
- Quantum computing
- Quantum Computing
- Quantum bits (qbits) instead of boolean
- Qbits can be in multiple states simultaneously
- Think multiple possible values at one time. IE. a
3-bit register can have values 0-7 simultaneously - Processing can then perform computations of all
possible values simultaneously -gt Quantum
Parallelism - 600 qbits mean 2600 states (cant do in std arch)
- Can perform everyday tasks, but show superiority
with applications that exploit Quantum
Parallelism - Security applications
- Truly random numbers