Alternate Architectures - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Alternate Architectures

Description:

Register usage and Load/Store more prominent (since transistor cheap) ... Optimize: number of messages to send and distance to travel ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 27
Provided by: toda98
Category:

less

Transcript and Presenter's Notes

Title: Alternate Architectures


1
Alternate Architectures
2
Architectures
  • Extensive coverage of RISC architectures
  • RISC vs. CISC debate
  • Complex Instruction Set Computer (CISC)
  • 1980s debate Why need so many instructions? Only
    20 used frequently
  • Reduced Instruction Set Computer (RISC)
  • Today, difficult to categorize as either RISC or
    CISC
  • Blurry lines many arch use both approaches
  • Various complex, and sometimes more instructions
    in RISC
  • Register usage and Load/Store more prominent
    (since transistor cheap)

Innovation does not necessarily mean inventing a
new wheel it may be a simple case of figuring
out the best way to use a wheel that already
exists.
3
Flynns Taxonomy
  • 1972 Michael Flynn proposed way to categorize
    computer arch
  • Considers 2 factors
  • Number of instructions
  • Number of data streams that flow into the
    processor
  • Four main categories

SISD
Single Instruction Stream
Single Data Stream
MISD
SIMD
Multiple Instruction Streams
Multiple Data Streams
MIMD
4
Examples of Four Types
  • SISD Single Instruction Single Data
  • Single point of control
  • Uniprocessor machines
  • SIMD Single Instruction Multiple Data
  • Single point of control, execute the same
    instruction simultaneously on multiple data
    values
  • Array processors, vector processors
  • MISD Multiple Instruction Single Data
  • Multiple instruction streams operation on the
    same data stream
  • MIMD Multiple Instruction Multiple Data
  • Multiple control points, independent instruction
    and data streams
  • Multiprocessors, parallel systems
  • SIMD are simpler to design than MIMD, but less
    flexible
  • SIMD must execute SAME instruction simultaneously
    (Conditional Branch?)

5
Flynns Taxonomy Issues
  • Very few, if any, applications of MISD machines
  • Assumes parallelism was homogeneous
  • All processors are the identical
  • Machine with 4 FP adders, 2 multipliers, and 1
    integer unit 7 simultaneous operations in
    parallel. Where does it fit?
  • MIMD
  • Any multiprocessor system falls here, but no
    consideration for processor connections or memory
    view
  • Proposed sub-classification mechanisms
  • Subdividing by shared memory or not
  • Shared memory just like in uniprocessor system,
    global memory, shared variables
  • Non-shared memory each processor has separate
    memory bank/portion. Processors communicate by
    message passing (expensive/slow)
  • Not H/W classification, this is memory
    programming model (system S/W)
  • Bus-based or switched processors

6
MIMD Example
  • Two major parallel architectural paradigms
  • Symmetric Multiprocessors (SMPs)
  • Dual processor Intel PC, share memory
  • Massively Parallel Processors (MPPs)
  • Cray T3E, non-shared memory
  • MPP house thousands of CPUs with 100sGB memory
    (mil)
  • To differentiate
  • MPP many processors distributed memory
    communication via network
  • SMP few processors shared memory
    communication via memory
  • Issues
  • MPP harder to program (communication between
    processors for pieces)
  • SMP bottleneck when all processors attempt to
    access same memory at same time

Can the program be partitioned easily? If so,
use MPP. Application dictates
7
MIMD Example Distributed/Cluster Computing
  • Networked computers that work collaboratively to
    solve a problem
  • NOW Network of workstations
  • Heterogenous workstations, only use if not have
    idle cycles
  • Communication by internet
  • IE. Intranet can be used to control
  • COW Cluster of workstations
  • Similar, single entity is in charge
  • Common software
  • Access to one node, gives access to all nodes

8
MIMD Example Distributed/Cluster Computing
  • DCPC Dedicated cluster parallel computer
  • Collection of workstations specifically collected
    to work on a given parallel computation
  • Common S/W, file systems
  • Managed by single entity, communicate by internet
  • NOT used as common workstations
  • PoPC Pile of PCs
  • Cluster of dedicated heterogeneous hardware used
    to build a parallel system of off-the shelf
    components
  • Large number of slow and cheap nodes vs. DCPC
    (few expensive computers)
  • Other Examples Grid Computing, Ubiquitous
    Computing

9
Extension of Flynns Taxonomy
  • Expanded to include SPMD (single program,
    multiple data)
  • Multiprocessors own data set program memory
  • Same program loaded and executed, with sync
    points
  • Different nodes execute separate instructions of
    the same program
  • If myNodeNum 1 do this, else do that
  • Actually programming paradigm on MIMD machines
  • Differs from SPMD
  • Processors can do different things at the same
    time
  • Supercomputers use
  • Data Driven
  • Von Neumann instruction driven machine
  • Characteristics of data determine sequence of
    processor events, not instructions (more later)

10
New Taxonomy
11
Parallel Multiprocessor Architectures
  • Superscalar and VLIW
  • Vector Processors
  • Interconnetion Networks
  • Shared Memory Processors
  • Distributed Computing

12
Superscalar and VLIW
  • Superpipelining
  • Pipeline stage requires lt ½ a clk cycle to
    execute
  • Therefore can execute two tasks per external
    clock cycle (internal clk 2x fast)
  • Superscalar
  • Allows multiple instructions to execute
    simultaneously in each cycle
  • Analogy Adding a lane to a busy highway (IE add
    extra hardware)
  • Instruction Fetch multiple instructions at once
  • Decoding Unit determines if two instructions
    are independent and whether dependency exists.
  • Ex IRM RS/6000
  • Instruction fetch, 2 processors (6 stage FP and
    4 stage Integer)
  • IF 2-stage pipeline,
  • 1st stage fetch packets of 4 instructions each
  • 2nd stage delivered instructions to the
    appropriate processing unit
  • Parallelisms through pipelining and replication

13
Superscalar and VLIW
  • VLIW
  • All compiler optimization
  • No additional H/W
  • Processors pack independent instructions into one
    long instruction
  • Tells execution units what to do
  • Compiler arbitrates all dependencies
  • Typically contain 4-8 instructions, fixed at
    compile time
  • Ex Intel Itanium

14
Vector Processors
  • Specialized, heavily pipelined processors
  • Perform efficient operation on entire vectors or
    matrices
  • High degree of parallelism
  • Applications Weather forecasting, image
    processing
  • Utilize vector registers to hold several elements
    at a time
  • FIFO queues
  • Two Categories
  • Register-register
  • operations use register as src and dest
  • Con long vectors must be split to fit in
    registers
  • Memory-memory
  • operands from memory directly to ALU
  • Con large startup time (memory latency)
  • Efficient
  • Fetch significantly few instructions (less
    decode, less control overhead, less memory
    bandwidth)
  • Continuous source of data and can do prefetching
    of values

15
Interconnection Networks
  • Each processor has own memory but allowed to
    access others via network
  • Categorized by topology, routing strategy and
    switching technique
  • Message passing efficiency is limited by
  • Bandwidth
  • Message Latency
  • Transport Latency
  • Overhead
  • Optimize number of messages to send and distance
    to travel
  • Static or Dynamic path between pair can change
    or not
  • Blocking or non-blocking new connections in the
    presence of other existing connections or not
  • Processors typically connected by static networks
  • Processor-memory connected by dynamic networks

16
Static Interconnection Networks
Star
Completely Connected
Linear Ring
Mesh Mesh Ring
Tree
3D HyperCube
17
Dynamic Interconnection Networks
  • Bus-based network
  • Bandwidth bottleneck
  • Switching networks
  • Crossbar network
  • Fully connected, Non-blocking
  • Simultaneous connections
  • Difficult to manage

18
Dynamic Interconnection Networks
  • 2x2 switches
  • 4 states through, cross
  • Upper or lower broadcast
  • Advanced multistage interconnection networks
  • Built with 2 x 2 switches
  • Switches dynamically configure to allow any CPU
    to any mem
  • switches num stages dictate length of
    communication channel
  • Useful in loosely coupled distributed systems or
    in tightly coupled systems (control processor-to
    memory)
  • Blocking can occur, since switches can only be in
    1 state

19
Multistage Networks Omega Network
  • Interchange switches in multi-stages
  • Can CPU 00 com. with Mem 00? How?
  • 1A and 2A in through
  • Can CPU 10 com. with Mem 01? How?
  • 1A and 2A set to cross
  • Can these 2 happen simultaneously?
  • This Omega Network is blocking
  • Non-blocking can be make by adding more switches
    and more stages
  • Generally, n nodes requires log2 n stages with
    n/2 switches per stage

Through Cross
Upper BC Lower BC
20
Comparison of Interconnection Networks
  • Bus are the simplest and most efficient when
    moderate processors
  • Bus bottleneck if many processors make memory
    requests simultaneously.

21
Shared Memory Processors
  • Doesnt have to mean 1 continuous large memory
  • Local memory for each processor, but must share
    it
  • Local cache could be used with 1 single global
    memory
  • Memory Sync
  • Uniform Memory Access (UMA)
  • All take same time
  • 1 pool of shared memory connected through bus or
    switch network
  • Non-Uniform Mem Access (NUMA)
  • Each processor own piece of mem
  • Near by mem takes less time than further away, ie
    access time varies
  • Cache coherency problems
  • Private cache

22
Dataflow Computing
  • An instruction is executed when the data
    necessary becomes available
  • Actual order of instructions has no effect on the
    order in which they are executed.
  • Flow determined entirely by data dependencies
  • No PC, no shared data storage
  • Data passed from one instruction to the next
  • Data Flow Graph
  • Static dataflow architecture
  • Flow through staged pipelined

23
Dataflow Computing
  • Dynamic dataflow architecture
  • Data tagged with context info (instr tag)
  • Each clk cycle memory searched
  • If find complete set for instruction, execute

( initial j lt- n k lt-l while j gt l do new k lt-
kj new j lt- j -1 return k)
24
Dataflow Computing Processing Units
  • Enabling Unit
  • Two units
  • Matching unit
  • Accepts data tokens and stores them in memory
  • Fetching unit
  • If node (task) with a particular token is
    activated, token is extracted from memory and
    combined to make a executable packet
  • Functional Unit
  • Computes necessary output values and combines
    result with destination addresses to form more
    data tokens
  • Sent back to Enabling unit
  • No contention and cache coherency problems

25
Neural Networks
  • Based parallel architecture of Human Brain
  • Network of processing elements, each handling one
    piece of larger problem
  • Difficulty lies in which PEs to connect together,
    weights for connections, and weight thresholds
  • Neural Networks based on learning methods
  • Therefore can make mistake
  • Supervised or Unsupervised learning
  • Prior knowledge of correct result (training
    phase)
  • No prior knowledge during training
  • Ex image classification, facial recognition,
    risk management, sales forecasting, customer
    research
  • When large, impossible to understand how the
    network gets its result
  • Do you trust something a human cant understand?

26
Future of Computing
  • Boolean Logic is standard computing.
  • Moores Law cant last forever
  • Need new technologies
  • Optical or photonic computing light beams
  • Biological computing living organisms
  • DNA computing DNA as S/W and enzymes H/W
  • Quantum computing
  • Quantum Computing
  • Quantum bits (qbits) instead of boolean
  • Qbits can be in multiple states simultaneously
  • Think multiple possible values at one time. IE. a
    3-bit register can have values 0-7 simultaneously
  • Processing can then perform computations of all
    possible values simultaneously -gt Quantum
    Parallelism
  • 600 qbits mean 2600 states (cant do in std arch)
  • Can perform everyday tasks, but show superiority
    with applications that exploit Quantum
    Parallelism
  • Security applications
  • Truly random numbers
Write a Comment
User Comments (0)
About PowerShow.com