Alternative Architectures - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Alternative Architectures

Description:

Defined as a situation that involves losing one quality or aspect of ... RISC is a misnomer. Presently, there are more instructions in RISC machines than CISC. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 32
Provided by: Chr118
Category:

less

Transcript and Presenter's Notes

Title: Alternative Architectures


1
Alternative Architectures
  • Christopher Trinh
  • CIS Fall 2009
  • Chapter 9 (pg 461 486)

2
What are they?
  • Architectures that transcend the classic von
    Neumann approach.
  • Instruction level parallelism
  • Multiprocessing architectures
  • Parallel processing
  • Dataflow computing
  • Neural networks
  • Systolic array
  • Quantum computing
  • Optical computing
  • Biological computing

3
Trade-offs
  • Defined as a situation that involves losing one
    quality or aspect of something in return for
    gaining another quality or aspect.
  • Important concept in the computer field.
  • Speed vs. Money
  • Speed vs. Power consumption/Heat

4
It's All about the Benjamins
  • In trade-offs, money in most cases take
    precedence.
  • Moores Law vs. Rock's Law
  • The economic flipside to Moore's Law.
  • The cost of a semiconductor chip fabrication
    plant doubles every four years. As of 2003, the
    price had already reached about 3 billion US
    dollars.
  • Consumer market is now dominated with parallel
    computing, in the form of Multiprocessor system.
  • Exceptions Research

5
Back in the days
  • CISC vs. RISC (complex vs reduced instruction
    sets).
  • CISC largely motivated by high cost of memory
    (i.e. registers).
  • Analogous to Text messages (SMS).
  • lol, u, brb, omg, and gr8
  • Same motivation for it, provided the same
    benefit.
  • More information per memory/SMS.

6
What's the Difference
  • RSIC
  • Minimize the number of cycles per instructions,
    most instructions execute in one clock cycle.
  • Uses hardwired control, easier to do instruction
    pipelining.
  • Complexity is pushed up into the domain of the
    compiler.
  • More instructions.
  • CSIS
  • Increases performance by reducing number of
    instructions per programs.

7
Between RISC and CISC
  • Cheaper and more plentiful memory be came
    available. Money became less of a trade-off
    factor.
  • Case for a Reduced Instruction Set Computer,
    David Patterson and David Ditzel.
  • 45 data movement instructions
  • 25 ALU instructions
  • 30 flow control instructions
  • Overall only 20 of the time complex instructions
    were used.

8
Performance Formula
Same program
Constant
5 X 10
CISC
RISC
9
Microcode
  • CISC rely on microcode for instruction
    complexity.
  • Efficiency is limited by variable length
    instructions, slowing down the decoding process.
  • Leads to varying number of clock cycles /
    instruction, difficult to implement pipelines.
  • Interprets each instruction as it is fetched from
    memory. Additional translation process.
  • More complex the instruction set, more time it
    takes to look up the instructions and execute it
  • Back to text messages, IYKWIM and (_8()

10
Comparison chart RISC vs. CISC on page 468
RISC is a misnomer. Presently, there are more
instructions in RISC machines than CISC. Most
architecture today is based off of RISC.
11
Register windows sets
  • Registers offer the greatest potential for
    performance improvement.
  • Recall that on average, 45 of instructions in
    programs involved the movement of data.
  • Saving registers, passing parameters, and
    restoring registers involves considerable effort
    and resources.
  • Highlevel languages depend on modularization for
    efficiency, procedure calls and parameter passing
    are natural side effects.

12
  • Imagine all registers divided into sets (or
    windows). Each set has a specific number of
    registers.
  • Only one set (or windows) is visible to the
    processor.
  • Similar in concept to variable scope
  • Global registers - common to all windows.
  • Local registers - local to the current window.
  • Input registers overlaps with the preceding
    windows output registers.
  • Output registers overlaps with the next
    windows input registers.
  • Current window pointer (CWP) points to the
    register window set to be used at any given time.

13
Registers have a circular nature. When procedures
end they are marked as reusable. Recursion
and deeply nested functions use main memory when
registers are full.
14
Flynns Taxonomy
Considers two factors Number of instructions and
the number of data streams that flow into the
processor. Page 469 - 471
PU - Processing Unit
15
Single Instruction, Single Data stream
(SISD) Single Instruction, Multiple Data streams
(SIMD) Multiple Instruction, Single Data stream
(MISD) Multiple Instruction, Multiple Data
streams (MIMD) Single Program, Multiple Data
streams (SPMD)
16
SPMD
  • Single Program, Multiple Data streams
  • Consists of multiprocessors, each with its own
    data set and program memory.
  • Same program is executed on each processor.
  • Each node can do different things at the same
    time.
  • If myNode 1 do this, else do that.
  • Synchronization at various global control points.
  • Often used as supercomputers.

17
Vector processors (SIMD)
  • Referred to as supercomputers
  • Most famous are the Cray series, little change to
    their basic architecture in the past 25 years.
  • Vector processors specialized heavily pipelined
    processors that perform efficient operations on
    entire vectors and matrices at once.
  • Suited for applications that benefit from high
    degree of parallelism (ie. weather forecasting,
    medical diagnoses, and image processing).
  • Efficient for two reasons machine fetches
    significantly fewer instructions leading to less
    decoding, control unit overhead and memory
    bandwidth usage. Processor knows itll have
    continuous source of data, so it can begin
    pre-fetching corresponding pairs of values.

18
  • Vector registers specialized registers that can
    hold several vector elements at one time.
  • Two types of vector processors
    registers-register vector processors and
    memory-memory vector processors.
  • Registers-register vector processors
  • Require that all operations use registers as
    source and destination operands.
  • disadvantage in that long vectors must be broken
    into fixed length segments that are mall enough
    to fit into registers.
  • Memory-memory vector processors
  • allow operands from memory to be routed directly
    to the ALU, results are stream back to memory.
  • disadvantage is that they have large startup
    time due to memory latency, after the pipeline is
    full disadvantage disappears.

19
Parallel and multiprocessor
  • Two major parallel architectural paradigms. Under
    MIMD architectures, but differ in how they use
    memory.
  • Symmetric multiprocessors (SMPs)
  • Massively parallel processors (MPPs)

MPP many processors distributed memory
communication via network SMP few processors
shared memory communication via memory
MPP
SMP
  • Harder to program - so that pieces of the
    program on separate CPUs can communicate with
    each other.
  • Uses if program is easily partitioned.
  • Large companies (data warehousing) frequently
    use this system.
  • Easier to program.
  • Suffer from bottleneck when all processors
    attempt to access the same memory at the same
    time.

20
  • Multiprocessing parallel architecture is
    analogous to adding horse to help out with the
    work (horsepower).
  • We improve processor performance by distributing
    the computational load among several processors.
  • Parallelism results in higher throughput
    (data/sec), better fault tolerance, and more
    attractive price/performance ratio.
  • Amdahls Law States that if two processing
    components run at two different speeds, the
    slower speed will dominate. Perfect speed up is
    not possible.
  • You are only as fast as your slowest part
  • Every algorithm will eventually have a
    sequential part to it. Additional processors have
    to wait till the serial processing is complete.
  • Parallelism is not a magic solution to improve
    speed. Some algorithms/programs have more
    sequential processing and it is less cost
    effective to employ a multiprocessing parallel
    architecture (ie. Programming individual bank
    transactions, however transactions of all bank
    customers may have added benefit).

21
Instruction level parallelism (ILP)
  • Superscalar vs. Very long instruction words
    (VLIW)
  • Superscalar design methodology that allows
    multiple instructions to be executed
    simultaneously in each cycle.
  • Achieve speedup similar to the idea of adding
    another lane to a busy single lane highway.
  • Exhibit parallelism through pipelining and
    replication.
  • Added highway lanes are called execution units.
  • Execution units consists of floating-point
    adders, multipliers, and other specialized
    components.
  • Not uncommon to have these units duplicated.
  • Units are pipelined
  • Pipelining - divides the fetch-decode-execute
    cycle into stages, in which a set of instructions
    are in different stages at the same time.

22
  • Superpipelining is when a pipeline has stages
    that require less than half a clock cycle to
    execute
  • Accomplished using an internal clock which can
    be added which is double the speed of the
    external clock, allowing completion of two tasks
    per external clock cycle.
  • Instruction fetch component that can retrieve
    multiple instructions simultaneously from memory.
  • Decoding unit determines whether the
    instructions are independent (and thus be
    executed simultaneously).
  • Superscalar processors rely on both the hardware
    and compiler to generates approximate schedules
    to make the best use of the machine resources.

23
VLIW
  • Relay entire on the compiler for scheduling of
    operations.
  • Packs independent instructions into one long
    instruction.
  • Because the instructions are fixed at compile
    time, changes such as memory latency requires you
    to recompile the code.
  • Could also lead to significant increases in the
    amount of code generated.
  • Intels Itanium IA-64 is an example of VLIW
    processor
  • Uses an EPIC style of VLIW
  • Difference bundles its instructions in various
    lengths, uses a special delimiter to indicate
    where one bundle ends and another begins.
  • Instructions words are prefetched by hardware,
    instructions within bundles are executed in
    parallel and have no concern for ordering.

24
Interconnection Networks
  • Each processor has its own memory, but processors
    are allowed to access each other memories via the
    network.
  • Network topology - factor in the overhead of cost
    of message passing. List of message passing
    efficient factors
  • Bandwidth
  • Message latency
  • Transport latency
  • Overhead
  • Static networks vs. dynamic networks
  • Dynamic networks allow the path between two
    entites to change from one communication to the
    next, static networks do not.

25
(No Transcript)
26
Dynamic networks allow for dynamic configuration
Bus or switch. Bus-based networks the
simplest and most cost efficient when amount of
entities is moderate. Main disadvantage is the
bottleneck can occur. Parallel buses can remove
this issue but the cost is considerable.
Crossbar switch
2 X 2 switch
27
Omega Network
Example of a multistage network, built using 2 x
2 switches.
Trade off chart of various networks.
28
Shared memory processors
Doesnt mean all processors must share one large
memory, each processor can have a local memory,
but it must be shared with other processors.
29
  • Shared memory MIMD have two categories in how
    they sync their memory operations
  • Uniformed Memory Access (UMA) all memory access
    take the same amount of time. Pool of shared
    memory that is connected to a group of processors
    through a bus or switch network.
  • Nonuniformed Memory Access (NUMA) memory access
    is inconsistent across the address of the
    machine.
  • Leads to cache coherence problems. (race
    conditions)
  • Can use Snoopy cache controllers that monitor
    the caches on all systems. Call cache coherent
    NUMA (CC-NUMA)
  • You can use a various cache update protocol
  • write-through
  • write-through with update
  • write-through with invalidation
  • write-back

30
Distributed Systems
  • Loosely coupled distributed computers dependent
    on a network for communication among processors
    to solve a problem.
  • Cluster computing NOWs, COWs, DCPCs, and PoPCs,
    all resources are within the same admin domain
    working on group tasks.
  • You can make your own cluster by downloading
    BEOWULF open-source project.
  • Public-resource computing or Global computing
    grid computing where computing power is supplied
    by volunteers thru the internet. Very cheap
    source of computing power.

31
SETI_at_Home project analyze radio data to
determine if there is intelligent life out there.
(Think the movie Contact).
Folding_at_Home project - designed to perform
computationally intensive simulations of protein
folding and other molecular dynamics (MD), and to
improve on the methods available to do so.
7.87 PFLOPS (250 bytes), the first computing
project of any kind to cross the four petaFLOPS
milestone. This level of performance is primarily
enabled by the cumulative effort of a vast array
of PlayStation 3 and powerful GPU units.
Write a Comment
User Comments (0)
About PowerShow.com