A New Direction for Computer Architecture Research - PowerPoint PPT Presentation

About This Presentation
Title:

A New Direction for Computer Architecture Research

Description:

COMP4211 Advanced Architectures & Algorithms Week 11 Seminar A New Direction for Computer Architecture Research Lih Wen Koh 19 May 2004 Outline Overview of Current ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 48
Provided by: lwk5
Category:

less

Transcript and Presenter's Notes

Title: A New Direction for Computer Architecture Research


1
A New Direction for Computer Architecture Research
COMP4211 Advanced Architectures Algorithms Week
11 Seminar
  • Lih Wen Koh
  • 19 May 2004

2
Outline
  • Overview of Current Computer Architecture
    Research
  • The Desktop/Server Domain
  • Evaluation of current processors in the
    desktop/server domain
  • Benchmark performance
  • Software effort
  • Design complexity
  • A New Target Domain Personal Mobile Computing
  • Major requirements of personal mobile computing
    applications
  • Evaluation of current processors in the personal
    mobile computing domain
  • Vector IRAM by UC Berkeley

3
Outline
  • Overview of Current Computer Architecture
    Research
  • The Desktop/Server Domain
  • Evaluation of current processors in the
    desktop/server domain
  • Benchmark performance
  • Software effort
  • Design complexity
  • A New Target Domain Personal Mobile Computing
  • Major requirements of personal mobile computing
    applications
  • Evaluation of current processors in the personal
    mobile computing domain
  • Vector IRAM by UC Berkeley

4
Overview of Current Computer Architecture Research
  • Current computer architecture research have a
    bias for the past desktop and server
    applications
  • Next decades technology domain personal mobile
    computing
  • Question What are these future applications?
  • Question What is the set of requirements for
    this domain?
  • Question Do current microprocessors meet these
    requirements?

5
Billion-transistor microprocessors
6
Billion-transistor microprocessors
  • The amount of transistors used for caches and
    main memory in billion-transistor processors
    varies from 50-90 of the transistor budget.
  • Mostly on caches and main memory ? to store
    redundant, local copies of data normally found
    else where in the system
  • Question Is this the best utilization of half a
    billion transistors for future applications?

7
Outline
  • Overview of Current Computer Architecture
    Research
  • The Desktop/Server Domain
  • Evaluation of current processors in the
    desktop/server domain
  • Benchmark performance
  • Software effort
  • Design complexity
  • A New Target Domain Personal Mobile Computing
  • Major requirements of personal mobile computing
    applications
  • Evaluation of current processors in the personal
    mobile computing domain
  • Vector IRAM by UC Berkeley

8
The Desktop/Server Domain
  • Evaluation of billion-transistor processors
    (grading system for strength, 0 for
    neutrality, - for weakness)

9
The Desktop/Server Domain
  • Desktop
  • Wide superscalar, trace and simultaneous
    multithreading processors should deliver the
    highest performance on SPECint04
  • Use out-of-order and advanced prediction
    techniques to exploit ILP
  • IA-64 will perform slightly worse because of
    immature VLIW compilers
  • CMP and Raw
  • will have inferior performance in integer
    applications which are not highly parallelizable
  • performance is better in FP applications where
    parallelism and high memory bandwidth are more
    important than out-of-order execution

10
The Desktop/Server Domain
  • Server
  • CMP and SMT will provide the best performance due
    to their ability to use coarse-grained
    parallelism even with a single chip
  • Wide superscalar, trace and IA-64 will perform
    worse because out-of-order execution provides
    only a small benefit to online transaction
    processing (OLTP) applications
  • Raw difficult to predict the potential success
    of its software to map the parallelism of
    databases on reconfigurable logic and
    software-controlled caches

11
The Desktop/Server Domain
  • Software Effort
  • Wide superscalar, trace and SMT processors can
    run existing executables
  • CMP can run existing executables but need to
    be rewritten in a multithreaded or parallel
    fashion which is neither easy nor automated.
  • IA-64 will supposedly run existing executables,
    but significant performance increases will
    require enhanced VLIW compilers.
  • Raw relies on the most challenging software
    development for sophisticated routing, mapping
    and runtime-scheduling tools, compilers and
    reusable libraries.

12
The Desktop/Server Domain
  • Physical Design Complexity
  • Includes effort for design, verification and
    testing of an IC
  • Wide superscalar and multithreading processors
    use complex techniques e.g. aggressive
    data/control prediction, out-of-order execution,
    multithreading and non-modular designs
    (individually designed multiple blocks)
  • IA-64 the basic challenge is the design and
    verification of forwarding logic among the
    multiple functional units on the chip
  • CMP, trace and Raw modular design but complex
    out-of-order, cache coherency, multiprocessor
    communication, register remapping etc.
  • Raw requires design and replication of a single
    processing tile and network switch verification
    is trivial in terms of the circuits, but
    verification of the mapping software is also
    required, which is often not trivial.

13
The Desktop/Server Domain
  • Conclusion
  • Current billion-transistor processors are
    optimized for desktop/server computing and
    promise impressive performance.
  • The main concern is the design complexity of
    these architectures.

14
Outline
  • Overview of Current Computer Architecture
    Research
  • The Desktop/Server Domain
  • Evaluation of current processors in the
    desktop/server domain
  • Benchmark performance
  • Software effort
  • Design complexity
  • A New Target Domain Personal Mobile Computing
  • Major requirements of personal mobile computing
    applications
  • Evaluation of current processors in the personal
    mobile computing domain
  • Vector IRAM by UC Berkeley

15
A New Target Domain Personal Mobile Computing
  • Convergent devices
  • Goal a single, portable, personal computing and
    communication device that incorporate necessary
    functions of a PDA, laptop computer, cellular
    phone etc.
  • Greater demand for computing power, but at the
    same time, the size, weight and power consumption
    of these devices must remain constant
  • Key features
  • Most important feature interface interaction
    with the user
  • Voice and image I/O
  • Applications like speech and pattern recognition
  • Wireless infrastructure
  • Networking, telephony, GPS information
  • Trend 1 Multimedia applications video, speech,
    animation, music.
  • Trend 2 Popularity of portable electronics
  • PDA, digital cameras, cellular phones, video game
    consoles
  • Personal Mobile Computing

16
Outline
  • Overview of Current Computer Architecture
    Research
  • The Desktop/Server Domain
  • Evaluation of current processors in the
    desktop/server domain
  • Benchmark performance
  • Software effort
  • Design complexity
  • A New Target Domain Personal Mobile Computing
  • Major requirements of personal mobile computing
    applications
  • Evaluation of current processors in the personal
    mobile computing domain
  • Vector IRAM by UC Berkeley

17
Major microprocessor requirements
  • Requirement 1 High performance for multimedia
    functions
  • Requirement 2 Energy and power efficiency
  • Design for portable, battery-operated devices
  • Power budget lt 2 Watts
  • Processor design of power target lt 1 Watt
  • Power budget of current high-performance
    microprocessors (tens of Watts) is unacceptable
  • Requirement 3 Small size
  • Code size
  • Integrated solutions (external cache and main
    memory not feasible)
  • Requirement 4 Low design complexity
  • Scalability in terms of both performance and
    physical design

18
Characteristics of Multimedia Applications
  • Real-time response
  • Worst case guaranteed performance sufficient for
    real-time qualitative perception
  • Instead of maximum peak performance
  • Continuous-media data types
  • Continuous stream of input and output
  • Temporal locality in data memory accesses is low!
    Data caches may well be an obstacle to high
    performance for continuous-media data types
  • Typically narrow data 8-16 bit for image pixels
    and sound samples
  • SIMD-type operations desirable

19
Characteristics of Multimedia Applications
  • Fine-grained parallelism
  • Same operation is performed across sequences of
    data in vector or SIMD fashion
  • Coarse-grained parallelism
  • A pipeline of functions process a single stream
    of data to produce the end results.

20
Characteristics of Multimedia Applications
  • High instruction reference locality
  • Typically small kernels/loops that dominate the
    processing time
  • High temporal and spatial locality for
    instructions
  • Example Convolution equation for signal
    filtering
  • for n 0 to N
  • yn 0
  • for k n to N
  • yn xk
    hn-k
  • end for
  • end for
  • High memory bandwidth
  • For applications such as 3D graphics
  • High network bandwidth
  • Data (e.g. video) streaming for external sources
    requires high network and I/O bandwidth.

21
Outline
  • Overview of Current Computer Architecture
    Research
  • The Desktop/Server Domain
  • Evaluation of current processors in the
    desktop/server domain
  • Benchmark performance
  • Software effort
  • Design complexity
  • A New Target Domain Personal Mobile Computing
  • Major requirements of personal mobile computing
    applications
  • Evaluation of current processors in the personal
    mobile computing domain
  • Vector IRAM by UC Berkeley

22
Processor Evaluation
  • Real-time response
  • Out-of-order techniques caches ? unpredictable
    performance, hence difficult to guarantee
    real-time response
  • Continuous-media data types
  • Question Does temporal locality in data memory
    accesses still hold?
  • Claim Data caches may well be an obstacle to
    high performance for continuous-media data types
  • Parallelism
  • MMX-like multimedia extensions for exploiting
    fine-grained parallelism
  • but this exposes data alignment issues,
    restriction on number of vector or SIMD elements
    operated on by each instruction
  • Coarse-grained parallelism is best on SMT, CMP
    and Raw architectures.

23
Processor Evaluation
  • Memory bandwidth
  • Cache-based architectures have limited memory
    bandwidth
  • Could potentially use streaming buffers and cache
    bypassing to help sequential bandwidth, but this
    does not address bandwidth requirements of
    indexed or random accesses
  • Recall 50-90 of transistor budget is dedicated
    to caches!
  • Code size
  • Code size is a weakness (especially for IA-64)
    because loop unrolling and software pipelining
    are heavily relied upon to gain performance
  • Code size is also a problem for Raw architecture
    as programmers must program the reconfigurable
    portion of each datapath

24
Processor Evaluation
  • Energy/power efficiency
  • Redundant computation for out-of-order models
  • Complex issue-logic
  • Forwarding across long wires
  • Power-hungry reconfigurable logic
  • Design scalability
  • The main problem is the forwarding of results
    across large chips or communication among
    multiple cores/tiles.
  • Simple pipelining of long interconnects is not a
    sufficient solution
  • exposes the timing of forwarding or communication
    to the scheduling logic or software
  • Increases complexity

25
Processor Evaluation
  • Conclusion
  • Current processors fail to meet many of the
    requirements of the new computing model.
  • Question
  • What design will?

26
Outline
  • Overview of Current Computer Architecture
    Research
  • The Desktop/Server Domain
  • Evaluation of current processors in the
    desktop/server domain
  • Benchmark performance
  • Software effort
  • Design complexity
  • A New Target Domain Personal Mobile Computing
  • Major requirements of personal mobile computing
    applications
  • Evaluation of current processors in the personal
    mobile computing domain
  • Vector IRAM by UC Berkeley

27
Vector IRAM processor
  • Targeted at matching the requirements of the
    personal mobile computing environment
  • 2 main ideas
  • Vector processing addresses demands of
    multimedia processing
  • Embedded DRAM addresses the energy efficiency,
    size and weight demands of portable devices

28
VIRAM Prototype Architecture
29
VIRAM Prototype Architecture
  • Uses in-order, scalar processor with L1 caches,
    tightly integrated with a vector execution unit
    (with 8 lanes)
  • 16MB of embedded DRAM as main memory
  • connected to the scalar and vector unit through a
    crossbar
  • Organized in 8 independent banks, each with a
    256-bit synchronous interface
  • ? provides sufficient sequential and random
    bandwidth even for demanding applications
  • ? reduces the penalty of high energy consumption
    by avoiding the memory bus bottlenecks of
    conventional multi-chip systems
  • DMA engine for off-chip access

30
Modular Vector Unit Design
  • Vector unit is managed as a co-processor
  • Single-issue, in-order pipeline for predictable
    performance
  • Efficient for short vectors
  • Pipelined instruction start-up
  • Full support for instruction chaining
  • 256-bit datapath can be configured as 4 64-bit
    operations, 8 32-bit operations or 16 16-bit
    operations (SIMD)

31
Embedded DRAM in VIRAM
  • DRAM Dynamic RAM information must be
    periodically refreshed to mimic the behaviour
    of static storage
  • On-chip DRAM connected to vector execution lanes
    via memory crossbars
  • c.f. most SRAM cache-based machines SRAM is
    more expensive, less dense
  • In conventional architectures
  • most of the instructions and data are fetched
    from two lower levels of the memory hierarchy
    the L1 and L2 caches which use small SRAM-based
    memory structures.
  • Most of the reads from the DRAM are not directly
    from the CPU, but are (burst) reads initiated to
    bring data and instructions into these caches.
  • Each DRAM macro is 1.5MB in size
  • DRAM latency is included in the vector execution
    pipeline

32
Non-Delayed Pipeline
  • Random access latency could lead to stalls due to
    long load

33
Delayed Vector Pipeline
  • Solution include random access latency in vector
    unit pipeline
  • Delay arithmetic operations and stores to shorten
    RAW hazards

34
Vector Instruction Set
  • Complete load-store vector instruction set
  • extends the MIPS64 ISA with vector instructions
  • Data types supported 64, 32, 16 and 8 bit
  • 32 general purpose vector registers, 32 vector
    flag registers, 16 scalar registers
  • 91 instructions arithmetic, logical, vector
    processing, sequential/strided/indexed loads and
    stores
  • ISA does not include
  • Maximum vector register length
  • Functional unit datapath width
  • DSP support
  • Fixed-point arithmetic, saturating arithmetic
  • Intra-register permutations for butterfly
    operations

35
Vector Instruction Set
  • Compiler and OS support
  • Conditional execution of vector operations
  • Support for software speculation of load
    operations
  • MMU-based virtual memory
  • Restartable arithmetic exceptions
  • Valid and dirty bits for vector registers

36
Vector IRAM for Desktop/Server Applications?
  • Desktop domain
  • - Do not expect vector processing to benefit
    integer applications.
  • Floating point applications are highly
    vectorizable.
  • All applications should benefit from low memory
    latency and high memory bandwidth of vector IRAM.
  • Server domain
  • - Expect to perform poorly due to limited on-chip
    memory.
  • Should perform better on decision support
    instead of online transaction processing.

37
Vector IRAM for Desktop/Server Applications?
  • Software effort
  • Vectorizing compilers have been developed and
    used in commercial environments for decades
  • - But additional work is required to tune
    compilers for multimedia workloads and make DSP
    features and data types accessible through
    high-level languages
  • Design complexity
  • Vector IRAM is highly-modular

38
Vector IRAM for Personal Mobile Computing?
  • Real-time response
  • in-order, does not rely on data caches ? highly
    predictable
  • Continuous data types
  • Vector model is superior to MMX-like, SIMD
    extensions
  • Provides explicit control of the number of
    elements each instruction operates on
  • Allows scaling of the number of elements each
    instruction operates on without changing the ISA
  • Does not expose data packing and alignment to
    software

39
Vector IRAM for Personal Mobile Computing?
  • Fine-grained parallelism
  • Vector processing
  • Coarse-grained parallelism
  • High-speed multiply-accumulate achieved through
    instruction chaining
  • Allow programming in high-level language, unlike
    most DSP architectures.
  • Code size
  • Compactness possible because a single vector
    instruction specify whole loops
  • Code size is smaller than VLIW comparable to x86
    CISC code
  • Memory bandwidth
  • Available from on-chip hierarchical DRAM

40
Performance Evaluation of VIRAM
41
Performance Evaluation of VIRAM
  • Performance is reported in iterations per cycle
    and is normalized by the x86 processor.
  • With unoptimized code, VIRAM outperforms the x86,
    MIPS and VLIW processors running unoptimized
    code 30 and 45 slower than the 1GHz PowerPC
    and VLIW processors running optimized code
  • With optimized/scheduled code, VIRAM is 1.6 to
    18.5 times faster than all others.
  • Note
  • VIRAM is the only single-issue design in the
    processor set
  • VIRAM is the only one not using SRAM caches
  • VIRAMs clock frequency is the second slowest.

42
Vector IRAM for Personal Mobile Computing?
43
Vector IRAM for Personal Mobile Computing?
  • Energy/power efficiency
  • Vector instruction specifies a large number of
    independent operations ? no energy wasted for
    fetching and decoding instruction checking
    dependencies and making predictions
  • Execution model is in-order
  • ? limited forwarding is needed, simple control
    logic and thus power efficient
  • Typical power consumption
  • MIPS core 0.5W
  • Vector unit 1.0 W
  • DRAM 0.2 W
  • Misc 0.3 W

44
Vector IRAM for Personal Mobile Computing?
  • Design scalability
  • The processor-memory crossbar is the only place
    where vector IRAM uses long wires
  • Deep pipelining is a viable solution without any
    h/w or s/w complications

45
Vector IRAM for Personal Mobile Computing?
  • Performance scales well with the number of vector
    lanes.
  • Compared to the single-lane case, two, four and
    eight lanes lead to 1.5x, 2.5x, 3.5x
    performance improvement respectively.

46
Conclusion
  • Modern architectures are designed and optimized
    for desktop and server applications.
  • Newly emerging domain Personal Mobile Computing
    poses a different set of architectural
    requirements.
  • We have seen that modern architectures do not
    meet many of the requirements of applications in
    the personal mobile computing domain.
  • VIRAM an effort by UC Berkeley to develop a new
    architecture targeted at applications in the
    personal mobile computing domain.
  • Early results show a promising improvement in
    performance without compromising the requirements
    of low power.

47
References
  • A New Direction for Computer Architecture
    Research
  • Christoforos E. Kozyrakis, David A. Patterson, UC
    Berkeley, Computer Magazine, IEEE Nov 1998.
  • Vector IRAM A Microprocessor Architecture for
    Media Processing
  • Christoforos E. Kozyrakis, UC Berkeley, CS252
    Graduate Computer Architecture, 2000.
  • Vector IRAM A Media-Oriented Vector Processor
    with Embedded DRAM
  • C. Kozyrakis, J. Gebis, D. Martin, S. Williams,
    I. Mavroidis, S. Pope, D. Jones, D. Patterson, K.
    Yelick. 12th Hot Chips Conference, Palo Alto, CA,
    August 2000
  • Exploiting On-Chip Memory Bandwidth in the VIRAM
    Compiler
  • D. Judd, K. Yelick, C. Kozyraki, D. Martin, and
    D. Patterson, Second Workshop on Intelligent
    Memory Systems, Cambridge, November 2000
  • Vector v.s. Superscalar and VLIW Architectures
    for Embedded Multimedia Benchmarks
  • C. Kozyrakis, D. Patterson. 35th International
    Symposium on Microarchitecture, Instabul, Turkey,
    November 2002
  • Memory-Intensive Benchmarks IRAM vs. Cache-Based
    Machines
  • Brian R. Gaeke, Parry Husbands, Xiaoye S. Li,
    Leonid Oliker, Katherine A. Yelick, and Rupak
    Biswas. Proceedings of the International Parallel
    and Distributed Processing Symposium (IPDPS). Ft.
    Lauderdale, FL. April, 2002
  • Logic and Computer Design Fundamentals
Write a Comment
User Comments (0)
About PowerShow.com