A New Direction for Computer Architecture Research

About This Presentation

Title:

A New Direction for Computer Architecture Research

Description:

COMP4211 Advanced Architectures & Algorithms Week 11 Seminar A New Direction for Computer Architecture Research Lih Wen Koh 19 May 2004 Outline Overview of Current ... – PowerPoint PPT presentation

Number of Views:116

Avg rating:3.0/5.0

Slides: 48

Provided by: lwk5

Category:

more less

Transcript and Presenter's Notes

Title: A New Direction for Computer Architecture Research

1
A New Direction for Computer Architecture Research
COMP4211 Advanced Architectures Algorithms Week
11 Seminar

Lih Wen Koh
19 May 2004

2
Outline

Overview of Current Computer Architecture
Research
The Desktop/Server Domain
Evaluation of current processors in the
desktop/server domain
Benchmark performance
Software effort
Design complexity
A New Target Domain Personal Mobile Computing
Major requirements of personal mobile computing
applications
Evaluation of current processors in the personal
mobile computing domain
Vector IRAM by UC Berkeley

3
Outline

Overview of Current Computer Architecture
Research
The Desktop/Server Domain
Evaluation of current processors in the
desktop/server domain
Benchmark performance
Software effort
Design complexity
A New Target Domain Personal Mobile Computing
Major requirements of personal mobile computing
applications
Evaluation of current processors in the personal
mobile computing domain
Vector IRAM by UC Berkeley

4
Overview of Current Computer Architecture Research

Current computer architecture research have a
bias for the past desktop and server
applications
Next decades technology domain personal mobile
computing
Question What are these future applications?
Question What is the set of requirements for
this domain?
Question Do current microprocessors meet these
requirements?

5
Billion-transistor microprocessors
6
Billion-transistor microprocessors

The amount of transistors used for caches and
main memory in billion-transistor processors
varies from 50-90 of the transistor budget.
Mostly on caches and main memory ? to store
redundant, local copies of data normally found
else where in the system
Question Is this the best utilization of half a
billion transistors for future applications?

7
Outline

Overview of Current Computer Architecture
Research
The Desktop/Server Domain
Evaluation of current processors in the
desktop/server domain
Benchmark performance
Software effort
Design complexity
A New Target Domain Personal Mobile Computing
Major requirements of personal mobile computing
applications
Evaluation of current processors in the personal
mobile computing domain
Vector IRAM by UC Berkeley

8
The Desktop/Server Domain

Evaluation of billion-transistor processors
(grading system for strength, 0 for
neutrality, - for weakness)

9
The Desktop/Server Domain

Desktop
Wide superscalar, trace and simultaneous
multithreading processors should deliver the
highest performance on SPECint04
Use out-of-order and advanced prediction
techniques to exploit ILP
IA-64 will perform slightly worse because of
immature VLIW compilers
CMP and Raw
will have inferior performance in integer
applications which are not highly parallelizable
performance is better in FP applications where
parallelism and high memory bandwidth are more
important than out-of-order execution

10
The Desktop/Server Domain

Server
CMP and SMT will provide the best performance due
to their ability to use coarse-grained
parallelism even with a single chip
Wide superscalar, trace and IA-64 will perform
worse because out-of-order execution provides
only a small benefit to online transaction
processing (OLTP) applications
Raw difficult to predict the potential success
of its software to map the parallelism of
databases on reconfigurable logic and
software-controlled caches

11
The Desktop/Server Domain

Software Effort
Wide superscalar, trace and SMT processors can
run existing executables
CMP can run existing executables but need to
be rewritten in a multithreaded or parallel
fashion which is neither easy nor automated.
IA-64 will supposedly run existing executables,
but significant performance increases will
require enhanced VLIW compilers.
Raw relies on the most challenging software
development for sophisticated routing, mapping
and runtime-scheduling tools, compilers and
reusable libraries.

12
The Desktop/Server Domain

Physical Design Complexity
Includes effort for design, verification and
testing of an IC
Wide superscalar and multithreading processors
use complex techniques e.g. aggressive
data/control prediction, out-of-order execution,
multithreading and non-modular designs
(individually designed multiple blocks)
IA-64 the basic challenge is the design and
verification of forwarding logic among the
multiple functional units on the chip
CMP, trace and Raw modular design but complex
out-of-order, cache coherency, multiprocessor
communication, register remapping etc.
Raw requires design and replication of a single
processing tile and network switch verification
is trivial in terms of the circuits, but
verification of the mapping software is also
required, which is often not trivial.

13
The Desktop/Server Domain

Conclusion
Current billion-transistor processors are
optimized for desktop/server computing and
promise impressive performance.
The main concern is the design complexity of
these architectures.

14
Outline

Overview of Current Computer Architecture
Research
The Desktop/Server Domain
Evaluation of current processors in the
desktop/server domain
Benchmark performance
Software effort
Design complexity
A New Target Domain Personal Mobile Computing
Major requirements of personal mobile computing
applications
Evaluation of current processors in the personal
mobile computing domain
Vector IRAM by UC Berkeley

15
A New Target Domain Personal Mobile Computing

Convergent devices
Goal a single, portable, personal computing and
communication device that incorporate necessary
functions of a PDA, laptop computer, cellular
phone etc.
Greater demand for computing power, but at the
same time, the size, weight and power consumption
of these devices must remain constant
Key features
Most important feature interface interaction
with the user
Voice and image I/O
Applications like speech and pattern recognition
Wireless infrastructure
Networking, telephony, GPS information

Trend 1 Multimedia applications video, speech,
animation, music.
Trend 2 Popularity of portable electronics
PDA, digital cameras, cellular phones, video game
consoles
Personal Mobile Computing

16
Outline

Overview of Current Computer Architecture
Research
The Desktop/Server Domain
Evaluation of current processors in the
desktop/server domain
Benchmark performance
Software effort
Design complexity
A New Target Domain Personal Mobile Computing
Major requirements of personal mobile computing
applications
Evaluation of current processors in the personal
mobile computing domain
Vector IRAM by UC Berkeley

17
Major microprocessor requirements

Requirement 1 High performance for multimedia
functions
Requirement 2 Energy and power efficiency
Design for portable, battery-operated devices
Power budget lt 2 Watts
Processor design of power target lt 1 Watt
Power budget of current high-performance
microprocessors (tens of Watts) is unacceptable
Requirement 3 Small size
Code size
Integrated solutions (external cache and main
memory not feasible)
Requirement 4 Low design complexity
Scalability in terms of both performance and
physical design

18
Characteristics of Multimedia Applications

Real-time response
Worst case guaranteed performance sufficient for
real-time qualitative perception
Instead of maximum peak performance
Continuous-media data types
Continuous stream of input and output
Temporal locality in data memory accesses is low!
Data caches may well be an obstacle to high
performance for continuous-media data types
Typically narrow data 8-16 bit for image pixels
and sound samples
SIMD-type operations desirable

19
Characteristics of Multimedia Applications

Fine-grained parallelism
Same operation is performed across sequences of
data in vector or SIMD fashion
Coarse-grained parallelism
A pipeline of functions process a single stream
of data to produce the end results.

20
Characteristics of Multimedia Applications

High instruction reference locality
Typically small kernels/loops that dominate the
processing time
High temporal and spatial locality for
instructions
Example Convolution equation for signal
filtering

for n 0 to N
yn 0
for k n to N
yn xk
hn-k
end for
end for

High memory bandwidth
For applications such as 3D graphics
High network bandwidth
Data (e.g. video) streaming for external sources
requires high network and I/O bandwidth.

21
Outline

Overview of Current Computer Architecture
Research
The Desktop/Server Domain
Evaluation of current processors in the
desktop/server domain
Benchmark performance
Software effort
Design complexity
A New Target Domain Personal Mobile Computing
Major requirements of personal mobile computing
applications
Evaluation of current processors in the personal
mobile computing domain
Vector IRAM by UC Berkeley

22
Processor Evaluation

Real-time response
Out-of-order techniques caches ? unpredictable
performance, hence difficult to guarantee
real-time response
Continuous-media data types
Question Does temporal locality in data memory
accesses still hold?
Claim Data caches may well be an obstacle to
high performance for continuous-media data types
Parallelism
MMX-like multimedia extensions for exploiting
fine-grained parallelism
but this exposes data alignment issues,
restriction on number of vector or SIMD elements
operated on by each instruction
Coarse-grained parallelism is best on SMT, CMP
and Raw architectures.

23
Processor Evaluation

Memory bandwidth
Cache-based architectures have limited memory
bandwidth
Could potentially use streaming buffers and cache
bypassing to help sequential bandwidth, but this
does not address bandwidth requirements of
indexed or random accesses
Recall 50-90 of transistor budget is dedicated
to caches!
Code size
Code size is a weakness (especially for IA-64)
because loop unrolling and software pipelining
are heavily relied upon to gain performance
Code size is also a problem for Raw architecture
as programmers must program the reconfigurable
portion of each datapath

24
Processor Evaluation

Energy/power efficiency
Redundant computation for out-of-order models
Complex issue-logic
Forwarding across long wires
Power-hungry reconfigurable logic
Design scalability
The main problem is the forwarding of results
across large chips or communication among
multiple cores/tiles.
Simple pipelining of long interconnects is not a
sufficient solution
exposes the timing of forwarding or communication
to the scheduling logic or software
Increases complexity

25
Processor Evaluation

Conclusion
Current processors fail to meet many of the
requirements of the new computing model.
Question
What design will?

26
Outline

Overview of Current Computer Architecture
Research
The Desktop/Server Domain
Evaluation of current processors in the
desktop/server domain
Benchmark performance
Software effort
Design complexity
A New Target Domain Personal Mobile Computing
Major requirements of personal mobile computing
applications
Evaluation of current processors in the personal
mobile computing domain
Vector IRAM by UC Berkeley

27
Vector IRAM processor

Targeted at matching the requirements of the
personal mobile computing environment
2 main ideas
Vector processing addresses demands of
multimedia processing
Embedded DRAM addresses the energy efficiency,
size and weight demands of portable devices

28
VIRAM Prototype Architecture
29
VIRAM Prototype Architecture

Uses in-order, scalar processor with L1 caches,
tightly integrated with a vector execution unit
(with 8 lanes)
16MB of embedded DRAM as main memory
connected to the scalar and vector unit through a
crossbar
Organized in 8 independent banks, each with a
256-bit synchronous interface
? provides sufficient sequential and random
bandwidth even for demanding applications
? reduces the penalty of high energy consumption
by avoiding the memory bus bottlenecks of
conventional multi-chip systems
DMA engine for off-chip access

30
Modular Vector Unit Design

Vector unit is managed as a co-processor
Single-issue, in-order pipeline for predictable
performance
Efficient for short vectors
Pipelined instruction start-up
Full support for instruction chaining

256-bit datapath can be configured as 4 64-bit
operations, 8 32-bit operations or 16 16-bit
operations (SIMD)

31
Embedded DRAM in VIRAM

DRAM Dynamic RAM information must be
periodically refreshed to mimic the behaviour
of static storage
On-chip DRAM connected to vector execution lanes
via memory crossbars
c.f. most SRAM cache-based machines SRAM is
more expensive, less dense
In conventional architectures
most of the instructions and data are fetched
from two lower levels of the memory hierarchy
the L1 and L2 caches which use small SRAM-based
memory structures.
Most of the reads from the DRAM are not directly
from the CPU, but are (burst) reads initiated to
bring data and instructions into these caches.
Each DRAM macro is 1.5MB in size
DRAM latency is included in the vector execution
pipeline

32
Non-Delayed Pipeline

Random access latency could lead to stalls due to
long load

33
Delayed Vector Pipeline

Solution include random access latency in vector
unit pipeline
Delay arithmetic operations and stores to shorten
RAW hazards

34
Vector Instruction Set

Complete load-store vector instruction set
extends the MIPS64 ISA with vector instructions
Data types supported 64, 32, 16 and 8 bit
32 general purpose vector registers, 32 vector
flag registers, 16 scalar registers
91 instructions arithmetic, logical, vector
processing, sequential/strided/indexed loads and
stores
ISA does not include
Maximum vector register length
Functional unit datapath width
DSP support
Fixed-point arithmetic, saturating arithmetic
Intra-register permutations for butterfly
operations

35
Vector Instruction Set

Compiler and OS support
Conditional execution of vector operations
Support for software speculation of load
operations
MMU-based virtual memory
Restartable arithmetic exceptions
Valid and dirty bits for vector registers

36
Vector IRAM for Desktop/Server Applications?

Desktop domain
- Do not expect vector processing to benefit
integer applications.
Floating point applications are highly
vectorizable.
All applications should benefit from low memory
latency and high memory bandwidth of vector IRAM.
Server domain
- Expect to perform poorly due to limited on-chip
memory.
Should perform better on decision support
instead of online transaction processing.

37
Vector IRAM for Desktop/Server Applications?

Software effort
Vectorizing compilers have been developed and
used in commercial environments for decades
- But additional work is required to tune
compilers for multimedia workloads and make DSP
features and data types accessible through
high-level languages
Design complexity
Vector IRAM is highly-modular

38
Vector IRAM for Personal Mobile Computing?

Real-time response
in-order, does not rely on data caches ? highly
predictable
Continuous data types
Vector model is superior to MMX-like, SIMD
extensions
Provides explicit control of the number of
elements each instruction operates on
Allows scaling of the number of elements each
instruction operates on without changing the ISA
Does not expose data packing and alignment to
software

39
Vector IRAM for Personal Mobile Computing?

Fine-grained parallelism
Vector processing
Coarse-grained parallelism
High-speed multiply-accumulate achieved through
instruction chaining
Allow programming in high-level language, unlike
most DSP architectures.
Code size
Compactness possible because a single vector
instruction specify whole loops
Code size is smaller than VLIW comparable to x86
CISC code
Memory bandwidth
Available from on-chip hierarchical DRAM

40
Performance Evaluation of VIRAM
41
Performance Evaluation of VIRAM

Performance is reported in iterations per cycle
and is normalized by the x86 processor.
With unoptimized code, VIRAM outperforms the x86,
MIPS and VLIW processors running unoptimized
code 30 and 45 slower than the 1GHz PowerPC
and VLIW processors running optimized code
With optimized/scheduled code, VIRAM is 1.6 to
18.5 times faster than all others.
Note
VIRAM is the only single-issue design in the
processor set
VIRAM is the only one not using SRAM caches
VIRAMs clock frequency is the second slowest.

42
Vector IRAM for Personal Mobile Computing?
43
Vector IRAM for Personal Mobile Computing?

Energy/power efficiency
Vector instruction specifies a large number of
independent operations ? no energy wasted for
fetching and decoding instruction checking
dependencies and making predictions
Execution model is in-order
? limited forwarding is needed, simple control
logic and thus power efficient
Typical power consumption
MIPS core 0.5W
Vector unit 1.0 W
DRAM 0.2 W
Misc 0.3 W

44
Vector IRAM for Personal Mobile Computing?

Design scalability
The processor-memory crossbar is the only place
where vector IRAM uses long wires
Deep pipelining is a viable solution without any
h/w or s/w complications

45
Vector IRAM for Personal Mobile Computing?

Performance scales well with the number of vector
lanes.
Compared to the single-lane case, two, four and
eight lanes lead to 1.5x, 2.5x, 3.5x
performance improvement respectively.

46
Conclusion

Modern architectures are designed and optimized
for desktop and server applications.
Newly emerging domain Personal Mobile Computing
poses a different set of architectural
requirements.
We have seen that modern architectures do not
meet many of the requirements of applications in
the personal mobile computing domain.
VIRAM an effort by UC Berkeley to develop a new
architecture targeted at applications in the
personal mobile computing domain.
Early results show a promising improvement in
performance without compromising the requirements
of low power.

47
References

A New Direction for Computer Architecture
Research
Christoforos E. Kozyrakis, David A. Patterson, UC
Berkeley, Computer Magazine, IEEE Nov 1998.
Vector IRAM A Microprocessor Architecture for
Media Processing
Christoforos E. Kozyrakis, UC Berkeley, CS252
Graduate Computer Architecture, 2000.
Vector IRAM A Media-Oriented Vector Processor
with Embedded DRAM
C. Kozyrakis, J. Gebis, D. Martin, S. Williams,
I. Mavroidis, S. Pope, D. Jones, D. Patterson, K.
Yelick. 12th Hot Chips Conference, Palo Alto, CA,
August 2000
Exploiting On-Chip Memory Bandwidth in the VIRAM
Compiler
D. Judd, K. Yelick, C. Kozyraki, D. Martin, and
D. Patterson, Second Workshop on Intelligent
Memory Systems, Cambridge, November 2000
Vector v.s. Superscalar and VLIW Architectures
for Embedded Multimedia Benchmarks
C. Kozyrakis, D. Patterson. 35th International
Symposium on Microarchitecture, Instabul, Turkey,
November 2002
Memory-Intensive Benchmarks IRAM vs. Cache-Based
Machines
Brian R. Gaeke, Parry Husbands, Xiaoye S. Li,
Leonid Oliker, Katherine A. Yelick, and Rupak
Biswas. Proceedings of the International Parallel
and Distributed Processing Symposium (IPDPS). Ft.
Lauderdale, FL. April, 2002
Logic and Computer Design Fundamentals