Advanced Computer Architecture CSE 8383

About This Presentation

Title:

Advanced Computer Architecture CSE 8383

Description:

The Laplace operator is one possible operator for emphasizing edges in a gray ... Pat Gelsinger (Pentium at 90 W) Power Density continues to soar ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 93

Provided by: Adm952

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture CSE 8383

1
Advanced Computer ArchitectureCSE 8383
April 10, 2008 Session 10
2
Contents

Parallel Programming
Multithreading
Multi-Core
Why now?
A Paradigm Shift
Multi-Core Architecture
Case Studies
IBM Cell
Intel Core 2Duo
AMD

Parallel Programming

4
More Work in Parallel Programming

Multiple threads of control
Partitioning for concurrent execution
Task Scheduling/resource allocation
Communication and Sharing
Synchronization
Debugging

5
Explicit versus Implicit Parallel Programming
Explicit
Implicit
Compiler
Programmer
Parallel Architecture
Parallel Architecture
6
Parallel Programming
7
Programmers Responsibilities
Class Programmer Responsibility
1 Implicit Parallelism (nothing much)
2 Identification of Parallelism Potential
3 Decomposition (potential), placement
4 Decomposition, high level coordination
5 Decomposition, high level coord, placement
6 Decomposition, low level coordination
7 Decomposition, low level coord, placement
8
Programming Languages

Conventional Languages with extensions
Libraries
Compiler directives
Language constructs
New Languages
Conventional Languages with Tools (implicit
parallelism)

9
Types of Parallelism

Data Parallelism
Same Instruction Multiple Data (SIMD)
Same Program Multiple Data (SPMD)
Function (Control) Parallelism
Perform different functions in parallel
Pipeline
Execution overlap
Instruction Level Parallelism
Superscalar
Dataflow
VLIW

10
Supervisor Workers Model (Simple)
11
Data Parallel Image Filtering
The Laplace operator is one possible operator for
emphasizing edges in a gray-scale image (edge
detection). The operator carries out a simple
local difference pattern and is therefore well
suited to parallel execution. The Laplace
operator is applied in parallel to each pixel
with its four neighbors
12
Approximation to p
13
Parallelism in Loops
6 processors (cores)
15 loops iterations
for (i get_myid() i lt 15 i n_procs) xi
i
14
Function Parallelism
Determine which process does what
if ( get_myid() x) .. Do this if ( get_myid()
y) .. Do that ..
15
Task Granularity

Fine grain
Operation - instruction level (appropriate for
SIMD)
Medium Grain
Chunk of code function
Large Grain
Large function - program

Overhead Parallelism Tradeoff
16
Granularity -- Matrix Multiplication
x

x

17
Serial vs. Parallel Process
18
Communication via Shared data
19
Synchronization
20
Barriers
T1
T2
T0
Barrier
Barrier
wait
Barrier
wait
proceed
proceed
proceed
Synchronization Point
21
Distributed Memory Parallel Application

A number of sequential programs, each of which
will correspond to one or more processes in a
parallel program
Communication among processes
Send / receive

Structure
Start graph
Tree

22
Sorting
23
(No Transcript)
24
Types of Communication
Time
recv()
Function is called
nrecv()
trecv()
Continue execution
wait
wait
Time is expired
Resume execution
Message arrival
Resume execution
Blocking
Non-blocking
Timeout
25
Multithreading
26
Multithreaded Processors

Several register sets
Fast Context Switching

Register set 1
Register set 2
Register set 3
Register set 4
Thread 3
Thread 4
Thread 1
Thread 2
27
Execution in Multithreaded Processors

cycle-by-cycle Interleaving
block interleaving
simultaneous multithreading

28
Multithreading Techniques
Multithreading
cycle-by-cycle interleaving
block interleaving
static
dynamic
Switch-on-cache-miss
Switch-on-signal
explicit switch
implicit switch (switch-on-load,
switch-on-store, switch-on-branch, ..)
Switch-on-use
Conditional switch
Source Jurij Silc
29
Multithreading on Scalar
Context switching
Context switching
Context switching
Single threaded
Cycle by cycle interleaving
Block interleaving
30
Single Threaded CPU

The different colored boxes in RAM represent
instructions for four different running programs
Only the instructions for the red program are
actually being executed right now
This CPU can issue up to four instructions per
clock cycle to the execution core, but as you can
see it never actually reaches this
four-instruction limit.

31
Single Threaded SMP
The red program and the yellow process both
happen to be executing simultaneously, one on
each processor. Once their respective time slices
are up, their contexts will be saved, their code
and data will be flushed from the CPU, and two
new processes will be prepared for execution.
32
Multithreaded Processors
If the red thread requests data from main memory
and this data isn't present in the cache, then
this thread could stall for many CPU cycles while
waiting for the data to arrive. In the meantime,
however, the processor could execute the yellow
thread while the red one is stalled, thereby
keeping the pipeline full and getting useful work
out of what would otherwise be dead cycles
33
Simultaneous Multithreading (SMT)
SMT is simply Multithreading without the
restriction that all the instructions issued by
the front end on each clock be from the same
thread
34
The Path to Multi-Core
35
Background

Wafer
Thin slice of semiconducting material, such as a
silicon crystal, upon which microcircuits are
constructed
Die Size
The die size of the processor refers to its
physical surface area size on the wafer. It is
typically measured in square millimeters (mm2).
In essence a "die" is really a chip . the smaller
the chip, the more of them that can be made from
a single wafer.
Circuit Size
The level of miniaturization of the processor. In
order to pack more transistors into the same
space, they must be continually made smaller and
smaller. Measured in Microns (mm) or Nanometer
(nm)

36
Examples

386C
Die Size 42 mm2
1.0 m technology
275,000 transistors

486C
Die Size 90 mm2
0.7 m technology
1.2 million transistors

Pentium III
Die Size 106 mm2
0.18m technology
28 million transistors

Pentium
Die Size 148 mm2
0.5 m technology
3.2 million transistors

37
Pentium III (0.18 m process technology)
Source Fred Pollack, Intel. New
Micro-architecture Challenges in the coming
Generations of CMOS Process Technologies. Micro32
38
(No Transcript)
39
nm Process Technology
Technology (nm) 90 65 45 32 22
Integration Capacity (BT) 2 4 8 16 32
40
Increasing Die Size

Using the same technology
Increasing the Die Size 2-3X ? 1.5-1.7X in
Performance.
Power is proportional to Die-area Frequency
We cannot produce microprocessors with ever
increasing Die size The constraint is POWER

41
Reducing circuit Size

Reducing circuit size in particular is key to
reducing the size of the chip.
The first generation Pentium used a 0.8 micron
circuit size, and required 296 square millimeters
per chip.
The second generation chip had the circuit size
reduced to 0.6 microns, and the die size dropped
by a full 50 to 148 square millimeters.

42
Shrink transistors by 30 every generation ?
transistor density doubles, oxide thickness
shrinks, frequency increases, and threshold
voltage decreases. Gate thickness cannot keep on
shrinking ? slowing frequency increase, less
threshold voltage reduction.
43
Processor Evolution
Generation i 1
Generation i
(0.5 mm, for example)
(0.35 mm, for example)

Gate delay reduces by 1/ (frequency up by
)
Number of transistors in a constant area goes up
by 2 (Deeper pipelines, ILP, more cashes)
Additional transistors enable an additional
increase in performance
Result 2x performance at roughly equal cost

44
What happens to power if we hold die size
constant at each generation?
Allows 100 growth in transistors each
generation
Source Fred Pollack, Intel. New
Micro-architecture Challenges in the coming
Generations of CMOS Process Technologies. Micro32
45
What happens to die Size if we hold power
constant at each generation?
Die size has to reduce 25 in area each
generation ? 50 growth in transistors, which
limits PERFORMANCE, Power Density is still a
problem
Source Fred Pollack, Intel. New
Micro-architecture Challenges in the coming
Generations of CMOS Process Technologies. Micro32
46
Power Density continues to soar
Source Intel Developer Forum, Spring 2004 Pat
Gelsinger (Pentium at 90 W)
47
Business as Usual wont work Power is a Major
Barrier

As Processor Continue to improve in Performance
and Speed, Power consumption and heat dissipation
have become major challenges
Higher costs
Thermal Packaging
Fans
Electricity
Air conditioning

48
A new Paradigm Shift

Old Paradigm
Performance improved Frequency, unconstrained
power, voltage scaling
New Paradigm
Performance improved IPC, Multi-core, power
efficient micro architecture advancement

49
Multiple CPUs on a Single Chip

An attractive option for chip designers because
of the availability of cores from earlier
processor generations, which, when shrunk down to
present-day process technology, are small enough
for aggregation into a single die

50
Multi-core
Technology Generation i
Technology Generation i1
Generation i
Generation i
Generation i

Gate delay does not reduce much
The frequency and performance of each core is the
same or a little less than previous generation

51
From HT to Many-Core
Intel predicts 100s of cores on a chip in 2015
52
Multi-cores are Reality
of Cores
Source Saman Amarasinghe, MIT (6.189 2007,
lecture-1)
53
Multi-Core Architecture
54
Multi-core Architecture

Multiple cores are being integrated on a single
chip and made available for general purpose
computing
Higher levels of integration
multiple processing cores
Caches
memory controllers
some I/O processing)
Network on Chip (NoC)

Shared memory
One copy of data shared among multiple cores
Synchronization via locking
intel

Distributed memory
Cores access local data
Cores exchange data

56
Memory Access Alternatives
Shared address space Distributed address space
Global Memory SMP Symmetric Multiprocessors
Distributed Memory DMS Distributed Shared Memory MP Message Passing

Symmetric Multiprocessors (SMP)
Message Passing (MP)
Distributed Shared Memory (DSM)

57
Network on Chip (NoC)
control
data
I/O
Switch Network
Traditional Bus
58
Shared Memory
Shared Secondary Cache
Shared Primary Cache
Shared Global Memory
59
General Architecture
CPU core
registers
L1 I
L1 D
L2 cache
main memory
I/O
main memory
I/O
Multiple cores
Conventional Microprocessor
60
General Architecture (cont)
Shared Cache
Multithreaded Shared Cache
61
Case Studies
62
Case Study 1IBMs Cell Processor
63
Cell Highlights

Supercomputer on a chip
Multi-core microprocessor(9 cores)
gt4 Ghz clock frequency
10X performance for many applications

64
Key Attributes

Cell is Multi-core
-Contains 64-bit power architecture
-Contains 8 synergetic processor elements
Cell is a Broadband Architecture
-SPE is RISC architecture with SIMD organization
and local store
-128 concurrent transactions to memory per
processor
Cell is a Real-Time Architecture
-Resource allocation (for bandwidth measurement)
-Locking caching (via replacement management
table)
Cell is a Security Enabled Architecture
-Isolate SPE for flexible security programming

65
Cell Processor Components
66
Cell BE Processor Block Diagram
67
POWER Processing Element (PPE)

POWER Processing Unit (PPU) connected to a 512KB
L2 cache.
Responsible for running the OS and coordinating
the SPEs.
Key design goals maximize the performance/power
ratio as well as the performance/area ratio.
Dual-issue, in-order processor with dual-thread
support
Utilizes delayed-execution pipelines and allows
limited out-of-order execution of load
instructions.

68
Synergistic Processing Elements (SPE)

Dual-issue, in-order machine with a large
128-entry, 128-bit register file used for both
floating-point and integer operations
Modular design consisting of a Synergistic
Processing Unit (SPU) and a Memory Flow
Controller (MFC).
Compute engine with SIMD support and 256KB of
dedicated local storage.
The MFC contains a DMA controller with an
associated MMU and an Atomic Unit to handle synch
operations with other SPUs and the PPU.

69
SPE (cont.)

They operate directly on instructions and data
from its dedicated local store.
They rely on a channel interface to access the
main memory and other local stores.
The channel interface, which is in the MFC, runs
independently of the SPU and is capable of
translating addresses and doing DMA transfers
while the SPU continues with the program
execution.
SIMD support can perform operations on 16 8-bit,
8 16-bit, 4 32-bit integers, or 4
single-precision floating-point numbers per
cycle.
At 3.2GHz, each SPU is capable of performing up
to 51.2 billion 8-bit integer operations or
25.6GFLOPs in single precision.

70
Four levels of Parallelism

Blade level ? 2 cell processors per blade
Chip level ? 9 cores
Instruction level ? Dual issue pipelines on each
SPE
Register level ? Native SIMD on SPE and PPE VMX

71
Cell Chip Floor plan
72
Element Interconnect Bus (EIB)

Implemented as a ring
Interconnect 12 elements
1 PPE with 51.2GB/s aggregate bandwidth
8 SPEs each with 51.2GB/s aggregate bandwidth
MIC 25.6GB/s of memory bandwidth
2 IOIF 35GB/s(out), 25GB/s(in) of I/O bandwidth
Support two transfer modes
DMA between SPEs
MMIO/DMA between PPE and system memory

Source Ainsworth Pinkston, On Characterizing
Performance of the Cell Broad band Engine
Element Interconnect Bus, 1st International Symp.
on NOCS 2007
73
Element Interconnect Bus (EIB)

An EIB consists of the following
Four 16 byte-wide rings (two in each direction)
1.1 Each ring capable of handling up to 3
concurrent non-overlapping transfers
1.2 Supports up to 12 data transfers at a time
A shared command bus
2.1 Distributes commands
2.2 Sets up end to end transactions
2.3 Handles coherency
A central data arbiter to connect the 12 Cell
elements
3.1 Implemented in a star-like structure
3.2 It controls access to the EIB data rings on a
per transaction basis

Source Ainsworth Pinkston, On Characterizing
Performance of the Cell Broad band Engine
Element Interconnect Bus, 1st International Symp.
on NOCS 2007
74
Element Interconnect Bus (EIB)
75
Cell Manufacturing Parameters

About 234 million transistors (compared with 125
million for Pentium 4) that runs at more than 4.0
GHz
As compared to conventional processors, Cell is
fairly large, with a die size of 221 square
millimeters
The introductory design is fabricated using a 90
nm Silicon on insulator (SOL) process
In March 2007 IBM announced that the 65 nm
version of Cell BE (Broadband Engine) is in
production

76
Cell Power Consumption

Each SPE consumes about 1 W when clocked at 2
GHz, 2 W at 3 GHz, and 4 W at 4 GHz
Including the eight SPEs, the PPE, and other
logic, the CELL processor will dissipate close to
15W at 2 GHz, 30W at 3 GHz, and approximately 60W
4 GHz

77
Cell Power Management

Dynamic Power Management (DPM)
Five Power Management States
One linear sensor
Ten digital thermal sensors

78
Case Study 2Intels Core 2 Duo
79
Intel Core 2 Duo Highlights

Multi-core microprocessor(2 cores)
It has a range of 1.5 to 3 Ghz clock frequency
2X performance for many applications
Dedicated level 1 cache and shared level 2 cache
Its shared L2 cache comes in two flavors 2MB and
4MB, depending on the model
It supports 64bit architecture

80
Intel Core 2 Duo Block Diagram
Dedicated L1
Shared L2
The two cores exchange data implicitly through
the shared level 2 cache
81
Intel Core 2 Duo Architecture
Reduced front-side bus traffic effective data
sharing between cores allows data requests to be
resolved at the shared cache level instead of
going all the way to the system memory
One Copy needed to be retrieved
Core 1 had to retrieve the data from Core 2 by
going all the way through the FSB and Main Memory
82
Intels Core 2 Duo Manufacturing Parameters

About 291 million transistors
Compared to Cells 221 square millimeters, Core 2
Duo has a smaller die size between 143 and 107
square millimeters depending on the model.
The current Intel process technology for the Dual
core ranges between 65 nm and 45nm (2007) with an
estimate of 155 million transistors .

83
Intel Core 2 Duo Power Consumption

Power consumption in Core 2 Duo ranges 65w-130w
depending on the model.
Assuming you have 75 w processor model (Conroe is
65W) it will cost you 4 to keep your computer up
for the whole month

84
Intel Core 2 Duo Power Management

It uses 65 nm technology instead of the previous
90nm technology
(Less voltage requirements)
Aggressive clock gating
Enhanced Speed-Step
Low VCC Arrays
Blocks controlled via sleep transistors
Low leakage transistors

85
Case Study 3AMDs Quad-Core Processor
(Barcelona)
86
AMD Quad-Core Highlights

Designed to enable simultaneous 32- and 64-bit
computing
Minimizes the cost of transition and maximizes
current investments
Integrated DDR2 Memory Controller
Increases application performance by
dramatically reducing memory latency
Scales memory bandwidth and performance to match
compute needs
HyperTranspor Technology Provides up to 24.0GB/s
peak bandwidth per processor, reducing I/O
bottlenecks

87
AMD Quad-Core Block Diagram

Dedicated L1 and L2

Shared L3

88
AMD Quad-Core Architecture

It has a crossbar switch instead of the usual bus
used in dual core processors
It lowers the probability of having memory
access collisions
L3 to alleviate the memory access latency since
we have a greater possibility of accessing the
memory due to the high number of cores

89
AMD Quad-Core Architecture (cont)

Cache Hierarchy
Dedicated L1 cache
2 way associative
8 banks (each 16B wide).
Dedicated L2 cache
16 way associative
victim cache, exclusive w.r.t L1
Shared L3 cache
32 way associative
Fills from L3 leave likely shared lines in L3
Victim cache, partially exclusive w.r.t. L2
Sharing aware replacement policy

Replacement policiesL1,L2 pseudo LRU
L3Sharing aware pseudo LRU
90
AMD Quad-Core Manufacturing Parameters

The current AMD process technology for Quad-Core
is 65nm
It is comprised of approximately 463M transistors
(about 119M less than Intels quad-core
Kentsfield)
It has a die size of 285 square millimeters
(Compared to Cells 221 square millimeters)

91
AMD Quad-Core Power Consumption

Power consumption in AMD Quad-Core ranges 68-95w(
compared to 65w-130w of Intels Core 2 Duo)
depending on the model.
AMD CoolCore Technology
Reduces processor energy consumption by turning
off unused parts of the processor. For example,
the memory controller can turn off the write
logic when reading from memory, helping reduce
system power
Power can be switched on or off within a single
clock cycle, saving energy with no impact to
performance