Title: Advanced Computer Architecture CSE 8383
1Advanced Computer ArchitectureCSE 8383
April 10, 2008 Session 10
2Contents
- Parallel Programming
-
- Multithreading
-
- Multi-Core
- Why now?
- A Paradigm Shift
- Multi-Core Architecture
- Case Studies
- IBM Cell
- Intel Core 2Duo
- AMD
3 4More Work in Parallel Programming
- Multiple threads of control
- Partitioning for concurrent execution
- Task Scheduling/resource allocation
- Communication and Sharing
- Synchronization
- Debugging
5Explicit versus Implicit Parallel Programming
Explicit
Implicit
Compiler
Programmer
Parallel Architecture
Parallel Architecture
6Parallel Programming
7Programmers Responsibilities
Class Programmer Responsibility
1 Implicit Parallelism (nothing much)
2 Identification of Parallelism Potential
3 Decomposition (potential), placement
4 Decomposition, high level coordination
5 Decomposition, high level coord, placement
6 Decomposition, low level coordination
7 Decomposition, low level coord, placement
8Programming Languages
- Conventional Languages with extensions
- Libraries
- Compiler directives
- Language constructs
- New Languages
- Conventional Languages with Tools (implicit
parallelism)
9Types of Parallelism
- Data Parallelism
- Same Instruction Multiple Data (SIMD)
- Same Program Multiple Data (SPMD)
- Function (Control) Parallelism
- Perform different functions in parallel
- Pipeline
- Execution overlap
- Instruction Level Parallelism
- Superscalar
- Dataflow
- VLIW
10Supervisor Workers Model (Simple)
11Data Parallel Image Filtering
The Laplace operator is one possible operator for
emphasizing edges in a gray-scale image (edge
detection). The operator carries out a simple
local difference pattern and is therefore well
suited to parallel execution. The Laplace
operator is applied in parallel to each pixel
with its four neighbors
12Approximation to p
13Parallelism in Loops
6 processors (cores)
15 loops iterations
for (i get_myid() i lt 15 i n_procs) xi
i
14Function Parallelism
Determine which process does what
if ( get_myid() x) .. Do this if ( get_myid()
y) .. Do that ..
15Task Granularity
- Fine grain
- Operation - instruction level (appropriate for
SIMD) - Medium Grain
- Chunk of code function
- Large Grain
- Large function - program
Overhead Parallelism Tradeoff
16Granularity -- Matrix Multiplication
x
x
17Serial vs. Parallel Process
18Communication via Shared data
19Synchronization
20Barriers
T1
T2
T0
Barrier
Barrier
wait
Barrier
wait
proceed
proceed
proceed
Synchronization Point
21Distributed Memory Parallel Application
- A number of sequential programs, each of which
will correspond to one or more processes in a
parallel program - Communication among processes
- Send / receive
- Structure
- Start graph
- Tree
22Sorting
23(No Transcript)
24Types of Communication
Time
recv()
Function is called
nrecv()
trecv()
Continue execution
wait
wait
Time is expired
Resume execution
Message arrival
Resume execution
Blocking
Non-blocking
Timeout
25Multithreading
26Multithreaded Processors
- Several register sets
- Fast Context Switching
Register set 1
Register set 2
Register set 3
Register set 4
Thread 3
Thread 4
Thread 1
Thread 2
27Execution in Multithreaded Processors
- cycle-by-cycle Interleaving
- block interleaving
- simultaneous multithreading
28Multithreading Techniques
Multithreading
cycle-by-cycle interleaving
block interleaving
static
dynamic
Switch-on-cache-miss
Switch-on-signal
explicit switch
implicit switch (switch-on-load,
switch-on-store, switch-on-branch, ..)
Switch-on-use
Conditional switch
Source Jurij Silc
29Multithreading on Scalar
Context switching
Context switching
Context switching
Single threaded
Cycle by cycle interleaving
Block interleaving
30Single Threaded CPU
- The different colored boxes in RAM represent
instructions for four different running programs - Only the instructions for the red program are
actually being executed right now - This CPU can issue up to four instructions per
clock cycle to the execution core, but as you can
see it never actually reaches this
four-instruction limit.
31Single Threaded SMP
The red program and the yellow process both
happen to be executing simultaneously, one on
each processor. Once their respective time slices
are up, their contexts will be saved, their code
and data will be flushed from the CPU, and two
new processes will be prepared for execution.
32Multithreaded Processors
If the red thread requests data from main memory
and this data isn't present in the cache, then
this thread could stall for many CPU cycles while
waiting for the data to arrive. In the meantime,
however, the processor could execute the yellow
thread while the red one is stalled, thereby
keeping the pipeline full and getting useful work
out of what would otherwise be dead cycles
33Simultaneous Multithreading (SMT)
SMT is simply Multithreading without the
restriction that all the instructions issued by
the front end on each clock be from the same
thread
34The Path to Multi-Core
35Background
- Wafer
- Thin slice of semiconducting material, such as a
silicon crystal, upon which microcircuits are
constructed - Die Size
- The die size of the processor refers to its
physical surface area size on the wafer. It is
typically measured in square millimeters (mm2).
In essence a "die" is really a chip . the smaller
the chip, the more of them that can be made from
a single wafer. - Circuit Size
- The level of miniaturization of the processor. In
order to pack more transistors into the same
space, they must be continually made smaller and
smaller. Measured in Microns (mm) or Nanometer
(nm)
36Examples
- 386C
- Die Size 42 mm2
- 1.0 m technology
- 275,000 transistors
- 486C
- Die Size 90 mm2
- 0.7 m technology
- 1.2 million transistors
- Pentium III
- Die Size 106 mm2
- 0.18m technology
- 28 million transistors
- Pentium
- Die Size 148 mm2
- 0.5 m technology
- 3.2 million transistors
37Pentium III (0.18 m process technology)
Source Fred Pollack, Intel. New
Micro-architecture Challenges in the coming
Generations of CMOS Process Technologies. Micro32
38(No Transcript)
39nm Process Technology
Technology (nm) 90 65 45 32 22
Integration Capacity (BT) 2 4 8 16 32
40Increasing Die Size
- Using the same technology
- Increasing the Die Size 2-3X ? 1.5-1.7X in
Performance. - Power is proportional to Die-area Frequency
- We cannot produce microprocessors with ever
increasing Die size The constraint is POWER
41Reducing circuit Size
- Reducing circuit size in particular is key to
reducing the size of the chip. - The first generation Pentium used a 0.8 micron
circuit size, and required 296 square millimeters
per chip. - The second generation chip had the circuit size
reduced to 0.6 microns, and the die size dropped
by a full 50 to 148 square millimeters.
42 Shrink transistors by 30 every generation ?
transistor density doubles, oxide thickness
shrinks, frequency increases, and threshold
voltage decreases. Gate thickness cannot keep on
shrinking ? slowing frequency increase, less
threshold voltage reduction.
43Processor Evolution
Generation i 1
Generation i
(0.5 mm, for example)
(0.35 mm, for example)
- Gate delay reduces by 1/ (frequency up by
) - Number of transistors in a constant area goes up
by 2 (Deeper pipelines, ILP, more cashes) - Additional transistors enable an additional
increase in performance - Result 2x performance at roughly equal cost
44What happens to power if we hold die size
constant at each generation?
Allows 100 growth in transistors each
generation
Source Fred Pollack, Intel. New
Micro-architecture Challenges in the coming
Generations of CMOS Process Technologies. Micro32
45What happens to die Size if we hold power
constant at each generation?
Die size has to reduce 25 in area each
generation ? 50 growth in transistors, which
limits PERFORMANCE, Power Density is still a
problem
Source Fred Pollack, Intel. New
Micro-architecture Challenges in the coming
Generations of CMOS Process Technologies. Micro32
46Power Density continues to soar
Source Intel Developer Forum, Spring 2004 Pat
Gelsinger (Pentium at 90 W)
47Business as Usual wont work Power is a Major
Barrier
- As Processor Continue to improve in Performance
and Speed, Power consumption and heat dissipation
have become major challenges - Higher costs
- Thermal Packaging
- Fans
- Electricity
- Air conditioning
48A new Paradigm Shift
- Old Paradigm
- Performance improved Frequency, unconstrained
power, voltage scaling - New Paradigm
- Performance improved IPC, Multi-core, power
efficient micro architecture advancement
49Multiple CPUs on a Single Chip
- An attractive option for chip designers because
of the availability of cores from earlier
processor generations, which, when shrunk down to
present-day process technology, are small enough
for aggregation into a single die
50Multi-core
Technology Generation i
Technology Generation i1
Generation i
Generation i
Generation i
- Gate delay does not reduce much
- The frequency and performance of each core is the
same or a little less than previous generation
51From HT to Many-Core
Intel predicts 100s of cores on a chip in 2015
52Multi-cores are Reality
of Cores
Source Saman Amarasinghe, MIT (6.189 2007,
lecture-1)
53Multi-Core Architecture
54Multi-core Architecture
- Multiple cores are being integrated on a single
chip and made available for general purpose
computing - Higher levels of integration
- multiple processing cores
- Caches
- memory controllers
- some I/O processing)
- Network on Chip (NoC)
55- Shared memory
- One copy of data shared among multiple cores
- Synchronization via locking
- intel
- Distributed memory
- Cores access local data
- Cores exchange data
56Memory Access Alternatives
Shared address space Distributed address space
Global Memory SMP Symmetric Multiprocessors
Distributed Memory DMS Distributed Shared Memory MP Message Passing
- Symmetric Multiprocessors (SMP)
- Message Passing (MP)
- Distributed Shared Memory (DSM)
57Network on Chip (NoC)
control
data
I/O
Switch Network
Traditional Bus
58Shared Memory
Shared Secondary Cache
Shared Primary Cache
Shared Global Memory
59General Architecture
CPU core
registers
L1 I
L1 D
L2 cache
main memory
I/O
main memory
I/O
Multiple cores
Conventional Microprocessor
60General Architecture (cont)
Shared Cache
Multithreaded Shared Cache
61 Case Studies
62Case Study 1IBMs Cell Processor
63Cell Highlights
- Supercomputer on a chip
- Multi-core microprocessor(9 cores)
- gt4 Ghz clock frequency
- 10X performance for many applications
64Key Attributes
- Cell is Multi-core
- -Contains 64-bit power architecture
- -Contains 8 synergetic processor elements
- Cell is a Broadband Architecture
- -SPE is RISC architecture with SIMD organization
and local store - -128 concurrent transactions to memory per
processor - Cell is a Real-Time Architecture
- -Resource allocation (for bandwidth measurement)
- -Locking caching (via replacement management
table) - Cell is a Security Enabled Architecture
- -Isolate SPE for flexible security programming
65Cell Processor Components
66Cell BE Processor Block Diagram
67POWER Processing Element (PPE)
- POWER Processing Unit (PPU) connected to a 512KB
L2 cache. - Responsible for running the OS and coordinating
the SPEs. - Key design goals maximize the performance/power
ratio as well as the performance/area ratio. - Dual-issue, in-order processor with dual-thread
support - Utilizes delayed-execution pipelines and allows
limited out-of-order execution of load
instructions.
68Synergistic Processing Elements (SPE)
- Dual-issue, in-order machine with a large
128-entry, 128-bit register file used for both
floating-point and integer operations - Modular design consisting of a Synergistic
Processing Unit (SPU) and a Memory Flow
Controller (MFC). - Compute engine with SIMD support and 256KB of
dedicated local storage. - The MFC contains a DMA controller with an
associated MMU and an Atomic Unit to handle synch
operations with other SPUs and the PPU.
69SPE (cont.)
- They operate directly on instructions and data
from its dedicated local store. - They rely on a channel interface to access the
main memory and other local stores. - The channel interface, which is in the MFC, runs
independently of the SPU and is capable of
translating addresses and doing DMA transfers
while the SPU continues with the program
execution. - SIMD support can perform operations on 16 8-bit,
8 16-bit, 4 32-bit integers, or 4
single-precision floating-point numbers per
cycle. - At 3.2GHz, each SPU is capable of performing up
to 51.2 billion 8-bit integer operations or
25.6GFLOPs in single precision.
70Four levels of Parallelism
- Blade level ? 2 cell processors per blade
- Chip level ? 9 cores
- Instruction level ? Dual issue pipelines on each
SPE - Register level ? Native SIMD on SPE and PPE VMX
71Cell Chip Floor plan
72Element Interconnect Bus (EIB)
- Implemented as a ring
- Interconnect 12 elements
- 1 PPE with 51.2GB/s aggregate bandwidth
- 8 SPEs each with 51.2GB/s aggregate bandwidth
- MIC 25.6GB/s of memory bandwidth
- 2 IOIF 35GB/s(out), 25GB/s(in) of I/O bandwidth
- Support two transfer modes
- DMA between SPEs
- MMIO/DMA between PPE and system memory
Source Ainsworth Pinkston, On Characterizing
Performance of the Cell Broad band Engine
Element Interconnect Bus, 1st International Symp.
on NOCS 2007
73Element Interconnect Bus (EIB)
- An EIB consists of the following
- Four 16 byte-wide rings (two in each direction)
- 1.1 Each ring capable of handling up to 3
concurrent non-overlapping transfers - 1.2 Supports up to 12 data transfers at a time
- A shared command bus
- 2.1 Distributes commands
- 2.2 Sets up end to end transactions
- 2.3 Handles coherency
- A central data arbiter to connect the 12 Cell
elements - 3.1 Implemented in a star-like structure
- 3.2 It controls access to the EIB data rings on a
per transaction basis
Source Ainsworth Pinkston, On Characterizing
Performance of the Cell Broad band Engine
Element Interconnect Bus, 1st International Symp.
on NOCS 2007
74Element Interconnect Bus (EIB)
75Cell Manufacturing Parameters
- About 234 million transistors (compared with 125
million for Pentium 4) that runs at more than 4.0
GHz - As compared to conventional processors, Cell is
fairly large, with a die size of 221 square
millimeters - The introductory design is fabricated using a 90
nm Silicon on insulator (SOL) process - In March 2007 IBM announced that the 65 nm
version of Cell BE (Broadband Engine) is in
production
76Cell Power Consumption
- Each SPE consumes about 1 W when clocked at 2
GHz, 2 W at 3 GHz, and 4 W at 4 GHz - Including the eight SPEs, the PPE, and other
logic, the CELL processor will dissipate close to
15W at 2 GHz, 30W at 3 GHz, and approximately 60W
4 GHz
77Cell Power Management
- Dynamic Power Management (DPM)
- Five Power Management States
- One linear sensor
- Ten digital thermal sensors
78 Case Study 2Intels Core 2 Duo
79Intel Core 2 Duo Highlights
- Multi-core microprocessor(2 cores)
- It has a range of 1.5 to 3 Ghz clock frequency
- 2X performance for many applications
- Dedicated level 1 cache and shared level 2 cache
- Its shared L2 cache comes in two flavors 2MB and
4MB, depending on the model - It supports 64bit architecture
80Intel Core 2 Duo Block Diagram
Dedicated L1
Shared L2
The two cores exchange data implicitly through
the shared level 2 cache
81Intel Core 2 Duo Architecture
Reduced front-side bus traffic effective data
sharing between cores allows data requests to be
resolved at the shared cache level instead of
going all the way to the system memory
One Copy needed to be retrieved
Core 1 had to retrieve the data from Core 2 by
going all the way through the FSB and Main Memory
82Intels Core 2 Duo Manufacturing Parameters
- About 291 million transistors
- Compared to Cells 221 square millimeters, Core 2
Duo has a smaller die size between 143 and 107
square millimeters depending on the model. - The current Intel process technology for the Dual
core ranges between 65 nm and 45nm (2007) with an
estimate of 155 million transistors .
83Intel Core 2 Duo Power Consumption
- Power consumption in Core 2 Duo ranges 65w-130w
depending on the model. - Assuming you have 75 w processor model (Conroe is
65W) it will cost you 4 to keep your computer up
for the whole month
84Intel Core 2 Duo Power Management
- It uses 65 nm technology instead of the previous
90nm technology - (Less voltage requirements)
- Aggressive clock gating
- Enhanced Speed-Step
- Low VCC Arrays
- Blocks controlled via sleep transistors
- Low leakage transistors
85Case Study 3AMDs Quad-Core Processor
(Barcelona)
86AMD Quad-Core Highlights
- Designed to enable simultaneous 32- and 64-bit
computing - Minimizes the cost of transition and maximizes
current investments - Integrated DDR2 Memory Controller
- Increases application performance by
dramatically reducing memory latency - Scales memory bandwidth and performance to match
compute needs - HyperTranspor Technology Provides up to 24.0GB/s
peak bandwidth per processor, reducing I/O
bottlenecks
87AMD Quad-Core Block Diagram
88AMD Quad-Core Architecture
- It has a crossbar switch instead of the usual bus
used in dual core processors - It lowers the probability of having memory
access collisions - L3 to alleviate the memory access latency since
we have a greater possibility of accessing the
memory due to the high number of cores
89AMD Quad-Core Architecture (cont)
- Cache Hierarchy
- Dedicated L1 cache
- 2 way associative
- 8 banks (each 16B wide).
- Dedicated L2 cache
- 16 way associative
- victim cache, exclusive w.r.t L1
- Shared L3 cache
- 32 way associative
- Fills from L3 leave likely shared lines in L3
- Victim cache, partially exclusive w.r.t. L2
- Sharing aware replacement policy
Replacement policiesL1,L2 pseudo LRU
L3Sharing aware pseudo LRU
90AMD Quad-Core Manufacturing Parameters
- The current AMD process technology for Quad-Core
is 65nm - It is comprised of approximately 463M transistors
(about 119M less than Intels quad-core
Kentsfield) - It has a die size of 285 square millimeters
(Compared to Cells 221 square millimeters)
91AMD Quad-Core Power Consumption
- Power consumption in AMD Quad-Core ranges 68-95w(
compared to 65w-130w of Intels Core 2 Duo)
depending on the model. - AMD CoolCore Technology
- Reduces processor energy consumption by turning
off unused parts of the processor. For example,
the memory controller can turn off the write
logic when reading from memory, helping reduce
system power - Power can be switched on or off within a single
clock cycle, saving energy with no impact to
performance
92AMD Quad-Core Power Management
Native quad-core technology enables enhanced
power management across all four cores