Advanced Computer Architecture CSE 8383 - PowerPoint PPT Presentation

1 / 92
About This Presentation
Title:

Advanced Computer Architecture CSE 8383

Description:

The Laplace operator is one possible operator for emphasizing edges in a gray ... Pat Gelsinger (Pentium at 90 W) Power Density continues to soar ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 93
Provided by: Adm952
Category:

less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture CSE 8383


1
Advanced Computer ArchitectureCSE 8383
April 10, 2008 Session 10
2
Contents
  • Parallel Programming
  • Multithreading
  • Multi-Core
  • Why now?
  • A Paradigm Shift
  • Multi-Core Architecture
  • Case Studies
  • IBM Cell
  • Intel Core 2Duo
  • AMD

3
  • Parallel Programming

4
More Work in Parallel Programming
  • Multiple threads of control
  • Partitioning for concurrent execution
  • Task Scheduling/resource allocation
  • Communication and Sharing
  • Synchronization
  • Debugging

5
Explicit versus Implicit Parallel Programming
Explicit
Implicit
Compiler
Programmer
Parallel Architecture
Parallel Architecture
6
Parallel Programming
7
Programmers Responsibilities
Class Programmer Responsibility
1 Implicit Parallelism (nothing much)
2 Identification of Parallelism Potential
3 Decomposition (potential), placement
4 Decomposition, high level coordination
5 Decomposition, high level coord, placement
6 Decomposition, low level coordination
7 Decomposition, low level coord, placement
8
Programming Languages
  • Conventional Languages with extensions
  • Libraries
  • Compiler directives
  • Language constructs
  • New Languages
  • Conventional Languages with Tools (implicit
    parallelism)

9
Types of Parallelism
  • Data Parallelism
  • Same Instruction Multiple Data (SIMD)
  • Same Program Multiple Data (SPMD)
  • Function (Control) Parallelism
  • Perform different functions in parallel
  • Pipeline
  • Execution overlap
  • Instruction Level Parallelism
  • Superscalar
  • Dataflow
  • VLIW

10
Supervisor Workers Model (Simple)
11
Data Parallel Image Filtering
The Laplace operator is one possible operator for
emphasizing edges in a gray-scale image (edge
detection). The operator carries out a simple
local difference pattern and is therefore well
suited to parallel execution. The Laplace
operator is applied in parallel to each pixel
with its four neighbors
12
Approximation to p
13
Parallelism in Loops
6 processors (cores)
15 loops iterations
for (i get_myid() i lt 15 i n_procs) xi
i
14
Function Parallelism
Determine which process does what
if ( get_myid() x) .. Do this if ( get_myid()
y) .. Do that ..
15
Task Granularity
  • Fine grain
  • Operation - instruction level (appropriate for
    SIMD)
  • Medium Grain
  • Chunk of code function
  • Large Grain
  • Large function - program

Overhead Parallelism Tradeoff
16
Granularity -- Matrix Multiplication
x

x

17
Serial vs. Parallel Process
18
Communication via Shared data
19
Synchronization
20
Barriers
T1
T2
T0
Barrier
Barrier
wait
Barrier
wait
proceed
proceed
proceed
Synchronization Point
21
Distributed Memory Parallel Application
  • A number of sequential programs, each of which
    will correspond to one or more processes in a
    parallel program
  • Communication among processes
  • Send / receive
  • Structure
  • Start graph
  • Tree

22
Sorting
23
(No Transcript)
24
Types of Communication
Time
recv()
Function is called
nrecv()
trecv()
Continue execution
wait
wait
Time is expired
Resume execution
Message arrival
Resume execution
Blocking
Non-blocking
Timeout
25
Multithreading
26
Multithreaded Processors
  • Several register sets
  • Fast Context Switching

Register set 1
Register set 2
Register set 3
Register set 4
Thread 3
Thread 4
Thread 1
Thread 2
27
Execution in Multithreaded Processors
  • cycle-by-cycle Interleaving
  • block interleaving
  • simultaneous multithreading

28
Multithreading Techniques
Multithreading
cycle-by-cycle interleaving
block interleaving
static
dynamic
Switch-on-cache-miss
Switch-on-signal
explicit switch
implicit switch (switch-on-load,
switch-on-store, switch-on-branch, ..)
Switch-on-use
Conditional switch
Source Jurij Silc
29
Multithreading on Scalar
Context switching
Context switching
Context switching
Single threaded
Cycle by cycle interleaving
Block interleaving
30
Single Threaded CPU
  • The different colored boxes in RAM represent
    instructions for four different running programs
  • Only the instructions for the red program are
    actually being executed right now
  • This CPU can issue up to four instructions per
    clock cycle to the execution core, but as you can
    see it never actually reaches this
    four-instruction limit.

31
Single Threaded SMP
The red program and the yellow process both
happen to be executing simultaneously, one on
each processor. Once their respective time slices
are up, their contexts will be saved, their code
and data will be flushed from the CPU, and two
new processes will be prepared for execution.
32
Multithreaded Processors
If the red thread requests data from main memory
and this data isn't present in the cache, then
this thread could stall for many CPU cycles while
waiting for the data to arrive. In the meantime,
however, the processor could execute the yellow
thread while the red one is stalled, thereby
keeping the pipeline full and getting useful work
out of what would otherwise be dead cycles
33
Simultaneous Multithreading (SMT)
SMT is simply Multithreading without the
restriction that all the instructions issued by
the front end on each clock be from the same
thread
34
The Path to Multi-Core
35
Background
  • Wafer
  • Thin slice of semiconducting material, such as a
    silicon crystal, upon which microcircuits are
    constructed
  • Die Size
  • The die size of the processor refers to its
    physical surface area size on the wafer. It is
    typically measured in square millimeters (mm2).
    In essence a "die" is really a chip . the smaller
    the chip, the more of them that can be made from
    a single wafer.
  • Circuit Size
  • The level of miniaturization of the processor. In
    order to pack more transistors into the same
    space, they must be continually made smaller and
    smaller. Measured in Microns (mm) or Nanometer
    (nm)

36
Examples
  • 386C
  • Die Size 42 mm2
  • 1.0 m technology
  • 275,000 transistors
  • 486C
  • Die Size 90 mm2
  • 0.7 m technology
  • 1.2 million transistors
  • Pentium III
  • Die Size 106 mm2
  • 0.18m technology
  • 28 million transistors
  • Pentium
  • Die Size 148 mm2
  • 0.5 m technology
  • 3.2 million transistors

37
Pentium III (0.18 m process technology)
Source Fred Pollack, Intel. New
Micro-architecture Challenges in the coming
Generations of CMOS Process Technologies. Micro32
38
(No Transcript)
39
nm Process Technology
Technology (nm) 90 65 45 32 22
Integration Capacity (BT) 2 4 8 16 32
40
Increasing Die Size
  • Using the same technology
  • Increasing the Die Size 2-3X ? 1.5-1.7X in
    Performance.
  • Power is proportional to Die-area Frequency
  • We cannot produce microprocessors with ever
    increasing Die size The constraint is POWER

41
Reducing circuit Size
  • Reducing circuit size in particular is key to
    reducing the size of the chip.
  • The first generation Pentium used a 0.8 micron
    circuit size, and required 296 square millimeters
    per chip.
  • The second generation chip had the circuit size
    reduced to 0.6 microns, and the die size dropped
    by a full 50 to 148 square millimeters.

42
Shrink transistors by 30 every generation ?
transistor density doubles, oxide thickness
shrinks, frequency increases, and threshold
voltage decreases. Gate thickness cannot keep on
shrinking ? slowing frequency increase, less
threshold voltage reduction.
43
Processor Evolution
Generation i 1
Generation i
(0.5 mm, for example)
(0.35 mm, for example)
  • Gate delay reduces by 1/ (frequency up by
    )
  • Number of transistors in a constant area goes up
    by 2 (Deeper pipelines, ILP, more cashes)
  • Additional transistors enable an additional
    increase in performance
  • Result 2x performance at roughly equal cost

44
What happens to power if we hold die size
constant at each generation?
Allows 100 growth in transistors each
generation
Source Fred Pollack, Intel. New
Micro-architecture Challenges in the coming
Generations of CMOS Process Technologies. Micro32
45
What happens to die Size if we hold power
constant at each generation?
Die size has to reduce 25 in area each
generation ? 50 growth in transistors, which
limits PERFORMANCE, Power Density is still a
problem
Source Fred Pollack, Intel. New
Micro-architecture Challenges in the coming
Generations of CMOS Process Technologies. Micro32
46
Power Density continues to soar
Source Intel Developer Forum, Spring 2004 Pat
Gelsinger (Pentium at 90 W)
47
Business as Usual wont work Power is a Major
Barrier
  • As Processor Continue to improve in Performance
    and Speed, Power consumption and heat dissipation
    have become major challenges
  • Higher costs
  • Thermal Packaging
  • Fans
  • Electricity
  • Air conditioning

48
A new Paradigm Shift
  • Old Paradigm
  • Performance improved Frequency, unconstrained
    power, voltage scaling
  • New Paradigm
  • Performance improved IPC, Multi-core, power
    efficient micro architecture advancement

49
Multiple CPUs on a Single Chip
  • An attractive option for chip designers because
    of the availability of cores from earlier
    processor generations, which, when shrunk down to
    present-day process technology, are small enough
    for aggregation into a single die

50
Multi-core
Technology Generation i
Technology Generation i1
Generation i
Generation i
Generation i
  • Gate delay does not reduce much
  • The frequency and performance of each core is the
    same or a little less than previous generation

51
From HT to Many-Core
Intel predicts 100s of cores on a chip in 2015
52
Multi-cores are Reality
of Cores
Source Saman Amarasinghe, MIT (6.189 2007,
lecture-1)
53
Multi-Core Architecture
54
Multi-core Architecture
  • Multiple cores are being integrated on a single
    chip and made available for general purpose
    computing
  • Higher levels of integration
  • multiple processing cores
  • Caches
  • memory controllers
  • some I/O processing)
  • Network on Chip (NoC)

55
  • Shared memory
  • One copy of data shared among multiple cores
  • Synchronization via locking
  • intel
  • Distributed memory
  • Cores access local data
  • Cores exchange data

56
Memory Access Alternatives
Shared address space Distributed address space
Global Memory SMP Symmetric Multiprocessors
Distributed Memory DMS Distributed Shared Memory MP Message Passing
  • Symmetric Multiprocessors (SMP)
  • Message Passing (MP)
  • Distributed Shared Memory (DSM)

57
Network on Chip (NoC)
control
data
I/O
Switch Network
Traditional Bus
58
Shared Memory
Shared Secondary Cache
Shared Primary Cache
Shared Global Memory
59
General Architecture
CPU core
registers
L1 I
L1 D
L2 cache
main memory
I/O
main memory
I/O
Multiple cores
Conventional Microprocessor
60
General Architecture (cont)
Shared Cache
Multithreaded Shared Cache
61
Case Studies
62
Case Study 1IBMs Cell Processor
63
Cell Highlights
  • Supercomputer on a chip
  • Multi-core microprocessor(9 cores)
  • gt4 Ghz clock frequency
  • 10X performance for many applications

64
Key Attributes
  • Cell is Multi-core
  • -Contains 64-bit power architecture
  • -Contains 8 synergetic processor elements
  • Cell is a Broadband Architecture
  • -SPE is RISC architecture with SIMD organization
    and local store
  • -128 concurrent transactions to memory per
    processor
  • Cell is a Real-Time Architecture
  • -Resource allocation (for bandwidth measurement)
  • -Locking caching (via replacement management
    table)
  • Cell is a Security Enabled Architecture
  • -Isolate SPE for flexible security programming

65
Cell Processor Components
66
Cell BE Processor Block Diagram
67
POWER Processing Element (PPE)
  • POWER Processing Unit (PPU) connected to a 512KB
    L2 cache.
  • Responsible for running the OS and coordinating
    the SPEs.
  • Key design goals maximize the performance/power
    ratio as well as the performance/area ratio.
  • Dual-issue, in-order processor with dual-thread
    support
  • Utilizes delayed-execution pipelines and allows
    limited out-of-order execution of load
    instructions.

68
Synergistic Processing Elements (SPE)
  • Dual-issue, in-order machine with a large
    128-entry, 128-bit register file used for both
    floating-point and integer operations
  • Modular design consisting of a Synergistic
    Processing Unit (SPU) and a Memory Flow
    Controller (MFC).
  • Compute engine with SIMD support and 256KB of
    dedicated local storage.
  • The MFC contains a DMA controller with an
    associated MMU and an Atomic Unit to handle synch
    operations with other SPUs and the PPU.

69
SPE (cont.)
  • They operate directly on instructions and data
    from its dedicated local store.
  • They rely on a channel interface to access the
    main memory and other local stores.
  • The channel interface, which is in the MFC, runs
    independently of the SPU and is capable of
    translating addresses and doing DMA transfers
    while the SPU continues with the program
    execution.
  • SIMD support can perform operations on 16 8-bit,
    8 16-bit, 4 32-bit integers, or 4
    single-precision floating-point numbers per
    cycle.
  • At 3.2GHz, each SPU is capable of performing up
    to 51.2 billion 8-bit integer operations or
    25.6GFLOPs in single precision.

70
Four levels of Parallelism
  • Blade level ? 2 cell processors per blade
  • Chip level ? 9 cores
  • Instruction level ? Dual issue pipelines on each
    SPE
  • Register level ? Native SIMD on SPE and PPE VMX

71
Cell Chip Floor plan
72
Element Interconnect Bus (EIB)
  • Implemented as a ring
  • Interconnect 12 elements
  • 1 PPE with 51.2GB/s aggregate bandwidth
  • 8 SPEs each with 51.2GB/s aggregate bandwidth
  • MIC 25.6GB/s of memory bandwidth
  • 2 IOIF 35GB/s(out), 25GB/s(in) of I/O bandwidth
  • Support two transfer modes
  • DMA between SPEs
  • MMIO/DMA between PPE and system memory

Source Ainsworth Pinkston, On Characterizing
Performance of the Cell Broad band Engine
Element Interconnect Bus, 1st International Symp.
on  NOCS 2007
73
Element Interconnect Bus (EIB)
  • An EIB consists of the following
  • Four 16 byte-wide rings (two in each direction)
  • 1.1 Each ring capable of handling up to 3
    concurrent non-overlapping transfers
  • 1.2 Supports up to 12 data transfers at a time
  • A shared command bus
  • 2.1 Distributes commands
  • 2.2 Sets up end to end transactions
  • 2.3 Handles coherency
  • A central data arbiter to connect the 12 Cell
    elements
  • 3.1 Implemented in a star-like structure
  • 3.2 It controls access to the EIB data rings on a
    per transaction basis

Source Ainsworth Pinkston, On Characterizing
Performance of the Cell Broad band Engine
Element Interconnect Bus, 1st International Symp.
on  NOCS 2007
74
Element Interconnect Bus (EIB)
75
Cell Manufacturing Parameters
  • About 234 million transistors (compared with 125
    million for Pentium 4) that runs at more than 4.0
    GHz
  • As compared to conventional processors, Cell is
    fairly large, with a die size of 221 square
    millimeters
  • The introductory design is fabricated using a 90
    nm Silicon on insulator (SOL) process
  • In March 2007 IBM announced that the 65 nm
    version of Cell BE (Broadband Engine) is in
    production

76
Cell Power Consumption
  • Each SPE consumes about 1 W when clocked at 2
    GHz, 2 W at 3 GHz, and 4 W at 4 GHz
  • Including the eight SPEs, the PPE, and other
    logic, the CELL processor will dissipate close to
    15W at 2 GHz, 30W at 3 GHz, and approximately 60W
    4 GHz

77
Cell Power Management
  • Dynamic Power Management (DPM)
  • Five Power Management States
  • One linear sensor
  • Ten digital thermal sensors

78
Case Study 2Intels Core 2 Duo
79
Intel Core 2 Duo Highlights
  • Multi-core microprocessor(2 cores)
  • It has a range of 1.5 to 3 Ghz clock frequency
  • 2X performance for many applications
  • Dedicated level 1 cache and shared level 2 cache
  • Its shared L2 cache comes in two flavors 2MB and
    4MB, depending on the model
  • It supports 64bit architecture

80
Intel Core 2 Duo Block Diagram
Dedicated L1
Shared L2
The two cores exchange data implicitly through
the shared level 2 cache
81
Intel Core 2 Duo Architecture
Reduced front-side bus traffic effective data
sharing between cores allows data requests to be
resolved at the shared cache level instead of
going all the way to the system memory
One Copy needed to be retrieved
Core 1 had to retrieve the data from Core 2 by
going all the way through the FSB and Main Memory
82
Intels Core 2 Duo Manufacturing Parameters
  • About 291 million transistors
  • Compared to Cells 221 square millimeters, Core 2
    Duo has a smaller die size between 143 and 107
    square millimeters depending on the model.
  • The current Intel process technology for the Dual
    core ranges between 65 nm and 45nm (2007) with an
    estimate of 155 million transistors .

83
Intel Core 2 Duo Power Consumption
  • Power consumption in Core 2 Duo ranges 65w-130w
    depending on the model.
  • Assuming you have 75 w processor model (Conroe is
    65W) it will cost you 4 to keep your computer up
    for the whole month

84
Intel Core 2 Duo Power Management
  • It uses 65 nm technology instead of the previous
    90nm technology
  • (Less voltage requirements)
  • Aggressive clock gating
  • Enhanced Speed-Step
  • Low VCC Arrays
  • Blocks controlled via sleep transistors
  • Low leakage transistors

85
Case Study 3AMDs Quad-Core Processor
(Barcelona)
86
AMD Quad-Core Highlights
  • Designed to enable simultaneous 32- and 64-bit
    computing
  • Minimizes the cost of transition and maximizes
    current investments
  • Integrated DDR2 Memory Controller
  • Increases application performance by
    dramatically reducing memory latency
  • Scales memory bandwidth and performance to match
    compute needs
  • HyperTranspor Technology Provides up to 24.0GB/s
    peak bandwidth per processor, reducing I/O
    bottlenecks

87
AMD Quad-Core Block Diagram
  • Dedicated L1 and L2
  • Shared L3

88
AMD Quad-Core Architecture
  • It has a crossbar switch instead of the usual bus
    used in dual core processors
  • It lowers the probability of having memory
    access collisions
  • L3 to alleviate the memory access latency since
    we have a greater possibility of accessing the
    memory due to the high number of cores

89
AMD Quad-Core Architecture (cont)
  • Cache Hierarchy
  • Dedicated L1 cache
  • 2 way associative
  • 8 banks (each 16B wide).
  • Dedicated L2 cache
  • 16 way associative
  • victim cache, exclusive w.r.t L1
  • Shared L3 cache
  • 32 way associative
  • Fills from L3 leave likely shared lines in L3
  • Victim cache, partially exclusive w.r.t. L2
  • Sharing aware replacement policy

Replacement policiesL1,L2 pseudo LRU
L3Sharing aware pseudo LRU
90
AMD Quad-Core Manufacturing Parameters
  • The current AMD process technology for Quad-Core
    is 65nm
  • It is comprised of approximately 463M transistors
    (about 119M less than Intels quad-core
    Kentsfield)
  • It has a die size of 285 square millimeters
    (Compared to Cells 221 square millimeters)

91
AMD Quad-Core Power Consumption
  • Power consumption in AMD Quad-Core ranges 68-95w(
    compared to 65w-130w of Intels Core 2 Duo)
    depending on the model.
  • AMD CoolCore Technology
  • Reduces processor energy consumption by turning
    off unused parts of the processor. For example,
    the memory controller can turn off the write
    logic when reading from memory, helping reduce
    system power
  • Power can be switched on or off within a single
    clock cycle, saving energy with no impact to
    performance

92
AMD Quad-Core Power Management
Native quad-core technology enables enhanced
power management across all four cores
Write a Comment
User Comments (0)
About PowerShow.com