Multi Processing Hardware - PowerPoint PPT Presentation

1 / 63

About This Presentation

Title:

Multi Processing Hardware

Description:

Multi Processing Hardware Gilad Berman Agenda Multi Core Background and software limitation Multi Processing Methods - Scale Up - Scale Out Process vs. Thread Multi ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 64

Provided by: csTauAc8

Category:

more less

Transcript and Presenter's Notes

Title: Multi Processing Hardware

1
Multi Processing Hardware

Gilad Berman

2
Agenda

Multi Core Background and software limitation
Multi Processing Methods
- Scale Up
- Scale Out
Process vs. Thread
Multi Processing Architecture
- SMP vs. NUMA
- cache coherency protocols
- Programming discussion
Hyper-Threading Technology
High-Level Atomic Operation in Hardware

3
Why Is This Important?

Computer Hardware has been changes A LOT the past
3 years
The programmer should know the hardware, it makes
a difference!
Sadly, the current (TAU) computer sciences
material do not cover this issue at all, almost..
Not only our degree, most of the software
companies do not fully aware of that (this is my
impression)
Its much easier to buy faster and powerfull
hardware than optimizing the code. But, there
is a limit to that and computing demands
constantly increasing
Interesting, I hope..

4
Multi Core Backgroundand software limitation
5
History - Frequency Gains By Making Transistors
Smaller
Frequency
Shrinking Transistors Has Historically Enabled
Higher Frequency
This was the case until the end of 2005
Time
6
Multi Core Background

Transistor sizes have become so small..
Electrons leak through the transistor dielectric
wall which regulates current flow
VERY general and partial explanation
For CPU to run faster the transistors need to be
smaller, shorter Source to Drain path.
The smaller the transistors are the larger the
leakage.
The larger the leakage, the larger the voltage
threshold.
Larger voltage threshold larger CPU power
consumption and heat produced, require more
cooling capacity even more power.
All of the above results in exponential Watt per
Performance ratio.

7
Multi Core Background

Classical frequency scaling appeared to be at a
crossroads
Frequency improvements have hit a physics limit
Power cost money
Going green
But manufacturing processes continue to enable
higher per die transistor count
Processor vendors now deliver greater numbers of
processors on the die to displace frequency
increase

8
Current x86 Processors

Dual, Quad and Six core CPUs ( Hyper threading
on some models)
Multi core trend is here to stay, 8 core next
year and so on..
Different Architecture CPUs (RISC, Special
purpose CPUs) already have up to 32 cores per
socket
Hardware evolving MUCH faster than software
- most of our home PCs subsystem are much
faster than
the average servers from only a year ago!
We get much more for much less. So, whats the
problem?

9
Know Your Software Limitations

Software do not scale if it hasnt been written
to do so!
Single threaded application running on 24 core
server will use only one core, the OS has nothing
to do about it.
Single threaded application running on 24 core
server will actually run slower than if it will
run on your home PC (or every other single core
machine).
Explanation coming later. Any ideas?

10
Know Your Software Limitations

Single or Dual Threaded Typically Developed
Before 2000
Legacy server software and most desktop software
Most server software developed by application
developers new to multi-threading
N Threaded (typically N 4 to 8) Majority of
Business Applications
Vast majority of server software developed in the
past 5 - 8 years
Exchange, Notes, Citrix, Terminal Services, most
Java, File/Print/Web serving, etc.
Almost all customer developed applications
Some workstation software
Fully Threaded Software (N 8 to 64)
Enterprise Enabled Applications
SQL Server, Oracle, DB/2, SAP, PeopleSoft,
VMWare, 64-bit Terminal Services, etc.

11
So, What can be done?

Obviously, write N threaded of fully threaded
applications
But this is not so simple
- writing fully threaded application is hard.
- rewriting all of the existing applications
is not possible.
Virtualization

12
Virtualization Technology

Separation of OS and hardware
Encapsulation of OS and application into VMs
Isolation
Hardware independence
Flexibility

Virtualization Layer
? Works with what you have today
13
Multi Core Methods
14
Market View

Every server is a multi Core server
Multi Core trend will continue
Two major software trends
- Virtualization
- Cloud computing
Power consumption became a real issue

One Socket (quad core HT) Server
15
Multi Processing Methods

Two Main methods
Scale Up
Scale Out

16
Multi Processing Methods - Scale Up

One large multi Core server
Single OS
It is not uncommon to see 24 Cores and 256GB RAM
server
Common Applications
- Databases
- ERP
- Virtualization
- Mathematic Applications

One server with 96 cores and 1TB of RAM
17
Multi Processing Methods - Scale Out

Many smaller and faster servers, each one doing a
fraction of the work
OS per server
Dense and low power consumption
This is the future
Common Applications
- HPC
- Cloud
- Web
- Rendering
- VLSI (Chip design)

18
Process vs. Thread
19
What is a Process?

Execution context
- Program counter (PC)
- Stack pointer (SP)
- Data registers
Code
Data
Stack

20
What is a Thread?

Execution context
- Program counter (PC)
- Stack pointer (SP)
- Data registers

21
Process vs. Thread

processes are typically independent,
threads exist as subsets of a process. Each
process has one or more threads
processes carry considerable state information
multiple threads within a process share state as
well as memory and other resources
processes have separate address space,
threads of the same process share their address
space
Inter-process communication is expensive need to
context switch
Inter-thread communication cheap can use process
memory and may not need to context switch

22
Multi Processing Architecture
23
Multi Processing Architecture (Scale Up Servers)

SMP Symmetric multi processing
Implemented by Intel, until the new generation,
Nehalem
also known as UMA
NUMA Non-Uniform Memory Access
AMD design

24
Multi Processing Architecture - SMP

all CPUs have symmetric access to all hardware
resources such as memory, I/O bus, and
interrupts, thus the name symmetric
multiprocessing.
Advantage - simplifies the development of the
operating system and applications
Disadvantage - limited scalability of this
architecture
As processors are added to the system, the
shared
resources are frequently accessed by an
increasingly
greater number of processors. More processors
using
the same resources creates queuing delays

25
Multi Processing Architecture - NUMA

Memory controller embedded to the processor
Each Processor have local, Fast memory and
remote, slower memory
Is it better?
It depends whether the OS and the application
are NUMA Aware
Think of the Super Market lines

26
Multi Processing Architecture NUMA Awareness

Modern Operating Systems are NUMA Aware
The OS get the NUMA information from the SART
table, which created by the system BIOS
SART - Static Resource Affinity Table
includes topology information for all the
processors and memory in a system.
The topology information includes the number of
nodes in the system and which memory is local to
each processor
the OS can read this info the same way it can
read the CPU type, machine serial number etc.
(using the ACPI specifications)

27
Multi Processing Architecture NUMA
AwarenessSART - Static Resource Affinity Table

Implements the concept of proximity domains in a
system
Resources, including processors, memory, and PCI
adapters in a system, are tightly coupled
This way the OS can use this information to
determine the best resource allocation and the
scheduling of threads throughout the system
Resent ACPI implementation include the Proximity
method (_PXM) so the OS can be NUMA aware even
without SART and introduce the support for
hot-plug devices

28
Multi Processing Architecture NUMA
AwarenessSART - Static Resource Affinity Table -
Example
29
Multi Processing Architecture Problems
Or, Why do 4 processor machine do not work 4
times faster than 1 processor machine?

Shared recourses Buses, Memory Controller, I/O
hub etc.
Electrical load on the buses, speed must be
slowed down to ensure valid data
Data coherency, especially cache coherency

30
Multi Processing Architecture Cache Coherency
Protocols the MESI protocol

Intel Implementation
MESI stands for Modified, Exclusive, Shared, and
Invalid
One of these four states is assigned to every
data element stored in each CPU cache using two
additional bit per cache line
On each processor data load into cache, the
processor must broadcast to all other processors
in the system to check their caches to see if
they have the requested data
These broadcasts are called snoop cycles, and
they must occur during every memory read or write
operations

31
Multi Processing Architecture the MESI Protocol

exclusive state is used to indicate that data is
stored in only one cache.
Data that is marked exclusive can be updated in
the cache without a snoop broadcast to the other
CPUs.

I need X
Memory
CPU 3
CPU 2
CPU 2
4
X 4
X 4 E
Front Side Bus
32
Multi Processing Architecture the MESI Protocol

Shared state is used to indicate that data is
shared in more than one processor
If the front-side bus request is a read
operation, the data is marked as shared.
indicating that its copy is read-only and cannot
be modified without notifying the other CPUs.

I want to know what X equals to
Memory
CPU 3
CPU 2
CPU 2
4
X 4 E
X 4 S
X 4 S
Front Side Bus
33
Multi Processing Architecture the MESI Protocol

Modified state is used to indicate that data has
modified and might be stored in other CPUs
If the operation is a write request, the CPU that
is possessing the unmodified data must mark its
data as invalid, indicating that the data can no
longer be used.
Main Memory is updated

X 5
Memory
CPU 3
CPU 2
CPU 2
4
X 4 S
X 4 S
X 5 S M
X 4 S I
X 5
Front Side Bus
34
Multi Processing Architecture the MESI Protocol

Invalid state is used to indicate that data has
modified by other CPU and is no longer relevant

X 6
Memory
CPU 3
CPU 2
CPU 2
4
X 5 S M
X 4 S I
X 6 S M
X 5 S I
X 6
Front Side Bus
35
Multi Processing Architecture Cache Coherency
Protocols the MOESI protocol

AMD Implementation Better for NUMA
The MOESI protocol expands the MESI protocol with
yet another cache line status flag namely the
owner status (thus the O in MOESI).
After the update of a cache line, the cache line
is not written back to system memory but is
flagged as an owner.
When another CPU issues a read, it gets its data
from the owners cache rather than from slower
memory

36
Multi Processing Architecture Cache Coherency
Protocols the MOESI protocol

A cache-line still isn't written back to memory
until cache flush
The owner CPU dont need to send snoop messages
in case it modified the data
offload on the Busses and the
memory controller

37
Multi Processing Architecture SMP FSB load

Cache coherency protocol generate great load on
the front side but and many times this bus
becomes the performance bottleneck of those
systems
How can we offload this traffic?

Memory
CPU 3
CPU 2
CPU 2
4
X 4 S I
X 6 S M
X 5 S I
Front Side Bus
38
Multi Processing Architecture Snoop Filter
Snoop Filter

Processor Cache Miss
Scalability Controller Memory
Controller
I/O Bridge

I/O Bridge

Memory Interface
Memory Interface
Memory Interface
Memory Interface
One processor requests memory address not
residing in its L2/L3 cache.
39
Multi Processing Architecture Snoop Filter
Tag
MESI

Processor Cache Miss
Scalability Controller Memory
Controller
I/O Bridge

eDRAM

I/O Bridge

Memory Interface
Memory Interface
Memory Interface
Memory Interface
9 way set associative
Looks for the tag in the directory
Embedded DRAM
40
Multi Processing Architecture Snoop Filter
No snoop reflection on this bus
Tag
MESI

Processor Cache Miss
Scalability Controller Memory
Controller
I/O Bridge

eDRAM

I/O Bridge

Memory Interface
Memory Interface
Memory Interface
Memory Interface
9 way set associative

No Address - Tag Match
Go to main memory
without waiting for
front-side bus snoop.

IBM Embedded DRAM
41
Multi Processing Architecture Programming
discussion
Generally, the SMP architecture allows you to use
the advantages of the threads - faster
intercommunication and context switching without
the remote memory access penalty, for example
Microsoft SQL server has only one process with
many threads. And, this architecture is
simpler. NUMA, on the other hand, can,
potentially give you better performance assuming
the application is optimized. The programmer
needs to fine tune the processes / threads
distribution. A ideal case might be one process
and 4 (or more) threads per socket minimize the
remote memory access

Threads vs. Processes
How would you optimize for this?
And assuming the architecture is NUMA? SMP?

42
Hyper-Threading Technology
43
Intel Hyper-Threading Technology
SMT
w/o SMT

Also known as Simultaneous Multi-Threading (SMT)
Run 2 threads at the same time per core
Take advantage of 4-wide execution engine
Keep it fed with multiple threads
Hide latency of a single thread
Hyper-Threading Technology makes a single
physical processor appear as two logical
processors
Intel Xeon 5500 Microarchitecture advantages
Larger caches
Massive memory BW

Time (proc. cycles)
Note Each box represents a processor execution
unit
Why is that important?
44
Hyper-Threading Technology Why?

Want to max out the CPUs execution units
utilization
CPU may stall due to cache miss, branch
misprediction or data dependancy
- data dependancy example
RAW (Read After Write)
i1. R2 ? R1 R3
i2. R4 ? R2 R3
Need efficient solution
Very low die area cost
Can provide significant performance benefit
depending on application
Much more efficient than adding an entire core

45
Hyper-Threading Technology How does it works?

Each logical processor has its own architecture
state
Architecture state
The part of the CPU thats hold the state of the
process -
Control registers
General Purpose registers
Instruction Streaming Buffers and Trace Cache
Fill Buffers
Instruction Translation Look-aside Buffer
Only one execution unit

46
Hyper-Threading Technology How does it works?
47
Hyper-Threading Technology How does it works?
Wasted execution Unit slots
48
Hyper-Threading Technology OS awareness
For best performance, the operating system should
implement few optimizations

Hyper-Threading Aware Thread Scheduling
The operating system should schedule threads to
logical processors on different physical
processors before scheduling two threads to the
same physical processor.
Aggressive HALT of Processors in the Idle Loop
Using the YIELD (PAUSE) Instruction to Avoid
Spinlock
Contention
The YIELD instruction causes the logical
processor to pause for a short period of time
(approximately 50 clock cycles), and allows the
other logical processor access to the shared
resources on the physical HT processor.

49
Hyper-Threading Technology Application awareness

Operating System expose Hyper-Threading API
for example - GetLogicalProcessorInformation
for Windows
To optimize the application performance benefit
on HT-enabled systems, the application should
ensure that the threads executing on the two
logical processors have minimal dependencies on
the same shared resources on the same physical
processor.
Bad HT thread affinity example
Threads that perform similar actions and stall
for the same reasons should not be scheduled on
the same physical processor. TooMuchMilk?
The benefit of HT is that shared processor
resources can be used by one logical processor
while the other logical processor is stalled.
This does not work when both logical processors
are stalled for the same reason.

50
Hyper-Threading Technology Some Programming
considerations

Alice
Leave note A
While (note B)
go to top_of_loop
Go buy milk
Remove note A

Bob Leave note B If !(note A) Go buy
milk Remove note B
Spin-Wait , can starve the other tread
51
Hyper-Threading Technology Some Programming
considerations

Alice
Leave note A
While (note B)
wait (ideal time)
go to top_of _loop
Go buy milk
Remove note A

Bob Leave note B If !(note A) Go buy
milk Remove note B

Ideal Time
It is clear that the loop variable cannot change
faster than the memory bus can update it.
Hence, there is no benefit to pre-execute the
loop faster than the time needed for a memory
refresh.

52
Hyper-Threading Technology Hardware assists

Two new (2004) instructions directly aimed at the
spin-wait issue
Monitor - watches a specified area of memory for
write activity
Its companion instruction, mwait, associates
writes to this memory block to waking up a
specific processor
Since updating a variable is a write activity, by
specifying its memory address and using mwait, a
processor can simply be suspended until the
variable is updated.
Effectively, this enables a wait on a variable
without spinning a loop.

53
High-Level Atomic Operation in Hardware
54
Atomic Operation in HardwareLOCK Instruction

x86 Architecture implement the LOCK prefix
Causes the processors LOCK signal to be
asserted during the execution of the accompanying
instruction (turns it to atomic instruction)
The LOCK signal insures that the processor has
exclusive use of any shared memory while the
signal is asserted

atomic instructions (Intel at least) Increment
(lock inc r/m) Decrement (lock dec
r/m) Exchange (xchg r/m, r) Always executed
with the LOCK prefix Fetch and add (lock xadd
r/m, r)
55
Atomic Operation in HardwareRead-Modify-Write

Read-modify-write is a class of high level atomic
operations which both read a memory location and
write a new value into it simultaneously
Typically they are used to implement mutexes or
semaphores
Examples
- Compare and Swap
- Fetch and Add
- Test and Set
- Load Link / Store Conditional

56
Atomic Operation in Hardware Compare and Swap /
Exchange

CMPXCHG r/m , r
IF accumulator DEST
THEN
ZF 1
DEST SRC
ELSE
ZF 0
accumulator DEST
FI
ZF indicates whether the swap has accrued
The destination operand receives a write cycle
without regard to the result of the comparison
Needs the LOCK prefix to ensure atomically

57
Atomic Operation in Hardware Compare and Swap /
Exchange

OS system calls and programming languages
automatically implement the LOCK prefix
include ltsys/atomic_op.hgt
boolean_t compare_and_swap ( word_addr, old_val_
addr, new_val)atomic_p word_addrint
old_val_addrint new_val
Available in Java
AtomicInteger.compareAndSet(int,int) -gt bool

58
Atomic Operation in Hardware Compare and Swap /
Exchange
Now this can work (I hope, its was late..)

Alice
If (noMilk)
if (noNote)
leave Note
buy milk
remove Note

Bob
If (noMilk)
if (noNote)
leave Note
buy milk
remove Note

Atomic
59
Atomic Operation in Hardware Load-Link/Store-Cond
itional

RISC / MIPS implementation
The LL and SC instructions are primitive
instructions used to perform a read-modify-write
operation to storage
the use of the LL and SC instructions ensures
that no other processor or mechanism has modified
the target memory location between the time the
LL instruction is executed and the time the SC
The LL instruction, in addition to doing a simple
load, has the side effect of setting a user
transparent bit called the load link bit(LLbit),
similar to the cache coherency protocols
The LLbit forms a breakable link between the LL
instruction and asubsequent SC instruction

60
Atomic Operation in Hardware Load-Link/Store-Cond
itional

The SC performs a simple store if and only if the
LLbit is set when the store is executed. If the
LLbit is not set, then the store will fail to
execute
LLbit is reset upon occurrence of any event that
even has potential to modify the
lock-variable while the sequence of code between
LL and SC is being executed
The most obvious case where the link will be
broken is when an invalidate occurs to the cache
line which was the subject of load

61
Atomic Operation in Hardware Load-Link/Store-Cond
itional

IBM Power Architecture Implementation

Assume that GPR 4 contains the new value to be
stored. Assume that GPR 3 contains the address
of the word to be loaded and replaced. loop
lwarx r5,0,r3 Load and reserve
stwcx r4,0,r3 Store new value if still
reserved bne- loop Loop if
lost reservation The new value is now in
storage. The old value is returned to GPR 4.
62
Atomic Operation in Hardware Load-Link/Store-Cond
itional - Usage

The ABA problem
- Suppose that the value of V is A.
- Try a CAS to change A to X.
- Another thread can change A to B and back to A.
- The Compare-And-Swap wont see it and will
succeed
The obvious case is working with lists, or every
other data structure using pointers
Note CAS do have a solution for this using sort
of a counter to changes.

63
Thank You

Write a Comment

User Comments (0)