Title: Multi Processing Hardware
1Multi Processing Hardware
2Agenda
- Multi Core Background and software limitation
- Multi Processing Methods
- - Scale Up
- - Scale Out
- Process vs. Thread
- Multi Processing Architecture
- - SMP vs. NUMA
- - cache coherency protocols
- - Programming discussion
- Hyper-Threading Technology
- High-Level Atomic Operation in Hardware
3Why Is This Important?
- Computer Hardware has been changes A LOT the past
3 years - The programmer should know the hardware, it makes
a difference! - Sadly, the current (TAU) computer sciences
material do not cover this issue at all, almost.. - Not only our degree, most of the software
companies do not fully aware of that (this is my
impression) -
- Its much easier to buy faster and powerfull
hardware than optimizing the code. But, there
is a limit to that and computing demands
constantly increasing - Interesting, I hope..
4Multi Core Backgroundand software limitation
5History - Frequency Gains By Making Transistors
Smaller
Frequency
Shrinking Transistors Has Historically Enabled
Higher Frequency
This was the case until the end of 2005
Time
6Multi Core Background
- Transistor sizes have become so small..
- Electrons leak through the transistor dielectric
wall which regulates current flow - VERY general and partial explanation
- For CPU to run faster the transistors need to be
smaller, shorter Source to Drain path. - The smaller the transistors are the larger the
leakage. - The larger the leakage, the larger the voltage
threshold. - Larger voltage threshold larger CPU power
consumption and heat produced, require more
cooling capacity even more power. - All of the above results in exponential Watt per
Performance ratio.
7Multi Core Background
- Classical frequency scaling appeared to be at a
crossroads - Frequency improvements have hit a physics limit
- Power cost money
- Going green
- But manufacturing processes continue to enable
higher per die transistor count - Processor vendors now deliver greater numbers of
processors on the die to displace frequency
increase
8Current x86 Processors
- Dual, Quad and Six core CPUs ( Hyper threading
on some models) - Multi core trend is here to stay, 8 core next
year and so on.. - Different Architecture CPUs (RISC, Special
purpose CPUs) already have up to 32 cores per
socket - Hardware evolving MUCH faster than software
- - most of our home PCs subsystem are much
faster than - the average servers from only a year ago!
- We get much more for much less. So, whats the
problem?
9Know Your Software Limitations
- Software do not scale if it hasnt been written
to do so! - Single threaded application running on 24 core
server will use only one core, the OS has nothing
to do about it. - Single threaded application running on 24 core
server will actually run slower than if it will
run on your home PC (or every other single core
machine). - Explanation coming later. Any ideas?
10Know Your Software Limitations
- Single or Dual Threaded Typically Developed
Before 2000 - Legacy server software and most desktop software
- Most server software developed by application
developers new to multi-threading - N Threaded (typically N 4 to 8) Majority of
Business Applications - Vast majority of server software developed in the
past 5 - 8 years - Exchange, Notes, Citrix, Terminal Services, most
Java, File/Print/Web serving, etc. - Almost all customer developed applications
- Some workstation software
- Fully Threaded Software (N 8 to 64)
Enterprise Enabled Applications - SQL Server, Oracle, DB/2, SAP, PeopleSoft,
VMWare, 64-bit Terminal Services, etc.
11So, What can be done?
- Obviously, write N threaded of fully threaded
applications - But this is not so simple
- - writing fully threaded application is hard.
- - rewriting all of the existing applications
is not possible. - Virtualization
12Virtualization Technology
- Separation of OS and hardware
- Encapsulation of OS and application into VMs
- Isolation
- Hardware independence
- Flexibility
Virtualization Layer
? Works with what you have today
13Multi Core Methods
14Market View
- Every server is a multi Core server
- Multi Core trend will continue
- Two major software trends
- - Virtualization
- - Cloud computing
- Power consumption became a real issue
One Socket (quad core HT) Server
15Multi Processing Methods
- Two Main methods
- Scale Up
- Scale Out
16Multi Processing Methods - Scale Up
- One large multi Core server
- Single OS
- It is not uncommon to see 24 Cores and 256GB RAM
server - Common Applications
- - Databases
- - ERP
- - Virtualization
- - Mathematic Applications
One server with 96 cores and 1TB of RAM
17Multi Processing Methods - Scale Out
- Many smaller and faster servers, each one doing a
fraction of the work - OS per server
- Dense and low power consumption
- This is the future
- Common Applications
- - HPC
- - Cloud
- - Web
- - Rendering
- - VLSI (Chip design)
18Process vs. Thread
19What is a Process?
- Execution context
- - Program counter (PC)
- - Stack pointer (SP)
- - Data registers
- Code
- Data
- Stack
20What is a Thread?
- Execution context
- - Program counter (PC)
- - Stack pointer (SP)
- - Data registers
21Process vs. Thread
- processes are typically independent,
- threads exist as subsets of a process. Each
process has one or more threads - processes carry considerable state information
- multiple threads within a process share state as
well as memory and other resources - processes have separate address space,
- threads of the same process share their address
space - Inter-process communication is expensive need to
context switch - Inter-thread communication cheap can use process
memory and may not need to context switch
22Multi Processing Architecture
23Multi Processing Architecture (Scale Up Servers)
- SMP Symmetric multi processing
- Implemented by Intel, until the new generation,
Nehalem - also known as UMA
- NUMA Non-Uniform Memory Access
- AMD design
24Multi Processing Architecture - SMP
- all CPUs have symmetric access to all hardware
resources such as memory, I/O bus, and
interrupts, thus the name symmetric
multiprocessing. - Advantage - simplifies the development of the
operating system and applications - Disadvantage - limited scalability of this
- architecture
- As processors are added to the system, the
shared - resources are frequently accessed by an
increasingly - greater number of processors. More processors
using - the same resources creates queuing delays
25Multi Processing Architecture - NUMA
- Memory controller embedded to the processor
- Each Processor have local, Fast memory and
remote, slower memory - Is it better?
- It depends whether the OS and the application
are NUMA Aware - Think of the Super Market lines
26Multi Processing Architecture NUMA Awareness
- Modern Operating Systems are NUMA Aware
- The OS get the NUMA information from the SART
table, which created by the system BIOS -
- SART - Static Resource Affinity Table
- includes topology information for all the
processors and memory in a system. - The topology information includes the number of
nodes in the system and which memory is local to
each processor - the OS can read this info the same way it can
read the CPU type, machine serial number etc.
(using the ACPI specifications)
27Multi Processing Architecture NUMA
AwarenessSART - Static Resource Affinity Table
- Implements the concept of proximity domains in a
system - Resources, including processors, memory, and PCI
adapters in a system, are tightly coupled - This way the OS can use this information to
determine the best resource allocation and the
scheduling of threads throughout the system - Resent ACPI implementation include the Proximity
method (_PXM) so the OS can be NUMA aware even
without SART and introduce the support for
hot-plug devices
28Multi Processing Architecture NUMA
AwarenessSART - Static Resource Affinity Table -
Example
29Multi Processing Architecture Problems
Or, Why do 4 processor machine do not work 4
times faster than 1 processor machine?
- Shared recourses Buses, Memory Controller, I/O
hub etc. - Electrical load on the buses, speed must be
slowed down to ensure valid data - Data coherency, especially cache coherency
30Multi Processing Architecture Cache Coherency
Protocols the MESI protocol
- Intel Implementation
- MESI stands for Modified, Exclusive, Shared, and
Invalid - One of these four states is assigned to every
data element stored in each CPU cache using two
additional bit per cache line - On each processor data load into cache, the
processor must broadcast to all other processors
in the system to check their caches to see if
they have the requested data - These broadcasts are called snoop cycles, and
they must occur during every memory read or write
operations
31Multi Processing Architecture the MESI Protocol
- exclusive state is used to indicate that data is
stored in only one cache. - Data that is marked exclusive can be updated in
the cache without a snoop broadcast to the other
CPUs.
I need X
Memory
CPU 3
CPU 2
CPU 2
4
X 4
X 4 E
Front Side Bus
32Multi Processing Architecture the MESI Protocol
- Shared state is used to indicate that data is
shared in more than one processor - If the front-side bus request is a read
operation, the data is marked as shared.
indicating that its copy is read-only and cannot
be modified without notifying the other CPUs.
I want to know what X equals to
Memory
CPU 3
CPU 2
CPU 2
4
X 4 E
X 4 S
X 4 S
Front Side Bus
33Multi Processing Architecture the MESI Protocol
- Modified state is used to indicate that data has
modified and might be stored in other CPUs - If the operation is a write request, the CPU that
is possessing the unmodified data must mark its
data as invalid, indicating that the data can no
longer be used. - Main Memory is updated
X 5
Memory
CPU 3
CPU 2
CPU 2
4
X 4 S
X 4 S
X 5 S M
X 4 S I
X 5
Front Side Bus
34Multi Processing Architecture the MESI Protocol
- Invalid state is used to indicate that data has
modified by other CPU and is no longer relevant
X 6
Memory
CPU 3
CPU 2
CPU 2
4
X 5 S M
X 4 S I
X 6 S M
X 5 S I
X 6
Front Side Bus
35Multi Processing Architecture Cache Coherency
Protocols the MOESI protocol
- AMD Implementation Better for NUMA
- The MOESI protocol expands the MESI protocol with
yet another cache line status flag namely the
owner status (thus the O in MOESI). - After the update of a cache line, the cache line
is not written back to system memory but is
flagged as an owner. - When another CPU issues a read, it gets its data
from the owners cache rather than from slower
memory
36Multi Processing Architecture Cache Coherency
Protocols the MOESI protocol
- A cache-line still isn't written back to memory
until cache flush - The owner CPU dont need to send snoop messages
in case it modified the data - offload on the Busses and the
- memory controller
37Multi Processing Architecture SMP FSB load
- Cache coherency protocol generate great load on
the front side but and many times this bus
becomes the performance bottleneck of those
systems - How can we offload this traffic?
Memory
CPU 3
CPU 2
CPU 2
4
X 4 S I
X 6 S M
X 5 S I
Front Side Bus
38Multi Processing Architecture Snoop Filter
Snoop Filter
Processor Cache Miss
Scalability Controller Memory
Controller
I/O Bridge
I/O Bridge
Memory Interface
Memory Interface
Memory Interface
Memory Interface
One processor requests memory address not
residing in its L2/L3 cache.
39Multi Processing Architecture Snoop Filter
Tag
MESI
Processor Cache Miss
Scalability Controller Memory
Controller
I/O Bridge
eDRAM
I/O Bridge
Memory Interface
Memory Interface
Memory Interface
Memory Interface
9 way set associative
Looks for the tag in the directory
Embedded DRAM
40Multi Processing Architecture Snoop Filter
No snoop reflection on this bus
Tag
MESI
Processor Cache Miss
Scalability Controller Memory
Controller
I/O Bridge
eDRAM
I/O Bridge
Memory Interface
Memory Interface
Memory Interface
Memory Interface
9 way set associative
- No Address - Tag Match
- Go to main memory
- without waiting for
- front-side bus snoop.
IBM Embedded DRAM
41Multi Processing Architecture Programming
discussion
Generally, the SMP architecture allows you to use
the advantages of the threads - faster
intercommunication and context switching without
the remote memory access penalty, for example
Microsoft SQL server has only one process with
many threads. And, this architecture is
simpler. NUMA, on the other hand, can,
potentially give you better performance assuming
the application is optimized. The programmer
needs to fine tune the processes / threads
distribution. A ideal case might be one process
and 4 (or more) threads per socket minimize the
remote memory access
- Threads vs. Processes
- How would you optimize for this?
- And assuming the architecture is NUMA? SMP?
42Hyper-Threading Technology
43Intel Hyper-Threading Technology
SMT
w/o SMT
- Also known as Simultaneous Multi-Threading (SMT)
- Run 2 threads at the same time per core
- Take advantage of 4-wide execution engine
- Keep it fed with multiple threads
- Hide latency of a single thread
- Hyper-Threading Technology makes a single
physical processor appear as two logical
processors - Intel Xeon 5500 Microarchitecture advantages
- Larger caches
- Massive memory BW
Time (proc. cycles)
Note Each box represents a processor execution
unit
Why is that important?
44Hyper-Threading Technology Why?
- Want to max out the CPUs execution units
utilization - CPU may stall due to cache miss, branch
misprediction or data dependancy - - data dependancy example
- RAW (Read After Write)
- i1. R2 ? R1 R3
- i2. R4 ? R2 R3
- Need efficient solution
- Very low die area cost
- Can provide significant performance benefit
depending on application - Much more efficient than adding an entire core
45Hyper-Threading Technology How does it works?
- Each logical processor has its own architecture
state - Architecture state
- The part of the CPU thats hold the state of the
process - - Control registers
- General Purpose registers
- Instruction Streaming Buffers and Trace Cache
Fill Buffers - Instruction Translation Look-aside Buffer
- Only one execution unit
46Hyper-Threading Technology How does it works?
47Hyper-Threading Technology How does it works?
Wasted execution Unit slots
48Hyper-Threading Technology OS awareness
For best performance, the operating system should
implement few optimizations
-
- Hyper-Threading Aware Thread Scheduling
- The operating system should schedule threads to
logical processors on different physical
processors before scheduling two threads to the
same physical processor. - Aggressive HALT of Processors in the Idle Loop
- Using the YIELD (PAUSE) Instruction to Avoid
Spinlock - Contention
- The YIELD instruction causes the logical
processor to pause for a short period of time
(approximately 50 clock cycles), and allows the
other logical processor access to the shared
resources on the physical HT processor.
49Hyper-Threading Technology Application awareness
-
- Operating System expose Hyper-Threading API
- for example - GetLogicalProcessorInformation
for Windows - To optimize the application performance benefit
on HT-enabled systems, the application should
ensure that the threads executing on the two
logical processors have minimal dependencies on
the same shared resources on the same physical
processor. - Bad HT thread affinity example
- Threads that perform similar actions and stall
for the same reasons should not be scheduled on
the same physical processor. TooMuchMilk? -
- The benefit of HT is that shared processor
resources can be used by one logical processor
while the other logical processor is stalled.
This does not work when both logical processors
are stalled for the same reason.
50Hyper-Threading Technology Some Programming
considerations
- Alice
- Leave note A
- While (note B)
- go to top_of_loop
- Go buy milk
- Remove note A
Bob Leave note B If !(note A) Go buy
milk Remove note B
Spin-Wait , can starve the other tread
51Hyper-Threading Technology Some Programming
considerations
- Alice
- Leave note A
- While (note B)
- wait (ideal time)
- go to top_of _loop
- Go buy milk
- Remove note A
Bob Leave note B If !(note A) Go buy
milk Remove note B
- Ideal Time
- It is clear that the loop variable cannot change
faster than the memory bus can update it. - Hence, there is no benefit to pre-execute the
loop faster than the time needed for a memory
refresh.
52Hyper-Threading Technology Hardware assists
- Two new (2004) instructions directly aimed at the
spin-wait issue - Monitor - watches a specified area of memory for
write activity - Its companion instruction, mwait, associates
writes to this memory block to waking up a
specific processor - Since updating a variable is a write activity, by
specifying its memory address and using mwait, a
processor can simply be suspended until the
variable is updated. - Effectively, this enables a wait on a variable
without spinning a loop.
53High-Level Atomic Operation in Hardware
54Atomic Operation in HardwareLOCK Instruction
- x86 Architecture implement the LOCK prefix
- Causes the processors LOCK signal to be
asserted during the execution of the accompanying
instruction (turns it to atomic instruction) - The LOCK signal insures that the processor has
exclusive use of any shared memory while the
signal is asserted
atomic instructions (Intel at least) Increment
(lock inc r/m) Decrement (lock dec
r/m) Exchange (xchg r/m, r) Always executed
with the LOCK prefix Fetch and add (lock xadd
r/m, r)
55Atomic Operation in HardwareRead-Modify-Write
- Read-modify-write is a class of high level atomic
operations which both read a memory location and
write a new value into it simultaneously - Typically they are used to implement mutexes or
semaphores - Examples
-
- - Compare and Swap
- - Fetch and Add
- - Test and Set
- - Load Link / Store Conditional
56Atomic Operation in Hardware Compare and Swap /
Exchange
- CMPXCHG r/m , r
- IF accumulator DEST
- THEN
- ZF 1
- DEST SRC
- ELSE
- ZF 0
- accumulator DEST
- FI
- ZF indicates whether the swap has accrued
- The destination operand receives a write cycle
without regard to the result of the comparison - Needs the LOCK prefix to ensure atomically
57Atomic Operation in Hardware Compare and Swap /
Exchange
- OS system calls and programming languages
automatically implement the LOCK prefix - include ltsys/atomic_op.hgt
- boolean_t compare_and_swap ( word_addr, old_val_
addr, new_val)atomic_p word_addrint
old_val_addrint new_val - Available in Java
- AtomicInteger.compareAndSet(int,int) -gt bool
58Atomic Operation in Hardware Compare and Swap /
Exchange
Now this can work (I hope, its was late..)
- Alice
- If (noMilk)
- if (noNote)
-
- leave Note
- buy milk
- remove Note
-
-
- Bob
- If (noMilk)
- if (noNote)
-
- leave Note
- buy milk
- remove Note
-
-
Atomic
59Atomic Operation in Hardware Load-Link/Store-Cond
itional
- RISC / MIPS implementation
- The LL and SC instructions are primitive
instructions used to perform a read-modify-write
operation to storage - the use of the LL and SC instructions ensures
that no other processor or mechanism has modified
the target memory location between the time the
LL instruction is executed and the time the SC - The LL instruction, in addition to doing a simple
load, has the side effect of setting a user
transparent bit called the load link bit(LLbit),
similar to the cache coherency protocols - The LLbit forms a breakable link between the LL
instruction and asubsequent SC instruction
60Atomic Operation in Hardware Load-Link/Store-Cond
itional
- The SC performs a simple store if and only if the
LLbit is set when the store is executed. If the
LLbit is not set, then the store will fail to
execute - LLbit is reset upon occurrence of any event that
even has potential to modify the - lock-variable while the sequence of code between
LL and SC is being executed - The most obvious case where the link will be
broken is when an invalidate occurs to the cache
line which was the subject of load
61Atomic Operation in Hardware Load-Link/Store-Cond
itional
- IBM Power Architecture Implementation
Assume that GPR 4 contains the new value to be
stored. Assume that GPR 3 contains the address
of the word to be loaded and replaced. loop
lwarx r5,0,r3 Load and reserve
stwcx r4,0,r3 Store new value if still
reserved bne- loop Loop if
lost reservation The new value is now in
storage. The old value is returned to GPR 4.
62Atomic Operation in Hardware Load-Link/Store-Cond
itional - Usage
- The ABA problem
- - Suppose that the value of V is A.
- - Try a CAS to change A to X.
- - Another thread can change A to B and back to A.
- - The Compare-And-Swap wont see it and will
succeed - The obvious case is working with lists, or every
other data structure using pointers - Note CAS do have a solution for this using sort
of a counter to changes.
63Thank You