Multi Processing Hardware - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Multi Processing Hardware

Description:

Multi Processing Hardware Gilad Berman Agenda Multi Core Background and software limitation Multi Processing Methods - Scale Up - Scale Out Process vs. Thread Multi ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 64
Provided by: csTauAc8
Category:

less

Transcript and Presenter's Notes

Title: Multi Processing Hardware


1
Multi Processing Hardware
  • Gilad Berman

2
Agenda
  • Multi Core Background and software limitation
  • Multi Processing Methods
  • - Scale Up
  • - Scale Out
  • Process vs. Thread
  • Multi Processing Architecture
  • - SMP vs. NUMA
  • - cache coherency protocols
  • - Programming discussion
  • Hyper-Threading Technology
  • High-Level Atomic Operation in Hardware

3
Why Is This Important?
  • Computer Hardware has been changes A LOT the past
    3 years
  • The programmer should know the hardware, it makes
    a difference!
  • Sadly, the current (TAU) computer sciences
    material do not cover this issue at all, almost..
  • Not only our degree, most of the software
    companies do not fully aware of that (this is my
    impression)
  • Its much easier to buy faster and powerfull
    hardware than optimizing the code. But, there
    is a limit to that and computing demands
    constantly increasing
  • Interesting, I hope..

4
Multi Core Backgroundand software limitation
5
History - Frequency Gains By Making Transistors
Smaller
Frequency
Shrinking Transistors Has Historically Enabled
Higher Frequency
This was the case until the end of 2005
Time
6
Multi Core Background
  • Transistor sizes have become so small..
  • Electrons leak through the transistor dielectric
    wall which regulates current flow
  • VERY general and partial explanation
  • For CPU to run faster the transistors need to be
    smaller, shorter Source to Drain path.
  • The smaller the transistors are the larger the
    leakage.
  • The larger the leakage, the larger the voltage
    threshold.
  • Larger voltage threshold larger CPU power
    consumption and heat produced, require more
    cooling capacity even more power.
  • All of the above results in exponential Watt per
    Performance ratio.

7
Multi Core Background
  • Classical frequency scaling appeared to be at a
    crossroads
  • Frequency improvements have hit a physics limit
  • Power cost money
  • Going green
  • But manufacturing processes continue to enable
    higher per die transistor count
  • Processor vendors now deliver greater numbers of
    processors on the die to displace frequency
    increase

8
Current x86 Processors
  • Dual, Quad and Six core CPUs ( Hyper threading
    on some models)
  • Multi core trend is here to stay, 8 core next
    year and so on..
  • Different Architecture CPUs (RISC, Special
    purpose CPUs) already have up to 32 cores per
    socket
  • Hardware evolving MUCH faster than software
  • - most of our home PCs subsystem are much
    faster than
  • the average servers from only a year ago!
  • We get much more for much less. So, whats the
    problem?

9
Know Your Software Limitations
  • Software do not scale if it hasnt been written
    to do so!
  • Single threaded application running on 24 core
    server will use only one core, the OS has nothing
    to do about it.
  • Single threaded application running on 24 core
    server will actually run slower than if it will
    run on your home PC (or every other single core
    machine).
  • Explanation coming later. Any ideas?

10
Know Your Software Limitations
  • Single or Dual Threaded Typically Developed
    Before 2000
  • Legacy server software and most desktop software
  • Most server software developed by application
    developers new to multi-threading
  • N Threaded (typically N 4 to 8) Majority of
    Business Applications
  • Vast majority of server software developed in the
    past 5 - 8 years
  • Exchange, Notes, Citrix, Terminal Services, most
    Java, File/Print/Web serving, etc.
  • Almost all customer developed applications
  • Some workstation software
  • Fully Threaded Software (N 8 to 64)
    Enterprise Enabled Applications
  • SQL Server, Oracle, DB/2, SAP, PeopleSoft,
    VMWare, 64-bit Terminal Services, etc.

11
So, What can be done?
  • Obviously, write N threaded of fully threaded
    applications
  • But this is not so simple
  • - writing fully threaded application is hard.
  • - rewriting all of the existing applications
    is not possible.
  • Virtualization

12
Virtualization Technology
  • Separation of OS and hardware
  • Encapsulation of OS and application into VMs
  • Isolation
  • Hardware independence
  • Flexibility

Virtualization Layer
? Works with what you have today
13
Multi Core Methods
14
Market View
  • Every server is a multi Core server
  • Multi Core trend will continue
  • Two major software trends
  • - Virtualization
  • - Cloud computing
  • Power consumption became a real issue

One Socket (quad core HT) Server
15
Multi Processing Methods
  • Two Main methods
  • Scale Up
  • Scale Out

16
Multi Processing Methods - Scale Up
  • One large multi Core server
  • Single OS
  • It is not uncommon to see 24 Cores and 256GB RAM
    server
  • Common Applications
  • - Databases
  • - ERP
  • - Virtualization
  • - Mathematic Applications

One server with 96 cores and 1TB of RAM
17
Multi Processing Methods - Scale Out
  • Many smaller and faster servers, each one doing a
    fraction of the work
  • OS per server
  • Dense and low power consumption
  • This is the future
  • Common Applications
  • - HPC
  • - Cloud
  • - Web
  • - Rendering
  • - VLSI (Chip design)

18
Process vs. Thread
19
What is a Process?
  • Execution context
  • - Program counter (PC)
  • - Stack pointer (SP)
  • - Data registers
  • Code
  • Data
  • Stack

20
What is a Thread?
  • Execution context
  • - Program counter (PC)
  • - Stack pointer (SP)
  • - Data registers

21
Process vs. Thread
  • processes are typically independent,
  • threads exist as subsets of a process. Each
    process has one or more threads
  • processes carry considerable state information
  • multiple threads within a process share state as
    well as memory and other resources
  • processes have separate address space,
  • threads of the same process share their address
    space
  • Inter-process communication is expensive need to
    context switch
  • Inter-thread communication cheap can use process
    memory and may not need to context switch

22
Multi Processing Architecture
23
Multi Processing Architecture (Scale Up Servers)
  • SMP Symmetric multi processing
  • Implemented by Intel, until the new generation,
    Nehalem
  • also known as UMA
  • NUMA Non-Uniform Memory Access
  • AMD design

24
Multi Processing Architecture - SMP
  • all CPUs have symmetric access to all hardware
    resources such as memory, I/O bus, and
    interrupts, thus the name symmetric
    multiprocessing.
  • Advantage - simplifies the development of the
    operating system and applications
  • Disadvantage - limited scalability of this
  • architecture
  • As processors are added to the system, the
    shared
  • resources are frequently accessed by an
    increasingly
  • greater number of processors. More processors
    using
  • the same resources creates queuing delays

25
Multi Processing Architecture - NUMA
  • Memory controller embedded to the processor
  • Each Processor have local, Fast memory and
    remote, slower memory
  • Is it better?
  • It depends whether the OS and the application
    are NUMA Aware
  • Think of the Super Market lines

26
Multi Processing Architecture NUMA Awareness
  • Modern Operating Systems are NUMA Aware
  • The OS get the NUMA information from the SART
    table, which created by the system BIOS
  • SART - Static Resource Affinity Table
  • includes topology information for all the
    processors and memory in a system.
  • The topology information includes the number of
    nodes in the system and which memory is local to
    each processor
  • the OS can read this info the same way it can
    read the CPU type, machine serial number etc.
    (using the ACPI specifications)

27
Multi Processing Architecture NUMA
AwarenessSART - Static Resource Affinity Table
  • Implements the concept of proximity domains in a
    system
  • Resources, including processors, memory, and PCI
    adapters in a system, are tightly coupled
  • This way the OS can use this information to
    determine the best resource allocation and the
    scheduling of threads throughout the system
  • Resent ACPI implementation include the Proximity
    method (_PXM) so the OS can be NUMA aware even
    without SART and introduce the support for
    hot-plug devices

28
Multi Processing Architecture NUMA
AwarenessSART - Static Resource Affinity Table -
Example
29
Multi Processing Architecture Problems
Or, Why do 4 processor machine do not work 4
times faster than 1 processor machine?
  • Shared recourses Buses, Memory Controller, I/O
    hub etc.
  • Electrical load on the buses, speed must be
    slowed down to ensure valid data
  • Data coherency, especially cache coherency

30
Multi Processing Architecture Cache Coherency
Protocols the MESI protocol
  • Intel Implementation
  • MESI stands for Modified, Exclusive, Shared, and
    Invalid
  • One of these four states is assigned to every
    data element stored in each CPU cache using two
    additional bit per cache line
  • On each processor data load into cache, the
    processor must broadcast to all other processors
    in the system to check their caches to see if
    they have the requested data
  • These broadcasts are called snoop cycles, and
    they must occur during every memory read or write
    operations

31
Multi Processing Architecture the MESI Protocol
  • exclusive state is used to indicate that data is
    stored in only one cache.
  • Data that is marked exclusive can be updated in
    the cache without a snoop broadcast to the other
    CPUs.

I need X
Memory
CPU 3
CPU 2
CPU 2
4
X 4
X 4 E
Front Side Bus
32
Multi Processing Architecture the MESI Protocol
  • Shared state is used to indicate that data is
    shared in more than one processor
  • If the front-side bus request is a read
    operation, the data is marked as shared.
    indicating that its copy is read-only and cannot
    be modified without notifying the other CPUs.

I want to know what X equals to
Memory
CPU 3
CPU 2
CPU 2
4
X 4 E
X 4 S
X 4 S
Front Side Bus
33
Multi Processing Architecture the MESI Protocol
  • Modified state is used to indicate that data has
    modified and might be stored in other CPUs
  • If the operation is a write request, the CPU that
    is possessing the unmodified data must mark its
    data as invalid, indicating that the data can no
    longer be used.
  • Main Memory is updated

X 5
Memory
CPU 3
CPU 2
CPU 2
4
X 4 S
X 4 S
X 5 S M
X 4 S I
X 5
Front Side Bus
34
Multi Processing Architecture the MESI Protocol
  • Invalid state is used to indicate that data has
    modified by other CPU and is no longer relevant

X 6
Memory
CPU 3
CPU 2
CPU 2
4
X 5 S M
X 4 S I
X 6 S M
X 5 S I
X 6
Front Side Bus
35
Multi Processing Architecture Cache Coherency
Protocols the MOESI protocol
  • AMD Implementation Better for NUMA
  • The MOESI protocol expands the MESI protocol with
    yet another cache line status flag namely the
    owner status (thus the O in MOESI).
  • After the update of a cache line, the cache line
    is not written back to system memory but is
    flagged as an owner.
  • When another CPU issues a read, it gets its data
    from the owners cache rather than from slower
    memory

36
Multi Processing Architecture Cache Coherency
Protocols the MOESI protocol
  • A cache-line still isn't written back to memory
    until cache flush
  • The owner CPU dont need to send snoop messages
    in case it modified the data
  • offload on the Busses and the
  • memory controller

37
Multi Processing Architecture SMP FSB load
  • Cache coherency protocol generate great load on
    the front side but and many times this bus
    becomes the performance bottleneck of those
    systems
  • How can we offload this traffic?

Memory
CPU 3
CPU 2
CPU 2
4
X 4 S I
X 6 S M
X 5 S I
Front Side Bus
38
Multi Processing Architecture Snoop Filter
Snoop Filter



Processor Cache Miss
Scalability Controller Memory
Controller
I/O Bridge


I/O Bridge

Memory Interface
Memory Interface
Memory Interface
Memory Interface
One processor requests memory address not
residing in its L2/L3 cache.
39
Multi Processing Architecture Snoop Filter
Tag
MESI

Processor Cache Miss
Scalability Controller Memory
Controller
I/O Bridge

eDRAM

I/O Bridge

Memory Interface
Memory Interface
Memory Interface
Memory Interface
9 way set associative
Looks for the tag in the directory
Embedded DRAM
40
Multi Processing Architecture Snoop Filter
No snoop reflection on this bus
Tag
MESI



Processor Cache Miss
Scalability Controller Memory
Controller
I/O Bridge

eDRAM

I/O Bridge

Memory Interface
Memory Interface
Memory Interface
Memory Interface
9 way set associative
  • No Address - Tag Match
  • Go to main memory
  • without waiting for
  • front-side bus snoop.

IBM Embedded DRAM
41
Multi Processing Architecture Programming
discussion
Generally, the SMP architecture allows you to use
the advantages of the threads - faster
intercommunication and context switching without
the remote memory access penalty, for example
Microsoft SQL server has only one process with
many threads. And, this architecture is
simpler. NUMA, on the other hand, can,
potentially give you better performance assuming
the application is optimized. The programmer
needs to fine tune the processes / threads
distribution. A ideal case might be one process
and 4 (or more) threads per socket minimize the
remote memory access
  • Threads vs. Processes
  • How would you optimize for this?
  • And assuming the architecture is NUMA? SMP?

42
Hyper-Threading Technology
43
Intel Hyper-Threading Technology
SMT
w/o SMT
  • Also known as Simultaneous Multi-Threading (SMT)
  • Run 2 threads at the same time per core
  • Take advantage of 4-wide execution engine
  • Keep it fed with multiple threads
  • Hide latency of a single thread
  • Hyper-Threading Technology makes a single
    physical processor appear as two logical
    processors
  • Intel Xeon 5500 Microarchitecture advantages
  • Larger caches
  • Massive memory BW

Time (proc. cycles)
Note Each box represents a processor execution
unit
Why is that important?
44
Hyper-Threading Technology Why?
  • Want to max out the CPUs execution units
    utilization
  • CPU may stall due to cache miss, branch
    misprediction or data dependancy
  • - data dependancy example
  • RAW (Read After Write)
  • i1. R2 ? R1 R3
  • i2. R4 ? R2 R3
  • Need efficient solution
  • Very low die area cost
  • Can provide significant performance benefit
    depending on application
  • Much more efficient than adding an entire core

45
Hyper-Threading Technology How does it works?
  • Each logical processor has its own architecture
    state
  • Architecture state
  • The part of the CPU thats hold the state of the
    process -
  • Control registers
  • General Purpose registers
  • Instruction Streaming Buffers and Trace Cache
    Fill Buffers
  • Instruction Translation Look-aside Buffer
  • Only one execution unit

46
Hyper-Threading Technology How does it works?
47
Hyper-Threading Technology How does it works?
Wasted execution Unit slots
48
Hyper-Threading Technology OS awareness
For best performance, the operating system should
implement few optimizations
  • Hyper-Threading Aware Thread Scheduling
  • The operating system should schedule threads to
    logical processors on different physical
    processors before scheduling two threads to the
    same physical processor.
  • Aggressive HALT of Processors in the Idle Loop
  • Using the YIELD (PAUSE) Instruction to Avoid
    Spinlock
  • Contention
  • The YIELD instruction causes the logical
    processor to pause for a short period of time
    (approximately 50 clock cycles), and allows the
    other logical processor access to the shared
    resources on the physical HT processor.

49
Hyper-Threading Technology Application awareness
  • Operating System expose Hyper-Threading API
  • for example - GetLogicalProcessorInformation
    for Windows
  • To optimize the application performance benefit
    on HT-enabled systems, the application should
    ensure that the threads executing on the two
    logical processors have minimal dependencies on
    the same shared resources on the same physical
    processor.
  • Bad HT thread affinity example
  • Threads that perform similar actions and stall
    for the same reasons should not be scheduled on
    the same physical processor. TooMuchMilk?
  • The benefit of HT is that shared processor
    resources can be used by one logical processor
    while the other logical processor is stalled.
    This does not work when both logical processors
    are stalled for the same reason.

50
Hyper-Threading Technology Some Programming
considerations
  • Alice
  • Leave note A
  • While (note B)
  • go to top_of_loop
  • Go buy milk
  • Remove note A

Bob Leave note B If !(note A) Go buy
milk Remove note B
Spin-Wait , can starve the other tread
51
Hyper-Threading Technology Some Programming
considerations
  • Alice
  • Leave note A
  • While (note B)
  • wait (ideal time)
  • go to top_of _loop
  • Go buy milk
  • Remove note A

Bob Leave note B If !(note A) Go buy
milk Remove note B
  • Ideal Time
  • It is clear that the loop variable cannot change
    faster than the memory bus can update it.
  • Hence, there is no benefit to pre-execute the
    loop faster than the time needed for a memory
    refresh.

52
Hyper-Threading Technology Hardware assists
  • Two new (2004) instructions directly aimed at the
    spin-wait issue
  • Monitor - watches a specified area of memory for
    write activity
  • Its companion instruction, mwait, associates
    writes to this memory block to waking up a
    specific processor
  • Since updating a variable is a write activity, by
    specifying its memory address and using mwait, a
    processor can simply be suspended until the
    variable is updated.
  • Effectively, this enables a wait on a variable
    without spinning a loop.

53
High-Level Atomic Operation in Hardware
54
Atomic Operation in HardwareLOCK Instruction
  • x86 Architecture implement the LOCK prefix
  • Causes the processors LOCK signal to be
    asserted during the execution of the accompanying
    instruction (turns it to atomic instruction)
  • The LOCK signal insures that the processor has
    exclusive use of any shared memory while the
    signal is asserted

atomic instructions (Intel at least) Increment
(lock inc r/m) Decrement (lock dec
r/m) Exchange (xchg r/m, r) Always executed
with the LOCK prefix Fetch and add (lock xadd
r/m, r)
55
Atomic Operation in HardwareRead-Modify-Write
  • Read-modify-write is a class of high level atomic
    operations which both read a memory location and
    write a new value into it simultaneously
  • Typically they are used to implement mutexes or
    semaphores
  • Examples
  • - Compare and Swap
  • - Fetch and Add
  • - Test and Set
  • - Load Link / Store Conditional

56
Atomic Operation in Hardware Compare and Swap /
Exchange
  • CMPXCHG r/m , r
  • IF accumulator DEST
  • THEN
  • ZF 1
  • DEST SRC
  • ELSE
  • ZF 0
  • accumulator DEST
  • FI
  • ZF indicates whether the swap has accrued
  • The destination operand receives a write cycle
    without regard to the result of the comparison
  • Needs the LOCK prefix to ensure atomically

57
Atomic Operation in Hardware Compare and Swap /
Exchange
  • OS system calls and programming languages
    automatically implement the LOCK prefix
  • include ltsys/atomic_op.hgt
  • boolean_t compare_and_swap ( word_addr,  old_val_
    addr,  new_val)atomic_p word_addrint
    old_val_addrint new_val
  • Available in Java
  • AtomicInteger.compareAndSet(int,int) -gt bool

58
Atomic Operation in Hardware Compare and Swap /
Exchange
Now this can work (I hope, its was late..)
  • Alice
  • If (noMilk)
  • if (noNote)
  • leave Note
  • buy milk
  • remove Note
  • Bob
  • If (noMilk)
  • if (noNote)
  • leave Note
  • buy milk
  • remove Note

Atomic
59
Atomic Operation in Hardware Load-Link/Store-Cond
itional
  • RISC / MIPS implementation
  • The LL and SC instructions are primitive
    instructions used to perform a read-modify-write
    operation to storage
  • the use of the LL and SC instructions ensures
    that no other processor or mechanism has modified
    the target memory location between the time the
    LL instruction is executed and the time the SC
  • The LL instruction, in addition to doing a simple
    load, has the side effect of setting a user
    transparent bit called the load link bit(LLbit),
    similar to the cache coherency protocols
  • The LLbit forms a breakable link between the LL
    instruction and asubsequent SC instruction

60
Atomic Operation in Hardware Load-Link/Store-Cond
itional
  • The SC performs a simple store if and only if the
    LLbit is set when the store is executed. If the
    LLbit is not set, then the store will fail to
    execute
  • LLbit is reset upon occurrence of any event that
    even has potential to modify the
  • lock-variable while the sequence of code between
    LL and SC is being executed
  • The most obvious case where the link will be
    broken is when an invalidate occurs to the cache
    line which was the subject of load

61
Atomic Operation in Hardware Load-Link/Store-Cond
itional
  • IBM Power Architecture Implementation

Assume that GPR 4 contains the new value to be
stored. Assume that GPR 3 contains the address
of the word to be loaded and replaced. loop
lwarx r5,0,r3 Load and reserve
stwcx r4,0,r3 Store new value if still
reserved bne- loop Loop if
lost reservation The new value is now in
storage. The old value is returned to GPR 4.
62
Atomic Operation in Hardware Load-Link/Store-Cond
itional - Usage
  • The ABA problem
  • - Suppose that the value of V is A.
  • - Try a CAS to change A to X.
  • - Another thread can change A to B and back to A.
  • - The Compare-And-Swap wont see it and will
    succeed
  • The obvious case is working with lists, or every
    other data structure using pointers
  • Note CAS do have a solution for this using sort
    of a counter to changes.

63
Thank You
Write a Comment
User Comments (0)
About PowerShow.com