Title: Cache Coherent Nonuniform Memory Access
1ccNUMA
Cache Coherent Non-Uniform Memory Access
Chris Coughlin MSCS521 Prof. Ten Eyck Spring 2004
2Lets First Talk About Computer Architectures
In 1966, Michael Flynn proposed a classification
for computer architectures based on the number of
instruction steams and data streams (Flynns
Taxonomy).
- SISD(Single Instruction Stream-Single Data
Stream) - A single-processor computer (uniprocessor) in
which a single stream of instructions is
generated from the program. - SIMD(Single Instruction Stream-Multiple Data
Stream) - Each instruction is executed on a different set
of data by different processors. (Used for vector
and array processing) - MISD(Multiple Instruction Stream-Single Data
Stream) - Each processor executes a different sequence of
instructions. - Never been commercially implemented.
- MIMD(Multiple Instruction Stream-Multiple Data
Stream) - Each processor has a separate program.
- An instruction stream is generated from each
program. - Each instruction operates on different data.
3Multiprocessors
- The idea behind multiprocessors is to create
powerful computers by connecting many smaller
ones. - Computational speed is increased by using
multiple processors operating together on a
single problem. - A parallel processing program is a single program
that runs on multiple processors simultaneously. - The overall problem is split into parts, each of
which is performed by a separate processor in
parallel. - In addition to a faster solution, it may also
generate a more precise solution.
4MIMD Systems
- Shared Memory Multiprocessor System
- Multiple processors are connected to multiple
memory modules such that each processor can
access any other processors memory module. This
multiprocessor employs a shared address space
(also known as a single address space). - Communication is implicit with loads and stores
there is no explicit recipient of a shared
memory access. - Processors may communicate without necessarily
being aware of one another. - A single image of the operating system runs
across all the processors.
5MIMD Systems (cont.)
- Multicomputer
- A term for parallel processors with separate,
private address spaces (not accessible by the
other processors in the system). - Communicate by message-passing the messages
carry data from one processor to another as
dictated by the program. - Complete computers, consisting of a processor and
local memory, connected through an
interconnection network (e.g. a LAN).
6Computer Architecture Classifications
Processor Organizations
Single Instruction, Single Instruction, Multiple
Instruction Multiple Instruction Single Data
Stream Multiple Data Stream Single Data
Stream Multiple Data Stream (SISD)
(SIMD) (MISD)
(MIMD)
Uniprocessor Vector
Array Shared Memory Multicomputer
Processor
Processor (tightly coupled)
(loosely coupled)
Note We will expand on this later
7Back to Shared Memory Multiprocessors
- Two styles UMA and NUMA
-
- UMA (Uniform Memory Access)
- The time to access main memory is the same for
all processors since they are equally close to
all memory locations. - Machines that use UMA are called Symmetric
Multiprocessors (SMPs). - In a typical SMP architecture, all memory
accesses are posted to the same shared memory
bus. - Contention - as more CPUs are added, competition
for access to the bus leads to a decline in
performance. - Thus, scalability is limited to about 32
processors.
8Shared Memory Multiprocessors (cont.)
- NUMA (Non-Uniform Memory Access)
- Since memory is physically distributed, it is
faster for a processor to access its own local
memory than non-local memory (memory local to
another processor or shared between processors). - Unlike SMPs, all processors are not equally close
to all memory locations. - A processors own internal computations can be
done in its local memory leading to reduced
memory contention. - Designed to surpass the scalability limits of
SMPs.
9Communication and Connection Options for
Multiprocessors
Multiprocessors come in two main configurations
a single bus connection, and a network
connection. The choice of the communication
model and the physical connection depends largely
on the number of processors in the organization.
Notice that the scalability of NUMA makes it
ideal for a network configuration. UMA, however,
is best suited to a bus connection.
10A Multiprocessor Bus Configuration
The single bus design is limited in terms of
scalability. The largest number of processors in
a commercial product using this configuration is
36 (SGI Power Challenge).
11A Multiprocessor Network Configuration
The network-connected processor design is very
scalable. Since each processor has its own
memory, the network connection is only used for
communication between processors.
12A Quick Look at Cache
- Modern processors use a faster, smaller cache
memory to act as a buffer for slower, larger
memory. - Caches exploit the principal of locality in
memory accesses. - Temporal locality the concept that if data is
referenced, it will tend to be referenced again
soon after. - Spatial locality the concept that data is
more likely to be referenced soon if data near
it was just referenced. - Caches hold recently referenced data, as well as
data near the recently referenced data. - This can lead to performance increases by
reducing the need to access main memory on every
reference.
13What is ccNUMA?
- The cc in ccNUMA stands for cache coherent.
- The use of cache memory in modern computer
architectures leads to the cache coherence
problem. - It is a situation that can occur when two or more
processors reference the same shared data. If
one processor modifies its copy of the data, the
other processors will have stale copies of the
data in their caches. - Machines that are cache coherent ensure that a
processor accessing a memory location receives
the most up-to-date version of the data. - Cache coherence is maintained by software,
special-purpose hardware, or both. - NUMA systems that maintain cache coherence are
referred to as ccNUMA machines. - Since few applications still exist for non-cache
coherent NUMA machines, the terms NUMA and ccNUMA
are used interchangeably.
14Computer Architecture Classifications (revisited)
Processor Organizations
Single Instruction, Single Instruction, Multiple
Instruction Multiple Instruction Single Data
Stream Multiple Data Stream Single Data
Stream Multiple Data Stream (SISD)
(SIMD) (MISD)
(MIMD)
Uniprocessor Vector
Array Shared Memory
Multicomputer
Processor Processor
(tightly coupled) (loosely coupled)
UMA (SMP)
NUMA
ccNUMA
15Cache Coherency Protocols
- Snooping protocol
- A bus-based method in which cache controllers
monitor the bus for activity and update or
invalidate cache entries as necessary. - Two types
- Write invalidate the writing processor sends
an invalidation signal to the bus. All other
caches check to see if they have a copy of the
cache block. If they do, the block containing
the data gets invalidated. The writing
processor then changes its local copy. - Write-update the writing processor broadcasts
the new data over the bus and all copies are
updated with the new value. - Commercial machines use write-invalidate to
preserve bandwidth. - Write-update has the advantage of making the new
values appear in the caches sooner. -
16Cache Coherency Protocols (cont.)
- Directory-based protocol
- A central directory maintains the information
about which memory locations are being shared in
multiple caches and which are contained in just
one processors cache. - On any memory access, it knows the caches that
need to be updated or invalidated. - It is used by all software-based implementations
of shared memory. - It is a scalable scheme that is suitable for a
network configuration.
17A Side-Effect of Cache Coherency
- False sharing
- Caches are organized into blocks of contiguous
memory locations mainly because programs tend
to use spatial locality of reference. - It is therefore possible for two processors to
share the same cache block, but to not share the
same memory location within the block. - If one processor writes to its own part of the
block, it then causes the other processors
entire block, including the memory location it
was accessing, to get updated or invalidated. - Unnecessary invalidations can affect performance.
- It is up to the programmer to detect it and avoid
it. - Compiler-based solutions are being researched.
18ccNUMA Implementations
- Stanford Dash
- Dash stands for Directory Architecture for Shared
Memory. - First to use directory-based cache coherence.
- SGI Origin 2000 (Silicon Graphics Inc.) -
- Can support up to 1024 processors.
- SGI claims it accounts for over 95 of worldwide
shipments of ccNUMA-based systems. - IBMs LA (Local Access) ccNUMA
19References
- Computer Organization and Design The
Hardware/Software Interface, David A. Patterson
John L. Hennessy, 1998, 2nd edition - Supercomputing Systems Architectures, Design,
and Performance, Svetlana P. Kartashev Steven
I. Kartashev, 1990 - Parallel Programming Techniques and Applications
Using Networked Workstations and Parallel
Computers, Barry Wilkinson Michael Allen, 1999 - www.mkp.com/cod2e.htm
- Non-Uniform Memory Access Wikipedia
- Symmetric Multiprocessing - Wikipedia
- Cache Coherence - Wikipedia
- Parallel Computing - Wikipedia
- Locality of Reference Wikipedia
20References (cont.)
- A Primer on NUMA ( Non-Uniform Memory Access)
- Cache Coherence in the context of Shared Memory
Architecture - Distributed shared memory -- ccNUMA interconnects
- The Stanford Dash Multiprocessor
- The SGI Origin A ccNUMA Highly Scalable Server
- IBM Distributed Shared Memory Plans Uncovered
- http//benchoi.info/Bens/Teaching/Csc364/PDF/CH18.
pdf - http//www.cs.ucsd.edu/classes/fa00/cse240/lecture
s/Lecture17.html - http//www.cs.ucsd.edu/users/carter/260/260class02
.pdf