Title: Nonblocking Data Structures for HighPerformance Computing
1Non-blocking Data Structures for High-Performance
Computing
2Outline
- Shared Memory
- Synchronization Methods
- Memory Management
- Shared Data Structures
- Dictionary
- Performance
- Conclusions
3Shared Memory
CPU
CPU
CPU
. . .
Cache
Cache
Cache
Memory
- Uniform Memory Access (UMA)
...
...
...
CPU
CPU
CPU
CPU
CPU
CPU
. . .
Cache bus
Cache bus
Cache bus
Memory
Memory
Memory
- Non-Uniform Memory Access (NUMA)
4Synchronization
- Shared data structures needs synchronization!
- Accesses and updates must be coordinated to
establish consistency.
P1
P2
P3
5Hardware Synchronization Primitives
- Consensus 1
- Atomic Read/Write
- Consensus 2
- Atomic Test-And-Set (TAS), Fetch-And-Add (FAA),
Swap - Consensus Infinite
- Atomic Compare-And-Swap (CAS)
- Atomic Load-Linked/Store-Conditionally
Read
Read
Write
Mf(M,)
6Mutual Exclusion
- Access to shared data will be atomic because of
lock - Reduced Parallelism by definition
- Blocking, Danger of priority inversion and
deadlocks. - Solutions exists, but with high overhead,
especially for multi-processor systems
P1
P2
P3
7Non-blocking Synchronization
- Perform operation/changes using atomic primitives
- Lock-Free Synchronization
- Optimistic approach
- Retries until succeeding
- Guarantees progress of at least one operation
- Wait-Free Synchronization
- Always finishes in a finite number of its own
steps - Coordination with all participants
8Memory Management
- Dynamic data structures need dynamic memory
management - Concurrent D.S. need concurrent M.M.!
9Concurrent Memory Management
- Concurrent Memory Allocation
- i.e. malloc/free functionality
- Concurrent Garbage Collection
- Questions (among many)
- When to re-use memory?
- How to de-reference pointers safely?
P2
P1
P3
10Lock-Free Memory Management
- Memory Allocation
- Valois 1995 fixed block-size, fixed purpose
- Michael 2004 Gidenstam et al. 2004, any size,
any purpose - Garbage Collection
- Valois 1995, Detlefs et al. 2001 reference
counting - Michael 2002, Herlihy et al. 2002 hazard
pointers - Gidenstam, Papatriantafilou, Sundell and Tsigas
2005 hazard pointer reference counting
11Lock-Free Reference Counting
- De-referencing links
- 1. Read the link contents, i.e. a pointer.
- 2. Increment (FAA) the reference count on the
corresponding object. - What if the link is changed between step 1 and 2?
- Solution by Detlefs et al
- Use DCAS on step 2 that operates on two arbitrary
memory words. Retries if link is changed after
step 2. - Solution by Valois et al
- The reference count field is present
indefinitely. Decrement reference count and
retries if link is changed after step 2.
12Lock-Free Hazard Pointers (Michael 2002)
- De-referencing links
- 1. Read the link contents, i.e. a pointer.
- 2. Set a hazard pointer to the read pointer
value. - 3. Read the link contents again if not same as
in step 1 then restart from step 1. - Deletion
- After deleted from data structure, put node on a
local list. - When the local list reaches a certain size scan
all hazard pointers globally, reclaim memory of
all nodes which address does not match the scan.
13Lock-Free Memory Allocation
- Solution (lock-free), IBM freelists
- Create a linked-list of the free nodes,
allocate/reclaim using CAS - Needs some mechanism to avoid the ABA problem.
Allocate
Head
Mem 1
Mem 2
Mem n
Reclaim
Used 1
14Shared Data StructureDictionaries (Sets)
- Fundamental data structure
- Works on a set of ltkey,valuegt pairs
- Three basic operations
- Insert(k,v) Adds a new item
- vFindKey(k) Finds the item ltk,vgt
- vDeleteKey(k) Finds and removes the item ltk,vgt
15Randomized Algorithm Skip Lists
- William Pugh Skip Lists A Probabilistic
Alternative to Balanced Trees, 1990 - Layers of ordered lists with different densities,
achieves a tree-like behavior - Time complexity O(log2N) probabilistic!
Head
Tail
25
50
1
2
3
4
5
6
7
16New Lock-Free Concurrent Skip List
- Define node state to depend on the insertion
status at lowest level as well as a deletion flag - Insert from lowest level going upwards
- Set deletion flag. Delete from highest level
going downwards
1
2
3
4
5
6
7
D
D
D
D
D
D
D
3
2
1
p
3
2
1
p
D
17Overlapping operations on shared data
Insert 2
2
- Example Insert operation- which of 2 or 3 gets
inserted? - Solution Compare-And-Swap atomic
primitiveCAS(ppointer to word, oldword,
newword)booleanatomic do if p old then
p new return true else return false
1
4
3
Insert 3
18Concurrent Insert vs. Delete operations
b)
1
4
2
a)
- Problem- both nodes are deleted!
- Solution (Harris et al) Use bit 0 of pointer to
mark deletion status
Delete
3
Insert
b)
1
4
2
a)
c)
3
19Helping Scheme
- Threads need to traverse safely
- Need to remove marked-to-be-deleted nodes while
traversing Help! - Finds previous node, finish deletion and
continues traversing from previous node
or
1
4
2
1
4
2
?
?
1
4
2
20Lock-Free Skip List - Techniques Summary
- The Skip List is treated as layers of ordered
lists - Uses CAS atomic primitive
- Lock-Free memory management
- IBM Freelists
- Reference counting (ValoisMichaelScott)
- Helping scheme
- Back-Off strategy
- All together proved to be linearizable
21Lock-Free Skip List publications
- First publications in literature
- H. Sundell and P. Tsigas, Fast and Lock-Free
Concurrent Priority Queues for Multi-thread
Systems, IPDPS 2003 - H. Sundell and P. Tsigas, Scalable and Lock-Free
Concurrent Dictionaries, SAC 2004 - Later publications
- M. Fomitchev and E. Ruppert, Lock-free linked
lists and skip lists, PODC 2004 - K. Fraser, Practical lock-freedom, PhD thesis,
2004
22New Lock-Free Skip List !
- The thread that fulfils the deletion of a node
removes the next pointer when finished. - Allows other threads to traverse through even
marked next pointers. - If not possible to traverse forward, go back to
the remembered position on previous (upper)
levels. - Helps deletions-in-progress only when absolutely
necessary. - Works with a modified version of Michaels Hazard
Pointer memory management!
23Correctness
- Linearizability (Herlihy 1991)
- In order for an implementation to be
linearizable, for every concurrent execution,
there should exist an equal sequential execution
that respects the partial order of the operations
in the concurrent execution
24Correctness
- Define precise sequential semantics
- Define abstract state and its interpretation
- Show that state is atomically updated
- Define linearizability points
- Show that operations take effect atomically at
these points with respect to sequential semantics - Creates a total order using the linearizability
points that respects the partial order - The algorithm is linearizable
25Memory Consistency and Out-Of-Order execution
- Models on actual multiprocessor architectures
Relaxed Memory Order etc. - Must insert special machine instructions (memory
barriers) to enforce stronger memory consistency
models!
Ti
W(x,1)
R(y)0
W(x,0)
R(y)1
Tj
W(y,0)
W(y,1)
R(x)1
Tk
R(x)1
R(y)1
R(x)0
t
26Experiments
- Experiment with 1-32 threads performed on Sun
Fire 15K with 48 cpus. - Each thread performs 20000 operations, whereof
the first total 50-10000 operations are Inserts,
remaining are equally randomly distributed over
Insert, FindKey and DeleteKeys. - Fixed Skiplist maximum level of 10.
- Compare with implementations of other skip
list-based dictionaries and a singly linked list
by Michael, using same scenarios. - Averaged execution time of 10 experiments.
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35Multi-Word Compare-And-Swap
- Operations
- bool CASN(int p1, int o1, int n1, )
- int Read(int p)
- Standard algoritmic approach
- 1. Try to acquire a lock on all positions of
interest. - 2. If already taken, help corresponding operation
- 3. If all taken and all match change status of
operation - 4. Remove locks and possibly write new values
- My approach
- Wait-free memory management (IPDPS 2005)
- Lock stealing and lock hand-over
- Allow un-sorted pointers
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48Lock-Free Deque
- Practical algorithms in literature
- Michael 2003, CAS-based lock-free algorithm for
shared deques, Euro-Par 2003 - Sundell and Tsigas, Lock-Free and Practical
Doubly Linked List-Based Deques using Single-Word
Compare-And-Swap, OPODIS 2004 - Approach
- Apply new memory management on lock-free deque
49(No Transcript)
50(No Transcript)
51Conclusions
- Work performed at EPCC
- Improved algorithm of lock-free skip list
- Improved Michaels hazard pointer algorithm
- Experiments comparing with other recent
dictionary algorithms - New implementation of CASN.
- Experiments comparing with other recent CASN
algorithms. - Experiments comparing a lock-free deque algorithm
using different memory management techniques. - Future work
- Implement new lock-free/ wait-free dynamic data
structures. More experiments.
52Questions?
- Contact Information
- Address Håkan Sundell Computing
Science Chalmers University of Technology - Email phs_at_cs.chalmers.se
- Web http//www.cs.chalmers.se/phs