Title: Programming, Debugging, Profiling and Optimizing Transactional Memory Applications
1Programming, Debugging, Profiling and Optimizing
Transactional Memory Applications
PhD Thesis Proposal
Ferad Zyulkyarov
- Department of Computer Architecture
- Universitat Politècnica de Catalunya
BarcelonaTech - Barcelona Supercomputing Center
01 July 2010
2Publications
- Ferad Zyulkyarov, Srdjan Stipic, Tim Harris,
Osman Unsal, Adrian Cristal, Ibrahim Hur, Mateo
Valero, Discovering and Understanding Performance
Bottlenecks in Transactional Applications,
PACT'10 - Ferad Zyulkyarov, Tim Harris, Osman Unsal, Adrian
Cristal, Mateo Valero, Debugging Programs that
use Atomic Blocks and Transactional Memory,
PPoPP'10 - Vladimir Gajinov, Ferad Zyulkyarov, Osman Unsal,
Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo
Valero, QuakeTM Parallelizing a Complex Serial
Application Using Transactional Memory , ICS'09 - Ferad Zyulkyarov, Vladimir Gajinov, Osman Unsal,
Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo
Valero, Atomic Quake Using Transactional Memory
in an Interactive Multiplayer Game Server ,
PPoPP09 - Ferad Zyulkyarov, Sanja Cvijic,Osman Unsal,
Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo
Valero, WormBench - A Configurable Workload for
Evaluating Transactional Memory Systems, MEDEA
'09 - Ferad Zyulkyarov, Milos Milovanovic, Osman Unsal,
Adrian Cristal, Eduard Ayguade, Tim Harris, Mateo
Valero, Memory Management for Transaction
Processing Core in Heterogeneous
Chip-Multiprocessors, OSHMA '09 - Milos Milovanovic, Osman Unsal, Adrian Cristal,
Ferad Zyulkyarov, Mateo Valero, Compiler Support
for Using Transactional Memory in C/C
Applications, INTERACT07
3Work Plan
12m
11m
21m
10m
15m
9.5m
7m
2m
01/10/2010
4Transactional Memory
atomic statement1 statement2
statement3 statement4 ...
5The Big Questions
- Is programming with TM easy?
- Is TM competitive with locks?
- Are existing development tools sufficient?
6Atomic Quake
- Parallel Quake game server
- All locks are replaces with atomic blocks
- 27,400 LOC of C code in 56 files
- Rich transactional application
- 63 atomic blocks
- Rich uses of atomic blocks
- Library calls, I/O, error handling, memory
allocation, failure atomicity - Various transactional characteristics
- A workload to drive research in TM
7Is programming with TM easy?
- Yes.
- In large applications where we have many shared
objects and want to provide efficient fine grain
synchronization - Example region based locking in tree data
structure and graphs.
8Where Transactions Fit?
Guarding different types of objects with separate
locks.
1 switch(object-gttype) / Lock phase / 2
KEY lock(key_mutex) break 3 LIFE
lock(life_mutex) break 4 WEAPON
lock(weapon_mutex) break 5 ARMOR
lock(armor_mutex) break 6 7 8
pick_up_object(object) 9 10 switch(object-gttype
) / Unlock phase / 11 KEY
unlock(key_mutex) break 12 LIFE
unlock(life_mutex) break 13 WEAPON
unlock(weapon_mutex) break 14 ARMOR
unlock(armor_mutex) break 15
Lock phase.
atomic
pick_up_object(object)
Unlock phase.
9Is TM Competitive to Locks?
- No.
- 4-5x slowdown on single threaded version.
- But it is promising to be competitive because of
the obtained good scalability.
Scales OK up to 4 threads.
Sudden increase in aborts.
Threads Transactions Aborts Aborts Irrevocable
Threads Transactions Num Irrevocable
1 36 667 0 0.00 17
2 75 824 241 0.42 31
4 166 000 2 612 1.58 85
8 477 519 76 771 25.50 237
10Are Existing Tools Sufficient?
- No
- We need
- Richer language level primitives and integration.
- Mechanisms to handle I/O.
- Dynamic error handling.
- Debuggers.
- Profilers.
11Unstructured Use of Locks
Atomic Block
1 bool first_if false 2 bool second_if
false 3 for (i0 iltsv_tot_num_players/sv_nproc
i) 4 ltstatements1gt 5 atomic 6
ltstatemnts2gt 7 if (!c-gtsend_message) 8
ltstatements3gt 9 first_if true 10
else 11 ltstamemnts5gt 12 if
(!sv.paused !Netchan_CanPacket(c-gtnetchan)) 1
3 ltstatmenets6gt 14 second_if
true 15 else 16
ltstatements8gt 17 if (c-gtstate
cs_spawned) 18 if (frame_threads_num
gt 1) 19 atomic 20
ltstatements9gt 21 22
else 23 ltstatements9gt 24
25 26 27 28 29 if
(first_if) 30 ltstatements4gt 31
first_if false 32 continue 33 34 if
(second_if) 35 ltstatements7gt 36
second_if false 37 continue 38 39
ltstatements10gt 40
Locks
1 for (i0 iltsv_tot_num_players/sv_nproc
i) 2 ltstatements1gt 3
LOCK(cl_msg_lockc - svs.clients) 4
ltstatemnts2gt 5 if (!c-gtsend_message) 6
ltstatements3gt 7 UNLOCK(cl_msg_lockc
- svs.clients) 8 ltstatements4gt 9
continue 10 11 ltstamemnts5gt 12
if (!sv.paused !Netchan_CanPacket
(c-gtnetchan)) 13 ltstatmenets6gt 14
UNLOCK(cl_msg_lockc - svs.clients) 15
ltstatements7gt 16 continue 17 18
ltstatements8gt 19 if (c-gtstate
cs_spawned) 20 if (frame_threads_num gt
1) LOCK(par_runcmd_lock) 21
ltstatements9gt 22 if (frame_thread_num gt
1) UNLOCK(par_runcmd_lock) 23 24
UNLOCK(cl_msg_lockc - svs.clients) 25
ltstatements10gt 26
Extra variables and code
Solution explicit commit
Complicated Conditional Logic
12Various Transactional Characteristics
Per-atomic block runtime statistics from Atomic
Quake.
Very small transactions
Different execution frequency -gt Phased behavior.
ID TX Dynamic Length (CPU Cycles) Dynamic Length (CPU Cycles) Dynamic Length (CPU Cycles) Dynamic Length (CPU Cycles) Read Set (Bytes) Read Set (Bytes) Read Set (Bytes) Read Set (Bytes) Write Set (Bytes) Write Set (Bytes) Write Set (Bytes) Write Set (Bytes)
ID TX Total Min Max Avg Total Min Max Avg Total Min Max Avg
56 26,962 172,872,572 288 112,832 6,412 1,328,536 20 104 49 0 0 0 0
60 5,931 5,810,152 224 41,552 980 76,212 12 640 13 928 0 116 0
61 1,095 20,573,540 4,560 49,984 19,208 723,474 88 776 661 90 84 84 84
59 1,042 3,117,844 1,520 39,344 2,999 29,176 5 28 28 16,672 16 16 16
57 1,038 401,502,152 288,704 522,528 387,552 10,963,719 7,614 15,490 10,562 2,592,367 1,680 3,656 2,497
58 1,002 134,949,344 87,056 1,341,504 134,949 5,054,282 3,028 53,566 5,044 931,445 548 11,161 930
15 3 67,660 720 48,176 1,735 96 32 32 32 18 6 6 6
5 2 99,988 592 36,384 1,923 64 32 32 32 10 5 5 5
22 2 43,632 12,176 35,504 21,816 72 36 36 36 128 64 64 64
36 2 40,476 6,800 44,880 20,238 249 108 141 125 55 22 33 28
38 2 71,368 2,144 31,504 4,461 90 44 46 45 26 12 14 13
Very large transactions
Most frequent atomic block is read-only.
Control flow does not reach all atomic blocks.
13Debugging Transactional Applications
- Existing debuggers are not aware of atomic blocks
and transactional memory - New principles and approaches
- Debugging atomic blocks atomically
- Debugging at the level of transactions
- Managing transactions at debug-time
- Extension for WinDbg to debug programs with
atomic blocks
14Atomicity in Debugging
- Step over atomic blocks as if single instruction.
- Abstracts weather atomic blocks are implemented
with TM or lock inference - Good for debugging sync errors at granularity of
atomic blocks vs. individual statements inside
the atomic blocks.
Non-TM Aware Debugger
TM Aware Debugger
ltstatement 1gt ltstatement 2gt atomic ltstatement
3gt ltstatement 4gt ltstatement 5gt ltstatement
6gt ltstatement 7gt ltstatement 8gt
ltstatement 1gt ltstatement 2gt atomic ltstatement
3gt ltstatement 4gt ltstatement 5gt ltstatement
6gt ltstatement 7gt ltstatement 8gt
Debugging becomes frustrating when transaction
aborts.
15Isolation in Debugging
- What if we want to debug wrong code within atomic
block? - Put breakpoint inside atomic block.
- Validate the transaction
- Step within the transaction.
- The user does not observe intermediate results of
concurrently running transactions - Switch transaction to irrevocable mode after
validation.
atomic ltstatement 1gt ltstatement 2gt
ltstatement 3gt ltstatement 4gt
16Debugging at the Level of Transactions
- Assumes that atomic blocks are implemented with
transactional memory. - Examine the internal state of the TM
- Read/write set, re-executions, status
- TM specific watch points
- Break when conflict happens
- Filters
- Concurrent work with Herlihy and Lev PACT 09.
17TM Specific Watchpoints
Filter Break if Address reservation_at_04 Thread
T2
Break when conflict happens
AND
atomic ltstatement 1gt ltstatement 2gt
ltstatement 3gt ltstatement 4gt
18Managing Transactions at Debug-Time
- At the level of atomic blocks
- Debug time atomic blocks
- Splitting atomic blocks
- At the level of transactions
- Changing the state of TM system (i.e. adding and
removing entries from read/write set, change the
status, abort) - Analogous to the functionality of existing
debuggers to change the CPU state
19Example Debug Time Atomic Blocks
ltstatement 1gt ltstatement 2gt ltstatement
3gt ltstatement 4gt ltstatement 5gt ltstatement
6gt ltstatement 7gt ltstatement 8gt ltstatement
9gt ltstatement 10gt ltstatement 11gt ltstatement
12gt ltstatement 13gt ltstatement 14gt
20Example Debug Time Atomic Blocks
ltstatement 1gt ltstatement 2gt ltstatement
3gt StartDebugAtomic ltstatement 4gt ltstatement
5gt ltstatement 6gt ltstatement 7gt ltstatement
8gt ltstatement 9gt EndDebugAtomic ltstatement
10gt ltstatement 11gt ltstatement 12gt ltstatement
13gt ltstatement 14gt
User marks the start and the end of
the transactions
21Issues of Profiling TM Programs
- TM applications have unanticipated overheads
- Problem raised by Pankratius talk at ICSE09
and Rossbach et al. PPoPP10 - Difficult to profile TM applications without
profiling tools and without knowing the
implementation of the TM system - Experience of optimizing QuakeTM, Gajinov et al.
ICS2009
22Profiling TM Programs
- Design principles
- Report results at source language constructs
- Abstract the underlying TM system
- Low probe effect and overhead
- Profiling techniques
- Conflict point discovery
- Identifying conflicting data structures
- Visualizing transactions
23Conflict Point Discovery
- Identifies the statements involved in conflicts
- Provides contextual information
- Finds the critical path
FileLine Conf. Method Line
Hashtable.cs51 152 Add If (_containerhashCode
Hashtable.cs48 62 Add uint hashCode HashSdbm(
Hashtable.cs53 5 Add _containerhashCode n
Hashtable.cs83 5 Add while (entry ! null)
ArrayList.cs79 3 Contains for (int i 0 i lt count i )
ArrayList.cs52 1 Add if (count capacity 1)
24Call Context
increment() counter
Thread 1
for (int i 0 i lt 100 i)
probability80() probability20()
Bottom-up view increment (100) ----
probability80 (80) ---- probability20 (20)
Top-down view main (100) ---- probability80
(80) ---- increment (80)
-----probability20 (20) ---- increment
(20)
probability20 probability random()
100 if (probability gt 80) atomic
increment()
Thread 2
for (int i 0 i lt 100 i)
probability80() probability20()
probability80 probability random()
100 if (probability lt 80) atomic
increment()
25Aborts Graph (Bayes)
AB1
AB2
There are 15 atomic blocks and only one of them
aborts most. Which atomic blocks cause AB3 to
abort?
Conf 73 Wasted 63
Conf 20 Wasted 29
AB3
72 of wasted work
26Indentifying Conflicting Objects
1 List list new List() 2 list.Add(1) 3
list.Add(2) 4 list.Add(3) ... atomic
list.Replace(2, 33)
Per-Object View List.cs1 list (42) ---
ChangeNode (20 ) ---- Replace (12)
---- Add (8)
List
1
2
3
0x08
0x10
0x18
0x20
GC Root 0x08
Object Addr 0x20
Instr Addr 0x446290
GC
Memory Allocator
DbgEng
List.cs1
27Transaction Visualizer (Genome)
Garbage Collection
Wait on barrier
Aborts occur at the first and last atomic blocks
in program order.
28Overhead and Probe Effect
Process data offline or during GC.
Profiling Enabled - Profiling Disabled
Normalized Execution Time
 Thrd Bayes Bayes- Gen Gen- Intrd Intrd- Labr Labr- Vac Vac- WB WB-
1 1.59 1.00 1.27 1.00 1.29 1.00 1.07 1.00 1.26 1.00 0.71 1.00
2 1.00 0.56 0.97 0.67 0.97 0.58 0.64 0.61 0.83 0.59 0.60 0.55
4 0.23 0.23 0.73 0.52 0.91 0.36 0.45 0.46 0.58 0.40 0.41 0.33
8 0.21 0.20 0.73 0.55 1.57 0.38 0.72 0.56 0.53 0.34 0.33 0.22
Standard deviation for the difference 27
Abort Rate in
 Thrd Bayes Bayes- Gen Gen- Intrd Intrd- Labr Labr- Vac Vac- WB WB-
1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 4.39 4.69 0.07 0.07 3.69 3.51 0.19 0.15 0.80 0.80 0.00 0.00
4 16.29 27.31 0.26 0.36 14.90 13.65 0.35 0.36 2.30 2.45 0.00 0.00
8 53.74 66.08 0.53 0.80 39.64 37.41 0.40 0.47 4.91 5.30 0.02 0.03
Standard deviation for the difference 3.88
29Optimization Techniques
- Moving statements
- Atomic block scheduling
- Checkpoints and nested atomic blocks
- Pessimistic reads
- Early release
30Moving Statements
No!
- atomic
- counter
- ltstatement1gt
- ltstatement2gt
- ltstatement3gt
- atomic
- ltstatement1gt
- ltstatement2gt
- ltstatement3gt
- counter
Will this code execute the same?
31Checkpoints
- atomic
- ltstatement1gt
- ltstatement2gt
- ltstatement3gt
- ltstatement4gt
- ltstatement5gt
- ltstatement6gt
- ltstatement7gt
Insert Checkpoint
32Checkpoints
- atomic
- ltstatement1gt
- ltstatement2gt
- ltstatement3gt
- ltstatement4gt
- ltstatement5gt
- ltstatement6gt
- ltcheckpointgt
- ltstatement7gt
Reduced wasted work for the atomic block with 40.
Insert Checkpoint
33Conclusion
- Study the programmability aspects of TM
- New debugging principles and approaches for TM
applications - New profiling techniques for TM applications
- Profile-guided optimization approaches for TM
applications
34