Optimistic Intra-Transaction Parallelism using Thread Level Speculation - PowerPoint PPT Presentation

About This Presentation
Title:

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Description:

Title: PowerPoint Presentation Last modified by: colohan Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 191
Provided by: eecgToro6
Category:

less

Transcript and Presenter's Notes

Title: Optimistic Intra-Transaction Parallelism using Thread Level Speculation


1
Optimistic Intra-Transaction Parallelism using
Thread Level Speculation
  • Chris Colohan1, Anastassia Ailamaki1,
  • J. Gregory Steffan2 and Todd C. Mowry1,3
  • 1Carnegie Mellon University
  • 2University of Toronto
  • 3Intel Research Pittsburgh

2
Chip Multiprocessors are Here!
AMD Opteron
IBM Power 5
Intel Yonah
  • 2 cores now, soon will have 4, 8, 16, or 32
  • Multiple threads per core
  • How do we best use them?

3
Multi-Core Enhances Throughput
Database Server
Users
Cores can run concurrent transactions and improve
throughput
4
Using Multiple Cores
Database Server
Users
Can multiple cores improve transaction latency?
5
Parallelizing transactions
DBMS
SELECT cust_info FROM customer UPDATE district
WITH order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock quantity-- UPDATE stock WITH
quantity INSERT item INTO order_line
  • Intra-query parallelism
  • Used for long-running queries (decision support)
  • Does not work for short queries
  • Short queries dominate in commercial workloads

6
Parallelizing transactions
DBMS
SELECT cust_info FROM customer UPDATE district
WITH order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock quantity-- UPDATE stock WITH
quantity INSERT item INTO order_line
  • Intra-transaction parallelism
  • Each thread spans multiple queries
  • Hard to add to existing systems!
  • Need to change interface, add latches and locks,
    worry about correctness of parallel execution

7
Parallelizing transactions
DBMS
SELECT cust_info FROM customer UPDATE district
WITH order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock quantity-- UPDATE stock WITH
quantity INSERT item INTO order_line
  • Intra-transaction parallelism
  • Breaks transaction into threads
  • Hard to add to existing systems!
  • Need to change interface, add latches and locks,
    worry about correctness of parallel execution

Thread Level Speculation (TLS) makes
parallelization easier.
8
Thread Level Speculation (TLS)
p
p
q
q
p
q
Sequential
Parallel
9
Thread Level Speculation (TLS)
  • Use epochs
  • Detect violations
  • Restart to recover
  • Buffer state
  • Oldest epoch
  • Never restarts
  • No buffering
  • Worst case
  • Sequential
  • Best case
  • Fully parallel

Epoch 1
Epoch 2
p
Violation!
p
p
R2
q
q
p
q
Sequential
Parallel
Data dependences limit performance.
10
A Coordinated Effort
Choose epoch boundaries
TransactionProgrammer
DBMS Programmer
Remove performance bottlenecks
Hardware Developer
Add TLS support to architecture
11
So whats new?
  • Intra-transaction parallelism
  • Without changing the transactions
  • With minor changes to the DBMS
  • Without having to worry about locking
  • Without introducing concurrency bugs
  • With good performance
  • Halve transaction latency on four cores

12
Related Work
  • Optimistic Concurrency Control (Kung82)
  • Sagas (MolinaSalem87)
  • Transaction chopping (Shasha95)

13
Outline
  • Introduction
  • Related work
  • Dividing transactions into epochs
  • Removing bottlenecks in the DBMS
  • Results
  • Conclusions

14
Case Study New Order (TPC-C)
GET cust_info FROM customer UPDATE district WITH
order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock WHERE i_iditem UPDATE stock
WITH quantity-1 WHERE i_iditem
INSERT item INTO order_line
  • Only dependence is the quantity field
  • Very unlikely to occur (1/100,000)

15
Case Study New Order (TPC-C)
GET cust_info FROM customer UPDATE district WITH
order_id INSERT order_id INTO
new_order foreach(item) GET quantity FROM
stock WHERE i_iditem UPDATE stock
WITH quantity-1 WHERE i_iditem
INSERT item INTO order_line
GET cust_info FROM customer UPDATE district WITH
order_id INSERT order_id INTO
new_order TLS_foreach(item) GET quantity
FROM stock WHERE i_iditem UPDATE
stock WITH quantity-1 WHERE i_iditem
INSERT item INTO order_line
16
Outline
  • Introduction
  • Related work
  • Dividing transactions into epochs
  • Removing bottlenecks in the DBMS
  • Results
  • Conclusions

17
Dependences in DBMS
18
Dependences in DBMS
  • Dependences serialize execution!
  • Example statistics gathering
  • pages_pinned
  • TLS maintains serial ordering of increments
  • To remove, use per-CPU counters
  • Performance tuning
  • Profile execution
  • Remove bottleneck dependence
  • Repeat

19
Buffer Pool Management
CPU
get_page(5)
put_page(5)
Buffer Pool
ref 1
ref 0
20
Buffer Pool Management
CPU
get_page(5)
get_page(5)
put_page(5)
put_page(5)
get_page(5)
Buffer Pool
put_page(5)
TLS ensures first epoch gets page first. Who
cares?
ref 0
  • TLS maintains original load/store order
  • Sometimes this is not needed

21
Buffer Pool Management
  • Escape speculation
  • Invoke operation
  • Store undo function
  • Resume speculation

CPU
get_page(5)
get_page(5)
get_page(5)
put_page(5)
put_page(5)
put_page(5)
get_page(5)
Buffer Pool
put_page(5)
ref 0
Isolated undoing get_page will not affect
other transactions Undoable have an
operation (put_page) which returns the
system to its initial state
22
Buffer Pool Management
CPU
get_page(5)
get_page(5)
get_page(5)
put_page(5)
get_page(5)
Buffer Pool
Not undoable!
ref 0
23
Buffer Pool Management
CPU
get_page(5)
get_page(5)
get_page(5)
put_page(5)
Buffer Pool
ref 0
  • Delay put_page until end of epoch
  • Avoid dependence

24
Removing Bottleneck Dependences
  • We introduce three techniques
  • Delay operations until non-speculative
  • Mutex and lock acquire and release
  • Buffer pool, memory, and cursor release
  • Log sequence number assignment
  • Escape speculation
  • Buffer pool, memory, and cursor allocation
  • Traditional parallelization
  • Memory allocation, cursor pool, error checks,
    false sharing

25
Outline
  • Introduction
  • Related work
  • Dividing transactions into epochs
  • Removing bottlenecks in the DBMS
  • Results
  • Conclusions

26
Experimental Setup
  • Detailed simulation
  • Superscalar, out-of-order, 128 entry reorder
    buffer
  • Memory hierarchy modeled in detail
  • TPC-C transactions on BerkeleyDB
  • In-core database
  • Single user
  • Single warehouse
  • Measure interval of 100 transactions
  • Measuring latency not throughput

27
Optimizing the DBMS New Order
1.25
26 improvement
1
0.75
Time (normalized)
Other CPUs not helping
0.5
Cant optimize much more
Cache misses increase
0.25
0
Sequential
28
Optimizing the DBMS New Order
1.25
1
0.75
Time (normalized)
0.5
0.25
0
This process took me 30 days and lt1200 lines of
code.
Sequential
29
Other TPC-C Transactions
1
0.75
Idle CPU
Failed
Time (normalized)
Cache Miss
0.5
Busy
0.25
0
New Order
Delivery
Stock Level
Payment
Order Status
30
Conclusions
  • TLS makes intra-transaction parallelism practical
  • Reasonable changes to transaction, DBMS, and
    hardware
  • Halve transaction latency

31
Needed backup slides (not done yet)
  • 2 proc. Results
  • Shared caches may change how you want to extract
    parallelism!
  • Just have lots of transactions no sharing
  • TLS may have more sharing

32
Any questions?
  • For more information, see
  • www.colohan.com

33
Backup Slides Follow
34
LATCHES
35
Latches
  • Mutual exclusion between transactions
  • Cause violations between epochs
  • Read-test-write cycle ? RAW
  • Not needed between epochs
  • TLS already provides mutual exclusion!

36
Latches Aggressive Acquire
Acquire latch_cnt work latch_cnt--
latch_cnt work (enqueue release)
latch_cnt work (enqueue release)
Commit work latch_cnt--
Commit work latch_cnt-- Release
37
Latches Lazy Acquire
Acquire work Release
(enqueue acquire) work (enqueue release)
(enqueue acquire) work (enqueue release)
Acquire Commit work Release
Acquire Commit work Release
38
HARDWARE
39
TLS in Database Systems
  • Large epochs
  • More dependences
  • Must tolerate
  • More state
  • Bigger buffers

Non-Database TLS
TLS in Database Systems
40
Feedback Loop
for() do_work()
41
Violations Feedback
p
Violation!
p
p
R2
q
q
p
q
Sequential
Parallel
42
Eliminating Violations
0x0FD8? 0xFD20 0x0FC0? 0xFC18
43
Tolerating Violations Sub-epochs
Violation!
q
Sub-epochs
44
Sub-epochs
  • Started periodically by hardware
  • How many?
  • When to start?
  • Hardware implementation
  • Just like epochs
  • Use more epoch contexts
  • No need to check violations between sub-epochs
    within an epoch

Violation!
q
Sub-epochs
45
Old TLS Design
Buffer speculative state in write back L1 cache
CPU
CPU
CPU
CPU
L1
L1
L1
L1
Restart by invalidating speculative lines
Invalidation
Detect violations through invalidations
  • Problems
  • L1 cache not large enough
  • Later epochs only get values on commit

L2
Rest of system only sees committed data
Rest of memory system
46
New Cache Design
CPU
CPU
CPU
CPU
Speculative writes immediately visible to L2 (and
later epochs)
L1
L1
L1
L1
Restart by invalidating speculative lines
Buffer speculative and non-speculative state for
all epochs in L2
L2
L2
Invalidation
Detect violations at lookup time
Rest of memory system
Invalidation coherence between L2 caches
47
New Features
New!
CPU
CPU
CPU
CPU
Speculative state in L1 and L2 cache
L1
L1
L1
L1
Cache line replication (versions)
L2
L2
Data dependence tracking within cache
Speculative victim cache
Rest of memory system
48
Scaling
Time (normalized)
49
Evaluating a 4-CPU system
Parallelized benchmark run on 1 CPU
Original benchmark run on 1 CPU
Without sub-epoch support
1
0.75
Parallel execution
Time (normalized)
0.5
Ignore violations (Amdahls Law limit)
0.25
0
TLS Seq
Baseline
Sequential
No Sub-epoch
No Speculation
50
Sub-epochs How many/How big?
  • Supporting more sub-epochs is better
  • Spacing depends on location of violations
  • Even spacing is good enough

51
Query Execution
  • Actions taken by a query
  • Bring pages into buffer pool
  • Acquire and release latches locks
  • Allocate/free memory
  • Allocate/free and use cursors
  • Use B-trees
  • Generate log entries

These generate violations.
52
Applying TLS
  1. Parallelize loop
  2. Run benchmark
  3. Remove bottleneck
  4. Go to 2

53
Outline
TransactionProgrammer
DBMS Programmer
Hardware Developer
54
Violation Prediction
55
Violation Prediction
  • Predictor problems
  • Large epochs ? many predictions
  • Failed prediction ? violation
  • Incorrect prediction ? large stall
  • Two predictors required
  • Last store
  • Dependent load

Predictor
q
Done
q
Predict Dependences
56
TLS Execution
CPU 1
CPU 2
CPU 3
CPU 4
p
L1
L1
L1
L1
Violation!
p
R2
q
L2
Rest of memory system
57
TLS Execution
p
Violation!
p
p
R2
q
s
t
58
TLS Execution
p
Violation!
p
p
R2
q
s
t
59
TLS Execution
p
Violation!
p
p
R2
q
60
TLS Execution
p
Violation!
p
p
R2
q
q
61
TLS Execution
p
Violation!
p
p
R2
q
q
62
Replication
p
Violation!
p
p
R2
q
q
q
Cant invalidate line if it contains two epochs
changes
63
Replication
p
Violation!
p
p
R2
q
q
q
q
64
Replication
p
Violation!
p
p
R2
q
q
q
q
  • Makes epochs independent
  • Enables sub-epochs

65
Sub-epochs
p
1a

q
p
p
1b

q
q
p
1c

q

q
1d
p
  • Uses more epoch contexts
  • Detection/buffering/rewind is free
  • More replication
  • Speculative victim cache

66
get_page() wrapper
  • page_t get_page_wrapper(pageid_t id)
  • static tls_mutex mut
  • page_t ret
  • tls_escape_speculation()
  • check_get_arguments(id)
  • tls_acquire_mutex(mut)
  • ret get_page(id)
  • tls_release_mutex(mut)
  • tls_on_violation(put, ret)
  • tls_resume_speculation()
  • return ret

? Wraps get_page()
67
get_page() wrapper
  • page_t get_page_wrapper(pageid_t id)
  • static tls_mutex mut
  • page_t ret
  • tls_escape_speculation()
  • check_get_arguments(id)
  • tls_acquire_mutex(mut)
  • ret get_page(id)
  • tls_release_mutex(mut)
  • tls_on_violation(put, ret)
  • tls_resume_speculation()
  • return ret

? No violations while calling get_page()
68
get_page() wrapper
  • page_t get_page_wrapper(pageid_t id)
  • static tls_mutex mut
  • page_t ret
  • tls_escape_speculation()
  • check_get_arguments(id)
  • tls_acquire_mutex(mut)
  • ret get_page(id)
  • tls_release_mutex(mut)
  • tls_on_violation(put, ret)
  • tls_resume_speculation()
  • return ret

? May get bad input data from speculative thread!
69
get_page() wrapper
  • page_t get_page_wrapper(pageid_t id)
  • static tls_mutex mut
  • page_t ret
  • tls_escape_speculation()
  • check_get_arguments(id)
  • tls_acquire_mutex(mut)
  • ret get_page(id)
  • tls_release_mutex(mut)
  • tls_on_violation(put, ret)
  • tls_resume_speculation()
  • return ret

? Only one epoch per transaction at a time
70
get_page() wrapper
  • page_t get_page_wrapper(pageid_t id)
  • static tls_mutex mut
  • page_t ret
  • tls_escape_speculation()
  • check_get_arguments(id)
  • tls_acquire_mutex(mut)
  • ret get_page(id)
  • tls_release_mutex(mut)
  • tls_on_violation(put, ret)
  • tls_resume_speculation()
  • return ret

? How to undo get_page()
71
get_page() wrapper
  • Isolated
  • Undoing this operation does not cause cascading
    aborts
  • Undoable
  • Easy way to return system to initial state
  • Can also be used for
  • Cursor management
  • malloc()
  • page_t get_page_wrapper(pageid_t id)
  • static tls_mutex mut
  • page_t ret
  • tls_escape_speculation()
  • check_get_arguments(id)
  • tls_acquire_mutex(mut)
  • ret get_page(id)
  • tls_release_mutex(mut)
  • tls_on_violation(put, ret)
  • tls_resume_speculation()
  • return ret

72
TPC-C Benchmark
Company
Warehouse 1
Warehouse W
District 1
District 2
District 10
Cust 1
Cust 2
Cust 3k
73
TPC-C Benchmark
10
Warehouse W
District W10
History W30k
100k
3k
1
Customer W30k
Stock W100k
New Order W9k
1
0-1
3
W
Order W30k
Order Line W300k
Item 100k
5-15
74
What is TLS?
while(cond) x hashi ... hashj
y ...
75
What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ...
hash19 ... hash21 ...
hash33 ... hash30 ...
hash10 ... hash25 ...
76
What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ...
hash19 ... hash21 ...
hash33 ... hash30 ...
hash10 ... hash25 ...
Violation!
77
What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ... attempt_commit()
hash19 ... hash21 ... attempt_commit()
hash33 ... hash30 ... attempt_commit()
hash10 ... hash25 ... attempt_commit()
Violation!
?
?
?
?
78
What is TLS?
while(cond) x hashi ... hashj
y ...
Thread 1
Thread 2
Thread 3
Thread 4
hash3 ... hash10 ... attempt_commit()
hash19 ... hash21 ... attempt_commit()
hash33 ... hash30 ... attempt_commit()
hash10 ... hash25 ... attempt_commit()
Violation!
?
?
?
?
Redo
Thread 4
hash10 ... hash25 ... attempt_commit()
?
79
TLS Hardware Design
  • Whats new?
  • Large threads
  • Epochs will communicate
  • Complex control flow
  • Huge legacy code base
  • How does hardware change?
  • Store state in L2 instead of L1
  • Reversible atomic operations
  • Tolerate dependences
  • Aggressive update propagation (implicit
    forwarding)
  • Sub-epochs

80
L1 Cache Line
Valid
Data
LRU
Tag
SL
SM
  • SL bit
  • L2 cache knows this line has been speculatively
    loaded
  • On violation or commit clear
  • SM bit
  • This line contains speculative changes
  • On commit clear
  • On violation SM ? Invalid
  • ? Otherwise, just like a normal cache ?

81
escaping Speculation
Valid
Data
Stale
LRU
Tag
SL
SM
  • ? Speculative epoch wants to make system visible
    change!
  • Ignore SM lines while escaped
  • Stale bit
  • This line may be outdated by speculative work.
  • On violation or commit clear

82
L1 to L2 communication
  • L2 sees all stores (write through)
  • L2 sees first load of an epoch
  • NotifySL message
  • ? L2 can track data dependences!

83
L1 Changes Summary
Valid
Data
Stale
LRU
Tag
SL
SM
  • Add three bits to each line
  • SL
  • SM
  • Stale
  • Modify tag match to recognize bits
  • Add queue of NotifySL requests

84
L2 Cache Line
Fine Grained SM
CPU1
CPU2
Exclusive
Valid
Data
Dirty
LRU
Tag
SL
SL
SM
SM
  • Cache line can be
  • Modified by one CPU
  • Loaded by multiple CPUs

85
Cache Line Conflicts
  • Three classes of conflict
  • Epoch 2 stores, epoch 1 loads
  • Need old version to load
  • Epoch 1 stores, epoch 2 stores
  • Need to keep changes separate
  • Epoch 1 loads, epoch 2 stores
  • Need to be able to discard line on violation
  • ? Need a way of storing multiple conflicting
    versions in the cache ?

86
Cache line replication
  • On conflict, replicate line
  • Split line into two copies
  • Divide SM and SL bits at split point
  • Divide directory bits at split point

87
Replication Problems
  • Complicates line lookup
  • Need to find all replicas and select best
  • Best most recent replica
  • Change management
  • On write, update all later copies
  • Also need to find all more speculative replicas
    to check for violations
  • On commit must get rid of stale lines
  • Invalidation Required Buffer (IRB)

88
Victim Cache
  • How do you deal with a full cache set?
  • Use a victim cache
  • Holds evicted lines without losing SM SL bits
  • Must be fast ?every cache lookup needs to know
  • Do I have the best replica of this line?
  • Critical path
  • Do I cause a violation?
  • Not on critical path

89
Summary of Hardware Support
  • Sub-epochs
  • Violations hurt less!
  • Shared cache TLS support
  • Faster communication
  • More room to store state
  • RAOs
  • Dont speculate on known operations
  • Reduces amount of speculative state

90
Summary of Hardware Changes
  • Sub-epochs
  • Checkpoint register state
  • Needs replicas in cache
  • Shared cache TLS support
  • Speculative L1
  • Replication in L1 and L2
  • Speculative victim cache
  • Invalidation Required Buffer
  • RAOs
  • Suspend/resume speculation
  • Mutexes
  • Undo list

91
TLS Execution
CPU
CPU
CPU
CPU
p
L1
L1
L1
L1
Invalidation
Violation!
p
p
q
R2
q
L2
Rest of memory system
p
p
p
q
q
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
Problems with Old Cache Design
  • Database epochs are large
  • L1 cache not large enough
  • Sub-epochs add more state
  • L1 cache not associative enough
  • Database epochs communicate
  • L1 cache only communicates committed data

96
Intro Summary
  • TLS makes intra-transaction parallelism easy
  • Divide transaction into epochs
  • Hardware support
  • Detect violations
  • Restart to recover
  • Sub-epochs mitigate penalty
  • Buffer state
  • New process
  • Modify software ? avoid violations ?
    improve performance

97
The Many Faces of Ogg
98
The Many Faces of Ogg
99
Removing Bottlenecks
  • Three general techniques
  • Partition data structures
  • malloc
  • Postpone operations until non-speculative
  • Latches and locks, log entries
  • Handle speculation manually
  • Buffer pool

100
Bottlenecks Encoutered
  • Buffer pool
  • Latches Locks
  • Malloc/free
  • Cursor queues
  • Error checks
  • False sharing
  • B-tree performance optimization
  • Log entries

101
The Many Faces of Ogg
102
Performance on 4 CPUs
Unmodified benchmark
Modified benchmark
103
Incremental Parallelization
4 CPUs
104
Scaling
Unmodified benchmark
Modified benchmark
105
Parallelization is Hard
Tuning
Performance Improvement
Tuning
Tuning
Tuning
Hand Parallelization
Programmer Effort
Parallelizing Compiler
106
Case Study New Order (TPC-C)
  • Begin transaction
  • End transaction

107
Case Study New Order (TPC-C)
  • Begin transaction
  • Read customer info
  • Read increment order
  • Create new order
  • End transaction

108
Case Study New Order (TPC-C)
  • Begin transaction
  • Read customer info
  • Read increment order
  • Create new order
  • For each item in order
  • Get item info
  • Decrement count in stock
  • Record order info
  • End transaction

109
Case Study New Order (TPC-C)
  • Begin transaction
  • Read customer info
  • Read increment order
  • Create new order
  • For each item in order
  • Get item info
  • Decrement count in stock
  • Record order info
  • End transaction

80 of transaction execution time
110
Case Study New Order (TPC-C)
  • Begin transaction
  • Read customer info
  • Read increment order
  • Create new order
  • For each item in order
  • Get item info
  • Decrement count in stock
  • Record order info
  • End transaction

80 of transaction execution time
111
The Many Faces of Ogg
112
Step 2 Changing the Software
113
No problem!
  • Loop is easy to parallelize using TLS!
  • Not really
  • Calls into DBMS invoke complex operations
  • Ogg needs to do some work
  • Many operations in DBMS are parallel
  • Not written with TLS in mind!

114
Resource Management
  • Mutexes
  • acquired and released
  • Locks
  • locked and unlocked
  • Cursors
  • pushed and popped from free stack
  • Memory
  • allocated and freed
  • Buffer pool entries
  • Acquired and released

115
Mutexes Deadlock?
  • Problem
  • Re-ordered acquire/release operations!
  • Possibly introduced deadlock?
  • Solutions
  • Avoidance
  • Static acquire order
  • Recovery
  • Detect deadlock and violate

116
Locks
  • Like mutexes, but
  • Allows multiple readers
  • No memory overhead when not held
  • Often held for much longer
  • Treat similarly to mutexes

117
Cursors
  • Used for traversing B-trees
  • Pre-allocated, kept in pools

118
Maintaining Cursor Pool
head
Get
Use
Release
119
Maintaining Cursor Pool
head
Get
Use
Release
120
Maintaining Cursor Pool
head
Get
Use
Release
121
Maintaining Cursor Pool
Violation!
head
Get
Get
Use
Use
Release
Release
122
Parallelizing Cursor Pool
  • Use per-CPU pools
  • Modify code each CPU gets its own pool
  • No sharing no violations!
  • Requires cpuid() instruction

Get
Get
head
head
Use
Use
Release
Release
123
Memory Allocation
  • Problem
  • malloc() metadata causes dependences
  • Solutions
  • Per-cpu memory pools
  • Parallelized free list

124
The Log
  • Append records to global log
  • Appending causes dependence
  • Cant parallelize
  • Global log sequence number (LSN)
  • Generate log records in buffers
  • Assign LSNs when homefree

125
B-Trees
  • Leaf pages contain free space counts
  • Inserts of random records o.k.
  • Inserting adjacent records
  • Dependence on decrementing count
  • Page splits
  • Infrequent

126
Other Dependences
  • Statistics gathering
  • Error checks
  • False sharing

127
Related Work
  • Lots of work in TLS
  • Multiscalar (Wisconsin)
  • Hydra (Stanford)
  • IACOMA (Illinois)
  • RAW (MIT)
  • Hand parallelizing using TLS
  • Manohar Prabhu and Kunle Olukotun (PPoPP03)

128
Any questions?
129
Why is this a problem?
  • B-tree insertion into ORDERLINE table
  • Key is ol_n
  • DBMS does not know that keys will be sequential
  • Each insert usually updates the same btree page

130
Sequential Btree Inserts
4
free
free
1
4
3
2
item
free
item
item
item
item
item
item
free
free
item
item
item
free
free
free
free
free
free
free
free
free
free
free
131
Improvement SM Versioning
  • Blow away all SM lines on a violation?
  • May be bad!
  • Instead
  • On primary violation
  • Only invalidate locally modified SM lines
  • On secondary violation
  • Invalidate all SM lines
  • Needs one more bit LocalSM
  • May decrease number of misses to L2 on violation

132
Outline
  • Store state in L2 instead of L1
  • Reversible atomic operations
  • Tolerate dependences
  • Aggressive update propagation (implicit
    forwarding)
  • Sub-epochs
  • Results and analysis

133
Outline
  • Store state in L2 instead of L1
  • Reversible atomic operations
  • Tolerate dependences
  • Aggressive update propagation (implicit
    forwarding)
  • Sub-epochs
  • Results and analysis

134
Tolerating dependences
  • Aggressive update propagation
  • Get for free!
  • Sub-epochs
  • Periodically checkpoint epochs
  • Every N instructions?
  • Picking N may be interesting
  • Perhaps checkpoints could be set before the
    location of previous violations?

135
Outline
  • Store state in L2 instead of L1
  • Reversible atomic operations
  • Tolerate dependences
  • Aggressive update propagation (implicit
    forwarding)
  • Sub-epochs
  • Results and analysis

136
Why not faster?
  • Possible reasons
  • Idle cpus
  • RAO mutexes
  • Violations
  • Cache effects
  • Data dependences

137
Why not faster?
  • Possible reasons
  • Idle cpus
  • 9 epochs/region average
  • Two bundles of four and one of one
  • ¼ of cpu cycles wasted!
  • RAO mutexes
  • Violations
  • Cache effects
  • Data dependences

138
Why not faster?
  • Possible reasons
  • Idle cpus
  • RAO mutexes
  • Not implemented yet
  • Ooops!
  • Violations
  • Cache effects
  • Data dependences

139
Why not faster?
  • Possible reasons
  • Idle cpus
  • RAO mutexes
  • Violations
  • 21/969 epochs violated
  • Distance 1 magic synchronized
  • 2.2Mcycles (over 4 cpus)
  • About 1.5
  • Cache effects
  • Data dependences

140
Why not faster?
  • Possible reasons
  • Idle cpus
  • RAO mutexes
  • Violations
  • Cache effects
  • Deserves its own slide. ?
  • Data dependences

141
Cache effects of speculation
  • Only 20 of references are speculative!
  • Speculative references have small impact on
    non-speculative hit rate (lt1)
  • Speculative refs miss a lot in L1
  • 9-15 for reads, 2-6 for writes
  • L2 saw HUGE increase in traffic
  • 152k refs to 3474k refs
  • Spec/non spec lines are thrashing from L1s

142
Why not faster?
  • Possible reasons
  • Idle cpus
  • RAO mutexes
  • Violations
  • Cache effects
  • Data dependences
  • Oh yeah!
  • Btree item count
  • Split up btree insert?
  • alloc and write
  • Do alloc as RAO
  • Needs more thought

143
L2 Cache Line
Fine Grained SM
CPU1
CPU2
Exclusive
Valid
Data
Dirty
LRU
Tag
SL
SL
SM
SM
Set 1
Set 2
144
Why are you here?
  • Want faster database systems
  • Have funky new hardware Thread Level
    Speculation (TLS)
  • How can we apply TLS todatabase systems?
  • Side question
  • Is this a VLDB or an ASPLOS talk?

145
How?
  • Divide transaction into TLS-threads
  • Run TLS-threads in parallel maintain sequential
    semantics
  • Profit!

146
Why parallelize transactions?
  • Decrease transaction latency
  • Increase concurrency while avoiding concurrency
    control bottleneck
  • A.k.a. use more CPUs, same of xactions
  • The obvious
  • Database performance matters

147
Shopping List
  • What do we need? (research scope)
  • Cheap hardware
  • Thread Level Speculation (TLS)
  • Minor changes allowed.
  • Important database application
  • TPC-C
  • Almost no changes allowed!
  • Modular database system
  • BerkeleyDB
  • Some changes allowed.

148
Outline
  • TLS Hardware
  • The Benchmark (TPC-C)
  • Changing the database system
  • Results
  • Conclusions

149
Outline
  • TLS Hardware
  • The Benchmark (TPC-C)
  • Changing the database system
  • Results
  • Conclusions

150
Whats new?
  • Database operations are
  • Large
  • Complex
  • Large TLS-threads
  • Lots of dependences
  • Difficult to analyze
  • Want
  • Programmer optimization effort faster program

151
Hardware changes summary
  • Must tolerate dependences
  • Prediction?
  • Implicit forwarding?
  • May need larger caches
  • May need larger associativity

152
Outline
  • TLS Hardware
  • The Benchmark (TPC-C)
  • Changing the database system
  • Results
  • Conclusions

153
Parallelization Strategy
  1. Pick a benchmark
  2. Parallelize a loop
  3. Analyze dependences
  4. Optimize away dependences
  5. Evaluate performance
  6. If not satisfied, goto 3

154
Outline
  • TLS Hardware
  • The Benchmark (TPC-C)
  • Changing the database system
  • Resource management
  • The log
  • B-trees
  • False sharing
  • Results
  • Conclusions

155
Outline
  • TLS Hardware
  • The Benchmark (TPC-C)
  • Changing the database system
  • Results
  • Conclusions

156
Results
  • Viola simulator
  • Single CPI
  • Perfect violation prediction
  • No memory system
  • 4 cpus
  • Exhaustive dependence tracking
  • Currently working on an out-of-order superscalar
    simulation (cello)
  • 10 transaction warm-up
  • Measure 100 transactions

157
Outline
  • TLS Hardware
  • The Benchmark (TPC-C)
  • Changing the database system
  • Results
  • Conclusions

158
Conclusions
  • TLS can improve transaction latency
  • Violation predictors important
  • Iff dependences must be tolerated
  • TLS makes hand parallelizing easier

159
Improving Database Performance
  • How to improve performance
  • Parallelize transaction
  • Increase number of concurrent transactions
  • Both of these require independence of database
    operations!

160
Case Study New Order (TPC-C)
  • Begin transaction
  • End transaction

161
Case Study New Order (TPC-C)
  • Begin transaction
  • Read customer info (customer, warehouse)
  • Read increment order (district)
  • Create new order (orders, neworder)
  • End transaction

162
Case Study New Order (TPC-C)
  • Begin transaction
  • Read customer info (customer, warehouse)
  • Read increment order (district)
  • Create new order (orders, neworder)
  • For each item in order
  • Get item info (item)
  • Decrement count in stock (stock)
  • Record order info (orderline)
  • End transaction

163
Case Study New Order (TPC-C)
  • Begin transaction
  • Read customer info (customer, warehouse)
  • Read increment order (district)
  • Create new order (orders, neworder)
  • For each item in order
  • Get item info (item)
  • Decrement count in stock (stock)
  • Record order info (orderline)
  • End transaction

Parallelize this loop
164
Case Study New Order (TPC-C)
  • Begin transaction
  • Read customer info (customer, warehouse)
  • Read increment order (district)
  • Create new order (orders, neworder)
  • For each item in order
  • Get item info (item)
  • Decrement count in stock (stock)
  • Record order info (orderline)
  • End transaction

Parallelize this loop
165
Implementing on a Real DB
  • Using BerkeleyDB
  • Table Database
  • Give database any arbitrary key
  • ?will return arbitrary data (bytes)
  • Use structs for keys and rows
  • Database provides ACID through
  • Transactions
  • Locking (page level)
  • Storage management
  • Provides indexing using b-trees

166
Parallelizing a Transaction
  • For each item in order
  • Get item info (item)
  • Decrement count in stock (stock)
  • Record order info (order line)

167
Parallelizing a Transaction
  • For each item in order
  • Get item info (item)
  • Decrement count in stock (stock)
  • Record order info (order line)
  • Get cursor from pool
  • Use cursor to traverse b-tree
  • Find row, lock page for row
  • Release cursor to pool

168
Maintaining Cursor Pool
head
Get
Use
Release
169
Maintaining Cursor Pool
head
Get
Use
Release
170
Maintaining Cursor Pool
head
Get
Use
Release
171
Maintaining Cursor Pool
Violation!
head
Get
Get
Use
Use
Release
Release
172
Parallelizing Cursor Pool 1
  • Use per-CPU pools
  • Modify code each CPU gets its own pool
  • No sharing no violations!
  • Requires cpuid() instruction

173
Parallelizing Cursor Pool 2
  • Dequeue and enqueue atomic and unordered
  • Delay enqueue until end of thread
  • Forces separate pools
  • Avoids modificationof data struct

Get
head
Get
Use
Use
Release
Release
174
Parallelizing Cursor Pool 3
  • Atomic unordered dequeue enqueue
  • Cursor struct is TLS unordered
  • Struct defined as a byte range in memory

Get
Get
head
Get
Get
Use
Use
Use
Use
Release
Release
Release
Release
175
Parallelizing Cursor Pool 4
  • Mutex protect dequeue enqueuedeclare pointer
    to cursor struct to be TLS unordered
  • Any access through pointer does not have TLS
    applied
  • Pointer is tainted, any copies of it keep this
    property

176
Problems with 3 4
  • What exactly is the boundary of a structure?
  • How do you express the concept of object in a
    loosely-typed language like C?
  • A byte range or a pointer is only an
    approximation.
  • Dynamically allocated sub-components?

177
Mutexes in a TLS world
  • Two types of threads
  • real threads
  • TLS threads
  • Two types of mutexes
  • Inter-real-thread
  • Inter-TLS-thread

178
Inter-real-thread Mutexes
  • Acquire get mutex for all TLS threads
  • Release release for current TLS thread
  • May still be held by another TLS thread!

179
Inter-TLS-thread Mutexes
  • Should never interact between two real threads
  • Implies no TLS ordering between TLS threads while
    mutex is held
  • But what do to on a violation?
  • Cant just throw away changes to memory
  • Must undo operations performed in critical section

180
Parallelizing Databases using TLS
  • Split transactions into threads
  • Threads created are large
  • 60k instructions
  • 16kB of speculative state
  • More dependences between threads
  • How do we design a machine which can handle these
    large threads?

181
The Old Way
P
P
P
P
P
P
P
P
Speculative state
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
Committed state

L3
Memory System
182
The Old Way
  • Advantages
  • Each epoch has its own L1 cache
  • Epoch state does not intermix
  • Disadvantages
  • L1 cache is too small!
  • Full cache dead meat
  • No shared speculative memory

183
The New Way
  • L2 cache is huge!
  • State of the art in caches, Power5
  • 1.92MB 10-way L2
  • 32kB 4-way L1
  • Shared speculative memory for free
  • Keeps TLS logic off of the critical path

184
TLS Shared L2 Design
  • L1 Write-through write-no-allocate
    CullerSinghGupta99
  • Easy to understand and reason about
  • Writes visible to L2 simplifies shared
    speculative memory
  • L2 cache shared cache architecture with
    replication
  • Rest of memory distributed TLS coherence

185
TLS Shared L2 Design
  • Explain from the top down

P
P
P
P
P
P
P
P
Cached speculative state
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
Real speculative state

L3
Memory System
186
TLS Shared L2 Design
P
P
P
P
P
P
P
P
Cached speculative state
L1
L1
L1
L1
L1
L1
L1
L1
L2
L2
Real speculative state

L3
Memory System
187
Part II Dealing with dependences
188
Predictor Design
  • How do you design a predictor that
  • Identifies violating loads
  • Identifies the last store that causes them
  • Only triggers when they cause a problem
  • Has very very high accuracy
  • ???

189
Sub-epoch design
  • Like checkpointing
  • Leave holes in epoch space
  • Every 5k instructions start a new epoch
  • Uses more cache to buffer changes
  • More strain on associativity/victim cache
  • Uses more epoch contexts

190
Summary
  • Supporting large epochs needs
  • Buffer state in L2 instead of L1
  • Shared speculative memory
  • Replication
  • Victim cache
  • Sub-epochs

191
Any questions?
Write a Comment
User Comments (0)
About PowerShow.com