Title: Building a Database on S3
1Building a Database on S3
- Brantner, Florescue, Graf, Kossmann, Kraska
- SIGMOD08
2Introduction
- Next wave on the Web to provide services
- Make it easy for everyone to provide services,
not just the Googles, Amazons - Technical difficulties
- 24X7 service
- Need data centers around world
- Must administer server and any DBs
- Success can Kill
- Reason for utility computing (cloud)
3- Goal of utility computing
- Storage, CPU, network bandwidth as a commodity at
low unit cost - Scalability not a problem
- Full availability at any time never blocked
- Clients can fail at any time
- Response times constant (R/W)
- Pay by use
4- Most prominent utility service is S3
- Part of Amazon Web Services AWS
- S3, SQS, EC2, SimpleDB
- S3 is Amazons simple storage service
- Infinite scalability, high availability, low cost
- Currently for multi-media documents
- For data rarely updated
- Smugmug implemented on top of S3
- S3 popular as a backup
- Products to backup data from MySQL to S3
- But will S3 work for other kids of data?
5- Disadvantages of S3
- Slow compared to local disk drive
- Sacrifice consistency undetermined amount of
time to update object - Updates not applied in same order as initiated
- Eventual consistency is only guarantee
6- Can Web-based DB applications can be implemented
on top of utility services (S3)? - What if use S3 for a general purpose DB?
- Small objects, frequent updates
- Present R/W commit protocols
- Study cost, performance, consistency
- Goal preserve scalability and availability of
distributed systems, some ACID properties - Can only maximize level of consistency
- Will not try to support full ACID properties
7- Show how small object frequently updated can be
implemented - Show how B-tree can be implemented
- Protocols for different levels of consistency
- Performance results with TPC-W benchmarks
8S3
- S3 Simple Storage System
- Conceptually, infinite store for objects from 1B
to 5GB - Object is a byte container identified by URI
- Can Read/Update with SOAP or REST-based interface
- Get(uri) returns object
- Put(uri, bytestream) writes new version
- Get-if-modified-since(uri,TS) gets new version if
object changes since TS
9- In S3, each object associated with a bucket
- User specified bucket for new object, can scan
through objects in bucket - Use buckets as unit of security or individual
objects - S3 not free
- 0.15 to store 1GB of data per month
- 160GB disk drive costs 70
- Cost for 2 years - .02 per GB and month (power
not included) - S3 same ballpark as disk drives
- Using S3 as a backup is good
10- But
- Cost for R/W access
- .01 per 10K get requests, per 1K put requests
- Many operate their own servers to cache
- Latency also a problem
- Reading takes 100 Ms
- 2 to 3 X longer than from local disk
- Writing takes 3X as long as reading
- Throughput is superior
- Acceptable bandwidth only if data read in large
chunks 100KB - Must cluster small objects into pages on disk
11- Implementation details of S3 not published
- S3 seems to replicate all data at several data
centers - Replicas R/W at any time
- Updates propagated asynchronously
- If data center fails, use another center
- Last update wins
- Guarantees full R/W availability crucial to Web
applications
12SQS
- SQS Simple Queuing system
- Allows users to manage an infinite number of
queues with infinite capacity - Each queue referenced by a URI, supports
send/receive messages via HTTP or REST-based
interface - Size of message 8KB for HTTP
- Supports
- createQueue, send message to Q, receive number of
messages from top of Q, delete message from Q,
grant another user send/receive messages to Q
13- Cost of SQS - .01 to sent 1K messages,
- Round trip times
- Each call to SQS returns a result or ACK
- Round trip time as wallclock time
- Implementation details not published
- Seems messages of Q stored in distributed and
replicated way - Cleints can initiate requests at any time, never
blocked - Best effort returning messages in FIFO
- SQS returns only every 10th relevant message
- E.g. Q has 200 messages, ask for top 100, get 20
14EC2
- EC2 Elastic Computing Cloud
- Allows renting machines (CPUDisk) for specified
period of time - Client gets virtual machine hosted on Amazon
server - 0.10 per hour regardless of how heavily machine
used - All requests from EC2 to S3 and SQS are free
- If use all these, computation moved to the data
15Using S3 as a disk
- Client-Server Architecture
- Similar to distributed shared-disk DB systems
- Client retrieves pages from S3, based on URIs,
buffers them locally, updates them, writes them
back - Record is bytestream of variable size
(constrained by page size) - Can be relational tuples or XML
elements/documents, Blobs - Focus on
- page manager coordinates R/W, buffers pages
- Record manager interface, organizes records on
pages, free-space management
16Page manager, record manager, etc. could be
executed on EC2 Or whole client stack installed
on laptops or mobile phones to implement Web 2.0
application (assume this)
17Record Manager
- Record Manager manages tuples
- Record associated with a collection (table)
- Record composed of key and data
- Record stored in one page, pages stored as single
object - Table implemented in a bucket
- Table identified by URI
- Create new record, read record based on key,
update based on key, delete based on key scan uri
18Page manager
- Implements buffer pool for S3 pages
- Supports reading, updated, marking as updated,
create new pages - Implements commit and abort
- Assume write set fits into client's main memory
or secondary storage - Commit must propagate changes to S3
- If abort, discard clients buffer pool
- No pages evicted from buffer pool as part of
commit get up-to-date version if necessary
19B-tree indexes
- Adopt existing DB technology where possible
- Root, intermediate nodes stored as pages with
(key, uri of next level) - Leaf pages of primary index have (key, payload
data) - Data stored as leaf of B-tree (index-organized
table IOT) - Leaf pages of secondary index have (search key,
record key) - Retrieve keys of matching records, go to primary
index to retrieve records with payload data - Nodes at each level are chained
- Root always at same URI (even when split node)
20Logging
- Use traditional strategies if can
- Insert log, delete log record, update log record
associated with a data page - Redo logging - log records are idempotent can
apply more than once with same result - Undo logging keep before and after image in
update logs - Keep last version of record in delete log records
21Security
- Everybody has access to S3
- S3 gives clients control of the data
- Client who owns a collection, can give other
clients R/W privileges to collection (bucket) or
pages - Cannot do views but can be implemented on top
of S3 - If provider not trusted, can encrypt data
- Can assign a curator for a collection to approve
all updates
22Basic Commit Protocols
- Updates by one client can be overwritten by
another even if 2 are updating different tuples - Unit of transfer is a page rather than tuple
- Several small objects must be clustered together
(not the case in typical S3 usage)
23- Assume all features of utility computing
- Protocol
- Client generates log records for all updates
committed and sends to SQS - Log records applied to pages on S3 called
checkpointing - First step carried out in constant time
- Second step can be asynchronous, users never
blocked any part fails, resend (idempotent)
24- Preserves features of utility computing
- But not atomic (can apply only part of updates)
- Not consistent only guarantee will eventually
be written
25PU Queues
- PU Pending Update queues
- Clients propagate log records to PU Qs
- Each B-tree has one PU Q
- One PU Q associated with each leaf node of a
Primary B-tree of a table
26Checkpoint Protocol for Data Pages
- Input of a checkpoint is a PU Q
- Make sure no other client carrying out checkpoint
concurrently - Associate a Lock Q with a PU Q
- Receives a token from Lock Q if can lock object
- Set time out, must complete checkpoint by then
- Protocol to update data pages but really update
B-tree, so see next slide
27Checkpoint Protocol for B-trees
- More complicated than checkpointing a database
because several tree pages are involved - Obtain token from Lock Q
- Receive log records form the PU Q
- Sort the log records by Key
- Find leaf node for first log record
- Apply all log records to that leaf node
- Put new version to S3
- Delete log records
- Continue if still time
28Checkpoint Strategies
- Checkpoint on a pages can be carried out by
reader, writer, watchdog (additional
infrastructure), owner (may be offline) - Assume writer initiates checkpoint
- Each data page has TS of last checkpoint
- TS taken from machine that does checkpoint
- Client compute difference between wallclock time
and TS - If bigger than checkpoint, writer carries out
checkpoint (10-15 s) - If update once, never checkpointed (force random)
- Queries can have phantoms
29Transactional Properties
- Durability
- with SQS
- Atomicity if
- Use additional Atomic queues associated with each
client - Commit logs to Atomic Qs rather than PU Qs
- Every log record has id of commit for client
- Client sends special commit to Atomic Q
- then send all log records to PU Q
- Delete commit record from Atomic Q
30- Logging works
- If client fails, restarts
- Delete log records from Atomic Q with no id that
is same as a commit record - Those with matching id propagated to PU Q and
deleted - Delete commit record from Atomic Q only after all
its log records propagated to PU Q - Log records propagated twice no problem because
idempotent
31Consistency Levels
- Consistency for the Web
- Not Strict (every read reads most recent write)
- Monotonic read If read value of x, any
successive read by client reads that or more
recent value - Keep track of highest commit TS cached by client
- Monotonic write W to x is completed before any
successive write to x by same client - Counter for each page, increment when update,
keep track of counter value, order for log
32- Read your writes effect of W on x always be
seen by successive R to x by same client - True if use monotonic reads
- Write follows read W on x following R on x by
same client takes place on same or more recent
value of x that was read - W not applied directly to data times
33Isolation
- Not implemented, but if did
- Multiversion optimistic concurrency control to
implement isolation in S3 - Multiversion retrieve version of object as of
moment when transaction started - When commit, compare W set to W set of Ts
committed earlier and started later if
intersection empty can commit - Apply 2PL protocol on PU Q in commit phase
- Need global counter can be implemented on top
of S3, but bottleneck
34- Not implemented, but if did
- BOCC (Optimistic concurrency control)
- Also involves global counter for beginning/end
- Requires only 1 commit at a time big problem
35Experiments and Results
- Increasing levels of consistency
- Basic eventual consistency
- Montonicity R and Ws, R your Ws and W follows R
- Atomicity in addition to above 2
- Baseline naïve approach write all dirty pages
to S3 - Not even eventual consistency updates lost
- R, W, create, index probe, abort same for all
differ in commit, checkpoints
36- Mac with 2.15 MHz Intel processor
- Page size of data 100KB
- B-tree node size 57 KB
- TTL of clients cache - 100 s
- Cache size - 5 MB
- Checkpoint interval 15 s
- 1 GB of network traffic - 0.18
37TCP-W Benchmark
- Models online bookstore with queries asking for
availability of products, places orders - Retrieve customer record, search for 6 products,
place orders for 3 products (random)
38Running time
Average and maximum execution time in seconds per
transaction are high
Believe results acceptable in interactive
environment Transactions simulate 12 clicks
about 1s each except for commit Higher
consistency means faster Only propagate log
records to SQS Atomicity batches log records
39Cost ()
Overall cost per 1000 transactions Run several
thousand transactions, divide by total cost
Cost increases with highest consistency Interactio
n with SQS expensive checkpoints to Atomic Q
(change interval) 0.3 cents per
transaction acceptable for some The more
orders the more it costs
40Varying Checkpoint Interval
Increasing interval decreases cost Less than 10
seconds means checkpoint for every
update Infinite interval is 7 per 1000
transactions
41Related Work
- Utility computing biggest success is grid
computing - Specific purpose to analyze large scientific data
- Even S3 for specific purpose multi media and
backups - Goal is to broaden scope of utility computing to
general-purpose web-based applications - S3 is a distributed P2P but no technical
drawbacks of centralized component ?? - Paper proposes to overlay DM on top of S3
- P2P community proposes to create network overlays
on top of Internet
42Conclusions
- Utility computing not attractive for
high-performance transaction processing - Paper is first step for the above
- Abandoned strict consistency and DB-style
transactions (controversial) - May need ACID properties more than scalability
and availability - New algorithms for join, query optimization
- Need ways to can through several pages (chained
I/O) - Need right security structure