Building a Database on S3 - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Building a Database on S3

Description:

Next wave on the Web to provide services ... Put(uri, bytestream) writes new version ... Updates by one client can be overwritten by another even if 2 are ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 43

Provided by: susanv5

Category:

more less

Transcript and Presenter's Notes

Title: Building a Database on S3

1
Building a Database on S3

Brantner, Florescue, Graf, Kossmann, Kraska
SIGMOD08

2
Introduction

Next wave on the Web to provide services
Make it easy for everyone to provide services,
not just the Googles, Amazons
Technical difficulties
24X7 service
Need data centers around world
Must administer server and any DBs
Success can Kill
Reason for utility computing (cloud)

Goal of utility computing
Storage, CPU, network bandwidth as a commodity at
low unit cost
Scalability not a problem
Full availability at any time never blocked
Clients can fail at any time
Response times constant (R/W)
Pay by use

Most prominent utility service is S3
Part of Amazon Web Services AWS
S3, SQS, EC2, SimpleDB
S3 is Amazons simple storage service
Infinite scalability, high availability, low cost
Currently for multi-media documents
For data rarely updated
Smugmug implemented on top of S3
S3 popular as a backup
Products to backup data from MySQL to S3
But will S3 work for other kids of data?

Disadvantages of S3
Slow compared to local disk drive
Sacrifice consistency undetermined amount of
time to update object
Updates not applied in same order as initiated
Eventual consistency is only guarantee

Can Web-based DB applications can be implemented
on top of utility services (S3)?
What if use S3 for a general purpose DB?
Small objects, frequent updates
Present R/W commit protocols
Study cost, performance, consistency
Goal preserve scalability and availability of
distributed systems, some ACID properties
Can only maximize level of consistency
Will not try to support full ACID properties

Show how small object frequently updated can be
implemented
Show how B-tree can be implemented
Protocols for different levels of consistency
Performance results with TPC-W benchmarks

8
S3

S3 Simple Storage System
Conceptually, infinite store for objects from 1B
to 5GB
Object is a byte container identified by URI
Can Read/Update with SOAP or REST-based interface
Get(uri) returns object
Put(uri, bytestream) writes new version
Get-if-modified-since(uri,TS) gets new version if
object changes since TS

In S3, each object associated with a bucket
User specified bucket for new object, can scan
through objects in bucket
Use buckets as unit of security or individual
objects
S3 not free
0.15 to store 1GB of data per month
160GB disk drive costs 70
Cost for 2 years - .02 per GB and month (power
not included)
S3 same ballpark as disk drives
Using S3 as a backup is good

But
Cost for R/W access
.01 per 10K get requests, per 1K put requests
Many operate their own servers to cache
Latency also a problem
Reading takes 100 Ms
2 to 3 X longer than from local disk
Writing takes 3X as long as reading
Throughput is superior
Acceptable bandwidth only if data read in large
chunks 100KB
Must cluster small objects into pages on disk

Implementation details of S3 not published
S3 seems to replicate all data at several data
centers
Replicas R/W at any time
Updates propagated asynchronously
If data center fails, use another center
Last update wins
Guarantees full R/W availability crucial to Web
applications

12
SQS

SQS Simple Queuing system
Allows users to manage an infinite number of
queues with infinite capacity
Each queue referenced by a URI, supports
send/receive messages via HTTP or REST-based
interface
Size of message 8KB for HTTP
Supports
createQueue, send message to Q, receive number of
messages from top of Q, delete message from Q,
grant another user send/receive messages to Q

Cost of SQS - .01 to sent 1K messages,
Round trip times
Each call to SQS returns a result or ACK
Round trip time as wallclock time
Implementation details not published
Seems messages of Q stored in distributed and
replicated way
Cleints can initiate requests at any time, never
blocked
Best effort returning messages in FIFO
SQS returns only every 10th relevant message
E.g. Q has 200 messages, ask for top 100, get 20

14
EC2

EC2 Elastic Computing Cloud
Allows renting machines (CPUDisk) for specified
period of time
Client gets virtual machine hosted on Amazon
server
0.10 per hour regardless of how heavily machine
used
All requests from EC2 to S3 and SQS are free
If use all these, computation moved to the data

15
Using S3 as a disk

Client-Server Architecture
Similar to distributed shared-disk DB systems
Client retrieves pages from S3, based on URIs,
buffers them locally, updates them, writes them
back
Record is bytestream of variable size
(constrained by page size)
Can be relational tuples or XML
elements/documents, Blobs
Focus on
page manager coordinates R/W, buffers pages
Record manager interface, organizes records on
pages, free-space management

16
Page manager, record manager, etc. could be
executed on EC2 Or whole client stack installed
on laptops or mobile phones to implement Web 2.0
application (assume this)
17
Record Manager

Record Manager manages tuples
Record associated with a collection (table)
Record composed of key and data
Record stored in one page, pages stored as single
object
Table implemented in a bucket
Table identified by URI
Create new record, read record based on key,
update based on key, delete based on key scan uri

18
Page manager

Implements buffer pool for S3 pages
Supports reading, updated, marking as updated,
create new pages
Implements commit and abort
Assume write set fits into client's main memory
or secondary storage
Commit must propagate changes to S3
If abort, discard clients buffer pool
No pages evicted from buffer pool as part of
commit get up-to-date version if necessary

19
B-tree indexes

Adopt existing DB technology where possible
Root, intermediate nodes stored as pages with
(key, uri of next level)
Leaf pages of primary index have (key, payload
data)
Data stored as leaf of B-tree (index-organized
table IOT)
Leaf pages of secondary index have (search key,
record key)
Retrieve keys of matching records, go to primary
index to retrieve records with payload data
Nodes at each level are chained
Root always at same URI (even when split node)

20
Logging

Use traditional strategies if can
Insert log, delete log record, update log record
associated with a data page
Redo logging - log records are idempotent can
apply more than once with same result
Undo logging keep before and after image in
update logs
Keep last version of record in delete log records

21
Security

Everybody has access to S3
S3 gives clients control of the data
Client who owns a collection, can give other
clients R/W privileges to collection (bucket) or
pages
Cannot do views but can be implemented on top
of S3
If provider not trusted, can encrypt data
Can assign a curator for a collection to approve
all updates

22
Basic Commit Protocols

Updates by one client can be overwritten by
another even if 2 are updating different tuples
Unit of transfer is a page rather than tuple
Several small objects must be clustered together
(not the case in typical S3 usage)

Assume all features of utility computing
Protocol
Client generates log records for all updates
committed and sends to SQS
Log records applied to pages on S3 called
checkpointing
First step carried out in constant time
Second step can be asynchronous, users never
blocked any part fails, resend (idempotent)

Preserves features of utility computing
But not atomic (can apply only part of updates)
Not consistent only guarantee will eventually
be written

25
PU Queues

PU Pending Update queues
Clients propagate log records to PU Qs
Each B-tree has one PU Q
One PU Q associated with each leaf node of a
Primary B-tree of a table

26
Checkpoint Protocol for Data Pages

Input of a checkpoint is a PU Q
Make sure no other client carrying out checkpoint
concurrently
Associate a Lock Q with a PU Q
Receives a token from Lock Q if can lock object
Set time out, must complete checkpoint by then
Protocol to update data pages but really update
B-tree, so see next slide

27
Checkpoint Protocol for B-trees

More complicated than checkpointing a database
because several tree pages are involved
Obtain token from Lock Q
Receive log records form the PU Q
Sort the log records by Key
Find leaf node for first log record
Apply all log records to that leaf node
Put new version to S3
Delete log records
Continue if still time

28
Checkpoint Strategies

Checkpoint on a pages can be carried out by
reader, writer, watchdog (additional
infrastructure), owner (may be offline)
Assume writer initiates checkpoint
Each data page has TS of last checkpoint
TS taken from machine that does checkpoint
Client compute difference between wallclock time
and TS
If bigger than checkpoint, writer carries out
checkpoint (10-15 s)
If update once, never checkpointed (force random)
Queries can have phantoms

29
Transactional Properties

Durability
with SQS
Atomicity if
Use additional Atomic queues associated with each
client
Commit logs to Atomic Qs rather than PU Qs
Every log record has id of commit for client
Client sends special commit to Atomic Q
then send all log records to PU Q
Delete commit record from Atomic Q

Logging works
If client fails, restarts
Delete log records from Atomic Q with no id that
is same as a commit record
Those with matching id propagated to PU Q and
deleted
Delete commit record from Atomic Q only after all
its log records propagated to PU Q
Log records propagated twice no problem because
idempotent

31
Consistency Levels

Consistency for the Web
Not Strict (every read reads most recent write)
Monotonic read If read value of x, any
successive read by client reads that or more
recent value
Keep track of highest commit TS cached by client
Monotonic write W to x is completed before any
successive write to x by same client
Counter for each page, increment when update,
keep track of counter value, order for log

Read your writes effect of W on x always be
seen by successive R to x by same client
True if use monotonic reads
Write follows read W on x following R on x by
same client takes place on same or more recent
value of x that was read
W not applied directly to data times

33
Isolation

Not implemented, but if did
Multiversion optimistic concurrency control to
implement isolation in S3
Multiversion retrieve version of object as of
moment when transaction started
When commit, compare W set to W set of Ts
committed earlier and started later if
intersection empty can commit
Apply 2PL protocol on PU Q in commit phase
Need global counter can be implemented on top
of S3, but bottleneck

Not implemented, but if did
BOCC (Optimistic concurrency control)
Also involves global counter for beginning/end
Requires only 1 commit at a time big problem

35
Experiments and Results

Increasing levels of consistency
Basic eventual consistency
Montonicity R and Ws, R your Ws and W follows R
Atomicity in addition to above 2
Baseline naïve approach write all dirty pages
to S3
Not even eventual consistency updates lost
R, W, create, index probe, abort same for all
differ in commit, checkpoints

Mac with 2.15 MHz Intel processor
Page size of data 100KB
B-tree node size 57 KB
TTL of clients cache - 100 s
Cache size - 5 MB
Checkpoint interval 15 s
1 GB of network traffic - 0.18

37
TCP-W Benchmark

Models online bookstore with queries asking for
availability of products, places orders
Retrieve customer record, search for 6 products,
place orders for 3 products (random)

38
Running time
Average and maximum execution time in seconds per
transaction are high
Believe results acceptable in interactive
environment Transactions simulate 12 clicks
about 1s each except for commit Higher
consistency means faster Only propagate log
records to SQS Atomicity batches log records
39
Cost ()
Overall cost per 1000 transactions Run several
thousand transactions, divide by total cost
Cost increases with highest consistency Interactio
n with SQS expensive checkpoints to Atomic Q
(change interval) 0.3 cents per
transaction acceptable for some The more
orders the more it costs
40
Varying Checkpoint Interval
Increasing interval decreases cost Less than 10
seconds means checkpoint for every
update Infinite interval is 7 per 1000
transactions
41
Related Work

Utility computing biggest success is grid
computing
Specific purpose to analyze large scientific data
Even S3 for specific purpose multi media and
backups
Goal is to broaden scope of utility computing to
general-purpose web-based applications
S3 is a distributed P2P but no technical
drawbacks of centralized component ??
Paper proposes to overlay DM on top of S3
P2P community proposes to create network overlays
on top of Internet

42
Conclusions