HighAvailability in Scalable Distributed Data Structures - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

HighAvailability in Scalable Distributed Data Structures

Description:

There is one parity bucket per data bucket group ... usually O (Ib) messages to parity buckets. to recompute (XOR) parity bits since usually all records get ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 65

Provided by: Witold6

Category:

more less

Transcript and Presenter's Notes

Title: HighAvailability in Scalable Distributed Data Structures

1
High-Availability in Scalable Distributed Data
Structures
W. Litwin

Witold.Litwin_at_dauphine.fr

2
Plan

What are SDDSs ?
High-Availability SDDSs
LH with scalable availability
Conclusion

3
Multicomputers

A collection of loosely coupled computers
common and/or preexisting hardware
share nothing architecture
message passing through high-speed net
(?????0?Mb/s)
Network multicomputers
use general purpose nets
LANs Fast Ethernet, Token Ring, SCI, FDDI,
Myrinet, ATM
NCSA cluster 512 NTs on Myrinet by the end of
1998
Switched multicomputers
use a bus, or a switch
IBM-SP2, Parsytec...

4
Network multicomputer
Server
Client
5
Why multicomputers ?

Unbeatable price-performance ratio
Much cheaper and more powerful than
supercomputers
especially the network multicomputers
1500 WSs at HPL with 500 GB of RAM TBs of
disks
Computing power
file size, access and processing times,
throughput...
For more pro cons
IBM SP2 and GPFS literature
Tanenbaum "Distributed Operating Systems",
Prentice Hall, 1995
NOW project (UC Berkeley)
Bill Gates at Microsoft Scalability Day, May 1997
www.microoft.com White Papers from Business
Syst. Div.
Report to the President, Presidents Inf. Techn.
Adv. Comm., Aug 98

6
Why SDDSs

Multicomputers need data structures and file
systems
Trivial extensions of traditional structures are
not best
hot-spots
scalability
parallel queries
distributed and autonomous clients
distributed RAM distance to data

7
What is an SDDS ?

Data are structured
records with keys ? objects with an OID
more semantics than in Unix flat-file model
abstraction popular with applications
allows for parallel scans
function shipping
Data are on servers
always available for access
Overflowing servers split into new servers
appended to the file without informing the
clients
Queries come from multiple autonomous clients
available for access only on their initiative
no synchronous updates on the clients
There is no centralized directory for access
computations

8
What is an SDDS ?

Clients can make addressing errors
Clients have less or more adequate image of the
actual file structure
Servers are able to forward the queries to the
correct address
perhaps in several messages
Servers may send Image Adjustment Messages
Clients do not make same error twice
See the SDDS talk for more on it
192.134.119.81/witold.html
Or the LH ACM-TODS paper (Dec. 96)

9
An SDDS
growth through splits under inserts
Servers
Clients
10
An SDDS
growth through splits under inserts
Servers
Clients
11
An SDDS
growth through splits under inserts
Servers
Clients
12
An SDDS
growth through splits under inserts
Servers
Clients
13
An SDDS
growth through splits under inserts
Servers
Clients
14
An SDDS
Clients
15
An SDDS
Clients
16
An SDDS
IAM
Clients
17
An SDDS
Clients
18
An SDDS
Clients
19
Known SDDSs
DS
Classics
20
Known SDDSs
DS
SDDS (1993)
Classics
Hash
LH DDH Breitbart al
21
Known SDDSs
DS
SDDS (1993)
Classics
Hash
1-d tree
LH DDH Breitbart al
22
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
23
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
H-Avail.
LHm, LHg
Security
LHs
24
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
H-Avail.
LHm, LHg
LHSA
Security
s-availability
LHs
LHRS
25
LH (A classic)

Allows for the primary key (OID) based hash files
generalizes the LH addressing schema
variants used in Netscape products, LH-Server,
Unify, Frontpage, IIS, MsExchange...
Typical load factor 70 - 90
In practice, at most 2 forwarding messages
regardless of the size of the file
In general, 1 m/insert and 2 m/search on the
average
4 messages in the worst case
Search time of 1 ms (10 Mb/s net), of 150 ?s
(100 Mb/s net) and of 30 ?s (Gb/s net)

26
High-availability LH schemes

In a large multicomputer, it is unlikely that all
servers are up
Consider the probability that a bucket is up is
99
bucket is unavailable 3 days per year
If one stores every key in only 1 bucket
case of typical SDDSs, LH included
Then file reliability probability that n-bucket
file is entirely up is
37 for n 100
0 for n 1000
Acceptable for yourself ?

27
High-availability LH schemes

Using 2 buckets to store a key, one may expect
the reliability of
99 for n 100
91 for n 1000
High-availability files
make data available despite unavailability of
some servers
RAIDx, LSA, EvenOdd, DATUM...
High-availability SDDS
make sense
are the only way to reliable large SDDS files

28
Known High-availability LH schemes

Known high-availability LH schemes keep data
available under
any single server failure (1-availability)
any n-server failure
n fixed or scalable (n-availability or scalable
availability)
some n-server failures n gt n
Three principles for high-availability SDDS
schemes are known
mirroring (LHm)
storage doubles 1-availability
striping (LHs)
affects parallel queries 1-availability
grouping (LHg, LHSA, LHRS)

29
Scalable Availability

n-availability
availability of all data despite simultaneous
unavailability of up to any n buckets
E.g., RAIDx, EvenOdd, RAV, DATUM...
Reliability
probability P that all the records are available
Problem
For every choice of n, P ? 0 when the file
scales.
Solution
Scalable availability
n grows with the file size, to regulate P
Constraint
the growth has to be incremental

30
LHsa file(Litwin, Menon, Risch)

An LH file with data buckets for data records
provided in addition with the availability level
i in each bucket
One or more parity files with parity buckets for
parity records
added when the file scales
with every bucket 0 mirroring the LHsa file
state data (i, n)
A family of grouping functions group data buckets
into groups of size k gt 1 such that
Every two buckets in the same group i are in
different groups i ? i
There is one parity bucket per data bucket group
Within a parity bucket, a parity record is
maintained for up to k data records with the same
rank

31
LHsa File Expansion(k 2)
32
Scalable Availability(basic schema)

Up to k data buckets, use 1-st level grouping
so there will be only one parity bucket
Then, start also 2-nd level grouping
When file exceeds k2 buckets, start 3-rd level
grouping
When file exceeds k3 buckets, start 4-rd level
grouping
Etc.

33
LHsa groups
34
LHsa File Expansion(k 2)
35
LHsa Recovery

Bucket or record unavailability is detected
by the client during the search or update
by forwarding server
Coordinator is alerted to perform the recovery
to bypass the unavailable bucket
or to restore the record on the fly
or the restore the bucket in a spare
The recovered record is delivered to the client

36
LHsa Bucket Record Recovery

Try the 1-st level group for the unavailable
bucket m
If other buckets are found unavailable in this
group
try to recover each of them using 2nd level
groups
And so on
Come back to recover finally bucket
See the paper for the full algorithms
For an I -available file, it is possible to
sometimes recover a records even when more than I
buckets in a group are unavailable

37
LHsa normal recovery
38
Good case recovery (I 3)
39
Scalability AnalysisSearch Performance

Search
usually same cost as for LH
inluding the parallel queries
2 messages per key search
in degraded mode
usually O (k)
record reconstruction cost using 1-st level
parity
worst case O ((I1) k)

40
Insert Performance

Usually (I1) or (I 2)
(I1) is the best possible value for every
I-availability schema
In degraded mode
about the same if the unavailable bucket can be
bypassed
add the bucket recovery cost otherwise
the client cost is only a few messages to deliver
the record to the coordinator

41
Split Performance

LH split cost that is O (b/2)
b is bucket capacity
one message per record
Plus usually O (Ib) messages to parity buckets
to recompute (XOR) parity bits since usually all
records get new ranks
Plus O (b) messages when new bucket is created

42
Storage Performance

Storage overhead cost Cs S / S
S - storage for parity files
S - storage for data buckets
practically, the LH file storage cost
Cs depends on the file availability level I
reached
To build new level to I 1
Cs starts from lower bound LI I / k
for file size M kI
the best possible value for any I-availability
schema
Increases towards an upper bound
UI 1 ? O ( ½ I /k)
as long as new splits add parity buckets
Decreases towards LI1 afterwards

43
Example
44
Reliability

Probability P that all records are available to
the application
all the data buckets are available
or every record can be recovered
there is at most I buckets failed in an LHsa
group
Depends on
failure probability p of each site
group size k
file size M
Reliability of basic LHsa schema is termed
uncontrolled

45
Uncontrolled Reliability
LHsa
46
Controlled Reliability

To keep the reliability above or close to given
threshold through
delaying or accelerating the availability level
growth
or gracefully changing group size k
Necessary for higher values of p
case of less reliable sites
frequent situation on network multicomputers
May improve performance for small ps.
Several schemes are possible

47
Controlled Reliability with Fixed Group Size
k 4 T 0.8
p 0.2
48
Controlled Reliability with Variable Group Size
p 0.01 T 0.95
49
LHRS (Litwin Schwarz)

Single grouping function
1234, 5678
Multiple parity buckets per group
Scalable availability
1 parity bucket per group until 2i1 buckets
Then, at each split, add 2nd parity bucket to
each existing group or create 2 parity buckets
for new groups until 2i2 buckets
etc.

50
LHRS File Expansion
51
LHRS File Expansion
52
LHRS File Expansion
53
LHRS Parity Calculus

Choose GF(2l )
Typically GF (16) or GF (256)
Create the k x n generator matrix G
using elementary transformation of extended
Vandermond matrix of GF elements
k is the records group size
n 2l is max segment size (data and parity
records)
G I P
I denotes the identity matrix
Each record is a sequence of symbols from GF(2l)
The k symbols with the same offset in the records
of a group become the (horizontal) information
vector U
The matrix multiplication U G provides the (n -
k) parity symbols, i.e., the codeword vector

54
LHRS Parity Calculus

Parity calculus is distributed to parity buckets
each column is at one bucket
Parity is calculated only for existing data and
parity buckets
At each insert, delete and update
Adding new parity buckets does not change
existing parity records

55
Example GF(4)
Addition XOR Multiplication direct table or
log / antilog tables
56
Encoding
Records
57
Encoding
Records
Codewords
01 01 01 01 00
58
Encoding
Records
Codewords
01 01 01 01 00 00 00 00 00 00 ... ... ...
...
59
LHRS Recovery Calculus

Performed when at most n - k buckets are
unavailable, among the data and the parity
buckets of a group
Choose k available buckets
Form the submatrix H of G from the corresponding
columns
Invert this matrix into matrix H-1
Multiply the horizontal vector S of available
symbols with the same offset by H -1
The result contains the recovered data and/or
parity symbols

60
Example
Buckets
61
Example
Buckets
62
Example
Buckets
Recovered symbols / buckets
01 01 01
63
Recovery
Buckets
Recovered symbols / buckets
01 01 01 00 00 00 ...
64
Conclusion