Title: HighAvailability in Scalable Distributed Data Structures
1High-Availability in Scalable Distributed Data
Structures
W. Litwin
- Witold.Litwin_at_dauphine.fr
2Plan
- What are SDDSs ?
- High-Availability SDDSs
- LH with scalable availability
- Conclusion
3Multicomputers
- A collection of loosely coupled computers
- common and/or preexisting hardware
- share nothing architecture
- message passing through high-speed net
(?????0?Mb/s) - Network multicomputers
- use general purpose nets
- LANs Fast Ethernet, Token Ring, SCI, FDDI,
Myrinet, ATM - NCSA cluster 512 NTs on Myrinet by the end of
1998 - Switched multicomputers
- use a bus, or a switch
- IBM-SP2, Parsytec...
4Network multicomputer
Server
Client
5Why multicomputers ?
- Unbeatable price-performance ratio
- Much cheaper and more powerful than
supercomputers - especially the network multicomputers
- 1500 WSs at HPL with 500 GB of RAM TBs of
disks - Computing power
- file size, access and processing times,
throughput... - For more pro cons
- IBM SP2 and GPFS literature
- Tanenbaum "Distributed Operating Systems",
Prentice Hall, 1995 - NOW project (UC Berkeley)
- Bill Gates at Microsoft Scalability Day, May 1997
- www.microoft.com White Papers from Business
Syst. Div. - Report to the President, Presidents Inf. Techn.
Adv. Comm., Aug 98
6Why SDDSs
- Multicomputers need data structures and file
systems - Trivial extensions of traditional structures are
not best - hot-spots
- scalability
- parallel queries
- distributed and autonomous clients
- distributed RAM distance to data
7What is an SDDS ?
- Data are structured
- records with keys ? objects with an OID
- more semantics than in Unix flat-file model
- abstraction popular with applications
- allows for parallel scans
- function shipping
- Data are on servers
- always available for access
- Overflowing servers split into new servers
- appended to the file without informing the
clients - Queries come from multiple autonomous clients
- available for access only on their initiative
- no synchronous updates on the clients
- There is no centralized directory for access
computations
8What is an SDDS ?
- Clients can make addressing errors
- Clients have less or more adequate image of the
actual file structure - Servers are able to forward the queries to the
correct address - perhaps in several messages
- Servers may send Image Adjustment Messages
- Clients do not make same error twice
- See the SDDS talk for more on it
- 192.134.119.81/witold.html
- Or the LH ACM-TODS paper (Dec. 96)
9An SDDS
growth through splits under inserts
Servers
Clients
10An SDDS
growth through splits under inserts
Servers
Clients
11An SDDS
growth through splits under inserts
Servers
Clients
12An SDDS
growth through splits under inserts
Servers
Clients
13An SDDS
growth through splits under inserts
Servers
Clients
14An SDDS
Clients
15An SDDS
Clients
16An SDDS
IAM
Clients
17An SDDS
Clients
18An SDDS
Clients
19Known SDDSs
DS
Classics
20Known SDDSs
DS
SDDS (1993)
Classics
Hash
LH DDH Breitbart al
21Known SDDSs
DS
SDDS (1993)
Classics
Hash
1-d tree
LH DDH Breitbart al
22Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
23Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
H-Avail.
LHm, LHg
Security
LHs
24Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
H-Avail.
LHm, LHg
LHSA
Security
s-availability
LHs
LHRS
25LH (A classic)
- Allows for the primary key (OID) based hash files
- generalizes the LH addressing schema
- variants used in Netscape products, LH-Server,
Unify, Frontpage, IIS, MsExchange... - Typical load factor 70 - 90
- In practice, at most 2 forwarding messages
- regardless of the size of the file
- In general, 1 m/insert and 2 m/search on the
average - 4 messages in the worst case
- Search time of 1 ms (10 Mb/s net), of 150 ?s
(100 Mb/s net) and of 30 ?s (Gb/s net)
26High-availability LH schemes
- In a large multicomputer, it is unlikely that all
servers are up - Consider the probability that a bucket is up is
99 - bucket is unavailable 3 days per year
- If one stores every key in only 1 bucket
- case of typical SDDSs, LH included
- Then file reliability probability that n-bucket
file is entirely up is - 37 for n 100
- 0 for n 1000
- Acceptable for yourself ?
27High-availability LH schemes
- Using 2 buckets to store a key, one may expect
the reliability of - 99 for n 100
- 91 for n 1000
- High-availability files
- make data available despite unavailability of
some servers - RAIDx, LSA, EvenOdd, DATUM...
- High-availability SDDS
- make sense
- are the only way to reliable large SDDS files
28Known High-availability LH schemes
- Known high-availability LH schemes keep data
available under - any single server failure (1-availability)
- any n-server failure
- n fixed or scalable (n-availability or scalable
availability) - some n-server failures n gt n
- Three principles for high-availability SDDS
schemes are known - mirroring (LHm)
- storage doubles 1-availability
- striping (LHs)
- affects parallel queries 1-availability
- grouping (LHg, LHSA, LHRS)
29Scalable Availability
- n-availability
- availability of all data despite simultaneous
unavailability of up to any n buckets - E.g., RAIDx, EvenOdd, RAV, DATUM...
- Reliability
- probability P that all the records are available
- Problem
- For every choice of n, P ? 0 when the file
scales. - Solution
- Scalable availability
- n grows with the file size, to regulate P
- Constraint
- the growth has to be incremental
30LHsa file(Litwin, Menon, Risch)
- An LH file with data buckets for data records
- provided in addition with the availability level
i in each bucket - One or more parity files with parity buckets for
parity records - added when the file scales
- with every bucket 0 mirroring the LHsa file
state data (i, n) - A family of grouping functions group data buckets
into groups of size k gt 1 such that - Every two buckets in the same group i are in
different groups i ? i - There is one parity bucket per data bucket group
- Within a parity bucket, a parity record is
maintained for up to k data records with the same
rank
31LHsa File Expansion(k 2)
32Scalable Availability(basic schema)
- Up to k data buckets, use 1-st level grouping
- so there will be only one parity bucket
- Then, start also 2-nd level grouping
- When file exceeds k2 buckets, start 3-rd level
grouping - When file exceeds k3 buckets, start 4-rd level
grouping - Etc.
33LHsa groups
34LHsa File Expansion(k 2)
35LHsa Recovery
- Bucket or record unavailability is detected
- by the client during the search or update
- by forwarding server
- Coordinator is alerted to perform the recovery
- to bypass the unavailable bucket
- or to restore the record on the fly
- or the restore the bucket in a spare
- The recovered record is delivered to the client
36LHsa Bucket Record Recovery
- Try the 1-st level group for the unavailable
bucket m - If other buckets are found unavailable in this
group - try to recover each of them using 2nd level
groups - And so on
- Come back to recover finally bucket
- See the paper for the full algorithms
- For an I -available file, it is possible to
sometimes recover a records even when more than I
buckets in a group are unavailable
37LHsa normal recovery
38Good case recovery (I 3)
39Scalability AnalysisSearch Performance
- Search
- usually same cost as for LH
- inluding the parallel queries
- 2 messages per key search
- in degraded mode
- usually O (k)
- record reconstruction cost using 1-st level
parity - worst case O ((I1) k)
40Insert Performance
- Usually (I1) or (I 2)
- (I1) is the best possible value for every
I-availability schema - In degraded mode
- about the same if the unavailable bucket can be
bypassed - add the bucket recovery cost otherwise
- the client cost is only a few messages to deliver
the record to the coordinator
41Split Performance
- LH split cost that is O (b/2)
- b is bucket capacity
- one message per record
- Plus usually O (Ib) messages to parity buckets
- to recompute (XOR) parity bits since usually all
records get new ranks - Plus O (b) messages when new bucket is created
42Storage Performance
- Storage overhead cost Cs S / S
- S - storage for parity files
- S - storage for data buckets
- practically, the LH file storage cost
- Cs depends on the file availability level I
reached - To build new level to I 1
- Cs starts from lower bound LI I / k
- for file size M kI
- the best possible value for any I-availability
schema - Increases towards an upper bound
- UI 1 ? O ( ½ I /k)
- as long as new splits add parity buckets
- Decreases towards LI1 afterwards
43Example
44Reliability
- Probability P that all records are available to
the application - all the data buckets are available
- or every record can be recovered
- there is at most I buckets failed in an LHsa
group - Depends on
- failure probability p of each site
- group size k
- file size M
- Reliability of basic LHsa schema is termed
uncontrolled
45Uncontrolled Reliability
LHsa
46Controlled Reliability
- To keep the reliability above or close to given
threshold through - delaying or accelerating the availability level
growth - or gracefully changing group size k
- Necessary for higher values of p
- case of less reliable sites
- frequent situation on network multicomputers
- May improve performance for small ps.
- Several schemes are possible
47Controlled Reliability with Fixed Group Size
k 4 T 0.8
p 0.2
48Controlled Reliability with Variable Group Size
p 0.01 T 0.95
49LHRS (Litwin Schwarz)
- Single grouping function
- 1234, 5678
- Multiple parity buckets per group
- Scalable availability
- 1 parity bucket per group until 2i1 buckets
- Then, at each split, add 2nd parity bucket to
each existing group or create 2 parity buckets
for new groups until 2i2 buckets - etc.
50LHRS File Expansion
51LHRS File Expansion
52LHRS File Expansion
53LHRS Parity Calculus
- Choose GF(2l )
- Typically GF (16) or GF (256)
- Create the k x n generator matrix G
- using elementary transformation of extended
Vandermond matrix of GF elements - k is the records group size
- n 2l is max segment size (data and parity
records) - G I P
- I denotes the identity matrix
- Each record is a sequence of symbols from GF(2l)
- The k symbols with the same offset in the records
of a group become the (horizontal) information
vector U - The matrix multiplication U G provides the (n -
k) parity symbols, i.e., the codeword vector
54LHRS Parity Calculus
- Parity calculus is distributed to parity buckets
- each column is at one bucket
- Parity is calculated only for existing data and
parity buckets - At each insert, delete and update
- Adding new parity buckets does not change
existing parity records
55Example GF(4)
Addition XOR Multiplication direct table or
log / antilog tables
56Encoding
Records
57Encoding
Records
Codewords
01 01 01 01 00
58Encoding
Records
Codewords
01 01 01 01 00 00 00 00 00 00 ... ... ...
...
59LHRS Recovery Calculus
- Performed when at most n - k buckets are
unavailable, among the data and the parity
buckets of a group - Choose k available buckets
- Form the submatrix H of G from the corresponding
columns - Invert this matrix into matrix H-1
- Multiply the horizontal vector S of available
symbols with the same offset by H -1 - The result contains the recovered data and/or
parity symbols
60Example
Buckets
61Example
Buckets
62Example
Buckets
Recovered symbols / buckets
01 01 01
63Recovery
Buckets
Recovered symbols / buckets
01 01 01 00 00 00 ...
64Conclusion
- High-availability is an important property of an
SDDS - Its design should preserve the scalability,
parallelism reliability - Schemes using record grouping seem most
appropriate
65Future Work
- Performance analysis of LHRS
- Implementation of any of high-availability SDDSs
- LHRS is now implemented at CERIA by Mattias
Ljungström - High-availability variants of other known SDDSs
66End
Thank you for your attention
Witold Litwin witold.litwin_at_dauphine.fr
67(No Transcript)