HighAvailability in Scalable Distributed Data Structures - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

HighAvailability in Scalable Distributed Data Structures

Description:

There is one parity bucket per data bucket group ... usually O (Ib) messages to parity buckets. to recompute (XOR) parity bits since usually all records get ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 65
Provided by: Witold6
Category:

less

Transcript and Presenter's Notes

Title: HighAvailability in Scalable Distributed Data Structures


1
High-Availability in Scalable Distributed Data
Structures
W. Litwin
  • Witold.Litwin_at_dauphine.fr

2
Plan
  • What are SDDSs ?
  • High-Availability SDDSs
  • LH with scalable availability
  • Conclusion

3
Multicomputers
  • A collection of loosely coupled computers
  • common and/or preexisting hardware
  • share nothing architecture
  • message passing through high-speed net
    (?????0?Mb/s)
  • Network multicomputers
  • use general purpose nets
  • LANs Fast Ethernet, Token Ring, SCI, FDDI,
    Myrinet, ATM
  • NCSA cluster 512 NTs on Myrinet by the end of
    1998
  • Switched multicomputers
  • use a bus, or a switch
  • IBM-SP2, Parsytec...

4
Network multicomputer
Server
Client
5
Why multicomputers ?
  • Unbeatable price-performance ratio
  • Much cheaper and more powerful than
    supercomputers
  • especially the network multicomputers
  • 1500 WSs at HPL with 500 GB of RAM TBs of
    disks
  • Computing power
  • file size, access and processing times,
    throughput...
  • For more pro cons
  • IBM SP2 and GPFS literature
  • Tanenbaum "Distributed Operating Systems",
    Prentice Hall, 1995
  • NOW project (UC Berkeley)
  • Bill Gates at Microsoft Scalability Day, May 1997
  • www.microoft.com White Papers from Business
    Syst. Div.
  • Report to the President, Presidents Inf. Techn.
    Adv. Comm., Aug 98

6
Why SDDSs
  • Multicomputers need data structures and file
    systems
  • Trivial extensions of traditional structures are
    not best
  • hot-spots
  • scalability
  • parallel queries
  • distributed and autonomous clients
  • distributed RAM distance to data

7
What is an SDDS ?
  • Data are structured
  • records with keys ? objects with an OID
  • more semantics than in Unix flat-file model
  • abstraction popular with applications
  • allows for parallel scans
  • function shipping
  • Data are on servers
  • always available for access
  • Overflowing servers split into new servers
  • appended to the file without informing the
    clients
  • Queries come from multiple autonomous clients
  • available for access only on their initiative
  • no synchronous updates on the clients
  • There is no centralized directory for access
    computations

8
What is an SDDS ?
  • Clients can make addressing errors
  • Clients have less or more adequate image of the
    actual file structure
  • Servers are able to forward the queries to the
    correct address
  • perhaps in several messages
  • Servers may send Image Adjustment Messages
  • Clients do not make same error twice
  • See the SDDS talk for more on it
  • 192.134.119.81/witold.html
  • Or the LH ACM-TODS paper (Dec. 96)

9
An SDDS
growth through splits under inserts
Servers
Clients
10
An SDDS
growth through splits under inserts
Servers
Clients
11
An SDDS
growth through splits under inserts
Servers
Clients
12
An SDDS
growth through splits under inserts
Servers
Clients
13
An SDDS
growth through splits under inserts
Servers
Clients
14
An SDDS
Clients
15
An SDDS
Clients
16
An SDDS
IAM
Clients
17
An SDDS
Clients
18
An SDDS
Clients
19
Known SDDSs
DS
Classics
20
Known SDDSs
DS
SDDS (1993)
Classics
Hash
LH DDH Breitbart al
21
Known SDDSs
DS
SDDS (1993)
Classics
Hash
1-d tree
LH DDH Breitbart al
22
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
23
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
H-Avail.
LHm, LHg
Security
LHs
24
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
H-Avail.
LHm, LHg
LHSA
Security
s-availability
LHs
LHRS
25
LH (A classic)
  • Allows for the primary key (OID) based hash files
  • generalizes the LH addressing schema
  • variants used in Netscape products, LH-Server,
    Unify, Frontpage, IIS, MsExchange...
  • Typical load factor 70 - 90
  • In practice, at most 2 forwarding messages
  • regardless of the size of the file
  • In general, 1 m/insert and 2 m/search on the
    average
  • 4 messages in the worst case
  • Search time of 1 ms (10 Mb/s net), of 150 ?s
    (100 Mb/s net) and of 30 ?s (Gb/s net)

26
High-availability LH schemes
  • In a large multicomputer, it is unlikely that all
    servers are up
  • Consider the probability that a bucket is up is
    99
  • bucket is unavailable 3 days per year
  • If one stores every key in only 1 bucket
  • case of typical SDDSs, LH included
  • Then file reliability probability that n-bucket
    file is entirely up is
  • 37 for n 100
  • 0 for n 1000
  • Acceptable for yourself ?

27
High-availability LH schemes
  • Using 2 buckets to store a key, one may expect
    the reliability of
  • 99 for n 100
  • 91 for n 1000
  • High-availability files
  • make data available despite unavailability of
    some servers
  • RAIDx, LSA, EvenOdd, DATUM...
  • High-availability SDDS
  • make sense
  • are the only way to reliable large SDDS files

28
Known High-availability LH schemes
  • Known high-availability LH schemes keep data
    available under
  • any single server failure (1-availability)
  • any n-server failure
  • n fixed or scalable (n-availability or scalable
    availability)
  • some n-server failures n gt n
  • Three principles for high-availability SDDS
    schemes are known
  • mirroring (LHm)
  • storage doubles 1-availability
  • striping (LHs)
  • affects parallel queries 1-availability
  • grouping (LHg, LHSA, LHRS)

29
Scalable Availability
  • n-availability
  • availability of all data despite simultaneous
    unavailability of up to any n buckets
  • E.g., RAIDx, EvenOdd, RAV, DATUM...
  • Reliability
  • probability P that all the records are available
  • Problem
  • For every choice of n, P ? 0 when the file
    scales.
  • Solution
  • Scalable availability
  • n grows with the file size, to regulate P
  • Constraint
  • the growth has to be incremental

30
LHsa file(Litwin, Menon, Risch)
  • An LH file with data buckets for data records
  • provided in addition with the availability level
    i in each bucket
  • One or more parity files with parity buckets for
    parity records
  • added when the file scales
  • with every bucket 0 mirroring the LHsa file
    state data (i, n)
  • A family of grouping functions group data buckets
    into groups of size k gt 1 such that
  • Every two buckets in the same group i are in
    different groups i ? i
  • There is one parity bucket per data bucket group
  • Within a parity bucket, a parity record is
    maintained for up to k data records with the same
    rank

31
LHsa File Expansion(k 2)
32
Scalable Availability(basic schema)
  • Up to k data buckets, use 1-st level grouping
  • so there will be only one parity bucket
  • Then, start also 2-nd level grouping
  • When file exceeds k2 buckets, start 3-rd level
    grouping
  • When file exceeds k3 buckets, start 4-rd level
    grouping
  • Etc.

33
LHsa groups
34
LHsa File Expansion(k 2)
35
LHsa Recovery
  • Bucket or record unavailability is detected
  • by the client during the search or update
  • by forwarding server
  • Coordinator is alerted to perform the recovery
  • to bypass the unavailable bucket
  • or to restore the record on the fly
  • or the restore the bucket in a spare
  • The recovered record is delivered to the client

36
LHsa Bucket Record Recovery
  • Try the 1-st level group for the unavailable
    bucket m
  • If other buckets are found unavailable in this
    group
  • try to recover each of them using 2nd level
    groups
  • And so on
  • Come back to recover finally bucket
  • See the paper for the full algorithms
  • For an I -available file, it is possible to
    sometimes recover a records even when more than I
    buckets in a group are unavailable

37
LHsa normal recovery
38
Good case recovery (I 3)
39
Scalability AnalysisSearch Performance
  • Search
  • usually same cost as for LH
  • inluding the parallel queries
  • 2 messages per key search
  • in degraded mode
  • usually O (k)
  • record reconstruction cost using 1-st level
    parity
  • worst case O ((I1) k)

40
Insert Performance
  • Usually (I1) or (I 2)
  • (I1) is the best possible value for every
    I-availability schema
  • In degraded mode
  • about the same if the unavailable bucket can be
    bypassed
  • add the bucket recovery cost otherwise
  • the client cost is only a few messages to deliver
    the record to the coordinator

41
Split Performance
  • LH split cost that is O (b/2)
  • b is bucket capacity
  • one message per record
  • Plus usually O (Ib) messages to parity buckets
  • to recompute (XOR) parity bits since usually all
    records get new ranks
  • Plus O (b) messages when new bucket is created

42
Storage Performance
  • Storage overhead cost Cs S / S
  • S - storage for parity files
  • S - storage for data buckets
  • practically, the LH file storage cost
  • Cs depends on the file availability level I
    reached
  • To build new level to I 1
  • Cs starts from lower bound LI I / k
  • for file size M kI
  • the best possible value for any I-availability
    schema
  • Increases towards an upper bound
  • UI  1 ? O ( ½ I /k)
  • as long as new splits add parity buckets
  • Decreases towards LI1 afterwards

43
Example
44
Reliability
  • Probability P that all records are available to
    the application
  • all the data buckets are available
  • or every record can be recovered
  • there is at most I buckets failed in an LHsa
    group
  • Depends on
  • failure probability p of each site
  • group size k
  • file size M
  • Reliability of basic LHsa schema is termed
    uncontrolled

45
Uncontrolled Reliability
LHsa
46
Controlled Reliability
  • To keep the reliability above or close to given
    threshold through
  • delaying or accelerating the availability level
    growth
  • or gracefully changing group size k
  • Necessary for higher values of p
  • case of less reliable sites
  • frequent situation on network multicomputers
  • May improve performance for small ps.
  • Several schemes are possible

47
Controlled Reliability with Fixed Group Size
k 4 T 0.8
p 0.2
48
Controlled Reliability with Variable Group Size
p 0.01 T 0.95
49
LHRS (Litwin Schwarz)
  • Single grouping function
  • 1234, 5678
  • Multiple parity buckets per group
  • Scalable availability
  • 1 parity bucket per group until 2i1 buckets
  • Then, at each split, add 2nd parity bucket to
    each existing group or create 2 parity buckets
    for new groups until 2i2 buckets
  • etc.

50
LHRS File Expansion
51
LHRS File Expansion
52
LHRS File Expansion
53
LHRS Parity Calculus
  • Choose GF(2l )
  • Typically GF (16) or GF (256)
  • Create the k x n generator matrix G
  • using elementary transformation of extended
    Vandermond matrix of GF elements
  • k is the records group size
  • n 2l is max segment size (data and parity
    records)
  • G I P
  • I denotes the identity matrix
  • Each record is a sequence of symbols from GF(2l)
  • The k symbols with the same offset in the records
    of a group become the (horizontal) information
    vector U
  • The matrix multiplication U G provides the (n -
    k) parity symbols, i.e., the codeword vector

54
LHRS Parity Calculus
  • Parity calculus is distributed to parity buckets
  • each column is at one bucket
  • Parity is calculated only for existing data and
    parity buckets
  • At each insert, delete and update
  • Adding new parity buckets does not change
    existing parity records

55
Example GF(4)
Addition XOR Multiplication direct table or
log / antilog tables
56
Encoding
Records
57
Encoding
Records
Codewords
01 01 01 01 00
58
Encoding
Records
Codewords
01 01 01 01 00 00 00 00 00 00 ... ... ...
...
59
LHRS Recovery Calculus
  • Performed when at most n - k buckets are
    unavailable, among the data and the parity
    buckets of a group
  • Choose k available buckets
  • Form the submatrix H of G from the corresponding
    columns
  • Invert this matrix into matrix H-1
  • Multiply the horizontal vector S of available
    symbols with the same offset by H -1
  • The result contains the recovered data and/or
    parity symbols

60
Example
Buckets
61
Example
Buckets
62
Example
Buckets
Recovered symbols / buckets
01 01 01
63
Recovery
Buckets
Recovered symbols / buckets
01 01 01 00 00 00 ...
64
Conclusion
  • High-availability is an important property of an
    SDDS
  • Its design should preserve the scalability,
    parallelism reliability
  • Schemes using record grouping seem most
    appropriate

65
Future Work
  • Performance analysis of LHRS
  • Implementation of any of high-availability SDDSs
  • LHRS is now implemented at CERIA by Mattias
    Ljungström
  • High-availability variants of other known SDDSs

66
End
Thank you for your attention
Witold Litwin witold.litwin_at_dauphine.fr
67
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com