Scalable Distributed Data Structures - PowerPoint PPT Presentation

About This Presentation

Title:

Scalable Distributed Data Structures

Description:

Scalable Distributed Data Structures & High-Performance Computing Witold Litwin Fethi Bennour CERIA University Paris 9 Dauphine http://ceria.dauphine.fr/ – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 76

Provided by: fet50

Category:

more less

Transcript and Presenter's Notes

Title: Scalable Distributed Data Structures

1
Scalable Distributed Data Structures
High-Performance ComputingWitold Litwin Fethi
BennourCERIAUniversity Paris 9
Dauphinehttp//ceria.dauphine.fr/
2
Plan

Multicomputers for HPC
What are SDDSs ?
Overview of LH
Implementation under SDDS-2000
Conclusion

3
Multicomputers

A collection of loosely coupled computers
Mass-produced and/or preexisting hardware
share nothing architecture
Best for HPC because of scalability
message passing through high-speed net
(?????0?Mb/s)
Network multicomputers
use general purpose nets PCs
LANs Fast Ethernet, Token Ring, SCI, FDDI,
Myrinet, ATM
NCSA cluster 1024 NTs on Myrinet by the end of
1999
Switched multicomputers
use a bus, or a switch
IBM-SP2, Parsytec...

4
Why Multicomputers ?

Unbeatable price-performance ratio for HPC.
Cheaper and more powerful than supercomputers.
especially the network multicomputers.
Available everywhere.
Computing power.
file size, access and processing times,
throughput...
For more pro cons
IBM SP2 and GPFS literature.
Tanenbaum "Distributed Operating Systems",
Prentice Hall, 1995.
NOW project (UC Berkeley).
Bill Gates at Microsoft Scalability Day, May
1997.
www.microoft.com White Papers from Business
Syst. Div.
Report to the President, Presidents Inf. Techn.
Adv. Comm., Aug 98.

5
Typical Network Multicomputer
Client
Server
6
Why SDDSs

Multicomputers need data structures and file
systems
Trivial extensions of traditional structures are
not best
hot-spots
scalability
parallel queries
distributed and autonomous clients
distributed RAM distance to data
For a CPU, data on a disk are as far as those at
the Moon for a human (J. Gray, ACM Turing Price
1999)

7
What is an SDDS ?

Data are structured
records with keys ? objects with OIDs
more semantics than in Unix flat-file model
abstraction most popular with applications
parallel scans function shipping
Data are on servers
waiting for access
Overflowing servers split into new servers
appended to the file without informing the
clients
Queries come from multiple autonomous clients
Access initiators
Not supporting synchronous updates
Not using any centralized directory for access
computations

8
What is an SDDS ?

Clients can make addressing errors
Clients have less or more adequate image of the
actual file structure
Servers are able to forward the queries to the
correct address
perhaps in several messages
Servers may send Image Adjustment Messages
Clients do not make same error twice
Servers supports parallel scans
Sent out by multicast or unicast
With deterministic or probabilistic termination
See the SDDS talk papers for more
ceria.dauphine.fr/witold.html
Or the LH ACM-TODS paper (Dec. 96)

9
High-Availability SDDS

A server can be unavailable for access without
service interruption
Data are reconstructed from other servers
Data and parity servers
Up to k ³ 1 servers can fail
At parity overhead cost of about 1/k
Factor k can itself scale with the file
Scalable availability SDDSs

10
An SDDS
growth through splits under inserts
Servers
Clients
11
An SDDS
growth through splits under inserts
Servers
Clients
12
An SDDS
growth through splits under inserts
Servers
Clients
13
An SDDS
growth through splits under inserts
Servers
Clients
14
An SDDSClient Access
Clients
15
An SDDSClient Access
Clients
16
An SDDSClient Access
IAM
Clients
17
An SDDSClient Access
Clients
18
An SDDSClient Access
Clients
19
Known SDDSs
DS
Classics
20
Known SDDSs
DS
SDDS (1993)
Classics
Hash
LH DDH Breitbart al
21
Known SDDSs
DS
SDDS (1993)
Classics
Hash
1-d tree
LH DDH Breitbart al
22
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
23
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
H-Avail.
LHm, LHg
Security
LHs
24
Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
Disk
SDLSA
H-Avail.
LHm, LHg
LHSA
Security
s-availability
LHs
LHRS
http//192.134.119.81/SDDS-bibliograhie.html
25
LH (A classic)

Scalable distributed hash partitionning
generalizes the LH addressing schema
variants used in Netscape products, LH-Server,
Unify, Frontpage, IIS, MsExchange...
Typical load factor 70 - 90
In practice, at most 2 forwarding messages
regardless of the size of the file
In general, 1 m/insert and 2 m/search on the
average
4 messages in the worst case

26
LH bucket servers

For every record c, its correct address a results
from the LH addressing rule
a Ü hi(c)
if n 0 then exit
else
if a lt n then a Ü h i1 ( c)
end
(i, n) the file state, known only to the
LH-coordinator
Each server a keeps only track of the function hj
used to access it
j i or j i1

27
LH clients

Each client uses the LH-rule for address
computation, but with the client image (i, n)
of the file state.
Initially, for a new client (i, n) 0.

28
LH Server Address Verification and Forwarding

Server a getting key c, a m in particular,
computes
a' hj (c)
if a' a then accept c
else a'' hj - 1 (c)
if a'' gt a and a'' lt a' then a' a''
send c to bucket a'

29
Client Image Adjustment

The IAM consists of address a where the client
sent c and of j (a)
if j gt i' then i' j - 1, n' a
1
if n' ??2i' then n' 0, i' i' 1
The rule guarantees that client image is within
the file
Provided there is no file contractions (merge)

30
LH file structure
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
31
LH file structure
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
32
LH split
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
33
LH split
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
34
LH split
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
35
LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
15
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
36
LH addressing
servers
15
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
37
LH addressing
servers
15
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
a 7, j 3
n 3 i 3
n' 0, i' 3
n' 3, i' 2
Coordinateur
Client
Client
38
LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
9
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
39
LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
9
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
40
LH addressing
servers
9
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
41
LH addressing
servers
9
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
a 9, j 4
n 3 i 3
n' 1, i' 3
n' 3, i' 2
Coordinateur
Client
Client
42
Result

The distributed file can grow to even whole
Internet so that
every insert and search are done in four
messages (IAM included)
in general an insert is done in one message and
search in two message

43
SDDS-2000Prototype Implementation of LH and of
RP on Wintel multicomputer

Architecture Client/Server
TCP/IP Communication (UDP and TCP) with Windows
Sockets
Multiple threads control
Processes synchronization (mutex, critical
section, event, time_out, etc)
Queuing system
Optional Flow control for UDP messaging

44
SDDS-2000 Client Architecture

Send Request
Receive Response
Return Response
Client Image process.

45
SDDS-2000 Server Architecture
Bucket SDDS
Insertion
Search
Update
Delete
Request Analyse

Listen Thread
Queuing system
Work Thread
Local process
Forward
Response

W.Thread 1
W.Thread 4
Queuing system
Listen Thread
Response
Response
Socket
Network
client Request
Client
46
LHLH RAM buckets
47
Measuring conditions

LAN of 4 computers interconnected by a 100 Mb/s
Ethernet
F.S Fast Server Pentium II 350 MHz 128 Mo
RAM
F.C Fast Client Pentium II 350 MHz 128 Mo
RAM
S.C Slow Client Pentium I 90 Mhz 48 Mo RAM
S.S Slow Server Pentium I 90 Mhz 48 Mo RAM
The measurements result from 10.000 records
more.
UDP Protocol for insertions and searches
TCP Protocol for splitting

48
Best performances of a F.S configuration
S.C (1)
F.S J0
S.C (2)
100 Mb/s
Bucket 0
S.C (3)
UDP communication
49
Fast Server Average Insert time

Inserts without ack
3 clients create lost messages
? best time 0,44 ms

50
Fast ServerAverage Search time

The time measured include the search process
response return
More than 3 clients, there are a lot of lost
messages
Whatever is the bucket capacity (1000,5000, ,
20000 records),
?0,66 ms is the best time

51
Performance of a Slow Server Configuration
S.S J0
S.C
wait
100 Mb/s
Bucket 0
UDP communication
52
Slow ServerAverage Insert time

Measurements on server without ack
S.C to S.S (with wait)
We dont need a 2nd client
? 2,3 ms is the best constant time

53
Slow ServerAverage Search time

Measurements on server
S.C to S.S (with wait)
We dont need a 2nd client
? 3,3 ms is the best time

54
Insert time into up to 3 buckets Configuration
F.S J2
Bucket 0
S.S J1
S.C
Batch 1,2,3,
100 Mb/s
Bucket 1
S.S J2
Bucket 2
UDP communication
55
Average insert time no ack

File creation includes 2 splits forwards
updates of IAM
Buckets already exist without splits
Conditions S.C F.S 2 S.S
Time measured on the server of bucket 0 which is
informed of the end of insertions from each
server.
The split is not penalizing ? 0,8 ms/insert in
both cases.

56
Average search time in 3 Slow Servers
Configuration
S.S J2
Bucket 0
S.S J1
F.C
Batch 1,2,3,
100 Mb/s
Bucket 1
S.S J2
Bucket 2
UDP communication
57
The average key search time Fast Client Slow
Servers

Records are sent in batch system 1,2,3,. 10000
Balanced charge (load) The 3 buckets receive
the same number of records
Non balanced charge The bucket 1 receives more
than the others
conclusion The curve is linear ? a good
parallelism

58
ExtrapolationSingle 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Search time
F.S 0,66 ms
S.S 3,3 ms
5
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
59
Extrapolation
Extrapolation Single 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Pentium III 700 Mhz
/ 2
Search time
F.S 0,66 ms
S.S 3,3 ms
5
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
60
Extrapolation
Extrapolation Single 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Pentium III 700 Mhz
/ 2
Search time
F.S 0,66 ms
S.S 3,3 ms
5
lt 0,33 ms
2
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
61
Extrapolation
Extrapolation Single 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Pentium III 700 Mhz
/ 2
Search time
F.S 0,66 ms
S.S 3,3 ms
5
lt 0,33 ms
2
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
lt 0,22 ms
2
62
Extrapolation Search time on fast P3 servers

The client is F.C
3 servers are 350 Mhz.P3 search time is 0,216
ms/ key
3 servers are 700 Mhz. search time is 0,106 ms/
key

63
Extrapolation Search time in file scaling to
100 servers
64
RP schemes

Produce 1-d ordered files
for range search
Uses m-ary trees
like a B-tree
Efficiently supports range queries
LH also supports range queries
but less efficiently
Consists of the family of three schemes
RPN RPC and RPS

65
RP schemes
66
(No Transcript)
67
Comparison between LHLH RPN
68
Scalable Distributed Log Structured Array (SDLSA)

Intended for high-capacity SANs of IBM Ramac
Virtual Arrays (RVAs) or Enterprise Storage
Servers (ESSs)
One RVA contains up to 0.8 TB of data
One EES contains up to 13 TB of data
Reuse of current capabilities
Transparent access to the entire SAN, as if it
were one RVA or EES
Preservation of current functions,
Log Structured Arrays
for high-availability without small-write RAID
penalty
Snapshots
New capabilities
Scalable TB databases
PB databases for an EES SAN
Parallel / distributed processing
High-availability supporting an entire server
node unavailability

69
Gross Architecture
RVA
70
Scalable Availability SDDS

Support unavailability of k ³ 1 server sites
The factor k increases automatically with the
file.
Necessary to prevent the reliability decrease
Moderate overhead for parity data
Storage overhead of O (1/k)
Access overhead of k messages per data record
insert or update
Do not impare search and parallel scans
Unlike trivial adaptations of RAID like schemes.
Several schemas were proposed around LH
Different properties to best suit various
applications
See http//ceria.dauphine.fr/witold.html

71
SDLSA Main features

LH used as global addressing schema
RAM buckets split atomically
Disk buckets split in lazy way
A record (logical track) moves only when
The client access it (update, or read)
It is garbage collected
Atomic split of TB disk bucket would take hours
The LHRS schema is used for the
high-availability
Litwin W. Menon, J. Scalable Distributed Log
Structured Arrays. CERIA Res. Rep. 12, 1999
http//ceria.dauphine.fr/witold.html

72
Conclusion

SDDSs should be highly useful for HPC
Scalability
Fast access perfromance
Parallel scans function shipping
High-availability
SDDSs are available on network multicomputers
SDDS-2000
Access performance prove at least an order of
magnitude faster than to traditional files
Should reach two orders (100 times improvement)
for 700 Mhz P3
Combination of fast net distributed RAM

73
Future work