Title: Scalable Distributed Data Structures
1Scalable Distributed Data Structures
High-Performance ComputingWitold Litwin Fethi
BennourCERIAUniversity Paris 9
Dauphinehttp//ceria.dauphine.fr/
2Plan
- Multicomputers for HPC
- What are SDDSs ?
- Overview of LH
- Implementation under SDDS-2000
- Conclusion
3Multicomputers
- A collection of loosely coupled computers
- Mass-produced and/or preexisting hardware
- share nothing architecture
- Best for HPC because of scalability
- message passing through high-speed net
(?????0?Mb/s) - Network multicomputers
- use general purpose nets PCs
- LANs Fast Ethernet, Token Ring, SCI, FDDI,
Myrinet, ATM - NCSA cluster 1024 NTs on Myrinet by the end of
1999 - Switched multicomputers
- use a bus, or a switch
- IBM-SP2, Parsytec...
4Why Multicomputers ?
- Unbeatable price-performance ratio for HPC.
- Cheaper and more powerful than supercomputers.
- especially the network multicomputers.
- Available everywhere.
- Computing power.
- file size, access and processing times,
throughput... - For more pro cons
- IBM SP2 and GPFS literature.
- Tanenbaum "Distributed Operating Systems",
Prentice Hall, 1995. - NOW project (UC Berkeley).
- Bill Gates at Microsoft Scalability Day, May
1997. - www.microoft.com White Papers from Business
Syst. Div. - Report to the President, Presidents Inf. Techn.
Adv. Comm., Aug 98.
5Typical Network Multicomputer
Client
Server
6Why SDDSs
- Multicomputers need data structures and file
systems - Trivial extensions of traditional structures are
not best - hot-spots
- scalability
- parallel queries
- distributed and autonomous clients
- distributed RAM distance to data
- For a CPU, data on a disk are as far as those at
the Moon for a human (J. Gray, ACM Turing Price
1999)
7What is an SDDS ?
- Data are structured
- records with keys ? objects with OIDs
- more semantics than in Unix flat-file model
- abstraction most popular with applications
- parallel scans function shipping
- Data are on servers
- waiting for access
- Overflowing servers split into new servers
- appended to the file without informing the
clients - Queries come from multiple autonomous clients
- Access initiators
- Not supporting synchronous updates
- Not using any centralized directory for access
computations
8What is an SDDS ?
- Clients can make addressing errors
- Clients have less or more adequate image of the
actual file structure - Servers are able to forward the queries to the
correct address - perhaps in several messages
- Servers may send Image Adjustment Messages
- Clients do not make same error twice
- Servers supports parallel scans
- Sent out by multicast or unicast
- With deterministic or probabilistic termination
- See the SDDS talk papers for more
- ceria.dauphine.fr/witold.html
- Or the LH ACM-TODS paper (Dec. 96)
9High-Availability SDDS
- A server can be unavailable for access without
service interruption - Data are reconstructed from other servers
- Data and parity servers
- Up to k ³ 1 servers can fail
- At parity overhead cost of about 1/k
- Factor k can itself scale with the file
- Scalable availability SDDSs
10An SDDS
growth through splits under inserts
Servers
Clients
11An SDDS
growth through splits under inserts
Servers
Clients
12An SDDS
growth through splits under inserts
Servers
Clients
13An SDDS
growth through splits under inserts
Servers
Clients
14An SDDSClient Access
Clients
15An SDDSClient Access
Clients
16An SDDSClient Access
IAM
Clients
17An SDDSClient Access
Clients
18An SDDSClient Access
Clients
19Known SDDSs
DS
Classics
20Known SDDSs
DS
SDDS (1993)
Classics
Hash
LH DDH Breitbart al
21Known SDDSs
DS
SDDS (1993)
Classics
Hash
1-d tree
LH DDH Breitbart al
22Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
23Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
H-Avail.
LHm, LHg
Security
LHs
24Known SDDSs
DS
SDDS (1993)
Classics
m-d trees
Hash
1-d tree
LH DDH Breitbart al
Disk
SDLSA
H-Avail.
LHm, LHg
LHSA
Security
s-availability
LHs
LHRS
http//192.134.119.81/SDDS-bibliograhie.html
25LH (A classic)
- Scalable distributed hash partitionning
- generalizes the LH addressing schema
- variants used in Netscape products, LH-Server,
Unify, Frontpage, IIS, MsExchange... - Typical load factor 70 - 90
- In practice, at most 2 forwarding messages
- regardless of the size of the file
- In general, 1 m/insert and 2 m/search on the
average - 4 messages in the worst case
26LH bucket servers
- For every record c, its correct address a results
from the LH addressing rule - a Ü hi(c)
- if n 0 then exit
- else
- if a lt n then a Ü h i1 ( c)
- end
- (i, n) the file state, known only to the
LH-coordinator - Each server a keeps only track of the function hj
used to access it - j i or j i1
27LH clients
- Each client uses the LH-rule for address
computation, but with the client image (i, n)
of the file state. - Initially, for a new client (i, n) 0.
28LH Server Address Verification and Forwarding
- Server a getting key c, a m in particular,
computes - a' hj (c)
- if a' a then accept c
- else a'' hj - 1 (c)
- if a'' gt a and a'' lt a' then a' a''
- send c to bucket a'
29Client Image Adjustment
- The IAM consists of address a where the client
sent c and of j (a) - if j gt i' then i' j - 1, n' a
1 - if n' ??2i' then n' 0, i' i' 1
- The rule guarantees that client image is within
the file - Provided there is no file contractions (merge)
30LH file structure
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
31LH file structure
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
32LH split
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
33LH split
servers
j 4
j 4
j 3
j 3
j 4
j 4
0
1
2
7
8
9
n 2 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
34LH split
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinator
Client
Client
35LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
15
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
36LH addressing
servers
15
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
37LH addressing
servers
15
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
a 7, j 3
n 3 i 3
n' 0, i' 3
n' 3, i' 2
Coordinateur
Client
Client
38LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
9
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
39LH addressing
servers
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
9
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
40LH addressing
servers
9
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
n 3 i 3
n' 0, i' 0
n' 3, i' 2
Coordinateur
Client
Client
41LH addressing
servers
9
j 4
j 4
j 4
j 3
j 4
j 4
j 4
0
1
2
7
8
9
10
a 9, j 4
n 3 i 3
n' 1, i' 3
n' 3, i' 2
Coordinateur
Client
Client
42Result
- The distributed file can grow to even whole
Internet so that - every insert and search are done in four
messages (IAM included) - in general an insert is done in one message and
search in two message
43SDDS-2000Prototype Implementation of LH and of
RP on Wintel multicomputer
- Architecture Client/Server
- TCP/IP Communication (UDP and TCP) with Windows
Sockets - Multiple threads control
- Processes synchronization (mutex, critical
section, event, time_out, etc) - Queuing system
- Optional Flow control for UDP messaging
44SDDS-2000 Client Architecture
- Send Request
- Receive Response
- Return Response
- Client Image process.
45SDDS-2000 Server Architecture
Bucket SDDS
Insertion
Search
Update
Delete
Request Analyse
- Listen Thread
- Queuing system
- Work Thread
- Local process
- Forward
- Response
W.Thread 1
W.Thread 4
Queuing system
Listen Thread
Response
Response
Socket
Network
client Request
Client
46LHLH RAM buckets
47Measuring conditions
- LAN of 4 computers interconnected by a 100 Mb/s
Ethernet - F.S Fast Server Pentium II 350 MHz 128 Mo
RAM - F.C Fast Client Pentium II 350 MHz 128 Mo
RAM - S.C Slow Client Pentium I 90 Mhz 48 Mo RAM
- S.S Slow Server Pentium I 90 Mhz 48 Mo RAM
- The measurements result from 10.000 records
more. - UDP Protocol for insertions and searches
- TCP Protocol for splitting
48Best performances of a F.S configuration
S.C (1)
F.S J0
S.C (2)
100 Mb/s
Bucket 0
S.C (3)
UDP communication
49Fast Server Average Insert time
- Inserts without ack
- 3 clients create lost messages
- ? best time 0,44 ms
50Fast ServerAverage Search time
- The time measured include the search process
response return - More than 3 clients, there are a lot of lost
messages - Whatever is the bucket capacity (1000,5000, ,
20000 records), - ?0,66 ms is the best time
51Performance of a Slow Server Configuration
S.S J0
S.C
wait
100 Mb/s
Bucket 0
UDP communication
52Slow ServerAverage Insert time
- Measurements on server without ack
- S.C to S.S (with wait)
- We dont need a 2nd client
- ? 2,3 ms is the best constant time
53Slow ServerAverage Search time
- Measurements on server
- S.C to S.S (with wait)
- We dont need a 2nd client
- ? 3,3 ms is the best time
54Insert time into up to 3 buckets Configuration
F.S J2
Bucket 0
S.S J1
S.C
Batch 1,2,3,
100 Mb/s
Bucket 1
S.S J2
Bucket 2
UDP communication
55Average insert time no ack
- File creation includes 2 splits forwards
updates of IAM - Buckets already exist without splits
- Conditions S.C F.S 2 S.S
- Time measured on the server of bucket 0 which is
informed of the end of insertions from each
server. - The split is not penalizing ? 0,8 ms/insert in
both cases.
56Average search time in 3 Slow Servers
Configuration
S.S J2
Bucket 0
S.S J1
F.C
Batch 1,2,3,
100 Mb/s
Bucket 1
S.S J2
Bucket 2
UDP communication
57The average key search time Fast Client Slow
Servers
- Records are sent in batch system 1,2,3,. 10000
- Balanced charge (load) The 3 buckets receive
the same number of records - Non balanced charge The bucket 1 receives more
than the others - conclusion The curve is linear ? a good
parallelism
58ExtrapolationSingle 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Search time
F.S 0,66 ms
S.S 3,3 ms
5
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
59Extrapolation
Extrapolation Single 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Pentium III 700 Mhz
/ 2
Search time
F.S 0,66 ms
S.S 3,3 ms
5
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
60Extrapolation
Extrapolation Single 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Pentium III 700 Mhz
/ 2
Search time
F.S 0,66 ms
S.S 3,3 ms
5
lt 0,33 ms
2
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
61Extrapolation
Extrapolation Single 700 Mhz P3 server
Processor
Pentium II 350 Mhz
Pentium 90 Mhz
/ 4
Pentium III 700 Mhz
/ 2
Search time
F.S 0,66 ms
S.S 3,3 ms
5
lt 0,33 ms
2
Insertion time
F.S 0,44 ms
S.S 2,37 ms
5
lt 0,22 ms
2
62Extrapolation Search time on fast P3 servers
- The client is F.C
- 3 servers are 350 Mhz.P3 search time is 0,216
ms/ key - 3 servers are 700 Mhz. search time is 0,106 ms/
key
63Extrapolation Search time in file scaling to
100 servers
64RP schemes
- Produce 1-d ordered files
- for range search
- Uses m-ary trees
- like a B-tree
- Efficiently supports range queries
- LH also supports range queries
- but less efficiently
- Consists of the family of three schemes
- RPN RPC and RPS
65RP schemes
66(No Transcript)
67Comparison between LHLH RPN
68Scalable Distributed Log Structured Array (SDLSA)
- Intended for high-capacity SANs of IBM Ramac
Virtual Arrays (RVAs) or Enterprise Storage
Servers (ESSs) - One RVA contains up to 0.8 TB of data
- One EES contains up to 13 TB of data
- Reuse of current capabilities
- Transparent access to the entire SAN, as if it
were one RVA or EES - Preservation of current functions,
- Log Structured Arrays
- for high-availability without small-write RAID
penalty - Snapshots
- New capabilities
- Scalable TB databases
- PB databases for an EES SAN
- Parallel / distributed processing
- High-availability supporting an entire server
node unavailability
69Gross Architecture
RVA
70Scalable Availability SDDS
- Support unavailability of k ³ 1 server sites
- The factor k increases automatically with the
file. - Necessary to prevent the reliability decrease
- Moderate overhead for parity data
- Storage overhead of O (1/k)
- Access overhead of k messages per data record
insert or update - Do not impare search and parallel scans
- Unlike trivial adaptations of RAID like schemes.
- Several schemas were proposed around LH
- Different properties to best suit various
applications - See http//ceria.dauphine.fr/witold.html
71SDLSA Main features
- LH used as global addressing schema
- RAM buckets split atomically
- Disk buckets split in lazy way
- A record (logical track) moves only when
- The client access it (update, or read)
- It is garbage collected
- Atomic split of TB disk bucket would take hours
- The LHRS schema is used for the
high-availability - Litwin W. Menon, J. Scalable Distributed Log
Structured Arrays. CERIA Res. Rep. 12, 1999
http//ceria.dauphine.fr/witold.html
72Conclusion
- SDDSs should be highly useful for HPC
- Scalability
- Fast access perfromance
- Parallel scans function shipping
- High-availability
- SDDSs are available on network multicomputers
- SDDS-2000
- Access performance prove at least an order of
magnitude faster than to traditional files - Should reach two orders (100 times improvement)
for 700 Mhz P3 - Combination of fast net distributed RAM
73Future work
- Experiments
- Faster net
- We do not have any volunteer to help ?
- More Wintel computers
- We are adding two 700 Mhz P3
- Volunteers with funding for more their own
config. ? - Experiments on switched multicomputers
- LHLH runs on Parsytec (J. Karlson) SGs (Math.
Cntr. Of U. Amsterdam) - Volunteers with an SP2 ?
- Generally, we welcome every cooperation
74THE END
- Thank You for Your Attention
Witold LITWIN Fehti Bennour
Sponsored by HP Laboratories IBM Almaden
Research Microsoft Research
75(No Transcript)