Distributed Metadata with the AMGA Metadata Catalog - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed Metadata with the AMGA Metadata Catalog

Description:

Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006 Workshop on Next-Generation Distributed Data Management – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 25
Provided by: documents3
Category:

less

Transcript and Presenter's Notes

Title: Distributed Metadata with the AMGA Metadata Catalog


1
Distributed Metadata with the AMGA Metadata
Catalog
  • Nuno Santos, Birger Koblitz
  • 20 June 2006
  • Workshop on Next-Generation Distributed Data
    Management

2
Abstract
  • Metadata Catalogs on Data Grids The case for
    replication
  • The AMGA Metadata Catalog
  • Metadata Replication with AMGA
  • Benchmark Results
  • Future Work/Open Challenges

3
Metadata Catalogs
  • Metadata on the Grid
  • File Metadata - Describe files with
    application-specific information
  • Purpose file discovery based on their contents
  • Simplified Database Service Store generic
    structured data on the Grid
  • Not as powerful as a DB, but easier to use and
    better Grid integration (security, hide DB
    heterogeneity)
  • Metadata Services are essential for many Grid
    applications
  • Must be accessible Grid-wide
  • But Data Grids can be large

4
An Example - The LCG Sites
  • LCG LHC Computing Grid
  • Distribute and process the data generated by the
    LHC (Large Hadron Collider) at CERN
  • 200 sites and 5.000 users worldwide

Taken from http//goc03.grid-support.ac.uk/google
maps/lcg.html
5
Challenges for Catalog Services
  • Scalability
  • Hundreds of grid sites
  • Thousands users
  • Geographical Distribution
  • Network latency
  • Dependability
  • In a large and heterogeneous system, failures
    will be common
  • A centralized system does not meet the
    requirements
  • Distribution and replication required

6
Off-the-shelf DB Replication?
  • Most DB systems have DB replication mechanisms
  • Oracle Streams, Slony for PostgreSQL, MySQL
    replication
  • Example 3D Project at CERN
  • (Distributed Deployment of Databases)
  • Uses Oracle Streams for replication
  • Being deployed only at a few LCG sites (10
    sites, Tier-0 and Tier-1s)
  • Requires Oracle () and expert on-site DBAs
    ()
  • Most sites dont have these resources
  • Off-the-shelf replication is vendor-specific
  • But Grids are heterogeneous by nature
  • Sites have different DB systems available

Only partial solution to the problem of metadata
replication
7
Replication in the Catalog
  • Alternative we are exploring
  • Replication in the Metadata Catalog
  • Advantages
  • Database independent
  • Metadata-aware replication
  • More efficient replicate Metadata commands
  • Better functionality Partial replication,
    federation
  • Ease of deployment and administration
  • Built-in into the Metadata Catalog
  • No need for dedicated DB admin
  • The AMGA Metadata Catalogue is the basis for our
    work on replication

8
The AMGA Metadata Catalog
  • Metadata Catalog of the gLite Middleware (EGEE)
  • Several groups of users among the EGEE community
  • High Energy Physics
  • Biomed
  • Main features
  • Dynamic schemas
  • Hierarchical organization
  • Security
  • Authentication user/pass, X509 Certs, GSI
  • Authorization VOMS, ACLs

9
AMGA Implementation
  • C implementation
  • Back-ends
  • Oracle, MySQL, PostgreSQL, SQLite
  • Front-end - TCP Streaming
  • Text-based protocol like TELNET, SMTP, POP
  • Examples
  • Adding data
  • Retrieving data

addentry /DLAudio/song.mp3 /DLAudioAuthor
John Smith /DLAudioAlbum Latest Hits
selectattr /DLAudioFILE /DLAudioAuthor
/DLAudioAlbum like(/DLAudioFILE, .mp3")
10
Standalone Performance
  • Single server scales well up to 100 concurrent
    clients
  • Could not go past 100. Limited by the database
  • WAN access one to two orders of magnitude slower
    than LAN

Replication can solve both bottlenecks
11
  • Metadata Replication with AMGA

12
Requirements of EGEE Communities
  • Motivation Requirements of EGEEs user
    communities.
  • Mainly HEP and Biomed
  • High Energy Physics (HEP)
  • Millions of files, 5.000 users distributed
    across 200 computing centres
  • Mainly (read-only) file metadata
  • Main concerns scalability, performance and
    fault-tolerance
  • Biomed
  • Manage medical images on the Grid
  • Data produced in a distributed fashion by
    laboratories and hospitals
  • Highly sensitive data patient details
  • Smaller scale than HEP
  • Main concern security

13
Metadata Replication
Some replication models
Partial replication
Full replication
Federation
Proxy
14
Architecture
  • Main design decisions
  • Asynchronous replication for tolerating with
    high latencies and fault-tolerance
  • Partial replication Replicate only what is
    interesting for the remote users
  • Master-slave Writes only allowed on the master
  • But mastership is granted to metadata
    collections, not to nodes

15
Status
  • Initial implementation completed
  • Available functionality
  • Full and partial replication
  • Chained replication (master ? slave1 ? slave2)
  • Federation - basic support
  • Data is always copied to slave
  • Cross DB replication PostgreSQL ? MySQL tested
  • Other combinations should work (give or take some
    debugging)
  • Available as part of AMGA

16
  • Benchmark Results

17
Benchmark Study
  • Investigate the following
  • Overhead of replication and scalability of the
    master
  • Behaviour of the system under faults

18
Scalability
  • Setup
  • Insertion rate at master 90 entries/s.
  • Total 10,000 entries
  • 0 slaves - saving replication updates, but not
    shipping (slaves disconnected)
  • Small increase in CPU usage as number of slaves
    increases
  • 10 slaves, 20 increase from standalone operation
  • Number of update logs sent scales almost linearly

19
Fault Tolerance
  • Next test illustrates fault tolerance mechanisms
  • Slave fails
  • Master keeps the updates for the slave
  • Replication log grows
  • Slave reconnects
  • Master sends pending updates
  • Eventually system recovers to a steady state with
    the slave up-to-date
  • Test conditions
  • Insertion rate at master 50 entries/s
  • Total 20.000 entries
  • Two slaves, both start connected
  • Slave1 disconnects temporarily
  • Setup

20
Fault Tolerance and Recovery
  • While slave1 is disconnected, the replication log
    grows in size
  • Limited in size. Slave unsubscribed if it does
    not reconnect in time.
  • After slave reconnection, system recovers in
    around 60 seconds.

21
  • Future Work/Open Challenges

22
Scalability
  • Support hundreds of replicas
  • HEP use case. Extreme case one replica catalog
    per site
  • Challenges
  • Scalability
  • Fault-tolerance tolerate failures of slaves and
    of master
  • Current method of shipping updates (direct
    streaming) might not scale
  • Chained replication (divide and conquer)
  • Already possible with AMGA, performance needs to
    be studied
  • Group communication

23
Federation
  • Federation of independent catalogs
  • Biomed use case
  • Challenges
  • Provide a consistent view over the federated
    catalogs
  • Shared namespace
  • Security - Trust management, access control and
    user management
  • Ideas

24
Conclusion
  • Replication of Metadata Catalogues necessary for
    Data Grids
  • We are exploring replication at the Catalogue
    using AMGA
  • Initial implementation completed
  • First results are promising
  • Currently working on improving scalability and on
    federation
  • More information about our current work at
  • http//project-arda-dev.web.cern.ch/project-arda-d
    ev/metadata/
Write a Comment
User Comments (0)
About PowerShow.com