Title: OAI-based Harvesting
1OAI-based Harvesting
IVOA Registry Working Group
2Harvesting in the Registry Framework
VO Projects
harvest
(pull)
Full Searchable Registry
Data Centers
Local Searchable Registry
Harvesting is about publishing
Specialized Portals Services
3Harvesting in the Registry Framework
VO Projects
harvest
(pull)
replicate
Full Searchable Registry
Data Centers
Local Searchable Registry
Harvesting is about publishing
Specialized Portals Services
4Harvesting in the Registry Framework
VO Projects
harvest
(pull)
replicate
Full Searchable Registry
Data Centers
selective harvesting
Local Searchable Registry
Harvesting is about publishing
Specialized Portals Services
5Searching in the Registry Framework
VO Projects
Full Searchable Registry
Data Centers
search queries
Local Searchable Registry
Harvesting is not about searching
Client Applications
Specialized Portals Services
6Open Archives Initiative (OAI) Protocol for
Metadata Harvesting
- Existing standard for harvesting resource
descriptions widely supported in digital library
community - http//www.openarchives.org/
- Supports aggregation of resource descriptions
- Deployed successfully as part of NVO registry
prototype - Currently part of framework supporting the NVO
Data Inventory Service (DIS)
7OAI-PMH Features
- Defines 6 operations
- Identify GetRecord
- ListIdentifiers ListMetadataFormats
- ListRecords ListSets
- Features
- Support for multiple description formats
(metadataPrefix) - Harvesting by date (from, until)
- Harvesting by category (set)
- Marking records as deleted
- Support for resumption tokens
8OAI as a Web Service
- IVOA presumed preference for Web Services
- OAI PMH defined as set of HTTP Get services
- Broad interest in seeing Web Service version
- Evolving the standard toward WS
- NVO has prototyped WS versions
- Gretchen Greene Wil OMullane (STScI)
- Charlie Cowert (SDSC)
- Ray Plante (NCSA)
- In contact with OAI community
- Opportunity present proposal for standard WS
version to OAI community
9Reasons to adopt OAI
- Existing, well-tested standard we dont have to
reinvent - Easy to implement
- Demonstrated by NVO
- Lots of existing OAI software tools
- For clients and servers
- Lowers cost of implementation
- Interoperability with larger digital library
community - Do these hold in the web service context?
- Yes, if the WS version leverages the original OAI
schema
10A Harvesting Standard based on OAI
- Spelled out in alternate section 4 to the
Registry Interface working draft - The OAI standard defined by
- The (existing) OAI-PMH v2.0 specification
- Operations, behavior, message schema
- OAI-PMH schema defines envelope for resource
description - The (proposed) OAI WSDL interface
- Mapping to WS interface
- Imports the OAI-PMH schema
- Standard IVOA use of OAI
- OAI spec provides hooks for community-specific
semantics - Resource description format
- Sets
11IVOA Metadata Format
- Define metadata format ivo_vor
- Description using the VOResource schema
- Restricted to a resource sub-type defined in a
standard extension schema - Today one of the working draft extensions
- Non-standard extensions should accessible via a
non-standard metadata format name - Harvester can expect to fully comprehend the
resource description - Dublin core format, oai_dc, required
- For cross-disciplinary interoperability
- Support is trivial via standard XSL stylesheet
12Sets Named Categories of records
- The OAI notion of sets
- Each record may belong to zero or more named
categories called sets - Sets may be defined by a community or the
individual provider - Enables selective harvesting by category
- Proposed use of sets for IVOA
- Implicit definition of a set for each standard
resource sub-type - E.g. Organisation, Registry, SimpleImageAcess,
etc. - Set name of the form ivo_type e.g.
ivo_Registry - Allows harvesting of specific types
- Explicit definition of special sets
- ivo_managed those records with authority ID that
originates with that registry - ivo_standard any record of a standard resource
sub-type - Full registry replication by omitting set argument
13Other miscellaneous specifications
- Required resource records
- One Registry record describing the registry
itself - One Authority record for each AuthorityID it
manages - One Organisation record for each publisher that
registers an AuthorityID - The Identify operation response must include
the registry record for the registry - e.g. ltresource xsitypeRegistrygt
14Related Harvester Interface
- Additional standard operation to be supported by
searchable registry (i.e. the harvester) - Harvest Me a mechanism to tell a harvester
that an update is available - Inputs
- ivo-id the ID of the harvestable registry
- harvestingType HTTP Get or WS version
- Should we allow either?
- baseURL endpoint for harvesting interface
- lastUpdate date of most recent update to
registry contents - Harvester may choose when/if to harvest
15Conclusions
- Reasons to adopt OAI-based harvesting
- Existing, well-tested standard we dont have to
reinvent - Easy to implement
- Lots of existing OAI software tools
- Interoperability with larger digital library
community - OAI fulfills all of the harvesting functionality
set proposed in current WD - Proposal presumes a WS interface desired
- Opportunity to contribute to DL community
- Action
- Endorse preference to support standard if
- Meets needs
- Has favorable cost/benefit ratio
- Continue to develop within context of Registry
Interface WD - Projects should study OAI spec carefully
- If doesnt meet above criteria, enumerate how
16Myths about OAI
- OAI does more than we need, too heavy
- If you do this right, you will reinvent most/all
of the required functionality - Its cheaper to do something simpler
- Its cheaper to adopt standard if you can
leverage existing software - OAIWS arent these envelopes redundant?
- The two envelopes serve different functions
- OAI envelope managing a set of related
operations in a format-independent way - Response management Record management
- responseDate
identifier - request inputs
datestamp -
set membership - Envelopes are easily skipped