Title: An Overview of OAI
1An Overview of OAI OAI-PMH
- by
- Filbert Minj
-
- Francis Jayakanth
- NCSI, IISc
2Agenda
- Part I Overview of the OAI
- Part II Overview of the OAI-PMH
- High-Tea Break
- Part III Demo of Harvesting of metadata and
Searching - Discussion?
3Part I Overview of the OAI
- General Information
- The Journal System
- Growth of ePrint Archives
- The ePrint System
- The UPS Prototype
- The Dawn of OAI
- Important Resources
4Most Relevant Resource
- Open Archives Forum
- http//www.oaforum.org/tutorial/index.php
- This Presentation to a great extent is based
on the tutorial available at the above mentioned
URL. - Several slides from the above site have been
interpolated in this ppt file
5General Information
- ePrints
- ePrints are commonly defined as research articles
in electronic form (with an underlying assumption
that they are available online) - Preprints (Before peer-review)
- PostPrints (final, revised, refereed, and
accepted draft)
6General Information
- Repository
- a repository is a network accessible server that
holds ePrints - Archive
- is generally accepted as a synonym for repository
7General Information
- ePrint Archive An established medium to
communicate non-peer reviewed scholarly
literature (preprints)
8General Information
- Metadata
- structured information about resources
- is a descriptive information about an object or
a resource whether it is in physical or
electronic form
9General Information
- DC (Dublin Core)
- is a metadata format defined on the basis of
international consensus. The DC Metadata Element
Set defines fifteen elements for simple resource
description and discovery
10General Information
- OAI (Open Archives Initiative)
- OAI is an initiative to develop and promote
interoperability standards that aim to facilitate
the efficient dissemination of content.
11General Information
- Protocol
- a protocol is a set of rules defining
communication between systems. FTP (File Transfer
Protocol) and HTTP (Hypertext Transport Protocol)
are examples of protocols used for communication
between systems across the Internet.
12General Information
- OAI-PMH (OAI Protocol for Metadata Harvesting)
- OAI-PMH is a lightweight harvesting protocol for
sharing metadata between services.
13General Information
- Data Provider
- a Data Provider maintains one or more
repositories (web servers) that support the
OAI-PMH as a means of exposing metadata. - Service Provider
- a Service Provider issues OAI-PMH requests to
data providers and uses the metadata as a basis
for building value-added services.
14General Information
- Harvesting
- in the OAI context, harvesting refers
specifically to the gathering together of
metadata from a number of distributed
repositories into a combined data store
15General Information
- Interoperability
- is the ability of systems, services and
organizations to work together seamlessly toward
common or diverse goals. In the technical arena
it is supported by open standards for
communication between systems and for description
of resources and collections, among others.
Interoperability is considered here primarily in
the context of resource discovery and access.
16General Information
- XML (Extensible Markup Language)
- it defines a means of describing data. XML can be
validated against a DTD or schema setting out the
elements of the language created - DTD (Document Type Definition)
- a DTD is a formal specification of the structure
of a document
17The Journal System
- Significant challenges to the journal system
- Explosive growth of the Internet
- Publication delay
- Full transfer of rights by authors to publishers
- The implementation of peer-review and
- Skyrocketing of subscription prices
- Challenges have resulted in exploring
alternative models for scholarly communication
18Growth of ePrint Archives
- The roots OAI lie in the growing no. of ePrint
archives. Several of these began as - Informal vehicle for dissemination of
- preliminary research results and
- gray literature
- A no. of them have evolved into an essential
medium for sharing research results among the
colleagues in a field
19Growth of ePrint Archives
- arXiv (xxx) 1991- Physics - Los Alamos
(Cornell?) 2.5 Lac preprints - OAI-PMH - CogPrints Cog Sci. Univ. of Southampton
OAI-PMH - RePEc (NetEc) 1993 Economics - Univ. of
Surrey Guildford Protocol - NCSTRL Comp. Sci. Dienst to OAI ODU,VT and
others - NDLTD Thesis Dissertation - Virginia Tech.
20Growth of ePrint Archives
- The growth of ePrint archives exemplify a more
equitable and efficient model for disseminating
research results - An important challenge is to increase the impact
of the ePrint archives. - The growth of ePrint archives demonstrate shift
in the traditional scholarly communication model
the journal system
21Growth of ePrint Archives
- There are indications that a growing number of
disciplines, organizations and even commercial
publishers are inspired by this pioneering work
and are investigating alternative models for
scholarly communication
22Open Access Journals
- BMC (BioMed Central) open access publisher
- PLoS (Public Library of Science) will launch
peer reviewed open access journals - PloS Biology already launched and
- PLoS Medicine will follow
- DOAJ Directory of Open Access Journals
- http//www.doaj.org/
23ePrint Archives
- Basic aims of ePrint archives initiative
- create a more effective scholarly communication
mechanism and - there by providing an alternative to existing
scholarly communication model
24ePrint Archives
- Approaches taken by individual archives differ in
number of ways - Centralized model
- arXiv
- Distributed departmental/institutional model
- RePEc
- Some deal with gray literature
25ePrint Archives
- Approaches taken by individual archives differ in
number of ways - Some incorporate metadata of peer-reviewed papers
- Some deal with metadata only, others metadata and
full text - Different protocols
- Dienst, Guildford
26ePrint Archives
- Different approaches and protocols used meant
- Doesnt facilitate discovery
- Different search interfaces
- No provision to share metadata (interoperability)
27ePrint Archives
- Key players recognised the need for single search
interface to all the archives through
interoperability - Two key interoperability problems impairing
impact of ePrint archives were identified - Multiple search interface
- No machine-based way for sharing the metadata
28ePrint Archives
- Solutions explored included
- Cross searching of archives
- Harvesting metadata from various archives and
build a central index - in July 1999, a call for meeting of tech.
experts to attend a meeting in Santa Fe, NM in
Oct99 was given by Ginsparg, Luce and Sompel
29Creation of UPS
- Creation of UPS Universal Preprint Service for
author self archived scholarly literature was
proposed - UPS would be the fundamental and free layer of
scholarly information, above which both free and
commercial service could flourish
30Creation of UPS
- The first step towards establishing UPS was
identification/creation of interoperable
technologies and frameworks for the
dissemination of ePrints
31Luce Van de Sompel Ginsparg
32UPS Prototype
- Architectural framework for UPS?
- Cross searching Harvesting of
metadata - (Z39.50)
-
33UPS Prototype
- Searching vs. harvesting
- US digital library experience in this area (e.g.
NCSTRL) indicated that cross-searching not
preferred approach - distributed searching of N
nodes viable, but only for small values of N - NCSTRL N gt 100 Not satisfactory
34UPS Prototype
- The UPS Prototype at Santa Fe Oct99
- Services based on a collection of harvested
metadata - SFX/OpenURL linking
- Based on NCSTRL Dienst protocol
- Insights regarding lack on interoperability
- Recommendation metadata harvesting
35UPS Prototype
- UPS architecture identified two logical roles
- Data Provider
Service Provider - (depositpublishexpose metadata)
(harvestprovide service)
36The Dawn of OAI
- The name UPS was quickly changed
- to avoid clash with already established
commercial parcel service and - not all e-print archives contained preprints
- The framework within which this universal service
would be developed was now designated the Open
Archives initiative OAi, and later OAI
37Requirements for Metadata Harvesting
- For harvesting method to work, there must be
agreements on - Transport protocol (HTTP)
- Metadata formats (DC, MARC..)
- Quality assurance (mandatory fields)
- IP and usage rights (who can do what with the
records)
38The Dawn of a Protocol
- An initial agreement in key areas made it
possible to develop a protocol for metadata
harvesting, named the Santa Fe Convention in
honour of the meeting where the agreement was
reached.
39Benefits of Interoperability
- Facilitates information discovery, linking and
peer reviewing - Increases visibility (impact)
- Single search interface
40Whats in the Name
- Open Archives Initiative
- The protocol is openly
Archive/Repository - OAI is happening at - documented, and is contains
collection of break-neck speed - compliant with open document-like
objects - Standards HTTP, DC
- and XML
41Questions?
42Part II
- An Overview of the OAI-PMH
43OAI-PMH Version History
- Santa Fe Convention was the first incarnation of
the OAI-PMH 02/2000 - Goal optimise discovery of e-prints
- Inputs
- UPS prototype
- RePEc/SODA data/service provider model
- Dienst protocol
- Deliberations at the Santa Fe Meeting 10/99
44OAI-PMH Version History
- OAI-PMH V. 1.0 01/2001
- Goal optimise discovery of document-like obj.
- Inputs
- Santa Fe Convention
- various DLF meetings on metadata harvesting
- deliberations at Cornell
- alpha-testers of OAI-PMH v 1.0
- recognition of DC as best core metadata format
- for interoperability across multiple archives
45OAI-PMH v 1.0 01/2001
- Low-barrier interoperability specification
- Metadata harvesting model data provider /
service provider - Focus on document-like objects
- HTTP based
- XML responses
- Unqualified Dublin Core
- Experimental 12-18 months
46OAI-PMH Version History
- OAI-PMH V. 2.0
- Goal recurrent exchange of metadata about
resources between systems - Inputs ...
- OAI-PMH v.1.0
- feedback on OAI-implementers
- deliberations by OAI-tech 09/01 - 06/02
- alpha test group of OAI-PMH v.2.0 03/02 - 06/02
- officially released June 14, 2002
47OAI-PMH v.2.0 06/2002
- Low-barrier interoperability specification
- Metadata harvesting model data provider /
service provider - Metadata about resources
- HTTP based
- XML responses
- Unqualified Dublin Core
- Stable No backward compatibility
- Future releases will be backward compatible
48What OAI-PMH is not
- Not a search system on its own
- Not a database management system
- Not single metadata schema
- Not a OAIS
49Basic Functioning of OAI-PMH
50OAI General Assumption
- Two groups of participants
- Data Providers (Open Archives, Repositories)
- free access of metadata
- not necessarily free access to full texts /
resources - easy to implement, low barrier solution
51OAI General Assumption
- Two groups of participants
- Service Providers
- use OAI interfaces of the Data Providers
- harvest and store metadata (no live requests!)
- may select certain subsets from Data
Providers (set hierarchy, date stamp) - offer (value-added) service on the basis of the
metadata
52Multiple data and service providers
Harvesting based on OAI-PMH
Service providers
53Aggregators
Aggregator
Service providers
54OAI-PMH Structure Model
Data Provider
e-prints
e-print
Requests Identify ListMetadataformats
ListSets ListIdentifiers ListRecords
GetRecord
Repository
Data Provider
Images
e-print
Repository
Service Provider
Data Provider
OPAC
e-print
Repository
Data Provider
Harvester
Data Provider
Responses General information Metadata
formats Set structure Record identifier
Metadata
Museum
e-print
Repository
Data Provider
Archive
e-print
Repository
55OAI-PMH Protocol Overview
- Protocol is based on HTTP
- Request arguments are issued as GET or POST
methods - Responses are encoded in XML syntax
- Supports any metadata format (at least Dublin
Core)
56OAI-PMH Protocol Overview
- Data providers may support granularity for
service providers for selective harvesting - Define a logical set hierarchy
- Date stamps (last change of metadata set)
- Error messages are http based
- Supports flow control
- Supports six request types (known as verbs)
57OAI Verbs
- Identify
- ListSets
- ListMetadataFormats
- ListIdentifiers
- GetRecord
- ListRecords
58OAI Verbs - Identify
- Purpose
- Return general information about the archive and
its policies (e.g., date stamp granularity) - Parameters
- None
- Sample URL
- http//eprints.iisc.ernet.in/perl/oai2?verbIdenti
fy
59OAI Verbs - ListSets
- Purpose
- Provide a listing of sets in which records may be
organized - Parameters
- None
- Sample URL
- http//eprints.iisc.ernet.in/perl/oai2?verbListSe
ts
60OAI Verbs - ListMetadataFormats
- Purpose
- List metadata formats supported by the archive as
well as their schema locations and namespaces - Parameters
- identifier for a specific record (O)
- Sample URL
- http//eprints.iisc.ernet.in/perl/oai2?verbListMe
tadataFormats
61OAI Verbs - ListIdentifiers
- Purpose
- List headers for all items corresponding to the
specified parameters - Parameters
- from start date (O)
- until end date (O)
- set set to harvest from (O)
- metadataPrefix metadata format to list
identifiers for (R) - resumptionToken flow control mechanism (X)
- Sample URL
- http//eprints.iisc.ernet.in/perl/oai2
verbListIdentifiersmetadataPrefixoai_dc
62OAI Verbs - GetRecord
- Purpose
- Returns the metadata for a single item in the
form of an OAI record - Parameters
- identifier unique id for item (R)
- metadataPrefix metadata format for the record
(R) - Sample URL
- http//eprints.iisc.ernet.in/perl/oai2?
verbGetRecordidentifieroaiiiscePrints.OAI210
metadataPrefixoai_dc
63OAI Verbs - ListRecords
- Purpose
- Retrieves metadata records for multiple items
- Parameters
- from start date (O)
- until end date (O)
- set set to harvest from (O)
- resumptionToken flow control mechanism (X)
- metadataPrefix metadata format (R)
- Sample URL
- http//www.anarchive.org/cgi-bin/OAI?verbListRec
ordmetadataprefixoai_dcfrom2001-01-01
64Protocol Details Flow Control
Data Provider
Service Provider
Harvester
Repository
65OAI Compliant Tools
- eprints.org (http//www.eprints.org)
- Dspace (http//dspace.org)
- CDSware (http//cdsware.cern.ch)
- Kepler (http//kepler.cs.odu.edu/)
66OAI-PMH Based Services
- Repository Explorer
- http//oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/tes
toai/ - Serach engines
- Arc http//arc.cs.odu.edu/
- MyOAI http//www.myoai.org/
- Physnet (subset of arXive, IOP)
- http//physnet.uni-oldenburg.de/oai/query.php
- OAIster http//oaister.umdl.umich.edu/o/oaister/
67Summary
- Low-cost mechanism for harvesting metadata
records from one system to another - Based on HTTP and XML Web-friendly
- Development over last 2-3 years has seen move
from specific (discovery of e-prints) to generic
(sharing descriptions of any resource)
68Summary
- Recommends simple DC as record format but
extensible to any format encoded in XML - OAI-PMH is not a search protocol
- Metadata and full-text typically made freely
available but not a requirement - OAI-PMH can be used between closed groups
69Other Important Resources
- OAI Web site
- http//www.openarchives.org/
- Open Archives Forum
- http//www.oaforum.org/tutorial/index.php
- The Santa Fe Convention of the Open Archives
Intiative by Herbert Van De Sompel and Carl
Lagoze, D-Lib magazine,Vol 6 no. 2, Feb 2000
70Questions?
71Thank you for your PresencePatience