Title: OAI Overview
1OAI Overview
- Michael L. Nelson
- Old Dominion University
- Norfolk Virginia, USA
- mln_at_cs.odu.edu
- http//www.cs.odu.edu/mln/
Bioinformatics Seminar ODU CS 791/891 Feb 3 2003
2The Rise and Fall of Distributed Searching
- wholesale distributed searching, popular at the
time, is attractive in theory but troublesome in
practice - Davis Lagoze, JASIS 51(3), pp. 273-80
- Powell French, Proc 5th ACM DL, pp. 264-265
- distributed searching of N nodes still viable,
but only for small values of N - NCSTRL N gt 100 bad
- NTRS/NIX Nlt20 ok (but could be better)
3The Rise and Fall of Distributed Searching
- Other problems of distributed searching (from
STARTS) - source-metadata problem
- how do you know which nodes to search?
- query-language problem
- syntax varies and drifts over time between the
various nodes - rank-merging problem
- how do you meaningfully merge multiple result
sets? - Temptations
- centralize all functions
- everything will be done at X
- standardize on a single product
- everyone will use system Y
4Universal Preprint Service
- A cross-archive DL that that provides services on
a collection of metadata harvested from multiple
archives - based on NCSTRL a modified version of Dienst
- support for clustering
- support for buckets
- Demonstrated at Santa Fe NM, October 21-22, 1999
- http//ups.cs.odu.edu/
- D-Lib Magazine, 6(2) 2000 (2 articles)
- http//www.dlib.org/dlib/february00/02contents.htm
l - UPS was soon renamed the Open Archives Initiative
(OAI) http//www.openarchives.org/
5Data and Service Providers
- Data Providers
- publishing into an archive
- providing methods for metadata harvesting
- provide non-technical context for sharing
information also - Service Providers
- harvest metadata from providers
- implement user interface to data
- Self-describing archives
- Much of the learning about the constituent UPS
archives occurred out of band
Even if these are done by the same DL, these are
distinct roles
6Metadata Harvesting
- Move away from distributed searching
- Extract metadata from various sources
- Build services on local copies of metadata
- data remains at remote repositories
all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
local copy of metadata
metadata harvested offline
metadata harvested offline
metadata harvested offline
metadata harvested offline
each node independently maintained
. . .
7Result OAI
- http//www.openarchives.org/
- The OAI was the result of the demonstration and
discussion during the Santa Fe meeting - Initial focus was on federating collections of
scholarly e-print materials - however, interest grew and the scope and
application of OAI expanded to become a generic
bulk metadata transport protocol - Note
- OAI is only about metadata -- not full text!
- OAI is neutral with respect to the nature of the
metadata or the resources the metadata describes - read commercial publishers have an interest in
OAI too...
8Santa Fe convention
OAI-PMH v.1.0/1.1
OAI-PMH v.2.0
9Dublin Core
- Dublin Core Metadata Initiative
- http//www.dublincore.org/
- from 1994-1995, recognizing the need for simple,
interoperable metadata for resource discovery - good overview of metadata DC
http//www.dlib.org/dlib/january01/lagoze/01lagoze
.html - 15 elements (qualifiers possible)
10Overview of OAI Verbs
archival metadata
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
11Argument Summary
12Error Summary
Generate badVerb on any input not matching the 6
defined verbs this is an inversion
of the table in section 3.6 of the OAI-PMH
specification
13Flow Control
- ListSets, ListIdentifiers, ListRecords are all
allowed to return partial responses, via a
combination of - resumptionToken an opaque, archive-defined data
string that when passed back to the archive
allows the response to begin where it left off - each archive defines their own resumptionToken
syntax it may have visible semantics or not - 503 http status code retry after
- up to the harvester to understand this code and
respect it, and up to the archive to enforce it
14resumptionToken
scenario harvesting 277 records in 3
separate 100 record chunks
15OAI Links Demos
- Data providers
- not really meant for end-user interaction, but
Sulemans Repository Explorer is an excellent
tool - http//purl.org/net/oai_explorer
- 100 registered data providers
- http//oaisrv.nsdl.cornell.edu/Register/BrowseSite
s.pl - many being used for internal purposes not
registered - Service providers
- Arc, the first known SP harvesting from OAI data
providers - http//arc.cs.odu.edu/
- 20 registered service providers
- http//www.openarchives.org/service_provider/oai_s
p.htm - several more known to be in testing or creation
16Field of Dreams
- It should be easy to be a data provider, even if
it makes more work for the service provider. - if enough data providers exist, the service
providers will come (DPs gtgt SPs) - Open-source / freely available tools
- drop-in data providers
- industrial strength http//www.eprints.org/
- personal size http//kepler.cs.odu.edu/
- tools to make your existing DL a data provider
- http//www.openarchives.org/tools/tools.htm
- also OAI-implementers mailing list / mail
archive! - service providers
- only bits and pieces currently publicly
available...
17OAI Observation Front-End Only
- No input/registry mechanism
- OAI harvesting protocol is always a front-end for
something else - filesystem, Dienst, RDBMS, LDAP, etc.
- convenient for pre-existing DLs, but does not
address new DLs - e.g., we want to do OAI
- Bounds the scope of OAI
- responsibilities and domain of OAI are still be
discussed - tension between functionality and simplicity
18OAI Observation No TC
- No terms conditions provisions in protocol
- assumes all metadata has uniform access rights
- how to restrict metadata to certain hosts?
- introducing TC would increase the scope of
application, but at the expense of simplicity - how expensive do we want to make a
just-a-front-end protocol ? - maybe TC is a good application for sets?
19OAI Observation No TC
- Possible to use multiple OAI servers in a
DMZ-like configuration
OAI requests from trusted hosts
OAI requests from arbitrary hosts
Public OAI Server
Private OAI Server
Source database
could even use a separate copy of the database
20OAI Observation No TC
- Possible to use OAI harvesting protocol in
closed, restricted systems
OAI 1
OAI 2
OAI 3
OAI 4
all OAI requests originate from these 4 DLs
21OAI Observation Monolithic
- An OAI server has no protocol-defined concept of
other OAI servers - backups, mirrors, etc. have to be resolved
outside of the scope of OAI - scope vs. complexity again
- fully connected graph of DLs harvesting from each
other is unnecessary - cf. web crawlers vs. gathers in U of Colorados
Harvest System - 3rd party harvesting interfaces raise more TC
and data coherency issues
22302 Load Balancing
- Interactive users on main DL machine should not
be impacted by metadata harvesting - dont take deliveries through the front door
- not part of the protocol defined outside the
protocol
OAI Server
harvester
naca.larc.nasa.gov/oai/
23OAI Observation Data Coherency
- In the interest of OAI implementer simplicity,
several issues are left for the service provider
to interpret - what is an update vs. addition?
- in the NACA OAI interface, they are reported as
the same and its up to the harvesting system to
figure it out - deletions?
- it is currently optional for OAI systems to mark
records as deleted or not - still left to the harvester to interpret
24OAI Observation Harvest Model
- Frequency of harvests
- all-at-once harvests?
- initial harvest
- resolving data coherency
- frequent incremental harvests?
- far more efficient for both service and data
providers - Webcrawling vs. digital library models
- webcrawlers little to no a priori information
about target - DLs frequent harvesting of a small number of
known targets - Realization we know very little about how
harvesting behavior - are we optimizing for all-at-once, when
incremental will be more common?
25Other Uses For the OAI-PMH
- Assumptions
- Traditional DLs / SPs will continue on their
present path of increasing sophistication - citation indexing, search results viz,
personalization, recommendations, subject-based
filtering, etc. - growth rates remain the same (5x DPs as SPs)
- Premise OAI-PMH is applicable to any scenario
that needs to update / synchronize distributed
state - Future opportunities are possible by creatively
interpreting the OAI-PMH data model
26OAI-PMH Data Model
item identifier
record identifier metadata format datestamp
27Typical Values
- repository
- collection of publications
- resource
- scholarly publication
- item
- all metadata (DC MARC)
- record
- a single metadata format
- datestamp
- last update / addition of a record
- metadata format
- bibliographic metadata format
- set
- originating institution or subject categories
28Repositories
- Stretching the idea of a repository a bit
- contextually sensitive repositories
- personalization for harvesters
- communication between strangers, or communication
between friends? - OAI-PMH for individual complex objects?
- OAI-PMH without MySQL?!
- Fedora, Multi-valent documents, buckets
- tar, jar, zip, etc. files
29Resource
- What if resource were
- computer system status
- uptime, who, w, df, ps, etc.
- or generalized system status
- e.g., sports league standings
- people
- personnel databases
- authority files for authors
30Item
- What if item were
- software
- union of versions formats
- all forms of metadata
- administrative structural
- citations, annotations, reviews, etc.
- data
- e.g., newsfeeds and other XML expressible content
- metadataPrefixes or sets could be defined to be
different versions
31Record
- What if record were
- specific software instantiations / updates
- access / retrieval logs for DLs (or computer
systems) - push / pull model inversion
- put a harvester on the client behind a firewall,
the client contacts a DP and receives
instructions on how to submit the desired
document (e.g., send email to a specified address)
32Datestamp
- semantics of datestamp are strongly influenced by
the choice of resource / item / record /
metadataPrefix, but it could be used to - signify change of set membership (e.g., workflow
item moves from submitted to approved) - change datestamp to reflect access to the DP
- e.g., in conjunction with metadataPrefixes of
accessed or mirrored
33metadataPrefix
- what if metadataPrefix were
- instructions for extracting / archiving /
scraping the resource - verbListRecordsmetadataPrefixextract_TIFFs
- code fragments to run locally
- (harvested from a trusted source!)
- XSLT for other metadataPrefixes
- branding container is at the repository-level,
this could be record- or item-level
34Set
- sets are already used for tunneling OAI-PMH
extensions (see Suleman Fox, D-Lib 7(12)) - other uses
- in aggregators, automatically create 1 set per
baseURL - have hidden sets (or metadataPrefix) that have
administrative or community-specific values (or
triggers) - setaccessedgt1000from2001-01-01
- setharvestMeWithTheseARGSuntil2002-05-05metada
taPrefixoai_marc
35Interesting Services
- DP9
- gateway to expose repository contents in HTML
suitable for web crawlers - Celestial
- OAI cache, also 1.1 -gt 2.0 converter
- Static (mini-) repositories
- XML files, based on OLAC work
- OpenURL metadata format registries
- record metadata format