OAI Overview - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

OAI Overview

Description:

wholesale distributed searching, popular at the time, is attractive in theory ... Davis & Lagoze, JASIS 51(3), pp. 273-80 ... OAI Observation: Monolithic ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 36
Provided by: Michael50
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: OAI Overview


1
OAI Overview
  • Michael L. Nelson
  • Old Dominion University
  • Norfolk Virginia, USA
  • mln_at_cs.odu.edu
  • http//www.cs.odu.edu/mln/

Bioinformatics Seminar ODU CS 791/891 Feb 3 2003
2
The Rise and Fall of Distributed Searching
  • wholesale distributed searching, popular at the
    time, is attractive in theory but troublesome in
    practice
  • Davis Lagoze, JASIS 51(3), pp. 273-80
  • Powell French, Proc 5th ACM DL, pp. 264-265
  • distributed searching of N nodes still viable,
    but only for small values of N
  • NCSTRL N gt 100 bad
  • NTRS/NIX Nlt20 ok (but could be better)

3
The Rise and Fall of Distributed Searching
  • Other problems of distributed searching (from
    STARTS)
  • source-metadata problem
  • how do you know which nodes to search?
  • query-language problem
  • syntax varies and drifts over time between the
    various nodes
  • rank-merging problem
  • how do you meaningfully merge multiple result
    sets?
  • Temptations
  • centralize all functions
  • everything will be done at X
  • standardize on a single product
  • everyone will use system Y

4
Universal Preprint Service
  • A cross-archive DL that that provides services on
    a collection of metadata harvested from multiple
    archives
  • based on NCSTRL a modified version of Dienst
  • support for clustering
  • support for buckets
  • Demonstrated at Santa Fe NM, October 21-22, 1999
  • http//ups.cs.odu.edu/
  • D-Lib Magazine, 6(2) 2000 (2 articles)
  • http//www.dlib.org/dlib/february00/02contents.htm
    l
  • UPS was soon renamed the Open Archives Initiative
    (OAI) http//www.openarchives.org/

5
Data and Service Providers
  • Data Providers
  • publishing into an archive
  • providing methods for metadata harvesting
  • provide non-technical context for sharing
    information also
  • Service Providers
  • harvest metadata from providers
  • implement user interface to data
  • Self-describing archives
  • Much of the learning about the constituent UPS
    archives occurred out of band

Even if these are done by the same DL, these are
distinct roles
6
Metadata Harvesting
  • Move away from distributed searching
  • Extract metadata from various sources
  • Build services on local copies of metadata
  • data remains at remote repositories

all searching, browsing, etc. performed on the
metadata here
user
individual nodes can still support direct
user interaction
search for cfd applications
local copy of metadata
metadata harvested offline
metadata harvested offline
metadata harvested offline
metadata harvested offline
each node independently maintained
. . .
7
Result OAI
  • http//www.openarchives.org/
  • The OAI was the result of the demonstration and
    discussion during the Santa Fe meeting
  • Initial focus was on federating collections of
    scholarly e-print materials
  • however, interest grew and the scope and
    application of OAI expanded to become a generic
    bulk metadata transport protocol
  • Note
  • OAI is only about metadata -- not full text!
  • OAI is neutral with respect to the nature of the
    metadata or the resources the metadata describes
  • read commercial publishers have an interest in
    OAI too...

8
Santa Fe convention
OAI-PMH v.1.0/1.1
OAI-PMH v.2.0
9
Dublin Core
  • Dublin Core Metadata Initiative
  • http//www.dublincore.org/
  • from 1994-1995, recognizing the need for simple,
    interoperable metadata for resource discovery
  • good overview of metadata DC
    http//www.dlib.org/dlib/january01/lagoze/01lagoze
    .html
  • 15 elements (qualifiers possible)

10
Overview of OAI Verbs
archival metadata
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
11
Argument Summary
12
Error Summary
Generate badVerb on any input not matching the 6
defined verbs this is an inversion
of the table in section 3.6 of the OAI-PMH
specification
13
Flow Control
  • ListSets, ListIdentifiers, ListRecords are all
    allowed to return partial responses, via a
    combination of
  • resumptionToken an opaque, archive-defined data
    string that when passed back to the archive
    allows the response to begin where it left off
  • each archive defines their own resumptionToken
    syntax it may have visible semantics or not
  • 503 http status code retry after
  • up to the harvester to understand this code and
    respect it, and up to the archive to enforce it

14
resumptionToken
scenario harvesting 277 records in 3
separate 100 record chunks
15
OAI Links Demos
  • Data providers
  • not really meant for end-user interaction, but
    Sulemans Repository Explorer is an excellent
    tool
  • http//purl.org/net/oai_explorer
  • 100 registered data providers
  • http//oaisrv.nsdl.cornell.edu/Register/BrowseSite
    s.pl
  • many being used for internal purposes not
    registered
  • Service providers
  • Arc, the first known SP harvesting from OAI data
    providers
  • http//arc.cs.odu.edu/
  • 20 registered service providers
  • http//www.openarchives.org/service_provider/oai_s
    p.htm
  • several more known to be in testing or creation

16
Field of Dreams
  • It should be easy to be a data provider, even if
    it makes more work for the service provider.
  • if enough data providers exist, the service
    providers will come (DPs gtgt SPs)
  • Open-source / freely available tools
  • drop-in data providers
  • industrial strength http//www.eprints.org/
  • personal size http//kepler.cs.odu.edu/
  • tools to make your existing DL a data provider
  • http//www.openarchives.org/tools/tools.htm
  • also OAI-implementers mailing list / mail
    archive!
  • service providers
  • only bits and pieces currently publicly
    available...

17
OAI Observation Front-End Only
  • No input/registry mechanism
  • OAI harvesting protocol is always a front-end for
    something else
  • filesystem, Dienst, RDBMS, LDAP, etc.
  • convenient for pre-existing DLs, but does not
    address new DLs
  • e.g., we want to do OAI
  • Bounds the scope of OAI
  • responsibilities and domain of OAI are still be
    discussed
  • tension between functionality and simplicity

18
OAI Observation No TC
  • No terms conditions provisions in protocol
  • assumes all metadata has uniform access rights
  • how to restrict metadata to certain hosts?
  • introducing TC would increase the scope of
    application, but at the expense of simplicity
  • how expensive do we want to make a
    just-a-front-end protocol ?
  • maybe TC is a good application for sets?

19
OAI Observation No TC
  • Possible to use multiple OAI servers in a
    DMZ-like configuration

OAI requests from trusted hosts
OAI requests from arbitrary hosts
Public OAI Server
Private OAI Server
Source database
could even use a separate copy of the database
20
OAI Observation No TC
  • Possible to use OAI harvesting protocol in
    closed, restricted systems

OAI 1
OAI 2
OAI 3
OAI 4
all OAI requests originate from these 4 DLs
21
OAI Observation Monolithic
  • An OAI server has no protocol-defined concept of
    other OAI servers
  • backups, mirrors, etc. have to be resolved
    outside of the scope of OAI
  • scope vs. complexity again
  • fully connected graph of DLs harvesting from each
    other is unnecessary
  • cf. web crawlers vs. gathers in U of Colorados
    Harvest System
  • 3rd party harvesting interfaces raise more TC
    and data coherency issues

22
302 Load Balancing
  • Interactive users on main DL machine should not
    be impacted by metadata harvesting
  • dont take deliveries through the front door
  • not part of the protocol defined outside the
    protocol

OAI Server
harvester
naca.larc.nasa.gov/oai/
23
OAI Observation Data Coherency
  • In the interest of OAI implementer simplicity,
    several issues are left for the service provider
    to interpret
  • what is an update vs. addition?
  • in the NACA OAI interface, they are reported as
    the same and its up to the harvesting system to
    figure it out
  • deletions?
  • it is currently optional for OAI systems to mark
    records as deleted or not
  • still left to the harvester to interpret

24
OAI Observation Harvest Model
  • Frequency of harvests
  • all-at-once harvests?
  • initial harvest
  • resolving data coherency
  • frequent incremental harvests?
  • far more efficient for both service and data
    providers
  • Webcrawling vs. digital library models
  • webcrawlers little to no a priori information
    about target
  • DLs frequent harvesting of a small number of
    known targets
  • Realization we know very little about how
    harvesting behavior
  • are we optimizing for all-at-once, when
    incremental will be more common?

25
Other Uses For the OAI-PMH
  • Assumptions
  • Traditional DLs / SPs will continue on their
    present path of increasing sophistication
  • citation indexing, search results viz,
    personalization, recommendations, subject-based
    filtering, etc.
  • growth rates remain the same (5x DPs as SPs)
  • Premise OAI-PMH is applicable to any scenario
    that needs to update / synchronize distributed
    state
  • Future opportunities are possible by creatively
    interpreting the OAI-PMH data model

26
OAI-PMH Data Model
item identifier
record identifier metadata format datestamp
27
Typical Values
  • repository
  • collection of publications
  • resource
  • scholarly publication
  • item
  • all metadata (DC MARC)
  • record
  • a single metadata format
  • datestamp
  • last update / addition of a record
  • metadata format
  • bibliographic metadata format
  • set
  • originating institution or subject categories

28
Repositories
  • Stretching the idea of a repository a bit
  • contextually sensitive repositories
  • personalization for harvesters
  • communication between strangers, or communication
    between friends?
  • OAI-PMH for individual complex objects?
  • OAI-PMH without MySQL?!
  • Fedora, Multi-valent documents, buckets
  • tar, jar, zip, etc. files

29
Resource
  • What if resource were
  • computer system status
  • uptime, who, w, df, ps, etc.
  • or generalized system status
  • e.g., sports league standings
  • people
  • personnel databases
  • authority files for authors

30
Item
  • What if item were
  • software
  • union of versions formats
  • all forms of metadata
  • administrative structural
  • citations, annotations, reviews, etc.
  • data
  • e.g., newsfeeds and other XML expressible content
  • metadataPrefixes or sets could be defined to be
    different versions

31
Record
  • What if record were
  • specific software instantiations / updates
  • access / retrieval logs for DLs (or computer
    systems)
  • push / pull model inversion
  • put a harvester on the client behind a firewall,
    the client contacts a DP and receives
    instructions on how to submit the desired
    document (e.g., send email to a specified address)

32
Datestamp
  • semantics of datestamp are strongly influenced by
    the choice of resource / item / record /
    metadataPrefix, but it could be used to
  • signify change of set membership (e.g., workflow
    item moves from submitted to approved)
  • change datestamp to reflect access to the DP
  • e.g., in conjunction with metadataPrefixes of
    accessed or mirrored

33
metadataPrefix
  • what if metadataPrefix were
  • instructions for extracting / archiving /
    scraping the resource
  • verbListRecordsmetadataPrefixextract_TIFFs
  • code fragments to run locally
  • (harvested from a trusted source!)
  • XSLT for other metadataPrefixes
  • branding container is at the repository-level,
    this could be record- or item-level

34
Set
  • sets are already used for tunneling OAI-PMH
    extensions (see Suleman Fox, D-Lib 7(12))
  • other uses
  • in aggregators, automatically create 1 set per
    baseURL
  • have hidden sets (or metadataPrefix) that have
    administrative or community-specific values (or
    triggers)
  • setaccessedgt1000from2001-01-01
  • setharvestMeWithTheseARGSuntil2002-05-05metada
    taPrefixoai_marc

35
Interesting Services
  • DP9
  • gateway to expose repository contents in HTML
    suitable for web crawlers
  • Celestial
  • OAI cache, also 1.1 -gt 2.0 converter
  • Static (mini-) repositories
  • XML files, based on OLAC work
  • OpenURL metadata format registries
  • record metadata format
Write a Comment
User Comments (0)
About PowerShow.com