Advanced OAI-PMH - PowerPoint PPT Presentation

1 / 118
About This Presentation
Title:

Advanced OAI-PMH

Description:

Advanced OAI-PMH Michael L. Nelson mln_at_cs.odu.edu http://www.cs.odu.edu/~mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu – PowerPoint PPT presentation

Number of Views:506
Avg rating:3.0/5.0
Slides: 119
Provided by: Tria504
Learn more at: https://www.cs.odu.edu
Category:
Tags: oai | pmh | advanced | market | sparc

less

Transcript and Presenter's Notes

Title: Advanced OAI-PMH


1
Advanced OAI-PMH
  • Michael L. Nelson
  • mln_at_cs.odu.edu
  • http//www.cs.odu.edu/mln/
  • Several Slides from
  • Herbert Van de Sompel, Simeon Warner and Xiaoming
    Liu
  • University of Southern California
  • 6/15/04

2
Outline
  • Guidelines, recommendations, best practices for
    2.0 implementations
  • harvesters, repositories, aggregators, optional
    containers
  • Novel applications of OAI-PMH
  • New developments in the OAI-PMH community
  • resource harvesting
  • mod_oai
  • static repositories
  • oai-rights
  • ERRoLs

3
Repository Implementation
(see also Repository Implementation
Guidelines http//www.openarchives.org/OAI/2.0/gu
idelines-repository.htm)
4
Minimal Repository
  • 2.0 provides many expressive, but optional
    features
  • but still low barrier!
  • if you are writing your own repository software,
    the quickest path to implementation can involve
    initially
  • only supporting DC
  • skipping ltaboutgt, sets, compression
  • skip flow control (resumptionTokens) if lt 1000
    items
  • add optional features as requirements and
    familiarity allows

5
Be Honest with datestamp!
  • a change in the process of dynamic generation of
    a metadata format really does mean all records
    have been updated!
  • harvester caveat an incremental harvest could
    yield an entire repository dump if all the date
    stamps change (for example, if the metadata
    mapping rules change)

if (internalItemDatestamp gt disseminationInterface
Datestamp) datestamp internalItemDatestamp
else datestamp disseminationInterfaceDatest
amp
6
Not Hiding Updates
  • OAI-PMH is designed to allow incremental
    harvesting
  • Updates must be available by the end of the
    period of the datestamp assigned, i.e.
  • Day granularity gt during same day
  • Seconds granularity gt during same second
  • Reason harvesters need to overlap requests by
    just one datestamp interval (one day or one
    second)
  • in 1.x, 2 intervals were required (in many
    circumstances)

7
State in resumptionTokens
  • HTTP is stateless
  • resumptionTokens allow state information to be
    passed back to the repository to create a
    complete list from sequence of incomplete lists
  • EITHER all state in resumptionToken
  • OR cache result set in repository

8
Caching the Result Set
  • Repository caches results of initial request,
    returns only incomplete list
  • resumptionToken does not contain all state
    information, it includes
  • a session id
  • offset information, necessary for idempotency
  • resumptionToken allows repository to return next
    incomplete list
  • increased complexity due to cache management
  • but a potential performance win

9
All State in the resumptionToken
  • Arrange that remaining items/headers in complete
    list response can be specified with a new query
    and encode that in resumptionToken
  • One simple approach is to return items/headers in
    id order and make the new query specify the same
    parameters and the last id return (or by date)
  • simple to implement, but possibly longer
    execution times
  • Can encode parameters very simply
  • ltresumptionTokengtmetadataPrefixoai_dc
  • from1999-02-03until2002-04-01
  • lastidfghy45123lt/resumptionTokengt

10
resumptionToken attributes (1)
  • expirationDate likely to be useful when cache
    clean-up schedule is known
  • Do not specify expirationDate if all state in
    resumptionToken
  • badResumptionToken error to be used if
    resumptionToken expired
  • May also be used if request cannot be completed
    for some other reason
  • e.g. if repository changes cause the incomplete
    list to have no records
  • issue badRTs judiciously it can invalidate a
    lot of effort by a lot of harvesters

11
resumptionToken attributes (2)
  • completeListSize and cursor optionally provide
    information about size of complete list and
    number of records so far disseminated
  • not (currently) widely used
  • use consistently if used
  • designed for status monitoring
  • caveat harvester completeListSize may be
    approximate and may be revised

12
resumptionToken
  • The only defined use of resumptionToken is as
    follows
  • a repository must include a resumptionToken
    element as part of each response that includes an
    incomplete list
  • in order to retrieve the next portion of the
    complete list,  the next request must use the
    value of that resumptionToken element as the
    value of the resumptionToken argument of the
    request
  • the response containing the incomplete list that
    completes the list must include an empty
    resumptionToken element

13
Flow Control Load Balancing
  • How to respond to a bad harvester
  • HTTP status code 200 response to OAI-PMH request
    with a resumptionToken.
  • HTTP status code 503 with the Retry-After header
    set to an appropriate value if subsequent request
    follows too quickly or if the server is heavily
    loaded.
  • HTTP status code 403 with an appropriate reason
    specified if subsequent requests do not adhere to
    Retry-After delays.

14
302 Load Balancing
  • Interactive users on main DL machine should not
    be impacted by metadata harvesting
  • dont take deliveries through the front door
  • not part of the protocol defined outside the
    protocol

OAI Server
harvester
naca.larc.nasa.gov/oai/
15
DNS Load Balancing
  • using a DNS rotor, establish
  • a.foo.org, b.foo.org, c.foo.org
  • each with a synchronized copy of the repository
  • let DNS chance distribute the load
  • implication if resumptionTokens could issued to
    loosely synchronized servers, it is likely that
    the rTs will be stateful

16
Load Balancing Caveats
  • Copies of the repository must be synchronized
  • (cf. Pande, et al. JCDL 02)
  • depending on synchronization method, be careful
    not to hide updates
  • see Aggregator Guidelines
  • Complex hierarchies are possible
  • programmer must insure no cycles in redirection
    graphs!
  • The baseURL in the reply must always point to the
    original repository, not the repository that
    eventually answered the request

17
Error Handling Verbosity
  • More is better
  • lterror code"badArgument"gtIllegal argument
    foolt/errorgt
  • lterror code"badArgument"gtIllegal argument
    barlt/errorgt
  • is preferred over
  • lterror code"badArgument"gtIllegal arguments
    foo, barlt/errorgt
  • which is preferred over
  • lterror code"badArgument"gtIllegal
    argumentslt/errorgt

18
Error Handling Levels
  • the OAI-PMH error / exception conditions are for
    OAI-PMH semantic events
  • they are not for situations when
  • the database is down
  • a record is malformed
  • remember record id datestamp
    metadataPrefix
  • if youre missing one of those, you dont have an
    OAI record!
  • and other conditions that occur outside the OAI
    scope
  • use http codes 500, 503 or other appropriate
    values to indicate non-OAI problems

19
Error Handling Extensions
  • Arguments that are not 'required', 'optional' or
    'exclusive are 'illegal' and should generate
    badArgument errors.
  • If you want to extend the OAI-PMH
  • stop and consider do you really need to?
  • maybe you should have different OAI-PMH
    interfaces, or creative metadata formats
  • if you really, really want to, tunnel your
    extensions through the set feature
  • see http//www.dlib.org/dlib/december01/suleman/12
    suleman.html for examples

20
Idempotency of List Requests (1)
  • Purpose is to allow harvesters to recover from
    lost responses or crashes without starting a
    large harvest from scratch
  • Recover by re-issuing request using
    resumptionToken from previous request
  • IMPLICATION harvester must accept both the most
    recent resumptionToken issued and the previous
    one

21
Idempotency of List Requests (2)
  • response to a re-issued request must contain all
    unchanged records
  • any changed records will get new datestamps after
    time of initial request
  • changes will be picked up by subsequent harvest
    if not included
  • no experience yet with incomplete responses to
    ListSets or ListMetadataFormats requests

22
Case Study bucket based repositories
  • Buckets see Nelson Maly, CACM 44(5)
  • 2.0
  • NTRS - ntrs.nasa.gov/ (MySQL, DC)
  • LTRS - techreports.larc.nasa.gov/ltrs/oai2.0/
    (file system, refer)
  • NACA - naca.larc.nasa.gov/oai2.0/ (file system,
    refer)
  • 1.1
  • LTRS - techreports.larc.nasa.gov/ltrs/oai/
  • NACA - naca.larc.nasa.gov/ltrs/oai/
  • Open Video - www.open-video.org/oai/ (MySQL,
    local)
  • JTRS - ston.jsc.nasa.gov/collections/TRS/oai (MS
    Access dump, local)
  • GLTRS (filesystem, HTML scraping)
  • Characteristics
  • resumptionToken support initially skipped added
    later (all)
  • highly encoded rTs 2001-01-01!!!!301!600
  • sets initially skipped, added later (LTRS)
  • initially had load balancing with 2 NACA
    repositories

23
Case Study bucket based repositories
  • in bucket terminology
  • 6 OAI verbs (methods) added to the existing list
    of methods
  • http//ntrs.nasa.gov/?methodlist_methods
  • http//ntrs.nasa.gov/?methodlist_sourcetargetLi
    stIdentifiers
  • a data element is added to the bucket that
    contains the specifics of the particular
    repository and its metadata format
  • http//ntrs.nasa.gov/?methoddisplaypkg_nameoai
    element_nameoai.pl

24
Harvester Implementation and Use(see also
Harvester Implementation Guidelineshttp//www.op
enarchives.org/OAI/2.0/guidelines-repository.htm )
25
Be a Polite OAI Neighbor
  • Re-use existing free harvester software/libraries
    http//www.openarchives.org/tools/index.html
  • If you insist on writing your own harvester, read
    http//www.robotstxt.org/wc/robots.html
  • Provide meaningful User-Agent From headers
  • Should be present in HTTP headers of all robot
    requests
  • Should be configured even if using someone elses
    harvester

26
Harvesting Sequence
  • Issue Identify request
  • Check OAI-PMH version
  • Check baseURL, granularity, compression
  • Issue ListMetadataFormats request
  • Get information regarding selected metadataPrefix
  • Issue ListSets request if using sets
  • Check set structure matches expectation
  • Issue ListIdentifier or ListRecords request
  • Continue until end of complete list

27
Listen to the Repository
  • Check Identifys ltgranularitygt element if you
    wish to use finer than YYYY-MM-DD
  • If you harvest with sets, remember that
    indicates hierarchy
  • harvesting a will recursively harvest ab,
    abc, and ad
  • Check for and handle non-200 HTTP status codes,
    503, 302 and 4xx in particular
  • Empty resumptionToken gt end of complete list
  • Ask for compressed responses if the repository
    supports them

28
Harvesting Everything
  • Issue an Identify request to find protocol
    version, finest datestamp granularity supported,
    if compression is supported
  • Issue a ListMetadataFormats request to obtain a
    list of all metadataPrefixes supported.
  • Harvest using a ListRecords request for each
    metadataPrefix supported. Knowledge of the
    datestamp granularity allows for less overlap in
    incremental harvesting if granularities finer
    than a day are supported.
  • Set structure can be inferred from the setSpec
    elements in the header blocks of each record
    returned (consistency checks are possible).
  • Items may be reconstructed from the constituent
    records.
  • Provenance and other information in ltaboutgt
    blocks may be re-assembled at the item level if
    it is the same for all metadata formats
    harvested. However, this information may be
    supplied differently for different metadata
    formats and may thus need to be store separately
    for each metadata format.

29
Harvesting v1.1 and v2.0
  • Not difficult to handle both cases, test Identify
    response
  • v1.1 ltIdentifygt ltprotocolVersiongt
  • v2.0 ltOAI-PMHgt ltIdentifygt ltprotocolVersiongt
  • Different error and exception handling
  • Many similarities, harvesters can share lots of
    code

30
Harvesting logs
  • Alan Kents v2.0 harvester logs
    http//www.inquirion.com8123/public/collListcoll
    ListCmdlist
  • Alan Kents summary of v1.1 harvesting results
    http//www.mds.rmit.edu.au/ajk/oai/interop/summar
    y.htm
  • Celestial harvesting logs http//celestial.eprints
    .org/cgi-bin/status
  • DP9 gateway using Arc harvested information
  • http//arc.cs.odu.edu8080/dp9/index.jsp

31
ltfriendsgt example (1)
  • A light-weight, data-provider driven way to
    communicate the existence of others, e.g.
  • http//ntrs.nasa.gov/?verbIdentify
  • ltdescriptiongt
  • ltfriends namespace stuff gt
  • ltbaseURLgthttp//naca.larc.nasa.gov/oai2.0lt/base
    URLgt
  • ltbaseURLgthttp//ntrs.nasa.gov/oai2.0lt/baseURLgt
  • ltbaseURLgthttp//eprints.riacs.edu/perl/oai/lt/ba
    seURLgt
  • ltbaseURLgthttp//ston.jsc.nasa.gov/collections/
    TRS/oai/lt/baseURLgt
  • lt/friendsgt
  • lt/descriptiongt

32
ltfriendsgt example (2)
harvester
Identify
ltfriendsgt
http//techreports.larc.nasa.gov/ltrs/oai2.0/
http//naca.larc.nasa.gov/oai2.0/
http//ston.jsc.nasa.gov/collections/TRS/oai/
http//ntrs.nasa.gov/oai2.0/
http//eprints.riacs.edu/perl/oai/
33
Use of ltfriendsgt
34
Aggregator / Cache / Proxy Implementation (see
also Aggregator Implementation Guidelineshttp//
www.openarchives.org/OAI/2.0/guidelines-aggregator
.htm)
35
ltprovenancegt datestamps
  • Reminder datestamps are local to the repository,
    a re-exporting service must use new local
    datestamps
  • Such services should use the ltprovenancegt
    container to preserve the original datestamps and
    other information

36
Identifiers are Local
  • Identifiers are local to the repository
  • Unless you absolutely did not change the
    metadata and the identifier corresponds to a
    recognized URI scheme, use a new identifier upon
    re-exporting
  • use the ltprovenancegt container to preserve the
    harvesting history

37
oai-identifier
  • Just one option for identifiers in OAI-PMH
  • The v2.0 oai-identifier scheme is not compatible
    with v1.1
  • repositoryName now domain name based
  • not reliant upon OAI centralized registration
  • One-to-one mapping for escaping characters 3F
    allowed, 3f not
  • allows simple comparison

38
Derived from the same item?
  • 3 different ways to determine if records share
    provenance from the same item
  • both records have the same identifier and the
    baseURL in the request elements of the OAI-PMH
    reponses which include the record are the same
  • both records have the same identifier and that
    identifier belongs to some recognized URI scheme
  • the provenance containers of both records have
    the same entries for both the identifier and
    baseURL

39
ltprovenancegt example (1)
Consider a request from crosswalker.oa.org http/
/odd.oa.org?verbGetRecord identifieroaiodd.o
a.orgz1x2y3metadataPrefixodd_fmt and the
following response from odd.oa.org
  • ltresponseDategt2002-02-08T085546.1lt/responseDategt
  • ltrequest verb"GetRecord" metadataPrefix"odd_fmt"
  • identifier"oaiodd.oa.orgz1x2y3"gthttp//odd
    .oa.orglt/requestgt
  • ltGetRecord ...namespace stuff
  • ltrecordgt
  • ltheadergt
  • ltidentifiergtoaiodd.oa.orgz1x2y3lt/identifie
    rgt
  • ltdatestampgt1999-08-07T060504Zlt/datestampgt
  • lt/headergt
  • ltmetadatagt metadata record in odd_fmt
    lt/metadatagt
  • lt/recordgt
  • lt/GetRecordgt

40
ltprovenancegt example (2)
Imagine that crosswalker.oa.org cross-walks
harvested metadata from odd_fmt into oai_marc and
then re-exposes the metadata with new
identifiers. A request from getmarc.oa.org http
//crosswalker.oa.org?verbGetRecord
identifieroaicw.oa.orgz1x2y3
metadataPrefixoai_marc might then yield the
following response from crosswalker.oa.org
41
ltprovenancegt example (3)
  • ltrecordgt
  • ltheadergt
  • ltidentifiergtoaicw.oa.orgz1x2y3lt/identifiergt
  • ltdatestampgt2002-02-09T011543Zlt/datestampgt
  • lt/headergt
  • ltmetadatagt ...metadata record in oai_marc...
    lt/metadatagt
  • ltaboutgt
  • ltprovenance namespace stuff gt
  • ltoriginDescription harvestDate"2002-02-08T0
    85546Z
  • altered"true"gt
  • ltbaseURLgthttp//odd.oa.orglt/baseURLgt
  • ltidentifiergtoaiodd.oa.orgz1x2y3lt/identif
    iergt
  • ltdatestampgt1999-08-07T060504Zlt/datestamp
    gt
  • ltmetadataNamespacegthttp//odd.oa.org/odd_f
    mtlt/..gt
  • lt/originDescriptiongt
  • lt/provenancegt
  • lt/aboutgt
  • lt/recordgt

42
ltprovenancegt example (4)
This oai_marc record is then re-exposed by
getmarc.oa.org with the same identifier
oaicw.oa.ogz1x2y3 (because the record has not
been altered). The associated ltprovenancegt
container might be
43
ltprovenancegt example (5)
ltrecordgt ltheadergt ltidentifiergtoaicw.oa.org
z1x2y3lt/identifiergt ltdatestampgt2002-03-01T01
4611Zlt/datestampgt lt/headergt ltmetadatagt
...metadata record in oai_marc... lt/metadatagt
ltaboutgt ltprovenance namespace stuffgt
ltoriginDescription harvestDate2002-03-01T01234
5 alteredfalsegt ltbaseURLgthttp//crossw
alker.oa.org/ltbaseURLgt ltidentifiergtoaicw.
oa.orgz1x2y3lt/identifiergt
ltdatestampgt2002-02-09T011543Zlt/datestampgt
ltmetadataNamespacegthttp//../oai_marclt/metadata
Namespacegt ltoriginDescription
harvestDate"2002-02-08T085546Z
altered"true"gt ltbaseURLgthttp//odd.o
a.orglt/baseURLgt ltidentifiergtoaiodd.o
a.orgz1x2y3lt/identifiergt
ltdatestampgt1999-08-07T060504Zlt/datestampgt
ltmetadataNamespacegthttp//odd.oa.org/odd_fm
tlt/metadateNamespacegt
lt/originDescriptiongt lt/originDescriptiongt
lt/provenancegt lt/aboutgt lt/recordgt
44
A Deduping Experiment
  • Initial Experiences Re-Exporting Duplicate and
    Similarity Computation with an OAI-PMH
    aggregator
  • http//www.arxiv.org/abs/cs.DL/0401001
  • Summary
  • harvest favorite repositories
  • compute similarities with VSM
  • re-export metadata with top 10 similarities for a
    record in an ltaboutgt container
  • dups, recommendations

45
looking ahead novel uses of OAI-PMH
46
  • Using OAI-PMHDifferently Young, Van de Sompel,
    Hickey, D-Lib Magazine, 9(7/8), 2003
  • http//www.dlib.org/dlib/july03/young/07young.html
  • DL Usage logs LANL
  • Registry of metadata formats for OpenURL OCLC
    LANL
  • http//www.openurl.info/registry/
  • http//lib-www.lanl.gov/herbertv/papers/icpp02-dr
    aft.pdf
  • GSFAD Thesaurus OCLC
  • The multi-faceted use of the OAI-PMH in the LANL
    Repository
  • http//lib-www.lanl.gov/herbertv/papers/jcdl2004-
    submitted-draft.pdf

47
(No Transcript)
48
OAI-PMH access to DL usage logs
  • usage logs filtered and stored in MySQL db
  • accessible as 2 OAI-PMH repositories
  • document oriented
  • agent oriented (user-proxy)
  • interlinked
  • recommender system
  • harvests logs
  • interpretes logs
  • exposes relationships (OpenURL access)

49
Phase 1 creating recommender system
local
local
About local and remote data
Document logs
Agent logs
50
Phase 2 requesting recommendations
Log based recom. system
PubMed bibliographic
biblio or citation
see http//www.dlib.org/dlib/june02/bollen/06boll
en.html for a possible methodology for computing
log-based recommendations
51
Repository 1
agent
alogIP128.1.22.13
alog
52
Repository 2
document
dlogoripmid258471
dlog
53
  • default links
  • restricted in nature
  • action-radius restricted by business agreements
  • not context-sensitive

resource2
resource3
metadata plane
resource1
herbert van de sompel
54

extended services plane
service component1
service component2
resource2
resource3
metadata plane
resource1
herbert van de sompel
55
OAI-PMH-conformant OpenURL Registry
  • NISO OpenURL Framework builds on Registry
  • Registry entry
  • unique identifier
  • always DC record
  • sometimes XHTML or XML Schema definition

56
OAI-PMH-conformant OpenURL Registry
  • Collaboration between LANL and OCLC Office of
    Research
  • Registry is OAI-PMH harvestable
  • Registry is browseable through overlaying of
    PURL and XSLT

57
OpenURL Registry
registered item
oriencUTF-8
58
OpenURL Registry
registered item
orifmtxmlxsdbook
xsd
59
OAI-PMH-conformant OpenURL Registry
  • Insert XSLT stylesheet reference in OAI-PMH
    response gt make repository browseable

lt?xml version"1.0" encoding"UTF-8"
?gt lt?xml-stylesheet type"text/xsl"
href"gui.xsl" ?gt ltOAI-PMH xmlns"http//www.opena
rchives.org/OAI/2.0/"
xmlnsxsi"http//www
.w3.org/2001/XMLSchema-instance"
xsischemaLocation"http//www
.openarchives.org/OAI/2.0/
http//www.openarchives.org/O
AI/2.0/OAI-PMH.xsd"gt ltresponseDategt2002-02-08T12
0001Zlt/responseDategt ... lt/OAI-PMHgt
60
OAI-PMH-conformant OpenURL Registry
  • Use PURL partial redirects to obtain publishable
    URLs

baseURL? verbGetRecord metadataPrefixxsd
identifierorifmtxmlxsdbook http//www.op
enurl.info/registry /xsd /orifmtxmlxsdbook
61
OAI-PMH-conformant GSFAD Thesaurus
  • OCLC Office of Research
  • GSFAD Thesaurus is OAI-PMH harvestable
  • Thesaurus is user-browseable through overlaying
    of PURL and XSLT
  • Thesaurus is accessible by machines via
    OAI-PMH-based web services

62
GSFAD Thesaurus
concept
Adventurefilms
MARCXML
63
Other Uses For the OAI-PMH
  • Assumptions
  • Traditional DLs / SPs will continue on their
    present path of increasing sophistication
  • citation indexing, search results viz,
    personalization, recommendations, subject-based
    filtering, etc.
  • growth rates remain the same (5x DPs as SPs)
  • Premise OAI-PMH is applicable to any scenario
    that needs to update / synchronize distributed
    state
  • Future opportunities are possible by creatively
    interpreting the OAI-PMH data model

64
Typical Values
  • repository
  • collection of publications
  • resource
  • scholarly publication
  • item
  • all metadata (DC MARC)
  • record
  • a single metadata format
  • datestamp
  • last update / addition of a record
  • metadata format
  • bibliographic metadata format
  • set
  • originating institution or subject categories

65
Repositories
  • Stretching the idea of a repository a bit
  • contextually sensitive repositories
  • personalization for harvesters
  • communication between strangers, or communication
    between friends?
  • OAI-PMH for individual complex objects?
  • OAI-PMH without MySQL?!
  • Fedora, Multi-valent documents, buckets
  • tar, jar, zip, etc. files

66
Resource
  • What if resource were
  • computer system status
  • uptime, who, w, df, ps, etc.
  • or generalized system status
  • e.g., sports league standings
  • people
  • personnel databases
  • authority files for authors

67
Item
  • What if item were
  • software
  • union of versions formats
  • all forms of metadata
  • administrative structural
  • citations, annotations, reviews, etc.
  • data
  • e.g., newsfeeds and other XML expressible content
  • metadataPrefixes or sets could be defined to be
    different versions

68
Record
  • What if record were
  • specific software instantiations / updates
  • access / retrieval logs for DLs (or computer
    systems)
  • push / pull model inversion
  • put a harvester on the client behind a firewall,
    the client contacts a DP and receives
    instructions on how to submit the desired
    document (e.g., send email to a specified address)

69
Datestamp
  • semantics of datestamp are strongly influenced by
    the choice of resource / item / record /
    metadataPrefix, but it could be used to
  • signify change of set membership (e.g., workflow
    item moves from submitted to approved)
  • change datestamp to reflect access to the DP
  • e.g., in conjunction with metadataPrefixes of
    accessed or mirrored

70
metadataPrefix
  • what if metadataPrefix were
  • instructions for extracting / archiving /
    scraping the resource
  • verbListRecordsmetadataPrefixextract_TIFFs
  • code fragments to run locally
  • (harvested from a trusted source!)
  • XSLT for other metadataPrefixes
  • branding container is at the repository-level,
    this could be record- or item-level

71
Set
  • sets are already used for tunneling OAI-PMH
    extensions (see Suleman Fox, D-Lib 7(12))
  • other uses
  • in aggregators, automatically create 1 set per
    baseURL
  • have hidden sets (or metadataPrefix) that have
    administrative or community-specific values (or
    triggers)
  • setaccessedgt1000from2001-01-01
  • setharvestMeWithTheseARGSuntil2002-05-05metada
    taPrefixoai_marc

72
OAI-PMH The Deep Web
73
Race for This New Market
  • Yahoo! University of Michigan
  • http//www.umich.edu/news/index.html?Releases/2004
    /Mar04/r031004
  • Google CrossRef
  • http//www.nature.com/nature/focus/accessdebate/17
    .html

74
Exposing Repository Contents
  • DP9 Webcrawler access to OAI-PMH repositories
  • http//dlib.cs.odu.edu/dp9/
  • JCDL 02 http//www.cs.odu.edu/liu_x/dp9/dp9.pdf
  • An Apache module for OAI-PMH
  • http//www.modoai.org/
  • Extensible Repository Resource Locators (ERRoLs)
    for OAI Identifiers
  • http//www.oclc.org/research/projects/oairesolver/
    default.htm

75
DP9 Indexing Repositories with Web Crawlers
  • Convert repository to a series of hyperlinks.
  • Each record has a persistent URL.
  • Begin by dynamically creating a starting HTML
    page for an OAI-PMH repository
  • Starting page for a data provider would be
    constructed by issuing a ListIdentifier request
    and translating the response into a HTML format
    containing a series of links to the records

76
Index OAI Repository by Crawler
  • A link on this HTML page, when invoked, would
    result in another GetRecord request for a
    specific identifier.
  • HTTP 503 and OAI-PMH resumptionToken are also
    observed.

77
Architecture
78
Better Fit for Crawlers
  • Format of URLs
  • http//arc.cs.odu.edu8080/dp9/getrecord.jsp?ident
    ifieroaiNACA1917naca-report-10 prefixoai_dc
  • http//arc.cs.odu.edu8080/dp9/getrecord/oai_dc/oa
    iNACA1917naca-report-10
  • HTML Meta tags
  • Some crawlers (such as Inktomi) use the HTML meta
    tags to index a Web pages DP9 also maps Dublin
    Core metadata to corresponding HTML meta tags.
  • For pages that are designed exclusively for
    robots navigation, a noindex robots meta tag is
    used
  • X-FORWARDED-FOR header to distinguish
  • between different users coming in via his proxy

79
Results
  • 101 repositories with millions of records are
    open to web crawlers.
  • Tens of thousands of documents has been indexed
    by web crawlers through DP9.
  • Web logs show that more than 1000 queries are
    issued from popular web search engines each day.

80
Problems
  • DP9 does not cache the records and only forwards
    requests to corresponding data providers.
  • Dp9s records are always up-to-date.
  • Quality of service is highly dependent on the
    availability of data providers.

81
Problems
  • Aggressive crawling.
  • An aggressive crawler using DP9 can rapidly send
    requests without regard for the load they are
    placing on the data providers.
  • Solution.
  • OAI mirror/caching mechanism such as OAI
    aggregator (http//celestial.eprints.org/) in
    Southampton university.
  • HTTP throttle software to relieve the overhead on
    data providers.

82
Problems
  • PageRank?
  • resumptionToken may create very deep links that
    web crawlers dont follow.
  • resumptionToken may be invalid after a period of
    time.
  • Possible solution
  • Create many small bins based on the selective
    harvesting features of OAI-PMH

83
Problems
  • Long harvesting time.
  • If a crawler harvests one record per second, it
    will take at least 277 hours to harvest a 1M
    records database.
  • Solution.
  • DP9 is distributed as an open source tool so any
    OAI-PMH compliant repository can install it
  • http//dlib.cs.odu.edu/dp9/
  • Current installation.
  • Southampton University.
  • University of Pennsylvania.

84
mod_oai
  • Ongoing project
  • funded by the Mellon Foundation
  • www.modoai.org
  • An Apache module (written in C) to automatically
    answer OAI-PMH requests for an http server

85
Inefficient Web Crawlers
86
A More Efficient Way
web site 100 pages 10 pages per
resumptionToken, 5 page updates a week
87
ERRoLs
  • OCLC to project to assign Cool URLs to OAI-PMH
    records
  • cf. http//www.w3.org/Provider/Style/URI.html
  • a thin layer of a service provider to bring
    repositories closer to human interaction
  • cf. the Repository Explorer

88
Steps
  • repository registers with ERRoLs
  • access OAI-PMH records with
  • http//errol.oclc.org/ ltoai-identifiergt
  • ERRoLs also available for
  • sites that dont user ltoai-identifiergt
  • repository overview pages

89
ERRoLs Examples
  • http//errol.oclc.org/oaixmlregistry.oclc.orgdem
    o/ISBN/0521555132
  • http//errol.oclc.org/oaixmlregistry.oclc.orgdem
    o/ISBN/0521555132.html
  • http//errol.oclc.org/oaixmlregistry.oclc.orgdem
    o/ISBN/0521555132.ListMetadataFormats
  • http//errol.oclc.org/oaixmlregistry.oclc.orgdem
    o/ISBN/0521555132.resource
  • http//errol.oclc.org/oaixmlregistry.oclc.orgdem
    o/ISBN/0521555132.ListERRoLs

90
Inverted Archives
  • Unit of discourse is no longer an archive or
    service, but a DOI which has services linked from
    it
  • cf.
  • UPS demonstration prototype
  • Smart Objects, Dumb Archives (SODA) model

91
Example
http//dx.doi.org/10.1145/374308.374342
92
Naming
  • Fundamental to other technologies (OAI-PMH,
    OpenURL, etc.)
  • Options
  • URNs
  • Persistent URLs (PURLs)
  • http//purl.org/
  • Handles
  • http//www.handle.net/
  • Digital Object Identifiers
  • http//www.doi.org/
  • ARK
  • http//www.cdlib.org/inside/diglib/ark/

93
Object Models
94
Popular Object Models
  • METS
  • used in DSpace, Fedora
  • http//www.loc.gov/standards/mets/
  • MPEG-21 DIDL
  • http//xml.coverpages.org/mpeg21-didl.html
  • used in LANL DLs
  • http//www.dlib.org/dlib/november03/bekaert/11beka
    ert.html
  • http//www.dlib.org/dlib/february04/bekaert/02beka
    ert.html
  • http//lib-www.lanl.gov/herbertv/papers/jcdl2004-
    submitted-draft.pdf

95
Object Models OAI-PMH
resource
item
oaifoo.edu1234
records
METS
Move from simple metadata files pointing to
resources
to records as modeled representations of
resources
96
OAI-PRH?
  • using OAI-PMH for resource extraction / exchange
  • yes, OAI-PMH is for metadata not resources, but
    its going to happen anyway
  • mirroring
  • preservation (archive zipping)
  • convergence with OAIS
  • assumptions
  • a digital resource
  • rsync et al. neither appropriate nor possible
  • defer metadata vs. data discussion

97
Possible Approaches
  1. Exploit knowledge outside the scope of the
    OAI-PMH to extract the resource
  2. Base64 encode the resource and transmit via
    OAI-PMH as a separate metadata prefix?
  3. Separate metadata prefix with instructions on how
    to extract / scrape the resource
  4. Separate metadata format with XML encoded
    metadata, along with XSLT to decode it

98
Out of Band Knowledge
  1. take url in dcidentifier
  2. parse report number
  3. append reportnumber.pdf to url

direct pdf
99
Out of Band Knowledge
  • pros tailored, no accidental harvesting
  • cons not scalable wrt of repositories
    harvesters, false negatives

no metadata change metadata change
no data change ok unnecessary download
data change missed update! ok
assumption change in metadata means a change in
data -- not always true!
100
Base64 Encoding
  • define separate metadata formats
  • base64application/pdf
  • base64application/powerpoint
  • pros describable with OAI-PMH semantics,
    accomplished with standard OAI-PMH tools
  • cons heavyweight (could use compression),
    suitable for simple objects only, accidental
    harvesting would produce high loads for
    repositories and harvesters
  • use complex objects, modeled representation,
    instead!

101
Metadata as Instructions
cf. http//genomebiology.com/2003/4/6/R40
102
Metadata as Instructions
  • the resource described in ltdcidentifiergt could
    be a complex object
  • may not be appropriate to
  • tar the object into a single file
  • expose all constituent objects through OAI-PMH
  • define a metadata prefix that provides machine
    readable instructions on how extract the complex
    object
  • METS
  • MPEG-21 DIDL
  • SCORM

103
Metadata as Instructions
104
XSLT
  • if the resource is already XML encoded, include
    an XSLT to transform into the desired format
  • use separate metadata formats or even sets for
    the harvester to express their transformation
    preferences?
  • pros elegant, limited work for repository
  • cons assumes client-side transformation
    capability, applicable only for XML-encodable
    resources

105
Static Repositories
106
Why a Static Repository?
  • Not all sites can install CGI or servlets
  • they can have web access, but either use
  • a restrictive ISP
  • be firewalled off
  • Not all sites need the heavyweight of OAI-PMH
    software
  • static repos are not limited by size, but are
    generally intended for smaller repositories
    (lt5000 records)
  • http//www.openarchives.org/OAI/2.0/guidelines-sta
    tic-repository.htm

107
Static Repository Architecture
gateways can described themselves with the new
gateway container http//www.openarchives.org/OAI
/2.0/guidelines-gateway.htm
108
OAI-Rights
109
Not Just for Eprints Anymore
  • OAI-Rights is an ongoing effort
  • http//www.openarchives.org/documents/OAIRightsWhi
    tePaper.html
  • http//www.openarchives.org/news/oairightspress030
    929.html
  • Issues
  • entity association
  • metadata vs. resource
  • aggregation association
  • records, repository, sets
  • binding
  • Records ltaboutgt, Identifys ltdescriptiongt

110
Download and Go!
111
Where Do You Want to Build?
user
CDSware
service provider
data provider
data provider
data provider
data provider
data provider
. . .
local context- sensitive services
EPrints.org
CDSware
112
Fedora
  • joint project between Cornell UVa
  • funded by the Mellon Foundation
  • a repository management system
  • focuses on complex digital objects and their
    behaviors
  • more info
  • http//www.fedora.info/
  • D-Lib Magazine, 9(4)
  • http//www.dlib.org/dlib/april03/staples/04staples
    .html

113
  • MIT HP Labs
  • constructed to capture all the output of MITs
    faculty
  • now generalized to the DSpace Federation
  • 8 top universities in the US Canada
  • More info
  • http//www.dspace.org/
  • http//sourceforge.net/projects/dspace/
  • D-Lib Magazine 9(1)
  • http//www.dlib.org/dlib/january03/smith/01smith.h
    tml

114
EPrints.org
  • developed at Southampton University
  • part of larger suite of institutional/author
    self-archiving tools and services
  • e.g. citebase paracite
  • widely adopted -- 100 sites
  • http//software.eprints.org/ep2
  • more info
  • http//www.eprints.org/
  • http//www.arl.org/sparc/core/index.asp?pageg206

115
CDSware
  • developed at CERN
  • data provider service provider
  • large-scale use _at_ CERN (gt 600k records)
  • in use at a few non-CERN sites
  • free paid support models
  • more info
  • http//cdsware.cern.ch/

116
  • P2P publishing for academia
  • community servers for coordination, management
  • archivelets for individual laptops, PCs
  • more info
  • http//kepler.cs.odu.edu/
  • D-Lib Magazine 7(4)
  • http//www.dlib.org/dlib/april01/maly/04maly.html

117
  • developed by UKOLN
  • open source
  • OpenURL 0.1 format resolver
  • NISO 1.0 format???
  • more info
  • Ariadne, 28
  • http//www.ariadne.ac.uk/issue28/resolver/
  • ftp//ftp.ukoln.ac.uk/metadata/tools/openresolver/
  • http//www.ukoln.ac.uk/distributed-systems/openurl
    /

118
The Future Community Building
  • Ultimately, protocols and metadata formats are
    not what makes a difference
  • Rather, the critical mass afforded by a common
    set of utilities (cf. http, Dublin Core, XML)
  • The best current example The Open Language
    Archives Community
  • http//www.language-archives.org/
  • OAI-PMH provides the basis for communication
    between strangers, but allows even richer
    communication between friends
Write a Comment
User Comments (0)
About PowerShow.com