Harvesting Metadata Using OAIPMH - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Harvesting Metadata Using OAIPMH

Description:

Aimed at making the large and growing number of repositories of ... Only five years old, but already essential ... MySQL. SWISH-E. Whatever is lying around... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 53
Provided by: royten2
Category:

less

Transcript and Presenter's Notes

Title: Harvesting Metadata Using OAIPMH


1
Harvesting Metadata Using OAI-PMH
  • Roy Tennant
  • California Digital Library

2
Outline
  • The Open Archives Initiative
  • OAI-PMH
  • The Harvesting Process
  • Harvesting Problems
  • Steps to a Fruitful Harvest
  • A Harvesting Service Model
  • The OAI Future

3
Open Archives Initiative
  • Aimed at making the large and growing number of
    repositories of freely available digital content
    interoperable
  • Only five years old, but already essential
  • Protocol for Metadata Harvesting (OAI-PMH)
    specifies how repositories can expose their
    metadata for others to harvest
  • Well over 500 repositories world-wide support the
    protocol
  • OAIster.org has indexed 3.5 million items from
    those repositories

4
www.oaforum.org/tutorial/
5
OAI-PMH
  • Data providers (DP) those with the stuff
  • Service providers (SP) those who harvest
    metadata and provide aggregation and search
    services
  • Software for both DPs and SPs readily available
  • OAI-PMH verbs
  • Identify
  • ListIdentifiers
  • ListMetadataFormats
  • ListSets
  • ListRecords
  • GetRecord

6
OAI Architecture
Source Open Archives Forum Tutorial
7
(No Transcript)
8
Identify
  • Provides basic information about a repository

9
ListMetadataFormats
  • Lists available metadata formats

10
ListIdentifiers
  • Lists all identifiers (or only those of the
    optionally specified set)
  • Must include metadataPrefix attribute

11
ListSets
  • Lists available sets

12
Library of Congress ListSets response
13
ListRecords
  • Lists all records (or only those of the
    optionally specified set)
  • Must include metadataPrefix attribute

14
GetRecord
  • Retrieves a specific record
  • Must include metadataPrefix and identifier
    attributes

15
The Harvesting Process
  • Identifying Sources
  • Selecting Sets
  • Harvesting
  • Indexing
  • Interface

16
gita.grainger.uiuc.edu/registry/
17
errol.oclc.org
18
Selecting Sets
  • Review the response to the ListSets verb
  • May be instructive to search the collection in
    the native interface, if possible
  • Look for descriptive pages on the site being
    harvested

19
(No Transcript)
20
(No Transcript)
21
Harvesting
  • Many harvesting applications are available, I
    will focus on
  • Public Knowledge Project (PKP) Harvester
  • Virginia Tech Perl Harvester
  • Library software vendors increasingly offer
    harvesting products (e.g., ExLibris MetaIndex)

22
(No Transcript)
23
Virginia Tech Perl Harvester
-----------------------------------------
Harvester Sample Configurator
-----------------------------------------
Version 1.1 July 2002
Hussein Suleman lthussein_at_vt.edugt
Digital Library Research Laboratory
www.dlib.vt.edu Virginia Tech
------------------------------------------ Def
aults/previous values are in brackets - press
ltentergt to accept those enter "delete" to erase
a default value enter "continue" to skip further
questions and use all defaults press ltctrlgt-c to
escape at any time (new values will be
lost) Press ltentergt to continue ARCHIVES Add
all the archives that should be
harvested Current list of archives No archives
currently defined ! Select from Add
Done Enter your choice D areturn ARCHIVE
IDENTIFIER You need a unique name by which to
refer to the archive you will harvest metadata
from Examples nsdl-380602, VTETD Archive
identifier nsdl-380602return
24
Lets Harvest!
25
Indexing
  • Pick your favorite database/indexing software
  • MySQL
  • SWISH-E
  • Whatever is lying around
  • May need to specifically set up a method to
    search across the entire record
  • May need different fields for indexing than for
    display
  • Will need to deal with element collision

26
Interface
  • Software interface (API) for other applications
  • SRU/SRW?
  • Arbitrary Web Services schema?
  • User interface
  • What functions do you want your users to be able
    to perform?
  • What kinds of displays do you want?

27
Harvesting Problems
  • Sets
  • Metadata Formats
  • Metadata Artifacts
  • Granularity
  • Metadata Variances

28
Sets
  • Records are harvested in clumps, called sets
    created by DPs
  • No guidelines exist for defining sets
  • Examples
  • Collection
  • Organizational structure
  • Format (but is a page image an image? See example)

29
Metadata Formats
  • Only required format is simple Dublin Core,
    although any format can be made available in
    addition
  • Few DPs surface richer metadata
  • Simple DC is simply too simple!
  • Example (artifact vs. surrogate dates)

30
Metadata Artifacts
  • unintended, unwanted aberrations
  • Sample causes
  • Idiosyncratic local practices
  • Anachronisms
  • HTML code
  • Examples
  • Circa string of dates for searching purposes
  • electronic resource

31
Granularity
  • Record Granularity what is an object?
  • A book, or each individual page?
  • Examples CDL, Univ. of Michigan
  • Metadata Granularity
  • Multiple values in one field
  • Example Univ. of Washington

32
Metadata Variances
  • Subject terminology differences
  • Disparities in recording the same metadata
  • Example date variances
  • Mapping oddities or mistakes
  • Examples 1) format into description, 2)
    description into subject

33
Steps to a Fruitful Harvest
  • Needs Assessment (its the user, stupid)
  • DP Identification and Communication
  • Metadata Capture
  • Metadata Analysis
  • Metadata Subsetting
  • Metadata Normalization
  • Metadata Enrichment
  • Indexing Display
  • Interface (its still the user, stupid)

34
Needs Assessment
  • What are you trying to accomplish?
  • What will your users want to be able to do?
  • What metadata will you need, and what procedures
    will you need to set up to enable these
    activities?
  • Which repositories have what you want?
  • Is what they have (e.g., sets, metadata) usable
    as is, or ?

35
DP Identification Communication
  • Identification
  • Use UIUC directory of DPs to identify potential
    sources
  • Communication
  • Not required to tell them you are harvesting, but
    may help establish a good relationship
  • May want to request that they surface a richer
    metadata format and/or provide a different set

36
Metadata Capture
  • Sample questions to answer
  • Individual sets, or all?
  • Richer metadata formats available?
  • How frequently to reharvest?
  • Start from scratch each time or update?
  • Many software options

37
Metadata Analysis
  • Finding out what you have (and dont have)
  • Encoding practices
  • Gap analysis (e.g., missing fields, etc.)
  • Mistakes (e.g., mapping errors)
  • Software can help
  • Commercial software like Spotfire
  • In-house or open source software tools

38
Source 2002 Masters Thesis, Jewel Hope Ward,
UNC Chapel Hill
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
Metadata Subsetting
  • DP sets are unlikely to serve all SP uses well
  • SPs will need the ability to subset harvested
    metadata
  • Example prototype subsetting tool

46
(No Transcript)
47
Metadata Normalization
  • Normalizing to reduce to a standard or normal
    state
  • Prototype date normalization service screen

48
Metadata Enrichment
  • Adding fields and/or qualifiers may be useful or
    required, for example
  • Metadata provider information
  • Geographic coverage
  • Subject terms mapped to a different thesaurus
  • Authority control record
  • The enrichment process may be the same tool as
    the subsetting tool (i.e., find a cluster of
    records and perform an action)

49
Indexing Display
  • Selected fields may need to be mapped to specific
    indexing and display elements
  • Particularly required if harvesting different
    metadata formats
  • But also needs to be done with multiple,
    conflicting fields

50
A Harvesting Service Model
51
The OAI Future
  • Further protocol development
  • Services layered on top of OAI-PMH
  • Shared software tools
  • Best practices for both DPs and SPs

52
oai-best.comm.nsdl.org
Write a Comment
User Comments (0)
About PowerShow.com