Title: Harvesting Metadata Using OAIPMH
1Harvesting Metadata Using OAI-PMH
- Roy Tennant
- California Digital Library
2Outline
- The Open Archives Initiative
- OAI-PMH
- The Harvesting Process
- Harvesting Problems
- Steps to a Fruitful Harvest
- A Harvesting Service Model
- The OAI Future
3Open Archives Initiative
- Aimed at making the large and growing number of
repositories of freely available digital content
interoperable - Only five years old, but already essential
- Protocol for Metadata Harvesting (OAI-PMH)
specifies how repositories can expose their
metadata for others to harvest - Well over 500 repositories world-wide support the
protocol - OAIster.org has indexed 3.5 million items from
those repositories
4www.oaforum.org/tutorial/
5OAI-PMH
- Data providers (DP) those with the stuff
- Service providers (SP) those who harvest
metadata and provide aggregation and search
services - Software for both DPs and SPs readily available
- OAI-PMH verbs
- Identify
- ListIdentifiers
- ListMetadataFormats
- ListSets
- ListRecords
- GetRecord
6OAI Architecture
Source Open Archives Forum Tutorial
7(No Transcript)
8Identify
- Provides basic information about a repository
9ListMetadataFormats
- Lists available metadata formats
10ListIdentifiers
- Lists all identifiers (or only those of the
optionally specified set) - Must include metadataPrefix attribute
11ListSets
12Library of Congress ListSets response
13ListRecords
- Lists all records (or only those of the
optionally specified set) - Must include metadataPrefix attribute
14GetRecord
- Retrieves a specific record
- Must include metadataPrefix and identifier
attributes
15The Harvesting Process
- Identifying Sources
- Selecting Sets
- Harvesting
- Indexing
- Interface
16gita.grainger.uiuc.edu/registry/
17errol.oclc.org
18Selecting Sets
- Review the response to the ListSets verb
- May be instructive to search the collection in
the native interface, if possible - Look for descriptive pages on the site being
harvested
19(No Transcript)
20(No Transcript)
21Harvesting
- Many harvesting applications are available, I
will focus on - Public Knowledge Project (PKP) Harvester
- Virginia Tech Perl Harvester
- Library software vendors increasingly offer
harvesting products (e.g., ExLibris MetaIndex)
22(No Transcript)
23Virginia Tech Perl Harvester
-----------------------------------------
Harvester Sample Configurator
-----------------------------------------
Version 1.1 July 2002
Hussein Suleman lthussein_at_vt.edugt
Digital Library Research Laboratory
www.dlib.vt.edu Virginia Tech
------------------------------------------ Def
aults/previous values are in brackets - press
ltentergt to accept those enter "delete" to erase
a default value enter "continue" to skip further
questions and use all defaults press ltctrlgt-c to
escape at any time (new values will be
lost) Press ltentergt to continue ARCHIVES Add
all the archives that should be
harvested Current list of archives No archives
currently defined ! Select from Add
Done Enter your choice D areturn ARCHIVE
IDENTIFIER You need a unique name by which to
refer to the archive you will harvest metadata
from Examples nsdl-380602, VTETD Archive
identifier nsdl-380602return
24Lets Harvest!
25Indexing
- Pick your favorite database/indexing software
- MySQL
- SWISH-E
- Whatever is lying around
- May need to specifically set up a method to
search across the entire record - May need different fields for indexing than for
display - Will need to deal with element collision
26Interface
- Software interface (API) for other applications
- SRU/SRW?
- Arbitrary Web Services schema?
- User interface
- What functions do you want your users to be able
to perform? - What kinds of displays do you want?
27Harvesting Problems
- Sets
- Metadata Formats
- Metadata Artifacts
- Granularity
- Metadata Variances
28Sets
- Records are harvested in clumps, called sets
created by DPs - No guidelines exist for defining sets
- Examples
- Collection
- Organizational structure
- Format (but is a page image an image? See example)
29Metadata Formats
- Only required format is simple Dublin Core,
although any format can be made available in
addition - Few DPs surface richer metadata
- Simple DC is simply too simple!
- Example (artifact vs. surrogate dates)
30Metadata Artifacts
- unintended, unwanted aberrations
- Sample causes
- Idiosyncratic local practices
- Anachronisms
- HTML code
- Examples
- Circa string of dates for searching purposes
- electronic resource
31Granularity
- Record Granularity what is an object?
- A book, or each individual page?
- Examples CDL, Univ. of Michigan
- Metadata Granularity
- Multiple values in one field
- Example Univ. of Washington
32Metadata Variances
- Subject terminology differences
- Disparities in recording the same metadata
- Example date variances
- Mapping oddities or mistakes
- Examples 1) format into description, 2)
description into subject
33Steps to a Fruitful Harvest
- Needs Assessment (its the user, stupid)
- DP Identification and Communication
- Metadata Capture
- Metadata Analysis
- Metadata Subsetting
- Metadata Normalization
- Metadata Enrichment
- Indexing Display
- Interface (its still the user, stupid)
34Needs Assessment
- What are you trying to accomplish?
- What will your users want to be able to do?
- What metadata will you need, and what procedures
will you need to set up to enable these
activities? - Which repositories have what you want?
- Is what they have (e.g., sets, metadata) usable
as is, or ?
35DP Identification Communication
- Identification
- Use UIUC directory of DPs to identify potential
sources - Communication
- Not required to tell them you are harvesting, but
may help establish a good relationship - May want to request that they surface a richer
metadata format and/or provide a different set
36Metadata Capture
- Sample questions to answer
- Individual sets, or all?
- Richer metadata formats available?
- How frequently to reharvest?
- Start from scratch each time or update?
- Many software options
37Metadata Analysis
- Finding out what you have (and dont have)
- Encoding practices
- Gap analysis (e.g., missing fields, etc.)
- Mistakes (e.g., mapping errors)
- Software can help
- Commercial software like Spotfire
- In-house or open source software tools
38Source 2002 Masters Thesis, Jewel Hope Ward,
UNC Chapel Hill
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45Metadata Subsetting
- DP sets are unlikely to serve all SP uses well
- SPs will need the ability to subset harvested
metadata - Example prototype subsetting tool
46(No Transcript)
47Metadata Normalization
- Normalizing to reduce to a standard or normal
state - Prototype date normalization service screen
48Metadata Enrichment
- Adding fields and/or qualifiers may be useful or
required, for example - Metadata provider information
- Geographic coverage
- Subject terms mapped to a different thesaurus
- Authority control record
- The enrichment process may be the same tool as
the subsetting tool (i.e., find a cluster of
records and perform an action)
49Indexing Display
- Selected fields may need to be mapped to specific
indexing and display elements - Particularly required if harvesting different
metadata formats - But also needs to be done with multiple,
conflicting fields
50A Harvesting Service Model
51The OAI Future
- Further protocol development
- Services layered on top of OAI-PMH
- Shared software tools
- Best practices for both DPs and SPs
52oai-best.comm.nsdl.org