Title: Metadata
1Metadata
- Andy Powell
- Technical Development and Research
- UKOLN
- University of Bath
- http//www.ukoln.ac.uk/
- a.powell_at_ukoln.ac.uk
2Metadata
- What is metadata?
- an introduction
- The Dublin Core
- metadata for the Web
- Metadata management
- Models for dealing with Web-site metadata
- UKOLN metadata projects
- overviews (and problems)
3What is metadata?
- by definition
- ..data about data..
- ..data which provides information
- about a resource..
- by example
- title, author, subject classification, shelf mark
- digital format, terms and conditions, location
(URL)
4What is metadata? (2)
- by usage
- Resource discovery
- Searching, location
- Authentication
- Quality/rating
- Semantic interoperability
- Resource management
- User interface
- Grouping resources for printing
- 3-D visualisations
5Range of formats
Simple
Rich
Dublin Core IAFA SOIF
MARC TEI headers CIMI
Alta Vista NetFirst Lycos
robot generated
hand crafted
6Where is metadata?
- Embedded within resource
- HTML ltMETAgt tags
- Linked to resource
- Remote database
- distributed
- union (centralised)
7Who creates metadata?
- Publisher side
- author
- webmaster
- institution
- Service side
- search service
- third party creators
robot generated
hand crafted
8Dublin Core
- 15 element core metadata set
- Primarily intended to aid resource discovery on
the Web - Main usage currently embedded into HTML META tags
- All elements optional and repeatable
- Status?
- Agreed syntax for embedding in HTML
- Still discussion about the use of some of the
elements
http//www.ukoln.ac.uk/metadata/resources/dc.html
9Dublin Core History
- 4 DC meetings
- Dublin, Warwick, Dublin, Canberra
- (DC-5 - Helsinki coming soon)
- Mailing list discussions
- meta2_at_lut.ac.uk
- W3C interest
- RDF (PICS-NG), MCF
- Various projects
- Still no significant interest yet from the big
search engines -(
10DC Elements - 1
- Title
- Subject
- intended to promote use of controlled
vocabularies but in practice likely to be used
for uncontrolled list of keywords - Description
- abstract
- Creator
- Publisher
11DC Elements - 2
- Contributor
- Date
- the date the resource was made available in its
present form. Agreed default format uses subset
of ISO 8601, e.g. 1997-09-15 - Type
- category of resource - document, image, sound,
home page, novel, poem, etc. Still much
discussion about the content of this element - Format
- MIME type
- Identifier
12DC Elements - 3
- Source
- Language
- language of the resource - NOT the metadata
- Relation
- no guidelines for usage currently
- Coverage
- separate working party looking at usage
- Rights
- rights management seen as too complex for DC.
This will give a URL to some external information
13Simple Example
- ltHTMLgtltHEADgt
- ltTITLEgtUKOLN Home Pagelt/TITLEgt
- ltMETA NAME"DC.title CONTENT"UKOLN UK Office
for Library and Information Networking"gt - ltMETA NAME"DC.subject" CONTENT"national centre,
network information support, library community,
awareness, research, information services, public
library networking, bibliographic management,
distributed library systems, metadata, resource
discovery, conferences, lectures, workshops"gt - ltMETA NAME"DC.description" CONTENT"UKOLN is a
national centre for support in network
information management in the library and
information communities. It provides awareness,
research and information services"gt - ltMETA NAME"DC.creator" CONTENTStark, Isobel"gt
- lt/HEADgt
- ...
14Element qualifiers
- Need to refine meaning in some cases
- TYPE
- Refines meaning of element - sub-divides element
namespace - SCHEME
- Element value taken from external schema, e.g.
LCSH for DC.subject, Z39.53 for DC.language - LANGUAGE
- Language of element value (not of the resource
being described!)
15Examples - TYPE
- Original DC.creator tag
- ltMETA NAME"DC.creator" CONTENTStark, Isobel"gt
- Non-personal author
- ltMETA NAME"DC.creator.corporate" CONTENTUKOLN
Information Services Group"gt - Authors email address
- ltMETA NAME"DC.creator.email CONTENTisg_at_ukoln.a
c.uk"gt
16Examples - SCHEME
- Library of Congress Subject Heading
- ltMETA NAME"DC.subject" CONTENT(SCHEMELCSH)
Library information networks -- Great Britain"gt - ltMETA NAME"DC.subject" CONTENT"(SCHEMELCSH)
Information technology -- higher education"gt - or
- ltMETA NAME"DC.subject" SCHEMELCSH
CONTENTLibrary information networks -- Great
Britain"gt - ltMETA NAME"DC.subject" SCHEMELCSH
CONTENT"Information technology -- higher
education"gt
17Metadata Management
- Practical issues of using Dublin Core for
Internet resource description... - UKOLN metadata system
- Requirements
- 3 models for metadata management
- Implementation at UKOLN
18UKOLN metadata system requirements
- Easy to use
- Work with a variety of methods of creating HTML
- Simple migration to future metadata formats
- Separate metadata from resource
19Managing Dublin Core (1)HTML Authoring tool
Embed by hand using HTML or text editor
- Pros
- Simple
- May be useful for training and familiarisation
- Cons
- May not be possible with all editors
- Maintenance problems
- Easy to make errors
20DC-dot
- A Web based tool for creating Dublin Core ltmetagt
tags - Automatic generation of some tags based on
content of the resource - Forms based editing of tags
- Cut-and-paste output into HTML
- Conversion to other formats
- SOIF, ROADS/WHOIS, USMARC, GILS...
http//www.ukoln.ac.uk/metadata/dcdot/
21Managing Dublin Core (2)Web-site management tool
Use Web-site management tool, for example
NetObjects Fusion
- Pros
- Use of Web-site management tools likely to
increase - Object-oriented database approach
- Cons
- Proprietry formats
- Early days - too early to evaluate use for
metadata yet?
22Managing Dublin Core (3)On the fly generation
Hold Dublin Core separately and embed on-the-fly
using server-side include (SSI)
- Pros
- Separates metadata from resource
- Future migration fairly simple
- Cons
- Performance
- Lack of integration with HTML tools
- Server specific
23UKOLN metadata system (1)
- Embed on-the-fly
- Apache SSI script
- Store metadata using SOIF records
- Use MS-Access as tool to create the records
- Associate metadata with resource by co-locating
them in the Web server filestore
24UKOLN metadata system (2)
intro.html
Apache syntax for calling server-side
script lt!--exec cmd"getmeta" --gt
lthtmlgt ltheadgt lttitlegtlt/titlegt lt!--exec
cmd"getmeta" --gt lt/headgt ...
HTML editor
intro.html.soif
_at_FILE http//www.ukoln.ac. ... keywords13
xxx, yyy, zzz description14 blah blah
b author13 Stark, Isobel ...
MS-Access Database
25UKOLN metadata system (3)
MS-Access front end... Filename browser Text
boxes Name choosers UKOLN specific metadata
26UKOLN metadata system (4)
intro.html
Web robot
lthtmlgt ltheadgt lttitlegtlt/titlegt lt!--exec
cmd"getmeta" --gt lt/headgt ...
1
2
UKOLN Web server
6
intro.html.soif
_at_FILE http//www.ukoln.ac. ... keywords13
xxx, yyy, zzz description14 blah blah
b author13 Stark, Isobel ...
3
4
SSI script
5
27Issues
- Performance
- Interaction with Web caches
- Dublin Core vs Alta Vista style metadata
- ltMETA NAMEDescription CONTENTblah, blah"gt
- ltMETA NAME"Keywords CONTENT"xxx, yyy, zzz"gt
- Granularity
- Which pages should have metadata?
28What's the point...
- of embedding DC ltmetagt tags?
- Alta Vista isn't going to look for them
- But, worth doing...
- within individual projects
- within specific communities (e.g. eLib)
- Improve local search facilities
- e.g. load SOIF records into a Netscape Catalogue
Server - Web-site management benefits
29UKOLN Metadata projects
- ROADS
- Software for Subject Service
- DESIRE
- European Web indexing
- NewsAgent
- Current awareness service for Library and
Information Staff - BIBLINK
- Information flow from publishers to National
Bibliographic Agencies
30ROADS
- Resource Organisation and Discovery in
Subject-based Services - Web based tools for Subject Services
- SOSIG, ADAM, OMNI,
- Manage and search Internet resource descriptions
- ROADS templates (based on IAFA templates)
- WHOIS
http//www.ukoln.ac.uk/roads/
31ROADS - WHOIS (1)
- Simple client-server search and retrieve protocol
- Developed originally for white pages
applications - Offer search facilities across several Subject
Services - Distribute a Subject Service across several
physical servers - Query routing - centroids and CIP
32ROADS - WHOIS (2)
- Centroid generated by ADAM contains youll find
the string mona in the title attribute of at
least one record in the ADAM database.
SOSIG
2
CGI-based WHOIS client
3
OMNI
CIP sharing of centroids
1
4
6
5
Web browser
ADAM
33DESIRE
- European Web cataloguing
- Subject Services
- EuroSOSIG (Bristol), EELS (Lund), Arts
(Koninklijke Bibliotheek) - Manually created ROADS templates
- European Web Index
- based on Nordic Web Index (NWI)
- Robot generated, all resources
- Multiple servers linked with Z39.50
- GILS
http//www.nic.surfnet.nl/surfnet/projects/desire/
desire.html
34DESIRE - current work (1)
- Internationalisation of ROADS
- Use of robots to
- aid manual cataloguing of resources
- build indexes based on list of URLs in a ROADS
database - Robot will use embedded Dublin Core if available
35DESIRE - current work (2)
- Re-design of EWI robot - including
- support for Dublin Core
- EWI records GILS-II compatible
- Allow users to search across subject services and
the EWI using Z39.50 - by converting ROADS records into GILS records
- by building a WHOIS to Z39.50 gateway
http//roads.ukoln.ac.uk/cgi-bin/egwcgi/egwirtcl/t
argets.egw
36NewsAgent
- Current awareness service for LIS...
- Distributed database
- servers at LITC, FD, UKOLN - Z39.50
- metadata (and some full-text)
- based on DALI
- Mixture of content streams
- Variety of access methods
- Web, e-mail and Z39.50 clients
- user-configurable profiles
http//www.ukoln.ac.uk/metadata/NewsAgent/
37NewsAgent - Content
- Journals
- Program, VINE, Journal of Librarianship and
Information Science - News and briefing material
- LA, IIS, UKOLN (Ariadne), BL, LITC
- Web pages
- E-mail lists and USENET news
38NewsAgent - Harvesting
- Web crawler
- looking for embedded Dublin Core
- Limiting the harvest
- simple heuristics
- use of Dublin Core Relation element
- E-mail parser
http//www.ukoln.ac.uk/metadata/NewsAgent/dcusage.
html
39BIBLINK
- Information flow between publishers
- traditional
- new - CD-ROM or Web (new to publishing)
- and National Bibliographic Agencies
- British Library, UK
- Biblioteca Nacional, Madrid, Spain
- Bibliothèque Nationale de France, Paris
- Koninklijke Bibliotheek, Den Haag, Netherlands
- Nasjonalbiblioteket, Rana, Norway
- Universitat Oberta de Catalunya, Barcelona, Spain
http//www.ukoln.ac.uk/metadata/BIBLINK/
40BIBLINK - research
- Scope
- Electronic publications suitable for inclusion in
National Bibliographies - Metadata
- Dublin Core (with extensions!), SGML DTD
- Identifiers
- ISBN, ISSN, SICI, DOI, URN
- Transmission
- Simple e-mail or Web crawler
- Authentication
- MD5 hash assigned to each resource
41BIBLINK - data set
- Minimum data set
- Author, Title, Publisher, Place of Publication,
Price, Extent (size), Keywords, Description,
Edition/Version, Date of Publication, System
Requirements, Format, Language, Terms and
Conditions, Frequency, Identifier, Contributor,
Checksum - Similar to DC but some dont fit
- ltMETA NAMEBIBLINK.placePublication
CONTENTBath, UKgt - ltMETA NAMEBIBLINK.frequency
CONTENTmonthlygt - Issues over conversion to MARC
42BIBLINK - demonstrator
Publishers
- Cataloguing in Publication(CIP) level records
Dublin Core
E-mail
NBAs/National Libraries
Dublin Core
- Enhanced records optionally returned to publishers
UNIMARC
- Conversion on to local MARC format using USEMARCON
??MARC
43Conclusions
- Think about metadata as a process
- Dublin Core syntax now stable enough to use
- Use within projects initially
- Choose metadata management model appropriate to
your site - Consider long term maintenance and transition to
other formats