Title: Quality Metrics for Metadata
1Quality Metrics for Metadata
- Xavier Ochoa
- Escuela Superior Politécnica del Litoral (ESPOL)
2Agenda
- Concept Definition
- Quality of Metadata
- Big Question
- Solution Proposal Metrics
- Example of Quality Metrics for Metadata
- Metrics for Metadata Framework
- Future Work
3Metadata
Book
Metadata about the book
4Repositories
Metadata Repository
Physical Storage
5Repositories
6Repositories
7Standards
Title Name, The Writers LastName,
Firstname Publisher Name, Country Date
Month/Year Genre (from the genre
list) Sub-genre (from the sub-genre
list) Location shelf code
Library A
Title Name, The Author Firstname
LastName Publisher Name, City, Country Year
xxxx Subject (from the category
list) Location Library, Dewey code, shelf
Library B
8Standards
9Standards
- The good thing about standards is that there are
so many to choose from
10Standards
- MARC21 http//www.loc.gov/marc/
- EXIF http//www.exif.org/
- ID3 http//www.id3.org/
- MPEG-7 http//www.chiariglione.org/mpeg/standards
/mpeg-7/mpeg-7.htm - Dublin Core http//dublincore.org/
- LOM http//ltsc.ieee.org/wg12/
11Quality of Metadata
12Quality of Metadata
13Quality of Metadata
14Quality of Metadata
- Quality is the measure of fitness for a task.
- "high quality metadata supports the functional
requirements of the system it is designed to
support" (Guy at al, 2004)
15Quality of Metadata
- There could be multiple tasks that the metadata
should fit - Searching / Retrieving
- Evaluation
- Assembly
- Re-use
16Quality of Metadata
- The functional requirements of the systems could
conflict - The system will be able to search in federated
repositories - Each repository will use a different language and
vocabulary
17Quality of Metadata
- The quality of the metadata could mean the
success of an application - Google, CDDB, Amazon
- Bad metadata is worse than no metadata at all
- Metadata is trusted
18Big Question
?
- How do we assess the quality of a metadata
instance?
?
19Literature
- Some of studies (Barton et al 2003, Dushay and
Hillman 2003, Guy et al 2004) approach the
problem from a librarian point of view - Manually check a statistical sample of records to
evaluate their quality. - Dushay and Hillman propose the use of tools to
help metadata experts in the task, but it is
still a manual activity.
20Literature
- Others (Najjar et al 2003, Zeng et al 2004, Moen
2005) try to collect statistical information from
the metadata instances in the repositories to
evaluate the usage of the metadata standard. - Najjar et al (2004) tries to compare the
statistics of the fields present in the
repository and the field that are used in real
searches. - Liddy et al (2003) measure the fitness of
metadata through a serie of usability studies.
21Metrics
- A good system needs both characteristics
- Been mostly automated
- Predict with certain amount of precision the
fitness of the metadata instance for its task - Other fields had attacked similar problems
through the use of metrics - Software Engineering
- Bibliographical Studies (Infometrics)
- Search engines (Eg. PageRank)
22Quality Metrics
Metadata Set 1
Metadata Set 2
Title The time machine Description Science
fiction book about a scientist that travel to the
past and the future. Take place in the late 19th
century. Author empty Number of pages
empty Publisher empty Type of cover empty
Title The time machine Description
empty Author Herbert George Welles Number of
pages 145 Publisher LN Asoc. Type of cover
Hard
23Quality Metrics (Searching)
- Weighted measurement of the completeness of a
metadata record - Not all the fields are equally important for the
end user. - The importance is measured by the usage of the
field for searching - The importance (0ltxlt1) is multiplied by the
presence or absence (1 or -1) of a value in the
field
24Quality Metrics (Searching)
Percentage of Usage of the Fields in Search
Title 90 Description 80 Author 20 Number of
pages 5 Publisher 10 Type of cover 10
Metric Calculation Q Exist(Title)0.9
Exist(Description)0.8 Exist(Author)0.2
Exist(Npages)0.05 Exist(Publisher)0.1
Exist(Tcover).1
25Quality Metrics
Metadata Set 1
Metadata Set 2
Title The time machine Description Science
fiction book about a scientist that travel to the
past and the future. Take place in the late 19th
century. Author empty Number of pages
empty Publisher empty Type of cover empty
Title The time machine Description
empty Author Herbert George Welles Number of
pages 145 Publisher LN Asoc. Type of cover
Hard
Q0.90.8-0.2-0.05-0.1-0.1 Q1.25
Q0.9-0.80.20.05-0.1-0.1 Q0.55
26Quality Metrics
Metadata Set 1
Metadata Set 2
Title Parallel Systems Description Course about
the distribution or parallelization of algorithms
through multiple proccessors.
Title Parallel Systems Description Course on
parallel computing. This course covers the
material related to the use and workings of
parallel computers.
27Quality Metrics (Searching)
- Amount of information present on main title and
description (or other free text fields) - Try to measure the relevance and uniqueness of
the terms present in the text. - It use the TF-IDF metric to calculate the how
good a word is to discriminate the element. - The 3 highest results are added
28Quality Metrics (Searching)
- Lets suppose
- For Metadata Set 1
- TD-IDF(distribution) 2.50
- TD-IDF(parallel) 1.25
- TD-IDF(processors) 1.02
- For Metadata Set 2
- TD-IDF(parallel) 1.75
- TD-IDF(computing)0.5
- TD-IDF(course)0.2
29Quality Metrics
Metadata Set 1
Metadata Set 2
Title Parallel Systems Description Course about
the distribution or parallelization of algorithms
through multiple proccessors.
Title Parallel Systems Description Course on
parallel computing. This course covers the
material related to the use and workings of
parallel computers.
Q1.252.501.02 Q4.77
Q1.750.500.20 Q2.45
30Quality Metrics
Metadata Set 1
Metadata Set 2
Title Parallel Systems Description Course about
the distribution or parallelization of algorithms
through multiple proccessors. Pedagogical
Context Take before HPCS Take after Computer
Networks
Title Parallel Systems Description Course on
parallel computing. This course covers the
material related to the use and workings of
parallel computers. Pedagogical
Context University
31Quality Metrics (Searching)
- Linkage
- This measurement will try to asses how much paths
exists to reach a metadata instance from a
similar one. - What will be measured is the amount of formal
linkage between objects. - Each link pointing to other objects add 1 to the
quality.
32Quality Metrics
Metadata Set 1
Metadata Set 2
Title Parallel Systems Description Course about
the distribution or parallelization of algorithms
through multiple proccessors. Pedagogical
Context Take before HPCS Take after Computer
Networks
Title Parallel Systems Description Course on
parallel computing. This course covers the
material related to the use and workings of
parallel computers. Pedagogical
Context University
Q11 Q2
Q0
33Quality Metrics
- Other Metrics
- Distribution of Range fields (Evaluation)
- Amount of information about previous usage
(Evaluation) - Coherence (Searching - Evaluation)
- Some could be good, some could be bad.
- A new (small?) field to research
- Metametrics
34M4M Framework
- Metadata for Metadata Framework (M4M)
- The framework will enable the fast and easy
creation of metrics and studies over digital
repositories. - The prototype and first implementation will be
based on Java programming language.
35Design Goals
- XML-based
- XML technologies as XQuery and XPath will be
used to standardize the language used to access
the index and create new metrics. - Metadata standard agnostic
- Any metadata standard could be used as long as an
XML binding is provided. - Useful out-of-the-box
- There will be a set of default standards and
metrics that could be easily used without any
extension to the code.
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40Future Work
- Implement the M4M Framework
- Apply the M4M Framework to ARIADNE
- Assess the effectiveness of the metrics comparing
its results to the human review of the metadata. - Apply the metrics to determine the quality of
automatically generated metadata
41Thanks,Questions, Commentaries, Suggestions?
- Xavier Ochoa
- xavier_at_cti.espol.edu.ec