Quality Metrics for Metadata - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Quality Metrics for Metadata

Description:

MPEG-7: http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm ... Amount of information present on main title and description (or other free text fields) ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 42
Provided by: xavier99
Category:

less

Transcript and Presenter's Notes

Title: Quality Metrics for Metadata


1
Quality Metrics for Metadata
  • Xavier Ochoa
  • Escuela Superior Politécnica del Litoral (ESPOL)

2
Agenda
  • Concept Definition
  • Quality of Metadata
  • Big Question
  • Solution Proposal Metrics
  • Example of Quality Metrics for Metadata
  • Metrics for Metadata Framework
  • Future Work

3
Metadata
Book
Metadata about the book
4
Repositories
Metadata Repository
Physical Storage
5
Repositories
6
Repositories
7
Standards
Title Name, The Writers LastName,
Firstname Publisher Name, Country Date
Month/Year Genre (from the genre
list) Sub-genre (from the sub-genre
list) Location shelf code
Library A
Title Name, The Author Firstname
LastName Publisher Name, City, Country Year
xxxx Subject (from the category
list) Location Library, Dewey code, shelf
Library B
8
Standards
9
Standards
  • The good thing about standards is that there are
    so many to choose from

10
Standards
  • MARC21 http//www.loc.gov/marc/
  • EXIF http//www.exif.org/
  • ID3 http//www.id3.org/
  • MPEG-7 http//www.chiariglione.org/mpeg/standards
    /mpeg-7/mpeg-7.htm
  • Dublin Core http//dublincore.org/
  • LOM http//ltsc.ieee.org/wg12/

11
Quality of Metadata
12
Quality of Metadata
13
Quality of Metadata
14
Quality of Metadata
  • Quality is the measure of fitness for a task.
  • "high quality metadata supports the functional
    requirements of the system it is designed to
    support" (Guy at al, 2004)

15
Quality of Metadata
  • There could be multiple tasks that the metadata
    should fit
  • Searching / Retrieving
  • Evaluation
  • Assembly
  • Re-use

16
Quality of Metadata
  • The functional requirements of the systems could
    conflict
  • The system will be able to search in federated
    repositories
  • Each repository will use a different language and
    vocabulary

17
Quality of Metadata
  • The quality of the metadata could mean the
    success of an application
  • Google, CDDB, Amazon
  • Bad metadata is worse than no metadata at all
  • Metadata is trusted

18
Big Question
?
  • How do we assess the quality of a metadata
    instance?

?
19
Literature
  • Some of studies (Barton et al 2003, Dushay and
    Hillman 2003, Guy et al 2004) approach the
    problem from a librarian point of view
  • Manually check a statistical sample of records to
    evaluate their quality.
  • Dushay and Hillman propose the use of tools to
    help metadata experts in the task, but it is
    still a manual activity.

20
Literature
  • Others (Najjar et al 2003, Zeng et al 2004, Moen
    2005) try to collect statistical information from
    the metadata instances in the repositories to
    evaluate the usage of the metadata standard.
  • Najjar et al (2004) tries to compare the
    statistics of the fields present in the
    repository and the field that are used in real
    searches.
  • Liddy et al (2003) measure the fitness of
    metadata through a serie of usability studies.

21
Metrics
  • A good system needs both characteristics
  • Been mostly automated
  • Predict with certain amount of precision the
    fitness of the metadata instance for its task
  • Other fields had attacked similar problems
    through the use of metrics
  • Software Engineering
  • Bibliographical Studies (Infometrics)
  • Search engines (Eg. PageRank)

22
Quality Metrics
Metadata Set 1
Metadata Set 2
Title The time machine Description Science
fiction book about a scientist that travel to the
past and the future. Take place in the late 19th
century. Author empty Number of pages
empty Publisher empty Type of cover empty
Title The time machine Description
empty Author Herbert George Welles Number of
pages 145 Publisher LN Asoc. Type of cover
Hard
23
Quality Metrics (Searching)
  • Weighted measurement of the completeness of a
    metadata record
  • Not all the fields are equally important for the
    end user.
  • The importance is measured by the usage of the
    field for searching
  • The importance (0ltxlt1) is multiplied by the
    presence or absence (1 or -1) of a value in the
    field

24
Quality Metrics (Searching)
Percentage of Usage of the Fields in Search
Title 90 Description 80 Author 20 Number of
pages 5 Publisher 10 Type of cover 10
Metric Calculation Q Exist(Title)0.9
Exist(Description)0.8 Exist(Author)0.2
Exist(Npages)0.05 Exist(Publisher)0.1
Exist(Tcover).1
25
Quality Metrics
Metadata Set 1
Metadata Set 2
Title The time machine Description Science
fiction book about a scientist that travel to the
past and the future. Take place in the late 19th
century. Author empty Number of pages
empty Publisher empty Type of cover empty
Title The time machine Description
empty Author Herbert George Welles Number of
pages 145 Publisher LN Asoc. Type of cover
Hard
Q0.90.8-0.2-0.05-0.1-0.1 Q1.25
Q0.9-0.80.20.05-0.1-0.1 Q0.55
26
Quality Metrics
Metadata Set 1
Metadata Set 2
Title Parallel Systems Description Course about
the distribution or parallelization of algorithms
through multiple proccessors.
Title Parallel Systems Description Course on
parallel computing. This course covers the
material related to the use and workings of
parallel computers.
27
Quality Metrics (Searching)
  • Amount of information present on main title and
    description (or other free text fields)
  • Try to measure the relevance and uniqueness of
    the terms present in the text.
  • It use the TF-IDF metric to calculate the how
    good a word is to discriminate the element.
  • The 3 highest results are added

28
Quality Metrics (Searching)
  • Lets suppose
  • For Metadata Set 1
  • TD-IDF(distribution) 2.50
  • TD-IDF(parallel) 1.25
  • TD-IDF(processors) 1.02
  • For Metadata Set 2
  • TD-IDF(parallel) 1.75
  • TD-IDF(computing)0.5
  • TD-IDF(course)0.2

29
Quality Metrics
Metadata Set 1
Metadata Set 2
Title Parallel Systems Description Course about
the distribution or parallelization of algorithms
through multiple proccessors.
Title Parallel Systems Description Course on
parallel computing. This course covers the
material related to the use and workings of
parallel computers.
Q1.252.501.02 Q4.77
Q1.750.500.20 Q2.45
30
Quality Metrics
Metadata Set 1
Metadata Set 2
Title Parallel Systems Description Course about
the distribution or parallelization of algorithms
through multiple proccessors. Pedagogical
Context Take before HPCS Take after Computer
Networks
Title Parallel Systems Description Course on
parallel computing. This course covers the
material related to the use and workings of
parallel computers. Pedagogical
Context University
31
Quality Metrics (Searching)
  • Linkage
  • This measurement will try to asses how much paths
    exists to reach a metadata instance from a
    similar one.
  • What will be measured is the amount of formal
    linkage between objects.
  • Each link pointing to other objects add 1 to the
    quality.

32
Quality Metrics
Metadata Set 1
Metadata Set 2
Title Parallel Systems Description Course about
the distribution or parallelization of algorithms
through multiple proccessors. Pedagogical
Context Take before HPCS Take after Computer
Networks
Title Parallel Systems Description Course on
parallel computing. This course covers the
material related to the use and workings of
parallel computers. Pedagogical
Context University
Q11 Q2
Q0
33
Quality Metrics
  • Other Metrics
  • Distribution of Range fields (Evaluation)
  • Amount of information about previous usage
    (Evaluation)
  • Coherence (Searching - Evaluation)
  • Some could be good, some could be bad.
  • A new (small?) field to research
  • Metametrics

34
M4M Framework
  • Metadata for Metadata Framework (M4M)
  • The framework will enable the fast and easy
    creation of metrics and studies over digital
    repositories.
  • The prototype and first implementation will be
    based on Java programming language.

35
Design Goals
  • XML-based
  • XML technologies as XQuery and XPath will be
    used to standardize the language used to access
    the index and create new metrics.
  • Metadata standard agnostic
  • Any metadata standard could be used as long as an
    XML binding is provided.
  • Useful out-of-the-box
  • There will be a set of default standards and
    metrics that could be easily used without any
    extension to the code.

36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Future Work
  • Implement the M4M Framework
  • Apply the M4M Framework to ARIADNE
  • Assess the effectiveness of the metrics comparing
    its results to the human review of the metadata.
  • Apply the metrics to determine the quality of
    automatically generated metadata

41
Thanks,Questions, Commentaries, Suggestions?
  • Xavier Ochoa
  • xavier_at_cti.espol.edu.ec
Write a Comment
User Comments (0)
About PowerShow.com