Title: Metadata Quality for Federated Collections
1Metadata Quality for Federated Collections
Besiki Stvilia, Les Gasser, Mike Twidale, Sarah
Shreeves, Tim Cole
- GSLIS, UIUC
- November, 2004
21. Abstract
- Centralized metadata repositories attempt to
provide integrated access across multiple digital
collections from libraries, archives and museums.
Metadata quality in these repositories heavily
influences the collections' usability---high
quality can raise satisfaction and use, while low
quality can render collections unusable.
Individual metadata type, origin and quality
variances are compounded into complex quality
challenges when collections are aggregated.
Current metadata quality assurance is generally
piecemeal, reactive, ad-hoc, and a-theoretical
formal compatibility and interoperability
standards often prove unenforceable given
metadata providers' dynamic and conflicting
organizational priorities. We are empirically
examining large bodies of harvested metadata to
develop systematic techniques for metadata
quality assessment/assurance. We study metadata
quality, value, and cost models algorithms for
connecting metadata component variations to
(aggregate) metadata record quality and
prototype metadata quality assurance tools that
help providers, aggregators and users reason
about metadata quality, doing more intelligent
selection, aggregation and maintenance of
metadata.
32. Approach
The model has been developed using a number of
techniques such as literature analysis, case
studies, statistical analysis, strategic
experimentation, and multi-agent modeling. The
model along with the concepts and metrics used
can serve as a foundation for developing
effective specific methodologies of quality
assurance in various types of organizations. Our
model of metadata quality ties together findings
from existing and new research in information
quality, along with well-developed work in
information seeking/use behavior, and the
techniques of strategic experimentation from
manufacturing. It presents a holistic approach to
determining the quality of a metadata object,
identifying quality requirements based on
typified contexts of metadata use (such as
specific information seeking/use activities) and
expressing interactions between metadata quality
and metadata value.
43. Measuring Metadata Quality3.1 Metadata
Quality Problem
- Actual qualitynot matching Required/needed
level of quality - May arise at different levels
- Element Level
- Schema Level
- Quality Dimensions
53.2 Information Quality Dimensions
- Relational / Contextual
- Accuracy
- Completeness
- Complexity
- Latency
- NaturalnessÂ
- Informativeness
- Relevance (aboutness)
- Precision
- Security
- Verifiability
- Volatility
- Intrinsic
- Accuracy
- Cohesiveness
- Complexity
- Semantic-consistency
- Structural-consistency
- Currency
- Informativeness
- Naturalness
- Precision
63.3 MQ Dimensions may trade off
- completeness vs. simplicity robustness vs.
simplicity volatility vs. simplicity robustness
vs. redundancy accessibility vs. certainty - Taguchi curves help to model and reason about
tradeoffs.
73.4 Genre Captures Context
84. Measuring Value4.1 Whats the Value of
Quality?
94.2 Value as Amount of Use
Dublin Core element of total records containing element
Identifier 99.6
Title 80.3
Type 76.5
Subject 72.9
Format 69.4
Publisher 61.2
Language 55.0
Creator 50.7
Description 47.4
Date 43.0
Rights 41.0
Relation 31.2
Source 14.9
Contributor 6.6
Coverage 5.9
- The value of metadata can be a function of the
probability distribution of the
operations/transactions using the metadata. - Human factors experiments can be used for
assessing the effectiveness of creating and using
the metadata. - Metadata often is an organizational asset,
especially in organizations like libraries and
one can calculate its dollar cost based on the
average time a cataloger spends on creating a
record or an element of the record..
105. IMLS Digital Collections and Content Project
- Promote centralized search, interoperability and
reusability of metadata collections - Harvested metadata from gt20 data providers,
gt150,000 Dublin Core Records (and growing) - Data providers small public libraries and
historical societies large academic libraries
museums research centers - Records provided from dozens to tens of
thousands - Interoperability and reusability require
negotiation of Global quality - http//imlsdcc.grainger.uiuc.edu
115.1 Examples of Quality Problems
- Ptolemaios son of Diodoros
- Dioskoros Ptolemaios
- Dioscorus. Ptolemaios
- (variant transliteration)
- ltdategt2000lt/dategt
- ltdategt1998-03-26lt/dategt
- (ambiguous and structurally inconsistent)
- ltpublishergtNew York Robert Carter,
1846lt/publishergt(schema limitation led to
workaround) - . . .
- Activity
- Find Collocate
- Actions
- Find
- Identify
- Select
- Obtain
- Across Federated Collections
125.2 Findings
- MQ dimensions with major quality problems
- completeness
- redundancy
- clarity
- semantic inconsistency (incorrect element use)
- structural inconsistency
- inaccurate representation
Problem type Incomplete Redundant Unclear Incorrect Use of Elements Inconsistent Inaccurate
100 94 78 73 47 24
135.3 Findings
- Correlation between consistency of element use
and type of metadata objects and type of data
providers (sample size 2,000). -
- Grouping by type of objects made standard
deviation of total number of elements used drop
significantly (from 5.73 to 3.6) - Clustering by use of distinct DC elements
(K-means, with 2 clusters) suggested that
different types of institutions may use different
number of distinct DC elements - Academic libraries 13
- Public libraries 8
- Museums - divided
DC Elements A P
Title 1 1
Creator 1 1
Subject 1 1
Description 1 1
publisher 1 1
contributor 0 0
Date 1 1
Type 1 0
Format 1 0
Identifier 1 1
Source 1 0
Language 1 0
Relation 1 0
Coverage 0 0
Rights 1 1
145.3 Findings
- High complexity of metadata content related to
quality problems - Strong correlation found between Content
Simplicity/Complexity Rate and Quality Problem
Rate (-.434, plt.01) - However, no significant correlation found between
Quality Problem Rate and Length of Metadata
Object (.043) - Differences in how well standard schemas handle
different types of original objects - lowest
quality problem rate found for print materials
GENRE/TYPE MEAN Error Rate MEDIAN Error Rate
species .00099 .00064
manuscript .00063 .00058
photograph .00025 .00018
art .00016 .00015
print .00010 .00010
156. Conclusions and Lessons Learned
- Communities of practice may use their own
implicit or explicit schema when sharing metadata
even through a standardized schema such as DC - Some schema elements can be more ambiguous than
others and require qualification Date vs.
Creator - Ambiguity of schema elements can be major source
of quality problems leading to context loss and
element misuses - Inferring native schema and comparing it to
destination schema can point to possible sources
of quality problems - Analysis of activities can help in evaluating
Robustness and Clarity of schema - Mining regularities between metadata
characteristics and quality problems can help in
constructing robust and inexpensive metrics - Some metrics used in Information Retrieval
(Infonoise, Kulback-Liebler, Average IDF) can be
effective and scalable in assessing quality at
the content level - A general purpose dictionary-based metric found
robust for assessing cognitive complexity of
metadata content - Structure profiles can be effective source for
measuring quality and predicting quality problems
at the schema level
16Acknowledgements and Contact Information
The research was made possible by the generous
support from the Institute of Museums and Library
Services (IMLS) and the UIUC Campus Research
Board.
How to contact
Email Besiki Stvilia at stvilia_at_uiuc.edu