Title: Digital Library Content Model
1Digital Library Content Model
- Dagobert Soergel
- College of Information Studies University of
Maryland - Department of Library and Information Studies
University at Buffalo
2The Problem
- Digital libraries must
- Store a wide variety of often complex information
objects and display these objects on different
platforms. This requires modeling information
objects, their internal structure, and
relationships among them. - Provide data that support discovery,
interpretation, use, and management of
information objects. This requires a good
metadata model - Support annotation of information objects.
Annotations turn out to be surprisingly diverse.
An annotation my refer to only a part of an
information object. This requires an elegant
model that can deal with many cases.
3Purpose of the talk
- To reexamine a number of basic notions regarding
the content of a digital library (or, more
generally, any information system) to achieve
sound definitions - Developed in the framework of the
- DELOS Digital Library Reference Model
- a framework for describing digital libraries,
their content, users, and functions and, for
each, their qualities and associated policies
4Premisses
- Modeling the content domain is complex and much
thinking is muddled - Need to be able to handle both data and
documents - Any reference model
- needs to be abstract and must not commit to any
particular standard or design decision - rather, it must provide a framework for
specifying the commitments of any particular DL
(or information system)
5Issues
- 0 Scope of this talk and modeling constructs
- 1a Content in the overall context of a DL
reference model - 1b Modeling information objects
- 1c Levels, versions, and relationships
- 1d Composite information objects / resources
- 1e Resource identifiers
- 2 Metadata, including provenance, context, usage
- 3 Annotation
6Scope of this talk
- A reference model for a broadly conceived digital
library will be able to model most any
information system, thus will be useful very
broadly. - The focus on digital libraries is in the
application, especially the type of collection,
to which the model is applied.
7Scope level of abstraction
- The reference model should stay on an abstract
level. It should not require specific standards
but rather allow for plugging in any standard,
such as RDA or DC. - A DL should indicate to the users what standard
it uses for things like time, place, type of
relationship, type of resource - The reference model should not require design
choices but rather provide a framework for
specifying design choices,such as selectivity of
the collection. A DL will then indicate whether
its collection is selective or fully inclusive
8Modeling constructs
- The reference model should be based on an
entity-relationship model (E-R model). - Second-order logic relationship instances are
resources that can in turn be related to
anything. Apply pragmatically for useful
navigation and common-sense inferences stay away
from types of reasoning that run into problems
with second order logic. - Must add mechanisms for indicating the degree of
precision or the degree of certainty of
statements.
9Issues
- 1a Content in the overall context of a DL
reference model - 1b Modeling information objects
- 1c Levels, versions, and relationships
- 1d Composite information objects / resources
- 1e Resource identifiers
- 2 Metadata, including provenance, context, usage
- 3 Annotation
10Content in the overall context of a DL reference
model
- Resources
- Structured data
- Unstructured data, text
- Uses of data
11Everything is a resource
- W3C definition
- A resource is anything that can be identified or
named. Any resource is represented by a resource
identifiern - Resource includes ? external (non-digital)
objects or events and ? digital object or
event, wherever that digital object or event may
reside or occur. - Same as topic in topic maps
- In an E-R model, entity types, entity instances
(entity values), relationship types, and
relationship instances are all resources - In RDA Resource restricted to information
object.Advantages of broader definition will
become clear.
12Structured data statements
- Resource 1 ltrelationshipgt Resource 2
- SoftwareModule ltcreatedBygt LegalEntity
- SoftwareModule ltannotatedBygt Information object
- Event lthappenedIngt (Date1, Date2)
- Multi-way relationships, frames
- Statements are information objects, that is, they
are resources that can in turn be related to
anything - Statement also called proposition or assertions
(or fact)
13More on structured data
- Data consist of statements about resources.
- Such statements can be conceived as relationship
instances in which the resource in focus
occupies one argument slot. A simple statement
using a binary relationship or a multi-way
relationship (a frame instance with slots
filled) (objects in an object-oriented database)
14More on structured data
- Slot fillers are also known as data values.
- A data value makes sense only when it is seen in
relation to one or more resources, for example as
a slot filler in a frame. - Examples
- The value 55 makes sense only in the right
context, such as in the success slot of a drug
treatment frame - The value 185 cm makes sense only if we know it
is the height of a person or the length of a pair
of skis.
15- There are two ways to communicate such
statements. - 1. Structured dataOne learns what one wants to
know about the resource in focus immediately from
a relationship instance. - Hamlet ltauthoredBygt Shakespeare
- The drug treatment frame on Taxoteer
- The actual data of interest are represented in a
database
16- There are two ways to communicate such
statements. - Unstructured dataOne needs to extract what one
wants to know from a text or image that is
related to the resource in focus. - Shakespeare schrieb den Hamlet im Jahre 1625
- Hamlet wurde von Shakespeare verfasst
- Taxoteer ist effektiv in der Behandlung von
Krebsen die keine Rezeptoren fuer Estrogen
haben. In aelteren Personen liegt die
Erfolgsrate bei 50. - The data of interest are stored in what is
commonly known as document.
17Functions of data
- Data about a resource may serve any of the
following functions - learn about the resource and its various
characteristics - learn about the history and context of the
resource - learn how to use the resource
- manage the resource
- preserve the resource
- The sections about metadata (roughly data about
an information object) will specialize this list
18Relationship as the basic modeling construct
- Important principle
- Many concepts in a DL reference model are best
modeled based on relationships rather than based
on entities - For example, annotation-hood resides not in an
information object but in the relationship - InformationObjectA ltannotatesgt InformatioObjectB
- InformationObject B ltannotatedBygt
InformationObjectA
19Resource type examples
- Information objectsIncl. documents, data
streams, databases, queries and their results
(virtual information objects, such as database
reports, virtual collections) - Actors that can search for, create, and manage
resources - Functions and services
- Software modules
- Policies
- Languages
- Ideas, concepts
20Inheritance
- Many reference model constructs are specified at
the level of resource. - They inherit down to the different resource
types, especially information objects - For example, the following statement types are
valid for Resource - Resource ltidentifiedBygt Identifier
- Resource ltcharacterizedBygt QualityParameter
- Resource ltregulatetBygt Policy
- Therefore, they are also valid for
InformationObject or Actor or Policy
21Issues
- 1a Content in the overall context of a DL
reference model - 1b Modeling information objects
- 1c Levels, versions, and relationships
- 1d Composite information objects / resources
- 1e Resource identifiers
- 2 Metadata, including provenance, context, usage
- 3 Annotation
22Information objects 1
- A formal relationship instance (such a row in a
table or a structured data record) - A document (written or spoken text, image, sound)
from which a human reader can learn about the
resource in focus or about the relationships
among several resources. - Information extraction document ? formal
relationship instances. - A collection of information objects is in turn
an information object - a table in a relational database a collection
of rows, each representing a relationship
instance or a collection of relationship
instances - a collection of documents
23Information objects 2
- An information object may be a close
representation of an external object or event,
for example - An image (photograph or painting) of a building.
There may be many such images taken from
different angles etc. - A video recording of a soccer game. There may be
several such video recordings, each capturing
different scenes, or capturing the same scene
from different angles, or following different
players, etc. These are different information
objects representing the same external event.
24Real world objects, concepts, ideas
- To provide full access to the information objects
it contains, a digital library must manage data
about any kind of object (real world objects,
concepts, ideas) in its subject domain. - Why?
- The DL may represent data in the form of a
database - Users look for information objects that deal with
or are digital representations of any kind of
object. - This idea underlies Topic Maps which were
originally designed to improve access to
documents by relating the topics discussed in
these documents.
25Real world objects, concepts, ideas
- Examples (these are all resources)
- People (focus of biographical reference tools)
- Organizations (focus of organization directories)
- Events (focus of developing "event gazetteers")
- Places (focus of gazetteers)
- Dates
- Mathematical theorems (focus of mathematical
encyclopedias) - Concepts, ideas
- Problems and proposed solutions
- Computer programs (focus of software directories
or libraries) - The reference model should have a more complete
list and indicate sources dealing with these
26Issues
- 1a Content in the overall context of a DL
reference model - 1b Modeling information objects
- 1c Levels, versions, and relationships
- 1d Composite information objects / resources
- 1e Resource identifiers
- 2 Metadata, including provenance, context, usage
- 3 Annotation
27Levels, versions, and relationships
- Work, manifestation, item (individual copy)
- Linked through relationships
28Work
- Intellectual or artistic entity, as the abstract
essence or as a text, image, or piece of music. - Range
- A basic story or theme
- the story of Faust
- the myth of the Great Flood
- A text telling the story, such as
- Goethe's Faust
- the account of the Great Flood in the Bible
(original Hebrew) - the account of the same myth in another culture
- A specific version of the account in the Hebrew
Biblea Latin translation of the account in the
Hebrew Bible
29Manifestation
- A specific rendering of a work by means of a
graphical image or sound, taken in the abstract
the idea of such a rendering. - Examples
- The text of Goethe's Faust printed in a
particular typeface and layoutA performance at
which the text is recited also renders the text
but is more properly considered a separate, but
related, work. - A specific score of a given version of Schubert's
Fifth. A performance of that version of
Schuberts Fifth also renders the piece of music
but is considered a separate, but related, work. - Also the rendering of a work in the form of
digital storage that can be transformed to a
graphical image or sound, again taken as the
abstract pattern of digital signals.
30Item, individual copy
- The embodiment of a manifestation in a physical
object - We can perceive the content of an manifestation
only through an individual copy of it (unless we
have memorized the visual expression manifest in
a manifestation and can conjure it up from
memory). - There are works that have only one manifestation
of which there is only one copy.
31Relationships among information objects
- The story of Faust ltdealsWithgt Pact with the
devil - The story of Faust ltisToldIngt Marlows Faust
- The story of Faust ltisToldIngt Goethes Faust
- Goethes Faust ltauthoredBygt Goethe, Johann
Wolfgang von - Goethes Faust lthasManifestationgt R1231
- R1231 ltpublishedBygt Cotta
- R1231 lthasDategt 1871
- R1232 ltisCopyOfgt R1231
- R1232 ltownedBygt (HRieth, 1896, 1956)
- R1232 ltownedBygt (DSoergel, 1956, )
32Hierarchical inheritance
- Data about a work inherit to all works below it
along ltisToldIngt, lthasVersiongt etc. Therefore - Goethe' Faust ltdealsWithgt Pact with the devil
- Data about a work inherit to all its
manifestations. Therefore - R1231 ltauthoredBygt Goethe, Johann Wolfgang von
- Data about a manifestation inherit to all its
items - Hierarchical inheritance increases efficiency
- More efficient catalog input
- More efficient catalog storage
- More efficient representation and reading of
search results
33More relationships
- R271 The man I killed, by Michael Halliday
- R519 The man I killed, play by Christopher Wern
- R519 ltisBasedOngt R271
- R315 Handbook of commercial geography, by Robert
Chisholm - R783 Chisholm's handbook of commercial geography,
entirely rewritten by L. Dudley Stamp and S.
Carter Gilmour. - R783 ltentirelyRewrittenFromgt R315
34Relationship to FRBRNotes on Terminology
- The FRBR distinction between work and expression
should be rethought. It is unclear and
consequently poorly understood, and it may not be
necessary. Just have work.The intuition FRBR
tries to capture in this distinction is better
handled through relationships among works as
defined here. - Following FRBR I use the term manifestation.
Other term edition (in the sense of German
Ausgabe), but edition also means German Auflage,
so use of the term edition can be confusing. - It would be nice to be able to use graphic
expression as a synonym for rendering, but to
avoid any further confusion with FRBR it is best
not to use the term expression at all.
35Version control
- Important, but not elaborated here
36Issues
- 1a Content in the overall context of a DL
reference model - 1b Modeling information objects
- 1c Levels, versions, and relationships
- 1d Composite information objects / resources
- 1e Resource identifiers
- 2 Metadata, including provenance, context, usage
- 3 Annotation
37Composite information objects / resources
- Examples
- Book divided into chapters, sections, paragraphs,
words (XML Document Object Model, DOM or
TEI)Each part can be seen as a separate
information object - Movie with images, soundtrack, close captions,
script, all coordinated (MPEG-7) - A medical record with patient data, test data,
images, live monitoring data streams, diagnoses,
drugs prescribed, etc.
38Composite information objects / resources
- Abstractly Each component is a separate
information object, composition expressed through
relationships - In practice
- Many document models for composite (or compound)
documents supporting presentation - DL needs to allow specification, for each
document, of the particular document model used
39Issues
- 1a Content in the overall context of a DL
reference model - 1b Modeling information objects
- 1c Levels, versions, and relationships
- 1d Composite information objects / resources
- 1e Resource identifiers
- 2 Metadata, including provenance, context, usage
- 3 Annotation
40Identifying information objects
- 1 Initial definition upon entry into the digital
library. - 2 Definition on the spot
- ExamplesAnnotate a specific segment of a text
document or a region of an image or sound
document orAnchor an annotation to a specific
location in a document. - The segment or anchor is a new information object
that is included in the original information
object, and this new information object is linked
with any of several annotation relationships to a
new information object created by the user. - Related to composite objects. More on this under
annotation
41Issues
- 1a Content in the overall context of a DL
reference model - 1b Modeling information objects
- 1c Levels, versions, and relationships
- 1d Composite information objects / resources
- 1e Resource identifiers
- 2 Metadata, including provenance, context, usage
- 3 Annotation
42Data about information objects
- Metadata data about information objects if
used for discovering, interpreting, and using
information objects - Relate information objects to other types of
resources. Examples - InformationObject lthasCreatorgt Actor
- InformationObject ltdealsWithgt Actor
- InformationObject ltcontainsTextgt Text (or, more
specifically Word) - Relate a word in a text to the concept that is
the meaning in which the word is used in this
particular position. - InformationObjectA lthasAbstractgt
InformationObjectB - InformationObjectA lthasCriticalCommentarygt Informa
tionObjectC - InformationObjectD lthasSupportiveCommentarygt Infor
mationObjectC
43More on defining metadata
- The metadata-hood of an information object does
not reside in the information object, but in its
relationship to another information object and,
more specifically, in its use - A piece of data
- is used as metadata
- if it is used for the purpose of discovering,
interpreting, and using information objects,
which then give the ultimate data wanted. - The same piece of data may fill the ultimate need
to of the user in one situation and be used as
metadata in another situation.
44Not metadata
- Data about resources that are not information
objects are not metadata even if they are similar
in form. - Data about information objects are not always
used as metadata. For example, using author data
to count a faculty members publications or
citation data to compute impact - Extensive discussion of what exactly is the
definition of metadata is not a good use of
resources. A system should provide the data
that are useful to a user for whatever purpose
what each piece of data is called is less
important.
45Metadata typologies
- Metadata (and data in general) can be divided
into categories from several perspectives, and
within each perspective there exist several
approaches. Some examples of how to categorize
metadata - by purposes or use. Since the same unit of
metadata can be used for several purposes, the
resulting categories overlap. - by source, for example, extracted, assigned by
cataloger, assigned by user (social tagging),
from usage tracking - by intrinsic characteristics, for example data
about provenance or about the format of the
information object
46Some metadata uses
- A Learn about information objects and interpret
them this includes - A1 Learn about the identity and characteristics
of information objects (descriptive metadata) - A2 Learn about the history and other features of
the context of the information object
(contextual metadata) - B Learn how to use an information object,
including - B1 Learn how to gain legal access (access and
rights metadata) - B2 Learn how to gain technical access to the
information object (what machinery and software
is needed to access the information object for
a given purpose, such as assimilation by a
person or processing by a computer program) - C Manage information objects (administrative
metadata), in particular - C1 Manage the preservation of information
objects (preservation metadata).
47Usage data
- Data on usage of resourcesand on usage rights,
usage history, future use / preservation
important for discovering, interpreting, and
using resources as well as managing resources - Some of these data can be collected automatically
- If the resource in question is an information
object, this kind of data is often used as
metadata
48Issues
- 1a Content in the overall context of a DL
reference model - 1b Modeling information objects
- 1c Levels, versions, and relationships
- 1d Composite information objects / resources
- 1e Resource identifiers
- 2 Metadata, including provenance, context, usage
- 3 Annotation
49Annotation
- InformationObjectA ltannotatedBygt
InformationObjectB - InformationObjectB may be created on the spot in
order to annotate A (InformationObjectB and the
annotation relationship have the same author) or
B may preexist (the annotation relationship
between A and B is introduced by a third party) - Specific type of annotation expressed by
specializing the annotatedBy relationship, for
example - InformationObjectA ltcriticizedBygt InformationObjec
tB - InformationObjectA lthasCriticalCommentarygt Informa
tionObjectC - InformationObjectD lthasSupportiveCommentarygt
InformationObjectC - InformationObjectE ltisPartOfSpeechgt PartOfSpeech
- Annotation-hood is in the relationship, not in
the information object
50Annotation
- Annotation-hood is in the relationship, not in
the information object - There is a wide range of relationship types that
are called annotations. Linguists think of
annotations differently than scholars making
comments on a text. - Rather than trying to define exactly what
annotation means, the reference model should
include a comprehensive list of relationship
types that might be considered annotation by
somebody so that anybody can define their meaning
of annotation by giving the appropriate subset of
annotation relationship types. - The same thought applies to metadata, discussed
on a later slide.
51Special resource types for annotations
- Some annotations require special types of
resources. - Examples
- Annotate a text with part-of-speech indications
annotated resource a one-word fragment of the
textannotating resource a value from a list of
parts of speech - Annotate a text with meaning for word sense
disambiguation annotated resource a word or
phrase in the textannotating resource a value
from a list of meanings defined in some way - Annotation through underlining or other
marksannotated resource a fragment of text or
other information objectannotating resource a
pair (sign, meaning), e.g. (underline,
important) or (?, check this out) or (X,
nonsense) - The annotated resource and the annotating
resource may be very short
52Annotation and metadata
- Metadata and annotation data overlap, and
different communities and individuals have
different definitions of what is included in
metadata and what is included in annotations. - The precise nature of a unit of data about an
information object is determined by the
relationship type and the resource that is linked
to. The interpretation of each type of data is
in the eye of the beholder. - Need an inventory of relationship types (a type
of ontology)For example, the CIDOC Content
Reference Model (CIDOC/CRM) is an inventory of
broad relationship types. - In such an inventory, one could indicate who
considers a given relationship type as usable as
metadata and/or as belonging to annotation.
53Take-home message 1
- The entity-relationship model (E-R model)
provides the unifying principle for a digital
library content model - The E-R model allows representation of structured
data of any complexity on a conceptual level. - Defining relationships between information
objects handles - Modeling information objects
- Levels, versions, and relationships
- Composite information objects / resources
- Metadata
- Annotation
- Many notions are captured better through
relationships than fine distinctions of entity
types
54Take-home message 2
- Any reference model
- needs to be abstract and must not commit to any
particular standard or design decision - rather, it must provide a framework for
specifying the commitments of any particular DL
(or information system) - A reference model provides a systematic
framework for description and analysis, not a
prescription
55- Dagobert Soergel
- dsoergel at umd.edu
- www.dsoergel.com
56Omitted slides
57Construction process
- Need to be sure all applicable concepts from
various sources such as the 5S model and FRBR/CRM
are included, either in the skeleton model or in
a list of values / choices, as appropriate - There is still work to be done to pull reference
model subject matter out of the reference
architecture document, and vice versa.
58Construction process
- We should have an online version of the reference
model document with the following properties - Links to discussion of issues and underlying
rationale, capturing some of the discussion in
the group. - Links from the reference model to the appropriate
section of the reference architecture - The Wiki page may not quite do it.
59- There are two ways to communicate such
statements. - One learns what one wants to know about the
resource in focus immediately from a relationship
instance. Hamlet ltauthoredBygt Shakespeare
The drug treatment frame on TaxoteerThe
actual data of interest are represented in a
database that captures these statements
(relationship instances), such as - a collection of Prolog statements
- a relational database
- an object-oriented database
- One needs to consult an information object that
is related to the resource in focus.Shakespeare
schrieb den Hamlet im Jahre 1625Hamlet wurde
von Shakespeare verfasstTaxoteer ist effektiv
in der Behandlung von Krebsen die keine
Rezeptoren fuer Estrogen haben. In aelteren
Personen liegt die Erfolgsrate bei 50
60- The DL designer must decide how to identify the
new resource that is a part of an existing
resource and the new text object created by the
annotator and how to store the link between
these two information objects
61Identifying information objectsArchitecture
issues
- Definition on the spot, options
- (1) use completely independent identifiers
and store the relationship explicitly - (2) use dependent identifiers
- The part of a document can be identified by
document identifier followed by information that
uniquely identifies the part. The part relation
is implied by the structure of the identifier. - The annotation information object could be
identified by the identifier of the resource
being annotated followed by a short string that
identifies the nth annotation of this resource
(like a footnote). The relationship between the
resource and the resource annotating it would be
implied by the identifier (however, the specific
type of the annotation relationship would not be
captured this way). The resource that annotates
still can be referenced from any other context. - Implicit representationEmbedded annotations The
annotation is embedded in the document, linked to
a point in a text that is identified only by the
place of the annotation. This could be converted
to an explicit representation.
62Some metadata uses
- This is a specialization of the functions of data
given above - A learn about other data, that is, information
objects, and understand them this includes - A1 learn about the identity and characteristics
of information objects (descriptive metadata) - A2 learn about the history and other features of
the context of the information object
(contextual metadata) - B learn how to use an information object (source
of data), including - B1 learn how to gain legal access to the
information object (access and use rights
metadata) - B2 learn how to gain technical access to the
information object (what machinery and software
is needed to access the information object for
a given purpose, such as assimilation by a person
or processing by a computer program) - C manage information objects (administrative
metadata), in particular - C1 manage the preservation of information
objects (preservation metadata).
63Metadata in the reference model
- When describing a DL using the reference model,
need to be able to indicate any typology of
metadata used in the DL