Title: Defining Collections in Distributed Digital Libraries
1Defining Collections in Distributed Digital
Libraries
- Carl Logoze, David FieldingD-Lib
MagazineNovember 1998
2Introduction
3Order and Chaos in Global Information Space (I)
- Characteristics of the Web
- Universality (???)
- Quantity without quality
- Uniformity (???)
- Specialized and domain-specific tools,
technologies, and guidance are difficult or
impossible to find - Decentralization
- Difficult to impose the organizational structures
necessary, ensuring information integrity
(reliability and accessibility) security and
privacy for content and users, and survivability
(preservation) of information - Web?????????????????????????Content??????????
- ???????????
4Order and Chaos in Global Information Space (II)
- ???????????????????????
- Selection, organization, and specialization
should be permitted without being imposed
(???????) - Mechanisms for selection, organization, and
specialization should be flexible, extensible,
and independent of other characteristics of DL,
such as how content and services are physically
distributed or how and by whom the components of
the DLs are managed (?DL??????????????????????) - ??????????????
5Role of Collection Developmentin Traditional
Library
- Selection - defining resources belonging to a
collection - Physical containment or demarcation in
traditional library - Specialization - resource discovery aids or
cataloging techniques tailored to the collection
or the audience - Administration - management and preservation
policies that conform to the collection
characteristics. - ?????????????
6Collection Service Architecture
- Defines collection membership through criteria
- Subject classification, language, or genre
- Allow automatic and/or dynamic selection of
resources from a set of distributed information
sources, based on either metadata about those
resources or the content within the resources - Provide query routing and query pre-processing
and post-processing facilities to facilitates
resource discovery - Tailored to the characteristics of the collection
- Act as a distributed metadata repository,
storing, disseminating, and processing data
relevant to the management and administration of
objects in the collection
7Relationship with Overall DL Systems
- The collection service is one of several services
in the component-based digital library
architecture - A repository service for storing digital content
- A naming service for registering and resolving
unique names for objects - An index service that processes queries for
content discovery
8Why Separate Collection Service with Others?
- ?????????????????Collection??
- ?????????(DO)???????Collection
- Defined by multiple collection services under
separate administration - Collection membership and administration is
distinct from definitions of query capabilities
(defined by index services) - Collection definition is relatively lightweight
- The logically distinct collection service defined
in this paper is fundamentally a simple query
routing mechanism that requires no access to the
content of individual digital objects - Collection??????????DO??
9Component-Based Distributed Digital Libraries
10Cornell Digital Library Research Group (CDLRG)
- Deployment of distributed digital libraries
- Open Architecture - the functionality of a DL
system is available in the form of distinct
function units (services), each of which has
operational semantics exposed through an open
protocol. - Federation - DLs are managed aggregations of
these functional units (or services) and the
resources to which they provide access. New
functionality can be added to these systems
through the implementation of value-added
services, which interact with existing services
using established protocols. - Distribution - The components (and content) of a
digital library may be spread over the global
Internet, but are presented to the user as a
single uniform system. - NCSTRL, Dienst, CRADDL
11Interaction of Core Digital Library Services
12Core Services of DL (I)
- Content digital objects (DO)
- Byte streams content-specific behaviors
secure access (rights management mechanisms) - Repository service deposit, storage, and access
to DOs - A DO is considered contained within a repository
if the URN of that object resolves to the
respective repository - DOs are identified by globally-unique names
registered with the naming service. The naming
service is able to resolve a URN to one or more
physical locations.
13Core Services of DL (II)
- Index service discovery of DOs via query.
- DOs indexed by an index service may be
distributed - Queries return results sets that contain the URNs
of digital objects that match the query (and
possibly other meta-information). - Collection service aggregation of access to sets
of DOs and services into meaningful collections. - User interface services or gateways
human-centered entry points to the functionality
of the digital library
14Modular Design -- Customized DLs
??????????????(Independent)
- Creators of digital resources
- Repository managers may adopt policies that
implicitly select the digital objects that can be
deposited into the repository. - Administrators of index servers select the
digital objects that are indexed in that server. - Collection services apply broader (not digital
object specific) selection mechanisms against the
query interfaces of one or more index services. - User interface gateways select one or more
collections that users can search over and access
objects within.
15Defining a Collection in a Distributed Digital
Library
16Collection Definition (I)
- A collection is logically defined as a set of
criteria for selecting resources from the broader
information space. - Static criteria
- URNs or ISBNs
- The set of resources that are stored in a
specific repository - Dynamic criteria
- Dublin Core subject element with the value
"computer science". - Advanced Natural Language techniques
17Collection Definition (II)
- A collection is operational defined in terms of
resource discovery - The resources in the collection are those that
can be directly found using those resource
discovery tools. - Collection-specific resource discovery tools have
the following characteristics - Direct queries only to those index servers that
can return objects in the collection (query
routing). - Employ filtering techniques to select only those
objects in the respective index servers that fit
the collection criteria - Employ resource discovery aids specialized for
the collection - Domain-specific stop-word lists, stemming
algorithms, thesauri
18Defining Collections -- Resource Discovery
19Advantages
- Location and Administrative Independence
- No linkage between the membership of a resource
in a collection and its location in a repository
nor its collocation with other member objects. - Collections can be created, and subsequently shut
down, on demand resources do not need to be
moved to physical locations in fact no changes
need to be made to the objects themselves. - Dynamic Membership
- Extensibility -- offer an opportunity to employ
more dynamic and contextual criteria as they are
developed - Ex. criteria based analysis of link topology
20Collection Service Implementations
21Dienst Collection Service
- Information to be accessed in the Dienst
collection service - The list of publishing authorities that are part
of the collection. - The network location ex. foo.ncstrl.org port 80
- Meta information about each of the index servers
- primary or secondary, last update of the index,
performance information, etc. - Correspondence of index servers to repository
servers - Provides the index servers with information on
the repository servers from which they should
download meta information for indexing.
22???CS?????????????
????Query, ????IS???
23Limitations of Dienst
- Collection criteria are hard-wired
- The Dienst protocol and server implementation
limits the ability of user interface servers to
interact with more than one collection (and its
associated set of sub-collections) - The Dienst architecture incorrectly conflates
(??) the functions of the user interface service
with query routing. - Query routing ??User Interface?
- Limits our capacity to performing query routing
that is highly collection specific
24CRADDL Collection Service
- Implemented as a set of distributed servers that
act as a metadata repository for collection
specific information, and that perform collection
specific query routing - Each collection service maps to a single
collection in effect, a collection exists and is
accessible in the digital library infrastructure
if there is a collection service for it. - Collection Service Collection 11
25Collection Service and User Interface Servers
- Main Consumers of CS user interface gateways
(UI) - UI gives human-friendly access to one or more
collections - Interaction between UI and CS (through protocol
requests) - The exchange of metadata about the collection
- A collection description name and free-text
description - Assist UI and users to choose collections for
queries - The elements of the collection hierarchy
- Query capability and customization information
- Facilitate the creation of collection-specialized
query forms by a user interface - Submission of query requests and the return of
corresponding result sets
26Collection Service
Customized UI
27Components of CS
- Central collection server (CCS) -- central point
of management of a collection. - Responsible for the creation and modification of
- Collection criteria
- Index server tables - set of index servers used
for searches - Collection metadata - collection description,
query capabilities - One or more collection query routers (CQR)
- Local, replicated access to collection metadata
-- reliability - Query routing tailored for local conditions
- Localized query routing and connectivity region
- ??????UI?????CQR for collection metadata
28Distributed collection service and connectivity
regions