Title: MARIAN: Searching and Querying Across Heterogeneous Federated Digital Libraries
1MARIANSearching and Querying Across
Heterogeneous FederatedDigital Libraries
- Marcos André Gonçalves
- Robert K. France
- Edward A. Fox
- Tamas E. Doszkocs
- Work performed at Virginia Tech, Blacksburg, VA
USA - Support provided in part by NSF National
Library of Medicine.
2JCDL 2001
- First Joint ACM/IEEE Conference on
Digital Libraries ( NSF DLI-2 PI mtg) - http//www.jcdl.org
- June 24-28, 2001 in Roanoke, VA
- Conference Committee
- General Chair Edward A. Fox, Virginia Tech
- Program Chair Christine Borgman, UCLA
- Treasurer Neil Rowe, Naval Postgraduate School
- Posters Chair Craig Nevill-Manning, Rutgers U.
3Outline
- NDLTD
- Harvesting Strategies and the OAI
- MARIAN Middleware
- Generating Digital Libraries with 5SL
- Future Directions
4NDLTD (1 of 3)
- Context Networked Digital Library of Theses and
Dissertations, www.ndltd.org, www.theses.org - Please join! Submit your (students) works!
- International federation of universities,
libraries, supporting institutions (e.g., VTLS
union catalog) - Extremely heterogeneous
- Autonomy of management and decentralization
- Disparate protocols, metadata, repositories
(e.g., UMI, OCLCs WorldCat), language,
encodings, user characteristics and preferences
5NDLTD (2 of 3)
- Worldwide organization educational/social
context - National/regional projects in Australia,
Catalunya, Germany, India, Latin America
(UNESCO/OAS/ISTEC), South Africa (Mellon), USA
(including OhioLINK), - International conference (225 in March 2000, more
expected for next, at Caltech) - Steering committee representing supporting groups
as well as the hundreds of universities
6NDLTD (3 of 3)
- Unique collection discipline/document context
- Multilingual and multimedia content
- Large book-size documents
- Full-content in several formats (XML, PDF, etc.)
- Large number of bibliographic references
- Several sets of metadata with different ranges of
quality, that can fit with the Open Archives
Initiative (www.openarchives.org)
7Harvesting Strategies
- Harvesting vs. Federated Search
- Harvesting plus Federated Search
- Plus local collections
- The NDLTD Union Collection
- Multiple Harvesting Protocols
- Harvest System
- Z39.50
- Dienst
- OAI
8Union Collection Architecture
9Open Archives Initiative (OAI)
- Interoperability Standards Released - Jan/Feb
- Data Service Providers
- Metadata Harvesting Protocol
- Unique identifiers (URNs) for each record
- Date-stamp for each record when last
modified/created/deleted - HTTP server with scripting capabilities
- 6 Service requests (verbs)
- Identify, ListMetaFormats, ListSets
- ListIdentifiers, GetRecord, ListRecords
10low-barrier interop umbrella
metadata
herbert van de sompel
11OAI harvesting tools
service provider harvester
data provider repository
Datestamp Identifier Set
Records
herbert van de sompel
12OAI harvesting tools
service provider harvester
data provider repository
- Supporting protocol requests
- Identify
- ListMetadataFormats
- ListSets
- Harvesting protocol requests
- ListRecords
- ListIdentifiers
- GetRecord
herbert van de sompel
13Design Features
- Combined Harvesting, Federated Search, and Local
Collections - Object-Oriented Information Graph Representation
- 5S Model and 5SL Specification Language
14MARIAN Middleware
- Flexible Representation Model
- Information Graph
- Class Hierarchies
- Weights and Weighted Sets (w. lazy eval)
- Class-Based Search
- Unified Searcher API
- Combining Heterogeneous Information
- Structural Matching
- Synthetic Superclasses
15Information Graph Model (1/2)
- Each Information Object is a Node.
- Structure exposed through Links
- Features of interest can become Nodes
- or can remain Hidden within Node Class Search
Methods.
16Information Graph Model (2/2)
17Class-Based Search
- Common Search Methods
- Text
- Link / Weighted Link
- Node in Context
- Common Searcher Operations
- Match Best (weighted maximum)
- Match Most (summative union)
18Class-Based Search
- public interface ClassManager
-
- public WtdObjSet match(InfoDesc description)
- public boolean isInClass(FullID id)
- public Object idToObject(FullID id)
- public Vector idsToObjects(Vector ids)
19Class-Based Search
20Combining Sources of Information
- Structural Matching
- Extends Weighted Retrieval to include Best Match
to Document Structure - Recursive, Extensible
- Collection Views
- Simple Interface to Complex Collections
- Common Interface to Diverse Collections
- Weighted Interface to Collections of Varying
Quality
21NDLTD Collection View (part)
ThesisDissertation
ThesisDissertation
HasAuthor
Individual
HasAuthor
Individual
title
title
SubClasses
SubClasses
description
description
HasSubject
HasSubject
Subject
Subject
SubClasses
SubClasses
SubClasses
SubClasses
SubClasses
SubClasses
1.0
0.8
1.0
0.8
0.8
0.8
0.8
0.8
1.0
0.9
1.0
0.9
1.0
0.9
1.0
1.0
0.9
1.0
0.8
0.8
PhysDis-ETD (SOIF)
PhysDis-ETD (SOIF)
Dc.creator
HasDcCreator
Dc.creator
HasDcCreator
dc.title
dc.title
crawlerTitle
crawlerTitle
HasCrawlerAuthor
Individual
HasCrawlerAuthor
Individual
dc.description
dc.description
Dc.Subject
Dc.Subject
crawlerDescription
crawlerDescription
HasDcSubject
HasDcSubject
Headings
Headings
body
body
HasHeadings
HasHeadings
HasKeywords
HasKeywords
Keywords
Keywords
225S Model for Digital Libraries (1/2)
- Formal Model
- Streams
- Structures
- Spaces
- Services
- Societies
235S Model for Digital Libraries (2/2)
- NDLTD / MARIAN Example
- Document (presentable, indexable information
object) - Weighted Set (e.g., of results to a match
operation) - Collection Graph Inheritance Lattice Measure
Space - Adaptive Search Query History Maintenance
- Library End-Users DL Builders
- Formal Model
- Streams
- Structures
- Spaces
- Services
- Societies
245SL
- Generates Digital Library (Components)
25Generating Digital Libraries XML
26Interoperability with 5S and 5SL
- Reductionist / Constructivist Approach
- Compositional mappings between DLs
- Composition of S-based constructs
- Mapping language
27Student Projects to Integrate
- Schedule-driven Harvester
- SDI / Filtering for NDLTD
- MARIAN-Phronesis (Spanish Monterrey) and work
with German (Oldenburg / DFG), Portuguese,
Chinese, Japanese, Korean - TREC data formatted for loading
28Future Work
- Fusion on hybrid architecture
- Incorporation of belief networks
- Using 5SL to generate wrappers
- New services/ functionalities
- Personalization (e.g., history, folders)
- Visualization (e.g., Envision applet)
- Integration with PetaPlex (100 nodes, 2.5 Tbytes
disk capacity, gt 300 Mbps to campus backbone,
Sornil inversion)
29Conclusions
- NDLTD provides a real, fertile, DL testbed.
- Harvesting strategies and the OAI
- MARIAN middleware graphs, classes, views
- Generating Digital Libraries with 5SL
- Future high performance services, experimental
comparisons