Title: Introduction to Integrated Search
1Introduction to Integrated Search
- Thomas Place
- Presentation for Digital Libraries à la Carte
2009 - Friday, 31 July 2009
2Overview
- A Personal History
- Users
- Common Architecture
- Concluding remarks
3A Personal History
4History search _at_ Tilburg University
- 1988 Searching only in the library
- 1992 Searching moves to the desktop
- 1995-1997 Homogeneous search interface
- 2001 Metasearch plus dynamic linking
- 2009 Integrated search
- 201? Searching completely in the cloud
51988 Searching only in the library
- Psychological Abstracts print index (appeared
during1927 2006) - Social Sciences Citation Index print index
(appeared during1973/4 - ????) - OPAC terminals Online in the library ? Public
- Stand alone PC in the library with CD-ROMs
PsycLit, SSCI - (In 1985 my first PC)
- Each week Current Contents print journal with
table of contents of journals - Each month exhibition of newly acquired books
- Browsing the shelves
61992 searching moves to the desktop
- New library building
- LBS3 of Pica (now OCLC)
- new generation ILS the core is still
operational in 2009 - building of database from union catalogue took
weeks transfer by tapes - updates online
- OPAC accessible via the Internet (telnet)
- Tilburg the first Dutch university with a Campus
Wide Information System (1991) with entry points
for the local bibliographical databases - Catalogue
- Excerpta Informatica
- Online Contents journal articles
- Student theses
- Attent reports in economics
- Brabant database
- and for external databases on the internet.
- CD-ROMs available via campus network
71995-1997 homogeneous search interface
- All local databases (Trip) have Z39.50 interface
exception the catalogue - Z39.50 MS Windows client (Kwik)
- Soon replaced by a Web application (Trix)
- Homogeneous access to internal and external
Z39.50 databases via a Web browser (Netscape) - Each database was, however, searched separately
like in 1988 with the print indexes. - Users didnt understand that Catalogue is for
books and journals, Online Contents is for
articles, etc. Default selection is the first
database in the list
8One Interface
Z39.50
Homogeneous userinterface
9One Interface
XML
Federator
Z39.50
SRU
Metasearch
102001 metasearch plus dynamic linking
- European project Decomate II
? - commercialization by OCLC PICA,
- software development by Tilburg University
- not at the market anymore, other products
are - First Dutch implementation of metasearch still
running. - Database lists, homogeneous userinterface for
SRU/Z39.50 databases, metasearch, de-duplication,
dynamic linking to fulltext OpenURL resolving,
book shelves, current awareness services - Local databases only available via user interface
of iPort - User interface conforms to house style demo
11Problems with metasearching
- the performance is sometimes disappointing (no
Google-like performance) - the presentation of the information is not
optimal (merging, sorting) - users find it difficult to select the right
databases for a federated search (as a solution
they select all databases which has a negative
effect on the performance and increases the noise
in the search results). - users dont know how to formulate the best
queries for the databases they have chosen (in
many cases this is also not possible because a
query that is optimal for one database is not the
optimal query for another database in which the
user also wants to search indexes differ over
dbs).
12One Interface
Z39.50
Homogeneous userinterface
13One Interface
XML
Federator
Z39.50
SRU
Metasearch
14One Interface
SRU
XML
OAI-PMH
Integrated search
152009 Integrated search
- Page with databases is no longer the start, but
the search box. - No database selection just search demo
- Technical solution Meresco of CQ2
- Open Source
- We work together with the TU Delft who implements
also Meresco Discover - Meresco infrastructure is also used for special
services, e.g., Economists Online
16What are our goals?
- To be THE one and only search engine of Tilburg
University - Searching scientific information (library) AND
non-scientific information (website, learning
material) - Query leads (in the future) to
- Relevant documents and web pages (Meresco)
- Experts (expert finding system developed by
master student) - Specialised databases (Purple search, metasearch
application of the University of Groningen) - Finding of documents no longer clicking to full
record display most important information is
directly presented in result list - Informing the user about the search results
facets, clusters - Added value add-ons / mash-ups, integration in
the workflow
17(No Transcript)
18Components
- Information resources
- Ingest
- Search engine
- Presentation and integration of external services
19NEEO Institutional Repositories Other economics
repositories
Logs
Metadata
Objects
OAI-PMH
HTTP
Crawler
Harvester
OAI-PMH
Metadata enrichment server
Metadata
Gateway
SRU
Search engine
SRU
RePEc
RSS/Atom
OAI-PMH
Portlet
Portal
Publication list generator
Ajax server
Service component
Data
subcomponent
Protocol
20Information resources
- Repositories with OAI-PMH interface
- Local databases (IR, Student theses, Online
Contents, ...) - SHARED repositories with metadata of publishers
- Elsevier repository _at_ UvT
-
- External repositories
- RePEc
- IRs (e.g., NEEO)
- ...
- GGC Dutch Shared Cataloguing System with
OUF/SRU interface (catalogue records) - ...
21Ingest
- Meresco harvester
- OAI-PMH repositories to harvester
- SRUUpdate van harvester to search engine
- Inbox
- Pica records from GGC go in inbox
- Records are fetched form inbox by the search
engine - Records are stored in their original format in
database of the search engine - If no MODS, than conversion MODS is stored
alongside the original records so no dynamic
conversion for indexing and presentation - Parts (e.g.. ratings, annotations or fulltext)
can be added to the record
22Meresco search engine
- Lucene
- XML-based all paths in the tree can be indexed
- Powerful facetting engine not Lucene
- Search term suggestions
- Clustering (sort of)
- Indexing of fulltext
- Has its own GUI
- But integration via SRU with other front ends
(e.g., Economists Online) is possible - Flexible writing you own pluggable components in
Python - UvT develops tools for configuration by
Functional Application Managers (information
specialists)
23Integration of external services (UvT)
- place locator
- OpenURL resolver
- No menus OpenURL in, XML out
- Info about location as specific as possible
- Connection with ILS for availability info (need
for standards DLF) - Is called from results list (Ajax)
- Journal covers (local server)
- Book covers (Syndetics)
- More to come
24What is now (June 2009) in de search engine?
25What will be added?
26Users (Delft)
- Students lack an overview of the domain in which
they search. They are inexperienced searchers and
dont know the terminology of the disciplines in
which they search. The challenge for students is
to find structure in the chaos of information. - Students search without a clear plan. They want
to be able to revisit earlier search paths. This
is not well supported by present systems. - When a student starts searching there is no clear
idea of what (s)he is searching for. During the
search process their information need becomes
gradually more clear and they discover the
relevant search terms. - For students it is difficult to verify the
trustworthiness of the information that they find
during searching. - Student dont know RSS.
- The way students search is not very well
organised. They change strategies and goals. They
are very receptive for unexpected results
(serendipity) which give them new leads for
searching more information.
27Metalib statistics of the University of Groningen
50 zero or false results
- Misspellings and typos in search terms
- Picking databases at random
- Unable to understand QuickSearch, MetaSearch,
Find Database - Using the wrong search keys
- Using search keys wrong
- Using Dutch search terms in English language
databases - Using non-specific terms, phrases that are too
broad - Lack of understanding of Boolean logic or
database peculiarities
Metalib statistics
28Common architecture
- Data layer
- Search layer in most cases Lucene as core (Omega
of Un. Utrecht Autonomy) - Presentation layer
29(No Transcript)
30Primo Search Engine
Import/Data APIs
data
Publishing Platform
data
Harvesting (OAI-PMH,..)
Source Repository
31(No Transcript)
32Data layer
- Collection of metadata and documents from
external sources by - OAI-PMH harvesting
- downloading from CDs or DVDs
- FTP get
- SRU/Z39.50 requests
- Cleaning of the metadata (e.g., repairing invalid
XML) - Adding metadata elements local data, subject
infoAlso availability info? - Merging metadata (e-holdings print holdings of
same journals expressions, manifestations of the
same work FRBR) - Conversion to standard XML-format (PNX, MODS,
MARC21) proprietary vs standardized
formatsWhat is stored? Orginal and/or converted
records. Or nothing or only external record
location? - Adding admin info source, ingest date, access
rights - Fetching documents and adding (ASCII or XML
version of) fulltext to the records - Processing of data generated by users tags,
annotations, ratingsUser generated data
external (shareable) or internal (non-shareable)
data
33Data layer
- Sharing of data
- Processing of publisher data at one place,
indexing at many places - Sharing of annotations, tags and ratings (?)
- Issues
- What is stored?
- Pre-processing in data layer versus
post-processing in presentation layerstatic data
versus dynamic datadata generated during
post-processing cant be indexed
34Search layer
- Indexing of records from the data layer
- Loading SRUUpdate or batch mechanism
- Filters, analyzers
- Index definitions (Lucene document format)
- Separate indexes for facetting, search
suggestions and/or clustering? or use Solr? - Searching in the index(es)
- Search results including facets, clusters in XML
- SRU
- RSS
35Search layer
- Sharing of indexes
- One central index with subcollections
- Distributed index standardization of index
definitions - Exchanging of indexes
- 1 is possible but requires organisation
- 2 and 3 are probably technical possible, but I
dont know of successful examples - Issues
- Standard search interfaces SRU,
36Presentation layer
- Web application that sends (converted) user query
(HTTP request) to search engine and receives
search results in XML - Processing of XML and returning HTTP response to
the browser - For dynamic content, the browser is responsible
Ajax.E.g., availability info - Possible modules
- Query parser Google like queries gt CQL
- OpenURL generator
- Tag cloud builder
- Authorisation module with access rules
authentication is external support of SAML
(A-Select, Shibboleth)
37Presentation layer
- Integration of external servicesApplication must
allow for easy integration of external web
services - Recommender systems like Purple search,
metasearch application of the University of
Groningen - Personalised services, e.g.,
- Current awareness service storage of profiles
(or is RSS sufficient?) - E-shelves, shopping cart permanent storage?
Sharing of e-shelves. - Tagging, annotations, ratings. Sharing
- Location services integrating OpenURL resolver
and Circulation Control of ILS. Issue
Standardized access to availability information
of ILS. - Federated search server
- Amazon (book covers, book reviews) / Syndetics
(book covers, reviews, tables of content) - Google books
- Web of science impact factor (or new service of
Ex Libris) - Export services
- xISSN (OCLC) get all related ISSNsCan also be
used during preprocessing in data layer - TicTOCs Journal Tables of Contents Service
38Concluding remarks
- Just search, no database selection
- Integrated search systems must give guidance to
the user facets, clusters, suggestions,
recommendations, - Sharing of resources requires a common
architecture, common APIs, common standards