Title: Development of the Fedora Generic Search Service
1Development of the Fedora Generic Search Service
- by Gert Schmeltz Pedersen
- gsp_at_dtv.dk
Fedora Users Conference, June 19-20, 2006
2Development of the Fedora Generic Search Service
- Contents
- Background
- The DEF-XWS project
- Zebra at work
- Lucene in action
- Approach and current requirements
- Current prototype (fedoragsearch)
- Architectural snapshots
- Configuration and customization
- Further work
3Background - DEF-XWS Eprints
4Background - DEF-XWS Eprints
ZebraForFedora, a module for Fedora
(http//www.indexdata.dk/zebra) Purpose to
obtain powerful text index and search
functionality and performance. The original text
index and search functionality in Fedora is
simple SQL on a table, where DC element texts are
stored in fields. ZebraForFedora is a set of
Java classes that deploys over existing Fedora
and Zebra installations by the running of an Ant
target. In the Fedora configuration file
ltmodule role"fedora.server.search.FieldSearch"
class"dk.defxws.eprints.fedora.server.search.Fiel
dSearchZebraModule"gt ltcommentgtInstead of
fedora.server.search.FieldSearchSQLModulelt/comment
gt ltdatastore id"zebra"gt
ltcommentgtZebra serverlt/commentgt ltparam
name"host" value"defxws.cvt.dk"/gt
ltparam name"port" value"9395"/gt
lt/datastoregt
5Background - DEF-XWS Eprints
6Background - DEF-XWS Eprints
- Purpose achieved
- Fedora hands-on and experience
- web services hands-on and experience
- DEF-XWS Eprints available from web services
- http//defxws.cvt.dk8082/fedora/access/soap?wsdl
- http//defxws.cvt.dk8082/fedora/accessDEF-XWS/soa
p?wsdl - ready for 3-layered system architecture
- applications combining many web services
- Lesson
- Do not override field search,
- provide generic search service instead ...
7Background - DEF-XWS Eprints
Generic
Zebra
...
Lucene
- Core Fedora Repository Service
- new services are deployed as web applications
(.war files), with a configuration file. - The Generic Search Service shall be a webapp,
configurable to use an existing Fedora repository
and an existing installation of an indexing and
searching engine, like Zebra, Lucene, and others.
- Functionality to be decided by a working group of
Fedora users and developers.
8Zebra at work
- Features
- Zebra is written in portable C, so it runs on
most Unix-like systems as well as Windows NT. - Z39.50 protocol support, recently also SRW/SRU
and CQL - Modules zebraidx and zebrasrv
- Searching supports a powerful combination of
boolean queries as well as relevance-ranking
(free-text) queries. Truncation, masking, full
regular expression matching and "approximate
matching" (eg. spelling mistakes) are all
handled. - Configurable to understand many input formats...
SGML, XML, ISO2709 (MARC), raw text. - Arbitrarily complex records.
- Robust updating - records can be added and
deleted on the fly. - Very large databases logical files can be
automatically partitioned over multiple disks.
9Zebra at work
- The following is an excerpt from the abstract
syntax file for the GILS profile. - name gils
- reference GILS-schema
- attset gils.att
- tagset gils.tag
- varset var1.var
- maptab gils-usmarc.map
- Element set names
- esetname VARIANT gils-variant.est for
WAIS-compliance - esetname B gils-b.est
- esetname G gils-g.est
- esetname F _at_
- elm (1,10) rank -
- elm (1,12) url -
- elm (1,14) localControlNumber Local-number
- elm (1,16) dateOfLastModification
Date/time-last-modified - elm (2,1) title w!,p!
- elm (4,1) controlIdentifier Identifier-standard
- elm (2,6) abstract Abstract
Z39.50 configuration
- This is an excerpt from the GILS attribute set
definition. Notice how the file describing the
bib-1 attribute set is referenced. - name gils
- reference GILS-attset
- include bib1.att
- att 2001 distributorName
- att 2002 indextermsControlled
- att 2003 purpose
- att 2004 accessConstraints
- att 2005 useConstraints
10Lucene in action
- How to integrate Lucene into your applications
- Ready-to-use framework for rich document handling
- Case studies including Nutch, TheServerSide,
jGuru, etc. - Lucene ports to Perl, Python, C/.Net, and C
- Sorting, filtering, term vectors, multiple, and
remote index searching - The new SpanQuery family, extending query parser,
hit collecting - Performance testing and tuning
- Lucene add-ons (hit highlighting, synonym lookup,
and others) - Foreword by Doug Cutting, the inventor of Lucene
11Lucene in action
Fedora AND title"Information retrieval"
AND creatorStaples
Document
http//lucene.apache.org/java/docs/queryparsersynt
ax.html
Field
- Figure 1.5 A typical application integration with
Lucene
12Approach and Current Requirements
- Do iterations of requirements analysis and
prototype development - Distinguish between requirements and prototype
properties - Current requirements
- allow various indexing-and-search engines to be
configured or plugged in, initially Lucene and
Zebra - implement as a webapp within the Fedora Service
Framework - make it installable independently from Fedora, so
no editing of Fedora code or configs is involved,
and no mix of files in the same directories
occurs (however, a notification mechanism is
needed in Fedora 2.2) - allow indexing of, and search in, all information
in FOXML records for FedoraObjects, including
full texts in datastreams and disseminator
results - define interface for a set of operations, provide
REST and SOAP access - basic operations
- updateIndex - indexing the contents of the Fedora
repository - gfindObjects - search similar to Fedora
findObjects - secondary operations
- browseIndex - browsing terms in a given index.
- getRepositoryInfo - describing the properties of
a repository - getIndexInfo - describing the properties of an
index - allow multiple repositories to be indexed in one
and the same index - allow multiple indexes to be generated from one
repository
13Current prototype - updateIndex
ltfoxmldigitalObject PID"demo21"gt
ltfoxmlobjectPropertiesgt ltfoxmlproperty
NAME"http//www.w3.org/1999/02/22-rdf-syntax-nst
ype" VALUE"FedoraObject"/gt
ltfoxmlproperty NAME"infofedora/fedora-systemde
f/modelstate" VALUE"Active"/gt
ltfoxmlproperty NAME"infofedora/fedora-systemde
f/modellabel" VALUE"Sample Document Object (FO
to PDF)"/gt ltfoxmlproperty
NAME"infofedora/fedora-systemdef/modelcontentM
odel" VALUE"FO_TO_PDFDOC"/gt
lt/foxmlobjectPropertiesgt
ltfoxmldatastream ID"DC" STATE"A"
CONTROL_GROUP"X" VERSIONABLE"true"gt
ltfoxmldatastreamVersion ID"DC1.0" LABEL"Dublin
Core for the Document object" CREATED"2006-05-16T
102348.376Z" MIMETYPE"text/xml" SIZE"606"gt
ltfoxmlxmlContentgt ltoai_dcdc
xmlnsoai_dc"http//www.openarchives.org/OAI/2.0/
oai_dc/" xmlnsdc"http//purl.org/dc/elements/1.1
/"gt ltdctitlegtAdvanced FO Sample from Apache
FOP Distributionlt/dctitlegt ltdccreatorgtApache
Grouplt/dccreatorgt lt/oai_dcdcgt
lt/foxmlxmlContentgt lt/foxmldatastreamVers
iongt lt/foxmldatastreamgt lt/foxmldigitalObject
gt
transformation
ltIndexDocument gt ltIndexField IFname"PID
gtdemo21lt/IndexFieldgt ltIndexField
IFname"property.type gtFedoraObjectlt/IndexField
gt ltIndexField IFname"property.state
gtActivelt/IndexFieldgt ltIndexField
IFname"property.contentModel
gtFO_TO_PDFDOClt/IndexFieldgt ltIndexField
IFname"dc.title"gtAdvanced FO Sample
lt/IndexFieldgt ltIndexField IFname"dc.creator"gt
Apache Grouplt/IndexFieldgt ltIndexField
index"TOKENIZED" dsId"DS1" IFname"DS1.text"/gt
lt/IndexDocumentgt
14Current prototype - gfindObjects
15Current prototype - gfindObjects
16Current prototype - getRepositoryInfo
17Current prototype - getIndexInfo
18Architectural snapshots - basic - fedoragsearch
- Contents
- Lucene
- Zebra
- fedoragsearch
- REST demo
- architecture
- installation and configuration
- further customizations
19Architectural snapshots - indexing -
many-to-many
20Configuration and customization
- Configuration examples
- fedoragsearch.properties
- - soapBase http//HOSTPORT/fedoragsearch/service
s - - repositoryNames REPOSNAMES
- - indexNames INDEXNAMES
- mimeTypes MIMETYPES
- INDEXNAME/index.properties
- - operationsImpl dk.defxws.fgslucene.OperationsI
mpl - defaultQueryFields dc.description dc.title
- REPOSNAME/repository.properties
- - soapBase http//FEDORAHOSTPORT/fedora/services
- - fedoraObjectDir FEDORAOBJECTDIR
- Customization examples
- demoFoxmlToLucene.xslt
21Further work
- Prototype implementation
- Zebra updateIndex
- browseIndex
- Final decisions by core development team
- Requirements? From prototype to production
version? Responsibility? - From prototype to production version
- Clean up
- Give access
- Make better Exceptions and error messages
- Handle XACML
- Notification mechanism
- javaDoc
- Junit test cases
- Test on various platforms
- Documentation
- Contributions from Fedora community and core
development team