Development of the Fedora Generic Search Service - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Development of the Fedora Generic Search Service

Description:

comment Instead of fedora.server.search.FieldSearchSQLModule /comment datastore id='zebra' comment Zebra server /comment param name='host' value='defxws.cvt.dk' ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 22
Provided by: gertschmel
Category:

less

Transcript and Presenter's Notes

Title: Development of the Fedora Generic Search Service


1
Development of the Fedora Generic Search Service
  • by Gert Schmeltz Pedersen
  • gsp_at_dtv.dk

Fedora Users Conference, June 19-20, 2006
2
Development of the Fedora Generic Search Service
  • Contents
  • Background
  • The DEF-XWS project
  • Zebra at work
  • Lucene in action
  • Approach and current requirements
  • Current prototype (fedoragsearch)
  • Architectural snapshots
  • Configuration and customization
  • Further work

3
Background - DEF-XWS Eprints
4
Background - DEF-XWS Eprints
ZebraForFedora, a module for Fedora
(http//www.indexdata.dk/zebra) Purpose to
obtain powerful text index and search
functionality and performance. The original text
index and search functionality in Fedora is
simple SQL on a table, where DC element texts are
stored in fields. ZebraForFedora is a set of
Java classes that deploys over existing Fedora
and Zebra installations by the running of an Ant
target. In the Fedora configuration file
ltmodule role"fedora.server.search.FieldSearch"
class"dk.defxws.eprints.fedora.server.search.Fiel
dSearchZebraModule"gt ltcommentgtInstead of
fedora.server.search.FieldSearchSQLModulelt/comment
gt ltdatastore id"zebra"gt
ltcommentgtZebra serverlt/commentgt ltparam
name"host" value"defxws.cvt.dk"/gt
ltparam name"port" value"9395"/gt
lt/datastoregt
5
Background - DEF-XWS Eprints
6
Background - DEF-XWS Eprints
  • Purpose achieved
  • Fedora hands-on and experience
  • web services hands-on and experience
  • DEF-XWS Eprints available from web services
  • http//defxws.cvt.dk8082/fedora/access/soap?wsdl
  • http//defxws.cvt.dk8082/fedora/accessDEF-XWS/soa
    p?wsdl
  • ready for 3-layered system architecture
  • applications combining many web services
  • Lesson
  • Do not override field search,
  • provide generic search service instead ...

7
Background - DEF-XWS Eprints
Generic
Zebra
...
Lucene
  • Core Fedora Repository Service
  • new services are deployed as web applications
    (.war files), with a configuration file.
  • The Generic Search Service shall be a webapp,
    configurable to use an existing Fedora repository
    and an existing installation of an indexing and
    searching engine, like Zebra, Lucene, and others.
  • Functionality to be decided by a working group of
    Fedora users and developers.

8
Zebra at work
  • Features
  • Zebra is written in portable C, so it runs on
    most Unix-like systems as well as Windows NT.
  • Z39.50 protocol support, recently also SRW/SRU
    and CQL
  • Modules zebraidx and zebrasrv
  • Searching supports a powerful combination of
    boolean queries as well as relevance-ranking
    (free-text) queries. Truncation, masking, full
    regular expression matching and "approximate
    matching" (eg. spelling mistakes) are all
    handled.
  • Configurable to understand many input formats...
    SGML, XML, ISO2709 (MARC), raw text.
  • Arbitrarily complex records.
  • Robust updating - records can be added and
    deleted on the fly.
  • Very large databases logical files can be
    automatically partitioned over multiple disks.

9
Zebra at work
  • The following is an excerpt from the abstract
    syntax file for the GILS profile.
  • name gils
  • reference GILS-schema
  • attset gils.att
  • tagset gils.tag
  • varset var1.var
  • maptab gils-usmarc.map
  • Element set names
  • esetname VARIANT gils-variant.est for
    WAIS-compliance
  • esetname B gils-b.est
  • esetname G gils-g.est
  • esetname F _at_
  • elm (1,10) rank -
  • elm (1,12) url -
  • elm (1,14) localControlNumber Local-number
  • elm (1,16) dateOfLastModification
    Date/time-last-modified
  • elm (2,1) title w!,p!
  • elm (4,1) controlIdentifier Identifier-standard
  • elm (2,6) abstract Abstract

Z39.50 configuration
  • This is an excerpt from the GILS attribute set
    definition. Notice how the file describing the
    bib-1 attribute set is referenced.
  • name gils
  • reference GILS-attset
  • include bib1.att
  • att 2001 distributorName
  • att 2002 indextermsControlled
  • att 2003 purpose
  • att 2004 accessConstraints
  • att 2005 useConstraints

10
Lucene in action
  • How to integrate Lucene into your applications
  • Ready-to-use framework for rich document handling
  • Case studies including Nutch, TheServerSide,
    jGuru, etc.
  • Lucene ports to Perl, Python, C/.Net, and C
  • Sorting, filtering, term vectors, multiple, and
    remote index searching
  • The new SpanQuery family, extending query parser,
    hit collecting
  • Performance testing and tuning
  • Lucene add-ons (hit highlighting, synonym lookup,
    and others)
  • Foreword by Doug Cutting, the inventor of Lucene

11
Lucene in action
Fedora AND title"Information retrieval"
AND creatorStaples
Document
http//lucene.apache.org/java/docs/queryparsersynt
ax.html
Field
  • Figure 1.5 A typical application integration with
    Lucene

12
Approach and Current Requirements
  • Do iterations of requirements analysis and
    prototype development
  • Distinguish between requirements and prototype
    properties
  • Current requirements
  • allow various indexing-and-search engines to be
    configured or plugged in, initially Lucene and
    Zebra
  • implement as a webapp within the Fedora Service
    Framework
  • make it installable independently from Fedora, so
    no editing of Fedora code or configs is involved,
    and no mix of files in the same directories
    occurs (however, a notification mechanism is
    needed in Fedora 2.2)
  • allow indexing of, and search in, all information
    in FOXML records for FedoraObjects, including
    full texts in datastreams and disseminator
    results
  • define interface for a set of operations, provide
    REST and SOAP access
  • basic operations
  • updateIndex - indexing the contents of the Fedora
    repository
  • gfindObjects - search similar to Fedora
    findObjects
  • secondary operations
  • browseIndex - browsing terms in a given index.
  • getRepositoryInfo - describing the properties of
    a repository
  • getIndexInfo - describing the properties of an
    index
  • allow multiple repositories to be indexed in one
    and the same index
  • allow multiple indexes to be generated from one
    repository

13
Current prototype - updateIndex
ltfoxmldigitalObject PID"demo21"gt
ltfoxmlobjectPropertiesgt ltfoxmlproperty
NAME"http//www.w3.org/1999/02/22-rdf-syntax-nst
ype" VALUE"FedoraObject"/gt
ltfoxmlproperty NAME"infofedora/fedora-systemde
f/modelstate" VALUE"Active"/gt
ltfoxmlproperty NAME"infofedora/fedora-systemde
f/modellabel" VALUE"Sample Document Object (FO
to PDF)"/gt ltfoxmlproperty
NAME"infofedora/fedora-systemdef/modelcontentM
odel" VALUE"FO_TO_PDFDOC"/gt
lt/foxmlobjectPropertiesgt
ltfoxmldatastream ID"DC" STATE"A"
CONTROL_GROUP"X" VERSIONABLE"true"gt
ltfoxmldatastreamVersion ID"DC1.0" LABEL"Dublin
Core for the Document object" CREATED"2006-05-16T
102348.376Z" MIMETYPE"text/xml" SIZE"606"gt
ltfoxmlxmlContentgt ltoai_dcdc
xmlnsoai_dc"http//www.openarchives.org/OAI/2.0/
oai_dc/" xmlnsdc"http//purl.org/dc/elements/1.1
/"gt ltdctitlegtAdvanced FO Sample from Apache
FOP Distributionlt/dctitlegt ltdccreatorgtApache
Grouplt/dccreatorgt lt/oai_dcdcgt
lt/foxmlxmlContentgt lt/foxmldatastreamVers
iongt lt/foxmldatastreamgt lt/foxmldigitalObject
gt
transformation
ltIndexDocument gt ltIndexField IFname"PID
gtdemo21lt/IndexFieldgt ltIndexField
IFname"property.type gtFedoraObjectlt/IndexField
gt ltIndexField IFname"property.state
gtActivelt/IndexFieldgt ltIndexField
IFname"property.contentModel
gtFO_TO_PDFDOClt/IndexFieldgt ltIndexField
IFname"dc.title"gtAdvanced FO Sample
lt/IndexFieldgt ltIndexField IFname"dc.creator"gt
Apache Grouplt/IndexFieldgt ltIndexField
index"TOKENIZED" dsId"DS1" IFname"DS1.text"/gt
lt/IndexDocumentgt
14
Current prototype - gfindObjects
15
Current prototype - gfindObjects
16
Current prototype - getRepositoryInfo
17
Current prototype - getIndexInfo
18
Architectural snapshots - basic - fedoragsearch
  • Contents
  • Lucene
  • Zebra
  • fedoragsearch
  • REST demo
  • architecture
  • installation and configuration
  • further customizations

19
Architectural snapshots - indexing -
many-to-many
20
Configuration and customization
  • Configuration examples
  • fedoragsearch.properties
  • - soapBase http//HOSTPORT/fedoragsearch/service
    s
  • - repositoryNames REPOSNAMES
  • - indexNames INDEXNAMES
  • mimeTypes MIMETYPES
  • INDEXNAME/index.properties
  • - operationsImpl dk.defxws.fgslucene.OperationsI
    mpl
  • defaultQueryFields dc.description dc.title
  • REPOSNAME/repository.properties
  • - soapBase http//FEDORAHOSTPORT/fedora/services
  • - fedoraObjectDir FEDORAOBJECTDIR
  • Customization examples
  • demoFoxmlToLucene.xslt

21
Further work
  • Prototype implementation
  • Zebra updateIndex
  • browseIndex
  • Final decisions by core development team
  • Requirements? From prototype to production
    version? Responsibility?
  • From prototype to production version
  • Clean up
  • Give access
  • Make better Exceptions and error messages
  • Handle XACML
  • Notification mechanism
  • javaDoc
  • Junit test cases
  • Test on various platforms
  • Documentation
  • Contributions from Fedora community and core
    development team
Write a Comment
User Comments (0)
About PowerShow.com