XTF in Depth

1 / 73

About This Presentation

Title:

XTF in Depth

Description:

All about rapid prototyping, fast deployment, and incremental improvement ... Accent/diacritic marks. Many users can't or don't know how to type them ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 74

Provided by: marti88

more less

Transcript and Presenter's Notes

Title: XTF in Depth

1
XTF in Depth

Powerful Search and Display for Electronic Text

Martin HayeCalifornia Digital Library
January 2009 presentation at University of Sydney
2
XTF in Depth

Part 1
What is XTF and how does it compare?
Who is using it?
What needs does it address?
New features in 2.1
Design and data flow
Adapting Lucene and Saxon
Planned improvements
Part 2
Interactive demos

3
XTF in 5 minutes

eXtensible Text Framework
Search and display technology from CDL
Open-source Java framework
Powerful and highly configurable
All about rapid prototyping, fast deployment,
and incremental improvement
XML Full text search
Also indexes PDF, HTML, Word
Excel and Powerpoint coming soon

4
XTF in 5 minutes

Search Query power/speed of Lucene, plus
search results shown in context
keyword search, facets, spelling, lots more
View Processing power of Saxon, plus
large file optimizations, hit markup
Configure and customize exclusively in XSLT
Flexible, overlapping collections
Mature, tightly integrated, well documented
In use at CDL and many other places

5
What XTF is not

It is not a content management system
Creation (conversion, scanning, manual)?
Ingest / administration
Editing
Preservation
Not built for remote administration
Not a true XML database
but close
Not Google
Google one interface to vast grab-bag of data
XTF crafted interfaces to high-quality data sets

6
How does XTF compare?
Green- stone

Solr
Turn-key / easy---------------gt
XTF 2.1
XTF 2.0
Customizable / Powerful --------------------------
--------------gt
caveat based on my limited experience with
Greenstone and Solr
7
Online Archive of California
8
eScholarship Editions
9
calisphere
10
Mark Twain Project Online
11
UC Berkeley
12
University of Sydney
13
Encyclopedia of Chicago
14
Indiana University Newton
15
Indiana University Swinburne
16
Sweden
17
Brazil
18
Italy
19
Needs

Lets look at four needs that XTF was created to
address
Diverse data
Open software
Rapid deployment
Community involvement

20
Needs 1. Diverse data

Our collections many and diverse
eScholarship (TEI, PDF)
UC Press monographs (a text may be gt 10 megs)
25,000 scholarly articles in PDF
Mark Twain
Hand-crafted critical edition (TEI MODS)?
OAC finding aids, images, books, manuscripts
Japanese American Relocation Digital Archives
TEI, EAD, MODS
Book scanning projects (Google, Internet Archive)
Thousands of scanned books (PDF DC)?
Millions of Melvyl catalog records (MARC)

21
Needs 2. Open software

Digital Publishing Products
Black box (no control over fixes features)?
Often not standards-based
Tech companies have short lifespans
Support often spotty
Data can be held hostage, or even lost

22
Needs3. Rapid deployment

New collections arriving
Users don't want to wait a year for access
Many what if and wouldn't it be cool requests
from our staff
Java programmers are expensive
Look feel goes stale quickly
Barrage of feature requests

23
Needs4. Community involvement

We want to share the load
For XTF 2.1, we asked the XTF community to vote
for features they wanted
At CDL we try to align our development to needs
of the community
Result Everybody benefits

24
New and improved in 2.1

Faceted browse
Search flexibility
Bookbag
Spelling correction
Similar items
OAI-PMH

25
Faceted browse

Previously implementing faceted browse required
lots of XSLT programming.
Hierarchical facets even harder
Required us to deeply refactor the stylesheets,
but now its simple to add new facets.

26
Faceted browse
27
Faceted browse
28
Hierarchical facets
29
Hierarchical facets
30
Search flexibility

Keyword search single box (now default).
Internally, searches multiple fields.
Advanced search explicitly fill in constraints
for various fields
Freeform search (new) text-based field
specifiers, AND, OR, parentheses, etc.

31
Keyword search
32
Advanced search
33
Freeform search
34
OAI-PMH

This fit nicely into XTFs architecture
Simple but conforming implementation

35
Bookbag

Refactored the AJAX to use YUI (Yahoo User
Interface widgets)
Still session based
Now supports emailing the bookbag

36
Bookbag
37
Bookbag
38
Bookbag
39
Spelling correction

Unicode bug fixes
On by default and fully integrated

40
Spelling correction
41
Spelling correction
42
Similar items

Allows user to see more like this
Improved AJAX integration
On by default - no configuration needed

43
Similar items
44
Similar items
45
Other changes in XTF 2.1

Built-in NLM Blue, TEI P5, MS Word support
(still support TEI P4, EAD, PDF, HTML, text)
Valid XHTML output
RawQuery servlet to provide a query back-end to a
(e.g. Ruby) front-end or mash-up.
Bug fixes and minor changes (many
reported/requested by users)

46
Wiki documentation
47
Wiki documentation
48
Design philosophy

Adaptation through programming
XTF is still about building what you want using a
set of powerful tools
But now
Stylesheets are more modular
Build interfaces faster using honed widgets
Prettier UI to start with

49
XTF is open, standards based

Based on free, open-source tools
Java SDK 1.5
Lucene 2.1 full-text search toolkit
Saxon 8.9 XSLT processor
UNICODE support throughout
XTF itself is open-source (BSD license)?
No native code pure Java and XSLT 2.0
Runs on Windows, Solaris, Linux, MacOS
Drops right in to Tomcat or Resin
Lots of user-fixable documentation

50
Modular

Use crossQuery servlet to search, dynaXML to
display and navigate. Deploy one or both.
Stylesheets govern flow of data no Java
programming required
Easy to add features incrementally
100 configurable look and feel
Skin slice one system can have several
interfaces and multiple brands
Collection subsetting driven by meta-data

51
Why XSLT?

XSLT is a natural fit for XML
Powerful, dynamic language
Incredibly high-quality, free processor (Saxon)?
Why not Java/Struts?
Poor for rapid prototyping, steep learning curve
Why not Ruby?
Not necessarily a good match for XML data
Can be too clever by half
But a smart mash-up might be cool...

52
Indexing Process
53
Indexing

Input filters adapt to many doc types
Any XML doc type
PDF, MS Word, plain text, untidy HTML
XTF is agnostic regarding
Document identifiers
Filesystem organization
Uses document selector stylesheet to identify and
classify documents in filesystem
Meta-data storage
Incremental indexing
Simply update filesystem then run indexer.

54
crossQuery servlet
55
Flexible Search/Display

One query, many collections
XTF enables Virtual collections
Output filters for various result views
e.g. simple vs. advanced search form, results in
brief vs. long format, etc.
Query parsers for different search interfaces
Interface to other query protocols
SRU and OAI-PMH already implemented
Should be easy to adapt other queries
Very extensive set of query operators
Flexible query composition
Faceted browse

56
Query Power

Many operators
AND, OR, NEAR, NOT, phrase, range, wildcard
Or-Near, multi-field AND, more like this
Arbitrarily complex queries
Combine full-text search with meta-data
Unusual queries like"dynamic duo" near "red
phone"
Structure-aware searching
e.g. search only headings, or only bibliographies
But must pre-define which structures to search

57
More Power

Fixed-length snippets
Highlight the hit and just the hit
Sort by relevance, or any meta-data fields
Spelling correction
No penalty for huge documents
XTF lazily pulls in only those parts used by a
particular request (e.g. show just Chapter 1)?
Scalable
Proven with 10 million records / 14 gigs data
but beyond that, Solr looks better
Authentication IP lists, LDAP, or external

58
dynaXML servlet
59
Adapting Lucene and Saxon

Adapting Lucene
Chunking, flattening, hit marking, stop-words,
setting limits, insensitivity, special queries,
faceted browsing, spelling correction
Adapting Saxon
Lazy trees, misc. extensions

60
Adapting LuceneChunking

Why
Lucene's proximity searches perform best on small
documents
Small chunks enable efficient generation of
80-character snippet surrounding each hit
How
XTF breaks text blocks into 200-word chunks
Chunks overlap to detect a hit starting in one
and ending in the next.
Each chunk carries structural info, plus pointer
to location in XML doc.
Only first chunk carries meta-data for doc

61
Adapting LuceneFlattening XML

XSLT prefilter flattens XML structure
Series of text blocks
Block tagged with structural info for search
Prefilter can boost or suppress sections
Fine control over proximity matching
Prefilter gathers/marks meta-data
Can come from within the document, from an XML
doc in filesystem, or fetched from a URL.
Synthesize meta-data (e.g. sort fields, facets)?

62
Adapting LuceneHit Marking

Marking search hits in context
Lucene doesn't pinpoint location of hits, only
gives a score per-document
Custom enhancements to Lucene's span logic
score and locate each hit.
dynaXML dynamically adds ranked hits to original
XML doc, then sends to XSLT formatter.
crossQuery forms a snippet around and highlights
each hit.

63
Adapting LuceneStop-words

Robust, efficient stop-word handling
the, a, an, it, on...
People do use them, and expect corresponding
results.
Lucene normally ignores stop-words, for speed.
XTF quietly joins stop-words to adjacent words,
forming n-grams
Example man on the moon -gt man-on on-the
the-moon
Queries are internally rewritten to search for
n-grams automatically.

64
Adapting LuceneSetting Limits

Limits on aberrant queries
Adjustable limits on number of terms matched by
range or wildcard queries
N-grams naturally make most queries efficient
Configurable limits on amount of work performed
by a single query.
Numeric range query
Avoids term expansion
Efficiently filters very granular data, e.g.
timestamps 2006-11-14124603.77

65
Adapting LuceneInsensitivity

Accent/diacritic marks
Many users can't or don't know how to type them
XTF indexer uses configurable map to remove
accents
crossQuery maps query terms
Plural
Convenient for cat to match cats also
Configurable map of plural to singular used at
index and query time

66
Adapting LuceneSpecial Queries

OR-NEAR
Standard OR query doesn't use proximity
OR-NEAR if words nearby, score is boosted
Multi-field AND
All terms must be present, in any field.
Essential for certain keyword searches against
all enemies clarke(matches against title and
author)?
More like this
Auto-calculates interesting terms in meta-data
Creates OR-NEAR query to find similar docs

67
Adapting LuceneFaceted Browsing