Title: XTF in Depth
1XTF in Depth
- Powerful Search and Display for Electronic Text
Martin HayeCalifornia Digital Library
January 2009 presentation at University of Sydney
2XTF in Depth
- Part 1
- What is XTF and how does it compare?
- Who is using it?
- What needs does it address?
- New features in 2.1
- Design and data flow
- Adapting Lucene and Saxon
- Planned improvements
- Part 2
- Interactive demos
3XTF in 5 minutes
- eXtensible Text Framework
- Search and display technology from CDL
- Open-source Java framework
- Powerful and highly configurable
- All about rapid prototyping, fast deployment,
and incremental improvement - XML Full text search
- Also indexes PDF, HTML, Word
- Excel and Powerpoint coming soon
4XTF in 5 minutes
- Search Query power/speed of Lucene, plus
- search results shown in context
- keyword search, facets, spelling, lots more
- View Processing power of Saxon, plus
- large file optimizations, hit markup
- Configure and customize exclusively in XSLT
- Flexible, overlapping collections
- Mature, tightly integrated, well documented
- In use at CDL and many other places
5What XTF is not
- It is not a content management system
- Creation (conversion, scanning, manual)?
- Ingest / administration
- Editing
- Preservation
- Not built for remote administration
- Not a true XML database
- but close
- Not Google
- Google one interface to vast grab-bag of data
- XTF crafted interfaces to high-quality data sets
6How does XTF compare?
Green- stone
Solr
Turn-key / easy---------------gt
XTF 2.1
XTF 2.0
Customizable / Powerful --------------------------
--------------gt
caveat based on my limited experience with
Greenstone and Solr
7Online Archive of California
8eScholarship Editions
9calisphere
10Mark Twain Project Online
11UC Berkeley
12University of Sydney
13Encyclopedia of Chicago
14Indiana University Newton
15Indiana University Swinburne
16Sweden
17Brazil
18Italy
19Needs
- Lets look at four needs that XTF was created to
address - Diverse data
- Open software
- Rapid deployment
- Community involvement
20Needs 1. Diverse data
- Our collections many and diverse
- eScholarship (TEI, PDF)
- UC Press monographs (a text may be gt 10 megs)
- 25,000 scholarly articles in PDF
- Mark Twain
- Hand-crafted critical edition (TEI MODS)?
- OAC finding aids, images, books, manuscripts
- Japanese American Relocation Digital Archives
- TEI, EAD, MODS
- Book scanning projects (Google, Internet Archive)
- Thousands of scanned books (PDF DC)?
- Millions of Melvyl catalog records (MARC)
21Needs 2. Open software
- Digital Publishing Products
- Black box (no control over fixes features)?
- Often not standards-based
- Tech companies have short lifespans
- Support often spotty
- Data can be held hostage, or even lost
22Needs3. Rapid deployment
- New collections arriving
- Users don't want to wait a year for access
- Many what if and wouldn't it be cool requests
from our staff - Java programmers are expensive
- Look feel goes stale quickly
- Barrage of feature requests
23Needs4. Community involvement
- We want to share the load
- For XTF 2.1, we asked the XTF community to vote
for features they wanted - At CDL we try to align our development to needs
of the community - Result Everybody benefits
24New and improved in 2.1
- Faceted browse
- Search flexibility
- Bookbag
- Spelling correction
- Similar items
- OAI-PMH
25Faceted browse
- Previously implementing faceted browse required
lots of XSLT programming. - Hierarchical facets even harder
- Required us to deeply refactor the stylesheets,
but now its simple to add new facets.
26Faceted browse
27Faceted browse
28Hierarchical facets
29Hierarchical facets
30Search flexibility
- Keyword search single box (now default).
Internally, searches multiple fields. - Advanced search explicitly fill in constraints
for various fields - Freeform search (new) text-based field
specifiers, AND, OR, parentheses, etc.
31Keyword search
32Advanced search
33Freeform search
34OAI-PMH
- This fit nicely into XTFs architecture
- Simple but conforming implementation
35Bookbag
- Refactored the AJAX to use YUI (Yahoo User
Interface widgets) - Still session based
- Now supports emailing the bookbag
36Bookbag
37Bookbag
38Bookbag
39Spelling correction
- Unicode bug fixes
- On by default and fully integrated
40Spelling correction
41Spelling correction
42Similar items
- Allows user to see more like this
- Improved AJAX integration
- On by default - no configuration needed
43Similar items
44Similar items
45Other changes in XTF 2.1
- Built-in NLM Blue, TEI P5, MS Word support
(still support TEI P4, EAD, PDF, HTML, text) - Valid XHTML output
- RawQuery servlet to provide a query back-end to a
(e.g. Ruby) front-end or mash-up. - Bug fixes and minor changes (many
reported/requested by users)
46Wiki documentation
47Wiki documentation
48Design philosophy
- Adaptation through programming
- XTF is still about building what you want using a
set of powerful tools - But now
- Stylesheets are more modular
- Build interfaces faster using honed widgets
- Prettier UI to start with
49XTF is open, standards based
- Based on free, open-source tools
- Java SDK 1.5
- Lucene 2.1 full-text search toolkit
- Saxon 8.9 XSLT processor
- UNICODE support throughout
- XTF itself is open-source (BSD license)?
- No native code pure Java and XSLT 2.0
- Runs on Windows, Solaris, Linux, MacOS
- Drops right in to Tomcat or Resin
- Lots of user-fixable documentation
50Modular
- Use crossQuery servlet to search, dynaXML to
display and navigate. Deploy one or both. - Stylesheets govern flow of data no Java
programming required - Easy to add features incrementally
- 100 configurable look and feel
- Skin slice one system can have several
interfaces and multiple brands - Collection subsetting driven by meta-data
51Why XSLT?
- XSLT is a natural fit for XML
- Powerful, dynamic language
- Incredibly high-quality, free processor (Saxon)?
- Why not Java/Struts?
- Poor for rapid prototyping, steep learning curve
- Why not Ruby?
- Not necessarily a good match for XML data
- Can be too clever by half
- But a smart mash-up might be cool...
52Indexing Process
53Indexing
- Input filters adapt to many doc types
- Any XML doc type
- PDF, MS Word, plain text, untidy HTML
- XTF is agnostic regarding
- Document identifiers
- Filesystem organization
- Uses document selector stylesheet to identify and
classify documents in filesystem - Meta-data storage
- Incremental indexing
- Simply update filesystem then run indexer.
54crossQuery servlet
55Flexible Search/Display
- One query, many collections
- XTF enables Virtual collections
- Output filters for various result views
- e.g. simple vs. advanced search form, results in
brief vs. long format, etc. - Query parsers for different search interfaces
- Interface to other query protocols
- SRU and OAI-PMH already implemented
- Should be easy to adapt other queries
- Very extensive set of query operators
- Flexible query composition
- Faceted browse
56Query Power
- Many operators
- AND, OR, NEAR, NOT, phrase, range, wildcard
- Or-Near, multi-field AND, more like this
- Arbitrarily complex queries
- Combine full-text search with meta-data
- Unusual queries like"dynamic duo" near "red
phone" - Structure-aware searching
- e.g. search only headings, or only bibliographies
- But must pre-define which structures to search
57More Power
- Fixed-length snippets
- Highlight the hit and just the hit
- Sort by relevance, or any meta-data fields
- Spelling correction
- No penalty for huge documents
- XTF lazily pulls in only those parts used by a
particular request (e.g. show just Chapter 1)? - Scalable
- Proven with 10 million records / 14 gigs data
- but beyond that, Solr looks better
- Authentication IP lists, LDAP, or external
58dynaXML servlet
59Adapting Lucene and Saxon
- Adapting Lucene
- Chunking, flattening, hit marking, stop-words,
setting limits, insensitivity, special queries,
faceted browsing, spelling correction - Adapting Saxon
- Lazy trees, misc. extensions
60Adapting LuceneChunking
- Why
- Lucene's proximity searches perform best on small
documents - Small chunks enable efficient generation of
80-character snippet surrounding each hit - How
- XTF breaks text blocks into 200-word chunks
- Chunks overlap to detect a hit starting in one
and ending in the next. - Each chunk carries structural info, plus pointer
to location in XML doc. - Only first chunk carries meta-data for doc
61Adapting LuceneFlattening XML
- XSLT prefilter flattens XML structure
- Series of text blocks
- Block tagged with structural info for search
- Prefilter can boost or suppress sections
- Fine control over proximity matching
- Prefilter gathers/marks meta-data
- Can come from within the document, from an XML
doc in filesystem, or fetched from a URL. - Synthesize meta-data (e.g. sort fields, facets)?
62Adapting LuceneHit Marking
- Marking search hits in context
- Lucene doesn't pinpoint location of hits, only
gives a score per-document - Custom enhancements to Lucene's span logic
score and locate each hit. - dynaXML dynamically adds ranked hits to original
XML doc, then sends to XSLT formatter. - crossQuery forms a snippet around and highlights
each hit.
63Adapting LuceneStop-words
- Robust, efficient stop-word handling
- the, a, an, it, on...
- People do use them, and expect corresponding
results. - Lucene normally ignores stop-words, for speed.
- XTF quietly joins stop-words to adjacent words,
forming n-grams - Example man on the moon -gt man-on on-the
the-moon - Queries are internally rewritten to search for
n-grams automatically.
64Adapting LuceneSetting Limits
- Limits on aberrant queries
- Adjustable limits on number of terms matched by
range or wildcard queries - N-grams naturally make most queries efficient
- Configurable limits on amount of work performed
by a single query. - Numeric range query
- Avoids term expansion
- Efficiently filters very granular data, e.g.
timestamps 2006-11-14124603.77
65Adapting LuceneInsensitivity
- Accent/diacritic marks
- Many users can't or don't know how to type them
- XTF indexer uses configurable map to remove
accents - crossQuery maps query terms
- Plural
- Convenient for cat to match cats also
- Configurable map of plural to singular used at
index and query time
66Adapting LuceneSpecial Queries
- OR-NEAR
- Standard OR query doesn't use proximity
- OR-NEAR if words nearby, score is boosted
- Multi-field AND
- All terms must be present, in any field.
- Essential for certain keyword searches against
all enemies clarke(matches against title and
author)? - More like this
- Auto-calculates interesting terms in meta-data
- Creates OR-NEAR query to find similar docs
67Adapting LuceneFaceted Browsing
- Draws facet term list from Lucene index
- Each facet cached in-memory
- Counts per group created dynamically
- Special mini-language to sort/select (esp. useful
for hierarchical facets)?
68Adapting LuceneSpelling Correction
- Any standard dictionary won't match place and
proper names - Idea use the index as source of suggestions
- XTF searches words within edit distance 2
- Candidates ranked by weighted score
- Edit distance (transpositions discounted)?
- Frequency of use in the index
- Double-metaphone match
- Multi-word correction uses pair frequencies
- On test data, 80 right suggestion
69Adapting SaxonLazy Trees
- The need display small parts of large (gt 10MB)
XML documents - Solution create a binary, random-access version
of each document - XSL keys calc'd once and stored
- Only elements accessed by a given request are
loaded from disk - Care must be taken in stylesheets
- Profile mode is useful for optimization
70Adapting SaxonExtensions
- More complete SQL database connection
- Ability to call external tools
- Automatic XML conversion in/out
- Timeout enforcement
- File utilities
- Check file existence
- Get file length and timestamp
- Session data
- Key/value pairs
- Value can be XML or plain string
71The future
- XTF 2.2
- Better out-of-box for large EADs
- Fixes for incremental indexing other bug fixes
- Specify any number of sub-dirs to index
- Possible TEI P5 refactoring
- Background auto-warming of new index
- Support for indexing Powerpoint and Excel files
- Further out
- A page-turner for scanned texts and converted
PDFs - Pop-up image/PDF page snippets
- And of course, features suggested by users
72Demos
- Ill demonstrate the features we talked about on
several different XTF sites out in the wild.
73Fin
- Project xtf.sourceforge.net
- Docs xtf.wiki.sourceforge.net
- Discuss groups.google.com/group/xtf-user
- This talk xtf.sourceforge.net/talks/2009-01-23.pp
t - Me martin.haye_at_ucop.edu