Lessons Learned from Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Lessons Learned from Information Retrieval

Description:

Current solutions not fully understood by people using them ... does not necessarily know beforehand 'vocabulary' or background of either ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 14
Provided by: umiac7
Category:

less

Transcript and Presenter's Notes

Title: Lessons Learned from Information Retrieval


1
Lessons Learned from Information Retrieval
  • Chris Buckley
  • Sabir Research
  • chrisb_at_sabir.com

2
Legal E-Discovery
  • Important, growing problem
  • Current solutions not fully understood by people
    using them
  • Imperative to find better solutions that scale
  • Evaluation required
  • How do we know we are doing better?
  • Can we prove a level of performance?

3
Lack of Shared Context
  • The basic problem of both search and e-discovery
  • Searcher does not necessarily know beforehand
    vocabulary or background of either author or
    intended audience of documents to be searched

4
Relevance Feedback
  • Human judges some documents as relevant, system
    finds others based on judgements
  • Only general technique to improve system
    knowledge of context proven successful
  • works from small collections of 1970s to large
    collections of present (TREC HARD track)
  • Difficult to apply to discovery
  • Need to change entire discovery process

5
Toolbox of other techniques
  • Many other aids to search
  • Ontologies, linguistic analysis, semantic
    analysis, data mining, term relationships
  • Good techniques for IR uniformly
  • Give big wins for some searches
  • Give mild losses for others
  • Need a set of techniques, a toolbox
  • In practice for IR research, issue not finding
    big wins, but avoiding the losses

6
Implications of toolbox
  • No expected silver bullet AI solution
  • Boolean search will not expand to accommodate
    combinations of solutions
  • Test collections are critical

7
Test Collection Importance
  • Needed to develop tools
  • Needed to develop decision procedures of when to
    use tools
  • Toolbox requirement means needed to distinguish a
    good overall system from one with a good tool
  • All systems are able to show searches on which
    individual tools work well
  • Good system shows performance gain on entire set
    of searches.

8
Test Collection Composition
  • Large set of realistic documents
  • Set (at least 30) of topics or information needs
  • Set of judgements what documents are responsive
    (or non-responsive) to each topic
  • Judgements are expensive and limit how test
    collection results can be interpreted

9
Incomplete Judgements
  • Judgements are too time consuming and expensive
    to be complete (judge every one)
  • Pool retrieved documents from a variety of
    systems
  • Feasible, but
  • Known incomplete
  • We cant even accurately estimate how incomplete

10
Inexact Judgements
  • Humans differ substantially on judgements
  • Standard TREC collections
  • Topics include 1-3 paragraphs describing what
    makes a document relevant
  • Given same pool of documents, 2 humans overlap on
    70 of their relevant sets
  • 76 agreement on small TREC legal test

11
Implications of Judgements
  • No gold standard of perfect performance is even
    possible
  • Any system claiming better than 70 precision at
    70 recall is working on a problem other than
    general search
  • Almost impossible to get useful absolute measures
    of performance

12
Comparative Evaluation
  • Comparisons between systems on moderate size
    collections (several GBytes) are solid.
  • Comparative results on larger collections (500
    GBytes) are showing strains
  • Believable but larger error margin
  • Active area of research
  • Overall goal for e-discovery has to be
    comparative evaluation

13
Sabir TREC Legal Results
  • Submitted 7 runs
  • Very basic approach (1995 technology)
  • 3 tools from my toolbox
  • 3 query variations
  • One of the top systems
  • All results basically the same
  • tools did not help on average
Write a Comment
User Comments (0)
About PowerShow.com