Classification Technology at LexisNexis - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Classification Technology at LexisNexis

Description:

TTI used stepwise linear regression to test in combination and suggest weights ... Allow both positive and negative weights in addition to absolute yes/no ... – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 16
Provided by: markdavi
Category:

less

Transcript and Presenter's Notes

Title: Classification Technology at LexisNexis


1
  • Classification Technology at LexisNexis
  • SIGIR 2001 Workshop on Operational Text
  • Classification
  • Mark Wasson
  • LexisNexis
  • mark.wasson_at_lexisnexis.com
  • September 13, 2001

2
Our Boolean Origins

3
Our Boolean Origins

4
The Topic Identification System
  • The Topic Identification System Model
  • Term-based Topic Identification (TTI)
  • Term Mapping System
  • Company Concept Indexing
  • Named Entity Indexing (Companies, People,
    Organizations, Places)
  • Subject Indexing Prototype (not released)
  • NEXIS Topical Indexing

5
Psycholinguistics Features
  • Propositional Language Model Underlies Surface
    Forms
  • Word Concepts
  • Semantic Priming, Additive up to a Point
  • Spreading Activation

6
Terms and Word Concepts
  • All words and phrases are searchable no stop
    words
  • No automatic morphological or thesaurus expansion
  • Exception name variant generation, but subject
    to human verification
  • Word Concept a set of functionally equivalent
    terms with respect to a given topic 1 to 100s of
    terms in a single word concept

7
Frequency Weighting
  • Frequency weighting at word concept level
    rather than at individual term level
  • TTI used chi-square to compare individual word
    concepts to supervised training set
  • TTI used stepwise linear regression to test in
    combination and suggest weights
  • Allow both positive and negative weights in
    addition to absolute yes/no Boolean functionality

8
Problem Word Concepts
  • 5 documents 3 relevant (G), 2 irrelevant (B)
  • W1 in G1, G2, B1
  • W2 in G2, G3, B2
  • W3 in G1, G3, B1
  • Each W by itself produces 67 recall, 67
    precision
  • W1 W2 -gt 100 recall, 60 precision
  • W1 W3 -gt 100 recall, 75 precision
  • W2 W3 -gt 100 recall, 60 precision
  • W1 W2 W3 -gt 100 recall, 60 precision
  • Also, fewer terms -gt faster processing

9
Looking Up Terms in Documents
  • Count a term extra in key document parts
  • Headlines
  • Leading text
  • Captions
  • Count all potential matches
  • American gets counted for 100s of companies
  • Dont count a term when part of another
  • Mead in Mead Corp.
  • French in French Fry

10
Calculating Topic Scores
  • Summation of frequency weight across all word
    concepts
  • Normalize score
  • Compare to threshold
  • Verification range in TTI
  • Major references, strong passing references, weak
    passing references in indexing tools
  • Add controlled vocabulary term or marker to
    document if score gt threshold
  • Add score, any associated secondary CVTs

11
Source-dependent, -independent
  • Similar field functions, different field names
    and locations
  • Database and file information to guide production
    processes
  • The source specification file allows us to reuse
    a single topic definition across a wide variety
    of sources and source types

12
Manual vs. Automatic
  • Build each definition using iterative manual
    process
  • Use supervised learning?
  • TTIs chi-square and regression
  • Cost of creating training samples
  • Automate repetitive, labor-intensive tasks
  • Generate name variants
  • Cheap labor cost few minutes to 8 hours

13
Test, Test, Test
  • Business unit benchmarks prior to adoption
  • Development process test cases
  • Internal benchmarks with 3rd party technologies
  • Sorry, not TREC
  • Most tests, topics, sources recall and
    precision both in the 90-95 range

14
The End?
  • TIS Model? 16 years old
  • TTI? In production for 11 years
  • Term Mapping? 9 years old
  • Entity Indexing? 6-7 years old
  • Topical Indexing? 3 years old
  • Complemented by SRA NetOwl-based indexing 2 years
    ago
  • No movement afoot to replace any of them

15
Related Papers
  • TTI
  • Leigh, S. (1991). The Use of Natural Language
    Processing in the Development of Topic Specific
    Databases. Proceedings of the 12th National
    Online Meeting.
  • Company Concept Indexing
  • Wasson, M. (2000). Large-scale Controlled
    Vocabulary Indexing for Named Entities.
    Proceedings of the ANLP-NAACL 2000 Conference.
Write a Comment
User Comments (0)
About PowerShow.com