Active Taxonomies and their role in the Enterprise - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Active Taxonomies and their role in the Enterprise

Description:

Entries may include synonyms, other language equivalents, author, date, related terms etc ... Portals and Intranets. e-Commerce Applications. Functions include: ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 20
Provided by: zone2inf
Category:

less

Transcript and Presenter's Notes

Title: Active Taxonomies and their role in the Enterprise


1
Active Taxonomies and their role in the
Enterprise
  • Jeremy Ellman
  • School of Informatics,
  • Northumbria University, UK
  • Jeremy.Ellman _at_northumbria.ac.uk

2
Presentation Overview
  • What is a Taxonomy?
  • What are Enterprise Taxonomy Applications?
  • How Taxonomies may be made Active using Lexical
    Chains
  • The Case for Taxonomy
  • Information Management and Metadata

3
What is a Taxonomy
  • Minimally composed of a mostly hierarchical
    classification plus metadata
  • Also called an Ontology
  • Examples include
  • Dewey Decimal System
  • LoCSH
  • (Local) Government Category List
  • WordNet
  • Rogets Thesaurus
  • Entries may include synonyms, other language
    equivalents, author, date, related terms etc
  • A working component of the information
    architecture

4
An Example Taxonomy
5
The Case for Taxonomy
  • Taxonomies improve information access through
    structured browsing or search improvements
  • Search process is limited by user behaviour
  • Queries limited to 2.35 words on average
  • Professionals achieve better results using more
    appropriate vocabulary
  • Taxonomies allow synonyms to be added to user
    queries
  • Heart attack Vs Myocardial infarcation
  • Consistent use of metadata labels

6
Enterprise Taxonomy Applications and Function
  • Applications include
  • Content Management Systems
  • Document Management Systems
  • Portals and Intranets
  • e-Commerce Applications
  • Functions include
  • supporting information retrieval
  • Text classification
  • Message Routing
  • Topic Spotting
  • providing the conceptual basis for knowledge
    based systems

7
Text Classification Methods
  • Identify appropriate categories for a text
  • Text is treated as Bag of Words
  • Generally create a statistical model of each
    category derived from a training set
  • Accuracy domain dependent
  • At best 90
  • Generally around 70

8
Key issues
  • Current approaches to text analysis ignore
  • text structure,
  • Document meaning
  • Semantics of categories

9
How Taxonomies may be made Active using Lexical
Chains
  • An Active taxonomy would use the category
    information to assist processing
  • Easy generation of navigation interfaces
  • Rapid development of new content interfaces
  • Reduced cost of localization
  • Easy accommodation of name changes

10
What is a Lexical Chain?
  • a set of words in a text that are related by both
    proximity, and by lexical linking relations
    between the words derived from an external
    knowledge source (Morris and Hirst 1991)
  • Uses Rogets thesaurus for performance
  • Simpler, more balanced structure than WordNet (4
    ply deep)
  • 1000 main headings, divided into Noun, Verb ,Adj,
    Adv, and Phrase

11
Lexical Chains in text
Quotation from Einstein 1939 We suppose4 a very
long6 train0 travelling3 along the rails with a
constant velocity v and in the direction1
indicated in figure7 . People travelling in this
direction will with advantage5 use the train as a
rigid reference2 - body they regard all events
in reference-to to the train. Then every event
which takes place along the line also takes place
at a particular point10 of the train. Also, the
definition8 of simultaneity9 can be given
relative-to to the train in exactly the same way
as with respect to embankment
12
Actual Chains
  • 0. train, rails, train, train, line, train,
    train, embankment,
  • 1. direction, people, direction,
  • 2. reference, regard, relative-to, respect,
  • 3. travelling, velocity, travelling, rigid
  • 4. suppose, reference-to, place, place,
  • 5. advantage, events, event
  • 6. long, constant
  • figure, body

13
The Generic Document Profile
  • Objective is to represent a document as taxonomy
    categories
  • Document strength in these categories is derived
    from the lexical chains extracted
  • Sum of Link Strengths weighted by link type and
    inter word distance
  • Gives attribute value vector to represent
    document
  • Can be compared to give measure of inter document
    similarity

14
GDP as Classification Method
  • Advantages
  • Gives list of categories ordered by value
  • No training required
  • Disadvantages
  • Categories are those used in thesaurus or
    taxonomy
  • Not all taxonomies have adequate vocabulary

15
Using the GDP As an Arbitrary Classifier
  • Use machine learning training approach
  • Trained documents produce profiles that can be
    merged, normalized, and associated with any
    category
  • Classification process then compares against
    category models

16
Evaluation
  • Compare against statistical approaches using
    reported corpora
  • Initially using 20-newsgroups from BoW (McAllum
    (1999) 20Mb dataset
  • Obtained WIPO Patent Dataset
  • 450Mb XML zipped test data
  • 500Mb XML zipped training data
  • Also based on published Taxonomy

17
Research Questions
  • Can the Bag of Words model be bettered with
    taxonomies?
  • Can a subject specific taxonomy beat a generic
    one?
  • Can subject specific taxonomies be extended
    accurately using generic ones?

18
Conclusion
  • Taxonomies have a place in Enterprise Information
    Applications
  • Text coherence can be used to characterise that
    text using external Active taxonomies
  • This may be a practical text classification method

19
Questions?
Write a Comment
User Comments (0)
About PowerShow.com