Title: Active Taxonomies and their role in the Enterprise
1Active Taxonomies and their role in the
Enterprise
- Jeremy Ellman
- School of Informatics,
- Northumbria University, UK
- Jeremy.Ellman _at_northumbria.ac.uk
2Presentation Overview
- What is a Taxonomy?
- What are Enterprise Taxonomy Applications?
- How Taxonomies may be made Active using Lexical
Chains - The Case for Taxonomy
- Information Management and Metadata
3What is a Taxonomy
- Minimally composed of a mostly hierarchical
classification plus metadata - Also called an Ontology
- Examples include
- Dewey Decimal System
- LoCSH
- (Local) Government Category List
- WordNet
- Rogets Thesaurus
- Entries may include synonyms, other language
equivalents, author, date, related terms etc - A working component of the information
architecture
4An Example Taxonomy
5The Case for Taxonomy
- Taxonomies improve information access through
structured browsing or search improvements - Search process is limited by user behaviour
- Queries limited to 2.35 words on average
- Professionals achieve better results using more
appropriate vocabulary - Taxonomies allow synonyms to be added to user
queries - Heart attack Vs Myocardial infarcation
- Consistent use of metadata labels
6Enterprise Taxonomy Applications and Function
- Applications include
- Content Management Systems
- Document Management Systems
- Portals and Intranets
- e-Commerce Applications
- Functions include
- supporting information retrieval
- Text classification
- Message Routing
- Topic Spotting
- providing the conceptual basis for knowledge
based systems
7Text Classification Methods
- Identify appropriate categories for a text
- Text is treated as Bag of Words
- Generally create a statistical model of each
category derived from a training set - Accuracy domain dependent
- At best 90
- Generally around 70
8Key issues
- Current approaches to text analysis ignore
- text structure,
- Document meaning
- Semantics of categories
9How Taxonomies may be made Active using Lexical
Chains
- An Active taxonomy would use the category
information to assist processing - Easy generation of navigation interfaces
- Rapid development of new content interfaces
- Reduced cost of localization
- Easy accommodation of name changes
10What is a Lexical Chain?
- a set of words in a text that are related by both
proximity, and by lexical linking relations
between the words derived from an external
knowledge source (Morris and Hirst 1991) - Uses Rogets thesaurus for performance
- Simpler, more balanced structure than WordNet (4
ply deep) - 1000 main headings, divided into Noun, Verb ,Adj,
Adv, and Phrase
11Lexical Chains in text
Quotation from Einstein 1939 We suppose4 a very
long6 train0 travelling3 along the rails with a
constant velocity v and in the direction1
indicated in figure7 . People travelling in this
direction will with advantage5 use the train as a
rigid reference2 - body they regard all events
in reference-to to the train. Then every event
which takes place along the line also takes place
at a particular point10 of the train. Also, the
definition8 of simultaneity9 can be given
relative-to to the train in exactly the same way
as with respect to embankment
12Actual Chains
- 0. train, rails, train, train, line, train,
train, embankment, - 1. direction, people, direction,
- 2. reference, regard, relative-to, respect,
- 3. travelling, velocity, travelling, rigid
- 4. suppose, reference-to, place, place,
- 5. advantage, events, event
- 6. long, constant
- figure, body
13The Generic Document Profile
- Objective is to represent a document as taxonomy
categories - Document strength in these categories is derived
from the lexical chains extracted - Sum of Link Strengths weighted by link type and
inter word distance - Gives attribute value vector to represent
document - Can be compared to give measure of inter document
similarity
14GDP as Classification Method
- Advantages
- Gives list of categories ordered by value
- No training required
- Disadvantages
- Categories are those used in thesaurus or
taxonomy - Not all taxonomies have adequate vocabulary
15Using the GDP As an Arbitrary Classifier
- Use machine learning training approach
- Trained documents produce profiles that can be
merged, normalized, and associated with any
category - Classification process then compares against
category models
16Evaluation
- Compare against statistical approaches using
reported corpora - Initially using 20-newsgroups from BoW (McAllum
(1999) 20Mb dataset - Obtained WIPO Patent Dataset
- 450Mb XML zipped test data
- 500Mb XML zipped training data
- Also based on published Taxonomy
17Research Questions
- Can the Bag of Words model be bettered with
taxonomies? - Can a subject specific taxonomy beat a generic
one? - Can subject specific taxonomies be extended
accurately using generic ones?
18Conclusion
- Taxonomies have a place in Enterprise Information
Applications - Text coherence can be used to characterise that
text using external Active taxonomies - This may be a practical text classification method
19Questions?