Title: Controlled vocabularies: Thesauri and information retrieval
1Controlled vocabulariesThesauri and information
retrieval
- Michael Middleton
- QUT School of Information Systems, Brisbane,
Australia - m.middleton_at_qut.edu.au
- for
- STIMULATE 5
- Vrije Universiteit Brussel
- Brussels, Belgium
- July, 2005
2Introduction
- Context .. History
- Vocabulary principles
- Thesaurus software
- Thesaurus building . application
- Thesaurus evaluation
- The future
3Context Information life cycle
create
distribute
dispose
store
use
reuse
maintain
recall
4Context Information management
- Domains
- Operational
- Analytical
- Strategic
5Context indexing
- Producing representations of records or documents
that constitute a finding aid to the records in a
database or to part of a document - Assigned indexing
- Derived indexing
6Indexer qualities
- The Art of assigned indexing
- Empathy
- Meticulousness
- Consistency
- General knowledge
- Patience
7Indexing guidelines
- Conceptual analysis and assigning
- Aboutness
- Elements of the document to consider
- Exhaustivity
- Specificity
- Index what is in the item
- Co-ordination
8Assigned index representations
- Alphabetical Subject
- Classified
- Alphabetical
- Notation
- Chain
9Indexing exercise
- How consistent is database indexing?
- Example the same paper in multiple databases
- Middleton, M Skills expectations of library
graduates http//eprints.qut.edu.au/archive/000000
94/ - Index it yourself
- Compare your indexing with others
- Compare the indexing in ERIC and INSPEC
10Context metadata
- Agent
- Document description
- Responsibility
- Administrative
- Provenance
- Connections
- Conditions of use
11Context metadata
- Content
- Topic (application of vocabulary control)
- Coverage
- Role
12Controlled vocabulary
- Thesaurus
- A controlled vocabulary of terms in natural
language that are designed for post-coordination - Classification scheme
- A scheme for organisation by categories in a
systematic manner this may involve grouping by
subject, function or other criteria, or
determining document naming conventions - Often involves notation
13Purpose
- Indexing by translating diverse natural language
to consistent terminology - Establishing relationships among terms
- Information retrieval improving precision and
recall
14History
- Bibliographic databases
- Many applications, list of online associated
thesauri and classification schemes at
http//sky.fit.qut.edu.au/middletm/cont_voc.html - Standards
- ISO2788 ISO 5964
- ANSI Z39.19
15Thesaurus principles
- Term relationships
- Continuing evolution
- Internally consistent hierarchies to support
database searching
16The Thesaurus
- The vocabulary of a controlled indexing language
formally organised so that the a priori
relationships between concepts are made explicit. - A thesaurus is an example of metadata
17Thesaurus extract (ISO sample)
- 35 mm CAMERAS
- BT MINIATURE CAMERAS
- CAMERAS
- BT OPTICAL EQUIPMENT
- NT MOVING PICTURE CAMERAS
- STEREO CAMERAS
- STILL CAMERAS
- UNDERWATER CAMERAS
- RT PHOTOGRAPHY
- CINE CAMERAS
- BT MOVING PICTURE CAMERAS
- NT UNDERWATER CINE CAMERAS
- RT CINEMA
- CINEMA
- RT CINE CAMERAS
- INSTANT PICTURE CAMERAS
- SN Cameras which produce a finished
- print directly
- BT STILL CAMERAS
- Land cameras USE VIEW CAMERAS
- MICROSCOPES
- BT OPTICAL EQUIPMENT
- MINIATURE CAMERAS
- BT STILL CAMERAS
- NT 35 mm CAMERAS
- MOVING PICTURE CAMERAS
- BT CAMERAS
- NT CINE CAMERAS
- TELEVISION CAMERAS
18(No Transcript)
19Standardising the Vocabulary
- Types of entities forms of terms
- Singular vs plural
- Homonyms
- Choice of terms
- Scope notes and history notes
20Compound terms
- Terms should be factored into simpler elements to
improve users understanding. - Semantic factoring
- Syntactic factoring
21Semantic Relationships
- Equivalence
- Establishing relationships between preferred
(postable) and non-preferred (non-postable) terms - Hierarchical
- Establishing relationships between subordinate
and superordinate terms. These may be
distinguished as - Generic
- Whole-part
- Instance
- Associative
- Establishing relationships between terms that are
mentally associated, but not equivalent or
hierarchical
22 but, the Functions thesaurus
- Whereas
- agenda papers might have
- broader term documents
- In a functions thesaurus
- agenda papers might have
- broader term meetings
23Applying a functional thesaurus
- Top Term
- PERSONNEL
- Scope Notes The function of managing all
employees - Related Terms
- COMPENSATION
- ESTABLISHMENT
- INDUSTRIAL RELATIONS etc, etc
- Narrower Terms
- ALLOWANCES
- APPEALS (Decisions)
- APPOINTMENT
- ARRANGEMENTS
- AUTHORISATION
- COMMITTEES
- COMPLIANCE etc, etc
- Use For Terms
24(No Transcript)
25Thesaurus Display
- Alphabetical hierarchies
- One level above and below entry term
- Complete hierarchy for each term or separate TT
display - Permuted term lists
- Combination with classification notation
- Graphic Displays
26Applying a thesaurus
- Download Term Tree from http//www.termtree.com.au
- Free trial download from
27Thesaurus software
- Assigned
- Integrated database
- Deriving terminology
28Thesaurus software - assigned
- Terms are assigned by vocabulary specialists in
independent database - a.k.a.
- Synercon Management Consulting
- MultiTes
- OpenCyc
- SuperTHES
- from THESmain/THESshow for mono-/multilingual
thesauri - Term Tree 2000
- WebChoir
- Wordmap
29Thesaurus software integrated database
- Terms are assigned by specialists, thesaurus
works like active data dictionary to control
database - BASIS
- InMagic Bibliotech PRO
- BRS/Search
- STAR
30Thesaurus software for deriving terminology
- Terms are created automatically from text
- Entrieva
- SemioTagger, SemioMap and SemioSkyline for
viewing - Intology
- taxonomy builder
- Verity
- Thematic Mapping
- Autonomy
- taxonomy generation categorization
31Thesaurus Building - 1
- Users
- Define
- Identify needs
- Define Thesaurus range depth
- Raw vocabulary building
- Identify sources
- Collect and record terms
32Thesaurus Building -2
- Vocabulary organisation
- Cluster terms
- Establish relationships using symbols
- Maintenance
33Business application
- Not long term collaborative efforts of
classification specialists - Instead, adapt to business changes
- Not just descriptions of present business
processes - Instead, reflect strategic planning, competitors
- Not necessarily a single taxonomy
- Instead, multiple overlapping taxonomies
34Content management
- Describe content as its being created rather
than classify after creation - User-needs orientation
35Integrating taxonomies
- Accurate reporting
- Exchange of data
- Assist resource discovery
- Information retrieval
36Thesaurus evaluation
- Qualities
- Information retrieval evaluation
37Thesaurus Qualities
- Scope and features description
- Display forms
- Correctness of hierarchies
- Use of scope, history and qualification
- Adherence to standards
- Syndetic measures
- Connectedness
- Accessibility
38Thesauri Retrieval evaluation
- Cranfield experiments since
- Recall and precision
- Influence on indexing
- Conceptual analysis
- Translation failure
- Omissions
- Exhaustivity/Specificity
- Syntax and false drops
- Maintenance costs
39Post-controlled vocabularies
- Use of a Hedge of terms to represent a broad
concept, eg - psychological aspects of..........
- ........in Australia
- ....review items on.....
40Still to come
- Research areas
- Metathesauri
- Super interlinked vocabularies (e.g. NLM)
- Semantic Web
- Enhancing word association with usage statistics
like links (e.g. THESUS)
41Review
- Controlled vocabulary types
- Software support
- Business processes
- Website
- http//sky.fit.qut.edu.au/middletm/cont_voc.html
- (about to move to database driven site
redirection will be applied)
42Questions?