Title: The Use of Classification in Information Retrieval
1The Use of Classification in Information Retrieval
- Barbara H. Kwasnik
- School of Information Studies
- Syracuse University
- ASIST Annual Conference
- Charlotte, NC
- November 2, 2005
2The Process of Classification
- Classification is the partitioning of experience
into meaningful clusters. - Two necessary processes that work in parallel
- Clustering Finding similar attributes along some
meaningful dimensions in order to group things
together into classes and - Discrimination Determining rules for
distinctions among things, so that we can create
boundaries for classes.
3Whats the Point of Classifying?
- Retrieval and Re-Finding.
- If were trying to retrieve something we know is
there, classification provides a shortcut. By
clustering like things together, it helps us find
things again that were stored there in the past. - Browsing and Exploration.
- By the same token, a classified collection can be
searched/browsed for something even if we only
have reason to suspect its there but dont know
for sure, such as in stores, or in libraries.
4Whats the Point of Classifying?
- Communication.
- Classification creates labels and definitions for
class inclusion. This enables communication about
disparate phenomena by establishing a common
ground. - Knowledge representation.
- Classifications are knowledge structures and thus
visualize and reflect what we know about things.
Such representations can help us to understand
things better, identify gaps, recognize patterns,
predict future trends, etc.
5The Information-Retrieval Problem
- Borrowing the notion from Bob Oddy, an
information seeking and retrieval event can be
construed as a dialogue in which the user
reveals him or herself to the system, and the
system, in turn, reveals itself to the user.
6The Information-Retrieval Problem Space
User and the The Users Context
Collection Documents Information
Matching/ Intermediation
Articulates and represents a request using some
strategy
Representation of the Collection/ Documents/
Information
Query
Query
Output
Query
7Challenges on the Query-Formulation Side
- It may be difficult to articulate a request
- Request may be vague, unformed
- Request may be difficult to translate/express
- User may not know what is available, or what
there is to choose from - Users like to ask for what they think they might
reasonably find often dont like to shoot blind - Strategy for level of precision may not be
obvious - User may not be aware of the precision the
system can accommodate - Query may thus be too broad or too specific
8Challenges on the System Side
- Expressiveness available to the system may be
inadequate and thus the resulting representation
may be incomplete - In a plant catalog, no vocabulary for expressing
the hardiness of the plant - The choice of available descriptive dimensions
may not be the most useful ones - Original MARC record for archival collections
- First author only in Web of Science
9Challenges on the System Side
- Inadequate or inconsistent levels of precision or
granularity - In Dewey Decimal all other religions share
one-tenth of the space. - LCSH policy for indexing only the main topic of a
book. - Inter-indexer and intra-indexer inconsistency or
system error might affect reliability of the
representations. - Indexers tendency to choose a limited range of
terms, even when others are more appropriate.
10Challenges in the Matching
- Users and systems knowledge structures may not
be the same - Users and systems vocabulary may differ
- The mechanism by which the user is revealed to
the system and vice versa may be inadequate for
achieving a good match - E.g., In many OPACs, the use of the search option
subject is ambiguous and does not indicate
clearly that it means LCSH subject headings, and
not keyword. - The navigational affordance and control of the
system may cause confusion, inconsistencies, and
miscommunication between user and system.
11Challenges in the Output
- Undifferentiated output is difficult to navigate
and interpret
12The Role of Classification in IR
- Classification can play a role in remedying many
of the challenges in an IR event by the use of - Classified collection representations
- Classified output clusters, classes
- Thesauri and taxonomies for aid in question
formulation and document representation - Classification for navigation and browsing
13Possible Scenarios
14Examples
- A classified cookbook index yields a term that
points to a classified section of the cookbook. - Cookies nut cookies macaroons points to the
cookie section of the cookbook. - A keyword entered in a search box yields a
classified menu of options leading to an
unclassified list of products. - Orvis clothing online
15How Classification Helps
- What Threats to Successful IR can be helped by
classification?
16At the Query-Formulation End
- Ability to choose from a classified display or to
use a classification to support query-formulation
strategies - Reduces the need to come up with a term on your
own (recognition is easier than production) - Supports browsing, and precise known-item
searching. - Allows you to learn something about the domain
(i.e., learn the knowledge structure of the
system)
17More at the Query-Formulation End
- Orients you in the information space. Gives you a
sense of the whole whats available, whats
included - Terms are disambiguated because they are in
context - Use of a taxonomy allows you to expand or narrow
the query or suggests other related and possibly
useful terms
18At the System End
- Taxonomies, thesauri, ontologies, help ensure
consistent representation at all stages - Help identify gaps and inconsistencies in the
coverage - Provide guidance for depth and exhaustivity of
representation - Help provide guidance in navigation
- E.g., navigational index and content indexes on a
website
19At the Matching and Output Stages
- The matching in a classified collection allows
for adjustments to granularity to help minimize
the probability of no hits. - Classified output is browsable.
- Classified output eases the navigation and
cognitive processing of large results sets. - Can help users orient themselves in the
information space. - May provide suggestions for the next round of
querying.
20The Power of Classification for IR
- When a good classification is deployed in a
retrieval system it has the ability to add a big
jolt of information power. - A robust and valid classification expresses a
great deal about the nature of the entities being
classified, and also the relationship of these
entities to each other. - On the other hand, we need much less knowledge of
a domain in order to read a classification that
is presented to us.
21But, There Are Constraints
- No classification is ever a complete
representation. - All classifications are created in a cultural
context. - Some classifications are good at description,
some good at visualization, some at reflecting a
body of knowledge. - With very few exceptions, most cant do all at
once.
22Says a Lot, but not Everything
- Early stages of breast cancer
- Stage 0 Cancer cells are present but they
havent spread - Stage I Cancer has spread 2 cm or less
- Stage II Cancer has spread lt5 cm sometimes
the lymph nodes may be involved. - Advanced stages of breast cancer
- Stage III gt5 cm spread to lymph nodes
- Stage IV metastatic spread to other parts of
the body, such as bone, liver, lung, or brain.
23Selective Representation
- No classification can represent all perspectives
at the same time. While appealing as a
visualization, some things get privileged, and
some get masked.
24All Depends on What Theory You Adopt
- Genealogical
- Linguistic
- Classification
- For any two languages
- Do they have a common origin?
- Typological
- Linguistic
- Classification
- For any two languages what similarities of form
do they have.
25What a Good Classification Should Be
- Elegant and parsimonious expresses the domain
being classified as succinctly and efficiently as
possible. - Expressive and complete sufficient to
accommodate all entities within it with just
enough specificity to be useful. - Memorable and usable affords devices to help
the user learn it and relearn it. - Flexible and hospitable allows new entities to
be added gracefully and coherently - There are very few classifications
- that can be all of these.
26Conclusion
- Classification has an important role to play in
information retrieval - For enhanced communication between system and
user - For precision, flexibility, completeness, and
consistency of document and query representation - For expressiveness at the knowledge-structure
level but also the navigational level - For browsing to reduce cognitive load, increase
fun. - But, care must be taken to ensure that the
classification is appropriate to the various
functions of information seeking and retrieval.