The Use of Classification in Information Retrieval - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

The Use of Classification in Information Retrieval

Description:

Classification is the partitioning of experience into ... Genealogical. Linguistic. Classification. For any two languages: Do they have a common origin? ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 27
Provided by: infor46
Category:

less

Transcript and Presenter's Notes

Title: The Use of Classification in Information Retrieval


1
The Use of Classification in Information Retrieval
  • Barbara H. Kwasnik
  • School of Information Studies
  • Syracuse University
  • ASIST Annual Conference
  • Charlotte, NC
  • November 2, 2005

2
The Process of Classification
  • Classification is the partitioning of experience
    into meaningful clusters.
  • Two necessary processes that work in parallel
  • Clustering Finding similar attributes along some
    meaningful dimensions in order to group things
    together into classes and
  • Discrimination Determining rules for
    distinctions among things, so that we can create
    boundaries for classes.

3
Whats the Point of Classifying?
  • Retrieval and Re-Finding.
  • If were trying to retrieve something we know is
    there, classification provides a shortcut. By
    clustering like things together, it helps us find
    things again that were stored there in the past.
  • Browsing and Exploration.
  • By the same token, a classified collection can be
    searched/browsed for something even if we only
    have reason to suspect its there but dont know
    for sure, such as in stores, or in libraries.

4
Whats the Point of Classifying?
  • Communication.
  • Classification creates labels and definitions for
    class inclusion. This enables communication about
    disparate phenomena by establishing a common
    ground.
  • Knowledge representation.
  • Classifications are knowledge structures and thus
    visualize and reflect what we know about things.
    Such representations can help us to understand
    things better, identify gaps, recognize patterns,
    predict future trends, etc.

5
The Information-Retrieval Problem
  • Borrowing the notion from Bob Oddy, an
    information seeking and retrieval event can be
    construed as a dialogue in which the user
    reveals him or herself to the system, and the
    system, in turn, reveals itself to the user.

6
The Information-Retrieval Problem Space
User and the The Users Context
Collection Documents Information
Matching/ Intermediation
Articulates and represents a request using some
strategy
Representation of the Collection/ Documents/
Information
Query
Query
Output
Query
7
Challenges on the Query-Formulation Side
  • It may be difficult to articulate a request
  • Request may be vague, unformed
  • Request may be difficult to translate/express
  • User may not know what is available, or what
    there is to choose from
  • Users like to ask for what they think they might
    reasonably find often dont like to shoot blind
  • Strategy for level of precision may not be
    obvious
  • User may not be aware of the precision the
    system can accommodate
  • Query may thus be too broad or too specific

8
Challenges on the System Side
  • Expressiveness available to the system may be
    inadequate and thus the resulting representation
    may be incomplete
  • In a plant catalog, no vocabulary for expressing
    the hardiness of the plant
  • The choice of available descriptive dimensions
    may not be the most useful ones
  • Original MARC record for archival collections
  • First author only in Web of Science

9
Challenges on the System Side
  • Inadequate or inconsistent levels of precision or
    granularity
  • In Dewey Decimal all other religions share
    one-tenth of the space.
  • LCSH policy for indexing only the main topic of a
    book.
  • Inter-indexer and intra-indexer inconsistency or
    system error might affect reliability of the
    representations.
  • Indexers tendency to choose a limited range of
    terms, even when others are more appropriate.

10
Challenges in the Matching
  • Users and systems knowledge structures may not
    be the same
  • Users and systems vocabulary may differ
  • The mechanism by which the user is revealed to
    the system and vice versa may be inadequate for
    achieving a good match
  • E.g., In many OPACs, the use of the search option
    subject is ambiguous and does not indicate
    clearly that it means LCSH subject headings, and
    not keyword.
  • The navigational affordance and control of the
    system may cause confusion, inconsistencies, and
    miscommunication between user and system.

11
Challenges in the Output
  • Undifferentiated output is difficult to navigate
    and interpret

12
The Role of Classification in IR
  • Classification can play a role in remedying many
    of the challenges in an IR event by the use of
  • Classified collection representations
  • Classified output clusters, classes
  • Thesauri and taxonomies for aid in question
    formulation and document representation
  • Classification for navigation and browsing

13
Possible Scenarios
14
Examples
  • A classified cookbook index yields a term that
    points to a classified section of the cookbook.
  • Cookies nut cookies macaroons points to the
    cookie section of the cookbook.
  • A keyword entered in a search box yields a
    classified menu of options leading to an
    unclassified list of products.
  • Orvis clothing online

15
How Classification Helps
  • What Threats to Successful IR can be helped by
    classification?

16
At the Query-Formulation End
  • Ability to choose from a classified display or to
    use a classification to support query-formulation
    strategies
  • Reduces the need to come up with a term on your
    own (recognition is easier than production)
  • Supports browsing, and precise known-item
    searching.
  • Allows you to learn something about the domain
    (i.e., learn the knowledge structure of the
    system)

17
More at the Query-Formulation End
  • Orients you in the information space. Gives you a
    sense of the whole whats available, whats
    included
  • Terms are disambiguated because they are in
    context
  • Use of a taxonomy allows you to expand or narrow
    the query or suggests other related and possibly
    useful terms

18
At the System End
  • Taxonomies, thesauri, ontologies, help ensure
    consistent representation at all stages
  • Help identify gaps and inconsistencies in the
    coverage
  • Provide guidance for depth and exhaustivity of
    representation
  • Help provide guidance in navigation
  • E.g., navigational index and content indexes on a
    website

19
At the Matching and Output Stages
  • The matching in a classified collection allows
    for adjustments to granularity to help minimize
    the probability of no hits.
  • Classified output is browsable.
  • Classified output eases the navigation and
    cognitive processing of large results sets.
  • Can help users orient themselves in the
    information space.
  • May provide suggestions for the next round of
    querying.

20
The Power of Classification for IR
  • When a good classification is deployed in a
    retrieval system it has the ability to add a big
    jolt of information power.
  • A robust and valid classification expresses a
    great deal about the nature of the entities being
    classified, and also the relationship of these
    entities to each other.
  • On the other hand, we need much less knowledge of
    a domain in order to read a classification that
    is presented to us.

21
But, There Are Constraints
  • No classification is ever a complete
    representation.
  • All classifications are created in a cultural
    context.
  • Some classifications are good at description,
    some good at visualization, some at reflecting a
    body of knowledge.
  • With very few exceptions, most cant do all at
    once.

22
Says a Lot, but not Everything
  • Early stages of breast cancer
  • Stage 0 Cancer cells are present but they
    havent spread
  • Stage I Cancer has spread 2 cm or less
  • Stage II Cancer has spread lt5 cm sometimes
    the lymph nodes may be involved.
  • Advanced stages of breast cancer
  • Stage III gt5 cm spread to lymph nodes
  • Stage IV metastatic spread to other parts of
    the body, such as bone, liver, lung, or brain.

23
Selective Representation
  • No classification can represent all perspectives
    at the same time. While appealing as a
    visualization, some things get privileged, and
    some get masked.

24
All Depends on What Theory You Adopt
  • Genealogical
  • Linguistic
  • Classification
  • For any two languages
  • Do they have a common origin?
  • Typological
  • Linguistic
  • Classification
  • For any two languages what similarities of form
    do they have.

25
What a Good Classification Should Be
  • Elegant and parsimonious expresses the domain
    being classified as succinctly and efficiently as
    possible.
  • Expressive and complete sufficient to
    accommodate all entities within it with just
    enough specificity to be useful.
  • Memorable and usable affords devices to help
    the user learn it and relearn it.
  • Flexible and hospitable allows new entities to
    be added gracefully and coherently
  • There are very few classifications
  • that can be all of these.

26
Conclusion
  • Classification has an important role to play in
    information retrieval
  • For enhanced communication between system and
    user
  • For precision, flexibility, completeness, and
    consistency of document and query representation
  • For expressiveness at the knowledge-structure
    level but also the navigational level
  • For browsing to reduce cognitive load, increase
    fun.
  • But, care must be taken to ensure that the
    classification is appropriate to the various
    functions of information seeking and retrieval.
Write a Comment
User Comments (0)
About PowerShow.com