Indexing and information retrieval

About This Presentation

Title:

Indexing and information retrieval

Description:

If we were using a 'free text' or 'natural language' indexing ... Two phases: analysis then translation. Vocabulary control vs. free text. Sources of evidence? ... – PowerPoint PPT presentation

Number of Views:698

Avg rating:5.0/5.0

Slides: 80

Provided by: technica5

Category:

more less

Transcript and Presenter's Notes

Title: Indexing and information retrieval

1
Indexing and information retrieval

inf384c
UT Austin School of Information

2
Introduction to Indexing

The main purpose of indexing and abstracting is
to construct representations of published items
in a form suitable for inclusion in some type of
database. (Lancaster, 1)
The basic assumption is that indexers are able
to state what a document is about by
formulating an expression which summarizes the
content of the document. (Hutchins, 92)

3
Introduction to Indexing

Though often quite complex, the basic goal of
most indexing is to represent the meaning of a
document (its aboutness) in a compressed form
that will be amenable to search and retrieval.
Traditionally the notion of a documents
aboutness is closely related to its topical
subject matter.

4
Indexing and the IR Process
doc1
Indexer
interprets
creates
s1
database
indexing vocabulary
5
Terms Associated with Indexing
6
The Task of the Indexer

Indexing involves two principal steps
Conceptual Analysis Interpreting the document to
ascertain its meaning
Translate the document into a set of descriptors
that the system recognizes

7
Conceptual Analysis

In the indexing literature, we see a lot of
tension about conceptual analysis.
But in many cases, people are quite good at
picking out important ideas from a text, at
analyzing a documents aboutness.

8
Translation Assigning Descriptors

If we were using a free text or natural
language indexing system, then we could just
write down our descriptors for each document
after conceptual analysis.
But many systems impose more order on the process
by using a controlled vocabulary.
Why do they add this step, this translation?

9
Vocabulary Control Important Ideas

Types of Controlled Vocabularies (not mutually
exclusive)
Subject Headings
Thesauri
Name Authority Files

10
Authority Control
11
Subject Heading Lists

Controlled vocabularies are usually expressed as
a list of vetted, legitimate terms, frequently
referred to as subject headings.
These may be organized in various ways
Alphabetically
Topically
In a Thesaurus (via lexical-semantic
relationships)

12
Thesauri LCSH Example

Cookery

13
Thesauri LCSH Example

Cookery

In this case cookery is our descriptor.
14
Thesauri LCSH Example

Cookery

Cookery is a preferred term (PT) for these.
15
Thesauri LCSH Example

Cookery

Broader Terms (BTs) expand the search vertically.
16
Thesauri LCSH Example

Cookery

Narrower terms (NTs) focus the scope of the
search.
17
Thesauri LCSH Example

Cookery

Related terms (RTs) expand a search horizontally
18
Thesauri Art and Architecture Thesaurus

Most thesauri are domain-specific. That is, they
control the vocabulary of a particular domain of
knowledge.
N.B. We can consider these to be ontologies,
representing linguistic knowledge about a
particular field.
Consider the Art and Architecture Thesaurus
(maintained by the Getty Museum).

19
Some vocabulary review People are types of
Agents thus People is a NT for Agent.
20
Some vocabulary review There are many types of
person within AAT. Since they appear at the same
level in the hierarchy, terms such as antiquaries
and athletes are coordinate terms. Likewise,
acrobats and fencers are coordinate with each
other.
21
Controlled Vocabulary

Costs of disambiguation via controlled vocabulary

22
Manual Indexing Overview

Two phases analysis then translation
Vocabulary control vs. free text
Sources of evidence?
Inter-indexer agreement?

23
Manual Indexing Overview

Two phases analysis then translation
Vocabulary control vs. free text
Sources of evidence?
Inter-indexer agreement?

Computational approaches to indexing probably
dont solve these problems, but they do change
the equation (maybe for the better).
24
Automatic Indexing

Our goal to derive useful document surrogates
without undue human intervention.
Human attention is expensive and slow.
Peoples time is best spent on doing things
machines cant do.
The leap of faith we must let the data speak for
themselves.
In place of human intuition, well rely on
rigorous analysis of the document itself (heavily
empirical)
We will import intuition in the form of
strategies for treating language empirically.

25
The leap of faith

(Textual) Documents are comprised of words and
words convey meaning.
Thus we can glean something about meaning by
analyzing document text.
The question becomes what do we mean by
analyzing?
Shallow efficient, tractable (good enough?)
Deep intensive, feasible computationally (any
better than shallow analysis?)

26
Textual Analysis for IR

Most IR is based on lexical analysis.
i.e. analyzing which words appear in documents
and which words are most likely to convey topical
meaning.
This is highly language-dependent. Well assume
primarily English text.
Why are other languages different w/respect to
lexical analysis?
Why is it harder (or is it harder) to analyze
non-textual information (e.g. images)

27
Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
28
Indexer steps

Sequence of (Modified token, Document ID) pairs.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
29
The Inverted Index
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
30
The Inverted Index
Query julius and caesar Query caesar and not
julius
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
31
The Inverted Index
Query julius and caesar Query caesar and not
julius
Why is it preferable to use the inverted index?
Why not just search the documents directly?
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
32
Ranking Documents for Retrieval

Given a database of indexed documents, what
should we do when a searcher issues a query?
Most modern IR systems rank documents in
decreasing order of estimated relevance to the
query.
How might we rank documents against a query?

33
Naïve document ranking

Given a query Q containing words q1 q2 qm we
could simply count how many of the query words
are also in the document.

34
Naïve document ranking
r1(q,d1)2 r1(q,d2)1

Given a query Q containing words q1 q2 qm we
could simply count how many of the query words
are also in the document.

35
Can we be less naïve?

Instead of counting how many query words a
document contains, we could sum over the
frequency of each query word in each document.

36
Can we be less naïve?
r1(q,d1) 11 2 r1(q,d2)
1 r1(q,d3) 31 4

Given a query Q containing words q1 q2 qm we
could simply count how many of the query words
are also in the document.

37
Can we be still less naïve?

This function assumes that all words are equally
important. Why is this unrealistic?
What might we do to mitigate this error?

38
Term weighting by TF-IDF (term freq. - inverse
doc. freq.)

Intuition when assessing the relevance of a
document, we should take into account how many
times it contains each term (term frequency), but
also how common that term is in general.
Reward terms that occur often in the doc. but
rarely in the collection.

39
Inverse document frequency (IDF)

Intuition we should give more weight to terms
that are rare in the corpus.
If we have N documents, and term j occurs in nj
of them, we are concerned with N / ni .
If nj is small, the quantity is large. Vice
versa for common terms.

40
Term Weighting

Consider the following terms. If these terms
occurred in a document, how indicative of that
docs aboutness do you suspect each term would
be?

41
Term Weighting

Consider the following terms. If these terms
occurred in a document, how indicative of that
docs aboutness do you suspect each term would
be?

N1000
42
Term Weighting

Consider the following terms. If these terms
occurred in a document, how indicative of that
docs aboutness do you suspect each term would
be?

pseudo IDF
N1000
43
Inverse document frequency (IDF)

Given that a words w occurs in dw documents, and
that our collection consists of N documents total

44
Inverse document frequency (IDF)
45
Inverse document frequency (IDF)
N.B. stopwords. If dwN
46
Document ranking by TF-IDF

Sum over the frequency of each query word in the
document times the words inverse document
frequency.

47
Document ranking by TF-IDF
48
Document ranking by TF-IDF
r3(q,d1) 14.2 14.8 9
49
Document ranking by TF-IDF
r3(q,d1) 14.2 14.8 9
r3(q,d2) 14.2 22.2 8.6
50
IR Evaluation Experimentation and Research
Methods
51
Evaluating IR Systems

Assuming weve made some design decisions and
selected one or more IR models, we would like to
know how well our decisions are serving us.
IR evaluation is concerned with pursuing the
question given a particular retrieval situation,
how well is my IR system working?

52
Evaluating IR Systems

In IR research, evaluation usually analyzes at
least one of these facets of a retrieval system
Effectiveness
Efficiency
Cost
Speed

53
IR Effectiveness

Information retrieval systems attempt to discover
documents in a collection that are relevant to a
users stated information need.

54
IR Effectiveness

Information retrieval systems attempt to discover
documents in a collection that are relevant to a
users stated information need.

To say that one IR system is more effective than
another maybe were saying that that system does
a better job of discriminating between relevant
and non-relevant documents? Regardless of the
thorns that lie in this definition, this is
commonly the assumption we work under.
55
IR Effectiveness

Information retrieval systems attempt to discover
documents in a collection that are relevant to a
users stated information need.

Perhaps this is the most troublesome notion in
the world of information retrieval research. It
is fundamental to IR evaluation, but it is nearly
impossible to define (or operationalize) it in a
way that many researchers can agree upon.
56
The Cranfield Paradigm

A drastically simplified notion of relevance has
its roots in a series of experiments undertaken
in the 1950s.
Cyril Cleverdon (a researcher in Cranfield,
England) wanted to test the effectiveness of
various methods of indexing
e.g. manual vs. automatic
To perform these tests, he constructed an
elaborate (and still widely used) apparatus for
experimentation.

57
The Cranfield Paradigm

The idea behind Cleverdons design
Compile a corpus of documents
Ask potential readers of these documents to
supply queries that they would like answered by
consulting the documents
In a laborious (and well-documented) process,
have subject experts judge each document with
respect to its relevance to each query

58
IR Test Collections

These components comprise the basic elements of a
so-called test collection for IR
experimentation.
A corpus of documents
A set of queries
A set of qrels lists of all documents that are
relevant to each query

59
Measuring Retrieval Effectiveness
For a particular query q
ABCDN
N docs in the collection R Total of docs
in the collection relevant to query q.
60
Measuring Retrieval Effectiveness
For a particular query q
precision A / (A B) recall A / (A
C)
61
Measuring Retrieval Effectiveness
For a particular query q
precision A / (A B) recall A / (A
C)
In other words Precision is the percent of
retrieved docs that are relevant. Recall is the
percent of relevant docs that have been retrieved.
62
Measuring Retrieval Effectiveness
Model 1
precision ?? recall ??
63
Measuring Retrieval Effectiveness
Model 1
precision 10/15 2/3 recall 10/20
1/2
What does this mean? Is it good performance?
64
Measuring Retrieval Effectiveness
Model 2
Model 1
precision 5/6 recall 5/15 1/3
precision 10/15 2/3 recall 10/20
1/2
65
Measuring Retrieval Effectiveness
Model 1 has better recall than Model 2
Model 2 has better precision than Model 1
Model 2
Model 1
precision 5/6 recall 5/15 1/3
precision 10/15 2/3 recall 10/20
1/2
66
Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall20?
67
Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall20? Rel 5.
20 of 5 1. Thus we want to compute precision
after retrieving 1 relevant document 1/1 1.
68
Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall40? Rel 5.
40 of 5 2. Thus we want to compute precision
after retrieving 2 relevant documents 2/50.4.
69
Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall60? Rel 5.
60 of 5 3. Thus we want to compute precision
after retrieving 3 relevant documents 3/80.375.
70
Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
Whats the trend here? As our recall rate goes
up, our precision tends to decline. This is true
in general, though oddities relating to small
numbers of relevant documents can alter this.
But most frequently, we find relationships like
What is our precision at recall60? Rel 5.
60 of 5 3. Thus we want to compute precision
after retrieving 3 relevant documents 3/80.375.
71
(No Transcript)
72
Precision and Recall
For a particular query q M1 ranks 10 docs R N
N N R N N R R R
While M2 ranks them R R N N R N R
N N R Which model is better???
73
(No Transcript)
74
Deriving a single effectiveness measure mean
avg. precision
For a particular query q M1 ranks 10 docs R N
N N R N N R R R
While M2 ranks them R R N N R N R
N N R Which model is better???
75
Deriving a single effectiveness measure mean
avg. precision
For a particular query q M1 ranks 10 docs R N
N N R N N R R R
While M2 ranks them R R N N R N R
N N R Which model is better???
76
IR in our culture John Battelles The Search

IR began as a quiet branch of library science.
In John Battelles argument, what has search
become?

77
The database of intentions
78
The database of intentions start your own hedge
fund
Is one company (Morgan Chase or Merrill) less
correlated with these doomsday words? We have
access to data like nobody imagined. Now what to
do with it?
79
Our challenge coda

My problem is not finding something. May
problem is understanding something. Danny
Hillis (Battelle 16)
The goal of Google and other search companies is
to provide people with information and make it
useful to them. -- Craig Silverstein (Battelle
17)
In this ambitious arena, what is our role? What
will it be? What should it be?

Write a Comment

User Comments (0)