Title: Indexing and information retrieval
1Indexing and information retrieval
- inf384c
- UT Austin School of Information
2Introduction to Indexing
- The main purpose of indexing and abstracting is
to construct representations of published items
in a form suitable for inclusion in some type of
database. (Lancaster, 1) - The basic assumption is that indexers are able
to state what a document is about by
formulating an expression which summarizes the
content of the document. (Hutchins, 92)
3Introduction to Indexing
- Though often quite complex, the basic goal of
most indexing is to represent the meaning of a
document (its aboutness) in a compressed form
that will be amenable to search and retrieval. - Traditionally the notion of a documents
aboutness is closely related to its topical
subject matter.
4Indexing and the IR Process
doc1
Indexer
interprets
creates
s1
database
indexing vocabulary
5Terms Associated with Indexing
6The Task of the Indexer
- Indexing involves two principal steps
- Conceptual Analysis Interpreting the document to
ascertain its meaning - Translate the document into a set of descriptors
that the system recognizes
7Conceptual Analysis
- In the indexing literature, we see a lot of
tension about conceptual analysis. - But in many cases, people are quite good at
picking out important ideas from a text, at
analyzing a documents aboutness.
8Translation Assigning Descriptors
- If we were using a free text or natural
language indexing system, then we could just
write down our descriptors for each document
after conceptual analysis. - But many systems impose more order on the process
by using a controlled vocabulary. - Why do they add this step, this translation?
9Vocabulary Control Important Ideas
- Types of Controlled Vocabularies (not mutually
exclusive) - Subject Headings
- Thesauri
- Name Authority Files
10Authority Control
11Subject Heading Lists
- Controlled vocabularies are usually expressed as
a list of vetted, legitimate terms, frequently
referred to as subject headings. - These may be organized in various ways
- Alphabetically
- Topically
- In a Thesaurus (via lexical-semantic
relationships)
12Thesauri LCSH Example
13Thesauri LCSH Example
In this case cookery is our descriptor.
14Thesauri LCSH Example
Cookery is a preferred term (PT) for these.
15Thesauri LCSH Example
Broader Terms (BTs) expand the search vertically.
16Thesauri LCSH Example
Narrower terms (NTs) focus the scope of the
search.
17Thesauri LCSH Example
Related terms (RTs) expand a search horizontally
18Thesauri Art and Architecture Thesaurus
- Most thesauri are domain-specific. That is, they
control the vocabulary of a particular domain of
knowledge. - N.B. We can consider these to be ontologies,
representing linguistic knowledge about a
particular field. - Consider the Art and Architecture Thesaurus
(maintained by the Getty Museum).
19Some vocabulary review People are types of
Agents thus People is a NT for Agent.
20Some vocabulary review There are many types of
person within AAT. Since they appear at the same
level in the hierarchy, terms such as antiquaries
and athletes are coordinate terms. Likewise,
acrobats and fencers are coordinate with each
other.
21Controlled Vocabulary
- Costs of disambiguation via controlled vocabulary
22Manual Indexing Overview
- Two phases analysis then translation
- Vocabulary control vs. free text
- Sources of evidence?
- Inter-indexer agreement?
23Manual Indexing Overview
- Two phases analysis then translation
- Vocabulary control vs. free text
- Sources of evidence?
- Inter-indexer agreement?
Computational approaches to indexing probably
dont solve these problems, but they do change
the equation (maybe for the better).
24Automatic Indexing
- Our goal to derive useful document surrogates
without undue human intervention. - Human attention is expensive and slow.
- Peoples time is best spent on doing things
machines cant do. - The leap of faith we must let the data speak for
themselves. - In place of human intuition, well rely on
rigorous analysis of the document itself (heavily
empirical) - We will import intuition in the form of
strategies for treating language empirically.
25The leap of faith
- (Textual) Documents are comprised of words and
words convey meaning. - Thus we can glean something about meaning by
analyzing document text. - The question becomes what do we mean by
analyzing? - Shallow efficient, tractable (good enough?)
- Deep intensive, feasible computationally (any
better than shallow analysis?)
26Textual Analysis for IR
- Most IR is based on lexical analysis.
- i.e. analyzing which words appear in documents
and which words are most likely to convey topical
meaning. - This is highly language-dependent. Well assume
primarily English text. - Why are other languages different w/respect to
lexical analysis? - Why is it harder (or is it harder) to analyze
non-textual information (e.g. images)
27Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
28Indexer steps
- Sequence of (Modified token, Document ID) pairs.
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
29The Inverted Index
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
30The Inverted Index
Query julius and caesar Query caesar and not
julius
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
31The Inverted Index
Query julius and caesar Query caesar and not
julius
Why is it preferable to use the inverted index?
Why not just search the documents directly?
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
32Ranking Documents for Retrieval
- Given a database of indexed documents, what
should we do when a searcher issues a query? - Most modern IR systems rank documents in
decreasing order of estimated relevance to the
query. - How might we rank documents against a query?
33Naïve document ranking
- Given a query Q containing words q1 q2 qm we
could simply count how many of the query words
are also in the document.
34Naïve document ranking
r1(q,d1)2 r1(q,d2)1
- Given a query Q containing words q1 q2 qm we
could simply count how many of the query words
are also in the document.
35Can we be less naïve?
- Instead of counting how many query words a
document contains, we could sum over the
frequency of each query word in each document.
36Can we be less naïve?
r1(q,d1) 11 2 r1(q,d2)
1 r1(q,d3) 31 4
- Given a query Q containing words q1 q2 qm we
could simply count how many of the query words
are also in the document.
37Can we be still less naïve?
- This function assumes that all words are equally
important. Why is this unrealistic? - What might we do to mitigate this error?
38Term weighting by TF-IDF (term freq. - inverse
doc. freq.)
- Intuition when assessing the relevance of a
document, we should take into account how many
times it contains each term (term frequency), but
also how common that term is in general. - Reward terms that occur often in the doc. but
rarely in the collection.
39Inverse document frequency (IDF)
- Intuition we should give more weight to terms
that are rare in the corpus. - If we have N documents, and term j occurs in nj
of them, we are concerned with N / ni . - If nj is small, the quantity is large. Vice
versa for common terms.
40Term Weighting
- Consider the following terms. If these terms
occurred in a document, how indicative of that
docs aboutness do you suspect each term would
be?
41Term Weighting
- Consider the following terms. If these terms
occurred in a document, how indicative of that
docs aboutness do you suspect each term would
be?
N1000
42Term Weighting
- Consider the following terms. If these terms
occurred in a document, how indicative of that
docs aboutness do you suspect each term would
be?
pseudo IDF
N1000
43Inverse document frequency (IDF)
- Given that a words w occurs in dw documents, and
that our collection consists of N documents total
44Inverse document frequency (IDF)
45Inverse document frequency (IDF)
N.B. stopwords. If dwN
46Document ranking by TF-IDF
- Sum over the frequency of each query word in the
document times the words inverse document
frequency.
47Document ranking by TF-IDF
48Document ranking by TF-IDF
r3(q,d1) 14.2 14.8 9
49Document ranking by TF-IDF
r3(q,d1) 14.2 14.8 9
r3(q,d2) 14.2 22.2 8.6
50IR Evaluation Experimentation and Research
Methods
51Evaluating IR Systems
- Assuming weve made some design decisions and
selected one or more IR models, we would like to
know how well our decisions are serving us. - IR evaluation is concerned with pursuing the
question given a particular retrieval situation,
how well is my IR system working?
52Evaluating IR Systems
- In IR research, evaluation usually analyzes at
least one of these facets of a retrieval system - Effectiveness
- Efficiency
- Cost
- Speed
53IR Effectiveness
- Information retrieval systems attempt to discover
documents in a collection that are relevant to a
users stated information need.
54IR Effectiveness
- Information retrieval systems attempt to discover
documents in a collection that are relevant to a
users stated information need.
To say that one IR system is more effective than
another maybe were saying that that system does
a better job of discriminating between relevant
and non-relevant documents? Regardless of the
thorns that lie in this definition, this is
commonly the assumption we work under.
55IR Effectiveness
- Information retrieval systems attempt to discover
documents in a collection that are relevant to a
users stated information need.
Perhaps this is the most troublesome notion in
the world of information retrieval research. It
is fundamental to IR evaluation, but it is nearly
impossible to define (or operationalize) it in a
way that many researchers can agree upon.
56The Cranfield Paradigm
- A drastically simplified notion of relevance has
its roots in a series of experiments undertaken
in the 1950s. - Cyril Cleverdon (a researcher in Cranfield,
England) wanted to test the effectiveness of
various methods of indexing - e.g. manual vs. automatic
- To perform these tests, he constructed an
elaborate (and still widely used) apparatus for
experimentation.
57The Cranfield Paradigm
- The idea behind Cleverdons design
- Compile a corpus of documents
- Ask potential readers of these documents to
supply queries that they would like answered by
consulting the documents - In a laborious (and well-documented) process,
have subject experts judge each document with
respect to its relevance to each query
58IR Test Collections
- These components comprise the basic elements of a
so-called test collection for IR
experimentation. - A corpus of documents
- A set of queries
- A set of qrels lists of all documents that are
relevant to each query
59Measuring Retrieval Effectiveness
For a particular query q
ABCDN
N docs in the collection R Total of docs
in the collection relevant to query q.
60Measuring Retrieval Effectiveness
For a particular query q
precision A / (A B) recall A / (A
C)
61Measuring Retrieval Effectiveness
For a particular query q
precision A / (A B) recall A / (A
C)
In other words Precision is the percent of
retrieved docs that are relevant. Recall is the
percent of relevant docs that have been retrieved.
62Measuring Retrieval Effectiveness
Model 1
precision ?? recall ??
63Measuring Retrieval Effectiveness
Model 1
precision 10/15 2/3 recall 10/20
1/2
What does this mean? Is it good performance?
64Measuring Retrieval Effectiveness
Model 2
Model 1
precision 5/6 recall 5/15 1/3
precision 10/15 2/3 recall 10/20
1/2
65Measuring Retrieval Effectiveness
Model 1 has better recall than Model 2
Model 2 has better precision than Model 1
Model 2
Model 1
precision 5/6 recall 5/15 1/3
precision 10/15 2/3 recall 10/20
1/2
66Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall20?
67Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall20? Rel 5.
20 of 5 1. Thus we want to compute precision
after retrieving 1 relevant document 1/1 1.
68Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall40? Rel 5.
40 of 5 2. Thus we want to compute precision
after retrieving 2 relevant documents 2/50.4.
69Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall60? Rel 5.
60 of 5 3. Thus we want to compute precision
after retrieving 3 relevant documents 3/80.375.
70Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
Whats the trend here? As our recall rate goes
up, our precision tends to decline. This is true
in general, though oddities relating to small
numbers of relevant documents can alter this.
But most frequently, we find relationships like
What is our precision at recall60? Rel 5.
60 of 5 3. Thus we want to compute precision
after retrieving 3 relevant documents 3/80.375.
71(No Transcript)
72Precision and Recall
For a particular query q M1 ranks 10 docs R N
N N R N N R R R
While M2 ranks them R R N N R N R
N N R Which model is better???
73(No Transcript)
74Deriving a single effectiveness measure mean
avg. precision
For a particular query q M1 ranks 10 docs R N
N N R N N R R R
While M2 ranks them R R N N R N R
N N R Which model is better???
75Deriving a single effectiveness measure mean
avg. precision
For a particular query q M1 ranks 10 docs R N
N N R N N R R R
While M2 ranks them R R N N R N R
N N R Which model is better???
76IR in our culture John Battelles The Search
- IR began as a quiet branch of library science.
- In John Battelles argument, what has search
become?
77The database of intentions
78The database of intentions start your own hedge
fund
Is one company (Morgan Chase or Merrill) less
correlated with these doomsday words? We have
access to data like nobody imagined. Now what to
do with it?
79Our challenge coda
- My problem is not finding something. May
problem is understanding something. Danny
Hillis (Battelle 16) - The goal of Google and other search companies is
to provide people with information and make it
useful to them. -- Craig Silverstein (Battelle
17) - In this ambitious arena, what is our role? What
will it be? What should it be?