Information Retrieval: Indexing - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Information Retrieval: Indexing

Description:

Statistical analysis of document based of word occurrence frequency ... 41A3241 Door-Knocker. 41A327 Door-keeper, houseguard. 41A325 Threshold ... – PowerPoint PPT presentation

Number of Views:1383

Avg rating:3.0/5.0

Slides: 48

Provided by: Carole153

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval: Indexing

1
Information RetrievalIndexing

Acknowledgements
Dr Mounia Lalmas (QMW)
Dr Joemon Jose (Glasgow)

2

Roadmap

What is a document?
Representing the content of documents
Luhn's analysis
Generation of document representatives
Weighting
Inverted files

3
Indexing Language

Language used to describe documents and queries
index terms selected subset of words
derived from the text or arrived at independently
Keyword searching
Statistical analysis of document based of word
occurrence frequency
Automated, efficient and potentially inaccurate
Searching using controlled vocabularies
More accurate results but time consuming if
documents manually indexed

4
Luhn's analysis

Resolving power of significant words
ability of words to discriminate document content
peak at rank order position half way between the
two cut-offs

5
Generating document representatives
6
Generating document representatives

Input text full text, abstract, title
Document representative list of (weighted) class
names, each name representing a class of concepts
(words) occurring in input text
Document indexed by a class name if one of its
significant words occurs as a member of that
class
Phases
identify words - Lexical Analysis (Tokenising)
removal of high frequency words
suffix stripping (stemming)
detecting equivalent stems
thesauri
others (noun-phrase, noun group, logical formula,
structure)
Index structure creation

7
Process View
8
Lexical Analysis

The process of converting a stream of characters
(the text of the documents) into a stream of
words (the candidate words to be adopted as index
terms)
treating digits, hyphens, punctuation marks, and
the case of the letters.

9
Stopword Removal

Removal of high frequency words
list of stop words (implement Luhn's upper
cut-off)
filtering out words with very low discrimination
values for retrieval purposes
example been", a", about", otherwise"
compare input text with stop list
reduction between 30 and 50 per cent

10
Conflation

Conflation reduces word variants into a single
form
similar words generally have similar meaning
retrieval effectiveness increased if the query is
expanded with those which are similar in meaning
to those originally contained within it.
Stemming algorithm is a conflation procedure
reduces all words with same root into a single
root

11
Different forms - stemming

Stemming
Matching the query term forests to forest and
forested
choke", choking", choked"
Suffix removal
removal of suffixes - worker
Porter algorithm remove longest suffix
error equal" -gt eq" heuristic rules
more effective than ordinary word forms
Detecting equivalent stems
example ABSORB- and ABSORPT-
Stemmers remove affixes
prefixes? - megavolt

12
Plural stemmer

Plurals in English
If word ends in ies but not eies, aies
ies -gt y
if word ends in es but not aes, ees, oes
es -gt e
if word ends in s but not us or ss
s -gt
First applicable rule is the one used

13
Processing

The destruction of the amazon rain forests
Case normalisation
Stop word removal.
From fixed list
destruction amazon rain forests
Suffix removal (stemming).
destruct amazon rain forest

14
Thesauri

A collection of terms along with some structure
or relationships between them. Scope notes etc..
provide standard vocabulary for indexing
searching
assist user locating terms for proper query
formulation
provide classification hierarchy for broadening
and narrowing current query according to user
need
Equivalence synonyms, preferred terms
Hierarchical broader/narrower terms (BT/NT)
Association related terms across the hierarchy
(RT)

15
Thesauri Examples WordNet
16
(No Transcript)
17
Faceted Classification
18
Thesauri Examples AAT

Art and Architecture Thesaurus

19
Hierarchical Classifications

Alphanumeric coding schemes
Subject classifications
A taxonomy that represents a classification or
kind-of hierarchy.
Examples Dewey Decimal, AAT, SHIC, ICONCLASS

41A32 Door
41A322 Closing the Door
41A323 Monumental Door
41A324 Metalwork of a Door
41A3241 Door-Knocker
41A325 Threshold
41A327 Door-keeper, houseguard
20
Terminology/Controlled vocabulary

The descriptors from a thesauri form a controlled
vocabulary
Normalise indexing concepts
Identification of indexing concepts with clear
semantics
Retrieval based on concepts rather than terms
Good for specific domains (e.g., medical)
Problematic for general domains (large, new,
dynamic)

21
No One Classification
22
No One Classification
23
Generating document representatives - Outcome

Class
words with the same stem
Class name
stem
Document representative
list of class names (index terms or keywords)
Same process applied to query

24
Precision and Recall

Precision
Ratio of the number of relevant documents
retrieved to the total number of documents
retrieved.
The number of hits that are relevant
Recall
Ratio of number of relevant documents retrieved
to the total number of relevant documents
The number of relevant documents that are hits

25
Precision and Recall
High Precision Low Recall
26
Precision and Recall
RA

The user isnt usually given the answer set RA at
once
The documents in A are sorted to a degree of
relevance (ranking) which the user examines.
Recall and precision vary as the user proceeds
with their examination of the answer set A

27
Precision and Recall Trade Off

Increase number of documents retrieved
Likely to retrieve more of the relevant documents
and thus increase the recall
But typically retrieve more inappropriate
documents and thus decrease precision

28
Index term weighting

Effectiveness of an indexing language
Exhaustivity
number of different topics indexed
high exhaustivity high recall and low precision
Specificity
ability of the indexing language to describe
topics precisely
high specificity high precision and low recall

29
Index term weighting

Exhaustivity
related to the number of index terms assigned to
a given document
Specificity
number of documents to which a term is assigned
in a collection
related to the distribution of index terms in
collection
Index term weighting
index term frequency occurrence frequency of a
term in document
document frequency number of documents in which
a term occurs

30
IR as Clustering

A query is a vague spec of a set of objects, A
IR is reduced to the problem of determining which
documents are in set A and which ones are not
Intra clustering similarity
What are the features that better describe the
objects in A
Inter clustering dissimilarity
What are the features that better distinguish the
objects A from the remaining objects in C

A Retrieved Documents
x
x
x
x
x
x
C Document Collection
31
Index term weighting
32
Index term weighting

Intra-clustering similarity
The raw frequency of a term t inside a document
d.
A measure of how well the document term describes
the document contents
Inter-cluster dissimilarity
Inverse document frequency
Inverse of the frequency of a term t among the
documents in the collection.
Terms which appear in many documents are not
useful for distinguishing a relevant document
from a non-relevant one.

Normalised frequency of term t in document d
Inverse document frequency
Weight(t,d) tf(t,d) x idf(t)
33
Term weighting schemes

Best known
Variation for query term weights

34
Example

Nuclear 7 Computer 9
Poverty 5 Unemployment 1
Luddites 3 Machines 19
People 25 And 49
Weight(machine)
19/25 x log(100/50)
0.76 x 0.3013 0.228988
Weight(luddite)
3/25 x log(100/2)
0.12 x 1.69897 0.2038764
Weight(poverty) 5/25 x log(100/2) 0.2 x
1.69897 0.339794

35
Inverted Files

Word-oriented mechanism for indexing test
collections to speed up searching
Searching
vocabulary search (query terms)
retrieval of occurrence
manipulation of occurrence

36
Original Document view
Cosmonaut astronaut moon car truck
D1 1 0 1 1 1
D2 0 1 1 0 0
D3 0 0 0 1 1
37
Inverted view
D1 D2 D3
Cosmonaut 1 0 0
astronaut 0 1 0
moon 1 1 0

Car 1 0 1
truck 1 0 1
38
Inverted index
cosmonaut
D1
astronaut
D2
moon
D1
D2
car
D1
D3
truck
D1
D3
39
Inverted File

The speed of retrieval is maximised by
considering only those terms that have been
specified in the query
This speed is achieved only at the cost of very
substantial storage and processing overheads

40
Components of an inverted file
Header Information
frequency
Document number
pointer
term
frequency
Field type
Postings file
41
Producing an Inverted file
Postings
Term
Doc 3
Inverted File
Doc 1
Doc 2
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
aid
0
0
0
1
0
0
0
1
AI
4, 8
A
all
0
1
0
1
0
1
0
0
AL
2, 4, 6
back
1
0
1
0
0
0
1
0
BA
1, 3, 7
B
brown
1
0
1
0
1
0
1
0
BR
1, 3, 5, 7
come
0
1
0
1
0
1
0
1
C
2, 4, 6, 8
dog
0
0
1
0
1
0
0
0
D
3, 5
fox
0
0
1
0
1
0
1
0
F
3, 5, 7
good
0
1
0
1
0
1
0
1
G
2, 4, 6, 8
jump
0
0
1
0
0
0
0
0
J
3
lazy
1
0
1
0
1
0
1
0
L
1, 3, 5, 7
men
0
1
0
1
0
0
0
1
M
2, 4, 8
now
0
1
0
0
0
1
0
1
N
2, 6, 8
over
1
0
1
0
1
0
1
1
O
1, 3, 5, 7, 8
party
0
0
0
0
0
1
0
1
P
6, 8
quick
1
0
1
0
0
0
0
0
Q
1, 3
their
1
0
0
0
1
0
1
0
TH
1, 5, 7
T
time
0
1
0
1
0
1
0
0
TI
2, 4, 6
42
An Inverted file
Term
Postings
Inverted File
aid
AI
4, 8
A
all
AL
2, 4, 6
back
BA
1, 3, 7
B
brown
BR
1, 3, 5, 7
come
C
2, 4, 6, 8
dog
D
3, 5
fox
F
3, 5, 7
good
G
2, 4, 6, 8
jump
J
3
lazy
L
1, 3, 5, 7
men
M
2, 4, 8
now
N
2, 6, 8
over
O
1, 3, 5, 7, 8
party
P
6, 8
quick
Q
1, 3
their
TH
1, 5, 7
T
time
TI
2, 4, 6
43
Searching Algorithm

For each document D, Score(D) 0
For each query term
Search the vocabulary list
Pull out the postings list
for each document J in the list,
Score(J) Score(J) 1

44
What Goes in a Postings File?

Boolean retrieval
Just the document number
Ranked Retrieval
Document number and term weight (TFIDF, ...)
Proximity operators
Word offsets for each occurrence of the term
Example Doc 3 (t17, t36), Doc 13 (t3, t45)

45
How Big Is the Postings File?

Very compact for Boolean retrieval
About 10 of the size of the documents
If an aggressive stopword list is used
Not much larger for ranked retrieval
Perhaps 20
Enormous for proximity operators
Sometimes larger than the documents
But access is fast - you know where to look

46
Query
Documents
indexing
indexing
Stop word
Stemming
Matching
Query features
Indexing features
Storage inverted index
Term 1
di
dj
dk
47
Similarity Matching

The process in which we compute the relevance of
a document for a query
A similarity measure comprises
term weighting scheme which allocates numerical
values to each of the index terms in a query or
document reflecting their relative importance
similarity coefficient - uses the term weights to
compute the overall degree of similarity between
a query and a document

Write a Comment

User Comments (0)