Title: Information Retrieval: Indexing
1Information RetrievalIndexing
- Acknowledgements
- Dr Mounia Lalmas (QMW)
- Dr Joemon Jose (Glasgow)
2 Roadmap
- What is a document?
- Representing the content of documents
- Luhn's analysis
- Generation of document representatives
- Weighting
- Inverted files
3Indexing Language
- Language used to describe documents and queries
- index terms selected subset of words
- derived from the text or arrived at independently
- Keyword searching
- Statistical analysis of document based of word
occurrence frequency - Automated, efficient and potentially inaccurate
- Searching using controlled vocabularies
- More accurate results but time consuming if
documents manually indexed
4Luhn's analysis
- Resolving power of significant words
- ability of words to discriminate document content
- peak at rank order position half way between the
two cut-offs
5Generating document representatives
6Generating document representatives
- Input text full text, abstract, title
- Document representative list of (weighted) class
names, each name representing a class of concepts
(words) occurring in input text - Document indexed by a class name if one of its
significant words occurs as a member of that
class - Phases
- identify words - Lexical Analysis (Tokenising)
- removal of high frequency words
- suffix stripping (stemming)
- detecting equivalent stems
- thesauri
- others (noun-phrase, noun group, logical formula,
structure) - Index structure creation
7Process View
8Lexical Analysis
- The process of converting a stream of characters
(the text of the documents) into a stream of
words (the candidate words to be adopted as index
terms) - treating digits, hyphens, punctuation marks, and
the case of the letters.
9Stopword Removal
- Removal of high frequency words
- list of stop words (implement Luhn's upper
cut-off) - filtering out words with very low discrimination
values for retrieval purposes - example been", a", about", otherwise"
- compare input text with stop list
- reduction between 30 and 50 per cent
10Conflation
- Conflation reduces word variants into a single
form - similar words generally have similar meaning
- retrieval effectiveness increased if the query is
expanded with those which are similar in meaning
to those originally contained within it. - Stemming algorithm is a conflation procedure
- reduces all words with same root into a single
root
11Different forms - stemming
- Stemming
- Matching the query term forests to forest and
forested - choke", choking", choked"
- Suffix removal
- removal of suffixes - worker
- Porter algorithm remove longest suffix
- error equal" -gt eq" heuristic rules
- more effective than ordinary word forms
- Detecting equivalent stems
- example ABSORB- and ABSORPT-
- Stemmers remove affixes
- prefixes? - megavolt
12Plural stemmer
- Plurals in English
- If word ends in ies but not eies, aies
- ies -gt y
- if word ends in es but not aes, ees, oes
- es -gt e
- if word ends in s but not us or ss
- s -gt
- First applicable rule is the one used
13Processing
- The destruction of the amazon rain forests
- Case normalisation
- Stop word removal.
- From fixed list
- destruction amazon rain forests
- Suffix removal (stemming).
- destruct amazon rain forest
14Thesauri
- A collection of terms along with some structure
or relationships between them. Scope notes etc.. - provide standard vocabulary for indexing
searching - assist user locating terms for proper query
formulation - provide classification hierarchy for broadening
and narrowing current query according to user
need - Equivalence synonyms, preferred terms
- Hierarchical broader/narrower terms (BT/NT)
- Association related terms across the hierarchy
(RT)
15Thesauri Examples WordNet
16(No Transcript)
17Faceted Classification
18Thesauri Examples AAT
- Art and Architecture Thesaurus
19Hierarchical Classifications
- Alphanumeric coding schemes
- Subject classifications
- A taxonomy that represents a classification or
kind-of hierarchy. - Examples Dewey Decimal, AAT, SHIC, ICONCLASS
41A32 Door
41A322 Closing the Door
41A323 Monumental Door
41A324 Metalwork of a Door
41A3241 Door-Knocker
41A325 Threshold
41A327 Door-keeper, houseguard
20Terminology/Controlled vocabulary
- The descriptors from a thesauri form a controlled
vocabulary - Normalise indexing concepts
- Identification of indexing concepts with clear
semantics - Retrieval based on concepts rather than terms
- Good for specific domains (e.g., medical)
- Problematic for general domains (large, new,
dynamic)
21No One Classification
22No One Classification
23Generating document representatives - Outcome
- Class
- words with the same stem
- Class name
- stem
- Document representative
- list of class names (index terms or keywords)
- Same process applied to query
24Precision and Recall
- Precision
- Ratio of the number of relevant documents
retrieved to the total number of documents
retrieved. - The number of hits that are relevant
- Recall
- Ratio of number of relevant documents retrieved
to the total number of relevant documents - The number of relevant documents that are hits
25Precision and Recall
High Precision Low Recall
26Precision and Recall
RA
- The user isnt usually given the answer set RA at
once - The documents in A are sorted to a degree of
relevance (ranking) which the user examines.
Recall and precision vary as the user proceeds
with their examination of the answer set A
27Precision and Recall Trade Off
- Increase number of documents retrieved
- Likely to retrieve more of the relevant documents
and thus increase the recall - But typically retrieve more inappropriate
documents and thus decrease precision
28Index term weighting
- Effectiveness of an indexing language
- Exhaustivity
- number of different topics indexed
- high exhaustivity high recall and low precision
- Specificity
- ability of the indexing language to describe
topics precisely - high specificity high precision and low recall
29Index term weighting
- Exhaustivity
- related to the number of index terms assigned to
a given document - Specificity
- number of documents to which a term is assigned
in a collection - related to the distribution of index terms in
collection - Index term weighting
- index term frequency occurrence frequency of a
term in document - document frequency number of documents in which
a term occurs
30IR as Clustering
- A query is a vague spec of a set of objects, A
- IR is reduced to the problem of determining which
documents are in set A and which ones are not - Intra clustering similarity
- What are the features that better describe the
objects in A - Inter clustering dissimilarity
- What are the features that better distinguish the
objects A from the remaining objects in C
A Retrieved Documents
x
x
x
x
x
x
C Document Collection
31Index term weighting
32Index term weighting
- Intra-clustering similarity
- The raw frequency of a term t inside a document
d. - A measure of how well the document term describes
the document contents - Inter-cluster dissimilarity
- Inverse document frequency
- Inverse of the frequency of a term t among the
documents in the collection. - Terms which appear in many documents are not
useful for distinguishing a relevant document
from a non-relevant one.
Normalised frequency of term t in document d
Inverse document frequency
Weight(t,d) tf(t,d) x idf(t)
33Term weighting schemes
- Best known
- Variation for query term weights
34Example
- Nuclear 7 Computer 9
- Poverty 5 Unemployment 1
- Luddites 3 Machines 19
- People 25 And 49
- Weight(machine)
- 19/25 x log(100/50)
- 0.76 x 0.3013 0.228988
- Weight(luddite)
- 3/25 x log(100/2)
- 0.12 x 1.69897 0.2038764
- Weight(poverty) 5/25 x log(100/2) 0.2 x
1.69897 0.339794
35Inverted Files
- Word-oriented mechanism for indexing test
collections to speed up searching - Searching
- vocabulary search (query terms)
- retrieval of occurrence
- manipulation of occurrence
36Original Document view
Cosmonaut astronaut moon car truck
D1 1 0 1 1 1
D2 0 1 1 0 0
D3 0 0 0 1 1
37Inverted view
D1 D2 D3
Cosmonaut 1 0 0
astronaut 0 1 0
moon 1 1 0
Car 1 0 1
truck 1 0 1
38Inverted index
cosmonaut
D1
astronaut
D2
moon
D1
D2
car
D1
D3
truck
D1
D3
39Inverted File
- The speed of retrieval is maximised by
considering only those terms that have been
specified in the query - This speed is achieved only at the cost of very
substantial storage and processing overheads
40Components of an inverted file
Header Information
frequency
Document number
pointer
term
frequency
Field type
Postings file
41Producing an Inverted file
Postings
Term
Doc 3
Inverted File
Doc 1
Doc 2
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
aid
0
0
0
1
0
0
0
1
AI
4, 8
A
all
0
1
0
1
0
1
0
0
AL
2, 4, 6
back
1
0
1
0
0
0
1
0
BA
1, 3, 7
B
brown
1
0
1
0
1
0
1
0
BR
1, 3, 5, 7
come
0
1
0
1
0
1
0
1
C
2, 4, 6, 8
dog
0
0
1
0
1
0
0
0
D
3, 5
fox
0
0
1
0
1
0
1
0
F
3, 5, 7
good
0
1
0
1
0
1
0
1
G
2, 4, 6, 8
jump
0
0
1
0
0
0
0
0
J
3
lazy
1
0
1
0
1
0
1
0
L
1, 3, 5, 7
men
0
1
0
1
0
0
0
1
M
2, 4, 8
now
0
1
0
0
0
1
0
1
N
2, 6, 8
over
1
0
1
0
1
0
1
1
O
1, 3, 5, 7, 8
party
0
0
0
0
0
1
0
1
P
6, 8
quick
1
0
1
0
0
0
0
0
Q
1, 3
their
1
0
0
0
1
0
1
0
TH
1, 5, 7
T
time
0
1
0
1
0
1
0
0
TI
2, 4, 6
42An Inverted file
Term
Postings
Inverted File
aid
AI
4, 8
A
all
AL
2, 4, 6
back
BA
1, 3, 7
B
brown
BR
1, 3, 5, 7
come
C
2, 4, 6, 8
dog
D
3, 5
fox
F
3, 5, 7
good
G
2, 4, 6, 8
jump
J
3
lazy
L
1, 3, 5, 7
men
M
2, 4, 8
now
N
2, 6, 8
over
O
1, 3, 5, 7, 8
party
P
6, 8
quick
Q
1, 3
their
TH
1, 5, 7
T
time
TI
2, 4, 6
43Searching Algorithm
- For each document D, Score(D) 0
- For each query term
- Search the vocabulary list
- Pull out the postings list
- for each document J in the list,
- Score(J) Score(J) 1
44What Goes in a Postings File?
- Boolean retrieval
- Just the document number
- Ranked Retrieval
- Document number and term weight (TFIDF, ...)
- Proximity operators
- Word offsets for each occurrence of the term
- Example Doc 3 (t17, t36), Doc 13 (t3, t45)
45How Big Is the Postings File?
- Very compact for Boolean retrieval
- About 10 of the size of the documents
- If an aggressive stopword list is used
- Not much larger for ranked retrieval
- Perhaps 20
- Enormous for proximity operators
- Sometimes larger than the documents
- But access is fast - you know where to look
46Query
Documents
indexing
indexing
Stop word
Stemming
Matching
Query features
Indexing features
Storage inverted index
Term 1
di
dj
dk
47Similarity Matching
- The process in which we compute the relevance of
a document for a query - A similarity measure comprises
- term weighting scheme which allocates numerical
values to each of the index terms in a query or
document reflecting their relative importance - similarity coefficient - uses the term weights to
compute the overall degree of similarity between
a query and a document
?