Information Retrieval: Indexing - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Information Retrieval: Indexing

Description:

Statistical analysis of document based of word occurrence frequency ... 41A3241 Door-Knocker. 41A327 Door-keeper, houseguard. 41A325 Threshold ... – PowerPoint PPT presentation

Number of Views:1382
Avg rating:3.0/5.0
Slides: 48
Provided by: Carole153
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval: Indexing


1
Information RetrievalIndexing
  • Acknowledgements
  • Dr Mounia Lalmas (QMW)
  • Dr Joemon Jose (Glasgow)

2

Roadmap
  • What is a document?
  • Representing the content of documents
  • Luhn's analysis
  • Generation of document representatives
  • Weighting
  • Inverted files

3
Indexing Language
  • Language used to describe documents and queries
  • index terms selected subset of words
  • derived from the text or arrived at independently
  • Keyword searching
  • Statistical analysis of document based of word
    occurrence frequency
  • Automated, efficient and potentially inaccurate
  • Searching using controlled vocabularies
  • More accurate results but time consuming if
    documents manually indexed

4
Luhn's analysis
  • Resolving power of significant words
  • ability of words to discriminate document content
  • peak at rank order position half way between the
    two cut-offs

5
Generating document representatives
6
Generating document representatives
  • Input text full text, abstract, title
  • Document representative list of (weighted) class
    names, each name representing a class of concepts
    (words) occurring in input text
  • Document indexed by a class name if one of its
    significant words occurs as a member of that
    class
  • Phases
  • identify words - Lexical Analysis (Tokenising)
  • removal of high frequency words
  • suffix stripping (stemming)
  • detecting equivalent stems
  • thesauri
  • others (noun-phrase, noun group, logical formula,
    structure)
  • Index structure creation

7
Process View
8
Lexical Analysis
  • The process of converting a stream of characters
    (the text of the documents) into a stream of
    words (the candidate words to be adopted as index
    terms)
  • treating digits, hyphens, punctuation marks, and
    the case of the letters.

9
Stopword Removal
  • Removal of high frequency words
  • list of stop words (implement Luhn's upper
    cut-off)
  • filtering out words with very low discrimination
    values for retrieval purposes
  • example been", a", about", otherwise"
  • compare input text with stop list
  • reduction between 30 and 50 per cent

10
Conflation
  • Conflation reduces word variants into a single
    form
  • similar words generally have similar meaning
  • retrieval effectiveness increased if the query is
    expanded with those which are similar in meaning
    to those originally contained within it.
  • Stemming algorithm is a conflation procedure
  • reduces all words with same root into a single
    root

11
Different forms - stemming
  • Stemming
  • Matching the query term forests to forest and
    forested
  • choke", choking", choked"
  • Suffix removal
  • removal of suffixes - worker
  • Porter algorithm remove longest suffix
  • error equal" -gt eq" heuristic rules
  • more effective than ordinary word forms
  • Detecting equivalent stems
  • example ABSORB- and ABSORPT-
  • Stemmers remove affixes
  • prefixes? - megavolt

12
Plural stemmer
  • Plurals in English
  • If word ends in ies but not eies, aies
  • ies -gt y
  • if word ends in es but not aes, ees, oes
  • es -gt e
  • if word ends in s but not us or ss
  • s -gt
  • First applicable rule is the one used

13
Processing
  • The destruction of the amazon rain forests
  • Case normalisation
  • Stop word removal.
  • From fixed list
  • destruction amazon rain forests
  • Suffix removal (stemming).
  • destruct amazon rain forest

14
Thesauri
  • A collection of terms along with some structure
    or relationships between them. Scope notes etc..
  • provide standard vocabulary for indexing
    searching
  • assist user locating terms for proper query
    formulation
  • provide classification hierarchy for broadening
    and narrowing current query according to user
    need
  • Equivalence synonyms, preferred terms
  • Hierarchical broader/narrower terms (BT/NT)
  • Association related terms across the hierarchy
    (RT)

15
Thesauri Examples WordNet
16
(No Transcript)
17
Faceted Classification
18
Thesauri Examples AAT
  • Art and Architecture Thesaurus

19
Hierarchical Classifications
  • Alphanumeric coding schemes
  • Subject classifications
  • A taxonomy that represents a classification or
    kind-of hierarchy.
  • Examples Dewey Decimal, AAT, SHIC, ICONCLASS

41A32 Door
41A322 Closing the Door
41A323 Monumental Door
41A324 Metalwork of a Door
41A3241 Door-Knocker
41A325 Threshold
41A327 Door-keeper, houseguard
20
Terminology/Controlled vocabulary
  • The descriptors from a thesauri form a controlled
    vocabulary
  • Normalise indexing concepts
  • Identification of indexing concepts with clear
    semantics
  • Retrieval based on concepts rather than terms
  • Good for specific domains (e.g., medical)
  • Problematic for general domains (large, new,
    dynamic)

21
No One Classification
22
No One Classification
23
Generating document representatives - Outcome
  • Class
  • words with the same stem
  • Class name
  • stem
  • Document representative
  • list of class names (index terms or keywords)
  • Same process applied to query

24
Precision and Recall
  • Precision
  • Ratio of the number of relevant documents
    retrieved to the total number of documents
    retrieved.
  • The number of hits that are relevant
  • Recall
  • Ratio of number of relevant documents retrieved
    to the total number of relevant documents
  • The number of relevant documents that are hits

25
Precision and Recall
High Precision Low Recall
26
Precision and Recall
RA
  • The user isnt usually given the answer set RA at
    once
  • The documents in A are sorted to a degree of
    relevance (ranking) which the user examines.
    Recall and precision vary as the user proceeds
    with their examination of the answer set A

27
Precision and Recall Trade Off
  • Increase number of documents retrieved
  • Likely to retrieve more of the relevant documents
    and thus increase the recall
  • But typically retrieve more inappropriate
    documents and thus decrease precision

28
Index term weighting
  • Effectiveness of an indexing language
  • Exhaustivity
  • number of different topics indexed
  • high exhaustivity high recall and low precision
  • Specificity
  • ability of the indexing language to describe
    topics precisely
  • high specificity high precision and low recall

29
Index term weighting
  • Exhaustivity
  • related to the number of index terms assigned to
    a given document
  • Specificity
  • number of documents to which a term is assigned
    in a collection
  • related to the distribution of index terms in
    collection
  • Index term weighting
  • index term frequency occurrence frequency of a
    term in document
  • document frequency number of documents in which
    a term occurs

30
IR as Clustering
  • A query is a vague spec of a set of objects, A
  • IR is reduced to the problem of determining which
    documents are in set A and which ones are not
  • Intra clustering similarity
  • What are the features that better describe the
    objects in A
  • Inter clustering dissimilarity
  • What are the features that better distinguish the
    objects A from the remaining objects in C

A Retrieved Documents
x
x
x
x
x
x
C Document Collection
31
Index term weighting
32
Index term weighting
  • Intra-clustering similarity
  • The raw frequency of a term t inside a document
    d.
  • A measure of how well the document term describes
    the document contents
  • Inter-cluster dissimilarity
  • Inverse document frequency
  • Inverse of the frequency of a term t among the
    documents in the collection.
  • Terms which appear in many documents are not
    useful for distinguishing a relevant document
    from a non-relevant one.

Normalised frequency of term t in document d
Inverse document frequency
Weight(t,d) tf(t,d) x idf(t)
33
Term weighting schemes
  • Best known
  • Variation for query term weights

34
Example
  • Nuclear 7 Computer 9
  • Poverty 5 Unemployment 1
  • Luddites 3 Machines 19
  • People 25 And 49
  • Weight(machine)
  • 19/25 x log(100/50)
  • 0.76 x 0.3013 0.228988
  • Weight(luddite)
  • 3/25 x log(100/2)
  • 0.12 x 1.69897 0.2038764
  • Weight(poverty) 5/25 x log(100/2) 0.2 x
    1.69897 0.339794

35
Inverted Files
  • Word-oriented mechanism for indexing test
    collections to speed up searching
  • Searching
  • vocabulary search (query terms)
  • retrieval of occurrence
  • manipulation of occurrence

36
Original Document view
Cosmonaut astronaut moon car truck
D1 1 0 1 1 1
D2 0 1 1 0 0
D3 0 0 0 1 1
37
Inverted view
D1 D2 D3
Cosmonaut 1 0 0
astronaut 0 1 0
moon 1 1 0

Car 1 0 1
truck 1 0 1
38
Inverted index
cosmonaut
D1
astronaut
D2
moon
D1
D2
car
D1
D3
truck
D1
D3
39
Inverted File
  • The speed of retrieval is maximised by
    considering only those terms that have been
    specified in the query
  • This speed is achieved only at the cost of very
    substantial storage and processing overheads

40
Components of an inverted file
Header Information
frequency
Document number
pointer
term
frequency
Field type
Postings file
41
Producing an Inverted file
Postings
Term
Doc 3
Inverted File
Doc 1
Doc 2
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
aid
0
0
0
1
0
0
0
1
AI
4, 8
A
all
0
1
0
1
0
1
0
0
AL
2, 4, 6
back
1
0
1
0
0
0
1
0
BA
1, 3, 7
B
brown
1
0
1
0
1
0
1
0
BR
1, 3, 5, 7
come
0
1
0
1
0
1
0
1
C
2, 4, 6, 8
dog
0
0
1
0
1
0
0
0
D
3, 5
fox
0
0
1
0
1
0
1
0
F
3, 5, 7
good
0
1
0
1
0
1
0
1
G
2, 4, 6, 8
jump
0
0
1
0
0
0
0
0
J
3
lazy
1
0
1
0
1
0
1
0
L
1, 3, 5, 7
men
0
1
0
1
0
0
0
1
M
2, 4, 8
now
0
1
0
0
0
1
0
1
N
2, 6, 8
over
1
0
1
0
1
0
1
1
O
1, 3, 5, 7, 8
party
0
0
0
0
0
1
0
1
P
6, 8
quick
1
0
1
0
0
0
0
0
Q
1, 3
their
1
0
0
0
1
0
1
0
TH
1, 5, 7
T
time
0
1
0
1
0
1
0
0
TI
2, 4, 6
42
An Inverted file
Term
Postings
Inverted File
aid
AI
4, 8
A
all
AL
2, 4, 6
back
BA
1, 3, 7
B
brown
BR
1, 3, 5, 7
come
C
2, 4, 6, 8
dog
D
3, 5
fox
F
3, 5, 7
good
G
2, 4, 6, 8
jump
J
3
lazy
L
1, 3, 5, 7
men
M
2, 4, 8
now
N
2, 6, 8
over
O
1, 3, 5, 7, 8
party
P
6, 8
quick
Q
1, 3
their
TH
1, 5, 7
T
time
TI
2, 4, 6
43
Searching Algorithm
  • For each document D, Score(D) 0
  • For each query term
  • Search the vocabulary list
  • Pull out the postings list
  • for each document J in the list,
  • Score(J) Score(J) 1

44
What Goes in a Postings File?
  • Boolean retrieval
  • Just the document number
  • Ranked Retrieval
  • Document number and term weight (TFIDF, ...)
  • Proximity operators
  • Word offsets for each occurrence of the term
  • Example Doc 3 (t17, t36), Doc 13 (t3, t45)

45
How Big Is the Postings File?
  • Very compact for Boolean retrieval
  • About 10 of the size of the documents
  • If an aggressive stopword list is used
  • Not much larger for ranked retrieval
  • Perhaps 20
  • Enormous for proximity operators
  • Sometimes larger than the documents
  • But access is fast - you know where to look

46
Query
Documents
indexing
indexing
Stop word
Stemming
Matching
Query features
Indexing features
Storage inverted index
Term 1
di
dj
dk
47
Similarity Matching
  • The process in which we compute the relevance of
    a document for a query
  • A similarity measure comprises
  • term weighting scheme which allocates numerical
    values to each of the index terms in a query or
    document reflecting their relative importance
  • similarity coefficient - uses the term weights to
    compute the overall degree of similarity between
    a query and a document

?
Write a Comment
User Comments (0)
About PowerShow.com