Title: Lecture 3: Document Models for IR
1Lecture 3Document Models for IR
- Prof. Xiaotie Deng
- Department of Computer Science
2Outline
- Background
- Classical Models
- Latent Semantic Indexing Model
- Graph Model
3Background Document Logic View
- A text document may be represented for computer
analysis in different formats - Full text
- Index Terms
- Structures
4Background Document Indexer
- The huge size of the Internet makes it
unrealistic to use the full text for information
retrieval that requires quick response - The indexer simplifies the logical view of a
document - Indexing method dictates document storage and
retrieval algorithms - Automation of indexing methods is necessary for
information retrieval over the Internet.
5Background Document Indexer Possible
Drawbacks
- Summary of document through a set of index terms
may lead to poor performance - many unrelated documents may be included in the
answer set for a query - relevant documents which are not indexed by any
of the query keywords cannot be retrieved
6Background IR Models A Formal Description
- A quadruple D, Q, F, R (q, d)
- D (document) is a set of composed of logical view
(or representations) for the documents in the
collection. - Q (queries) is a set composed of logical views
(or representations) for user information needs. - F (Framework) is a framework for modeling
documents representations, queries, and their
relationships - R(q,d) is a ranking function which associates a
real number with a query q and a document
representation d. Such ranking defines an
ordering among the documents with regard to the
query q .
7Classic Models
- Boolean Model
- Vector Space Model
- Probabilistic Model
8Classic Models Boolean Model
- Documents representations full text or a set of
key-words (contained in the text or not) - Query representation logic operators, query
terms, query expressions - Searching using inverted file and set operations
to construct the result set
9Classic Models Boolean Model Search
- Queries
- A B (C D)
- Break collection into two unordered sets
- Documents that match the query
- Documents that dont
- Return all the match the query
10Classic Models Boolean Model Example 1
ka
kb
(1,1,0)
(1,0,0)
(1,1,1)
kc
The three conjunctive components for the query q
ka (kb kc)
11Classic Models Boolean Model Example 2
Consider three document about cityu _at_
http//www.cityu.edu.hk/cityu/about/index.htm abo
ut FSE _at_ http//www.cityu.edu.hk/cityu/dpt-acad/f
se.htm about CS _at_ http//www.cs.cityu.edu.hk/c
ontent/about/ Query degree aim returns
about cityU Query degree aim returns all
three
12Classic Models Boolean Model Advantages
- Simple and clean formalism
- The answer set is exactly what the users look
for. - Therefore, users can have complete control if
they know how to write a Boolean formula of terms
for the document(s) they want to find out. - Easy to be implemented on computers
- Popular (most of the search engines support this
model)
13Classic Models Boolean Model Disadvantages
- Results are considered to be equals and no
ranking of the documents - The set of all documents that satisfies a query
may be still too large for the users to browser
through or too little - The users may only know what they are looking for
in a vague way but not be able to formulate it as
a Boolean expression - Need to train the users
14Classic Models Boolean Model Improvement
- Expand and refine query through interactive
protocols - Automation of query formula generation
- Assign weights to query terms and rank the
results accordingly
15Classic Models Vector Space Model
- Representation
- Similarity Measure
- Advantages
- Limitations
16Classic Models Vector Space Model
Representation Introduction
- Represent stored text as well as information
queries by vectors of terms - term is typically a word, a word stem, or a
phrase associated with the text under
consideration or may be word weights. - Generate terms by term weighting system
- terms are not equally useful for content
representation - assign high weights to terms deems important and
low weights to the less important terms
17Classic Models Vector Space Model
Representation Illustration
- Every document in the collection is presented by
a vector - Distinct terms in the collection is called Index
terms, or vocabulary
Computer XML Operating System Microsoft Office Uni
x Search Engines
Page Collection
Collection about computer
Index terms
18Classic Models Vector Space Model
Representation Terms Relationship
- Each term i is identified as Ti
- No relationship between terms in vector space,
they are orthogonal - Instead, in collection about computer, terms,
like computer, OS, are correlated to each
other.
19Classic Models Vector Space Model
Representation Vector
- A vocabulary of 2 terms forms a 2D space, each
document may contain 0,1 or 2 terms. We may see
the following vectors for the representation. - D1lt0,0gt
- D2lt0,0.3gt
- D3lt2,3gt
20Classic Models Vector Space Model
Representation Matrix
- t-Terms will form a t-D space
- Documents and queries can be presented as t-D
vectors - Documents can be considered as a point in t-D
space - We may form a matrix of n by t rows for n
documents indexed with t terms. - The matrix is called Document/Term Matrix
21Classic Models Vector Space Model
Representation Matrix Illustration
Terms
Weight of a term in the document
Documents
22Classic Models Vector Space Model
Representation Matrix Weight
- Combine two factors in the document-term weight
- tfij frequency of term j in document i
- df j document frequency of term j number of
documents containing term j - idfj inverse document frequency of term j
log2 (N/ df j) (N number of documents in
collection) - Inverse document frequency -- an indication of
term values as a document discriminator.
23Classic Models Vector Space Model
Representation Matrix Weight tf-idf
Indicator
- A term occurs frequently in the document but
rarely in the remaining of the collection has a
high weight - A typical combined term importance indicator
- wij tfij ? idfj tfij ? log2 (N/ df j)
- Many other ways are recommended to determine the
document-term weight
24Classic Models Vector Space Model
Representation Matrix Example Page 1
- 5 Documents
- D1 How to Bake Bread without Recipes
- D2 The Classic Art of Viennese Pastry
- D3 Numerical Recipes The Art of Scientific
Computing - D4 Breads, Pastries, Pies and Cakes Quantity
Baking Recipes - D5 Pastry A Book of Best French Recipe
25Classic Models Vector Space Model
Representation Matrix Example Page 2
- 6 Index Terms
- T1 Bak(e,ing)
- T2 recipes
- T3 bread
- T4 cake
- T5 pastr(y,ies)
- T6 pie
26Classic Models Vector Space Model
Representation Matrix Example Page 3
- D1 How to Bake Bread without Recipes
- D2 The Classic Art of Viennese Pastry
- D3 Numerical Recipes The Art of Scientific
Computing - D4 Breads, Pastries, Pies and Cakes Quantity
Baking Recipes - D5 Pastry A Book of Best French Recipe
27Term Frequency in documents
Classic Models Vector Space Model
Representation Matrix Example Page 4
(I,j)1 of document I contains item j once
28Document frequency of term j
Classic Models Vector Space Model
Representation Matrix Example Page 5
29tf-idf Weight Matrix
Classic Models Vector Space Model
Representation Matrix Example Page 6
log(5/2) log(5/4) log(5/2) 0 0
0 0 0 0 0 log(5/3)
0 0 log(5/4) 0 0 0
0 log(5/2) log(5/4) log(5/2) log(5)
log(5/3) log(5) 0 log(5/4) 0 0
log(5/3) 0
30Classic Models Vector Space Model
Representation Matrix Exercise
- Write a program that use tf-idf term weight to
form the term/document matrix. Test it for the
following three documents - http//www.cityu.edu.hk/cityu/about/index.htm
- http//www.cityu.edu.hk/cityu/dpt-acad/fse.htm
- http//www.cs.cityu.edu.hk/content/about/
31Classic Models Vector Space Model
Similarity Measure
- Determine the similarity between document D and
query Q - Lots of method can be used to calculate the
similarity - Cosine Similarity Measures
32Classic Models Vector Space Model
Similarity Measure Cosine
dj
?
Q
Cosine similarity measures the cosine of the
angle between two vectors
33Classic Models Vector Space Model Advantages
- Term-weighting scheme improves retrieval
performance - Partial matching strategy allows retrieval of
documents that approximate the query conditions - Its cosine ranking formula sorts the documents
according to their degree of similarity to the
query
34Classic Models Vector Space Model
Limitations
- underlying assumption is that the terms in the
vector are orthogonal - the need for several query terms if a
discriminating ranking is to be achieved, whereas
only two or three ANDed terms may suffice in a
Boolean environment to obtain a high-quality
output - Difficult to explicitly specifying synonymous and
phrasal relationships, where these can be easily
handled in a Boolean environment by means of the
OR and AND operators or by an extended Boolean
model
35Latent Semantic Indexing Model
- Map document and query vector into a lower
dimensional space which is associated with
concepts - Information retrieval using a singular value
decomposition model of latent semantic structure.
11th ACM SIGIR Conference, pp.465-480, 1988 - by G.W.Furnas, S. Deerwester,S.T.Dumais,T.K.Landau
er, R.A.Harshman,L.A.Streeter, and K.E.Lochbaum - http//www.cs.utk.edu/lsi/
- A Tutorial http//www.cs.utk.edu/berry/lsi/nod
e5.html
36LSI General Approach
- It is based on Vector Space Model
- In Vector Space Model, terms are treated
independently - Here some relationship of the terms are obtained,
implicitly, magically through matrix analysis - This allows reduction of some un-necessary
information in the document representation.
37LSI Background Knowledge Term Document
Association Matrix
- Term Document Association Matrix
- Let t be the number of terms and N be the number
of documents - Let M(Mij) be term-document association matrix.
- Mij may be considered as weight associated with
the term-document pair (ti,dj)
38LSI Background Knowledge Eigenvalue and
Eigenvector
- We have
- A an n n matrix
- v an n-dimensional vector
- c a scalar
- If Avcv, then
- c is called an eigenvalue of A
- v is called an eigenvector of A
39LSI Background Knowledge Eigenvalue and
Eigenvector Example Page 1
- A x
- Then Ax3x
- 3 is an eigenvalue, and x is an eigenvector
- Question find another eigenvalue?
40LSI Background Knowledge Eigenvalue and
Eigenvector Example Page 2
- yt(1,-1). Ay(1,-1)ty.
- Therefore, another eigenvalue is 1 and its
associated eigenvector is y - Then let
- S
- Then A(x,y) (x,y)S
- More over xty0
41LSI Background Knowledge Eigenvalue and
Eigenvector Example Page 3
- Let K(x,y)/sqrt(2)
- Then
- KtKI
- and AKSKt
42LSI Background Knowledge A Corollary of
Eigen Decomposition Theorem
- If A is a symmetrical matrix, then there exist a
matrix K (KtKI) and a diagonal matrix S such
that AKSKt - See http//mathworld.wolfram.com/EigenDecompositio
nTheorem.html - Application to our use
- Both MMt and Mt M are symmetric
- In addition, their eigenvalues are the same
except that the large one has an extra number of
zeros.
43LSI Background Knowledge Matrix
Decomposition
- Decompose MKSDt
- K the matrix of eigenvectors derived from the
term-to-term correlation matrix given by MMt - Dt that of Mt M
- S an (rxr) matrix of singular values where r is
the rank of M
44LSI Reduced Concept Space
- Let Ss be the s largest singular values of S.
- Let Ks and Dst be the corresponding columns of
rows K and S. - The matrix MsKsSsDst
- is closest to M in the least square sense
- NOTE Ms has the same number of rows (terms) and
columns (documents) as M but it may be totally
different from M. - A numerical example
- http//www.cse.ogi.edu/class/cse580ir/handouts/10
20October/Models20in20Information20Retrieval20
II20Motivating20Engineering20Decisions/sld041.
htm
45LSI The Relationship of Two Documents di and
dj
- MstMs(KsSsDst )t(KsSsDst )
- DsSsKstKsSsDst
- DsSsSsDst
- DsSs(DsSs)t
- The (i,j) element quantifies the relationship
between document i and j.
46LSI The Choice of s
- It should be large enough to allow fitting all
the structure in the original data - It should be small enough to allow filtering out
noise caused by variation of choices of terms.
47LSI Ranking of Query Results
- Model the query Q as a pseudo-document in the
original term document matrix M - The vector MstQ provides ranks of all documents
with respect to this query Q.
48LSI Advantages
- When s is small with respect to t and N, it
provides an efficient indexing model - it provides for elimination of noise and removal
of redundancy - it introduces conceptualization based on theory
of singular value decomposition
49Graph Model
- Improving Effectiveness and Efficiency of Web
Search by Graph-based Text Representation - Junji Tomita and Yoshihiko Hayashi
- http//www9.org/final-posters/13/poster13.html
- Interactive Web Search by Graphical Query
Refinement - Junji Tomita and Genichiro Kikui
- http//www10.org/cdrom/posters/1078.pdf
50Graph Model Subject Graph Basics
- A node represents a term in the text.
- A link denotes an association between the linked
terms. - Significance of terms and term-term associations
are represented as weights assigned to them.
51Graph Model Subject Graph Weight Assignment
- Term-statistics-based weighting schemes
- frequencies of terms
- frequencies of term-term association
- multiplied by inverse document frequency
52Graph Model Subject Graph Similarity of
Documents
- Subject Graph Matching.
- Weight terms and term-term associations with µ
and 1-µ for adequately chosen µ. - Then calculate the cosine value of two documents
treating weighted terms and term-term
associations as elements of the vector space
model.
53Graph Model Query As A Graph
- Sometimes users query is vague
- System Represents users query as a query graph
- User can interactively and explicitly clarify
his/her query by looking at and editing the query
graph - System implicitly edits the query graph according
to users choice on documents
54Graph Model Query As A Graph A Sample
System User Interface
55Graph Model Query As A Graph Illustration
guide
transport
travel
train
Asia
Japan
link and nodes with no link or
56Graph Model Query As A Graph Interactive
Graph Query Refinement
- User inputs sentences as query, system displays
the initial query graph made from the inputs - User edits the query graph by removing and/or
adding nodes and/or links - System measures the relevance score of each
document against the modified query graph - System ranks search results in descending score
order and displays their titles to the user
interface - User selects documents relevant to his/her needs
- System refines the query graph based on the
documents selected by user and the old query
graph - System displays the new query graph to user
- Repeat previous steps until the user is satisfied
with the search results
57Graph Model Query As A Graph Interactive
Graph Query Refinement Details of Step 6 --
Making A New Query Graph
58Graph Model Query As A Graph Digest Graph
- The output of search engines is presented via
graphical representation - a sub-graph of the Subject Graph for the entire
document. - The sub-graph is generated on the fly in response
to the current query. - User can intuitively understand the subject of
each document from the terms and the term-term
associations in the graph.
59Summary
- Background
- Classical Models
- Latent Semantic Indexing Model
- Graph Model