Title: Text-retrieval%20Systems
1Text-retrievalSystems
- NDBI010 Lecture Slides
- KSI MFF UK
- http//www.ms.mff.cuni.cz/kopecky/teaching/ndbi01
0/ - Version 10.05.12.13.30.en
2Literature (textbooks)
- Introduction to Information Retrieval
- Christopher D. Manning, Prabhakar Raghavanand
Hinrich Schütze - Cambridge University Press, 2008
- http//informationretrieval.org/
- Dokumentografické informacnà systémy
- Pokorný J., Snášel V., Kopecký M.
- Nakladatelstvà Karolinum, UK Praha, 2005
- Pokorný J., Snášel V., Húsek D.
- Nakladatelstvà Karolinum, UK Praha, 1998
- Textové informacnà systémy
- Melichar B.
- Vydavatelstvà CVUT, Praha, 1997
3Further links (books)
- Computer Algorithms - String Pattern Matching
Strategies, - Jun Ichi Aoe,
- IEEE Computer Society Press 1994
- Concept Decomposition for Large Sparse Text Data
using Clustering - Inderjit S. Dhillon, Dharmendra S. Modha
- IBM Almaden Research Center, 1999
4Further links (articles)
- The IGrid Index Reversing the Dimensionality
Curse For Similarity Indexing in High Dimensional
Space for Large Sparse Text Data using
Clustering - Charu C. Aggrawal, Philip S. Yu
- IBM T. J. Watson Research Center
- The Pyramid Technique Towards Breaking the Curse
of Dimensionality - S. Berchtold, C. Böhm, H.-P. Kriegel
- ACM SIGMOD Conference Proceedings, 1998
5Further links (articles)
- Affinity Rank A New Scheme for Efficient Web
Search - Yi Liu, Benyu Zhang, Zheng Chen, Michael R. Lyu,
Wei-Ying Ma - 2004
- Improving Web Search Results Using Affinity Graph
- Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi,
Weiguo Fan, Zheng Chen1, Wei-Ying Ma - Efficient computation of pagerank
- T.H. Haveliwala
- Technical report, Stanford University, 1999
6Further links (older)
- Introduction to Modern Information Retrieval
- Salton G., McGill M. J.
- McGRAW-Hill 1981
- Výber informacà v textových bázÃch dat
- Pokorný J.
- OVC CVUT Praha 1989
7Introduction
- Overview of the problem informativeness
measurement
8Retrieval system origin
- 50th of 20th century
- The gradual automation of the procedures used in
libraries - Now a separate subsection of ISs
- Factual IS
- Processing of information having defined internal
structure (usually in the form of tables) - Bibliographic IS
- Processing of information in form of the text
written in natural language without strict
internal structure.
9Interaction with TRS
- Query formulation
- Comparison
- Hit-list obtaining
- Query tuning/reformulation
- Document request
- Document obtaining
10TRS Structure
- Document disclosure system
- Returns secondary information
- Author
- Title
- ...
- Document delivery system
- Need not to be supported by the SW
I)
1
2
3
4
II)
5
6
11Query Evaluation
- Direct comparison is time-consuming
12Query Evaluation
- Document model is used to compare
- Lossy process,usually based on presence of
words in documents - Produces structured data suitable for effective
comparison
13Query Evaluation
- Query is processed to obtain needed form
- Processed queryis compared against the index
14Text preprocessing
- Searching is more effective using created
(structured) model of documents, but it can use
only information stored in the model, not in
documents. - The goal is to create model, preserving as much
information form the original documents as
possible. - Problem lot of ambiguity in text.
- Still exist many not resolved tasks concerning
document understanding.
15Text understanding
- Writer
- Text sequence of words in natural language.
- Each word stands for some idea/imagination of
writer. - Ideas represent real subject, activity, etc.
- Reader folows (not necessary exactly the same
mappings) from left to right
...
16Text understanding
- Synonymy of words
- More words can have the same meaning for the
writer - car automobile
- sick ill
...
17Text understanding
- Homonymy of words
- One word can have more than one different
meanings - fluke fish, anchor,
- crown currency, treetop, jewel,
- class year of studies, kategory in set theory,
...
18Text understanding
- Word meanings need not be exactly the same.
- Hierarchical overlapping
- animal gt horse gt Stallion
- Associativity among meanings
- calculator computer processor
...
19Text understanding
- Mapping between subjects, ideas and words can
depend on individual persons readers and
writers. - Two people can assign partly or completely
different meaning to given term. - Two people can imagine different thing for the
same word. - mother, room, ...
- In result, by reading the same text two different
readers can obtain different information - Each from other
- In comparison with authors intention
20Text understanding
- Homonymy and ambiguities grows with transition
form words/terms to sentences and bigger parts of
the text. - Example of English sentence with more
grammatically correct meanings (in this case a
human reader probably eliminates the nonsense
meaning) - See Podivné fungovánà gramatiky,http//www.scienc
eworld.cz/sw.nsf/lingvistika - In the sentence Time flies like an arrow either
flies (fly) or like can be chosen for the
predicate, what produces two significantly
different meanings.
21Text preprocessing
- Inclusion of linguistic analysis into the text
processing can partially solve the problem - Disambiguation
- Selection of correct meaning of the term in the
sentence - According to grammar (Verb versus Noun etc.)
- According to context (more complicated, can
distinguish between two Verbs, two Nouns, etc).
22Text preprocessing
- Inclusion of linguistic analysis into the text
processing can partially solve the problem - Lemmatization
- For each term/word in the text after its proper
meaning is found assigns - Type of word, plural vs. singular, present time
vs. preterite, etc. - Base form (singular for Nouns, infinitive for
Verbs, ) - Information obtained by sentence analysis
(subject, predicate, object, ...)
23Text preprocessing
- Other options, that can be more or less solved
are - Identification of collocations
- World war two, ...
- Assigning of Nouns for Pronouns, used in the text
(very complex and hard to solve, sometimes even
for human reader)
24Precision and Recall
- As a result of ambiguities there exists no
optimal text retrieval system - After the answer of the query is obtained,
following values can be evaluated - Number of returned documents in the list Nv
- The system supposed them to be relevant useful
according to their math with the query - Number of returned relevant documents Nvr
- The questioner find them to be really relevant as
they fulfill its information needs - Number of all relevant documents in the system
Nr - Very hard to guess for large and unknown
collections
25Precision and Recall
- Two TRSs can (and do) return two different
result for the same query, that can be partly or
completely unique. - How to compare quality of those systems?
26Precision and Recall
- Two questioners can suppose another documents to
be relevant for their equally formulated query - How to meet both subjective expectations of
questioners?
27Precision and Recall
- Quality of result set of documents is usually
evaluated according to numbers Nv, Nr, Nrv - Precision
- P Nvr / Nv
- Probability of returned document to be relevant
to the user - Recall
- R Nvr / Nr
- Probability of relevant document to be returned
to the user
28Precision and Recall
- Both coefficients depend on the feeling of the
questioner - The same document can fulfill information needs
of first questioner while at the same time fail
to meet them for the second one. - Each user determines different values of Nr and
Nrv coefficients - Both measures P and R depend on them
29Precision and Recall
- In optimal case
- PR1
- There are all and only relevant documents in the
response of the system - Usually
- The answer in the first iteration is neither
precise nor complete
Optimalanswer
1
Typical initial answer
0
0
1
30Precision and Recall
- Query tuning
- Iterative modification of the query targeted to
increase the quality of the response - Theoretically it is possible to reach the optimum
sooner or later
R
1
Optimum
0
P
0
1
31Presnost a úplnost
- due to (not only) ambiguities both measures
depend indirectly each on the other,ie. PR ?
const. lt 1 - In order to increase P the absolute number of
relevant documents in the response is decreased. - In order to increase R the number of irrelevant
documents rapidly grows. - The probability to reach quality above the limit
is low.
R
1
Optimum
0
P
0
1
32Prediction Criterion
- In time of query formulation the questioner has
to guess correct term (words) the author used for
expression of given idea - Problems are caused e.g. by
- Synonyms (author could use different synonym not
remembered by the user) - Overlapping meanings of terms
- Colorful poetical hyperboles
33Prediction Criterion
- The problem can be partly suppressed by inclusion
of thesaurus, containing - Hierarchies of terms and their meanings
- Sets of synonyms
- Definitions of associations between terms
- Questioner can use it during query formulation
- System can use it during query evaluation
34Prediction Criterion
- The user often tends to tune its own query in
conservative way - He/she tends to fix terms used in the first
iteration (they must be the best because I
remembered them immediately) and vary only
additional terms at the end of the query - It is useful to support the user to
(semi)automatically eliminate wrong terms and
replace them with useful ones, that describe
really relevant documents
35Maximal Criterion
- The questioner is usually not able or willing to
go through exhaustive number of hits in the
response to find out the relevant one - Usually max. 20-50 documents according to their
length - Need to not only sort out documents not matching
the query but order the answer list according to
supposed relevancy in descendant order the
supposedly best documents at the begining
36Maximal Criterion
- Due to maximal criterion, the user usually tries
to increase the Precision of the answer - Small amount of resultingdocuments in the
answer, containing as high ratio of relevant
documents as possible - Some problematic domains requires both high
precision and recall - Lawyers, especially in territories having case
law based on precedents (need to find out as much
similar cases as possible)
37Exact pattern matching
38Why to Search for Patterns in Text
- Due to index documents or queries
- To involve only given set of terms (lemmas)
- To omit given set of meaningless terms (lemmas)
as conjunctions, numerals, pronouns, - To highlight given terms in documents, presented
to users
39Algorithms classificationby preprocessing
- I - Brute-force algorithm
- II - Others (suitable for TRS)
- Further divided according to
- Number of simultaneously matched patterns
- 1, N, ?
- Direction of comparison
- Left to right
- Right to left
40Class II Algorithms
41Exact Pattern Matching
- Searching of One Pattern
- Within Text
42Brute-force Algorithm
- Let m denotes length of text t,let n denotes
length of pattern p. - If i-th position in text doesnt match j-th
position in pattern - Shift of pattern one position to the
right,restart comparison at first (leftmost)
position in the pattern - Average time complexity o(mn), e.g. in search
of an-1b in am-1b - For natural language text/pattern mconst ops,
i.e. o(m)
const is small
number
(lt10), dependent
on the language
43Knuth-Morris-Pratt Algorithm
- Left to right searching for one pattern
- In comparison with brute-force algorithm KMP
eliminates repeated comparison of already
successfully compared characters of text - Pattern is shifted as less as possible to align
own prefix of examined part of pattern below
equal fragment of text
44KMP Algorithm
45KMP Algorithm
- In front of mismatch position is left own prefix
already examined part of pattern - It has to be equal to the postfix of already
examined part of pattern - The longest such a prefix determines the smallest
shift
46KMP algoritmus
47KMP algoritmus
- If
- j-th position of pattern p doesnt match to i-th
position of text t - The longest own prefix of already examined part
of pattern equal to the postfix of already
examined part of pattern is of length k - then
- After the shift k characters remain before the
mismatch position - Comparison restarts from k1st position of the
pattern - Restart positions are pre-computed and stored in
auxiliary array A - In this case Aj k1
48KMP algoritmus
- begin KMP m length(t) n length(p)
i 1 j 1 while (i lt m) and (j lt n) do
begin while (j gt 0) and (pj ltgt ti) do
j Aj inc(i) inc(j) end
whileif (j gt n) then pattern found at
position i-j1 else not foundend KMP
49Obtaining of array A for KMP search
- A1 0
- If all values are known for positions1 .. j-1,
it is easy to compute correct value for j-th
position - Let Aj-1 contains corrections for j-1st
position.I.e., Aj-1-1 chars at the beginning
of pattern are the same as equivalent number of
chars before j-1st positon
50Obtaining of array A for KMP search
51Obtaining of array A for KMP search
- If j-1st position of pattern match to Aj-1 th
position,the prefix can be prolongedand so
correct value of Aj is by one higher, than the
previous value.
52Obtaining of array A for KMP search
53Obtaining of array A for KMP search
- If j-1st and Aj-1 th positions in pattern
doesnt match, the correction Aj-11 would
cause mismatch at the previous position in text - The correction for such a mismatch is already
known(numbers A1 .. Aj-1 are already
computed)
54Obtaining of array A for KMP search
- It is necessary to follow correction starting by
j-1 st position until j-1 st position in pattern
match to the found position in the target
position, or the correction reaches 0 (out of
pattern)
55Obtaining of array A for KMP search
56Obtaining of array A for KMP search
57Obtaining of array A for KMP search - algorithm
- begin A1 0 n length(p) j 2
while (j lt n) do begin k j-1 l k
repeat l Al until (l
0) or (pl pk) Aj l1
inc(j) endend
58KMP algorithm
- Time complexity of KMP is o(mn).
- Already successfully compared positions in text
are never checked again - After each shift of pattern the given mismatch
position can be checked again, but there are at
most o(m) shifts of pattern. - Similarly time complexity of preprocessing is
o(n).
59KMP Optimization
- It is possible to further optimize auxiliary
array A - If the character pj equals to pAj,there
would be the same character as the one that
caused the mismatch aligned to mismatch position. - In this case the optimization can be computed in
advance in another auxiliary array Awhere Aj
def AAj - Else Aj def Aj
- Array A j can be usedduring the search phase
60Boyer-Moore Algorithm
- Right to left search of one pattern using pattern
preprocessing - Pattern is shifted left to right
- Characters of pattern are compared from right to
left
61Boyer-Moore Algorithm
- If the mismatch of n-j th position of pattern
against i-j th position of text, where Ti-jx
occures, where - n denotes length of pattern,
- i denotes position of the end of pattern in text,
- j0..n-1
- x?X, X is the alphabet
- Pattern is moved by SHIFTn-j,x characters to
the rights - The comparison restarts at the end of pattern,
i.e. for j0
62Boyer-Moore Algorithm
- There exists more different definitions of
SHIFTn-j,x - Variant 1 Auxiliary array SHIFT0..n-1,X is for
each position in the pattern and for each
character of the alphabet X defined as follows - The smallest possible shift, aligning the
character x in the text at the mismatch position
with the same character in the pattern. - If there exists no such character x in the
pattern left to the mismatch position, shift the
pattern to start immediately after the mismatch
position.
63Boyer-Moore Algorithm (1)
- Average time complexity is o(mn),e.g. for
searching ban-1 in am-nban-1 - For huge alphabetsand patterns with small number
of different characters (especially for words
searched in texts in natural languages) the
average time complexity is o(m/n) - i.e. the longer the pattern, the more efficient
search
64Boyer-Moore Algorithm (1)
65Boyer-Moore Algorithm (1)
- Representation of SHIFT array for pattern
ANANAS - Full arrows depicts successful comparison of one
character - Other arrows stands for shift of target character
to position of starting character - Not present arrows means shift after the mismatch
position
66Boyer-Moore Algorithm (1)
- Another representation. To save the space
complexity x?A,N,S,X - X stands for any character not apearing in the
pattern - Values beginning with represents the length
of shift - Values without represents new value of j
67Benchmark on Artificial Text
68Benchmark on Artificial Text
69Benchmark on Artificial Text
70Benchmark on English Text
- Note
- Unique pattern ? found at its original position
71Benchmark on English Text
72Benchmark on English Text
73Benchmark on English Text
74Benchmark on English Text
75Review of Algorithms
76Exact pattern matching
- Searching for finite set of patterns
77Aho-Corrasick Algorithm
- Left to right searching of more patterns
simultaneously - Extension of KMP algorithm
- Preprocessing of patterns
- Linear reading of text
- Average time complexity o(m?ni), wherem denotes
length of textni denotes length of i-th pattern
78A-C Algorithm
- Text T
- Set of patternsPP1, P2, , Pk
- Search engineS (Q, X, q0, g, f, F)
- Q finite set of states
- X alphabet
- q0?Q initial state
- g Q x X ? Q (go)forward function
- f Q ? Q (fail)backward function
- F ? Q set of final states
79A-C Algorithm
- States in the set Q correspond to all prefixes of
all patterns - State q0 reprezents empty prefix ?
- g(q,x) qx,iff qx?Q
- Else g(q0,x)q0
- Else g(q,x) undefined
- f(q) for qltgtq0 is equal to longest own postfix q
in the set Q - f(q)ltq
- Final states correspond to all complete patterns,
i.e. FP
80A-C Algorithm
- Search based on total (fully defined) transition
function ?(q,x) QxX?X - ?(q,x) g(q,x), iff g(q,x) is defined
- ?(q,x) ?(f(q),x)
- Correct definition, because f(q) - distance of
f(q) from initial state is less than q and
g(q0,x) is completely defined.
81A-C Algorithm
- f is constructed in order of increasing q, i.e.
according to distance of state from the beginning - It is not necessary to define f(q0)
- If q1 the longest own postfix is empty, i.e.
f(q)q0
- f(qx)f(g(q,x)) ?(f(q),x)
- To determine value of fail function for state qx,
accessible from state q using character x, it is
necessary to start in q, follow fail function to
f(q) and then go forward using the character x
82A-C Algorithm
- Example Phe,her,she, function g
83A-C Algorithm
- Example Phe,her,she, function f
84A-C Algorithm
- Detection of all occurrences of patterns, even of
patterns hidden inside another ones - Either collect all patterns detected in given
state by going through all states accessible from
it using fail function, i.e. final states in f
i(q), igt0 - Or - after transition to state q go through all
states linked together by fail function and
report all final states
85A-C Algorithm delta function
- function delta(qstates x alphabet)statesbegi
n delta while gq,x fail do q fq
delta gq,xend deltabegin A-C q
0 for i 1 to length(t) do begin q
delta(q,ti) report(q) report all found
patterns, ending by ti end forend A-C
86KMP vs. A-C for 1 pattern
- Equal algorithms, different formulations
- j ( compared position)
- P10
- Pjk
- qj-1 ( compared positions)
- g(q0,)q0
- f(qj-1)qk-1
87Commentz-Walter Algorithm
- Right to left search for more patterns
simultaneously - Combination of B-M and A-C algorithms
- Average time complexity (for natural
languages)o(m/min(ni)), wherem denotes length
of textsni denotes length of i-th pattern
88C-W Algorithm
- Text T
- Set of patternsPP1, P2, , Pk
- Search engineS (Q, X, q0, g, f, F)
- Q finite set of states
- X alphabet
- q0?Q initial state
- g Q x X ? Q (go)forward function
- f Q ? Q (fail)backward function
- F ? Q set of finalstates
89C-W Algorithm
- States in set Q represents all postfixes of all
patterns - State q0 represents empty postfix ?
- g(q,x) xq,iff xq?Q
- f(q) where qltgtq0 is equal to longest own prefix q
in the set Q - f(q)ltq
- Final states correspond to all complete patterns,
i.e. FV
90C-W Algorithm
91C-W Algorithm
- Backward function (arrows going to q0 are not
shown)
s
h
e
e
he
she
r
h
e
r
er
her
92C-W Algorithm
- LMIN min(ni)length of the shortest pattern
- h(q) qdistance of state q from the initial
state - char(x)minimal distance of state, reachable via
character x - pred(q)predecessor of state q,i.e. the state,
representing one character shorter postfix
- If g(q,x) is not defined, patterns (search
engine) is shifted by shift(q,x) positions to the
right and the again search restarts by state q0
again - shift(q,x)min( max( shift1(q,x), shift2(q)
), shift3(q) )
93C-W Algorithm
- shift1(q,x) char(x)-h(q)-1, pokud gt 0
- shift2(q) min(LMIN?h(q)-h(q), f(q)q)
- shift3(q0) LMIN
- shift3(q) min(shift3(pred(q)) ? ?
h(q)-h(q), ?kfk(q)q ? q?F)
94C-W Algorithm
- shift1(q,x) aligning of collision
characterchar(y)-h(kolo)-18-4-13
3
95C-W Algorithm
- shift2(q) aligning of checked part of
textstates, where fail function goes to qmust
be taken into account
1
96C-W Algorithm
- shift3(q) aligning of (any) postfix of checked
text, collision character need not be used again
to find a match
97Exact Pattern Matching
- Searching for (Regular) Infinite Set of Patterns
in Text
98Regular expressions and languages
- Regular expression R
- Atomic expressions
- ?
- ?
- a, a ? X
- Operations
- U.V concatenation
- UV union
- Vk V.VV
- V V0V1V2
- V V1V2V3
- Value of expression h(R)
- ? empty language
- ? empty word only
- a, a ? X
- u.vu?h(U)? v?h(V)
- h(U)?h(V)
99Regular Expression Feature
- 1)Â U(VW) (UV)W
- 2)Â U.(V.W) (U.V).W
- 3)Â UV VU
- 4)Â (UV).W (U.W)(V.W)
- 5)Â U.(VW) (U.V)(U.W)
- 6)Â UU U
- 7)Â ?.U U
- 8)Â ?.U ?
- 9)Â U? U
- 10) U ?U.U (?U)
100(Deterministic) Finite Automaton
- K ( Q, X, q0, ?, F )
- Q is a finite set of states
- X is an alphabet
- q0 ? Q is an initial state
- ? Q x X ? Q is totally defined transition
function - F ? Q is a set of final states
101(Deterministic) Finite Automaton
- Configuration of FA
- (q,w) ? Q x X
- Transition of FA
- relation ? (Q x X) x (Q x X)
- (q,aw) (q,w) ? ?(q,a) q
- Automaton accepts word w(q0, w) (q,?), q?F
102Non-deterministic Finite Automaton
- a) default def. K ( Q, X, q0, ?, F )b)
extended def. K ( Q, X, S, ?, F ) - Q is a finite set of internal states
- X is an alphabet
- q0 ? Q is an initial state S ? Q is
(alternatively) set of initial states - ? Q x X ? P(Q) is a transition function
- F ? Q is a set of final states
103Non-deterministic Finite Automaton
- NKA for Phe, her, she
- S1,4,8
- F3,7,11
104NDFA?DFA Conversion
- K(Q, X, S, ?, F)
- K(Q, X, q0, ?, F)
- Q P(Q)
- X
- q0 S
- ?( q, x) ??(q, x), q?q
- F q?Q?q?F??
105NDFA?DFA ConversionSet of Initial States Allowed
- By table, only reachable states
transitions
to state 1
not shown
106NDFA?DFA ConversionOnly One Initial State Allowed
- By table, only reachable states
transitions
to state 1
not shown
107Derivation of regular expression
- If, , then
- I.e., if ,then
108Derivation of regular expression
109Construction of DFAUsing Derivations of RE
- Derivation of regular expressions allows directly
and algorithmically build DFA for any regular
expression - Let V is given regular expression in alphabet X
- Each state of DFA defines a set of words, that
move the DFA from this state to any of final
states.So, every state can be associated with
regular expression, defining this set of words - q0 V
- ?(q,x)
- F q?Q ??h(q)
110Construction of DFAUsing Derivations of RE
- V (01).01 in alphabet X0,1
- q0 (01).01
-
-
111Construction of DFAUsing Derivations of RE
- V (01).01 in alphabet X0,1
- q0 (01).01
- F (01).01?
112Document Models
- Different variants of models
- Takes (non)existence of terms in documents into
account or not - Takes frequencies of terms in documents into
account or not - Takes positions of terms in documents into
account or not
113Document Models in TRSs
114Boolean Model of TRS
- Mid of 20. century
- Adoption of procedures, used in
librarianshipand their gradual implementation
115Boolean Model of TRS
- Database (collection) D containing n documenta
- Dd1, d2, dn
- Documents described using m terms
- T t1, t2, tm
- term tj descriptor, usually word or collocation
- Each document is represented as a subset of
available terms - Contained in the document
- Better describing content of the document
- d1 ? T
116Boolean Model of TRS
- Assigning of a set of terms to document can be
achieved by different approaches - Subdivision according to author
- Manual
- Done by a human indexer, that understands the
content of document - Non-consistent. More indexers need not produce
the same set of terms. One indexer might later
produce different set of terms as before. - Automatic
- Done algorithmically
- Consistent, but without text understanding
- Subdivision according to free will in selecting
descriptors - Controlled
- Set of terms is defined in advance and indexer
cannot change it. It only can select those
describing given document as best as possible. - Non-controlled
- The set of terms can be extended whenever new
document is inserted into collection.
117Indexation
- Thesaurus
- Internally structured set of terms
- Synonyms with defined preferred term
- Hierarchies of semantically narrower/broader
terms - Similar terms
- ...
- Stop-list
- Set of non-significant terms that are meaningless
for indexation - Pronouns, interjections,
118Indexation
- Common words arenot suitable for document
identification - Too specific words as well. Lot of different
terms appears in verysmall number of docs - Its eliminationdecreases significantly size of
the index, and slightly its quality
119Boolean Model of TRS
- Query is represented by Boolean expression
- ta AND tb document has to contain/to be
described by both terms - ta OR tb document has to contain/to be
described by at least one term - NOT t document has not contain/to be
described by given term
120Boolean Model of TRS
- Query examples
- searching AND information
- encoding OR decoding
- processing AND (document OR text)
- computer AND NOT personal
121Boolean Model of TRS Extensions
- Collocations in queries
- searching for information
- data encoding OR data decoding
- text processing
- computer AND NOT personal computer
122Boolean Model of TRS Extensions
- Using of factual meta-data(attribute values)
- databaseAND (author Salton)
- text processingAND (year_of_publishing gt 1990)
123Boolean Model of TRS Extensions
- Wildcards in terms
- datab AND system
- stands for termsdatabase, databases,system
, systems, etc. - portabl? AND computer
- stands for terms portable,computer,
computers, computerized etc.
124Boolean Index Structure
- Inverted file
- It holds a list of identified documents for each
term (instead of a set of terms for each
document) - t1 d1,1, d1,2, ..., d1,k1
- t2 d2,1, d2,2, ..., d2,k2
- tm dm,1, dm,2, ..., dm,km
125Boolean Index Structure
- One-by-one processing of inserted
documentsproduces a sequence of couples
ltdoc_id,term_idgtsorted by first component, i.e.
by doc_id - Next the sequence is reordered lexicographically
by term_id, doc_id and duplicates are removed - The result can be further optimized by adding
directory pointing to sections, corresponding to
individual terms, and removing term_ids from the
sequence
126Lemmatization and Disambiguationof Czech
Language (ÚFAL)
- Odpovedným zástupcem nemuže být každý.
- Zákon by mel zajistit individualizaci
odpovednosti a zajištenà odbornosti.
- ltp n1gtlts id"docID001-p1s1"gtltf
capgtOdpovednýmltMDlgtodpovedný_(kdo_za_neco_odpovÃ
dá)ltMDtgtAAIS7----1A----ltfgtzástupcemltMDlgtzástupce
ltMDtgtNNMS7-----A----ltfgtnemuželtMDlgtmoci_(mÃt_možn
ost_neco_delat)ltMDtgtVB-S---3P-NA---ltfgtbýtltMDlgtb
ýtltMDtgtVf--------A----ltfgtkaždýltMDlgtkaždýltMDtgtAAIS
1----1A---- - ltp n2gt
Paragraph Nr.
Sentence Nr.
Word in document
Lemma including meaning
Type of word (Adverb),
127Proximity Constraints
- t1 (m,n) t2
- most general form
- term t2 can appear at most m words after t1, or
term t1 can appear at most n words after t2. - t1 sentence t2
- terms have to appear in the same sentence
- t1 paragraph t2
- terms have to appear in the same paragraph
128Proximity Constraints Evaluation
- Using the same index structure
- Operators replaced by conjunctions
- Query evaluation to find candidates
- Check for co-occurrences in primary texts
- Small index
- Longer time needed for evaluation
- Necessity of storing primary documents
- Extension of index by positions of term
occurrences in documents - Large index
129Extended Index Structure
- During indexation is built a sequence of
5-tuplesltdok_id,term_id,para_nr,sent_nr,word_nrgt
ordered by dok_id, para_nr,sent_nr,word_nr - Sequence is reordered byltterm_id,dok_id,para_nr,s
ent_nr,word_nrgt - No duplicities are removed
130Thesaurus Utilization
- BT(x) - Broader Term to term x
- NT(x) - Narrower Terms
- PT(x) - Preferred Term
- SYN(x) - SYNonyms to term x
- RT(x) - Related Terms
- TT(x) - Top Term
131Disadvantages of Boolean Model
- Salton
- Query formulation is more an art than science.
- Hits can not be rate by its quality.
- All terms in the query are taken as equally
important. - Output size can not be controlled. System
frequently produces empty or very large answers. - Some results doesnt correspond to intuitive
understanding. - Documents in answer to disjunctive query can
contain only one of mentioned term as well as all
of them. - Documents eliminated from answer to conjunctive
query can miss one of mentioned term as well as
all of them.
132Partial Answer Ordering
- Q (t1 OR t2) AND (t2 OR t3) AND t4
- conversion to equivalent DNF
- Q (t1 AND t2 AND t3 AND t4)
- OR (t1 AND t2 AND NOT t3 AND t4)
- OR (t1 AND NOT t2 AND t3 AND t4)
- OR (NOT t1 AND t2 AND t3 AND t4)
- OR (NOT t1 AND t2 AND NOT t3 AND t4)
133Partial Answer Ordering
- Each elementary conjunction (further EC) contain
all terms used in original query and is rated by
number of terms used in positive way (without
NOT) - All ECs differs each from another in at least
one term (one contains tj, second contains NOT
tj) - Every document correspond to at most one EC
- Document is then rated by number, assigned to
given EC.
134Partial Answer Ordering
- There exist 2k ECs in case of query using k
terms - There exist only k different ratings
- More ECs can have the same rating
- (ta OR tb) (ta AND tb) rating 2 OR (ta
AND NOT tb) rating 1 OR (NOT ta AND tb)
rating 1
135Vector Space Model of TRS
- 70th of 20. century
- cca 20 years younger than Boolean model of TRS
- Tries to minimize and/or eliminate disadvantages
of Boolean model
136Vector Space Model of TRS
- Database D containing n documents
- Dd1, d2, dn
- Documents are described by m terms
- T t1, t2, tm
- term tj word or collocation
- Document representation usingm-dimensional
vector of term weights -
137Vector Space Model of TRS
- Document model
-
- wi,j level of importance of j-th termto
identify/describe i-th document - Query
-
- qj level of importance of j-th term for the
user
138Vector Space Model Index
139Vector Space Model of TRS
- Similarity between vectors representing document
and query is in general defined by Similarity
function
1
0
0
1
140Similarity Functions
-
- Factor is proportional both to
level of importance in document and for the user - Orthogonal vectors have zero similarity
- Base vectors in the vector space (individual
terms) are orthogonal each to other and so have
zero similarity
141Vector Space Model of TRS
-
- Not only the angle, but also sizes of vectors
influence the similarity - Longer vectors, that tends to be assigned to
longer texts have an advantage on shorter ones - Its desirable to normalize all vectors to have
unitary size
142Vector normalization
- Vector length influence elimination
143Vector normalization
- In time of indexation
- No overhead in time of searching
- Sometimes it is necessary to re-compute all
vectors in case that vectors reflects also
aspects dependent on complete collection - In time of search
- Part of similarity function definition
- Slows down the response of the system
144Output Size Control
- Documents in the output list are ordered by
descending similarity to the given query - Most similar documents at the beginning of the
list - The list size can be easily restricted with
respect to maximal criterion - The maximal number of documents in the list can
be restricted - Only documents reaching threshold similarity can
be shown in the result
145Negation in Vector Space Model
-
- It is possible to extend query space
- Then the contribution of j-th
dimension can be negative - Documents that contain j-th term are suppressed
in comparison with others
146Scalar product
147Cosine Measure (Salton)
148Jaccard Measure
149Dice Measure
150Overlap Measure
151Asymmetric Measure
152Pseudo-Cosine Measure
153Vector Space Model Indexation
- Based on number of occurrences of given term in
given document - The more given word occurs in given document, the
more important for its identification - Term FrequencyTFi,j term_occurs / all_occurs
154Vector Space Model Indexation
- Without stop-list the result contains almost only
meaningless words at the beginning
155Vector Space Model Indexation
- Term frequencies are very small even for most
frequent terms - Normalized term frequency iffelse
156Vector Space Model Indexation
Differentiation of important termsfrom
non-important ones
157Vector Space Model Indexation
- IDF (Inverted Document Frequency) reflects
importance of given term in the index for
complete collection
Entropy of probability that the term occurs in
randomly chosen document
158Vector Space Model Indexation
- (Optional) document vector normalization to
unite size
159Querying in Vector Space Model
- Equal representation of documents and queries
brings many advantages over Boolean Model - Query can be defined
- Directly by its hand-made definition
- By reference to known indexed document
- By reference to non-indexed document indexer
creates ad-hoc vector from its primary text - By text fragment (using copy-paste etc.)
- By combination of some above mentioned ways
160Feedback
- Query building/tuning based on user feedback to
previous answers - Adding terms identifying relevant documents
- Elimination of terms unimportant for relevant
document identification and important for
irrelevant ones - Prediction criterion improvement
161Feedback
- Answer to previous queryis classified by the
user, who can mark relevant and/or irrelevant
documents
162Positive Feedback
- Relevant document attract the query towards them
163Negative Feedback
- Irrelevant documents push query away from them
- Less effective than positive feedback
- Less used
164Feedback
- The query iteratively migrates towards the center
of relevant documents
165Feedback
- General form
- One of used special form
Centroid (centre of gravity)
166Feedback
- General form
- Other used(weighted) form
(1-?) / ?
Weighted centroid (centre of gravity)
167Term Equivalence in VS Model
- Individual terms (dimensions of the space) are
supposedly, but not really, mutually
independent
Problem with prediction
inappropriately chosen
synonyms
168Term Equivalence in VS Model
169Term Similarity in VS Model
- Generalised equivalence
- Similarity matrix
-
- All computation used in VS model can be evaluated
also on transposed index.Here mutual similarity
of term can be evaluated
(vector dimension n, not m) - Really similar terms co-occurs often together
- Common terms co-occurs often together as well
170Term Hierarchies in VS Model
- Similarly toBoolean Model
Publication
Print
Book
Papers
Magazine
171Term Hierarchies in VS Model
0.8
- Similarly toBoolean Model
- Edges can haveassigned weights
- User weightsthen can be easily propagated
Publication
0.4
0.6
0.48
0.32
Print
Book
0.3
0.7
0.224
0.096
Papers
Magazine
172Citations and VS Model
- Scientific publications cites their sources
- Assumption
- Cited documents are semantically similar
- Citing documents are semantically similar
173Citations and VS Model
- Direct reference between documents A a B
- Document A cites document B
- Denoted A?B
- Indirect reference between A a B
- Ex. C1, Ck so, that A?C1??Ck?B
- Link between documents A a B
- A?B or B?A
174Citations and VS Model
- A and B are bibliographically paired,if and only
if they cite the same source CA?C ? B?C
- A and B are co-cited, if and only if they are
both cited in some document CC?A ? C?B
175Citations and VS Model
- Acyclic oriented citation graph
- Flowchart matrix of the citation graph
- Ccij?0,1ltnxngtcij1, iff i?jcij0 else
176Citations and VS Model
- BP matrix of bibliographic pairing
- bpij number of documents cited in both
documents i and j. - Follows bpii number of documents cited in i
-
177Citations and VS Model
- CP matrix of co-citation pairing
- cpij number of documents citing both i and j
- Follows cpii number of documents citing i
-
178Citations and VS Model
- DL matrix of document links
- dlij 1 ? (cij 1 ? cji 1)
- It is possible to modify resulting similarities
between documents and given query using some of
matrices BP, CP, DL - Modification of index matrix D
- D BP.D, resp. DCP.D , resp. DDL.D
- DBP.CP.DL.D
179Using mutual document similarities in VS Model
- DS matrix of mutual document similarities
- dsij
- The same idea as in case of BP, CP, DL
- Modification of index matrix D
- DDS.D
180Term Discrimination Values
- Discrimination value defines the importance of
the term in the vector space to distinguish
individual documents stored in the collection - By removal of the term from index, i.e. by
reduction of index dimensionality it can happen - Overall distance between documents
decreases(average similarity of document pairs
increases) - Overall distance between documents
increases(average similarity of document pairs
decreases) - In this case the presence of the dimension in the
space is not needed (is contra-productive)
18145,0?
35,3?
0,0?
45,0?
182Term Discrimination Values
- Computation based on average document similarity
- More efficient variant using central document
(centroid)
183Term Discrimination Values
- The same value is computed for the space reduced
by k-th dimension
184Term Discrimination Values
- Discrimination value is defined as a difference
of both average values - Can be used instead of IDFk
- ? 0
- Important termdiscriminating documents
- DVk defines the measure of importance
- ? 0
- Unimportant term
185Term Discrimination Values(value DV of terms
depending on number of documents where the term
is present)
Number of documents, where the term is present.
Collection contains 7777 documents
186Document clustering
- Kohonen maps
- C3M algorithm
- K-mean algorithm
187Document Clustering
- Response time of VS based TRS is directly
proportional to number of documents in the
collection,that must be compared with the query - Clustering allows to skip major part of index
during the search and compare only closest
documents
188Document Clustering
- Without clusters, it is necessary to compare all
documents, even if the minimal needed similarity
is defined
189Document Clustering
- Each cluster represent m-dimensional sphere,
defined by its center and radius - If not, it is possible to approximate it this way
during computations
190Document Clustering
- Having clusters, the query evaluation need not to
compare documents in clusters outside the area of
user interest
191Cluster types
- Clusters having the same volume
- Easy to create
- Some clusters can be almost empty, while others
can contain huge amount of documents
192Cluster types
- Clusters having (approximately) the same number
of documents - Hard to create
- More effective in case of non-uniformly
distributed docs.
193Cluster types
- Non-disjunctive clusters
- One document can belong to more than one cluster
- Sometimes weighted belonging in fuzzy clusters.
194Cluster types
- Disjunctive clusters
- Document can belong to exactly one cluster
195Cluster types
- It is not possible to completely and disjointly
cover space using spheres - It is possible to use convex polyhedra, where
each document belongs to closest center
196Cluster types
- Then clusters can be approximated by non-disjoint
set of spheres, defined by the center and the
most distant belonging document
197Query Evaluation With Clusters (I)
- Let are given query q and minimal required
similarity s - Note. Similarity computed by scalar product,
vectors are normalized - Index is split to k clusters(c1,r1), , (ck,rk)
- Note. Radii are angular
- Query radiusr ? arccos(s)s cos(?)
r?
1
?
1
198Query Evaluation With Clusters (I)
- Emptiness of cluster intersection with the query
area is found out from the value
arccos(Sim(q,ci))-r-ri - If this value ? 0,documents in the cluster are
compared - If this values gt 0,documents can not be in the
result
199Query Evaluation With Clusters (II)
- Let are given query q and maximal number of
required documents x. - Again, index is split to k clusters(c1,r1), ,
(ck,rk) - No radius of the query is available
200Query Evaluation With Clusters (II)
- Clusters are sorted in ascended order by
increasing distance of their center from the
query, i.e. according to arccos(Sim(q,ci)) - Better sorted by increasing distance of cluster
boundary from the query, i.e. according to
arccos(Sim(q,ci))-ri
201Query Evaluation With Clusters (II)
- Clusters are sorted in ascended order by
arccos(Sim(q,ci))-ri i.e. by increasing distance
of cluster boundary from the query q
4.
5.
2.