Text-retrieval%20Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Text-retrieval%20Systems

Description:

www.ms.mff.cuni.cz – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 532
Provided by: Michal189
Category:

less

Transcript and Presenter's Notes

Title: Text-retrieval%20Systems


1
Text-retrievalSystems
  • NDBI010 Lecture Slides
  • KSI MFF UK
  • http//www.ms.mff.cuni.cz/kopecky/teaching/ndbi01
    0/
  • Version 10.05.12.13.30.en

2
Literature (textbooks)
  • Introduction to Information Retrieval
  • Christopher D. Manning, Prabhakar Raghavanand
    Hinrich Schütze
  • Cambridge University Press, 2008
  • http//informationretrieval.org/
  • Dokumentografické informacní systémy
  • Pokorný J., Snášel V., Kopecký M.
  • Nakladatelství Karolinum, UK Praha, 2005
  • Pokorný J., Snášel V., Húsek D.
  • Nakladatelství Karolinum, UK Praha, 1998
  • Textové informacní systémy
  • Melichar B.
  • Vydavatelství CVUT, Praha, 1997

3
Further links (books)
  • Computer Algorithms - String Pattern Matching
    Strategies,
  • Jun Ichi Aoe,
  • IEEE Computer Society Press 1994
  • Concept Decomposition for Large Sparse Text Data
    using Clustering
  • Inderjit S. Dhillon, Dharmendra S. Modha
  • IBM Almaden Research Center, 1999

4
Further links (articles)
  • The IGrid Index Reversing the Dimensionality
    Curse For Similarity Indexing in High Dimensional
    Space for Large Sparse Text Data using
    Clustering
  • Charu C. Aggrawal, Philip S. Yu
  • IBM T. J. Watson Research Center
  • The Pyramid Technique Towards Breaking the Curse
    of Dimensionality
  • S. Berchtold, C. Böhm, H.-P. Kriegel
  • ACM SIGMOD Conference Proceedings, 1998

5
Further links (articles)
  • Affinity Rank A New Scheme for Efficient Web
    Search
  • Yi Liu, Benyu Zhang, Zheng Chen, Michael R. Lyu,
    Wei-Ying Ma
  • 2004
  • Improving Web Search Results Using Affinity Graph
  • Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi,
    Weiguo Fan, Zheng Chen1, Wei-Ying Ma
  • Efficient computation of pagerank
  • T.H. Haveliwala
  • Technical report, Stanford University, 1999

6
Further links (older)
  • Introduction to Modern Information Retrieval
  • Salton G., McGill M. J.
  • McGRAW-Hill 1981
  • Výber informací v textových bázích dat
  • Pokorný J.
  • OVC CVUT Praha 1989

7
Introduction
  • Overview of the problem informativeness
    measurement

8
Retrieval system origin
  • 50th of 20th century
  • The gradual automation of the procedures used in
    libraries
  • Now a separate subsection of ISs
  • Factual IS
  • Processing of information having defined internal
    structure (usually in the form of tables)
  • Bibliographic IS
  • Processing of information in form of the text
    written in natural language without strict
    internal structure.

9
Interaction with TRS
  1. Query formulation
  2. Comparison
  3. Hit-list obtaining
  4. Query tuning/reformulation
  5. Document request
  6. Document obtaining

10
TRS Structure
  • Document disclosure system
  • Returns secondary information
  • Author
  • Title
  • ...
  • Document delivery system
  • Need not to be supported by the SW

I)
1
2
3
4
II)
5
6
11
Query Evaluation
  • Direct comparison is time-consuming

12
Query Evaluation
  • Document model is used to compare
  • Lossy process,usually based on presence of
    words in documents
  • Produces structured data suitable for effective
    comparison

13
Query Evaluation
  • Query is processed to obtain needed form
  • Processed queryis compared against the index

14
Text preprocessing
  • Searching is more effective using created
    (structured) model of documents, but it can use
    only information stored in the model, not in
    documents.
  • The goal is to create model, preserving as much
    information form the original documents as
    possible.
  • Problem lot of ambiguity in text.
  • Still exist many not resolved tasks concerning
    document understanding.

15
Text understanding
  • Writer
  • Text sequence of words in natural language.
  • Each word stands for some idea/imagination of
    writer.
  • Ideas represent real subject, activity, etc.
  • Reader folows (not necessary exactly the same
    mappings) from left to right

...
16
Text understanding
  • Synonymy of words
  • More words can have the same meaning for the
    writer
  • car automobile
  • sick ill

...
17
Text understanding
  • Homonymy of words
  • One word can have more than one different
    meanings
  • fluke fish, anchor,
  • crown currency, treetop, jewel,
  • class year of studies, kategory in set theory,

...
18
Text understanding
  • Word meanings need not be exactly the same.
  • Hierarchical overlapping
  • animal gt horse gt Stallion
  • Associativity among meanings
  • calculator computer processor

...
19
Text understanding
  • Mapping between subjects, ideas and words can
    depend on individual persons readers and
    writers.
  • Two people can assign partly or completely
    different meaning to given term.
  • Two people can imagine different thing for the
    same word.
  • mother, room, ...
  • In result, by reading the same text two different
    readers can obtain different information
  • Each from other
  • In comparison with authors intention

20
Text understanding
  • Homonymy and ambiguities grows with transition
    form words/terms to sentences and bigger parts of
    the text.
  • Example of English sentence with more
    grammatically correct meanings (in this case a
    human reader probably eliminates the nonsense
    meaning)
  • See Podivné fungování gramatiky,http//www.scienc
    eworld.cz/sw.nsf/lingvistika
  • In the sentence Time flies like an arrow either
    flies (fly) or like can be chosen for the
    predicate, what produces two significantly
    different meanings.

21
Text preprocessing
  • Inclusion of linguistic analysis into the text
    processing can partially solve the problem
  • Disambiguation
  • Selection of correct meaning of the term in the
    sentence
  • According to grammar (Verb versus Noun etc.)
  • According to context (more complicated, can
    distinguish between two Verbs, two Nouns, etc).

22
Text preprocessing
  • Inclusion of linguistic analysis into the text
    processing can partially solve the problem
  • Lemmatization
  • For each term/word in the text after its proper
    meaning is found assigns
  • Type of word, plural vs. singular, present time
    vs. preterite, etc.
  • Base form (singular for Nouns, infinitive for
    Verbs, )
  • Information obtained by sentence analysis
    (subject, predicate, object, ...)

23
Text preprocessing
  • Other options, that can be more or less solved
    are
  • Identification of collocations
  • World war two, ...
  • Assigning of Nouns for Pronouns, used in the text
    (very complex and hard to solve, sometimes even
    for human reader)

24
Precision and Recall
  • As a result of ambiguities there exists no
    optimal text retrieval system
  • After the answer of the query is obtained,
    following values can be evaluated
  • Number of returned documents in the list Nv
  • The system supposed them to be relevant useful
    according to their math with the query
  • Number of returned relevant documents Nvr
  • The questioner find them to be really relevant as
    they fulfill its information needs
  • Number of all relevant documents in the system
    Nr
  • Very hard to guess for large and unknown
    collections

25
Precision and Recall
  • Two TRSs can (and do) return two different
    result for the same query, that can be partly or
    completely unique.
  • How to compare quality of those systems?

26
Precision and Recall
  • Two questioners can suppose another documents to
    be relevant for their equally formulated query
  • How to meet both subjective expectations of
    questioners?

27
Precision and Recall
  • Quality of result set of documents is usually
    evaluated according to numbers Nv, Nr, Nrv
  • Precision
  • P Nvr / Nv
  • Probability of returned document to be relevant
    to the user
  • Recall
  • R Nvr / Nr
  • Probability of relevant document to be returned
    to the user

28
Precision and Recall
  • Both coefficients depend on the feeling of the
    questioner
  • The same document can fulfill information needs
    of first questioner while at the same time fail
    to meet them for the second one.
  • Each user determines different values of Nr and
    Nrv coefficients
  • Both measures P and R depend on them

29
Precision and Recall
  • In optimal case
  • PR1
  • There are all and only relevant documents in the
    response of the system
  • Usually
  • The answer in the first iteration is neither
    precise nor complete

Optimalanswer
1
Typical initial answer
0
0
1
30
Precision and Recall
  • Query tuning
  • Iterative modification of the query targeted to
    increase the quality of the response
  • Theoretically it is possible to reach the optimum
    sooner or later

R
1
Optimum
0
P
0
1
31
Presnost a úplnost
  • due to (not only) ambiguities both measures
    depend indirectly each on the other,ie. PR ?
    const. lt 1
  • In order to increase P the absolute number of
    relevant documents in the response is decreased.
  • In order to increase R the number of irrelevant
    documents rapidly grows.
  • The probability to reach quality above the limit
    is low.

R
1
Optimum
0
P
0
1
32
Prediction Criterion
  • In time of query formulation the questioner has
    to guess correct term (words) the author used for
    expression of given idea
  • Problems are caused e.g. by
  • Synonyms (author could use different synonym not
    remembered by the user)
  • Overlapping meanings of terms
  • Colorful poetical hyperboles

33
Prediction Criterion
  • The problem can be partly suppressed by inclusion
    of thesaurus, containing
  • Hierarchies of terms and their meanings
  • Sets of synonyms
  • Definitions of associations between terms
  • Questioner can use it during query formulation
  • System can use it during query evaluation

34
Prediction Criterion
  • The user often tends to tune its own query in
    conservative way
  • He/she tends to fix terms used in the first
    iteration (they must be the best because I
    remembered them immediately) and vary only
    additional terms at the end of the query
  • It is useful to support the user to
    (semi)automatically eliminate wrong terms and
    replace them with useful ones, that describe
    really relevant documents

35
Maximal Criterion
  • The questioner is usually not able or willing to
    go through exhaustive number of hits in the
    response to find out the relevant one
  • Usually max. 20-50 documents according to their
    length
  • Need to not only sort out documents not matching
    the query but order the answer list according to
    supposed relevancy in descendant order the
    supposedly best documents at the begining

36
Maximal Criterion
  • Due to maximal criterion, the user usually tries
    to increase the Precision of the answer
  • Small amount of resultingdocuments in the
    answer, containing as high ratio of relevant
    documents as possible
  • Some problematic domains requires both high
    precision and recall
  • Lawyers, especially in territories having case
    law based on precedents (need to find out as much
    similar cases as possible)

37
Exact pattern matching
38
Why to Search for Patterns in Text
  • Due to index documents or queries
  • To involve only given set of terms (lemmas)
  • To omit given set of meaningless terms (lemmas)
    as conjunctions, numerals, pronouns,
  • To highlight given terms in documents, presented
    to users

39
Algorithms classificationby preprocessing
  • I - Brute-force algorithm
  • II - Others (suitable for TRS)
  • Further divided according to
  • Number of simultaneously matched patterns
  • 1, N, ?
  • Direction of comparison
  • Left to right
  • Right to left

40
Class II Algorithms
41
Exact Pattern Matching
  • Searching of One Pattern
  • Within Text

42
Brute-force Algorithm
  • Let m denotes length of text t,let n denotes
    length of pattern p.
  • If i-th position in text doesnt match j-th
    position in pattern
  • Shift of pattern one position to the
    right,restart comparison at first (leftmost)
    position in the pattern
  • Average time complexity o(mn), e.g. in search
    of an-1b in am-1b
  • For natural language text/pattern mconst ops,
    i.e. o(m)
    const is small
    number
    (lt10), dependent

    on the language

43
Knuth-Morris-Pratt Algorithm
  • Left to right searching for one pattern
  • In comparison with brute-force algorithm KMP
    eliminates repeated comparison of already
    successfully compared characters of text
  • Pattern is shifted as less as possible to align
    own prefix of examined part of pattern below
    equal fragment of text

44
KMP Algorithm
45
KMP Algorithm
  • In front of mismatch position is left own prefix
    already examined part of pattern
  • It has to be equal to the postfix of already
    examined part of pattern
  • The longest such a prefix determines the smallest
    shift

46
KMP algoritmus
47
KMP algoritmus
  • If
  • j-th position of pattern p doesnt match to i-th
    position of text t
  • The longest own prefix of already examined part
    of pattern equal to the postfix of already
    examined part of pattern is of length k
  • then
  • After the shift k characters remain before the
    mismatch position
  • Comparison restarts from k1st position of the
    pattern
  • Restart positions are pre-computed and stored in
    auxiliary array A
  • In this case Aj k1

48
KMP algoritmus
  • begin KMP m length(t) n length(p)
    i 1 j 1 while (i lt m) and (j lt n) do
    begin while (j gt 0) and (pj ltgt ti) do
    j Aj inc(i) inc(j) end
    whileif (j gt n) then pattern found at
    position i-j1 else not foundend KMP

49
Obtaining of array A for KMP search
  • A1 0
  • If all values are known for positions1 .. j-1,
    it is easy to compute correct value for j-th
    position
  • Let Aj-1 contains corrections for j-1st
    position.I.e., Aj-1-1 chars at the beginning
    of pattern are the same as equivalent number of
    chars before j-1st positon

50
Obtaining of array A for KMP search
51
Obtaining of array A for KMP search
  • If j-1st position of pattern match to Aj-1 th
    position,the prefix can be prolongedand so
    correct value of Aj is by one higher, than the
    previous value.

52
Obtaining of array A for KMP search
53
Obtaining of array A for KMP search
  • If j-1st and Aj-1 th positions in pattern
    doesnt match, the correction Aj-11 would
    cause mismatch at the previous position in text
  • The correction for such a mismatch is already
    known(numbers A1 .. Aj-1 are already
    computed)

54
Obtaining of array A for KMP search
  • It is necessary to follow correction starting by
    j-1 st position until j-1 st position in pattern
    match to the found position in the target
    position, or the correction reaches 0 (out of
    pattern)

55
Obtaining of array A for KMP search
56
Obtaining of array A for KMP search
57
Obtaining of array A for KMP search - algorithm
  • begin A1 0 n length(p) j 2
    while (j lt n) do begin k j-1 l k
    repeat l Al until (l
    0) or (pl pk) Aj l1
    inc(j) endend

58
KMP algorithm
  • Time complexity of KMP is o(mn).
  • Already successfully compared positions in text
    are never checked again
  • After each shift of pattern the given mismatch
    position can be checked again, but there are at
    most o(m) shifts of pattern.
  • Similarly time complexity of preprocessing is
    o(n).

59
KMP Optimization
  • It is possible to further optimize auxiliary
    array A
  • If the character pj equals to pAj,there
    would be the same character as the one that
    caused the mismatch aligned to mismatch position.
  • In this case the optimization can be computed in
    advance in another auxiliary array Awhere Aj
    def AAj
  • Else Aj def Aj
  • Array A j can be usedduring the search phase

60
Boyer-Moore Algorithm
  • Right to left search of one pattern using pattern
    preprocessing
  • Pattern is shifted left to right
  • Characters of pattern are compared from right to
    left

61
Boyer-Moore Algorithm
  • If the mismatch of n-j th position of pattern
    against i-j th position of text, where Ti-jx
    occures, where
  • n denotes length of pattern,
  • i denotes position of the end of pattern in text,
  • j0..n-1
  • x?X, X is the alphabet
  • Pattern is moved by SHIFTn-j,x characters to
    the rights
  • The comparison restarts at the end of pattern,
    i.e. for j0

62
Boyer-Moore Algorithm
  • There exists more different definitions of
    SHIFTn-j,x
  • Variant 1 Auxiliary array SHIFT0..n-1,X is for
    each position in the pattern and for each
    character of the alphabet X defined as follows
  • The smallest possible shift, aligning the
    character x in the text at the mismatch position
    with the same character in the pattern.
  • If there exists no such character x in the
    pattern left to the mismatch position, shift the
    pattern to start immediately after the mismatch
    position.

63
Boyer-Moore Algorithm (1)
  • Average time complexity is o(mn),e.g. for
    searching ban-1 in am-nban-1
  • For huge alphabetsand patterns with small number
    of different characters (especially for words
    searched in texts in natural languages) the
    average time complexity is o(m/n)
  • i.e. the longer the pattern, the more efficient
    search

64
Boyer-Moore Algorithm (1)
  • Example

65
Boyer-Moore Algorithm (1)
  • Representation of SHIFT array for pattern
    ANANAS
  • Full arrows depicts successful comparison of one
    character
  • Other arrows stands for shift of target character
    to position of starting character
  • Not present arrows means shift after the mismatch
    position

66
Boyer-Moore Algorithm (1)
  • Another representation. To save the space
    complexity x?A,N,S,X
  • X stands for any character not apearing in the
    pattern
  • Values beginning with represents the length
    of shift
  • Values without represents new value of j

67
Benchmark on Artificial Text
68
Benchmark on Artificial Text
69
Benchmark on Artificial Text
70
Benchmark on English Text
  • Note
  • Unique pattern ? found at its original position

71
Benchmark on English Text
72
Benchmark on English Text
73
Benchmark on English Text
74
Benchmark on English Text
75
Review of Algorithms
76
Exact pattern matching
  • Searching for finite set of patterns

77
Aho-Corrasick Algorithm
  • Left to right searching of more patterns
    simultaneously
  • Extension of KMP algorithm
  • Preprocessing of patterns
  • Linear reading of text
  • Average time complexity o(m?ni), wherem denotes
    length of textni denotes length of i-th pattern

78
A-C Algorithm
  • Text T
  • Set of patternsPP1, P2, , Pk
  • Search engineS (Q, X, q0, g, f, F)
  • Q finite set of states
  • X alphabet
  • q0?Q initial state
  • g Q x X ? Q (go)forward function
  • f Q ? Q (fail)backward function
  • F ? Q set of final states

79
A-C Algorithm
  • States in the set Q correspond to all prefixes of
    all patterns
  • State q0 reprezents empty prefix ?
  • g(q,x) qx,iff qx?Q
  • Else g(q0,x)q0
  • Else g(q,x) undefined
  • f(q) for qltgtq0 is equal to longest own postfix q
    in the set Q
  • f(q)ltq
  • Final states correspond to all complete patterns,
    i.e. FP

80
A-C Algorithm
  • Search based on total (fully defined) transition
    function ?(q,x) QxX?X
  • ?(q,x) g(q,x), iff g(q,x) is defined
  • ?(q,x) ?(f(q),x)
  • Correct definition, because f(q) - distance of
    f(q) from initial state is less than q and
    g(q0,x) is completely defined.

81
A-C Algorithm
  • f is constructed in order of increasing q, i.e.
    according to distance of state from the beginning
  • It is not necessary to define f(q0)
  • If q1 the longest own postfix is empty, i.e.
    f(q)q0
  • f(qx)f(g(q,x)) ?(f(q),x)
  • To determine value of fail function for state qx,
    accessible from state q using character x, it is
    necessary to start in q, follow fail function to
    f(q) and then go forward using the character x

82
A-C Algorithm
  • Example Phe,her,she, function g

83
A-C Algorithm
  • Example Phe,her,she, function f

84
A-C Algorithm
  • Detection of all occurrences of patterns, even of
    patterns hidden inside another ones
  • Either collect all patterns detected in given
    state by going through all states accessible from
    it using fail function, i.e. final states in f
    i(q), igt0
  • Or - after transition to state q go through all
    states linked together by fail function and
    report all final states

85
A-C Algorithm delta function
  • function delta(qstates x alphabet)statesbegi
    n delta while gq,x fail do q fq
    delta gq,xend deltabegin A-C q
    0 for i 1 to length(t) do begin q
    delta(q,ti) report(q) report all found
    patterns, ending by ti end forend A-C

86
KMP vs. A-C for 1 pattern
  • Equal algorithms, different formulations
  • j ( compared position)
  • P10
  • Pjk
  • qj-1 ( compared positions)
  • g(q0,)q0
  • f(qj-1)qk-1

87
Commentz-Walter Algorithm
  • Right to left search for more patterns
    simultaneously
  • Combination of B-M and A-C algorithms
  • Average time complexity (for natural
    languages)o(m/min(ni)), wherem denotes length
    of textsni denotes length of i-th pattern

88
C-W Algorithm
  • Text T
  • Set of patternsPP1, P2, , Pk
  • Search engineS (Q, X, q0, g, f, F)
  • Q finite set of states
  • X alphabet
  • q0?Q initial state
  • g Q x X ? Q (go)forward function
  • f Q ? Q (fail)backward function
  • F ? Q set of finalstates

89
C-W Algorithm
  • States in set Q represents all postfixes of all
    patterns
  • State q0 represents empty postfix ?
  • g(q,x) xq,iff xq?Q
  • f(q) where qltgtq0 is equal to longest own prefix q
    in the set Q
  • f(q)ltq
  • Final states correspond to all complete patterns,
    i.e. FV

90
C-W Algorithm
  • Forward function

91
C-W Algorithm
  • Backward function (arrows going to q0 are not
    shown)

s
h
e
e
he
she
r
h
e
r
er
her
92
C-W Algorithm
  • LMIN min(ni)length of the shortest pattern
  • h(q) qdistance of state q from the initial
    state
  • char(x)minimal distance of state, reachable via
    character x
  • pred(q)predecessor of state q,i.e. the state,
    representing one character shorter postfix
  • If g(q,x) is not defined, patterns (search
    engine) is shifted by shift(q,x) positions to the
    right and the again search restarts by state q0
    again
  • shift(q,x)min( max( shift1(q,x), shift2(q)
    ), shift3(q) )

93
C-W Algorithm
  • shift1(q,x) char(x)-h(q)-1, pokud gt 0
  • shift2(q) min(LMIN?h(q)-h(q), f(q)q)
  • shift3(q0) LMIN
  • shift3(q) min(shift3(pred(q)) ? ?
    h(q)-h(q), ?kfk(q)q ? q?F)

94
C-W Algorithm
  • shift1(q,x) aligning of collision
    characterchar(y)-h(kolo)-18-4-13

3
95
C-W Algorithm
  • shift2(q) aligning of checked part of
    textstates, where fail function goes to qmust
    be taken into account

1
96
C-W Algorithm
  • shift3(q) aligning of (any) postfix of checked
    text, collision character need not be used again
    to find a match

97
Exact Pattern Matching
  • Searching for (Regular) Infinite Set of Patterns
    in Text

98
Regular expressions and languages
  • Regular expression R
  • Atomic expressions
  • ?
  • ?
  • a, a ? X
  • Operations
  • U.V concatenation
  • UV union
  • Vk V.VV
  • V V0V1V2
  • V V1V2V3
  • Value of expression h(R)
  • ? empty language
  • ? empty word only
  • a, a ? X
  • u.vu?h(U)? v?h(V)
  • h(U)?h(V)

99
Regular Expression Feature
  • 1)  U(VW) (UV)W
  • 2)  U.(V.W) (U.V).W
  • 3)  UV VU
  • 4)  (UV).W (U.W)(V.W)
  • 5)  U.(VW) (U.V)(U.W)
  • 6)  UU U
  • 7)  ?.U U
  • 8)  ?.U ?
  • 9)  U? U
  • 10) U ?U.U (?U)

100
(Deterministic) Finite Automaton
  • K ( Q, X, q0, ?, F )
  • Q is a finite set of states
  • X is an alphabet
  • q0 ? Q is an initial state
  • ? Q x X ? Q is totally defined transition
    function
  • F ? Q is a set of final states

101
(Deterministic) Finite Automaton
  • Configuration of FA
  • (q,w) ? Q x X
  • Transition of FA
  • relation ? (Q x X) x (Q x X)
  • (q,aw) (q,w) ? ?(q,a) q
  • Automaton accepts word w(q0, w) (q,?), q?F

102
Non-deterministic Finite Automaton
  • a) default def. K ( Q, X, q0, ?, F )b)
    extended def. K ( Q, X, S, ?, F )
  • Q is a finite set of internal states
  • X is an alphabet
  • q0 ? Q is an initial state S ? Q is
    (alternatively) set of initial states
  • ? Q x X ? P(Q) is a transition function
  • F ? Q is a set of final states

103
Non-deterministic Finite Automaton
  • NKA for Phe, her, she
  • S1,4,8
  • F3,7,11
  • S1
  • F3,4,7

104
NDFA?DFA Conversion
  • K(Q, X, S, ?, F)
  • K(Q, X, q0, ?, F)
  • Q P(Q)
  • X
  • q0 S
  • ?( q, x) ??(q, x), q?q
  • F q?Q?q?F??

105
NDFA?DFA ConversionSet of Initial States Allowed
  • By table, only reachable states
    transitions
    to state 1
    not shown

106
NDFA?DFA ConversionOnly One Initial State Allowed
  • By table, only reachable states
    transitions
    to state 1
    not shown

107
Derivation of regular expression
  • If, , then
  • I.e., if ,then

108
Derivation of regular expression

109
Construction of DFAUsing Derivations of RE
  • Derivation of regular expressions allows directly
    and algorithmically build DFA for any regular
    expression
  • Let V is given regular expression in alphabet X
  • Each state of DFA defines a set of words, that
    move the DFA from this state to any of final
    states.So, every state can be associated with
    regular expression, defining this set of words
  • q0 V
  • ?(q,x)
  • F q?Q ??h(q)

110
Construction of DFAUsing Derivations of RE
  • V (01).01 in alphabet X0,1
  • q0 (01).01

111
Construction of DFAUsing Derivations of RE
  • V (01).01 in alphabet X0,1
  • q0 (01).01
  • F (01).01?

112
Document Models
  • Different variants of models
  • Takes (non)existence of terms in documents into
    account or not
  • Takes frequencies of terms in documents into
    account or not
  • Takes positions of terms in documents into
    account or not

113
Document Models in TRSs
  • Boolean Model

114
Boolean Model of TRS
  • Mid of 20. century
  • Adoption of procedures, used in
    librarianshipand their gradual implementation

115
Boolean Model of TRS
  • Database (collection) D containing n documenta
  • Dd1, d2, dn
  • Documents described using m terms
  • T t1, t2, tm
  • term tj descriptor, usually word or collocation
  • Each document is represented as a subset of
    available terms
  • Contained in the document
  • Better describing content of the document
  • d1 ? T

116
Boolean Model of TRS
  • Assigning of a set of terms to document can be
    achieved by different approaches
  • Subdivision according to author
  • Manual
  • Done by a human indexer, that understands the
    content of document
  • Non-consistent. More indexers need not produce
    the same set of terms. One indexer might later
    produce different set of terms as before.
  • Automatic
  • Done algorithmically
  • Consistent, but without text understanding
  • Subdivision according to free will in selecting
    descriptors
  • Controlled
  • Set of terms is defined in advance and indexer
    cannot change it. It only can select those
    describing given document as best as possible.
  • Non-controlled
  • The set of terms can be extended whenever new
    document is inserted into collection.

117
Indexation
  • Thesaurus
  • Internally structured set of terms
  • Synonyms with defined preferred term
  • Hierarchies of semantically narrower/broader
    terms
  • Similar terms
  • ...
  • Stop-list
  • Set of non-significant terms that are meaningless
    for indexation
  • Pronouns, interjections,

118
Indexation
  • Common words arenot suitable for document
    identification
  • Too specific words as well. Lot of different
    terms appears in verysmall number of docs
  • Its eliminationdecreases significantly size of
    the index, and slightly its quality

119
Boolean Model of TRS
  • Query is represented by Boolean expression
  • ta AND tb document has to contain/to be
    described by both terms
  • ta OR tb document has to contain/to be
    described by at least one term
  • NOT t document has not contain/to be
    described by given term

120
Boolean Model of TRS
  • Query examples
  • searching AND information
  • encoding OR decoding
  • processing AND (document OR text)
  • computer AND NOT personal

121
Boolean Model of TRS Extensions
  • Collocations in queries
  • searching for information
  • data encoding OR data decoding
  • text processing
  • computer AND NOT personal computer

122
Boolean Model of TRS Extensions
  • Using of factual meta-data(attribute values)
  • databaseAND (author Salton)
  • text processingAND (year_of_publishing gt 1990)

123
Boolean Model of TRS Extensions
  • Wildcards in terms
  • datab AND system
  • stands for termsdatabase, databases,system
    , systems, etc.
  • portabl? AND computer
  • stands for terms portable,computer,
    computers, computerized etc.

124
Boolean Index Structure
  • Inverted file
  • It holds a list of identified documents for each
    term (instead of a set of terms for each
    document)
  • t1 d1,1, d1,2, ..., d1,k1
  • t2 d2,1, d2,2, ..., d2,k2
  • tm dm,1, dm,2, ..., dm,km

125
Boolean Index Structure
  • One-by-one processing of inserted
    documentsproduces a sequence of couples
    ltdoc_id,term_idgtsorted by first component, i.e.
    by doc_id
  • Next the sequence is reordered lexicographically
    by term_id, doc_id and duplicates are removed
  • The result can be further optimized by adding
    directory pointing to sections, corresponding to
    individual terms, and removing term_ids from the
    sequence

126
Lemmatization and Disambiguationof Czech
Language (ÚFAL)
  • Odpovedným zástupcem nemuže být každý.
  • Zákon by mel zajistit individualizaci
    odpovednosti a zajištení odbornosti.
  • ltp n1gtlts id"docID001-p1s1"gtltf
    capgtOdpovednýmltMDlgtodpovedný_(kdo_za_neco_odpoví
    dá)ltMDtgtAAIS7----1A----ltfgtzástupcemltMDlgtzástupce
    ltMDtgtNNMS7-----A----ltfgtnemuželtMDlgtmoci_(mít_možn
    ost_neco_delat)ltMDtgtVB-S---3P-NA---ltfgtbýtltMDlgtb
    ýtltMDtgtVf--------A----ltfgtkaždýltMDlgtkaždýltMDtgtAAIS
    1----1A----
  • ltp n2gt

Paragraph Nr.
Sentence Nr.
Word in document
Lemma including meaning
Type of word (Adverb),
127
Proximity Constraints
  • t1 (m,n) t2
  • most general form
  • term t2 can appear at most m words after t1, or
    term t1 can appear at most n words after t2.
  • t1 sentence t2
  • terms have to appear in the same sentence
  • t1 paragraph t2
  • terms have to appear in the same paragraph

128
Proximity Constraints Evaluation
  • Using the same index structure
  • Operators replaced by conjunctions
  • Query evaluation to find candidates
  • Check for co-occurrences in primary texts
  • Small index
  • Longer time needed for evaluation
  • Necessity of storing primary documents
  • Extension of index by positions of term
    occurrences in documents
  • Large index

129
Extended Index Structure
  • During indexation is built a sequence of
    5-tuplesltdok_id,term_id,para_nr,sent_nr,word_nrgt
    ordered by dok_id, para_nr,sent_nr,word_nr
  • Sequence is reordered byltterm_id,dok_id,para_nr,s
    ent_nr,word_nrgt
  • No duplicities are removed

130
Thesaurus Utilization
  • BT(x) - Broader Term to term x
  • NT(x) - Narrower Terms
  • PT(x) - Preferred Term
  • SYN(x) - SYNonyms to term x
  • RT(x) - Related Terms
  • TT(x) - Top Term

131
Disadvantages of Boolean Model
  • Salton
  • Query formulation is more an art than science.
  • Hits can not be rate by its quality.
  • All terms in the query are taken as equally
    important.
  • Output size can not be controlled. System
    frequently produces empty or very large answers.
  • Some results doesnt correspond to intuitive
    understanding.
  • Documents in answer to disjunctive query can
    contain only one of mentioned term as well as all
    of them.
  • Documents eliminated from answer to conjunctive
    query can miss one of mentioned term as well as
    all of them.

132
Partial Answer Ordering
  • Q (t1 OR t2) AND (t2 OR t3) AND t4
  • conversion to equivalent DNF
  • Q (t1 AND t2 AND t3 AND t4)
  • OR (t1 AND t2 AND NOT t3 AND t4)
  • OR (t1 AND NOT t2 AND t3 AND t4)
  • OR (NOT t1 AND t2 AND t3 AND t4)
  • OR (NOT t1 AND t2 AND NOT t3 AND t4)

133
Partial Answer Ordering
  • Each elementary conjunction (further EC) contain
    all terms used in original query and is rated by
    number of terms used in positive way (without
    NOT)
  • All ECs differs each from another in at least
    one term (one contains tj, second contains NOT
    tj)
  • Every document correspond to at most one EC
  • Document is then rated by number, assigned to
    given EC.

134
Partial Answer Ordering
  • There exist 2k ECs in case of query using k
    terms
  • There exist only k different ratings
  • More ECs can have the same rating
  • (ta OR tb) (ta AND tb) rating 2 OR (ta
    AND NOT tb) rating 1 OR (NOT ta AND tb)
    rating 1

135
Vector Space Model of TRS
  • 70th of 20. century
  • cca 20 years younger than Boolean model of TRS
  • Tries to minimize and/or eliminate disadvantages
    of Boolean model

136
Vector Space Model of TRS
  • Database D containing n documents
  • Dd1, d2, dn
  • Documents are described by m terms
  • T t1, t2, tm
  • term tj word or collocation
  • Document representation usingm-dimensional
    vector of term weights

137
Vector Space Model of TRS
  • Document model
  • wi,j level of importance of j-th termto
    identify/describe i-th document
  • Query
  • qj level of importance of j-th term for the
    user

138
Vector Space Model Index
139
Vector Space Model of TRS
  • Similarity between vectors representing document
    and query is in general defined by Similarity
    function

1
0
0
1
140
Similarity Functions
  • Factor is proportional both to
    level of importance in document and for the user
  • Orthogonal vectors have zero similarity
  • Base vectors in the vector space (individual
    terms) are orthogonal each to other and so have
    zero similarity

141
Vector Space Model of TRS
  • Not only the angle, but also sizes of vectors
    influence the similarity
  • Longer vectors, that tends to be assigned to
    longer texts have an advantage on shorter ones
  • Its desirable to normalize all vectors to have
    unitary size

142
Vector normalization
  • Vector length influence elimination

143
Vector normalization
  • In time of indexation
  • No overhead in time of searching
  • Sometimes it is necessary to re-compute all
    vectors in case that vectors reflects also
    aspects dependent on complete collection
  • In time of search
  • Part of similarity function definition
  • Slows down the response of the system

144
Output Size Control
  • Documents in the output list are ordered by
    descending similarity to the given query
  • Most similar documents at the beginning of the
    list
  • The list size can be easily restricted with
    respect to maximal criterion
  • The maximal number of documents in the list can
    be restricted
  • Only documents reaching threshold similarity can
    be shown in the result

145
Negation in Vector Space Model
  • It is possible to extend query space
  • Then the contribution of j-th
    dimension can be negative
  • Documents that contain j-th term are suppressed
    in comparison with others

146
Scalar product
147
Cosine Measure (Salton)
148
Jaccard Measure
149
Dice Measure
150
Overlap Measure
151
Asymmetric Measure
152
Pseudo-Cosine Measure
153
Vector Space Model Indexation
  • Based on number of occurrences of given term in
    given document
  • The more given word occurs in given document, the
    more important for its identification
  • Term FrequencyTFi,j term_occurs / all_occurs

154
Vector Space Model Indexation
  • Without stop-list the result contains almost only
    meaningless words at the beginning

155
Vector Space Model Indexation
  • Term frequencies are very small even for most
    frequent terms
  • Normalized term frequency iffelse

156
Vector Space Model Indexation
Differentiation of important termsfrom
non-important ones
157
Vector Space Model Indexation
  • IDF (Inverted Document Frequency) reflects
    importance of given term in the index for
    complete collection

Entropy of probability that the term occurs in
randomly chosen document
158
Vector Space Model Indexation
  • (Optional) document vector normalization to
    unite size

159
Querying in Vector Space Model
  • Equal representation of documents and queries
    brings many advantages over Boolean Model
  • Query can be defined
  • Directly by its hand-made definition
  • By reference to known indexed document
  • By reference to non-indexed document indexer
    creates ad-hoc vector from its primary text
  • By text fragment (using copy-paste etc.)
  • By combination of some above mentioned ways

160
Feedback
  • Query building/tuning based on user feedback to
    previous answers
  • Adding terms identifying relevant documents
  • Elimination of terms unimportant for relevant
    document identification and important for
    irrelevant ones
  • Prediction criterion improvement

161
Feedback
  • Answer to previous queryis classified by the
    user, who can mark relevant and/or irrelevant
    documents

162
Positive Feedback
  • Relevant document attract the query towards them

163
Negative Feedback
  • Irrelevant documents push query away from them
  • Less effective than positive feedback
  • Less used

164
Feedback
  • The query iteratively migrates towards the center
    of relevant documents

165
Feedback
  • General form
  • One of used special form

Centroid (centre of gravity)
166
Feedback
  • General form
  • Other used(weighted) form

(1-?) / ?
Weighted centroid (centre of gravity)
167
Term Equivalence in VS Model
  • Individual terms (dimensions of the space) are
    supposedly, but not really, mutually
    independent
    Problem with prediction
    inappropriately chosen
    synonyms

168
Term Equivalence in VS Model
  • Equivalency matrix

169
Term Similarity in VS Model
  • Generalised equivalence
  • Similarity matrix
  • All computation used in VS model can be evaluated
    also on transposed index.Here mutual similarity
    of term can be evaluated
    (vector dimension n, not m)
  • Really similar terms co-occurs often together
  • Common terms co-occurs often together as well

170
Term Hierarchies in VS Model
  • Similarly toBoolean Model

Publication
Print
Book
Papers
Magazine
171
Term Hierarchies in VS Model
0.8
  • Similarly toBoolean Model
  • Edges can haveassigned weights
  • User weightsthen can be easily propagated

Publication
0.4
0.6
0.48
0.32
Print
Book
0.3
0.7
0.224
0.096
Papers
Magazine
172
Citations and VS Model
  • Scientific publications cites their sources
  • Assumption
  • Cited documents are semantically similar
  • Citing documents are semantically similar

173
Citations and VS Model
  • Direct reference between documents A a B
  • Document A cites document B
  • Denoted A?B
  • Indirect reference between A a B
  • Ex. C1, Ck so, that A?C1??Ck?B
  • Link between documents A a B
  • A?B or B?A

174
Citations and VS Model
  • A and B are bibliographically paired,if and only
    if they cite the same source CA?C ? B?C
  • A and B are co-cited, if and only if they are
    both cited in some document CC?A ? C?B

175
Citations and VS Model
  • Acyclic oriented citation graph
  • Flowchart matrix of the citation graph
  • Ccij?0,1ltnxngtcij1, iff i?jcij0 else

176
Citations and VS Model
  • BP matrix of bibliographic pairing
  • bpij number of documents cited in both
    documents i and j.
  • Follows bpii number of documents cited in i

177
Citations and VS Model
  • CP matrix of co-citation pairing
  • cpij number of documents citing both i and j
  • Follows cpii number of documents citing i

178
Citations and VS Model
  • DL matrix of document links
  • dlij 1 ? (cij 1 ? cji 1)
  • It is possible to modify resulting similarities
    between documents and given query using some of
    matrices BP, CP, DL
  • Modification of index matrix D
  • D BP.D, resp. DCP.D , resp. DDL.D
  • DBP.CP.DL.D

179
Using mutual document similarities in VS Model
  • DS matrix of mutual document similarities
  • dsij
  • The same idea as in case of BP, CP, DL
  • Modification of index matrix D
  • DDS.D

180
Term Discrimination Values
  • Discrimination value defines the importance of
    the term in the vector space to distinguish
    individual documents stored in the collection
  • By removal of the term from index, i.e. by
    reduction of index dimensionality it can happen
  • Overall distance between documents
    decreases(average similarity of document pairs
    increases)
  • Overall distance between documents
    increases(average similarity of document pairs
    decreases)
  • In this case the presence of the dimension in the
    space is not needed (is contra-productive)

181
45,0?
35,3?
0,0?
45,0?
182
Term Discrimination Values
  • Computation based on average document similarity
  • More efficient variant using central document
    (centroid)

183
Term Discrimination Values
  • The same value is computed for the space reduced
    by k-th dimension

184
Term Discrimination Values
  • Discrimination value is defined as a difference
    of both average values
  • Can be used instead of IDFk
  • ? 0
  • Important termdiscriminating documents
  • DVk defines the measure of importance
  • ? 0
  • Unimportant term

185
Term Discrimination Values(value DV of terms
depending on number of documents where the term
is present)
Number of documents, where the term is present.
Collection contains 7777 documents
186
Document clustering
  • Kohonen maps
  • C3M algorithm
  • K-mean algorithm

187
Document Clustering
  • Response time of VS based TRS is directly
    proportional to number of documents in the
    collection,that must be compared with the query
  • Clustering allows to skip major part of index
    during the search and compare only closest
    documents

188
Document Clustering
  • Without clusters, it is necessary to compare all
    documents, even if the minimal needed similarity
    is defined

189
Document Clustering
  • Each cluster represent m-dimensional sphere,
    defined by its center and radius
  • If not, it is possible to approximate it this way
    during computations

190
Document Clustering
  • Having clusters, the query evaluation need not to
    compare documents in clusters outside the area of
    user interest

191
Cluster types
  • Clusters having the same volume
  • Easy to create
  • Some clusters can be almost empty, while others
    can contain huge amount of documents

192
Cluster types
  • Clusters having (approximately) the same number
    of documents
  • Hard to create
  • More effective in case of non-uniformly
    distributed docs.

193
Cluster types
  • Non-disjunctive clusters
  • One document can belong to more than one cluster
  • Sometimes weighted belonging in fuzzy clusters.

194
Cluster types
  • Disjunctive clusters
  • Document can belong to exactly one cluster

195
Cluster types
  • It is not possible to completely and disjointly
    cover space using spheres
  • It is possible to use convex polyhedra, where
    each document belongs to closest center

196
Cluster types
  • Then clusters can be approximated by non-disjoint
    set of spheres, defined by the center and the
    most distant belonging document

197
Query Evaluation With Clusters (I)
  • Let are given query q and minimal required
    similarity s
  • Note. Similarity computed by scalar product,
    vectors are normalized
  • Index is split to k clusters(c1,r1), , (ck,rk)
  • Note. Radii are angular
  • Query radiusr ? arccos(s)s cos(?)

r?
1
?
1
198
Query Evaluation With Clusters (I)
  • Emptiness of cluster intersection with the query
    area is found out from the value
    arccos(Sim(q,ci))-r-ri
  • If this value ? 0,documents in the cluster are
    compared
  • If this values gt 0,documents can not be in the
    result

199
Query Evaluation With Clusters (II)
  • Let are given query q and maximal number of
    required documents x.
  • Again, index is split to k clusters(c1,r1), ,
    (ck,rk)
  • No radius of the query is available

200
Query Evaluation With Clusters (II)
  • Clusters are sorted in ascended order by
    increasing distance of their center from the
    query, i.e. according to arccos(Sim(q,ci))
  • Better sorted by increasing distance of cluster
    boundary from the query, i.e. according to
    arccos(Sim(q,ci))-ri

201
Query Evaluation With Clusters (II)
  • Clusters are sorted in ascended order by
    arccos(Sim(q,ci))-ri i.e. by increasing distance
    of cluster boundary from the query q
  • x7

4.
5.
2.
Write a Comment
User Comments (0)
About PowerShow.com