Information RetrievalIR

1 / 19

About This Presentation

Title:

Information RetrievalIR

Description:

... Bi- or Multi-lingual IR? Bi- and Multi-lingual community that ... Multi-lingual indexing approach. indexing for all different languages. higher storage cost ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 20

Provided by: www4Comp

more less

Transcript and Presenter's Notes

Title: Information RetrievalIR

1
Information Retrieval(IR)

Information retrieval generally refers to the
process of extracting relevant documents
according to the information specified in the
query.
IRS vs. DBMS

DBMS
structured
query formulation
deterministic
all relevant data
one-off query

IRS
unstructured
casual query style
non-deterministic
most relevant data
relevance feedback

2
Basic Components of IRS
Text editor/ file system/ internet files
Linguistics Information
Input documents
KnowledgeBase
IRS core
Relevant Documents
Users
Query
3
Information Retrieval

The IR technology
Knowledge base Dictionary and rules
Basic Information representation model
Indexing of documents for retrieval
Relevance calculation
Oriental languages Vs English in IR
Main difference is in what is considered useful
information in each language
Different NLP knowledge and variants of common
methods need to be used

4
Vector Space Modelfor document representation

Document D articles in text form
Terms T Basic language units, such as words,
phrases
D(T1 T2 Ti Tn),
Ti and Tj may be referring to the same word
appearing in different places and also the order
in which it appear is also relevant.
Term Weight Ti has an associated weight value Wi
to indicate the importance of Ti to D
DD(T1 W1 T2 W2 Tn Wn)
For a given D, if we do not consider word
repetition and order, also terms are against a
known set T (t1 t2 tK) where K is the
number of words in the vocabulary, thus D(T1 W1
T2 W2 Tn Wn) can be represented by the Vector
Space Model DD(W1 W2 WK)

Vector Space Model for Document Representation
(W1 W2 Wk) can be considered as a vector
(t1 t2 tk) defines the k dimension
coordinate system, where K is a fixed number.
Each coordinate indicates the weight of term ti.
Different documents are then considered as
different vectors in the VSM.

?
6

Similarity The degree of relevance of two
documents
The degree of relevance of D1and D2 can be
described by a so called similarity function,
Sim(D1, D2 ), which describe their distance.
Many different definitions of Sim(D1, D2 )
One simple definition(inner product)
Sim (D1, D2 ) ?nk1 w1kw2k
Example D1(1,0,0, 1, 1, 1), D2(0,1,1,0,1,0)
Sim(D1, D2)1x00x10x11x01x11x0 1

Another definition( Cosine)
Sim (D1, D2 ) cos?
( ?nk1 w1kw2k)/sqrt((?nk1 w1k2)(?nk1 w2k2))
For Information retrieval, D2 can be a query Q.
Suppose there are I documents Di ,where i 1 to
I
Rank Sim(Di, Q), the higher the value, the more
relevant Di is to Q

8
Terms Selection for Indexing

T can be considered as all the terms that can be
indexed
Approach 1 Simply use a word dictionary
Approach 2 terms in a dictionary terms
segmented dynamically gt T is not necessarily
static
Every document Di needs to be segmented
Vocabulary for indexing is normally much smaller
than vocabulary of documents.
Not every word Tk in Di which is in T will be
indexed
Tk in Di which is in T but not indexed is
considered to have weight wik 0
In other words, all indexed terms for Di are
considered to have weight greater than zero

The process to select terms in a Di for indexing
is called Terms selection
Word frequency in documents is related to the
information the articles intend to convey. Thus
word frequency is often used in earlier term
selection and weight assignment algorithms
The Zipf Law in information theory
For a given document set, and rank the terms
according to its frequency gt Zipf Law
Freq(ti) rank(ti) ? constant

10
The P.H. Lunh Method(terms selection)

Suppose N documents form a document set
Dset Di, 1 ltiltN
(1) Freqik the frequency of Tk in Di.
TotFreqk the frequency of Tk in Dset.
(2) Then,
TotFreqk ?Ni1 Freqik
(3) Sort TotFreqk in descending order. Select an
upper bound and a lower bound, Cu-b , Cl-b,
respectively.
Index only the terms between Cu-b and Cl-b
Absolute frequency, choice of Cu-b and Cl-b

P.H. Lunhs method is a very rough association of
terms frequency with information to be conveyed
in a document.
Some low frequency terms may be very important to
a particular article(document), and it may be
exactly the reason it doesnt appear so often in
other articles
Keys in terms selection for indexing
completeness and accuracy
Related to the article so that it can be indexed
for retrieval(completeness)
Distinguish one article from other
articles(accuracy and representative)
Example The term ?? in the document set
computer is not an important term, however , it
is probably important in a hardware devices set
Relative frequency

12
Weight Assignment Algorithm

Assuming
Freqik ? in Di, importance of tk in Di ?
Freqik
TotFreqk (Frequency of tk) in Dset ?, importance
of tk in Di ?
gt log2(N/TotFreqk)
The weight should be assigned based on these
assumptions
Wik Freqik Freqik (log2 N - log2 TotFreqk)

13
More Consideration in Relevance

Long articles are more likely to be retrieved
? discount the length factor
Document feature terms
most frequently used terms tend to appear in more
articles, it does not serve to distinguish one
article from others.
Expressiveness of terms high-frequency,
low-frequency, normal-frequency. Terms with
normal frequency and low-frequency terms convey
more about the article features/theme.
Word class nouns convey more information(?? vs.
??)
Use of PoS tags and also stoplist(Slist)

Syntactical word classification information
Nouns are more related to concepts
?? ? ?? ?? ? ? ?? ? ?? ?? ? ?? ? ?? ?
Class? ? ? ? ? ? ? ? ?
? ? ? ? ? ?
Slist elimination of words cannot be identified
by class such as ?
Only those terms not on the stop list will be
used in frequency calculation
Semantic word classification extracting concepts
????? ????????? ????? ????
? ?? ?? ?? ? ?? ?? ??
Thesaurus co-occurrence of related terms

Indexing of phrases(grammatical analysis)
Example Artificial intelligence(?? ??) it is
more relevant to index artificial
intelligence(????)( as one unit) rather than two
independent terms.

16
Information Retrieval System Architecture
Document sources
Customer inquiries
Indexing engine
Indexed query expression
Indexed Document repository
Relevance Calculation
Retrieved documents
Query optimization relevance feedback
Evaluation and return by ranking
Return result
17
Bilingual (Multi-lingual) IR

Retargetable IR approach
monolingual but IR system can be configured for
different languages
Need Bi- or Multi-lingual IR?
Bi- and Multi-lingual community that reads text
in more than one language find text in more than
language!
Retrieval of law information in Chinese and
English.
Retrieval of a person reported in different
newspapers written in different languages

Dictionary approach
normalize indexing and searching into one
language (save storage)
determination of translation equivalence(multiple
translations) during indexing and during
extraction of term vector of query (increase time
of indexing)
not easy to obtain good translation equivalence
many proper nouns not in translation dictionary
need to be found and map to their corresponding
target translation
inflexible cannot use user specified translation
equivalence

Multi-lingual indexing approach
indexing for all different languages
higher storage cost
different indexing techniques for different
languages (e.g. English and Chinese)
flexible (can use system supplied or user
supplied translation equivalence)
support exact match in different languages

Write a Comment

User Comments (0)