Title: Information RetrievalIR
1Information Retrieval(IR)
- Information retrieval generally refers to the
process of extracting relevant documents
according to the information specified in the
query. - IRS vs. DBMS
- DBMS
- structured
- query formulation
- deterministic
- all relevant data
- one-off query
- IRS
- unstructured
- casual query style
- non-deterministic
- most relevant data
- relevance feedback
2Basic Components of IRS
Text editor/ file system/ internet files
Linguistics Information
Input documents
KnowledgeBase
IRS core
Relevant Documents
Users
Query
3Information Retrieval
- The IR technology
- Knowledge base Dictionary and rules
- Basic Information representation model
- Indexing of documents for retrieval
- Relevance calculation
- Oriental languages Vs English in IR
- Main difference is in what is considered useful
information in each language - Different NLP knowledge and variants of common
methods need to be used
4Vector Space Modelfor document representation
- Document D articles in text form
- Terms T Basic language units, such as words,
phrases - D(T1 T2 Ti Tn),
- Ti and Tj may be referring to the same word
appearing in different places and also the order
in which it appear is also relevant. - Term Weight Ti has an associated weight value Wi
to indicate the importance of Ti to D - DD(T1 W1 T2 W2 Tn Wn)
- For a given D, if we do not consider word
repetition and order, also terms are against a
known set T (t1 t2 tK) where K is the
number of words in the vocabulary, thus D(T1 W1
T2 W2 Tn Wn) can be represented by the Vector
Space Model DD(W1 W2 WK)
5- Vector Space Model for Document Representation
- (W1 W2 Wk) can be considered as a vector
- (t1 t2 tk) defines the k dimension
coordinate system, where K is a fixed number. - Each coordinate indicates the weight of term ti.
- Different documents are then considered as
different vectors in the VSM.
?
6- Similarity The degree of relevance of two
documents - The degree of relevance of D1and D2 can be
described by a so called similarity function,
Sim(D1, D2 ), which describe their distance. - Many different definitions of Sim(D1, D2 )
- One simple definition(inner product)
- Sim (D1, D2 ) ?nk1 w1kw2k
- Example D1(1,0,0, 1, 1, 1), D2(0,1,1,0,1,0)
- Sim(D1, D2)1x00x10x11x01x11x0 1
7- Another definition( Cosine)
- Sim (D1, D2 ) cos?
- ( ?nk1 w1kw2k)/sqrt((?nk1 w1k2)(?nk1 w2k2))
- For Information retrieval, D2 can be a query Q.
- Suppose there are I documents Di ,where i 1 to
I - Rank Sim(Di, Q), the higher the value, the more
relevant Di is to Q
8Terms Selection for Indexing
- T can be considered as all the terms that can be
indexed - Approach 1 Simply use a word dictionary
- Approach 2 terms in a dictionary terms
segmented dynamically gt T is not necessarily
static - Every document Di needs to be segmented
- Vocabulary for indexing is normally much smaller
than vocabulary of documents. - Not every word Tk in Di which is in T will be
indexed - Tk in Di which is in T but not indexed is
considered to have weight wik 0 - In other words, all indexed terms for Di are
considered to have weight greater than zero
9- The process to select terms in a Di for indexing
is called Terms selection - Word frequency in documents is related to the
information the articles intend to convey. Thus
word frequency is often used in earlier term
selection and weight assignment algorithms - The Zipf Law in information theory
- For a given document set, and rank the terms
according to its frequency gt Zipf Law - Freq(ti) rank(ti) ? constant
10The P.H. Lunh Method(terms selection)
- Suppose N documents form a document set
- Dset Di, 1 ltiltN
- (1) Freqik the frequency of Tk in Di.
- TotFreqk the frequency of Tk in Dset.
- (2) Then,
- TotFreqk ?Ni1 Freqik
- (3) Sort TotFreqk in descending order. Select an
upper bound and a lower bound, Cu-b , Cl-b,
respectively. - Index only the terms between Cu-b and Cl-b
- Absolute frequency, choice of Cu-b and Cl-b
11- P.H. Lunhs method is a very rough association of
terms frequency with information to be conveyed
in a document. - Some low frequency terms may be very important to
a particular article(document), and it may be
exactly the reason it doesnt appear so often in
other articles - Keys in terms selection for indexing
completeness and accuracy - Related to the article so that it can be indexed
for retrieval(completeness) - Distinguish one article from other
articles(accuracy and representative) - Example The term ?? in the document set
computer is not an important term, however , it
is probably important in a hardware devices set - Relative frequency
12Weight Assignment Algorithm
- Assuming
- Freqik ? in Di, importance of tk in Di ?
- Freqik
- TotFreqk (Frequency of tk) in Dset ?, importance
of tk in Di ? - gt log2(N/TotFreqk)
- The weight should be assigned based on these
assumptions - Wik Freqik Freqik (log2 N - log2 TotFreqk)
13More Consideration in Relevance
- Long articles are more likely to be retrieved
- ? discount the length factor
- Document feature terms
- most frequently used terms tend to appear in more
articles, it does not serve to distinguish one
article from others. - Expressiveness of terms high-frequency,
low-frequency, normal-frequency. Terms with
normal frequency and low-frequency terms convey
more about the article features/theme. - Word class nouns convey more information(?? vs.
??) - Use of PoS tags and also stoplist(Slist)
14- Syntactical word classification information
Nouns are more related to concepts - ?? ? ?? ?? ? ? ?? ? ?? ?? ? ?? ? ?? ?
- Class? ? ? ? ? ? ? ? ?
? ? ? ? ? ? - Slist elimination of words cannot be identified
by class such as ? - Only those terms not on the stop list will be
used in frequency calculation - Semantic word classification extracting concepts
- ????? ????????? ????? ????
- ? ?? ?? ?? ? ?? ?? ??
- Thesaurus co-occurrence of related terms
15- Indexing of phrases(grammatical analysis)
- Example Artificial intelligence(?? ??) it is
more relevant to index artificial
intelligence(????)( as one unit) rather than two
independent terms.
16Information Retrieval System Architecture
Document sources
Customer inquiries
Indexing engine
Indexed query expression
Indexed Document repository
Relevance Calculation
Retrieved documents
Query optimization relevance feedback
Evaluation and return by ranking
Return result
17Bilingual (Multi-lingual) IR
- Retargetable IR approach
- monolingual but IR system can be configured for
different languages - Need Bi- or Multi-lingual IR?
- Bi- and Multi-lingual community that reads text
in more than one language find text in more than
language! - Retrieval of law information in Chinese and
English. - Retrieval of a person reported in different
newspapers written in different languages
18- Dictionary approach
- normalize indexing and searching into one
language (save storage) - determination of translation equivalence(multiple
translations) during indexing and during
extraction of term vector of query (increase time
of indexing) - not easy to obtain good translation equivalence
- many proper nouns not in translation dictionary
need to be found and map to their corresponding
target translation - inflexible cannot use user specified translation
equivalence
19- Multi-lingual indexing approach
- indexing for all different languages
- higher storage cost
- different indexing techniques for different
languages (e.g. English and Chinese) - flexible (can use system supplied or user
supplied translation equivalence) - support exact match in different languages