Information RetrievalIR

1 / 19
About This Presentation
Title:

Information RetrievalIR

Description:

... Bi- or Multi-lingual IR? Bi- and Multi-lingual community that ... Multi-lingual indexing approach. indexing for all different languages. higher storage cost ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Information RetrievalIR


1
Information Retrieval(IR)
  • Information retrieval generally refers to the
    process of extracting relevant documents
    according to the information specified in the
    query.
  • IRS vs. DBMS
  • DBMS
  • structured
  • query formulation
  • deterministic
  • all relevant data
  • one-off query
  • IRS
  • unstructured
  • casual query style
  • non-deterministic
  • most relevant data
  • relevance feedback

2
Basic Components of IRS
Text editor/ file system/ internet files
Linguistics Information
Input documents
KnowledgeBase
IRS core
Relevant Documents
Users
Query
3
Information Retrieval
  • The IR technology
  • Knowledge base Dictionary and rules
  • Basic Information representation model
  • Indexing of documents for retrieval
  • Relevance calculation
  • Oriental languages Vs English in IR
  • Main difference is in what is considered useful
    information in each language
  • Different NLP knowledge and variants of common
    methods need to be used

4
Vector Space Modelfor document representation
  • Document D articles in text form
  • Terms T Basic language units, such as words,
    phrases
  • D(T1 T2 Ti Tn),
  • Ti and Tj may be referring to the same word
    appearing in different places and also the order
    in which it appear is also relevant.
  • Term Weight Ti has an associated weight value Wi
    to indicate the importance of Ti to D
  • DD(T1 W1 T2 W2 Tn Wn)
  • For a given D, if we do not consider word
    repetition and order, also terms are against a
    known set T (t1 t2 tK) where K is the
    number of words in the vocabulary, thus D(T1 W1
    T2 W2 Tn Wn) can be represented by the Vector
    Space Model DD(W1 W2 WK)

5
  • Vector Space Model for Document Representation
  • (W1 W2 Wk) can be considered as a vector
  • (t1 t2 tk) defines the k dimension
    coordinate system, where K is a fixed number.
  • Each coordinate indicates the weight of term ti.
  • Different documents are then considered as
    different vectors in the VSM.

?
6
  • Similarity The degree of relevance of two
    documents
  • The degree of relevance of D1and D2 can be
    described by a so called similarity function,
    Sim(D1, D2 ), which describe their distance.
  • Many different definitions of Sim(D1, D2 )
  • One simple definition(inner product)
  • Sim (D1, D2 ) ?nk1 w1kw2k
  • Example D1(1,0,0, 1, 1, 1), D2(0,1,1,0,1,0)
  • Sim(D1, D2)1x00x10x11x01x11x0 1

7
  • Another definition( Cosine)
  • Sim (D1, D2 ) cos?
  • ( ?nk1 w1kw2k)/sqrt((?nk1 w1k2)(?nk1 w2k2))
  • For Information retrieval, D2 can be a query Q.
  • Suppose there are I documents Di ,where i 1 to
    I
  • Rank Sim(Di, Q), the higher the value, the more
    relevant Di is to Q

8
Terms Selection for Indexing
  • T can be considered as all the terms that can be
    indexed
  • Approach 1 Simply use a word dictionary
  • Approach 2 terms in a dictionary terms
    segmented dynamically gt T is not necessarily
    static
  • Every document Di needs to be segmented
  • Vocabulary for indexing is normally much smaller
    than vocabulary of documents.
  • Not every word Tk in Di which is in T will be
    indexed
  • Tk in Di which is in T but not indexed is
    considered to have weight wik 0
  • In other words, all indexed terms for Di are
    considered to have weight greater than zero

9
  • The process to select terms in a Di for indexing
    is called Terms selection
  • Word frequency in documents is related to the
    information the articles intend to convey. Thus
    word frequency is often used in earlier term
    selection and weight assignment algorithms
  • The Zipf Law in information theory
  • For a given document set, and rank the terms
    according to its frequency gt Zipf Law
  • Freq(ti) rank(ti) ? constant

10
The P.H. Lunh Method(terms selection)
  • Suppose N documents form a document set
  • Dset Di, 1 ltiltN
  • (1) Freqik the frequency of Tk in Di.
  • TotFreqk the frequency of Tk in Dset.
  • (2) Then,
  • TotFreqk ?Ni1 Freqik
  • (3) Sort TotFreqk in descending order. Select an
    upper bound and a lower bound, Cu-b , Cl-b,
    respectively.
  • Index only the terms between Cu-b and Cl-b
  • Absolute frequency, choice of Cu-b and Cl-b

11
  • P.H. Lunhs method is a very rough association of
    terms frequency with information to be conveyed
    in a document.
  • Some low frequency terms may be very important to
    a particular article(document), and it may be
    exactly the reason it doesnt appear so often in
    other articles
  • Keys in terms selection for indexing
    completeness and accuracy
  • Related to the article so that it can be indexed
    for retrieval(completeness)
  • Distinguish one article from other
    articles(accuracy and representative)
  • Example The term ?? in the document set
    computer is not an important term, however , it
    is probably important in a hardware devices set
  • Relative frequency

12
Weight Assignment Algorithm
  • Assuming
  • Freqik ? in Di, importance of tk in Di ?
  • Freqik
  • TotFreqk (Frequency of tk) in Dset ?, importance
    of tk in Di ?
  • gt log2(N/TotFreqk)
  • The weight should be assigned based on these
    assumptions
  • Wik Freqik Freqik (log2 N - log2 TotFreqk)

13
More Consideration in Relevance
  • Long articles are more likely to be retrieved
  • ? discount the length factor
  • Document feature terms
  • most frequently used terms tend to appear in more
    articles, it does not serve to distinguish one
    article from others.
  • Expressiveness of terms high-frequency,
    low-frequency, normal-frequency. Terms with
    normal frequency and low-frequency terms convey
    more about the article features/theme.
  • Word class nouns convey more information(?? vs.
    ??)
  • Use of PoS tags and also stoplist(Slist)

14
  • Syntactical word classification information
    Nouns are more related to concepts
  • ?? ? ?? ?? ? ? ?? ? ?? ?? ? ?? ? ?? ?
  • Class? ? ? ? ? ? ? ? ?
    ? ? ? ? ? ?
  • Slist elimination of words cannot be identified
    by class such as ?
  • Only those terms not on the stop list will be
    used in frequency calculation
  • Semantic word classification extracting concepts
  • ????? ????????? ????? ????
  • ? ?? ?? ?? ? ?? ?? ??
  • Thesaurus co-occurrence of related terms

15
  • Indexing of phrases(grammatical analysis)
  • Example Artificial intelligence(?? ??) it is
    more relevant to index artificial
    intelligence(????)( as one unit) rather than two
    independent terms.

16
Information Retrieval System Architecture
Document sources
Customer inquiries
Indexing engine
Indexed query expression
Indexed Document repository
Relevance Calculation
Retrieved documents
Query optimization relevance feedback
Evaluation and return by ranking
Return result
17
Bilingual (Multi-lingual) IR
  • Retargetable IR approach
  • monolingual but IR system can be configured for
    different languages
  • Need Bi- or Multi-lingual IR?
  • Bi- and Multi-lingual community that reads text
    in more than one language find text in more than
    language!
  • Retrieval of law information in Chinese and
    English.
  • Retrieval of a person reported in different
    newspapers written in different languages

18
  • Dictionary approach
  • normalize indexing and searching into one
    language (save storage)
  • determination of translation equivalence(multiple
    translations) during indexing and during
    extraction of term vector of query (increase time
    of indexing)
  • not easy to obtain good translation equivalence
  • many proper nouns not in translation dictionary
    need to be found and map to their corresponding
    target translation
  • inflexible cannot use user specified translation
    equivalence

19
  • Multi-lingual indexing approach
  • indexing for all different languages
  • higher storage cost
  • different indexing techniques for different
    languages (e.g. English and Chinese)
  • flexible (can use system supplied or user
    supplied translation equivalence)
  • support exact match in different languages
Write a Comment
User Comments (0)