Introduction Recherche dinformation - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction Recherche dinformation

Description:

Document: an entity that contains some description of information, ... 1983 - Okapi started. 1985 RIAO-1. 1986 CvR logic model. 1990 Deerwester et al,LSI paper ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 58
Provided by: NIE125
Category:

less

Transcript and Presenter's Notes

Title: Introduction Recherche dinformation


1
IntroductionRecherche dinformation
  • Jian-Yun Nie
  • (Based on van Rijsbergens introduction)

2
Plan
  • Definition
  • History
  • Experimental tradition
  • Methods
  • Query
  • Indexing
  • Matching
  • Results
  • Evaluation
  • Current situation
  • Research

3
Definition
4
Important concepts
  • Document an entity that contains some
    description of information, may be in form of
    text, image, graphic, video, speech, etc.
  • Document collection a set of documents (may be
    static or dynamic)
  • User
  • Information need the users requirement of
    information
  • Query (request) a description of information
    need, usually in natural language
  • Relevance (relevant document) a document that
    contains the required information (Pertinence)
  • Correspondence, degree of relevance, relevance
    score, the degree of belief (by the system)
    that a document is relevant
  • Judge user/system
  • Indexing a process that transforms a document
    into a form of internal representation
  • Retrieval an operation that determines the
    documents to be retrieved
  • Response (answer) the documents returned by the
    system (usually a ranked list)

5
What is IR?
  • (Salton, 1968) Information retrieval is a field
    concerned with the structure, analysis,
    organization, storage, searching, and retrieval
    of information.
  • (Lancaster, 1968) An information retrieval system
    does not inform (i.e. change the knowledge of)
    the user on the subject of his inquiry. It merely
    informs on the existence (or non-existence) and
    whereabouts of documents relating to his request.
  • (Needham, 1977)..the complexity arises from the
    impossibility of describing the content of a
    document, or the intent of request, precisely, or
    unambiguously

6
Data retrieval v.s. IR (VR 79)
7
History
8
Important events in IR (1)
  • 1952 Mooers coins IR
  • 1958 International Conference on Scientific
    Information
  • 1960 Cranfield I
  • 1960 Maron and Kuhns paper
  • 1961 (-1965) Smart built
  • 1964 Washington conference on Association Methods
  • 1966 Cranfield II
  • 1968 Saltons first book
  • 197- Cranfield conferences
  • 1975 CvRs book
  • 1975 Ideal test collection
  • 1976 KSJ/SER JASIS paper

9
Important events in IR (2)
  • 1978 1st SIGIR
  • 1979 1st BCSIRSG
  • 1980 1st joint ACM/BCS conference on IR
  • 1981 KSJ book on IR Experiments
  • 1982 Belkin et al ASK hypothesis
  • 1983 - Okapi started
  • 1985 RIAO-1
  • 1986 CvR logic model
  • 1990 Deerwester et al,LSI paper
  • 1991 Inquiry started
  • 1992 TREC-1
  • 1998 Croft Ponte paper on language models

10
Best known researchers (Salton award)
11
Best known researchers (Salton award)
12
Best known researchers (Salton award)
13
Relevant journals and conferences
  • Journals
  • ACM Transactions on Information Systems (TOIS)
  • Information Processing and Management (IPM)
  • J. of the American Society for Information
    Science and Technologies (JASIST)
  • Information Retrieval
  • Conferences
  • ACM SIGIR
  • CIKM
  • TREC
  • ECIR
  • ACL
  • RIAO
  • CORIA

14
Experimental tradition
15
Tradition of Experiments
  • Strong experimental tradition (from Cranfield)
  • To prove that an IR technique or IR system is
    better, the effectiveness should be measured on
    test data
  • This tradition has a strong influence to other
    areas (computational linguistics, AI, machine
    translation, etc.)
  • Pros
  • Develop practically effective approaches
  • Experimental evidence to prove a technique
  • Avoid nice, but useless theories
  • Cons
  • Neglect theoretical development
  • Difficult to develop new theories and new
    techniques to compete against established methods
  • Wide use of heuristics, intuitions, manual
    tuning,, or tricks

16
Experimental Methodology
  • Cleverdon Cranfield
  • Developed the Cranfield Experiments (funded by
    the National Science Foundation) and introduced
    the concepts recall and precision to study the
    performance of  information retrieval systems.
  • Lancaster Medlars
  • Report on the evaluation of its operating
    efficiency. American documentation. 20(2)
    119-142 1969 April. Lancaster refined a
    technique of failure analysis for this
    evaluation, seeking to investigate reasons why
    relevant documents were not retrieved.
  • Saracevic CWRU
  • Theories and experiments related to human
    information behavior human-computer interaction
    from the human viewpoint and modeling
    interaction processes in information retrieval.
    Notion of relevance in relation to information
    and information systems. Theoretical and
    pragmatic study of value of information and
    library services. Nature of information science
    as a field.
  • Salton Smart
  • "Salton's Magical Automatic Retriever of Text"
  • Vector space model, relevance feedback, tfidf,
  • Sparck Jones Ideal Test Collection
  • Big document collection, large set of various
    queries, exhaustive relevance judgments
  • Blair Maron Stairs
  • law documents, result analysis
  • Harman TREC
  • Annual experimental contest
  • Large document collections, more realistic
    queries, partial relevance judgments
  • Not an ideal test collection, but more realistic

17
Some References on the Web
  • Cyril W. Cleverdon, The significance of the
    Cranfield tests on index languages, ACM-SIGIR,
    1991, pp. 3 12, (http//portal.acm.org/citation.
    cfm?id122861)
  • David C. Blair , M. E. Maron, An evaluation of
    retrieval effectiveness for a full-text
    document-retrieval system, Communications of the
    ACM, v.28 n.3, p.289-299, March 1985
    (http//portal.acm.org/citation.cfm?id3197dlGUI
    DEcollGUIDECFID65359706CFTOKEN94782922)
  • G. Salton , M. E. Lesk, Computer Evaluation of
    Indexing and Text Processing, Journal of the ACM
    (JACM), v.15 n.1, p.8-36, Jan. 1968
    (http//portal.acm.org/citation.cfm?id321441dlG
    UIDEcollGUIDECFID65359986CFTOKEN40286)
  • TREC http//trec.nist.gov

18
Evaluation
Query
Document collection
Desired answers
Answers
evaluation
19
Test collection
  • Document collection a large set of documents
  • Query set a set of queries (usually 50 or more)
  • Relevance judgments for each query, determine
    manually the relevant documents in the document
    collection
  • In TREC the judgments are not known priori to
    the experiments

20
Methods
21
Indexing-based IR
  • Document Query
  • indexing indexing
  • (Query analysis)
  • Representation Representation
  • (keywords) Query (keywords)
  • evaluation

22
Query
23
Query Language
  • Artificial/Natural (web)
  • multilingual/cross-lingual
  • images
  • none at all!

24
Query Definition
  • Complete/Incomplete
  • Independence/Dependence
  • Weighted/Unweighted (tf idf)
  • Query expansion/one shot (feedback, web)
  • Sense disambiguation
  • Cross-lingual

25
Indexing
26
Marons theory of indexing
  • ..in the case where the query consists of single
    term, call it B, the probability that a given
    document will be judged relevant by a patron
    submitting B is simply the ratio of the number of
    patrons who submit B as their query and judge
    that document as relevant, to the number of
    patrons, who submit B as their search query
  • P(DB) P(D,B) / P(B)

27
Representation of Information
  • Discrimination without Representation
    (specificity)
  • Representation with Discrimination (exhaustivity)
  • TFIDF
  • TF importance of term for a document
  • IDF Importance of document for term
    (specificity)
  • ...defining a concept of information,....that
    once this notion is properly explicated a
    document can be represented by the information
    it contains (CvR, 1979)

28
Maching (query evaluation)
29
Matching (query evaluation)
  • exact/partial match e.g SQL/Dice
  • Boolean matching (Fairthorne, 50)
  • co-ordination level matching (Cleverdon,60)
  • cosine correlation (Salton, 70) VS
  • probabilistic (ranking principle) (SER,80) PRP
  • logical uncertainty principle (CvR, 90) LUP
  • Bayesian inference (Croft,90) NET
  • Language modeling (PonteCroft 98)

30
Inference
  • Deduction/Induction A, A?B infer B
  • Cluster Hypothesis
  • Association Hypothesis
  • P(term1term2)

31
Logic
  • It is a common fallacy, underwritten at this date
    by the investment of several million dollars in a
    variety of retrieval hardware, that the algebra
    of Boole (1847) is the appropriate formalism for
    retrieval design..The logic of Brouwer, as
    invoked by Fairthorne, is one such weakening of
    the postulate system, (Mooers, 1961)
  • Another one
  • Logical Uncertainty Principle (CvR, 1986)

32
Logic
  • If Mark were to loose his job, he would work less
  • If Mark were to work less, he would be less tense
  • If Mark were to loose his job, he would be less
    tense

33
Cluster Hypothesis
  • If document X is closely associated with Y, then
    over the population of potential queries the
    probability of relevance for X will be
    approximately the same as the probability of
    relevance for Y, or in symbols
  • P(relevanceX) P(relevanceY)
  • Document clustering

34
Association Hypothesis
  • If one index term X is good at discriminating
    relevant from non-relevant documents, then any
    closely associated index term Y is also likely to
    be good at this.
  • P(relevanceX) P(relevanceY)
  • Query expansion

35
Models
  • Boolean
  • Vector Space (metrics) - mixture of things
  • Probabilistic (3 models)
  • Logical (implication) - what kind of logic
  • Language models
  • Cognitive (users)

36
Retrieval Result
37
Items Wanted
  • Matching/Relevant or Correct/Useful
  • The function of a document retrieval system
  • cannot be to retrieve all and only the relevant
    documents....but to guide the patron in his
    search for information (Maron)
  • Topical/tasks
  • Meaning/content

38
Some difficulties with relevance
  • Goffman, 1969
  • ..that the relevance of the information from one
    document depends upon what is already known about
    the subject, and in turn affects the relevance of
    other documents subsequently examined.
  • Maron,
  • Just because a document is about the subject
    sought by a patron, that fact does not imply that
    he would judge it relevant.

39
Relevance (Borlund, 2000)
  • That is the relevance or irrelevance of a given
    retrieved document may affect the users current
    state of knowledge resulting in a change of the
    users information need, which may lead to a
    change of the users perception/ interpretation
    of the subsequent retrieved documents.

40
Evaluation
41
Error Response
  • Precision error where an irrelevant is retrieved
  • P(relevant doc. Retrieved)/(retrieved)
  • Recall error where a relevant document is not
    retrieved
  • R (relevant doc. Retrieved)/(Relevant)
  • Trade-off
  • F-measure F 2PR/(PR)
  • How to cope with lack of recall
  • Cranfield ?Ideal test collection ?TREC

42
What is a relevant document?
  • Relevance is
  • The correspondence between a document and a
    query, a measure of informativeness to the query
  • A degree of relation (overlap, relatedness, )
    between document and query
  • A measure of utility of the document to the user
  • Judged by user / system
  • User relevance / system relevance

43
How should relevance be judged?
  • Relevance is dependent on
  • Document contents
  • Information need (query)
  • Time constraint
  • Purpose of retrieval
  • Retrieval environment
  • Computer/connection speed
  • User interface
  • Domain of application (newspaper articles, law,
    patent, medicine )
  • Users knowledge
  • About the domain of application
  • about the system

44
How is relevance judged? (TREC)
  • Candidate answers by merging the answers from
    different participating systems
  • Several human assessors judge for the same query
  • Agreement/disagreement
  • Binary value (rel. / irrel.) / multi-valued
  • Workable strategy but potential problems
  • Some relevant document may not be found by any
    system
  • Subjective judgments
  • Disagreement between assessors and participants
    (but participants usually respect the judgments
    of assessors)

45
Practice v.s. Experiments
  • Practice
  • Web
  • Electronic Publishing
  • Task-oriented IR
  • Data Mining
  • Knowledge Discovery
  • Distance learning
  • Video/film asset management
  • ExperimentsTREC
  • HCI
  • Visualisation
  • Work in Context, Cognitive approaches
  • Cross - lingual
  • Cross - media
  • Corpus-based IR (inc. wordnet, etc)
  • Digital Libraries
  • CBIR (Content-Based Image R)
  • TDT (Topic Detection and Tracking)

46
Research themes
  • Discrimination/Representation
  • Data fusion
  • Authority/importance models (e.g. PageRank)
  • Logic Uncertainty models
  • Filtering/Routing
  • Language models
  • Summarisation
  • IR DBMS (inc XML etc)
  • Clustering the web
  • Visualising the web
  • Living with single term queries
  • Living with no queries
  • Scale free networks
  • Trading media (text helps images!)
  • Temporal dimensions (topics,events)
  • Evaluation (Time to dump P and R?)
  • NLP in IR

47
Current situation
48
Where are we now in IR?
  • Landmarks
  • Hypotheses/Principles
  • Postulates of Impotence
  • Long-term challenges
  • Areas of research

49
Landmarks
  • Luhns tf weighting
  • Architecture
  • Relevance Feedback
  • Stemming
  • Poisson Model -gt BM25
  • Statistical weighting tfidf
  • Various models

50
Hypotheses/Principles
  • Items may be associated without apparent meaning
    but exploiting their association may help
    retrieval
  • P R trade-off ABNO/OBNA
  • Exhaustivity/Specificity
  • Cluster Hypothesis
  • Association Hypothesis
  • Probability Ranking Principle
  • Logical Uncertainty Principle
  • ASK
  • Polyrepresentation

51
Postulates of Impotence(according to Swanson,
1988)
  • An information need cannot be expressed
    independent of context
  • It is impossible to instruct a machine to
    translate a request into adequate search terms
  • A documents relevance depends on other seen
    documents
  • It is never possible to verify whether all
    relevant documents have been found
  • Machines cannot recognize meaning -gt cant beat
    human indexing etc

52
.more postulates
  • Word-occurrence statistics can neither represent
    meaning nor substitute for it
  • The ability of an IR system to support an
    iterative process cannot be evaluated in terms of
    single-iteration human relevance judgment
  • You can have either subtle relevance judgments or
    highly effective mechanised procedures, but not
    both
  • Thus, consistently effective fully automatic
    indexing and retrieval is not possible

53
Long-term Challenges workshop Umass. 9/2002
  • Global information access Satisfy human
    information needs through natural efficient
    interaction with an automated system that
    leverages world-wide structured and unstructured
    data in any language.
  • Contextual Retrieval Combine search technologies
    and knowledge about query and user context into a
    single framework in order to provide the most
    appropriate answer for a users information
    need.

54
Areas of Research
  • How does the brain do it? (neuroscience)
  • How do we see to retrieve? (computer vision)
  • How do we reduce dimensionality in dynamic
    fashion? (Statistics)
  • What is a good logic for IR? (mathematical logic)
  • What is a good theory of uncertainty?
    (frequency/geometry)
  • How do we model context? (HCI)
  • How do we formally capture interaction?
  • How do we capture implicit/tacit information?
  • Is there a theory of information for IR?

55
Images not Text how might thatmake a difference?
  • no visual keywords (yet)
  • tf/idf issue
  • aboutness revisable (eg Maron)
  • relevance revisable (eg Goffman)
  • feedback requires salience
  • aboutness -gt relevance -gt aboutness

56
Text v.s. image
  • Text Image
  • Keywords Visual features
  • Frequency ?
  • Meaning Object/form/color
  • Grammar Geometry
  • Salience Salience
  • Relevance Path dependent
  • Query expansion Example image

57
References
  • Van Rijsbergens talks
  • http//ir.dcs.gla.ac.uk/oldseminars/Keith2.ppt
  • http//www-clips.imag.fr/mrim/essir03/PDF/4.Rijsbe
    rgen.pdf
  • S.E. ROBERTSON, COMPUTER RETRIEVAL
    (http//www.soi.city.ac.uk/ser/papers/j_doc_histo
    ry/npap.html)
Write a Comment
User Comments (0)
About PowerShow.com