Search Engines e Question Answering - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Search Engines e Question Answering

Description:

Search Engines e Question Answering Giuseppe Attardi Universit di Pisa (some s borrowed from C. Manning, H. Sch tze) Overview Information Retrieval Models ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 75
Provided by: Giuseppe47
Category:

less

Transcript and Presenter's Notes

Title: Search Engines e Question Answering


1
Search Engines e Question Answering
  • Giuseppe Attardi
  • Università di Pisa
  • (some slides borrowed from C. Manning, H. Schütze)

2
Overview
  • Information Retrieval Models
  • Boolean and vector-space retrieval models ranked
    retrieval text-similarity metrics TF-IDF (term
    frequency/inverse document frequency) weighting
    cosine similarity performance metrics
    precision, recall, F-measure.
  • Indexing and Search
  • Indexing and inverted files Compression
    Postings Lists Query languages
  • Web Search
  • Search engines Architecture Crawling
    parallel/distributed, focused Link analysis
    (Google PageRank) Scaling
  • Text Categorization and Clustering
  • Question Answering
  • Information extraction Named Entity Recognition
    Natural Language Processing Part of Speech
    tagging Question analysis and semantic matching

3
References
  1. Modern Information Retrieval, R. Baeza-Yates, B.
    Ribeiro-Nieto, Addison Wesley
  2. Managing Gigabytes, 2nd Edition, I.H. Witten, A.
    Moffat, T.C. Bell, Morgan Kaufmann, 1999.
  3. Foundations of Statistical Natural Language
    Processing, MIT Statistical Natural Language
    Processing, C. Manning and Shutze, MIT Press,
    1999.

4
Motivation
5
Adaptive Computing
  • Desktop Metaphor highly successful in making
    computers popular
  • See Alan Kay 1975 presentation in Pisa
  • Limitations
  • Point and click involves very elementary actions
  • People are required to perform more and more
    clerical tasks
  • We have become bank clerks, typographers,
    illustrators, librarians

6
Illustrative problem
  • Add a table to a document with results from
    latest benchmarks and send it to my colleague
    Antonio
  • 7-8 pointclick just to get to the document
  • 7-8 pointclick to get to the data
  • Lengthy fiddling with table layout
  • 3-4 pointclick to retrieve mail address
  • Etc.

7
Success story
  • Do I care where a document is stored?
  • Shall I need a secretary for filing my documents?
  • Search Engines prove that you dont

8
Overcoming Desktop Metaphor
  • Could think of just one possibility
  • Raise the level of interaction with computers
  • How?
  • Could think of just one possibility
  • Use natural language

9
Adaptiveness
  • My language is different from yours
  • Should be learned from user interaction
  • See Steels talking heads language games
  • Through implicit interactions
  • Many potential sources (e.g. file a message in a
    folder ? classification)

10
Research Goal
  • Question Answering
  • Techniques
  • Traditional IR tools
  • NLP tools (POS tagging, parser)
  • Complement Knowledge Bases with massive data sets
    of usage (Web)
  • Knowledge extraction tools (NE tagging)
  • Continuous learning

11
IXE Framework
Passage Index
NE Tagger
Python Perl Java
EventStream ContextStream GIS
POS Tagger
Clustering
Sent. Splitter
Web Service
Wrappers
MaxEntropy
Unicode RegExp Tokenizer Suffix Trees
Files Mem Mapping Threads Synchronization
Readers
Indexer
Search
Crawler
Text
Object Store
OS Abstraction
12
Information Retrieval Models
13
Information Retrieval Models
  • A model is an embodiment of the theory in which
    we define a set of objects about which assertions
    can be made and restrict the ways in which
    classes of objects can interact
  • A retrieval model specifies the representations
    used for documents and information needs, and how
    they are compared
  • (Turtle Croft, 1992)

14
Information Retrieval Model
  • Provides an abstract description of the
    representation used for documents, the
    representation of queries, the indexing process,
    the matching process between a query and the
    documents and the ranking criteria

15
Formal Characterization
  • An Information Retrieval model is a quadruple ?D,
    Q, F, R? where
  • D is a set of representations for the documents
    in the collection
  • Q is a set of representations for the user
    information needs (queries)
  • F is a framework for modelling document
    representations, queries, and their relationships
  • R Q ? D ? R is a ranking function which
    associates a real number with a query qi ? Q and
    document representation dj ? D(Baeza-Yates
    Ribeiro-Neto, 1999)

16
Information Retrieval Models
  • Three classic models
  • Boolean Model
  • Vector Space Model
  • Probabilistic Model
  • Additional models
  • Extended Boolean
  • Fuzzy matching
  • Cluster-based retrieval
  • Language models

17
Collections
Information need
Pre-process
text input
Index
Parse
18
Boolean Model
t1
t2
D9
D2
D1
q3
q5
q6
q1 t1 t2 t3
D4
D11
q2 t1 t2 t3
D5
q3 t1 t2 t3
q1
D3
D6
q4 t1 t2 t3
q2
q4
D10
q5 t1 t2 t3
q7
q6 t1 t2 t3
q8
q7 t1 t2 t3
D8
D7
q8 t1 t2 t3
t3
19
Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
20
Boolean Problems
  • Disjunctive (OR) queries lead to information
    overload
  • Conjunctive (AND) queries lead to reduced, and
    commonly zero result
  • Conjunctive queries imply reduction in Recall

21
Boolean Model Assessment
Disadvantages
Advantages
  • Complete expressiveness for any identifiable
    subset of collection
  • Exact and simple to program
  • The whole panoply of Boolean Algebra available
  • Complex query syntax is often misunderstood (if
    understood at all)
  • Problems of Null output and Information Overload
  • Output is not ordered in any useful fashion

22
Boolean Extensions
  • Fuzzy Logic
  • Adds weights to each term/concept
  • ta AND tb is interpreted as MIN(w(ta),w(tb))
  • ta OR tb is interpreted as MAX (w(ta),w(tb))
  • Proximity/Adjacency operators
  • Interpreted as additional constraints on Boolean
    AND
  • Verity TOPIC system
  • Uses various weighted forms of Boolean logic and
    proximity information in calculating Robertson
    Selection Values (RSV)

23
Vector Space Model
  • Documents are represented as vectors in term
    space
  • Terms are usually stems
  • Documents represented by binary vectors of terms
  • Queries represented the same as documents
  • Query and Document weights are based on length
    and direction of their vector
  • A vector distance measure between the query and
    documents is used to rank retrieved documents

24
Documents in Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
25
Vector Space Documents and Queries
docs t1 t2 t3 RSVQ.Di
D1 1 0 1 4
D2 1 0 0 1
D3 0 1 1 5
D4 1 0 0 1
D5 1 1 1 6
D6 1 1 0 3
D7 0 1 0 2
D8 0 1 0 2
D9 0 0 1 3
D10 0 1 1 5
D11 1 0 1 3
Q 1 2 3 weights
q1 q2 q3
t1
t3
D2
D9
D1
D4
D11
D5
D3
D6
D10
D8
t2
D7
26
Similarity Measures
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap Coefficient
27
Vector Space with Term Weights
Di(wdi1, wdi2, , wdit) Q (wqi1, wqi2, , wqit)
Term B
1.0
Q (0.4, 0.8) D1(0.8, 0.3) D2(0.2, 0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
28
Problems with Vector Space
  • There is no real theoretical basis for the
    assumption of a term space
  • it is more for visualization that having any real
    basis
  • most similarity measures work about the same
    regardless of model
  • Terms are not really orthogonal dimensions
  • Terms are not independent of all other terms

29
Probabilistic Retrieval
  • Goes back to 1960s (Maron and Kuhns)
  • Robertsons Probabilistic Ranking Principle
  • Retrieved documents should be ranked in
    decreasing probability that they are relevant to
    the users query
  • How to estimate these probabilities?
  • Several methods (Model 1, Model 2, Model 3) with
    different emphasis on how estimates are done

30
Probabilistic Models Notation
  • D all present and future documents
  • Q all present and future queries
  • (di, qj) a document query pair
  • x ? D class of similar documents
  • y ? Q class of similar queries
  • Relevance is a relation
  • R (di, qj) di ? D, qj ? Q, di is judged
    relevant by the user submitting qj

31
Probabilistic model
  • Given D, estimate P(RD) and P(NRD)
  • P(RD)P(DR)P(R)/P(D) (P(D), P(R) constant)
  • ? P(DR)
  • D t1x1, t2x2,

32
Prob. model (contd)
For document ranking
33
Prob. model (contd)
ri Rel. doc. with ti ni-ri Irrel.doc. with ti ni Doc. with ti
Ri-ri Rel. doc. without ti N-Rinri Irrel.doc. without ti N-ni Doc. without ti
Ri Rel. doc N-Ri Irrel.doc. N Samples
  • How to estimate pi and qi?
  • A set of N relevant and irrelevant samples

34
Prob. model (contd)
  • Smoothing (Robertson-Sparck-Jones formula)
  • When no sample is available
  • pi0.5,
  • qi(ni0.5)/(N0.5)?ni/N
  • May be implemented as VSM

35
Probabilistic Models
  • Model 1 Probabilistic Indexing, P(R y, di)
  • Model 2 Probabilistic Querying, P(R qj, x)
  • Model 3 Merged Model, P(R qj, di)
  • Model 0 P(R y, x)
  • Probabilities are estimated based on prior usage
    or relevance estimation

36
Probabilistic Models
  • Rigorous formal model attempts to predict the
    probability that a given document will be
    relevant to a given query
  • Ranks retrieved documents according to this
    probability of relevance (Probability Ranking
    Principle)
  • Relies on accurate estimates of probabilities for
    accurate results

37
Vector and Probabilistic Models
  • Support natural language queries
  • Treat documents and queries the same
  • Support relevance feedback searching
  • Support ranked retrieval
  • Differ primarily in theoretical basis and in how
    the ranking is calculated
  • Vector assumes relevance
  • Probabilistic relies on relevance judgments or
    estimates

38
IR Ranking
39
Ranking models in IR
  • Key idea
  • We wish to return in order the documents most
    likely to be useful to the searcher
  • To do this, we want to know which documents best
    satisfy a query
  • An obvious idea is that if a document talks about
    a topic more then it is a better match
  • A query should then just specify terms that are
    relevant to the information need, without
    requiring that all of them must be present
  • Document relevant if it has a lot of the terms

40
Binary term presence matrices
  • Record whether a document contains a word
    document is binary vector in 0,1v
  • What we have mainly assumed so far
  • Idea Query satisfaction overlap measure

  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 0
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
41
Overlap matching
  • What are the problems with the overlap measure?
  • It doesnt consider
  • Term frequency in document
  • Term scarcity in collection (document mention
    frequency)
  • Length of documents
  • (AND queries score not normalized)

42
Overlap matching
  • One can normalize in various ways
  • Jaccard coefficient
  • Cosine measure
  • What documents would score best using Jaccard
    against a typical query?
  • Does the cosine measure fix this problem?

43
Count term-document matrices
  • We havent considered frequency of a word
  • Count of a word in a document
  • Bag of words model
  • Document is a vector in Nv

Normalization Calpurnia vs. Calphurnia
  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
44
Weighting term frequency tf
  • What is the relative importance of
  • 0 vs. 1 occurrence of a term in a doc
  • 1 vs. 2 occurrences
  • 2 vs. 3 occurrences
  • Unclear but it seems that more is better, but a
    lot isnt necessarily better than a few
  • Can just use raw score
  • Another option commonly used in practice

45
Dot product matching
  • Match is dot product of query and document
  • Note 0 if orthogonal (no words in common)
  • Rank by match
  • It still doesnt consider
  • Term scarcity in collection (document mention
    frequency)
  • Length of documents and queries
  • Not normalized

46
Weighting should depend on the term overall
  • Which of these tells you more about a doc?
  • 10 occurrences of hernia?
  • 10 occurrences of the?
  • Suggest looking at collection frequency (cf)
  • But document frequency (df) may be better
  • Word cf df
  • try 10422 8760
  • insurance 10440 3997
  • Document frequency weighting is only possible in
    known (static) collection

47
tf x idf term weights
  • tf x idf measure combines
  • term frequency (tf)
  • measure of term density in a doc
  • inverse document frequency (idf)
  • measure of informativeness of term its rarity
    across the whole corpus
  • could just be raw count of number of documents
    the term occurs in (idfi 1/dfi)
  • but by far the most commonly used version is
  • See Kishore Papineni, NAACL 2, 2002 for
    theoretical justification

48
Summary tf x idf
  • Assign a tf.idf weight to each term i in each
    document d
  • tfi,d frequency of term i in document d
  • n total number of documents
  • dfi number of documents that contain term i
  • Increases with the number of occurrences within a
    doc
  • Increases with the rarity of the term across the
    whole corpus

What is the wt of a term that occurs in all of
the docs?








49
Real-valued term-document matrices
  • Function (scaling) of count of a word in a
    document
  • Bag of words model
  • Each is a vector in Rv
  • Here log scaled tf.idf

50
Documents as vectors
  • Each doc j can now be viewed as a vector of
    tf?idf values, one component for each term
  • So we have a vector space
  • terms are axes
  • docs live in this space
  • even with stemming, may have 20,000 dimensions
  • (The corpus of documents gives us a matrix, which
    we could also view as a vector space in which
    words live transposable data)

51
Why turn docs into vectors?
  • First application Query-by-example
  • Given a doc d, find others like it
  • Now that d is a vector, find vectors (docs)
    near it

52
Intuition
t3
d2
d3
d1
?
f
t1
d5
t2
d4
Postulate Documents that are close together
in vector space talk about the same things
53
The vector space model
  • Query as vector
  • We regard query as short document
  • We return the documents ranked by the closeness
    of their vectors to the query, also represented
    as a vector
  • Developed in the SMART system (Salton, c. 1970)
    and standardly used by TREC participants and web
    IR systems

54
Desiderata for proximity
  • If d1 is near d2, then d2 is near d1
  • If d1 near d2, and d2 near d3, then d1 is not far
    from d3
  • No doc is closer to d than d itself

55
First cut
  • Distance between vectors d1 and d2 is the length
    of the vector d1 d2
  • Euclidean distance
  • Why is this not a great idea?
  • We still havent dealt with the issue of length
    normalization
  • Long documents would be more similar to each
    other by virtue of length, not topic
  • However, we can implicitly normalize by looking
    at angles instead

56
Cosine similarity
  • Distance between vectors d1 and d2 captured by
    the cosine of the angle x between them.
  • Note this is similarity, not distance

57
Cosine similarity
  • Cosine of angle between two vectors
  • The denominator involves the lengths of the
    vectors
  • So the cosine measure is also known as the
    normalized inner product

58
Normalized vectors
  • A vector can be normalized (given a length of 1)
    by dividing each of its components by the
    vector's length
  • This maps vectors onto the unit circle
  • Then,
  • Longer documents dont get more weight
  • For normalized vectors, the cosine is simply the
    dot product

59
Okapi BM25
  • where
  • and
  • Wd document length WAL average document
    length
  • k1, k3, b parameters N number of docs in
    collection
  • tfq,t query-term frequency tfd,t
    within-document frequency
  • dft collection frequency ( of docs that t
    occurs in)

60
Evaluating an IR system
61
Evaluating an IR system
  • What are some measures for evaluating an IR
    systems performance?
  • Speed of indexing
  • Index/corpus size ratio
  • Speed of query processing
  • Relevance of results
  • Note information need is translated into a
    boolean query
  • Relevance is assessed relative to the information
    need not the query

62
Standard relevance benchmarks
  • TREC - National Institute of Standards and
    Testing (NIST) has run large IR testbed for many
    years
  • Reuters and other benchmark sets used
  • Retrieval tasks specified
  • sometimes as queries
  • Human experts mark, for each query and for each
    doc, Relevant or Not relevant
  • or at least for subset that some system returned

63
The TREC experiments
  • Once per year
  • A set of documents and queries are distributed
    to the participants (the standard answers are
    unknown) (April)
  • Participants work (very hard) to construct,
    fine-tune their systems, and submit the answers
    (1000/query) at the deadline (July)
  • NIST people manually evaluate the answers and
    provide correct answers (and classification of IR
    systems) (July August)
  • TREC conference (November)

64
TREC evaluation methodology
  • Known document collection (gt100K) and query set
    (50)
  • Submission of 1000 documents for each query by
    each participant
  • Merge 100 first documents of each participant
    into global pool
  • Human relevance judgment of the global pool
  • The other documents are assumed to be irrelevant
  • Evaluation of each system (with 1000 answers)
  • Partial relevance judgments
  • But stable for system ranking

65
Tracks (tasks)
  • Ad Hoc track given document collection,
    different topics
  • Routing (filtering) stable interests (user
    profile), incoming document flow
  • CLIR Ad Hoc, but with queries in a different
    language
  • Web a large set of Web pages
  • Question-Answering When did Nixon visit China?
  • Interactive put users into action with system
  • Spoken document retrieval
  • Image and video retrieval
  • Information tracking new topic / follow up

66
Precision and recall
  • Precision fraction of retrieved docs that are
    relevant P(relevantretrieved)
  • Recall fraction of relevant docs that are
    retrieved P(retrievedrelevant)
  • Precision P tp/(tp fp)
  • Recall R tp/(tp fn)

Relevant Not Relevant
Retrieved tp fp
Not Retrieved fn tn
67
Other measures
  • Precision at a particular cutoff
  • p_at_10
  • Uninterpolated average precision
  • Interpolated average precision
  • Accuracy
  • Error

68
Other measures (cont.)
  • Noise retrieved irrelevant docs / retrieved
    docs
  • Silence non-retrieved relevant docs / relevant
    docs
  • Noise 1 Precision Silence 1 Recall
  • Fallout retrieved irrel. docs / irrel. docs
  • Single value measures
  • Average precision average at 11 points of
    recall
  • Expected search length (no. irrelevant documents
    to read before obtaining n relevant doc.)

69
Why not just use accuracy?
  • How to build a 99.9999 accurate search engine on
    a low budget.
  • People doing information retrieval want to find
    something quickly and have a certain tolerance
    for junk

Snoogle.com
Search for
70
Precision/Recall
  • Can get high recall (but low precision) by
    retrieving all docs for all queries!
  • Recall is a non-decreasing function of the number
    of docs retrieved
  • Precision usually decreases (in a good system)
  • Difficulties in using precision/recall
  • Should average over large corpus/query ensembles
  • Need human relevance judgments
  • Heavily skewed by corpus/authorship

71
General form of precision/recall
  • Precision change w.r.t. Recall (not a fixed
    point)
  • Systems cannot compare at one Precision/Recall
    point
  • Average precision (on 11 points of recall 0.0,
    0.1, , 1.0)

72
A combined measure F
  • Combined measure that assesses this tradeoff is F
    measure (weighted harmonic mean)
  • People usually use balanced F1 measure
  • i.e., with ? 1 or ? ½
  • Harmonic mean is conservative average
  • See CJ van Rijsbergen, Information Retrieval

73
F1 and other averages
74
MAP (Mean Average Precision)
  • rij rank of the j-th relevant document for Qi
  • Ri rel. doc. for Qi
  • n test queries
  • E.g. Rank 1 4 1st rel. doc.
  • 5 8 2nd rel. doc.
  • 10 3rd rel. doc.
Write a Comment
User Comments (0)
About PowerShow.com