Title: Search Engines e Question Answering
1Search Engines e Question Answering
- Giuseppe Attardi
- Università di Pisa
- (some slides borrowed from C. Manning, H. Schütze)
2Overview
- Information Retrieval Models
- Boolean and vector-space retrieval models ranked
retrieval text-similarity metrics TF-IDF (term
frequency/inverse document frequency) weighting
cosine similarity performance metrics
precision, recall, F-measure. - Indexing and Search
- Indexing and inverted files Compression
Postings Lists Query languages - Web Search
- Search engines Architecture Crawling
parallel/distributed, focused Link analysis
(Google PageRank) Scaling - Text Categorization and Clustering
- Question Answering
- Information extraction Named Entity Recognition
Natural Language Processing Part of Speech
tagging Question analysis and semantic matching
3References
- Modern Information Retrieval, R. Baeza-Yates, B.
Ribeiro-Nieto, Addison Wesley - Managing Gigabytes, 2nd Edition, I.H. Witten, A.
Moffat, T.C. Bell, Morgan Kaufmann, 1999. - Foundations of Statistical Natural Language
Processing, MIT Statistical Natural Language
Processing, C. Manning and Shutze, MIT Press,
1999.
4Motivation
5Adaptive Computing
- Desktop Metaphor highly successful in making
computers popular - See Alan Kay 1975 presentation in Pisa
- Limitations
- Point and click involves very elementary actions
- People are required to perform more and more
clerical tasks - We have become bank clerks, typographers,
illustrators, librarians
6Illustrative problem
- Add a table to a document with results from
latest benchmarks and send it to my colleague
Antonio - 7-8 pointclick just to get to the document
- 7-8 pointclick to get to the data
- Lengthy fiddling with table layout
- 3-4 pointclick to retrieve mail address
- Etc.
7Success story
- Do I care where a document is stored?
- Shall I need a secretary for filing my documents?
- Search Engines prove that you dont
8Overcoming Desktop Metaphor
- Could think of just one possibility
- Raise the level of interaction with computers
- How?
- Could think of just one possibility
- Use natural language
9Adaptiveness
- My language is different from yours
- Should be learned from user interaction
- See Steels talking heads language games
- Through implicit interactions
- Many potential sources (e.g. file a message in a
folder ? classification)
10Research Goal
- Question Answering
- Techniques
- Traditional IR tools
- NLP tools (POS tagging, parser)
- Complement Knowledge Bases with massive data sets
of usage (Web) - Knowledge extraction tools (NE tagging)
- Continuous learning
11IXE Framework
Passage Index
NE Tagger
Python Perl Java
EventStream ContextStream GIS
POS Tagger
Clustering
Sent. Splitter
Web Service
Wrappers
MaxEntropy
Unicode RegExp Tokenizer Suffix Trees
Files Mem Mapping Threads Synchronization
Readers
Indexer
Search
Crawler
Text
Object Store
OS Abstraction
12Information Retrieval Models
13Information Retrieval Models
- A model is an embodiment of the theory in which
we define a set of objects about which assertions
can be made and restrict the ways in which
classes of objects can interact - A retrieval model specifies the representations
used for documents and information needs, and how
they are compared - (Turtle Croft, 1992)
14Information Retrieval Model
- Provides an abstract description of the
representation used for documents, the
representation of queries, the indexing process,
the matching process between a query and the
documents and the ranking criteria
15Formal Characterization
- An Information Retrieval model is a quadruple ?D,
Q, F, R? where - D is a set of representations for the documents
in the collection - Q is a set of representations for the user
information needs (queries) - F is a framework for modelling document
representations, queries, and their relationships - R Q ? D ? R is a ranking function which
associates a real number with a query qi ? Q and
document representation dj ? D(Baeza-Yates
Ribeiro-Neto, 1999)
16Information Retrieval Models
- Three classic models
- Boolean Model
- Vector Space Model
- Probabilistic Model
- Additional models
- Extended Boolean
- Fuzzy matching
- Cluster-based retrieval
- Language models
17Collections
Information need
Pre-process
text input
Index
Parse
18Boolean Model
t1
t2
D9
D2
D1
q3
q5
q6
q1 t1 t2 t3
D4
D11
q2 t1 t2 t3
D5
q3 t1 t2 t3
q1
D3
D6
q4 t1 t2 t3
q2
q4
D10
q5 t1 t2 t3
q7
q6 t1 t2 t3
q8
q7 t1 t2 t3
D8
D7
q8 t1 t2 t3
t3
19Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
20Boolean Problems
- Disjunctive (OR) queries lead to information
overload - Conjunctive (AND) queries lead to reduced, and
commonly zero result - Conjunctive queries imply reduction in Recall
21Boolean Model Assessment
Disadvantages
Advantages
- Complete expressiveness for any identifiable
subset of collection - Exact and simple to program
- The whole panoply of Boolean Algebra available
- Complex query syntax is often misunderstood (if
understood at all) - Problems of Null output and Information Overload
- Output is not ordered in any useful fashion
22Boolean Extensions
- Fuzzy Logic
- Adds weights to each term/concept
- ta AND tb is interpreted as MIN(w(ta),w(tb))
- ta OR tb is interpreted as MAX (w(ta),w(tb))
- Proximity/Adjacency operators
- Interpreted as additional constraints on Boolean
AND - Verity TOPIC system
- Uses various weighted forms of Boolean logic and
proximity information in calculating Robertson
Selection Values (RSV)
23Vector Space Model
- Documents are represented as vectors in term
space - Terms are usually stems
- Documents represented by binary vectors of terms
- Queries represented the same as documents
- Query and Document weights are based on length
and direction of their vector - A vector distance measure between the query and
documents is used to rank retrieved documents
24Documents in Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
25Vector Space Documents and Queries
docs t1 t2 t3 RSVQ.Di
D1 1 0 1 4
D2 1 0 0 1
D3 0 1 1 5
D4 1 0 0 1
D5 1 1 1 6
D6 1 1 0 3
D7 0 1 0 2
D8 0 1 0 2
D9 0 0 1 3
D10 0 1 1 5
D11 1 0 1 3
Q 1 2 3 weights
q1 q2 q3
t1
t3
D2
D9
D1
D4
D11
D5
D3
D6
D10
D8
t2
D7
26Similarity Measures
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap Coefficient
27Vector Space with Term Weights
Di(wdi1, wdi2, , wdit) Q (wqi1, wqi2, , wqit)
Term B
1.0
Q (0.4, 0.8) D1(0.8, 0.3) D2(0.2, 0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
28Problems with Vector Space
- There is no real theoretical basis for the
assumption of a term space - it is more for visualization that having any real
basis - most similarity measures work about the same
regardless of model - Terms are not really orthogonal dimensions
- Terms are not independent of all other terms
29Probabilistic Retrieval
- Goes back to 1960s (Maron and Kuhns)
- Robertsons Probabilistic Ranking Principle
- Retrieved documents should be ranked in
decreasing probability that they are relevant to
the users query - How to estimate these probabilities?
- Several methods (Model 1, Model 2, Model 3) with
different emphasis on how estimates are done
30Probabilistic Models Notation
- D all present and future documents
- Q all present and future queries
- (di, qj) a document query pair
- x ? D class of similar documents
- y ? Q class of similar queries
- Relevance is a relation
- R (di, qj) di ? D, qj ? Q, di is judged
relevant by the user submitting qj
31Probabilistic model
- Given D, estimate P(RD) and P(NRD)
- P(RD)P(DR)P(R)/P(D) (P(D), P(R) constant)
- ? P(DR)
- D t1x1, t2x2,
-
-
32Prob. model (contd)
For document ranking
33Prob. model (contd)
ri Rel. doc. with ti ni-ri Irrel.doc. with ti ni Doc. with ti
Ri-ri Rel. doc. without ti N-Rinri Irrel.doc. without ti N-ni Doc. without ti
Ri Rel. doc N-Ri Irrel.doc. N Samples
- How to estimate pi and qi?
- A set of N relevant and irrelevant samples
34Prob. model (contd)
- Smoothing (Robertson-Sparck-Jones formula)
- When no sample is available
- pi0.5,
- qi(ni0.5)/(N0.5)?ni/N
- May be implemented as VSM
35Probabilistic Models
- Model 1 Probabilistic Indexing, P(R y, di)
- Model 2 Probabilistic Querying, P(R qj, x)
- Model 3 Merged Model, P(R qj, di)
- Model 0 P(R y, x)
- Probabilities are estimated based on prior usage
or relevance estimation
36Probabilistic Models
- Rigorous formal model attempts to predict the
probability that a given document will be
relevant to a given query - Ranks retrieved documents according to this
probability of relevance (Probability Ranking
Principle) - Relies on accurate estimates of probabilities for
accurate results
37Vector and Probabilistic Models
- Support natural language queries
- Treat documents and queries the same
- Support relevance feedback searching
- Support ranked retrieval
- Differ primarily in theoretical basis and in how
the ranking is calculated - Vector assumes relevance
- Probabilistic relies on relevance judgments or
estimates
38IR Ranking
39Ranking models in IR
- Key idea
- We wish to return in order the documents most
likely to be useful to the searcher - To do this, we want to know which documents best
satisfy a query - An obvious idea is that if a document talks about
a topic more then it is a better match - A query should then just specify terms that are
relevant to the information need, without
requiring that all of them must be present - Document relevant if it has a lot of the terms
40Binary term presence matrices
- Record whether a document contains a word
document is binary vector in 0,1v - What we have mainly assumed so far
- Idea Query satisfaction overlap measure
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 0
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
41Overlap matching
- What are the problems with the overlap measure?
- It doesnt consider
- Term frequency in document
- Term scarcity in collection (document mention
frequency) - Length of documents
- (AND queries score not normalized)
42Overlap matching
- One can normalize in various ways
- Jaccard coefficient
- Cosine measure
- What documents would score best using Jaccard
against a typical query? - Does the cosine measure fix this problem?
43Count term-document matrices
- We havent considered frequency of a word
- Count of a word in a document
- Bag of words model
- Document is a vector in Nv
Normalization Calpurnia vs. Calphurnia
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
44Weighting term frequency tf
- What is the relative importance of
- 0 vs. 1 occurrence of a term in a doc
- 1 vs. 2 occurrences
- 2 vs. 3 occurrences
- Unclear but it seems that more is better, but a
lot isnt necessarily better than a few - Can just use raw score
- Another option commonly used in practice
45Dot product matching
- Match is dot product of query and document
- Note 0 if orthogonal (no words in common)
- Rank by match
- It still doesnt consider
- Term scarcity in collection (document mention
frequency) - Length of documents and queries
- Not normalized
46Weighting should depend on the term overall
- Which of these tells you more about a doc?
- 10 occurrences of hernia?
- 10 occurrences of the?
- Suggest looking at collection frequency (cf)
- But document frequency (df) may be better
- Word cf df
- try 10422 8760
- insurance 10440 3997
- Document frequency weighting is only possible in
known (static) collection
47tf x idf term weights
- tf x idf measure combines
- term frequency (tf)
- measure of term density in a doc
- inverse document frequency (idf)
- measure of informativeness of term its rarity
across the whole corpus - could just be raw count of number of documents
the term occurs in (idfi 1/dfi) - but by far the most commonly used version is
- See Kishore Papineni, NAACL 2, 2002 for
theoretical justification
48Summary tf x idf
- Assign a tf.idf weight to each term i in each
document d - tfi,d frequency of term i in document d
- n total number of documents
- dfi number of documents that contain term i
- Increases with the number of occurrences within a
doc - Increases with the rarity of the term across the
whole corpus
What is the wt of a term that occurs in all of
the docs?
49Real-valued term-document matrices
- Function (scaling) of count of a word in a
document - Bag of words model
- Each is a vector in Rv
- Here log scaled tf.idf
50Documents as vectors
- Each doc j can now be viewed as a vector of
tf?idf values, one component for each term - So we have a vector space
- terms are axes
- docs live in this space
- even with stemming, may have 20,000 dimensions
- (The corpus of documents gives us a matrix, which
we could also view as a vector space in which
words live transposable data)
51Why turn docs into vectors?
- First application Query-by-example
- Given a doc d, find others like it
- Now that d is a vector, find vectors (docs)
near it
52Intuition
t3
d2
d3
d1
?
f
t1
d5
t2
d4
Postulate Documents that are close together
in vector space talk about the same things
53The vector space model
- Query as vector
- We regard query as short document
- We return the documents ranked by the closeness
of their vectors to the query, also represented
as a vector - Developed in the SMART system (Salton, c. 1970)
and standardly used by TREC participants and web
IR systems
54Desiderata for proximity
- If d1 is near d2, then d2 is near d1
- If d1 near d2, and d2 near d3, then d1 is not far
from d3 - No doc is closer to d than d itself
55First cut
- Distance between vectors d1 and d2 is the length
of the vector d1 d2 - Euclidean distance
- Why is this not a great idea?
- We still havent dealt with the issue of length
normalization - Long documents would be more similar to each
other by virtue of length, not topic - However, we can implicitly normalize by looking
at angles instead
56Cosine similarity
- Distance between vectors d1 and d2 captured by
the cosine of the angle x between them. - Note this is similarity, not distance
57Cosine similarity
- Cosine of angle between two vectors
- The denominator involves the lengths of the
vectors - So the cosine measure is also known as the
normalized inner product
58Normalized vectors
- A vector can be normalized (given a length of 1)
by dividing each of its components by the
vector's length - This maps vectors onto the unit circle
- Then,
- Longer documents dont get more weight
- For normalized vectors, the cosine is simply the
dot product
59Okapi BM25
- where
- and
- Wd document length WAL average document
length - k1, k3, b parameters N number of docs in
collection - tfq,t query-term frequency tfd,t
within-document frequency - dft collection frequency ( of docs that t
occurs in)
60Evaluating an IR system
61Evaluating an IR system
- What are some measures for evaluating an IR
systems performance? - Speed of indexing
- Index/corpus size ratio
- Speed of query processing
- Relevance of results
- Note information need is translated into a
boolean query - Relevance is assessed relative to the information
need not the query
62Standard relevance benchmarks
- TREC - National Institute of Standards and
Testing (NIST) has run large IR testbed for many
years - Reuters and other benchmark sets used
- Retrieval tasks specified
- sometimes as queries
- Human experts mark, for each query and for each
doc, Relevant or Not relevant - or at least for subset that some system returned
63The TREC experiments
- Once per year
- A set of documents and queries are distributed
to the participants (the standard answers are
unknown) (April) - Participants work (very hard) to construct,
fine-tune their systems, and submit the answers
(1000/query) at the deadline (July) - NIST people manually evaluate the answers and
provide correct answers (and classification of IR
systems) (July August) - TREC conference (November)
64TREC evaluation methodology
- Known document collection (gt100K) and query set
(50) - Submission of 1000 documents for each query by
each participant - Merge 100 first documents of each participant
into global pool - Human relevance judgment of the global pool
- The other documents are assumed to be irrelevant
- Evaluation of each system (with 1000 answers)
- Partial relevance judgments
- But stable for system ranking
65Tracks (tasks)
- Ad Hoc track given document collection,
different topics - Routing (filtering) stable interests (user
profile), incoming document flow - CLIR Ad Hoc, but with queries in a different
language - Web a large set of Web pages
- Question-Answering When did Nixon visit China?
- Interactive put users into action with system
- Spoken document retrieval
- Image and video retrieval
- Information tracking new topic / follow up
66Precision and recall
- Precision fraction of retrieved docs that are
relevant P(relevantretrieved) - Recall fraction of relevant docs that are
retrieved P(retrievedrelevant) - Precision P tp/(tp fp)
- Recall R tp/(tp fn)
Relevant Not Relevant
Retrieved tp fp
Not Retrieved fn tn
67Other measures
- Precision at a particular cutoff
- p_at_10
- Uninterpolated average precision
- Interpolated average precision
- Accuracy
- Error
68Other measures (cont.)
- Noise retrieved irrelevant docs / retrieved
docs - Silence non-retrieved relevant docs / relevant
docs - Noise 1 Precision Silence 1 Recall
- Fallout retrieved irrel. docs / irrel. docs
- Single value measures
- Average precision average at 11 points of
recall - Expected search length (no. irrelevant documents
to read before obtaining n relevant doc.)
69Why not just use accuracy?
- How to build a 99.9999 accurate search engine on
a low budget. - People doing information retrieval want to find
something quickly and have a certain tolerance
for junk
Snoogle.com
Search for
70Precision/Recall
- Can get high recall (but low precision) by
retrieving all docs for all queries! - Recall is a non-decreasing function of the number
of docs retrieved - Precision usually decreases (in a good system)
- Difficulties in using precision/recall
- Should average over large corpus/query ensembles
- Need human relevance judgments
- Heavily skewed by corpus/authorship
71General form of precision/recall
- Precision change w.r.t. Recall (not a fixed
point) - Systems cannot compare at one Precision/Recall
point - Average precision (on 11 points of recall 0.0,
0.1, , 1.0)
72A combined measure F
- Combined measure that assesses this tradeoff is F
measure (weighted harmonic mean) - People usually use balanced F1 measure
- i.e., with ? 1 or ? ½
- Harmonic mean is conservative average
- See CJ van Rijsbergen, Information Retrieval
73F1 and other averages
74MAP (Mean Average Precision)
- rij rank of the j-th relevant document for Qi
- Ri rel. doc. for Qi
- n test queries
- E.g. Rank 1 4 1st rel. doc.
- 5 8 2nd rel. doc.
- 10 3rd rel. doc.