Search Engines e Question Answering - PowerPoint PPT Presentation

1 / 74

About This Presentation

Title:

Search Engines e Question Answering

Description:

Search Engines e Question Answering Giuseppe Attardi Universit di Pisa (some s borrowed from C. Manning, H. Sch tze) Overview Information Retrieval Models ... – PowerPoint PPT presentation

Number of Views:202

Avg rating:3.0/5.0

Slides: 75

Provided by: Giuseppe47

Category:

more less

Transcript and Presenter's Notes

Title: Search Engines e Question Answering

1
Search Engines e Question Answering

Giuseppe Attardi
Università di Pisa
(some slides borrowed from C. Manning, H. Schütze)

2
Overview

Information Retrieval Models
Boolean and vector-space retrieval models ranked
retrieval text-similarity metrics TF-IDF (term
frequency/inverse document frequency) weighting
cosine similarity performance metrics
precision, recall, F-measure.
Indexing and Search
Indexing and inverted files Compression
Postings Lists Query languages
Web Search
Search engines Architecture Crawling
parallel/distributed, focused Link analysis
(Google PageRank) Scaling
Text Categorization and Clustering
Question Answering
Information extraction Named Entity Recognition
Natural Language Processing Part of Speech
tagging Question analysis and semantic matching

3
References

Modern Information Retrieval, R. Baeza-Yates, B.
Ribeiro-Nieto, Addison Wesley
Managing Gigabytes, 2nd Edition, I.H. Witten, A.
Moffat, T.C. Bell, Morgan Kaufmann, 1999.
Foundations of Statistical Natural Language
Processing, MIT Statistical Natural Language
Processing, C. Manning and Shutze, MIT Press,
1999.

4
Motivation
5
Adaptive Computing

Desktop Metaphor highly successful in making
computers popular
See Alan Kay 1975 presentation in Pisa
Limitations
Point and click involves very elementary actions
People are required to perform more and more
clerical tasks
We have become bank clerks, typographers,
illustrators, librarians

6
Illustrative problem

Add a table to a document with results from
latest benchmarks and send it to my colleague
Antonio
7-8 pointclick just to get to the document
7-8 pointclick to get to the data
Lengthy fiddling with table layout
3-4 pointclick to retrieve mail address
Etc.

7
Success story

Do I care where a document is stored?
Shall I need a secretary for filing my documents?
Search Engines prove that you dont

8
Overcoming Desktop Metaphor

Could think of just one possibility
Raise the level of interaction with computers
How?
Could think of just one possibility
Use natural language

9
Adaptiveness

My language is different from yours
Should be learned from user interaction
See Steels talking heads language games
Through implicit interactions
Many potential sources (e.g. file a message in a
folder ? classification)

10
Research Goal

Question Answering
Techniques
Traditional IR tools
NLP tools (POS tagging, parser)
Complement Knowledge Bases with massive data sets
of usage (Web)
Knowledge extraction tools (NE tagging)
Continuous learning

11
IXE Framework
Passage Index
NE Tagger
Python Perl Java
EventStream ContextStream GIS
POS Tagger
Clustering
Sent. Splitter
Web Service
Wrappers
MaxEntropy
Unicode RegExp Tokenizer Suffix Trees
Files Mem Mapping Threads Synchronization
Readers
Indexer
Search
Crawler
Text
Object Store
OS Abstraction
12
Information Retrieval Models
13
Information Retrieval Models

A model is an embodiment of the theory in which
we define a set of objects about which assertions
can be made and restrict the ways in which
classes of objects can interact
A retrieval model specifies the representations
used for documents and information needs, and how
they are compared
(Turtle Croft, 1992)

14
Information Retrieval Model

Provides an abstract description of the
representation used for documents, the
representation of queries, the indexing process,
the matching process between a query and the
documents and the ranking criteria

15
Formal Characterization

An Information Retrieval model is a quadruple ?D,
Q, F, R? where
D is a set of representations for the documents
in the collection
Q is a set of representations for the user
information needs (queries)
F is a framework for modelling document
representations, queries, and their relationships
R Q ? D ? R is a ranking function which
associates a real number with a query qi ? Q and
document representation dj ? D(Baeza-Yates
Ribeiro-Neto, 1999)

16
Information Retrieval Models

Three classic models
Boolean Model
Vector Space Model
Probabilistic Model
Additional models
Extended Boolean
Fuzzy matching
Cluster-based retrieval
Language models

17
Collections
Information need
Pre-process
text input
Index
Parse
18
Boolean Model
t1
t2
D9
D2
D1
q3
q5
q6
q1 t1 t2 t3
D4
D11
q2 t1 t2 t3
D5
q3 t1 t2 t3
q1
D3
D6
q4 t1 t2 t3
q2
q4
D10
q5 t1 t2 t3
q7
q6 t1 t2 t3
q8
q7 t1 t2 t3
D8
D7
q8 t1 t2 t3
t3
19
Boolean Searching
Formal Query cracks AND beams AND
Width_measurement AND Prestressed_concrete
Measurement of the width of cracks in
prestressed concrete beams
Cracks
Width measurement
Beams
Relaxed Query (C AND B AND P) OR (C AND B AND
W) OR (C AND W AND P) OR (B AND W AND P)
Prestressed concrete
20
Boolean Problems

Disjunctive (OR) queries lead to information
overload
Conjunctive (AND) queries lead to reduced, and
commonly zero result
Conjunctive queries imply reduction in Recall

21
Boolean Model Assessment
Disadvantages
Advantages

Complete expressiveness for any identifiable
subset of collection
Exact and simple to program
The whole panoply of Boolean Algebra available

Complex query syntax is often misunderstood (if
understood at all)
Problems of Null output and Information Overload
Output is not ordered in any useful fashion

22
Boolean Extensions

Fuzzy Logic
Adds weights to each term/concept
ta AND tb is interpreted as MIN(w(ta),w(tb))
ta OR tb is interpreted as MAX (w(ta),w(tb))
Proximity/Adjacency operators
Interpreted as additional constraints on Boolean
AND
Verity TOPIC system
Uses various weighted forms of Boolean logic and
proximity information in calculating Robertson
Selection Values (RSV)

23
Vector Space Model

Documents are represented as vectors in term
space
Terms are usually stems
Documents represented by binary vectors of terms
Queries represented the same as documents
Query and Document weights are based on length
and direction of their vector
A vector distance measure between the query and
documents is used to rank retrieved documents

24
Documents in Vector Space
t3
D1
D9
D11
D5
D3
D10
D2
D4
t1
D7
D6
D8
t2
25
Vector Space Documents and Queries
docs t1 t2 t3 RSVQ.Di
D1 1 0 1 4
D2 1 0 0 1
D3 0 1 1 5
D4 1 0 0 1
D5 1 1 1 6
D6 1 1 0 3
D7 0 1 0 2
D8 0 1 0 2
D9 0 0 1 3
D10 0 1 1 5
D11 1 0 1 3
Q 1 2 3 weights
q1 q2 q3
t1
t3
D2
D9
D1
D4
D11
D5
D3
D6
D10
D8
t2
D7
26
Similarity Measures
Simple matching (coordination level
match) Dices Coefficient Jaccards
Coefficient Cosine Coefficient Overlap Coefficient
27
Vector Space with Term Weights
Di(wdi1, wdi2, , wdit) Q (wqi1, wqi2, , wqit)
Term B
1.0
Q (0.4, 0.8) D1(0.8, 0.3) D2(0.2, 0.7)
Q
D2
0.8
0.6
0.4
D1
0.2
0.8
0.6
0.4
0.2
0
1.0
Term A
28
Problems with Vector Space

There is no real theoretical basis for the
assumption of a term space
it is more for visualization that having any real
basis
most similarity measures work about the same
regardless of model
Terms are not really orthogonal dimensions
Terms are not independent of all other terms

29
Probabilistic Retrieval

Goes back to 1960s (Maron and Kuhns)
Robertsons Probabilistic Ranking Principle
Retrieved documents should be ranked in
decreasing probability that they are relevant to
the users query
How to estimate these probabilities?
Several methods (Model 1, Model 2, Model 3) with
different emphasis on how estimates are done

30
Probabilistic Models Notation

D all present and future documents
Q all present and future queries
(di, qj) a document query pair
x ? D class of similar documents
y ? Q class of similar queries
Relevance is a relation
R (di, qj) di ? D, qj ? Q, di is judged
relevant by the user submitting qj

31
Probabilistic model

Given D, estimate P(RD) and P(NRD)
P(RD)P(DR)P(R)/P(D) (P(D), P(R) constant)
? P(DR)
D t1x1, t2x2,

32
Prob. model (contd)
For document ranking
33
Prob. model (contd)
ri Rel. doc. with ti ni-ri Irrel.doc. with ti ni Doc. with ti
Ri-ri Rel. doc. without ti N-Rinri Irrel.doc. without ti N-ni Doc. without ti
Ri Rel. doc N-Ri Irrel.doc. N Samples

How to estimate pi and qi?
A set of N relevant and irrelevant samples

34
Prob. model (contd)

Smoothing (Robertson-Sparck-Jones formula)
When no sample is available
pi0.5,
qi(ni0.5)/(N0.5)?ni/N
May be implemented as VSM

35
Probabilistic Models

Model 1 Probabilistic Indexing, P(R y, di)
Model 2 Probabilistic Querying, P(R qj, x)
Model 3 Merged Model, P(R qj, di)
Model 0 P(R y, x)
Probabilities are estimated based on prior usage
or relevance estimation

36
Probabilistic Models

Rigorous formal model attempts to predict the
probability that a given document will be
relevant to a given query
Ranks retrieved documents according to this
probability of relevance (Probability Ranking
Principle)
Relies on accurate estimates of probabilities for
accurate results

37
Vector and Probabilistic Models

Support natural language queries
Treat documents and queries the same
Support relevance feedback searching
Support ranked retrieval
Differ primarily in theoretical basis and in how
the ranking is calculated
Vector assumes relevance
Probabilistic relies on relevance judgments or
estimates

38
IR Ranking
39
Ranking models in IR

Key idea
We wish to return in order the documents most
likely to be useful to the searcher
To do this, we want to know which documents best
satisfy a query
An obvious idea is that if a document talks about
a topic more then it is a better match
A query should then just specify terms that are
relevant to the information need, without
requiring that all of them must be present
Document relevant if it has a lot of the terms

40
Binary term presence matrices

Record whether a document contains a word
document is binary vector in 0,1v
What we have mainly assumed so far
Idea Query satisfaction overlap measure

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 0
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
41
Overlap matching

What are the problems with the overlap measure?
It doesnt consider
Term frequency in document
Term scarcity in collection (document mention
frequency)
Length of documents
(AND queries score not normalized)

42
Overlap matching

One can normalize in various ways
Jaccard coefficient
Cosine measure
What documents would score best using Jaccard
against a typical query?
Does the cosine measure fix this problem?

43
Count term-document matrices

We havent considered frequency of a word
Count of a word in a document
Bag of words model
Document is a vector in Nv

Normalization Calpurnia vs. Calphurnia
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
44
Weighting term frequency tf

What is the relative importance of
0 vs. 1 occurrence of a term in a doc
1 vs. 2 occurrences
2 vs. 3 occurrences
Unclear but it seems that more is better, but a
lot isnt necessarily better than a few
Can just use raw score
Another option commonly used in practice

45
Dot product matching

Match is dot product of query and document
Note 0 if orthogonal (no words in common)
Rank by match
It still doesnt consider
Term scarcity in collection (document mention
frequency)
Length of documents and queries
Not normalized

46
Weighting should depend on the term overall

Which of these tells you more about a doc?
10 occurrences of hernia?
10 occurrences of the?
Suggest looking at collection frequency (cf)
But document frequency (df) may be better
Word cf df
try 10422 8760
insurance 10440 3997
Document frequency weighting is only possible in
known (static) collection

47
tf x idf term weights

tf x idf measure combines
term frequency (tf)
measure of term density in a doc
inverse document frequency (idf)
measure of informativeness of term its rarity
across the whole corpus
could just be raw count of number of documents
the term occurs in (idfi 1/dfi)
but by far the most commonly used version is
See Kishore Papineni, NAACL 2, 2002 for
theoretical justification

48
Summary tf x idf

Assign a tf.idf weight to each term i in each
document d
tfi,d frequency of term i in document d
n total number of documents
dfi number of documents that contain term i
Increases with the number of occurrences within a
doc
Increases with the rarity of the term across the
whole corpus

What is the wt of a term that occurs in all of
the docs?

49
Real-valued term-document matrices

Function (scaling) of count of a word in a
document
Bag of words model
Each is a vector in Rv
Here log scaled tf.idf

50
Documents as vectors

Each doc j can now be viewed as a vector of
tf?idf values, one component for each term
So we have a vector space
terms are axes
docs live in this space
even with stemming, may have 20,000 dimensions
(The corpus of documents gives us a matrix, which
we could also view as a vector space in which
words live transposable data)

51
Why turn docs into vectors?

First application Query-by-example
Given a doc d, find others like it
Now that d is a vector, find vectors (docs)
near it

52
Intuition
t3
d2
d3
d1
?
f
t1
d5
t2
d4
Postulate Documents that are close together
in vector space talk about the same things
53
The vector space model

Query as vector
We regard query as short document
We return the documents ranked by the closeness
of their vectors to the query, also represented
as a vector
Developed in the SMART system (Salton, c. 1970)
and standardly used by TREC participants and web
IR systems

54
Desiderata for proximity

If d1 is near d2, then d2 is near d1
If d1 near d2, and d2 near d3, then d1 is not far
from d3
No doc is closer to d than d itself

55
First cut

Distance between vectors d1 and d2 is the length
of the vector d1 d2
Euclidean distance
Why is this not a great idea?
We still havent dealt with the issue of length
normalization
Long documents would be more similar to each
other by virtue of length, not topic
However, we can implicitly normalize by looking
at angles instead

56
Cosine similarity

Distance between vectors d1 and d2 captured by
the cosine of the angle x between them.
Note this is similarity, not distance

57
Cosine similarity

Cosine of angle between two vectors
The denominator involves the lengths of the
vectors
So the cosine measure is also known as the
normalized inner product

58
Normalized vectors

A vector can be normalized (given a length of 1)
by dividing each of its components by the
vector's length
This maps vectors onto the unit circle
Then,
Longer documents dont get more weight
For normalized vectors, the cosine is simply the
dot product

59
Okapi BM25

where
and
Wd document length WAL average document
length
k1, k3, b parameters N number of docs in
collection
tfq,t query-term frequency tfd,t
within-document frequency
dft collection frequency ( of docs that t
occurs in)

60
Evaluating an IR system
61
Evaluating an IR system

What are some measures for evaluating an IR
systems performance?
Speed of indexing
Index/corpus size ratio
Speed of query processing
Relevance of results
Note information need is translated into a
boolean query
Relevance is assessed relative to the information
need not the query

62
Standard relevance benchmarks

TREC - National Institute of Standards and
Testing (NIST) has run large IR testbed for many
years
Reuters and other benchmark sets used
Retrieval tasks specified
sometimes as queries
Human experts mark, for each query and for each
doc, Relevant or Not relevant
or at least for subset that some system returned

63
The TREC experiments

Once per year
A set of documents and queries are distributed
to the participants (the standard answers are
unknown) (April)
Participants work (very hard) to construct,
fine-tune their systems, and submit the answers
(1000/query) at the deadline (July)
NIST people manually evaluate the answers and
provide correct answers (and classification of IR
systems) (July August)
TREC conference (November)

64
TREC evaluation methodology

Known document collection (gt100K) and query set
(50)
Submission of 1000 documents for each query by
each participant
Merge 100 first documents of each participant
into global pool
Human relevance judgment of the global pool
The other documents are assumed to be irrelevant
Evaluation of each system (with 1000 answers)
Partial relevance judgments
But stable for system ranking

65
Tracks (tasks)

Ad Hoc track given document collection,
different topics
Routing (filtering) stable interests (user
profile), incoming document flow
CLIR Ad Hoc, but with queries in a different
language
Web a large set of Web pages
Question-Answering When did Nixon visit China?
Interactive put users into action with system
Spoken document retrieval
Image and video retrieval
Information tracking new topic / follow up

66
Precision and recall

Precision fraction of retrieved docs that are
relevant P(relevantretrieved)
Recall fraction of relevant docs that are
retrieved P(retrievedrelevant)
Precision P tp/(tp fp)
Recall R tp/(tp fn)

Relevant Not Relevant
Retrieved tp fp
Not Retrieved fn tn
67
Other measures

Precision at a particular cutoff
p_at_10
Uninterpolated average precision
Interpolated average precision
Accuracy
Error

68
Other measures (cont.)

Noise retrieved irrelevant docs / retrieved
docs
Silence non-retrieved relevant docs / relevant
docs
Noise 1 Precision Silence 1 Recall
Fallout retrieved irrel. docs / irrel. docs
Single value measures
Average precision average at 11 points of
recall
Expected search length (no. irrelevant documents
to read before obtaining n relevant doc.)

69
Why not just use accuracy?

How to build a 99.9999 accurate search engine on
a low budget.
People doing information retrieval want to find
something quickly and have a certain tolerance
for junk

Snoogle.com
Search for
70
Precision/Recall

Can get high recall (but low precision) by
retrieving all docs for all queries!
Recall is a non-decreasing function of the number
of docs retrieved
Precision usually decreases (in a good system)
Difficulties in using precision/recall
Should average over large corpus/query ensembles
Need human relevance judgments
Heavily skewed by corpus/authorship

71
General form of precision/recall

Precision change w.r.t. Recall (not a fixed
point)
Systems cannot compare at one Precision/Recall
point
Average precision (on 11 points of recall 0.0,
0.1, , 1.0)

72
A combined measure F

Combined measure that assesses this tradeoff is F
measure (weighted harmonic mean)
People usually use balanced F1 measure
i.e., with ? 1 or ? ½
Harmonic mean is conservative average
See CJ van Rijsbergen, Information Retrieval

73
F1 and other averages
74
MAP (Mean Average Precision)