LIS618 lecture 2 - PowerPoint PPT Presentation

About This Presentation

Title:

LIS618 lecture 2

Description:

Directory. Numeric. Are there any known sources? Authors ... databases are ordered in hierarchical fashion. at each level a Boolean search can be executed ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 48

Provided by: kric

Learn more at: https://openlib.org

Category:

more less

Transcript and Presenter's Notes

Title: LIS618 lecture 2

1
LIS618 lecture 2

Thomas Krichel
2002-09-22

2
Structure of talk

General round trip on theoretical matters, part
Information retrieval models
vector model
Probabilistic model
Retrieval performance evaluation
Query languages
Introduction to online searching
Introduction to DIALOG
Overview
bluesheets

3
vector model

associates weights with each index term appearing
in the query and in each database document.
relevance can be calculated as the cosine between
the two vector, i.e. their cross product divided
be the square roots of the squares of each
vector. This measure varies between 0 and 1.

4
tf/idf weighting technique

Let n_i be the number of documents where the term
d_i appears. Let F_i_j be the number of times
term i appears in the document j.
The normalized frequency is f_i_j, given by
f_i_j(F_i_j/max_l(F_l_j)
that is the raw frequency divided by the
maximum raw frequency achieved by any term in
the document j.

5
tf/idf weighting technique

Let N be the number of documents.
Then the most frequently used weighting scheme is
w_i_jf_i_j log(N/n_i)
There are other methods, but these are variations
on this one.

6
advantages of vector model

term weighting improves performance
sorting is possible
easy to compute, therefore fast
results are difficult to improve without
query expansion
user feedback circle

7
probabilistic model (outline only)

starts with the assumption that there is a subset
of documents that form the ideal answer set
query process specifies properties of the answer
set
query terms can be used to form a probability
that a document is part of the answer
then we start an iterative process with the user
to gain more characteristics about the answer set

8
recursive method

If we assume that the probability that the
documents that are relevant among a set of
initially retrieved documents is proportional to
the appearance of index terms that are part of
the query, the probability can further be
refined.

9
probabilistic model

For any user requirement, we assume that there is
an answer set and that the probability that the
user finds a document interesting only depends on
the document and the query
Then the similarity of the document to the query
can be expressed as
s(probability that the document is part of the
answer set / probability that it is not part of
the answer set).
There are ways to calculate this (with some more
assumptions).

10
retrieval performance evaluation

There are two classic measures. Both assume that
there is an answer set.
Recall is the fraction of the relevant documents
that the query result has captured.
Precision is the fraction of the retrieved
documents that is relevant.

11
recall and precision curves

assume that all the retrieved documents arrive at
once and are being examined.
during that process, the user discover more and
more relevant documents. Recall increases.
during the same process, at least eventually,
there will be less and less useful document.
Precision declines (usually).

12
Example

let the answer set be 4,7,5,3,6,1,0,8,9 and
non-relevant documents represented by letters.
A query reveals the following result
7,a,3,b,c,9,n,j,l,5,r,o,s,e,4.
for the first document, (recall, precison) is
(10,100), for the third, (20,60), then follow
(30,50), (40,40), (50,27)

13
recall/precision curves

Such curves can be formed for each query.
An average curve, for each recall level, can be
calculated for several queries.
Recall and precision levels can also be used to
calculate two single-valued summaries.

14
average precision at seen document

sum all the precision level for each new relevant
document and divide by the total number of
relevant documents is the query.
In our example, it is 0.57
This measure favors retrieval methods that get
the relevant documents to the top.

15
R-precision

a more ad-hoc measure.
Let R be the size of the answer set.
Take the first R results of the query.
Find the number of relevant documents
Divide by R.
In our example, the R-precision is .4.
An average can be calculated for a number of
queries.

16
critique of recall precision

recall has to be estimated by an expert
recall is very difficult to estimate in a large
collection
measures most appropriate to a situation where
queries are run in batch mode, they are difficult
to reconcile with the idea of interactive use.
there are some other measures.

17
simple queries

single-word queries
one word only
Hopefully some word combinations are understood
as one word, e.g. on-line
Context queries
phrase queries (be aware of stop words)
proximity queries, generalize phrase queries
Boolean queries

18
simple pattern queries

prefix queries (e.g. "anal" for analogy)
suffix queries (e.g. "oral" for choral)
substring (e.g. "al" for talk)
ranges (e.g. form "held" to "hero")
within a distance, usually Levenshtein distance
(i.e. the minimum number of insertions,
deletions, and replacements) of query term

19
regular expressions

come from UNIX computing
build form strings where certain characters are
metacharacters.
example "pro(blem)(tein)s?" matches problem,
problem, protein and proteins.
example New .y matches "New Jersey" and "New
York City", and "New Delhy".
great variety of dialects, usually very powerful.
Extremely important in digital libraries.

20
structured queries

make use of document structures
simplest example is when the documents are
database records, we can search for terms is a
certain field only.
if there is sufficient structure to field
contents, the field can be interpreted as meaning
something different than the word it contains.
example dates

21
query protocols

There are some standard languages
Z39.50 queries
CCL, "common command language" is a development
of Z39.50
CD-RDx "compact disk read only data exchange" is
supported by US government agencies such as CIA
and NASA
SFQL "structure full text query language" built
on SQL

22
document preprocessing

operations done on the documents before indexing
lexical analysis
elimination of stop words
stemming of words
selection of index term
construction of term categorization structures
receives a decline in attention

23
lexical analysis

divides a stream of characters into a stream of
words
seems easy enough but.
should we keep numbers?
hyphens. compare "state-of-the-art" with "b-52"
removal of punctuation, but "333B.C."
casing. compare "bank" and "Bank"

24
elimination of stop words

some words carry no meaning and should be
eliminated
in fact any word that appears in 80 of all
documents is pretty much useless, but
consider a searcher for "to be or not to be".

25
stemming

in general, users search for the occurrence of a
term irrespective of grammar
plural, gerund forms, past tense can be subject
to stemming
important algorithm by Porter
evidence on effect on retrieval is mixed

26
index term selection

some engines try to capture nouns only
some nouns that appear heavily together can be
considered to be one index term, such as
"computer science"
Most web engines, however, index all words, why?

27
thesauri

a list of words and for each word, a list of
related words
synonyms
broader terms
narrower terms
used
to provide a consistent vocabulary for indexing
and searching
to assist users with locating terms for query
formulation
allow users to broaden or narrow query

28
use of thesauri

most users want to get a quick response
often the selection of terms is erroneous
frequently the relationship between terms in the
query is badly served by the relationships in the
query. Thus thesaurus expansion of an initial
query (if performed automatically) can lead to
bad results.

29
Online database searching
30
before a search I

what is purpose
brief overview
comprehensive search
What perspective on the topic
scholarly
technical
business
popular

31
before search II

What type of information
Fulltext
Bibliographic
Directory
Numeric
Are there any known sources?
Authors
Journals
Papers
Conferences

32
before search III

What are the language restrictions?
What, if any, are the cost restrictions?
How current need the data to be?
How much of each record is required?

33
DIALOG
34
Literature

http//training.dialog.com/sem_info/courses/pdf_se
m/dlg1.pdf
http//training.dialog.com/sem_info/courses/pdf_se
m/dlg2.pdf
http//training.dialog.com/sem_info/courses/pdf_se
m/dlg3.pdf
http//training.dialog.com/sem_info/courses/pdf_se
m/dlg4.pdf

35
databank

over 500 different databases
references and abstracts for published
literature,
business information and financial data
complete text of articles and news stories
statistical tables
Directories
Two interfaces to all this stuff.
Guided search (for neophytes)
Command search (for Masters) ? for us!!

36
Four steps in a search

Use the Databases Selection Tool to select
databases
Identify search terms
Use Dialog basic commands to conduct a search
View records online

37
B E S T strategy I

begin
b 630,636
b papersmj, not 630
expand
e colong island university
e aukrichel, t

38
B E S T II

select
s (mate?(N)drink?) or (lex(N)para?)
s s1 and s2
type
type s1/3/1,6

39
Command search

The first thing to be done is to select a
database.
8 categories
Government Medicine Pharmaceuticals
News Science Technology
Business Intellectual property
Reference Social Sciences and Humanities
there we go to command search

40
databases menus

databases are ordered in hierarchical fashion
at each level a Boolean search can be executed
on all of Dialog
on the databases in the current hierarchical
level

41
searching

result may be a just a blank screen
otherwise, a table with the file number, the
database name and the number of hits appears
wait until the display is complete.
sorting of database is possible by the number of
hits for the current query

42
blue sheet

each database name is linked to a blueish pop-up
window called the blue sheet for the database
Contents of bluesheet is covered later
at this stage we choose a database and hit
"begin". We see that there is a command selected
"be numbers" where numbers are the ones for the
databases selected, separated by comma.

43
finding a database

file 411 contains the database of databases
'sf category' selects files belonging to a
category category
categories are listed at http//library.dialog.com
/bluesheets
'rank files' will rank the results
'b ref,ref' will select databases using rank
references.

44
closer look at the bluesheet

file description
subject coverage (free vocabulary)
format options, lists all formats
by number (internal)
by dialog web format (external, i.e.
cross-database)
search options
basic index, i.e. subject contents
additional index, i.e. non-subject

45
search options basic index

select without qualifiers searches in all fields
in the basic index
bluesheet lists field indicators available for a
database
also note if field is indexed by word or phrase.
proximity searching only works with word indices.

46
other search options

additional indices lists those terms that can
lead a query. Often, these are phrase indexed.
special features will list other features that
the database has and that can be used in queries

47
http//openlib.org/home/krichel