LIS618 lecture 3 - PowerPoint PPT Presentation

About This Presentation

Title:

LIS618 lecture 3

Description:

Happy Valentine's day! Theory: discussion of the Boolean model. Theory: ... There is no implicit or as in the simple search, so forget about the double quotes. ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 32

Provided by: kric

Learn more at: https://openlib.org

Category:

more less

Transcript and Presenter's Notes

Title: LIS618 lecture 3

1
LIS618 lecture 3

Thomas Krichel
2004-02-14

2
Structure

Happy Valentines day!
Theory discussion of the Boolean model
Theory the vector model
Practice introducing Nexis
More Nexis next week

3
advantages of Boolean model

supposedly easy to grasp by the user
precise semantics of queries
implemented in the majority of commercial systems

4
problems of Boolean model

sharp distinction between relevant and irrelevant
documents
no ranking possible
users find it difficult to formulate Boolean
queries
users find it difficult to resolve Boolean queries

5
vector model

associates weights with each index term appearing
in the query and in each database document.
relevance can be calculated as the cosine between
the two vectors, i.e. their cross product divided
be the square roots of the squares of each
vector. This measure varies between 0 and 1.

6
tf/idf

stands for term frequency / inverse document
frequency
This refers to a technique that gives term a high
rank in a document if
the term appears frequently in a document
the term does not appear frequently in other
documents
We will look at each component one at time.

7
absolute maximum term frequency

Let F_t_d be the number of times term t appears
in the document d. This is its absolute term
frequency in the document.
Let m_d be the maximum absolute term frequency
achieved by any term in document d. Examples
Document 1 a b a a b c c d
m_1 3, because "a" appears 3 times
Document 2 a b a f f f e d f a a
m_2 4, because "a" or "f" appears 4 times

8
relative document term frequency

The relative term frequency f_t_d, is given by
f_t_d F_t_d / m_d
that is the absolute term frequency of term t
in document d divided by the maximum absolute
term frequency of document d.
This completes the "term frequency" part of the
tf/idf formula.
Let us look at this part through an example.

9
main example, part I

Consider three documents
1 a b c a f o n l p o f t y x
2 a m o e e e n n n a n p l
3 r a e e f n l i f f f f x l
First, look at the maximum frequency achieved by
any term in a given document.
m_1 2 ("a", "f" and "o" are there twice)
m_2 4 ("n" is there four times)
m_3 5 ("f" is there five times)

10
main example part II

Now look at some example of absolute term
frequency
F_a_1 2 F_e_2 3 F_x_3 1
and some examples of relative term frequency
f_a_1 F_a_1 / m_1 2 / 2 1
f_e_2 F_e_2 / m_2 3 / 4 0.75
f_x_3 F_x_3 / m_3 1 / 5 0.2

11
inverse document frequency

Let N be the number of documents in the datebase.
N3 in our example.
Let n_t be the number of documents where the term
t appears. In our example
n_a 3 n_e 2 n_x 2
N/n_t is an indication of inverse document
frequency of a term. It is larger the less a term
appears across documents in the database.

12
intermezzo the logarithm

The logarithm, written log() is a mathematical
function. You should know that
log() is an increasing function, i.e. the bigger
is x, the bigger is log(x).
log(1) 0
log(x) gt 0 if x gt 1
Your calculator will tell you what the logarithm
of a number is.

13
tf/idf formula

Term frequency and inverse document frequency
have to be combined.
The final formula for the weight combines the
terms as follows
w_t_d f_t_d log( N / n_t )

14
main example part III

N 3
w_a_1 1 log(3/3) log(1) 0 !
w_e_2 0.75 log(3/2)
w_x_3 0.2 log(3/2)
where log(3/2) 0.176, approximately

15
practical operation

The computer will search the documents for the
query term and return the documents where the
weight of term in the index for that document is
strictly positive, by order of weights, highest
to lowest.
If there are several query terms the computer
will perform a more complicated operation that we
will not further study here, so we limit
ourselves to the case of one query term.

16
practical tests

You ask the computer to query the term "a" in our
example. What documents are being returned?
Compare with the result of the Boolean model.
You ask the computer to query the term "e". What
documents are being returned, and in what order?

17
advantages of vector model

term weighting improves performance
sorting is possible
easy to compute, therefore fast
results are difficult to improve without
query expansion
user feedback circle

18
Lexis/Nexis

Lexis is a specialized legal research service
Nexis is primarily a news services
adds an important temporal component to all its
contents
restricts contents as compared to Dialog
potentially bad competition from Google
lives at http//www.nexis.com

19
compilation of Nexis

Uses a number of news sources such as newspapers.
Uses company reports databases
Uses web sites, the URLs of which are found in
the news sources. Some of the material there can
be of low value (remember the comments in the
first lecture)

20
SmartIndexing

There is a controlled vocabulary of indexing
terms
A document is indexed
In full text view (except web sites)
With automatic addition of index terms that
correspond to the document.
Index terms are added
Weight of index terms is calculated
http//www.lexis-nexis.com/infopro/products/index/
has more on it.

21
equivalents

Nexis has a number of "equivalents" where,
depending on sources, it replaces one with the
other. Contrary to their claims they also work in
quick search
First (second, third, etc.)is 1st (2nd, 3rd,
etc.) Monday (All days ex. Sunday) Mon (Tues,
Weds, etc.)
January (Abbreviations work) Jan (Feb, Mar,
etc.)
One (all numbers lt 20) 1 (2, 3, etc.)
and
company co
corporation corp
incorporated inc

22
Six interfaces to Nexis

Quick search
Subject directory
Power search
Personal news
Search forms
Real time news
In the remainder of the lecture I will go through
some of these

23
Quick search

Implicit OR between terms
Use quotes to require adjacency of terms
You can select from a drop-down box of sources
You can set the date range, though unclear what
it means
It seems to OR a plural to your search term.
Sometimes returns documents with none of the
search terms. she is the one

24
Quick search

It is not clear what parts of documents are being
searched
Apparently it does not search the full text.
But it seems to prioritize
TERM, i.e. smart keywords extracted,
HLEAD for news
TITLE for legal documents
WEB-SEARCH-TEXT for web pages

25
relevance ranking concerns

where terms appear within the document
how many occurrences of the terms appear in the
document
how often those search terms appear throughout
the document
apparently not how much they occur, example
search for "the" or "the the"
seems that they guard algorithm a secret

26
Subject directory

you can follow the subject tree but
there seems to be only a tiny amount of documents
categories are not particularly deep or developed
there is a "more like this" feature of limited
use, Thomas finds

27
Power search

You can first create a customized set of sources
to search
Do this at the start, you browse a menu, then
click done, search now
This is a lot more efficient than trying to build
a search strategy on a large set.

28
power search truncation

represents a single character, present or
absent
womn
labor
! truncates to the end of the word
bookk!

29
Power search connectors

OR
AND
AND NOT
PRE/n, n is a number, ordered proximity
W/n, n is a number, unordered proximity
W/S words in same sentence
W/P words is the some paragraph
Use parentheses!
There is no implicit or as in the simple search,
so forget about the double quotes.

30
Power search expressions