LIS618 lecture 3 - PowerPoint PPT Presentation

About This Presentation
Title:

LIS618 lecture 3

Description:

Happy Valentine's day! Theory: discussion of the Boolean model. Theory: ... There is no implicit or as in the simple search, so forget about the double quotes. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 32
Provided by: kric
Learn more at: https://openlib.org
Category:
Tags: lecture | lis618

less

Transcript and Presenter's Notes

Title: LIS618 lecture 3


1
LIS618 lecture 3
  • Thomas Krichel
  • 2004-02-14

2
Structure
  • Happy Valentines day!
  • Theory discussion of the Boolean model
  • Theory the vector model
  • Practice introducing Nexis
  • More Nexis next week

3
advantages of Boolean model
  • supposedly easy to grasp by the user
  • precise semantics of queries
  • implemented in the majority of commercial systems

4
problems of Boolean model
  • sharp distinction between relevant and irrelevant
    documents
  • no ranking possible
  • users find it difficult to formulate Boolean
    queries
  • users find it difficult to resolve Boolean queries

5
vector model
  • associates weights with each index term appearing
    in the query and in each database document.
  • relevance can be calculated as the cosine between
    the two vectors, i.e. their cross product divided
    be the square roots of the squares of each
    vector. This measure varies between 0 and 1.

6
tf/idf
  • stands for term frequency / inverse document
    frequency
  • This refers to a technique that gives term a high
    rank in a document if
  • the term appears frequently in a document
  • the term does not appear frequently in other
    documents
  • We will look at each component one at time.

7
absolute maximum term frequency
  • Let F_t_d be the number of times term t appears
    in the document d. This is its absolute term
    frequency in the document.
  • Let m_d be the maximum absolute term frequency
    achieved by any term in document d. Examples
  • Document 1 a b a a b c c d
  • m_1 3, because "a" appears 3 times
  • Document 2 a b a f f f e d f a a
  • m_2 4, because "a" or "f" appears 4 times

8
relative document term frequency
  • The relative term frequency f_t_d, is given by
  • f_t_d F_t_d / m_d
  • that is the absolute term frequency of term t
    in document d divided by the maximum absolute
    term frequency of document d.
  • This completes the "term frequency" part of the
    tf/idf formula.
  • Let us look at this part through an example.

9
main example, part I
  • Consider three documents
  • 1 a b c a f o n l p o f t y x
  • 2 a m o e e e n n n a n p l
  • 3 r a e e f n l i f f f f x l
  • First, look at the maximum frequency achieved by
    any term in a given document.
  • m_1 2 ("a", "f" and "o" are there twice)
  • m_2 4 ("n" is there four times)
  • m_3 5 ("f" is there five times)

10
main example part II
  • Now look at some example of absolute term
    frequency
  • F_a_1 2 F_e_2 3 F_x_3 1
  • and some examples of relative term frequency
  • f_a_1 F_a_1 / m_1 2 / 2 1
  • f_e_2 F_e_2 / m_2 3 / 4 0.75
  • f_x_3 F_x_3 / m_3 1 / 5 0.2

11
inverse document frequency
  • Let N be the number of documents in the datebase.
    N3 in our example.
  • Let n_t be the number of documents where the term
    t appears. In our example
  • n_a 3 n_e 2 n_x 2
  • N/n_t is an indication of inverse document
    frequency of a term. It is larger the less a term
    appears across documents in the database.

12
intermezzo the logarithm
  • The logarithm, written log() is a mathematical
    function. You should know that
  • log() is an increasing function, i.e. the bigger
    is x, the bigger is log(x).
  • log(1) 0
  • log(x) gt 0 if x gt 1
  • Your calculator will tell you what the logarithm
    of a number is.

13
tf/idf formula
  • Term frequency and inverse document frequency
    have to be combined.
  • The final formula for the weight combines the
    terms as follows
  • w_t_d f_t_d log( N / n_t )

14
main example part III
  • N 3
  • w_a_1 1 log(3/3) log(1) 0 !
  • w_e_2 0.75 log(3/2)
  • w_x_3 0.2 log(3/2)
  • where log(3/2) 0.176, approximately

15
practical operation
  • The computer will search the documents for the
    query term and return the documents where the
    weight of term in the index for that document is
    strictly positive, by order of weights, highest
    to lowest.
  • If there are several query terms the computer
    will perform a more complicated operation that we
    will not further study here, so we limit
    ourselves to the case of one query term.

16
practical tests
  • You ask the computer to query the term "a" in our
    example. What documents are being returned?
  • Compare with the result of the Boolean model.
  • You ask the computer to query the term "e". What
    documents are being returned, and in what order?

17
advantages of vector model
  • term weighting improves performance
  • sorting is possible
  • easy to compute, therefore fast
  • results are difficult to improve without
  • query expansion
  • user feedback circle

18
Lexis/Nexis
  • Lexis is a specialized legal research service
  • Nexis is primarily a news services
  • adds an important temporal component to all its
    contents
  • restricts contents as compared to Dialog
  • potentially bad competition from Google
  • lives at http//www.nexis.com

19
compilation of Nexis
  • Uses a number of news sources such as newspapers.
  • Uses company reports databases
  • Uses web sites, the URLs of which are found in
    the news sources. Some of the material there can
    be of low value (remember the comments in the
    first lecture)

20
SmartIndexing
  • There is a controlled vocabulary of indexing
    terms
  • A document is indexed
  • In full text view (except web sites)
  • With automatic addition of index terms that
    correspond to the document.
  • Index terms are added
  • Weight of index terms is calculated
  • http//www.lexis-nexis.com/infopro/products/index/
    has more on it.

21
equivalents
  • Nexis has a number of "equivalents" where,
    depending on sources, it replaces one with the
    other. Contrary to their claims they also work in
    quick search
  • First (second, third, etc.)is 1st (2nd, 3rd,
    etc.) Monday (All days ex. Sunday) Mon (Tues,
    Weds, etc.)
  • January (Abbreviations work) Jan (Feb, Mar,
    etc.)
  • One (all numbers lt 20) 1 (2, 3, etc.)
  • and
  • company co
  • corporation corp
  • incorporated inc

22
Six interfaces to Nexis
  • Quick search
  • Subject directory
  • Power search
  • Personal news
  • Search forms
  • Real time news
  • In the remainder of the lecture I will go through
    some of these

23
Quick search
  • Implicit OR between terms
  • Use quotes to require adjacency of terms
  • You can select from a drop-down box of sources
  • You can set the date range, though unclear what
    it means
  • It seems to OR a plural to your search term.
  • Sometimes returns documents with none of the
    search terms. she is the one

24
Quick search
  • It is not clear what parts of documents are being
    searched
  • Apparently it does not search the full text.
  • But it seems to prioritize
  • TERM, i.e. smart keywords extracted,
  • HLEAD for news
  • TITLE for legal documents
  • WEB-SEARCH-TEXT for web pages

25
relevance ranking concerns
  • where terms appear within the document
  • how many occurrences of the terms appear in the
    document
  • how often those search terms appear throughout
    the document
  • apparently not how much they occur, example
    search for "the" or "the the"
  • seems that they guard algorithm a secret

26
Subject directory
  • you can follow the subject tree but
  • there seems to be only a tiny amount of documents
  • categories are not particularly deep or developed
  • there is a "more like this" feature of limited
    use, Thomas finds

27
Power search
  • You can first create a customized set of sources
    to search
  • Do this at the start, you browse a menu, then
    click done, search now
  • This is a lot more efficient than trying to build
    a search strategy on a large set.

28
power search truncation
  • represents a single character, present or
    absent
  • womn
  • labor
  • ! truncates to the end of the word
  • bookk!

29
Power search connectors
  • OR
  • AND
  • AND NOT
  • PRE/n, n is a number, ordered proximity
  • W/n, n is a number, unordered proximity
  • W/S words in same sentence
  • W/P words is the some paragraph
  • Use parentheses!
  • There is no implicit or as in the simple search,
    so forget about the double quotes.

30
Power search expressions
  • Parentheses group terms together
  • for one or no letter
  • ! for any number of letters
  • ATLEAST n (term), where n is a minimum number of
    occurrences
  • PLURAL (term) only the plural of term
  • SINGULAR (term) only the singular of term
  • ALLCAPS (term) only capitals of term
  • NOCAPS (term) no capitals of term
  • CAPS (term) capitalized term only

31
http//openlib.org/home/krichel
  • Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com