LIS618 lecture 2 - PowerPoint PPT Presentation

About This Presentation
Title:

LIS618 lecture 2

Description:

Directory. Numeric. Are there any known sources? Authors ... databases are ordered in hierarchical fashion. at each level a Boolean search can be executed ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 48
Provided by: kric
Learn more at: https://openlib.org
Category:
Tags: lecture | lis618

less

Transcript and Presenter's Notes

Title: LIS618 lecture 2


1
LIS618 lecture 2
  • Thomas Krichel
  • 2002-09-22

2
Structure of talk
  • General round trip on theoretical matters, part
  • Information retrieval models
  • vector model
  • Probabilistic model
  • Retrieval performance evaluation
  • Query languages
  • Introduction to online searching
  • Introduction to DIALOG
  • Overview
  • bluesheets

3
vector model
  • associates weights with each index term appearing
    in the query and in each database document.
  • relevance can be calculated as the cosine between
    the two vector, i.e. their cross product divided
    be the square roots of the squares of each
    vector. This measure varies between 0 and 1.

4
tf/idf weighting technique
  • Let n_i be the number of documents where the term
    d_i appears. Let F_i_j be the number of times
    term i appears in the document j.
  • The normalized frequency is f_i_j, given by
  • f_i_j(F_i_j/max_l(F_l_j)
  • that is the raw frequency divided by the
    maximum raw frequency achieved by any term in
    the document j.

5
tf/idf weighting technique
  • Let N be the number of documents.
  • Then the most frequently used weighting scheme is
  • w_i_jf_i_j log(N/n_i)
  • There are other methods, but these are variations
    on this one.

6
advantages of vector model
  • term weighting improves performance
  • sorting is possible
  • easy to compute, therefore fast
  • results are difficult to improve without
  • query expansion
  • user feedback circle

7
probabilistic model (outline only)
  • starts with the assumption that there is a subset
    of documents that form the ideal answer set
  • query process specifies properties of the answer
    set
  • query terms can be used to form a probability
    that a document is part of the answer
  • then we start an iterative process with the user
    to gain more characteristics about the answer set

8
recursive method
  • If we assume that the probability that the
    documents that are relevant among a set of
    initially retrieved documents is proportional to
    the appearance of index terms that are part of
    the query, the probability can further be
    refined.

9
probabilistic model
  • For any user requirement, we assume that there is
    an answer set and that the probability that the
    user finds a document interesting only depends on
    the document and the query
  • Then the similarity of the document to the query
    can be expressed as
  • s(probability that the document is part of the
    answer set / probability that it is not part of
    the answer set).
  • There are ways to calculate this (with some more
    assumptions).

10
retrieval performance evaluation
  • There are two classic measures. Both assume that
    there is an answer set.
  • Recall is the fraction of the relevant documents
    that the query result has captured.
  • Precision is the fraction of the retrieved
    documents that is relevant.

11
recall and precision curves
  • assume that all the retrieved documents arrive at
    once and are being examined.
  • during that process, the user discover more and
    more relevant documents. Recall increases.
  • during the same process, at least eventually,
    there will be less and less useful document.
    Precision declines (usually).

12
Example
  • let the answer set be 4,7,5,3,6,1,0,8,9 and
    non-relevant documents represented by letters.
  • A query reveals the following result
  • 7,a,3,b,c,9,n,j,l,5,r,o,s,e,4.
  • for the first document, (recall, precison) is
    (10,100), for the third, (20,60), then follow
    (30,50), (40,40), (50,27)

13
recall/precision curves
  • Such curves can be formed for each query.
  • An average curve, for each recall level, can be
    calculated for several queries.
  • Recall and precision levels can also be used to
    calculate two single-valued summaries.

14
average precision at seen document
  • sum all the precision level for each new relevant
    document and divide by the total number of
    relevant documents is the query.
  • In our example, it is 0.57
  • This measure favors retrieval methods that get
    the relevant documents to the top.

15
R-precision
  • a more ad-hoc measure.
  • Let R be the size of the answer set.
  • Take the first R results of the query.
  • Find the number of relevant documents
  • Divide by R.
  • In our example, the R-precision is .4.
  • An average can be calculated for a number of
    queries.

16
critique of recall precision
  • recall has to be estimated by an expert
  • recall is very difficult to estimate in a large
    collection
  • measures most appropriate to a situation where
    queries are run in batch mode, they are difficult
    to reconcile with the idea of interactive use.
  • there are some other measures.

17
simple queries
  • single-word queries
  • one word only
  • Hopefully some word combinations are understood
    as one word, e.g. on-line
  • Context queries
  • phrase queries (be aware of stop words)
  • proximity queries, generalize phrase queries
  • Boolean queries

18
simple pattern queries
  • prefix queries (e.g. "anal" for analogy)
  • suffix queries (e.g. "oral" for choral)
  • substring (e.g. "al" for talk)
  • ranges (e.g. form "held" to "hero")
  • within a distance, usually Levenshtein distance
    (i.e. the minimum number of insertions,
    deletions, and replacements) of query term

19
regular expressions
  • come from UNIX computing
  • build form strings where certain characters are
    metacharacters.
  • example "pro(blem)(tein)s?" matches problem,
    problem, protein and proteins.
  • example New .y matches "New Jersey" and "New
    York City", and "New Delhy".
  • great variety of dialects, usually very powerful.
  • Extremely important in digital libraries.

20
structured queries
  • make use of document structures
  • simplest example is when the documents are
    database records, we can search for terms is a
    certain field only.
  • if there is sufficient structure to field
    contents, the field can be interpreted as meaning
    something different than the word it contains.
    example dates

21
query protocols
  • There are some standard languages
  • Z39.50 queries
  • CCL, "common command language" is a development
    of Z39.50
  • CD-RDx "compact disk read only data exchange" is
    supported by US government agencies such as CIA
    and NASA
  • SFQL "structure full text query language" built
    on SQL

22
document preprocessing
  • operations done on the documents before indexing
  • lexical analysis
  • elimination of stop words
  • stemming of words
  • selection of index term
  • construction of term categorization structures
  • receives a decline in attention

23
lexical analysis
  • divides a stream of characters into a stream of
    words
  • seems easy enough but.
  • should we keep numbers?
  • hyphens. compare "state-of-the-art" with "b-52"
  • removal of punctuation, but "333B.C."
  • casing. compare "bank" and "Bank"

24
elimination of stop words
  • some words carry no meaning and should be
    eliminated
  • in fact any word that appears in 80 of all
    documents is pretty much useless, but
  • consider a searcher for "to be or not to be".

25
stemming
  • in general, users search for the occurrence of a
    term irrespective of grammar
  • plural, gerund forms, past tense can be subject
    to stemming
  • important algorithm by Porter
  • evidence on effect on retrieval is mixed

26
index term selection
  • some engines try to capture nouns only
  • some nouns that appear heavily together can be
    considered to be one index term, such as
    "computer science"
  • Most web engines, however, index all words, why?

27
thesauri
  • a list of words and for each word, a list of
    related words
  • synonyms
  • broader terms
  • narrower terms
  • used
  • to provide a consistent vocabulary for indexing
    and searching
  • to assist users with locating terms for query
    formulation
  • allow users to broaden or narrow query

28
use of thesauri
  • most users want to get a quick response
  • often the selection of terms is erroneous
  • frequently the relationship between terms in the
    query is badly served by the relationships in the
    query. Thus thesaurus expansion of an initial
    query (if performed automatically) can lead to
    bad results.

29
Online database searching
30
before a search I
  • what is purpose
  • brief overview
  • comprehensive search
  • What perspective on the topic
  • scholarly
  • technical
  • business
  • popular

31
before search II
  • What type of information
  • Fulltext
  • Bibliographic
  • Directory
  • Numeric
  • Are there any known sources?
  • Authors
  • Journals
  • Papers
  • Conferences

32
before search III
  • What are the language restrictions?
  • What, if any, are the cost restrictions?
  • How current need the data to be?
  • How much of each record is required?

33
DIALOG
34
Literature
  • http//training.dialog.com/sem_info/courses/pdf_se
    m/dlg1.pdf
  • http//training.dialog.com/sem_info/courses/pdf_se
    m/dlg2.pdf
  • http//training.dialog.com/sem_info/courses/pdf_se
    m/dlg3.pdf
  • http//training.dialog.com/sem_info/courses/pdf_se
    m/dlg4.pdf

35
databank
  • over 500 different databases
  • references and abstracts for published
    literature,
  • business information and financial data
  • complete text of articles and news stories
  • statistical tables
  • Directories
  • Two interfaces to all this stuff.
  • Guided search (for neophytes)
  • Command search (for Masters) ? for us!!

36
Four steps in a search
  • Use the Databases Selection Tool to select
    databases
  • Identify search terms
  • Use Dialog basic commands to conduct a search
  • View records online

37
B E S T strategy I
  • begin
  • b 630,636
  • b papersmj, not 630
  • expand
  • e colong island university
  • e aukrichel, t

38
B E S T II
  • select
  • s (mate?(N)drink?) or (lex(N)para?)
  • s s1 and s2
  • type
  • type s1/3/1,6

39
Command search
  • The first thing to be done is to select a
    database.
  • 8 categories
  • Government Medicine Pharmaceuticals
  • News Science Technology
  • Business Intellectual property
  • Reference Social Sciences and Humanities
  • there we go to command search

40
databases menus
  • databases are ordered in hierarchical fashion
  • at each level a Boolean search can be executed
  • on all of Dialog
  • on the databases in the current hierarchical
    level

41
searching
  • result may be a just a blank screen
  • otherwise, a table with the file number, the
    database name and the number of hits appears
  • wait until the display is complete.
  • sorting of database is possible by the number of
    hits for the current query

42
blue sheet
  • each database name is linked to a blueish pop-up
    window called the blue sheet for the database
  • Contents of bluesheet is covered later
  • at this stage we choose a database and hit
    "begin". We see that there is a command selected
    "be numbers" where numbers are the ones for the
    databases selected, separated by comma.

43
finding a database
  • file 411 contains the database of databases
  • 'sf category' selects files belonging to a
    category category
  • categories are listed at http//library.dialog.com
    /bluesheets
  • 'rank files' will rank the results
  • 'b ref,ref' will select databases using rank
    references.

44
closer look at the bluesheet
  • file description
  • subject coverage (free vocabulary)
  • format options, lists all formats
  • by number (internal)
  • by dialog web format (external, i.e.
    cross-database)
  • search options
  • basic index, i.e. subject contents
  • additional index, i.e. non-subject

45
search options basic index
  • select without qualifiers searches in all fields
    in the basic index
  • bluesheet lists field indicators available for a
    database
  • also note if field is indexed by word or phrase.
    proximity searching only works with word indices.

46
other search options
  • additional indices lists those terms that can
    lead a query. Often, these are phrase indexed.
  • special features will list other features that
    the database has and that can be used in queries

47
http//openlib.org/home/krichel
  • Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com