IST 2140 Information Storage and Retrieval - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

IST 2140 Information Storage and Retrieval

Description:

or hamburger? or french()(fry or fries)) (note use of ... No consensus over what types of measure best for what application. Cosine measure is widely used ... – PowerPoint PPT presentation

Number of Views:2161
Avg rating:3.0/5.0
Slides: 42
Provided by: eras
Category:

less

Transcript and Presenter's Notes

Title: IST 2140 Information Storage and Retrieval


1
IST 2140Information Storage and Retrieval
  • Week 3
  • Information Retrieval Models

2
Retrieval Tasks
  • Ad hoc retrieval
  • Static documents, new queries
  • formerly called retrospective retrieval
  • Filtering
  • Static queries, new documents
  • Based on user profile
  • Formerly called SDI (selective dissemination of
    information) or current awareness

3
Information Retrieval
Matching Process
Document Collection
Query
4
Basic concepts
  • Document is described by a set of representative
    keywords (index terms)
  • Keywords may have binary weights or weights
    calculated from statistics of their frequency in
    text
  • Retrieval is a matching process between
    document keywords and words in queries

5
Information Retrieval Models
  • A model is an embodiment of the theory in which
    we define a set of objects about which assertions
    can be made and restrict the ways in which
    classes of objects can interact
  • A retrieval model specifies the representations
    used for documents and information needs, and how
    they are compared.
  • (Turtle Croft, 1992)

6
A formal characterization of Information
Retrieval Models
  • An information retrieval model is a quadrupole
    D,Q,F,R(qi,dj) where
  • D is a set of representations for the documents
    in the collection
  • Q is a set of representations for the user
    information needs (queries)
  • F is a framework for modelling document
    representations, queries, and their relationships
  • R(qi,dj) is a ranking function which associates a
    real number with a query qi (qi ? Q) and document
    representation dj (dj ? D) (Baeza-Yates
    Ribeiro-Neto, 1999)

7
Information Retrieval Models
  • Three classic models
  • Boolean Model
  • Vector Space Model
  • Probabilistic Model
  • Additional models
  • Extended Boolean
  • Fuzzy matching
  • Cluster-based retrieval
  • Language models

8
Implementation vs. Models
  • An IR model is a formalization of the way of
    thinking about information retrieval
  • Compare to implementation---how to operationalize
    the model in a given environment (e.g. file
    structures)

9
Classic Retrieval Models
  • Boolean
  • Documents and queries are sets of index terms
  • set theoretic
  • Vector
  • Documents and queries are documents in
    L-dimensional space
  • algebraic
  • Probabilistic
  • Based on probability theory

10
Boolean Model
  • first online systems in 60s and 70s
  • most widely used in commercial IR
  • AND, OR, NOT operators
  • usually supplemented with proximity operators
  • requires an exact match
  • based on inverted file

11
Boolean Model
  • Based on set theory and Boolean algebra
  • Queries are specified as Boolean expressions
  • Widely used in commercial IR systems (Dialog,
    Lexis/Nexis)
  • Interpreted using Venn Diagrams

12
Boolean Operators
  • AND
  • OR
  • NOT

13
Boolean AND
  • Information AND Retrieval

Information
Retrieval
14
Boolean OR
  • Cats OR Felines

Felines
Cats
15
Example
  • Draw a Venn diagram for
  • Care and feeding and (cats or dogs)
  • What is the meaning of
  • Information and retrieval and performance or
    evaluation

16
Boolean-based Matching
  • Exact match systems separate the documents
    containing a given term from those that do not.
  • No similarity between document and query
    structure
  • Proximity Judgment Gradations of the retrieved
    set

Queries
Terms
Mediterranean
scholarships
horticulture
agriculture
cathedrals
adventure
disasters
leprosy
recipes
bridge
tennis
Venus
flags
flags AND tennis
0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1
1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0
0 0 1 0
leprosy AND tennis
Venus OR (tennis AND flags)
Documents
(bridge OR flags) AND tennis
17
Sample Boolean Query
  • On DIALOG
  • SELECT nutritional()value? and (junk()food? or
    fast()food? or pizza? or hamburger? or
    french()(fry or fries))
  • (note use of positional operators)

18
Boolean model
  • Terms are present or absent
  • Index term weights are binary 0,1
  • Exact match system documents are predicted to be
    relevant or non-relevant

19
Other Boolean features
  • Order dependency of operators
  • ( ), NOT, AND, OR (DIALOG)
  • May differ on different systems
  • Nesting of search terms
  • Nutrition and (fast or junk) and food

20
Other Boolean features
  • Extensions to AND operator
  • Positional operators
  • adjacency A ADJ B, A(W)B
  • Field limitations
  • A/TI, B/TI,DE
  • Extensions to OR operator
  • Truncation
  • A?, A?? ?
  • Browsing indexes

21
Pros and Cons
  • simple formalism
  • - hard to control (too many, too few documents
    retrieved)
  • - requires formally constructed queries
  • - performance issues??

22
Extended Boolean model
  • Add partial matching and term weighting
  • E.g.
  • take retrieval set and weight by binary terms
  • Take retrieval set and calculate weights and
    ranks within the set

23
Vector Model
  • Non-binary weights
  • Calculate degree of similarity between document
    and query
  • Ranked output by sorting similarity values
  • Also called vector space model

24
Vector Space Model
  • based on idea of n-dimensional document space
  • query is also located in document space
  • documents are ranked in order of their
    closeness to the query
  • many possible matching functions

25
Vector Space Model
  • Documents and queries are points in L-dimensional
    space (where L is number of unique index terms in
    the data collection)

Q
D
26
Document and Query Vectors
  • Documents and Queries are vectors of terms
  • Actual vectors have many terms (thousands)
  • Vectors can use binary keyword weights or assume
    0-1 weights (term frequencies)
  • Example terms dog,cat,house, sink,
    road, car
  • Binary (1,1,0,0,0,0), (0,0,1,1,0,0)
  • Weighted (0.01,0.01, 0.002, 0.0,0.0,0.0)
  • Queries can be weighted also

27
Vector Space Model with Term Weights
  • assume document terms have different values for
    retrieval
  • therefore assign weights to each term in each
    document
  • example TFxIDF weighting
  • proportional to frequency of term in document
  • inversely proportional to frequency of term in
    collection

28
A 2-D Document Space
  • Consider a document collection with only two
    words cat, dog
  • Three document vectors
  • D1
  • D2
  • D3

D3
D2
D1
29
A 3-D Document Space
Di 3 t1 4t2 2t3 Query0t1t2t3
30
Term document matrix
  • A document collection can be represented by a
    matrix in which rows are documents and columns
    are terms entries are 0 (if term is absent in a
    document) or a value for term weight
  • Note that this is not effective for
    implementation other data structures are used
    (inverted file)

31
Vector-based Matching Metrics
  • Metric or Distance Measure documents close
    together in the vector space are likely to be
    highly similar

32
Measuring similarity
  • Need a means of computing similarity between
    documents and queries (or pairs or documents)
  • Many proposed none ideal
  • Allows
  • Ranking
  • Thresholding
  • Query reformulation

33
Some conventions
  • N number of documents in collection
  • L number of unique terms in collection
  • Di the ith document in the collection
  • Tj the jth term in the collection
  • Q a query
  • Wij the weight of term i in Document j
  • WiQ the weight of term i in Query

34
Inner product
  • Many similarity functions are based on the inner
    product
  • SIM(Di,Q) S(Dij,Qj)
  • For binary vectors, reduces to number of terms in
    common in D and Q
  • For weighted terms it is the sum of the products
    of the term weights in documents and query
  • Favours longer documents

35
Calculating Degree of Similarity
  • Many candidate measures, e.g.
  • Dice Coefficient
  • Jaccard Coefficient
  • Cosine Coefficient
  • Based on contribution to similarity of
    co-occurring terms
  • Generally normalized by length of document

36
Dice Coefficient
  • SIM (D,Q)
  • 2 S (wiD wiQ)
  • -------------------------
  • S wiD2 S wiQ2
  • Reduces to 2A/(BC) for binary documents
  • Sum is taken over all i from 1?L (dimensionality
    of vector space)

37
Jaccard Coefficient
  • SIM (D,Q)
  • S (wiD wiQ)
  • -------------------------------------------------
    -
  • S wiD2 S wiQ2 S (wiDwiQ)
  • Reduces to A/(BC-A) for binary documents
  • Sum is taken over all i from 1? L(dimensionality
    of vector space)

38
Cosine Measure
  • Very commonly used
  • Measures cosine of angle between document-query
    (or document-document) vector

Di
)
Q
39
Cosine coefficient
  • SIM (Dj, Q)
  • S (wi,Dj wi,q)
  • ----------------------------
  • SQRT(S wi,Dj2) SQRT( S wi,q2)
  • Sum is taken over L(dimensionality of vector
    space)

40
What Similarity Function to Use?
  • Some studies comparing performance of measures
    (e.g. McGill, 67 measures)
  • Significant differences in performance may
    perform differently with high or low frequency
    terms
  • May be monotonic among similar measures
  • No consensus over what types of measure best for
    what application
  • Cosine measure is widely used

41
Pros and Cons
  • Term weights improve retrieval performance
  • Partial matching means query need not be exact
  • Ranking provides guidance to searcher
  • Relevance feedback feasible
  • - assumes orthogonality obviously false
  • - high degree of empiricism
  • - Less control by searcher?
Write a Comment
User Comments (0)
About PowerShow.com