Title: Information Retrieval and Recommendation Techniques
1Information Retrieval and Recommendation
- Reality (real world) can not known in its
entirety - Reality is represented by a collection of data
abstracted from observation of the real world. - Information need drives the storage and retrieval
of information. - Relationships among reality, information need,
data and query (see Figure 1.1).
3Information Systems
- Two portions endosystem and ectosystem.
- Ectosystem has three human components
- User
- Funder
- Server information professional who operates the
system and provide service to the user. - Endosystem has four components
- Media
- Devices
- Algorithms
- Data structures
- The performance is dictated by the endosystem but
judged by the ecosystem. - User is mainly concerned about effectiveness.
- Server is more aware of the efficiency.
- Founder is more concerned about economy of the
system. - This course concentrates primarily on
effectiveness measures. - The so called user-satisfaction has many meanings
and different users may use different criteria. - A fixed set of criteria must be established for
fair comparison.
5From Signal to Wisdom
- Five stepstones
- Signal bit stream, wave, etc.
- Data impersonal, available to any users
- Information a set of data matched to a
particular information need. - Knowledge coherence of data, concepts, and
rules. - Wisdom a balanced judgment in the light of
certain value criteria.
6Chapter 2 Document and Query Forms
7What is a document?
- A paper or a book? A section or a chapter?
- There is no strict definition on the scope and
format of a document. - The document concept can be extended to include
programs, files, email messages, images, voices,
and videos. - However, most commercial IR systems handle
multimedia documents through their textual
representations. - The focus of this course is on text retrieval.
8Data Structures of Documents
- Fully formatted documents typically, these are
entities stored in DBMSs. - Fully unformatted documents typically, these are
data collected via sensors, e.g., medical
monitering, sound and image data, and a text
editor. - Most textual documents, however, is
semi-structured, including title, author, source,
abstract, and other structural information.
9Document Surrogates
- A document surrogate is a limited representation
of a full document. It is the main focus of
storing and querying for many IR system. - How to generate and evaluate document surrogates
in response to users information need is an
important topic.
10Ingredients of document surrogates
- Document identifier could be less meaningless
such as record id, or a more elaborate identifier
such as Library of Congress classification scheme
for books (e.g., T210 C37 1982). - Title
- Names author, corporate, publisher
- Dates for timeliness and appropriateness
- Unit descriptor Introduction, Conclusion,
11Ingredients of document surrogates
- Keywords
- Abstract a brief one- or two-paragraph
description of the contents of a paper. - Extracts similar to abstract but created by
someone other than the authors. - Review similar to extract but meant to be
critical. The review itself is a separate
document that worth retrieving.
12Vocabulary Control
- It specifies a finite set of vocabularies to be
used for specifying keywords. - Advantages
- Uniformity throughout the retrieval system
- More efficient
- Disadvantages
- Authors/users cannot give/retrieve a more
detailed information. - Most IR system nowadays opt to an uncontrolled
vocabulary and rely on a sound internal thesaurus
for bring together related terms.
13Encoding Standards
- ASCII a standard for English text encoding.
However, it does not cover characters of
different fonts, macthematical symbols, etc. - Big-5 traditional chinese character set with 2
bytes. - GB simplified chinese charater set with XX
bytes. - CCCII a full traditional chinese character set
with at most 6 bytes. - Unicode a unified encoding trying to cover
characters from multiple nations.
14Markup languages
- Initially used by word processor (.doc, .tex) and
printer (.ps, .pdf) - Recently used for representing a document with
hypertext information (HTML, SGML) WWW. - A document written in markup language can be
segmented into several portions that better
represent that document for searching.
15Query Structures
- Two types of matches
- Exact match (equality match and range match)
- Approximate match
16Boolean Queries
- Based on Boolean algebra
- Common connectives AND, OR, NOT
- E.g., A AND (B OR C) AND D
- Each term could be expanded by stemming or a list
of related terms from a thesaurus. - E.g., inf -gt information, vegetarian-gtmideastern
countries - A xor B ? (A AND NOT B) OR (NOT A AND B)
- By far the most popular retrieval approach.
17Boolean Queries (Contd)
- Additional operators
- Proximity (e.g., icing within 3 words of
chocolate) - K out of N terms (e.g., 3 OF (A, B, C)
- Problems
- No good way to weigh terms
- E.g., music by Beethoven, preferably sonata.
(Beethoven AND sonata) OR (Beethoven) - Easy to misuse (e.g., People who like to have
dinner with sports or symphony may specify
dinner AND sports AND symphony).
18Boolean Queries (Contd)
- Order of preference may not be natural to users
(e.g., A OR B AND C). People tend to interpret
requests depending on the semantics. - E.g., coffee AND croissant OR muffin
- Raincoat AND umbrella OR sunglass
- User may construct a highly complex query.
- There are techniques on simplifying a given query
into disjunctive normal form (DNF) or conjunctive
normal form (CNF) - It has been shown that every Boolean expression
can be converted to an equivalent DNF or CNF.
19Boolean Queries (Contd)
- DNF a disjunction of several conjuncts, each of
which includes two terms connected by AND. - E.g., (A AND B) OR (A AND NOT C)
equivalent to (A AND B). - CNF a conjunction of several disjuncts, each of
which includes two terms connected by OR. - Normalization to DNF can be done by looking at
the TRUE rows, while that to CNF can be done by
looking at the FALSE rows.
20Boolean Queries (Contd)
- The size of returned set could be explosively
large. Sol return only a limited number of
records. - Though there are many problems with Boolean
queries, they are still popular because people
tend to use only two or three terms at a time.
21Vector Queries
- Each document is represented as a vector, or a
list of terms. - The similarity between a document and a query is
based on the presence of terms in both the query
and the document. - The simplest model is 0-1 vector. A more general
model is weighted vector. - Assigning weights to a document or a query is a
complex process. - It is reasonable to assume that more frequent
terms are more important.
22Vector Queries (Contd)
- It is better to give a user the freedom to assign
weights. In this case, a conversion between user
weight and system weight must be done. Show the
conversion equ. - There are two types of vector queries (for
similarity search) - top-N queries
- Threshold-based queries
23Extended Boolean Queries
- This approach incorporates weights into Boolean
queries. A general form is Aw1 Bw2 (e.g., A0.2
AND B0.6). - A OR B0.2 retrieves all documents that contain A
and those documents in B that are within top 20
closest to the documents in A. - A OR B1 ?A OR B
- A OR B0 ?A
- See Figure 3.1 for a diagrammatic illustration.
24Extended Boolean Queries (Contd)
- A AND B0.2
- A AND B0 ?A
- See Figure 3.2 for graphical illustration.
- A AND NOT B0.2
- See Figure 3.3 for graphical illustration.
- A0.2 OR B0.6 returns 20 of the documents in A-B
that are closest to B and 60 of the documents in
B-A that are closest to A.
25Extended Boolean Queries (Contd)
- See Example 3.1.
- One needs to define the distance between a
document and a set of document (contains A). - The computation of an extended Boolean query
could be time-consuming. - This model have not become popular.
26Fuzzy Queries
- It is based on fuzzy set.
- In a fuzzy set S, each element in S is associated
with a membership grade. - Formally, Sltx, ?s(x)gt ?sgt0.
- A?B xx?A and x ?B, ?(x)min (?A(x), ?B(x)).
- A?B xx?A or B, ?(x)max(?A(x), ?B(x)).
- NOT A xx?A, ?(x)1- ?A(x).
27Fuzzy Queries (Contd)
- To use fuzzy queries, documents must be fuzzy
too. - The documents are returned to the users in
decreasing order of their fuzzy values associated
with the fuzzy query.
28Probabilistic Queries
- Similar to fuzzy queries but now the membership
function is probabilities. - The probability of a document in association with
a query (or term) can be calculated through some
probability theory (e.g., Bayes Theorem) after
some observation.
29Natural Language Queries
- Convenient
- Imprecise, inaccurate, and frequently
ungrammatical. - The difficulties lie in obtaining an accurate
interpretation of a longer text, which may rely
on common sense. - The successful system must restrict to a narrowly
defined domain (e.g., medicine v.s. diagnosis of
30Information Retrieval and Database Systems
- Should one use a database system to handle
information retrieval requests? - DBMS is a mature and successful technolgy in
handling precise queries. - It is not appropriate to handle imprecise textual
elements. - OODB provide the augment functions to the textual
or image elements and is considered a good
31The Matching Process
32Boolean based matching
- It divides the document space into two those
satisfying the query and those that do not. - Finer grading of the set of retrieved documents
can be defined on the number of terms satisfied
(e.g., A OR B OR C).
33Vector-based matching
- Measures
- Based on the idea of distance
- Minkowski metric (Lq) Lq(Xi1-Xj1q
Xi2-Xj2qXi3-Xj3qXip-Xjpq)1/q - Special cases Manhattan distance (q1),
Euclidean distance (q2), and maximum direction
distance (q?). - See example in p.133.
- Based on the idea of angle
- Cosine function ((Q?D)/(QD).
34Mapping distance to similarity
- It is better to map distance (or dissimilarity)
into some range, e.g. 0, 1. - A simple inversion function is ?b-u.
- A more general inversion function is ?b-p(u),
where p(u) is a monotone nondecreasing func s.t.
p(0)0. - See Fig. 4.1 for graphical illustration.
35Distance or cosine?
- lt1, 3gt , lt100, 300gt, lt3, 1gt? Which pair is
similar? - In practice, distance and angular measures seem
to give results of similar quality because the
cluster of documents all roughly lie in the same
36Missing terms and term relationships
- The conventional value 0 means
- Truly missing
- No information
- However, if 0 is regarded as undefined. It
becomes impossible to measure the distance
between two documents (e.g., lt3, -gt and lt-, 4gt. - Terms used to define the vector model are clearly
not independent, e.g., digital and computer
have a strong relationship. - However, the effect of dependent terms is hardly
37Probability matching
- For a given query, we can define the probability
that a document is related as P(rel)n/N. - The discriminant function on the selected set is
dis(selected)P(relselected)/P(?relselected). - The desirable discriminant function value of a
set is at least 1. - Let a document be represented by terms t1, , tn,
and they are statistically independent.
P(selectedrel)P(t1rel)P(t2rel)P(tnrel). - We can use Bayes theorem to calculate the
probability that a document should be selected. - See Example 4.1.
38Fuzzy matching
- The issue is on how to define the fuzzy grade of
documents w.r.t. a query. - One can define the fuzzy grade based on the
closeness to a query. For example, ??? v.s. ??
v.s. ????
39Proximity matching
- The proximity criteria can be used independently
of any other criteria. - A modification is to use phrases rather than
words. But it causes problems in some cases
(e.g., information retrieval v.s. the retrieval
of information). - Another modification is to use order of words
(e.g., junior college v.s. college junior).
However, this still causes the same problem as
before. - Many systems introduce a measure on the proximity.
40Effects of weighting
- Weights can be given on sets of words, rather
than individual words. - E.g., (beef and broccoli)5 (beef but not
broccoli)2 (broccoli but not beef)2,
noodles1 snow peas1 water chestnuts1.
41Effects of scaling
- An extensive collection is likely to contain
fewer additional relevant documents. - Information filtering aims at producing a
relatively small set. - Another possibility is to use several models
together, leading to so called data fusion.
42A user-centered view
- Each user has an individual vocabulary that may
not match that of the author, editor, or indexer. - Many times, the user does not know how to specify
his/her information need. Ill know it when I
see it. Therefore, it is important to allow
users direct access to the data (browsing).
43Text Analysis
- Indexing is the act of assigning index terms to a
document. - Many nonfiction books have indexes created by
authors. - The indexing language may be controlled or
uncontrolled. - For manual indexing, an uncontrolled indexing
language is generally used. - Lack of consistency (the agreement in index term
assignment may be as little as 20) - Difficult for fast evolving field.
45Indexing (Contd)
- Characteristics of an indexing language
- Exhaustivity (the breadth) and specificity (the
depth) - The ingredients of indexes
- Links (occur together)
- Roles
- Cross referencing
- See Coal, see fuel
- Related terms microcomputer, see also personal
computer - Broader term (BT) poodle, BT dog
- Narrower term (NT) dog, NT poodle, cocker
spaniel, pointer.
46Index (Contd)
- Automatic indexing will play an ever-increasing
role. - Approaches for automatic indexing
- Word counting
- Based on deeper linguistic knowledge
- Based on semantics and concepts within a document
collection. - Often inverted file is used to store indexes of
documents in a document collection.
47Matrix Representations
- Term-document matrix A
- Aij indicates the occurrence or the count of term
i in document j. - Term-term matrix T
- Tij indicates the occurrence or the count of term
i and term j. - Document-document matrix D
- Dij indicates the degree of term overlapping
between document i and document j. - These matrices are usually sparse and better be
stored by lists.
48Term Extraction and Analysis
- It has been observed that frequencies of words in
a document follow the so called Zipfs law
(fkr-1 ) 1, ½, 1/3, ¼, - Many similar observations have been made
- Half of a documents is made up of 250 distinct
words. - 20 of the text words account for 70 of term
usage. - None of the observations are supported by Zipfs
law. - High frequncy terms are not desirable because
they are so common. - Rare words are not desirable because very few
documents will be retrieved.
49Term Association
- Term association is expanded with the concept of
word proximity. - Proximity measure depends on
- the number of intervening words
- The number of words appearing in the same
sentence. - Word order
- Punctuation
- However, there are risks The felons
information assured the retrieval of the money,
and the retrieval of information, and information
50Term significance
- Frequent words in a document collection may not
be significant. (e.g., digital computer in
computer science collection). - Absolute term frequency ignores the size of a
document. - Relative term frequency is often used.
- Absolute term frequency / length of doc.
- Term frequency of a document collection
- Total frequency count of a term / total words in
documents of a document collection - Number of documents containing the term / total
number of documents.
51How to adjust the frequency weight of a term
- Inverse document frequency weight
- N total number of documents.
- Dk number of documents containing term k
- fik absolute frequency of term k in doc. i.
- Wik the weight of term k in document i.
- idfk log2(N/dk)1
- Wik fik?idfk
- This weight assignment is called TF-IDF.
52How to adjust the frequency weight of a term
- Signal-to-noise
- H(p1, p2, , pn) information content of a
document with pi being the probability of word i. - Requirements
- H is a continuous function of pi.
- If pi1/n, H is a monotone increasing function of
n. - H preserves the partitioning property
- H(1/2, 1/3, 1/6) H(1/2, ½)1/2H(2/3,1/3)
H(2/3, 1/3)2/3H(3/4,1/4) - Entropy function satisfies all three requirements
- H
53How to adjust the frequency weight of a term
- The more frequent a word is, the less information
it carries. - The noise nk of index term k is defined as
- The signal sk of index term k is defined as
sklogtk nk. - The weight wik of term k in document i is
wikfik sk
54How to adjust the frequency weight of a term
- Term discrimination value
- The average similarity
- A centroid document D, where fk tk/N.
- ?k?k - ?.
- wikfik ?k
55Phrases and Proximity
- Weighting schemes discriminate phrases.
- How to compensate?
- Count both the individual words and phrase.
- Count the number of words in a phrase.
- 1 log (number of words in a phrase)
- How to handle proximity query?
- Documents with involved words are identified,
followed by the judgment of proximity criteria. - Direct analysis of a document collection can be
done by using standard vocabulary analysis (e.g.,
Brown corpus).
56Pragmatic Factors
- Identifying trigger phrases
- Words such as conclusion, finding, identify key
points and ideas in a document. - Weighting authors
- Weighting journals
- Users pragmatic factors
- Education level
- Novice or expert in an area
57Document Similarity
- Similarity metrics of 0-1 vector.
- Contingency table for doc. to doc. match
D21 D20
D11 w x n1
D10 y z N-n1
n2 N-n2 N
58Document similarity
- If D1 and D2 are independent, w/N(n1/N) (n2/N).
- We can define the basic comparison between D1 and
D2 as ?(D1, D2)w-(n1n2/N). - In general, the similarity between D1 and D2 can
be defined as follows
59Various ways for defining coefficient of
- Separation coefficient N/2.
- Rectangular distance max(n1, n2).
- Conditional probability min(n1, n2).
- Vector angle (n1n2)1/2
- Arithmetic mean (n1n2)/2.
- For more, see p. 128.
- For the relationship, see Table 5.2.
60Other close similarity metrics
- Use only w instead of w-(n1n2/N).
- Dices coefficient 2w/(n1n2).
- Cosine coefficient w/(n1n2)1/2.
- Overlap coefficient w/min(n1n2)
- Jaccards coefficient w/(N-z)
- Distance measures requirements
- Non-negative
- Symmetric
- Triangle inequality (Dist(A, C) lt Dist(A,
B)Dist(B, C)
61Stop lists
- Stop list or negative dictionary consists of very
high frequency words. - Typical stop list contains 150-500 words.
- Any well-defined field may have its own jargon.
- Words in the stop list should be excluded from
later processing. - Query should also be processed against the stop
list. - However, phrases that contain the words in stop
list may not always be eliminated (e.g., to be or
not to be).
- Computer, computers, computing, compute,
computes, computed, computational,
computationally, computable all deal with closely
related concepts. - Use stemming algorithm to strip off word endings
(e.g., comput). - Watch out the false stripping
- Bed -gt b, breed -gtbre
- Keep minimum acceptable stem length, having a
small list of exceptional words, and keep various
word forms.
63Stemming (contd)
- Stemming may not save much space (5).
- One can also stem only the queries and then use
wild cards in matching. - Watch the various word forms. E.g., knife should
be expanded as knif and kniv.
- A thesaurus contains
- Synonyms
- Antonyms
- Broader terms
- Narrower terms
- Closely related terms
- A thesaurus can be used during the query
processing to broaden a query. - A similar problem arises w.r. t. homonyms.
65Mid-term project
- Lexical analysis and stoplist (Ch7)
- Stemming algorithms (Ch8)
- Thesaurus construction (Ch9)
- String searching algorithms (Ch10)
- Relevance feedback and other query modification
techniques (Ch11) - Hashing algorithms (Ch13)
- Ranking algorithms (Ch14)
- Chinese text segmentation (to be provided)
66File Structures
67Inverted File
- Structures for inverted file
- Sorted array (Figure 3.1 in the supplement)
- B-tree (Figure 3.2 in the supplement)
- Trie
- A straightforward approach
- Parse the text to get a list of (word, location)
- Sort the list in ascending order of word
- Weighting each word.
- See Figure 3.3 and 3.4 in the supplement
- Hard to evolve.
68Inverted File (Contd)
- The data structure can be improved for faster
searching (Figure 3.5 in the supplement) - A dictionary, including
- Term and number of postings
- A posting file, including
- A set of list, one for each term
- Doc
- Number of postings in the doc.
- See Figure 3.5.
69Inverted File (Contd)
- The dictionary can be implemented as a B-tree.
- When a term in a new document is identified,
- A new tree node is created, or
- The related data of an existing node is modified.
- The posting file can be implemented as a set of
linked list. - See Table 3.1 for some statistics.
70Signature File
- A document is partitioned into a set of blocks,
each of which has D keywords. - Each keyword is represented by a bit pattern
(signature) of size F, with m bits set to 1. - The block signature is formed by superimposing
(OR) the constituent word signatures. - Sig(Q) OR Sig(B) Sig(Q) if B contains the words
in Q. - See Figure 4.1 in the supplement.
71Signature File (Contd)
- Which m bits should be set for a given word?
- For each 3-triplet of W, a hashing function maps
it to a position between 0, F-1. - If the number of 1s is less than m, randomly set
additional bits. - How to set m?
- It has been shown that when mF ln2/D, the false
drop probability is minimized.
72Signature File (Contd)
- The signature file could be huge. Sequential
search takes time. - The signature file is often sparse.
- Three approaches to reduce query time
- Compression
- Vertical partitioning
- Horizontal partitioning
73Signature File (Contd)
- Vertical partitioning
- Use F different files, one per bit position.
- For a query with k bits set, we need to examine k
files. Then AND these files. - The qualifying blocks will have 1s in the
resultant vector. - Inserting a block requires writing to F files.
74Signature File (Contd)
- Horizontal partitioning
- TWO level signatures
- The first level has N document signatures.
- Several signatures with a common prefix are
grouped into a group. - The second level has group signatures which are
created by superimposing the constituent document
signatures. - This approach can be generalized to a B-tree like
structure (called S-tree).
75User Profiles and Their Use
76Simple Profiles
- A simple profile consists of a set of key terms
with given weights, much like a query. - Such profiles were originally developed for
current awareness (CA) or selective dissemination
of information (SDI). - The purpose of CA (SDI) is to help researchers
keep up with the latest developments in their
areas. - In a CA system, users are asked to file an
interest profile, which must be updated
periodically. - In fact, the interest profile acts an a routing
77Extended Profiles
- Extended profiles record background information
of a person that might help in determining the
interested document types. - Education level, familiarity of an area, language
fluency, journal subscriptions, reading habits,
specific preferences. - This type of information cannot be used directly
in the retrieval process but must be applied to
the retrieval set to organize it.
78Current Awareness Systems
- It assumes that the user is adequately aware of
past work and needs only to keep abreast of
current developments. - It operates only on current literature and
actively w/o user intervene. - The user may redefine a profile at any time, and
many systems will periodically remind users to
review their profiles. - Most CA systems make use only the simple user
profile. - Current awareness systems are suitable for a
dynamic environment.
79Retrospective Search Systems
- The effectiveness of a CA system is difficult to
measure because users often treat the presented
documents off-line. - Unlike a CA system, a retrospective search system
has a relatively large and stable database and
handles ad-hoc queries. - Virtually all existing retrospective search
systems do not differentiate users.
80Modifying the Query By the Profile
- A reference librarian may help a person with a
request by learning more about this persons
background and level of knowledge. E.g., theory
of groups. - A given query may be modified according to the
persons profile. - Three ways to modify a query
- Post-filter effort to retrieve documents is
substantial. - Pre-filter A food query ltcalories3,
spiciness7gt may be modified for a user with
profile lt2, 2gt to lt2.8, 6gt.
81Modifying the Query By the Profile
- Suppose Qltq1, q2, , qngt and Pltp1, p2, , pngt.
- Simple linear transformation qi kpi
(1-k)qi. - Piecewise linear transformation
- Case 1. pi?0 and qi ?0 ordinary k value.
- Case 2. Pi0 and qi ?0 k is very small (5).
- Case 3. pi?0 and qi 0 k is smaller (50).
82Query and Profile as Separate Reference Points
- Query and profile are treated as co-filters.
- Four approaches
- Disjunctive model D, Q?d or D, P?d.
- Conjunctive model D, Q?d and D, P?d.
- Ellipsoidal model D, Q D, P?d, see Figure
6.2, 6.3. - Cassini oval model D, Q ? D, P?d, see Figure
6.4. - All the above models can be weighted.
- Empirical experiments showed that query-profile
combinations do provide better performance than
the query alone.
83Multiple Reference Point Systems
- A reference point is a defined point or concept
against which a document can be judged. - Queries, user profiles, known papers or books are
reference points. - A reference point is sometimes called a point of
interest (POI). - Weights and metrics can be applied to general
reference points as before.
84Documents and Document Clusters
- Each favored document can be treated as a
reference point. - Favored documents can also be clustered. Each
document cluster may be represented as a cluster
point. - Many statistical techniques can be used to
cluster documents. - The centroid or medoid of a document cluster is
then used as the reference point.
85The Mathematical Basis
- Graphical User Interface for Document
Organization Rather than using terms as vector
dimensions, GUIDO uses each reference point as a
dimension, resulting in a low dimension space. - In a 2-D GUIDO, a document is represented as an
ordered pair (x, y), where x is the distance from
Q and y is the distance from P. Note that P-Q
?. - P (?, 0), Q(0, ?).
- Consider the line between P and Q. Three cases
- D, P D, Q ?
- D, P D, Q ?
- D, P D, Q - ?
- For any points not on the line between P and Q
- D, P D, Q gt ?
- D, P ? gt D, Q
- D, Q ? gt D, P
- Observation 1 multiple document points are
mapped into the same point in the distance space. - Observation 2 Mapping complex boundary contours
into simpler contours. - In the ellipsoidal model, the contour becomes a
straightline parallel to P-Q line.
- In the weighted ellipsoidal model, the contour is
still a straightline but at an angle. - If we are looking for a document D where the
distance ratio of D, P to D, Q is a constant,
we have - D, Q lt d/fr. (See the general model)
- Therefore, the contour is a circle in the general
model. - The contour is a straightline crossing the origin
in GUIDO model because D, P k D, Q. See
Figure 7.5. - With different metrics, the size of distance
space and locations of documents may change but
the basic shape in the distance space remains.
- Visual Information Browsing Environment a user
chooses the positions of reference points
arbitrarily on the screen. - The location of a document is the ratios of its
similarities to the reference points. - Each document is represented as a rectangle whose
size is the importance (sum of similarities?) to
the reference points.
- In a 2-POI VIBE, documents are displayed on the
line connecting the two POIs. - In a n-POI VIBE, let p1, p2, , pn be the
coordinates of the POIs and s1, s2, , sn be the
similarities of a document D to these POIs. The
coordinate of D, pd, is (See example 7.2)
- While GUIDO is based on distance metrics, VIBE is
based on similarity metrics. - Consider a 2-POI VIBE, a document is located at a
position that is a fix ratio c s1/s2. - If si1/di, cd2/d1. Thus, a straightline in
GUIDO is a point in VIBE. - If sk-d, c kd2-d1. Further compressed.
92Boolean VIBE
- One can think of n1 POIs as vertices in
n-dimensions that form a polyhedron. - Three POIs A, B, and C form a triangle in a 2-D
space as shown in Figure 7.10. - Documents containing all terms of A and B appear
on the line A-B. Documents containing all terms
of A, B, and C appear inside the triangle. - Four POIs form a polyhedron in a 3-D space.
93Boolean VIBE
- To render n POIs on a 2-D display, the resulting
display consists of 2n-1 Boolean points,
representing all Boolean combinations except the
one that is completely negated, see Figure 7.10. - A threshold on the similarity between points need
to be specified for determining document
positions, see Table 7.1.
94Retrieval Effectiveness Measures
95Goodness of an IR System
- Judged by the user for appropriateness to her
information need. vague. - Determine the level of judgment
- Question that meets the information need
- Query that corresponds to the question.
- Determine the measure
- Binary accepted or rejected
- N-ary 4 definitely relevant, 3 probably
relevant, 2 neutral, 1 probably not relevant,
0 definitely not relevant.
96Goodness of an IR System (Contd)
- Relevance of a document how well this document
responds to the query. - Pertinence of a document how well this document
satisfies the information need. - Usefulness of a document
- The document is not relevant or pertinent to my
present need, but it is useful in a different
context. - The document is relevant, but it is not useful
because Ive already known it.
97Precision and Recall
Retrieved Not retrieved
Relevant w x n1wx
Not relevant y z
Not relevant n2wy z Nwxyz
- Precision w/n2.
- Recall w/n1.
- The number of document returned in response to a
query (n2) may controlled by either first K or a
similarity threshold. - If very few documents are returned, precision
could be high, while recall is very low. - If all documents are returned, recall1, while
precision is very low.
98Precision and Recall (contd)
- One can plot a precision-recall graph to compare
the performance of different IR systems. See
Figure 8.1. - Two relevant measures
- Fallout the proportion of nonrelevant documents
that are retrieved, F y / (N-n1) - Generality the proportion of relevant documents
within the entire collection G n1/N - Precision (P), recall (R), fallout, and
generality (G) are related
99Precision and Recall (contd)
- P/(1-P) is the ratio of relevant retrieved
documents to nonrelevant retrieved documents. - G/(1-G) is the ratio of relevant documents to
nonrelevant documents in the collection. - R/F gt 1 if the IR system does better in locating
relevant documents. - R/F lt 1 if the IR system does better in rejecting
non-relevant documents.
100Precision and Recall (contd)
- Weakness of precision/recall measures
- It is generally difficult to get exact value for
recall because one has to examine the entire
collection. - It is not clear that recall and precision are
significant to the user. Some argued that
precision is more important than recall. - Either one represents an incomplete picture of
the IR systems performance.
101User-oriented measures
- The above measures attempt to measure the
performance of the entire IR system, regardless
of the differences on users. - From a user point of view, her interpretation on
the retrieved set of documents could be - Let V of relevant documents known to the user.
Vn of relevant, retrieved documents known to
the user. N of relevant, retrieved documents. - Coverage ratio Vn/V
- Novelty ratio (N-Vn)/N
102User-oriented measures (Contd)
- Relative recall of relevant, retrieved
documents / of desired documents. - Recall effort of desired documents / of
documents examined.
103Average precision and recall
- Fix recall at several points (say, 0.25, 0.5, and
0.75) and compute the average precision at each
recall level. - If the exact recall is difficult to compute, one
can compute the average precision for each fix
number of relevant documents. See Table 8.2. - If the exact recall can be computed, a more
comprehensive precision/recall table can be
obtained. See Table 8.3.
104Operating Curves
- Let C be a measurable characteristic, P1 and P2
be the sets of relevant and irrelevant documents
respectively. - If C distinguishes P1 and P2 well, the curve will
have a higher slope. - It has been shown that the operating curve of a
given IR system is usually a straightline. - The distance from lt50,50gt to the operating curve
along the line lt0, 100gt to lt50, 50gt can be used
to measure the performance of an IR system,
called Swets E measure. See Figure 8.3.
105Expected search length
- All the above measures do not consider the order
of returned documents. - Suppose the set of retrieved documents can be
divided into subsets S1, S2, , Sk with
decreasing priority and Si has ni relevant
documents. - Given a desired number N of relevant documents,
one can compute the expected search length. See
Example 8.2. - By varying N, one can plot a performance on the
expected search length as shown in Figure 8.4.
106Expected search length (Contd)
- An aggregate number can be computed as the
average number of documents searched per relevant
document. Let the number be ei. - If the chance of searching for 1, 2, , 7
documents are equally likely, one can compute the
overall expected search length by the formula
107Normalized recall
- Typical IR system presents results to the user in
a linear list. - If a user sees many relevant documents first, she
may be more satisfied with the system
performance. - Rocchios normalized recall is defined as a step
function F, where F(k)F(k-1) 1 if the kth
document is relevant and F(k)F(k-1) otherwise.
See Figure 8.5. - A step function F is defined as
- F(0)0,
- F(k1) (F(k) or F(k)1)).
108Normalized recall (Contd)
- Let A be the area between the actual and ideal
graphs, n1 be the number of relevant documents, N
be the number of documents examined. - Normalized recall 1 A/n1(N-n1).
- However, if two systems behave the same except
for the position of the last document, the
normalized recall values may differ a lot.
109Sliding ratio
- Rather than judging a document as either relevant
or irrelevant, sliding ratio assigns weighted
relevance to each document. - Let the weight list of the retrieved documents be
w1, w2, , wN, and their sorted list be W1, W2,
, WN in decreasing order. The sliding ratio
SR(n) is defined as
110Satisfaction and frustration
- Myaeng divides the measure into satisfaction and
frustration. - Satisfaction is the accumulative sum of
satisfaction weights. - Frustration is the accumulative sum of
2-satisfaction weights. See Example 8.4. - Total Satisfaction frustration.
111Content-based Recommendation
112NewsWeeder Learn to Filter Netnews
- Ken Lang
- Proceedings of the Conference on Machine
Learning, 1995
- NewsWeeder is a netnews-filtering system.
- It allows users to read regular newsgroups.
- It also creates some personal, virtual newsgroups
such as nw.top50.bob for Bob. - A list of article summaries sorted by predicted
rating. - After reading an article, the reader clicks on a
rating from one to five.
- This way of collecting users ratings is called
active feedback, in contrast to passive feedback,
such as time spent reading. - The drawback to active feedback is the extra
effort required to explicit rating. - Each night, the system uses the collected rating
information to learn a new model for each users
interest. - How to learn a new model is the subject of this
- Raw text is parsed into tokens.
- A vector of token counts is created for each
document (article). - Tokens are not stemmed.
- The vector is on the order of 20,000 to 100,000
tokens long. - No explicit dimension reduction techniques are
used to reduce the size of vectors.
116TF-IDF weighting
- Motivation
- The more times a token t appears in a document d
(term frequency, tft,d), - The less times a token t occurs throughout all
documents (document frequency, dft), - The better t represents the subject of document
d. - Throw out tokens occurring less than 3 times
total. - Throw out the M most frequent tokens.
- The weight of t w.r.t to d, w(t, d) is
- w(t, d) tft,d ? log2(N/ dft),
- where N is the total number of documents.
117TF-IDF weighting
- Each document is represented by a tf-idf vector
normalized into unit length. - Use cosine function to determine the similarity
between two documents. - Given a category (1..5), a prototype vector is
computed by averaging the normalized tf-idf
vectors in the category.
118TF-IDF weighting
- Let vp1, vp2, vp3, vp4, vp5 be the prototype
vectors of the five categories. - A learning model is derived as follows
- Predicted-rate(d) c1?sim(d, vp1) c2?sim(d,
vp2) c3?sim(d, vp3) c4?sim(d, vp4) c5?sim(d,
vp5). - The above model is determined by linear
regression on documents rated by the user.
119Minimum Description Length (MDL)
- A kind of Baysian classifier but based on the
entropy measure. - In information theory, the minimum average length
to encode messages with p1, p2, , pk
probabilities is -?iPi log Pi. That is, the
number of bits to represent message i is -Pi log
Pi. - Let H be a category and D a document,
- Equivalently, we can minimize log(p(DH)-log(p(H
)). - The above total encoding length includes
- Number of bits to encode the hypothesis
- Number of bits required to encode the data given
the hypothesis. - That is, to find a balance between simpler models
and models that produce smaller error when
explaining the observed data.
121MDL applied to Newsweeder
- Problem description
- We are given a document d with token vector Td
and non-zero entries ld, and a set of previous
rating information Dtrain. - We like to find a category ci that maximizes p(ci
Td, ld, Dtrain), or equivalently, minimizes
log(p(Td ci, ld, Dtrain))- log(p(ci ld,
122MDL applied to Newsweeder
- Assume that words in a document are independent,
we have - p(Td ci, ld, Dtrain)?j p(tj,d ci, ld,
Dtrain) - where ti,d (0 or 1) represents whether token i
appears in document d. - Notations
- ti ?i?N ti,j
- ri,l a correlation estimated 0, 1 between
ti,d and ld. - The above measures can be computed for the entire
documents or for a particular category, denoted
by ck.
123MDL applied to Newsweeder
- When ti,d is not related to the length of the
document (I.e, ri,l 0), we have - When ti,d is strongly related to the length of
the document (I.e, ri,l 1), we have
124MDL applied to Newsweeder
- In general, it can be modeled as
- Hypothesis For a given token, either it is
special w.r.t. a category or it is unrelated to
any category.
125MDL applied to Newsweeder
- A token is related to some category if the
following value is greater than a small constant
(0.1) - The intuition is that if by considering category
information the encoding bits can be reduced,
this token plays an important role in deciding
the category of a document.
- Divide the set of articles into training set and
test set. - Parse the training articles, throwing out tokens
occurring less than 3 times total. - Compute ti and ri,l for each token.
- For each token t and category c, decide whether
to use category independent or category dependent
127Summary (contd)
- Compute the similarity of each training article
to each rating category by taking the inverse of
the number of bits required to encode Td under
the categorys probabilistic model. - Compute a linear regression model from the
training articles.
- The performance metric is precision.
- Retrieve the top 10 of highest predicted rating
articles. - Data
- see Table 1 for the meaning of 5 categories.
- Articles rated as 1 or 2 are considered
interesting. - Users only two exhibit enough amount of ratings,
see Table 2.
129TF-IDF performance
- Do not use a fixed stop-list because it may not
suit a dynamic environment. - Top N most frequent words are removed.
- By experimenting different partitioning on
training/test sets, it shows that removing
100-400 words seem to have the best performance.
See Graph 1. - TF-IDF has about three times improvement over
130MDL Experiments
- See Graph 2 for a comparison between TF-IDF and
MDL. - MDL constantly outperforms TF-IDF.
- Table 3 shows the predicted ratings and actual
ratings of a test article. - The correct prediction is 65 (see the diagonal
line) - In general, the performance after the regression
step tends to meet or exceed the precision
obtained by the method of choosing only the
category with maximum probability.
131Learning and Revising User Profiles The
Identification of Interesting Web Sites
- M. Pazzani and D. Billsus
- Machine Learning 27, 1997
- The goal is to find information that satisfies
long-term recurring interests. - Feedback on the interestingness of a set of
previously visited sites are used to predict the
interests of unseen sites. - The recommender system is called Syskill Webert.
133Syskill Webert
- A different profile is learned for each topic.
- Each user has a set of profiles, one for each
topic. - Each web page is augmented with special control
on selecting user ratings. See Figure 1. - Each page is rated as either hot or cold. See
Figure 2 for notations for recommendations.
134Learning user profiles
- Use supervised learning with a set of positive
examples and negative examples. - Each rated web page is converted into a Boolean
feature vector. - The information gain of a word is used to
determine how informative the word is.
135Learning user profiles
- The set of k most informative words are used for
feature set. (k128) - In addition, words in a stop list with
approximately 600 words and HTML tags are
excluded. - See Table 1 on feature words on goats.
136NaĂŻve Bayesian classifier
- Provided features are independent.
- A given example is assigned to the class (hot or
cold) with the higher probability.
137Initial experiments
- See Table 2 for four users on 9 topics.
- Again, the partition on training set and test set
is varied. - Accuracy is the primary performance metric.
- Figure 3 displays the average accuracy, which is
substantially better than the probability of cold
pages. - In biomedical domain, all the top 10 pages were
actually interesting, and all the bottom 10 pages
were actually uninteresting.
138Initial experiments
- Among the 21 pages with probabilities above 0.9,
19 were rated interesting. - Among the 64 pages with probability below 0.1,
only one was rated interesting. - Table 3 shows how the number of feature words
impact accuracy with 20 training examples. - An intermediate number (96) of features performs
the best. - Comprehensive approach for feature selection is
not feasible as it increases the complexity.
139Alternative machine learning alg.
- Nearest neighbor Assign the class of the most
similar example. - PEBLS The distance between two examples is the
sum of the value difference of all attributes.
The difference between Vjx and Vjy is
140Machine Learning (Contd)
- Decision trees ID3, which recursively selects
the features with the highest information gain. - Rocchios algorithm
- Use TF-IDF as feature weights (with normalization
to unit length). - Build the prototype-vector of the interesting
class by subtracting 0.25 of the average vector
of the uninteresting pages from the average
vector of the interesting pages. - The purpose is to prevent infrequently occurring
terms from overly affecting the classification. - Pages with a certain distance from the prototype
(determined by cosine) are considered
- 20 examples were chosen as training set because
the increase of accuracy after 20 is mild. - See Table 4. In each domain, the highest accuracy
as well as those with slightly lower accuracies
were marked as . - ID3 (or C4.5) is not suited.
- Nearest neighbor performs worse (even for k-NN).
- Backpropagation, Bayesian classifier and
Rocchios algorithms are among the best. - Bayesian classifier is chosen because it is fast
and adapts well to attribute dependencies.
142Using predefined user profiles
- Some users are unwilling to rate many pages
before the system gives reliable prediction. - Initial profile is solicited as follows
- Provide a set of words that indicate interesting
pages. - Provide another set of words that indicate
uninteresting pages. This set is more difficult
to get. - Four probabilites for each word are given
p(wordi present hot), p((wordi absent hot),
p(wordi present cold), p((wordi absent cold).
The default for p(wordi present hot) is 0.7 and
that for p(wordi present cold) is 0.3.
143Using predefined user profiles (Contd)
- As more training data becomes available, more
believe should be placed on the probability
estimates. - Conjugate priors are used to update probabilities
from data - The initial probability is assume to be
equivalent to 50 pages. - If P(wordi presenthot)0.8 and among 25 hot
pages seen, 10 contain wordi. - The probability becomes (4010)/(5025)
- Three alternatives
- Data use only data for estimation. 96 features
are obtained purely from data. - Revision use both data and initial profile for
estimation. All words in the profile are used as
features, supplemented with the most informative
words for a total of 96 features. - Fixed Use only the words provided by the user as
features and only the initial profiles.
- See Table 5, 6, and 7 for probabilities in
initial profiles. - Figure 4, 5, and 6 show that the revision
strategy performs the best. The performance of
fixed is surprisingly good. - If we use only words in initial user profile and
calculate the probability from data, it still
performs well. See Figure 7.
146Using lexical knowledge
- Use WORDNET as thesaurus.
- When there is no relationship between a word and
words in a topic, this word is eliminated. This
includes Hypernym, Antonym, Member-Holonym,
Part-Holonym, Similar-to, Pertainnym, and
Derived-from. - Table 8 shows the eliminated words that are
unrelated to goat. - Figure 8 shows that when the number of examples
is small, applying lexical knowledge does help.
147Comparing Feature-based and Clique-based User
Models for Movie Selection
- J. Alspector, A. Kotcz, and N. Karunanithi
- Conf. of Digital Libraries, 1998
- Compare content-based and collaborative
approaches for making recommendations for movies. - Users must provide explicit ratings on some
movies. - Data sets 7389 movies
- Volunteers for rating movies 242.
149Clique-based approach
- A set of users form a clique if their movie
ratings are closely related. - The similarity between two users ratings is
defined by Pearson correlation coefficient (I.e.,
cosine function) as follows
150Clique-based approach
- How to decide the clique of a given user U?
- Smin minimum number of common ratings with U.
- Cmin minimum correlation threshold.
- In the experiments, Smin is set as a constant 10,
and Cmin is a variable such that the number of
size of the clique is 40. - Once a clique is identified,
- For a given unseen movie m, let N be the number
of clique members that rate m. - ci(m) is the rating of movie m given by user i.
- r(m) is the estimated rating of movie m to the
user U.
151Clique-based approach
152Feature-based approach
- Extract relevant features from the movies that
user has rated. - Build a model for a user by associating selected
features and the ratings. - Estimating ratings for an unseen movie to a user.
By consulting the model.
153Relevant features
- Seven features are used
- 25 catetories (0, 1)
- 6 MPAA rating (0, 1)
- Maltin rating (0..4)
- Academy award won1, nominated0.5, not
considered0. - Origin USA0, USA with foreign
collaboration0.5, foreign made0. - Director each director is represented as
numerical value that is the average rating of the
user to the movies directed by the director. - Each feature is normalized between 0, 1.
154Linear model