Title: Word Weighting based on Users Browsing History
1Word Weighting based on Users Browsing History
- Yutaka Matsuo
- National Institute of Advanced Industrial Science
and Technology (JPN) - Presenter Junichiro Mori
- University of Tokyo (JPN)
2Outline of the talk
- Introduction
- Context-based word weighting
- Proposed measure
- System architecture
- Evaluation
- Conclusion
3Introduction
Introduction
- Many information support systems with NLP use
tfidf to measure the weight of words. - Tfidf is based on statistics of word occurrence
on a target document and a corpus. - It is effective in many practical systems
including summarization systems and retrieval
systems. - However, a word that is important to one user is
sometimes not important to others.
4Example
Introduction
- Suzuki hitting streak ends at 23 games
- Ichiro Suzuki is a Japanese MLB player, MVP in
2001. - Those who are greatly interested in MLB would
thinks hitting streak ends as important, - While a user who has no interest in MLB would
note the words game or Seattle Mariners as
the informative, because those words would
indicate that the subject of the article was
baseball. - If a user is not familiar with the topic, he/she
may think general words related to the topic are
important. - On the other hand, if a user is familiar with the
topic, he/she may think more detailed words are
important.
Our main hypothesis
5Goal of this research
Introduction
- This research addresses context-based word
weighting, focusing on the statistical feature of
word co-occurrence. - In order to measure the weight of words more
correctly, contextual information about a user
(we call familiar words) is used.
6Outline of the talk
- Introduction
- Context-based word weighting
- Proposed measure
- Previous work
- IRM (Interest Relevance Measure)
- System architecture
- Evaluation
- Conclusion
7IRM
- A new measure, IRM, is based on a word-weighting
algorithm applied to a single document. - Matsuo 03 Keyword Extraction from a Single
Document using Word Co-occurrence Statistical
Information, FLAIRS 2003
8We take a paper for example.
Previous work Matsuo03
COMPUTING MACHINERY AND INTELLIGENCE
A.M.TURING 1 The Imitation Game I PROPOSE to
consider the question, 'Can machines think?' This
should begin with definitions of the meaning of
the terms 'machine 'and 'think'. The definitions
might be framed so as to reflect so far as
possible the normal use of the words, but this
attitude is dangerous. If the meaning of the
words 'machine' and 'think 'are to be found by
examining how they are commonly used it is
difficult to escape the conclusion that the
meaning and the answer to the question, 'Can
machines think?' is to be sought in a statistical
survey such as a Gallup poll. But this is absurd.
Instead of attempting such a definition I shall
replace the question by another, which is closely
related to it and is expressed in relatively
unambiguous words. The new form of the problem
can be described' in terms of a game which we
call the 'imitation game'. It is played with
three people, a man (A), a woman (B), and an
interrogator (C) who may be of either
9Distribution of frequent terms
Previous work Matsuo03
10Next, count co-occurrences
Previous work Matsuo03
-
- The new form of the problem can be described' in
terms of a game which we call the imitation
game.
- stem, stop word
elimination, phrase extraction
- new and form co-occur once.
- new and problem co-occur once.
- .
- call and imitation game co-occur once.
11Co-occurrence matrix
Previous work Matsuo03
12Co-occurrences ofkind frequent terms,
andmakefrequent terms
Previous work Matsuo03
- A general term such as kind or make is used
relatively impartially with each frequent term,
but
13Co-occurrence matrix
Previous work Matsuo03
Frequent terms
Frequent terms
14Co-occurrences ofimitation frequent terms,
anddigital computerfrequent terms
Previous work Matsuo03
- while a term such as imitation or digital
computer shows co-occurrence especially with
particular terms.
15Biases of co-occurrence
Previous work Matsuo03
- A general term such as kind or make is used
relatively impartially with each frequent
tem,while a term such as imitation or
digital computer shows co-occurrence especially
with particular terms. - Therefore, the degree of biases of co-occurrence
can be used as a surrogate of term importance.
16 ?2-measure
Previous work Matsuo03
- We use the ?2-test, which is very common for
evaluating biases between expected and observed
frequencies.
- G Frequent terms
- freq(w,g) Frequency of co-occurrence term w and
term g. - pg unconditional probability (the expected
probability) of g. - f(w) The total number of co-oocurrence of term w
and frequent terms G
- Large bias of co-occurrence means importance of a
word.
17Sort by ?2-value
Previous work Matsuo03
We can get important words based on co-occurrence
information in a document.
18Outline of the talk
- Introduction
- Context-based word weighting
- Proposed measure
- Previous work
- IRM (Interest Relevance Measure)
- System architecture
- Evaluation
- Conclusion
19Personalize the calculation of word importance
IRM, proposed measure
- The previous method is useful for extracting
reader-independent important words from a
document. - However, importance of words depends not only on
the document itself but also on a reader.
20If we change the columns to pick up
IRM, proposed measure
a machine, b computer, c question, d
digital, e answer, f game, g argument, h
make, i state, j number u imitation, v
digital computer, wkind, xmake
21If we change the columns to pick up
IRM, proposed measure
Frequent words
Frequent termslogic
Frequent termsGod
The relevant words to selected words have high ?2
value, because they co-occurs often.
22Familiarity instead of frequency
IRM, proposed measure
- We focus on familiar words to the user, instead
of frequent words in the document. - Definition Familiar words are the words which a
user has frequently seen in the past.
23Interest Relevancy Measure (IRM)
IRM, proposed measure
- where Hk is a set of familiar words for user k
24IRM
IRM, proposed measure
- If the value of IRM is large, word wij is
relevant to users familiar words. - The word is relevant to the users interests, so
it is a keyword for the user. - Conversely, if the value of IRM is small, word
wij is not specifically relevant to any of the
familiar words.
25Outline of the talk
- Introduction
- Context-based word weighting
- Proposed measure
- Previous work
- IRM (Interest Relevance Measure)
- System architecture
- Evaluation
- Conclusion
26Browsing support system
- It is difficult to evaluate IRM objectively
because the weight of words depends on a users
familiar words, and therefore varies among users. - Therefore, we evaluate IRM by constructing a Web
browsing support system. - Web pages accessed by a user are monitored by a
proxy server. - The count of each word is stored in a database.
27System architecture ofbrowsing support system
Browser
28Sample Screen shot
29(No Transcript)
30Outline of the talk
- Introduction
- Context-based word weighting
- Proposed measure
- Previous work
- IRM (Interest Relevance Measure)
- System architecture
- Evaluation
- Conclusion
31Evaluation
- For evaluation, ten people tried this system for
more than one hour. - Three methods are implemented for comparison.
- (I) word frequency
- (II) tfidf
- (III) IRM
32Evaluation Result(1)
- After using each system (blind), we ask the
following questions on a 5-point Likert-scale
from 1(not at all) to 5 (very much). - Q1 Do this system help you browse the Web?
- (I) 2.8 (II) 3.2 (III) 3.2
- Q2 Are the red color words (high IRM words)
interesting to you? - (I) 3.2 (II) 4.0 (III) 4.1
- Q3 Are the interesting words colored red?
- (I) 2.9 (II) 3.3 (III) 3.8
- Q4 Are the blue color words (familiar words)
interesting to you? - (I) 2.7 (II) 2.5 (III) 2.0
- Q5 Are the interesting words colored blue?
- (I) 2.7 (II) 2.5 (III) 2.4
(I) word frequency (II) tfidf (III) IRM
33Evaluation Result(2)
- After evaluating all three system, we ask the
following two questions. - Q6 Which one helps your browsing the most?
- (I) 1 people (II) 3 (III) 6
- Q7 Which one detects your interests the most?
- (I) 0 people (II) 2 (III) 8
- Overall, IRM can detect words of the users
interests the most.
(I) word frequency (II) tfidf (III) IRM
34Outline of the talk
- Introduction
- Context-based word weighting
- Proposed measure
- Previous work
- IRM (Interest Relevance Measure)
- System architecture
- Evaluation
- Conclusion
35Conclusion
- We develop an context-based word weighting
measure (IRM) based on the relevance (i.e., the
co-occurrence) to a users familiar words. - If a user is not familiar with the topic, he/she
may think general words related to the topic are
important. - On the other hand, if a user is familiar with the
topic, he/she may think more detailed words are
important. - We implemented IRM to browsing support system,
and showed the effect.