Title: APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL
1APPLICATIONS OF DATA MINING IN INFORMATION
RETRIEVAL
2Nearest Neighbor Classifiers
- Basic intuition similar documents should have
the same class label - We can use the vector space model and the cosine
similarity - Training is simple
- Remember the class value of the initial documents
- Index them using an inverted index structure
- Testing is also simple
- Use each test document dt as a query
- Fetch k training documents most similar to it
- Use majority voting to determine the class of dt
3Nearest Neighbor Classifiers
- Instead of pure counts of classes, we can use the
weights wrt the similarity - If training document d has label cd, then cd
accumulates a score of s(dq, d) - The class with maximum score is selected
- Per-class offsets could be used and tuned later
on
4Nearest Neighbor Classifiers
- Choosing the value k
- Try various values of k and use a portion of the
documents for validation. - Cluster the documents and choose a value of k
proportional to the size of small clusters
5Nearest Neighbor Classifiers
- kNN is a lazy strategy compared to decision trees
- Advantages
- No-training needed to construct a model
- When properly tuned for k, and bc, they are
comparable in accuracy to other classifiers. - Disadvantages
- May involve many inverted index lookups, scoring,
sorting and picking the best k results takes time
(since k is small compared to the retrieved
documents, such types of queries are called
iceberg queries)
6Nearest Neighbor Classifiers
- For better performance, some effort is spent
during training - Documents are clustered, and only a few
statistical parameters are stored per-cluster - A test document is first compared with the
cluster representatives, then with the individual
documents from appropriate clusters
7Measures of accuracy
- We may have the following
- Each document is associated with exactly one
class - Each document is associated with a subset of
classes - The ideas from precision and recall can be used
to measure the accuracy of the classifier - Calculate the average precision and recall for
all the classes
8Hypertext Classification
- An HTML document can be thought of a hierarchy of
regions represented by a tree-structured Document
Object Model (DOM). www.w3.org/DOM - DOM tree consists of
- Internal nodes
- Leaf nodes segments of text
- Hyperlinks to other nodes
9Hypertext Classification
- An example DOM in XML format
- Is is important to distinguish the two
occurrences of the term surfing which can be
achieved by prefixing the term by the sequence of
tags in the DOM tree. - resume.publication.title.surfing
- resume.hobbies.item.surfing
ltresumegt ltpublicationgt lttitlegt Statistical
models for web-surfing lt/titlegt lt/publicationgt lt
hobbiesgt ltitemgt Wind-surfing lt/itemgt lt/hobbiesgt
lt/resumegt
10Hypertext Classification
- Use relations to give meaning to textual features
such as - contains-text(domNode, term)
- part-of(domNode1, domNode2)
- tagged(domNode, tagName)
- links-to(srcDomNode, dstDomNode)
- Contains-anchor-text(srcDomNode, dstDomNode,
term) - Classified(domNode, label)
- Discover rules from collection of relations such
as - Classifed(A, facultyPage) - contains-text(A,
professor), contains-text(A, phd), links-to(B,A),
contains0text(B,faculty) - Where - means if, and comma stands for
conjunction.
11Hypertext Classification
- Rule Induction with two-class setting
- FOIL (First order Inductive Learner by Quinlan,
1993) - a greedy algorithm that learns rules to
distinguish positive examples from negative ones - Repeatedly searches for the current best rule and
removes all the positive examples covered by the
rule until all the positive examples in the data
set are covered - Tries to maximize the gain of adding literal p to
rule r - P is the set of positive and N is the set of
negative examples - When p is added to r, then there are P positive
and N negative examples satisfying the new rule
12Hypertext Classification
Let R be the set of rules learned, initially
empty While D ! EmptySet do // learn a new
rule Let r be true and be the new rule
while some d in D- satisfies r // Add a
new possibly negated literal to r to specialize
it Add best possible literal p as a
conjunction to r endwhile R lt- R U r
Remove from D all instances for which r
evaluates to true Endwhile Return R
13Hypertext Classification
- Let R be the set of rules learned, initially
empty - While D ! EmptySet do
- // learn a new rule
- Let r be true and be the new rule
- while some d in D- satisfies r
- // Add a new possibly negated literal to
r to specialize it - Add best possible literal p as a
conjunction to r - endwhile
- R lt- R U r
- Remove from D all instances for which r
evaluates to true - Endwhile
- Return R
- Types of literals explored
- XiXj, Xic, XigtXj etc where Xi, Xj are variables
and c is a constant - Q(X1,X2,..Xk) where Q is a relation and Xi are
variables - Not(L), where L is a literal of the above forms
14Hypertext Classification
- With relational learning, we can learn class
labels for individual pages, and relationships
between them - Member(homePage, department)
- Teaches(homePage, coursePage)
- Advises(homePage, homePage)
- We can also incorporate other classifiers such as
naïve bayesian for rule learning
15RETRIEVAL UTILITIES
16Retrieval Utilities
- Relevance feedback
- Clustering
- Passage-based Retrieval
- Parsing
- N-grams
- Thesauri
- Semantic Networks
- Regression Analysis
17Relevance Feedback
- Do the retrieval in multiple steps
- User refines the query at each step wrt the
results of the previous queries - User tells the IR system which documents are
relevant - New terms are added to the query based on the
feedback - Term weights may be updated based on the user
feedback
18Relevance Feedback
- Bypass the user for relevance feedback by
- Assuming the top-k results in the ranked list are
relevant - Modify the original query as done before
19Relevance Feedback
- Example find information surrounding the
various conspiracy theories about the
assassination of John F. Kennedy (Example from
your text book) - IF the highly ranked document contains the term
Oswald then this needs to be added to the
initial query - If the term assassination appears in the top
ranked document, then its weight should be
increased.
20Relevance Feedback in Vector Space Model
- Q is the original query
- R is the set of relevant and S is the set of
irrelevant documents selected by the user - R n1, S n2
21Relevance Feedback in Vector Space Model
- Q is the original query
- R is the set of relevant and S is the set of
irrelevant documents selected by the user - R n1, S n2
- In general
The weights are referred to as Rocchio weights
22Relevance Feedback in Vector Space Model
- What if the original query retrieves only
non-relevant documents (determined by the user)? - Then increase the weight of the most frequently
occurring term in the document collection.
23Relevance Feedback in Vector Space Model
- Result set clustering can be used as a utility
for relevance feedback. - Hierarchical clustering can be used for that
purpose where the distance is defined by the
cosine similarity