APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL - PowerPoint PPT Presentation

About This Presentation
Title:

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL

Description:

Statistical models for web-surfing Wind-surfing Hypertext Classification ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 24
Provided by: peopleSab
Category:

less

Transcript and Presenter's Notes

Title: APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL


1
APPLICATIONS OF DATA MINING IN INFORMATION
RETRIEVAL
2
Nearest Neighbor Classifiers
  • Basic intuition similar documents should have
    the same class label
  • We can use the vector space model and the cosine
    similarity
  • Training is simple
  • Remember the class value of the initial documents
  • Index them using an inverted index structure
  • Testing is also simple
  • Use each test document dt as a query
  • Fetch k training documents most similar to it
  • Use majority voting to determine the class of dt

3
Nearest Neighbor Classifiers
  • Instead of pure counts of classes, we can use the
    weights wrt the similarity
  • If training document d has label cd, then cd
    accumulates a score of s(dq, d)
  • The class with maximum score is selected
  • Per-class offsets could be used and tuned later
    on

4
Nearest Neighbor Classifiers
  • Choosing the value k
  • Try various values of k and use a portion of the
    documents for validation.
  • Cluster the documents and choose a value of k
    proportional to the size of small clusters

5
Nearest Neighbor Classifiers
  • kNN is a lazy strategy compared to decision trees
  • Advantages
  • No-training needed to construct a model
  • When properly tuned for k, and bc, they are
    comparable in accuracy to other classifiers.
  • Disadvantages
  • May involve many inverted index lookups, scoring,
    sorting and picking the best k results takes time
    (since k is small compared to the retrieved
    documents, such types of queries are called
    iceberg queries)

6
Nearest Neighbor Classifiers
  • For better performance, some effort is spent
    during training
  • Documents are clustered, and only a few
    statistical parameters are stored per-cluster
  • A test document is first compared with the
    cluster representatives, then with the individual
    documents from appropriate clusters

7
Measures of accuracy
  • We may have the following
  • Each document is associated with exactly one
    class
  • Each document is associated with a subset of
    classes
  • The ideas from precision and recall can be used
    to measure the accuracy of the classifier
  • Calculate the average precision and recall for
    all the classes

8
Hypertext Classification
  • An HTML document can be thought of a hierarchy of
    regions represented by a tree-structured Document
    Object Model (DOM). www.w3.org/DOM
  • DOM tree consists of
  • Internal nodes
  • Leaf nodes segments of text
  • Hyperlinks to other nodes

9
Hypertext Classification
  • An example DOM in XML format
  • Is is important to distinguish the two
    occurrences of the term surfing which can be
    achieved by prefixing the term by the sequence of
    tags in the DOM tree.
  • resume.publication.title.surfing
  • resume.hobbies.item.surfing

ltresumegt ltpublicationgt lttitlegt Statistical
models for web-surfing lt/titlegt lt/publicationgt lt
hobbiesgt ltitemgt Wind-surfing lt/itemgt lt/hobbiesgt
lt/resumegt
10
Hypertext Classification
  • Use relations to give meaning to textual features
    such as
  • contains-text(domNode, term)
  • part-of(domNode1, domNode2)
  • tagged(domNode, tagName)
  • links-to(srcDomNode, dstDomNode)
  • Contains-anchor-text(srcDomNode, dstDomNode,
    term)
  • Classified(domNode, label)
  • Discover rules from collection of relations such
    as
  • Classifed(A, facultyPage) - contains-text(A,
    professor), contains-text(A, phd), links-to(B,A),
    contains0text(B,faculty)
  • Where - means if, and comma stands for
    conjunction.

11
Hypertext Classification
  • Rule Induction with two-class setting
  • FOIL (First order Inductive Learner by Quinlan,
    1993)
  • a greedy algorithm that learns rules to
    distinguish positive examples from negative ones
  • Repeatedly searches for the current best rule and
    removes all the positive examples covered by the
    rule until all the positive examples in the data
    set are covered
  • Tries to maximize the gain of adding literal p to
    rule r
  • P is the set of positive and N is the set of
    negative examples
  • When p is added to r, then there are P positive
    and N negative examples satisfying the new rule

12
Hypertext Classification
Let R be the set of rules learned, initially
empty While D ! EmptySet do // learn a new
rule Let r be true and be the new rule
while some d in D- satisfies r // Add a
new possibly negated literal to r to specialize
it Add best possible literal p as a
conjunction to r endwhile R lt- R U r
Remove from D all instances for which r
evaluates to true Endwhile Return R
13
Hypertext Classification
  • Let R be the set of rules learned, initially
    empty
  • While D ! EmptySet do
  • // learn a new rule
  • Let r be true and be the new rule
  • while some d in D- satisfies r
  • // Add a new possibly negated literal to
    r to specialize it
  • Add best possible literal p as a
    conjunction to r
  • endwhile
  • R lt- R U r
  • Remove from D all instances for which r
    evaluates to true
  • Endwhile
  • Return R
  • Types of literals explored
  • XiXj, Xic, XigtXj etc where Xi, Xj are variables
    and c is a constant
  • Q(X1,X2,..Xk) where Q is a relation and Xi are
    variables
  • Not(L), where L is a literal of the above forms

14
Hypertext Classification
  • With relational learning, we can learn class
    labels for individual pages, and relationships
    between them
  • Member(homePage, department)
  • Teaches(homePage, coursePage)
  • Advises(homePage, homePage)
  • We can also incorporate other classifiers such as
    naïve bayesian for rule learning

15
RETRIEVAL UTILITIES
16
Retrieval Utilities
  • Relevance feedback
  • Clustering
  • Passage-based Retrieval
  • Parsing
  • N-grams
  • Thesauri
  • Semantic Networks
  • Regression Analysis

17
Relevance Feedback
  • Do the retrieval in multiple steps
  • User refines the query at each step wrt the
    results of the previous queries
  • User tells the IR system which documents are
    relevant
  • New terms are added to the query based on the
    feedback
  • Term weights may be updated based on the user
    feedback

18
Relevance Feedback
  • Bypass the user for relevance feedback by
  • Assuming the top-k results in the ranked list are
    relevant
  • Modify the original query as done before

19
Relevance Feedback
  • Example find information surrounding the
    various conspiracy theories about the
    assassination of John F. Kennedy (Example from
    your text book)
  • IF the highly ranked document contains the term
    Oswald then this needs to be added to the
    initial query
  • If the term assassination appears in the top
    ranked document, then its weight should be
    increased.

20
Relevance Feedback in Vector Space Model
  • Q is the original query
  • R is the set of relevant and S is the set of
    irrelevant documents selected by the user
  • R n1, S n2

21
Relevance Feedback in Vector Space Model
  • Q is the original query
  • R is the set of relevant and S is the set of
    irrelevant documents selected by the user
  • R n1, S n2
  • In general

The weights are referred to as Rocchio weights
22
Relevance Feedback in Vector Space Model
  • What if the original query retrieves only
    non-relevant documents (determined by the user)?
  • Then increase the weight of the most frequently
    occurring term in the document collection.

23
Relevance Feedback in Vector Space Model
  • Result set clustering can be used as a utility
    for relevance feedback.
  • Hierarchical clustering can be used for that
    purpose where the distance is defined by the
    cosine similarity
Write a Comment
User Comments (0)
About PowerShow.com