Topic Distillation and Web Page Categorization - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Topic Distillation and Web Page Categorization

Description:

... experts and classification based on the textual content as by Yahoo. ... li a href='http://www.teddingtoncheese.co.uk' Teddington... /a Buy online... /li ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 29
Provided by: prasanna4
Category:

less

Transcript and Presenter's Notes

Title: Topic Distillation and Web Page Categorization


1
Topic Distillation and Web Page Categorization
  • Prasanna K. Desikan
  • (05/29/2002)

2
Motivation
  • The web is a huge repository of information.
  • Categorizing web documents facilitates the
    search and retrieval of pages.
  • Topic distillation is the process of finding
    authoritative Web pages and comprehensive hubs
    which reciprocally endorse each other and are
    relevant to a given query.

3
Approaches for Categorization
  • Text based Categorization
  • Structure or link based Categorization
  • Combination of link and text information

4
Web Page Categorization Algorithms
  • Manual categorization by domain specific experts.
  • Categorization would involve the analysis of the
    contents of the web page by a number of domain
    experts and classification based on the textual
    content as by Yahoo.
  • Content-based categorization - solely on document
    content or a combination of document content and
    META tags.
  • To classify a document, all the stop words are
    removed and the remaining keywords/phrases are
    represented in the form of a feature vector.

5
Web Page Categorization Algorithms
  • Link and Content Analysis.
  • Based on the fact that a web page that refers to
    a document must contain enough hints about its
    content to induce someone to read it . Such hints
    can be used to classify the document being
    referred.

6
Topic Distillation in Hyperlinked Environment 1
  • Aim To find quality documents related to a query
    topic.
  • Problems encountered with HITS approach.
  • Mutually reinforcing relationships between hosts.
  • Automatically generated links.
  • Non Relevant Nodes (documents not relevant to the
    query topic) .

7
Topic Distillation in Hyperlinked Environment1
  • Let the Web be represented as a graph with the
    node as a web page and the edge as a link.
  • Approaches
  • If there are k edges (an edge here is a link)
    from documents on a first host to a single
    document on a second host we give each edge an
    authority weight of 1/k.

8
Topic Distillation in Hyperlinked Environment1
  • Approaches (contd).
  • Compute the Relevance Weight for each node.
  • Eliminate non-relevant nodes from the graph by
    setting a threshold on the relevance weight .
  • Regulate the influence of a node based on its
    relevance.

9
Topic Distillation in Hyperlinked Environment1
  • Approaches (contd).
  • Partial Content Analysis.
  • Content Pruning by analyzing only a part of the
    graph- i.e. the nodes which are most influential
    in the outcome.

10
Automatic Resource Compilation 2
  • Goal Automatically compile a resource list on
    any topic that is broad and well-represented on
    the Web.
  • Approach.
  • search-and-growth phase.
  • a weighting phase.
  • w(p,q) 1 n(t).
  • w(p,q) -measure of the authority on the topic
    invested by page p in page q.
  • n(t) - number of matches between terms in the
    topic description in the anchor window of width
    B.
  • an iteration-and-reporting phase.

11
Relaxation Labeling Technique3
  • First Classify the unclassified documents from
    the neighborhood (using terms only classifier
    -i.e using the text from the neighboring
    documents).
  • Iterate until convergence.
  • Recompute the class for each document using both
    the local text and the class information of the
    neighbors.
  • The relaxation is guaranteed to converge to a
    consistent state.

12
Probabilistic Relational Model4
  • Web Pages and Links are modeled as entities and
    relationships respectively, while each of them is
    represented as a class.
  • Create Bayesian network using the attributes from
    entity-relationship model in order to model
    uncertainty and make inference.

13
Probabilistic Relational Model
  • By belief propagation, an approximation inference
    approach, we can use our prior knowledge to infer
    the unobserved case.
  • Given new data with some unobserved variables,
    first assign most likely values to them.
  • Based on the estimation of those marginal
    probabilities, we predict the correct
    classification.

14
Probabilistic Relational Model
  • This approach proved to be effective when applied
    to hypertext classification problem, by utilizing
    both information from the content and the link
    structure, it provides more accurate
    classification and ability to do probabilistic
    reasoning.

15
Integrating the DOM With Hyperlinks for Enhanced
Topic Distillation 6
  • A uniform grained model.
  • Web pages are represented by their tag trees
    (also called their Document Object Models
    (DOMs)).
  • DOM trees are interconnected by ordinary
    hyperlinks.
  • dis-aggregate mixed hubs.

16
A new fine grained model 7
lthtmlgtltbodygt lttable gt lttrgtlttdgt lttable gt
lttrgtlttdgtlta hrefhttp//art.qaz.comgtartlt/agtlt/td
gtlt/trgt lttrgtlttdgtlta hrefhttp//ski.qaz.comgtsk
ilt/agtlt/tdgtlt/trgt lt/tablegt lt/tdgtlt/trgt lttrgtlttdgt
ltulgt ltligtlta hrefhttp//www.fromages.com
gtFromages.comlt/agt French cheeselt/ligt
ltligtlta hrefhttp//www.teddingtoncheese.co.ukgtTe
ddingtonlt/agt Buy onlinelt/ligt
lt/ulgt lt/tdgtlt/trgt lt/tablegt lt/bodygtlt/htmlgt
17
Integrating the DOM With Hyperlinks for Enhanced
Topic Distillation
  • Figure 6 The fine-grained model of Web linkage
    which unifies hyperlinks and DOM structure

18
Integrating the DOM With Hyperlinks for Enhanced
Topic Distillation
  • Benefits
  • Reduces Topic Drift
  • Identifies and extracts regions (DOM Subtrees)
    relevant to the query out of the following
  • Broader hub
  • Hub with additional less-relevant contents and
    links

19
Web Page Classification Based on Document
Structure
  • Web pages that belong to a particular category
    have some similarity in their structure.
  • Information Pages.
  • Research Pages.
  • Personal Home Pages.

The general structural information of any page
can be deduced from the placement of links, text
and images including equations and graphs.
20
Web Page Categories Based on Structural
Similarities
  • Information Pages
  • a logo on the top followed by a navigation bar
    linking the page to other important pages
  • the ratio of link text (amount of text with
    links) to normal text also tends to be relatively
    high
  • Research Pages
  • contain huge amounts of text, equations and
    graphs in the form of images
  • The number of distinctive gray levels/color
    shades in the images also provides a cue

21
Web Page Categories Based on Structural
Similarities
  • Personal Pages.
  • The name and address of the person appear
    prominently at the top of the page.
  • A photograph of the person concerned.
  • towards the bottom of the page, the person
    provides links to his publications if there are
    any and other useful references or links to his
    favorite destinations on the web.

22
Feature Extraction
  • Textual Information.
  • The number and placement of links in a page
    provides valuable information about the broad
    category the page belongs to .
  • The ratio of number of characters in links to the
    total number of characters in the page.

23
Feature Extraction
  • Image Information
  • Information pages have more colors than personal
    homepages, which in turn have more colors than
    research pages
  • The histogram of synthetic images generally tends
    to concentrate at a few bands of color shades. In
    contrast, the histogram of natural images is
    spread over a larger area
  • Information pages usually contain many natural
    images, while research pages contain a number of
    synthetic images

24
Feature Extraction
  • Other Information
  • Approaches using classification based on video
    and other multimedia content presently not
    implemented

25
Results
26
Web Page Categories Based on Structural
Similarities
  • Conclusions and Future work for the approach
  • This approach augmented with traditional text
    based approaches could be used for effective
    categorization of web pages.
  • Improvement in feature selection.
  • Automate the training process.
  • Has to be experimented on more data sets.

27
References
  • 1K.Bharat and M. Henzinger, Improved Algorithms
    for Topic Distillation in a hyperlinked
    environment, In 21st International ACM SIGIR
    Conference on Research and Development in
    Information Retrieval.
  • 2 S. Chakrabarti, B. Dom, D. Gibson, J.
    Kleinberg, P. Raghavan, and S. Rajagopalan.
    Automatic Resource Compilation by Analyzing
    Hyperlink Structure and Associated Text.
    Proceedings of the 7th World-Wide Web conference,
    1998.
  • 3 S. Chakrabarti, B. Dom and P. Indyk. Enhanced
    hypertext categorization using hyperlinks.
    Proceedings of ACM SIGMOD 1998.

28
References
  • 4 L.Getoor, E.Segal, B.Tasker, D.Koller.
    Probabilistic Models of Text and Link Structure
    for Hypertext Classification. IJCAI Workshop on
    "Text Learning Beyond Supervision", Seattle, WA,
    August 2001.
  • 5 Arul Prakash Asirvatham, Kranthi Kumar Ravi,
    C.V.Jawahar, 'Web Page Classification based on
    Document Structure.
  • 6 Soumen Chakrabarti, Integrating the Document
    Object Model with Hyperlinks for Enhanced Topic
    Distillation and Information Extraction 10th
    International World Wide Web Conference, Hong
    Kong, May 2001.
  • 7 Soumen Chakrabarti, Mukul M. Joshi , Vivek B.
    Tawde, Enhanced topic distillation using text,
    markup tags, and hyperlinks. SIGIR 2001, New
    Orleans, LA, Sep 2001.
Write a Comment
User Comments (0)
About PowerShow.com