Automatic Ontology Discovery from News Web Pages - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Automatic Ontology Discovery from News Web Pages

Description:

HTML = Few Informational Nodes Lots of Junk Tags (tr, td, p, b, i, br, etc... For all non-frequent paths a-b and b-c, increment the count for a-c and populate ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 9
Provided by: itw3
Category:

less

Transcript and Presenter's Notes

Title: Automatic Ontology Discovery from News Web Pages


1
Automatic Ontology Discovery from News Web Pages
  • Srinivas Vadrevu
  • CSE 591/CSE 575 Data Mining
  • Instructor Dr. Huan Liu
  • Date 04/24/2003
  • Link http//www.public.asu.edu/svadrevu/DM/DMTal
    k.ppt

2
Problem Statement
  • Discover the ontology lying underneath the Web
    News documents automatically
  • Motivation
  • HearSay Project (Intelligent Voice Enabled
    Browser)
  • Manual Compilation of the ontology is tedious
  • Ever-changing Web
  • Need for automated ontology construction
  • Category based Information Retrieval

3
Tasks Involved
  • Structural and Semantic Analysis
  • Semantic Partitioning of Web Documents
  • HTML Few Informational Nodes Lots of Junk
    Tags (tr, td, p, b, i, br, etc)
  • Discover the hierarchy among the informational
    nodes in the web page
  • Ontology Engineering My Part
  • Classify templates from instances
  • Identify equivalencies among the templates
  • Discover the hierarchy among the templates
  • Discover attributes and concepts from templates
    and the conceptual relationships among them

4
Proposed Approach
  • Input Semantically Partitioned News Web Pages
    (XML Trees)
  • Output Ontology contained within the web pages
  • Algorithm
  • Classify frequent nodes as the template nodes
    (frequent ones are those that satisfy the
    support)
  • Group the frequent nodes according to their
    similarity (Use Jaccards Coefficient, A B /
    A U B)
  • Find all (parent-child) relationships among these
    groups
  • For all non-frequent paths a-b and b-c, increment
    the count for a-c and populate the frequent paths
    again
  • Construct a tree from these frequent
    (parent-child) paths
  • Follow the links to the frequent nodes in the
    ontology obtained and repeat the procedure until
    there are no frequent nodes left
  • PS Other approaches considered include
    Self-Organizing Maps,
  • Bayesian Networks, DTD Mining and Context Free
    Grammars

5
Initial Results (with only the front
pages)Ontology built automatically from 3
snapshots of 18 news web pages, including cnn,
nytimes, bbc, etc.
6
More Results (going one level deep)after
following links to all frequent nodes under News
7
Future Work
  • Mapping Merging
  • Classification with instances
  • Refine the Semantic Partitioning and thus the
    ontology
  • Go back to HTML and use the ontology obtained as
    the model in semantic partitioning
  • Use this to these new trees to refine the
    ontology (Expectation-Maximization)
  • Use of Context Free Grammars
  • Build an approximate ontology by hand. Represent
    each tree as a CFG. Now the problem reduces to
    finding the minimal grammar that contains all the
    semantically partitioned grammars and the ideal
    grammar
  • Use of Clustering (Self-Organizing Maps)
  • Use SOMs to group the frequent nodes according to
    their similarity (with Jaccards Coefficient as
    the distance measure)

8
Conclusions
  • Classifying template nodes from instances is not
    a difficult task given that we have enough
    instances about the domain
  • Look for frequent nodes
  • Taxonomy (and thus the ontology) can be built for
    a domain without any prior knowledge (already
    developed ontology) about the domain
  • Look at instances of the domain instead
  • Particularly useful with the web, where the
    ontologies for many domains is unknown
Write a Comment
User Comments (0)
About PowerShow.com