Automatic Ontology Discovery from News Web Pages

About This Presentation

Title:

Automatic Ontology Discovery from News Web Pages

Description:

HTML = Few Informational Nodes Lots of Junk Tags (tr, td, p, b, i, br, etc... For all non-frequent paths a-b and b-c, increment the count for a-c and populate ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 9

Provided by: itw3

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Ontology Discovery from News Web Pages

1
Automatic Ontology Discovery from News Web Pages

Srinivas Vadrevu
CSE 591/CSE 575 Data Mining
Instructor Dr. Huan Liu
Date 04/24/2003
Link http//www.public.asu.edu/svadrevu/DM/DMTal
k.ppt

2
Problem Statement

Discover the ontology lying underneath the Web
News documents automatically
Motivation
HearSay Project (Intelligent Voice Enabled
Browser)
Manual Compilation of the ontology is tedious
Ever-changing Web
Need for automated ontology construction
Category based Information Retrieval

3
Tasks Involved

Structural and Semantic Analysis
Semantic Partitioning of Web Documents
HTML Few Informational Nodes Lots of Junk
Tags (tr, td, p, b, i, br, etc)
Discover the hierarchy among the informational
nodes in the web page
Ontology Engineering My Part
Classify templates from instances
Identify equivalencies among the templates
Discover the hierarchy among the templates
Discover attributes and concepts from templates
and the conceptual relationships among them

4
Proposed Approach

Input Semantically Partitioned News Web Pages
(XML Trees)
Output Ontology contained within the web pages
Algorithm
Classify frequent nodes as the template nodes
(frequent ones are those that satisfy the
support)
Group the frequent nodes according to their
similarity (Use Jaccards Coefficient, A B /
A U B)
Find all (parent-child) relationships among these
groups
For all non-frequent paths a-b and b-c, increment
the count for a-c and populate the frequent paths
again
Construct a tree from these frequent
(parent-child) paths
Follow the links to the frequent nodes in the
ontology obtained and repeat the procedure until
there are no frequent nodes left

PS Other approaches considered include
Self-Organizing Maps,
Bayesian Networks, DTD Mining and Context Free
Grammars

5
Initial Results (with only the front
pages)Ontology built automatically from 3
snapshots of 18 news web pages, including cnn,
nytimes, bbc, etc.
6
More Results (going one level deep)after
following links to all frequent nodes under News
7
Future Work

Mapping Merging
Classification with instances
Refine the Semantic Partitioning and thus the
ontology
Go back to HTML and use the ontology obtained as
the model in semantic partitioning
Use this to these new trees to refine the
ontology (Expectation-Maximization)
Use of Context Free Grammars
Build an approximate ontology by hand. Represent
each tree as a CFG. Now the problem reduces to
finding the minimal grammar that contains all the
semantically partitioned grammars and the ideal
grammar
Use of Clustering (Self-Organizing Maps)
Use SOMs to group the frequent nodes according to
their similarity (with Jaccards Coefficient as
the distance measure)

8
Conclusions

Classifying template nodes from instances is not
a difficult task given that we have enough
instances about the domain
Look for frequent nodes
Taxonomy (and thus the ontology) can be built for
a domain without any prior knowledge (already
developed ontology) about the domain
Look at instances of the domain instead
Particularly useful with the web, where the
ontologies for many domains is unknown