Enhanced hypertext categorization using hyperlinks - PowerPoint PPT Presentation

About This Presentation

Title:

Enhanced hypertext categorization using hyperlinks

Description:

Enhanced hypertext categorization. using ... 'OI' tends to be noisy (many topics point to Netscape and Free Speech Online) ... Music. Unknown. Unknown ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 28

Provided by: soumencha

Category:

more less

Transcript and Presenter's Notes

Title: Enhanced hypertext categorization using hyperlinks

1
Enhanced hypertext categorizationusing hyperlinks

Soumen Chakrabarti (IBM Almaden)Byron Dom (IBM
Almaden)
Piotr Indyk (Stanford)

2
Hypertext categorization

Automatic topic identification
Also called supervised learning
Given
Hypertext document corpus
A small set of classified documents
Goal
Construct a classifier
Apply to new documents

3
Example from the web
4
Applications and benefits

Retrieval
Browsing (Yahoo!)
Searching (socks and NOT apparel)
Adopted by most search companies
Profile based filtering and routing
Email, news, push services
Collaborative filtering
Automatically categorize click trails
Cluster users based on frequently visited topics

5
Click-trail and bookmark organizer
Integrated browser
View of topic Hierarchy
Web Page
6
The limitation of text-only classifiers

Text-only classifiers are well-researched
Rule induction
Bayesian learning
87 accurate on news
Lower accuracy on hyperlinked corpora
Heterogenous
Information in links not utilized

7
Our contributions

A novel approach to hypertext classification
Combine text and link information
Framework for link modeling in hypertext graphs
Markov random field (limited sphere of
influence)
Techniques for feature extraction
Use of domain knowledge to limit complexity
Techniques to handle incomplete information
Iterative labeling algorithm

8
Is this a new problem?

Reduction to text classification
Include (tagged) text from neighbors
Classify the result
Does not increase accuracy
Big neighbor pages
Lack of semantic correlation

9
Big neighbor
10
More of big neighbor
11
Coherent pages linking to incoherent pages
12
Model specification

A hypertext graph
Nodes documents
Edges hyperlinks
Document sequence or set of terms and links
Each document has a class label
Some labels are known
Most are unknown
Labels are drawn from some distribution

13
Assumptions used in probability model

No indirect coupling between the text and the
neighbors classes
The probability of a nodes class depends only on
neighbors within limited radius
Independence among the neighbor class
probabilities
Can assume higher order dependence
(neighborhood radius greater than 1)

14
Probability estimation
Posterior probability of class given text and
neighborhood
Prior class probability
Class conditional neighbor class distribution
(independence between neighbors)
Class conditional term distribution
15
Bayesian classification algorithm

Learning phase (parameter estimation)
Distribution of a text within a class
Interclass linkage probabilities
Prior probability of a class
Classification phase
Compute class probabilities
Choose the class with highest posterior
probability

16
Partial neighborhood knowledge

Problem
Class of test page depends on neighbors classes
Must know neighbors classes to use interclass
probabilities ? circularity!
Solution
Iterative labeling
Initially classify neighboring nodes using text
Repeatedly reclassify until consistent
Text, link, or joint model
Will this stabilize?

17
Data set 1 US patent database

Local text information
Title
Abstract
Citation links
Related patents cite each other
Complete knowledge of the neighbors classes

18
Complete knowledge of neighborhood

Features used
Local text
Class tags from neighbor links
Large gain from tags
Gains sensitive to tag representation
/Arts
/Arts/Painting

19
Partial knowledge of neighborhood

Algorithm
Grow radius-two neighborhood
Delete labels from a fraction of nodes
Do iterative labeling
Observations
Benefit from links
TextLink most robust

20
Data set 2 Yahoo!

Few links point to classified documents
19 of docs have any classified out-link
28 has any classified in-link
40 has either one
?Need to find new source of information and
extend the algorithm

21
Radius-2 information co-citations

An IO-bridge connects to many pages of similar
topics
OI tends to be noisy (many topics point to
Netscape and Free Speech Online)
II and OO lead to topic divergence

Unclassifieddocument
Bridge
Classifieddocument
I-link
O-link
Classifieddocument
Document to be classified
IO
OI
II/OO
22
Link proximity

Are out-links that are close together more likely
to point to related topics than out-links that
are far apart?

23
Bridges are locally coherent

Link proximity ? semantic proximity
Exploit this source of information
Huge attribute space
Simple classification
Check coherence
Voting

24
Effect of exploiting bridges and locality
25
Conclusions

New model for citation among hyperlinked
documents belonging to various topics
New categorization algorithm
Complexity controlled using domain knowledge
about citations
Significant increase in accuracy

26
Future work

Better models for joint distribution between
terms and links
Semantic page segmentation to distill pure
bridges from ones having a mixture of topics
Higher complexity
Potentially better results
More clever use of neighbors text
Investigation of the relationship between spatial
and semantic proximity

27
Related work

Write a Comment

User Comments (0)