Title: Enhanced hypertext categorization using hyperlinks
1Enhanced hypertext categorizationusing hyperlinks
- Soumen Chakrabarti (IBM Almaden)Byron Dom (IBM
Almaden) - Piotr Indyk (Stanford)
2Hypertext categorization
- Automatic topic identification
- Also called supervised learning
- Given
- Hypertext document corpus
- A small set of classified documents
- Goal
- Construct a classifier
- Apply to new documents
3Example from the web
4Applications and benefits
- Retrieval
- Browsing (Yahoo!)
- Searching (socks and NOT apparel)
- Adopted by most search companies
- Profile based filtering and routing
- Email, news, push services
- Collaborative filtering
- Automatically categorize click trails
- Cluster users based on frequently visited topics
5Click-trail and bookmark organizer
Integrated browser
View of topic Hierarchy
Web Page
6The limitation of text-only classifiers
- Text-only classifiers are well-researched
- Rule induction
- Bayesian learning
- 87 accurate on news
- Lower accuracy on hyperlinked corpora
- Heterogenous
- Information in links not utilized
7Our contributions
- A novel approach to hypertext classification
- Combine text and link information
- Framework for link modeling in hypertext graphs
- Markov random field (limited sphere of
influence) - Techniques for feature extraction
- Use of domain knowledge to limit complexity
- Techniques to handle incomplete information
- Iterative labeling algorithm
8Is this a new problem?
- Reduction to text classification
- Include (tagged) text from neighbors
- Classify the result
- Does not increase accuracy
- Big neighbor pages
- Lack of semantic correlation
9Big neighbor
10More of big neighbor
11Coherent pages linking to incoherent pages
12Model specification
- A hypertext graph
- Nodes documents
- Edges hyperlinks
- Document sequence or set of terms and links
- Each document has a class label
- Some labels are known
- Most are unknown
- Labels are drawn from some distribution
13Assumptions used in probability model
- No indirect coupling between the text and the
neighbors classes - The probability of a nodes class depends only on
neighbors within limited radius - Independence among the neighbor class
probabilities - Can assume higher order dependence
- (neighborhood radius greater than 1)
14Probability estimation
Posterior probability of class given text and
neighborhood
Prior class probability
Class conditional neighbor class distribution
(independence between neighbors)
Class conditional term distribution
15Bayesian classification algorithm
- Learning phase (parameter estimation)
- Distribution of a text within a class
- Interclass linkage probabilities
- Prior probability of a class
- Classification phase
- Compute class probabilities
- Choose the class with highest posterior
probability
16Partial neighborhood knowledge
- Problem
- Class of test page depends on neighbors classes
- Must know neighbors classes to use interclass
probabilities ? circularity! - Solution
- Iterative labeling
- Initially classify neighboring nodes using text
- Repeatedly reclassify until consistent
- Text, link, or joint model
- Will this stabilize?
17Data set 1 US patent database
- Local text information
- Title
- Abstract
- Citation links
- Related patents cite each other
- Complete knowledge of the neighbors classes
18Complete knowledge of neighborhood
- Features used
- Local text
- Class tags from neighbor links
- Large gain from tags
- Gains sensitive to tag representation
- /Arts
- /Arts/Painting
19Partial knowledge of neighborhood
- Algorithm
- Grow radius-two neighborhood
- Delete labels from a fraction of nodes
- Do iterative labeling
- Observations
- Benefit from links
- TextLink most robust
20Data set 2 Yahoo!
- Few links point to classified documents
- 19 of docs have any classified out-link
- 28 has any classified in-link
- 40 has either one
- ?Need to find new source of information and
extend the algorithm
21Radius-2 information co-citations
- An IO-bridge connects to many pages of similar
topics - OI tends to be noisy (many topics point to
Netscape and Free Speech Online) - II and OO lead to topic divergence
Unclassifieddocument
Bridge
Classifieddocument
I-link
O-link
Classifieddocument
Document to be classified
IO
OI
II/OO
22Link proximity
- Are out-links that are close together more likely
to point to related topics than out-links that
are far apart?
23Bridges are locally coherent
- Link proximity ? semantic proximity
- Exploit this source of information
- Huge attribute space
- Simple classification
- Check coherence
- Voting
24Effect of exploiting bridges and locality
25Conclusions
- New model for citation among hyperlinked
documents belonging to various topics - New categorization algorithm
- Complexity controlled using domain knowledge
about citations - Significant increase in accuracy
26Future work
- Better models for joint distribution between
terms and links - Semantic page segmentation to distill pure
bridges from ones having a mixture of topics - Higher complexity
- Potentially better results
- More clever use of neighbors text
- Investigation of the relationship between spatial
and semantic proximity
27Related work