Web Page Categorization without the Web Page - PowerPoint PPT Presentation

About This Presentation
Title:

Web Page Categorization without the Web Page

Description:

Web Page Categorization without the Web Page Author: Min-Yen Kan WWW-2004 Basic Idea Web Page Categorization ~ Text Categorization Some retrieve the whole document ... – PowerPoint PPT presentation

Number of Views:283
Avg rating:3.0/5.0
Slides: 10
Provided by: nitina6
Category:

less

Transcript and Presenter's Notes

Title: Web Page Categorization without the Web Page


1
Web Page Categorization without the Web Page
  • Author Min-Yen Kan
  • WWW-2004

2
Basic Idea
  • Web Page Categorization Text Categorization
  • Some retrieve the whole document
  • This yields URLs of additional documents
  • Could result in cyclic crawling or
    non-terminating crawling
  • Glean information from intuitive URLs
  • Avoid the bottleneck

3
An Example
  • http//cs.cornell.edu/Info/Courses/Current/CS415/C
    S415.html
  • Classify the above webpage into one of the
    following categories
  • Course
  • Faculty
  • Project
  • Student

4
Approach
  • 2 phase URL segmentation
  • First phase
  • Baseline
  • scheme//host/path-elements/document.extension
  • More segmentation like, faculty-info ? faculty
    info
  • Refined
  • Break the URL if a transition between uppercase,
    lowercase and digits is observed

5
Approach
  • Second phase
  • Information content reduction
  • Examines all possible partitions of the segment
  • Adds information content (IC) of all such
    partitions
  • Pick the one with lowest IC
  • Title token based finite state transducer
  • What about acronyms
  • Non-deterministic weighted finite-state
    transducer splits and expands segments based on
    previously seen web page titles

6
An Example
FST Rule Score Output
1. Match the initial letter in the subsequent token 2 l
2. Match the initial letter in the non-subsequent token 1 l
3. Match a subsequent letter in the current token 1 l
4. Match the final letter in the current token 3 l
5. Skip a character in the candidate expansion 0 ?
  • nytimes ? New York Times
  • ??N?e?w?Y?o?r?k?T?i?m?e?s
  • Score of 12 and outputs nytimes

R1 R5 R5 R1 R5 R5
R5 R1 R3 R3 R3 R4
7
Experiments
  • Dataset used WebKB (4167 pages)
  • Classified under student, faculty, course and
    project
  • Classification used SVM
  • Compared with FOIL-PILFS (based on inductive
    logic programming)
  • Evaluation made based on (U)RL Ub,Ur,Ui,Uf,
    (A)nchor text, (T)itle text and page te(X)t

8
Experiments
9
Conclusion
  • URLs contain tokens effective for classification
  • Its faster
  • Careful URL segmentation boosts classification
  • URL segmentation is more powerful than expansion
  • Can assist source based classification to a
    limited extent
  • FST can not expand what it hasnt seen
  • Cryptic URLs are hard to tackle
Write a Comment
User Comments (0)
About PowerShow.com