Hierarchical Classification of Documents with Error Control - PowerPoint PPT Presentation

About This Presentation

Title:

Hierarchical Classification of Documents with Error Control

Description:

This presentation will probably involve audience discussion, which will create action items. ... This will automatically create an Action Item at the end ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 33

Provided by: CSE

Category:

more less

Transcript and Presenter's Notes

Title: Hierarchical Classification of Documents with Error Control

1
Hierarchical Classification of Documents with
Error Control

This presentation will probably involve audience
discussion, which will create action items. Use
PowerPoint to keep track of these action items
during your presentation
In Slide Show, click on the right mouse button
Select Meeting Minder
Select the Action Items tab
Type in action items as they come up
Click OK to dismiss this box
This will automatically create an Action Item
slide at the end of your presentation with your
points entered.

Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu,
Irwin King

2
Overview

Abstract
Problem Description
Document Classification Model
Error Control Schemes
Recovery oriented scheme
Error masking scheme
Experiments
Conclusion

3
Abstract

Traditional document classification (flat
classification) involves only a single classifier
Single classifier takes care of everything
Slow and high overhead

4
Abstract

Hierarchical document classification
Class hierarchy
Use one classifier at each internal node

5
Abstract

Advantage
Better performance
Disadvantage
Wrong result if misclassified in any node

6
Abstract

Introduce error control mechanism
Approach 1 (recovery oriented)
Detect and correct misclassification
Approach 2 (error masking)
Mask errors by using multiple versions of
classifiers

7
Problem Description
Class Taxonomy
Statistics
Training System
Training Documents
class doc_id
Feature Terms
Class-doc Relation
8
Problem Description
Statistics
Classification System
Target Class
Feature Terms
Incoming Documents
9
Problem Description

Objective Achieve
Higher accuracy
Fast performance
Our proposed algorithms provide a good trade-off
between accuracy and performance

10
Document Classification Model

Formally, we use a model from Chakrabarti et al.
1997
Based on naive Bayesian network
For simplicity, we study a single node classifier.

Probability that an incoming document d belongs
to c is

zi,d number of occurrence of term i in the
incoming document d
Pj, c probability that a word in class c is
j (estimated using the training data)

12
Feature Selection

Previous formula involves all the terms
Feature selection reduces cost by using only the
terms with good discriminating power
Use the training sets to identify the feature
terms

13
Fishers Index

Fishers Index indicates the discriminating power
of a term
Good discriminating power large interclass
distance, small intraclass distance

Interclass distance
c1
c2
w(t)
Intraclass distance
14
Document Classification Model

Consider only feature terms in the classification
function p(cic,d)
Pick the largest probability among all ci
Use one classifier in each internal node

15
Recovery Oriented Scheme

Database system
Failure in DBMS
Restart from a consistent state
Document classification
Error detected
Restart from a correct class (High Confidence
Ancestor, or HCA)

16
Recovery Oriented Scheme

In practice,
Rollback is slow
Identify wrong paths and avoid them
To identify wrong paths,
Define closeness indicator (CI)
On wrong path, when CI falls below a threshold

17
Recovery Oriented Scheme
Define distance of HCA and current node 2
Wrong path
18
Recovery Oriented Scheme
Define distance of HCA and current node 2
Wrong path
19
Error Masking Scheme

Software Fault Tolerance
Run multiple versions of software
Majority voting
Document Classification
Run classifiers of different designs
Majority voting

20
O-Classifier

Traditional classifier

21
N-classifier

Skip some intermediate levels

22
Error Masking Scheme

Run three classifiers in parallel
O-classifier
N-classifier
O-classifier using new feature length
This selection minimizes the time wasted on
waiting the slowest classifiers

23
Experiments

Data Sets
US Patents
Preclassified
Rich text content
Highly hierarchical
3 Sets Collected
3 levels/large no of docs
4 levels/large no of docs
7 levels/small no of docs

24
Experiments

Algorithm compared
Simple hierarchical
TAPER
Flat
Recovery oriented
Error masking
Generally,
flat is the slowest and the most accurate
simple hierarchical is the fastest and the least
accurate

25
Accuracy 3 levels/large
26
Accuracy 4 levels/large
27
Accuracy 7 levels/small
28
Performance 3 levels/large
29
Performance 4 levels/large
30
Performance 7 levels/small
31
Conclusion