Automatic Categorization of Figures in Scientific Documents - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Automatic Categorization of Figures in Scientific Documents

Description:

Information not used by digital library search engines. Data within figures ... Multimedia retrieval in digital libraries ... Photograph vs. non-photography ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 26

Provided by: xiaon

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Categorization of Figures in Scientific Documents

1
Automatic Categorization of Figures in Scientific
Documents

Xiaonan Lu1, Prasenjit Mitra2,1, James Z.
Wang2,1, and C. Lee Giles2,1
1Department of Computer Science and Engineering
2College of Information Sciences and Technology
The Pennsylvania State University, University
Park, Pennsylvania, USA

Funded by NSF Chemistry Cyberinfrastructure and
Research Facilities Developing Collaboratory
Tools to Facilitate Multi-Disciplinary,
Multi-Scale Research in Environmental Molecular
Sciences
2
Main idea

Goal Location and extraction of non-textual
information from scientific/academic documents
Focus on figures
Problems
Identification
Categorization
Data extraction
Indexing
Retrieval

3
Outline

Introduction
Motivation and problem statement
Overview of our work
Related prior work
Categories of figures
The method
Feature extraction
Classification
Experimental results
Summary and future work

4
Examples of figures in scientific documents

Graph, diagram, flowchart, photograph, etc.
Information not used by digital library search
engines

5
Data within figures

Data from figures in published documents
Automatic data extraction
Curves
Data points
Coordinate values
Crucial for established fields such as chemistry,
physics, etc. (now done by hand)

6
A full-text and figure based retrieval System

Query
A document containing gardening pictures?
A paper reporting experiments on human computer
interface?

7
Overview of this Work

Semantics-sensitive
Different data extraction
Indexing
retrieval techniques for different categories
Content-based feature extraction
Complete information
Challenging
Machine-learning based classification

8
Related prior work

Document retrieval
Metadata extraction Giuffrida, 2000 Han, 2003
Liu, 2004 Yilmazel, 2004, etc.
Name disambiguation Mann, 2003 Han, 2005
Marlin, 2005 On, 2005, etc.
Document image understanding
Image representation to semantics Niyogi, 1995,
etc.
Structure analysis haralick, 1994 Blostein,
2000 Mao, 2003 Zheng, 2005, etc.
Multimedia retrieval in digital libraries
Text-based, content-based Christel, 2005 Tsai,
2005 Smeulders, 2000 Wang, 2001, etc.

9
Categorization of figures

Goal
Represent high level semantics
Facilitate further processing
Procedure
Specify figure type base on functionality
Visual similarity within type
Data
Use 5000 figures from CiteSeer
Annotate types

10
Categories (ontology) of figures
Others
photograph
2-D plot
Diagram
3-D plot
11
Classification of figure types
document
category
12
Figure Extraction

Adobe Acrobat image extraction
Extract image based on file format
Does not work for scanned document

13
Texture-based block classification

Non-overlapping blocks
Pixel partition predefined
Picture, text, or background

Wavelet coefficient distribution
Continuous-tone image
Non-continuous-tone image
14
Global texture features

Global texture features

15
Edge detection

Edge detection
Reduces the amount of data
Preserves structural features
Binary output for basic object detection
Use Canny edge detection Canny, 1986

16
Line detection

Line detection
Basic and common visual component
Hough transform Hough, 1959

17
Classification

Support Vector Machine SVMlight
Generalized technique
Clear connection to underlying statistical
learning theory
Good classifier for high dimensionality feature
space

18
Experimental setup

Experimental system
C, MySQL, Linux
Data set
2000 PDF files from CiteSeer
Adobe Acrobat image extraction tool
Manual annotation
Tests
Photograph vs. non-photography
Multi-class classification photograph, 2-D plot,
3-D plot, diagram, others

19
Figure type dataset
20
Performance measure

Classification error rate
Precision
Recall

21
Photograph vs. Non-photograph
22
Multi-class Classification
23
Summary

Specification of problem of location, extraction
of non-textual information (figures) in
scientific/academic documents
Categorization ontology of types of figures
Machine learning classification
Promising experimental results

24
Future Work