Title: Automatic Categorization of Figures in Scientific Documents
1Automatic Categorization of Figures in Scientific
Documents
- Xiaonan Lu1, Prasenjit Mitra2,1, James Z.
Wang2,1, and C. Lee Giles2,1 - 1Department of Computer Science and Engineering
- 2College of Information Sciences and Technology
- The Pennsylvania State University, University
Park, Pennsylvania, USA
Funded by NSF Chemistry Cyberinfrastructure and
Research Facilities Developing Collaboratory
Tools to Facilitate Multi-Disciplinary,
Multi-Scale Research in Environmental Molecular
Sciences
2Main idea
- Goal Location and extraction of non-textual
information from scientific/academic documents - Focus on figures
- Problems
- Identification
- Categorization
- Data extraction
- Indexing
- Retrieval
3Outline
- Introduction
- Motivation and problem statement
- Overview of our work
- Related prior work
- Categories of figures
- The method
- Feature extraction
- Classification
- Experimental results
- Summary and future work
4Examples of figures in scientific documents
- Graph, diagram, flowchart, photograph, etc.
- Information not used by digital library search
engines
5Data within figures
- Data from figures in published documents
- Automatic data extraction
- Curves
- Data points
- Coordinate values
- Crucial for established fields such as chemistry,
physics, etc. (now done by hand)
6A full-text and figure based retrieval System
- Query
- A document containing gardening pictures?
- A paper reporting experiments on human computer
interface?
7Overview of this Work
- Semantics-sensitive
- Different data extraction
- Indexing
- retrieval techniques for different categories
- Content-based feature extraction
- Complete information
- Challenging
- Machine-learning based classification
8Related prior work
- Document retrieval
- Metadata extraction Giuffrida, 2000 Han, 2003
Liu, 2004 Yilmazel, 2004, etc. - Name disambiguation Mann, 2003 Han, 2005
Marlin, 2005 On, 2005, etc. - Document image understanding
- Image representation to semantics Niyogi, 1995,
etc. - Structure analysis haralick, 1994 Blostein,
2000 Mao, 2003 Zheng, 2005, etc. - Multimedia retrieval in digital libraries
- Text-based, content-based Christel, 2005 Tsai,
2005 Smeulders, 2000 Wang, 2001, etc.
9Categorization of figures
- Goal
- Represent high level semantics
- Facilitate further processing
- Procedure
- Specify figure type base on functionality
- Visual similarity within type
- Data
- Use 5000 figures from CiteSeer
- Annotate types
10Categories (ontology) of figures
Others
photograph
2-D plot
Diagram
3-D plot
11Classification of figure types
document
category
12Figure Extraction
- Adobe Acrobat image extraction
- Extract image based on file format
- Does not work for scanned document
13Texture-based block classification
- Non-overlapping blocks
- Pixel partition predefined
- Picture, text, or background
Wavelet coefficient distribution
Continuous-tone image
Non-continuous-tone image
14Global texture features
15Edge detection
- Edge detection
- Reduces the amount of data
- Preserves structural features
- Binary output for basic object detection
- Use Canny edge detection Canny, 1986
16Line detection
- Line detection
- Basic and common visual component
- Hough transform Hough, 1959
17Classification
- Support Vector Machine SVMlight
- Generalized technique
- Clear connection to underlying statistical
learning theory - Good classifier for high dimensionality feature
space
18Experimental setup
- Experimental system
- C, MySQL, Linux
- Data set
- 2000 PDF files from CiteSeer
- Adobe Acrobat image extraction tool
- Manual annotation
- Tests
- Photograph vs. non-photography
- Multi-class classification photograph, 2-D plot,
3-D plot, diagram, others
19Figure type dataset
20Performance measure
- Classification error rate
- Precision
- Recall
21Photograph vs. Non-photograph
22Multi-class Classification
23Summary
- Specification of problem of location, extraction
of non-textual information (figures) in
scientific/academic documents - Categorization ontology of types of figures
- Machine learning classification
- Promising experimental results
24Future Work
- Improved identification and extraction
performance - More reliable techniques
- Extract data from figures
- Shapes and content from figures
- Data points, curves, legend, coordinates
- Paper in review
- Combine text and content-based approach
- Index figures for search
25The wealth of information has created a poverty
of attention.
Herbert Simon