Automatic Categorization of Figures in Scientific Documents - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Automatic Categorization of Figures in Scientific Documents

Description:

Information not used by digital library search engines. Data within figures ... Multimedia retrieval in digital libraries ... Photograph vs. non-photography ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 26
Provided by: xiaon
Category:

less

Transcript and Presenter's Notes

Title: Automatic Categorization of Figures in Scientific Documents


1
Automatic Categorization of Figures in Scientific
Documents
  • Xiaonan Lu1, Prasenjit Mitra2,1, James Z.
    Wang2,1, and C. Lee Giles2,1
  • 1Department of Computer Science and Engineering
  • 2College of Information Sciences and Technology
  • The Pennsylvania State University, University
    Park, Pennsylvania, USA

Funded by NSF Chemistry Cyberinfrastructure and
Research Facilities Developing Collaboratory
Tools to Facilitate Multi-Disciplinary,
Multi-Scale Research in Environmental Molecular
Sciences
2
Main idea
  • Goal Location and extraction of non-textual
    information from scientific/academic documents
  • Focus on figures
  • Problems
  • Identification
  • Categorization
  • Data extraction
  • Indexing
  • Retrieval

3
Outline
  • Introduction
  • Motivation and problem statement
  • Overview of our work
  • Related prior work
  • Categories of figures
  • The method
  • Feature extraction
  • Classification
  • Experimental results
  • Summary and future work

4
Examples of figures in scientific documents
  • Graph, diagram, flowchart, photograph, etc.
  • Information not used by digital library search
    engines

5
Data within figures
  • Data from figures in published documents
  • Automatic data extraction
  • Curves
  • Data points
  • Coordinate values
  • Crucial for established fields such as chemistry,
    physics, etc. (now done by hand)

6
A full-text and figure based retrieval System
  • Query
  • A document containing gardening pictures?
  • A paper reporting experiments on human computer
    interface?

7
Overview of this Work
  • Semantics-sensitive
  • Different data extraction
  • Indexing
  • retrieval techniques for different categories
  • Content-based feature extraction
  • Complete information
  • Challenging
  • Machine-learning based classification

8
Related prior work
  • Document retrieval
  • Metadata extraction Giuffrida, 2000 Han, 2003
    Liu, 2004 Yilmazel, 2004, etc.
  • Name disambiguation Mann, 2003 Han, 2005
    Marlin, 2005 On, 2005, etc.
  • Document image understanding
  • Image representation to semantics Niyogi, 1995,
    etc.
  • Structure analysis haralick, 1994 Blostein,
    2000 Mao, 2003 Zheng, 2005, etc.
  • Multimedia retrieval in digital libraries
  • Text-based, content-based Christel, 2005 Tsai,
    2005 Smeulders, 2000 Wang, 2001, etc.

9
Categorization of figures
  • Goal
  • Represent high level semantics
  • Facilitate further processing
  • Procedure
  • Specify figure type base on functionality
  • Visual similarity within type
  • Data
  • Use 5000 figures from CiteSeer
  • Annotate types

10
Categories (ontology) of figures
Others
photograph
2-D plot
Diagram
3-D plot
11
Classification of figure types
document
category
12
Figure Extraction
  • Adobe Acrobat image extraction
  • Extract image based on file format
  • Does not work for scanned document

13
Texture-based block classification
  • Non-overlapping blocks
  • Pixel partition predefined
  • Picture, text, or background

Wavelet coefficient distribution
Continuous-tone image
Non-continuous-tone image
14
Global texture features
  • Global texture features

15
Edge detection
  • Edge detection
  • Reduces the amount of data
  • Preserves structural features
  • Binary output for basic object detection
  • Use Canny edge detection Canny, 1986

16
Line detection
  • Line detection
  • Basic and common visual component
  • Hough transform Hough, 1959

17
Classification
  • Support Vector Machine SVMlight
  • Generalized technique
  • Clear connection to underlying statistical
    learning theory
  • Good classifier for high dimensionality feature
    space

18
Experimental setup
  • Experimental system
  • C, MySQL, Linux
  • Data set
  • 2000 PDF files from CiteSeer
  • Adobe Acrobat image extraction tool
  • Manual annotation
  • Tests
  • Photograph vs. non-photography
  • Multi-class classification photograph, 2-D plot,
    3-D plot, diagram, others

19
Figure type dataset
20
Performance measure
  • Classification error rate
  • Precision
  • Recall

21
Photograph vs. Non-photograph
22
Multi-class Classification
23
Summary
  • Specification of problem of location, extraction
    of non-textual information (figures) in
    scientific/academic documents
  • Categorization ontology of types of figures
  • Machine learning classification
  • Promising experimental results

24
Future Work
  • Improved identification and extraction
    performance
  • More reliable techniques
  • Extract data from figures
  • Shapes and content from figures
  • Data points, curves, legend, coordinates
  • Paper in review
  • Combine text and content-based approach
  • Index figures for search

25
The wealth of information has created a poverty
of attention.
Herbert Simon
Write a Comment
User Comments (0)
About PowerShow.com