Extracting Math from PostScript Documents - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Extracting Math from PostScript Documents

Description:

Why Extract Math ... In this particular case, extraction of the document image ... (NEC) to see if we can extract sufficient formulas to add to the indexing ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 24
Provided by: richard489
Category:

less

Transcript and Presenter's Notes

Title: Extracting Math from PostScript Documents


1
Extracting Math from PostScript Documents
  • Michael Yang
  • Univ. Calif., Irvine
  • Richard Fateman
  • Univ. Calif, Berkeley

2
Why Extract Math from Documents?
  • The current and recent past publications of
    scholarly journals in mathematics are not
    adequately indexed.
  • Imagine a query Find papers that involve this
    differential equation
  • x2 yxy(x2-m2)y0
  • Or Is there a common name for this equation?
    Ans yes, Bessels

3
Why Extract Math from Documents?
  • Find papers that may be relevant to a formula or
    a proof of a related theorem.
  • Find out if a discovery is actually novel or a
    rediscovery of a previous result.
  • Even Is this formula true?

4
How can we search, anyway?
  • Search in integral tables using hashing, flexible
    pattern matching.
  • Example TILU (Fateman, Einwohner)
  • The general problem looks like a huge challenge
    of unification with simplifications of analytic
    functions. Is af(b) the same as f-1(a)b ?

5
These are obviously hard questions
  • But we are much better off if we can start with
    a few decades of the most recent math papers and
    their formulas to search.
  • Prerequisite encoding of formulas with semantic
    markup, the point of this paper.

6
Why start with PostScript or PDF?
  • We have many papers, including math journals,
    online, some of them free, with essentially all
    markup removed, stored for printing as PS or PDF.
  • Automation of inserting the markup, even if only
    partly successful, can help enable further work
    to make it possible to index and search for math.

7
Is this easier or harder than OCR?
  • It should be easier, because all the characters
    are known as error-free glyphs.
  • OCR tends to make erroneous symbol
    identifications if there is inadequate word-based
    context.
  • For example o0OÂş, 1lI!i , Illinois (!),
    -_
  • Well-known sources of PS provide stereotypes for
    the font/glyph/location mapping.
  • But it could be harder if the PostScript is truly
    obscure (PS is Turing equivalent, after all)

8
An Example
From a paper by Cyril Banderier et al, Random
Maps, Coalescing Saddles, Singularity Analysis,
and Airy Phenomena,'' Random Structures and
Algorithms, 19 3-4, 194--246 (2001) only
slightly edited by inserting newlines. explain
origin ....0.002 0.0025 200 400 600 800 1000 k
Figure 3. Left The standard Airy distribution.
Right Observed frequencies of core sizes k 2
20 1000 in 50,000 random maps of size 2,000,
showing the bimodal character of the
distribution. variety of integral or power
series representations including (see 1, 45)
1) Ai(z) 1 2 Z 1 1 e i(zt t 3 3) dt 1 3 23 1
X n0 3 13 z n ( n 1) 3) n sin 2(n 1) 3
Equipped with this de nition, we present the
main character of the paper, a probability
distribution closely related to the Airy
function. De nition 1. The standard ....
9
What is this really?
In this particular case, extraction of the
document image shows two formulas in the middle
of the citation
10
How could we encode this image?
Recognize the characters on the page as
equivalent to a expression, for
example \mbox Ai(z) 1\over2 \pi\int
_-\infty\infty ei(ztt3/3)dt
1 \over \pi 32/3\sum_n0\infty
(31/3z)n \Gamma((n1)/3) \over n! \sin
2(n1)\pi\over 3. or some alternative in
MathML or OpenMath. What are the barriers to
getting to this point?
11
Detecting Math in the first place
  • Look for changes in font, italics, font size
    changes, altered baselines.
  • Consider the density of text (formulas are low
    density).
  • Notice the presence of special characters unusual
    in text is common in math, but not in text
    (Also , -, parens).

12
Implementation
  • Run PostScript through a modified Ghostscript
    (PS interpreter) to output text file information
    suitable for geometric/math processing.
  • Run this file through previously developed
    OCR-based technology (in Lisp) for using
    bounding-boxes, contents, positions, to create a
    geometric 2-D relative position tree. Process
    further to identify semantic relationships if
    possible and output a hierarchical
    tree-representation of math formulas.
  • Convert this to TeX (could be MathML equally
    well).

13
Possible Future Work
  • Better font tools
  • Look at more producers of PS (not just TeX and
    dvips), e.g. Acrobat Distiller.
  • Run some tests (NEC) to see if we can extract
    sufficient formulas to add to the indexing
    information.
  • Examine the issue of formula similarity e.g.
    parameter substitution, simplification,
    rearrangement. (relatively easy in the context
    of integration because there is a designated
    variable of integration.)

14
Conclusions
  • Its possible to automatically revisit previously
    typeset documents and invent plausible versions
    of TeX source-code for some, perhaps much, of
    published TeX.
  • This provides an additional link to a chain which
    may eventually lead to more widespread semantic
    encoding of math for index and retrieval.
  • Given the difficulties, a better route for the
    future is to have authors or editors use semantic
    mark-up for digital mathematical documents for
    born digital documents. Publishers should
    encourage this kind of work, although standards
    are currently disappointing.

15
Another paper, not included
  • Submitted to ISSAC-2004
  • Author R. Fateman

16
Rational Function Computing with Poles and
Residues
  • Heres the idea consider 2 forms for the same
    rational expression.

17
Which form is better?
  • Generality of representation
  • Complexity (Cost) of operations
  • Arithmetic (, , /)
  • Integration, derivatives, limits, series,
  • Numerical evaluation
  • Display for human viewing

18
Keep constant numerators over (powers of) linear
denominators ( polynomial)
  • Works for encoding arbitrary rational functions
    (over complex numbers) in one variable.
  • Plausibly requires high-precision floats if you
    start with ratio of polynomials where the roots
    of the denominator cannot be expressed as exact
    rational numbers.

19
PRO Once you have this representation
  • Addition of rational functions is essentially
    free, compared to standard representation since
    no polynomial GCD is required.
  • a/b c/d is already simplified except for
    sorting and the possibility that bd
  • Multiplication of rational functions is
    inexpensive also, again no GCD needed.

20
CON Do you want to use this representation?
  • Division is not fast, so it is more appropriate
    if division is infrequent.
  • If the input is not already in residue/pole form,
    or if you have to do division, finding zeros
    introduces approximations maybe for the first
    time in a problem.
  • Output forms may look longer.

21
Examples
  • Ordinary addition orders of magnitude faster.
    E.g 45,000 times faster.
  • Ordinary multiplication maybe 2X faster
  • What about mixtures of and together? What
    important algorithms are there?
  • Sparse determinant calculation.

22
A determinant benchmark
  • Consider matrices with entries of this form
  • Determinant of 8X8 matrix in Macsyma 2.4, on a
    2.6GHz Pentium 4 computer.
  • Using Gaussian Elimination 112 sec
  • Using Minor Expansion 109 sec
  • Using Residues/Poles (75 in bignum
    arithmetic) 41 sec
  • Using Residues/Poles and double-floats
    1.6sec

23
Conclusions
  • No surprise that avoiding GCDs is a winner.
  • Using approximate calculations can provide huge
    speedups. Do we really need exact computation
    everywhere we provide it?
  • We have a potential application for
    high-precision zero-finding, as well as
    non-overflowing software floats (GMP, ARPREC)
Write a Comment
User Comments (0)
About PowerShow.com