Extracting Math from PostScript Documents

About This Presentation

Title:

Extracting Math from PostScript Documents

Description:

Why Extract Math ... In this particular case, extraction of the document image ... (NEC) to see if we can extract sufficient formulas to add to the indexing ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 24

Provided by: richard489

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Extracting Math from PostScript Documents

1
Extracting Math from PostScript Documents

Michael Yang
Univ. Calif., Irvine
Richard Fateman
Univ. Calif, Berkeley

2
Why Extract Math from Documents?

The current and recent past publications of
scholarly journals in mathematics are not
adequately indexed.
Imagine a query Find papers that involve this
differential equation
x2 yxy(x2-m2)y0
Or Is there a common name for this equation?
Ans yes, Bessels

3
Why Extract Math from Documents?

Find papers that may be relevant to a formula or
a proof of a related theorem.
Find out if a discovery is actually novel or a
rediscovery of a previous result.
Even Is this formula true?

4
How can we search, anyway?

Search in integral tables using hashing, flexible
pattern matching.
Example TILU (Fateman, Einwohner)
The general problem looks like a huge challenge
of unification with simplifications of analytic
functions. Is af(b) the same as f-1(a)b ?

5
These are obviously hard questions

But we are much better off if we can start with
a few decades of the most recent math papers and
their formulas to search.
Prerequisite encoding of formulas with semantic
markup, the point of this paper.

6
Why start with PostScript or PDF?

We have many papers, including math journals,
online, some of them free, with essentially all
markup removed, stored for printing as PS or PDF.
Automation of inserting the markup, even if only
partly successful, can help enable further work
to make it possible to index and search for math.

7
Is this easier or harder than OCR?

It should be easier, because all the characters
are known as error-free glyphs.
OCR tends to make erroneous symbol
identifications if there is inadequate word-based
context.
For example o0Oº, 1lI!i , Illinois (!),
-_
Well-known sources of PS provide stereotypes for
the font/glyph/location mapping.
But it could be harder if the PostScript is truly
obscure (PS is Turing equivalent, after all)

8
An Example
From a paper by Cyril Banderier et al, Random
Maps, Coalescing Saddles, Singularity Analysis,
and Airy Phenomena,'' Random Structures and
Algorithms, 19 3-4, 194--246 (2001) only
slightly edited by inserting newlines. explain
origin ....0.002 0.0025 200 400 600 800 1000 k
Figure 3. Left The standard Airy distribution.
Right Observed frequencies of core sizes k 2
20 1000 in 50,000 random maps of size 2,000,
showing the bimodal character of the
distribution. variety of integral or power
series representations including (see 1, 45)
1) Ai(z) 1 2 Z 1 1 e i(zt t 3 3) dt 1 3 23 1
X n0 3 13 z n ( n 1) 3) n sin 2(n 1) 3
Equipped with this de nition, we present the
main character of the paper, a probability
distribution closely related to the Airy
function. De nition 1. The standard ....
9
What is this really?
In this particular case, extraction of the
document image shows two formulas in the middle
of the citation
10
How could we encode this image?
Recognize the characters on the page as
equivalent to a expression, for
example \mbox Ai(z) 1\over2 \pi\int
_-\infty\infty ei(ztt3/3)dt
1 \over \pi 32/3\sum_n0\infty
(31/3z)n \Gamma((n1)/3) \over n! \sin
2(n1)\pi\over 3. or some alternative in
MathML or OpenMath. What are the barriers to
getting to this point?
11
Detecting Math in the first place

Look for changes in font, italics, font size
changes, altered baselines.
Consider the density of text (formulas are low
density).
Notice the presence of special characters unusual
in text is common in math, but not in text
(Also , -, parens).

12
Implementation

Run PostScript through a modified Ghostscript
(PS interpreter) to output text file information
suitable for geometric/math processing.
Run this file through previously developed
OCR-based technology (in Lisp) for using
bounding-boxes, contents, positions, to create a
geometric 2-D relative position tree. Process
further to identify semantic relationships if
possible and output a hierarchical
tree-representation of math formulas.
Convert this to TeX (could be MathML equally
well).

13
Possible Future Work

Better font tools
Look at more producers of PS (not just TeX and
dvips), e.g. Acrobat Distiller.
Run some tests (NEC) to see if we can extract
sufficient formulas to add to the indexing
information.
Examine the issue of formula similarity e.g.
parameter substitution, simplification,
rearrangement. (relatively easy in the context
of integration because there is a designated
variable of integration.)

14
Conclusions

Its possible to automatically revisit previously
typeset documents and invent plausible versions
of TeX source-code for some, perhaps much, of
published TeX.
This provides an additional link to a chain which
may eventually lead to more widespread semantic
encoding of math for index and retrieval.
Given the difficulties, a better route for the
future is to have authors or editors use semantic
mark-up for digital mathematical documents for
born digital documents. Publishers should
encourage this kind of work, although standards
are currently disappointing.

15
Another paper, not included

Submitted to ISSAC-2004
Author R. Fateman

16
Rational Function Computing with Poles and
Residues

Heres the idea consider 2 forms for the same
rational expression.

17
Which form is better?

Generality of representation
Complexity (Cost) of operations
Arithmetic (, , /)
Integration, derivatives, limits, series,
Numerical evaluation
Display for human viewing

18
Keep constant numerators over (powers of) linear
denominators ( polynomial)

Works for encoding arbitrary rational functions
(over complex numbers) in one variable.
Plausibly requires high-precision floats if you
start with ratio of polynomials where the roots
of the denominator cannot be expressed as exact
rational numbers.

19
PRO Once you have this representation

Addition of rational functions is essentially
free, compared to standard representation since
no polynomial GCD is required.
a/b c/d is already simplified except for
sorting and the possibility that bd
Multiplication of rational functions is
inexpensive also, again no GCD needed.

20
CON Do you want to use this representation?

Division is not fast, so it is more appropriate
if division is infrequent.
If the input is not already in residue/pole form,
or if you have to do division, finding zeros
introduces approximations maybe for the first
time in a problem.
Output forms may look longer.

21
Examples

Ordinary addition orders of magnitude faster.
E.g 45,000 times faster.
Ordinary multiplication maybe 2X faster
What about mixtures of and together? What
important algorithms are there?
Sparse determinant calculation.

22
A determinant benchmark

Consider matrices with entries of this form
Determinant of 8X8 matrix in Macsyma 2.4, on a
2.6GHz Pentium 4 computer.
Using Gaussian Elimination 112 sec
Using Minor Expansion 109 sec
Using Residues/Poles (75 in bignum
arithmetic) 41 sec
Using Residues/Poles and double-floats
1.6sec

23
Conclusions

No surprise that avoiding GCDs is a winner.
Using approximate calculations can provide huge
speedups. Do we really need exact computation
everywhere we provide it?
We have a potential application for
high-precision zero-finding, as well as
non-overflowing software floats (GMP, ARPREC)

Write a Comment

User Comments (0)

About PowerShow.com

Extracting Math from PostScript Documents - PowerPoint PPT Presentation

Extracting Math from PostScript Documents

Why Extract Math ... In this particular case, extraction of the document image ... (NEC) to see if we can extract sufficient formulas to add to the indexing ... – PowerPoint PPT presentation