Title: Extracting Math from PostScript Documents
1Extracting Math from PostScript Documents
- Michael Yang
- Univ. Calif., Irvine
- Richard Fateman
- Univ. Calif, Berkeley
2Why Extract Math from Documents?
- The current and recent past publications of
scholarly journals in mathematics are not
adequately indexed. - Imagine a query Find papers that involve this
differential equation - x2 yxy(x2-m2)y0
- Or Is there a common name for this equation?
Ans yes, Bessels
3Why Extract Math from Documents?
- Find papers that may be relevant to a formula or
a proof of a related theorem. - Find out if a discovery is actually novel or a
rediscovery of a previous result. - Even Is this formula true?
4How can we search, anyway?
- Search in integral tables using hashing, flexible
pattern matching. - Example TILU (Fateman, Einwohner)
- The general problem looks like a huge challenge
of unification with simplifications of analytic
functions. Is af(b) the same as f-1(a)b ?
5These are obviously hard questions
- But we are much better off if we can start with
a few decades of the most recent math papers and
their formulas to search. - Prerequisite encoding of formulas with semantic
markup, the point of this paper.
6Why start with PostScript or PDF?
- We have many papers, including math journals,
online, some of them free, with essentially all
markup removed, stored for printing as PS or PDF. - Automation of inserting the markup, even if only
partly successful, can help enable further work
to make it possible to index and search for math.
7Is this easier or harder than OCR?
- It should be easier, because all the characters
are known as error-free glyphs. - OCR tends to make erroneous symbol
identifications if there is inadequate word-based
context. - For example o0OÂş, 1lI!i , Illinois (!),
-_ - Well-known sources of PS provide stereotypes for
the font/glyph/location mapping. - But it could be harder if the PostScript is truly
obscure (PS is Turing equivalent, after all)
8An Example
From a paper by Cyril Banderier et al, Random
Maps, Coalescing Saddles, Singularity Analysis,
and Airy Phenomena,'' Random Structures and
Algorithms, 19 3-4, 194--246 (2001) only
slightly edited by inserting newlines. explain
origin ....0.002 0.0025 200 400 600 800 1000 k
Figure 3. Left The standard Airy distribution.
Right Observed frequencies of core sizes k 2
20 1000 in 50,000 random maps of size 2,000,
showing the bimodal character of the
distribution. variety of integral or power
series representations including (see 1, 45)
1) Ai(z) 1 2 Z 1 1 e i(zt t 3 3) dt 1 3 23 1
X n0 3 13 z n ( n 1) 3) n sin 2(n 1) 3
Equipped with this de nition, we present the
main character of the paper, a probability
distribution closely related to the Airy
function. De nition 1. The standard ....
9What is this really?
In this particular case, extraction of the
document image shows two formulas in the middle
of the citation
10How could we encode this image?
Recognize the characters on the page as
equivalent to a expression, for
example \mbox Ai(z) 1\over2 \pi\int
_-\infty\infty ei(ztt3/3)dt
1 \over \pi 32/3\sum_n0\infty
(31/3z)n \Gamma((n1)/3) \over n! \sin
2(n1)\pi\over 3. or some alternative in
MathML or OpenMath. What are the barriers to
getting to this point?
11Detecting Math in the first place
- Look for changes in font, italics, font size
changes, altered baselines. - Consider the density of text (formulas are low
density). - Notice the presence of special characters unusual
in text is common in math, but not in text
(Also , -, parens).
12Implementation
- Run PostScript through a modified Ghostscript
(PS interpreter) to output text file information
suitable for geometric/math processing. - Run this file through previously developed
OCR-based technology (in Lisp) for using
bounding-boxes, contents, positions, to create a
geometric 2-D relative position tree. Process
further to identify semantic relationships if
possible and output a hierarchical
tree-representation of math formulas. - Convert this to TeX (could be MathML equally
well).
13Possible Future Work
- Better font tools
- Look at more producers of PS (not just TeX and
dvips), e.g. Acrobat Distiller. - Run some tests (NEC) to see if we can extract
sufficient formulas to add to the indexing
information. - Examine the issue of formula similarity e.g.
parameter substitution, simplification,
rearrangement. (relatively easy in the context
of integration because there is a designated
variable of integration.)
14Conclusions
- Its possible to automatically revisit previously
typeset documents and invent plausible versions
of TeX source-code for some, perhaps much, of
published TeX. - This provides an additional link to a chain which
may eventually lead to more widespread semantic
encoding of math for index and retrieval. - Given the difficulties, a better route for the
future is to have authors or editors use semantic
mark-up for digital mathematical documents for
born digital documents. Publishers should
encourage this kind of work, although standards
are currently disappointing.
15Another paper, not included
- Submitted to ISSAC-2004
- Author R. Fateman
16Rational Function Computing with Poles and
Residues
- Heres the idea consider 2 forms for the same
rational expression.
17Which form is better?
- Generality of representation
- Complexity (Cost) of operations
- Arithmetic (, , /)
- Integration, derivatives, limits, series,
- Numerical evaluation
- Display for human viewing
18Keep constant numerators over (powers of) linear
denominators ( polynomial)
- Works for encoding arbitrary rational functions
(over complex numbers) in one variable. - Plausibly requires high-precision floats if you
start with ratio of polynomials where the roots
of the denominator cannot be expressed as exact
rational numbers.
19PRO Once you have this representation
- Addition of rational functions is essentially
free, compared to standard representation since
no polynomial GCD is required. - a/b c/d is already simplified except for
sorting and the possibility that bd - Multiplication of rational functions is
inexpensive also, again no GCD needed.
20CON Do you want to use this representation?
- Division is not fast, so it is more appropriate
if division is infrequent. - If the input is not already in residue/pole form,
or if you have to do division, finding zeros
introduces approximations maybe for the first
time in a problem. - Output forms may look longer.
21Examples
- Ordinary addition orders of magnitude faster.
E.g 45,000 times faster. - Ordinary multiplication maybe 2X faster
- What about mixtures of and together? What
important algorithms are there? - Sparse determinant calculation.
22A determinant benchmark
- Consider matrices with entries of this form
- Determinant of 8X8 matrix in Macsyma 2.4, on a
2.6GHz Pentium 4 computer. - Using Gaussian Elimination 112 sec
- Using Minor Expansion 109 sec
- Using Residues/Poles (75 in bignum
arithmetic) 41 sec - Using Residues/Poles and double-floats
1.6sec
23Conclusions
- No surprise that avoiding GCDs is a winner.
- Using approximate calculations can provide huge
speedups. Do we really need exact computation
everywhere we provide it? - We have a potential application for
high-precision zero-finding, as well as
non-overflowing software floats (GMP, ARPREC)