Title: Unicode Support for Mathematics
1Unicode Support for Mathematics
- Murray Sargent III
- Microsoft
2Overview
- Unicode math characters
- Semantics of math characters
- Unicode and markup
- Multiple ways of encoding math characters
- Not yet standardized math characters
- Inputting math symbols
3Unicode Math Characters
- 340 math chars exist in ASCII, U2200 U22FF,
arrows, combining marks of Unicode 3.0 - 996 math alphanumeric characters are in Unicode
3.1s Plane 1 - 591 new math symbols and operators are in Unicode
3.2s BMP - One math variant selector
- One new combining character (reverse solidus).
4Basic Set of Alphanumeric Characters
- Latin digits (0 - 9)
- Upper- lowercase Latin letters (a - z, A - Z)
- Uppercase Greek letters ? - O plus the nabla ?
and the variant of theta T given by U03F4 - Lowercase Greek letters a - ? plus the partial
differential sign ? and glyph variants of e, ?,
?, f, ?, and p - Only unaccented forms of letters are used
5Math Alphanumeric Characters
- Math needs various Latin and Greek alphabets like
normal, bold, italic, script, Fraktur, and
open-face - May appear to be font variations, but have
distinct semantics - Without these distinctions, you get gibberish,
violating Unicode rule plain text must contain
enough info to permit the text to be rendered
legibly, and nothing more - Plain-text searches should distinguish between
alphabets, e.g., search for script H shouldnt
match H, etc. - Reduces markup verbosity
6 Legibility Loss
- Without math alphabets, the Hamiltonian formula
- H ? dt eE2 µH2
- becomes an integral equation
- H ? dt eE2 µH2
7Math Alphanumeric Chars (cont)
- Plain a-z, A-Z, 0-9, ?-?, ?-O
- Bold a-z, A-Z, 0-9, ?-?, ?-O
- Italic a-z, A-Z, ?-?, ?-O
- Bold italic a-z, A-Z, ?-?, ?-O
- Script a-z, A-Z
- Bold script a-z, A-Z
- Fraktur a-z, A-Z
- Bold Fraktur a-z, A-Z
- Double struck a-z, A-Z, 0-9
- Sans-serif a-z, A-Z, 0-9
- Sans-serif bold a-z, A-Z, 0-9, ?-?, ?-O
- Sans-serif italic a-z, A-Z
- Sans-serif bold italic a-z, A-Z, ?-?, ?-O
- Monospace a-z, A-Z, 0-9
8How Display Math Alphabets?
- Can use Unicode surrogate pair mechanisms
available on OS - Alternatively, bind to standard fonts and use
corresponding BMP characters - Second approach probably faster and to display
Unicode one needs font binding in any event. But
most traditional fonts are not suited to math
alphabetic characters - A single math font may look more consistent
9Math Alphabetics via Glyph Variants
- One approach to the math alphanumerics would be
to use a set of math glyph variant selectors - Such a tag would follow a base character
imparting a math style - Approach was dropped since it seemed likely to be
abused - One math variant selector does exist to offer a
different line slant for some composite symbols - Other variant selectors are being defined for
nonmath purposes, e.g., Han variants
10Multiple Character Encodings
- As with nonmath characters, math symbols can
often be encoded in multiple ways, composed and
decomposed - E.g., ? can be U003D, U0338 or U2260
- Recommendation use the fully composed symbol,
e.g., U2260 for ? - For alphabetic characters, use combining-mark
sequences to get consistent typography - Some representations use markup for the
alphabetic cases. This allows multicharacter
combining marks.
11Compatibility Holes
- Compatibility holes (reserved positions) exist in
some Unicode sequences to avoid duplicate
encodings (ugh!) - E.g., U2071-U2073 are holes for ¹²³, which are
U00B9, U00B2, and U00B3, respectively - Math alphanumerics have holes corresponding to
Letterlike symbols. - Recommendation you can use the hole codes
internally, but must import and export the
standard codes.
12Nonstandard Characters
- People will always invent new math characters
that arent yet standardized. - Use private use area for these with a
higher-level marking that these are for math. - This approach can lead to collisions in the math
community (unless a standard is maintained) - Cut/copy in plain text can have collisions with
other uses of the private use area
13Unicode and Markup
- Unicode was never intended to represent all
aspects of text - Language attribute sort order, word breaks
- Rich (fancy) text formatting built-up fractions
- Content tags headings, abstract, author, figure
- Glyph variants Poetica font 58 ampersands
Mantinia font novel ligatures (TT, TE, etc.) - MathML adds XML tags for math constructs, but
seems awfully wordy
14Unicode Plain Text
- Can do a lot with plain text, e.g., BiDi
- Grey zone use of embedded codes
- Unicode ascribes semantics to characters, e.g.,
paragraph mark, right-to-left mark - Lots of interesting punctuation characters in
range U2000 to U204F - Extensive character semantics/properties tables,
including mathematical, numerical
15Unicode Character Semantics
- Math characters have math property
- Math characters are numeric, variable, or
operator, but not a combination - Properties are useful in parsing math plain text
- MathML doesnt use these properties every
quantity is explicitly tagged - Properties still can be useful for inputting text
for MathML (noone wants to type all those tags!) - Sometimes default properties need to be overruled
- Would be useful to have more math properties
16Plain Text Encoding
- TEX fraction numerator is what follows a up to
keyword \over - Denominator is what follows the \over up to the
matching - are not printed
- Simple rules give unambiguous plain text, but
results dont look like math - How to make a plain text that looks like math?
17Simple plain text encoding
- Simple operand is a span of alphanumeric
characters - E.g., simple numerator or denominator is
terminated by any operator - Operators include arithmetic operators, most
whitespace characters, all U22xx, an argument
break operator (displayed as small raised dot),
sub/superscript operators - Fraction operator is given by the Unicode
fraction slash operator U2044
18Fractions
- abc/d gives
- More complicated operands use parentheses ( ),
brackets , or - Outermost parens arent displayed in built-up
form - E.g., plain text (a c)/d displays as
- Easier to read than TEXs, e.g., a c \over d
- MathML ltmfracgtltmrowgtltmigtalt/migtltmogtlt/mogt
ltmigtclt/migtlt/mrowgtltmrowgtltmigtdlt/migt lt/mrowgtlt/mfracgt - Neat feature plain text looks like math
19Subscripts and Superscripts
- Unicode has numeric subscripts and superscripts
along with some operators (U2070-U208E) - Others need some kind of markup like
ltmsupgtlt/msupgt - With special subscript and superscript operators
(not yet in Unicode), these scripts can be
encoded nestibly - Use parentheses as for fractions to overrule
built-in precedence order
20Presentation markup
- Presentation markup directs how the math should
be rendered.
ltmrowgt ltmigtElt/migt ltmogtlt/mogt ltmrowgt
ltmigtmlt/migt ltmogtInvisibleTimeslt/mogt
ltmsupgt ltmigtclt/migt ltmngt2lt/mngt
lt/msupgt lt/mrowgt lt/mrowgt
21Content markup
- Content markup describes the meaning of the
expression, not the format.
ltrelgt lteq/gt ltcigtElt/cigt ltapplygt
lttimesgt ltcigtmlt/cigt ltapplygt
ltpower/gt ltcigtclt/cigt
ltcngt2lt/cngt lt/applygt lt/timesgt
lt/applygt lt/relgt
22(No Transcript)
23Unicode TEX Example
24Symbol Entry
- GUI PCs can display a myriad glyphs, mathematics
symbols, and international characters - Hard to input special symbols. Menu methods are
slow. Hot keys are great but hard to learn - Reexamine and improve symbol-input and storage
methods - With left/right Ctrl/Alt keys, PC keyboard gives
direct access to 600 symbols. Maximum possible
2100 1030 - Use on-screen, customizable, keyboards and symbol
boxes - Drag drop any symbol into apps or onto keyboards
25Hex to Unicode Input Method
- Type Unicode character hexadecimal code
- Make corrections as need be
- Type Altx to convert to character
- Type Altx to convert back to hex (useful
especially for missing glyph character) - Resolve ambiguities by selection
- Input higher-plane chars using 5 or 6-digit code
- New MS Word standard
26Built-Up Formula Heuristics
- Math characters identify themselves and neighbors
as math - E.g., fraction (U2044), ASCII operators,
U2200U22FF, and U20D0U20FF identify neighbors
as mathematical - Math characters include various English and Greek
alphabets - When heuristics fail, user can select math mode
WYSIWYG instead of visible math on/off codes
27Operator Precedence
- Everyone knows that multiply takes precedence
over add, e.g., 353 18, not 24 - C-language precedence is too intricate for most
programmers to use extensively - TEX doesnt use precedence relies on to
define operator scope - In general, ( ) can be used to clarify or
overrule precedence - Precedence reduces clutter, so some precedence is
desirable (else things look like LISP!) - But keep it simple enough to remember easily
28Layout Operator Precedence
- Subscript, superscript
- Integral, sum ò S P
- Functions Ö
- Times, divide /
- Other operators Space ". , - Tab
- Right brackets )
- Left brackets (
- End of paragraph FF EOP
29Mathematics as a Programming Language
- Fortran made great steps in getting computers to
understand mathematics - Java and C accept Unicode variable names
- C has preprocessor and operator overloading,
but needs extensions to be really powerful - Use Unicode characters including math
alphanumerics - Use plain-text encoding of mathematical
expressions - Cant use all mathematical expressions as code,
but can go much further than current languages go - When to to multiply? In abstract, multiplication
is infinitely fast and precise, but not on a
computer
30void IHBMWM(void) gammap gammasqrt(1
I2) upsilon cmplx(gammagamma1,
Delta) alphainc alpha0(1-(gammagammaI2/gamm
ap)/(gammap upsilon)) if (!gamma1
fabs(DeltaT1) lt 0.01) alphacoh
-halfalpha0I2pow(gamma/gammap,
3) else Gamma 1/T1 gamma1 I2sF
(I2/T1)/cmplx(Gamma, Delta) betap2
upsilon(upsilon gammaI2sF) beta
sqrt(betap2) alphacoh 0.5gammaalpha0(I2sF
(gamma upsilon) /(gammapgammap -
betap2)) ((1gamma/beta)(beta -
upsilon)/(beta upsilon) -
(1gamma/gammap)(gammap - upsilon)/ (gammap
upsilon)) alpha1 alphainc alphacoh
31(No Transcript)
32(No Transcript)
33Conclusions
- Unicode provides great support for math in both
marked up and plain text - Unicode character properties facilitate
plain-text encoding of mathematics but arent
used in MathML - Heuristics allow plain text to be built up
- Need two more Unicode assignments subscript and
superscript operators - On-screen keyboards and symbol boxes aid formula
entry - Unicode math characters could be useful for
programming languages