Globalization Gotchas - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Globalization Gotchas

Description:

glyphs, code points, bytes, code units, user-perceived characters (grapheme clusters) ... 034F COMBINING GRAPHEME JOINER doesn't join graphemes. http://www. ... – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 34

Provided by: mark740

Category:

more less

Transcript and Presenter's Notes

Title: Globalization Gotchas

1
Globalization Gotchas

Mark Davis

2
Unicode Basics

Unicode encodes characters, not glyphs
U0067 ? g g g g g g g g g g g g g. ...
Unicode does not encode characters by language
French, German, English j have the same code
point even though all have different
pronunciations
Chinese ? (da) has the same code point as
Japanese ? (dai).
UTF-8, UTF-16, and UTF-32 are all Unicode.
The word character means different things to
different people make clear which one you mean.
glyphs, code points, bytes, code units,
user-perceived characters (grapheme clusters),

3
Unicode in APIs

U0000 to U10FFFF Be prepared to handle (at
least not corrupt!) any incoming code points
A back-level system may get unassigned code
points from later versions.
Watch for "UCS-2" implementations. They use
UTF-16 text, but don't support characters above
UFFFF they also may accidentally cause isolated
surrogates.
Some APIs/protocols will count lengths in code
points, and others in bytes (or other code
units).
Make sure you don't mix them up.
Don't limit API parameters to a single character
(and definitely not to a single code unit!).
What users think of as a single character (e.g.
x, ch) may be a sequence in Unicode.
Use the latest version of Unicode supports new
characters, corrections, more stability
guarantees.

4
Choice of Characters

Character and block names may be misleading, eg,
U034F COMBINING GRAPHEME JOINER doesn't join
graphemes.? http//www.unicode.org/faq/
Use U2060 (word joiner) instead of UFEFF
(zero-width nobreak space) for everything but the
BOM function.
Never use unassigned code points those will be
used in future versions of Unicode.
Only use private use (PUA) or non-characters (and
only if necessary)
If you do, minimize the opportunity for collision
by picking an unusual range.

5
Character Conversion

Always use "shortest form" UTF-8.
It's the Law.
And if that isnt enough, consider security
attacks.
If a protocol allows a choice of charsets, always
tag correctly
Not all text is correctly tagged character
detection may be necessary. But remember, it's
always a guess!
Converting a database of mixed, untagged data is
extremely painful.
Bad assumptions
Length bytes N length code points
1 character charset X 1 character Unicode
The ordering may also be different.

6
Character Conversion II

IANA / MIME charset names are ill-defined
vendors often convert same charset different
ways.
Shift-JIS 0x5C ? U005C (\) or U00A5 ()
Dont simply omit unconvertable data to reduce
security problems, at least substitute
UFFFD (when converting to Unicode) or
0x1A (when converting to bytes).
? http//www.w3.org/TR/japanese-xml/
? http//icu.sourceforge.net/charts/charset/

7
Properties

Use properties such as Alphabetic, not hard-coded
lists
isAlphabetic(x) regex \pAlphabetic or
Alphabetic
Not (A x Z OR a x z)
Some properties aren't what you think use
White_Space not General_CategoryZs
Alphabetic not General_CategoryL
Lowercase not General_CategoryLl
ScriptGreek not BlockGreek
Characters may change property values between
versions of Unicode
? http//unicode.org/standard/stability_policy.htm
l

8
Identifiers Tokens

When designing syntax, use as a base
Pattern_Syntax for operators / relations
Pattern_Whitespace for gaps
XID_Start and XID_Continue for identifiers.
All backwards compatible across versions
Profiles may expand or narrow from the base
Watch out for security attacks
paypal.com with a Cyrillic a
? See Unicode Security at this conference

9
Comparison (Collation)Searching, Sorting,
Matching

There are two binary orders
code point order UTF-8 order UTF-32 order
? UTF16 order
Dont present users with binary order!
No users expect A lt Z lt a lt z lt Ç lt ä.
Apply normalization to get a unique form, so Å
Å.
Security Issues Protocols must precisely define
the comparison operations
Eg, LDAP doesn't, so lookup may fail (or falsely
succeed!)
Aside from wrong results, opening for security
attacks.

10
Language-Sensitive Comparison

Use UCA Order as a base to meet
user-expectations
a lt A lt ä lt Ç C? lt z lt Z
Real language-sensitive order requires tailoring
on top of UCA ordering depends on context and
language
china lt China lt chinas lt danish
ae lt æ lt af
z lt æ (Danish)
c lt d lt ... h lt ch lt i (Slovak)
Follow UCA for substring match offsets some
gotchas here.
Don't mix up "stable" and "deterministic"
sorting they are very different.
? http//unicode.org/reports/tr10/
? http//unicode.org/cldr

11
Normalization (NFC,)

Standardized normalized forms defined by Unicode.
The ordering of accents in a normalization form
may not be the typical type-in order.
Fonts should handle both orders.
Normalization is context independent
Don't assume NFC(x y) NFC(x) NFC(y)
People assume that NFC always composes, but some
characters decompose in NFC.
Trivia In Unicode 4.1 there are exactly 3
characters that are different in all 4
normalization forms ?, ?, ?

12
Maximum Expansion (U4.1)
Operation UTF Factor Sample Sample
NFC 8 3X ?? U1D160
NFC 16, 32 3X ? UFB2C
NFD 8 3X ? U0390
NFD 16, 32 4X ? U1F82
NFKC / NFKD 8 11X ? UFDFA
NFKC / NFKD 16, 32 18X ? UFDFA
13
Case Conversion

Not a simple 11 mapping
Title case ? ? ? ? ?
Expansion heiß ? HEISS ? heiss
Context-dependent ?S?S ? ?s??
Language-dependent istanbul ? ISTANBUL
Warning never use language-dependent casing for
language-independent structures, like file-system
B-Trees.

14
Casing Maximum Expansion
Operation UTF Factor Sample Sample
Lower 8 1.5X ? U023A
Lower 16, 32 1X A U0041
Upper / Title / Fold 8, 16, 32 3X ? U0390
15
Case Conversion II

Case folding was not stable.
Different results from toCaseFold(S) between two
versions
Stability now guaranteed in Unicode 5.0
Don't use the Lowercase_Letter (Ll) or
Uppercase_Letter (Lt) of General_Category
These were constrained to be in a partition.
Use the separate binary properties Lowercase and
Uppercase instead.

16
Lowercase / UppercaseForm vs Function

Lowercase, the binary property
The character is lowercase in form,but not
necessarily in function.
Functionally Lowercase
isCased(x) isLowercase(x).
See Section 3.13 of TUS.

17
Lowercase Form vs Function
LC F. LC Ll Count Examples Examples Examples
LC F. LC Ll (U4.1) Examples Examples Examples
Y N N 114 ? U02E0 MODIFIER LETTER SMALL GAMMA
Y N Y 705 ª U00AA FEMININE ORDINAL INDICATOR
Y Y N 43 ? U2170 SMALL ROMAN NUMERAL ONE
Y Y Y 903 a U0061 LATIN SMALL LETTER A
18
Segmentation

What a user thinks of as a characters is often a
sequence.
Words are not just sequences of letters.
Lines dont just break at spaces
All may be language-dependent
? http//www.unicode.org/reports/tr14/
? http//www.unicode.org/reports/tr29/

19
Transliteration

Transliteration ???????? ? Elleniká?
Translation ???????? ? Greek
Transliteration may vary by language
????? ? Putin, Poutine, ...
???????? ? Gorbachev, Gorbacev, Gorbatchev,
Gorbacëv, Gorbachov, Gorbatsov, Gorbatschow, ...
Watch for terminology lossy vs lossless
Lossy transliteration ???????? ? Ellinika ?
???????a
In ISO terms transliteration lossless
transliteration transcription lossy
transliteration.
? http//unicode.org/draft/reports/tr35/tr35.html

20
Rendering is Contextual
Processing character-by-character gives the wrong
results!

Glyphs may change shape
Multiple characters ? 1 glyph
One character ? multiple glyphs

21
Rendering II

Good rendering systems will handle customary
type-in order for text plus canonical order.
Excellent ones will do any canonically-equivalent
order, but those are rare.
There may be differences in the customary glyphs
for different languages specify the font or the
language where they have to be distinguished
Security Issues
Never render a missing glyph as "?.
Don't simply overlay diacritics it can cause
security problems.
? http//www.unicode.org/notes/tn2/
? http//unicode.org/reports/tr14/

22
Globalization

Unicode ? Globalization (aka Internationalization,
Localizability)
Unicode provides the basis for software
globalization, but there's more work to be
done...
Use globalization APIs Formatting and parsing of
dates, times, numbers, currencies comparison of
text calendar systems ... are locale-dependent.
Where OS facilities are not adequate or
cross-platform solutions are needed, use ICU (C,
C, Java)
Don't put any translatable strings into your
code separate into resource files.
Provide context to translators is Mark a noun, a
verb, or a name
Dont use the same string in different contexts
unless the meaning is identical (including
references).
Note User-Interface language (menus, dialog,
help-system,...) ?Data language (body text,
spreadsheet cells).
Programs need to handle, as data, more languages
than in localized UI

23
Common Globalization Mistakes

Never compile Windows apps as ANSI (the
default!).
Don't simply concatenate strings to make
messages
Order of components differs by language use Java
MessageFormat, or structure UI as separate
fields.
Don't assume icons and symbols mean the same
around the world. Don't assume everyone can read
the Latin alphabet.
Allocate space flexibly OK in English ?
Aceptar in Spanish
English is a relatively compact language others
may require more characters (eg in database
fields) and more screen real estate (in UIs).
Beware of discrepancies in fallback behavior
Java ResourceBundle (J2SE), Java Standard Tag
Library (JSTL), Java Server Face (JSF), Apache
HTTP,...
? http//unicode.org/cldr/
? http//ibm.com/software/globalization/icu/

24
Neutral Formats

Store and transmit neutral-format data wherever
possible. Convert that data to the user's
preferred formats as "close" to the user as
possible.
Type Example Rec. Standard
Language/Locale en-US (en_US) RFC 3066 bis /
CLDR
Territory AU RFC 3066 bis
Currency EUR ISO 4217
Timezone Australia/Melbourne TZDB
Calendar islamic-civil CLDR Calendar ID
Custom Date yyyy-mmm-dd CLDR Pattern Format
Binary Time 8C80E9E3967A4B0 Windows File Time

25
Identification

Locale IDs are extensions of language IDs use
CLDR.? http//unicode.org/cldr/
Don't assume that everyone in country always uses
that countrys currency. Always use an explicit
currency ID (ISO 4217).
ltRUR, 1.2345710³gt ? 1 234,57?. in Russian,
but Rub 1,234.57 in English.
Don't assume the timezone ID is implied by the
user's locale. For the best timezone information,
use the TZ database use CLDR for timezone
names.? http//www.twinsun.com/tz/tz-link.htm
If you heuristically compute territory IDs,
timezone IDs, currency IDs, etc. (eg, from
browser settings) make sure the user can override
that and pick an explicit value.

26
Unicode Guide

Authoritative but lightweight
Introduction, overview, and quick reference
Main principles of the Unicode Standard
Best practices in Software Globalization

27
Other Resources

Unicode Site
http//unicode.org
An Overview of ICU
http//icu.sourceforge.net/docs/papers/icu_overvie
w_latest.ppt
Globalizing Software
http//icu.sourceforge.net/docs/papers/globalizing
_software.ppt
W3C Internationalization
http//www.w3.org/International/
Microsoft Global Software Development
http//www.microsoft.com/globaldev/default.asp

28
QA
29
Backup Slides
30
User Input

If you develop your own text editor, use the OS
APIs to handle IMEs (Input Method Engines) for
Chinese, Japanese, Korean,...
If you are using "type-ahead" to get to a
position in a list (eg typing "Jo" gets to the
first element starting with those characters),
allow arbitrary input. This is often easiest with
visible fields.
If your password field can contain characters
that require an IME, a screen pop-up box may
reveal the password to onlookers.

31
Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
? ? I ?
? i ? 0049 0307
I ? 0069 ? I
0049 ? ? 0130
? i ? I
? 0131 ? 0049
I ? ?
0130 ? i ? ? I ?
I ? ? 0069 0307 ? 0130 0307
0049 0307 ? ?
32
Java

In MessageFormat, watch for words like can't,
since ASCII ' has syntactic meaning. Use a real
apostrophe (U2019) where possible cant.
In Date and Calendar, the months are numbered
from 0 (February is month number 1!). However,
weeks and days are numbered from 1.
Java serialized text isn't UTF-8, though it's
close. U0000 and supplementary code points are
encoded differently.
Java globalization support is pretty outdated
use ICU to supplement it.
Java ResourceBundle (J2SE), Java Standard Tag
Library (JSTL), Java Server Face (JSF), Apache
HTTP server, etc. all provide some locale
determination mechanism and facility but they
all differ in details.

33
JavaScript

Always encode characters above U007F with
escapes (\uxxxx).
There is an HTML mechanism to specify the charset
of the Javascript source, but it is not widely
implemented.
The JDK tool native2ascii can be used to convert
the files to use escapes

Write a Comment

User Comments (0)