Title: Globalization Gotchas
1Globalization Gotchas
2Unicode Basics
- Unicode encodes characters, not glyphs
- U0067 ? g g g g g g g g g g g g g. ...
- Unicode does not encode characters by language
- French, German, English j have the same code
point even though all have different
pronunciations - Chinese ? (da) has the same code point as
Japanese ? (dai). - UTF-8, UTF-16, and UTF-32 are all Unicode.
- The word character means different things to
different people make clear which one you mean. - glyphs, code points, bytes, code units,
user-perceived characters (grapheme clusters),
3Unicode in APIs
- U0000 to U10FFFF Be prepared to handle (at
least not corrupt!) any incoming code points - A back-level system may get unassigned code
points from later versions. - Watch for "UCS-2" implementations. They use
UTF-16 text, but don't support characters above
UFFFF they also may accidentally cause isolated
surrogates. - Some APIs/protocols will count lengths in code
points, and others in bytes (or other code
units). - Make sure you don't mix them up.
- Don't limit API parameters to a single character
(and definitely not to a single code unit!). - What users think of as a single character (e.g.
x, ch) may be a sequence in Unicode. - Use the latest version of Unicode supports new
characters, corrections, more stability
guarantees.
4Choice of Characters
- Character and block names may be misleading, eg,
- U034F COMBINING GRAPHEME JOINER doesn't join
graphemes.? http//www.unicode.org/faq/ - Use U2060 (word joiner) instead of UFEFF
(zero-width nobreak space) for everything but the
BOM function. - Never use unassigned code points those will be
used in future versions of Unicode. - Only use private use (PUA) or non-characters (and
only if necessary) - If you do, minimize the opportunity for collision
by picking an unusual range.
5Character Conversion
- Always use "shortest form" UTF-8.
- It's the Law.
- And if that isnt enough, consider security
attacks. - If a protocol allows a choice of charsets, always
tag correctly - Not all text is correctly tagged character
detection may be necessary. But remember, it's
always a guess! - Converting a database of mixed, untagged data is
extremely painful. - Bad assumptions
- Length bytes N length code points
- 1 character charset X 1 character Unicode
- The ordering may also be different.
6Character Conversion II
- IANA / MIME charset names are ill-defined
vendors often convert same charset different
ways. - Shift-JIS 0x5C ? U005C (\) or U00A5 ()
- Dont simply omit unconvertable data to reduce
security problems, at least substitute - UFFFD (when converting to Unicode) or
- 0x1A (when converting to bytes).
- ? http//www.w3.org/TR/japanese-xml/
- ? http//icu.sourceforge.net/charts/charset/
7Properties
- Use properties such as Alphabetic, not hard-coded
lists - isAlphabetic(x) regex \pAlphabetic or
Alphabetic - Not (A x Z OR a x z)
- Some properties aren't what you think use
- White_Space not General_CategoryZs
- Alphabetic not General_CategoryL
- Lowercase not General_CategoryLl
- ScriptGreek not BlockGreek
- Characters may change property values between
versions of Unicode - ? http//unicode.org/standard/stability_policy.htm
l
8Identifiers Tokens
- When designing syntax, use as a base
- Pattern_Syntax for operators / relations
- Pattern_Whitespace for gaps
- XID_Start and XID_Continue for identifiers.
- All backwards compatible across versions
- Profiles may expand or narrow from the base
- Watch out for security attacks
- paypal.com with a Cyrillic a
- ? See Unicode Security at this conference
9Comparison (Collation)Searching, Sorting,
Matching
- There are two binary orders
- code point order UTF-8 order UTF-32 order
- ? UTF16 order
- Dont present users with binary order!
- No users expect A lt Z lt a lt z lt Ç lt ä.
- Apply normalization to get a unique form, so Ã…
Ã…. - Security Issues Protocols must precisely define
the comparison operations - Eg, LDAP doesn't, so lookup may fail (or falsely
succeed!) - Aside from wrong results, opening for security
attacks.
10Language-Sensitive Comparison
- Use UCA Order as a base to meet
user-expectations - a lt A lt ä lt Ç C? lt z lt Z
- Real language-sensitive order requires tailoring
on top of UCA ordering depends on context and
language - china lt China lt chinas lt danish
- ae lt æ lt af
- z lt æ (Danish)
- c lt d lt ... h lt ch lt i (Slovak)
- Follow UCA for substring match offsets some
gotchas here. - Don't mix up "stable" and "deterministic"
sorting they are very different. - ? http//unicode.org/reports/tr10/
- ? http//unicode.org/cldr
11Normalization (NFC,)
- Standardized normalized forms defined by Unicode.
- The ordering of accents in a normalization form
may not be the typical type-in order. - Fonts should handle both orders.
- Normalization is context independent
- Don't assume NFC(x y) NFC(x) NFC(y)
- People assume that NFC always composes, but some
characters decompose in NFC. - Trivia In Unicode 4.1 there are exactly 3
characters that are different in all 4
normalization forms ?, ?, ?
12Maximum Expansion (U4.1)
Operation UTF Factor Sample Sample
NFC 8 3X ?? U1D160
NFC 16, 32 3X ? UFB2C
NFD 8 3X ? U0390
NFD 16, 32 4X ? U1F82
NFKC / NFKD 8 11X ? UFDFA
NFKC / NFKD 16, 32 18X ? UFDFA
13Case Conversion
- Not a simple 11 mapping
- Title case ? ? ? ? ?
- Expansion heiß ? HEISS ? heiss
- Context-dependent ?S?S ? ?s??
- Language-dependent istanbul ? ISTANBUL
- Warning never use language-dependent casing for
language-independent structures, like file-system
B-Trees.
14Casing Maximum Expansion
Operation UTF Factor Sample Sample
Lower 8 1.5X ? U023A
Lower 16, 32 1X A U0041
Upper / Title / Fold 8, 16, 32 3X ? U0390
15Case Conversion II
- Case folding was not stable.
- Different results from toCaseFold(S) between two
versions - Stability now guaranteed in Unicode 5.0
- Don't use the Lowercase_Letter (Ll) or
Uppercase_Letter (Lt) of General_Category - These were constrained to be in a partition.
- Use the separate binary properties Lowercase and
Uppercase instead.
16Lowercase / UppercaseForm vs Function
- Lowercase, the binary property
- The character is lowercase in form,but not
necessarily in function. - Functionally Lowercase
- isCased(x) isLowercase(x).
- See Section 3.13 of TUS.
17Lowercase Form vs Function
LC F. LC Ll Count Examples Examples Examples
LC F. LC Ll (U4.1) Examples Examples Examples
Y N N 114 ? U02E0 MODIFIER LETTER SMALL GAMMA
Y N Y 705 ª U00AA FEMININE ORDINAL INDICATOR
Y Y N 43 ? U2170 SMALL ROMAN NUMERAL ONE
Y Y Y 903 a U0061 LATIN SMALL LETTER A
18Segmentation
- What a user thinks of as a characters is often a
sequence. - Words are not just sequences of letters.
- Lines dont just break at spaces
- All may be language-dependent
- ? http//www.unicode.org/reports/tr14/
? http//www.unicode.org/reports/tr29/  Â
19Transliteration
- Transliteration ???????? ? Elleniká?
Translation ???????? ? Greek - Transliteration may vary by language
- ????? ? Putin, Poutine, ...
- ???????? ? Gorbachev, Gorbacev, Gorbatchev,
Gorbacëv, Gorbachov, Gorbatsov, Gorbatschow, ... - Watch for terminology lossy vs lossless
- Lossy transliteration ???????? ? Ellinika ?
???????a - In ISO terms transliteration lossless
transliteration transcription lossy
transliteration. - ? http//unicode.org/draft/reports/tr35/tr35.html
20Rendering is Contextual
Processing character-by-character gives the wrong
results!
- Glyphs may change shape
- Multiple characters ? 1 glyph
- One character ? multiple glyphs
21Rendering II
- Good rendering systems will handle customary
type-in order for text plus canonical order. - Excellent ones will do any canonically-equivalent
order, but those are rare. - There may be differences in the customary glyphs
for different languages specify the font or the
language where they have to be distinguished - Security Issues
- Never render a missing glyph as "?.
- Don't simply overlay diacritics it can cause
security problems. - ? http//www.unicode.org/notes/tn2/
- ? http//unicode.org/reports/tr14/
22Globalization
- Unicode ? Globalization (aka Internationalization,
Localizability) - Unicode provides the basis for software
globalization, but there's more work to be
done... - Use globalization APIs Formatting and parsing of
dates, times, numbers, currencies comparison of
text calendar systems ... are locale-dependent. - Where OS facilities are not adequate or
cross-platform solutions are needed, use ICU (C,
C, Java) - Don't put any translatable strings into your
code separate into resource files. - Provide context to translators is Mark a noun, a
verb, or a name - Dont use the same string in different contexts
unless the meaning is identical (including
references). - Note User-Interface language (menus, dialog,
help-system,...) ?Data language (body text,
spreadsheet cells). - Programs need to handle, as data, more languages
than in localized UI
23Common Globalization Mistakes
- Never compile Windows apps as ANSI (the
default!). - Don't simply concatenate strings to make
messages - Order of components differs by language use Java
MessageFormat, or structure UI as separate
fields. - Don't assume icons and symbols mean the same
around the world. Don't assume everyone can read
the Latin alphabet. - Allocate space flexibly OK in English ?
Aceptar in Spanish - English is a relatively compact language others
may require more characters (eg in database
fields) and more screen real estate (in UIs). - Beware of discrepancies in fallback behavior
- Java ResourceBundle (J2SE), Java Standard Tag
Library (JSTL), Java Server Face (JSF), Apache
HTTP,... - ? http//unicode.org/cldr/
- ? http//ibm.com/software/globalization/icu/
24Neutral Formats
- Store and transmit neutral-format data wherever
possible. Convert that data to the user's
preferred formats as "close" to the user as
possible. - Type Example Rec. Standard
- Language/Locale en-US (en_US) RFC 3066 bis /
CLDR - Territory AU RFC 3066 bis
- Currency EUR ISO 4217
- Timezone Australia/Melbourne TZDB
- Calendar islamic-civil CLDR Calendar ID
- Custom Date yyyy-mmm-dd CLDR Pattern Format
- Binary Time 8C80E9E3967A4B0 Windows File Time
25Identification
- Locale IDs are extensions of language IDs use
CLDR.? http//unicode.org/cldr/ - Don't assume that everyone in country always uses
that countrys currency. Always use an explicit
currency ID (ISO 4217). - ltRUR, 1.2345710³gt ? 1 234,57?. in Russian,
- but Rub 1,234.57 in English.
- Don't assume the timezone ID is implied by the
user's locale. For the best timezone information,
use the TZ database use CLDR for timezone
names.? http//www.twinsun.com/tz/tz-link.htm - If you heuristically compute territory IDs,
timezone IDs, currency IDs, etc. (eg, from
browser settings) make sure the user can override
that and pick an explicit value.
26Unicode Guide
- Authoritative but lightweight
- Introduction, overview, and quick reference
- Main principles of the Unicode Standard
- Best practices in Software Globalization
27Other Resources
- Unicode Site
- http//unicode.org
- An Overview of ICU
- http//icu.sourceforge.net/docs/papers/icu_overvie
w_latest.ppt - Globalizing Software
- http//icu.sourceforge.net/docs/papers/globalizing
_software.ppt - W3C Internationalization
- http//www.w3.org/International/
- Microsoft Global Software Development
- http//www.microsoft.com/globaldev/default.asp
28QA
29Backup Slides
30User InputÂ
- If you develop your own text editor, use the OS
APIs to handle IMEs (Input Method Engines) for
Chinese, Japanese, Korean,... - If you are using "type-ahead" to get to a
position in a list (eg typing "Jo" gets to the
first element starting with those characters),
allow arbitrary input. This is often easiest with
visible fields. - If your password field can contain characters
that require an IME, a screen pop-up box may
reveal the password to onlookers.
31Dotted and Dotless I
Uppercase Normal Lowercase Turkic Uppercase
 ?  ? I ?
 ? i ? 0049 0307
I ? 0069 ? I
0049 ? Â ? 0130
 ? i ? I
 ? 0131 ? 0049
I ? Â ? Â
0130 ? i ? ? I ?
I ? ? 0069 0307 ? 0130 0307
0049 0307 ? Â ? Â
32Java
- In MessageFormat, watch for words like can't,
since ASCII ' has syntactic meaning. Use a real
apostrophe (U2019) where possible cant. - In Date and Calendar, the months are numbered
from 0 (February is month number 1!). However,
weeks and days are numbered from 1. - Java serialized text isn't UTF-8, though it's
close. U0000 and supplementary code points are
encoded differently. - Java globalization support is pretty outdated
use ICU to supplement it. - Java ResourceBundle (J2SE), Java Standard Tag
Library (JSTL), Java Server Face (JSF), Apache
HTTP server, etc. all provide some locale
determination mechanism and facility but they
all differ in details.
33JavaScript
- Always encode characters above U007F with
escapes (\uxxxx). - There is an HTML mechanism to specify the charset
of the Javascript source, but it is not widely
implemented. - The JDK tool native2ascii can be used to convert
the files to use escapes