Title: Whats New in Globalization
1Whats New in Globalization?
- Mark Davis
- President CofounderThe Unicode Consortium
2The Unicode Standard, Version 5.0
- Hard copy versions of the Unicode Standard have
been among the most crucial and most heavily used
reference books in my personal library for
years. Donald E. Knuth - For more than a decade, Unicode has been a
foundation for many Microsoft products and
technologies Unicode Standard Version 5.0 will
help us deliver important new benefits to
users. Bill Gates - The path W3C follows to making text on the Web
truly global is Unicode. Sir Tim Berners-Lee,
KBE - Without Unicode, Java wouldn't be Java, and the
Internet would have a harder time connecting the
people of the world. James Gosling
3The Unicode Standard, Version 5.0
- Obsoletes previous versions
- Basis for Microsoft's Vista in upgrade plans for
Google, Yahoo!, and ICU, to name but a few. - Hundreds of pages of new information thousands
of revised pages all Unicode Standard Annexes - Systematic framework for improved text processing
- Improvements to the Unicode Encoding Model for
UTF-8, - Rigorous stability of case folding and
identifiers - Improved interoperability and backward
compatibility - Enabling additional new ways to optimize code
4U5.0 Unicode Character Database
- Unicode far more than a list of characters
- Properties key to how characters function
- Changes in 5.0
- Scripts Unassigned code points ? Zzzz
- Casing Stability Upper ? folded
- BIDI Consistent Bidi_Mirrored
- Now Normative kIICore
- Line Break SE Asian ? Complex_Context
- New Properties Normative_Name_Alias, Deprecated,
3 Unihan provisional properties
5U5.0 Conformance
- Stable Case-Folded
- Upper ? Lower
- Much clearer encoding / property model
- Stable Approved Named Character Sequences
- Bengali, Gurmukhi, Tamil changes
- Combining grapheme joiner clarified
- Disunification of Diacritics
65.0 Annexes Core
- UAX 9 Bidirectional Algorithm
- Tightened conformance requirements
- UAX 15 Unicode Normalization Forms
- New Stream-Safe Text Format
- Appendix of characters requiring special handling
- Expanded info on stability guarantees
- Additional detailed figures, guidelines
- UAX 31 Identifier and Pattern Syntax
- Added profiles information on usage
7U5.0 Annexes Boundaries
- UAX 14 Line Breaking Properties
- Rules modified to improve behavior
- Now Normative (conformance clauses reorganized)
- UAX 29 Text Boundaries
- Edge cases improved
- Tailorings for text boundaries now in Unicode
CLDR - Format of the rules changed to ease
implementation - Additional guidelines on regex, identifiers,
8U5.0 Characters by Script
9Unicode Character Timeline
10Unicode Guide for Programmers
- Adjunct to Standard
- Concise Guide for Software Globalization
- Crucial Concepts
- Key Gotchas
- Recognize and Avoid
- Details on
- Encoding conversions
- UTF-8, 16, 32 BOM
- Using character properties
- Text Operations
11Unicode Common Locale Data Repository CLDR
- Key locale data for world languages
- Most extensive standard repository of locale data
- XML format
?e?t??a, 05 Septeµß???? 2005
Montag, 5. September 2005
1 234,57???.
?1,234.57
Arabic arabski Bulgarian bulgarski Czech
czeski
Africa ??Central America ??? Eastern Africa
?? Northern Africa ??
AED ?.?.? BHD ?.?. DZD ?.?.? EGP
?.?.? EUR
Z lt Å
12Unicode CLDR 1.4
- 121 languages and 142 territories 360 locales
in all - 25 more locale data over 17,000 new/modified
items - Repository separated into language vs locale data
- Language-specific segmentation (word/line
breaks) - Transliterations (eg ???????? ? Elleniká)
- Data for lenient date/time formatting and parsing
- Programmer asks for numeric day abbreviated
month - Best format pattern returned, eg dd.MMM
- Quarters in dates (eg 2006Q1)
- BCP 47 compatibility extensions
13BCP 47 Language Tags
- Usage HTTP, HTML, XML CLDR Locale IDs
- RFC 4646 Obsoletes RFCs 1766, 3066
- Addresses problems in RFC3066
- ISO standards stability / accessibility /
ambiguity - Parseability, Extensibility Registration speed
- Identification of script (where necessary)
- Traditional Chinese (zh-Hant), Serbian in Latin
(sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.
14Unicode Security
- Examples
- Visual Confusables paypal.com with Cyrillic
a - Non visual problems buffer overflows,
non-shortest form, - UTR 36 Unicode Security Considerations
- Guidelines Recommendations
- UTS 39. Unicode Security Mechanisms
- Algorithms Data
- Limitations on Repertoire
- Testing for Confusables
15Internationalized Domain Names
- One instance of broad problem
- Many RFCs use Nameprep limited to Unicode 3.2
- Unicode recommendations
- Narrow the repertoire exclude symbols,
punctuation - Expand the coverage currently only Unicode 3.2.
- IETF idn-nextsteps published
- Some positive developments, but misreads Unicode,
needs more work
16URL ? IRI
- International Resource Identifier (IRI)
- UTF-8, -escaped
- Example
- http//w3.org/International/articles/idn-and-iri/
JP??/??????.html - http//w3.org/International/articles/idn-and-iri/
JPE7B48D... E8B186.html - See http//ietf.org/rfc/rfc3987.txt
17Ideographic Variation Database
- U82A6 ashi multiple forms
- The first occurrence any glyph
- Second occurrence is in the name of the town
Ashiya customarily displayed with form 4 - Registration for variants
18Ideographic Variation Database
- Variation Selector
- Identifies a restriction on the appearance of a
character - Character Variation Selector Variation
Sequence - Han ideographs
- Impossible to build a single collection for
everyone requirements from scholars, governments
and publishers - Instead, registration of multiple independent
collections - Unicode Ideographic Variation Database
- A given variation sequence is used in at most one
collection - Makes interchange of variation sequences
reliable. - Registration, not Assessment
19ICU 3.6
- Mature, portable C/C/Java intl libraries
- Unicode 5.0, UCA 5.0, CLDR 1.4
- ICU4C
- Charset Detection
- Improved Time Zones, Thai word break, UText (64
bit), Performance, Data Management, - ICU4J
- Globalization Preferences
- Flexible date/time formats, Charset conversion
20Near-Term Issues
- Unicode 5.0.1, Unicode 5.1
- CLDR / BCP 47bis
- LDAP
- Collation Registry
- IANA Charset Registry
21Unicode 5.1 - possibilities
- Characters
- CJK Unified Ideographs Extension C
- Minority Scripts Cham and Lanna
- Malayalam chillu
-
- Properties/Behavior
- Normalization process for stable strings
22CLDR 1.5 / BCP 47bis
- CLDR 1.5
- Data Submission Starting November
- New structures / data
- BCP 47
- Adding 7,000 (!) new language subtags
- Possibly other changes
23LDAP
- Now has definitive comparison (good)
- Stuck at Unicode 3.2 (bad)
- http//www.ietf.org/rfc/rfc4518.txt
24Collation Registry
- Nearing approval
- Adds ability to register comparisons
- Workable for basic cases
- http//www.ietf.org/internet-drafts/draft-newman-
i18n-comparator-14.txt
25IANA Charset registry
- Currently limited usefulness
- Ill-defined
- Missing mapping tables
- Incomplete
- Inaccurate
- Regime Change
- Hope for future improvements!
26Whats New in Globalization?
- Mark Davis
- President CofounderThe Unicode Consortium