Whats New in Globalization - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Whats New in Globalization

Description:

Limitations on Repertoire. Testing for Confusables ... the repertoire: exclude symbols, ... Workable for basic cases. http://www.ietf.org ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 27

Provided by: GoogleE2

Category:

more less

Transcript and Presenter's Notes

Title: Whats New in Globalization

1
Whats New in Globalization?

Mark Davis
President CofounderThe Unicode Consortium

2
The Unicode Standard, Version 5.0

Hard copy versions of the Unicode Standard have
been among the most crucial and most heavily used
reference books in my personal library for
years. Donald E. Knuth
For more than a decade, Unicode has been a
foundation for many Microsoft products and
technologies Unicode Standard Version 5.0 will
help us deliver important new benefits to
users. Bill Gates
The path W3C follows to making text on the Web
truly global is Unicode. Sir Tim Berners-Lee,
KBE
Without Unicode, Java wouldn't be Java, and the
Internet would have a harder time connecting the
people of the world. James Gosling

3
The Unicode Standard, Version 5.0

Obsoletes previous versions
Basis for Microsoft's Vista in upgrade plans for
Google, Yahoo!, and ICU, to name but a few.
Hundreds of pages of new information thousands
of revised pages all Unicode Standard Annexes
Systematic framework for improved text processing
Improvements to the Unicode Encoding Model for
UTF-8,
Rigorous stability of case folding and
identifiers
Improved interoperability and backward
compatibility
Enabling additional new ways to optimize code

4
U5.0 Unicode Character Database

Unicode far more than a list of characters
Properties key to how characters function
Changes in 5.0
Scripts Unassigned code points ? Zzzz
Casing Stability Upper ? folded
BIDI Consistent Bidi_Mirrored
Now Normative kIICore
Line Break SE Asian ? Complex_Context
New Properties Normative_Name_Alias, Deprecated,
3 Unihan provisional properties

5
U5.0 Conformance

Stable Case-Folded
Upper ? Lower
Much clearer encoding / property model
Stable Approved Named Character Sequences
Bengali, Gurmukhi, Tamil changes
Combining grapheme joiner clarified
Disunification of Diacritics

6
5.0 Annexes Core

UAX 9 Bidirectional Algorithm
Tightened conformance requirements
UAX 15 Unicode Normalization Forms
New Stream-Safe Text Format
Appendix of characters requiring special handling
Expanded info on stability guarantees
Additional detailed figures, guidelines
UAX 31 Identifier and Pattern Syntax
Added profiles information on usage

7
U5.0 Annexes Boundaries

UAX 14 Line Breaking Properties
Rules modified to improve behavior
Now Normative (conformance clauses reorganized)
UAX 29 Text Boundaries
Edge cases improved
Tailorings for text boundaries now in Unicode
CLDR
Format of the rules changed to ease
implementation
Additional guidelines on regex, identifiers,

8
U5.0 Characters by Script
9
Unicode Character Timeline
10
Unicode Guide for Programmers

Adjunct to Standard
Concise Guide for Software Globalization
Crucial Concepts
Key Gotchas
Recognize and Avoid
Details on
Encoding conversions
UTF-8, 16, 32 BOM
Using character properties
Text Operations

11
Unicode Common Locale Data Repository CLDR

Key locale data for world languages
Most extensive standard repository of locale data
XML format

?e?t??a, 05 Septeµß???? 2005
Montag, 5. September 2005
1 234,57???.
?1,234.57
Arabic arabski Bulgarian bulgarski Czech
czeski
Africa ??Central America ??? Eastern Africa
?? Northern Africa ??
AED ?.?.? BHD ?.?. DZD ?.?.? EGP
?.?.? EUR
Z lt Å
12
Unicode CLDR 1.4

121 languages and 142 territories 360 locales
in all
25 more locale data over 17,000 new/modified
items
Repository separated into language vs locale data
Language-specific segmentation (word/line
breaks)
Transliterations (eg ???????? ? Elleniká)
Data for lenient date/time formatting and parsing
Programmer asks for numeric day abbreviated
month
Best format pattern returned, eg dd.MMM
Quarters in dates (eg 2006Q1)
BCP 47 compatibility extensions

13
BCP 47 Language Tags

Usage HTTP, HTML, XML CLDR Locale IDs
RFC 4646 Obsoletes RFCs 1766, 3066
Addresses problems in RFC3066
ISO standards stability / accessibility /
ambiguity
Parseability, Extensibility Registration speed
Identification of script (where necessary)
Traditional Chinese (zh-Hant), Serbian in Latin
(sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.

14
Unicode Security

Examples
Visual Confusables paypal.com with Cyrillic
a
Non visual problems buffer overflows,
non-shortest form,
UTR 36 Unicode Security Considerations
Guidelines Recommendations
UTS 39. Unicode Security Mechanisms
Algorithms Data
Limitations on Repertoire
Testing for Confusables

15
Internationalized Domain Names

One instance of broad problem
Many RFCs use Nameprep limited to Unicode 3.2
Unicode recommendations
Narrow the repertoire exclude symbols,
punctuation
Expand the coverage currently only Unicode 3.2.
IETF idn-nextsteps published
Some positive developments, but misreads Unicode,
needs more work

16
URL ? IRI

International Resource Identifier (IRI)
UTF-8, -escaped
Example
http//w3.org/International/articles/idn-and-iri/
JP??/??????.html
http//w3.org/International/articles/idn-and-iri/
JPE7B48D... E8B186.html
See http//ietf.org/rfc/rfc3987.txt

17
Ideographic Variation Database

U82A6 ashi multiple forms
The first occurrence any glyph
Second occurrence is in the name of the town
Ashiya customarily displayed with form 4
Registration for variants

18
Ideographic Variation Database

Variation Selector
Identifies a restriction on the appearance of a
character
Character Variation Selector Variation
Sequence
Han ideographs
Impossible to build a single collection for
everyone requirements from scholars, governments
and publishers
Instead, registration of multiple independent
collections
Unicode Ideographic Variation Database
A given variation sequence is used in at most one
collection
Makes interchange of variation sequences
reliable.
Registration, not Assessment

19
ICU 3.6

Mature, portable C/C/Java intl libraries
Unicode 5.0, UCA 5.0, CLDR 1.4
ICU4C
Charset Detection
Improved Time Zones, Thai word break, UText (64
bit), Performance, Data Management,
ICU4J
Globalization Preferences
Flexible date/time formats, Charset conversion

20
Near-Term Issues