Whats New in Globalization - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Whats New in Globalization

Description:

Limitations on Repertoire. Testing for Confusables ... the repertoire: exclude symbols, ... Workable for basic cases. http://www.ietf.org ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 27
Provided by: GoogleE2
Category:

less

Transcript and Presenter's Notes

Title: Whats New in Globalization


1
Whats New in Globalization?
  • Mark Davis
  • President CofounderThe Unicode Consortium

2
The Unicode Standard, Version 5.0
  • Hard copy versions of the Unicode Standard have
    been among the most crucial and most heavily used
    reference books in my personal library for
    years. Donald E. Knuth
  • For more than a decade, Unicode has been a
    foundation for many Microsoft products and
    technologies Unicode Standard Version 5.0 will
    help us deliver important new benefits to
    users. Bill Gates
  • The path W3C follows to making text on the Web
    truly global is Unicode. Sir Tim Berners-Lee,
    KBE
  • Without Unicode, Java wouldn't be Java, and the
    Internet would have a harder time connecting the
    people of the world. James Gosling

3
The Unicode Standard, Version 5.0
  • Obsoletes previous versions
  • Basis for Microsoft's Vista in upgrade plans for
    Google, Yahoo!, and ICU, to name but a few.
  • Hundreds of pages of new information thousands
    of revised pages all Unicode Standard Annexes
  • Systematic framework for improved text processing
  • Improvements to the Unicode Encoding Model for
    UTF-8,
  • Rigorous stability of case folding and
    identifiers
  • Improved interoperability and backward
    compatibility
  • Enabling additional new ways to optimize code

4
U5.0 Unicode Character Database
  • Unicode far more than a list of characters
  • Properties key to how characters function
  • Changes in 5.0
  • Scripts Unassigned code points ? Zzzz
  • Casing Stability Upper ? folded
  • BIDI Consistent Bidi_Mirrored
  • Now Normative kIICore
  • Line Break SE Asian ? Complex_Context
  • New Properties Normative_Name_Alias, Deprecated,
    3 Unihan provisional properties

5
U5.0 Conformance
  • Stable Case-Folded
  • Upper ? Lower
  • Much clearer encoding / property model
  • Stable Approved Named Character Sequences
  • Bengali, Gurmukhi, Tamil changes
  • Combining grapheme joiner clarified
  • Disunification of Diacritics

6
5.0 Annexes Core
  • UAX 9 Bidirectional Algorithm
  • Tightened conformance requirements
  • UAX 15 Unicode Normalization Forms
  • New Stream-Safe Text Format
  • Appendix of characters requiring special handling
  • Expanded info on stability guarantees
  • Additional detailed figures, guidelines
  • UAX 31 Identifier and Pattern Syntax
  • Added profiles information on usage

7
U5.0 Annexes Boundaries
  • UAX 14 Line Breaking Properties
  • Rules modified to improve behavior
  • Now Normative (conformance clauses reorganized)
  • UAX 29 Text Boundaries
  • Edge cases improved
  • Tailorings for text boundaries now in Unicode
    CLDR
  • Format of the rules changed to ease
    implementation
  • Additional guidelines on regex, identifiers,

8
U5.0 Characters by Script
9
Unicode Character Timeline
10
Unicode Guide for Programmers
  • Adjunct to Standard
  • Concise Guide for Software Globalization
  • Crucial Concepts
  • Key Gotchas
  • Recognize and Avoid
  • Details on
  • Encoding conversions
  • UTF-8, 16, 32 BOM
  • Using character properties
  • Text Operations

11
Unicode Common Locale Data Repository CLDR
  • Key locale data for world languages
  • Most extensive standard repository of locale data
  • XML format

?e?t??a, 05 Septeµß???? 2005
Montag, 5. September 2005
1 234,57???.
?1,234.57
Arabic arabski Bulgarian bulgarski Czech
czeski
Africa ??Central America ??? Eastern Africa
?? Northern Africa ??
AED ?.?.? BHD ?.?. DZD ?.?.? EGP
?.?.? EUR
Z lt Å
12
Unicode CLDR 1.4
  • 121 languages and 142 territories 360 locales
    in all
  • 25 more locale data over 17,000 new/modified
    items
  • Repository separated into language vs locale data
  • Language-specific segmentation (word/line
    breaks)
  • Transliterations (eg ???????? ? Elleniká)
  • Data for lenient date/time formatting and parsing
  • Programmer asks for numeric day abbreviated
    month
  • Best format pattern returned, eg dd.MMM
  • Quarters in dates (eg 2006Q1)
  • BCP 47 compatibility extensions

13
BCP 47 Language Tags
  • Usage HTTP, HTML, XML CLDR Locale IDs
  • RFC 4646 Obsoletes RFCs 1766, 3066
  • Addresses problems in RFC3066
  • ISO standards stability / accessibility /
    ambiguity
  • Parseability, Extensibility Registration speed
  • Identification of script (where necessary)
  • Traditional Chinese (zh-Hant), Serbian in Latin
    (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.

14
Unicode Security
  • Examples
  • Visual Confusables paypal.com with Cyrillic
    a
  • Non visual problems buffer overflows,
    non-shortest form,
  • UTR 36 Unicode Security Considerations
  • Guidelines Recommendations
  • UTS 39. Unicode Security Mechanisms
  • Algorithms Data
  • Limitations on Repertoire
  • Testing for Confusables

15
Internationalized Domain Names
  • One instance of broad problem
  • Many RFCs use Nameprep limited to Unicode 3.2
  • Unicode recommendations
  • Narrow the repertoire exclude symbols,
    punctuation
  • Expand the coverage currently only Unicode 3.2.
  • IETF idn-nextsteps published
  • Some positive developments, but misreads Unicode,
    needs more work

16
URL ? IRI
  • International Resource Identifier (IRI)
  • UTF-8, -escaped
  • Example
  • http//w3.org/International/articles/idn-and-iri/
    JP??/??????.html
  • http//w3.org/International/articles/idn-and-iri/
    JPE7B48D... E8B186.html
  • See http//ietf.org/rfc/rfc3987.txt

17
Ideographic Variation Database
  • U82A6 ashi multiple forms
  • The first occurrence any glyph
  • Second occurrence is in the name of the town
    Ashiya customarily displayed with form 4
  • Registration for variants

18
Ideographic Variation Database
  • Variation Selector
  • Identifies a restriction on the appearance of a
    character
  • Character Variation Selector Variation
    Sequence
  • Han ideographs
  • Impossible to build a single collection for
    everyone requirements from scholars, governments
    and publishers
  • Instead, registration of multiple independent
    collections
  • Unicode Ideographic Variation Database
  • A given variation sequence is used in at most one
    collection
  • Makes interchange of variation sequences
    reliable.
  • Registration, not Assessment

19
ICU 3.6
  • Mature, portable C/C/Java intl libraries
  • Unicode 5.0, UCA 5.0, CLDR 1.4
  • ICU4C
  • Charset Detection
  • Improved Time Zones, Thai word break, UText (64
    bit), Performance, Data Management,
  • ICU4J
  • Globalization Preferences
  • Flexible date/time formats, Charset conversion

20
Near-Term Issues
  • Unicode 5.0.1, Unicode 5.1
  • CLDR / BCP 47bis
  • LDAP
  • Collation Registry
  • IANA Charset Registry

21
Unicode 5.1 - possibilities
  • Characters
  • CJK Unified Ideographs Extension C
  • Minority Scripts Cham and Lanna
  • Malayalam chillu
  • Properties/Behavior
  • Normalization process for stable strings

22
CLDR 1.5 / BCP 47bis
  • CLDR 1.5
  • Data Submission Starting November
  • New structures / data
  • BCP 47
  • Adding 7,000 (!) new language subtags
  • Possibly other changes

23
LDAP
  • Now has definitive comparison (good)
  • Stuck at Unicode 3.2 (bad)
  • http//www.ietf.org/rfc/rfc4518.txt

24
Collation Registry
  • Nearing approval
  • Adds ability to register comparisons
  • Workable for basic cases
  • http//www.ietf.org/internet-drafts/draft-newman-
    i18n-comparator-14.txt

25
IANA Charset registry
  • Currently limited usefulness
  • Ill-defined
  • Missing mapping tables
  • Incomplete
  • Inaccurate
  • Regime Change
  • Hope for future improvements!

26
Whats New in Globalization?
  • Mark Davis
  • President CofounderThe Unicode Consortium
Write a Comment
User Comments (0)
About PowerShow.com