Unicode 4.0 - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Unicode 4.0

Description:

Unicode 4.0 Mark Davis President, The Unicode Consortium Note: s differ from proceedings Overview New Characters Conformance UAX: Unicode Standard Annexes UCD ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 40
Provided by: mark740
Category:

less

Transcript and Presenter's Notes

Title: Unicode 4.0


1
Unicode 4.0
  • Mark Davis
  • President, The Unicode Consortium
  • Note slides differ from proceedings

2
Overview
  • New Characters
  • Conformance
  • UAX Unicode Standard Annexes
  • UCD Unicode Character Database
  • UTS Unicode Technical Standards
  • Not part of the Standard, but can claim
    conformance

3
Properties and Behavior
  • Unicode is not just a list of characters
  • Properties and behavior are crucial
  • With them, new characters can work out of the
    box
  • Some are part of the standard (BIDI,
    Normalization), others are associated (Collation,
    Regular Expressions)

4
New Characters 1,228
  • Modern Scripts
  • (additions to) Indic, Khmer, Latin, Greek,
    Arabic, Syriac
  • (minority scripts) Limbu, Tai Le, Osmanya
  • Historic Scripts
  • Linear B, Cypriot, Ugaritic, Shavian, Aegean
    Numbers
  • Symbols
  • Monograms, digrams, tetragrams, other symbols
  • modifier combining characters

5
New Characters (cont.)
  • Special Characters
  • additional variation selectors (for future CJK
    variants), double-diacritics for dictionary use
  • For a detailed list, see Derived Age in the UCD
    4.0, and the beta Charts.
  • Character repertoire corresponds to ISO/IEC
    106462003.

6
Conformance
  • Substantially improved specification of
    conformance requirements
  • Incorporated UTR 17 Character Encoding Model,
    clearly separating encoding forms and encoding
    schemes
  • Tightened definitions of UTF-8, UTF-16, UTF-32
  • Separate definition of Unicode String
  • Clarified conformance status of Unicode Standard
    Annexes
  • Formal definitions of properties algorithms
  • Provisional properties

7
UTF vs. Unicode String
  • Important Distinction
  • UTF
  • Unique representation for Code Point
  • All else illegal
  • C0 80
  • D800 0061
  • Unicode String
  • Sequence of code units
  • Internal Processing, not interchange
  • Not necessarily valid UTF
  • C0 A0
  • D800 0061

8
Conformance (cont.)
  • Formalized policies for stability of the standard
  • Clarification of semantics of important
    characters, including BOM
  • Revised scope of enclosing combining marks
  • Revised semantics of ZWJ for cursive scripts
  • Normalization Corrections
  • U2F868 U2F874 U2F91F U2F95F U2F9BF
  • All corrections subject to strict stability
    constraints
  • For 3.2 repertoire, NFC3.2(X) NFC4.0(X)

9
Textual Clarifications
  • Major changes to Chapters 2, 3, 6, 14 and 15
  • Definitive terminology for code points
  • graphic, format, control, private-use
  • assigned characters
  • surrogate, noncharacter, reserved
  • not characters
  • Substantial improvements to many character block
    descriptions, especially Indic

10
Programming language identifiers
  • Now backwards-compatible
  • Once a Unicode identifier,
  • Always a Unicode identifier
  • Alternate definition for complete stability
  • Fix set of allowed characters
  • Allow all reserved code points
  • Complete stability
  • - Odd characters
  • Also see new UTR on Syntax Characters

11
Case mappings now normative (but tailorable)
  • Clearer definition of string functions
  • isUpper(), isLower(), isTitle(), isFold()
  • toUpper(), toLower(), toTitle(), toFold()
  • Definition of titlecase uses word boundaries
  • Note that the Turkic mappings do not maintain
    canonical equivalence, without additional
    processing.

12
UAX 9 BIDI
  • BIDI Arabic/Hebrew Display
  • HTML, all modern word processors, OSs,
  • New
  • canonically equivalence now preserved
  • data change, not algorithm
  • shaping is done after reordering
  • but not across directional boundaries
  • clarifications of
  • ZWJ, ZWNJ
  • intermediate level processing

13
UAX 15 Normalization
  • Unique form for text comparison
  • W3C Character Model, International Domain Names,
    Network File System,
  • New
  • Description of Stable Code Points.
  • Notation NFC(x) and isNFC(x), in Notation.
  • Added pointer to UTN 5 Canonical Equivalences in
    Applications
  • Rewrote Annex 12 Corrigenda for clarity, and to
    describe the use of Normalization Corrections.
  • Added Annex 13 Canonical Equivalence.

14
UAX 14 Line Breaking
  • Line-Break (word-wrap) all Unicode text
  • Customizable for different languages
  • New
  • Negative numbers and dates with hyphens will not
    break across lines
  • Word-Joiner will link any characters (except hard
    line breaks)
  • Behavior of soft hyphen clarified
  • marks opportunity for breaking, not specific
    graphic appearance.
  • Rules for GL relaxed SP and ZW override
  • New Property Values NL, WJ

15
UAX 29 Text Boundaries
  • Default User Character, Word, Sentence
    boundaries
  • Customizable for different languages
  • Word, sentence tailoring expected
  • New
  • Extracted from 3.0, but significantly revised
  • Grapheme cluster (user character)
  • Hangul Syllable or other Base
  • plus (optionally) any number of NSMs

16
No Sub. Changes
  • UAX 11 East Asian Width
  • Guidelines for choosing character width
  • UAX 24 Script Names
  • Default script assignment
  • Used in regular expressions
  • Now UAX

17
Superseded UAXes
  • Incorporated into and thus superseded by Unicode
    Version 4.0
  • UAX 13 Unicode Newline Guidelines
  • UAX 19 UTF-32
  • UAX 21 Case Mappings
  • UAX 27 Unicode 3.1
  • UAX 28 Unicode 3.2

18
Unicode Character Database
  • Crucial Component of Unicode
  • Documentation coalesced into UCD.html.
  • New properties and values
  • Hangul_Syllable_Type, Unicode_Radical_Stroke
  • CJK numeric values added.
  • PropertyValueAliases adds block names
  • UCD fallback props more precisely defined.
  • for code points not explicitly in data files
  • New Characters
  • Appropriate properties assigned

19
UCD4.0 (cont.)
  • Modifier letters
  • The general category of 02B9..02BA, 02C6..02CF
    changed to general category Lm.
  • Khmer
  • Two Khmer characters are deprecated four others
    strongly discouraged.
  • Decimal Digits
  • Numeric_Typedecimal digit now aligned with
    General_CategoryNd
  • Braille
  • Added script value

20
UCD4.0 (cont. 2)
  • Case Mapping
  • Fixed for Turkish, Lithuanian
  • Default Ignorables
  • Hangul Filler characters
  • Soft-Hyphen, CGJ, ZWS
  • Arabic End of Ayah and Syriac Abbreviation Mark
    no longer DI, shaping classes fixed.
  • Grapheme_Extend
  • removes halfwidth katakana marks, most Mc (except
    as needed for canonical equivalence)

21
Unicode Technical Standard
  • UTS separate standard
  • independent conformance requirements
  • UTR information and guidelines
  • Documents may move from UTR status to UTS

22
UTS 10 Unicode Collation
  • Significance
  • String comparison, matching, searching
  • Compares all Unicode characters
  • Handles linguistic features
  • Accents, Case, Punctuation,
  • Contextual weighting,
  • Tailor for different languages
  • Version 4.0.0 due Sept. 2003
  • From now on, to be sync'ed in repertoire and
    version with the Unicode Standard.

23
UTS 18 Regular Exp.
  • Significance
  • Crucial to many applications web, XML,
  • Unicode adds significant requirements
  • Level 1 Basic Support
  • Perl
  • Level 2 Extended Support
  • Level 3 Tailored Support
  • New
  • Recently approved as UTS (was UTR)
  • Adds clearer conformance requirements
  • Flexible list of features
  • Partial conformance claims

24
UTS 6 SCSU
  • Simple Unicode Compression
  • Added suitability for XML
  • See also Technical Note on BOCU
  • Main difference preserves binary order
  • x lt y gt BOCU(x) lt BOCU(y)

25
New UTRs
  • Draft UTR 23 Character Properties
  • Draft Character Property Model
  • Character Folding
  • Hiragana-Katakana, Case,
  • Programming Language IDs, Syntax characters

26
Q A
  • Other talks here
  • Common Locale Data
  • interchange of language-specific data for
    sorting, dates, times, currencies
  • ICU
  • premier Unicode enablement library
  • full-featured, x-platform
  • C, C, Java

27
Background Slides
28
Unicode 3.2 (March, 2002)
  • New Characters 1,016
  • Symbols
  • Large collection of mathematical symbols,
    especially targeted at MathML, recycling symbols,
    ornamental brackets.
  • Special Characters
  • combining grapheme joiner, word joiner, invisible
    operators for math, variation selectors
  • Modern Scripts
  • minority scripts of the Philippines

29
Conformance
  • Eliminates irregular UTF-8
  • Defines variation sequences
  • Replaces ZWNBSP with Word Joiner
  • Clarifies scope of combining marks (further
    revised in 4.0)
  • Clarifications of conjoining jamo behavior,
    hangul syllable structure, decomposables,

30
Textual Clarifications
  • Combined vowels in Khmer, characters discouraged
    in Khmer
  • Use of dingbats

31
Unicode Standard Annexes
  • UAX 21 Case Mappings (was UTR)

32
Unicode Character Database
  • New properties
  • IDS_Binary_Operator, IDS_Trinary_Operator,
    Radical, Unified_Ideograph,
  • Default_Ignorable_Code_Point, Deprecated
    Soft_Dotted, Logical_Order_Exception
  • Grapheme_Base, Grapheme_Extend,Grapheme_Link
  • DerivedAge
  • Normalization Corrections
  • Added Property Property Value Aliases
  • Adds StandardizedVariants.html

33
Related Items
  • UTS 10 Unicode Collation Algorithm
  • Ignorable character handling, dual versioning,
    more conditions on well-formed weights, separate
    weights for CJK and unassigned characters,
    non-characters
  • Note base version still U3.1
  • UTR 26 CESU-8
  • Unicode Technical Notes
  • Updated Character Encoding Stability Policy
  • Added Public Review process
  • Updated Glossary

34
Unicode 3.1 (March, 2001)
  • New Characters 44,946
  • First supplementaries encoded!
  • Modern scripts
  • CJK Ideographs (now totaling 71,039)
  • Historic scripts
  • Old Italic, Gothic, Deseret, Byzantine Musical
    Symbols
  • Symbols
  • Mathematical Alphanumeric Symbols, (Western)
    Musical Symbols

35
Conformance
  • Non-shortest-form UTF-8 excluded
  • Clarification of the stability of the standard,
  • code units vs. code points, non-characters,
    normative properties, informative properties,
    normative references
  • Revisions of guidelines
  • wchar_t, unassigned code points, identifiers
  • Major revision of Georgian
  • Use of ZWNJ and ZWJ for ligatures
  • Language tag characters encoded
  • but discouraged

36
Unicode Standard Annexes
  • UAX 19 UTF-32

37
Unicode Character Database
  • Major revision of PropList properties
  • White_Space, Bidi_Control, Join_Control,
    Hex_Digit
  • Alphabetic, Ideographic, Lowercase, Uppercase
    ID_Start, ID_Continue, XID_Start, XID_Continue
    Noncharacter_Code_Point
  • Quotation_Mark, Terminal_Punctuation, Math, Dash,
    Hyphen, Diacritic, Extender
  • New properties Case folding, Scripts
  • Added DerivedProperties, NormalizationTest

38
Related Items
  • Documented Character Encoding Stability Policy
  • UTS 10 Unicode Collation Algorithm
  • Merged data files updated to base version 3.1
  • UTR 18 Unicode Regular Expression Guidelines
  • UTR 20 Unicode in XML and other Markup
    Languages
  • UTR 22 Character Mapping Tables
  • UTR 24 Script Names

39
Schedule
  • 2003, April UCD/UAXes
  • Final data files available
  • Implementation can proceed
  • 2003 September
  • Book Available
Write a Comment
User Comments (0)
About PowerShow.com