Status of Proposed Unicode Changes to the SQL Standard - PowerPoint PPT Presentation

About This Presentation
Title:

Status of Proposed Unicode Changes to the SQL Standard

Description:

Status of Proposed Unicode Changes to the SQL Standard by Michael G. McKenna / Sybase, Inc. Stefan Buchta, Hirotaka Yoshioka / Oracle Corporation – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 23
Provided by: MikeMc72
Category:

less

Transcript and Presenter's Notes

Title: Status of Proposed Unicode Changes to the SQL Standard


1
Status of Proposed Unicode Changes to the SQL
Standard
  • by
  • Michael G. McKenna / Sybase, Inc.
  • Stefan Buchta, Hirotaka Yoshioka / Oracle
    Corporation
  • v.1.3
  • March 1999

2
Introduction
  • SQL Character Set Internationalization is
    out-of-date
  • New accepted standards
  • Unicode
  • Posix
  • WG20
  • Java
  • Some changes made to I18N for SQL3
  • Still under review by ISO and ANSI for future
  • WWW
  • XML
  • Windows/NT
  • Oracle/Sybase

3
Scope
  • Concerned Parties
  • Implementer vendor
  • Designer / Admin customer
  • User interface
  • This is a tool
  • Discussion catalyst
  • Proposal for changes to SQL
  • Suggestion for implementation

4
Implementers
  • Major Database Companies
  • Oracle, Sybase, Informix, Ask/Ingres, Adabas,
    DB2, Borland, ..., Microsoft
  • Concerns
  • Feasibility to implement
  • Relevance to customer
  • Migration costs/issues
  • Future markets
  • Maintainability
  • Backward compatibility
  • Installed customer base
  • Competitive position

5
Issues with SQL92 and SQL3
  • Old I18N (pre standards)
  • non conformant
  • awkward
  • not implemented
  • Character Data/Character Columns
  • Multiple Character Sets
  • CREATE CHARACTERSET
  • Character set introducer

6
X
7
X
8
Issues, continued (2)
  • SQL Names and LiteralsExample of SQL92/SQL3
    literal
  • SELECT from employeeWHERE name
    _iso88591'Müller
  • Now uses Unicode lexical types for identifiers
  • SQL_TEXT
  • Superset of all installed character sets
  • Ideally, should explicitly be Unicode

9
X
10
Issues, continued (3)
  • Collation Handling
  • SQL92 contains features that (almost) allow the
    definition of collations
  • Example
  • CREATE COLLATION german_dictionary FOR
    iso8859_1 FROM (USING(german_default), MODIFY
    (A lt Ä, a lt ä, O lt Ö, o lt ö, U lt Ü, u lt
    ü, ß ss), WHEN NOT FOUND MAX)
  • No multi-pass ordering like ISO 14651
  • Drastically changed for SQL3

11
X
12
X
13
Issues, continued (4)
  • Text element versus Unicode character (10646
    levels, applies to collations)
  • How long is a character?
  • Composite Characters/Canonical Equivalence

ñ ? n
14
Issues, continued (5)
  • Upper-/Lowercase Translations (FOLD)
  • Example
  • German Ü lt-gt ü
  • But German ß has the uppercase equivalent
    SS, but not all sequences SS correspond to
    ß when returned to lowercase.
  • FOLD function to use Unicode case-mapping, as of
    January 1999

15
Issues, continued (6)
  • Client Character Encoding through CLI (Locale
    negotiation)
  • MESSAGE TEXT
  • User Defined Characters (UDC)

16
Proposed Changes to SQL
  • Synchronize with present standards
  • Character Handling
  • Collations
  • Locales

17
Synchronize with present standards
  • JTC1/SC22/WG20
  • Programming Languages and I18N
  • JTC1/SC2/WG2 ISO 10646-1
  • Character Set handling
  • Unicode concepts
  • JTC1/SC2/ WG3
  • Single byte character sets
  • ISO 14651/14652
  • Standardized collations
  • Unicode Technical Report

18
Synchronize with present de Facto standards
  • Java
  • Unicode String type
  • RFC 2277
  • UTF-8 as default internet encoding
  • XML
  • Potential Universal data stream
  • Default encoding is Unicode
  • ODBC 3.5
  • Mapping with SQL_WCHAR

19
Gratuitous Animated Grahpics ...
Gratuitous Animated Grahpics ...
20
Character Sets
  • SQL_TEXT º Unicode
  • Eliminate introducer for identifiers
  • \Uxxxx
  • \\ escape
  • Keep schema default character set
  • Add UNICHAR datatype

21
X
22
Character Sets (2)
  • Surrogate characters
  • User-defined character mechanism
  • CREATE UDC ltchar valuegt
  • FOR ltcharset namegt AS ltunicode binary valuegt
  • WHERE LEXICAL PROPERTY LIKE ltunicode binary
    valuegt
  • WITH UPPER LOWER ltunicode binary
    valuegt

23
X
24
Character Sets (3)
  • Canonical Equivalence for Identifiers
  • Entry Level 1 Â ¹ A Â ¹ Â
    Intermediate Level 2 Â Â
    (Vietnamese, Indic, Arabic)Full Level 3 Â
    A Â A

25
Collations
  • Unicode Consortium Technical Paper 10 for
    Universal sorting
  • Has mechanism for cultural differences,
    overlays
  • Proven in actual implementation (Java, Sybase
    internal testing)
  • Issue No standard cultural variations yetBeing
    developed by National Bodies, de-Facto,
    TC304/Europe, ISO 14652 Cultural Registries
  • Map all data to Unicode for collation results

26
Collations (2)
  • Handle Text Elements in comparison
  • example Marillo (ll)
    like mari_o like
    maril_o marillo, marilo
    like mari_lo
  • Handle Text Elements in string functions
  • example (Vietnamese)
  • SUBSTRING (
    COLLATE viet_te FROM 2 FOR 5 )
  • should be (five text elements)
  • not (five unicode
    characters)

27
Locale Architecture
  • Linguistic profile preferences on connection
  • Like HTTP language preferences protocol
  • Consistent SQL locales independent of O/S
  • Communicate language to called external functions
  • to handle MESSAGE_TEXT in diagnostic functions
  • Locale-sensitive FOLD
  • e.g. Capitalized accents with fr_FR versus fr_CA

28
Locale Architecture (2)
  • Hierarchical Retrieval Methods
  • Application development
  • Language/message retrieval
  • Preferred choice retrieval
  • SELECT FROM table_foo WHERE
  • HIERARCHY(foo_lang, SELECT lang FROM lang_list
    ORDER BY japan_nec)

29
Open Issues
  • Meta tagging of languages
  • columns
  • in-line
  • Lexical equivalence
  • not canonically equivalent
  • compatibility zone
  • keywords?
  • identifiers
  • non-Latin digits
  • Current I18N proposals for SQL

30
X
31
Current SQL Proposal vs. This Paper
  • Agree
  • Unicode as SQL_TEXT
  • Character set introducers
  • Canonical equivalence
  • Surrogate characters
  • Use of standard character sets and collations
  • Unicode identifiers
  • FOLD
  • Schema default character set
  • Disagree
  • Simultaneous multiple character set support
  • UNICHAR
  • Locales
  • Language for MESSAGE_TEXT
  • Text elements
  • User-defined characters
  • Hierarchical retrieval methods

32
Conclusion
  • Unicode Collations Locale negotiation
  • opens the door for other cultural processing
    of data
  • date/time
  • numbers
  • conversion to/from strings
  • Fosters better SQL integration with
  • Java
  • XML
  • the Internet
  • ODBC
  • Feedback? Please visit
  • www.g11n.org/SQL

33
References
X
Write a Comment
User Comments (0)
About PowerShow.com