Unicode (and Java) - PowerPoint PPT Presentation

About This Presentation

Title:

Unicode (and Java)

Description:

Unicode (and Java) Brice Giesbrecht Objective of Presentation The need for Unicode How it works Differentiate between encodings How to get your browser to work – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 35

Provided by: Brice2

Category:

more less

Transcript and Presenter's Notes

Title: Unicode (and Java)

1
Unicode (and Java)

Brice Giesbrecht

2
Objective of Presentation

The need for Unicode
How it works
Differentiate between encodings
How to get your browser to work
See how Java consumes and produces data

3
Overview of Presentation

Character Sets
Unicode
Encodings
Unicode Support in Java
Unicode Support in Databases (?)
Demonstration (web app)
Resources
Door Prizes (for those still awake)

4
Character Sets

What is a character set?
Code Page a mapping in which a sequence of bits,
usually a single octet representing integer
values 0 through 255, is associated with a
specific character (wikipedia)
Most character sets are a direct mapping of a
value to a number (7 bit / 8 bit)
Character sets are NOT fonts!
Encoding is usually a lookup in a table
Most IBM and Microsoft code pages use ASCII as
their base set of characters
The English bias (compare to Indic languages)

5
Character Sets

Issues Within a single Language
Selectors to overcome 8 bit limitations
(especially for CJK sets)
Historical importance of platforms and hardware
Compatibility (or more likely, lack thereof)
ISCII as an example
Issues outside a single Language
How do you produce content using multiple
languages? (Or the characters from those
languages?)
http//en.wikipedia.org/wiki/Code_page_437

6
Character Sets

Enter the standards
ISO-646 (ASCII, still 7 bit)
12 whole code points to play with!
C0 Control Set (0x00 0x1F)
ISO-8859-n
0x00 0x7F ISO-646 IRV
0x80 0xFF Different for each set (or part)
ISO 8859-1 (Latin1)
C1 Control Set (0x80 0X9F)
ISO-2022
Designed for transmission
Non Latin bases multi byte sets

7
Character Sets

Enter Microsoft!
Windows code pages
http//www.microsoft.com/globaldev/reference/wincp
.mspx
Cp1252
Based on ISO 8859-1
C1 code points used for printable characters
Often mislabeled as ISO-8859-1 due to their
similarities

8
Unicode

What is Unicode?
Unicode provides a unique number for every
character,
no matter what the platform,
no matter what the program,
no matter what the language.

9
Unicode

ISO 10646 1990
Merged with the Unicode Consortium Ties a
character, name, and a code point together
BMP Basic Multilingual Plane (the first 65,536
code points)
ISO and UC Character repertoire are synchronized
UCS (Universal Character Set)

10
Unicode

Q So are they the same thing?A No. Although
the character codes and encoding forms are
synchronized between Unicode and ISO/IEC 10646,
the Unicode Standard imposes additional
constraints on implementations to ensure that
they treat characters uniformly across platforms
and applications. To this end, it supplies an
extensive set of functional characterspecificatio
ns, character data, algorithms and substantial
background material that is not in ISO/IEC
10646.(http//unicode.org/faq/unicode_iso.html)

11
Unicode

The Unicode Standard includes a set of
characters, names, and coded representations that
are identical with those in ISO/IEC 106462003.
It additionally provides details of
characterproperties, processing algorithms, and
definitions that are useful to implementers. It
strengthens Unicode support for worldwide
communication, software availability, and
publishing. (http//www.iso.org)

12
Unicode

UCS Code space (0x 0x7FFFFFFF)
128 x 256 x 256 x 256 (GPRC)
2,147,483,648 possible code points
The Unicode Character Database
http//unicode.org/Public/UNIDATA/UCD.html
Main Definition (UnicodeData.txt)
Available on line
http//www.unicode.org/Public/UNIDATA/
Unicode Code Space (0x 0x10FFFF)
17 x 256 x 256 1,114,112 code points

13
Unicode

As of Unicode 5.0.0, 101,063 (9.1) of these
codepoints are assigned, with another 137,468
(12.3) reserved for private use, leaving 875,441
(78.6) unassigned. The number of assigned code
points is made up as follows 98,884
graphemes 140 formatting characters 65 control
characters 2,048 surrogate characters

14
Unicode

Plane 0 (0000-FFFF)
Basic Multilingual Plane (BMP)
Used for most of the alphabets
Not all code points are used
Allocated in areas/blocks

15
Unicode

Plane 1 (10000-1FFFF)
Supplementary Multilingual Plane (SMP)
Historic scripts such as Linear B, but is also
used for musical and mathematical symbols.

16
Unicode

Plane 2 (20000-2FFFF)
Supplementary Ideographic Plane (SIP)
Used for about 40,000 rare Chinese characters
that are mostly historic

17
Unicode

Planes 3 to 13 (30000-DFFFF)
Unassigned

18
Unicode

Plane 14 (E0000-EFFFF)
Supplementary Special-purpose Plane (SSP)
glyph (font) selection
code point variation selector variation
sequence
http//www.unicode.org/reports/tr37/tr37-3.html
(Ideographic Variation Database)

19
Unicode

Plane 15 (F0000-FFFFF)
Plane 16 (100000-10FFFF)
Plane 0 (E000-F8FF)
Private Use Area (PUA)
The use of the PUA was a concept inherited from
certain Asian encoding systems. These systems had
private use areas to encode Japanese Gaiji (rare
personal name characters) in application-specific
ways)

20
Unicode

ConScript Unicode Registry
The purpose of the ConScript Unicode Registry
(CSUR) is to coordinate the assignment of blocks
out of the Unicode Private Use Area (E000-F8FF
and 000F0000-0010FFFF) to constructed/artificial
scripts, including scripts for constructed/artific
ial languages.
Cirth, Klingon, Tengwar, etc.

21
Encodings

Purpose of the following encodings is to get the
Unicode value to you.Depending on the storage or
transmission protocols, differentencodings will
need to be used. These are not different
character sets, they are ways of representing the
characters in Unicode.

22
Encodings

Endianness
0x1234
LE 34 12
BE 12 34
Byte Order Mark - 0xFEFF
Helps Determine Endianness
Unicode 3.2 (0x2060)
0xFFFE reserved
0XFEFF set aside for BOM
Also used to declare encoding (UTF-8)

23
Encodings

UTF-8
Variable-length character encoding
Can address all characters in the UCS but was
limited by RFC 3629 to just address the Unicode
code space.
BOM EF BB BF
Format
000000-00007F 0zzzzzzz
000080-0007FF 110yyyyy 10zzzzzz
000800-00FFFF 1110xxxx 10yyyyyy 10zzzzzz
010000-10FFFF 11110www 10xxxxxx 10yyyyyy 10zzzzzz

24
Encodings

UTF-32/UCS-4
Fixed-length character encoding
Uses 31 bits
UCS-4 capable of addressing entire UCS, but was
restricted to only cover the Unicode code space
UTF-32 only covers the Unicode code space
4E8C, 10302 00004E8C, 00010302
BE BOM 00 00 FE FF
LE BOM FF FE 00 00

25
Encodings

UCS-2
Fixed-length encoding
Two-octet
It is NOT UTF-16!
Only addresses BMP
UCS-2BE, UCS-2LE
Obsoleted by UTF-16

26
Encodings

UTF-16
Variable-length encoding
UTF-16BE, UTF-16LE
BE BOM FEFF
LE BOM FFFE
Surrogates are used to address code points
outside the BMP. (We will cover this later)

27
Encodings

UTF-16 Surrogate Pairs
Needed for code points gt 0xFFFF
High Byte 0xD800 0xDBFF first surrogate
Low Byte 0xDC00 0xDFFF second surrogate
Algorithm
((cp - 0x10000) high 10 bits) 0xD800
((cp - 0x10000) low 10 bits) 0xDC00

28
Encodings

Which Encoding should you use?
If dealing with CJK or Hindi (gt0x0800), UTF-8
requires 3 bytes whereas UTF-16 needs only 2
UTF-8 is great for ASCII whereas UTF-16 needs 2
bytes for it
Java uses UTF-16
Windows uses UTF-16LE internally
UTF-32 not really used that much
UTF-8 and UTF-16 are the most common

29
Java

J2SE 1.5 version 4.0
J2SE 1.4 version 3.0
J2SE 1.3 version 2.1
Supplementary characters were part of Unicode 3.1
Addressed in JSR 204 (http//jcp.org/en/jsr/detail
?id204)

30
Java

Unicode characters are specified using \u such as
\u0039
Unicode can be used in source files
file.encodingCp1252 on my machine
You can change this, but beware
Java reads and writes using this encoding by
default
You can specify the character set to use for
reading or writing

31
Java
Big5 Big5-HKSCS EUC-JP EUC-KR GB18030 GB2312 GBK IBM-Thai IBM00858 IBM01140 IBM01141 IBM01142 IBM01143 IBM01144 IBM01145 IBM01146 IBM01147 IBM01148 IBM01149 IBM037 IBM1026 IBM1047 IBM273 IBM277 IBM278 IBM280 IBM284 IBM285 IBM297 IBM420 IBM424 IBM437 IBM500 IBM775 IBM850 IBM852 IBM855 IBM857 IBM860 IBM861 IBM862 IBM863 IBM864 IBM865 IBM866 IBM868 IBM869 IBM870 IBM871 IBM918 ISO-2022-CN ISO-2022-JP ISO-2022-KR ISO-8859-1 ISO-8859-13 ISO-8859-15 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-5 ISO-8859-6 ISO-8859-7 ISO-8859-8 ISO-8859-9 JIS_X0201 JIS_X0212-1990 KOI8-R Shift_JIS TIS-620 US-ASCII UTF-16 UTF-16BE UTF-16LE UTF-8 windows-1250 windows-1251 windows-1252 windows-1253 windows-1254 windows-1255 windows-1256 windows-1257 windows-1258 windows-31j x-Big5-Solaris x-euc-jp-linux x-EUC-TW x-eucJP-Open x-IBM1006 x-IBM1025 x-IBM1046 x-IBM1097 x-IBM1098 x-IBM1112 x-IBM1122 x-IBM1123 x-IBM1124 x-IBM1381 x-IBM1383 x-IBM33722 x-IBM737 x-IBM856 x-IBM874 x-IBM875 x-IBM921 x-IBM922 x-IBM930 x-IBM933 x-IBM935 x-IBM937 x-IBM939 x-IBM942 x-IBM942C x-IBM943 x-IBM943C x-IBM948 x-IBM949 x-IBM949C x-IBM950 x-IBM964 x-IBM970 x-ISCII91 x-ISO-2022-CN-CNS x-ISO-2022-CN-GB x-iso-8859-11 x-JIS0208 x-JISAutoDetect x-Johab x-MacArabic x-MacCentralEurope x-MacCroatian x-MacCyrillic x-MacDingbat x-MacGreek x-MacHebrew x-MacIceland x-MacRoman x-MacRomania x-MacSymbol x-MacThai x-MacTurkish x-MacUkraine x-MS950-HKSCS x-mswin-936 x-PCK x-windows-874 x-windows-949 x-windows-950
32
Databases (Maybe)

SQL 92 NATIONAL CHARACTER
The ltkey wordgts NATIONAL CHARACTER are used to
specify a character string data type with a
particular implementation-defined character
repertoire. Special syntax (N'string') is
provided for representing literals in that
character repertoire.
Collation
Database Support
MySQL
Oracle
Sql Server
Postgres

33
Demonstration