Introduction to Character Encodings, Java and You - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Introduction to Character Encodings, Java and You

Description:

... based on Chinese ideographic writing ('Han' or 'Hanja'): Chinese ... Kanji ('Han' or Chinese writing): ???. Hiragana (phonetic for Japanese words): ???? ... – PowerPoint PPT presentation

Number of Views:232

Avg rating:3.0/5.0

Slides: 51

Provided by: karen115

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Character Encodings, Java and You

1
Introduction to Character Encodings, Java and You
2
Agenda

Defining the problem
Where webMethods products encounter character set
problems.
What the symptoms look like.
Understand core concepts
What is a character set? Whats an encoding?
What is Unicode, really?
Code Examples to avoid problems

3
Confusion Reigns

Generally, the most confusing aspect of
internationalization.
Many, many standards to choose from.
Arcane terminology
American programmers rarely (seem) to encounter
it head-on.
Were presenting this because many of our
products are encountering this problem now.

4
Problem Domain

webMethods products interface with
non-Java systems (for example, in the adapters)
non-Java environments (file systems, databases,
libraries, email, ftp, http, etc.).

5
Javas Text Representation

Java provides a convenient text processing
architecture centered on the Java String object.
A Java String is basically an array of Java
Character Objects.

6
Java Characters

Each Java Character object represents a Unicode
character.
(Currently) a 16-bit unsigned integer value
between 0 and 65,535.
Character class provides access to character
properties.
UPPER, lower, and Titlecase mapping
Comparison
Directionality
Compatibility
C-TYPE values such as alpha-ness, digit-ness,
alphanumeric-ness

7
Non-Java Text

Non-Java files, applications, filesystems,
database, et.al. typically do not use Unicode.
Java sees them as an array of bytes (byte).

8
Three Problems
9
Bad Conversion

Target character set doesnt have this character
in it. Java replaces each character with a ?
Input String ???
Output String ???
Typically
Using the default encoding when we meant to
specify one.
Writing on a device (such as System.out) whose
legacy encoding doesnt support the characters.

10
No Glyph

Java knows what the character is and is handling
it properly, but doesnt have a picture of it to
show you (in the current Font selected).
Input String ???
Output String ???
Typically
Nothing is wrong, just using the wrong Font.

11
Random Trash

A byte was converted using the wrong character
encoding. Bytes were mapped to the wrong
characters.
Input String ???
Output String ?ú??ê
Typically
Using the wrong encoding, the underlying bytes
are mapped to different, random-seeming
characters.

12
Examples

Same byte sequences, different results
Shift JIS byte 0xE0, 0x41, 0x83, 0x70 ??
Latin-1 byte 0xE0, 0x41, 0x83, 0x70 àA?p
Java String 0xE0, 0x41, 0x83, 0x70 ??
Java String ?? U6F13 U30D1

13
Character Set Terminology
14
What is a Character?

A character is a single, atomic unit of text.
The definition has a different meaning according
to the writing system and context.

15
Abstract characters

Some abstract characters include
A Roman Letter Capital A
Combining Accent Grave
? Hiragana character ni
? CJK Ideograph
? Arabic letter
? Hangul syllable
A Fullwidth compatibility letter A

16
What is a Character Set?

A character set is a set--- a collection of
characters, usually organized in some fashion.
Youre probably most familiar with ASCII
0x41 A
0x42 B
Etc.

17
What is a Character Encoding?

Character set a collection of characters,
basically, a bucket.
Character encoding the specific ones and zeroes
assigned to a character set.
Character Set A 0x41
Character Encoding A 0x41

18
Eight Bit Encodings

8-bit encodings allow for 256 characters.

128 ASCII
32 C1 controls
96 extended
19
Latin-1

The standard for Western Europe is generally
ISO-8859-1
AKA Latin-1
Used by UNIX systems and the Web.
Extended version used by Microsoft for Windows.

20
Let a Thousand Encodings Bloom

Each language has its own character set
Everywhere ASCII
Western European (like German or French) Latin-1
Eastern European (like Polish or Slovak) Latin-2
Simplified Chinese GB2312

21
Actually, many for each language
22
Other Writing Systems

Writing systems vary around the world (in order
of increasing complexity, more or less)
Latin-based alphabets
(ABCDEFG) English
Cyrillic and Greek-based alphabets
(????????...) Russian
Ideographic writing systems have thousands of
characters
(??????...) Japanese
Bi-directional (RTL) languages go right to left
(...???????) Hebrew
Complex scripts (everything else)
(???? )Devanagari

23
Expanded Character Sets

Most languages have alphabetic or phonetic
writing systems
Russian, Greek, Slavic, (many) Native American,
Bahasa, Hebrew, Arabic, Semitic, etc. alphabetic
Indian (subcontinent), Thai, Japanese kana,
Korean phonetic writing systems
8 bits is enough for all of the above (with some
tricks)
Some languages use scripts based on Chinese
ideographic writing (Han or Hanja)
Chinese
Korean
Vietnamese (traditional)
Japanese Kanji

24
Double-Byte

8-bit character encodings use eight bits per
character.
28 255 characters
Double-byte character sets must be 2 bytes per
character ?
216 65,535 characters
Should actually be called multi-byte (MBCS).
Each character can be ONE, TWO, THREE and
sometimes FOUR bytes in length.
MAY involve shift states.

25
Multibyte Encodings

A typical Japanese Character Set
JIS X 208 (??)
Character Encodings of JIS X 208
Shift-JIS (CP932) 0x8A 0xBF 0x8E 0x9A
EUC-JP 0xB4 0xC1 0xBB 0xFA
ISO 2022-JP 0x1B, 0x24, 0x42, 0x34 0x41 0x3B
0x7A 0x1B 0x28 0x4A
Non-Legacy
UTF-16 (0x6F22 0x5B57)

26
An MBCS Example Shift-JIS

Character set used by DOS, Windows, Macs, and a
few UNIX-like systems for Japanese.
Code Page 932
JIS X 2081997

27
Shift-JIS

In order to reach more characters, double byte
values start with a limited range of lead bytes
These can be followed by any character valuegt
0x40 (trail byte)

28
Shift-JIS

Each lead byte provides a window onto
additional characters.

29
Shift-JIS

Problems
Lead byte values are also valid as trail bytes.
Common special characters (\!!) are valid trail
bytes.

30
Han

CJK scripts require up to 100,000 unique
characters for complete representation.
Four major variants
Traditional Chinese
Simplified Chinese
Japanese Kanji
Korean (non-Hangul)

31
Kanji

Sometimes you hear Japanese called kanji
Kanji is actually one of four writing systems
used in Japan.
Kanji should be avoided as a generic term for
DBCS.
Kanji (Han or Chinese writing) ???
Hiragana (phonetic for Japanese words) ????
Katakana (phonetic for foreign words) ????
Romanji (Roman script) nihongo

32
Chinese

Upper two are Traditional.
Lower character is the Simplified variant.

33
Hangul

Korean Hangul is a syllabic phonetic system,
which has thousands of combinations.
Hangul is not related to Han ideographic writing.

34
Code Page Hell

With hundreds of encodings and character sets to
choose from, making internationalized code work
in the late 1980s and early 1990s was
hellish.
Internationalization folks referred to this as
code page hell

35
Unicode and Java

To the Rescue

36
Unicode (ISO 10646-2)

Unicode is a character set that supports all of
the worlds languages and writing systems.
Originally designed as a wide character
set--every character was represented by 16-bits.
This allowed for 65,535 potential characters.
Extended to allow 1.1 million characters.
Unicode is maintained by an industry consortium.
ISO 10646-2 is maintained by WG2. The two are
exactly identical.

37
Its a character set?

Unicode is a character set. It has these
encodings
UTF-32. (BE/LE)
A 32-bit encoding. All characters 32 bits.
UTF-16. (BE/LE)
A 16-bit encoding. All characters are 16-bits.
Characters above 0xFFFF (the Basic Multilingual
Plane) require two special surrogate
characters.
UTF-8.
An 8-bit variable width encoding. Characters are
1, 2, 3 or 4 bytes long. Always non-endian.
ASCII ASCII
All other characters have a special bit pattern

38
UTF-8 Bit Pattern

ASCII ASCII
0x41 A
All other characters are multibyte.
110xxxxx two bytes
1110xxxx three bytes
11110xxx four bytes
10xxxxxx trail byte
U00C0 À 0xC3 0x80 (11000011 10000000)

39
Convenience Method for UTF8

Almost True readUTF and writeUTF allow direct
access to UTF-8 DataInput/DataOutputStreams.
This is not really UTF-8, but a Sun specialized
version.
Use InputStreamReader/OutputStreamWriter to do
proper conversions.

40
Java Uses Unicode

Every character in every Java String object is
encoded as UTF-16 Unicode.
Every string is converted from a legacy encoding,
either by the compiler or by the String class.
This is the reason for native2ascii and encoding
switches.
Once you have a String object, everything is
Unicode UTF-16.

41
Special encodings

There are two encodings that the system treats as
special
file.encoding
ISO-8859-1
All basic conversion functions use your system
default encoding.
Most servlet conversion functions use ISO-8859-1
as the default.

42
Two File Encodings

Windows systems generally have two different file
encodings
ANSI encoding is the Windows default code page
for GUI applications.
OEM encoding is the code page used by the cmd
or command interpreter shells.

43
Stream Readers and Writers

InputStreamReader and OutputStreamWriter classes
perform controlled conversion between byte and
String.
Always pass the encoding as a variable.
Use the IANA preferred name for the encoding, if
possible (see ftp//ftp.isi.edu/in-notes/iana/assi
gnments/)
Prefer UTF8 for on-the-wire transport.

44
Code Sample
// use with any type of InputStream
class InputStream is new FileInputStream(file)
InputStreamReader isr new
InputStreamReader(is, encoding) // use Buffered
Reader for efficiency BufferedReader br new
BufferedReader(isr) StringBuffer sb new
StringBuffer() int chr while ((chr br.read()
gt -1) sb.append(chr)
Note Try blocks eliminated for clarity.
45
OutputStreamWriter Code Sample
// use with any type of OutputStream
class OutputStream os new
ByteArrayOutputStream(file) OutputStreamWriter
osw new OutputStreamWriter((OutputStream)os,
encoding) osw.write(myString, 0,
myString.length()) osw.flush()
Note Try blocks eliminated for clarity.
46
Character Class

Provides access to Unicode character properties.
UnicodeBlock inside class
Character getType (defined types)
isDigit
isLetter
isLetterOrDigit
isUpperCase/isLowerCase/isTitleCase
toUpperCase/toLowerCase/toTitleCase
isSpace/isWhitespace
isISOControl/isJavaIdentifierStart/isJavaIdentifer
Part

47
Normalization

Many characters have two (or more)
representations in Unicode.
Normalization makes the sequences the same.
Simplifies user input parsing and validation.

48
ICUj Normalizer Class

Four forms of Normalization
Form C (composed)
Form D (decomposed)
Form KC (canonical composed)
Form KD (canonical decomposed)
Special handling for Hangul characters!
Note that there is a private class
java.text.Normalizer in the JDK.

49
Demo Programs

UnicodeDemo a Java program that demonstrates
the byte sequences of different encodings and
also provides some code that shows ISR and OSW in
action.
Charsets a Windows program by my buddy Bill
Hall for playing with encodings.
http//www.inter-locale.com -- my personal
website, with examples and demos of certain Java
I18n things.

50
Questions?