Title: Introduction to Character Encodings, Java and You
1Introduction to Character Encodings, Java and You
2Agenda
- Defining the problem
- Where webMethods products encounter character set
problems. - What the symptoms look like.
- Understand core concepts
- What is a character set? Whats an encoding?
- What is Unicode, really?
- Code Examples to avoid problems
3Confusion Reigns
- Generally, the most confusing aspect of
internationalization. - Many, many standards to choose from.
- Arcane terminology
- American programmers rarely (seem) to encounter
it head-on. - Were presenting this because many of our
products are encountering this problem now.
4Problem Domain
- webMethods products interface with
- non-Java systems (for example, in the adapters)
- non-Java environments (file systems, databases,
libraries, email, ftp, http, etc.).
5Javas Text Representation
- Java provides a convenient text processing
architecture centered on the Java String object. - A Java String is basically an array of Java
Character Objects.
6Java Characters
- Each Java Character object represents a Unicode
character. - (Currently) a 16-bit unsigned integer value
between 0 and 65,535. - Character class provides access to character
properties. - UPPER, lower, and Titlecase mapping
- Comparison
- Directionality
- Compatibility
- C-TYPE values such as alpha-ness, digit-ness,
alphanumeric-ness
7Non-Java Text
- Non-Java files, applications, filesystems,
database, et.al. typically do not use Unicode.
Java sees them as an array of bytes (byte).
8Three Problems
9Bad Conversion
- Target character set doesnt have this character
in it. Java replaces each character with a ? - Input String ???
- Output String ???
- Typically
- Using the default encoding when we meant to
specify one. - Writing on a device (such as System.out) whose
legacy encoding doesnt support the characters.
10No Glyph
- Java knows what the character is and is handling
it properly, but doesnt have a picture of it to
show you (in the current Font selected). - Input String ???
- Output String ???
- Typically
- Nothing is wrong, just using the wrong Font.
11Random Trash
- A byte was converted using the wrong character
encoding. Bytes were mapped to the wrong
characters. - Input String ???
- Output String ?ú??ê
- Typically
- Using the wrong encoding, the underlying bytes
are mapped to different, random-seeming
characters.
12Examples
- Same byte sequences, different results
- Shift JIS byte 0xE0, 0x41, 0x83, 0x70 ??
- Latin-1 byte 0xE0, 0x41, 0x83, 0x70 Ã A?p
- Java String 0xE0, 0x41, 0x83, 0x70 ??
- Java String ?? U6F13 U30D1
13Character Set Terminology
14What is a Character?
- A character is a single, atomic unit of text.
- The definition has a different meaning according
to the writing system and context.
15Abstract characters
- Some abstract characters include
- A Roman Letter Capital A
- Combining Accent Grave
- ? Hiragana character ni
- ? CJK Ideograph
- ? Arabic letter
- ? Hangul syllable
- A Fullwidth compatibility letter A
16What is a Character Set?
- A character set is a set--- a collection of
characters, usually organized in some fashion. - Youre probably most familiar with ASCII
- 0x41 A
- 0x42 B
- Etc.
17What is a Character Encoding?
- Character set a collection of characters,
basically, a bucket. - Character encoding the specific ones and zeroes
assigned to a character set. - Character Set A 0x41
- Character Encoding A 0x41
18Eight Bit Encodings
- 8-bit encodings allow for 256 characters.
128 ASCII
32 C1 controls
96 extended
19Latin-1
- The standard for Western Europe is generally
ISO-8859-1 - AKA Latin-1
- Used by UNIX systems and the Web.
- Extended version used by Microsoft for Windows.
20Let a Thousand Encodings Bloom
- Each language has its own character set
- Everywhere ASCII
- Western European (like German or French) Latin-1
- Eastern European (like Polish or Slovak) Latin-2
- Simplified Chinese GB2312
21Actually, many for each language
22Other Writing Systems
- Writing systems vary around the world (in order
of increasing complexity, more or less) - Latin-based alphabets
- (ABCDEFG) English
- Cyrillic and Greek-based alphabets
- (????????...) Russian
- Ideographic writing systems have thousands of
characters - (??????...) Japanese
- Bi-directional (RTL) languages go right to left
- (...???????) Hebrew
- Complex scripts (everything else)
- (???? )Devanagari
23Expanded Character Sets
- Most languages have alphabetic or phonetic
writing systems - Russian, Greek, Slavic, (many) Native American,
Bahasa, Hebrew, Arabic, Semitic, etc. alphabetic - Indian (subcontinent), Thai, Japanese kana,
Korean phonetic writing systems - 8 bits is enough for all of the above (with some
tricks) - Some languages use scripts based on Chinese
ideographic writing (Han or Hanja) - Chinese
- Korean
- Vietnamese (traditional)
- Japanese Kanji
24Double-Byte
- 8-bit character encodings use eight bits per
character. - 28 255 characters
- Double-byte character sets must be 2 bytes per
character ? - 216 65,535 characters
- Should actually be called multi-byte (MBCS).
- Each character can be ONE, TWO, THREE and
sometimes FOUR bytes in length. - MAY involve shift states.
25Multibyte Encodings
- A typical Japanese Character Set
- JIS X 208 (??)
- Character Encodings of JIS X 208
- Shift-JIS (CP932) 0x8A 0xBF 0x8E 0x9A
- EUC-JP 0xB4 0xC1 0xBB 0xFA
- ISO 2022-JP 0x1B, 0x24, 0x42, 0x34 0x41 0x3B
0x7A 0x1B 0x28 0x4A - Non-Legacy
- UTF-16 (0x6F22 0x5B57)
26An MBCS Example Shift-JIS
- Character set used by DOS, Windows, Macs, and a
few UNIX-like systems for Japanese. - Code Page 932
- JIS X 2081997
27Shift-JIS
- In order to reach more characters, double byte
values start with a limited range of lead bytes - These can be followed by any character valuegt
0x40 (trail byte)
28Shift-JIS
- Each lead byte provides a window onto
additional characters.
29Shift-JIS
- Problems
- Lead byte values are also valid as trail bytes.
- Common special characters (\!!) are valid trail
bytes.
30Han
- CJK scripts require up to 100,000 unique
characters for complete representation. - Four major variants
- Traditional Chinese
- Simplified Chinese
- Japanese Kanji
- Korean (non-Hangul)
31Kanji
- Sometimes you hear Japanese called kanji
- Kanji is actually one of four writing systems
used in Japan. - Kanji should be avoided as a generic term for
DBCS. - Kanji (Han or Chinese writing) ???
- Hiragana (phonetic for Japanese words) ????
- Katakana (phonetic for foreign words) ????
- Romanji (Roman script) nihongo
32Chinese
- Upper two are Traditional.
- Lower character is the Simplified variant.
33Hangul
- Korean Hangul is a syllabic phonetic system,
which has thousands of combinations. - Hangul is not related to Han ideographic writing.
34Code Page Hell
- With hundreds of encodings and character sets to
choose from, making internationalized code work
in the late 1980s and early 1990s was
hellish. - Internationalization folks referred to this as
code page hell
35Unicode and Java
36Unicode (ISO 10646-2)
- Unicode is a character set that supports all of
the worlds languages and writing systems. - Originally designed as a wide character
set--every character was represented by 16-bits.
This allowed for 65,535 potential characters. - Extended to allow 1.1 million characters.
- Unicode is maintained by an industry consortium.
ISO 10646-2 is maintained by WG2. The two are
exactly identical.
37Its a character set?
- Unicode is a character set. It has these
encodings - UTF-32. (BE/LE)
- A 32-bit encoding. All characters 32 bits.
- UTF-16. (BE/LE)
- A 16-bit encoding. All characters are 16-bits.
- Characters above 0xFFFF (the Basic Multilingual
Plane) require two special surrogate
characters. - UTF-8.
- An 8-bit variable width encoding. Characters are
1, 2, 3 or 4 bytes long. Always non-endian. - ASCII ASCII
- All other characters have a special bit pattern
38UTF-8 Bit Pattern
- ASCII ASCII
- 0x41 A
- All other characters are multibyte.
- 110xxxxx two bytes
- 1110xxxx three bytes
- 11110xxx four bytes
- 10xxxxxx trail byte
- U00C0 À 0xC3 0x80 (11000011 10000000)
39Convenience Method for UTF8
- Almost True readUTF and writeUTF allow direct
access to UTF-8 DataInput/DataOutputStreams. - This is not really UTF-8, but a Sun specialized
version. - Use InputStreamReader/OutputStreamWriter to do
proper conversions.
40Java Uses Unicode
- Every character in every Java String object is
encoded as UTF-16 Unicode. - Every string is converted from a legacy encoding,
either by the compiler or by the String class. - This is the reason for native2ascii and encoding
switches. - Once you have a String object, everything is
Unicode UTF-16.
41Special encodings
- There are two encodings that the system treats as
special - file.encoding
- ISO-8859-1
- All basic conversion functions use your system
default encoding. - Most servlet conversion functions use ISO-8859-1
as the default.
42Two File Encodings
- Windows systems generally have two different file
encodings - ANSI encoding is the Windows default code page
for GUI applications. - OEM encoding is the code page used by the cmd
or command interpreter shells.
43Stream Readers and Writers
- InputStreamReader and OutputStreamWriter classes
perform controlled conversion between byte and
String. - Always pass the encoding as a variable.
- Use the IANA preferred name for the encoding, if
possible (see ftp//ftp.isi.edu/in-notes/iana/assi
gnments/) - Prefer UTF8 for on-the-wire transport.
44Code Sample
// use with any type of InputStream
class InputStream is new FileInputStream(file)
InputStreamReader isr new
InputStreamReader(is, encoding) // use Buffered
Reader for efficiency BufferedReader br new
BufferedReader(isr) StringBuffer sb new
StringBuffer() int chr while ((chr br.read()
gt -1) sb.append(chr)
Note Try blocks eliminated for clarity.
45OutputStreamWriter Code Sample
// use with any type of OutputStream
class OutputStream os new
ByteArrayOutputStream(file) OutputStreamWriter
osw new OutputStreamWriter((OutputStream)os,
encoding) osw.write(myString, 0,
myString.length()) osw.flush()
Note Try blocks eliminated for clarity.
46Character Class
- Provides access to Unicode character properties.
- UnicodeBlock inside class
- Character getType (defined types)
- isDigit
- isLetter
- isLetterOrDigit
- isUpperCase/isLowerCase/isTitleCase
- toUpperCase/toLowerCase/toTitleCase
- isSpace/isWhitespace
- isISOControl/isJavaIdentifierStart/isJavaIdentifer
Part
47Normalization
- Many characters have two (or more)
representations in Unicode. - Normalization makes the sequences the same.
- Simplifies user input parsing and validation.
48ICUj Normalizer Class
- Four forms of Normalization
- Form C (composed)
- Form D (decomposed)
- Form KC (canonical composed)
- Form KD (canonical decomposed)
- Special handling for Hangul characters!
- Note that there is a private class
java.text.Normalizer in the JDK.
49Demo Programs
- UnicodeDemo a Java program that demonstrates
the byte sequences of different encodings and
also provides some code that shows ISR and OSW in
action. - Charsets a Windows program by my buddy Bill
Hall for playing with encodings. - http//www.inter-locale.com -- my personal
website, with examples and demos of certain Java
I18n things.
50Questions?
- Addison Phillips
- aphillips_at_webmethods.com