Introduction to Character Encodings, Java and You - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Introduction to Character Encodings, Java and You

Description:

... based on Chinese ideographic writing ('Han' or 'Hanja'): Chinese ... Kanji ('Han' or Chinese writing): ???. Hiragana (phonetic for Japanese words): ???? ... – PowerPoint PPT presentation

Number of Views:232
Avg rating:3.0/5.0
Slides: 51
Provided by: karen115
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Character Encodings, Java and You


1
Introduction to Character Encodings, Java and You
2
Agenda
  • Defining the problem
  • Where webMethods products encounter character set
    problems.
  • What the symptoms look like.
  • Understand core concepts
  • What is a character set? Whats an encoding?
  • What is Unicode, really?
  • Code Examples to avoid problems

3
Confusion Reigns
  • Generally, the most confusing aspect of
    internationalization.
  • Many, many standards to choose from.
  • Arcane terminology
  • American programmers rarely (seem) to encounter
    it head-on.
  • Were presenting this because many of our
    products are encountering this problem now.

4
Problem Domain
  • webMethods products interface with
  • non-Java systems (for example, in the adapters)
  • non-Java environments (file systems, databases,
    libraries, email, ftp, http, etc.).

5
Javas Text Representation
  • Java provides a convenient text processing
    architecture centered on the Java String object.
  • A Java String is basically an array of Java
    Character Objects.

6
Java Characters
  • Each Java Character object represents a Unicode
    character.
  • (Currently) a 16-bit unsigned integer value
    between 0 and 65,535.
  • Character class provides access to character
    properties.
  • UPPER, lower, and Titlecase mapping
  • Comparison
  • Directionality
  • Compatibility
  • C-TYPE values such as alpha-ness, digit-ness,
    alphanumeric-ness

7
Non-Java Text
  • Non-Java files, applications, filesystems,
    database, et.al. typically do not use Unicode.
    Java sees them as an array of bytes (byte).

8
Three Problems
9
Bad Conversion
  • Target character set doesnt have this character
    in it. Java replaces each character with a ?
  • Input String ???
  • Output String ???
  • Typically
  • Using the default encoding when we meant to
    specify one.
  • Writing on a device (such as System.out) whose
    legacy encoding doesnt support the characters.

10
No Glyph
  • Java knows what the character is and is handling
    it properly, but doesnt have a picture of it to
    show you (in the current Font selected).
  • Input String ???
  • Output String ???
  • Typically
  • Nothing is wrong, just using the wrong Font.

11
Random Trash
  • A byte was converted using the wrong character
    encoding. Bytes were mapped to the wrong
    characters.
  • Input String ???
  • Output String ?ú??ê
  • Typically
  • Using the wrong encoding, the underlying bytes
    are mapped to different, random-seeming
    characters.

12
Examples
  • Same byte sequences, different results
  • Shift JIS byte 0xE0, 0x41, 0x83, 0x70 ??
  • Latin-1 byte 0xE0, 0x41, 0x83, 0x70 àA?p
  • Java String 0xE0, 0x41, 0x83, 0x70 ??
  • Java String ?? U6F13 U30D1

13
Character Set Terminology
14
What is a Character?
  • A character is a single, atomic unit of text.
  • The definition has a different meaning according
    to the writing system and context.

15
Abstract characters
  • Some abstract characters include
  • A Roman Letter Capital A
  • Combining Accent Grave
  • ? Hiragana character ni
  • ? CJK Ideograph
  • ? Arabic letter
  • ? Hangul syllable
  • A Fullwidth compatibility letter A

16
What is a Character Set?
  • A character set is a set--- a collection of
    characters, usually organized in some fashion.
  • Youre probably most familiar with ASCII
  • 0x41 A
  • 0x42 B
  • Etc.

17
What is a Character Encoding?
  • Character set a collection of characters,
    basically, a bucket.
  • Character encoding the specific ones and zeroes
    assigned to a character set.
  • Character Set A 0x41
  • Character Encoding A 0x41

18
Eight Bit Encodings
  • 8-bit encodings allow for 256 characters.

128 ASCII
32 C1 controls
96 extended
19
Latin-1
  • The standard for Western Europe is generally
    ISO-8859-1
  • AKA Latin-1
  • Used by UNIX systems and the Web.
  • Extended version used by Microsoft for Windows.

20
Let a Thousand Encodings Bloom
  • Each language has its own character set
  • Everywhere ASCII
  • Western European (like German or French) Latin-1
  • Eastern European (like Polish or Slovak) Latin-2
  • Simplified Chinese GB2312

21
Actually, many for each language
22
Other Writing Systems
  • Writing systems vary around the world (in order
    of increasing complexity, more or less)
  • Latin-based alphabets
  • (ABCDEFG) English
  • Cyrillic and Greek-based alphabets
  • (????????...) Russian
  • Ideographic writing systems have thousands of
    characters
  • (??????...) Japanese
  • Bi-directional (RTL) languages go right to left
  • (...???????) Hebrew
  • Complex scripts (everything else)
  • (???? )Devanagari

23
Expanded Character Sets
  • Most languages have alphabetic or phonetic
    writing systems
  • Russian, Greek, Slavic, (many) Native American,
    Bahasa, Hebrew, Arabic, Semitic, etc. alphabetic
  • Indian (subcontinent), Thai, Japanese kana,
    Korean phonetic writing systems
  • 8 bits is enough for all of the above (with some
    tricks)
  • Some languages use scripts based on Chinese
    ideographic writing (Han or Hanja)
  • Chinese
  • Korean
  • Vietnamese (traditional)
  • Japanese Kanji

24
Double-Byte
  • 8-bit character encodings use eight bits per
    character.
  • 28 255 characters
  • Double-byte character sets must be 2 bytes per
    character ?
  • 216 65,535 characters
  • Should actually be called multi-byte (MBCS).
  • Each character can be ONE, TWO, THREE and
    sometimes FOUR bytes in length.
  • MAY involve shift states.

25
Multibyte Encodings
  • A typical Japanese Character Set
  • JIS X 208 (??)
  • Character Encodings of JIS X 208
  • Shift-JIS (CP932) 0x8A 0xBF 0x8E 0x9A
  • EUC-JP 0xB4 0xC1 0xBB 0xFA
  • ISO 2022-JP 0x1B, 0x24, 0x42, 0x34 0x41 0x3B
    0x7A 0x1B 0x28 0x4A
  • Non-Legacy
  • UTF-16 (0x6F22 0x5B57)

26
An MBCS Example Shift-JIS
  • Character set used by DOS, Windows, Macs, and a
    few UNIX-like systems for Japanese.
  • Code Page 932
  • JIS X 2081997

27
Shift-JIS
  • In order to reach more characters, double byte
    values start with a limited range of lead bytes
  • These can be followed by any character valuegt
    0x40 (trail byte)

28
Shift-JIS
  • Each lead byte provides a window onto
    additional characters.

29
Shift-JIS
  • Problems
  • Lead byte values are also valid as trail bytes.
  • Common special characters (\!!) are valid trail
    bytes.

30
Han
  • CJK scripts require up to 100,000 unique
    characters for complete representation.
  • Four major variants
  • Traditional Chinese
  • Simplified Chinese
  • Japanese Kanji
  • Korean (non-Hangul)

31
Kanji
  • Sometimes you hear Japanese called kanji
  • Kanji is actually one of four writing systems
    used in Japan.
  • Kanji should be avoided as a generic term for
    DBCS.
  • Kanji (Han or Chinese writing) ???
  • Hiragana (phonetic for Japanese words) ????
  • Katakana (phonetic for foreign words) ????
  • Romanji (Roman script) nihongo

32
Chinese
  • Upper two are Traditional.
  • Lower character is the Simplified variant.

33
Hangul
  • Korean Hangul is a syllabic phonetic system,
    which has thousands of combinations.
  • Hangul is not related to Han ideographic writing.

34
Code Page Hell
  • With hundreds of encodings and character sets to
    choose from, making internationalized code work
    in the late 1980s and early 1990s was
    hellish.
  • Internationalization folks referred to this as
    code page hell

35
Unicode and Java
  • To the Rescue

36
Unicode (ISO 10646-2)
  • Unicode is a character set that supports all of
    the worlds languages and writing systems.
  • Originally designed as a wide character
    set--every character was represented by 16-bits.
    This allowed for 65,535 potential characters.
  • Extended to allow 1.1 million characters.
  • Unicode is maintained by an industry consortium.
    ISO 10646-2 is maintained by WG2. The two are
    exactly identical.

37
Its a character set?
  • Unicode is a character set. It has these
    encodings
  • UTF-32. (BE/LE)
  • A 32-bit encoding. All characters 32 bits.
  • UTF-16. (BE/LE)
  • A 16-bit encoding. All characters are 16-bits.
  • Characters above 0xFFFF (the Basic Multilingual
    Plane) require two special surrogate
    characters.
  • UTF-8.
  • An 8-bit variable width encoding. Characters are
    1, 2, 3 or 4 bytes long. Always non-endian.
  • ASCII ASCII
  • All other characters have a special bit pattern

38
UTF-8 Bit Pattern
  • ASCII ASCII
  • 0x41 A
  • All other characters are multibyte.
  • 110xxxxx two bytes
  • 1110xxxx three bytes
  • 11110xxx four bytes
  • 10xxxxxx trail byte
  • U00C0 À 0xC3 0x80 (11000011 10000000)

39
Convenience Method for UTF8
  • Almost True readUTF and writeUTF allow direct
    access to UTF-8 DataInput/DataOutputStreams.
  • This is not really UTF-8, but a Sun specialized
    version.
  • Use InputStreamReader/OutputStreamWriter to do
    proper conversions.

40
Java Uses Unicode
  • Every character in every Java String object is
    encoded as UTF-16 Unicode.
  • Every string is converted from a legacy encoding,
    either by the compiler or by the String class.
  • This is the reason for native2ascii and encoding
    switches.
  • Once you have a String object, everything is
    Unicode UTF-16.

41
Special encodings
  • There are two encodings that the system treats as
    special
  • file.encoding
  • ISO-8859-1
  • All basic conversion functions use your system
    default encoding.
  • Most servlet conversion functions use ISO-8859-1
    as the default.

42
Two File Encodings
  • Windows systems generally have two different file
    encodings
  • ANSI encoding is the Windows default code page
    for GUI applications.
  • OEM encoding is the code page used by the cmd
    or command interpreter shells.

43
Stream Readers and Writers
  • InputStreamReader and OutputStreamWriter classes
    perform controlled conversion between byte and
    String.
  • Always pass the encoding as a variable.
  • Use the IANA preferred name for the encoding, if
    possible (see ftp//ftp.isi.edu/in-notes/iana/assi
    gnments/)
  • Prefer UTF8 for on-the-wire transport.

44
Code Sample
// use with any type of InputStream
class InputStream is new FileInputStream(file)
InputStreamReader isr new
InputStreamReader(is, encoding) // use Buffered
Reader for efficiency BufferedReader br new
BufferedReader(isr) StringBuffer sb new
StringBuffer() int chr while ((chr br.read()
gt -1) sb.append(chr)
Note Try blocks eliminated for clarity.
45
OutputStreamWriter Code Sample
// use with any type of OutputStream
class OutputStream os new
ByteArrayOutputStream(file) OutputStreamWriter
osw new OutputStreamWriter((OutputStream)os,
encoding) osw.write(myString, 0,
myString.length()) osw.flush()
Note Try blocks eliminated for clarity.
46
Character Class
  • Provides access to Unicode character properties.
  • UnicodeBlock inside class
  • Character getType (defined types)
  • isDigit
  • isLetter
  • isLetterOrDigit
  • isUpperCase/isLowerCase/isTitleCase
  • toUpperCase/toLowerCase/toTitleCase
  • isSpace/isWhitespace
  • isISOControl/isJavaIdentifierStart/isJavaIdentifer
    Part

47
Normalization
  • Many characters have two (or more)
    representations in Unicode.
  • Normalization makes the sequences the same.
  • Simplifies user input parsing and validation.

48
ICUj Normalizer Class
  • Four forms of Normalization
  • Form C (composed)
  • Form D (decomposed)
  • Form KC (canonical composed)
  • Form KD (canonical decomposed)
  • Special handling for Hangul characters!
  • Note that there is a private class
    java.text.Normalizer in the JDK.

49
Demo Programs
  • UnicodeDemo a Java program that demonstrates
    the byte sequences of different encodings and
    also provides some code that shows ISR and OSW in
    action.
  • Charsets a Windows program by my buddy Bill
    Hall for playing with encodings.
  • http//www.inter-locale.com -- my personal
    website, with examples and demos of certain Java
    I18n things.

50
Questions?
  • Addison Phillips
  • aphillips_at_webmethods.com
Write a Comment
User Comments (0)
About PowerShow.com