Title: Ian Little
1Pluggable Charset Support in the Java Platform
Ian Little Java Software Core Tools and
Libraries, Software Engineer Sun Microsystems
Ireland
2Outline of Presentation
- Support for charsets within Java platform
Old limitations, What's new and exciting! - Primer on buffer classes.
- A Developers Sightseeing Tour of the new charset
API - Overview of how to write a custom installable
charset implementation. - Deploy and enjoy!
3Pluggable Charsets in Java
- Prior to J2SE Release 1.4 No public API (
- 1.4 introduces java.nio.charset package
) - Small but important
part of New I/O (JSR-51) -
SPI allows support for new charsets to be
plugged in.
4JSR-51 The Bigger Picture
- java.nio Buffers
- java.nio.channels
- Non-blocking network I/O
- Fast file I/O (memory-mapped files, etc.)
- java.util.regex -- Regular expressions
- java.nio.charset -- Character sets
5Java Community Process
- java.nio and charset API/SPI conceived
nurtured through Java Community
Process (JCP). - New major features API additions in J2SE1.4
introduced reviewed via JCP. - JSR expert groups are composed of a wide
diversity of domain experts. - For JSR-51 (finalised May 2002) expert group
included IBM, Oracle, BEA, OKI, NTT, Sun.
6Terminology
- Charset Defined in RFC 2278
- The combination of a coded character setand an
encoding scheme. - Coders
- Either a decoder (java.nio.charset.CharsetDecoder)
or an encoder (CharsetEncoder).
7Core NIO Charset Coders
- java.nio Charset API introduced in J2SE 1.4
- You can start writing to the API and deploying
custom developed charsets now! - All charsets supported in prior J2SE releases
are still supported. - Only some are accessible via the java.nio.charset
API - More will be added in future releases.
US-ASCII, ISO-8859-1, ISO-8859-15, UTF-8, UTF-16,
UTF-16BE, UTF-16LE, windows-1252, Big5,
Big5-HKSCS, EUC-CN, EUC-JP, EUC-KR, EUC-TW, GBK,
GB18030, ISO-2022-KR, Johab, windows-936/949/950,
ISO-8859-X,JIS-X-0201,JIS-X-0212,
JIS-X0208, TIS-620, ISCII91
8 Introducing the API/SPI
9java.nio.charset package
- java.nio
- Core buffer classes ByteBuffer, CharBuffer
- Used as input/output for coders
- java.nio.charset
- Core charset classes
- Charset
- CharsetDecoder, CharsetEncoder
- CoderResult, CodingErrorAction
- java.nio.charset.spi
- Service-Provider Interface for buildinguser-insta
llable charset coders
10 API Overview
- java.nio.charset.Charset
- A named mapping between sequences of sixteen-bit
Unicode characters and sequences of bytes. - Encapsulates the immutable properties of charsets
- concrete instances obtained via the static
forName() method
11 API Overview (continued)
- java.nio.charset.CharsetDecoder
- Engine which takes a sequence of bytes
encoded using a native character encodings scheme
or mapping and produces the equivalently
decoded Java 16-bit Unicode based character
representation. - java.nio.charset.CharsetEncoder
- Engine which takes a sequence of Java
characters stored internally as 16-bit Unicode
values and produces the equivalent natively
encoded byte sequence.
12 API Overview (continued)
- java.nio.charset.spi.CharsetProvider
- An object which facilitates the installation of
the Charset implementation into the running JVM.
13java.nio core classes
- Buffer A core java.nio abstraction
- java.nio.Buffer is the parent abstract
superclass. - Buffer subclasses encapsulate a linear sequence
of values of primitive data types
- The key Buffer properties are
position, limit capacity. - Specifically revelant when dealing with Charsets
are - ByteBuffer and CharBuffer subclasses
14Java.nio Buffer classes overview
- Buffer read operation achieved via get()methods
- Buffer write operations achieved via put()
methods - Overloaded method put(..) get(..)method
definitions depending on whether you require
- per- byte or per-char reads from an input buffer
- Bulk byte or char reads or writes
- Buffer reads writes addressed absolutely or
relatively
15NIO Buffer classes overview
- To create a CharBuffer and ByteBuffer
instance - allocate(int capacity) or
- allocateDirect(int capacity)
- Interoperability with pre java.nio code
- CharBuffer provides
- CharBuffer wrap(char)
- ByteBuffer provides
- ByteBuffer wrap(byte)
16 Charset Class
- Anchor class within the java.nio.charset package
- boolean isSupported(String charsetName) -
tests if supplied charset name/alias is
supported via java.nio API - Charset naming consistency is a goal of the API.
- Canonical names of Charsets mirror those within
IANA registry.
17 java.nio.Charset Class
- Use MIME-preferred name where multiple choices
exist. - boolean isRegistered()
- Prefix canonical name with x- or X- where
charset is not IANA-registered.
18API Usage Idiom Decoding
- Decoding bytes from a file (setup)....
- // Get Charset instance Charset cs
Charset.forName(X-fooCS)
// Get Decoder instance CharsetDecoder
decoder cs.newDecoder()// Decode ByteBuffer
to CharBuffer (quick)CharBuffer cb
decoder.decode(bb)// Quick decode of
ByteBuffer to StringString s
decoder.decode(bb).toString()
19API Usage Idiom Decoding
- ByteBuffer bb ...
- CharBuffer cb ... boolean eof
false CoderResult result
CoderResult.UNDERFLOWwhile (!eof)
if (result CoderResult.UNDERFLOW
) bb.clear() eof
(inChannel.read(bb) -1)
bb.flip()
result decoder.decode(bb, cb, eof) if
(result CoderResult.OVERFLOW) drainBuf(cb)
//defined elsewhere
decoder.flush() // check overflow here
too!
20Autodetecting Charsets
- Special case of decoder implementation -
Inspects encoded text determines
the encoding employed. - Typically such autodetecting charsets will be
asymmetric and will require a
decoder with no associated
encoder implementation. - java.nio.charset.CharsetDecoder provides
- boolean isAutoDetecting()
- boolean isCharsetDetected()
- String charsetDetected()
21Error Handling Exceptions
- The propagation of exceptions by coders when
they encounter unexpected input is
under the programmers control - Achievable via default and overrideable actions
defined within the class. - java.nio.charset.CodingErrorAction
- Overrideable action directives are encapsulated
within - CodingErrorAction.REPLACE
- CodingErrorAction.IGNORE
- CodingErrorAction.REPORT
22 Internals of a Decoder
- Subclass java.nio.charset.CharsetDecoder
directly if decoder has a decoding algorithm
or properties perceived to be
generally unique. - Employ OOA guiding best practices to determine
inheritance / re-use opportunities. - Repeated decoding logic (similarly for encoders)
can often be usefully refactored
into a common abstract base
decoder class. - Bulk of decoder implementation resides within
the method CoderResult
decodeLoop(ByteBuffer src,
CharBuffer dest)
23 Internals of a decoder (continued)
- decodeLoop() method inspects each byte within
the input ByteBuffer object instance and
calculates the appropriate output char (or
chars) and prepares to place them in
receiving CharBuffer. - When the input buffer is depleted (no more bytes)
the decodeLoop method should
ordinarily return with
CoderResult.UNDERFLOW
24 Internals of a decoder (continued)
- Illegal input bytes or sequences need to be
flagged via returning with the
invocation of - CoderResult.malformedForLength(n)
- Malformed input can be dealt with by
performing silent substitution of
replacement chars in the output
decoded buffer. It is possible to override
the action via methods defined in
java.nio.charset.CoderAction
25 Internals of a decoder (continued)
- Receiving CharBuffer should be checked prior
to each put() to determine if
sufficient space. - If decoder requires larger output
buffer return CoderResult.OVERFLOW
26 Internals of a decoder (continued)
- decodeLoop()method can flag overflow by
returning java.nio.charset.
CoderResult.OVERFLOW - Overflow handled by calling code.
- Drain output CharBuffer and reset its position
before re-invoking the decoder - implFlush(CharBuffer out) A provided API
hook which permits decoders (especially those
which maintain state) to flush pending char
output.
27 Autodetecting Charsets
- Special case of decoder implementation -
Inspects encoded text determines
the encoding employed. - Typically such autodetecting charsets will be
asymmetric and will require a
decoder with no associated
encoder implementation. - java.nio.charset.CharsetDecoder provides
- boolean isAutoDetecting()
- boolean isCharsetDetected()
- String charsetDetected()
28CharsetEncoder convenience
methods
- java.nio.charset.CharsetDecoder methods
- boolean canEncode(char c
- Boolean canEncode(CharSequence cseq) Tests
encodeability of a char or CharSequence. - float maxBytesPerChar() Used
primarily by users/clients of API to
adequately size output buffers - float averageBytesPerChar() Can be used
by coder API library clients to
perform economical Buffer sizing.
29Internals of an encoder
- Principal Encoder entry point is
CoderResult
encodeLoop(CharBuffer src,
ByteBuffer dest) - Read and inspect each input character using
relative CharBuffer get() method
calls. - Example For single char at a time
reads char c src.get() - use in.hasRemaining()or in.remaining()
to determine encoder loop termination
condition, i.e no more input
chars within current encode buffer.
30Internals of an encoder (continued)
- A well behaved coder implementation will always
terminate its encodeLoop() method
implementation by returning the class
constant
java.nio.charset.CoderResult.U
NDERFLOW - Code which invokes the encoder will then arrange
to drain the existing input buffer and
supply the encoder with the
remaining bytes to encode within the
pursuing input buffers payload.
31 Internals of an encoder
- Determine if input character is mappable to the
repertoire of the target encoding
or not - Unmappable chars found in input flagged by
returning CoderResult.unmappableForLength(int
n) (The value n can conceivably exceed
1) - out.hasRemaining() or out.remaining()
should be queried before each put() of output
characters to guard against output
buffer overflow. - encodeLoop(src,dest)will need to
return CoderResult.OVERFLOW when
the output buffer is undersized to receive
the output encoded bytes.
32Error Conditions Exceptions
- The propagation of exceptions by coders when
they encounter unexpected input is
under the programmers control
via default and overrideable actions
defined within the
class - java.nio.charset.CodingErrorAction
- Overrideable action directives are encapsulated
within - CodingErrorAction.REPLACE
- CodingErrorAction.IGNORE
- CodingErrorAction.REPORT
33 Writing a user installed provider
- Writing your own provider from scratch
- java.nio.charset.spi.CharsetProvider
- java.nio.charset.spi.CharsetProvider
Methods which you will need to
override - Iterator charsets()
- Charset charsetForName(String
charsetName)
34 Writing a user installed provider
- Provider architecture requirement that you place
a a specically named file within the
classpath under the META-INF
directory within the jar-file containing
the provider and compiled charset Java classes. - The file needs to reside within the directory
(relative) - META-INF/services/
- The filename required equals the provider class
name java.nio.charset.spi.CharsetProvider - Contents of the file is fully qualified
classnames of each provider bundled
within the provider jar file.
35Writing a user installed provider
- Provider lookup occurs via the current threads
context classloader - You may put provider jar within
applet/application classpath or
within the J2SE extensions
directory
36Finally ....
- Programmers now have access to a rich and
well integrated (via java.nio) API to
which to write and access
charsets within J2SE. - The SPI features provide an extremely useful way
to extend the J2SE charset support at
runtime. - NB Read the Specification within the Javadocs
to understand the full nuances of
the API !! http//java.sun.com/j2se/1.4/docs/
guide/nio/index.html - Going forward J2SE charset support will come
in the form of a core set of
exclusively New I/O supported
charsets and optionally downloadable
New I/O capable charsets from Sun 3rd parties.
37Thanks!
- Many thanks for your kind patience today!
- A big thank you to my fellow team members within
the Java Software Core Tools and Libraries
team !! Andrew Bennett, my
manager. Mark Reinhold JSR51 Spec
lead, powerhouse behind New
I/O And to my fellow team
colleagues Josh Bloch,
Mike McCloskey, Iris Garcia,
Neal Gafter, Joe Darcy.