Title: Internationalization i18n Localization l10n
1Internationalization (i18n)Localization (l10n)
2Objectives
- The need to do it.
- The difficulties involved.
- How to do it with JSPs and Java.
- Database issues.
3Think Globally
- 92 of the world speaks no or little English
- 20 main Asian languages
- According to ethnologue.com
- 6809 living languages
- Living a child is learning it as their only
language
4Primary Languages Spoken
Rank Language Script Speakers (millions) 1 Mandar
in Chinese 885 2 Hindi Devanagari 375 3 Spanish L
atin 358 4 English Latin 347 5 Arabic Arabic 2
11 6 Bengali Bengali 210 7 Portugues Latin 178 8
Russian Cyrillic 165 9 Japanese Japanese 125 10 Ge
rman Latin 100
5English usage is shrinking
http//global-reach.biz/globstats/evol.htm Englis
h Internet users 2000 58 2005 lt 35 Non
English online traffic 2000 40 2005 70
6Why i18n is hard
- Lots of different character sets
- A character set maps characters to numbers
- ASCII created in 1968 used 7 bits
- That gives you 128 possible characters
- 26 lower case
- 26 upper case
- 10 numbers
- 10 symbols above the numbers
- 20 commonly used punctuation marks
- Leaves you space for 36 more characters
7Why i18n is hard
- Only Latin, English, Hawaiian, and Swahili fit
into 7 bits - The 8th bit was used for i18n but, no standards
were set. - Even with Western European Latin, there are many,
many mappings - ISO-8859-1 (based on old DEC VT220 terms)
- IBM 850
- cp1252 (microsoft's contribution)
- http//czyborra.com/charsets/ - is a HUGE list of
character sets
8Characters Need More Space
- Eight bits gives you room for 256 characters
- Still not enough.
- Unicode 3.2 has OVER 95,000 characters!!!
- !!! THAT'S A LOT OF CHARACTERS !!!
- Since 8 bits can only hold 256 characters, that
led to... A LOT OF STANDARDS!!!!! - There are so many standards, nobody can keep them
straight - IANA maintains the list of standards names
- http//www.iana.org/assignments/character-sets
9Unicode The Solution (sort of)
- A consortium started in 1991 to come up with a
standard character set - Characters are stored in more than one byte.
- Two bytes give you 216 characters, which covers
the characters of every currently used language. - Java stores characters in Unicode, which makes it
very good for internationalization.
10Problems with Unicode
- Unicode is a character set. It simply attaches
numbers to characters doesn't dictate how
they're translated into bits. - The simplest character encoding for Unicode is
called UCS2 just use two bytes for each
character A is 0x00 0x41 - If you're not careful, you'll end up sending
characters that look like \0 to UNIX which
messes it up. - UTF-8 is an encoding of Unicode which works
across all known platforms.
11Where i18n happens
- In a database driven website, you have to worry
about i18n in 3 places - Your database has to store some kind of unicode
- The web browser the person uses has to know
unicode - Your application has to be i18n
- Has to read unicode from the database
- Has to know how to write to the browser
12I18n in databases
- Unicode support in databases is fairly new.
- Recent versions of all major databases support
UTF-8 in some way. - Some, however, require you to use special data
formats. - Oracle, for example, will let you either declare
your entire database as Unicode encoded, or you
can add unicode to non-unicode encoded database
tables using the NCHAR and NVARCHAR2 data types.
13Database Details
Database UTF-8 Other UNICODE? Special
treatment? Oracle Yes Yes Depends Sybase
Yes No No Postgres Yes No No MS
SQL Server No Yes Yes MySQL Yes Sort
of No
14How Browsers Deal with Different Encodings
When a browser sends a request for a web page, it
tells the web server what kind of encoding it
understands GET / HTTP/1.1 HTTP_ACCEPT_LANGUAGE
en-us, hr HTTP_ACCEPT_CHARSETISO-8859-1,
UTF-8 This says that the browser prefers
documents to be sent to it int the iso-8859-1
(Latin) encoding, but will also take
UTF-8. Sadly, Internet explorer doesn't send
ACCEPT_CHARSET. As of MSIE 6, it understands
UTF-8.
15How Browsers Deal with Different Encodings
When a server gets a request, it retrieves pages,
runs programs, then respondes. Unless otherwise
specified, it uses same encoding of request. If
no charset was sent with request, it uses a
default. Here's a typical HTTPD response 200
OK HTTP/1.1 Content-Type contenttext/html
charsetISO-8859-1 lthtmlgtltheadgtlttitlegthi!lt/title
gtlt/headgt ...
16Setting the Character Set of a JSP Reponse
lt_at_ page contentTypetext/htmlcharsetUTF-8
gt sets the character set to UTF-8. The
default is ISO-8859-1 Some HTML pages have this
sort of thing ltMETA http-eqiv Content-type
Contenttext/htmlcharsetUTF-8gt It gets
overridden by the contentType set by the page
directive. lt?xml version1.0 encodingUTF-8gt
also gets overridden.
17Internationalization and Localization in your
Application
In Java, different languages are handled by
different resource files. For each language
(locale) you have, a separate resource file
contains the text. Greetings.properties
Greetings_fr.properties greetings
Hello greetings Bonjour farewell
Goodbye farewell Au revoir inquiry How are
you? inquiry Comment allez-vous? These files
go in your classes directory (under
WEB-INF) Your code will determine the locale and
use the appropriate file for the text. Each file
is called a resource bundle. Files which contain
the same messages share a common basename.
18Defining the Locale
A locale defines a country and language In JSTL,
the locale can be set in two ways 1. By hand
ltfmtsetLocale valuefr_CA, fr_FR /gt This says
you'd prefere Canadian French, but will accept
French French is there's no resource for Canadian
French. Also good, if you have a bean called
myBean with a property locale ltfmtsetLocale
value"myBean.locale" /gt 2. You can let
JSTL set the locale based on the browser's
HTTP_ACCEPT_LANGUAGE Don't forget to include the
fmt tags lt_at_ taglib prefix"fmt"
uri"http//java.sun.com/jstl/fmt" gt
19Using the Locale and Resources
Once the locale is set (either automatically or
by hand) use the ltfmtbundlegt and ltfmtmessagegt
tags to get the internationalized
message ltfmtbundle basename"greetings"gt
Hello ltfmtmessage key"hello" /gt Goodbye
ltfmtmessage key"goodbye" /gt lt/fmtbundlegt
The bundle tags will internationalize dates
too ltfmtformatDate value"myBean.date"
dateStyle"long" /gt If the locale is en_US, the
date will be month/day/year If it's en_GB, the
date will be day/month/year
20Setting a Default Locale for Your Application
In the web.xml file of your application ltcontext
-paramgt ltparam-namegt javax.servlet.jsp.jstl.fmt.
fallbackLocale lt/param-namegt ltparam-valuegt
en lt/param-valuegt lt/context-paramgt
21Resources
All the language codes http//ftp.ics.uci.edu/pub
/ietf/http/related/iso639.txt All the country
codes http//www.davros.org/misc/iso3166.html Th
e Unicode site http//www.unicode.org/