Title: An Introduction to W3C
1Web Internationalization InitiativeManoj
JainDepartment of Information TechnologyMinistry
of Communication and ITGovernment of
IndiaAugust 3, 2006
2W3C DIT
Department of Information Technology became a
member of World Wide Web Consortium (W3C) to
provide adequate representation of Indian
languages/ scripts in the various Web Technology
Standards being evolved by W3C Consortium.
3The project Web Internationalization Initiative
With the above mentioned objective DIT initiated
a project web Internationalization Initiative
for Indian languages in which C-DAC regional
units (Pune, Noida, Kolkata and Trivandrum) and
Industry Consortium for Language Technologies
(CoILTech-MAIT) are participating in various
activity working groups for evolving
specifications, guidelines, test suites,
developing translations and interoperable
technologies for the cluster of assigned
languages and organize sensitization workshops in
the region and promote participation of local
industries.
4- Indian Languages/ Scripts
- There are 22 constitutionally recognized
languages in India. Apart from these 22
languages, many more dialects are also spoken in
various regions of the country. - These 22 languages are using 12 different
scripts. Some languages are written using one
script e.g. Hindi, Sanskrit, Marathi, Konkani,
Sindhi, Maithili, Nepali Dogri languages use
Devanagari script. - Some languages are writen in more than one script
such as Urdu, Sindhi, Manipuri and Santhali.
5WII ProjectImplementing agencies their
assigned languages
6Web Internationalization Initiative
- These centres to participate in various W3C
activities. - The present focus of the project is to
participate in the Internationalization/
Localization related activities. - Encoding issues with respect to these languages
are being addressed in Unicode forum.
7Web Internationalization Initiative
- CDAC, Pune is working on the Tag Set.
- CDAC, Noida has initiated a web based Discussion
Board to build consensus among the experts. - CDAC Kolkata has proposed a three tier Linguistic
markup. This will help in Translation of tags.
This is under discussion at W3C forum. - CDAC Trivandrum is participating in Device
Independence and XML related activities.
8.Web Internationalization Initiative
- MAIT-COILTech has been assigned the
responsibility to interact with the Indian IT
industry to get feedback on various issues and
sample implementations. - It also includes interaction with various
browsers and other web tool manufacturers to
ensure adequate support of Indian Languages in
these tools applications.
9Localization Standards
- Encoding Standards
- Input Standards/ Keyboard Managers
- Fonts Rendering
- Locale Data
- Database storage Retrieval
10Internationalization/ Localization Some
important issues
- Content language It is very important to declare
language in the content so that it can easily be
searched/ rendered/ displayed. - Presentation of the content Presentation of the
content should be in such a way that it should
reflect the cultural and traditional values of
that region. - Images Animation Examples in the Content Uses
of the regional images, animation and example
really makes the content viewer/ user friendly.
Internationalized product should be able to
handle this aspect.
11... Internationalization/ Localization Some
important issues
- Forms Databases and scripts that receive data
from FORMs on pages in multiple languages must
also be able to support the characters for all
those languages simultaneously. - This is very much relevant to the e-Gov
applications being developed for Indian
languages.
12- WII Project Tasks undertaken
- Character Encoding Issues
- Locale Specific Data
- Text Formatting Issues
- Font Rendering Issues
- Indian Language Tag Set
- Inputs for Mobile Web Initiative
- RFC-3066 for Identification of Languages
- Feedback on RFC-3490 (Internationalizing Domain
Names in Applications (IDNA)) - RFC-3491, RFC-3492 RFC-3987 (PunyCode,
Stringprep Profile and Handling path for
Internationalized Domain Names (IDN)) - Reference Implementations of the draft standard
- Speech Synthesis Grammar
13General Formatting Issues
- Absolute/relative positioning, Layering, and
Transparency - Copyfitting
- Cropping and Scaling of Images
- Hyphenation
- Non-rectangular Areas
14Text Formatting Indic specific issues
- Alignment of scripts and baseline shifts
- Support for automatic alignment of text from
multiple scripts with different alignment rules.
Ability to handle sub-script and super-scripts. - Justification/Word and Letter Spacing
- Justification/spacing policy controls.
- Sorting/Collating/Data processing
- Support for sorting and collating data (for
example in index entries, but more generally
wherever it is required for proper presentation).
Support for other sorts of data-processing
functions may be required as well.
15...Text Formatting Indic specific issues
- Fonts
- Indic languages are script-based languages, some
of other issues with formatting of a document
with these languages are - Prefix, suffix, and stand-alone glyph variants
- No hyphenation (?)
- Justification (how to accomplished through the
stretching of letters or syallables). - Vowel relocation and/or resequencing
16ISO 639.1 ISO 639.2Codes for Representation of
Names of Languages
- ISO 639.1 ISO 639.2 are Two or Three letter
Codes for Representation of Names of Languages. -
- ISO 639.1 is a two letter code
- For example hi for Hindi and kn for Kannada
- ISO 639.2 is a three letter code
- For example mar for Marathi and san for Sanskrit
- There are few more Indian languages which need to
be assigned the code such as Bodo, Apbhransh
and Bundelkhandi etc.
17Language Tags RFC 3066bis
- Language Tags are used to help identify languages
whether spoken, written, signed or otherwise
signaled for the purpose of communication. - Applications, protocols or specifications that
use language tags are often faced with the
problem of identifying sets of content that share
certain language attributes. - A Language Tag consists of a Primary Language
subtag and a series of subsequent subtags, each
of which refines or narrows the range of language
identified by the overall tag.
18Internationalized Tag Set
- This is a set of elements and attributes, these
can be used with Document Type Definition (DTDs)
/ Schemas to support the internationalization /
localization.
19IRI URI
- IRI and URI are important activity towards
internationalization / localization of the web.
The e-infrastructure division of DIT is working
towards Internationalization of domain names. - CDAC centres under the WII project are to provide
feedback on various RFCs issued by IETF, IDNA and
IANA etc, so that these recommendations ensure
Indian languages support adequately.
20RFC 3987Internationalized Resource Identifiers
- A Uniform Resource Identifier is a sequence of
characters chosen from a limited subset of the
repertoire of US- ASCII characters. - The RFC3987 defines a new protocol element called
Internationalized Resource Identifiers (IRI) by
extending the syntax of URIs to a much wider
repertoire of characters to cover all the written
scripts of the world. - Indian scripts are complex in nature. Study of
IRI may be done from Indian languages
perspective.
21RFC 3491
- RFC 3491 specifies processing rules that will
allow users to enter internationalized domain
names (IDNs) into applications.
22RFC 3454
- RFC 3454 specifies a framework of processing
rules for Unicode text. This RFC mainly relates
to the Internationalized Domain Names.
23RFC3492Punycode Encoding of Unicode for IDNA
- Puny code is a transfer encoding syntax designed
for use with Internationalized Domain Names in
applications. It uniquely and reversibly
transforms a Unicode string into an ASCII string.
- This is important for the implementation of the
IDN in non-Latin scripts/ languages such as
Indian Languages.
24Numeric Character References (NCRs)
- Escapes such as NCRs and entities are ways of
representing any Unicode Character in Markup
using only ASCII characters. - For Example
- Character a in X/HTML as XE1 or 225 or
aacute. -
- These are useful for clearly representing
ambiguous or invisible character and prevent
problems with syntax characters such as
ampersands and angle brackets. NCRs can be used
for unsupported characters.
25Mobile Web Initiative
- W3C Group on Mobile Web Initiative
- In India, many people have started using mobile
devices to access the web. - Standard Keyboard Layout for inputting various
Indian languages content on mobile devices are
being evolved.
26Voice Browser SSML
- Voice Browser for Indian Languages.
- Speech Synthesis Markup Language to ensure Indian
languages representation.
27Reference Implementations of the draft standard
- The project Web Internationalization
Initiative, envisages implementation of the
draft W3C standards for Indian languages/
scripts.
28Others issues
- Display Font Rendering Issues
- Keyboard Issues
- Transliteration Issues
29- ???????.........
- Thank you
30Upcoming event.
- Bangalore being the major IT Hub in India a
workshop on Internationalization/ Localization is
also planned by during August 24-25, 2006 in
Bangalore. More details at www.mait.com