Title: Character Matters
1WELCOME
2Character Matters
- XML 2004, Marriott Wardman Park Hotel, Washington
DC
Diederik Gerth van Wijk, Content Office
2004-11-16
3Overview
- What is a character
- XML versus SGML, Unicode versus System Data
entities - Character overload need for restriction
- Character underload need for extension
- How to validate the restriction with DSDL
- How to validate an extension
- The need for Bottom Up Constraint Languages
- PCDATA considered harmful
4What is a character
- XML 1.0, 3rd ed. Definition A character is an
atomic unit of text as specified by ISO/IEC
106462000 ISO/IEC 10646 - XML processors MUST accept any character in the
range specified for Char - 2 Char x9 xA xD
x20-xD7FF xE000-xFFFD
x10000-x10FFFF / any Unicode character,
excluding the surrogate blocks, FFFE, and
FFFF. / - Thats more than a million characters
- Thats more than I can process
- But not all that I want to process
5What is a character (2)
- ISO/IEC 10646 Universal Multiple-Octet Coded
Character Set (UCS) - character A member of a set of elements used
for the organisation, control, or representation
of data - collection A set of coded characters which is
numbered and named and which consists of those
coded characters whose code positions lie within
one or more identified ranges - ISO/IEC TR 15285 An operational model for
characters and glyphs - Information technology uses the term character
(or coded character) for the information content,
and the term glyph for the presentation image. - Since the standard does not specify which
information the character represents, a user of
the standard is free to choose.
6What is a character (3)
- Are these the same character?
7XML versus SGML, Unicode versus SDATA entities
- In SGML, the SGML declaration declares which
character repertoire to use. By defaultCHARSET
--DOCUMENT CHARACTER SET-- BASESET "ISO
646-1983//CHARSET International Reference
Version (IRV)//ESC 2/5 4/0" - If you want anything more, use SDATA
entitieslt!ENTITY eacute SDATA "eacute " --
e-accent aigue --gt - In XML, the only repertoire is Unicode (AKA ISO
10646) - And your entities must be generallt!ENTITY
eacute "x00E9"gt lt!-- e-accent aigue --gt - The good news is that the eacute is now well
defined - The bad news is that you lost control
8Character overload the need for restriction
- It takes a big font to support all Unicode
glyphs19-06-2003 0505 367.112
arial.ttf12-01-1999 1059 24.131.012
ARIALUNI.TTF - Are you sure you know how to process all
characters? - how to sort?
- how to hyphenate?
- how to pronounce?
- how to render?
- how to search?
- In XML, do you need ¹ 1 ? ? ? ? ? ? Whats wrong
with ltmyElementgt1lt/myElementgt and let
ltmyElementgts style decide? - Would this be correct?ltpara xmllangengtEnchant
é, M?!lt/paragt
9Character underload the need for extension
- Potentially 1,000,000 characters, and still not
satisfied? - If the replacement text is context or style sheet
sensitive - questionmark in Greek , in Latin ?
- some-dash sometimes mdash, sometimes
ndash - Topographical registration marks
- Combinations
- j-acute j jx0301
- min-2 -/- x207B/x208B
- Chinese / Japanese / Korean characters for names
10How to specify a restriction
- Unicode characters have three characteristics
- Codepoint (the number) and name
- Block (range of codepoints, like Latin-1
Supplement) - General category (like Lu, Letter, uppercase)
- XML Schema datatypes allow restrictions to block
or category - But only of data content, not of mixed content
11Document Schema Definition Languages (DSDL)
- New International Standard IS 19757
- Part 3 Rule-based validation -
Schematronltschrule context"/_at_xmllang'nl'
"gt ltschassert test"\pIsBasicLatin\pIsLati
n-Supplement"gt Dutch text should only contain
basic latin characters, some of which may have
accents lt/schassertgt lt/schrulegt - "\pIsBasicLatin\pIsLatin-Supplement"is hard
to read and to reuse - Part 7 Character Repertoire Description Language
- CRDLDefines reusable named collections of
characters
12What a CRDL definition might look like
- "\pIsBasicLatin\pIsLatin-Supplement"
- ltcollection namedutch-charsgt
- ltuniongt
- ltref hrefwww.unicode.org/gencat/BasicLatin/
gt - ltref hrefwww.unicode.org/gencat/Latin-Supple
ment/gt - lt/uniongt
- lt/collectiongt
13Unicodes solution for extension Private Use
Areas
- E000..F8FF Private Use Area
- F0000..FFFFF Supplementary Private Use Area-A
- 100000..10FFFF Supplementary Private Use Area-B
- General category for characters in these areas is
CoOther, Private Use - In the past, my eacute was probably the same
character as your eacute - ISO lists of frequently used character entities
- But my Private Use character UE000 is probably
not your character UE000
14SGML and XML roundtrips using PU characters
- We still use SGML
- But sometimes XML tools are nice
- So we do roundtrips
- Then one-to-one mapping is handy
- But our entities are many-to-many
- Unless we use the private use area fromlt!ENTITY
o-umlaut "x00F6"gt lt!-- oe --gtlt!ENTITY o-trema
"x00F6"gt lt!-- -o --gttolt!ENTITY o-umlaut
"xE000"gt lt!-- oe --gtlt!ENTITY o-trema
"x00F6"gt lt!-- -o --gt - But then, how do we define the processing of
private use characters?
15How to validate an extension
- If I define private use area characters, how do I
define their behaviour? - Or characteristics, like
- allow my PU char UE000 wherever Latin-1
supplement is allowed, or - treat my PU char UE000 as if it were an
uppercase letter - Processing is not part of DSDL, thats only
validation - If I add a character, I only want to define it
once - But I want to reuse public character collections
- My DocBook DTD might specify public collection
restrictions - So if DSDL 7 doesnt allow to define
characteristics - My DSDL 7 schemas will have to be generated
16Bottom Up Constraint Language
- SGML and XML are hierarchical, top down
- An element type definition defines its content,
not where it may be used - A CSS might be used to validate a document
- In a Bottom Up Constraint Language an element
defines - which type its content is, and
- which type itself is, and thereby a new element
adds itself to every element who contains its
type - its processing characteristics (CSS etc)
- its documentation
- The same goes for characters if a PU character
says it is like a Latin 1 Upper Case letter
that should be enough to allow it wherever Latin
1 Upper Case letters are allowed
17What a BUCL statement might look like
- ltCharProp CodePointE000" PU_Name"COMBINIG
UMLAUT" PU_General_Category"Mn"
PU_Canonical_Combining_Class"230"
PU_Bidi_Class"NSM" PU_Bidi_Mirrored"N"
Render_As"0308" /gt - ltCharProp CodePointE001" PU_Name"LATIN SMALL
LETTER O WITH UMLAUT" PU_Decomposition_Mapping"00
6F E000" PU_General_Category"Ll"
PU_Simple_Uppercase_MappingE002"
Char_Entity"oumlautOrder_As"6F 65" /gt - ltCharProp CodePointE002PU_Name"LATIN CAPITAL
LETTER O WITH UMLAUTPU_Decomposition_Mapping"00
4F E000" PU_General_Category"Lu"
PU_Simple_Lowercase_MappingE001"
Char_Entity"OumlautOrder_As"4F 45"/gt -
18What a BUCL system might do
- BUCL might create a DTD for every purpose
- For rendering or editing purposes lt!ENTITY
oumlaut "ox0308"gtlt!ENTITY Oumlaut
"Ox0308"gt - For sorting purposeslt!ENTITY oumlaut
"oe"gtlt!ENTITY Oumlaut "OE"gt - For roundtrip purposeslt!ENTITY oumlaut
"xE001"gtlt!ENTITY Oumlaut "xE002"gt - And CSS, FOSIs, Relax NG schemas, documentation,
.....
19Bottom Up Constraint Language (2)
20PCDATA considered harmful
- A bulleted list should not specify to use ? as
bullet, the style sheet should - In a ltpara xmllangengt no Chinese characters
should be allowed - But wouldnt spell checking do?
- But if youre using a word list, why allow
(english) characters? - Why not encode a paragraph as a sequence of
sentences, and a sentence as a grammatical tree,
and use references to a dictionary? - And let the value of xmllang decide which
dictionary and which transformation rules to
apply! - Only dictionaries, ltnamegtelements and stylesheets
should use PCDATA
21Conclusions
- For quality control, you need restriction and
validation - The world changes, new characters occur
- Private Use characters can replace SDATA entities
- You need to be able to specify their behavior
- And characteristics
- From which the validation rules should follow
- BUCL UP!
- If you do your data analysis well, the usage of
PCDATA should be far more restricted than we now
do
22Questions
- Do you really mean it?
- Do you sell dictionaries?
- I want my logo to be a character, how do I do
that?