Hon Wah Chan - PowerPoint PPT Presentation

About This Presentation
Title:

Hon Wah Chan

Description:

Handling nonUnicode documents in Unicode text engines. Describe interfaces and component usage ... interfaces and component usage. Ways to input Unicode ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 29
Provided by: Howa74
Learn more at: http://www.unicode.org
Category:
Tags: chan | hon | usage | wah

less

Transcript and Presenter's Notes

Title: Hon Wah Chan


1
Multilingual Editing using RichEdit 4
  • Hon Wah Chan
  • Murray Sargent III
  • Microsoft Corporation
  • Text Services Group, Word

2
Introduction
  • RichEdit is a text engine with a hierarchy of
    presentation formats
  • Features such as automatic choice of fonts, rich
    text, 2D text objects
  • Handling nonUnicode documents in Unicode text
    engines
  • Describe interfaces and component usage
  • Ways to input Unicode text using IMEs, speech
  • Demo

3
Whats RichEdit?
  • RichEdit 4.x is set of plain/rich-text,
    single/multiline Unicode/ANSI edit controls and
    combo/listboxes in single world-wide binary
  • Multilevel undo, message com interfaces, Word
    compatibility, pretty rich text
  • Outline view, zoom, font binding, latest in IME
    support, and rich complex script support (BiDi,
    Indic, and Thai)

4
Clients include
  • Handheld PC PocketWord
  • eBooks
  • OE (for mail header)
  • Borlands Delphi
  • SQL server dev tools, RAID
  • MSN Companion chat
  • Via Win2k Wrapper ccmail, WebEditPro, Eudora,
    Encarta, Money(US), Sibelius, Borland TRichedit
    class, apps created with VB, MFC
  • Outlook mail note, post-it
  • Most Office dialogs
  • All OSes since Win98
  • Wordpad, Charmap
  • Darwin installer
  • WebCalc
  • Project
  • Visual Studio, DaVinci
  • Publisher
  • Front Page

5
Some Fancier Features
  • Features added for ebooks pagination,
    hyphenation, kerning, ClearType support, text
    wrap around embedded objects
  • Multilevel tables
  • Autocorrect
  • AutoURL detection (improved from 3.0)

6
2D Text Objects
  • RichEdit 4.5 (in development) supports WYSISYG
    editing of many 2D objects
  • Ruby, Tatenakayoko, Warichu, Kumimoji
  • Math fractions, autosizing brackets, boxes,
    matrices, integrals
  • Demo will show some of these features

7
Backward Compatibility
  • Unicode text engines need to import/export text
    in other character sets
  • Given nonUnicode plain text, which codepage
    should one use to convert to/from Unicode?
  • On localized systems, system code page is a good
    bet
  • In multilingual text, you can enter text using
    keyboards in a variety of languages that need
    either Unicode or multiple code pages
  • For searching text, best choice seems to be to
    use the current keyboard code page
  • If text begins with a BOM, its Unicode
  • If text begins with a rich-text header, e.g.,
    \rtf or lthtmlgt, use appropriate conversion
    routine

8
Backward Compatibility (cont)
  • Need a little rich-text functionality to display
    Unicode plain text unambiguously in some CJK
    scenarios
  • This functionality handles font choices and
    language-dependent glyph variants
  • When a user types in text using a keyboard
    charset, edit engine knows charset and therefore
    can insert accurate Unicode text including which
    CJK glyph variant to use
  • Client gets text as pure ANSI (or Unicode) text
    without script clues
  • Would be handy to have script tags

9
Complex Scripts
  • Unicode covers many complex scripts, e.g.,
    Arabic, Indic, Thai, ancient Korean
  • Complex-scripts require layout engine that
    translates character codes to glyph indices
    (often referencing ligatures)
  • RichEdit uses Uniscribe and the MS line-layout
    component for complex scripts

10
Font Binding
  • Most Unicode characters belong to scripts
  • Associate with each position in a document a
    font bundle
  • When inserting characters, assign each one to a
    script
  • For CJK, check surrounding characters for Kana
    and Hangul as clues to use Japanese or Korean
    fonts instead of Chinese
  • Assign scripts to neutrals and digits
  • Keyboard language, especially IMEs, provide
    strong binding clues
  • Format inserted characters with fonts assigned to
    scripts. Check current font to see if it supports
    required script
  • RichEdit 4.0 has 50 scripts for Unicode 3.1.
    Client can specify what default font to use for a
    given script.

11
Language Detection Font Binding
  • Korean and Japanese are often easy to spot
    because of Hangul and Kana characters,
    respectively
  • For CJK can convert back to codepage and see if
    errors occur (Ken Lundes suggestion)
  • For proofing purposes, accurate language
    identification is needed. For font binding,
    script identification is usually sufficient
  • Typically more than one language corresponds to a
    script, e.g., Latin script. Essentially only one
    uses the Korean script
  • Natural language processing techniques allow good
    language identification if more than a few words
    are involved, e.g., a sentence

12
Font Sizing
  • In dialogs, 8-pt Latin characters are commonly
    used
  • 8-pt Chinese characters are hard to read, so
    better to use 9 points in combination with 8-pt
    Latin characters
  • Latin characters have bigger descenders than
    Chinese characters, since latter only need room
    for underline
  • Combining 8-pt Latin characters with 9-point
    Chinese characters and keeping same baseline
    increases line height to 9 pts plus extra height
    for Latin descender
  • Result is more like 10 points shifts text too
    high in dialog box originally designed to handle
    one language

13
Unicode Surrogate Pairs
  • Using 2 16-bit surrogates to represent a single
    character complicates more than measurement and
    display of characters
  • Arrow-key handlers and other methods that change
    character position must avoid ending up in
    between lead and trail surrogates
  • Input methods need to map to surrogate pair
  • Case changes, line-breaking rules, sorting, file
    formats, and backing-store manipulations in
    general have to recognize and deal with pairs
  • Surrogate code ranges make them easy to work with
    relative to multibyte encoding systems

14
Nonspacing Combining Marks
  • Multicode characters (surrogate pairs, CRLFs,
    combining-mark and variant-tag sequences) require
    special display/navigation handling
  • Render combining-mark sequences by standard
    systems calls and fonts that support combining
    marks. Better display needs layout engine that
    talks to OpenType
  • Simple caret movement across combining-mark
    sequences prevents stopping inside a sequence.
    Backspace key deletes one mark at a time
  • Mouse-cursor hit testing leaves selection at
    beginning/end of combining-mark sequence (more
    elegant model allows selection and editing of
    individual marks)
  • Cool thing if you can navigate past CRLF
    combinations, you can modify corresponding code
    to handle surrogate pairs and combining-mark
    sequences quite easily

15
Interfaces
  • Messages and keyboard
  • File read/write (plain text or RTF)
  • TOM (Text Object Model)
  • ITextServices/ITextHost interfaces

16
RichEdit Message Interface
  • System messages
  • keyboard messages
  • mouse messages
  • clipboard messages
  • Edit messages RichEdit supports all but four of
    the system edit messages
  • RichEdit messages
  • Character/paragraph formatting
  • Text input/query
  • Notification

17
File Formats
  • Plain text can be saved/read encoded in any
    codepage, including Unicode and UTF-8
  • RTF is the principle rich-text format
  • UTF-8 RTF is used preferentially for
    cut/copy/paste. Can be used in stream operations
  • Copying text to/from Word can be a handy way to
    get desired formatting into a RichEdit instance
  • HTML is available via system converters

18
TOM (Text Object Model)
  • A set of COM dual interfaces that allow Unicode
    rich/plain text to be manipulated by VB, C/C,
    and Java clients.
  • Access for spelling/grammar checkers
  • Accessibility
  • Powerful and efficient text processing
    primitives. Embedded scripts

19
TOM(cont)
  • ITextDocument Top-level editing object
  • ITextStoryRanges Enumerator for stories in
    document
  • ITextRange Primary text interface range of text
  • ITextFont Character-attribute interface
  • ITextPara Paragraph-attribute interface
  • ITextTag HTML Tag interface
  • ITextAttributes Tag-attribute enumerator
  • ITextSelection Screen highlighted text range
  • TextRange Selection inherits all range methods

20
ITextServices/ITextHost Interfaces
  • Windowless interfaces that go beyond message
    interface
  • In-place active state use window of the
    container
  • Fewer system resources
  • Faster activation and deactivation

21
Other Components used
  • Uniscribe
  • MS line-layout component
  • Windows Text Services Framework
  • Callbacks for access to word-break, auto correct,
    hyphenation, and Clear Type libraries

22
Input methods
  • Support for the latest IMEs
  • Speech and handwriting input (Windows Text
    Services Framework)
  • Alt-x Unicode input method
  • Standard hot keys

23
IMEs
  • Support Level 2 and Level 3 IMEs
  • Support Active Input Method Manager (AIMM)
  • Reconversion - user can convert final string back
    to composition mode, allowing easy selection of a
    different candidate string.
  • Document feed - provides IME with text for
    current paragraph to increase conversion accuracy
    during typing.
  • Mouse Operation - gives user better control over
    candidate and UI windows
  • Caret position - gets current caret and line
    info, which IME98 uses to position UI windows
    (e.g., candidate list).

24
Windows Text Services Framework
  • Provide support for Far East input across
    language Win32 platforms to aware applications.
  • Provide consistent UI for different input methods
  • speech, handwriting, IME
  • Coordinated input
  • Data persistence for dynamic text editing
  • Richedit supports both the native mode and Active
    Input Method Manager (AIMM) mode

25
Hex to Unicode Input Method
  • Type Unicode character hexadecimal code
  • Make corrections as need be
  • Type Altx to convert to character
  • Type Altx to convert back to hex (useful
    especially for missing glyph character)
  • Resolve ambiguities by selection
  • Input higher-plane chars using 5 or 6-digit code
  • MS Word 2002 standard

26
Unicode combobox/listbox
  • Emulate the system combobox and listbox
  • Unicode supports on all Win32 platforms
  • Allow mixed languages between items
  • Modified EM_SETTEXTEX for inserting items
  • Use in Office applications

27
Demo
28
Conclusions
  • Have described RichEdit, an engine for text
    display and editing with a hierarchy of
    presentation formats
  • Automatic choice of fonts for Unicode plain text
    including surrogate-pair characters, combining
    mark sequences
  • Handling nonUnicode documents in Unicode text
    engines
  • Described interfaces and component usage
  • Ways to input Unicode text using IMEs, speech
  • Clients include many Office and Windows apps
  • Able to display 2D Text Objects such as Ruby and
    Warichu
Write a Comment
User Comments (0)
About PowerShow.com