Folded Trie: Efficient Data Structure for All of Unicode

About This Presentation
Title:

Folded Trie: Efficient Data Structure for All of Unicode

Description:

Dublin, Ireland, May 2002. Folded Trie: Efficient Data Structure for All of Unicode ... Dublin, Ireland, May 2002. Introduction. A lot of data for each code point ... –

Number of Views:295
Avg rating:3.0/5.0
Slides: 22
Provided by: icupr
Learn more at: https://icu-project.org
Category:

less

Transcript and Presenter's Notes

Title: Folded Trie: Efficient Data Structure for All of Unicode


1
Folded Trie Efficient Data Structure for All of
Unicode
  • Vladimir Weinstein
  • vweinste_at_us.ibm.com

Globalization Center of Competency, San Jose, CA
2
Introduction
  • A lot of data for each code point
  • Need appropriate data structures
  • Unicode version 3.1 introduced code points into
    supplementary space addressable range grew to
    more than a million
  • Repetitive data
  • Sparsely populated range, especially the
    supplementary space

3
Data Structures
  • Arrays
  • Advantages very fast access time, fast write
    time
  • Disadvantage Unacceptable memory consumption
  • Hash tables
  • Advantages Easy to use, Reasonably fast, General
  • Disadvantages High overhead, complicated
    sequential access, slower than array lookup, data
    within ranges is not shared

4
Data Structures (continued)
  • Inversion Maps
  • Advantages simple, very compact, fast boolean
    operations
  • Disadvantages worse access time than arrays and
    possibly hash tables
  • For more details see Bits of Unicode at
    http//www.macchiato.com/slides/Bits_of_Unicode.pp
    t

5
Tries
  • A trie is a structure with one or more indexes
    and one data storage.
  • Name comes from Information Retrieval
  • Shares repetitive data
  • Good compaction
  • Not appropriate for frequently changing data

6
Single-Index Trie
  • A trie structure with an index array and a data
    array.
  • Advantages
  • Excellent size
  • Very good access performance (two array accesses,
    shift, mask and addition)
  • Disadvantages
  • Not appropriate for frequently changing data
  • Index array gets too big when dealing with
    supplementary code points

7
Single-Index Trie Diagram
8
Double-Index Trie
  • Two index arrays and a data block
  • Compared to single-index trie
  • Provides better compression of the index array
  • Worse performance, but still very fast
  • Feasible for supplementary code points

9
Double-Index Trie Diagram
10
Folded Trie
  • Fast access for BMP code points
  • Slower access for supplementary code points, but
    far less frequent
  • Compacts supplementary index
  • Needs additional build time processing
  • Fast address with UTF-16 code units
  • no need to construct code point

11
Folded Trie Supplementary Access Diagram
Final Data
  • BMP code points access same as with single-index

12
ICU Implementation UTrie
  • ICU implementation is called UTrie
  • Stores either 16 bit or 32 bit wide data
    (extensible in the future)
  • Up to 256K different data elements
  • Can be frozen and reused as memory mapped image
    for fast startup
  • Using UTrie requires custom code
  • More about ICU at the end of presentation

13
Range Enumeration
  • Allows enumerating over a set of contiguous
    maximal ranges of same data elements
  • Elements can be preprocessed by additional
    callback
  • Saves time when processing the whole Unicode
    range by efficiently walking the trie structure

14
Latin-1 Fast Path
  • Build time option
  • Allows direct array access for the Latin-1 range
    (0x00-0xFF)
  • Latin-1 range is not compressed if this option is
    used
  • Appropriate when access for Latin-1 range is
    critical
  • collation

15
Example Normalization Data
  • Normalization data is stored using UTries
  • For example, main data has the following format

31
15
7
6
5
3
0
Extra data index
Combining class
BCK
FWD
QC_MAYBE
QC_NO
  • Can be either
  • index to variable length data
  • first part of supplementary lookup value
  • Special handling indicator (Hangul, Jamo)

Combines back
Values for normalization quick check
Combines forward
  • Variable-length data contains composition and
    decomposition info

16
Example Character Properties Data
  • The result of UTrie lookup is an index
  • Double indexing allows for even better
    compression, since many code points have the same
    property value
  • UTrie data width is 16 bit (thousands of data
    entries), while the property data width is 32
    bits (few hundred unique data words).

Folded Trie
Property data
32 bits
16 bits
17
International Components for Unicode
  • International Components for Unicode(ICU) is a
    library that provides robust and full-featured
    Unicode support
  • Several library services use the common UTrie
    implementation
  • Wide variety of supported platforms
  • open source (X license non-viral)
  • C/C and Java versions
  • http//oss.software.ibm.com/icu/

18
Conclusion
  • UTrie data structure provides good compression
    with fast access
  • The main constraint for usage is the nature of
    the data that needs to be stored
  • Designed for repetitive and sparse data

19
Q A

20
Folding and Surrogate Access
  • Folding process compacts the index for
    supplementaries and moves it right above the BMP
    index
  • Access in ICU4C
  • Define a C callback, invoked when special lead
    surrogate is detected
  • Manually detect special lead surrogates
  • In ICU4J, provide a subclass with a method that
    detects special lead surrogates

21
Summary
  • Introduction Storing Unicode data
  • Types of data structures
  • Tries
  • Single-index trie
  • Double-index trie
  • Folded trie
  • Usage of folded trie in normalization
  • Usage of folded trie for character properties
Write a Comment
User Comments (0)
About PowerShow.com