Folded Trie: Efficient Data Structure for All of Unicode

About This Presentation

Title:

Folded Trie: Efficient Data Structure for All of Unicode

Description:

Dublin, Ireland, May 2002. Folded Trie: Efficient Data Structure for All of Unicode ... Dublin, Ireland, May 2002. Introduction. A lot of data for each code point ... –

Number of Views:295

Avg rating:3.0/5.0

Slides: 22

Provided by: icupr

Learn more at: https://icu-project.org

Category:

more less

Transcript and Presenter's Notes

Title: Folded Trie: Efficient Data Structure for All of Unicode

1
Folded Trie Efficient Data Structure for All of
Unicode

Vladimir Weinstein
vweinste_at_us.ibm.com

Globalization Center of Competency, San Jose, CA
2
Introduction

A lot of data for each code point
Need appropriate data structures
Unicode version 3.1 introduced code points into
supplementary space addressable range grew to
more than a million
Repetitive data
Sparsely populated range, especially the
supplementary space

3
Data Structures

Arrays
Advantages very fast access time, fast write
time
Disadvantage Unacceptable memory consumption
Hash tables
Advantages Easy to use, Reasonably fast, General
Disadvantages High overhead, complicated
sequential access, slower than array lookup, data
within ranges is not shared

4
Data Structures (continued)

Inversion Maps
Advantages simple, very compact, fast boolean
operations
Disadvantages worse access time than arrays and
possibly hash tables
For more details see Bits of Unicode at
http//www.macchiato.com/slides/Bits_of_Unicode.pp
t

5
Tries

A trie is a structure with one or more indexes
and one data storage.
Name comes from Information Retrieval
Shares repetitive data
Good compaction
Not appropriate for frequently changing data

6
Single-Index Trie

A trie structure with an index array and a data
array.
Advantages
Excellent size
Very good access performance (two array accesses,
shift, mask and addition)
Disadvantages
Not appropriate for frequently changing data
Index array gets too big when dealing with
supplementary code points

7
Single-Index Trie Diagram
8
Double-Index Trie

Two index arrays and a data block
Compared to single-index trie
Provides better compression of the index array
Worse performance, but still very fast
Feasible for supplementary code points

9
Double-Index Trie Diagram
10
Folded Trie

Fast access for BMP code points
Slower access for supplementary code points, but
far less frequent
Compacts supplementary index
Needs additional build time processing
Fast address with UTF-16 code units
no need to construct code point

11
Folded Trie Supplementary Access Diagram
Final Data

BMP code points access same as with single-index

12
ICU Implementation UTrie

ICU implementation is called UTrie
Stores either 16 bit or 32 bit wide data
(extensible in the future)
Up to 256K different data elements
Can be frozen and reused as memory mapped image
for fast startup
Using UTrie requires custom code
More about ICU at the end of presentation

13
Range Enumeration

Allows enumerating over a set of contiguous
maximal ranges of same data elements
Elements can be preprocessed by additional
callback
Saves time when processing the whole Unicode
range by efficiently walking the trie structure

14
Latin-1 Fast Path

Build time option
Allows direct array access for the Latin-1 range
(0x00-0xFF)
Latin-1 range is not compressed if this option is
used
Appropriate when access for Latin-1 range is
critical
collation

15
Example Normalization Data

Normalization data is stored using UTries
For example, main data has the following format

31
15
7
6
5
3
0
Extra data index
Combining class
BCK
FWD
QC_MAYBE
QC_NO

Can be either
index to variable length data
first part of supplementary lookup value
Special handling indicator (Hangul, Jamo)

Combines back
Values for normalization quick check
Combines forward

Variable-length data contains composition and
decomposition info

16
Example Character Properties Data

The result of UTrie lookup is an index
Double indexing allows for even better
compression, since many code points have the same
property value
UTrie data width is 16 bit (thousands of data
entries), while the property data width is 32
bits (few hundred unique data words).

Folded Trie
Property data
32 bits
16 bits
17
International Components for Unicode

International Components for Unicode(ICU) is a
library that provides robust and full-featured
Unicode support
Several library services use the common UTrie
implementation
Wide variety of supported platforms
open source (X license non-viral)
C/C and Java versions
http//oss.software.ibm.com/icu/

18
Conclusion

UTrie data structure provides good compression
with fast access
The main constraint for usage is the nature of
the data that needs to be stored
Designed for repetitive and sparse data

19
Q A

20
Folding and Surrogate Access

Folding process compacts the index for
supplementaries and moves it right above the BMP
index
Access in ICU4C
Define a C callback, invoked when special lead
surrogate is detected
Manually detect special lead surrogates
In ICU4J, provide a subclass with a method that
detects special lead surrogates

21
Summary