Introduction to PAT-Tree and its variations - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to PAT-Tree and its variations

Description:

Definition: Patricia Tree that storing every semi-infinite string (sistring) of a document ... With the definition of 'Essential Node'(EN) ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 31
Provided by: cseCu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to PAT-Tree and its variations


1
Introduction to PAT-Tree and its variations
  • Kenny Kwok
  • clkwok_at_cse.cuhk.edu.hk
  • Department of Computer Science and Engineering
  • The Chinese University of Hong Kong
  • Shatin, N.T., Hong Kong SAR

2
Outline
  • Definition of PAT Tree
  • PAT Tree on Chinese Document
  • Modified structure of PAT Tree
  • Application examples
  • Conclusion

3
PAT tree
  • Definition Patricia Tree that storing every
    semi-infinite string (sistring) of a document
  • Two things we have to know
  • PATRICIA TREE
  • SISTRING

4
PATRICIA TREE
  • A particular type of trie
  • Example, trie and PATRICIA TREE with content
    010, 011, and 101.

5
PATRICIA TREE
  • Therefore, PATRICIA TREE will have the following
    attributes in its internal nodes
  • Index bit (check bit)
  • Child pointers (each node must contain exactly 2
    children)
  • On the other hand, leave nodes must be storing
    actual content for final comparison

6
SISTRING
  • Sistring is the short form of Semi-Infinite
    String
  • String, no matter what they actually are, is a
    form of binary bit pattern. (e.g. 11001)
  • One of the sistring in the above example is
    11001000
  • There are totally 5 sistrings in this example

7
SISTRING
  • Sistrings are theoretically of infinite length
  • 110010000
  • 10010000
  • 0010000
  • 010000
  • 10000
  • Practically, we cannot store it infinite. For the
    above example, we only need to store each
    sistrings up to 5 bits long. They are descriptive
    enough distinguish each from one another.

8
SISTRING
  • Bit level is too abstract, depends on
    application, we rarely apply this on bit level.
    Character level is a better idea!
  • e.g. CUHK
  • Corresponding sistrings would be
  • CUHK000
  • UHK000
  • HK000
  • K000
  • We require each should be at least 4 characters
    long.
  • (Why we pad 0/NULL at the end of sistring?)

9
SISTRING (USAGE)
  • SISTRINGs are efficient in storing substring
    information.
  • A string with n characters will have n(n1)/2
    sub-strings. Since the longest one is with size
    n. Storage requirement for sub-strings would be
    O(n3)
  • e.g. CUHK is 4 character long, which consist of
    4(5)/2 10 different sub-strings C, U, , CU,
    UK, , CUH, UHK, CUHK.
  • Storage requirement is O(n2)max(length) -gt O(n3)

10
SISTRING (USAGE)
  • We may instead storing the sistrings of CUHK,
    which requires O(n2) storage.
  • CUHK lt- represent C CU CUH CUHK at the same time
  • UHK0 lt- represent U UH UHK at the same time
  • HK00 lt- represent H HK at the same time
  • K000 lt- represent K only
  • A prefix-matching on sistrings is equivalent to
    the exact matching on the sub-strings.
  • Conclusion, sistrings is better representation
    for storing sub-string information.

11
PAT Tree
  • Now it is time for PAT Tree again
  • PAT Tree is a PATRICIA TREE store every sistrings
    of a document
  • What if the document is now contain simply
    CUHK?
  • We like character at this moment, but PATRICIA is
    working on bits, therefore, we have to know the
    bit pattern of each sistrings in order to know
    the actual figure of the PAT tree result
  • It looks frustrating for even small example, but
    it is how PAT tree works!

12
PAT Tree (Example)
  • By digitalizing the string, we can manually
    visualize how the PAT Tree could be.
  • Following is the actual bit patternof the four
    sistrings
  • Once we understand how thePAT-tree work, we
    wontdetail it in later examples.

13
PAT Tree
  • In a document, we dont view it as a packed
    string of characters. A document consist of
    words. e.g. Hello. This is a simple document.
  • In this case, sistrings can be applied in
    document level the document is treated as a
    big string, we may tokenize it word-by-word,
    instead of character-by-character.

14
PAT Tree (Example)
  • This works! BUT
  • We still need O(n2)memory for storingthose
    sistrings
  • We may reduce thememory to O(n)by making use
    ofpoints.

15
PAT Tree (Actual Structure)
  • We need to maintain onlythe document itself
  • The PAT Tree actsas an index structure
  • Memory requirement
  • Document, O(n)
  • PAT Tree index, O(n)
  • Leaves pointers, O(n)
  • Therefore, PAT Tree is a linear data structure
    that contains sub-strings, O(n3), information

16
The Chinese PAT tree
  • we can built PAT tree for english easily.
    Sistrings are decomposed word by word.
  • for Chinese document, the document layout shows
    no idea about words. Sadly, they packed together.
  • e.g. ?????
  • We know there are 5 characters, whats more?
  • In fact, there are 2 words ?? and ???, but we
    have no way to KNOW about this by just reading
    the text without any other supporting knowledge.

17
Semi-Infinite String (Sistring)
  • Sistrings are null padding string
  • The sistrings becomes
  • ?????
  • ????00
  • ???0000
  • ??000000
  • ?00000000
  • This make sistrings comparable to each others
  • We can examine a particular bit of a sistring and
    there will not have missing-bit in any sistrings

18
The Chinese PAT tree
  • In the research of Chinese information
    processing, researchers suggest to have sistrings
    for Chinese document in sentense level
  • i.e. each documents decompose into many sentences
    by their punctuation marks.
  • ????,??? will be viewed as 2 sentences ????
    and ??
  • For each sentences, their sistrings can be
    obtained liked ????, ???, ??, etc.

19
The Chinese PAT tree
  • By this way, Chinese PAT tree is built. Since
    Chinese words must be a sub-string of the
    document, all Chinese words can still be found in
    the Chinese PAT tree efficiently.
  • Therefore, Chinese word segmentation is one of
    the most important application using the PAT tree.

20
The Chinese PAT Tree Structure
  • In Chinese PAT tree, a document is decomposed
    into sentences. It is possible that sistrings of
    one sentence will be a subset of another
    sentence.
  • e.g. ????,????. Sistrings ?? appear twice.
    Once of them will be eaten by another.
  • Therefore, we usually have a frequency count
    attached to each leave node of the tree.

21
The Chinese PAT Treee Structure
  • Internal node remains the same. It has check-bit
    information
  • Leave node will now have a frequency count
    attribute
  • The document is decomposed into a number of
    sentences.
  • Storage complexity is remains O(n).

22
Structure modification
  • We can see that node structure for internal node
    and leave node are not the same
  • tree will be more flexible if their nodes are
    generic (have a universal node structure)
  • Trade off generic node structure will enlarge
    the individual node size
  • But..
  • Memory are cheap now
  • Even the low end computer can support hundreds MB
    of RAM
  • The modified tree is still a O(n) structure

23
Structure of the modified node
  1. Check Bit
  2. Frequency Count
  3. Link to a sistring
  4. Pointers to the child nodes

24
Example of our Modified Version
25
Essential Length
  • Essential Length is the number of Chinese
    character a tree node can represent
  • In general, Chinese characters is a double-byte
    character (16-bit)
  • The essential length equal to the check bit,
    truncated to the nearest Chinese character
  • e.g. a node with check bit 53
  • It can represent only 3 Chinese characters (48
    bits) but not 4 Chinese characters (64 bits)
  • Its essential length 48

26
Essential Node
  • We call a node Essential Node (EN) if and only
    if its,
  • Essential Length gt 32
  • Essential Length is at least 16 more than the
    previous ancestral EN
  • Each Essential Node can uniquely represent a
    sub-string(phrase).

27
Essential Node
  • With the definition of Essential Node(EN)
  • Each essential node will represent a possible
    Chinese substring, e.g. ?????, ???
  • With the generalized structure, each EN will also
    have the frequency count, which reflect the
    occurrence of the particular associated
    sub-string.

28
Essential Node
  • The essential node with
  • Check bit 80
  • Essential length is 80
  • Representing the phrase ?????
  • Check bit 48
  • Essential length is 48
  • Representing the phrase ???

29
Applications
  • PAT tree may embedded more information depends on
    the application
  • Famous Chinese information processing
    applications include
  • Keyword extractions
  • Sentences Segmentation
  • Document Classification
  • These show the importance of PAT tree structure
    on those applications

30
Conclusion
  • PAT tree is a O(n) data structure for document
    indexing
  • PAT tree is good for solving sub-string matching
    problem
  • Chinese PAT tree has sistrings in sentence level.
    Frequency count is introduced to overcome the
    duplicate sistrings problem
  • On generalizing the node structure, the modified
    version increase the pat tree capability for
    varies applications
Write a Comment
User Comments (0)
About PowerShow.com