Title: Phrase Hierarchy Inference
1Phrase Hierarchy Inference
- Gordon Paynter, UC Riverside
- Craig Nevill-Manning, Google
- Ian Witten, University of Waikato
2Outline
- Overlapping vs non-overlapping phrases
- Memory-based algorithm
- Suffix trees
- Suffix arrays
- Multipass algorithm
3Non-overlapping phrases
- Given a text, parse it into a tree of repeated
phrases - Advantage
- Based on existing data compression algorithms
- Disadvantage
- Sometimes arbitrary association of words
In the beginning, God created the heaven and the
earth
4Overlapping Phrases
- Instead, we count all repeating phrases, even if
two phrases overlap - Limit phrase length to, say, ten
5Memory-based Algorithm
- For each word w
- Everywhere that word occurs, consider the phrase
formed by the word plus the word to the left (aw) - Similarly for words to the right (wa)
- If the phrase is always preceded or followed by
the same word, extend the phrase - If the phrase begins or ends with a stopword,
extend the phrase - Add all the extended phrases to the list of
expansions for w - For each phrase p
6Memory-based Algorithm
- Problem
- How to efficiently find words to the right and
left for every occurrence of a word or a phrase? - Solution
- Suffix trees
7Suffix Tree
- A compacted trie of suffixes
- Trie a tree containing a set of strings
she sells sea shells on the sea shore
s h e l l s ? ? o r e ? e l l l s ? a ? o
n ? t h e ?
8Suffix Tree
- Compacted trie no nodes with only one child
s h e l l s ? ? o r e ? e l l l s ? a ? o
n ? t h e ?
s h e lls? ? ore? e llls? a? on? the?
9Suffix Tree
- Compacted trie of all suffixes
she sells sea shells on the sea shore he sells
sea shells on the sea shore e sells sea shells on
the sea shore sells sea shells on the sea
shore sells sea shells on the sea shore ells sea
shells on the sea shore lls sea shells on the sea
shore ls sea shells on the sea shore s sea shells
on the sea shore sea shells on the sea shore sea
shells on the sea shore
10Two Surprising Facts
- Even though there are O(n2) characters in all the
suffixes, - Suffix trees consume O(n) space
- Suffix trees take O(n) time to compute
11Suffix Tree
- How does the suffix tree help us?
- Build a suffix tree of words (instead of single
letters) - For any word, words to the right are children in
the tree - Compaction means that the longest unique sequence
is already computed - For words to the left, build a suffix tree for
the reverse sequence
12Suffix Array
seashellsontheseashore sellsseashellson
theseashore esellsseashellsontheseashore
ellsseashellsontheseashore hesellsseashel
lsontheseashore llsseashellsontheseashor
e lsseashellsontheseashore sseashellsont
heseashore seashellsontheseashore sellssea
shellsontheseashore shesellsseashellsont
heseashore
13Suffix Array
- Advantages
- Simple 10 lines of code
- Space efficient one array of pointers
- Disadvantages
- More expensive to create O(n log n)
- More expensive to operate on (linear scans
instead of following an edge)
14Multi-pass Algorithm
- Disk seeks dominate
- minimize disk seeks
- fit within available memory
- Disk reads are cheap, seeks are expensive
- Make multiple passes over the data, using as
little memory as possible
15Three Phases
- Phase 1 count all single words, two word
phrases, three word phrases - Phase 2 make expansion lists for each phrase
- Phase 3 delete uninteresting phrases
16Phase 1 Count Phrases
- Make one pass over the data, counting individual
words - Write out all words that appear more than once
- Make a second pass over the data, counting pairs
of words, where both words appear more than once - Write out all pairs that appear more than once
- Make a third pass over the data, counting triples
of words, where both overlapping pairs appear
more than once - Write out all triples that appear more than once
17Phase 1 Output
words and 31 Gone 2 man 4 old 12 sea 8 the 57 Win
d 3 with 17
pairs of words and the 25 Gone with 2 man
and 3 old man 2 The old 5 the sea 3 the
Wind 2 with the 13
triples of words and the sea 3 Gone with
the 2 man and the 2 old man and 2 The old
man 2 with the Wind 2
18Phase 2 Make Expansion Lists
- Read all pairs of words that appear more than
once (from phase 1) - Insert each pair in the list for each word
- Read all frequent triples
- Insert each triple in the list for each
overlapping pair -
19Phase 2 Output
words and 31 Gone 2 man 4 old 12 sea 8 the 57 Win
d 3 with 17
pairs of words and the 25 Gone with 2 man
and 3 old man 2 The old 5 the sea 3 the
Wind 2 with the 13
triples of words and the sea 3 Gone with
the 2 man and the 2 old man and 2 The old
man 2 with the Wind 2
20Phase 3
- Delete each phrase in the hierarchy if
- it begins or ends in a stopword (man and)
- it occurs in a particular longer phrase more than
75 of the time (theoretical computer) - Pointers to that phrase now point to that
phrases expansions - Process is recursive
21Phase 3 Output
words and 31 Gone 2 man 4 old 12 sea 8 the 57 Win
d 3 with 17
pairs of words and the 25 Gone with 2 man
and 3 old man 2 The old 5 the sea 3 the
Wind 2 with the 13
triples of words and the sea 3 Gone with
the 2 man and the 2 old man and 2 The old
man 2 with the Wind 2