Title: Fast Deterministic Exact Nearest Neighbor Search in the Manhattan Metric
1Intelligent Text Processinglecture 3 Word
distribution laws.Word-based indexing
Szymon Grabowskisgrabow_at_kis.p.lodz.plhttp//szgr
abowski.kis.p.lodz.pl/IPT08/
Lódz, 2008
2Zipfs law (Zipf, 1935, 1949)
http//en.wikipedia.org/wiki/Zipf's_law,
http//ciir.cs.umass.edu/cmpsci646/Slides/ir0820
compression.pdf
word-rank ? word-freq ? constant That is, a few
most frequent words cover a relatively large part
of the text, while the majority of words in the
given texts vocabulary occur only once or
twice. More formally, the frequency of any word
is (approximately) inversely proportional to its
rank in the frequency table. Zipfs law is
empirical!
Example from the Brown Corpus (slightly over 106
words) the is the most freq. word with 7
(69971) of all word occs. The next word, of,
has 3.5 occs (36411), followed by and with
less than 3 occs (28852). Only 135 items are
needed to account for half the Brown Corpus.
3Does Wikipedia confirm Zipfs law?
http//en.wikipedia.org/wiki/Zipf's_law
Word freq in Wikipedia, Nov 2006, log-log plot.
x is a word rank,y is the total of
occs. Zipf's law roughly corresponds to the
green (1/x) line.
4Lets check it in Python...
Dickens collectionhttp//sun.aei.polsl.pl/sdeor
/stat.php?sCorpus_dickensucorpus/dickens.bz2
distinct words 283475 top freq words('the',
80103), ('and', 63971), ('to', 47413), ('of',
46825), ('a', 37182)
5Lots of words with only a few occurrences,(dicken
s example continued)
there are 9423 words with freq 1there are 3735
words with freq 2there are 2309 words with freq
3there are 1518 words with freq 4
6Brown corpus statistics http//ciir.cs.umass.edu
/cmpsci646/Slides/ir0820compression.pdf
7Heaps law (Heaps, 1978) http//en.wikipedia.or
g/wiki/Heaps'_law
Another empirical law. It tells how vocabulary
size Vgrows with growing text size n (expressed
in words) V(n) K nß, where K is typically
around 10..100 (for English)and ß between 0.4
and 0.6.
Roughly speaking, the vocabulary ( of distinct
words) grows proportially to the square root of
the text length.
8Musings on Heaps law
The number of words grows without a
limit...How is it possible?
Because in new documents new words tend to
occure.g. (human, geographical etc.) names.
But also typos!
On the other hand, the dictionary size
growssignificantly slower than the text
itself.I.e. it doesnt pay much to represent the
dictionarysuccinctly (with compression)
dictionary compression ideas which slow down the
accessshould be avoided.
9Inverted indexes
Almost all real-world text indexes are inverted
indexes(in one form or another). An inverted
index (Salton and McGill, 1983) stores wordsand
their occurrence lists in the text
database/collection.
The occurrence lists may store exact word
positions(inverted list) or just a block or a
document (inverted file).
Storing exact word positions (inverted list)
enables for faster search and facilitates some
kinds of queries (compared to the inverted
file),but requires (much) more storage.
10Inverted file, example http//en.wikipedia.org/w
iki/Inverted_index
We have three texts (documents) T0 "it is
what it is" T1 "what is it"T2 "it is a
banana" We built the vocabulary and the
associated document index lists
Lets search for what, is and it.That is, we
want to obtain the references to all the
documents which contain those three words (at
least once each, and in arbitrary positions!).
The answer
11How to build an inverted index
http//nlp.stanford.edu/IR-book/pdf/02voc.pdf
Step 1 obvious. Step 2 split the text into
tokens (roughly words). Step 3 reduce the
amount of unique tokens(e.g. eliminate
capitalized words, plural nouns). Step 4 build
the dictionary structure and occurrence lists.
12Tokenization http//nlp.stanford.edu/IR-book/pdf
/02voc.pdf
A token is an instance of a sequence of
characters in some particular document that are
grouped together as a useful semantic unit for
processing.
How to tokenize? Not so easy as it first
seems... See the following sent. and possible
tokenizations of some excerpts from it.
13Tokenization, contd http//nlp.stanford.edu/IR-
book/pdf/02voc.pdf
Should we accept tokens like C, C, .NET,
DC-9, MASH ? And if MASH is legal
(single token), then why not 235 ?
Is a dot (.) a punctuation mark (sentence
terminator)? Usually yes (of course), but think
about emails, URLs and IPs...
14Tokenization, contd http//nlp.stanford.edu/IR-
book/pdf/02voc.pdf
Is a space always a reasonable token
delimiter? Is it better to perceive New York (or
San Francisco, Tomaszów Mazowiecki...) as one
token or two? Similarly with foreign phrases
(cafe latte, en face).
Splitting on spaces may bring bad retrieval
resultssearch for York University will mainly
fetch documentsrelated to New York University.
15Stop words http//nlp.stanford.edu/IR-book/pdf/0
2voc.pdf
Stop words are very frequent words which carry
little information (e.g. pronouns, articles).
For example
Consider an inverted list (i.e. exact positions
for all word occurrences are kept in the
index). If we have the stop list (i.e. discard
them during indexing), the index will get
smaller, say by 2030 (note it has to do with
Zipfs law).
Danger of using stop words some
meaningfulqueries may consist of stop words
exclusively The Who, to be or not to be... The
modern trend is not to use a stop list at all.
16Glimpse a classic inverted file index Manber
Wu, 1993
Assumption the text to index is a single
unstructure file(perhaps a concatenation of
documents). It is divided into 256 blocks of
approximately the same size. Each entry in the
index is a word and the list of blocksin which
it occurs, in ascending order. Each block number
takes 1 byte (why?).
17Block-addressing (e.g. Glimpse) index, general
scheme fig. 3 from Navarro et al., Adding
compression..., IR, 2000
18How to search in Glimpse?
- Two basic queries
- keyword search the user specifies one or more
words and requires a list of all documents (or
blocks)in which those words occur in any
positions, - phrase search the user specifies two or more
wordsand requires a list of all documents (or
blocks)in which those words occur as a phrase,
i.e. one-by-one.
Why phrase search is important?Imagine keyword
search for new Scotlandand phrase search for
new Scotland. Whats the difference?
19How to search in Glimpse, contd(fig. from
http//nlp.stanford.edu/IR-book/pdf/01bool.pdf )
The key operation is to intersect several block
lists. Imagine the query has 4 words, and
their corresponding lists have length10, 120,
5, 30. How do you perform the intersection?
Its best to start with two shortest lists, i.e.
of length 5 and 10 in our example. The
intersection output will have length at most
5(but usually less, even 0, when we just
stop!). Then we intersect the obtained list with
the list of length 30,and finally with the
longest list. No matter the intersection order,
the same result but (vastly) different speeds!
20How to search in Glimpse, contd
We have obtained the intersection of all lists
what then?
Depends on the query for the keyword
query,were done (can retrieve now the found
blocks / docs). For the phrase query, we have
yet to scan the resulting blocks / documents and
checkif and where the phrase occurs in them. To
this end, we can use any fast exact string
matching alg, e.g. BoyerMooreHorspool. Conclus
ion the smaller the resulting list,the faster
the phrase query is handled.
21Approximate matching with Glimpse
Imagine we want to find occurrences of a given
phrasebut with up to k (Levenshtein) errors.
How to do it?
Assume for example k 2 and the phrase grey
cat. The phrase has two words, so there are the
following error per word possibilities 0 0,
0 1, 0 2, 1 0, 1 1, 2 0.
22Approximate matching with Glimpse, contd
E.g. 0 2 means here that the first word (grey)
must be exactly matched, but the second with 2
errors(e.g. rats).So, the approximate query
grey cat translates to many exact queries (many
of them rather silly...) grey cat, gray cat,
grey rat, great cat, grey colt, gore cat, grey
date...
All those possibilities are obtained from
traversingthe vocabulary structure (e.g., a
trie). Another option is on-line approx matching
overthe vocabulary represented as plain
text(concatenation of words) see fig. at the
next slide.
23Approx matching with Glimpse, query example with
a single word x (fig. from Baeza-Yates
Navarro, Fast Approximate String Matching in a
Dictionary, SPIRE98)
24Block boundaries problem
If the Glimpse blocks never cross
documentboundaries (i.e., are natural), we dont
have this problem... But if the block boundaries
are artificial,then we may be unlucky and have
one of our keywords at the very end of a block
Bj and the next keyword at the beginning of
Bj1. How not to miss an occurrence?
There is a simple solutionblock may overlap a
little. E.g. the last 30 words of each block
are repeated at the beginning of the nextblock.
Assuming the phrase length / keyword sethas no
more than 30 words, we are safe. But we may then
need to scan more blocks than necessary (why?).
25Glimpse issues and limitations
The authors claim their index takes only 24 of
the original text size. But it can work to text
collections to about 200 MB onlythen it starts
to degenerate, i.e., the block liststend to be
long and many queries and handled not much faster
than using online seach (without any index). ?
How can we help it?Overall idea is fine, so we
must take care of details.One major idea is to
apply compressionto the index...
26(Glimpse-like) index compression
The purpose of data compression in inverted
indexesis not only to save space (storage). It
is also to make queries faster!(One big reason
is less I/O, e.g. one disk access wherewithout
compression wed need two accesses.Another
reason is more cache-friendly memory access.)
27Compression opportunities (Navarro et al., 2000)
- Text (partitioned into blocks) may be compressed
on word level (faster text search in the last
stage). - Long lists may be represented as their
complements,i.e. the numbers of the block in
which a given word does NOT occur. - Lists store increasing numbers, so the gaps
(differences) between them may be encoded,
e.g.2, 15, 23, 24, 100 ? 2, 13, 8, 1, 76
(smaller numbers). - The resulting gaps may be statistically
encoded(e.g. with some Elias code next
slides...).
28Compression paradigm modeling coding
Modeling the way we look at input data. They
can be perceived as individual (context-free)
1-byte characters, or pairs of bytes, or triples
etc. We can look for matching sequences in the
past buffer (bounded or unbounded, sliding or
not), the minimum match length can be set to
some value, etc. etc. We can apply lossy or
lossless transforms (DCT in JPEG, BurrowsWheeler
transform), etc. Modeling is difficult. Sometimes
more art than science. Often data-specific. Codi
ng what we do with the data transformed in the
modeling phase.
29Intro to coding theory
A uniquely decodable code if any
concatenationof its codewords can be uniquely
parsed. A prefix code no codeword is a proper
prefix of any other codeword. Also called
aninstantaneous code.
Trivially, any prefix code is uniquely
decodable.(But not vice versa!)
30Average codeword length
Let an alphabet have s symbols,with
probabilities p0, p1, ..., ps1.Lets have a
(uniquely decodable) code C c0, c1, ...,
cs1.The avg codeword length for a given
probability distribution is defined as
So this is a weighted average length over
individualcodewords. More frequent symbols have
a stronger influence.
31Entropy
Entropy is the average information in a
symbol.Or the lower bound on the average number
(may be fractional) of bits needed to encode an
input symbol. Higher entropy less compressible
data. What is the entropy of data is a vague
issue.We measure the entropy always according to
a given model (e.g. context-free aka order-0
statistics).
Shannons entropy formula (S the source
emitting messages / symbols)
32Redundancy
Simply speaking, redudancy is the excess in the
representation of data. Redundant data means
compressible data. A redundant code is a
non-optimal (or far from optimal) code. A code
redundancy (for a given prob. distribution) R(C,
p0, p1, ..., ps1) L(C, p0, p1, ..., ps1)
H(p0, p1, ..., ps1) ? 0. The redundancy of a
code is the avg excess (over the entropy) per
symbol. Cant be below 0, of course.
33Basic codes. Unary code
Extremely simple though. Application very skew
distribution(expected for a given problem).
34Basic codes. Elias gamma code
Still simple, and usually much better than the
unary code.
35Elias gamma code in Glimpse (example)
an examplary occurrence list2, 4, 5, 6, 9, 11,
40, 42, 43, 94, 96, 120, 133, 134, 151, 203 list
of deltas (differences)2, 2, 1, 1, 3, 2, 29,
2, 1, 51, 2, 24, 13, 1, 17, 52 list of deltas
minus 1 (as zero was previously impossible,
except perhaps on the 1st position)2, 1, 0,
0, 2, 1, 28, 1, 0, 50, 1, 23, 12, 0, 16, 51
no compression (one item one byte) 16 8
128 bits with Elias coding 78 bits
101 100 0 0 101 100 111101101 100 0
11111010011 100 111101000 1110101 0
111100001 11111010100
36Python code for the previous example
occ_list 2, 4, 5, 6, 9, 11, 40, 42, 43, 94,
96, 120, 133, 134, 151, 203import
mathdelta_list occ_list0for i in range(1,
len(occ_list)) delta_list
occ_listi-occ_listi-1-1print
occ_listprint delta_list total_len 0total_seq
""for x in delta_list z
int(math.log(x1, 2)) code1 "1" z "0"
code2 bin(x1-2z)2.zfill(z) if z gt 1
else "" total_seq code1 code2 " "
total_len len(code1 code2) print "no
compression", len(occ_list)8, "bits"print
"with Elias gamma coding", total_len,
"bits"print total_seq
v2.6 neededas the function bin()is used
37Huffman coding (1952) basic facts
Elias codes assume that we know (roughly) the
symbol distribution before encoding. What if we
guessed it badly...? If we know the symbol
distribution, we may construct an optimal code,
more precisely, optimal among the uniquely
decodable codes having a codebook. It is called
Huffman coding.
Example. Symbol frequencies above,final Huffman
tree on the right.
38Huffman coding (1952) basic facts, contd
Its redundancy is always less than 1 bit /
symbol (but may be arbitrarily close). In most
practical applications (data not very
skew)Huffman code avg length only 1-3 worse
than entropy. E.g. the order-0 entropy of book1
(English 19th century novel, plain text) from
the Calgary corpus 4.567 bpc. Huffman avg
length 4.569 bpc.
39Why not using only Huffmanfor encoding
occurrence lists?
40Kraft inequality
Given the integers k0, k1, ..., ks1, we can
construct a prefix code C c0, c1, ..., cs1
for which ci ki, 0 ? i ? s, iff
Question can we have a prefix code for 5 symbols
with the following codeword lengths(a) 2, 3,
3, 3, 4 (b) 1, 2, 2, 4, 5 ?
41Example of a uniquely decodable, albeit not
prefix, code
Consider the string 11000000010. Do you
know how to parse it? From http//ocw.mit.ed
u/NR/rdonlyres/Electrical-Engineering-and-Computer
-Science/6-441Transmission-of-InformationSpring20
03/02665B0D-F6AF-42BB-87D6-771EA9D1BF38/0/6441lec
ture6.pdf