Indexing and Complexity - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Indexing and Complexity

Description:

Indexing and Complexity 24 25 29 Agenda Inverted indexes Computational complexity Some Interesting Questions How long will it take to find a document? – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 18

Provided by: Preferr654

Category:

more less

Transcript and Presenter's Notes

Title: Indexing and Complexity

1
Indexing and Complexity
2
Agenda

Inverted indexes
Computational complexity

3
Some Interesting Questions

How long will it take to find a document?
Is there any work we can do in advance?
If so, how long will that take?
How big a computer will I need?
How much disk space? How much RAM?
What if more documents arrive?
How much of the advance work must be repeated?
Will searching become slower?
How much more disk space will be needed?

4
A Cautionary Tale

Searching is easy - just ask Microsoft!
Find can search my 1 GB disk in 30 seconds
Well, actually it only looks at the file names...
How long do you think find would take for
The 100 GB disk we just got?
For the World Wide Web?
Computers are getting faster, but
How does AltaVista give answers in 5 seconds?

5
The Inverted File Trick

Organize the bag of words matrix by terms
You know the terms that you are looking for
Look up terms like you search phone books
For each letter, jump directly to the right spot
For terms of reasonable length, this is very fast
For each term, store the document identifiers
For every document that contains that term
At query time, use the document identifiers
Consult a postings file

6
An Example
Postings
Term
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
Inverted File
aid
0
0
0
1
0
0
0
1
AI
4, 8
A
all
0
1
0
1
0
1
0
0
AL
2, 4, 6
back
1
0
1
0
0
0
1
0
BA
1, 3, 7
B
brown
1
0
1
0
1
0
1
0
BR
1, 3, 5, 7
come
0
1
0
1
0
1
0
1
C
2, 4, 6, 8
dog
0
0
1
0
1
0
0
0
D
3, 5
fox
0
0
1
0
1
0
1
0
F
3, 5, 7
good
0
1
0
1
0
1
0
1
G
2, 4, 6, 8
jump
0
0
1
0
0
0
0
0
J
3
lazy
1
0
1
0
1
0
1
0
L
1, 3, 5, 7
men
0
1
0
1
0
0
0
1
M
2, 4, 8
now
0
1
0
0
0
1
0
1
N
2, 6, 8
over
1
0
1
0
1
0
1
1
O
1, 3, 5, 7, 8
party
0
0
0
0
0
1
0
1
P
6, 8
quick
1
0
1
0
0
0
0
0
Q
1, 3
their
1
0
0
0
1
0
1
0
TH
1, 5, 7
T
time
0
1
0
1
0
1
0
0
TI
2, 4, 6
7
The Finished Product
Term
Postings
Inverted File
aid
AI
4, 8
A
all
AL
2, 4, 6
back
BA
1, 3, 7
B
brown
BR
1, 3, 5, 7
come
C
2, 4, 6, 8
dog
D
3, 5
fox
F
3, 5, 7
good
G
2, 4, 6, 8
jump
J
3
lazy
L
1, 3, 5, 7
men
M
2, 4, 8
now
N
2, 6, 8
over
O
1, 3, 5, 7, 8
party
P
6, 8
quick
Q
1, 3
their
TH
1, 5, 7
T
time
TI
2, 4, 6
8
What Goes in a Postings File?

Boolean retrieval
Just the document number
Ranked Retrieval
Document number and term weight (TFIDF, ...)
Proximity operators
Word offsets for each occurrence of the term
Example Doc 3 (t17, t36), Doc 13 (t3, t45)

9
How Big Is the Postings File?

Very compact for Boolean retrieval
About 10 of the size of the documents
If an aggressive stopword list is used!
Not much larger for ranked retrieval
Perhaps 20
Enormous for proximity operators
Sometimes larger than the documents!
But access is fast - you know where to look

10
Building an Inverted Index

Simplest solution is a single sorted array
Fast lookup using binary search
But sorting large files on disk is very slow
And adding one document means starting over
Tree structures allow easy insertion
But the worst case lookup time is linear
Balanced trees provide the best of both
Fast lookup and easy insertion
But they require 45 more disk space

11
Starting a B Tree Inverted File
Now is the time for all good
aaaaa
now
now
time
good
all
12
Adding a New Term
Now is the time for all good men
aaaaa
now
aaaaa
men
now
time
good
all
men
13
How Big is the Inverted Index?

Typically smaller than the postings file
Depends on number of terms, not documents
Eventually almost all terms will be indexed
But the postings file will continue to grow
Postings dominate asymptotic space complexity
Linear in the number of documents
Assuming that the documents remain about the same
size

14
Some Facts About Disks

It takes a long time to get the first byte
A Pentium can do 1,000,000 operations in 10 ms
But you can get 1,000 bytes just about as fast
40 MB/sec transfer rates are typical
So it pays to put related stuff in each block
M-ary trees B are better than binary B trees
Time complexity is measured in disk blocks read
Since computing time is negligible by comparison

15
Time Complexity

Indexing
Walk the inverted file, splitting if needed
Insert into the postings file in sorted order
Hours or days for large collections
Query processing
Walk the inverted file
Read the postings file
Seconds, even for enormous collections

16
Summary

Slow indexing yields fast query processing
We use extra disk space to save query time
Index space is in addition to document space
Time and space complexity must be balanced
Disk block reads are the critical resource
Fast disks are more useful than fast computers

17
A Question

If insertions are more common than queries (for
example, filtering news stories as they arrive
and then never looking at them again), what kind
of an index should you build?

18
Indexing High Volume Streams