Title: INF 2914 Information Retrieval and Web Search
1INF 2914Information Retrieval and Web Search
- Lecture 6 Index Construction
- These slides are adapted from Stanfords class
CS276 / LING 286 - Information Retrieval and Web Mining
2(Offline) Search Engine Data Flow
Parse Tokenize
Global Analysis
Index Build
Crawler
- Dup detection
- Static rank
- Anchor text
- Spam analysis
- -
- Scan tokenized web pages, anchor text,
etc- Generate text index
web page
- Parse- Tokenize- Per page analysis
2
1
3
4
in background
duptable
tokenizedweb pages
ranktable
anchortext
spam table
invertedtext index
3Inverted index
- For each term T, we must store a list of all
documents that contain T.
Posting
2
4
8
16
32
64
128
2
3
5
8
13
21
34
1
13
16
Sorted by docID (more later on why).
4Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
5Indexer steps
- Sequence of (Modified token, Document ID) pairs.
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
6 Core indexing step.
7 - Multiple term entries in a single document are
merged. - Frequency information is added.
Why frequency? Will discuss later.
8 - The result is split into a Dictionary file and a
Postings file.
9The index we just built
- How do we process a query?
10Query processing AND
- Consider processing the query
- Brutus AND Caesar
- Locate Brutus in the Dictionary
- Retrieve its postings.
- Locate Caesar in the Dictionary
- Retrieve its postings.
- Merge the two postings
128
Brutus
Caesar
34
11The merge
- Walk through the two postings simultaneously, in
time linear in the total number of postings
entries
128
2
34
If the list lengths are x and y, the merge takes
O(xy) operations. Crucial postings sorted by
docID.
12Index construction
- How do we construct an index?
- What strategies can we use with limited main
memory?
13Our corpus for this lecture
- Number of docs n 1M
- Each doc has 1K terms
- Number of distinct terms m 500K
- 667 million postings entries
14How many postings?
- Number of 1s in the i th block nJ/i
- Summing this over m/J blocks, we have
- For our numbers, this should be about 667 million
postings.
15Recall index construction
- Documents are processed to extract words and
these are saved with the Document ID.
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
16 Key step
- After all documents have been processed the
inverted file is sorted by terms.
We focus on this sort step. We have 667M items to
sort.
17Index construction
- At 10-12 bytes per postings entry, demands
several temporary gigabytes
18System parameters for design
- Disk seek 10 milliseconds
- Block transfer from disk 1 microsecond per byte
(following a seek) - All other ops 1 microsecond
- E.g., compare two postings entries and decide
their merge order
19Bottleneck
- Build postings entries one doc at a time
- Now sort postings entries by term (then by doc
within each term) - Doing this with random disk seeks would be too
slow must sort N667M records
If every comparison took 2 disk seeks, and N
items could be sorted with N log2N comparisons,
how long would this take?
20Disk-based sorting
- Build postings entries one doc at a time
- Now sort postings entries by term
- Doing this with random disk seeks would be too
slow must sort N667M records
If every comparison took 2 disk seeks, and N
items could be sorted with N log2N comparisons,
how long would this take? 12.4 years!!!
20
21Sorting with fewer disk seeks
- 12-byte (444) records (term, doc, freq).
- These are generated as we process docs.
- Must now sort 667M such 12-byte records by term.
- Define a Block 10M such records
- can easily fit a couple into memory.
- Will have 64 such blocks to start with.
- Will sort within blocks first, then merge the
blocks into one long sorted order.
21
22Sorting 64 blocks of 10M records
- First, read each block and sort within
- Quicksort takes 2N ln N expected steps
- In our case 2 x (10M ln 10M) steps
- Time to Quicksort each block 320 seconds
- Total time to read each block from disk and write
it back - 120M x 2 x 10-6 240 seconds
- 64 times this estimate - gives us 64 sorted runs
of 10M records each - Total Quicksort time 5.6 hours
- Total readwrite time 4.2 hours
- Total for this phase 10 hours
- Need 2 copies of data on disk, throughout
22
23Merging 64 sorted runs
- Merge tree of log264 6 layers.
- During each layer, read into memory runs in
blocks of 10M, merge, write back.
2
1
Merged run.
3
4
Runs being merged.
Disk
23
24Merge tree
1 run ?
2 runs ?
4 runs ?
8 runs, 80M/run
16 runs, 40M/run
32 runs, 20M/run
Bottom level of tree.
Sorted runs.
1
2
64
63
24
25Merging 64 runs
- Time estimate for disk transfer
- 6 x Time to readwrite 64 blocks
- 6 x 4.2 hours 25 hours
- Time estimate for the merge operation
- 6 x 640M x 10-6 1 hour
- Time estimate for the overall algorithm
- Sort time Merge time 10 26 36 hours
- Lower bound (main memory sort)
- Time to readwrite 4.2 hours
- Time to sort in memory 10.7 hours
- Total time 15 hours
25
26Some indexing numbers
26
27How to improve indexing time?
- Compression of the sorted runs
- Multi-way merge
- Heap merge all runs
- Radix sort (linear time sorting)
- Pipelining reading, sorting, and writing phases
27
28Multi-way merge
Heap
Sorted runs
1
2
64
63
28
29Indexing improvements
- Radix sort
- Linear time sorting
- Flexibility in defining the sort criteria
- Bigger sort buffers increase performance
(contradicting previous literature) (see VLDB
paper on the references) - Pipelining read and sort write phases
B1
B1
B1
B1
Read
B2
B2
B2
B2
Sort Write
time
29
30Positional indexing
- Given documentsD1 This is a testD2 Is this
a testD3 This is not a test
- Reorganize by term
- TERM DOC LOC DATA(caps)this 1 0 1is 1 1 0a 1 2
0test 1 3 0is 2 0 1this 2 1 0a 2 2 0test 2 3
0this 3 0 1is 3 1 0not 3 2 0a 3 3 0test 3 4
0
30
31Positional indexing
In postings list format a (1,2,0),(2,2,0),(3,
3,0) is (1,1,0),(2,0,1),(3,1,0) not (3,2,0) test
(1,3,0),(2,3,0),(3,4,0) this (1,0,1),(2,1,0),(3,0,
1)
Sort by ltterm, doc, locgt TERM DOC LOC DATA(caps)
a 1 2 0a 2 2 0a 3 3 0is 1 1 0is 2 0 1is 3 1
0not 3 2 0 test 1 3 0test 2 3 0test 3 4 0 this
1 0 1 this 2 1 0 this 3 0 1
31
32Positional indexing with radix sort
- Radix key
- Token hash 8 bytes
- Document ID 8 bytes
- Location 4 bytes, but no need to sort by
location since Radix sort is stable!
32
33Distributed indexing
- Maintain a master machine directing the indexing
job considered safe - Break up indexing into sets of (parallel) tasks
- Master machine assigns each task to an idle
machine from a pool
33
34Parallel tasks
- We will use two sets of parallel tasks
- Parsers
- Inverters
- Break the input document corpus into splits
- Each split is a subset of documents
- Master assigns a split to an idle parser machine
- Parser reads a document at a time and emits
(term, doc) pairs
34
35Parallel tasks
- Parser writes pairs into j partitions
- Each for a range of terms first letters
- (e.g., a-f, g-p, q-z) here j3.
- Now to complete the index inversion
35
36Data flow
Master
assign
assign
Postings
Parser
Inverter
a-f
g-p
q-z
a-f
Parser
a-f
g-p
q-z
Inverter
g-p
Inverter
splits
q-z
Parser
a-f
g-p
q-z
36
37Inverters
- Collect all (term, doc) pairs for a partition
- Sorts and writes to postings list
- Each partition contains a set of postings
Above process flow a special case of MapReduce.
37
38MapReduce
- Model for processing large data sets.
- Contains Map and Reduce functions.
- Runs on a large cluster of machines.
- A lot of MapReduce programs are executed on
Googles cluster everyday.
38
39Motivation
- Input data is large
- The whole Web, billions of Pages
- Lots of machines
- Use them efficiently
39
40A real example
- Term frequencies through the whole Web
repository. - Count of URL access frequency.
- Reverse web-link graph
- .
40
41Programming model
- Input Output each a set of key/value pairs
- Programmer specifies two functions
- map (in_key, in_value) -gt list(out_key,
intermediate_value) - Processes input key/value pair
- Produces set of intermediate pairs
- reduce (out_key, list(intermediate_value)) -gt
list(out_value) - Combines all intermediate values for a particular
key - Produces a set of merged output values (usually
just one)
41
42Example
- Page 1 the weather is good
- Page 2 today is good
- Page 3 good weather is good.
42
43Example Count word occurrences
- map(String input_key, String input_value)
- // input_key document name
- // input_value document contents
- for each word w in input_value
- EmitIntermediate(w, "1")
- reduce(String output_key, Iterator
intermediate_values) - // output_key a word
- // output_values a list of counts
- int result 0
- for each v in intermediate_values
- result ParseInt(v)
- Emit(AsString(result))
43
44Map output
- Worker 1
- (the 1), (weather 1), (is 1), (good 1).
- Worker 2
- (today 1), (is 1), (good 1).
- Worker 3
- (good 1), (weather 1), (is 1), (good 1).
44
45Reduce Input
- Worker 1
- (the 1)
- Worker 2
- (is 1), (is 1), (is 1)
- Worker 3
- (weather 1), (weather 1)
- Worker 4
- (today 1)
- Worker 5
- (good 1), (good 1), (good 1), (good 1)
45
46Reduce Output
- Worker 1
- (the 1)
- Worker 2
- (is 3)
- Worker 3
- (weather 2)
- Worker 4
- (today 1)
- Worker 5
- (good 4)
46
4747
4848
49Fault tolerance
- Typical cluster
- 100s/1000s of 2-CPU x86 machines, 2-4 GB of
memory - Storage is on local IDE disks
- GFS distributed file system manages data
(SOSP'03) - Job scheduling system jobs made up of tasks,
scheduler assigns tasks to machines - Implementation is a C library linked into user
programs)
49
50Fault tolerance
- On worker failure
- Detect failure via periodic heartbeats
- Re-execute completed and in-progress map tasks
- Re-execute in progress reduce tasks
- Task completion committed through master
- Master failure
- Could handle, but don't yet (master failure
unlikely)
50
51Performance
- Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
150 seconds. - Sort 1010 100-byte records (modeled after
TeraSort benchmark) 839 seconds.
51
52More and more MapReduce
52
53Experience Rewrite of Production Indexing System
- Rewrote Google's production indexing system using
MapReduce - Set of 24 MapReduce operations
- New code is simpler, easier to understand
- MapReduce takes care of failures, slow machines
- Easy to make indexing faster by adding more
machines
53
54MapReduce Overview
- MapReduce has proven to be a useful abstraction
- Greatly simplifies large-scale computations at
Google - Fun to use focus on problem, let library deal w/
messy details
54
55Resources
- MG Chapter 5
- MapReduce Simplified Data Processing on Large
Clusters, Jeffrey Dean and Sanjay Ghemawat - Indexing Shared Content in Information Retrieval
Systems, A. Broder et. al., EDBT2006 - High Performance Index Build Algorithms for
Intranet Search Engines, M. F. Fontoura et. al.,
VLDB 2004
55