INF 2914 Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

INF 2914 Information Retrieval and Web Search

Description:

Radix sort. Linear time sorting. Flexibility in defining the sort criteria ... Location = 4 bytes, but no need to sort by location since Radix sort is stable! 33 ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 56

Provided by: christo396

Category:

more less

Transcript and Presenter's Notes

Title: INF 2914 Information Retrieval and Web Search

1
INF 2914Information Retrieval and Web Search

Lecture 6 Index Construction
These slides are adapted from Stanfords class
CS276 / LING 286
Information Retrieval and Web Mining

2
(Offline) Search Engine Data Flow
Parse Tokenize
Global Analysis
Index Build
Crawler

Dup detection
Static rank
Anchor text
Spam analysis
-

- Scan tokenized web pages, anchor text,
etc- Generate text index
web page
- Parse- Tokenize- Per page analysis
2
1
3
4
in background
duptable
tokenizedweb pages
ranktable
anchortext
spam table
invertedtext index
3
Inverted index

For each term T, we must store a list of all
documents that contain T.

Posting
2
4
8
16
32
64
128
2
3
5
8
13
21
34
1
13
16
Sorted by docID (more later on why).
4
Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
5
Indexer steps

Sequence of (Modified token, Document ID) pairs.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
6

Sort by terms.

Core indexing step.
7

Multiple term entries in a single document are
merged.
Frequency information is added.

Why frequency? Will discuss later.
8

The result is split into a Dictionary file and a
Postings file.

9
The index we just built

How do we process a query?

10
Query processing AND

Consider processing the query
Brutus AND Caesar
Locate Brutus in the Dictionary
Retrieve its postings.
Locate Caesar in the Dictionary
Retrieve its postings.
Merge the two postings

128
Brutus
Caesar
34
11
The merge

Walk through the two postings simultaneously, in
time linear in the total number of postings
entries

128
2
34
If the list lengths are x and y, the merge takes
O(xy) operations. Crucial postings sorted by
docID.
12
Index construction

How do we construct an index?
What strategies can we use with limited main
memory?

13
Our corpus for this lecture

Number of docs n 1M
Each doc has 1K terms
Number of distinct terms m 500K
667 million postings entries

14
How many postings?

Number of 1s in the i th block nJ/i
Summing this over m/J blocks, we have
For our numbers, this should be about 667 million
postings.

15
Recall index construction

Documents are processed to extract words and
these are saved with the Document ID.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
16
Key step

After all documents have been processed the
inverted file is sorted by terms.

We focus on this sort step. We have 667M items to
sort.
17
Index construction

At 10-12 bytes per postings entry, demands
several temporary gigabytes

18
System parameters for design

Disk seek 10 milliseconds
Block transfer from disk 1 microsecond per byte
(following a seek)
All other ops 1 microsecond
E.g., compare two postings entries and decide
their merge order

19
Bottleneck

Build postings entries one doc at a time
Now sort postings entries by term (then by doc
within each term)
Doing this with random disk seeks would be too
slow must sort N667M records

If every comparison took 2 disk seeks, and N
items could be sorted with N log2N comparisons,
how long would this take?
20
Disk-based sorting

Build postings entries one doc at a time
Now sort postings entries by term
Doing this with random disk seeks would be too
slow must sort N667M records

If every comparison took 2 disk seeks, and N
items could be sorted with N log2N comparisons,
how long would this take? 12.4 years!!!
20
21
Sorting with fewer disk seeks

12-byte (444) records (term, doc, freq).
These are generated as we process docs.
Must now sort 667M such 12-byte records by term.
Define a Block 10M such records
can easily fit a couple into memory.
Will have 64 such blocks to start with.
Will sort within blocks first, then merge the
blocks into one long sorted order.

21
22
Sorting 64 blocks of 10M records

First, read each block and sort within
Quicksort takes 2N ln N expected steps
In our case 2 x (10M ln 10M) steps
Time to Quicksort each block 320 seconds
Total time to read each block from disk and write
it back
120M x 2 x 10-6 240 seconds
64 times this estimate - gives us 64 sorted runs
of 10M records each
Total Quicksort time 5.6 hours
Total readwrite time 4.2 hours
Total for this phase 10 hours
Need 2 copies of data on disk, throughout

22
23
Merging 64 sorted runs

Merge tree of log264 6 layers.
During each layer, read into memory runs in
blocks of 10M, merge, write back.

2
1
Merged run.
3
4
Runs being merged.
Disk
23
24
Merge tree
1 run ?
2 runs ?
4 runs ?
8 runs, 80M/run
16 runs, 40M/run

32 runs, 20M/run
Bottom level of tree.
Sorted runs.

1
2
64
63
24
25
Merging 64 runs

Time estimate for disk transfer
6 x Time to readwrite 64 blocks
6 x 4.2 hours 25 hours
Time estimate for the merge operation
6 x 640M x 10-6 1 hour
Time estimate for the overall algorithm
Sort time Merge time 10 26 36 hours
Lower bound (main memory sort)
Time to readwrite 4.2 hours
Time to sort in memory 10.7 hours
Total time 15 hours

25
26
Some indexing numbers
26
27
How to improve indexing time?

Compression of the sorted runs
Multi-way merge
Heap merge all runs
Radix sort (linear time sorting)
Pipelining reading, sorting, and writing phases

27
28
Multi-way merge
Heap
Sorted runs

1
2
64
63
28
29
Indexing improvements

Radix sort
Linear time sorting
Flexibility in defining the sort criteria
Bigger sort buffers increase performance
(contradicting previous literature) (see VLDB
paper on the references)
Pipelining read and sort write phases

B1
B1
B1
B1
Read
B2
B2
B2
B2
Sort Write
time
29
30
Positional indexing

Given documentsD1 This is a testD2 Is this
a testD3 This is not a test

Reorganize by term
TERM DOC LOC DATA(caps)this 1 0 1is 1 1 0a 1 2
0test 1 3 0is 2 0 1this 2 1 0a 2 2 0test 2 3
0this 3 0 1is 3 1 0not 3 2 0a 3 3 0test 3 4
0

30
31
Positional indexing
In postings list format a (1,2,0),(2,2,0),(3,
3,0) is (1,1,0),(2,0,1),(3,1,0) not (3,2,0) test
(1,3,0),(2,3,0),(3,4,0) this (1,0,1),(2,1,0),(3,0,
1)
Sort by ltterm, doc, locgt TERM DOC LOC DATA(caps)
a 1 2 0a 2 2 0a 3 3 0is 1 1 0is 2 0 1is 3 1
0not 3 2 0 test 1 3 0test 2 3 0test 3 4 0 this
1 0 1 this 2 1 0 this 3 0 1
31
32
Positional indexing with radix sort

Radix key
Token hash 8 bytes
Document ID 8 bytes
Location 4 bytes, but no need to sort by
location since Radix sort is stable!

32
33
Distributed indexing

Maintain a master machine directing the indexing
job considered safe
Break up indexing into sets of (parallel) tasks
Master machine assigns each task to an idle
machine from a pool

33
34
Parallel tasks

We will use two sets of parallel tasks
Parsers
Inverters
Break the input document corpus into splits
Each split is a subset of documents
Master assigns a split to an idle parser machine
Parser reads a document at a time and emits
(term, doc) pairs

34
35
Parallel tasks

Parser writes pairs into j partitions
Each for a range of terms first letters
(e.g., a-f, g-p, q-z) here j3.
Now to complete the index inversion

35
36
Data flow
Master
assign
assign
Postings
Parser
Inverter
a-f
g-p
q-z
a-f
Parser
a-f
g-p
q-z
Inverter
g-p
Inverter
splits
q-z
Parser
a-f
g-p
q-z
36
37
Inverters

Collect all (term, doc) pairs for a partition
Sorts and writes to postings list
Each partition contains a set of postings

Above process flow a special case of MapReduce.
37
38
MapReduce

Model for processing large data sets.
Contains Map and Reduce functions.
Runs on a large cluster of machines.
A lot of MapReduce programs are executed on
Googles cluster everyday.

38
39
Motivation

Input data is large
The whole Web, billions of Pages
Lots of machines
Use them efficiently

39
40
A real example

Term frequencies through the whole Web
repository.
Count of URL access frequency.
Reverse web-link graph
.

40
41
Programming model

Input Output each a set of key/value pairs
Programmer specifies two functions
map (in_key, in_value) -gt list(out_key,
intermediate_value)
Processes input key/value pair
Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -gt
list(out_value)
Combines all intermediate values for a particular
key
Produces a set of merged output values (usually
just one)

41
42
Example

Page 1 the weather is good
Page 2 today is good
Page 3 good weather is good.

42
43
Example Count word occurrences

map(String input_key, String input_value)
// input_key document name
// input_value document contents
for each word w in input_value
EmitIntermediate(w, "1")
reduce(String output_key, Iterator
intermediate_values)
// output_key a word
// output_values a list of counts
int result 0
for each v in intermediate_values
result ParseInt(v)
Emit(AsString(result))

43
44
Map output

Worker 1
(the 1), (weather 1), (is 1), (good 1).
Worker 2
(today 1), (is 1), (good 1).
Worker 3
(good 1), (weather 1), (is 1), (good 1).

44
45
Reduce Input

Worker 1
(the 1)
Worker 2
(is 1), (is 1), (is 1)
Worker 3
(weather 1), (weather 1)
Worker 4
(today 1)
Worker 5
(good 1), (good 1), (good 1), (good 1)

45
46
Reduce Output

Worker 1
(the 1)
Worker 2
(is 3)
Worker 3
(weather 2)
Worker 4
(today 1)
Worker 5
(good 4)

46
47
47
48
48
49
Fault tolerance

Typical cluster
100s/1000s of 2-CPU x86 machines, 2-4 GB of
memory
Storage is on local IDE disks
GFS distributed file system manages data
(SOSP'03)
Job scheduling system jobs made up of tasks,
scheduler assigns tasks to machines
Implementation is a C library linked into user
programs)

49
50
Fault tolerance

On worker failure
Detect failure via periodic heartbeats
Re-execute completed and in-progress map tasks
Re-execute in progress reduce tasks
Task completion committed through master
Master failure
Could handle, but don't yet (master failure
unlikely)

50
51
Performance

Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
150 seconds.
Sort 1010 100-byte records (modeled after
TeraSort benchmark) 839 seconds.

51
52
More and more MapReduce
52
53
Experience Rewrite of Production Indexing System

Rewrote Google's production indexing system using
MapReduce
Set of 24 MapReduce operations
New code is simpler, easier to understand
MapReduce takes care of failures, slow machines
Easy to make indexing faster by adding more
machines

53
54
MapReduce Overview

MapReduce has proven to be a useful abstraction
Greatly simplifies large-scale computations at
Google
Fun to use focus on problem, let library deal w/
messy details

54
55
Resources

MG Chapter 5
MapReduce Simplified Data Processing on Large
Clusters, Jeffrey Dean and Sanjay Ghemawat
Indexing Shared Content in Information Retrieval
Systems, A. Broder et. al., EDBT2006
High Performance Index Build Algorithms for
Intranet Search Engines, M. F. Fontoura et. al.,
VLDB 2004

Write a Comment

User Comments (0)