Title: Building a Distributed Full-Text Index for the Web
1Building a Distributed Full-Text Index for the
Web S. Melnik, S. Raghavan, B.Yang, H.
Garcia-Molina
2- Introduction.
- Testbed architecture.
- Design of the indexer.
- Distributed indexing.
3- Introduction.
- Testbed architecture.
- Design of the indexer.
- Distributed indexing.
41
2
3
Pig Cat Fish Cat
Fly Dog Pig
Dog Cat Fish Dog
Inverted list Cat-gt (1,2), (1,4),
(3,2) Dog-gt(2,2), (3,1), (3,4) Fish-gt(1,3),
(3,3) Pig-gt(1,1), (2,3)
Inverted index
location
5Inverted index consist of an inverted lists for
each sorted term. Inverted list consist of a
locations in sorted way. Location consist of
(page identifier, position in the
page). Posting consist of (index term, location).
6Building an inverted index over a collection of
web pages involves 1. Processing each
page to extract postings. 2. Building
for each term inverted list. 3. Writing out
on disk.
7Important problems when building web-scale
inverted index 1. Scale and growth
rate. 2. Rate of change
8- Introduction.
- Testbed architecture.
- Design of the indexer.
- Distributed indexing.
9(No Transcript)
10- Distributors.
- Indexers.
- Query servers.
11- Distributed inverted index organization
- Local inverted files.
- 2. Global inverted files.
12Global inverted files
Cat-gt(1,2), (1,4), (3,2) Dog-gt(2,2), (3,1), (3,4)
Query server 1
a-e
Fish-gt(1,3), (3,3) Pig-gt(1,1), (2,3)
Query server 2
f-z
2
1
3
Dog Cat Fish Dog
Fly Dog Pig
Pig Cat Fish Cat
13Local inverted files
f-z
a-e
Query server 2
Query server 1
Cat-gt(3,2) Dog-gt(3,1), (3,4) Fish-gt(3,3)
Cat-gt(1,2), (1,4) Dog-gt(2,2) Fish-gt(1,3) Fly-gt(2,1
) Pig-gt(1,1), (2,3)
Dog Cat Fish Dog
Fly Dog Pig
Pig Cat Fish Cat
2
1
3
14Local vs. Global
- Resilience to failures.
- Network load.
15Testbed environment The indexers and the query
servers are single processor PCs with 350-500
MHz processors, 300-500 MB of main memory, and
equipped with multiple disks. All the machines
are interconnected by a 100 Mbps Ethernet LAN
network.
16The WebBase collection To study some properties
of web pages that are relevant to text indexing,
we analyzed 5 samples, of 100,000 pages each,
from different portions of the WebBase
repository.
17value Property
438 Average number of words per page
171 Average number of distinct words per page
8650 Average size of each page (as HTML)
2815 Average size of each page after removing HTML tags
8 Average size of a word in the vocabulary
Table 1 Properties of the WebBase collection
18(No Transcript)
19- Introduction.
- Testbed architecture.
- Design of the indexer.
- Distributed indexing.
20(No Transcript)
21- Design of the Indexer
- Software pipeline.
- The storage of the inverted files generated by
the process.
22- Software pipeline
- The process can logically be split into 3 phases
- Processing -gt CPU intensive.
- Flushing -gt disk.
- loading -gt network.
23(No Transcript)
24The goal of our pipelining technique is to design
an execution schedule for the different indexing
phases that will result in minimal overall
running time. Examples
F
Execution of the pipeline
P
L
25(No Transcript)
26t
Pipeline time
27Theoretical analysis vs. experimental results
28(No Transcript)
29(No Transcript)
30- Design of the Indexer
- Software pipeline.
- The storage of the inverted files generated by
the process.
31Storage schemes We consider ed three storage
schemes for storing inverted files as sets of
(key, value) pairs in a B-tree 1.
Full list. 2. Single payload. 3.
Mixed list.
32(No Transcript)
33- A qualitative comparison of these storage
schemes - Index size
- Zig-zag joins
- Hot updates
34Zig-zag join using ordered indexes
1
2
3
4
7
9
18
1
7
9
11
17
12
19
35Experimental results (using mixed list)
36Index size (age) Index size (GB) Input size (GB) Number of pages(million)
6.17 0.05 0.81 0.1
6.70 0.27 4.03 0.5
7.01 1.13 16.11 2.0
6.90 2.78 40.28 5.0
Table 5Mixed-list scheme index sizes
Only one posting was generated for all the
occurrences of a word in a page
37(No Transcript)
38- Introduction.
- Testbed architecture.
- Design of the indexer.
- Distributed indexing.
39- Two problems that must be addressed when building
an inverted index on a distributed architecture - Page distribution The question of when and how
to distribute pages to the indexing nodes. - Collecting global statistics the question of
where, when, and how to compute and distribute
global statistics.
40- Two strategies for page distribution
- A priori distribution.
- Runtime distribution.
41- Three advantages of runtime distribution
- Space.
- Load balancing.
- Effective pipelining.
42- Collecting global statistics
- A dedicated server known as the statistician.
- Parallel computation.
- Minimize the number of conversations among
servers. - Avoid extra disk I/O
- Reduces network overhead.
43- Two strategies for sending information to the
statistician - ME Strategy sending local information during
merging. - FL Strategy sending local information during
flushing.
44(No Transcript)
45(No Transcript)
46comparison