Searching the Web

About This Presentation

Title:

Searching the Web

Description:

To better understand Web search engines: Fundamental concepts. Main challenges. Design issues. Implementation techniques and algorithms. 8/25/09. SDBI 2001. 3 ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 95

Provided by: csHu

Category:

more less

Transcript and Presenter's Notes

Title: Searching the Web

1
Searching the Web

Yoram Bachrach
Yiftah Ben-Aharon

Arvind Arasu, Junghoo Cho, Hector Garcia-Molina,
Andreas Paepcke, Sriram Raghavan
2
Goal

To better understand Web search engines
Fundamental concepts
Main challenges
Design issues
Implementation techniques and algorithms

3
Schedule

Search engine requirements
Components overview
Specific modules
Purpose
Implementation
Performance metrics
Conclusion

4
What does it do?

Processes users queries
Find pages with related information
Return a resources list
Is it really that simple?

5
What does it do?

Processes users queries
How is a query represented?
Find pages with related information
Return a resources list
Is it really that simple?

6
What does it do?

Processes users queries
Find pages with related information
How do we find pages?
Where in the web do we look?
How do we store the data?
Return a resources list
Is it really that simple?

7
What does it do?

Processes users queries
Find pages with related information
Return a resources list
Is what order?
How are the pages ranked?
Is it really that simple?

8
What does it do?

Processes users queries
Find pages with related information
Return a resources list
Is it really that simple?
Limited resources
Time quality tradeoff

9
Search Engine Structure

General Design
Crawling
Storage
Indexing
Ranking

10
Motivation

The web is
Used by millions
Contains lots of information
Link based
Incoherent
Changes rapidly
Distributed
Traditional information retrieval was built with
the exact opposite in mind

11
The Webs Characteristics

Size
Over a billion pages available
5-10K per page gt tens of terrabytes
Size doubles every 2 years
Change
23 change daily
Half life time of about 10 days
Poisson model for changes
Bowtie structure

12
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
13
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
14
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
15
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
16
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
17
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
18
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
19
Terms

Crawler
Crawler control
Indexes text, structure, utility
Page repository
Indexer
Collection analysis module
Query engine
Ranking module

20
Crawling

Itsi Bitsi Spider crawling up the web!

21
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
22
Crawling web pages

What pages to download
When to refresh
Minimize load on web sites
How to parallelize the process

23
Page selection

Importance metric
Web crawler model
Crawler method for choosing page to download

24
Importance Metrics

Given a page P, define how good that page is.
Several metric types
Interest driven
Popularity driven
Location driven
Combined

25
Interest Driven

Define a driving query Q
Find textual similarity between P and Q
Define a word vocabulary W1Wn
Define a vector for P and Q
Vp, Vq ltW1,,Wngt
Wi 0 if Wi does not appear in the document
Wi Inverse document frequency otherwise
IDF(Wi) 1 / number of appearances in the entire
collection
Importance IS(P) P Q (cosine product)
Finding IDF requires going over the entire web
Estimate IDF by pages already visited, to
calculate IS

26
Popularity Driven

How popular a page is
Backlink count
IB(P) the number of pages containing a link to
P
Estimat by pervious crawls IB(P)
More sophisticated metric, called PageRank IR(P)

27
Location Driven

IL(P) A function of the URL to P
Words appearing on URL
Number of / on the URL
Easily evaluated, requires no data from pervious
crawls

28
Combined Metrics

IC(P) a function of several other metrics
Allows using local metrics for first stage and
estimated metrics for second stage
IC(P) aIS(P)bIB(P)cIL(P)

29
Crawler Models

A crawler
Tries to visit more important pages first
Only has estimates of importance metrics
Can only download a limited amount
How well does a crawler perform?
Crawl and Stop
Crawl and Stop with Threshold

30
Crawl and Stop

A crawler stops after visiting K pages
A perfect crawler
Visits pages with ranks R1,,Rk
These are called Hot Pages
A real crawler
Visits only MltK hot pages
Performance rate
For a random crawler

31
Crawl and Stop with Threshold

A crawler stops after visiting K pages
Hot pages are pages with a metric higher than G
A crawler visits V hot pages
Metric percent of hot pages visited
Perfect crawler
Random crawler

32
Ordering Metrics

The crawlers queue is prioritized according to an
ordering metric
The ordering metric is based on an importance
metric
Location metrics - directly
Popularity metrics - via estimates according to
pervious crawls
Similarity metrics via estimates according to
anchor

33
Case Study - WebBase

Using Stanfords 225,000 web pages as the entire
collection
Use popularity importance IB(P)
Assume Crawl and Stop with Threshold G 100
Start at http//www.stanford.edu
Use PageRank, backlink and BFS as ordering
metrics

34
WebBase Results
35
Page Refresh

Make sure pages are up-to-date
Many possible strategies
Uniform refresh
Proportional to change frequency
Need to define a metric

36
Freshness Metric

Freshness
1 if fresh, 0 otherwise
Age of pages
time since modified

37
Average Freshness

Freshness changes over time
Take the average freshness over a long period of
time

38
Refresh Strategy

Crawlers can refresh only a certain amount of
pages in a period of time.
The page download resource can be allocated in
many ways
The proportional refresh policy allocated the
resource proportionally to the pages change rate.

39
Example

The collection contains 2 pages
E1 changes 9 times a day
E2 changes once a day
Simplified change model
The day is split into 9 equal intervals, and E1
changes once on each interval
E2 changes once during the entire day
The only unknown is when the pages change within
the intervals
The crawler can download a page a day.
Our goal is to maximize the freshness

40
Example (2)
41
Example (3)

Which page do we refresh?
If we refresh E2 in midday
If E2 changes in first half of the day, and we
refresh in midday, it remains fresh for the rest
half of the day.
50 for 0.5 day freshness increase
50 for no increase
Expectancy of 0.25 day freshness increase
If we refresh E1 in midday
If E1 changes in first half of the interval, and
we refresh in midday (which is the middle of the
interval), it remains fresh for the rest half of
the interval 1/18 of a day.
50 for 1/18 day freshness increase
50 for no increase
Expectancy of 1/36 day freshness increase

42
Example (4)

This gives a nice estimation
But things are more complex in real life
Not sure that a page will change within an
interval
Have to worry about age
Using a Poisson model shows a uniform policy
always performs better than a proportional one.

43
Example (5)

Studies have found the best policy for similar
example
Assume page changes follow a Poisson process.
Assume 5 pages, which change 1,2,3,4,5 times a
day

44
Repository

Hidden Treasures

45
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
46
Storage

The page repository is a scalable storage system
for web pages
Allows the Crawler to store pages
Allows the Indexer and Collection Analysis to
retrieve them
Similar to other data storage systems DB or
file systems
Does not have to provide some of the other
systems features transactions, logging,
directory.

47
Storage Issues

Scalability and seamless load distribution
Dual access modes
Random access (used by the query engine for
cached pages)
Streaming access (used by the Indexer and
Collection Analysis)
Large bulk update reclaim old space, avoid
access/update conflicts
Obsolete pages - remove pages no longer on the web

48
Designing a Distributed Web Repository

Repository designed to work over a cluster of
interconnected nodes
Page distribution across nodes
Physical organization within a node
Update strategy

49
Page Distribution

How to choose a node to store a page
Uniform distribution any page can be sent to
any node
Hash distribution policy hash page ID space
into node ID space

50
Organization Within a Node

Several operations required
Add / remove a page
High speed streaming
Random page access
Hashed organization
Treat each disk as a hash bucket
Assign according to a pages ID
Log organization
Treat the disk as one file, and add the page at
the end
Support random access using a B-tree
Hybrid
Hash map a page to an extent and use log
structure within an extent.

51
Distribution Performance
52
Update Strategies

Updates are generated by the crawler
Several characteristics
Time in which the crawl occurs and the repository
receives information
Whether the crawls information replaces the
entire database or modifies parts of it

53
Batch vs. Steady

Batch mode
Periodically executed
Allocated a certain amount of time
Steady mode
Run all the time
Always send results back to the repository

54
Partial vs. Complete Crawls

A batch mode crawler can
Do a complete crawl every run, and replace entire
collection
Recrawl only a specific subset, and apply updates
to the existing collection partial crawl
The repository can implement
In place update
Quickly refreshen pages
Shadowing, update as another stage
Avoid refresh-access conflics

55
Partial vs. Complete Crawls

Shadowing resolves the conflicts between updates
and read for the queries
Batch mode suits well with shadowing
Steady crawler suits with in place updates

56
The WebBase Repository

Distributed storage that works with the Stanford
WebCrawler
Uses a node manager for monitoring storage nodes
and collecting status information
Each page is assigned a unique identifier, a
signature of normalized URL
URLs are normalized since a same resource can be
pointed from several URLs
Stanford Crawler runs in batch mode, so Shadowing
is used by the repository

57
The WebBase Repository
58
Indexing

Excuse me, where can I find

59
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
60
The Indexer Modul

Creates Two indexes
Text (content) index Uses Traditional
indexing methods like Inverted Indexing.
Structure(Links( index Uses a directed graph of
pages and links. Sometimes also creates an
inverted graph.

61
The Collection Analysis Module

Uses the 2 basic indexes created by
the indexer module in order to
assemble Utility Indexes.
e.g. A site index.

62
Inverted Index

A Set of inverted lists, one per each index term
(word).
Inverted list of a term A sorted list of
locations in which the term appeared.
Posting A pair (w,l) where w is word and l is
one of its locations.
Lexicon Holds all indexs terms with statistics
about the term (not the posting)

63
Challenges

Index build must be
Fast
Economic
(unlike traditional index buildings)
Incremental Indexing must be supported
Storage compression vs. speed

64
Index Partitioning

A distributed text indexing can be done by
Local inverted file (IFL)
Each nodes contain disjoint random pages.
Query is broadcasted.
Result is the joined query answers.
Global inverted file (IFG)
Each node is responsible only for a subset of
terms in the collection.
Query sent only to the apropriate node.

65
The WebBase Indexer Architecture

Distributors Store pages detected by the
crawler and need to be indexed.
Indexers Performs the core indexing.
Query Servers holding the inverted index,
partitioned using IFL

66
The WebBase Indexer Stages

Stage 1
Loading pages from the Distributor.
Processing pages.
Flushing results to disk.
Stage 2
Pairs of (Inverted file, Lexicon) are created by
merging stage 1s files.
Each pair is transferred to a query server.

67
The WebPage IndexerParallelizing stage 1

Use 3-steps pipeline, one stage per each action
in stage 1.
Each action has different orientation (IO/CPU
intensive)

68
The WebPage IndexerParallelizing results

Sequential index
building is about
30-40 slower
then pipelined
one.

69
The WebBase Indexer Statistics Collection
concept

Term-level statistics must be collected
e.g. IDF - inverse document frequency
1/(number of appearance in collection)
Statistics computation as part of index creation
(instead of at query time).
A special server Statistician is dedicated for
this goal.

70
The WebBase IndexerStatistics Collection process

Stage 1
Indexers pass local information to the
statistician.
The statistician process it (globally) and return
results to the indexer
Stage 2
Global statistics are integrated into the
lexicons.

71
The WebBase IndexerStatistics Collection
optimizations

Send data to statistician when is in memory
(avoid explicit IO)
FL - When flushing data to disk.
ME - When merging the flushed data
Local aggregation Use partial order for sending
less messages.
e.g. 1000 x cat vs. cat, 1000

72
Indexing, Conclusion

Web pages indexing is complicated due to its
scale (millions of pages, hundreds of gigabytes).
Challenges Incremental indexing and
personalization.

73
Ranking

Everybody wants to rule the world

74
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
75
Traditional Ranking Faults

Many pages containing a term may be of poor
quality or not relevant.
Insufficient self description vs. spamming.
Not using link analysis.

76
PageRank

Tries to capture the notion of Importance of a
page.
Uses Backlinks for ranking.
Avoids trivial spamming Distributes pages
voting power among the pages they are linking
to.
Important page linking to a page will raise
its rank more the Not Important one.

77
Simple PageRank

Given by
Where
B(i) set of pages links to i.
N(j) number of outgoing links from j.
Well defined if link graph is strongly connected.
Based on Random Surfer Model - Rank of page
equals to the probability of being in this page.

78
Computation Of Simple PageRank (1)
79
Computation Of Simple PageRank (2)

Given a matrix A, an eigenvalue c and the
corresponding eigenvector v is defined if Avcv
Hence r is eigenvector of Atr for eigenvalue 1
If G is strongly connected then r is unique.

80
Computation Of Simple PageRank (3)

Simple PageRank can be computed by

81
Simple PageRankExample
82
Practical PageRank The Problem

Web is not a strongly connected graph. It
contains
Rank Sinks Cluster of pages without outgoing
links.Pages outside cluster will be ranked 0.
Rank Leaks A page without outgoing links. All
pages will be ranked 0.

83
Practical PageRank The Solution

Remove all Page Leaks.
Add decay factor d to Simple PageRank
Based on Board Surfer Model

84
Practical PageRank In practice ...

Google uses IR techniques combined with Practical
PageRank to determine the rank of a query.

85
HITS Hypertext Induced Topic Search

A query dependent technique.
Produces two scores
Authority A most likely to be relevant page to
a given query.
Hub Points to many Authorities.
Contains two part
Identifying the focused subgraph.
Link analysis.

86
HITSIdentifying The Focused Subgraph

Subgraph creation from t-sized page set
(d reduces the influence of extremely
popular pages like yahoo.com)

87
HITSLink Analysis

Calculates Authorities Hubs scores (ai hi )
for each page in S

88
HITSLink Analysis Computation

Eigenvectors computation can be used by
Where
a Vector of Authorities scores
h Vector of Hubs scores.
A Adjacency matrix in which ai,j 1 if
points to j.

89
Other Link Based Techniques

Identifying Communities Sets of pages created
and used by people sharing a common interest,
Related Pages Sibling pages may be related.
Classification Resource Compilation
Automatic vs. Manual classification.
Identifying high quality pages for a topic

90
Ranking, Conclusion

The link structure of the web contains useful
information.
Ranking methods
PageRank A global ranking scheme for ranking
search results
HITS Computes the Authorities Hubs for a given
query.
Future Directions Use of other information
sources, sophisticated text analysis.

91
Conclusion