Title: Searching the Web
1Searching the Web
- Yoram Bachrach
- Yiftah Ben-Aharon
Arvind Arasu, Junghoo Cho, Hector Garcia-Molina,
Andreas Paepcke, Sriram Raghavan
2Goal
- To better understand Web search engines
- Fundamental concepts
- Main challenges
- Design issues
- Implementation techniques and algorithms
3Schedule
- Search engine requirements
- Components overview
- Specific modules
- Purpose
- Implementation
- Performance metrics
- Conclusion
4What does it do?
- Processes users queries
- Find pages with related information
- Return a resources list
- Is it really that simple?
5What does it do?
- Processes users queries
- How is a query represented?
- Find pages with related information
- Return a resources list
- Is it really that simple?
6What does it do?
- Processes users queries
- Find pages with related information
- How do we find pages?
- Where in the web do we look?
- How do we store the data?
- Return a resources list
- Is it really that simple?
7What does it do?
- Processes users queries
- Find pages with related information
- Return a resources list
- Is what order?
- How are the pages ranked?
- Is it really that simple?
8What does it do?
- Processes users queries
- Find pages with related information
- Return a resources list
- Is it really that simple?
- Limited resources
- Time quality tradeoff
9Search Engine Structure
- General Design
- Crawling
- Storage
- Indexing
- Ranking
10Motivation
- The web is
- Used by millions
- Contains lots of information
- Link based
- Incoherent
- Changes rapidly
- Distributed
- Traditional information retrieval was built with
the exact opposite in mind
11The Webs Characteristics
- Size
- Over a billion pages available
- 5-10K per page gt tens of terrabytes
- Size doubles every 2 years
- Change
- 23 change daily
- Half life time of about 10 days
- Poisson model for changes
- Bowtie structure
12Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
13Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
14Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
15Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
16Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
17Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
18Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
19Terms
- Crawler
- Crawler control
- Indexes text, structure, utility
- Page repository
- Indexer
- Collection analysis module
- Query engine
- Ranking module
20Crawling
- Itsi Bitsi Spider crawling up the web!
21Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
22Crawling web pages
- What pages to download
- When to refresh
- Minimize load on web sites
- How to parallelize the process
23Page selection
- Importance metric
- Web crawler model
- Crawler method for choosing page to download
24Importance Metrics
- Given a page P, define how good that page is.
- Several metric types
- Interest driven
- Popularity driven
- Location driven
- Combined
25Interest Driven
- Define a driving query Q
- Find textual similarity between P and Q
- Define a word vocabulary W1Wn
- Define a vector for P and Q
- Vp, Vq ltW1,,Wngt
- Wi 0 if Wi does not appear in the document
- Wi Inverse document frequency otherwise
- IDF(Wi) 1 / number of appearances in the entire
collection - Importance IS(P) P Q (cosine product)
- Finding IDF requires going over the entire web
- Estimate IDF by pages already visited, to
calculate IS
26Popularity Driven
- How popular a page is
- Backlink count
- IB(P) the number of pages containing a link to
P - Estimat by pervious crawls IB(P)
- More sophisticated metric, called PageRank IR(P)
27Location Driven
- IL(P) A function of the URL to P
- Words appearing on URL
- Number of / on the URL
- Easily evaluated, requires no data from pervious
crawls
28Combined Metrics
- IC(P) a function of several other metrics
- Allows using local metrics for first stage and
estimated metrics for second stage - IC(P) aIS(P)bIB(P)cIL(P)
29Crawler Models
- A crawler
- Tries to visit more important pages first
- Only has estimates of importance metrics
- Can only download a limited amount
- How well does a crawler perform?
- Crawl and Stop
- Crawl and Stop with Threshold
30Crawl and Stop
- A crawler stops after visiting K pages
- A perfect crawler
- Visits pages with ranks R1,,Rk
- These are called Hot Pages
- A real crawler
- Visits only MltK hot pages
- Performance rate
- For a random crawler
31Crawl and Stop with Threshold
- A crawler stops after visiting K pages
- Hot pages are pages with a metric higher than G
- A crawler visits V hot pages
- Metric percent of hot pages visited
- Perfect crawler
- Random crawler
32Ordering Metrics
- The crawlers queue is prioritized according to an
ordering metric - The ordering metric is based on an importance
metric - Location metrics - directly
- Popularity metrics - via estimates according to
pervious crawls - Similarity metrics via estimates according to
anchor
33Case Study - WebBase
- Using Stanfords 225,000 web pages as the entire
collection - Use popularity importance IB(P)
- Assume Crawl and Stop with Threshold G 100
- Start at http//www.stanford.edu
- Use PageRank, backlink and BFS as ordering
metrics
34WebBase Results
35Page Refresh
- Make sure pages are up-to-date
- Many possible strategies
- Uniform refresh
- Proportional to change frequency
- Need to define a metric
36Freshness Metric
- Freshness
- 1 if fresh, 0 otherwise
- Age of pages
- time since modified
37Average Freshness
- Freshness changes over time
- Take the average freshness over a long period of
time
38Refresh Strategy
- Crawlers can refresh only a certain amount of
pages in a period of time. - The page download resource can be allocated in
many ways - The proportional refresh policy allocated the
resource proportionally to the pages change rate.
39Example
- The collection contains 2 pages
- E1 changes 9 times a day
- E2 changes once a day
- Simplified change model
- The day is split into 9 equal intervals, and E1
changes once on each interval - E2 changes once during the entire day
- The only unknown is when the pages change within
the intervals - The crawler can download a page a day.
- Our goal is to maximize the freshness
40Example (2)
41Example (3)
- Which page do we refresh?
- If we refresh E2 in midday
- If E2 changes in first half of the day, and we
refresh in midday, it remains fresh for the rest
half of the day. - 50 for 0.5 day freshness increase
- 50 for no increase
- Expectancy of 0.25 day freshness increase
- If we refresh E1 in midday
- If E1 changes in first half of the interval, and
we refresh in midday (which is the middle of the
interval), it remains fresh for the rest half of
the interval 1/18 of a day. - 50 for 1/18 day freshness increase
- 50 for no increase
- Expectancy of 1/36 day freshness increase
42Example (4)
- This gives a nice estimation
- But things are more complex in real life
- Not sure that a page will change within an
interval - Have to worry about age
- Using a Poisson model shows a uniform policy
always performs better than a proportional one.
43Example (5)
- Studies have found the best policy for similar
example - Assume page changes follow a Poisson process.
- Assume 5 pages, which change 1,2,3,4,5 times a
day
44Repository
45Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
46Storage
- The page repository is a scalable storage system
for web pages - Allows the Crawler to store pages
- Allows the Indexer and Collection Analysis to
retrieve them - Similar to other data storage systems DB or
file systems - Does not have to provide some of the other
systems features transactions, logging,
directory.
47Storage Issues
- Scalability and seamless load distribution
- Dual access modes
- Random access (used by the query engine for
cached pages) - Streaming access (used by the Indexer and
Collection Analysis) - Large bulk update reclaim old space, avoid
access/update conflicts - Obsolete pages - remove pages no longer on the web
48Designing a Distributed Web Repository
- Repository designed to work over a cluster of
interconnected nodes - Page distribution across nodes
- Physical organization within a node
- Update strategy
49Page Distribution
- How to choose a node to store a page
- Uniform distribution any page can be sent to
any node - Hash distribution policy hash page ID space
into node ID space
50Organization Within a Node
- Several operations required
- Add / remove a page
- High speed streaming
- Random page access
- Hashed organization
- Treat each disk as a hash bucket
- Assign according to a pages ID
- Log organization
- Treat the disk as one file, and add the page at
the end - Support random access using a B-tree
- Hybrid
- Hash map a page to an extent and use log
structure within an extent.
51Distribution Performance
52Update Strategies
- Updates are generated by the crawler
- Several characteristics
- Time in which the crawl occurs and the repository
receives information - Whether the crawls information replaces the
entire database or modifies parts of it
53Batch vs. Steady
- Batch mode
- Periodically executed
- Allocated a certain amount of time
- Steady mode
- Run all the time
- Always send results back to the repository
54Partial vs. Complete Crawls
- A batch mode crawler can
- Do a complete crawl every run, and replace entire
collection - Recrawl only a specific subset, and apply updates
to the existing collection partial crawl - The repository can implement
- In place update
- Quickly refreshen pages
- Shadowing, update as another stage
- Avoid refresh-access conflics
55Partial vs. Complete Crawls
- Shadowing resolves the conflicts between updates
and read for the queries - Batch mode suits well with shadowing
- Steady crawler suits with in place updates
56The WebBase Repository
- Distributed storage that works with the Stanford
WebCrawler - Uses a node manager for monitoring storage nodes
and collecting status information - Each page is assigned a unique identifier, a
signature of normalized URL - URLs are normalized since a same resource can be
pointed from several URLs - Stanford Crawler runs in batch mode, so Shadowing
is used by the repository
57The WebBase Repository
58Indexing
- Excuse me, where can I find
59Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
60 The Indexer Modul
- Creates Two indexes
- Text (content) index Uses Traditional
indexing methods like Inverted Indexing. - Structure(Links( index Uses a directed graph of
pages and links. Sometimes also creates an
inverted graph.
61The Collection Analysis Module
- Uses the 2 basic indexes created by
- the indexer module in order to
- assemble Utility Indexes.
- e.g. A site index.
62Inverted Index
- A Set of inverted lists, one per each index term
(word). - Inverted list of a term A sorted list of
locations in which the term appeared. - Posting A pair (w,l) where w is word and l is
one of its locations. - Lexicon Holds all indexs terms with statistics
about the term (not the posting)
63Challenges
- Index build must be
- Fast
- Economic
- (unlike traditional index buildings)
- Incremental Indexing must be supported
- Storage compression vs. speed
64Index Partitioning
- A distributed text indexing can be done by
- Local inverted file (IFL)
- Each nodes contain disjoint random pages.
- Query is broadcasted.
- Result is the joined query answers.
- Global inverted file (IFG)
- Each node is responsible only for a subset of
terms in the collection. - Query sent only to the apropriate node.
65The WebBase Indexer Architecture
- Distributors Store pages detected by the
crawler and need to be indexed. - Indexers Performs the core indexing.
- Query Servers holding the inverted index,
partitioned using IFL
66The WebBase Indexer Stages
- Stage 1
- Loading pages from the Distributor.
- Processing pages.
- Flushing results to disk.
- Stage 2
- Pairs of (Inverted file, Lexicon) are created by
merging stage 1s files. - Each pair is transferred to a query server.
67The WebPage IndexerParallelizing stage 1
- Use 3-steps pipeline, one stage per each action
in stage 1. - Each action has different orientation (IO/CPU
intensive)
68The WebPage IndexerParallelizing results
- Sequential index
- building is about
- 30-40 slower
- then pipelined
- one.
69The WebBase Indexer Statistics Collection
concept
- Term-level statistics must be collected
- e.g. IDF - inverse document frequency
- 1/(number of appearance in collection)
- Statistics computation as part of index creation
(instead of at query time). - A special server Statistician is dedicated for
this goal.
70The WebBase IndexerStatistics Collection process
- Stage 1
- Indexers pass local information to the
statistician. - The statistician process it (globally) and return
results to the indexer - Stage 2
- Global statistics are integrated into the
lexicons.
71The WebBase IndexerStatistics Collection
optimizations
- Send data to statistician when is in memory
(avoid explicit IO) - FL - When flushing data to disk.
- ME - When merging the flushed data
- Local aggregation Use partial order for sending
less messages. - e.g. 1000 x cat vs. cat, 1000
72Indexing, Conclusion
- Web pages indexing is complicated due to its
scale (millions of pages, hundreds of gigabytes).
- Challenges Incremental indexing and
personalization.
73Ranking
- Everybody wants to rule the world
74Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
75Traditional Ranking Faults
- Many pages containing a term may be of poor
quality or not relevant. - Insufficient self description vs. spamming.
- Not using link analysis.
76PageRank
- Tries to capture the notion of Importance of a
page. - Uses Backlinks for ranking.
- Avoids trivial spamming Distributes pages
voting power among the pages they are linking
to. - Important page linking to a page will raise
its rank more the Not Important one.
77Simple PageRank
- Given by
- Where
- B(i) set of pages links to i.
- N(j) number of outgoing links from j.
- Well defined if link graph is strongly connected.
- Based on Random Surfer Model - Rank of page
equals to the probability of being in this page.
78Computation Of Simple PageRank (1)
79Computation Of Simple PageRank (2)
- Given a matrix A, an eigenvalue c and the
corresponding eigenvector v is defined if Avcv - Hence r is eigenvector of Atr for eigenvalue 1
- If G is strongly connected then r is unique.
80Computation Of Simple PageRank (3)
- Simple PageRank can be computed by
81Simple PageRankExample
82Practical PageRank The Problem
- Web is not a strongly connected graph. It
contains - Rank Sinks Cluster of pages without outgoing
links.Pages outside cluster will be ranked 0. - Rank Leaks A page without outgoing links. All
pages will be ranked 0.
83Practical PageRank The Solution
- Remove all Page Leaks.
- Add decay factor d to Simple PageRank
- Based on Board Surfer Model
84Practical PageRank In practice ...
- Google uses IR techniques combined with Practical
PageRank to determine the rank of a query.
85HITS Hypertext Induced Topic Search
- A query dependent technique.
- Produces two scores
- Authority A most likely to be relevant page to
a given query. - Hub Points to many Authorities.
- Contains two part
- Identifying the focused subgraph.
- Link analysis.
86HITSIdentifying The Focused Subgraph
- Subgraph creation from t-sized page set
- (d reduces the influence of extremely
- popular pages like yahoo.com)
87HITSLink Analysis
- Calculates Authorities Hubs scores (ai hi )
for each page in S
88HITSLink Analysis Computation
- Eigenvectors computation can be used by
- Where
- a Vector of Authorities scores
- h Vector of Hubs scores.
- A Adjacency matrix in which ai,j 1 if
points to j.
89Other Link Based Techniques
- Identifying Communities Sets of pages created
and used by people sharing a common interest, - Related Pages Sibling pages may be related.
- Classification Resource Compilation
- Automatic vs. Manual classification.
- Identifying high quality pages for a topic
90Ranking, Conclusion
- The link structure of the web contains useful
information. - Ranking methods
- PageRank A global ranking scheme for ranking
search results - HITS Computes the Authorities Hubs for a given
query. - Future Directions Use of other information
sources, sophisticated text analysis.
91Conclusion
92The motivation
- Webs vast scale .
- Limited resources.
- Web is changing rapidly.
- Important and Demanded field.
93The Basic Architecture
- Crawlers Travel the web, retrieving pages.
- Repositories Store pages locally.
- Indexers Index and analyze pages stored in
repository. - Ranking modules Return the query engines the
most promising pages.
94The End