Searching the Web - PowerPoint PPT Presentation

About This Presentation
Title:

Searching the Web

Description:

To better understand Web search engines: Fundamental concepts. Main challenges. Design issues. Implementation techniques and algorithms. 8/25/09. SDBI 2001. 3 ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 95
Provided by: csHu
Category:
Tags: searching | web

less

Transcript and Presenter's Notes

Title: Searching the Web


1
Searching the Web
  • Yoram Bachrach
  • Yiftah Ben-Aharon

Arvind Arasu, Junghoo Cho, Hector Garcia-Molina,
Andreas Paepcke, Sriram Raghavan
2
Goal
  • To better understand Web search engines
  • Fundamental concepts
  • Main challenges
  • Design issues
  • Implementation techniques and algorithms

3
Schedule
  • Search engine requirements
  • Components overview
  • Specific modules
  • Purpose
  • Implementation
  • Performance metrics
  • Conclusion

4
What does it do?
  • Processes users queries
  • Find pages with related information
  • Return a resources list
  • Is it really that simple?

5
What does it do?
  • Processes users queries
  • How is a query represented?
  • Find pages with related information
  • Return a resources list
  • Is it really that simple?

6
What does it do?
  • Processes users queries
  • Find pages with related information
  • How do we find pages?
  • Where in the web do we look?
  • How do we store the data?
  • Return a resources list
  • Is it really that simple?

7
What does it do?
  • Processes users queries
  • Find pages with related information
  • Return a resources list
  • Is what order?
  • How are the pages ranked?
  • Is it really that simple?

8
What does it do?
  • Processes users queries
  • Find pages with related information
  • Return a resources list
  • Is it really that simple?
  • Limited resources
  • Time quality tradeoff

9
Search Engine Structure
  • General Design
  • Crawling
  • Storage
  • Indexing
  • Ranking

10
Motivation
  • The web is
  • Used by millions
  • Contains lots of information
  • Link based
  • Incoherent
  • Changes rapidly
  • Distributed
  • Traditional information retrieval was built with
    the exact opposite in mind

11
The Webs Characteristics
  • Size
  • Over a billion pages available
  • 5-10K per page gt tens of terrabytes
  • Size doubles every 2 years
  • Change
  • 23 change daily
  • Half life time of about 10 days
  • Poisson model for changes
  • Bowtie structure

12
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
13
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
14
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
15
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
16
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
17
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
18
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
19
Terms
  • Crawler
  • Crawler control
  • Indexes text, structure, utility
  • Page repository
  • Indexer
  • Collection analysis module
  • Query engine
  • Ranking module

20
Crawling
  • Itsi Bitsi Spider crawling up the web!

21
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
22
Crawling web pages
  • What pages to download
  • When to refresh
  • Minimize load on web sites
  • How to parallelize the process

23
Page selection
  • Importance metric
  • Web crawler model
  • Crawler method for choosing page to download

24
Importance Metrics
  • Given a page P, define how good that page is.
  • Several metric types
  • Interest driven
  • Popularity driven
  • Location driven
  • Combined

25
Interest Driven
  • Define a driving query Q
  • Find textual similarity between P and Q
  • Define a word vocabulary W1Wn
  • Define a vector for P and Q
  • Vp, Vq ltW1,,Wngt
  • Wi 0 if Wi does not appear in the document
  • Wi Inverse document frequency otherwise
  • IDF(Wi) 1 / number of appearances in the entire
    collection
  • Importance IS(P) P Q (cosine product)
  • Finding IDF requires going over the entire web
  • Estimate IDF by pages already visited, to
    calculate IS

26
Popularity Driven
  • How popular a page is
  • Backlink count
  • IB(P) the number of pages containing a link to
    P
  • Estimat by pervious crawls IB(P)
  • More sophisticated metric, called PageRank IR(P)

27
Location Driven
  • IL(P) A function of the URL to P
  • Words appearing on URL
  • Number of / on the URL
  • Easily evaluated, requires no data from pervious
    crawls

28
Combined Metrics
  • IC(P) a function of several other metrics
  • Allows using local metrics for first stage and
    estimated metrics for second stage
  • IC(P) aIS(P)bIB(P)cIL(P)

29
Crawler Models
  • A crawler
  • Tries to visit more important pages first
  • Only has estimates of importance metrics
  • Can only download a limited amount
  • How well does a crawler perform?
  • Crawl and Stop
  • Crawl and Stop with Threshold

30
Crawl and Stop
  • A crawler stops after visiting K pages
  • A perfect crawler
  • Visits pages with ranks R1,,Rk
  • These are called Hot Pages
  • A real crawler
  • Visits only MltK hot pages
  • Performance rate
  • For a random crawler

31
Crawl and Stop with Threshold
  • A crawler stops after visiting K pages
  • Hot pages are pages with a metric higher than G
  • A crawler visits V hot pages
  • Metric percent of hot pages visited
  • Perfect crawler
  • Random crawler

32
Ordering Metrics
  • The crawlers queue is prioritized according to an
    ordering metric
  • The ordering metric is based on an importance
    metric
  • Location metrics - directly
  • Popularity metrics - via estimates according to
    pervious crawls
  • Similarity metrics via estimates according to
    anchor

33
Case Study - WebBase
  • Using Stanfords 225,000 web pages as the entire
    collection
  • Use popularity importance IB(P)
  • Assume Crawl and Stop with Threshold G 100
  • Start at http//www.stanford.edu
  • Use PageRank, backlink and BFS as ordering
    metrics

34
WebBase Results
35
Page Refresh
  • Make sure pages are up-to-date
  • Many possible strategies
  • Uniform refresh
  • Proportional to change frequency
  • Need to define a metric

36
Freshness Metric
  • Freshness
  • 1 if fresh, 0 otherwise
  • Age of pages
  • time since modified

37
Average Freshness
  • Freshness changes over time
  • Take the average freshness over a long period of
    time

38
Refresh Strategy
  • Crawlers can refresh only a certain amount of
    pages in a period of time.
  • The page download resource can be allocated in
    many ways
  • The proportional refresh policy allocated the
    resource proportionally to the pages change rate.

39
Example
  • The collection contains 2 pages
  • E1 changes 9 times a day
  • E2 changes once a day
  • Simplified change model
  • The day is split into 9 equal intervals, and E1
    changes once on each interval
  • E2 changes once during the entire day
  • The only unknown is when the pages change within
    the intervals
  • The crawler can download a page a day.
  • Our goal is to maximize the freshness

40
Example (2)
41
Example (3)
  • Which page do we refresh?
  • If we refresh E2 in midday
  • If E2 changes in first half of the day, and we
    refresh in midday, it remains fresh for the rest
    half of the day.
  • 50 for 0.5 day freshness increase
  • 50 for no increase
  • Expectancy of 0.25 day freshness increase
  • If we refresh E1 in midday
  • If E1 changes in first half of the interval, and
    we refresh in midday (which is the middle of the
    interval), it remains fresh for the rest half of
    the interval 1/18 of a day.
  • 50 for 1/18 day freshness increase
  • 50 for no increase
  • Expectancy of 1/36 day freshness increase

42
Example (4)
  • This gives a nice estimation
  • But things are more complex in real life
  • Not sure that a page will change within an
    interval
  • Have to worry about age
  • Using a Poisson model shows a uniform policy
    always performs better than a proportional one.

43
Example (5)
  • Studies have found the best policy for similar
    example
  • Assume page changes follow a Poisson process.
  • Assume 5 pages, which change 1,2,3,4,5 times a
    day

44
Repository
  • Hidden Treasures

45
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
46
Storage
  • The page repository is a scalable storage system
    for web pages
  • Allows the Crawler to store pages
  • Allows the Indexer and Collection Analysis to
    retrieve them
  • Similar to other data storage systems DB or
    file systems
  • Does not have to provide some of the other
    systems features transactions, logging,
    directory.

47
Storage Issues
  • Scalability and seamless load distribution
  • Dual access modes
  • Random access (used by the query engine for
    cached pages)
  • Streaming access (used by the Indexer and
    Collection Analysis)
  • Large bulk update reclaim old space, avoid
    access/update conflicts
  • Obsolete pages - remove pages no longer on the web

48
Designing a Distributed Web Repository
  • Repository designed to work over a cluster of
    interconnected nodes
  • Page distribution across nodes
  • Physical organization within a node
  • Update strategy

49
Page Distribution
  • How to choose a node to store a page
  • Uniform distribution any page can be sent to
    any node
  • Hash distribution policy hash page ID space
    into node ID space

50
Organization Within a Node
  • Several operations required
  • Add / remove a page
  • High speed streaming
  • Random page access
  • Hashed organization
  • Treat each disk as a hash bucket
  • Assign according to a pages ID
  • Log organization
  • Treat the disk as one file, and add the page at
    the end
  • Support random access using a B-tree
  • Hybrid
  • Hash map a page to an extent and use log
    structure within an extent.

51
Distribution Performance
52
Update Strategies
  • Updates are generated by the crawler
  • Several characteristics
  • Time in which the crawl occurs and the repository
    receives information
  • Whether the crawls information replaces the
    entire database or modifies parts of it

53
Batch vs. Steady
  • Batch mode
  • Periodically executed
  • Allocated a certain amount of time
  • Steady mode
  • Run all the time
  • Always send results back to the repository

54
Partial vs. Complete Crawls
  • A batch mode crawler can
  • Do a complete crawl every run, and replace entire
    collection
  • Recrawl only a specific subset, and apply updates
    to the existing collection partial crawl
  • The repository can implement
  • In place update
  • Quickly refreshen pages
  • Shadowing, update as another stage
  • Avoid refresh-access conflics

55
Partial vs. Complete Crawls
  • Shadowing resolves the conflicts between updates
    and read for the queries
  • Batch mode suits well with shadowing
  • Steady crawler suits with in place updates

56
The WebBase Repository
  • Distributed storage that works with the Stanford
    WebCrawler
  • Uses a node manager for monitoring storage nodes
    and collecting status information
  • Each page is assigned a unique identifier, a
    signature of normalized URL
  • URLs are normalized since a same resource can be
    pointed from several URLs
  • Stanford Crawler runs in batch mode, so Shadowing
    is used by the repository

57
The WebBase Repository
58
Indexing
  • Excuse me, where can I find

59
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
60
The Indexer Modul
  • Creates Two indexes
  • Text (content) index Uses Traditional
    indexing methods like Inverted Indexing.
  • Structure(Links( index Uses a directed graph of
    pages and links. Sometimes also creates an
    inverted graph.

61
The Collection Analysis Module
  • Uses the 2 basic indexes created by
  • the indexer module in order to
  • assemble Utility Indexes.
  • e.g. A site index.

62
Inverted Index
  • A Set of inverted lists, one per each index term
    (word).
  • Inverted list of a term A sorted list of
    locations in which the term appeared.
  • Posting A pair (w,l) where w is word and l is
    one of its locations.
  • Lexicon Holds all indexs terms with statistics
    about the term (not the posting)

63
Challenges
  • Index build must be
  • Fast
  • Economic
  • (unlike traditional index buildings)
  • Incremental Indexing must be supported
  • Storage compression vs. speed

64
Index Partitioning
  • A distributed text indexing can be done by
  • Local inverted file (IFL)
  • Each nodes contain disjoint random pages.
  • Query is broadcasted.
  • Result is the joined query answers.
  • Global inverted file (IFG)
  • Each node is responsible only for a subset of
    terms in the collection.
  • Query sent only to the apropriate node.

65
The WebBase Indexer Architecture
  • Distributors Store pages detected by the
    crawler and need to be indexed.
  • Indexers Performs the core indexing.
  • Query Servers holding the inverted index,
    partitioned using IFL

66
The WebBase Indexer Stages
  • Stage 1
  • Loading pages from the Distributor.
  • Processing pages.
  • Flushing results to disk.
  • Stage 2
  • Pairs of (Inverted file, Lexicon) are created by
    merging stage 1s files.
  • Each pair is transferred to a query server.

67
The WebPage IndexerParallelizing stage 1
  • Use 3-steps pipeline, one stage per each action
    in stage 1.
  • Each action has different orientation (IO/CPU
    intensive)

68
The WebPage IndexerParallelizing results
  • Sequential index
  • building is about
  • 30-40 slower
  • then pipelined
  • one.

69
The WebBase Indexer Statistics Collection
concept
  • Term-level statistics must be collected
  • e.g. IDF - inverse document frequency
  • 1/(number of appearance in collection)
  • Statistics computation as part of index creation
    (instead of at query time).
  • A special server Statistician is dedicated for
    this goal.

70
The WebBase IndexerStatistics Collection process
  • Stage 1
  • Indexers pass local information to the
    statistician.
  • The statistician process it (globally) and return
    results to the indexer
  • Stage 2
  • Global statistics are integrated into the
    lexicons.

71
The WebBase IndexerStatistics Collection
optimizations
  • Send data to statistician when is in memory
    (avoid explicit IO)
  • FL - When flushing data to disk.
  • ME - When merging the flushed data
  • Local aggregation Use partial order for sending
    less messages.
  • e.g. 1000 x cat vs. cat, 1000

72
Indexing, Conclusion
  • Web pages indexing is complicated due to its
    scale (millions of pages, hundreds of gigabytes).
  • Challenges Incremental indexing and
    personalization.

73
Ranking
  • Everybody wants to rule the world

74
Search Engine Structure
Page Repository
Indexer
Collection Analysis
Queries
Results
Ranking
Query Engine
Text
Structure
Utility
Crawl Control
Indexes
75
Traditional Ranking Faults
  • Many pages containing a term may be of poor
    quality or not relevant.
  • Insufficient self description vs. spamming.
  • Not using link analysis.

76
PageRank
  • Tries to capture the notion of Importance of a
    page.
  • Uses Backlinks for ranking.
  • Avoids trivial spamming Distributes pages
    voting power among the pages they are linking
    to.
  • Important page linking to a page will raise
    its rank more the Not Important one.

77
Simple PageRank
  • Given by
  • Where
  • B(i) set of pages links to i.
  • N(j) number of outgoing links from j.
  • Well defined if link graph is strongly connected.
  • Based on Random Surfer Model - Rank of page
    equals to the probability of being in this page.

78
Computation Of Simple PageRank (1)
79
Computation Of Simple PageRank (2)
  • Given a matrix A, an eigenvalue c and the
    corresponding eigenvector v is defined if Avcv
  • Hence r is eigenvector of Atr for eigenvalue 1
  • If G is strongly connected then r is unique.

80
Computation Of Simple PageRank (3)
  • Simple PageRank can be computed by

81
Simple PageRankExample
82
Practical PageRank The Problem
  • Web is not a strongly connected graph. It
    contains
  • Rank Sinks Cluster of pages without outgoing
    links.Pages outside cluster will be ranked 0.
  • Rank Leaks A page without outgoing links. All
    pages will be ranked 0.

83
Practical PageRank The Solution
  • Remove all Page Leaks.
  • Add decay factor d to Simple PageRank
  • Based on Board Surfer Model

84
Practical PageRank In practice ...
  • Google uses IR techniques combined with Practical
    PageRank to determine the rank of a query.

85
HITS Hypertext Induced Topic Search
  • A query dependent technique.
  • Produces two scores
  • Authority A most likely to be relevant page to
    a given query.
  • Hub Points to many Authorities.
  • Contains two part
  • Identifying the focused subgraph.
  • Link analysis.

86
HITSIdentifying The Focused Subgraph
  • Subgraph creation from t-sized page set
  • (d reduces the influence of extremely
  • popular pages like yahoo.com)

87
HITSLink Analysis
  • Calculates Authorities Hubs scores (ai hi )
    for each page in S

88
HITSLink Analysis Computation
  • Eigenvectors computation can be used by
  • Where
  • a Vector of Authorities scores
  • h Vector of Hubs scores.
  • A Adjacency matrix in which ai,j 1 if
    points to j.

89
Other Link Based Techniques
  • Identifying Communities Sets of pages created
    and used by people sharing a common interest,
  • Related Pages Sibling pages may be related.
  • Classification Resource Compilation
  • Automatic vs. Manual classification.
  • Identifying high quality pages for a topic

90
Ranking, Conclusion
  • The link structure of the web contains useful
    information.
  • Ranking methods
  • PageRank A global ranking scheme for ranking
    search results
  • HITS Computes the Authorities Hubs for a given
    query.
  • Future Directions Use of other information
    sources, sophisticated text analysis.

91
Conclusion
  • What was it all about ?

92
The motivation
  • Webs vast scale .
  • Limited resources.
  • Web is changing rapidly.
  • Important and Demanded field.

93
The Basic Architecture
  • Crawlers Travel the web, retrieving pages.
  • Repositories Store pages locally.
  • Indexers Index and analyze pages stored in
    repository.
  • Ranking modules Return the query engines the
    most promising pages.

94
The End
  • Questions, anyone ?
Write a Comment
User Comments (0)
About PowerShow.com