The Anatomy of a LargeScale Hypertextual Web Search Engine - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

The Anatomy of a LargeScale Hypertextual Web Search Engine

Description:

Including page rank, anchor text, and proximity information ... Improve search efficiency and scale to approximately 100 million web pages ... – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 21
Provided by: dblab1
Category:

less

Transcript and Presenter's Notes

Title: The Anatomy of a LargeScale Hypertextual Web Search Engine


1
The Anatomy of a Large-Scale Hypertextual Web
Search Engine
  • S. Brin and L. Page, Computer Networks and ISDN
    Systems, Vol. 30, No. 1-7, pages 107-117, April
    1998

2006. 3. 14 Young Geun Han
2
Contents
  • System Anatomy
  • Crawling the Web
  • Indexing the Web
  • Searching
  • Results and Performance
  • Storage Requirements
  • System Performance
  • Search Performance
  • Conclusions

3
Crawling the Web (1)
  • Crawler
  • The most fragile application
  • Involves interacting with many web servers and
    name servers
  • Running a web crawler
  • Tricky performance, reliability issues and social
    issues

4
Crawling the Web (2)
  • Tricky performance
  • Google has a fast distributed crawling system
  • Each crawler keeps roughly 300 connection open at
    once
  • Google can crawl over 100 web pages per second
    using four crawlers at peak speeds (roughly 600K
    per second of data)
  • Each crawler maintains a its own DNS cache
  • The crawler uses asynchronous IO and a number of
    queues

lt states of connections gt
5
Crawling the Web (3)
  • Reliability issues
  • There are many people who dont know what a
    crawler is
  • They consider running a crawler as generating a
    fair amount of email and phone calls
  • They consider that we like their web site very
    much
  • There are some people who dont know about the
    robots exclusion protocol

6
Crawling the Web (4)
  • Social issues
  • Because of the huge amount of data involved,
    unexpected things will happen
  • Easy problem to fix had not come up until we had
    download tens of millions of pages
  • Impossible to test a crawler without running it
    on large part of the Internet
  • Crawlers need to be designed to be very robust
    and carefully tested

7
Indexing the Web (1)
  • Parsing
  • Any parser must handle a huge array of possible
    errors
  • Use flex to generate a lexical analyzer for
    maximum speed

8
Indexing the Web (2)
  • Indexing Documents into Barrels
  • After each document is parsed, it is encoded into
    a number of barrels
  • Every word is converted into a wordID by using an
    in-memory hash table -- the lexicon
  • New additions to the lexicon hash table are
    logged to a file
  • The words in the current document are translated
    into hit lists
  • The words are written into the forward barrels
  • For parallelization, indexer writes a log to a
    file, instead of sharing the lexicon

9
Indexing the Web (3)
  • Sorting
  • Takes each of the forward barrels
  • Sorts it by wordID to produce an inverted barrel
  • Parallelize the sorting phase
  • Subdivides the barrels into baskets to load into
    main memory because the barrels dont fit into
    memory
  • Sorts baskets and writes its contents into the
    inverted barrel

10
Searching (1)
  • Parse the query
  • Convert words into wordIDs
  • Seek to the start of the doclist in the short
    barrel for every word
  • Scan through the doclists until there is a
    document that matches all the search terms
  • Compute the rank of that document for the query
  • If we are in the short barrels and at the end of
    any doclist, seek to the start of the doclist in
    the full barrel for every word and go to step 4
  • If we are not at the end of any doclist go to
    step 4
  • Sort the documents that have matched by rank and
    return the top k
  • Figure 4. Google Query Evaluation

11
Searching (2)
  • The Ranking System
  • Every hitlist include position, font and
    capitalization information
  • Factor in hits from anchor text and the PageRank
    of the document
  • Ranking function so that no particular factor can
    have too much influence
  • For a single word search
  • In order to rank a document, Google looks at that
    documents hit list for a single word query and
    computes an IR score combined with PageRank
  • For a multi-word search
  • Hits occurring close together in a document are
    weighted higher than hits occurring far apart

12
Searching (3)
  • Feedback
  • Google has a user feedback mechanism because
    figuring out the right values for many parameters
    is very difficult
  • When the ranking function is modified, this
    mechanism gives developers some idea of how a
    change in the ranking function affects the search
    results

13
Result and Performance (1)
14
Result and Performance (2)
  • Googles results for a search
  • A number of results are from the whitehouse.gov
    domain
  • Most major commercial search engines do not
    return any results from whitehouse.gov
  • There is no title because it was not crawled
  • Instead, Google relied on anchor text to
    determine this was a good answer to the query
  • There are no results about a Bill other than
    Clinton or about a Clinton other than Bill

15
Result and Performance (3)
  • Storage Requirements

Table 1. Statistics
16
Result and Performance (4)
  • System Performance
  • In total it took roughly 9 days to download the
    26 million pages (including errors)
  • Download the last 11 million pages in just 63
    hours, averaging just over 4 million pages per
    day or 48.5 pages per second
  • The indexer ran just faster than the crawlers
  • The indexer runs at roughly 54 pages per second
  • Using four machines, the whole process of sorting
    takes about 24 hours

17
Result and Performance (5)
  • Search Performance
  • Google answers most queries in between 1 and 10
    seconds
  • The search time is mostly dominated by disk IO
    over NFS

Table 2. Search Times
18
Conclusions (1)
  • Google
  • A scalable search engine
  • Including page rank, anchor text, and proximity
    information
  • A complete architecture for gathering web pages,
    indexing them, and performing search queries over
    them

19
Conclusions (2)
  • Future Work
  • Improve search efficiency and scale to
    approximately 100 million web pages
  • Smart algorithms to decide what old web pages
    should be recrawled and what new ones should be
    crawled
  • High Quality Search
  • Google makes heavy use of hypertextual
    information consisting of link structure and link
    text
  • Google also uses proximity and font information
  • The analysis of link structure and PageRank
    allows Google to evaluate the quality of web pages

20
Conclusions (3)
  • Scalable Architecture
  • Google is efficient in both space and time
  • Googles major data structures make efficient use
    of available storage space
  • The crawling, indexing, and sorting operations
    are efficient in time
  • Google overcomes a number of bottlenecks
  • A Research Tool
  • Not only a high quality search engine but a
    research tool
  • A necessary research tool for a wide range of
    applications
Write a Comment
User Comments (0)
About PowerShow.com