The Anatomy of a LargeScale Hypertextual Web Search Engine - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

The Anatomy of a LargeScale Hypertextual Web Search Engine

Description:

Distributed Crawlers. Storeserver. Repository. Indexer. Barrels. URL Resolver. Sorter. DumpLexicon ... Indexer and crawler ran simultaneously. Future work: ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 16
Provided by: joseph272
Category:

less

Transcript and Presenter's Notes

Title: The Anatomy of a LargeScale Hypertextual Web Search Engine


1
The Anatomy of a Large-Scale Hypertextual Web
Search Engine
  • Sergey Brin Lawrence Page
  • Presented by
  • Siddharth Sriram Joseph Xavier
  • Department of Electrical and Computer Engineering

2
Overview
  • _at_ Stanford University
  • Presented as a prototype of a large-scale search
    engine
  • 26 million pages, 147 GB
  • Google googol
  • Issues
  • Scaling
  • Exploiting structure in Hypertext
  • PageRank Algorithm
  • Architecture
  • Data Structures, Crawling, Indexing, Searching
  • Results

3
  • PageRank Algorithm using link graph
  • Anchor Text
  • Associate the anchor text of a link to the page
    it points to
  • Information Retrieval
  • TREC gt well controlled, homogenous collections
  • Not equipped to handle Hypertext documents
  • Vector Space Model not enough

4
Architecture
  • URL Server
  • Distributed Crawlers
  • Storeserver
  • Repository
  • Indexer
  • Barrels
  • URL Resolver
  • Sorter
  • DumpLexicon
  • Searcher

5
Data Structures
  • BigFiles
  • Repository
  • Document Index
  • Lexicon
  • Hit Lists
  • Forward Index
  • Inverted Index

6
Repository
  • Full HTML of every webpage
  • Compressed using zlib
  • Prefixed by docID, length, URL
  • Files stored one after another

7
Document Index
  • Fixed width ISAM index
  • Stores document status, pointer to repository,
    document checksum
  • If document has been crawled, ptr to variable
    length docinfo file stored
  • Otherwise ptr to URLlist stored

8
Hit Lists
  • Plain and Fancy hits
  • 2 bytes for each hit
  • Length of hit list
  • stored before hit

9
Forward Index
  • Stored in 64 barrels.
  • If a document contains words in a barrel, then
    the docID is recorded into the barrel, with the
    list of wordIDs and hitlists.
  • Each wordID stored as a relative difference from
    the minimum wordID in a barrel. (24 bits for the
    wordID, 8 for hitlist length).

10
Inverted Index
  • Same barrels as forward index, but processed by
    the sorter.
  • For every wordID, doclist of docIDs generated,
    with corresponding hitlists.
  • Two sets of inverted barrels, one for hitlists
    with anchor or title text, another for all
    hitlists.

11
Indexing the Web
  • Parser flex used to generate a lexical analyzer
    involved a fair amount or work
  • Indexing Documents into barrels
  • Every word hashed into wordID
  • Occurrences translated into hitlists and written
    into forward barrels
  • Lexicon needs to be shared
  • Extra words written into a log, processed by one
    final indexer

12
Searching
  • Parse the query.
  • Convert words into wordIDs.
  • Seek to the start of the doclist in the short
    barrel for every word.
  • Scan through the doclists until there is a
    document that matches all the search terms.
  • Compute the rank of that document for the query.
  • If we are in the short barrels and at the end of
    any doclist, seek to the start of the doclist in
    the full barrel for every word and go to step 4.
  • If we are not at the end of any doclist go to
    step 4.
  • Sort the documents that have matched by rank and
    return the top k.

13
Ranking
  • Count weight generated for each word in query
  • Dot product taken with type weight vector (for
    single word queries) or with type-prox weight
    vector (for multiple word queries)
  • Combined with PageRank to give final score.

14
Results
  • High quality pages
  • zlib 31 ratio
  • 9 days to download 26 million pages
  • Indexer and crawler ran simultaneously
  • Future work
  • Query caching, smart disk allocation, updates
  • User context, relevance feedback

15
Footnote foot in mouth!!
  • we expect that advertising funded search engines
    will be inherently biased towards the advertisers
    and away from the needs of the consumers.
Write a Comment
User Comments (0)
About PowerShow.com