Title: The%20Anatomy%20Of%20A%20Large%20Scale%20Search%20Engine
1The Anatomy Of A Large Scale Search Engine
- Based on a paper by
- Sergey Brin Lawrence Page
Computer Science Department, Stanford
University - submitted to WWW7 (1997) lecture by
Tal Blum for the SDBI seminar
2Index
- Introduction
- Design Goals
- System Features
- Related Work
- System Anatomy
- Results Performance
- Conclusions
3What is Google?
- Large-scale search engine
- makes extensive use in hypertext
- designed to crawl index the web efficiently
- gives better results
- prototype at http//google.stanford.edu or
http//www.google.com - googol 10100
4Why talk about google?
- To engineer a SE is a challenging task
- millions of pages, terms, queries
- little academic research
- SE today is not what it was 5 years ago
- the first detailed public description of SE
- better results using hypertext
- uncontrolled hypertext collections
5The web - IR challenge
- 2 main ways for surfing
- high quality human maintained lists (Yahoo)
- too slow to improve
- cannot cover esoteric topics
- expensive to build and maintain
- search engines (google, altavista)
- search by keywords
- too many low quality matches
- people try to mislead automated search engines
6Web Growth
7Web Search Engine Scaling-Up1994-2000
- First SE WWWW (1994) had an index of 110,000 web
pages, 1500 queries - November 1997 index of 2-100 million web pages,
20 million(Altavista) - expected that by 2000 SE will have an index of
billion web pages, hundreds of millions of queries
8Web Search Engine Scaling-Up1999
- Challenges in Creating a Search Engine which
scales even to today web - Fast crawling technology
- gather documents, keep them up to date
- Efficient storage space
- indices, optionally the documents
- Handle queries quickly
- rate of thousands per second
9Google Scaling with the web
- Improved Hardware Performance
- exceptions disk seek time, OS
- Google is designed to scale well to extremely
large data sets - Googles data structure are optimized for fast
efficient access - Google is a centralized SE
10Design Goals
- Improved Search Quality
- Junk Results
- Number of documents has increased by many factors
- User ability to look at documents has not
- As the collection size grows we need tools with
very high precision even at the expanse of recall - Use of hypertextual information
- In google link structure anchor text
11Design Goals (2)
- Academic Search Engine Research
- SE has migrated from academic domain to the
commercial - SE technology became mostly a black art
advertising oriented. - Get people usage Information
- considered commercially valuable
- Support novel research activities on large-scale
web data
12System Features
- PageRank Bringing order to the web
- most web SE has largely ignored the link graph
- 518 million hyperlinks
- correspond well with people idea of importance
- Pr(A) (1-d) (Pr(T1)/C(T1)Pr(Tn)/C(Tn))
- difference from traditional methods
- not counting links from pages equally
- normalizing by the number of links in a page
- different from Hits of Kleiberg
13System Features (2)
- Anchor Text
- Associate link text with the page it points to
- advantages
- anchor provide more accurate description
- can exist for documents that cant be indexed
- images, programs, databases, mp3, non-text docs,
e-mails - can return web pages that hadnt been crawled
- was first used in WWW Worm 1994
14System Features (3)
- Other Features
- Location Information
- Use of proximity in search
- Visualization Information
- Font relative Size
- Full raw HTML is available
- users can view a cashed version of the page
- users can view the page as it was when indexed
- can be used for research
15Related Work
- SE have short history (wwww 1994)
- commercial services closely guard the details of
their databases - work on specialized features of SE
- especially on post-processing results of SE
- work on Information Retrieval Systems
- especially on well controlled environments
16IR Differences Between the Web and Well
Controlled Collections
- TREC 96s Very Large Corpus is only 20GB
compared to 147GB of Google crawl - The Web is a vast collection of heterogeneous
documents - language, vocabulary, format
- things that work well for TREC often do not
produce good results on the web - there is no control over what people put on the
web
17System Anatomy
18Major Data Structures
- Big Files
- virtual files spanning multiple file systems
- addressable by 64 bit integers
- handles allocation deallocation of File
Descriptions since the OSs is not enough - supports rudimentary compression
19Major Data Structures (2)
- Repository
- tradeoff between speed compression ratio
- choose zlib (3 to 1) over bzip (4 to 1)
- requires no other data structure to access it
20Major Data Structures (3)
- Document Index
- keeps information about each document
- fixed width ISAM (index sequential access mode)
index - includes various statistics
- pointer to repository, if crawled, pointer to
info lists - compact data structure
- we can fetch a record in 1 disk seek during search
21Major Data Structures (4)
- URLs - docID file
- used to convert URLs to docIDs
- list of URL checksums with their docIDs
- sorted by checksums
- given a URL a binary search is performed
- conversion is done in batch mode
22Major Data Structures (4)
- Lexicon
- can fit in memory for reasonable price
- currently 256 MB
- contains 14 million words
- 2 parts
- a list of words
- a hash table
23Major Data Structures (4)
- Hit Lists
- includes position font capitalization
- account for most of the space used in the indexes
- 3 alternatives simple, Huffman , hand-optimized
- hand encoding uses 2 bytes for every hit
24Major Data Structures (4)
25Major Data Structures (5)
- Forward Index
- partially ordered
- used 64 Barrels
- each Barrel holds a range of wordIDs
- requires slightly more storage
- each wordID is stored as a relative difference
from the minimum wordID of the Barrel - save considerable time in the sorting
26Major Data Structures (6)
- Inverted Index
- 64 Barrels (same as the Forward Index)
- for each wordID the Lexicon contains a pointer to
the Barrel that wordID falls into - the pointer points to a doclist with their hit
list - the order of the docIDs is important
- by docID or doc word-ranking
- in Google they choose a compromise
27Major Data Structures (7)
- Crawling the Web
- fast distributed crawling system
- URLserver Crawlers are implemented in phyton
- each Crawler keeps about 300 connection open
- at peek time the rate - 100 pages, 600K per
second - uses internal cached DNS lookup
- synchronized IO to handle events
- number of queues
- Robust Carefully tested
28Major Data Structures (8)
- Indexing the Web
- Parsing
- should know to handle errors
- HTML typos
- kb of zeros in a middle of a TAG
- non-ASCII characters
- HTML Tags nested hundreds deep
- Developed their own Parser
- involved a fair amount of work
- did not cause a bottleneck
29Major Data Structures (9)
- Indexing Documents into Barrels
- turning words into wordIDs
- in-memory hash table - the Lexicon
- new additions are logged to a file
- parallelization
- shared lexicon of 14 million pages
- log of all the extra words
30Major Data Structures (10)
- Indexing the Web
- Sorting
- creating the inverted index
- produces two types of barrels
- for titles and anchor
- for full text
- sorts every barrel separately
- running sorters at parallel
- the sorting is done in main memory
31Searching
- Algorithm
- 1. Parse the query
- 2. Convert word into wordIDs
- 3. Seek to the start of the doclist in the short
barrel for every word - 4. Scan through the doclists until there is a
document that matches all of the search terms
- 5. Compute the rank of that document
- 6. If were at the end of the short barrels start
at the doclists of the full barrel, unless we
have enough - 7. If were not at the end of any doclist goto
step 4 - 8. Sort the documents by rank return the top K
32The Ranking System
- The information
- Position, Font Size, Capitalization
- Anchor Text
- PageRank
- Hits Types
- title ,anchor , URL etc..
- small font, large font etc..
33The Ranking System (2)
- Each Hit type has its own weight
- Counts weights increase linearly with counts at
first but quickly taper off this is the IR score
of the doc - the IR is combined with PageRank to give the
final Rank - For multi-word query
- A proximity score for every set of hits with a
proximity type weight
34Feedback
- A trusted user may optionally evaluate the
results - The feedback is saved
- When modifying the ranking function we can see
the impact of this change on all previous
searches that were ranked
35Results
- Produce better results than major commercial
search engines for most searches - Example query bill clinton
- return results from the Whitehouse.gov
- email addresses of the president
- all the results are high quality pages
- no broken links
- no bill without clinton no clinton without bill
36Storage Requirements
- Using Compression on the repository
- about 55 GB for all the data used by the SE
- most of the queries can be answered by just the
short inverted index - with better compression, a high quality SE can
fit onto a 7GB drive of a new PC
37Storage Statistics
Web Page Statistics
38System Performance
- It took 9 days to download 26million pages
- 48.5 pages per second
- The Indexer Crawler ran simultaneously
- The Indexer runs at 54 pages per second
- The sorters run in parallel using 4 machines, the
whole process took 24 hours
39Conclusions
- Scalable Search Engine
- High Quality Search Results
- Search techniques
- PageRank
- Anchor Text
- Proximity Information
- A Complete Architecture
40Future Work
- Improve search efficiency
- Scale to 100 million
- Boolean Operators
- Text Surrounding Links
- Personalization PageRank
- Result Summarization
41New Features
- Google Scout
- Documents Caching
- Uncle Sams
- Link option
42The End