Intro to Web Search - PowerPoint PPT Presentation

About This Presentation
Title:

Intro to Web Search

Description:

Title: All About Nutch Author: cse Last modified by: Michael Cafarella Created Date: 2/3/2004 4:48:51 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 33
Provided by: CSE128
Category:

less

Transcript and Presenter's Notes

Title: Intro to Web Search


1
Intro to Web Search
  • Michael J. Cafarella
  • December 5, 2007

2
Searching the Web
  • Web Search is basically a database problem, but
    no one uses SQL databases
  • Every query is a top-k query
  • Every query plan is the same
  • Massive numbers of queries and data
  • Read-only data
  • A search query can be thought of in SQL terms,
    but the engineered system is completely different

3
Nutch Hadoop A case study
  • Open-source, free to use and change
  • Nutch is a search engine, can handle 200M pages
  • Hadoop is backend infrastructure, biggest
    deployment is a 2000 machine cluster
  • There have been many different search engine
    designs, but Nutch is pretty standard and easy to
    learn from

4
Outline
  • Search basics
  • What are the elementary steps?
  • Nutch design
  • Link database, fetcher, indexer, etc
  • Hadoop support
  • Distributed filesystem, job control

5
Search document model
  • Think of a web document as a tuple with several
    columns
  • Incoming link text
  • Title
  • Page content (maybe many sub-parts)
  • Unique docid
  • A web search is really SELECT FROM docs WHERE
    docs.text LIKE userquery AND docs.title LIKE
    userquery AND ORDER BY relevance
  • Where relevance is very complicated

6
Search document model (2)
  • Three main challenges to processing a query
  • Processing speed
  • Result relevance
  • Scaling to many documents

7
Processing speed
  • You could grep, but each query will need to touch
    each document
  • Key to fast processing is the inverted index
  • Basic idea is for each word, list all the
    documents where that word can be found

8
as
billy
cities
friendly
give
mayors
nickels
seattle
such
words
9
Query such as
docs
docid0
docid1
docid2
dociddocs-1

as
billy
cities
friendly
give
mayors
nickels
seattle
such
words
  1. Test for equality
  2. Advance smaller pointer
  3. Abort when a list is exhausted

docs
docid0
docid1
docid2
dociddocs-1

322
Returned docs
10
Result Relevance
  • Modern search engines use hundreds of different
    clues for ranking
  • Page title, meta tags, bold text, text position
    on page, word frequency,
  • A few are usually considered standard
  • tfidf(t, d) freq(t-in-d) / freq(t-in-corpus)
  • Link analysis link counting, PageRank, etc
  • Incoming anchor text
  • Big gains are now hard to find

11
Scaling to many documents
  • Not even the inverted index can handle billions
    of docs on a single machine
  • Need to parallelize query
  • Segment by document
  • Segment by search term

12
Scaling (2) doc segmenting
Docs 1-2M
Docs 2-3M
Docs 3-4M
Docs 4-5M
Docs 0-1M
Ds 1.2M, 1.7M
britney
Ds 2.3M, 2.9M
Ds 3.1M, 3.2M
britney
Ds 4.4M, 4.5M
britney
britney
britney
Ds 1, 29
1.2M, 4.4M, 29,
britney
13
Scaling (3)
  • Segment by document, pros/cons
  • Easy to partition (just MOD the docid)
  • Easy to add new documents
  • If machine fails, quality goes down but queries
    dont die
  • Segment by term, pros/cons
  • Harder to partition (terms uneven)
  • Trickier to add a new document (need to touch
    many machines)
  • If machine fails, search term might disappear,
    but not critical pages (e.g., yahoo.com/index.html
    )

14
Intro to Nutch
  • A search engine is more than just the query
    system. Simply obtaining the pages and
    constructing the index is a lot of work

15
WebDB
16
Moving Parts
  • Acquisition cycle
  • WebDB
  • Fetcher
  • Index generation
  • Indexing
  • Link analysis (maybe)
  • Serving results

17
WebDB
  • Contains info on all pages, links
  • URL, last download, failures, link score,
    content hash, ref counting
  • Source hash, target URL
  • Must always be consistent
  • Designed to minimize disk seeks
  • 19ms seek time x 200m new pages/mo
  • 44 days of disk seeks!

18
Fetcher
  • Fetcher is very stupid. Not a crawler
  • Divide to-fetch list into k pieces, one for
    each fetcher machine
  • URLs for one domain go to same list, otherwise
    random
  • Politeness w/o inter-fetcher protocols
  • Can observe robots.txt similarly
  • Better DNS, robots caching
  • Easy parallelism
  • Two outputs pages, WebDB edits

19
WebDB/Fetcher Updates
URL http//www.about.com/index.html
LastUpdated 3/22/05
ContentHash MD5_sdflkjweroiwelksd
Edit DOWNLOAD_CONTENT
URL http//www.yahoo/index.html
ContentHash MD5_toewkekqmekkalekaa
URL http//www.cnn.com/index.html
LastUpdated Never
ContentHash None
Edit DOWNLOAD_CONTENT
URL http//www.cnn.com/index.html
ContentHash MD5_balboglerropewolefbag
URL http//www.cnn.com/index.html
LastUpdated Today!
ContentHash MD5_balboglerropewolefbag
URL http//www.flickr/com/index.html
LastUpdated Never
ContentHash None
URL http//www.yahoo/index.html
LastUpdated 4/07/05
ContentHash MD5_toewkekqmekkalekaa
Edit NEW_LINK
URL http//www.flickr.com/index.html
ContentHash None
URL http//www.yahoo.com/index.html
LastUpdated Today!
ContentHash MD5_toewkekqmekkalekaa
Fetcher edits
WebDB
2. Sort edits (externally, if necessary)
1. Write down fetcher edits
3. Read streams in parallel, emitting new database
4. Repeat for other tables
20
Indexing
  • Iterate through all k page sets in parallel,
    constructing inverted index
  • Creates a searchable document like we saw
    earlier
  • URL text
  • Content text
  • Incoming anchor text
  • Inverted index provided by the Lucene open source
    project

21
Administering Nutch
  • Admin costs are critical
  • Its a hassle when you have 25 machines
  • Google has maybe gt100k
  • Files
  • WebDB content, working files
  • Fetchlists, fetched pages
  • Link analysis outputs, working files
  • Inverted indices
  • Jobs
  • Emit fetchlists, fetch, update WebDB
  • Run link analysis
  • Build inverted indices

22
Administering Nutch (2)
  • Admin sounds boring, but its not!
  • Really
  • I swear
  • Large-file maintenance
  • Google File System (Ghemawat, Gobioff, Leung)
  • Nutch Distributed File System
  • Job Control
  • Map/Reduce (Dean and Ghemawat)
  • Result Hadoop, a Nutch spinoff

23
Nutch Distributed File System
  • Similar, but not identical, to GFS
  • Requirements are fairly strange
  • Extremely large files
  • Most files read once, from start to end
  • Low admin costs per GB
  • Equally strange design
  • Write-once, with delete
  • Single file can exist across many machines
  • Wholly automatic failure recovery

24
NDFS (2)
  • Data divided into blocks
  • Blocks can be copied, replicated
  • Datanodes hold and serve blocks
  • Namenode holds metainfo
  • Filename ? block list
  • Block ? datanode-location
  • Datanodes report in to namenode every few seconds

25
NDFS File Read
Datanode 0
Datanode 1
Datanode 2
Namenode
Datanode 3
Datanode 4
Datanode 5
crawl.txt
(block-33 / datanodes 1, 4) (block-95 / datanodes
0, 2) (block-65 / datanodes 1, 4, 5)
  1. Client asks datanode for filename info
  2. Namenode responds with blocklist, and location(s)
    for each block
  3. Client fetches each block, in sequence, from a
    datanode

26
NDFS Replication
Datanode 0 (33, 95)
Datanode 1 (46, 95)
Datanode 2 (33, 104)
(Blk 90 to dn 0)
Namenode
Datanode 3 (21, 33, 46)
Datanode 4 (90)
Datanode 5 (21, 90, 104)
  1. Always keep at least k copies of each blk
  2. Imagine datanode 4 dies blk 90 lost
  3. Namenode loses heartbeat, decrements blk 90s
    reference count. Asks datanode 5 to replicate
    blk 90 to datanode 0
  4. Choosing replication target is tricky

27
Map/Reduce
  • Map/Reduce is programming model from Lisp (and
    other places)
  • Easy to distribute across nodes
  • Nice retry/failure semantics
  • map(key, val) is run on each item in set
  • emits key/val pairs
  • reduce(key, vals) is run for each unique key
    emitted by map()
  • emits final output
  • Many problems can be phrased this way

28
Map/Reduce (2)
  • Task count words in docs
  • Input consists of (url, contents) pairs
  • map(keyurl, valcontents)
  • For each word w in contents, emit (w, 1)
  • reduce(keyword, valuesuniq_counts)
  • Sum all 1s in values list
  • Emit result (word, sum)

29
Map/Reduce (3)
  • Task grep
  • Input consists of (urloffset, single line)
  • map(keyurloffset, valline)
  • If contents matches regexp, emit (line, 1)
  • reduce(keyline, valuesuniq_counts)
  • Dont do anything just emit line
  • We can also do graph inversion, link analysis,
    WebDB updates, etc

30
Map/Reduce (4)
  • How is this distributed?
  • Partition input key/value pairs into chunks, run
    map() tasks in parallel
  • After all map()s are complete, consolidate all
    emitted values for each unique emitted key
  • Now partition space of output map keys, and run
    reduce() in parallel
  • If map() or reduce() fails, reexecute!

31
Map/Reduce Job Processing
TaskTracker 0
TaskTracker 1
TaskTracker 2
JobTracker
TaskTracker 3
TaskTracker 4
TaskTracker 5
grep
  1. Client submits grep job, indicating code and
    input files
  2. JobTracker breaks input file into k chunks, (in
    this case 6). Assigns work to ttrackers.
  3. After map(), tasktrackers exchange map-output to
    build reduce() keyspace
  4. JobTracker breaks reduce() keyspace into m chunks
    (in this case 6). Assigns work.
  5. reduce() output may go to NDFS

32
Conclusion
  • http//www.nutch.org/
  • Partial documentation
  • Source code
  • Developer discussion board
  • Lucene in Action by Hatcher, Gospodnetic
  • http//www.hadoop.org/
  • Or, take 490H
Write a Comment
User Comments (0)
About PowerShow.com