CHAPTER 1: INTRODUCTION - PowerPoint PPT Presentation

About This Presentation

Title:

CHAPTER 1: INTRODUCTION

Description:

Can we make use of the tag information to improve the effectiveness of a search engine? ... occurrences is used in Altavista, HotBot, Yahoo, Lycos, LASER, ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 39

Provided by: sunylearni

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: CHAPTER 1: INTRODUCTION

1
Green Island (Coral Cay Great Barrier Reef
Australia 9/18)
2
(No Transcript)
3
Agin Court Reef (Outer Barrier Reef Australia
9/19)
4
Search Engine

A search engine is essentially a text retrieval
system for web pages plus a Web interface.
So whats new???

5
Some Characteristics of the Web

Web pages are
very voluminous and diversified
widely distributed on many servers.
extremely dynamic/volatile.
Web pages have
more structure (extensively tagged).
are extensively linked.
may often have other associated metadata
Web search is
Noisy (pages with high similarity to query may
still differ in relevance)
Adversarial!
A page can advertise itself falsely just so it
will be retrieved
Web users are
ordinary folks (dolts?) without special
training
they tend to submit short queries.
There is a very large user community.

Need to crawl and maintain index
Use the links and tags and Meta-data!
Use the social structure of the web
Easily impressed ?
6
Crawlers Main issues

General-purpose crawling
Context specific crawiling
Building topic-specific search engines

7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
SPIDER CASE STUDY
11
Web Crawling (Search) Strategy

Starting location(s)
Traversal order
Depth first
Breadth first
Or ???
Cycles?
Coverage?
Load?

d

b
e
h
j
c
f
g
i
12
Robot (2)

Some specific issues
What initial URLs to use?
Choice depends on type of search engines to be
built.
For general-purpose search engines, use URLs that
are likely to reach a large portion of the Web
such as the Yahoo home page.
For local search engines covering one or several
organizations, use URLs of the home pages of
these organizations. In addition, use appropriate
domain constraint.

13
Robot (7)

Several research issues about robots
Fetching more important pages first with limited
resources.
Can use measures of page importance
Fetching web pages in a specified subject area
such as movies and sports for creating
domain-specific search engines.
Focused crawling
Efficient re-fetch of web pages to keep web page
index up-to-date.
Keeping track of change rate of a page

14
Storing Summaries

Cant store complete page text
Whole WWW doesnt fit on any server
Stop Words
Stemming
What (compact) summary should be stored?
Per URL
Title, snippet
Per Word
URL, word number

But, look at Googles Cache copy ..and its
privacy violations
15
(No Transcript)
16
(No Transcript)
17
Robot (4)

How to extract URLs from a web page?
Need to identify all possible tags and attributes
that hold URLs.
Anchor tag lta hrefURL gt lt/agt
Option tag ltoption valueURLgt lt/optiongt
Map ltarea hrefURL gt
Frame ltframe srcURL gt
Link to an image ltimg srcURL gt
Relative path vs. absolute path ltbase href gt

18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Mercators way of maintaining URL
frontier ?Extracted URLs enter front queue ?Each
URL goes into a front queue based on
its Priority. (priority assigned Based on page
importance and Change rate) ?URLs are shifted
from Front to back queues. Each Back queue
corresponds To a single host. Each queue Has time
te at which the host Can be hit again ?URLs
removed from back Queue when crawler wants A page
to crawl
23
Focused Crawling

Classifier Is crawled page P relevant to the
topic?
Algorithm that maps page to relevant/irrelevant
Semi-automatic
Based on page vicinity..
Distilleris crawled page P likely to lead to
relevant pages?
Algorithm that maps page to likely/unlikely
Could be just A/H computation, and taking HUBS
Distiller determines the priority of following
links off of P

24
(No Transcript)
25
(No Transcript)
26
Large Scale Indexing
27
(No Transcript)
28
(No Transcript)
29
Map-Reduce Parallelism
(added after classbased on class discussion)

Named after lisp constructs map and reduce
(reduce fn2 (map fn1 list))
Run function fn1 on every item of the list, and
reduce the resulting list using fn2
(reduce (map 1 (4 5 6 7 8 9)))
(reduce (5 6 7 8 9 10))
151200 (5678910)
(reduce (map primality-test (num1 num2)))
So where is the parallelism?
All the map operations can be done in parallel
(e.g. you can test the primality of each of the
numbers in parallel).
The overall reduce operation has to be done after
the map operation (but can also be parallelized
e.g. assuming the primality-test returns a 0 or
1, the reduce operation can partition the list
into k smaller lists and add the elements of each
of the lists in parallel (and add the results)
Note that the parallelism in both the above
examples depends on the length of input (the
larger the input list the more parallel
operations you can do in theory).
Map-reduce on clusters of computers involve
writing your task in a map-reduce form
The cluster computing infrastructure will then
parallelize the map and reduce parts using the
available pool of machines (you dont need to
thinkwhile writing the programas to how many
machines and which specific machines are used to
do the parallel tasks)
An open source environment that provides such an
infrastructure is Hadoop
http//hadoop.apache.org/core/

Qn Can we bring map-reduce parallelism to
indexing?
30
Partition the set of documents into blocks
construct index for each block separately
merge the indexes
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Distributing indexes over hosts

At web scale, the entire inverted index cant be
held on a single host.
How to distribute?
Split the index by terms
Split the index by documents
Preferred method is to split it by docs (!)
Each index only points to docs in a specific
barrel
Different strategies for assigning docs to
barrels

At retrieval time
Compute top-k docs from each barrel
Merge the top-k lists to generate the final top-k
Result merging can be tricky..so try to punt it
Idea
Consider putting most important docs in top few
barrels
This way, we can ignore worrying about other
barrels unless the top barrels dont return
enough results
Another idea
Split the top 20 and bottom 80 of the doc
occurrences into different indexes..
Short vs. long barrels
Do search on short ones first and then go to long
ones as needed

37
Dynamic Indexing
simplest approach
38
IR for Web Pages
39
Use of Tag Information (1)

Web pages are mostly HTML documents (for now).
HTML tags allow the author of a web page to
Control the display of page contents on the Web.
Express their emphases on different parts of the
page.
HTML tags provide additional information about
the contents of a web page.
Can we make use of the tag information to improve
the effectiveness of a search engine?

40
Use of Tag Information (2)
Document is indexed not just with its contents
But with the contents of others descriptions of
it

Two main ideas of using tags
Associate different importance to term
occurrences in different tags.
Use anchor text to index referenced documents.

Page 2 http//travelocity.com/
Page 1
. . . . . . airplane ticket and
hotel . . . . . .
41
Google Bombs The other side of Anchor Text
Document is indexed not just with its contents
But with the contents of others descriptions of
it

You can tar someones page just by linking to
them with some damning anchor text
If the anchor text is unique enough, then even a
few pages linking with that keyword will make
sure the page comes up high
E.g. link your SOs page with
my cuddlybubbly woogums
Shmoopie unfortunately is already taken by
Seinfeld
For more common-place keywords (such as
unelectable or my sweet heart) you need a lot
more links
Which, in the case of the later, may defeat the
purpose

42
Use of Tag Information (3)

Many search engines are using tags to improve
retrieval effectiveness.
Associating different importance to term
occurrences is used in Altavista, HotBot, Yahoo,
Lycos, LASER, SIBRIS.
WWWW and Google use terms in anchor tags to index
a referenced page.
Qn what should be the exact weights for
different kinds of terms?

43
Use of Tag Information (4)

The Webor Method (Cutler 97, Cutler 99)
Partition HTML tags into six ordered classes
title, header, list, strong, anchor, plain
Extend the term frequency value of a term in a
document into a term frequency vector (TFV).
Suppose term t appears in the ith class tfi
times, i 1..6. Then TFV (tf1, tf2, tf3, tf4,
tf5, tf6).
Example If for page p, term binghamton appears
1 time in the title, 2 times in the headers and 8
times in the anchors of hyperlinks pointing to p,
then for this term in p
TFV (1, 2, 0, 0, 8, 0).

44
Use of Tag Information (6)

The Webor Method (Continued)
Challenge How to find the (optimal) CIV (civ1,
civ2, civ3, civ4, civ5, civ6) such that the
retrieval performance can be improved the most?
One Solution Find the optimal CIV experimentally
using a hill-climbing search in the space of CIV

Details Skipped
45
Use of LINK information Why?

Pure query similarity will be unable to pinpoint
right pages because of the sheer volume of pages
There may be too many pages that have same
keyword similarity with the query
The even if you are one in a million, there are
still 300 more like you phenomenon
Web content creators are autonomous/uncontrolled
No one stops me from making a page and writing on
it this is the homepage of President Bush
and adversarial
I may intentionally create pages with keywords
just to drive traffic to my page
I might even use spoofing techniques to show one
face to the search engine and another to the user
So we need some metrics about the
trustworthiness/importance of the page
Lets look at social networks, since these topics
have been investigated there..