Collaborative Search

About This Presentation

Title:

Collaborative Search

Description:

... Proper nouns may have higher ... f Requires communication page page Bad Good Good Good Exchange Good Bad Bad Good Cross-Over Good Bad Good Bad Firewall Comm ... – PowerPoint PPT presentation

Number of Views:168

Avg rating:3.0/5.0

Slides: 67

Provided by: iroUmont

Category:

more less

Transcript and Presenter's Notes

Title: Collaborative Search

1
Collaborative Search

Zheng Zhen

Traditional IR
Web search
Crawlers
parallel crawler
intelligent crawler
Collaborative Search
References

3
Traditional IR

System
User
Acquisition documents, objects
Problem information need
Representation question
Representation indexing, ...
Database of Indexed documents
Query search formulation
Matching searching
Feedback
Retrieved objects
4
Classic Information Retrieval

Homogenous documents
Well categorized
Small well-controlled collection
Closed, static environment
Controlled collection growth

5
Web Search

Web
- open, dynamic environment
- vast uncontrolled collection of PAGES
Web page
- heterogeneous various formats, languages
- content may change over time !
Importance of LINKS
Existing Search Facilities
Generic yahoo, askjeeves, google etc.
Specialized Pluribus,Collaborative Spider

6
Common operations

Indexing
- identifies potential index terms in documents
Query processing
- form keywords
Search
- access indexed file
Ranking

7
Ranking

Ranking is important
Factors which influence rank
Term location or frequency
Proximity to query terms
Date of Publication
Length
Popularity
Heuristics Proper nouns may have higher weights
WWW Link analysis Popularity (ex. Google)

8
The Web indexing

Web pages are heterogenous documents
Contain both text information and meta
information
External meta information can be inferred
Must be processed before the pertinence can be
established

9
Indexing WWW documents

Web pages require Preprocessing to get uniform
data structure
- Normalizes the document stream to a predefined
format
- Breaks the document stream into desired
retrievable units
- Isolates and metatags subdocument pieces

Web1
page1
Uniform format
Web2
page2
preprocessing
Web n
Page n
10
Computing weights

Assign weight to each descriptor for document
add to index
Weights are based on
term frequency within the document (tf)
Global term frequency within the corpus
This will be a problem when using parallel
independent agents to do indexing

11
IR on Web

Query
Search match
Indexed files
Query Processor
Page ranking
Document Processor
Responses
Browse
Web
Crawlers
Web pages
12
Web Document discovery

Corpus is very large
Dynamic
Open
Documents must be discovered
. use Web crawler

13
Web Crawler

What is a Crawler?
init initial urls
get next url scheduled urls
Web
get page visited urls
extract urls
web pages

14
Parallel Crawler

Advantages
Faster.
Imperative for large-scale crawling
Can be run on cheaper machines
Network load dispersion
Network load reduction

Crawler1
Crawler2
Downloaded Web pages
Web
CrawlerN
Parallel Crawlers by Cho, Junghoo et al.
University of California, WWW2002, Honolulu,
Hawaii, USA
15
Evaluation Metrics

Overlap
1 - ( of unique pages downloaded / of page
downloaded by team of crawler)
Coverage
of pages downloaded by the parallel crawler /
Total of reachable pages
Communication overhead
of exchanged messages / of page downloads

16
Assignment of search areas

Partitioning the Web
Address division .net, .ca , UdeM.ca
Topic
Static assignment ( see next page)
Dynamic assignment (see multi-agent collaborative
search)

17
Partition function

Multitude of ways to partition the web
Site-hashing
Based on the hash value of the site name of a
URL
URL hashing
Based on the hash value of all the URL
Hierarchical
partition the web hierarchically based on the
URLs of the pages

Partitionning will come up again with Agents !

Crawling modes (Examples)
Firewall mode, Cross-over mode, Exchange mode
Site1 (Crawler1)
Site2(Crawler2)
Parallel Crawlers by Cho, Junghoo et al.
University of California, Los Angeles WWW2002,
Honolulu, Hawaii, USA

a
f
b
c
g
d
i
h
e
19

Firewall mode download within partitions
Crawler1 a?b, a?c
Crawler2f?g, g?h, g?i
Site1 (Crawler1)
Site2(Crawler2)

a
f
g
b
c
d
i
h
e

D and E are overlooked !

20
Cross over mode download between
partitions Crawler1 a?b, a?c a?g, g?h, h?d,
d?e, g?i Crawler2 f?g, g?h, g?i h?d, d?e
Site1 (Crawler1)
Site2(Crawler2)

a
f
g
b
c
d
i
h
e

Duplication of work !

21
Exchange mode download within partitions,
exchange info. Crawler1 a?b, a?c
then g ? Crawler2 Crawler2 f?g,
g?h, g?i then d ? Crawler1
Site1 (Crawler1) Site2(Crawler2)

a
f
g
b
c
d
i
h
e

Requires communication

Minimizing communication in Exchange Mode
Batch communication
Allow replication
1) Because links to pages follows a Zipf
distribution (... 20-80 factor)
2) Replicate some popular URLs at each Crawlers
Zipf distribution
incoming links

incoming links

page
page
23
Evaluating quality

We want important pages
Quality measure Pages ? Top_k /
Top_k
Pages downloaded k pages
Topk top k most important pages
Indication of importance backlink count

Comparison 2
From experiments2
1) firewall mode parallel crawler number lt 4
less quality
2) exchange mode small network traffic
maximize quality
3) replicating between 10,000 100,000 (sic)
popular URLs reduces 40 commu. overhead

Mode Coverage Overlap Quality Comm. overhead
Firewall Bad Good Bad Good
Cross-Over Good Bad Bad Good
Exchange Good Good Good Bad
25
Intelligent crawling

Indiscriminate crawlers ( i.e. for Google)
Any new page is good
Topic-oriented crawlers
I.e. Call for tenders
We just want new pages on a topic of interest
?Intelligent crawler

Intelligent Crawling on the WWW with Arbitrary
Predicates, C. Aggarwal,et al., IBM TJ Watson
Res. Ctr., WWW10, Hong-Kong 2001
26
Focused Crawling

Which node to explore next ?
Depth-first ? Breadth-first ?
Best-first ! But what is best?
Focused crawling is best, how to establish focus
?
-- Linkage locality -- Sibling
locality

topicY
X
topic X
topicY
X
topicY
...
Y
Y ?
Y
Y
27
Focused Crawling

Objective given a specific query, find
-- Good sources of content (authorities)... many
links TO
-- Good sources of links (hubs) ... many links
FROM
authorities hubs
Given a arbitrary query, can we auto-focus ?
-- learning capability
-- learning model

28
Learning Model

Analyze links from pages on the search periphery
Learning how to pick good links to follow
visited web page to visit page
hyperlink

1
2
C
3
4
29
Learning Model

Clues based on
- content
- URL tokens
- linkage info
- sibling structure
Different needs require different learning
- crawler need learning during the crawl
- reuse learning information
The Crawler should be intelligent

30
Intelligent Crawling

Priority list of URLs to be explored (Plist)
User defined predicate to compute interest of
page ( processed query)
KB knowledge base

31
Intelligent Crawling

Algorithm Intelligent-Crawler()
Begin
Priority-List (PList ) Starting Seeds
While not (termination) do
begin
Reorder URLs on PList using
KB
Drop unimportant items from
PList
W lt pop the first element
on PList
Fetch the Web page W
Parse W and add all the
outlinks in W to PList
If W satisfies the
user-defined predicate, then store W
Update KB using content and
link information for W
end
End

32
Intelligent Crawler

During the crawling process, we can accumulate
some information
Like
number of URLs crawled, N1
number of URLs crawled which satisfy predicate ,
N2
pages in which word i occurs which satisfy the
predicate, N3
pages with keyword in URL which satisfy (or
not) predicate .
How to create a KB?

A later example will illustrate URL based learning

33
Intelligent Crawler
Example User is interested in online
malls BUT only 0.1 web pages contain
online malls HOWEVER if word eshop is in
URL then prob of page containing online
malls 5 Thus we should add to KB fact
that eshop in URL is useful criterion in
choosing pages to explore.
34
Formal view
C a crawled web page satisfies the given
predicate P(C) probability of event C, P(C)
N2 / N1 E a fact that we know about a
candidate URL Knowledge of the event E may
increase the probability P(C) thus P(CE)
P(C ? E) / P(E) P(CE) / P(C) P(C ? E) /
(P(C) P(E)) Calculate the interest ratio for
the event C given event E as IR(C,E) IR(C,E)
P(CE) / P(C) P(C ? E) / (P(C) P(E)) The
value of P(C ? E), P(E) can be calculated during
the crawling
from Intelligent Crawling on the WWW with
Arbitrary Predicates, C. Aggarwal,et al.,
35
Mall example

Example
0.1 web pages contain online malls satisfy (
P(C))
if word eshop occur ( E ) then the
probability (P(CE)) of satisfying increase to
5
So interest ratio 5 / 0.1 50
IR(C,E) P(CE) / P(C)

36
Collaborative Search

3 ways search for information
Browsing, querying and filtering
Collaborative type 10
Collaborative browsing
Mediated searching
Collaborative information filtering
Collaborative agents
Collaborative reuse of results

37
Collaborative Search

What do we mean by collaboration ?
Human ? computer ? Human
Human ?? Computer
Computer agent ?? Computer agent

38
Collaborative Search

Man - machine
Collaborative browsing --- Ariadne system 23
Collaborative reuse of results --- Pluribus 21
(2000)
Collaborative information filtering ---
Collaborative filtering 25
Mediated searching --- DIAMS 22 (2000)
Machine - machine ( ? Collaborative agents )
meta-search engines Meta Crawler, Mamma,
Metagopher, Copernic
topic-oriented collaborative crawler 11
(2002)
Collaborative spider 16 (2002)
UbiCrawler 5 (2003)
Collaborator 19 (under development)

39
Existing systems

meta-search engines
Meta Crawler, Mamma, Metagopher, Copernic
query --------- passes -----? to other search
engines
collect ?------ results -------- from other
search engines
combine ----- results ------?user

40
Topic-oriented collaborative crawlers
11 (2002)

Each crawler is given a specific topic
It knows the topics of its colleagues
It sends URLs of pages it doesnt care about to
the one responsible for the topic
Problems
static predefined topic categories
static assignment partition function,
controller assign sites to each crawler

41
Collaborative spiders 16 (2002)

JATLite (Java Agent Template Lite),
uses KQML,
User agents ONE scheduler agent ,
Collaborator agent (as a mediator)
search, content mining,
post-retrieval analysis system
group user sharing information

42
UbiCrawler 5 (2003)
consistent hashing partition function buckets
are agents, keys are hosts failure detector ---
only synchronous component each agent keeps
track of the visited URLs in a hash table pure
Java application, RMI based, multi-thread agent
43
Collaborator 19 (under development)

a shared workspace framework for virtual teams
3 tier architecture, J2EEAgent ( BlueJADE ),
client tier, middle tier, enterprise information
systems tier
personal agents, session management agents
desktop or wireless device
Jade, FIPA

44
Conclusion

Current collaborative search
- collaborative
- dynamic
- adaptive exploring
- intelligent
- decentralized
Trend ? Agent

45
Multi-agent collaborative search

Challenges ?
agent_1
agent_2
agent_n

Query?
.
DataStore
.
DataStore
Web
.
DataStore
46
Challenges
Partition dynamic ? - dynamic assigning the web
domain to agents Load balancing ? - each cache
stores roughly the same of pages Content look
up ? - an agent can easily locate the storage
that storing particular content Solution Web
Cache Consistent Hashing
47
Web Caching

Content (URL -gt content)
For download efficiency
Indexing information (Keyword -gt URL)
Search efficiency

48
Browser caching
1. For efficiency
www.abc.com 2. Each client has own
cache
caches
clients
49
Proxy caches
1. each cache stores a subset of all pages
www.abc.com 2. each client knows several
caches
Domain caches
clients
50
Agents web cache

communication
User
User
Web
agent
agent
agent
Web cache
Web cache
Web cache
51
Content Look Up

Summary cache
Distributed hash
Consistent hash
Also achieves load balancing
Partition dynamic

Summary cache
Each cache knows the content of all the others
C1 C2
C3
F? C3

A, B, C C2D, E C3F, G
D, E C1A, B, C C3F, G
F, G C1A, B, C C2D, E
client
53
Distributed hashing

Distribute the work amongst many agents
Efficient, O(1), determination of agent
responsible for a given KW or URL
Problem redistribution of data when number of
agents changes
Solution consistent hash ?

54
Consistent Hashing

Use standard Hash function H to map
items 1,2,3,4,5 and agents A,B to a unit circle
Map each item to closest cache
- A holds 1,2,3
- B holds 4,5
4 Web Caching with Consistent Hashing by
David Karger et al, MIT Lab

55
Consistent Hashing

To add a new agent C, hash the agent id
Move the item close to it
Other items dont move
- A holds 3
- B holds 4,5
- C holds 1,2
this example will be reused in partition dynamic

C
56
Consistent Hashing

Designed features
- Load balancing each bucket stores roughly
same of pages
- Content look Up easily locate given key by
hash function H
- Smoothness little impact on hash bucket
contents when buckets
are added/removed
Application of the consistent hashing
Freenet 6, UbiCrawler 5

57
Partition dynamic

Suppose in above example items 1,2,3,4,5 are
the sites name of scheduled URLs, first only
have agents A, B to explore web, partition like

Web 1 2

4 3 5
Agent_A 1,2,3
Agent_B 4,5
58
Partition dynamic

After adding new agent C, reassign the web domain
to agents like

Agent_C 1,2
Web 1 2

4
3 5
Agent_A 3
Agent_B 4,5
59
Concrete model

Multiagent layer
a general agent paradigm is not practical
Agent type
Interface agent
collector agent
Information agent
Agent functionality
interface agent interactively collects query
information with user
collector agent collects infor., forms
plan,results composition
information agent focused crawling with the
plan, form indexed files

60
Concrete model

collaborative
query answer

User1
User2
User n
InferfaceAgent1
InferfaceAgent2
InferfaceAgent k
Collector Agent1
Collector Agent2
Collector Agent j
infoAgent1
infoAgent2
infoAgent m
Database1
Database2
Database m
61
Concrete model

infoAgent_1
communication
infoAgent_n

Local Storage
Indexing
Document processor
KB
Crawler
Web
Crawler
Document processor
Indexing
Local Storage
KB
62
References

1 How a Search Engine Works by Elizabeth
Liddy School of Information Studies Syracuse
University
http//www.infotoday.com/searcher/may01/liddy.ht
m
2 Parallel Crawlers by Cho, Junghoo
Garcia-Molina, Hector http//dbpubs.stanford.edu8
090/pub/2002-9
3 Mercator A Scalable, Extensible Web Crawler
http//research.compaq.com/SRC/mercator/papers/ww
w/paper.html
4 David Karger, Tom Leighton, Danny Lewin, and
Alex Sherman. Web caching with consistent
hashing. In Proc. of 8th International worldWide
Web Conference, Toronto, Canada, 1999
5 UbiCrawler A Scalable Fully Distributed Web
Crawler (2003) http//ausweb.scu.edu.au/aw02/paper
s/refereed/vigna/paper.html
6 Freenet A Distributed Anonymous Information
Storage and Retrieval System http//citeseer.nj.ne
c.com/clarke00freenet.html

7 LOOKING UP DATA IN P2P SYSTEMS by Hari
Balakrishnan et al
http//www.utsc.utoronto.ca/rosselet/cscd58s/
tut03/pres03/p2p-lookups.pdf
8 Web Caching by Ion Stoica
www.cs.berkeley.edu/istoica/cs268/notes/lecture2
1.pdf
9 The Effects of Cooperation on Multiagent
Search in Task-Oriented Domains
http//citeseer.nj.nec.com/557884.html
10 Collaborative Search and Retrieval Finding
Information Together
https//doc.telin.nl/dscgi/ds.py/Get/File-8269/Gi
gaCE-Collaborative_Search_and_Retrieval__Finding_I
nformation_Together.pdf
11 Topic-Oriented Collaborative Crawling by
Chiasen Chung, Charles L.A. Clarke
http//citeseer.nj.nec.com/538331.html
12 Intelligent Crawling on the World Wide Web
with Arbitrary Predicates (2001)
http//citeseer.nj.nec.com/aggarwal01intelligent.
html
13 Scaling Question Answering to the Web by
Cody Kwok et al http//www10.org/cdrom/papers/12
0/

14 The Anatomy of a Large-Scale Hypertextual
Web Search Engine (1998) http//citeseer.nj.nec.co
m/brin98anatomy.html
15 Design and evaluation of a multi-agent
collaborative Web Mining System (2003)
http//citeseer.nj.nec.com/chau03design.html
16 Text-learning and related intelligent agents
(1999) by Dunja Mladenic http//citeseer.nj.nec.co
m/mladenic99textlearning.html
17 Text learning and related intelligent
agents by Dunja Mladenic
http//www.cs.cmu.edu/TextLeauning/pww/
18 Coordination of Multiple Intelligent
Software agents by Sycara, K., and Zeng, D
http//www.cs.cmu.edu/softagents/publications.ht
ml
19 Enhancing Collaborative Work through Agents
by F. Bergenti et al
http//www-dii.ing.unisi.it/aiia2002/paper/
AGENTI/bergenti-aiia02.pdf
20 Agents that Reduce Work and Information
Overload by Pattie Maes http//www.cs.brandeis.e
du/cs125a/content/agentsmaes.doc
21 Collaboratively Searching the Web An
Initial by Agustin Schapira http//none.cs.umass.e
du/schapira/thesis/report/

22 Collaborative Information Agents on the
World Wide Web by James R. Chen
http//ic.arc.nasa.gov/ic/projects/aim/papers/dl98
.pdf
23 Collaborative browsing and visualisation of
the search process
http//www.comp.lancs.ac.uk/computing/research/cs
eg/projects/ariadne/docs/elvira96.html
24 Collaborative design that used the shared
cognitive space
www.jaist.ac.jp/library/thesis/is-master-2002/pap
er/t-kizaki/abstract.ps
25 Collaborative Filtering by Berkeley Workshop
http//www.sims.berkeley.edu/resources/collab/