Personalized Web Spiders

About This Presentation

Title:

Personalized Web Spiders

Description:

Software programs that traverse the WWW information space by following hypertext ... Spiders, crawlers, Web robots, Web agents, Webbots. ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 47

Provided by: Cla50

Category:

more less

Transcript and Presenter's Notes

Title: Personalized Web Spiders

1
Personalized Web Spiders

Unit 10 of Web Intelligence Web Information
Retrieval

2
Introduction
3
Web Spiders

Software programs that traverse the WWW
information space by following hypertext links
(or other methods) and retrieving Web documents
by standard HTTP protocol
Spiders, crawlers, Web robots, Web agents,
Webbots..
Research in spiders began in the early 1990's

4
Web Spider Research

Speed and efficiency
Increase the harvest speed of a spider
Scalability
Spider policy
The behaviors of spiders and their impacts on
other individuals and the Web as a whole
Polite spiders (robots.txt) or robots META tag
Information retrieval
Spidering algorithms and heuristics to retrieve
relevant information from the Web more
effectively
General and focused spider

5
Applications of Web Spiders

Personal search
Search for Web pages of interest to a particular
user
Client-based more computational power is
available for the search process and more
functionalities are possible
Building collections
Collect Web pages to create the index of any
search engine
Other purposes build lexicon, collection E-mail
or resumes
Archiving
Web statistics
Total number of servers, average size of a HTML
document, 404

6
Analysis of Web Content and Structure

How to represent and analyze the content and
structure of the Web
Web spiders rely on such information to guide
their searches
Web analysis techniques
Content-based approaches
Link-based approaches

7
Content-Based Approaches

Analyze HTML content of a Web page to induce
information about the page
Analyze body text to determine if the page is
relevant to a domain
Use indexing techniques to extract the key
concepts that represent a page
Words and phrases that appear in the HTML title
or headings
Domain knowledge (domain-specific terminology)
The URL address of a Web page
http//ourworld.compuserve.com/homepages/LungCance
r

8
Link-Based Approaches

If the author of a Web page A places a link to a
Web page B
B is relevant or similar to A, or of good quality
In-links the hyperlinks pointing to a given
page
The larger the number of in-links, the better (or
important) a page is considered to be
Anchor text (or text nearby) may provide a good
description of the target page
Applications of link-based approaches
Web page classification and clustering
Identify cyber communities

9
Link-Based Approaches Page Rank

Give a link from a authoritative source a higher
weight than a link from an unimportant personal
homepage
PageRank compute a page's score by weighting
each in-link to the page proportionally to the
quality of the page containing the in-link

Massive computation time!!
d is a damping factor between 0 and 1, c(q) is
the number of outgoing links in q
10
Link-Based Approaches HITS

HITS Hyperlink-induced topic search
Authority pages high-quality pages related to a
particular topic or search query
A page to which many others point
Hub pages not necessarily an authority but
provide pointers to other authority pages
A page that points to many others

11
PageRank and HITS
q1
r1
q2
r2
p
q3
r3
qi
rj
12
Graph Traversal Algorithms

The Web can be viewed as a directed graph
Uninformed search
Breadth-first search and depth-first search
Do not make use of any information to guide the
search process
Informed search
Some information about each search node is
available
Best-first search
Different metrics, such as the number of
in-links, PageRank score, keyword frequency, and
similarity to search query, have been used as
guiding heuristics

13
Graph Traversal Algorithms (Cont.)

Parallel search
Explore the different parts of a search space in
parallel
Spreading activation algorithm used in Neural
Networks
Genetic algorithm

14
Web Spiders for Personal Search
15
Overview

Usually run on the client machine
More CPU time and memory can be allocated to the
search process and more functionalities are
possible
Allow users to have more control and
personalization options during the search process

16
Personal Web Spiders

Allow users to enter keywords, specify the depth
and width of search for links contained in the
current homepages displayed, and request the
spider agent to fetch homepages connected to the
current homepage
Search Web neighborhoods to find relevant pages
and returns a list of links that look promising
Show the search results as graphs
Perform linguistic analysis and clustering of
search results
Improve search effectiveness by sharing relevant
search sessions amount users (Collaborative
spider)

17
Personal Web Spiders (Cont.)

Use more advanced algorithms during search
Best First Search genetic algorithm
Each URL is modeled as an individual in the
initial population
Crossover extracting the URLs that are pointed
to by multiple starting URLs
Mutation retrieving random URLs from Yahoo
Locate Web pages relevant to a pre-defined set of
topics based on example pages provided by the
user
Use Naïve Bayesian classifier to guide the search
process
Relevance feedback
Commercial Web spiders that collect, monitor,
download, analyze Web documents

18
Personal Web Spiders (Cont.)

Composed of meta spiders
Programs that connect to different search engines
to retrieve search results
Use domain ontology in meta spiders to assist
users in query formulation
Download and analyze all the documents in the
result set
Cluster the search results from multiple search
engines
Hierarchical and graph-based classification
P2P Web spiders

19
Case Study

The architecture of two search agents enhanced
with post-retrieval analysis capabilities
Competitive Intelligence Spider (CI Spider)
Collect Web pages on a real-time basis from the
Web sites specified by the user
Perform indexing and categorization analysis
Meta Spider
Connect to different search engines and integrate
the results

20
Architecture of CI Spider and Meta Spider
Search query
General Search Engine
UserInterface
Search Spiders
Search result
User-specifiedWeb Sites
AZ NounPhraser
KeyPhrases
SelectedPhrases
SOM
2DTopic Map
21
Case Study (Cont.)

Components
User interface
Internet spiders
Arizona noun phraser
Index the key phrases that appear in each
document
Part-of-speech tagging and linguistic rules
Self-organizing map (SOM)
A neural-network to automatically cluster the Web
pages into different regions on a 2D map

22
Future of Web Spiders

As the amount of dynamic content on the Web
increases, spiders need to be and able to
retrieve and manipulate dynamic content
autonomously
Spiders can perform better indexing by applying
computational linguistic analysis to extract
meaning entities rather than mere keywords from
Web pages (Semantic Web)
As the quality and credibility of Web pages vary
considerably, spiders need to use more advanced
intelligent techniques to distinguish between
good and bad pages
An ideal personal spider should behave like a
human librarian who tries to understand and
answer user queries through an autonomous or
interactive process using natural language

23
?????SOM???????Goal-Oriented SOM System for
Document Clustering
24
Introduction

Document Clustering
Assign documents which are similar in some aspect
into the same group. E.g. ?100? ???????
How to define the similarity? Whats the rule to
assign documents?
Document Clustering with Neural Network
Kohonens self-organizing map (SOM) is a good
tool for clustering.
Result could be presented by a map, which is more
convenient for user to explore.

25
Introduction Motivation

Theres a long time for solving the clustering
problem with SOM.
However, SOM cant be adapted for users
interests
The clustering process is fully automatic without
users options.
The meaning of Goal-Oriented SOM
Use users interests to lead the clustering
process of SOM.
Produce a result map that matches users
interests.

26
Introduction Scenario

An user wants to survey documents about ?? on
Web
First, he uses a search engine to get the matched
documents.
He may get too many documents about ??.
He needs an intelligent system to cluster these
by his interests / thoughts.
The purpose of this research
To design an intelligent system to cluster
documents in personalized way.
E.g. ?? , ??

27
Related Work

Latent Semantic Analysis
Self-Organizing Map

28
Latent Semantic Analysis (LSA)

LSI -- Using Latent Semantic Analysis (LSA) for
indexing, to representing the relationship
between terms or documents better.
LSA -- extracting and representing the
contextual-usage meaning of words by statistical
computations applied to a large corpus of text.
The two main components are
Singular Value Decomposition (SVD)
A form of factor analysis, could extract the
abstract knowledge of documents into semantic
space.
Dimension Reduction (DR)
Make LSA represent the abstract knowledge of
documents more precisely.

29
LSA Process
Modified matrix X
matrix X
SVD (Singular Value Decomposition) DR (Dimensio
n Reduction)
30
LSA An Example

An Example of text data Titles of Some Technical
Memos
HCI Human Computer Interaction
c1 Human machine interface for ABC computer
application
c2 A survey of user opinion of computer system
response time
c3 The EPS user interface management system
c4 System and human system engineering testing
of EPS
c5 Relation of user perceived response time to
error measurement
Mathematical Graph Theory
m1 The generation of random, binary, ordered
trees
m2 The intersection graph of paths in trees
m3 Graph minors IV Widths of trees and
well-quasi-ordering
m4 Graph minors A survey

31
LSA An Example
Dimension Reduction2
32
Self-Organizing Map (SOM)

Suppose there are N data to be clustered
Input
Quantify of each record.
Each input node is a vector, say X.
Output
K output nodes, usually arranged in a regular 2D
grid.
The number of output nodes should be decided
before doing learning process.
Each output node has a model vector m, the
dimension of which must be the same as input
vector.

33
SOM Network Topology
K output nodes
Kohonens Layer
N Input nodes
34
SOM Process

Present each input node in order -- run several
iterations, or until SOM converges.
An iteration is as follow
Suppose we have an input vector X
Compute similarity between X and all output
nodes model vectors. Pick up the closest
output node as the winner node.
Update all output nodes model vectors, according
to the following rule and the winner node, called
c(x)
After this, the model vector becomes more similar
to the input vector.
Repeat the above for all other input vectors in
order.

35
SOM Process

decides the speed and width of
learning
(a)
0 lta(t) lt1, monotonically decreasing
s(t) monotonically decreasing
(b) , where R is called Neighbor
radius
Finally, assign each input vector X to some
output node as follows
Compute similarity between X and all output
nodes model vectors. Assign X to the closest
one.

36
SOM Result Map
37
Goal-Oriented SOM Flow Chart
System
User
query keywords specified goal
request Web page
User Interface
show map
Search Engine
relevance feedback
matched Web Pages
Segment Tool
tune arguments
document vector
Goal-Oriented SOM
result
LSI tool
Modified SOM
relations between terms
38
Similarity Function Concept

Original
Based on some presumed similarity function.
Goal-Oriented
Based on the users perception of the similarity
between documents.
More personalized!

39
Similarity Function Traditional

SOM clustered similar data. But, how to define
similarity?
In WEBSOM Adaptive Search
In vector space model, each dimension represents
an unique term. In some situation, some terms are
more important.
For example In distinguishing ???? and
????,
???, ??? will be more important than
??, ??
If multiple different weight to different
dimension
let the weight of the dimension corresponding
to more important
term bigger.
Then, the dimension of bigger weight contributes
more in computing similarity.
Therefore, we could achieve the target --
clustering with users interests.

40
Similarity Function GOSOM

The modified similarity function in Goal-Oriented
SOM
is the weight of the term
Next, how do we decide the importance of terms?
By LSI, we get a matrix about relationship
between terms.
The terms which are closer to the users goal are
more important.
For example If users goal are ??, ????,
??, the relationships between ??? and them
are 0.8,0.2,0.2, then the weight of ??? 0.8

41
Labeling Traditional

After SOM learning, the rough clustering was done
each output node is a cluster.
Next, How to label them?
In Adaptive Search
Pick the term corresponding to the dimension
which has largest value, as the label of this
node.
If the adjoined node has the same label, then
merge them to a bigger cluster.

42
Labeling Our Approach

In our approach
By the matrix of relationship between terms
produced by LSI, we can decide each term belongs
to which user goal. For example
Users goal are ?? , ???? , ??
The model vector of some output node 3.67,
5.32, 4.11, 7.58
By the term relationship matrix, we find the term
corresponding to the 1st dimension is closest to
??, the 2nd is closest to ????, the 3rd is
closest to ??, the 4th is closest to ??.
Then we calculate the sum of each user goal as
follow
?? 1st dim 4th dim 3.677.5811.25
???? 2nd 5.32
?? 3rd dim 4.11
Thus, assign this output node to the
?? cluster.

43
Labeling Our Approach (Cont.)

Considered some terms are more important to some
specified goal, we calculate the sum with weight.
Continued with the previous example
By the term relationship matrix, we find the
importance of the term corresponding to the 1st
dimension for ?? are 0.63, the 4th for ?? is
0.48, then the the sum of ?? will be
1stdim 0.63 4thdim 0.2 3.670.63
7.580.2 3.8281

44
Labeling Our Approach (Cont.)

In fact, there are some totally unrelated
documents
If all of the sum of each user goal lt threshold
the original sum, then assign this output node to
Unrelated.
For example
Threshold 0.3
The model vector of some output node 3.67,
5.32, 4.11, 7.58 ,so their sum 20.68
By the method of the previous page, the the sum
of ?? 5.9505, the the sum of ???? 4.0432,
?? 3.7401
Because 3.7401 lt 3.8281 lt 4.0432 lt 0.3 20.68
6.204, so assign this output node to the
Unrelated cluster.