Title: Personalized Web Spiders
1Personalized Web Spiders
- Unit 10 of Web Intelligence Web Information
Retrieval
2Introduction
3Web Spiders
- Software programs that traverse the WWW
information space by following hypertext links
(or other methods) and retrieving Web documents
by standard HTTP protocol - Spiders, crawlers, Web robots, Web agents,
Webbots.. - Research in spiders began in the early 1990's
4Web Spider Research
- Speed and efficiency
- Increase the harvest speed of a spider
- Scalability
- Spider policy
- The behaviors of spiders and their impacts on
other individuals and the Web as a whole - Polite spiders (robots.txt) or robots META tag
- Information retrieval
- Spidering algorithms and heuristics to retrieve
relevant information from the Web more
effectively - General and focused spider
5Applications of Web Spiders
- Personal search
- Search for Web pages of interest to a particular
user - Client-based more computational power is
available for the search process and more
functionalities are possible - Building collections
- Collect Web pages to create the index of any
search engine - Other purposes build lexicon, collection E-mail
or resumes - Archiving
- Web statistics
- Total number of servers, average size of a HTML
document, 404
6Analysis of Web Content and Structure
- How to represent and analyze the content and
structure of the Web - Web spiders rely on such information to guide
their searches - Web analysis techniques
- Content-based approaches
- Link-based approaches
7Content-Based Approaches
- Analyze HTML content of a Web page to induce
information about the page - Analyze body text to determine if the page is
relevant to a domain - Use indexing techniques to extract the key
concepts that represent a page - Words and phrases that appear in the HTML title
or headings - Domain knowledge (domain-specific terminology)
- The URL address of a Web page
- http//ourworld.compuserve.com/homepages/LungCance
r
8Link-Based Approaches
- If the author of a Web page A places a link to a
Web page B - B is relevant or similar to A, or of good quality
- In-links the hyperlinks pointing to a given
page - The larger the number of in-links, the better (or
important) a page is considered to be - Anchor text (or text nearby) may provide a good
description of the target page - Applications of link-based approaches
- Web page classification and clustering
- Identify cyber communities
9Link-Based Approaches Page Rank
- Give a link from a authoritative source a higher
weight than a link from an unimportant personal
homepage - PageRank compute a page's score by weighting
each in-link to the page proportionally to the
quality of the page containing the in-link
Massive computation time!!
d is a damping factor between 0 and 1, c(q) is
the number of outgoing links in q
10Link-Based Approaches HITS
- HITS Hyperlink-induced topic search
- Authority pages high-quality pages related to a
particular topic or search query - A page to which many others point
- Hub pages not necessarily an authority but
provide pointers to other authority pages - A page that points to many others
11PageRank and HITS
q1
r1
q2
r2
p
q3
r3
qi
rj
12Graph Traversal Algorithms
- The Web can be viewed as a directed graph
- Uninformed search
- Breadth-first search and depth-first search
- Do not make use of any information to guide the
search process - Informed search
- Some information about each search node is
available - Best-first search
- Different metrics, such as the number of
in-links, PageRank score, keyword frequency, and
similarity to search query, have been used as
guiding heuristics
13Graph Traversal Algorithms (Cont.)
- Parallel search
- Explore the different parts of a search space in
parallel - Spreading activation algorithm used in Neural
Networks - Genetic algorithm
14Web Spiders for Personal Search
15Overview
- Usually run on the client machine
- More CPU time and memory can be allocated to the
search process and more functionalities are
possible - Allow users to have more control and
personalization options during the search process
16Personal Web Spiders
- Allow users to enter keywords, specify the depth
and width of search for links contained in the
current homepages displayed, and request the
spider agent to fetch homepages connected to the
current homepage - Search Web neighborhoods to find relevant pages
and returns a list of links that look promising - Show the search results as graphs
- Perform linguistic analysis and clustering of
search results - Improve search effectiveness by sharing relevant
search sessions amount users (Collaborative
spider)
17Personal Web Spiders (Cont.)
- Use more advanced algorithms during search
- Best First Search genetic algorithm
- Each URL is modeled as an individual in the
initial population - Crossover extracting the URLs that are pointed
to by multiple starting URLs - Mutation retrieving random URLs from Yahoo
- Locate Web pages relevant to a pre-defined set of
topics based on example pages provided by the
user - Use Naïve Bayesian classifier to guide the search
process - Relevance feedback
- Commercial Web spiders that collect, monitor,
download, analyze Web documents
18Personal Web Spiders (Cont.)
- Composed of meta spiders
- Programs that connect to different search engines
to retrieve search results - Use domain ontology in meta spiders to assist
users in query formulation - Download and analyze all the documents in the
result set - Cluster the search results from multiple search
engines - Hierarchical and graph-based classification
- P2P Web spiders
19Case Study
- The architecture of two search agents enhanced
with post-retrieval analysis capabilities - Competitive Intelligence Spider (CI Spider)
- Collect Web pages on a real-time basis from the
Web sites specified by the user - Perform indexing and categorization analysis
- Meta Spider
- Connect to different search engines and integrate
the results
20Architecture of CI Spider and Meta Spider
Search query
General Search Engine
UserInterface
Search Spiders
Search result
User-specifiedWeb Sites
AZ NounPhraser
KeyPhrases
SelectedPhrases
SOM
2DTopic Map
21Case Study (Cont.)
- Components
- User interface
- Internet spiders
- Arizona noun phraser
- Index the key phrases that appear in each
document - Part-of-speech tagging and linguistic rules
- Self-organizing map (SOM)
- A neural-network to automatically cluster the Web
pages into different regions on a 2D map
22Future of Web Spiders
- As the amount of dynamic content on the Web
increases, spiders need to be and able to
retrieve and manipulate dynamic content
autonomously - Spiders can perform better indexing by applying
computational linguistic analysis to extract
meaning entities rather than mere keywords from
Web pages (Semantic Web) - As the quality and credibility of Web pages vary
considerably, spiders need to use more advanced
intelligent techniques to distinguish between
good and bad pages - An ideal personal spider should behave like a
human librarian who tries to understand and
answer user queries through an autonomous or
interactive process using natural language
23?????SOM???????Goal-Oriented SOM System for
Document Clustering
24Introduction
- Document Clustering
- Assign documents which are similar in some aspect
into the same group. E.g. ?100? ??????? - How to define the similarity? Whats the rule to
assign documents? - Document Clustering with Neural Network
- Kohonens self-organizing map (SOM) is a good
tool for clustering. - Result could be presented by a map, which is more
convenient for user to explore.
25Introduction Motivation
- Theres a long time for solving the clustering
problem with SOM. - However, SOM cant be adapted for users
interests - The clustering process is fully automatic without
users options. - The meaning of Goal-Oriented SOM
- Use users interests to lead the clustering
process of SOM. - Produce a result map that matches users
interests.
26Introduction Scenario
- An user wants to survey documents about ?? on
Web - First, he uses a search engine to get the matched
documents. - He may get too many documents about ??.
- He needs an intelligent system to cluster these
by his interests / thoughts. - The purpose of this research
- To design an intelligent system to cluster
documents in personalized way. - E.g. ?? , ??
27Related Work
- Latent Semantic Analysis
- Self-Organizing Map
28Latent Semantic Analysis (LSA)
- LSI -- Using Latent Semantic Analysis (LSA) for
indexing, to representing the relationship
between terms or documents better. - LSA -- extracting and representing the
contextual-usage meaning of words by statistical
computations applied to a large corpus of text.
The two main components are - Singular Value Decomposition (SVD)
- A form of factor analysis, could extract the
abstract knowledge of documents into semantic
space. - Dimension Reduction (DR)
- Make LSA represent the abstract knowledge of
documents more precisely.
29LSA Process
Modified matrix X
matrix X
SVD (Singular Value Decomposition) DR (Dimensio
n Reduction)
30LSA An Example
- An Example of text data Titles of Some Technical
Memos - HCI Human Computer Interaction
- c1 Human machine interface for ABC computer
application - c2 A survey of user opinion of computer system
response time - c3 The EPS user interface management system
- c4 System and human system engineering testing
of EPS - c5 Relation of user perceived response time to
error measurement - Mathematical Graph Theory
- m1 The generation of random, binary, ordered
trees - m2 The intersection graph of paths in trees
- m3 Graph minors IV Widths of trees and
well-quasi-ordering - m4 Graph minors A survey
31LSA An Example
Dimension Reduction2
32Self-Organizing Map (SOM)
- Suppose there are N data to be clustered
- Input
- Quantify of each record.
- Each input node is a vector, say X.
- Output
- K output nodes, usually arranged in a regular 2D
grid. - The number of output nodes should be decided
before doing learning process. - Each output node has a model vector m, the
dimension of which must be the same as input
vector.
33SOM Network Topology
K output nodes
Kohonens Layer
N Input nodes
34SOM Process
- Present each input node in order -- run several
iterations, or until SOM converges. - An iteration is as follow
- Suppose we have an input vector X
- Compute similarity between X and all output
nodes model vectors. Pick up the closest
output node as the winner node. - Update all output nodes model vectors, according
to the following rule and the winner node, called
c(x) - After this, the model vector becomes more similar
to the input vector. - Repeat the above for all other input vectors in
order.
35SOM Process
- decides the speed and width of
learning - (a)
- 0 lta(t) lt1, monotonically decreasing
- s(t) monotonically decreasing
- (b) , where R is called Neighbor
radius -
- Finally, assign each input vector X to some
output node as follows - Compute similarity between X and all output
nodes model vectors. Assign X to the closest
one.
36SOM Result Map
37Goal-Oriented SOM Flow Chart
System
User
query keywords specified goal
request Web page
User Interface
show map
Search Engine
relevance feedback
matched Web Pages
Segment Tool
tune arguments
document vector
Goal-Oriented SOM
result
LSI tool
Modified SOM
relations between terms
38Similarity Function Concept
- Original
- Based on some presumed similarity function.
- Goal-Oriented
- Based on the users perception of the similarity
between documents. - More personalized!
39Similarity Function Traditional
- SOM clustered similar data. But, how to define
similarity? - In WEBSOM Adaptive Search
- In vector space model, each dimension represents
an unique term. In some situation, some terms are
more important. - For example In distinguishing ???? and
????, - ???, ??? will be more important than
??, ?? - If multiple different weight to different
dimension - let the weight of the dimension corresponding
to more important - term bigger.
- Then, the dimension of bigger weight contributes
more in computing similarity. - Therefore, we could achieve the target --
clustering with users interests.
40Similarity Function GOSOM
- The modified similarity function in Goal-Oriented
SOM -
-
is the weight of the term - Next, how do we decide the importance of terms?
- By LSI, we get a matrix about relationship
between terms. - The terms which are closer to the users goal are
more important. - For example If users goal are ??, ????,
??, the relationships between ??? and them
are 0.8,0.2,0.2, then the weight of ??? 0.8
41Labeling Traditional
- After SOM learning, the rough clustering was done
each output node is a cluster. - Next, How to label them?
- In Adaptive Search
- Pick the term corresponding to the dimension
which has largest value, as the label of this
node. - If the adjoined node has the same label, then
merge them to a bigger cluster.
42Labeling Our Approach
- In our approach
- By the matrix of relationship between terms
produced by LSI, we can decide each term belongs
to which user goal. For example - Users goal are ?? , ???? , ??
- The model vector of some output node 3.67,
5.32, 4.11, 7.58 - By the term relationship matrix, we find the term
corresponding to the 1st dimension is closest to
??, the 2nd is closest to ????, the 3rd is
closest to ??, the 4th is closest to ??. - Then we calculate the sum of each user goal as
follow - ?? 1st dim 4th dim 3.677.5811.25
- ???? 2nd 5.32
- ?? 3rd dim 4.11
- Thus, assign this output node to the
?? cluster.
43Labeling Our Approach (Cont.)
- Considered some terms are more important to some
specified goal, we calculate the sum with weight.
Continued with the previous example - By the term relationship matrix, we find the
importance of the term corresponding to the 1st
dimension for ?? are 0.63, the 4th for ?? is
0.48, then the the sum of ?? will be - 1stdim 0.63 4thdim 0.2 3.670.63
7.580.2 3.8281
44Labeling Our Approach (Cont.)
- In fact, there are some totally unrelated
documents - If all of the sum of each user goal lt threshold
the original sum, then assign this output node to
Unrelated. - For example
- Threshold 0.3
- The model vector of some output node 3.67,
5.32, 4.11, 7.58 ,so their sum 20.68 - By the method of the previous page, the the sum
of ?? 5.9505, the the sum of ???? 4.0432,
?? 3.7401 - Because 3.7401 lt 3.8281 lt 4.0432 lt 0.3 20.68
6.204, so assign this output node to the
Unrelated cluster.
45Result Map
- Different colors represent different
clusters.There are 3 clusters 2 goals, and the
Unrelated.
46User Relevance Feedback
- After clustering, let the user pick the right
ones. - Pick the main terms of the right documents, say
5. - Add the weight of these main terms.
- Run the SOM with modified weight again.