Personalized Web Spiders - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Personalized Web Spiders

Description:

Software programs that traverse the WWW information space by following hypertext ... Spiders, crawlers, Web robots, Web agents, Webbots. ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 47
Provided by: Cla50
Category:

less

Transcript and Presenter's Notes

Title: Personalized Web Spiders


1
Personalized Web Spiders
  • Unit 10 of Web Intelligence Web Information
    Retrieval

2
Introduction
3
Web Spiders
  • Software programs that traverse the WWW
    information space by following hypertext links
    (or other methods) and retrieving Web documents
    by standard HTTP protocol
  • Spiders, crawlers, Web robots, Web agents,
    Webbots..
  • Research in spiders began in the early 1990's

4
Web Spider Research
  • Speed and efficiency
  • Increase the harvest speed of a spider
  • Scalability
  • Spider policy
  • The behaviors of spiders and their impacts on
    other individuals and the Web as a whole
  • Polite spiders (robots.txt) or robots META tag
  • Information retrieval
  • Spidering algorithms and heuristics to retrieve
    relevant information from the Web more
    effectively
  • General and focused spider

5
Applications of Web Spiders
  • Personal search
  • Search for Web pages of interest to a particular
    user
  • Client-based more computational power is
    available for the search process and more
    functionalities are possible
  • Building collections
  • Collect Web pages to create the index of any
    search engine
  • Other purposes build lexicon, collection E-mail
    or resumes
  • Archiving
  • Web statistics
  • Total number of servers, average size of a HTML
    document, 404

6
Analysis of Web Content and Structure
  • How to represent and analyze the content and
    structure of the Web
  • Web spiders rely on such information to guide
    their searches
  • Web analysis techniques
  • Content-based approaches
  • Link-based approaches

7
Content-Based Approaches
  • Analyze HTML content of a Web page to induce
    information about the page
  • Analyze body text to determine if the page is
    relevant to a domain
  • Use indexing techniques to extract the key
    concepts that represent a page
  • Words and phrases that appear in the HTML title
    or headings
  • Domain knowledge (domain-specific terminology)
  • The URL address of a Web page
  • http//ourworld.compuserve.com/homepages/LungCance
    r

8
Link-Based Approaches
  • If the author of a Web page A places a link to a
    Web page B
  • B is relevant or similar to A, or of good quality
  • In-links the hyperlinks pointing to a given
    page
  • The larger the number of in-links, the better (or
    important) a page is considered to be
  • Anchor text (or text nearby) may provide a good
    description of the target page
  • Applications of link-based approaches
  • Web page classification and clustering
  • Identify cyber communities

9
Link-Based Approaches Page Rank
  • Give a link from a authoritative source a higher
    weight than a link from an unimportant personal
    homepage
  • PageRank compute a page's score by weighting
    each in-link to the page proportionally to the
    quality of the page containing the in-link

Massive computation time!!
d is a damping factor between 0 and 1, c(q) is
the number of outgoing links in q
10
Link-Based Approaches HITS
  • HITS Hyperlink-induced topic search
  • Authority pages high-quality pages related to a
    particular topic or search query
  • A page to which many others point
  • Hub pages not necessarily an authority but
    provide pointers to other authority pages
  • A page that points to many others

11
PageRank and HITS
q1
r1
q2
r2
p
q3
r3
qi
rj
12
Graph Traversal Algorithms
  • The Web can be viewed as a directed graph
  • Uninformed search
  • Breadth-first search and depth-first search
  • Do not make use of any information to guide the
    search process
  • Informed search
  • Some information about each search node is
    available
  • Best-first search
  • Different metrics, such as the number of
    in-links, PageRank score, keyword frequency, and
    similarity to search query, have been used as
    guiding heuristics

13
Graph Traversal Algorithms (Cont.)
  • Parallel search
  • Explore the different parts of a search space in
    parallel
  • Spreading activation algorithm used in Neural
    Networks
  • Genetic algorithm

14
Web Spiders for Personal Search
15
Overview
  • Usually run on the client machine
  • More CPU time and memory can be allocated to the
    search process and more functionalities are
    possible
  • Allow users to have more control and
    personalization options during the search process

16
Personal Web Spiders
  • Allow users to enter keywords, specify the depth
    and width of search for links contained in the
    current homepages displayed, and request the
    spider agent to fetch homepages connected to the
    current homepage
  • Search Web neighborhoods to find relevant pages
    and returns a list of links that look promising
  • Show the search results as graphs
  • Perform linguistic analysis and clustering of
    search results
  • Improve search effectiveness by sharing relevant
    search sessions amount users (Collaborative
    spider)

17
Personal Web Spiders (Cont.)
  • Use more advanced algorithms during search
  • Best First Search genetic algorithm
  • Each URL is modeled as an individual in the
    initial population
  • Crossover extracting the URLs that are pointed
    to by multiple starting URLs
  • Mutation retrieving random URLs from Yahoo
  • Locate Web pages relevant to a pre-defined set of
    topics based on example pages provided by the
    user
  • Use Naïve Bayesian classifier to guide the search
    process
  • Relevance feedback
  • Commercial Web spiders that collect, monitor,
    download, analyze Web documents

18
Personal Web Spiders (Cont.)
  • Composed of meta spiders
  • Programs that connect to different search engines
    to retrieve search results
  • Use domain ontology in meta spiders to assist
    users in query formulation
  • Download and analyze all the documents in the
    result set
  • Cluster the search results from multiple search
    engines
  • Hierarchical and graph-based classification
  • P2P Web spiders

19
Case Study
  • The architecture of two search agents enhanced
    with post-retrieval analysis capabilities
  • Competitive Intelligence Spider (CI Spider)
  • Collect Web pages on a real-time basis from the
    Web sites specified by the user
  • Perform indexing and categorization analysis
  • Meta Spider
  • Connect to different search engines and integrate
    the results

20
Architecture of CI Spider and Meta Spider
Search query
General Search Engine
UserInterface
Search Spiders
Search result
User-specifiedWeb Sites
AZ NounPhraser
KeyPhrases
SelectedPhrases
SOM
2DTopic Map
21
Case Study (Cont.)
  • Components
  • User interface
  • Internet spiders
  • Arizona noun phraser
  • Index the key phrases that appear in each
    document
  • Part-of-speech tagging and linguistic rules
  • Self-organizing map (SOM)
  • A neural-network to automatically cluster the Web
    pages into different regions on a 2D map

22
Future of Web Spiders
  • As the amount of dynamic content on the Web
    increases, spiders need to be and able to
    retrieve and manipulate dynamic content
    autonomously
  • Spiders can perform better indexing by applying
    computational linguistic analysis to extract
    meaning entities rather than mere keywords from
    Web pages (Semantic Web)
  • As the quality and credibility of Web pages vary
    considerably, spiders need to use more advanced
    intelligent techniques to distinguish between
    good and bad pages
  • An ideal personal spider should behave like a
    human librarian who tries to understand and
    answer user queries through an autonomous or
    interactive process using natural language

23
?????SOM???????Goal-Oriented SOM System for
Document Clustering
24
Introduction
  • Document Clustering
  • Assign documents which are similar in some aspect
    into the same group. E.g. ?100? ???????
  • How to define the similarity? Whats the rule to
    assign documents?
  • Document Clustering with Neural Network
  • Kohonens self-organizing map (SOM) is a good
    tool for clustering.
  • Result could be presented by a map, which is more
    convenient for user to explore.

25
Introduction Motivation
  • Theres a long time for solving the clustering
    problem with SOM.
  • However, SOM cant be adapted for users
    interests
  • The clustering process is fully automatic without
    users options.
  • The meaning of Goal-Oriented SOM
  • Use users interests to lead the clustering
    process of SOM.
  • Produce a result map that matches users
    interests.

26
Introduction Scenario
  • An user wants to survey documents about ?? on
    Web
  • First, he uses a search engine to get the matched
    documents.
  • He may get too many documents about ??.
  • He needs an intelligent system to cluster these
    by his interests / thoughts.
  • The purpose of this research
  • To design an intelligent system to cluster
    documents in personalized way.
  • E.g. ?? , ??

27
Related Work
  • Latent Semantic Analysis
  • Self-Organizing Map

28
Latent Semantic Analysis (LSA)
  • LSI -- Using Latent Semantic Analysis (LSA) for
    indexing, to representing the relationship
    between terms or documents better.
  • LSA -- extracting and representing the
    contextual-usage meaning of words by statistical
    computations applied to a large corpus of text.
    The two main components are
  • Singular Value Decomposition (SVD)
  • A form of factor analysis, could extract the
    abstract knowledge of documents into semantic
    space.
  • Dimension Reduction (DR)
  • Make LSA represent the abstract knowledge of
    documents more precisely.

29
LSA Process
Modified matrix X
matrix X
SVD (Singular Value Decomposition) DR (Dimensio
n Reduction)
30
LSA An Example
  • An Example of text data Titles of Some Technical
    Memos
  • HCI Human Computer Interaction
  • c1 Human machine interface for ABC computer
    application
  • c2 A survey of user opinion of computer system
    response time
  • c3 The EPS user interface management system
  • c4 System and human system engineering testing
    of EPS
  • c5 Relation of user perceived response time to
    error measurement
  • Mathematical Graph Theory
  • m1 The generation of random, binary, ordered
    trees
  • m2 The intersection graph of paths in trees
  • m3 Graph minors IV Widths of trees and
    well-quasi-ordering
  • m4 Graph minors A survey

31
LSA An Example
Dimension Reduction2
32
Self-Organizing Map (SOM)
  • Suppose there are N data to be clustered
  • Input
  • Quantify of each record.
  • Each input node is a vector, say X.
  • Output
  • K output nodes, usually arranged in a regular 2D
    grid.
  • The number of output nodes should be decided
    before doing learning process.
  • Each output node has a model vector m, the
    dimension of which must be the same as input
    vector.

33
SOM Network Topology
K output nodes
Kohonens Layer
N Input nodes
34
SOM Process
  • Present each input node in order -- run several
    iterations, or until SOM converges.
  • An iteration is as follow
  • Suppose we have an input vector X
  • Compute similarity between X and all output
    nodes model vectors. Pick up the closest
    output node as the winner node.
  • Update all output nodes model vectors, according
    to the following rule and the winner node, called
    c(x)
  • After this, the model vector becomes more similar
    to the input vector.
  • Repeat the above for all other input vectors in
    order.

35
SOM Process
  • decides the speed and width of
    learning
  • (a)
  • 0 lta(t) lt1, monotonically decreasing
  • s(t) monotonically decreasing
  • (b) , where R is called Neighbor
    radius
  • Finally, assign each input vector X to some
    output node as follows
  • Compute similarity between X and all output
    nodes model vectors. Assign X to the closest
    one.

36
SOM Result Map
37
Goal-Oriented SOM Flow Chart
System
User
query keywords specified goal
request Web page
User Interface
show map
Search Engine
relevance feedback
matched Web Pages
Segment Tool
tune arguments
document vector
Goal-Oriented SOM
result
LSI tool
Modified SOM
relations between terms
38
Similarity Function Concept
  • Original
  • Based on some presumed similarity function.
  • Goal-Oriented
  • Based on the users perception of the similarity
    between documents.
  • More personalized!

39
Similarity Function Traditional
  • SOM clustered similar data. But, how to define
    similarity?
  • In WEBSOM Adaptive Search
  • In vector space model, each dimension represents
    an unique term. In some situation, some terms are
    more important.
  • For example In distinguishing ???? and
    ????,
  • ???, ??? will be more important than
    ??, ??
  • If multiple different weight to different
    dimension
  • let the weight of the dimension corresponding
    to more important
  • term bigger.
  • Then, the dimension of bigger weight contributes
    more in computing similarity.
  • Therefore, we could achieve the target --
    clustering with users interests.

40
Similarity Function GOSOM
  • The modified similarity function in Goal-Oriented
    SOM


  • is the weight of the term
  • Next, how do we decide the importance of terms?
  • By LSI, we get a matrix about relationship
    between terms.
  • The terms which are closer to the users goal are
    more important.
  • For example If users goal are ??, ????,
    ??, the relationships between ??? and them
    are 0.8,0.2,0.2, then the weight of ??? 0.8

41
Labeling Traditional
  • After SOM learning, the rough clustering was done
    each output node is a cluster.
  • Next, How to label them?
  • In Adaptive Search
  • Pick the term corresponding to the dimension
    which has largest value, as the label of this
    node.
  • If the adjoined node has the same label, then
    merge them to a bigger cluster.

42
Labeling Our Approach
  • In our approach
  • By the matrix of relationship between terms
    produced by LSI, we can decide each term belongs
    to which user goal. For example
  • Users goal are ?? , ???? , ??
  • The model vector of some output node 3.67,
    5.32, 4.11, 7.58
  • By the term relationship matrix, we find the term
    corresponding to the 1st dimension is closest to
    ??, the 2nd is closest to ????, the 3rd is
    closest to ??, the 4th is closest to ??.
  • Then we calculate the sum of each user goal as
    follow
  • ?? 1st dim 4th dim 3.677.5811.25
  • ???? 2nd 5.32
  • ?? 3rd dim 4.11
  • Thus, assign this output node to the
    ?? cluster.

43
Labeling Our Approach (Cont.)
  • Considered some terms are more important to some
    specified goal, we calculate the sum with weight.
    Continued with the previous example
  • By the term relationship matrix, we find the
    importance of the term corresponding to the 1st
    dimension for ?? are 0.63, the 4th for ?? is
    0.48, then the the sum of ?? will be
  • 1stdim 0.63 4thdim 0.2 3.670.63
    7.580.2 3.8281

44
Labeling Our Approach (Cont.)
  • In fact, there are some totally unrelated
    documents
  • If all of the sum of each user goal lt threshold
    the original sum, then assign this output node to
    Unrelated.
  • For example
  • Threshold 0.3
  • The model vector of some output node 3.67,
    5.32, 4.11, 7.58 ,so their sum 20.68
  • By the method of the previous page, the the sum
    of ?? 5.9505, the the sum of ???? 4.0432,
    ?? 3.7401
  • Because 3.7401 lt 3.8281 lt 4.0432 lt 0.3 20.68
    6.204, so assign this output node to the
    Unrelated cluster.

45
Result Map
  • Different colors represent different
    clusters.There are 3 clusters 2 goals, and the
    Unrelated.

46
User Relevance Feedback
  • After clustering, let the user pick the right
    ones.
  • Pick the main terms of the right documents, say
    5.
  • Add the weight of these main terms.
  • Run the SOM with modified weight again.
Write a Comment
User Comments (0)
About PowerShow.com