Learning Based Web Query Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Learning Based Web Query Processing

Description:

Definition: Sijk: Segment. Lm:Hyperlink. S1. S11. S12. S13. S131. S2. S21. S3 ... Holiday ... of Tsim Sha Tsui - Kowloon, Holiday Inn Golden Mile is your ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 67
Provided by: dom1
Category:

less

Transcript and Presenter's Notes

Title: Learning Based Web Query Processing


1
Learning Based Web Query Processing
  • Yanlei Diao
  • Computer Science Department
  • Hong Kong U. of Science Technology

2
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration

3
Searching the Web
  • Want to find a piece of information on the Web?

Heterogeneity
Huge Size
Lack of Structure
Diversified User Bases
Ever- Changing
4
Search Engines
  • Maintain indices, keyword input, match input
    keywords with indices, return relevant documents.
  • Problems
  • Large hit lists with low precision. Users find
    relevant documents by browsing.
  • URLs but not the required information are
    returned. Users read the pages for the required
    information.

5
Web Information Retrieval
  • IR Vector-space model, search and browse
    capabilities
  • Web IR Web navigation, indexing, query
    languages, query-document matching, output
    ranking, user relevance feedback
  • Recent Improvement Hierarchical classification,
    better presentation of results, hypertext study,
    metasearching...

6
Web IR for Query Processing
  • Problems
  • A list of URLs or documents is returned. Users
    browse a lot to find information.
  • It asks users for precise query requirements,
    which is hard for casual users.
  • It lacks a well-defined underlying model.
    Vector-space model does not convey as much as
    Hypertext.
  • ?Large hit lists with low precision, rely on
    input queries

7
Intelligent Agents
  • The agents learn user profiles/models from their
    search behaviors and employ the knowledge to
    predict URLs of interest to the user.
  • Some rely on search engines and heuristics to
    find targets of a specific type e.g. papers or
    homepages
  • Some help users in an interactive mode They
    learn while users are browsing.
  • Some adaptive agents work autonomously They use
    heuristics, recommend pages of interest and take
    user feedback to improve.

8
Agents for Query Processing
  • Problems
  • Recommending pages of interest, but not
    information of interest to the user
  • Using vector-space model or converting HTML to
    text documents
  • Requiring a prior knowledge, such as user
    profiles, or using heuristics for a particular
    domain
  • ?Not well suited for ad hoc queries

9
Database Approaches
  • The Web is a directed graph nodes are Web pages
    and edges are hyperlinks between pages.
  • Query languages 1st generation combines
    content-based and structure-based queries. 2nd
    generation accesses structure of Web objects and
    creates complex objects.
  • Wrappers and mediators they present an
    integrated view of the resources.

10
DB Approaches for Query Processing
  • Problems
  • Wrapper generation is only feasible for a number
    of sites in a domain. The Web is growing very
    fast!
  • Web query languages require knowledge of the Web
    sites (content and linkage) and the language
    syntax. They are hard to use.
  • ?Not scalable, good for Web site management but
    not queries on the entire Web.

11
Our Goal
  • A Web query processing system for any Web users
    that
  • processes ad hoc queries on HTML pages
  • automatically extracts succinct and precise query
    results ( a result may take the form of a table,
    a list or a paragraph).
  • ? Learn the knowledge for query processing from
    the User!

12
Proposed Approach
  • An approach with learning capabilities
  • Keyword input (probably not precise)
  • Search engines return a URL list
  • During browsing, learns from users
  • to navigate through the web pages
  • to identify the required information on a web
    page
  • Processes the rest URLs automatically
  • Returns succinct and precise results

13
Unique Features
  • Returning succinct and precise results, i.e.
    segments of pages
  • No a prior knowledge or preprocessing, suited for
    ad hoc queries
  • exploiting page formatting and linkage
    information simultaneously, good use of rich
    information conveyed by HTML.

14
Benefits from Learning
  • Bridging the gap between keyword input and real
    query requirements
  • Capable of navigating in the neighborhoods of
    documents returned by search engines
  • Automating the processing of all possibly
    relevant documents in one query
  • Almost imperceptible to users, user-friendly

15
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration

16
Modeling a Web Page
  • Segment a group of tag delimited elements, unit
    in query processing, e.g. paragraph, table, list,
    nested (atomic segments to the document), Segment
    Tree
  • Attributes of a segment
  • content text in the scope of the segment
  • description summary of the content
  • Hyperlink represented as segments to be
    comparable
  • content URL
  • description anchor text
  • associated with the parent segment

17
A Sample
18
Modeling a Web Site
  • Ignore backward links, links pointing to
    themselves, links outside a site.
  • A Web site is modeled as hyperlink-connected
    segment trees, called
  • Segment Graph.

19
Knowledge for the Locating Task
The locating task is to find a segment in the
Segment Graph of a site as the query result.
20
Two Types of Knowledge
A link conveys description of the pointed page
while a queried segment contains both description
and the result itself.
21
Navigation Knowledge
  • concerns descriptive information and helps find
    the navigational path
  • a set of (term, weight) pairs
  • Term a selected word f the description of
    segments and links on the navigational path
  • Weight indicating the importance of the term in
    leading to the queried segment

22
Learning Navigation Knowledge
  • Navigational path, (link?)segment, e.g.
    L2?L4?S41.
  • Extended navigational path, ((segment ?)link?)
    ((segment ?) segment), e.g. (S1?S11?L2) ?
    (S3?S31?L4) ? (S4?S41).

Step1. Assign a weight to each component on the
path, e.g. L2, S31, S41. The closer to the
target, the higher the weight. Step2. Assign a
weight to each term in the description of a
component on the path.
The weight of a term can be summed up over
navigational paths. The set of (term, weight)
pairs is stored into the navigation knowledge
base.
23
Classification knowledge
  • Checks if a segment meets query requirements on
    both descriptive information and the result.
  • Cast in the Bayesian learning framework.
  • Set of triples (feature, NP, NN)
  • Feature word, integer, real, symbol, , date,
    time, email address, , contained in a segment
  • NP occurrences of the feature in positive
    samples
  • NN occurrences of the feature in negative
    samples

24
Learning Classification knowledge
The queried segment is a positive sample. All
other segments on the same page are negative
samples.
The content of each segment is parsed into a set
of features, either simple and complex types.
Count NP and NN accumulatively for each feature
over all samples. Store all triples (feature, NP,
NN) into the classification knowledge base.
25
Query Processing Using Learned Knowledge
  • After a Web page is retrieved, the segment graph
    is built
  • For each segment and link, a score is computed by
    applying the navigation knowledge
    (ApplyNavigation).
  • Segments/links are sorted on the score
  • If a link has the highest score, the system
    navigates through the link
  • If a segment has the highest score, all segments
    on the page are checked to see if there is a
    queried segment
  • The process is repeated until either a segment is
    found or conclusion can be made that the site
    does not contain queried information.

26
Locating Algorithm
On each page, if the result is not found
choosing an unprocessed component with highest
score if a link is chosen ? if a segment is
chosen
27
Locating Algorithm
On each page, if the result is not found
choosing an unprocessed component with highest
score if a link is chosen if a segment is chosen
? (ApplyClassification)
28
Applying Learned Knowledge
  • Application of Navigation Knowledge
  • extracts terms in the description of a
    link/segment
  • reads the weights of the terms and assigns a
    score to the link/segment by a certain function
    (max currently)
  • sorts all links and segments by their scores
  • Application of Classification Knowledge
  • computes the confidence C to classify a segment
    as the queried result
  • chooses the segment on a page with the largest C.
    If the largest C is over a threshold, returns the
    segment

29
forward
Hotel 1
3
Hotel 2
User browses it!
done
30
User clicks here!
31
Room information
User marks it!
32
Generating Navigation Knowledge
  • The navigation path looks like
  • Hotel Reservation-gtsingle hk double hk standard
    room deluxe room executive room
  • By our weighting scheme, a weight is assigned to
    each term

33
Generating Classification Knowledge
  • Training Samples
  • Occurrences of each feature are counted

Negative Holiday Inn Golden Mile In the heart
of Tsim Sha Tsui - Kowloon, Holiday Inn Golden
Mile is your number one choice for accommodation,
dining, meetings and banquets. Ideally situated
in the heart of ...
Positive single hk double hk standard room
999.00 1,039.00 deluxe room
1,199.00 1,239.00 executive room 1,399.00
1,499.00
34
back
Fact starts here!
35
(No Transcript)
36
Applying Navigation Knowledge
  • The page contains
  • Navigation knowledge shows

Paragraph 57 - 73 Lockhart Road, Wanchai, Hong
Kong, SAR, PRC Paragraph Located in the hub of
Wanchai, the Wharney Hotel is within walking
distance of the Hong Kong Arts Centre, Convention
and Exhibition Centre, busy commercial complexes
and shopping malls. ... Paragraph TEL (852)
2861-1000 FAX (852) 2865-6023
Links Main Features Services Dining and
Banqueting Hotel Rates Reservation ...
37
Navigation Knowledge assigns scores
Fact chooses it!
38
Navigation Knowledge assigns scores
39
Classification Knowledge computes confidence
Apply Classification Knowledge to all Segments
40
Fact finds it!
41
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration

42
A Query Processing System
  • A learning based query processing system
  • User Interface accepts user queries, presents
    query results, a browser capable of capturing
    user actions
  • Query Analyzer analyzes and transforms user
    queries
  • Session Controller coordinates learning and
    locating
  • Learner generates knowledge from captured user
    actions
  • Locator applies knowledge and locates query
    results
  • Retriever Parser retrieves pages and parses to
    trees
  • Knowledge Base stores learned knowledge

43
Reference Architecture
44
A Query Session
45
Training Strategies
  • Sequential
  • First n sites user browses and system learns
  • Next N-n sites system processes
  • Random
  • Randomly choose n sites user browses and system
    learns
  • the system processes the rest
  • Interleaved
  • First n0 sites, user browses and system learns
  • Next n - n0 site, system makes decision. For
    incorrect ones, user browses and system re-learns
  • Next N-n sites system processes

46
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration

47
System Evaluation
  • System Capabilities
  • Performance
  • Effectiveness precision, recall, correctness
  • Efficiency in a site, how many pages the system
    visits to find a result or to recognize the
    irrelevancy
  • Training efficiency how many training samples
    are needed
  • Key Issues
  • Effectiveness of the knowledge
  • Effectiveness of training strategies
  • Tests on A Range of Queries

48
A System Output Sample
49
System Capabilities
  • The system returns segments of the Web pages
  • The segments may not contain any input keyword
    but meet the requirement of room rates.
  • The system learned the query requirement from the
    user!
  • Segments can be from pages whose URLs are not
    directly returned by Yahoo!.
  • The system learned how to follow the hyperlinks
    to the queried segment!

50
System Evaluation - Effectiveness
  • Given a set of URLs in a query session, the
    system makes N decisions
  • N N1 N2 N3 N4
  • Precision N1 / (N1N3) ,
  • Recall N1 / sites that contain results,
  • Correctness (N1N2) / N .

51
System Evaluation - Efficiency
  • How efficiently the system finds a queried
    segment in a site?
  • Level of a Queried Segment the length of the
    shortest path to find it
  • Absolute Path length Visited pages,
  • Relative Path Length Visited pages / Level of
    the Queried Segment .

52
Basic Performance

Q11 Hong Kong Hotel Room Rate Q12 Hong Kong
Hotel
Sequential training
53
Effectiveness of Knowledge
  • Other two systems implemented for comparison
  • Classification Knowledge Only treat links and
    segments the same by the Bayes classifier
  • Learning
  • Locating

Action positive negative click a
link the link other links on the
page mark a segment the segment other segments
on the page
Classify all segments and links If a link has the
highest confidence, follow the link If a segment
has the highest confidence and passes the
threshold, return it.
54
Effectiveness of Knowledge
  • Navigation Knowledge Only only checks the
    descriptive information of links and segments
  • Learning
  • Locating

Navigational path ? Navigation Knowledge
Assigns scores to all links and segments using
navigation knowledge If a link has the highest
score, follow the link If a segment has the
highest score, return it.
55
Effectiveness of Knowledge
56
Effects of Training Strategies
Query Q12 Training Size 3-10
57
Effects of Training Strategies
  • Random training performs badly, low in recall
  • As the training size increases, interleaved
    training outperforms sequential training
  • Best accuracy reaches or exceeds 90 in all
    metrics when the interleaved training strategy is
    used
  • Enlarging the training size for random and
    sequential training is not effective

58
Improved Performance
Interleaved training
59
A Range of Queries
  • Hotel room rates targets at prices, easy to
    identify
  • Admission requirements on graduate student
    includes items such as degree, GPA, GRE, etc.
    that are not easy to specify in keywords but easy
    to show by marking
  • Data Mining Researcher concept, subjective,
    evidence including research interests, projects,
    professional activity, etc

60
Results of A Range of Queries
Interleaved training
61
Performance for the Queries
  • Effectiveness
  • first 4 queries accuracy is 80 to above 90
  • the last query still capable of filtering out
    irrelevant sites
  • Efficiency
  • relative path length to locate a queried segment
    is close to 1
  • absolute path length to conclude irrelevancy is
    no more than 2.5 pages.
  • The performance is not affected much by how
    precise the keyword query is. The system learns
    query requirements

62
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration

63
Conclusions
  • Proposed and implemented learning based Web query
    processing with the following features
  • Returning succinct results segments of pages
  • No a prior knowledge or preprocessing, suited for
    ad hoc queries
  • exploiting page formatting and linkage
    information simultaneously.
  • The preliminary results are promising

64
Future Work
  • Better segmentation for HTML documents
  • Better knowledge, key factor that affects system
    performance
  • other weighting schemes for navigation knowledge
  • other implementation of classification knowledge
  • More system evaluation
  • Dynamic web pages

65
Outline
  • Background
  • Learning Based Web Query Processing
  • FACT A Prototype System
  • Preliminary System Evaluation
  • Conclusions
  • Demonstration

66
Demonstration
Write a Comment
User Comments (0)
About PowerShow.com