Schema Mapping of Formbased Web Interfaces - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Schema Mapping of Formbased Web Interfaces

Description:

... where Sij= weighted similarity between the ... Kernel gram matrix formed as weighted sum of individual similarity measures. ... How to Assign Feature Weights? ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 18
Provided by: Arav151
Category:

less

Transcript and Presenter's Notes

Title: Schema Mapping of Formbased Web Interfaces


1
Schema Mapping of Form-based Web Interfaces
  • Aravind Kalavagattu
  • Raju Balakrishnan
  • Garrett Wolf
  • Yochan Information Integration Group

2
Motivation
  • Deep Web Problem on the WWW
  • Tons of data reside in the hidden databases
  • Many of current web search engines fail to crawl
    this valuable data
  • Data Integrators
  • A data integration application works by
    integrating data from several online databases
  • They all need Schema Mapping as their first and
    foremost step to query/crawl the databases!
  • Many sites provide only a restricted query access
    through form-based interfaces.

3
Problem Statement
  • Formally, the web-interface is a schema F with n
    form-fields f1, f2, f3fn.
  • Given two html form-schemas G and H, the problem
    of semantic schema mapping boils down to mapping
    a subset of fields in G, i.e., (g1 g2 g3 gm)
    with the subset of fields in H, i.e., (h1 h2
    h3 hn) such that each
  • gi corresponds to hi.
  • For example, in the above picture, taking G as,
    cars.com, and H as, Autoads.com, we have the
    mappings G.Make H.Make, G.Model H.Model,
    G.MaxPrice H.Price, G.SearchWithin
    H.Distance, G.YourZip H.ZipCode.

4
Steps Involved
  • Feature Identification
  • Distance/Similarity Metrics
  • Methodology
  • Feature Weights Learning
  • Clustering Approach
  • SVM based Classification
  • Empirical Evaluation
  • Results and Discussion

5
Feature Identification
  • Name
  • Actual name of the html input element
  • Description
  • The description commonly located to the left or
    above the field when viewing a rendered HTML page
  • Values List
  • Applies to selection lists and radio buttons
  • HTML Control Type
  • Type selection list, text box, check box, radio
    button
  • Data Type
  • Data type of the values

6
Similarity Metrics
  • Name Similarity (NS)
  • Two types
  • Jaccard Coefficient
  • Semantic similarity
  • Distance between the common hypernym of the two
    words
  • Example Maharashtra Arizona are similar
    now.
  • Description Similarity (DS)
  • Same as above
  • Value List Similarity (VLS)
  • Jaccard Coefficient
  • Control Type Similarity (CTS)
  • For differing types, we decide if one subsumes
    the other
  • Data Type Similarity (DTS)
  • Its in a manner similar to the above similarity

7
Steps Involved
  • Feature Identification
  • Distance/Similarity Metrics
  • Methodology
  • Feature Weights Learning
  • Clustering Approach
  • SVM based Classification
  • Empirical Evaluation
  • Results and Discussion

8
Learning Feature Weights
  • The similarity between two form fields f and g,
    will be
  • Sim(f,g) WT.SIMfgb
  • Perceptron based gradient descent, with ideal
    kernel matrix element as target values to
    optimize weights
  • Perceptron learning Equal number of positive
    and negative samples are used across domains,

9
Clustering Approach
  • Similarity matrix S, where Sij weighted
    similarity between the form-fields i and j, we
    perform a agglomerative hierarchical clustering
    using the three different distance measure
  • Single-link, Complete-link, Average-link

10
SVM Classification Approach
  • Kernel gram matrix formed as weighted sum of
    individual similarity measures.
  • The negative eign values are set to zero to make
    the kernel matrix positive semi-definite.
  • N one-vs-all SVMs classifiers were trained to
    classify fields, where N is number of classes,
    with five
  • The Trained SVM is tested Against the test data

11
Steps Involved
  • Feature Identification
  • Distance/Similarity Metrics
  • Methodology
  • Feature Weights Learning
  • Clustering Approach
  • SVM based Classification
  • Empirical Evaluation
  • Results and Discussion

12
Experiments
  • Distance Metrics
  • Bag of Words Similarity (Jaccard)
  • name, description, values
  • Semantic Similarity (WordNet)
  • name, description
  • Is-A Relationship Similarity (Attribute
    Hierarchy)
  • data type
  • Boolean Similarity (Exact Match Only)
  • html type
  • Data Set
  • UIUC Information Integration Repository
  • TEL-8 Query Interfaces data set
  • 4 of 8 domains
  • Airfares, Automobiles, Hotels, Car Rentals
  • Features
  • name
  • description
  • values
  • data type
  • html type

13
Which Distance Metric to Use?
14
How to Assign Feature Weights?
15
Can Weights be Used Across Domains?
What Affect do the Negative Similarities Have?
  • Assigning negative similarity between fields in
    the same form improves accuracy when compared to
    zero similarity assignments.

16
Observations
  • Clustering using Single-Link distance performs
    badly! Complete and Average perform equally
    good, but on the whole, Average-link tends to
    win!
  • The perceptron approach to learn the feature
    weights does best compared to uniform/manual
    assignments!
  • Clustering using Average link and
    perceptron-learned feature weights gave us high
    accuracy compared to the SVM based approach.
  • Feature weights can be used across domains.

17
Thank You
Write a Comment
User Comments (0)
About PowerShow.com