A Labelbased Metadata for Schema Clustering - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

A Labelbased Metadata for Schema Clustering

Description:

A Label-based Metadata for Schema Clustering. by. Pitsanu Lousangfa, ... Schema1.element1 = Schema2.element1. 8. Cluster Metadata. An optimal set of clusters ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 30
Provided by: dut89
Category:

less

Transcript and Presenter's Notes

Title: A Labelbased Metadata for Schema Clustering


1
A Label-based Metadata for Schema Clustering
  • by
  • Pitsanu Lousangfa, Naiyana SahavechaphanLarge-
    scale Simulation Research Laboratory
  • National Electronics and Computer Technology
    Center, Thailand

2
Outline
  • Introduction to schema matching
  • Ideal schema matching
  • Label-based metadata
  • Label-based clustering
  • Preliminary experiments
  • Conclusion and future work

3
Information Integration
Data Source C
Data Source B
Data Source D
Data Source A
Data Source E
  • Information integration is affected by
  • data schema (structure)
  • data representation i.e. XML, Text, Db
  • data format i.e. temperature can be defined by C
    or F

4
Schema Matching
  • Table Cust
  • CId
  • Cname
  • FirstName
  • LastName


Schema 1
Schema Matching
Mapping
Schema 2
  • Table Customer
  • CustId
  • Company
  • Contact
  • Phone

5
Element-to-Element Comparison
Pairwise Approach
Cluster-based Approach
(source) Schema 1
(target) Schema 2
. . .
. . .
comparisons m x n
comparisons lt m x n
6
Filtering-based Approach
All elements represent semantically similar (not
the same) information
?
7
Filtering-based Approach
Schema1.element1 Schema2.element1
?
8
Cluster Metadata
  • An optimal set of clusters
  • Each cluster contains elements representing
    semantically similar (not the same) information
  • An efficient mapping result in good performance
  • Relevant elements are discovered via matched
    clusters
  • Irrelevant elements are filtered out via non
    matched clusters

9
Related Work
  • SemInt VLDB 1994
  • Creates cluster metadata based on
  • Element specification i.e. type, key, length
  • Data value i.e. min/max value, std
  • Advantage
  • Offers an automatic metadata creation
  • Disadvantages
  • too few or too many clusters due to the lack of
    semantic information or the absence of data value
  • Cupid VLDB 2001
  • Creates meatadata based on pre-defined concepts
    of element labels
  • Advantage
  • Provides appropriate metadata
  • Disadvantage
  • Prohibit an automatic metadata creation,
    affecting schema clustering

10
Label-based Metadata
Distinct-based Metadata Creation (DM)
Synonym-based Metadata Creation (SM)
Relation-based Metadata Creation (RM)
current metadata M
DM/SM/RM
new metadata M
to-be added element label L
11
Distinct-based Metadata creation (DM)
  • Definition
  • T T U T
  • Description
  • Distinct-based metadata is simply a collection
    of
  • lexically different tokens defined in T and T

M
M
L
M
L
12
Synonym-based Metadata creation (SM)
  • Definition
  • T (T U T) - S
  • Description
  • A synonym-based metadata is a collection of
    tokens defined in T and T where such tokens must
    be lexically different and are not synonym to
    each other

M
M
L
M
L
13
Relation-based Metadata creation (RM)
  • Definition
  • T ((T U T)) - S - R) U A
  • Description
  • A relation-based metadata is a collection of
    tokens
  • defined in T and T where such tokens must be
  • lexically different and are not related

L
M
M
M
L
14
Outline
  • Introduction to schema matching
  • Ideal schema matching
  • Label-based metadata
  • Label-based clustering
  • Preliminary experiments
  • Conclusion and future work

15
lCluster Architecture
16
lCluster Algorithm
  • for each element e in Elements

if no existing cluster then create a
new cluster c add element e into a cluster
c create a metadata m for a cluster c
else
for each cluster c in Clusters if
distance between a metadata of c and a label of e
lt threshold then add element e
into a cluster c create a metadata
m for a cluster c else
create a new cluster c
add element e into a cluster c
create a metadata m for a cluster c

17
lCluster Algorithm Walkthrough
Input provinceIdentification,
countryIdentification, department,
division
Threshold 0.15
Note The example uses RM to create metadata.
18
lCluster Algorithm Walkthrough
Input provinceIdentification,
countryIdentification, department,
division
Threshold 0.15
Note The example uses RM to create metadata.
19
lCluster Algorithm Walkthrough
Input provinceIdentification,
countryIdentification, department,
division
Threshold 0.15
Note The example uses RM to create metadata.
20
Label-based Difference Distance Measure
  • Label-based Difference Distance (T ,T )
    1-sLSim(T ,T )

S
T
S
T
Where T a source element label
T a target element label t
a source token t a
target token T the number of
source tokens T the number of
target tokens lingSim linguistic similarity
measure of two given tokens
(using WordNetSimilarity)
S
T
s
t
S
T
21
Label-based similarity measure (cont.)
Example
employeeStatusName, firstName
Source
Target
sLSim (0.670.881)(0.881)/5 0.89
22
Outline
  • Introduction to schema matching
  • Ideal schema matching
  • Label-based metadata
  • Label-based clustering
  • Preliminary experiments
  • Conclusion and future work

23
Preliminary Experiments
  • Efficiency Measurement
  • The clustering accuracy F-Measure
  • The schema matching performance the number of
    token comparisons
  • Research domain
  • Source research schema (17 elements)
  • Target research schemas
  • NSTDA research schema (15 elements)
  • NECTEC research schema (38 elements)
  • MTEC research schema (80 elements)
  • BIOTEC research schema (76 elements)

24
The Clustering Accuracy
25
Schema Matching Improvement
26
The Cumulative Distribution of Number of Tokens
27
Conclusion
  • RM is a potential metadata when comparing to DM
    and SM
  • A label-based metadata provides 60 schema
    clustering accuracy, affecting schema matching
    accuracy
  • WordNetSimlarity provides the maximum
    similarity value, regardless of the right word
    sense
  • The creation of metadata based on only new
    element makes more difference distance between
    new metadata and existing element members

28
Future Work
  • improve metadata creation algorithm to take into
    account existing element members
  • improve schema cluster with the utilization of
    other information such as length, data type and
    key
  • extensively evaluate our approach with previous
    approaches and various information domains

29
Thank youQ A
Write a Comment
User Comments (0)
About PowerShow.com