A Labelbased Metadata for Schema Clustering - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

A Labelbased Metadata for Schema Clustering

Description:

A Label-based Metadata for Schema Clustering. by. Pitsanu Lousangfa, ... Schema1.element1 = Schema2.element1. 8. Cluster Metadata. An optimal set of clusters ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 30

Provided by: dut89

Category:

more less

Transcript and Presenter's Notes

Title: A Labelbased Metadata for Schema Clustering

1
A Label-based Metadata for Schema Clustering

by
Pitsanu Lousangfa, Naiyana SahavechaphanLarge-
scale Simulation Research Laboratory
National Electronics and Computer Technology
Center, Thailand

2
Outline

Introduction to schema matching
Ideal schema matching
Label-based metadata
Label-based clustering
Preliminary experiments
Conclusion and future work

3
Information Integration
Data Source C
Data Source B
Data Source D
Data Source A
Data Source E

Information integration is affected by
data schema (structure)
data representation i.e. XML, Text, Db
data format i.e. temperature can be defined by C
or F

4
Schema Matching

Table Cust
CId
Cname
FirstName
LastName

Schema 1
Schema Matching
Mapping
Schema 2

Table Customer
CustId
Company
Contact
Phone

5
Element-to-Element Comparison
Pairwise Approach
Cluster-based Approach
(source) Schema 1
(target) Schema 2
. . .
. . .
comparisons m x n
comparisons lt m x n
6
Filtering-based Approach
All elements represent semantically similar (not
the same) information
?
7
Filtering-based Approach
Schema1.element1 Schema2.element1
?
8
Cluster Metadata

An optimal set of clusters
Each cluster contains elements representing
semantically similar (not the same) information
An efficient mapping result in good performance
Relevant elements are discovered via matched
clusters
Irrelevant elements are filtered out via non
matched clusters

9
Related Work

SemInt VLDB 1994
Creates cluster metadata based on
Element specification i.e. type, key, length
Data value i.e. min/max value, std
Advantage
Offers an automatic metadata creation
Disadvantages
too few or too many clusters due to the lack of
semantic information or the absence of data value
Cupid VLDB 2001
Creates meatadata based on pre-defined concepts
of element labels
Advantage
Provides appropriate metadata
Disadvantage
Prohibit an automatic metadata creation,
affecting schema clustering

10
Label-based Metadata
Distinct-based Metadata Creation (DM)
Synonym-based Metadata Creation (SM)
Relation-based Metadata Creation (RM)
current metadata M
DM/SM/RM
new metadata M
to-be added element label L
11
Distinct-based Metadata creation (DM)

Definition
T T U T
Description
Distinct-based metadata is simply a collection
of
lexically different tokens defined in T and T

M
M
L
M
L
12
Synonym-based Metadata creation (SM)

Definition
T (T U T) - S
Description
A synonym-based metadata is a collection of
tokens defined in T and T where such tokens must
be lexically different and are not synonym to
each other

M
M
L
M
L
13
Relation-based Metadata creation (RM)

Definition
T ((T U T)) - S - R) U A
Description
A relation-based metadata is a collection of
tokens
defined in T and T where such tokens must be
lexically different and are not related

L
M
M
M
L
14
Outline

Introduction to schema matching
Ideal schema matching
Label-based metadata
Label-based clustering
Preliminary experiments
Conclusion and future work

15
lCluster Architecture
16
lCluster Algorithm

for each element e in Elements

if no existing cluster then create a
new cluster c add element e into a cluster
c create a metadata m for a cluster c
else
for each cluster c in Clusters if
distance between a metadata of c and a label of e
lt threshold then add element e
into a cluster c create a metadata
m for a cluster c else
create a new cluster c
add element e into a cluster c
create a metadata m for a cluster c

17
lCluster Algorithm Walkthrough
Input provinceIdentification,
countryIdentification, department,
division
Threshold 0.15
Note The example uses RM to create metadata.
18
lCluster Algorithm Walkthrough
Input provinceIdentification,
countryIdentification, department,
division
Threshold 0.15
Note The example uses RM to create metadata.
19
lCluster Algorithm Walkthrough
Input provinceIdentification,
countryIdentification, department,
division
Threshold 0.15
Note The example uses RM to create metadata.
20
Label-based Difference Distance Measure

Label-based Difference Distance (T ,T )
1-sLSim(T ,T )

S
T
S
T
Where T a source element label
T a target element label t
a source token t a
target token T the number of
source tokens T the number of
target tokens lingSim linguistic similarity
measure of two given tokens
(using WordNetSimilarity)
S
T
s
t
S
T
21
Label-based similarity measure (cont.)
Example
employeeStatusName, firstName
Source
Target
sLSim (0.670.881)(0.881)/5 0.89
22
Outline

Introduction to schema matching
Ideal schema matching
Label-based metadata
Label-based clustering
Preliminary experiments
Conclusion and future work

23
Preliminary Experiments

Efficiency Measurement
The clustering accuracy F-Measure
The schema matching performance the number of
token comparisons
Research domain
Source research schema (17 elements)
Target research schemas
NSTDA research schema (15 elements)
NECTEC research schema (38 elements)
MTEC research schema (80 elements)
BIOTEC research schema (76 elements)

24
The Clustering Accuracy
25
Schema Matching Improvement
26
The Cumulative Distribution of Number of Tokens
27
Conclusion

RM is a potential metadata when comparing to DM
and SM
A label-based metadata provides 60 schema
clustering accuracy, affecting schema matching
accuracy
WordNetSimlarity provides the maximum
similarity value, regardless of the right word
sense
The creation of metadata based on only new
element makes more difference distance between
new metadata and existing element members

28
Future Work

improve metadata creation algorithm to take into
account existing element members
improve schema cluster with the utilization of
other information such as length, data type and
key
extensively evaluate our approach with previous
approaches and various information domains

29
Thank youQ A

Write a Comment

User Comments (0)