Title: A Labelbased Metadata for Schema Clustering
1A Label-based Metadata for Schema Clustering
- by
- Pitsanu Lousangfa, Naiyana SahavechaphanLarge-
scale Simulation Research Laboratory - National Electronics and Computer Technology
Center, Thailand
2Outline
- Introduction to schema matching
- Ideal schema matching
- Label-based metadata
- Label-based clustering
- Preliminary experiments
- Conclusion and future work
3Information Integration
Data Source C
Data Source B
Data Source D
Data Source A
Data Source E
- Information integration is affected by
- data schema (structure)
- data representation i.e. XML, Text, Db
- data format i.e. temperature can be defined by C
or F
4Schema Matching
- Table Cust
- CId
- Cname
- FirstName
- LastName
Schema 1
Schema Matching
Mapping
Schema 2
- Table Customer
- CustId
- Company
- Contact
- Phone
5Element-to-Element Comparison
Pairwise Approach
Cluster-based Approach
(source) Schema 1
(target) Schema 2
. . .
. . .
comparisons m x n
comparisons lt m x n
6Filtering-based Approach
All elements represent semantically similar (not
the same) information
?
7Filtering-based Approach
Schema1.element1 Schema2.element1
?
8Cluster Metadata
- An optimal set of clusters
- Each cluster contains elements representing
semantically similar (not the same) information - An efficient mapping result in good performance
- Relevant elements are discovered via matched
clusters - Irrelevant elements are filtered out via non
matched clusters
9Related Work
- SemInt VLDB 1994
- Creates cluster metadata based on
- Element specification i.e. type, key, length
- Data value i.e. min/max value, std
- Advantage
- Offers an automatic metadata creation
- Disadvantages
- too few or too many clusters due to the lack of
semantic information or the absence of data value - Cupid VLDB 2001
- Creates meatadata based on pre-defined concepts
of element labels - Advantage
- Provides appropriate metadata
- Disadvantage
- Prohibit an automatic metadata creation,
affecting schema clustering
10Label-based Metadata
Distinct-based Metadata Creation (DM)
Synonym-based Metadata Creation (SM)
Relation-based Metadata Creation (RM)
current metadata M
DM/SM/RM
new metadata M
to-be added element label L
11Distinct-based Metadata creation (DM)
- Definition
- T T U T
- Description
- Distinct-based metadata is simply a collection
of - lexically different tokens defined in T and T
-
M
M
L
M
L
12Synonym-based Metadata creation (SM)
- Definition
- T (T U T) - S
- Description
- A synonym-based metadata is a collection of
tokens defined in T and T where such tokens must
be lexically different and are not synonym to
each other
M
M
L
M
L
13Relation-based Metadata creation (RM)
- Definition
- T ((T U T)) - S - R) U A
- Description
- A relation-based metadata is a collection of
tokens - defined in T and T where such tokens must be
- lexically different and are not related
L
M
M
M
L
14Outline
- Introduction to schema matching
- Ideal schema matching
- Label-based metadata
- Label-based clustering
- Preliminary experiments
- Conclusion and future work
15lCluster Architecture
16lCluster Algorithm
- for each element e in Elements
-
if no existing cluster then create a
new cluster c add element e into a cluster
c create a metadata m for a cluster c
else
for each cluster c in Clusters if
distance between a metadata of c and a label of e
lt threshold then add element e
into a cluster c create a metadata
m for a cluster c else
create a new cluster c
add element e into a cluster c
create a metadata m for a cluster c
17lCluster Algorithm Walkthrough
Input provinceIdentification,
countryIdentification, department,
division
Threshold 0.15
Note The example uses RM to create metadata.
18lCluster Algorithm Walkthrough
Input provinceIdentification,
countryIdentification, department,
division
Threshold 0.15
Note The example uses RM to create metadata.
19lCluster Algorithm Walkthrough
Input provinceIdentification,
countryIdentification, department,
division
Threshold 0.15
Note The example uses RM to create metadata.
20Label-based Difference Distance Measure
- Label-based Difference Distance (T ,T )
1-sLSim(T ,T )
S
T
S
T
Where T a source element label
T a target element label t
a source token t a
target token T the number of
source tokens T the number of
target tokens lingSim linguistic similarity
measure of two given tokens
(using WordNetSimilarity)
S
T
s
t
S
T
21Label-based similarity measure (cont.)
Example
employeeStatusName, firstName
Source
Target
sLSim (0.670.881)(0.881)/5 0.89
22Outline
- Introduction to schema matching
- Ideal schema matching
- Label-based metadata
- Label-based clustering
- Preliminary experiments
- Conclusion and future work
23Preliminary Experiments
- Efficiency Measurement
- The clustering accuracy F-Measure
- The schema matching performance the number of
token comparisons - Research domain
- Source research schema (17 elements)
- Target research schemas
- NSTDA research schema (15 elements)
- NECTEC research schema (38 elements)
- MTEC research schema (80 elements)
- BIOTEC research schema (76 elements)
24The Clustering Accuracy
25 Schema Matching Improvement
26The Cumulative Distribution of Number of Tokens
27Conclusion
- RM is a potential metadata when comparing to DM
and SM - A label-based metadata provides 60 schema
clustering accuracy, affecting schema matching
accuracy - WordNetSimlarity provides the maximum
similarity value, regardless of the right word
sense - The creation of metadata based on only new
element makes more difference distance between
new metadata and existing element members
28Future Work
- improve metadata creation algorithm to take into
account existing element members - improve schema cluster with the utilization of
other information such as length, data type and
key - extensively evaluate our approach with previous
approaches and various information domains
29Thank youQ A