Title: A survey of approaches to automatic schema matching
1A survey of approaches to automatic schema
matching
- Presenter Pantouvakis Stelios
2Introduction
- Schema matching is typically done by hand in
current implementations - Drawbacks (time- effort-consuming, error-prone)
- Need for automatic schema matching at
- data (schema) integration
- E-business
- data warehousing
- semantic query processing
3The Match operator
- Schema set of elements connected with some
structure - Independent of representation (XML, ER-model,
OO-model, directed graph,) - Mapping set of mapping elements (certain
elements from S1 mapped to certain elements from
S2) plus a mapping expression for each mapping
element, (which specifies the relation)
4The Match operator
- Mapping expressions may be
- scalar (,lt)
- functions (addition, concatenation)
- ER-style relationships (is-a, part-of)
- set-oriented relationships (overlaps, contains)
-
- Symbol shows mapping elements without
determining the mapping expression
5The Match operator
- The match operation is a function that takes two
schemas S1 and S2 as input and returns a mapping
between them (matching result) - Implementation is similar to Join in that
checking for each element of S1 if each element
in S2 matches and produce an output. But there
are differences - operates on metadata (schema elements)
- each element of S1 may match with multiple
elements of S2 - many comparison expressions may be use
- mappings may have multiple mapping expressions
6ExampleMappings
- Mappings may be
- Cust.C Costumer.CustID
- Concatenate( Cust.FirstName, Cust.LastName)
Costumer.Contact
7Architecture of generic match
8Architecture of generic match
- In general, it is not possible to determine fully
automatically all matches between two schemas - The implementation of Match should therefore only
determine match candidates - The user has to accept, reject or change them
- The user should be able to specify matches for
elements for which the system was unable to find
satisfactory match candidates
9Classification of schema matching approaches
- One match operator may use multiple matching
algorithms (matchers) - Different matchers work better to different
application domains - categorization of individual matchers is first
checked
10Classification of schema matching approaches
- Instance vs. Schema matchers can consider
instance data or only schema-level information - Element vs. Structure matching match individual
schema elements or combination of elements - Language vs. Constraint matcher can use
linguistic-based approach or constraint-based
approach
11Classification of schema matching approaches
- Matching Cardinality match result may relate
multiple elements of the two schemas - Auxiliary Information matchers may use also
dictionaries, global schemas, previous matching
decisions and user input.
12(No Transcript)
13Schema-level matchers
- In general
- Consider schema information, like name,
description, data type, relationship types
(part-of, is-a, etc), constraints and schema
structure. - Matchers may find multiple match candidates,
attaching to it a degree of similarity in the
range 0-1, in order to identify the best
candidates.
14Schema-level matchers1. Granularity of
match(element-level vs. structure-level)
- Element-level matching
- for each element of S1 determine matching
elements in S2 - may be at atomic level (attributes) or higher
level (entities, classes, relational tables) but
considers them in isolation, ignoring its
substructure and components
15Schema-level matchers1. Granularity of
match(element-level vs. structure-level)
- Structure-level matching
- matches combinations of elements that appear
together in a structure in S1 with combinations
of elements in S2 - full match complete structures
- partial match some components of each structure
match - may use equivalence patterns (from a library)
(e.g. is-a hierarchy ? single structure with
Boolean attribute)
16ExampleFull Partial Structural match
Atomic-level match
Address.ZIP CustomerAddress.PostalCode
17ExampleEquivalence Pattern
18Schema-level matchers2. Match cardinality
- Each element of S1 (or S2) may participate in 0,
1 or many mapping elements. - Within an individual mapping element one or more
S1 elements can match one or more S2 elements.
Cases are - 11, 1n, n1 (local cardinality)
- nm (global cardinality requires structural
match) - Most existing approaches do 11 and 1n
19ExampleMatch cardinalities
20Schema-level matchers3. Linguistic approaches
- Matchers use names and text to find semantically
similar schema elements - Need dictionaries (general nature, domain- or
enterprise-specific, even multilanguage) - These specific dictionaries require much effort
to be build up - Homonyms may mislead the matcher
21Schema-level matchers3. Linguistic approaches
- Name matching
- equality of names
- equality of canonical name (Cust CustNo)
- equality of synonyms (make brand)
- equality of hypernyms (book is-a publication
article is-a publication book article) - Similarity based on pronunciation or soundex
- user-provided name matches (reportsTo
manager) - May be used for element- or structure- based
matchers or even match different levels
(author.name AuthorName) - Not limited to 11 matches(phone homePhone,
officePhone )
22Schema-level matchers3. Linguistic approaches
- Description matching
- Use comments of schema elements in natural
language to match elements - simply by extracting words for synonym comparison
- or as sophisticated as using natural language
understanding technology for semantically
equivalent expressions
Example
23Schema-level matchers4. Constraint-based
approaches
- Schemas often contain constraints to define data
types and value rangedm uniqueness, optionality,
relationship types and cardinalities. - If both schemas have such information the matcher
can use it to match elements. - Obviously this criterion alone will make many
matching errors. - Still this approach can be combined with other
matchers to limit match candidates
24Schema-level matchers5. Reusing schema and
mapping information
- Improve effectiveness of Match by supporting the
reuse of common schema components (schemas from
same domains are often very similar) - reusable components are from atomic-level
components to entire schema fragments - reuse of previously determined mappings. If
matching S?S2 is already done and S1?S2 matching
is needed, optionally S1?S could be found (if it
is easier)
25ExampleReuse of previously determined mappings
26Instance-level matchers
- Instance-level data can give insight into the
contents and meaning of schema elements, using
frequencies of words, combination of words, range
of values etc. - Useful when schema information is limited and
when semi-structured data is used - Even when schema information is available this
approach can help decision between equally
plausible matchings
27Instance-level matchers
- Applicable to the most above approaches but
especially to - linguistic based approaches
- constrained-based approaches
- e.g. A constrained-based matcher may use a
instance-level check to choose Pno EmpNo and
not Pno DeptNo based on the range of values
of the three attributes - Main drawback possible number of schema elements
for evaluating instances
28Combining different matchers
- Several types of matchers. They can be combined
into a single Match operator in two ways - Hybrid matcher that intergrades multiple matching
criteria - Composite matcher that combines the results of
independently executed matchers (including hybrid
matchers) - Approaches must evaluate the possibility of using
criteria simultaneously or in a specific order
29Combining different matchers Hybrid matcher
- Typically uses hard-wired combination of
particular matching techniques that are executed
simultaneously or in a fixed order. - Better match candidates and better performance
than composite matcher - poor match candidates can be filtered out early
- reduced number of passes
30Combining different matchers Composite matcher
- Allow a selection between several matchers
- The user can choose the matchers to be executed
either simultaneously or in a specific order and
the way to combine results so that it better
applies the particular domain - The composite matcher may find a selection and
order automatically
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Conclusion
- User interaction is necessary in any case because
the implementation of Match can only determine
match candidates which a user can accept, reject
or change - The more configurable the matcher is the best
results can be obtained - The current implementations have yet to explore
more general view over the problem (independence
of schema representation, more criteria available
for the user to choose among, applicable in
various domains)
35Comments
- If user must check all matchings and have to
interfere with most of matchers steps, when do
we win time and effort doing the work
automatically? - Time Space complexity of the (multiple)
algorithms?