Similarity Flooding

About This Presentation

Title:

Similarity Flooding

Description:

Requires domain-specific knowledge and coding. Solution: ... Initial mapping can apply domain knowledge. In this example: StringMatch is used: ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 67

Provided by: yishai

Category:

more less

Transcript and Presenter's Notes

Title: Similarity Flooding

1
Similarity Flooding

A Versatile Graph Matching Algorithm
by
Sergey Melnik, Hector Garcia-Molina, Erhard Rahm

2
Introduction Motivation

Goal matching elements of related, complex
objects
Matching elements of two data schemes
Matching elements of two data instances
Many conceivable uses for object matching
Looking for a generic algorithm with wide
applicability

3
Applications

Comparing data schemes
Items from different shopping sites
Merger between two corporations
Preparation of data for data warehousing and
analyzing processes
Comparing data instances
Bio-informatics
Collaboration allowing multiple users to edit a
program / system

4
Existing Approaches

Comparing SQL can use type information
Comparing XML can use hierarchy
Requires domain-specific knowledge and coding
Solution
Generic algorithm that is agnostic to domain
Structural model relies on structural
similarities to find a matching

5
Part I Algorithm Framework

General Discussion of Algorithm Input, Output,
and Main Components

6
Algorithm Framework

Input two objects to match
Representation of objects as graphs
G1(V1, E1), G2(V2, E2)
Matching between graphs gives mapping
V1xV2? ?
Filtering of mapping to obtain meaningful match
Output mapping between elements of input objects
Human verification sometimes required

7
Input ? Graph ? Mapping ? Filtering

Input are two objects to be matched
Match will be between sub-elements of the two
objects
Match of sub-elements will be scored. High scores
indicate a strong similarity
Assumption Objects can be represented as graphs

8
Input ? Graph ? Mapping ? Filtering

Represent objects as directed, labeled graphs
Choose any sensible graph representation (this is
domain-specific) that maintains structural
information
Structural information in graphs will be used for
mapping.
Intuition similar elements have similar
neighbors
G1 (V1, E1), G2 (V2, E2)

9
Input ? Graph ? Mapping ? Filtering

We want a mapping ?V1xV2 ? ?
Convenient to normalize such that 0? ?(v,u) ?1
Begin with initial mapping function
Null function ?(v, u) 1 for all v in V1, u in
V2
String Matching function
Other domain-specific function
Perform an iterative fixpoint calculation. Each
iteration floods the similarity value ?(v,u) to
the neighbors of v and u

10
Input ? Graph ? Mapping ? Filtering

We have a mapping ? V1xV2 ? ?
We are usually not interested in all pairs V1xV2
Applying filtering functions yields a partial
mapping
Threshold (only when ?(v,u) gt some constant)
Wedding (each v mapped to only one u and vice
versa)
Result is a useful mapping that matches elements
of V1 with elements of V2

11
Part II An Example - Relational Schemas

An Example Employing the Algorithm to Match Two
Simple Relational Schemas

12
Example Relational Schemas

Scenario two relational schemas that describe
similar or same data
Goal match elements of two given relational
schemas
Input SQL statements for creating each scheme
Desired output a meaningful mapping between the
elements of the two schemas

13
Example Relational SchemasInput ? Graph ?
Mapping ? Filtering

CREATE TABLE Personnel (
Pno int,
Pname string,
Dept string,
Born date,
UNIQUE perskey(Pno)
)
S1

CREATE TABLE Employee (
EmpNo int PRIMARY KEY,
EmpName varchar(50),
DeptNo int REFERENCES Department,
Salary dec(15,2),
Birthdate date
)
CREATE TABLE Department (
DeptNo int PRIMARY KEY,
DeptName varchar(70)
)
S2

14
Example Relational Schemas

Algorithm script
G1 SQLDDL2Graph(S1)
G2 SQLDDL2Graph(S2)
initialMap StringMatch(G1, G2)
product SFJoin(G1, G2, initialMap)
result SelectThreshold(product)

15
Example Relational SchemasInput ? Graph ?
Mapping ? Filtering

Any graph representation of schemas can be chosen
Representation should maintain as much
information as possible, in particular structural
information
Example uses Open Information Model (OIM) based
graph representation

16
Example Relational SchemasInput ? Graph ?
Mapping ? Filtering
17
Example Relational SchemasInput ? Graph ?
Mapping ? Filtering

Calculate initial mapping to improve performance
Initial mapping can apply domain knowledge
In this example StringMatch is used
Compares common prefixes and suffixes of literals
Assumes elements with similar names have similar
meaning
Applies on all elements including elements that
are created by the graph representation (e.g.
type)
Initial mapping still far from satisfactory

18
Example Relational SchemasInput ? Graph ?
Mapping ? Filtering
Top values of similarity mapping ? after StringMatch Top values of similarity mapping ? after StringMatch Top values of similarity mapping ? after StringMatch Top values of similarity mapping ? after StringMatch Top values of similarity mapping ? after StringMatch Top values of similarity mapping ? after StringMatch
? Node in G1 Node in G2 ? Node in G1 Node in G2
1.0 Column Column 0.26 Pname DeptName
0.66 ColumnType Column 0.26 Pname EmpName
0.66 Dept DeptNo 0.22 date BirthDate
0.66 Dept DeptName 0.11 Dept Department
0.5 UniqueKey PrimaryKey 0.06 int Department
19
Example Relational Schemas Input ? Graph ?
Mapping ? Filtering

Next step similarity flooding (SFJoin)
Initial similarity values taken from initial
mapping
In each iteration similarity of two elements
affects the similarity of their respective
neighbors (e.g. similarity of type names such as
string adds to similarity of columns from the
same type)
Iterate until similarity values are stable

20
Example Relational Schemas Input ? Graph ?
Mapping ? Filtering

After fixpoint calculation, the mapping ? is
filtered to provide a meaningful mapping
The filter operator SelectThreshold removes node
pairs for which ?(u,v) lt some constant
In this example, the mapping product contained
211 node pairs with positive similarities, which
were filtered to a total of 12 node pairs

21
Example Relational Schemas
Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold
? Node in G1 Node in G2 Node in G2 ? Node in G1 Node in G1 Node in G2
1.0 Column Column Column 0.29 UniqueKey perskey UniqueKey perskey PrimaryKey on EmpNo
0.81 Personnel Employee Employee 0.28 Personnel / Dept Personnel / Dept Department / DeptName
0.66 ColType ColType ColType 0.25 Personnel / Pno Personnel / Pno Employee / EmpNo
0.44 int int int 0.19 UniqueKey UniqueKey PrimaryKey
0.43 Table Table Table 0.18 Personnel / Pname Personnel / Pname Employee / EmpName
0.35 date date date 0.17 Personnel / Born Personnel / Born Employee / Birthdate
Table Table Table SQL column type SQL column type SQL column type Column Column
22
Example Relational Schemas

Summary of example
Good results without domain-specific knowledge
Graph representation may vary
Similarity flooding results need to be filtered

23
Part III Similarity Flooding Calculation

Details of the Similarity Flooding Calculation
Algorithm

24
Similarity Flooding Calculation

Start with directed, labeled graphs A, B
Every edge e in a graph is represented by a
triplet (s,p,o) edge labeled p from s to o
Define pairwise connectivity graph PCG(A, B)

25
Similarity Flooding Calculation
Pairwise Connectivity Graph Example
26
Similarity Flooding Calculation

Induced Propagation Graph add edges in opposite
direction
Edge weights propagation coefficients. They
measure how the similarity propagates to
neighbors
One way to calculate weights each edge type
(label) contributes a total of 1.0 outgoing
propagation

27
Similarity Flooding Calculation
Induced Propagation Graph Example
28
Similarity Flooding Calculation

Similarity measure ?(x,y)?0 for all x?A and b?B.
We also call ? a mapping
Iterative computation of ?, with propagation in
each iteration
?i is the mapping after the ith iteration
?0 is the initial mapping
Each iteration computes ?i based on ?i-1 and the
propagation graph
Stop when a stable mapping is reached

29
Similarity Flooding Calculation
Propagation from ?i for similarity of x and y is
the sum of all similarities from neighbors, each
multiplied by the propagation coefficients
30
Similarity Flooding Calculation

Many ways to iterate

Choice will aim to achieve high quality and fast
convergence

31
Similarity Flooding Calculation

Basic each iteration propagates from neighbors
Initial mapping has diminishing effect
A initial mapping has high importance.
Propagation has diminishing effect

32
Similarity Flooding Calculation

B initial mapping has high importance, recurring
in propagation
C initial mapping and current mapping have
identical importance

33
Part IV Filtering

Overview of Various Approaches to Filtering of SF
Mapping

34
Filtering

Result of iterations is a mapping ? between all
pairs in V1 and V2. We usually want much less
information!
Filtering will remove pairs, leaving us with only
the interesting ones
There are many ways to filter. Filter choice is
domain-specific

35
Filtering

Possible filtering directions
Remove uninteresting pairs according to
domain-specific knowledge (e.g. column,
table, string from SQL matches) and typing
information.
Cardinality considerations do we want a 11
mapping? A nm mapping?
Threshold remove matches with low scores

36
Filtering Cardinality

Cardinality-based filters can use techniques from
bilateral graph (marriage) problems
Stable marriage
Assignment problem max. of ??(x,y)
Maximum mapping max. number of 11 matches
Maximal mapping not contained in other mapping
Perfect/Complete all are married
All the above give 0,10,1 (monogamous)
matches, and can be found in polynomial time

37
Filtering Relative Similarity

?(x,y) is the absolute similarity of x and y
We can also define a relative similarity

Relative similarity is directed. The reverse
direction is defined in an analogue manner
Bipartite graph methods can also handle directed
graphs

38
Filtering Threshold

Threshold can be applied to absolute or relative
similarities
A useful example threshold of trel1.0 gives a
perfectionist egalitarian polygamy e.g. no
man/woman is willing to accept any but the best
match

39
Part V Examples

Examples of Algorithm Application to Various
Problems

40
Example Change Detection

Goal change detection in two labeled trees
Original tree T1 was changed to give T2
Node names were replaced
Subtrees were copied and moved
New node was inserted
We want the best match for every node of T2
Cardinality constraint 0,n 1,1

41
Example Change Detection

Algorithm Script
Product SFJoin(T2, T1)
Result SelectLeft(product)

42
Example Change Detection

No initial mapping
SelectLeft operator selects best absolute match
for each element in left argument
Results can also provide hints on type of change
that was performed!

43
Example Change Detection
44
Example Matching Schemas Using Instance Data

Goal match two XML Schemas using instance data
Two XML product descriptions from two shopping
websites
We want to use the instance data to match the XML
schemas

45
Example Matching Schemas Using Instance Data
46
Example Matching Schemas Using Instance Data

Algorithm Script
G1 XML2DOMGraph(db1)
G2 XML2DOMGraph(db2)
initialMap StringMatch(G1, G2)
product SFJoin(G1, G2, initialMap)
result XMLMapFilter(product, G1, G2)
Only new piece of code is the XMLMapFilter
operator

47
Example Schemas, Instance Data
48
Part VI Analysis

Match Quality, Algorithm Complexity, Convergence
and Limitations

49
Match Quality

Assessing match quality is difficult
Human verification and tuning of matching is
often required
A useful metric would be to measure the amount of
human work required to reach the perfect match
Recall how many good matches did we show?
Precision how many of the matches we show are
good?

50
Convergence

Fixpoint iterations are an eigenvector
computation for the matrix that corresponds to
the propagation graph
Computation converges iff graph is strongly
connected
To achieve this we use dampening use ?0 in the
fixpoint formula, where ?0(x,y) gt 0 for all x,y
Convergence rate depends on spectral radius of
the matrix, and can be improved by high dampening
values

51
Convergence

In many cases we are only interested in order of
map pairs, and not absolute values of ?.
The order usually stabilizes before the actual
values do

52
Complexity

Usually 5-30 iterations
Each iteration is O(E) (edges in propagation
graph)
E O(E1E2)
E1 O(V12) if G1 is highly connected
E2 O(V22) if G2 is highly connected
Worst case of each iteration is O(V12V22)
Average case of each iteration is O(V1V2)

53
Limitations

Algorithm requires representation as directed,
labeled graph
Degrades when edges are unlabeled or undirected
Degrades when labeling is more uniform
Assumes structural adjacency contributes to
similarity
Will not work for matching HTML
Requires matched objects to be of same type and
with same graph representation

54
Limitations

Algorithm cannot utilize order and aggregation
information (e.g. for XML)
Order the order of sub-elements within an
element
Aggregation an element containing an array of
sub-elements

55
Part VII Variability and Applications

Discussion of Algorithm Variability Areas and
Possible Applications

56
Variability in Algorithm

Graph representation of input objects
Calculation of propagation coefficients
Initial mapping function
Iteration formula
Filtering function

57
Graph Representation

Graph representation of input objects is
arbitrary sub-elements can be modeled as nodes,
edges, or both.
On one hand
Richer graph captures more structure information
Type information about sub-elements can be
modeled
On the other hand
Larger graphs mean longer computation
Rich graph often implies more uniform labeling

58
Propagation Coefficients

Propagation coefficients can be calculated in
many ways
Sum of all outgoing edges is 1.0
Equal weigh (1.0) for all edges
Sum of all outgoing edges of label p is 1.0
Sum of all incoming edges is 1.0
Label-specific weight allocation
Etc.

59
Initial Mapping Function

Initial mapping can improve performance and help
convergence
Initial mapping function can be naïve, or it can
employ domain-specific knowledge

60
Iteration Formula

Each iteration calculates ?i1 from ?i , ?0, and
?(?i)
Iteration formula can vary, giving different
weight and effect to these components
Example if initial mapping is good, give higher
weight to ?0
Formula affects convergence speed as well as
resultant mapping

61
Filtering Function

Results of iterations require filtering to become
a meaningful mapping
Many approaches to filtering are possible, as
discussed
Choice usually stems from graph representation
and specific goal. For example
If graphs contain many type-related nodes, they
can be pruned from results
If goal is to detect changes, we want a match for
each element of the newer object

62
Applications

There are many possible applications besides the
ones described
Comparing websites
Old vs. new versions of website
Two websites with information about same subject
Structural information gained from containment
and links

63
Applications

Natural language processing and speech
recognition
Match given sentence to XML template
Match two text segments that refer to the same
subject
Finding self-similarities and related data items
by running SFJoin(G,G)
Preparation of data and schemas for data
warehousing and data mining
Canonization of data and meta-data

64
Semantic Interpretation - Example

For example (1st approach), the user utterance
"I would like a medium coca cola and a large
pizza with pepperoni and mushrooms.
could be converted to the following semantic
result
drink
beverage "coke
drinksize "medium
pizza
pizzasize "large"
topping "pepperoni", "mushrooms"