Similarity Flooding - PowerPoint PPT Presentation

About This Presentation
Title:

Similarity Flooding

Description:

Requires domain-specific knowledge and coding. Solution: ... Initial mapping can apply domain knowledge. In this example: StringMatch is used: ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 67
Provided by: yishai
Category:

less

Transcript and Presenter's Notes

Title: Similarity Flooding


1
Similarity Flooding
  • A Versatile Graph Matching Algorithm
  • by
  • Sergey Melnik, Hector Garcia-Molina, Erhard Rahm

2
Introduction Motivation
  • Goal matching elements of related, complex
    objects
  • Matching elements of two data schemes
  • Matching elements of two data instances
  • Many conceivable uses for object matching
  • Looking for a generic algorithm with wide
    applicability

3
Applications
  • Comparing data schemes
  • Items from different shopping sites
  • Merger between two corporations
  • Preparation of data for data warehousing and
    analyzing processes
  • Comparing data instances
  • Bio-informatics
  • Collaboration allowing multiple users to edit a
    program / system

4
Existing Approaches
  • Comparing SQL can use type information
  • Comparing XML can use hierarchy
  • Requires domain-specific knowledge and coding
  • Solution
  • Generic algorithm that is agnostic to domain
  • Structural model relies on structural
    similarities to find a matching

5
Part I Algorithm Framework
  • General Discussion of Algorithm Input, Output,
    and Main Components

6
Algorithm Framework
  • Input two objects to match
  • Representation of objects as graphs
  • G1(V1, E1), G2(V2, E2)
  • Matching between graphs gives mapping
  • V1xV2? ?
  • Filtering of mapping to obtain meaningful match
  • Output mapping between elements of input objects
  • Human verification sometimes required

7
Input ? Graph ? Mapping ? Filtering
  • Input are two objects to be matched
  • Match will be between sub-elements of the two
    objects
  • Match of sub-elements will be scored. High scores
    indicate a strong similarity
  • Assumption Objects can be represented as graphs

8
Input ? Graph ? Mapping ? Filtering
  • Represent objects as directed, labeled graphs
  • Choose any sensible graph representation (this is
    domain-specific) that maintains structural
    information
  • Structural information in graphs will be used for
    mapping.
  • Intuition similar elements have similar
    neighbors
  • G1 (V1, E1), G2 (V2, E2)

9
Input ? Graph ? Mapping ? Filtering
  • We want a mapping ?V1xV2 ? ?
  • Convenient to normalize such that 0? ?(v,u) ?1
  • Begin with initial mapping function
  • Null function ?(v, u) 1 for all v in V1, u in
    V2
  • String Matching function
  • Other domain-specific function
  • Perform an iterative fixpoint calculation. Each
    iteration floods the similarity value ?(v,u) to
    the neighbors of v and u

10
Input ? Graph ? Mapping ? Filtering
  • We have a mapping ? V1xV2 ? ?
  • We are usually not interested in all pairs V1xV2
  • Applying filtering functions yields a partial
    mapping
  • Threshold (only when ?(v,u) gt some constant)
  • Wedding (each v mapped to only one u and vice
    versa)
  • Result is a useful mapping that matches elements
    of V1 with elements of V2

11
Part II An Example - Relational Schemas
  • An Example Employing the Algorithm to Match Two
    Simple Relational Schemas

12
Example Relational Schemas
  • Scenario two relational schemas that describe
    similar or same data
  • Goal match elements of two given relational
    schemas
  • Input SQL statements for creating each scheme
  • Desired output a meaningful mapping between the
    elements of the two schemas

13
Example Relational SchemasInput ? Graph ?
Mapping ? Filtering
  • CREATE TABLE Personnel (
  • Pno int,
  • Pname string,
  • Dept string,
  • Born date,
  • UNIQUE perskey(Pno)
  • )
  • S1
  • CREATE TABLE Employee (
  • EmpNo int PRIMARY KEY,
  • EmpName varchar(50),
  • DeptNo int REFERENCES Department,
  • Salary dec(15,2),
  • Birthdate date
  • )
  • CREATE TABLE Department (
  • DeptNo int PRIMARY KEY,
  • DeptName varchar(70)
  • )
  • S2

14
Example Relational Schemas
  • Algorithm script
  • G1 SQLDDL2Graph(S1)
  • G2 SQLDDL2Graph(S2)
  • initialMap StringMatch(G1, G2)
  • product SFJoin(G1, G2, initialMap)
  • result SelectThreshold(product)

15
Example Relational SchemasInput ? Graph ?
Mapping ? Filtering
  • Any graph representation of schemas can be chosen
  • Representation should maintain as much
    information as possible, in particular structural
    information
  • Example uses Open Information Model (OIM) based
    graph representation

16
Example Relational SchemasInput ? Graph ?
Mapping ? Filtering
17
Example Relational SchemasInput ? Graph ?
Mapping ? Filtering
  • Calculate initial mapping to improve performance
  • Initial mapping can apply domain knowledge
  • In this example StringMatch is used
  • Compares common prefixes and suffixes of literals
  • Assumes elements with similar names have similar
    meaning
  • Applies on all elements including elements that
    are created by the graph representation (e.g.
    type)
  • Initial mapping still far from satisfactory

18
Example Relational SchemasInput ? Graph ?
Mapping ? Filtering
Top values of similarity mapping ? after StringMatch Top values of similarity mapping ? after StringMatch Top values of similarity mapping ? after StringMatch Top values of similarity mapping ? after StringMatch Top values of similarity mapping ? after StringMatch Top values of similarity mapping ? after StringMatch
? Node in G1 Node in G2 ? Node in G1 Node in G2
1.0 Column Column 0.26 Pname DeptName
0.66 ColumnType Column 0.26 Pname EmpName
0.66 Dept DeptNo 0.22 date BirthDate
0.66 Dept DeptName 0.11 Dept Department
0.5 UniqueKey PrimaryKey 0.06 int Department
19
Example Relational Schemas Input ? Graph ?
Mapping ? Filtering
  • Next step similarity flooding (SFJoin)
  • Initial similarity values taken from initial
    mapping
  • In each iteration similarity of two elements
    affects the similarity of their respective
    neighbors (e.g. similarity of type names such as
    string adds to similarity of columns from the
    same type)
  • Iterate until similarity values are stable

20
Example Relational Schemas Input ? Graph ?
Mapping ? Filtering
  • After fixpoint calculation, the mapping ? is
    filtered to provide a meaningful mapping
  • The filter operator SelectThreshold removes node
    pairs for which ?(u,v) lt some constant
  • In this example, the mapping product contained
    211 node pairs with positive similarities, which
    were filtered to a total of 12 node pairs

21
Example Relational Schemas
Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold Similarity mapping ? after SelectThreshold
? Node in G1 Node in G2 Node in G2 ? Node in G1 Node in G1 Node in G2
1.0 Column Column Column 0.29 UniqueKey perskey UniqueKey perskey PrimaryKey on EmpNo
0.81 Personnel Employee Employee 0.28 Personnel / Dept Personnel / Dept Department / DeptName
0.66 ColType ColType ColType 0.25 Personnel / Pno Personnel / Pno Employee / EmpNo
0.44 int int int 0.19 UniqueKey UniqueKey PrimaryKey
0.43 Table Table Table 0.18 Personnel / Pname Personnel / Pname Employee / EmpName
0.35 date date date 0.17 Personnel / Born Personnel / Born Employee / Birthdate
Table Table Table SQL column type SQL column type SQL column type Column Column
22
Example Relational Schemas
  • Summary of example
  • Good results without domain-specific knowledge
  • Graph representation may vary
  • Similarity flooding results need to be filtered

23
Part III Similarity Flooding Calculation
  • Details of the Similarity Flooding Calculation
    Algorithm

24
Similarity Flooding Calculation
  • Start with directed, labeled graphs A, B
  • Every edge e in a graph is represented by a
    triplet (s,p,o) edge labeled p from s to o
  • Define pairwise connectivity graph PCG(A, B)

25
Similarity Flooding Calculation
Pairwise Connectivity Graph Example
26
Similarity Flooding Calculation
  • Induced Propagation Graph add edges in opposite
    direction
  • Edge weights propagation coefficients. They
    measure how the similarity propagates to
    neighbors
  • One way to calculate weights each edge type
    (label) contributes a total of 1.0 outgoing
    propagation

27
Similarity Flooding Calculation
Induced Propagation Graph Example
28
Similarity Flooding Calculation
  • Similarity measure ?(x,y)?0 for all x?A and b?B.
    We also call ? a mapping
  • Iterative computation of ?, with propagation in
    each iteration
  • ?i is the mapping after the ith iteration
  • ?0 is the initial mapping
  • Each iteration computes ?i based on ?i-1 and the
    propagation graph
  • Stop when a stable mapping is reached

29
Similarity Flooding Calculation
Propagation from ?i for similarity of x and y is
the sum of all similarities from neighbors, each
multiplied by the propagation coefficients
30
Similarity Flooding Calculation
  • Many ways to iterate
  • Choice will aim to achieve high quality and fast
    convergence

31
Similarity Flooding Calculation
  • Basic each iteration propagates from neighbors
    Initial mapping has diminishing effect
  • A initial mapping has high importance.
    Propagation has diminishing effect

32
Similarity Flooding Calculation
  • B initial mapping has high importance, recurring
    in propagation
  • C initial mapping and current mapping have
    identical importance

33
Part IV Filtering
  • Overview of Various Approaches to Filtering of SF
    Mapping

34
Filtering
  • Result of iterations is a mapping ? between all
    pairs in V1 and V2. We usually want much less
    information!
  • Filtering will remove pairs, leaving us with only
    the interesting ones
  • There are many ways to filter. Filter choice is
    domain-specific

35
Filtering
  • Possible filtering directions
  • Remove uninteresting pairs according to
    domain-specific knowledge (e.g. column,
    table, string from SQL matches) and typing
    information.
  • Cardinality considerations do we want a 11
    mapping? A nm mapping?
  • Threshold remove matches with low scores

36
Filtering Cardinality
  • Cardinality-based filters can use techniques from
    bilateral graph (marriage) problems
  • Stable marriage
  • Assignment problem max. of ??(x,y)
  • Maximum mapping max. number of 11 matches
  • Maximal mapping not contained in other mapping
  • Perfect/Complete all are married
  • All the above give 0,10,1 (monogamous)
    matches, and can be found in polynomial time

37
Filtering Relative Similarity
  • ?(x,y) is the absolute similarity of x and y
  • We can also define a relative similarity
  • Relative similarity is directed. The reverse
    direction is defined in an analogue manner
  • Bipartite graph methods can also handle directed
    graphs

38
Filtering Threshold
  • Threshold can be applied to absolute or relative
    similarities
  • A useful example threshold of trel1.0 gives a
    perfectionist egalitarian polygamy e.g. no
    man/woman is willing to accept any but the best
    match

39
Part V Examples
  • Examples of Algorithm Application to Various
    Problems

40
Example Change Detection
  • Goal change detection in two labeled trees
  • Original tree T1 was changed to give T2
  • Node names were replaced
  • Subtrees were copied and moved
  • New node was inserted
  • We want the best match for every node of T2
  • Cardinality constraint 0,n 1,1

41
Example Change Detection
  • Algorithm Script
  • Product SFJoin(T2, T1)
  • Result SelectLeft(product)

42
Example Change Detection
  • No initial mapping
  • SelectLeft operator selects best absolute match
    for each element in left argument
  • Results can also provide hints on type of change
    that was performed!

43
Example Change Detection
44
Example Matching Schemas Using Instance Data
  • Goal match two XML Schemas using instance data
  • Two XML product descriptions from two shopping
    websites
  • We want to use the instance data to match the XML
    schemas

45
Example Matching Schemas Using Instance Data
46
Example Matching Schemas Using Instance Data
  • Algorithm Script
  • G1 XML2DOMGraph(db1)
  • G2 XML2DOMGraph(db2)
  • initialMap StringMatch(G1, G2)
  • product SFJoin(G1, G2, initialMap)
  • result XMLMapFilter(product, G1, G2)
  • Only new piece of code is the XMLMapFilter
    operator

47
Example Schemas, Instance Data
48
Part VI Analysis
  • Match Quality, Algorithm Complexity, Convergence
    and Limitations

49
Match Quality
  • Assessing match quality is difficult
  • Human verification and tuning of matching is
    often required
  • A useful metric would be to measure the amount of
    human work required to reach the perfect match
  • Recall how many good matches did we show?
  • Precision how many of the matches we show are
    good?

50
Convergence
  • Fixpoint iterations are an eigenvector
    computation for the matrix that corresponds to
    the propagation graph
  • Computation converges iff graph is strongly
    connected
  • To achieve this we use dampening use ?0 in the
    fixpoint formula, where ?0(x,y) gt 0 for all x,y
  • Convergence rate depends on spectral radius of
    the matrix, and can be improved by high dampening
    values

51
Convergence
  • In many cases we are only interested in order of
    map pairs, and not absolute values of ?.
  • The order usually stabilizes before the actual
    values do

52
Complexity
  • Usually 5-30 iterations
  • Each iteration is O(E) (edges in propagation
    graph)
  • E O(E1E2)
  • E1 O(V12) if G1 is highly connected
  • E2 O(V22) if G2 is highly connected
  • Worst case of each iteration is O(V12V22)
  • Average case of each iteration is O(V1V2)

53
Limitations
  • Algorithm requires representation as directed,
    labeled graph
  • Degrades when edges are unlabeled or undirected
  • Degrades when labeling is more uniform
  • Assumes structural adjacency contributes to
    similarity
  • Will not work for matching HTML
  • Requires matched objects to be of same type and
    with same graph representation

54
Limitations
  • Algorithm cannot utilize order and aggregation
    information (e.g. for XML)
  • Order the order of sub-elements within an
    element
  • Aggregation an element containing an array of
    sub-elements

55
Part VII Variability and Applications
  • Discussion of Algorithm Variability Areas and
    Possible Applications

56
Variability in Algorithm
  • Graph representation of input objects
  • Calculation of propagation coefficients
  • Initial mapping function
  • Iteration formula
  • Filtering function

57
Graph Representation
  • Graph representation of input objects is
    arbitrary sub-elements can be modeled as nodes,
    edges, or both.
  • On one hand
  • Richer graph captures more structure information
  • Type information about sub-elements can be
    modeled
  • On the other hand
  • Larger graphs mean longer computation
  • Rich graph often implies more uniform labeling

58
Propagation Coefficients
  • Propagation coefficients can be calculated in
    many ways
  • Sum of all outgoing edges is 1.0
  • Equal weigh (1.0) for all edges
  • Sum of all outgoing edges of label p is 1.0
  • Sum of all incoming edges is 1.0
  • Label-specific weight allocation
  • Etc.

59
Initial Mapping Function
  • Initial mapping can improve performance and help
    convergence
  • Initial mapping function can be naïve, or it can
    employ domain-specific knowledge

60
Iteration Formula
  • Each iteration calculates ?i1 from ?i , ?0, and
    ?(?i)
  • Iteration formula can vary, giving different
    weight and effect to these components
  • Example if initial mapping is good, give higher
    weight to ?0
  • Formula affects convergence speed as well as
    resultant mapping

61
Filtering Function
  • Results of iterations require filtering to become
    a meaningful mapping
  • Many approaches to filtering are possible, as
    discussed
  • Choice usually stems from graph representation
    and specific goal. For example
  • If graphs contain many type-related nodes, they
    can be pruned from results
  • If goal is to detect changes, we want a match for
    each element of the newer object

62
Applications
  • There are many possible applications besides the
    ones described
  • Comparing websites
  • Old vs. new versions of website
  • Two websites with information about same subject
  • Structural information gained from containment
    and links

63
Applications
  • Natural language processing and speech
    recognition
  • Match given sentence to XML template
  • Match two text segments that refer to the same
    subject
  • Finding self-similarities and related data items
    by running SFJoin(G,G)
  • Preparation of data and schemas for data
    warehousing and data mining
  • Canonization of data and meta-data

64
Semantic Interpretation - Example
  • For example (1st approach), the user utterance
  • "I would like a medium coca cola and a large
    pizza with pepperoni and mushrooms.
  • could be converted to the following semantic
    result
  • drink
  • beverage "coke
  • drinksize "medium
  • pizza
  • pizzasize "large"
  • topping "pepperoni", "mushrooms"

65
Applications
  • More

66
Summary
  • Generic algorithm with many applications
  • Relies on structural information captured in
    graph representation
  • Domain-specific customizations can improve
    performance and match quality
  • Useful but does not deliver 100 exact results
    human verification often required
Write a Comment
User Comments (0)
About PowerShow.com