Title: Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi Meng3
1An Interactive Clustering-based Approach to
Integrating Source Query Interfaces on the Deep
Web
- Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi
Meng3 - 1 University of Illinois at Urbana-Champaign
- 2 University of Illinois at Chicago
- 3 SUNY at Binghamton
- June 2004, Paris, France
2Access Deep Web Sources
united.com
airtravel.com
delta.com
hotwire.com
3Global Query Interface
united.com
airtravel.com
delta.com
hotwire.com
4Constructing Global Query Interface
- A unified query interface with these desired
features - Conciseness - Combine semantically
- similar fields over source interfaces
- Completeness - Retain source-specific fields
- User-friendliness Highly related fields
- are close together
- Two-phrased integration
- Interface Matching Identify semantically
similar fields - Interface Integration Merge the source query
interfaces
5Interface Matching Challenges
- Field A in one interface is semantically similar
- to field B in another interface, but
- have nothing in common. E.g.,
- sim(A,B) sim(A,C), which field should
- A match? E.g.,
x
x
?
6Interface Matching Challenges (Contd)
- 1m mappings E.g.,
- Determine matching threshold
?
7Existing Common Limitations
- Limitation 1 Non-hierarchical modeling
- Limitation 2 Do not handle 1m mappings or
handle them with low accuracy - Limitation 3 Does not allow limited user
interactions - Detailed comparisons given in paper
8The IceQs Approach SIGMOD-04
- Hierarchical modeling
- Lets be out of flat land
- Greedy is good
- Always start with the most confident matching
- Bridging effect
- a2 and c2 might not look similar themselves
- but they might both be similar to b3
- 1m mappings
- Aggregate and is-a types
- User interaction helps in
- Interactive learning of matching threshold
- Resolution of uncertain mappings
0.8
0.5
Pick this!
X
9Hierarchical Modeling
Ordered Tree Representation
Source Query Interface
Capture ordering and grouping of fields
10Field Similarity Function
- Each field may have a label, a name and a set of
values, e.g., - Evaluate the similarity sim(A,B) between two
fields, A and B, based on - Linguistic similarity by label similarity, name
similarity and name vs. label similarity, each
measured by Cosine function - Domain similarity by domain type and domain value
similarity
Linguistic similarity
Domain similarity
11Find 11 Mappings via Clustering
Interfaces
Initial similarity matrix
(Threshold .3)
After one merge
, final clusters
a1,b1,c1, b2,c2,a2,b3
12Bridging Effect
A
?
B
C
Observations - It is difficult to match
vehicle field, A, with make field, B -
But As instances are similar to Cs, and Cs
label is similar to Bs - Thus, C might serve
as a bridge to connect A and B!
13Bridging Effect (Contd)
?
?
airtravel.com
hotfares.com
airtickets.com
Connections might also be made via labels
14Field Ordering-based Tie Resolution
0.35
B1
0.35
A1
A2
0.35
0.35
B2
Question sim(A1, B1) sim(A1, B2), which one
should A1 match?
Observation the ordering of fields conveys
semantics!
15Complex Mappings
Aggregate type contents of fields on the many
side are part of the content of field on the one
side
Commonalities (1) field proximity, (2) parent
label similarity, and (3) value characteristics
16Complex Mappings (Contd)
Is-a type contents of fields on the many side
are sum/union of the content of field on the one
side
Commonalities (1) field proximity, (2) parent
label similarity, and (3) value characteristics
17Complex Mappings (Contd)
- Final 1-m phase infers new mappings
Preliminary 1-m phase a1 ? (b1, b2) Clustering
phase b1 ? c1, b2 ? c2 Final 1-m phase a1 ?
(c1, c2)
18Active Learning of Thresholds
- Observation In an ideal situation,
- if field A matches with some field X, then sim(A,
X) threshold T1 - if field A does not match with any field, then
for any C, maxsim(A, C)
.91 .8 .73 .62 .46 .2 .03
.87 .82 .6 .53 .5 .33 .28
.62 .53 .5 .48 .46 .32 .1
Initial B 0,.4
Drop rule 50
List 1
List 3
List 2
List1 (1) question on .2, answer yes, update B
0, .2, continue on list 1 (2) question
on .03, answer no, update B .03, .2 List2
question on .1, answer yes, update B.03,
.1 List3 no values within B
Threshold set to any value between .03 and .1
19Interactive Resolution of Uncertain Mappings
- Resolve potential homonyms
- Observation two fields are
- possible homonyms if their
- labels are highly similar
- while domains are not.
- Determine potential synonyms
- Observation Two fields might still be similar
- if there are common values in their
- domains even if their label/domain
- similarities are low
x
X
20Interactive Resolution of Uncertain Mappings
- Determine potential 1m mappings
- Observation A might still match with B and C if
(a) sim(A,B) is very close to sim(A,C) (b) B and
C are adjacent and (c) A is the only field in
its interface which satisfies (a) and (b)
?
21Empirical Evaluations
Accuracy with all user interactions
Accuracy with learned thresholds
Automatic field matching
Distribution of questions
22Comparison of Component Contributions
7.3
15.4
On average, 12.6 increase in recall
23Summary
- High accuracy of determining matching fields
across multiple user interfaces - Limited use of user interactions
24Future Research
- Improve the accuracy of determining matching
fields further - Decrease the number of user interactions
- Produce unified friendly user interface
- Provide such a tool on the Web