Title: Answering Structured Queries on Unstructured Data
1Answering Structured Queries on Unstructured Data
- Jing Liu, Xin (Luna) Dong, Alon Halevy
- Univ. of Washington
- _at_WebDB 2006
2Seamless Querying on the Structured and
Unstructured Data
3Seamless Querying on the Structured and
Unstructured Data
4Dataspaces and DSSPs
- Dataspace ? Collection of both structured and
unstructured data PODS06 Keynote talk - Dataspace Support Platforms (DSSPs) ? Services
over dataspaces (e.g., search, query and source
discovery, etc)
5SEMEX- Personal Information Management
Title
Year
Paper
Author
PublishedIn
CitedBy
6SEMEX- Personal Information Management
Time
Paper
FromFile
7SEMEX- Personal Information Management
Article (Title dataspace) (Author Alon
Halevy)
8SEMEX- Personal Information Management
Mentioned In Article alon keynote pods06
9SEMEX- Personal Information Management
Web search results by Google Alon Halevys
Home Page DBLP David Maier
10Current Approach
- Information-extraction approach
- Use supervised learning
- Hard to scale to data in a large number of
domains - Hard to apply to the case where the query schema
is unknown beforehand
11Solution in SEMEX
- Semex solution
- Transform a structured query into keyword search
- Keyword search on unstructured data.
- Advantages
- Apply to different domains
- Handle different queries
12SEMEX Transform a Structured Query into a
Keyword Search
Article (Title dataspace) (Author alon
halevy) ? Article dataspace alon halevy
13Challenges
- Example
- SELECT title
- FROM paper
- WHERE title LIKE Dataspaces AND year 2005
select title from paper where title LIKE
dataspaces and year 2005
Top-10 Precision 0
14Challenges
- Example
- SELECT title
- FROM paper
- WHERE title LIKE Dataspaces AND year 2005
title paper title dataspaces and year 2005
Top-10 Precision 0
15Challenges
- Example
- SELECT title
- FROM paper
- WHERE title LIKE Dataspaces AND year 2005
dataspaces 2005
Top-10 Precision 0.2
16Challenges
- Example
- SELECT title
- FROM paper
- WHERE title LIKE Dataspaces AND year 2005
dataspaces 2005 paper title
Top-10 Precision 0.2
17Challenges
- Example
- SELECT title
- FROM paper
- WHERE title LIKE Dataspaces AND year 2005
dataspaces 2005 paper
Top-10 Precision 0.6
18Outline
- Motivation
- Problem Definition
- Our Algorithm
- Construct Query Graph
- Extract Keywords
- Experimental Results
- Conclusions and Future Work
19Problem Definition
- Keyword extraction (Query transformation)
- Input a structured query
- Output a set of keywords
- Measure the quality of the extraction using top-k
precision of keyword-search answers
20Queries Considered
- Only consider basic SPJ (selection, projection,
simple joining) queries in our first step - Do not consider
- Disjunctions
- Comparison predicates (e.g., ?, lt, gt)
- Aggregations
21How to Select Keywords?
- Example
- SELECT title
- FROM paper
- WHERE title LIKE Dataspaces AND year
- 2005
22How to Select Keywords?
- Example
- SELECT title
- FROM paper
- WHERE title LIKE Dataspaces AND year
- 2005
23How to Select Keywords?
- Example
- SELECT title
- FROM paper
- WHERE title LIKE Dataspaces AND year
- 2005
24How to Select Keywords?
- Example
- SELECT title
- FROM paper
- WHERE title LIKE Dataspaces AND year
- 2005
25Outline
- Motivation
- Problem Definition
- Our Algorithm
- Construct Query Graph
- Extract Keywords
- Experimental Results
- Conclusions and Future Work
26Architecture Overview
SQL Queries
XML Queries
Triple Queries
Query-graph
Construction
Query Graph
Keyword
Extraction
Keyword Set
27Construct Query Graph
- SELECT title
- FROM Paper, Person
- WHERE title LIKE Dataspaces
- AND Paper.author Person.id
- AND Person.name LIKE Halevy
title
?paper
?
author
title
person
Dataspaces
name
Halevy
28Construct Query Graph
- SELECT title
- FROM Paper, Person
- WHERE title LIKE Dataspaces
- AND Paper.author Person.id
- AND Person.name LIKE Halevy
title
?paper
?
author
title
person
Dataspaces
name
Halevy
29Construct Query Graph
- SELECT title
- FROM Paper, Person
- WHERE title LIKE Dataspaces
- AND Paper.author Person.id
- AND Person.name LIKE Halevy
title
?paper
?
author
title
person
Dataspaces
name
Halevy
30Construct Query Graph
- SELECT title
- FROM Paper, Person
- WHERE title LIKE Dataspaces
- AND Paper.author Person.id
- AND Person.name LIKE Halevy
title
?paper
?
author
title
person
Dataspaces
name
Halevy
31Construct Query Graph
- SELECT title
- FROM Paper, Person
- WHERE title LIKE Dataspaces
- AND Paper.author Person.id
- AND Person.name LIKE Halevy
title
?paper
?
author
title
person
Dataspaces
name
Halevy
32Informativeness and Representativeness
- Example
- A paper authored by a person with name Halevy
- Informativeness ? Measure the amount of
information provided by a label term (i-score) - Representativeness ? Roughly correspond to the
probability that searching the given term returns
documents or webpages in the queried domain
(r-score 1 - distractiveness) - Informativeness gt distractivenessi.e., i-score
r-score gt 1
? Halevys paper
33Informativeness of a Label Depends on the Already
Selected Labels
Paper
34Informativeness of a Label Depends on the Already
Selected Labels
Dataspace Paper in 2005
35Informativeness of a Label Depends on the Already
Selected Labels
Dataspace Paper of Halevy and Franklin in 2005
36Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
37Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
38Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
39Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
-0.4
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
40Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.4,0.6)
? (0.8,0)
-0.4
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
41Effect of a Selected Label on the i-scores of
Other Labels
-0.1
title (0.8,0.2)
?paper (0.4,0.6)
? (0.8,0)
-0.4
-0.1
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
42Effect of a Selected Label on the i-scores of
Other Labels
-0.1
title (0.7,0.2)
?paper (0.4,0.6)
? (0.8,0)
-0.4
-0.1
author (0.7,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
43Extract Keywords Using Greedy Algorithm
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
44Extract Keywords Using Greedy Algorithm
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
Step 1 Choose all labels of value nodes,
update i-scores of the rest labels
45Extract Keywords Using Greedy Algorithm
title (0.6688,0.2)
?paper (0.5375,0.6)
? (0.8,0)
author (0.575,0.4)
title (0.3688,0.2)
person (0.5,0.6)
Dataspaces (1,0.8)
name (0.05,0.2)
Halevy (1,0.8)
Step 1 Choose all labels of value nodes,
update i-scores of the rest labels
46Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person (0.5,0.6)
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Step 2 Choose the label with highest i r if i
r gt 1, update i-scores of the rest labels
47Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person (0.5,0.6)
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Step 2 Choose the label with highest i r if i
r gt 1, update i-scores of the rest labels
48Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person (0.2,0.6)
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Step 3 Iterate step 2 until no more labels can
be added
49Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Final keyword set Dataspaces, Halevy, paper
50Extract Keywords Using Greedy AlgorithmAnother
Example
title
?paper
?
author
title
author
person
person
Dataspaces (1,0.8)
name
name
Franklin (1,0.8)
Halevy (1,0.8)
Final keyword set Dataspaces, Halevy, Franklin
51Outline
- Motivation
- Problem Definition
- Our Algorithm
- Construct Query Graph
- Extract Keywords
- Experimental Results
- Conclusions and Future Work
52Experiment Setup
- Six different domains movie, geography, company
profiles, bibliography, DBLP, and car profiles - Randomly select text values
- Vary two parameters in the selected queries
- values ? Number of attribute values in the query
(information given) - Length ? Longest path from a queried instance to
other instances (complexity of structure
information) - Measure the quality of extracted keywords with
top-k precision
53Initialize i-scores and r-scores ? Without domain
knowledge
- i-scores
- Text-value labels 1
- Labels of queried instances 1
- Other labels 0.8
- r-scores
- Text-value node labels 0.8
- Labels of association edges between instances of
the same type 0.8 - Instance node labels 0.6
- Association edge labels 0.4
- Attribute edge labels 0.2
- Number-value node labels 0
54High Precisions in All Data Domains w/o Domain
Knowledge
- Average top-2 precision was 0.68
- Average top-10 precision was 0.59
55Initialize i-scores and r-scores ? With domain
knowledge
- Can obtain more meaningful r-scores
- How
- Do keyword search on the labels
- Calculate the percentage of top-k answers that
are related to the queried domain
56Applying Domain Knowledge Increases Performance
- The top-10 precisions increased 39 on average.
57Increasing Value Increases Precision
Movie
Geography
Movie
Geography
58Increasing Length Decreases Precision
Movie
Geography
Movie
Geography
59Conclusions
- Dataspace Support Platforms require answering
structured queries on unstructured data - Solution Transform a structured query into
keyword search by keyword extraction - Our algorithm obtains good results in various
domains
60Future Work
- Refine the extracted keyword set by considering
the schema or a corpus of schemas - Use existing structured data to supplement the
selected keyword set - Perform linguistic analysis of the words in the
structured query - Develop methods for ranking answers from
structured and unstructured data sources
61Precise Data Integration
Cost
Benefit
Heterogeneity
62Approximate Data Integration
Benefit
Cost
Heterogeneity
63Answering Structured Queries on Unstructured Data
- Jing Liu, Xin (Luna) Dong, Alon Halevy
- Univ. of Washington
- http//data.cs.washington.edu/semex/semex.html
64Related Work
- SCORE CIKM, 2005
- Extract keywords from query results on structured
data. - Not generic.
65Algorithm of Updating i-scores
- Effect of selected label on other labels
- Source node (or edge) has flow volume rs.
- The flow value is divided among the neighbors.
- The flow value decreases exponentially with the
number of hops. - Update i-scores
- inew iold Effect of selected label