Answering Structured Queries on Unstructured Data - PowerPoint PPT Presentation

1 / 65

About This Presentation

Title:

Answering Structured Queries on Unstructured Data

Description:

Answering Structured Queries on Unstructured Data. Jing Liu, Xin ... Hard to apply to the case where the query schema is unknown beforehand. 10/9/09. WebDB 2006 ... – PowerPoint PPT presentation

Number of Views:189

Avg rating:3.0/5.0

Slides: 66

Provided by: liuj2

Category:

more less

Transcript and Presenter's Notes

Title: Answering Structured Queries on Unstructured Data

1
Answering Structured Queries on Unstructured Data

Jing Liu, Xin (Luna) Dong, Alon Halevy
Univ. of Washington
_at_WebDB 2006

2
Seamless Querying on the Structured and
Unstructured Data
3
Seamless Querying on the Structured and
Unstructured Data
4
Dataspaces and DSSPs

Dataspace ? Collection of both structured and
unstructured data PODS06 Keynote talk
Dataspace Support Platforms (DSSPs) ? Services
over dataspaces (e.g., search, query and source
discovery, etc)

5
SEMEX- Personal Information Management
Title
Year
Paper
Author
PublishedIn
CitedBy
6
SEMEX- Personal Information Management
Time
Paper
FromFile
7
SEMEX- Personal Information Management
Article (Title dataspace) (Author Alon
Halevy)
8
SEMEX- Personal Information Management
Mentioned In Article alon keynote pods06
9
SEMEX- Personal Information Management
Web search results by Google Alon Halevys
Home Page DBLP David Maier
10
Current Approach

Information-extraction approach
Use supervised learning
Hard to scale to data in a large number of
domains
Hard to apply to the case where the query schema
is unknown beforehand

11
Solution in SEMEX

Semex solution
Transform a structured query into keyword search
Keyword search on unstructured data.
Advantages
Apply to different domains
Handle different queries

12
SEMEX Transform a Structured Query into a
Keyword Search
Article (Title dataspace) (Author alon
halevy) ? Article dataspace alon halevy
13
Challenges

Example
SELECT title
FROM paper
WHERE title LIKE Dataspaces AND year 2005

select title from paper where title LIKE
dataspaces and year 2005
Top-10 Precision 0
14
Challenges

Example
SELECT title
FROM paper
WHERE title LIKE Dataspaces AND year 2005

title paper title dataspaces and year 2005
Top-10 Precision 0
15
Challenges

Example
SELECT title
FROM paper
WHERE title LIKE Dataspaces AND year 2005

dataspaces 2005
Top-10 Precision 0.2
16
Challenges

Example
SELECT title
FROM paper
WHERE title LIKE Dataspaces AND year 2005

dataspaces 2005 paper title
Top-10 Precision 0.2
17
Challenges

Example
SELECT title
FROM paper
WHERE title LIKE Dataspaces AND year 2005

dataspaces 2005 paper
Top-10 Precision 0.6
18
Outline

Motivation
Problem Definition
Our Algorithm
Construct Query Graph
Extract Keywords
Experimental Results
Conclusions and Future Work

19
Problem Definition

Keyword extraction (Query transformation)
Input a structured query
Output a set of keywords
Measure the quality of the extraction using top-k
precision of keyword-search answers

20
Queries Considered

Only consider basic SPJ (selection, projection,
simple joining) queries in our first step
Do not consider
Disjunctions
Comparison predicates (e.g., ?, lt, gt)
Aggregations

21
How to Select Keywords?

Example
SELECT title
FROM paper
WHERE title LIKE Dataspaces AND year
2005

22
How to Select Keywords?

Example
SELECT title
FROM paper
WHERE title LIKE Dataspaces AND year
2005

23
How to Select Keywords?

Example
SELECT title
FROM paper
WHERE title LIKE Dataspaces AND year
2005

24
How to Select Keywords?

Example
SELECT title
FROM paper
WHERE title LIKE Dataspaces AND year
2005

25
Outline

Motivation
Problem Definition
Our Algorithm
Construct Query Graph
Extract Keywords
Experimental Results
Conclusions and Future Work

26
Architecture Overview
SQL Queries
XML Queries
Triple Queries
Query-graph
Construction
Query Graph
Keyword
Extraction
Keyword Set
27
Construct Query Graph

SELECT title
FROM Paper, Person
WHERE title LIKE Dataspaces
AND Paper.author Person.id
AND Person.name LIKE Halevy

title
?paper
?
author
title
person
Dataspaces
name
Halevy
28
Construct Query Graph

SELECT title
FROM Paper, Person
WHERE title LIKE Dataspaces
AND Paper.author Person.id
AND Person.name LIKE Halevy

title
?paper
?
author
title
person
Dataspaces
name
Halevy
29
Construct Query Graph

SELECT title
FROM Paper, Person
WHERE title LIKE Dataspaces
AND Paper.author Person.id
AND Person.name LIKE Halevy

title
?paper
?
author
title
person
Dataspaces
name
Halevy
30
Construct Query Graph

SELECT title
FROM Paper, Person
WHERE title LIKE Dataspaces
AND Paper.author Person.id
AND Person.name LIKE Halevy

title
?paper
?
author
title
person
Dataspaces
name
Halevy
31
Construct Query Graph

SELECT title
FROM Paper, Person
WHERE title LIKE Dataspaces
AND Paper.author Person.id
AND Person.name LIKE Halevy

title
?paper
?
author
title
person
Dataspaces
name
Halevy
32
Informativeness and Representativeness

Example
A paper authored by a person with name Halevy
Informativeness ? Measure the amount of
information provided by a label term (i-score)
Representativeness ? Roughly correspond to the
probability that searching the given term returns
documents or webpages in the queried domain
(r-score 1 - distractiveness)
Informativeness gt distractivenessi.e., i-score
r-score gt 1

? Halevys paper
33
Informativeness of a Label Depends on the Already
Selected Labels
Paper
34
Informativeness of a Label Depends on the Already
Selected Labels
Dataspace Paper in 2005
35
Informativeness of a Label Depends on the Already
Selected Labels
Dataspace Paper of Halevy and Franklin in 2005
36
Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
37
Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
38
Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
39
Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
-0.4
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
40
Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.4,0.6)
? (0.8,0)
-0.4
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
41
Effect of a Selected Label on the i-scores of
Other Labels
-0.1
title (0.8,0.2)
?paper (0.4,0.6)
? (0.8,0)
-0.4
-0.1
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
42
Effect of a Selected Label on the i-scores of
Other Labels
-0.1
title (0.7,0.2)
?paper (0.4,0.6)
? (0.8,0)
-0.4
-0.1
author (0.7,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
43
Extract Keywords Using Greedy Algorithm
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
44
Extract Keywords Using Greedy Algorithm
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
Step 1 Choose all labels of value nodes,
update i-scores of the rest labels
45
Extract Keywords Using Greedy Algorithm
title (0.6688,0.2)
?paper (0.5375,0.6)
? (0.8,0)
author (0.575,0.4)
title (0.3688,0.2)
person (0.5,0.6)
Dataspaces (1,0.8)
name (0.05,0.2)
Halevy (1,0.8)
Step 1 Choose all labels of value nodes,
update i-scores of the rest labels
46
Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person (0.5,0.6)
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Step 2 Choose the label with highest i r if i
r gt 1, update i-scores of the rest labels
47
Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person (0.5,0.6)
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Step 2 Choose the label with highest i r if i
r gt 1, update i-scores of the rest labels
48
Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person (0.2,0.6)
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Step 3 Iterate step 2 until no more labels can
be added
49
Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Final keyword set Dataspaces, Halevy, paper
50
Extract Keywords Using Greedy AlgorithmAnother
Example
title
?paper
?
author
title
author
person
person
Dataspaces (1,0.8)
name
name
Franklin (1,0.8)
Halevy (1,0.8)
Final keyword set Dataspaces, Halevy, Franklin
51
Outline

Motivation
Problem Definition
Our Algorithm
Construct Query Graph
Extract Keywords
Experimental Results
Conclusions and Future Work

52
Experiment Setup

Six different domains movie, geography, company
profiles, bibliography, DBLP, and car profiles
Randomly select text values
Vary two parameters in the selected queries
values ? Number of attribute values in the query
(information given)
Length ? Longest path from a queried instance to
other instances (complexity of structure
information)
Measure the quality of extracted keywords with
top-k precision

53
Initialize i-scores and r-scores ? Without domain
knowledge

i-scores
Text-value labels 1
Labels of queried instances 1
Other labels 0.8
r-scores
Text-value node labels 0.8
Labels of association edges between instances of
the same type 0.8
Instance node labels 0.6
Association edge labels 0.4
Attribute edge labels 0.2
Number-value node labels 0

54
High Precisions in All Data Domains w/o Domain
Knowledge

Average top-2 precision was 0.68
Average top-10 precision was 0.59

55
Initialize i-scores and r-scores ? With domain
knowledge

Can obtain more meaningful r-scores
How
Do keyword search on the labels
Calculate the percentage of top-k answers that
are related to the queried domain

56
Applying Domain Knowledge Increases Performance

The top-10 precisions increased 39 on average.

57
Increasing Value Increases Precision
Movie
Geography
Movie
Geography
58
Increasing Length Decreases Precision
Movie
Geography
Movie
Geography
59
Conclusions

Dataspace Support Platforms require answering
structured queries on unstructured data
Solution Transform a structured query into
keyword search by keyword extraction
Our algorithm obtains good results in various
domains

60
Future Work

Refine the extracted keyword set by considering
the schema or a corpus of schemas
Use existing structured data to supplement the
selected keyword set
Perform linguistic analysis of the words in the
structured query
Develop methods for ranking answers from
structured and unstructured data sources

61
Precise Data Integration
Cost
Benefit
Heterogeneity
62
Approximate Data Integration
Benefit
Cost
Heterogeneity
63
Answering Structured Queries on Unstructured Data