Answering Structured Queries on Unstructured Data - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Answering Structured Queries on Unstructured Data

Description:

Answering Structured Queries on Unstructured Data. Jing Liu, Xin ... Hard to apply to the case where the query schema is unknown beforehand. 10/9/09. WebDB 2006 ... – PowerPoint PPT presentation

Number of Views:189
Avg rating:3.0/5.0
Slides: 66
Provided by: liuj2
Category:

less

Transcript and Presenter's Notes

Title: Answering Structured Queries on Unstructured Data


1
Answering Structured Queries on Unstructured Data
  • Jing Liu, Xin (Luna) Dong, Alon Halevy
  • Univ. of Washington
  • _at_WebDB 2006

2
Seamless Querying on the Structured and
Unstructured Data
3
Seamless Querying on the Structured and
Unstructured Data
4
Dataspaces and DSSPs
  • Dataspace ? Collection of both structured and
    unstructured data PODS06 Keynote talk
  • Dataspace Support Platforms (DSSPs) ? Services
    over dataspaces (e.g., search, query and source
    discovery, etc)

5
SEMEX- Personal Information Management
Title
Year
Paper
Author
PublishedIn
CitedBy
6
SEMEX- Personal Information Management
Time
Paper
FromFile
7
SEMEX- Personal Information Management
Article (Title dataspace) (Author Alon
Halevy)
8
SEMEX- Personal Information Management
Mentioned In Article alon keynote pods06
9
SEMEX- Personal Information Management
Web search results by Google Alon Halevys
Home Page DBLP David Maier
10
Current Approach
  • Information-extraction approach
  • Use supervised learning
  • Hard to scale to data in a large number of
    domains
  • Hard to apply to the case where the query schema
    is unknown beforehand

11
Solution in SEMEX
  • Semex solution
  • Transform a structured query into keyword search
  • Keyword search on unstructured data.
  • Advantages
  • Apply to different domains
  • Handle different queries

12
SEMEX Transform a Structured Query into a
Keyword Search
Article (Title dataspace) (Author alon
halevy) ? Article dataspace alon halevy
13
Challenges
  • Example
  • SELECT title
  • FROM paper
  • WHERE title LIKE Dataspaces AND year 2005

select title from paper where title LIKE
dataspaces and year 2005
Top-10 Precision 0
14
Challenges
  • Example
  • SELECT title
  • FROM paper
  • WHERE title LIKE Dataspaces AND year 2005

title paper title dataspaces and year 2005
Top-10 Precision 0
15
Challenges
  • Example
  • SELECT title
  • FROM paper
  • WHERE title LIKE Dataspaces AND year 2005

dataspaces 2005
Top-10 Precision 0.2
16
Challenges
  • Example
  • SELECT title
  • FROM paper
  • WHERE title LIKE Dataspaces AND year 2005

dataspaces 2005 paper title
Top-10 Precision 0.2
17
Challenges
  • Example
  • SELECT title
  • FROM paper
  • WHERE title LIKE Dataspaces AND year 2005

dataspaces 2005 paper
Top-10 Precision 0.6
18
Outline
  • Motivation
  • Problem Definition
  • Our Algorithm
  • Construct Query Graph
  • Extract Keywords
  • Experimental Results
  • Conclusions and Future Work

19
Problem Definition
  • Keyword extraction (Query transformation)
  • Input a structured query
  • Output a set of keywords
  • Measure the quality of the extraction using top-k
    precision of keyword-search answers

20
Queries Considered
  • Only consider basic SPJ (selection, projection,
    simple joining) queries in our first step
  • Do not consider
  • Disjunctions
  • Comparison predicates (e.g., ?, lt, gt)
  • Aggregations

21
How to Select Keywords?
  • Example
  • SELECT title
  • FROM paper
  • WHERE title LIKE Dataspaces AND year
  • 2005

22
How to Select Keywords?
  • Example
  • SELECT title
  • FROM paper
  • WHERE title LIKE Dataspaces AND year
  • 2005

23
How to Select Keywords?
  • Example
  • SELECT title
  • FROM paper
  • WHERE title LIKE Dataspaces AND year
  • 2005

24
How to Select Keywords?
  • Example
  • SELECT title
  • FROM paper
  • WHERE title LIKE Dataspaces AND year
  • 2005

25
Outline
  • Motivation
  • Problem Definition
  • Our Algorithm
  • Construct Query Graph
  • Extract Keywords
  • Experimental Results
  • Conclusions and Future Work

26
Architecture Overview
SQL Queries
XML Queries
Triple Queries
Query-graph
Construction
Query Graph
Keyword
Extraction
Keyword Set
27
Construct Query Graph
  • SELECT title
  • FROM Paper, Person
  • WHERE title LIKE Dataspaces
  • AND Paper.author Person.id
  • AND Person.name LIKE Halevy

title
?paper
?
author
title
person
Dataspaces
name
Halevy
28
Construct Query Graph
  • SELECT title
  • FROM Paper, Person
  • WHERE title LIKE Dataspaces
  • AND Paper.author Person.id
  • AND Person.name LIKE Halevy

title
?paper
?
author
title
person
Dataspaces
name
Halevy
29
Construct Query Graph
  • SELECT title
  • FROM Paper, Person
  • WHERE title LIKE Dataspaces
  • AND Paper.author Person.id
  • AND Person.name LIKE Halevy

title
?paper
?
author
title
person
Dataspaces
name
Halevy
30
Construct Query Graph
  • SELECT title
  • FROM Paper, Person
  • WHERE title LIKE Dataspaces
  • AND Paper.author Person.id
  • AND Person.name LIKE Halevy

title
?paper
?
author
title
person
Dataspaces
name
Halevy
31
Construct Query Graph
  • SELECT title
  • FROM Paper, Person
  • WHERE title LIKE Dataspaces
  • AND Paper.author Person.id
  • AND Person.name LIKE Halevy

title
?paper
?
author
title
person
Dataspaces
name
Halevy
32
Informativeness and Representativeness
  • Example
  • A paper authored by a person with name Halevy
  • Informativeness ? Measure the amount of
    information provided by a label term (i-score)
  • Representativeness ? Roughly correspond to the
    probability that searching the given term returns
    documents or webpages in the queried domain
    (r-score 1 - distractiveness)
  • Informativeness gt distractivenessi.e., i-score
    r-score gt 1

? Halevys paper
33
Informativeness of a Label Depends on the Already
Selected Labels
Paper
34
Informativeness of a Label Depends on the Already
Selected Labels
Dataspace Paper in 2005
35
Informativeness of a Label Depends on the Already
Selected Labels
Dataspace Paper of Halevy and Franklin in 2005
36
Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
37
Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
38
Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
39
Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
-0.4
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
40
Effect of a Selected Label on the i-scores of
Other Labels
title (0.8,0.2)
?paper (0.4,0.6)
? (0.8,0)
-0.4
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
41
Effect of a Selected Label on the i-scores of
Other Labels
-0.1
title (0.8,0.2)
?paper (0.4,0.6)
? (0.8,0)
-0.4
-0.1
author (0.8,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
42
Effect of a Selected Label on the i-scores of
Other Labels
-0.1
title (0.7,0.2)
?paper (0.4,0.6)
? (0.8,0)
-0.4
-0.1
author (0.7,0.4)
title (0,0.2)
-0.8
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
43
Extract Keywords Using Greedy Algorithm
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
44
Extract Keywords Using Greedy Algorithm
title (0.8,0.2)
?paper (0.8,0.6)
? (0.8,0)
author (0.8,0.4)
title (0.8,0.2)
person (0.8,0.6)
Dataspaces (1,0.8)
name (0.8,0.2)
Halevy (1,0.8)
Step 1 Choose all labels of value nodes,
update i-scores of the rest labels
45
Extract Keywords Using Greedy Algorithm
title (0.6688,0.2)
?paper (0.5375,0.6)
? (0.8,0)
author (0.575,0.4)
title (0.3688,0.2)
person (0.5,0.6)
Dataspaces (1,0.8)
name (0.05,0.2)
Halevy (1,0.8)
Step 1 Choose all labels of value nodes,
update i-scores of the rest labels
46
Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person (0.5,0.6)
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Step 2 Choose the label with highest i r if i
r gt 1, update i-scores of the rest labels
47
Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person (0.5,0.6)
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Step 2 Choose the label with highest i r if i
r gt 1, update i-scores of the rest labels
48
Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person (0.2,0.6)
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Step 3 Iterate step 2 until no more labels can
be added
49
Extract Keywords Using Greedy Algorithm
title
?paper (0.5375,0.6)
?
author
title
person
Dataspaces (1,0.8)
name
Halevy (1,0.8)
Final keyword set Dataspaces, Halevy, paper
50
Extract Keywords Using Greedy AlgorithmAnother
Example
title
?paper
?
author
title
author
person
person
Dataspaces (1,0.8)
name
name
Franklin (1,0.8)
Halevy (1,0.8)
Final keyword set Dataspaces, Halevy, Franklin
51
Outline
  • Motivation
  • Problem Definition
  • Our Algorithm
  • Construct Query Graph
  • Extract Keywords
  • Experimental Results
  • Conclusions and Future Work

52
Experiment Setup
  • Six different domains movie, geography, company
    profiles, bibliography, DBLP, and car profiles
  • Randomly select text values
  • Vary two parameters in the selected queries
  • values ? Number of attribute values in the query
    (information given)
  • Length ? Longest path from a queried instance to
    other instances (complexity of structure
    information)
  • Measure the quality of extracted keywords with
    top-k precision

53
Initialize i-scores and r-scores ? Without domain
knowledge
  • i-scores
  • Text-value labels 1
  • Labels of queried instances 1
  • Other labels 0.8
  • r-scores
  • Text-value node labels 0.8
  • Labels of association edges between instances of
    the same type 0.8
  • Instance node labels 0.6
  • Association edge labels 0.4
  • Attribute edge labels 0.2
  • Number-value node labels 0

54
High Precisions in All Data Domains w/o Domain
Knowledge
  • Average top-2 precision was 0.68
  • Average top-10 precision was 0.59

55
Initialize i-scores and r-scores ? With domain
knowledge
  • Can obtain more meaningful r-scores
  • How
  • Do keyword search on the labels
  • Calculate the percentage of top-k answers that
    are related to the queried domain

56
Applying Domain Knowledge Increases Performance
  • The top-10 precisions increased 39 on average.

57
Increasing Value Increases Precision
Movie
Geography
Movie
Geography
58
Increasing Length Decreases Precision
Movie
Geography
Movie
Geography
59
Conclusions
  • Dataspace Support Platforms require answering
    structured queries on unstructured data
  • Solution Transform a structured query into
    keyword search by keyword extraction
  • Our algorithm obtains good results in various
    domains

60
Future Work
  • Refine the extracted keyword set by considering
    the schema or a corpus of schemas
  • Use existing structured data to supplement the
    selected keyword set
  • Perform linguistic analysis of the words in the
    structured query
  • Develop methods for ranking answers from
    structured and unstructured data sources

61
Precise Data Integration
Cost
Benefit
Heterogeneity
62
Approximate Data Integration
Benefit
Cost
Heterogeneity
63
Answering Structured Queries on Unstructured Data
  • Jing Liu, Xin (Luna) Dong, Alon Halevy
  • Univ. of Washington
  • http//data.cs.washington.edu/semex/semex.html

64
Related Work
  • SCORE CIKM, 2005
  • Extract keywords from query results on structured
    data.
  • Not generic.

65
Algorithm of Updating i-scores
  • Effect of selected label on other labels
  • Source node (or edge) has flow volume rs.
  • The flow value is divided among the neighbors.
  • The flow value decreases exponentially with the
    number of hops.
  • Update i-scores
  • inew iold Effect of selected label
Write a Comment
User Comments (0)
About PowerShow.com