Integrating Online and Geospatial Information Sources

About This Presentation

Title:

Integrating Online and Geospatial Information Sources

Description:

Craig Knoblock. University of Southern California. 1 ... Craig Knoblock. University of Southern California. 7. Why isn't this just. No common schema ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 116

Provided by: craigkn9

Category:

more less

Transcript and Presenter's Notes

Title: Integrating Online and Geospatial Information Sources

1
Integrating Online and Geospatial Information
Sources

Craig A. Knoblock
University of Southern California

2
Acknowledgements

Slide Integration
Thanks to Subbarao Kambhampati
We jointly gave a version of this tutorial at
AAAI-02
Also many thanks to
Jose Luis Ambite
Greg Barish
Alon (Ha)Levy
Steve Minton
Ion Muslea
Sheila Tejada
for permission to use/mutilate some of their
slides

3
Overview

Motivation for Information Integration
Database Refresher
Accessing Information Sources
Record Linkage
Data Integration/Query Planning
Plan Execution
Standards for Integration/Mediation
Discussion

4
Preamble Platitudes

Internet is growing at an enormous rate
Even after the bubble-burst
All kinds of information sources are online
Web pages, databases masquerading as web pages,
Services, Sensors
Promise of unprecedented information access to
every Tom, Dick and Mary..
But, right now, they still need to know where
to go, and be willing to manually put together
bits and pieces of information gleaned from
various sources and services
Information Integration aims to do this
automatically.

5
Isnt web mostly text?

The invisible web is mostly structured
Most web servers have back end database servers
They dynamically convert (wrap) the structured
data into readable english
ltIndia, New Delhigt gt The capital of India is
New Delhi.
So, if we can unwrap the text, we have
structured data!
(un)wrappers, learning wrappers etc
Note also that such dynamic pages cannot be
crawled...
The (coming) Semi-structured web
Most pages are at least semi-structured
XML standard is expected to ease the transfer of
such pages.
The Services
Travel services, mapping services
The Sensors
Stock quotes, current temperatures, ticket prices

6
Why isnt this just

Search engines do text-based retrieval of URLS
Works reasonably well for single document texts,
or for finding sites based on single document
text
Cannot integrate information from multiple
documents
Cannot do effective query relaxation or
generalization
Cannot link documents and databases
The aim of Information integration is to support
query processing over structured and
semi-structured sources as well as services.

7
Why isnt this just
Databases
Distributed Databases

No common schema
Sources with heterogeneous schemas (and
ontologies)
Semi-structured sources
Legacy Sources
Not relational-complete
Variety of access/process limitations
Autonomous sources
No central administration
Uncontrolled source content overlap
Unpredictable run-time behavior
Makes query execution hard
Presence of services
Need to compose services

8
(An exceedingly brief) Database Refresher
9
Traditional Database Architecture
Query (SQL)
Database Manager (DBMS) -Storage mgmt
-Query processing -View management
-(Transaction processing)
Database (relational)
Answer (relation)
10
Database Outline

What we care about
Structured data representations
Relational databases
Deductive databases
Structured query languages
SQL
Views ( materialized views)

11
Relational Data Terminology
Attribute names
Product
Name Price Category
Manufacturer gizmo 19.99
gadgets GizmoWorks Power
gizmo 29.99 gadgets
GizmoWorks SingleTouch 149.99
photography Canon MultiTouch 203.99
household Hitachi
tuples
(Arity4)
schema
Product(name string, Price real, category
enum, Manufacturer string)
12
Relational Algebra

Operators
tuple sets as input, new set as output
Operations
Union, Intersection, difference, ..
Selection (s)
Projection (?)
Cartesian product (X)
Join ( )

13
SQL A query language for Relational Algebra
Many standards out there SQL92, SQL2, SQL3,
SQL99 Select attributes From relations
(possibly multiple, joined) Where conditions
(selections)
Find companies that manufacture products
bought by Joe Blow SELECT Company.name FROM
Company, Product WHERE Company.nameProduct.m
aker AND Product.name IN
(SELECT product
FROM Purchase
WHERE buyer Joe Blow)
Other features aggregation, group-by etc.
14
Deductive Databases

Relations viewed as predicates.
Interrelations between relations expressed as
datalog rules
(Horn clauses, without function symbols)
Queries correspond to datalog programs
Conjunctive queries are datalog programs with a
single non-recursive rule Correspond to SPJ
queries in SQL
Emprelated(Name,Dname) - Empdep(Name,Dname)
Emprelated(Name,Dname) - Empdep(Name,D1),
Emprelated(D1,Dname)

EDB predicate
IDB predicate
15
Datalog

Datalog Program set of datalog rules
Datalog rule conjunctive query
big-LA-buyers(Buyer,Seller, Price) -
person(Buyer, Los Angeles),
purchase(Buyer, Seller,
Product, Price),
Price gt 10000.

16
Datalog

Datalog Program set of datalog rules
Datalog rule conjunctive query
big-LA-buyers(Buyer,Seller, Price) -
person(Buyer, Los Angeles),
purchase(Buyer, Seller,
Product, Price),
Price gt 10000.
Buyer, Seller, Price
? Product person(Buyer, Los Angeles)
purchase(Buyer, Seller,
Product, Price)
Price gt 10000)
? big-LA-buyers(Buyer,Seller, Price)

Datalog
First-Order Logic
17
Views
CREATE VIEW Seattle-view AS SELECT
buyer, seller, product, store FROM
Person, Purchase WHERE Person.city
Seattle AND
Person.name Purchase.buyer
Virtual vs. Materialized
We can later use the views SELECT
name, store FROM Seattle-view,
Product WHERE Seattle-view.product
Product.name AND
Product.category shoes
Whats really happening when we query a view??
18
Conjunctive Queries and Views

CREATE VIEW Big-LA-buyers AS
SELECT buyer, seller, price
FROM Person, Purchase
WHERE Person.city Los Angeles AND
Person.name Purchase.buyer
AND
Purchase.price gt 10000
big-LA-buyers(Buyer,Seller, Price) -
person(Buyer, Los Angeles),
purchase(Buyer, Seller,
Product, Price),
Price gt 10000.
Datalog rule view definition
Rule body select-from-where construct of SQL

19
Integrator vs. DBMS
Reprise

No common schema
Sources with heterogeneous schemas
Semi-structured sources
Legacy Sources
Not relational-complete
Variety of access/process limitations
Autonomous sources
No central administration
Uncontrolled source content overlap
Lack of source statistics
Tradeoffs between query plan cost, coverage,
quality etc.
Multi-objective cost models
Unpredictable run-time behavior
Makes query execution hard
Presence of services
Need to compose services

20
Acessing Information Sources

Wrapper Learning

21
Wrappers Information Agents
Thai lt 20
A G E N T
GIVE ME Thai food lt 20 A-rated
Arated
22
Wrapper Induction

Problem description
Web sources present data in human-readable format
take user query
apply it to data base
present results in template HTML page
To integrate data from multiple sources, one must
first extract relevant information from Web pages
Task learn extraction rules based on labeled
examples
Hand-writing rules is tedious, error prone, and
time consuming

23
Example of Extraction Task
NAME Casablanca Restaurant STREET 220
Lincoln Boulevard CITY Venice PHONE
(310) 392-5751
24
Rule Learning

Machine learning
Use past experiences to improve performance
Rule learning
INPUT
Labeled examples training testing data
Admissible rules (hypotheses space)
Search strategy
Desired output
Rule that performs well both on training and
testing data

25
STALKER Muslea et al, 98 99 01

Hierarchical wrapper induction
Decomposes a hard problem in several easier ones
Extracts items independently of each other
Each rule is a finite automaton
Advantages
Powerful extraction language (eg, embedded list)
One hard-to-extract item does not affect others
Disadvantage
Does not exploit item order (sometimes may help)

26
STALKER The Wrapper Architecture
Data
Query
Information Extractor
Extraction Rules
EC Tree
27
Extraction Rules
Extraction rule sequence of landmarks
SkipTo(Phone) SkipTo(ltigt)
SkipTo(lt/igt)
Name Joels ltpgt Phone ltigt (310) 777-1111
lt/igtltpgt Review
28
The Embedded Catalog Tree (ECT)
RESTAURANT Name
List ( Locations ) Cuisine
City List
(PhoneNumbers)
AreaCode Phone

Name KFC
Cuisine Fast Food
Locations
Venice (310) 123-4567,
(800) 888-4412.
L.A. (213) 987-6543.
Encino (818) 999-4567,
(888) 727-3131.

29
Example of Rule Induction
Training Examples

Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
) 123-4567 lt/bgtltpgtCuisine ...
Name Burger King ltpgt Phone ( 310 ) 987-9876
ltpgt Cuisine

30
Learning the Extraction Rules
GUI
Inductive Learning System
Extraction Rules
Labeled Pages
31
Example of Rule Induction
Training Examples

Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
) 123-4567 lt/bgtltpgtCuisine ...
Name Burger King ltpgt Phone ( 310 ) 987-9876
ltpgt Cuisine

Initial candidate SkipTo( ( )
32
Example of Rule Induction
Training Examples

Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
) 123-4567 lt/bgtltpgtCuisine ...
Name Burger King ltpgt Phone ( 310 ) 987-9876
ltpgt Cuisine

SkipTo( ltbgt ( ) ... SkipTo(Phone)
SkipTo( ( ) ... SkipTo() SkipTo(()
Initial candidate SkipTo( ( )
33
Example of Rule Induction
Training Examples

Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
) 123-4567 lt/bgtltpgtCuisine ...
Name Burger King ltpgt Phone ( 310 ) 987-9876
ltpgt Cuisine

Initial candidate SkipTo( ( )

SkipTo(Phone) SkipTo() SkipTo( ( )
...
SkipTo( ltbgt ( ) ... SkipTo(Phone)
SkipTo( ( ) ... SkipTo() SkipTo(()
34
Active Learning Information Agents

Active Learning
Idea system selects most informative exs. to
label
Advantage fewer examples to reach same accuracy
Information Agents
One agent may use hundreds of extraction rules
Small reduction of examples per rule gt big
impact on user
Want to achieve 100 accuracy with as few
examples as possible

35
Which example should be labeled next?
SkipTo( Phone )
Training Examples
Name Joels ltpgt Phone (310) 777-1111
ltpgtReview The chef
Name Kims ltpgt Phone (213) 757-1111
ltpgtReview Korean
Unlabeled Examples
Name Chez Jean ltpgt Phone (310) 666-1111
ltpgt Review
Name Burger King ltpgt Phone(818) 789-1211
ltpgt Review ...
Name Café del Rey ltpgt Phone (310) 111-1111
ltpgt Review ...
Name KFC ltpgt Phoneltbgt (800) 111-7171 lt/bgt
ltpgt Review...
36

Multi-view Learning
Two ways to find start of the phone number
SkipTo( Phone )
BackTo( ( Number ) )
Name KFC ltpgt Phone (310) 111-1111 ltpgt
Review Fried chicken
37
Co-Testing

-
Labeled data
Unlabeled data
38
Co-Testing for Wrapper Induction
BackTo( (Number) )
SkipTo( Phone )
Name Joels ltpgt Phone (310) 777-1111
ltpgtReview ...
Name Kims ltpgt Phone (213) 757-1111
ltpgtReview ...
39
Not all queries are equally informative
Phone (800) 171-1771 ltpgt Fax (111)
111-1111 ltpgt Review
Phoneltigt - lt/igtltpgt Review Founded a
century ago (1891) , this
40
Weak Views

Learn content description for item to be
extracted
Too general for extraction
( Nmb ) Nmb Nmb cant tell a phone number from
a fax number
Useful at discriminating among query candidates
Learned field description
Starts with ( Nmb )
Ends with Nmb Nmb
Contains Nmb Punct
Length 6,6

41
Naïve Aggressive Co-Testing

Naïve Co-Testing
Query randomly chosen contention point
Output rule with fewest mistakes on queries
Aggressive Co-Testing
Query contention point that most violates weak
view
Output committee vote (2 rules weak view)

42
Empirical Results 33 Difficult Tasks

33 most difficult of the 140 extraction tasks
Each view gt 7 labeled examples for best accuracy
At least 100 examples for task

43
Results in 33 Difficult Domains
Extraction Tasks
44
Results in 33 Difficult Domains
Extraction Tasks
45
Results in 33 Difficult Domains
Extraction Tasks
46
Automatic Wrapper Generation Crescenzi, Mecca,
Merialdo, 2001

Automatically generates wrappers web pages
Supports nested structures and lists
Applies to large, complex pages with regular
structure
Approach
Start with the first page and create a union-free
regular expression that defines the wrapper
Match each successive sample against the wrapper
Mismatches result in generalizations of the
regular expression

47
Example Matching
48
Types of Mismatches

String mismatches are used to discover fields of
the document
Tag mismatches can indicate either optional
elements or iterators
For iterations, mismatch is caused by repeated
elements in a list
End of the list corresponds to last matching
token
Beginning of list corresponds to one of the
mismatched tokens
These create possible squares

49
Limitations

Assumptions
Pages are well-structured
Want to extract at the level of entire fields
Structure can be modeled without disjunctions
Search space for explaining mismatches is huge
Uses a number of heuristics to prune space
Limited backtracking
Limit on number of choices to explore
Patterns cannot be delimited by optionals
Will result in pruning possible wrappers

50
Record Linkage

Integrating Data Across Sources

51
Record Linkage

Problem
Different sources typically represent and format
information differently.
As a result, determining if two sources are
referring to the same object can be difficult.
Example
Is Joe Cool the same person as Joseph B.
Cool?
What if they have the same telephone number?
What if Joe Cools number is 310-322-0730 and
Joseph B. Cools number is 310-640-2973?

52
Example Data Integration Problem

How to align (or join) the objects across
different sources

Zagats Restaurant Guide Source
Department of Health Restaurant Source
Arts Delicatessen Ca Brea CPK The
Grill Patina Philippes The Original The Tillerman
Arts Deli California Pizza Kitchen Campanile Citr
us Grill, The Philippe The Original Spago
53
Information Retrieval Approach Cohen, 1998

Idea Evaluate the similarity of records via
textual similarity. Used in Whirl (Cohen 1998).
Follows the same approach used by classical IR
algorithms (including web search engines).
First, stemming is applied to each entry.
E.g. Joes Diner -gt Joe s Diner
Then, entries are compared by counting the number
of words in common.
Note Infrequent words weighted more heavily by
TFIDF metric Term Frequency Inverse Document
Frequency

54
Unsupervised Record Linkage

Idea Analyze data and automatically cluster
pairs into three groups
Let R P(obs Same) / P(obs Different)
Matched if R gt threshold TU
Unmatched if R lt threshold TL
Ambiguous if TL lt R lt TU
This model for computing decision rules was
introduced by Felligi Sunter in 1969
Particularly useful for statistically linking
large sets of data, e.g., by US Census Bureau

55
Unsupervised Record Linkage (cont.)

Winkler (1998) used EM algorithm to estimate
P(obs Same) and P(obs Different)
EM computes the maximum likelihood estimate
The algorithm iteratively determines the
parameters most likely to generate the observed
data.
Additional mathematical techniques must be used
to adjust for relative frequencies
I.e. last name of Smith is much more frequent
than Knoblock.

56
Supervised Active Learning ApproachTejada,
Knoblock Minton, 2001

Supervised learning. System learns
Which attributes to weight more heavily
Transformation rules

57
Mapping Rules

Set of Similarity Scores Mapping Rules
Name Street Phone

.967 .973 .3 .17 .3
.74 .8 .542 .49 .95 .97
.67
Name gt .8 Street gt .79 gt mapped Name gt .89 gt
mapped Street lt .57 gt not mapped
58
Transformation Weights

Appropriate transformations depend on the
application domain
Restaurants, companies, airports
and on the different attributes within an
application
Acronym is more appropriate for restaurant name
than phone number
Learn likelihood that if a transformation is
applied then two object match

Transformation Weight P (match transformation)
59
Learning Object Mappings
Active Atlas
Set of Mapped Objects
Source 1
Candidate Generator
Mapping Learner
Source 2

Candidate Generator
Judge textual similarity of mappings
Reduce number of mappings considered for
classification
Mapping Learner
Active learning technique to learn mapping rules
and transformation weights
System chooses most informative example for the
user to label
Minimize the amount of user interaction

User Input
60
Mapping Rule Learner
61
Committee Disagreement

Chooses an example based on the disagreement of
the query committee
In this case CPK, California Pizza Kitchen is the
most informative example based on disagreement

Committee
Examples M1 M2 M3
Arts Deli, Arts Delicatessen CPK, California
Pizza Kitchen CaBrea, La Brea Bakery
Yes Yes Yes Yes No
Yes No No No
62
Data Integration/Query Planning
63
Principal Dimensions of Data Integration

Virtual vs. materialized architecture
Access query only or query update?
Mediated schema and query reformulation
Content Descriptions
Global-as-view
Local-as-view
Language for descriptions and queries
conjunctive queries (CQs), union of CQs, Datalog
(recursion), first-order logic (?,?,?),
description logics,
Types of Sources
Structured (DBs) vs. semi-structured (Web)
Source capabilities positive and negative

64
Materialized Architecture Data Warehouse
65
Virtual ArchitectureMediator
66
Virtual Integration Architecture

Leave the data in the sources
When a query comes in
Determine the relevant sources to the query
Break down the query into sub-queries for the
sources
Get the answers from the sources, and combine
them appropriately
Data is fresh. Approach scalable
Issues
Relating Sources Mediator
Reformulating the query
Efficient planning execution

User queries
Mediated schema
Mediator
Reformulation engine
optimizer
Data source
Execution engine
catalog
wrapper
wrapper
wrapper
Data
Data
Data
source
source
source
Garlic IBM, HermesUMDTsimmis,
InfoMasterStanford DISCOINRIA Information
Manifold ATT SIMS/AriadneUSCEmerac/HavasuA
SU
67
Desiderata for Relating Source-Mediator Schemas

Expressive power distinguish between sources
with closely related data. Hence, be able to
prune access to irrelevant sources.
Easy addition make it easy to add new data
sources.
Reformulation be able to reformulate a user
query into a query on the sources efficiently and
effectively.
Nonlossy be able to handle all queries that can
be answered by directly accessing the sources

Reformulation
68
Source Descriptions

Elements of source descriptions
Contents source contains movies, directors,
cast.
Constraints only movies produced after 1965.
Completeness contains all American movies.
Capabilities
Negative source requires movie title or director
as input
Positive source can perform selections, joins,

69
Approaches to Specification of Source Descriptions

Global-as-View (GAV)
Mediator relation defined as a view over source
relations
Ex TSIMMIS (Stanford), HERMES (Maryland)
Local-as-View (LAV)
Source relation defined as view over mediator
relations
Ex Information Manifold (ATT), Tukwila(UW),
InfoMaster (Stanford), Ariadne (USC)

View named query logical formula
70
Query Reformulation

Problem rewrite the user query expressed in the
mediated schema into a query expressed in the
source schemas
Given a query Q in terms of the mediated-schema
relations, and descriptions of the information
sources,
Find a query Q that uses only the source
relations, such that
Q Q (i.e., answers are correct i.e., Q ? Q)
and
Q provides all possible answers to Q given the
sources

71
Answering queries using views

Given query q and view definitions VV1Vn
q is an Equivalent Rewriting of q using V if
q refers only to views in V, and
q q
q is a Maximally-Contained Rewriting of q using
V if
q refers only to views in V, and
q ? q, and
there is no rewriting q1, such that q ? q1 ? q
and q1 ? q

72
Global-as-View (GAV)

Each mediator relation is defined as a view over
source relations.
MovieActor(title,actor) ?
DB1(id,title,actor,year)
MovieActor(title,actor) ? DB2(title,director,a
ctor,year)
MovieReview(title, review) ? DB1(id,title,actor,ye
ar) DB3(id,review)

73
Query Reformulation in GAV

Query reformulation rule unfoldingsimplificatio
n
Query Find reviews for DeNiro movies
q(title,review) - MovieActor(title,DeNiro),
MovieReview(title,revie
w)
1. q(title,review) - DB1(id,title,DeNiro,year)
,
DB1(id,title,actor,year),
DB3(id,review)
2. q(title,review) -
DB2(title,director,DeNiro,year),
DB1(id,title,actor, year),
DB3(id,review)

74
Local-as-View (LAV)

Each source relation is defined as a view over
mediator relations
V1(title, year, director) ? Movie(title,year,dire
ctor,genre) American(director) year 1960
genre Comedy
V2 (title, review) ? Movie(title,year,director,gen
re) year1990 MovieReview(title, review)

?
?
75
Query Reformulation in LAV
Query Reviews for comedies produced after
1950 q(title,review) - Movie(title,year,director,
Comedy), year 1950, MovieReview(title,review)

Reformulated query
q(title,review) - V1(title,year,director),
V2(title,review)

q ? q
V1(title, year, director) ? Movie(title,year,direc
tor,genre) American(director) year 1960
genre Comedy V2 (title, review) ?
Movie(title,year,director,genre) year1990
MovieReview(title, review)
76
Integrating GIS and ImageryGlobal as View
Approach Gupta et al.

GIS Source
Soil maps
Parcel maps
Digital elevation maps
Transportation network maps
Image Library
Satellite imagery
Aerial images
Property photographs

77
Mediation in MIX

Mediator defined by building an structured
representation of both GIS and image sources
Mediator relations defined by
Containment conditions
Spatial or temporal joins
Logical associations
Queries and results in XML

78
Mediation in MIX (cont.)

Wrappers
Construct wrappers for the GIS and image data
sources
Evaluating spatial queries
Determine subqueries to each of the sources
Compose results and produce integrated XML
document
Spatial data converter used to handle conversions
between sources (e.g., UTM to USGS 7.5 quad)

79
Example

Produce a table of aerial imagery and photographs
of houses broken down by 5-year increments and
Total Assessed Value

80
Result
81
Plan Execution
82
Motivation

Problem
Information gathering may involve accessing and
integrating data from many sources
Total time to execute these plans may be large
Why?
Unpredictable network latencies
Varying remote source capabilities
Thus, execution is often I/O-bound
Complicating factor binding patterns
During execution, many sources cannot be queried
until a previous source query has been answered

83
GAV vs. LAV

Not modular
Addition of new sources changes the mediated
schema
Can be awkward to write mediated schema without
loss of information
Query reformulation easy
reduces to view unfolding (polynomial)
Can build hierarchies of mediated schemas
Best when
Few, stable, data sources
well-known to the mediator (e.g. corporate
integration)
Garlic, TSIMMIS, HERMES

Modular--adding new sources is easy
Very flexible--power of the entire query language
available to describe sources
Reformulation is hard
Involves answering queries only using views (can
be intractable)
Best when
Many, relatively unknown data sources
possibility of addition/deletion of sources
Information Manifold, InfoMaster, Emerac

84
Traditional Approaches

Executing information gathering plans
Generate a plan
Plan typically consists of a partial ordering of
the operators
Execute the plan based on the given order
Operators process all of their input data before
transmitting any results to consumer(s)
Operators as fast as their most latent input
Long delays due to the dependencies in the plan

85
Dataflow vs Von-Neumann
((a b) (c d))
a
b
c
d
a
b
c
d
ADD
ADD
ADD
ADD
MUL
arc
MUL
actor
86
Streaming Dataflow

Plans consist of a network of operators
Each operator like a function
Example Wrapper, Select, etc.
Operators produce and consume data
Operators fire when any part of any input data
becomes available
Data routed between operators are relations
Zero or more tuples with one or more attributes

Input
Output
Plan
Wrapper
Wrapper
Join
Select
87
Parallelism of Streaming Dataflow

Dataflow (horizontal parallelism)
Decentralized, independent operator execution
Enables "maximally parallel" operator execution
Also known as the "dataflow limit"
Streaming/pipelining (vertical parallelism)
Producer emits tuples to consumer ASAP
Producer consumer can process same relation
simultaneously
Effective because information gathering latencies
can be high even at the tuple level
Data often "trickles" out of I/O-bound operators

88
Example The RepInfo Agent

INPUT
Any street address
e.g., 4767 Admiralty Way, Marina del Rey, CA,
90292
OUTPUT
Federal reps
2 senators,
1 house member
For each rep
Recent news
Real-time funding
information

89
RepInfo Sources
90
RepInfo Sources
91
RepInfo Sources
92
OpenSecrets Navigation Fetching!
93
OpenSecrets Navigation Fetching!
94
OpenSecrets Navigation Fetching!
95
OpenSecrets Navigation Fetching!
96
RepInfo agent plan
address
senators house reps
combined results
recent news
Join name
Wrapper Yahoo News
Select senators, house reps
Wrapper Vote-Smart
graph URL
Wrapper OpenSecrets (funding page)
Wrapper OpenSecrets (member page)
Wrapper OpenSecrets (names page)
all officials
member URL
funding URL
97
Adaptive Query Execution

Network Query Engines
Tukwila
Operator reordering
Optimized operators
Telegraph
Tuple-level adaptivity
Niagara
Partial results for blocking operators
Agent Execution Language
Theseus
Speculative execution

98
How to speculate?

General problem
Means for issuing and confirming predictions
Two new operators
Speculate Makes predictions based on "hints"
Confirm Prevents errant results from exiting
plan

hints
predictions/additions
Speculate
answers
confirmations
probable results
Confirm
actual results
confirmations
99
How to speculate?

Example RepInfo
Make predictions about officials based on address
Makes practical sense
Representatives do not change often
Addresses-to-reps is a many-to-one relationship

100
Speedups beyond 2

Cascading speculation
Speculation on speculation
Functional dependencies
Enable early confirmation because subsequent FD
processing is deterministic

S
S
S
S
S
S
S
S
S
W
W
W
W
W
W
W
W
W
W
G
101
Learning to Speculate

Accurate predictions
The better our prediction accuracy, the better
the speedup
Example
Predict federal officials given an address
Categories of predictions
How do we deal with?
New hints
Making novel predictions

102
Caching

Associate answers with previously seen hints
Advantages
Simple
Disadvantages
Requires lots of space
Only supports previously seen predictions on
previously seen data (category A)

103
Other ways to predict

Classification
4780 Admiralty Way, Marina del Rey, CA 90292
Likely reps (Boxer, Feinstein, and Harman)
We have learned that zip code and city are the
features that most likely indicate the
representative
Translation
Some data have predictable transformations
Example the OpenSecrets source
Member URL
http//www.opensecrets.org/politicians/summary.asp
?CIDN00006750
Funding URL
http//www.opensecrets.org/politicians/sector.asp?
CIDN00006750

104
Standards for Integration/Mediation
105
The X-standards

XML an on-the-wire representation for data
Xquery a query language for XML
Xschema/DTD a schema description language for
XML data
RDF a language for meta-data description
WSDL/SOAP/UDDI languages for describing services

106
HTML vs. XML

ltbibliographygt
ltbookgt lttitlegt Foundations lt/titlegt
ltauthorgt Abiteboul lt/authorgt
ltauthorgt Hull lt/authorgt
ltauthorgt Vianu lt/authorgt
ltpublishergt Addison Wesley
lt/publishergt
ltyeargt 1995 lt/yeargt
lt/bookgt
lt/bibliographygt

lth1gt Bibliography lt/h1gt
ltpgt ltigt Foundations of Databases lt/igt
Abiteboul, Hull, Vianu
ltbrgt Addison Wesley, 1995
ltpgt ltigt Data on the Web lt/igt
Abiteoul, Buneman, Suciu
ltbrgt Morgan Kaufmann, 1999

Self-describing -Schema info part of the
data -Good for data exchange (albeit
baroque for storage)
107
XML Terminology

tags book, title, author,
start tag ltbookgt, end tag lt/bookgt
elements ltbookgtltbookgt,ltauthorgtlt/authorgt
elements are nested
empty element ltredgtlt/redgt abbrv. ltred/gt
an XML document single root element

well formed XML document if it has matching tags
108
Why are Database folks so excited about XML?

XML is just a syntax for (self-describing) data
This is still exciting because
No standard syntax for relational data
With XML, we can
Translate any legacy data to XML
Can exchange data in XML format
Ship over the web, input to any application

109
XML vs. Relational Data

XML is meant as a language that supports both
Text and Structured Data
Conflicting demands...
XML supports semi-structured data
In essence, the schema can be union of multiple
schemas
Easy to represent books with or without prices,
books with any number of authors etc.
XML supports free mixing of text and data
using the PCDATA type
XML is ordered (while relational data is
unordered)

110
Querying XML

Requirements
Need to handle lack of schema.
We may not know much about the data, so we need
to navigate the XML.
Need to support both information retrieval and
SQL-style queries.
Ordered vs. un-ordered XML
Human readable
like SQL? ?
Candidates
Many based on conflicting requirements
XSL Makes IR folks happy
XML-QL Makes DB folks happy
Xquery W3Cs attempt to make everybody (un)happy

111
Example Query
Query
Result

ltbibgt
for b in /bib/book
where b/publisher "Addison-Wesley"
and b/_at_year gt 1991
return ltbook year b/_at_year gt
b/title
lt/bookgt
lt/bibgt
For all books after 1991,
return with Year changed from
a tag to an attribute

ltbibgt ltbook year"1994"gt lttitlegtTCP/IP
Illustratedlt/titlegt lt/bookgt ltbook
year"1992"gt lttitlegtAdvanced Programming in
the Unix environmentlt/titlegt lt/bookgt lt/bibgt
112
Impact of XML on Integration

If and when all sources accept Xqueries and
exchange data in XML format, then
Mediator can accept user queries in Xquery
Access sources using Xquery
Get data back in XML format
Merge results and send to user in XML format
How about now?
Sources can use XML adapters (middle-ware)

113
XML middleware for Databases

XML adapters (middle-ware) received significant
attention in DB community
SilkRoute (ATT)
Xperanto (IBM)
Issues
Need to convert relational data into XML
Tagging (easy)
Need to convert Xquery queries into equivalent
SQL queries
Trickier as Xquery supports schema querying
A single query may be mapped into a union of SQL
queries

114
Is XML standardization a magical solution for
Integration?

If all WEB sources standardize into XML format
Source access (wrapper generation issues) become
easier to manage
BUT all other problems remain
Still need to relate source (XML)schemas to
mediator (XML)schema
Still need to reason about source overlap, source
access limitations etc.
Still need to manage execution in the presence of
source/network uncertainities

115
Semantic Web

The LAV/GAV approaches assume that some human
expert will do the actual schema mapping
The semantic-web initiative attempts to
automate schema mapping
Idea Allow pages to write logical axioms
relating their vocabulary (tags) to other
external tags
Support automatic inference of relations between
source and mediator schema using these rules
DAMLOIL

116
Review