Title: Integrating Online and Geospatial Information Sources
1Integrating Online and Geospatial Information
Sources
- Craig A. Knoblock
- University of Southern California
2Acknowledgements
- Slide Integration
- Thanks to Subbarao Kambhampati
- We jointly gave a version of this tutorial at
AAAI-02 - Also many thanks to
- Jose Luis Ambite
- Greg Barish
- Alon (Ha)Levy
- Steve Minton
- Ion Muslea
- Sheila Tejada
- for permission to use/mutilate some of their
slides
3Overview
- Motivation for Information Integration
- Database Refresher
- Accessing Information Sources
- Record Linkage
- Data Integration/Query Planning
- Plan Execution
- Standards for Integration/Mediation
- Discussion
4Preamble Platitudes
- Internet is growing at an enormous rate
- Even after the bubble-burst
- All kinds of information sources are online
- Web pages, databases masquerading as web pages,
Services, Sensors - Promise of unprecedented information access to
every Tom, Dick and Mary.. - But, right now, they still need to know where
to go, and be willing to manually put together
bits and pieces of information gleaned from
various sources and services - Information Integration aims to do this
automatically.
5Isnt web mostly text?
- The invisible web is mostly structured
- Most web servers have back end database servers
- They dynamically convert (wrap) the structured
data into readable english - ltIndia, New Delhigt gt The capital of India is
New Delhi. - So, if we can unwrap the text, we have
structured data! - (un)wrappers, learning wrappers etc
- Note also that such dynamic pages cannot be
crawled... - The (coming) Semi-structured web
- Most pages are at least semi-structured
- XML standard is expected to ease the transfer of
such pages. - The Services
- Travel services, mapping services
- The Sensors
- Stock quotes, current temperatures, ticket prices
6Why isnt this just
- Search engines do text-based retrieval of URLS
- Works reasonably well for single document texts,
or for finding sites based on single document
text - Cannot integrate information from multiple
documents - Cannot do effective query relaxation or
generalization - Cannot link documents and databases
- The aim of Information integration is to support
query processing over structured and
semi-structured sources as well as services.
7Why isnt this just
Databases
Distributed Databases
- No common schema
- Sources with heterogeneous schemas (and
ontologies) - Semi-structured sources
- Legacy Sources
- Not relational-complete
- Variety of access/process limitations
- Autonomous sources
- No central administration
- Uncontrolled source content overlap
- Unpredictable run-time behavior
- Makes query execution hard
- Presence of services
- Need to compose services
8(An exceedingly brief) Database Refresher
9Traditional Database Architecture
Query (SQL)
Database Manager (DBMS) -Storage mgmt
-Query processing -View management
-(Transaction processing)
Database (relational)
Answer (relation)
10Database Outline
- What we care about
- Structured data representations
- Relational databases
- Deductive databases
- Structured query languages
- SQL
- Views ( materialized views)
11Relational Data Terminology
Attribute names
Product
Name Price Category
Manufacturer gizmo 19.99
gadgets GizmoWorks Power
gizmo 29.99 gadgets
GizmoWorks SingleTouch 149.99
photography Canon MultiTouch 203.99
household Hitachi
tuples
(Arity4)
schema
Product(name string, Price real, category
enum, Manufacturer string)
12Relational Algebra
- Operators
- tuple sets as input, new set as output
- Operations
- Union, Intersection, difference, ..
- Selection (s)
- Projection (?)
- Cartesian product (X)
- Join ( )
13SQL A query language for Relational Algebra
Many standards out there SQL92, SQL2, SQL3,
SQL99 Select attributes From relations
(possibly multiple, joined) Where conditions
(selections)
Find companies that manufacture products
bought by Joe Blow SELECT Company.name FROM
Company, Product WHERE Company.nameProduct.m
aker AND Product.name IN
(SELECT product
FROM Purchase
WHERE buyer Joe Blow)
Other features aggregation, group-by etc.
14Deductive Databases
- Relations viewed as predicates.
- Interrelations between relations expressed as
datalog rules - (Horn clauses, without function symbols)
- Queries correspond to datalog programs
- Conjunctive queries are datalog programs with a
single non-recursive rule Correspond to SPJ
queries in SQL - Emprelated(Name,Dname) - Empdep(Name,Dname)
- Emprelated(Name,Dname) - Empdep(Name,D1),
Emprelated(D1,Dname) -
EDB predicate
IDB predicate
15Datalog
- Datalog Program set of datalog rules
- Datalog rule conjunctive query
- big-LA-buyers(Buyer,Seller, Price) -
- person(Buyer, Los Angeles),
- purchase(Buyer, Seller,
Product, Price), - Price gt 10000.
16Datalog
- Datalog Program set of datalog rules
- Datalog rule conjunctive query
- big-LA-buyers(Buyer,Seller, Price) -
- person(Buyer, Los Angeles),
- purchase(Buyer, Seller,
Product, Price), - Price gt 10000.
- Buyer, Seller, Price
- ? Product person(Buyer, Los Angeles)
- purchase(Buyer, Seller,
Product, Price) - Price gt 10000)
- ? big-LA-buyers(Buyer,Seller, Price)
Datalog
First-Order Logic
17Views
CREATE VIEW Seattle-view AS SELECT
buyer, seller, product, store FROM
Person, Purchase WHERE Person.city
Seattle AND
Person.name Purchase.buyer
Virtual vs. Materialized
We can later use the views SELECT
name, store FROM Seattle-view,
Product WHERE Seattle-view.product
Product.name AND
Product.category shoes
Whats really happening when we query a view??
18Conjunctive Queries and Views
- CREATE VIEW Big-LA-buyers AS
- SELECT buyer, seller, price
- FROM Person, Purchase
- WHERE Person.city Los Angeles AND
- Person.name Purchase.buyer
AND - Purchase.price gt 10000
- big-LA-buyers(Buyer,Seller, Price) -
- person(Buyer, Los Angeles),
- purchase(Buyer, Seller,
Product, Price), - Price gt 10000.
- Datalog rule view definition
- Rule body select-from-where construct of SQL
19Integrator vs. DBMS
Reprise
- No common schema
- Sources with heterogeneous schemas
- Semi-structured sources
- Legacy Sources
- Not relational-complete
- Variety of access/process limitations
- Autonomous sources
- No central administration
- Uncontrolled source content overlap
- Lack of source statistics
- Tradeoffs between query plan cost, coverage,
quality etc. - Multi-objective cost models
- Unpredictable run-time behavior
- Makes query execution hard
- Presence of services
- Need to compose services
20Acessing Information Sources
21Wrappers Information Agents
Thai lt 20
A G E N T
GIVE ME Thai food lt 20 A-rated
Arated
22Wrapper Induction
- Problem description
- Web sources present data in human-readable format
- take user query
- apply it to data base
- present results in template HTML page
- To integrate data from multiple sources, one must
first extract relevant information from Web pages - Task learn extraction rules based on labeled
examples - Hand-writing rules is tedious, error prone, and
time consuming
23Example of Extraction Task
NAME Casablanca Restaurant STREET 220
Lincoln Boulevard CITY Venice PHONE
(310) 392-5751
24Rule Learning
- Machine learning
- Use past experiences to improve performance
- Rule learning
- INPUT
- Labeled examples training testing data
- Admissible rules (hypotheses space)
- Search strategy
- Desired output
- Rule that performs well both on training and
testing data
25STALKER Muslea et al, 98 99 01
- Hierarchical wrapper induction
- Decomposes a hard problem in several easier ones
- Extracts items independently of each other
- Each rule is a finite automaton
- Advantages
- Powerful extraction language (eg, embedded list)
- One hard-to-extract item does not affect others
- Disadvantage
- Does not exploit item order (sometimes may help)
26STALKER The Wrapper Architecture
Data
Query
Information Extractor
Extraction Rules
EC Tree
27Extraction Rules
Extraction rule sequence of landmarks
SkipTo(Phone) SkipTo(ltigt)
SkipTo(lt/igt)
Name Joels ltpgt Phone ltigt (310) 777-1111
lt/igtltpgt Review
28The Embedded Catalog Tree (ECT)
RESTAURANT Name
List ( Locations ) Cuisine
City List
(PhoneNumbers)
AreaCode Phone
- Name KFC
- Cuisine Fast Food
- Locations
- Venice (310) 123-4567,
- (800) 888-4412.
- L.A. (213) 987-6543.
- Encino (818) 999-4567,
- (888) 727-3131.
29Example of Rule Induction
Training Examples
- Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
) 123-4567 lt/bgtltpgtCuisine ... - Name Burger King ltpgt Phone ( 310 ) 987-9876
ltpgt Cuisine
30Learning the Extraction Rules
GUI
Inductive Learning System
Extraction Rules
Labeled Pages
31Example of Rule Induction
Training Examples
- Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
) 123-4567 lt/bgtltpgtCuisine ... - Name Burger King ltpgt Phone ( 310 ) 987-9876
ltpgt Cuisine
Initial candidate SkipTo( ( )
32Example of Rule Induction
Training Examples
- Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
) 123-4567 lt/bgtltpgtCuisine ... - Name Burger King ltpgt Phone ( 310 ) 987-9876
ltpgt Cuisine
SkipTo( ltbgt ( ) ... SkipTo(Phone)
SkipTo( ( ) ... SkipTo() SkipTo(()
Initial candidate SkipTo( ( )
33Example of Rule Induction
Training Examples
- Name Del Taco ltpgt Phone (toll free) ltbgt ( 800
) 123-4567 lt/bgtltpgtCuisine ... - Name Burger King ltpgt Phone ( 310 ) 987-9876
ltpgt Cuisine
Initial candidate SkipTo( ( )
SkipTo(Phone) SkipTo() SkipTo( ( )
...
SkipTo( ltbgt ( ) ... SkipTo(Phone)
SkipTo( ( ) ... SkipTo() SkipTo(()
34Active Learning Information Agents
- Active Learning
- Idea system selects most informative exs. to
label - Advantage fewer examples to reach same accuracy
- Information Agents
- One agent may use hundreds of extraction rules
- Small reduction of examples per rule gt big
impact on user - Want to achieve 100 accuracy with as few
examples as possible
35Which example should be labeled next?
SkipTo( Phone )
Training Examples
Name Joels ltpgt Phone (310) 777-1111
ltpgtReview The chef
Name Kims ltpgt Phone (213) 757-1111
ltpgtReview Korean
Unlabeled Examples
Name Chez Jean ltpgt Phone (310) 666-1111
ltpgt Review
Name Burger King ltpgt Phone(818) 789-1211
ltpgt Review ...
Name Café del Rey ltpgt Phone (310) 111-1111
ltpgt Review ...
Name KFC ltpgt Phoneltbgt (800) 111-7171 lt/bgt
ltpgt Review...
36Multi-view Learning
Two ways to find start of the phone number
SkipTo( Phone )
BackTo( ( Number ) )
Name KFC ltpgt Phone (310) 111-1111 ltpgt
Review Fried chicken
37Co-Testing
-
Labeled data
Unlabeled data
38Co-Testing for Wrapper Induction
BackTo( (Number) )
SkipTo( Phone )
Name Joels ltpgt Phone (310) 777-1111
ltpgtReview ...
Name Kims ltpgt Phone (213) 757-1111
ltpgtReview ...
39Not all queries are equally informative
Phone (800) 171-1771 ltpgt Fax (111)
111-1111 ltpgt Review
Phoneltigt - lt/igtltpgt Review Founded a
century ago (1891) , this
40Weak Views
- Learn content description for item to be
extracted - Too general for extraction
- ( Nmb ) Nmb Nmb cant tell a phone number from
a fax number - Useful at discriminating among query candidates
- Learned field description
- Starts with ( Nmb )
- Ends with Nmb Nmb
- Contains Nmb Punct
- Length 6,6
41Naïve Aggressive Co-Testing
- Naïve Co-Testing
- Query randomly chosen contention point
- Output rule with fewest mistakes on queries
- Aggressive Co-Testing
- Query contention point that most violates weak
view - Output committee vote (2 rules weak view)
42Empirical Results 33 Difficult Tasks
- 33 most difficult of the 140 extraction tasks
- Each view gt 7 labeled examples for best accuracy
- At least 100 examples for task
43Results in 33 Difficult Domains
Extraction Tasks
44Results in 33 Difficult Domains
Extraction Tasks
45Results in 33 Difficult Domains
Extraction Tasks
46Automatic Wrapper Generation Crescenzi, Mecca,
Merialdo, 2001
- Automatically generates wrappers web pages
- Supports nested structures and lists
- Applies to large, complex pages with regular
structure - Approach
- Start with the first page and create a union-free
regular expression that defines the wrapper - Match each successive sample against the wrapper
- Mismatches result in generalizations of the
regular expression
47Example Matching
48Types of Mismatches
- String mismatches are used to discover fields of
the document - Tag mismatches can indicate either optional
elements or iterators - For iterations, mismatch is caused by repeated
elements in a list - End of the list corresponds to last matching
token - Beginning of list corresponds to one of the
mismatched tokens - These create possible squares
49Limitations
- Assumptions
- Pages are well-structured
- Want to extract at the level of entire fields
- Structure can be modeled without disjunctions
- Search space for explaining mismatches is huge
- Uses a number of heuristics to prune space
- Limited backtracking
- Limit on number of choices to explore
- Patterns cannot be delimited by optionals
- Will result in pruning possible wrappers
50Record Linkage
- Integrating Data Across Sources
51Record Linkage
- Problem
- Different sources typically represent and format
information differently. - As a result, determining if two sources are
referring to the same object can be difficult. - Example
- Is Joe Cool the same person as Joseph B.
Cool? - What if they have the same telephone number?
- What if Joe Cools number is 310-322-0730 and
Joseph B. Cools number is 310-640-2973?
52Example Data Integration Problem
- How to align (or join) the objects across
different sources
Zagats Restaurant Guide Source
Department of Health Restaurant Source
Arts Delicatessen Ca Brea CPK The
Grill Patina Philippes The Original The Tillerman
Arts Deli California Pizza Kitchen Campanile Citr
us Grill, The Philippe The Original Spago
53Information Retrieval Approach Cohen, 1998
- Idea Evaluate the similarity of records via
textual similarity. Used in Whirl (Cohen 1998). - Follows the same approach used by classical IR
algorithms (including web search engines). - First, stemming is applied to each entry.
- E.g. Joes Diner -gt Joe s Diner
- Then, entries are compared by counting the number
of words in common. - Note Infrequent words weighted more heavily by
TFIDF metric Term Frequency Inverse Document
Frequency
54Unsupervised Record Linkage
- Idea Analyze data and automatically cluster
pairs into three groups - Let R P(obs Same) / P(obs Different)
- Matched if R gt threshold TU
- Unmatched if R lt threshold TL
- Ambiguous if TL lt R lt TU
- This model for computing decision rules was
introduced by Felligi Sunter in 1969 - Particularly useful for statistically linking
large sets of data, e.g., by US Census Bureau
55Unsupervised Record Linkage (cont.)
- Winkler (1998) used EM algorithm to estimate
P(obs Same) and P(obs Different) - EM computes the maximum likelihood estimate
- The algorithm iteratively determines the
parameters most likely to generate the observed
data. - Additional mathematical techniques must be used
to adjust for relative frequencies - I.e. last name of Smith is much more frequent
than Knoblock.
56Supervised Active Learning ApproachTejada,
Knoblock Minton, 2001
- Supervised learning. System learns
- Which attributes to weight more heavily
- Transformation rules
57Mapping Rules
- Set of Similarity Scores Mapping Rules
- Name Street Phone
.967 .973 .3 .17 .3
.74 .8 .542 .49 .95 .97
.67
Name gt .8 Street gt .79 gt mapped Name gt .89 gt
mapped Street lt .57 gt not mapped
58Transformation Weights
- Appropriate transformations depend on the
application domain - Restaurants, companies, airports
- and on the different attributes within an
application - Acronym is more appropriate for restaurant name
than phone number - Learn likelihood that if a transformation is
applied then two object match
Transformation Weight P (match transformation)
59Learning Object Mappings
Active Atlas
Set of Mapped Objects
Source 1
Candidate Generator
Mapping Learner
Source 2
- Candidate Generator
- Judge textual similarity of mappings
- Reduce number of mappings considered for
classification - Mapping Learner
- Active learning technique to learn mapping rules
and transformation weights - System chooses most informative example for the
user to label - Minimize the amount of user interaction
User Input
60Mapping Rule Learner
61Committee Disagreement
- Chooses an example based on the disagreement of
the query committee - In this case CPK, California Pizza Kitchen is the
most informative example based on disagreement
Committee
Examples M1 M2 M3
Arts Deli, Arts Delicatessen CPK, California
Pizza Kitchen CaBrea, La Brea Bakery
Yes Yes Yes Yes No
Yes No No No
62Data Integration/Query Planning
63Principal Dimensions of Data Integration
- Virtual vs. materialized architecture
- Access query only or query update?
- Mediated schema and query reformulation
- Content Descriptions
- Global-as-view
- Local-as-view
- Language for descriptions and queries
conjunctive queries (CQs), union of CQs, Datalog
(recursion), first-order logic (?,?,?),
description logics, - Types of Sources
- Structured (DBs) vs. semi-structured (Web)
- Source capabilities positive and negative
64Materialized Architecture Data Warehouse
65Virtual ArchitectureMediator
66Virtual Integration Architecture
- Leave the data in the sources
- When a query comes in
- Determine the relevant sources to the query
- Break down the query into sub-queries for the
sources - Get the answers from the sources, and combine
them appropriately - Data is fresh. Approach scalable
- Issues
- Relating Sources Mediator
- Reformulating the query
- Efficient planning execution
User queries
Mediated schema
Mediator
Reformulation engine
optimizer
Data source
Execution engine
catalog
wrapper
wrapper
wrapper
Data
Data
Data
source
source
source
Garlic IBM, HermesUMDTsimmis,
InfoMasterStanford DISCOINRIA Information
Manifold ATT SIMS/AriadneUSCEmerac/HavasuA
SU
67Desiderata for Relating Source-Mediator Schemas
- Expressive power distinguish between sources
with closely related data. Hence, be able to
prune access to irrelevant sources. - Easy addition make it easy to add new data
sources. - Reformulation be able to reformulate a user
query into a query on the sources efficiently and
effectively. - Nonlossy be able to handle all queries that can
be answered by directly accessing the sources
Reformulation
68Source Descriptions
- Elements of source descriptions
- Contents source contains movies, directors,
cast. - Constraints only movies produced after 1965.
- Completeness contains all American movies.
- Capabilities
- Negative source requires movie title or director
as input - Positive source can perform selections, joins,
69Approaches to Specification of Source Descriptions
- Global-as-View (GAV)
- Mediator relation defined as a view over source
relations - Ex TSIMMIS (Stanford), HERMES (Maryland)
- Local-as-View (LAV)
- Source relation defined as view over mediator
relations - Ex Information Manifold (ATT), Tukwila(UW),
InfoMaster (Stanford), Ariadne (USC) -
View named query logical formula
70Query Reformulation
- Problem rewrite the user query expressed in the
mediated schema into a query expressed in the
source schemas - Given a query Q in terms of the mediated-schema
relations, and descriptions of the information
sources, - Find a query Q that uses only the source
relations, such that - Q Q (i.e., answers are correct i.e., Q ? Q)
and - Q provides all possible answers to Q given the
sources
71Answering queries using views
- Given query q and view definitions VV1Vn
- q is an Equivalent Rewriting of q using V if
- q refers only to views in V, and
- q q
- q is a Maximally-Contained Rewriting of q using
V if - q refers only to views in V, and
- q ? q, and
- there is no rewriting q1, such that q ? q1 ? q
and q1 ? q
72Global-as-View (GAV)
- Each mediator relation is defined as a view over
source relations. - MovieActor(title,actor) ?
- DB1(id,title,actor,year)
- MovieActor(title,actor) ? DB2(title,director,a
ctor,year) - MovieReview(title, review) ? DB1(id,title,actor,ye
ar) DB3(id,review)
73Query Reformulation in GAV
- Query reformulation rule unfoldingsimplificatio
n - Query Find reviews for DeNiro movies
- q(title,review) - MovieActor(title,DeNiro),
- MovieReview(title,revie
w) - 1. q(title,review) - DB1(id,title,DeNiro,year)
, - DB1(id,title,actor,year),
DB3(id,review) - 2. q(title,review) -
- DB2(title,director,DeNiro,year),
- DB1(id,title,actor, year),
DB3(id,review) -
74Local-as-View (LAV)
- Each source relation is defined as a view over
mediator relations - V1(title, year, director) ? Movie(title,year,dire
ctor,genre) American(director) year 1960
genre Comedy - V2 (title, review) ? Movie(title,year,director,gen
re) year1990 MovieReview(title, review)
?
?
75Query Reformulation in LAV
Query Reviews for comedies produced after
1950 q(title,review) - Movie(title,year,director,
Comedy), year 1950, MovieReview(title,review)
- Reformulated query
- q(title,review) - V1(title,year,director),
- V2(title,review)
q ? q
V1(title, year, director) ? Movie(title,year,direc
tor,genre) American(director) year 1960
genre Comedy V2 (title, review) ?
Movie(title,year,director,genre) year1990
MovieReview(title, review)
76Integrating GIS and ImageryGlobal as View
Approach Gupta et al.
- GIS Source
- Soil maps
- Parcel maps
- Digital elevation maps
- Transportation network maps
- Image Library
- Satellite imagery
- Aerial images
- Property photographs
77Mediation in MIX
- Mediator defined by building an structured
representation of both GIS and image sources - Mediator relations defined by
- Containment conditions
- Spatial or temporal joins
- Logical associations
- Queries and results in XML
78Mediation in MIX (cont.)
- Wrappers
- Construct wrappers for the GIS and image data
sources - Evaluating spatial queries
- Determine subqueries to each of the sources
- Compose results and produce integrated XML
document - Spatial data converter used to handle conversions
between sources (e.g., UTM to USGS 7.5 quad)
79Example
- Produce a table of aerial imagery and photographs
of houses broken down by 5-year increments and
Total Assessed Value
80Result
81Plan Execution
82Motivation
- Problem
- Information gathering may involve accessing and
integrating data from many sources - Total time to execute these plans may be large
- Why?
- Unpredictable network latencies
- Varying remote source capabilities
- Thus, execution is often I/O-bound
- Complicating factor binding patterns
- During execution, many sources cannot be queried
until a previous source query has been answered
83 GAV vs. LAV
- Not modular
- Addition of new sources changes the mediated
schema - Can be awkward to write mediated schema without
loss of information - Query reformulation easy
- reduces to view unfolding (polynomial)
- Can build hierarchies of mediated schemas
- Best when
- Few, stable, data sources
- well-known to the mediator (e.g. corporate
integration) - Garlic, TSIMMIS, HERMES
- Modular--adding new sources is easy
- Very flexible--power of the entire query language
available to describe sources - Reformulation is hard
- Involves answering queries only using views (can
be intractable) - Best when
- Many, relatively unknown data sources
- possibility of addition/deletion of sources
- Information Manifold, InfoMaster, Emerac
84Traditional Approaches
- Executing information gathering plans
- Generate a plan
- Plan typically consists of a partial ordering of
the operators - Execute the plan based on the given order
- Operators process all of their input data before
transmitting any results to consumer(s) - Operators as fast as their most latent input
- Long delays due to the dependencies in the plan
85Dataflow vs Von-Neumann
((a b) (c d))
a
b
c
d
a
b
c
d
ADD
ADD
ADD
ADD
MUL
arc
MUL
actor
86Streaming Dataflow
- Plans consist of a network of operators
- Each operator like a function
- Example Wrapper, Select, etc.
- Operators produce and consume data
- Operators fire when any part of any input data
becomes available - Data routed between operators are relations
- Zero or more tuples with one or more attributes
Input
Output
Plan
Wrapper
Wrapper
Join
Select
87Parallelism of Streaming Dataflow
- Dataflow (horizontal parallelism)
- Decentralized, independent operator execution
- Enables "maximally parallel" operator execution
- Also known as the "dataflow limit"
- Streaming/pipelining (vertical parallelism)
- Producer emits tuples to consumer ASAP
- Producer consumer can process same relation
simultaneously - Effective because information gathering latencies
can be high even at the tuple level - Data often "trickles" out of I/O-bound operators
88Example The RepInfo Agent
- INPUT
- Any street address
- e.g., 4767 Admiralty Way, Marina del Rey, CA,
90292 -
- OUTPUT
- Federal reps
- 2 senators,
- 1 house member
- For each rep
- Recent news
- Real-time funding
- information
89RepInfo Sources
90RepInfo Sources
91RepInfo Sources
92OpenSecrets Navigation Fetching!
93OpenSecrets Navigation Fetching!
94OpenSecrets Navigation Fetching!
95OpenSecrets Navigation Fetching!
96RepInfo agent plan
address
senators house reps
combined results
recent news
Join name
Wrapper Yahoo News
Select senators, house reps
Wrapper Vote-Smart
graph URL
Wrapper OpenSecrets (funding page)
Wrapper OpenSecrets (member page)
Wrapper OpenSecrets (names page)
all officials
member URL
funding URL
97Adaptive Query Execution
- Network Query Engines
- Tukwila
- Operator reordering
- Optimized operators
- Telegraph
- Tuple-level adaptivity
- Niagara
- Partial results for blocking operators
- Agent Execution Language
- Theseus
- Speculative execution
98How to speculate?
- General problem
- Means for issuing and confirming predictions
- Two new operators
- Speculate Makes predictions based on "hints"
- Confirm Prevents errant results from exiting
plan
hints
predictions/additions
Speculate
answers
confirmations
probable results
Confirm
actual results
confirmations
99How to speculate?
- Example RepInfo
- Make predictions about officials based on address
- Makes practical sense
- Representatives do not change often
- Addresses-to-reps is a many-to-one relationship
100Speedups beyond 2
- Cascading speculation
- Speculation on speculation
- Functional dependencies
- Enable early confirmation because subsequent FD
processing is deterministic
S
S
S
S
S
S
S
S
S
W
W
W
W
W
W
W
W
W
W
G
101Learning to Speculate
- Accurate predictions
- The better our prediction accuracy, the better
the speedup - Example
- Predict federal officials given an address
- Categories of predictions
- How do we deal with?
- New hints
- Making novel predictions
102Caching
- Associate answers with previously seen hints
- Advantages
- Simple
- Disadvantages
- Requires lots of space
- Only supports previously seen predictions on
previously seen data (category A)
103Other ways to predict
- Classification
- 4780 Admiralty Way, Marina del Rey, CA 90292
- Likely reps (Boxer, Feinstein, and Harman)
- We have learned that zip code and city are the
features that most likely indicate the
representative - Translation
- Some data have predictable transformations
- Example the OpenSecrets source
- Member URL
- http//www.opensecrets.org/politicians/summary.asp
?CIDN00006750 - Funding URL
- http//www.opensecrets.org/politicians/sector.asp?
CIDN00006750
104Standards for Integration/Mediation
105The X-standards
- XML an on-the-wire representation for data
- Xquery a query language for XML
- Xschema/DTD a schema description language for
XML data - RDF a language for meta-data description
- WSDL/SOAP/UDDI languages for describing services
106HTML vs. XML
- ltbibliographygt
- ltbookgt lttitlegt Foundations lt/titlegt
- ltauthorgt Abiteboul lt/authorgt
- ltauthorgt Hull lt/authorgt
- ltauthorgt Vianu lt/authorgt
- ltpublishergt Addison Wesley
lt/publishergt - ltyeargt 1995 lt/yeargt
- lt/bookgt
-
- lt/bibliographygt
- lth1gt Bibliography lt/h1gt
- ltpgt ltigt Foundations of Databases lt/igt
- Abiteboul, Hull, Vianu
- ltbrgt Addison Wesley, 1995
- ltpgt ltigt Data on the Web lt/igt
- Abiteoul, Buneman, Suciu
- ltbrgt Morgan Kaufmann, 1999
Self-describing -Schema info part of the
data -Good for data exchange (albeit
baroque for storage)
107XML Terminology
- tags book, title, author,
- start tag ltbookgt, end tag lt/bookgt
- elements ltbookgtltbookgt,ltauthorgtlt/authorgt
- elements are nested
- empty element ltredgtlt/redgt abbrv. ltred/gt
- an XML document single root element
well formed XML document if it has matching tags
108Why are Database folks so excited about XML?
- XML is just a syntax for (self-describing) data
- This is still exciting because
- No standard syntax for relational data
- With XML, we can
- Translate any legacy data to XML
- Can exchange data in XML format
- Ship over the web, input to any application
109XML vs. Relational Data
- XML is meant as a language that supports both
Text and Structured Data - Conflicting demands...
- XML supports semi-structured data
- In essence, the schema can be union of multiple
schemas - Easy to represent books with or without prices,
books with any number of authors etc. - XML supports free mixing of text and data
- using the PCDATA type
- XML is ordered (while relational data is
unordered)
110Querying XML
- Requirements
- Need to handle lack of schema.
- We may not know much about the data, so we need
to navigate the XML. - Need to support both information retrieval and
SQL-style queries. - Ordered vs. un-ordered XML
- Human readable
- like SQL? ?
- Candidates
- Many based on conflicting requirements
- XSL Makes IR folks happy
- XML-QL Makes DB folks happy
- Xquery W3Cs attempt to make everybody (un)happy
111Example Query
Query
Result
- ltbibgt
- for b in /bib/book
- where b/publisher "Addison-Wesley"
- and b/_at_year gt 1991
- return ltbook year b/_at_year gt
- b/title
- lt/bookgt
- lt/bibgt
- For all books after 1991,
- return with Year changed from
- a tag to an attribute
ltbibgt ltbook year"1994"gt lttitlegtTCP/IP
Illustratedlt/titlegt lt/bookgt ltbook
year"1992"gt lttitlegtAdvanced Programming in
the Unix environmentlt/titlegt lt/bookgt lt/bibgt
112Impact of XML on Integration
- If and when all sources accept Xqueries and
exchange data in XML format, then - Mediator can accept user queries in Xquery
- Access sources using Xquery
- Get data back in XML format
- Merge results and send to user in XML format
- How about now?
- Sources can use XML adapters (middle-ware)
113XML middleware for Databases
- XML adapters (middle-ware) received significant
attention in DB community - SilkRoute (ATT)
- Xperanto (IBM)
- Issues
- Need to convert relational data into XML
- Tagging (easy)
- Need to convert Xquery queries into equivalent
SQL queries - Trickier as Xquery supports schema querying
- A single query may be mapped into a union of SQL
queries
114Is XML standardization a magical solution for
Integration?
- If all WEB sources standardize into XML format
- Source access (wrapper generation issues) become
easier to manage - BUT all other problems remain
- Still need to relate source (XML)schemas to
mediator (XML)schema - Still need to reason about source overlap, source
access limitations etc. - Still need to manage execution in the presence of
source/network uncertainities
115Semantic Web
- The LAV/GAV approaches assume that some human
expert will do the actual schema mapping - The semantic-web initiative attempts to
automate schema mapping - Idea Allow pages to write logical axioms
relating their vocabulary (tags) to other
external tags - Support automatic inference of relations between
source and mediator schema using these rules - DAMLOIL
116Review
- Motivation for Information Integration
- Database Refresher
- Accessing Information Sources
- Record Linkage
- Data Integration/Query Planning
- Plan Execution
- Standards for Integration/Mediation
- Discussion
117Discussion
- Many opportunities for integrating online sources
with geospatial sources - There are many online sources that can be related
to geospatial sources - Online databases, text documents, phone books,
schedules, - Possible uses
- Augmenting what is known about geospatial
entities - Using online sources to analyze imagery
- Combining online sources with geospatial sources
for research and analysis - Effect of coastal pollution on the economy of
coastal regions - Placing text documents in a geospatial context