Title: Distributed Query Processing
1Distributed Query Processing
- Based on The state of the art in distributed
query processing Donald Kossman (ACM Computing
Surveys, 2000)
2Motivation
- Cost and scalability network of off-shelf
machines - Integration of different software vendors (with
own DBMS) - Integration of legacy systems
- Applications inherently distributed, such as
workflow or collaborative-design - State-of-the-art distributed information
technologies (e-businesses)
3Part 1 Basics
- Query Processing Basics
- centralized query processing
- distributed query processing
4Problem Statement
- Input Query such as Biological objects in
study A referenced in a literature in journal Y. - Output Answer
- Objectives
- response time, throughput, first answers, little
IO, ... - Centralized vs. Distributed Query Processing
- same basic problem
- but, more and different parameters, such(data
sites or available machine power) and objectives
5Steps in Query Processing
- Input Declarative Query
- SQL, XQuery, ...
- Step 1 Translate Query into Algebra
- Tree of operators (query plan generation)
- Step 2 Optimize Query
- Tree of operators (logical) - also select
partitions of table - Tree of operators (physical) also site
annotations - (Compilation)
- Step 3 Execution
- Interpretation Query result generation
6Algebra
A.d
SELECT A.d FROM A, B WHERE A.a B.b
AND A.c 35
A.a B.b, A.c 35
X
A
B
- relational algebra for SQL very well understood
- algebra for XQuery mostly understood
7Query Optimization
A.d
A.d
A.a B.b, A.c 35
hashjoin
X
B.b
A
B
index A.c
B
- logical, e.g., push down cheap predicates
- enumerate alternative plans, apply cost model
- use search heuristics to find cheapest plan
8Basic Query Optimization
- Classical Dynamic Programming algorithm
- Performs join order optimization
- Input Join query on n relations
- Output Best join order
9The Dynamic Prog. Algorithm
for i 1 to n do optPlan(Ri)
accessPlans(Ri) prunePlans(optPlan(Ri)) for
i 2 to n do for all S ? R1, R2 Rn
such that S i do optPlan(S) ?
for all O ? S do optPlan(S) optPlan(S) ?
joinPlans(optPlan(O), optPlan(S
O)) prunePlans(optPlan(S)) return
optPlan(R1, R2, Rn)
10Query Execution
John
A.d
(John, 35, CS)
hashjoin
(CS) (AS)
(John, 35, CS) (Mary, 35, EE)
B.b
(Edinburgh, CS,5.0) (Edinburgh, AS, 6.0)
index A.c
B
- library of operators (hash join, merge join, ...)
- exploit indexes and clustering in database
- pipelining (iterator model)
11Summary Centralized Queries
- Basic SQL (SPJG, nesting) well understood
- Very good extensibility
- spatial joins, time series, UDF, xquery, etc.
- Current problems
- Better statistics cost model for optimization
- Physical database design expensive complex
- Some Trends
- interactiveness during execution
- approximate answers, top-k
- self-tuning capabilities (adaptive robust etc.)
12Distributed Query Processing Basics
- Idea
- Extension of centralized query processing.
(System R et al. in 80s) - What is different?
- extend physical algebra sendreceive operators
- other metrics optimize for response time
- resource vectors, network interconnect matrix
- caching and replication
- less predictability in cost model (adaptive
algos) - heterogeneity in data formats and data models
13Issues in Distributed Databases
- Plan enumeration
- The time and space complexity of traditional
dynamic programming algorithm is very large - Iterative Dynamic Programming (heuristic for
large queries) - Cost Models
- Classic Cost Model
- Response Time Model
- Economic Models
14Distributed Query Plan
A.d
Forms Of Parallelism?
hashjoin
receive
receive
send
send
B.b
index A.c
B
15Cost Resource Utilization
Total Cost Sum of Cost of Ops Cost 40
1
8
1
6
1
6
2
5
10
16Another Metric Response Time
Total Cost 40 first tuple 25 last tuple 33
25, 33
Pipelined parallelism
24, 32
0, 7
0, 24
Independent parallelism
0, 6
0, 18
0, 12
first tuple 0 last tuple 10
0, 5
0, 10
17Query Execution Techniques for Distributed
Databases
- Row Blocking
- Multi-cast optimization
- Multi-threaded execution
- Joins with horizontal partitioning
- Semi joins
- Top n queries
18Query Execution Techniques for DD
- Row Blocking
- SEND and RECEIVE operators in query plan to model
communication - Implemented by TCP/IP, UDP, etc.
- Ship tuples in block-wise fashion (batch) smooth
burstiness
19Query Execution Techniques for DD
- Multi-cast Optimization
- Location of sending/receiving may affect
communication costs forwarding versus
multi-casting - Multi-threaded execution
- Several threads for operators at the same site
(intra-query parallelism) - May be useful to enable concurrent reads for
diverse machines (while continuing query
processing) - Must consider if resources warrant concurrent
operator execution (say two sorts each needing
all memory)
20Query Execution Techniques for DD
- Joins with Data (horizontal) partitioning
- Hash-based partitioning to conduct joins on
independent partitions - Semi Joins
- Reduce communication costs Send only join keys
instead of complete tuples to the site to extract
relevant join partners - Double-pipelined hash joins
- Non-blocking join operators to deliver first
results quickly fully exploit pipelined
parallelism, and reduce overall response time - Top n queries
- Isloate top n tuples quickly and only perform
other expensive operations (like sort, join, etc)
on those few (use stop operators)
21Adaptive Algorithms
- Deal with unpredictable events at run time
- delays in arrival of data, burstiness of network
- autonomity of nodes, changes in policies
- Example double pipelined hash joins
- build hash table for both input streams
- read inputs in separate threads
- good for bursty arrival of data
- Re-optimization at run time (LEO, etc.)
- monitor execution of query
- adjust estimates of cost model
- re-optimize if delta is too large
22Special Techniques for Client-Server Architectures
- Shipping techniques
- Query shipping
- Data shipping
- Hybrid shipping
- Query Optimization
- Site Selection
- Where to optimize
- Two Phase Optimization
23Special Techniques for Federated Database Systems
- Wrapper architecture
- Query optimization
- Query capabilities
- Cost estimation
- Calibration Approach
- Wrapper Cost Model
- Parameter Binding
24Heterogeneity
- Use Wrappers to hide heterogeneity
- Wrappers take care of data format, packaging
- Wrappers map from local to global schema
- Wrappers carry out caching
- connections, cursors, data, ...
- Wrappers map queries into local dialect
- Wrappers participate in query planning!!!
- define the subset of queries that can be handled
- give cost information, statistics
- capability-based rewriting
25Summary
- Theory well understood
- extend traditional (centralized) query processing
- add many more details
- heterogenity needs manual work and wrappers
- Problems in Practice
- cost model, statistics
- architectures are not fit for adaptivity,
heterogeneity - optimizers do not scale for 10,000s of sites
- autonomy of sites systems not built for
asynchronous communication
26Middleware
- Two kinds of middleware
- data warehouses
- virtual integration
- Data Warehouses
- good query response times
- good materializes results of data cleaning
- bad high resource requirements in middleware
- bad staleness of data
- Virtual Integration
- the opposite
- caching possible to improve response times
27Virtual Integration
Query
Middleware (query decomposition, result
composition)
wrapper
wrapper
sub query
sub query
DB1
DB2
28IBM Data Joiner
SQL Query
Data Joiner
wrapper
wrapper
sub query
sub query
SQL DB1
SQL DB2
29Adding XML
Query
XML Publishing
Middleware (SQL)
wrapper
wrapper
sub query
sub query
DB1
DB2
30XML Data Integration
XML Query
Middleware (XML)
XML query
XML query
wrapper
wrapper
DB1
DB2
31XML Data Integration
- Example BEA Liquid Data
- Advantage
- Availability of XML wrappers for all major
databases - Problems
- XML - SQL mapping is very difficult
- XML is not always the right language (e.g.,
decision support style queries)
32Web Services
- Idea Encapsulate Data Source
- provide WSDL interface to access data
- works very well if query pattern is known
- Problem Exploit Capability of Source
- WSDL limits capabilities of data source
- good optimization requires white box
- example access by id, access by name, full
scanshould all combinations be listed in WSDL? - Solution WSDL for Query Planning
33Summary
- Middleware looks like a homogenous centralized
database - location transparency
- data model transparency
- Middleware provides global schema
- data sources map local schemas to global schema
- Various kinds of middleware (SQL, XML)
- Stacks of middleware possible
- Data cleaning requires special attention