Title: Scoped and Approximate Queries in a Relational Grid Information Service
1Scoped and Approximate Queries in a Relational
Grid Information Service
- Dong Lu , Peter A. Dinda , Jason A. Skicewicz
- Prescience Lab, Dept. of Computer Science
- Northwestern University, Evanston, IL 60201
2Outline
- Introduction and motivation
- Powerful queries, but expensive to execute
- Trade off between result size and query time
- Our solutions Scoped query, Approximate query,
Scoped Approximate query - Nondeterministic query (SC Talk on Tuesday)
- Performance Evaluation
3What is RGIS?
- GIS A Grid Information Service stores
information about the resources and services in a
distributed computing environment and answer
queries about it. - RGIS Grid Information Service based on
relational data model.
4Why RGIS?
- RGIS can answer complex compositional queries
- Relational algebra (SQL)
- Joins
- Difficult in a hierarchical model (directory
service) - Other reasons
- Indexes separate from data model
- Schema evoluation
- Transactional insert/update/delete
- Consistency
5RGIS Model of a Grid
module
- Annotated network topology graph
- Annotation examples
- Hosts memory, disk, OS, NICs, etc.
- Router/Switch backplane bandwidth, ports
- Link latency and bandwidth
- Highly dynamic data in streams, not DB
- Virtualization, Futures, Leases
- Virtual machines
Software
endpoint
router
iplink
host
Network
Data link
maclink
macswitch
Physical
connectorswitch
connectorlink
6The RGIS Design (Per Site)
7Challenge/Trade off
- Complex queries to a relational database can take
a long time, - Hours, days or even weeks when we want seconds.
- Typically, returned result set is unnecessarily
big. - Get back all results
- We need mechanisms to trade off the query time
with the size of result set.
8Challenge/Trade off
All results
Approximate results
Nondeterministic results
Scoped results
9Example Cluster Finder
Find N hosts connected to the same router, with
total memory N512 MB, all running Linux, and
the bisection bandwidth of The cluster is no
less than 100Mbits/sec.
10Original SQL for 2 Host Cluster Finder
SELECT scoped-approx h1.distip, h2.distip
FROM hosts h1, hosts h2, iplinks l1, iplinks
l2, routers r WHERE h1.mem_mbh2.mem_mbgt1024
and h1.os'linux' and h2.os'linux' and
((l1.srcr.distip and l2.srcr.distip
and l1.desth1.distip and l2.desth2.distip)
or (l1.destr.distip and l2.destr.distip
and l1.srch1.distip and l2.srch2.distip))
and h1.distipltgth2.distip and L1.BW_MBS gt
100 AND L2.BW_MBS gt 100 SCOPED BY
r.distipX WITHIN 100 seconds
Original
11Original SQL for Cluster Finder
- It is 2N1 way join to look for a N node
cluster. Not scalable.
Routers
IP links
Hosts
Cluster 1
Cluster 2
12Scoped Cluster Finder
Routers
IP links
- Query the hosts
- around a random
- router.
Hosts
13Scoped Cluster Finder
14Approximate Cluster Finder
- When searching for N hosts with total memory
N512, we can approximate the query with search
for N hosts with each having memory over 512. - Thus reduced or avoided the number of joins.
- However, this wont find, say, N/2 hosts with 256
MB and N/2 hosts with 768 MB
15Approximate Cluster Finder
SELECT R.DISTIP, H1.DISTIP FROM HOSTS
H1, IPLINKS L1, ROUTERS R WHERE
H1.MEM_MBgt512 AND H1.OS'LINUX' AND
L1.BW_MBS gt 100 AND ((L1.SRCR.DISTIP AND
L1.DESTH1.DISTIP) OR (L1.DEST
R.DISTIP AND L1.SRCH1.DISTIP)) AND R.DISTIP
IN (SELECT R.DISTIP FROM HOSTS
H1, IPLINKS L1, ROUTERS R WHERE
H1.MEM_MBgt512 AND H1.OS'LINUX' AND
L1.BW_MBSgt100 AND ((L1.SRCR.DISTIP
AND L1.DESTH1.DISTIP) OR (L1.DEST
R.DISTIP AND L1.SRCH1.DISTIP)) GROUP
BY R.DISTIP HAVING COUNT() gt
2) ORDER BY R.DISTIP
16Scoped Approximate Cluster Finder
- Combine approximate query with scoped query.
- Scoped to one randomly chosen router at a time,
if no results found, choose another random router
and repeat the query. - Approximate N host join for 512N memory with
searches for N hosts each with gt512. - Always a THREE way join.
- regardless of the size of the cluster being
searched for. Thus very scalable. - may need to search multiple routers.
17Scoped Approximate Cluster Finder
The scoped approximate cluster finder has a fixed
number of joins.
18Time bounded queries
- The query rewriter will start the query as a
child process. - Parent kills the child process if no results
returned within deadline.
19Limitations of Scoped and Approximate queries
- The returned results are subset of original
query, and it is possible to report no results
while the original query could return results
after running a long time. - Not all queries can be written as Scoped or
Approximate queries. - It is hard to automate the Scoped and Approximate
query rewriting.
20Performance Evaluation
- Need to populate the database with large amount
of data. - Computational grids are still in early stages.
- No large data sets available.
- Use Smith MDS data for memory
- We generate synthetic grids that are
representative of the Internet. - Can generate very large grids
21GridG Generated Synthetic Grids
- Three-level network WAN, MAN, LAN. Nodes on WAN,
MAN are routers, while nodes on LAN are hosts. - Links IP links annotated with bandwidth and
latency. - Hosts annotated with memory size, architecture,
number of processors, CPU clock rate, disk size,
etc. - User can control all the distributions and the
size of network.
22GridG Synthesing Realistic Computational Grids
SC talk on Tuesday!
http//www.cs.northwestern.edu/urgis/GridG
23Experimental Setup
- Dell PowerEdge 4400 dual Xeon 1 GHz processors,
2 GB memory, 240 GB RAID 5 storage system. - Oracle 9i Enterprise edition, red hat Linux 7.1.
- Each test is repeated either 25 or 100 times, and
we provide the average value.
24Performance of various Query Technique with
Cluster Finder
- Cluster size Standard Scoped Approx
Scoped Approx - 2 21.44 2.27
7.62 1.16 - 4 gt7200 2047.9 7.48
1.32 - 8 gt9000 gt3600 7.46
1.43 - 16 N/A gt3600 7.51
1.45 - 32 N/A gt3600 7.65
5.96 - 64 N/A gt3600 gt120
9.58
(Time to run query in Seconds)
25Performance of Scoped Approximate Queries
- Cluster Finder Find N hosts, each running
Linux, with total memory at least N512 MB, all
connected to the same router, the bisection width
is at least 100Mbits. - Our running example
- Non network query Find N hosts with total
memory at least N512 MB. - No joins needed at all
26Performance of Scoped Approximate Queries (2)
- Scalability with database size.
- Scalability with the complexity of queries.
- Scalability with concurrent users and update load.
27Performance of Scoped Approximate Query (9.8K
hosts, Cluster Finder)
28Performance of Scoped Approximate Query (101K
hosts , Cluster Finder)
29Performance of Scoped Approximate Query (980K
hosts , Cluster Finder)
30Performance of Scoped Approximate Query (9.8K
hosts, Non-network query)
31Performance of Scoped Approximate Query (101K
hosts , Non-network query)
32Performance of Scoped Approximate Query (980K
hosts , Non-network query)
33Scalability with multiple concurrent users and
background load
- Other research has shown that GIS servers will
undertake frequent updating while serving the
requests. - GIS servers serve multiple concurrent users.
- Evaluate scoped approximate queries with
concurrent users and update load. - Concurrent users execute queries repeatedly
- The update load execute transactional updates on
randomly selected hosts as fast as possible. - About 200 updates/second
34Performance of Scoped Approximate Query (9.8K
hosts , Cluster Finder, with Concurrent Users,
looking for 64 nodes)
35Performance of Scoped Approximate Query (9.8K
hosts , Non network query, with Concurrent Users,
looking for 64 nodes)
36Conclusions
- Described and evaluated two query techniques to
trade off query time with the size of result set
Scoped and Approximate query. - Combination of Scoped and Approximate query can
dramatically reduce response time and server load.
37For more information
- GridG and Related paper http//www.cs.northwester
n.edu/urgis/GridG - Synthesizing Realistic Computational Grids,
In proceedings of SC03. - RGIS and Related paper http//www.cs.northwestern
.edu/urgis/ - Nondeterministic Queries in a Relational Grid
Information Service, In proceedings of SC03.