Federating Scientific Data

1 / 45

About This Presentation

Title:

Federating Scientific Data

Description:

... environment introduces new requirements on the selectivity estimation problem ... Assume selectivity is known for some queries (say 25%), in addition, say the ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 46

Provided by: tanum

more less

Transcript and Presenter's Notes

Title: Federating Scientific Data

1
Federating Scientific Data

Tanu Malik
Dept. of Computer Science
Johns Hopkins University

2
New trends in scientific research

Explore the same scientific phenomena in related
sub-disciplines
In Astronomy, the same sky is observed in
different electromagnetic spectra
In Biology, cures for the same disease is
searched in genetics, physiology and molecular
biology
Trend is important for new discoveries to take
place
Correlation between new discoveries and the
number of connections between fundamental
properties
Number of connections increase by federating data
from sub-disciplines

3
Issues in building federations

Systems in scientific sub-disciplines are
autonomous and thus heterogeneous
Sub-disciplines make independent choices in
hardware and software
Federations can become non-inter-operable
Scientific sub-disciplines are data-intensive
Data acquisition governed by Moores law
Federated tasks (queries) compare, merge and join
large data-sets from
multiple sub-disciplines
Efficient execution of tasks
By optimizing individual tasks before execution
By minimizing volume of data transfer from all
tasks
Collection of task-specific statistics might be
essential for efficient execution
Need for effective solutions to allow for
efficient federation,
management, and correlation of data

4
Astronomy Our representative science

Astronomy needs federations
Currently, more than 100 independent surveys
published on the Internet
Typically each contains 10-100 million objects
Different wavelengths, different properties, but
same objects
Several questions for finding similar objects
Real, high-dimensional data
Astronomy data has no commercial value
No privacy concerns. Free for sharing
Great for experimenting

5
Our contributions

A federation of astronomy archives The SkyQuery
Architecture, specification of federated queries,
algorithms for optimizing federated queries
A caching framework
Consists of algorithms that minimize network
traffic. Algorithms build upon economic
principles
Estimating query result size
A model that exploits data mining techniques to
estimate result size of federated queries

6
The SkyQuery An astronomy federation

Ability to query data across archives.
Typical Query Find all objects near the Pole
star that were observed by archives measuring
infra-red wavelengths but not by archives
measuring radio wavelengths
This is a distributed probabilistic spatial join
query as there is error associated with the
recorded location of an object.
Federation is necessary to run this query as
infra-red and radio archives are independently
managed
Queries are to be optimized as the near
criteria can potentially transfer thousands of
objects over the Internet

7
SkyQuery Architecture
8
SkyQuery status

Uses Web services standard for federation
Currently a federation of 30 sites
Is expected to grow to 120 sites soon
Actively used by astronomers all over the world
The spin-off is OpenSkyQuery that provides
intuitive interfaces and
fully-developed language
for cross-match specification
http//www.openskyquery.net (User site)
http//www.skyquery.net (Research and
Development site)

9
Bottleneck in SkyQuery

In January 2004, a participating site of SkyQuery
resulted in
Total number of queries 26,667
Total number of object accesses 45,154
Total bytes on the network 1200 GB ( More than 1
TB!)
Total number of unique object accesses 3890
Total size of objects that were accessed 90 GB
SkyQuery does not respect good network
citizenship
Locality in object accesses can be translated
into effective caching methods

10
Naïve caching is dangerous

Risks of caching
Caching may transfer unfiltered data
May transfer columns or tables across network
Caching moves data to programs/queries
Caching may reduce parallelism
May move queries from many servers to few caches

Benefits of a federation
Filters data prior to transfer
Selection and aggregation queries reduce data
size
Moves the program (fewer bytes) to the data (more
bytes)
Executes queries in parallel
Experiment scales with the number of federated
sites

11
Bypass-Yield An effective caching method

Distinguish between queries
To bypass
To satisfy in cache
Bypass queries
Sent directly to database servers
Caching them does not save network traffic
Allows for benefits of federation
moving-queries-not-data,
parallelism, and data filtering
Queries satisfied in cache
Save network traffic
Defines yield of a query to
Distinguish queries
Helps decide accessed data objects to load from
servers to cache
Caching these data objects saves network traffic
Measures the network savings rate per unit of
cache space used

12
Network flow in a Bypass-Yield Cache
DL load cost, DB bypass cost Minimize WAN cost
on (DL DB)
13
The BYC metric

Let si be the size and fi be the fetch cost of
object oi
Let yi,j be the size of the query-result of the
jth query that accesses oi
Let pi,j be the probability of occurrence of the
jth query above
Byte-Yield Hit Rate (BYHR) for object oi defined
as
BYHR has two components
Expected benefit of caching oi due to yield
Scaled by the cost/size ratio

14
BYC algorithms make economic decisions

Rate-Profile An optimistic algorithm
Predicts object access patterns from workload
Combines recency and frequency to measure
probability of access for objects in and outside
cache
Evicts objects with lowest rate of savings.
Optimistically, loads objects whose rate of
savings are increasing
OnlineBY A defensive algorithm
Makes no assumptions about workload
Loads an object into the cache when network
traffic equal to the size of object has been
incurred
Maintains the cache according to an web caching
algorithm (GDS)
Is lg2(k) competitive, k is the ratio of size of
cache, to the smallest object in cache
SpaceEffBY A randomized algorithm
Maintains no state about workload
Loads objects with probability equal to
yield/size of object

15
Current researchYield estimation

Selectivity Estimation
Estimating the number of tuples that satisfy a
query
Accurate estimates can help in
For choosing best plans in query optimization
As feedback to users
Resource consumption, load balancing
For BYC to achieve maximum network savings
Accurate selectivity estimation is crucial to BYC
performance
Governs load and evict decisions

16
Selectivity estimation in BYC

Common methods to estimate selectivity
Random sampling, histogram, wavelet based
compressions
A caching environment introduces new requirements
on the selectivity estimation problem
Caches are close to clients not servers often
have no access to databases
Cache is a constrained resource in terms of
storage
Caching systems are adaptive and online

17
Template example

Assume there are large no of queries to the
following query template
SELECT order date,required date FROM Order
WHERE freight OP
VALUE
VALUE varies over the domain A,B. A,B are real
values. Operator OP comes from the set lt,gt

18
Freight distribution

Cumulative frequency distribution for freight
attribute for OP lt
Needs to known precisely to estimate selectivity
of all queries

19
Template example

Assume there are large no of queries to the
following query template
SELECT order date,required date FROM Order
WHERE freight OP
VALUE
VALUE varies in the range M,N. M,N are real
values. Operator OP comes
from the set lt,gt

The feature vector consists of OP,VALUE
Consider feature vectors over all such queries.
Assume
selectivity is known for some queries (say 25),
in addition, say the
selectivity can be classified into 3 classes
low, medium and high

20
Estimation example

Decision tree learns u, and v from query
attributes
Use linear regression in each class to obtain
actual yield values

21
Properties of our solution

Uses a query based approach
Current methods use a data-based approach
Detect templates in the workload
Most scientific experiments adhere to query
templates
Very fast and compact
Use decision trees and regression over templates
Crude estimates for the bypass phase, accurate
estimates for loading data
Able to process complex queries
Extracts feature vectors from templates

22
Future directions

To improve query execution performance in the
cache
In the absence of indices and materialized views
Dynamic physical design which reorganizes itself
according to workload
Open questions
Finding the best file organization
How to best represent queries in terms of
database schema, without actually changing the
physical organization?
How to update changes to schema when the workload
changes?
Plan to work on the above problem till Spring of
2006.

23
Acknowledgements

Inside
Comp. Science Randal Burns, Members of HSSL
Phy Astro Alex Szalay, The SDSS Team
JHU The GBO Committee
Outside
Microsoft Research Jim Gray
Univ. of Notre Dame Amitabh Chaudhary, Nitesh
Chawla
Carnegie Mellon Anastassia Ailamaki, Stratos
Papadomanolakis

24
(No Transcript)
25
Web services

SkyQuery uses Web services for interoperability
Use of Internet standards
Communication Protocol HTTP
Message Exchange Model
Simple Object Access Protocol(SOAP) with
eXtensible Markup Language(XML) encoding
Describing, Defining and Discovering Web
Services
Web Services Description Language (WSDL),
Uniform Discovery, Description and Integration
(UDDI) of Web Services

26
The object cross-match

Matching objects across archives, if they
correspond to the same astronomical body.
Given A set of observations, one from each
archive.
Question Do these observations refer to the same
astronomical body?
Answer
Easy, if positions can be measured precisely
the X Match answer is deterministic.

27
Probabilistic Cross Match

Measured positions have an error due to
instrument inaccuracies.
This error follows a Gaussian distribution
(and is known for each archive)
Given A set of observations, one from each
archive
Question What is the probability that these
observations refer to the same astronomical body?
Algorithm ?
Probabilistic Cross Match

28
Computing the probability of a X Match

To compute the probability of a X Match, we
assume a position for the real body.
The probability of a X Match varies with this
considered position of the real body.
We consider that position for which probability
is highest.
This position is a kind of weighted mean.
Probability at this position can be represented
by a circular region about mean.
Our X Match algorithm reports those sets of
observations
probability gt threshold

29
The probabilistic cross-match
30
Background Page Caching

Fixed size objects/pages, different fetch cost
Cache hit is equivalent to an entire page being
accessed
Caches pages that have high fetch cost
Used in operating systems

P6
P6
Goal Minimize total fetch cost of pages
31
Background Object Caching

Variable size objects, different fetch cost
Cache hit is equivalent to accessing an entire
object
Caches objects that have high cost/size ratio.
Used in proxy web caching

O8
08
Goal Minimize total fetch cost of objects
32
Bypass-Yield Caching

Objects are of variable size
For each object, cost of access varies
Cache hit implies fetching an entire object, part
of an object, or an aggregate computed over an
object

Client Requests
Q8
Goal Minimize total network cost (load and
bypass)
33
Rate-Profile

Estimating probability of object access to
estimate savings
By characterizing workload
For objects in cache
Estimates probability as frequency-counts over
cache lifetime
Computes a rate of network savings
Every in-cache access increase the rate in
proportion to query yield
Every time step (query access) decays rate

34
Rate-Profile

For objects outside cache
Separates queries accesses for an object into
clusters (episodes)
For each episode maintains frequency-counts over
episode length
Maximum rate in each episode is the maximum
benefit of caching An
optimistic decision
Ages maximum benefits in each episode by weight
Computes the expected rate of network savings

35
Rate-Profile Algorithm

Discount object load cost from expected savings
Penalize object for the network cost incurred in
loading an object into cache
Load Vs Bypass
Compares discounted expected rate of network
savings of an object not in the
cache with the rate of network savings
of all objects in the cache
If greater, load. Else, bypass
Eviction
Evict object(s) with lowest current savings

36
OnlineBY

Formulates the bypass-yield caching problem as a
combination of
Rent-or-buy problem, and
Object caching problem
The bypass Vs. load decision A defensive
decision
Bypasses queries (rents) until total bypass
traffic exceeds fetch cost
Then, loads (buys) an object
Eviction decisions and in-cache policies
Cache is maintained according to an algorithm for
object
caching problem
This is O(lg2k) competitive
k size of cache / size of smallest object

37
Randomized Ski Rental

Randomized version of OnlineBY
Does not store the amount of traffic bypassed
Load decision A random decision
Load an object with probability equal to the
ratio of the yield of the
current query to the fetch cost
Requires no metadata for making load Vs bypass
decision
Reduces total amount of meta data required
Performs well in practice ( no bounds yet )

38
Network Cost of a Trace (columns)