Title: Federating Scientific Data
1Federating Scientific Data
- Tanu Malik
- Dept. of Computer Science
- Johns Hopkins University
2New trends in scientific research
- Explore the same scientific phenomena in related
sub-disciplines - In Astronomy, the same sky is observed in
different electromagnetic spectra - In Biology, cures for the same disease is
searched in genetics, physiology and molecular
biology - Trend is important for new discoveries to take
place - Correlation between new discoveries and the
number of connections between fundamental
properties - Number of connections increase by federating data
from sub-disciplines
3Issues in building federations
- Systems in scientific sub-disciplines are
autonomous and thus heterogeneous - Sub-disciplines make independent choices in
hardware and software - Federations can become non-inter-operable
- Scientific sub-disciplines are data-intensive
- Data acquisition governed by Moores law
- Federated tasks (queries) compare, merge and join
large data-sets from
multiple sub-disciplines - Efficient execution of tasks
- By optimizing individual tasks before execution
- By minimizing volume of data transfer from all
tasks - Collection of task-specific statistics might be
essential for efficient execution - Need for effective solutions to allow for
efficient federation,
management, and correlation of data
4Astronomy Our representative science
- Astronomy needs federations
- Currently, more than 100 independent surveys
published on the Internet - Typically each contains 10-100 million objects
- Different wavelengths, different properties, but
same objects - Several questions for finding similar objects
- Real, high-dimensional data
- Astronomy data has no commercial value
- No privacy concerns. Free for sharing
- Great for experimenting
5Our contributions
- A federation of astronomy archives The SkyQuery
- Architecture, specification of federated queries,
algorithms for optimizing federated queries - A caching framework
- Consists of algorithms that minimize network
traffic. Algorithms build upon economic
principles - Estimating query result size
- A model that exploits data mining techniques to
estimate result size of federated queries
6The SkyQuery An astronomy federation
- Ability to query data across archives.
- Typical Query Find all objects near the Pole
star that were observed by archives measuring
infra-red wavelengths but not by archives
measuring radio wavelengths - This is a distributed probabilistic spatial join
query as there is error associated with the
recorded location of an object. - Federation is necessary to run this query as
infra-red and radio archives are independently
managed - Queries are to be optimized as the near
criteria can potentially transfer thousands of
objects over the Internet
7SkyQuery Architecture
8SkyQuery status
- Uses Web services standard for federation
- Currently a federation of 30 sites
- Is expected to grow to 120 sites soon
- Actively used by astronomers all over the world
- The spin-off is OpenSkyQuery that provides
intuitive interfaces and
fully-developed language
for cross-match specification
- http//www.openskyquery.net (User site)
- http//www.skyquery.net (Research and
Development site)
9Bottleneck in SkyQuery
- In January 2004, a participating site of SkyQuery
resulted in - Total number of queries 26,667
- Total number of object accesses 45,154
- Total bytes on the network 1200 GB ( More than 1
TB!) - Total number of unique object accesses 3890
- Total size of objects that were accessed 90 GB
- SkyQuery does not respect good network
citizenship - Locality in object accesses can be translated
into effective caching methods
10NaĂŻve caching is dangerous
- Risks of caching
- Caching may transfer unfiltered data
- May transfer columns or tables across network
- Caching moves data to programs/queries
- Caching may reduce parallelism
- May move queries from many servers to few caches
- Benefits of a federation
- Filters data prior to transfer
- Selection and aggregation queries reduce data
size - Moves the program (fewer bytes) to the data (more
bytes) - Executes queries in parallel
- Experiment scales with the number of federated
sites
11Bypass-Yield An effective caching method
- Distinguish between queries
- To bypass
- To satisfy in cache
- Bypass queries
- Sent directly to database servers
- Caching them does not save network traffic
- Allows for benefits of federation
moving-queries-not-data, - parallelism, and data filtering
- Queries satisfied in cache
- Save network traffic
- Defines yield of a query to
- Distinguish queries
- Helps decide accessed data objects to load from
servers to cache - Caching these data objects saves network traffic
- Measures the network savings rate per unit of
cache space used
12Network flow in a Bypass-Yield Cache
DL load cost, DB bypass cost Minimize WAN cost
on (DL DB)
13The BYC metric
- Let si be the size and fi be the fetch cost of
object oi - Let yi,j be the size of the query-result of the
jth query that accesses oi - Let pi,j be the probability of occurrence of the
jth query above - Byte-Yield Hit Rate (BYHR) for object oi defined
as - BYHR has two components
- Expected benefit of caching oi due to yield
- Scaled by the cost/size ratio
14BYC algorithms make economic decisions
- Rate-Profile An optimistic algorithm
- Predicts object access patterns from workload
- Combines recency and frequency to measure
probability of access for objects in and outside
cache - Evicts objects with lowest rate of savings.
Optimistically, loads objects whose rate of
savings are increasing - OnlineBY A defensive algorithm
- Makes no assumptions about workload
- Loads an object into the cache when network
traffic equal to the size of object has been
incurred - Maintains the cache according to an web caching
algorithm (GDS) - Is lg2(k) competitive, k is the ratio of size of
cache, to the smallest object in cache - SpaceEffBY A randomized algorithm
- Maintains no state about workload
- Loads objects with probability equal to
yield/size of object
15Current researchYield estimation
- Selectivity Estimation
- Estimating the number of tuples that satisfy a
query - Accurate estimates can help in
- For choosing best plans in query optimization
- As feedback to users
- Resource consumption, load balancing
- For BYC to achieve maximum network savings
- Accurate selectivity estimation is crucial to BYC
performance - Governs load and evict decisions
16Selectivity estimation in BYC
- Common methods to estimate selectivity
- Random sampling, histogram, wavelet based
compressions - A caching environment introduces new requirements
on the selectivity estimation problem - Caches are close to clients not servers often
have no access to databases - Cache is a constrained resource in terms of
storage - Caching systems are adaptive and online
17Template example
- Assume there are large no of queries to the
following query template - SELECT order date,required date FROM Order
- WHERE freight OP
VALUE - VALUE varies over the domain A,B. A,B are real
values. Operator OP comes from the set lt,gt
18Freight distribution
- Cumulative frequency distribution for freight
attribute for OP lt - Needs to known precisely to estimate selectivity
of all queries
19Template example
- Assume there are large no of queries to the
following query template - SELECT order date,required date FROM Order
- WHERE freight OP
VALUE - VALUE varies in the range M,N. M,N are real
values. Operator OP comes
from the set lt,gt
- The feature vector consists of OP,VALUE
- Consider feature vectors over all such queries.
Assume
selectivity is known for some queries (say 25),
in addition, say the
selectivity can be classified into 3 classes
low, medium and high
20Estimation example
- Decision tree learns u, and v from query
attributes - Use linear regression in each class to obtain
actual yield values
21Properties of our solution
- Uses a query based approach
- Current methods use a data-based approach
- Detect templates in the workload
- Most scientific experiments adhere to query
templates - Very fast and compact
- Use decision trees and regression over templates
- Crude estimates for the bypass phase, accurate
estimates for loading data - Able to process complex queries
- Extracts feature vectors from templates
22Future directions
- To improve query execution performance in the
cache - In the absence of indices and materialized views
- Dynamic physical design which reorganizes itself
according to workload - Open questions
- Finding the best file organization
- How to best represent queries in terms of
database schema, without actually changing the
physical organization? - How to update changes to schema when the workload
changes? - Plan to work on the above problem till Spring of
2006.
23Acknowledgements
- Inside
- Comp. Science Randal Burns, Members of HSSL
- Phy Astro Alex Szalay, The SDSS Team
- JHU The GBO Committee
- Outside
- Microsoft Research Jim Gray
- Univ. of Notre Dame Amitabh Chaudhary, Nitesh
Chawla - Carnegie Mellon Anastassia Ailamaki, Stratos
Papadomanolakis
24(No Transcript)
25Web services
- SkyQuery uses Web services for interoperability
- Use of Internet standards
- Communication Protocol HTTP
- Message Exchange Model
- Simple Object Access Protocol(SOAP) with
eXtensible Markup Language(XML) encoding - Describing, Defining and Discovering Web
Services - Web Services Description Language (WSDL),
Uniform Discovery, Description and Integration
(UDDI) of Web Services
26The object cross-match
- Matching objects across archives, if they
correspond to the same astronomical body. - Given A set of observations, one from each
archive. - Question Do these observations refer to the same
astronomical body? - Answer
- Easy, if positions can be measured precisely
- the X Match answer is deterministic.
27Probabilistic Cross Match
- Measured positions have an error due to
instrument inaccuracies. - This error follows a Gaussian distribution
(and is known for each archive) - Given A set of observations, one from each
archive - Question What is the probability that these
observations refer to the same astronomical body? - Algorithm ?
- Probabilistic Cross Match
28Computing the probability of a X Match
- To compute the probability of a X Match, we
assume a position for the real body. - The probability of a X Match varies with this
considered position of the real body. - We consider that position for which probability
is highest. - This position is a kind of weighted mean.
- Probability at this position can be represented
by a circular region about mean. - Our X Match algorithm reports those sets of
observations - probability gt threshold
29The probabilistic cross-match
30Background Page Caching
- Fixed size objects/pages, different fetch cost
- Cache hit is equivalent to an entire page being
accessed - Caches pages that have high fetch cost
- Used in operating systems
P6
P6
Goal Minimize total fetch cost of pages
31Background Object Caching
- Variable size objects, different fetch cost
- Cache hit is equivalent to accessing an entire
object - Caches objects that have high cost/size ratio.
- Used in proxy web caching
O8
08
Goal Minimize total fetch cost of objects
32Bypass-Yield Caching
- Objects are of variable size
- For each object, cost of access varies
- Cache hit implies fetching an entire object, part
of an object, or an aggregate computed over an
object
Client Requests
Q8
Goal Minimize total network cost (load and
bypass)
33Rate-Profile
- Estimating probability of object access to
estimate savings - By characterizing workload
- For objects in cache
- Estimates probability as frequency-counts over
cache lifetime - Computes a rate of network savings
- Every in-cache access increase the rate in
proportion to query yield - Every time step (query access) decays rate
34Rate-Profile
- For objects outside cache
- Separates queries accesses for an object into
clusters (episodes) - For each episode maintains frequency-counts over
episode length - Maximum rate in each episode is the maximum
benefit of caching An
optimistic decision - Ages maximum benefits in each episode by weight
- Computes the expected rate of network savings
35Rate-Profile Algorithm
- Discount object load cost from expected savings
- Penalize object for the network cost incurred in
loading an object into cache
- Load Vs Bypass
- Compares discounted expected rate of network
savings of an object not in the
cache with the rate of network savings
of all objects in the cache - If greater, load. Else, bypass
- Eviction
- Evict object(s) with lowest current savings
36OnlineBY
- Formulates the bypass-yield caching problem as a
combination of - Rent-or-buy problem, and
- Object caching problem
- The bypass Vs. load decision A defensive
decision - Bypasses queries (rents) until total bypass
traffic exceeds fetch cost - Then, loads (buys) an object
- Eviction decisions and in-cache policies
- Cache is maintained according to an algorithm for
object - caching problem
- This is O(lg2k) competitive
- k size of cache / size of smallest object
37Randomized Ski Rental
- Randomized version of OnlineBY
- Does not store the amount of traffic bypassed
- Load decision A random decision
- Load an object with probability equal to the
ratio of the yield of the
current query to the fetch cost - Requires no metadata for making load Vs bypass
decision - Reduces total amount of meta data required
- Performs well in practice ( no bounds yet )
38Network Cost of a Trace (columns)
- Rate-Profile dynamically chooses between no cache
and inline cache - Rate-Profile compares well with an optimal static
cache - Other algorithms show similar results
39Algorithm Cost Analysis
- Rate-Profile is better than OnlineBY in column
caching - SpaceEffBY always lags behind
- In all algorithms, column caching is better than
table caching
40Schema Locality
- Schema locality implies reuse of schema elements
- than data elements
- Both tables and columns show heavy and long
lasting - periods of reuse
41Containment
- It is imperative for workload to show both
containment and - locality for query caching to be viable (Luo et
al) - Determining exact query containment is NP
complete - (Chandra and Merlin)
- Few objects experience reuse
42Architecture
- Status
- We are currently implementing the above system
within BYC. - We currently have preliminary experiments prove
the efficacy of the above model.
43Results(1) Range Query
44Result(2) User-defined Function Queries
45Result(3) Index Queries