Federating Scientific Data

1 / 45
About This Presentation
Title:

Federating Scientific Data

Description:

... environment introduces new requirements on the selectivity estimation problem ... Assume selectivity is known for some queries (say 25%), in addition, say the ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 46
Provided by: tanum

less

Transcript and Presenter's Notes

Title: Federating Scientific Data


1
Federating Scientific Data
  • Tanu Malik
  • Dept. of Computer Science
  • Johns Hopkins University

2
New trends in scientific research
  • Explore the same scientific phenomena in related
    sub-disciplines
  • In Astronomy, the same sky is observed in
    different electromagnetic spectra
  • In Biology, cures for the same disease is
    searched in genetics, physiology and molecular
    biology
  • Trend is important for new discoveries to take
    place
  • Correlation between new discoveries and the
    number of connections between fundamental
    properties
  • Number of connections increase by federating data
    from sub-disciplines

3
Issues in building federations
  • Systems in scientific sub-disciplines are
    autonomous and thus heterogeneous
  • Sub-disciplines make independent choices in
    hardware and software
  • Federations can become non-inter-operable
  • Scientific sub-disciplines are data-intensive
  • Data acquisition governed by Moores law
  • Federated tasks (queries) compare, merge and join
    large data-sets from
    multiple sub-disciplines
  • Efficient execution of tasks
  • By optimizing individual tasks before execution
  • By minimizing volume of data transfer from all
    tasks
  • Collection of task-specific statistics might be
    essential for efficient execution
  • Need for effective solutions to allow for
    efficient federation,
    management, and correlation of data

4
Astronomy Our representative science
  • Astronomy needs federations
  • Currently, more than 100 independent surveys
    published on the Internet
  • Typically each contains 10-100 million objects
  • Different wavelengths, different properties, but
    same objects
  • Several questions for finding similar objects
  • Real, high-dimensional data
  • Astronomy data has no commercial value
  • No privacy concerns. Free for sharing
  • Great for experimenting

5
Our contributions
  • A federation of astronomy archives The SkyQuery
  • Architecture, specification of federated queries,
    algorithms for optimizing federated queries
  • A caching framework
  • Consists of algorithms that minimize network
    traffic. Algorithms build upon economic
    principles
  • Estimating query result size
  • A model that exploits data mining techniques to
    estimate result size of federated queries

6
The SkyQuery An astronomy federation
  • Ability to query data across archives.
  • Typical Query Find all objects near the Pole
    star that were observed by archives measuring
    infra-red wavelengths but not by archives
    measuring radio wavelengths
  • This is a distributed probabilistic spatial join
    query as there is error associated with the
    recorded location of an object.
  • Federation is necessary to run this query as
    infra-red and radio archives are independently
    managed
  • Queries are to be optimized as the near
    criteria can potentially transfer thousands of
    objects over the Internet

7
SkyQuery Architecture
8
SkyQuery status
  • Uses Web services standard for federation
  • Currently a federation of 30 sites
  • Is expected to grow to 120 sites soon
  • Actively used by astronomers all over the world
  • The spin-off is OpenSkyQuery that provides
    intuitive interfaces and
    fully-developed language
    for cross-match specification
  • http//www.openskyquery.net (User site)
  • http//www.skyquery.net (Research and
    Development site)

9
Bottleneck in SkyQuery
  • In January 2004, a participating site of SkyQuery
    resulted in
  • Total number of queries 26,667
  • Total number of object accesses 45,154
  • Total bytes on the network 1200 GB ( More than 1
    TB!)
  • Total number of unique object accesses 3890
  • Total size of objects that were accessed 90 GB
  • SkyQuery does not respect good network
    citizenship
  • Locality in object accesses can be translated
    into effective caching methods

10
NaĂŻve caching is dangerous
  • Risks of caching
  • Caching may transfer unfiltered data
  • May transfer columns or tables across network
  • Caching moves data to programs/queries
  • Caching may reduce parallelism
  • May move queries from many servers to few caches
  • Benefits of a federation
  • Filters data prior to transfer
  • Selection and aggregation queries reduce data
    size
  • Moves the program (fewer bytes) to the data (more
    bytes)
  • Executes queries in parallel
  • Experiment scales with the number of federated
    sites

11
Bypass-Yield An effective caching method
  • Distinguish between queries
  • To bypass
  • To satisfy in cache
  • Bypass queries
  • Sent directly to database servers
  • Caching them does not save network traffic
  • Allows for benefits of federation
    moving-queries-not-data,
  • parallelism, and data filtering
  • Queries satisfied in cache
  • Save network traffic
  • Defines yield of a query to
  • Distinguish queries
  • Helps decide accessed data objects to load from
    servers to cache
  • Caching these data objects saves network traffic
  • Measures the network savings rate per unit of
    cache space used

12
Network flow in a Bypass-Yield Cache
DL load cost, DB bypass cost Minimize WAN cost
on (DL DB)
13
The BYC metric
  • Let si be the size and fi be the fetch cost of
    object oi
  • Let yi,j be the size of the query-result of the
    jth query that accesses oi
  • Let pi,j be the probability of occurrence of the
    jth query above
  • Byte-Yield Hit Rate (BYHR) for object oi defined
    as
  • BYHR has two components
  • Expected benefit of caching oi due to yield
  • Scaled by the cost/size ratio

14
BYC algorithms make economic decisions
  • Rate-Profile An optimistic algorithm
  • Predicts object access patterns from workload
  • Combines recency and frequency to measure
    probability of access for objects in and outside
    cache
  • Evicts objects with lowest rate of savings.
    Optimistically, loads objects whose rate of
    savings are increasing
  • OnlineBY A defensive algorithm
  • Makes no assumptions about workload
  • Loads an object into the cache when network
    traffic equal to the size of object has been
    incurred
  • Maintains the cache according to an web caching
    algorithm (GDS)
  • Is lg2(k) competitive, k is the ratio of size of
    cache, to the smallest object in cache
  • SpaceEffBY A randomized algorithm
  • Maintains no state about workload
  • Loads objects with probability equal to
    yield/size of object

15
Current researchYield estimation
  • Selectivity Estimation
  • Estimating the number of tuples that satisfy a
    query
  • Accurate estimates can help in
  • For choosing best plans in query optimization
  • As feedback to users
  • Resource consumption, load balancing
  • For BYC to achieve maximum network savings
  • Accurate selectivity estimation is crucial to BYC
    performance
  • Governs load and evict decisions

16
Selectivity estimation in BYC
  • Common methods to estimate selectivity
  • Random sampling, histogram, wavelet based
    compressions
  • A caching environment introduces new requirements
    on the selectivity estimation problem
  • Caches are close to clients not servers often
    have no access to databases
  • Cache is a constrained resource in terms of
    storage
  • Caching systems are adaptive and online

17
Template example
  • Assume there are large no of queries to the
    following query template
  • SELECT order date,required date FROM Order
  • WHERE freight OP
    VALUE
  • VALUE varies over the domain A,B. A,B are real
    values. Operator OP comes from the set lt,gt

18
Freight distribution
  • Cumulative frequency distribution for freight
    attribute for OP lt
  • Needs to known precisely to estimate selectivity
    of all queries

19
Template example
  • Assume there are large no of queries to the
    following query template
  • SELECT order date,required date FROM Order
  • WHERE freight OP
    VALUE
  • VALUE varies in the range M,N. M,N are real
    values. Operator OP comes
    from the set lt,gt
  • The feature vector consists of OP,VALUE
  • Consider feature vectors over all such queries.
    Assume
    selectivity is known for some queries (say 25),
    in addition, say the
    selectivity can be classified into 3 classes
    low, medium and high

20
Estimation example
  • Decision tree learns u, and v from query
    attributes
  • Use linear regression in each class to obtain
    actual yield values

21
Properties of our solution
  • Uses a query based approach
  • Current methods use a data-based approach
  • Detect templates in the workload
  • Most scientific experiments adhere to query
    templates
  • Very fast and compact
  • Use decision trees and regression over templates
  • Crude estimates for the bypass phase, accurate
    estimates for loading data
  • Able to process complex queries
  • Extracts feature vectors from templates

22
Future directions
  • To improve query execution performance in the
    cache
  • In the absence of indices and materialized views
  • Dynamic physical design which reorganizes itself
    according to workload
  • Open questions
  • Finding the best file organization
  • How to best represent queries in terms of
    database schema, without actually changing the
    physical organization?
  • How to update changes to schema when the workload
    changes?
  • Plan to work on the above problem till Spring of
    2006.

23
Acknowledgements
  • Inside
  • Comp. Science Randal Burns, Members of HSSL
  • Phy Astro Alex Szalay, The SDSS Team
  • JHU The GBO Committee
  • Outside
  • Microsoft Research Jim Gray
  • Univ. of Notre Dame Amitabh Chaudhary, Nitesh
    Chawla
  • Carnegie Mellon Anastassia Ailamaki, Stratos
    Papadomanolakis

24
(No Transcript)
25
Web services
  • SkyQuery uses Web services for interoperability
  • Use of Internet standards
  • Communication Protocol HTTP
  • Message Exchange Model
  • Simple Object Access Protocol(SOAP) with
    eXtensible Markup Language(XML) encoding
  • Describing, Defining and Discovering Web
    Services
  • Web Services Description Language (WSDL),
    Uniform Discovery, Description and Integration
    (UDDI) of Web Services

26
The object cross-match
  • Matching objects across archives, if they
    correspond to the same astronomical body.
  • Given A set of observations, one from each
    archive.
  • Question Do these observations refer to the same
    astronomical body?
  • Answer
  • Easy, if positions can be measured precisely
  • the X Match answer is deterministic.

27
Probabilistic Cross Match
  • Measured positions have an error due to
    instrument inaccuracies.
  • This error follows a Gaussian distribution
    (and is known for each archive)
  • Given A set of observations, one from each
    archive
  • Question What is the probability that these
    observations refer to the same astronomical body?
  • Algorithm ?
  • Probabilistic Cross Match

28
Computing the probability of a X Match
  • To compute the probability of a X Match, we
    assume a position for the real body.
  • The probability of a X Match varies with this
    considered position of the real body.
  • We consider that position for which probability
    is highest.
  • This position is a kind of weighted mean.
  • Probability at this position can be represented
    by a circular region about mean.
  • Our X Match algorithm reports those sets of
    observations
  • probability gt threshold

29
The probabilistic cross-match
30
Background Page Caching
  • Fixed size objects/pages, different fetch cost
  • Cache hit is equivalent to an entire page being
    accessed
  • Caches pages that have high fetch cost
  • Used in operating systems

P6
P6
Goal Minimize total fetch cost of pages
31
Background Object Caching
  • Variable size objects, different fetch cost
  • Cache hit is equivalent to accessing an entire
    object
  • Caches objects that have high cost/size ratio.
  • Used in proxy web caching

O8
08
Goal Minimize total fetch cost of objects
32
Bypass-Yield Caching
  • Objects are of variable size
  • For each object, cost of access varies
  • Cache hit implies fetching an entire object, part
    of an object, or an aggregate computed over an
    object

Client Requests
Q8
Goal Minimize total network cost (load and
bypass)
33
Rate-Profile
  • Estimating probability of object access to
    estimate savings
  • By characterizing workload
  • For objects in cache
  • Estimates probability as frequency-counts over
    cache lifetime
  • Computes a rate of network savings
  • Every in-cache access increase the rate in
    proportion to query yield
  • Every time step (query access) decays rate

34
Rate-Profile
  • For objects outside cache
  • Separates queries accesses for an object into
    clusters (episodes)
  • For each episode maintains frequency-counts over
    episode length
  • Maximum rate in each episode is the maximum
    benefit of caching An
    optimistic decision
  • Ages maximum benefits in each episode by weight
  • Computes the expected rate of network savings

35
Rate-Profile Algorithm
  • Discount object load cost from expected savings
  • Penalize object for the network cost incurred in
    loading an object into cache
  • Load Vs Bypass
  • Compares discounted expected rate of network
    savings of an object not in the
    cache with the rate of network savings
    of all objects in the cache
  • If greater, load. Else, bypass
  • Eviction
  • Evict object(s) with lowest current savings

36
OnlineBY
  • Formulates the bypass-yield caching problem as a
    combination of
  • Rent-or-buy problem, and
  • Object caching problem
  • The bypass Vs. load decision A defensive
    decision
  • Bypasses queries (rents) until total bypass
    traffic exceeds fetch cost
  • Then, loads (buys) an object
  • Eviction decisions and in-cache policies
  • Cache is maintained according to an algorithm for
    object
  • caching problem
  • This is O(lg2k) competitive
  • k size of cache / size of smallest object

37
Randomized Ski Rental
  • Randomized version of OnlineBY
  • Does not store the amount of traffic bypassed
  • Load decision A random decision
  • Load an object with probability equal to the
    ratio of the yield of the
    current query to the fetch cost
  • Requires no metadata for making load Vs bypass
    decision
  • Reduces total amount of meta data required
  • Performs well in practice ( no bounds yet )

38
Network Cost of a Trace (columns)
  • Rate-Profile dynamically chooses between no cache
    and inline cache
  • Rate-Profile compares well with an optimal static
    cache
  • Other algorithms show similar results

39
Algorithm Cost Analysis
  • Rate-Profile is better than OnlineBY in column
    caching
  • SpaceEffBY always lags behind
  • In all algorithms, column caching is better than
    table caching

40
Schema Locality
  • Schema locality implies reuse of schema elements
  • than data elements
  • Both tables and columns show heavy and long
    lasting
  • periods of reuse

41
Containment
  • It is imperative for workload to show both
    containment and
  • locality for query caching to be viable (Luo et
    al)
  • Determining exact query containment is NP
    complete
  • (Chandra and Merlin)
  • Few objects experience reuse

42
Architecture
  • Status
  • We are currently implementing the above system
    within BYC.
  • We currently have preliminary experiments prove
    the efficacy of the above model.

43
Results(1) Range Query
44
Result(2) User-defined Function Queries
45
Result(3) Index Queries
Write a Comment
User Comments (0)