Databases - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Databases

Description:

Databases – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 50
Provided by: bertr68
Category:
Tags: databases

less

Transcript and Presenter's Notes

Title: Databases


1
Databases Information SystemsProjects
Research Overview DBIS.ucdavis.edu
  • Michael Gertz
  • Bertram Ludäscher
  • Dept. of Computer Science
  • University of California, Davis
  • gertz,ludaesch_at_ucdavis.edu

2
Databases and Information Systems (DBIS)
  • DBIS.ucdavis.edu_at_ Dept of Computer Science (CS)
  • DAKS.ucdavis.edu (Data Knowledge Systems) _at_
    Genome Center (GC)
  • Faculty
  • Michael Gertz Bertram Ludäscher
  • Researchers
  • Drs. Shawn Bowers (GC), Timothy McPhillips (GC),
    Norbert Podhorszki (CS)
  • Current Students
  • Omar Alonso, Michael Byrd, Conny Franke,
  • Quinn Hart, Carlos Rueda, Dave Thau, Alex Chen

3
Projects Research Areas
  • Ongoing Collaborations
  • NSF/ITR GeoStreams
  • NSF/ITR GEON (Geosciences Network)
  • NSF/ITR SEEK (Science Environment for Ecological
    Knowledge)
  • DOE/SciDAC SDM (Scientific Data Management
    Center)
  • DOE/CPES (Center for Plasma Edge Simulation)
  • New Projects
  • NSF/CEOP COMET (Coast-to-Mountain Environmental
    Transect)
  • NSF/CEOP Kepler (Real-time Problem Solving
    Environment)
  • NSF/AToL pPOD (Processing Phylogenetic Data)
  • NSF/SEII ChIP-chip (Bioinformatics Workflows)
  • Research Areas
  • scientific data management, scientific workflows
  • streaming data, geospatial data, data security
  • knowledge representation, data integration

4
The Diversity Unity of Science
Natural Sciences

Earth Sciences
Life Sciences
Physical Sciences
Observations, Measurements, Models, Simulations,
Analyses, Hypotheses Understanding, Prediction,

in vivo, in vitro, in situ, in silico,
Data-, Knowledge-, Workflow- Management is
central to most of them!
compute-intensive
structurally semantics -intensive
data-intensive
metadata-intensive
5
Towards 2020 Science Report (MSR)
http//research.microsoft.com/towards2020science
  • new develoment at the intersection of computer
    science and the sciences a leap from the
    application of computing to support scientists to
    do science (i.e. computational science) to
    the integration of computer science concepts,
    tools and theorems into the very fabric of
    science. We believe this development
    represents the foundations of a new revolution in
    science
  • we believe computer science is poised to become
    as fundamental to biology as mathematics has
    become to physics
  • to understand cells and cellular systems
    requires viewing them as information processing
    systems, as evidenced by the fundamental
    similarity between molecular machines of the
    living cell and computational automata, and by
    the natural fit between computer process algebras
    and biological signalling and between
    computational logical circuits and regulatory
    systems in the cell
  • We highlight that an immediate and important
    challenge is that of end-to-end scientific data
    management, from data acquisition and data
    integration, to data treatment, provenance and
    persistence.
  • dramatic in its impact, will be the integration
    of new conceptual and technological tools from
    computer science into the sciences.

6
Types of Information Integration
  • Conventional data integration
  • schema-based
  • view-based
  • at the data-level
  • Spatial (co-)registration/overlay of different
    data
  • from 2D, 3D, 4D (x,y,z,t), (4n) D ? GIS
  • Extended DI approaches using ontologies
  • controlled vocabularies, metadata, annotations
  • Scientific Information Integration
  • data process/application integration
  • Scientific Workflows
  • can include all the others and
  • statistics, data mining, visualization,

7
e-Science (UK) and Cyberinfrastructure (US)
  • e-Science is about global collaboration in key
    areas of science and the next generation of
    computing infrastructure that will enable it."
  • Sir John Taylor, Director Office of Science and
    Technology, UK
  • "Cyberinfrastructure is the coordinated aggregate
    of software, hardware and other technologies, as
    well as human expertise, required to support
    current and future discoveries in science and
    engineering. The challenge of Cyberinfrastructure
    is to integrate relevant and often disparate
    resources to provide a useful, usable, and
    enabling framework for research and discovery
    characterized by broad access and 'end-to-end'
    coordination.
  • Fran Berman, San Diego Supercomputer Center, UCSD

8
Integrated Cyberinfrastructure System meeting
the needs of multiple communities Source Dr.
Deborah Crawford, Chair, NSF CyberInfrastructure
Working Group
  • Applications
  • Environmental Science
  • High Energy Physics
  • Biomedical Informatics
  • Geoscience

DevelopmentTools Libraries
Education and Training
Discovery Innovation
Grid Services Middleware
Hardware
9
Scientific Workflows Cyberinfrastructure
UPPER-WARE
10
Scientific Information Integration
  • Conventional Data Integration
  • syntactic structural heterogeneities, schema
    mappings, schema matching, query rewriting
    (GAV,GLAV, ),
  • dealing with fundamentally same kind of
    information
  • that happens to be represented differently,
    incompletely,
  • find the correct, best way to integrate
    different representations
  • Scientific Information Integration (SII)
  • has the traditional II as a small (but very
    important) piece
  • but often deals with combining fundamentally
    different information
  • not a single correct / best way to integrate
  • invokes scientific theories or models that cannot
    be inferred from the data, schema, ontologies
  • ? joining of data, chaining of tools is in
    the scientists head!
  • ? scientific workflows can provide the end-to-end
    framework

11
Information Integration A Tree of Life (AToL)
  • many AToL projects
  • ? need to integrate the integrators (biologists)
    data

12
Src Junhyong Kim, Department of Biology, Penn
Center for Bioinformatics, U Penn
13
Inferring a phylogenetic tree from disparate data
Aligned DNA sequences
Maximum likelihood tree (DNA)
Discrete morphological data
Maximum parsimony tree
Integrate
Consensus Tree(s)
Maximum likelihood tree (continuous characters)
Continuous characters
Actors
Datasets
Datasets
14
Scientific Workflow
  • Capture how a scientist works with data and
    analytical tools
  • data access, transformation, analysis,
    visualization
  • possible worldview dataflow-oriented (cf.
    signal-processing)
  • Scientific workflow (wf) benefits (compare w/
    script-based approaches)
  • wf automation
  • wf component reuse
  • wf design, documentation
  • wf archival, sharing
  • built-in concurrency
  • (task-, pipeline-parallelism)
  • built-in provenance support
  • distributed execution
  • (Grid) support

15
Kepler Ecological Niche Modeling Pipeline
  • Scientific Workflow paradigm
  • Reusable components (actors) a scientists
    verbs/actions
  • Top-level workflows conceptual representation
    of the science process, sentences in the
    scientists language
  • Sub-workflows increasing levels of detail
  • Separation of concerns
  • actors what to do
  • parameters configurable behavior
  • channels dataflow, pipeline composition
  • directors fix execution model, scheduling
  • semantic types smart discovery, linking

D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer.
16
Kepler and Sensor Networks
  • New NSF CEOP projects
  • Management and Analysis of Environmental
    Observatory Data using the Kepler Scientific
    Workflow System, NCEAS, SDSC, UC Davis, OSU,
    CENS (UCLA), OPeNDAP
  • standardize services for sensor networks, support
    multiple views, protocols
  • COMET Coast-to-Mountain Environmental Transect,
    UC Davis, Bodega Marine Lab, Lake Tahoe Research
    Center
  • study how environmental factors affect ecosystems
    along an elevation gradient from coastal
    California to the summit of the Sierra Nevada

CEOP/COMET
CEOP/Kepler
17
Simple Kepler workflow using R (a statistics
package)
18
Plumbing with Style (Norbert Podhorszki UC
Davis, Scott Klasky ORNL)
Monitor
  • Plasma physics simulation on 2048 processors on
    Seaborg_at_NERSC (LBL)
  • Gyrokinetic Toroidal Code (GTC) to study energy
    transport in fusion devices (plasma
    microturbulence)
  • Generating 800GB of data (3000 files, 6000
    timesteps, 267MB/timestep), 30 hour simulation
    run
  • Under workflow control
  • Monitor (watch) simulation progress (via wf
    actors)
  • Transfer from NERSC to ORNL concurrently with the
    simulation run
  • Convert each file to HDF5 file
  • Archive files to 4GB chunks into HPSS

19
Some Research Challenges
  • Goal helping scientists and workflow engineers
    in SII
  • to optimize the human resource
  • workflow modeling design
  • software engineering, query optimization, type
    inference
  • rich provenance support
  • data models, computation models, query languages
  • use/exploit semantic information
  • logic-based reasoning
  • and to optimize system resources
  • resource scheduling, distributed execution,
  • cost models, scheduling, distributed computing

20
Behold the Beauty of Scientific Workflow Design
Author Kristian Stevens, UC Davis
Src Kristian Stevens, ECS-289F, 2006
21
Shimology Part 2 the ugly truth inside
Author Kristian Stevens, UC Davis
22
Challenge Modeling Design Paradigms
  • Vanilla Process Network
  • Functional Programming Dataflow Network
  • XML Transformation Network
  • Collection-oriented Modeling Design framework
    (COMAD)

The limitations of my modeling language are the
limitations of my design world. BL
23
CS Challenge Hybrid (semantic structural) Types
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
24
CS Challenge Propagating Semantic Types
  • Creating semantic annotations is difficult
  • Potentially large numbers of derived data
    products
  • Thousands of workflow components
  • Getting it right can be difficult for the
    domain scientist
  • ? Annotation Propagation

?
?1
?2
?3
Forward Propagation
Automatically Derive Annotations
?
?1
?2
?3
Backward Propagation
Automatically Derive Annotations
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
25
CS Research Problems in Propagation
  • Computing Forward and Backward Propagation
  • Under different schema constraint languages
  • What can and cannot be computed
  • Approximate what cannot be computed
  • Algorithms for propagation through a single actor
  • Algorithms for propagation through an entire
    workflow

Biom1(ob, yr, seas, plt, spp, bm) -
Biom(ob, yr, seas, plt, spp, bm), Sscd(spp).
Biom3(yr, plt, spp, 1) - Biom2(yr, plt,
spp, bm), bm gt 0 Biom3(yr, plt, spp, 0) -
Biom2(yr, plt, spp, bm), bm lt 0
Biom2(yr, plt, spp, z ? sum(b y, t, p)) -
Biom1(ob, yr, seas, plt, spp, bm).
union
join
aggregation
26
Example queries and annotations
S
R1(o, x, y, t, v)
?
R1, R2
S
Actor A
R2(u, p)
?o,x,y,v
?ud
S(o, x, y, v, u, p)
?tc
q ?o,x,y,v(?tc(R1)) ? ?ud(R2)
R2
R1
  • Forward propagation
  • ?1 R1(o, x, y, t, v) ? Observation(o) ?
    hasVal(o, v)
  • ?2 R2(u, p) ? Site(u) ? Species(p) ?
    observedIn(p, u)
  • ?? ?(q?) where ? ?1 ? ?2
  • Backward propagation
  • ?? S(o, x, y, v, u, p) ? Observation(o) ?
    hasVal(o, v) ? Species(p)
  • ? ??(q)

S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
27
Results on S-T Finite Dependencies (Fagin et al)
  • Full dependencies Lfull (e.g., ?/??, ?, ?/??,
    ?) ?x ?(x) ? ?(x)
  • Embedded dependencies Lem (e.g., ??) ?x ?(x) ?
    ?y ?(x, y)
  • Skolemized dependencies LSko
  • ?f ?x ?(x), ?(x) ? ?(x)
  • Composition (we want L?(Lq?) ? L? )
  • Lfull(Lfull) ? Lfull Lfull(Lem) ? Lfull
  • Lem(Lfull) ? Lem Lem(Lem) ? Lem
  • LSko(LSko) ? LSko
  • In general, annotations take the form of
    embedded (or Skolemized) s-t dependencies

28
A Scientific Publication (the final PROVENANCE
frontier )
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
29
More Evidence
data reference
type of evidence
tool reference
trust me on this one
  • provenance/data lineage show the history and
    evidence
  • related to proof trees
  • unlike w/ scripts, SWF system can keep track of
    what happened
  • In the future deposit your data workflows in a
    repository

30
SUMMARY (Part 1)
Data Integration
Knowledge Representation
Process Integration
31
Geospatial Data (everything happens somewhere,
sometime)
  • Spatial data
  • Data with a spatial location in a given
    reference frame, which is a perspective of the
    viewer to describe physical quantities (e.g.,
    position, velocity) a coordinate system is a way
    to describe physical quantities in a perspective.
  • Geospatial data
  • Data whose underlying reference frame is the
    Earths surface
  • concerns phenomena above, on and below the
    Earths surface.
  • Sources of (geo)spatial data
  • Remotely-sensed data
  • Aerial photography
  • Digitized maps
  • Field surveys
  • Sensor networks
  • Simulations

32
Geospatial Data (cont.)
  • About 80 of all data have a spatial component !
  • Interest in geospatial data and Geographic
    Information Systems (GIS) is witnessing a
    dramatic increase that goes beyond traditional
    GIS uses.
  • Further applications include
  • Economic development
  • Geo-marketing
  • Mobile services
  • Utility management
  • Disaster management
  • Transportation networks
  • Biodiversity
  • Climatology
  • Earthquake monitoring
  • ...

33
Remotely-Sensed Data

34
Remotely-Sensed Data
  • Several hundred operational satellites are up
    there, streaming dozens of terabytes of
    geospatial data down to Earth per day.
  • ExampleGeostationary Observational Environmental
    Satellite (GOES)
  • Data is obtained in row-scan-
  • order, from north to south
  • and east to west
  • Data collection is based on
  • routine schedule for 11 sectors
  • 3,000x3,000 km region is
  • scanned in about 3 minutes
  • Downlink rate about 2.1Mbps
  • (approximately 21GB/day)

135 W longitude
75 W longitude
Typical approach to data processing File-based,
i.e., store raster image data and derive some
standard data products or let users upload data
for further processing only 6-12 of the data
is actually used !
35
The GeoStreams Project

Goal Process (continuous) user queries over
streams of raster images
36
GeoStreamsResearch Problems
  • Objectives
  • (1) Develop stream processing framework for
    remotely-sensed imagery (RSI)
  • (2) Support complex (non-standard DBMS)
    operations on various forms of
  • streams of RSI
  • (3) Exploit techniques and concepts developed
    for traditional stream systems
  • (4) Use real data sets/streams and data product
    requirements
  • Problems
  • How does one model streams of raster image data?
  • Image algebra, points sets (pixels), value
    sets (pixel values),...
  • What are operations on streams of raster image
    data?
  • Standard operations
  • Spatial, temporal, and value selections (e.g.,
    give me the temperature values for the query box
    over Davis every day at 1pm)
  • Value transforms (e.g., contrast/Gaussian
    stretch, histogram equalization)
  • Spatial transforms (e.g., map re-projection,
    zooming)

37
GeoStreamsResearch Problems (cont.)
  • How to compose streams G1 and G2 ?
  • G1 ? G2 (x,G1(x) ? G2(x)) x ? X, ? ? , ?,
    ,/
  • How to build complex queries ? Example
  • G1 NIR (near-infrared), G2 VIS (visible)
  • G ((fval ? ((G1-G2) / (G2G1))) ? fUTM)R
  • What abut spatio-temporal aggregates?

38
GeoStreamsResearch Problems (cont.)
  • How to implement individual operators (blocking
    vs. non-blocking)?
  • What other operators are of practical relevance?
  • For example, change detection, combination of
    stream data with persistent (static) geospatial
    data, ...
  • Optimization issues
  • Minimal algebraic query optimization framework
    (e.g., push down of spatial selection over
    spatial transform)
  • Multiple-query optimization, i.e., how to share
    data and operators among multiple (continuous)
    queries?
  • How to design a scalable distributed query
    processing framework?
  • cluster computing ? operator scheduling
    techniques
  • Web services ? each service provides some data
    products

For more information about this NSF ITR funded
project, visit http//geostreams.ucdavis.edu
39
Beyond Streaming DataThe COMET Project
  • COMET Coast-to-Mountain Environmental Transect
  • Funded in October 2006 through NSF
    Cyberinfrastructure for Earth Observatories
    Program (CEOP) at a level of 2.1M over three
    years.
  • Participants
  • M. Gertz (PI), B. Ludäscher (Computer Science)
  • G. Schladow (Director, Tahoe Environmental
    Research Center)
  • S. Williams (Director, Bodega Bay Marine Lab), I.
    Faloona, J. Largier
  • S. Ustin (Director, CalSpace, Cstars), Q. Hart
  • K.T. Paw U, Shu-Hua Chen (Atmospheric Sciences
    Climatology)
  • Objective

Develop a practical cyberinfrastructure (CI)
prototype to facilitate the study of the way in
which multiple environmental factors, including
climatic variability, affect major ecosystems
along an elevation gradient from coastal
California to the summit of the Sierra Nevada.
This CI will be based around the integration of
access to distributed and varied data collections
and sensor data streams, registration of data,
models and analysis tools, semantically-aware
data query mechanisms, and an orchestration
system for advanced scientific workflows. Access
to this CI will be provided through a Web-based
portal.
40
Beyond Streaming DataThe COMET Project
  • What transect, what data?

Ecological data, NEXRAD (Doppler radar), NOAA
AVHRR, CalTrans sensor data, ....
41
The COMET ProjectVision
CISAME CyberInfrastructure System for Data
Assimilation and Model Management for the
Environment

42
The COMET ProjectResearch Problems
  • There are many non-CS questions regarding
    climate variability, impact on ecosystems, El
    Nino, costal marine communities, changes in
    upwelling strength, carbon flux, particle fluxes
    and depositions,....
  • Scientific data management issues
  • Make all data and data products readily
    accessible to users and applications at the
    necessary spatial and temporal resolution
  • Provide semantic data registration for streaming
    sensor data, satellite imagery, various forms of
    geospatial data, and so on
  • Provide data and standard data products in
    different data formats
  • Spatially and temporally synchronize data
  • Fully integrate complex climate models and
    applications and make them readily accessible to
    many users/scientists through the Portal
  • Requires modeling of complex models as scientific
    workflows that ingest and produce diverse types
    of data
  • Such workflows need to be optimized and
    coordinated (multiple-workflow optimization)

43
The COMET ProjectResearch Problems
  • A simple example The Weather Research
    Forecasting Model

44
Security and Privacy of Geospatial Data
  • Project recently started with UT Dallas and
    Purdue University
  • Objectives
  • Improve the security of geospatial data
    repositories that are managed by different state,
    county, and municipal organizations and accessed
    through GIS and applications.
  • Introduce and advance concepts, techniques, and
    architectures for security models and policies
    for geospatial data, including topographic/themati
    c maps and aerial/satellite imagery.
  • Modular and compositional security policies as a
    comprehensive framework to model and reason about
    multi-granular, context-driven, dynamic, and
    location-aware security and privacy requirements
    of GIS repositories and applications.
  • Trust and integrity management models and
    techniques for geospatial data

45
S P of Geospatial Data (cont.)
  • What is the problem?
  • Numerous government, county, and municipal
    organizations manage thematic and topographical
    maps in support of disaster and emergency
    management, homeland security, and environmental
    crises provide geospatial data for various
    features of U.S. locations and facilities at very
    fine-grained levels of detail
  • GIS repositories and GIS Web services have no
    mechanisms for securing geospatial data
  • Overlay of GIS layers may reveal sensitive
    information (inference problem for geospatial
    data)
  • What is an appropriate policy specification
    framework that combines field-based and
    feature-based data, active policies (e.g.,
    obfuscation of objects), event-based policies,
    context-based policies (e.g., location
    awareness)?
  • How to combine developments with OGC standards
    and technologies?

46
S P of Geospatial Data (cont.)
  • Framwork and architecture

DATA PRESENTATION LAYER
Traditional GIS
Open Geospatial Consortium Framework Core
Application Schemas Geospatial Features Geogra
phy Markup Language Metadata
GIS Web Services
Wrapper
SECURITY LAYER
Trust Privacy Management
Policy Specifications
Access Control Mechanisms
Authentic Data Publication
Policy Reasoning Engine
DATA INTEROPERATION ACCESS LAYER
GIS Interoperation Services GIS Data Repository
Access
GIS Data Repositories
47
S P of Geospatial Data (cont.)
  • Question What are privacy threats in the
    context of geospatial data, in particular
    satellite imagery?

48
Teaching
  • We offer several classes related to our research
    areas and project activities
  • ECS 165A (Database Systems)
  • ECS 165B (Advanced Database Systems)
  • ECS 166 (Scientific Data Management)
  • ECS 265 (Distributed Database Systems)
  • ECS 289F (Spatial Databases)
  • ECS 289F (Topics in Scientific Data Management)
  • ECS 289A/F (Logics and Knowledge Representation)
  • DBIS Seminar (Fridays 1-230pm)

49
Q A
DBIS.ucdavis.edu
DAKS.ucdavis.edu
kepler-project.org
Write a Comment
User Comments (0)
About PowerShow.com