Active Data Repository, DataCutter - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Active Data Repository, DataCutter

Description:

Spatial/multi-dimensional multi-scale, multi-resolution datasets ... Selection of data subset makes use of spatial index (e.g., R-tree, quad-tree, etc. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 36
Provided by: joel208
Category:

less

Transcript and Presenter's Notes

Title: Active Data Repository, DataCutter


1
Active Data Repository, DataCutter
  • Joel Saltz
  • Biomedical Informatics Department
  • The Ohio State University

University of Maryland Alan Sussman Charlie
Chang Mike Beynon Henrique Andrade Christian
Hansen
Ohio State University Tahsin Kurc Umit Catalyurek
Gagan Agrawal Renato Ferreira
2
Data exploration and analysis
  • Identify trends or interesting phenomena
  • Spatial/multi-dimensional multi-scale,
    multi-resolution datasets
  • Applications select portions of one or more
    datasets
  • Selection of data subset makes use of spatial
    index (e.g., R-tree, quad-tree, etc.)
  • Data not used as-is, generally processing is
    done to generate a data product often data
    volume is reduced
  • Access to data by multiple users in a
    collaborative environment

3
Data Query and Processing Scenario
result data elements
intermediate data elements (accumulator elements)
reduction function
source data elements
4
Some Specifics
  • DU Select(Output, R) DI Select(Input, R)
  • for (ue in DU)
  • read ue ae Initialize(ue) A ? ae
  • for (ie in DI)
  • read ie SA Map(ie, A)
  • for (ae in SA) ae Aggregate(ie,ae)
  • for (ae in A)
  • ue Finalize(ae) write ue

5
Active Data Repository
6
Active Data Repository (ADR)
  • C class library and runtime system
  • storage, retrieval and manipulation of
    multi-dimensional datasets .
  • Runtime system support for common operations
  • data retrieval, memory management, and scheduling
    of processing across a parallel machine.
  • Data declustering/clustering, Indexing
  • Buffer management and scheduling of operations

7
ADR System Architecture
  • Front-end the interface between clients and
    back-end. Provides support for clients
  • to connect to ADR,
  • query ADR to get information about already
    registered datasets and user-defined methods,
  • to create ADR queries and submit them.
  • Back-end data storage, retrieval, and
    processing.
  • Distributed memory parallel machine or cluster of
    workstations, with multiple disks attached to
    each node
  • Customizable services for application-specific
    processing
  • Internal services for data retrieval, resource
    management

8
ADR ARCHITECTURE
Visualization Client
Query Grid id, time steps iso-surface
value Viewing parameters
Front End
2D image
Application Front End
Query Interface Service
Query Submission Service
Store 3D Volume in ADR
Query Execution Service
Query Planning Service
Dataset Service
Attribute Space Service
Data Aggregation Service
Indexing Service
ADR Back End
Customized using VTKs iso-surface
rendering functions
Customizable ADR Services
9
Datasets in ADR
  • ADR expects the input datasets to be partitioned
    into data chunks.
  • A data chunk, unit of I/O and communication,
  • contains a subset of input data values (and
    associated points in input space)
  • is associated with a minimum bounding rectangle,
    which covers all the points in the chunk.
  • Data chunks are distributed across all the disks
    in the system.
  • An index has to be built on minimum bounding
    rectangles of chunks

10
Loading Datasets into ADR
  • A user
  • should partition dataset into data chunks
  • can distribute chunks across the disks, and
    provide an index for accessing them
  • ADR, given data chunks and associated minimum
    bounding rectangles in a set of files
  • can distribute data chunks across the disks using
    a Hilbert-curve based declustering algorithm,
  • can create an R-tree based index on the dataset.

11
Loading Datasets into ADR
  • Partition dataset into data chunks -- each chunk
    contains a set of data elements
  • Each chunk is associated with a bounding box
  • ADR Data Loading Service
  • Distributes chunks across the disks in the system
  • Constructs an R-tree index using bounding boxes
    of the data chunks

Disk Farm
12
ADR Back-end Processing
Client
Output Handling Phase
Global Combine Phase
Back-end processor
Local Reduction Phase
Initialization Phase
13
DataCutter
14
DataCutter
  • Purpose Specialized components for processing
    data
  • Based on Active Disks research Acharya, Uysal,
    Saltz ASPLOS98,
  • filters logical unit of computation
  • reads data from input buffers, compute/filter/aggr
    egate data, then writes data to output buffers
  • filter can only carry out subsetting,
    commutative/associative data aggregations
  • subsetting implemented by (among other tings)
    multilevel hierarchical indexes based on R-tree
    indexing method.
  • filter computations are pipelined

15
DataCutter
  • streams how filters communicate
  • unidirectional buffer pipes
  • application developer specifies connectivity
    between filters
  • copies because of the way filters are defined,
    each filters computations can be carried out by
    a system defined number of transparent copies
  • support for spatial queries, indexing inherits
    ADRs R tree indexing, spatial query support
    (specialized filters)
  • support for data aggregation filter groups
    carry out ADR style data aggregation

16
DataCutter/Globus/NWS
  • DataCutter filters subset, filter, aggregate data
  • Network Weather Service used to provide network,
    processor performance information used in placing
    filter copies, determining which data replicas to
    use
  • Globus used to run filter code, track location of
    data replicas, maintain updated NWS performance
    information, record location of active filter
    copies.

17
SRB/DataCutter
  • Support for Data Filtering and Range Queries
  • Creation of indices over data sets
  • Subsetting of data sets
  • Search for files or portions of files that
    intersect a given range query
  • Filter operations on portions of files (data
    segments) before returning them to the client (to
    perform filtering or aggregation to reduce data
    volume)

18
Filter Framework
  • C language binding
  • each filter is a logical entity can instantiate
    many copies of each filter
  • one thread for each instantiated filter
  • supports pipelined computations
  • heuristics used for placement now uses NWS

class MyFilter public AS_Filter_Base
public int init(int argc, char argv )
int process(stream_t st) int
finalize(void)
19
Example Applications
  • Virtual Microscope
  • Iso-surface Rendering

20
Filter Connectivity / Placement
filter.Aouts stream1 stream3filter.Bins
stream1outs stream2filter.Cins stream2
stream3
  • Manual placement specification via file
  • Off line calculation of placement via sample run
    (Beynon PhD Thesis)
  • On line placement using NWS/Globus1

placementA host1.cs.umd.eduB
host2.cs.umd.eduC host3.cs.umd.edu
21
Group Instances (Batches)
Work issued in instance batches until all
complete. Matching instances to environment
(CPU capacity)
22
Transparent Copies
E0
EK
host1
EK1
Cluster 3
EN
host2
Cluster 1
Cluster 2
23
Exploration and Visualization of Oil Reservoir
Simulation Data
Mary Wheeler, Steven Bryant, Malgorzata
Peszynska, Ryan Martino Center for Subsurface
Modeling University of Texas at
Austin http//www.ticam.utexas.edu/CSM
Joel Saltz, Umit Catalyurek, Tahsin
Kurc Biomedical Informatics Department The Ohio
State University http//medicine.osu.edu/informati
cs
Alan Sussman, Michael Beynon Department of
Computer Science University of Maryland http//www
.cs.umd.edu/projects/adr
Don Stredney, Dennis Sessanna Interface
Laboratory The Ohio Supercomputer
Center http//www.osc.edu
24
System Architecture
25
Dataset
  • Data size 1.5TB
  • 207 simulations, selected from
  • 18 Geostatistics Models (GM)
  • 10 Realizations of each model (R)
  • 4 Well Patterns (WP)
  • Each simulation is 6.9GB
  • 10,000 time steps
  • 9,000 grid elements
  • 8 scalars 3 vectors 17 variables
  • Stored on UMD Storage Cluster
  • 9TB disks on 50 nodes PIII-650, 128MB, Switched
    Ethernet

26
Economic Model
  • Economic assessment
  • Net Present Value (NPV)
  • Return on Investment (ROI)
  • Sweep Efficiency (SE)
  • Queries
  • return R-WP for given GM that has NPV gt avg
  • return R-WP for all GM which has max NPV
  • .
  • Economic model uses
  • well rates (time series data)
  • cost and price (e.g., oil) parameters

27
Results
  • Economic model shows range of winners and losers
  • We want to also understand the physics behind
    this
  • An example is looking at bypassed oil
  • Turns out to be strongly correlated to economics

28
Representative Realization
  • Select the simulation/realization that has values
    closest to a user-defined criteria.
  • analyze that simulation or use its initial
    conditions for further simulation studies.
  • Find the dataset among a set of datasets
  • values of oil concentration, water pressure, and
    gas pressure are closest to the average of these
    values across the set of datasets
  • User selects
  • A set of datasets (D) and a set of time steps
    (T1,T2,,TN).
  • Query Find the dataset that is closest to the
    average.
  • min S(all grid points) Oc Ocavg Wp
    Wpavg Gp Gpavg

29
Representative Realization
SUM
AVG
DIFF
Client
a set of requests (unit-of-works)
RD
  • RD Read filter. Accesses data sets. A data
    buffer is one time step. Read filter sends data
    from each dataset to SUM and DIFF.
  • SUM Sum filter. Performs sum of Co, Wp, and Gp
    at each grid point across datasets.
  • AVG Average filter. Carries out average
    operation on Co, Wp, and Gp values. AVG and SUM
    together execute step 1 of the average algorithm.
  • DIFF Difference filter. Finds the sum of
    differences between grid values and average
    values for each dataset (Step 2). Sends the
    difference to the Client.
  • Client Keeps track of differences for each time
    step, carries out average over all time steps for
    each dataset (Step 3). Note this could be another
    filter.

30
Representative Realization
Client
AVG
Transparent Copies (one copy per node on four
nodes without data)
DIFF
DIFF
DIFF
DIFF
SUM
SUM
SUM
SUM
Transparent Copies (one copy per node on four
nodes without data)
..
RD
RD
Transparent Copies (one copy per node)
Node 20
Node 1
31
High Resolution Imaging and Feature Detection
  • 10 micron resolution microCT, fluorescence
    microscopy
  • Characterize osteoporosis related changes in bone
    trabecular structure
  • MicroCT Obtain 360 projection radiographs at 1
    degree intervals
  • Carry out error correction and generate 3D solid
    model
  • Invoke various feature detection/texture analysis
    algorithms to characterize changes in bone
  • Collaborator Kim Powell, Cleveland Clinic

32
Optimized DataCutter implementation of Background
Correction
read
subsample
correction
air attenuation
scale air
read-create-first-sino
CT warp
33
(No Transcript)
34
(No Transcript)
35
Teragrid Software Development
  • DataCutter will provide data filtering,
    subsetting, spatial query services as part of
    community teragrid infrastructure
  • Integrated with SRB, will be integrated with
    Globus/NWS teragrid services
  • DataCutter/ADR supported by NSF (NPACI, ACR,
    ITR), DOE ASCI, DoD, DARPA, NIH
Write a Comment
User Comments (0)
About PowerShow.com