Active Data Repository, DataCutter - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Active Data Repository, DataCutter

Description:

Spatial/multi-dimensional multi-scale, multi-resolution datasets ... Selection of data subset makes use of spatial index (e.g., R-tree, quad-tree, etc. ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 36

Provided by: joel208

Category:

more less

Transcript and Presenter's Notes

Title: Active Data Repository, DataCutter

1
Active Data Repository, DataCutter

Joel Saltz
Biomedical Informatics Department
The Ohio State University

University of Maryland Alan Sussman Charlie
Chang Mike Beynon Henrique Andrade Christian
Hansen
Ohio State University Tahsin Kurc Umit Catalyurek
Gagan Agrawal Renato Ferreira
2
Data exploration and analysis

Identify trends or interesting phenomena
Spatial/multi-dimensional multi-scale,
multi-resolution datasets
Applications select portions of one or more
datasets
Selection of data subset makes use of spatial
index (e.g., R-tree, quad-tree, etc.)
Data not used as-is, generally processing is
done to generate a data product often data
volume is reduced
Access to data by multiple users in a
collaborative environment

3
Data Query and Processing Scenario
result data elements
intermediate data elements (accumulator elements)
reduction function
source data elements
4
Some Specifics

DU Select(Output, R) DI Select(Input, R)
for (ue in DU)
read ue ae Initialize(ue) A ? ae
for (ie in DI)
read ie SA Map(ie, A)
for (ae in SA) ae Aggregate(ie,ae)
for (ae in A)
ue Finalize(ae) write ue

5
Active Data Repository
6
Active Data Repository (ADR)

C class library and runtime system
storage, retrieval and manipulation of
multi-dimensional datasets .
Runtime system support for common operations
data retrieval, memory management, and scheduling
of processing across a parallel machine.
Data declustering/clustering, Indexing
Buffer management and scheduling of operations

7
ADR System Architecture

Front-end the interface between clients and
back-end. Provides support for clients
to connect to ADR,
query ADR to get information about already
registered datasets and user-defined methods,
to create ADR queries and submit them.
Back-end data storage, retrieval, and
processing.
Distributed memory parallel machine or cluster of
workstations, with multiple disks attached to
each node
Customizable services for application-specific
processing
Internal services for data retrieval, resource
management

8
ADR ARCHITECTURE
Visualization Client
Query Grid id, time steps iso-surface
value Viewing parameters
Front End
2D image
Application Front End
Query Interface Service
Query Submission Service
Store 3D Volume in ADR
Query Execution Service
Query Planning Service
Dataset Service
Attribute Space Service
Data Aggregation Service
Indexing Service
ADR Back End
Customized using VTKs iso-surface
rendering functions
Customizable ADR Services
9
Datasets in ADR

ADR expects the input datasets to be partitioned
into data chunks.
A data chunk, unit of I/O and communication,
contains a subset of input data values (and
associated points in input space)
is associated with a minimum bounding rectangle,
which covers all the points in the chunk.
Data chunks are distributed across all the disks
in the system.
An index has to be built on minimum bounding
rectangles of chunks

10
Loading Datasets into ADR

A user
should partition dataset into data chunks
can distribute chunks across the disks, and
provide an index for accessing them
ADR, given data chunks and associated minimum
bounding rectangles in a set of files
can distribute data chunks across the disks using
a Hilbert-curve based declustering algorithm,
can create an R-tree based index on the dataset.

11
Loading Datasets into ADR

Partition dataset into data chunks -- each chunk
contains a set of data elements
Each chunk is associated with a bounding box
ADR Data Loading Service
Distributes chunks across the disks in the system
Constructs an R-tree index using bounding boxes
of the data chunks

Disk Farm
12
ADR Back-end Processing
Client
Output Handling Phase
Global Combine Phase
Back-end processor
Local Reduction Phase
Initialization Phase
13
DataCutter
14
DataCutter

Purpose Specialized components for processing
data
Based on Active Disks research Acharya, Uysal,
Saltz ASPLOS98,
filters logical unit of computation
reads data from input buffers, compute/filter/aggr
egate data, then writes data to output buffers
filter can only carry out subsetting,
commutative/associative data aggregations
subsetting implemented by (among other tings)
multilevel hierarchical indexes based on R-tree
indexing method.
filter computations are pipelined

15
DataCutter

streams how filters communicate
unidirectional buffer pipes
application developer specifies connectivity
between filters
copies because of the way filters are defined,
each filters computations can be carried out by
a system defined number of transparent copies
support for spatial queries, indexing inherits
ADRs R tree indexing, spatial query support
(specialized filters)
support for data aggregation filter groups
carry out ADR style data aggregation

16
DataCutter/Globus/NWS

DataCutter filters subset, filter, aggregate data
Network Weather Service used to provide network,
processor performance information used in placing
filter copies, determining which data replicas to
use
Globus used to run filter code, track location of
data replicas, maintain updated NWS performance
information, record location of active filter
copies.

17
SRB/DataCutter

Support for Data Filtering and Range Queries
Creation of indices over data sets
Subsetting of data sets
Search for files or portions of files that
intersect a given range query
Filter operations on portions of files (data
segments) before returning them to the client (to
perform filtering or aggregation to reduce data
volume)

18
Filter Framework

C language binding
each filter is a logical entity can instantiate
many copies of each filter
one thread for each instantiated filter
supports pipelined computations
heuristics used for placement now uses NWS

class MyFilter public AS_Filter_Base
public int init(int argc, char argv )
int process(stream_t st) int
finalize(void)
19
Example Applications

Virtual Microscope

Iso-surface Rendering

20
Filter Connectivity / Placement
filter.Aouts stream1 stream3filter.Bins
stream1outs stream2filter.Cins stream2
stream3

Manual placement specification via file
Off line calculation of placement via sample run
(Beynon PhD Thesis)
On line placement using NWS/Globus1

placementA host1.cs.umd.eduB
host2.cs.umd.eduC host3.cs.umd.edu
21
Group Instances (Batches)
Work issued in instance batches until all
complete. Matching instances to environment
(CPU capacity)
22
Transparent Copies
E0
EK
host1
EK1
Cluster 3
EN
host2
Cluster 1
Cluster 2
23
Exploration and Visualization of Oil Reservoir
Simulation Data
Mary Wheeler, Steven Bryant, Malgorzata
Peszynska, Ryan Martino Center for Subsurface
Modeling University of Texas at
Austin http//www.ticam.utexas.edu/CSM
Joel Saltz, Umit Catalyurek, Tahsin
Kurc Biomedical Informatics Department The Ohio
State University http//medicine.osu.edu/informati
cs
Alan Sussman, Michael Beynon Department of
Computer Science University of Maryland http//www
.cs.umd.edu/projects/adr
Don Stredney, Dennis Sessanna Interface
Laboratory The Ohio Supercomputer
Center http//www.osc.edu
24
System Architecture
25
Dataset

Data size 1.5TB
207 simulations, selected from
18 Geostatistics Models (GM)
10 Realizations of each model (R)
4 Well Patterns (WP)
Each simulation is 6.9GB
10,000 time steps
9,000 grid elements
8 scalars 3 vectors 17 variables
Stored on UMD Storage Cluster
9TB disks on 50 nodes PIII-650, 128MB, Switched
Ethernet

26
Economic Model

Economic assessment
Net Present Value (NPV)
Return on Investment (ROI)
Sweep Efficiency (SE)
Queries
return R-WP for given GM that has NPV gt avg
return R-WP for all GM which has max NPV
.

Economic model uses
well rates (time series data)
cost and price (e.g., oil) parameters

27
Results

Economic model shows range of winners and losers
We want to also understand the physics behind
this
An example is looking at bypassed oil
Turns out to be strongly correlated to economics

28
Representative Realization

Select the simulation/realization that has values
closest to a user-defined criteria.
analyze that simulation or use its initial
conditions for further simulation studies.
Find the dataset among a set of datasets
values of oil concentration, water pressure, and
gas pressure are closest to the average of these
values across the set of datasets
User selects
A set of datasets (D) and a set of time steps
(T1,T2,,TN).
Query Find the dataset that is closest to the
average.
min S(all grid points) Oc Ocavg Wp
Wpavg Gp Gpavg

29
Representative Realization
SUM
AVG
DIFF
Client
a set of requests (unit-of-works)
RD

RD Read filter. Accesses data sets. A data
buffer is one time step. Read filter sends data
from each dataset to SUM and DIFF.
SUM Sum filter. Performs sum of Co, Wp, and Gp
at each grid point across datasets.
AVG Average filter. Carries out average
operation on Co, Wp, and Gp values. AVG and SUM
together execute step 1 of the average algorithm.
DIFF Difference filter. Finds the sum of
differences between grid values and average
values for each dataset (Step 2). Sends the
difference to the Client.
Client Keeps track of differences for each time
step, carries out average over all time steps for
each dataset (Step 3). Note this could be another
filter.

30
Representative Realization
Client
AVG
Transparent Copies (one copy per node on four
nodes without data)
DIFF
DIFF
DIFF
DIFF
SUM
SUM
SUM
SUM
Transparent Copies (one copy per node on four
nodes without data)
..
RD
RD
Transparent Copies (one copy per node)
Node 20
Node 1
31
High Resolution Imaging and Feature Detection

10 micron resolution microCT, fluorescence
microscopy
Characterize osteoporosis related changes in bone
trabecular structure
MicroCT Obtain 360 projection radiographs at 1
degree intervals
Carry out error correction and generate 3D solid
model
Invoke various feature detection/texture analysis
algorithms to characterize changes in bone
Collaborator Kim Powell, Cleveland Clinic

32
Optimized DataCutter implementation of Background
Correction
read
subsample
correction
air attenuation
scale air
read-create-first-sino
CT warp
33
(No Transcript)
34
(No Transcript)
35
Teragrid Software Development

DataCutter will provide data filtering,
subsetting, spatial query services as part of
community teragrid infrastructure
Integrated with SRB, will be integrated with
Globus/NWS teragrid services
DataCutter/ADR supported by NSF (NPACI, ACR,
ITR), DOE ASCI, DoD, DARPA, NIH

Write a Comment

User Comments (0)