Title: Active Data Repository, DataCutter
1Active Data Repository, DataCutter
- Joel Saltz
- Biomedical Informatics Department
- The Ohio State University
University of Maryland Alan Sussman Charlie
Chang Mike Beynon Henrique Andrade Christian
Hansen
Ohio State University Tahsin Kurc Umit Catalyurek
Gagan Agrawal Renato Ferreira
2Data exploration and analysis
- Identify trends or interesting phenomena
- Spatial/multi-dimensional multi-scale,
multi-resolution datasets - Applications select portions of one or more
datasets - Selection of data subset makes use of spatial
index (e.g., R-tree, quad-tree, etc.) - Data not used as-is, generally processing is
done to generate a data product often data
volume is reduced - Access to data by multiple users in a
collaborative environment
3Data Query and Processing Scenario
result data elements
intermediate data elements (accumulator elements)
reduction function
source data elements
4Some Specifics
- DU Select(Output, R) DI Select(Input, R)
- for (ue in DU)
- read ue ae Initialize(ue) A ? ae
- for (ie in DI)
- read ie SA Map(ie, A)
- for (ae in SA) ae Aggregate(ie,ae)
-
- for (ae in A)
- ue Finalize(ae) write ue
5Active Data Repository
6Active Data Repository (ADR)
- C class library and runtime system
- storage, retrieval and manipulation of
multi-dimensional datasets . - Runtime system support for common operations
- data retrieval, memory management, and scheduling
of processing across a parallel machine. - Data declustering/clustering, Indexing
- Buffer management and scheduling of operations
7ADR System Architecture
- Front-end the interface between clients and
back-end. Provides support for clients - to connect to ADR,
- query ADR to get information about already
registered datasets and user-defined methods, - to create ADR queries and submit them.
- Back-end data storage, retrieval, and
processing. - Distributed memory parallel machine or cluster of
workstations, with multiple disks attached to
each node - Customizable services for application-specific
processing - Internal services for data retrieval, resource
management
8ADR ARCHITECTURE
Visualization Client
Query Grid id, time steps iso-surface
value Viewing parameters
Front End
2D image
Application Front End
Query Interface Service
Query Submission Service
Store 3D Volume in ADR
Query Execution Service
Query Planning Service
Dataset Service
Attribute Space Service
Data Aggregation Service
Indexing Service
ADR Back End
Customized using VTKs iso-surface
rendering functions
Customizable ADR Services
9Datasets in ADR
- ADR expects the input datasets to be partitioned
into data chunks. - A data chunk, unit of I/O and communication,
- contains a subset of input data values (and
associated points in input space) - is associated with a minimum bounding rectangle,
which covers all the points in the chunk. - Data chunks are distributed across all the disks
in the system. - An index has to be built on minimum bounding
rectangles of chunks
10Loading Datasets into ADR
- A user
- should partition dataset into data chunks
- can distribute chunks across the disks, and
provide an index for accessing them - ADR, given data chunks and associated minimum
bounding rectangles in a set of files - can distribute data chunks across the disks using
a Hilbert-curve based declustering algorithm, - can create an R-tree based index on the dataset.
11Loading Datasets into ADR
- Partition dataset into data chunks -- each chunk
contains a set of data elements - Each chunk is associated with a bounding box
- ADR Data Loading Service
- Distributes chunks across the disks in the system
- Constructs an R-tree index using bounding boxes
of the data chunks
Disk Farm
12ADR Back-end Processing
Client
Output Handling Phase
Global Combine Phase
Back-end processor
Local Reduction Phase
Initialization Phase
13DataCutter
14DataCutter
- Purpose Specialized components for processing
data - Based on Active Disks research Acharya, Uysal,
Saltz ASPLOS98, - filters logical unit of computation
- reads data from input buffers, compute/filter/aggr
egate data, then writes data to output buffers - filter can only carry out subsetting,
commutative/associative data aggregations - subsetting implemented by (among other tings)
multilevel hierarchical indexes based on R-tree
indexing method. - filter computations are pipelined
15DataCutter
- streams how filters communicate
- unidirectional buffer pipes
- application developer specifies connectivity
between filters - copies because of the way filters are defined,
each filters computations can be carried out by
a system defined number of transparent copies - support for spatial queries, indexing inherits
ADRs R tree indexing, spatial query support
(specialized filters) - support for data aggregation filter groups
carry out ADR style data aggregation
16DataCutter/Globus/NWS
- DataCutter filters subset, filter, aggregate data
- Network Weather Service used to provide network,
processor performance information used in placing
filter copies, determining which data replicas to
use - Globus used to run filter code, track location of
data replicas, maintain updated NWS performance
information, record location of active filter
copies.
17SRB/DataCutter
- Support for Data Filtering and Range Queries
- Creation of indices over data sets
- Subsetting of data sets
- Search for files or portions of files that
intersect a given range query - Filter operations on portions of files (data
segments) before returning them to the client (to
perform filtering or aggregation to reduce data
volume)
18Filter Framework
- C language binding
- each filter is a logical entity can instantiate
many copies of each filter - one thread for each instantiated filter
- supports pipelined computations
- heuristics used for placement now uses NWS
class MyFilter public AS_Filter_Base
public int init(int argc, char argv )
int process(stream_t st) int
finalize(void)
19Example Applications
20Filter Connectivity / Placement
filter.Aouts stream1 stream3filter.Bins
stream1outs stream2filter.Cins stream2
stream3
- Manual placement specification via file
- Off line calculation of placement via sample run
(Beynon PhD Thesis) - On line placement using NWS/Globus1
placementA host1.cs.umd.eduB
host2.cs.umd.eduC host3.cs.umd.edu
21Group Instances (Batches)
Work issued in instance batches until all
complete. Matching instances to environment
(CPU capacity)
22Transparent Copies
E0
EK
host1
EK1
Cluster 3
EN
host2
Cluster 1
Cluster 2
23Exploration and Visualization of Oil Reservoir
Simulation Data
Mary Wheeler, Steven Bryant, Malgorzata
Peszynska, Ryan Martino Center for Subsurface
Modeling University of Texas at
Austin http//www.ticam.utexas.edu/CSM
Joel Saltz, Umit Catalyurek, Tahsin
Kurc Biomedical Informatics Department The Ohio
State University http//medicine.osu.edu/informati
cs
Alan Sussman, Michael Beynon Department of
Computer Science University of Maryland http//www
.cs.umd.edu/projects/adr
Don Stredney, Dennis Sessanna Interface
Laboratory The Ohio Supercomputer
Center http//www.osc.edu
24System Architecture
25Dataset
- Data size 1.5TB
- 207 simulations, selected from
- 18 Geostatistics Models (GM)
- 10 Realizations of each model (R)
- 4 Well Patterns (WP)
- Each simulation is 6.9GB
- 10,000 time steps
- 9,000 grid elements
- 8 scalars 3 vectors 17 variables
- Stored on UMD Storage Cluster
- 9TB disks on 50 nodes PIII-650, 128MB, Switched
Ethernet
26Economic Model
- Economic assessment
- Net Present Value (NPV)
- Return on Investment (ROI)
- Sweep Efficiency (SE)
- Queries
- return R-WP for given GM that has NPV gt avg
- return R-WP for all GM which has max NPV
- .
- Economic model uses
- well rates (time series data)
- cost and price (e.g., oil) parameters
27Results
- Economic model shows range of winners and losers
- We want to also understand the physics behind
this - An example is looking at bypassed oil
- Turns out to be strongly correlated to economics
28Representative Realization
- Select the simulation/realization that has values
closest to a user-defined criteria. - analyze that simulation or use its initial
conditions for further simulation studies. - Find the dataset among a set of datasets
- values of oil concentration, water pressure, and
gas pressure are closest to the average of these
values across the set of datasets - User selects
- A set of datasets (D) and a set of time steps
(T1,T2,,TN). - Query Find the dataset that is closest to the
average. - min S(all grid points) Oc Ocavg Wp
Wpavg Gp Gpavg
29Representative Realization
SUM
AVG
DIFF
Client
a set of requests (unit-of-works)
RD
- RD Read filter. Accesses data sets. A data
buffer is one time step. Read filter sends data
from each dataset to SUM and DIFF. - SUM Sum filter. Performs sum of Co, Wp, and Gp
at each grid point across datasets. - AVG Average filter. Carries out average
operation on Co, Wp, and Gp values. AVG and SUM
together execute step 1 of the average algorithm. - DIFF Difference filter. Finds the sum of
differences between grid values and average
values for each dataset (Step 2). Sends the
difference to the Client. - Client Keeps track of differences for each time
step, carries out average over all time steps for
each dataset (Step 3). Note this could be another
filter.
30Representative Realization
Client
AVG
Transparent Copies (one copy per node on four
nodes without data)
DIFF
DIFF
DIFF
DIFF
SUM
SUM
SUM
SUM
Transparent Copies (one copy per node on four
nodes without data)
..
RD
RD
Transparent Copies (one copy per node)
Node 20
Node 1
31High Resolution Imaging and Feature Detection
- 10 micron resolution microCT, fluorescence
microscopy - Characterize osteoporosis related changes in bone
trabecular structure - MicroCT Obtain 360 projection radiographs at 1
degree intervals - Carry out error correction and generate 3D solid
model - Invoke various feature detection/texture analysis
algorithms to characterize changes in bone - Collaborator Kim Powell, Cleveland Clinic
32Optimized DataCutter implementation of Background
Correction
read
subsample
correction
air attenuation
scale air
read-create-first-sino
CT warp
33(No Transcript)
34(No Transcript)
35Teragrid Software Development
- DataCutter will provide data filtering,
subsetting, spatial query services as part of
community teragrid infrastructure - Integrated with SRB, will be integrated with
Globus/NWS teragrid services - DataCutter/ADR supported by NSF (NPACI, ACR,
ITR), DOE ASCI, DoD, DARPA, NIH