GPS Data Management and Visualization - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

GPS Data Management and Visualization

Description:

... Data Generation Rate of 85mbps on 32 nodes at NERSC streamed to clusters at PPPL. ... large amount of processors (N) streaming to a small amount of ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 28
Provided by: pppl4
Category:

less

Transcript and Presenter's Notes

Title: GPS Data Management and Visualization


1
GPS Data Management and Visualization
  • Scott Klasky
  • Micah Beck, Kwan-Liu Ma
  • Viraj Bhat, Manish Parashar

2
Overview
  • Data Model/Data Files.
  • Data Movement.
  • Data Access to TBs of Data.
  • Data Visualization problem.
  • Workflow Automation.
  • Data Analysis.

3
Data Model
  • New data structures in GTC.
  • Adams, Decyk, Ethier, Klasky, Nishimura
  • Definition of particles, fields
  • Includes metadata necessary for long-term access
    to the data.
  • Includes provenance, so we can understand where
    this output came from, including
  • Input data
  • Code version.
  • Machine the code ran on.
  • processors.

4
Data File
  • We will use HDF5
  • Could use parallel netcdf, XML
  • We will provide simple calls for HDF5 routines.
  • Open statement, close statement, write
    attributes, write coordinates, write raw data.
  • HDF5 routines will be written by C code.
  • Will will make it transparent to stream or write
    data.

5
Data Movement
  • We generate TBs of data which is generated at
    NERSC, and possibly ORNL.
  • Need to move the data to local resources for
    multiple collaborators.
  • We have developed data streaming methods to
    automate this process.

6
Data movement motivation
  • Ad hoc transfer mechanisms used by scientists to
    data transfer.
  • Transfer is done after the simulation.
  • scp is the norm for transferring data
    typically 3Mbps over WAN.
  • Pattern of usage.
  • Scientists typically run larger simulations on a
    large of processors at a remote location.
    (NERSC, ORNL).
  • Repeatedly analyze the simulation data on our
    local clusters.
  • Fusion codes such as GTC can take over 10 of
    their time doing IO.
  • Time on supercomputer is for supercomputing.
  • Data analysis/visualization requires ltlt
    processors. O(10)

7
Data streaming challenges
  • Latency in transferring data
  • NERSC ?? PPPL.
  • gt50ms tcp latency each way.
  • We can only run on batch on supercomputers.
  • Valuable CPU time on supercomputers.
  • Minimal overhead for IO during the calculations.
  • Basis of an Adaptive Workflow for fusion
    scientists.
  • Failure of links in a Wide Area Network.

8
Data Streaming development
  • Threaded parallel data streaming
  • Logistical Networking (LN).
  • Transfer TBs of data from simulations at NERSC
    to local analysis/visualization as the simulation
    executes.
  • High Performance (up to 97Mbs on 100Mb link).
  • Low Overhead (lt5 overhead).
  • Adaptive transfer technology leading to latency
    aware transfer mechanisms.
  • First step in setting up a data pipeline between
    the simulation and processing routines
  • General workflow and data management in the
    fusion community.

9
Adaptive Buffer Data Management
  • Simple algorithm to manage the buffer
  • Adapts to both the computations data output rate
    and network conditions.
  • Data generation rate is the data generated per
    time step (mbps).
  • Data transfer rate is the transfer rate (mbps).
  • Dynamically adjust to the data generation rate.
  • Done by sending all the data that has accumulated
    since the last transfer to Maximizing the network
    throughput.
  • Data sent in the current step is dependent on
    data generated in the previous step.
  • Loose feedback mechanism.
  • If data generation rate exceeds the data transfer
    rate queue manager
  • Concatenates multiple blocks to improve
    throughput Latency Aware
  • Increases multithreading in the transfer routines
    to improve throughput.

10
Failsafe Mechanism
  • Primary reasons of failure
  • Buffer Overflow at the Receiving end.
  • Buffer not able to hold the simulated data.
  • Binary files are written to GPFS at NERSC.
  • Status signal send to the exnodercv program.
  • Data is re-fetched from the depots.
  • Network connection to our local depot is severed.
  • Transfer mechanism tries to upload to a close-by
    depot (NERSC/UTK) depots.
  • Data is fetched during the creation of
    post-processing routines.

11
Data Streaming adapts to the network
Figure 7-Latency Aware
Figure 8- Network Adaptive Self Adjusting Buffer.
  • Latency Awareness inherent in the threaded buffer
    mechanism.
  • Concatenation of blocks occurs as soon as the
    first block is transferred, 1MB is transferred
    followed by 20MB
  • Whenever the data transfer rates dip more data is
    sent out in the next step.
  • Buffer Stabilizes itself after initial transfer
    of 1 block of data
  • Stabilizes to send 16 blocks of data.
  • Data generation rate matches the transfer rate.
  • State of equilibrium reached in the transfer
    mechanism.
  • Scheme network aware.

12
Data Streaming has high throughput
  • Buffering Scheme can keep up with data generation
    rates of 85Mbps from NERSC to PPPL.
  • The data was generated across 32 processors.
  • Network is easily saturated using the buffering
    scheme.
  • The maximum traffic between NERSC to PPPL is
    100Mps of which we hit 97Mbps.
  • ESNET router statistics depicts this.
  • The network jump at 2200 when the simulation is
    in progress.

Figure 9-Data Generation Rate of 85mbps on 32
nodes at NERSC streamed to clusters at PPPL.
Figure 10-ESNET router, statistics peak transfer
rates of 97Mbs/100Mbs at around 2200.(5 minute
average).
13
Data Streaming has low overhead
  • Definition of IO overhead
  • One way to compare is to write binary files(F77
    writes) to GPFS on the supercomputer nodes as
    they are being generated
  • Concatenate then and do a concatenated block
    write instead of single block writes.
  • Observation
  • GTC codes writing data to the GPFS (2Mbps or less
    per node), has an overhead of 10
  • Our scheme always does better than local IO

Overhead of the buffering scheme as compared to
writing to the General Purpose File System (NERSC)
14
Why do you care?
Analysis/Transformation of data begins at local
end.
Simulation Ends
C O M P
C O M M
C O M P
C O M M
C O M P
C O M M
C O M P
C O M M
Ad-hoc Transfer Mechanisms used by scientists
like ftp, scp
Time
LOCAL I/O SIMULATION END
BUFFERING SCHEME
Simulation Ends
Simulations with our data transfer mechanisms
C O M P
C O M M
C O M M
C O M P
C O M P
C O M M
C O M P
C O M M
Time
Transformation and Analysis of data Begins at
local end
Initial Data Analysis and transformation ends at
local end
15
Recent Work -1
  • We are hardening routines.
  • Needs to work for large amount of processors (N)
    streaming to a small amount of processors (M).
  • Buffer overflows occur when transferring when
    NgtgtM.
  • Solution.
  • Form MPI groups, G processors, and transfer from
    G to 1.
  • On SMPs we transfer to local processors.
  • Examining the extra overhead, but doesnt effect
    overhead by much .

16
Recent Work - 2
  • We are providing more XML, metadata inside the
    eXnode.
  • Provides us with information to understand raw
    data from G--gt 1 processor transformation.
  • Provides us with simple statistics (min,max,
    mean) of raw data.
  • Provides us with information for the next stage
    of the workflow.

17
Global Terascale Data Management using Logistical
Networking
  • Micah BeckDirector Associate
    ProfessorLogistical Computing Internetworking
    Laboratory

18
What is Logistical Networking (LN)?
  • Logistical Networking is a storage architecture
    based on exposed, generic, shared infrastructure
  • Resources directly accessible by applications
  • Disk and tape have a consistent interface
  • Lightweight authorization, no accounting!
  • Storage depots are easy to deploy and operate
  • Service levels are set by operator
  • Max. allocation size duration can be imposed
  • Open-source, portable client tools and libraries
  • Linux, Solaris, MacOSX, Windows, AIX, others
  • Some Java tools

19
The exNode and LN Tools
  • The exNode is a form for representing files that
    are stored on distributed LN depots
  • Large files are implemented by aggregating
    smaller allocations, possibly on multiple depots
  • Replicated data can be represented, possibly on
    different media types (eg disk caches, HPSS)
  • exNodes are stored, transferred as XML-encoded
    files much smaller than dataset
  • Logistical networking user level libraries and
    tools read, write and manage exNodes like local
    files
  • Logistical networking puts applications in
    control!

20
Data Management Tools NetCDF HDF5
  • Limitations of current implementation
  • Requires localization of entire file to access
    site
  • Datasets can be too large for the user to store
    locally - if so, they cant browse
  • Scientists sometimes require reduced views or
    subsets which are much smaller than entire file
  • Adapting data localization to app requirements
  • Data movement must be performed on-demand
  • Local caching of data must be informed by
    application-specific predictions

21
NetCDF/L HDF5/L
  • Modified NetCDF that read/writes data directly
    from/to network using exNode file representation
  • Implements standard NetCDF API
  • Passes NetCDF acceptance test
  • Allows NetCDF applications to be ported without
    modification to source code
  • Recognizes a special URL that indicates a file
    should be interpreted as an exNode (lors//)
  • Based on libxio, a port of UNIX library to
    exNodes
  • NetCDF/L available HDF5/L is under development

22
Future Work
  • Integration with tape archives, HPSS
  • Implementation of application-driven DM policy
  • Caching/prefetching
  • Data distribution
  • Integration of exNodes with directories and
    databases containing application metadata
  • Improved adaptation to transient network faults
    performance problems
  • Recovery from catastrophic storage failures
  • Working in disconnected environments
  • DM operations performed at LN depots

23
Data Visualization
  • UC Davis working on high-end visualization. We
    will here this in the next talk.
  • We will loosely couple this into our Integrated
    Data Analysis Environment IDAVE.
  • Scene graph will be from their routine, reading
    will be done in Express.
  • We have incorporated Lins IDL routines into C.
  • We are putting these inside of express.
  • Need to really get our file format for dums and
    viz. files, .!

24
UC Davis
  • Accomplishments
  • A new interpolation scheme for the twisted-grid
    potential data
  • Hardware-accelerated visualization of potential
    data
  • Direct visualization of particle data
  • Future Plans
  • We plan to investigate the following problems
  • Interactive browsing of particle data
  • Feature extraction for particle data
  • Interactive browsing of time-varying potential
    data
  • Expressive rendering methods that enhance the
    perception of spatial and temporal features of
    interest.
  • Simultaneous visualization of particle and
    potential data

25
Particle Visualization
26
Workflows
  • We are starting to work with the SDM center on
    workflows.
  • We have a funded proposal with Nevins to
    integrate GKV IDL code into our workflow.
  • Later, the IDL code will be converted to a real
    language.
  • We will look at parallel 3D correlation
    functions.
  • Nevins routines could be replace by other
    routines.
  • It will be up to GTC users to decide the data
    analysis routines they want automated.

27
Future Work
  • Incorporate the LN hdf5 routines into our DM
    section.
  • Work on automated workflows.
  • Incorporate advanced 3D particle viz. routines
    into our viz. package.
  • Harden our data streaming routines.
  • Enhance our IDAVE.
Write a Comment
User Comments (0)
About PowerShow.com