Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications

Description:

A Compression Framework for Data Intensive Applications. Chunk Resource Allocation (CRA) Layer. Initialization of the system. Generate chunk requests, enqueue processing – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 21
Provided by: cseOhios
Category:

less

Transcript and Presenter's Notes

Title: Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications


1
Integrating Online Compression to Accelerate
Large-Scale Data Analytics Applications


  • Tekin Bicer, Jian Yin, David Chiu, Gagan
    Agrawal and Karen SchuchardtOhio State
    UniversityWashington State UniversityPacific
    Northwest National Laboratories



2
Introduction
  • Scientific simulations and instruments can
    generate large amount of data
  • E.g. Global Cloud Resolving Model
  • 1PB data for 4km grid-cell
  • Higher resolutions, more and more data
  • I/O operations become bottleneck
  • Problems
  • Storage, I/O performance
  • Compression

3
Motivation
  • Generic compression algorithms
  • Good for low entropy sequence of bytes
  • Scientific dataset are hard to compress
  • Floating point numbers Exponent and mantissa
  • Mantissa can be highly entropic
  • Using compression in applications is challenging
  • Suitable compression algorithms
  • Utilization of available resources
  • Integration of compression algorithms

4
Outline
  • Introduction
  • Motivation
  • Compression Methodology
  • Online Compression Framework
  • Experimental Results
  • Related Work
  • Conclusion

5
Compression Methodology
  • Common properties of scientific datasets
  • Multidimensional arrays
  • Consist of floating point numbers
  • Relationship between neighboring values
  • Domain specific solutions can help
  • Approach
  • Prediction-based differential compression
  • Predict the values of neighboring cells
  • Store the difference

6
Example GCRM Temperature Variable Compression
  • E.g. Temperature record
  • The values of neighboring cells are highly
    related
  • X table (after prediction)
  • X compressed values
  • 5bits for prediction difference
  • Lossless and lossy comp.
  • Fast and good compression ratios

7
Compression Framework
  • Improve end-to-end application performance
  • Minimize the application I/O time
  • Pipelining I/O and (de)comp. operations
  • Hide computational overhead
  • Overlapping app. computation with comp. framework
  • Easy implementation of diff. comp. alg.
  • Easy integration with applications
  • Similar API to POSIX I/O

8
A Compression Framework for Data Intensive
Applications
  • Chunk Resource Allocation (CRA) Layer
  • Initialization of the system
  • Generate chunk requests, enqueue processing
  • Converting original offset and data size requests
    to compressed
  • Parallel Compression Engine (PCE)
  • Applies encode(), decode() functions to chunks
  • Manages in-memory cache with informed prefetching
  • Creates I/O requests
  • Parallel I/O Layer (PIOL)
  • Creates parallel chunk requests to storage medium
  • Each chunk request is handled by a group of
    threads
  • Provides abstraction for different data transfer
    protocols

9
Compression Framework API
  • User defined functions
  • encode_t() (R) Code for compression
  • decode_t() (R) Code for decompression
  • prefetch_t() (O) Informed prefetching function
  • Application can use below functions
  • comp_read Applies decode_t to comp. chunk
  • comp_write Applies encode_t to original chunk
    comp_seek Mimics fseek, also utilizes prefetch_t
  • comp_init Init. system (thread pools, cache etc.)

10
Prefetching and In-Memory Cache
  • Overlapping application layer computation with
    I/O
  • Reusability of already accessed data is small
  • Prefetching and caching the prospective chunks
  • Default is LRU
  • User can analyze history and provide prospective
    chunk list
  • Cache uses row-based locking scheme for efficient
    consecutive chunk requests

Informed Prefetching
prefetch()
11
Integration with a Data-Intensive Computing System
  • MapReduce style API
  • Remote data processing
  • Sensitive to I/O bandwidth
  • Processes data in
  • local cluster
  • cloud
  • or both (Hybrid Cloud)

12
Outline
  • Introduction
  • Motivation
  • Compression Methodology
  • Online Compression Framework
  • Experimental Results
  • Related Work
  • Conclusion

13
Experimental Setup
  • Two datasets
  • GCRM 375GB (L270 R105)
  • NPB 237GB (L166 R71)
  • 16x8 cores (Intel Xeon 2.53GHz)
  • Storage of datasets
  • Lustre FS (14 storage nodes)
  • Amazon S3 (Northern Virginia)
  • Compression algorithms
  • CC, FPC, LZO, bzip, gzip, lzma
  • Applications AT, MMAT, KMeans

14
Performance of MMAT
Compression Ratios Compression Ratios
CC 51.68 (186GB)
LZO 20.40 (299GB)
Speedups Speedups Speedups Speedups
Local Remote Hybrid
CC 1.63 1.90 1.85
LZO 1.04 1.24 1.14
I/O Throughput (128np) I/O Throughput (128np) I/O Throughput (128np)
GB/sec Orig. CC
Local 1.62 3.21
Remote 0.1 0.19
  • Breakdown of Performance
  • Overhead (Local) 15.41
  • Read Speedup 1.96

15
Lossy Compression (MMAT)
  • Lossy
  • e dropped bits
  • Error bound 5x(1/105)

Compression Ratios Compression Ratios
Lossless 51.68
2e 56.88 (162GB)
4e 62.93 (139GB)
Speedups Speedups Speedups Speedups
Local Remote Hybrid
2e vs CC 1.07 1.18 1.09
4e vs CC 1.13 1.43 1.18
4e vs orig. 1.76 2.41 2.18
16
Performance of KMeans
  • NPB dataset
  • Comp ratio 24.01 (180GB)
  • More computation
  • More opportunity to fetch and decompression

Speedups Speedups Speedups Speedups
Local Remote Hybrid
FPC 0.75 1.30 1.12
Speedups w/ multithreading Speedups w/ multithreading Speedups w/ multithreading Speedups w/ multithreading
Local Remote Hybrid
2P - 4IO 1.25 1.17 1.19
4P - 8IO 1.37 1.16 1.21
4P 8IO vs Orig. 1.03 1.51 1.36
17
Conclusion
  • Management and analysis of scientific datasets
    are challenging
  • Generic compression algorithms are inefficient
    for scientific datasets
  • We proposed a compression framework and
    methodology
  • Domain specific compression algorithms are fast
    and space efficient
  • 51.68 compression ratio
  • 53.27 improvement in exec. time
  • Easy plug-and-play of compression
  • Integration of the proposed framework and
    methodology with a data analysis middleware

18
Thanks!
19
Multithreading Prefetching
  • Diff. PCE and I/O Threads
  • 2P 4IO
  • 2 PCE threads, 4 I/O threads
  • One core is assigned to comp. framework

Speedups Speedups Speedups Speedups
Local Remote Hybrid
2P - 4IO 0.88 1.13 1.05
4P - 8IO 0.86 1.10 1.04
20
Related Work
  • (Scientific) data management
  • NetCDF, PNetCDF, HDF5
  • Nicolae et al. (BlobSeer)
  • Distributed data management service for efficient
    reading, writing and appending ops.
  • Compression
  • Generic LZO, bzip, gzip, szip, LZMA etc.
  • Scientific
  • Schendel and Jin et al. (ISOBAR)
  • Organizes highly entropic data into compressible
    data chunks
  • Burtscher et al. (FPC)
  • Efficient double-precision floating point
    compression
  • Lakshminarasimhan et al. (ISABELA)
Write a Comment
User Comments (0)
About PowerShow.com