Multi-dimensional Index on Hadoop Distributed File System - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Multi-dimensional Index on Hadoop Distributed File System

Description:

Haojun Liao, Jizhong Han and Jinyun Fang - Vikas Gonti Introduction Design Overview Details of Implementation Experimental Evaluation Conclusion and Future work ... – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 25

Provided by: vgo8

Learn more at: https://www.cs.odu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multi-dimensional Index on Hadoop Distributed File System

1
Multi-dimensional Index on Hadoop Distributed
File System

Haojun Liao, Jizhong Han and Jinyun Fang

- Vikas Gonti
2
Contents

Introduction
Design Overview
Details of Implementation
Experimental Evaluation
Conclusion and Future work

3
Introduction

Why we need to improve the query performance?
Large Voluminous Data
Cost
Scalability
Spatial access methods.

4
HDFS

HDFS is an open source implementation of DFS

5
Index Structure in HDFS

HDFS offers an effective way of manipulating and
maintenance index as in the single processor
environment.
No significant modifications of the original
index structure are required to enable its
appropriate function.
Query processing using index can be integrated
with MapReduce framework towards query
efficiency.
Answering queries by using access methods can be
more efficient than by using sequential scanning.

6
Design Overview

R-Tree like Index Structure
a disk-based hierarchical structure based on
B-tree, is one of the most common used
multi-dimensional index.
R-Tree node is implemented as a disk page.
Index nodes are of the same size in a index.

7
Structure of Index Frame
8
Query Processing

Only a small portion of index nodes are involved
in handling high selectivity query types, i.e.,
point query, range query, and nearest neighbor
query.

9
Details of Implementation

Buffer Management
Node Size
Index Structure
Query Optimization issues
Data Transfer Model

10
Details of Implementation

Buffer Management
Different strategies are applied to internal
nodes and leaf nodes of index.
Internal nodes accounts for small fraction of
total space.
All the internal nodes are kept in buffer once
loaded.
Access frequently.
In Contrast,
Only a certain number of buffer pages are
allocated for leaf nodes.
Relatively large space.
Visited on demand.
More buffer for leaf nodes will decrease the disk
access and reduce the response time.

11
Buffer Management (cont..)

Buffer strategy
Internal nodes are pinned in buffer once loaded
for future node access.
Leaf nodes are allocated limited number of buffer
pages averagely distributed in each data nodes
that are managed by LRU policy
Data transferring procedure
Once the data request emerges, data node check
its buffer first.
If the required data packet is hold in the
buffer, it is sent to client.
Otherwise, disk access is invoked.

12
Node Size

Many factors are involved in the determination of
index node size, such as transfer overhead, I/O
costs and CPU time.
Small Index nodes incurs more data transmission
times.
Large index node will be used in order to reduce
the data transfer costs.
Large node requires more data to be processed in
query
The size of the index node should be aligned to
the size of data packet, the minimum unit for
data transferring in HDFC.
Avoid the transmitting of the unnecessary data.

13
Index Structure.

Meta-data locates in the front of file,
accounting for 1kb disk space.
Internal nodes, as well as metadata, need to fit
in the first data chunk.
The rest space of the first data chunk of the
index file is left blank for the extension of
internal nodes.

14
Index Structure (cont..)

Leaf nodes are grouped in several data chunks,
according to the location proximity.
We align the next leaf node to the start position
of the next data chunk.
In internal node, the entry information includes
the sub-tree identifications and the MBR
In leaf node, entry information includes the MBR
information and pointer to corresponding data
object.

15
Query Optimization issues

Ordered entries can speed up the in-memory search
procedure by reducing computing needs.
Once the entries of each index nodes is sorted
according to some spatial criterion, the order
can be well preserved in a static environment.
The point and range query processing within each
node costs O(n/2w) time, where n is the number
of entries in each node and w is the cardinality
of the object set.

16
Data Transfer Model

Data node pushes data to client in the form of
data packet of 64k by default to obtain the best
performance for sequential reads.
We implement a new data transfer protocol to
facilitate the random reads for block-based index
structure.
The main difference with PUSH model is that Data
node is blocked after transmitting one data
packet.
Data node is blocked in favor of the random
access where the client might not need the
sequential data packets.

17
Data Transfer Model (Cont..)
Original Push Model
Our Transfer Model
18
Experimental Evaluations

Datasets
The following real world datasets are used for
the experimentation.
CAR contains 2,249,727 road segments extracted
from Tiger/Line datasets
HYD contains 40,995,718 line segments
representing rivers of China and
TLK contains up to 157,425,887 points extracted
from the elevation data of China.

19
Data transfer overhead

Transferred data is compared with the required
data by varying size and count of read operations
on HYD data set.

20
Performance

This result is based on TLK dataset. The data
packet during transfer ranges from 8k to 64.
This new protocol offers the best performance for
all access patterns when the data packet is set
to 64k.

21
Effects of index node size

Response time of range query and point query on
CAR dataset.
The worst performance of range query is when the
index node size is set 16k, and the performance
degenerates rapidly with the increase of query
window.
The response time for point query is longest when
index node size is set 8k because the index has
more level and more node visits are involved with
smaller node size.

22
Effects of Buffer

Vary the buffer size to evaluate the effects of
buffer on query performance, whereas other
parameters are kept constant.
We perform 100 range queries of 1 of total space
on TLK dataset.
The response time decreases as the available
buffer increases. The further improvement of
response time can be expected with the increase
of buffer size.

23
Conclusion Future work

Conclusion
We propose a method for organizing hierarchical
structures applied to both B-tree and R-tree on
HDFS.
We investigate several systematic parameters like
node size, index distribution, buffer, and query
processing techniques
Data transfer protocol specified for block-wise
random reads integrate with HDFS
Future work
Investigate the problem of combination of
MapReduce and index structure.
Explore efficient multi-dimensional data
distribution strategy according to index
structure to further enhance the I/O performance