Multi-dimensional Index on Hadoop Distributed File System - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Multi-dimensional Index on Hadoop Distributed File System

Description:

Haojun Liao, Jizhong Han and Jinyun Fang - Vikas Gonti Introduction Design Overview Details of Implementation Experimental Evaluation Conclusion and Future work ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 25
Provided by: vgo8
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: Multi-dimensional Index on Hadoop Distributed File System


1
Multi-dimensional Index on Hadoop Distributed
File System
  • Haojun Liao, Jizhong Han and Jinyun Fang

- Vikas Gonti
2
Contents
  • Introduction
  • Design Overview
  • Details of Implementation
  • Experimental Evaluation
  • Conclusion and Future work

3
Introduction
  • Why we need to improve the query performance?
  • Large Voluminous Data
  • Cost
  • Scalability
  • Spatial access methods.

4
HDFS
  • HDFS is an open source implementation of DFS

5
Index Structure in HDFS
  • HDFS offers an effective way of manipulating and
    maintenance index as in the single processor
    environment.
  • No significant modifications of the original
    index structure are required to enable its
    appropriate function.
  • Query processing using index can be integrated
    with MapReduce framework towards query
    efficiency.
  • Answering queries by using access methods can be
    more efficient than by using sequential scanning.

6
Design Overview
  • R-Tree like Index Structure
  • a disk-based hierarchical structure based on
    B-tree, is one of the most common used
    multi-dimensional index.
  • R-Tree node is implemented as a disk page.
  • Index nodes are of the same size in a index.

7
Structure of Index Frame
8
Query Processing
  • Only a small portion of index nodes are involved
    in handling high selectivity query types, i.e.,
    point query, range query, and nearest neighbor
    query.

9
Details of Implementation
  • Buffer Management
  • Node Size
  • Index Structure
  • Query Optimization issues
  • Data Transfer Model

10
Details of Implementation
  • Buffer Management
  • Different strategies are applied to internal
    nodes and leaf nodes of index.
  • Internal nodes accounts for small fraction of
    total space.
  • All the internal nodes are kept in buffer once
    loaded.
  • Access frequently.
  • In Contrast,
  • Only a certain number of buffer pages are
    allocated for leaf nodes.
  • Relatively large space.
  • Visited on demand.
  • More buffer for leaf nodes will decrease the disk
    access and reduce the response time.

11
Buffer Management (cont..)
  • Buffer strategy
  • Internal nodes are pinned in buffer once loaded
    for future node access.
  • Leaf nodes are allocated limited number of buffer
    pages averagely distributed in each data nodes
    that are managed by LRU policy
  • Data transferring procedure
  • Once the data request emerges, data node check
    its buffer first.
  • If the required data packet is hold in the
    buffer, it is sent to client.
  • Otherwise, disk access is invoked.

12
Node Size
  • Many factors are involved in the determination of
    index node size, such as transfer overhead, I/O
    costs and CPU time.
  • Small Index nodes incurs more data transmission
    times.
  • Large index node will be used in order to reduce
    the data transfer costs.
  • Large node requires more data to be processed in
    query
  • The size of the index node should be aligned to
    the size of data packet, the minimum unit for
    data transferring in HDFC.
  • Avoid the transmitting of the unnecessary data.

13
Index Structure.
  • Meta-data locates in the front of file,
    accounting for 1kb disk space.
  • Internal nodes, as well as metadata, need to fit
    in the first data chunk.
  • The rest space of the first data chunk of the
    index file is left blank for the extension of
    internal nodes.

14
Index Structure (cont..)
  • Leaf nodes are grouped in several data chunks,
    according to the location proximity.
  • We align the next leaf node to the start position
    of the next data chunk.
  • In internal node, the entry information includes
    the sub-tree identifications and the MBR
  • In leaf node, entry information includes the MBR
    information and pointer to corresponding data
    object.

15
Query Optimization issues
  • Ordered entries can speed up the in-memory search
    procedure by reducing computing needs.
  • Once the entries of each index nodes is sorted
    according to some spatial criterion, the order
    can be well preserved in a static environment.
  • The point and range query processing within each
    node costs O(n/2w) time, where n is the number
    of entries in each node and w is the cardinality
    of the object set.

16
Data Transfer Model
  • Data node pushes data to client in the form of
    data packet of 64k by default to obtain the best
    performance for sequential reads.
  • We implement a new data transfer protocol to
    facilitate the random reads for block-based index
    structure.
  • The main difference with PUSH model is that Data
    node is blocked after transmitting one data
    packet.
  • Data node is blocked in favor of the random
    access where the client might not need the
    sequential data packets.

17
Data Transfer Model (Cont..)
Original Push Model
Our Transfer Model
18
Experimental Evaluations
  • Datasets
  • The following real world datasets are used for
    the experimentation.
  • CAR contains 2,249,727 road segments extracted
    from Tiger/Line datasets
  • HYD contains 40,995,718 line segments
    representing rivers of China and
  • TLK contains up to 157,425,887 points extracted
    from the elevation data of China.

19
Data transfer overhead
  • Transferred data is compared with the required
    data by varying size and count of read operations
    on HYD data set.

20
Performance
  • This result is based on TLK dataset. The data
    packet during transfer ranges from 8k to 64.
  • This new protocol offers the best performance for
    all access patterns when the data packet is set
    to 64k.

21
Effects of index node size
  • Response time of range query and point query on
    CAR dataset.
  • The worst performance of range query is when the
    index node size is set 16k, and the performance
    degenerates rapidly with the increase of query
    window.
  • The response time for point query is longest when
    index node size is set 8k because the index has
    more level and more node visits are involved with
    smaller node size.

22
Effects of Buffer
  • Vary the buffer size to evaluate the effects of
    buffer on query performance, whereas other
    parameters are kept constant.
  • We perform 100 range queries of 1 of total space
    on TLK dataset.
  • The response time decreases as the available
    buffer increases. The further improvement of
    response time can be expected with the increase
    of buffer size.

23
Conclusion Future work
  • Conclusion
  • We propose a method for organizing hierarchical
    structures applied to both B-tree and R-tree on
    HDFS.
  • We investigate several systematic parameters like
    node size, index distribution, buffer, and query
    processing techniques
  • Data transfer protocol specified for block-wise
    random reads integrate with HDFS
  • Future work
  • Investigate the problem of combination of
    MapReduce and index structure.
  • Explore efficient multi-dimensional data
    distribution strategy according to index
    structure to further enhance the I/O performance

24
  • Questions ?

-- Thank You
Write a Comment
User Comments (0)
About PowerShow.com