The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur

1 / 25
About This Presentation
Title:

The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur

Description:

The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao –

Number of Views:159
Avg rating:3.0/5.0
Slides: 26
Provided by: SDXR
Category:

less

Transcript and Presenter's Notes

Title: The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur


1
The Hadoop Distributed File SystemArchitecture
and Designby Dhruba Borthakur
  • Presented by Bryant Yao

2
Introduction
  • What is it? Its a file system!
  • Supports most of the operations a normal file
    system would.
  • Open source implementation of GFS (Google File
    System).
  • Written in Java
  • Designed primarily for GNU/Linux
  • Some support for Windows

3
Design Goals
  • HDFS is designed to store large files (think TB
    or PB).
  • HDFS is designed for a computer cluster/s made up
    of racks.
  • Write once, read many model
  • Useful for reading many files at once but not
    single files.
  • Streaming access of data
  • Data is coming to you constantly and not in waves
  • Make use of commodity computers
  • Expect hardware to fail
  • Moving computation is cheaper than moving data

Rack 1
Rack 2
Cluster
4
Master/Slave Architecture
Namenode
Datanodes
5
Master/Slave Architecture cont.
  • 1 master, many slaves
  • The master manages the file system namespace and
    regulates access to files by clients.
  • Data distributed across slaves. The slaves store
    the data as blocks.
  • What is a block?
  • A portion of a file.
  • Files are broken down into and stored as a
    sequence of blocks.

File 1
A
B
C
Broken down into blocks A, B, and C.
6
Task Flow
7
Namenode
  • Master
  • Handles metadata operations
  • Stored in a transaction log called EditLog
  • Manages datanodes
  • Passes I/O requests to datanodes
  • Informs the datanode when to perform block
    operations.
  • Maintains a BlockMap which keeps track of which
    blocks each datanode is responsible for.
  • Stores all files metadata in memory
  • File attributes, number of replicas, files
    blocks, block locations, and checksum of a block.
  • Stores a copy of the namespace in the FsImage on
    disk.

8
Datanode
  • Slave
  • Handles data I/O.
  • Handles block creation, deletion, and replication
  • Local storage is optimized so files are stored
    over multiple file directories
  • Storing data into a single directory

9
Data Replication
  • Makes copies of the data!
  • Replication factor determines the number of
    copies.
  • Specified by namenode or during file creation
  • Replication is pipelined!

10
Pipelining Data Replication
  • Blocks are split into portions (4KB).

1
2
3
Assume a block is split into 3 portions A, B,
and C.
A
1
2
3
B
A
1
2
3
C
B
A
11
Replication Polic y
  • Communication bandwidth between computers in a
    rack is greater than between a computer outside
    of the rack.
  • We could replicate data across racksbut this
    would consume the most bandwidth.
  • We could replicate data across all computers in a
    rackbut if the rack dies were in the same
    position as before.

12
Replication Polic y cont.
  • Assume only three replicas are created.
  • Split the replicas between 2 racks.
  • Rack failure is rare so were still able to
    maintain good data reliability while minimizing
    bandwidth cost.
  • Version 0.18.0
  • 2 replicas in current rack (2 different nodes)
  • 1 replica in remote rack
  • Version 0.20.3.x
  • 1 replica in current rack
  • 2 replicas in remote rack (2 different nodes)
  • What happens if replication factor is 2 or gt 3?
  • No answer in this paper.
  • Some other papers state that the minimum is 3.
  • The author wrote a separate paper stating every
    replica after the 3rd is placed randomly.

13
Reading Data
  • Read the data thats closest to you!
  • If the block/replica of data you want is on the
    datanode/rack/data center youre on, read it from
    there!
  • Read from datanodes directly.
  • Can be done in parallel.
  • Namenode is used to generate the list of
    datanodes which host a requested file as well as
    getting checksum values to validate blocks
    retrieved from the datanodes.

14
Writing Data
  • Data is written once
  • Split into blocks, typically of size 64MB
  • The larger the block size, the less metadata
    stored by the namenode
  • Data is written to a temporary local block on the
    client side and then flushed to a datanode, once
    the block is full.
  • If a file is closed while the temporary block
    isnt full, the remaining data is flushed to the
    datanode.
  • If the namenode dies during file creation, the
    file is lost!

15
Hardware Failure
Imagine a file is broken into 3 blocks spread
over three datanodes.
1
2
3
Block A
Block B
Block C
If the third datanode died, we would have no
access to block C and we cant retrieve the file.
1
2
3
Block A
Block B
Block C
16
Designing for Hardware Failure
  • Data replication
  • Safemode
  • Heartbeat
  • Block report
  • Checkpoints
  • Re-replication

17
Checkpoints
EditLog
FsImage


File System Namespace
18
Checkpoints
  • FsImage is a copy of the system taken before any
    changes have occurred.
  • EditLog is a log of all the changes to the
    namenode since its startup.
  • Upon the start up of the namenode, it applies all
    changes to the FsImage to create an up to date
    version of itself.
  • The resulting FsImage is the checkpoint.
  • If either the FsImage or EditLog is corrupt, the
    HDFS will not start!

19
Heartbeat and Blockreport
  • A heartbeat is a message sent from the datanode
    to the namenode.
  • Periodically sent to the namenode, letting the
    namenode know its alive.
  • If its dead, assume you cant use it.
  • Blockreport
  • A list of blocks the datanode is handling.

20
Safemode
  • Upon startup, the namenode enters safemode to
    check the health status of the cluster. Only done
    once.
  • Heartbeat is used to ensure all datanodes are
    available to use.
  • Blockreport is used to check data integrity.
  • If the number of replicas retrieved is different
    from the number of replicas expected, there is a
    problem.
  • Replicated Found

A
A
A
A
A
21
Re-replication/De-replication
  • During startup and when receiving heartbeats, the
    namenode will check to see if the number of
    replicas for each block is satisfied.
  • If the number of replicas found was lower than
    expected, perform data replication for each block
    that does not satisfy the above criteria.
  • If the number of replicas found was lower than
    expected, the namenode randomly selects datanodes
    to remove blocks from, for each block that does
    not satisfy the above criteria.

22
Other
  • Can view file system through FS Shell or the web
  • Communicates through TCP/IP
  • File deletes are a move operation to a trash
    folder which auto-deletes files after a specified
    time (default is 6 hours).
  • Rebalancer moves data from datanodes which have
    are close to filling up their local storage.

23
Relation with Search Engines
  • Originally built for Nutch.
  • Intended to be the backbone for a search engine.
  • HDFS is the file system used by Hadoop.
  • Hadoop also contains a MapReducer which has many
    applications, like indexing the web!
  • Analyzing large amounts of data.
  • Used by many, many companies
  • Google, Yahoo!, Facebook, etc.
  • It can store the web!
  • Just kidding ?.

24
Pros/Cons
  • The goal of this paper is to describe the system,
    not analyze it. It gives a great beginning
    overview.
  • Probably couldve been condensed/organized
    better.
  • Some information is missing
  • SecondaryNameNode
  • CheckpointNode
  • Etc.

25
Pros/Cons of HDFSIn and Beyond the Paper
  • Pros
  • It accomplishes everything it set out to do.
  • Horizontally scalable just add a new datanode!
  • Cheap cheap cheap to build.
  • Good for reading and storing large amounts of
    data.
  • Cons
  • Security
  • No redundancy of namenode
  • Single point of failure
  • The namenode is not scalable
  • Doesnt handle small files well
  • Still in development, many features missing

26
Questions?
  • Thank you for listening!
Write a Comment
User Comments (0)
About PowerShow.com