Title: The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur
1The Hadoop Distributed File SystemArchitecture
and Designby Dhruba Borthakur
2Introduction
- What is it? Its a file system!
- Supports most of the operations a normal file
system would. - Open source implementation of GFS (Google File
System). - Written in Java
- Designed primarily for GNU/Linux
- Some support for Windows
3Design Goals
- HDFS is designed to store large files (think TB
or PB). - HDFS is designed for a computer cluster/s made up
of racks. - Write once, read many model
- Useful for reading many files at once but not
single files. - Streaming access of data
- Data is coming to you constantly and not in waves
- Make use of commodity computers
- Expect hardware to fail
- Moving computation is cheaper than moving data
Rack 1
Rack 2
Cluster
4Master/Slave Architecture
Namenode
Datanodes
5Master/Slave Architecture cont.
- 1 master, many slaves
- The master manages the file system namespace and
regulates access to files by clients. - Data distributed across slaves. The slaves store
the data as blocks. - What is a block?
- A portion of a file.
- Files are broken down into and stored as a
sequence of blocks.
File 1
A
B
C
Broken down into blocks A, B, and C.
6Task Flow
7Namenode
- Master
- Handles metadata operations
- Stored in a transaction log called EditLog
- Manages datanodes
- Passes I/O requests to datanodes
- Informs the datanode when to perform block
operations. - Maintains a BlockMap which keeps track of which
blocks each datanode is responsible for. - Stores all files metadata in memory
- File attributes, number of replicas, files
blocks, block locations, and checksum of a block. - Stores a copy of the namespace in the FsImage on
disk.
8Datanode
- Slave
- Handles data I/O.
- Handles block creation, deletion, and replication
- Local storage is optimized so files are stored
over multiple file directories - Storing data into a single directory
9Data Replication
- Makes copies of the data!
- Replication factor determines the number of
copies. - Specified by namenode or during file creation
- Replication is pipelined!
10Pipelining Data Replication
- Blocks are split into portions (4KB).
1
2
3
Assume a block is split into 3 portions A, B,
and C.
A
1
2
3
B
A
1
2
3
C
B
A
11Replication Polic y
- Communication bandwidth between computers in a
rack is greater than between a computer outside
of the rack. - We could replicate data across racksbut this
would consume the most bandwidth. - We could replicate data across all computers in a
rackbut if the rack dies were in the same
position as before.
12Replication Polic y cont.
- Assume only three replicas are created.
- Split the replicas between 2 racks.
- Rack failure is rare so were still able to
maintain good data reliability while minimizing
bandwidth cost. - Version 0.18.0
- 2 replicas in current rack (2 different nodes)
- 1 replica in remote rack
- Version 0.20.3.x
- 1 replica in current rack
- 2 replicas in remote rack (2 different nodes)
- What happens if replication factor is 2 or gt 3?
- No answer in this paper.
- Some other papers state that the minimum is 3.
- The author wrote a separate paper stating every
replica after the 3rd is placed randomly.
13Reading Data
- Read the data thats closest to you!
- If the block/replica of data you want is on the
datanode/rack/data center youre on, read it from
there! - Read from datanodes directly.
- Can be done in parallel.
- Namenode is used to generate the list of
datanodes which host a requested file as well as
getting checksum values to validate blocks
retrieved from the datanodes.
14Writing Data
- Data is written once
- Split into blocks, typically of size 64MB
- The larger the block size, the less metadata
stored by the namenode - Data is written to a temporary local block on the
client side and then flushed to a datanode, once
the block is full. - If a file is closed while the temporary block
isnt full, the remaining data is flushed to the
datanode. - If the namenode dies during file creation, the
file is lost!
15Hardware Failure
Imagine a file is broken into 3 blocks spread
over three datanodes.
1
2
3
Block A
Block B
Block C
If the third datanode died, we would have no
access to block C and we cant retrieve the file.
1
2
3
Block A
Block B
Block C
16Designing for Hardware Failure
- Data replication
- Safemode
- Heartbeat
- Block report
- Checkpoints
- Re-replication
17Checkpoints
EditLog
FsImage
File System Namespace
18Checkpoints
- FsImage is a copy of the system taken before any
changes have occurred. - EditLog is a log of all the changes to the
namenode since its startup. - Upon the start up of the namenode, it applies all
changes to the FsImage to create an up to date
version of itself. - The resulting FsImage is the checkpoint.
- If either the FsImage or EditLog is corrupt, the
HDFS will not start!
19Heartbeat and Blockreport
- A heartbeat is a message sent from the datanode
to the namenode. - Periodically sent to the namenode, letting the
namenode know its alive. - If its dead, assume you cant use it.
- Blockreport
- A list of blocks the datanode is handling.
20Safemode
- Upon startup, the namenode enters safemode to
check the health status of the cluster. Only done
once. - Heartbeat is used to ensure all datanodes are
available to use. - Blockreport is used to check data integrity.
- If the number of replicas retrieved is different
from the number of replicas expected, there is a
problem. - Replicated Found
A
A
A
A
A
21Re-replication/De-replication
- During startup and when receiving heartbeats, the
namenode will check to see if the number of
replicas for each block is satisfied. - If the number of replicas found was lower than
expected, perform data replication for each block
that does not satisfy the above criteria. - If the number of replicas found was lower than
expected, the namenode randomly selects datanodes
to remove blocks from, for each block that does
not satisfy the above criteria.
22Other
- Can view file system through FS Shell or the web
- Communicates through TCP/IP
- File deletes are a move operation to a trash
folder which auto-deletes files after a specified
time (default is 6 hours). - Rebalancer moves data from datanodes which have
are close to filling up their local storage.
23Relation with Search Engines
- Originally built for Nutch.
- Intended to be the backbone for a search engine.
- HDFS is the file system used by Hadoop.
- Hadoop also contains a MapReducer which has many
applications, like indexing the web! - Analyzing large amounts of data.
- Used by many, many companies
- Google, Yahoo!, Facebook, etc.
- It can store the web!
- Just kidding ?.
24Pros/Cons
- The goal of this paper is to describe the system,
not analyze it. It gives a great beginning
overview. - Probably couldve been condensed/organized
better. - Some information is missing
- SecondaryNameNode
- CheckpointNode
- Etc.
25Pros/Cons of HDFSIn and Beyond the Paper
- Pros
- It accomplishes everything it set out to do.
- Horizontally scalable just add a new datanode!
- Cheap cheap cheap to build.
- Good for reading and storing large amounts of
data. - Cons
- Security
- No redundancy of namenode
- Single point of failure
- The namenode is not scalable
- Doesnt handle small files well
- Still in development, many features missing
26Questions?