15-440, Hadoop Distributed File System Allison Naaktgeboren - PowerPoint PPT Presentation

About This Presentation
Title:

15-440, Hadoop Distributed File System Allison Naaktgeboren

Description:

15-440, Hadoop Distributed File System. Allison Naaktgeboren. Wut u mean? ... Avoid bothering the Master too often. When a Client has 1 chunk's worth of data ... – PowerPoint PPT presentation

Number of Views:214
Avg rating:3.0/5.0
Slides: 16
Provided by: andre4
Category:

less

Transcript and Presenter's Notes

Title: 15-440, Hadoop Distributed File System Allison Naaktgeboren


1
15-440, Hadoop Distributed File SystemAllison
Naaktgeboren
  • Ur doin' it rong kitteh
  • Wut u mean? I iz loadin a HA-doop fileh

2
Annoucements
  • Go Vote!
  • Interpretive Dances happen only after Lecture
  • Office Hour Change
  • Mon 630-930
  • Tues 6-730
  • Exams are graded

3
Hadoop Core at 30,000 ft
4
Back to the Map Reduce Model
  • Recall that
  • map (in_key, in_value) -gt
  • (inter_key, inter_value) list
  • combine (inter_key, inter_value) ? (inter_key,
    inter_value)
  • reduce (inter_key, inter_value list) -gt
  • (out_key, out_vlaue)?
  • What resource are we most constrained by?
  • Oceans of Data, Skinny pipes
  • How many types of data will the file system care
    about?
  • How long will we need each kind?
  • What is the common case for each?

5
(No Transcript)
6
What would a MR Filesytem need?
  • General Use case large files
  • Mostly append to end, long sequential reads, few
    deletes
  • Appends might be concurrent
  • Scability
  • Adding (or losing) machines should be relatively
    painless
  • Nodes work on nearby data
  • Minimize moving data between machines
  • Bandwidth is our limiting resource
  • Remember how much data
  • Failure (handling)is Common
  • Yea, yea we know, we took 213, we know hardware
    sucks
  • No, really failure (handling) is common
    (constant)?
  • Disks, processors,whole nodes, racks, and
    datacenters

7
Addressing Those Concerns
  • Sequential Reads, appends need to be fast
  • Deletes can be painful
  • Hot plug machines
  • Add or lose machines while system is running jobs
  • System should auto detect the change
  • HDFS should distribute data somewhat evenly
  • So that all workers have a reasonable amount of
    data to chew on
  • And coordinating with the Jobtracker (job
    master)?
  • Data Replication
  • Should be spread out. Why?
  • What type of problems could arise?

8
Moving into the Details
  • Nodes in HDFS
  • NameNode (master) ( like GFS Master)?
  • DataNodes (slaves) ( like GFS chunkservers)?
  • NB Hadoop and HDFS closely paired
  • careful use of jargon defines the true expert
  • worker node A and data node 1 are frequently
    the same machine
  • Two types of Masters
  • Jobtracker (Hadoop Job Master)?
  • NameNode (file system Master)?
  • What I mean by 'master' for the rest of the
    lecture

9
Your Data goes in ....
  • Files are divided into Chunks
  • 64 MB
  • The mapping between filename and chunks goes to
    the Master
  • Each chunk is replicated and sent off to
    DataNodes
  • By default, 3
  • The master determines which dataNodes

10
What the Clients Do
  • Where the data starts
  • On file creation creates a seperate file
    w/checksum
  • When data fetched back from a dataNode, checksum
    computed again
  • Cache file data
  • Avoid bothering the Master too often
  • When a Client has 1 chunk's worth of data
  • Contacts the Master,
  • Master sends name of dataNodes to send it to
  • ONLY sends it to the 1st

11
What the DataNodes Do
  • Heartbeat to the Master
  • Opens, closes, or replicates a chunk if requested
    from Master
  • During replication, sends data to next dataNode
    in chain

12
What the Namespace Node Does
  • System metadata!
  • Holds Name-gtID mapping
  • Chunk replicas locations
  • Transcation Logs
  • EditLog
  • FSImage
  • It is responsible for coherency
  • Uses the logs atomically
  • Addresses the conccurent writes issue
  • It is checkpointed
  • Similar to AFS volume snapshots
  • Will pull last consistent log upon restart

13
What the Namespace Node Does
  • Listens for Heartbeats
  • Listens for Client Requests
  • If no heartbeat
  • marks a node as dead
  • Its data is deregistered
  • It selects dataNodes
  • Which nodes get which chunks
  • Signals creating, opening, closing
  • Deletes
  • Orders move to /trash
  • Starts delete timer

14
All together Now!
15
Additional Resources
  • Hadoop wiki
  • Youtube ? Hadoop ? Google developer videos (1-3
    will be helpful)?
  • Google University
  • Includes UW course, the other UW course, a couple
    others
  • Use are your own risk
  • The Google File System paper is rather readable
    as research papers go
Write a Comment
User Comments (0)
About PowerShow.com