Introduction to - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Introduction to

Description:

Hadoop is a software framework for distributed processing of large datasets across large clusters of ... the java programming language ... in parallel to solve a ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 15
Provided by: 123se111
Category:

less

Transcript and Presenter's Notes

Title: Introduction to


1
Introduction to
Tanya Chaturvedi MBA(ISM) 500026401
2
What is Hadoop???
  • Hadoop is a software framework for distributed
    processing of large datasets across large
    clusters of computers
  • Large datasets ? Terabytes or petabytes of data
  • Large clusters ? hundreds or thousands of nodes
  • Hadoop is written in the java programming
    language and requires Java Runtime Environment
    (JRE) 1.6 or higher.

3
Innovation
  • This technology was invented by Google back in
    their early days so they could usefully index all
    the textual and structural information they were
    collecting and then present meaningful results to
    the users.
  • Hadoop is based on a simple data model, any data
    will fit.

4
Hadoop Master/Slave ArchitectureHadoop is
designed as a master slave architecture.
Master node
Many slave nodes
5
Design Principles of Hadoop
  • Need to process big data.
  • Need to parallelize computation across thousands
    of nodes.
  • Commodity hardware
  • Large number of low-end cheap machines working in
    parallel to solve a computing problem.
  • Small number of high-end expensive machines.
  • Fault tolerance and automatic recovery
  • Nodes/tasks will fail and will recover
    automatically.

6
Users of Hadoop
  • Google Inventors of MapReduce computing
    paradigm.
  • Yahoo index calculation for yahoo search engine.
  • IBM, Microsoft, Oracle, Apple, HP, Twitter
  • Facebook, Amazon, AOL, NetFlex
  • Many others universities and research labs

7
Main Reasons for Using Hadoop
8
Hadoop ArchitectureHadoop framework consists of
two main layers
  • Distributed file system (HDFS)
  • Execution engine (MapReduce)

9
  • A small Hadoop cluster will include a single
    master and multiple slave nodes. The master node
    consists of a JobTracker, TaskTracker, NameNode
    and DataNode. A slave or worker node acts as both
    a DataNode and TaskTracker, though it is possible
    to have data-only worker nodes and compute-only
    worker nodes.
  • Job tracker is the master node.
  • it receives the users job
  • Hadoop requires Java Runtime Environment (JRE)
    1.6 or higher.

10
Hadoop Distributed File System
  • HDFS is a distributed, scalable, and portable
    file system written in Java for the Hadoop
    framework.
  • HDFS keeps different copies of data in different
    locations.
  • The goal of HDFS is to reduce the impact of power
    failure or switch failure, so that even if these
    occur, the data can be available.

11
Properties of HDFS
  • Large A HDFS instance may consist of thousands
    of server machines, each storing part of the file
    systems data
  • Replication Each data block is replicated many
    times (default is 3).
  • Fault Tolerance Detection of faults and quick,
    automatic recovery from them is a core
    architectural goal of HDFS.

12
Advantages of Using Hadoop
  • Hadoop is a framework which provides distributed
    storage and computational capabilities both.
  • It is extremely scalable.
  • HDFS uses large block size which eventually works
    best when manipulating large data sets.
  • HDFS maintains different replicas of files
    fault tolerant.
  • Hadoop uses Mapreduce framework which is
    batch-based, distributed computing framework.

13
Limitations of Hadoop
  • Security
  • Inefficient for handling small files.
  • Does not offer storage or network level
    encryption.
  • Single master model-can result in single point of
    failure.

14
Hadoop Vs. Other Systems
Write a Comment
User Comments (0)
About PowerShow.com