PPT – The solution for bigdata - Hadoop PowerPoint presentation

About This Presentation

Title:

The solution for bigdata - Hadoop

Description:

An introduction to the Hadoop framework and a brief description on its structure, how it works – PowerPoint PPT presentation

Number of Views:19110

Slides: 27

Provided by: krishnaj.sai

Category: Medicine, Science & Technology

more less

Transcript and Presenter's Notes

Title: The solution for bigdata - Hadoop

1
The solution for Big data HADOOP

J. Sai Krishna and G. Sravya Lahari
2nd B.Tech (CSE)
K.O.R.M College of Engineering
Kadapa

2
Contents

Data trends in storing data.
Bigdata problems in IT industry
Introduction to HADOOP
HDFS (Hadoop Distributed File System)
MapReduce
Prominent users of Hadoop.
Conclusion

3
Data trends in storing data

What is data--- Any real world symbol (character,
numeric, special character) or a of group
of them is said to be data it may be of the
visual or audio or scriptural ,etc

4
Big data

What is big dataIn IT, it is a collection of
data sets so large and complex data that it
becomes difficult to process using on-hand
database management tools or traditional data
processing applications.
As of 2012, limits on the size of data sets that
are feasible to process in reasonable time were
on the order of Exabyte of data.

5
BIGDATA and problems with it.

Daily about 0.5 Petabytes of updates are being
made into FACEBOOK including 40 millions photos.
Daily, YOUTUBE is loaded with videos that can be
watched for one year continuously
Limitations are encountered due to large data
sets in many areas, including meteorology,
genomics, complex physics simulations, and
biological and environmental research.
Also affect Internet search, finance and business
informatics.
The challenges include in capture, retrieval,
storage, search, sharing, analysis, and
visualization.

6
HADOOP

THEN WHAT COULD BE THE SOLUTION FOR BIGDATA

7
What is Hadoop?

It is a opensource software written in java
Hadoop software library is a framework that
allows for the distributed processing of large
data sets across clusters of computers using
simple programming models.
It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.

8
The project includes these modules

Hadoop Common
Hadoop Distributed File System (HDFS)
Hadoop MapReduce

9
1.Hadoop Commons

It provides access to the filesystems supported
by Hadoop.
The Hadoop Common package contains the necessary
JAR files and scripts needed to start Hadoop.
The package also provides source code,
documentation, and a contribution section which
includes projects from the Hadoop Community
(Avro, Cassandra, Chukwa, Hbase, Hive, Mahout,
Pig, ZooKeeper)

10
2. Hadoop Distributed File System (HDFS)

Hadoop uses HDFS, a distributed file system based
on GFS (Google File System), as its shared
filesystem.
HDFS architecture divides files into large chunks
(64MB) distributed across data servers (this is
configurable).
It has a namenode and datanodes

11
What does a HDFS contain

HDFS consists of a global namenodes or namespaces
and they are federated.
The datanodes are used as common storage for
blocks by all the Namenodes.
Each datanode registers with all the Namenodes in
the cluster.
Datanodes send periodic heartbeats and block
reports and handles commands from the Namenodes

12
Structure of Hadoop system
13
Master Node

Master node
Keeps track of namespace and metadata about items
Keeps track of MapReduce jobs in the system
Hadoop currently configured with centurion064 as
the master node
Hadoop is locally installed in each system.
Installed location is in /localtmp/hadoop/hadoop-0
.15.3

14
Slave Nodes

Slave nodes
Manage blocks of data sent from master node
In common, these are the chunkservers
Currently centurion060, centurion064 are the two
slave nodes being used.
Slave nodes store their data in
/localtmp/hadoop/hadoop-dfs (this is
automatically created by the DFS)
Once you use the DFS, relative paths are from
/usr/your usr id

15
Advantages and Limitations of HDFS

Reduce traffic on job scheduling.
File access can be achieved through the native
Java or language of the users' choice (C, Java,
Python, PHP, Ruby, Erlang, Perl, Haskell, C,
Cocoa, Smalltalk, and OCaml),

It cannot be directly mounted by an existing
operating system.
It should be provided with UNIX or LUNIX system.

16
3.Hadoop MAPREDUCE SYSTEM

The Hadoop MapReduce framework harnesses a
cluster of machines and executes user defined
MapReduce jobs across the nodes in the cluster.
A MapReduce computation has two phases
a map phase and
a reduce phase.

17
Map and reduce methods usage
18
Word Count over a Given Set of strings
Love 1 India 1 We 2 tennis
1 play 1
We 1 love 1 India 1 We 1 Play 1 tennis 1
We love India
We play tennis
Map
Reduce
19
MapReduce in with no reduce tasks
20

MapReduce with two reduce tasks - Automatic
Parallel Execution in MapReduce

21
MapReduce - lifecycle
Map function
Map phase
Reduce phase
22
Shuffle and sort in MapReduce with multiple
reduce tasks
23
Prominent users of HADOOP

Amazon 100 nodes
Facebook two clusters of 8000 and 3000 nodes
Adobe 80 node system
EBay 532 node cluster
yahoo cluster of about 4500 nodes
IIIT Hyderabad 30 node cluster

24
Achievements

March 2011 - Apache Hadoop takes top prize at
Media Guardian Innovation Award
July 2012 - Hadoop Wins Terabyte Sort Benchmark

25
Conclusion

It reduce traffic on capture, storage, search,
sharing, analysis, and visualization.
A huge amount of data could be stored and large
computations could be done in a single compound
with full safety and security at cheap cost.
BIGDATA and BIGDATA-SOLUTIONS is one of the
burning issues in the present IT industry so,
work on those will surely make you more useful to
that.