Title: The solution for bigdata - Hadoop
1The solution for Big data HADOOP
- J. Sai Krishna and G. Sravya Lahari
- 2nd B.Tech (CSE)
- K.O.R.M College of Engineering
- Kadapa
2Contents
- Data trends in storing data.
- Bigdata problems in IT industry
- Introduction to HADOOP
- HDFS (Hadoop Distributed File System)
- MapReduce
- Prominent users of Hadoop.
- Conclusion
3Data trends in storing data
- What is data--- Any real world symbol (character,
numeric, special character) or a of group
of them is said to be data it may be of the
visual or audio or scriptural ,etc
4Big data
- What is big dataIn IT, it is a collection of
data sets so large and complex data that it
becomes difficult to process using on-hand
database management tools or traditional data
processing applications. - As of 2012, limits on the size of data sets that
are feasible to process in reasonable time were
on the order of Exabyte of data.
5 BIGDATA and problems with it.
- Daily about 0.5 Petabytes of updates are being
made into FACEBOOK including 40 millions photos. - Daily, YOUTUBE is loaded with videos that can be
watched for one year continuously - Limitations are encountered due to large data
sets in many areas, including meteorology,
genomics, complex physics simulations, and
biological and environmental research. - Also affect Internet search, finance and business
informatics. - The challenges include in capture, retrieval,
storage, search, sharing, analysis, and
visualization. -
6HADOOP
- THEN WHAT COULD BE THE SOLUTION FOR BIGDATA
7What is Hadoop?
- It is a opensource software written in java
- Hadoop software library is a framework that
allows for the distributed processing of large
data sets across clusters of computers using
simple programming models. - It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
8The project includes these modules
- Hadoop Common
- Hadoop Distributed File System (HDFS)
-
- Hadoop MapReduce
91.Hadoop Commons
- It provides access to the filesystems supported
by Hadoop. - The Hadoop Common package contains the necessary
JAR files and scripts needed to start Hadoop. - The package also provides source code,
documentation, and a contribution section which
includes projects from the Hadoop Community
(Avro, Cassandra, Chukwa, Hbase, Hive, Mahout,
Pig, ZooKeeper)
102. Hadoop Distributed File System (HDFS)
- Hadoop uses HDFS, a distributed file system based
on GFS (Google File System), as its shared
filesystem. - HDFS architecture divides files into large chunks
(64MB) distributed across data servers (this is
configurable). - It has a namenode and datanodes
11What does a HDFS contain
- HDFS consists of a global namenodes or namespaces
and they are federated. - The datanodes are used as common storage for
blocks by all the Namenodes. - Each datanode registers with all the Namenodes in
the cluster. - Datanodes send periodic heartbeats and block
reports and handles commands from the Namenodes
12Structure of Hadoop system
13 Master Node
- Master node
- Keeps track of namespace and metadata about items
- Keeps track of MapReduce jobs in the system
- Hadoop currently configured with centurion064 as
the master node - Hadoop is locally installed in each system.
- Installed location is in /localtmp/hadoop/hadoop-0
.15.3
14 Slave Nodes
- Slave nodes
- Manage blocks of data sent from master node
- In common, these are the chunkservers
- Currently centurion060, centurion064 are the two
slave nodes being used. - Slave nodes store their data in
/localtmp/hadoop/hadoop-dfs (this is
automatically created by the DFS) - Once you use the DFS, relative paths are from
/usr/your usr id
15Advantages and Limitations of HDFS
- Reduce traffic on job scheduling.
- File access can be achieved through the native
Java or language of the users' choice (C, Java,
Python, PHP, Ruby, Erlang, Perl, Haskell, C,
Cocoa, Smalltalk, and OCaml),
- It cannot be directly mounted by an existing
operating system. - It should be provided with UNIX or LUNIX system.
163.Hadoop MAPREDUCE SYSTEM
- The Hadoop MapReduce framework harnesses a
cluster of machines and executes user defined
MapReduce jobs across the nodes in the cluster. - A MapReduce computation has two phases
- a map phase and
- a reduce phase.
17Map and reduce methods usage
18Word Count over a Given Set of strings
Love 1 India 1 We 2 tennis
1 play 1
We 1 love 1 India 1 We 1 Play 1 tennis 1
We love India
We play tennis
Map
Reduce
19MapReduce in with no reduce tasks
20- MapReduce with two reduce tasks - Automatic
Parallel Execution in MapReduce
21MapReduce - lifecycle
Map function
Map phase
Reduce phase
22Shuffle and sort in MapReduce with multiple
reduce tasks
23Prominent users of HADOOP
- Amazon 100 nodes
- Facebook two clusters of 8000 and 3000 nodes
- Adobe 80 node system
- EBay 532 node cluster
- yahoo cluster of about 4500 nodes
- IIIT Hyderabad 30 node cluster
24Achievements
- March 2011 - Apache Hadoop takes top prize at
Media Guardian Innovation Award - July 2012 - Hadoop Wins Terabyte Sort Benchmark
25Conclusion
- It reduce traffic on capture, storage, search,
sharing, analysis, and visualization. - A huge amount of data could be stored and large
computations could be done in a single compound
with full safety and security at cheap cost. - BIGDATA and BIGDATA-SOLUTIONS is one of the
burning issues in the present IT industry so,
work on those will surely make you more useful to
that.
26Thank you