Big Data Overview of storage and processing - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Big Data Overview of storage and processing

Description:

Big Data does not only relate to the size of data Complexity: missing information, dummy data, ... clusters, data centers, distributed data User Interaction: ... – PowerPoint PPT presentation

Number of Views:324

Avg rating:3.0/5.0

Slides: 18

Provided by: Dun648

Category:

more less

Transcript and Presenter's Notes

Title: Big Data Overview of storage and processing

1
Big DataOverview of storage and processing

David Gibbs and Govardhan Tanniru
Georgia State University
Department of Computer Science
P.O. Box 3965 Atlanta, GA 30302-3965.

2
Big Data

Big Data does not only relate to the size of data
Complexity missing information, dummy data,
organization
Processing Software, processing power, parallel
and distributed computing
Data Transfer Limitations of current systems,
CPU intensive
Storage Data sets beyond relational database,
clusters, data centers, distributed data
User Interaction Non-programmers need to perform
complex information, real time GUI interfaces,
visualization of data

3
Where the Field is

Primary sources of big data
Meteorology
Complex physics simulations
Biology
Business
Web searching
Social networking
Telecommunications
Many programs for storage and processing
Most Popular HDFS, GFS, Hadoop, and MapReduce
No standard for processing/storing data
No common off the shelf software
Increases the difficulty in mining data within a
field or industry

4
Difficulties

Storage
Developing a system in which very large amounts
of data can be stored securely and accessed
quickly
Transfer
Transfer from the storage site to the processing
site
Moving large amounts of data over TCP is costly
Processing
How powerful of a system is needed?
There is a lot of data but no information
Processing the data in an efficient manner and
obtaining the correct information

5
The Direction of Data Storage

NoSQL
Allows storage of massive data sets without the
need for overwhelming tables and indexing
Each cluster stores part of the data and
replicates it on other clusters
Master/Slave architecture
HDFS (Hadoop Distributed File System)
P2P architecture
Cassandra
ColumnFamily data model
Increased difficulty for data mining
No Join operations
Pulling in more data than needed
Increased transfer times, processing power

6
More About NoSQL(ACID)..

The key advantage of schema-free design is that
it enables applications to quickly upgrade the
structure of data without table rewrites.
The data validity and integrity aspect is
enforced at the data management layer.
NoSQL typically does not maintain complete
consistency across distributed servers because of
the burden this places on databases,
particularly in distributed systems.
The Consistency, Availability, Partition (CAP)
Theorem states that with consistency,
availability, and partitioning tolerance, only
two can be optimized at any time.
Traditional relational databases enforce strict
transactional semantics to preserve consistency,
but many NoSQL databases have more scalable
architectures that relax the consistency
requirement.
Some NoSQL databases put objects into a conflict
state when this occurs. However, it is inevitably
the responsibility of the application to deal
with these conflicts.

7
Important Papers by Google

Google File System
Map Reduce
Big Table

8
Google File System

Google has reexamined traditional choices
/Assumptions and explored radically different
points in the design space.
First, component failures are the norm rather
than the exception.
-gtThe system is built from many inexpensive
commodity components that often fail. It must
constantly monitor itself and detect, tolerate,
and recover promptly from component failures on a
routine basis
Second, files are huge by traditional standards.
Multi-GB files are common.
Third, most files are mutated by appending new
data rather than overwriting existing data.
Fourth, co-designing the applications and the
file system API benefits the overall system by
increasing our flexibility .

9
Consistancy Model.

Random writes within a file are practically
non-existent. Once written, the files are only
read, and often only sequentially.
A variety of data share these characteristics.
Appending becomes the focus of performance
optimization and atomicity guarantees, while
caching data blocks in the client loses its
appeal.
Google has introduced an atomic append operation
so that multiple clients can append concurrently
to a file without extra synchronization between
them.
Snapshot creates a copy of a file or a directory
treeat low cost.
Record append allows multiple clients to append
data to the same file concurrently while
guaranteeing the atomicity of each individual
clients append. (Without Additional Locking).

10
GFS Architecture

Master servers keep metadata on the various data
files.
Chunk servers store the actual data on disk. Each
chunk is replicates across three different chunk
servers to create redundancy in case of server
crashes.
Once directed by a master server, a client
application retrieves files directly from chunk
servers.

11
Map Reduce Operation

MapReduce is a programming model and an
associated implementation for processing and
generating large data sets.
Users specify a map function that processes a
key/value pair to generate a set of intermediate
key/value pairs.
A Reduce function that merges all intermediate
values associated with the same intermediate key.
The MapReduce system has three different types of
servers.The Master server assigns user tasks to
map and reduce servers. It also tracks the state
of the tasks. - The Map servers accept user
input and performs map operations on them. The
results are written to intermediate files.
The Reduce servers accepts intermediate files
produced by map servers and performs reduce
operation on them.
The steps look like GFS -gt Map -gt Shuffle -gt
Reduction -gt Store Results back into GFS.- In
MapReduce a map maps one view of data to another,
producing a key value pair,
Data transferred between map and reduce servers
is compressed. The idea is that because servers
aren't CPU bound it makes sense to spend on data
compression and decompression in order to save on
bandwidth and I/O.

12
Map and Reduce. (Contd..)

map(String key, String value)
// key document name
// value document contents
for each word w in value
EmitIntermediate(w, "1")
reduce(String key, Iterator values)
// key a word
// values a list of counts
int result 0
for each v in values
result ParseInt(v)
Emit(AsString(result))

13
Big Table

BigTable is a large scale, fault tolerant, self
managing system that includes terabytes of memory
and petabytes of storage. It can handle millions
of reads/writes per second.
BigTable is a distributed hash mechanism built on
top of GFS. It is not a relational database. It
doesn't support joins or SQL type queries.
It provides lookup mechanism to access
structured data by key. GFS stores opaque data
and many applications needs has data with
structure.
Machines can be added and deleted while the
system is running and the whole system just
works.
Each data item is stored in a cell which can be
accessed using a row key, column key, or
timestamp.
BigTable has three different types of servers
( Master, Tablet ,Lock Servers)

14
Hardware strategy

Use ultra cheap commodity hardware and built
software on top to handle their death.
A 1,000-fold computer power increase can be had
for a 33 times lower cost if you you use a
failure-prone infrastructure rather than an
infrastructure built on highly reliable
components. You must build reliability on top of
unreliability for this strategy to work.

15
Mixed Architectures

Many Papers focus on the integration of
Traditional and Big Data Architectures.
We need architectures to handle both the types of
Data.
Below is the diagram from Oracle white Paper.

16
Other Areas of Focus

Knowledge Discovery in Databases.
Bringing the big data and big compute
communities together is an active area of
research.
Hybrid Way of Storing Un Structuted Data(File
Systems and DBMS).
Efficient Data Transfer Protocols for Big
Data(high-performance network data movement )
Use of cloud computing for Big Data.
Compression aspects I/O Performance Analysis for
Big Data Clustering.
Privacy Implications on Social Networking
sites.(Friends tagging another person).
Faults with HADOOP might help our research.