Big Data Overview of storage and processing - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Big Data Overview of storage and processing

Description:

Big Data does not only relate to the size of data Complexity: missing information, dummy data, ... clusters, data centers, distributed data User Interaction: ... – PowerPoint PPT presentation

Number of Views:320
Avg rating:3.0/5.0
Slides: 18
Provided by: Dun648
Category:

less

Transcript and Presenter's Notes

Title: Big Data Overview of storage and processing


1
Big DataOverview of storage and processing
  • David Gibbs and Govardhan Tanniru
  • Georgia State University
  • Department of Computer Science
  • P.O. Box 3965 Atlanta, GA 30302-3965.

2
Big Data
  • Big Data does not only relate to the size of data
  • Complexity missing information, dummy data,
    organization
  • Processing Software, processing power, parallel
    and distributed computing
  • Data Transfer Limitations of current systems,
    CPU intensive
  • Storage Data sets beyond relational database,
    clusters, data centers, distributed data
  • User Interaction Non-programmers need to perform
    complex information, real time GUI interfaces,
    visualization of data

3
Where the Field is
  • Primary sources of big data
  • Meteorology
  • Complex physics simulations
  • Biology
  • Business
  • Web searching
  • Social networking
  • Telecommunications
  • Many programs for storage and processing
  • Most Popular HDFS, GFS, Hadoop, and MapReduce
  • No standard for processing/storing data
  • No common off the shelf software
  • Increases the difficulty in mining data within a
    field or industry

4
Difficulties
  • Storage
  • Developing a system in which very large amounts
    of data can be stored securely and accessed
    quickly
  • Transfer
  • Transfer from the storage site to the processing
    site
  • Moving large amounts of data over TCP is costly
  • Processing
  • How powerful of a system is needed?
  • There is a lot of data but no information
  • Processing the data in an efficient manner and
    obtaining the correct information

5
The Direction of Data Storage
  • NoSQL
  • Allows storage of massive data sets without the
    need for overwhelming tables and indexing
  • Each cluster stores part of the data and
    replicates it on other clusters
  • Master/Slave architecture
  • HDFS (Hadoop Distributed File System)
  • P2P architecture
  • Cassandra
  • ColumnFamily data model
  • Increased difficulty for data mining
  • No Join operations
  • Pulling in more data than needed
  • Increased transfer times, processing power

6
More About NoSQL(ACID)..
  • The key advantage of schema-free design is that
    it enables applications to quickly upgrade the
    structure of data without table rewrites.
  • The data validity and integrity aspect is
    enforced at the data management layer.
  • NoSQL typically does not maintain complete
    consistency across distributed servers because of
    the burden this places on databases,
  • particularly in distributed systems.
  • The Consistency, Availability, Partition (CAP)
    Theorem states that with consistency,
    availability, and partitioning tolerance, only
    two can be optimized at any time.
  • Traditional relational databases enforce strict
    transactional semantics to preserve consistency,
    but many NoSQL databases have more scalable
    architectures that relax the consistency
    requirement.
  • Some NoSQL databases put objects into a conflict
    state when this occurs. However, it is inevitably
    the responsibility of the application to deal
    with these conflicts.

7
Important Papers by Google
  • Google File System
  • Map Reduce
  • Big Table

8
Google File System
  • Google has reexamined traditional choices
    /Assumptions and explored radically different
    points in the design space.
  • First, component failures are the norm rather
    than the exception.
  • -gtThe system is built from many inexpensive
    commodity components that often fail. It must
    constantly monitor itself and detect, tolerate,
    and recover promptly from component failures on a
    routine basis
  • Second, files are huge by traditional standards.
    Multi-GB files are common.
  • Third, most files are mutated by appending new
    data rather than overwriting existing data.
  • Fourth, co-designing the applications and the
    file system API benefits the overall system by
    increasing our flexibility .

9
Consistancy Model.
  • Random writes within a file are practically
    non-existent. Once written, the files are only
    read, and often only sequentially.
  • A variety of data share these characteristics.
  • Appending becomes the focus of performance
    optimization and atomicity guarantees, while
    caching data blocks in the client loses its
    appeal.
  • Google has introduced an atomic append operation
    so that multiple clients can append concurrently
    to a file without extra synchronization between
    them.
  • Snapshot creates a copy of a file or a directory
    treeat low cost.
  • Record append allows multiple clients to append
    data to the same file concurrently while
    guaranteeing the atomicity of each individual
    clients append. (Without Additional Locking).

10
GFS Architecture
  • Master servers keep metadata on the various data
    files.
  • Chunk servers store the actual data on disk. Each
    chunk is replicates across three different chunk
    servers to create redundancy in case of server
    crashes.
  • Once directed by a master server, a client
    application retrieves files directly from chunk
    servers.

11
Map Reduce Operation
  • MapReduce is a programming model and an
    associated implementation for processing and
    generating large data sets.
  • Users specify a map function that processes a
    key/value pair to generate a set of intermediate
    key/value pairs.
  • A Reduce function that merges all intermediate
    values associated with the same intermediate key.
  • The MapReduce system has three different types of
    servers.The Master server assigns user tasks to
    map and reduce servers. It also tracks the state
    of the tasks. - The Map servers accept user
    input and performs map operations on them. The
    results are written to intermediate files.
  • The Reduce servers accepts intermediate files
    produced by map servers and performs reduce
    operation on them.
  • The steps look like GFS -gt Map -gt Shuffle -gt
    Reduction -gt Store Results back into GFS.- In
    MapReduce a map maps one view of data to another,
    producing a key value pair,
  • Data transferred between map and reduce servers
    is compressed. The idea is that because servers
    aren't CPU bound it makes sense to spend on data
    compression and decompression in order to save on
    bandwidth and I/O.

12
Map and Reduce. (Contd..)
  • map(String key, String value)
  • // key document name
  • // value document contents
  • for each word w in value
  • EmitIntermediate(w, "1")
  • reduce(String key, Iterator values)
  • // key a word
  • // values a list of counts
  • int result 0
  • for each v in values
  • result ParseInt(v)
  • Emit(AsString(result))

13
Big Table
  • BigTable is a large scale, fault tolerant, self
    managing system that includes terabytes of memory
    and petabytes of storage. It can handle millions
    of reads/writes per second.
  • BigTable is a distributed hash mechanism built on
    top of GFS. It is not a relational database. It
    doesn't support joins or SQL type queries.
  • It provides lookup mechanism to access
    structured data by key. GFS stores opaque data
    and many applications needs has data with
    structure.
  • Machines can be added and deleted while the
    system is running and the whole system just
    works.
  • Each data item is stored in a cell which can be
    accessed using a row key, column key, or
    timestamp.
  • BigTable has three different types of servers
    ( Master, Tablet ,Lock Servers)

14
Hardware strategy
  • Use ultra cheap commodity hardware and built
    software on top to handle their death.
  • A 1,000-fold computer power increase can be had
    for a 33 times lower cost if you you use a
    failure-prone infrastructure rather than an
    infrastructure built on highly reliable
    components. You must build reliability on top of
    unreliability for this strategy to work.

15
Mixed Architectures
  • Many Papers focus on the integration of
    Traditional and Big Data Architectures.
  • We need architectures to handle both the types of
    Data.
  • Below is the diagram from Oracle white Paper.

16
Other Areas of Focus
  • Knowledge Discovery in Databases.
  • Bringing the big data and big compute
    communities together is an active area of
    research.
  • Hybrid Way of Storing Un Structuted Data(File
    Systems and DBMS).
  • Efficient Data Transfer Protocols for Big
    Data(high-performance network data movement )
  • Use of cloud computing for Big Data.
  • Compression aspects I/O Performance Analysis for
    Big Data Clustering.
  • Privacy Implications on Social Networking
    sites.(Friends tagging another person).
  • Faults with HADOOP might help our research.

17
Questions
Write a Comment
User Comments (0)
About PowerShow.com