CSC590 Selected Topics - PowerPoint PPT Presentation

About This Presentation
Title:

CSC590 Selected Topics

Description:

CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 21
Provided by: Imra65
Category:

less

Transcript and Presenter's Notes

Title: CSC590 Selected Topics


1
CSC590 Selected Topics
  • Bigtable A Distributed Storage System for
    Structured Data
  • Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson
    C. Hsieh, Deborah A. Wallach
  • Mike Burrows, Tushar Chandra, Andrew Fikes,
    Robert E. Gruber

by Haifa Alyahya 432920323
2
Outline
  • Introduction
  • Data Model
  • APIs
  • Building Blocks
  • Implementation
  • Refinements
  • Performance
  • Real Applications
  • Conclusion

3
Discussion
  • Bigtable(Bt) is a distributed storage system for
    managing structured data that is designed to
    scale to a very large size.
  • Many projects at Google store data in Bigtable,
    including web indexing, Google Earth, and Google
    Finance.

4
Introduction
  • Bigtable is designed to reliably scale to
    petabytes of data and thousands of machines.
  • Bigtable has achieved several goals
  • Wide applicability.
  • Scalability.
  • High performance.
  • High availability.

5
Motivation
  • Scale Problem
  • Lots of data
  • Millions of machines
  • Different project/applications
  • Hundreds of millions of users
  • Storage for (semi-)structured data.
  • No commercial system big enough
  • Couldnt afford if there was one
  • Low-level storage optimization help performance
    significantly
  • Much harder to do when running on top of a
    database layer

6
Data Model
  • A sparse, distributed persistent
    multi-dimensional sorted map
  • (row, column, timestamp) -gt cell contents

7
Data Model
  • Rows
  • Arbitrary string
  • Access to data in a row is atomic
  • Ordered lexicographically

8
Data Model
  • Column
  • Tow-level name structure
  • family qualifier
  • Column Family is the unit of access control

9
Data Model
  • Timestamps
  • Store different versions of data in a cell
  • Lookup options
  • Return most recent K values
  • Return all values

10
Data Model
  • The row range for a table is dynamically
    partitioned
  • Each row range is called a tablet
  • Tablet is the unit for distribution and load
    balancing

11
APIs
  • Metadata operations
  • Create/delete tables, column families, change
    metadata
  • Writes
  • Set() write cells in a row
  • DeleteCells() delete cells in a row
  • DeleteRow() delete all cells in a row
  • Reads
  • Scanner read arbitrary cells in a bigtable
  • Each row read is atomic
  • Can restrict returned rows to a particular range
  • Can ask for just data from 1 row, all rows, etc.
  • Can ask for all columns, just certain column
    families, or specific columns

12
APIs
13
Building Blocks
  • Google File System (GFS)
  • stores persistent data (SSTable file format)
  • Scheduler
  • schedules jobs onto machines
  • Chubby
  • Lock service distributed lock manager
  • master election, location bootstrapping
  • MapReduce (optional)
  • Data processing
  • Read/write Bigtable data

14
Chubby
  • lock/file/name service
  • Coarse-grained locks
  • Each clients has a session with Chubby.
  • The session expires if it is unable to renew its
    session lease within the lease expiration time.
  • 5 replicas, need a majority vote to be active
  • Also an OSDI 06 Paper

15
Implementation
  • The Bigtable implementation has three major
    components
  • A library that is linked into every client
  • One master server
  • Many tablet servers

16
Tablet Location Management
17
Refinements
  • Locality groups
  • Clients can group multiple column families
    together into a locality group.
  • Compression
  • Uses Bentley and McIlroy's scheme and fast
    compression algorithm.
  • Caching for read performance
  • Uses Scan Cache and Block Cache.
  • Bloom filters
  • Reduce the number of accesses.

18
Performance Evaluation
19
Real Applications
  • Google Analytics
  • http//analytics.google.com
  • Google Earth
  • http//earth.google.com
  • Personalized search
  • www.google.com/psearch

20
Conclusions
  • Users like
  • the performance and high availability provided by
    the Bigtable implementation
  • that they can scale the capacity of their
    clusters by simply adding more machines to the
    system as their resource demands change over time
  • There are significant advantages to building a
    custom storage solution
  • Challenges
  • User adoption and acceptance of a new interface
  • Implementation issues
Write a Comment
User Comments (0)
About PowerShow.com