Introduction to cloud computing - PowerPoint PPT Presentation

1 / 92
About This Presentation
Title:

Introduction to cloud computing

Description:

GFS Bigtable Mapreduce ... Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net – PowerPoint PPT presentation

Number of Views:405
Avg rating:3.0/5.0
Slides: 93
Provided by: Jiahe2
Category:

less

Transcript and Presenter's Notes

Title: Introduction to cloud computing


1
Introduction to cloud computing
  • Jiaheng Lu
  • Department of Computer Science
  • Renmin University of China
  • www.jiahenglu.net

2
Cloud computing

3
(No Transcript)
4
Google Cloud computing techniques

5
The Google File System
6
The Google File System (GFS)
  • A scalable distributed file system for large
    distributed data intensive applications
  • Multiple GFS clusters are currently deployed.
  • The largest ones have
  • 1000 storage nodes
  • 300 TeraBytes of disk storage
  • heavily accessed by hundreds of clients on
    distinct machines

7
Introduction
  • Shares many same goals as previous distributed
    file systems
  • performance, scalability, reliability, etc
  • GFS design has been driven by four key
    observation of Google application workloads and
    technological environment

8
Intro Observations 1
  • 1. Component failures are the norm
  • constant monitoring, error detection, fault
    tolerance and automatic recovery are integral to
    the system
  • 2. Huge files (by traditional standards)
  • Multi GB files are common
  • I/O operations and blocks sizes must be revisited

9
Intro Observations 2
  • 3. Most files are mutated by appending new data
  • This is the focus of performance optimization and
    atomicity guarantees
  • 4. Co-designing the applications and APIs
    benefits overall system by increasing flexibility

10
The Design
  • Cluster consists of a single master and multiple
    chunkservers and is accessed by multiple clients

11
The Master
  • Maintains all file system metadata.
  • names space, access control info, file to chunk
    mappings, chunk (including replicas) location,
    etc.
  • Periodically communicates with chunkservers in
    HeartBeat messages to give instructions and check
    state

12
The Master
  • Helps make sophisticated chunk placement and
    replication decision, using global knowledge
  • For reading and writing, client contacts Master
    to get chunk locations, then deals directly with
    chunkservers
  • Master is not a bottleneck for reads/writes

13
Chunkservers
  • Files are broken into chunks. Each chunk has a
    immutable globally unique 64-bit chunk-handle.
  • handle is assigned by the master at chunk
    creation
  • Chunk size is 64 MB
  • Each chunk is replicated on 3 (default) servers

14
Clients
  • Linked to apps using the file system API.
  • Communicates with master and chunkservers for
    reading and writing
  • Master interactions only for metadata
  • Chunkserver interactions for data
  • Only caches metadata information
  • Data is too large to cache.

15
Chunk Locations
  • Master does not keep a persistent record of
    locations of chunks and replicas.
  • Polls chunkservers at startup, and when new
    chunkservers join/leave for this.
  • Stays up to date by controlling placement of new
    chunks and through HeartBeat messages (when
    monitoring chunkservers)

16
Operation Log
  • Record of all critical metadata changes
  • Stored on Master and replicated on other machines
  • Defines order of concurrent operations
  • Changes not visible to clients until they
    propigate to all chunk replicas
  • Also used to recover the file system state

17
System Interactions Leases and Mutation Order
  • Leases maintain a mutation order across all chunk
    replicas
  • Master grants a lease to a replica, called the
    primary
  • The primary choses the serial mutation order, and
    all replicas follow this order
  • Minimizes management overhead for the Master

18
System Interactions Leases and Mutation Order
19
Atomic Record Append
  • Client specifies the data to write GFS chooses
    and returns the offset it writes to and appends
    the data to each replica at least once
  • Heavily used by Googles Distributed
    applications.
  • No need for a distributed lock manager
  • GFS choses the offset, not the client

20
Atomic Record Append How?
  • Follows similar control flow as mutations
  • Primary tells secondary replicas to append at the
    same offset as the primary
  • If a replica append fails at any replica, it is
    retried by the client.
  • So replicas of the same chunk may contain
    different data, including duplicates, whole or in
    part, of the same record

21
Atomic Record Append How?
  • GFS does not guarantee that all replicas are
    bitwise identical.
  • Only guarantees that data is written at least
    once in an atomic unit.
  • Data must be written at the same offset for all
    chunk replicas for success to be reported.

22
Replica Placement
  • Placement policy maximizes data reliability and
    network bandwidth
  • Spread replicas not only across machines, but
    also across racks
  • Guards against machine failures, and racks
    getting damaged or going offline
  • Reads for a chunk exploit aggregate bandwidth of
    multiple racks
  • Writes have to flow through multiple racks
  • tradeoff made willingly

23
Chunk creation
  • created and placed by master.
  • placed on chunkservers with below average disk
    utilization
  • limit number of recent creations on a
    chunkserver
  • with creations comes lots of writes

24
Detecting Stale Replicas
  • Master has a chunk version number to distinguish
    up to date and stale replicas
  • Increase version when granting a lease
  • If a replica is not available, its version is not
    increased
  • master detects stale replicas when a chunkservers
    report chunks and versions
  • Remove stale replicas during garbage collection

25
Garbage collection
  • When a client deletes a file, master logs it like
    other changes and changes filename to a hidden
    file.
  • Master removes files hidden for longer than 3
    days when scanning file system name space
  • metadata is also erased
  • During HeartBeat messages, the chunkservers send
    the master a subset of its chunks, and the
    master tells it which files have no metadata.
  • Chunkserver removes these files on its own

26
Fault ToleranceHigh Availability
  • Fast recovery
  • Master and chunkservers can restart in seconds
  • Chunk Replication
  • Master Replication
  • shadow masters provide read-only access when
    primary master is down
  • mutations not done until recorded on all master
    replicas

27
Fault ToleranceData Integrity
  • Chunkservers use checksums to detect corrupt data
  • Since replicas are not bitwise identical,
    chunkservers maintain their own checksums
  • For reads, chunkserver verifies checksum before
    sending chunk
  • Update checksums during writes

28
Introduction to MapReduce
29
MapReduce Insight
  • Consider the problem of counting the number of
    occurrences of each word in a large collection of
    documents
  • How would you do it in parallel ?

30
MapReduce Programming Model
  • Inspired from map and reduce operations commonly
    used in functional programming languages like
    Lisp.
  • Users implement interface of two primary methods
  • 1. Map (key1, val1) ? (key2, val2)
  • 2. Reduce (key2, val2) ? val3

31
Map operation
  • Map, a pure function, written by the user, takes
    an input key/value pair and produces a set of
    intermediate key/value pairs.
  • e.g. (docid, doc-content)
  • Draw an analogy to SQL, map can be visualized as
    group-by clause of an aggregate query.

32
Reduce operation
  • On completion of map phase, all the intermediate
    values for a given output key are combined
    together into a list and given to a reducer.
  • Can be visualized as aggregate function (e.g.,
    average) that is computed over all the rows with
    the same group-by attribute.

33
Pseudo-code
  • map(String input_key, String input_value)
  • // input_key document name
  • // input_value document contents
  • for each word w in input_value
  • EmitIntermediate(w, "1")
  • reduce(String output_key, Iterator
    intermediate_values)
  • // output_key a word
  • // output_values a list of counts
  • int result 0
  • for each v in intermediate_values
  • result ParseInt(v)
  • Emit(AsString(result))

34
MapReduce Execution overview

35
MapReduce Example

36
MapReduce in Parallel Example

37
MapReduce Fault Tolerance
  • Handled via re-execution of tasks.
  • Task completion committed through master
  • What happens if Mapper fails ?
  • Re-execute completed in-progress map tasks
  • What happens if Reducer fails ?
  • Re-execute in progress reduce tasks
  • What happens if Master fails ?
  • Potential trouble !!

38
MapReduce
  • Walk through of One more Application

39
(No Transcript)
40
MapReduce PageRank
  • PageRank models the behavior of a random
    surfer.
  • C(t) is the out-degree of t, and (1-d) is a
    damping factor (random jump)
  • The random surfer keeps clicking on successive
    links at random not taking content into
    consideration.
  • Distributes its pages rank equally among all
    pages it links to.
  • The dampening factor takes the surfer getting
    bored and typing arbitrary URL.

41
PageRank Key Insights
  • Effects at each iteration is local. i1th
    iteration depends only on ith iteration
  • At iteration i, PageRank for individual nodes can
    be computed independently

42
PageRank using MapReduce
  • Use Sparse matrix representation (M)
  • Map each row of M to a list of PageRank credit
    to assign to out link neighbours.
  • These prestige scores are reduced to a single
    PageRank value for a page by aggregating over
    them.

43
PageRank using MapReduce
Source of Image Lin 2008
44
Phase 1 Process HTML
  • Map task takes (URL, page-content) pairs and maps
    them to (URL, (PRinit, list-of-urls))
  • PRinit is the seed PageRank for URL
  • list-of-urls contains all pages pointed to by URL
  • Reduce task is just the identity function

45
Phase 2 PageRank Distribution
  • Reduce task gets (URL, url_list) and many (URL,
    val) values
  • Sum vals and fix up with d to get new PR
  • Emit (URL, (new_rank, url_list))
  • Check for convergence using non parallel
    component

46
MapReduce Some More Apps
  • Distributed Grep.
  • Count of URL Access Frequency.
  • Clustering (K-means)
  • Graph Algorithms.
  • Indexing Systems

MapReduce Programs In Google Source Tree
47
MapReduce Extensions and similar apps
  • PIG (Yahoo)
  • Hadoop (Apache)
  • DryadLinq (Microsoft)

48
Large Scale Systems Architecture using MapReduce
49
BigTable A Distributed Storage System for
Structured Data
50
Introduction
  • BigTable is a distributed storage system for
    managing structured data.
  • Designed to scale to a very large size
  • Petabytes of data across thousands of servers
  • Used for many Google projects
  • Web indexing, Personalized Search, Google Earth,
    Google Analytics, Google Finance,
  • Flexible, high-performance solution for all of
    Googles products

51
Motivation
  • Lots of (semi-)structured data at Google
  • URLs
  • Contents, crawl metadata, links, anchors,
    pagerank,
  • Per-user data
  • User preference settings, recent queries/search
    results,
  • Geographic locations
  • Physical entities (shops, restaurants, etc.),
    roads, satellite image data, user annotations,
  • Scale is large
  • Billions of URLs, many versions/page
    (20K/version)
  • Hundreds of millions of users, thousands or q/sec
  • 100TB of satellite image data

52
Why not just use commercial DB?
  • Scale is too large for most commercial databases
  • Even if it werent, cost would be very high
  • Building internally means system can be applied
    across many projects for low incremental cost
  • Low-level storage optimizations help performance
    significantly
  • Much harder to do when running on top of a
    database layer

53
Goals
  • Want asynchronous processes to be continuously
    updating different pieces of data
  • Want access to most current data at any time
  • Need to support
  • Very high read/write rates (millions of ops per
    second)
  • Efficient scans over all or interesting subsets
    of data
  • Efficient joins of large one-to-one and
    one-to-many datasets
  • Often want to examine data changes over time
  • E.g. Contents of a web page over multiple crawls

54
BigTable
  • Distributed multi-level map
  • Fault-tolerant, persistent
  • Scalable
  • Thousands of servers
  • Terabytes of in-memory data
  • Petabyte of disk-based data
  • Millions of reads/writes per second, efficient
    scans
  • Self-managing
  • Servers can be added/removed dynamically
  • Servers adjust to load imbalance

55
Building Blocks
  • Building blocks
  • Google File System (GFS) Raw storage
  • Scheduler schedules jobs onto machines
  • Lock service distributed lock manager
  • MapReduce simplified large-scale data processing
  • BigTable uses of building blocks
  • GFS stores persistent data (SSTable file format
    for storage of data)
  • Scheduler schedules jobs involved in BigTable
    serving
  • Lock service master election, location
    bootstrapping
  • Map Reduce often used to read/write BigTable data

56
Basic Data Model
  • A BigTable is a sparse, distributed persistent
    multi-dimensional sorted map
  • (row, column, timestamp) -gt cell contents
  • Good match for most Google applications

57
WebTable Example
  • Want to keep copy of a large collection of web
    pages and related information
  • Use URLs as row keys
  • Various aspects of web page as column names
  • Store contents of web pages in the contents
    column under the timestamps when they were
    fetched.

58
Rows
  • Name is an arbitrary string
  • Access to data in a row is atomic
  • Row creation is implicit upon storing data
  • Rows ordered lexicographically
  • Rows close together lexicographically usually on
    one or a small number of machines

59
Rows (cont.)
  • Reads of short row ranges are efficient and
    typically require communication with a small
    number of machines.
  • Can exploit this property by selecting row keys
    so they get good locality for data access.
  • Example
  • math.gatech.edu, math.uga.edu, phys.gatech.edu,
    phys.uga.edu
  • VS
  • edu.gatech.math, edu.gatech.phys, edu.uga.math,
    edu.uga.phys

60
Columns
  • Columns have two-level name structure
  • familyoptional_qualifier
  • Column family
  • Unit of access control
  • Has associated type information
  • Qualifier gives unbounded columns
  • Additional levels of indexing, if desired

61
Timestamps
  • Used to store different versions of data in a
    cell
  • New writes default to current time, but
    timestamps for writes can also be set explicitly
    by clients
  • Lookup options
  • Return most recent K values
  • Return all values in timestamp range (or all
    values)
  • Column families can be marked w/ attributes
  • Only retain most recent K values in a cell
  • Keep values until they are older than K seconds

62
Implementation Three Major Components
  • Library linked into every client
  • One master server
  • Responsible for
  • Assigning tablets to tablet servers
  • Detecting addition and expiration of tablet
    servers
  • Balancing tablet-server load
  • Garbage collection
  • Many tablet servers
  • Tablet servers handle read and write requests to
    its table
  • Splits tablets that have grown too large

63
Implementation (cont.)
  • Client data doesnt move through master server.
    Clients communicate directly with tablet servers
    for reads and writes.
  • Most clients never communicate with the master
    server, leaving it lightly loaded in practice.

64
Tablets
  • Large tables broken into tablets at row
    boundaries
  • Tablet holds contiguous range of rows
  • Clients can often choose row keys to achieve
    locality
  • Aim for 100MB to 200MB of data per tablet
  • Serving machine responsible for 100 tablets
  • Fast recovery
  • 100 machines each pick up 1 tablet for failed
    machine
  • Fine-grained load balancing
  • Migrate tablets away from overloaded machine
  • Master makes load-balancing decisions

65
Tablet Location
  • Since tablets move around from server to server,
    given a row, how do clients find the right
    machine?
  • Need to find tablet whose row range covers the
    target row

66
Tablet Assignment
  • Each tablet is assigned to one tablet server at a
    time.
  • Master server keeps track of the set of live
    tablet servers and current assignments of tablets
    to servers. Also keeps track of unassigned
    tablets.
  • When a tablet is unassigned, master assigns the
    tablet to an tablet server with sufficient room.

67
API
  • Metadata operations
  • Create/delete tables, column families, change
    metadata
  • Writes (atomic)
  • Set() write cells in a row
  • DeleteCells() delete cells in a row
  • DeleteRow() delete all cells in a row
  • Reads
  • Scanner read arbitrary cells in a bigtable
  • Each row read is atomic
  • Can restrict returned rows to a particular range
  • Can ask for just data from 1 row, all rows, etc.
  • Can ask for all columns, just certain column
    families, or specific columns

68
Refinements Locality Groups
  • Can group multiple column families into a
    locality group
  • Separate SSTable is created for each locality
    group in each tablet.
  • Segregating columns families that are not
    typically accessed together enables more
    efficient reads.
  • In WebTable, page metadata can be in one group
    and contents of the page in another group.

69
Refinements Compression
  • Many opportunities for compression
  • Similar values in the same row/column at
    different timestamps
  • Similar values in different columns
  • Similar values across adjacent rows
  • Two-pass custom compressions scheme
  • First pass compress long common strings across a
    large window
  • Second pass look for repetitions in small window
  • Speed emphasized, but good space reduction
    (10-to-1)

70
Refinements Bloom Filters
  • Read operation has to read from disk when desired
    SSTable isnt in memory
  • Reduce number of accesses by specifying a Bloom
    filter.
  • Allows us ask if an SSTable might contain data
    for a specified row/column pair.
  • Small amount of memory for Bloom filters
    drastically reduces the number of disk seeks for
    read operations
  • Use implies that most lookups for non-existent
    rows or columns do not need to touch disk

71
BigTable A Distributed Storage System for
Structured Data
72
Introduction
  • BigTable is a distributed storage system for
    managing structured data.
  • Designed to scale to a very large size
  • Petabytes of data across thousands of servers
  • Used for many Google projects
  • Web indexing, Personalized Search, Google Earth,
    Google Analytics, Google Finance,
  • Flexible, high-performance solution for all of
    Googles products

73
Motivation
  • Lots of (semi-)structured data at Google
  • URLs
  • Contents, crawl metadata, links, anchors,
    pagerank,
  • Per-user data
  • User preference settings, recent queries/search
    results,
  • Geographic locations
  • Physical entities (shops, restaurants, etc.),
    roads, satellite image data, user annotations,
  • Scale is large
  • Billions of URLs, many versions/page
    (20K/version)
  • Hundreds of millions of users, thousands or q/sec
  • 100TB of satellite image data

74
Why not just use commercial DB?
  • Scale is too large for most commercial databases
  • Even if it werent, cost would be very high
  • Building internally means system can be applied
    across many projects for low incremental cost
  • Low-level storage optimizations help performance
    significantly
  • Much harder to do when running on top of a
    database layer

75
Goals
  • Want asynchronous processes to be continuously
    updating different pieces of data
  • Want access to most current data at any time
  • Need to support
  • Very high read/write rates (millions of ops per
    second)
  • Efficient scans over all or interesting subsets
    of data
  • Efficient joins of large one-to-one and
    one-to-many datasets
  • Often want to examine data changes over time
  • E.g. Contents of a web page over multiple crawls

76
BigTable
  • Distributed multi-level map
  • Fault-tolerant, persistent
  • Scalable
  • Thousands of servers
  • Terabytes of in-memory data
  • Petabyte of disk-based data
  • Millions of reads/writes per second, efficient
    scans
  • Self-managing
  • Servers can be added/removed dynamically
  • Servers adjust to load imbalance

77
Building Blocks
  • Building blocks
  • Google File System (GFS) Raw storage
  • Scheduler schedules jobs onto machines
  • Lock service distributed lock manager
  • MapReduce simplified large-scale data processing
  • BigTable uses of building blocks
  • GFS stores persistent data (SSTable file format
    for storage of data)
  • Scheduler schedules jobs involved in BigTable
    serving
  • Lock service master election, location
    bootstrapping
  • Map Reduce often used to read/write BigTable data

78
Basic Data Model
  • A BigTable is a sparse, distributed persistent
    multi-dimensional sorted map
  • (row, column, timestamp) -gt cell contents
  • Good match for most Google applications

79
WebTable Example
  • Want to keep copy of a large collection of web
    pages and related information
  • Use URLs as row keys
  • Various aspects of web page as column names
  • Store contents of web pages in the contents
    column under the timestamps when they were
    fetched.

80
Rows
  • Name is an arbitrary string
  • Access to data in a row is atomic
  • Row creation is implicit upon storing data
  • Rows ordered lexicographically
  • Rows close together lexicographically usually on
    one or a small number of machines

81
Rows (cont.)
  • Reads of short row ranges are efficient and
    typically require communication with a small
    number of machines.
  • Can exploit this property by selecting row keys
    so they get good locality for data access.
  • Example
  • math.gatech.edu, math.uga.edu, phys.gatech.edu,
    phys.uga.edu
  • VS
  • edu.gatech.math, edu.gatech.phys, edu.uga.math,
    edu.uga.phys

82
Columns
  • Columns have two-level name structure
  • familyoptional_qualifier
  • Column family
  • Unit of access control
  • Has associated type information
  • Qualifier gives unbounded columns
  • Additional levels of indexing, if desired

83
Timestamps
  • Used to store different versions of data in a
    cell
  • New writes default to current time, but
    timestamps for writes can also be set explicitly
    by clients
  • Lookup options
  • Return most recent K values
  • Return all values in timestamp range (or all
    values)
  • Column families can be marked w/ attributes
  • Only retain most recent K values in a cell
  • Keep values until they are older than K seconds

84
Implementation Three Major Components
  • Library linked into every client
  • One master server
  • Responsible for
  • Assigning tablets to tablet servers
  • Detecting addition and expiration of tablet
    servers
  • Balancing tablet-server load
  • Garbage collection
  • Many tablet servers
  • Tablet servers handle read and write requests to
    its table
  • Splits tablets that have grown too large

85
Implementation (cont.)
  • Client data doesnt move through master server.
    Clients communicate directly with tablet servers
    for reads and writes.
  • Most clients never communicate with the master
    server, leaving it lightly loaded in practice.

86
Tablets
  • Large tables broken into tablets at row
    boundaries
  • Tablet holds contiguous range of rows
  • Clients can often choose row keys to achieve
    locality
  • Aim for 100MB to 200MB of data per tablet
  • Serving machine responsible for 100 tablets
  • Fast recovery
  • 100 machines each pick up 1 tablet for failed
    machine
  • Fine-grained load balancing
  • Migrate tablets away from overloaded machine
  • Master makes load-balancing decisions

87
Tablet Location
  • Since tablets move around from server to server,
    given a row, how do clients find the right
    machine?
  • Need to find tablet whose row range covers the
    target row

88
Tablet Assignment
  • Each tablet is assigned to one tablet server at a
    time.
  • Master server keeps track of the set of live
    tablet servers and current assignments of tablets
    to servers. Also keeps track of unassigned
    tablets.
  • When a tablet is unassigned, master assigns the
    tablet to an tablet server with sufficient room.

89
API
  • Metadata operations
  • Create/delete tables, column families, change
    metadata
  • Writes (atomic)
  • Set() write cells in a row
  • DeleteCells() delete cells in a row
  • DeleteRow() delete all cells in a row
  • Reads
  • Scanner read arbitrary cells in a bigtable
  • Each row read is atomic
  • Can restrict returned rows to a particular range
  • Can ask for just data from 1 row, all rows, etc.
  • Can ask for all columns, just certain column
    families, or specific columns

90
Refinements Locality Groups
  • Can group multiple column families into a
    locality group
  • Separate SSTable is created for each locality
    group in each tablet.
  • Segregating columns families that are not
    typically accessed together enables more
    efficient reads.
  • In WebTable, page metadata can be in one group
    and contents of the page in another group.

91
Refinements Compression
  • Many opportunities for compression
  • Similar values in the same row/column at
    different timestamps
  • Similar values in different columns
  • Similar values across adjacent rows
  • Two-pass custom compressions scheme
  • First pass compress long common strings across a
    large window
  • Second pass look for repetitions in small window
  • Speed emphasized, but good space reduction
    (10-to-1)

92
Refinements Bloom Filters
  • Read operation has to read from disk when desired
    SSTable isnt in memory
  • Reduce number of accesses by specifying a Bloom
    filter.
  • Allows us ask if an SSTable might contain data
    for a specified row/column pair.
  • Small amount of memory for Bloom filters
    drastically reduces the number of disk seeks for
    read operations
  • Use implies that most lookups for non-existent
    rows or columns do not need to touch disk
Write a Comment
User Comments (0)
About PowerShow.com