Introduction to cloud computing

About This Presentation

Title:

Introduction to cloud computing

Description:

GFS Bigtable Mapreduce ... Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net – PowerPoint PPT presentation

Number of Views:411

Avg rating:3.0/5.0

Slides: 93

Provided by: Jiahe2

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to cloud computing

1
Introduction to cloud computing

Jiaheng Lu
Department of Computer Science
Renmin University of China
www.jiahenglu.net

2
Cloud computing

3
(No Transcript)
4
Google Cloud computing techniques

5
The Google File System
6
The Google File System (GFS)

A scalable distributed file system for large
distributed data intensive applications
Multiple GFS clusters are currently deployed.
The largest ones have
1000 storage nodes
300 TeraBytes of disk storage
heavily accessed by hundreds of clients on
distinct machines

7
Introduction

Shares many same goals as previous distributed
file systems
performance, scalability, reliability, etc
GFS design has been driven by four key
observation of Google application workloads and
technological environment

8
Intro Observations 1

1. Component failures are the norm
constant monitoring, error detection, fault
tolerance and automatic recovery are integral to
the system
2. Huge files (by traditional standards)
Multi GB files are common
I/O operations and blocks sizes must be revisited

9
Intro Observations 2

3. Most files are mutated by appending new data
This is the focus of performance optimization and
atomicity guarantees
4. Co-designing the applications and APIs
benefits overall system by increasing flexibility

10
The Design

Cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients

11
The Master

Maintains all file system metadata.
names space, access control info, file to chunk
mappings, chunk (including replicas) location,
etc.
Periodically communicates with chunkservers in
HeartBeat messages to give instructions and check
state

12
The Master

Helps make sophisticated chunk placement and
replication decision, using global knowledge
For reading and writing, client contacts Master
to get chunk locations, then deals directly with
chunkservers
Master is not a bottleneck for reads/writes

13
Chunkservers

Files are broken into chunks. Each chunk has a
immutable globally unique 64-bit chunk-handle.
handle is assigned by the master at chunk
creation
Chunk size is 64 MB
Each chunk is replicated on 3 (default) servers

14
Clients

Linked to apps using the file system API.
Communicates with master and chunkservers for
reading and writing
Master interactions only for metadata
Chunkserver interactions for data
Only caches metadata information
Data is too large to cache.

15
Chunk Locations

Master does not keep a persistent record of
locations of chunks and replicas.
Polls chunkservers at startup, and when new
chunkservers join/leave for this.
Stays up to date by controlling placement of new
chunks and through HeartBeat messages (when
monitoring chunkservers)

16
Operation Log

Record of all critical metadata changes
Stored on Master and replicated on other machines
Defines order of concurrent operations
Changes not visible to clients until they
propigate to all chunk replicas
Also used to recover the file system state

17
System Interactions Leases and Mutation Order

Leases maintain a mutation order across all chunk
replicas
Master grants a lease to a replica, called the
primary
The primary choses the serial mutation order, and
all replicas follow this order
Minimizes management overhead for the Master

18
System Interactions Leases and Mutation Order
19
Atomic Record Append

Client specifies the data to write GFS chooses
and returns the offset it writes to and appends
the data to each replica at least once
Heavily used by Googles Distributed
applications.
No need for a distributed lock manager
GFS choses the offset, not the client

20
Atomic Record Append How?

Follows similar control flow as mutations
Primary tells secondary replicas to append at the
same offset as the primary
If a replica append fails at any replica, it is
retried by the client.
So replicas of the same chunk may contain
different data, including duplicates, whole or in
part, of the same record

21
Atomic Record Append How?

GFS does not guarantee that all replicas are
bitwise identical.
Only guarantees that data is written at least
once in an atomic unit.
Data must be written at the same offset for all
chunk replicas for success to be reported.

22
Replica Placement

Placement policy maximizes data reliability and
network bandwidth
Spread replicas not only across machines, but
also across racks
Guards against machine failures, and racks
getting damaged or going offline
Reads for a chunk exploit aggregate bandwidth of
multiple racks
Writes have to flow through multiple racks
tradeoff made willingly

23
Chunk creation

created and placed by master.
placed on chunkservers with below average disk
utilization
limit number of recent creations on a
chunkserver
with creations comes lots of writes

24
Detecting Stale Replicas

Master has a chunk version number to distinguish
up to date and stale replicas
Increase version when granting a lease
If a replica is not available, its version is not
increased
master detects stale replicas when a chunkservers
report chunks and versions
Remove stale replicas during garbage collection

25
Garbage collection

When a client deletes a file, master logs it like
other changes and changes filename to a hidden
file.
Master removes files hidden for longer than 3
days when scanning file system name space
metadata is also erased
During HeartBeat messages, the chunkservers send
the master a subset of its chunks, and the
master tells it which files have no metadata.
Chunkserver removes these files on its own

26
Fault ToleranceHigh Availability

Fast recovery
Master and chunkservers can restart in seconds
Chunk Replication
Master Replication
shadow masters provide read-only access when
primary master is down
mutations not done until recorded on all master
replicas

27
Fault ToleranceData Integrity

Chunkservers use checksums to detect corrupt data
Since replicas are not bitwise identical,
chunkservers maintain their own checksums
For reads, chunkserver verifies checksum before
sending chunk
Update checksums during writes

28
Introduction to MapReduce
29
MapReduce Insight

Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
How would you do it in parallel ?

30
MapReduce Programming Model

Inspired from map and reduce operations commonly
used in functional programming languages like
Lisp.
Users implement interface of two primary methods
1. Map (key1, val1) ? (key2, val2)
2. Reduce (key2, val2) ? val3

31
Map operation

Map, a pure function, written by the user, takes
an input key/value pair and produces a set of
intermediate key/value pairs.
e.g. (docid, doc-content)
Draw an analogy to SQL, map can be visualized as
group-by clause of an aggregate query.

32
Reduce operation

On completion of map phase, all the intermediate
values for a given output key are combined
together into a list and given to a reducer.
Can be visualized as aggregate function (e.g.,
average) that is computed over all the rows with
the same group-by attribute.

33
Pseudo-code

map(String input_key, String input_value)
// input_key document name
// input_value document contents
for each word w in input_value
EmitIntermediate(w, "1")
reduce(String output_key, Iterator
intermediate_values)
// output_key a word
// output_values a list of counts
int result 0
for each v in intermediate_values
result ParseInt(v)
Emit(AsString(result))

34
MapReduce Execution overview

35
MapReduce Example

36
MapReduce in Parallel Example

37
MapReduce Fault Tolerance

Handled via re-execution of tasks.
Task completion committed through master
What happens if Mapper fails ?
Re-execute completed in-progress map tasks
What happens if Reducer fails ?
Re-execute in progress reduce tasks
What happens if Master fails ?
Potential trouble !!

38
MapReduce

Walk through of One more Application

39
(No Transcript)
40
MapReduce PageRank

PageRank models the behavior of a random
surfer.
C(t) is the out-degree of t, and (1-d) is a
damping factor (random jump)
The random surfer keeps clicking on successive
links at random not taking content into
consideration.
Distributes its pages rank equally among all
pages it links to.
The dampening factor takes the surfer getting
bored and typing arbitrary URL.

41
PageRank Key Insights

Effects at each iteration is local. i1th
iteration depends only on ith iteration
At iteration i, PageRank for individual nodes can
be computed independently

42
PageRank using MapReduce

Use Sparse matrix representation (M)
Map each row of M to a list of PageRank credit
to assign to out link neighbours.
These prestige scores are reduced to a single
PageRank value for a page by aggregating over
them.

43
PageRank using MapReduce
Source of Image Lin 2008
44
Phase 1 Process HTML

Map task takes (URL, page-content) pairs and maps
them to (URL, (PRinit, list-of-urls))
PRinit is the seed PageRank for URL
list-of-urls contains all pages pointed to by URL
Reduce task is just the identity function

45
Phase 2 PageRank Distribution

Reduce task gets (URL, url_list) and many (URL,
val) values
Sum vals and fix up with d to get new PR
Emit (URL, (new_rank, url_list))
Check for convergence using non parallel
component

46
MapReduce Some More Apps

Distributed Grep.
Count of URL Access Frequency.
Clustering (K-means)
Graph Algorithms.
Indexing Systems

MapReduce Programs In Google Source Tree
47
MapReduce Extensions and similar apps

PIG (Yahoo)
Hadoop (Apache)
DryadLinq (Microsoft)

48
Large Scale Systems Architecture using MapReduce
49
BigTable A Distributed Storage System for
Structured Data
50
Introduction

BigTable is a distributed storage system for
managing structured data.
Designed to scale to a very large size
Petabytes of data across thousands of servers
Used for many Google projects
Web indexing, Personalized Search, Google Earth,
Google Analytics, Google Finance,
Flexible, high-performance solution for all of
Googles products

51
Motivation

Lots of (semi-)structured data at Google
URLs
Contents, crawl metadata, links, anchors,
pagerank,
Per-user data
User preference settings, recent queries/search
results,
Geographic locations
Physical entities (shops, restaurants, etc.),
roads, satellite image data, user annotations,
Scale is large
Billions of URLs, many versions/page
(20K/version)
Hundreds of millions of users, thousands or q/sec
100TB of satellite image data

52
Why not just use commercial DB?

Scale is too large for most commercial databases
Even if it werent, cost would be very high
Building internally means system can be applied
across many projects for low incremental cost
Low-level storage optimizations help performance
significantly
Much harder to do when running on top of a
database layer

53
Goals

Want asynchronous processes to be continuously
updating different pieces of data
Want access to most current data at any time
Need to support
Very high read/write rates (millions of ops per
second)
Efficient scans over all or interesting subsets
of data
Efficient joins of large one-to-one and
one-to-many datasets
Often want to examine data changes over time
E.g. Contents of a web page over multiple crawls

54
BigTable

Distributed multi-level map
Fault-tolerant, persistent
Scalable
Thousands of servers
Terabytes of in-memory data
Petabyte of disk-based data
Millions of reads/writes per second, efficient
scans
Self-managing
Servers can be added/removed dynamically
Servers adjust to load imbalance

55
Building Blocks

Building blocks
Google File System (GFS) Raw storage
Scheduler schedules jobs onto machines
Lock service distributed lock manager
MapReduce simplified large-scale data processing
BigTable uses of building blocks
GFS stores persistent data (SSTable file format
for storage of data)
Scheduler schedules jobs involved in BigTable
serving
Lock service master election, location
bootstrapping
Map Reduce often used to read/write BigTable data

56
Basic Data Model

A BigTable is a sparse, distributed persistent
multi-dimensional sorted map
(row, column, timestamp) -gt cell contents
Good match for most Google applications

57
WebTable Example

Want to keep copy of a large collection of web
pages and related information
Use URLs as row keys
Various aspects of web page as column names
Store contents of web pages in the contents
column under the timestamps when they were
fetched.

58
Rows

Name is an arbitrary string
Access to data in a row is atomic
Row creation is implicit upon storing data
Rows ordered lexicographically
Rows close together lexicographically usually on
one or a small number of machines

59
Rows (cont.)

Reads of short row ranges are efficient and
typically require communication with a small
number of machines.
Can exploit this property by selecting row keys
so they get good locality for data access.
Example
math.gatech.edu, math.uga.edu, phys.gatech.edu,
phys.uga.edu
VS
edu.gatech.math, edu.gatech.phys, edu.uga.math,
edu.uga.phys

60
Columns

Columns have two-level name structure
familyoptional_qualifier
Column family
Unit of access control
Has associated type information
Qualifier gives unbounded columns
Additional levels of indexing, if desired

61
Timestamps

Used to store different versions of data in a
cell
New writes default to current time, but
timestamps for writes can also be set explicitly
by clients
Lookup options
Return most recent K values
Return all values in timestamp range (or all
values)
Column families can be marked w/ attributes
Only retain most recent K values in a cell
Keep values until they are older than K seconds

62
Implementation Three Major Components

Library linked into every client
One master server
Responsible for
Assigning tablets to tablet servers
Detecting addition and expiration of tablet
servers
Balancing tablet-server load
Garbage collection
Many tablet servers
Tablet servers handle read and write requests to
its table
Splits tablets that have grown too large

63
Implementation (cont.)

Client data doesnt move through master server.
Clients communicate directly with tablet servers
for reads and writes.
Most clients never communicate with the master
server, leaving it lightly loaded in practice.

64
Tablets

Large tables broken into tablets at row
boundaries
Tablet holds contiguous range of rows
Clients can often choose row keys to achieve
locality
Aim for 100MB to 200MB of data per tablet
Serving machine responsible for 100 tablets
Fast recovery
100 machines each pick up 1 tablet for failed
machine
Fine-grained load balancing
Migrate tablets away from overloaded machine
Master makes load-balancing decisions

65
Tablet Location

Since tablets move around from server to server,
given a row, how do clients find the right
machine?
Need to find tablet whose row range covers the
target row

66
Tablet Assignment

Each tablet is assigned to one tablet server at a
time.
Master server keeps track of the set of live
tablet servers and current assignments of tablets
to servers. Also keeps track of unassigned
tablets.
When a tablet is unassigned, master assigns the
tablet to an tablet server with sufficient room.

67
API

Metadata operations
Create/delete tables, column families, change
metadata
Writes (atomic)
Set() write cells in a row
DeleteCells() delete cells in a row
DeleteRow() delete all cells in a row
Reads
Scanner read arbitrary cells in a bigtable
Each row read is atomic
Can restrict returned rows to a particular range
Can ask for just data from 1 row, all rows, etc.
Can ask for all columns, just certain column
families, or specific columns

68
Refinements Locality Groups

Can group multiple column families into a
locality group
Separate SSTable is created for each locality
group in each tablet.
Segregating columns families that are not
typically accessed together enables more
efficient reads.
In WebTable, page metadata can be in one group
and contents of the page in another group.

69
Refinements Compression

Many opportunities for compression
Similar values in the same row/column at
different timestamps
Similar values in different columns
Similar values across adjacent rows
Two-pass custom compressions scheme
First pass compress long common strings across a
large window
Second pass look for repetitions in small window
Speed emphasized, but good space reduction
(10-to-1)

70
Refinements Bloom Filters

Read operation has to read from disk when desired
SSTable isnt in memory
Reduce number of accesses by specifying a Bloom
filter.
Allows us ask if an SSTable might contain data
for a specified row/column pair.
Small amount of memory for Bloom filters
drastically reduces the number of disk seeks for
read operations
Use implies that most lookups for non-existent
rows or columns do not need to touch disk

71
BigTable A Distributed Storage System for
Structured Data
72
Introduction

BigTable is a distributed storage system for
managing structured data.
Designed to scale to a very large size
Petabytes of data across thousands of servers
Used for many Google projects
Web indexing, Personalized Search, Google Earth,
Google Analytics, Google Finance,
Flexible, high-performance solution for all of
Googles products

73
Motivation

Lots of (semi-)structured data at Google
URLs
Contents, crawl metadata, links, anchors,
pagerank,
Per-user data
User preference settings, recent queries/search
results,
Geographic locations
Physical entities (shops, restaurants, etc.),
roads, satellite image data, user annotations,
Scale is large
Billions of URLs, many versions/page
(20K/version)
Hundreds of millions of users, thousands or q/sec
100TB of satellite image data

74
Why not just use commercial DB?

Scale is too large for most commercial databases
Even if it werent, cost would be very high
Building internally means system can be applied
across many projects for low incremental cost
Low-level storage optimizations help performance
significantly
Much harder to do when running on top of a
database layer

75
Goals

Want asynchronous processes to be continuously
updating different pieces of data
Want access to most current data at any time
Need to support
Very high read/write rates (millions of ops per
second)
Efficient scans over all or interesting subsets
of data
Efficient joins of large one-to-one and
one-to-many datasets
Often want to examine data changes over time
E.g. Contents of a web page over multiple crawls

76
BigTable

Distributed multi-level map
Fault-tolerant, persistent
Scalable
Thousands of servers
Terabytes of in-memory data
Petabyte of disk-based data
Millions of reads/writes per second, efficient
scans
Self-managing
Servers can be added/removed dynamically
Servers adjust to load imbalance

77
Building Blocks

Building blocks
Google File System (GFS) Raw storage
Scheduler schedules jobs onto machines
Lock service distributed lock manager
MapReduce simplified large-scale data processing
BigTable uses of building blocks
GFS stores persistent data (SSTable file format
for storage of data)
Scheduler schedules jobs involved in BigTable
serving
Lock service master election, location
bootstrapping
Map Reduce often used to read/write BigTable data

78
Basic Data Model

A BigTable is a sparse, distributed persistent
multi-dimensional sorted map
(row, column, timestamp) -gt cell contents
Good match for most Google applications

79
WebTable Example

Want to keep copy of a large collection of web
pages and related information
Use URLs as row keys
Various aspects of web page as column names
Store contents of web pages in the contents
column under the timestamps when they were
fetched.

80
Rows

Name is an arbitrary string
Access to data in a row is atomic
Row creation is implicit upon storing data
Rows ordered lexicographically
Rows close together lexicographically usually on
one or a small number of machines

81
Rows (cont.)

Reads of short row ranges are efficient and
typically require communication with a small
number of machines.
Can exploit this property by selecting row keys
so they get good locality for data access.
Example
math.gatech.edu, math.uga.edu, phys.gatech.edu,
phys.uga.edu
VS
edu.gatech.math, edu.gatech.phys, edu.uga.math,
edu.uga.phys

82
Columns

Columns have two-level name structure
familyoptional_qualifier
Column family
Unit of access control
Has associated type information
Qualifier gives unbounded columns
Additional levels of indexing, if desired

83
Timestamps

Used to store different versions of data in a
cell
New writes default to current time, but
timestamps for writes can also be set explicitly
by clients
Lookup options
Return most recent K values
Return all values in timestamp range (or all
values)
Column families can be marked w/ attributes
Only retain most recent K values in a cell
Keep values until they are older than K seconds

84
Implementation Three Major Components

Library linked into every client
One master server
Responsible for
Assigning tablets to tablet servers
Detecting addition and expiration of tablet
servers
Balancing tablet-server load
Garbage collection
Many tablet servers
Tablet servers handle read and write requests to
its table
Splits tablets that have grown too large

85
Implementation (cont.)

Client data doesnt move through master server.
Clients communicate directly with tablet servers
for reads and writes.
Most clients never communicate with the master
server, leaving it lightly loaded in practice.

86
Tablets

Large tables broken into tablets at row
boundaries
Tablet holds contiguous range of rows
Clients can often choose row keys to achieve
locality
Aim for 100MB to 200MB of data per tablet
Serving machine responsible for 100 tablets
Fast recovery
100 machines each pick up 1 tablet for failed
machine
Fine-grained load balancing
Migrate tablets away from overloaded machine
Master makes load-balancing decisions

87
Tablet Location

Since tablets move around from server to server,
given a row, how do clients find the right
machine?
Need to find tablet whose row range covers the
target row

88
Tablet Assignment

Each tablet is assigned to one tablet server at a
time.
Master server keeps track of the set of live
tablet servers and current assignments of tablets
to servers. Also keeps track of unassigned
tablets.
When a tablet is unassigned, master assigns the
tablet to an tablet server with sufficient room.

89
API

Metadata operations
Create/delete tables, column families, change
metadata
Writes (atomic)
Set() write cells in a row
DeleteCells() delete cells in a row
DeleteRow() delete all cells in a row
Reads
Scanner read arbitrary cells in a bigtable
Each row read is atomic
Can restrict returned rows to a particular range
Can ask for just data from 1 row, all rows, etc.
Can ask for all columns, just certain column
families, or specific columns

90
Refinements Locality Groups

Can group multiple column families into a
locality group
Separate SSTable is created for each locality
group in each tablet.
Segregating columns families that are not
typically accessed together enables more
efficient reads.
In WebTable, page metadata can be in one group
and contents of the page in another group.

91
Refinements Compression

Many opportunities for compression
Similar values in the same row/column at
different timestamps
Similar values in different columns
Similar values across adjacent rows
Two-pass custom compressions scheme
First pass compress long common strings across a
large window
Second pass look for repetitions in small window
Speed emphasized, but good space reduction
(10-to-1)

92
Refinements Bloom Filters

Read operation has to read from disk when desired
SSTable isnt in memory
Reduce number of accesses by specifying a Bloom
filter.
Allows us ask if an SSTable might contain data
for a specified row/column pair.
Small amount of memory for Bloom filters
drastically reduces the number of disk seeks for
read operations
Use implies that most lookups for non-existent
rows or columns do not need to touch disk