Title: Introduction to cloud computing
1Introduction to cloud computing
- Jiaheng Lu
- Department of Computer Science
- Renmin University of China
- www.jiahenglu.net
2Cloud computing
3(No Transcript)
4ReviewWhy distributed systems?
What are the advantages?
distributed vs centralized? multi-server vs
client-server?
- Geography
- Concurrency gt Speed
- High-availability (if failures occur).
5Review What Should Be Distributed?
- Users and User Interface
- Thin client
- Processing
- Trim client
- Data
- Fat client
- Will discuss tradeoffs later
Presentation
workflow
Business Objects
Database
6ReviewWork Distribution Spectrum
- Presentation and plug-ins
- Workflow manages session invokes objects
- Business objects
- Database
Presentation
workflow
Business Objects
Database
7The Pattern Three Tier Computing
Presentation
- Clients do presentation, gather input
- Clients do some workflow (Xscript)
- Clients send high-level requests to ORB (Object
Request Broker) - ORB dispatches workflows and business objects --
proxies for client, orchestrate flows queues - Server-side workflow scripts call on distributed
business objects to execute task
workflow
Business Objects
Database
8The Three Tiers
Object Data server.
9Why Did Everyone Go To Three-Tier?
- Manageability
- Business rules must be with data
- Middleware operations tools
- Performance (scalability)
- Server resources are precious
- ORB dispatches requests to server pools
- Technology Physics
- Put UI processing near user
- Put shared data processing near shared data
Presentation
workflow
Business Objects
Database
10Google Cloud computing techniques
11The Google File System
12The Google File System (GFS)
- A scalable distributed file system for large
distributed data intensive applications - Multiple GFS clusters are currently deployed.
- The largest ones have
- 1000 storage nodes
- 300 TeraBytes of disk storage
- heavily accessed by hundreds of clients on
distinct machines
13Introduction
- Shares many same goals as previous distributed
file systems - performance, scalability, reliability, etc
- GFS design has been driven by four key
observation of Google application workloads and
technological environment
14Intro Observations 1
- 1. Component failures are the norm
- constant monitoring, error detection, fault
tolerance and automatic recovery are integral to
the system - 2. Huge files (by traditional standards)
- Multi GB files are common
- I/O operations and blocks sizes must be revisited
15Intro Observations 2
- 3. Most files are mutated by appending new data
- This is the focus of performance optimization and
atomicity guarantees - 4. Co-designing the applications and APIs
benefits overall system by increasing flexibility
16The Design
- Cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients
17The Master
- Maintains all file system metadata.
- names space, access control info, file to chunk
mappings, chunk (including replicas) location,
etc. - Periodically communicates with chunkservers in
HeartBeat messages to give instructions and check
state
18The Master
- Helps make sophisticated chunk placement and
replication decision, using global knowledge - For reading and writing, client contacts Master
to get chunk locations, then deals directly with
chunkservers - Master is not a bottleneck for reads/writes
19Chunkservers
- Files are broken into chunks. Each chunk has a
immutable globally unique 64-bit chunk-handle. - handle is assigned by the master at chunk
creation - Chunk size is 64 MB
- Each chunk is replicated on 3 (default) servers
20Clients
- Linked to apps using the file system API.
- Communicates with master and chunkservers for
reading and writing - Master interactions only for metadata
- Chunkserver interactions for data
- Only caches metadata information
- Data is too large to cache.
21Chunk Locations
- Master does not keep a persistent record of
locations of chunks and replicas. - Polls chunkservers at startup, and when new
chunkservers join/leave for this. - Stays up to date by controlling placement of new
chunks and through HeartBeat messages (when
monitoring chunkservers)
22Operation Log
- Record of all critical metadata changes
- Stored on Master and replicated on other machines
- Defines order of concurrent operations
- Changes not visible to clients until they
propigate to all chunk replicas - Also used to recover the file system state
23System Interactions Leases and Mutation Order
- Leases maintain a mutation order across all chunk
replicas - Master grants a lease to a replica, called the
primary - The primary choses the serial mutation order, and
all replicas follow this order - Minimizes management overhead for the Master
24System Interactions Leases and Mutation Order
25Atomic Record Append
- Client specifies the data to write GFS chooses
and returns the offset it writes to and appends
the data to each replica at least once - Heavily used by Googles Distributed
applications. - No need for a distributed lock manager
- GFS choses the offset, not the client
26Atomic Record Append How?
- Follows similar control flow as mutations
- Primary tells secondary replicas to append at the
same offset as the primary - If a replica append fails at any replica, it is
retried by the client. - So replicas of the same chunk may contain
different data, including duplicates, whole or in
part, of the same record
27Atomic Record Append How?
- GFS does not guarantee that all replicas are
bitwise identical. - Only guarantees that data is written at least
once in an atomic unit. - Data must be written at the same offset for all
chunk replicas for success to be reported.
28Replica Placement
- Placement policy maximizes data reliability and
network bandwidth - Spread replicas not only across machines, but
also across racks - Guards against machine failures, and racks
getting damaged or going offline - Reads for a chunk exploit aggregate bandwidth of
multiple racks - Writes have to flow through multiple racks
- tradeoff made willingly
29Chunk creation
- created and placed by master.
- placed on chunkservers with below average disk
utilization - limit number of recent creations on a
chunkserver - with creations comes lots of writes
30Detecting Stale Replicas
- Master has a chunk version number to distinguish
up to date and stale replicas - Increase version when granting a lease
- If a replica is not available, its version is not
increased - master detects stale replicas when a chunkservers
report chunks and versions - Remove stale replicas during garbage collection
31Garbage collection
- When a client deletes a file, master logs it like
other changes and changes filename to a hidden
file. - Master removes files hidden for longer than 3
days when scanning file system name space - metadata is also erased
- During HeartBeat messages, the chunkservers send
the master a subset of its chunks, and the
master tells it which files have no metadata. - Chunkserver removes these files on its own
32Fault ToleranceHigh Availability
- Fast recovery
- Master and chunkservers can restart in seconds
- Chunk Replication
- Master Replication
- shadow masters provide read-only access when
primary master is down - mutations not done until recorded on all master
replicas
33Fault ToleranceData Integrity
- Chunkservers use checksums to detect corrupt data
- Since replicas are not bitwise identical,
chunkservers maintain their own checksums - For reads, chunkserver verifies checksum before
sending chunk - Update checksums during writes
34Introduction to MapReduce
35MapReduce Insight
-
- Consider the problem of counting the number of
occurrences of each word in a large collection of
documents - How would you do it in parallel ?
36MapReduce Programming Model
-
- Inspired from map and reduce operations commonly
used in functional programming languages like
Lisp. - Users implement interface of two primary methods
- 1. Map (key1, val1) ? (key2, val2)
- 2. Reduce (key2, val2) ? val3
-
37Map operation
-
- Map, a pure function, written by the user, takes
an input key/value pair and produces a set of
intermediate key/value pairs. - e.g. (docid, doc-content)
- Draw an analogy to SQL, map can be visualized as
group-by clause of an aggregate query. -
38Reduce operation
-
- On completion of map phase, all the intermediate
values for a given output key are combined
together into a list and given to a reducer. - Can be visualized as aggregate function (e.g.,
average) that is computed over all the rows with
the same group-by attribute.
39Pseudo-code
- map(String input_key, String input_value)
- // input_key document name
- // input_value document contents
- for each word w in input_value
- EmitIntermediate(w, "1")
- reduce(String output_key, Iterator
intermediate_values) - // output_key a word
- // output_values a list of counts
- int result 0
- for each v in intermediate_values
- result ParseInt(v)
- Emit(AsString(result))
40MapReduce Execution overview
41MapReduce Example
42MapReduce in Parallel Example
43MapReduce Fault Tolerance
- Handled via re-execution of tasks.
- Task completion committed through master
- What happens if Mapper fails ?
- Re-execute completed in-progress map tasks
- What happens if Reducer fails ?
- Re-execute in progress reduce tasks
- What happens if Master fails ?
- Potential trouble !!
44MapReduce
- Walk through of One more Application
45(No Transcript)
46MapReduce PageRank
- PageRank models the behavior of a random
surfer. - C(t) is the out-degree of t, and (1-d) is a
damping factor (random jump) - The random surfer keeps clicking on successive
links at random not taking content into
consideration. - Distributes its pages rank equally among all
pages it links to. - The dampening factor takes the surfer getting
bored and typing arbitrary URL.
47PageRank Key Insights
-
- Effects at each iteration is local. i1th
iteration depends only on ith iteration - At iteration i, PageRank for individual nodes can
be computed independently
48PageRank using MapReduce
-
- Use Sparse matrix representation (M)
- Map each row of M to a list of PageRank credit
to assign to out link neighbours. - These prestige scores are reduced to a single
PageRank value for a page by aggregating over
them.
49 PageRank using MapReduce
Source of Image Lin 2008
50Phase 1 Process HTML
-
- Map task takes (URL, page-content) pairs and maps
them to (URL, (PRinit, list-of-urls)) - PRinit is the seed PageRank for URL
- list-of-urls contains all pages pointed to by URL
- Reduce task is just the identity function
51Phase 2 PageRank Distribution
-
- Reduce task gets (URL, url_list) and many (URL,
val) values - Sum vals and fix up with d to get new PR
- Emit (URL, (new_rank, url_list))
- Check for convergence using non parallel
component
52MapReduce Some More Apps
- Distributed Grep.
- Count of URL Access Frequency.
- Clustering (K-means)
- Graph Algorithms.
- Indexing Systems
MapReduce Programs In Google Source Tree
53MapReduce Extensions and similar apps
-
- PIG (Yahoo)
- Hadoop (Apache)
- DryadLinq (Microsoft)
54Large Scale Systems Architecture using MapReduce
55BigTable A Distributed Storage System for
Structured Data
56Introduction
- BigTable is a distributed storage system for
managing structured data. - Designed to scale to a very large size
- Petabytes of data across thousands of servers
- Used for many Google projects
- Web indexing, Personalized Search, Google Earth,
Google Analytics, Google Finance, - Flexible, high-performance solution for all of
Googles products
57Motivation
- Lots of (semi-)structured data at Google
- URLs
- Contents, crawl metadata, links, anchors,
pagerank, - Per-user data
- User preference settings, recent queries/search
results, - Geographic locations
- Physical entities (shops, restaurants, etc.),
roads, satellite image data, user annotations, - Scale is large
- Billions of URLs, many versions/page
(20K/version) - Hundreds of millions of users, thousands or q/sec
- 100TB of satellite image data
58Why not just use commercial DB?
- Scale is too large for most commercial databases
- Even if it werent, cost would be very high
- Building internally means system can be applied
across many projects for low incremental cost - Low-level storage optimizations help performance
significantly - Much harder to do when running on top of a
database layer
59Goals
- Want asynchronous processes to be continuously
updating different pieces of data - Want access to most current data at any time
- Need to support
- Very high read/write rates (millions of ops per
second) - Efficient scans over all or interesting subsets
of data - Efficient joins of large one-to-one and
one-to-many datasets - Often want to examine data changes over time
- E.g. Contents of a web page over multiple crawls
60BigTable
- Distributed multi-level map
- Fault-tolerant, persistent
- Scalable
- Thousands of servers
- Terabytes of in-memory data
- Petabyte of disk-based data
- Millions of reads/writes per second, efficient
scans - Self-managing
- Servers can be added/removed dynamically
- Servers adjust to load imbalance
61Building Blocks
- Building blocks
- Google File System (GFS) Raw storage
- Scheduler schedules jobs onto machines
- Lock service distributed lock manager
- MapReduce simplified large-scale data processing
- BigTable uses of building blocks
- GFS stores persistent data (SSTable file format
for storage of data) - Scheduler schedules jobs involved in BigTable
serving - Lock service master election, location
bootstrapping - Map Reduce often used to read/write BigTable data
62Basic Data Model
- A BigTable is a sparse, distributed persistent
multi-dimensional sorted map - (row, column, timestamp) -gt cell contents
- Good match for most Google applications
63WebTable Example
- Want to keep copy of a large collection of web
pages and related information - Use URLs as row keys
- Various aspects of web page as column names
- Store contents of web pages in the contents
column under the timestamps when they were
fetched.
64Rows
- Name is an arbitrary string
- Access to data in a row is atomic
- Row creation is implicit upon storing data
- Rows ordered lexicographically
- Rows close together lexicographically usually on
one or a small number of machines
65Rows (cont.)
- Reads of short row ranges are efficient and
typically require communication with a small
number of machines. - Can exploit this property by selecting row keys
so they get good locality for data access. - Example
- math.gatech.edu, math.uga.edu, phys.gatech.edu,
phys.uga.edu - VS
- edu.gatech.math, edu.gatech.phys, edu.uga.math,
edu.uga.phys
66Columns
- Columns have two-level name structure
- familyoptional_qualifier
- Column family
- Unit of access control
- Has associated type information
- Qualifier gives unbounded columns
- Additional levels of indexing, if desired
67Timestamps
- Used to store different versions of data in a
cell - New writes default to current time, but
timestamps for writes can also be set explicitly
by clients - Lookup options
- Return most recent K values
- Return all values in timestamp range (or all
values) - Column families can be marked w/ attributes
- Only retain most recent K values in a cell
- Keep values until they are older than K seconds
68Implementation Three Major Components
- Library linked into every client
- One master server
- Responsible for
- Assigning tablets to tablet servers
- Detecting addition and expiration of tablet
servers - Balancing tablet-server load
- Garbage collection
- Many tablet servers
- Tablet servers handle read and write requests to
its table - Splits tablets that have grown too large
69Implementation (cont.)
- Client data doesnt move through master server.
Clients communicate directly with tablet servers
for reads and writes. - Most clients never communicate with the master
server, leaving it lightly loaded in practice.
70Tablets
- Large tables broken into tablets at row
boundaries - Tablet holds contiguous range of rows
- Clients can often choose row keys to achieve
locality - Aim for 100MB to 200MB of data per tablet
- Serving machine responsible for 100 tablets
- Fast recovery
- 100 machines each pick up 1 tablet for failed
machine - Fine-grained load balancing
- Migrate tablets away from overloaded machine
- Master makes load-balancing decisions
71Tablet Location
- Since tablets move around from server to server,
given a row, how do clients find the right
machine? - Need to find tablet whose row range covers the
target row
72Tablet Assignment
- Each tablet is assigned to one tablet server at a
time. - Master server keeps track of the set of live
tablet servers and current assignments of tablets
to servers. Also keeps track of unassigned
tablets. - When a tablet is unassigned, master assigns the
tablet to an tablet server with sufficient room.
73API
- Metadata operations
- Create/delete tables, column families, change
metadata - Writes (atomic)
- Set() write cells in a row
- DeleteCells() delete cells in a row
- DeleteRow() delete all cells in a row
- Reads
- Scanner read arbitrary cells in a bigtable
- Each row read is atomic
- Can restrict returned rows to a particular range
- Can ask for just data from 1 row, all rows, etc.
- Can ask for all columns, just certain column
families, or specific columns
74Refinements Locality Groups
- Can group multiple column families into a
locality group - Separate SSTable is created for each locality
group in each tablet. - Segregating columns families that are not
typically accessed together enables more
efficient reads. - In WebTable, page metadata can be in one group
and contents of the page in another group.
75Refinements Compression
- Many opportunities for compression
- Similar values in the same row/column at
different timestamps - Similar values in different columns
- Similar values across adjacent rows
- Two-pass custom compressions scheme
- First pass compress long common strings across a
large window - Second pass look for repetitions in small window
- Speed emphasized, but good space reduction
(10-to-1)
76Refinements Bloom Filters
- Read operation has to read from disk when desired
SSTable isnt in memory - Reduce number of accesses by specifying a Bloom
filter. - Allows us ask if an SSTable might contain data
for a specified row/column pair. - Small amount of memory for Bloom filters
drastically reduces the number of disk seeks for
read operations - Use implies that most lookups for non-existent
rows or columns do not need to touch disk
77BigTable A Distributed Storage System for
Structured Data
78Introduction
- BigTable is a distributed storage system for
managing structured data. - Designed to scale to a very large size
- Petabytes of data across thousands of servers
- Used for many Google projects
- Web indexing, Personalized Search, Google Earth,
Google Analytics, Google Finance, - Flexible, high-performance solution for all of
Googles products
79Motivation
- Lots of (semi-)structured data at Google
- URLs
- Contents, crawl metadata, links, anchors,
pagerank, - Per-user data
- User preference settings, recent queries/search
results, - Geographic locations
- Physical entities (shops, restaurants, etc.),
roads, satellite image data, user annotations, - Scale is large
- Billions of URLs, many versions/page
(20K/version) - Hundreds of millions of users, thousands or q/sec
- 100TB of satellite image data
80Why not just use commercial DB?
- Scale is too large for most commercial databases
- Even if it werent, cost would be very high
- Building internally means system can be applied
across many projects for low incremental cost - Low-level storage optimizations help performance
significantly - Much harder to do when running on top of a
database layer
81Goals
- Want asynchronous processes to be continuously
updating different pieces of data - Want access to most current data at any time
- Need to support
- Very high read/write rates (millions of ops per
second) - Efficient scans over all or interesting subsets
of data - Efficient joins of large one-to-one and
one-to-many datasets - Often want to examine data changes over time
- E.g. Contents of a web page over multiple crawls
82BigTable
- Distributed multi-level map
- Fault-tolerant, persistent
- Scalable
- Thousands of servers
- Terabytes of in-memory data
- Petabyte of disk-based data
- Millions of reads/writes per second, efficient
scans - Self-managing
- Servers can be added/removed dynamically
- Servers adjust to load imbalance
83Building Blocks
- Building blocks
- Google File System (GFS) Raw storage
- Scheduler schedules jobs onto machines
- Lock service distributed lock manager
- MapReduce simplified large-scale data processing
- BigTable uses of building blocks
- GFS stores persistent data (SSTable file format
for storage of data) - Scheduler schedules jobs involved in BigTable
serving - Lock service master election, location
bootstrapping - Map Reduce often used to read/write BigTable data
84Basic Data Model
- A BigTable is a sparse, distributed persistent
multi-dimensional sorted map - (row, column, timestamp) -gt cell contents
- Good match for most Google applications
85WebTable Example
- Want to keep copy of a large collection of web
pages and related information - Use URLs as row keys
- Various aspects of web page as column names
- Store contents of web pages in the contents
column under the timestamps when they were
fetched.
86Rows
- Name is an arbitrary string
- Access to data in a row is atomic
- Row creation is implicit upon storing data
- Rows ordered lexicographically
- Rows close together lexicographically usually on
one or a small number of machines
87Rows (cont.)
- Reads of short row ranges are efficient and
typically require communication with a small
number of machines. - Can exploit this property by selecting row keys
so they get good locality for data access. - Example
- math.gatech.edu, math.uga.edu, phys.gatech.edu,
phys.uga.edu - VS
- edu.gatech.math, edu.gatech.phys, edu.uga.math,
edu.uga.phys
88Columns
- Columns have two-level name structure
- familyoptional_qualifier
- Column family
- Unit of access control
- Has associated type information
- Qualifier gives unbounded columns
- Additional levels of indexing, if desired
89Timestamps
- Used to store different versions of data in a
cell - New writes default to current time, but
timestamps for writes can also be set explicitly
by clients - Lookup options
- Return most recent K values
- Return all values in timestamp range (or all
values) - Column families can be marked w/ attributes
- Only retain most recent K values in a cell
- Keep values until they are older than K seconds
90Implementation Three Major Components
- Library linked into every client
- One master server
- Responsible for
- Assigning tablets to tablet servers
- Detecting addition and expiration of tablet
servers - Balancing tablet-server load
- Garbage collection
- Many tablet servers
- Tablet servers handle read and write requests to
its table - Splits tablets that have grown too large
91Implementation (cont.)
- Client data doesnt move through master server.
Clients communicate directly with tablet servers
for reads and writes. - Most clients never communicate with the master
server, leaving it lightly loaded in practice.
92Tablets
- Large tables broken into tablets at row
boundaries - Tablet holds contiguous range of rows
- Clients can often choose row keys to achieve
locality - Aim for 100MB to 200MB of data per tablet
- Serving machine responsible for 100 tablets
- Fast recovery
- 100 machines each pick up 1 tablet for failed
machine - Fine-grained load balancing
- Migrate tablets away from overloaded machine
- Master makes load-balancing decisions
93Tablet Location
- Since tablets move around from server to server,
given a row, how do clients find the right
machine? - Need to find tablet whose row range covers the
target row
94Tablet Assignment
- Each tablet is assigned to one tablet server at a
time. - Master server keeps track of the set of live
tablet servers and current assignments of tablets
to servers. Also keeps track of unassigned
tablets. - When a tablet is unassigned, master assigns the
tablet to an tablet server with sufficient room.
95API
- Metadata operations
- Create/delete tables, column families, change
metadata - Writes (atomic)
- Set() write cells in a row
- DeleteCells() delete cells in a row
- DeleteRow() delete all cells in a row
- Reads
- Scanner read arbitrary cells in a bigtable
- Each row read is atomic
- Can restrict returned rows to a particular range
- Can ask for just data from 1 row, all rows, etc.
- Can ask for all columns, just certain column
families, or specific columns
96Refinements Locality Groups
- Can group multiple column families into a
locality group - Separate SSTable is created for each locality
group in each tablet. - Segregating columns families that are not
typically accessed together enables more
efficient reads. - In WebTable, page metadata can be in one group
and contents of the page in another group.
97Refinements Compression
- Many opportunities for compression
- Similar values in the same row/column at
different timestamps - Similar values in different columns
- Similar values across adjacent rows
- Two-pass custom compressions scheme
- First pass compress long common strings across a
large window - Second pass look for repetitions in small window
- Speed emphasized, but good space reduction
(10-to-1)
98Refinements Bloom Filters
- Read operation has to read from disk when desired
SSTable isnt in memory - Reduce number of accesses by specifying a Bloom
filter. - Allows us ask if an SSTable might contain data
for a specified row/column pair. - Small amount of memory for Bloom filters
drastically reduces the number of disk seeks for
read operations - Use implies that most lookups for non-existent
rows or columns do not need to touch disk