Introduction to cloud computing - PowerPoint PPT Presentation

1 / 98
About This Presentation
Title:

Introduction to cloud computing

Description:

Contents, crawl metadata, links, anchors, pagerank, ... Per-user data: ... E.g. Contents of a web page over multiple crawls. BigTable. Distributed multi-level map ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 99
Provided by: jiahe
Category:

less

Transcript and Presenter's Notes

Title: Introduction to cloud computing


1
Introduction to cloud computing
  • Jiaheng Lu
  • Department of Computer Science
  • Renmin University of China
  • www.jiahenglu.net

2
Cloud computing

3
(No Transcript)
4
ReviewWhy distributed systems?
What are the advantages?
distributed vs centralized? multi-server vs
client-server?
  • Geography
  • Concurrency gt Speed
  • High-availability (if failures occur).

5
Review What Should Be Distributed?
  • Users and User Interface
  • Thin client
  • Processing
  • Trim client
  • Data
  • Fat client
  • Will discuss tradeoffs later

Presentation
workflow
Business Objects
Database
6
ReviewWork Distribution Spectrum
  • Presentation and plug-ins
  • Workflow manages session invokes objects
  • Business objects
  • Database

Presentation
workflow
Business Objects
Database
7
The Pattern Three Tier Computing
Presentation
  • Clients do presentation, gather input
  • Clients do some workflow (Xscript)
  • Clients send high-level requests to ORB (Object
    Request Broker)
  • ORB dispatches workflows and business objects --
    proxies for client, orchestrate flows queues
  • Server-side workflow scripts call on distributed
    business objects to execute task

workflow
Business Objects
Database
8
The Three Tiers
Object Data server.
9
Why Did Everyone Go To Three-Tier?
  • Manageability
  • Business rules must be with data
  • Middleware operations tools
  • Performance (scalability)
  • Server resources are precious
  • ORB dispatches requests to server pools
  • Technology Physics
  • Put UI processing near user
  • Put shared data processing near shared data

Presentation
workflow
Business Objects
Database
10
Google Cloud computing techniques

11
The Google File System
12
The Google File System (GFS)
  • A scalable distributed file system for large
    distributed data intensive applications
  • Multiple GFS clusters are currently deployed.
  • The largest ones have
  • 1000 storage nodes
  • 300 TeraBytes of disk storage
  • heavily accessed by hundreds of clients on
    distinct machines

13
Introduction
  • Shares many same goals as previous distributed
    file systems
  • performance, scalability, reliability, etc
  • GFS design has been driven by four key
    observation of Google application workloads and
    technological environment

14
Intro Observations 1
  • 1. Component failures are the norm
  • constant monitoring, error detection, fault
    tolerance and automatic recovery are integral to
    the system
  • 2. Huge files (by traditional standards)
  • Multi GB files are common
  • I/O operations and blocks sizes must be revisited

15
Intro Observations 2
  • 3. Most files are mutated by appending new data
  • This is the focus of performance optimization and
    atomicity guarantees
  • 4. Co-designing the applications and APIs
    benefits overall system by increasing flexibility

16
The Design
  • Cluster consists of a single master and multiple
    chunkservers and is accessed by multiple clients

17
The Master
  • Maintains all file system metadata.
  • names space, access control info, file to chunk
    mappings, chunk (including replicas) location,
    etc.
  • Periodically communicates with chunkservers in
    HeartBeat messages to give instructions and check
    state

18
The Master
  • Helps make sophisticated chunk placement and
    replication decision, using global knowledge
  • For reading and writing, client contacts Master
    to get chunk locations, then deals directly with
    chunkservers
  • Master is not a bottleneck for reads/writes

19
Chunkservers
  • Files are broken into chunks. Each chunk has a
    immutable globally unique 64-bit chunk-handle.
  • handle is assigned by the master at chunk
    creation
  • Chunk size is 64 MB
  • Each chunk is replicated on 3 (default) servers

20
Clients
  • Linked to apps using the file system API.
  • Communicates with master and chunkservers for
    reading and writing
  • Master interactions only for metadata
  • Chunkserver interactions for data
  • Only caches metadata information
  • Data is too large to cache.

21
Chunk Locations
  • Master does not keep a persistent record of
    locations of chunks and replicas.
  • Polls chunkservers at startup, and when new
    chunkservers join/leave for this.
  • Stays up to date by controlling placement of new
    chunks and through HeartBeat messages (when
    monitoring chunkservers)

22
Operation Log
  • Record of all critical metadata changes
  • Stored on Master and replicated on other machines
  • Defines order of concurrent operations
  • Changes not visible to clients until they
    propigate to all chunk replicas
  • Also used to recover the file system state

23
System Interactions Leases and Mutation Order
  • Leases maintain a mutation order across all chunk
    replicas
  • Master grants a lease to a replica, called the
    primary
  • The primary choses the serial mutation order, and
    all replicas follow this order
  • Minimizes management overhead for the Master

24
System Interactions Leases and Mutation Order
25
Atomic Record Append
  • Client specifies the data to write GFS chooses
    and returns the offset it writes to and appends
    the data to each replica at least once
  • Heavily used by Googles Distributed
    applications.
  • No need for a distributed lock manager
  • GFS choses the offset, not the client

26
Atomic Record Append How?
  • Follows similar control flow as mutations
  • Primary tells secondary replicas to append at the
    same offset as the primary
  • If a replica append fails at any replica, it is
    retried by the client.
  • So replicas of the same chunk may contain
    different data, including duplicates, whole or in
    part, of the same record

27
Atomic Record Append How?
  • GFS does not guarantee that all replicas are
    bitwise identical.
  • Only guarantees that data is written at least
    once in an atomic unit.
  • Data must be written at the same offset for all
    chunk replicas for success to be reported.

28
Replica Placement
  • Placement policy maximizes data reliability and
    network bandwidth
  • Spread replicas not only across machines, but
    also across racks
  • Guards against machine failures, and racks
    getting damaged or going offline
  • Reads for a chunk exploit aggregate bandwidth of
    multiple racks
  • Writes have to flow through multiple racks
  • tradeoff made willingly

29
Chunk creation
  • created and placed by master.
  • placed on chunkservers with below average disk
    utilization
  • limit number of recent creations on a
    chunkserver
  • with creations comes lots of writes

30
Detecting Stale Replicas
  • Master has a chunk version number to distinguish
    up to date and stale replicas
  • Increase version when granting a lease
  • If a replica is not available, its version is not
    increased
  • master detects stale replicas when a chunkservers
    report chunks and versions
  • Remove stale replicas during garbage collection

31
Garbage collection
  • When a client deletes a file, master logs it like
    other changes and changes filename to a hidden
    file.
  • Master removes files hidden for longer than 3
    days when scanning file system name space
  • metadata is also erased
  • During HeartBeat messages, the chunkservers send
    the master a subset of its chunks, and the
    master tells it which files have no metadata.
  • Chunkserver removes these files on its own

32
Fault ToleranceHigh Availability
  • Fast recovery
  • Master and chunkservers can restart in seconds
  • Chunk Replication
  • Master Replication
  • shadow masters provide read-only access when
    primary master is down
  • mutations not done until recorded on all master
    replicas

33
Fault ToleranceData Integrity
  • Chunkservers use checksums to detect corrupt data
  • Since replicas are not bitwise identical,
    chunkservers maintain their own checksums
  • For reads, chunkserver verifies checksum before
    sending chunk
  • Update checksums during writes

34
Introduction to MapReduce
35
MapReduce Insight
  • Consider the problem of counting the number of
    occurrences of each word in a large collection of
    documents
  • How would you do it in parallel ?

36
MapReduce Programming Model
  • Inspired from map and reduce operations commonly
    used in functional programming languages like
    Lisp.
  • Users implement interface of two primary methods
  • 1. Map (key1, val1) ? (key2, val2)
  • 2. Reduce (key2, val2) ? val3

37
Map operation
  • Map, a pure function, written by the user, takes
    an input key/value pair and produces a set of
    intermediate key/value pairs.
  • e.g. (docid, doc-content)
  • Draw an analogy to SQL, map can be visualized as
    group-by clause of an aggregate query.

38
Reduce operation
  • On completion of map phase, all the intermediate
    values for a given output key are combined
    together into a list and given to a reducer.
  • Can be visualized as aggregate function (e.g.,
    average) that is computed over all the rows with
    the same group-by attribute.

39
Pseudo-code
  • map(String input_key, String input_value)
  • // input_key document name
  • // input_value document contents
  • for each word w in input_value
  • EmitIntermediate(w, "1")
  • reduce(String output_key, Iterator
    intermediate_values)
  • // output_key a word
  • // output_values a list of counts
  • int result 0
  • for each v in intermediate_values
  • result ParseInt(v)
  • Emit(AsString(result))

40
MapReduce Execution overview

41
MapReduce Example

42
MapReduce in Parallel Example

43
MapReduce Fault Tolerance
  • Handled via re-execution of tasks.
  • Task completion committed through master
  • What happens if Mapper fails ?
  • Re-execute completed in-progress map tasks
  • What happens if Reducer fails ?
  • Re-execute in progress reduce tasks
  • What happens if Master fails ?
  • Potential trouble !!

44
MapReduce
  • Walk through of One more Application

45
(No Transcript)
46
MapReduce PageRank
  • PageRank models the behavior of a random
    surfer.
  • C(t) is the out-degree of t, and (1-d) is a
    damping factor (random jump)
  • The random surfer keeps clicking on successive
    links at random not taking content into
    consideration.
  • Distributes its pages rank equally among all
    pages it links to.
  • The dampening factor takes the surfer getting
    bored and typing arbitrary URL.

47
PageRank Key Insights
  • Effects at each iteration is local. i1th
    iteration depends only on ith iteration
  • At iteration i, PageRank for individual nodes can
    be computed independently

48
PageRank using MapReduce
  • Use Sparse matrix representation (M)
  • Map each row of M to a list of PageRank credit
    to assign to out link neighbours.
  • These prestige scores are reduced to a single
    PageRank value for a page by aggregating over
    them.

49
PageRank using MapReduce
Source of Image Lin 2008
50
Phase 1 Process HTML
  • Map task takes (URL, page-content) pairs and maps
    them to (URL, (PRinit, list-of-urls))
  • PRinit is the seed PageRank for URL
  • list-of-urls contains all pages pointed to by URL
  • Reduce task is just the identity function

51
Phase 2 PageRank Distribution
  • Reduce task gets (URL, url_list) and many (URL,
    val) values
  • Sum vals and fix up with d to get new PR
  • Emit (URL, (new_rank, url_list))
  • Check for convergence using non parallel
    component

52
MapReduce Some More Apps
  • Distributed Grep.
  • Count of URL Access Frequency.
  • Clustering (K-means)
  • Graph Algorithms.
  • Indexing Systems

MapReduce Programs In Google Source Tree
53
MapReduce Extensions and similar apps
  • PIG (Yahoo)
  • Hadoop (Apache)
  • DryadLinq (Microsoft)

54
Large Scale Systems Architecture using MapReduce
55
BigTable A Distributed Storage System for
Structured Data
56
Introduction
  • BigTable is a distributed storage system for
    managing structured data.
  • Designed to scale to a very large size
  • Petabytes of data across thousands of servers
  • Used for many Google projects
  • Web indexing, Personalized Search, Google Earth,
    Google Analytics, Google Finance,
  • Flexible, high-performance solution for all of
    Googles products

57
Motivation
  • Lots of (semi-)structured data at Google
  • URLs
  • Contents, crawl metadata, links, anchors,
    pagerank,
  • Per-user data
  • User preference settings, recent queries/search
    results,
  • Geographic locations
  • Physical entities (shops, restaurants, etc.),
    roads, satellite image data, user annotations,
  • Scale is large
  • Billions of URLs, many versions/page
    (20K/version)
  • Hundreds of millions of users, thousands or q/sec
  • 100TB of satellite image data

58
Why not just use commercial DB?
  • Scale is too large for most commercial databases
  • Even if it werent, cost would be very high
  • Building internally means system can be applied
    across many projects for low incremental cost
  • Low-level storage optimizations help performance
    significantly
  • Much harder to do when running on top of a
    database layer

59
Goals
  • Want asynchronous processes to be continuously
    updating different pieces of data
  • Want access to most current data at any time
  • Need to support
  • Very high read/write rates (millions of ops per
    second)
  • Efficient scans over all or interesting subsets
    of data
  • Efficient joins of large one-to-one and
    one-to-many datasets
  • Often want to examine data changes over time
  • E.g. Contents of a web page over multiple crawls

60
BigTable
  • Distributed multi-level map
  • Fault-tolerant, persistent
  • Scalable
  • Thousands of servers
  • Terabytes of in-memory data
  • Petabyte of disk-based data
  • Millions of reads/writes per second, efficient
    scans
  • Self-managing
  • Servers can be added/removed dynamically
  • Servers adjust to load imbalance

61
Building Blocks
  • Building blocks
  • Google File System (GFS) Raw storage
  • Scheduler schedules jobs onto machines
  • Lock service distributed lock manager
  • MapReduce simplified large-scale data processing
  • BigTable uses of building blocks
  • GFS stores persistent data (SSTable file format
    for storage of data)
  • Scheduler schedules jobs involved in BigTable
    serving
  • Lock service master election, location
    bootstrapping
  • Map Reduce often used to read/write BigTable data

62
Basic Data Model
  • A BigTable is a sparse, distributed persistent
    multi-dimensional sorted map
  • (row, column, timestamp) -gt cell contents
  • Good match for most Google applications

63
WebTable Example
  • Want to keep copy of a large collection of web
    pages and related information
  • Use URLs as row keys
  • Various aspects of web page as column names
  • Store contents of web pages in the contents
    column under the timestamps when they were
    fetched.

64
Rows
  • Name is an arbitrary string
  • Access to data in a row is atomic
  • Row creation is implicit upon storing data
  • Rows ordered lexicographically
  • Rows close together lexicographically usually on
    one or a small number of machines

65
Rows (cont.)
  • Reads of short row ranges are efficient and
    typically require communication with a small
    number of machines.
  • Can exploit this property by selecting row keys
    so they get good locality for data access.
  • Example
  • math.gatech.edu, math.uga.edu, phys.gatech.edu,
    phys.uga.edu
  • VS
  • edu.gatech.math, edu.gatech.phys, edu.uga.math,
    edu.uga.phys

66
Columns
  • Columns have two-level name structure
  • familyoptional_qualifier
  • Column family
  • Unit of access control
  • Has associated type information
  • Qualifier gives unbounded columns
  • Additional levels of indexing, if desired

67
Timestamps
  • Used to store different versions of data in a
    cell
  • New writes default to current time, but
    timestamps for writes can also be set explicitly
    by clients
  • Lookup options
  • Return most recent K values
  • Return all values in timestamp range (or all
    values)
  • Column families can be marked w/ attributes
  • Only retain most recent K values in a cell
  • Keep values until they are older than K seconds

68
Implementation Three Major Components
  • Library linked into every client
  • One master server
  • Responsible for
  • Assigning tablets to tablet servers
  • Detecting addition and expiration of tablet
    servers
  • Balancing tablet-server load
  • Garbage collection
  • Many tablet servers
  • Tablet servers handle read and write requests to
    its table
  • Splits tablets that have grown too large

69
Implementation (cont.)
  • Client data doesnt move through master server.
    Clients communicate directly with tablet servers
    for reads and writes.
  • Most clients never communicate with the master
    server, leaving it lightly loaded in practice.

70
Tablets
  • Large tables broken into tablets at row
    boundaries
  • Tablet holds contiguous range of rows
  • Clients can often choose row keys to achieve
    locality
  • Aim for 100MB to 200MB of data per tablet
  • Serving machine responsible for 100 tablets
  • Fast recovery
  • 100 machines each pick up 1 tablet for failed
    machine
  • Fine-grained load balancing
  • Migrate tablets away from overloaded machine
  • Master makes load-balancing decisions

71
Tablet Location
  • Since tablets move around from server to server,
    given a row, how do clients find the right
    machine?
  • Need to find tablet whose row range covers the
    target row

72
Tablet Assignment
  • Each tablet is assigned to one tablet server at a
    time.
  • Master server keeps track of the set of live
    tablet servers and current assignments of tablets
    to servers. Also keeps track of unassigned
    tablets.
  • When a tablet is unassigned, master assigns the
    tablet to an tablet server with sufficient room.

73
API
  • Metadata operations
  • Create/delete tables, column families, change
    metadata
  • Writes (atomic)
  • Set() write cells in a row
  • DeleteCells() delete cells in a row
  • DeleteRow() delete all cells in a row
  • Reads
  • Scanner read arbitrary cells in a bigtable
  • Each row read is atomic
  • Can restrict returned rows to a particular range
  • Can ask for just data from 1 row, all rows, etc.
  • Can ask for all columns, just certain column
    families, or specific columns

74
Refinements Locality Groups
  • Can group multiple column families into a
    locality group
  • Separate SSTable is created for each locality
    group in each tablet.
  • Segregating columns families that are not
    typically accessed together enables more
    efficient reads.
  • In WebTable, page metadata can be in one group
    and contents of the page in another group.

75
Refinements Compression
  • Many opportunities for compression
  • Similar values in the same row/column at
    different timestamps
  • Similar values in different columns
  • Similar values across adjacent rows
  • Two-pass custom compressions scheme
  • First pass compress long common strings across a
    large window
  • Second pass look for repetitions in small window
  • Speed emphasized, but good space reduction
    (10-to-1)

76
Refinements Bloom Filters
  • Read operation has to read from disk when desired
    SSTable isnt in memory
  • Reduce number of accesses by specifying a Bloom
    filter.
  • Allows us ask if an SSTable might contain data
    for a specified row/column pair.
  • Small amount of memory for Bloom filters
    drastically reduces the number of disk seeks for
    read operations
  • Use implies that most lookups for non-existent
    rows or columns do not need to touch disk

77
BigTable A Distributed Storage System for
Structured Data
78
Introduction
  • BigTable is a distributed storage system for
    managing structured data.
  • Designed to scale to a very large size
  • Petabytes of data across thousands of servers
  • Used for many Google projects
  • Web indexing, Personalized Search, Google Earth,
    Google Analytics, Google Finance,
  • Flexible, high-performance solution for all of
    Googles products

79
Motivation
  • Lots of (semi-)structured data at Google
  • URLs
  • Contents, crawl metadata, links, anchors,
    pagerank,
  • Per-user data
  • User preference settings, recent queries/search
    results,
  • Geographic locations
  • Physical entities (shops, restaurants, etc.),
    roads, satellite image data, user annotations,
  • Scale is large
  • Billions of URLs, many versions/page
    (20K/version)
  • Hundreds of millions of users, thousands or q/sec
  • 100TB of satellite image data

80
Why not just use commercial DB?
  • Scale is too large for most commercial databases
  • Even if it werent, cost would be very high
  • Building internally means system can be applied
    across many projects for low incremental cost
  • Low-level storage optimizations help performance
    significantly
  • Much harder to do when running on top of a
    database layer

81
Goals
  • Want asynchronous processes to be continuously
    updating different pieces of data
  • Want access to most current data at any time
  • Need to support
  • Very high read/write rates (millions of ops per
    second)
  • Efficient scans over all or interesting subsets
    of data
  • Efficient joins of large one-to-one and
    one-to-many datasets
  • Often want to examine data changes over time
  • E.g. Contents of a web page over multiple crawls

82
BigTable
  • Distributed multi-level map
  • Fault-tolerant, persistent
  • Scalable
  • Thousands of servers
  • Terabytes of in-memory data
  • Petabyte of disk-based data
  • Millions of reads/writes per second, efficient
    scans
  • Self-managing
  • Servers can be added/removed dynamically
  • Servers adjust to load imbalance

83
Building Blocks
  • Building blocks
  • Google File System (GFS) Raw storage
  • Scheduler schedules jobs onto machines
  • Lock service distributed lock manager
  • MapReduce simplified large-scale data processing
  • BigTable uses of building blocks
  • GFS stores persistent data (SSTable file format
    for storage of data)
  • Scheduler schedules jobs involved in BigTable
    serving
  • Lock service master election, location
    bootstrapping
  • Map Reduce often used to read/write BigTable data

84
Basic Data Model
  • A BigTable is a sparse, distributed persistent
    multi-dimensional sorted map
  • (row, column, timestamp) -gt cell contents
  • Good match for most Google applications

85
WebTable Example
  • Want to keep copy of a large collection of web
    pages and related information
  • Use URLs as row keys
  • Various aspects of web page as column names
  • Store contents of web pages in the contents
    column under the timestamps when they were
    fetched.

86
Rows
  • Name is an arbitrary string
  • Access to data in a row is atomic
  • Row creation is implicit upon storing data
  • Rows ordered lexicographically
  • Rows close together lexicographically usually on
    one or a small number of machines

87
Rows (cont.)
  • Reads of short row ranges are efficient and
    typically require communication with a small
    number of machines.
  • Can exploit this property by selecting row keys
    so they get good locality for data access.
  • Example
  • math.gatech.edu, math.uga.edu, phys.gatech.edu,
    phys.uga.edu
  • VS
  • edu.gatech.math, edu.gatech.phys, edu.uga.math,
    edu.uga.phys

88
Columns
  • Columns have two-level name structure
  • familyoptional_qualifier
  • Column family
  • Unit of access control
  • Has associated type information
  • Qualifier gives unbounded columns
  • Additional levels of indexing, if desired

89
Timestamps
  • Used to store different versions of data in a
    cell
  • New writes default to current time, but
    timestamps for writes can also be set explicitly
    by clients
  • Lookup options
  • Return most recent K values
  • Return all values in timestamp range (or all
    values)
  • Column families can be marked w/ attributes
  • Only retain most recent K values in a cell
  • Keep values until they are older than K seconds

90
Implementation Three Major Components
  • Library linked into every client
  • One master server
  • Responsible for
  • Assigning tablets to tablet servers
  • Detecting addition and expiration of tablet
    servers
  • Balancing tablet-server load
  • Garbage collection
  • Many tablet servers
  • Tablet servers handle read and write requests to
    its table
  • Splits tablets that have grown too large

91
Implementation (cont.)
  • Client data doesnt move through master server.
    Clients communicate directly with tablet servers
    for reads and writes.
  • Most clients never communicate with the master
    server, leaving it lightly loaded in practice.

92
Tablets
  • Large tables broken into tablets at row
    boundaries
  • Tablet holds contiguous range of rows
  • Clients can often choose row keys to achieve
    locality
  • Aim for 100MB to 200MB of data per tablet
  • Serving machine responsible for 100 tablets
  • Fast recovery
  • 100 machines each pick up 1 tablet for failed
    machine
  • Fine-grained load balancing
  • Migrate tablets away from overloaded machine
  • Master makes load-balancing decisions

93
Tablet Location
  • Since tablets move around from server to server,
    given a row, how do clients find the right
    machine?
  • Need to find tablet whose row range covers the
    target row

94
Tablet Assignment
  • Each tablet is assigned to one tablet server at a
    time.
  • Master server keeps track of the set of live
    tablet servers and current assignments of tablets
    to servers. Also keeps track of unassigned
    tablets.
  • When a tablet is unassigned, master assigns the
    tablet to an tablet server with sufficient room.

95
API
  • Metadata operations
  • Create/delete tables, column families, change
    metadata
  • Writes (atomic)
  • Set() write cells in a row
  • DeleteCells() delete cells in a row
  • DeleteRow() delete all cells in a row
  • Reads
  • Scanner read arbitrary cells in a bigtable
  • Each row read is atomic
  • Can restrict returned rows to a particular range
  • Can ask for just data from 1 row, all rows, etc.
  • Can ask for all columns, just certain column
    families, or specific columns

96
Refinements Locality Groups
  • Can group multiple column families into a
    locality group
  • Separate SSTable is created for each locality
    group in each tablet.
  • Segregating columns families that are not
    typically accessed together enables more
    efficient reads.
  • In WebTable, page metadata can be in one group
    and contents of the page in another group.

97
Refinements Compression
  • Many opportunities for compression
  • Similar values in the same row/column at
    different timestamps
  • Similar values in different columns
  • Similar values across adjacent rows
  • Two-pass custom compressions scheme
  • First pass compress long common strings across a
    large window
  • Second pass look for repetitions in small window
  • Speed emphasized, but good space reduction
    (10-to-1)

98
Refinements Bloom Filters
  • Read operation has to read from disk when desired
    SSTable isnt in memory
  • Reduce number of accesses by specifying a Bloom
    filter.
  • Allows us ask if an SSTable might contain data
    for a specified row/column pair.
  • Small amount of memory for Bloom filters
    drastically reduces the number of disk seeks for
    read operations
  • Use implies that most lookups for non-existent
    rows or columns do not need to touch disk
Write a Comment
User Comments (0)
About PowerShow.com