Title: www.jiahenglu.net
1?????????
- ???
- ??????
- www.jiahenglu.net
2CLOUD COMPUTING
3????
- ?????
- Google ?????GFS,Bigtable ?Mapreduce
- Yahoo??????Hadoop
- ????????
3
4Cloud computing
5(No Transcript)
6Why we use cloud computing?
7Why we use cloud computing?
- Case 1
- Write a file
- Save
- Computer down, file is lost
- Files are always stored in cloud, never lost
8Why we use cloud computing?
- Case 2
- Use IE --- download, install, use
- Use QQ --- download, install, use
- Use C --- download, install, use
-
- Get the serve from the cloud
9What is cloud and cloud computing?
- Cloud
- Demand resources or services over Internet
- scale and reliability of a data center.
10What is cloud and cloud computing?
- Cloud computing is a style of computing in
which dynamically scalable and often virtualized
resources are provided as a serve over the
Internet. - Users need not have knowledge of, expertise in,
or control over the technology infrastructure in
the "cloud" that supports them. -
11Characteristics of cloud computing
- Virtual.
- software, databases, Web servers,
operating systems, storage and networking as
virtual servers. - On demand.
- add and subtract processors, memory, network
bandwidth, storage.
12Types of cloud service
SaaS Software as a Service
PaaS Platform as a Service
IaaS Infrastructure as a Service
13SaaS
- Software delivery model
- No hardware or software to manage
- Service delivered through a browser
- Customers use the service on demand
- Instant Scalability
14SaaS
- Examples
- Your current CRM package is not managing the load
or you simply dont want to host it in-house. Use
a SaaS provider such as Salesforce.com - Your email is hosted on an exchange server in
your office and it is very slow. Outsource this
using Hosted Exchange.
15PaaS
- Platform delivery model
- Platforms are built upon Infrastructure, which is
expensive - Estimating demand is not a science!
- Platform management is not fun!
16PaaS
- Examples
- You need to host a large file (5Mb) on your
website and make it available for 35,000 users
for only two months duration. Use Cloud Front
from Amazon. - You want to start storage services on your
network for a large number of files and you do
not have the storage capacityuse Amazon S3.
17IaaS
- Computer infrastructure delivery model
- A platform virtualization environment
- Computing resources, such as storing and
processing capacity. - Virtualization taken a step further
18IaaS
- Examples
- You want to run a batch job but you dont have
the infrastructure necessary to run it in a
timely manner. Use Amazon EC2. -
- You want to host a website, but only for a few
days. Use Flexiscale.
19Cloud computing and other computing techniques
20CLOUD COMPUTING
21The 21st Century Vision Of Computing
Leonard Kleinrock , one of the chief scientists
of the original Advanced Research Projects Agency
Network (ARPANET) project which seeded the
Internet, said As of now, computer networks
are still in their infancy, but as they grow up
and become sophisticated, we will probably see
the spread of computer utilities which, like
present electric and telephone utilities, will
service individual homes and offices across the
country.
22The 21st Century Vision Of Computing
Sun Microsystems co-founder Bill Joy
23The 21st Century Vision Of Computing
24Definitions
utility
25Definitions
Utility computing is the packaging of computing
resources, such as computation and storage, as a
metered service similar to a traditional public
utility
utility
26Definitions
utility
A computer cluster is a group of linked
computers, working together closely so that in
many respects they form a single computer.
27Definitions
utility
Grid computing is the application of several
computers to a single problem at the same time
usually to a scientific or technical problem that
requires a great number of computer processing
cycles or access to large amounts of data
28Definitions
utility
Cloud computing is a style of computing in which
dynamically scalable and often virtualized
resources are provided as a service over the
Internet.
29Grid Computing Cloud Computing
- share a lot commonality
- intention, architecture and technology
-
- Difference
- programming model, business model, compute
model, applications, and Virtualization.
30Grid Computing Cloud Computing
- the problems are mostly the same
- manage large facilities
- define methods by which consumers discover,
request and use resources provided by the central
facilities - implement the often highly parallel computations
that execute on those resources.
31Grid Computing Cloud Computing
- Virtualization
- Grid
- do not rely on virtualization as much as Clouds
do, each individual organization maintain full
control of their resources - Cloud
- an indispensable ingredient for almost every Cloud
32(No Transcript)
33Any question and any comments ?
2015-3-8
33
34????
- ?????
- Google ?????GFS,Bigtable ?Mapreduce
- Yahoo??????Hadoop
- ????????
34
35Google Cloud computing techniques
36Cloud Systems
OSDI06
- BigTable
- HBase
- HyperTable
- Hive
- HadoopDB
- GreenPlum
- CouchDB
- Voldemort
- PNUTS
- SQL Azure
BigTable-like
MapReduce
VLDB09
VLDB09
DBMS-based
VLDB08
37The Google File System
38The Google File System (GFS)
- A scalable distributed file system for large
distributed data intensive applications - Multiple GFS clusters are currently deployed.
- The largest ones have
- 1000 storage nodes
- 300 TeraBytes of disk storage
- heavily accessed by hundreds of clients on
distinct machines
39Introduction
- Shares many same goals as previous distributed
file systems - performance, scalability, reliability, etc
- GFS design has been driven by four key
observation of Google application workloads and
technological environment
40Intro Observations 1
- 1. Component failures are the norm
- constant monitoring, error detection, fault
tolerance and automatic recovery are integral to
the system - 2. Huge files (by traditional standards)
- Multi GB files are common
- I/O operations and blocks sizes must be revisited
41Intro Observations 2
- 3. Most files are mutated by appending new data
- This is the focus of performance optimization and
atomicity guarantees - 4. Co-designing the applications and APIs
benefits overall system by increasing flexibility
42The Design
- Cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients
43The Master
- Maintains all file system metadata.
- names space, access control info, file to chunk
mappings, chunk (including replicas) location,
etc. - Periodically communicates with chunkservers in
HeartBeat messages to give instructions and check
state
44The Master
- Helps make sophisticated chunk placement and
replication decision, using global knowledge - For reading and writing, client contacts Master
to get chunk locations, then deals directly with
chunkservers - Master is not a bottleneck for reads/writes
45Chunkservers
- Files are broken into chunks. Each chunk has a
immutable globally unique 64-bit chunk-handle. - handle is assigned by the master at chunk
creation - Chunk size is 64 MB
- Each chunk is replicated on 3 (default) servers
46Clients
- Linked to apps using the file system API.
- Communicates with master and chunkservers for
reading and writing - Master interactions only for metadata
- Chunkserver interactions for data
- Only caches metadata information
- Data is too large to cache.
47Chunk Locations
- Master does not keep a persistent record of
locations of chunks and replicas. - Polls chunkservers at startup, and when new
chunkservers join/leave for this. - Stays up to date by controlling placement of new
chunks and through HeartBeat messages (when
monitoring chunkservers)
48Operation Log
- Record of all critical metadata changes
- Stored on Master and replicated on other machines
- Defines order of concurrent operations
- Also used to recover the file system state
49System Interactions Leases and Mutation Order
- Leases maintain a mutation order across all chunk
replicas - Master grants a lease to a replica, called the
primary - The primary choses the serial mutation order, and
all replicas follow this order - Minimizes management overhead for the Master
50Atomic Record Append
- Client specifies the data to write GFS chooses
and returns the offset it writes to and appends
the data to each replica at least once - Heavily used by Googles Distributed
applications. - No need for a distributed lock manager
- GFS choses the offset, not the client
51Atomic Record Append How?
- Follows similar control flow as mutations
- Primary tells secondary replicas to append at the
same offset as the primary - If a replica append fails at any replica, it is
retried by the client. - So replicas of the same chunk may contain
different data, including duplicates, whole or in
part, of the same record
52Atomic Record Append How?
- GFS does not guarantee that all replicas are
bitwise identical. - Only guarantees that data is written at least
once in an atomic unit. - Data must be written at the same offset for all
chunk replicas for success to be reported.
53Detecting Stale Replicas
- Master has a chunk version number to distinguish
up to date and stale replicas - Increase version when granting a lease
- If a replica is not available, its version is not
increased - master detects stale replicas when a chunkservers
report chunks and versions - Remove stale replicas during garbage collection
54Garbage collection
- When a client deletes a file, master logs it like
other changes and changes filename to a hidden
file. - Master removes files hidden for longer than 3
days when scanning file system name space - metadata is also erased
- During HeartBeat messages, the chunkservers send
the master a subset of its chunks, and the
master tells it which files have no metadata. - Chunkserver removes these files on its own
55Fault ToleranceHigh Availability
- Fast recovery
- Master and chunkservers can restart in seconds
- Chunk Replication
- Master Replication
- shadow masters provide read-only access when
primary master is down - mutations not done until recorded on all master
replicas
56Fault ToleranceData Integrity
- Chunkservers use checksums to detect corrupt data
- Since replicas are not bitwise identical,
chunkservers maintain their own checksums - For reads, chunkserver verifies checksum before
sending chunk - Update checksums during writes
57Introduction to MapReduce
58MapReduce Insight
-
- Consider the problem of counting the number of
occurrences of each word in a large collection of
documents - How would you do it in parallel ?
59MapReduce Programming Model
-
- Inspired from map and reduce operations commonly
used in functional programming languages like
Lisp. - Users implement interface of two primary methods
- 1. Map (key1, val1) ? (key2, val2)
- 2. Reduce (key2, val2) ? val3
-
60Map operation
-
- Map, a pure function, written by the user, takes
an input key/value pair and produces a set of
intermediate key/value pairs. - e.g. (docid, doc-content)
- Draw an analogy to SQL, map can be visualized as
group-by clause of an aggregate query. -
61Reduce operation
-
- On completion of map phase, all the intermediate
values for a given output key are combined
together into a list and given to a reducer. - Can be visualized as aggregate function (e.g.,
average) that is computed over all the rows with
the same group-by attribute.
62Pseudo-code
- map(String input_key, String input_value)
- // input_key document name
- // input_value document contents
- for each word w in input_value
- EmitIntermediate(w, "1")
- reduce(String output_key, Iterator
intermediate_values) - // output_key a word
- // output_values a list of counts
- int result 0
- for each v in intermediate_values
- result ParseInt(v)
- Emit(AsString(result))
63MapReduce Execution overview
64MapReduce Example
65MapReduce in Parallel Example
66MapReduce Fault Tolerance
- Handled via re-execution of tasks.
- Task completion committed through master
- What happens if Mapper fails ?
- Re-execute completed in-progress map tasks
- What happens if Reducer fails ?
- Re-execute in progress reduce tasks
- What happens if Master fails ?
- Potential trouble !!
67MapReduce
- Walk through of One more Application
68(No Transcript)
69MapReduce PageRank
- PageRank models the behavior of a random
surfer. - C(t) is the out-degree of t, and (1-d) is a
damping factor (random jump) - The random surfer keeps clicking on successive
links at random not taking content into
consideration. - Distributes its pages rank equally among all
pages it links to. - The dampening factor takes the surfer getting
bored and typing arbitrary URL.
70PageRank Key Insights
-
- Effects at each iteration is local. i1th
iteration depends only on ith iteration - At iteration i, PageRank for individual nodes can
be computed independently
71PageRank using MapReduce
-
- Use Sparse matrix representation (M)
- Map each row of M to a list of PageRank credit
to assign to out link neighbours. - These prestige scores are reduced to a single
PageRank value for a page by aggregating over
them.
72 PageRank using MapReduce
Source of Image Lin 2008
73Phase 1 Process HTML
-
- Map task takes (URL, page-content) pairs and maps
them to (URL, (PRinit, list-of-urls)) - PRinit is the seed PageRank for URL
- list-of-urls contains all pages pointed to by URL
- Reduce task is just the identity function
74Phase 2 PageRank Distribution
-
- Reduce task gets (URL, url_list) and many (URL,
val) values - Sum vals and fix up with d to get new PR
- Emit (URL, (new_rank, url_list))
- Check for convergence using non parallel
component
75MapReduce Some More Apps
- Distributed Grep.
- Count of URL Access Frequency.
- Clustering (K-means)
- Graph Algorithms.
- Indexing Systems
MapReduce Programs In Google Source Tree
76MapReduce Extensions and similar apps
-
- PIG (Yahoo)
- Hadoop (Apache)
- DryadLinq (Microsoft)
77Large Scale Systems Architecture using MapReduce
78BigTable A Distributed Storage System for
Structured Data
79Introduction
- BigTable is a distributed storage system for
managing structured data. - Designed to scale to a very large size
- Petabytes of data across thousands of servers
- Used for many Google projects
- Web indexing, Personalized Search, Google Earth,
Google Analytics, Google Finance, - Flexible, high-performance solution for all of
Googles products
80Motivation
- Lots of (semi-)structured data at Google
- URLs
- Contents, crawl metadata, links, anchors,
pagerank, - Per-user data
- User preference settings, recent queries/search
results, - Geographic locations
- Physical entities (shops, restaurants, etc.),
roads, satellite image data, user annotations, - Scale is large
- Billions of URLs, many versions/page
(20K/version) - Hundreds of millions of users, thousands or q/sec
- 100TB of satellite image data
81Why not just use commercial DB?
- Scale is too large for most commercial databases
- Even if it werent, cost would be very high
- Building internally means system can be applied
across many projects for low incremental cost - Low-level storage optimizations help performance
significantly - Much harder to do when running on top of a
database layer
82Goals
- Want asynchronous processes to be continuously
updating different pieces of data - Want access to most current data at any time
- Need to support
- Very high read/write rates (millions of ops per
second) - Efficient scans over all or interesting subsets
of data - Efficient joins of large one-to-one and
one-to-many datasets - Often want to examine data changes over time
- E.g. Contents of a web page over multiple crawls
83BigTable
- Distributed multi-level map
- Fault-tolerant, persistent
- Scalable
- Thousands of servers
- Terabytes of in-memory data
- Petabyte of disk-based data
- Millions of reads/writes per second, efficient
scans - Self-managing
- Servers can be added/removed dynamically
- Servers adjust to load imbalance
84Building Blocks
- Building blocks
- Google File System (GFS) Raw storage
- Scheduler schedules jobs onto machines
- Lock service distributed lock manager
- MapReduce simplified large-scale data processing
- BigTable uses of building blocks
- GFS stores persistent data (SSTable file format
for storage of data) - Scheduler schedules jobs involved in BigTable
serving - Lock service master election, location
bootstrapping - Map Reduce often used to read/write BigTable data
85Basic Data Model
- A BigTable is a sparse, distributed persistent
multi-dimensional sorted map - (row, column, timestamp) -gt cell contents
- Good match for most Google applications
86WebTable Example
- Want to keep copy of a large collection of web
pages and related information - Use URLs as row keys
- Various aspects of web page as column names
- Store contents of web pages in the contents
column under the timestamps when they were
fetched.
87Rows
- Name is an arbitrary string
- Access to data in a row is atomic
- Row creation is implicit upon storing data
- Rows ordered lexicographically
- Rows close together lexicographically usually on
one or a small number of machines
88Rows (cont.)
- Reads of short row ranges are efficient and
typically require communication with a small
number of machines. - Can exploit this property by selecting row keys
so they get good locality for data access. - Example
- math.gatech.edu, math.uga.edu, phys.gatech.edu,
phys.uga.edu - VS
- edu.gatech.math, edu.gatech.phys, edu.uga.math,
edu.uga.phys
89Columns
- Columns have two-level name structure
- familyoptional_qualifier
- Column family
- Unit of access control
- Has associated type information
- Qualifier gives unbounded columns
- Additional levels of indexing, if desired
90Timestamps
- Used to store different versions of data in a
cell - New writes default to current time, but
timestamps for writes can also be set explicitly
by clients - Lookup options
- Return most recent K values
- Return all values in timestamp range (or all
values) - Column families can be marked w/ attributes
- Only retain most recent K values in a cell
- Keep values until they are older than K seconds
91Implementation Three Major Components
- Library linked into every client
- One master server
- Responsible for
- Assigning tablets to tablet servers
- Detecting addition and expiration of tablet
servers - Balancing tablet-server load
- Garbage collection
- Many tablet servers
- Tablet servers handle read and write requests to
its table - Splits tablets that have grown too large
92Implementation (cont.)
- Client data doesnt move through master server.
Clients communicate directly with tablet servers
for reads and writes. - Most clients never communicate with the master
server, leaving it lightly loaded in practice.
93Tablets
- Large tables broken into tablets at row
boundaries - Tablet holds contiguous range of rows
- Clients can often choose row keys to achieve
locality - Aim for 100MB to 200MB of data per tablet
- Serving machine responsible for 100 tablets
- Fast recovery
- 100 machines each pick up 1 tablet for failed
machine - Fine-grained load balancing
- Migrate tablets away from overloaded machine
- Master makes load-balancing decisions
94Tablet Location
- Since tablets move around from server to server,
given a row, how do clients find the right
machine? - Need to find tablet whose row range covers the
target row
95Tablet Assignment
- Each tablet is assigned to one tablet server at a
time. - Master server keeps track of the set of live
tablet servers and current assignments of tablets
to servers. Also keeps track of unassigned
tablets. - When a tablet is unassigned, master assigns the
tablet to an tablet server with sufficient room.
96API
- Metadata operations
- Create/delete tables, column families, change
metadata - Writes (atomic)
- Set() write cells in a row
- DeleteCells() delete cells in a row
- DeleteRow() delete all cells in a row
- Reads
- Scanner read arbitrary cells in a bigtable
- Each row read is atomic
- Can restrict returned rows to a particular range
- Can ask for just data from 1 row, all rows, etc.
- Can ask for all columns, just certain column
families, or specific columns
97Refinements Compression
- Many opportunities for compression
- Similar values in the same row/column at
different timestamps - Similar values in different columns
- Similar values across adjacent rows
- Two-pass custom compressions scheme
- First pass compress long common strings across a
large window - Second pass look for repetitions in small window
- Speed emphasized, but good space reduction
(10-to-1)
98Refinements Bloom Filters
- Read operation has to read from disk when desired
SSTable isnt in memory - Reduce number of accesses by specifying a Bloom
filter. - Allows us ask if an SSTable might contain data
for a specified row/column pair. - Small amount of memory for Bloom filters
drastically reduces the number of disk seeks for
read operations - Use implies that most lookups for non-existent
rows or columns do not need to touch disk
99Refinements Bloom Filters
- Read operation has to read from disk when desired
SSTable isnt in memory - Reduce number of accesses by specifying a Bloom
filter. - Allows us ask if an SSTable might contain data
for a specified row/column pair. - Small amount of memory for Bloom filters
drastically reduces the number of disk seeks for
read operations - Use implies that most lookups for non-existent
rows or columns do not need to touch disk
100????
- ?????
- Google ?????GFS,Bigtable ?Mapreduce
- Yahoo??????Hadoop
- ????????
100
101Yahoo! Cloud computing
102Search Results of the Future
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
103Whats in the Horizontal Cloud?
Simple Web Service APIs
Horizontal Cloud Services
Edge Content Services e.g., YCS, YCPI
Provisioning Virtualization e.g., EC2
Batch Storage Processing e.g., Hadoop Pig
Operational Storage e.g., S3, MObStor, Sherpa
Other Services Messaging, Workflow, virtual
DBs Webserving
ID Account Management
Shared Infrastructure
Metering, Billing, Accounting
Monitoring QoS
Common Approaches to QA, Production
Engineering, Performance Engineering, Datacenter
Management, and Optimization
104Yahoo! Cloud Stack
EDGE
Horizontal Cloud Services
YCS
YCPI
Brooklyn
WEB
Horizontal Cloud Services
VM/OS
yApache
PHP
App Engine
APP
Provisioning (Self-serve)
Monitoring/Metering/Security
Horizontal Cloud Services
VM/OS
Serving Grid
Data Highway
STORAGE
Horizontal Cloud Services
Sherpa
MOBStor
BATCH
Horizontal Cloud Services
Hadoop
105Web Data Management
- CRUD
- Point lookups and short scans
- Index organized table and random I/Os
- per latency
- Scan oriented workloads
- Focus on sequential disk I/O
- per cpu cycle
Structured record storage (PNUTS/Sherpa)
Large data analysis (Hadoop)
- Object retrieval and streaming
- Scalable file storage
- per GB
Blob storage (SAN/NAS)
106The World Has Changed
- Web serving applications need
- Scalability!
- Preferably elastic
- Flexible schemas
- Geographic distribution
- High availability
- Reliable storage
- Web serving applications can do without
- Complicated queries
- Strong transactions
107PNUTS / SHERPA To Help You Scale Your Mountains
of Data
108Yahoo! Serving Storage Problem
- Small records 100KB or less
- Structured records lots of fields, evolving
- Extreme data scale - Tens of TB
- Extreme request scale - Tens of thousands of
requests/sec - Low latency globally - 20 datacenters worldwide
- High Availability - outages cost millions
- Variable usage patterns - as applications and
users change
108
109The PNUTS/Sherpa Solution
- The next generation global-scale record store
- Record-orientation Routing, data storage
optimized for low-latency record access - Scale out Add machines to scale throughput
(while keeping latency low) - Asynchrony Pub-sub replication to far-flung
datacenters to mask propagation delay - Consistency model Reduce complexity of
asynchrony for the application programmer - Cloud deployment model Hosted, managed service
to reduce app time-to-market and enable on demand
scale and elasticity
109
110What is PNUTS/Sherpa?
CREATE TABLE Parts ( ID VARCHAR, StockNumber
INT, Status VARCHAR )
Structured, flexible schema
Geographic replication
Parallel database
Hosted, managed infrastructure
110
111What Will It Become?
Indexes and views
CREATE TABLE Parts ( ID VARCHAR, StockNumber
INT, Status VARCHAR )
Geographic replication
Parallel database
Structured, flexible schema
Hosted, managed infrastructure
112What Will It Become?
Indexes and views
113Design Goals
- Scalability
- Thousands of machines
- Easy to add capacity
- Restrict query language to avoid costly queries
- Geographic replication
- Asynchronous replication around the globe
- Low-latency local access
- High availability and fault tolerance
- Automatically recover from failures
- Serve reads and writes despite failures
- Consistency
- Per-record guarantees
- Timeline model
- Option to relax if needed
- Multiple access paths
- Hash table, ordered table
- Primary, secondary access
- Hosted service
- Applications plug and play
- Share operational cost
113
114Technology Elements
Applications
Tabular API
PNUTS API
- PNUTS
- Query planning and execution
- Index maintenance
- Distributed infrastructure for tabular data
- Data partitioning
- Update consistency
- Replication
YCA Authorization
- Tribble
- Pub/sub messaging
- Zookeeper
- Consistency service
114
115Data Manipulation
- Per-record operations
- Get
- Set
- Delete
- Multi-record operations
- Multiget
- Scan
- Getrange
115
116TabletsHash Table
Name
Description
Price
0x0000
Grape
12
Grapes are good to eat
Limes are green
9
Lime
1
Apple
Apple is wisdom
900
Strawberry
Strawberry shortcake
0x2AF3
2
Orange
Arrgh! Dont get scurvy!
3
Avocado
But at what price?
Lemon
How much did you pay for this lemon?
1
14
Is this a vegetable?
Tomato
0x911F
2
The perfect fruit
Banana
8
Kiwi
New Zealand
0xFFFF
116
117TabletsOrdered Table
Name
Description
Price
A
1
Apple
Apple is wisdom
3
Avocado
But at what price?
2
Banana
The perfect fruit
12
Grape
Grapes are good to eat
H
Kiwi
8
New Zealand
Lemon
How much did you pay for this lemon?
1
Limes are green
Lime
9
2
Orange
Arrgh! Dont get scurvy!
Q
900
Strawberry
Strawberry shortcake
Is this a vegetable?
14
Tomato
Z
117
118Flexible Schema
Posted date Listing id Item Price
6/1/07 424252 Couch 570
6/1/07 763245 Bike 86
6/3/07 211242 Car 1123
6/5/07 421133 Lamp 15
Condition
Good
Fair
Color
Red
119Detailed Architecture
Local region
Remote regions
Clients
REST API
Routers
Tribble
Tablet Controller
Storage units
119
120Tablet Splitting and Balancing
Each storage unit has many tablets (horizontal
partitions of the table)
Storage unit may become a hotspot
Tablets may grow over time
Overfull tablets split
Shed load by moving tablets to other servers
120
121QUERY PROCESSING
121
122Accessing Data
Get key k
SU
SU
SU
122
123Bulk Read
SU
SU
SU
123
124Range Queries in YDOT
- Clustered, ordered retrieval of records
Apple Avocado Banana Blueberry
Canteloupe Grape Kiwi Lemon
Lime Mango Orange
Strawberry Tomato Watermelon
Apple Avocado Banana Blueberry
Canteloupe Grape Kiwi Lemon
Lime Mango Orange
Strawberry Tomato Watermelon
125Updates
Write key k
Sequence for key k
Routers
Message brokers
Write key k
Sequence for key k
SUCCESS
Write key k
125
126ASYNCHRONOUS REPLICATION AND CONSISTENCY
126
127Asynchronous Replication
127
128Consistency Model
- Goal Make it easier for applications to reason
about updates and cope with asynchrony - What happens to a record with primary key
Alice?
Record inserted
Delete
Update
Update
Update
Update
Update
Update
Update
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Time
Generation 1
As the record is updated, copies may get out of
sync.
128
129Example Social Alice
East
Record Timeline
West
User Status
Alice ___
___
User Status
Alice Busy
Busy
User Status
Alice Busy
User Status
Alice Free
Free
User Status
Alice ???
User Status
Alice ???
Free
130Consistency Model
Read
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
In general, reads are served using a local copy
130
131Consistency Model
Read up-to-date
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
But application can request and get current
version
131
132Consistency Model
Read v.6
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
Or variations such as read forwardwhile copies
may lag the master record, every copy goes
through the same sequence of changes
132
133Consistency Model
Write
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
Achieved via per-record primary copy
protocol (To maximize availability, record
masterships automaticlly transferred if site
fails) Can be selectively weakened to eventual
consistency (local writes that are reconciled
using version vectors)
133
134Consistency Model
Write if v.7
ERROR
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
Test-and-set writes facilitate per-record
transactions
134
135Consistency Techniques
- Per-record mastering
- Each record is assigned a master region
- May differ between records
- Updates to the record forwarded to the master
region - Ensures consistent ordering of updates
- Tablet-level mastering
- Each tablet is assigned a master region
- Inserts and deletes of records forwarded to the
master region - Master region decides tablet splits
- These details are hidden from the application
- Except for the latency impact!
136Mastering
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
Tablet master
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
136
137Bulk Insert/Update/Replace
- Client feeds records to bulk manager
- Bulk loader transfers records to SUs in batches
- Bypass routers and message brokers
- Efficient import into storage unit
Client
Bulk manager
Source Data
138Bulk Load in YDOT
- YDOT bulk inserts can cause performance hotspots
- Solution preallocate tablets
139Index Maintenance
- How to have lots of interesting indexes and
views, without killing performance? - Solution Asynchrony!
- Indexes/views updated asynchronously when base
table updated
140SHERPAIN CONTEXT
140
141Types of Record Stores
S3
PNUTS
Oracle
Simple
Feature rich
Object retrieval
Retrieval from single table of objects/records
SQL
142Types of Record Stores
S3
PNUTS
Oracle
Best effort
Strong guarantees
Eventual consistency
Timeline consistency
ACID
Program centric consistency
Object-centric consistency
143Types of Record Stores
PNUTS
CouchDB
Oracle
Flexibility, Schema evolution
Optimized for Fixed schemas
Object-centric consistency
Consistency spans objects
144Types of Record Stores
- Elasticity (ability to add resources on demand)
PNUTS S3
Oracle
Inelastic
Elastic
Limited (via data distribution)
VLSD (Very Large Scale Distribution /Replication)
145Data Stores Comparison
- Versus PNUTS
- More expressive queries
- Users must control partitioning
- Limited elasticity
- Highly optimized for complex workloads
- Limited flexibility to evolving applications
- Inherit limitations of underlying data management
system - Object storage versus record management
- User-partitioned SQL stores
- Microsoft Azure SDS
- Amazon SimpleDB
- Multi-tenant application databases
- Salesforce.com
- Oracle on Demand
- Mutable object stores
- Amazon S3
146Application Design Space
Get a few things
Sherpa
MObStor
YMDB
MySQL
Oracle
Filer
BigTable
Scan everything
Hadoop
Everest
Files
Records
146
147Alternatives Matrix
Consistency model
Structured access
Global low latency
SQL/ACID
Availability
Operability
Updates
Elastic
Sherpa
Y! UDB
MySQL
Oracle
HDFS
BigTable
Dynamo
Cassandra
147
148QUESTIONS?
148
149Hadoop
150Problem
- How do you scale up applications?
- Run jobs processing 100s of terabytes of data
- Takes 11 days to read on 1 computer
- Need lots of cheap computers
- Fixes speed problem (15 minutes on 1000
computers), but - Reliability problems
- In large clusters, computers fail every day
- Cluster size is not fixed
- Need common infrastructure
- Must be efficient and reliable
151Solution
- Open Source Apache Project
- Hadoop Core includes
- Distributed File System - distributes data
- Map/Reduce - distributes application
- Written in Java
- Runs on
- Linux, Mac OS/X, Windows, and Solaris
- Commodity hardware
152 Hardware Cluster of Hadoop
- Typically in 2 level architecture
- Nodes are commodity PCs
- 40 nodes/rack
- Uplink from rack is 8 gigabit
- Rack-internal is 1 gigabit
153Distributed File System
- Single namespace for entire cluster
- Managed by a single namenode.
- Files are single-writer and append-only.
- Optimized for streaming reads of large files.
- Files are broken in to large blocks.
- Typically 128 MB
- Replicated to several datanodes, for reliability
- Access from Java, C, or command line.
154Block Placement
- Default is 3 replicas, but settable
- Blocks are placed (writes are pipelined)
- On same node
- On different rack
- On the other rack
- Clients read from closest replica
- If the replication for a block drops below
target, it is automatically re-replicated.
155How is Yahoo using Hadoop?
- Started with building better applications
- Scale up web scale batch applications (search,
ads, ) - Factor out common code from existing systems, so
new applications will be easier to write - Manage the many clusters
156Running Production WebMap
- Search needs a graph of the known web
- Invert edges, compute link text, whole graph
heuristics - Periodic batch job using Map/Reduce
- Uses a chain of 100 map/reduce jobs
- Scale
- 1 trillion edges in graph
- Largest shuffle is 450 TB
- Final output is 300 TB compressed
- Runs on 10,000 cores
- Raw disk used 5 PB
157Terabyte Sort Benchmark
- Started by Jim Gray at Microsoft in 1998
- Sorting 10 billion 100 byte records
- Hadoop won the general category in 209 seconds
- 910 nodes
- 2 quad-core Xeons _at_ 2.0Ghz / node
- 4 SATA disks / node
- 8 GB ram / node
- 1 gb ethernet / node
- 40 nodes / rack
- 8 gb ethernet uplink / rack
- Previous records was 297 seconds
158Hadoop clusters
- We have 20,000 machines running Hadoop
- Our largest clusters are currently 2000 nodes
- Several petabytes of user data (compressed,
unreplicated) - We run hundreds of thousands of jobs every month
159Research Cluster Usage
160Who Uses Hadoop?
- Amazon/A9
- AOL
- Facebook
- Fox interactive media
- Google / IBM
- New York Times
- PowerSet (now Microsoft)
- Quantcast
- Rackspace/Mailtrust
- Veoh
- Yahoo!
- More at http//wiki.apache.org/hadoop/PoweredBy
161QA
- For more information
- Website http//hadoop.apache.org/core
- Mailing lists
- core-dev_at_hadoop.apache
- core-user_at_hadoop.apache
162????
- ?????
- Google ?????GFS,Bigtable ?Mapreduce
- Yahoo??????Hadoop
- ????????
162
163Summary of Applications
BigTable HBase HyperTable Hive HadoopDB
- Data Analysis
- Internet Service
- Private Cloud
- Web Applications
- Some operations that can tolerate relaxed
consistency
PNUTS
164Architecture
MapReduce-based
DBMS-based
Hybrid of MapReduce and DBMS
BigTable HBase Hypertable Hive
SQL Azure PNUTS Voldemort
HadoopDB
scalability
sounds good
easy to support SQL
fault tolerance
Performance?
ability to run in a heterogeneous environment
easy to utilize index, optimization method
bottleneck of data storage
data replication in file system
data replication upon DBMS
a lot of work to do to support SQL
165Consistency
A
BigTable,HBase, Hive,Hypertable,HadoopDB
C
- Two kinds of consistency
- strong consistency ACID(Atomicity Consistency
Isolation Durability) - weak consistency BASE(Basically Available
Soft-state Eventual consistency )
P
A
C
PNUTS
P
SQL Azure ?
166A tailor
RDBMS
LOCK
ACID
SAFETY
TRANSACTION
3NF
167Further Reading
Efficient Bulk Insertion into a Distributed
Ordered Table (SIGMOD 2008) Adam Silberstein,
Brian Cooper, Utkarsh Srivastava, Erik Vee,
Ramana Yerneni, Raghu Ramakrishnan PNUTS
Yahoo!'s Hosted Data Serving Platform (VLDB
2008) Brian Cooper, Raghu Ramakrishnan, Utkarsh
Srivastava, Adam Silberstein, Phil Bohannon,
Hans-Arno Jacobsen, Nick Puz, Daniel Weaver,
Ramana Yerneni Asynchronous View Maintenance for
VLSD Databases, Parag Agrawal, Adam Silberstein,
Brian F. Cooper, Utkarsh Srivastava and Raghu
Ramakrishnan SIGMOD 2009 Cloud Storage Design
in a PNUTShell Brian F. Cooper, Raghu
Ramakrishnan, and Utkarsh Srivastava Beautiful
Data, OReilly Media, 2009
168Further Reading
F. Chang et al. Bigtable A distributed storage
system for structured data. In OSDI, 2006. J.
Dean and S. Ghemawat. MapReduce Simplified data
processing on large clusters. In OSDI, 2004.
G. DeCandia et al. Dynamo Amazons highly
available key-value store. In SOSP, 2007. S.
Ghemawat, H. Gobioff, and S.-T. Leung. The
Google File System. In Proc. SOSP, 2003. D.
Kossmann. The state of the art in distributed
query processing. ACM Computing Surveys,
32(4)422469, 2000.
169?????????
- ??????? 2010?6?
- ?????????
- ????
- ?????
- ???????
- ??????????
170???????????????
- ?1? ??
- 1.1 ???????
- 1.2 ?????????
- 1.3 ??????????????
- 1.4 ??
- ???????
- ?2? ???????
- 2.1 ????????
- 2.2 ??????????
- 2.3??????????(?????,???,?????)
- 2.4??
- ?3? ??-??????
- 3.1 ??-???????????
- 3.2 ??-????????
- 3.3 ??-?????????
- 3.4 ??
171???????????????
- ?4? ?????
- 4.1 ??????????
- 4.2 ??????
- 4.3 ??????
- 4.3 ??
- ?5? ?????????? (CORBA)
- 5.1 CORBA????
- 5.2 CORBA ?????
- 5.3 ???????
- 5.4 Java IDL??
- 5.5 ??
- ????????
- ?6? ????????
- 6.1 ?????
- 6.2 ???
- 6.3 ??????????
- 6.4 ??
- ?7? Google????????
- 7.1 Google ????
172???????????????
- ?8? Yahoo??????
- 8.1 PNUTS ??????????
- 8.2 Pig ??????????
- 8.3 ZooKeeper ??????????????
- 8.4 ??
- ?9? Aneka ??????
- 9.1 Aneka ???
- 9.2 ????????
- 9.3 Aneka??????????????
- 9.4 ??
- ?10? Greenplum??????
- 10.1 GreenPlum????
- 10.2 GreenPlum?????
- 10.3 GreenPlum???????????
- 10.4 GreenPlum????????
- 10.5 ??
- ?11? Amazon dynamo??????
- 11.1 Amazon dynamo??
- 11.2 Amazon dynamo?????
173???????????????
- ???????????
- ?13? ??Hadoop????
- 13.1 Hadoop????
- 13.2 Map/Reduce????
- 13.3 ?????????
- 13.4 ??????
- 13.5 ??
- ?14? ??HBase????
- 14.1 ???HBase??
- 14.2 HBase?????
- 14.3 HBase??????
- 14.4 ????HBase
- 14.5 ??
174???????????????
- ?15? ??Google Apps????
- 15.1 Google App Engine ??
- 15.2 ????Google App Engine
- 15.3 ??Google Apps?????????
- 15.4 ??
- ?16? ??MS Azure????
- 16.1 MS Azure????
- 16.2 WINDOWS AZURE????
- 16.3 ??
- ?17? ??Amazon EC2??????
- 17.1 Amazon Elastic Compute Cloud ??
- 17.2 ????AmazonEC2
- 17.3 ??
175QA Thanks