CS 525 Advanced Distributed Systems Spring 09

About This Presentation

Title:

CS 525 Advanced Distributed Systems Spring 09

Description:

Example: Rapid Atmospheric Modeling System, ColoState U ... (map square (1 2 3 4)) Output: (1 4 9 16) [processes each record sequentially and independently] ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 71

Provided by: csU70

Category:

more less

Transcript and Presenter's Notes

Title: CS 525 Advanced Distributed Systems Spring 09

1
CS 525 Advanced Distributed SystemsSpring 09
Indranil Gupta (Indy) Lecture 4 The Grid.
Clouds. January 29, 2009
2
Two Questions Well Try to Answer

What is the Grid? Basics, no hype.
What is its relation to p2p?

3
Example Rapid Atmospheric Modeling System,
ColoState U

Hurricane Georges, 17 days in Sept 1998
RAMS modeled the mesoscale convective complex
that dropped so much rain, in good agreement with
recorded data
Used 5 km spacing instead of the usual 10 km
Ran on 256 processors
Can one run such a program without access to a
supercomputer?

4
Distributed ComputingResources
Wisconsin
NCSA
MIT
5
An Application Coded by a Physicist
Output files of Job 0 Input to Job 2
Job 0
Job 1
Job 2
Jobs 1 and 2 can be concurrent
Output files of Job 2 Input to Job 3
Job 3
6
An Application Coded by a Physicist
Output files of Job 0 Input to Job 2
Several GBs

May take several hours/days
4 stages of a job
Init
Stage in
Execute
Stage out
Publish
Computation Intensive,
so Massively Parallel

Job 2
Output files of Job 2 Input to Job 3
7
Wisconsin
Job 0
Job 2
Job 1
Job 3
Allocation? Scheduling?
NCSA
MIT
8
Job 0
Wisconsin
Condor Protocol
Job 2
Job 1
Job 3
Globus Protocol
NCSA
MIT
9
Wisconsin
Job 3
Job 0
Internal structure of different sites invisible
to Globus
Globus Protocol
Job 1
NCSA
MIT
Job 2
External Allocation Scheduling Stage in Stage
out of Files
10
Wisconsin
Condor Protocol
Job 3
Job 0
Internal Allocation Scheduling Monitoring Distri
bution and Publishing of Files
11
Tiered Architecture (OSI 7 layer-like)
High energy Physics apps
Resource discovery, replication, brokering
Globus, Condor
Workstations, LANs
Opportunity for Crossover ideas from p2p systems
12
The Grid Today
Some are 40Gbps links! (The TeraGrid links)
A parallel Internet
13
Globus Alliance

Alliance involves U. Illinois Chicago, Argonne
National Laboratory, USC-ISI, U. Edinburgh,
Swedish Center for Parallel Computers
Activities research, testbeds, software tools,
applications
Globus Toolkit (latest ver - GT3)
The Globus Toolkit includes software services
and libraries for resource monitoring, discovery,
and management, plus security and file
management. Its latest version, GT3, is the
first full-scale implementation of new Open Grid
Services Architecture (OGSA).

14
More

Entire community, with multiple conferences,
get-togethers (GGF), and projects
Grid Projects
http//www-fp.mcs.anl.gov/foster/grid-projects/
Grid Users
Today Core is the physics community (since the
Grid originates from the GriPhyN project)
Tomorrow biologists, large-scale computations
(nug30 already)?

15
Some Things Grid Researchers Consider Important

Single sign-on collective job set should require
once-only user authentication
Mapping to local security mechanisms some sites
use Kerberos, others using Unix
Delegation credentials to access resources
inherited by subcomputations, e.g., job 0 to job
1
Community authorization e.g., third-party
authentication

16
Grid History 1990s

CASA network linked 4 labs in California and New
Mexico
Paul Messina Massively parallel and vector
supercomputers for computational chemistry,
climate modeling, etc.
Blanca linked sites in the Midwest
Charlie Catlett, NCSA multimedia digital
libraries and remote visualization
More testbeds in Germany Europe than in the US
I-way experiment linked 11 experimental networks
Tom DeFanti, U. Illinois at Chicago and Rick
Stevens, ANL, for a week in Nov 1995, a national
high-speed network infrastructure. 60 application
demonstrations, from distributed computing to
virtual reality collaboration.
I-Soft secure sign-on, etc.

17
Trends Technology

Doubling Periods storage 12 mos, bandwidth 9
mos, and (what law is this?) cpu speed 18 mos
Then and Now
Bandwidth
1985 mostly 56Kbps links nationwide
2004 155 Mbps links widespread
Disk capacity
Todays PCs have 100GBs, same as a 1990
supercomputer

18
Trends Users

Then and Now
Biologists
1990 were running small single-molecule
simulations
2004 want to calculate structures of complex
macromolecules, want to screen thousands of drug
candidates
Physicists
2006 CERNs Large Hadron Collider produced 1015
B/year
Trends in Technology and User Requirements
Independent or Symbiotic?

19
Prophecies

In 1965, MIT's Fernando Corbató and the other
designers of the Multics operating system
envisioned a computer facility operating like a
power company or water company.
Plug your thin client into the computing Utiling
and Play your favorite Intensive Compute
Communicate Application
Will this be a reality with the Grid?

20
P2P
Grid
21
Definitions

Grid
P2P

Infrastructure that provides dependable,
consistent, pervasive, and inexpensive access to
high-end computational capabilities (1998)
A system that coordinates resources not subject
to centralized control, using open,
general-purpose protocols to deliver nontrivial
QoS (2002)
Applications that takes advantage of resources
at the edges of the Internet (2000)
Decentralized, self-organizing distributed
systems, in which all or most communication is
symmetric (2002)

22
Definitions

Grid
P2P

Infrastructure that provides dependable,
consistent, pervasive, and inexpensive access to
high-end computational capabilities (1998)
A system that coordinates resources not subject
to centralized control, using open,
general-purpose protocols to deliver nontrivial
QoS (2002)
Applications that takes advantage of resources
at the edges of the Internet (2000)
Decentralized, self-organizing distributed
systems, in which all or most communication is
symmetric (2002)

525 (good legal applications without
intellectual fodder)
525 (clever designs without good, legal
applications)
23
Grid versus P2P - Pick your favorite
24
Applications

P2P
Some
File sharing
Number crunching
Content distribution
Measurements
Legal Applications?
Consequence
Low Complexity

Grid
Often complex involving various combinations of
Data manipulation
Computation
Tele-instrumentation
Wide range of computational models, e.g.
Embarrassingly
Tightly coupled
Workflow
Consequence
Complexity often inherent in the application
itself

25
Applications

P2P
Some
File sharing
Number crunching
Content distribution
Measurements
Legal Applications?
Consequence
Low Complexity

Grid
Often complex involving various combinations of
Data manipulation
Computation
Tele-instrumentation
Wide range of computational models, e.g.
Embarrassingly
Tightly coupled
Workflow
Consequence
Complexity often inherent in the application
itself

26
Scale and Failure

P2P
V. large numbers of entities
Moderate activity
E.g., 1-2 TB in Gnutella (01)
Diverse approaches to failure
Centralized (SETI)
Decentralized and Self-Stabilizing

Grid
Moderate number of entities
10s institutions, 1000s users
Large amounts of activity
4.5 TB/day (D0 experiment)
Approaches to failure reflect assumptions
E.g., centralized components

FastTrackC 4,277,745
iMesh 1,398,532
eDonkey 500,289
DirectConnect 111,454
Blubster 100,266
FileNavigator 14,400
Ares 7,731
(www.slyck.com, 2/19/03)
27
Scale and Failure

P2P
V. large numbers of entities
Moderate activity
E.g., 1-2 TB in Gnutella (01)
Diverse approaches to failure
Centralized (SETI)
Decentralized and Self-Stabilizing

Grid
Moderate number of entities
10s institutions, 1000s users
Large amounts of activity
4.5 TB/day (D0 experiment)
Approaches to failure reflect assumptions
E.g., centralized components

FastTrackC 4,277,745
iMesh 1,398,532
eDonkey 500,289
DirectConnect 111,454
Blubster 100,266
FileNavigator 14,400
Ares 7,731
(www.slyck.com, 2/19/03)
28
Services and Infrastructure

Grid
Standard protocols (Global Grid Forum, etc.)
De facto standard software (open source Globus
Toolkit)
Shared infrastructure (authentication, discovery,
resource access, etc.)
Consequences
Reusable services
Large developer user communities
Interoperability code reuse

P2P
Each application defines deploys completely
independent infrastructure
JXTA, BOINC, XtremWeb?
Efforts started to define common APIs, albeit
with limited scope to date
Consequences
New (albeit simple) install per application
Interoperability code reuse not achieved

29
Services and Infrastructure

Grid
Standard protocols (Global Grid Forum, etc.)
De facto standard software (open source Globus
Toolkit)
Shared infrastructure (authentication, discovery,
resource access, etc.)
Consequences
Reusable services
Large developer user communities
Interoperability code reuse

P2P
Each application defines deploys completely
independent infrastructure
JXTA, BOINC, XtremWeb?
Efforts started to define common APIs, albeit
with limited scope to date
Consequences
New (albeit simple) install per application
Interoperability code reuse not achieved

30
Coolness Factor

Grid

31
Coolness Factor

Grid

32
Summary Grid and P2P

1) Both are concerned with the same general
problem
Resource sharing within virtual communities
2) Both take the same general approach
Creation of overlays that need not correspond in
structure to underlying organizational structures
3) Each has made genuine technical advances, but
in complementary directions
Grid addresses infrastructure but not yet scale
and failure
P2P addresses scale and failure but not yet
infrastructure
4) Complementary strengths and weaknesses gt room
for collaboration (Ian Foster at UChicago)

33
Crossover Ideas

Some P2P ideas useful in the Grid
Resource discovery (DHTs), e.g., how do you make
filenames more expressive, i.e., a computer
cluster resource?
Replication models, for fault-tolerance,
security, reliability
Membership, i.e., which workstations are
currently available?
Churn-Resistance, i.e., users log in and out
problem difficult since free host gets a entire
computations, not just small files
All above are open research directions, waiting
to be explored!

34
Cloud Computing

Whats it all about?
A First Step

35
Life of Ra (a Research Area)
Where is Grid? Where is cloud computing?
First peak end of hype (This is a hot area!)
Hype - Wow!
First trough I told you so!
POPULARITY OF AREA
TIME
Young Adolescent Middle Age
Old Age
(solid base, hybrid algorithms)
(incremental Solutions)
(low-hanging fruits)
(interesting Problems)
36
How do I identify what stage a research area is
in?

If there have been no publications in research
area more than 1-2 years old, it is in the Young
Phase
Pick a paper in the last 1 year published in the
research area. Read it. If you think that you
could have come up with the core idea in that
paper (given all the background etc.), then the
research area is in its Young phase.
Find the latest published paper that you think
you could have come up with the idea for. If this
paper has been cited by one round of papers (but
these citing papers themselves have not been
cited), then the research area is in the
Adolescent phase.
Do Step 3 above, and if you find that the citing
papers themselves have been cited, and so on,
then the research area is at least in the Middle
Age phase.
Pick a paper in the last 1-2 years. If you find
that there are only incremental developments in
these latest published papers, and the ideas may
be innovative but are not yielding large enough
performance benefits, then the area is mature.
If no one works in the research area, or everyone
you talk to thinks negatively about the area
(except perhaps the inventors of the area), then
the area is dead.

37
What is a cloud?

Its a cluster! Its a supercomputer! Its a
datastore!
Its superman!
None of the above
Cloud Lots of storage compute cycles nearby

38
Data-intensive Computing

Computation-Intensive Computing
Example areas MPI-based, High-performance
computing, Grids
Typically run on supercomputers (e.g., NCSA Blue
Waters)
Data-Intensive
Typically store data at datacenters
Use compute nodes nearby
Compute nodes run computation services
In data-intensive computing, the focus shifts
from computation to the data problem areas
include
Storage
Communication bottleneck
Moving tasks to data (rather than vice-versa)
Security
Availability of Data
Scalability

39
Distributed Clouds

A single-site cloud consists of
Compute nodes (split into racks)
Switches, connecting the racks
Storage (backend) nodes connected to the network
Front-end for submitting jobs
Services physical resource set, software
services
A geographically distributed cloud consists of
Multiple such sites
Each site perhaps with a different structure and
services

40
Cirrus Cloud at University of Illinois
Only show internal switches used for data
transfers, 1GbE with 48 ports
Storage Node
Storage Node
Storage Node
8 ports
2 ports
Storage Node
8 ports
Procurve Switch
Procurve Switch
2 ports
Head Node
Note System management, monitoring, and operator
console will use a different set of switches not
pictured here.
41
Example Cirrus Cloud at U. Illinois

128 servers. Each has
8 cores (total 1024 cores)
16 GB RAM
2 TB disk
Backing store of about 250 TB
Total storage 0.5 PB
Gigabit Networking

42
6 Diverse Sites within Cirrus

UIUC Systems Research for Cloud Computing
Cloud Computing Applications
Karlsruhe Institute of Tech (KIT, Germany)
Grid-style jobs
IDA, Singapore
Intel
HP
Yahoo! CMUs M45 cluster
All will be networked together see
http//www.cloudtestbed.org

43
What Services?

Different Clouds Export different services
Industrial Clouds
Amazon S3 (Simple Storage Service) store
arbitrary datasets
Amazon EC2 (Elastic Compute Cloud) upload and
run arbitrary images
Google AppEngine develop applications within
their appengine framework, upload data that will
be imported into their format, and run
Academic Clouds
Google-IBM Cloud (U. Washington) run apps
programmed atop Hadoop
Cirrus cloud run (i) apps programmed atop Hadoop
and Pig, and (ii) systems-level research on this
first generation of cloud computing models

44
Software Services

Computational
MapReduce (Hadoop)
Pig Latin
Naming and Management
Zookeeper
Tivoli, OpenView
Storage
HDFS
PNUTS

45
Sample Service MapReduce

Google uses MapReduce to run 100K jobs per day,
processing up to 20 PB of data
Yahoo! has released open-source software Hadoop
that implements MapReduce
Other companies that have used MapReduce to
process their data A9.com, AOL, Facebook, The
New York Times
Highly-Parallel Data-Processing

46
What is MapReduce?

Terms are borrowed from Functional Language
(e.g., Lisp)
Sum of squares
(map square (1 2 3 4))
Output (1 4 9 16)
processes each record sequentially and
independently
(reduce (1 4 9 16))
( 16 ( 9 ( 4 1) ) )
Output 30
processes set of all records in a batch

47
Map

Process individual key/value pair to generate
intermediate key/value pairs.

Welcome Everyone Hello Everyone
Welcome 1 Everyone 1 Hello 1 Everyone 1
Input ltfilename, file textgt
48
Reduce

Processes and merges all intermediate values
associated with each given key assigned to it

Welcome 1 Everyone 1 Hello 1 Everyone 1
Everyone 2 Hello 1 Welcome 1
49
Some Applications

Distributed Grep
Map - Emits a line if it matches the supplied
pattern
Reduce - Copies the the intermediate data to
output
Count of URL access frequency
Map Process web log and outputs ltURL, 1gt
Reduce - Emits ltURL, total countgt
Reverse Web-Link Graph
Map process web log and outputs lttarget,
sourcegt
Reduce - emits lttarget, list(source)gt

50
Programming MapReduce

Externally For user
Write a Map program (short), write a Reduce
program (short)
Submit job wait for result
Need to know nothing about parallel/distributed
programming!
Internally For the cloud (and for us distributed
systems researchers)
Parallelize Map
Transfer data from Map to Reduce
Parallelize Reduce
Implement Storage for Map input, Map output,
Reduce input, and Reduce output

51
Inside MapReduce

For the cloud (and for us distributed systems
researchers)
Parallelize Map easy! each map job is
independent of the other!
Transfer data from Map to Reduce
All Map output records with same key assigned to
same Reduce task
use partitioning function (more soon)
Parallelize Reduce easy! each map job is
independent of the other!
Implement Storage for Map input, Map output,
Reduce input, and Reduce output
Map input from distributed file system
Map output to local disk (at Map node) uses
local file system
Reduce input from (multiple) remote disks uses
local file systems
Reduce output to distributed file system
local file system Linux FS, etc.
distributed file system GFS (Google File
System), HDFS (Hadoop Distributed File System)

52
Internal Workings of MapReduce
53
Flow of Data

Input slices are typically 16MB to 64MB.
Map workers use a partitioning function to store
intermediate key/value pair to the local disk.
e.g., Hash (key) mod R

Output files
Map workers
Reduce workers
partitioning
54
Fault Tolerance

Worker Failure
Master keeps 3 states for each worker task
(idle, in-progress, completed)
Master sends periodic pings to each worker to
keep track of it (central failure detector)
If fail while in-progress, mark the task as idle
If map workers fail after completed, mark as idle
Notify the reduce task about the map worker
failure
Master Failure
Checkpoint

55
Locality and Backup tasks

Locality
Since cloud has hierarchical topology
GFS stores 3 replicas of each of 64MB chunks
Maybe on different racks
Attempt to schedule a map task on a machine that
contains a replica of corresponding input data
why?
Stragglers (slow nodes)
Due to Bad Disk, Network Bandwidth, CPU, or
Memory.
Perform backup (replicated) execution of
straggler task task done when first replica
complete

56
Grep
Testbed 1800 servers each with 4GB RAM, dual
2GHz Xeon, dual 169 GB IDE disk, 100 Gbps,
Gigabit ethernet per machine

Locality optimization helps
1800 machines read 1 TB at peak 31 GB/s
W/out this, rack switches would limit to 10 GB/s
Startup overhead is significant for short jobs

Workload 1010 100-byte records to extract
records matching a rare pattern (92K matching
records)
57
Sort
M 15000 R 4000

Normal No backup tasks
200 processes killed

Backup tasks reduce job completion time a lot!
System deals well with failures

Workload 1010 100-byte records (modeled after
TeraSort benchmark)
58
Discussion Points

Storage Is the local write-remote read model
good for Map output/Reduce input?
What happens on node failure?
Entire Reduce phase needs to wait for all Map
tasks to finish
Why? What is the disadvantage?
What are the other issues related to our
challenges
Storage
Communication bottleneck
Moving tasks to data (rather than vice-versa)
Security
Availability of Data
Scalability
Locality within clouds, or across them
Inter-cloud/multi-cloud computations
Other Programming Models?
Based on MapReduce
Beyond MapReduce-based ones
Concern Do clouds run the risk of going the
Grid way?

59
P2P and Clouds/Grid

Opportunity to use p2p design techniques,
principles, and algorithms in cloud computing
Cloud computing vs. Grid computing what are the
differences?

60
Prophecies
Are we there yet?
?
?
?

In 1965, MIT's Fernando Corbató and the other
designers of the Multics operating system
envisioned a computer facility operating like a
power company or water company.
Plug your thin client into the computing Utiling
and Play your favorite Intensive Compute
Storage Communicate Application
Will this be a reality with the Grid and Clouds?

Are we going towards it?
61
Administrative Announcements

Student-led paper presentations (see instructions
on website)
Start from February 12th
Groups of up to 2 students each class,
responsible for a set of 3 Main Papers on a
topic
45 minute presentations (total) followed by
discussion
Set up appointment with me to show slides by 5 pm
day prior to presentation
List of papers is up on the website
Each of the other students (non-presenters)
expected to read the papers before class and turn
in a one to two page review of the any two of the
main set of papers (summary, comments, criticisms
and possible future directions)

62
Announcements (contd.)

Presentation Deadline form groups by midnight of
January 31 by dropping by my office hours (10.45
am 12 pm, Tu, Th in 3112 SC)
Hurry! Some interesting topics are already taken!
I can help you find partners
Use course newsgroup for forming groups and
discussion class.cs525

63
Announcements (contd.)

Projects
Groups of 2 (need not be same as presentation
groups)
Well start detailed discussions soon (a few
classes into the student-led presentations)
Please turn in filled-out Student Infosheets
today or next lecture.

64
Next week

No lecture Tuesday February 3 (no office hours
either)
Thursday (February 5) lecture read Basic
Distributed Computing Concepts papers

65
Backup Slides
66
Example Rapid Atmospheric Modeling System,
ColoState U

Weather Prediction is inaccurate
Hurricane Georges, 17 days in Sept 1998

67
(No Transcript)
68
Next Week Onwards

Student led presentations start
Organization of presentation is up to you
Suggested describe background and motivation for
the session topic, present an example or two,
then get into the paper topics
Reviews You have to submit both an email copy
(which will appear on the course website) and a
hardcopy (on which I will give you feedback). See
website for detailed instructions.
1-2 pages only, 2 papers only

69
Refinements and Extensions