Title: CS 525 Advanced Distributed Systems Spring 09
1CS 525 Advanced Distributed SystemsSpring 09
Indranil Gupta (Indy) Lecture 4 The Grid.
Clouds. January 29, 2009
2Two Questions Well Try to Answer
- What is the Grid? Basics, no hype.
- What is its relation to p2p?
3Example Rapid Atmospheric Modeling System,
ColoState U
- Hurricane Georges, 17 days in Sept 1998
- RAMS modeled the mesoscale convective complex
that dropped so much rain, in good agreement with
recorded data - Used 5 km spacing instead of the usual 10 km
- Ran on 256 processors
- Can one run such a program without access to a
supercomputer?
4Distributed ComputingResources
Wisconsin
NCSA
MIT
5An Application Coded by a Physicist
Output files of Job 0 Input to Job 2
Job 0
Job 1
Job 2
Jobs 1 and 2 can be concurrent
Output files of Job 2 Input to Job 3
Job 3
6An Application Coded by a Physicist
Output files of Job 0 Input to Job 2
Several GBs
- May take several hours/days
- 4 stages of a job
- Init
- Stage in
- Execute
- Stage out
- Publish
- Computation Intensive,
- so Massively Parallel
Job 2
Output files of Job 2 Input to Job 3
7Wisconsin
Job 0
Job 2
Job 1
Job 3
Allocation? Scheduling?
NCSA
MIT
8Job 0
Wisconsin
Condor Protocol
Job 2
Job 1
Job 3
Globus Protocol
NCSA
MIT
9Wisconsin
Job 3
Job 0
Internal structure of different sites invisible
to Globus
Globus Protocol
Job 1
NCSA
MIT
Job 2
External Allocation Scheduling Stage in Stage
out of Files
10Wisconsin
Condor Protocol
Job 3
Job 0
Internal Allocation Scheduling Monitoring Distri
bution and Publishing of Files
11Tiered Architecture (OSI 7 layer-like)
High energy Physics apps
Resource discovery, replication, brokering
Globus, Condor
Workstations, LANs
Opportunity for Crossover ideas from p2p systems
12The Grid Today
Some are 40Gbps links! (The TeraGrid links)
A parallel Internet
13Globus Alliance
- Alliance involves U. Illinois Chicago, Argonne
National Laboratory, USC-ISI, U. Edinburgh,
Swedish Center for Parallel Computers - Activities research, testbeds, software tools,
applications - Globus Toolkit (latest ver - GT3)
- The Globus Toolkit includes software services
and libraries for resource monitoring, discovery,
and management, plus security and file
management. Its latest version, GT3, is the
first full-scale implementation of new Open Grid
Services Architecture (OGSA).
14More
- Entire community, with multiple conferences,
get-togethers (GGF), and projects - Grid Projects
- http//www-fp.mcs.anl.gov/foster/grid-projects/
- Grid Users
- Today Core is the physics community (since the
Grid originates from the GriPhyN project) - Tomorrow biologists, large-scale computations
(nug30 already)?
15Some Things Grid Researchers Consider Important
- Single sign-on collective job set should require
once-only user authentication - Mapping to local security mechanisms some sites
use Kerberos, others using Unix - Delegation credentials to access resources
inherited by subcomputations, e.g., job 0 to job
1 - Community authorization e.g., third-party
authentication
16Grid History 1990s
- CASA network linked 4 labs in California and New
Mexico - Paul Messina Massively parallel and vector
supercomputers for computational chemistry,
climate modeling, etc. - Blanca linked sites in the Midwest
- Charlie Catlett, NCSA multimedia digital
libraries and remote visualization - More testbeds in Germany Europe than in the US
- I-way experiment linked 11 experimental networks
- Tom DeFanti, U. Illinois at Chicago and Rick
Stevens, ANL, for a week in Nov 1995, a national
high-speed network infrastructure. 60 application
demonstrations, from distributed computing to
virtual reality collaboration. - I-Soft secure sign-on, etc.
17Trends Technology
- Doubling Periods storage 12 mos, bandwidth 9
mos, and (what law is this?) cpu speed 18 mos - Then and Now
- Bandwidth
- 1985 mostly 56Kbps links nationwide
- 2004 155 Mbps links widespread
- Disk capacity
- Todays PCs have 100GBs, same as a 1990
supercomputer
18Trends Users
- Then and Now
- Biologists
- 1990 were running small single-molecule
simulations - 2004 want to calculate structures of complex
macromolecules, want to screen thousands of drug
candidates - Physicists
- 2006 CERNs Large Hadron Collider produced 1015
B/year - Trends in Technology and User Requirements
Independent or Symbiotic?
19Prophecies
- In 1965, MIT's Fernando Corbató and the other
designers of the Multics operating system
envisioned a computer facility operating like a
power company or water company. - Plug your thin client into the computing Utiling
- and Play your favorite Intensive Compute
- Communicate Application
- Will this be a reality with the Grid?
20P2P
Grid
21Definitions
- Infrastructure that provides dependable,
consistent, pervasive, and inexpensive access to
high-end computational capabilities (1998) - A system that coordinates resources not subject
to centralized control, using open,
general-purpose protocols to deliver nontrivial
QoS (2002) - Applications that takes advantage of resources
at the edges of the Internet (2000) - Decentralized, self-organizing distributed
systems, in which all or most communication is
symmetric (2002)
22Definitions
- Infrastructure that provides dependable,
consistent, pervasive, and inexpensive access to
high-end computational capabilities (1998) - A system that coordinates resources not subject
to centralized control, using open,
general-purpose protocols to deliver nontrivial
QoS (2002) - Applications that takes advantage of resources
at the edges of the Internet (2000) - Decentralized, self-organizing distributed
systems, in which all or most communication is
symmetric (2002)
525 (good legal applications without
intellectual fodder)
525 (clever designs without good, legal
applications)
23Grid versus P2P - Pick your favorite
24Applications
- P2P
- Some
- File sharing
- Number crunching
- Content distribution
- Measurements
- Legal Applications?
-
- Consequence
- Low Complexity
- Grid
- Often complex involving various combinations of
- Data manipulation
- Computation
- Tele-instrumentation
- Wide range of computational models, e.g.
- Embarrassingly
- Tightly coupled
- Workflow
- Consequence
- Complexity often inherent in the application
itself
25Applications
- P2P
- Some
- File sharing
- Number crunching
- Content distribution
- Measurements
- Legal Applications?
-
- Consequence
- Low Complexity
- Grid
- Often complex involving various combinations of
- Data manipulation
- Computation
- Tele-instrumentation
- Wide range of computational models, e.g.
- Embarrassingly
- Tightly coupled
- Workflow
- Consequence
- Complexity often inherent in the application
itself
26Scale and Failure
- P2P
- V. large numbers of entities
- Moderate activity
- E.g., 1-2 TB in Gnutella (01)
- Diverse approaches to failure
- Centralized (SETI)
- Decentralized and Self-Stabilizing
- Grid
- Moderate number of entities
- 10s institutions, 1000s users
- Large amounts of activity
- 4.5 TB/day (D0 experiment)
- Approaches to failure reflect assumptions
- E.g., centralized components
FastTrackC 4,277,745
iMesh 1,398,532
eDonkey 500,289
DirectConnect 111,454
Blubster 100,266
FileNavigator 14,400
Ares 7,731
(www.slyck.com, 2/19/03)
27Scale and Failure
- P2P
- V. large numbers of entities
- Moderate activity
- E.g., 1-2 TB in Gnutella (01)
- Diverse approaches to failure
- Centralized (SETI)
- Decentralized and Self-Stabilizing
- Grid
- Moderate number of entities
- 10s institutions, 1000s users
- Large amounts of activity
- 4.5 TB/day (D0 experiment)
- Approaches to failure reflect assumptions
- E.g., centralized components
FastTrackC 4,277,745
iMesh 1,398,532
eDonkey 500,289
DirectConnect 111,454
Blubster 100,266
FileNavigator 14,400
Ares 7,731
(www.slyck.com, 2/19/03)
28Services and Infrastructure
- Grid
- Standard protocols (Global Grid Forum, etc.)
- De facto standard software (open source Globus
Toolkit) - Shared infrastructure (authentication, discovery,
resource access, etc.) - Consequences
- Reusable services
- Large developer user communities
- Interoperability code reuse
- P2P
- Each application defines deploys completely
independent infrastructure - JXTA, BOINC, XtremWeb?
- Efforts started to define common APIs, albeit
with limited scope to date - Consequences
- New (albeit simple) install per application
- Interoperability code reuse not achieved
29Services and Infrastructure
- Grid
- Standard protocols (Global Grid Forum, etc.)
- De facto standard software (open source Globus
Toolkit) - Shared infrastructure (authentication, discovery,
resource access, etc.) - Consequences
- Reusable services
- Large developer user communities
- Interoperability code reuse
- P2P
- Each application defines deploys completely
independent infrastructure - JXTA, BOINC, XtremWeb?
- Efforts started to define common APIs, albeit
with limited scope to date - Consequences
- New (albeit simple) install per application
- Interoperability code reuse not achieved
30Coolness Factor
31Coolness Factor
32Summary Grid and P2P
- 1) Both are concerned with the same general
problem - Resource sharing within virtual communities
- 2) Both take the same general approach
- Creation of overlays that need not correspond in
structure to underlying organizational structures - 3) Each has made genuine technical advances, but
in complementary directions - Grid addresses infrastructure but not yet scale
and failure - P2P addresses scale and failure but not yet
infrastructure - 4) Complementary strengths and weaknesses gt room
for collaboration (Ian Foster at UChicago)
33Crossover Ideas
- Some P2P ideas useful in the Grid
- Resource discovery (DHTs), e.g., how do you make
filenames more expressive, i.e., a computer
cluster resource? - Replication models, for fault-tolerance,
security, reliability - Membership, i.e., which workstations are
currently available? - Churn-Resistance, i.e., users log in and out
problem difficult since free host gets a entire
computations, not just small files - All above are open research directions, waiting
to be explored!
34Cloud Computing
- Whats it all about?
- A First Step
35Life of Ra (a Research Area)
Where is Grid? Where is cloud computing?
First peak end of hype (This is a hot area!)
Hype - Wow!
First trough I told you so!
POPULARITY OF AREA
TIME
Young Adolescent Middle Age
Old Age
(solid base, hybrid algorithms)
(incremental Solutions)
(low-hanging fruits)
(interesting Problems)
36How do I identify what stage a research area is
in?
- If there have been no publications in research
area more than 1-2 years old, it is in the Young
Phase - Pick a paper in the last 1 year published in the
research area. Read it. If you think that you
could have come up with the core idea in that
paper (given all the background etc.), then the
research area is in its Young phase. - Find the latest published paper that you think
you could have come up with the idea for. If this
paper has been cited by one round of papers (but
these citing papers themselves have not been
cited), then the research area is in the
Adolescent phase. - Do Step 3 above, and if you find that the citing
papers themselves have been cited, and so on,
then the research area is at least in the Middle
Age phase. - Pick a paper in the last 1-2 years. If you find
that there are only incremental developments in
these latest published papers, and the ideas may
be innovative but are not yielding large enough
performance benefits, then the area is mature. - If no one works in the research area, or everyone
you talk to thinks negatively about the area
(except perhaps the inventors of the area), then
the area is dead.
37What is a cloud?
- Its a cluster! Its a supercomputer! Its a
datastore! - Its superman!
- None of the above
- Cloud Lots of storage compute cycles nearby
38Data-intensive Computing
- Computation-Intensive Computing
- Example areas MPI-based, High-performance
computing, Grids - Typically run on supercomputers (e.g., NCSA Blue
Waters) - Data-Intensive
- Typically store data at datacenters
- Use compute nodes nearby
- Compute nodes run computation services
- In data-intensive computing, the focus shifts
from computation to the data problem areas
include - Storage
- Communication bottleneck
- Moving tasks to data (rather than vice-versa)
- Security
- Availability of Data
- Scalability
39Distributed Clouds
- A single-site cloud consists of
- Compute nodes (split into racks)
- Switches, connecting the racks
- Storage (backend) nodes connected to the network
- Front-end for submitting jobs
- Services physical resource set, software
services - A geographically distributed cloud consists of
- Multiple such sites
- Each site perhaps with a different structure and
services
40Cirrus Cloud at University of Illinois
Only show internal switches used for data
transfers, 1GbE with 48 ports
Storage Node
Storage Node
Storage Node
8 ports
2 ports
Storage Node
8 ports
Procurve Switch
Procurve Switch
2 ports
Head Node
Note System management, monitoring, and operator
console will use a different set of switches not
pictured here.
41Example Cirrus Cloud at U. Illinois
- 128 servers. Each has
- 8 cores (total 1024 cores)
- 16 GB RAM
- 2 TB disk
- Backing store of about 250 TB
- Total storage 0.5 PB
- Gigabit Networking
426 Diverse Sites within Cirrus
- UIUC Systems Research for Cloud Computing
Cloud Computing Applications - Karlsruhe Institute of Tech (KIT, Germany)
Grid-style jobs - IDA, Singapore
- Intel
- HP
- Yahoo! CMUs M45 cluster
- All will be networked together see
http//www.cloudtestbed.org
43What Services?
- Different Clouds Export different services
- Industrial Clouds
- Amazon S3 (Simple Storage Service) store
arbitrary datasets - Amazon EC2 (Elastic Compute Cloud) upload and
run arbitrary images - Google AppEngine develop applications within
their appengine framework, upload data that will
be imported into their format, and run - Academic Clouds
- Google-IBM Cloud (U. Washington) run apps
programmed atop Hadoop - Cirrus cloud run (i) apps programmed atop Hadoop
and Pig, and (ii) systems-level research on this
first generation of cloud computing models
44Software Services
- Computational
- MapReduce (Hadoop)
- Pig Latin
- Naming and Management
- Zookeeper
- Tivoli, OpenView
- Storage
- HDFS
- PNUTS
45Sample Service MapReduce
- Google uses MapReduce to run 100K jobs per day,
processing up to 20 PB of data - Yahoo! has released open-source software Hadoop
that implements MapReduce - Other companies that have used MapReduce to
process their data A9.com, AOL, Facebook, The
New York Times - Highly-Parallel Data-Processing
46What is MapReduce?
- Terms are borrowed from Functional Language
(e.g., Lisp) - Sum of squares
- (map square (1 2 3 4))
- Output (1 4 9 16)
- processes each record sequentially and
independently - (reduce (1 4 9 16))
- ( 16 ( 9 ( 4 1) ) )
- Output 30
- processes set of all records in a batch
47Map
- Process individual key/value pair to generate
intermediate key/value pairs.
Welcome Everyone Hello Everyone
Welcome 1 Everyone 1 Hello 1 Everyone 1
Input ltfilename, file textgt
48Reduce
- Processes and merges all intermediate values
associated with each given key assigned to it
Welcome 1 Everyone 1 Hello 1 Everyone 1
Everyone 2 Hello 1 Welcome 1
49Some Applications
- Distributed Grep
- Map - Emits a line if it matches the supplied
pattern - Reduce - Copies the the intermediate data to
output - Count of URL access frequency
- Map Process web log and outputs ltURL, 1gt
- Reduce - Emits ltURL, total countgt
- Reverse Web-Link Graph
- Map process web log and outputs lttarget,
sourcegt - Reduce - emits lttarget, list(source)gt
-
50Programming MapReduce
- Externally For user
- Write a Map program (short), write a Reduce
program (short) - Submit job wait for result
- Need to know nothing about parallel/distributed
programming! - Internally For the cloud (and for us distributed
systems researchers) - Parallelize Map
- Transfer data from Map to Reduce
- Parallelize Reduce
- Implement Storage for Map input, Map output,
Reduce input, and Reduce output
51Inside MapReduce
- For the cloud (and for us distributed systems
researchers) - Parallelize Map easy! each map job is
independent of the other! - Transfer data from Map to Reduce
- All Map output records with same key assigned to
same Reduce task - use partitioning function (more soon)
- Parallelize Reduce easy! each map job is
independent of the other! - Implement Storage for Map input, Map output,
Reduce input, and Reduce output - Map input from distributed file system
- Map output to local disk (at Map node) uses
local file system - Reduce input from (multiple) remote disks uses
local file systems - Reduce output to distributed file system
- local file system Linux FS, etc.
- distributed file system GFS (Google File
System), HDFS (Hadoop Distributed File System)
52Internal Workings of MapReduce
53Flow of Data
- Input slices are typically 16MB to 64MB.
- Map workers use a partitioning function to store
intermediate key/value pair to the local disk. - e.g., Hash (key) mod R
Output files
Map workers
Reduce workers
partitioning
54Fault Tolerance
- Worker Failure
- Master keeps 3 states for each worker task
- (idle, in-progress, completed)
- Master sends periodic pings to each worker to
keep track of it (central failure detector) - If fail while in-progress, mark the task as idle
- If map workers fail after completed, mark as idle
- Notify the reduce task about the map worker
failure - Master Failure
- Checkpoint
55Locality and Backup tasks
- Locality
- Since cloud has hierarchical topology
- GFS stores 3 replicas of each of 64MB chunks
- Maybe on different racks
- Attempt to schedule a map task on a machine that
contains a replica of corresponding input data
why? - Stragglers (slow nodes)
- Due to Bad Disk, Network Bandwidth, CPU, or
Memory. - Perform backup (replicated) execution of
straggler task task done when first replica
complete
56Grep
Testbed 1800 servers each with 4GB RAM, dual
2GHz Xeon, dual 169 GB IDE disk, 100 Gbps,
Gigabit ethernet per machine
- Locality optimization helps
- 1800 machines read 1 TB at peak 31 GB/s
- W/out this, rack switches would limit to 10 GB/s
- Startup overhead is significant for short jobs
Workload 1010 100-byte records to extract
records matching a rare pattern (92K matching
records)
57Sort
M 15000 R 4000
- Normal No backup tasks
200 processes killed
- Backup tasks reduce job completion time a lot!
- System deals well with failures
Workload 1010 100-byte records (modeled after
TeraSort benchmark)
58Discussion Points
- Storage Is the local write-remote read model
good for Map output/Reduce input? - What happens on node failure?
- Entire Reduce phase needs to wait for all Map
tasks to finish - Why? What is the disadvantage?
- What are the other issues related to our
challenges - Storage
- Communication bottleneck
- Moving tasks to data (rather than vice-versa)
- Security
- Availability of Data
- Scalability
- Locality within clouds, or across them
- Inter-cloud/multi-cloud computations
- Other Programming Models?
- Based on MapReduce
- Beyond MapReduce-based ones
- Concern Do clouds run the risk of going the
- Grid way?
59P2P and Clouds/Grid
- Opportunity to use p2p design techniques,
principles, and algorithms in cloud computing - Cloud computing vs. Grid computing what are the
differences?
60Prophecies
Are we there yet?
?
?
?
- In 1965, MIT's Fernando Corbató and the other
designers of the Multics operating system
envisioned a computer facility operating like a
power company or water company. - Plug your thin client into the computing Utiling
- and Play your favorite Intensive Compute
Storage Communicate Application - Will this be a reality with the Grid and Clouds?
Are we going towards it?
61Administrative Announcements
- Student-led paper presentations (see instructions
on website) - Start from February 12th
- Groups of up to 2 students each class,
responsible for a set of 3 Main Papers on a
topic - 45 minute presentations (total) followed by
discussion - Set up appointment with me to show slides by 5 pm
day prior to presentation - List of papers is up on the website
- Each of the other students (non-presenters)
expected to read the papers before class and turn
in a one to two page review of the any two of the
main set of papers (summary, comments, criticisms
and possible future directions)
62Announcements (contd.)
- Presentation Deadline form groups by midnight of
January 31 by dropping by my office hours (10.45
am 12 pm, Tu, Th in 3112 SC) - Hurry! Some interesting topics are already taken!
- I can help you find partners
- Use course newsgroup for forming groups and
discussion class.cs525
63Announcements (contd.)
- Projects
- Groups of 2 (need not be same as presentation
groups) - Well start detailed discussions soon (a few
classes into the student-led presentations) - Please turn in filled-out Student Infosheets
today or next lecture.
64Next week
- No lecture Tuesday February 3 (no office hours
either) - Thursday (February 5) lecture read Basic
Distributed Computing Concepts papers
65Backup Slides
66Example Rapid Atmospheric Modeling System,
ColoState U
- Weather Prediction is inaccurate
- Hurricane Georges, 17 days in Sept 1998
67(No Transcript)
68Next Week Onwards
- Student led presentations start
- Organization of presentation is up to you
- Suggested describe background and motivation for
the session topic, present an example or two,
then get into the paper topics - Reviews You have to submit both an email copy
(which will appear on the course website) and a
hardcopy (on which I will give you feedback). See
website for detailed instructions. - 1-2 pages only, 2 papers only
69Refinements and Extensions
- Local Execution
- For debugging purpose
- Users have control on specific Map tasks
- Status Information
- Master runs an HTTP server
- Status page shows the status of computation
- Link to output file
- Standard Error list
70Refinements and Extensions
- Combiner Function
- User defined
- Done within map task.
- Save network bandwidth.
- Skipping Bad records
- Best solution is to debug fix
- Not always possible third-party source
libraries - On segmentation fault
- Send UDP packet to master from signal handler
- Include sequence number of record being processed
- If master sees two failures for same record
- Next worker is told to skip the record