CS 525 Advanced Distributed Systems Spring 09 - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

CS 525 Advanced Distributed Systems Spring 09

Description:

Example: Rapid Atmospheric Modeling System, ColoState U ... (map square (1 2 3 4)) Output: (1 4 9 16) [processes each record sequentially and independently] ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 71
Provided by: csU70
Category:

less

Transcript and Presenter's Notes

Title: CS 525 Advanced Distributed Systems Spring 09


1
CS 525 Advanced Distributed SystemsSpring 09
Indranil Gupta (Indy) Lecture 4 The Grid.
Clouds. January 29, 2009
2
Two Questions Well Try to Answer
  • What is the Grid? Basics, no hype.
  • What is its relation to p2p?

3
Example Rapid Atmospheric Modeling System,
ColoState U
  • Hurricane Georges, 17 days in Sept 1998
  • RAMS modeled the mesoscale convective complex
    that dropped so much rain, in good agreement with
    recorded data
  • Used 5 km spacing instead of the usual 10 km
  • Ran on 256 processors
  • Can one run such a program without access to a
    supercomputer?

4
Distributed ComputingResources
Wisconsin
NCSA
MIT
5
An Application Coded by a Physicist
Output files of Job 0 Input to Job 2
Job 0
Job 1
Job 2
Jobs 1 and 2 can be concurrent
Output files of Job 2 Input to Job 3
Job 3
6
An Application Coded by a Physicist
Output files of Job 0 Input to Job 2
Several GBs
  • May take several hours/days
  • 4 stages of a job
  • Init
  • Stage in
  • Execute
  • Stage out
  • Publish
  • Computation Intensive,
  • so Massively Parallel

Job 2
Output files of Job 2 Input to Job 3
7
Wisconsin
Job 0
Job 2
Job 1
Job 3
Allocation? Scheduling?
NCSA
MIT
8
Job 0
Wisconsin
Condor Protocol
Job 2
Job 1
Job 3
Globus Protocol
NCSA
MIT
9
Wisconsin
Job 3
Job 0
Internal structure of different sites invisible
to Globus
Globus Protocol
Job 1
NCSA
MIT
Job 2
External Allocation Scheduling Stage in Stage
out of Files
10
Wisconsin
Condor Protocol
Job 3
Job 0
Internal Allocation Scheduling Monitoring Distri
bution and Publishing of Files
11
Tiered Architecture (OSI 7 layer-like)
High energy Physics apps
Resource discovery, replication, brokering
Globus, Condor
Workstations, LANs
Opportunity for Crossover ideas from p2p systems
12
The Grid Today
Some are 40Gbps links! (The TeraGrid links)
A parallel Internet
13
Globus Alliance
  • Alliance involves U. Illinois Chicago, Argonne
    National Laboratory, USC-ISI, U. Edinburgh,
    Swedish Center for Parallel Computers
  • Activities research, testbeds, software tools,
    applications
  • Globus Toolkit (latest ver - GT3)
  • The Globus Toolkit includes software services
    and libraries for resource monitoring, discovery,
    and management, plus security and file
    management.  Its latest version, GT3, is the
    first full-scale implementation of new Open Grid
    Services Architecture (OGSA).

14
More
  • Entire community, with multiple conferences,
    get-togethers (GGF), and projects
  • Grid Projects
  • http//www-fp.mcs.anl.gov/foster/grid-projects/
  • Grid Users
  • Today Core is the physics community (since the
    Grid originates from the GriPhyN project)
  • Tomorrow biologists, large-scale computations
    (nug30 already)?

15
Some Things Grid Researchers Consider Important
  • Single sign-on collective job set should require
    once-only user authentication
  • Mapping to local security mechanisms some sites
    use Kerberos, others using Unix
  • Delegation credentials to access resources
    inherited by subcomputations, e.g., job 0 to job
    1
  • Community authorization e.g., third-party
    authentication

16
Grid History 1990s
  • CASA network linked 4 labs in California and New
    Mexico
  • Paul Messina Massively parallel and vector
    supercomputers for computational chemistry,
    climate modeling, etc.
  • Blanca linked sites in the Midwest
  • Charlie Catlett, NCSA multimedia digital
    libraries and remote visualization
  • More testbeds in Germany Europe than in the US
  • I-way experiment linked 11 experimental networks
  • Tom DeFanti, U. Illinois at Chicago and Rick
    Stevens, ANL, for a week in Nov 1995, a national
    high-speed network infrastructure. 60 application
    demonstrations, from distributed computing to
    virtual reality collaboration.
  • I-Soft secure sign-on, etc.

17
Trends Technology
  • Doubling Periods storage 12 mos, bandwidth 9
    mos, and (what law is this?) cpu speed 18 mos
  • Then and Now
  • Bandwidth
  • 1985 mostly 56Kbps links nationwide
  • 2004 155 Mbps links widespread
  • Disk capacity
  • Todays PCs have 100GBs, same as a 1990
    supercomputer

18
Trends Users
  • Then and Now
  • Biologists
  • 1990 were running small single-molecule
    simulations
  • 2004 want to calculate structures of complex
    macromolecules, want to screen thousands of drug
    candidates
  • Physicists
  • 2006 CERNs Large Hadron Collider produced 1015
    B/year
  • Trends in Technology and User Requirements
    Independent or Symbiotic?

19
Prophecies
  • In 1965, MIT's Fernando Corbató and the other
    designers of the Multics operating system
    envisioned a computer facility operating like a
    power company or water company.
  • Plug your thin client into the computing Utiling
  • and Play your favorite Intensive Compute
  • Communicate Application
  • Will this be a reality with the Grid?

20
P2P
Grid
21
Definitions
  • Grid
  • P2P
  • Infrastructure that provides dependable,
    consistent, pervasive, and inexpensive access to
    high-end computational capabilities (1998)
  • A system that coordinates resources not subject
    to centralized control, using open,
    general-purpose protocols to deliver nontrivial
    QoS (2002)
  • Applications that takes advantage of resources
    at the edges of the Internet (2000)
  • Decentralized, self-organizing distributed
    systems, in which all or most communication is
    symmetric (2002)

22
Definitions
  • Grid
  • P2P
  • Infrastructure that provides dependable,
    consistent, pervasive, and inexpensive access to
    high-end computational capabilities (1998)
  • A system that coordinates resources not subject
    to centralized control, using open,
    general-purpose protocols to deliver nontrivial
    QoS (2002)
  • Applications that takes advantage of resources
    at the edges of the Internet (2000)
  • Decentralized, self-organizing distributed
    systems, in which all or most communication is
    symmetric (2002)

525 (good legal applications without
intellectual fodder)
525 (clever designs without good, legal
applications)
23
Grid versus P2P - Pick your favorite
24
Applications
  • P2P
  • Some
  • File sharing
  • Number crunching
  • Content distribution
  • Measurements
  • Legal Applications?
  • Consequence
  • Low Complexity
  • Grid
  • Often complex involving various combinations of
  • Data manipulation
  • Computation
  • Tele-instrumentation
  • Wide range of computational models, e.g.
  • Embarrassingly
  • Tightly coupled
  • Workflow
  • Consequence
  • Complexity often inherent in the application
    itself

25
Applications
  • P2P
  • Some
  • File sharing
  • Number crunching
  • Content distribution
  • Measurements
  • Legal Applications?
  • Consequence
  • Low Complexity
  • Grid
  • Often complex involving various combinations of
  • Data manipulation
  • Computation
  • Tele-instrumentation
  • Wide range of computational models, e.g.
  • Embarrassingly
  • Tightly coupled
  • Workflow
  • Consequence
  • Complexity often inherent in the application
    itself

26
Scale and Failure
  • P2P
  • V. large numbers of entities
  • Moderate activity
  • E.g., 1-2 TB in Gnutella (01)
  • Diverse approaches to failure
  • Centralized (SETI)
  • Decentralized and Self-Stabilizing
  • Grid
  • Moderate number of entities
  • 10s institutions, 1000s users
  • Large amounts of activity
  • 4.5 TB/day (D0 experiment)
  • Approaches to failure reflect assumptions
  • E.g., centralized components

FastTrackC 4,277,745
iMesh 1,398,532
eDonkey 500,289
DirectConnect 111,454
Blubster 100,266
FileNavigator 14,400
Ares 7,731
(www.slyck.com, 2/19/03)
27
Scale and Failure
  • P2P
  • V. large numbers of entities
  • Moderate activity
  • E.g., 1-2 TB in Gnutella (01)
  • Diverse approaches to failure
  • Centralized (SETI)
  • Decentralized and Self-Stabilizing
  • Grid
  • Moderate number of entities
  • 10s institutions, 1000s users
  • Large amounts of activity
  • 4.5 TB/day (D0 experiment)
  • Approaches to failure reflect assumptions
  • E.g., centralized components

FastTrackC 4,277,745
iMesh 1,398,532
eDonkey 500,289
DirectConnect 111,454
Blubster 100,266
FileNavigator 14,400
Ares 7,731
(www.slyck.com, 2/19/03)
28
Services and Infrastructure
  • Grid
  • Standard protocols (Global Grid Forum, etc.)
  • De facto standard software (open source Globus
    Toolkit)
  • Shared infrastructure (authentication, discovery,
    resource access, etc.)
  • Consequences
  • Reusable services
  • Large developer user communities
  • Interoperability code reuse
  • P2P
  • Each application defines deploys completely
    independent infrastructure
  • JXTA, BOINC, XtremWeb?
  • Efforts started to define common APIs, albeit
    with limited scope to date
  • Consequences
  • New (albeit simple) install per application
  • Interoperability code reuse not achieved

29
Services and Infrastructure
  • Grid
  • Standard protocols (Global Grid Forum, etc.)
  • De facto standard software (open source Globus
    Toolkit)
  • Shared infrastructure (authentication, discovery,
    resource access, etc.)
  • Consequences
  • Reusable services
  • Large developer user communities
  • Interoperability code reuse
  • P2P
  • Each application defines deploys completely
    independent infrastructure
  • JXTA, BOINC, XtremWeb?
  • Efforts started to define common APIs, albeit
    with limited scope to date
  • Consequences
  • New (albeit simple) install per application
  • Interoperability code reuse not achieved

30
Coolness Factor
  • Grid
  • P2P

31
Coolness Factor
  • Grid
  • P2P

32
Summary Grid and P2P
  • 1) Both are concerned with the same general
    problem
  • Resource sharing within virtual communities
  • 2) Both take the same general approach
  • Creation of overlays that need not correspond in
    structure to underlying organizational structures
  • 3) Each has made genuine technical advances, but
    in complementary directions
  • Grid addresses infrastructure but not yet scale
    and failure
  • P2P addresses scale and failure but not yet
    infrastructure
  • 4) Complementary strengths and weaknesses gt room
    for collaboration (Ian Foster at UChicago)

33
Crossover Ideas
  • Some P2P ideas useful in the Grid
  • Resource discovery (DHTs), e.g., how do you make
    filenames more expressive, i.e., a computer
    cluster resource?
  • Replication models, for fault-tolerance,
    security, reliability
  • Membership, i.e., which workstations are
    currently available?
  • Churn-Resistance, i.e., users log in and out
    problem difficult since free host gets a entire
    computations, not just small files
  • All above are open research directions, waiting
    to be explored!

34
Cloud Computing
  • Whats it all about?
  • A First Step

35
Life of Ra (a Research Area)
Where is Grid? Where is cloud computing?
First peak end of hype (This is a hot area!)
Hype - Wow!
First trough I told you so!
POPULARITY OF AREA
TIME
Young Adolescent Middle Age
Old Age
(solid base, hybrid algorithms)
(incremental Solutions)
(low-hanging fruits)
(interesting Problems)
36
How do I identify what stage a research area is
in?
  1. If there have been no publications in research
    area more than 1-2 years old, it is in the Young
    Phase
  2. Pick a paper in the last 1 year published in the
    research area. Read it. If you think that you
    could have come up with the core idea in that
    paper (given all the background etc.), then the
    research area is in its Young phase.
  3. Find the latest published paper that you think
    you could have come up with the idea for. If this
    paper has been cited by one round of papers (but
    these citing papers themselves have not been
    cited), then the research area is in the
    Adolescent phase.
  4. Do Step 3 above, and if you find that the citing
    papers themselves have been cited, and so on,
    then the research area is at least in the Middle
    Age phase.
  5. Pick a paper in the last 1-2 years. If you find
    that there are only incremental developments in
    these latest published papers, and the ideas may
    be innovative but are not yielding large enough
    performance benefits, then the area is mature.
  6. If no one works in the research area, or everyone
    you talk to thinks negatively about the area
    (except perhaps the inventors of the area), then
    the area is dead.

37
What is a cloud?
  • Its a cluster! Its a supercomputer! Its a
    datastore!
  • Its superman!
  • None of the above
  • Cloud Lots of storage compute cycles nearby

38
Data-intensive Computing
  • Computation-Intensive Computing
  • Example areas MPI-based, High-performance
    computing, Grids
  • Typically run on supercomputers (e.g., NCSA Blue
    Waters)
  • Data-Intensive
  • Typically store data at datacenters
  • Use compute nodes nearby
  • Compute nodes run computation services
  • In data-intensive computing, the focus shifts
    from computation to the data problem areas
    include
  • Storage
  • Communication bottleneck
  • Moving tasks to data (rather than vice-versa)
  • Security
  • Availability of Data
  • Scalability

39
Distributed Clouds
  • A single-site cloud consists of
  • Compute nodes (split into racks)
  • Switches, connecting the racks
  • Storage (backend) nodes connected to the network
  • Front-end for submitting jobs
  • Services physical resource set, software
    services
  • A geographically distributed cloud consists of
  • Multiple such sites
  • Each site perhaps with a different structure and
    services

40
Cirrus Cloud at University of Illinois
Only show internal switches used for data
transfers, 1GbE with 48 ports
Storage Node
Storage Node
Storage Node
8 ports
2 ports
Storage Node
8 ports
Procurve Switch
Procurve Switch
2 ports
Head Node
Note System management, monitoring, and operator
console will use a different set of switches not
pictured here.
41
Example Cirrus Cloud at U. Illinois
  • 128 servers. Each has
  • 8 cores (total 1024 cores)
  • 16 GB RAM
  • 2 TB disk
  • Backing store of about 250 TB
  • Total storage 0.5 PB
  • Gigabit Networking

42
6 Diverse Sites within Cirrus
  • UIUC Systems Research for Cloud Computing
    Cloud Computing Applications
  • Karlsruhe Institute of Tech (KIT, Germany)
    Grid-style jobs
  • IDA, Singapore
  • Intel
  • HP
  • Yahoo! CMUs M45 cluster
  • All will be networked together see
    http//www.cloudtestbed.org

43
What Services?
  • Different Clouds Export different services
  • Industrial Clouds
  • Amazon S3 (Simple Storage Service) store
    arbitrary datasets
  • Amazon EC2 (Elastic Compute Cloud) upload and
    run arbitrary images
  • Google AppEngine develop applications within
    their appengine framework, upload data that will
    be imported into their format, and run
  • Academic Clouds
  • Google-IBM Cloud (U. Washington) run apps
    programmed atop Hadoop
  • Cirrus cloud run (i) apps programmed atop Hadoop
    and Pig, and (ii) systems-level research on this
    first generation of cloud computing models

44
Software Services
  • Computational
  • MapReduce (Hadoop)
  • Pig Latin
  • Naming and Management
  • Zookeeper
  • Tivoli, OpenView
  • Storage
  • HDFS
  • PNUTS

45
Sample Service MapReduce
  • Google uses MapReduce to run 100K jobs per day,
    processing up to 20 PB of data
  • Yahoo! has released open-source software Hadoop
    that implements MapReduce
  • Other companies that have used MapReduce to
    process their data A9.com, AOL, Facebook, The
    New York Times
  • Highly-Parallel Data-Processing

46
What is MapReduce?
  • Terms are borrowed from Functional Language
    (e.g., Lisp)
  • Sum of squares
  • (map square (1 2 3 4))
  • Output (1 4 9 16)
  • processes each record sequentially and
    independently
  • (reduce (1 4 9 16))
  • ( 16 ( 9 ( 4 1) ) )
  • Output 30
  • processes set of all records in a batch

47
Map
  • Process individual key/value pair to generate
    intermediate key/value pairs.

Welcome Everyone Hello Everyone
Welcome 1 Everyone 1 Hello 1 Everyone 1
Input ltfilename, file textgt
48
Reduce
  • Processes and merges all intermediate values
    associated with each given key assigned to it

Welcome 1 Everyone 1 Hello 1 Everyone 1
Everyone 2 Hello 1 Welcome 1
49
Some Applications
  • Distributed Grep
  • Map - Emits a line if it matches the supplied
    pattern
  • Reduce - Copies the the intermediate data to
    output
  • Count of URL access frequency
  • Map Process web log and outputs ltURL, 1gt
  • Reduce - Emits ltURL, total countgt
  • Reverse Web-Link Graph
  • Map process web log and outputs lttarget,
    sourcegt
  • Reduce - emits lttarget, list(source)gt

50
Programming MapReduce
  • Externally For user
  • Write a Map program (short), write a Reduce
    program (short)
  • Submit job wait for result
  • Need to know nothing about parallel/distributed
    programming!
  • Internally For the cloud (and for us distributed
    systems researchers)
  • Parallelize Map
  • Transfer data from Map to Reduce
  • Parallelize Reduce
  • Implement Storage for Map input, Map output,
    Reduce input, and Reduce output

51
Inside MapReduce
  • For the cloud (and for us distributed systems
    researchers)
  • Parallelize Map easy! each map job is
    independent of the other!
  • Transfer data from Map to Reduce
  • All Map output records with same key assigned to
    same Reduce task
  • use partitioning function (more soon)
  • Parallelize Reduce easy! each map job is
    independent of the other!
  • Implement Storage for Map input, Map output,
    Reduce input, and Reduce output
  • Map input from distributed file system
  • Map output to local disk (at Map node) uses
    local file system
  • Reduce input from (multiple) remote disks uses
    local file systems
  • Reduce output to distributed file system
  • local file system Linux FS, etc.
  • distributed file system GFS (Google File
    System), HDFS (Hadoop Distributed File System)

52
Internal Workings of MapReduce
53
Flow of Data
  • Input slices are typically 16MB to 64MB.
  • Map workers use a partitioning function to store
    intermediate key/value pair to the local disk.
  • e.g., Hash (key) mod R

Output files
Map workers
Reduce workers
partitioning
54
Fault Tolerance
  • Worker Failure
  • Master keeps 3 states for each worker task
  • (idle, in-progress, completed)
  • Master sends periodic pings to each worker to
    keep track of it (central failure detector)
  • If fail while in-progress, mark the task as idle
  • If map workers fail after completed, mark as idle
  • Notify the reduce task about the map worker
    failure
  • Master Failure
  • Checkpoint

55
Locality and Backup tasks
  • Locality
  • Since cloud has hierarchical topology
  • GFS stores 3 replicas of each of 64MB chunks
  • Maybe on different racks
  • Attempt to schedule a map task on a machine that
    contains a replica of corresponding input data
    why?
  • Stragglers (slow nodes)
  • Due to Bad Disk, Network Bandwidth, CPU, or
    Memory.
  • Perform backup (replicated) execution of
    straggler task task done when first replica
    complete

56
Grep
Testbed 1800 servers each with 4GB RAM, dual
2GHz Xeon, dual 169 GB IDE disk, 100 Gbps,
Gigabit ethernet per machine
  • Locality optimization helps
  • 1800 machines read 1 TB at peak 31 GB/s
  • W/out this, rack switches would limit to 10 GB/s
  • Startup overhead is significant for short jobs

Workload 1010 100-byte records to extract
records matching a rare pattern (92K matching
records)
57
Sort
M 15000 R 4000
  • Normal No backup tasks
    200 processes killed
  • Backup tasks reduce job completion time a lot!
  • System deals well with failures

Workload 1010 100-byte records (modeled after
TeraSort benchmark)
58
Discussion Points
  • Storage Is the local write-remote read model
    good for Map output/Reduce input?
  • What happens on node failure?
  • Entire Reduce phase needs to wait for all Map
    tasks to finish
  • Why? What is the disadvantage?
  • What are the other issues related to our
    challenges
  • Storage
  • Communication bottleneck
  • Moving tasks to data (rather than vice-versa)
  • Security
  • Availability of Data
  • Scalability
  • Locality within clouds, or across them
  • Inter-cloud/multi-cloud computations
  • Other Programming Models?
  • Based on MapReduce
  • Beyond MapReduce-based ones
  • Concern Do clouds run the risk of going the
  • Grid way?

59
P2P and Clouds/Grid
  • Opportunity to use p2p design techniques,
    principles, and algorithms in cloud computing
  • Cloud computing vs. Grid computing what are the
    differences?

60
Prophecies
Are we there yet?
?
?
?
  • In 1965, MIT's Fernando Corbató and the other
    designers of the Multics operating system
    envisioned a computer facility operating like a
    power company or water company.
  • Plug your thin client into the computing Utiling
  • and Play your favorite Intensive Compute
    Storage Communicate Application
  • Will this be a reality with the Grid and Clouds?

Are we going towards it?
61
Administrative Announcements
  • Student-led paper presentations (see instructions
    on website)
  • Start from February 12th
  • Groups of up to 2 students each class,
    responsible for a set of 3 Main Papers on a
    topic
  • 45 minute presentations (total) followed by
    discussion
  • Set up appointment with me to show slides by 5 pm
    day prior to presentation
  • List of papers is up on the website
  • Each of the other students (non-presenters)
    expected to read the papers before class and turn
    in a one to two page review of the any two of the
    main set of papers (summary, comments, criticisms
    and possible future directions)

62
Announcements (contd.)
  • Presentation Deadline form groups by midnight of
    January 31 by dropping by my office hours (10.45
    am 12 pm, Tu, Th in 3112 SC)
  • Hurry! Some interesting topics are already taken!
  • I can help you find partners
  • Use course newsgroup for forming groups and
    discussion class.cs525

63
Announcements (contd.)
  • Projects
  • Groups of 2 (need not be same as presentation
    groups)
  • Well start detailed discussions soon (a few
    classes into the student-led presentations)
  • Please turn in filled-out Student Infosheets
    today or next lecture.

64
Next week
  • No lecture Tuesday February 3 (no office hours
    either)
  • Thursday (February 5) lecture read Basic
    Distributed Computing Concepts papers

65
Backup Slides
66
Example Rapid Atmospheric Modeling System,
ColoState U
  • Weather Prediction is inaccurate
  • Hurricane Georges, 17 days in Sept 1998

67
(No Transcript)
68
Next Week Onwards
  • Student led presentations start
  • Organization of presentation is up to you
  • Suggested describe background and motivation for
    the session topic, present an example or two,
    then get into the paper topics
  • Reviews You have to submit both an email copy
    (which will appear on the course website) and a
    hardcopy (on which I will give you feedback). See
    website for detailed instructions.
  • 1-2 pages only, 2 papers only

69
Refinements and Extensions
  • Local Execution
  • For debugging purpose
  • Users have control on specific Map tasks
  • Status Information
  • Master runs an HTTP server
  • Status page shows the status of computation
  • Link to output file
  • Standard Error list

70
Refinements and Extensions
  • Combiner Function
  • User defined
  • Done within map task.
  • Save network bandwidth.
  • Skipping Bad records
  • Best solution is to debug fix
  • Not always possible third-party source
    libraries
  • On segmentation fault
  • Send UDP packet to master from signal handler
  • Include sequence number of record being processed
  • If master sees two failures for same record
  • Next worker is told to skip the record
Write a Comment
User Comments (0)
About PowerShow.com