Distributed Systems - PowerPoint PPT Presentation

1 / 72

About This Presentation

Title:

Distributed Systems

Description:

Edges often turned off, without permanent IP addresses, etc. ... (KaZaA, iMesh), Gnutella (LimeWire, BearShare), Overnet, BitTorrent, etc ... – PowerPoint PPT presentation

Number of Views:166

Avg rating:3.0/5.0

Slides: 73

Provided by: Xini6

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Systems

1
Chapter 11 Advanced Distributed Systems
2
P2P Computing

Def 1 A class of applications that take
advantage of resources (e.g., storage, cycles,
content) available at the edge of the
Internet.
Edges often turned off, without permanent IP
addresses, etc.
Def 2 A class of decentralized,
self-organizing distributed systems, in which
all or most communication is symmetric.
(IPTPS02)
Lots of other definitions that fit in between

3
Applications Computing

Examples Seti_at_Home, UnitedDevices, Gnome_at_home,
many others
Approach suitable for a particular class of
problems.
Massive parallelism
Low bandwidth/computation ratio
Error tolerance, independence from solving a
particular task
Problems
Centralized.
How to extend the model to problems that are not
massively parallel?
Ability to operate in an environment with limited
trust and dynamic resources

4
Applications File sharing

The killer application to date
Too many to list them all Napster, FastTrack
(KaZaA, iMesh), Gnutella (LimeWire, BearShare),
Overnet, BitTorrent, etc
Decentralized control
Building a (relatively) reliable, data-delivery
service using a large, heterogeneous set of
unreliable components.

FastTrack (Kazaa) ,2003
5
Applications Content Streaming

Streaming the user plays the data as as it
arrives
ExamplesPplive, SplitStream, etc

6
Many other P2P applications

Backup storage (HiveNet, OceanStore)
Collaborative environments (Groove Networks)
Web serving communities (uServ)
Instant messaging (Yahoo, AOL)
Anonymous email
Censorship-resistant publishing systems
(Ethernity, Freenet)
Spam filtering

7
Client/Server vs. P2P
8
Client/Server vs. P2P
9
Overlay Network
10
Overlay Network

An abstract layer built on top of the physical
network
Neighbors in the overlay can be several hops away
in the physical network
Why do we need overlays?
Flexibility in
Choosing neighbors
Forming and customizing topology to fit
application needs (e.g., short delay,
reliability, high BW, )
Designing communication protocols among nodes
Get around limitations in legacy networks

11
abstract P2P overlay architecture
12
Network Communications Layer

It describes the network characteristics of
desktop machines connected over the Internet or
small wireless or sensor-based devices that are
connected in an ad-hoc manner.

13
Overlay Nodes Management layer

The Overlay Nodes Management layer covers the
management of peers, which include discovery of
peers and routing algorithms for optimization.

14
Features Management layer

The Features Management layer deals with the
security, reliability, fault resiliency and
aggregated resource availability aspects of
maintaining the robustness of P2P systems.

15
Services Specific layer

The Services Specific layer supports the
underlying P2P infrastructure and the
application-specific components through
scheduling of parallel and computation intensive
tasks, content and file management.

16
Application-level layer

The Application-level layer is concerned with
tools, applications and services that are
implemented with specific functionalities on top
of the underlying P2P overlay infrastructure.

17
P2P Systems Simple Model
18
Peer Software Architecture Model

P2P Substrate (key component)
Overlay management
Construction
Maintenance (peer join/leave/fail and network
dynamics)
Resource management
Allocation (storage)
Discovery (routing and lookup)
Can be classified according to the flexibility of
placing objects at peers

19
P2P Substrates Classification

Structured (or tightly controlled, DHT)
Objects are rigidly assigned to specific peers
Looks like as a Distributed Hash Table (DHT)
Efficient search guarantee of finding
Lack of partial name and keyword queries
Maintenance overhead
Ex Chord, CAN, Pastry, Tapestry, Kademila
(Overnet)
Unstructured (or loosely controlled)
Objects can be anywhere
Support partial name and keyword queries
Inefficient search no guarantee of finding
Some heuristics exist to enhance performance
Ex Gnutella, Kazaa (super node), GIA

20
Types of P2P Systems
21
Napster (1)

Sharing of music files
Lists of files are uploaded to Napster server
Queries contain various keywords of required file
Server returns IP address of user machines having
the file
File transfer is direct

22
Napster (2)

Centralised model
Napster server ensures correct results
Only used for finding the location of the files
Scalability bottleneck
Single point of failure
Denial of Service attacks possible
Lawsuits

23
Gnutella (1)

Sharing of any type of files
Decentralised search
Queries are sent to the neighbour nodes
Neighbours ask their own neighbours and so on
Time To Live (TTL) field on queries
File transfer is direct

24
Gnutella Network

Steps
Node 2 initiates search for A
note do not know where is A,
Flooding.

7
1
4
2
6
3
5
25
Gnutella Network

Steps
Node 2 initiates search for A
Sends message to all neighbors

7
1
4
2
6
3
5
26
Gnutella Network

Steps
Node 2 initiates search for A
Sends message to all neighbors
Neighbors forward message

7
1
4
2
6
3
5
27
Gnutella Network

Steps
Node 2 initiates search for A
Sends message to all neighbors
Neighbors forward message
Nodes that haveA initiate a reply message

7
1
4
2
6
3
5
28
Gnutella Network

Steps
Node 2 initiates search for A
Sends message to all neighbors
Neighbors forward message
Nodes that haveA initiate a reply message
Query reply message is back-propagated

7
1
4
2
6
3
5
29
Gnutella Network

Steps
Node 2 initiates search for A
Sends message to all neighbors
Neighbors forward message
Nodes that haveA initiate a reply message
Query reply message is back-propagated
Node 2 gets replies

7
1
4
2
6
3
5
30
Gnutella Network

Steps
Node 2 initiates search for A
Sends message to all neighbors
Neighbors forward message
Nodes that haveA initiate a reply message
Query reply message is back-propagated
Node 2 gets replies
File download

download A
7
1
4
2
6
3
5
31
Gnutella (2)

Decentralised model
No single point of failure
Less susceptible to denial of service
SCALABILITY (flooding)
Cannot ensure correct results

32
KaZaA

Hybrid of Napster and Gnutella
Super-peers act as local search hubs
Each super-peer is like a constrained Napster
server
Automatically chosen based on capacity and
availability
Lists of files are uploaded to a super-peer
Super-peers periodically exchange file lists
Queries are sent to super-peers

33
Freenet

Ensures anonymity
Decentralised search
Queries are sent to the neighbour nodes
Neighbours ask their own neighbours and so on
The query process is sequential
Learning ability

34
Structured P2P

Second generation P2P (overlay) networks
Self-organizing
Load balanced
Fault-tolerant
Guarantees on numbers of hops to answer a query
Based on a distributed hash table interface

35
Distributed Hash Tables (DHT)

Distributed version of a hash table data
structure
Stores (key, value) pairs
The key is like a filename
The value can be file contents
Goal Efficiently insert/lookup/delete (key,
value) pairs
Each peer stores a subset of (key, value) pairs
in the system
Core operation Find node responsible for a key
Map key to node
Efficiently route insert/lookup/delete request
to this node

36
DHT Generic Interface

Node id m-bit identifier (similar to an IP
address)
Key sequence of bytes
Value sequence of bytes
put(key, value)
Store (key,value) at the node responsible for
the key
value get(key)
Retrieve value associated with key (from the
appropriate node)

37
DHT Applications

File sharing
Databases
Service discovery
Chat service
Publish/subscribe networks

38
DHT Desirable Properties

Keys mapped evenly to all nodes in the network
Each node maintains information about only a few
other nodes
Efficient routing of messages to nodes
Node insertion/deletion only affects a few nodes

39
Chord API

Node id m-bit identifier (similar to an IP
address)
Key m-bit identifier (hash of a sequence of
bytes)
Value sequence of bytes
API
insert(key, value)
lookup(key)
update(key, newval)
join(n)
leave()

40
Consistent Hashing
41
Chord Operation (1)

Nodes form a circle based on node identifiers
Each node is responsible in storing a portion of
the keys
Hash function ensures even distribution of keys
and nodes in the circle

42
Chord Ring definition

Finger table node k stores pointers to k1,
k2, k4 ..., k2m -1 (mod n)
Find node for every data in O(log(nodes)) steps
O(log(nodes)) storage per node

43
Chord Operation (2)
44
Chord Operation (3)

Lookup the furthest node that precedes the key
Query reaches target node in O(logN) hops

45
Scalable Lookup Scheme
Finger Table for N8
finger k 1st node that succeeds (n2k-1)mod2m
46
Lookup Using Finger Table
N1
lookup(54)
N56
N8
N51
N48
N14
N42
N38
N21
N32
47
Scalable Lookup Scheme

// ask node n to find the successor of id
n.find_successor(id)
if (id belongs to (n, successor)
return successor
else
n0 closest preceding node(id)
return n0.find_successor(id)
// search the local table for the highest
predecessor of id
n.closest_preceding_node(id)
for i m downto 1
if (fingeri belongs to (n, id))
return fingeri
return n

48
Chord Properties

In a system with N nodes and K keys
Each node manages at most K/N keys
Bound information stored in every node
Lookups resolved with O(logN) hops
No delivery guarantees
Poor network locality

49
Network Locality
Nodes close on ring can be far in the network
50
Grid Computing

What is a Grid an integrated advanced cyber
infrastructure that delivers
Computing capacity
Data capacity
Communication capacity
Analogy to the Electrical Power Grid

51
History

For many years, a few wacky computer scientists
have been trying to help other scientists use
distributed computing.
Interactive simulation (climate modeling)
Very large-scale simulation and analysis (galaxy
formation, gravity waves, battlefield simulation)
Engineering (parameter studies, linked component
models)
Experimental data analysis (high-energy physics)
Image and sensor analysis (astronomy, climate
study, ecology)
Online instrumentation (microscopes, x-ray
devices, etc.)
Remote visualization (climate studies, biology)
Engineering (large-scale structural testing,
chemical engineering)
In these cases, the scientific problems are big
enough that they required people in several
organization to collaborate and share computing
resources, data, instruments.

52
Some Core Problems

Too hard to keep track of authentication data
(ID/password) across institutions
Too hard to monitor system and application status
across institutions
Too many ways to submit jobs
Too many ways to store access files and data
Too many ways to keep track of data
Too easy to leave dangling resources lying
around (robustness)

53
Challenging Applications

The applications that Grid technology is aimed at
are not easy applications!
The reason these things havent been done before
is because people believed it was too hard to
bother trying.
If youre trying to do these things, youd better
be prepared for it to be challenging.
Grid technologies are aimed at helping to
overcome the challenges.
They solve some of the most common problems
They encourage standard solutions that make
future interoperability easier
They were developed as parts of real projects
In many cases, they benefit from years of lessons
from multiple applications
Ever-improving documentation, installation,
configuration, training

54
Earth System Grid

Goal address technical obstacles to the
sharing analysis of high-volume data from
advanced earth system models

55
Other Examples of Grids

TeraGrid NSF funded linking 5 major research
sites at 40 Gbs (www.teragrid.org)
European Union Data Grid grid for applications
in high energy physics, environmental science,
bioinformatics (www.eu-datagrid.org)
Access Grid collaboration systems using
commodity technologies (www.accessgrid.org)
Network for Earthquake Engineering Simulations
Grid - grid for earthquake engineering
(www.nees.org)

56
Current Status of the Grid

Dozens of Grid projects in scientific and
technical computing in academic research
community
Consensus on Key concepts and technologies (GGF
Global Grid Forum)
Open source Globus Toolkit a standard for major
protocols and services
Funding agencies funding a lot of grid projects
Business interest emerging rapidly
Standards still emerging grid services, web
services resource framework
Requires significant user training

57
Use Grid Now

A lot of work to make applications grid-ready
adopt new algorithms for parallel computation
change user interface
Have to build application on different
architectures
Need to move application and data to different
computers
Security and Licensing issues
Requires a lot of system administration expertise
Largely UNIX-based

58
Software Layers
Web browser or command window
(User interface)
Globus Client on Users Workstation
(Certs, submit job)
Globus Server on Master Node
(Job manager)
Queue Managers and Schedulers on Master Node
Applications Running on Grid Clusters
59
Developing Grid Standards
Increased functionality, standardization
60
Sand Glass Model

Trying to force homogeneity on users is futile.
Everyone has their own preferences, sometimes
even dogma.
The Internet provides the model

61
Evolution of the Grid
App-specific Services
Open Grid Services Arch
Increased functionality, standardization
Web services
GGF OGSI, WSRF, (leveraging OASIS, W3C,
IETF) Multiple implementations, including Globus
Toolkit
X.509, LDAP, FTP,
Globus Toolkit
Defacto standards GGF GridFTP, GSI (leveraging
IETF)
Custom solutions
Time
62
Open Grid Services Architecture

Define a service-oriented architecture
the key to effective virtualization
to address vital Grid requirements
utility, on-demand, system management,
collaborative computing, etc.
building on Web service standards.
extending those standards when needed

63
Grid and Web Services Convergence

The definition of WSRF means that the Grid and
Web services communities can move forward on a
common base.

64
Who Is the Grid For?

Any Grid (distributed/collaborative) application
or system will involve several classes of
people.
End users (e.g., Scientists, Engineers,
Customers)
Application/Product Developers
System Administrators
System Architects and Integrators
Each user class has unique skills and unique
requirements.
The user class whose needs are met varies from
tool to tool (even within the Globus Toolkit).

65
What End Users Need
Secure, reliable, on-demand access to
data, software, people, and other
resources (ideally all via a Web Browser!)
66
General Architecture
67
Grid Community Software
68
Social Policies/Procedures

How will people use the system?
Who will set up access control?
Who creates the data?
How will computational resources be added to the
system?
How will simulation capabilities be used?
What will accounting data be used for?
Not all problems are solved by technology!
Understanding how the system will be used is
important for narrowing the requirements.

69
What Is the Globus Toolkit?

The Globus Toolkit is a collection of solutions
to problems that frequently come up when trying
to build collaborative distributed applications.
Heterogeneity
To date (v1.0 - v4.0), the Toolkit has focused on
simplifying heterogenity for application
developers.
We aspire to include more vertical solutions in
future versions.
Standards
Our goal has been to capitalize on and encourage
use of existing standards (IETF, W3C, OASIS,
GGF).
The Toolkit also includes reference
implementations of new/proposed standards in
these organizations.