Title: Processes and Threads
1(No Transcript)
2Problems with Scheduling
- Priority systems are ad hoc at best
- highest priority always wins
- Fair share implemented by adjusting priorities
with a feedback loop - complex mechanism
- Priority inversion high-priority jobs can be
blocked behind low-priority jobs - Schedulers are complex and difficult to control
- what we need
- proportional sharing
- dynamic flexibility
- simplicity
3Tickets in Lottery Scheduling
- Priority determined by the number of tickets each
process has - Scheduler picks winning ticket randomly, gives
owner the resource - Tickets can be used for a wide variety of
different resources (uniform) and are machine
independent (abstract)
4Performance Characteristics
- If client has probability p of winning, then the
expected number of wins is np. (n of
lotteries) - Variance of binomial distribution np(1-p)
- Accuracy improves with n ½
- need frequent lotteries
- Big picture answer mostly accurate, but
short-term inaccuracies are possible - see Stride scheduling below.
5Ticket Inflation
- Make up your own tickets (print your own money)
- Only works among mutually trusting clients
- Presumably works best if inflation is temporary
- Allows clients to adjust their priority
dynamically with zero communication
6Ticket Transfer
- Basic idea if you are blocked on someone else,
give them your tickets - Example client-server
- Server has no tickets of its own
- clients give server all of their tickets during
RPC - server's priority is the sum of the priorities of
all of its active clients - server can use lottery scheduling to give
preferential service to high-priority clients - Very elegant solution to a long-standing problem
7Trust Boundaries
- A group contains mutually trusting clients
- A unique currency is used inside a group
- simplifies mini lottery like mutex inside a group
- supports fine-grain allocation decisions
- Exchange rate is needed between groups
- effect of inflation can be localized to a group
8Compensation tickets
- What happens if a thread is I/O bound and blocks
before its quantum expires? - the thread gets less than its share of the
processor. - Basic idea if you complete fraction f of the
quantum, your tickets are inflated by 1/f until
the next time you win. - example if B on average uses 1/5 of a quantum,
its tickets will be inflated 5x and it will win 5
times as often and get its correct share overall.
- What if B alternates between 1/5 and whole
quantums?
9Implementation
- Frequent lotteries mean that lotteries must be
efficient - a fast random number generator
- fast selection of ticket based on random number
- Ticket selection
- straightforward algorithm O(n)
- tree-based implementation O(log n)
10Implementation Ticket Object
11Currency Graph
12Problems
- Not as fair as we'd like
- mutex comes out 1.81 instead of 21,
- possible starvation
- multimedia apps come out 1.921.501 instead of
321 - possible jitter
- Every queue is an implicit scheduling decision...
- Every spinlock ignores priority...
- Can we force it to be unfair? Is there a way to
use compensation tickets to get more time, e.g.,
quit early to get compensation tickets and then
run for the full time next time? - What about kernel cycles? If a process uses a lot
of cycles indirectly, such as through the
ethernet driver, does it get higher priority
implicitly? (probably)
13Stride Scheduling
- Basic idea make a deterministic version to
reduce short-term variability - Mark time virtually using passes as the unit
- A process has a stride, which is the number of
passes between executions. Strides are inversely
proportional to the number of tickets, so high
priority jobs have low strides and thus run
often. - Very regular a job with priority p will run
every 1/p passes.
14Stride Scheduling (contd)
- Algorithm (roughly) always pick the job with the
lowest pass number. Updates its pass number by
adding its stride. - Similar mechanism to compensation tickets if a
job uses only fraction f, update its pass number
by instead of just using the stride. - Overall result it is far more accurate than
lottery scheduling and error can be bounded
absolutely instead of probabilistically
15Stride Scheduling Example
16Distributed System
- Distributed System (DS)
- consists of a collection of autonomous computers
linked by a computer network and equipped with
distributed system software. - DS software
- enables computers to coordinate their activities
and to share the resources of the system, i.e.,
hardware, software and data. - Users of a DS should perceive a single,
integrated computing facility even though it may
be implemented by many computers in different
locations.
17Characteristics of Distributed Systems
- The following characteristics are primarily
responsible for the usefulness of distributed
systems - Resource Sharing
- Openness
- Concurrency
- Scalability
- Fault tolerance
- Transparency
- They are not automatic consequences of
distribution system and application software
must be carefully designed
18DESIGN GOALS
- Key design goals
- Performance, Reliability, Consistency,
Scalability, Security - Basic design issues
- Naming
- Communication optimize the implementation while
retaining a high level programming model - Software structure structure a system so that
new services can be introduced that will
interwork fully with existing services - Workload allocation deploy the processing,
communication and resources for optimum effect in
the processing of changing workload - Consistency maintenance the maintenance of
consistency at reasonable cost
19Naming
- Distributed systems are based on the sharing of
resources and on the transparency of resource
distribution - Names assigned to resources must
- have global meanings that are independent of
location - be supported by a name interpretation system that
can translate names to enable programs to access
the resources - Design issue
- design a naming scheme that will scale, and
translate names efficiently to meet appropriate
performance goals
20Communication
- Communication between a pair of processes
involves - transfer of data from the sending process to the
receiving process - synchronization of the receiving process with the
sending process may be required - Programming Primitives
- Communication Structure
- Client- Server
- Group Communication
21Software Structure
- Addition of new service should be easy
Applications
Open services
Distributed programming support
Operating system kernel services
Computer and network hardware
The main categories of software in a distributed
system
22Workload Allocation
- How is work allocated amongst resources in a DS ?
- Workstation-Server Model
- putting the processor cycles near the user
good for interactive applications - capacity of workstation determines the size of
largest task that can be performed on behalf of
the user - does not optimize the use of processing and
memory resources - a single user with a large computing task is not
able to obtain additional resources - Some modifications of the workstation-server
model - processor pool model, shared memory multiprocessor
23Processor Pool Model
- Processor pool model
- allocate processors dynamically to users
- a processor pool usually consists of a collection
of low-cost computers - each processor in a pool has an independent
network connection - processors do not have to be homogeneous
- processors are allocated to processes for their
lifetime - Users
- use a simple computer or X-terminal
- a users work can be performed partly or entirely
on the pool processors - examples Amoeba, Clouds, Plan 9
24Use of Idle Workstations
- A significant proportion of workstations on a
network may be unused or be used for lightweight
activities (at some time especially overnight) - The idle workstations can be used to run jobs for
users who are logged on at other stations and do
not have sufficient capacity at their machine - In Sprite OS
- the target workstation is chosen transparently by
the system - include a facility for process migration
- NOW(Networks of Workstations)
- MPP is expensive and workstations are NOT
- network is getting faster than any other
components - for what?
- network RAM, cooperative file cacheing, software
RAID, parallel computing, etc
25Consistency Maintenance
- Update Consistency
- Arises when several processes access and update
data concurrently - changing a data value cannot be performed
instantaneously - desired effect
- the update looks atomic - a related set of
changes made by a given process should appear to
all other processes as if it was done
instantaneous - Significant because
- many processes share data
- operation of system itself depends on the
consistency of file directories managed by file
services, naming databases etc
26Consistency Maintenance (contd)
- Replication Consistency
- motivations of data replication
- increased availability and performance
- if data have been copied to several computers and
subsequently modified at one or more of them, - the possibility of inconsistencies arises between
the values of data items at different computers
27Consistency Maintenance (contd)
- Cache Consistency
- cacheing vs replication
- same consistency problem as replication
- examples
- multiprocessor caches
- file caches
- cluster web server
28User Requirements
- Functionality
- What the system should do for users
- Quality of Service
- issues of performance, reliability and security
- Reconfigurability
- accommodate changes without causing disruption to
existing services
29Distributed File System
- Introduction
- The SUN Network File System
- The Andrew File System
- The Coda File System
- The xFS
30Introduction
- Three practical implementations.
- Sun Network File System
- Andrew File System
- Coda File System
- These systems aim to emulate the UNIX file system
interface - Emulation of a UNIX file system interface
- caching of file data in client computers is an
essential design feature, but the conventional
UNIX file system offers one-copy update semantics - one-copy update semantics file contents seen by
all of the concurrent processes are those that
they would see if only single copy of the file
contents existed - These three implementations allow some deviation
from one-copy semantics - one-copy model has not been strictly adhered
31Server Structure
- Connectionless
- Connection-Oriented
- Iterative Server
- Concurrent Server
32Stateful Server
file position is updated here
fopen(...)
fread(fp, nbytes)
file descriptor for client A
data
file system
client A
33Stateless Server
fopen(fp, read)) fread(.,position.) fclose(fp)
file descriptor for client A
data
file system
client A
file position is updated here
34The Sun NFS
- provide transparent access to remote files for
client programs - each computer has client and server modules in
its kernel - the client and server relationship is symmetric
- each computer in an NFS can act as both a client
and a server - larger installations may be configured as
dedicated servers - available for almost every major system
35The Sun NFS (contd)
- Design goals with respect to transparency
- Access transparency
- An API is identical to the local OSs interface.
Thus, in a UNIX client, no modifications to
existing programs are required for accesses to
remote files. - Location transparency
- each client establishes a file name space by
adding remote file systems to its local name
space for each client (mount) - NFS does not enforce a single network-wide file
name space. - each client may see a unique set of name space
36The Sun NFS (contd)
- Failure transparency
- NFS server is stateless and most file access
operations are idempotent - UNIX file operations are translated to NFS
operations by an NFS client module - Stateless and idempotent nature of NFS ensures
that failure semantics for remote file access are
similar to those for local file access - Performance transparency
- both the client and server employ caching to
achieve satisfactory performance - For clients, the maintenance of cache coherence
is somewhat complex, because several clients may
be using and updating the same file
37The Sun NFS (contd)
- Migration transparency
- Mount service
- establish the file name space in client computers
- file systems may be moved between servers, but
the remote mount tables in each client must then
be separately updated to enable the clients to
access the file system in its new location - migration transparency is not fully achieved by
NFS - Automounter
- runs in each NFS client and enables pathnames to
be used that refer to unmounted file systems
38The Sun NFS (contd)
- Replication transparency
- NFS does not support file replication in a
general sense - Concurrency transparency
- UNIX support only rudimentary locking facilities
for concurrency control - NFS does not aim to improve upon the UNIX
approach to the control of concurrent updates to
files
39The Sun NFS (contd)
- Scalability
- Scalability of the NFS is limited.
- Due to the lack of replication
- The number of clients that can simultaneously
access a shared file is restricted by the
performance of the server that holds the file. - can become a system-wide performance bottleneck
for heavily-used files.
40Implementation of NFS
- User-level client process process using NFS
- NFS client and server modules communicate using
remote procedure calling.
41The Andrew File System
- Andrew
- a distributed computing environment developed at
CMU - Andrew File System (AFS)
- reflects an intention to support
information-sharing on a large scale - provides transparent access to remote shared
files for UNIX programs - scalability is the most important design goal
- implemented on workstations and servers running
BSD4.3 UNIX or Mach
42The Andrew File System (contd)
- Two unusual design characteristics
- whole-file serving
- the entire contents of files are transmitted to
client computers by AFS servers. - whole-file caching
- a copy of a file is stored in a cache on the
clients local disk. - the cache is permanent, surviving reboots of the
client computer.
43The Andrew File System (contd)
- The design strategy is based on some assumptions
- files are small
- reads are much more common than writes (about 6
times) - sequential access is common and random access is
rare - most files are read and written by only one user
- temporal locality of reference for files is high
- Databases do not fit the design assumptions of
AFS - typically shared by many users and are often
updated quite frequently - DB are treated by its own storage control, anyway
44Implementation
- Some questions about the implementation of AFS
- How does AFS gain control when an open or close
system call referring to a file in the shared
file space is issued by a client? - How is the server holding the required file
located? - What space is allocated to cached files in
workstations? - How does AFS ensure that the cached copies of
files are up-to-date when files may be updated by
several clients?
45Implementation (contd)
- Vice name given to the server software that runs
as a user-level UNIX process in each server
computer. - Venus a user-level process that runs in each
client computer.
46Cache coherence
- Callback promise
- mechanism for ensuring that cached copies of
files are updated when another client closes the
same file after updating it. - Vice supplies a copy of a file to a Venus with a
callback - callback promises are stored with the cached
files - state of callback promise either valid or
cancelled - When a Vice update a file, it notifies all of
the Venus processes to which it has issued
callback promises by sending a callback - callback is a RPC from a server to a client
(i.e., Venus) - When a Venus receives a callback, it sets the
callback promise token for the relevant file to
cancelled
47Cache coherence (contd)
- Handling open in Venus
- If the required file is found in the cache, then
its token is checked. - If its value is cancelled, then get a new copy
- If valid, then use it
- Restart of a client computer after a failure
- some callbacks may have been missed
- for each file with a valid token, Venus sends a
timestamp to the server - If timestamp is current, the server responds with
valid. - Otherwise, the server responds with cancelled
48Cache coherence (contd)
- Callback promise renewal interval
- Callback promises must be renewed before an open
if a time T (say, 10 minutes) has elapsed without
communication from the server for a cached file - deals with communication failure
49Update semantics
- For a client C operating on a file F on a server
S, the followings are guaranteed - Update semantics for AFS-1
- after a successful open latest(F,S)
- after a failed open failure(S)
- after a successful close updated(F,S)
- after a failed close failure(S)
- latest(F,S) current value of F at C is the same
as the value at S - failure(S) open or close has not been performed
at S - updated(F,S) Cs value of F has been
successfully propagated to S
50Update semantics (2)
- Update semantics for AFS-2
- currency guarantee for open is slightly weaker
- after a successful open
- latest(F,S,0) or (lostCallback(S,T) and
inCache(F) and latest(F,S,T)) - latestes(F,S,T) the copy of F seen by client is
no more than T out of date - lostCallback(S,T) callback message from S to C
has been lost during the last T time - inCache(F) F was in the cache at C before open
was attempted
51Update semantics (3)
- AFS does not provide any further concurrency
control mechanism - If clients in different workstations open, write
and close the same file concurrently, - only the updates from the last close remain and
all others will be silently lost (no error
report) - clients must implement concurrency control
independently if they require it - When two client processes in the same workstation
open a file, - they share the same cached copy, and updates are
performed in the normal UNIX fashion
block-by-block.
52The Coda File System
- Coda File System
- a descendent of AFS that addresses several new
requirements CMU - replication for a large scale system
- improvement in fault-tolerance
- mobile use of portable computers
- Goal
- constant data availability
- provide users with the benefits of a shared file
repository, but allow them to rely entirely on
local resources when the repository is partially
or totally inaccessible - retain the original goals of AFS with regard to
scalability and the emulation of UNIX
53The Coda File System (contd)
- read-write volumes
- can be stored on several servers
- higher throughput of file accesses and a greater
degree of fault tolerance - Support of disconnected operation
- an extension of the mechanism in AFS for caching
copies of files at workstations - enable workstations to operate when disconnected
from the network
54The Coda File System (contd)
- Volume storage group (VSG)
- set of servers holding replicas of a file volume
- Available volume storage group (AVSG)
- some subset of VSG in which a client wishing to
open a file - Callback promise mechanism
- Clients are notified of a change, as in AFS
- Updates instead of invalidations
55The Coda File System (contd)
- Coda version vector (CVV)
- attached to each version of a file
- vector of integers with one element for each
server in VSG - server-i1, server-i2, . . ., server-ik
- each element of CVV denotes the number of
modifications on the version of the file held at
the corresponding server - Provide information about the update history of
each file version to enable inconsistencies to be
detected and corrected automatically if updates
do not conflict, or with manual intervention if
they do
56The Coda File System (contd)
- Repair of inconsistency
- if all the elements of CVV at one site gt those of
all other sites - inconsistency can be automatically repaired
- otherwise, the conflict cannot in general be
resolved automatically - the file is marked as inoperable, and the owner
of the file is informed of the conflict - needs a manual intervention
57The Coda File System (contd)
- Scenario
- when a modified file is closed, Venus sends to
each site in AVSG an update message (new contents
of the file and CVV) - Vice at each site checks CVV
- if consistent, store new contents and returns ACK
- Venus increments elements of CVV for the servers
that responded positively to the update message,
and distributes the new CVV to members of AVSG
58The Coda File System Example
- F is a file in a volume replicated at servers S1,
S2 and S3 - C1 and C2 clients
- VSG for F S1, S2, S3
- AVSG for C1 S1, S2, AVSG for C2 S3
- Initially, CVVs for F at all three servers are
1, 1, 1 - C1 modifies F
- CVVs for F at S1 and S2 are 2, 2, 1
- C2 modifies F
- CVV for F at S3 is 1, 1, 2
- No CVV dominates all other CVVs
- conflict requiring manual intervention
- Suppose F is not modified in step 3 above. Then
2, 2, 1 dominates 1, 1, 1. Thus, the version
of the file at S1 or S2 should replace that at S3
59Update semantics
- The currency guarantees by Coda when a file is
opened at a client are weaker than for AFS - The Guarantee offered by
- successful open
- It provides the most recent copy of file from the
current AVSG - If no server is accessible, a locally cached copy
of file is used if available. - successful close
- The file has been propagated to the currently
accessible set of servers - If no server is available, the file has been
marked for propagation at the earliest
opportunity.
60Update semantics (contd)
- S server, S set of servers (the files VSG)
- s the AVSG for the file seen by a client C
- after a successful open s ¹ Æ and
(latest(F,s,0) or - (latest(F,s,T) and lostCallback(s,T) and
inCache(F))) - or (s Æ and inCache(F))
- after a failed open s ¹ Æ and conflict(F, s)
- or (s Æ and Ø inCache(F))
- after a successful close s ¹ Æ and updated(F,
s) - or (s Æ)
- after a failed close s ¹ Æ and conflict(F, s)
- conflict(F, s) means that the values of F at some
servers in s are currently in conflict
61Cache coherence
- Venus at each client must detect the following
events within T seconds - enlargement of AVSG
- due to accessibility of a previously inaccessible
server - shrinking of an AVSG
- due to a server becoming inaccessible
- a lost callback
- Multicast messages to VSG
62xFS
- xFS Serverless Network File System
- in the paper " A Case for NOW", Experience with
a ... - idea
- file system as a parallel program
- exploit fast LANs
- Cooperative Cacheing
- use remote memory to avoid going to disk
- manage client memory as a global resource
- much of client memory is not used
- server get file from client's memory instead of
from disk - better send to idle client than discarding
replaced file copy
63xFS Cache Coherence
- Write Ownership Cache Coherence
- each node can own a file
- owner has the most up to date copy
- server just keeps track of who "owns" file
- any request to a file is forwarded to the owner
- a file is either
- owned only one copy exists
- read-only multiple copies
- to modify a file,
- secure a file as owned
- modify as many time as you want
- if someone else reads the file, send the up to
date version, and marks the file as read-only
64xFS Cache Coherence
invalid
write by other node
read
write by other node
write
read-only
write
owned
read by other node
65xFS Software RAID
- Cooperative cacheing makes availability nightmare
- any crash will damage a part of a file system
- stripe data redundantly over multiple disks
- software RAID
- reconstruct missing part from remaining parts
- logging makes reconstruction easy
66xFS Software RAID
- Motivations
- high nadwidth requirements from
- multimedia
- parallel computing
- economic workstations
- high speed network
- lets learn from RAID
- parallel IO from inexpensive hard disks
- fault managements
- limitations
- single server
- small write problem
67xFS Software RAID
- Approaches
- stripe each file across multiple file servers
- small file problems
- when stripping units is too small
- ideal size is 10s of Kbytes
- two reads and two writes for a write (parity
check/build) - when a file is a stripping unit
- parity will consume the same space
- load cannot be spread across servers
68xFS Experiences
- Need of a formal method for cache coherence
- it is much more complicated than it looks
- lot of trasient states
- 3 formal states gt 22 implementation states
- ad hoc test-and-retry leaves unknown errorr
permanently - no one is sure about the correctness
- software protability is poor
69xFS Experiences
- Threads in a server
- it is a nice concept but
- it incurs too much concurrency
- too much data races
- the most difficult thing to understand in the
world - dificult to debug
- solutioniterative server
- difficult to design but simple to debug
- less error-prone
- efficient
- RPC
- not suitable for multi-party communication
- need to gather/scatter RPC servers