CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

CS556: Distributed Systems

Description:

Condor. Distributed batch processing. Large-scale numerical ... restart() invoked by Condor code. Instead of user's main() Overwrites its own data segment ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 21
Provided by: dimp9
Category:

less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems


1
CS-556 Distributed Systems
Condor, Mariposa
  • Manolis Marazakis
  • maraz_at_csd.uoc.gr

2
Condor
  • Distributed batch processing
  • Large-scale numerical simulations
  • U. Wisconsin Fermi Labs
  • Schedule job execution on idle workstations
  • More efficient resource utilization
  • Suspend execution when owner begins to use it
  • Either migrate to another idle workstation
  • checkpoint of process state
  • Data stack segments, CPU state, pending
    signals, open FDs
  • Executables need to be re-linked (not
    re-compiled)
  • Allow access to files even without a DFS
  • remote system calls
  • Unmodified OS kernel
  • There are some limitations
  • Or queue until an idle workstation is available

3
Checkpointing (I)
  • UNIX process state
  • Text
  • Data
  • Initialized (static) data
  • Segment starting at first page-size boundary
    above text area
  • Unitialized data
  • Heap
  • Grows toward higher addresses
  • Stack
  • Scratch space for function call mechanism
  • Automatic variables, arguments, return values
  • Complications with user-level thread packages
  • The stack may be in the process data segment
  • Kernel state
  • Signals, open FDs, registers
  • Mapped segments
  • Dynamically linked libraries

4
Checkpointing (II)
  • setjmp() longjmp() for manipulating the stack
  • ioctl() in /proc file system to find mapped
    memory segments
  • write() to produce checkpoint file
  • or write to socket
  • Ensure that dynamic library text data are
    included
  • must be in the same addresses at restart
  • mmap() for /dev/zero, read() to restore
    checkpoint
  • Checkpoint initiated by signal (SEGV)
  • restart as if returning from signal handler
  • Re-open active file descriptors
  • Recorded using a modified version of open()
  • Uses syscall()
  • sigprocmask(), sigaction(), sigispending()

5
Checkpointing (III)
  • Condor init. handler is linked with users code
  • Init. data structures install signal handler
  • checkpoint()
  • Invoke users main()
  • Upon signal reception, perform check-point
  • Continue execution (periodic snapshot)
  • or vacate workstation
  • restart() invoked by Condor code
  • Instead of users main()
  • Overwrites its own data segment
  • Returns to checkpoint() which is a signal
    handler ?
  • When checkpoint() returns, the users code is
    resumed

6
Checkpointing (IV)
  • Shadow process
  • Handles RPCs for operations on file descriptors
    opened on a host from which the process was
    vacated
  • Useful when there is no common FS
  • Modified read() write()
  • Using syscall()
  • Limitations
  • Communicating processes ?
  • sockets, signals, pipes,
  • Programs that invoke fork() ? exec()
  • What about code that cannot be re-linked ?

7
Top-down approaches to resource management
  • Optimization of system-wide metrics
  • Average response time throughput
  • Overall
  • OR per-class
  • Representing requirements of individual
    application classes
  • Centralized approaches
  • Resource Manager
  • Co-operative approaches
  • Consensus
  • Multicast

Consistent global state ?
8
Resource management in open systems
  • Heterogeneity of applications components
  • How to define a global performance metric ?
  • Conflicting goals
  • Dynamic changes in the environment
  • Dynamic participation of providers
  • New application classes
  • Changes in resource consumption patterns
  • Communication overhead
  • Limited resilience to failures

9
Bottom-up resource management
  • Market
  • A system where independent individuals interact
    via trading to achieve a fair allocation of
    resources
  • Mechanisms protocols
  • Price systems, auctions
  • Economic theory
  • Analytical framework for reasoning about
    properties of resource allocations
  • No attempt to optimize a global metric
  • Each producer/consumer has its own goals
    definition of optimality
  • Independent decisions
  • Competition, selfish optimization
  • Global coherent behavior emerges when an
    equilibrium state is reached

10
Pareto optimality
  • A set of allocations is optimal if no subset of
    the agents can improve on their allocations
  • No requirement for
  • Comparable preferences (utility functions)
  • Central co-ordination
  • Multiple independent optimization problems
  • One for each decision agent
  • Trading reveals no private state
  • Only exchanges acceptance/rejection of offers
  • Initial endowments to agents
  • Reflect relative priorities

11
General equilibrium
  • Perfect balance of supply demand
  • For all traded goods
  • How to make allocation decisions ?
  • Some approaches require that an equilibrium state
    is reached
  • Tatonnement process (Walras, 1874)
  • Arbitrary ordering of resources
  • Adjust price of a resource so as to balance
    demand with supply, given the prices of all other
    resources
  • Multiple rounds, as a change in the price of a
    resource may trigger a change of the excess
    demand for all resources
  • Other approaches are more dynamic
  • Stock market metaphor
  • Basic assumption An agent that selfishly
    competes will not voluntarily trade with others
    unless it is made better off by trading

12
Auctions
  • Mechanisms for adjusting prices so as to match
    supply demand
  • English auction
  • Dutch auction
  • Sealed bid
  • Double auction
  • Stock commodity exchange
  • Vickrey auction
  • Similar to sealed-bid, but the winner pays the
    price of the 2nd-highest price
  • The optimal strategy is to reveal true valuations
  • thus avoiding multiple rounds

13
A load-balancing economy
  • Allocation of CPU time link bandwidth bet.
    competing jobs
  • dij delay over link (i, j)
  • ri processing rate of node i
  • µj processing requirement of job j
  • Preferences of jobs
  • Min Ck cost of processing at node k,
    including cost of communication B/W from origin
    node to node k
  • Min STk service time at node k
  • STk
  • Min Ck aSTk
  • Auctions held by processors bidding by the jobs

14
A data management economy
  • Management of data migration replication
  • Minimize expected Tx response time
  • Control variables
  • copies of each data object
  • Assignment of copies to nodes
  • Pricing strategies of data suppliers
  • Txs pay for data access at a processor
  • which leases copies of objects from data
    managers
  • The number of read-only copies adapts to the
    read/write ratio
  • Without any further coordination

15
Mariposa
  • DDBMS built based on Distributed INGRES
  • Wide-area distribution
  • Local autonomy
  • Assumptions differing from the conventional
  • Static data allocation
  • Single administration authority
  • Uniformity in CPU, network connectivity, query
    processing capacity
  • Microeconomic resource management
  • Query routing scheduling
  • Replication of data fragments
  • Naming service
  • Execute queries within their budget
  • By contracting processing sites for query
    fragments

16
Dynamic environment
  • Naming service
  • advertisements for available data objects
  • Replicated (not centralized)
  • Contract other instances to receive updates
    (asynchronously)
  • A server can join the system by buying copies of
    data objects advertising its services
  • Per-site bidder storage manager
  • Attempt to maximize profit per unit of proc. time
  • Total autonomy
  • Some queries may not be completed
  • Some data objects may be dropped
  • Data mobility
  • No notion of home node

17
Replica management (I)
  • Storage managers contract others to receive
    asynchronous notifications of updates
  • Define payment stream for updates delivered
    within a specified time interval
  • Trade-off bet. currency of data replication
    cost
  • Updates are merged
  • Data returned by queries may be out-of-data
  • by varying degrees
  • Budget a non-increasing function of time
  • Administered by querys broker
  • Obtains bids ltCi, Di, Digt
  • for each of the sub-queries in the querys
    execution plan
  • cost for processing a sub-query within Di sec
    after its receipt
  • Expiration time (validity period)

18
Replica management (II)
  • Per-site billing rate for each data object
  • Allows site admin. to express bias towards
    specific objects
  • Hot list
  • Data objects for which the site always issues
    bids
  • Actual bid Computed Bid Load Average
  • Low price when idle
  • High price when overloaded

19
B/W management
  • Table with entries of the form ltBW, t1, t2gt
  • Available B/W bet. sites, for a given time
    interval
  • Network bid requests
  • ltTx, request, from ,togt
  • Network bid
  • ltTx, request, reservation price, timegt
  • Step-by-step calculation of available B/W
  • Starting at destination
  • Propagation of B/W profiles
  • When the B/W profile reaches the source, it
    provides the minimum B/W over all links along the
    path
  • Source-to-destination pass to determine the price
    for B/W along the path

20
References
  • M. Litzkow, T. Tannenbaum, J. Basney, M. Livny,
    Checkpointing and migration of UNIX processes in
    the Condor distributed processing system,
    Technical report 1346, U. Wisconsin-Madison,
    Computer Sciences Dept., 1997.
  • M. Stonebraker, Mariposa A wide-area
    distributed database system, VLDB Journal,
    October, 1996.
Write a Comment
User Comments (0)
About PowerShow.com