Title: UCB Millennium and the Vineyard Cluster Architecture
1UCB Millenniumand theVineyard Cluster
Architecture
- Phil Buonadonna
- University of California, Berkeley
- http//www.millennium.berkeley.edu
2Millennium Project
- Hierarchical Cluster of Clusters
PIII-X 64x4
Ninja
PIII 32x2
PII PIII
Gigabit Ethernet (GbE)
PII8x2
PII8x2
Math
Astro
PII8x2
PII8x2
PII8x2
Bio
Physics
CE
3Millennium Agenda
- Investigate recent PC technologies in Clusters
- NT/Linux
- VI Architecture / GbE / Distributed I/O
- Harvest the lessons learned from NOW
- Robust, flexible remote execution
- Distributed resource management
- Investigate clusters that span administrative
units - Turn-key cluster deployment
- Sense of ownership
- Investigate the Computational Economy Approach
- Resource management with a natural sense of
ownership - Enough heterogeneous interests to be worthwhile
- Form basis for Sci. Computing, Internet Services,
etc.
4Vineyard Cluster Architecture
- Distributed resource utilization and management
in a Vineyard of Clusters.
Applications / Services
Mgmt / Monitoring
PBS
MPI
VEXEC
I/O
TOOLS
REXEC
- VIA / GM, GbE - Multicast
- NT / Linux (2.2.x) - Stride Scheduler
Rootstock Distribution
5Outline
- Millennium Project
- Vineyard Cluster SW Architecture
- Important Component Technologies
- Rootstock cluster SW distribution facility
- REXEC Robust Linux Remote Execution
- Economic-based Resource allocation
- CAN communication over VIA
- IO Rivers
- Directions and Discussion
6Rootstock
- Disseminate easy-to-build PC cluster system
software - Variety of cluster designs
- well-engineered high-performance clusters
- low-cost casual workgroup clusters
- server farms
- scalable internet servers
- Root Cluster Server (CS)
- Provides cluster software stock
- Second-level customized distribution within each
cluster from its own CS node
7Rootstock Cluster
- Collection of nodes with IP connectivity
- can be dedicated subnet, w/ or w/o NAT, or any
collection - run nfsd (within cluster), httpd, ssl
- One node designated as Cluster Root
- serves as the root of administrative operations
and mgmt. - may be same or different from other nodes
- may participate in normal cluster operation or
not - gt is trusted by other nodes and has storage for
dialtone - May have designated front-end nodes or not
- May have dedicated cluster-area-network (eg.
Myrinet) or not.
8Rootstock Mechanics
Cluster System Distribution Center
cluster stock - build - os - drvrs - mill SW - os
mods
cs
1. Cluster Stock - Rootstock build pages - Full
Current Linux - all fixes and pckgs -
SSL, SSH - Cluster Drivers - Cluster System
Layers - rexec, mpe, pbs - Optional SW () -
Cluster Kernal Mods
IP network
CAN
...
5. Cluster Update button (future) - 2nd
dialtone, CF engine, rolling update
9Computational Economy
- Market-based approach to resource allocation
- Optimizes for user value
-
TimeShare
API
API
BatchQueue
Economic F.E.
Access Modules
Resources
Resource Managers
Apps(Value)
10REXEC Remote Execution
- Secure, decentralized remote execution
environment - Features
- Decouples resource discovery and selection
- Multiple Allocation Policies (VEXECs)
- Decentralized control
- Each client rexec is the root for a distributed
task. - Dynamic discovery and configuration
- Resource announcements on a cluster multi-cast
channel - All Soft State
- Simple, well-defined failure and cleanup models
- They all fall down
- Secure
- Translates Pricing Mechanism to Resource
Allocation
11REXEC / VEXEC
- Components
- rexecd, rexec vexecd
Node A
Node B
Node C
Node D
rexecd
rexecd
rexecd
rexecd
Cluster IP Multicast Channel
vexecd(Policy A)
vexecd(Policy B)
Node A
run indexer on Nodes AB at 3 credits/min
minimum
rexec
rexec n 2 r 3 indexer
12Interactive Pricing Mechanism
- Most work on economic mechanisms focuses on
single item or batch case - hold auctions (e.g., second-price sealed bid)
- integrated into Vineyard PBS
- interactive case needs to be very simple
- Bidder i gets bi / åk bk of CPU at rate bi
- enforced by stride scheduler
- Running cluster mirror usage experiment
- two identical clusters for one user community
with accounts - one free and uncontrolled
- one for bid and controlled
- which is more desirable to use
13Communication / VIA
- Multiple Physical Layers
- Fast Ethernet
- Gigabit Ethernet (Inter Intra cluster net)
- Myrinet w/ Lanai7 (Intra cluster net)
- Transports
- IP, IP Multicast
- VI Architecture / GM
- Explore integrated IPC and distributed I/O
14AM Architecture
Proc A
- Components
- Endpoints
- Virtual Networks
- Bundles
- Operations
- Request / Reply
- Short, Med, Long
- Create, Map, Free
- Poll, Wait
- Credit based flow control
Proc B
Proc C
15AM-VIA Architecture
- VI Queue (VIQ)
- Logical channel for AM message type
- VI independent Send/Receive Queues
- Independent request credit scheme (counter n)
- MAP Object
- Container for 3 VIQs
- Short,Medium,Long
- Single Registered Memory Region
MAP Object
16AM-VIA Integration
- Endpoints Collection of MAP objects
- Virtual network emulated by point-to-point
connections
- Bundle Pair of VI Completion Queues
- Send/Receive
Proc A
Proc B
Proc C
17(No Transcript)