Title: A View from the Top End of Year 1
1A View from the TopEnd of Year 1
Al Geist October 10-11 Houston TX
2Participating Organizations
Coordinator Al Geist
Participating Organizations
ORNL ANL LBNL PNNL
PSC SDSC IBM
SNL LANL Ames NCSA
Cray Intel Unlimited Scale
Main Web Site
www.scidac.org/ScalableSystems
3Review of Last Meeting
Scalable Systems Software Center
June 13-14 Houston TX
Details in Main project notebook
4Progress Reports at June. mtg
Al Geist working groups, notebooks,
telecoms Working Group Leaders What areas
their working group is addressing
Progress report on what their group has done
Present problems being addressed
Next steps for the group Discussion
items for the larger group to consider Demonstrat
ions of Prototype Components One Big
intra-component demo
Slides can be found in Main Notebook page 22
5Consensus and Voting
Event Manager Proposal Much discussion revised
proposal to say that Event Management is
important feature to our Software Suite
independent of whether it is in a central
component or inside components. And that
proposed tuple API is initial starting point.
Passed strawvote 13 for / 0 against / 0
abstain Adopt HTTP POST (byte count) as standard
Proposal Passed strawvote 10 for / 0 against
/ 1 abstain Adopt W3 standard for XML signature
syntax and process Long discussion. Decided
more discussion needed before vote Bugzilla site
now up and running Link is on the
ScalableSystems home page.
6Progress Since Last Meeting
Scalable Systems Software Center
June-October
7Five Project Notebooks filling up
- A main notebook for general information
- And individual notebooks for each working group
- Over 200 total pages 34 added since last
meeting - A lot of new material in Resource Management
notebook (way to go)
Get to all notebooks through main web site
www.scidac.org/ScalableSystems Click on side
bar or at project notebooks at bottom of page
8Four Bi-weekly Working Group Telecoms Less talk
more work
Resource management, scheduling, and accounting
Tuesday 300 pm (Eastern) 1-800-664-0771
keyword SSS mtg Validation and Testing
Wednesday 100 pm (Eastern) 1-877-540-9892 mtg
code 999157 Proccess management, system
monitoring, and checkpointing Thursday 100
pm (Eastern) 1-877-252-5250 mtg code
160910 Node build, configuration, and
information service Friday 300 pm
(Eastern) 1-888-469-1934 mtg code 58145
(changes)
9Scalable Systems Integrated Component
Demonstration
Done June 2002
4 Create-Reservation
Allocation Manager
Local Scheduler
9 Withdraw-Allocation
2 Query-Job
7 Query-Job
8 Delete-Job
3 Query-Node
5 Run-Job
Queue Manager
Node Monitor
Job Submission Client
1 Submit-Job
Color Key Working Group
Resource Management and Accounting
Process Management and Monitoring
Node Configuration and Build Infrastructure
0 Service-Lookup
6 Exec-Process
Process Manager
Discovery Service
10Meta Scheduler D. Jackson
Meta Manager S. Scott
System/Job Monitors M. Showerman
Node Manager T. Naughton
Allocation Management S. Jackson
Job Manager B. Bode
Package Services J. Mugler
Accounting S. Jackson
C-Plant XML interface E. Debenedictis
Service Directory N. Desai
Scheduler D. Jackson
Process Manager R.Lusk
Information Services JP Navaro
Checkpoint / Restart P. Hargrove
Authentication Communication R. Lusk
Queue Manager B. Bode
SSSlib Used by all components
Process Mgmt Working Group
Resource Mgmt Working Group
Build Configure Working Group
11This Meeting
Scalable Systems Software Center
October 10-11,2002
12SciDAC Booth
13SciDAC Systems Poster
14SciDAC Booth
15SciDAC Systems Poster (2)
16Agenda October 10
800 Breakfast 830 Al Geist
Project Status. Getting ready for SC 2002 900
External Project review Feburary (start
planing)
Working Group Reports 930 Scott Jackson
Resource Management 1030 Break 1100 Erik
Debenedictis Validation and Testing 1200
Lunch (on own but go somewhere as group)
100 Paul Hargrove Process Management
200 Narayan Desi Node Build, Configure
3.00 Break 330 SC Demos and
Hacking big multi-component demo 500 Open
Discussion 530 Adjourn Working groups
may wish to get together in evening
17Agenda October 11
800 Breakfast 830 Discussion,
proposals, strawvotes THANKS to Airport Security
Meeting for open access to their internet
access! ssslib meatball GUI (who?) Chiba City
for SC demos (Nov 4?) cross group issues test
packaging? 1030 Break 1100 Al Geist
Summary SC
Booth, demos, theater, software, handout
(Brett) February review reviewers,
advisor, talks
next meeting date day before review 1200
meeting ends
18External SciDAC Review mtg
Late February 2003 may bubble over to early
March 18 month checkup by MICS Each SciDAC
Project is reviewed separately Scalable
Systems is the only thing on the agenda Full two
days of detailed presentations So many of us will
have to give presentations External review panel
(different for each ISIC) We can suggest names
Cant be from our organizations or
affiliated They will have been given our
proposal beforehand
19External SciDAC Review metrics
I asked Fred and McGraw about Metrics 1. How
have we helped SciDAC Aps? Can we show use in CCS
and NERSC and others. 2. Put Advisory Panel into
place. Apps and Computer Center personnel Ive
asked Drake (Climate), Mezzacapa (Astro), Bland
(CCS), Nichols (Chemistry) we need NERSC
rep and others? 3. Show short term successes and
use
20External Review Panel Suggestions
External review panel (different for each
ISIC) We can suggest names - who? Barney
McCabe Russ Miller Bart Miller Jose M
(IBM) Someone from Cray Someone from Etnus John
Delsignore Someone from Unlimited Scale? Walt
Ligon Andrew Lumsdaine Jim Garlick Steve
Chapin
21Meeting Notes
Scott Jackson rm progress Scope queue manager,
job manager, scheduler, allocation, meta Demo
CCS, NERSC, and Chiba meta-schedule would be
good Scheduler- enhance internal scalability to
64K nodes, add support for HTTP framing
protocol. Qbank security enhanced Interface to
PBS, LSF, LL for suspend/resume and requeue
mgt Queue Manager-conforms to SSSRMAP XML spec.
full wire protocol compatibility new
enterface to Event Manager Allocation
Manager-survey of 15 sites for requirements.
Implemented HTTP framing, SHA1-HMAC security
working with Qbank/Maui reframed bank objects
(accounts, users, allocations) as dynamic
object actions defined in metadata cache
creation of dynamic web-GUI using PHP and
javascript Meta scheduler interoperates with
Grid (globus), fault tolerance global jobID
tracking, scheduler reconnection. Improved user
interface Current issues job state mgt, data
staging, job signaling, job steps
22Meeting Notes
Scott Jackson rm progress (cont.) Next work-
prepare for SC demos, scalability testing, BIG
thing is release v1.0 RM system.
Documentation, security authentication, extend
suspend/resume schema beyond what PBS, LL does
today Discussion of the need for a scalability
testbed. Eric Debenidictis validation
progress Create machine independent test for
testing supercomputer Infrastructure QMTest
Tests (from all sources) Value- improved
method execute the SSS Standard Test
body Recent Activity QMTest on SNL SciDAC
cluster, test package definition Will McClendon
test architecture (diagram in slides) QMTest is
scriptable test driver in Python HTTP based
interface Zope Running at SNL and PSC Requires
exact match on STDOUT/STDERR
23Meeting Notes
Will McClendon test architecture (cont.) QMTest
Screenshot and discussion of how tests are
done. Raw results need to be interpreted to
determine pass or fail Mike ???- goes over the
package details How to create a test package to
the suite Package File Layout Make-like Will
present as a proposal tomorrow Paul Hargrove pm
group Progress prototyping and development
continue how to interface to something we
cant imagine validating schema for process
manager node monitor schema created Checkpoint
Manager- types serial checkpoints (independent
but potentially multithreaded), done parallel
checkpoints (MPI) scalable systems XML
interfaces
24Meeting Notes
Rusty Lusk process manager (see diagram in his
slides) MPD1 (C) overview added capabilities
required by pmWG MPD is one prototype for SSS
Process Manager MPD2 (python) diagram in slides
for new design Python about 5X slower with this
untuned version Mike Showerman- system monitoring
component Craig Steffen full time on this project
and a student Using new XML schema defined by
Need to write graphical display that uses this
new XML interface Run a small cluster in NCSA
booth with SSS software stack Discussion how
about an animated meatball diagram Paul returns
Data migration meatball removed Next steps
interfaces continue to stabilize chkpt, PM,
monitors Monitoring data. . . Details need
defining
25Meeting Notes
Narayan Desai Build and configure
update Components service directory (solid
and on Chiba now), event manager completely
rewritten, stable XML, SSSlib robust
(bindings for C, Java, Python, Perl)
(wire protocol modules, basic, challenge,
http, http-rm) Build and Config Management (third
try at the abstraction) cluster HW build
system (OSCAR module for this one in the works)
node state manager Issues- Abstraction problems
with second try. Multiple implementations
important to validate abstraction DEMOS
26Refined Picture on Next Slide
Service Directory
File System
Meta Scheduler
Meta Monitor
Meta Manager
Event Manager
Meta Services
Node Configuration Build Manager
System Job Monitor
Accounting
Scheduler
Allocation Management
Process Manager
Job Queue Manager
User DB
High Performance Communication I/O
Checkpoint / Restart
Usage Reports
User Utilities
Testing Validation
Application Environment
Blue text uses ssslib Red text talks ssslib
protocol
27Grid Interfaces
Meta Scheduler
Meta Monitor
Meta Manager
Meta Services
Accounting
Scheduler
System Job Monitor
Node Configuration Build Manager
These Interface To all
Service Directory
File System
Event Manager
Allocation Management
User DB
Process Manager
Job Queue Manager
Usage Reports
High Performance Communication I/O
User Utilities
Checkpoint / Restart
Application Environment