Title: DIVISION%20INFRASTRUCTURE%20and%20PLANNING
1- DIVISION INFRASTRUCTURE and PLANNING
- Vicky White
- Fermilab
- March 17, 2004
-
2Division Direction (26 FTE)
- Head Vicky White (also labs Cyber Security
Executive, with active deputy Dane Skow) - Deputy Head Robert Tschirhart
- Chief Scientist of the Division
- MOUs and Stakeholder requirements
- 3 Associate Heads with cross-cutting
responsibilities and small team of staff each - - Facilities, ESH Gerry Bellendir (6.5 FTE)
- Budget, Computing resource planning Steve
Wolbers (7 FTE) - External communications, Admin staff, Project
initiation status, Division web Ruth Pordes
(9 FTE) - 1 Assistant Head DOE relations Irwin Gaines (at
DOE)
3How the Division works
- Unique (among HEP labs) in having a Computing
Division that fully contributes to the scientific
program - Mix of Scientists, Engineers, Computing
Professionals, Technical and Administrative Staff - We think this works and we are very proud of it.
- We encourage our scientists to do science and are
proud of their scientific contributions - We think communications with our stakeholders is
outstanding and aided by this - We believe in
- System Solutions hardware and software
- Matrixed project work across organizations
- Common services and solutions
- Evolving all of our systems aggressively (e.g. -gt
Grid)
4Computing Division ESH Program
- 793 days without a lost-time injury!
- - 3 first-aid cases (15 month period)
5Computing Division ESH Program
- Training Ergonomics, Beryllium Handling, Lead,
Computer Room, GERT, Emergency Warden, Service
Coordinator - - 96 complete on required ESH courses
- Ergonomic Workstation Reviews (about ¼ of the
division has had their workstation reviewed). - Hold annual fire and tornado drills
6Computing Division ESH Program
- Monthly walkthroughs with Department Heads
(average 2 per month) - Quarterly walkthroughs with DOE
- Assist in improving or maintaining building
safety (on-going) - Investigate and record injuries (first-aid and
recordable) - Assist in writing and review of Hazard Analysis
7Facility Operations and Planning
- This is a big job !!
- We now have 3 Facilities smaller computing room
in Wilson Hall - Feynman Computing Center
- New Muon center for Lattice Gauge
- High Density Computing Facility demolition and
reconstruction starting April 5 - We have a posted opening for an assistant
building manager - Computer Facility planning space, power,
networking, installations etc. - Facility construction planning working with
FESS - Monitoring all of the systems in our facility
8Meetings and Communications inside CD
- Kitchen Cabinet meeting of Division Head, Deputy
and Assoc Heads weekly - Department Heads meetings monthly
- Operations meeting weekly
- Budget Meeting monthly, 2 Budget retreats per
year - Briefings on issues/project proposals weekly
- Facility planning meetings regularly
- Project status reports weekly
- Matrixed projects meetings
- R2D2 (Run 2 Data handling) -gt Grid Projects
- Accel Projects meeting (monthly)
- CMS Projects (monthly)
- All-hands division meetings (2 or 3 per year)
9Stakeholder communications
- Bi-weekly meeting slot with CDF and D0
spokespersons Computing leaders - As needed meetings with other expt spokespersons
and/or computing leaders - CD Head is member of CMS SWC PMG and CMS
Advisory Software Comp. Board - CD Deputy attends BTeV PMG
- CD representative on CDF and D0 PMGs
- Stakeholders participate in briefings and status
meetings present needs, requests, etc. - Lab Scheduling, LAM and All-Experimenters
meetings - Windows Policy committee
- Computer Security Working Group
10How do we set Priorities ?
- Listen and discuss with
- Director, Associate Director for Research, Deputy
Director, other division/section heads - PAC, HEPAP
- Experiment Spokesperson and Computing leaders and
liasons - Run II Reviews
- Project Reviews
- US-CMS SWC PMG, ASCB and Reviews
- Budget Retreat discussions
- External Project Steering Groups and
collaborators - Funding Agency contacts
- Then we just make decisions and do it!
11Evolving our workforce
- I issued a challenge in Jan 2003 to each
department to become 10 more efficient in
operational areas - So we would be able to invest and move forward
into the future - Big emphasis on measuring what we are doing
define your own metrics, but show us - Strong encouragement to reassign staff, train,
offer opportunities to change assignments
12Has it worked?
- I believe it has worked to a large extent
- But there is much more to do
- We are down from 275 FTEs in Sep 03 to 258 FTEs
but that has brought stress in places - We have taken on 15 FTE of work in Accelerator
Division (some taken from BTeV) - We need to hire have 10 openings posted
- We will go to lights out operations
- We need more skilled computer professionals and
fewer limited skill level operational staff - We have taken a tough stance on performance
- We have no fat
- no-one is messing around on some unapproved
project - everyone effort reports each month
- we need to work smarter not harder in some areas
13Some Common Services
Common Service Customer/Stakeholder Comments
Storage and Data movement and caching CDF, D0, CMS, MINOS, Theory, SDSS, KTeV, all Enstore 1.5 Petabytes data ! dCache, SRM
Databases CDF, D0, MINOS, CMS, Accelerator, ourselves Oracle 24x7 mySQL,Postgres
Networks, Mail, Print Servers, Helpdesk, Windows, Linux, etc. Everyone ! First class, many 24X7, services lead Cyb.Security
SAM-GRID CDF, D0, MINOS Aligning with LHC
Simulation, MC and Analysis Tools CDF, D0, CMS, MINOS, Fixed Target, Accel. Div. Growing needs
Farms All experiments Moving to GRID
Engineering Support and RD CDF, D0, BTeV, JDEM, Accel. Div. Projects Q outside our door
14Budget FY04-FY06
15 16(No Transcript)
17FTE spread FY04
18FTE spread FY05
19FTE spread FY06
20Risks
Risk Type of Risk Plan/mitigation
Provision of computer center building infrastructure fails to keep up with programmatic demands for power and cooling for computing power Infrastructure Multi-year plan to re-use existing buildings separate plan each year to build to match characteristics of systems given changing technologies
Processing time for CDF or D0 events and/or need to reprocess pushes computing needs outside planning envelope. Programmatic Establish Grid model for provision of computing resources in a seamless way. (Already close to established). Execute plan at Fermilab to make all computing generic Grid computing to meet peak demands by load sharing.
Demands for serving up Run II data both on-site and off-site, escalate to a point where the central storage and caching systems fail to scale Programmatic Much work has been done to assure scalability of the central storage system. We have many robots and can add tape drives to robots in a scalable way.
Tape technologies do not continue to follow the cost/GB curve we plan for or tape technologies become obsolete Programmatic We have two different types of robots including two large ADIC flexible media robots that can take a broad range of media types. If STK silos become obsolete and STK makes no new media we can expect LTO drives or their descendents to continue for several years. Our caching strategy allows us to transparently go to an all disk solution, and to replicate data on disk, should this become cost effective.
21Risks
Rely on Grid Computing to solve many problems If the Grid has been oversold, or oversubscribed and Run II experiments have increasing difficulty getting resources as we approach LHC turnon this could limit the physics from Run II. Programmatic We plan to maintain a solid base of processing capability at Fermilab. Experiments will have to make hard choices that could limit the physics.
Success with Accelerator Division joint projects means we are likely to be asked to be engaged in this work longer. Already this is happening. Applying resources to BTeV has to be balanced with these needs. Programmatic Plan carefully what we take on.
For the Grid to work the Network infrastructure must be highly performant to all locations Programmatic Fermilab procuring fiber connection to starlight Fermilab worked on ESnet roadmap report in office of science and now working with ESnet to use the fiber for a Metropolitan Area Network, with ANL. RD proposals and continual push on improved networking capabilities worldwide (ICFA SCIC), Internet working group, etc.
22Risk Type of Risk Plan/mitigation
All data tapes in FCC. All data tapes for one experiment in one or at most two tape silos. Risk of catastrophic data loss low, but non zero. Programmatic and Infrastructure Working on Physical Infrastructure to house silo(s) Combining all silos into one logical system Dispersal of data to multiple physical locations
Satellite computer center buildings will not have Generator backup, only UPS to allow for orderly shutdown of systems on power failure Programmatic Need 10 more processors to mitigate effects of power outages which leave many dead systems in their wake. Have adopted a policy on use of buildings to minimize effects of downtime of worker nodes, keeping file servers, machines with state in FCC.
Satellite computer centers need money to run them. FCC costs us a lot to run. FESS do not provide all of the services. We have to pay for many contracts ourselves. Each additional building will need maintainance, up to high standards, if millions of dollars of computing are to be within and monitored. Infrastructure We still need to squeeze these costs out of the budget. If necessary will have to tax purchasers of computing.
Plan for lights out computing center could get derailed. Two legacy tape systems are being migrated to robotic storage. Building monitoring systems need improvement. Programmatic Finish executing plan to put all active data into a robot. Work with FESS on enhanced and secure access to building monitoring information is ongoing.