DIVISION%20INFRASTRUCTURE%20and%20PLANNING - PowerPoint PPT Presentation

About This Presentation

Title:

DIVISION%20INFRASTRUCTURE%20and%20PLANNING

Description:

DOE Tevatron Operations Review. Division Direction (26 FTE) ... Quarterly walkthroughs with DOE. Assist in improving or maintaining building safety (on-going) ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 23

Provided by: with83

Learn more at: https://cd-docdb.fnal.gov

Category:

more less

Transcript and Presenter's Notes

Title: DIVISION%20INFRASTRUCTURE%20and%20PLANNING

1

DIVISION INFRASTRUCTURE and PLANNING
Vicky White
Fermilab
March 17, 2004

2
Division Direction (26 FTE)

Head Vicky White (also labs Cyber Security
Executive, with active deputy Dane Skow)
Deputy Head Robert Tschirhart
Chief Scientist of the Division
MOUs and Stakeholder requirements
3 Associate Heads with cross-cutting
responsibilities and small team of staff each
- Facilities, ESH Gerry Bellendir (6.5 FTE)
Budget, Computing resource planning Steve
Wolbers (7 FTE)
External communications, Admin staff, Project
initiation status, Division web Ruth Pordes
(9 FTE)
1 Assistant Head DOE relations Irwin Gaines (at
DOE)

3
How the Division works

Unique (among HEP labs) in having a Computing
Division that fully contributes to the scientific
program
Mix of Scientists, Engineers, Computing
Professionals, Technical and Administrative Staff
We think this works and we are very proud of it.
We encourage our scientists to do science and are
proud of their scientific contributions
We think communications with our stakeholders is
outstanding and aided by this
We believe in
System Solutions hardware and software
Matrixed project work across organizations
Common services and solutions
Evolving all of our systems aggressively (e.g. -gt
Grid)

4
Computing Division ESH Program

793 days without a lost-time injury!
- 3 first-aid cases (15 month period)

5
Computing Division ESH Program

Training Ergonomics, Beryllium Handling, Lead,
Computer Room, GERT, Emergency Warden, Service
Coordinator
- 96 complete on required ESH courses
Ergonomic Workstation Reviews (about ¼ of the
division has had their workstation reviewed).
Hold annual fire and tornado drills

6
Computing Division ESH Program

Monthly walkthroughs with Department Heads
(average 2 per month)
Quarterly walkthroughs with DOE
Assist in improving or maintaining building
safety (on-going)
Investigate and record injuries (first-aid and
recordable)
Assist in writing and review of Hazard Analysis

7
Facility Operations and Planning

This is a big job !!
We now have 3 Facilities smaller computing room
in Wilson Hall
Feynman Computing Center
New Muon center for Lattice Gauge
High Density Computing Facility demolition and
reconstruction starting April 5
We have a posted opening for an assistant
building manager
Computer Facility planning space, power,
networking, installations etc.
Facility construction planning working with
FESS
Monitoring all of the systems in our facility

8
Meetings and Communications inside CD

Kitchen Cabinet meeting of Division Head, Deputy
and Assoc Heads weekly
Department Heads meetings monthly
Operations meeting weekly
Budget Meeting monthly, 2 Budget retreats per
year
Briefings on issues/project proposals weekly
Facility planning meetings regularly
Project status reports weekly
Matrixed projects meetings
R2D2 (Run 2 Data handling) -gt Grid Projects
Accel Projects meeting (monthly)
CMS Projects (monthly)
All-hands division meetings (2 or 3 per year)

9
Stakeholder communications

Bi-weekly meeting slot with CDF and D0
spokespersons Computing leaders
As needed meetings with other expt spokespersons
and/or computing leaders
CD Head is member of CMS SWC PMG and CMS
Advisory Software Comp. Board
CD Deputy attends BTeV PMG
CD representative on CDF and D0 PMGs
Stakeholders participate in briefings and status
meetings present needs, requests, etc.
Lab Scheduling, LAM and All-Experimenters
meetings
Windows Policy committee
Computer Security Working Group

10
How do we set Priorities ?

Listen and discuss with
Director, Associate Director for Research, Deputy
Director, other division/section heads
PAC, HEPAP
Experiment Spokesperson and Computing leaders and
liasons
Run II Reviews
Project Reviews
US-CMS SWC PMG, ASCB and Reviews
Budget Retreat discussions
External Project Steering Groups and
collaborators
Funding Agency contacts
Then we just make decisions and do it!

11
Evolving our workforce

I issued a challenge in Jan 2003 to each
department to become 10 more efficient in
operational areas
So we would be able to invest and move forward
into the future
Big emphasis on measuring what we are doing
define your own metrics, but show us
Strong encouragement to reassign staff, train,
offer opportunities to change assignments

12
Has it worked?

I believe it has worked to a large extent
But there is much more to do
We are down from 275 FTEs in Sep 03 to 258 FTEs
but that has brought stress in places
We have taken on 15 FTE of work in Accelerator
Division (some taken from BTeV)
We need to hire have 10 openings posted
We will go to lights out operations
We need more skilled computer professionals and
fewer limited skill level operational staff
We have taken a tough stance on performance
We have no fat
no-one is messing around on some unapproved
project
everyone effort reports each month
we need to work smarter not harder in some areas

13
Some Common Services
Common Service Customer/Stakeholder Comments
Storage and Data movement and caching CDF, D0, CMS, MINOS, Theory, SDSS, KTeV, all Enstore 1.5 Petabytes data ! dCache, SRM
Databases CDF, D0, MINOS, CMS, Accelerator, ourselves Oracle 24x7 mySQL,Postgres
Networks, Mail, Print Servers, Helpdesk, Windows, Linux, etc. Everyone ! First class, many 24X7, services lead Cyb.Security
SAM-GRID CDF, D0, MINOS Aligning with LHC
Simulation, MC and Analysis Tools CDF, D0, CMS, MINOS, Fixed Target, Accel. Div. Growing needs
Farms All experiments Moving to GRID
Engineering Support and RD CDF, D0, BTeV, JDEM, Accel. Div. Projects Q outside our door
14
Budget FY04-FY06
15

16
(No Transcript)
17
FTE spread FY04
18
FTE spread FY05
19
FTE spread FY06
20
Risks
Risk Type of Risk Plan/mitigation
Provision of computer center building infrastructure fails to keep up with programmatic demands for power and cooling for computing power Infrastructure Multi-year plan to re-use existing buildings separate plan each year to build to match characteristics of systems given changing technologies
Processing time for CDF or D0 events and/or need to reprocess pushes computing needs outside planning envelope. Programmatic Establish Grid model for provision of computing resources in a seamless way. (Already close to established). Execute plan at Fermilab to make all computing generic Grid computing to meet peak demands by load sharing.
Demands for serving up Run II data both on-site and off-site, escalate to a point where the central storage and caching systems fail to scale Programmatic Much work has been done to assure scalability of the central storage system. We have many robots and can add tape drives to robots in a scalable way.
Tape technologies do not continue to follow the cost/GB curve we plan for or tape technologies become obsolete Programmatic We have two different types of robots including two large ADIC flexible media robots that can take a broad range of media types. If STK silos become obsolete and STK makes no new media we can expect LTO drives or their descendents to continue for several years. Our caching strategy allows us to transparently go to an all disk solution, and to replicate data on disk, should this become cost effective.
21
Risks
Rely on Grid Computing to solve many problems If the Grid has been oversold, or oversubscribed and Run II experiments have increasing difficulty getting resources as we approach LHC turnon this could limit the physics from Run II. Programmatic We plan to maintain a solid base of processing capability at Fermilab. Experiments will have to make hard choices that could limit the physics.
Success with Accelerator Division joint projects means we are likely to be asked to be engaged in this work longer. Already this is happening. Applying resources to BTeV has to be balanced with these needs. Programmatic Plan carefully what we take on.
For the Grid to work the Network infrastructure must be highly performant to all locations Programmatic Fermilab procuring fiber connection to starlight Fermilab worked on ESnet roadmap report in office of science and now working with ESnet to use the fiber for a Metropolitan Area Network, with ANL. RD proposals and continual push on improved networking capabilities worldwide (ICFA SCIC), Internet working group, etc.
22
Risk Type of Risk Plan/mitigation
All data tapes in FCC. All data tapes for one experiment in one or at most two tape silos. Risk of catastrophic data loss low, but non zero. Programmatic and Infrastructure Working on Physical Infrastructure to house silo(s) Combining all silos into one logical system Dispersal of data to multiple physical locations
Satellite computer center buildings will not have Generator backup, only UPS to allow for orderly shutdown of systems on power failure Programmatic Need 10 more processors to mitigate effects of power outages which leave many dead systems in their wake. Have adopted a policy on use of buildings to minimize effects of downtime of worker nodes, keeping file servers, machines with state in FCC.
Satellite computer centers need money to run them. FCC costs us a lot to run. FESS do not provide all of the services. We have to pay for many contracts ourselves. Each additional building will need maintainance, up to high standards, if millions of dollars of computing are to be within and monitored. Infrastructure We still need to squeeze these costs out of the budget. If necessary will have to tax purchasers of computing.
Plan for lights out computing center could get derailed. Two legacy tape systems are being migrated to robotic storage. Building monitoring systems need improvement. Programmatic Finish executing plan to put all active data into a robot. Work with FESS on enhanced and secure access to building monitoring information is ongoing.

Write a Comment

User Comments (0)