Title: [Amusing title goes here] GridPP project management
1Amusing title goes hereGridPP project
management
- Sarah Pearce
- 8 September 2009
- GridPP23
2Project Map GridPP3 Q4 08
3Project Map GridPP3 Q2 09
4Project map - statistics
Q208 Q308 Q408 Q109 Q209
Metric OK 99 142 155 172 184
Metric close to target 24 47 39 32 22
Metric not OK 41 32 32 21 27
Not able to be measured 27 22 11 10 3
Milestone achieved 11 22 32 42 57
Milestone overdue 2 7 13 17 4
Milestone not due / metric n/a 101 80 69 60 58
Suspended 0 6 6 9 12
Awaiting input 34 5 12 10 3
Total 339 363 369 373 370
Metrics
Milestones
5Experiments - red metrics
- ATLAS, CMS and Other experiments
- No red metrics (although ATLAS has lots of
amber) - Previously ATLAS job failures dominated by access
to storage. - LHCb
- 1.2.2 - MC production (generation) efficiency
(84/ target 95) - 1.2.3 - T1 MC production (reconstruction,
stripping) efficiency (55/90) - 1.2.4 T1 MC/Event user analysis UK efficiency
(43/70) - 1.2.11 LHCb SAM tests uptime T1 (98/82)
- 1.2.23 Keep LHCb GANGA user training material
updated
Mainly problems with LHCb application software.
Various scheduled downtimes T1 (moving to new
building, CASTOR and network developments). LHCb
note that support at the UK Tier-1 and Tier-2
sites for MC production has been excellent and
communications between sites and experiment are
improved.
6Grid services
- Operations
- 2.1.3 - Proportion of available jobslots used
(51/ target 80) - 2.1.6 - Job success rates (85/95) was 90 in
Q408. Expect it to decrease as sites get busier. - 2.1.10 - GridPP deployment web-pages up-to-date
- review underway - Rest of Grid Services
- No red metrics or milestones
7Tier-1 - metrics
- Front end systems
- 3.1.8 - Availability of CE service (91/99)
Scheduled downtime R89, network - Resource delivery
- 3.2.11 - Farm Occupancy (67/target 80) up
from 45 in Q408 - 3.2.13 quarterly report not available
- Previously 3.2.10 - Job Efficiency (now 88, was
69) - Storage systems
- 3.4.4 - met of UB Allocation for
Disk (87/100). UB allocations to be revised. - 3.4.8 CASTOR SAM tests LHC VOs (93/99).
CASTOR development, R89, network.
8Tier-1 overdue milestones
- Front end systems
- 3.1.22 LHC Monitoring infrastructure operational
at RAL waiting on work by Dante - Resource delivery
- 3.2.16 - Disaster and Business Continuity Plan
Available. - 3.2.18 - Disaster Plan fully implemented
- New disaster management system is operational
and working well, but some contingency plans
remain to be completed. - Storage systems
- General ADS Service Ends. Not been a priority but
closure process has started.
9Tier-2s
- of promised disk and CPU available green for
all Tier-2s (metrics 12). - SAM availability and reliability tests green for
most Tier-2s (metrics 34). Not a weighted
average, so can be brought down by a couple of
poorly performing small sites. - Other red metrics
- Metric 5 SLL ATLAS test performance,
LondonGrid and SouthGrid. - 4.2.6 - Average SLL SE test performance,
ScotGrid (86/95) - CPU utilisation (wall clock time CPU time,
metrics 78) LondonGrid, SouthGrid - of disk used (metric 9) ScotGrid, SouthGrid
- Number of management meetings NorthGrid
(metric 11) - Middleware upgrading LondonGrid (metric 14).
10Management and external
- Project execution red metrics
- Nearly all staff now in post
- 5.2.9 CB meetings (target 1 per year)
- NGI
- Milestones amended in light of EGI developments
- Outreach, LCG and EGEE
- No red metrics
11Risk register
12High level risks
- R1 Recruitment and retention difficulties
- Likelihood 3, impact 3 (reduced from 4,3)
- Nearly all staff now in place, but staff turnover
remains a concern - R12 Machine room problems compromise Tier-1
- Likelihood 4, impact 3
- Transfer to R89 went smoothly, but issues since
have triggered Tier-1 disaster planning process
(air conditioning problem and water leak) - R5 Service insufficiently resilient wrt storage
downgraded to medium risk (2,4) - Resilience expressed explicitly
- CASTOR more stable
- But impact of problems increases as data taking
approaches
13Finances
- 335k of Tier-1 hardware rolled over from FY08 to
FY09 - 1m of Tier-1 hardware delayed until FY10
- Most Tier-2 hardware grants should be in early
FY10 small number of sites require in FY09
14Staffing
- Some areas not finished recruiting, so funded
effort under that expected - But in all cases more than compensated by
unfunded effort
15Next steps
- New Oversight Committee next week
- Transition plan for EGEE posts to EGI
- This quarter finishes at end of September will
send reminder out then for next round of reports - Aim to keep the Quarterly reports metric green
but NOT using the Dilbert method