Title: GridPP use- interoper- communic- ability
1GridPP use-interoper-communic-ability
Tony Doyle
2Introduction
- Is the system usable?
- How will GridPP and NGS interoperate?
- Communication and discussion introduction
3A. Usability (Prequel)
- GridPP runs a major part of the EGEE/LCG Grid,
which supports 3000 users - The Grid is not (yet) as transparent as end-users
want it to be - The underlying overall failure rate is 10
- User (interface)s, middleware and operational
procedures (need to) adapt - (see talk by Jeremy for more info. on performance
and operations) - Procedures to manage the underlying problems such
that system is usable are highlighted
4EGEE CPU hours(1 April 2006 to 31 July 2006 )
Active User requires thousands of CPU hours
5 million hours
5Virtual Organisations
- Users are grouped into Virtual Organisations
- Users/VO varies from 1 to 806 members (and
growing..) - Broadly four classes of VO
- LHC experiments
- EGEE supported
- Worldwide (mainly non-LHC particle physics)
- Local/regional e.g. UK PhenoGrid
- Sites can choose which VOs to support, subject to
MOU/funding commitments - Most GridPP sites support 20 VOs
- GridPP nominally allocates 1 of resources to
EGEE non-HEP VOs - GridPP currently contributes 30 of the EGEE CPU
resources
6User View?
- Perspective matters
- This is not
- a usability survey
- unbiased
- representative
- Straw poll
- users overcame initial registration hurdles
within two weeks - users adapt to Grid in (un-)coordinated
ways - The Grid was sufficiently flexible for many
analysis applications
7Physics Analysis
ESD Data or Monte Carlo
Event Tags
Collaboration -wide Tasks
Event Selection
Calibration Data
Analysis, Skims
INCREASING DATA FLOW
Raw Data
Analysis Groups
Physics Objects
Physics Objects
Physics Objects
Individual Physicists
Physics Analysis
8User evolution
- Number of UK Grid users (exc. Deployment Team)
- Quarter 05Q4
06Q2 06Q3 - Value 1342
1831 2777 - Many EGEE VOs supported c.f. 3000 EGEE target
- Number of active users (gt 10 jobs per month)
- Quarter 05Q4 06Q1
06Q2 - Value 83 166 201
- Fraction 6.2 11.0
- Viewpoint growing fairly rapidly, but not as
active as they could be? depends on the active
definition
9Know your users? UK-enabled VOs
- 806 atlas 763 dzero 577 cms 566 dteam 150
lhcb 131 alice 75 bio 65 dteamsgm 41
esr 31 ilc 27 atlassgm 27 alicesgm
21 cmsprg 18 atlasprg 17 fusn 15 zeus 13
dteamprg 13 cmssgm 11 hone 9 pheno 9
geant 7 babar 6 aliceprg 5 lhcbsgm
5 biosgm 3 babarsgm 2 zeussgm 2 t2k 2
geantsgm 2 cedar 1 phenosgm 1 minossgm
1 lhcbprg 1 ilcsgm 1 honesgm 1 cdf
10User Interface
Dockable windows
Screenshot of the Ganga GUI
- The GUI is relatively low-level (jobs, file
collections) - Dynamic panels for higher level functions
11Complex Applications
12WLCG MoU
- Particle physicists collaborate, play roles and
delegate - e.g. prg production group sgm
software group managers - Underpinned by Memoranda of Understanding
- Current MoU signatories
- China France Germany Italy India Japan
Netherlands Pakistan Portugal Romania Taiwan UK
USA - Pending signatures
- Australia Belgium Canada Czech Republic Nordic
Poland Russia Spain Switzerland Ukraine - Negotiation w.r.t. resource and service level
13Resource allocation
- Need to assign quotas and priorities to VOs and
measure delivery - VOMS provides group/role information in the proxy
- Tools to control quotas and priorities in site
services being developed - So far only at whole-VO level
- Maui batch scheduler is flexible, easy to map to
groups/roles - Sites set the target shares
- Can publish VO/group-specific values in GLUE
schema, hence the RB can use them for scheduling - Accounting tool (APEL) measures CPU use at global
level (UK task) - Storage accounting currently being added
- GridPP monitors storage across UK
- Privacy issues around user-level accounting,
being solved by encryption
14User Support
- Becoming vital as the number of users grows
- But modest effort available in the various
projects - Global Grid User Support (GGUS) portal at
Karlsruhe provides a central ticket interface - Problems are categorised
- Tickets are classified by an on-duty Ticket
Process Manager, and assigned to an appropriate
support unit - UK (GridPP) contributes support effort
- GGUS has a web-service interface to ticketing
systems at each ROC - Other support units are local mailing lists
- Mostly best-effort support, working hours only
- Currently tens of tickets/week
- Manageable, but may not scale much further
- Some tickets slip through the net
15Documentation Training
- Need documentation and training for both system
managers and users - Mostly expert users up to now, but user community
is expanding - Induction of new VOs is a particular problem no
peer support - EGEE is running User Fora for users to share
experience - Next in Manchester in May 07 (with OGF)
- EGEE has a dedicated training activity run by
NeSC/Edinburgh - Documentation is often a low priority, little
dedicated effort - The rapid pace of change means that material
requires constant review - Effort on documentation is now increasing
- GridPP has appointed a documentation officer
- GridPP web site, wiki
- Installation manual for admins is good
- There is also a wiki for admins to share
experience - Focus is now on user documentation
- New EGEE web site coming soon
16Alternative view?
- The number of users in the Grid School for the
Gifted is manageable now - The system may be too complex, requiring too much
work by the average user? - Or the (virtual) help desk may not be enough?
- Or the documentation may be misleading?
- Or..
- Having smart users helps (the current ones are)
17B. Interoperability
- GridPP/NGS meeting - Nottingham EMCC, September
2006 - Present Tony Doyle, David Britton, Paul
Jeffreys, David Wallom, Robin Middleton, Andy
Richards, Stephen Pickles, Steven Young, Dave
Colling, Peter Clarke, Neil Geddes - Agenda
- Ultimate goals and the model for achieving them
and any constraints - Timetables
- Required software (in both directions)
18B. Interoperability
- Goals A general discussion on what we might hope
to achieve and why. - Several key points made...
- Open question whether we ever need to actually
have any closer partnership - GridPP is focused on a relatively immediate goal
and will always be constrained in some way by the
broader LCG requirements - NGS should be further from the bleeding edge in
grid developments - NGS affiliation and partnership model exists
- GridPP T2's all have MoUs which will need
revamping under GridPP3. This will be an ideal
opportunity to formalise any relationship between
GridPP (T2's) and the NGS. - It is unclear who is using EGEE (in the UK) and
who could or would want to use it - EGEE-UKI needs to do a better PR job within the
UK - Phenogrid are registering with EGEE
19B. Interoperability
- The current "minimal software stack" approach of
NGS is being reviewed as a greater variety of
partner resources are considered (data centres
and research facilities) - Different "stacks" will be relevant to different
sorts of partners i.e. there is likely to be a
range of "NGS Profiles - For the foreseeable future, NGS is likely to
exist in a world with multiple parallel software
stacks and it will not be possible merge them - Installing parallel stacks or profiles is not a
problem if they are easy to install and do not
interfere - One possibility is that the different NGS
profiles would reflect Different stacks such as
GT4 or gLite - Operations-can we present accounting information
consistently
20B. Interoperability
- What benefit is there in a GridPP site joining
NGS ? - much less relevant for sites where the resources
are essentially dedicated for HEP. Where there
are shared facilities with other fields then the
generic and shared nature of the NGS can provide
ready made interfaces for the broader
communities. We are clearly a long way form being
able to merge both activities completely. e.g.
GridPP requirements on monitoring and accounting
could not currently be met by NGS nodes and NGS
would not require all partners to report a la
GridPP. (Of course this does not preclude project
specific layers such as this accounting on top of
the basic NGS profiles, for relevant partner). - There is a concern that "joining" the NGS would
put an additional load on the GridPP sites.
Looking further ahead of course, the intention is
that this is not the case, but that supporting
the standard NGS profiles is exactly the same
work as required to meet (a subset of) the GridPP
requirements. This can only be guaranteed if
there is sufficient representation of GridPP
sites within the NGS.
21B. Interoperability
- Next steps/timetable
- GridPP3 MoUs - No action required. Can wait until
next year and should be informed by lessons
learned over the next 6-12 months. GridPP sites
currently meet the minimal requirements for NGS
through the standard GridPP installations. - If Sites enable the NGS VO then this effectively
gives NGS affiliation if they wish. - Formal Affiliation would, however, require that
the interface be monitored by NGS. Agreed that
the next step should be to understand in detail
what is actually required for NGS partnership.
22B. Interoperability
- Next steps/timetable
- Agreed to focus on two sites, Glasgow and LeSC.
Aim to be ready to achieve NGS partnership by
Christmas 2006. - The decision as to whether or not to actually
apply for formal partnership can be left to later
in the year. - The principal goal is to understand the steps and
requirements etc. - It was agreed that NGS should provide a Glite CE
for core NGS nodes which would allow the nodes To
be a part of the EGEE/LCG SAM infrastructure. - Accounting and monitoring are areas which are
still developing and where it is not clear what
the best solution is (for NGS) - Meet once more before Christmas..
23gt Implementation
- GU should concentrate on delivering 1. A job
submission mechanism 2. A method to prepare the
job's environment what input files, etc. This
means we can offer 1. gsissh login to head
node, with access to some shared space (e.g. the
home directory for the NGS pool accounts). 2.
job submission from head node to the gatekeeper,
which can use either GRAM (globus-job-submit) or
EGEE methods (edg-job-submit) This would seem
to qualify us as an NGS partner site, comparing
with - http//www.grid-support.ac.uk/index.php?optionco
ntenttaskviewid143 - The SLAs on offer seem none too onerous
24C. Communicability
- "T0-T1-T2 Service Challenges" Panel Members
Tony Cass, Jeremy Coles, Dave Colling, John
Gordon, Dave Kant, Mark Leese, Jamie Shiers.
notes recorded by Neasan O'Neill - "Analysis on the Grid" Panel Members Roger
Barlow, Giuliano Castelli, David Grellscheid,
Mike Kenyon, Gennady Kuznetsov, Steve Lloyd,
Andrew McNab, Caitriana Nicholson, James Werner.
notes recorded by Giuseppe Mazza - "How is/will data be managed at the T1/T2s?"
Panel Members Phil Clark, Greig Cowan, Brian
Davies, Alessandra Forti, David Martin, Paul
Millar, Jens Jensen, Sam Skipsey, Gianfranco
Sciacca, Robin Tasker, Paul Trepka. notes
recorded by Tom Doherty - "Experiment Service Challenges" Panel Members
Dave Colling, Catalin Condurache, Peter Hobson,
Roger Jones, Raja Nandakumar, Glenn Patrick.
notes recorded by Caitriana Nicholson
- "Beyond GridPP2 and e-Infrastructure" Panel
Members Pete Clarke, Dave Britton, Tony Doyle,
Neil Geddes, John Gordon, Neasan O'Neill, Joanna
Schmidt, John Walsh, Pete Watkins. notes
recorded by Duncan Rand - "Site Installation and Management" Panel
Members Tony Cass, Pete Gronbech, Dave Kelsey,
Winnie Lacesso, Colin Morey, Mark Nelson, Derek
Ross, Graeme Stewart, Steve Thorn, John Walsh.
notes recorded by Mark Leese - "What is a workable Tier-2 Deployment Model?"
Panel Members Olivier van der Aa, Jeremy Coles,
Santanu Das, Alessandra Forti, Pete Gronbech,
Peter Love, Giuseppe Mazza, Duncan Rand, Graeme
Stewart, Pete Watkins. notes recorded by
Gianfranco Sciacca - "What is Middleware Support?" Panel Members
Mona Aggarwal, Tom Doherty, Barney Garrett, Jens
Jensen, Andrew McNab, Robin Middleton, Paul
Millar, Robin Tasker. notes recorded by
Catalin Condurache
251. "LCG Service Challenges"
- This was a session which brought out the detailed
planning of Service Challenges.
1. SC is a great idea which is a kind of reality
check reality is imminent data, increasing
complexity of experiment-led initiatives, and
more users 2. Need more documentation and
support still true(!) despite effort 3. Time
scales and deadlines are needed for deployment
well known and widely communicated via Jamie
Jeremy 4. Storage model is important issue
especially for storage group increasingly large
issue dedicated discussion 5. Communication on
experience forthcoming discussions will be
discussed at DTeam and PMB meetings 6. Networks
will play an important part in SC4 underpins
file transfer tests, but needs to be embedded
within these - disk performance (being
understood) v network performance (many hidden
variables)
26There was a list of specific actions
- Implement a better user support model ONGOING
- Support the deployment of an SRM at every Tier-2
site DONE - Revisit site plans for implementing promised
resources DONE - Support the installation of any required local
catalogues at sites GENERALLY LIMITED TO TIER-1.
DONE - Investigate the experiment VO box requests. Make
a recommendation to Tier-2s. Revisit as GridPP.
NOT REQD. (CURRENTLY) - Better understand network links to sites (we do
not want to saturate links) ONGOING - Schedule transfer tests from Tier-1 to Tier-2
test rates and stability DONE AND ONGOING - Work closer with experiments? CAN IMPROVE
27There was a list of specific actions
- user support (mail lists, web form, TPMs, GGUS
integration) NEED TO ENSURE USERS KNOW (AND
KEEP REMINDING THEM) - SRM at T2 (almost done) DONE
- site plans revised (SRIF3, FEC) ONGOING
- local catalogues (wiki, SC3, plan for rest)
- VO boxes (review group) DISAPPEARING..
- network links (10 easy questions, wiki)
FIREWALLGRID http//www.ggf.org/documents/GFD.83.
pdf - T1-T2 tests (plan, stalled, dcache/dpm) DONE
- Experiment links (some progress) MORE REQD.
282. "Running Applications on the Grid"
- (Why won't my jobs run?)
- Summary
- A number of people say things working are well -
pleasant surprise - easier than LSF! A SUBSET OF
USERS ATTEND GRIDPP MEETINGS - VO setup and requirements don't want each VO to
have to talk to each site. VO should provide list
of requirements for site to support VO. THERE ARE
A LARGE NUMBER OF RESPONSIBILITIES TO BE HANDLED
BY EACH EXPT. - Certificates need to improve situation. Once
over this hurdle using the grid is plainer
sailing. INTRINSIC TIME DEPENDENCE OF CA-RA-USER
TRUST ESTABLISHMENT (NECESSARY) - Data management issues more of a problem than job
or RB problems. How to get information to user re
failures and support channels. INCREASINGLY TRUE
MANY AD-HOC DELETIONS FOLLOWING E.G. FTS
FAILURES - Monitoring real file transfers would be an
interesting addition. USER MECHANISMS TO TRACE
OVERALL PROGRESS, BUT NOT MANY INDIVIDUAL USER
TOOLS/SCRIPTS APPEARING E.G. TNT (Tag Navigator
Tool) PLUG-IN TO GANGA FOR ATLAS FILE COLLECTIONS
WOULD NEED TO COMMUNICATE WITH THE MonAMI FTS
PLUG-IN
293. "Grid Documentation"
- (What documentation is needed/missing? Is it a
question of organisation?) - Could updates to documents be raised at meetings?
- A mailing list specifically for document updates
may be useful. - Competition between different solutions to one
problem. - For all experiments - link in all documentation
and give responsibility to a line manager (for
example) to oversee its maintenance. - What are the mechanisms or how do we find out
what is inadequate within a document - a document
should be checked every few months to point out
its inadequacies gt should a review process be
set up by SB. - Roles and responsibilities should be established.
- Important documents should be highlighted - and
index of useful doc's and what sources of
documents are available may be useful. - Much progress made by Stephen Burke in many of
these areas. Steve attends PMB
305. "Beyond GridPP2 and e-Infrastructure"
- (What is the current status of planning?)
- EGEE II may be superseded by European
infrastructure EGEE III NOW BEING PLANNED - DTI planning a UK infrastructure
- Integrate better with NGS - SEE EARLIER SLIDES
- More things developed by GridPP will be supported
centrally NEED TO CONVINCE UK COMMUNITY OF THE
USEFULNESS AND ADAPTABILITY OF GLITE AS A
COMPONENT PART OF PERVASIVE INFRASTRUCTURE
316. "Managing Large Facilities in the LHC era"
- (What works? What doesn't? What won't)
- Sys admins seem happy with their package
managers. - We should share common knowledge (about software
tools) more. ONGOING - Extra Costs (over and above the price of the
hardware) involved in having large clusters.
ONGOING - IMPROVED, BUT CAN IMPROVE FURTHER METRIC DT
(INSTALL USER AVAILABILTY) AVAILABILITY
327. "What is a workable Tier-2 Deployment Model?
- Conclusion Deployment is under control
- testing has made good progress
- operations still an issue
- METRIC DT (INSTALL USER AVAILABILTY)
OVERALL AVAILABILITY SYSTEM MANAGER(S) - EXCELLENT T2 SUPPORT STRUCTURE REQD.
338. "What is Middleware Support?"
- (really all about)
- gLite test bed
- EGEE2 - dedicated testing/certification system
- using wiki was good idea. Consolidate into
documents. - need some structure to make sure wiki doesn't get
out of control. - need some moderators for the wiki.
- developers not getting correct requirements for
s/w.sysadmin questions not the same questions
that were in the minds - of the developers..
- bad if the wiki is incorrect.
- need someone to move what is in the wiki to some
sort of more formal docs (LaTeX or DocBook) which
has been properly checked and signed off by the
developers. - ONGOING, LIMITED PROGRESS INTRINSIC LIMITATION?
(THERE WILL ALWAYS BE OUT OF DATE/LIMITED
DOCUMENTATION?) - NEED A DOCUMENTATION REVIEW CHALLENGE?
34Conclusion
- All sessions were felt to be worthwhile
- Some produced hard actions
- Some areas have made progress since
- Positive correlation between subjects which made
progress and where GridPP had existing structures
in place (Deployment, Documentation) - Counter examples, middleware, experiments
- Lets do this again but next time take more care
to task people with subsequent progress and look
for new structures to deliver results. - MAKE IT SO
- The logical end of a talk on Gridability (or
the emperors new clothes?)