Charge - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Charge

Description:

The export of full raw data sets will require increased WAN bandwidth ... influence the purchase decision (i.e. don't just buy the cheapest stuff around) ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 32
Provided by: ParisS1
Category:
Tags: charge

less

Transcript and Presenter's Notes

Title: Charge


1
LCG Internal Review
P. Sphicas on behalf of the review committee U.
Egede, P. Hristov, S. C. Lin, L. Perini, M.
Pimia, A. Putzer, K. Woller LCG Internal
Review November 17-19, 2003
  • Introduction
  • Charge
  • Technical assessment
  • Grid Deployment
  • GTA
  • Fabric
  • Global issues management
  • Presentation/LHCC etc

2
Introduction
3
Charge
  • Les Matthias
  • Would like to see opportunity for people
    working on the project to present the work
    theyve done
  • Get guidance and advice
  • Things that are missing
  • Things that are inconsistent
  • Milestones, etc are discussed extensively in SC2.
  • So leave those out
  • Concentrate on technical aspects
  • Suggestions for LHCC review
  • To make the message more clear/complete

4
The actual review
  • Two persons assigned per area to facilitate the
    writing of the report
  • Grid technology and middleware
  • Laura Perini Simon Lin
  • Grid deployment
  • Alois Putzer Ulrik Egede
  • Fabric (Tier-0/Tier-1)
  • Knut Woller Peter Hristov
  • Resources/management
  • Martti Pimia
  • To the contributors many thanks for the very
    interesting and complete presentations
  • You should be proud of the work done/ work that
    is going on

5
Technical Assessment
6
Grid Deployment (I)
  • Deployment overview
  • In general the LCG mgmt has a good overview of
    the status of the project. This is reflected in
  • the awareness of the weak areas within the
    current deployment and
  • the risks related to the increased usage for the
    2004 Data Challenges
  • The importance of a stable LCG testbed
  • The experiments will have to use experience
    gained during 2004 for the creation of the TDR's
    in early 2005. This means that a stable LCG
    implementation needs to be in place across Tier
    0, 1 and 2 centres for most of 2004.
  • Tier-2 less well defined

7
Grid Deployment (II)
  • ARDA
  • Acceptance of the recommendations from ARDA
    implies that the LCG Grid will undergo a
    transition to the OGSI services provided by ARDA
    prototype.
  • While testing of OGSI services has a high
    priority it should not be allowed to interfere in
    a destructive manner with the running of LCG-2
    (required for the data challenges).
  • The experiments plan the use of parts of the
    ARDA prototype in their 2004 data challenges.
    The success of this seems uncertain.
  • The success of different experiments with
    distributed Monte Carlo production using a model
    where they pull the jobs from CEs was noted.
  • The model used for this in 2003 had limited
    workload management and security thus making it
    unsuitable for distributed analysis
  • This model for Resource Brokering (which bypasses
    most issues related to the information system and
    GRAM) could be developed further and the security
    model should support it.

8
Grid Deployment (III)
  • Certification and testing
  • The CT testbed at CERN is an essential part of
    the deployment of LCG. It has identified
    numerous problems in LCG-1 before its
    implementation at Tier-1 sites and at the moment
    supports the testing of LCG-2 before its upcoming
    release. The CT testbed is also used for
    experiments to test their software against
    upcoming releases and to test security patches of
    present LCG releases.
  • The multiple use of the same testbed causes
    problems with too little time for each test
  • We recommend that a specific small validation
    testbed is created (at CERN) with the specific
    purpose to let the experiments test their
    software before official LCG releases.
  • The Tier-1 centres have had many problems with
    the installation of LCG-1. This shows that there
    are issues related to the installation and
    configuration that were not caught during the
    test phase. The LCG area/task should pay closer
    attention to this issue.

9
Grid Deployment (IV)
  • Actual deployment (I)
  • An area of concern. Some important improvements
    for LCG-2 are planned
  • Easy installation outside LCFGng
  • Experiment-driven installation of
    experiment-specific s/w.
  • We applaud/support these plans and eagerly await
    their completeness
  • We were also told that there is no formal
    problem-tracking tool used for Grid deployment
    but just one-to-one emails. This is worrisome.
  • An integration with the problem-tracking tools of
    the Grid Operations Centre might be beneficial.
  • The Tier-1 experience presentation demonstrated
    that there are many shortfalls in the current
    deployment. Many of them are real, others seem
    to be due to insufficient communication. Given
    that a perceived problem is almost as bad as a
    real problem
  • We recommend that closer links between the
    relevant actors be put in place. Current GDB
    meetings do not include the troops perhaps a
    regular (monthly/bi-monthly) meeting of a
    technical nature would help.
  • Particular attention should be paid to the issue
    of newcomers (and synchronizing them to the old
    and established practices).

10
Grid Deployment (V)
  • Actual deployment (II)
  • At CERN, deficiencies in the automated software
    packaging led to significant delay in the
    conversion of LXBatch nodes (from Fabric
    presentations).
  • Few nodes have been done manually, conversion of
    100 nodes as next step before converting all
    lxbatch has been postponed to early 2004, i.e. by
    three months.
  • The effort on easing the installation and
    reducing/clarifying its dependencies should be
    strengthened.

11
Grid Deployment (VI)
  • Analysis on the Grid
  • The biggest challenge, i.e. chaotic access to the
    LCG by a large number of potentially
    inexperienced users, is still ahead.
  • All work done until now and the plans for the
    early part of next year deal with running
    production-style jobs on LCG.
  • Note that from early next summer the experiments
    plan to let "physicists" use LCG for analysis.
  • What is needed is a clear strategy towards full
    tests of the computing models before the LCG TDR.
  • Identified as a global issue ? see later
  • There is also a clear dependence on middleware
    deployment ? see GTA section

12
Grid Deployment (VII)
  • Security
  • What has been done so far seems quite good
  • User certification (authentication/authorization)
  • Not many users have actually gone through the
    certificate application system so far. Current
    certification scheme can be used in the near
    future
  • a re-evaluation needs to be carried out for the
    long-term, especially when other (non-HEP) users
    enter
  • Site security
  • The system works but no site has yet had to
    cope with a major security breach.
  • The real test of all this is yet to come
    especially when we enter global,
    production-mode Grids.
  • Grid Operations Centre
  • Same as above. Good plans, but minimal
    experience so far.

13
GTA (I)
  • Evident the M/W is one of the most important
    components of LCG.
  • But the current plan for the LCG M/W area
    (partly thanks to ARDA) is incomplete.
  • Also GTA milestones have been referred to as
    mostly reports (not good).
  • While the M/W is not under the exclusive control
    of the LCG project, its milestones are very
    important and need to be included in the project
    overview.
  • They will clearly need to be negotiated between
    LCG and EGEE
  • Having the same person in charge of both is
    clearly good
  • It would be useful to have a plan, complete with
    milestones, manpower (including manpower outside
    CERN) prepared already by January 2004 (i.e.
    before EGEE starts in April).
  • This plan will clearly have to be agreed upon
    with the EGEE Management, who should arrange for
    hires of new people to proceed immediately at the
    project start (or even before).

14
GTA (II)
  • M/W planning
  • ARDA planning should be established by end 2003,
    involving both the experiments and EGEE m/w
    experts, as well as AliEn, NorduGrid and US m/w
    experts.
  • The way to implement the ARDA service
    Architecture in a fast prototype framework should
    be proposed by a technical team as soon as
    possible.
  • The plan should clearly be
  • consistent with experiment requirements
  • complying with the EGEE engagements
  • submitted to SC2.
  • The plan should address (obvious, here for
    completeness)
  • real timescale and milestones
  • team available for executing the work
  • the m/w components to be taken and their sources
  • the relationship and synchronization with LCG-2,
  • and OGSI

15
GTA (III)
  • The six-month timescale for the ARDA prototype
    should be negotiated with EGEE and the
    experiments
  • Real point to have a new release for users
    before end 2004
  • this requires the experiments to be exposed to it
    earlier (by end summer 2004?)
  • This timescale does not seem incompatible with
    EGEE proposal/TA sent to EU ? see below
  • M/W development cycle and scope
  • Component-by-component deployment and avoiding
    big-bang releases are critical parts of a
    strategy for avoiding some of the worst problems
    experienced with EDG.
  • Intermediate (EGEE) releases should be in the
    milestones of the LCG m/w manager
  • EGEE has to produce a coherent set of functions
    and code with clear interfaces allowing multiple
    implementations of some components

16
GTA (IV)
  • OGSI and GRAM
  • ARDA report recommends a fast OGSI-compliant
    prototype
  • Question is whether OGSI offers a mature solution
    (on a short timescale) for an ARDA
    implementation. Numerous issues
  • Information System and GRAM (critical parts of
    the GLOBUS kit) used by LCG, but both services
    have problems of scalability and reliability.
  • In GT3/OGSI GRAM is both slow and not scalable
    (expected, given it is but the GT2 version
    wrapped). CondorG/GT3 will be (has been?)
    demonstrated in SC2003.
  • Performance evaluation eagerly awaited the
    CONDOR/VDT team (should have) incorporated much
    of the EDG/LCG feedback on the GT2 GRAM.
  • GT3 IndexService totally new, looks well
    designed, still some problems are present.
  • GLOBUS team has been informed and an improved
    bi-directional channel has been established with
    GTA.

17
GTA (V)
  • Federated Grids
  • Currently LHC experiments use a number of
    different Grids
  • Sometimes multiple systems (Grids) are used even
    within a single experiment
  • Clear that different Grids will coexist (e.g. US
    Tier2, NorduGrid)
  • numerous reasons (funding included)
  • EGEE alone will also require some federation
    concept (national grids with identical m/w)
  • First priority should be to show that a single
    Grid can achieve real production quality.
  • Fortunately, this is the LCG
  • Side remark ARDA may offer a good opportunity
    for harmonisation of the different efforts

18
GTA (VI)
  • Realistic Grid analysis tests by experiments
    before end 2004 are necessary before the
    Computing TDRs can be written (and their
    underlying computing model established).
  • This testing should involve also Tier-2s and
    possibly even Tier-3s, and requires the s/w from
    the Application Area to be available in time.
  • These tests will need a stable LCG service
  • but with a significant risk of failure (for the
    tests)
  • the experiments should chart strategies for
    failures in specific components leaving the
    rest of the system in a functioning state
  • Fallback solutions
  • A fallback solution for Grid m/w is very
    important
  • especially if LCG-2 evolution does not deliver
    production-quality m/w in time for the experiment
    C-TDRs
  • The experiments need (at least) some Grid
    functionality, available with production quality.
  • The actual functionalities and how such a
    fallback solution will be organized should be
    included in the LCG plan

19
Fabric (I)
  • The fabric area is developing successfully the
    Tier-0/1 center at CERN within the given time
    scale.
  • Detailed computing models from the experiments
    are essential for the design, resource planning,
    and organisation of Tier-0/1.
  • The 2004 physics data challenges should therefore
    be Grid-oriented and should include analysis
    procedures.
  • The computing data challenges are important for
    stress-testing the fabric system
  • they should thus be conducted according to the
    presented schedule
  • The export of full raw data sets will require
    increased WAN bandwidth
  • We thus support the current plans for increased
    networking capability
  • We should keep an eye on the adequacy of
    external entities, e.g. GEANT, which constitute
    an integral part of the system

20
Fabric (II)
  • The fabric demonstrates inertia to rapid
    middleware developments.
  • This is good, but also creates friction between
    the two areas.
  • The middleware should clearly not depend on
    technological specifics of the underlying fabric.
  • The coupling between the two, therefore,
    introduces a risk.
  • As already stated in the Hoffman Report, the
    major risk for the fabric infrastructure is not
    technology, which can be adequately foreseen for
    two to three years in advance, but market
    development.
  • Falling-off of the commodity trail (which will go
    further towards portables) may increase projected
    prices for high-powered CPUs drastically on short
    time scales.

21
Fabric (III)
  • The restructuring plans for the computer center
    until 2008 as outlined are good.
  • All major requirements as foreseeable today have
    been understood and can be met according to
    reasonable estimates of power and space
    consumption and cost development. Infrastructure
    delays are minor and still acceptable
  • Major challenges will be the installation of up
    to 1200 CPU and 550 disk servers per year.
  • Acquisition preparations should start soon.
  • The Total Cost of Ownership (TCO) estimates for
    different architecture types, e.g. from OpenLab,
    should clearly influence the purchase decision
    (i.e. dont just buy the cheapest stuff around).
  • Perhaps the acquisition of the online farms by
    the experiments can be somehow included in
    the overall purchase plans.

22
Fabric (IV)
  • Choices of tools for fabric management at CERN
    seem to be mostly made.
  • It seems natural (and desirable) to use identical
    tools for the online systems.
  • IT and the collaborations should get together
    soon to prevent diverging solutions.
  • The Fabric Management milestones and deliverables
    seem to be less well-defined than the
    infrastructure ones.
  • If its only a matter of presentation, fix it for
    next week
  • A few milestones have been delayed, namely the
    display and state-management system.

23
Fabric (VI)
  • Storage
  • Question what are the requirements and goals of
    the ongoing improvements on AFS and CASTOR?
  • The benefits and problems in the existing
    AFS/CASTOR setup have not been elaborated on
  • Where does dCache fit in here (does it fit in
    here?)
  • While we consider it (clearly) wise to keep track
    of technology in the storage area, we recommend
    that priorities are set on the various activities
  • Licensing
  • We appreciate the TCO approach in selecting the
    appropriate solution. One side effect of making
    this decision on a case-by-case basis is
    inconsistent combinations, e.g. lack of Oracle
    support on non-enterprise Linux servers.
  • We believe that this need not have serious
    consequences if the decisions are sufficiently
    flexible to allow a mix of enterprise with
    non-enterprise servers.
  • Insourcing RedHat support for the next year gains
    time for negotiations with RedHat.
  • These negotiations are clearly critical for all
    development activities
  • We agree deal must be made with full HEP
    community in mind.

24
Management and Global issues
25
Global issues (I)
  • Planning realistic tests of analysis on the Grid
    (by the experiments) before end 2004 looks like a
    necessary step before the Computing TDRs can be
    written (and their underlying computing model
    established).
  • This testing should involve also Tier-2s and
    possibly Tier-3s, and requires the s/w from the
    Application Area to be available in time
  • A point of concern we will know whether the LCG
    will have the functionality needed too close to
    the time of the TDRs.

26
Global Issues (II)
  • The LCG-EGEE relationship should be clarified
    (especially to LHCC)
  • Having the same persons responsible for the M/W
    and Deployment tasks in both projects (LCG and
    EGEE) is clearly a good idea
  • The question is whether this is sufficient
  • It should also be clarified whether the m/w
    responsibility in LCG is going to be shared
    between two leaders (like, at present, between
    the CTO and the m/w manager).
  • If yes, the rationale should be given and the
    respective responsibilities should be clarified,
    thus minimizing the area of overlap
  • If not, the unification of the responsibilities
    should be accelerated, without disrupting current
    activities and their short-term evolution. Just
    like with new hires, it would be good to have the
    final m/w management structure in place before
    the official start date of EGEE.
  • The lack of a Fabric area in EGEE needs to be
    addressed as well
  • The expected loss of 6 out of 15 EDG/LCG
    developers by mid-2004 can have a high (negative)
    impact on the Fabric.

27
Global issues (III)
  • Lacking authorization for phase 2 of LCG, the
    long-term support of software packages developed
    in EDG and LCG is a concern
  • This needs to be addressed in 2004, well before
    the end of phase 1
  • In the future there will be different fabrics
  • Remember this when sealing the RedHat deal
  • Others likely to use SVR-Linux instead
  • The relationships between LCG-2 and the
    post-prototype ARDA implementation should be
    stated more clearly
  • The parallelism between LCG-2 in production and
    the ARDA prototype should be made more clear in
    the presentation

28
Management
  • Manpower
  • The manpower situation, in absolute FTEs, seems
    to be almost OK for the next year, but in the
    longer term there are many problems, which may
    lead to an untenable situation.   Even the
    'optimistic' count shows a shortfall of 14 FTE in
    2005
  • Additional issues
  • LCG Phase II manpower is in the CERN planning,
    but not yet funded
  • EGEE manpower needs to be hired and assigned
    early enough to avoid gaps
  • LCG integration with experiments seems to need
    more effort both from the experiments and from
    the LCG
  • The Computing MOUs, synchronizing with the
    Computing RRB, and the overall plan, etc is a
    huge issue
  • Fully support appearance of Comp. Coordinators in
    PEB.

29
LHCC Comp. Reviewand Summary
30
Suggestions
  • Fewer presentations
  • Switch order of presentations (GTA before
    Deployment)
  • Add introduction to the Grid
  • Add that many flavors of middleware exist, that
    there is software beyond the middleware that
    needs to be added/used
  • Go through all project acronyms Go through all
    (important) 3- and 4-letter abbreviations
  • Take out technical details aim for simplicity
  • Maintain critical self-assessment
  • Avoid arguments in front of the committee
  • Add relationship with EGEE, describe the
    complicated world you/we live in

31
Parting remarks
  • There is a lot of work going on
  • Really, a lot of work is going on
  • The field (Grids) is in a high state of flux
  • HEP back on the frontier (this time on software)
  • Fine, huge balancing act between computer
    science, frontier in software development, High
    Energy Physics, LHC computing needs, EU, CERN,
    people
  • Leads to inefficiencies potentially the biggest
    threat to the project
  • Need an analysis of a number of what-if risk
    scenarii
  • Experiments and LCG are in client-server mode
  • Must move towards a collaboration of LCG and 4
    experiments
  • This will be the key to the success of the project
Write a Comment
User Comments (0)
About PowerShow.com