Service Challenges - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Service Challenges

Description:

Jamie Shiers Progress since last CR. Experimental views. Franco Carminati ALICE ... Easter w/e. Target 10 day period. Just! Most of the time below target ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 20
Provided by: pauld181
Category:

less

Transcript and Presenter's Notes

Title: Service Challenges


1
Service Challenges
  • Paul Dauncey, Rainer Mankel, Carsten Niebuhr,
    Michel Gonin

2
Talks given
  • Jamie Shiers Progress since last CR
  • Experimental views
  • Franco Carminati ALICE
  • Dietrich Liko ATLAS
  • Michael Ernst CMS
  • Nick Brook LHCb
  • Last CR was half way through SC3
  • This was failed by factors of 2
  • A lot learned and applied in SC4 since
  • Overall progress has been positive
  • A lot of work and great improvement over the last
    year
  • Some same areas are still causing a lot of
    problems

3
(No Transcript)
4
Transfers achieved
Just!
  • Most of the time below target
  • But much better than a factor of 2 below

5
(No Transcript)
6
Comparing to new figures
  • Rates are pretty good compared to new figures

7
In reality, there will be other transfers
  • Actual rates were at much lower levels during
    T0?T1 tests than nominal rates above
  • But network bandwidth not likely to be main limit

8
Jamie Shiers comments
  • But there were a lot of other things achieved
  • Some examples follow

9
Experiment-driven transfers
10
ATLAS MC jobs
  • Aim is to be at 10M/week by end of 2007 x10
    current rate

11
CME job submissions
12
LHCb reconstruction
  • Aim is to reconstruct 3000M events/month at
    nominal rate
  • Possibly fairer comparison is number of input
    files
  • 30k/month nominal above is 50-100k/month so
    already above nominal

13
What were the problems in SC4?
  • No simple answer
  • Many, many individual one-off problems were
    mentioned
  • Little quantitative information was presented
  • Many reports of instabilities
  • T1 sites (ATLAS report all 9 T1s only all
    available for a few hours/month)
  • Hardware failures
  • SRM/mass storage/Castor/dCache
  • File catalogues
  • Site differences
  • Firewalls
  • Badly configured nodes/sites
  • EGEE software
  • File access (GFAL)
  • File transfer (FTS)

14
From Sijbrands talk in last CR
  • gLite (and dCache to some extent) were less of an
    issue recently
  • But the other usual suspects are still giving
    problems
  • But dont forget overall level of service much
    improved

15
Schedule for future commissioning
16
(No Transcript)
17
How can further improvements be made?
  • Many comments that manual intervention was
    required
  • heroic efforts
  • at the limit of what the system can do
  • Jamie Shiers talk was dominated by communication
    improvements and problem reporting between the
    sites
  • Error reporting, tutorials, phone meetings,
    workshops, Wikis, etc.
  • He sees this as the way to improve performance
    and reliability
  • Have to live with this level of problems just
    get more efficient at overcoming them when they
    occur
  • Castor is a notable exception
  • However, must also put a lot of effort into bug
    fixing
  • Not sexy may need to push to keep the effort
    in the right direction
  • Effectively division of effort in maintenance vs.
    development
  • Important to get the balance of effort right here

18
Other points mentioned
  • Experiments will not ramp up to nominal rates by
    Jul07
  • E.g. ATLAS simulation is x10 below right now
  • Most are aiming for this around early 2008
  • No direct DAQ output has been included yet
  • Hence, service commissioning period will not be
    based on realistic loads
  • Should commissioning targets be relaxed for 2007,
    given LHC schedule? Only makes sense if frees up
    effort to use elsewhere not clear if true
  • Almost all service performance reported as data
    transfer rates
  • Obviously critical to get data out, both for
    storage and analysis
  • Some information given on job performance
  • Very little on CPU usage efficiency this seems
    to be underutilised
  • Scheduled outages can be worse than unscheduled
    ones
  • They hit more than one site simultaneously
  • More than one item tends to be removed from
    service

19
Conclusions
  • A lot of progress since the last CR
  • Jamie Shiers Despite the problems encountered
    and those yet to be faced and resolved I
    believe that it is correct to say we have a
    usable service (not a perfect one)
  • Several critical components still be to deployed
  • Without disrupting services
  • With a somewhat uncertain schedule
  • Service problems seen are amorphous and not easy
    to categorise
  • Many one-offs, so progress will be slow in fixing
    them
Write a Comment
User Comments (0)
About PowerShow.com