The Coming Crisis in Computational Science - PowerPoint PPT Presentation

About This Presentation
Title:

The Coming Crisis in Computational Science

Description:

Brooks, 1987; Remer, 2000; Rifkin, 2002; Thomsett, 2002; Highsmith, 2001 ... Post and Cook, 2000: Post and Kendall, 2002 ... R. Laughlin, 2002;Post and Kendall, 2002 ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 34
Provided by: douglas128
Learn more at: https://www.csm.ornl.gov
Category:

less

Transcript and Presenter's Notes

Title: The Coming Crisis in Computational Science


1
The Coming Crisis in Computational Science
  • Douglass Post
  • Los Alamos National Laboratory
  • SCIDAC Workshop
  • Charleston, South Carolina, March 22-24, 2004

LA-UR-04-0388 Approved for public
release Distribution is unlimited
Los Alamos National Laboratory, an affirmative
action/equal opportunity employer, is operated by
the University of California for the U.S.
Department of Energy under contract
W-7405-ENG-36. By acceptance of this article, the
publisher recognizes that the U.S. Government
retains a nonexclusive, royalty-free license to
publish or reproduce the published form of this
contribution, or to allow others to do so, for
U.S. Government purposes. Los Alamos National
Laboratory requests that the publisher identify
this article as work performed under the auspices
of the U.S. Department of Energy. Los Alamos
National Laboratory strongly supports academic
freedom and a researchers right to publish as
an institution, however, the Laboratory does not
endorse the viewpoint of a publication or
guarantee its technical correctness.
2
Outline
  • Introduction
  • The Performance and Programming Challenges
  • The Prediction Challenge
  • Lessons Learned from ASCI
  • Case study lessons
  • Quantitative Estimation
  • Verification and Validation
  • Software Quality
  • Conclusions and Path Forward

3
Introduction
  • The growth of computing power over the last 50
    years has enabled us to address and solve many
    important technical problems for society
  • Codes contain Realistic models, good spatial and
    temporal resolution, realistic geometries,
    realistic physical data, etc.

Stanford ASCI AllianceJet Engine Simulation
U. Of Illinois ASCI AllianceShuttle Rocket
Booster Simulation
L. Winter et al-Rio Grande Watershed
G. Gisler et al-Impact of Dinosaur Killer Asteroid
4
Three ChallengesPerformance, Programming and
Prediction
  • Performance Challenge-Building powerful computers
  • Programming ChallengeProgramming for Complex
    Computers
  • Rapid code development
  • Optimize codes for good performance
  • Prediction ChallengeDeveloping predictive codes
    with complex scientific models
  • Develop codes that have reliable predictive
    capability
  • Verification
  • Validation
  • Code Project Management and Quality

5
Outline
  • Introduction
  • The Performance and Programming Challenges
  • The Prediction Challenge
  • Lessons Learned from ASCI
  • Case study lessons
  • Quantitative Estimation
  • Verification and Validation
  • Software Quality
  • Conclusions and Path Forward

6
Programming Challenge
  • Modern Computers are complicated machines
  • 1000s of processors linked by a intricate set of
    networks for data transfer
  • Terabytes of data
  • Challenges
  • Memory bandwidth not keeping up with processor
    speed
  • Memory managementcache utilization
  • Interprocessor communication and message passing
    conflicts
  • Reliabilityfault tolerance

7
High Performance Computing Community must address
Programming Challenge.
  • Problem is severe
  • Code development time is often longer than
    platform turnaround time, many codes have a 20
    year or longer lifetime, with a 5 to 10 or more
    year replacement time
  • Performance is often poor and not likely to
    improve
  • Trade-off between portability and performance
    favors portability
  • Moore needs to be done to simplify application
    code development
  • Better tools, language enhancements for easier
    parallelization, stable architectures are needed
  • Addressing Programming Challenge is a major
    thrust of the DARPA High Productivity Computing
    Systems Project (HPCS).

8
Outline
  • Introduction
  • The Performance and Programming Challenges
  • The Prediction Challenge
  • Lessons Learned from ASCI
  • Case study lessons
  • Quantitative Estimation
  • Verification and Validation
  • Software Quality
  • Conclusions and Path Forward

9
Predictive Challenge is even more serious than
Programming Challenge.
  • Programming Challenge is a matter of efficiency
  • Programming for more complex computers takes
    longer and is more difficult and takes more
    people, but with enough resources and time, codes
    can be written to run on these computers
  • But the Predictive Challenge is a matter of
    survival
  • If the results of the complicated application
    codes cannot be believed, then there is no reason
    to develop the codes or the platforms to run them
    on.

10
Many things can be wrong with a computer
generated prediction.
  • Experimental and theoretical science are mature
    methodologies but computational science is not.
  • Code could have bugs in either the models or the
    solution methods that result in answers that are
    incorrect
  • e.g. 2254.22, sin(90O) 673.44, etc.
  • Models in the code could be incomplete or not
    applicable to problem or have wrong data
  • E.g. climate models without an ocean current
    model
  • User could be inexperienced, not know how to use
    the code correctly
  • CRATER analysis of Columbia Shuttle damage
  • Two examples Columbia Space Shuttle,
    Sonoluminescence

11
Lessons Learned are important
1
  • 4 stages of design maturity for a methodology to
    matureHenry PetroskiDesign Paradigms
  • Suspension bridgescase studies of failures (and
    successes) were essential for reaching
    reliability and credibility

2
3
Tacoma Narrows Bridge buckled and fell 4 months
after construction!
Case studies conducted after each crash. Lessons
learned identified and adopted by community
4
12
Computational Science is at the third stage.
  • Computational Science is in the midst of the
    third stage.
  • Prior generations of code developers were deeply
    scared of failure, didnt trust the codes.
  • New generation of code developers trained as
    computational scientists
  • New codes are more complex and more ambitious but
    not as closely coupled to experiments and theory
  • Disasters occurring now
  • We need to assess the successes and failures,
    develop lessons learned and adopt them
  • Otherwise we will fail to fulfill the promise of
    computational science
  • Computational science has to develop the same
    professional integrity as theoretical and
    experimental science

13
Computational analysis of Columbia Space Shuttle
damage was flawed.
  • The Columbia Space Shuttle wing failed during
    re-entry due to hot gases entering a portion of
    the wing damaged by a piece of foam that broke
    off during launch
  • Shortly after launch, Boeing did an analysis
    using the code CRATER to determine likelihood
    that the wing was seriously damaged
  • Problems with analysis
  • The analysis was carried out by an inexperienced
    user, someone who had only used the code once
    before
  • CRATER was designed to study the effects of
    micrometeorite impacts, and had been validated
    only for projectiles less that 1/400 the size and
    mass of the piece of foam that struck the wing
  • Didnt use a code like LS-DYNA that was the
    industry standard for assessing impact damage
  • The prior CRATER validation results indicated
    that the code gave conservative predictions
  • Analysis indicated that there might be some
    damage, but probably not at a level to warrant
    concern
  • Concerns due to CRATER analysis were downplayed
    by middle management and not passed up the
    system.
  • Result was that no one tried hard to look at the
    wing and figure out how to do something to avoid
    the crash (maybe there was no way to fix it).

NASA Columbia Shuttle Accident Report
Validation object
Columbia re-entry
Foam objects
Columbia breakup
14
Computational predictions led us astray with
sonoluminescent fusion
  • Taleyarkhan and co-workers at ORNL used intense
    sound waves to form sonoluminescent bubbles with
    deuterated acetone
  • They observed Tritium formation and 14 MeV
    neutrons, indicating that nuclear fusion was
    occurring
  • Observed Temperature was 107 oK instead of
    the usual 103 oK
  • If true, we could have a fusion reactor in every
    house
  • Computer simulations were done that matched the
    experimental results if the driving pressure in
    the codes was increased by a factor of 10 (well
    outside reasonable uncertainties)
  • Based on the agreement of theory and
    experiment, the results were published in
    Science and generated intense interest
  • Experiments were repeated, especially at ORNL. No
    significant Tritium or 14 MeV neutrons were
    found.
  • Sigh No fusion reactor in everyones house.
  • Simulation was misleading
  • Bad experiment bad simulation ?

Taleyarkhan et al, Science 295(2002), p. 1868
Shapira, et al, PRL, 89(2002), p.104302.
15
Outline
  • Introduction
  • The Performance and Programming Challenge
  • The Prediction Challenge
  • Lessons Learned from ASCI
  • Case study lessons
  • Quantitative Estimation
  • Verification and Validation
  • Software Quality
  • Conclusions and Path Forward

16
ASCI
  • In late 1996, the DOE launched the Accelerated
    Strategic Computing Initiative (ASCI) to develop
    theenhanced predictive capability by 2004 at
    LANL, LLNL and SNL that was required to certify
    the US nuclear stockpile without testing
  • ASCI codes were to have much better physics,
    better resolution and better materials data
  • Need a 105 increase in computer power from 1995
    level
  • Develop massively parallel platforms (20 TFlops
    at LANL this year, 100 TFlops at LLNL in
    2005-2006)
  • ASCI included development of applications,
    development and analysis tools, massively
    parallel platforms, operating and networking
    systems and physics models
  • 6 B expended so far

17
Lessons Learned
  • Build on successful code development history and
    prototypes
  • Highly competent and motivated people in a good
    team are essential
  • Risk identification, management and mitigation
    are essential
  • Software Project Management Run the code project
    like a project
  • Schedule and resources are determined by the
    requirements
  • Customer focus is essential
  • For code teams and for stakeholder support
  • Better physics is much more important than better
    computer science
  • Use modern but proven Computer Science
    techniques,
  • Dont make the code project a Computer Science
    research project
  • Provide training for the team
  • Software Quality Engineering Best Practices
    rather than Processes
  • Validation and Verification are essential

18
LLNL and LANL had no big code project
experience before ASCI
  • But they did have a lot of successful small team
    experience
  • Code teams of 1 to 5 staff developed
    multi-physics codes one module at a time and then
    integrated them
  • ASCI needed rapid code development on an
    accellerated time scale
  • LLNL and LANL launched large code projects with
    very mixed success
  • Didnt look at the lessons learned from other
    communities

19
Teams
  • Tom DeMarco states that there are four essentials
    of good management
  • Get the right people
  • Match them to the right jobs
  • Keep them motivated.
  • Help their teams to jell and stay jelled.
  • (All the rest is Administrivia.) The Deadline
  • Success is all about Teams!
  • Managements key role is the support and
    nurturing of teams

Crestone ProjectTeam
T. DeMarco, 2000 DeMarco and Lister, 1999
Cockburn and Highsmith, 2001 Thomsett, 2002
McBreen, 2001
20
Risk identification, management and mitigation
are essential
  • Tom DeMarco lists five major risks for software
    projects
  • Uncertain or rapidly changing Requirements, Goals
    and Deliverables
  • Almost always fatal
  • 2. Inadequate resources or schedule to meet the
    requirements
  • 3. Institutional turmoil, including lack of
    management support for code project team, rapid
    turnover, unstable computing environment, etc.
  • 4. Inadequate reserve and allowance for
    requirements creep and scope changes
  • 5. Poor Team performance.
  • To these we add two
  • 6. Inadequate support by stakeholder groups that
    need to supply essential modules, etc.
  • 7. Problem is too hard to be solved within
    existing constraints
  • Poor team performance is usually blamed for
    problems
  • But, all risks but 5 are the responsibility of
    management!
  • ASCI experience Management attention to 14, 6,7
    has been inadequate.
  • Risk mitigation by contingency, back-up staffs
    and activities is key

T. DeMarco, 2002a
21
Software Project Management
  • Good organization of the work is essential
  • Good project management is more important for
    success than good Software Quality Aassurance
  • Manage the code project as a project
  • Clearly defined deliverables, a work breakdown
    structure for the tasks, a schedule and a plan
    tied to resources
  • Execute the plan, monitor and track progress with
    quantitative metrics, re-deploy resources to keep
    the project on track as necessary
  • Insist on support from sponsors and stakeholders
  • Project leader must control the resources,
    otherwise the leader is just a cheerleader!

Brooks, 1987 Remer, 2000 Rifkin, 2002
Thomsett, 2002 Highsmith, 2001
22
Requirements, Schedule and Resources must be
consistent.
  • Planning Software Projects is more restrictive
    than planning conventional projects
  • Normally, one can pick two out of requirements,
    schedule and resources, and the third is then
    determined
  • For software, the requirements determine the
    optimal schedule and the optimal resources. You
    can do worse, but not better than optimal.
  • Schedule
  • The rate of software development, like all
    knowledge based work, is limited by the speed
    that people can think, analyze problems and
    develop solutions.
  • Resources
  • The size of the code team is determined by the
    number of people who can coordinate their work
    together
  • Specifying the schedule and/or resources plus the
    requirements has been one of the greatest
    problems for ASCI (and for many other code
    development projects in our experience).
  • Then the code development plan is over-specified.

D. Remer, 2000 T. Jones, 1988 Post and Kendal,
2002 S. Rifkin,2002
23
ASCI codes take about 8 years to develop.
  • We have studied the record of most of the ASCI
    code projects at LANL and LLNL to identify the
    necessary resources and schedule
  • The requirements are well known and fixed. LANL
    and LLNL have been modeling nuclear weapons for
    50 to 60 years.
  • We find that it takes about 8 years and a team
    of at least 10 to 20 professionals to develop a
    code with the minimum capability to provide a 3
    Dimensional simulation of a nuclear explosion
  • No code project that hasnt had 8 years has
    succeeded
  • All of the code projects that have succeeded have
    been at least 8 years old (although some have
    failed even with 8 years of work)

Post and Cook, 2000 Post and Kendall, 2002
24
Initial use of metrics to estimate schedule and
resource requirements
Key parameter is a function point FP, a
weighted total of inputs, outputs, inquiries,
logical files and interfaces
SLOC Single Line of Code
T. Capers-Jones, 1998
Corrections for Lab environment compared to
industry Schedule FP schedule delays Delays up
to 1.5 years for recruiting, clearance, learning
curve Schedule multiplier of 1.6 for complexity
of project 1.6 based on contingency required
compared to industry due to complexity of
classified/unclassified computing environment,
unstable and evolving ASCI platforms, paradigm
shift to massively parallel computing, need for
algorithm RD, complexity of models, etc.
D. Remer, 2000 E. Yourdon, 1997
25
Comparison of ASCI code history and estimation
results
Post and Cook, 2000 Post and Kendall, 2002
26
ASCI codes require about 8 years and 20
professionals
  • Estimation procedures are consistent with LANL
    and LLNL history
  • About 8 years and 20 professionals are required.

D. Remer, 2000 Post and Kendall, 2002
27
Result of not estimating the time and resources
needed to complete the requirements was failure
of many code projects to meet their milestones.
  • Code Projects were asked to do 8 years of work in
    3.5 years
  • Management changed the requirements 6 months
    before the due date
  • Result
  • 3 out of 5 code projects failed to meet the
    milestones (those with less than 8 years of
    experience)
  • 2 out of 5 code project succeeded in meeting the
    milestones (those with more than 8 years of
    experience)
  • Management at LLNL and LANL blamed the code
    teams, and in some cases, punished them
  • Morale was devastated, project teams were reduced
  • Only now3 years latersome failed projects are
    beginning to recover
  • Real cause was no realistic planning and
    estimation by upper level management
  • DOE and labs now drafting more appropriate
    milestones and schedules

Post and Cook, 2000 Post and Kendall, 2002
28
Focus on Customer
  • Customer focus is a feature of every successful
    project
  • Inadequate customer focus is a feature of many,
    if not most, unsuccessful projects
  • Code projects are unsuccessful unless the users
    can solve real problems with their codes
  • LLNL and LANL encourage customer focus by
  • Organizationally embedding the code teams in user
    groups
  • Collocation of users and code teams
  • Encourages daily informal interactions
  • Involve users in the code development process,
    e.g. testing, setting requirements, assessing
    progress,.
  • Code development teams share in success of users
  • Trust between users and developers is essential
  • Common management helps balance competing
    concerns and issues
  • Incremental delivery has helped build and
    maintain customer focus

User
McBreen, 2001 D. Phillips, 1997 R. Thomsett,
2002 E. Verzuh, 1999
29
Better Physics is the key
  • Paradigm shift from interpolation among nuclear
    test data points to extrapolation to new regimes
    and conditions requires much better physics
  • Cant rely on adjusting parameters to exploit
    compensating errors
  • Predictive ability of codes depends on the
    quality of the physics and the solution
    algorithms
  • Correct and appropriate equations for the
    problem, physical data and models, accurate
    solutions of equations,
  • ASCI has funded a parallel effort to develop
    better algorithms and physical databases and
    physics models (turbulence, opacities, cross
    sections,..)

R. Laughlin, 2002Post and Kendall, 2002
30
Employ modern computer science techniques, but
dont do computer science research.
  • Main value of the project is improved science
    (e.g. physics)
  • Implementing improved physics (3-D, higher
    resolution, better algorithms, etc.) on the
    newest, largest massively parallel platforms that
    change every two years is challenging enough.
    Dont increase the challenge!!!
  • Use standard industrial computer science.
    Computer science must be a reliable and useable
    tool!!!!! Dont do computer science research
  • Beyond state-of-the-art computer science (new
    frameworks, languages, etc.) has generally proven
    to be at best an impediment and at worst a
    project killer
  • LANL spent over 50 of its code development
    resources on a project that had a major computer
    science research component. It was a massive
    failure.

T. Demarco, T. Lister (2002) Post and Cook,
2000 Post and Kendall, 2002
31
Software Quality Engineering Practices rather
than Processes
  • Attention to project management is more important
  • SQA and SQE are big issues for DOE and DOD codes
    (10 CFR 830, etc.)
  • Stan Rifkin defines 3 categories of software (S.
    Rifkin, 2002)
  • Operationally excellentone bug kills, airplane
    control software
  • Customer intimateflexible with lots of features,
    MS Office
  • Product innovativenew and innovative
    capabilities, scientific codes
  • Heavy emphasis on process necessary for
    operationally excellent software, where bugs
    are fatal
  • Heavy process stifles innovative softwarethe
    kind of innovations necessary to solve
    computational scientific and technical problems
  • Scientists trained to question authority, look
    for value added for any procedure
  • ASCI had more success emphasizing Best
    practices than Good processes
  • If the code team doesnt implement SQA on their
    own terms, the sponsors may implement on theirs,
    and the teams wont like it much less

DeMarco and Boehm, 2002 DPhillips,1997 Remer,
2000 DeMarco and Lister, 2002 Rifkin, 2002
Post and Cook, 2000 Post and Kendall, 2002
32
Verification and Validation
  • Customers (e.g. DOD) want to know why they should
    believe code results
  • Codes are only a model of reality
  • Verification and Validation are essential
  • Verification
  • Equations are solved correctly
  • Mathematical exercise
  • Regression suites of test problems, convergence
    tests, manufactured solutions, analytic test
    problems
  • Validation
  • Ensure models reflect nature
  • Check code results with experimental data
  • NNSA is funding a large experimental program to
    provide validation data
  • National Ignition Facility, DAHRT, ATLAS, Z,

NIF
DAHRT
Roach, 1998 Roache, 2002 Salari and Knupp,
2000 Lindl, 1998 Lewis, 1992 Laughliin, 2002)
33
Conclusions
  • If Computational Science is to fulfill its
    promise for society, it will need to become as
    mature as theoretical and experimental
    methodologies.
  • Prediction Challenge
  • Need to analyze past experiences, successes and
    failures, develop lessons learned and implement
    themDARPA HPCS doing case studies of 20 major
    US code projects (DoD, DOE, NASA, NOAA, academia,
    industry,)
  • Major lesson is that we need to improve
  • Verification
  • Validation
  • Software Project Management and Software Quality
  • Programming Challenge
  • HPC community needs to reduce the difficulty of
    developing codes for modern platformsDARPA HPCS
    developing new benchmarks, performance
    measurement methodologies, encouraging new
    development tools, etc.
Write a Comment
User Comments (0)
About PowerShow.com