Title: The Coming Crisis in Computational Science
1The Coming Crisis in Computational Science
- Douglass Post
- Los Alamos National Laboratory
- SCIDAC Workshop
- Charleston, South Carolina, March 22-24, 2004
LA-UR-04-0388 Approved for public
release Distribution is unlimited
Los Alamos National Laboratory, an affirmative
action/equal opportunity employer, is operated by
the University of California for the U.S.
Department of Energy under contract
W-7405-ENG-36. By acceptance of this article, the
publisher recognizes that the U.S. Government
retains a nonexclusive, royalty-free license to
publish or reproduce the published form of this
contribution, or to allow others to do so, for
U.S. Government purposes. Los Alamos National
Laboratory requests that the publisher identify
this article as work performed under the auspices
of the U.S. Department of Energy. Los Alamos
National Laboratory strongly supports academic
freedom and a researchers right to publish as
an institution, however, the Laboratory does not
endorse the viewpoint of a publication or
guarantee its technical correctness.
2Outline
- Introduction
- The Performance and Programming Challenges
- The Prediction Challenge
- Lessons Learned from ASCI
- Case study lessons
- Quantitative Estimation
- Verification and Validation
- Software Quality
- Conclusions and Path Forward
3Introduction
- The growth of computing power over the last 50
years has enabled us to address and solve many
important technical problems for society - Codes contain Realistic models, good spatial and
temporal resolution, realistic geometries,
realistic physical data, etc. -
Stanford ASCI AllianceJet Engine Simulation
U. Of Illinois ASCI AllianceShuttle Rocket
Booster Simulation
L. Winter et al-Rio Grande Watershed
G. Gisler et al-Impact of Dinosaur Killer Asteroid
4Three ChallengesPerformance, Programming and
Prediction
- Performance Challenge-Building powerful computers
- Programming ChallengeProgramming for Complex
Computers - Rapid code development
- Optimize codes for good performance
- Prediction ChallengeDeveloping predictive codes
with complex scientific models - Develop codes that have reliable predictive
capability - Verification
- Validation
- Code Project Management and Quality
5Outline
- Introduction
- The Performance and Programming Challenges
- The Prediction Challenge
- Lessons Learned from ASCI
- Case study lessons
- Quantitative Estimation
- Verification and Validation
- Software Quality
- Conclusions and Path Forward
6Programming Challenge
- Modern Computers are complicated machines
- 1000s of processors linked by a intricate set of
networks for data transfer - Terabytes of data
- Challenges
- Memory bandwidth not keeping up with processor
speed - Memory managementcache utilization
- Interprocessor communication and message passing
conflicts - Reliabilityfault tolerance
7High Performance Computing Community must address
Programming Challenge.
- Problem is severe
- Code development time is often longer than
platform turnaround time, many codes have a 20
year or longer lifetime, with a 5 to 10 or more
year replacement time - Performance is often poor and not likely to
improve - Trade-off between portability and performance
favors portability - Moore needs to be done to simplify application
code development - Better tools, language enhancements for easier
parallelization, stable architectures are needed - Addressing Programming Challenge is a major
thrust of the DARPA High Productivity Computing
Systems Project (HPCS).
8Outline
- Introduction
- The Performance and Programming Challenges
- The Prediction Challenge
- Lessons Learned from ASCI
- Case study lessons
- Quantitative Estimation
- Verification and Validation
- Software Quality
- Conclusions and Path Forward
9Predictive Challenge is even more serious than
Programming Challenge.
- Programming Challenge is a matter of efficiency
- Programming for more complex computers takes
longer and is more difficult and takes more
people, but with enough resources and time, codes
can be written to run on these computers - But the Predictive Challenge is a matter of
survival - If the results of the complicated application
codes cannot be believed, then there is no reason
to develop the codes or the platforms to run them
on.
10Many things can be wrong with a computer
generated prediction.
- Experimental and theoretical science are mature
methodologies but computational science is not. - Code could have bugs in either the models or the
solution methods that result in answers that are
incorrect - e.g. 2254.22, sin(90O) 673.44, etc.
- Models in the code could be incomplete or not
applicable to problem or have wrong data - E.g. climate models without an ocean current
model - User could be inexperienced, not know how to use
the code correctly - CRATER analysis of Columbia Shuttle damage
- Two examples Columbia Space Shuttle,
Sonoluminescence
11Lessons Learned are important
1
- 4 stages of design maturity for a methodology to
matureHenry PetroskiDesign Paradigms - Suspension bridgescase studies of failures (and
successes) were essential for reaching
reliability and credibility
2
3
Tacoma Narrows Bridge buckled and fell 4 months
after construction!
Case studies conducted after each crash. Lessons
learned identified and adopted by community
4
12Computational Science is at the third stage.
- Computational Science is in the midst of the
third stage. - Prior generations of code developers were deeply
scared of failure, didnt trust the codes. - New generation of code developers trained as
computational scientists - New codes are more complex and more ambitious but
not as closely coupled to experiments and theory - Disasters occurring now
- We need to assess the successes and failures,
develop lessons learned and adopt them - Otherwise we will fail to fulfill the promise of
computational science - Computational science has to develop the same
professional integrity as theoretical and
experimental science
13Computational analysis of Columbia Space Shuttle
damage was flawed.
- The Columbia Space Shuttle wing failed during
re-entry due to hot gases entering a portion of
the wing damaged by a piece of foam that broke
off during launch - Shortly after launch, Boeing did an analysis
using the code CRATER to determine likelihood
that the wing was seriously damaged - Problems with analysis
- The analysis was carried out by an inexperienced
user, someone who had only used the code once
before - CRATER was designed to study the effects of
micrometeorite impacts, and had been validated
only for projectiles less that 1/400 the size and
mass of the piece of foam that struck the wing - Didnt use a code like LS-DYNA that was the
industry standard for assessing impact damage - The prior CRATER validation results indicated
that the code gave conservative predictions - Analysis indicated that there might be some
damage, but probably not at a level to warrant
concern - Concerns due to CRATER analysis were downplayed
by middle management and not passed up the
system. - Result was that no one tried hard to look at the
wing and figure out how to do something to avoid
the crash (maybe there was no way to fix it).
NASA Columbia Shuttle Accident Report
Validation object
Columbia re-entry
Foam objects
Columbia breakup
14Computational predictions led us astray with
sonoluminescent fusion
- Taleyarkhan and co-workers at ORNL used intense
sound waves to form sonoluminescent bubbles with
deuterated acetone - They observed Tritium formation and 14 MeV
neutrons, indicating that nuclear fusion was
occurring - Observed Temperature was 107 oK instead of
the usual 103 oK - If true, we could have a fusion reactor in every
house - Computer simulations were done that matched the
experimental results if the driving pressure in
the codes was increased by a factor of 10 (well
outside reasonable uncertainties) - Based on the agreement of theory and
experiment, the results were published in
Science and generated intense interest - Experiments were repeated, especially at ORNL. No
significant Tritium or 14 MeV neutrons were
found. - Sigh No fusion reactor in everyones house.
- Simulation was misleading
- Bad experiment bad simulation ?
Taleyarkhan et al, Science 295(2002), p. 1868
Shapira, et al, PRL, 89(2002), p.104302.
15Outline
- Introduction
- The Performance and Programming Challenge
- The Prediction Challenge
- Lessons Learned from ASCI
- Case study lessons
- Quantitative Estimation
- Verification and Validation
- Software Quality
- Conclusions and Path Forward
16ASCI
- In late 1996, the DOE launched the Accelerated
Strategic Computing Initiative (ASCI) to develop
theenhanced predictive capability by 2004 at
LANL, LLNL and SNL that was required to certify
the US nuclear stockpile without testing - ASCI codes were to have much better physics,
better resolution and better materials data - Need a 105 increase in computer power from 1995
level - Develop massively parallel platforms (20 TFlops
at LANL this year, 100 TFlops at LLNL in
2005-2006) - ASCI included development of applications,
development and analysis tools, massively
parallel platforms, operating and networking
systems and physics models - 6 B expended so far
17Lessons Learned
- Build on successful code development history and
prototypes - Highly competent and motivated people in a good
team are essential - Risk identification, management and mitigation
are essential - Software Project Management Run the code project
like a project - Schedule and resources are determined by the
requirements - Customer focus is essential
- For code teams and for stakeholder support
- Better physics is much more important than better
computer science - Use modern but proven Computer Science
techniques, - Dont make the code project a Computer Science
research project - Provide training for the team
- Software Quality Engineering Best Practices
rather than Processes - Validation and Verification are essential
18LLNL and LANL had no big code project
experience before ASCI
- But they did have a lot of successful small team
experience - Code teams of 1 to 5 staff developed
multi-physics codes one module at a time and then
integrated them - ASCI needed rapid code development on an
accellerated time scale - LLNL and LANL launched large code projects with
very mixed success - Didnt look at the lessons learned from other
communities
19Teams
- Tom DeMarco states that there are four essentials
of good management - Get the right people
- Match them to the right jobs
- Keep them motivated.
- Help their teams to jell and stay jelled.
- (All the rest is Administrivia.) The Deadline
- Success is all about Teams!
- Managements key role is the support and
nurturing of teams
Crestone ProjectTeam
T. DeMarco, 2000 DeMarco and Lister, 1999
Cockburn and Highsmith, 2001 Thomsett, 2002
McBreen, 2001
20Risk identification, management and mitigation
are essential
- Tom DeMarco lists five major risks for software
projects - Uncertain or rapidly changing Requirements, Goals
and Deliverables - Almost always fatal
- 2. Inadequate resources or schedule to meet the
requirements - 3. Institutional turmoil, including lack of
management support for code project team, rapid
turnover, unstable computing environment, etc. - 4. Inadequate reserve and allowance for
requirements creep and scope changes - 5. Poor Team performance.
- To these we add two
- 6. Inadequate support by stakeholder groups that
need to supply essential modules, etc. - 7. Problem is too hard to be solved within
existing constraints - Poor team performance is usually blamed for
problems - But, all risks but 5 are the responsibility of
management! - ASCI experience Management attention to 14, 6,7
has been inadequate. - Risk mitigation by contingency, back-up staffs
and activities is key
T. DeMarco, 2002a
21Software Project Management
- Good organization of the work is essential
- Good project management is more important for
success than good Software Quality Aassurance - Manage the code project as a project
- Clearly defined deliverables, a work breakdown
structure for the tasks, a schedule and a plan
tied to resources - Execute the plan, monitor and track progress with
quantitative metrics, re-deploy resources to keep
the project on track as necessary - Insist on support from sponsors and stakeholders
- Project leader must control the resources,
otherwise the leader is just a cheerleader!
Brooks, 1987 Remer, 2000 Rifkin, 2002
Thomsett, 2002 Highsmith, 2001
22Requirements, Schedule and Resources must be
consistent.
- Planning Software Projects is more restrictive
than planning conventional projects - Normally, one can pick two out of requirements,
schedule and resources, and the third is then
determined - For software, the requirements determine the
optimal schedule and the optimal resources. You
can do worse, but not better than optimal. - Schedule
- The rate of software development, like all
knowledge based work, is limited by the speed
that people can think, analyze problems and
develop solutions. - Resources
- The size of the code team is determined by the
number of people who can coordinate their work
together - Specifying the schedule and/or resources plus the
requirements has been one of the greatest
problems for ASCI (and for many other code
development projects in our experience). - Then the code development plan is over-specified.
D. Remer, 2000 T. Jones, 1988 Post and Kendal,
2002 S. Rifkin,2002
23ASCI codes take about 8 years to develop.
- We have studied the record of most of the ASCI
code projects at LANL and LLNL to identify the
necessary resources and schedule - The requirements are well known and fixed. LANL
and LLNL have been modeling nuclear weapons for
50 to 60 years. - We find that it takes about 8 years and a team
of at least 10 to 20 professionals to develop a
code with the minimum capability to provide a 3
Dimensional simulation of a nuclear explosion - No code project that hasnt had 8 years has
succeeded - All of the code projects that have succeeded have
been at least 8 years old (although some have
failed even with 8 years of work)
Post and Cook, 2000 Post and Kendall, 2002
24Initial use of metrics to estimate schedule and
resource requirements
Key parameter is a function point FP, a
weighted total of inputs, outputs, inquiries,
logical files and interfaces
SLOC Single Line of Code
T. Capers-Jones, 1998
Corrections for Lab environment compared to
industry Schedule FP schedule delays Delays up
to 1.5 years for recruiting, clearance, learning
curve Schedule multiplier of 1.6 for complexity
of project 1.6 based on contingency required
compared to industry due to complexity of
classified/unclassified computing environment,
unstable and evolving ASCI platforms, paradigm
shift to massively parallel computing, need for
algorithm RD, complexity of models, etc.
D. Remer, 2000 E. Yourdon, 1997
25Comparison of ASCI code history and estimation
results
Post and Cook, 2000 Post and Kendall, 2002
26ASCI codes require about 8 years and 20
professionals
- Estimation procedures are consistent with LANL
and LLNL history - About 8 years and 20 professionals are required.
D. Remer, 2000 Post and Kendall, 2002
27Result of not estimating the time and resources
needed to complete the requirements was failure
of many code projects to meet their milestones.
- Code Projects were asked to do 8 years of work in
3.5 years - Management changed the requirements 6 months
before the due date - Result
- 3 out of 5 code projects failed to meet the
milestones (those with less than 8 years of
experience) - 2 out of 5 code project succeeded in meeting the
milestones (those with more than 8 years of
experience) - Management at LLNL and LANL blamed the code
teams, and in some cases, punished them - Morale was devastated, project teams were reduced
- Only now3 years latersome failed projects are
beginning to recover - Real cause was no realistic planning and
estimation by upper level management - DOE and labs now drafting more appropriate
milestones and schedules
Post and Cook, 2000 Post and Kendall, 2002
28Focus on Customer
- Customer focus is a feature of every successful
project - Inadequate customer focus is a feature of many,
if not most, unsuccessful projects - Code projects are unsuccessful unless the users
can solve real problems with their codes - LLNL and LANL encourage customer focus by
- Organizationally embedding the code teams in user
groups - Collocation of users and code teams
- Encourages daily informal interactions
- Involve users in the code development process,
e.g. testing, setting requirements, assessing
progress,. - Code development teams share in success of users
- Trust between users and developers is essential
- Common management helps balance competing
concerns and issues - Incremental delivery has helped build and
maintain customer focus
User
McBreen, 2001 D. Phillips, 1997 R. Thomsett,
2002 E. Verzuh, 1999
29Better Physics is the key
- Paradigm shift from interpolation among nuclear
test data points to extrapolation to new regimes
and conditions requires much better physics - Cant rely on adjusting parameters to exploit
compensating errors - Predictive ability of codes depends on the
quality of the physics and the solution
algorithms - Correct and appropriate equations for the
problem, physical data and models, accurate
solutions of equations, - ASCI has funded a parallel effort to develop
better algorithms and physical databases and
physics models (turbulence, opacities, cross
sections,..)
R. Laughlin, 2002Post and Kendall, 2002
30Employ modern computer science techniques, but
dont do computer science research.
- Main value of the project is improved science
(e.g. physics) - Implementing improved physics (3-D, higher
resolution, better algorithms, etc.) on the
newest, largest massively parallel platforms that
change every two years is challenging enough.
Dont increase the challenge!!! - Use standard industrial computer science.
Computer science must be a reliable and useable
tool!!!!! Dont do computer science research - Beyond state-of-the-art computer science (new
frameworks, languages, etc.) has generally proven
to be at best an impediment and at worst a
project killer - LANL spent over 50 of its code development
resources on a project that had a major computer
science research component. It was a massive
failure.
T. Demarco, T. Lister (2002) Post and Cook,
2000 Post and Kendall, 2002
31Software Quality Engineering Practices rather
than Processes
- Attention to project management is more important
- SQA and SQE are big issues for DOE and DOD codes
(10 CFR 830, etc.) - Stan Rifkin defines 3 categories of software (S.
Rifkin, 2002) - Operationally excellentone bug kills, airplane
control software - Customer intimateflexible with lots of features,
MS Office - Product innovativenew and innovative
capabilities, scientific codes - Heavy emphasis on process necessary for
operationally excellent software, where bugs
are fatal - Heavy process stifles innovative softwarethe
kind of innovations necessary to solve
computational scientific and technical problems - Scientists trained to question authority, look
for value added for any procedure - ASCI had more success emphasizing Best
practices than Good processes - If the code team doesnt implement SQA on their
own terms, the sponsors may implement on theirs,
and the teams wont like it much less
DeMarco and Boehm, 2002 DPhillips,1997 Remer,
2000 DeMarco and Lister, 2002 Rifkin, 2002
Post and Cook, 2000 Post and Kendall, 2002
32Verification and Validation
- Customers (e.g. DOD) want to know why they should
believe code results - Codes are only a model of reality
- Verification and Validation are essential
- Verification
- Equations are solved correctly
- Mathematical exercise
- Regression suites of test problems, convergence
tests, manufactured solutions, analytic test
problems - Validation
- Ensure models reflect nature
- Check code results with experimental data
- NNSA is funding a large experimental program to
provide validation data - National Ignition Facility, DAHRT, ATLAS, Z,
NIF
DAHRT
Roach, 1998 Roache, 2002 Salari and Knupp,
2000 Lindl, 1998 Lewis, 1992 Laughliin, 2002)
33Conclusions
- If Computational Science is to fulfill its
promise for society, it will need to become as
mature as theoretical and experimental
methodologies. - Prediction Challenge
- Need to analyze past experiences, successes and
failures, develop lessons learned and implement
themDARPA HPCS doing case studies of 20 major
US code projects (DoD, DOE, NASA, NOAA, academia,
industry,) - Major lesson is that we need to improve
- Verification
- Validation
- Software Project Management and Software Quality
- Programming Challenge
- HPC community needs to reduce the difficulty of
developing codes for modern platformsDARPA HPCS
developing new benchmarks, performance
measurement methodologies, encouraging new
development tools, etc.