Title: NCARs Response to upcoming OCI Solicitations
1NCARs Response to upcoming OCI Solicitations
- Richard Loft
- SCD Deputy Director for RD
2Outline
- NSF Cyberinfrastructure Strategy (Track-1
Track-2)
- NCAR generic strategy for NSFXX-625s (Track-2)
- NCAR response to NSF05-625
- NSF Petascale Initiative Strategy
- NCAR response to NSF Petascale Initiative
3 NSFs Cyberinfrastructure Strategy
- The NSFs HPC acquisition strategy (through FY10)
for HPC is for three Tracks
- Track 1 High End O(1 PFLOPS sustained)
- Track 2 Mid level system O(100 TFLOPS)
NSFXX-625
- First instance (NSF05-625) submitted Feb 10,
2006
- Next instances due
- November 30, 2006
- November 30, 2007
- November 30, 2008
- Track 3 Typical University HPC O(1-10 TFLOPS)
- The purpose of the Track-1 system will be to
achieve revolutionary advancement and
breakthroughs in science and engineering.
4Solicitation NSF05-625Towards a Petascale
Computing Environment for Science and Engineering
- Award September 2006
- System in production by May 31, 2007
- 30,000,000 or 15,000,000.
- Operating costs funded under separate action.
- RP serves the broad science community - open
access.
- Allocations by LRAC/MRAC or their successors
- Two 10 Gb/s TeraGrid links
5NCARs Overall NSFXX-625 Strategy
- Leverage NCAR/SCD expertise in production HPC.
- Get a production system -
- No white box Linux solutions.
- Stay on path to usable petascale systems
- NCAR is a Teragrid outsider - must address two
areas
- Leverage experience with general scientific
users
- Lack of Grid consulting experience
- Emphasize, but dont over emphasize,
geosciences.
- In proposing, NCAR has a facility problem
- Minimize costs - power, administrative staff,
level of support.
- Creative plan for remote user support and
education.
6NSF05-625 Partners
- Facility Partner
- End-to-End System Supplier
- User Support Network -
- NCAR Consulting Service Group
- University partners
7NSF05-625 Facility Partner
- NCAR ML Facility after ICESS is FULL.
- Key Points
- A new datacenter is needed whether NCAR wins the
NSF05-625 solicitation or not.
- Because of the short timeline, new datacenter
never factors into the strategy for NSFXX-625.
- Identified a colocation facility
- facility features
- local (Denver-Boulder area)
- State of the Art, High Availability Center
- Currently 4 x 2MW generators of power available
- Familiar with large scale deployments
- Dark Fibre readily available (good connectivity)
8NSF05-625 Supercomputer System Details
- Two systems capability capacity
- 80 Tflops combined
- Robotic tape storage system 12PB
9NCAR NSF05-625 User Support Plan
- Largest potential differentiator in proposal -
lets do something unique!
- System will be used by the generic scientist
-support plan must
- Be extensible to other domains than geoscience
- Address grid user support
- Strategy leverages OSCER-lead IGERT proposal-
- Combine teaching of computational science with
user support
- Embed application support expertise in key
institutions
- Build education and training materials through
university partnerships.
10Track-1 System Background
- Source of funds Presidential Innovation
Initiative announced in SOTU.
- Performance goal 1 PFLOPS sustained on
interesting problems.
- Science goal breakthroughs
- Use model 12 research teams per year using whole
system for days or weeks at a time.
- Capability system - large everything fault
tolerant.
- Single system in one location.
- Not a requirement that machine be upgradable.
11Track-1 Project Parameters
- Funds 200M over 4 years, starting FY07
- Single award
- Money is for end-to-end system (as in 625)
- Not intended to fund facility.
- Release of funds tied to meeting hw and sw
milestones.
- Deployment Stages
- Simulator
- Prototype
- Petascale system operates FY10-FY15
- Operations funds FY10-15 funded separately.
12Two Stage Award Process Timeline
- Solicitation out May, 2006 (???)
- HPCS down-select June, 2006
- Preliminary Proposal due August, 2006
- Down selection (invitation to 3-4 to write Full
Proposal)
- Full Proposal due January, 2007
- Site visits Spring, 2007
- Award Sep, 2007
13NSFs view of the problem
- NSF recognizes the facility (power, cooling,
space) challenge of this system.
- Therefore NSF welcomes collaborative approaches
- University Federal Lab
- University commercial data center
- University State Government
- University consortium
- NSF recognizes that applications will need
significant modification to run on this system.
- User support plan
- Expects proposer to discuss needs in this area
with experts in key applications areas.
14The Cards in NCARs Hand
- NCAR
- Is a leader in making the case that geoscience
grand challenge problems need petascale
computing.
- Has many grand challenge problems to offer
itself.
- Has experience at large processor counts.
- Has recently connected to the TeraGrid, and is
moving towards becoming a full-fledged Resource
Provider.
15NCAR Response Options
- Do Nothing
- Focus on Petascale Geoscience Applications
- Partner with a lead institution or consortium
- Lead a Tier-1 proposal
16NCAR Response Options
- Do Nothing
- Focus on Petascale Geoscience Applications
- Partner with a lead institution or consortium
- Lead a Tier-1 proposal
17Questions, Comments?
18The Relationship Between OCIs Roadmap and
NCARs Datacenter project
- Richard Loft
- SCD Deputy Director for RD
19Projected CCSM Computing Requirements Exceed
Moores Law
Thanks to Jeff Kiehl/Bill Collins
20 NSFs Cyberinfrastructure Strategy
- The NSFs HPC acquisition strategy (through FY10)
for HPC is for three Tracks
- Track 1 High End O(1 PFLOPS sustained)
- Track 2 Mid level system O(100 TFLOPS)
NSFXX-625
- First instance (NSF05-625) submitted Feb 10,
2006
- Next instances due
- November 30, 2006
- November 30, 2007
- November 30, 2008
- Track 3 Typical University HPC O(1-10 TFLOPS)
- The purpose of the Track-1 system will be to
achieve revolutionary advancement and
breakthroughs in science and engineering.
21NCAR strategic goals
- NCAR will stay in the top echelon of geoscience
computing centers.
- NCARs immediate strategic goal is to be a
Track-2 center.
- To do this, NCAR must be integrated with NSFs
cyberinfrastructure plans.
- This means both connecting and ultimately
operating within the Teragrid framework.
- The Teragrid is evolving, so this is a moving
target.
22NCAR new-facility
- NCAR ML Facility after ICESS is FULL.
- Key Points
- A new datacenter is needed whether NCAR wins the
NSF05-625 solicitation or not.
- Because of the short timeline, a new datacenter
never factors into the strategy for NSFXX-625.
- Right now, we cant handle a modest budget
augmentation for computing with the current
facility.
23Mesa Lab is full after the ICESS procurement
- ICESS Integrated Computing Environment for
Scientific Simulation
- Were sitting at 980 kW right now.
- Deinstall of bluesky will give us back 450 kW.
- This leaves about 600 kW of head-room.
- The ICESS procurement is expected to deliver a
system with a maximum power requirement of
500-600 kW of power.
- This is not enough to house 15M-30M of
equipment from NSF05-625, for example.
24Were fast running out of power
Max power at the Mesa Lab is 1.2 MW!
25Preparing for the Petascale
- Richard Loft
- SCD Deputy Director for RD
26What to expect in HEC?
- Much more parallelism.
- A good deal of uncertainty regarding node
architectures.
- Many threads per node.
- Continued ubiquity of Linux/Intel systems.
- There will be vector systems
- Emergence of exotic architectures.
- Largest (petascale) system likely to have special
features
- Power aware design (small memory?)
- Fault tolerant design features
- Light-weight compute node kernels
- Custom networks
27Top 500Speed of Supercomputers vs Time
28Top 500Number of Processors vs Time
29HEC in 2010
- Based on history, should expect 4K-8K CPU systems
to be commonplace by the end of the decade.
- The largest systems on the Top500 list should be
1-10 PFLOPS.
- Parallelism in largest system - estimate (2010).
- Assume a clock speed of 5 GHz a double FMA CPU
delivers 20 GFLOPS peak
- 1 PFLOPS peak 50K CPUs.
- 10 PFLOPS peak 500K CPUs
- Large vector systems (if they exist) will still
be highly parallel.
- To justifying using the largest systems, must use
a sizable fraction of the resource.
30Range of Plausible Architectures 2010
- Power issues will slow rate of increase in clock
frequency.
- This will drive trend towards massive
parallelism.
- All scalar system with have multiple CPUs per
socket (chip).
- Currently 2 CPUs per core, by 2008, 4 CPUs per
socket will be common place.
- 2010 scalar architectures will likely continue
this trend. 8 CPUs are possible - Cell Chip
already has 8 synergistic processors.
- Key unknown is which architecture for a cluster
on a chip will be most effective.
- Vector systems will be around, but at what
price?
- Wildcards
- Impact of DARPA HPCS program
- Exotics FPGAs, PIMs, GPUs.
31How to make science staff aware ofcoming changes?
- NCAR must develop a science driven plan for
exploiting petascale systems at the end of the
decade.
- Briefed NCAR Director, DD, CISL and ESSL
Directors
- Meetings (SEWG at CCSM Breckenridge)
- Organizing NSF workshops on petascale geoscience
benchmarking scheduled at DC (June 1-2) and NCAR
(TBD)
- Have initiated internal petascale discussions
- CGD-SCD joint meetings
- Peta_ccsm mail list.
- Peta_ccsm Swiki site.
- Through activities like this. NSA should take
leadership role.
32What must be done to secure resources to improve
scalability?
- Must help ourselves.
- Invest judiciously in computational science where
possible.
- Leverage application development partnerships
(SciDAC, etc.)
- Write proposals.
- Support for applications development for the
Track-1 system can be built into a NCAR
partnership deal.
- NSF has indicated an independent funding track
for applications. NCAR should aggressively pursue
those funding sources.
- New ideas can help - e.g. POP
33POP Space Filling Curves partition for 8
processors
Credit John Dennis, SCD
34POP 1/10 Degree BG/L Improvements
35POP 1/10 Degree performance
BG/L SFC improvement
36Questions, Comments?
37Top 500 Processor Types Intel taking over
Today Intel is inside 2/3 of the Top500 machines
38(No Transcript)
39The commodity onslaught
- The Linux/Intel cluster is taking over Top500.
- Linux has not penetrated at major Weather, Ocean,
Climate centers- yet - reasons
- System maturity (SCD experience)
- Scalability of dominant commodity interconnects
- Combinatorics (Linux flavor, processor,
interconnect, compiler)
- But it affects NCAR indirectly because
- Ubiquity Opportunity
- Universities are deploying them.
- NCAR must rethink services provided to the
Universities.
- Puts strain on all community software development
activities.