Title: NERSC Status and Plans
1- NERSC Status and Plans
- for the
- NERSC User Group MeetingFebruary 22,2001
- BILL KRAMER
- DEPUTY DIVISION DIRECTOR
- DEPARTMENT HEAD, HIGH PERFORMANCE COMPUTING
DEPARTMENT - kramer_at_nersc.gov
- 510-486-7577
2Agenda
- Update on NERSC activities
- IBM SP Phase 2 status and plans
- NERSC-4 plans
- NERSC-2 decommissioning
3ACTIVITIES AND ACCOMPLISHMENTS
4NERSC Facility Mission
To provide reliable, high-quality,state-of-the-ar
t computing resourcesand client support in a
timely mannerindependent of client
locationwhile wisely advancing the state of
computational and computer science.
52001 GOALS
- PROVIDE RELIABLE AND TIMELY SERVICE
- Systems
- Gross Availability, Scheduled Availability,
MTBF/MTBI, MTTR - Services
- Responsiveness, Timeliness, Accuracy, Proactivity
- DEVELOP INNOVATIVE APPROACHES TO ASSIST THE
CLIENT COMMUNITY EFFECTIVELY USE NERSC SYSTEMS. - DEVELOP AND IMPLEMENT WAYS TO TRANSFER RESEARCH
PRODUCTS AND KNOWLEDGE INTO PRODUCTION SYSTEMS AT
NERSC AND ELSEWHERE - NEVER BE A BOTTLENECK TO MOVING NEW TECHNOLOGY
INTO SERVICE. - INSURE ALL NEW TECHNOLOGY AND CHANGES IMPROVE (OR
AT LEAST DOES NOT DIMINISH) SERVICE TO OUR
CLIENTS.
6GOALS (CONT)
- NERSC AND LBNL WILL BE A LEADER IN LARGE SCALE
SYSTEMS MANAGEMENT SERVICES. - EXPORT KNOWLEDGE, EXPERIENCE, AND TECHNOLOGY
DEVELOPED AT NERSC, PARTICULARLY TO AND WITHIN
NERSC CLIENT SITES. - NERSC WILL BE ABLE TO THRIVE AND IMPROVE IN AN
ENVIRONMENT WHERE CHANGE IS THE NORM - IMPROVE THE EFFECTIVENESS OF NERSC STAFF BY
IMPROVING INFRASTRUCTURE, CARING FOR STAFF,
ENCOURAGING PROFESSIONALISM AND PROFESSIONAL
IMPROVEMENT
NEW TECHNOLOGY
TIMELYINFORMATION
INNOVATIVE ASSISTANCE
RELIABLE SERVICE
TECHNOLOGY TRANSFER
SUCCESS FOR CLIENTS AND FACILITY
CONSISTENT SERVICE SYSTEM ARCHITECTURE
MISSION
NEW TECHNOLOGY
LARGE SCALE LEADER
STAFF EFFECTIVENESS
CHANGE
RESEARCH FLOW
WISE INTEGRATION
7Major AccomplishmentsSince last meeting (June
2000)
- IBM SP placed into full service April 4, 2000
more later - Augmented the allocations by 1 M Hours in FY 2000
- Contributed to 11M PE hours in FY 2000 more
than doubling the FY 2000 allocation - SP is fully utilized
- Moved entire facility to Oakland more later
- Completed the second PAC allocation process with
lessons learned from the first year
8Activities and Accomplishments
- Improved Mass Storage System
- Upgraded HPSS
- New versions of HSI
- Implementing Gigabit Ethernet
- Two STK robots added
- Replaced 3490 with 9840 tape drives
- Higher density and Higher speed tape drives
- Formed Network and Security Group
- Succeeded in external reviews
- Policy Board
- SCAC
9Activities and Accomplishments
- Implemented new accounting system NIM
- Old system was
- Difficult to maintain
- Difficult to integrate to new system
- Limited by 32 bits
- Not Y2K compliant
- New system
- Web focused
- Available data base software
- Works for any type of system
- Thrived in a state of increased security
- Open Model
- Audits, tests
102000 Activities and Accomplishments
- NERSC firmly established as a leader in system
evaluation - Effective System Performance (ESP) recognized as
a major step in system evaluation and is
influencing a number of sites and vendors - Sustained System Performance measures
- Initiated a formal benchmarking effort to the
NERSC Application Performance Simulation Suite
(NAPs) to possibly be the next widely recognized
parallel evaluation suite
11Activities and Accomplishments
- Formed the NERSC Cluster team to investigate the
impact of SMP commodity clusters for High
Performance, Parallel Computing to assure the
most effective implementations of division
resources related to cluster computing. - Coordinate all NERSC Division cluster computing
activities (research, development, advanced
prototypes, pre-production, production, and user
support). - Initiated a formal procurement for mid-range
cluster - In consultation with DOE, decided not to award as
part of NERSC program activities
12NERSC Division
DIVISION ADMINISTRATOR FINANCIAL
MANAGER WILLIAM FORTNEY
CHIEF TECHNOLOGIST DAVID BAILEY
DISTRIBUTED SYSTEMS DEPARTMENT WILLIAM
JOHNSTON Department Head DEB AGARWAL, Deputy
HIGH PERFORMANCE COMPUTING DEPARTMENT WILLIAM
KRAMER Department Head
HIGH PERFORMANCE COMPUTING RESEARCH
DEPARTMENT ROBERT LUCAS Department Head
APPLIED NUMERICAL ALGORITHMS PHIL COLELLA
IMAGING COLLABORATIVE COMPUTING BAHRAM PARVIN
COLLABORATORIES DEB AGARWAL
ADVANCED SYSTEMS TAMMY WELCOME
HENP COMPUTING DAVID QUARRIE
CENTER FOR BIOINFORMATICS COMPUTATIONAL
GENOMICS MANFRED ZORN
DATA INTENSIVE DIST. COMPUTING BRIAN TIERNEY
(CERN) WILLIAM JOHNSTON (acting)
COMPUTATIONAL SYSTEMS JIM CRAW
MASS STORAGE NANCY MEYER
SCIENTIFIC COMPUTING ESMOND NG
DISTRIBUTED SECURITY RESEARCH MARY THOMPSON
COMPUTER OPERATIONS NETWORKING SUPPORT WILLIAM
HARRIS
USER SERVICES FRANCESCA VERDIER
CENTER FOR COMPUTATIONAL SCIENCE ENGR. JOHN
BELL
SCIENTIFIC DATA MANAGEMENT ARIE SHOSHANI
SCIENTIFIC DATA MGMT RESEARCH ARIE SHOSHANI
NETWORKING WILLIAM JOHNSTON (acting)
FUTURE INFRASTRUCTURE NETWORKING
SECURITY HOWARD WALTER
FUTURE TECHNOLOGIES ROBERT LUCAS (acting)
VISUALIZATION ROBERT LUCAS (acting)
Rev 02/01/01
13HIGH PERFORMANCE COMPUTING DEPARTMENT WILLIAM
KRAMER Department Head
USER SERVICES FRANCESCA VERDIER Mikhail
Avrekh Harsh Anand Majdi Baddourah Jonathan
Carter Tom DeBoni Jed Donnelley Therese
Enright Richard Gerber Frank Hale John
McCarthy R.K. Owen Iwona Sakrejda David
Skinner Michael Stewart (C) David Turner Karen
Zukor
ADVANCED SYSTEMS TAMMY WELCOME Greg
Butler Thomas Davis Adrian Wong
COMPUTATIONAL SYSTEMS JAMES CRAW Terrence
Brewer (C) Scott Burrow (I) Tina Butler Shane
Canon Nicholas Cardo Stephan Chan William
Contento (C) Bryan Hardy (C) Stephen Luzmoor
(C) Ron Mertes (I) Kenneth Okikawa David
Paul Robert Thurman (C) Cary Whitney
COMPUTER OPERATIONS NETWORKING
SUPPORT WILLIAM HARRIS Clayton Bagwell
Jr. Elizabeth Bautista Richard Beard Del
Black Aaron Garrett Mark Heer Russell Huie Ian
Kaufman Yulok Lam Steven Lowe Anita
Newkirk Robert Neylan Alex Ubungen
MASS STORAGE NANCY MEYER Harvard
Holmes Wayne Hurlbert Nancy Johnston Rick Un (V)
HENP COMPUTING DAVID QUARRIE CRAIG TULL
(Deputy) Paolo Calafiura Christopher Day Igor
Gaponenko Charles Leggett (P) Massimo
Marino Akbar Mokhtarani Simon Patton
FUTURE INFRASTRUCTURE NETWORKING
SECURITY HOWARD WALTER Eli Dart Brent
Draney Stephen Lau
(C) Cray (FB) Faculty UC Berkeley (FD) Faculty
UC Davis (G) Graduate Student Research Assistant
(I) IBM (M) Mathematical Sciences Research
Institute (MS) Masters Student (P) Postdoctoral
Researcher (SA) Student Assistant (V) Visitor
On leave to CERN
Rev 02/01/01
14HIGH PERFORMANCE COMPUTING RESEARCH
DEPARTMENT ROBERT LUCAS Department Head
APPLIED NUMERICAL ALGORITHMS PHILLIP
COLELLA Susan Graham (FB) Anton Kast
Peter McCorquodale (P) Brian
Van Straalen Daniel Graves Daniel
Martin (P) Greg Miller (FD)
IMAGING COLLABORATIVE COMPUTING BAHRAM
PARVIN Hui H Chan (MS) Gerald Fontenay
Sonia Sachs Qing
Yang Ge Cong (V) Masoud
Nikravesh (V) John Taylor
SCIENTIFIC COMPUTING ESMOND NG Julian Borrill
Xiaofeng He (V) Jodi Lamoureux (P)
Lin-Wang Wang Andrew Canning Yun
He Sherry Li
Michael Wehner (V) Chris Ding
Parry Husbands (P) Osni Marques
Chao Yang Tony Drummond Niels Jensen (FD)
Peter Nugent Woo-Sun
Yang (P) Ricardo da Silva (V) Plamen Koev (G)
David Raczkowski (P)
CENTER FOR BIOINFORMATICS COMPUTATIONAL
GENOMICS MANFRED ZORN Donn Davy
Inna Dubchak Sylvia Spengler
SCIENTIFIC DATA MANAGEMENT ARIE SHOSHANI Carl
Anderson Andreas Mueller
Ekow Etoo M. Shinkarsky
(SA) Mary Anderson Vijaya Natarajan
Elaheh Pourabbas (V) Alexander
Sim Junmin Gu Frank Olken
Arie Segev (FB)
John Wu Jinbaek Kim (G)
CENTER FOR COMPUTATIONAL SCIENCE
ENGINEERING JOHN BELL Ann Almgren
William Crutchfield Michael Lijewski
Charles Rendleman Vincent Beckner
Marcus Day
FUTURE TECHNOLOGIES ROBERT LUCAS (acting) David
Culler (FB) Paul Hargrove
Eric Roman Michael
Welcome James Demmel (FB) Leonid Oliker
Erich Stromeier Katherine
Yelick (FB)
VISUALIZATION ROBERT LUCAS (acting) Edward
Bethel James Hoffman (M)
Terry Ligocki Soon Tee Teoh
(G) James Chen (G) David Hoffman
(M) John Shalf
Gunther Weber (G) Bernd Hamann (FD)
Oliver Kreylos (G)
(FB) Faculty UC
Berkeley (FD) Faculty UC Davis (G) Graduate
Student Research Assistant (M) Mathematical
Sciences Research Institute (MS) Masters Student
(P) Postdoctoral
Researcher (S) SGI (SA) Student Assistant (V)
Visitor Life Sciences Div. On Assignment to
NSF
Rev 02/01/01
15FY00 MPP Users/Usage by Discipline
16FY00 PVP Users/Usage by Discipline
17NERSC FY00 MPP Usage by Site
18NERSC FY00 PVP Usage by Site
19FY00 MPP Users/Usage by Institution Type
20FY00 PVP Users/Usage by Institution Type
21NERSC System Architecture
FDDI/ ETHERNET 10/100/Gigbit
REMOTE VISUALIZATION SERVER
MAX STRAT
SYMBOLIC MANIPULATION SERVER
IBM And STK Robots
DPSS
PDSF
ResearchCluster
IBM SP NERSC-3 Processors 604/304 Gigabyte memory
CRI T3E 900 644/256
CRI SV1
MILLENNIUM
IBM SP NERSC-3 Phase 2a 2532 Processors/ 1824
Gigabyte memory
LBNL Cluster
VIS LAB
22Current Systems
23Major Systems
- MPP
- IBM SP Phase 2a
- 158 16-way SMP nodes
- 2144 Parallel Application CPUs/12 GB per node
- 20 TB Shared GPFS
- 11,712 GB swap space - local to nodes
- 8.6 TB of temporary scratch space
- 7.7 TB of permanent home space
- 4-20 GB home quotas
- 240 Mbps aggregate I/O - measured from user
nodes (6 HiPPI, 2 GE, 1 ATM) - T3E-900 LC with 696 PEs - UNICOS/mk
- 644 Application Pes/256 MB per PE
- 383 GB of Swap Space - 582 GB Checkpoint File
System - 1.5 TB /usr/tmp temporary scratch space - 1 TB
permanent home space - 7- 25 GB home quotas, DMF managed
- 35 MBps aggregate I/O measured from user nodes
- (2 HiPPI, 2 FDDI) - 1.0 TB local /usr/tmp
- Serial
- PVP - Three J90 SV-1 Systems running UNICOS
- 64 CPUs Total/8 GB of Memory per System (24 GB
total) - 1.0 TB local /usr/tmp
- PDSF - Linux Cluster
- 281 IA-32 CPUs
- 3 LINUX and 3 Solaris file servers
- DPSS integration
- 7.5 TB aggregate disk space
- 4 striped fast Ethernet connections to HPSS
- LBNL Mid Range Cluster
- 160 IA-32 CPUs
- LINUX with enhancements
- 1 TB aggregate disk space
- Myrinet 2000 Interconnect
- GigaEthernet connections to HPSS
- Storage
- HPSS
- 8 STK Tape Libraries
- 3490 Tape drives
- 7.4 TB of cache disk
- 20 HiPPI Interconnects, 12 FDDI connects, 2 GE
connections - Total Capacity 960TB
- 160 TB in use
- HPSS - Probe
24T3E Utilization95 Gross Utilization
Allocation Starvation
Full Scheduling Functionality
4.4 improvement per month
Checkpoint t - Start of Capability Jobs
Allocation Starvation
Systems Merged
25SP Utilization
- In the 80-85 range which is above original
expectations for first year - More variation than T3E
26T3E Job Size
More than 70 of the jobs are large
27SP Job Size
Full size jobs more than 10 of usage
60 of the jobs are gt ¼ the maximum size
28StorageHPSS
29NERSC Network Architecture
30CONTINUE NETWORK IMPROVEMENTS
31LBNL Oakland Scientific Facility
32Oakland Facility
- 20,000 sf computer room 7,000 sf office space
- 16,000 sf computer space built out
- NERSC occupying 12,000 sf
- Ten year lease with 3 five year options
- 10.5M computer room construction costs
- Option for additional 20,000 sf computer room
33LBNL Oakland Scientific Facility
Move accomplished between Oct 26 to Nov 4
System Scheduled Actual SP 10/27 9 am no
outage T3E 11/3 10 am 11/3 3 am SV1s 11/3
10 am 11/2 3 pm HPSS 11/3 10 am 10/31
930 am PDSF 11/6 10 am 11/2 11 am Other
Systems 11/3 10 am 11/1 8 am
34Computer Room Layout
Up to 20,000 sf of computer space Direct Esnet
node at OC12
352000 Activities and Accomplishments
- PDSF Upgrade in conjunction with building move
362000 Activities and Accomplishments
- netCDF parallel support developed by NERSC staff
for the Cray T3E. - A similar effort is being planned to port netCDF
to the IBM SP platform. - Communication for Clusters M-VIA and MVICH
- M-VIA and MVICH are VIA-based software for
low-latency, high-bandwidth, inter-process
communication. - M-VIA is a modular implementation of the VIA
standard for Linux. - MVICH is an MPICH-based implementation of MPI for
VIA.
37FY 2000 User Survey Results
- Areas of most importance to users
- available hardware (cycles)
- overall running of the center
- network access to NERSC
- allocations process
- Highest satisfaction (score gt 6.4)
- Problem reporting/consulting services (timely
response, quality, followup) - Training
- Uptime (SP and T3E)
- FORTRAN (T3E and PVP)
- Lowest satisfaction (score lt 4.5)
- PVP batch wait time
- T3E batch wait time
- Largest increases in satisfaction from FY 1999
- PVP cluster (we introduced interactive SV1
services) - HPSS performance
- Hardware management and configuration (we monitor
and improve this continuously) - HPCF website (all areas are continuously
improved, with a special focus on topics
highlights as needing improvement in the surveys)
- T3E Fortran compilers
38Client Comments from Survey
- "Very responsive consulting staff that makes the
user feel that his problem, and its solution, is
important to NERSC" - "Provide excellent computing resources with high
reliability and ease of use." - "The announcement managing and web-support is
very professional." - "Manages large simulations and data. The oodles
of scratch space on mcurie and gseaborg help me
process large amounts of data in one go." - "NERSC has been the most stable supercomputer
center in the country particularly with the
migration from the T3E to the IBM SP". - "Makes supercomputing easy."
39NERSC 3 Phase 2a/b
40Result NERSC-3 Phase 2a
- System built and configured
- Started factory tests 12/13
- Expect delivery 1/5
- Undergoing acceptance testing
- General production April 2001
- What is different that needs testing
- New Processors,
- New Nodes, New memory system
- New switch fabric
- New Operating System
- New parallel file system software
41IBM Configuration
- Phase 1 Phase 2a/b
- Compute Nodes 256 134
- Processors 256x2512 134x162144
- Networking Nodes 8 2
- Interactive Nodes 8 2
- GPFS Nodes 16 16
- Service Nodes 16 4
- Total Nodes (CPUs) 304 (604) 158 (2528)
- Total Memory (compute nodes) 256 GB 1.6 TB
- Total Global Disk (user accessible) 10 TB 20 TB
- Peak (compute nodes) 409.6 GF 3.2 TF
- Peak (all nodes) 486.4 GF 3.8 TF
- Sustained System Perf 33 GF 235 GF/280 GF
- Production Dates April 1999 April 2001/Oct 2001
- is a minimum - may increase due to sustained
system performance measure
42What has been completed
- 6 nodes added to the configuration
- Memory per node increased to 12 GB for 140
compute nodes - Loan of full memory for Phase 2
- System installed and braced
- Switch adapters and memory added to system
- System configuration
- Security Audit
- System testing for many functions
- Benchmarks being run and problems being diagnosed
43Current Issues
- Failure of two benchmarks need to be resolved
- Best case indicate broken hardware likely with
the switch adaptors - Worst case indicate design and load issues that
are fundamental - Variation
- Loading and switch contention
- Remaining tests
- Throughput, ESP
- Full System tests
- I/O
- Functionality
44General Schedule
- Complete testing TBD based on problem
correction - Production Configuration set up
- 3rd party s/w, local tools, queues, etc
- Availability Test
- Add early users 10 days after successful testing
complete - Gradually add other users complete 40 days
after successful testing - Shut down Phase 1 10 days after system open to
all users - Move 10 TB of disk space configuration will
require Phase 2 downtime - Upgrade to Phase 2b in late summer, early fall
45NERSC-3 Sustained System Performance Projections
- Estimates the amount scientific computation that
can really be delivered - Depends on delivery of Phase 2b functionality
- The higher the last number is the better since
the system remains at NERSC for 4 more years
Test/Config, Acceptance,etc
Software lags hardware
46NERSC Computational Power vs. Moores Law
47NERSC 4
48NERSC-4
- NERSC 4 IS ALREADY ON OUR MINDS
- PLAN IS FOR FY 2003 INSTALLATION
- PROCUREMENT PLANS BEING FORMULATED
- EXPERIMENTATION AND EVALUATION OF VENDORS IS
STARTING - ESP, ARCHITECTURES, BRIEFINGS
- CLUSTER EVALUATION EFFORTS
- USER REQUIREMENTS DOCUMENT (GREENBOOK) IMPORTANT
49How Big Can NERSC-4 be
- Assume a delivery in FY 2003
- Assume no other space is used in Oakland until
NERSC-4 - Assume cost is not an issue (at least for now)
- Assume technology still progresses
- ASCI will have a 30 Tflop/s system running for
over 2 years
50How close is 100 Tflop/s
- Available gross space in Oakland is 3,000 sf
without major changes - Assume it is 70 usable
- The rest goes to air handlers, columns, etc.
- That gives 3,000 sf of space for racks
- IBM system used for estimates
- Other vendors are similar
- Each processor is 1.5 Ghz, to yield 6 Gflop/s
- An SMP node is made up of 32 processors
- 2 Nodes in a frame
- 64 processors in a frame 384 Gflops per frame.
- Frames are 32 - 36" wide and 48 deep
- service clearance of 3 feet in front and back
(which can overlap) - 3 by 7 is 21 sf per frame
51Practical System Peak
- Rack Distribution
- 60 of racks are for CPUs
- 90 are user/computation nodes
- 10 are system support nodes
- 20 of racks are for switch fabric
- 20 of racks for disks
- 5,400 sf / 21 sf per frames 257 frames
- 277 nodes that are are directly used by
computation - 8,870 CPUS for computation
- system total is 9,856 (308 nodes)
- Practical system peak is 53 Tflop/s
- .192 Tflop/s per node 277 nodes
- Some other places would claim 60 Tflop/s
52How much use will it be
- Sustained vs Peak performance
- Class A codes on T3E samples at 11
- LSMS
- 44 of peak on T3E
- So far 60 of peak on Phase 2a (maybe more)
- Efficiency
- T3E runs at a 30 day average about 95
- SP runs at a 30 day average about 80
- Still functionality planned
53How much will it cost
- Current cost for a balanced system is about 7.8M
per Tflop/s - Aggressive
- Cost should drop by a factor of 4
- 1-2 M per Teraflop/s
- Many assumptions
- Conservative
- 3.5 M per Teraflop/s
- Added costs for install, operate and balance the
facility is 20. - The full cost is 140 M to 250 M
- Too Bad
54The Real Strategy
- Traditional strategy within existing NERSC
Program funding. Acquire new computational
capability every three years - 3 to 4 times
capability increase of existing systems - Early, commercial, balanced systems with focus on
- - stable programming environment
- - mature system management tools
- - good sustained to peak performance ratio
- Total value of 25M - 30M
- - About 9-10M/yr. using lease to own
- Have two generations in service at a time
- - e.g. T3E and IBM SP
- Phased introduction if technology indicates
- Balance other system architecture components
55Necessary Steps
- 1) Accumulate and evaluate benchmark candidates
- 2) Create a draft benchmark suite and run it on
several systems - 3) Create the draft benchmark rules
- 4) Set basic goals and options for procurement
and then create a draft RFP document - 5) Conduct market surveys (vendor briefings,
intelligence gathering, etc.) - we do this after
the first items so we can be looking for the
right information and also we can tell the
vendors what to expect. It is often the case
that we have to "market" to the vendors on why
they should be bidding - since it costs them a
lot. - 6) Evaluate alternative and options for RFP and
tests. This is also where we do a technology
schedule (when what is available) and estimate
prices - price/performance, etc. - 7) Refine RFP and benchmark rules for final
release - 8) Go thru reviews
- 9) Release RFP
- 10) Answer questions from vendors
- 11) Get responses - evaluate
- 12) Determine best value - present results and
get concurrence - 13) Prepare to negotiate
- 14) Negotiate
- 15) Put contract package together
- 16) Get concurrence and approval
- 17) Vendor builds the system
- 18) Factory test
- 19) Vendor delivers it
56Rough Schedule
- Goal NERSC-4 installation in first half of CY
2003 - Vendor responses (11) in early CY 2002
- Award in late summer/fall of CY2002.
- This is necessary in order to assure delivery and
acceptance ( 22) in FY 2003. - A lot of work and long lead times (for example,
we have to account for review and approval times,
90 days for vendors to craft responses, time to
negotiate, ...) - NERSC Staff kick off meeting first week of march,
- Been doing some planning work already.
57NERSC-2 Decommissioning
- RETIRING NERSC 2 IS ALREADY ON OUR MINDS
- IF POSSIBLE WE WOULD LIKE TO KEEP NERSC 2 IN
SERVICE UNTIL 6 MONTHS BEFORE NERSC 4
INSTALLATION - Therefore, expect retirement at the end of FY
2002 - It is risky to assume there will be a viable
vector replacement - Team is working to determine possible paths for
traditional vector users - Report due in early summer
58SUMMARY
- NERSC does an exceptionally effective job
delivering services to DOE and other researchers - NERSC has made significant upgrades this year
that position it well for future growth and
continued excellence - NERSC has a well mapped strategy for the next
several years