Title: Interdisciplinary HighPerformance Computing: Renaissance Computing Institute
1Interdisciplinary High-Performance Computing
Renaissance Computing Institute
- Dan Reed
- Director
- Alan Blatecky
- Deputy Director
- Diane Pozefsky
- Research Scientist and Professor
- Duke University
- North Carolina State University
- University of North Carolina at Chapel Hill
2Renaissance Computing Institute
- Vision
- a multidisciplinary institute
- academe, commerce and society
- broad in scope and participation
- from art to zoology
- Objectives
- enrich and empower human potential
- communities at all levels
- create multidisciplinary partnerships
- science, engineering and computing
- commerce, humanities and the arts
- develop and deploy world-leading infrastructure
- computing, communications and data management
- visualization, collaboration and manufacturing
3The Big Questions
- Life and nature
- structures, processes and interactions
- Matter and universe
- origins, structure, manipulation and futures
- interactions, systems, and context
- Humanity
- creativity, socialization and community
- Answering big questions (usually) requires
- boldness to engage opportunities
- expandable approaches
- world-leading infrastructure
- collaborations and interdisciplinary partnerships
4How Big Is Big?
- Every 10X brings new challenges
- 64 processors was once considered large
- its now a research cluster in a closet
- 1024 processors is todays medium size
- 2048-8096 processors is todays large
- were struggling even here
- 10K-100K processors is in sight
- we have fundamental challenges
- and no integrated research program
- Grids bring a complementary set of challenges
- diversity
- unreliable communication links
- shared data stores
- widely varying system support
- maintenance and software stability
Norman et al
5Big System Reliability
- Facing the issues
- ASCI Q boot time is 8 hours
- not far from the system MTTF
- Cost of frequent application checkpoints
- Its time to take RAS seriously
- systems do provide warnings
- soft bit errors
- disk read/write retries, packet loss
- status and health provide guidance
- node temperature/fan duty cycles
- Potential software and algorithmic responses
- diagnostic-mediated checkpointing
- domain-specific fault tolerance
- optimal system size for minimum execution time
Source Jack Horner and Charng-da Lu
6Representative Research Projects
- VGrADS
- (Virtual Grid Application Development Software)
- LEAD
- (Linked Environments for Atmospheric Discovery)
- PERC
- (Performance Evaluation Research Center)
- LACSI
- (Los Alamos Computer Science Institute)
- NCSA
- (National Computational Science Alliance)
- Grids, HPC, biology, atmospheric science,
astronomy, - NC Bio Portal
7Portals and Grids
Grid Resources
OGCE Portlets with Container
Service API
Grid Service Stubs
Grid Protocols
Grid Services
OGCE Science Portal
Local Portal Services
Open Source Tools
Remote Content Servers
Remote Content Services
HTTP
Apache Jetspeed Internal Services
8Whats A Grid?
http//
Web Uniform access to documents
http//
Software catalogs
Grid Flexible, high-performance access to
resources and services for distributed communities
Computers
Sensors and instruments
Colleagues
Data archives
9Science and Engineering Grids
10Web Services
Source Globus Team
11Grid and Web Services
Source Globus/IBM
12VGrADS
- Virtual Grid Application Development Software
- Goals
- simplify and accelerate the development of Grid
applications and services - high levels of performance and resource
efficiency - expand the community of Grid users and developers
- Contributions
- Introduction of virtual grids (vgrids)
- classification of grid types
- language to define type of grid needed
13VGrADS
Application Targets LEAD and Encyclopedia of
Life (EOL)
14LEAD and Cyclic Tornado Genesis
- LEAD (large NSF project) for atmospheric science
Grid - Linked Environments for Atmospheric Discovery
- UNC, Oklahoma, Indiana, National Center for
Atmospheric Research, Alabama, Illinois, - What vertical profiles of wind, temperature
humidity - lead to multiple mesocyclones and/or
tornadoes?
15How will LEAD help?
- Allows the use of analysis tools, forecast
models, and data repositories as dynamically
adaptive, on-demand systems that can - change configuration rapidly and automatically in
response to weather - continually be steered by new data (i.e., the
weather) - respond to decision-driven inputs from users
- initiate other processes automatically and
- steer remote observing technologies to optimize
data collection for the problem at hand.
16LEAD System
Local Resources and Services
Grid Resources and Services
Tools Sub-System
17Two Level Instrumentation/Monitoring
- SvPablo for modules
- Graphical user interface tool for
- Source code instrumentation
- Browsing runtime performance data
- Autopilot for workflow
- Sensors and actuators
- distributed measurement and software control
- Fuzzy logic decision procedures
- distributed performance control
- Standard performance daemons
18AutoPilot
19Research Opportunities
- Large-scale Parallel Systems
- fault-tolerance, performance analysis, scheduling
- I/O, applications
- Computational and Data Grids
- resource management, policies, applications