Title: DataDriven Scholarship
1Data-Driven Scholarship
- Dr. Fran Berman
- Director, San Diego Supercomputer Center
- Professor and HPC Endowed Chair, UCSD
2Data-Driven Scholarship Starts with Data-Driven
Scholars
- What is required for data-driven research and
education? - The on-line world has had tremendous impact on
- Investigation
- Dissemination
- Collaboration
- Education, outreach, training
3Research in the Information Age
Data-oriented Applications
Extreme researchers increasingly dependent on
both High Performance Computing (HPC)and
Highly Reliable Data (HRD)
Large-scale data required as input, intermediate,
output for many modern research applications
TeraShake
PDB applications
NVO
Home, Lab, Campus, Desktop Applications
Medium, Large, and Leadership HPC Applications
World ofWarcraft
MolecularModeling
Quicken
Compute (more FLOPS)
4Researchers require the ability to manage, store,
preserve, and use data for extended periods
Data Cyberinfrastructure
- What are the trends and what is the noise in my
data? - How should I display my data?
- How should I organize my data?
- How can I preserve my data after the project is
over? - How do I combine my data with my colleagues
data? - My data is confidential, how do I make sure it is
seen/used only by the right people?
Many Data Sources
5SDSC Data Cyberinfrastructure
- SDSC HIGH PERFORMANCE COMPUTING SYSTEMS
- DataStar
- 15.6 TFLOPS Power 4 system
- 7.125 TB total memory
- Up to 4 GBps I/O to disk
- 115 TB GPFS filesystem
- Intimidata
- First academic IBM Blue Gene system
- 17.1 TF
- 1.5 TB total memory
- 3 racks, each with 2,048 PowerPC processors and
128 I/O nodes - TeraGrid Cluster
- 524 Itanium2 IA-64 processors
- 2 TB total memory
- Also 16 2-way data I/O nodes
- http//www.sdsc.edu/
- user_services/
- SDSC DATA COLLECTIONS, ARCHIVAL AND STORAGE
SYSTEMS - 2.4 PB Storage-area Network (SAN)
- 25 PB StorageTek/IBM tape library
- HPSS and SAM-QFS archival systems
- DB2, Oracle, MySQL
- Storage Resource Broker
- Supporting servers IBM 32-way p690s,
- 72-CPU SunFire 15K, etc.
- http//datacentral.sdsc.edu/
Support for community data collections and
databases Data management, mining, analysis, and
preservation
- SDSC SCIENCE and TECHNOLOGY STAFF, SOFTWARE,
SERVICES - Data-oriented Community SW, toolkits, portals,
codes - DataCentral national hosting repository
- Chronopolis services (w/ UCSDL)
- Data User Services
- Application/Community Collaborations
- Education and Training
- http//www.sdsc.edu/
6Good Data Infrastructure Incurs Real Costs
Capacity Costs
- Most valuable data must be replicated
- SDSC research collections have been doubling
every 15 months. - SDSC storage is 25 PB and counting. Data is from
supercomputer simulations, digital library
collections, etc.
Information courtesy of Richard Moore
7New Conversations for Researchers and Repositories
- Researcher Can you preserve the digital data
from my project? - Repository
- How much data?
- For how long?
- What is the form of the data?
- Who will be in charge of cleaning, annotating,
organizing, refreshing, modifying etc. the data? - Who should have access to the data?
- Can the data be regenerated?
- Who will pay for infrastructure support?
- Etc.
- New opportunities for partnership
- Users
- Libraries
- Publishers
- Data Centers
- Private sector
- Public sector
- Federal agencies, etc.
- Key Issues for partners
- Trust
- Service agreements
- Incentives and penalties
- Security, privacy, confidentiality
- Expectation management
8Key Challenges Incorporating the ilities
- Reliability
- How can we maximize data reliability?
- Replicants, UPS systems, heterogeneity, etc.
- How can we measure data reliability?
- Network availability 99.999 uptime (5 nines),
- What is the equivalent number of 0s for data
reliability?
- Responsibility
- Who owns the data?
- Who takes care of the data?
- Who pays for the data?
- Who can see the data?
9The Grand Challenge Economic Sustainability
Relay Funding
- Making Infinite Funding Finite
- Difficult to support infrastructure for data
preservation as an infinite, increasing mortgage - Creative solutions can be used to create
sustainable economic models
User fees, recharges
Consortium support
Endowments
Hybrid solutions
10We cant achieve success in the Information Age
without a solid foundation of Information