DataDriven Scholarship - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

DataDriven Scholarship

Description:

Director, San Diego Supercomputer Center. Professor and HPC Endowed Chair, UCSD ... 2.4 PB Storage-area Network (SAN) 25 PB StorageTek/IBM tape library ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: DataDriven Scholarship


1
Data-Driven Scholarship
  • Dr. Fran Berman
  • Director, San Diego Supercomputer Center
  • Professor and HPC Endowed Chair, UCSD

2
Data-Driven Scholarship Starts with Data-Driven
Scholars
  • What is required for data-driven research and
    education?
  • The on-line world has had tremendous impact on
  • Investigation
  • Dissemination
  • Collaboration
  • Education, outreach, training

3
Research in the Information Age
Data-oriented Applications
Extreme researchers increasingly dependent on
both High Performance Computing (HPC)and
Highly Reliable Data (HRD)
Large-scale data required as input, intermediate,
output for many modern research applications
TeraShake
PDB applications
NVO
Home, Lab, Campus, Desktop Applications
Medium, Large, and Leadership HPC Applications
World ofWarcraft
MolecularModeling
Quicken
Compute (more FLOPS)
4
Researchers require the ability to manage, store,
preserve, and use data for extended periods
Data Cyberinfrastructure
  • What are the trends and what is the noise in my
    data?
  • How should I display my data?
  • How should I organize my data?
  • How can I preserve my data after the project is
    over?
  • How do I combine my data with my colleagues
    data?
  • My data is confidential, how do I make sure it is
    seen/used only by the right people?

Many Data Sources
5
SDSC Data Cyberinfrastructure
  • SDSC HIGH PERFORMANCE COMPUTING SYSTEMS
  • DataStar
  • 15.6 TFLOPS Power 4 system
  • 7.125 TB total memory
  • Up to 4 GBps I/O to disk
  • 115 TB GPFS filesystem
  • Intimidata
  • First academic IBM Blue Gene system
  • 17.1 TF
  • 1.5 TB total memory
  • 3 racks, each with 2,048 PowerPC processors and
    128 I/O nodes
  • TeraGrid Cluster
  • 524 Itanium2 IA-64 processors
  • 2 TB total memory
  • Also 16 2-way data I/O nodes
  • http//www.sdsc.edu/
  • user_services/
  • SDSC DATA COLLECTIONS, ARCHIVAL AND STORAGE
    SYSTEMS
  • 2.4 PB Storage-area Network (SAN)
  • 25 PB StorageTek/IBM tape library
  • HPSS and SAM-QFS archival systems
  • DB2, Oracle, MySQL
  • Storage Resource Broker
  • Supporting servers IBM 32-way p690s,
  • 72-CPU SunFire 15K, etc.
  • http//datacentral.sdsc.edu/

Support for community data collections and
databases Data management, mining, analysis, and
preservation
  • SDSC SCIENCE and TECHNOLOGY STAFF, SOFTWARE,
    SERVICES
  • Data-oriented Community SW, toolkits, portals,
    codes
  • DataCentral national hosting repository
  • Chronopolis services (w/ UCSDL)
  • Data User Services
  • Application/Community Collaborations
  • Education and Training
  • http//www.sdsc.edu/

6
Good Data Infrastructure Incurs Real Costs
Capacity Costs
  • Most valuable data must be replicated
  • SDSC research collections have been doubling
    every 15 months.
  • SDSC storage is 25 PB and counting. Data is from
    supercomputer simulations, digital library
    collections, etc.

Information courtesy of Richard Moore
7
New Conversations for Researchers and Repositories
  • Researcher Can you preserve the digital data
    from my project?
  • Repository
  • How much data?
  • For how long?
  • What is the form of the data?
  • Who will be in charge of cleaning, annotating,
    organizing, refreshing, modifying etc. the data?
  • Who should have access to the data?
  • Can the data be regenerated?
  • Who will pay for infrastructure support?
  • Etc.
  • New opportunities for partnership
  • Users
  • Libraries
  • Publishers
  • Data Centers
  • Private sector
  • Public sector
  • Federal agencies, etc.
  • Key Issues for partners
  • Trust
  • Service agreements
  • Incentives and penalties
  • Security, privacy, confidentiality
  • Expectation management

8
Key Challenges Incorporating the ilities
  • Reliability
  • How can we maximize data reliability?
  • Replicants, UPS systems, heterogeneity, etc.
  • How can we measure data reliability?
  • Network availability 99.999 uptime (5 nines),
  • What is the equivalent number of 0s for data
    reliability?
  • Responsibility
  • Who owns the data?
  • Who takes care of the data?
  • Who pays for the data?
  • Who can see the data?

9
The Grand Challenge Economic Sustainability
Relay Funding
  • Making Infinite Funding Finite
  • Difficult to support infrastructure for data
    preservation as an infinite, increasing mortgage
  • Creative solutions can be used to create
    sustainable economic models

User fees, recharges
Consortium support
Endowments
Hybrid solutions
10
We cant achieve success in the Information Age
without a solid foundation of Information
Write a Comment
User Comments (0)
About PowerShow.com