Title: Dr' Francine Berman
1One Hundred Years of Data
- Dr. Francine Berman
- Director, San Diego Supercomputer Center
- Professor and High Performance Computing Endowed
Chair, UC San Diego
2The Digital World
Entertainment
Shopping
Information
3How much Data is There?
iPod Shuffle (up to 120 songs) 512 MegaBytes
Printed materials in the Library of Congress
10 TeraBytes
1 human brain at the micron level 1 PetaByte
SDSC HPSS tape archive 6 PetaBytes
1 novel 1 MegaByte
All worldwide information in one year 2 ExaBytes
1 Low Resolution Photo 100 KiloBytes
Rough/average estimates
4Research, Education, and Data
Japanese Art Images 70.6 GB
JCSG/SLAC 15.7 TB
NVO 100 TB
Life Sciences
Astronomy
Arts, and Humanities
Engineering
Projected LHC Data 10 PB/year
SCEC 153 TB
TeraBridge 800 GB
Geosciences
Physics
5Data-oriented Science and Engineering
Applications Driving the Next Generation of
Technology Challenges
Data-oriented Science and Engineering
Applications
TeraShake
(Examples, Application subclasses)
PDB applications
NVO
Home, Lab, Campus, Desktop Applications
TraditionalHPC Applications
Everquest
MolecularModeling
Quicken
Compute (more FLOPS)
6Data Stewardship
- What is required for stewardship of data for the
science and engineering community? - Who needs it?
- How does data drive new discovery?
- What facilities are required?
- Whats involved in preserving data for the
foreseeable future?
7Data Stewardship
- What is required for stewardship of data for the
science and engineering community? - Who needs it?
- How does data drive new discovery?
- What facilities are required?
- Whats involved in preserving data for the
foreseeable future?
8PDB A resource for the global Biology community
- The Protein Data Bank
- Largest repository on the planet for structural
information about proteins - Provides free worldwide public access 24/7 to
accurate protein data - PDB maintained by the Worldwide PDB administered
by the Research Collaboratory for Structural
Bioinformatics (RCSB), directed by Helen Berman
Molecule of the Month Glucose Oxidase Enzyme
used for making the measurement of glucose (e.g.
in monitoring diabetes) fast, easy, and
inexpensive.
2006 gt 5000 structures in one year, gt36,000
total structures
1976-1990, roughly 500 structures or less per year
Growth of Yearly/Total Structures in PDB
9How Does the PDB Work?
10Supporting and Sustaining the PDB
- Consortium Funding (NSF, NIGMS, DOE, NLM, NCI,
NCRR, NIBIB, NINDS) - Industrial Support (Advanced Chemistry
Development Inc., IBM, Sybase, Compaq, Silicon
Graphics, Sun Microsystems) - Multiple sites wwPDB RCSB (USA), PDBj
(Japan), MSD-EBI (Europe)
- Tool Development
- Data Extraction and Preparation
- Data Format Conversion
- Data Validation
- Dictionary and Data management
- Tools supporting the OMB Corba Standard for
Macromolecular Structure Data, etc.
11Data Stewardship
- What is required for stewardship of data for the
science and engineering community? - Who needs it?
- How does data drive new discovery?
- What facilities are required?
- Whats involved in preserving data for the
foreseeable future?
12Major Earthquakes on the San Andreas Fault,
1680-present
Earthquake Simulations
- Simulation results provide new scientific
information enabling better - Estimation of seismic risk
- Emergency preparation, response and planning
- Design of next generation of earthquake-resistant
structures - Results provide information which can help in
saving many lives and billions in economic losses
- Researchers use geological, historical, and
environmental data to simulate massive
earthquakes. - These simulations are critical to understand
seismic movement, and assess potential impact.
1906 M 7.8
?
1857 M 7.8
1680 M 7.7
How dangerous is the San Andreas Fault?
13TeraShake Simulation
- Simulation of Southern of 7.7 earthquake on lower
San Andreas Fault - Physics-based dynamic source model simulation
of mesh of 1.8 billion cubes with spatial
resolution of 200 m - Builds on 10 years of data and models from the
Southern California Earthquake Center - Simulated first 3 minutes of a magnitude 7.7
earthquake, 22,728 time steps of 0.011 second
each - Simulation generates 45 TB data
14SCEC Data Requirements
Resources must support a complicated
orchestration of computation and data movement
Parallelfile system
Dataparking
The next generation simulation will require even
more resources Researchers plan to double the
temporal/spatial resolution of TeraShake
I have desired to see a large earthquake
simulation for over a decade. This dream has been
accomplished. Bernard Minster, Scripps
Institute of Oceanography
15Behind the Scenes Enabling Infrastructure for
TeraShake
- Computers and Systems
- 80,000 hours on 240 processors of DataStar
- 256 GB memory p690 used for testing, p655s used
for production run, TG used for porting - 30 TB Global Parallel file GPFS
- Run-time 100 MB/s data transfer from GPFS to
SAM-QFS - 27,000 hours post-processing for high resolution
rendering - People
- 20 people involved in information technology
support - 20 people involved in geoscience modeling and
simulation
- Data Storage
- 47 TB archival tape storage on Sun StorEdge
SAM-QFS - 47 TB backup on High Performance Storage system
HPSS - SRB Collection with 1,000,000 files
- Funding
- SDSC Cyberinfrastructure resources for TeraShake
funded by NSF - Southern California Earthquake Center is an
NSF-funded geoscience research and development
center
16Data Partner The Data-Oriented Supercomputer
- Balanced system provides support for
tightly-coupled and strong I/O applications - Grid platforms not a strong option
- Data local to computation
- I/O rates exceed WAN capabilities
- Continuous and frequent I/O is latency intolerant
- Scalability
- Need high-bandwidth and large-capacity local
parallel file systems, archival storage
Data ?
Compute ?
DoD apps plotted for locality
17Data Stewardship
- What is required for stewardship of data for the
science and engineering community? - Who needs it?
- How does data drive new discovery?
- What facilities are required?
- Whats involved in preserving data for the
foreseeable future?
18National Data Cyberinfrastructure Resources at
SDSC
- DATA-ORIENTED COMPUTE SYSTEMS
- DataStar
- 15.6 TFLOPS Power 4 system
- 7.125 TB total memory
- Up to 4 GBps I/O to disk
- 115 TB GPFS filesystem
- TeraGrid Cluster
- 524 Itanium2 IA-64 processors
- 2 TB total memory
- Also 12 2-way data nodes
- Blue Gene Data
- First academic IBM Blue Gene system
- 2,048 PowerPC processors
- 128 I/O nodes
- http//www.sdsc.edu/user_services/
- DATA COLLECTIONS, ARCHIVAL AND STORAGE SYSTEMS
- 1.4 PB Storage-area Network (SAN)
- 6 PB StorageTek tape library
- HPSS and SAM-QFS archival systems
- DB2, Oracle, MySQL
- Storage Resource Broker
- 72-CPU Sun Fire 15K
- IBM p690s HPSS, DB2, etc
- http//datacentral.sdsc.edu/
Support for community data collections and
databases Data management, mining, analysis, and
preservation
- SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES
- User Services
- Application/Community Collaborations
- Education and Training
- SDSC Synthesis Center
- Data-oriented Community SW, toolkits, portals,
codes - http//www.sdsc.edu/
19National Data Repository SDSC DataCentral
- First broad program of its kind to support
national research and community data collections
and databases - Data allocations provided on SDSC resources
- Data collection and database hosting
- Batch oriented access, collection management
services - Comprehensive data resources disk, tape,
databases, SRB, web services, tools, 24/7
operations, collection specialists, etc.
Web-based portal access
20DataCentral Allocated Collections include
21Working with Data Data Integration for New
Discovery
- Data Integration in the Biosciences
- Data Integration in the Geosciences
Where can we most safely build a nuclear waste
dump? Where should we drill for oil? What is
the distribution and U/ Pb zircon ages of A-type
plutons in VA? How does it relate to host rock
structures?
Data Integration
Complex multiple-worlds mediation
Geo-Chemical
Geo-Physical
Geo-Chronologic
Foliation Map
Geologic Map
22Services, Tools, and Technologies Key for Data
Integration and Management
- Data Systems
- SAM/QFS
- HPSS
- GPFS
- SRB
- Data Services
- Data migration/upload, usage and support (SRB)
- Database selection and Schema design (Oracle,
DB2, MySQL) - Database application tuning and optimization
- Portal creation and collection publication
- Data analysis (e.g. Matlab) and mining (e.g.
WEKA) - DataCentral
- Data-oriented Toolkits and Tools
- Biology Workbench
- Montage (astronomy mosaicking)
- Kepler (Workflow management)
- Vista Volume renderer (visualization), etc.
23100 Years of DataWhats involved in preserving
data for the foreseeable future?
24Who Cares about Digital Preservation?
The Public Sector
UCSD Libraries
The Private Sector
The Entertainment Industry
Researchers and Educators
25Many Science, Cultural, and Official Collections
Must be Sustained for the Foreseeable Future
- Critical collections
- community reference data collections (e.g.
Protein Data Bank) - irreplaceable collections (e.g. Shoah
collection) - longitudinal data (e.g. PSID Panel Study of
Income Dynamics) - No plan for preservation often means that data is
lost or damaged
- .the progress of science and useful arts
depends on the reliable preservation of knowledge
and information for generations to come. - Preserving Our Digital Heritage, Library of
Congress
26Key Challenges for Digital Preservation
- What should we preserve?
- What materials must be rescued?
- How to plan for preservation of materials by
design? - How should we preserve it?
- Formats
- Storage media
- Stewardship who is responsible, and for how
long? - Who should pay for preservation?
- The content generators?
- The government?
- The users?
- Who should have access?
Print media provides easy access for long periods
of time but is hard to data-mine
Digital media is easier to data-mine but requires
management of evolution of media and resource
planning over time
27Preservation and Risk
Less risk means more replicants, more resources,
more people
28Chronopolis An Integrated Approach to Long-term
Digital Preservation
- Chronopolis provides a comprehensive approach to
infrastructure for long-term preservation
integrating - Collection ingestion
- Access and Services
- Research and development for new functionality
and adaptation to evolving technologies - Business model, data policies, and management
issues critical to success of the infrastructure
29Chronopolis Replication and Distribution
- 3 replicas of valuable collections considered
reasonable mitigation for risk of data loss - Chronopolis Consortium will store 3 copies of
preservation collections - Bright copy Chronopolis site supports
ingestion, collection management, user access - Dim copy Chronopolis site supports remote
replica of bright copy and supports user access - Dark copy Chronopolis site supports reference
copy that may be used for disaster recovery but
no user access - Each site may play different roles for different
collections
Dim copy C1
Dark copy C1
Bright copy C2
Dark copy C2
Bright copy C1
Dim copy C2
30Killer App in Data Preservation Sustainability
31Data in the News
- Newsworthy items about Supercomputing
- Simulating Earthquakes for Science and Society
HPCWire, January 27, 2006 - Simulation of 7.7 earthquake in lower San Andreas
Fault - Japanese supercomputer simulates Earth BBC,
April 26, 2002 - A new Japanese supercomputer was switched on
this month and immediately outclassed its nearest
rival.
- Newsworthy items about Data
- Bank Data Loss May Affect Officials Boston
Globe, February 27, 2005 - Data tapes lost with information on more than 60
U.S. senators and others - Data Loss Bug Afflicts Linux ZDNet News,
 December 6, 2002 - Programmers have found a bug that, under
unusual circumstances, could cause systems to
drop data.
32Data Preservation Requires a Different
Sustainability Model than Supercomputing
33The Branscomb Pyramid for Computing (circa 1993)
FACILITIES
APPLICATIONS
Leadership-class facilities Maintained by
national labs and centers. Substantive
professional workforce
Community codes and professional SW. Maintained
by large groups of professionals (NASTRAN,
Powerpoint, WRF, Everquest)
High-end
Community software and highly-used project codes.
Developed and maintained by some professionals
and academics (CHARMM, GAMESS, etc.)
Mid-range university and research lab facilities.
Maintained by professionals and
non-professionals.
campus, research lab
Research and individual codes. Supported by
developers or their proxies.
Private, home, and personal facilities. Supported
by users or their proxies.
Small-scale, home
34The Berman Pyramid for Data (circa 2006) ?
FACILITIES
COLLECTIONS
High-end
campus, library, data center
National-scale data repositories, archives, and
libraries. High capacity, high reliability
environment maintained by professional workforce
Reference, important, and irreplaceable data
collections PDB, PSID, Shoah, Presidential
Libraries, etc.
Local libraries and data centers. Commercial
data storage. Medium capacity, medium-high
reliability. Maintained by professionals.
Research data collections. Developed and
maintained by some professionals and academics
Private repository. Supported by users or their
proxies. Low-medium reliability, low capacity
Personal data collections. Supported by
developers or their proxies.
Small scale, home
35Whats the Funding Model for the Data Pyramid?
FACILITIES
COLLECTIONS
National-scale data repositories, archives, and
libraries. High capacity, high reliability
environment maintained by professional workforce
Reference, important, and irreplaceable data
collections PDB, PSID, Shoah, Presidential
Libraries, etc.
Research data collections. Developed and
maintained by some professionals and academics
Local libraries and data centers. Commercial
data storage. Medium capacity, medium-high
reliability. Maintained by professionals.
Private repository. Supported by users or their
proxies. Low-medium reliability, low capacity
Personal data collections. Supported by
developers or their proxies.
Commercial Opportunities
36Commercial Opportunities at the Low End
- Cheap commercial data storage is moving us from a
napster model (data is accessible and free) to
an iTunes model (data is accessible and
inexpensive)
37Amazon S3 (Simple Storage Service)
- Storage for Rent
- Storage is .15 per GB per month
- .20 per GB data transfer (to and from)
- Write, read, delete objects containing 1 GB-5GB
(number of objects is unlimited), access
controlled by user - For 2.00 , you can store for one year
- Lots of high resolution family photos
- Multiple videos of your childrens recitals
- Personal documentation equivalent to up to 1000
novels, etc.
Should we store the NVO with Amazon S3?
The National Virtual Observatory (NVO) is a
critical reference collection for the astronomy
community of data from the worlds large
telescopes and sky surveys.
38A Thought Experiment
- What would it cost to store the SDSC NVO
collection (100 TB) on Amazon? - 100,000 GB X 2 (1 ingest, no accesses storage
for a year) 200K/year - 100,000 GB X 3 (1 ingest, average 5 accesses
per GB stored storage for a year) 300K/year - Not clear
- How many copies Amazon stores
- Whether the format is well-suited for NVO
- Whether the usage model would make the costs of
data transfer, ingest, access, etc. infeasible,
etc. - If Amazon constitutes a trusted repository
- What happens to your data when you stop paying,
etc.
- What about the CERN LHC collection (10 PB/year)?
- 10,000,000 GB X 2 (1 ingest, no accesses per
item storage for a year) 20M/year
39What is the Business Model for the Upper levels
of the Data Pyramid?
FACILITIES
COLLECTIONS
?
?
National-scale data repositories, libraries, and
archives
Critical, valuable, and irreplaceable reference
collections very large collections
Libraries and Data Centers
Important Research collections large data
collections
Personal data collections
Personal Repositories
Commercial Opportunities
40Partnership Opportunities at the Middle Level
- Creative investment opportunities
- Short-term investments Building collections,
website and tool development, finite support for
facilities and collections, transition support
for media, formats, etc. - Longer-term investments Maintaining
collections, maintaining facilities, evolving
and maintaining software. - Public/Private partnerships must ensure
reliability, trust. - Do you trust Amazon with your data? Google?
Your university library? Your public library? - How much are content generators willing to pay
to store their data? How much are users willing
to pay to use the data?
?
Opportunities for Creative Public, Private,
and Philanthropic Partnerships
Collections Important Research collections
large data collections
Facilities Libraries and Data Centers
Commercial Opportunities
41Public Support Needed at the Top
(National-scale) level
Collections Critical, valuable, irreplaceable
reference collections very large data collections
- National-scale collections and facilities
constitute critical infrastructure for academic,
public, and private sectors - National-scale facilities must
- Be trusted repositories
- Be highly reliable
- Provide high capacity and state of the art
storage. - Have a 5 year, 50 year, 100 year plan
- Serve a national community, etc.
- Public leadership, funding, and engagement
critical for success
PublicSupport Needed
Opportunities for Creative Public, Private,
and Philanthropic Partnerships
Facilities National-scale libraries, archives,
and data repositories
Commercial Opportunities
42Thank You
berman_at_sdsc.edu www.sdsc.edu