Dr' Francine Berman

About This Presentation

Title:

Dr' Francine Berman

Description:

Industrial Support (Advanced Chemistry Development Inc. ... Dictionary and Data management ... Medicinal Chemistry. Genomics. Where can we most safely build a ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 43

Provided by: david133

Category:

more less

Transcript and Presenter's Notes

Title: Dr' Francine Berman

1
One Hundred Years of Data

Dr. Francine Berman
Director, San Diego Supercomputer Center
Professor and High Performance Computing Endowed
Chair, UC San Diego

2
The Digital World
Entertainment
Shopping
Information
3
How much Data is There?
iPod Shuffle (up to 120 songs) 512 MegaBytes
Printed materials in the Library of Congress
10 TeraBytes
1 human brain at the micron level 1 PetaByte
SDSC HPSS tape archive 6 PetaBytes
1 novel 1 MegaByte
All worldwide information in one year 2 ExaBytes
1 Low Resolution Photo 100 KiloBytes
Rough/average estimates
4
Research, Education, and Data
Japanese Art Images 70.6 GB
JCSG/SLAC 15.7 TB
NVO 100 TB
Life Sciences
Astronomy
Arts, and Humanities
Engineering
Projected LHC Data 10 PB/year
SCEC 153 TB
TeraBridge 800 GB
Geosciences
Physics
5
Data-oriented Science and Engineering
Applications Driving the Next Generation of
Technology Challenges
Data-oriented Science and Engineering
Applications
TeraShake
(Examples, Application subclasses)
PDB applications
NVO
Home, Lab, Campus, Desktop Applications
TraditionalHPC Applications
Everquest
MolecularModeling
Quicken
Compute (more FLOPS)
6
Data Stewardship

What is required for stewardship of data for the
science and engineering community?
Who needs it?
How does data drive new discovery?
What facilities are required?
Whats involved in preserving data for the
foreseeable future?

7
Data Stewardship

What is required for stewardship of data for the
science and engineering community?
Who needs it?
How does data drive new discovery?
What facilities are required?
Whats involved in preserving data for the
foreseeable future?

8
PDB A resource for the global Biology community

The Protein Data Bank
Largest repository on the planet for structural
information about proteins
Provides free worldwide public access 24/7 to
accurate protein data
PDB maintained by the Worldwide PDB administered
by the Research Collaboratory for Structural
Bioinformatics (RCSB), directed by Helen Berman

Molecule of the Month Glucose Oxidase Enzyme
used for making the measurement of glucose (e.g.
in monitoring diabetes) fast, easy, and
inexpensive.
2006 gt 5000 structures in one year, gt36,000
total structures
1976-1990, roughly 500 structures or less per year
Growth of Yearly/Total Structures in PDB
9
How Does the PDB Work?
10
Supporting and Sustaining the PDB

Consortium Funding (NSF, NIGMS, DOE, NLM, NCI,
NCRR, NIBIB, NINDS)
Industrial Support (Advanced Chemistry
Development Inc., IBM, Sybase, Compaq, Silicon
Graphics, Sun Microsystems)
Multiple sites wwPDB RCSB (USA), PDBj
(Japan), MSD-EBI (Europe)

Tool Development
Data Extraction and Preparation
Data Format Conversion
Data Validation
Dictionary and Data management
Tools supporting the OMB Corba Standard for
Macromolecular Structure Data, etc.

11
Data Stewardship

What is required for stewardship of data for the
science and engineering community?
Who needs it?
How does data drive new discovery?
What facilities are required?
Whats involved in preserving data for the
foreseeable future?

12
Major Earthquakes on the San Andreas Fault,
1680-present
Earthquake Simulations

Simulation results provide new scientific
information enabling better
Estimation of seismic risk
Emergency preparation, response and planning
Design of next generation of earthquake-resistant
structures
Results provide information which can help in
saving many lives and billions in economic losses

Researchers use geological, historical, and
environmental data to simulate massive
earthquakes.
These simulations are critical to understand
seismic movement, and assess potential impact.

1906 M 7.8
?
1857 M 7.8
1680 M 7.7
How dangerous is the San Andreas Fault?
13
TeraShake Simulation

Simulation of Southern of 7.7 earthquake on lower
San Andreas Fault
Physics-based dynamic source model simulation
of mesh of 1.8 billion cubes with spatial
resolution of 200 m
Builds on 10 years of data and models from the
Southern California Earthquake Center
Simulated first 3 minutes of a magnitude 7.7
earthquake, 22,728 time steps of 0.011 second
each
Simulation generates 45 TB data

14
SCEC Data Requirements
Resources must support a complicated
orchestration of computation and data movement
Parallelfile system
Dataparking
The next generation simulation will require even
more resources Researchers plan to double the
temporal/spatial resolution of TeraShake
I have desired to see a large earthquake
simulation for over a decade. This dream has been
accomplished. Bernard Minster, Scripps
Institute of Oceanography
15
Behind the Scenes Enabling Infrastructure for
TeraShake

Computers and Systems
80,000 hours on 240 processors of DataStar
256 GB memory p690 used for testing, p655s used
for production run, TG used for porting
30 TB Global Parallel file GPFS
Run-time 100 MB/s data transfer from GPFS to
SAM-QFS
27,000 hours post-processing for high resolution
rendering
People
20 people involved in information technology
support
20 people involved in geoscience modeling and
simulation

Data Storage
47 TB archival tape storage on Sun StorEdge
SAM-QFS
47 TB backup on High Performance Storage system
HPSS
SRB Collection with 1,000,000 files
Funding
SDSC Cyberinfrastructure resources for TeraShake
funded by NSF
Southern California Earthquake Center is an
NSF-funded geoscience research and development
center

16
Data Partner The Data-Oriented Supercomputer

Balanced system provides support for
tightly-coupled and strong I/O applications
Grid platforms not a strong option
Data local to computation
I/O rates exceed WAN capabilities
Continuous and frequent I/O is latency intolerant
Scalability
Need high-bandwidth and large-capacity local
parallel file systems, archival storage

Data ?
Compute ?
DoD apps plotted for locality
17
Data Stewardship

What is required for stewardship of data for the
science and engineering community?
Who needs it?
How does data drive new discovery?
What facilities are required?
Whats involved in preserving data for the
foreseeable future?

18
National Data Cyberinfrastructure Resources at
SDSC

DATA-ORIENTED COMPUTE SYSTEMS
DataStar
15.6 TFLOPS Power 4 system
7.125 TB total memory
Up to 4 GBps I/O to disk
115 TB GPFS filesystem
TeraGrid Cluster
524 Itanium2 IA-64 processors
2 TB total memory
Also 12 2-way data nodes
Blue Gene Data
First academic IBM Blue Gene system
2,048 PowerPC processors
128 I/O nodes
http//www.sdsc.edu/user_services/

DATA COLLECTIONS, ARCHIVAL AND STORAGE SYSTEMS
1.4 PB Storage-area Network (SAN)
6 PB StorageTek tape library
HPSS and SAM-QFS archival systems
DB2, Oracle, MySQL
Storage Resource Broker
72-CPU Sun Fire 15K
IBM p690s HPSS, DB2, etc
http//datacentral.sdsc.edu/

Support for community data collections and
databases Data management, mining, analysis, and
preservation

SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES
User Services
Application/Community Collaborations
Education and Training
SDSC Synthesis Center
Data-oriented Community SW, toolkits, portals,
codes
http//www.sdsc.edu/

19
National Data Repository SDSC DataCentral

First broad program of its kind to support
national research and community data collections
and databases
Data allocations provided on SDSC resources
Data collection and database hosting
Batch oriented access, collection management
services
Comprehensive data resources disk, tape,
databases, SRB, web services, tools, 24/7
operations, collection specialists, etc.

Web-based portal access
20
DataCentral Allocated Collections include
21
Working with Data Data Integration for New
Discovery

Data Integration in the Biosciences

Data Integration in the Geosciences

Where can we most safely build a nuclear waste
dump? Where should we drill for oil? What is
the distribution and U/ Pb zircon ages of A-type
plutons in VA? How does it relate to host rock
structures?
Data Integration
Complex multiple-worlds mediation
Geo-Chemical
Geo-Physical
Geo-Chronologic
Foliation Map
Geologic Map
22
Services, Tools, and Technologies Key for Data
Integration and Management

Data Systems
SAM/QFS
HPSS
GPFS
SRB

Data Services
Data migration/upload, usage and support (SRB)
Database selection and Schema design (Oracle,
DB2, MySQL)
Database application tuning and optimization
Portal creation and collection publication
Data analysis (e.g. Matlab) and mining (e.g.
WEKA)
DataCentral
Data-oriented Toolkits and Tools
Biology Workbench
Montage (astronomy mosaicking)
Kepler (Workflow management)
Vista Volume renderer (visualization), etc.

23
100 Years of DataWhats involved in preserving
data for the foreseeable future?
24
Who Cares about Digital Preservation?
The Public Sector
UCSD Libraries
The Private Sector
The Entertainment Industry
Researchers and Educators
25
Many Science, Cultural, and Official Collections
Must be Sustained for the Foreseeable Future

Critical collections
community reference data collections (e.g.
Protein Data Bank)
irreplaceable collections (e.g. Shoah
collection)
longitudinal data (e.g. PSID Panel Study of
Income Dynamics)
No plan for preservation often means that data is
lost or damaged

.the progress of science and useful arts
depends on the reliable preservation of knowledge
and information for generations to come.
Preserving Our Digital Heritage, Library of
Congress

26
Key Challenges for Digital Preservation

What should we preserve?
What materials must be rescued?
How to plan for preservation of materials by
design?
How should we preserve it?
Formats
Storage media
Stewardship who is responsible, and for how
long?
Who should pay for preservation?
The content generators?
The government?
The users?
Who should have access?

Print media provides easy access for long periods
of time but is hard to data-mine
Digital media is easier to data-mine but requires
management of evolution of media and resource
planning over time
27
Preservation and Risk
Less risk means more replicants, more resources,
more people
28
Chronopolis An Integrated Approach to Long-term
Digital Preservation

Chronopolis provides a comprehensive approach to
infrastructure for long-term preservation
integrating
Collection ingestion
Access and Services
Research and development for new functionality
and adaptation to evolving technologies
Business model, data policies, and management
issues critical to success of the infrastructure

29
Chronopolis Replication and Distribution

3 replicas of valuable collections considered
reasonable mitigation for risk of data loss
Chronopolis Consortium will store 3 copies of
preservation collections
Bright copy Chronopolis site supports
ingestion, collection management, user access
Dim copy Chronopolis site supports remote
replica of bright copy and supports user access
Dark copy Chronopolis site supports reference
copy that may be used for disaster recovery but
no user access
Each site may play different roles for different
collections

Dim copy C1
Dark copy C1
Bright copy C2
Dark copy C2
Bright copy C1
Dim copy C2
30
Killer App in Data Preservation Sustainability
31
Data in the News

Newsworthy items about Supercomputing
Simulating Earthquakes for Science and Society
HPCWire, January 27, 2006
Simulation of 7.7 earthquake in lower San Andreas
Fault
Japanese supercomputer simulates Earth BBC,
April 26, 2002
A new Japanese supercomputer was switched on
this month and immediately outclassed its nearest
rival.

Newsworthy items about Data
Bank Data Loss May Affect Officials Boston
Globe, February 27, 2005
Data tapes lost with information on more than 60
U.S. senators and others
Data Loss Bug Afflicts Linux ZDNet News,
December 6, 2002
Programmers have found a bug that, under
unusual circumstances, could cause systems to
drop data.

32
Data Preservation Requires a Different
Sustainability Model than Supercomputing
33
The Branscomb Pyramid for Computing (circa 1993)
FACILITIES
APPLICATIONS
Leadership-class facilities Maintained by
national labs and centers. Substantive
professional workforce
Community codes and professional SW. Maintained
by large groups of professionals (NASTRAN,
Powerpoint, WRF, Everquest)
High-end
Community software and highly-used project codes.
Developed and maintained by some professionals
and academics (CHARMM, GAMESS, etc.)
Mid-range university and research lab facilities.
Maintained by professionals and
non-professionals.
campus, research lab
Research and individual codes. Supported by
developers or their proxies.
Private, home, and personal facilities. Supported
by users or their proxies.
Small-scale, home
34
The Berman Pyramid for Data (circa 2006) ?
FACILITIES
COLLECTIONS
High-end
campus, library, data center
National-scale data repositories, archives, and
libraries. High capacity, high reliability
environment maintained by professional workforce
Reference, important, and irreplaceable data
collections PDB, PSID, Shoah, Presidential
Libraries, etc.
Local libraries and data centers. Commercial
data storage. Medium capacity, medium-high
reliability. Maintained by professionals.
Research data collections. Developed and
maintained by some professionals and academics
Private repository. Supported by users or their
proxies. Low-medium reliability, low capacity
Personal data collections. Supported by
developers or their proxies.
Small scale, home
35
Whats the Funding Model for the Data Pyramid?
FACILITIES
COLLECTIONS

National-scale data repositories, archives, and
libraries. High capacity, high reliability
environment maintained by professional workforce
Reference, important, and irreplaceable data
collections PDB, PSID, Shoah, Presidential
Libraries, etc.
Research data collections. Developed and
maintained by some professionals and academics
Local libraries and data centers. Commercial
data storage. Medium capacity, medium-high
reliability. Maintained by professionals.
Private repository. Supported by users or their
proxies. Low-medium reliability, low capacity
Personal data collections. Supported by
developers or their proxies.
Commercial Opportunities
36
Commercial Opportunities at the Low End

Cheap commercial data storage is moving us from a
napster model (data is accessible and free) to
an iTunes model (data is accessible and
inexpensive)

37
Amazon S3 (Simple Storage Service)

Storage for Rent
Storage is .15 per GB per month
.20 per GB data transfer (to and from)
Write, read, delete objects containing 1 GB-5GB
(number of objects is unlimited), access
controlled by user
For 2.00 , you can store for one year
Lots of high resolution family photos
Multiple videos of your childrens recitals
Personal documentation equivalent to up to 1000
novels, etc.

Should we store the NVO with Amazon S3?
The National Virtual Observatory (NVO) is a
critical reference collection for the astronomy
community of data from the worlds large
telescopes and sky surveys.
38
A Thought Experiment

What would it cost to store the SDSC NVO
collection (100 TB) on Amazon?
100,000 GB X 2 (1 ingest, no accesses storage
for a year) 200K/year
100,000 GB X 3 (1 ingest, average 5 accesses
per GB stored storage for a year) 300K/year
Not clear
How many copies Amazon stores
Whether the format is well-suited for NVO
Whether the usage model would make the costs of
data transfer, ingest, access, etc. infeasible,
etc.
If Amazon constitutes a trusted repository
What happens to your data when you stop paying,
etc.

What about the CERN LHC collection (10 PB/year)?
10,000,000 GB X 2 (1 ingest, no accesses per
item storage for a year) 20M/year

39
What is the Business Model for the Upper levels
of the Data Pyramid?
FACILITIES
COLLECTIONS
?
?
National-scale data repositories, libraries, and
archives
Critical, valuable, and irreplaceable reference
collections very large collections
Libraries and Data Centers
Important Research collections large data
collections
Personal data collections
Personal Repositories
Commercial Opportunities
40
Partnership Opportunities at the Middle Level

Creative investment opportunities
Short-term investments Building collections,
website and tool development, finite support for
facilities and collections, transition support
for media, formats, etc.
Longer-term investments Maintaining
collections, maintaining facilities, evolving
and maintaining software.
Public/Private partnerships must ensure
reliability, trust.
Do you trust Amazon with your data? Google?
Your university library? Your public library?
How much are content generators willing to pay
to store their data? How much are users willing
to pay to use the data?

?
Opportunities for Creative Public, Private,
and Philanthropic Partnerships
Collections Important Research collections
large data collections
Facilities Libraries and Data Centers
Commercial Opportunities
41
Public Support Needed at the Top
(National-scale) level
Collections Critical, valuable, irreplaceable
reference collections very large data collections

National-scale collections and facilities
constitute critical infrastructure for academic,
public, and private sectors
National-scale facilities must
Be trusted repositories
Be highly reliable
Provide high capacity and state of the art
storage.
Have a 5 year, 50 year, 100 year plan
Serve a national community, etc.
Public leadership, funding, and engagement
critical for success

PublicSupport Needed
Opportunities for Creative Public, Private,
and Philanthropic Partnerships
Facilities National-scale libraries, archives,
and data repositories
Commercial Opportunities
42
Thank You
berman_at_sdsc.edu www.sdsc.edu

Write a Comment

User Comments (0)