Dr' Francine Berman - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Dr' Francine Berman

Description:

Industrial Support (Advanced Chemistry Development Inc. ... Dictionary and Data management ... Medicinal Chemistry. Genomics. Where can we most safely build a ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 43
Provided by: david133
Category:

less

Transcript and Presenter's Notes

Title: Dr' Francine Berman


1
One Hundred Years of Data
  • Dr. Francine Berman
  • Director, San Diego Supercomputer Center
  • Professor and High Performance Computing Endowed
    Chair, UC San Diego

2
The Digital World
Entertainment
Shopping
Information
3
How much Data is There?
iPod Shuffle (up to 120 songs) 512 MegaBytes
Printed materials in the Library of Congress
10 TeraBytes
1 human brain at the micron level 1 PetaByte
SDSC HPSS tape archive 6 PetaBytes
1 novel 1 MegaByte
All worldwide information in one year 2 ExaBytes
1 Low Resolution Photo 100 KiloBytes
Rough/average estimates
4
Research, Education, and Data
Japanese Art Images 70.6 GB
JCSG/SLAC 15.7 TB
NVO 100 TB
Life Sciences
Astronomy
Arts, and Humanities
Engineering
Projected LHC Data 10 PB/year
SCEC 153 TB
TeraBridge 800 GB
Geosciences
Physics
5
Data-oriented Science and Engineering
Applications Driving the Next Generation of
Technology Challenges
Data-oriented Science and Engineering
Applications
TeraShake
(Examples, Application subclasses)
PDB applications
NVO
Home, Lab, Campus, Desktop Applications
TraditionalHPC Applications
Everquest
MolecularModeling
Quicken
Compute (more FLOPS)
6
Data Stewardship
  • What is required for stewardship of data for the
    science and engineering community?
  • Who needs it?
  • How does data drive new discovery?
  • What facilities are required?
  • Whats involved in preserving data for the
    foreseeable future?

7
Data Stewardship
  • What is required for stewardship of data for the
    science and engineering community?
  • Who needs it?
  • How does data drive new discovery?
  • What facilities are required?
  • Whats involved in preserving data for the
    foreseeable future?

8
PDB A resource for the global Biology community
  • The Protein Data Bank
  • Largest repository on the planet for structural
    information about proteins
  • Provides free worldwide public access 24/7 to
    accurate protein data
  • PDB maintained by the Worldwide PDB administered
    by the Research Collaboratory for Structural
    Bioinformatics (RCSB), directed by Helen Berman

Molecule of the Month Glucose Oxidase Enzyme
used for making the measurement of glucose (e.g.
in monitoring diabetes) fast, easy, and
inexpensive.
2006 gt 5000 structures in one year, gt36,000
total structures
1976-1990, roughly 500 structures or less per year
Growth of Yearly/Total Structures in PDB
9
How Does the PDB Work?
10
Supporting and Sustaining the PDB
  • Consortium Funding (NSF, NIGMS, DOE, NLM, NCI,
    NCRR, NIBIB, NINDS)
  • Industrial Support (Advanced Chemistry
    Development Inc., IBM, Sybase, Compaq, Silicon
    Graphics, Sun Microsystems)
  • Multiple sites wwPDB RCSB (USA), PDBj
    (Japan), MSD-EBI (Europe)
  • Tool Development
  • Data Extraction and Preparation
  • Data Format Conversion
  • Data Validation
  • Dictionary and Data management
  • Tools supporting the OMB Corba Standard for
    Macromolecular Structure Data, etc.

11
Data Stewardship
  • What is required for stewardship of data for the
    science and engineering community?
  • Who needs it?
  • How does data drive new discovery?
  • What facilities are required?
  • Whats involved in preserving data for the
    foreseeable future?

12
Major Earthquakes on the San Andreas Fault,
1680-present
Earthquake Simulations
  • Simulation results provide new scientific
    information enabling better
  • Estimation of seismic risk
  • Emergency preparation, response and planning
  • Design of next generation of earthquake-resistant
    structures
  • Results provide information which can help in
    saving many lives and billions in economic losses
  • Researchers use geological, historical, and
    environmental data to simulate massive
    earthquakes.
  • These simulations are critical to understand
    seismic movement, and assess potential impact.

1906 M 7.8
?
1857 M 7.8
1680 M 7.7
How dangerous is the San Andreas Fault?
13
TeraShake Simulation
  • Simulation of Southern of 7.7 earthquake on lower
    San Andreas Fault
  • Physics-based dynamic source model simulation
    of mesh of 1.8 billion cubes with spatial
    resolution of 200 m
  • Builds on 10 years of data and models from the
    Southern California Earthquake Center
  • Simulated first 3 minutes of a magnitude 7.7
    earthquake, 22,728 time steps of 0.011 second
    each
  • Simulation generates 45 TB data

14
SCEC Data Requirements
Resources must support a complicated
orchestration of computation and data movement
Parallelfile system
Dataparking
The next generation simulation will require even
more resources Researchers plan to double the
temporal/spatial resolution of TeraShake
I have desired to see a large earthquake
simulation for over a decade. This dream has been
accomplished.  Bernard Minster, Scripps
Institute of Oceanography
15
Behind the Scenes Enabling Infrastructure for
TeraShake
  • Computers and Systems
  • 80,000 hours on 240 processors of DataStar
  • 256 GB memory p690 used for testing, p655s used
    for production run, TG used for porting
  • 30 TB Global Parallel file GPFS
  • Run-time 100 MB/s data transfer from GPFS to
    SAM-QFS
  • 27,000 hours post-processing for high resolution
    rendering
  • People
  • 20 people involved in information technology
    support
  • 20 people involved in geoscience modeling and
    simulation
  • Data Storage
  • 47 TB archival tape storage on Sun StorEdge
    SAM-QFS
  • 47 TB backup on High Performance Storage system
    HPSS
  • SRB Collection with 1,000,000 files
  • Funding
  • SDSC Cyberinfrastructure resources for TeraShake
    funded by NSF
  • Southern California Earthquake Center is an
    NSF-funded geoscience research and development
    center

16
Data Partner The Data-Oriented Supercomputer
  • Balanced system provides support for
    tightly-coupled and strong I/O applications
  • Grid platforms not a strong option
  • Data local to computation
  • I/O rates exceed WAN capabilities
  • Continuous and frequent I/O is latency intolerant
  • Scalability
  • Need high-bandwidth and large-capacity local
    parallel file systems, archival storage

Data ?
Compute ?
DoD apps plotted for locality
17
Data Stewardship
  • What is required for stewardship of data for the
    science and engineering community?
  • Who needs it?
  • How does data drive new discovery?
  • What facilities are required?
  • Whats involved in preserving data for the
    foreseeable future?

18
National Data Cyberinfrastructure Resources at
SDSC
  • DATA-ORIENTED COMPUTE SYSTEMS
  • DataStar
  • 15.6 TFLOPS Power 4 system
  • 7.125 TB total memory
  • Up to 4 GBps I/O to disk
  • 115 TB GPFS filesystem
  • TeraGrid Cluster
  • 524 Itanium2 IA-64 processors
  • 2 TB total memory
  • Also 12 2-way data nodes
  • Blue Gene Data
  • First academic IBM Blue Gene system
  • 2,048 PowerPC processors
  • 128 I/O nodes
  • http//www.sdsc.edu/user_services/
  • DATA COLLECTIONS, ARCHIVAL AND STORAGE SYSTEMS
  • 1.4 PB Storage-area Network (SAN)
  • 6 PB StorageTek tape library
  • HPSS and SAM-QFS archival systems
  • DB2, Oracle, MySQL
  • Storage Resource Broker
  • 72-CPU Sun Fire 15K
  • IBM p690s HPSS, DB2, etc
  • http//datacentral.sdsc.edu/

Support for community data collections and
databases Data management, mining, analysis, and
preservation
  • SCIENCE and TECHNOLOGY STAFF, SOFTWARE, SERVICES
  • User Services
  • Application/Community Collaborations
  • Education and Training
  • SDSC Synthesis Center
  • Data-oriented Community SW, toolkits, portals,
    codes
  • http//www.sdsc.edu/

19
National Data Repository SDSC DataCentral
  • First broad program of its kind to support
    national research and community data collections
    and databases
  • Data allocations provided on SDSC resources
  • Data collection and database hosting
  • Batch oriented access, collection management
    services
  • Comprehensive data resources disk, tape,
    databases, SRB, web services, tools, 24/7
    operations, collection specialists, etc.

Web-based portal access
20
DataCentral Allocated Collections include
21
Working with Data Data Integration for New
Discovery
  • Data Integration in the Biosciences
  • Data Integration in the Geosciences

Where can we most safely build a nuclear waste
dump? Where should we drill for oil? What is
the distribution and U/ Pb zircon ages of A-type
plutons in VA? How does it relate to host rock
structures?
Data Integration
Complex multiple-worlds mediation
Geo-Chemical
Geo-Physical
Geo-Chronologic
Foliation Map
Geologic Map
22
Services, Tools, and Technologies Key for Data
Integration and Management
  • Data Systems
  • SAM/QFS
  • HPSS
  • GPFS
  • SRB
  • Data Services
  • Data migration/upload, usage and support (SRB)
  • Database selection and Schema design (Oracle,
    DB2, MySQL)
  • Database application tuning and optimization
  • Portal creation and collection publication
  • Data analysis (e.g. Matlab) and mining (e.g.
    WEKA)
  • DataCentral
  • Data-oriented Toolkits and Tools
  • Biology Workbench
  • Montage (astronomy mosaicking)
  • Kepler (Workflow management)
  • Vista Volume renderer (visualization), etc.

23
100 Years of DataWhats involved in preserving
data for the foreseeable future?
24
Who Cares about Digital Preservation?
The Public Sector
UCSD Libraries
The Private Sector
The Entertainment Industry
Researchers and Educators
25
Many Science, Cultural, and Official Collections
Must be Sustained for the Foreseeable Future
  • Critical collections
  • community reference data collections (e.g.
    Protein Data Bank)
  • irreplaceable collections (e.g. Shoah
    collection)
  • longitudinal data (e.g. PSID Panel Study of
    Income Dynamics)
  • No plan for preservation often means that data is
    lost or damaged
  • .the progress of science and useful arts
    depends on the reliable preservation of knowledge
    and information for generations to come.
  • Preserving Our Digital Heritage, Library of
    Congress

26
Key Challenges for Digital Preservation
  • What should we preserve?
  • What materials must be rescued?
  • How to plan for preservation of materials by
    design?
  • How should we preserve it?
  • Formats
  • Storage media
  • Stewardship who is responsible, and for how
    long?
  • Who should pay for preservation?
  • The content generators?
  • The government?
  • The users?
  • Who should have access?

Print media provides easy access for long periods
of time but is hard to data-mine
Digital media is easier to data-mine but requires
management of evolution of media and resource
planning over time
27
Preservation and Risk
Less risk means more replicants, more resources,
more people
28
Chronopolis An Integrated Approach to Long-term
Digital Preservation
  • Chronopolis provides a comprehensive approach to
    infrastructure for long-term preservation
    integrating
  • Collection ingestion
  • Access and Services
  • Research and development for new functionality
    and adaptation to evolving technologies
  • Business model, data policies, and management
    issues critical to success of the infrastructure

29
Chronopolis Replication and Distribution
  • 3 replicas of valuable collections considered
    reasonable mitigation for risk of data loss
  • Chronopolis Consortium will store 3 copies of
    preservation collections
  • Bright copy Chronopolis site supports
    ingestion, collection management, user access
  • Dim copy Chronopolis site supports remote
    replica of bright copy and supports user access
  • Dark copy Chronopolis site supports reference
    copy that may be used for disaster recovery but
    no user access
  • Each site may play different roles for different
    collections

Dim copy C1
Dark copy C1
Bright copy C2
Dark copy C2
Bright copy C1
Dim copy C2
30
Killer App in Data Preservation Sustainability
31
Data in the News
  • Newsworthy items about Supercomputing
  • Simulating Earthquakes for Science and Society
    HPCWire, January 27, 2006
  • Simulation of 7.7 earthquake in lower San Andreas
    Fault
  • Japanese supercomputer simulates Earth BBC,
    April 26, 2002
  • A new Japanese supercomputer was switched on
    this month and immediately outclassed its nearest
    rival.
  • Newsworthy items about Data
  • Bank Data Loss May Affect Officials Boston
    Globe, February 27, 2005
  • Data tapes lost with information on more than 60
    U.S. senators and others
  • Data Loss Bug Afflicts Linux ZDNet News,
     December 6, 2002
  • Programmers have found a bug that, under
    unusual circumstances, could cause systems to
    drop data.

32
Data Preservation Requires a Different
Sustainability Model than Supercomputing
33
The Branscomb Pyramid for Computing (circa 1993)
FACILITIES
APPLICATIONS
Leadership-class facilities Maintained by
national labs and centers. Substantive
professional workforce
Community codes and professional SW. Maintained
by large groups of professionals (NASTRAN,
Powerpoint, WRF, Everquest)
High-end
Community software and highly-used project codes.
Developed and maintained by some professionals
and academics (CHARMM, GAMESS, etc.)
Mid-range university and research lab facilities.
Maintained by professionals and
non-professionals.
campus, research lab
Research and individual codes. Supported by
developers or their proxies.
Private, home, and personal facilities. Supported
by users or their proxies.
Small-scale, home
34
The Berman Pyramid for Data (circa 2006) ?
FACILITIES
COLLECTIONS
High-end
campus, library, data center
National-scale data repositories, archives, and
libraries. High capacity, high reliability
environment maintained by professional workforce
Reference, important, and irreplaceable data
collections PDB, PSID, Shoah, Presidential
Libraries, etc.
Local libraries and data centers. Commercial
data storage. Medium capacity, medium-high
reliability. Maintained by professionals.
Research data collections. Developed and
maintained by some professionals and academics
Private repository. Supported by users or their
proxies. Low-medium reliability, low capacity
Personal data collections. Supported by
developers or their proxies.
Small scale, home
35
Whats the Funding Model for the Data Pyramid?
FACILITIES
COLLECTIONS

National-scale data repositories, archives, and
libraries. High capacity, high reliability
environment maintained by professional workforce
Reference, important, and irreplaceable data
collections PDB, PSID, Shoah, Presidential
Libraries, etc.
Research data collections. Developed and
maintained by some professionals and academics
Local libraries and data centers. Commercial
data storage. Medium capacity, medium-high
reliability. Maintained by professionals.
Private repository. Supported by users or their
proxies. Low-medium reliability, low capacity
Personal data collections. Supported by
developers or their proxies.
Commercial Opportunities
36
Commercial Opportunities at the Low End
  • Cheap commercial data storage is moving us from a
    napster model (data is accessible and free) to
    an iTunes model (data is accessible and
    inexpensive)

37
Amazon S3 (Simple Storage Service)
  • Storage for Rent
  • Storage is .15 per GB per month
  • .20 per GB data transfer (to and from)
  • Write, read, delete objects containing 1 GB-5GB
    (number of objects is unlimited), access
    controlled by user
  • For 2.00 , you can store for one year
  • Lots of high resolution family photos
  • Multiple videos of your childrens recitals
  • Personal documentation equivalent to up to 1000
    novels, etc.

Should we store the NVO with Amazon S3?
The National Virtual Observatory (NVO) is a
critical reference collection for the astronomy
community of data from the worlds large
telescopes and sky surveys.
38
A Thought Experiment
  • What would it cost to store the SDSC NVO
    collection (100 TB) on Amazon?
  • 100,000 GB X 2 (1 ingest, no accesses storage
    for a year) 200K/year
  • 100,000 GB X 3 (1 ingest, average 5 accesses
    per GB stored storage for a year) 300K/year
  • Not clear
  • How many copies Amazon stores
  • Whether the format is well-suited for NVO
  • Whether the usage model would make the costs of
    data transfer, ingest, access, etc. infeasible,
    etc.
  • If Amazon constitutes a trusted repository
  • What happens to your data when you stop paying,
    etc.
  • What about the CERN LHC collection (10 PB/year)?
  • 10,000,000 GB X 2 (1 ingest, no accesses per
    item storage for a year) 20M/year

39
What is the Business Model for the Upper levels
of the Data Pyramid?
FACILITIES
COLLECTIONS
?
?
National-scale data repositories, libraries, and
archives
Critical, valuable, and irreplaceable reference
collections very large collections
Libraries and Data Centers
Important Research collections large data
collections
Personal data collections
Personal Repositories
Commercial Opportunities
40
Partnership Opportunities at the Middle Level
  • Creative investment opportunities
  • Short-term investments Building collections,
    website and tool development, finite support for
    facilities and collections, transition support
    for media, formats, etc.
  • Longer-term investments Maintaining
    collections, maintaining facilities, evolving
    and maintaining software.
  • Public/Private partnerships must ensure
    reliability, trust.
  • Do you trust Amazon with your data? Google?
    Your university library? Your public library?
  • How much are content generators willing to pay
    to store their data? How much are users willing
    to pay to use the data?

?
Opportunities for Creative Public, Private,
and Philanthropic Partnerships
Collections Important Research collections
large data collections
Facilities Libraries and Data Centers
Commercial Opportunities
41
Public Support Needed at the Top
(National-scale) level
Collections Critical, valuable, irreplaceable
reference collections very large data collections
  • National-scale collections and facilities
    constitute critical infrastructure for academic,
    public, and private sectors
  • National-scale facilities must
  • Be trusted repositories
  • Be highly reliable
  • Provide high capacity and state of the art
    storage.
  • Have a 5 year, 50 year, 100 year plan
  • Serve a national community, etc.
  • Public leadership, funding, and engagement
    critical for success

PublicSupport Needed
Opportunities for Creative Public, Private,
and Philanthropic Partnerships
Facilities National-scale libraries, archives,
and data repositories
Commercial Opportunities
42
Thank You
berman_at_sdsc.edu www.sdsc.edu
Write a Comment
User Comments (0)
About PowerShow.com