Title: Scientific Data Libraries
1Scientific Data Libraries
New paradigm for science. Old style form
hypothesis, design experiment, run experiment,
analyze results, evaluate hypothesis New style
form hypothesis, look up data to test it,
evaluate hypothesis Molecular biology has been
first, astronomy next, many other fields will
follow
2Model or lookup?
Weather measure today run equations, or
measure today and find a similar day in the
past? Chess the opening and endgame are done
by lookup the middle game is done by calculation
3Protein Data Bank
22,700 protein structures growth over last
thirty years
4Alcohol dehydrogenase
5Sky pictures
6National Virtual Observatory
Traditionally, astronomers figured out what they
needed to see in the sky to test their theories,
then signed up for two weeks at an observatory
such as Kitt Peak, and sat there at night taking
photographs. Now the Sloan Digital Sky Survey
and other resources may let them do their work
without using a telescope. The large synoptic
survey telescope will gather 7-10 terabytes PER
NIGHT and 10 petabytes/yr.
72Micron, Sloan survey
Finding a brown dwarf.
8IRIS seismic data consortium
9(No Transcript)
10Rhododendron in CalFlora database
11Medical MRI scan (UCLA)
12Digital orthophotoquad
13Eckerd College dolphin digital library
14Tim Rowe vertebrate fossil CT scans
15Peter Allen Beauvais Cathedral
16Marc Levoy Forum Urbis Romae
17Q Where will the Data Come From?A Sensor
Applications
- Earth Observation
- 15 PB by 2007
- Medical Images Information Health Monitoring
- Potential 1 GB/patient/y ? 1 EB/y
- Video Monitoring
- 1E8 video cameras _at_ 1E5 MBps ? 10TB/s ? 100
EB/y ? filtered??? - Airplane Engines
- 1 GB sensor data/flight,
- 100,000 engine hours/day
- 30PB/y
- Smart Dust ?? EB/y
This slide taken from a presentation by Jim Gray
18Data sharing ethics
- Vary by field
- Molecular biology you cant publish a paper
reporting a protein structure without depositing
the structure in the public data bank. Genomic
data also public. - Astronomy convention is you get two years use
of the data you collect, then must make available
to others - Dead Sea Scrolls kept secret for forty years.
- Yet molecular biology data has potentially
enormous economic value, whereas cosmology and
ancient scrolls have none. - What should we urge on new fields?
19Cyberinfrastructure
- NSF has traditionally paid for some
infrastructure - Supercomputer centers (now some 80M/yr)
- Backbone networking (perhaps some 40M/yr)
- What about content? (NSF/NIH support much
already) - Cyberinfrastructure task force looked at this
however the recommendation for support of data is
mixed with a proposal to go to 7 supercomputer
centers, and the total is 1B/yr, which is
politically unrealistic. - Librarians, and scientists with data, dont have
the organization or political weight of
supercomputing.
20Large scale storage
Where are these resources? Generally in
computer centers, or in scientific departments,
or sometimes at private corporations (Microsoft,
in particular) Not enough in libraries in
general libraries do not have funds to support
such services and are not well placed to get
them. We need more cooperative projects
examples are UCSB and UIUC.
21Guess at the future
- Written material, as a storage problem, is
insignificant compared with data. The data
requires too much specialized knowledge to share
easily. - Each project, as well as storing its data, is
likely to store its own publications. - Libraries might be marginalized, with only old
stuff. - What to do?
- Develop techniques for general data storage to
let libraries share this work - Create an ethic for public sharing
- Find public funding for data storage.