Title: Scientific databases
1Scientific databases
Increasing access to raw data Requires more care
to administer and to use But its transforming
science by providing data in advance Not clear
whether such databases end up in libraries or in
the academic departments where the data is
created and used, and where the specialists are.
2Protein Data Bank
22,700 protein structures growth over last
thirty years
3Alcohol dehydrogenase
4Sky pictures
5National Virtual Observatory
Traditionally, astronomers figured out what they
needed to see in the sky to test their theories,
then signed up for two weeks at an observatory
such as Kitt Peak, and sat there at night taking
photographs. Now the Sloan Digital Sky Survey
and other resources may let them do their work
without using a telescope. The large synoptic
survey telescope will gather 7-10 terabytes PER
NIGHT and 10 petabytes/yr.
62Micron, Sloan survey
Finding a brown dwarf.
7IRIS seismic data consortium, or
see http//aslwww.cr.usgs.gov/Seismic_Data/telemet
ry_data/map_sta_eq.shtml
8(No Transcript)
9Rhododendron in CalFlora database
10Medical MRI scan (UCLA)
11ICSPR
Interuniversity Consortiumm for Political and
Social Research (in Ann Arbor)
12Download Data or Explore it Online
- Data services provides generalized
transformations and exploratory data analysis - Data services can be applied automatically to
studies from other archives - Intelligent exploration based on variable type
- Results by-email, or direct download
- Load balancing, distributed analysis
- Subsets and conversions
- STATA, SPSS,S-Plus,TSV,SPSS
- On-line data analysis
- Univariate and multivariate analysis
- Powerful statistical language back-end
- Benchmarked, gold standard accuracy
- Capable of sophisticated multivariate analysis
- Replication code always available
Harvard virtual data center
Managing Collections in a VDC
13Federated Facts Figures
Joe Hellerstein, Berkeley
14Canonical XML View
Clothing(pid, item, category, description, price,
cost) lt001, Suede Jacket, outerwear,Hip
length, 325.00, 175.00gt, lt002, Rain boots,
outerwear, Ankle height, 45.00, 19.99gt
Discount(pid, discount) lt001, 0.70gt, lt002,
0.50gt
- Oracle 9i XSQL
- Canonical XML View
- Use XSLT (out of engine) to transform canonical
XML
15SQL software
Theres also various commercial products SAP,
DB2, Interbase, Oracle, and Microsoft Access. SQL
looks like SELECT Lastname FROM Payroll WHERE
Department 344 The visual interface in
Microsoft Access is easier.
16Database integration joins
The normal operator for merging database tables
is the join operator, which is discussed at
exhaustive length in any database course or
book. The key point is that multiple tables must
contain names with the same semantics. If you
have LastName in an Address database and a
Payroll database, spell them all the same way
labels and contents. It turns out to be much more
painful to merge numerical databases than to
merge textual databases try to avoid it.
17Question answering
Natural language processing for question
answering can mean either translating English
into a formal database language retrieving
facts from text Early examples were Baseball and
Lunar (1960s), which did simple matching of
sentence patterns to possible queries. Surprisingl
y, you can do pretty well on the first question,
but dialog is too difficult. (Which sample had
the most iron? Sample 13 Where was it from?
it??)
18Geo- and time- reference queries
Perseus project (Greg Crane)
19Data provenance (Buneman)
20Data quality
Its harder to judge numerical data than textual
essays. Its not likely that theres another
source, you havent got cues from the quality of
the writing, and its harder to judge whether
something makes sense. The lack of labeling on
many web pages makes it hard to know who the
publisher might be, even if that name would
help you. Calculations based on databases are
even harder to deal with logical deductions may
be worse. tacR gene regulates the human nervous
system tacQ gene is similar to tacR but is
found in E. coli so tacQ gene regulates the E.
coli nervous system
21Questions are repetitive
The best known question answering system is
AskJeeves, which uses a pre-stored list of
answers to a few questions, and then a search
engine beyond that. Sort of a new interface to a
FAQ plus a general search. Mark Ackerman proposed
(1996) the Answer Garden as a system where
youd be presented with a set of choices (sort of
twenty questions) until you either reached a
previously answered question or a consultant if
the latter, the answer was then inserted in the
system at the point you reached.
22Natural language problems
Even if you can translate natural language to SQL
and then feed it to a database, you still have to
train the users to what kind of queries can be
answered. IBM Heidelberg had a system in the
1970s which took 90 minutes of training to ask
questions in English you can make a start on
teaching SQL in 90 minutes. The reality is that
real question-answering systems use very
restricted subject domains (weather, stocks,
airline schedules) while the science fiction
writers always imagine completely general tell
me what to do next systems. Perhaps the best
strategy is query clustering just link each new
query to an earlier question whose answer is
known.
23Data lookup, not experiment
In the future, many experiments wont be
necessary because the answers will already be
online. Data acquisition is being automated and
enormous quantities of information are online
(petabytes). Molecular biology is first,
replacing wet chemistry with lookups in the
protein and genome data banks (eg to determine
the function of a gene or protein) Astronomy is
probably coming next Many earth-observing fields
getting ready
24Q Where will the Data Come From?A Sensor
Applications
- Earth Observation
- 15 PB by 2007
- Medical Images Information Health Monitoring
- Potential 1 GB/patient/y ? 1 EB/y
- Video Monitoring
- 1E8 video cameras _at_ 1E5 MBps ? 10TB/s ? 100
EB/y ? filtered??? - Airplane Engines
- 1 GB sensor data/flight,
- 100,000 engine hours/day
- 30PB/y
- Smart Dust ?? EB/y
This slide taken from a presentation by Jim Gray
25Kinds of sensors
Photography aerial and satellite Environmental
sensors temperature, pressure, etc. Cameras in
cities crime prevention, traffic,
etc. Transaction monitoring widely used in
business Biological sensors detect germs,
toxins GPS-based position sensors now even
placed on animals RFID tags all kinds of objects
can be labeled Many purposes security,
commercial, health, weather,
26Wireless to be ubiquitous
Wireless mote from Crossbow Technology.
Biometrics also coming every device will know
who used it again, is this an invasion of
privacy or a useful service?
27Data rates high
Most sensors are dumb they collect and/or
transmit vast amounts of data, but it isnt
selected in useful ways. Thus, the results
depend on data mining. Commercial Wal-mart asked
what sold best before the last hurricane in
Florida, and the answer was beer. Security
British study alleged 1/3 drop in burglary by
announcing that they have surveillance cameras
linked to a database of known criminals Medical
Firefighters wear gadgets that raise an alarm if
they have not moved in 30 seconds similar
devices suggested for the elderly (to detect
people who fall). Many privacy issues as yet not
understood.
28Biological sensor furry thermometer
Cat asleep in sun temperature below 70 F cat
asleep in shade, temperature above 75 F. Only
one bit of accuracy, and unreliable needs clear
weather and easily distracted by any chance of
being petted. But does not use expensive and
toxic mercury and can be dropped on floor without
damage.
29Data sharing ethics
- Vary by field
- Molecular biology you cant publish a paper
reporting a protein structure without depositing
the structure in the public data bank. Genomic
data also public. - Astronomy convention is you get two years use
of the data you collect, then must make available
to others - Dead Sea Scrolls kept secret for forty years.
- Yet molecular biology data has potentially
enormous economic value, whereas cosmology and
ancient scrolls have none. - What should we urge on new fields?
30Scientific Data Libraries
New paradigm for science. Old style form
hypothesis, design experiment, run experiment,
analyze results, evaluate hypothesis New style
form hypothesis, look up data to test it,
evaluate hypothesis Molecular biology has been
first, astronomy next, many other fields will
follow
31Model or lookup?
Weather measure today run equations, or
measure today and find a similar day in the
past? Chess the opening and endgame are done
by lookup the middle game is done by calculation
32Large scale storage
Where are these resources? Generally in
computer centers, or in scientific departments,
or sometimes at private corporations (Microsoft,
in particular) Not enough in libraries in
general libraries do not have funds to support
such services and are not well placed to get
them. We need more cooperative projects
examples are UCSB and UIUC.
33Guess at the future
- Written material, as a storage problem, is
insignificant compared with data. The data
requires too much specialized knowledge to share
easily. - Each project, as well as storing its data, is
likely to store its own publications. - Libraries might be marginalized, with only old
stuff. - What to do?
- Develop techniques for general data storage to
let libraries share this work - Create an ethic for public sharing
- Find public funding for data storage.