Title: Information Management in a NonBibliograpic Environment: Scientific Data
1Information Management in aNon-Bibliograpic
Environment Scientific Data
- Joseph A. Hourclé
- 2007-Nov-20
- FLICC Learning_at_Lunch
2About Me
3STEREO Solar TErrestrial RElations Observatory
4The Virtual Solar Observatory
5The Virtual Solar Observatory
- Federated Search of Solar Physics Data
- 14 organizations (currently)
- 4 more organizations being integrated
- 62 instruments
- Hundreds of distinct data collections
- 10s of millions of records
- Terabytes of Data
6The data is growing
- STEREO
- Launched Oct 2006
- Over 1.5 million images _at_ up to 8MB
- Hinode (Sunrise aka Solar-B)
- Launched Sept 2006
- Over 3 million images _at_ up to 8 MB
- SDO
- Scheduled to launch Aug 2008
- 1 image per second _at_ 32 MB
- 1.5TB/day dedicated connection
7Other disciplines have even more data
- NVO US National Virtual Observatory
- LSST (Large Synoptic Survey Telescope)
- Scheduled to start observing in 2012
- 7-10 TB/night, 3.2Gpix images
- 10 PB/yr
- EOS/DIS Earth Observing System/Data Information
System - About 2TB/day, per satellite (8?)
- Planned to be 16 PB
8 and were not the only one
- Heliospheric
- Magnetospheric
- Radiation Belt
- ITM (upper atmosphere)
- NVO / IVOA nighttime astronomy
- PDS planetary
- EOS earth
9What is Scientific Data?
10How is Scientific Data Gathered?
- Scientist thinks up a problem
- Scientist (and Engineers) create an instrument to
conduct an investigation - The instrument collects data via sensors
- Data are calibrated
- Data are written into scientifically useful
formats - Data are distributed to the scientists
11But really, what is data?
- There is no formal definition.
- Its as ambiguous as the term book
- Data may be shorthand for
- Data Collection
- Data Series
- Data Set
- Data Product
- Data Granule
12The problem with data
- Every investigation has different data needs
- Each investigation organizes and catalogs the
data to answer their scientific question - What is good data for one group may not be
useful for another - Because data is being collected continuously,
there may not be a consistent boundary on one
granule of data - Some data is tracked as individual values, and
only packaged upon request - Mostly time-series data, not images
13Types of Data Archives
- Instrument Archives
- Maintained by the PI team
- Little or no consideration towards re-use
- Resident Archive
- Maintained by a specific discipline
- Re-use within the given discipline
- Long-Term Archive
- Required for federally funded studies
- Focus on preservation, not use of data
14Active Archives
- Still changing
- May be ingesting from an active mission
- May still be processing their data
- May serve multiple editions or processed states
of the data - Final Data in Physical Units typically isnt
available until one or more years after the
mission - Not directly comparable with data from other
instruments until then
15Isnt this just Knowledge Management?
- There is no knowledge in the raw data
- But there is knowledge in the design of the
instruments sensors - What spectral range are the instruments sensitive
to? - What are the instruments possible operating
modes? - Knowledge of the instruments sensors affect how
the scientists interpret data - The scientists have to interpret the results to
determine the knowledge - May be reluctant to have others catalog their
data, as it requires understanding the science
16Multiple Operating ModesFilters on SOHO/EIT
171Å
195Å
284Å
304Å
17Known Sensor Issues SOHO/LASCO
18Knowledge Mgmt, cntd
- We do have Event and Feature Catalogs
- Scientists will record when/where they think
something interesting is occurring, and share
with others.
19Data Processing Raw Image (Linear)
20Data Processing Calibrated (Greyscale)
21Data Processing Before Calibration
22Data Processing Best Calibration
23Data ProcessingCCD Aging
24CCD Calibration
195Å
171Å
304Å
284Å
25Higher Level Data
26The Problems
- Cross discipline translation is difficult
- Concepts of what makes data useful differs
between disciplines - Different disciplines use different search
parameters - VSO time, spectral range, location on sun
- Always looking at the same object
- VHO location of observer, time, spectral range
- Observatories are moving, in situ measurements
- EOS location of object observed
- NVO direction of pointing (assumed from earth)
27Problems, cntd.
- Even when there is agreement, there are still
problems - Which time is important?
- Start time?
- Average time?
- Spacecraft time?
- Which coordinate system is used?
28Problems, still cntd
- Each discipline is working on solutions within
their field - Build systems that suit the needs of their
community - Each discipline has different first class data
- Currently working on metadata standards so data
can be discovered and used by other disciplines - SPASE MMI GEON
- Some work on ontologies to help with discovery
and use - VSTO SWEET GEON SESDI
29Lots of Permutations
30I know what youre thinking
31And it mostly works
32How does this affect libraries?
- The library is a changing organism
- Data is relatively unanalyzed in LIS
- Data connects to bibliographic records, and
visa-versa - What data was used in this journal article?
- Where can I get documentation on using this data?
- Has anyone published anything using this data?
- Data connects to other data
- What other instruments observed a given event?
- Is there an alternate version that better meets
my needs?
33Theres funding for research
- NSF
- CDI Cyber-Enabled Discovery and Innovation
- INTEROP Community-based Data Interoperability
Networks - IIS Information and Intelligent Systems
- DataNet Sustainable Digital Data Preservation
and Access Network Partners - NASA
- AISR Advanced Info. Systems Research
- ACCESS Advancing Collaborative Connections for
Earth Science Access
34 Sunspot on 15 July 2002 from the Swedish 1-m
Solar Telescope on La Palma
35http//virtualsolar.org/ http//stereo.gsfc.nasa.
gov
- joseph.a.hourcle_at_nasa.gov
36 37 38(No Transcript)