Title: Regional Databases and Archives: the Effects of Scale
1Regional Databases and Archivesthe Effects of
Scale
- A Presentation for Scalable Information Networks
for the Environment Workshop - October 31, 2001
- San Diego, California
- Raymond McCord
- Oak Ridge National Laboratory
- Oak Ridge National Laboratory is operated by
UT-Battelle, LLC, for the U.S. Department of
Energy under contract DE-AC05-00OR22725
2Credits
- Concepts are derived from managing data for
environmental projects over the past 25 years. - Variations of the concepts have been observed
from these disciplines. - plant community research
- impact assessment in marine systems
- national acid rain surveys
- Environmental monitoring and cleanup projects at
DOE facilities - Military land use assessment
- Climate change research (atmospheric research)
- Ideas are freely traded with Dick Olson (ORNL)
3Presentation Strategy
- Motivation and concerns
- Archive overview
- Definition, components, functions, why why not,
examples - Archives and scale
- Effects of scale
- Mitigate scale effects
- Generate and manage metadata
- Future Archive issues to resolve
4My Motivation Concerns
The enemy is our behavior. Will we change or
whine???
- Motivation
- Describe observations about the effects of scale
on Archives - Describe remedies to minimize scale effects
- Minimize remedy pain
- Concerns
- Preaching to the choir!!
- Nothing new will happen!!
- Continuing unnecessary limits to future science!!
5You cant keep running in here and demanding
data every two years
Challenge engage scientists in the process of
archiving their data and provide the
mechanism for archiving.
Source American Scientist,Vol 886 p 525.
6Archives and Scale Presumptions
- Regional data live in Archives
- Information sharing is important
- The archiving can be improved
- Archive neurons are metadata
- Multidisciplinary data will foster broader
ecological discoveries - The limited number of permanent data archives for
ecological data will increase
7What Is an Archive?
8What Is a Data Archive?
- A data archive is a permanent, electronic
collection of datasets with accompanying metadata
such that users of the data can acquire,
understand, and use the data. - More than a long-term backup
- More than an index or catalog with pointers to
datasets stored elsewhere - For more details, see Michener, W. A. and J. W.
Brunt. 2000. Ecological Data Design,
Management and Processing. Blackwell Science.
180 pp.
9Components of an Archive
- Data and metadata
- Storage devices
- Information system
- Network connections
- Staff
- Data/metadata preparation and review
- Systems development and maintenance
- User support
10Archive Functions
- Store data
- Submitted by others
- Build catalog and structure
- Maintain storage across technology generations
- Review new data (QA, metadata)
- Advertise contents
- Find data for users
- Query and browse logic
- Distribute data
- Provide access to data
- References to documentation
11Data Centers at ORNL
- CDIAC - Carbon Dioxide Information Analysis
Center - ARM Archive - Atmospheric Radiation Measurement
Program - ORNL DAAC - Distributed Active Archive Center for
Biogeochemical Dynamics - NARSTO - tropospheric air pollution information
for North America - OREIS - Oak Ridge Environmental Information System
12Atmospheric Radiation Measurement (ARM) Program
- ARM research questions
- What happens to all of the sunlight energy?
- How is light absorbed by clouds?
- What does partly cloudy mean? Statistically?
Spatially? - What types of clouds form? When and How?
- ARM is a once in a lifetime research adventure
for atmospheric scientists - ARM research includes instrumentation, system
development, data analysis, and modeling (climate
and process)
13ARM Measurements Scope
All data collection is highly automated -- a
REAL BLAST!!
Data collection is now a peer outcome with
scientific discovery
14ARM Archive
- ARM Archive stores and provides access to the
entire accumulation of data - Currently 5 million files and 14,000 GB and
growing - The ARM data in the Archive will be accessed for
research for many years (decades) - Currently distributes 50-100,000 files per month
(100-200 GB) - More information
- ARM Program www.arm.gov
- ARM Archive www.archive.arm.gov
15Archive webUser Interface
ARM Archive SchematicArchive Input Output
user copy
Requestedfiles
query specifications
location
DataRetrieval
date
measurement
catalogmeta data
filelist
IncomingData Files
DataReception
Other ARM Systems
MassStorage System
backupdata files
operationsmeta data
16Data Flow
Data
Metadata
User Interface
Network
Core archive functions
17Why Archive??
I am doing Science. Trust me.
18Cycles of ResearchAn Information View
Archive of Data
Publications
Automation and review
Selection and extraction
Analysis and modeling
Information review
Measurement Collection
Original Observations
Secondary Observations
200 yrs
20 yrs
Planning
Planning
Problem Definition (Research Objectives)
19Why Dont I Archive My Data?
- No incentives - whats in it for me?
- No acknowledgment - does a dataset paper?
- Give up publication rights - will somebody scoop
me? - Poor planning - it was not in the Plan
- No resources - whos going to pay for it?
- Lack of training - what do I do first?
- Unsure about metadata content - how much is
enough?
20Why Should I Archive My Data?(management hints!!)
- Career advancement (give them credit)
- you will get some recognition
- you can publish data paper in ESA Ecological
Archives - it may help me do science with broader scope
- Professional incentives (give them training)
- good scientific practice (create peer pressure)
- Institutional incentives (have expectations)
- required by the sponsor
- Technological advances (give them systems)
- its easier and there are more options
21Archiving Supports Science
- Metadata required for archiving will improve data
quality - Extends data usefulness
- Increases your information base for doing
research - data volume and diversity
- Permits replication of results
A KEY concept of Science
22The Effects of Project Scale on Archives
Metadata are archive neurons??
23Metadata Depends on Your World View
- Investigator
- Doesnt need extensive formal metadata
- Project
- Metadata needed for project integration and
modeling activities - Project data manager may help write metadata
- Data archive
- More detailed metadata (e.g., spatial
coordinates) - More standardization (e.g., keywords) to
communicate clearly with future users - Who writes the metadata?
24(In the beginning, was the measurement. It was
formless and desolate. Without context)
Measurement
25Single Experiment View
parameter name
Measurement
sample ID
location
date
26Research Project View
parameter name
media
QA flag
Measurement
sample ID
location
date
27Long-term or Multidisciplinary View
method
parameter name
Units
media
QA flag
Measurement
records
generator
sample ID
location
date
28Integrated System Archive View
words, words units method Parameter def.
lab field Method def.
method
Units def.
parameter name
Units
media
date words, words. QA def.
Record system
QA flag
Measurement
records
generator
sample ID
location
date
GIS
org.type name custodian address, etc.
coord. elev. type depth
Sample def. type date location generator
29Another View of Scale
30Project Scale and Recorded Metadata
Increasing User Scope
Program
PI
Metadata
Group
Archive
- Units
- Method
- QA flag
- Media
- Parameter name
- Measurement
- Date
- Sample ID
- Location
- Generator
- Records
31Data Maturation and Scale
- Individual Investigators
- collect data, quality assure, document, analyze,
publish - Groups or Science Teams
- collate data, enhance, synthesize, model, publish
- Project Information System
- collate data, review completeness, maintain data
for project - Data Distribution and Archive Center
- long-term archive, distribute freely to users
- Master Data Directory
- searchable index with pointers to data
32Preparing for Archiving
I will not wait. I will not wait. I will not
wait. I will not
33Generic Environmental Data Model (Which Piece Is
First?)
words, words units method Parameter def.
lab field Method def.
method
Units def.
parameter name
Units
media
date words, words. QA def.
Record system
QA flag
Measurement
records
generator
sample ID
location
date
GIS
org.type name custodian address, etc.
coord. elev. type depth
Sample def. type date location generator
34Sequence of Information Birth
words, words units method Parameter def.
lab field Method def.
method
Units def.
parameter name
Units
media
date words, words. QA def.
Record system
QA flag
Measurement
records
generator
sample ID
location
date
GIS
org.type name custodian address, etc.
coord. elev. type depth
Sample def. type date location generator
35Research Publishing Metadata
- Metadata design can be a checklist for research
planning - Metadata preparation can be integrated with
publication process - Metadata are an investment in current and future
science
36Where to Archive Data?
37Archive Choices
- What determines your options?
- Sponsor requirements
- Repository access
- Metadata requirements
- Scalable storage
- Personal web pages and files
- Project or network data centers
- Federal data centers
- Links transcend storage structures
- Master directory
- Mercury
38Personal Web Page
- Its fun, rewarding, relatively easy, can share
data quickly, can control access to data - Data issues??
- complete metadata
- QA checks
- Connected to basic archival center functions??
- ready access to data (24 h/d, 7 d/wk)
- user support
- data available on multiple media
- secure, backed-up, long-term storage
39ESA Ecological Archives
- Publishing datasets as peer reviewed, citable
papers (with volume and page numbers) - Data papers are announced in abstract form in a
print journal with data available electronically - Citation example
- Esser, G., H.F.H. Lieth, J.M.O. Scurlock and R.J.
Olson. 2000. Osnabrück net primary productivity
data set. (Ecological Archives data paper
E081-011). Ecology 81, 1177-1177. - Bill Michener, Editor
- http//esa.sdsc.edu/esapubs/Journals_main.htm
40Master Data Directory
- Provides search capability and pointers to a
source of the data (Center does not archive data)
- Maintains standard keywords/indices
- Collects metadata from many sources
- Examples
- Global Change Master Directory (GCMD)
http//gcmd.gsfc.nasa.gov - ORNL DAAC Mercury System http//mercury.ornl.gov
41What is Mercury?
1. The data provider uses the Metadata Editor to
create a metadata file containing links to the
data and documentation
NASA / ORNL
Metadata Index
2. Mercury harvests the metadata and builds an
index
Mercury is used to assist an investigator with
documenting data and making these data available
to others.
5. User links to data providers server
6. Data and documentation are downloaded
directly from the data provider
3. Users query the index
4. Full metadata are returned to the user,
including links back to the data provider
42Regional Archives
43Sources of Regional Data
- Carbon Dioxide Information Analysis Center
- National Geophysical Data Center
- National Environmental Satellite, Data, and
Information Service - National Soils Data Access Facility
- National Water Information System
- Forest Inventory and Analysis
- Breeding Bird Survey
- Threatened and Endangered Species
- Global Change Master Directory
44NASA EOSDIS Distributed Active Archive Centers
45Global scale, 280 parameters surface,
atmospheric, fluxes
46Future Issues to Resolve
- Size, diversity, and longevity
- Accommodating change
- Teaching good practices
47Issues Size, Diversity, Longevity
- Size
- Online vs. Offline
- Database vs. File structure
- Multiple institutions
- Too big for technology migration??
- Diversity
- Increased logic and documentation for finding
data - Spatial distribution
- Increased potential for uniqueness conflicts
- Longevity
- Too old to explain or decode
- Too much evolution of methods and practices
- Asynchronous change in data and metadata
48Issues Planning and Requirements
- Plan for archiving early and ongoing
- Avoids missing metadata
- Avoids panic
- Improves overall data quality and consistency
- Consider the timing of requirements
- Requirements
- Standards to be or not to be?
- Documentation expectations
- Accessibility
Its mine!! Its my data!! You CANT have it!!
49Research Implies Change
Research
Not always true for other information systems
repeat
Discovery
New information requirements
New questions
50Issues Accommodating Change
- Change must be considered in the design
- Things that will change
- Access expectations
- Logical hierarchy of information scope
- New parameters
- New disciplines
- New study sites
- New data sources or methods
51Issues More Changes
- Unpredictable variation is
- no excuse!!
- Often used as an excuse to avoid standards
- Cannot avoid all of it, but try
- Missing values will occur Plan ahead
- Do not do Temp, temp, t, T, temperature
- Be clear, avoid ambiguity
- Minimal observational intensity is
- no excuse!!
- Quick study no documentation??
The unexpected are rare and most valuable??
52Rules for CreatingDatabases for Archiving
- Unique occurrences
- Each type of measurement is represented in a
consistent way - Each measurement event is represented by only one
value - Identifiers
- Each value is associated with a parameter name
- Each measurement value has a quality indicator
and link to a method description - Place and time
- Each value is associated with a unique place name
with a quantitatively defined location
(geographic coordinates) - Each value is associated with a date and time
- Data Storage and Transport
- Data are stored or managed with a database
management system or equivalent
53Best Practices for Preparing Ecological and
Ground-Based Data Sets to Share and Archive
- Best Practices include
- Assign descriptive file names
- Use consistent and stable file formats
- Define the parameters
- Use consistent data organization
- Perform basic quality assurance
- Assign descriptive data set titles
- Provide documentation
- Published Cook et al. 2001. Ecological
Bulletin - http//www.daac.ornl.gov/DAAC/bestpractices.html
54Reflecting Into the Future
55Workshop Reactions
- Distributed (sensor) processing
- Yes / No
- Automated QA
- Getting data dirty
- Metadata early
- 10X easier, scalable
- Differentiate standards
- Intentional variance only
- Partition / isolate exceptions when possible
- Look for 3, 5, 10X changes
- 20-30 not worthwhile
56Summary Points
- Archives need structure and standards
- Social and education solutions VERY important
- Metadata are the neurons of Archives
- Metadata early better than late
- Need to think about our choices.
57Future Thoughts
- Will we be able to know Where are we? in the
information structure - How many 30 KB files are on a 100 GB tape
cartridge? - The future limits will not be technology
- But our minds
- We need to plan NOW how to best leverage the
future
58A Future Scientists View
- I told my college-age daughter about the Japanese
announcement of 1 TB of optical memory in 1 cubic
centimeter. - Her reply
- We need to know how to think critically and
select what kinds of projects and data we need to
keep because the limiting factor will be our
minds, not the technology.
59Looking Forward to a Future With Archives!!