Title: EaGLe: Data Archiving and Metadata
1EaGLe Data Archiving and Metadata
R-8286750
2Why archive the EaGLe data?
- To ensure its preservation for future generations
of scientists - To ensure it is broadly available for current
scientists to use - To create the broadest possible public benefit
from this taxpayer-funded program - To help EPA retain the data that is collected /
created through its funding - Because we wish that earlier researchers had
archived their data for us to use
3EaGLe Data Committee Mission Statement
- Develop an information management plan to archive
EaGLe data with appropriate metadata so that EPA
can make it readily available - Ensure that data usefulness outlives the EaGLe
project (and does not require continued
maintenance by EaGLe researchers)
Skip
4Jump 2
DATA METADATA EML and XML COST-EFFECTIVENESS
1 Types of data A) Metadata standards 1 Seems awfully complicated
2 What to archive B) What is EML? 2 How much will it cost me?
3 Data objects C) Ecological metadata 3 How long does it take?
4 Data packages D) What good is EML? 4 What good are metadata?
5 What is metadata? E) EML vocabulary 5 Who needs metadata?
6 Why collect metadata? F) What is XML? SECURITY ISSUES
7 The cons of metadata G) What good is XML? (1) Access controls
8 The locs of metadata H) Do I have to learn XML? (2) Approval process
9 Sample metadata file 1 METADATA RETRIEVAL (3) The locs of metadata
10 Sample metadata file 2 IJ) EIMS overview MORE INFORMATION
11 Getting in gear K) Data flow You to EIMS Optional data archival
12 EaGLe metadata entry L) EaGLe home page Do NOT archive
13 Metadata checklist M-N) Global search Non-standardized metadata
14 Checklist continued O-S) Metadata report EaGLe contacts
15 Data file formats T-U) Searches End
51 EaGLe Data Types
- Geospatial Imagery
- Genomic
- Remote Sensing
- Biological
- Routine Monitoring
Go Back
62 What data must be archived?
- All new data created or collected using EaGLe
funds - Field data
- Genomics experiments
- New GIS coverages
- New remote sensing data
- Other images, models
- All important summary, supplemental, and
explanatory information - Journal articles
- Poster Sessions
- Presentations
- Rules governing data QC or transforms
- SOPs, protocols, experimental design documents,
QA/QC documents
Go Back
73 Types of Data Objects
- Literature Objects
- Journal Articles, Bibliographies, Books,
Adobe.pdf files, etc. - Flat Files
- Stand-alone tables (i.e., SAS tables),
spreadsheet data - Relational Databases
- Many normalized tables joined by relational rules
- Data views, query objects combined bits from
separate tables - Graphical Objects
- Maps, photos, digital sounds, presentations, Web
sites - Material objects
- Soil samples, stained slides, microfiche,
posters, video tapes,etc.
Go Back
84 What is a Data Package?
- Together, electronic data objects and their
metadata file constitute a Data Package. - The metadata file is like the box, inventory tag
and instruction manual - The data themselves are the content of the
package - Data inventory requires good-quality metadata
- Even material objects can have electronic metadata
Go Back
95 Whats metadata ?
- Metadata means beside the data or data about
data - Metadata files contain summary and reference data
about primary data objects - Any information needed to identify, decode,
interpret, track, store, locate, assign ownership
of, or control access to a data object. - Everyday examples
- Library card catalogue Key to map symbols
Checkbook register - Scientific Metadata examples
- Particulate matter instruments equipment models
and settings, detection limits, replication,
sample handling details - Journal article citation, methods citation
- Sample indented metadata
Go Back
106 Why Collect Metadata?
- Long-term Storage
- Keep EaGLe data safely banked for future reuse
- Support long-term data tracking and retrieval
- Data Broadcasting
- Publish metadata via the Environmental Research
and Science Library (ERSL) public interface - Foster collaborative and cross-cutting research
- Meta-analyses made possiblesmall dataset mergers
- Cross-regional data, cross-media data
- Longitudinal time-series analysesdata
recombining
Go Back
117 The cons of metadata
- Content What is in the data object?
- Data descriptions, citation info, electronic file
formats - Contacts Who owns the data?
- Authors, contact person, organization
- Context What is the provenance of the data?
- Applicable knowledge areas, methods, project
origins, etc.
Go Back
128 The locs of metadata
- Location
- Where is the electronic file located?
- What is the geographic coverage of the data
object? - Locks
- Final version (protected against inadvertent
updates) - Viewing access controls
- Editing/downloading access controls
- Release date, expiration date
Go Back
139 Sample Indented Metadata file
Go Back
1410 Sample 2 indented metadata
- Switch to Normal view
- Click on icon
- Press Page down key to view PDF
- When finished, press ESC key to restore Normal
view - Use slide show icon to resume
Go Back
15 11 Getting in Gear
- Feb. 1, 2004 Begin metadata creation.
- Summer 2004 Begin EaGLe data uploading.
- Jan. 2005 EaGLe metadata completed.
- End of no-cost extensions (early 2006) Most of
EaGLe datasets archived but password-protected. - Jan. 2008 Most of EaGLe data released to public
Go Back
1612 Metadata Creation / Data Uploading
- Metadata Entry Form (MEF)
- Generates an EML-compliant metadata file in XML
format - Automatic upload to ERSL
- Data packages stored in EIMS repository (ERSL
backend) - EaGLe Portalintranet interface for grantees
- Review, Approval, and Release Processes
- Post-Release Search, Store and Update
- Searchable Metadata Records in one area of
EIMS/ERSL - Actual Datasets stored in EIMS/ERSL Repository
Go Back
1713 Metadata Checklist
- General Information
- Data Set Title
- Point of Contact
- Time period of the information contained in the
dataset - Abstract (brief description) of the dataset
- Geographic coverage of the dataset
- Data format (i.e., shape-file, coverage,
spreadsheet, etc.) - Dataset Creation
- Formal authors
- Others who contributed
- Research objectives for dataset
- Common misinterpretations of the data, if any
Go Back
1814 Metadata Checklist (continued)
- Dataset Contents
- Was a georeferencing system used? If so, what is
it? - What does each dataset record describe?
- What are the attributes that describe these
features? - Define each attribute and provide measurement
units. Also provide resolution and estimated
accuracy, if possible - Define or reference coded attributes (e.g., FIPS
codes, error codes) - Dataset Processes
- Citation of source of original data, if
applicable (e.g., GIS data) - Types of major data processing steps
- Detailed methodology of data collection,
including study designs, protocols, equipment,
analyses, etc., and any changes in data
collection procedures during the study - Record any QA tests performed and their results
Go Back
1915 Data File Formats
Unacceptable
Acceptable
- Files converted into character delimited ASCII
files (i.e., comma delimited .csv files) - jpeg, jpg, tiff, gif, img, png, geo-tiff, ecw,
ArcView, simple html or htm, xml, LaTeX, TeX, pdf
(method files) - Programs in programming language (must have text
support).
- Excel Spreadsheets (convert to .csv)
- Presentation files such as PowerPoint (convert to
.pdf) - Word-processing files (convert to ASCII)
- Proprietary files
- RTF files
- Special characters (Greek letters and other
symbols not found in ASCII)
Go Back
Go End
20A) Standards for Metadata Creation
- FGDC Content Standard for Digital Geospatial
Metadata http//www.fgdc.gov/metadata/contstan.ht
ml http//www.fgdc.gov/metadata/metadata.html - National Biological Information Infrastructure
http//www.nbii.gov/ - Ecological Metadata Language http//knb.ecoinfor
matics.org/software/eml - Knowledge Network for Biocomplexity
(MORPHO) http//knb.ecoinformatics.org/ - Dublin Core Metadata Element Set
www.dublincore.org - Encoded Archival Description (EAD)
http//www.loc.gov/ead/ - Data Documentation Initiative
http//www.icpsr.umich.edu/DDI/
Go Back
21B) So, whats EML?
- Ecological Metadata Language
- A metadata standard designed to handle
cross-disciplinary research - A wrapper that holds metadata for many
different types of primary data (geospacial,
biological, genomic,etc) - Widely accepted standard in the ecological
communities of interest. - A container that meshes with other types of
metadata standards - A metadata standard based on XML vocabulary.
- An information tree that can graft on new
branches of knowledge when they become necessary
to the knowledge community
Go Back
22C) EML Standard for Ecological Metadata
- Core Definitions and units of the columns
(fields or attributes) in all data tables - Methods, procedures, and protocols
- Research questions and hypotheses
- Site selection
- Authors, contacts, and proper citation for use
- Sampling Extent spatial, biological, temporal
- Sample Indented Metadata
Go Back
23D) What good is EML?
- Ease of data interchange with other scientists
- Enhances precision in data documentation
- Forces clarity in defining measurement units
- Missing-data codes, other interpretative codes
- Enforces data access rules
- Improves rapid search capability
Go Back
24E) EML Specialty Terms
Common usage EML Term
Field, independent variable, column name, header Attribute
Abstract, Brief, Executive Summary Abstract
Project Officer, Primary investigator Party
Go Back
Go End
25F) What is XML?
- eXtensible Markup Language
- A subset of Standard General Markup Language
- A method for marking up plain text
- To distinguish clearly between the
- content (text)
- document structure (title, paragraph, line, etc.)
- Note Textual attributes (bold, large, italic,
etc) are NOT included. - To make electronic documents readily
machine-readable - Makes document structures explicit and modular
- Permits easy transformations between document
formats
Go Back
26G) What good is XML?
- Allows document contents to be re-used in new
ways - Allows document elements to be stored just like
tables of numerical data - Enforces precise translation of document look
and feel from one presentation mode (hard-copy)
to another (web) - Transparency of markup to future readers
- Can accommodate new kinds of text markup at need
(audio tags, motion tags, etc) - Converts information to platform and software
independent formats to maximize long-term utility
Go Back
27H) Do I have to learn XML?
- NO!
- The Metadata Entry Form automatically creates a
valid XML document - Data entered into the form automatically follows
the EML constraints on mandatory inclusion of
metadata elements - Only system administrators and metadata
librarians need XML expertise
Go Back
Go End
28IJ) EIMS overview
Go Back
29K) Data Flow From You to EIMS back
EaGLe
Metadata entry into existing EaGLe system
Data load into EIMS
EIMS
Data update / retrieval from EaGLe intranet
portal into EIMS
Go Back
30L) EaGLe Prototype Home Page
Go Back
31M) EaGLe Prototype Global Search
Go Back
32N) EaGLe Prototype Search Results
Go Back
33O) EaGLe Metadata Report
Links to headers in the Metadata Report
Go Back
34P) EaGLe Metadata Report (continued)
Go Back
35Q) EaGLe Metadata Report (continued)
Go Back
36R) EaGLe Metadata Report (continued)
Go Back
37S) EaGLe Prototype Simple Search
Go Back
38T) EaGLe Prototype Advanced Search
Go Back
39U) EaGLe Prototype Advanced Search (continued)
Go Back
Go End
40Optional Data Archival
- Historical data owned by EaGLe researchers
- Data used strictly for QA/QC
- e.g., temperature of experimental tanks
- Work that produced no analyzable data
- Qualitative reports
- Pilot data
Go Back
41Do NOT Archive
- Data not owned by EaGLe researchers
- Data already archived elsewhere
- e.g., many GIS coverages
- Dirty data
- Sans quality controls
- Containing many missing values, duplicates, etc.
Go Back
42Non-standardized metadata
- Field notes
- Marginalia
- Large object free text fields
- Index cards
- Voice recordings
- Personal communications
- Mental notes (non-transcribed knowledge)
Go Back
Go End
43Who is working on EaGLe data archiving?
- EaGLe data committee (EDC)
- ? Valerie Brady (chair) ? Terry Brown (GLEI)
- ? Peter Noble (CEER-GOM) ? Lexia Valdes (ACE INC)
- ? Webb Sprague (PEEIR) ? Chris Pfeiffer (ASC)
- Environmental Information Management System
(EIMS) - ? John Sykes (USEPA EIMS)
- Computer Sciences Corporation (CSC)
- ? Derek Lane ? Susan Eversole ? Steve Walata III
- ? Geoff Blair ? Wally Schwab ? And others
Go End
Go Back
441 Seems awfully complicated
- ...but its easier than statistics
- No need to learn whole of EML to use the relevant
bits - No more complicated than programming a VCR
- Time, Date, Channel, Skip commercials
- Similar to writing a journal article
- Abstract, Background, Protocol,
- Methods, Analysis, Discussion, Results,
- Caveats, Secondary analysis potential
- Author Names, Affiliations, Bibliography
- EaGLe MEF or Morpho user-interface allow
production of the most useful metadata
Go Back
452 How much does it cost to collect metadata?
- Estimate the value of your research results
- Total amount of research grant(s) plus 15 added
value - Divide by number of years project is funded
- Allocate 10 of resulting /efforts to metadata
collection - Distribute amounts evenly over yearsdont stint!
- Collecting metadata at the beginning of a study
captures important data decisions and research
design elements - Use metadata collection as an ad hoc method of
data quality control during each year of the
study.
Go Back
463 How much time is this going to take?
- Between 8 and 40 hours per data group
- All similar data bundled togethernot a per
dataset cost! - More complex datasets take more time
- Loading or linking to pre-written material can
save time - Training for use of Metadata Entry Form
- One-time 3-hour training session
- Minimum 3 hours hands-on practice
- Availability of live help during first solo MEF
work
Go Back
Go End
474 What Good are Metadata?
- High quality metadata serve 5 purposes
- Data Integrity Maintenance over the long term
20-year rule - Across expected changes in data storage
technology, compression, etc. - Tracking, searching for, and retrieving datasets
- Like a library card cataloguewhere to find data,
where to shelve it. - Scientific collaboration
- Joint analysis and secondary analysis potential
- Cathedral effect
- Pooling data across regions contributes to an
environmental big picture - Longitudinal studies--building science efforts
upon a shared data foundation. - Economical
- Extending the shelf life of data gives taxpayers
more return on investment
Go Back
485 Who needs the EaGLe metadata?
Other scientists
Todays Colleagues Scientific Collaborators
Tomorrows meta-analysts
The next generation
Archivists
Data Librarians
The Public
Data Exchange Tools (CDX)
Citizens and Citizen Groups
Legislators and other decision-makers
Go Back
Go End
491) Data Access and Security
- Only registered users may enter or edit a
metadata record - Record-level edit permissions required for input
and update - Only registered Data Librarians can release
records to a designated user base (Public, EPA
Only, Group, Owner) - Confidential records can be restricted to a
subset of users - EPA Only accessible only to EPA registered
users - Group accessible only to members of a specified
group of users (including system users outside
the EPA firewall, if necessary) - Owner accessible only by the designated owner
of the EIMS record - Post-release any internet user may view metadata
records. - Separate access controls for actual datasets
Go Back
50Generations of Research
- For a true confluence of research efforts,
clarity in metadata is the key