Title: Three stories
1Three stories
- Astronomy
- Medical Imaging
- Genetics
2Astronomy Data Growth
- From glass plates to CCDs
- detectors follow Moores law
- The result a data tsunami
- available data doubles every two years
- Telescope growth
- 30X glass (concentration)
- 3000X in pixels (resolution)
- Single images
- 16Kx16K pixels
- Large Synoptic Survey Telescope
- wide field imaging at 5 terabytes/night
3 M telescopes area m2
CCD area mpixels
Source Alex Szalay/Jim Gray
3Large Synoptic Survey Telescope (LSST)
- Top project of the astronomy decadal survey
- Celestial cinematography
- 2 gigapixel detector for wide field imaging
- Science
- beyond the standard model
- non-baryonic dark matter
- non-zero L and neutrino oscillations
- observation targets
- near Earth object survey
- weak lensing of wide fields
- supernovae measurements
- Features
- 7 square degree field/6.9 meter effective
aperture - gt 5 TB of data/night from a mountain in Chile
4Distributed Virtual Astronomy
- Capabilities
- homogeneous, multi-wavelength data
- observations of millions of objects
- mega-sky surveys (2MASS, SLOAN, )
- Initiatives
- U.S. National Virtual Observatory (NVO)
- Caltech, JHU, ALMA, HST,
- EU Astrophysical Virtual Observatory (AVO)
- ESO, CNRS, CDS,
- Grid data mining and archives
- discovering significant patterns
- analysis of rich image/catalog databases
- understanding complex astrophysical systems
- integrated data/large numerical simulations
HST Data Access
5Biomedical Imaging Challenges
Source Chris Johnson, Utah and Art Toga, UCLA
6Medical Imaging Needs
- Many imaging modalities in medicine. Most are
based on raster scanning (pixel matrix
represented scanned image). - Most are 2D slices, some are time series
(videos). - Radiology example is multi-slice CT scanner.
Each slice is 512x512 pixels with up to a
thousand slices in a study (0.5 Gigabytes per
study).
7Genetics
- Genetic Sequences are very simplest
representation. Many more complex ones,
including images.
8Data Heterogeneity and Complexity
Genomic, proteomic, transcriptomic, metabalomic,
protein-protein interactions, regulatory
bio-networks, alignments, disease, patterns and
motifs, protein structure, protein
classifications, specialist proteins (enzymes,
receptors),
Proteome
Source Carole Goble (Manchester)
9Gene Expression and Microarrays
- Concurrent evaluation
- expression levels for thousands of genes
- Photolithography
- up to 500K 10-20 micron cells
- each containing millions of identical DNA
molecules - Image capture and analysis
- laser scanning and intensity calculation
Source Affymetrix
10Quantitative Begets Qualitative Change
11Why is it important to capture?
- Previous research was documented in scientific
literature and books. Increasingly though, our
theories and methods are based on empirical
measurements. Data gathered or sampled from our
environment. Without a record of this data
preserved, we cannot verify previous work, or
build on existing work.
12Memex Still Prescient
- Consider a future device for individual use,
which is a sort of mechanized private file and
library. It needs a name, and to coin one at
random, memex will do. A memex is a device in
which an individual stores all his books,
records, and communications, and which is
mechanized so that it may be consulted with
exceeding speed and flexibility. It is an
enlarged intimate supplement to his memory. - Vannevar Bush
- As We May Think, 1945
13Human-Computer Symbiosis
- It seems reasonable to envision, for a time 10 or
15 years hence, a 'thinking center' that will
incorporate the functions of present-day
libraries together with anticipated advances in
information storage and retrieval. - The picture readily enlarges itself into a
network of such centers, connected to one another
by wide-band communication lines and to
individual users by leased-wire services. In such
a system, the speed of the computers would be
balanced, and the cost of the gigantic memories
and the sophisticated programs would be divided
by the number of users. J.C.R. Licklider, 1960
1421st Century Challenges
- The three fold way
- theory and scholarship
- experiment and measurement
- computation and analysis
- Supported by
- distributed, multidisciplinary teams
- multimodal collaboration systems
- distributed, large scale data sources
- leading edge computing systems
- distributed experimental facilities
- Socialization and community
- multidisciplinary groups
- geographic distribution
- new enabling technologies
- creation of 21st century IT infrastructure
- sustainable, multidisciplinary communities
- Come as you are response
Computation
Experiment
Theory
15Example Linking Genotype and Phentotype to study
diseases
Phenotype 1 Phenotype 2 Phenotype 3
Phenotype 4
Ethnicity Environment
Age Gender
Identify Genes
Pharmacokinetics
Metabolism
Endocrine
Biomarker Signatures
Physiology
Proteome
Transcriptome
Immune
Morphometrics
Predictive Disease Susceptibility
Source Terry Magnuson
16Challenges
- Technical
- Data storage, computing power, data ingest.
- Knowledge (scholarly communications)
- How do we share information (different terms,
different languages) - How do we preserve?
- Medical imaging formats last 10-20 years (larger
capital investment, clinical care) - Commercial sequencer data format no longer exists
2 years after product introduced (rapid
technology changes, research only settings)
17Technical Data Challenges
- Multitudes of input sources (data being
generated) - Computing power requirements
- Storage requirements
18The Data Tsunami
- Many sources
- agricultural
- biomedical
- environmental
- engineering
- manufacturing
- financial
- social and policy
- historical
- Many causes and enablers
- increased detector resolution
- increased storage capability
- The challenge extracting insight!
We Are Here!
19Sensor Data Overload
Source Robert Morris, IBM
20Computing History
- 1890-1945
- mechanical, relay
- 7 year doubling
- 1945-1985
- tube, transistor,..
- 2.3 year doubling
- 1985-2003
- microprocessor
- 1 year doubling
- Exponentials
- chip transistor density 2X in 18 months
- WAN bandwidth 64X in two years
- storage 7X in two years
- graphics 100X in three years
Microcomputer Revolution
4K bit core plane
Source Jim Gray
21Computing Power Trends
http//www.transhumanist.com/volume1/moravec.htm
22Storage Qualitative Change
1972
80 GB in 2004
5 MB in 1956
23Storage in practical terms
- Megabyte
- a small novel
- Gigabyte
- a pickup truck filled with paper or a DVD
- Terabyte one thousand gigabytes 1000 today
- the text in one million books
- entire U.S. Library of Congress is ten terabytes
of text - Petabyte one thousand terabytes
- 1-2 petabytes equals all academic research
library holdings - coming soon to a pocket near you!
- soon routinely generated annually by many
scientific instruments - Exabyte one thousand petabytes
- 5 exabytes of words spoken in the history of
humanity - See www.sims.berkeley.edu/research/projects/how-mu
ch-info-2003/
Source Hal Varian, UC-Berkeley
24Knowledge preservation requires standards
- Storage formats
- Media (CDROM , DVD, tapes)
- File formats (PDF, JPEG, MPEG)
- Standards that define meaning for particular
domain (metadata, controlled vocabularies,
taxonomies). Examples from medical and
biological sciences MeSH, DICOM, MIAME, GO,
caBIG.
25Who pursues standards?
- Users (scientists) (GO, multitudes of domain
specific examples) - Manufactuers (companies making products)
- Storage media (CDROM, DVD, DVD 2nd generation not
quite ?) - Knowledge standards not so frequent, more often
in conjunction with push from user community
(DICOM, MIAME) - Government (MeSH, GenBank, caBIG)
26What role does/should the government play?
- Has developed standards in areas where
significant support was provided (medicine and
science via NLM, i.e. Medline, Mesh, UMLS,
GenBank, Entrez, etc). And future (NIH caBIG,
etc.). - Successful with high cost shared resources
(colliders (CERN), astronomy telescopes, XXX). - Many other areas not addressed though (?).
27Critical Steps to Success?
- Standards for long term preservation of important
descriptive information be developed by
community. Not details of format of lab machine,
but that data generated follows national
standards (controlled vocabularies). - Central repositories (federated OK) are setup to
store, preserve and provide access. - Bring about usage by
- mandating by funding age (NIH, NSF)
- Requirement for publication (GenBank for
sequences).
28(No Transcript)
29Challenges
- Lack overall semantic interoperable framework for
multidisciplinary research. - Motivation for research labs to adhere to
standards, especially when storing and describing
data. (example UNC stories). - Ownership, provenance and context
- Privacy and security (IRB approval for future
research) - Indexing, data mining.
30University Data Challenges
- Multiple cultures
- arts, humanities and social sciences
- sciences and engineering
- Many scholarly communication approaches
- books, monographs, journals, conferences
- access time, priority and intellectual property
- multiple media and expression
- text, audio, video, artifacts, performances,
- primary and secondary source materials
- professional societies and private publishers
- Institutional repositories
- multiple visions and roles
- digital archives and/or alternative publication
venues - research and education
- access modes and goals, not just articles or
books - longitudinal access and lifelong learning
- what and how much to save
- declining cost of storage and simplicity of
deposit
31PITAC Report Contents
- Computational Science Ensuring Americas
Competitiveness - A Wake-up Call The Challenges to U.S.
Preeminence and Competitiveness - Medieval or Modern? Research and Education
Structures for the 21st Century - Multi-decade Roadmap for Computational Science
- Sustained Infrastructure for Discovery and
Competitiveness - Research and Development Challenges
- Two key appendices
- Examples of Computational Science at Work
- Computational Science Warnings A Message Rarely
Heeded - Available at www.nitrd.gov
32PITAC Recommendation
- The Federal government must implement
coordinated, long-term computational science
programs that include funding for interconnecting
the software sustainability centers, national
data and software repositories, and national
high-end leadership centers with the researchers
who use those resources, forming a balanced,
coherent system that also includes regional and
local resources. - Such funding methods are customary practice in
research communities that use scientific
instruments such as light sources and telescopes,
increasingly in data-centered communities such as
those that use the genome database, and in the
national defense sector.