Three stories - PowerPoint PPT Presentation

About This Presentation
Title:

Three stories

Description:

It needs a name, and to coin one at random, 'memex' will do. ... format of lab machine, but that data generated follows national standards ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 33
Provided by: ncsa
Category:

less

Transcript and Presenter's Notes

Title: Three stories


1
Three stories
  • Astronomy
  • Medical Imaging
  • Genetics

2
Astronomy Data Growth
  • From glass plates to CCDs
  • detectors follow Moores law
  • The result a data tsunami
  • available data doubles every two years
  • Telescope growth
  • 30X glass (concentration)
  • 3000X in pixels (resolution)
  • Single images
  • 16Kx16K pixels
  • Large Synoptic Survey Telescope
  • wide field imaging at 5 terabytes/night

3 M telescopes area m2
CCD area mpixels
Source Alex Szalay/Jim Gray
3
Large Synoptic Survey Telescope (LSST)
  • Top project of the astronomy decadal survey
  • Celestial cinematography
  • 2 gigapixel detector for wide field imaging
  • Science
  • beyond the standard model
  • non-baryonic dark matter
  • non-zero L and neutrino oscillations
  • observation targets
  • near Earth object survey
  • weak lensing of wide fields
  • supernovae measurements
  • Features
  • 7 square degree field/6.9 meter effective
    aperture
  • gt 5 TB of data/night from a mountain in Chile

4
Distributed Virtual Astronomy
  • Capabilities
  • homogeneous, multi-wavelength data
  • observations of millions of objects
  • mega-sky surveys (2MASS, SLOAN, )
  • Initiatives
  • U.S. National Virtual Observatory (NVO)
  • Caltech, JHU, ALMA, HST,
  • EU Astrophysical Virtual Observatory (AVO)
  • ESO, CNRS, CDS,
  • Grid data mining and archives
  • discovering significant patterns
  • analysis of rich image/catalog databases
  • understanding complex astrophysical systems
  • integrated data/large numerical simulations

HST Data Access
5
Biomedical Imaging Challenges
Source Chris Johnson, Utah and Art Toga, UCLA
6
Medical Imaging Needs
  • Many imaging modalities in medicine. Most are
    based on raster scanning (pixel matrix
    represented scanned image).
  • Most are 2D slices, some are time series
    (videos).
  • Radiology example is multi-slice CT scanner.
    Each slice is 512x512 pixels with up to a
    thousand slices in a study (0.5 Gigabytes per
    study).

7
Genetics
  • Genetic Sequences are very simplest
    representation. Many more complex ones,
    including images.

8
Data Heterogeneity and Complexity
Genomic, proteomic, transcriptomic, metabalomic,
protein-protein interactions, regulatory
bio-networks, alignments, disease, patterns and
motifs, protein structure, protein
classifications, specialist proteins (enzymes,
receptors),
Proteome
Source Carole Goble (Manchester)
9
Gene Expression and Microarrays
  • Concurrent evaluation
  • expression levels for thousands of genes
  • Photolithography
  • up to 500K 10-20 micron cells
  • each containing millions of identical DNA
    molecules
  • Image capture and analysis
  • laser scanning and intensity calculation

Source Affymetrix
10
Quantitative Begets Qualitative Change
11
Why is it important to capture?
  • Previous research was documented in scientific
    literature and books. Increasingly though, our
    theories and methods are based on empirical
    measurements. Data gathered or sampled from our
    environment. Without a record of this data
    preserved, we cannot verify previous work, or
    build on existing work.

12
Memex Still Prescient
  • Consider a future device for individual use,
    which is a sort of mechanized private file and
    library. It needs a name, and to coin one at
    random, memex will do. A memex is a device in
    which an individual stores all his books,
    records, and communications, and which is
    mechanized so that it may be consulted with
    exceeding speed and flexibility. It is an
    enlarged intimate supplement to his memory.
  • Vannevar Bush
  • As We May Think, 1945

13
Human-Computer Symbiosis
  • It seems reasonable to envision, for a time 10 or
    15 years hence, a 'thinking center' that will
    incorporate the functions of present-day
    libraries together with anticipated advances in
    information storage and retrieval.
  • The picture readily enlarges itself into a
    network of such centers, connected to one another
    by wide-band communication lines and to
    individual users by leased-wire services. In such
    a system, the speed of the computers would be
    balanced, and the cost of the gigantic memories
    and the sophisticated programs would be divided
    by the number of users. J.C.R. Licklider, 1960

14
21st Century Challenges
  • The three fold way
  • theory and scholarship
  • experiment and measurement
  • computation and analysis
  • Supported by
  • distributed, multidisciplinary teams
  • multimodal collaboration systems
  • distributed, large scale data sources
  • leading edge computing systems
  • distributed experimental facilities
  • Socialization and community
  • multidisciplinary groups
  • geographic distribution
  • new enabling technologies
  • creation of 21st century IT infrastructure
  • sustainable, multidisciplinary communities
  • Come as you are response

Computation
Experiment
Theory
15
Example Linking Genotype and Phentotype to study
diseases
Phenotype 1 Phenotype 2 Phenotype 3
Phenotype 4
Ethnicity Environment
Age Gender
Identify Genes
Pharmacokinetics
Metabolism
Endocrine
Biomarker Signatures
Physiology
Proteome
Transcriptome
Immune
Morphometrics
Predictive Disease Susceptibility
Source Terry Magnuson
16
Challenges
  • Technical
  • Data storage, computing power, data ingest.
  • Knowledge (scholarly communications)
  • How do we share information (different terms,
    different languages)
  • How do we preserve?
  • Medical imaging formats last 10-20 years (larger
    capital investment, clinical care)
  • Commercial sequencer data format no longer exists
    2 years after product introduced (rapid
    technology changes, research only settings)

17
Technical Data Challenges
  • Multitudes of input sources (data being
    generated)
  • Computing power requirements
  • Storage requirements

18
The Data Tsunami
  • Many sources
  • agricultural
  • biomedical
  • environmental
  • engineering
  • manufacturing
  • financial
  • social and policy
  • historical
  • Many causes and enablers
  • increased detector resolution
  • increased storage capability
  • The challenge extracting insight!

We Are Here!
19
Sensor Data Overload
Source Robert Morris, IBM
20
Computing History
  • 1890-1945
  • mechanical, relay
  • 7 year doubling
  • 1945-1985
  • tube, transistor,..
  • 2.3 year doubling
  • 1985-2003
  • microprocessor
  • 1 year doubling
  • Exponentials
  • chip transistor density 2X in 18 months
  • WAN bandwidth 64X in two years
  • storage 7X in two years
  • graphics 100X in three years

Microcomputer Revolution
4K bit core plane
Source Jim Gray
21
Computing Power Trends
http//www.transhumanist.com/volume1/moravec.htm
22
Storage Qualitative Change
1972
80 GB in 2004
5 MB in 1956
23
Storage in practical terms
  • Megabyte
  • a small novel
  • Gigabyte
  • a pickup truck filled with paper or a DVD
  • Terabyte one thousand gigabytes 1000 today
  • the text in one million books
  • entire U.S. Library of Congress is ten terabytes
    of text
  • Petabyte one thousand terabytes
  • 1-2 petabytes equals all academic research
    library holdings
  • coming soon to a pocket near you!
  • soon routinely generated annually by many
    scientific instruments
  • Exabyte one thousand petabytes
  • 5 exabytes of words spoken in the history of
    humanity
  • See www.sims.berkeley.edu/research/projects/how-mu
    ch-info-2003/

Source Hal Varian, UC-Berkeley
24
Knowledge preservation requires standards
  • Storage formats
  • Media (CDROM , DVD, tapes)
  • File formats (PDF, JPEG, MPEG)
  • Standards that define meaning for particular
    domain (metadata, controlled vocabularies,
    taxonomies). Examples from medical and
    biological sciences MeSH, DICOM, MIAME, GO,
    caBIG.

25
Who pursues standards?
  • Users (scientists) (GO, multitudes of domain
    specific examples)
  • Manufactuers (companies making products)
  • Storage media (CDROM, DVD, DVD 2nd generation not
    quite ?)
  • Knowledge standards not so frequent, more often
    in conjunction with push from user community
    (DICOM, MIAME)
  • Government (MeSH, GenBank, caBIG)

26
What role does/should the government play?
  • Has developed standards in areas where
    significant support was provided (medicine and
    science via NLM, i.e. Medline, Mesh, UMLS,
    GenBank, Entrez, etc). And future (NIH caBIG,
    etc.).
  • Successful with high cost shared resources
    (colliders (CERN), astronomy telescopes, XXX).
  • Many other areas not addressed though (?).

27
Critical Steps to Success?
  • Standards for long term preservation of important
    descriptive information be developed by
    community. Not details of format of lab machine,
    but that data generated follows national
    standards (controlled vocabularies).
  • Central repositories (federated OK) are setup to
    store, preserve and provide access.
  • Bring about usage by
  • mandating by funding age (NIH, NSF)
  • Requirement for publication (GenBank for
    sequences).

28
(No Transcript)
29
Challenges
  • Lack overall semantic interoperable framework for
    multidisciplinary research.
  • Motivation for research labs to adhere to
    standards, especially when storing and describing
    data. (example UNC stories).
  • Ownership, provenance and context
  • Privacy and security (IRB approval for future
    research)
  • Indexing, data mining.

30
University Data Challenges
  • Multiple cultures
  • arts, humanities and social sciences
  • sciences and engineering
  • Many scholarly communication approaches
  • books, monographs, journals, conferences
  • access time, priority and intellectual property
  • multiple media and expression
  • text, audio, video, artifacts, performances,
  • primary and secondary source materials
  • professional societies and private publishers
  • Institutional repositories
  • multiple visions and roles
  • digital archives and/or alternative publication
    venues
  • research and education
  • access modes and goals, not just articles or
    books
  • longitudinal access and lifelong learning
  • what and how much to save
  • declining cost of storage and simplicity of
    deposit

31
PITAC Report Contents
  • Computational Science Ensuring Americas
    Competitiveness
  • A Wake-up Call The Challenges to U.S.
    Preeminence and Competitiveness
  • Medieval or Modern? Research and Education
    Structures for the 21st Century
  • Multi-decade Roadmap for Computational Science
  • Sustained Infrastructure for Discovery and
    Competitiveness
  • Research and Development Challenges
  • Two key appendices
  • Examples of Computational Science at Work
  • Computational Science Warnings A Message Rarely
    Heeded
  • Available at www.nitrd.gov

32
PITAC Recommendation
  • The Federal government must implement
    coordinated, long-term computational science
    programs that include funding for interconnecting
    the software sustainability centers, national
    data and software repositories, and national
    high-end leadership centers with the researchers
    who use those resources, forming a balanced,
    coherent system that also includes regional and
    local resources.
  • Such funding methods are customary practice in
    research communities that use scientific
    instruments such as light sources and telescopes,
    increasingly in data-centered communities such as
    those that use the genome database, and in the
    national defense sector.
Write a Comment
User Comments (0)
About PowerShow.com