eScience -- A Transformed Scientific Method" - PowerPoint PPT Presentation

About This Presentation
Title:

eScience -- A Transformed Scientific Method"

Description:

eScience -- A Transformed Scientific Method – PowerPoint PPT presentation

Number of Views:138
Avg rating:3.0/5.0
Slides: 40
Provided by: JimG52
Category:

less

Transcript and Presenter's Notes

Title: eScience -- A Transformed Scientific Method"


1
eScience -- A Transformed Scientific Method"
  • Jim Gray,
  • eScience Group,
  • Microsoft Research
  • http//research.microsoft.com/Gray

2
Outline
  • Whats Computer Science?
  • What Do I do?
  • eScience? Whats that?
  • Peer-Reviewed Literature and Data online? How
    would that work?

3
Whats Computer Science
  • We have the patent on
  • the byte (aka information)
  • the algorithm (aka process)
  • This covers just about everything interesting
  • Music is software
  • Literature is software
  • Life is just software
  • DNA is information
  • Metabolism is a process
  • Its digital

4
Whats SciencePasteurs Quadrant
Einstein Pasteur
Anti-Intellectual Edison
Theoretical
Practical
5
The Scholarly Life
Applies to science, engineering, medicine, law,
philosophy, art,..
6
An Amazing Thing
  • Intellectual property is property (has value)
  • Cyberspace is Real Estate!
  • Columbus discovered a New Worldlots of new real
    estate
  • CyberSpace is a new world
  • EverQuest and Second Life and
  • And Windows, Office, Google, .
  • And music and medicine and .
  • A invested in research
  • pays off 10x or more in NEW IDEAS.

7
Outline
  • Whats Computer Science?
  • What Do I do?
  • eScience? Whats that?
  • Peer-Reviewed Literature and Data online? How
    would that work?

8
What I Do
  • Meditation
  • Inventing new ways to organize data
  • Inventing new ways to search data
  • Using scientific data as the vehicle (eScience)
  • Service
  • Serve on government boards
  • Professional societies
  • Trying to help scientists
  • Trying to get scientific literature and data
    online
  • Teaching
  • here I am ?
  • Advise students
  • Mentor (younger) colleagues

9
Outline
  • Whats Computer Science?
  • What Do I do?
  • eScience? Whats that?
  • Peer-Reviewed Literature and Data online? How
    would that work?

10
eScience What is it?
  • Synthesis of information technology and science.
  • Science methods are changing.
  • Science is being codified/objectified.How
    represent scientific information and knowledge in
    computers?
  • Science faces a data deluge.How to manage and
    analyze information?
  • Scientific communication changing.

11
Science Paradigms
  • Thousand years ago science was empirical
  • describing natural phenomena
  • Last few hundred years theoretical branch
  • using models, generalizations
  • Last few decades a computational branch
  • simulating complex phenomena
  • Today data exploration (eScience)
  • unify theory, experiment, and simulation
  • Data captured by instrumentsOr generated by
    simulator
  • Processed by software
  • Information/Knowledge stored in computer
  • Scientist analyzes database / filesusing data
    management and statistics

12
X-Info
  • The evolution of X-Info and Comp-X for each
    discipline X
  • How to codify and represent our knowledge

The Generic Problems
  • Data ingest
  • Managing a petabyte
  • Common schema
  • How to organize it
  • How to reorganize it
  • How to coexist with others
  • Query and Vis tools
  • Building and executing models
  • Integrating data and Literature
  • Support/training
  • Performance

13
Experiment Budgets ¼½ Software
  • Millions of lines of code
  • Repeated for experiment after experiment
  • Not much sharing or learning
  • CS can change this
  • Build generic tools
  • Workflow schedulers
  • Databases and libraries
  • Analysis packages
  • Visualizers
  • Software for
  • Instrument scheduling
  • Instrument control
  • Data gathering
  • Data reduction
  • Database
  • Analysis
  • Modeling
  • Visualization

14
New Approaches to Data Analysis
  • Looking for
  • Needles in haystacks the Higgs particle
  • Haystacks Dark matter, Dark energy
  • Needles are easier than haystacks
  • Global statistics have poor scaling
  • Correlation functions are N2, likelihood
    techniques N3
  • As data and computers grow at same rate, we can
    only keep up with N logN
  • A way out?
  • Discard notion of optimal (data is fuzzy, answers
    are approximate)
  • Dont assume infinite computational resources or
    memory
  • Requires combination of statistics computer
    science

15
Analysis and Databases
  • Much statistical analysis deals with
  • Creating uniform samples
  • data filtering
  • Assembling relevant subsets
  • Estimating completeness
  • Censoring bad data
  • Counting and building histograms
  • Generating Monte-Carlo subsets
  • Likelihood calculations
  • Hypothesis testing
  • Traditionally these are performed on files
  • Most of these tasks are much better done inside a
    database
  • Move Mohamed to the mountain, not the mountain to
    Mohamed.

16
Outline
  • Whats Computer Science?
  • What Do I do?
  • eScience? Whats that?
  • Peer-Reviewed Literature and Data online? How
    would that work?

17
Peer-Reviewed Science Literature Is Coming
Online
  • Agencies and Foundations mandating research be
    public domain.
  • NIH (30 B/y, 40k PIs,)(see http//www.taxpayera
    ccess.org/)
  • Wellcome Trust
  • Japan, China, Italy, South Africa,.
  • Public Library of Science..
  • Other agencies will follow NIH
  • Publishers will resist (not surprising)
  • Professional societies will resist (amazing!)

18
Pub Med Central International
  • Information at your fingertips
  • Deployed US, China, England, Italy, South Africa,
    Japan (not public on Internet yet)
  • Each site can accept documents
  • Archives replicated
  • Federate thru web services
  • Working to integrate Word/Excel/ with
    PubmedCentral e.g. WordML, XSD,
  • To be clear NCBI is doing 99 of the work.

19
Peer Review
  • Currently support a conference peer-review system
    (300 conferences)
  • Form committee
  • Accept Manuscripts
  • Declare interest/recuse
  • Review
  • Decide
  • Form program
  • Notify
  • Revise

20
Publishing Peer Review
  • Add publishing steps
  • Form committee
  • Accept Manuscripts
  • Declare interest/recuse
  • Review
  • Decide
  • Form program
  • Notify
  • Revise
  • Publish
  • improve author-reader experience
  • Manage versions
  • Capture data
  • Interactive documents
  • Capture Workshop
  • presentations
  • proceedings
  • Capture classroom ConferenceXP
  • Moderated discussions of published articles
  • Connect to Archives

21
So What about Publishing Data?
  • The answer is 42.
  • But
  • What are the units?
  • How precise? How accurate 42.5 .01
  • Show your work data provenance

22
Thought Experiment
  • You have collected some dataand want to publish
    science based on it.
  • How do you publish the data so that others can
    read it and reproduce your results in 100
    years?
  • Document collection process?
  • How document data processing (scrubbing
    reducing the data)?
  • Where do you put it?

23
Objectifying Knowledge
  • This requires agreement about
  • Units cgs
  • Measurements who/what/when/where/how
  • CONCEPTS
  • Whats a planet, star, galaxy,?
  • Whats a gene, protein, pathway?
  • Need to objectify science
  • what are the objects?
  • what are the attributes?
  • What are the methods (in the OO sense)?
  • This is mostly Physics/Bio/Eco/Econ/... But CS
    can do generic things

24
Objectifying Knowledge
  • This requires agreement about
  • Units cgs
  • Measurements who/what/when/where/how
  • CONCEPTS
  • Whats a planet, star, galaxy,?
  • Whats a gene, protein, pathway?
  • Need to objectify science
  • what are the objects?
  • what are the attributes?
  • What are the methods (in the OO sense)?
  • This is mostly Physics/Bio/Eco/Econ/... But CS
    can do generic things

Warning!Painful discussions ahead The O
word Ontology The S word Schema The CV
words Controlled Vocabulary Domain experts
do not agree
25
The Best Example Entrez-GenBankhttp//www.ncbi.n
lm.nih.gov/
  • Sequence data deposited with Genbank
  • Literature references Genbank ID
  • BLAST searches Genbank
  • Entrez integrates and searches
  • PubMedCentral
  • PubChem
  • Genbank
  • Proteins, SNP,
  • Structure,..
  • Taxonomy
  • Many more

26
The Vision Global Data Federation
  • Massive datasets live near their owners
  • Near the instruments software pipeline
  • Near the applications
  • Near data knowledge and curation
  • Each Archive publishes a (web) service
  • Schema documents the data
  • Methods on objects (queries)
  • Scientists get personalized extracts
  • Uniform access to multiple Archives
  • A common global schema

Federation
27
Web Services Enable Federation
  • Web SERVER
  • Given a url parameters
  • Returns a web page (often dynamic)
  • Web SERVICE
  • Given a XML document (soap msg)
  • Returns an XML document
  • Tools make this look like an RPC.
  • F(x,y,z) returns (u, v, w)
  • Distributed objects for the web.
  • naming, discovery, security,..
  • Internet-scale distributed computing
  • Now Find object modelsfor each science.

28
Outline
  • Whats Computer Science?
  • What Do I do?
  • eScience? Whats that?
  • Peer-Reviewed Literature and Data online? How
    would that work?
  • And give an example ?

29
World Wide TelescopeVirtual Observatoryhttp//w
ww.us-vo.org/
http//www.ivoa.net/
  • Premise Most data is (or could be online)
  • So, the Internet is the worlds best telescope
  • It has data on every part of the sky
  • In every measured spectral band optical, x-ray,
    radio..
  • As deep as the best instruments (2 years ago).
  • It is up when you are up.The seeing is always
    great (no working at night, no clouds no moons
    no..).
  • Its a smart telescope links objects and
    data to literature on them.

30
Why Astronomy Data?
  • It has no commercial value
  • No privacy concerns
  • Can freely share results with others
  • Great for experimenting with algorithms
  • It is real and well documented
  • High-dimensional data (with confidence intervals)
  • Spatial data
  • Temporal data
  • Many different instruments from many different
    places and many different times
  • Federation is a goal
  • There is a lot of it (petabytes)

31
Time and Spectral DimensionsThe Multiwavelength
Crab Nebulae
Crab star 1053 AD
X-ray, optical, infrared, and radio views of
the nearby Crab Nebula, which is now in a state
of chaotic expansion after a supernova explosion
first sighted in 1054 A.D. by Chinese Astronomers.
Slide courtesy of Robert Brunner _at_ CalTech.
32
SkyServer.SDSS.org
  • A modern archive
  • Access to Sloan Digital Sky SurveySpectroscopic
    and Optical surveys
  • Raw Pixel data lives in file servers
  • Catalog data (derived objects) lives in Database
  • Online query to any and all
  • Also used for education
  • 150 hours of online Astronomy
  • Implicitly teaches data analysis
  • Interesting things
  • Spatial data search
  • Client query interface via Java Applet
  • Query from Emacs, Python, .
  • Cloned by other surveys (a template design)
  • Web services are core of it.

33
SkyServerSkyServer.SDSS.org
  • Like the TerraServer, but looking the other way
    a picture of ¼ of the universe
  • Sloan Digital Sky Survey Data Pixels Data
    Mining
  • About 400 attributes per object
  • Spectrograms for 1 of objects

34
Demo of SkyServer
  • Shows standard web server
  • Pixel/image data
  • Point and click
  • Explore one object
  • Explore sets of objects (data mining)

35
SkyQuery (http//skyquery.net/)
  • Distributed Query tool using a set of web
    services
  • Many astronomy archives from Pasadena, Chicago,
    Baltimore, Cambridge (England)
  • Has grown from 4 to 15 archives,now becoming
    international standard
  • WebService Poster Child
  • Allows queries like

SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
36
SkyQuery Structure
  • Each SkyNode publishes
  • Schema Web Service
  • Database Web Service
  • Portal is
  • Plans Query (2 phase)
  • Integrates answers
  • Is itself a web service

37
Schema (aka metadata)
  • Everyone starts with the same schema
    ltstuff/gtThen the start arguing about semantics.
  • Virtual Observatory http//www.ivoa.net/
  • Metadata based on Dublin Corehttp//www.ivoa.net
    /Documents/latest/RM.html
  • Universal Content Descriptors (UCD)
    http//vizier.u-strasbg.fr/doc/UCD.htxCaptures
    quantitative concepts and their unitsReduced
    from 100,000 tables in literature to 1,000
    terms
  • VOtable a schema for answers to
    questionshttp//www.us-vo.org/VOTable/
  • Common QueriesCone Search and Simple Image
    Access Protocol, SQL
  • Registry http//www.ivoa.net/Documents/latest/RME
    xp.htmlstill a work in progress.

38
SkyServer/SkyQuery Evolution MyDB and Batch Jobs
  • Problem need multi-step data analysis (not just
    single query).
  • Solution Allow personal databases on portal
  • Problem some queries are monsters
  • Solution Batch schedule on portal. Deposits
    answer in personal database.

39
Outline
  • The Evolution of X-Info
  • Online Literature
  • Online Data
  • The World Wide Telescope as Archetype

The Big Problems
  • Data ingest
  • Managing a petabyte
  • Common schema
  • How to organize it
  • How to reorganize it
  • How to coexist with others
  • Query and Vis tools
  • Integrating data and Literature
  • Support/training
  • Performance
  • Execute queries in a minute
  • Batch query scheduling

40
Outline
  • Whats Computer Science?
  • What Do I do?
  • eScience? Whats that?
  • Peer-Reviewed Literature and Data online? How
    would that work?
Write a Comment
User Comments (0)
About PowerShow.com