Four Talks - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Four Talks

Description:

Storage/archive NOT the problem (it's almost free) Curating/Publishing is expensive. ... So, the Internet is the world's best telescope: It has data on every ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 36

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: Four Talks

1
Four Talks
Jim Gray, Microsoft Alex Szalay, Ani Thakar, Jan
Vandenberg, JHU Chris Stoughton, Fermilab

The article we actually wroteOnline Scientific
Publication, Curation, Archiving
The advertised talk Computer Science Challenges
in the VO
A Web Server (SkyServer) tour
A Web Service (SdssCutout) tour
These lead up to Alex Szalays talk on Web
Services

2
The Paper We WroteOnline Scientific
Publication, Curation, Archiving

Jim Gray, Microsoft
Alex Szalay, Ani Thakar, Jan Vandenberg, JHU
Chris Stoughton, Fermilab

3
Outline

Virtual Observatory will be an ecosystem
ofauthors, curators, publishers, archivers,
readerscontributing using shared data.
The process and rolesauthor, curator, publisher,
archiver roles are changing
Ephemeral derivedMust capture ephemeral
information.All design metadata info is
ephemeralCan tradeoff recomputing derived data
EconomicsPublish/Archive cost is zero,
Author/Curator cost dominates
SDSSdata inflationwhat we are doing

4
Premise

Once published, scientific data needs to be
available forever,so that the science can be
reproduced/extended.
What does that mean?
Data
Ephemeral Data could not be reproduced
Stable data could be drived from emphemeral
data.
Meta-data how the data was collected/derivedis
ephemeral
Must be preserved
Includes design docs, software, email, pubs,
personal notes

5
Changing Roles

Exponential growth
Projects last at least 3-5 years
Project data online during project lifetime.
Data sent to central archive only at the end of
the project
At any instant, only 1/8 of data is centralized
New project responsibilities
Becoming Publishers and Curators
Larger fraction of budget spent on software
Standards are needed
Easier data interchange, fewer tools
Templates are needed
Much development duplicated, wasted

6
Publishing Data
Roles Authors Publishers Curators Archives Consume
rs
Traditional Scientists Journals Libraries Archives
Scientists
Emerging Collaborations Project web site DataDoc
Archives Digital Archives Scientists
7
The Core Problem No Economic Model

The archive user has not yet been born. How can
he pay you to curate the data?
The Scientist gathered data for his own
purposeWhy should he pay (invest time) for your
needs?
Answer to both thats the scientific method
Curating data (documenting the design, the
acquisition and the processing)Is very hard and
there is no reward for doing it.The results are
rewarded, not the process of getting them.
Storage/archive NOT the problem (its almost
free)
Curating/Publishing is expensive.

8
What SDSS is Doing Capture the Bits

Best-effort documenting data and process.
Publishing data often by UPS( 5TB today (dr1)
and so 15k for a copy)
Replicating data on 3 continents.
EVERYTHING online (tape data is dead data)
Archiving all email, discussions, .
Keeping all web-logs.
Now we need to figure out how to organize/search
all this metadata.

9
SDSS Data Inflation Data Pyramid

Level 2Derived data products 10x smaller But
there are many catalogs.
Publish new edition each year
Fixes bugs in data.
Must preserve old editions
Creates data pyramid
Store each edition
1, 2, 3, 4 N N2 bytes
Net Data Inflation L2 L1

Level 1AGrows 5TB pixels/year growing to
25TB 2 TB/y compressed growing to 13TB 4
TB today (level 1A in NASA terms)

10
Summary

Virtual Observatory will be an ecosystem
ofauthors, curators, publishers, archivers,
readerscontributing using shared data.
The process and roles are changing author
project publisher curator
Ephemeral stable data Capture ephemeral
information. All design metadata info is
ephemeralCan tradeoff recomputing derived data
EconomicsAuthor/Curate cost dominates
SDSSData Inflation, Data Pyramid

11
Four Talks

The article we actually wroteOnline Scientific
Publication, Curation, Archiving
The advertised talk Computer Science Challenges
in the VO
A Web Server (SkyServer) tour
A Web Service (SdssCutout) tour
These lead up to Alex Szalays talk on Web
Services

12
The Advertised TalkComputer Science Challenges
in the VO

Jim Gray, Microsoft

13
Virtual Observatory

Premise Most data is (or could be online)
So, the Internet is the worlds best telescope
It has data on every part of the sky
In every measured spectral band optical, x-ray,
radio..
As deep as the best instruments (2 years ago).
It is up when you are up.The seeing is always
great (no working at night, no clouds no moons
no..).
Its a smart telescope links objects and
data to literature on them.

14
Virtual ObservatoryData Federation of Web
Services

Massive datasets live near their owners
Near the instruments software pipeline
Near the applications
Near data knowledge and curation
Computer centers become Data Centers
Archives are replicated for
Performance
Availability/Reliability
Each Archive publishes a web service
Schema documents the data
Methods on objects (queries)
Scientists get personalized extracts
Uniform access to multiple Archives
A common global schema

15
Some Unique Things About Astro Data

There is a desire to compare data from different
instruments
Most astronomers publish their data (especially
surveys)
Combining data from different instruments gives
more info
Szalay observes Metcalfs law utility grows as
N2
This is less true in some other fields
Its tractable
sizes fit in current regimes (10s of terabytes
today)
tasks fit Beowulfs
Astro data is great sandbox for CS research.
High-dimensional data
Temporal, spatial, image datatypes
Few privacy/commercial concerns
There is lots of it

16
My 1 Challenge going beyond files(a file is an
array of bytes)Science vs Commerce

Data in files FTP a local copy /subset.ASCII or
Binary.
Each scientist builds own analysis toolkit
Analysis is tcl script of toolkit on local data.
Some simple visualization tools x vs y

Data in a database
Standard reports for standard things.
Report writers for non-standard things
GUI tools to explore data.
Decision trees
Clustering
Anomaly finders

17
Butsome science is hitting a wallFTP and GREP
are not adequate

You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
Oh!, and 1PB 10,000 disks
At some point you need indices to limit
search parallel data search and analysis
search and analysis tools
This is where databases can help

You can FTP 1 MB in 1 sec
You can FTP 1 GB / min ( 1 /GB)
2 days and 1K
3 years and 1M

18
Whats needed?(not drawn to scale)
19
CS Challenges For Astronomers

Objectify your field
Precisely define what you are talking about.
Objects and Methods / Attributes
This is REALLY difficult.
UCDs are a great start but, there is a long way
to go
Software is like entropy, it always increases.
-- Norman Augustine, Augustines Laws
Beware of legacy software cost can eat you
alive
Share software where possible.
Use standard software where possible.
Expect it will cost you 25 to 40 of project. ?
Explain what you want to do with the VO
20 queries or something like that.

20
Challenge to Data Miners Linear and Sub-Linear
Algorithms
Techniques

Today most correlation / clustering
algorithmsare polynomial N2 or N3 or
N2 is VERY big when N is big (1018 is big)
Need sub-linear algorithms
Current approaches are near optimal given
current assumptions.
So, need new assumptionsprobably heuristic and
approximate

21
Challenge to Data Miners Rediscover Astronomy

Astronomy needs deep understanding of physics.
But, some was discovered as variable
correlations then explained with physics.
Famous example Hertzsprung-Russell Diagramstar
luminosity vs color (temperature)
Challenge 1 (the student test) How much of
astronomy can data mining discover?
Challenge 2 (the Turing test)Can data mining
discover NEW correlations?

22
Plumbers Organize and Search Petabytes

Automate
instrument-to-archive pipelinesIt is is a messy
business very labor intensiveMost current
designs do not scale (too many manual
steps)BaBar (1TB/day) and ESO pipeline seem
promising.A job-scheduling or workflow system
Physical Database design access
Data access patterns are difficult to anticipate
Aggressively and automatically use indexing,
sub-setting.
Search in parallel
Goals
Answer easy queries in 10 seconds.
Answer hard queries (correlations) in 10 minutes.

23
Q How can a computer scientist help,
without learning a LOT of Astronomy?A Scenario
Design 20 questions.

Astronomers proposed 20 questions Typical of
things they want to do Each would require a
week (or month) of programming in tcl / C/
FTP
Goal, make it easy to answer questions
DB and tools design motivated by this goal
Implemented DB utility procedures
JHU Built GUI for Linux clients

24
The 20 Queries

Q11 Find all elliptical galaxies with spectra
that have an anomalous emission line.
Q12 Create a grided count of galaxies with u-ggt1
and rlt21.5 over 60ltdeclinationlt70, and 200ltright
ascensionlt210, on a grid of 2, and create a map
of masks over the same grid.
Q13 Create a count of galaxies for each of the
HTM triangles which satisfy a certain color cut,
like 0.7u-0.5g-0.2ilt1.25 rlt21.75, output it in
a form adequate for visualization.
Q14 Find stars with multiple measurements and
have magnitude variations gt0.1. Scan for stars
that have a secondary object (observed at a
different time) and compare their magnitudes.
Q15 Provide a list of moving objects consistent
with an asteroid.
Q16 Find all objects similar to the colors of a
quasar at 5.5ltredshiftlt6.5.
Q17 Find binary stars where at least one of them
has the colors of a white dwarf.
Q18 Find all objects within 30 arcseconds of one
another that have very similar colors that is
where the color ratios u-g, g-r, r-I are less
than 0.05m.
Q19 Find quasars with a broad absorption line in
their spectra and at least one galaxy within 10
arcseconds. Return both the quasars and the
galaxies.
Q20 For each galaxy in the BCG data set
(brightest color galaxy), in 160ltright
ascensionlt170, -25ltdeclinationlt35 count of
galaxies within 30"of it that have a photoz
within 0.05 of that galaxy.

Q1 Find all galaxies without unsaturated pixels
within 1' of a given point of ra75.327,
dec21.023
Q2 Find all galaxies with blue surface
brightness between and 23 and 25 mag per square
arcseconds, and -10ltsuper galactic latitude (sgb)
lt10, and declination less than zero.
Q3 Find all galaxies brighter than magnitude 22,
where the local extinction is gt0.75.
Q4 Find galaxies with an isophotal surface
brightness (SB) larger than 24 in the red band,
with an ellipticitygt0.5, and with the major axis
of the ellipse having a declination of between
30 and 60arc seconds.
Q5 Find all galaxies with a deVaucouleours
profile (r¼ falloff of intensity on disk) and the
photometric colors consistent with an elliptical
galaxy. The deVaucouleours profile
Q6 Find galaxies that are blended with a star,
output the deblended galaxy magnitudes.
Q7 Provide a list of star-like objects that are
1 rare.
Q8 Find all objects with unclassified spectra.
Q9 Find quasars with a line width gt2000 km/s and
2.5ltredshiftlt2.7.
Q10 Find galaxies with spectra that have an
equivalent width in Ha gt40Å (Ha is the main
hydrogen spectral line.)

Also some good queries at http//www.sdss.jhu.edu
/ScienceArchive/sxqt/sxQT/Example_Queries.html
25
Two kinds of SDSS data in an SQL DB(objects and
images all in DB)

15M Photo Objects 400 attributes

50K Spectra with 30 lines/ spectrum
26
An Easy QueryQ15 Find asteroids

Sounds hard but there are 5 pictures of the
object at 5 different times (color filters) and
so can see velocity.
Image pipeline computes velocity.
Computing it from the 5 color x,y would also be
fast
Finds 1,303 objects in 3 minutes,
140MBps. (could go 2x faster with more disks)

select objId, dbo.fGetUrlEq(ra,dec) as url
--return object ID url sqrt(power(rowv,2)powe
r(colv,2)) as velocity from photoObj --
check each object. where (power(rowv,2)
power(colv, 2)) -- square of velocity
between 50 and 1000 -- huge values error
27
Q15 Fast Moving Objects

Find near earth asteroids

SELECT r.objID as rId, g.objId as gId,
dbo.fGetUrlEq(g.ra, g.dec) as url FROM PhotoObj
r, PhotoObj g WHERE r.run g.run and
r.camcolg.camcol and abs(g.field-r.field)lt2
-- nearby -- the red selection criteria and
((power(r.q_r,2) power(r.u_r,2)) gt 0.111111
) and r.fiberMag_r between 6 and 22 and
r.fiberMag_r lt r.fiberMag_g and r.fiberMag_r lt
r.fiberMag_i and r.parentID0 and r.fiberMag_r lt
r.fiberMag_u and r.fiberMag_r lt
r.fiberMag_z and r.isoA_r/r.isoB_r gt 1.5 and
r.isoA_rgt2.0 -- the green selection
criteria and ((power(g.q_g,2) power(g.u_g,2))
gt 0.111111 ) and g.fiberMag_g between 6 and 22
and g.fiberMag_g lt g.fiberMag_r and
g.fiberMag_g lt g.fiberMag_i and g.fiberMag_g lt
g.fiberMag_u and g.fiberMag_g lt g.fiberMag_z and
g.parentID0 and g.isoA_g/g.isoB_g gt 1.5 and
g.isoA_g gt 2.0 -- the matchup of the pair and
sqrt(power(r.cx -g.cx,2) power(r.cy-g.cy,2)power
(r.cz-g.cz,2))(10800/PI())lt 4.0 and
abs(r.fiberMag_r-g.fiberMag_g)lt 2.0
28
(No Transcript)
29
Data Visualization(and human-computer interface)