Title: Four Talks
1Four Talks
Jim Gray, Microsoft Alex Szalay, Ani Thakar, Jan
Vandenberg, JHU Chris Stoughton, Fermilab
- The article we actually wroteOnline Scientific
Publication, Curation, Archiving - The advertised talk Computer Science Challenges
in the VO - A Web Server (SkyServer) tour
- A Web Service (SdssCutout) tour
- These lead up to Alex Szalays talk on Web
2The Paper We WroteOnline Scientific
Publication, Curation, Archiving
- Jim Gray, Microsoft
- Alex Szalay, Ani Thakar, Jan Vandenberg, JHU
- Chris Stoughton, Fermilab
- Virtual Observatory will be an ecosystem
ofauthors, curators, publishers, archivers,
readerscontributing using shared data. - The process and rolesauthor, curator, publisher,
archiver roles are changing - Ephemeral derivedMust capture ephemeral
information.All design metadata info is
ephemeralCan tradeoff recomputing derived data - EconomicsPublish/Archive cost is zero,
Author/Curator cost dominates - SDSSdata inflationwhat we are doing
- Once published, scientific data needs to be
available forever,so that the science can be
reproduced/extended. - What does that mean?
- Data
- Ephemeral Data could not be reproduced
- Stable data could be drived from emphemeral
data. - Meta-data how the data was collected/derivedis
ephemeral - Must be preserved
- Includes design docs, software, email, pubs,
personal notes
5Changing Roles
- Exponential growth
- Projects last at least 3-5 years
- Project data online during project lifetime.
- Data sent to central archive only at the end of
the project - At any instant, only 1/8 of data is centralized
- New project responsibilities
- Becoming Publishers and Curators
- Larger fraction of budget spent on software
- Standards are needed
- Easier data interchange, fewer tools
- Templates are needed
- Much development duplicated, wasted
6Publishing Data
Roles Authors Publishers Curators Archives Consume
Traditional Scientists Journals Libraries Archives
Emerging Collaborations Project web site DataDoc
Archives Digital Archives Scientists
7The Core Problem No Economic Model
- The archive user has not yet been born. How can
he pay you to curate the data? - The Scientist gathered data for his own
purposeWhy should he pay (invest time) for your
needs? - Answer to both thats the scientific method
- Curating data (documenting the design, the
acquisition and the processing)Is very hard and
there is no reward for doing it.The results are
rewarded, not the process of getting them. - Storage/archive NOT the problem (its almost
free) - Curating/Publishing is expensive.
8What SDSS is Doing Capture the Bits
- Best-effort documenting data and process.
- Publishing data often by UPS( 5TB today (dr1)
and so 15k for a copy) - Replicating data on 3 continents.
- EVERYTHING online (tape data is dead data)
- Archiving all email, discussions, .
- Keeping all web-logs.
- Now we need to figure out how to organize/search
all this metadata.
9SDSS Data Inflation Data Pyramid
- Level 2Derived data products 10x smaller But
there are many catalogs. - Publish new edition each year
- Fixes bugs in data.
- Must preserve old editions
- Creates data pyramid
- Store each edition
- 1, 2, 3, 4 N N2 bytes
- Net Data Inflation L2 L1
- Level 1AGrows 5TB pixels/year growing to
25TB 2 TB/y compressed growing to 13TB 4
TB today (level 1A in NASA terms)
- Virtual Observatory will be an ecosystem
ofauthors, curators, publishers, archivers,
readerscontributing using shared data. - The process and roles are changing author
project publisher curator - Ephemeral stable data Capture ephemeral
information. All design metadata info is
ephemeralCan tradeoff recomputing derived data - EconomicsAuthor/Curate cost dominates
- SDSSData Inflation, Data Pyramid
11Four Talks
- The article we actually wroteOnline Scientific
Publication, Curation, Archiving - The advertised talk Computer Science Challenges
in the VO - A Web Server (SkyServer) tour
- A Web Service (SdssCutout) tour
- These lead up to Alex Szalays talk on Web
12The Advertised TalkComputer Science Challenges
in the VO
13Virtual Observatory
- Premise Most data is (or could be online)
- So, the Internet is the worlds best telescope
- It has data on every part of the sky
- In every measured spectral band optical, x-ray,
radio.. - As deep as the best instruments (2 years ago).
- It is up when you are up.The seeing is always
great (no working at night, no clouds no moons
no..). - Its a smart telescope links objects and
data to literature on them.
14Virtual ObservatoryData Federation of Web
- Massive datasets live near their owners
- Near the instruments software pipeline
- Near the applications
- Near data knowledge and curation
- Computer centers become Data Centers
- Archives are replicated for
- Performance
- Availability/Reliability
- Each Archive publishes a web service
- Schema documents the data
- Methods on objects (queries)
- Scientists get personalized extracts
- Uniform access to multiple Archives
- A common global schema
15Some Unique Things About Astro Data
- There is a desire to compare data from different
instruments - Most astronomers publish their data (especially
surveys) - Combining data from different instruments gives
more info - Szalay observes Metcalfs law utility grows as
N2 - This is less true in some other fields
- Its tractable
- sizes fit in current regimes (10s of terabytes
today) - tasks fit Beowulfs
- Astro data is great sandbox for CS research.
- High-dimensional data
- Temporal, spatial, image datatypes
- Few privacy/commercial concerns
- There is lots of it
16My 1 Challenge going beyond files(a file is an
array of bytes)Science vs Commerce
- Data in files FTP a local copy /subset.ASCII or
Binary. - Each scientist builds own analysis toolkit
- Analysis is tcl script of toolkit on local data.
- Some simple visualization tools x vs y
- Data in a database
- Standard reports for standard things.
- Report writers for non-standard things
- GUI tools to explore data.
- Decision trees
- Clustering
- Anomaly finders
17Butsome science is hitting a wallFTP and GREP
are not adequate
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 10,000 disks
- At some point you need indices to limit
search parallel data search and analysis
search and analysis tools - This is where databases can help
- You can FTP 1 MB in 1 sec
- You can FTP 1 GB / min ( 1 /GB)
- 2 days and 1K
- 3 years and 1M
18Whats needed?(not drawn to scale)
19CS Challenges For Astronomers
- Objectify your field
- Precisely define what you are talking about.
- Objects and Methods / Attributes
- This is REALLY difficult.
- UCDs are a great start but, there is a long way
to go - Software is like entropy, it always increases.
-- Norman Augustine, Augustines Laws - Beware of legacy software cost can eat you
alive - Share software where possible.
- Use standard software where possible.
- Expect it will cost you 25 to 40 of project. ?
- Explain what you want to do with the VO
- 20 queries or something like that.
20Challenge to Data Miners Linear and Sub-Linear
- Today most correlation / clustering
algorithmsare polynomial N2 or N3 or - N2 is VERY big when N is big (1018 is big)
- Need sub-linear algorithms
- Current approaches are near optimal given
current assumptions. - So, need new assumptionsprobably heuristic and
21Challenge to Data Miners Rediscover Astronomy
- Astronomy needs deep understanding of physics.
- But, some was discovered as variable
correlations then explained with physics. - Famous example Hertzsprung-Russell Diagramstar
luminosity vs color (temperature) - Challenge 1 (the student test) How much of
astronomy can data mining discover? - Challenge 2 (the Turing test)Can data mining
discover NEW correlations?
22Plumbers Organize and Search Petabytes
- Automate
- instrument-to-archive pipelinesIt is is a messy
business very labor intensiveMost current
designs do not scale (too many manual
steps)BaBar (1TB/day) and ESO pipeline seem
promising.A job-scheduling or workflow system - Physical Database design access
- Data access patterns are difficult to anticipate
- Aggressively and automatically use indexing,
sub-setting. - Search in parallel
- Goals
- Answer easy queries in 10 seconds.
- Answer hard queries (correlations) in 10 minutes.
23Q How can a computer scientist help,
without learning a LOT of Astronomy?A Scenario
Design 20 questions.
- Astronomers proposed 20 questions Typical of
things they want to do Each would require a
week (or month) of programming in tcl / C/
FTP - Goal, make it easy to answer questions
- DB and tools design motivated by this goal
- Implemented DB utility procedures
- JHU Built GUI for Linux clients
24The 20 Queries
- Q11 Find all elliptical galaxies with spectra
that have an anomalous emission line. - Q12 Create a grided count of galaxies with u-ggt1
and rlt21.5 over 60ltdeclinationlt70, and 200ltright
ascensionlt210, on a grid of 2, and create a map
of masks over the same grid. - Q13 Create a count of galaxies for each of the
HTM triangles which satisfy a certain color cut,
like 0.7u-0.5g-0.2ilt1.25 rlt21.75, output it in
a form adequate for visualization. - Q14 Find stars with multiple measurements and
have magnitude variations gt0.1. Scan for stars
that have a secondary object (observed at a
different time) and compare their magnitudes. - Q15 Provide a list of moving objects consistent
with an asteroid. - Q16 Find all objects similar to the colors of a
quasar at 5.5ltredshiftlt6.5. - Q17 Find binary stars where at least one of them
has the colors of a white dwarf. - Q18 Find all objects within 30 arcseconds of one
another that have very similar colors that is
where the color ratios u-g, g-r, r-I are less
than 0.05m. - Q19 Find quasars with a broad absorption line in
their spectra and at least one galaxy within 10
arcseconds. Return both the quasars and the
galaxies. - Q20 For each galaxy in the BCG data set
(brightest color galaxy), in 160ltright
ascensionlt170, -25ltdeclinationlt35 count of
galaxies within 30"of it that have a photoz
within 0.05 of that galaxy.
- Q1 Find all galaxies without unsaturated pixels
within 1' of a given point of ra75.327,
dec21.023 - Q2 Find all galaxies with blue surface
brightness between and 23 and 25 mag per square
arcseconds, and -10ltsuper galactic latitude (sgb)
lt10, and declination less than zero. - Q3 Find all galaxies brighter than magnitude 22,
where the local extinction is gt0.75. - Q4 Find galaxies with an isophotal surface
brightness (SB) larger than 24 in the red band,
with an ellipticitygt0.5, and with the major axis
of the ellipse having a declination of between
30 and 60arc seconds. - Q5 Find all galaxies with a deVaucouleours
profile (r¼ falloff of intensity on disk) and the
photometric colors consistent with an elliptical
galaxy. The deVaucouleours profile - Q6 Find galaxies that are blended with a star,
output the deblended galaxy magnitudes. - Q7 Provide a list of star-like objects that are
1 rare. - Q8 Find all objects with unclassified spectra.
- Q9 Find quasars with a line width gt2000 km/s and
2.5ltredshiftlt2.7. - Q10 Find galaxies with spectra that have an
equivalent width in Ha gt40Å (Ha is the main
hydrogen spectral line.)
Also some good queries at http//www.sdss.jhu.edu
25Two kinds of SDSS data in an SQL DB(objects and
images all in DB)
- 15M Photo Objects 400 attributes
50K Spectra with 30 lines/ spectrum
26An Easy QueryQ15 Find asteroids
- Sounds hard but there are 5 pictures of the
object at 5 different times (color filters) and
so can see velocity. - Image pipeline computes velocity.
- Computing it from the 5 color x,y would also be
fast - Finds 1,303 objects in 3 minutes,
140MBps. (could go 2x faster with more disks)
select objId, dbo.fGetUrlEq(ra,dec) as url
--return object ID url sqrt(power(rowv,2)powe
r(colv,2)) as velocity from photoObj --
check each object. where (power(rowv,2)
power(colv, 2)) -- square of velocity
between 50 and 1000 -- huge values error
27Q15 Fast Moving Objects
- Find near earth asteroids
SELECT r.objID as rId, g.objId as gId,
dbo.fGetUrlEq(g.ra, g.dec) as url FROM PhotoObj
r, PhotoObj g WHERE r.run g.run and
r.camcolg.camcol and abs(g.field-r.field)lt2
-- nearby -- the red selection criteria and
((power(r.q_r,2) power(r.u_r,2)) gt 0.111111
) and r.fiberMag_r between 6 and 22 and
r.fiberMag_r lt r.fiberMag_g and r.fiberMag_r lt
r.fiberMag_i and r.parentID0 and r.fiberMag_r lt
r.fiberMag_u and r.fiberMag_r lt
r.fiberMag_z and r.isoA_r/r.isoB_r gt 1.5 and
r.isoA_rgt2.0 -- the green selection
criteria and ((power(g.q_g,2) power(g.u_g,2))
gt 0.111111 ) and g.fiberMag_g between 6 and 22
and g.fiberMag_g lt g.fiberMag_r and
g.fiberMag_g lt g.fiberMag_i and g.fiberMag_g lt
g.fiberMag_u and g.fiberMag_g lt g.fiberMag_z and
g.parentID0 and g.isoA_g/g.isoB_g gt 1.5 and
g.isoA_g gt 2.0 -- the matchup of the pair and
sqrt(power(r.cx -g.cx,2) power(r.cy-g.cy,2)power
(r.cz-g.cz,2))(10800/PI())lt 4.0 and
abs(r.fiberMag_r-g.fiberMag_g)lt 2.0
28(No Transcript)
29Data Visualization(and human-computer interface)
- Make it easy to ask questions
- Make it easy to understand the answers.
- Bad news we have had no takers on the
visualization 20 questions - This is still a VERY retro area.
- But. The following demos show some progress.
30Four Talks
- The article we actually wroteOnline Scientific
Publication, Curation, Archiving - The advertised talk Computer Science Challenges
in the VO - A Web Server (SkyServer) tour
- A Web Service (SdssCutout) tour
- These lead up to Alex Szalays talk on Web
31SkyServer Tourhttp//skyserver.sdss.org/
- Shows benefit of a database
- everything online
- Easy to find things index helps
- Automatic parallel search is essential
- Beware
- Im a lunatic re using databases for everything
- Most people do not put images in DB
- I do, because it is
- Simpler
- Easier to manage
- The right thing to do.
32Four Talks
- The article we actually wroteOnline Scientific
Publication, Curation, Archiving - The advertised talk Computer Science Challenges
in the VO - A Web Server (SkyServer) tour
- A Web Service (SdssCutout) tour
- Leads up to Alex Szalays talk on Web Services
33Whats a Web Service
- Given a url parameters
- Returns a web page (often dynamic)
- Given a XML document (soap msg)
- Returns an XML document
- Tools make this look like an RPC.
- F(x,y,z) returns (u, v, w)
- Distributed objects for the web.
- naming, discovery, security,..
- Internet-scale distributed computing
Web Server
http url
Web page
Your program
Web Service
Data In your address space
objectin xml
34Data Federations of Web Services
- Massive datasets live near their owners
- Near the instruments software pipeline
- Near the applications
- Near data knowledge and curation
- Super Computer centers become Super Data Centers
- Each Archive publishes web services
- Schema documents the data
- Methods on objects (queries)
- Scientists get personalized extracts
- Uniform access to multiple Archives
- A common global schema
35Grid and Web Services Synergy
- I believe the Grid will be many web services
- IETF standards Provide
- Naming
- Authorization / Security / Privacy
- Distributed Objects
- Discovery, Definition, Invocation, Object Model
- Higher level services workflow, transactions,
DB,.. - Synergy commercial Internet Grid tools
36SDSS Cutouthttp//SkyService.pha.jhu.edu/SdssCuto
- A simple web service
- You can have a copy of the code
- Needs an online database backend
37Four Talks
- The article we actually wroteOnline Scientific
Publication, Curation, Archiving - The advertised talk Computer Science Challenges
in the VO - A Web Server (SkyServer) tour
- A Web Service (SdssCutout) tour
- Leads up to Alex Szalays talk on Web Services
38References and Links
- SkyServer
- http//skyserver.sdss.org/
- http//SkyService.pha.jhu.edu/SdssCutout/
- Virtual Observatory
- http//www.us-vo.org/
- http//www.voforum.org/
- World-Wide Telescope
- paper in ScienceV.293 pp. 2037-2038. 14 Sept
2001. (MS-TR-2001-77 word or pdf.) - SDSS DB
- Get your personal copy athttp//research.microsof