Title: Alex Szalay
1 The Sloan Digital Sky Survey
- Alex Szalay
- Department of Physics and Astronomy
- The Johns Hopkins University
2The Sloan Digital Sky Survey
A project run by the Astrophysical Research
Consortium (ARC)
The University of Chicago Princeton
University The Johns Hopkins University The
University of Washington Fermi National
Accelerator Laboratory US Naval Observatory
The Japanese Participation Group The Institute
for Advanced Study Max Planck Inst,
Heidelberg SLOAN Foundation, NSF, DOE, NASA
Goal To create a detailed multicolor map of the
Northern Sky over 5 years, with a budget of
approximately 80M Data Size 40 TB raw, 2 TB
processed
3Scientific Motivation
Create the ultimate map of the Universe ? The
Cosmic Genome Project! Study the distribution of
galaxies ? What is the origin of
fluctuations? ? What is the topology of the
distribution? Measure the global properties of
the Universe ? How much dark matter is
there? Local census of the galaxy population ?
How did galaxies form? Find the most distant
objects in the Universe ? What are the highest
quasar redshifts?
4Cosmology Primer
The Universe is expanding the galaxies move
away from us spectral lines are redshifted
v Ho r Hubbles law
The fate of the universe depends on the
balance between gravity and the expansion
velocity
? density/criticalif ? lt1, expand forever
Most of the mass in the Universe is dark
matter, and it may be cold (CDM)
?dgt ?
The spatial distribution of galaxies is
correlated, due to small ripples in the early
Universe
P(k) power spectrum
5The Naught Problem
What are the global parameters of the
Universe? H0 the Hubble constant 55-75
km/s/Mpc ?0 the density parameter 0.25-1 ?0 the
cosmological constant 0 - 0.7 Their values are
still quite uncertain today... Goal measure
these parameters with an accuracy of a few percent
High Precision Cosmology!
6The Cosmic Genome Project
The SDSS will create the ultimate mapof the
Universe, with much more detailthan any other
measurement before
7Area and Size of Redshift Surveys
8Clustering of Galaxies
We will measure the spectrum of the density
fluctuations to high precision even on very
large scales
The error in the amplitude of the
fluctuation spectrum 1970 x100 1990 x2 1995
0.4 1998 0.2 1999 0.1 2002 0.05
9Relevant Scales
Distances measured in Mpc megaparsec 1
Mpc 3 x 1024 cm 5 Mpc distance
between galaxies 3000 Mpc scale of the
Universe
if ? gt200 Mpc fluctuations have a PRIMORDIAL
shape if ? lt100 Mpc gravity creates sharp
features, like walls, filaments and voids
Biasing conversion of mass into light is
nonlinear light is much more clumpy than the mass
10The Topology of Local Universe
Measure the Topology of the Universe
Does it consist of walls and voids
or is it randomly distributed?
11Finding the Most Distant Objects
Intermediate and high redshift QSOs
Multicolor selection function.
Luminosity functions and spatial clustering.
High redshift QSOs (zgt5).
12Features of the SDSS
Special 2.5m telescope, located at Apache Point,
NM 3 degree field of view. Zero distortion
focal plane. Two surveys in one Photometric
survey in 5 bands. Spectroscopic redshift
survey. Huge CCD Mosaic 30 CCDs 2K x
2K (imaging) 22 CCDs 2K x 400 (astrometry) Two
high resolution spectrographs 2 x 320 fibers,
with 3 arcsec diameter. R2000 resolution with
4096 pixels. Spectral coverage from 3900Ã… to
9200Ã…. Automated data reduction Over 100
man-years of development effort. (Fermilab
collaboration scientists) Very high data
volume Expect over 40 TB of raw data. About 2
TB processed products Data made available to the
public
13Apache Point Observatory
Located in New Mexico, near White Sands National
Monument
14The Telescope
Special 2.5m telescope 3 degree field of
view Zero distortion focal plane Wind
screen moved separately
15The Photometric Survey
Northern Galactic Cap 5 broad-band filters
( u', g', r', i', z )
limiting magnitudes (22.3, 23.3, 23.1, 22.3,
20.8) drift scan of 10,000 square degrees
55 sec exposure time 40 TB raw imaging
data -gt pipeline -gt 100,000,000 galaxies
50,000,000 stars calibration to 2 at
r'19.8 only done in the best seeing (20
nights/yr) pixel size is 0.4 arcsec,
astrometric precision is 60 milliarcsec Southern
Galactic Cap multiple scans (gt 30 times) of
the same stripe Continuous data rate of 8
Mbytes/sec
16Survey Strategy
Overlapping 2.5 degree wide stripes Avoiding the
Galactic Plane (dust) Multiple exposures on the
three Southern stripes
17The Spectroscopic Survey
Measure redshifts of objects ? distance SDSS
Redshift Survey 1 million galaxies 100,000
quasars 100,000 stars Two high throughput
spectrographs spectral range 3900-9200 Ã…. 640
spectra simultaneously. R2000
resolution. Automated reduction of spectra Very
high sampling density and completeness Objects in
other catalogs also targeted
18Optimal Tiling
Fields have 3 degree diameter Centers determined
by an optimization procedure A total of
2200 pointings 640 fibers assigned simultaneously
19The Mosaic Camera
20Photometric Calibrations
The SDSS will create a new photometric
system u' g' r' i' z' Primary standards
observed with the USNO 40-inch telescope in
Flagstaff Secondary standards observed with
the SDSS 20-inch telescope at Apache
Point calibrating the SDSS imaging data
21The Spectrographs
Two double spectrographs very high
throughput two 2048x2048 CCD detectors
mounted on the telescope light fed through
slithead
22The Fiber Feed System
Galaxy images are captured by optical fibers
lined up on the spectrograph slit Manually
plugged during the day into Al plugboards 640
fibers in each bundle The largest fiber system
today
23First Light Images
Telescope First light May 9th 1998
Equatorial scans
24The First Stripes
Camera 5 color imaging of gt100 square
degrees Multiple scans across the same
fields Photometric limits as expected
25NGC 2068
26UGC 3214
27NGC 6070
28The First Quasars
The four highest redshift quasars have been
found in the first SDSS test data !
29Methane/T Dwarf
- Discovery of several newobjects by SDSS 2MASS
30Detection of Gravitational Lensing
28,000 foreground galaxies and 2,045,000
background galaxies in test data(McKay etal 1999)
31SDSS Data Flow
32Distributed Collaboration
Fermilab
U.Chicago
U.Washington
ESNET
I. AdvancedStudy
Japan
Princeton U.
VBNS
JHU
Apache PointObservatory
USNO
NMSU
33Data Processing Pipelines
34Concept of the SDSS Archive
Science Archive (products accessible to users)
OperationalArchive (raw processed data)
35SDSS Data Products
Object catalog 400 GB parameters of
gt108 objects Redshift Catalog 1 GB
parameters of 106 objects Atlas Images 1.5
TB 5 color cutouts of gt108 objects
Spectra 60 GB in a one-dimensional
form Derived Catalogs 20 GB - clusters
- QSO absorption lines 4x4 Pixel All-Sky Map
60 GB heavily compressed
All raw data saved in a tape vault at Fermilab
36Who will be using the archive?
Power Users sophisticated, with lots of
resources research is centered around the
archive data moderate number of very intensive
queries mostly statistical, large output
sizes General Astronomy Public frequent, but
casual lookup of objects/regions the archives
help their research, but not central to
it large number of small queries a lot of
cross-identification requests Wide
Public browsing a Virtual Telescope can have
large public appeal need special
packaging could be a very large number of
requests
37How will the data be analyzed?
The data are inherently multidimensional gt
positions, colors, size, redshift Improved
classifications result in complex N-dimensional
volumes gt complex constraints, not
ranges Spatial relations will be
investigated gt nearest neighbors gt other
objects within a radius Data Mining finding the
needle in the haystack gt separate typical
from rare gt recognize patterns in the
data Output size can be prohibitively large for
intermediate files gt import output directly
into analysis tools
38Geometric Approach
- The Main Problem
- fast, indexed, complex searches of Terabytes in
k-dim space - searches are not necessary parallel to the
axes gt traditional indexing (b-tree) does not
work
- Geometric Approach
- Use the geometric nature of the k-dimensional
data - Quantize data into containers of
friends objects of similar colors close on
the sky stored together gt efficient cache
performance - Containers represent a coarse grained density map
of the data multidimensional index tree k-d
tree r-tree
39Organization of Searches
Queries are inherently geometric the primitive
constraint is a half-space formed by a linear
combination gt k-dimensional hyperplane Boolean
combinations are allowed the constraints form
k-dimensional polyhedra Queries are run on the
coarse grained map determine intersections of
index tree and query polyhedron List of
containers is prepared for query projections of
full query time and output volume created The
list of containers and query is sent to the
Search Engine actual searches quantized by
containers Searches can be optimized, executed
in parallel
40Geometric Indexing
Divide and Conquer
Partitioning
3 ? N ? M
HierarchicalTriangular Mesh
Split as k-d treeStored as r-treeof bounding
boxes
Using regularindexing techniques
41Sky coordinates
Stored as Cartesian coordinates projected onto
a unit sphere Longitude and Latitude
lines intersections of planes and the
sphere Boolean combinations query polyhedron
42Sky Partitioning
Hierarchical Triangular Mesh - based on octahedron
43Hierarchical Subdivision
Hierarchical subdivision of spherical
triangles represented as a quadtree In SDSS the
tree is 5 levels deep - 8192 triangles
44Result of the Query
45Magnitudes and Multicolor Searches
- Galaxy fluxes
- large dynamic range
- errors
- divergent as x? 0 !
For multicolor magnitudes the error
contours can be very anisotropic and
skewed, extremely poor localization!
But this is an artifact of the logarithm at zero
flux, in flux space the object is well localized
46Novel Magnitude Scale
b softnessc set to match normal magnitudes
- Advantages
- monotonic
- degrades gracefully
- objects have small error ellipse
- unified handling of detections and upper
limits! - Disadvantages
- unusual
- (Lupton, Gunn and Szalay, AJ 99)
47Flux Indexing
Split along alternating flux directions Create
balanced partitions Store bounding boxes at each
stepBuild a 10-12 level tree in each triangle
48How to build compact cells?
The SDSS will measure fluxes in 5 bands gt
asinh magnitudes Axis-parallel splits in median
flux, in 8 separate zones in Galactic
latitude gt 5 dimensional bounding boxes
The fluxes are strongly correlated gt 2 ?
dimensional distribution of typical objects gt
widely scattered rare objects gt large density
contrasts
Therefore first create a local density and
split on its value (Csabai etal 96) typical
(98) rare (2)
49Coarse Grained Design
Archive
50Distributed Implementation
User Interface
Analysis Engine
Master
SX Engine
Objectivity Federation
Objectivity
Slave
Slave
Slave
Objectivity
Slave
Objectivity
Objectivity
RAID
Objectivity
RAID
RAID
RAID
51JHU Contributions
Fiber spectrographs P. FeldmanA. UomotoS.
FriedmanS. Smee
- Science Archive
- A. SzalayA. ThakarP. Kunszt
- I. CsabaiGy. SzokolyA. ConnollyA. Chaudhaury
- A lot of help from
- Jim Gray, Microsoft
Management T. HeckmanT. PoehlerA. DavidsenA.
UomotoA. Szalay
52Processing Platforms
- At Fermilab
- 2 AlphaServer 8200 data processing
- 1 SGI Origin 2000 data bases
- Archive at JHU
- 1 AlphaServer 1000A (development)
- 10 Intel based servers w. LVD RAID
- software verified on
- Digital Unix, IRIX, Solaris, Linux
53Exploring new methods
New spectral classification techniques galaxy
spectra can be expressed as a superposition of a
few (lt5) principal components gt objective
classification of 1 million spectra!
Photometric redshifts galaxy colors
systematically change with redshift, the SDSS
photometry works like a 5-pixel spectrograph gt
?z0.05, but with 100 million objects!
Measuring cosmological parameters before data
analysis was limited by small number
statistics after dominant errors are systematic
(extinction) gt new analysis methods are
required!
54Photometric redshifts
Multicolor photometry maps physical
parameters luminosity L redshift z
spectral type T
Inversion u,g,r,I,z gt z, L,
T
observed fluxes
Redshifts are statistical, with large errors
?z?0.05 The data set is huge, more than 100
million galaxies Easy to subdivide into coarse z
bins, and by type gt study evolution gt
enormous volume - 1 Gpc3
55Measuring P(k)
Karhunen-Loeve transform Signal-to-noise
eigenmodes of the redshift survey Optimal
extraction of clustering signal Maximal
rejection of systematic errors(Vogeley and
Szalay 96, Matsubara, Szalay and Landy 99)
Pilot project using the Las Campanas Redshift
Survey with 22,000 galaxies
We simultaneously measure the values of the
redshift-distortion parameter (??0.6/b),
the normalization (?8 ) and the CDM shape
parameter ( ? ?h).
56Trends
- Future dominated by detector improvements
- Moores Law growth in CCD capabilities
- Gigapixel arrays on the horizon
- Improvements in computing and storage will
track growth in data volume - Investment in software is critical, and
growing
Total area of 3m telescopes in the world in m2,
total number of CCD pixels in Megapix, as a
function of time. Growth over 25 years is a
factor of 30 in glass, 3000 in pixels.
57The Age of Mega-Surveys
The next generation of astronomical archives with
Terabyte catalogs will dramatically change
astronomy top-down design large sky
coverage built on sound statistical
plans uniform, homogeneous, well
calibrated well controlled and documented
systematics The technology to acquire, store and
index the data is here we are riding Moores
Law Data mining in such vast archives will be a
challenge, but possibilities are quite
unimaginable Integrating these archives into a
single entity is a project for the whole
community gt National Virtual Observatory
58New Astronomy Different!
- Systematic Data Exploration
- will have a central role in the New Astronomy
- Digital Archives of the Sky
- will be the main access to data
- Data Avalanche
- the flood of Terabytes of data is already
happening, whether we like it or not! - Transition to the new
- may be organized or chaotic
59NVO The Challenges
- Size of the archived data
- 40,000 square degrees is 2 trillion pixels
- One band 4 Terabytes
- Multi-wavelength 10-100 Terabytes
- Time dimension few Petabytes
- The development of
- new archival methods
- new analysis tools
- new standards (metadata, interchange formats)
- Hardware/networking requirements
- Training the next generation!
60Summary
The SDSS project combines astronomy, physics, and
computer science
It promises to fundamentally change our view of
the universe
It will determine how the largest structures in
the universe were formed
It will serve as the standard astronomy
reference for several decades
Its virtual universe can be explored by both
scientists and the public
Through its archive it will create a new paradigm
in astronomy
61www.sdss.org www.sdss.jhu.edu