Store Everything Online In A Database

About This Presentation

Title:

Store Everything Online In A Database

Description:

http://research.microsoft.com/~gray/talks/cern_2001.ppt. 3. How Much ... Most bytes will never be ... impedance mismatch: rowsets == ragged arrays of objects ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 72

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: Store Everything Online In A Database

1
Store EverythingOnlineIn A Database

Jim Gray
Microsoft Research
Gray_at_Microsoft.com
http//research.microsoft.com/gray/talks

2
Outline

Store Everything
Online (Disk not Tape)
In a Database
A Federated DB
Two Examples

3
How Much is Everything?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
Everything! Recorded

Soon everything can be recorded and indexed
Most bytes will never be seen by humans.
Data summarization, trend detection anomaly
detection are key technologies
See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html
See Lyman Varian
How much information
http//www.sims.berkeley.edu/research/projects/how
-much-info/

All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
4
Storage capacity beating Moores law

2.3 k/TB today (raw disk)
1k/TB by end of 2002

5
Outline

Store Everything
Online (Disk not Tape)
In a Database
A Federated DB
Two Examples

6
Online Data

Can build 1PB of NAS disk for 5M today
Can SCAN (read or write) entire PB in 3 hours.
Operate it as a data pump continuous sequential
scan
Can deliver 1PB for 1M over Internet
Access charge is 300/Mbps bulk rate
Need to Geoplex data (store it in two places).
Need to filter/process data near the source,
To minimize network costs.

7
The Absurd Disk

2.5 hr scan time (poor sequential access)
1 access per second / 5 GB (VERY cold data)
Its a tape!

1 TB
100 MB/s
200 Kaps
8
Disk vs Tape

Tape
40 GB
10 MBps
10 sec pick time
30-120 second seek time
2/GB for media8/GB for drivelibrary
10 TB/rack
1 week scan

Disk
80 GB
35 MBps
5 ms seek time
3 ms rotate latency
2/GB for drive 2/GB for ctlrs/cabinet
15 TB/rack
1 hour scan

Guestimates Cern 200 TB 3480 tapes 2 col
50GB Rack 1 TB 12 drives
The price advantage of disk is growing the
performance advantage of disk is huge! At
10K/TB, disk is competitive with nearline tape.
9
Building a Petabyte Disk Store

Cadillac 500k/TB 500M/PB plus FC
switches plus 800M/PB
TPC-C SANs (Brand PC 18GB/) 60M/PB
Brand PC local SCSI 15M/PB
Do it yourself ATA
5M/PB

10
Cheap Storage and/or Balanced System

Low cost storage (2 x 3k servers) 5K TB2x (
800 Mhz, 256Mb 8x80GB disks 100MbE)raid5
costs 6K/TB
Balanced server (5k/.64 TB)
2x 1Ghz (2k)
512 MB
8 x 80 GB drives (2K)
Gbps Ethernet switch (300/port)
9k/TB 18K/mirrored TB

11
Next step in the Evolution

Disks become supercomputers
Controller will have 1bips, 1 GB ram, 1 GBps net
And a disk arm.
Disks will run full-blown app/web/db/os stack
Distributed computing
Processors migrate to transducers.

12
Its Hard to Archive a PetabyteIt takes a LONG
time to restore it.

At 1GBps it takes 12 days!
Store it in two (or more) places online (on
disk?). A geo-plex
Scrub it continuously (look for errors)
On failure,
use other copy until failure repaired,
refresh lost copy from safe copy.
Can organize the two copies differently
(e.g. one by time, one by space)

13
Outline

Store Everything
Online (Disk not Tape)
In a Database
A Federated DB
Two Examples

14
Why Not file object GREP?

It works if you have thousands of objects (and
you know them all)
But hard to search millions/billions/trillions
with GREP
Hard to put all attributes in file name.
Minimal metadata
Hard to do chunking right.
Hard to pivot on space/time/version/attributes.

15
The Reality its build vs buy

If you use a file system you will eventually
build a database system
metadata,
Query,
parallel ops,
security,.
reorganize,
recovery,
distributed,
replication,

16
OK so Ill put lots of objects in a fileDo It
Yourself Database

Good news
Your implementation will be 10x faster (at
least!)
easier to understand and use
Bad news
It will cost 10x more to build and maintain
Someday you will get bored maintaining/evolving
it
It will lack some killer features
Parallel search
Self-describing via metadata
SQL, XML,
Replication
Online update reorganization
Chunking is problematic (what granularity, how to
aggregate)

17
Top 10 reasons to put Everything in a DB

Someone else writes the million lines of code
Captures data and Metadata,
Standard interfaces give tools and quick learning
Allows Schema Evolution without breaking old apps
Index and Pivot on multiple attributes
space-time-attribute-version.
Parallel terabyte searches in seconds or minutes
Moves processing search close to the disk
arm (moves fewer bytes (qestons return datons).
Chunking is easier (can aggregate chunks at
server).
Automatic geo-replication
Online update and reorganization.
Security
If you pick the right vendor, ten years from now,
there will be software that can read the data.

18
DB Centric Examples

TerraServer
All images and all data in the database (chunked
as small tiles).www.TerraServer.Microsoft.com/
http//research.microsoft.com/gray/Papers/MSR_TR_
99_29_TerraServer.doc
SkyServer Virtual Sky
Both image and semantic data in a relational
store.
Parallel search NonProcedural access are
important.
http//research.microsoft.com/gray/Papers/MS_TR_9
9_30_Sloan_Digital_Sky_Survey.doc
http//skyserver.sdss.org/getMosaic.asp?Z1A1T
4H1S10M30
http//virtualsky.org/servlet/Page?F3RA16h10m
1.0sDE2B0d42m45sT4P12S10X5096Y4121
W4Z-1tile.2.1.x55tile.2.1.y20

19
OK Why dont they use our stuff?

Wrong metaphor HDF with hyper-slab is better
match.
Impedance match Getting stuff in/out of DB
is too hard
We sold them OODBs and they did not work
(unreliable, poor performance, no tools).

20
So, why will the future be different?

They have MUCH more data (109 files?)
Java / C eases impedance mismatch rowsets
ragged arrays of objectsiterators, exceptions,..
built in language
Tools are better
Optimizers are better
CPU and disk parallelism actually works now
Statistical packages are better.

21
Outline

Store Everything
Online (Disk not Tape)
In a Database
A Federated DB
Two Examples

22
But

Distributed shared databases have failed even
on their home turf. blocks, files, tables are
wrong abstraction for networks. (too low
level) Objects are the right abstraction So,
UDDI / WSDL / SOAP is the solution (not
SQL) Replays the NAS is better than SAN
argument methods gt sql gt file gt disk XML is the
wire format, XLANG is the workflow protocol,
Query will be in there somewhere.
23
DDB technology GREAT in a Cluster Beowulf

Uniform architecture
Trust among nodes
High bandwidth-low latency communication
Programs have single system image
Queries run in parallel
Global optimizer does query decomposition

24
But in a Distributed System(a Grid vs Beowulf)

Higher level abstraction give modularity
minimize round trips
Change is constant need modularity.
Heterogeneous architecture makes query planning
much harder
No trust
Communication is slow and expensive (minimize
it).

25
Federate Databases

Each Database exposes services
Self describing
Discoverable
Easy for programs to use/understand
Built on standards (W3C, IETF,..)
Client-side or server-side apps
Integrate these services
Combine information to produce answers

26
DDB the Trust Issue

Customers serve themselves
Follow the rules posted on the door
No Overhead, no staff!

Clerks serve Customers
Take order, fill order, fill out invoice, collect
money.
Overhead staff, training, rules,

Client/Server Groceries
DDB Grocery
27
Outline

Store Everything
Online (Disk not Tape)
In a Database
A Federated DB
Two Examples

28
How do we find information today?

Human searches web (with an index)
Human browses pages

29
How do we find information tomorrow?
Digital Dashboard

Agents gather and digest it for us.
Q How?
AW3C
Discovery UDDI, DISCO, WSDL
Use
SOAP

My Agents
SOAP WSDL
Web Services
30
How do we publish information?

Get the data.
Conceptualize the data schema
Provide methods that return data subsets.
Challenge how much processing on your server?
Publish the schema and methods.
We (you and I) are exploring these issues.

f, g, x, y
31
TerraServer Example

TerraServer
3TB Internet Map DB available since June 1998
USGS photo and topo maps of the US
Integrated with Home Advisor
Shows off SQL Server availability scalability
Designed for thin clients and voice network
TerraService
A .NET web service
Makes TerraServer data available to other apps

32
Demo http//terraserver.microsoft.com
Show photo topo gazetteer demographics
33
TerraServer Experience

Successful Web Site
Top 1000 Web Site continues to be popular
Met goals interesting, big, real, public, fast,
easy, accessible, and free
High Availability Windows Data Center Compaq
SAN Technology
New Feature Requests
Programmable access to meta-data
User selectable image sizes, i.e. a map server
Permission to use TerraServer data within server
applications

34
What is a Web Service?
A programmable application component accessible
via standard Web protocols
35
TerraService Architecture
HTML
Standard Browsers
Image/jpeg
Image/jpeg
Smart Clients
SOAP/XML
OLEDB
36
Terra Services
Landmark-Service
Terra-Tile-Service

Geo-coded places, e.g. Schools, Golf Courses,
Hospitals, etc.
Place Polygons e.g. Zip Codes, Cities, etc.

Query Gazetteer
Retrieve imagery meta-data
Retrieve imagery
Simple Projection conversions

allows overlay information for
Terra-Tile-Service applications
Clients can present TerraServer imagery in new
ways.
37
Terra Services

Tile
GetAreaFromPt
GetAreaFromRect
GetAreaFromTileId
GetTileMetaFromLonLat
GetTileMetaFromTileId
GetTile (Image)
Landmark
GetLandmarkTypes
CountOfLandmarkPointsByRect
GetLandmarkPointsByRect
CountOfLandmarkShapesByRect
GetLandmarkShapesByRect

Place Search
GetPlaceFacts
GetPlaceList
GetPlaceListInRect
CountPlacesInRect
Projection
ConvertLonLatToUtm
ConvertUtmToLonLat
ConvertLonLatTo NearestPlace
GetTheme
GetLatLonMetrics

http//terraservice.net
38
Soil Viewer Uses TerraService
39
Custom End Product
Web Soil Data Viewer
XML Soil Report
Soil Interpretation Map
40
What You Saw

Converted a Web Server
HTML get post
Server returns pictures to people
to a Web Service
SOAP service
returns XML self-describing data
Application integrates data (Agriculture and Geo
data)

41
Rosetta Stone

Dot Net
UDDI Universal description, discovery, and
integration
Schema, XLANG
SOAP simple object access protocol
WSDL web services definition language
XML- eXtended Markup Language

Distributed computing basic services
Yellow Pages
?
RPC remote procedure call, CORBA, DCOM, RMI
IDL interface definition language
XDR - eXternal Data Representation

42
SkyServer

Collaborating with
Alex Szalay, Peter Kunszt, Ani Thakar _at_ JHU
Robert Brunner, Roy Williams _at_ Caltech
George Djorgovski, Julian Bunn _at_ Caltech
FermiLab operates Sky Server
Compaq donated hardware
Microsoft donated software and money

43
Sky Server

Like TerraServer pictures of the sky.
But also LOTS of data on each object So a
data mining web service
Luminosity (multi-spectra), morphology, spectrum
So, it is a data mining application
Cross-correlation is challenging because
Multi-resolution
Data is dirty/fuzzy (error bars, cosmic rays,
airplanes)
Time varying

44
Astronomy Data

In the old days astronomers took photos.
Starting in the 1960s they began to digitize.
New instruments are digital (100s of GB/nite)
Detectors are following Moores law.
Data avalanche double every year

Total area of 3m telescopes in the world in m2,
total number of CCD pixels in megapixel, as a
function of time. Growth over 25 years is a
factor of 30 in glass, 3000 in pixels.
3 M telescopes area m2
Courtesy of Alex Szalay
CCD area mpixels
45
Astronomy Data

Astronomers have a few Petabytes now.
1 pixel (byte) / sq arc second 4TB
Multi-spectral, temporal, ? 1PB
They mine it looking for new (kinds of) objects
or more of interesting ones(quasars),
density variations in 400-D space correlations
in 400D space
Data doubles every year.
Data is public after a year.
So, 50 of the data is public.
Some have private access to 5 more data.
So 50 vs 55 access for everyone

46
Astronomy Data

But..
How do I get at that 50 of the data?
Astronomers have culture of publishing.
FITS files and many tools.http//fits.gsfc.nasa.g
ov/fits_home.html
Encouraged by NASA.
Publishing data details is difficult.
Astronomers want to do it but it is VERY
hard.(What programs where used? what were the
processing steps? How were errors treated?)
File is wrong abstraction.

47
Virtual Observatoryhttp//www.astro.caltech.edu/n
voconf/http//www.voforum.org/

Premise Most data is (or could be online)
So, the Internet is the worlds best telescope
It has data on every part of the sky
In every measured spectral band optical, x-ray,
radio..
As deep as the best instruments (1 year ago).
It is up when you are up.The seeing is always
great (no working at night, no clouds no moons
no..).
Its a smart telescope links objects and
data to literature on them.

48
Virtual Observatory Golden Age of Mega-Surveys

Many new surveys
multi-TB in size, 100 million objects or more
individual archives planned, or under way
Data publication an integral part of the survey
Software bill a major cost in the survey
Multi-wavelength view of the sky
more than 13 wavelength coverage in 5 years
Impressive early discoveries
finding exotic objects by unusual colors
L,T dwarfs, high-z quasars
finding objects by time variability
gravitational micro-lensing

MACHO 2MASS DENIS SDSS PRIME DPOSS GSC-II COBE
MAP NVSS FIRST GALEX ROSAT OGLE ...
Slide courtesy of Alex Szalay, modified by jim
49
Virtual Observatory Federating the Archives

The next generation mega-surveys are different
top-down design
large sky coverage
sound statistical plans
well controlled/documented data processing
Each survey has a publication plan
Data mining will lead to stunning new discoveries
Federating these archives
? Virtual Observatory

Slide courtesy of Alex Szalay
50
The Multiwavelength Crab Nebula
Crab star 1053 AD
Nova first sighted 1054 A.D. by Chinese
Astronomers Now Crab Nebula X-ray, optical,
infrared, and radio
Slide courtesy of Robert Brunner _at_ CalTech.
51
Exploring Parameter Space

Given an arbitrary parameter space
Data Clusters
Points between Data Clusters
Isolated Data Clusters
Isolated Data Groups
Holes in Data Clusters
Isolated Points

Nichol et al. 2001
Slide courtesy of Robert Brunner _at_ CalTech.
52
Virtual Observatory and Education

In the beginning science was empirical.
Then theoretical branches evolved.
Now, we have a computational branches.
The computational branch has been simulation
It is becoming data analysis/visualization
The Virtual Observatory can be used to
Teach astronomy make it interactive,
demonstrate ideas and phenomena
Teach computational science skillsand the
process of scientific discovery

53
Sloan Digital Sky Survey http//sdss.org/

A group of astronomers has been building a
telescope (with 90M from Sloan Foundation, NSF,
and a dozen universities). for the last 12
years
Now data is arriving
250GB/nite (20 nights per year).
100 M stars, 100 M galaxies, 1 M spectra.
Public data at http//sdss.org/
5 of the survey, 600 sq degrees, 15 M objects
60GB.
This data includes most of the known high z
quasars.
It has a lot of science left in it but that is
just the start.

54
Demo of Sky Server

Alex built SkyServer (based on TerraServer
design).
http//skyserver.sdss.org/

Demo famous places navigator data
shopping cart spectrum SQL? ?
55
Virtual Observatory Challenges

Size multi-Petabyte
40,000 square degrees is 2 Trillion pixels
One band (at 1 sq arcsec) 4 Terabytes
Multi-wavelength 10-100 Terabytes
Time dimension gtgt 10 Petabytes
Need auto parallelism tools
Unsolved Meta-Data problem
Hard to publish data programs
Hard to find/understand data programs
Current tools inadequate
new analysis visualization tools
Transition to the new astronomy
Sociological issues

56
The Challenges

How to federate the Archives to make a VO?
The hope XML is the answer.
The reality XML is syntax and tools FITS
on XML will be good but.. Explaining the data
will still be very difficult.
Define Astronomy Objects and Methods.
Based on UDDI, WSDL, SOAP.
Each archive is a service
http//TerraService.net/ shows the idea.
Working with Caltech (Brunner, Williams,
Djorgovski, Bunn)
But, how does data mining work?

57
Three Steps to a VO 0.01

Get SDSS and Palomar online
Alex Szalay, Jan Vandenberg, Ani Thakar.
Roy Williams, Robert Brunner, Julian Bunn
Do queries and crossID matches with CalTech and
SDSS to expose
Schema, Units,
Dataset problems
the typical use scenarios.
Implement WebServices at CalTech and SDSS

58
Summary

All information at your fingertips.
How do we publish information so that our
agents can digest it?
Example TerraServer -gt TerraService
The Virtual Observatory Concept
The Internet is worlds best telescope
For astronomy
For teaching astronomy and
For teaching computational science

59
Outline

Store Everything
Online (Disk not Tape)
In a Database
A Federated DB
Two Examples

60
No time for what follows
61
SDSS what I have been doing

Work with Alex Szalay, Don Slutz, and others to
define 20 canonical queries and 10 visualization
tasks.
Don Slutz did a first cut of the queries, Im
continuing that work.
Working with Alex Szalay on building Sky Server
and making data it public (send out 80GB
SQL DBs)

62
Two kinds of SDSS data

15M Photo Objects 400 attributes

20K Spectra with 10 lines/ spectrum
63
Spatial Data Access(Szalay, Kunszt,
Brunner)http//www.sdss.jhu.edu/ look at the HTM
link

Implemented Hierarchical Triangular Mesh (HTM) as
table-valued function for spatial joins.
Every object has a 20-deep Mesh ID.
Given a spatial definitionRoutine returns up to
500 covering triangles.
Spatial query is then up to 500 range queries.
Very fast 1,000s of triangles per second.

64
The 20 Queries

Q11 Find all elliptical galaxies with spectra
that have an anomalous emission line.
Q12 Create a grided count of galaxies with u-ggt1
and rlt21.5 over 60ltdeclinationlt70, and 200ltright
ascensionlt210, on a grid of 2, and create a map
of masks over the same grid.
Q13 Create a count of galaxies for each of the
HTM triangles which satisfy a certain color cut,
like 0.7u-0.5g-0.2ilt1.25 rlt21.75, output it in
a form adequate for visualization.
Q14 Find stars with multiple measurements and
have magnitude variations gt0.1. Scan for stars
that have a secondary object (observed at a
different time) and compare their magnitudes.
Q15 Provide a list of moving objects consistent
with an asteroid.
Q16 Find all objects similar to the colors of a
quasar at 5.5ltredshiftlt6.5.
Q17 Find binary stars where at least one of them
has the colors of a white dwarf.
Q18 Find all objects within 30 arcseconds of one
another that have very similar colors that is
where the color ratios u-g, g-r, r-I are less
than 0.05m.
Q19 Find quasars with a broad absorption line in
their spectra and at least one galaxy within 10
arcseconds. Return both the quasars and the
galaxies.
Q20 For each galaxy in the BCG data set
(brightest color galaxy), in 160ltright
ascensionlt170, -25ltdeclinationlt35 count of
galaxies within 30"of it that have a photoz
within 0.05 of that galaxy.

Q1 Find all galaxies without unsaturated pixels
within 1' of a given point of ra75.327,
dec21.023
Q2 Find all galaxies with blue surface
brightness between and 23 and 25 mag per square
arcseconds, and -10ltsuper galactic latitude (sgb)
lt10, and declination less than zero.
Q3 Find all galaxies brighter than magnitude 22,
where the local extinction is gt0.75.
Q4 Find galaxies with an isophotal surface
brightness (SB) larger than 24 in the red band,
with an ellipticitygt0.5, and with the major axis
of the ellipse having a declination of between
30 and 60arc seconds.
Q5 Find all galaxies with a deVaucouleours
profile (r¼ falloff of intensity on disk) and the
photometric colors consistent with an elliptical
galaxy. The deVaucouleours profile
Q6 Find galaxies that are blended with a star,
output the deblended galaxy magnitudes.
Q7 Provide a list of star-like objects that are
1 rare.
Q8 Find all objects with unclassified spectra.
Q9 Find quasars with a line width gt2000 km/s and
2.5ltredshiftlt2.7.
Q10 Find galaxies with spectra that have an
equivalent width in Ha gt40Å (Ha is the main
hydrogen spectral line.)

Also some good queries at http//www.sdss.jhu.edu
/ScienceArchive/sxqt/sxQT/Example_Queries.html
65
An easy oneQ7 Provide a list of star-like
objects that are 1 rare.

Found 14,681 buckets, first 140 buckets have
99 time 104 seconds
Disk bound, reads 3 disks at 68 MBps.

Select cast((u-g) as int) as ug, cast((g-r) as
int) as gr, cast((r-i) as int) as ri,
cast((i-z) as int) as iz, count()
as Population from stars where (ugriz) lt
150 group by cast((u-g) as int), cast((g-r) as
int), cast((r-i) as int), cast((i-z) as int)
order by count()
66
Another easy oneQ15 Provide a list of moving
objects consistent with an asteroid.

Looks hard but there are 5 pictures of the object
at 5 different times (colors) and so can compute
velocity.
Image pipeline computes velocity.
Computing it from the 5 color x,y would also be
fast
Finds 2167 objects in 7 minutes, 70MBps.

select object_id, -- return object ID
sqrt(power(rowv,2)power(colv,2)) as velocity
from sxPhotObj -- check each
object. where (power(rowv,2) power(colv, 2)) gt
50 -- square of velocity and rowv gt 0
and colv gt0 -- negative values indicate error
67
A Hard One Q14 Find stars with multiple
measurements that have magnitude variations
gt0.1.

This should work, but SQL Server does not allow
table values to be piped to table-valued
functions.

This should work, but SQL Server does not allow
table values to be piped to table-valued
functions.

68
A Hard one Second TryQ14 Find stars with
multiple measurements that have magnitude
variations gt0.1.

Write a program with a cursor, ran for 2 days

--------------------------------------------------
----------------------------- -- Table-valued
function that returns the binary stars within a
certain radius -- of another (in arc-minutes)
(typically 5 arc seconds). -- Returns the ID
pairs and the distance between them (in
arcseconds). create function BinaryStars(_at_MaxDista
nceArcMins float) returns _at_BinaryCandidatesTable
table( S1_object_ID bigint not null, -- Star
1 S2_object_ID bigint not null, -- Star
2 distance_arcSec float) -- distance between
them as begin declare _at_star_ID bigint,
_at_binary_ID bigint-- Star's ID and binary ID
declare _at_ra float, _at_dec float -- Star's
position declare _at_u float, _at_g float, _at_r float,
_at_i float,_at_z float -- Star's colors
----------------Open a cursor over stars and get
position and colors declare star_cursor cursor
for select object_ID, ra, dec, u, g, r, i,
z from Stars open star_cursor while
(11) -- for each star begin -- get its
attribues fetch next from star_cursor into
_at_star_ID, _at_ra, _at_dec, _at_u, _at_g, _at_r, _at_i, _at_z if
(_at__at_fetch_status -1) break -- end if no more
stars insert into _at_BinaryCandidatesTable --
insert its binaries select _at_star_ID,
S1.object_ID, -- return stars pairs
sqrt(N.DotProd)/PI()10800 -- and distance in
arc-seconds from getNearbyObjEq(_at_ra, _at_dec,
-- Find objects nearby S. _at_MaxDistanceArcMins)
as N, -- call them N. Stars as S1 --
S1 gets N's color values where _at_star_ID lt
N.Object_ID -- S1 different from S and
N.objType dbo.PhotoType('Star') -- S1 is a
star and N.object_ID S1.object_ID -- join
stars to get colors of S1N and
(abs(_at_u-S1.u) gt 0.1 -- one of the colors is
different. or abs(_at_g-S1.g) gt 0.1 or
abs(_at_r-S1.r) gt 0.1 or abs(_at_i-S1.i) gt 0.1
or abs(_at_z-S1.z) gt 0.1 ) end -- end
of loop over all stars -------------- Looped
over all stars, close cursor and exit. close
star_cursor -- deallocate star_cursor
return -- return table end -- end of
BinaryStars GO select from dbo.BinaryStars(.05)
69
A Hard one Third TryQ14 Find stars with
multiple measurements that have magnitude
variations gt0.1.

Use pre-computed neighbors table.
Ran in 17 minutes, found 31k pairs.

-- Plan 2 Use
the precomputed neighbors table select top 100
S.object_ID, S1.object_ID, -- return star pairs
and distance str(N.Distance_mins 60,6,1) as
DistArcSec from Stars S, -- S is a
star Neighbors N, -- N within 3 arcsec (10
pixels) of S. Stars S1 -- S1 N has the
color attibutes where S.Object_ID
N.Object_ID -- connect S and N. and
S.Object_ID lt N.Neighbor_Object_ID -- S1
different from S and N.Neighbor_objType
dbo.PhotoType('Star')-- S1 is a star (an
optimization) and N.Distance_mins lt .05 --
the 3 arcsecond test and N.Neighbor_object_ID
S1.Object_ID -- N S1 and (
abs(S.u-S1.u) gt 0.1 -- one of the colors is
different. or abs(S.g-S1.g) gt 0.1 or
abs(S.r-S1.r) gt 0.1 or abs(S.i-S1.i) gt 0.1 or
abs(S.z-S1.z) gt 0.1 ) -- Found 31,355 pairs
(out of 4.4 m stars) in 17 min 14 sec.
70
The Pain of Going Outside SQL(its fortunate that
all the queries are single statements)

Count parent objects
503 seconds for 14.7 M objects in 33.3 GB
66 MBps
IO bound (30 of one cpu)
100 k records/cpu sec

Use a cursor
No cpu parallelism
CPU bound
6 MBps, 2.7 k rps
5,450 seconds (10x slower)

declare _at_count int declare _at_sum int set _at_sum
0 declare PhotoCursor cursor for select nChild
from sxPhotoObj open PhotoCursor while (11)
begin fetch next from PhotoCursor into
_at_count if (_at__at_fetch_status -1) break set
_at_sum _at_sum _at_count end close
PhotoCursor deallocate PhotoCursor print 'Sum
is 'cast(_at_sum as varchar(12))
select count() from sxPhotoObj where nChild
gt 0
71
Summary of Current Status

All 20 queries run(still checking science)
Also 15 user queries
Run times on 3k PC (2 cpu, 4 disk, 256MB)

100 IO/cpu sec 5MB/cpu sec
72
Summary of Current Status

16 of the queries are simple
2 are iterative, 2 complex
Many are sequential one-pass and two-pass over
data
Covering indices make scans run fast
Table valued functions are wonderful but
limitations on parameters are a pain.
Counting is VERY common.
Binning (grouping by some set of attributes) is
common
Did not request cube, but that may be cultural.

73
Reflections on the 20 Queries

Data loading/scrubbing is labor intensive
tedious
AUTOMATE!!!
This is 5 of the data, and some queries take an
hour.
But this is not tuned (disk bound).
All queries benefit from parallelism (both disk
and cpu)(if you can state the query right, e.g.
inside SQL).
Parallel database machines will do great on this
Hash machines
Data pumps
See paper in word or pdf on my web site.
SQL looks good. The answers, need
visualization

74
Call to Action

If you do data visualization we need you(and we
know it).
If you do databaseshere is some data you can
practice on.
If you do distributed systemshere is a
federation you can practice on.
If you do astronomy educational outreachhere is
a tool for you.
The astronomy folks are very good, and very
smart, and a pleasure to work with, and the
questions are cosmic, so