Title: Store Everything Online In A Database
1Store EverythingOnlineIn A Database
- Jim Gray
- Microsoft Research
- Gray_at_Microsoft.com
- http//research.microsoft.com/gray/talks
2Outline
- Store Everything
- Online (Disk not Tape)
- In a Database
- A Federated DB
- Two Examples
3How Much is Everything?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
Everything! Recorded
- Soon everything can be recorded and indexed
- Most bytes will never be seen by humans.
- Data summarization, trend detection anomaly
detection are key technologies - See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html - See Lyman Varian
- How much information
- http//www.sims.berkeley.edu/research/projects/how
-much-info/
All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
4Storage capacity beating Moores law
- 2.3 k/TB today (raw disk)
- 1k/TB by end of 2002
-
5Outline
- Store Everything
- Online (Disk not Tape)
- In a Database
- A Federated DB
- Two Examples
6Online Data
- Can build 1PB of NAS disk for 5M today
- Can SCAN (read or write) entire PB in 3 hours.
- Operate it as a data pump continuous sequential
scan - Can deliver 1PB for 1M over Internet
- Access charge is 300/Mbps bulk rate
- Need to Geoplex data (store it in two places).
- Need to filter/process data near the source,
- To minimize network costs.
7The Absurd Disk
- 2.5 hr scan time (poor sequential access)
- 1 access per second / 5 GB (VERY cold data)
- Its a tape!
1 TB
100 MB/s
200 Kaps
8Disk vs Tape
- Tape
- 40 GB
- 10 MBps
- 10 sec pick time
- 30-120 second seek time
- 2/GB for media8/GB for drivelibrary
- 10 TB/rack
- 1 week scan
- Disk
- 80 GB
- 35 MBps
- 5 ms seek time
- 3 ms rotate latency
- 2/GB for drive 2/GB for ctlrs/cabinet
- 15 TB/rack
- 1 hour scan
Guestimates Cern 200 TB 3480 tapes 2 col
50GB Rack 1 TB 12 drives
The price advantage of disk is growing the
performance advantage of disk is huge! At
10K/TB, disk is competitive with nearline tape.
9Building a Petabyte Disk Store
- Cadillac 500k/TB 500M/PB plus FC
switches plus 800M/PB - TPC-C SANs (Brand PC 18GB/) 60M/PB
- Brand PC local SCSI 15M/PB
- Do it yourself ATA
5M/PB
10Cheap Storage and/or Balanced System
- Low cost storage (2 x 3k servers) 5K TB2x (
800 Mhz, 256Mb 8x80GB disks 100MbE)raid5
costs 6K/TB - Balanced server (5k/.64 TB)
- 2x 1Ghz (2k)
- 512 MB
- 8 x 80 GB drives (2K)
- Gbps Ethernet switch (300/port)
- 9k/TB 18K/mirrored TB
11Next step in the Evolution
- Disks become supercomputers
- Controller will have 1bips, 1 GB ram, 1 GBps net
- And a disk arm.
- Disks will run full-blown app/web/db/os stack
- Distributed computing
- Processors migrate to transducers.
12Its Hard to Archive a PetabyteIt takes a LONG
time to restore it.
- At 1GBps it takes 12 days!
- Store it in two (or more) places online (on
disk?). A geo-plex - Scrub it continuously (look for errors)
- On failure,
- use other copy until failure repaired,
- refresh lost copy from safe copy.
- Can organize the two copies differently
(e.g. one by time, one by space)
13Outline
- Store Everything
- Online (Disk not Tape)
- In a Database
- A Federated DB
- Two Examples
14Why Not file object GREP?
- It works if you have thousands of objects (and
you know them all) - But hard to search millions/billions/trillions
with GREP - Hard to put all attributes in file name.
- Minimal metadata
- Hard to do chunking right.
- Hard to pivot on space/time/version/attributes.
15The Reality its build vs buy
- If you use a file system you will eventually
build a database system - metadata,
- Query,
- parallel ops,
- security,.
- reorganize,
- recovery,
- distributed,
- replication,
16OK so Ill put lots of objects in a fileDo It
Yourself Database
- Good news
- Your implementation will be 10x faster (at
least!) - easier to understand and use
- Bad news
- It will cost 10x more to build and maintain
- Someday you will get bored maintaining/evolving
it - It will lack some killer features
- Parallel search
- Self-describing via metadata
- SQL, XML,
- Replication
- Online update reorganization
- Chunking is problematic (what granularity, how to
aggregate)
17Top 10 reasons to put Everything in a DB
- Someone else writes the million lines of code
- Captures data and Metadata,
- Standard interfaces give tools and quick learning
- Allows Schema Evolution without breaking old apps
- Index and Pivot on multiple attributes
space-time-attribute-version. - Parallel terabyte searches in seconds or minutes
- Moves processing search close to the disk
arm (moves fewer bytes (qestons return datons). - Chunking is easier (can aggregate chunks at
server). - Automatic geo-replication
- Online update and reorganization.
- Security
- If you pick the right vendor, ten years from now,
there will be software that can read the data.
18DB Centric Examples
- TerraServer
- All images and all data in the database (chunked
as small tiles).www.TerraServer.Microsoft.com/ - http//research.microsoft.com/gray/Papers/MSR_TR_
99_29_TerraServer.doc - SkyServer Virtual Sky
- Both image and semantic data in a relational
store. - Parallel search NonProcedural access are
important. - http//research.microsoft.com/gray/Papers/MS_TR_9
9_30_Sloan_Digital_Sky_Survey.doc - http//skyserver.sdss.org/getMosaic.asp?Z1A1T
4H1S10M30 - http//virtualsky.org/servlet/Page?F3RA16h10m
1.0sDE2B0d42m45sT4P12S10X5096Y4121
W4Z-1tile.2.1.x55tile.2.1.y20
19OK Why dont they use our stuff?
- Wrong metaphor HDF with hyper-slab is better
match. - Impedance match Getting stuff in/out of DB
is too hard - We sold them OODBs and they did not work
(unreliable, poor performance, no tools).
20So, why will the future be different?
- They have MUCH more data (109 files?)
- Java / C eases impedance mismatch rowsets
ragged arrays of objectsiterators, exceptions,..
built in language - Tools are better
- Optimizers are better
- CPU and disk parallelism actually works now
- Statistical packages are better.
21Outline
- Store Everything
- Online (Disk not Tape)
- In a Database
- A Federated DB
- Two Examples
22But
Distributed shared databases have failed even
on their home turf. blocks, files, tables are
wrong abstraction for networks. (too low
level) Objects are the right abstraction So,
UDDI / WSDL / SOAP is the solution (not
SQL) Replays the NAS is better than SAN
argument methods gt sql gt file gt disk XML is the
wire format, XLANG is the workflow protocol,
Query will be in there somewhere.
23 DDB technology GREAT in a Cluster Beowulf
- Uniform architecture
- Trust among nodes
- High bandwidth-low latency communication
- Programs have single system image
- Queries run in parallel
- Global optimizer does query decomposition
24But in a Distributed System(a Grid vs Beowulf)
- Higher level abstraction give modularity
minimize round trips - Change is constant need modularity.
- Heterogeneous architecture makes query planning
much harder - No trust
- Communication is slow and expensive (minimize
it).
25Federate Databases
- Each Database exposes services
- Self describing
- Discoverable
- Easy for programs to use/understand
- Built on standards (W3C, IETF,..)
- Client-side or server-side apps
- Integrate these services
- Combine information to produce answers
26DDB the Trust Issue
- Customers serve themselves
- Follow the rules posted on the door
- No Overhead, no staff!
- Clerks serve Customers
- Take order, fill order, fill out invoice, collect
money. - Overhead staff, training, rules,
Client/Server Groceries
DDB Grocery
27Outline
- Store Everything
- Online (Disk not Tape)
- In a Database
- A Federated DB
- Two Examples
28How do we find information today?
- Human searches web (with an index)
- Human browses pages
29How do we find information tomorrow?
Digital Dashboard
- Agents gather and digest it for us.
- Q How?
- AW3C
- Discovery UDDI, DISCO, WSDL
- Use
- SOAP
My Agents
SOAP WSDL
Web Services
30How do we publish information?
- Get the data.
- Conceptualize the data schema
- Provide methods that return data subsets.
- Challenge how much processing on your server?
- Publish the schema and methods.
- We (you and I) are exploring these issues.
f, g, x, y
31TerraServer Example
- TerraServer
- 3TB Internet Map DB available since June 1998
- USGS photo and topo maps of the US
- Integrated with Home Advisor
- Shows off SQL Server availability scalability
- Designed for thin clients and voice network
- TerraService
- A .NET web service
- Makes TerraServer data available to other apps
32Demo http//terraserver.microsoft.com
Show photo topo gazetteer demographics
33TerraServer Experience
- Successful Web Site
- Top 1000 Web Site continues to be popular
- Met goals interesting, big, real, public, fast,
easy, accessible, and free - High Availability Windows Data Center Compaq
SAN Technology - New Feature Requests
- Programmable access to meta-data
- User selectable image sizes, i.e. a map server
- Permission to use TerraServer data within server
applications
34What is a Web Service?
A programmable application component accessible
via standard Web protocols
35TerraService Architecture
HTML
Standard Browsers
Image/jpeg
Image/jpeg
Smart Clients
SOAP/XML
OLEDB
36Terra Services
Landmark-Service
Terra-Tile-Service
- Geo-coded places, e.g. Schools, Golf Courses,
Hospitals, etc. - Place Polygons e.g. Zip Codes, Cities, etc.
- Query Gazetteer
- Retrieve imagery meta-data
- Retrieve imagery
- Simple Projection conversions
allows overlay information for
Terra-Tile-Service applications
Clients can present TerraServer imagery in new
ways.
37Terra Services
- Tile
- GetAreaFromPt
- GetAreaFromRect
- GetAreaFromTileId
- GetTileMetaFromLonLat
- GetTileMetaFromTileId
- GetTile (Image)
- Landmark
- GetLandmarkTypes
- CountOfLandmarkPointsByRect
- GetLandmarkPointsByRect
- CountOfLandmarkShapesByRect
- GetLandmarkShapesByRect
- Place Search
- GetPlaceFacts
- GetPlaceList
- GetPlaceListInRect
- CountPlacesInRect
- Projection
- ConvertLonLatToUtm
- ConvertUtmToLonLat
- ConvertLonLatTo NearestPlace
- GetTheme
- GetLatLonMetrics
http//terraservice.net
38Soil Viewer Uses TerraService
39Custom End Product
Web Soil Data Viewer
XML Soil Report
Soil Interpretation Map
40What You Saw
- Converted a Web Server
- HTML get post
- Server returns pictures to people
- to a Web Service
- SOAP service
- returns XML self-describing data
- Application integrates data (Agriculture and Geo
data)
41Rosetta Stone
- Dot Net
- UDDI Universal description, discovery, and
integration - Schema, XLANG
- SOAP simple object access protocol
- WSDL web services definition language
- XML- eXtended Markup Language
- Distributed computing basic services
- Yellow Pages
- ?
- RPC remote procedure call, CORBA, DCOM, RMI
- IDL interface definition language
- XDR - eXternal Data Representation
42SkyServer
-
- Collaborating with
- Alex Szalay, Peter Kunszt, Ani Thakar _at_ JHU
- Robert Brunner, Roy Williams _at_ Caltech
- George Djorgovski, Julian Bunn _at_ Caltech
- FermiLab operates Sky Server
- Compaq donated hardware
- Microsoft donated software and money
43Sky Server
- Like TerraServer pictures of the sky.
- But also LOTS of data on each object So a
data mining web service - Luminosity (multi-spectra), morphology, spectrum
- So, it is a data mining application
- Cross-correlation is challenging because
- Multi-resolution
- Data is dirty/fuzzy (error bars, cosmic rays,
airplanes) - Time varying
44Astronomy Data
- In the old days astronomers took photos.
- Starting in the 1960s they began to digitize.
- New instruments are digital (100s of GB/nite)
- Detectors are following Moores law.
- Data avalanche double every year
Total area of 3m telescopes in the world in m2,
total number of CCD pixels in megapixel, as a
function of time. Growth over 25 years is a
factor of 30 in glass, 3000 in pixels.
3 M telescopes area m2
Courtesy of Alex Szalay
CCD area mpixels
45Astronomy Data
- Astronomers have a few Petabytes now.
- 1 pixel (byte) / sq arc second 4TB
- Multi-spectral, temporal, ? 1PB
- They mine it looking for new (kinds of) objects
or more of interesting ones(quasars),
density variations in 400-D space correlations
in 400D space - Data doubles every year.
- Data is public after a year.
- So, 50 of the data is public.
- Some have private access to 5 more data.
- So 50 vs 55 access for everyone
46Astronomy Data
- But..
- How do I get at that 50 of the data?
- Astronomers have culture of publishing.
- FITS files and many tools.http//fits.gsfc.nasa.g
ov/fits_home.html - Encouraged by NASA.
- Publishing data details is difficult.
Astronomers want to do it but it is VERY
hard.(What programs where used? what were the
processing steps? How were errors treated?) - File is wrong abstraction.
47Virtual Observatoryhttp//www.astro.caltech.edu/n
voconf/http//www.voforum.org/
- Premise Most data is (or could be online)
- So, the Internet is the worlds best telescope
- It has data on every part of the sky
- In every measured spectral band optical, x-ray,
radio.. - As deep as the best instruments (1 year ago).
- It is up when you are up.The seeing is always
great (no working at night, no clouds no moons
no..). - Its a smart telescope links objects and
data to literature on them.
48Virtual Observatory Golden Age of Mega-Surveys
- Many new surveys
- multi-TB in size, 100 million objects or more
- individual archives planned, or under way
- Data publication an integral part of the survey
- Software bill a major cost in the survey
- Multi-wavelength view of the sky
- more than 13 wavelength coverage in 5 years
- Impressive early discoveries
- finding exotic objects by unusual colors
- L,T dwarfs, high-z quasars
- finding objects by time variability
- gravitational micro-lensing
MACHO 2MASS DENIS SDSS PRIME DPOSS GSC-II COBE
MAP NVSS FIRST GALEX ROSAT OGLE ...
Slide courtesy of Alex Szalay, modified by jim
49Virtual Observatory Federating the Archives
- The next generation mega-surveys are different
- top-down design
- large sky coverage
- sound statistical plans
- well controlled/documented data processing
- Each survey has a publication plan
- Data mining will lead to stunning new discoveries
- Federating these archives
- ? Virtual Observatory
Slide courtesy of Alex Szalay
50The Multiwavelength Crab Nebula
Crab star 1053 AD
Nova first sighted 1054 A.D. by Chinese
Astronomers Now Crab Nebula X-ray, optical,
infrared, and radio
Slide courtesy of Robert Brunner _at_ CalTech.
51Exploring Parameter Space
- Given an arbitrary parameter space
- Data Clusters
- Points between Data Clusters
- Isolated Data Clusters
- Isolated Data Groups
- Holes in Data Clusters
- Isolated Points
Nichol et al. 2001
Slide courtesy of Robert Brunner _at_ CalTech.
52Virtual Observatory and Education
- In the beginning science was empirical.
- Then theoretical branches evolved.
- Now, we have a computational branches.
- The computational branch has been simulation
- It is becoming data analysis/visualization
- The Virtual Observatory can be used to
- Teach astronomy make it interactive,
demonstrate ideas and phenomena - Teach computational science skillsand the
process of scientific discovery
53Sloan Digital Sky Survey http//sdss.org/
- A group of astronomers has been building a
telescope (with 90M from Sloan Foundation, NSF,
and a dozen universities). for the last 12
years - Now data is arriving
- 250GB/nite (20 nights per year).
- 100 M stars, 100 M galaxies, 1 M spectra.
- Public data at http//sdss.org/
- 5 of the survey, 600 sq degrees, 15 M objects
60GB. - This data includes most of the known high z
quasars. - It has a lot of science left in it but that is
just the start.
54Demo of Sky Server
- Alex built SkyServer (based on TerraServer
design). - http//skyserver.sdss.org/
Demo famous places navigator data
shopping cart spectrum SQL? ?
55Virtual Observatory Challenges
- Size multi-Petabyte
- 40,000 square degrees is 2 Trillion pixels
- One band (at 1 sq arcsec) 4 Terabytes
- Multi-wavelength 10-100 Terabytes
- Time dimension gtgt 10 Petabytes
- Need auto parallelism tools
- Unsolved Meta-Data problem
- Hard to publish data programs
- Hard to find/understand data programs
- Current tools inadequate
- new analysis visualization tools
- Transition to the new astronomy
- Sociological issues
56The Challenges
- How to federate the Archives to make a VO?
- The hope XML is the answer.
- The reality XML is syntax and tools FITS
on XML will be good but.. Explaining the data
will still be very difficult. - Define Astronomy Objects and Methods.
- Based on UDDI, WSDL, SOAP.
- Each archive is a service
- http//TerraService.net/ shows the idea.
- Working with Caltech (Brunner, Williams,
Djorgovski, Bunn) - But, how does data mining work?
57Three Steps to a VO 0.01
- Get SDSS and Palomar online
- Alex Szalay, Jan Vandenberg, Ani Thakar.
- Roy Williams, Robert Brunner, Julian Bunn
- Do queries and crossID matches with CalTech and
SDSS to expose - Schema, Units,
- Dataset problems
- the typical use scenarios.
- Implement WebServices at CalTech and SDSS
58Summary
- All information at your fingertips.
- How do we publish information so that our
agents can digest it? - Example TerraServer -gt TerraService
- The Virtual Observatory Concept
- The Internet is worlds best telescope
- For astronomy
- For teaching astronomy and
- For teaching computational science
59Outline
- Store Everything
- Online (Disk not Tape)
- In a Database
- A Federated DB
- Two Examples
60No time for what follows
61SDSS what I have been doing
- Work with Alex Szalay, Don Slutz, and others to
define 20 canonical queries and 10 visualization
tasks. - Don Slutz did a first cut of the queries, Im
continuing that work. - Working with Alex Szalay on building Sky Server
and making data it public (send out 80GB
SQL DBs)
62Two kinds of SDSS data
- 15M Photo Objects 400 attributes
20K Spectra with 10 lines/ spectrum
63Spatial Data Access(Szalay, Kunszt,
Brunner)http//www.sdss.jhu.edu/ look at the HTM
link
- Implemented Hierarchical Triangular Mesh (HTM) as
table-valued function for spatial joins. - Every object has a 20-deep Mesh ID.
- Given a spatial definitionRoutine returns up to
500 covering triangles. - Spatial query is then up to 500 range queries.
- Very fast 1,000s of triangles per second.
64The 20 Queries
- Q11 Find all elliptical galaxies with spectra
that have an anomalous emission line. - Q12 Create a grided count of galaxies with u-ggt1
and rlt21.5 over 60ltdeclinationlt70, and 200ltright
ascensionlt210, on a grid of 2, and create a map
of masks over the same grid. - Q13 Create a count of galaxies for each of the
HTM triangles which satisfy a certain color cut,
like 0.7u-0.5g-0.2ilt1.25 rlt21.75, output it in
a form adequate for visualization. - Q14 Find stars with multiple measurements and
have magnitude variations gt0.1. Scan for stars
that have a secondary object (observed at a
different time) and compare their magnitudes. - Q15 Provide a list of moving objects consistent
with an asteroid. - Q16 Find all objects similar to the colors of a
quasar at 5.5ltredshiftlt6.5. - Q17 Find binary stars where at least one of them
has the colors of a white dwarf. - Q18 Find all objects within 30 arcseconds of one
another that have very similar colors that is
where the color ratios u-g, g-r, r-I are less
than 0.05m. - Q19 Find quasars with a broad absorption line in
their spectra and at least one galaxy within 10
arcseconds. Return both the quasars and the
galaxies. - Q20 For each galaxy in the BCG data set
(brightest color galaxy), in 160ltright
ascensionlt170, -25ltdeclinationlt35 count of
galaxies within 30"of it that have a photoz
within 0.05 of that galaxy.
- Q1 Find all galaxies without unsaturated pixels
within 1' of a given point of ra75.327,
dec21.023 - Q2 Find all galaxies with blue surface
brightness between and 23 and 25 mag per square
arcseconds, and -10ltsuper galactic latitude (sgb)
lt10, and declination less than zero. - Q3 Find all galaxies brighter than magnitude 22,
where the local extinction is gt0.75. - Q4 Find galaxies with an isophotal surface
brightness (SB) larger than 24 in the red band,
with an ellipticitygt0.5, and with the major axis
of the ellipse having a declination of between
30 and 60arc seconds. - Q5 Find all galaxies with a deVaucouleours
profile (r¼ falloff of intensity on disk) and the
photometric colors consistent with an elliptical
galaxy. The deVaucouleours profile - Q6 Find galaxies that are blended with a star,
output the deblended galaxy magnitudes. - Q7 Provide a list of star-like objects that are
1 rare. - Q8 Find all objects with unclassified spectra.
- Q9 Find quasars with a line width gt2000 km/s and
2.5ltredshiftlt2.7. - Q10 Find galaxies with spectra that have an
equivalent width in Ha gt40Å (Ha is the main
hydrogen spectral line.)
Also some good queries at http//www.sdss.jhu.edu
/ScienceArchive/sxqt/sxQT/Example_Queries.html
65An easy oneQ7 Provide a list of star-like
objects that are 1 rare.
- Found 14,681 buckets, first 140 buckets have
99 time 104 seconds - Disk bound, reads 3 disks at 68 MBps.
Select cast((u-g) as int) as ug, cast((g-r) as
int) as gr, cast((r-i) as int) as ri,
cast((i-z) as int) as iz, count()
as Population from stars where (ugriz) lt
150 group by cast((u-g) as int), cast((g-r) as
int), cast((r-i) as int), cast((i-z) as int)
order by count()
66Another easy oneQ15 Provide a list of moving
objects consistent with an asteroid.
- Looks hard but there are 5 pictures of the object
at 5 different times (colors) and so can compute
velocity. - Image pipeline computes velocity.
- Computing it from the 5 color x,y would also be
fast - Finds 2167 objects in 7 minutes, 70MBps.
select object_id, -- return object ID
sqrt(power(rowv,2)power(colv,2)) as velocity
from sxPhotObj -- check each
object. where (power(rowv,2) power(colv, 2)) gt
50 -- square of velocity and rowv gt 0
and colv gt0 -- negative values indicate error
67A Hard One Q14 Find stars with multiple
measurements that have magnitude variations
gt0.1.
- This should work, but SQL Server does not allow
table values to be piped to table-valued
functions.
- This should work, but SQL Server does not allow
table values to be piped to table-valued
functions.
68A Hard one Second TryQ14 Find stars with
multiple measurements that have magnitude
variations gt0.1.
- Write a program with a cursor, ran for 2 days
--------------------------------------------------
----------------------------- -- Table-valued
function that returns the binary stars within a
certain radius -- of another (in arc-minutes)
(typically 5 arc seconds). -- Returns the ID
pairs and the distance between them (in
arcseconds). create function BinaryStars(_at_MaxDista
nceArcMins float) returns _at_BinaryCandidatesTable
table( S1_object_ID bigint not null, -- Star
1 S2_object_ID bigint not null, -- Star
2 distance_arcSec float) -- distance between
them as begin declare _at_star_ID bigint,
_at_binary_ID bigint-- Star's ID and binary ID
declare _at_ra float, _at_dec float -- Star's
position declare _at_u float, _at_g float, _at_r float,
_at_i float,_at_z float -- Star's colors
----------------Open a cursor over stars and get
position and colors declare star_cursor cursor
for select object_ID, ra, dec, u, g, r, i,
z from Stars open star_cursor while
(11) -- for each star begin -- get its
attribues fetch next from star_cursor into
_at_star_ID, _at_ra, _at_dec, _at_u, _at_g, _at_r, _at_i, _at_z if
(_at__at_fetch_status -1) break -- end if no more
stars insert into _at_BinaryCandidatesTable --
insert its binaries select _at_star_ID,
S1.object_ID, -- return stars pairs
sqrt(N.DotProd)/PI()10800 -- and distance in
arc-seconds from getNearbyObjEq(_at_ra, _at_dec,
-- Find objects nearby S. _at_MaxDistanceArcMins)
as N, -- call them N. Stars as S1 --
S1 gets N's color values where _at_star_ID lt
N.Object_ID -- S1 different from S and
N.objType dbo.PhotoType('Star') -- S1 is a
star and N.object_ID S1.object_ID -- join
stars to get colors of S1N and
(abs(_at_u-S1.u) gt 0.1 -- one of the colors is
different. or abs(_at_g-S1.g) gt 0.1 or
abs(_at_r-S1.r) gt 0.1 or abs(_at_i-S1.i) gt 0.1
or abs(_at_z-S1.z) gt 0.1 ) end -- end
of loop over all stars -------------- Looped
over all stars, close cursor and exit. close
star_cursor -- deallocate star_cursor
return -- return table end -- end of
BinaryStars GO select from dbo.BinaryStars(.05)
69A Hard one Third TryQ14 Find stars with
multiple measurements that have magnitude
variations gt0.1.
- Use pre-computed neighbors table.
- Ran in 17 minutes, found 31k pairs.
-- Plan 2 Use
the precomputed neighbors table select top 100
S.object_ID, S1.object_ID, -- return star pairs
and distance str(N.Distance_mins 60,6,1) as
DistArcSec from Stars S, -- S is a
star Neighbors N, -- N within 3 arcsec (10
pixels) of S. Stars S1 -- S1 N has the
color attibutes where S.Object_ID
N.Object_ID -- connect S and N. and
S.Object_ID lt N.Neighbor_Object_ID -- S1
different from S and N.Neighbor_objType
dbo.PhotoType('Star')-- S1 is a star (an
optimization) and N.Distance_mins lt .05 --
the 3 arcsecond test and N.Neighbor_object_ID
S1.Object_ID -- N S1 and (
abs(S.u-S1.u) gt 0.1 -- one of the colors is
different. or abs(S.g-S1.g) gt 0.1 or
abs(S.r-S1.r) gt 0.1 or abs(S.i-S1.i) gt 0.1 or
abs(S.z-S1.z) gt 0.1 ) -- Found 31,355 pairs
(out of 4.4 m stars) in 17 min 14 sec.
70The Pain of Going Outside SQL(its fortunate that
all the queries are single statements)
- Count parent objects
- 503 seconds for 14.7 M objects in 33.3 GB
- 66 MBps
- IO bound (30 of one cpu)
- 100 k records/cpu sec
- Use a cursor
- No cpu parallelism
- CPU bound
- 6 MBps, 2.7 k rps
- 5,450 seconds (10x slower)
declare _at_count int declare _at_sum int set _at_sum
0 declare PhotoCursor cursor for select nChild
from sxPhotoObj open PhotoCursor while (11)
begin fetch next from PhotoCursor into
_at_count if (_at__at_fetch_status -1) break set
_at_sum _at_sum _at_count end close
PhotoCursor deallocate PhotoCursor print 'Sum
is 'cast(_at_sum as varchar(12))
select count() from sxPhotoObj where nChild
gt 0
71Summary of Current Status
- All 20 queries run(still checking science)
- Also 15 user queries
- Run times on 3k PC (2 cpu, 4 disk, 256MB)
100 IO/cpu sec 5MB/cpu sec
72Summary of Current Status
- 16 of the queries are simple
- 2 are iterative, 2 complex
- Many are sequential one-pass and two-pass over
data - Covering indices make scans run fast
- Table valued functions are wonderful but
limitations on parameters are a pain. - Counting is VERY common.
- Binning (grouping by some set of attributes) is
common - Did not request cube, but that may be cultural.
73Reflections on the 20 Queries
- Data loading/scrubbing is labor intensive
tedious - AUTOMATE!!!
- This is 5 of the data, and some queries take an
hour. - But this is not tuned (disk bound).
- All queries benefit from parallelism (both disk
and cpu)(if you can state the query right, e.g.
inside SQL). - Parallel database machines will do great on this
- Hash machines
- Data pumps
- See paper in word or pdf on my web site.
- SQL looks good. The answers, need
visualization
74Call to Action
- If you do data visualization we need you(and we
know it). - If you do databaseshere is some data you can
practice on. - If you do distributed systemshere is a
federation you can practice on. - If you do astronomy educational outreachhere is
a tool for you. - The astronomy folks are very good, and very
smart, and a pleasure to work with, and the
questions are cosmic, so