Building PetaByte Data Stores - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Building PetaByte Data Stores

Description:

1.5 M place names from Encarta World Atlas. 7 M Sq Km USGS doq (1 meter resolution) ... On the web (world's largest atlas) Sell images with commerce server. ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 21
Provided by: jimg178
Category:

less

Transcript and Presenter's Notes

Title: Building PetaByte Data Stores


1
Building Peta-Byte Data Stores
  • Jim Gray
  • _at_
  • Claus Shira Anniversary European Media Lab
  • 12 February 2001

2
How Much Information Is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
Everything! Recorded
  • Soon everything can be recorded and indexed
  • Most data never be seen by humans
  • Precious Resource Human attention
    Auto-Summarization Auto-Searchis key
    technology.www.lesk.com/mlesk/ksg97/ksg.html

All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
3
ops/s/ Had Three Growth PhasesNow doubling
every year
  • 1890-1945
  • Mechanical
  • Relay
  • 7-year doubling
  • 1945-1985
  • Tube, transistor,..
  • 2.3 year doubling
  • 1985-2000
  • Microprocessor
  • 1.0 year doubling

4
Gilders Law 3x bandwidth/year for 25 more years
  • Today
  • 10 Gbps per channel (per lambda)
  • 4 channels per fiber 40 Gbps
  • 32 fibers/bundle 1.2 Tbps/bundle
  • In lab 3 Tbps/fiber (400 x WDM)
  • In theory 25 Tbps per fiber
  • 1 Tbps USA 1996 WAN bisection bandwidth
  • Aggregate bandwidth doubles every 8 months!

1 fiber 25 Tbps
5
Redmond/Seattle, WA
Information Sciences Institute Microsoft Qwest Uni
versity of Washington Pacific Northwest
Gigapop HSCC (high speed connectivity
consortium) DARPA
New York
Arlington, VA
San Francisco, CA
5626 km 10 hops
6
Storage capacity beating Moores law
  • 3 k/TB today (raw disk)
  • 3 M /PB

7
Microsoft TerraServer http//TerraServer.Microso
ft.com/
  • Build a multi-TB SQL Server database
  • Data must be
  • 1 TB
  • Unencumbered
  • Interesting to everyone everywhere
  • And not offensive to anyone anywhere
  • Loaded
  • 1.5 M place names from Encarta World Atlas
  • 7 M Sq Km USGS doq (1 meter resolution)
  • 10 M sq Km USGS topos (2m)
  • 1 M Sq Km from Russian Space agency (2 m)
  • On the web (worlds largest atlas)
  • Sell images with commerce server.

8
TerraServer 4.0 Configuration
3 Active Database Servers
SQL\Inst1 - Topo Relief Data
SQL\Inst2 Aerial Imagery
SQL\Inst3 Aerial Imagery
Logical Volume Structure
One rack per database All volumes triple mirrored
(3x) MetaData on 15k rpm 18.2 GB drives Image
Data on 10k rpm 72.8 GB drives
MetaData 101GB
Image1-10 3.4 TB cooked 10 x 339 GB
volumes Spread across 3 servers 2x4 to photo
servers 1x2 for topo/relief server
9
TerraServer Activity
10
TerraServer.Microsoft.NET A Web Service
Before .NET
With .NET
11
TerraServer Recent/Current Effort
  • Added USGS Topographic maps (4 TB)
  • High availability (4 node cluster with failover)
  • Integrated with Encarta Online
  • The other 25 of the US DOQs (photos)
  • Adding digital elevation maps
  • Open architecture publish SOAP interfaces.
  • Adding mult-layer maps (with UC Berkeley)
  • Geo-Spatial extension to SQL Server

12
Astronomy is Changing(and so are other
sciences)The World Virtual Observatory
  • Doubles every 2 years.
  • Astronomers have a few PB
  • Data is public after 2 years.
  • So Everyone has ½ the data
  • Some people have 5more private data
  • So, its a nearly level playing field
  • Most accessible data is public.
  • Cyberspace is the new telescope
  • Multi-spectral, very deep,
  • Computer Science challenge Organize these
    datasets Provide easy access to them.

13
The Sloan Digital Sky Survey
The University of Chicago Princeton
University The Johns Hopkins University The
University of Washington Fermi National
Accelerator Laboratory US Naval Observatory
The Japanese Participation Group The Institute
for Advanced Study SLOAN Foundation, NSF, DOE,
NASA
Goal Create a detailed multicolor map of the
Northern Sky over 5 years
Special 2.5m telescope Two surveys in
one Photometric survey in 5 bands. Spectroscopi
c redshift survey. Huge CCD Mosaic 30 CCDs 2K x
2K (imaging) 22 CCDs 2K x 400 (astrometry) Two
high resolution spectrographs 2 x 320 fibers,
with 3 arcsec diameter. R2000 resolution with
4096 pixels. Spectral coverage from 3900Ã… to
9200Ã…. Automated data reduction Over 70
man-years of development effort. (Fermilab
collaboration scientists) Very high data
volume 40 TB of raw, 3TB cooked data (all
public).
14
The Cosmic Genome Project
The SDSS will create the ultimate mapof the
Universe, with much more detailthan any other
measurement before
15
Area and Size of Redshift Surveys
16
Experiment with Relational DBMS
  • See if SQLs Good Indexing and Scanning
    Compensates for Poor Object Support.
  • Leverage Fast/Big/Cheap Commodity Hardware.
  • Ported 40 GB Sample Database (from SDSS Sample
    Scan) to SQL Server 2000
  • Building public web site and data server

17
20 Astronomy Queries
  • Implemented spatial access extension to SQL (HTM)
  • Implement 20 Astronomy Queries in SQL (see paper
    for details).
  • 15M rows 378 cols, 30 GB. Can scan it in 8
    minutes (disk IO limited).
  • Many queries run in seconds
  • Create Covering Indexes on queried columns.
  • Create Neighbors Table listing objects within 1
    arc-minute (5 neighbors on the average) for
    spatial joins.
  • Install some more disks!

18
Query to Find Gravitational Lenses
Find all objects within 1 arc-minute of each
other that have very similar colors (the color
ratios u-g, g-r, r-i are less than 0.05m)
1 arc-minute
19
SQL Query to Find Gravitational Lenses
  • Find nearby objects with similar color ratios.
  • select count() from Objects L, Objects O,
    neighbors Nwhere L.Obj_id N.Obj_id and
    O.Obj_id N.neighbor_Obj_id and L.Obj_id O.Obj_id -- no dups and
    ABS((L.u-L.g)-(O.u-O.g))and ABS((L.g-L.r)-(O.g-O.r))and ABS((L.r-L.i)-(O.r-O.i))log) and ABS((L.z-L.r)-(O.z-O.r))
  • Finds 5223 objects, executes in 6 minutes.

20
SQL Results so far.
  • Have run 17 of 20 Queries so far.Working on
    spectra load and queries now.
  • Most Queries IO bound, ( 80MB/sec on 4 disks in 6
    minutes)
  • Covering indexes reduce execution to
  • Common to get Grid Distributionsselect
    convert(int,ra30)/30.0, as ra_bucket
    convert(int,dec30)/30.0, as dec_bucket
    count() as
    bucket count from Galaxieswhere (u-g) 1
    and r

21
Summary
  • Technology
  • 1M/PB store everything online (twice!)
  • Gigabit to the desktop store it anywhere
  • So You can store everything,
  • Anywhere in the world
  • Online everywhere
  • Research driven by apps
  • TerraServer
  • National Virtual Astronomy Observatory.
Write a Comment
User Comments (0)
About PowerShow.com