Title: Building Peta Byte Data Stores
1Building Peta Byte Data Stores
- Jim Gray
- Microsoft Research
- Research.Microsoft.com/Gray
2The Asilomar Report on Database Research Phil
Bernstein, Michael Brodie, Stefano Ceri, David
DeWitt, Mike Franklin, Hector Garcia-Molina,
Jim Gray, Jerry Held, Joe Hellerstein, H. V.
Jagadish, Michael Lesk, Dave Maier, Jeff
Naughton, Hamid Pirahesh, Mike Stonebraker, and
Jeff Ullman September 1998
- the field needs to radically broaden its
research focus to attack the issues of capturing,
storing, analyzing, and presenting the vast array
of online data. - -- broadening the definition of database
management to embrace all the content of the Web
and other online data stores, and rethinking our
fundamental assumptions in light of technology
shifts. - encouraging more speculative and long-range
work, moving conferences to a poster format, and
publishing all research literature on the Web. - http//research.microsoft.com/gray/Asilomar_DB_98
.html
3So, how are we doing?
- Capture, store, analyze, present terabytes?
- Making web data accessible?
- Publishing on the web (CoRR?)
- Posters-Workshops vs Conferences-Journals?
4Outline
- Technology
- 1M/PB store everything online (twice!)
- End-to-end high-speed networks
- Gigabit to the desktop
- So You can store everything,
- Anywhere in the world
- Online everywhere
- Research driven by apps
- TerraServer
- National Virtual Astronomy Observatory.
5Reality Check
- Good news
- In the limit, processing storage network is
free - Processing network is infinitely fast
- Bad news
- Most of us live in the present.
- People are getting more expensive.Management/prog
ramming cost exceeds hardware cost. - Speed of light not improving.
- WAN prices have not changed much in last 8 years.
6How Much Information Is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
Everything! Recorded
- Soon everything can be recorded and indexed
- Most data never be seen by humans
- Precious Resource Human attention
Auto-Summarization Auto-Searchis key
technology.www.lesk.com/mlesk/ksg97/ksg.html
All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
7Trends ops/s/ Had Three Growth Phases
- 1890-1945
- Mechanical
- Relay
- 7-year doubling
- 1945-1985
- Tube, transistor,..
- 2.3 year doubling
- 1985-2000
- Microprocessor
- 1.0 year doubling
8Storage capacity beating Moores law
9Cheap Storage and/or Balanced System
- Low cost storage (2 x 3k servers) 6K TB2x (
800 Mhz, 256Mb 8x80GB disks 100MbE) - Balanced server (5k/.64 TB)
- 2x800Mhz (2k)
- 512 MB
- 8 x 80 GB drives (2.4K)
- Gbps Ethernet switch (500/port)
- 10k TB, 20K/RAIDED TB
10Hot Swap Drives for Archive or Data Interchange
- 35 MBps write(so can write N x 80 GB in 40
minutes) - 80 GB/overnite
- N x 3 MB/second
- _at_ 19.95/nite
13
250
11The Absurd Disk
- 2.5 hr scan time (poor sequential access)
- 1 access per second / 5 GB (VERY cold data)
- Its a tape!
1 TB
100 MB/s
200 Kaps
12Disk vs Tape
- Disk
- 80 GB
- 35 MBps
- 5 ms seek time
- 3 ms rotate latency
- 4/GB for drive 3/GB for ctlrs/cabinet
- 4 TB/rack
- 1 hour scan
- Tape
- 40 GB
- 10 MBps
- 10 sec pick time
- 30-120 second seek time
- 2/GB for media8/GB for drivelibrary
- 10 TB/rack
- 1 week scan
Guestimates Cern 200 TB 3480 tapes 2 col
50GB Rack 1 TB 12 drives
The price advantage of tape is narrowing, and
the performance advantage of disk is growing At
10K/TB, disk is competitive with nearline tape.
13Its Hard to Archive a PetabyteIt takes a LONG
time to restore it.
- At 1GBps it takes 12 days!
- Store it in two (or more) places online (on
disk?). A geo-plex - Scrub it continuously (look for errors)
- On failure,
- use other copy until failure repaired,
- refresh lost copy from safe copy.
- Can organize the two copies differently
(e.g. one by time, one by space)
14Next step in the Evolution
- Disks become supercomputers
- Controller will have 1bips, 1 GB ram, 1 GBps net
- And a disk arm.
- Disks will run full-blown app/web/db/os stack
- Distributed computing
- Processors migrate to transducers.
15Terabyte (Petabyte) ProcessingRequires
Parallelism
- parallelism use many little devices in parallel
16Parallelism Must Be Automatic
- There are thousands of MPI programmers.
- There are hundreds-of-millions of people using
parallel database search. - Parallel programming is HARD!
- Find design patterns and automate them.
- Data search/mining has parallel design patterns.
17Gilders Law 3x bandwidth/year for 25 more years
- Today
- 10 Gbps per channel
- 4 channels per fiber 40 Gbps
- 32 fibers/bundle 1.2 Tbps/bundle
- In lab 3 Tbps/fiber (400 x WDM)
- In theory 25 Tbps per fiber
- 1 Tbps USA 1996 WAN bisection bandwidth
- Aggregate bandwidth doubles every 8 months!
1 fiber 25 Tbps
18Sense of scale
300 MBps OC48 G2 Or memcpy()
- How fat is your pipe?
- Fattest pipe on MS campus is the WAN!
20MBps disk / ATM / OC3
90 MBps PCI
94 MBps Coast to Coast
19Redmond/Seattle, WA
Information Sciences Institute Microsoft Qwest Uni
versity of Washington Pacific Northwest
Gigapop HSCC (high speed connectivity
consortium) DARPA
New York
Arlington, VA
San Francisco, CA
5626 km 10 hops
20Outline
- Technology
- 1M/PB store everything online (twice!)
- End-to-end high-speed networks
- Gigabit to the desktop
- So You can store everything,
- Anywhere in the world
- Online everywhere
- Research driven by apps
- TerraServer
- National Virtual Astronomy Observatory.
21Interesting Apps
- EOS/DIS
- TerraServer
- Sloan Digital Sky Survey
Kilo 103 Mega 106 Giga 109 Tera 1012 today, we
are here Peta 1015 Exa 1018
22The Challenge -- EOS/DIS
- Antarctica is melting -- 77 of fresh water
liberated - sea level rises 70 meters
- Chico Memphis are beach-front property
- New York, Washington, SF, LA, London, Paris
- Lets study it! Mission to Planet Earth
- EOS Earth Observing System (17B gt 10B)
- 50 instruments on 10 satellites 1999-2003
- Landsat (added later)
- EOS DIS Data Information System
- 3-5 MB/s raw, 30-50 MB/s processed.
- 4 TB/day,
- 15 PB by year 2007
23The Process Flow
- Data arrives and is pre-processed.
- instrument data is calibrated,
gridded averaged - Geophysical data is derived
- Users ask for stored data OR to analyze and
combine data. - Can make the pull-push split dynamically
Pull Processing
Push Processing
Other Data
24Key Architecture Features
- 2N data center design
- Scaleable OR-DBMS
- Emphasize Pull vs Push processing
- Storage hierarchy
- Data Pump
- Just in time acquisition
252N data center design
- duplex the archive (for fault tolerance)
- let anyone build an extract (the N)
- Partition data by time and by space (store 2 or 4
ways). - Each partition is a free-standing
OR-DBBMS (similar to Tandem, Teradata designs). - Clients and Partitions interact via standard
protocols - HTTPXML,
26Data Pump
- Some queries require reading ALL the data (for
reprocessing) - Each Data Center scans ALL the data every 2 days.
- Data rate 10 PB/day 10 TB/node/day 120 MB/s
- Compute on demand small jobs
- less than 100 M disk accesses
- less than 100 TeraOps.
- (less than 30 minute response time)
- For BIG JOBS scan entire 15PB database
- Queries (and extracts) snoop this data pump.
27Just-in-time acquisition 30
- Hardware prices decline 20-40/year
- So buy at last moment
- Buy best product that day commodity
- Depreciate over 3 years so that facility is
fresh. - (after 3 years, cost is 23 of original). 60
decline peaks at 10M
EOS DIS Disk Storage Size and Cost
assume 40 price decline/year
Data Need TB
Storage Cost M
2 PB _at_ 100M
1996
1994
1998
2000
2002
2004
2006
2008
28Problems
- Management (and HSM)
- Design and Meta-data
- Ingest
- Data discovery, search, and analysis
- Auto Parallelism
- reorg-reprocess
29What this system taught me
- Traditional storage metrics
- KAPS KB objects accessed per second
- /GB Storage cost
- New metrics
- MAPS megabyte objects accessed per second
- SCANS Time to scan the archive
- Admin cost dominates (!!)
- Auto parallelism is essential.
30Outline
- Technology
- 1M/PB store everything online (twice!)
- End-to-end high-speed networks
- Gigabit to the desktop
- So You can store everything,
- Anywhere in the world
- Online everywhere
- Research driven by apps
- TerraServer
- National Virtual Astronomy Observatory.
31Microsoft TerraServer http//TerraServer.Microso
ft.com/
- Build a multi-TB SQL Server database
- Data must be
- 1 TB
- Unencumbered
- Interesting to everyone everywhere
- And not offensive to anyone anywhere
- Loaded
- 1.5 M place names from Encarta World Atlas
- 7 M Sq Km USGS doq (1 meter resolution)
- 10 M sq Km USGS topos (2m)
- 1 M Sq Km from Russian Space agency (2 m)
- On the web (worlds largest atlas)
- Sell images with commerce server.
32Background
- Earth is 500 Tera-meters square
- USA is 10 tm2
- 100 TM2 land in 70ºN to 70ºS
- We have pictures of 9 of it
- 7 tsm from USGS
- 1 tsm from Russian Space Agency
- Compress 51 (JPEG) to 1.5 TB.
- Slice into 10 KB chunks (200x200 pixels)
- Store chunks in DB
- Navigate with
- Encarta Atlas
- globe
- gazetteer
- Someday
- multi-spectral image
- of everywhere
- once a day / hour
33TerraServer 4.0 Configuration
3 Active Database Servers
SQL\Inst1 - Topo Relief Data
SQL\Inst2 Aerial Imagery
SQL\Inst3 Aerial Imagery
Logical Volume Structure
One rack per database All volumes triple mirrored
(3x) MetaData on 15k rpm 18.2 GB drives Image
Data on 10k rpm 72.8 GB drives
2 spare volumes allocated per cluster 6
Additional 339 GB volumes to be added by year
end (2 per Db Server)
34TerraServer 4.0 Schema
35File System Config
- Use StorageWorks to form 28 RAID5 sets Each
raid set has 11 disks (16 spare drives) - Use NTFS to form 4 595GB NT volumes Each
striped over 7 Raid sets on 7 controllers - DB is File Group of 80 20,000 MB files (1.5TB)
36BAD OLD Load
37Load Process
Internet Data CenterTukwila, WA
2 TBDatabase
2 TBDatabase
2 TBDatabase
Read 4 Images
Write 1
TerraScale
CorporateNetwork
Executive Briefing Center, Redmond WA
TerraCutter
ReadImageFiles
38After a Year
TerraServer Daily Traffic Jun 22, 1998 thru June
22, 1999
30M
Sessions
- 15 TB of data (raw) 3B records
- 2.3 billion Hits
- 2.0 billion DB Queries
- 1.7 billion Images sent(2 TB of download)
- 368 million Page Views
- 99.93 DB Availability
- 4rd design now Online
- Built and operated by team of 4 people
20M
Hit
Count
Page View
DB Query
Image
10M
0
6/22/98
7/22/98
8/22/98
9/22/98
1/22/99
2/22/99
3/22/99
4/22/99
5/22/99
6/22/99
10/22/98
11/22/98
12/22/98
39TerraServer Activity
40TerraServer.Microsoft.NET A Web Service
Before .NET
With .NET
41TerraServer Recent/Current Effort
- Added USGS Topographic maps (4 TB)
- High availability (4 node cluster with failover)
- Integrated with Encarta Online
- The other 25 of the US DOQs (photos)
- Adding digital elevation maps
- Open architecture publish SOAP interfaces.
- Adding mult-layer maps (with UC Berkeley)
- Geo-Spatial extension to SQL Server
42Thank You!
43Outline
- Technology
- 1M/PB store everything online (twice!)
- End-to-end high-speed networks
- Gigabit to the desktop
- So You can store everything,
- Anywhere in the world
- Online everywhere
- Research driven by apps
- TerraServer
- National Virtual Astronomy Observatory.
44Astronomy is Changing(and so are other sciences)
- Astronomers have a few PB
- Doubles every 2 years.
- Data is public after 2 years.
- So Everyone has ½ the data
- Some people have 5more private data
- So, its a nearly level playing field
- Most accessible data is public.
45(inter) National Virtual Observatory
- Almost all astronomy datasets will be online
- Some are big (gtgt10 TB)
- Total is a few Petabytes
- Bigger datasets coming
- Data is public
- Scientists can mine these datasets
- Computer Science challenge Organize these
datasets Provide easy access to them.
46The Sloan Digital Sky SurveySLIDES BY Alex Szlay
A project run by the Astrophysical Research
Consortium (ARC)
The University of Chicago Princeton
University The Johns Hopkins University The
University of Washington Fermi National
Accelerator Laboratory US Naval Observatory
The Japanese Participation Group The Institute
for Advanced Study SLOAN Foundation, NSF, DOE,
NASA
Goal To create a detailed multicolor map of the
Northern Sky over 5 years, with a budget of
approximately 80M Data Size 40 TB raw, 1 TB
processed
47Features of the SDSS
Special 2.5m telescope, located at Apache Point,
NM 3 degree field of view. Zero distortion
focal plane. Two surveys in one Photometric
survey in 5 bands. Spectroscopic redshift
survey. Huge CCD Mosaic 30 CCDs 2K x
2K (imaging) 22 CCDs 2K x 400 (astrometry) Two
high resolution spectrographs 2 x 320 fibers,
with 3 arcsec diameter. R2000 resolution with
4096 pixels. Spectral coverage from 3900Ã… to
9200Ã…. Automated data reduction Over 70
man-years of development effort. (Fermilab
collaboration scientists) Very high data
volume Expect over 40 TB of raw data. About 3
TB processed catalogs. Data made available to
the public.
48Apache Point Observatory
Located in New Mexico, near White Sands National
Monument
Special 2.5m telescope 3 degree field of
view Zero distortion focal plane Wind
screen moved separately
49Scientific Motivation
Create the ultimate map of the Universe ? The
Cosmic Genome Project! Study the distribution of
galaxies ? What is the origin of
fluctuations? ? What is the topology of the
distribution? Measure the global properties of
the Universe ? How much dark matter is
there? Local census of the galaxy population ?
How did galaxies form? Find the most distant
objects in the Universe ? What are the highest
quasar redshifts?
50Cosmology Primer
The Universe is expanding the galaxies move
away from us spectral lines are redshifted
v Ho r Hubbles law
The fate of the universe depends on the balance
between gravity and the expansion velocity
? density/criticalif ? lt1, expand forever
?dgt ?
Most of the mass in the Universe is dark matter,
and it may be cold (CDM)
P(k) power spectrum
The spatial distribution of galaxies is
correlated, due to small ripples in the early
Universe.
51The Naught Problem
What are the global parameters of the
Universe? H0 the Hubble constant 55-75
km/s/Mpc ?0 the density parameter 0.25-1 ?0 the
cosmological constant 0 - 0.7 Their values are
still quite uncertain today... Goal measure
these parameters with an accuracy of a few percent
High Precision Cosmology!
52The Cosmic Genome Project
The SDSS will create the ultimate mapof the
Universe, with much more detailthan any other
measurement before
53Area and Size of Redshift Surveys
54The Topology of Local Universe
Measure the Topology of the Universe
Does it consist of walls and voids
or is it randomly distributed?
55Finding the Most Distant Objects
Intermediate and high redshift QSOs
Multicolor selection function.
Luminosity functions and spatial clustering.
High redshift QSOs (zgt5).
56The Photometric Survey
Northern Galactic Cap 5 broad-band filters
( u', g', r', i', z )
limiting magnitudes (22.3, 23.3, 23.1, 22.3,
20.8) drift scan of 10,000 square degrees
55 sec exposure time 40 TB raw imaging
data -gt pipeline -gt 100,000,000 galaxies
50,000,000 stars calibration to 2 at
r'19.8 only done in the best seeing (20
nights/yr) pixel size is 0.4 arcsec,
astrometric precision is 60 milliarcsec Southern
Galactic Cap multiple scans (gt 30 times) of
the same stripe Continuous data rate of 8
Mbytes/sec
57Survey Strategy
Overlapping 2.5 degree wide stripes Avoiding the
Galactic Plane (dust) Multiple exposures on the
three Southern stripes
58The Spectroscopic Survey
Measure redshifts of objects ? distance SDSS
Redshift Survey 1 million galaxies 100,000
quasars 100,000 stars Two high throughput
spectrographs spectral range 3900-9200 Ã…. 640
spectra simultaneously. R2000
resolution. Automated reduction of spectra Very
high sampling density and completeness Objects in
other catalogs also targeted
59First Light Images
Telescope First light May 9th 1998
Equatorial scans
60The First Stripes
Camera 5 color imaging of gt100 square
degrees Multiple scans across the same
fields Photometric limits as expected
61NGC 6070
62The First Quasars
Three of the four highest redshift quasars have
been found in the first SDSS test data !
63SDSS Data Flow
64Data Processing Pipelines
65SDSS Data Products
Object catalog 400 GB parameters of gt108
objects Redshift Catalog 2 GB
parameters of 106 objects Atlas Images
1.5 TB 5 color cutouts of gt109 objects
Spectra 60 GB in a one-dimensional
form 106 Derived Catalogs 60 GB -
clusters - QSO absorption lines 4x4 Pixel
All-Sky Map 1 TB heavily compressed 5 x
105
All raw data saved in a tape vault at Fermilab
66Concept of the SDSS Archive
Science Archive (products accessible to users)
OperationalArchive (raw processed data)
67Parallel Query Implementation
- Getting 200MBps/node thru SQL today
- 4 GB/s on 20 node cluster.
User Interface
Analysis Engine
Master
SX Engine
DBMS Federation
DBMS
Slave
Slave
Slave
DBMS
Slave
DBMS
DBMS
RAID
DBMS
RAID
RAID
RAID
68Who will be using the archive?
Power Users sophisticated, with lots of
resources research is centered around the
archive data moderate number of very intensive
queries mostly statistical, large output
sizes General Astronomy Public frequent, but
casual lookup of objects/regions the archives
help their research, but not central to
it large number of small queries a lot of
cross-identification requests Wide
Public browsing a Virtual Telescope can have
large public appeal need special
packaging could be a very large number of
requests
69How will the data be analyzed?
The data are inherently multidimensional gt
positions, colors, size, redshift Improved
classifications result in complex N-dimensional
volumes gt complex constraints, not
ranges Spatial relations will be
investigated gt nearest neighbors gt other
objects within a radius Data Mining finding the
needle in the haystack gt separate typical
from rare gt recognize patterns in the
data Output size can be prohibitively large for
intermediate files gt import output directly
into analysis tools
70Summary
SDSS combines astronomy, physics, and computer
science Promises to fundamentally change our view
of the universe High precision cosmology Serves
as standard astronomy reference for several
decades Virtual universe can be explored by both
scientists public A new paradigm in astronomy.
71Desiging and Mining Multi-Terabyte Astronomy
Archives The Sloan Digital Sky Survey (SDSS)
http//www.sdss.org/
- Scan 10,000 sq. degrees (50) of northern sky.
- 200,000,000 objects.
- 100 dimensions.
- 40 TB of raw data.
- 1 TB of catalog data.
Alex S. Szalay, Peter J. Kunszt, Ani Thakar (The
Johns Hopkins University)Jim Gray, Don Slutz
(Microsoft Research)Robert J. Brunner (Calif.
Institute of Technology)
72Astronomical Growth of Collected Data
- Data Gathering Rate doubles every 20
months.(Moores Law here too) - Several orders of magnitude more data now!
- SDSS telescope has 120 Million CCDs
- 55 second photometric exposure.
- 8 MB/sec data rate.
- 0.4 arc-sec pixel size.
- Also Spectroscopic Survey of 1 million objects.
73Major Changes in Astronomy
- Visual Observation --gt Photographic Plates--gt
Massive Scans of the Sky collecting Terabytes. - A Practice Scan of the SDSS Telescope Discovered
3 of the 4 most Distant Quasars! - SDSS plus other Surveys will yield a Digital Sky
- Telescope Quality Data available Online.
- Spatial Data Mining will find new objects.
- New research areas - Study Density Fluctuations.
-
74Different Kind of Spatial Data
- All Objects on Celestial Sphere Surface
- Position a point by 2 spherical angles (RA, DEC).
- Position by Cartesian x,y,z easier to search
within 1 arc-minute. - Hierarchy of Spherical Trianglesfor Indexing.
- SDSS tree is 5 levels deep 8192 triangles
75Experiment with Relational DBMS
- See if SQLs Good Indexing and Scanning
Compensates for Poor Object Support. - Leverage Fast/Big/Cheap Commodity Hardware.
- Ported 40 GB Sample Database (from SDSS Sample
Scan) to SQL Server 2000 - Building public web site and data server
7620 Astronomy Queries
- Implemented spatial access extension to SQL (HTM)
- Implement 20 Astronomy Queries in SQL (see paper
for details). - 15M rows 378 cols, 30 GB. Can scan it in 8
minutes (disk IO limited). - Many queries run in seconds
- Create Covering Indexes on queried columns.
- Create Neighbors Table listing objects within 1
arc-minute (5 neighbors on the average) for
spatial joins. - Install some more disks!
77Query to Find Gravitational Lenses
Find all objects within 1 arc-minute of each
other that have very similar colors (the color
ratios u-g, g-r, r-i are less than 0.05m)
1 arc-minute
78SQL Query to Find Gravitational Lenses
- select count() from sxTag T, sxTag U, neighbors
Nwhere T.UObj_id N.UObj_id and U.UObj_id
N.neighbor_UObj_id and N.UObj_id lt
N.neighbor_UObj_id -- no dups and T.ugt0 and
T.ggt0 and T.rgt0 and T.igt0 and U.ugt0 and U.ggt0
and U.rgt0 and U.igt0 and ABS((T.u-T.g)-(U.u-U.g
))lt0.05 -- similar color and
ABS((T.g-T.r)-(U.g-U.r))lt0.05 and
ABS((T.r-T.i)-(U.r-U.i))lt0.05 - Finds 5223 objects, executes in 6 minutes.
79SQL Results so far.
- Have run 17 of 20 Queries so far.
- Most Queries are IO bound, scanning at 80MB/sec
on 4 disks in 6 minutes (at the PCI bus limit) - Covering indexes reduce execution to lt 30 secs.
- Common to get Grid Distributionsselect
convert(int,ra30)/30.0, -- ra bucket
convert(int,dec30)/30.0, -- dec bucket
count() --
bucket count from Galaxieswhere (u-g)gt1 and
rlt21.5group by (1), (2)
80Distribution of Galaxies
81Outline
- Technology
- 1M/PB store everything online (twice!)
- End-to-end high-speed networks
- Gigabit to the desktop
- So You can store everything,
- Anywhere in the world
- Online everywhere
- Research driven by apps
- TerraServer
- National Virtual Astronomy Observatory.