Building PetaByte Servers

About This Presentation

Title:

Building PetaByte Servers

Description:

Title: Building PetaByte Data Servers Author: Jim Gray Last modified by: Jim Gray Created Date: 3/18/1997 6:05:45 PM Document presentation format – PowerPoint PPT presentation

Number of Views:189

Avg rating:3.0/5.0

Slides: 63

Provided by: JimG52

Category:

more less

Transcript and Presenter's Notes

Title: Building PetaByte Servers

1
Building PetaByte Servers

Jim Gray
Microsoft Research
Gray_at_Microsoft.com
http//www.Research.Microsoft.com/Gray/talks

Kilo 103 Mega 106 Giga 109 Tera 1012 today, we
are here Peta 1015 Exa 1018
2
Outline

The challenge Building GIANT data stores
for example, the EOS/DIS 15 PB system
Conclusion 1
Think about MOX and SCANS
Conclusion 2
Think about Clusters

3
The Challenge -- EOS/DIS

Antarctica is melting -- 77 of fresh water
liberated
sea level rises 70 meters
Chico Memphis are beach-front property
New York, Washington, SF, LA, London, Paris
Lets study it! Mission to Planet Earth
EOS Earth Observing System (17B gt 10B)
50 instruments on 10 satellites 1997-2001
Landsat (added later)
EOS DIS Data Information System
3-5 MB/s raw, 30-50 MB/s processed.
4 TB/day,
15 PB by year 2007

4
The Process Flow

Data arrives and is pre-processed.
instrument data is calibrated,
gridded averaged
Geophysical data is derived
Users ask for stored data OR to analyze and
combine data.
Can make the pull-push split dynamically

Pull Processing
Push Processing
Other Data
5
Designing EOS/DIS

Expect that millions will use the system
(online)Three user categories
NASA 500 -- funded by NASA to do science
Global Change 10 k - other dirt bags
Internet 20 m - everyone else
Grain speculators
Environmental Impact Reports
New applications gt discovery access must
be automatic
Allow anyone to set up a peer- node (DAAC SCF)
Design for Ad Hoc queries, Not Standard Data
Products If push is 90, then 10 of
data is read (on average).
gt A failure no one uses the data, in DSS, push
is 1 or less.
gt computation demand is enormous (pullpush
is 100 1)

6
The architecture

2N data center design
Scaleable OR-DBMS
Emphasize Pull vs Push processing
Storage hierarchy
Data Pump
Just in time acquisition

7
Obvious Point EOS/DIS will be a cluster of SMPs

It needs 16 PB storage
1 M disks in current technology
500K tapes in current technology
It needs 100 TeraOps of processing
100K processors (current technology)
and 100 Terabytes of DRAM
1997 requirements are 1000x smaller
smaller data rate
almost no re-processing work

8
2N data center design

duplex the archive (for fault tolerance)
let anyone build an extract (the N)
Partition data by time and by space (store 2 or 4
ways).
Each partition is a free-standing
OR-DBBMS (similar to Tandem, Teradata designs).
Clients and Partitions interact via standard
protocols
OLE-DB, DCOM/CORBA, HTTP,

9
Hardware Architecture

2 Huge Data Centers
Each has 50 to 1,000 nodes in a cluster
Each node has about 25250 TB of storage
SMP .5Bips to 50 Bips 20K
DRAM 50GB to 1 TB 50K
100 disks 2.3 TB to 230 TB 200K
10 tape robots 25 TB to 250 TB 200K
2 Interconnects 1GBps to 100 GBps 20K
Node costs 500K
Data Center costs 25M (capital cost)

10
Scaleable OR-DBMS

Adopt cluster approach (Tandem, Teradata,
VMScluster,..)
System must scale to many processors, disks,
links
OR DBMS based on standard object model
CORBA or DCOM (not vendor specific)
Grow by adding components
System must be self-managing

11
Storage Hierarchy

Cache hot 10 (1.5 PB) on disk.
Keep cold 90 on near-line tape.
Remember recent results on speculation
(more on this later MOX/GOX/SCANS)

12
Data Pump

Some queries require reading ALL the data (for
reprocessing)
Each Data Center scans the data every 2 weeks.
Data rate 10 PB/day 10 TB/node/day 120 MB/s
Compute on demand small jobs
less than 1,000 tape mounts
less than 100 M disk accesses
less than 100 TeraOps.
(less than 30 minute response time)
For BIG JOBS scan entire 15PB database
Queries (and extracts) snoop this data pump.

13
Just-in-time acquisition 30

Hardware prices decline 20-40/year
So buy at last moment
Buy best product that day commodity
Depreciate over 3 years so that facility is
fresh.
(after 3 years, cost is 23 of original). 60
decline peaks at 10M

EOS DIS Disk Storage Size and Cost
assume 40 price decline/year
Data Need TB
Storage Cost M
1996
1994
1998
2000
2002
2004
2006
2008
14
Problems

HSM
Design and Meta-data
Ingest
Data discovery, search, and analysis
reorg-reprocess
disaster recovery
cost

15
What this system teaches us

Traditional storage metrics
KOX KB objects accessed per second
/GB Storage cost
New metrics
MOX megabyte objects accessed per second
SCANS Time to scan the archive

16
Thesis Performance Storage Accesses not
Instructions Executed

In the old days we counted instructions and
IOs
Now we count memory references
Processors wait most of the time

17
The Pico Processor
1 M SPECmarks 106 clocks/ fault to
bulk ram Event-horizon on chip. VM
reincarnated Multi-program cache
Terror Bytes!
18
Storage Latency How Far Away is the Data?
Andromeda
9
Tape /Optical
10
2,000 Years
Robot
6
Pluto
Disk
2 Years
10
1.5 hr
Sacramento
Memory
100
This Campus
10
10 min
On Board Cache
On Chip Cache
2
This Room
Registers
1
My Head
1 min
19
DataFlow ProgrammingPrefetch Postwrite Hide
Latency
Can't wait for the data to arrive (2,000
years!) Need a memory that gets the data in
advance ( 100MB/S) Solution Pipeline
data to/from the processor Pipe data from
source (tape, disc, ram...) to cpu cache
20
MetaMessage Technology Ratios Are Important

If everything gets fastercheaper at the same
rate THEN nothing really changes.
Things getting MUCH BETTER
communication speed cost 1,000x
processor speed cost 100x
storage size cost 100x
Things staying about the same
speed of light (more or less constant)
people (10x more expensive)
storage speed (only 10x better)

21
Trends Storage Got Cheaper
Storage Capacity

/byte got 104 better
/access got 103 better
capacity grew 103
Latency improved 10
Bandwidth improved 10

Tape (kB)
Unit Storage Size
Year
Disk (kB)
RAM (b)
1960
1970
1980
1990
2000
22
Trends Access Times Improved Little
Access Times Improved Little
Processor Speedups
1e 3
Tape
1e 2
1e 1
1
Processors
1e 0
1e -1
Disk
Instructions / second
1e-2
Bits / second
1e-3
WANs
1e-4
1e-5
RAM
1e-6
1e-7
1960
1970
1980
1990
2000
1960
1970
1980
1990
2000
Year
Year
23
Trends Storage Bandwidth Improved Little
Transfer Rates Improved Little
Processor Speedups
RAM
1e -1
1
Processors
Disk
Tape
WANs
1960
1970
1980
1990
2000
1960
1970
1980
1990
2000
Year
Year
24
Todays Storage Hierarchy Speed Capacity vs
Cost Tradeoffs
Size vs Speed
Price vs Speed
Cache
Nearline
Tape
Offline
Main
Tape
Secondary
Disc
Online
Online
/MB
Secondary
Tape
Tape
Disc
Typical System (bytes)
Main
Offline
Nearline
Tape
Tape
Cache
-9
-6
-3
0
3
-9
-6
-3
0
3
10
10
10
10
10
10
10
10
10
10
Access Time (seconds)
Access Time (seconds)
25
Trends Application Storage Demand Grew

The New World
Billions of objects
Big objects (1MB)

The Old World
Millions of objects
100-byte objects

26
TrendsNew Applications
Multimedia Text, voice, image, video, ...
The paperless office Library of congress online
(on your campus) All information comes
electronically entertainment
publishing business Information Network,
Knowledge Navigator, Information at Your
Fingertips
27
What's a Terabyte
1 Terabyte 1,000,000,000 business letters
100,000,000 book pages 50,000,000 FAX
images 10,000,000 TV pictures (mpeg)
4,000 LandSat images Library of
Congress (in ASCI) is 25 TB
1980 200 M of disc
10,000 discs 5
M of tape silo 10,000 tapes
1997 200 K of magnetic disc 120
discs 300 K of optical disc robot
250 platters 50 K of tape silo
50 tapes Terror
Byte !! .1 of a PetaByte!!!!!!!!!!!!!!!!!!
150 miles of bookshelf 15 miles of bookshelf
7 miles of bookshelf 10 days of video
28
The Cost of Storage Access

File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3 /sheet
Disk disk (9 GB ) 2,000 ASCII
5 m pages 0.2 /sheet (50x cheaper
Image 200 k pages 1 /sheet (similar
to paper)

29
Standard Storage Metrics

Capacity
RAM MB and /MB today at 10MB 100/MB
Disk GB and /GB today at 5GB and 500/GB
Tape TB and /TB today at .1TB and 100k/TB
(nearline)
Access time (latency)
RAM 100 ns
Disk 10 ms
Tape 30 second pick, 30 second position
Transfer rate
RAM 1 GB/s
Disk 5 MB/s - - - Arrays can go to 1GB/s
Tape 3 MB/s - - - not clear that striping
works

30
New Storage Metrics KOXs, MOXs, GOXs, SCANs?

KOX How many kilobyte objects served per second
the file server, transaction procssing metric
MOX How many megabyte objects served per second
the Mosaic metric
GOX How many gigabyte objects served per hour
the video EOSDIS metric
SCANS How many scans of all the data per day
the data mining and utility metric

31
How To Get Lots of MOX, GOX, SCANS

parallelism use many little devices in parallel
Beware of the media myth
Beware of the access time myth

At 10 MB/s 1.2 days to scan
1,000 x parallel 15 minute SCAN.
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
32
Tape Optical Beware of the Media Myth
Optical is cheap 200 /platter
2 GB/platter gt 100/GB (2x
cheaper than disc) Tape is cheap 30 /tape
20 GB/tape gt 1.5 /GB (100x
cheaper than disc).
33
Tape Optical Reality Media is 10 of System
Cost
Tape needs a robot (10 k ... 3 m ) 10 ...
1000 tapes (at 20GB each) gt 20/GB ... 200/GB
(1x10x cheaper than disc) Optical needs a
robot (100 k ) 100 platters 200GB ( TODAY )
gt 400 /GB ( more expensive than mag disc )
Robots have poor access times Not good for
Library of Congress (25TB) Data motel data
checks in but it never checks out!
34
The Access Time Myth

The Myth seek or pick time dominates
The reality (1) Queuing dominates
(2) Transfer dominates BLOBs
(3) Disk seeks often short
Implication many cheap servers better than
one fast expensive server
shorter queues
parallel transfer
lower cost/access and cost/byte
This is now obvious for disk arrays
This will be obvious for tape arrays

35
The Disk Farm On a Card

The 100GB disc card
An array of discs
Can be used as
100 discs
1 striped disc
10 Fault Tolerant discs
....etc
LOTS of accesses/second
bandwidth

14"
Life is cheap, its the accessories that cost
ya. Processors are cheap, its the peripherals
that cost ya (a 10k disc card).
36
My Solution to Tertiary StorageTape Farms, Not
Mainframe Silos
100 robots
1M
50TB
50/GB
3K MOX
10K robot
1.5K GOX
10 tapes
1 Scans
500 GB
6 MB/s
20/GB
Scan in 24 hours. many independent tape
robots (like a disc farm)
30 MOX
15 GOX
37
The Metrics Disk and Tape Farms Win
Data Motel Data checks in, but it never checks
out
GB/K
1
,
000
,
000
K
OX
100
,
000
MOX
GOX
10
,
000
SCANS/Day
1
,
000
100
10
1
0.1
0.01
1000 x
D
i
sc Farm
100x DLT
Tape Farm
STC Tape Robot
6,000 tapes, 8 readers
38
Cost Per Access (3-year)
540
,000
500K
67
,000
100,000
KOX/
MOX/
GOX/
100
68
SCANS/k
23
120
10
4.3
7
7
100
2
1.5
1
0.2
0.1
1000 x Disc Farm
STC Tape Robot
100x DLT Tape Farm
6,000 tapes, 16
readers
39
Summary (of new ideas)

Storage accesses are the bottleneck
Accesses are getting larger (MOX, GOX, SCANS)
Capacity and cost are improving
BUT
Latencies and bandwidth are not improving much
SO
Use parallel access (disk and tape farms)

40
MetaMessage Technology Ratios Are Important

If everything gets fastercheaper at the same
rate nothing really changes.
Some things getting MUCH BETTER
communication speed cost 1,000x
processor speed cost 100x
storage size cost 100x
Some things staying about the same
speed of light (more or less constant)
people (10x worse)
storage speed (only 10x better)

41
Ratios Changed

10x better access time
10x more bandwidth
10,000x lower media price
DRAM/DISK 1001 to 1010 to 501

42
The Five Minute Rule

Trade DRAM for Disk Accesses
Cost of an access (DriveCost / Access_per_second)
Cost of a DRAM page ( /MB / pages_per_MB)
Break even has two terms
Technology term and an Economic term
Grew page size to compensate for changing ratios.
Now at 10 minute for random, 2 minute sequential

43
Shows Best Page Index Page Size 16KB
44
The Ideal Interconnect
SCSI Comm ---- -- --
- - - - -
--- -

High bandwidth
Low latency
No software stack
Zero Copy
User mode access to device
Low HBA latency
Error Free
(required if no software stack)
Flow Controlled
WE NEED A NEW PROTOCOL
best of SCSI and Comm
allow push pull
industry is doing it SAN VIA

45
Outline

The challenge Building GIANT data stores
for example, the EOS/DIS 15 PB system
Conclusion 1
Think about MOX and SCANS
Conclusion 2
Think about Clusters
SMP report
Cluster report

46
Scaleable ComputersBOTH SMP and Cluster
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
SMP
Super Server
Departmental
Cluster of PCs
Server
Personal
System
47
TPC-C Current Results

Best Performance is 30,390 tpmC _at_ 305/tpmC
(Oracle/DEC)
Best Price/Perf. is 7,693 tpmC _at_ 43.5/tpmC (MS
SQL/Dell)
Graphs show
UNIX high price
UNIX scaleup diseconomy

48
Compare SMP Performance
49
Where the money goes
50
TPC C improved fast
40 hardware, 100 software, 100 PC Technology
51
What does this mean?

PC Technology is 3x cheaper than high-end SMPs
PC nodes performance are 1/2 of high-end SMPs
4xP6 vs 20xUltraSparc
Peak performance is a cluster
Tandem 100 node cluster
DEC Alpha 4x8 cluster
Commodity solutions WILL come to this market

52
Cluster Shared What?

Shared Memory Multiprocessor
Multiple processors, one memory
all devices are local
DEC, SG, Sun Sequent 16..64 nodes
easy to program, not commodity
Shared Disk Cluster
an array of nodes
all shared common disks
VAXcluster Oracle
Shared Nothing Cluster
each device local to a node
ownership may change
Tandem, SP2, Wolfpack

53
Clusters being built

Teradata 500 nodes (50k/slice)
Tandem,VMScluster 150 nodes (100k/slice)
Intel, 9,000 nodes _at_ 55M
( 6k/slice)
Teradata, Tandem, DEC moving to NTlow slice
price
IBM 512 nodes _at_ 100m
(200k/slice)
PC clusters (bare handed) at dozens of nodes web
servers (msn, PointCast,), DB servers
KEY TECHNOLOGY HERE IS THE APPS.
Apps distribute data
Apps distribute execution

54
Cluster Advantages

Clients and Servers made from the same stuff.
Inexpensive Built with commodity components
Fault tolerance
Spare modules mask failures
Modular growth
grow by adding small modules
Parallel data search
use multiple processors and disks

55
Clusters are winning the high end

You saw that a 4x8 cluster has best TPC-C
performance
This year, a 32xUltraSparc cluster won the
MinuteSort Speed Trophy (see NOWsort at
www.now.cs.berkeley.edu)
Ordinal 16x on SGI Origin is close (but the
loser!).

56
Clusters (Plumbing)

Single system image
naming
protection/security
management/load balance
Fault Tolerance
Wolfpack Demo
Hot Pluggable hardware Software

57
So, Whats New?

When slices cost 50k, you buy 10 or 20.
When slices cost 5k you buy 100 or 200.
Manageability, programmability, usability become
key issues (total cost of ownership).
PCs are MUCH easier to use and program

MPP Vicious Cycle No Customers!
Apps
CP/Commodity Virtuous Cycle Standards allow
progress and investment protection
Standard OS Hardware
Customers
58
Windows NT Server ClusteringHigh Availability On
Standard Hardware

Standard API for clusters on many platforms
No special hardware required.
Resource Group is unit of failover
Typical resources
shared disk, printer, ...
IP address, NetName
Service (Web,SQL, File, Print Mail,MTS )
API to define
resource groups,
dependencies,
resources,
GUI administrative interface
A consortium of 60 HW SW vendors (everybody who
is anybody)

2-Node Cluster in beta test now. Available
97H1 gt2 node is next SQL Server and Oracle Demo
on it today Key concepts System a node Cluster
systems working together Resource hard/
soft-ware module Resource dependency resource
needs another Resource group fails over as a
unit Dependencies do not cross group boundaries
59
Wolfpack NT Clusters 1.0

Two node file and print failover

Private
Private
Shared SCSI Disk Strings
Disks
Disks
B
A
etty
lice
Clients
60
SQL Server 6.5 Failover

Failover unit is DB Server

Client failover via reconnect IP impersonation
or ODBC or DBlib reconnect in SQL Server 6.5

61
What is Wolfpack?
Cluster Management Tools
Cluster Api DLL
RPC
Cluster Service
Global Update
Database
Manager
Manager
Node
Event Processor
Manager

Mgr
Failover
Communication
App
Manager
Resource
Mgr
Resource
Other Nodes
DLL
Open Online IsAlive LooksAlive Offline Close
Resource
Resource Monitors
Management
Interface
Physical
Logical
App
Resource
Resource
Resource
DLL
DLL
DLL
Cluster Aware
App
62
Where We Are Today