Title: Scaleabilty
1Scaleabilty
- Jim Gray
- Gray_at_Microsoft.com
- (with help from Gordon Bell, George Spix,
Catharine van Ingen -
Mon
Tue
Wed
Thur
Fri
900
Overview
TP mons
Log
Files Buffers
B-tree
1100
Faults
Lock Theory
ResMgr
COM
Access Paths
130
Tolerance
Lock Techniq
CICS Inet
Corba
Groupware
330
T Models
Queues
Adv TM
Replication
Benchmark
700
Party
Workflow
Cyberbrick
Party
2A peta-op business app?
- PG and friends pay for the web (like they paid
for broadcast television) no new money, but
given Moore, traditional advertising revenues can
pay for all of our connectivity - voice, video,
data (presuming we figure out how to allow
them to brand the experience.) - Advertisers pay for impressions and ability to
analyze same. - A terabyte sort a minute to one a second.
- Bisection bw of 20gbytes/s to 200gbytes/s.
- Really a tera-op business app (todays portals)
3ScaleabilityScale Up and Scale Out
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
Cluster of PCs
4There'll be Billions Trillions Of Clients
- Every device will be intelligent
- Doors, rooms, cars
- Computing will be ubiquitous
5Billions Of ClientsNeed Millions Of Servers
Trillions
Billions
- All clients networked to servers
- May be nomadicor on-demand
- Fast clients wantfaster servers
- Servers provide
- Shared Data
- Control
- Coordination
- Communication
Clients
Mobileclients
Fixedclients
Servers
Server
Super server
6ThesisMany little beat few big
1 million
100 K
10 K
Pico Processor
Nano
Micro
10 pico-second ram
1 MB
Mini
Mainframe
10
0
MB
1
0 GB
1
TB
1
00 TB
1.8"
2.5"
3.5"
5.25"
1 M SPECmarks, 1TFLOP 106 clocks to bulk
ram Event-horizon on chip VM reincarnated Multi
-program cache, On-Chip SMP
9"
14"
- Smoking, hairy golf ball
- How to connect the many little parts?
- How to program the many little parts?
- Fault tolerance Management?
74 B PCs (1 Bips, .1GB dram, 10 GB disk 1 Gbps
Net, BG) The Bricks of Cyberspace
- Cost 1,000
- Come with
- NT
- DBMS
- High speed Net
- System management
- GUI / OOUI
- Tools
- Compatible with everyone else
- CyberBricks
8Computers shrink to a point
Kilo Mega Giga Tera Peta Exa Zetta Yotta
- Disks 100x in 10 years 2 TB 3.5 drive
- Shrink to 1 is 200GB
- Disk is super computer!
- This is already true of printers and terminals
9Super Server 4T Machine
- Array of 1,000 4B machines
- 1 b ips processors
- 1 B B DRAM
- 10 B B disks
- 1 Bbps comm lines
- 1 TB tape robot
- A few megabucks
- Challenge
- Manageability
- Programmability
- Security
- Availability
- Scaleability
- Affordability
- As easy as a single system
Cyber Brick a 4B machine
Future servers are CLUSTERS of processors,
discs Distributed database techniques make
clusters work
10Cluster VisionBuying Computers by the Slice
- Rack Stack
- Mail-order components
- Plug them into the cluster
- Modular growth without limits
- Grow by adding small modules
- Fault tolerance
- Spare modules mask failures
- Parallel execution data search
- Use multiple processors and disks
- Clients and servers made from the same stuff
- Inexpensive built with commodity CyberBricks
11Systems 30 Years Ago
- MegaBuck per Mega Instruction Per Second (mips)
- MegaBuck per MegaByte
- Sys Admin Data Admin per MegaBuck
12Disks of 30 Years Ago
- 10 MB
- Failed every few weeks
131988 IBM DB2 CICS Mainframe65 tps
- IBM 4391
- Simulated network of 800 clients
- 2m computer
- Staff of 6 to do benchmark
2 x 3725 network controllers
Refrigerator-sized CPU
16 GB disk farm 4 x 8 x .5GB
141987 Tandem Mini _at_ 256 tps
- 14 M computer (Tandem)
- A dozen people (1.8M/y)
- False floor, 2 rooms of machines
Admin expert
32 node processor array
Performance expert
Hardware experts
Simulate 25,600 clients
Network expert
Auditor
Manager
40 GB disk array (80 drives)
OS expert
DB expert
151997 9 years later1 Person and 1 box 1250 tps
- 1 Breadbox 5x 1987 machine room
- 23 GB is hand-held
- One person does all the work
- Cost/tps is 100,000x less5 micro dollars per
transaction
4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk
Hardware expert OS expert Net expert DB
expert App expert
3 x7 x 4GB disk arrays
16What Happened?Where did the 100,000x come from?
- Moores law 100X (at most)
- Software improvements 10X (at most)
- Commodity Pricing 100X (at least)
- Total 100,000X
- 100x from commodity
- (DBMS was 100K to start now 1k to start
- IBM 390 MIPS is 7.5K today
- Intel MIPS is 10 today
- Commodity disk is 50/GB vs 1,500/GB
- ...
17Web server farms, server consolidation /
sqft http//www.exodus.com (charges by mbps
times sqft)
Standard package, full height, fully populated,
3.5 disks HP, DELL, Compaq are trading places
wrt rack mount lead PoPC Celeron NLX shoeboxes
1000 nodes in 48 (24x2) sq ft.
650K from Arrow (3yr warrantee!) on chip at
speed L2
18Application Taxonomy
General purpose, non-parallelizable codesPCs
have it! Vectorizable Vectorizable
//able(Supers small DSMs) Hand tuned,
one-ofMPP course grainMPP embarrassingly
//(Clusters of PCs) DatabaseDatabase/TP Web
Host Stream Audio/Video
Technical
Commercial
If central control rich then IBM or large
SMPs else PC Clusters
19Peta Scale Computing
10x every 5 years, 100x every 10 (1000x in 20
if SC) Except --- memory IO bandwidth
20I think there is a world market for maybe five
computers.
Thomas Watson Senior, Chairman of IBM, 1943
21Microsoft.com 150x4 nodes a crowd
(3)
22HotMail (a year ago) 400 Computers Crowd (now
2x bigger)
23DB Clusters (crowds)
- 16-node Cluster
- 64 cpus
- 2 TB of disk
- Decision support
- 45-node Cluster
- 140 cpus
- 14 GB DRAM
- 4 TB RAID disk
- OLTP (Debit Credit)
- 1 B tpd (14 k tps)
24The Microsoft TerraServer Hardware
- Compaq AlphaServer 8400
- 8x400Mhz Alpha cpus
- 10 GB DRAM
- 324 9.2 GB StorageWorks Disks
- 3 TB raw, 2.4 TB of RAID5
- STK 9710 tape robot (4 TB)
- WindowsNT 4 EE, SQL Server 7.0
25TerraServer Lots of Web Hits
- A billion web hits!
- 1 TB, largest SQL DB on the Web
- 100 Qps average, 1,000 Qps peak
- 877 M SQL queries so far
26TerraServer Availability
- Operating for 13 months
- Unscheduled outage 2.9 hrs
- Scheduled outage 2.0 hrsSoftware upgrades
- Availability 99.93 overall up
- No NT failures (ever)
- One SQL7 Beta2 bug
- One major operator-assisted outage
27Backup / Restore
28Windows NT Versus UNIXBest Results on an SMP
SemiLog plot shows 3x (2 year) lead by UNIX
Does not show Oracle/Alpha Cluster at 100,000
tpmCAll these numbers are off-scale huge (40,000
active users?)
29TPC C Improvements (MS SQL) 250/year on Price,
100/year performancebottleneck is 3GB address
space
40 hardware, 100 software, 100 PC Technology
30UNIX (dis) Economy Of Scale
31Two different pricing regimesThis is late 1998
prices
32Storage Latency How far away is the data?
33Thesis Performance Storage Accesses not
Instructions Executed
- In the old days we counted instructions and
IOs - Now we count memory references
- Processors wait most of the time
Where the time goes
clock ticks used by AlphaSort Components
Disc Wait
Sort
Sort
Disc Wait
OS
Memory Wait
34Storage Hierarchy (10 levels)
- Registers, Cache L1, L2
- Main (1, 2, 3 if nUMA).
- Disk (1 (cached), 2)
- Tape (1 (mounted), 2)
35Todays Storage Hierarchy Speed Capacity vs
Cost Tradeoffs
Size vs Speed
Price vs Speed
Cache
Nearline
Tape
Offline
Main
Tape
Disc
Secondary
Online
Online
Secondary
/MB
Tape
Tape
Disc
Typical System (bytes)
Main
Offline
Nearline
Tape
Tape
Cache
-9
-6
-3
0
3
-9
-6
-3
0
3
10
10
10
10
10
10
10
10
10
10
Access Time (seconds)
Access Time (seconds)
36Meta-Message Technology Ratios Are Important
- If everything gets faster cheaper at the
same rate THEN nothing really changes. - Things getting MUCH BETTER
- communication speed cost 1,000x
- processor speed cost 100x
- storage size cost 100x
- Things staying about the same
- speed of light (more or less constant)
- people (10x more expensive)
- storage speed (only 10x better)
37Storage Ratios Changed
- 10x better access time
- 10x more bandwidth
- 4,000x lower media price
- DRAM/DISK 1001 to 1010 to 501
38The Pico Processor
1 M SPECmarks 106 clocks/ fault to
bulk ram Event-horizon on chip. VM
reincarnated Multi-program cache
Terror Bytes!
39Bottleneck Analysis
Theoretical Bus Bandwidth 422MBps 66 Mhz x 64
bits
MemoryRead/Write 150 MBps
MemCopy 50 MBps
Disk R/W 9MBps
40Bottleneck Analysis
- NTFS Read/Write
- 18 Ultra 3 SCSI on 4 strings (2x4 and 2x5) 3 PCI
64 - 155 MBps Unbuffered read (175 raw)
- 95 MBps Unbuffered write
- Good, but 10x down from our UNIX brethren (SGI,
SUN) -
155 MBps
41PennySort
- Hardware
- 266 Mhz Intel PPro
- 64 MB SDRAM (10ns)
- Dual Fujitsu DMA 3.2GB EIDE disks
- Software
- NT workstation 4.3
- NT 5 sort
- Performance
- sort 15 M 100-byte records (1.5 GB)
- Disk to disk
- elapsed time 820 sec
- cpu time 404 sec
42Penny Sort Ground Ruleshttp//research.microsoft.
com/barc/SortBenchmark
- How much can you sort for a penny.
- Hardware and Software cost
- Depreciated over 3 years
- 1M system gets about 1 second,
- 1K system gets about 1,000 seconds.
- Time (seconds) SystemPrice () / 946,080
- Input and output are disk resident
- Input is
- 100-byte records (random data)
- key is first 10 bytes.
- Must create output file and fill with sorted
version of input file. - Daytona (product) and Indy (special) categories
43How Good is NT5 Sort?
- CPU and IO not overlapped.
- System should be able to sort 2x more
- RAM has spare capacity
- Disk is space saturated (1.5GB in, 1.5GB out on
3GB drive.) Need an extra 3GB drive or a gt6GB
drive
Disk
CPU
Fixed
ram
44Sandia/Compaq/ServerNet/NT Sort
- Sort 1.1 Terabyte (13 Billion records) in 47
minutes - 68 nodes (dual 450 Mhz processors)543 disks,
1.5 M - 1.2 GBps network rap (2.8 GBps pap)
- 5.2 GBps of disk rap (same as pap)
- (rapreal application performance,pap peak
advertised performance)
45SP sort
46Progress on Sorting NT now leads both price and
performance
- Speedup comes from Moores law 40/year
- Processor/Disk/Network arrays 60/year (this is
a software speedup).
47Recent Results
- NOW Sort 9 GB on a cluster of 100 UltraSparcs
in 1 minute - MilleniumSort 16x Dell NT cluster 100 MB in
1.18 Sec (Datamation) - Tandem/Sandia Sort 68 CPU ServerNet 1 TB in
47 minutes - IBM SPsort
- 408 nodes, 1952 cpu
- 2168 disks
- 17.6 minutes 1057sec
- (all for 1/3 of 94M,
- slice price is 64k for 4cpu, 2GB ram, 6 9GB
disks interconnect
48Data Gravity Processing Moves to Transducers
- Move Processing to data sources
- Move to where the power (and sheet metal) is
- Processor in
- Modem
- Display
- Microphones (speech recognition) cameras
(vision) - Storage Data storage and analysis
- System is distributed (a cluster/mob)
49SAN Standard Interconnect
Gbps SAN 110 MBps
- LAN faster than memory bus?
- 1 GBps links in lab.
- 100 port cost soon
- Port is computer
- Winsock 110 MBps(10 cpu utilization at each
end)
PCI 70 MBps
UW Scsi 40 MBps
FW scsi 20 MBps
scsi 5 MBps
50Disk Node
- has magnetic storage (100 GB?)
- has processor DRAM
- has SAN attachment
- has execution environment
Applications
Services
DBMS
File System
RPC, ...
SAN driver
Disk driver
OS Kernel
51end
52Standard Storage Metrics
- Capacity
- RAM MB and /MB today at 10MB 100/MB
- Disk GB and /GB today at 10 GB and 200/GB
- Tape TB and /TB today at .1TB and 25k/TB
(nearline) - Access time (latency)
- RAM 100 ns
- Disk 10 ms
- Tape 30 second pick, 30 second position
- Transfer rate
- RAM 1 GB/s
- Disk 5 MB/s - - - Arrays can go to 1GB/s
- Tape 5 MB/s - - - striping is problematic
53New Storage Metrics Kaps, Maps, SCAN?
- Kaps How many KB objects served per second
- The file server, transaction processing metric
- This is the OLD metric.
- Maps How many MB objects served per sec
- The Multi-Media metric
- SCAN How long to scan all the data
- The data mining and utility metric
- And
- Kaps/, Maps/, TBscan/
54For the Record (good 1998 devices packaged in
systemhttp//www.tpc.org/results/individual_resul
ts/Dell/dell.6100.9801.es.pdf)
X 14
55For the Record (good 1998 devices packaged in
systemhttp//www.tpc.org/results/individual_resul
ts/Dell/dell.6100.9801.es.pdf)
X 14
56How To Get Lots of Maps, SCANs
At 10 MB/s 1.2 days to scan
1,000 x parallel 100 seconds SCAN.
- parallelism use many little devices in parallel
- Beware of the media myth
- Beware of the access time myth
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
57The Disk Farm On a Card
- The 1 TB disc card
- An array of discs
- Can be used as
- 100 discs
- 1 striped disc
- 10 Fault Tolerant discs
- ....etc
- LOTS of accesses/second
- bandwidth
14"
Life is cheap, its the accessories that cost
ya. Processors are cheap, its the peripherals
that cost ya (a 10k disc card).
58Tape Farms for Tertiary StorageNot Mainframe
Silos
100 robots
1M
50TB
50/GB
3K Maps
10K robot
14 tapes
27 hr Scan
500 GB
5 MB/s
20/GB
Scan in 27 hours. many independent tape
robots (like a disc farm)
30 Maps
59Tape Optical Beware of the Media Myth
Optical is cheap 200 /platter
2 GB/platter gt 100/GB (2x
cheaper than disc) Tape is cheap 30 /tape
20 GB/tape gt 1.5 /GB (100x
cheaper than disc).
60Tape Optical Reality Media is 10 of System
Cost
Tape needs a robot (10 k ... 3 m ) 10 ...
1000 tapes (at 20GB each) gt 20/GB ... 200/GB
(1x10x cheaper than disc) Optical needs a
robot (100 k ) 100 platters 200GB ( TODAY )
gt 400 /GB ( more expensive than mag disc )
Robots have poor access times Not good for
Library of Congress (25TB) Data motel data
checks in but it never checks out!
61The Access Time Myth
- The Myth seek or pick time dominates
- The reality (1) Queuing dominates
- (2) Transfer dominates BLOBs
- (3) Disk seeks often short
- Implication many cheap servers better than
one fast expensive server - shorter queues
- parallel transfer
- lower cost/access and cost/byte
- This is now obvious for disk arrays
- This will be obvious for tape arrays
62What To Do About HIGH Availability
- Need remote MIRRORED site to tolerate environment
al failures (power, net, fire, flood) operations
failures - Replicate changes across the net
- Failover servers across the net (some
distance) - Allows software upgrades, site moves, fires,...
- Tolerates operations errors, hiesenbugs,
client
server
server
State Changes
gt100 feet or gt100 miles
63Scaleup Has Limits(chart courtesy of Catharine
Van Ingen)
- Vector Supers 10x supers
- 3 Gflops/cpu
- bus/memory 20 GBps
- IO 1GBps
- Supers 10x PCs
- 300 Mflops/cpu
- bus/memory 2 GBps
- IO 1 GBps
- PCs are slow
- 30 Mflops/cpu
- and bus/memory 200MBps
- and IO 100 MBps
64TOP500 Systems by Vendor(courtesy of Larry Smarr
NCSA)
500
Other
Japanese Vector Machines
Other
DEC
400
Intel
Japanese
TMC
Sun
DEC
Intel
HP
300
TMC
IBM
Number of Systems
Sun
Convex
HP
200
Convex
SGI
IBM
SGI
100
CRI
CRI
0
Jun-93
Jun-94
Jun-95
Jun-96
Jun-97
Jun-98
Nov-93
Nov-94
Nov-95
Nov-96
Nov-97
TOP500 Reports http//www.netlib.org/benchmark/t
op500.html
65NCSA Super Cluster
http//access.ncsa.uiuc.edu/CoverStories/SuperClus
ter/super.html
- National Center for Supercomputing
ApplicationsUniversity of Illinois _at_ Urbana - 512 Pentium II cpus, 2,096 disks, SAN
- Compaq HP Myricom WindowsNT
- A Super Computer for 3M
- Classic Fortran/MPI programming
- DCOM programming model
66Avalon Alpha Clusters for Sciencehttp//cnls.lan
l.gov/avalon/
- 140 Alpha Processors(533 Mhz)
- x 256 MB 3GB disk
- Fast Ethernet switches
- 45 Gbytes RAM 550 GB disk
- Linux...
- 10 real Gflops for 313,000
- gt 34 real Mflops/k
- on 150 benchmark Mflops/k
- Beowulf project is Parent
- http//www.cacr.caltech.edu/beowulf/naegling.html
- 114 nodes, 2k/node,
- Scientists want cheap mips.
67Your Tax Dollars At WorkASCI for Stockpile
Stewardship
- Intel/Sandia 9000x1 node Ppro
- LLNL/IBM 512x8 PowerPC (SP2)
- LANL/Cray ?
- Maui Supercomputer Center
- 512x1 SP2
68Observations
- Uniprocessor RAP ltlt PAP
- real app performance ltlt peak advertised
performance - Growth has slowed (Bell Prize
- 1987 0.5 GFLOPS
- 1988 1.0 GFLOPS 1 year
- 1990 14 GFLOPS 2 years
- 1994 140 GFLOPS 4 years
- 1997 604 GFLOPS
- 1998 1600 G__OPS 4 years
69Two Generic Kinds of computing
- Many little
- embarrassingly parallel
- Fit RPC model
- Fit partitioned data and computation model
- Random works OK
- OLTP, File Server, Email, Web,..
- Few big
- sometimes not obviously parallel
- Do not fit RPC model (BIG rpcs)
- Scientific, simulation, data mining, ...
70Many Little Programming Model
- many small requests
- route requests to data
- encapsulate data with procedures (objects)
- three-tier computing
- RPC is a convenient/appropriate model
- Transactions are a big help in error handling
- Auto partition (e.g. hash data and computation)
- Works fine.
- Software CyberBricks
71Object Oriented ProgrammingParallelism From Many
Little Jobs
- Gives location transparency
- ORB/web/tpmon multiplexes clients to servers
- Enables distribution
- Exploits embarrassingly parallel apps
(transactions) - HTTP and RPC (dcom, corba, rmi, iiop, ) are
basis
Tp mon / orb/ web server
72Few Big Programming Model
- Finding parallelism is hard
- Pipelines are short (3x 6x speedup)
- Spreading objects/data is easy, but getting
locality is HARD - Mapping big job onto cluster is hard
- Scheduling is hard
- coarse grained (job) and fine grain (co-schedule)
- Fault tolerance is hard
73Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Sequential
Sequential
Partition outputs split N ways inputs merge
M ways
Any
Any
Sequential
Sequential
Sequential
Sequential
Program
Program
74Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 100 second SCAN.
BANDWIDTH
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
75Why are Relational OperatorsSuccessful for
Parallelism?
Relational data model uniform operators on
uniform data stream Closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
76Database Systems Hide Parallelism
- Automate system management via tools
- data placement
- data organization (indexing)
- periodic tasks (dump / recover / reorganize)
- Automatic fault tolerance
- duplex failover
- transactions
- Automatic parallelism
- among transactions (locking)
- within a transaction (parallel execution)
77SQL a Non-Procedural Programming Language
- SQL functional programming language
describes answer set. - Optimizer picks best execution plan
- Picks data flow web (pipeline),
- degree of parallelism (partitioning)
- other execution parameters (process placement,
memory,...)
Execution
Planning
Monitor
Schema
Executors
Plan
GUI
Optimizer
Rivers
78Partitioned Execution
Spreads computation and IO among processors
Partitioned data gives
NATURAL parallelism
79N x M way Parallelism
N inputs, M outputs, no bottlenecks. Partitioned
Data Partitioned and Pipelined Data Flows
80Automatic Parallel Object Relational DB
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Landsat
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
81Data Rivers Split Merge Streams
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Split/Merge in Gamma
Exchange operator in Volcano /SQL Server.
82Generalization Object-oriented Rivers
- Rivers transport sub-class of record-set (
stream of objects) - record type and partitioning are part of subclass
- Node transformers are data pumps
- an object with river inputs and outputs
- do late-binding to record-type
- Programming becomes data flow programming
- specify the pipelines
- Compiler/Scheduler does data partitioning and
transformer placement
83NT Cluster Sort as a Prototype
- Using
- data generation and
- sort as a prototypical app
- Hello world of distributed processing
- goal easy install execute
84Remote Install
- Add Registry entry to each remote node.
RegConnectRegistry() RegCreateKeyEx()
85Cluster StartupExecution
- Setup
- MULTI_QI struct
- COSERVERINFO struct
- Retrieve remote object handle
- from MULTI_QI struct
86Cluster Sort Conceptual Model
- Multiple Data Sources
- Multiple Data Destinations
- Multiple nodes
- Disks -gt Sockets -gt Disk -gt Disk
A
AAA BBB CCC
B
C
AAA BBB CCC
AAA BBB CCC
87How Do They Talk to Each Other?
- Each node has an OS
- Each node has local resources A federation.
- Each node does not completely trust the others.
- Nodes use RPC to talk to each other
- CORBA? DCOM? IIOP? RMI?
- One or all of the above.
- Huge leverage in high-level interfaces.
- Same old distributed system story.
Applications
Applications
datagrams
datagrams
streams
RPC
?
streams
RPC
?
h
VIAL/VIPL
Wire(s)