Scaleabilty

About This Presentation

Title:

Scaleabilty

Description:

Doors, rooms, cars... Computing will be ubiquitous. Billions Of Clients. Need ... Refrigerator-sized CPU. 1987: Tandem Mini _at_ 256 tps. 14 M$ computer (Tandem) ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 88

Provided by: jimg178

Category:

Tags: scaleabilty

more less

Transcript and Presenter's Notes

Title: Scaleabilty

1
Scaleabilty

Jim Gray
Gray_at_Microsoft.com
(with help from Gordon Bell, George Spix,
Catharine van Ingen

Mon
Tue
Wed
Thur
Fri
900
Overview
TP mons
Log
Files Buffers
B-tree
1100
Faults
Lock Theory
ResMgr
COM
Access Paths
130
Tolerance
Lock Techniq
CICS Inet
Corba
Groupware
330
T Models
Queues
Adv TM
Replication
Benchmark
700
Party
Workflow
Cyberbrick
Party
2
A peta-op business app?

PG and friends pay for the web (like they paid
for broadcast television) no new money, but
given Moore, traditional advertising revenues can
pay for all of our connectivity - voice, video,
data (presuming we figure out how to allow
them to brand the experience.)
Advertisers pay for impressions and ability to
analyze same.
A terabyte sort a minute to one a second.
Bisection bw of 20gbytes/s to 200gbytes/s.
Really a tera-op business app (todays portals)

3
ScaleabilityScale Up and Scale Out
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
Cluster of PCs
4
There'll be Billions Trillions Of Clients

Every device will be intelligent
Doors, rooms, cars
Computing will be ubiquitous

5
Billions Of ClientsNeed Millions Of Servers
Trillions
Billions

All clients networked to servers
May be nomadicor on-demand
Fast clients wantfaster servers
Servers provide
Shared Data
Control
Coordination
Communication

Clients
Mobileclients
Fixedclients
Servers
Server
Super server
6
ThesisMany little beat few big
1 million
100 K
10 K
Pico Processor
Nano
Micro
10 pico-second ram
1 MB
Mini
Mainframe
10
0

MB
1
0 GB
1
TB
1
00 TB
1.8"
2.5"
3.5"
5.25"
1 M SPECmarks, 1TFLOP 106 clocks to bulk
ram Event-horizon on chip VM reincarnated Multi
-program cache, On-Chip SMP
9"
14"

Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance Management?

7
4 B PCs (1 Bips, .1GB dram, 10 GB disk 1 Gbps
Net, BG) The Bricks of Cyberspace

Cost 1,000
Come with
NT
DBMS
High speed Net
System management
GUI / OOUI
Tools
Compatible with everyone else
CyberBricks

8
Computers shrink to a point
Kilo Mega Giga Tera Peta Exa Zetta Yotta

Disks 100x in 10 years 2 TB 3.5 drive
Shrink to 1 is 200GB
Disk is super computer!
This is already true of printers and terminals

9
Super Server 4T Machine

Array of 1,000 4B machines
1 b ips processors
1 B B DRAM
10 B B disks
1 Bbps comm lines
1 TB tape robot
A few megabucks
Challenge
Manageability
Programmability
Security
Availability
Scaleability
Affordability
As easy as a single system

Cyber Brick a 4B machine
Future servers are CLUSTERS of processors,
discs Distributed database techniques make
clusters work
10
Cluster VisionBuying Computers by the Slice

Rack Stack
Mail-order components
Plug them into the cluster
Modular growth without limits
Grow by adding small modules
Fault tolerance
Spare modules mask failures
Parallel execution data search
Use multiple processors and disks
Clients and servers made from the same stuff
Inexpensive built with commodity CyberBricks

11
Systems 30 Years Ago

MegaBuck per Mega Instruction Per Second (mips)
MegaBuck per MegaByte
Sys Admin Data Admin per MegaBuck

12
Disks of 30 Years Ago

10 MB
Failed every few weeks

13
1988 IBM DB2 CICS Mainframe65 tps

IBM 4391
Simulated network of 800 clients
2m computer
Staff of 6 to do benchmark

2 x 3725 network controllers
Refrigerator-sized CPU
16 GB disk farm 4 x 8 x .5GB
14
1987 Tandem Mini _at_ 256 tps

14 M computer (Tandem)
A dozen people (1.8M/y)
False floor, 2 rooms of machines

Admin expert
32 node processor array
Performance expert
Hardware experts
Simulate 25,600 clients
Network expert
Auditor
Manager
40 GB disk array (80 drives)
OS expert
DB expert
15
1997 9 years later1 Person and 1 box 1250 tps

1 Breadbox 5x 1987 machine room
23 GB is hand-held
One person does all the work
Cost/tps is 100,000x less5 micro dollars per
transaction

4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk
Hardware expert OS expert Net expert DB
expert App expert
3 x7 x 4GB disk arrays
16
What Happened?Where did the 100,000x come from?

Moores law 100X (at most)
Software improvements 10X (at most)
Commodity Pricing 100X (at least)
Total 100,000X
100x from commodity
(DBMS was 100K to start now 1k to start
IBM 390 MIPS is 7.5K today
Intel MIPS is 10 today
Commodity disk is 50/GB vs 1,500/GB
...

17
Web server farms, server consolidation /
sqft http//www.exodus.com (charges by mbps
times sqft)
Standard package, full height, fully populated,
3.5 disks HP, DELL, Compaq are trading places
wrt rack mount lead PoPC Celeron NLX shoeboxes
1000 nodes in 48 (24x2) sq ft.
650K from Arrow (3yr warrantee!) on chip at
speed L2
18
Application Taxonomy
General purpose, non-parallelizable codesPCs
have it! Vectorizable Vectorizable
//able(Supers small DSMs) Hand tuned,
one-ofMPP course grainMPP embarrassingly
//(Clusters of PCs) DatabaseDatabase/TP Web
Host Stream Audio/Video
Technical
Commercial
If central control rich then IBM or large
SMPs else PC Clusters
19
Peta Scale Computing

10x every 5 years, 100x every 10 (1000x in 20
if SC) Except --- memory IO bandwidth

20
I think there is a world market for maybe five
computers.

Thomas Watson Senior, Chairman of IBM, 1943
21
Microsoft.com 150x4 nodes a crowd
(3)
22
HotMail (a year ago) 400 Computers Crowd (now
2x bigger)
23
DB Clusters (crowds)

16-node Cluster
64 cpus
2 TB of disk
Decision support
45-node Cluster
140 cpus
14 GB DRAM
4 TB RAID disk
OLTP (Debit Credit)
1 B tpd (14 k tps)

24
The Microsoft TerraServer Hardware

Compaq AlphaServer 8400
8x400Mhz Alpha cpus
10 GB DRAM
324 9.2 GB StorageWorks Disks
3 TB raw, 2.4 TB of RAID5
STK 9710 tape robot (4 TB)
WindowsNT 4 EE, SQL Server 7.0

25
TerraServer Lots of Web Hits

A billion web hits!
1 TB, largest SQL DB on the Web
100 Qps average, 1,000 Qps peak
877 M SQL queries so far

26
TerraServer Availability

Operating for 13 months
Unscheduled outage 2.9 hrs
Scheduled outage 2.0 hrsSoftware upgrades
Availability 99.93 overall up
No NT failures (ever)
One SQL7 Beta2 bug
One major operator-assisted outage

27
Backup / Restore

28
Windows NT Versus UNIXBest Results on an SMP
SemiLog plot shows 3x (2 year) lead by UNIX
Does not show Oracle/Alpha Cluster at 100,000
tpmCAll these numbers are off-scale huge (40,000
active users?)
29
TPC C Improvements (MS SQL) 250/year on Price,
100/year performancebottleneck is 3GB address
space
40 hardware, 100 software, 100 PC Technology
30
UNIX (dis) Economy Of Scale
31
Two different pricing regimesThis is late 1998
prices
32
Storage Latency How far away is the data?
33
Thesis Performance Storage Accesses not
Instructions Executed

In the old days we counted instructions and
IOs
Now we count memory references
Processors wait most of the time

Where the time goes
clock ticks used by AlphaSort Components
Disc Wait
Sort
Sort
Disc Wait
OS
Memory Wait
34
Storage Hierarchy (10 levels)

Registers, Cache L1, L2
Main (1, 2, 3 if nUMA).
Disk (1 (cached), 2)
Tape (1 (mounted), 2)

35
Todays Storage Hierarchy Speed Capacity vs
Cost Tradeoffs
Size vs Speed
Price vs Speed
Cache
Nearline
Tape
Offline
Main
Tape
Disc
Secondary
Online
Online
Secondary
/MB
Tape
Tape
Disc
Typical System (bytes)
Main
Offline
Nearline
Tape
Tape
Cache
-9
-6
-3
0
3
-9
-6
-3
0
3
10
10
10
10
10
10
10
10
10
10
Access Time (seconds)
Access Time (seconds)
36
Meta-Message Technology Ratios Are Important

If everything gets faster cheaper at the
same rate THEN nothing really changes.
Things getting MUCH BETTER
communication speed cost 1,000x
processor speed cost 100x
storage size cost 100x
Things staying about the same
speed of light (more or less constant)
people (10x more expensive)
storage speed (only 10x better)

37
Storage Ratios Changed

10x better access time
10x more bandwidth
4,000x lower media price
DRAM/DISK 1001 to 1010 to 501

38
The Pico Processor
1 M SPECmarks 106 clocks/ fault to
bulk ram Event-horizon on chip. VM
reincarnated Multi-program cache
Terror Bytes!
39
Bottleneck Analysis

Drawn to linear scale

Theoretical Bus Bandwidth 422MBps 66 Mhz x 64
bits
MemoryRead/Write 150 MBps
MemCopy 50 MBps
Disk R/W 9MBps
40
Bottleneck Analysis

NTFS Read/Write
18 Ultra 3 SCSI on 4 strings (2x4 and 2x5) 3 PCI
64
155 MBps Unbuffered read (175 raw)
95 MBps Unbuffered write
Good, but 10x down from our UNIX brethren (SGI,
SUN)

155 MBps
41
PennySort

Hardware
266 Mhz Intel PPro
64 MB SDRAM (10ns)
Dual Fujitsu DMA 3.2GB EIDE disks
Software
NT workstation 4.3
NT 5 sort
Performance
sort 15 M 100-byte records (1.5 GB)
Disk to disk
elapsed time 820 sec
cpu time 404 sec

42
Penny Sort Ground Ruleshttp//research.microsoft.
com/barc/SortBenchmark

How much can you sort for a penny.
Hardware and Software cost
Depreciated over 3 years
1M system gets about 1 second,
1K system gets about 1,000 seconds.
Time (seconds) SystemPrice () / 946,080
Input and output are disk resident
Input is
100-byte records (random data)
key is first 10 bytes.
Must create output file and fill with sorted
version of input file.
Daytona (product) and Indy (special) categories

43
How Good is NT5 Sort?

CPU and IO not overlapped.
System should be able to sort 2x more
RAM has spare capacity
Disk is space saturated (1.5GB in, 1.5GB out on
3GB drive.) Need an extra 3GB drive or a gt6GB
drive

Disk
CPU
Fixed
ram
44
Sandia/Compaq/ServerNet/NT Sort

Sort 1.1 Terabyte (13 Billion records) in 47
minutes
68 nodes (dual 450 Mhz processors)543 disks,
1.5 M
1.2 GBps network rap (2.8 GBps pap)
5.2 GBps of disk rap (same as pap)
(rapreal application performance,pap peak
advertised performance)

45
SP sort

2 4 GBps!

46
Progress on Sorting NT now leads both price and
performance

Speedup comes from Moores law 40/year
Processor/Disk/Network arrays 60/year (this is
a software speedup).

47
Recent Results

NOW Sort 9 GB on a cluster of 100 UltraSparcs
in 1 minute
MilleniumSort 16x Dell NT cluster 100 MB in
1.18 Sec (Datamation)
Tandem/Sandia Sort 68 CPU ServerNet 1 TB in
47 minutes
IBM SPsort
408 nodes, 1952 cpu
2168 disks
17.6 minutes 1057sec
(all for 1/3 of 94M,
slice price is 64k for 4cpu, 2GB ram, 6 9GB
disks interconnect

48
Data Gravity Processing Moves to Transducers

Move Processing to data sources
Move to where the power (and sheet metal) is
Processor in
Modem
Display
Microphones (speech recognition) cameras
(vision)
Storage Data storage and analysis
System is distributed (a cluster/mob)

49
SAN Standard Interconnect
Gbps SAN 110 MBps

LAN faster than memory bus?
1 GBps links in lab.
100 port cost soon
Port is computer
Winsock 110 MBps(10 cpu utilization at each
end)

PCI 70 MBps
UW Scsi 40 MBps
FW scsi 20 MBps
scsi 5 MBps
50
Disk Node

has magnetic storage (100 GB?)
has processor DRAM
has SAN attachment
has execution environment

Applications
Services
DBMS
File System
RPC, ...
SAN driver
Disk driver
OS Kernel
51
end

52
Standard Storage Metrics

Capacity
RAM MB and /MB today at 10MB 100/MB
Disk GB and /GB today at 10 GB and 200/GB
Tape TB and /TB today at .1TB and 25k/TB
(nearline)
Access time (latency)
RAM 100 ns
Disk 10 ms
Tape 30 second pick, 30 second position
Transfer rate
RAM 1 GB/s
Disk 5 MB/s - - - Arrays can go to 1GB/s
Tape 5 MB/s - - - striping is problematic

53
New Storage Metrics Kaps, Maps, SCAN?

Kaps How many KB objects served per second
The file server, transaction processing metric
This is the OLD metric.
Maps How many MB objects served per sec
The Multi-Media metric
SCAN How long to scan all the data
The data mining and utility metric
And
Kaps/, Maps/, TBscan/

54
For the Record (good 1998 devices packaged in
systemhttp//www.tpc.org/results/individual_resul
ts/Dell/dell.6100.9801.es.pdf)
X 14
55
For the Record (good 1998 devices packaged in
systemhttp//www.tpc.org/results/individual_resul
ts/Dell/dell.6100.9801.es.pdf)
X 14
56
How To Get Lots of Maps, SCANs
At 10 MB/s 1.2 days to scan
1,000 x parallel 100 seconds SCAN.

parallelism use many little devices in parallel
Beware of the media myth
Beware of the access time myth

Parallelism divide a big problem into many
smaller ones to be solved in parallel.
57
The Disk Farm On a Card

The 1 TB disc card
An array of discs
Can be used as
100 discs
1 striped disc
10 Fault Tolerant discs
....etc
LOTS of accesses/second
bandwidth

14"
Life is cheap, its the accessories that cost
ya. Processors are cheap, its the peripherals
that cost ya (a 10k disc card).
58
Tape Farms for Tertiary StorageNot Mainframe
Silos
100 robots
1M
50TB
50/GB
3K Maps
10K robot

14 tapes
27 hr Scan
500 GB
5 MB/s
20/GB
Scan in 27 hours. many independent tape
robots (like a disc farm)
30 Maps
59
Tape Optical Beware of the Media Myth
Optical is cheap 200 /platter
2 GB/platter gt 100/GB (2x
cheaper than disc) Tape is cheap 30 /tape
20 GB/tape gt 1.5 /GB (100x
cheaper than disc).
60
Tape Optical Reality Media is 10 of System
Cost
Tape needs a robot (10 k ... 3 m ) 10 ...
1000 tapes (at 20GB each) gt 20/GB ... 200/GB
(1x10x cheaper than disc) Optical needs a
robot (100 k ) 100 platters 200GB ( TODAY )
gt 400 /GB ( more expensive than mag disc )
Robots have poor access times Not good for
Library of Congress (25TB) Data motel data
checks in but it never checks out!
61
The Access Time Myth

The Myth seek or pick time dominates
The reality (1) Queuing dominates
(2) Transfer dominates BLOBs
(3) Disk seeks often short
Implication many cheap servers better than
one fast expensive server
shorter queues
parallel transfer
lower cost/access and cost/byte
This is now obvious for disk arrays
This will be obvious for tape arrays

62
What To Do About HIGH Availability

Need remote MIRRORED site to tolerate environment
al failures (power, net, fire, flood) operations
failures
Replicate changes across the net
Failover servers across the net (some
distance)
Allows software upgrades, site moves, fires,...
Tolerates operations errors, hiesenbugs,

client

server
server
State Changes
gt100 feet or gt100 miles
63
Scaleup Has Limits(chart courtesy of Catharine
Van Ingen)

Vector Supers 10x supers
3 Gflops/cpu
bus/memory 20 GBps
IO 1GBps
Supers 10x PCs
300 Mflops/cpu
bus/memory 2 GBps
IO 1 GBps
PCs are slow
30 Mflops/cpu
and bus/memory 200MBps
and IO 100 MBps

64
TOP500 Systems by Vendor(courtesy of Larry Smarr
NCSA)
500
Other
Japanese Vector Machines
Other
DEC
400
Intel
Japanese
TMC
Sun
DEC
Intel
HP
300
TMC
IBM
Number of Systems
Sun
Convex
HP
200
Convex
SGI
IBM
SGI
100
CRI
CRI
0
Jun-93
Jun-94
Jun-95
Jun-96
Jun-97
Jun-98
Nov-93
Nov-94
Nov-95
Nov-96
Nov-97
TOP500 Reports http//www.netlib.org/benchmark/t
op500.html
65
NCSA Super Cluster
http//access.ncsa.uiuc.edu/CoverStories/SuperClus
ter/super.html

National Center for Supercomputing
ApplicationsUniversity of Illinois _at_ Urbana
512 Pentium II cpus, 2,096 disks, SAN
Compaq HP Myricom WindowsNT
A Super Computer for 3M
Classic Fortran/MPI programming
DCOM programming model

66
Avalon Alpha Clusters for Sciencehttp//cnls.lan
l.gov/avalon/

140 Alpha Processors(533 Mhz)
x 256 MB 3GB disk
Fast Ethernet switches
45 Gbytes RAM 550 GB disk
Linux...
10 real Gflops for 313,000
gt 34 real Mflops/k
on 150 benchmark Mflops/k
Beowulf project is Parent
http//www.cacr.caltech.edu/beowulf/naegling.html
114 nodes, 2k/node,
Scientists want cheap mips.

67
Your Tax Dollars At WorkASCI for Stockpile
Stewardship

Intel/Sandia 9000x1 node Ppro
LLNL/IBM 512x8 PowerPC (SP2)
LANL/Cray ?
Maui Supercomputer Center
512x1 SP2

68
Observations

Uniprocessor RAP ltlt PAP
real app performance ltlt peak advertised
performance
Growth has slowed (Bell Prize
1987 0.5 GFLOPS
1988 1.0 GFLOPS 1 year
1990 14 GFLOPS 2 years
1994 140 GFLOPS 4 years
1997 604 GFLOPS
1998 1600 G__OPS 4 years

69
Two Generic Kinds of computing

Many little
embarrassingly parallel
Fit RPC model
Fit partitioned data and computation model
Random works OK
OLTP, File Server, Email, Web,..
Few big
sometimes not obviously parallel
Do not fit RPC model (BIG rpcs)
Scientific, simulation, data mining, ...

70
Many Little Programming Model

many small requests
route requests to data
encapsulate data with procedures (objects)
three-tier computing
RPC is a convenient/appropriate model
Transactions are a big help in error handling
Auto partition (e.g. hash data and computation)
Works fine.
Software CyberBricks

71
Object Oriented ProgrammingParallelism From Many
Little Jobs

Gives location transparency
ORB/web/tpmon multiplexes clients to servers
Enables distribution
Exploits embarrassingly parallel apps
(transactions)
HTTP and RPC (dcom, corba, rmi, iiop, ) are
basis

Tp mon / orb/ web server
72
Few Big Programming Model

Finding parallelism is hard
Pipelines are short (3x 6x speedup)
Spreading objects/data is easy, but getting
locality is HARD
Mapping big job onto cluster is hard
Scheduling is hard
coarse grained (job) and fine grain (co-schedule)
Fault tolerance is hard

73
Kinds of Parallel Execution
Any
Any
Sequential
Sequential
Pipeline
Program
Program
Sequential
Sequential
Partition outputs split N ways inputs merge
M ways
Any
Any
Sequential
Sequential
Sequential
Sequential
Program
Program
74
Why Parallel Access To Data?
At 10 MB/s 1.2 days to scan
1,000 x parallel 100 second SCAN.
BANDWIDTH
Parallelism divide a big problem into many
smaller ones to be solved in parallel.
75
Why are Relational OperatorsSuccessful for
Parallelism?
Relational data model uniform operators on
uniform data stream Closed under
composition Each operator consumes 1 or 2 input
streams Each stream is a uniform collection of
data Sequential data in and out Pure
dataflow partitioning some operators (e.g.
aggregates, non-equi-join, sort,..) requires
innovation AUTOMATIC PARALLELISM
76
Database Systems Hide Parallelism

Automate system management via tools
data placement
data organization (indexing)
periodic tasks (dump / recover / reorganize)
Automatic fault tolerance
duplex failover
transactions
Automatic parallelism
among transactions (locking)
within a transaction (parallel execution)

77
SQL a Non-Procedural Programming Language

SQL functional programming language
describes answer set.
Optimizer picks best execution plan
Picks data flow web (pipeline),
degree of parallelism (partitioning)
other execution parameters (process placement,
memory,...)

Execution
Planning
Monitor
Schema
Executors
Plan
GUI
Optimizer
Rivers
78
Partitioned Execution
Spreads computation and IO among processors

Partitioned data gives
NATURAL parallelism
79
N x M way Parallelism
N inputs, M outputs, no bottlenecks. Partitioned
Data Partitioned and Pipelined Data Flows
80
Automatic Parallel Object Relational DB
Select image from landsat where date between 1970
and 1990 and overlaps(location, Rockies) and
snow_cover(image) gt.7
Temporal
Spatial
Image
Assign one process per processor/disk find
images with right data location analyze image,
if 70 snow, return it
Landsat
Answer
date
loc
image
image
33N 120W . . . . . . . 34N 120W
1/2/72 . . . . . .. . . 4/8/95
date, location, image tests
81
Data Rivers Split Merge Streams
Producers add records to the river, Consumers
consume records from the river Purely sequential
programming. River does flow control and
buffering does partition and merge of data
records River Split/Merge in Gamma
Exchange operator in Volcano /SQL Server.
82
Generalization Object-oriented Rivers