Title: Data%20Centric%20Computing
1Data Centric Computing
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Jim Gray
- Microsoft Research
- Research.Microsoft.com/Gray/talks
- FAST 2002
- Monterey, CA, 14 Oct 1999
2Put Everything in Future (Disk)
Controllers(its not if, its when?) Jim
Gray Microsoft Research http//Research.Micrsoft
.com/Gray/talks FAST 2002 Monterey, CA, 14
Oct 1999AcknowledgementsDave Patterson
explained this to me long ago Leonard
Chung Kim Keeton Erik Riedel
Catharine Van Ingen
Sub-Title
Helped me sharpen these arguments
3First Disk 1956
- IBM 305 RAMAC
- 4 MB
- 50x24 disks
- 1200 rpm
- 100 ms access
- 35k/y rent
- Included computer accounting software(tubes
not transistors)
410 years later
1.6 meters
5Disk Evolution
Kilo Mega Giga Tera Peta Exa Zetta Yotta
- Capacity100x in 10 years 1 TB 3.5 drive in
2005 20 GB as 1 micro-drive - System on a chip
- High-speed SAN
- Disk replacing tape
- Disk is super computer!
6Disks are becoming computers
- Smart drives
- Camera with micro-drive
- Replay / Tivo / Ultimate TV
- Phone with micro-drive
- MP3 players
- Tablet
- Xbox
- Many more
ApplicationsWeb, DBMS, Files OS
Disk Ctlr 1Ghz cpu 1GB RAM
Comm Infiniband, Ethernet, radio
7Data Gravity Processing Moves to Transducers
smart displays, microphones, printers, NICs,
disks
Processing decentralized Moving to data
sources Moving to power sources Moving to sheet
metal ? The end of computers ?
8Its Already True of PrintersPeripheral
CyberBrick
- You buy a printer
- You get a
- several network interfaces
- A Postscript engine
- cpu,
- memory,
- software,
- a spooler (soon)
- and a print engine.
9The (absurd?) consequences of Moores Law
- 256 way nUMA?
- Huge main memories now 500MB - 64GB memories
then 10GB - 1TB memories - Huge disksnow 20-200 GB 3.5 disks then .1 -
1 TB disks - Petabyte storage farms
- (that you cant back up or restore).
- Disks gtgt tapes
- Small disksOne platter one inch 10GB
- SAN convergence 1 GBps point to point is easy
- 1 GB RAM chips
- MAD at 200 Gbpsi
- Drives shrink one quantum
- 10 GBps SANs are ubiquitous
- 1 bips cpus for 10
- 10 bips cpus at high end
10The Absurd Design?
- Further segregate processing from storage
- Poor locality
- Much useless data movement
- Amdahls laws bus 10 B/ips io 1 b/ips
Disks
Processors
100 GBps
10 TBps
1 Tips
100TB
11Whats a Balanced System?(40 disk arms / cpu)
12Amdahls Balance Laws Revised
- Laws right, just need interpretation
(imagination?) - Balanced System Law A system needs 8
MIPS/MBpsIO, but instruction rate must be
measured on the workload. - Sequential workloads have low CPI (clocks per
instruction), - random workloads tend to have higher CPI.
- Alpha (the MB/MIPS ratio) is rising from 1 to 6.
This trend will likely continue. - One Random IOs per 50k instructions.
- Sequential IOs are larger One sequential IO per
200k instructions
13Observations re TPC C, H systems
- More than ½ the hardware cost is in disks
- Most of the mips are in the disk controllers
- 20 mips/arm is enough for tpcC
- 50 mips/arm is enough for tpcH
- Need 128MB to 256MB/arm
- Ref
- Gray Shenoy Rules of Thumb
- Keeton, Riedel, Uysal, PhD thesis.
- ? The end of computers ?
14TPC systems
- Normalize for CPI (clocks per instruction)
- TPC-C has about 7 ins/byte of IO
- TPC-H has 3 ins/byte of IO
- TPC-H needs ½ as many disks, sequential vs random
- Both use 9GB 10 krpm disks (need arms, not bytes)
15TPC systems Whats alpha (MB/MIPS)?
- Hard to say
- Intel 32 bit addressing ( 4GB limit). Known CPI.
- IBM, HP, Sun have 64 GB limit. Unknown CPI.
- Look at both, guess CPI for IBM, HP, Sun
- Alpha is between 1 and 6
Mips Memory Alpha
Amdahl 1 1 1
tpcC Intel 8x262 2Gips 4GB 2
tpcH Intel 8x458 4Gips 4GB 1
tpcC IBM 24 cpus ? 12 Gips 64GB 6
tpcH HP 32 cpus ? 16 Gips 32 GB 2
16When each disk has 1bips, no need for cpu
17Implications
Conventional
Radical
- Move app to NIC/device controller
- higher-higher level protocols CORBA / COM.
- Cluster parallelism is VERY important.
- Offload device handling to NIC/HBA
- higher level protocols I2O, NASD, VIA, IP, TCP
- SMP and Cluster parallelism is important.
18Interim Step Shared Logic
- Brick with 8-12 disk drives
- 200 mips/arm (or more)
- 2xGbpsEthernet
- General purpose OS (except NetApp )
- 10k/TB to 50k/TB
- Shared
- Sheet metal
- Power
- Support/Config
- Security
- Network ports
Snap 1TB 12x80GB NAS
NetApp .5TB 8x70GB NAS
Maxstor 2TB 12x160GB NAS
19Next step in the Evolution
- Disks become supercomputers
- Controller will have 1bips, 1 GB ram, 1 GBps net
- And a disk arm.
- Disks will run full-blown app/web/db/os stack
- Distributed computing
- Processors migrate to transducers.
20Gordon Bells Seven Price Tiers
- 10 wrist watch computers
- 100 pocket/ palm computers
- 1,000 portable computers
- 10,000 personal computers
(desktop) - 100,000 departmental computers
(closet) - 1,000,000 site computers
(glass house) - 10,000,000 regional computers (glass
castle)
Super-Server Costs more than 100,000
Mainframe Costs more than 1M Must be an
array of processors, disks, tapes comm
ports
21Bells Evolution of Computer Classes
Technology enable two evolutionary paths 1.
constant performance, decreasing cost 2.
constant price, increasing performance
1.26 2x/3 yrs -- 10x/decade 1/1.26 .8 1.6
4x/3 yrs --100x/decade 1/1.6 .62
22NAS vs SAN
High level Interfaces are better
- Network Attached Storage
- File servers
- Database servers
- Application servers
- (its a slippery slope as Novell showed)
- Storage Area Network
- A lower life form
- Block server get block / put block
- Wrong abstraction level (too low level)
- Security is VERY hard to understand.
- (who can read that disk block?)
SCSI and iSCSI are popular.
23How Do They Talk to Each Other?
- Each node has an OS
- Each node has local resources A federation.
- Each node does not completely trust the others.
- Nodes use RPC to talk to each other
- WebServices/SOAP? CORBA? COM? RMI?
- One or all of the above.
- Huge leverage in high-level interfaces.
- Same old distributed system story.
Applications
Applications
datagrams
datagrams
streams
RPC
?
streams
RPC
?
SIO
SIO
SAN
24Basic Argument for x-Disks
- Future disk controller is a super-computer.
- 1 bips processor
- 256 MB dram
- 1 TB disk plus one arm
- Connects to SAN via high-level protocols
- RPC, HTTP, SOAP, COM, Kerberos, Directory
Services,. - Commands are RPCs
- management, security,.
- Services file/web/db/ requests
- Managed by general-purpose OS with good dev
environment - Move apps to disk to save data movement
- need programming environment in controller
25The Slippery Slope
Nothing Sector Server
- If you add function to server
- Then you add more function to server
- Function gravitates to data.
Something Fixed App Server
Everything App Server
26Why Not a Sector Server?(lets get physical!)
- Good idea, thats what we have today.
- But
- cache added for performance
- Sector remap added for fault tolerance
- error reporting and diagnostics added
- SCSI commends (reserve,.. are growing)
- Sharing problematic (space mgmt, security,)
- Slipping down the slope to a 2-D block server
27Why Not a 1-D Block Server?Put A LITTLE on the
Disk Server
- Tried and true design
- HSC - VAX cluster
- EMC
- IBM Sysplex (3980?)
- But look inside
- Has a cache
- Has space management
- Has error reporting management
- Has RAID 0, 1, 2, 3, 4, 5, 10, 50,
- Has locking
- Has remote replication
- Has an OS
- Security is problematic
- Low-level interface moves too many bytes
28Why Not a 2-D Block Server?Put A LITTLE on the
Disk Server
- Tried and true design
- Cedar -gt NFS
- file server, cache, space,..
- Open file is many fewer msgs
- Grows to have
- Directories Naming
- Authentication access control
- RAID 0, 1, 2, 3, 4, 5, 10, 50,
- Locking
- Backup/restore/admin
- Cooperative caching with client
29Why Not a File Server?Put a Little on the 2-D
Block Server
- Tried and true design
- NetWare, Windows, Linux, NetApp, Cobalt,
SNAP,...WebDav - Yes, but look at NetWare
- File interface grew
- Became an app server
- Mail, DB, Web,.
- Netware had a primitive OS
- Hard to program, so optimized wrong thing
30Why Not Everything?Allow Everything on Disk
Server(thin clients)
- Tried and true design
- Mainframes, Minis, ...
- Web servers,
- Encapsulates data
- Minimizes data moves
- Scaleable
- It is where everyone ends up.
- All the arguments against are short-term.
31The Slippery Slope
Nothing Sector Server
- If you add function to server
- Then you add more function to server
- Function gravitates to data.
Something Fixed App Server
Everything App Server
32Disk Node
- has magnetic storage (1TB?)
- has processor DRAM
- has SAN attachment
- has execution environment
Applications
Services
DBMS
File System
RPC, ...
SAN driver
Disk driver
OS Kernel
33Hardware
- Homogenous machines leads to quick response
through reallocation - HP desktop machines, 320MB RAM, 3u high, 4 100GB
IDE Drives - 4k/TB (street), 2.5processors/TB, 1GB RAM/TB
- 3 weeks from ordering to operational
Slide courtesy of Brewster Kahle, _at_ Archive.org
34Disk as Tape
- Tape is unreliable, specialized, slow, low
density, not improving fast, and expensive - Using removable hard drives to replace tapes
function has been successful - When a tape is needed, the drive is put in a
machine and it is online. No need to copy from
tape before it is used. - Portable, durable, fast, media cost raw tapes,
dense. Unknown longevity suspected good. -
Slide courtesy of Brewster Kahle, _at_ Archive.org
35Disk As Tape What format?
- Today I send NTFS/SQL disks.
- But that is not a good format for Linux.
- Solution Ship NFS/CIFS/ODBC servers (not disks)
- Plug disk into LAN.
- DHCP then file or DB server via standard
interface. - Web Service in long term
36Some Questions
- Will the disk folks deliver?
- What is the product?
- How do I manage 1,000 nodes (disks)?
- How do I program 1,000 nodes (disks)?
- How does RAID work?
- How do I backup a PB?
- How do I restore a PB?
37Will the disk folks deliver? Maybe!Hard Drive
Unit Shipments
Source DiskTrend/IDC
Not a pretty picture (lately)
38Most Disks are Personal
- 85 of disks are desktop/mobile (not SCSI)
- Personal media is AT LEAST 50 of the problem.
- How to manage your shoebox of
- Documents
- Voicemail
- Photos
- Music
- Videos
39What is the Product?(see next section on media
management)
- Concept Plug it in and it works!
- Music/Video/Photo appliance (home)
- Game appliance
- PC
- File server appliance
- Data archive/interchange appliance
- Web appliance
- Email appliance
- Application appliance
- Router appliance
network
power
40Auto Manage Storage
- 1980 rule of thumb
- A DataAdmin per 10GB, SysAdmin per mips
- 2000 rule of thumb
- A DataAdmin per 5TB
- SysAdmin per 100 clones (varies with app).
- Problem
- 5TB is 50k today, 5k in a few years.
- Admin cost gtgt storage cost !!!!
- Challenge
- Automate ALL storage admin tasks
41How do I manage 1,000 nodes?
- You cant manage 1,000 x (for any x).
- They manage themselves.
- You manage exceptional exceptions.
- Auto Manage
- Plug Play hardware
- Auto-load balance placement storage
processing - Simple parallel programming model
- Fault masking
- Some positive signs
- Few admins at Google 10k nodes 2 PB ,
Yahoo! ? nodes, 0.3 PB, Hotmail 10k
nodes, 0.3 PB
42How do I program 1,000 nodes?
- You cant program 1,000 x (for any x).
- They program themselves.
- You write embarrassingly parallel programs
- Examples SQL, Web, Google, Inktomi, HotMail,.
- PVM and MPI prove it must be automatic (unless
you have a PhD)! - Auto Parallelism is ESSENTIAL
43Plug Play Software
- RPC is standardizing (SOAP/HTTP, COM,
RMI/IIOP) - Gives huge TOOL LEVERAGE
- Solves the hard problems
- naming,
- security,
- directory service,
- operations,...
- Commoditized programming environments
- FreeBSD, Linix, Solaris, tools
- NetWare tools
- WinCE, WinNT, tools
- JavaOS tools
- Apps gravitate to data.
- General purpose OS on dedicated ctlr can run apps.
44Its Hard to Archive a PetabyteIt takes a LONG
time to restore it.
- At 1GBps it takes 12 days!
- Store it in two (or more) places online (on
disk?). A geo-plex - Scrub it continuously (look for errors)
- On failure,
- use other copy until failure repaired,
- refresh lost copy from safe copy.
- Can organize the two copies differently
(e.g. one by time, one by space)
45Disk vs Tape
- Disk
- 160 GB
- 25 MBps
- 5 ms seek time
- 3 ms rotate latency
- 2/GB for drive 1/GB for ctlrs/cabinet
- 4 TB/rack
- Tape
- 100 GB
- 10 MBps
- 30 sec pick time
- Many minute seek time
- 5/GB for media10/GB for drivelibrary
- 10 TB/rack
Guestimates Cern 200 TB 3480 tapes 2 col
50GB Rack 1 TB 20 drives
The price advantage of tape is narrowing, and
the performance advantage of disk is growing
46Im a disk bigot
- I hate tape, tape hates me.
- Unreliable hardware
- Unreliable software
- Poor human factors
- Terrible latency, bandwidth
- Disk
- Much easier to use
- Much faster
- Cheaper!
- But needs new concepts
47Disk as Tape Challenges
- Offline disk (safe from virus)
- Trivialize Backup/Restore software
- Things never change
- Just object versions
- Snapshot for continuous change (databases)
- RAID in a SAN
- (cross-disk journaling)
- Massive replication (a la Farsite)
48Summary
- Disks will become supercomputers
- Compete in Linux appliance space
- Build best NAS software (compete with NetApp, ..)
- Auto-manage huge storage farms FarSite, SQL
autoAdmin, - Build worlds best disk-based backup system
Including Geoplex (compete with Veritas,..) - Push faster on 64-bit
49Storage capacity beating Moores law
- 2 k/TB today (raw disk)
- 1k/TB by end of 2002
-
50Trends Magnetic Storage Densities
- Amazing progress
- Ratios have changed
- Capacity grows 60/y
- Access speed grows 10x more slowly
51Trends Density Limits
Density vs Time b/µm2 Gb/in2
Bit Density
- The end is near!
- Products23 GbpsiLab 50 Gbpsilimit
60 Gbpsi - Butlimit keeps rising there are alternatives
b/µm2 Gb/in2
? NEMS, Florescent? Holographic, DNA?
3,000 2,000
1,000 600
300 200
SuperParmagnetic Limit
100 60
30 20
Wavelength Limit
ODD
10 6
DVD
3 2
CD
1 0.6
Figure adapted from Franco Vitaliano, The NEW
new media the growing attraction of nonmagnetic
storage, Data Storage, Feb 2000, pp 21-32,
www.datastorage.com
1990 1992 1994 1996 1998 2000 2002 2004
2006 2008
52CyberBricks
- Disks are becoming supercomputers.
- Each disk will be a file server then SOAP server
- Multi-disk bricks are transitional
- Long-term brick will have OS per disk.
- Systems will be built from bricks.
- There will also be
- Network Bricks
- Display Bricks
- Camera Bricks
- .
53Data Centric Computing
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Jim Gray
- Microsoft Research
- Research.Microsoft.com/Gray/talks
- FAST 2002
- Monterey, CA, 14 Oct 1999
54Communications Excitement!!
Point-to-Point
Broadcast
lecture concert
conversation money
Net Work DB
Immediate
book newspaper
mail
Time Shifted
Data Base
Its ALL going electronic Information is being
stored for analysis (so ALL database) Analysis
Automatic Processing are being added
Slide borrowed from Craig Mundie
55Information Excitement!
- But comm just carries information
- Real value added is
- information capture render speech, vision,
graphics, animation, - Information storage retrieval,
- Information analysis
56Information At Your Fingertips
- All information will be in an online database
(somewhere) - You might record everything you
- read 10MB/day, 400 GB/lifetime (5 disks today)
- hear 400MB/day, 16 TB/lifetime (2 disks/year
today) - see 1MB/s, 40GB/day, 1.6 PB/lifetime (150
disks/year maybe someday) - Data storage, organization, and analysis is
challenge. - text, speech, sound, vision, graphics, spatial,
time - Information at Your Fingertips
- Make it easy to capture
- Make it easy to store organize analyze
- Make it easy to present access
57How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Soon everything can be recorded and indexed
- Most bytes will never be seen by humans.
- Data summarization, trend detection anomaly
detection are key technologies - See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html - See Lyman Varian
- How much information
- http//www.sims.berkeley.edu/research/projects/how
-much-info/
Everything! Recorded
All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
58Why Put Everything in Cyberspace?
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Immediate OR Time Delayed
Locate Process Analyze Summarize
59Disk Storage Cheaper than Paper
- File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 3 /sheet - Disk disk (160 GB ) 300 ASCII
100 m pages 0.0001 /sheet (10,000x
cheaper) - Image 1 m photos 0.03
/sheet (100x cheaper) - Store everything on disk
60Gordon Bells MainBrainDigitize EverythingA
BIG shoebox?
- Scans 20 k pages tiff_at_ 300 dpi 1 GB
- Music 2 k tacks 7 GB
- Photos 13 k images 2 GB
- Video 10 hrs 3 GB
- Docs 3 k (ppt, word,..) 2 GB
- Mail 50 k messages 1 GB
- 16 GB
61Gary Starkweather
- Scan EVERYTHING
- 400 dpi TIFF
- 70k pages 14GB
- OCR all scans (98 recognition ocr accuracy)
- All indexed (5 second access to anything)
- All on his laptop.
62- Q What happens when the personal terabyte
arrives? - A Things will run SLOWLY. unless we add good
software
63Summary
- Disks will morph to appliances
- Main barriers to this happening
- Lack of Cool Apps
- Cost of Information management
64The Absurd Disk
- 2.5 hr scan time (poor sequential access)
- 1 aps / 5 GB (VERY cold data)
- Its a tape!
1 TB
100 MB/s
200 Kaps
65Crazy Disk Ideas
- Disk Farm on a card surface mount disks
- Disk (magnetic store) on a chip (micro machines
in Silicon) - Full Apps (e.g. SAP, Exchange/Notes,..) in the
disk controller (a processor with 128 MB dram)
ASIC
The Innovator's Dilemma When New Technologies
Cause Great Firms to FailClayton M.
Christensen.ISBN 0875845851
66The Disk Farm On a Card
- The 500GB disc card
- An array of discs
- Can be used as
- 100 discs
- 1 striped disc
- 50 Fault Tolerant discs
- ....etc
- LOTS of accesses/second
- bandwidth
14"
67Trends promises NEMS (Nano Electro Mechanical
Systems)(http//www.nanochip.com/) also
Cornell, IBM, CMU,
- 250 Gbpsi by using tunneling electronic
microscope - Disk replacement
- Capacity 180 GB now, 1.4 TB in 2 years
- Transfer rate 100 MB/sec RW
- Latency 0.5msec
- Power 23W active, .05W Standby
- 10k/TB now, 2k/TB in 2004
68Trends Gilders Law 3x bandwidth/year for 25
more years
- Today
- 40 Gbps per channel (?)
- 12 channels per fiber (wdm) 500 Gbps
- 32 fibers/bundle 16 Tbps/bundle
- In lab 3 Tbps/fiber (400 x WDM)
- In theory 25 Tbps per fiber
- 1 Tbps USA 1996 WAN bisection bandwidth
- Aggregate bandwidth doubles every 8 months!
1 fiber 25 Tbps
69Technology Drivers What if Networking Was as
Cheap As Disk IO?
- TCP/IP
- Unix/NT 100 cpu _at_ 40MBps
- Disk
- Unix/NT 8 cpu _at_ 40MBps
70SAN Standard Interconnect
Gbps Ethernet 110 MBps
- LAN faster than memory bus?
- 1 GBps links in lab.
- 100 port cost soon
- Port is computer
PCI 70 MBps
UW Scsi 40 MBps
FW scsi 20 MBps
scsi 5 MBps
71Building a Petabyte Store
- EMC 500k/TB 500M/PB plus FC switches
plus 800M/PB - TPC-C SANs (Dell 18GB/) 62 M/PB
- Dell local SCSI, 3ware 20M/PB
- Do it yourself 5M/PB
72The Cost of Storage(heading for 1K/TB soon)
73Cheap Storage or Balanced System
- Low cost storage (2 x 1.5k servers) 6K TB2x
(1K system 8x80GB disks 100MbEthernet) - Balanced server (7k/.5 TB)
- 2x800Mhz (2k)
- 256 MB (400)
- 8 x 80 GB drives (2K)
- Gbps Ethernet switch (1k)
- 11k TB, 22K/RAIDED TB
74320 GB, 2k (now)
- 4x80 GB IDE(2 hot plugable)
- (1,000)
- SCSI-IDE bridge
- 200k
- Box
- 500 Mhz cpu
- 256 MB SRAM
- Fan, power, Enet
- 700
- Or 8 disks/box640 GB for 3K ( or 300 GB RAID)
75(No Transcript)
76Hot Swap Drives for Archive or Data Interchange
- 25 MBps write(so can write N x 160 GB in 3
hours) - 160 GB/overnite
- N x 4 MB/second
- _at_ 19.95/nite
77Data delivery costs 1/GB today
- Rent for big customers 300/megabit per
second per month - Improved 3x in last 6 years (!).
- That translates to 1/GB at each end.
- You can mail a 160 GB disk for 20.
- Thats 16x cheaper
- If overnight its 3 MBps.
3x160 GB ½ TB
78Data on Disk Can Move to RAM in 8 years
301
6 years
79Storage Latency How Far Away is the Data?
Andromeda
9
Tape /Optical
10
2,000 Years
Robot
6
Pluto
Disk
2 Years
10
1.5 hr
Springfield
Memory
100
This Campus
10
10 min
On Board Cache
On Chip Cache
2
This Room
Registers
1
My Head
1 min
80More Kaps and Kaps/ but.
- Disk accesses got much less expensive Better
disks Cheaper disks! - But disk arms are expensivethe scarce resource
- 1 hour Scanvs 5 minutes in 1990
81Backup 3 scenarios
- Disaster Recovery Preservation through
Replication - Hardware Faults different solutions for
different situations - Clusters,
- load balancing,
- replication,
- tolerate machine/disk outages
- (Avoided RAID and expensive, low volume
solutions) - Programmer Error versioned duplicates (no
deletes)
82Online Data
- Can build 1PB of NAS disk for 5M today
- Can SCAN (read or write) entire PB in 3 hours.
- Operate it as a data pump continuous sequential
scan - Can deliver 1PB for 1M over Internet
- Access charge is 300/Mbps bulk rate
- Need to Geoplex data (store it in two places).
- Need to filter/process data near the source,
- To minimize network costs.