Title: The Data Avalanche
1The Data Avalanche
Talk at National Youth Leadership Forum on
Technology, aka nerd camp July 2004
- Jim Gray
- Microsoft Research
- Gray_at_Microsoft.com
- http//research.microsoft.com/Gray
2NumbersTeraBytes and Gigabytes are BIG!
- Mega a house in san francisco
- Giga a very rich person
- Tera The Bush national debt
- Peta more than all the money in the world
- A Gigabyte the Human Genome
- A Terabyte 150 mile long shelf of books.
3Outline
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Historical trends imply that in 20 years
- we can store everything in cyberspace.The
personal petabyte. - computers will have natural interfacesspeech
recognition/synthesisvision, object recognition
beyond OCR - Implications
- The information avalanche will only get worse.
- The user interface will change less typing,
more writing, talking, gesturing, more seeing
and hearing - Organizing, summarizing, prioritizinginformation
is a key technology.
We are here
4How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Soon everything can be recorded and indexed
- Most bytes will never be seen by humans.
- Data summarization, trend detection anomaly
detection are key technologies - See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html - See Lyman Varian
- How much information
- http//www.sims.berkeley.edu/research/projects/how
-much-info/
Everything! Recorded
All Books MultiMedia
All books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
5Things Have Changed
1956
- IBM 305 RAMAC
- 10 MB disk
- 1M (y2004 )
6The Next 50 years will see MORE CHANGE ops/s/
Had Three Growth Curves 1890-1990
Combination of Hans Moravac Larry Roberts
Gordon Bell WordSizeops/s/sysprice
- 1890-1945
- Mechanical
- Relay
- 7-year doubling
- 1945-1985
- Tube, transistor,..
- 2.3 year doubling
- 1985-2004
- Microprocessor
- 1.0 year doubling
7Constant Cost or Constant Function?
- 100x improvement per decade
- Same function 100x cheaper
- 100x more function for same price
Mainframe
SMP
Constellation
Cluster
Constant Price
Mini
SMP
Constellation
Workstation
Graphics/storage
Lower Price New Category
PDA
Camera/browser
8Growth Comes From NEW Apps
- The 10M computer of 1980 costs 1k today
- If we were still doing the same things,IT would
be a 0 B/y industry - NEW things absorb the new capacity
9The Surprise-Free Futurein 20 years.
- 10,000x more power for same price
- Personal supercomputer
- Personal petabyte stores
- Same function for 10,000x less cost.
- Smart dust --the penny PC?
- The 10 peta-op computer (for 1,000).
1010,000x would change things
- Human computer interface
- Decent computer vision
- Decent computer speech recognition
- Decent computer speech synthesis
- Vast information stores
- Ability to search and abstract the stores.
11How Good is HCI Today?
- Surprisingly good.
- Demo of making faces
- http//research.microsoft.com/research/pubs/view.
aspx?pubid290 - Demo of speech synthesis
- Daisy, Hal
- Synthetic voice
- Speech recognition is improving fast,
- Vision getting better
- Pen computing finally a reality.
- Displays improving fast (compared to last 30
years)
12Outline
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Historical trends imply that in 20 years
- we can store everything in cyberspace.The
personal petabyte. - computers will have natural interfacesspeech
recognition/synthesisvision, object recognition
beyond OCR - Implications
- The information avalanche will only get worse.
- The user interface will change less typing,
more writing, talking, gesturing, more seeing
and hearing - Organizing, summarizing, prioritizinginformation
is a key technology.
We are here
13How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Almost everything is recorded digitally.
- Most bytes are never seen by humans.
- Data summarization, trend detection anomaly
detection are key technologies - See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html - See Lyman Varian
- How much information
- http//www.sims.berkeley.edu/research/projects/how
-much-info/
Everything! Recorded
All Books MultiMedia
All books (words)
.Movie
A Photo
A Book
14And gt90 in Cyberspace Because
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Immediate OR Time Delayed
Locate Process Analyze Summarize
15MyLifeBits The guinea pig
- Gordon Bell is digitizing his life
- Has now scanned virtually all
- Books written (and read when possible)
- Personal documents (correspondence, memos,
email, bills, legal,0) - Photos
- Posters, paintings, photo of things (artifacts,
medals, plaques) - Home movies and videos
- CD collection
- And, of course, all PC files
- Recording phone, radio, TV, web pages
conversations - Paperless throughout 2002. 12 scanned, 12
discarded. - Only 30GB Excluding videos
- Video is 2 TB and growing fast
16Capture and encoding
17I mean everything
1825Kday life Personal Petabyte
1PB
Will anyone look at web pages in 2020?
Probably new modalities media will dominate
then.
19Challenges
- Capture Get the bits in
- Organize Index them
- Manage No worries about loss or space
- Curate/ Annotate atutomate where possible
- Privacy Keep safe from theft.
- Summarize Give thumbnail summaries
- Interface how ask/anticipate questions
- Present show it in understandable ways.
20MemexAs We May Think, Vannevar Bush, 1945
- A memex is a device in which an individual
stores all his books, records, and
communications, and which is mechanized so that
it may be consulted with exceeding speed and
flexibility - yet if the user inserted 5000 pages of material
a day it would take him hundreds of years to fill
the repository, so that he can be profligate and
enter material freely
21Too much storage?Try to fill a terabyte in a year
Item Items/TB Items/day
300 KB JPEG 3 M 9,800
1 MB Doc 1 M 2,900
1 hour 256 kb/s MP3 audio 9 K 26
1 hour 1.5 Mbp/s MPEG video 290 0.8
Petabyte volume has to be some form of video.
22How Will We Find Anything?
- Need Queries, Indexing, Pivoting, Scalability,
Backup, Replication,Online update, Set-oriented
access - If you dont use a DBMS, you will implement one!
- Simple logical structure
- Blob and link is all that is inherent
- Additional properties (facets extra
tables)and methods on those tables
(encapsulation) - More than a file system
- Unifies data and meta-data
SQL DBMS
23Photos
24Searching the most useful app?
- Challenge What questions for useful results?
- Many ways to present answers
-
25(No Transcript)
26Detail view
27Resource explorerAncestor (collections),
annotations, descendant preview panes turned on
28Synchronized timelines with histogram guide
29Value of media depends on annotations
- Its just bits until it is annotated
30System annotations provide base level of value
31Tracking usage even better
- Date 7/7/2000. Opened 30 times, emailed to 10
people (its valued by the user!)
32Get the user to say a little something is a big
jump
- Date 7/7/2000. Opened 30 times, emailed to 10
people. BARC dim sum intern farewell Lunch
33Getting the user to tell a story is the ultimate
in media value
- A story is a layout in time and space
- Most valuable content (by selection, and by being
well annotated) - Stories must include links to any media they use
(for future navigation/search transclusion). - Cf MovieMaker Creative Memories PhotoAlbums
34Value of media depends on annotations
Its just bits until it is annotated
- Auto-annotate whenever possible e.g. GPS cameras
- Make manual annotation as easy as possible. XP
photo capture, voice, photos with voice, etc - Support gang annotation
- Make stories easy
3580 of data is personal / individual. But, what
about the other 20?
- Business
- Wall Mart online 1PB and growing.
- Paradox most transaction systems lt 1 PB.
- Have to go to image/data monitoring for big data
- Government
- Government is the biggest business.
- Science
- LOTS of data.
36Instruments CERN LHCPeta Bytes per Year
- Looking for the Higgs Particle
- Sensors 1000 GB/s (1TB/s 30 EB/y)
- Events 75 GB/s
- Filtered 5 GB/s
- Reduced 0.1 GB/s 2 PB/y
- Data pyramid 100GB 1TB 100TB 1PB 10PB
37Information Avalanche
- Both
- better observational instruments and
- Better simulations
- are producing a data avalanche
- Examples
- Turbulence 100 TB simulation then mine the
Information - BaBar Grows 1TB/day 2/3 simulation Information
1/3 observational Information - CERN LHC will generate 1GB/s 10 PB/y
- VLBA (NRAO) generates 1GB/s today
- NCBI only ½ TB but doubling each year, very
rich dataset. - Pixar 100 TB/Movie
Image courtesy of C. Meneveau A. Szalay _at_ JHU
38Q Where will the Data Come From?A Sensor
Applications
- Earth Observation
- 15 PB by 2007
- Medical Images Information Health Monitoring
- Potential 1 GB/patient/y ? 1 EB/y
- Video Monitoring
- 1E8 video cameras _at_ 1E5 MBps ? 10TB/s ? 100
EB/y ? filtered??? - Airplane Engines
- 1 GB sensor data/flight,
- 100,000 engine hours/day
- 30PB/y
- Smart Dust ?? EB/y
http//robotics.eecs.berkeley.edu/pister/SmartDus
t/
http//www-bsac.eecs.berkeley.edu/shollar/macro_m
otes/macromotes.html
39The Big Picture
Experiments Instruments
facts
questions
?
facts
Other Archives
answers
facts
Literature
facts
Simulations
The Big Problems
- Data ingest
- Managing a petabyte
- Common schema
- How to organize it?
- How to reorganize it
- How to coexist with others
- Query and Vis tools
- Support/training
- Performance
- Execute queries in a minute
- Batch query scheduling
40FTP - GREP
- Download (FTP and GREP) are not adequate
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 3,000 disks
- At some point we need indices to limit
search parallel data search and analysis - This is where databases can help
- Next generation technique Data Exploration
- Bring the analysis to the data!
41The Speed Problem
- Many users want to search the whole DBad hoc
queries, often combinatorial - Want 1 minute response
- Brute force (parallel search)
- 1 disk 50MBps gt 1M disks/PB 300M/PB
- Indices (limit search, do column store)
- 1,000x less equipment 1M/PB
- Pre-compute answer
- No one knows how do it for all questions.
42Next-Generation Data Analysis
- Looking for
- Needles in haystacks the Higgs particle
- Haystacks Dark matter, Dark energy
- Needles are easier than haystacks
- Global statistics have poor scaling
- Correlation functions are N2, likelihood
techniques N3 - As data and computers grow at same rate, we can
only keep up with N logN - A way out?
- Relax notion of optimal (data is fuzzy, answers
are approximate) - Dont assume infinite computational resources or
memory - Combination of statistics computer science
43Analysis and Databases
- Much statistical analysis deals with
- Creating uniform samples
- data filtering
- Assembling relevant subsets
- Estimating completeness
- censoring bad data
- Counting and building histograms
- Generating Monte-Carlo subsets
- Likelihood calculations
- Hypothesis testing
- Traditionally these are performed on files
- Most of these tasks are much better done inside a
database - Move Mohamed to the mountain, not the mountain to
Mohamed.
44Outline
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Historical trends imply that in 20 years
- we can store everything in cyberspace.The
personal petabyte. - computers will have natural interfacesspeech
recognition/synthesisvision, object recognition
beyond OCR - Implications
- The information avalanche will only get worse.
- The user interface will change less typing,
more writing, talking, gesturing, more seeing
and hearing - Organizing, summarizing, prioritizinginformation
is a key technology.
We are here
45Information Avalanche
- In science, industry, government,.
- better observational instruments and
- and, better simulations
- producing a data avalanche
- Examples
- BaBar Grows 1TB/day 2/3 simulation Information
1/3 observational Information - CERN LHC will generate 1GB/s .10 PB/y
- VLBA (NRAO) generates 1GB/s today
- Pixar 100 TB/Movie
- New emphasis on informatics
- Capturing, Organizing, Summarizing, Analyzing,
Visualizing
Image courtesy C. Meneveau A. Szalay _at_ JHU
BaBar, Stanford
PE Gene Sequencer From http//www.genome.uci.edu/
Space Telescope
46The Evolution of Science
- Observational Science
- Scientist gathers data by direct observation
- Scientist analyzes data
- Analytical Science
- Scientist builds analytical model
- Makes predictions.
- Computational Science
- Simulate analytical model
- Validate model and makes predictions
- Data Exploration Science Data captured by
instrumentsOr data generated by simulator - Processed by software
- Placed in a database / files
- Scientist analyzes database / files
47e-Science
- Data captured by instrumentsOr data generated by
simulator - Processed by software
- Placed in a files or database
- Scientist analyzes files / database
- Virtual laboratories
- Networks connecting e-Scientists
- Strong support from funding agencies
- Better use of resources
- Primitive today
48The Big Picture
Experiments Instruments
facts
questions
?
facts
Other Archives
answers
facts
Literature
facts
Simulations
The Big Problems
- Data ingest
- Managing a petabyte
- Common schema
- How to organize it?
- How to reorganize it
- How to coexist with others
- Query and Vis tools
- Support/training
- Performance
- Execute queries in a minute
- Batch query scheduling
49e-Science is Data Mining
- There are LOTS of data
- people cannot examine most of it.
- Need computers to do analysis.
- Manual or Automatic Exploration
- Manual person suggests hypothesis, computer
checks hypothesis - Automatic Computer suggests hypothesis person
evaluates significance - Given an arbitrary parameter space
- Data Clusters
- Points between Data Clusters
- Isolated Data Clusters
- Isolated Data Groups
- Holes in Data Clusters
- Isolated Points
Nichol et al. 2001 Slide courtesy of and adapted
from Robert Brunner _at_ CalTech.
50Data Analysis
- Looking for
- Needles in haystacks the Higgs particle
- Haystacks Dark matter, Dark energy
- Needles are easier than haystacks
- Global statistics have poor scaling
- Correlation functions are N2, likelihood
techniques N3 - As data and computers grow at same rate, we can
only keep up with N logN - A way out?
- Discard notion of optimal (data is fuzzy,
answers are approximate) - Dont assume infinite computational resources or
memory - Requires combination of statistics computer
science
51TerraServer/TerraServicehttp//terraService.Net/
- US Geological Survey Photo (DOQ) Topo (DRG)
images online. - On Internet since June 1998
- Operated by Microsoft Corporation
- Cross Indexed with
- Home sales,
- Demographics,
- Encyclopedia
- A web service
- 20 TB data source
- 10 M web hits/day
52USGS Image Data
- Digital Raster Graphics
- 1 TB compressed TIFF, 65,000 files
- Scanned topographic maps
- 100 U.S. coverage
- 124,000, 1100,000 and 1250,000 scale maps
- Maps vary in age
- Digital OrthoQuads
- 18 TB, 260,000 files uncompressed
- Digitized aerial imagery
- 88 coverage conterminous US
- 1 meter resolution
- lt 10 years old
53User Interface Concept
Display Imagery 316 m 200 x 200 pixel images 7
level image pyramid Resolution 1 meter/pixel to
64 meter/pixel Navigation Tools 1.5 m place
names Click-on Coverage map Longitude and
Latitude search U.S. Address Search External
Geo-Spatial Links to USGS On-line Stream
Gauges Home Advisor Demographics Home Advisor
Real Estate Encarta Articles Steam flow gauges
Concept User navigates an almost seamless
image of earth
Click on image to zoom in
Buttons to pan NW, N, NE, W, E, SW, S, SE
Links to switch between Topo, Imagery, and Relief
data
Links to Print, Download and view meta-data
information
54 Terra Service New Things
- A popular web service
- Exactly the map you want.
- Dynamic Map Re-projection
- UTM to Geographic projection
- Dynamic texture mapping?
- New Data
- 1 foot resolution natural color imagery
- Census Tiger data
- Lights Out Management
- MOM
- Auto-backup / restore on drive failure
55New Urban Area Data
Microsoft Campus at 4 meter resolution
Redundant Bunch 1
Ball field at .25 meter resolution
56TerraServer Becomes a Web ServiceTerraServer.net
-gt TerraService.Net
- Web server is for people.
- Web Service is for programs
- The end of screen scraping
- No faking a URL pass real parameters.
- No parsing the answer data formatted into your
address space. - Hundreds of users but a specific example
- US Department of Agriculture
57TerraServer Web Services
Terra-Tile-Service
Landmark-Service
- Get image meta-data
- Query TS Gazetteer
- Retrieve TS ImageTiles
- Projection conversions
- Web Map Client
- OpenGIS like
- Landmarks layered on TerraServer imagery
- Geo-coded data of well-known objects (points),
e.g. Schools, Golf Courses, Hospitals, etc. - Polygons of well-known objects (shapes), e.g. Zip
Codes, Cities, etc - Fat Map Client
- Visual Basic / C Windows Form
- Access Web Services for all data
Sample Apps
http//terraservice.net
58Web Services
- Web SERVER
- Given a url parameters
- Returns a web page (often dynamic)
- Web SERVICE
- Given a XML document (soap msg)
- Returns an XML document
- Tools make this look like an RPC.
- F(x,y,z) returns (u, v, w)
- Distributed objects for the web.
- naming, discovery, security,..
- Internet-scale distributed computing
Your program
Web Server
http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
59TerraServer Hardware
- Storage Bricks
- White-box commodity servers
- 4tb raw / 2TB Raid1 SATA storage
- Dual Hyper-threaded Xeon 2.4ghz, 4GB RAM
- Partitioned Databases (PACS partitioned array)
- 3 Storage Bricks 1 TerraServer data
- Data partitioned across 20 databases
- More data partitions coming
- Low Cost Availability
- 4 copies of the data
- RAID1 SATA Mirroring
- 2 redundant Bunches
- Spare brick to repair failed brick 2N1 design
- Web Application bunch aware
- Load balances between redundant databases
- Fails over to surviving database on failure
- 100K capital expense.
60Research Objectives
User/App Goals
Technology Goals
- Test/show scalability
- Test/show availability
- Test/show lights out
- all operations maintenance occurs remotely
- Minimal ops and dev staff
- web service poster child
- Public Access to remote sensing data with no
GIS expertise required - Ubiquitous No special hw/sw required by client
- Delivery All OnLine/Internet Based, no tape or
CD distribution - Simple Designed to be used by a 6th grade
geography student
61Virtual Observatoryhttp//www.astro.caltech.edu/n
voconf/http//www.voforum.org/
- Premise Most data is (or could be online)
- So, the Internet is the worlds best telescope
- It has data on every part of the sky
- In every measured spectral band optical, x-ray,
radio.. - As deep as the best instruments (2 years ago).
- It is up when you are up.The seeing is always
great (no working at night, no clouds no moons
no..). - Its a smart telescope links objects and
data to literature on them.
62Why Astronomy Data?
- It has no commercial value
- No privacy concerns
- Can freely share results with others
- Great for experimenting with algorithms
- It is real and well documented
- High-dimensional data (with confidence intervals)
- Spatial data
- Temporal data
- Many different instruments from many different
places and many different times - Federation is a goal
- The questions are interesting
- How did the universe form?
- There is a lot of it (petabytes)
63Time and Spectral DimensionsThe Multiwavelength
Crab Nebulae
Crab star 1053 AD
X-ray, optical, infrared, and radio views of
the nearby Crab Nebula, which is now in a state
of chaotic expansion after a supernova explosion
first sighted in 1054 A.D. by Chinese Astronomers.
Slide courtesy of Robert Brunner _at_ CalTech.
64SkyServer.SDSS.org
- A modern archive
- Raw Pixel data lives in file servers
- Catalog data (derived objects) lives in Database
- Online query to any and all
- Also used for education
- 150 hours of online Astronomy
- Implicitly teaches data analysis
- Interesting things
- Spatial data search
- Client query interface via Java Applet
- Query interface via Emacs
- Popular -- 1 of Terraserver ?
- Cloned by other surveys (a template design)
- Web services are core of it.
65Demo of SkyServer
- Shows standard web server
- Pixel/image data
- Point and click
- Explore one object
- Explore sets of objects (data mining)
66Data Federations of Web Services
- Massive datasets live near their owners
- Near the instruments software pipeline
- Near the applications
- Near data knowledge and curation
- Super Computer centers become Super Data Centers
- Each Archive publishes a web service
- Schema documents the data
- Methods on objects (queries)
- Scientists get personalized extracts
- Uniform access to multiple Archives
- A common global schema
Federation
67Federation SkyQuery.Net
- Combine 4 archives initially
- Just added 10 more
- Send query to portal, portal joins data from
archives. - Problem want to do multi-step data analysis
(not just single query). - Solution Allow personal databases on portal
- Problem some queries are monsters
- Solution batch schedule on portal server,
Deposits answer in personal database.
68SkyQuery Structure
- Each SkyNode publishes
- Schema Web Service
- Database Web Service
- Portal is
- Plans Query (2 phase)
- Integrates answers
- Is itself a web service
69SkyQuery http//skyquery.net/
- Distributed Query tool using a set of web
services - Four astronomy archives from Pasadena, Chicago,
Baltimore, Cambridge (England). - Feasibility study, built in 6 weeks
- Tanu Malik (JHU CS grad student)
- Tamas Budavari (JHU astro postdoc)
- With help from Szalay, Thakar, Gray
- Implemented in C and .NET
- Allows queries like
SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
70SkyNode Basic Web Services
- Metadata information about resources
- Waveband
- Sky coverage
- Translation of names to universal dictionary
(UCD) - Simple search patterns on the resources
- Cone Search
- Image mosaic
- Unit conversions
- Simple filtering, counting, histogramming
- On-the-fly recalibrations
71Portals Higher Level Services
- Built on Atomic Services
- Perform more complex tasks
- Examples
- Automated resource discovery
- Cross-identifications
- Photometric redshifts
- Outlier detections
- Visualization facilities
- Goal
- Build custom portals in days from existing
building blocks (like today in IRAF or IDL)
72MyDB added to SkyQuery
- Moves analysis to the data
- Users can cooperate (share MyDB)
- Still exploring this
- Let users add personal DB 1GB for now.
- Use it as a workbook.
- Online and batch queries.
MyDB
73The Big Picture
Experiments Instruments
facts
questions
?
facts
Other Archives
answers
facts
Literature
facts
Simulations
The Big Problems
- Data ingest
- Managing a petabyte
- Common schema
- How to organize it?
- How to reorganize it
- How to coexist with others
- Query and Vis tools
- Support/training
- Performance
- Execute queries in a minute
- Batch query scheduling