Title: How much information
1How much information?
- Adapted from a presentation by
- Jim GrayMicrosoft Research
- http//research.microsoft.com/gray
- Alex Szalay
- Johns Hopkins University
- http//tarkus.pha.jhu.edu/szalay/
2How much information is there in the world
- Infometrics - the measurement of information
- What can we store
- What do we intend to store.
- What is stored.
- Why are we interested.
3Infinite Storage?
- The Terror Bytes are Here
- 1 TB costs
- 1 TB costs 300k/y to own
- Management curation are expensive
- Searching without indexing 1TB
- takes minutes or hours
- Petrified by Peta Bytes?
- But people can afford them so, They will
be used. - Solution Automate processes
Yotta Zetta Exa Peta Tera Giga Mega Kilo
4Digital Information Created, Captured,
Replicated Worldwide
Exabytes
10-fold Growth in 5 Years!
DVD RFID Digital TV MP3 players Digital
cameras Camera phones, VoIP Medical imaging,
Laptops, Data center applications,
Games Satellite images, GPS, ATMs,
Scanners Sensors, Digital radio, DLP theaters,
Telematics Peer-to-peer, Email, Instant
messaging, Videoconferencing, CAD/CAM, Toys,
Industrial machines, Security systems, Appliances
Source IDC, 2008
5Scale of things to come
- Information
- In 2002, recorded media and electronic
information flows generated about 22 exabytes
(1018) of information - In 2006, the amount of digital information
created, captured, and replicated was 161 EB - In 2010, the amount of information added annually
to the digital universe will be about 988 EB
(almost 1 ZB)
6Digital Universe Environmental Footprint
- In our physical universe, 98.5 of the known mass
is invisible, composed of interstellar dust or
what scientists call dark matter. In the
digital universe, we have our own form of dark
matter the tiny signals from sensors and RFID
tags and the voice packets that make up less than
6 of the digital universe by gigabyte, but
account for more than 99 of the units,
information containers, or files in it. - Tenfold growth of the digital universe in five
years will have a measurable impact on the
environment, in terms of both power consumed and
electronic waste.
7How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Soon most everything will be recorded and
indexed - Most bytes will never be seen by humans.
- Data summarization, trend detection anomaly
detection are key technologies - See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html - See Lyman Varian
- How much information
- http//www.sims.berkeley.edu/research/projects/how
-much-info/
Everything! Recorded
All Books MultiMedia
All books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
8(No Transcript)
9Digital Immortality
Bell, Gray, CACM, 01
Requirements for storing various media for a
single persons lifetime at modest fidelity
10What is Digital Immortality?
- Preservation and interaction of digitized
experiences for individuals and/or groups - Preservation and access
- Active interaction with archives through queries
and/or an avatar (agents) - Avatar interactions for group experiences
- Issues
- Archiving
- Indexing
- Veracity
- Access
11Information CensusLesk Varian Lyman
- 10 Exabytes
- 90 digital
- 55 personal
- Print .003 of bytes5TB/y, but text has lowest
entropy - Email is (10 Bmpd) 4PB/y and is 20 text
(estimate by Gray) - WWW is 50TBdeep web 50 PB
- Growth 50/y
12New Information Flows
- Telephone increase is significant
13Internet
14First Disk 1956
- IBM 305 RAMAC
- 4 MB
- 50x24 disks
- 1200 rpm
- 100 ms access
- 35k/y rent
- Included computer accounting software(tubes
not transistors)
1510 years later
30 MB
1.6 meters
16Now - Terabytes on your desk
Terabyte external drive for 200 - 20 cents a
gigabyte. In 5 years, 1 cent/gigabyte, 10 for a
terabyte?
17The Cost of Storage about 1K/TB
18Storage capacity beating Moores law
- ImprovementsCapacity 60/yBandwidth 40/yAcc
ess time 16/y - 1000 /TB today
- 100 /TB in 2007
-
Moores law
58.70
/year
TB growth
112.30
/year since 1993
Price decline
50.70
/year since 1993
Most (80) data is personal (not enterprise)This
will likely remain true.
19Disk Evolution
Kilo Mega Giga Tera Peta Exa Zetta Yotta
- Capacity100x in 10 years 1 TB 3.5 drive in
2006 20 GB as 1 micro-drive - System on a chip
- High-speed LAN
- Disk replacing tape
- Disk is super computer!
20Disk Storage Cheaper Than Paper
- File Cabinet (4 drawer) 250Cabinet Paper
(24,000 sheets) 250 Space (2x3 _at_ 10/ft2)
180 Total 700 0.03 /sheet 3 pennies
per page - Disk disk (250 GB ) 250 ASCII 100 m pages
2e-6 /sheet(10,000x cheaper)
micro-dollar per page Image 1 m photos
3e-4 /photo (100x cheaper) milli-dollar
per photo - Store everything on diskNote Disk is 100x to
1000x cheaper than RAM
21Why Put Everything in Cyberspace?
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Immediate OR Time Delayed
Locate Process Analyze Summarize
22MemexAs We May Think, Vannevar Bush, 1945
- A memex is a device in which an individual
stores all his books, records, and
communications, and which is mechanized so that
it may be consulted with exceeding speed and
flexibility - yet if the user inserted 5000 pages of material
a day it would take him hundreds of years to fill
the repository, so that he can be profligate and
enter material freely
23Trying to fill a terabyte in a year
24Projected Portable Computer for 2006
- 100 Gips processor
- 1 GB RAM
- 1 TB disk
- 1 Gbps network
- Some of your software finding things is a
data mining challenge
25The Personal Terabyte(s) (All Your Stuff
Online)So youve got it now what do you do
with it?
- TREASURED (whats the one thing you would save
in a fire?) - Can you find anything?
- Can you organize that many objects?
- Once you find it will you know what it is?
- Once youve found it, could you find it again?
- Information Science Goal Have GOOD answers for
all these Questions
26How Will We Find Anything?
- Need Queries, Indexing, Pivoting, Scalability,
Backup, Replication,Online update, Set-oriented
accessIf you dont use a DBMS, you will
implement one! - Simple logical structure
- Blob and link is all that is inherent
- Additional properties (facets extra
tables)and methods on those tables
(encapsulation) - More than a file system
- Unifies data and meta-data
SQL DBMS
27How Do We Represent It To The Outside
World?Schematized Storage
-
-
w.w3.org/2001/XMLSchema" xmlnsmsdata"urnschemas
-microsoft-comxml-msdata" name"radec" msdataIsDataSet"true" name"Table" type"xsdouble" minOccurs"0" / name"dec" type"xsdouble" minOccurs"0" /
- osoft-comxml-msdata" xmlnsdiffgr"urnschemas-m
icrosoft-comxml-diffgram-v1" - xmlns"" - msdatarowOrder"0" 184.028935351008
-1.12590950121524 -
184.025719033547 -1.2179582792018
6 DataSet
- File metaphor too primitive just a blob
- Table metaphor too primitive just records
- Need Metadata describing data context
- Format
- Providence (author/publisher/ citations/)
- Rights
- History
- Related documents
- In a standard format
- XML and XML schema
- DataSet is great example of this
- World is now defining standard schemas
schema
Data or difgram
2880 of data is personal / individual. But, what
about the other 20?
- Business
- Wall Mart online 1PB and growing.
- Paradox most transaction systems
- Have to go to image/data monitoring for big data
- Government
- Government is the biggest business.
- Science
- LOTS of data.
29Q Where will the Data Come From?A Sensor
Applications
- Earth Observation
- 15 PB by 2007
- Medical Images Information Health Monitoring
- Potential 1 GB/patient/y ? 1 EB/y
- Video Monitoring
- 1E8 video cameras _at_ 1E5 MBps ? 10TB/s ? 100
EB/y ? filtered??? - Airplane Engines
- 1 GB sensor data/flight,
- 100,000 engine hours/day
- 30PB/y
- Smart Dust ?? EB/y
http//robotics.eecs.berkeley.edu/pister/SmartDus
t/
http//www-bsac.eecs.berkeley.edu/shollar/macro_m
otes/macromotes.html
30Instruments CERN LHCPeta Bytes per Year
- Looking for the Higgs Particle
- Sensors 1000 GB/s (1TB/s 30 EB/y)
- Events 75 GB/s
- Filtered 5 GB/s
- Reduced 0.1 GB/s 2 PB/y
- Data pyramid 100GB 1TB 100TB 1PB 10PB
31Science Data VolumeESO/STECF Science Archive
- 100 TB archive
- Similar at Hubble, Keck, SDSS,
- 1PB aggregate
32Premise DataGrid Computing
- Store exabytes twice (for redundancy)
- Access them from anywhere
- Implies huge archive/data centers
- Supercomputer centers become super data centers
- Examples Google, Yahoo!, Hotmail,BaBar, CERN,
Fermilab, SDSC,
33Thesis
- Most new information is digital(and old
information is being digitized) - An Information Science Grand Challenge
- Capture
- Organize
- Summarize
- Visualize
- this information
- Optimize Human Attention as a resource
- Improve information quality
34Access!
35The Evolution of Science
- Observational Science
- Scientist gathers data by direct observation
- Scientist analyzes data
- Analytical Science
- Scientist builds analytical model
- Makes predictions.
- Computational Science
- Simulate analytical model
- Validate model and makes predictions
- Data Exploration Science Data captured by
instrumentsOr data generated by simulator - Processed by software
- Placed in a database / files
- Scientist analyzes database / files
36Computational Science Evolves
- Historically, Computational Science simulation.
- New emphasis on informatics
- Capturing,
- Organizing,
- Summarizing,
- Analyzing,
- Visualizing
- Largely driven by observational science, but
also needed by simulations. - Too soon to say if comp-X and X-info will unify
or compete.
BaBar, Stanford
PE Gene Sequencer From http//www.genome.uci.edu
/
Space Telescope
37Next-Generation Data Analysis
- Looking for
- Needles in haystacks the Higgs particle
- Haystacks Dark matter, Dark energy
- Needles are easier than haystacks
- Global statistics have poor scaling
- Correlation functions are N2, likelihood
techniques N3 - As data and computers grow at same rate, we can
only keep up with N logN - A way out?
- Discard notion of optimal (data is fuzzy, answers
are approximate) - Dont assume infinite computational resources or
memory - Requires combination of statistics computer
science
38Smart Data (active databases)
- If there is too much data to move around,
- take the analysis to the data!
- Do all data manipulations at database
- Build custom procedures and functions in the
database - Automatic parallelism guaranteed
- Easy to build-in custom functionality
- Databases Procedures being unified
- Example temporal and spatial indexing
- Pixel processing
- Easy to reorganize the data
- Multiple views, each optimal for certain types of
analyses - Building hierarchical summaries are trivial
- Scalable to Petabyte datasets
-
39Data Mining in the Image Domain Can We
Discover New Types of Phenomena Using Automated
Pattern Recognition? (Every object detection
algorithm has its biases and limitations)
Effective parametrization of source
morphologies and environments Multiscale
analysis (Also in the
time/lightcurve domain)
40Challenge Make Data Publication Access Easy
- Augment FTP with data query Return
intelligent data subsets - Make it easy to
- Publish Record structured data
- Find
- Find data anywhere in the network
- Get the subset you need
- Explore datasets interactively
- Realistic goal
- Make it as easy as publishing/reading web sites
today. -
41Data Federations of Web Services
- Massive datasets live near their owners
- Near the instruments software pipeline
- Near the applications
- Near data knowledge and curation
- Super Computer centers become Super Data Centers
- Each Archive publishes a web service
- Schema documents the data
- Methods on objects (queries)
- Scientists get personalized extracts
- Uniform access to multiple Archives
- A common global schema
- Challenge
- What is the object model for your science?
Federation
42Web Services The Key?
- Web SERVER
- Given a url parameters
- Returns a web page (often dynamic)
- Web SERVICE
- Given a XML document (soap msg)
- Returns an XML document
- Tools make this look like an RPC.
- F(x,y,z) returns (u, v, w)
- Distributed objects for the web.
- naming, discovery, security,..
- Internet-scale distributed computing
Your program
Web Server
http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
43Web Services Architecture
44Information Science and Data Generation Trends
- What does large amounts of information provide?
- New opportunities for search!
- New discoveries
- Business opportunities?
- Research opportunities?
- Problems?