How much information - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

How much information

Description:

Petrified by Peta Bytes? But... people can 'afford' them so, They will be used. ... Peta. Exa. Zetta. Yotta. Disk Storage Cheaper Than Paper. File Cabinet (4 ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 40

Provided by: jimg180

Category:

more less

Transcript and Presenter's Notes

Title: How much information

1
How much information?

Adapted from a presentation by
Jim GrayMicrosoft Research
http//research.microsoft.com/gray
Alex Szalay
Johns Hopkins University
http//tarkus.pha.jhu.edu/szalay/

2
How much information is there in the world

Infometrics - the measurement of information
What can we store
What do we intend to store.
What is stored.
Why are we interested.

3
Infinite Storage?

The Terror Bytes are Here
1 TB costs
1 TB costs 300k/y to own
Management curation are expensive
Searching without indexing 1TB
takes minutes or hours
Petrified by Peta Bytes?
But people can afford them so, They will
be used.
Solution Automate processes

Yotta Zetta Exa Peta Tera Giga Mega Kilo
4
Digital Information Created, Captured,
Replicated Worldwide
Exabytes
10-fold Growth in 5 Years!
DVD RFID Digital TV MP3 players Digital
cameras Camera phones, VoIP Medical imaging,
Laptops, Data center applications,
Games Satellite images, GPS, ATMs,
Scanners Sensors, Digital radio, DLP theaters,
Telematics Peer-to-peer, Email, Instant
messaging, Videoconferencing, CAD/CAM, Toys,
Industrial machines, Security systems, Appliances
Source IDC, 2008
5
Scale of things to come

Information
In 2002, recorded media and electronic
information flows generated about 22 exabytes
(1018) of information
In 2006, the amount of digital information
created, captured, and replicated was 161 EB
In 2010, the amount of information added annually
to the digital universe will be about 988 EB
(almost 1 ZB)

6
Digital Universe Environmental Footprint

In our physical universe, 98.5 of the known mass
is invisible, composed of interstellar dust or
what scientists call dark matter. In the
digital universe, we have our own form of dark
matter the tiny signals from sensors and RFID
tags and the voice packets that make up less than
6 of the digital universe by gigabyte, but
account for more than 99 of the units,
information containers, or files in it.
Tenfold growth of the digital universe in five
years will have a measurable impact on the
environment, in terms of both power consumed and
electronic waste.

7
How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo

Soon most everything will be recorded and
indexed
Most bytes will never be seen by humans.
Data summarization, trend detection anomaly
detection are key technologies
See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html
See Lyman Varian
How much information
http//www.sims.berkeley.edu/research/projects/how
-much-info/

Everything! Recorded
All Books MultiMedia
All books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
8
(No Transcript)
9
Digital Immortality
Bell, Gray, CACM, 01
Requirements for storing various media for a
single persons lifetime at modest fidelity
10
What is Digital Immortality?

Preservation and interaction of digitized
experiences for individuals and/or groups
Preservation and access
Active interaction with archives through queries
and/or an avatar (agents)
Avatar interactions for group experiences
Issues
Archiving
Indexing
Veracity
Access

11
Information CensusLesk Varian Lyman

10 Exabytes
90 digital
55 personal
Print .003 of bytes5TB/y, but text has lowest
entropy
Email is (10 Bmpd) 4PB/y and is 20 text
(estimate by Gray)
WWW is 50TBdeep web 50 PB
Growth 50/y

12
New Information Flows

Telephone increase is significant

13
Internet
14
First Disk 1956

IBM 305 RAMAC
4 MB
50x24 disks
1200 rpm
100 ms access
35k/y rent
Included computer accounting software(tubes
not transistors)

15
10 years later
30 MB
1.6 meters
16
Now - Terabytes on your desk
Terabyte external drive for 200 - 20 cents a
gigabyte. In 5 years, 1 cent/gigabyte, 10 for a
terabyte?
17
The Cost of Storage about 1K/TB
18
Storage capacity beating Moores law

ImprovementsCapacity 60/yBandwidth 40/yAcc
ess time 16/y
1000 /TB today
100 /TB in 2007

Moores law
58.70
/year
TB growth
112.30
/year since 1993
Price decline
50.70
/year since 1993
Most (80) data is personal (not enterprise)This
will likely remain true.
19
Disk Evolution
Kilo Mega Giga Tera Peta Exa Zetta Yotta

Capacity100x in 10 years 1 TB 3.5 drive in
2006 20 GB as 1 micro-drive
System on a chip
High-speed LAN
Disk replacing tape
Disk is super computer!

20
Disk Storage Cheaper Than Paper

File Cabinet (4 drawer) 250Cabinet Paper
(24,000 sheets) 250 Space (2x3 _at_ 10/ft2)
180 Total 700 0.03 /sheet 3 pennies
per page
Disk disk (250 GB ) 250 ASCII 100 m pages
2e-6 /sheet(10,000x cheaper)
micro-dollar per page Image 1 m photos
3e-4 /photo (100x cheaper) milli-dollar
per photo
Store everything on diskNote Disk is 100x to
1000x cheaper than RAM

21
Why Put Everything in Cyberspace?
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Immediate OR Time Delayed
Locate Process Analyze Summarize
22
MemexAs We May Think, Vannevar Bush, 1945

A memex is a device in which an individual
stores all his books, records, and
communications, and which is mechanized so that
it may be consulted with exceeding speed and
flexibility
yet if the user inserted 5000 pages of material
a day it would take him hundreds of years to fill
the repository, so that he can be profligate and
enter material freely

23
Trying to fill a terabyte in a year
24
Projected Portable Computer for 2006

100 Gips processor
1 GB RAM
1 TB disk
1 Gbps network
Some of your software finding things is a
data mining challenge

25
The Personal Terabyte(s) (All Your Stuff
Online)So youve got it now what do you do
with it?

TREASURED (whats the one thing you would save
in a fire?)
Can you find anything?
Can you organize that many objects?
Once you find it will you know what it is?
Once youve found it, could you find it again?
Information Science Goal Have GOOD answers for
all these Questions

26
How Will We Find Anything?

Need Queries, Indexing, Pivoting, Scalability,
Backup, Replication,Online update, Set-oriented
accessIf you dont use a DBMS, you will
implement one!
Simple logical structure
Blob and link is all that is inherent
Additional properties (facets extra
tables)and methods on those tables
(encapsulation)
More than a file system
Unifies data and meta-data

SQL DBMS
27
How Do We Represent It To The Outside
World?Schematized Storage
-
-
w.w3.org/2001/XMLSchema" xmlnsmsdata"urnschemas
-microsoft-comxml-msdata" name"radec" msdataIsDataSet"true" name"Table" type"xsdouble" minOccurs"0" / name"dec" type"xsdouble" minOccurs"0" /
- osoft-comxml-msdata" xmlnsdiffgr"urnschemas-m
icrosoft-comxml-diffgram-v1" - xmlns"" - msdatarowOrder"0" 184.028935351008
-1.12590950121524 -

184.025719033547 -1.2179582792018
6 DataSet

File metaphor too primitive just a blob
Table metaphor too primitive just records
Need Metadata describing data context
Format
Providence (author/publisher/ citations/)
Rights
History
Related documents
In a standard format
XML and XML schema
DataSet is great example of this
World is now defining standard schemas

schema
Data or difgram
28
80 of data is personal / individual. But, what
about the other 20?

Business
Wall Mart online 1PB and growing.
Paradox most transaction systems
Have to go to image/data monitoring for big data
Government
Government is the biggest business.
Science
LOTS of data.

29
Q Where will the Data Come From?A Sensor
Applications

Earth Observation
15 PB by 2007
Medical Images Information Health Monitoring
Potential 1 GB/patient/y ? 1 EB/y
Video Monitoring
1E8 video cameras _at_ 1E5 MBps ? 10TB/s ? 100
EB/y ? filtered???
Airplane Engines
1 GB sensor data/flight,
100,000 engine hours/day
30PB/y
Smart Dust ?? EB/y

http//robotics.eecs.berkeley.edu/pister/SmartDus
t/
http//www-bsac.eecs.berkeley.edu/shollar/macro_m
otes/macromotes.html
30
Instruments CERN LHCPeta Bytes per Year

Looking for the Higgs Particle
Sensors 1000 GB/s (1TB/s 30 EB/y)
Events 75 GB/s
Filtered 5 GB/s
Reduced 0.1 GB/s 2 PB/y
Data pyramid 100GB 1TB 100TB 1PB 10PB

31
Science Data VolumeESO/STECF Science Archive

100 TB archive
Similar at Hubble, Keck, SDSS,
1PB aggregate

32
Premise DataGrid Computing

Store exabytes twice (for redundancy)
Access them from anywhere
Implies huge archive/data centers
Supercomputer centers become super data centers
Examples Google, Yahoo!, Hotmail,BaBar, CERN,
Fermilab, SDSC,

33
Thesis

Most new information is digital(and old
information is being digitized)
An Information Science Grand Challenge
Capture
Organize
Summarize
Visualize
this information
Optimize Human Attention as a resource
Improve information quality

34
Access!
35
The Evolution of Science

Observational Science
Scientist gathers data by direct observation
Scientist analyzes data
Analytical Science
Scientist builds analytical model
Makes predictions.
Computational Science
Simulate analytical model
Validate model and makes predictions
Data Exploration Science Data captured by
instrumentsOr data generated by simulator
Processed by software
Placed in a database / files
Scientist analyzes database / files

36
Computational Science Evolves

Historically, Computational Science simulation.
New emphasis on informatics
Capturing,
Organizing,
Summarizing,
Analyzing,
Visualizing
Largely driven by observational science, but
also needed by simulations.
Too soon to say if comp-X and X-info will unify
or compete.

BaBar, Stanford
PE Gene Sequencer From http//www.genome.uci.edu
/
Space Telescope
37
Next-Generation Data Analysis

Looking for
Needles in haystacks the Higgs particle
Haystacks Dark matter, Dark energy
Needles are easier than haystacks
Global statistics have poor scaling
Correlation functions are N2, likelihood
techniques N3
As data and computers grow at same rate, we can
only keep up with N logN
A way out?
Discard notion of optimal (data is fuzzy, answers
are approximate)
Dont assume infinite computational resources or
memory
Requires combination of statistics computer
science

38
Smart Data (active databases)

If there is too much data to move around,
take the analysis to the data!
Do all data manipulations at database
Build custom procedures and functions in the
database
Automatic parallelism guaranteed
Easy to build-in custom functionality
Databases Procedures being unified
Example temporal and spatial indexing
Pixel processing
Easy to reorganize the data
Multiple views, each optimal for certain types of
analyses
Building hierarchical summaries are trivial
Scalable to Petabyte datasets

39
Data Mining in the Image Domain Can We
Discover New Types of Phenomena Using Automated
Pattern Recognition? (Every object detection
algorithm has its biases and limitations)
Effective parametrization of source
morphologies and environments Multiscale
analysis (Also in the
time/lightcurve domain)
40
Challenge Make Data Publication Access Easy

Augment FTP with data query Return
intelligent data subsets
Make it easy to
Publish Record structured data
Find
Find data anywhere in the network
Get the subset you need
Explore datasets interactively
Realistic goal
Make it as easy as publishing/reading web sites
today.

41
Data Federations of Web Services

Massive datasets live near their owners
Near the instruments software pipeline
Near the applications
Near data knowledge and curation
Super Computer centers become Super Data Centers
Each Archive publishes a web service
Schema documents the data
Methods on objects (queries)
Scientists get personalized extracts
Uniform access to multiple Archives
A common global schema
Challenge
What is the object model for your science?

Federation
42
Web Services The Key?

Web SERVER
Given a url parameters
Returns a web page (often dynamic)
Web SERVICE
Given a XML document (soap msg)
Returns an XML document
Tools make this look like an RPC.
F(x,y,z) returns (u, v, w)
Distributed objects for the web.
naming, discovery, security,..
Internet-scale distributed computing

Your program
Web Server
http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
43
Web Services Architecture
44
Information Science and Data Generation Trends