The Information Avalanche: Reducing Information Overload

About This Presentation

Title:

The Information Avalanche: Reducing Information Overload

Description:

The Information Avalanche: Reducing Information Overload – PowerPoint PPT presentation

Number of Views:90

Avg rating:3.0/5.0

Slides: 78

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: The Information Avalanche: Reducing Information Overload

1
The Information AvalancheReducing Information
Overload

Jim Gray
Microsoft Research
Onassis Foundation Science Lecture Series
http//www.forth.gr/onassis/lectures/2002-07-15/in
dex.html
Heraklion, Crete, Greece, 15-19 July 2002

2
Thesis

Most new information is digital(and old
information is being digitized)
A Computer Science Grand Challenge
Capture
Organize
Summarize
Visualize
This information
Optimize Human Attention as a resource.
Improve information quality

3
Information Avalanche

The Situation a census of the data
We can record everything
Everything is a LOT!
The Good news
Changes science, education, medicine,
entertainment,.
Shrinks time and space
Can augment human intelligence
The Bad News
The end of privacy
Cyber Crime / Cyber Terrorism
Monoculture
The Technical Challenges
Amplify human intellect
Organize, summarize and prioritize information
Make programming easy.

4
How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo

Soon everything can be recorded and indexed
Most bytes will never be seen by humans.
Data summarization, trend detection anomaly
detection are key technologies
See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html
See Lyman Varian
How much information
http//www.sims.berkeley.edu/research/projects/how
-much-info/

Everything! Recorded
All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
5
Information CensusLesk Varian Leyman

10 Exabytes
90 digital
gt 55 personal
Print .003 of bytes5TB/y, but text has lowest
entropy
Email is (10 Bmpd) 4PB/y and is 20 text
(estimate by Gray)
WWW is 50TBdeep web 50 PB
Growth 50/y

6
93
7
Storage capacity beating Moores law

ImprovementsCapacity 60/yBandwidth 40/yAcc
ess time 16/y
1000 /TB today
100 /TB in 2007

8
Disk Storage Cheaper than Paper

File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 0.03 /sheet
Disk disk (160 GB ) 200 ASCII
500 m pages 2e-7 /sheet (10,000x cheaper)
Image 1 m photos
3e-4 /photo (100x cheaper)
Store everything on disk

9
Why Put Everything in Cyberspace?
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Immediate OR Time Delayed
Locate Process Analyze Summarize
10
Storage trends

Right now, its affordable to buy 100 GB/year
In 5 years you can afford to buy
1TB/year!(assuming storage doubles every 18
months)

11
Trying to fill a terabyte in a year
12
MemexAs We May Think, Vannevar Bush, 1945

A memex is a device in which an individual
stores all his books, records, and
communications, and which is mechanized so that
it may be consulted with exceeding speed and
flexibility
yet if the user inserted 5000 pages of material
a day it would take him hundreds of years to fill
the repository, so that he can be profligate and
enter material freely

13
Gordon Bells MainBrainDigitize EverythingA
BIG shoebox?

Scans 20 k pages tiff_at_ 300 dpi 1 GB
Music 2 k tacks 7 GB
Photos 13 k images 2 GB
Video 10 hrs 3 GB
Docs 3 k (ppt, word,..) 2 GB
Mail 100 k messages 3 GB
18 GB

14
Gary Starkweather

Scan EVERYTHING
400 dpi TIFF
70k pages 14GB
OCR all scans (98 recognition ocr accuracy)
All indexed (5 second access to anything)
All on his laptop.

15
Access!
16
50 personal, What about the other 50

Business
Wall Mart online 1PB and growing.
Paradox most transaction systems have mere
PBs.
Have to go to image/data monitoring for big data
Government
Online government is big thrust (cheaper,
better,)
Science

17
Instruments CERN LHCPeta Bytes per Year

Looking for the Higgs Particle
Sensors 1000 GB/s (1TB/s)
Events 75 GB/s
Filtered 5 GB/s
Reduced 0.1 GB/s 2 PB/y
Data pyramid 100GB 1TB 100TB 1PB 10PB

18
LHC Requirements (2005- )

1E9 events pa _at_ 1MB/ev 1PB/year/expt
Reconstructed 100TB/recon/year/expt
Send to Tier1 Regional Centres
gt 400TB/year to RAL?
Keep one set derivatives on disk
and rest on tape
But UK plans a Tier1 clone
Many data clones

Source John Gordon IT Department, CLRC/RAL CUF
Meeting, October 2000
19
Science Data VolumeESO/STECF Science Archive

100 TB archive
Similar at Hubble, Keck, SDSS,
1PB aggregate

20
Data Pipeline NASA

Level 0 raw data data stream
Level 1 calibrated data measured values
Level 1A calibrated normalized
flux/magnitude/
Level 2 derived data metrics vegetation index
Data volume
0 1 1A ltlt 2
Level 2 gtgt level 1 because
MANY data products
Must keep all published
data Editions (versions)

EOSDIS Core System Information for Scientists,
http//observer.gsfc.nasa.gov/sec3/ProductLevels.
html
21
TerraServer http//TerraService.net/

3 x 2 TB databases
18TB disk tri-plexed (6TB)
3 1 Cluster
99.96 uptime
1B page views5B DB queries
Now a .NET web service

22
Image Data
USGS Topo Maps
USGS Aerial photos DOQ

All in the database 200x200 pixel tiles
compressed
Spatial access z-Tranform Btree

Encarta Virtual Globe
1 Km resolution
100 World Coverage
23
Hardware
8 Compaq DL360 Photon Web Servers
One SQL database per rack Each rack contains 4.5
tb 261 total drives / 13.7 TB total
Fiber SAN Switches
Meta Data Stored on 101 GB Fast, Small
Disks(18 x 18.2 GB)
SQL\Inst1
Imagery Data Stored on 4 339 GB Slow, Big
Disks (15 x 73.8 GB)
SQL\Inst2
SQL\Inst3
To Add 90 72.8 GB Disks in Feb 2001 to create 18
TB SAN
Spare
4 Compaq ProLiant 8500 Db Servers
24
TerraServer Lessons Learned

Hardware is 5 9s (with clustering)
Software is 5 9s (with clustering)
Admin is 4 9s (offline maintenance)
Network is 3 9s (mistakes, environment)
Simple designs are best
10 TB DB is management limit1 PB 100 x 10 TB
DBthis is 100x better than 5 years ago.
Minimize use of tape
Backup to disk (snapshots)
Portable disk TBs

25
Sensor Applications

Earth Observation
15 PB by 2007
Medical Images Information Health Monitoring
Potential 1 GB/patient/y ? 1 EB/y
Video Monitoring
1E8 video cameras _at_ 1E5 MBps ? 10TB/s ? 100
EB/y ? filtered???
Airplane Engines
1 GB sensor data/flight,
100,000 engine hours/day
30PB/y
Smart Dust ?? EB/y

http//robotics.eecs.berkeley.edu/pister/SmartDus
t/
http//www-bsac.eecs.berkeley.edu/shollar/macro_m
otes/macromotes.html
26
What do they do with the databusiness,
government, scienceMore later in talk

Look for anomalies
1, 2, 1, 2, 1, 1, 1, 2, -5, 1, 0, 2,
Look for trends and patterns
1, 2, 3, 4, 5,
Look for correlations
ln(x) ln(y) c ln(z)
Look at summaries then drill down to details
LOTS of histograms

27
Premise Grid Computing

Store exabytes once or twice (for redundancy)
Access them from anywhere
Implies huge archive/data centers
Supercomputer centers become super data centers
Examples Google, Yahoo!, Hotmail,CERN,
Fermilab, SDSC

28
Bandwidth 3x bandwidth/year for 25 more years

Today
40 Gbps per channel (?)
12 channels per fiber (wdm) 500 Gbps
32 fibers/bundle 16 Tbps/bundle
In lab 3 Tbps/fiber (400 x WDM)
In theory 25 Tbps per fiber
1 Tbps USA 1996 WAN bisection bandwidth
Aggregate bandwidth doubles every 8 months!

1 fiber 25 Tbps
29
Underlying Theme

Digital Everything
From words and numbersto sights and sounds

New Devices
From isolated to adaptive, synchronized, and
connected

Automation
From dumb to Web services
From manual to self-tuning, self organizing, and
self maintaining
Beyond reliability to availability

One inter-connected network
From stand alone/basic connectivity to always
wired (and wireless)
Everything over IP

30
Information Avalanche

The Situation a census of the data
We can record everything
Everything is a LOT!
The Good news
Changes science, education, medicine,
entertainment,.
Shrinks time and space
Can augment human intelligence
The Bad News
The end of privacy
Cyber Crime / Cyber Terrorism
Monoculture
The Technical Challenges
Amplify human intellect
Organize, summarize and prioritize information
Make programming easy.

31
Online Science

All literature online
All data online
All instruments online
Great analysis tools.

32
Online Education

All literature online
All lectures online
Interactive and time-shifted education
Just-in-time education
Available to everyone everywhere
Economic model is not understood (who pays?)
One model society pays

33
Online Business

Frictionless economy
Near-perfect information
Very efficient
Fully customized products
Example Wallmart / Dell
Traditional business 1-10 inventory turns/y
eBuisiness 100-500 turns/y no inventory
VERY efficient, huge economic advantage
Your customers suppliers loan you money!

34
Online Medicine

Traditional medicine
Can monitor your health continuously
Instant diagnosis
Personalized drugs
New Biology
DNA is software
solve each disease
Huge impact on agriculture too

35
Cyber-Space Shrinks Time and Distance

Everyone is always connected
Can get information they want
Can communicate with friends family
Everything is online
You never miss a meeting/game/party/movie (you
can always watch it)
You never forget anything (its there somewhere)

36
Sustainable Society

Year 2050 9 B people living at Europes
standard of living
100M people in a city?
Environment cant sustain it
More efficient cities/transportation/
20 consume 60 now if 100 consume 1/3 of
current levels net consumption unchanged.
Need to reduce energy/water/metal consumption 3x
in developed world.

37
CyberSpace (data) and ToolsCan Augment Human
Intelligence

See next talk (12 CS challenges)
MyMainBrain is a personal exampleimproved
memory
Data mining tools are promising

38
Information Avalanche

The Situation a census of the data
We can record everything
Everything is a LOT!
The Good news
Changes science, education, medicine,
entertainment,.
Shrinks time and space
Can augment human intelligence
The Bad News
The end of privacy
Cyber Crime / Cyber Terrorism
Monoculture
The Technical Challenges
Amplify human intellect
Organize, summarize and prioritize information
Make programming easy.

39
The End Of Privacy

You can find out all about me.
Organizations can precisely track us
Credit cards, email, cellphone,
Animals have tags in them, I will probably get
a tag (eventually)(I already carry a dozen ID
smart cards).
You have no privacy, get over it Scott Mcnealy

40
The Centralization of Power

Computers enable an Orwellian future (1984)
The government can know everything you ever
Buy
Say
Hear
See/Read/
Where you are (phone company already knows)
Who you see and talk to
OK now, but what if Nero/Hitler/Stalin/.. comes
to power?

41
Cyber Crime

You can steal my identity
Sell my house
Accumulate huge debts
Make a video of me doing terrible things.
You can steal on a grand scale
Now Trillions of dollars are online.
A LARGE honey-pot for criminals.

42
Cyber Terrorism

It is easier to attack/destroy than to steal.
Viruses, data corruption, data modification
Denial of Service
Hijacking and then destroying equipment
Utilities (water, energy, transportation)
Production (factories)

43
Monoculture

Radio TV movies Internetare making the
world more homogenous.
½ the world has never made a phone call
But this is changing fast (they want to make
phone calls!)
The wired world enables communities to form very
easily e.g. Sanskrit scholars.
But the community has to speak a common language.

44
Information Clutter

Most mail is junk mail
Most eMail will soon be junk mail
30 of hotmail, 75 of my mail (130 m/d).
Telemarketing wastes peoples time.
Creates info-glut
You have 50,000 new mail messages
Need systems and interfaces to filter,
summarize, prioritize information

45
Information Avalanche

The Situation a census of the data
We can record everything
Everything is a LOT!
The Good news
Changes science, education, medicine,
entertainment,.
Shrinks time and space
Can augment human intelligence
The Bad News
The end of privacy
Cyber Crime / Cyber Terrorism
Monoculture
The Technical Challenges
Amplify human intellect
Organize, summarize and prioritize information
Make programming easy.

46
Technical Challenges

Storing information
Organizing information
Summarizing information
Visualizing information
Make programming easy

47
The personal Terabyte (all your stuff online)So
youve got it now what do you do with it?

Probably not accessed very often but TREASURED
(whats the one thing you would save in a fire?)
Can you find anything?
Can you organize that many objects?
Once you find it will you know what it is?
Once youve found it, could you find it again?
Research Goal Have GOOD answers for all these
Questions

48
Bell, Gemmell, Lueder MyLifeBits Guiding
Principles

Freedom from strict hierarchy
Full text search Collections
Many visualizations
dont metaphor me in
Annotations add value
So make them easy!
Keep the links when you author
transclusion
Everything goes in a database

49
How will we find it?Put everything in the DB
(and index it)

Need dbms features Consistency, Indexing,
Pivoting, Queries, Speed/scalability, Backup,
replicationIf you dont use one, creating one!
Simple logical structure
Blob and link is all that is inherent
Additional properties (facets extra
tables)and methods on those tables
(encapsulation)
More than a file system
Unifies data and meta-data
Simpler to manage
Easier to subset and reorganize
Set-oriented access
Allows online updates
Automatic indexing, replication

SQL
SQL
50
How do we represent it to the outside world?
lt?xml version"1.0" encoding"utf-8" ?gt -
ltDataSet xmlns"http//WWT.sdss.org/"gt -
ltxsschema id"radec" xmlns"" xmlnsxs"http//ww
w.w3.org/2001/XMLSchema" xmlnsmsdata"urnschemas
-microsoft-comxml-msdata"gt ltxselement
name"radec" msdataIsDataSet"true"gt ltxselement
name"Table"gt ltxselement name"ra"
type"xsdouble" minOccurs"0" /gt ltxselement
name"dec" type"xsdouble" minOccurs"0" /gt
- ltdiffgrdiffgram xmlnsmsdata"urnschemas-micr
osoft-comxml-msdata" xmlnsdiffgr"urnschemas-m
icrosoft-comxml-diffgram-v1"gt - ltradec
xmlns""gt - ltTable diffgrid"Table1"
msdatarowOrder"0"gt ltragt184.028935351008lt/ragt
ltdecgt-1.12590950121524lt/decgt lt/Tablegt -
ltTable diffgrid"Table10" msdatarowOrder"9"gt
ltragt184.025719033547lt/ragt ltdecgt-1.2179582792018
6lt/decgt lt/Tablegt lt/radecgt lt/diffgrdiffgramgt lt/
DataSetgt

File metaphor too primitive just a blob
Table metaphor too primitive just records
Need Metadata describing data context
Format
Providence (author/publisher/ citations/)
Rights
History
Related documents
In a standard format
XML and XML schema
DataSet is great example of this
World is now defining standard schemas

schema
Data or difgram
51
There is a problem
Niklaus Wirth Algorithms Data Structures
Programs

GREAT!!!!
XML documents are portable objects
XML documents are complex objects
WSDL defines the methods on objects (the class)
But will all the implementations match?
Think of UNIX or SQL or C or
This is a work in progress.

52
PhotoServer Managing Photos

Load all photos into the database
Annotate the photos
View by various attributes
Do similarity Search
Use XML for interchange
Use dbObject, Template for access

SQL, Templates, XML data
IIS
jScript
XML datasets mime data
Templates Schema
SQL (for xml)
53
How Similarity Search Works

For each picture Loader
Inserts thumbnails
Extracts 270 Features into a blob
When looking for similar picture
Scan all photos comparing features(dot product
of vectors)
Sort by similarity
Feature blob is an array
Today I fake the array with functions and
castcast(substring(feature,72,8) as float)
When SQL Server gets C I will not have to fake
it.
And it will run 100x faster (compiled managed
code).
Idea pioneered by IBM Research,we use a variant
by MS Beijing Research.

No black squares 20 orange etc
many black squares 10 orange etc
72 match
27 match
54
Key Observations

Data
XML data sets are a universal way to represent
answers
XML data sets minimize round trips 1
request/response
Search
It is BEST to index
You can put objects and attributes in a row (SQL
puts big blobs off-page)
If you cant index, You can extract attributes
and quickly compare
SQL can scan at 2M records/cpu/second
Sequential scans are embarrassingly parallel.

55
What about Big Data

Talked about organizing personal data
What about BIG data.
Most of the following slides inspired by (or even
copied from)
Alex Szalay JHU and
George Djorgovski Cal Tech

56
Data ? Knowledge ?

Exponential growth of data volume,
complexity, quality
But growth SLOW growth of knowledge
understanding
Why? Methodology bottleneck Human wetware
limitations
Need AI-assisted discovery

Adapted from slides by Alex Szalay and George
Djorgovski
57
Whats needed?(not drawn to scale)
58
How Discoveries Made?adapted from slide by
George Djorgovski

Conceptual Discoveries e.g., Relativity, QM,
Brane World, Inflation Theoretical, may be
inspired by observations
Phenomenological Discoveries e.g., Dark Matter,
QSOs, GRBs, CMBR, Extrasolar Planets, Obscured
Universe
Empirical, inspire theories, can be motivated
by them

New Technical Capabilities
Observational Discoveries
Theory
Phenomenological Discoveries ? Explore
parameter space ? Make new connections (e.g.,
multi-?) Understanding of complex phenomena
requires complex, information-rich data (and
simulations?)
59
Data Mining in the Image Domain Can We
Discover New Types of Phenomena Using Automated
Pattern Recognition? (Every object detection
algorithm has its biases and limitations)
Effective parametrization of source
morphologies and environments Multiscale
analysis (Also in the
time/lightcurve domain)
60
Exploration of Parameter Spaces in the Catalog
Domain (Source Attributes)

Clustering Analysis (supervised and
unsupervised)
How many different types of objects are there?
Are there any rare or new types, outliers?
Multivariate Correlation Search
Are there significant, nontrivial correlations
present in the data?

Clusters vs. Correlations
Science ? Correlations
Correlations ? reduction of the statistical
dimensionality
61
New Science from Multivariate Correlations
Data dimension DD 2 Statistical dim. DS 2
DD 2 DS 1
xi
If DS lt DD, then MV correlations are present
f (xi, xj, )
xj
xk
Fundamental Plane of E-galaxies
Correlations objectively define types of objects,
e.g., TFR ? normal spirals, FP ? normal
ellipticals and can lead to some new
insights
62
The Curse of Hyper-dimensionality,
But DD gtgt 1, DS gtgt 1 Data Complexity ?
Multidimensionality ? Discoveries But the bad
news is
The computational cost of clustering analysis
K-means K ? N ? I ? D Expectation
Maximization K ? N ? I ? D2 Monte Carlo
Cross-Validation M ? Kmax2 ? N ? I ? D2 N
no. of data vectors 1e12, D no. of data
dimensions 1e4 K no. of clusters chosen,
Kmax max no. of clusters tried I no. of
iterations, M no. of Monte Carlo
trials/partitions
Exascale computing and / or better algorithms
Some dimensionality reduction methods do exist
(e.g., PCA, class prototypes, hierarchical
methods, etc.), but more work is needed
63
The Curse of Hyper-dimensionality

Visualization!
A fundamental limitation of the human perception
DMAX 3? 5? (NB We can certainly
understand mathematically much higher
dimensionalities, but cannot really visualize
them our own Neural Nets are powerful pattern
recognition tools)
Interactive visualization a key part of the
data mining process
Some methodology exists, but much more is needed

DM Algorithm
?
?
?
User
Visualization
64
Online Multivariate Analysis Challenges

Data heterogeneity, biases, selection effects
Non-Gaussianity of clusters (data models)
Non-trivial topology of clustering
Useful vs. useless parameters

Outlier population, or a non-Gaussian tail?
65
Useful vs. Useless Parameters
Clusters (classes) and correlations may
exist/separate in some parameter subspaces, but
not in others
xi
xn
xj
xm
66
Optimal Statisticsfollowing slides adapted from
Alex Szalay

statistics algorithms scale poorly
Correlation functions N2, likelihood techniques
N3
Even if data and computers grow at same
rateComputers can do at most N logN algorithms
Possible solutions?
Assumes infinite computational resources
Assumes that only source of error is statistical
Cosmic Variance we can only observe the
Universe from one location (finite sample size)
Solutions require combination of Statistics and
CS
New algorithms not worse than N logN

67
Clever Data Structures

Heavy use of tree structures
Initial cost NlnN
Large speedup later
Tree-codes for correlations (A. Moore et al 2001)
Fast, approximate heuristic algorithms
No need to be more accurate than cosmic variance
Fast CMB analysis by Szapudi etal (2001)
N logN instead of N3 gt 1 day instead of 10
million years
Take cost of computation into account
Controlled level of accuracy
Best result in a given time, given our computing
resources

68
Angular Clustering with Photo-z

w(?) by Peebles and Groth
The first example of publishing and analyzing
large data
Samples based on rest-frame quantities
Strictly volume limited samples
Largest angular correlation study to date
Very clear detection of
Luminosity and color dependence
Results consistent with 3D clustering

T. Budavari, A. Connolly, I. Csabai, I. Szapudi,
A. Szalay, S. Dodelson, J. Frieman, R. Scranton,
D. Johnston and the SDSS Collaboration
69
The Samples
2800 square degrees in 10 stripes, data in custom
DB
All 50M
mrlt21 15M
10 stripes 10M
0.1ltzlt0.3 -20 gt Mr 2.2M
0.1ltzlt0.5 -21.4 gt Mr 3.1M
-20 gt Mr gt-21 1182k
-21 gt Mr gt-23 931k
-21 gt Mr gt-22 662k
-22 gt Mr gt-23 269k
70
The Stripes