Title: The Information Avalanche: Reducing Information Overload
1The Information AvalancheReducing Information
Overload
- Jim Gray
- Microsoft Research
- Onassis Foundation Science Lecture Series
- http//www.forth.gr/onassis/lectures/2002-07-15/in
dex.html - Heraklion, Crete, Greece, 15-19 July 2002
2Thesis
- Most new information is digital(and old
information is being digitized) - A Computer Science Grand Challenge
- Capture
- Organize
- Summarize
- Visualize
- This information
- Optimize Human Attention as a resource.
- Improve information quality
3Information Avalanche
- The Situation a census of the data
- We can record everything
- Everything is a LOT!
- The Good news
- Changes science, education, medicine,
entertainment,. - Shrinks time and space
- Can augment human intelligence
- The Bad News
- The end of privacy
- Cyber Crime / Cyber Terrorism
- Monoculture
- The Technical Challenges
- Amplify human intellect
- Organize, summarize and prioritize information
- Make programming easy.
4How much information is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
- Soon everything can be recorded and indexed
- Most bytes will never be seen by humans.
- Data summarization, trend detection anomaly
detection are key technologies - See Mike Lesk How much information is there
http//www.lesk.com/mlesk/ksg97/ksg.html - See Lyman Varian
- How much information
- http//www.sims.berkeley.edu/research/projects/how
-much-info/
Everything! Recorded
All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
5Information CensusLesk Varian Leyman
- 10 Exabytes
- 90 digital
- gt 55 personal
- Print .003 of bytes5TB/y, but text has lowest
entropy - Email is (10 Bmpd) 4PB/y and is 20 text
(estimate by Gray) - WWW is 50TBdeep web 50 PB
- Growth 50/y
693
7Storage capacity beating Moores law
- ImprovementsCapacity 60/yBandwidth 40/yAcc
ess time 16/y - 1000 /TB today
- 100 /TB in 2007
-
8Disk Storage Cheaper than Paper
- File Cabinet cabinet (4 drawer) 250 paper
(24,000 sheets) 250 space (2x3 _at_
10/ft2) 180 total 700 0.03 /sheet - Disk disk (160 GB ) 200 ASCII
500 m pages 2e-7 /sheet (10,000x cheaper) - Image 1 m photos
3e-4 /photo (100x cheaper) - Store everything on disk
9Why Put Everything in Cyberspace?
Low rent min /byte Shrinks time now or
later Shrinks space here or there Automate
processing knowbots
Point-to-Point OR Broadcast
Immediate OR Time Delayed
Locate Process Analyze Summarize
10Storage trends
- Right now, its affordable to buy 100 GB/year
- In 5 years you can afford to buy
1TB/year!(assuming storage doubles every 18
months)
11Trying to fill a terabyte in a year
12MemexAs We May Think, Vannevar Bush, 1945
- A memex is a device in which an individual
stores all his books, records, and
communications, and which is mechanized so that
it may be consulted with exceeding speed and
flexibility - yet if the user inserted 5000 pages of material
a day it would take him hundreds of years to fill
the repository, so that he can be profligate and
enter material freely
13Gordon Bells MainBrainDigitize EverythingA
BIG shoebox?
- Scans 20 k pages tiff_at_ 300 dpi 1 GB
- Music 2 k tacks 7 GB
- Photos 13 k images 2 GB
- Video 10 hrs 3 GB
- Docs 3 k (ppt, word,..) 2 GB
- Mail 100 k messages 3 GB
- 18 GB
14Gary Starkweather
- Scan EVERYTHING
- 400 dpi TIFF
- 70k pages 14GB
- OCR all scans (98 recognition ocr accuracy)
- All indexed (5 second access to anything)
- All on his laptop.
15Access!
1650 personal, What about the other 50
- Business
- Wall Mart online 1PB and growing.
- Paradox most transaction systems have mere
PBs. - Have to go to image/data monitoring for big data
- Government
- Online government is big thrust (cheaper,
better,) - Science
17Instruments CERN LHCPeta Bytes per Year
- Looking for the Higgs Particle
- Sensors 1000 GB/s (1TB/s)
- Events 75 GB/s
- Filtered 5 GB/s
- Reduced 0.1 GB/s 2 PB/y
- Data pyramid 100GB 1TB 100TB 1PB 10PB
18LHC Requirements (2005- )
- 1E9 events pa _at_ 1MB/ev 1PB/year/expt
- Reconstructed 100TB/recon/year/expt
- Send to Tier1 Regional Centres
- gt 400TB/year to RAL?
- Keep one set derivatives on disk
- and rest on tape
- But UK plans a Tier1 clone
- Many data clones
Source John Gordon IT Department, CLRC/RAL CUF
Meeting, October 2000
19Science Data VolumeESO/STECF Science Archive
- 100 TB archive
- Similar at Hubble, Keck, SDSS,
- 1PB aggregate
20Data Pipeline NASA
- Level 0 raw data data stream
- Level 1 calibrated data measured values
- Level 1A calibrated normalized
flux/magnitude/ - Level 2 derived data metrics vegetation index
- Data volume
- 0 1 1A ltlt 2
- Level 2 gtgt level 1 because
- MANY data products
- Must keep all published
- data Editions (versions)
EOSDIS Core System Information for Scientists,
http//observer.gsfc.nasa.gov/sec3/ProductLevels.
html
21TerraServer http//TerraService.net/
- 3 x 2 TB databases
- 18TB disk tri-plexed (6TB)
- 3 1 Cluster
- 99.96 uptime
- 1B page views5B DB queries
- Now a .NET web service
22Image Data
USGS Topo Maps
USGS Aerial photos DOQ
- All in the database 200x200 pixel tiles
compressed - Spatial access z-Tranform Btree
Encarta Virtual Globe
1 Km resolution
100 World Coverage
23Hardware
8 Compaq DL360 Photon Web Servers
One SQL database per rack Each rack contains 4.5
tb 261 total drives / 13.7 TB total
Fiber SAN Switches
Meta Data Stored on 101 GB Fast, Small
Disks(18 x 18.2 GB)
SQL\Inst1
Imagery Data Stored on 4 339 GB Slow, Big
Disks (15 x 73.8 GB)
SQL\Inst2
SQL\Inst3
To Add 90 72.8 GB Disks in Feb 2001 to create 18
TB SAN
Spare
4 Compaq ProLiant 8500 Db Servers
24TerraServer Lessons Learned
- Hardware is 5 9s (with clustering)
- Software is 5 9s (with clustering)
- Admin is 4 9s (offline maintenance)
- Network is 3 9s (mistakes, environment)
- Simple designs are best
- 10 TB DB is management limit1 PB 100 x 10 TB
DBthis is 100x better than 5 years ago. - Minimize use of tape
- Backup to disk (snapshots)
- Portable disk TBs
25Sensor Applications
- Earth Observation
- 15 PB by 2007
- Medical Images Information Health Monitoring
- Potential 1 GB/patient/y ? 1 EB/y
- Video Monitoring
- 1E8 video cameras _at_ 1E5 MBps ? 10TB/s ? 100
EB/y ? filtered??? - Airplane Engines
- 1 GB sensor data/flight,
- 100,000 engine hours/day
- 30PB/y
- Smart Dust ?? EB/y
http//robotics.eecs.berkeley.edu/pister/SmartDus
t/
http//www-bsac.eecs.berkeley.edu/shollar/macro_m
otes/macromotes.html
26What do they do with the databusiness,
government, scienceMore later in talk
- Look for anomalies
- 1, 2, 1, 2, 1, 1, 1, 2, -5, 1, 0, 2,
- Look for trends and patterns
- 1, 2, 3, 4, 5,
- Look for correlations
- ln(x) ln(y) c ln(z)
- Look at summaries then drill down to details
- LOTS of histograms
27Premise Grid Computing
- Store exabytes once or twice (for redundancy)
- Access them from anywhere
- Implies huge archive/data centers
- Supercomputer centers become super data centers
- Examples Google, Yahoo!, Hotmail,CERN,
Fermilab, SDSC
28Bandwidth 3x bandwidth/year for 25 more years
- Today
- 40 Gbps per channel (?)
- 12 channels per fiber (wdm) 500 Gbps
- 32 fibers/bundle 16 Tbps/bundle
- In lab 3 Tbps/fiber (400 x WDM)
- In theory 25 Tbps per fiber
- 1 Tbps USA 1996 WAN bisection bandwidth
- Aggregate bandwidth doubles every 8 months!
1 fiber 25 Tbps
29Underlying Theme
- Digital Everything
- From words and numbersto sights and sounds
- New Devices
- From isolated to adaptive, synchronized, and
connected
- Automation
- From dumb to Web services
- From manual to self-tuning, self organizing, and
self maintaining - Beyond reliability to availability
- One inter-connected network
- From stand alone/basic connectivity to always
wired (and wireless) - Everything over IP
30Information Avalanche
- The Situation a census of the data
- We can record everything
- Everything is a LOT!
- The Good news
- Changes science, education, medicine,
entertainment,. - Shrinks time and space
- Can augment human intelligence
- The Bad News
- The end of privacy
- Cyber Crime / Cyber Terrorism
- Monoculture
- The Technical Challenges
- Amplify human intellect
- Organize, summarize and prioritize information
- Make programming easy.
31Online Science
- All literature online
- All data online
- All instruments online
- Great analysis tools.
32Online Education
- All literature online
- All lectures online
- Interactive and time-shifted education
- Just-in-time education
- Available to everyone everywhere
- Economic model is not understood (who pays?)
- One model society pays
33Online Business
- Frictionless economy
- Near-perfect information
- Very efficient
- Fully customized products
- Example Wallmart / Dell
- Traditional business 1-10 inventory turns/y
- eBuisiness 100-500 turns/y no inventory
- VERY efficient, huge economic advantage
- Your customers suppliers loan you money!
34Online Medicine
- Traditional medicine
- Can monitor your health continuously
- Instant diagnosis
- Personalized drugs
- New Biology
- DNA is software
- solve each disease
- Huge impact on agriculture too
35Cyber-Space Shrinks Time and Distance
- Everyone is always connected
- Can get information they want
- Can communicate with friends family
- Everything is online
- You never miss a meeting/game/party/movie (you
can always watch it) - You never forget anything (its there somewhere)
36Sustainable Society
- Year 2050 9 B people living at Europes
standard of living - 100M people in a city?
- Environment cant sustain it
- More efficient cities/transportation/
- 20 consume 60 now if 100 consume 1/3 of
current levels net consumption unchanged. - Need to reduce energy/water/metal consumption 3x
in developed world.
37CyberSpace (data) and ToolsCan Augment Human
Intelligence
- See next talk (12 CS challenges)
- MyMainBrain is a personal exampleimproved
memory - Data mining tools are promising
38Information Avalanche
- The Situation a census of the data
- We can record everything
- Everything is a LOT!
- The Good news
- Changes science, education, medicine,
entertainment,. - Shrinks time and space
- Can augment human intelligence
- The Bad News
- The end of privacy
- Cyber Crime / Cyber Terrorism
- Monoculture
- The Technical Challenges
- Amplify human intellect
- Organize, summarize and prioritize information
- Make programming easy.
39The End Of Privacy
- You can find out all about me.
- Organizations can precisely track us
- Credit cards, email, cellphone,
- Animals have tags in them, I will probably get
a tag (eventually)(I already carry a dozen ID
smart cards). - You have no privacy, get over it Scott Mcnealy
40The Centralization of Power
- Computers enable an Orwellian future (1984)
- The government can know everything you ever
- Buy
- Say
- Hear
- See/Read/
- Where you are (phone company already knows)
- Who you see and talk to
- OK now, but what if Nero/Hitler/Stalin/.. comes
to power?
41Cyber Crime
- You can steal my identity
- Sell my house
- Accumulate huge debts
- Make a video of me doing terrible things.
- You can steal on a grand scale
- Now Trillions of dollars are online.
- A LARGE honey-pot for criminals.
42Cyber Terrorism
- It is easier to attack/destroy than to steal.
- Viruses, data corruption, data modification
- Denial of Service
- Hijacking and then destroying equipment
- Utilities (water, energy, transportation)
- Production (factories)
43Monoculture
- Radio TV movies Internetare making the
world more homogenous. - ½ the world has never made a phone call
- But this is changing fast (they want to make
phone calls!) - The wired world enables communities to form very
easily e.g. Sanskrit scholars. - But the community has to speak a common language.
44Information Clutter
- Most mail is junk mail
- Most eMail will soon be junk mail
- 30 of hotmail, 75 of my mail (130 m/d).
- Telemarketing wastes peoples time.
- Creates info-glut
- You have 50,000 new mail messages
- Need systems and interfaces to filter,
summarize, prioritize information
45Information Avalanche
- The Situation a census of the data
- We can record everything
- Everything is a LOT!
- The Good news
- Changes science, education, medicine,
entertainment,. - Shrinks time and space
- Can augment human intelligence
- The Bad News
- The end of privacy
- Cyber Crime / Cyber Terrorism
- Monoculture
- The Technical Challenges
- Amplify human intellect
- Organize, summarize and prioritize information
- Make programming easy.
46Technical Challenges
- Storing information
- Organizing information
- Summarizing information
- Visualizing information
- Make programming easy
47The personal Terabyte (all your stuff online)So
youve got it now what do you do with it?
- Probably not accessed very often but TREASURED
(whats the one thing you would save in a fire?) - Can you find anything?
- Can you organize that many objects?
- Once you find it will you know what it is?
- Once youve found it, could you find it again?
- Research Goal Have GOOD answers for all these
Questions
48Bell, Gemmell, Lueder MyLifeBits Guiding
Principles
- Freedom from strict hierarchy
- Full text search Collections
- Many visualizations
- dont metaphor me in
- Annotations add value
- So make them easy!
- Keep the links when you author
- transclusion
- Everything goes in a database
49How will we find it?Put everything in the DB
(and index it)
- Need dbms features Consistency, Indexing,
Pivoting, Queries, Speed/scalability, Backup,
replicationIf you dont use one, creating one! - Simple logical structure
- Blob and link is all that is inherent
- Additional properties (facets extra
tables)and methods on those tables
(encapsulation) - More than a file system
- Unifies data and meta-data
- Simpler to manage
- Easier to subset and reorganize
- Set-oriented access
- Allows online updates
- Automatic indexing, replication
SQL
SQL
50How do we represent it to the outside world?
lt?xml version"1.0" encoding"utf-8" ?gt -
ltDataSet xmlns"http//WWT.sdss.org/"gt -
ltxsschema id"radec" xmlns"" xmlnsxs"http//ww
w.w3.org/2001/XMLSchema" xmlnsmsdata"urnschemas
-microsoft-comxml-msdata"gt ltxselement
name"radec" msdataIsDataSet"true"gt ltxselement
name"Table"gt ltxselement name"ra"
type"xsdouble" minOccurs"0" /gt ltxselement
name"dec" type"xsdouble" minOccurs"0" /gt
- ltdiffgrdiffgram xmlnsmsdata"urnschemas-micr
osoft-comxml-msdata" xmlnsdiffgr"urnschemas-m
icrosoft-comxml-diffgram-v1"gt - ltradec
xmlns""gt - ltTable diffgrid"Table1"
msdatarowOrder"0"gt ltragt184.028935351008lt/ragt
ltdecgt-1.12590950121524lt/decgt lt/Tablegt -
ltTable diffgrid"Table10" msdatarowOrder"9"gt
ltragt184.025719033547lt/ragt ltdecgt-1.2179582792018
6lt/decgt lt/Tablegt lt/radecgt lt/diffgrdiffgramgt lt/
DataSetgt
- File metaphor too primitive just a blob
- Table metaphor too primitive just records
- Need Metadata describing data context
- Format
- Providence (author/publisher/ citations/)
- Rights
- History
- Related documents
- In a standard format
- XML and XML schema
- DataSet is great example of this
- World is now defining standard schemas
schema
Data or difgram
51There is a problem
Niklaus Wirth Algorithms Data Structures
Programs
- GREAT!!!!
- XML documents are portable objects
- XML documents are complex objects
- WSDL defines the methods on objects (the class)
- But will all the implementations match?
- Think of UNIX or SQL or C or
- This is a work in progress.
52PhotoServer Managing Photos
- Load all photos into the database
- Annotate the photos
- View by various attributes
- Do similarity Search
- Use XML for interchange
- Use dbObject, Template for access
SQL, Templates, XML data
IIS
jScript
XML datasets mime data
Templates Schema
SQL (for xml)
53How Similarity Search Works
- For each picture Loader
- Inserts thumbnails
- Extracts 270 Features into a blob
- When looking for similar picture
- Scan all photos comparing features(dot product
of vectors) - Sort by similarity
- Feature blob is an array
- Today I fake the array with functions and
castcast(substring(feature,72,8) as float) - When SQL Server gets C I will not have to fake
it. - And it will run 100x faster (compiled managed
code). - Idea pioneered by IBM Research,we use a variant
by MS Beijing Research.
No black squares 20 orange etc
many black squares 10 orange etc
72 match
27 match
54Key Observations
- Data
- XML data sets are a universal way to represent
answers - XML data sets minimize round trips 1
request/response - Search
- It is BEST to index
- You can put objects and attributes in a row (SQL
puts big blobs off-page) - If you cant index, You can extract attributes
and quickly compare - SQL can scan at 2M records/cpu/second
- Sequential scans are embarrassingly parallel.
55What about Big Data
- Talked about organizing personal data
- What about BIG data.
- Most of the following slides inspired by (or even
copied from) - Alex Szalay JHU and
- George Djorgovski Cal Tech
56Data ? Knowledge ?
- Exponential growth of data volume,
complexity, quality - But growth SLOW growth of knowledge
understanding - Why? Methodology bottleneck Human wetware
limitations - Need AI-assisted discovery
Adapted from slides by Alex Szalay and George
Djorgovski
57Whats needed?(not drawn to scale)
58How Discoveries Made?adapted from slide by
George Djorgovski
- Conceptual Discoveries e.g., Relativity, QM,
Brane World, Inflation Theoretical, may be
inspired by observations - Phenomenological Discoveries e.g., Dark Matter,
QSOs, GRBs, CMBR, Extrasolar Planets, Obscured
Universe - Empirical, inspire theories, can be motivated
by them
New Technical Capabilities
Observational Discoveries
Theory
Phenomenological Discoveries ? Explore
parameter space ? Make new connections (e.g.,
multi-?) Understanding of complex phenomena
requires complex, information-rich data (and
simulations?)
59Data Mining in the Image Domain Can We
Discover New Types of Phenomena Using Automated
Pattern Recognition? (Every object detection
algorithm has its biases and limitations)
Effective parametrization of source
morphologies and environments Multiscale
analysis (Also in the
time/lightcurve domain)
60Exploration of Parameter Spaces in the Catalog
Domain (Source Attributes)
- Clustering Analysis (supervised and
unsupervised) - How many different types of objects are there?
- Are there any rare or new types, outliers?
- Multivariate Correlation Search
- Are there significant, nontrivial correlations
present in the data?
Clusters vs. Correlations
Science ? Correlations
Correlations ? reduction of the statistical
dimensionality
61New Science from Multivariate Correlations
Data dimension DD 2 Statistical dim. DS 2
DD 2 DS 1
xi
If DS lt DD, then MV correlations are present
f (xi, xj, )
xj
xk
Fundamental Plane of E-galaxies
Correlations objectively define types of objects,
e.g., TFR ? normal spirals, FP ? normal
ellipticals and can lead to some new
insights
62The Curse of Hyper-dimensionality,
But DD gtgt 1, DS gtgt 1 Data Complexity ?
Multidimensionality ? Discoveries But the bad
news is
The computational cost of clustering analysis
K-means K ? N ? I ? D Expectation
Maximization K ? N ? I ? D2 Monte Carlo
Cross-Validation M ? Kmax2 ? N ? I ? D2 N
no. of data vectors 1e12, D no. of data
dimensions 1e4 K no. of clusters chosen,
Kmax max no. of clusters tried I no. of
iterations, M no. of Monte Carlo
trials/partitions
Exascale computing and / or better algorithms
Some dimensionality reduction methods do exist
(e.g., PCA, class prototypes, hierarchical
methods, etc.), but more work is needed
63The Curse of Hyper-dimensionality
- Visualization!
- A fundamental limitation of the human perception
DMAX 3? 5? (NB We can certainly
understand mathematically much higher
dimensionalities, but cannot really visualize
them our own Neural Nets are powerful pattern
recognition tools) - Interactive visualization a key part of the
data mining process - Some methodology exists, but much more is needed
DM Algorithm
?
?
?
User
Visualization
64 Online Multivariate Analysis Challenges
- Data heterogeneity, biases, selection effects
- Non-Gaussianity of clusters (data models)
- Non-trivial topology of clustering
- Useful vs. useless parameters
Outlier population, or a non-Gaussian tail?
65Useful vs. Useless Parameters
Clusters (classes) and correlations may
exist/separate in some parameter subspaces, but
not in others
xi
xn
xj
xm
66Optimal Statisticsfollowing slides adapted from
Alex Szalay
- statistics algorithms scale poorly
- Correlation functions N2, likelihood techniques
N3 - Even if data and computers grow at same
rateComputers can do at most N logN algorithms - Possible solutions?
- Assumes infinite computational resources
- Assumes that only source of error is statistical
- Cosmic Variance we can only observe the
Universe from one location (finite sample size) - Solutions require combination of Statistics and
CS - New algorithms not worse than N logN
67Clever Data Structures
- Heavy use of tree structures
- Initial cost NlnN
- Large speedup later
- Tree-codes for correlations (A. Moore et al 2001)
- Fast, approximate heuristic algorithms
- No need to be more accurate than cosmic variance
- Fast CMB analysis by Szapudi etal (2001)
- N logN instead of N3 gt 1 day instead of 10
million years - Take cost of computation into account
- Controlled level of accuracy
- Best result in a given time, given our computing
resources
68Angular Clustering with Photo-z
- w(?) by Peebles and Groth
- The first example of publishing and analyzing
large data - Samples based on rest-frame quantities
- Strictly volume limited samples
- Largest angular correlation study to date
- Very clear detection of
- Luminosity and color dependence
- Results consistent with 3D clustering
T. Budavari, A. Connolly, I. Csabai, I. Szapudi,
A. Szalay, S. Dodelson, J. Frieman, R. Scranton,
D. Johnston and the SDSS Collaboration
69The Samples
2800 square degrees in 10 stripes, data in custom
DB
All 50M
mrlt21 15M
10 stripes 10M
0.1ltzlt0.3 -20 gt Mr 2.2M
0.1ltzlt0.5 -21.4 gt Mr 3.1M
-20 gt Mr gt-21 1182k
-21 gt Mr gt-23 931k
-21 gt Mr gt-22 662k
-22 gt Mr gt-23 269k
70The Stripes
- 10 stripes over the SDSS area, covering about
2800 square degrees - About 20 lost due to bad seeing
- Masks seeing, bright stars, etc.
- Images generated from query by web service
71The Masks
- Stripe 11 masks
- Masks are derived from the database
- Search and intersect extended objects with
boundaries
72The Analysis
- eSpICE I.Szapudi, S.Colombi and S.Prunet
- Integrated with the database by T. Budavari
- Extremely fast processing (N logN)
- 1 stripe with about 1 million galaxies is
processed in 3 mins - Usual figure was 10 min for 10,000 galaxies gt 70
days - Each stripe processed separately for each cut
- 2D angular correlation function computed
- w(?) average with rejection of pixels along the
scan - flat field vector causes mock correlations
73Angular Correlations I.
- Luminosity dependence 3 cuts
- -20gt M gt -21
- -21gt M gt -22
- -22gt M gt -23
74Angular Correlations II.
- Color Dependence
- 4 bins by rest-frame SED type
75If theres time
- Better User Interfaces 0 TaskGalary.MPG
- Organizing photos 1 Digital Photo.mpg
- Organizing newsgroups 2 Communities.mpg
- Enhancing meetings. 3 flows.mpg
- Attentional interfaces 4 Side Show.mpg
76Thesis
- Most new information is digital(and old
information is being digitized) - A Computer Science Grand Challenge
- Capture
- Organize
- Summarize
- Visualize
- This information
- Optimize Human Attention as a resource.
- Improve information quality
77Information Avalanche
- The Situation a census of the data
- We can record everything
- Everything is a LOT!
- The Good news
- Changes science, education, medicine,
entertainment,. - Shrinks time and space
- Can augment human intelligence
- The Bad News
- The end of privacy
- Cyber Crime / Cyber Terrorism
- Monoculture
- The Technical Challenges
- Amplify human intellect
- Organize, summarize and prioritize information
- Make programming easy.