Title: Distributed TeraMining
1Distributed Tera-Mining
R. L. Grossman Laboratory for Advanced
Computing University of Illinois Magnify, Inc.
21. Background
Three Fundamental Trends.
3Trend 1. Explosion of Data
4 All in the Wrong Format
With no one to analyze it.
5The Data Gap
Most data comes a GB and a TB at a time.
The Data Gap
Total new disk (TB) since 1995
New Ph.D.s
6Data Mining is Inevitable
The goal of data mining is to close this gap.
7Trend 2. Sonet is dead. Lambda Rules.
Gigabytes can be moved in seconds.
8Gigabytes can be Moved in Minutes
1 TB in 1.5 hours 10 GBs in 1 minute
1 TB in 6 hours 10 GBs in 4 minutes
9Trend 3 Most Data is Distributed
- Bushs Law The usefulness of a column of data
varies as the square of the number of columns it
is compared to.
10Example 1 ENSO Cholera
El Nino Data at NCAR
Cholera Data at WHO
11Example 2 Voting
12Correlation Reform Voters vs Votes for Buchanan
Palm Beach
132. Internet Infrastructures for Data
Data Webs, Semantic Webs, Data Grids, Distributed
Data Mining, Digital Libraries and all that
14Data Mining
ltpmmlgt lttree weight 0.3gt lttree-node
node-id8 threshold 0.239494 etc. gt lt/pmmlgt
data mining algorithm
learning set
statistical model
- Data mining is the semi-automatic extraction of
patterns, models, changes, associations, and
anomalies from large data sets.
15Data Mining Process -End to End Viewpoint
50 0 50
16DataSpace One Approach to Making Data Useful
Complementary to the grid, which we view as a
distributed computer.
- html
- http
- search by keyword
- workstations servers
- pmml dtml
- dstp
- correlate mine
- data compute clusters
TodaysMulti-media Web
TomorrowsData Web
- 16 terabytes of documents
- 4 billion documents
- petabytes of data
- tens of billions to trillions of records
17View Data as a Collection of Distributed Columns
18Data Servers and Data Browsers
WHO data in Geneva
NCAR data in Boulder
DataSpace
19UCK uckid
attributes aid
203. Summary Conclusion
21Terra Mining Testbed
Optical testbed for distributed tera miningof
scientific data.
Goal also to be testbed forbroadband based
business services.
22Lessons Learned
- Its the data stupid. Cycles, cylinders
lambdas are all commodities. - The fundamental challenge lower the cost to make
data useful. - The emergence of internet infrastructure for data
is inevitable.
Opens up possibilities for new types of
scientific discoveries.
23For More Information
- DataSpace
- http//www.dataspaceweb.net
- http//www.ncdm.uic.edu
- DataSpace Standards
- http//www.dmg.org
- Selected articles
- http//www.twocultures.net
- Magnify
- http//www.magnify.com
24End of Slides
25FTP Still Lives
26Trend 2. Bandwidth is a Commodity
27El Nina Anomalies
28Indonesia Cholera Cases
29Cholera Cases
30Distributed Exabytes (New Disks)
Petabytes
1 Exabyte
Source IDC (1999) "1999 Winchester Disk Drive
Market Forecast and Review"
31Trend 3 Most Data is Distributed
- Ws Law The usefulness of a column of data
varies as the square of the number of columns it
is compared to.
32Example 2 Voting
33Database 1 Total Votes for Buchanan by County
34Database 2 Total Registered Reform Voters by
County
35Correlation Total Votes vs Buchanan Votes by
County
Palm Beach