Distributed TeraMining - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed TeraMining

Description:

Most data comes a GB and a TB at a time. Data Mining is Inevitable ... Source: IDC (1999) '1999 Winchester Disk Drive Market Forecast and Review' Petabytes ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 26
Provided by: rlg5
Category:

less

Transcript and Presenter's Notes

Title: Distributed TeraMining


1
Distributed Tera-Mining
R. L. Grossman Laboratory for Advanced
Computing University of Illinois Magnify, Inc.
2
1. Background
Three Fundamental Trends.
3
Trend 1. Explosion of Data
4
All in the Wrong Format
With no one to analyze it.
5
The Data Gap
Most data comes a GB and a TB at a time.
The Data Gap
Total new disk (TB) since 1995
New Ph.D.s
6
Data Mining is Inevitable
The goal of data mining is to close this gap.
7
Trend 2. Sonet is dead. Lambda Rules.
Gigabytes can be moved in seconds.
8
Gigabytes can be Moved in Minutes
1 TB in 1.5 hours 10 GBs in 1 minute
1 TB in 6 hours 10 GBs in 4 minutes
9
Trend 3 Most Data is Distributed
  • Bushs Law The usefulness of a column of data
    varies as the square of the number of columns it
    is compared to.

10
Example 1 ENSO Cholera
El Nino Data at NCAR
Cholera Data at WHO
11
Example 2 Voting
12
Correlation Reform Voters vs Votes for Buchanan
Palm Beach
13
2. Internet Infrastructures for Data
Data Webs, Semantic Webs, Data Grids, Distributed
Data Mining, Digital Libraries and all that
14
Data Mining
ltpmmlgt lttree weight 0.3gt lttree-node
node-id8 threshold 0.239494 etc. gt lt/pmmlgt
data mining algorithm
learning set
statistical model
  • Data mining is the semi-automatic extraction of
    patterns, models, changes, associations, and
    anomalies from large data sets.

15
Data Mining Process -End to End Viewpoint
50 0 50
16
DataSpace One Approach to Making Data Useful
Complementary to the grid, which we view as a
distributed computer.
  • html
  • http
  • search by keyword
  • workstations servers
  • pmml dtml
  • dstp
  • correlate mine
  • data compute clusters

TodaysMulti-media Web
TomorrowsData Web
  • 16 terabytes of documents
  • 4 billion documents
  • petabytes of data
  • tens of billions to trillions of records

17
View Data as a Collection of Distributed Columns
18
Data Servers and Data Browsers
WHO data in Geneva
NCAR data in Boulder
DataSpace
19
UCK uckid
attributes aid
20
3. Summary Conclusion
21
Terra Mining Testbed
Optical testbed for distributed tera miningof
scientific data.
Goal also to be testbed forbroadband based
business services.
22
Lessons Learned
  • Its the data stupid. Cycles, cylinders
    lambdas are all commodities.
  • The fundamental challenge lower the cost to make
    data useful.
  • The emergence of internet infrastructure for data
    is inevitable.

Opens up possibilities for new types of
scientific discoveries.
23
For More Information
  • DataSpace
  • http//www.dataspaceweb.net
  • http//www.ncdm.uic.edu
  • DataSpace Standards
  • http//www.dmg.org
  • Selected articles
  • http//www.twocultures.net
  • Magnify
  • http//www.magnify.com

24
End of Slides
25
FTP Still Lives
26
Trend 2. Bandwidth is a Commodity
27
El Nina Anomalies
28
Indonesia Cholera Cases
29
Cholera Cases
30
Distributed Exabytes (New Disks)
Petabytes
1 Exabyte
Source IDC (1999) "1999 Winchester Disk Drive
Market Forecast and Review"
31
Trend 3 Most Data is Distributed
  • Ws Law The usefulness of a column of data
    varies as the square of the number of columns it
    is compared to.

32
Example 2 Voting
33
Database 1 Total Votes for Buchanan by County
34
Database 2 Total Registered Reform Voters by
County
35
Correlation Total Votes vs Buchanan Votes by
County
Palm Beach
Write a Comment
User Comments (0)
About PowerShow.com