Title: Affordable Knowledge Discovery Through Distributed Data Mining
1Affordable Knowledge Discovery Through
Distributed Data Mining
- Rex E. Gantenbein and Chris Sung
- Computer Science Department
- University of Wyoming
2 What Is Data Mining?
Data mining is the extraction of interesting
information or patterns from data in large
databases Also known as Knowledge discovery in
databases (KDD), Knowledge extraction,
Data/pattern analysis, Data archeology, Data
dredging, Information harvesting, Business
intelligence
3Data Mining A Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
4What is Distributed Data Mining?
- Data Mining in Distributed and Parallel Computing
Environment - Data Mining
-
- Distributed Parallel Processing
5What is Distributed Data Mining?
- Data Mining (DM)
- Data Warehousing (DW)
- Distributed Computing (DC)
6How can we distribute data mining?
- (S/M) Instruction (S/M) Data
- Server / Client Vs. P2P
- Homogeneous Vs. Heterogeneous
- Local to Global Vs. Local and Global
7DC(2) A Survey of Distributed KDD Approaches
8What is the problem?
- Accuracy
- DC(-) / DW()
- Performance
- DM, DW(-) / DC, Virtual DW()
- Affordability
- DM, DW(-) / DC, Virtual DW()
9Affordable Distributed Data Mining (ADDM) approach
- Virtual DW used as an integrator only
- Assumes heterogeneous and homogeneous databases
exist - No aggregation
- Locally networked
- No relational DB server
10ADDM Architecture
11ADDM Example distributed DB
A
B
C
12ADDM Example build virtual DW
B
A
C
13ADDM Example data preparation
A
B
C
Revised DW meta data
14ADDM Example Affinity Grouping
Probabilities of 3 items and their combinations
15ADDM Example mining result
Measure for Association Rules
P(condition and result) Confidence
(highest) -----------------------------------
P(condition) P(condition and
result) Improvement (gt1)
------------------------------------
P(condition) P(result)
16Summary
- Goal provide an architecture for Distributed DM
- Basic communications
- Will support a variety of DM techniques
- Satisfy accuracy, performance, affordability
constraints - Limitation available only in particular DM
Environments - (Visual C, Windows, OLE DB (ADO))
17Future Work
- Extend to more general DM environments (Java,
ODBC) - Upgrade the architecture to other distributed DM
protocols (Suns jxta)