Mining Distributed Databases - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Mining Distributed Databases

Description:

Mining Distributed Databases Raj Bhatnagar University of Cincinnati – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 22
Provided by: rajb152
Learn more at: https://eecs.ceas.uc.edu
Category:

less

Transcript and Presenter's Notes

Title: Mining Distributed Databases


1
Mining Distributed Databases
  • Raj Bhatnagar
  • University of Cincinnati

2
Distributed Databases
  • D D1 X D2 X . . . X Dn
  • - D is implicitly specified
  • Goal Discover patterns in implicit D, using
    the explicit Dis

Geographically distributed nodes
Limitations - Cant move Dis to a common
site - Size / communication cost/Privacy -
Cant update local databases - Cant send actual
data tuples
3
Explicit and Implicit Databases

Implicit Database
4
Decomposition of Computations
  • - Since D is implicit,
  • - For a computation
  • - Decompose F into G and gs
  • - Decomposition depends on
  • - F
  • - Dis
  • - Set of shared attributes

5
Decomposition of Computations
  • Computational primitives
  • Arithmetic primitives
  • Count of tuples in implicit D
  • Mean Value of an attribute in D
  • Informational entropy for a subset of D
  • Covariance matrix for D
  • non-numeric primitives
  • Median value of an atribute in D
  • Sorting subsets of tuples in D

6
Decomposition of Computations
  • Computational cost of decomposition
  • Communication cost
  • Number of messages exchanged
  • Number of database queries
  • Who does the decomposition?
  • Algorithm itself, at run time
  • Depending on the nature of overlap in Dis

7
Count All Tuples in Implicit D
Can be decomposed as
  • condJ Jth tuple in Shareds
  • n number of participating databases (Dis)
  • (N(Dt)condJ) count of tuples in Dt satisfying
    condJ
  • Local computation gi(Di,) N(Dt)condJ
  • G is a sum-of-products

8
Implementing Decomposed Computations

Stationary Agents
Mobile Agents
Aglet
Messages
9
Implementation of Count(D)
  • Stationary Agents
  • - Request / Send Summaries
  • - Simple SQL interface
  • - 1 count / message
  • - l attributes having k values each
  • - Query-code interface
  • - counts/message
  • - l attributes having k values each
  • Mobile Agents

Messages exchanged
Messages exchanged
Number of hops
10
Implementation of Count(D-test)
  • Stationary Agents
  • - Simple SQL interface
  • - Query-code interface
  • Mobile Agents

Shareds
L attributes k values each
tuples
11
Average Value of an attribute in D
  • Compute counts for each value of an attribute

Stationary Agents - Simple SQL interface -
Query-code interface Mobile Agents
Messages exchanged
(1 integer/message)
Messages exchanged
integers/message
Number of hops
12
Exception Tuples
  • Database of interest may exclude some tuples of D
  • Learning site keeps a relation E of exception
    tuples
  • E may have explicit tuples
  • E may have rules to generate exception tuples

C
A
E
B
1
1
3
2
2
1
-
-
1
2
-
-
2
2
-
-
SharedSet
Exceptions
Explicit Databases
13
Computing Informational Entropy
  • Consists of various counts only
  • Stationary agent/Simple SQL interface
  • Stationary agent/Query-code interface
  • Mobile agent

Messages exchanged
Messages exchanged
Number of messages/hops is independent of
the size of D
14
Decomposition of Algorithms
  • Arithmetic primitives are 1-step decompositions
  • Counts, averages, entropy
  • Algorithms involve
  • Arithmetic primitives
  • non-numeric primitives
  • Control structure
  • Decomposition studied for
  • Decision tree induction algorithm
  • Mining of association rules
  • Control structure is unaltered
  • Primitive computations are decomposed
  • Learner Node
  • Control structure
  • Decomposition
  • Composition

15
Building a Decision Tree
  • To induce a decision tree having
  • - d levels m attributes in n databases
    l shared attributes
  • - k values/attribute
  • Stationary agent/Simple SQL interface
  • Stationary agent/Query-code interface
  • Mobile agent

Number of messages/hops is independent of
the size of D
16
Mining Association Rules
  • Main operations
  • - Enumerate item-sets
  • - Compute support/confidence
  • - Basic computation Count-of-tuples
  • Communication Complexity
  • - m (avg.) item sets at each level of
    enumeration tree
  • - j levels of enumeration tree
  • - Query-code can count for all item sets at a
    level simultaneously
  • - Therefore, we need

Number of Counts Needed
17
More Complex Computations
  • Covariance matrix for D
  • Useful for eigen vectors/principal components
  • Needs second order moments
  • Graph/Network algorithms
  • Each node has part of a graph
  • Some nodes are shared
  • Determine MST
  • Paths of Min/Max flow
  • flow patterns

18
Sum of Products
  • Sum of products for two attributes
  • There are six different ways in which x and y may
    be distributed
  • Each requires a different decomposition
  • Case 1 x same as y and x belongs to the
    SharedSet.
  • Case 2 x same as y and x does not belong to the
    SharedSet.
  • Case 3 x and y both belong to the SharedSet.

19
Sum of Products
  • Case 4 x belongs to SharedSet and y does not.
  • Case 5 x, y dont belong to the SharedSet and
    reside on different nodes.
  • For each tuple t in SharedSet, obtain
  • and then
  • Case 6 x, y dont belong to the SharedSet and
    reside on the same node.

where
Prod(t) is average of product of x and y for
cond-t of SharedSet
20
Self-decomposing Algorithms
  • Easy decomposability of arithmetic primitives
  • Average/Covariance matrix/Entropy
  • Control structure of algorithms is not altered
  • More gains possible, by altering control
    structure
  • Decomposition is driven by the set of shared
    attributes
  • Algorithm can determine shared attributes in n
    messages/hops
  • Algorithms decompose in accordance with attribute
    sharing
  • No human intervention needed
  • Message complexity is independent of sizes of
    databases

21
Continuing Work
  • Determine patterns of flow in a network
  • Communication network traffic
  • Geographic/economic flows

Local flow data
Local flow data
Local flow data
Local flow data
Write a Comment
User Comments (0)
About PowerShow.com