Title: Mining Distributed Databases
1Mining Distributed Databases
- Raj Bhatnagar
- University of Cincinnati
2Distributed Databases
- D D1 X D2 X . . . X Dn
- - D is implicitly specified
- Goal Discover patterns in implicit D, using
the explicit Dis
Geographically distributed nodes
Limitations - Cant move Dis to a common
site - Size / communication cost/Privacy -
Cant update local databases - Cant send actual
data tuples
3Explicit and Implicit Databases
Implicit Database
4Decomposition of Computations
-
- - Since D is implicit,
-
- - For a computation
- - Decompose F into G and gs
-
- - Decomposition depends on
- - F
- - Dis
- - Set of shared attributes
5Decomposition of Computations
- Computational primitives
- Arithmetic primitives
- Count of tuples in implicit D
- Mean Value of an attribute in D
- Informational entropy for a subset of D
- Covariance matrix for D
- non-numeric primitives
- Median value of an atribute in D
- Sorting subsets of tuples in D
6Decomposition of Computations
- Computational cost of decomposition
- Communication cost
- Number of messages exchanged
- Number of database queries
- Who does the decomposition?
- Algorithm itself, at run time
- Depending on the nature of overlap in Dis
7Count All Tuples in Implicit D
Can be decomposed as
- condJ Jth tuple in Shareds
- n number of participating databases (Dis)
- (N(Dt)condJ) count of tuples in Dt satisfying
condJ - Local computation gi(Di,) N(Dt)condJ
- G is a sum-of-products
8Implementing Decomposed Computations
Stationary Agents
Mobile Agents
Aglet
Messages
9Implementation of Count(D)
- Stationary Agents
- - Request / Send Summaries
- - Simple SQL interface
- - 1 count / message
- - l attributes having k values each
- - Query-code interface
- - counts/message
- - l attributes having k values each
- Mobile Agents
-
Messages exchanged
Messages exchanged
Number of hops
10Implementation of Count(D-test)
- Stationary Agents
- - Simple SQL interface
- - Query-code interface
- Mobile Agents
-
Shareds
L attributes k values each
tuples
11Average Value of an attribute in D
- Compute counts for each value of an attribute
Stationary Agents - Simple SQL interface -
Query-code interface Mobile Agents
Messages exchanged
(1 integer/message)
Messages exchanged
integers/message
Number of hops
12Exception Tuples
- Database of interest may exclude some tuples of D
- Learning site keeps a relation E of exception
tuples - E may have explicit tuples
- E may have rules to generate exception tuples
C
A
E
B
1
1
3
2
2
1
-
-
1
2
-
-
2
2
-
-
SharedSet
Exceptions
Explicit Databases
13Computing Informational Entropy
- Consists of various counts only
- Stationary agent/Simple SQL interface
- Stationary agent/Query-code interface
- Mobile agent
Messages exchanged
Messages exchanged
Number of messages/hops is independent of
the size of D
14Decomposition of Algorithms
- Arithmetic primitives are 1-step decompositions
- Counts, averages, entropy
- Algorithms involve
- Arithmetic primitives
- non-numeric primitives
- Control structure
- Decomposition studied for
- Decision tree induction algorithm
- Mining of association rules
- Control structure is unaltered
- Primitive computations are decomposed
- Learner Node
- Control structure
- Decomposition
- Composition
15Building a Decision Tree
- To induce a decision tree having
- - d levels m attributes in n databases
l shared attributes - - k values/attribute
- Stationary agent/Simple SQL interface
- Stationary agent/Query-code interface
- Mobile agent
Number of messages/hops is independent of
the size of D
16Mining Association Rules
- Main operations
- - Enumerate item-sets
- - Compute support/confidence
- - Basic computation Count-of-tuples
- Communication Complexity
- - m (avg.) item sets at each level of
enumeration tree - - j levels of enumeration tree
- - Query-code can count for all item sets at a
level simultaneously - - Therefore, we need
Number of Counts Needed
17More Complex Computations
- Covariance matrix for D
- Useful for eigen vectors/principal components
- Needs second order moments
- Graph/Network algorithms
- Each node has part of a graph
- Some nodes are shared
- Determine MST
- Paths of Min/Max flow
- flow patterns
18Sum of Products
- Sum of products for two attributes
- There are six different ways in which x and y may
be distributed - Each requires a different decomposition
- Case 1 x same as y and x belongs to the
SharedSet. - Case 2 x same as y and x does not belong to the
SharedSet. - Case 3 x and y both belong to the SharedSet.
19Sum of Products
- Case 4 x belongs to SharedSet and y does not.
- Case 5 x, y dont belong to the SharedSet and
reside on different nodes. - For each tuple t in SharedSet, obtain
- and then
- Case 6 x, y dont belong to the SharedSet and
reside on the same node.
where
Prod(t) is average of product of x and y for
cond-t of SharedSet
20Self-decomposing Algorithms
- Easy decomposability of arithmetic primitives
- Average/Covariance matrix/Entropy
- Control structure of algorithms is not altered
- More gains possible, by altering control
structure - Decomposition is driven by the set of shared
attributes - Algorithm can determine shared attributes in n
messages/hops - Algorithms decompose in accordance with attribute
sharing - No human intervention needed
- Message complexity is independent of sizes of
databases
21Continuing Work
- Determine patterns of flow in a network
- Communication network traffic
- Geographic/economic flows
Local flow data
Local flow data
Local flow data
Local flow data