Information Theory For Data Management - PowerPoint PPT Presentation

1 / 140
About This Presentation
Title:

Information Theory For Data Management

Description:

Title: Information Theory for Data Management Author: Divesh & Suresh Last modified by: SRIVASTAVA, DIVESH (DIVESH) Created Date: 7/13/2006 3:34:23 AM – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 141
Provided by: Div71
Category:

less

Transcript and Presenter's Notes

Title: Information Theory For Data Management


1
Information Theory For Data Management
Divesh Srivastava Suresh Venkatasubramanian
2
Motivation
-- Abstruse Goose (177)
Information Theory is relevant to all of
humanity...
Information Theory for Data Management - Divesh
Suresh
3
Background
  • Many problems in data management need precise
    reasoning about information content, transfer and
    loss
  • Structure Extraction
  • Privacy preservation
  • Schema design
  • Probabilistic data ?

Information Theory for Data Management - Divesh
Suresh
4
Information Theory
  • First developed by Shannon as a way of
    quantifying capacity of signal channels.
  • Entropy, relative entropy and mutual information
    capture intrinsic informational aspects of a
    signal
  • Today
  • Information theory provides a domain-independent
    way to reason about structure in data
  • More information interesting structure
  • Less information linkage decoupling of
    structures

Information Theory for Data Management - Divesh
Suresh
5
Tutorial Thesis
  • Information theory provides a mathematical
    framework for the quantification of information
    content, linkage and loss.
  • This framework can be used in the design of data
    management strategies that rely on probing the
    structure of information in data.

Information Theory for Data Management - Divesh
Suresh
6
Tutorial Goals
  • Introduce information-theoretic concepts to DB
    audience
  • Give a data-centric perspective on information
    theory
  • Connect these to applications in data management
  • Describe underlying computational primitives
  • Illuminate when and how information theory might
    be of use in new areas of data management.

Information Theory for Data Management - Divesh
Suresh
7
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Database Design
  • Part 2
  • Review of Information Theory Basics
  • Application Data Integration
  • Computing Information Theoretic Primitives
  • Open Problems

Information Theory for Data Management - Divesh
Suresh
8
Histograms And Discrete Distributions
Information Theory for Data Management - Divesh
Suresh
9
Histograms And Discrete Distributions
X f(x)w(X)
x1 4520
x2 236
x3 122
x4 122
normalize
reweight
Information Theory for Data Management - Divesh
Suresh
10
From Columns To Random Variables
  • We can think of a column of data as represented
    by a random variable
  • X is a random variable
  • p(X) is the column of probabilities p(X x1),
    p(X x2), and so on
  • Also known (in unweighted case) as the empirical
    distribution induced by the column X.
  • Notation
  • X (upper case) denotes a random variable (column)
  • x (lower case) denotes a value taken by X (field
    in a tuple)
  • p(x) is the probability p(X x)

Information Theory for Data Management - Divesh
Suresh
11
Joint Distributions
  • Discrete distribution probability p(X,Y,Z)
  • p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)

X Y Z p(X,Y,Z)
x1 y1 z1 0.125
x1 y2 z2 0.125
x1 y1 z2 0.125
x1 y2 z1 0.125
x2 y3 z3 0.125
x2 y3 z4 0.125
x3 y3 z5 0.125
x4 y3 z6 0.125
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
X Y p(X,Y)
x1 y1 0.25
x1 y2 0.25
x2 y3 0.25
x3 y3 0.125
x4 y3 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
12
Entropy Of A Column
X p(X) h(X)
x1 0.5 1
x2 0.25 2
x3 0.125 3
x4 0.125 3
  • Let h(x) log2 1/p(x)
  • h(X) is column of h(x) values.
  • H(X) EXh(x) SX p(x) log2 1/p(x)
  • Two views of entropy
  • It captures uncertainty in data high entropy,
    more unpredictability
  • It captures information content higher entropy,
    more information.

H(X) 1.75 lt log X 2
Information Theory for Data Management - Divesh
Suresh
13
Examples
  • X uniform over 1, ..., 4. H(X) 2
  • Y is 1 with probability 0.5, in 2,3,4
    uniformly.
  • H(Y) 0.5 log 2 0.5 log 6 1.8 lt 2
  • Y is more sharply defined, and so has less
    uncertainty.
  • Z uniform over 1, ..., 8. H(Z) 3 gt 2
  • Z spans a larger range, and captures more
    information

X
Y
Z
Information Theory for Data Management - Divesh
Suresh
14
Comparing Distributions
  • How do we measure difference between two
    distributions ?
  • Kullback-Leibler divergence
  • dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)

Inference mechanism
Prior belief
Resulting belief
Information Theory for Data Management - Divesh
Suresh
15
Comparing Distributions
  • Kullback-Leibler divergence
  • dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)
  • dKL(p, q) gt 0
  • Captures extra information needed to capture p
    given q
  • Is asymmetric ! dKL(p, q) ! dKL(q, p)
  • Is not a metric (does not satisfy triangle
    inequality)
  • There are other measures
  • ?2-distance, variational distance, f-divergences,

Information Theory for Data Management - Divesh
Suresh
16
Conditional Probability
  • Given a joint distribution on random variables X,
    Y, how much information about X can we glean from
    Y ?
  • Conditional probability p(XY)
  • p(X x1 Y y1) p(X x1, Y y1)/p(Y y1)

X Y p(X,Y) p(XY) p(YX)
x1 y1 0.25 1.0 0.5
x1 y2 0.25 1.0 0.5
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 1.0
x4 y3 0.125 0.25 1.0
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
17
Conditional Entropy
  • Let h(xy) log2 1/p(xy)
  • H(XY) Ex,yh(xy) Sx Sy p(x,y) log2
    1/p(xy)
  • H(XY) H(X,Y) H(Y)
  • H(XY) H(X,Y) H(Y) 2.25 1.5 0.75
  • If X, Y are independent, H(XY) H(X)

X Y p(X,Y) p(XY) h(XY)
x1 y1 0.25 1.0 0.0
x1 y2 0.25 1.0 0.0
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 2.0
x4 y3 0.125 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
18
Mutual Information
  • Mutual information captures the difference
    between the joint distribution on X and Y, and
    the marginal distributions on X and Y.
  • Let i(xy) log p(x,y)/p(x)p(y)
  • I(XY) Ex,yI(XY) Sx Sy p(x,y) log
    p(x,y)/p(x)p(y)

X Y p(X,Y) h(X,Y) i(XY)
x1 y1 0.25 2.0 1.0
x1 y2 0.25 2.0 1.0
x2 y3 0.25 2.0 1.0
x3 y3 0.125 3.0 1.0
x4 y3 0.125 3.0 1.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
19
Mutual Information Strength of linkage
  • I(XY) H(X) H(Y) H(X,Y) H(X) H(XY)
    H(Y) H(YX)
  • If X, Y are independent, then I(XY) 0
  • H(X,Y) H(X) H(Y), so I(XY) H(X) H(Y)
    H(X,Y) 0
  • I(XY) lt max (H(X), H(Y))
  • Suppose Y f(X) (deterministically)
  • Then H(YX) 0, and so I(XY) H(Y) H(YX)
    H(Y)
  • Mutual information captures higher-order
    interactions
  • Covariance captures linear interactions only
  • Two variables can be uncorrelated (covariance
    0) and have nonzero mutual information
  • X ?R -1,1, Y X2. Cov(X,Y) 0, I(XY) H(X)
    gt 0

Information Theory for Data Management - Divesh
Suresh
20
Information Theory Summary
  • We can represent data as discrete distributions
    (normalized histograms)
  • Entropy captures uncertainty or information
    content in a distribution
  • The Kullback-Leibler distance captures the
    difference between distributions
  • Mutual information and conditional entropy
    capture linkage between variables in a joint
    distribution

Information Theory for Data Management - Divesh
Suresh
21
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Database Design
  • Part 2
  • Review of Information Theory Basics
  • Application Data Integration
  • Computing Information Theoretic Primitives
  • Open Problems

Information Theory for Data Management - Divesh
Suresh
22
Data Anonymization Using Randomization
  • Goal publish anonymized microdata to enable
    accurate ad hoc analyses, but ensure privacy of
    individuals sensitive attributes
  • Key ideas
  • Randomize numerical data add noise from known
    distribution
  • Reconstruct original data distribution using
    published noisy data
  • Issues
  • How can the original data distribution be
    reconstructed?
  • What kinds of randomization preserve privacy of
    individuals?

Information Theory for Data Management - Divesh
Suresh
23
Data Anonymization Using Randomization
  • Many randomization strategies proposed AS00,
    AA01, EGS03
  • Example randomization strategies X in 0, 10
  • R X µ (mod 11), µ is uniform in -1, 0, 1
  • R X µ (mod 11), µ is in -1 (p 0.25), 0 (p
    0.5), 1 (p 0.25)
  • R X (p 0.6), R µ, µ is uniform in 0, 10
    (p 0.4)
  • Question
  • Which randomization strategy has higher privacy
    preservation?
  • Quantify loss of privacy due to publication of
    randomized data

Information Theory for Data Management - Divesh
Suresh
24
Data Anonymization Using Randomization
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1

Id X
s1 0
s2 3
s3 5
s4 0
s5 8
s6 0
s7 6
s8 0
Information Theory for Data Management - Divesh
Suresh
25
Data Anonymization Using Randomization
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1

Id X µ
s1 0 -1
s2 3 0
s3 5 1
s4 0 0
s5 8 1
s6 0 -1
s7 6 1
s8 0 0
Id R1
s1 10
s2 3
s3 6
s4 0
s5 9
s6 10
s7 7
s8 0
?
Information Theory for Data Management - Divesh
Suresh
26
Data Anonymization Using Randomization
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1

Id X µ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
?
Information Theory for Data Management - Divesh
Suresh
27
Reconstruction of Original Data Distribution
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1
  • Reconstruct distribution of X using knowledge of
    R1 and µ
  • EM algorithm converges to MLE of original
    distribution AA01

Id X µ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
Id X R1
s1 10, 0, 1
s2 1, 2, 3
s3 4, 5, 6
s4 0, 1, 2
s5 8, 9, 10
s6 9, 10, 0
s7 4, 5, 6
s8 0, 1, 2
?
?
Information Theory for Data Management - Divesh
Suresh
28
Analysis of Privacy AS00
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1
  • If X is uniform in 0, 10, privacy determined by
    range of µ

Id X µ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
Id X R1
s1 10, 0, 1
s2 1, 2, 3
s3 4, 5, 6
s4 0, 1, 2
s5 8, 9, 10
s6 9, 10, 0
s7 4, 5, 6
s8 0, 1, 2
?
?
Information Theory for Data Management - Divesh
Suresh
29
Analysis of Privacy AA01
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1
  • If X is uniform in 0, 1 ? 5, 6, privacy
    smaller than range of µ

Id X µ
s1 0 0
s2 1 -1
s3 5 0
s4 6 1
s5 0 1
s6 1 -1
s7 5 -1
s8 6 1
Id R1
s1 0
s2 0
s3 5
s4 7
s5 1
s6 0
s7 4
s8 7
Id X R1
s1 10, 0, 1
s2 10, 0, 1
s3 4, 5, 6
s4 6, 7, 8
s5 0, 1, 2
s6 10, 0, 1
s7 3, 4, 5
s8 6, 7, 8
?
?
Information Theory for Data Management - Divesh
Suresh
30
Analysis of Privacy AA01
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1
  • If X is uniform in 0, 1 ? 5, 6, privacy
    smaller than range of µ
  • In some cases, sensitive value revealed

Id X µ
s1 0 0
s2 1 -1
s3 5 0
s4 6 1
s5 0 1
s6 1 -1
s7 5 -1
s8 6 1
Id R1
s1 0
s2 0
s3 5
s4 7
s5 1
s6 0
s7 4
s8 7
Id X R1
s1 0, 1
s2 0, 1
s3 5, 6
s4 6
s5 0, 1
s6 0, 1
s7 5
s8 6
?
?
Information Theory for Data Management - Divesh
Suresh
31
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • Smaller H(XR) ? more loss of privacy in X by
    knowledge of R
  • Larger I(XR) ? more loss of privacy in X by
    knowledge of R
  • I(XR) H(X) H(XR)
  • I(XR) used to capture correlation between X and
    R
  • p(X) is the prior knowledge of sensitive
    attribute X
  • p(X, R) is the joint distribution of X and R

Information Theory for Data Management - Divesh
Suresh
32
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • X is uniform in 5, 6, R1 X µ (mod 11), µ is
    uniform in -1, 0, 1

X R1 p(X,R1) h(X,R1) i(XR1)
5 4
5 5
5 6
6 5
6 6
6 7
X p(X) h(X)
5
6
R1 p(R1) h(R1)
4
5
6
7
Information Theory for Data Management - Divesh
Suresh
33
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • X is uniform in 5, 6, R1 X µ (mod 11), µ is
    uniform in -1, 0, 1

X R1 p(X,R1) h(X,R1) i(XR1)
5 4 0.17
5 5 0.17
5 6 0.17
6 5 0.17
6 6 0.17
6 7 0.17
X p(X) h(X)
5 0.5
6 0.5
R1 p(R1) h(R1)
4 0.17
5 0.34
6 0.34
7 0.17
Information Theory for Data Management - Divesh
Suresh
34
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • X is uniform in 5, 6, R1 X µ (mod 11), µ is
    uniform in -1, 0, 1

X R1 p(X,R1) h(X,R1) i(XR1)
5 4 0.17 2.58
5 5 0.17 2.58
5 6 0.17 2.58
6 5 0.17 2.58
6 6 0.17 2.58
6 7 0.17 2.58
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R1 p(R1) h(R1)
4 0.17 2.58
5 0.34 1.58
6 0.34 1.58
7 0.17 2.58
Information Theory for Data Management - Divesh
Suresh
35
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • X is uniform in 5, 6, R1 X µ (mod 11), µ is
    uniform in -1, 0, 1
  • I(XR) 0.33

X R1 p(X,R1) h(X,R1) i(XR1)
5 4 0.17 2.58 1.0
5 5 0.17 2.58 0.0
5 6 0.17 2.58 0.0
6 5 0.17 2.58 0.0
6 6 0.17 2.58 0.0
6 7 0.17 2.58 1.0
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R1 p(R1) h(R1)
4 0.17 2.58
5 0.34 1.58
6 0.34 1.58
7 0.17 2.58
Information Theory for Data Management - Divesh
Suresh
36
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • X is uniform in 5, 6, R2 X µ (mod 11), µ is
    uniform in 0, 1
  • I(XR1) 0.33, I(XR2) 0.5 ? R2 is a bigger
    privacy risk than R1

X R2 p(X,R2) h(X,R2) i(XR2)
5 5 0.25 2.0 1.0
5 6 0.25 2.0 0.0
6 6 0.25 2.0 0.0
6 7 0.25 2.0 1.0
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R2 p(R2) h(R2)
5 0.25 2.0
6 0.5 1.0
7 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
37
Quantify Loss of Privacy AA01
  • Equivalent goal quantify loss of privacy based
    on H(XR)
  • X is uniform in 5, 6, R2 X µ (mod 11), µ is
    uniform in 0, 1
  • Intuition we know more about X given R2, than
    about X given R1
  • H(XR1) 0.67, H(XR2) 0.5 ? R2 is a bigger
    privacy risk than R1

X R2 p(X,R2) p(XR2) h(XR2)
5 5 0.25 1.0 0.0
5 6 0.25 0.5 1.0
6 6 0.25 0.5 1.0
6 7 0.25 1.0 0.0
X R1 p(X,R1) p(XR1) h(XR1)
5 4 0.17 1.0 0.0
5 5 0.17 0.5 1.0
5 6 0.17 0.5 1.0
6 5 0.17 0.5 1.0
6 6 0.17 0.5 1.0
6 7 0.17 1.0 0.0
Information Theory for Data Management - Divesh
Suresh
38
Quantify Loss of Privacy
  • Example X is uniform in 0, 1
  • R3 e (p 0.9999), R3 X (p 0.0001)
  • R4 X (p 0.6), R4 1 X (p 0.4)
  • Is R3 or R4 a bigger privacy risk?

Information Theory for Data Management - Divesh
Suresh
39
Worst Case Loss of Privacy EGS03
  • Example X is uniform in 0, 1
  • R3 e (p 0.9999), R3 X (p 0.0001)
  • R4 X (p 0.6), R4 1 X (p 0.4)
  • I(XR3) 0.0001 ltlt I(XR4) 0.028

X R3 p(X,R3) h(X,R3) i(XR3)
0 e 0.49995 1.0 0.0
0 0 0.00005 14.29 1.0
1 e 0.49995 1.0 0.0
1 1 0.00005 14.29 1.0
X R4 p(X,R4) h(X,R4) i(XR4)
0 0 0.3 1.74 0.26
0 1 0.2 2.32 -0.32
1 0 0.2 2.32 -0.32
1 1 0.3 1.74 0.26
Information Theory for Data Management - Divesh
Suresh
40
Worst Case Loss of Privacy EGS03
  • Example X is uniform in 0, 1
  • R3 e (p 0.9999), R3 X (p 0.0001)
  • R4 X (p 0.6), R4 1 X (p 0.4)
  • I(XR3) 0.0001 ltlt I(XR4) 0.028
  • But R3 has a larger worst case risk

X R3 p(X,R3) h(X,R3) i(XR3)
0 e 0.49995 1.0 0.0
0 0 0.00005 14.29 1.0
1 e 0.49995 1.0 0.0
1 1 0.00005 14.29 1.0
X R4 p(X,R4) h(X,R4) i(XR4)
0 0 0.3 1.74 0.26
0 1 0.2 2.32 -0.32
1 0 0.2 2.32 -0.32
1 1 0.3 1.74 0.26
Information Theory for Data Management - Divesh
Suresh
41
Worst Case Loss of Privacy EGS03
  • Goal quantify worst case loss of privacy in X by
    knowledge of R
  • Use max KL divergence, instead of mutual
    information
  • Mutual information can be formulated as expected
    KL divergence
  • I(XR) ?x ?r p(x,r)log2(p(x,r)/p(x)p(r))
    KL(p(X,R) p(X)p(R))
  • I(XR) ?r p(r) ?x p(xr)log2(p(xr)/p(x)) ER
    KL(p(Xr) p(X))
  • AA01 measure quantifies expected loss of
    privacy over R
  • EGS03 propose a measure based on worst case
    loss of privacy
  • IW(XR) MAXR KL(p(Xr) p(X))

Information Theory for Data Management - Divesh
Suresh
42
Worst Case Loss of Privacy EGS03
  • Example X is uniform in 0, 1
  • R3 e (p 0.9999), R3 X (p 0.0001)
  • R4 X (p 0.6), R4 1 X (p 0.4)
  • IW(XR3) max0.0, 1.0, 1.0 gt IW(XR4)
    max0.028, 0.028

X R3 p(X,R3) p(XR3) i(XR3)
0 e 0.49995 0.5 0.0
0 0 0.00005 1.0 1.0
1 e 0.49995 0.5 0.0
1 1 0.00005 1.0 1.0
X R4 p(X,R4) p(XR4) i(XR4)
0 0 0.3 0.6 0.26
0 1 0.2 0.4 -0.32
1 0 0.2 0.4 -0.32
1 1 0.3 0.6 0.26
Information Theory for Data Management - Divesh
Suresh
43
Worst Case Loss of Privacy EGS03
  • Example X is uniform in 5, 6
  • R1 X µ (mod 11), µ is uniform in -1, 0, 1
  • R2 X µ (mod 11), µ is uniform in 0, 1
  • IW(XR1) max1.0, 0.0, 0.0, 1.0 IW(XR2)
    1.0, 0.0, 1.0
  • Unable to capture that R2 is a bigger privacy
    risk than R1

X R1 p(X,R1) p(XR1) i(XR1)
5 4 0.17 1.0 1.0
5 5 0.17 0.5 0.0
5 6 0.17 0.5 0.0
6 5 0.17 0.5 0.0
6 6 0.17 0.5 0.0
6 7 0.17 1.0 1.0
X R2 p(X,R2) p(XR2) i(XR2)
5 5 0.25 1.0 1.0
5 6 0.25 0.5 0.0
6 6 0.25 0.5 0.0
6 7 0.25 1.0 1.0
Information Theory for Data Management - Divesh
Suresh
44
Data Anonymization Summary
  • Randomization techniques useful for microdata
    anonymization
  • Randomization techniques differ in their loss of
    privacy
  • Information theoretic measures useful to capture
    loss of privacy
  • Expected KL divergence captures expected privacy
    loss AA01
  • Maximum KL divergence captures worst case privacy
    loss EGS03
  • Both are useful in practice

Information Theory for Data Management - Divesh
Suresh
45
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Database Design
  • Part 2
  • Review of Information Theory Basics
  • Application Data Integration
  • Computing Information Theoretic Primitives
  • Open Problems

Information Theory for Data Management - Divesh
Suresh
46
Information Dependencies DR00
  • Goal use information theory to examine and
    reason about information content of the
    attributes in a relation instance
  • Key ideas
  • Novel InD measure between attribute sets X, Y
    based on H(YX)
  • Identify numeric inequalities between InD
    measures
  • Results
  • InD measures are a broader class than FDs and
    MVDs
  • Armstrong axioms for FDs derivable from InD
    inequalities
  • MVD inference rules derivable from InD
    inequalities

Information Theory for Data Management - Divesh
Suresh
47
Information Dependencies DR00
  • Functional dependency X ? Y
  • FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
    (t1Y t2Y))

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
Information Theory for Data Management - Divesh
Suresh
48
Information Dependencies DR00
  • Functional dependency X ? Y
  • FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
    (t1Y t2Y))

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
Information Theory for Data Management - Divesh
Suresh
49
Information Dependencies DR00
  • Result FD X ? Y holds iff H(YX) 0
  • Intuition once X is known, no remaining
    uncertainty in Y
  • H(YX) 0.5

X Y p(X,Y) p(YX) h(YX)
x1 y1 0.25 0.5 1.0
x1 y2 0.25 0.5 1.0
x2 y3 0.25 1.0 0.0
x3 y3 0.125 1.0 0.0
x4 y3 0.125 1.0 0.0
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
50
Information Dependencies DR00
  • Multi-valued dependency X ?? Y
  • MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
    R(X,Z)

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
Information Theory for Data Management - Divesh
Suresh
51
Information Dependencies DR00
  • Multi-valued dependency X ?? Y
  • MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
    R(X,Z)

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
52
Information Dependencies DR00
  • Multi-valued dependency X ?? Y
  • MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
    R(X,Z)

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
53
Information Dependencies DR00
  • Result MVD X ?? Y holds iff H(Y,ZX) H(YX)
    H(ZX)
  • Intuition once X known, uncertainties in Y and Z
    are independent
  • H(YX) 0.5, H(ZX) 0.75, H(Y,ZX) 1.25

X Y h(YX)
x1 y1 1.0
x1 y2 1.0
x2 y3 0.0
x3 y3 0.0
x4 y3 0.0
X Z h(ZX)
x1 z1 1.0
x1 z2 1.0
x2 z3 1.0
x2 z4 1.0
x3 z5 0.0
x4 z6 0.0
X Y Z h(Y,ZX)
x1 y1 z1 2.0
x1 y2 z2 2.0
x1 y1 z2 2.0
x1 y2 z1 2.0
x2 y3 z3 1.0
x2 y3 z4 1.0
x3 y3 z5 0.0
x4 y3 z6 0.0

Information Theory for Data Management - Divesh
Suresh
54
Information Dependencies DR00
  • Result Armstrong axioms for FDs derivable from
    InD inequalities
  • Reflexivity If Y ? X, then X ? Y
  • H(YX) 0 for Y ? X
  • Augmentation X ? Y ? X,Z ? Y,Z
  • 0 H(Y,ZX,Z) H(YX,Z) H(YX) 0
  • Transitivity X ? Y Y ? Z ? X ? Z
  • 0 H(YX) H(ZY) H(ZX) 0

Information Theory for Data Management - Divesh
Suresh
55
Database Normal Forms
  • Goal eliminate update anomalies by good database
    design
  • Need to know the integrity constraints on all
    database instances
  • Boyce-Codd normal form
  • Input a set ? of functional dependencies
  • For every (non-trivial) FD R.X ? R.Y ? ?, R.X is
    a key of R
  • 4NF
  • Input a set ? of functional and multi-valued
    dependencies
  • For every (non-trivial) MVD R.X ?? R.Y ? ?,
    R.X is a key of R

Information Theory for Data Management - Divesh
Suresh
56
Database Normal Forms
  • Functional dependency X ? Y
  • Which design is better?

X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
57
Database Normal Forms
  • Functional dependency X ? Y
  • Which design is better?
  • Decomposition is in BCNF

X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
58
Database Normal Forms
  • Multi-valued dependency X ?? Y
  • Which design is better?

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
59
Database Normal Forms
  • Multi-valued dependency X ?? Y
  • Which design is better?
  • Decomposition is in 4NF

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
60
Well-Designed Databases AL03
  • Goal use information theory to characterize
    goodness of a database design and reason about
    normalization algorithms
  • Key idea
  • Information content measure of cell in a DB
    instance w.r.t. ICs
  • Redundancy reduces information content measure of
    cells
  • Results
  • Well-designed DB ? each cell has information
    content gt 0
  • Normalization algorithms never decrease
    information content

Information Theory for Data Management - Divesh
Suresh
61
Well-Designed Databases AL03
  • Information content of cell c in database D
    satisfying FD X ? Y
  • Uniform distribution p(V) on values for c
    consistent with D\c and FD
  • Information content of cell c is entropy H(V)
  • H(V62) 2.0

X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
V62 p(V62) h(V62)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
62
Well-Designed Databases AL03
  • Information content of cell c in database D
    satisfying FD X ? Y
  • Uniform distribution p(V) on values for c
    consistent with D\c and FD
  • Information content of cell c is entropy H(V)
  • H(V22) 0.0

X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
V22 p(V22) h(V22)
y1 1.0 0.0
y2 0.0
y3 0.0
y4 0.0
Information Theory for Data Management - Divesh
Suresh
63
Well-Designed Databases AL03
  • Information content of cell c in database D
    satisfying FD X ? Y
  • Information content of cell c is entropy H(V)
  • Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
    all cells c in D
  • Technicalities w.r.t. size of active domain

X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
c H(V)
c12 0.0
c22 0.0
c32 0.0
c42 0.0
c52 2.0
c62 2.0
Information Theory for Data Management - Divesh
Suresh
64
Well-Designed Databases AL03
  • Information content of cell c in database D
    satisfying FD X ? Y
  • Information content of cell c is entropy H(V)
  • H(V12) 2.0, H(V42) 2.0

V42 p(V42) h(V42)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
V12 p(V12) h(V12)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
65
Well-Designed Databases AL03
  • Information content of cell c in database D
    satisfying FD X ? Y
  • Information content of cell c is entropy H(V)
  • Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
    all cells c in D

X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c12 2.0
c22 2.0
c32 2.0
c42 2.0
Information Theory for Data Management - Divesh
Suresh
66
Well-Designed Databases AL03
  • Information content of cell c in DB D satisfying
    MVD X ?? Y
  • Information content of cell c is entropy H(V)
  • H(V52) 0.0, H(V53) 2.32

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
V52 p(V52) h(V52)
y3 1.0 0.0
V53 p(V53) h(V53)
z1 0.2 2.32
z2 0.2 2.32
z3 0.2 2.32
z4 0.0
z5 0.2 2.32
z6 0.2 2.32
Information Theory for Data Management - Divesh
Suresh
67
Well-Designed Databases AL03
  • Information content of cell c in DB D satisfying
    MVD X ?? Y
  • Information content of cell c is entropy H(V)
  • Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
    cells c in D

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
c H(V)
c12 0.0
c22 0.0
c32 0.0
c42 0.0
c52 0.0
c62 0.0
c72 1.58
c82 1.58
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
Information Theory for Data Management - Divesh
Suresh
68
Well-Designed Databases AL03
  • Information content of cell c in DB D satisfying
    MVD X ?? Y
  • Information content of cell c is entropy H(V)
  • H(V32) 1.58, H(V34) 2.32

V34 p(V34) h(V34)
z1 0.2 2.32
z2 0.2 2.32
z3 0.2 2.32
z4 0.0
z5 0.2 2.32
z6 0.2 2.32
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
V32 p(V32) h(V32)
y1 0.33 1.58
y2 0.33 1.58
y3 0.33 1.58
Information Theory for Data Management - Divesh
Suresh
69
Well-Designed Databases AL03
  • Information content of cell c in DB D satisfying
    MVD X ?? Y
  • Information content of cell c is entropy H(V)
  • Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
    cells c in D

X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c12 1.0
c22 1.0
c32 1.58
c42 1.58
c52 1.58
c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
Information Theory for Data Management - Divesh
Suresh
70
Well-Designed Databases AL03
  • Normalization algorithms never decrease
    information content
  • Information content of cell c is entropy H(V)

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
Information Theory for Data Management - Divesh
Suresh
71
Well-Designed Databases AL03
  • Normalization algorithms never decrease
    information content
  • Information content of cell c is entropy H(V)

c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58

Information Theory for Data Management - Divesh
Suresh
72
Well-Designed Databases AL03
  • Normalization algorithms never decrease
    information content
  • Information content of cell c is entropy H(V)

c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58

Information Theory for Data Management - Divesh
Suresh
73
Database Design Summary
  • Good database design essential for preserving
    data integrity
  • Information theoretic measures useful for
    integrity constraints
  • FD X ? Y holds iff InD measure H(YX) 0
  • MVD X ?? Y holds iff H(Y,ZX) H(YX) H(ZX)
  • Information theory to model correlations in
    specific database
  • Information theoretic measures useful for normal
    forms
  • Schema S is in BCNF/4NF iff ? D ? S, H(V) gt 0,
    for all cells c in D
  • Information theory to model distributions over
    possible databases

Information Theory for Data Management - Divesh
Suresh
74
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Database Design
  • Part 2
  • Review of Information Theory Basics
  • Application Data Integration
  • Computing Information Theoretic Primitives
  • Open Problems

Information Theory for Data Management - Divesh
Suresh
75
Review of Information Theory Basics
  • Discrete distribution probability p(X)
  • p(X,Y) ?z p(X,Y,Zz)

X Y Z p(X,Y,Z)
x1 y1 z1 0.125
x1 y2 z2 0.125
x1 y1 z2 0.125
x1 y2 z1 0.125
x2 y3 z3 0.125
x2 y3 z4 0.125
x3 y3 z5 0.125
x4 y3 z6 0.125
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
X Y p(X,Y)
x1 y1 0.25
x1 y2 0.25
x2 y3 0.25
x3 y3 0.125
x4 y3 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
76
Review of Information Theory Basics
  • Discrete distribution probability p(X)
  • p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)

X Y Z p(X,Y,Z)
x1 y1 z1 0.125
x1 y2 z2 0.125
x1 y1 z2 0.125
x1 y2 z1 0.125
x2 y3 z3 0.125
x2 y3 z4 0.125
x3 y3 z5 0.125
x4 y3 z6 0.125
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
X Y p(X,Y)
x1 y1 0.25
x1 y2 0.25
x2 y3 0.25
x3 y3 0.125
x4 y3 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
77
Review of Information Theory Basics
  • Discrete distribution conditional probability
    p(XY)
  • p(X,Y) p(XY)p(Y) p(YX)p(X)

X Y p(X,Y) p(XY) p(YX)
x1 y1 0.25 1.0 0.5
x1 y2 0.25 1.0 0.5
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 1.0
x4 y3 0.125 0.25 1.0
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
78
Review of Information Theory Basics
  • Discrete distribution entropy H(X)
  • h(x) log2(1/p(x))
  • H(X) ?Xx p(x)h(x) 1.75
  • H(Y) ?Yy p(y)h(y) 1.5 ( log2(Y) 1.58)
  • H(X,Y) ?Xx ?Yy p(x,y)h(x,y) 2.25 (
    log2(X,Y) 2.32)

X Y p(X,Y) h(X,Y)
x1 y1 0.25 2.0
x1 y2 0.25 2.0
x2 y3 0.25 2.0
x3 y3 0.125 3.0
x4 y3 0.125 3.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
79
Review of Information Theory Basics
  • Discrete distribution conditional entropy H(XY)
  • h(xy) log2(1/p(xy))
  • H(XY) ?Xx ?Yy p(x,y)h(xy) 0.75
  • H(XY) H(X,Y) H(Y) 2.25 1.5

X Y p(X,Y) p(XY) h(XY)
x1 y1 0.25 1.0 0.0
x1 y2 0.25 1.0 0.0
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 2.0
x4 y3 0.125 0.25 2.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
80
Review of Information Theory Basics
  • Discrete distribution mutual information I(XY)
  • i(xy) log2(p(x,y)/p(x)p(y))
  • I(XY) ?Xx ?Yy p(x,y)i(xy) 1.0
  • I(XY) H(X) H(Y) H(X,Y) 1.75 1.5 2.25

X Y p(X,Y) h(X,Y) i(XY)
x1 y1 0.25 2.0 1.0
x1 y2 0.25 2.0 1.0
x2 y3 0.25 2.0 1.0
x3 y3 0.125 3.0 1.0
x4 y3 0.125 3.0 1.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
81
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Database Design
  • Part 2
  • Review of Information Theory Basics
  • Application Data Integration
  • Computing Information Theoretic Primitives
  • Open Problems

Information Theory for Data Management - Divesh
Suresh
82
Schema Matching
  • Goal align columns across database tables to be
    integrated
  • Fundamental problem in database integration
  • Early useful approach textual similarity of
    column names
  • False positives Address ? IP_Address
  • False negatives Customer_Id Client_Number
  • Early useful approach overlap of values in
    columns, e.g., Jaccard
  • False positives Emp_Id ? Project_Id
  • False negatives Emp_Id Personnel_Number

Information Theory for Data Management - Divesh
Suresh
83
Opaque Schema Matching KN03
  • Goal align columns when column names, data
    values are opaque
  • Databases belong to different government
    bureaucracies ?
  • Treat column names and data values as
    uninterpreted (generic)
  • Example EMP_PROJ(Emp_Id, Proj_Id, Task_Id,
    Status_Id)
  • Likely that all Id fields are from the same
    domain
  • Different databases may have different column
    names

W X Y Z
w2 x1 y1 z2
w4 x2 y3 z3
w3 x3 y3 z1
w1 x2 y1 z2
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
Information Theory for Data Management - Divesh
Suresh
84
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • Perform graph matching between GD1 and GD2,
    minimizing distance
  • Intuition
  • Entropy H(X) captures distribution of values in
    database column X
  • Mutual information I(XY) captures correlations
    between X, Y
  • Efficiency graph matching between schema-sized
    graphs

Information Theory for Data Management - Divesh
Suresh
85
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)

A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A p(A)
a1 0.5
a3 0.25
a4 0.25
B p(B)
b1 0.25
b2 0.25
b3 0.25
b4 0.25
C p(C)
c1 0.5
c2 0.5
D p(D)
d1 0.25
d2 0.5
d3 0.25
Information Theory for Data Management - Divesh
Suresh
86
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5

A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A h(A)
a1 1.0
a3 2.0
a4 2.0
B h(B)
b1 2.0
b2 2.0
b3 2.0
b4 2.0
C h(C)
c1 1.0
c2 1.0
D h(D)
d1 2.0
d2 1.0
d3 2.0
Information Theory for Data Management - Divesh
Suresh
87
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5,
    I(AB) 1.5

A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A h(A)
a1 1.0
a3 2.0
a4 2.0
B h(B)
b1 2.0
b2 2.0
b3 2.0
b4 2.0
A B h(A,B) i(AB)
a1 b2 2.0 1.0
a3 b4 2.0 2.0
a1 b1 2.0 1.0
a4 b3 2.0 2.0
Information Theory for Data Management - Divesh
Suresh
88
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)

A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
Information Theory for Data Management - Divesh
Suresh
89
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • Perform graph matching between GD1 and GD2,
    minimizing distance
  • KN03 uses euclidean and normal distance metrics

Information Theory for Data Management - Divesh
Suresh
90
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • Perform graph matching between GD1 and GD2,
    minimizing distance

Information Theory for Data Management - Divesh
Suresh
91
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • Perform graph matching between GD1 and GD2,
    minimizing distance

Information Theory for Data Management - Divesh
Suresh
92
Heterogeneity Identification DKOSV06
  • Goal identify columns with semantically
    heterogeneous values
  • Can arise due to opaque schema matching KN03
  • Key ideas
  • Heterogeneity based on distribution,
    distinguishability of values
  • Use Information Theory to quantify heterogeneity
  • Issues
  • Which information theoretic measure characterizes
    heterogeneity?
  • How do we actually cluster the data ?

Information Theory for Data Management - Divesh
Suresh
93
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns

Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
jamesbond.007_at_action.com
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
94
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns

Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
jamesbond.007_at_action.com
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
95
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns
  • More semantic types in column ? greater
    heterogeneity
  • Only email versus email phone

Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
jamesbond.007_at_action.com
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
96
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns

Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
97
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns
  • Relative distribution of semantic types impacts
    heterogeneity
  • Mainly email few phone versus balanced email
    phone

Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
98
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns

Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
99
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns

Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
100
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns
  • More easily distinguished types ? greater
    heterogeneity
  • Phone (possibly) SSN versus balanced email
    phone

Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
101
Heterogeneity Identification DKOSV06
  • Heterogeneity complexity of describing the data
  • More, balanced clusters ? greater heterogeneity
  • More distinguishable clusters ? greater
    heterogeneity
  • Use two views of mutual information
  • It quantifies the complexity of describing the
    data (compression)
  • It quantifies the quality of the compression

Information Theory for Data Management - Divesh
Suresh
102
Heterogeneity Identification DKOSV06
  • Hard clustering

X Customer_Id T Cluster_Id
187-65-2468 t1
987-64-6837 t1
789-15-4321 t1
987-65-4321 t1
(908)-555-1234 t2
973-360-0000 t1
360-0007 t3
(877)-807-4596 t2
Information Theory for Data Management - Divesh
Suresh
103
Measuring complexity of clustering
  • Take 1 complexity of a clustering clusters
  • standard model of complexity.
  • Doesnt capture the fact that clusters have
    different sizes.

?
Information Theory for Data Management - Divesh
Suresh
104
Measuring complexity of clustering
  • Take 2 Complexity of clustering number of bits
    needed to describe it.
  • Writing down k needs log k bits.
  • In general, let cluster t ? T have t elements.
  • set p(t) t/n
  • bits to write down cluster sizes H(T) S pt
    log 1/pt

H( ) lt
H( )
Information Theory for Data Management - Divesh
Suresh
105
Heterogeneity Identification DKOSV06
  • Soft clustering cluster membership probabilities
  • How to compute a good soft clustering?

X Customer_Id T Cluster_Id p(TX)
789-15-4321 t1 0.75
987-65-4321 t1 0.75
789-15-4321 t2 0.25
987-65-4321 t2 0.25
(908)-555-1234 t1 0.25
973-360-0000 t1 0.5
(908)-555-1234 t2 0.75
973-360-0000 t2 0.5
Information Theory for Data Management - Divesh
Suresh
106
Measuring complexity of clustering
  • Take 1
  • p(t) Sx p(x) p(tx)
  • Compute H(T) as before.
  • Problem
  • H(T1) H(T2) !!

T1 t1 t2 T2 t1 t2
x1 0.5 0.5 x1 0.99 0.01
x2 0.5 0.5 x2 0.01 0.99
h(T) 0.5 0.5 h(T) 0.5 0.5
Information Theory for Data Management - Divesh
Suresh
107
Measuring complexity of clustering
  • By averaging the memberships, weve lost useful
    information.
  • Take II Compute I(TX) !
  • Even better If T is a hard clustering of X, then
    I(TX) H(T)

X T1 p(X,T) i(XT)
x1 t1 0.25 0
x1 t2 0.25 0
x2 t1 0.25 0
x2 t2 0.25 0
X T2 p(X,T) i(XT)
x1 t1 0.495 0.99
x1 t2 0.005 -5.64
x2 t1 0.25 0
x2 t2 0.25 0
I(T1X) 0
I(T2X) 0.46
Information Theory for Data Management - Divesh
Suresh
108
Measuring cost of a cluster
Given objects Xt X1, X2, , Xm in cluster
t, Cost(t) sum of distances from Xi to cluster
center If we set distance to be KL-distance,
then Cost of a cluster I(Xt,V)
Information Theory for Data Management - Divesh
Suresh
109
Cost of a clustering
  • If we partition X into k clusters X1, ..., Xk
  • Cost(clustering) Si pi I(Xi, V) (pi
    Xi/X)
  • Suppose we treat each cluster center itself as a
    point ?

Information Theory for Data Management - Divesh
Suresh
110
Cost of a clustering
  • We can write down the cost of this cluster
  • Cost(T) I(TV)
  • Key result BMDG05
  • Cost(clustering) I(X, V) I(T, V)
  • Minimizing cost(clustering) gt maximizing I(T, V)

Information Theory for Data Management - Divesh
Suresh
111
Heterogeneity Identification DKOSV06
  • Represent strings as q-gram distributions

X Customer_Id V 4-grams p(X,V)
987-65-4321 987- 0.10
987-65-4321 87-6 0.13
987-65-4321 7-65 0.12
987-65-4321 -65- 0.15
987-65-4321 65-4 0.05
987-65-4321 5-43 0.20
987-65-4321 -432 0.15
987-65-4321 4321 0.10
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
112
Heterogeneity Identification DKOSV06
  • iIB find soft clustering T of X that minimizes
    I(TX) ßI(TV)
  • Allow iIB to use arbitrarily many clusters, use
    ß H(X)/I(XV)
  • Closest to point with minimum space and maximum
    quality

X Customer_Id V 4-grams p(X,V)
987-65-4321 987- 0.10
987-65-4321 87-6 0.13
987-65-4321 7-65 0.12
987-65-4321 -65- 0.15
987-65-4321 65-4 0.05
987-65-4321 5-43 0.20
987-65-4321 -432 0.15
987-65-4321 4321 0.10
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
113
Heterogeneity Identification DKOSV06
  • Rate distortion curve I(TV)/I(XV) vs
    I(TX)/H(X)
  • ß

Information Theory for Data Management - Divesh
Suresh
114
Heterogeneity Identification DKOSV06
  • Heterogeneity mutual information I(TX) of iIB
    clustering T at ß
  • 0 I(TX) ( 0.126) H(X) ( 2.0), H(T) ( 1.0)
  • Ideally use iIB with an arbitrarily large number
    of clusters in T

X Customer_Id T Cluster_Id p(TX) i(TX)
789-15-4321 t1 0.75 0.41
987-65-4321 t1 0.75 0.41
789-15-4321 t2 0.25 -0.81
987-65-4321 t2 0.25 -0.81
(908)-555-1234 t1 0.25 -1.17
973-360-0000 t1 0.5 -0.17
(908)-555-1234 t2 0.75 0.77
973-360-0000 t2 0.5 0.19
Information Theory for Data Management - Divesh
Suresh
115
Heterogeneity Identification DKOSV06
  • Heterogeneity mutual information I(TX) of iIB
    clustering T at ß

Information Theory for Data Management - Divesh
Suresh
116
Data Integration Summary
  • Analyzing database instance critical for
    effective data integration
  • Matching and quality assessments are key
    components
  • Information theoretic measures useful for schema
    matching
  • Align columns when column names, data values are
    opaque
  • Mutual information I(XV) captures correlations
    between X, V
  • Information theoretic measures useful for
    heterogeneity testing
  • Identify columns with semantically heterogeneous
    values
  • I(TX) of iIB clustering T at ß captures column
    heterogeneity

Information Theory for Data Management - Divesh
Suresh
117
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Database Design
  • Part 2
  • Review of Information Theory Basics
  • Application Data Integration
  • Computing Information Theoretic Primitives
  • Open Problems

Information Theory for Data Management - Divesh
Suresh
118
Domain size matters
  • For random variable X, domain size supp(X)
    xi p(X xi) gt 0
  • Different solutions exist depending on whether
    domain size is small or large
  • Probability vectors usually very sparse

Information Theory for Data Management - Divesh
Suresh
119
Entropy Case I - Small domain size
  • Suppose the unique values for a random variable
    X is small (i.e fits in memory)
  • Maximum likelihood estimator
  • p(x) times x is encountered/total number of
    items in set.

1
2
1
2
1
5
4
1
2
3
4
5
Information Theory for Data Management - Divesh
Suresh
120
Entropy Case I - Small domain size
  • HMLE Sx p(x) log 1/p(x)
  • This is a biased estimate
  • EHMLE lt H
  • Miller-Madow correction
  • H HMLE (m 1)/2n
  • m is an estimate of number of non-empty bins
  • n number of samples
  • Bad news ALL estimators for H are biased.
  • Good news we can quantify bias and variance of
    MLE
  • Bias lt log(1 m/N)
  • Var(HMLE) lt (log n)2/N

Information Theory for Data Management - Divesh
Suresh
121
Entropy Case II - Large domain size
  • X is too large to fit in main memory, so we
    cant maintain explicit counts.
  • Streaming algorithms for H(X)
  • Long history of work on this problem
  • Bottomline
  • (1e)-relative-approximation for H(X) that allows
    for updates to frequencies, and requires almost
    constant, and optimal space HNO08.

Information Theory for Data Management - Divesh
Suresh
122
Streaming Entropy CCM07
  • High level idea sample randomly from the stream,
    and track counts of elements picked AMS
  • PROBLEM skewed distribution prevents us from
    sampling lower-frequency elements (and entropy is
    small)
  • Idea estimate largest frequency, and
  • distribution of whats left (higher entropy)

Information Theory for Data Management - Divesh
Suresh
123
Streaming Entropy CCM07
  • Maintain set of samples from original
    distribution and distribution without most
    freque
Write a Comment
User Comments (0)
About PowerShow.com