Information Theory For Data Management

About This Presentation

Title:

Information Theory For Data Management

Description:

Title: Information Theory for Data Management Author: Divesh & Suresh Last modified by: SRIVASTAVA, DIVESH (DIVESH) Created Date: 7/13/2006 3:34:23 AM – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 141

Provided by: Div71

Category:

more less

Transcript and Presenter's Notes

Title: Information Theory For Data Management

1
Information Theory For Data Management
Divesh Srivastava Suresh Venkatasubramanian
2
Motivation
-- Abstruse Goose (177)
Information Theory is relevant to all of
humanity...
Information Theory for Data Management - Divesh
Suresh
3
Background

Many problems in data management need precise
reasoning about information content, transfer and
loss
Structure Extraction
Privacy preservation
Schema design
Probabilistic data ?

Information Theory for Data Management - Divesh
Suresh
4
Information Theory

First developed by Shannon as a way of
quantifying capacity of signal channels.
Entropy, relative entropy and mutual information
capture intrinsic informational aspects of a
signal
Today
Information theory provides a domain-independent
way to reason about structure in data
More information interesting structure
Less information linkage decoupling of
structures

Information Theory for Data Management - Divesh
Suresh
5
Tutorial Thesis

Information theory provides a mathematical
framework for the quantification of information
content, linkage and loss.
This framework can be used in the design of data
management strategies that rely on probing the
structure of information in data.

Information Theory for Data Management - Divesh
Suresh
6
Tutorial Goals

Introduce information-theoretic concepts to DB
audience
Give a data-centric perspective on information
theory
Connect these to applications in data management
Describe underlying computational primitives
Illuminate when and how information theory might
be of use in new areas of data management.

Information Theory for Data Management - Divesh
Suresh
7
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Database Design
Part 2
Review of Information Theory Basics
Application Data Integration
Computing Information Theoretic Primitives
Open Problems

Information Theory for Data Management - Divesh
Suresh
8
Histograms And Discrete Distributions
Information Theory for Data Management - Divesh
Suresh
9
Histograms And Discrete Distributions
X f(x)w(X)
x1 4520
x2 236
x3 122
x4 122
normalize
reweight
Information Theory for Data Management - Divesh
Suresh
10
From Columns To Random Variables

We can think of a column of data as represented
by a random variable
X is a random variable
p(X) is the column of probabilities p(X x1),
p(X x2), and so on
Also known (in unweighted case) as the empirical
distribution induced by the column X.
Notation
X (upper case) denotes a random variable (column)
x (lower case) denotes a value taken by X (field
in a tuple)
p(x) is the probability p(X x)

Information Theory for Data Management - Divesh
Suresh
11
Joint Distributions

Discrete distribution probability p(X,Y,Z)
p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)

X Y Z p(X,Y,Z)
x1 y1 z1 0.125
x1 y2 z2 0.125
x1 y1 z2 0.125
x1 y2 z1 0.125
x2 y3 z3 0.125
x2 y3 z4 0.125
x3 y3 z5 0.125
x4 y3 z6 0.125
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
X Y p(X,Y)
x1 y1 0.25
x1 y2 0.25
x2 y3 0.25
x3 y3 0.125
x4 y3 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
12
Entropy Of A Column
X p(X) h(X)
x1 0.5 1
x2 0.25 2
x3 0.125 3
x4 0.125 3

Let h(x) log2 1/p(x)
h(X) is column of h(x) values.
H(X) EXh(x) SX p(x) log2 1/p(x)
Two views of entropy
It captures uncertainty in data high entropy,
more unpredictability
It captures information content higher entropy,
more information.

H(X) 1.75 lt log X 2
Information Theory for Data Management - Divesh
Suresh
13
Examples

X uniform over 1, ..., 4. H(X) 2
Y is 1 with probability 0.5, in 2,3,4
uniformly.
H(Y) 0.5 log 2 0.5 log 6 1.8 lt 2
Y is more sharply defined, and so has less
uncertainty.
Z uniform over 1, ..., 8. H(Z) 3 gt 2
Z spans a larger range, and captures more
information

X
Y
Z
Information Theory for Data Management - Divesh
Suresh
14
Comparing Distributions

How do we measure difference between two
distributions ?
Kullback-Leibler divergence
dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)

Inference mechanism
Prior belief
Resulting belief
Information Theory for Data Management - Divesh
Suresh
15
Comparing Distributions

Kullback-Leibler divergence
dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)
dKL(p, q) gt 0
Captures extra information needed to capture p
given q
Is asymmetric ! dKL(p, q) ! dKL(q, p)
Is not a metric (does not satisfy triangle
inequality)
There are other measures
?2-distance, variational distance, f-divergences,

Information Theory for Data Management - Divesh
Suresh
16
Conditional Probability

Given a joint distribution on random variables X,
Y, how much information about X can we glean from
Y ?
Conditional probability p(XY)
p(X x1 Y y1) p(X x1, Y y1)/p(Y y1)

X Y p(X,Y) p(XY) p(YX)
x1 y1 0.25 1.0 0.5
x1 y2 0.25 1.0 0.5
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 1.0
x4 y3 0.125 0.25 1.0
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
17
Conditional Entropy

Let h(xy) log2 1/p(xy)
H(XY) Ex,yh(xy) Sx Sy p(x,y) log2
1/p(xy)
H(XY) H(X,Y) H(Y)
H(XY) H(X,Y) H(Y) 2.25 1.5 0.75
If X, Y are independent, H(XY) H(X)

X Y p(X,Y) p(XY) h(XY)
x1 y1 0.25 1.0 0.0
x1 y2 0.25 1.0 0.0
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 2.0
x4 y3 0.125 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
18
Mutual Information

Mutual information captures the difference
between the joint distribution on X and Y, and
the marginal distributions on X and Y.
Let i(xy) log p(x,y)/p(x)p(y)
I(XY) Ex,yI(XY) Sx Sy p(x,y) log
p(x,y)/p(x)p(y)

X Y p(X,Y) h(X,Y) i(XY)
x1 y1 0.25 2.0 1.0
x1 y2 0.25 2.0 1.0
x2 y3 0.25 2.0 1.0
x3 y3 0.125 3.0 1.0
x4 y3 0.125 3.0 1.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
19
Mutual Information Strength of linkage

I(XY) H(X) H(Y) H(X,Y) H(X) H(XY)
H(Y) H(YX)
If X, Y are independent, then I(XY) 0
H(X,Y) H(X) H(Y), so I(XY) H(X) H(Y)
H(X,Y) 0
I(XY) lt max (H(X), H(Y))
Suppose Y f(X) (deterministically)
Then H(YX) 0, and so I(XY) H(Y) H(YX)
H(Y)
Mutual information captures higher-order
interactions
Covariance captures linear interactions only
Two variables can be uncorrelated (covariance
0) and have nonzero mutual information
X ?R -1,1, Y X2. Cov(X,Y) 0, I(XY) H(X)
gt 0

Information Theory for Data Management - Divesh
Suresh
20
Information Theory Summary

We can represent data as discrete distributions
(normalized histograms)
Entropy captures uncertainty or information
content in a distribution
The Kullback-Leibler distance captures the
difference between distributions
Mutual information and conditional entropy
capture linkage between variables in a joint
distribution

Information Theory for Data Management - Divesh
Suresh
21
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Database Design
Part 2
Review of Information Theory Basics
Application Data Integration
Computing Information Theoretic Primitives
Open Problems

Information Theory for Data Management - Divesh
Suresh
22
Data Anonymization Using Randomization

Goal publish anonymized microdata to enable
accurate ad hoc analyses, but ensure privacy of
individuals sensitive attributes
Key ideas
Randomize numerical data add noise from known
distribution
Reconstruct original data distribution using
published noisy data
Issues
How can the original data distribution be
reconstructed?
What kinds of randomization preserve privacy of
individuals?

Information Theory for Data Management - Divesh
Suresh
23
Data Anonymization Using Randomization

Many randomization strategies proposed AS00,
AA01, EGS03
Example randomization strategies X in 0, 10
R X µ (mod 11), µ is uniform in -1, 0, 1
R X µ (mod 11), µ is in -1 (p 0.25), 0 (p
0.5), 1 (p 0.25)
R X (p 0.6), R µ, µ is uniform in 0, 10
(p 0.4)
Question
Which randomization strategy has higher privacy
preservation?
Quantify loss of privacy due to publication of
randomized data

Information Theory for Data Management - Divesh
Suresh
24
Data Anonymization Using Randomization

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1

Id X
s1 0
s2 3
s3 5
s4 0
s5 8
s6 0
s7 6
s8 0
Information Theory for Data Management - Divesh
Suresh
25
Data Anonymization Using Randomization

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1

Id X µ
s1 0 -1
s2 3 0
s3 5 1
s4 0 0
s5 8 1
s6 0 -1
s7 6 1
s8 0 0
Id R1
s1 10
s2 3
s3 6
s4 0
s5 9
s6 10
s7 7
s8 0
?
Information Theory for Data Management - Divesh
Suresh
26
Data Anonymization Using Randomization

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1

Id X µ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
?
Information Theory for Data Management - Divesh
Suresh
27
Reconstruction of Original Data Distribution

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
Reconstruct distribution of X using knowledge of
R1 and µ
EM algorithm converges to MLE of original
distribution AA01

Id X µ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
Id X R1
s1 10, 0, 1
s2 1, 2, 3
s3 4, 5, 6
s4 0, 1, 2
s5 8, 9, 10
s6 9, 10, 0
s7 4, 5, 6
s8 0, 1, 2
?
?
Information Theory for Data Management - Divesh
Suresh
28
Analysis of Privacy AS00

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
If X is uniform in 0, 10, privacy determined by
range of µ

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
If X is uniform in 0, 1 ? 5, 6, privacy
smaller than range of µ

Id X µ
s1 0 0
s2 1 -1
s3 5 0
s4 6 1
s5 0 1
s6 1 -1
s7 5 -1
s8 6 1
Id R1
s1 0
s2 0
s3 5
s4 7
s5 1
s6 0
s7 4
s8 7
Id X R1
s1 10, 0, 1
s2 10, 0, 1
s3 4, 5, 6
s4 6, 7, 8
s5 0, 1, 2
s6 10, 0, 1
s7 3, 4, 5
s8 6, 7, 8
?
?
Information Theory for Data Management - Divesh
Suresh
30
Analysis of Privacy AA01

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
If X is uniform in 0, 1 ? 5, 6, privacy
smaller than range of µ
In some cases, sensitive value revealed

Id X µ
s1 0 0
s2 1 -1
s3 5 0
s4 6 1
s5 0 1
s6 1 -1
s7 5 -1
s8 6 1
Id R1
s1 0
s2 0
s3 5
s4 7
s5 1
s6 0
s7 4
s8 7
Id X R1
s1 0, 1
s2 0, 1
s3 5, 6
s4 6
s5 0, 1
s6 0, 1
s7 5
s8 6
?
?
Information Theory for Data Management - Divesh
Suresh
31
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
Smaller H(XR) ? more loss of privacy in X by
knowledge of R
Larger I(XR) ? more loss of privacy in X by
knowledge of R
I(XR) H(X) H(XR)
I(XR) used to capture correlation between X and
R
p(X) is the prior knowledge of sensitive
attribute X
p(X, R) is the joint distribution of X and R

Information Theory for Data Management - Divesh
Suresh
32
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1

X R1 p(X,R1) h(X,R1) i(XR1)
5 4
5 5
5 6
6 5
6 6
6 7
X p(X) h(X)
5
6
R1 p(R1) h(R1)
4
5
6
7
Information Theory for Data Management - Divesh
Suresh
33
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1

X R1 p(X,R1) h(X,R1) i(XR1)
5 4 0.17
5 5 0.17
5 6 0.17
6 5 0.17
6 6 0.17
6 7 0.17
X p(X) h(X)
5 0.5
6 0.5
R1 p(R1) h(R1)
4 0.17
5 0.34
6 0.34
7 0.17
Information Theory for Data Management - Divesh
Suresh
34
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1

X R1 p(X,R1) h(X,R1) i(XR1)
5 4 0.17 2.58
5 5 0.17 2.58
5 6 0.17 2.58
6 5 0.17 2.58
6 6 0.17 2.58
6 7 0.17 2.58
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R1 p(R1) h(R1)
4 0.17 2.58
5 0.34 1.58
6 0.34 1.58
7 0.17 2.58
Information Theory for Data Management - Divesh
Suresh
35
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1
I(XR) 0.33

X R1 p(X,R1) h(X,R1) i(XR1)
5 4 0.17 2.58 1.0
5 5 0.17 2.58 0.0
5 6 0.17 2.58 0.0
6 5 0.17 2.58 0.0
6 6 0.17 2.58 0.0
6 7 0.17 2.58 1.0
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R1 p(R1) h(R1)
4 0.17 2.58
5 0.34 1.58
6 0.34 1.58
7 0.17 2.58
Information Theory for Data Management - Divesh
Suresh
36
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
X is uniform in 5, 6, R2 X µ (mod 11), µ is
uniform in 0, 1
I(XR1) 0.33, I(XR2) 0.5 ? R2 is a bigger
privacy risk than R1

X R2 p(X,R2) h(X,R2) i(XR2)
5 5 0.25 2.0 1.0
5 6 0.25 2.0 0.0
6 6 0.25 2.0 0.0
6 7 0.25 2.0 1.0
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R2 p(R2) h(R2)
5 0.25 2.0
6 0.5 1.0
7 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
37
Quantify Loss of Privacy AA01

Equivalent goal quantify loss of privacy based
on H(XR)
X is uniform in 5, 6, R2 X µ (mod 11), µ is
uniform in 0, 1
Intuition we know more about X given R2, than
about X given R1
H(XR1) 0.67, H(XR2) 0.5 ? R2 is a bigger
privacy risk than R1

X R2 p(X,R2) p(XR2) h(XR2)
5 5 0.25 1.0 0.0
5 6 0.25 0.5 1.0
6 6 0.25 0.5 1.0
6 7 0.25 1.0 0.0
X R1 p(X,R1) p(XR1) h(XR1)
5 4 0.17 1.0 0.0
5 5 0.17 0.5 1.0
5 6 0.17 0.5 1.0
6 5 0.17 0.5 1.0
6 6 0.17 0.5 1.0
6 7 0.17 1.0 0.0
Information Theory for Data Management - Divesh
Suresh
38
Quantify Loss of Privacy

Example X is uniform in 0, 1
R3 e (p 0.9999), R3 X (p 0.0001)
R4 X (p 0.6), R4 1 X (p 0.4)
Is R3 or R4 a bigger privacy risk?

Information Theory for Data Management - Divesh
Suresh
39
Worst Case Loss of Privacy EGS03

Example X is uniform in 0, 1
R3 e (p 0.9999), R3 X (p 0.0001)
R4 X (p 0.6), R4 1 X (p 0.4)
I(XR3) 0.0001 ltlt I(XR4) 0.028

X R3 p(X,R3) h(X,R3) i(XR3)
0 e 0.49995 1.0 0.0
0 0 0.00005 14.29 1.0
1 e 0.49995 1.0 0.0
1 1 0.00005 14.29 1.0
X R4 p(X,R4) h(X,R4) i(XR4)
0 0 0.3 1.74 0.26
0 1 0.2 2.32 -0.32
1 0 0.2 2.32 -0.32
1 1 0.3 1.74 0.26
Information Theory for Data Management - Divesh
Suresh
40
Worst Case Loss of Privacy EGS03

Example X is uniform in 0, 1
R3 e (p 0.9999), R3 X (p 0.0001)
R4 X (p 0.6), R4 1 X (p 0.4)
I(XR3) 0.0001 ltlt I(XR4) 0.028
But R3 has a larger worst case risk

Goal quantify worst case loss of privacy in X by
knowledge of R
Use max KL divergence, instead of mutual
information
Mutual information can be formulated as expected
KL divergence
I(XR) ?x ?r p(x,r)log2(p(x,r)/p(x)p(r))
KL(p(X,R) p(X)p(R))
I(XR) ?r p(r) ?x p(xr)log2(p(xr)/p(x)) ER
KL(p(Xr) p(X))
AA01 measure quantifies expected loss of
privacy over R
EGS03 propose a measure based on worst case
loss of privacy
IW(XR) MAXR KL(p(Xr) p(X))

Information Theory for Data Management - Divesh
Suresh
42
Worst Case Loss of Privacy EGS03

Example X is uniform in 0, 1
R3 e (p 0.9999), R3 X (p 0.0001)
R4 X (p 0.6), R4 1 X (p 0.4)
IW(XR3) max0.0, 1.0, 1.0 gt IW(XR4)
max0.028, 0.028

X R3 p(X,R3) p(XR3) i(XR3)
0 e 0.49995 0.5 0.0
0 0 0.00005 1.0 1.0
1 e 0.49995 0.5 0.0
1 1 0.00005 1.0 1.0
X R4 p(X,R4) p(XR4) i(XR4)
0 0 0.3 0.6 0.26
0 1 0.2 0.4 -0.32
1 0 0.2 0.4 -0.32
1 1 0.3 0.6 0.26
Information Theory for Data Management - Divesh
Suresh
43
Worst Case Loss of Privacy EGS03

Example X is uniform in 5, 6
R1 X µ (mod 11), µ is uniform in -1, 0, 1
R2 X µ (mod 11), µ is uniform in 0, 1
IW(XR1) max1.0, 0.0, 0.0, 1.0 IW(XR2)
1.0, 0.0, 1.0
Unable to capture that R2 is a bigger privacy
risk than R1

X R1 p(X,R1) p(XR1) i(XR1)
5 4 0.17 1.0 1.0
5 5 0.17 0.5 0.0
5 6 0.17 0.5 0.0
6 5 0.17 0.5 0.0
6 6 0.17 0.5 0.0
6 7 0.17 1.0 1.0
X R2 p(X,R2) p(XR2) i(XR2)
5 5 0.25 1.0 1.0
5 6 0.25 0.5 0.0
6 6 0.25 0.5 0.0
6 7 0.25 1.0 1.0
Information Theory for Data Management - Divesh
Suresh
44
Data Anonymization Summary

Randomization techniques useful for microdata
anonymization
Randomization techniques differ in their loss of
privacy
Information theoretic measures useful to capture
loss of privacy
Expected KL divergence captures expected privacy
loss AA01
Maximum KL divergence captures worst case privacy
loss EGS03
Both are useful in practice

Information Theory for Data Management - Divesh
Suresh
45
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Database Design
Part 2
Review of Information Theory Basics
Application Data Integration
Computing Information Theoretic Primitives
Open Problems

Information Theory for Data Management - Divesh
Suresh
46
Information Dependencies DR00

Goal use information theory to examine and
reason about information content of the
attributes in a relation instance
Key ideas
Novel InD measure between attribute sets X, Y
based on H(YX)
Identify numeric inequalities between InD
measures
Results
InD measures are a broader class than FDs and
MVDs
Armstrong axioms for FDs derivable from InD
inequalities
MVD inference rules derivable from InD
inequalities

Information Theory for Data Management - Divesh
Suresh
47
Information Dependencies DR00

Functional dependency X ? Y
FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
(t1Y t2Y))

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
Information Theory for Data Management - Divesh
Suresh
48
Information Dependencies DR00

Functional dependency X ? Y
FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
(t1Y t2Y))

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
Information Theory for Data Management - Divesh
Suresh
49
Information Dependencies DR00

Result FD X ? Y holds iff H(YX) 0
Intuition once X is known, no remaining
uncertainty in Y
H(YX) 0.5

X Y p(X,Y) p(YX) h(YX)
x1 y1 0.25 0.5 1.0
x1 y2 0.25 0.5 1.0
x2 y3 0.25 1.0 0.0
x3 y3 0.125 1.0 0.0
x4 y3 0.125 1.0 0.0
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
50
Information Dependencies DR00

Multi-valued dependency X ?? Y
MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
Information Theory for Data Management - Divesh
Suresh
51
Information Dependencies DR00

Multi-valued dependency X ?? Y
MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
52
Information Dependencies DR00

Multi-valued dependency X ?? Y
MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
53
Information Dependencies DR00

Result MVD X ?? Y holds iff H(Y,ZX) H(YX)
H(ZX)
Intuition once X known, uncertainties in Y and Z
are independent
H(YX) 0.5, H(ZX) 0.75, H(Y,ZX) 1.25

X Y h(YX)
x1 y1 1.0
x1 y2 1.0
x2 y3 0.0
x3 y3 0.0
x4 y3 0.0
X Z h(ZX)
x1 z1 1.0
x1 z2 1.0
x2 z3 1.0
x2 z4 1.0
x3 z5 0.0
x4 z6 0.0
X Y Z h(Y,ZX)
x1 y1 z1 2.0
x1 y2 z2 2.0
x1 y1 z2 2.0
x1 y2 z1 2.0
x2 y3 z3 1.0
x2 y3 z4 1.0
x3 y3 z5 0.0
x4 y3 z6 0.0

Information Theory for Data Management - Divesh
Suresh
54
Information Dependencies DR00

Result Armstrong axioms for FDs derivable from
InD inequalities
Reflexivity If Y ? X, then X ? Y
H(YX) 0 for Y ? X
Augmentation X ? Y ? X,Z ? Y,Z
0 H(Y,ZX,Z) H(YX,Z) H(YX) 0
Transitivity X ? Y Y ? Z ? X ? Z
0 H(YX) H(ZY) H(ZX) 0

Information Theory for Data Management - Divesh
Suresh
55
Database Normal Forms

Goal eliminate update anomalies by good database
design
Need to know the integrity constraints on all
database instances
Boyce-Codd normal form
Input a set ? of functional dependencies
For every (non-trivial) FD R.X ? R.Y ? ?, R.X is
a key of R
4NF
Input a set ? of functional and multi-valued
dependencies
For every (non-trivial) MVD R.X ?? R.Y ? ?,
R.X is a key of R

Information Theory for Data Management - Divesh
Suresh
56
Database Normal Forms

Functional dependency X ? Y
Which design is better?

X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
57
Database Normal Forms

Functional dependency X ? Y
Which design is better?
Decomposition is in BCNF

X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
58
Database Normal Forms

Multi-valued dependency X ?? Y
Which design is better?

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
59
Database Normal Forms

Multi-valued dependency X ?? Y
Which design is better?
Decomposition is in 4NF

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6

Information Theory for Data Management - Divesh
Suresh
60
Well-Designed Databases AL03

Goal use information theory to characterize
goodness of a database design and reason about
normalization algorithms
Key idea
Information content measure of cell in a DB
instance w.r.t. ICs
Redundancy reduces information content measure of
cells
Results
Well-designed DB ? each cell has information
content gt 0
Normalization algorithms never decrease
information content

Information Theory for Data Management - Divesh
Suresh
61
Well-Designed Databases AL03

Information content of cell c in database D
satisfying FD X ? Y
Uniform distribution p(V) on values for c
consistent with D\c and FD
Information content of cell c is entropy H(V)
H(V62) 2.0

X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
V62 p(V62) h(V62)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
62
Well-Designed Databases AL03

Information content of cell c in database D
satisfying FD X ? Y
Uniform distribution p(V) on values for c
consistent with D\c and FD
Information content of cell c is entropy H(V)
H(V22) 0.0

X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
V22 p(V22) h(V22)
y1 1.0 0.0
y2 0.0
y3 0.0
y4 0.0
Information Theory for Data Management - Divesh
Suresh
63
Well-Designed Databases AL03

Information content of cell c in database D
satisfying FD X ? Y
Information content of cell c is entropy H(V)
Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
all cells c in D
Technicalities w.r.t. size of active domain

X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
c H(V)
c12 0.0
c22 0.0
c32 0.0
c42 0.0
c52 2.0
c62 2.0
Information Theory for Data Management - Divesh
Suresh
64
Well-Designed Databases AL03

Information content of cell c in database D
satisfying FD X ? Y
Information content of cell c is entropy H(V)
H(V12) 2.0, H(V42) 2.0

V42 p(V42) h(V42)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
V12 p(V12) h(V12)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
65
Well-Designed Databases AL03

Information content of cell c in database D
satisfying FD X ? Y
Information content of cell c is entropy H(V)
Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
all cells c in D

X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c12 2.0
c22 2.0
c32 2.0
c42 2.0
Information Theory for Data Management - Divesh
Suresh
66
Well-Designed Databases AL03

Information content of cell c in DB D satisfying
MVD X ?? Y
Information content of cell c is entropy H(V)
H(V52) 0.0, H(V53) 2.32

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
V52 p(V52) h(V52)
y3 1.0 0.0
V53 p(V53) h(V53)
z1 0.2 2.32
z2 0.2 2.32
z3 0.2 2.32
z4 0.0
z5 0.2 2.32
z6 0.2 2.32
Information Theory for Data Management - Divesh
Suresh
67
Well-Designed Databases AL03

Information content of cell c in DB D satisfying
MVD X ?? Y
Information content of cell c is entropy H(V)
Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
cells c in D

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
c H(V)
c12 0.0
c22 0.0
c32 0.0
c42 0.0
c52 0.0
c62 0.0
c72 1.58
c82 1.58
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
Information Theory for Data Management - Divesh
Suresh
68
Well-Designed Databases AL03

Information content of cell c in DB D satisfying
MVD X ?? Y
Information content of cell c is entropy H(V)
H(V32) 1.58, H(V34) 2.32

V34 p(V34) h(V34)
z1 0.2 2.32
z2 0.2 2.32
z3 0.2 2.32
z4 0.0
z5 0.2 2.32
z6 0.2 2.32
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
V32 p(V32) h(V32)
y1 0.33 1.58
y2 0.33 1.58
y3 0.33 1.58
Information Theory for Data Management - Divesh
Suresh
69
Well-Designed Databases AL03

Information content of cell c in DB D satisfying
MVD X ?? Y
Information content of cell c is entropy H(V)
Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
cells c in D

X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c12 1.0
c22 1.0
c32 1.58
c42 1.58
c52 1.58
c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
Information Theory for Data Management - Divesh
Suresh
70
Well-Designed Databases AL03

Normalization algorithms never decrease
information content
Information content of cell c is entropy H(V)

X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
Information Theory for Data Management - Divesh
Suresh
71
Well-Designed Databases AL03

Normalization algorithms never decrease
information content
Information content of cell c is entropy H(V)

c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58

Information Theory for Data Management - Divesh
Suresh
72
Well-Designed Databases AL03

Normalization algorithms never decrease
information content
Information content of cell c is entropy H(V)

Good database design essential for preserving
data integrity
Information theoretic measures useful for
integrity constraints
FD X ? Y holds iff InD measure H(YX) 0
MVD X ?? Y holds iff H(Y,ZX) H(YX) H(ZX)
Information theory to model correlations in
specific database
Information theoretic measures useful for normal
forms
Schema S is in BCNF/4NF iff ? D ? S, H(V) gt 0,
for all cells c in D
Information theory to model distributions over
possible databases

Information Theory for Data Management - Divesh
Suresh
74
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Database Design
Part 2
Review of Information Theory Basics
Application Data Integration
Computing Information Theoretic Primitives
Open Problems

Information Theory for Data Management - Divesh
Suresh
75
Review of Information Theory Basics

Discrete distribution probability p(X)
p(X,Y) ?z p(X,Y,Zz)

Discrete distribution probability p(X)
p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)

Discrete distribution conditional probability
p(XY)
p(X,Y) p(XY)p(Y) p(YX)p(X)

Discrete distribution entropy H(X)
h(x) log2(1/p(x))
H(X) ?Xx p(x)h(x) 1.75
H(Y) ?Yy p(y)h(y) 1.5 ( log2(Y) 1.58)
H(X,Y) ?Xx ?Yy p(x,y)h(x,y) 2.25 (
log2(X,Y) 2.32)

X Y p(X,Y) h(X,Y)
x1 y1 0.25 2.0
x1 y2 0.25 2.0
x2 y3 0.25 2.0
x3 y3 0.125 3.0
x4 y3 0.125 3.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
79
Review of Information Theory Basics

Discrete distribution conditional entropy H(XY)
h(xy) log2(1/p(xy))
H(XY) ?Xx ?Yy p(x,y)h(xy) 0.75
H(XY) H(X,Y) H(Y) 2.25 1.5

X Y p(X,Y) p(XY) h(XY)
x1 y1 0.25 1.0 0.0
x1 y2 0.25 1.0 0.0
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 2.0
x4 y3 0.125 0.25 2.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
80
Review of Information Theory Basics

Discrete distribution mutual information I(XY)
i(xy) log2(p(x,y)/p(x)p(y))
I(XY) ?Xx ?Yy p(x,y)i(xy) 1.0
I(XY) H(X) H(Y) H(X,Y) 1.75 1.5 2.25

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Database Design
Part 2
Review of Information Theory Basics
Application Data Integration
Computing Information Theoretic Primitives
Open Problems

Information Theory for Data Management - Divesh
Suresh
82
Schema Matching

Goal align columns across database tables to be
integrated
Fundamental problem in database integration
Early useful approach textual similarity of
column names
False positives Address ? IP_Address
False negatives Customer_Id Client_Number
Early useful approach overlap of values in
columns, e.g., Jaccard
False positives Emp_Id ? Project_Id
False negatives Emp_Id Personnel_Number

Information Theory for Data Management - Divesh
Suresh
83
Opaque Schema Matching KN03

Goal align columns when column names, data
values are opaque
Databases belong to different government
bureaucracies ?
Treat column names and data values as
uninterpreted (generic)
Example EMP_PROJ(Emp_Id, Proj_Id, Task_Id,
Status_Id)
Likely that all Id fields are from the same
domain
Different databases may have different column
names

W X Y Z
w2 x1 y1 z2
w4 x2 y3 z3
w3 x3 y3 z1
w1 x2 y1 z2
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
Information Theory for Data Management - Divesh
Suresh
84
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
Perform graph matching between GD1 and GD2,
minimizing distance
Intuition
Entropy H(X) captures distribution of values in
database column X
Mutual information I(XY) captures correlations
between X, Y
Efficiency graph matching between schema-sized
graphs

Information Theory for Data Management - Divesh
Suresh
85
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)

A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A p(A)
a1 0.5
a3 0.25
a4 0.25
B p(B)
b1 0.25
b2 0.25
b3 0.25
b4 0.25
C p(C)
c1 0.5
c2 0.5
D p(D)
d1 0.25
d2 0.5
d3 0.25
Information Theory for Data Management - Divesh
Suresh
86
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5

A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A h(A)
a1 1.0
a3 2.0
a4 2.0
B h(B)
b1 2.0
b2 2.0
b3 2.0
b4 2.0
C h(C)
c1 1.0
c2 1.0
D h(D)
d1 2.0
d2 1.0
d3 2.0
Information Theory for Data Management - Divesh
Suresh
87
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5,
I(AB) 1.5

A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A h(A)
a1 1.0
a3 2.0
a4 2.0
B h(B)
b1 2.0
b2 2.0
b3 2.0
b4 2.0
A B h(A,B) i(AB)
a1 b2 2.0 1.0
a3 b4 2.0 2.0
a1 b1 2.0 1.0
a4 b3 2.0 2.0
Information Theory for Data Management - Divesh
Suresh
88
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)

A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
Information Theory for Data Management - Divesh
Suresh
89
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
Perform graph matching between GD1 and GD2,
minimizing distance
KN03 uses euclidean and normal distance metrics

Information Theory for Data Management - Divesh
Suresh
90
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
Perform graph matching between GD1 and GD2,
minimizing distance

Information Theory for Data Management - Divesh
Suresh
91
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
Perform graph matching between GD1 and GD2,
minimizing distance

Information Theory for Data Management - Divesh
Suresh
92
Heterogeneity Identification DKOSV06

Goal identify columns with semantically
heterogeneous values
Can arise due to opaque schema matching KN03
Key ideas
Heterogeneity based on distribution,
distinguishability of values
Use Information Theory to quantify heterogeneity
Issues
Which information theoretic measure characterizes
heterogeneity?
How do we actually cluster the data ?

Information Theory for Data Management - Divesh
Suresh
93
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns

Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
jamesbond.007_at_action.com
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
94
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns

Example semantically homogeneous, heterogeneous
columns
More semantic types in column ? greater
heterogeneity
Only email versus email phone

Example semantically homogeneous, heterogeneous
columns

Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
97
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns
Relative distribution of semantic types impacts
heterogeneity
Mainly email few phone versus balanced email
phone

Example semantically homogeneous, heterogeneous
columns

Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
99
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns

Example semantically homogeneous, heterogeneous
columns
More easily distinguished types ? greater
heterogeneity
Phone (possibly) SSN versus balanced email
phone

Heterogeneity complexity of describing the data
More, balanced clusters ? greater heterogeneity
More distinguishable clusters ? greater
heterogeneity
Use two views of mutual information
It quantifies the complexity of describing the
data (compression)
It quantifies the quality of the compression

Information Theory for Data Management - Divesh
Suresh
102
Heterogeneity Identification DKOSV06

Hard clustering

X Customer_Id T Cluster_Id
187-65-2468 t1
987-64-6837 t1
789-15-4321 t1
987-65-4321 t1
(908)-555-1234 t2
973-360-0000 t1
360-0007 t3
(877)-807-4596 t2
Information Theory for Data Management - Divesh
Suresh
103
Measuring complexity of clustering

Take 1 complexity of a clustering clusters
standard model of complexity.
Doesnt capture the fact that clusters have
different sizes.

?
Information Theory for Data Management - Divesh
Suresh
104
Measuring complexity of clustering

Take 2 Complexity of clustering number of bits
needed to describe it.
Writing down k needs log k bits.
In general, let cluster t ? T have t elements.
set p(t) t/n
bits to write down cluster sizes H(T) S pt
log 1/pt

H( ) lt
H( )
Information Theory for Data Management - Divesh
Suresh
105
Heterogeneity Identification DKOSV06

Soft clustering cluster membership probabilities
How to compute a good soft clustering?

X Customer_Id T Cluster_Id p(TX)
789-15-4321 t1 0.75
987-65-4321 t1 0.75
789-15-4321 t2 0.25
987-65-4321 t2 0.25
(908)-555-1234 t1 0.25
973-360-0000 t1 0.5
(908)-555-1234 t2 0.75
973-360-0000 t2 0.5
Information Theory for Data Management - Divesh
Suresh
106
Measuring complexity of clustering

Take 1
p(t) Sx p(x) p(tx)
Compute H(T) as before.
Problem
H(T1) H(T2) !!

T1 t1 t2 T2 t1 t2
x1 0.5 0.5 x1 0.99 0.01
x2 0.5 0.5 x2 0.01 0.99
h(T) 0.5 0.5 h(T) 0.5 0.5
Information Theory for Data Management - Divesh
Suresh
107
Measuring complexity of clustering

By averaging the memberships, weve lost useful
information.
Take II Compute I(TX) !
Even better If T is a hard clustering of X, then
I(TX) H(T)

X T1 p(X,T) i(XT)
x1 t1 0.25 0
x1 t2 0.25 0
x2 t1 0.25 0
x2 t2 0.25 0
X T2 p(X,T) i(XT)
x1 t1 0.495 0.99
x1 t2 0.005 -5.64
x2 t1 0.25 0
x2 t2 0.25 0
I(T1X) 0
I(T2X) 0.46
Information Theory for Data Management - Divesh
Suresh
108
Measuring cost of a cluster
Given objects Xt X1, X2, , Xm in cluster
t, Cost(t) sum of distances from Xi to cluster
center If we set distance to be KL-distance,
then Cost of a cluster I(Xt,V)
Information Theory for Data Management - Divesh
Suresh
109
Cost of a clustering

If we partition X into k clusters X1, ..., Xk
Cost(clustering) Si pi I(Xi, V) (pi
Xi/X)
Suppose we treat each cluster center itself as a
point ?

Information Theory for Data Management - Divesh
Suresh
110
Cost of a clustering

We can write down the cost of this cluster
Cost(T) I(TV)
Key result BMDG05
Cost(clustering) I(X, V) I(T, V)
Minimizing cost(clustering) gt maximizing I(T, V)

Information Theory for Data Management - Divesh
Suresh
111
Heterogeneity Identification DKOSV06

Represent strings as q-gram distributions

X Customer_Id V 4-grams p(X,V)
987-65-4321 987- 0.10
987-65-4321 87-6 0.13
987-65-4321 7-65 0.12
987-65-4321 -65- 0.15
987-65-4321 65-4 0.05
987-65-4321 5-43 0.20
987-65-4321 -432 0.15
987-65-4321 4321 0.10
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
112
Heterogeneity Identification DKOSV06

iIB find soft clustering T of X that minimizes
I(TX) ßI(TV)
Allow iIB to use arbitrarily many clusters, use
ß H(X)/I(XV)
Closest to point with minimum space and maximum
quality

Rate distortion curve I(TV)/I(XV) vs
I(TX)/H(X)
ß

Information Theory for Data Management - Divesh
Suresh
114
Heterogeneity Identification DKOSV06

Heterogeneity mutual information I(TX) of iIB
clustering T at ß
0 I(TX) ( 0.126) H(X) ( 2.0), H(T) ( 1.0)
Ideally use iIB with an arbitrarily large number
of clusters in T

X Customer_Id T Cluster_Id p(TX) i(TX)
789-15-4321 t1 0.75 0.41
987-65-4321 t1 0.75 0.41
789-15-4321 t2 0.25 -0.81
987-65-4321 t2 0.25 -0.81
(908)-555-1234 t1 0.25 -1.17
973-360-0000 t1 0.5 -0.17
(908)-555-1234 t2 0.75 0.77
973-360-0000 t2 0.5 0.19
Information Theory for Data Management - Divesh
Suresh
115
Heterogeneity Identification DKOSV06

Heterogeneity mutual information I(TX) of iIB
clustering T at ß

Information Theory for Data Management - Divesh
Suresh
116
Data Integration Summary

Analyzing database instance critical for
effective data integration
Matching and quality assessments are key
components
Information theoretic measures useful for schema
matching
Align columns when column names, data values are
opaque
Mutual information I(XV) captures correlations
between X, V
Information theoretic measures useful for
heterogeneity testing
Identify columns with semantically heterogeneous
values
I(TX) of iIB clustering T at ß captures column
heterogeneity

Information Theory for Data Management - Divesh
Suresh
117
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Database Design
Part 2
Review of Information Theory Basics
Application Data Integration
Computing Information Theoretic Primitives
Open Problems

Information Theory for Data Management - Divesh
Suresh
118
Domain size matters

For random variable X, domain size supp(X)
xi p(X xi) gt 0
Different solutions exist depending on whether
domain size is small or large
Probability vectors usually very sparse

Information Theory for Data Management - Divesh
Suresh
119
Entropy Case I - Small domain size

Suppose the unique values for a random variable
X is small (i.e fits in memory)
Maximum likelihood estimator
p(x) times x is encountered/total number of
items in set.

1
2
1
2
1
5
4
1
2
3
4
5
Information Theory for Data Management - Divesh
Suresh
120
Entropy Case I - Small domain size

HMLE Sx p(x) log 1/p(x)
This is a biased estimate
EHMLE lt H
Miller-Madow correction
H HMLE (m 1)/2n
m is an estimate of number of non-empty bins
n number of samples
Bad news ALL estimators for H are biased.
Good news we can quantify bias and variance of
MLE
Bias lt log(1 m/N)
Var(HMLE) lt (log n)2/N

Information Theory for Data Management - Divesh
Suresh
121
Entropy Case II - Large domain size

X is too large to fit in main memory, so we
cant maintain explicit counts.
Streaming algorithms for H(X)
Long history of work on this problem
Bottomline
(1e)-relative-approximation for H(X) that allows
for updates to frequencies, and requires almost
constant, and optimal space HNO08.

Information Theory for Data Management - Divesh
Suresh
122
Streaming Entropy CCM07

High level idea sample randomly from the stream,
and track counts of elements picked AMS
PROBLEM skewed distribution prevents us from
sampling lower-frequency elements (and entropy is
small)
Idea estimate largest frequency, and
distribution of whats left (higher entropy)

Information Theory for Data Management - Divesh
Suresh
123
Streaming Entropy CCM07