Information Theory For Data Management

About This Presentation

Title:

Information Theory For Data Management

Description:

exif:DateTimeOriginal 2099-12-07T23:14:14 02:00 /exif:DateTimeOriginal exif:DateTimeDigitized 2099-12-07T23:14:14 02:00 /exif:DateTimeDigitized ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 151

Provided by: Div56

Category:

more less

Transcript and Presenter's Notes

Title: Information Theory For Data Management

1
Information Theory For Data Management
Divesh Srivastava Suresh Venkatasubramanian
2
Motivation
-- Abstruse Goose (177)
Information Theory is relevant to all of
humanity...
3
Background

Many problems in data management need precise
reasoning about information content, transfer and
loss
Structure Extraction
Privacy preservation
Schema design
Probabilistic data ?

4
Information Theory

First developed by Shannon as a way of
quantifying capacity of signal channels.
Entropy, relative entropy and mutual information
capture intrinsic informational aspects of a
signal
Today
Information theory provides a domain-independent
way to reason about structure in data
More information interesting structure
Less information linkage decoupling of
structures

5
Tutorial Thesis

Information theory provides a mathematical
framework for the quantification of information
content, linkage and loss.
This framework can be used in the design of data
management strategies that rely on probing the
structure of information in data.

6
Tutorial Goals

Introduce information-theoretic concepts to VLDB
audience
Give a data-centric perspective on information
theory
Connect these to applications in data management
Describe underlying computational primitives
Illuminate when and how information theory might
be of use in new areas of data management.

7
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Data Integration
Part 2
Review of Information Theory Basics
Application Database Design
Computing Information Theoretic Primitives
Open Problems

7
8
Histograms And Discrete Distributions
9
Histograms And Discrete Distributions
normalize
reweight
10
From Columns To Random Variables

We can think of a column of data as represented
by a random variable
X is a random variable
p(X) is the column of probabilities p(X x1),
p(X x2), and so on
Also known (in unweighted case) as the empirical
distribution induced by the column X.
Notation
X (upper case) denotes a random variable (column)
x (lower case) denotes a value taken by X (field
in a tuple)
p(x) is the probability p(X x)

11
Joint Distributions

Discrete distribution probability p(X,Y,Z)
p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)

11
11
12
Entropy Of A Column

Let h(x) log2 1/p(x)
h(X) is column of h(x) values.
H(X) EXh(x) SX p(x) log2 1/p(x)
Two views of entropy
It captures uncertainty in data high entropy,
more unpredictability
It captures information content higher entropy,
more information.

H(X) 1.75 lt log X 2
13
Examples

X uniform over 1, ..., 4. H(X) 2
Y is 1 with probability 0.5, in 2,3,4
uniformly.
H(Y) 0.5 log 2 0.5 log 6 1.8 lt 2
Y is more sharply defined, and so has less
uncertainty.
Z uniform over 1, ..., 8. H(Z) 3 gt 2
Z spans a larger range, and captures more
information

X
Y
Z
14
Comparing Distributions

How do we measure difference between two
distributions ?
Kullback-Leibler divergence
dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)

Inference mechanism
Prior belief
Resulting belief
15
Comparing Distributions

Kullback-Leibler divergence
dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)
dKL(p, q) gt 0
Captures extra information needed to capture p
given q
Is asymmetric ! dKL(p, q) ! dKL(q, p)
Is not a metric (does not satisfy triangle
inequality)
There are other measures
?2-distance, variational distance, f-divergences,

16
Conditional Probability

Given a joint distribution on random variables X,
Y, how much information about X can we glean from
Y ?
Conditional probability p(XY)
p(X x1 Y y1) p(X x1, Y y1)/p(Y y1)

17
Conditional Entropy

Let h(xy) log2 1/p(xy)
H(XY) Ex,yh(xy) Sx Sy p(x,y) log2
1/p(xy)
H(XY) H(X,Y) H(Y)
H(XY) H(X,Y) H(Y) 2.25 1.5 0.75
If X, Y are independent, H(XY) H(X)

18
Mutual Information

Mutual information captures the difference
between the joint distribution on X and Y, and
the marginal distributions on X and Y.
Let i(xy) log p(x,y)/p(x)p(y)
I(XY) Ex,yI(XY) Sx Sy p(x,y) log
p(x,y)/p(x)p(y)

19
Mutual Information Strength of linkage

I(XY) H(X) H(Y) H(X,Y) H(X) H(XY)
H(Y) H(YX)
If X, Y are independent, then I(XY) 0
H(X,Y) H(X) H(Y), so I(XY) H(X) H(Y)
H(X,Y) 0
I(XY) lt max (H(X), H(Y))
Suppose Y f(X) (deterministically)
Then H(YX) 0, and so I(XY) H(Y) H(YX)
H(Y)
Mutual information captures higher-order
interactions
Covariance captures linear interactions only
Two variables can be uncorrelated (covariance
0) and have nonzero mutual information
X ?R -1,1, Y X2. Cov(X,Y) 0, I(XY) H(X)
gt 0

20
Information-Theoretic Clustering

Clustering takes a collection of objects and
groups them.
Given a distance function between objects
Choice of measure of complexity of clustering
Choice of measure of cost for a cluster
Usually,
Distance function is Euclidean distance
Number of clusters is measure of complexity
Cost measure for cluster is sum-of-squared-distanc
e to center
Goal minimize complexity and cost
Inherent tradeoff between two

21
Feature Representation
Let V v1, v2, v3, v4 X is explained by
distribution over V. Feature vector of X is
0.5, 0.25, 0.125, 0.125
22
Feature Representation
p(v2X2) 0.2
Feature vector
23
Information-Theoretic Clustering

Clustering takes a collection of objects and
groups them.
Given a distance function between objects
Choice of measure of complexity of clustering
Choice of measure of cost for a cluster
In information-theoretic setting
What is the distance function ?
How do we measure complexity ?
What is a notion of cost/quality ?
Goal minimize complexity and maximize quality
Inherent tradeoff between two

24
Measuring complexity of clustering

Take 1 complexity of a clustering clusters
standard model of complexity.
Doesnt capture the fact that clusters have
different sizes.

?
25
Measuring complexity of clustering

Take 2 Complexity of clustering number of bits
needed to describe it.
Writing down k needs log k bits.
In general, let cluster t ? T have t elements.
set p(t) t/n
bits to write down cluster sizes H(T) S pt
log 1/pt

H( ) lt
H( )
26
Information-theoretic Clustering (take I)

Given data X x1, ..., xn explained by variable
V, partition X into clusters (represented by T)
such that
H(T) is minimized and quality is maximized

27
Soft clusterings

In a hard clustering, each point is assigned to
exactly one cluster.
Characteristic function
p(tx) 1 if x ? t, 0 if not.
Suppose we allow points to partially belong to
clusters
p(Tx) is a distribution.
p(tx) is the probability of assigning x to t
How do we describe the complexity of a clustering
?

28
Measuring complexity of clustering

Take 1
p(t) Sx p(x) p(tx)
Compute H(T) as before.
Problem
H(T1) H(T2) !!

29
Measuring complexity of clustering

By averaging the memberships, weve lost useful
information.
Take II Compute I(TX) !
Even better If T is a hard clustering of X, then
I(TX) H(T)

I(T1X) 0
I(T2X) 0.46
30
Information-theoretic Clustering (take II)

Given data X x1, ..., xn explained by variable
V, partition X into clusters (represented by T)
such that
I(T,X) is minimized and quality is maximized

31
Measuring cost of a cluster
Given objects Xt X1, X2, , Xm in cluster
t, Cost(t) (1/m)Si d(Xi, C) Si p(Xi)
dKL(p(VXi), C) where C (1/m) Si p(VXi) Si
p(Xi) p(VXi) p(V)
32
Mutual Information Cost of Cluster

Cost(t) (1/m)Si d(Xi, C) Si p(Xi)
dKL(p(VXi), p(V))
Si p(Xi) KL( p(VXi), p(V)) Si p(Xi) Sj
p(vjXi) log p(vjXi)/p(vj)
Si,j p(Xi, vj) log p(vj, Xi)/p(vj)p(Xi)
I(Xt, V) !!
Cost of a cluster I(Xt,V)

33
Cost of a clustering

If we partition X into k clusters X1, ..., Xk
Cost(clustering) Si pi I(Xi, V)
(pi Xi/X)

34
Cost of a clustering

Each cluster center t can be explained in terms
of V
p(Vt) Si p(Xi) p(VXi)
Suppose we treat each cluster center itself as a
point

35
Cost of a clustering

We can write down the cost of this cluster
Cost(T) I(TV)
Key result BMDG05
Cost(clustering) I(X, V) (T, V)
Minimizing cost(clustering) gt maximizing I(T, V)

36
Information-theoretic Clustering (take III)

Given data X x1, ..., xn explained by variable
V, partition X into clusters (represented by T)
such that
I(TX) - bI(TV) is maximized
This is the Information Bottleneck Method TPB98
Agglomerative techniques exist for the case of
hard clusterings
b is the tradeoff parameter between complexity
and cost
I(TX) and I(TV) are in the same units.

37
Information Theory Summary

We can represent data as discrete distributions
(normalized histograms)
Entropy captures uncertainty or information
content in a distribution
The Kullback-Leibler distance captures the
difference between distributions
Mutual information and conditional entropy
capture linkage between variables in a joint
distribution
We can formulate information-theoretic clustering
problems

38
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Data Integration
Part 2
Review of Information Theory Basics
Application Database Design
Computing Information Theoretic Primitives
Open Problems

39
Data Anonymization Using Randomization

Goal publish anonymized microdata to enable
accurate ad hoc analyses, but ensure privacy of
individuals sensitive attributes
Key ideas
Randomize numerical data add noise from known
distribution
Reconstruct original data distribution using
published noisy data
Issues
How can the original data distribution be
reconstructed?
What kinds of randomization preserve privacy of
individuals?

39
Information Theory for Data Management - Divesh
Suresh
40
Data Anonymization Using Randomization

Many randomization strategies proposed AS00,
AA01, EGS03
Example randomization strategies X in 0, 10
R X µ (mod 11), µ is uniform in -1, 0, 1
R X µ (mod 11), µ is in -1 (p 0.25), 0 (p
0.5), 1 (p 0.25)
R X (p 0.6), R µ, µ is uniform in 0, 10
(p 0.4)
Question
Which randomization strategy has higher privacy
preservation?
Quantify loss of privacy due to publication of
randomized data

40
41
Data Anonymization Using Randomization

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1

41
42
Data Anonymization Using Randomization

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1

?
42
43
Data Anonymization Using Randomization

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1

?
43
44
Reconstruction of Original Data Distribution

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
Reconstruct distribution of X using knowledge of
R1 and µ
EM algorithm converges to MLE of original
distribution AA01

?
?
44
45
Analysis of Privacy AS00

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
If X is uniform in 0, 10, privacy determined by
range of µ

?
?
45
46
Analysis of Privacy AA01

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
If X is uniform in 0, 1 ? 5, 6, privacy
smaller than range of µ

?
?
46
47
Analysis of Privacy AA01

X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
If X is uniform in 0, 1 ? 5, 6, privacy
smaller than range of µ
In some cases, sensitive value revealed

?
?
47
48
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
Smaller H(XR) ? more loss of privacy in X by
knowledge of R
Larger I(XR) ? more loss of privacy in X by
knowledge of R
I(XR) H(X) H(XR)
I(XR) used to capture correlation between X and
R
p(X) is the prior knowledge of sensitive
attribute X
p(X, R) is the joint distribution of X and R

48
49
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1

49
50
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1

50
51
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1

51
52
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1
I(XR) 0.33

52
53
Quantify Loss of Privacy AA01

Goal quantify loss of privacy based on mutual
information I(XR)
X is uniform in 5, 6, R2 X µ (mod 11), µ is
uniform in 0, 1
I(XR1) 0.33, I(XR2) 0.5 ? R2 is a bigger
privacy risk than R1

53
54
Quantify Loss of Privacy AA01

Equivalent goal quantify loss of privacy based
on H(XR)
X is uniform in 5, 6, R2 X µ (mod 11), µ is
uniform in 0, 1
Intuition we know more about X given R2, than
about X given R1
H(XR1) 0.67, H(XR2) 0.5 ? R2 is a bigger
privacy risk than R1

54
55
Quantify Loss of Privacy

Example X is uniform in 0, 1
R3 e (p 0.9999), R3 X (p 0.0001)
R4 X (p 0.6), R4 1 X (p 0.4)
Is R3 or R4 a bigger privacy risk?

55
56
Worst Case Loss of Privacy EGS03

Example X is uniform in 0, 1
R3 e (p 0.9999), R3 X (p 0.0001)
R4 X (p 0.6), R4 1 X (p 0.4)
I(XR3) 0.0001 ltlt I(XR4) 0.028

56
57
Worst Case Loss of Privacy EGS03

Example X is uniform in 0, 1
R3 e (p 0.9999), R3 X (p 0.0001)
R4 X (p 0.6), R4 1 X (p 0.4)
I(XR3) 0.0001 ltlt I(XR4) 0.028
But R3 has a larger worst case risk

57
58
Worst Case Loss of Privacy EGS03

Goal quantify worst case loss of privacy in X by
knowledge of R
Use max KL divergence, instead of mutual
information
Mutual information can be formulated as expected
KL divergence
I(XR) ?x ?r p(x,r)log2(p(x,r)/p(x)p(r))
KL(p(X,R) p(X)p(R))
I(XR) ?r p(r) ?x p(xr)log2(p(xr)/p(x)) ER
KL(p(Xr) p(X))
AA01 measure quantifies expected loss of
privacy over R
EGS03 propose a measure based on worst case
loss of privacy
IW(XR) MAXR KL(p(Xr) p(X))

58
59
Worst Case Loss of Privacy EGS03

Example X is uniform in 0, 1
R3 e (p 0.9999), R3 X (p 0.0001)
R4 X (p 0.6), R4 1 X (p 0.4)
IW(XR3) max0.0, 1.0, 1.0 gt IW(XR4)
max0.028, 0.028

59
60
Worst Case Loss of Privacy EGS03

Example X is uniform in 5, 6
R1 X µ (mod 11), µ is uniform in -1, 0, 1
R2 X µ (mod 11), µ is uniform in 0, 1
IW(XR1) max1.0, 0.0, 0.0, 1.0 IW(XR2)
1.0, 0.0, 1.0
Unable to capture that R2 is a bigger privacy
risk than R1

60
61
Data Anonymization Summary

Randomization techniques useful for microdata
anonymization
Randomization techniques differ in their loss of
privacy
Information theoretic measures useful to capture
loss of privacy
Expected KL divergence captures expected loss of
privacy AA01
Maximum KL divergence captures worst case loss of
privacy EGS03
Both are useful in practice

61
62
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Data Integration
Part 2
Review of Information Theory Basics
Application Database Design
Computing Information Theoretic Primitives
Open Problems

63
Schema Matching

Goal align columns across database tables to be
integrated
Fundamental problem in database integration
Early useful approach textual similarity of
column names
False positives Address ? IP_Address
False negatives Customer_Id Client_Number
Early useful approach overlap of values in
columns, e.g., Jaccard
False positives Emp_Id ? Project_Id
False negatives Emp_Id Personnel_Number

63
64
Opaque Schema Matching KN03

Goal align columns when column names, data
values are opaque
Databases belong to different government
bureaucracies ?
Treat column names and data values as
uninterpreted (generic)
Example EMP_PROJ(Emp_Id, Proj_Id, Task_Id,
Status_Id)
Likely that all Id fields are from the same
domain
Different databases may have different column
names

64
65
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
Perform graph matching between GD1 and GD2,
minimizing distance
Intuition
Entropy H(X) captures distribution of values in
database column X
Mutual information I(XY) captures correlations
between X, Y
Efficiency graph matching between schema-sized
graphs

65
66
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)

66
67
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5

67
68
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5,
I(AB) 1.5

68
69
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)

69
70
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
Perform graph matching between GD1 and GD2,
minimizing distance
KN03 uses euclidean and normal distance metrics

70
71
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
Perform graph matching between GD1 and GD2,
minimizing distance

71
72
Opaque Schema Matching KN03

Approach build complete, labeled graph GD for
each database D
Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
Perform graph matching between GD1 and GD2,
minimizing distance

72
73
Heterogeneity Identification DKOSV06

Goal identify columns with semantically
heterogeneous values
Can arise due to opaque schema matching KN03
Key ideas
Heterogeneity based on distribution,
distinguishability of values
Use Information Bottleneck to compute soft
clustering of values
Issues
Which information theoretic measure characterizes
heterogeneity?
How to set parameters in the Information
Bottleneck method?

73
74
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns

74
75
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns

75
76
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns
More semantic types in column ? greater
heterogeneity
Only email versus email phone

76
77
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns

77
78
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns
Relative distribution of semantic types impacts
heterogeneity
Mainly email few phone versus balanced email
phone

78
79
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns

79
80
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns

80
81
Heterogeneity Identification DKOSV06

Example semantically homogeneous, heterogeneous
columns
More easily distinguished types ? greater
heterogeneity
Phone (possibly) SSN versus balanced email
phone

81
82
Heterogeneity Identification DKOSV06

Heterogeneity space complexity of soft
clustering of the data
More, balanced clusters ? greater heterogeneity
More distinguishable clusters ? greater
heterogeneity
Soft clustering
Soft ? assign probabilities to membership of
values in clusters
How many clusters tradeoff between space versus
quality
Use Information Bottleneck to compute soft
clustering of values

82
83
Heterogeneity Identification DKOSV06

Hard clustering

83
84
Heterogeneity Identification DKOSV06

Soft clustering cluster membership probabilities
How to compute a good soft clustering?

84
85
Heterogeneity Identification DKOSV06

Represent strings as q-gram distributions

85
86
Heterogeneity Identification DKOSV06

iIB find soft clustering T of X that minimizes
I(TX) ßI(TV)
Allow iIB to use arbitrarily many clusters, use
ß H(X)/I(XV)
Closest to point with minimum space and maximum
quality

86
87
Heterogeneity Identification DKOSV06

Rate distortion curve I(TV)/I(XV) vs
I(TX)/H(X)
ß

87
88
Heterogeneity Identification DKOSV06

Heterogeneity mutual information I(TX) of iIB
clustering T at ß
0 I(TX) ( 0.126) H(X) ( 2.0), H(T) ( 1.0)
Ideally use iIB with an arbitrarily large number
of clusters in T

88
89
Heterogeneity Identification DKOSV06

Heterogeneity mutual information I(TX) of iIB
clustering T at ß

89
90
Data Integration Summary

Analyzing database instance critical for
effective data integration
Matching and quality assessments are key
components
Information theoretic measures useful for schema
matching
Align columns when column names, data values are
opaque
Mutual information I(XV) captures correlations
between X, V
Information theoretic measures useful for
heterogeneity testing
Identify columns with semantically heterogeneous
values
I(TX) of iIB clustering T at ß captures column
heterogeneity

90
91
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Data Integration
Part 2
Review of Information Theory Basics
Application Database Design
Computing Information Theoretic Primitives
Open Problems

92
Review of Information Theory Basics

Discrete distribution probability p(X)
p(X,Y) ?z p(X,Y,Zz)

92
93
Review of Information Theory Basics

Discrete distribution probability p(X)
p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)

93
94
Review of Information Theory Basics

Discrete distribution conditional probability
p(XY)
p(X,Y) p(XY)p(Y) p(YX)p(X)

94
95
Review of Information Theory Basics

Discrete distribution entropy H(X)
h(x) log2(1/p(x))
H(X) ?Xx p(x)h(x) 1.75
H(Y) ?Yy p(y)h(y) 1.5 ( log2(Y) 1.58)
H(X,Y) ?Xx ?Yy p(x,y)h(x,y) 2.25 (
log2(X,Y) 2.32)

95
96
Review of Information Theory Basics

Discrete distribution conditional entropy H(XY)
h(xy) log2(1/p(xy))
H(XY) ?Xx ?Yy p(x,y)h(xy) 0.75
H(XY) H(X,Y) H(Y) 2.25 1.5

96
97
Review of Information Theory Basics

Discrete distribution mutual information I(XY)
i(xy) log2(p(x,y)/p(x)p(y))
I(XY) ?Xx ?Yy p(x,y)i(xy) 1.0
I(XY) H(X) H(Y) H(X,Y) 1.75 1.5 2.25

97
98
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Data Integration
Part 2
Review of Information Theory Basics
Application Database Design
Computing Information Theoretic Primitives
Open Problems

99
Information Dependencies DR00

Goal use information theory to examine and
reason about information content of the
attributes in a relation instance
Key ideas
Novel InD measure between attribute sets X, Y
based on H(YX)
Identify numeric inequalities between InD
measures
Results
InD measures are a broader class than FDs and
MVDs
Armstrong axioms for FDs derivable from InD
inequalities
MVD inference rules derivable from InD
inequalities

99
100
Information Dependencies DR00

Functional dependency X ? Y
FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
(t1Y t2Y))

100
101
Information Dependencies DR00

Functional dependency X ? Y
FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
(t1Y t2Y))

101
102
Information Dependencies DR00

Result FD X ? Y holds iff H(YX) 0
Intuition once X is known, no remaining
uncertainty in Y
H(YX) 0.5

102
103
Information Dependencies DR00

Multi-valued dependency X ?? Y
MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)

103
104
Information Dependencies DR00

Multi-valued dependency X ?? Y
MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)

104
105
Information Dependencies DR00

Multi-valued dependency X ?? Y
MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)

105
106
Information Dependencies DR00

Result MVD X ?? Y holds iff H(Y,ZX) H(YX)
H(ZX)
Intuition once X known, uncertainties in Y and Z
are independent
H(YX) 0.5, H(ZX) 0.75, H(Y,ZX) 1.25

106
107
Information Dependencies DR00

Result Armstrong axioms for FDs derivable from
InD inequalities
Reflexivity If Y ? X, then X ? Y
H(YX) 0 for Y ? X
Augmentation X ? Y ? X,Z ? Y,Z
0 H(Y,ZX,Z) H(YX,Z) H(YX) 0
Transitivity X ? Y Y ? Z ? X ? Z
0 H(YX) H(ZY) H(ZX) 0

107
108
Database Normal Forms

Goal eliminate update anomalies by good database
design
Need to know the integrity constraints on all
database instances
Boyce-Codd normal form
Input a set ? of functional dependencies
For every (non-trivial) FD R.X ? R.Y ? ?, R.X is
a key of R
4NF
Input a set ? of functional and multi-valued
dependencies
For every (non-trivial) MVD R.X ?? R.Y ? ?,
R.X is a key of R

108
109
Database Normal Forms

Functional dependency X ? Y
Which design is better?

109
110
Database Normal Forms

Functional dependency X ? Y
Which design is better?
Decomposition is in BCNF

110
111
Database Normal Forms

Multi-valued dependency X ?? Y
Which design is better?

111
112
Database Normal Forms

Multi-valued dependency X ?? Y
Which design is better?
Decomposition is in 4NF

112
113
Well-Designed Databases AL03

Goal use information theory to characterize
goodness of a database design and reason about
normalization algorithms
Key idea
Information content measure of cell in a DB
instance w.r.t. ICs
Redundancy reduces information content measure of
cells
Results
Well-designed DB ? each cell has information
content gt 0
Normalization algorithms never decrease
information content

113
114
Well-Designed Databases AL03

Information content of cell c in database D
satisfying FD X ? Y
Uniform distribution p(V) on values for c
consistent with D\c and FD
Information content of cell c is entropy H(V)
H(V62) 2.0

114
115
Well-Designed Databases AL03

Information content of cell c in database D
satisfying FD X ? Y
Uniform distribution p(V) on values for c
consistent with D\c and FD
Information content of cell c is entropy H(V)
H(V22) 0.0

115
116
Well-Designed Databases AL03

Information content of cell c in database D
satisfying FD X ? Y
Information content of cell c is entropy H(V)
Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
all cells c in D
Technicalities w.r.t. size of active domain

116
117
Well-Designed Databases AL03

Information content of cell c in database D
satisfying FD X ? Y
Information content of cell c is entropy H(V)
H(V12) 2.0, H(V42) 2.0

117
118
Well-Designed Databases AL03

Information content of cell c in database D
satisfying FD X ? Y
Information content of cell c is entropy H(V)
Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
all cells c in D

118
119
Well-Designed Databases AL03

Information content of cell c in DB D satisfying
MVD X ?? Y
Information content of cell c is entropy H(V)
H(V52) 0.0, H(V53) 2.32

119
120
Well-Designed Databases AL03

Information content of cell c in DB D satisfying
MVD X ?? Y
Information content of cell c is entropy H(V)
Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
cells c in D

120
121
Well-Designed Databases AL03

Information content of cell c in DB D satisfying
MVD X ?? Y
Information content of cell c is entropy H(V)
H(V32) 1.58, H(V34) 2.32

121
122
Well-Designed Databases AL03

Information content of cell c in DB D satisfying
MVD X ?? Y
Information content of cell c is entropy H(V)
Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
cells c in D

122
123
Well-Designed Databases AL03

Normalization algorithms never decrease
information content
Information content of cell c is entropy H(V)

123
124
Well-Designed Databases AL03

Normalization algorithms never decrease
information content
Information content of cell c is entropy H(V)

124
125
Well-Designed Databases AL03

Normalization algorithms never decrease
information content
Information content of cell c is entropy H(V)

125
126
Database Design Summary

Good database design essential for preserving
data integrity
Information theoretic measures useful for
integrity constraints
FD X ? Y holds iff InD measure H(YX) 0
MVD X ?? Y holds iff H(Y,ZX) H(YX) H(ZX)
Information theory to model correlations in
specific database
Information theoretic measures useful for normal
forms
Schema S is in BCNF/4NF iff ? D ? S, H(V) gt 0,
for all cells c in D
Information theory to model distributions over
possible databases

126
127
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Data Integration
Part 2
Review of Information Theory Basics
Application Database Design
Computing Information Theoretic Primitives
Open Problems

127
Information Theory for Data Management - Divesh
Suresh
128
Domain size matters

For random variable X, domain size supp(X)
xi p(X xi) gt 0
Different solutions exist depending on whether
domain size is small or large
Probability vectors usually very sparse

129
Entropy Case I - Small domain size

Suppose the unique values for a random variable
X is small (i.e fits in memory)
Maximum likelihood estimator
p(x) times x is encountered/total number of
items in set.

1
2
1
2
1
5
4
1
2
3
4
5
130
Entropy Case I - Small domain size

HMLE Sx p(x) log 1/p(x)
This is a biased estimate
EHMLE lt H
Miller-Madow correction
H HMLE (m 1)/2n
m is an estimate of number of non-empty bins
n number of samples
Bad news ALL estimators for H are biased.
Good news we can quantify bias and variance of
MLE
Bias lt log(1 m/N)
Var(HMLE) lt (log n)2/N

131
Entropy Case II - Large domain size

X is too large to fit in main memory, so we
cant maintain explicit counts.
Streaming algorithms for H(X)
Long history of work on this problem
Bottomline
(1e)-relative-approximation for H(X) that allows
for updates to frequencies, and requires almost
constant, and optimal space HNO08.

132
Streaming Entropy CCM07

High level idea sample randomly from the stream,
and track counts of elements picked AMS
PROBLEM skewed distribution prevents us from
sampling lower-frequency elements (and entropy is
small)
Idea estimate largest frequency, and
distribution of whats left (higher entropy)

133
Streaming Entropy CCM07

Maintain set of samples from original
distribution and distribution without most
frequent element.
In parallel, maintain estimator for frequency of
most frequent element
normally this is hard
but if frequency is very large, then simple
estimator exists MG81 (Google interview
puzzle!)
At the end, compute function of these two
estimates
Memory usage roughly 1/e2 log(1/e) (e is the
error)

134
Entropy and MI are related

I(XY) H(X,Y) H(X) H(Y)
Suppose we can c-approximate H(X) for any c gt 0
Find H(X) s.t H(X) H(X) lt c
Then we can 3c-approximate I(XY)
I(XY) H(X,Y) H(X) H(Y)
lt H(X,Y)c (H(X)-c)
(H(Y)-c)
lt H(X,Y) H(X) H(Y) 3c
lt I(X,Y) 3c
Similarly, we can 2c-approximate H(YX) H(X,Y)
H(X)
Estimating entropy allows us to estimate I(XY)
and H(YX)

135
Computing KL-divergence Small Domains

easy algorithm maintain counts for each of p
and q, normalize, and compute KL-divergence.
PROBLEM ! Suppose qi 0
pi log pi/qi is undefined !
General problem with ML estimators all events
not seen have probability zero !!
Laplace correction add one to counts for each
seen element
Slightly better add 0.5 to counts for each seen
element KT81
Even better, more involved use Good-Turing
estimator GT53
YIeld non-zero probability for things not seen.

136
Computing KL-divergence Large Domains

Bad news No good relative-approximations exist
in small space.
(Partial) good news additive approximations in
small space under certain technical conditions
(no pi is too small).
(Partial) good news additive approximations for
symmetric variant of KL-divergence, via sampling.
For details, see GMV08,GIM08

137
Information-theoretic Clustering

Given a collection of random variables X, each
explained by a random variable Y, we wish to
find a (hard or soft) clustering T such that
I(T,X) bI(T, Y)
is minimized.
Features of solutions thus far
heuristic (general problem is NP-hard)
address both small-domain and large-domain
scenarios.

138
Agglomerative Clustering (aIB) ST00

Fix number of clusters k
While number of clusters lt k
Determine two clusters whose merge loses the
least information
Combine these two clusters
Output clustering
Merge Criterion
merge the two clusters so that change in I(TV)
is minimized
Note no consideration of b (number of clusters
is fixed)

139
Agglomerative Clustering (aIB) S

Elegant way of finding the two clusters to be
merged
Let dJS(p,q) (1/2)(dKL(p,m) dKL(q,m)), m
(pq)/2
dJS(p,q) is a symmetric distance between p, q
(Jensen-Shannon distance)
We merge clusters that have smallest dJS(p,q),
(weighted by cluster mass)

p
q
m
140
Iterative Information Bottleneck (iIB) S

aIB yields a hard clustering with k clusters.
If you want a soft clustering, use iIB (variant
of EM)
Step 1 p(tx) ? exp(-bdKL(p(Vx),p(Vt))
assign elements to clusters in proportion
(exponentially) to distance from cluster center !
Step 2 Compute new cluster centers by computing
weighted centroids
p(t) Sx p(tx) p(x)
p(Vt) Sx p(Vt) p(tx) p(x)/p(t)
Choose b according to DKOSV06

141
Dealing with massive data sets

Clustering on massive data sets is a problem
Two main heuristics
Sampling DKOSV06
pick a small sample of the data, cluster it, and
(if necessary) assign remaining points to
clusters using soft assignment.
How many points to sample to get good bounds ?
Streaming
Scan the data in one pass, performing clustering
on the fly
How much memory needed to get reasonable quality
solution ?

142
LIMBO (for aIB) ATMS04

BIRCH-like idea
Maintain (sparse) summary for each cluster (p(t),
p(Vt))
As data streams in, build clusters on groups of
objects
Build next-level clusters on cluster summaries
from lower level

143
Outline

Part 1
Introduction to Information Theory
Application Data Anonymization
Application Data Integration
Part 2
Review of Information Theory Basics
Application Database Design
Computing Information Theoretic Primitives
Open Problems

143
Information Theory for Data Management - Divesh
Suresh
144
Open Problems

Data exploration and mining information theory
as first-pass filter
Relation to nonparametric generative models in
machine learning (LDA, PPCA, ...)
Engineering and stability finding right knobs to
make systems reliable and scalable
Other information-theoretic concepts ? (rate
distortion, higher-order entropy, ...)

THANK YOU !
145
References Information Theory

CT Tom Cover and Joy Thomas Information
Theory.
BMDG05 Arindam Banerjee, Srujana Merugu,
Inderjit Dhillon, Joydeep Ghosh. Learning with
Bregman Divergences, JMLR 2005.
TPB98 Naftali Tishby, Fernando Pereira, William
Bialek. The Information Bottleneck Method. Proc.
37th Annual Allerton Conference, 1998

145
146
References Data Anonymization

AA01 Dakshi Agrawal, Charu C. Aggarwal On the
design and quantification of privacy preserving
data mining algorithms. PODS 2001.
AS00 Rakesh Agrawal, Ramakrishnan Srikant
Privacy preserving data mining. SIGMOD 2000.
EGS03 Alexandre Evfimievski, Johannes Gehrke,
Ramakrishnan Srikant Limiting privacy breaches
in privacy preserving data mining. PODS 2003.

146
146
Information Theory for Data Management - Divesh
Suresh
147
References Data Integration

AMT04 Periklis Andritsos, Renee J. Miller,
Panayiotis Tsaparas Information-theoretic tools
for mining database structure from large data
sets. SIGMOD 2004.
DKOSV06 Bing Tian Dai, Nick Koudas, Beng Chin
Ooi, Divesh Srivastava, Suresh Venkatasubramanian
Rapid identification of column heterogeneity.
ICDM 2006.
DKSTV08 Bing Tian Dai, Nick Koudas, Divesh
Srivastava, Anthony K. H. Tung, Suresh
Venkatasubramanian Validating multi-column
schema matchings by type. ICDE 2008.
KN03 Jaewoo Kang, Jeffrey F. Naughton On
schema matching with opaque column names and data
values. SIGMOD 2003.
PPH05 Patrick Pantel, Andrew Philpot, Eduard
Hovy An information theoretic model for database
alignment. SSDBM 2005.

147
148
References Database Design

AL03 Marcelo Arenas, Leonid Libkin An
information theoretic approach to normal forms
for relational and XML data. PODS 2003.
AL05 Marcelo Arenas, Leonid Libkin An
information theoretic approach to normal forms
for relational and XML data. JACM 52(2), 246-283,
2005.
DR00 Mehmet M. Dalkilic, Edward L. Robertson
Information dependencies. PODS 2000.
KL06 Solmaz Kolahi, Leonid Libkin On
redundancy vs dependency preservation in
normalization an information-theoretic study of
XML. PODS 2006.

148
149
References Computing IT quantities

P03 Liam Panninski. Estimation of entropy and
mutual information. Neural Computation 15
1191-1254
GT53 I. J. Good. Turings anticipation of
Empirical Bayes in connection with the
cryptanalysis of the Naval Enigma. Journal of
Statistical Computation and Simulation, 66(2),
2000.
KT81 R. E. Krichevsky and V. K. Trofimov. The
performance of universal encoding. IEEE Trans.
Inform. Th. 27 (1981), 199--207.
CCM07 Amit Chakrabarti, Graham Cormode and
Andrew McGregor. A near-optimal algorithm for
computing the entropy of a stream. Proc. SODA
2007.
HNO Nich Harvey, Jelani Nelson, Krzysztof Onak.
Sketching and Streaming Entropy via Approximation
Theory. FOCS 2008
ATMS04 Periklis Andritsos, Panayiotis Tsaparas,
Renée J. Miller and Kenneth C. Sevcik. LIMBO
Scalable Clustering of Categorical Data. EDBT 2004

149
149
Information Theory for Data Management - Divesh
Suresh
150
References Computing IT quantities

S Noam Slonim. The Information Bottleneck
theory and applications. Ph.D Thesis. Hebrew
University, 2000.
GMV08 Sudipto Guha, Andrew McGregor, Suresh
Venkatasubramanian. Streaming and sublinear
approximations for information distances. ACM
Trans Alg. 2008
GIM08 Sudipto Guha, Piotr Indyk, Andrew
McGregor. Sketching Information Distances. JMLR,
2008.

150
150
Information Theory for Data Management - Divesh
Suresh

Write a Comment

User Comments (0)