Title: Information Theory For Data Management
1Information Theory For Data Management
Divesh Srivastava Suresh Venkatasubramanian
2Motivation
-- Abstruse Goose (177)
Information Theory is relevant to all of
humanity...
3Background
- Many problems in data management need precise
reasoning about information content, transfer and
loss - Structure Extraction
- Privacy preservation
- Schema design
- Probabilistic data ?
4Information Theory
- First developed by Shannon as a way of
quantifying capacity of signal channels. - Entropy, relative entropy and mutual information
capture intrinsic informational aspects of a
signal - Today
- Information theory provides a domain-independent
way to reason about structure in data - More information interesting structure
- Less information linkage decoupling of
structures
5Tutorial Thesis
- Information theory provides a mathematical
framework for the quantification of information
content, linkage and loss. - This framework can be used in the design of data
management strategies that rely on probing the
structure of information in data.
6Tutorial Goals
- Introduce information-theoretic concepts to VLDB
audience - Give a data-centric perspective on information
theory - Connect these to applications in data management
- Describe underlying computational primitives
- Illuminate when and how information theory might
be of use in new areas of data management.
7Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Data Integration
- Part 2
- Review of Information Theory Basics
- Application Database Design
- Computing Information Theoretic Primitives
- Open Problems
7
8Histograms And Discrete Distributions
9Histograms And Discrete Distributions
normalize
reweight
10From Columns To Random Variables
- We can think of a column of data as represented
by a random variable - X is a random variable
- p(X) is the column of probabilities p(X x1),
p(X x2), and so on - Also known (in unweighted case) as the empirical
distribution induced by the column X. - Notation
- X (upper case) denotes a random variable (column)
- x (lower case) denotes a value taken by X (field
in a tuple) - p(x) is the probability p(X x)
11Joint Distributions
- Discrete distribution probability p(X,Y,Z)
- p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)
11
11
12Entropy Of A Column
- Let h(x) log2 1/p(x)
- h(X) is column of h(x) values.
- H(X) EXh(x) SX p(x) log2 1/p(x)
- Two views of entropy
- It captures uncertainty in data high entropy,
more unpredictability - It captures information content higher entropy,
more information.
H(X) 1.75 lt log X 2
13Examples
- X uniform over 1, ..., 4. H(X) 2
- Y is 1 with probability 0.5, in 2,3,4
uniformly. - H(Y) 0.5 log 2 0.5 log 6 1.8 lt 2
- Y is more sharply defined, and so has less
uncertainty. - Z uniform over 1, ..., 8. H(Z) 3 gt 2
- Z spans a larger range, and captures more
information
X
Y
Z
14Comparing Distributions
- How do we measure difference between two
distributions ? - Kullback-Leibler divergence
- dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)
Inference mechanism
Prior belief
Resulting belief
15Comparing Distributions
- Kullback-Leibler divergence
- dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)
- dKL(p, q) gt 0
- Captures extra information needed to capture p
given q - Is asymmetric ! dKL(p, q) ! dKL(q, p)
- Is not a metric (does not satisfy triangle
inequality) - There are other measures
- ?2-distance, variational distance, f-divergences,
16Conditional Probability
- Given a joint distribution on random variables X,
Y, how much information about X can we glean from
Y ? - Conditional probability p(XY)
- p(X x1 Y y1) p(X x1, Y y1)/p(Y y1)
17Conditional Entropy
- Let h(xy) log2 1/p(xy)
- H(XY) Ex,yh(xy) Sx Sy p(x,y) log2
1/p(xy) - H(XY) H(X,Y) H(Y)
- H(XY) H(X,Y) H(Y) 2.25 1.5 0.75
- If X, Y are independent, H(XY) H(X)
18Mutual Information
- Mutual information captures the difference
between the joint distribution on X and Y, and
the marginal distributions on X and Y. - Let i(xy) log p(x,y)/p(x)p(y)
- I(XY) Ex,yI(XY) Sx Sy p(x,y) log
p(x,y)/p(x)p(y)
19Mutual Information Strength of linkage
- I(XY) H(X) H(Y) H(X,Y) H(X) H(XY)
H(Y) H(YX) - If X, Y are independent, then I(XY) 0
- H(X,Y) H(X) H(Y), so I(XY) H(X) H(Y)
H(X,Y) 0 - I(XY) lt max (H(X), H(Y))
- Suppose Y f(X) (deterministically)
- Then H(YX) 0, and so I(XY) H(Y) H(YX)
H(Y) - Mutual information captures higher-order
interactions - Covariance captures linear interactions only
- Two variables can be uncorrelated (covariance
0) and have nonzero mutual information - X ?R -1,1, Y X2. Cov(X,Y) 0, I(XY) H(X)
gt 0
20Information-Theoretic Clustering
- Clustering takes a collection of objects and
groups them. - Given a distance function between objects
- Choice of measure of complexity of clustering
- Choice of measure of cost for a cluster
- Usually,
- Distance function is Euclidean distance
- Number of clusters is measure of complexity
- Cost measure for cluster is sum-of-squared-distanc
e to center - Goal minimize complexity and cost
- Inherent tradeoff between two
21Feature Representation
Let V v1, v2, v3, v4 X is explained by
distribution over V. Feature vector of X is
0.5, 0.25, 0.125, 0.125
22Feature Representation
p(v2X2) 0.2
Feature vector
23Information-Theoretic Clustering
- Clustering takes a collection of objects and
groups them. - Given a distance function between objects
- Choice of measure of complexity of clustering
- Choice of measure of cost for a cluster
- In information-theoretic setting
- What is the distance function ?
- How do we measure complexity ?
- What is a notion of cost/quality ?
- Goal minimize complexity and maximize quality
- Inherent tradeoff between two
24Measuring complexity of clustering
- Take 1 complexity of a clustering clusters
- standard model of complexity.
- Doesnt capture the fact that clusters have
different sizes.
?
25Measuring complexity of clustering
- Take 2 Complexity of clustering number of bits
needed to describe it. - Writing down k needs log k bits.
- In general, let cluster t ? T have t elements.
- set p(t) t/n
- bits to write down cluster sizes H(T) S pt
log 1/pt
H( ) lt
H( )
26Information-theoretic Clustering (take I)
- Given data X x1, ..., xn explained by variable
V, partition X into clusters (represented by T)
such that - H(T) is minimized and quality is maximized
27Soft clusterings
- In a hard clustering, each point is assigned to
exactly one cluster. - Characteristic function
- p(tx) 1 if x ? t, 0 if not.
- Suppose we allow points to partially belong to
clusters - p(Tx) is a distribution.
- p(tx) is the probability of assigning x to t
- How do we describe the complexity of a clustering
?
28Measuring complexity of clustering
- Take 1
- p(t) Sx p(x) p(tx)
- Compute H(T) as before.
- Problem
- H(T1) H(T2) !!
29Measuring complexity of clustering
- By averaging the memberships, weve lost useful
information. - Take II Compute I(TX) !
- Even better If T is a hard clustering of X, then
I(TX) H(T)
I(T1X) 0
I(T2X) 0.46
30Information-theoretic Clustering (take II)
- Given data X x1, ..., xn explained by variable
V, partition X into clusters (represented by T)
such that - I(T,X) is minimized and quality is maximized
31Measuring cost of a cluster
Given objects Xt X1, X2, , Xm in cluster
t, Cost(t) (1/m)Si d(Xi, C) Si p(Xi)
dKL(p(VXi), C) where C (1/m) Si p(VXi) Si
p(Xi) p(VXi) p(V)
32Mutual Information Cost of Cluster
- Cost(t) (1/m)Si d(Xi, C) Si p(Xi)
dKL(p(VXi), p(V)) - Si p(Xi) KL( p(VXi), p(V)) Si p(Xi) Sj
p(vjXi) log p(vjXi)/p(vj) - Si,j p(Xi, vj) log p(vj, Xi)/p(vj)p(Xi)
- I(Xt, V) !!
- Cost of a cluster I(Xt,V)
33Cost of a clustering
- If we partition X into k clusters X1, ..., Xk
- Cost(clustering) Si pi I(Xi, V)
- (pi Xi/X)
34Cost of a clustering
- Each cluster center t can be explained in terms
of V - p(Vt) Si p(Xi) p(VXi)
- Suppose we treat each cluster center itself as a
point
35Cost of a clustering
- We can write down the cost of this cluster
- Cost(T) I(TV)
- Key result BMDG05
- Cost(clustering) I(X, V) (T, V)
- Minimizing cost(clustering) gt maximizing I(T, V)
36Information-theoretic Clustering (take III)
- Given data X x1, ..., xn explained by variable
V, partition X into clusters (represented by T)
such that - I(TX) - bI(TV) is maximized
- This is the Information Bottleneck Method TPB98
- Agglomerative techniques exist for the case of
hard clusterings - b is the tradeoff parameter between complexity
and cost - I(TX) and I(TV) are in the same units.
37Information Theory Summary
- We can represent data as discrete distributions
(normalized histograms) - Entropy captures uncertainty or information
content in a distribution - The Kullback-Leibler distance captures the
difference between distributions - Mutual information and conditional entropy
capture linkage between variables in a joint
distribution - We can formulate information-theoretic clustering
problems
38Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Data Integration
- Part 2
- Review of Information Theory Basics
- Application Database Design
- Computing Information Theoretic Primitives
- Open Problems
39Data Anonymization Using Randomization
- Goal publish anonymized microdata to enable
accurate ad hoc analyses, but ensure privacy of
individuals sensitive attributes - Key ideas
- Randomize numerical data add noise from known
distribution - Reconstruct original data distribution using
published noisy data - Issues
- How can the original data distribution be
reconstructed? - What kinds of randomization preserve privacy of
individuals?
39
Information Theory for Data Management - Divesh
Suresh
40Data Anonymization Using Randomization
- Many randomization strategies proposed AS00,
AA01, EGS03 - Example randomization strategies X in 0, 10
- R X µ (mod 11), µ is uniform in -1, 0, 1
- R X µ (mod 11), µ is in -1 (p 0.25), 0 (p
0.5), 1 (p 0.25) - R X (p 0.6), R µ, µ is uniform in 0, 10
(p 0.4) - Question
- Which randomization strategy has higher privacy
preservation? - Quantify loss of privacy due to publication of
randomized data
40
41Data Anonymization Using Randomization
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
41
42Data Anonymization Using Randomization
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
?
42
43Data Anonymization Using Randomization
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
?
43
44Reconstruction of Original Data Distribution
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1 - Reconstruct distribution of X using knowledge of
R1 and µ - EM algorithm converges to MLE of original
distribution AA01
?
?
44
45Analysis of Privacy AS00
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1 - If X is uniform in 0, 10, privacy determined by
range of µ
?
?
45
46Analysis of Privacy AA01
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1 - If X is uniform in 0, 1 ? 5, 6, privacy
smaller than range of µ
?
?
46
47Analysis of Privacy AA01
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1 - If X is uniform in 0, 1 ? 5, 6, privacy
smaller than range of µ - In some cases, sensitive value revealed
?
?
47
48Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - Smaller H(XR) ? more loss of privacy in X by
knowledge of R - Larger I(XR) ? more loss of privacy in X by
knowledge of R - I(XR) H(X) H(XR)
- I(XR) used to capture correlation between X and
R - p(X) is the prior knowledge of sensitive
attribute X - p(X, R) is the joint distribution of X and R
48
49Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1
49
50Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1
50
51Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1
51
52Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1 - I(XR) 0.33
52
53Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - X is uniform in 5, 6, R2 X µ (mod 11), µ is
uniform in 0, 1 - I(XR1) 0.33, I(XR2) 0.5 ? R2 is a bigger
privacy risk than R1
53
54Quantify Loss of Privacy AA01
- Equivalent goal quantify loss of privacy based
on H(XR) - X is uniform in 5, 6, R2 X µ (mod 11), µ is
uniform in 0, 1 - Intuition we know more about X given R2, than
about X given R1 - H(XR1) 0.67, H(XR2) 0.5 ? R2 is a bigger
privacy risk than R1
54
55Quantify Loss of Privacy
- Example X is uniform in 0, 1
- R3 e (p 0.9999), R3 X (p 0.0001)
- R4 X (p 0.6), R4 1 X (p 0.4)
- Is R3 or R4 a bigger privacy risk?
55
56Worst Case Loss of Privacy EGS03
- Example X is uniform in 0, 1
- R3 e (p 0.9999), R3 X (p 0.0001)
- R4 X (p 0.6), R4 1 X (p 0.4)
- I(XR3) 0.0001 ltlt I(XR4) 0.028
56
57Worst Case Loss of Privacy EGS03
- Example X is uniform in 0, 1
- R3 e (p 0.9999), R3 X (p 0.0001)
- R4 X (p 0.6), R4 1 X (p 0.4)
- I(XR3) 0.0001 ltlt I(XR4) 0.028
- But R3 has a larger worst case risk
57
58Worst Case Loss of Privacy EGS03
- Goal quantify worst case loss of privacy in X by
knowledge of R - Use max KL divergence, instead of mutual
information - Mutual information can be formulated as expected
KL divergence - I(XR) ?x ?r p(x,r)log2(p(x,r)/p(x)p(r))
KL(p(X,R) p(X)p(R)) - I(XR) ?r p(r) ?x p(xr)log2(p(xr)/p(x)) ER
KL(p(Xr) p(X)) - AA01 measure quantifies expected loss of
privacy over R - EGS03 propose a measure based on worst case
loss of privacy - IW(XR) MAXR KL(p(Xr) p(X))
58
59Worst Case Loss of Privacy EGS03
- Example X is uniform in 0, 1
- R3 e (p 0.9999), R3 X (p 0.0001)
- R4 X (p 0.6), R4 1 X (p 0.4)
- IW(XR3) max0.0, 1.0, 1.0 gt IW(XR4)
max0.028, 0.028
59
60Worst Case Loss of Privacy EGS03
- Example X is uniform in 5, 6
- R1 X µ (mod 11), µ is uniform in -1, 0, 1
- R2 X µ (mod 11), µ is uniform in 0, 1
- IW(XR1) max1.0, 0.0, 0.0, 1.0 IW(XR2)
1.0, 0.0, 1.0 - Unable to capture that R2 is a bigger privacy
risk than R1
60
61Data Anonymization Summary
- Randomization techniques useful for microdata
anonymization - Randomization techniques differ in their loss of
privacy - Information theoretic measures useful to capture
loss of privacy - Expected KL divergence captures expected loss of
privacy AA01 - Maximum KL divergence captures worst case loss of
privacy EGS03 - Both are useful in practice
61
62Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Data Integration
- Part 2
- Review of Information Theory Basics
- Application Database Design
- Computing Information Theoretic Primitives
- Open Problems
63Schema Matching
- Goal align columns across database tables to be
integrated - Fundamental problem in database integration
- Early useful approach textual similarity of
column names - False positives Address ? IP_Address
- False negatives Customer_Id Client_Number
- Early useful approach overlap of values in
columns, e.g., Jaccard - False positives Emp_Id ? Project_Id
- False negatives Emp_Id Personnel_Number
63
64Opaque Schema Matching KN03
- Goal align columns when column names, data
values are opaque - Databases belong to different government
bureaucracies ? - Treat column names and data values as
uninterpreted (generic) - Example EMP_PROJ(Emp_Id, Proj_Id, Task_Id,
Status_Id) - Likely that all Id fields are from the same
domain - Different databases may have different column
names
64
65Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - Perform graph matching between GD1 and GD2,
minimizing distance - Intuition
- Entropy H(X) captures distribution of values in
database column X - Mutual information I(XY) captures correlations
between X, Y - Efficiency graph matching between schema-sized
graphs
65
66Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
66
67Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5
67
68Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5,
I(AB) 1.5
68
69Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
69
70Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - Perform graph matching between GD1 and GD2,
minimizing distance - KN03 uses euclidean and normal distance metrics
70
71Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - Perform graph matching between GD1 and GD2,
minimizing distance
71
72Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - Perform graph matching between GD1 and GD2,
minimizing distance
72
73Heterogeneity Identification DKOSV06
- Goal identify columns with semantically
heterogeneous values - Can arise due to opaque schema matching KN03
- Key ideas
- Heterogeneity based on distribution,
distinguishability of values - Use Information Bottleneck to compute soft
clustering of values - Issues
- Which information theoretic measure characterizes
heterogeneity? - How to set parameters in the Information
Bottleneck method?
73
74Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns
74
75Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns
75
76Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns - More semantic types in column ? greater
heterogeneity - Only email versus email phone
76
77Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns
77
78Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns - Relative distribution of semantic types impacts
heterogeneity - Mainly email few phone versus balanced email
phone
78
79Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns
79
80Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns
80
81Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns - More easily distinguished types ? greater
heterogeneity - Phone (possibly) SSN versus balanced email
phone
81
82Heterogeneity Identification DKOSV06
- Heterogeneity space complexity of soft
clustering of the data - More, balanced clusters ? greater heterogeneity
- More distinguishable clusters ? greater
heterogeneity - Soft clustering
- Soft ? assign probabilities to membership of
values in clusters - How many clusters tradeoff between space versus
quality - Use Information Bottleneck to compute soft
clustering of values
82
83Heterogeneity Identification DKOSV06
83
84Heterogeneity Identification DKOSV06
- Soft clustering cluster membership probabilities
- How to compute a good soft clustering?
84
85Heterogeneity Identification DKOSV06
- Represent strings as q-gram distributions
85
86Heterogeneity Identification DKOSV06
- iIB find soft clustering T of X that minimizes
I(TX) ßI(TV) - Allow iIB to use arbitrarily many clusters, use
ß H(X)/I(XV) - Closest to point with minimum space and maximum
quality
86
87Heterogeneity Identification DKOSV06
- Rate distortion curve I(TV)/I(XV) vs
I(TX)/H(X) - ß
87
88Heterogeneity Identification DKOSV06
- Heterogeneity mutual information I(TX) of iIB
clustering T at ß - 0 I(TX) ( 0.126) H(X) ( 2.0), H(T) ( 1.0)
- Ideally use iIB with an arbitrarily large number
of clusters in T
88
89Heterogeneity Identification DKOSV06
- Heterogeneity mutual information I(TX) of iIB
clustering T at ß
89
90Data Integration Summary
- Analyzing database instance critical for
effective data integration - Matching and quality assessments are key
components - Information theoretic measures useful for schema
matching - Align columns when column names, data values are
opaque - Mutual information I(XV) captures correlations
between X, V - Information theoretic measures useful for
heterogeneity testing - Identify columns with semantically heterogeneous
values - I(TX) of iIB clustering T at ß captures column
heterogeneity
90
91Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Data Integration
- Part 2
- Review of Information Theory Basics
- Application Database Design
- Computing Information Theoretic Primitives
- Open Problems
92Review of Information Theory Basics
- Discrete distribution probability p(X)
- p(X,Y) ?z p(X,Y,Zz)
92
93Review of Information Theory Basics
- Discrete distribution probability p(X)
- p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)
93
94Review of Information Theory Basics
- Discrete distribution conditional probability
p(XY) - p(X,Y) p(XY)p(Y) p(YX)p(X)
94
95Review of Information Theory Basics
- Discrete distribution entropy H(X)
- h(x) log2(1/p(x))
- H(X) ?Xx p(x)h(x) 1.75
- H(Y) ?Yy p(y)h(y) 1.5 ( log2(Y) 1.58)
- H(X,Y) ?Xx ?Yy p(x,y)h(x,y) 2.25 (
log2(X,Y) 2.32)
95
96Review of Information Theory Basics
- Discrete distribution conditional entropy H(XY)
- h(xy) log2(1/p(xy))
- H(XY) ?Xx ?Yy p(x,y)h(xy) 0.75
- H(XY) H(X,Y) H(Y) 2.25 1.5
96
97Review of Information Theory Basics
- Discrete distribution mutual information I(XY)
- i(xy) log2(p(x,y)/p(x)p(y))
- I(XY) ?Xx ?Yy p(x,y)i(xy) 1.0
- I(XY) H(X) H(Y) H(X,Y) 1.75 1.5 2.25
97
98Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Data Integration
- Part 2
- Review of Information Theory Basics
- Application Database Design
- Computing Information Theoretic Primitives
- Open Problems
99Information Dependencies DR00
- Goal use information theory to examine and
reason about information content of the
attributes in a relation instance - Key ideas
- Novel InD measure between attribute sets X, Y
based on H(YX) - Identify numeric inequalities between InD
measures - Results
- InD measures are a broader class than FDs and
MVDs - Armstrong axioms for FDs derivable from InD
inequalities - MVD inference rules derivable from InD
inequalities
99
100Information Dependencies DR00
- Functional dependency X ? Y
- FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
(t1Y t2Y))
100
101Information Dependencies DR00
- Functional dependency X ? Y
- FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
(t1Y t2Y))
101
102Information Dependencies DR00
- Result FD X ? Y holds iff H(YX) 0
- Intuition once X is known, no remaining
uncertainty in Y - H(YX) 0.5
102
103Information Dependencies DR00
- Multi-valued dependency X ?? Y
- MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)
103
104Information Dependencies DR00
- Multi-valued dependency X ?? Y
- MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)
104
105Information Dependencies DR00
- Multi-valued dependency X ?? Y
- MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)
105
106Information Dependencies DR00
- Result MVD X ?? Y holds iff H(Y,ZX) H(YX)
H(ZX) - Intuition once X known, uncertainties in Y and Z
are independent - H(YX) 0.5, H(ZX) 0.75, H(Y,ZX) 1.25
106
107Information Dependencies DR00
- Result Armstrong axioms for FDs derivable from
InD inequalities - Reflexivity If Y ? X, then X ? Y
- H(YX) 0 for Y ? X
- Augmentation X ? Y ? X,Z ? Y,Z
- 0 H(Y,ZX,Z) H(YX,Z) H(YX) 0
- Transitivity X ? Y Y ? Z ? X ? Z
- 0 H(YX) H(ZY) H(ZX) 0
107
108Database Normal Forms
- Goal eliminate update anomalies by good database
design - Need to know the integrity constraints on all
database instances - Boyce-Codd normal form
- Input a set ? of functional dependencies
- For every (non-trivial) FD R.X ? R.Y ? ?, R.X is
a key of R - 4NF
- Input a set ? of functional and multi-valued
dependencies - For every (non-trivial) MVD R.X ?? R.Y ? ?,
R.X is a key of R
108
109Database Normal Forms
- Functional dependency X ? Y
- Which design is better?
109
110Database Normal Forms
- Functional dependency X ? Y
- Which design is better?
- Decomposition is in BCNF
110
111Database Normal Forms
- Multi-valued dependency X ?? Y
- Which design is better?
111
112Database Normal Forms
- Multi-valued dependency X ?? Y
- Which design is better?
- Decomposition is in 4NF
112
113Well-Designed Databases AL03
- Goal use information theory to characterize
goodness of a database design and reason about
normalization algorithms - Key idea
- Information content measure of cell in a DB
instance w.r.t. ICs - Redundancy reduces information content measure of
cells - Results
- Well-designed DB ? each cell has information
content gt 0 - Normalization algorithms never decrease
information content
113
114Well-Designed Databases AL03
- Information content of cell c in database D
satisfying FD X ? Y - Uniform distribution p(V) on values for c
consistent with D\c and FD - Information content of cell c is entropy H(V)
- H(V62) 2.0
114
115Well-Designed Databases AL03
- Information content of cell c in database D
satisfying FD X ? Y - Uniform distribution p(V) on values for c
consistent with D\c and FD - Information content of cell c is entropy H(V)
- H(V22) 0.0
115
116Well-Designed Databases AL03
- Information content of cell c in database D
satisfying FD X ? Y - Information content of cell c is entropy H(V)
- Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
all cells c in D - Technicalities w.r.t. size of active domain
116
117Well-Designed Databases AL03
- Information content of cell c in database D
satisfying FD X ? Y - Information content of cell c is entropy H(V)
- H(V12) 2.0, H(V42) 2.0
117
118Well-Designed Databases AL03
- Information content of cell c in database D
satisfying FD X ? Y - Information content of cell c is entropy H(V)
- Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
all cells c in D
118
119Well-Designed Databases AL03
- Information content of cell c in DB D satisfying
MVD X ?? Y - Information content of cell c is entropy H(V)
- H(V52) 0.0, H(V53) 2.32
119
120Well-Designed Databases AL03
- Information content of cell c in DB D satisfying
MVD X ?? Y - Information content of cell c is entropy H(V)
- Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
cells c in D
120
121Well-Designed Databases AL03
- Information content of cell c in DB D satisfying
MVD X ?? Y - Information content of cell c is entropy H(V)
- H(V32) 1.58, H(V34) 2.32
121
122Well-Designed Databases AL03
- Information content of cell c in DB D satisfying
MVD X ?? Y - Information content of cell c is entropy H(V)
- Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
cells c in D
122
123Well-Designed Databases AL03
- Normalization algorithms never decrease
information content - Information content of cell c is entropy H(V)
123
124Well-Designed Databases AL03
- Normalization algorithms never decrease
information content - Information content of cell c is entropy H(V)
124
125Well-Designed Databases AL03
- Normalization algorithms never decrease
information content - Information content of cell c is entropy H(V)
125
126Database Design Summary
- Good database design essential for preserving
data integrity - Information theoretic measures useful for
integrity constraints - FD X ? Y holds iff InD measure H(YX) 0
- MVD X ?? Y holds iff H(Y,ZX) H(YX) H(ZX)
- Information theory to model correlations in
specific database - Information theoretic measures useful for normal
forms - Schema S is in BCNF/4NF iff ? D ? S, H(V) gt 0,
for all cells c in D - Information theory to model distributions over
possible databases
126
127Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Data Integration
- Part 2
- Review of Information Theory Basics
- Application Database Design
- Computing Information Theoretic Primitives
- Open Problems
127
Information Theory for Data Management - Divesh
Suresh
128Domain size matters
- For random variable X, domain size supp(X)
xi p(X xi) gt 0 - Different solutions exist depending on whether
domain size is small or large - Probability vectors usually very sparse
129Entropy Case I - Small domain size
- Suppose the unique values for a random variable
X is small (i.e fits in memory) - Maximum likelihood estimator
- p(x) times x is encountered/total number of
items in set.
1
2
1
2
1
5
4
1
2
3
4
5
130Entropy Case I - Small domain size
- HMLE Sx p(x) log 1/p(x)
- This is a biased estimate
- EHMLE lt H
- Miller-Madow correction
- H HMLE (m 1)/2n
- m is an estimate of number of non-empty bins
- n number of samples
- Bad news ALL estimators for H are biased.
- Good news we can quantify bias and variance of
MLE - Bias lt log(1 m/N)
- Var(HMLE) lt (log n)2/N
131Entropy Case II - Large domain size
- X is too large to fit in main memory, so we
cant maintain explicit counts. - Streaming algorithms for H(X)
- Long history of work on this problem
- Bottomline
- (1e)-relative-approximation for H(X) that allows
for updates to frequencies, and requires almost
constant, and optimal space HNO08.
132Streaming Entropy CCM07
- High level idea sample randomly from the stream,
and track counts of elements picked AMS - PROBLEM skewed distribution prevents us from
sampling lower-frequency elements (and entropy is
small) - Idea estimate largest frequency, and
- distribution of whats left (higher entropy)
133Streaming Entropy CCM07
- Maintain set of samples from original
distribution and distribution without most
frequent element. - In parallel, maintain estimator for frequency of
most frequent element - normally this is hard
- but if frequency is very large, then simple
estimator exists MG81 (Google interview
puzzle!) - At the end, compute function of these two
estimates - Memory usage roughly 1/e2 log(1/e) (e is the
error)
134Entropy and MI are related
- I(XY) H(X,Y) H(X) H(Y)
- Suppose we can c-approximate H(X) for any c gt 0
- Find H(X) s.t H(X) H(X) lt c
- Then we can 3c-approximate I(XY)
- I(XY) H(X,Y) H(X) H(Y)
- lt H(X,Y)c (H(X)-c)
(H(Y)-c) - lt H(X,Y) H(X) H(Y) 3c
- lt I(X,Y) 3c
- Similarly, we can 2c-approximate H(YX) H(X,Y)
H(X) - Estimating entropy allows us to estimate I(XY)
and H(YX)
135Computing KL-divergence Small Domains
- easy algorithm maintain counts for each of p
and q, normalize, and compute KL-divergence. - PROBLEM ! Suppose qi 0
- pi log pi/qi is undefined !
- General problem with ML estimators all events
not seen have probability zero !! - Laplace correction add one to counts for each
seen element - Slightly better add 0.5 to counts for each seen
element KT81 - Even better, more involved use Good-Turing
estimator GT53 - YIeld non-zero probability for things not seen.
136Computing KL-divergence Large Domains
- Bad news No good relative-approximations exist
in small space. - (Partial) good news additive approximations in
small space under certain technical conditions
(no pi is too small). - (Partial) good news additive approximations for
symmetric variant of KL-divergence, via sampling. - For details, see GMV08,GIM08
137Information-theoretic Clustering
- Given a collection of random variables X, each
explained by a random variable Y, we wish to
find a (hard or soft) clustering T such that - I(T,X) bI(T, Y)
- is minimized.
- Features of solutions thus far
- heuristic (general problem is NP-hard)
- address both small-domain and large-domain
scenarios.
138Agglomerative Clustering (aIB) ST00
- Fix number of clusters k
- While number of clusters lt k
- Determine two clusters whose merge loses the
least information - Combine these two clusters
- Output clustering
- Merge Criterion
- merge the two clusters so that change in I(TV)
is minimized - Note no consideration of b (number of clusters
is fixed)
139Agglomerative Clustering (aIB) S
- Elegant way of finding the two clusters to be
merged - Let dJS(p,q) (1/2)(dKL(p,m) dKL(q,m)), m
(pq)/2 - dJS(p,q) is a symmetric distance between p, q
(Jensen-Shannon distance) - We merge clusters that have smallest dJS(p,q),
(weighted by cluster mass)
p
q
m
140Iterative Information Bottleneck (iIB) S
- aIB yields a hard clustering with k clusters.
- If you want a soft clustering, use iIB (variant
of EM) - Step 1 p(tx) ? exp(-bdKL(p(Vx),p(Vt))
- assign elements to clusters in proportion
(exponentially) to distance from cluster center ! - Step 2 Compute new cluster centers by computing
weighted centroids - p(t) Sx p(tx) p(x)
- p(Vt) Sx p(Vt) p(tx) p(x)/p(t)
- Choose b according to DKOSV06
141Dealing with massive data sets
- Clustering on massive data sets is a problem
- Two main heuristics
- Sampling DKOSV06
- pick a small sample of the data, cluster it, and
(if necessary) assign remaining points to
clusters using soft assignment. - How many points to sample to get good bounds ?
- Streaming
- Scan the data in one pass, performing clustering
on the fly - How much memory needed to get reasonable quality
solution ?
142LIMBO (for aIB) ATMS04
- BIRCH-like idea
- Maintain (sparse) summary for each cluster (p(t),
p(Vt)) - As data streams in, build clusters on groups of
objects - Build next-level clusters on cluster summaries
from lower level
143Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Data Integration
- Part 2
- Review of Information Theory Basics
- Application Database Design
- Computing Information Theoretic Primitives
- Open Problems
143
Information Theory for Data Management - Divesh
Suresh
144Open Problems
- Data exploration and mining information theory
as first-pass filter - Relation to nonparametric generative models in
machine learning (LDA, PPCA, ...) - Engineering and stability finding right knobs to
make systems reliable and scalable - Other information-theoretic concepts ? (rate
distortion, higher-order entropy, ...)
THANK YOU !
145References Information Theory
- CT Tom Cover and Joy Thomas Information
Theory. - BMDG05 Arindam Banerjee, Srujana Merugu,
Inderjit Dhillon, Joydeep Ghosh. Learning with
Bregman Divergences, JMLR 2005. - TPB98 Naftali Tishby, Fernando Pereira, William
Bialek. The Information Bottleneck Method. Proc.
37th Annual Allerton Conference, 1998
145
146References Data Anonymization
- AA01 Dakshi Agrawal, Charu C. Aggarwal On the
design and quantification of privacy preserving
data mining algorithms. PODS 2001. - AS00 Rakesh Agrawal, Ramakrishnan Srikant
Privacy preserving data mining. SIGMOD 2000. - EGS03 Alexandre Evfimievski, Johannes Gehrke,
Ramakrishnan Srikant Limiting privacy breaches
in privacy preserving data mining. PODS 2003.
146
146
Information Theory for Data Management - Divesh
Suresh
147References Data Integration
- AMT04 Periklis Andritsos, Renee J. Miller,
Panayiotis Tsaparas Information-theoretic tools
for mining database structure from large data
sets. SIGMOD 2004. - DKOSV06 Bing Tian Dai, Nick Koudas, Beng Chin
Ooi, Divesh Srivastava, Suresh Venkatasubramanian
Rapid identification of column heterogeneity.
ICDM 2006. - DKSTV08 Bing Tian Dai, Nick Koudas, Divesh
Srivastava, Anthony K. H. Tung, Suresh
Venkatasubramanian Validating multi-column
schema matchings by type. ICDE 2008. - KN03 Jaewoo Kang, Jeffrey F. Naughton On
schema matching with opaque column names and data
values. SIGMOD 2003. - PPH05 Patrick Pantel, Andrew Philpot, Eduard
Hovy An information theoretic model for database
alignment. SSDBM 2005.
147
148References Database Design
- AL03 Marcelo Arenas, Leonid Libkin An
information theoretic approach to normal forms
for relational and XML data. PODS 2003. - AL05 Marcelo Arenas, Leonid Libkin An
information theoretic approach to normal forms
for relational and XML data. JACM 52(2), 246-283,
2005. - DR00 Mehmet M. Dalkilic, Edward L. Robertson
Information dependencies. PODS 2000. - KL06 Solmaz Kolahi, Leonid Libkin On
redundancy vs dependency preservation in
normalization an information-theoretic study of
XML. PODS 2006.
148
149References Computing IT quantities
- P03 Liam Panninski. Estimation of entropy and
mutual information. Neural Computation 15
1191-1254 - GT53 I. J. Good. Turings anticipation of
Empirical Bayes in connection with the
cryptanalysis of the Naval Enigma. Journal of
Statistical Computation and Simulation, 66(2),
2000. - KT81 R. E. Krichevsky and V. K. Trofimov. The
performance of universal encoding. IEEE Trans.
Inform. Th. 27 (1981), 199--207. - CCM07 Amit Chakrabarti, Graham Cormode and
Andrew McGregor. A near-optimal algorithm for
computing the entropy of a stream. Proc. SODA
2007. - HNO Nich Harvey, Jelani Nelson, Krzysztof Onak.
Sketching and Streaming Entropy via Approximation
Theory. FOCS 2008 - ATMS04 Periklis Andritsos, Panayiotis Tsaparas,
Renée J. Miller and Kenneth C. Sevcik. LIMBO
Scalable Clustering of Categorical Data. EDBT 2004
149
149
Information Theory for Data Management - Divesh
Suresh
150References Computing IT quantities
- S Noam Slonim. The Information Bottleneck
theory and applications. Ph.D Thesis. Hebrew
University, 2000. - GMV08 Sudipto Guha, Andrew McGregor, Suresh
Venkatasubramanian. Streaming and sublinear
approximations for information distances. ACM
Trans Alg. 2008 - GIM08 Sudipto Guha, Piotr Indyk, Andrew
McGregor. Sketching Information Distances. JMLR,
2008.
150
150
Information Theory for Data Management - Divesh
Suresh