Title: Information Theory For Data Management
1Information Theory For Data Management
Divesh Srivastava Suresh Venkatasubramanian
2Motivation
-- Abstruse Goose (177)
Information Theory is relevant to all of
humanity...
Information Theory for Data Management - Divesh
Suresh
3Background
- Many problems in data management need precise
reasoning about information content, transfer and
loss - Structure Extraction
- Privacy preservation
- Schema design
- Probabilistic data ?
Information Theory for Data Management - Divesh
Suresh
4Information Theory
- First developed by Shannon as a way of
quantifying capacity of signal channels. - Entropy, relative entropy and mutual information
capture intrinsic informational aspects of a
signal - Today
- Information theory provides a domain-independent
way to reason about structure in data - More information interesting structure
- Less information linkage decoupling of
structures
Information Theory for Data Management - Divesh
Suresh
5Tutorial Thesis
- Information theory provides a mathematical
framework for the quantification of information
content, linkage and loss. - This framework can be used in the design of data
management strategies that rely on probing the
structure of information in data.
Information Theory for Data Management - Divesh
Suresh
6Tutorial Goals
- Introduce information-theoretic concepts to DB
audience - Give a data-centric perspective on information
theory - Connect these to applications in data management
- Describe underlying computational primitives
- Illuminate when and how information theory might
be of use in new areas of data management.
Information Theory for Data Management - Divesh
Suresh
7Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Database Design
- Part 2
- Review of Information Theory Basics
- Application Data Integration
- Computing Information Theoretic Primitives
- Open Problems
Information Theory for Data Management - Divesh
Suresh
8Histograms And Discrete Distributions
Information Theory for Data Management - Divesh
Suresh
9Histograms And Discrete Distributions
X f(x)w(X)
x1 4520
x2 236
x3 122
x4 122
normalize
reweight
Information Theory for Data Management - Divesh
Suresh
10From Columns To Random Variables
- We can think of a column of data as represented
by a random variable - X is a random variable
- p(X) is the column of probabilities p(X x1),
p(X x2), and so on - Also known (in unweighted case) as the empirical
distribution induced by the column X. - Notation
- X (upper case) denotes a random variable (column)
- x (lower case) denotes a value taken by X (field
in a tuple) - p(x) is the probability p(X x)
Information Theory for Data Management - Divesh
Suresh
11Joint Distributions
- Discrete distribution probability p(X,Y,Z)
- p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)
X Y Z p(X,Y,Z)
x1 y1 z1 0.125
x1 y2 z2 0.125
x1 y1 z2 0.125
x1 y2 z1 0.125
x2 y3 z3 0.125
x2 y3 z4 0.125
x3 y3 z5 0.125
x4 y3 z6 0.125
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
X Y p(X,Y)
x1 y1 0.25
x1 y2 0.25
x2 y3 0.25
x3 y3 0.125
x4 y3 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
12Entropy Of A Column
X p(X) h(X)
x1 0.5 1
x2 0.25 2
x3 0.125 3
x4 0.125 3
- Let h(x) log2 1/p(x)
- h(X) is column of h(x) values.
- H(X) EXh(x) SX p(x) log2 1/p(x)
- Two views of entropy
- It captures uncertainty in data high entropy,
more unpredictability - It captures information content higher entropy,
more information.
H(X) 1.75 lt log X 2
Information Theory for Data Management - Divesh
Suresh
13Examples
- X uniform over 1, ..., 4. H(X) 2
- Y is 1 with probability 0.5, in 2,3,4
uniformly. - H(Y) 0.5 log 2 0.5 log 6 1.8 lt 2
- Y is more sharply defined, and so has less
uncertainty. - Z uniform over 1, ..., 8. H(Z) 3 gt 2
- Z spans a larger range, and captures more
information
X
Y
Z
Information Theory for Data Management - Divesh
Suresh
14Comparing Distributions
- How do we measure difference between two
distributions ? - Kullback-Leibler divergence
- dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)
Inference mechanism
Prior belief
Resulting belief
Information Theory for Data Management - Divesh
Suresh
15Comparing Distributions
- Kullback-Leibler divergence
- dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)
- dKL(p, q) gt 0
- Captures extra information needed to capture p
given q - Is asymmetric ! dKL(p, q) ! dKL(q, p)
- Is not a metric (does not satisfy triangle
inequality) - There are other measures
- ?2-distance, variational distance, f-divergences,
Information Theory for Data Management - Divesh
Suresh
16Conditional Probability
- Given a joint distribution on random variables X,
Y, how much information about X can we glean from
Y ? - Conditional probability p(XY)
- p(X x1 Y y1) p(X x1, Y y1)/p(Y y1)
X Y p(X,Y) p(XY) p(YX)
x1 y1 0.25 1.0 0.5
x1 y2 0.25 1.0 0.5
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 1.0
x4 y3 0.125 0.25 1.0
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
17Conditional Entropy
- Let h(xy) log2 1/p(xy)
- H(XY) Ex,yh(xy) Sx Sy p(x,y) log2
1/p(xy) - H(XY) H(X,Y) H(Y)
- H(XY) H(X,Y) H(Y) 2.25 1.5 0.75
- If X, Y are independent, H(XY) H(X)
X Y p(X,Y) p(XY) h(XY)
x1 y1 0.25 1.0 0.0
x1 y2 0.25 1.0 0.0
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 2.0
x4 y3 0.125 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
18Mutual Information
- Mutual information captures the difference
between the joint distribution on X and Y, and
the marginal distributions on X and Y. - Let i(xy) log p(x,y)/p(x)p(y)
- I(XY) Ex,yI(XY) Sx Sy p(x,y) log
p(x,y)/p(x)p(y)
X Y p(X,Y) h(X,Y) i(XY)
x1 y1 0.25 2.0 1.0
x1 y2 0.25 2.0 1.0
x2 y3 0.25 2.0 1.0
x3 y3 0.125 3.0 1.0
x4 y3 0.125 3.0 1.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
19Mutual Information Strength of linkage
- I(XY) H(X) H(Y) H(X,Y) H(X) H(XY)
H(Y) H(YX) - If X, Y are independent, then I(XY) 0
- H(X,Y) H(X) H(Y), so I(XY) H(X) H(Y)
H(X,Y) 0 - I(XY) lt max (H(X), H(Y))
- Suppose Y f(X) (deterministically)
- Then H(YX) 0, and so I(XY) H(Y) H(YX)
H(Y) - Mutual information captures higher-order
interactions - Covariance captures linear interactions only
- Two variables can be uncorrelated (covariance
0) and have nonzero mutual information - X ?R -1,1, Y X2. Cov(X,Y) 0, I(XY) H(X)
gt 0
Information Theory for Data Management - Divesh
Suresh
20Information Theory Summary
- We can represent data as discrete distributions
(normalized histograms) - Entropy captures uncertainty or information
content in a distribution - The Kullback-Leibler distance captures the
difference between distributions - Mutual information and conditional entropy
capture linkage between variables in a joint
distribution
Information Theory for Data Management - Divesh
Suresh
21Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Database Design
- Part 2
- Review of Information Theory Basics
- Application Data Integration
- Computing Information Theoretic Primitives
- Open Problems
Information Theory for Data Management - Divesh
Suresh
22Data Anonymization Using Randomization
- Goal publish anonymized microdata to enable
accurate ad hoc analyses, but ensure privacy of
individuals sensitive attributes - Key ideas
- Randomize numerical data add noise from known
distribution - Reconstruct original data distribution using
published noisy data - Issues
- How can the original data distribution be
reconstructed? - What kinds of randomization preserve privacy of
individuals?
Information Theory for Data Management - Divesh
Suresh
23Data Anonymization Using Randomization
- Many randomization strategies proposed AS00,
AA01, EGS03 - Example randomization strategies X in 0, 10
- R X µ (mod 11), µ is uniform in -1, 0, 1
- R X µ (mod 11), µ is in -1 (p 0.25), 0 (p
0.5), 1 (p 0.25) - R X (p 0.6), R µ, µ is uniform in 0, 10
(p 0.4) - Question
- Which randomization strategy has higher privacy
preservation? - Quantify loss of privacy due to publication of
randomized data
Information Theory for Data Management - Divesh
Suresh
24Data Anonymization Using Randomization
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
Id X
s1 0
s2 3
s3 5
s4 0
s5 8
s6 0
s7 6
s8 0
Information Theory for Data Management - Divesh
Suresh
25Data Anonymization Using Randomization
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
Id X µ
s1 0 -1
s2 3 0
s3 5 1
s4 0 0
s5 8 1
s6 0 -1
s7 6 1
s8 0 0
Id R1
s1 10
s2 3
s3 6
s4 0
s5 9
s6 10
s7 7
s8 0
?
Information Theory for Data Management - Divesh
Suresh
26Data Anonymization Using Randomization
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1
Id X µ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
?
Information Theory for Data Management - Divesh
Suresh
27Reconstruction of Original Data Distribution
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1 - Reconstruct distribution of X using knowledge of
R1 and µ - EM algorithm converges to MLE of original
distribution AA01
Id X µ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
Id X R1
s1 10, 0, 1
s2 1, 2, 3
s3 4, 5, 6
s4 0, 1, 2
s5 8, 9, 10
s6 9, 10, 0
s7 4, 5, 6
s8 0, 1, 2
?
?
Information Theory for Data Management - Divesh
Suresh
28Analysis of Privacy AS00
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1 - If X is uniform in 0, 10, privacy determined by
range of µ
Id X µ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
Id X R1
s1 10, 0, 1
s2 1, 2, 3
s3 4, 5, 6
s4 0, 1, 2
s5 8, 9, 10
s6 9, 10, 0
s7 4, 5, 6
s8 0, 1, 2
?
?
Information Theory for Data Management - Divesh
Suresh
29Analysis of Privacy AA01
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1 - If X is uniform in 0, 1 ? 5, 6, privacy
smaller than range of µ
Id X µ
s1 0 0
s2 1 -1
s3 5 0
s4 6 1
s5 0 1
s6 1 -1
s7 5 -1
s8 6 1
Id R1
s1 0
s2 0
s3 5
s4 7
s5 1
s6 0
s7 4
s8 7
Id X R1
s1 10, 0, 1
s2 10, 0, 1
s3 4, 5, 6
s4 6, 7, 8
s5 0, 1, 2
s6 10, 0, 1
s7 3, 4, 5
s8 6, 7, 8
?
?
Information Theory for Data Management - Divesh
Suresh
30Analysis of Privacy AA01
- X in 0, 10, R1 X µ (mod 11), µ is uniform
in -1, 0, 1 - If X is uniform in 0, 1 ? 5, 6, privacy
smaller than range of µ - In some cases, sensitive value revealed
Id X µ
s1 0 0
s2 1 -1
s3 5 0
s4 6 1
s5 0 1
s6 1 -1
s7 5 -1
s8 6 1
Id R1
s1 0
s2 0
s3 5
s4 7
s5 1
s6 0
s7 4
s8 7
Id X R1
s1 0, 1
s2 0, 1
s3 5, 6
s4 6
s5 0, 1
s6 0, 1
s7 5
s8 6
?
?
Information Theory for Data Management - Divesh
Suresh
31Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - Smaller H(XR) ? more loss of privacy in X by
knowledge of R - Larger I(XR) ? more loss of privacy in X by
knowledge of R - I(XR) H(X) H(XR)
- I(XR) used to capture correlation between X and
R - p(X) is the prior knowledge of sensitive
attribute X - p(X, R) is the joint distribution of X and R
Information Theory for Data Management - Divesh
Suresh
32Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1
X R1 p(X,R1) h(X,R1) i(XR1)
5 4
5 5
5 6
6 5
6 6
6 7
X p(X) h(X)
5
6
R1 p(R1) h(R1)
4
5
6
7
Information Theory for Data Management - Divesh
Suresh
33Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1
X R1 p(X,R1) h(X,R1) i(XR1)
5 4 0.17
5 5 0.17
5 6 0.17
6 5 0.17
6 6 0.17
6 7 0.17
X p(X) h(X)
5 0.5
6 0.5
R1 p(R1) h(R1)
4 0.17
5 0.34
6 0.34
7 0.17
Information Theory for Data Management - Divesh
Suresh
34Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1
X R1 p(X,R1) h(X,R1) i(XR1)
5 4 0.17 2.58
5 5 0.17 2.58
5 6 0.17 2.58
6 5 0.17 2.58
6 6 0.17 2.58
6 7 0.17 2.58
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R1 p(R1) h(R1)
4 0.17 2.58
5 0.34 1.58
6 0.34 1.58
7 0.17 2.58
Information Theory for Data Management - Divesh
Suresh
35Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - X is uniform in 5, 6, R1 X µ (mod 11), µ is
uniform in -1, 0, 1 - I(XR) 0.33
X R1 p(X,R1) h(X,R1) i(XR1)
5 4 0.17 2.58 1.0
5 5 0.17 2.58 0.0
5 6 0.17 2.58 0.0
6 5 0.17 2.58 0.0
6 6 0.17 2.58 0.0
6 7 0.17 2.58 1.0
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R1 p(R1) h(R1)
4 0.17 2.58
5 0.34 1.58
6 0.34 1.58
7 0.17 2.58
Information Theory for Data Management - Divesh
Suresh
36Quantify Loss of Privacy AA01
- Goal quantify loss of privacy based on mutual
information I(XR) - X is uniform in 5, 6, R2 X µ (mod 11), µ is
uniform in 0, 1 - I(XR1) 0.33, I(XR2) 0.5 ? R2 is a bigger
privacy risk than R1
X R2 p(X,R2) h(X,R2) i(XR2)
5 5 0.25 2.0 1.0
5 6 0.25 2.0 0.0
6 6 0.25 2.0 0.0
6 7 0.25 2.0 1.0
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R2 p(R2) h(R2)
5 0.25 2.0
6 0.5 1.0
7 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
37Quantify Loss of Privacy AA01
- Equivalent goal quantify loss of privacy based
on H(XR) - X is uniform in 5, 6, R2 X µ (mod 11), µ is
uniform in 0, 1 - Intuition we know more about X given R2, than
about X given R1 - H(XR1) 0.67, H(XR2) 0.5 ? R2 is a bigger
privacy risk than R1
X R2 p(X,R2) p(XR2) h(XR2)
5 5 0.25 1.0 0.0
5 6 0.25 0.5 1.0
6 6 0.25 0.5 1.0
6 7 0.25 1.0 0.0
X R1 p(X,R1) p(XR1) h(XR1)
5 4 0.17 1.0 0.0
5 5 0.17 0.5 1.0
5 6 0.17 0.5 1.0
6 5 0.17 0.5 1.0
6 6 0.17 0.5 1.0
6 7 0.17 1.0 0.0
Information Theory for Data Management - Divesh
Suresh
38Quantify Loss of Privacy
- Example X is uniform in 0, 1
- R3 e (p 0.9999), R3 X (p 0.0001)
- R4 X (p 0.6), R4 1 X (p 0.4)
- Is R3 or R4 a bigger privacy risk?
Information Theory for Data Management - Divesh
Suresh
39Worst Case Loss of Privacy EGS03
- Example X is uniform in 0, 1
- R3 e (p 0.9999), R3 X (p 0.0001)
- R4 X (p 0.6), R4 1 X (p 0.4)
- I(XR3) 0.0001 ltlt I(XR4) 0.028
X R3 p(X,R3) h(X,R3) i(XR3)
0 e 0.49995 1.0 0.0
0 0 0.00005 14.29 1.0
1 e 0.49995 1.0 0.0
1 1 0.00005 14.29 1.0
X R4 p(X,R4) h(X,R4) i(XR4)
0 0 0.3 1.74 0.26
0 1 0.2 2.32 -0.32
1 0 0.2 2.32 -0.32
1 1 0.3 1.74 0.26
Information Theory for Data Management - Divesh
Suresh
40Worst Case Loss of Privacy EGS03
- Example X is uniform in 0, 1
- R3 e (p 0.9999), R3 X (p 0.0001)
- R4 X (p 0.6), R4 1 X (p 0.4)
- I(XR3) 0.0001 ltlt I(XR4) 0.028
- But R3 has a larger worst case risk
X R3 p(X,R3) h(X,R3) i(XR3)
0 e 0.49995 1.0 0.0
0 0 0.00005 14.29 1.0
1 e 0.49995 1.0 0.0
1 1 0.00005 14.29 1.0
X R4 p(X,R4) h(X,R4) i(XR4)
0 0 0.3 1.74 0.26
0 1 0.2 2.32 -0.32
1 0 0.2 2.32 -0.32
1 1 0.3 1.74 0.26
Information Theory for Data Management - Divesh
Suresh
41Worst Case Loss of Privacy EGS03
- Goal quantify worst case loss of privacy in X by
knowledge of R - Use max KL divergence, instead of mutual
information - Mutual information can be formulated as expected
KL divergence - I(XR) ?x ?r p(x,r)log2(p(x,r)/p(x)p(r))
KL(p(X,R) p(X)p(R)) - I(XR) ?r p(r) ?x p(xr)log2(p(xr)/p(x)) ER
KL(p(Xr) p(X)) - AA01 measure quantifies expected loss of
privacy over R - EGS03 propose a measure based on worst case
loss of privacy - IW(XR) MAXR KL(p(Xr) p(X))
Information Theory for Data Management - Divesh
Suresh
42Worst Case Loss of Privacy EGS03
- Example X is uniform in 0, 1
- R3 e (p 0.9999), R3 X (p 0.0001)
- R4 X (p 0.6), R4 1 X (p 0.4)
- IW(XR3) max0.0, 1.0, 1.0 gt IW(XR4)
max0.028, 0.028
X R3 p(X,R3) p(XR3) i(XR3)
0 e 0.49995 0.5 0.0
0 0 0.00005 1.0 1.0
1 e 0.49995 0.5 0.0
1 1 0.00005 1.0 1.0
X R4 p(X,R4) p(XR4) i(XR4)
0 0 0.3 0.6 0.26
0 1 0.2 0.4 -0.32
1 0 0.2 0.4 -0.32
1 1 0.3 0.6 0.26
Information Theory for Data Management - Divesh
Suresh
43Worst Case Loss of Privacy EGS03
- Example X is uniform in 5, 6
- R1 X µ (mod 11), µ is uniform in -1, 0, 1
- R2 X µ (mod 11), µ is uniform in 0, 1
- IW(XR1) max1.0, 0.0, 0.0, 1.0 IW(XR2)
1.0, 0.0, 1.0 - Unable to capture that R2 is a bigger privacy
risk than R1
X R1 p(X,R1) p(XR1) i(XR1)
5 4 0.17 1.0 1.0
5 5 0.17 0.5 0.0
5 6 0.17 0.5 0.0
6 5 0.17 0.5 0.0
6 6 0.17 0.5 0.0
6 7 0.17 1.0 1.0
X R2 p(X,R2) p(XR2) i(XR2)
5 5 0.25 1.0 1.0
5 6 0.25 0.5 0.0
6 6 0.25 0.5 0.0
6 7 0.25 1.0 1.0
Information Theory for Data Management - Divesh
Suresh
44Data Anonymization Summary
- Randomization techniques useful for microdata
anonymization - Randomization techniques differ in their loss of
privacy - Information theoretic measures useful to capture
loss of privacy - Expected KL divergence captures expected privacy
loss AA01 - Maximum KL divergence captures worst case privacy
loss EGS03 - Both are useful in practice
Information Theory for Data Management - Divesh
Suresh
45Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Database Design
- Part 2
- Review of Information Theory Basics
- Application Data Integration
- Computing Information Theoretic Primitives
- Open Problems
Information Theory for Data Management - Divesh
Suresh
46Information Dependencies DR00
- Goal use information theory to examine and
reason about information content of the
attributes in a relation instance - Key ideas
- Novel InD measure between attribute sets X, Y
based on H(YX) - Identify numeric inequalities between InD
measures - Results
- InD measures are a broader class than FDs and
MVDs - Armstrong axioms for FDs derivable from InD
inequalities - MVD inference rules derivable from InD
inequalities
Information Theory for Data Management - Divesh
Suresh
47Information Dependencies DR00
- Functional dependency X ? Y
- FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
(t1Y t2Y))
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
Information Theory for Data Management - Divesh
Suresh
48Information Dependencies DR00
- Functional dependency X ? Y
- FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
(t1Y t2Y))
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
Information Theory for Data Management - Divesh
Suresh
49Information Dependencies DR00
- Result FD X ? Y holds iff H(YX) 0
- Intuition once X is known, no remaining
uncertainty in Y - H(YX) 0.5
X Y p(X,Y) p(YX) h(YX)
x1 y1 0.25 0.5 1.0
x1 y2 0.25 0.5 1.0
x2 y3 0.25 1.0 0.0
x3 y3 0.125 1.0 0.0
x4 y3 0.125 1.0 0.0
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
50Information Dependencies DR00
- Multi-valued dependency X ?? Y
- MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
Information Theory for Data Management - Divesh
Suresh
51Information Dependencies DR00
- Multi-valued dependency X ?? Y
- MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
Information Theory for Data Management - Divesh
Suresh
52Information Dependencies DR00
- Multi-valued dependency X ?? Y
- MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
R(X,Z)
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
Information Theory for Data Management - Divesh
Suresh
53Information Dependencies DR00
- Result MVD X ?? Y holds iff H(Y,ZX) H(YX)
H(ZX) - Intuition once X known, uncertainties in Y and Z
are independent - H(YX) 0.5, H(ZX) 0.75, H(Y,ZX) 1.25
X Y h(YX)
x1 y1 1.0
x1 y2 1.0
x2 y3 0.0
x3 y3 0.0
x4 y3 0.0
X Z h(ZX)
x1 z1 1.0
x1 z2 1.0
x2 z3 1.0
x2 z4 1.0
x3 z5 0.0
x4 z6 0.0
X Y Z h(Y,ZX)
x1 y1 z1 2.0
x1 y2 z2 2.0
x1 y1 z2 2.0
x1 y2 z1 2.0
x2 y3 z3 1.0
x2 y3 z4 1.0
x3 y3 z5 0.0
x4 y3 z6 0.0
Information Theory for Data Management - Divesh
Suresh
54Information Dependencies DR00
- Result Armstrong axioms for FDs derivable from
InD inequalities - Reflexivity If Y ? X, then X ? Y
- H(YX) 0 for Y ? X
- Augmentation X ? Y ? X,Z ? Y,Z
- 0 H(Y,ZX,Z) H(YX,Z) H(YX) 0
- Transitivity X ? Y Y ? Z ? X ? Z
- 0 H(YX) H(ZY) H(ZX) 0
Information Theory for Data Management - Divesh
Suresh
55Database Normal Forms
- Goal eliminate update anomalies by good database
design - Need to know the integrity constraints on all
database instances - Boyce-Codd normal form
- Input a set ? of functional dependencies
- For every (non-trivial) FD R.X ? R.Y ? ?, R.X is
a key of R - 4NF
- Input a set ? of functional and multi-valued
dependencies - For every (non-trivial) MVD R.X ?? R.Y ? ?,
R.X is a key of R
Information Theory for Data Management - Divesh
Suresh
56Database Normal Forms
- Functional dependency X ? Y
- Which design is better?
X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
Information Theory for Data Management - Divesh
Suresh
57Database Normal Forms
- Functional dependency X ? Y
- Which design is better?
- Decomposition is in BCNF
X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
Information Theory for Data Management - Divesh
Suresh
58Database Normal Forms
- Multi-valued dependency X ?? Y
- Which design is better?
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
Information Theory for Data Management - Divesh
Suresh
59Database Normal Forms
- Multi-valued dependency X ?? Y
- Which design is better?
- Decomposition is in 4NF
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
Information Theory for Data Management - Divesh
Suresh
60Well-Designed Databases AL03
- Goal use information theory to characterize
goodness of a database design and reason about
normalization algorithms - Key idea
- Information content measure of cell in a DB
instance w.r.t. ICs - Redundancy reduces information content measure of
cells - Results
- Well-designed DB ? each cell has information
content gt 0 - Normalization algorithms never decrease
information content
Information Theory for Data Management - Divesh
Suresh
61Well-Designed Databases AL03
- Information content of cell c in database D
satisfying FD X ? Y - Uniform distribution p(V) on values for c
consistent with D\c and FD - Information content of cell c is entropy H(V)
- H(V62) 2.0
X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
V62 p(V62) h(V62)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
62Well-Designed Databases AL03
- Information content of cell c in database D
satisfying FD X ? Y - Uniform distribution p(V) on values for c
consistent with D\c and FD - Information content of cell c is entropy H(V)
- H(V22) 0.0
X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
V22 p(V22) h(V22)
y1 1.0 0.0
y2 0.0
y3 0.0
y4 0.0
Information Theory for Data Management - Divesh
Suresh
63Well-Designed Databases AL03
- Information content of cell c in database D
satisfying FD X ? Y - Information content of cell c is entropy H(V)
- Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
all cells c in D - Technicalities w.r.t. size of active domain
X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
c H(V)
c12 0.0
c22 0.0
c32 0.0
c42 0.0
c52 2.0
c62 2.0
Information Theory for Data Management - Divesh
Suresh
64Well-Designed Databases AL03
- Information content of cell c in database D
satisfying FD X ? Y - Information content of cell c is entropy H(V)
- H(V12) 2.0, H(V42) 2.0
V42 p(V42) h(V42)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
V12 p(V12) h(V12)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
Information Theory for Data Management - Divesh
Suresh
65Well-Designed Databases AL03
- Information content of cell c in database D
satisfying FD X ? Y - Information content of cell c is entropy H(V)
- Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
all cells c in D
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c12 2.0
c22 2.0
c32 2.0
c42 2.0
Information Theory for Data Management - Divesh
Suresh
66Well-Designed Databases AL03
- Information content of cell c in DB D satisfying
MVD X ?? Y - Information content of cell c is entropy H(V)
- H(V52) 0.0, H(V53) 2.32
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
V52 p(V52) h(V52)
y3 1.0 0.0
V53 p(V53) h(V53)
z1 0.2 2.32
z2 0.2 2.32
z3 0.2 2.32
z4 0.0
z5 0.2 2.32
z6 0.2 2.32
Information Theory for Data Management - Divesh
Suresh
67Well-Designed Databases AL03
- Information content of cell c in DB D satisfying
MVD X ?? Y - Information content of cell c is entropy H(V)
- Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
cells c in D
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
c H(V)
c12 0.0
c22 0.0
c32 0.0
c42 0.0
c52 0.0
c62 0.0
c72 1.58
c82 1.58
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
Information Theory for Data Management - Divesh
Suresh
68Well-Designed Databases AL03
- Information content of cell c in DB D satisfying
MVD X ?? Y - Information content of cell c is entropy H(V)
- H(V32) 1.58, H(V34) 2.32
V34 p(V34) h(V34)
z1 0.2 2.32
z2 0.2 2.32
z3 0.2 2.32
z4 0.0
z5 0.2 2.32
z6 0.2 2.32
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
V32 p(V32) h(V32)
y1 0.33 1.58
y2 0.33 1.58
y3 0.33 1.58
Information Theory for Data Management - Divesh
Suresh
69Well-Designed Databases AL03
- Information content of cell c in DB D satisfying
MVD X ?? Y - Information content of cell c is entropy H(V)
- Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
cells c in D
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c12 1.0
c22 1.0
c32 1.58
c42 1.58
c52 1.58
c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
Information Theory for Data Management - Divesh
Suresh
70Well-Designed Databases AL03
- Normalization algorithms never decrease
information content - Information content of cell c is entropy H(V)
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
Information Theory for Data Management - Divesh
Suresh
71Well-Designed Databases AL03
- Normalization algorithms never decrease
information content - Information content of cell c is entropy H(V)
c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
Information Theory for Data Management - Divesh
Suresh
72Well-Designed Databases AL03
- Normalization algorithms never decrease
information content - Information content of cell c is entropy H(V)
c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
Information Theory for Data Management - Divesh
Suresh
73Database Design Summary
- Good database design essential for preserving
data integrity - Information theoretic measures useful for
integrity constraints - FD X ? Y holds iff InD measure H(YX) 0
- MVD X ?? Y holds iff H(Y,ZX) H(YX) H(ZX)
- Information theory to model correlations in
specific database - Information theoretic measures useful for normal
forms - Schema S is in BCNF/4NF iff ? D ? S, H(V) gt 0,
for all cells c in D - Information theory to model distributions over
possible databases
Information Theory for Data Management - Divesh
Suresh
74Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Database Design
- Part 2
- Review of Information Theory Basics
- Application Data Integration
- Computing Information Theoretic Primitives
- Open Problems
Information Theory for Data Management - Divesh
Suresh
75Review of Information Theory Basics
- Discrete distribution probability p(X)
- p(X,Y) ?z p(X,Y,Zz)
X Y Z p(X,Y,Z)
x1 y1 z1 0.125
x1 y2 z2 0.125
x1 y1 z2 0.125
x1 y2 z1 0.125
x2 y3 z3 0.125
x2 y3 z4 0.125
x3 y3 z5 0.125
x4 y3 z6 0.125
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
X Y p(X,Y)
x1 y1 0.25
x1 y2 0.25
x2 y3 0.25
x3 y3 0.125
x4 y3 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
76Review of Information Theory Basics
- Discrete distribution probability p(X)
- p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)
X Y Z p(X,Y,Z)
x1 y1 z1 0.125
x1 y2 z2 0.125
x1 y1 z2 0.125
x1 y2 z1 0.125
x2 y3 z3 0.125
x2 y3 z4 0.125
x3 y3 z5 0.125
x4 y3 z6 0.125
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
X Y p(X,Y)
x1 y1 0.25
x1 y2 0.25
x2 y3 0.25
x3 y3 0.125
x4 y3 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
77Review of Information Theory Basics
- Discrete distribution conditional probability
p(XY) - p(X,Y) p(XY)p(Y) p(YX)p(X)
X Y p(X,Y) p(XY) p(YX)
x1 y1 0.25 1.0 0.5
x1 y2 0.25 1.0 0.5
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 1.0
x4 y3 0.125 0.25 1.0
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
Information Theory for Data Management - Divesh
Suresh
78Review of Information Theory Basics
- Discrete distribution entropy H(X)
- h(x) log2(1/p(x))
- H(X) ?Xx p(x)h(x) 1.75
- H(Y) ?Yy p(y)h(y) 1.5 ( log2(Y) 1.58)
- H(X,Y) ?Xx ?Yy p(x,y)h(x,y) 2.25 (
log2(X,Y) 2.32)
X Y p(X,Y) h(X,Y)
x1 y1 0.25 2.0
x1 y2 0.25 2.0
x2 y3 0.25 2.0
x3 y3 0.125 3.0
x4 y3 0.125 3.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
79Review of Information Theory Basics
- Discrete distribution conditional entropy H(XY)
- h(xy) log2(1/p(xy))
- H(XY) ?Xx ?Yy p(x,y)h(xy) 0.75
- H(XY) H(X,Y) H(Y) 2.25 1.5
X Y p(X,Y) p(XY) h(XY)
x1 y1 0.25 1.0 0.0
x1 y2 0.25 1.0 0.0
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 2.0
x4 y3 0.125 0.25 2.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
80Review of Information Theory Basics
- Discrete distribution mutual information I(XY)
- i(xy) log2(p(x,y)/p(x)p(y))
- I(XY) ?Xx ?Yy p(x,y)i(xy) 1.0
- I(XY) H(X) H(Y) H(X,Y) 1.75 1.5 2.25
X Y p(X,Y) h(X,Y) i(XY)
x1 y1 0.25 2.0 1.0
x1 y2 0.25 2.0 1.0
x2 y3 0.25 2.0 1.0
x3 y3 0.125 3.0 1.0
x4 y3 0.125 3.0 1.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
Information Theory for Data Management - Divesh
Suresh
81Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Database Design
- Part 2
- Review of Information Theory Basics
- Application Data Integration
- Computing Information Theoretic Primitives
- Open Problems
Information Theory for Data Management - Divesh
Suresh
82Schema Matching
- Goal align columns across database tables to be
integrated - Fundamental problem in database integration
- Early useful approach textual similarity of
column names - False positives Address ? IP_Address
- False negatives Customer_Id Client_Number
- Early useful approach overlap of values in
columns, e.g., Jaccard - False positives Emp_Id ? Project_Id
- False negatives Emp_Id Personnel_Number
Information Theory for Data Management - Divesh
Suresh
83Opaque Schema Matching KN03
- Goal align columns when column names, data
values are opaque - Databases belong to different government
bureaucracies ? - Treat column names and data values as
uninterpreted (generic) - Example EMP_PROJ(Emp_Id, Proj_Id, Task_Id,
Status_Id) - Likely that all Id fields are from the same
domain - Different databases may have different column
names
W X Y Z
w2 x1 y1 z2
w4 x2 y3 z3
w3 x3 y3 z1
w1 x2 y1 z2
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
Information Theory for Data Management - Divesh
Suresh
84Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - Perform graph matching between GD1 and GD2,
minimizing distance - Intuition
- Entropy H(X) captures distribution of values in
database column X - Mutual information I(XY) captures correlations
between X, Y - Efficiency graph matching between schema-sized
graphs
Information Theory for Data Management - Divesh
Suresh
85Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A p(A)
a1 0.5
a3 0.25
a4 0.25
B p(B)
b1 0.25
b2 0.25
b3 0.25
b4 0.25
C p(C)
c1 0.5
c2 0.5
D p(D)
d1 0.25
d2 0.5
d3 0.25
Information Theory for Data Management - Divesh
Suresh
86Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A h(A)
a1 1.0
a3 2.0
a4 2.0
B h(B)
b1 2.0
b2 2.0
b3 2.0
b4 2.0
C h(C)
c1 1.0
c2 1.0
D h(D)
d1 2.0
d2 1.0
d3 2.0
Information Theory for Data Management - Divesh
Suresh
87Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5,
I(AB) 1.5
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A h(A)
a1 1.0
a3 2.0
a4 2.0
B h(B)
b1 2.0
b2 2.0
b3 2.0
b4 2.0
A B h(A,B) i(AB)
a1 b2 2.0 1.0
a3 b4 2.0 2.0
a1 b1 2.0 1.0
a4 b3 2.0 2.0
Information Theory for Data Management - Divesh
Suresh
88Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY)
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
Information Theory for Data Management - Divesh
Suresh
89Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - Perform graph matching between GD1 and GD2,
minimizing distance - KN03 uses euclidean and normal distance metrics
Information Theory for Data Management - Divesh
Suresh
90Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - Perform graph matching between GD1 and GD2,
minimizing distance
Information Theory for Data Management - Divesh
Suresh
91Opaque Schema Matching KN03
- Approach build complete, labeled graph GD for
each database D - Nodes are columns, label(node(X)) H(X),
label(edge(X, Y)) I(XY) - Perform graph matching between GD1 and GD2,
minimizing distance
Information Theory for Data Management - Divesh
Suresh
92Heterogeneity Identification DKOSV06
- Goal identify columns with semantically
heterogeneous values - Can arise due to opaque schema matching KN03
- Key ideas
- Heterogeneity based on distribution,
distinguishability of values - Use Information Theory to quantify heterogeneity
- Issues
- Which information theoretic measure characterizes
heterogeneity? - How do we actually cluster the data ?
Information Theory for Data Management - Divesh
Suresh
93Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
jamesbond.007_at_action.com
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
94Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
jamesbond.007_at_action.com
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
95Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns - More semantic types in column ? greater
heterogeneity - Only email versus email phone
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
jamesbond.007_at_action.com
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
96Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
97Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns - Relative distribution of semantic types impacts
heterogeneity - Mainly email few phone versus balanced email
phone
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
alpha_at_beta.ga
john.smith_at_noname.org
jane.doe_at_1973law.us
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
98Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
99Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
100Heterogeneity Identification DKOSV06
- Example semantically homogeneous, heterogeneous
columns - More easily distinguished types ? greater
heterogeneity - Phone (possibly) SSN versus balanced email
phone
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Customer_Id
h8742_at_yyy.com
kkjj_at_haha.org
qwerty_at_keyboard.us
555-1212_at_fax.in
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
101Heterogeneity Identification DKOSV06
- Heterogeneity complexity of describing the data
- More, balanced clusters ? greater heterogeneity
- More distinguishable clusters ? greater
heterogeneity - Use two views of mutual information
- It quantifies the complexity of describing the
data (compression) - It quantifies the quality of the compression
Information Theory for Data Management - Divesh
Suresh
102Heterogeneity Identification DKOSV06
X Customer_Id T Cluster_Id
187-65-2468 t1
987-64-6837 t1
789-15-4321 t1
987-65-4321 t1
(908)-555-1234 t2
973-360-0000 t1
360-0007 t3
(877)-807-4596 t2
Information Theory for Data Management - Divesh
Suresh
103Measuring complexity of clustering
- Take 1 complexity of a clustering clusters
- standard model of complexity.
- Doesnt capture the fact that clusters have
different sizes.
?
Information Theory for Data Management - Divesh
Suresh
104Measuring complexity of clustering
- Take 2 Complexity of clustering number of bits
needed to describe it. - Writing down k needs log k bits.
- In general, let cluster t ? T have t elements.
- set p(t) t/n
- bits to write down cluster sizes H(T) S pt
log 1/pt
H( ) lt
H( )
Information Theory for Data Management - Divesh
Suresh
105Heterogeneity Identification DKOSV06
- Soft clustering cluster membership probabilities
- How to compute a good soft clustering?
X Customer_Id T Cluster_Id p(TX)
789-15-4321 t1 0.75
987-65-4321 t1 0.75
789-15-4321 t2 0.25
987-65-4321 t2 0.25
(908)-555-1234 t1 0.25
973-360-0000 t1 0.5
(908)-555-1234 t2 0.75
973-360-0000 t2 0.5
Information Theory for Data Management - Divesh
Suresh
106Measuring complexity of clustering
- Take 1
- p(t) Sx p(x) p(tx)
- Compute H(T) as before.
- Problem
- H(T1) H(T2) !!
T1 t1 t2 T2 t1 t2
x1 0.5 0.5 x1 0.99 0.01
x2 0.5 0.5 x2 0.01 0.99
h(T) 0.5 0.5 h(T) 0.5 0.5
Information Theory for Data Management - Divesh
Suresh
107Measuring complexity of clustering
- By averaging the memberships, weve lost useful
information. - Take II Compute I(TX) !
- Even better If T is a hard clustering of X, then
I(TX) H(T)
X T1 p(X,T) i(XT)
x1 t1 0.25 0
x1 t2 0.25 0
x2 t1 0.25 0
x2 t2 0.25 0
X T2 p(X,T) i(XT)
x1 t1 0.495 0.99
x1 t2 0.005 -5.64
x2 t1 0.25 0
x2 t2 0.25 0
I(T1X) 0
I(T2X) 0.46
Information Theory for Data Management - Divesh
Suresh
108Measuring cost of a cluster
Given objects Xt X1, X2, , Xm in cluster
t, Cost(t) sum of distances from Xi to cluster
center If we set distance to be KL-distance,
then Cost of a cluster I(Xt,V)
Information Theory for Data Management - Divesh
Suresh
109Cost of a clustering
- If we partition X into k clusters X1, ..., Xk
- Cost(clustering) Si pi I(Xi, V) (pi
Xi/X) - Suppose we treat each cluster center itself as a
point ?
Information Theory for Data Management - Divesh
Suresh
110Cost of a clustering
- We can write down the cost of this cluster
- Cost(T) I(TV)
- Key result BMDG05
- Cost(clustering) I(X, V) I(T, V)
- Minimizing cost(clustering) gt maximizing I(T, V)
Information Theory for Data Management - Divesh
Suresh
111Heterogeneity Identification DKOSV06
- Represent strings as q-gram distributions
X Customer_Id V 4-grams p(X,V)
987-65-4321 987- 0.10
987-65-4321 87-6 0.13
987-65-4321 7-65 0.12
987-65-4321 -65- 0.15
987-65-4321 65-4 0.05
987-65-4321 5-43 0.20
987-65-4321 -432 0.15
987-65-4321 4321 0.10
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
112Heterogeneity Identification DKOSV06
- iIB find soft clustering T of X that minimizes
I(TX) ßI(TV) - Allow iIB to use arbitrarily many clusters, use
ß H(X)/I(XV) - Closest to point with minimum space and maximum
quality
X Customer_Id V 4-grams p(X,V)
987-65-4321 987- 0.10
987-65-4321 87-6 0.13
987-65-4321 7-65 0.12
987-65-4321 -65- 0.15
987-65-4321 65-4 0.05
987-65-4321 5-43 0.20
987-65-4321 -432 0.15
987-65-4321 4321 0.10
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Information Theory for Data Management - Divesh
Suresh
113Heterogeneity Identification DKOSV06
- Rate distortion curve I(TV)/I(XV) vs
I(TX)/H(X) - ß
Information Theory for Data Management - Divesh
Suresh
114Heterogeneity Identification DKOSV06
- Heterogeneity mutual information I(TX) of iIB
clustering T at ß - 0 I(TX) ( 0.126) H(X) ( 2.0), H(T) ( 1.0)
- Ideally use iIB with an arbitrarily large number
of clusters in T
X Customer_Id T Cluster_Id p(TX) i(TX)
789-15-4321 t1 0.75 0.41
987-65-4321 t1 0.75 0.41
789-15-4321 t2 0.25 -0.81
987-65-4321 t2 0.25 -0.81
(908)-555-1234 t1 0.25 -1.17
973-360-0000 t1 0.5 -0.17
(908)-555-1234 t2 0.75 0.77
973-360-0000 t2 0.5 0.19
Information Theory for Data Management - Divesh
Suresh
115Heterogeneity Identification DKOSV06
- Heterogeneity mutual information I(TX) of iIB
clustering T at ß
Information Theory for Data Management - Divesh
Suresh
116Data Integration Summary
- Analyzing database instance critical for
effective data integration - Matching and quality assessments are key
components - Information theoretic measures useful for schema
matching - Align columns when column names, data values are
opaque - Mutual information I(XV) captures correlations
between X, V - Information theoretic measures useful for
heterogeneity testing - Identify columns with semantically heterogeneous
values - I(TX) of iIB clustering T at ß captures column
heterogeneity
Information Theory for Data Management - Divesh
Suresh
117Outline
- Part 1
- Introduction to Information Theory
- Application Data Anonymization
- Application Database Design
- Part 2
- Review of Information Theory Basics
- Application Data Integration
- Computing Information Theoretic Primitives
- Open Problems
Information Theory for Data Management - Divesh
Suresh
118Domain size matters
- For random variable X, domain size supp(X)
xi p(X xi) gt 0 - Different solutions exist depending on whether
domain size is small or large - Probability vectors usually very sparse
Information Theory for Data Management - Divesh
Suresh
119Entropy Case I - Small domain size
- Suppose the unique values for a random variable
X is small (i.e fits in memory) - Maximum likelihood estimator
- p(x) times x is encountered/total number of
items in set.
1
2
1
2
1
5
4
1
2
3
4
5
Information Theory for Data Management - Divesh
Suresh
120Entropy Case I - Small domain size
- HMLE Sx p(x) log 1/p(x)
- This is a biased estimate
- EHMLE lt H
- Miller-Madow correction
- H HMLE (m 1)/2n
- m is an estimate of number of non-empty bins
- n number of samples
- Bad news ALL estimators for H are biased.
- Good news we can quantify bias and variance of
MLE - Bias lt log(1 m/N)
- Var(HMLE) lt (log n)2/N
Information Theory for Data Management - Divesh
Suresh
121Entropy Case II - Large domain size
- X is too large to fit in main memory, so we
cant maintain explicit counts. - Streaming algorithms for H(X)
- Long history of work on this problem
- Bottomline
- (1e)-relative-approximation for H(X) that allows
for updates to frequencies, and requires almost
constant, and optimal space HNO08.
Information Theory for Data Management - Divesh
Suresh
122Streaming Entropy CCM07
- High level idea sample randomly from the stream,
and track counts of elements picked AMS - PROBLEM skewed distribution prevents us from
sampling lower-frequency elements (and entropy is
small) - Idea estimate largest frequency, and
- distribution of whats left (higher entropy)
Information Theory for Data Management - Divesh
Suresh
123Streaming Entropy CCM07
- Maintain set of samples from original
distribution and distribution without most
freque