Information Theory For Data Management - PowerPoint PPT Presentation

1 / 150
About This Presentation
Title:

Information Theory For Data Management

Description:

exif:DateTimeOriginal 2099-12-07T23:14:14 02:00 /exif:DateTimeOriginal exif:DateTimeDigitized 2099-12-07T23:14:14 02:00 /exif:DateTimeDigitized ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 151
Provided by: Div56
Category:

less

Transcript and Presenter's Notes

Title: Information Theory For Data Management


1
Information Theory For Data Management
Divesh Srivastava Suresh Venkatasubramanian
2
Motivation
-- Abstruse Goose (177)
Information Theory is relevant to all of
humanity...
3
Background
  • Many problems in data management need precise
    reasoning about information content, transfer and
    loss
  • Structure Extraction
  • Privacy preservation
  • Schema design
  • Probabilistic data ?

4
Information Theory
  • First developed by Shannon as a way of
    quantifying capacity of signal channels.
  • Entropy, relative entropy and mutual information
    capture intrinsic informational aspects of a
    signal
  • Today
  • Information theory provides a domain-independent
    way to reason about structure in data
  • More information interesting structure
  • Less information linkage decoupling of
    structures

5
Tutorial Thesis
  • Information theory provides a mathematical
    framework for the quantification of information
    content, linkage and loss.
  • This framework can be used in the design of data
    management strategies that rely on probing the
    structure of information in data.

6
Tutorial Goals
  • Introduce information-theoretic concepts to VLDB
    audience
  • Give a data-centric perspective on information
    theory
  • Connect these to applications in data management
  • Describe underlying computational primitives
  • Illuminate when and how information theory might
    be of use in new areas of data management.

7
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Data Integration
  • Part 2
  • Review of Information Theory Basics
  • Application Database Design
  • Computing Information Theoretic Primitives
  • Open Problems

7
8
Histograms And Discrete Distributions
9
Histograms And Discrete Distributions
normalize
reweight
10
From Columns To Random Variables
  • We can think of a column of data as represented
    by a random variable
  • X is a random variable
  • p(X) is the column of probabilities p(X x1),
    p(X x2), and so on
  • Also known (in unweighted case) as the empirical
    distribution induced by the column X.
  • Notation
  • X (upper case) denotes a random variable (column)
  • x (lower case) denotes a value taken by X (field
    in a tuple)
  • p(x) is the probability p(X x)

11
Joint Distributions
  • Discrete distribution probability p(X,Y,Z)
  • p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)

11
11
12
Entropy Of A Column
  • Let h(x) log2 1/p(x)
  • h(X) is column of h(x) values.
  • H(X) EXh(x) SX p(x) log2 1/p(x)
  • Two views of entropy
  • It captures uncertainty in data high entropy,
    more unpredictability
  • It captures information content higher entropy,
    more information.

H(X) 1.75 lt log X 2
13
Examples
  • X uniform over 1, ..., 4. H(X) 2
  • Y is 1 with probability 0.5, in 2,3,4
    uniformly.
  • H(Y) 0.5 log 2 0.5 log 6 1.8 lt 2
  • Y is more sharply defined, and so has less
    uncertainty.
  • Z uniform over 1, ..., 8. H(Z) 3 gt 2
  • Z spans a larger range, and captures more
    information

X
Y
Z
14
Comparing Distributions
  • How do we measure difference between two
    distributions ?
  • Kullback-Leibler divergence
  • dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)

Inference mechanism
Prior belief
Resulting belief
15
Comparing Distributions
  • Kullback-Leibler divergence
  • dKL(p, q) Ep h(q) h(p) Si pi log(pi/qi)
  • dKL(p, q) gt 0
  • Captures extra information needed to capture p
    given q
  • Is asymmetric ! dKL(p, q) ! dKL(q, p)
  • Is not a metric (does not satisfy triangle
    inequality)
  • There are other measures
  • ?2-distance, variational distance, f-divergences,

16
Conditional Probability
  • Given a joint distribution on random variables X,
    Y, how much information about X can we glean from
    Y ?
  • Conditional probability p(XY)
  • p(X x1 Y y1) p(X x1, Y y1)/p(Y y1)

17
Conditional Entropy
  • Let h(xy) log2 1/p(xy)
  • H(XY) Ex,yh(xy) Sx Sy p(x,y) log2
    1/p(xy)
  • H(XY) H(X,Y) H(Y)
  • H(XY) H(X,Y) H(Y) 2.25 1.5 0.75
  • If X, Y are independent, H(XY) H(X)

18
Mutual Information
  • Mutual information captures the difference
    between the joint distribution on X and Y, and
    the marginal distributions on X and Y.
  • Let i(xy) log p(x,y)/p(x)p(y)
  • I(XY) Ex,yI(XY) Sx Sy p(x,y) log
    p(x,y)/p(x)p(y)

19
Mutual Information Strength of linkage
  • I(XY) H(X) H(Y) H(X,Y) H(X) H(XY)
    H(Y) H(YX)
  • If X, Y are independent, then I(XY) 0
  • H(X,Y) H(X) H(Y), so I(XY) H(X) H(Y)
    H(X,Y) 0
  • I(XY) lt max (H(X), H(Y))
  • Suppose Y f(X) (deterministically)
  • Then H(YX) 0, and so I(XY) H(Y) H(YX)
    H(Y)
  • Mutual information captures higher-order
    interactions
  • Covariance captures linear interactions only
  • Two variables can be uncorrelated (covariance
    0) and have nonzero mutual information
  • X ?R -1,1, Y X2. Cov(X,Y) 0, I(XY) H(X)
    gt 0

20
Information-Theoretic Clustering
  • Clustering takes a collection of objects and
    groups them.
  • Given a distance function between objects
  • Choice of measure of complexity of clustering
  • Choice of measure of cost for a cluster
  • Usually,
  • Distance function is Euclidean distance
  • Number of clusters is measure of complexity
  • Cost measure for cluster is sum-of-squared-distanc
    e to center
  • Goal minimize complexity and cost
  • Inherent tradeoff between two

21
Feature Representation
Let V v1, v2, v3, v4 X is explained by
distribution over V. Feature vector of X is
0.5, 0.25, 0.125, 0.125
22
Feature Representation
p(v2X2) 0.2
Feature vector
23
Information-Theoretic Clustering
  • Clustering takes a collection of objects and
    groups them.
  • Given a distance function between objects
  • Choice of measure of complexity of clustering
  • Choice of measure of cost for a cluster
  • In information-theoretic setting
  • What is the distance function ?
  • How do we measure complexity ?
  • What is a notion of cost/quality ?
  • Goal minimize complexity and maximize quality
  • Inherent tradeoff between two

24
Measuring complexity of clustering
  • Take 1 complexity of a clustering clusters
  • standard model of complexity.
  • Doesnt capture the fact that clusters have
    different sizes.

?
25
Measuring complexity of clustering
  • Take 2 Complexity of clustering number of bits
    needed to describe it.
  • Writing down k needs log k bits.
  • In general, let cluster t ? T have t elements.
  • set p(t) t/n
  • bits to write down cluster sizes H(T) S pt
    log 1/pt

H( ) lt
H( )
26
Information-theoretic Clustering (take I)
  • Given data X x1, ..., xn explained by variable
    V, partition X into clusters (represented by T)
    such that
  • H(T) is minimized and quality is maximized

27
Soft clusterings
  • In a hard clustering, each point is assigned to
    exactly one cluster.
  • Characteristic function
  • p(tx) 1 if x ? t, 0 if not.
  • Suppose we allow points to partially belong to
    clusters
  • p(Tx) is a distribution.
  • p(tx) is the probability of assigning x to t
  • How do we describe the complexity of a clustering
    ?

28
Measuring complexity of clustering
  • Take 1
  • p(t) Sx p(x) p(tx)
  • Compute H(T) as before.
  • Problem
  • H(T1) H(T2) !!

29
Measuring complexity of clustering
  • By averaging the memberships, weve lost useful
    information.
  • Take II Compute I(TX) !
  • Even better If T is a hard clustering of X, then
    I(TX) H(T)

I(T1X) 0
I(T2X) 0.46
30
Information-theoretic Clustering (take II)
  • Given data X x1, ..., xn explained by variable
    V, partition X into clusters (represented by T)
    such that
  • I(T,X) is minimized and quality is maximized

31
Measuring cost of a cluster
Given objects Xt X1, X2, , Xm in cluster
t, Cost(t) (1/m)Si d(Xi, C) Si p(Xi)
dKL(p(VXi), C) where C (1/m) Si p(VXi) Si
p(Xi) p(VXi) p(V)
32
Mutual Information Cost of Cluster
  • Cost(t) (1/m)Si d(Xi, C) Si p(Xi)
    dKL(p(VXi), p(V))
  • Si p(Xi) KL( p(VXi), p(V)) Si p(Xi) Sj
    p(vjXi) log p(vjXi)/p(vj)
  • Si,j p(Xi, vj) log p(vj, Xi)/p(vj)p(Xi)
  • I(Xt, V) !!
  • Cost of a cluster I(Xt,V)

33
Cost of a clustering
  • If we partition X into k clusters X1, ..., Xk
  • Cost(clustering) Si pi I(Xi, V)
  • (pi Xi/X)

34
Cost of a clustering
  • Each cluster center t can be explained in terms
    of V
  • p(Vt) Si p(Xi) p(VXi)
  • Suppose we treat each cluster center itself as a
    point

35
Cost of a clustering
  • We can write down the cost of this cluster
  • Cost(T) I(TV)
  • Key result BMDG05
  • Cost(clustering) I(X, V) (T, V)
  • Minimizing cost(clustering) gt maximizing I(T, V)

36
Information-theoretic Clustering (take III)
  • Given data X x1, ..., xn explained by variable
    V, partition X into clusters (represented by T)
    such that
  • I(TX) - bI(TV) is maximized
  • This is the Information Bottleneck Method TPB98
  • Agglomerative techniques exist for the case of
    hard clusterings
  • b is the tradeoff parameter between complexity
    and cost
  • I(TX) and I(TV) are in the same units.

37
Information Theory Summary
  • We can represent data as discrete distributions
    (normalized histograms)
  • Entropy captures uncertainty or information
    content in a distribution
  • The Kullback-Leibler distance captures the
    difference between distributions
  • Mutual information and conditional entropy
    capture linkage between variables in a joint
    distribution
  • We can formulate information-theoretic clustering
    problems

38
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Data Integration
  • Part 2
  • Review of Information Theory Basics
  • Application Database Design
  • Computing Information Theoretic Primitives
  • Open Problems

39
Data Anonymization Using Randomization
  • Goal publish anonymized microdata to enable
    accurate ad hoc analyses, but ensure privacy of
    individuals sensitive attributes
  • Key ideas
  • Randomize numerical data add noise from known
    distribution
  • Reconstruct original data distribution using
    published noisy data
  • Issues
  • How can the original data distribution be
    reconstructed?
  • What kinds of randomization preserve privacy of
    individuals?

39
Information Theory for Data Management - Divesh
Suresh
40
Data Anonymization Using Randomization
  • Many randomization strategies proposed AS00,
    AA01, EGS03
  • Example randomization strategies X in 0, 10
  • R X µ (mod 11), µ is uniform in -1, 0, 1
  • R X µ (mod 11), µ is in -1 (p 0.25), 0 (p
    0.5), 1 (p 0.25)
  • R X (p 0.6), R µ, µ is uniform in 0, 10
    (p 0.4)
  • Question
  • Which randomization strategy has higher privacy
    preservation?
  • Quantify loss of privacy due to publication of
    randomized data

40
41
Data Anonymization Using Randomization
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1

41
42
Data Anonymization Using Randomization
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1

?
42
43
Data Anonymization Using Randomization
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1

?
43
44
Reconstruction of Original Data Distribution
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1
  • Reconstruct distribution of X using knowledge of
    R1 and µ
  • EM algorithm converges to MLE of original
    distribution AA01

?
?
44
45
Analysis of Privacy AS00
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1
  • If X is uniform in 0, 10, privacy determined by
    range of µ

?
?
45
46
Analysis of Privacy AA01
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1
  • If X is uniform in 0, 1 ? 5, 6, privacy
    smaller than range of µ

?
?
46
47
Analysis of Privacy AA01
  • X in 0, 10, R1 X µ (mod 11), µ is uniform
    in -1, 0, 1
  • If X is uniform in 0, 1 ? 5, 6, privacy
    smaller than range of µ
  • In some cases, sensitive value revealed

?
?
47
48
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • Smaller H(XR) ? more loss of privacy in X by
    knowledge of R
  • Larger I(XR) ? more loss of privacy in X by
    knowledge of R
  • I(XR) H(X) H(XR)
  • I(XR) used to capture correlation between X and
    R
  • p(X) is the prior knowledge of sensitive
    attribute X
  • p(X, R) is the joint distribution of X and R

48
49
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • X is uniform in 5, 6, R1 X µ (mod 11), µ is
    uniform in -1, 0, 1

49
50
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • X is uniform in 5, 6, R1 X µ (mod 11), µ is
    uniform in -1, 0, 1

50
51
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • X is uniform in 5, 6, R1 X µ (mod 11), µ is
    uniform in -1, 0, 1

51
52
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • X is uniform in 5, 6, R1 X µ (mod 11), µ is
    uniform in -1, 0, 1
  • I(XR) 0.33

52
53
Quantify Loss of Privacy AA01
  • Goal quantify loss of privacy based on mutual
    information I(XR)
  • X is uniform in 5, 6, R2 X µ (mod 11), µ is
    uniform in 0, 1
  • I(XR1) 0.33, I(XR2) 0.5 ? R2 is a bigger
    privacy risk than R1

53
54
Quantify Loss of Privacy AA01
  • Equivalent goal quantify loss of privacy based
    on H(XR)
  • X is uniform in 5, 6, R2 X µ (mod 11), µ is
    uniform in 0, 1
  • Intuition we know more about X given R2, than
    about X given R1
  • H(XR1) 0.67, H(XR2) 0.5 ? R2 is a bigger
    privacy risk than R1

54
55
Quantify Loss of Privacy
  • Example X is uniform in 0, 1
  • R3 e (p 0.9999), R3 X (p 0.0001)
  • R4 X (p 0.6), R4 1 X (p 0.4)
  • Is R3 or R4 a bigger privacy risk?

55
56
Worst Case Loss of Privacy EGS03
  • Example X is uniform in 0, 1
  • R3 e (p 0.9999), R3 X (p 0.0001)
  • R4 X (p 0.6), R4 1 X (p 0.4)
  • I(XR3) 0.0001 ltlt I(XR4) 0.028

56
57
Worst Case Loss of Privacy EGS03
  • Example X is uniform in 0, 1
  • R3 e (p 0.9999), R3 X (p 0.0001)
  • R4 X (p 0.6), R4 1 X (p 0.4)
  • I(XR3) 0.0001 ltlt I(XR4) 0.028
  • But R3 has a larger worst case risk

57
58
Worst Case Loss of Privacy EGS03
  • Goal quantify worst case loss of privacy in X by
    knowledge of R
  • Use max KL divergence, instead of mutual
    information
  • Mutual information can be formulated as expected
    KL divergence
  • I(XR) ?x ?r p(x,r)log2(p(x,r)/p(x)p(r))
    KL(p(X,R) p(X)p(R))
  • I(XR) ?r p(r) ?x p(xr)log2(p(xr)/p(x)) ER
    KL(p(Xr) p(X))
  • AA01 measure quantifies expected loss of
    privacy over R
  • EGS03 propose a measure based on worst case
    loss of privacy
  • IW(XR) MAXR KL(p(Xr) p(X))

58
59
Worst Case Loss of Privacy EGS03
  • Example X is uniform in 0, 1
  • R3 e (p 0.9999), R3 X (p 0.0001)
  • R4 X (p 0.6), R4 1 X (p 0.4)
  • IW(XR3) max0.0, 1.0, 1.0 gt IW(XR4)
    max0.028, 0.028

59
60
Worst Case Loss of Privacy EGS03
  • Example X is uniform in 5, 6
  • R1 X µ (mod 11), µ is uniform in -1, 0, 1
  • R2 X µ (mod 11), µ is uniform in 0, 1
  • IW(XR1) max1.0, 0.0, 0.0, 1.0 IW(XR2)
    1.0, 0.0, 1.0
  • Unable to capture that R2 is a bigger privacy
    risk than R1

60
61
Data Anonymization Summary
  • Randomization techniques useful for microdata
    anonymization
  • Randomization techniques differ in their loss of
    privacy
  • Information theoretic measures useful to capture
    loss of privacy
  • Expected KL divergence captures expected loss of
    privacy AA01
  • Maximum KL divergence captures worst case loss of
    privacy EGS03
  • Both are useful in practice

61
62
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Data Integration
  • Part 2
  • Review of Information Theory Basics
  • Application Database Design
  • Computing Information Theoretic Primitives
  • Open Problems

63
Schema Matching
  • Goal align columns across database tables to be
    integrated
  • Fundamental problem in database integration
  • Early useful approach textual similarity of
    column names
  • False positives Address ? IP_Address
  • False negatives Customer_Id Client_Number
  • Early useful approach overlap of values in
    columns, e.g., Jaccard
  • False positives Emp_Id ? Project_Id
  • False negatives Emp_Id Personnel_Number

63
64
Opaque Schema Matching KN03
  • Goal align columns when column names, data
    values are opaque
  • Databases belong to different government
    bureaucracies ?
  • Treat column names and data values as
    uninterpreted (generic)
  • Example EMP_PROJ(Emp_Id, Proj_Id, Task_Id,
    Status_Id)
  • Likely that all Id fields are from the same
    domain
  • Different databases may have different column
    names

64
65
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • Perform graph matching between GD1 and GD2,
    minimizing distance
  • Intuition
  • Entropy H(X) captures distribution of values in
    database column X
  • Mutual information I(XY) captures correlations
    between X, Y
  • Efficiency graph matching between schema-sized
    graphs

65
66
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)

66
67
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5

67
68
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • H(A) 1.5, H(B) 2.0, H(C) 1.0, H(D) 1.5,
    I(AB) 1.5

68
69
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)

69
70
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • Perform graph matching between GD1 and GD2,
    minimizing distance
  • KN03 uses euclidean and normal distance metrics

70
71
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • Perform graph matching between GD1 and GD2,
    minimizing distance

71
72
Opaque Schema Matching KN03
  • Approach build complete, labeled graph GD for
    each database D
  • Nodes are columns, label(node(X)) H(X),
    label(edge(X, Y)) I(XY)
  • Perform graph matching between GD1 and GD2,
    minimizing distance

72
73
Heterogeneity Identification DKOSV06
  • Goal identify columns with semantically
    heterogeneous values
  • Can arise due to opaque schema matching KN03
  • Key ideas
  • Heterogeneity based on distribution,
    distinguishability of values
  • Use Information Bottleneck to compute soft
    clustering of values
  • Issues
  • Which information theoretic measure characterizes
    heterogeneity?
  • How to set parameters in the Information
    Bottleneck method?

73
74
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns

74
75
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns

75
76
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns
  • More semantic types in column ? greater
    heterogeneity
  • Only email versus email phone

76
77
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns

77
78
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns
  • Relative distribution of semantic types impacts
    heterogeneity
  • Mainly email few phone versus balanced email
    phone

78
79
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns

79
80
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns

80
81
Heterogeneity Identification DKOSV06
  • Example semantically homogeneous, heterogeneous
    columns
  • More easily distinguished types ? greater
    heterogeneity
  • Phone (possibly) SSN versus balanced email
    phone

81
82
Heterogeneity Identification DKOSV06
  • Heterogeneity space complexity of soft
    clustering of the data
  • More, balanced clusters ? greater heterogeneity
  • More distinguishable clusters ? greater
    heterogeneity
  • Soft clustering
  • Soft ? assign probabilities to membership of
    values in clusters
  • How many clusters tradeoff between space versus
    quality
  • Use Information Bottleneck to compute soft
    clustering of values

82
83
Heterogeneity Identification DKOSV06
  • Hard clustering

83
84
Heterogeneity Identification DKOSV06
  • Soft clustering cluster membership probabilities
  • How to compute a good soft clustering?

84
85
Heterogeneity Identification DKOSV06
  • Represent strings as q-gram distributions

85
86
Heterogeneity Identification DKOSV06
  • iIB find soft clustering T of X that minimizes
    I(TX) ßI(TV)
  • Allow iIB to use arbitrarily many clusters, use
    ß H(X)/I(XV)
  • Closest to point with minimum space and maximum
    quality

86
87
Heterogeneity Identification DKOSV06
  • Rate distortion curve I(TV)/I(XV) vs
    I(TX)/H(X)
  • ß

87
88
Heterogeneity Identification DKOSV06
  • Heterogeneity mutual information I(TX) of iIB
    clustering T at ß
  • 0 I(TX) ( 0.126) H(X) ( 2.0), H(T) ( 1.0)
  • Ideally use iIB with an arbitrarily large number
    of clusters in T

88
89
Heterogeneity Identification DKOSV06
  • Heterogeneity mutual information I(TX) of iIB
    clustering T at ß

89
90
Data Integration Summary
  • Analyzing database instance critical for
    effective data integration
  • Matching and quality assessments are key
    components
  • Information theoretic measures useful for schema
    matching
  • Align columns when column names, data values are
    opaque
  • Mutual information I(XV) captures correlations
    between X, V
  • Information theoretic measures useful for
    heterogeneity testing
  • Identify columns with semantically heterogeneous
    values
  • I(TX) of iIB clustering T at ß captures column
    heterogeneity

90
91
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Data Integration
  • Part 2
  • Review of Information Theory Basics
  • Application Database Design
  • Computing Information Theoretic Primitives
  • Open Problems

92
Review of Information Theory Basics
  • Discrete distribution probability p(X)
  • p(X,Y) ?z p(X,Y,Zz)

92
93
Review of Information Theory Basics
  • Discrete distribution probability p(X)
  • p(Y) ?x p(Xx,Y) ?x ?z p(Xx,Y,Zz)

93
94
Review of Information Theory Basics
  • Discrete distribution conditional probability
    p(XY)
  • p(X,Y) p(XY)p(Y) p(YX)p(X)

94
95
Review of Information Theory Basics
  • Discrete distribution entropy H(X)
  • h(x) log2(1/p(x))
  • H(X) ?Xx p(x)h(x) 1.75
  • H(Y) ?Yy p(y)h(y) 1.5 ( log2(Y) 1.58)
  • H(X,Y) ?Xx ?Yy p(x,y)h(x,y) 2.25 (
    log2(X,Y) 2.32)

95
96
Review of Information Theory Basics
  • Discrete distribution conditional entropy H(XY)
  • h(xy) log2(1/p(xy))
  • H(XY) ?Xx ?Yy p(x,y)h(xy) 0.75
  • H(XY) H(X,Y) H(Y) 2.25 1.5

96
97
Review of Information Theory Basics
  • Discrete distribution mutual information I(XY)
  • i(xy) log2(p(x,y)/p(x)p(y))
  • I(XY) ?Xx ?Yy p(x,y)i(xy) 1.0
  • I(XY) H(X) H(Y) H(X,Y) 1.75 1.5 2.25

97
98
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Data Integration
  • Part 2
  • Review of Information Theory Basics
  • Application Database Design
  • Computing Information Theoretic Primitives
  • Open Problems

99
Information Dependencies DR00
  • Goal use information theory to examine and
    reason about information content of the
    attributes in a relation instance
  • Key ideas
  • Novel InD measure between attribute sets X, Y
    based on H(YX)
  • Identify numeric inequalities between InD
    measures
  • Results
  • InD measures are a broader class than FDs and
    MVDs
  • Armstrong axioms for FDs derivable from InD
    inequalities
  • MVD inference rules derivable from InD
    inequalities

99
100
Information Dependencies DR00
  • Functional dependency X ? Y
  • FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
    (t1Y t2Y))

100
101
Information Dependencies DR00
  • Functional dependency X ? Y
  • FD X ? Y holds iff ? t1, t2 ((t1X t2X) ?
    (t1Y t2Y))

101
102
Information Dependencies DR00
  • Result FD X ? Y holds iff H(YX) 0
  • Intuition once X is known, no remaining
    uncertainty in Y
  • H(YX) 0.5

102
103
Information Dependencies DR00
  • Multi-valued dependency X ?? Y
  • MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
    R(X,Z)

103
104
Information Dependencies DR00
  • Multi-valued dependency X ?? Y
  • MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
    R(X,Z)


104
105
Information Dependencies DR00
  • Multi-valued dependency X ?? Y
  • MVD X ?? Y holds iff R(X,Y,Z) R(X,Y)
    R(X,Z)


105
106
Information Dependencies DR00
  • Result MVD X ?? Y holds iff H(Y,ZX) H(YX)
    H(ZX)
  • Intuition once X known, uncertainties in Y and Z
    are independent
  • H(YX) 0.5, H(ZX) 0.75, H(Y,ZX) 1.25


106
107
Information Dependencies DR00
  • Result Armstrong axioms for FDs derivable from
    InD inequalities
  • Reflexivity If Y ? X, then X ? Y
  • H(YX) 0 for Y ? X
  • Augmentation X ? Y ? X,Z ? Y,Z
  • 0 H(Y,ZX,Z) H(YX,Z) H(YX) 0
  • Transitivity X ? Y Y ? Z ? X ? Z
  • 0 H(YX) H(ZY) H(ZX) 0

107
108
Database Normal Forms
  • Goal eliminate update anomalies by good database
    design
  • Need to know the integrity constraints on all
    database instances
  • Boyce-Codd normal form
  • Input a set ? of functional dependencies
  • For every (non-trivial) FD R.X ? R.Y ? ?, R.X is
    a key of R
  • 4NF
  • Input a set ? of functional and multi-valued
    dependencies
  • For every (non-trivial) MVD R.X ?? R.Y ? ?,
    R.X is a key of R

108
109
Database Normal Forms
  • Functional dependency X ? Y
  • Which design is better?


109
110
Database Normal Forms
  • Functional dependency X ? Y
  • Which design is better?
  • Decomposition is in BCNF


110
111
Database Normal Forms
  • Multi-valued dependency X ?? Y
  • Which design is better?


111
112
Database Normal Forms
  • Multi-valued dependency X ?? Y
  • Which design is better?
  • Decomposition is in 4NF


112
113
Well-Designed Databases AL03
  • Goal use information theory to characterize
    goodness of a database design and reason about
    normalization algorithms
  • Key idea
  • Information content measure of cell in a DB
    instance w.r.t. ICs
  • Redundancy reduces information content measure of
    cells
  • Results
  • Well-designed DB ? each cell has information
    content gt 0
  • Normalization algorithms never decrease
    information content

113
114
Well-Designed Databases AL03
  • Information content of cell c in database D
    satisfying FD X ? Y
  • Uniform distribution p(V) on values for c
    consistent with D\c and FD
  • Information content of cell c is entropy H(V)
  • H(V62) 2.0

114
115
Well-Designed Databases AL03
  • Information content of cell c in database D
    satisfying FD X ? Y
  • Uniform distribution p(V) on values for c
    consistent with D\c and FD
  • Information content of cell c is entropy H(V)
  • H(V22) 0.0

115
116
Well-Designed Databases AL03
  • Information content of cell c in database D
    satisfying FD X ? Y
  • Information content of cell c is entropy H(V)
  • Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
    all cells c in D
  • Technicalities w.r.t. size of active domain

116
117
Well-Designed Databases AL03
  • Information content of cell c in database D
    satisfying FD X ? Y
  • Information content of cell c is entropy H(V)
  • H(V12) 2.0, H(V42) 2.0

117
118
Well-Designed Databases AL03
  • Information content of cell c in database D
    satisfying FD X ? Y
  • Information content of cell c is entropy H(V)
  • Schema S is in BCNF iff ? D ? S, H(V) gt 0, for
    all cells c in D

118
119
Well-Designed Databases AL03
  • Information content of cell c in DB D satisfying
    MVD X ?? Y
  • Information content of cell c is entropy H(V)
  • H(V52) 0.0, H(V53) 2.32

119
120
Well-Designed Databases AL03
  • Information content of cell c in DB D satisfying
    MVD X ?? Y
  • Information content of cell c is entropy H(V)
  • Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
    cells c in D

120
121
Well-Designed Databases AL03
  • Information content of cell c in DB D satisfying
    MVD X ?? Y
  • Information content of cell c is entropy H(V)
  • H(V32) 1.58, H(V34) 2.32

121
122
Well-Designed Databases AL03
  • Information content of cell c in DB D satisfying
    MVD X ?? Y
  • Information content of cell c is entropy H(V)
  • Schema S is in 4NF iff ? D ? S, H(V) gt 0, for all
    cells c in D

122
123
Well-Designed Databases AL03
  • Normalization algorithms never decrease
    information content
  • Information content of cell c is entropy H(V)

123
124
Well-Designed Databases AL03
  • Normalization algorithms never decrease
    information content
  • Information content of cell c is entropy H(V)


124
125
Well-Designed Databases AL03
  • Normalization algorithms never decrease
    information content
  • Information content of cell c is entropy H(V)


125
126
Database Design Summary
  • Good database design essential for preserving
    data integrity
  • Information theoretic measures useful for
    integrity constraints
  • FD X ? Y holds iff InD measure H(YX) 0
  • MVD X ?? Y holds iff H(Y,ZX) H(YX) H(ZX)
  • Information theory to model correlations in
    specific database
  • Information theoretic measures useful for normal
    forms
  • Schema S is in BCNF/4NF iff ? D ? S, H(V) gt 0,
    for all cells c in D
  • Information theory to model distributions over
    possible databases

126
127
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Data Integration
  • Part 2
  • Review of Information Theory Basics
  • Application Database Design
  • Computing Information Theoretic Primitives
  • Open Problems

127
Information Theory for Data Management - Divesh
Suresh
128
Domain size matters
  • For random variable X, domain size supp(X)
    xi p(X xi) gt 0
  • Different solutions exist depending on whether
    domain size is small or large
  • Probability vectors usually very sparse

129
Entropy Case I - Small domain size
  • Suppose the unique values for a random variable
    X is small (i.e fits in memory)
  • Maximum likelihood estimator
  • p(x) times x is encountered/total number of
    items in set.

1
2
1
2
1
5
4
1
2
3
4
5
130
Entropy Case I - Small domain size
  • HMLE Sx p(x) log 1/p(x)
  • This is a biased estimate
  • EHMLE lt H
  • Miller-Madow correction
  • H HMLE (m 1)/2n
  • m is an estimate of number of non-empty bins
  • n number of samples
  • Bad news ALL estimators for H are biased.
  • Good news we can quantify bias and variance of
    MLE
  • Bias lt log(1 m/N)
  • Var(HMLE) lt (log n)2/N

131
Entropy Case II - Large domain size
  • X is too large to fit in main memory, so we
    cant maintain explicit counts.
  • Streaming algorithms for H(X)
  • Long history of work on this problem
  • Bottomline
  • (1e)-relative-approximation for H(X) that allows
    for updates to frequencies, and requires almost
    constant, and optimal space HNO08.

132
Streaming Entropy CCM07
  • High level idea sample randomly from the stream,
    and track counts of elements picked AMS
  • PROBLEM skewed distribution prevents us from
    sampling lower-frequency elements (and entropy is
    small)
  • Idea estimate largest frequency, and
  • distribution of whats left (higher entropy)

133
Streaming Entropy CCM07
  • Maintain set of samples from original
    distribution and distribution without most
    frequent element.
  • In parallel, maintain estimator for frequency of
    most frequent element
  • normally this is hard
  • but if frequency is very large, then simple
    estimator exists MG81 (Google interview
    puzzle!)
  • At the end, compute function of these two
    estimates
  • Memory usage roughly 1/e2 log(1/e) (e is the
    error)

134
Entropy and MI are related
  • I(XY) H(X,Y) H(X) H(Y)
  • Suppose we can c-approximate H(X) for any c gt 0
  • Find H(X) s.t H(X) H(X) lt c
  • Then we can 3c-approximate I(XY)
  • I(XY) H(X,Y) H(X) H(Y)
  • lt H(X,Y)c (H(X)-c)
    (H(Y)-c)
  • lt H(X,Y) H(X) H(Y) 3c
  • lt I(X,Y) 3c
  • Similarly, we can 2c-approximate H(YX) H(X,Y)
    H(X)
  • Estimating entropy allows us to estimate I(XY)
    and H(YX)

135
Computing KL-divergence Small Domains
  • easy algorithm maintain counts for each of p
    and q, normalize, and compute KL-divergence.
  • PROBLEM ! Suppose qi 0
  • pi log pi/qi is undefined !
  • General problem with ML estimators all events
    not seen have probability zero !!
  • Laplace correction add one to counts for each
    seen element
  • Slightly better add 0.5 to counts for each seen
    element KT81
  • Even better, more involved use Good-Turing
    estimator GT53
  • YIeld non-zero probability for things not seen.

136
Computing KL-divergence Large Domains
  • Bad news No good relative-approximations exist
    in small space.
  • (Partial) good news additive approximations in
    small space under certain technical conditions
    (no pi is too small).
  • (Partial) good news additive approximations for
    symmetric variant of KL-divergence, via sampling.
  • For details, see GMV08,GIM08

137
Information-theoretic Clustering
  • Given a collection of random variables X, each
    explained by a random variable Y, we wish to
    find a (hard or soft) clustering T such that
  • I(T,X) bI(T, Y)
  • is minimized.
  • Features of solutions thus far
  • heuristic (general problem is NP-hard)
  • address both small-domain and large-domain
    scenarios.

138
Agglomerative Clustering (aIB) ST00
  • Fix number of clusters k
  • While number of clusters lt k
  • Determine two clusters whose merge loses the
    least information
  • Combine these two clusters
  • Output clustering
  • Merge Criterion
  • merge the two clusters so that change in I(TV)
    is minimized
  • Note no consideration of b (number of clusters
    is fixed)

139
Agglomerative Clustering (aIB) S
  • Elegant way of finding the two clusters to be
    merged
  • Let dJS(p,q) (1/2)(dKL(p,m) dKL(q,m)), m
    (pq)/2
  • dJS(p,q) is a symmetric distance between p, q
    (Jensen-Shannon distance)
  • We merge clusters that have smallest dJS(p,q),
    (weighted by cluster mass)

p
q
m
140
Iterative Information Bottleneck (iIB) S
  • aIB yields a hard clustering with k clusters.
  • If you want a soft clustering, use iIB (variant
    of EM)
  • Step 1 p(tx) ? exp(-bdKL(p(Vx),p(Vt))
  • assign elements to clusters in proportion
    (exponentially) to distance from cluster center !
  • Step 2 Compute new cluster centers by computing
    weighted centroids
  • p(t) Sx p(tx) p(x)
  • p(Vt) Sx p(Vt) p(tx) p(x)/p(t)
  • Choose b according to DKOSV06

141
Dealing with massive data sets
  • Clustering on massive data sets is a problem
  • Two main heuristics
  • Sampling DKOSV06
  • pick a small sample of the data, cluster it, and
    (if necessary) assign remaining points to
    clusters using soft assignment.
  • How many points to sample to get good bounds ?
  • Streaming
  • Scan the data in one pass, performing clustering
    on the fly
  • How much memory needed to get reasonable quality
    solution ?

142
LIMBO (for aIB) ATMS04
  • BIRCH-like idea
  • Maintain (sparse) summary for each cluster (p(t),
    p(Vt))
  • As data streams in, build clusters on groups of
    objects
  • Build next-level clusters on cluster summaries
    from lower level

143
Outline
  • Part 1
  • Introduction to Information Theory
  • Application Data Anonymization
  • Application Data Integration
  • Part 2
  • Review of Information Theory Basics
  • Application Database Design
  • Computing Information Theoretic Primitives
  • Open Problems

143
Information Theory for Data Management - Divesh
Suresh
144
Open Problems
  • Data exploration and mining information theory
    as first-pass filter
  • Relation to nonparametric generative models in
    machine learning (LDA, PPCA, ...)
  • Engineering and stability finding right knobs to
    make systems reliable and scalable
  • Other information-theoretic concepts ? (rate
    distortion, higher-order entropy, ...)

THANK YOU !
145
References Information Theory
  • CT Tom Cover and Joy Thomas Information
    Theory.
  • BMDG05 Arindam Banerjee, Srujana Merugu,
    Inderjit Dhillon, Joydeep Ghosh. Learning with
    Bregman Divergences, JMLR 2005.
  • TPB98 Naftali Tishby, Fernando Pereira, William
    Bialek. The Information Bottleneck Method. Proc.
    37th Annual Allerton Conference, 1998

145
146
References Data Anonymization
  • AA01 Dakshi Agrawal, Charu C. Aggarwal On the
    design and quantification of privacy preserving
    data mining algorithms. PODS 2001.
  • AS00 Rakesh Agrawal, Ramakrishnan Srikant
    Privacy preserving data mining. SIGMOD 2000.
  • EGS03 Alexandre Evfimievski, Johannes Gehrke,
    Ramakrishnan Srikant Limiting privacy breaches
    in privacy preserving data mining. PODS 2003.

146
146
Information Theory for Data Management - Divesh
Suresh
147
References Data Integration
  • AMT04 Periklis Andritsos, Renee J. Miller,
    Panayiotis Tsaparas Information-theoretic tools
    for mining database structure from large data
    sets. SIGMOD 2004.
  • DKOSV06 Bing Tian Dai, Nick Koudas, Beng Chin
    Ooi, Divesh Srivastava, Suresh Venkatasubramanian
    Rapid identification of column heterogeneity.
    ICDM 2006.
  • DKSTV08 Bing Tian Dai, Nick Koudas, Divesh
    Srivastava, Anthony K. H. Tung, Suresh
    Venkatasubramanian Validating multi-column
    schema matchings by type. ICDE 2008.
  • KN03 Jaewoo Kang, Jeffrey F. Naughton On
    schema matching with opaque column names and data
    values. SIGMOD 2003.
  • PPH05 Patrick Pantel, Andrew Philpot, Eduard
    Hovy An information theoretic model for database
    alignment. SSDBM 2005.

147
148
References Database Design
  • AL03 Marcelo Arenas, Leonid Libkin An
    information theoretic approach to normal forms
    for relational and XML data. PODS 2003.
  • AL05 Marcelo Arenas, Leonid Libkin An
    information theoretic approach to normal forms
    for relational and XML data. JACM 52(2), 246-283,
    2005.
  • DR00 Mehmet M. Dalkilic, Edward L. Robertson
    Information dependencies. PODS 2000.
  • KL06 Solmaz Kolahi, Leonid Libkin On
    redundancy vs dependency preservation in
    normalization an information-theoretic study of
    XML. PODS 2006.

148
149
References Computing IT quantities
  • P03 Liam Panninski. Estimation of entropy and
    mutual information. Neural Computation 15
    1191-1254
  • GT53 I. J. Good. Turings anticipation of
    Empirical Bayes in connection with the
    cryptanalysis of the Naval Enigma. Journal of
    Statistical Computation and Simulation, 66(2),
    2000.
  • KT81 R. E. Krichevsky and V. K. Trofimov. The
    performance of universal encoding. IEEE Trans.
    Inform. Th. 27 (1981), 199--207.
  • CCM07 Amit Chakrabarti, Graham Cormode and
    Andrew McGregor. A near-optimal algorithm for
    computing the entropy of a stream. Proc. SODA
    2007.
  • HNO Nich Harvey, Jelani Nelson, Krzysztof Onak.
    Sketching and Streaming Entropy via Approximation
    Theory. FOCS 2008
  • ATMS04 Periklis Andritsos, Panayiotis Tsaparas,
    Renée J. Miller and Kenneth C. Sevcik. LIMBO
    Scalable Clustering of Categorical Data. EDBT 2004

149
149
Information Theory for Data Management - Divesh
Suresh
150
References Computing IT quantities
  • S Noam Slonim. The Information Bottleneck
    theory and applications. Ph.D Thesis. Hebrew
    University, 2000.
  • GMV08 Sudipto Guha, Andrew McGregor, Suresh
    Venkatasubramanian. Streaming and sublinear
    approximations for information distances. ACM
    Trans Alg. 2008
  • GIM08 Sudipto Guha, Piotr Indyk, Andrew
    McGregor. Sketching Information Distances. JMLR,
    2008.

150
150
Information Theory for Data Management - Divesh
Suresh
Write a Comment
User Comments (0)
About PowerShow.com