Title: Additive Data Perturbation: the Basic Problem and Techniques
1Additive Data Perturbation the Basic Problem and
Techniques
2Outline
- Motivation
- Definition
- Privacy metrics
- Distribution reconstruction methods
- Privacy-preserving data mining with additive data
perturbation - Summary
- Note focus on the papers 10 and 11
3Motivation
- Web-based computing
- Observations
- Only a few sensitive attributes need protection
- Allow individual user to perform protection with
low cost - Some data mining algorithms work on distribution
instead of individual records
4- Definition of dataset
- Column by row table
- Each row is a record, or a vector
- Each column represents an attribute
- We also call it multidimensional data
2 records in the 3-attribute dataset
A 3-dimensional record
5Additive perturbation
- Definition
- Z XY
- X is the original value, Y is random noise and Z
is the perturbed value - Data Z and the parameters of Y are published
- e.g., Y is Gaussian N(0,1)
-
- History
- Used in statistical databases to protect
sensitive attributes (late 80s to 90s) check
paper14 - Benefit
- Allow distribution reconstruction
- Allow individual user to do perturbation
- Publish the noise distribution
6Applications in data mining
- Distribution reconstruction algorithms
- Rakeshs algorithm
- Expectation-Maximization (EM) algorithm
- Column-distribution based algorithms
- Decision tree
- Naïve Bayes classifier
7Major issues
- Privacy metrics
- Distribution reconstruction algorithms
- Metrics for loss of information
- A tradeoff between loss of information and
privacy
8Privacy metrics for additive perturbation
- Variance/confidence based definition
- Mutual information based definition
9Variance/confidence based definition
- Method
- Based on attackers view value estimation
- Knowing perturbed data, and noise distribution
- No other prior knowledge
- Estimation method
Confidence interval the range having c prob
that the real
value is in
Perturbed value
- Y zero mean, std ?
- is the important factor, i.e., var(Z-X) ?2
- Given Z, X is distant from Z in the Z/-? range
with c conf - We often ignore the confidence c and use ? to
represent the - difficulty of value estimation.
10Problem with Var/conf metric
- No knowledge about the original data is
incorporated - Knowledge about the original data distribution
- which will be discovered with distribution
reconstruction, in additive perturbation - can be known in prior in some applications
- Other prior knowledge may introduce more types of
attacks - Privacy evaluation need to incorporate these
attacks
11- Mutual information based method
- incorporating the original data distribution
- Concept Uncertainty ? entropy
- Difficulty of estimation the amount of privacy
- Intuition knowing the perturbed data Z and the
noise Y distribution, how much uncertainty of X
is reduced. - Z,Y do not help in estimate X ? all uncertainty
of X is preserved privacy 1 - Otherwise 0lt privacy lt1
12- Definition of mutual information
- Entropy h(A) ? evaluate uncertainty of A
- Not easy to estimate ? high entropy
- Distributions with the same variance ? uniform
has the largest entropy - Conditional entropy h(AB)
- If we know the random variable B, how much is the
uncertainty of A - If B is not independent of A, the uncertainty of
A can be reduced, (B helps explain A) i.e.,
h(AB) lth(A) - Mutual information I(AB) h(A)-h(AB)
- Evaluate the information brought by B in
estimating A - Note I(AB) I(BA)
13- Inherent privacy of a random variable
- Using uniform variable as the reference
- 2h(A)
- Make the definition consistent with Rakeshs
approach - MI based privacy metric
- P(AB) 1-2-I(AB), the lost privacy
- I(AB) 0 ? B does not help estimate A
- Privacy is fully preserved, the lost privacy
P(AB) 0 - I(AB) gt0 ? 0ltP(AB)lt1
- Calculation for additive perturbation
- I(XZ) h(Z) h(ZX) h(Z) h(Y)
14Distribution reconstruction
- Problem Z XY
- Know noise Ys distribution Fy
- Know the perturbed values z1, z2,zn
- Estimate the distribution Fx
- Basic methods
- Rakeshs method
- EM esitmation
15Rakeshs algorithm (paper 10)
- Find distribution P(XXY)
- three key points to understand it
- Bayes rule
- P(XXY) P(XYX) P(X)/P(XY)
- Conditional prob
- fxy(XYwXx) fy(w-x)
- Prob at the point a uses the average of all
sample estimates -
-
Using fx(a)?
16Stop criterion the difference between two
consecutive fx estimates is small
17Make it more efficient
- Bintize the range of x
- Discretize the previous formula
x
m(x) mid-point of the bin that x is in Lt
length of interval t
18- Weakness of Rakeshs algorithm
- No convergence proof
- Dont know if the iteration gives the globally
optimal result
19EM algorithm (paper 11)
- Using discretized bins to approximate the
distribution - Maximum Likelihood Estimation (MLE) method
- X1,x2,, xn are Independent and identically
distributed - Joint distribution
- f(x1,x2,,xn?) f(x1?)f(x2?)f(xn?)
- MLE principle
- Find ? that maximizes f(x1,x2,,xn?)
- Often maximize log f(x1,x2,,xn?) sum log
f(xi?)
Density (the height) of Bin i is notated as ?i
x
20- Basic idea of the EM alogrithm
- Q(?,?) is the MLE function
- ? is the bin densities (?1, ?2, ?k), and ? is
the previous estimate of ?. - EM algorithm
- Initial ? uniform distribution
- In each iteration find the current ? that
maximize Q(?,?) based on previous estimate ?,
and z
zj upper(?i)ltY ltzj lower(?i)
21- EM algorithm has properties
- Unique global optimal solution
- ? converges to the MLE solution
22Evaluating loss of information
- The information that additive perturbation wants
to preserve - Column distribution
- First metric
- Difference between the estimate and the original
distribution
23Evaluating loss of information
- Indirect metric
- Modeling quality
- The accuracy of classifier, if used for
classification modeling - Evaluation method
- Accuracy of the classifier trained on the
original data - Accuracy of the classifier trained on the
reconstructed distribution
24DM with Additive Perturbation
- Example decision tree
- A brief introduction to decision tree algorithm
- There are many versions
- One version working on continuous attributes
-
25- Split evaluation
- gini(S) 1- sum(pj2)
- Pj is the relativ frequency of class j in S
- gini_split(S) n1/ngini(S1)n2/ngini(S2)
- The smaller the better
- Procedure
- Get the distribution of each attribute
- Scan through each bin in the attribute and
calculate the gini_split index ? problem how to
determine pj - Find the minimum one
26- An approximate method to determine pj
- The original domain is partitioned to m bins
- Reconstruction gives an distribution over the
bins ? n1, n2,nm - Sort the perturbed data by the target attribute
- assign the records sequentially to the bins
according to the distribution - Look at the class labels associated with the
records - ? Errors happen because we use perturbed values
to determine the bin identification of each
record -
27When to reconstruct distribution
- Global calculate once
- By class calculate once per class
- Local by class at each node
- Empirical study shows
- By class and Local are more effective
28Problems with paper 10,11
- Privacy evaluation
- Didnt consider in-depth attacking methods
- Data reconstruction methods
- Loss of information
- Negatively related to privacy
- Not directly related to modeling
- Accuracy of distribution reconstruction vs.
accuracy of classifier ?
29Summary
- We discussed the basic methods with additive
perturbation - Definition
- Privacy metrics
- Distribution reconstruction
- The problem with privacy evaluation is not
complete - Attacks
- Covered by next class