Additive Data Perturbation: the Basic Problem and Techniques - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Additive Data Perturbation: the Basic Problem and Techniques

Description:

Additive Data Perturbation: the Basic Problem and Techniques – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 30

Provided by: Kek64

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Additive Data Perturbation: the Basic Problem and Techniques

1
Additive Data Perturbation the Basic Problem and
Techniques
2
Outline

Motivation
Definition
Privacy metrics
Distribution reconstruction methods
Privacy-preserving data mining with additive data
perturbation
Summary
Note focus on the papers 10 and 11

3
Motivation

Web-based computing
Observations
Only a few sensitive attributes need protection
Allow individual user to perform protection with
low cost
Some data mining algorithms work on distribution
instead of individual records

Definition of dataset
Column by row table
Each row is a record, or a vector
Each column represents an attribute
We also call it multidimensional data

2 records in the 3-attribute dataset
A B C
10 1.0 100
12 2.0 20
A 3-dimensional record
5
Additive perturbation

Definition
Z XY
X is the original value, Y is random noise and Z
is the perturbed value
Data Z and the parameters of Y are published
e.g., Y is Gaussian N(0,1)
History
Used in statistical databases to protect
sensitive attributes (late 80s to 90s) check
paper14
Benefit
Allow distribution reconstruction
Allow individual user to do perturbation
Publish the noise distribution

6
Applications in data mining

Distribution reconstruction algorithms
Rakeshs algorithm
Expectation-Maximization (EM) algorithm
Column-distribution based algorithms
Decision tree
Naïve Bayes classifier

7
Major issues

Privacy metrics
Distribution reconstruction algorithms
Metrics for loss of information
A tradeoff between loss of information and
privacy

8
Privacy metrics for additive perturbation

Variance/confidence based definition
Mutual information based definition

9
Variance/confidence based definition

Method
Based on attackers view value estimation
Knowing perturbed data, and noise distribution
No other prior knowledge
Estimation method

Confidence interval the range having c prob
that the real
value is in
Perturbed value

Y zero mean, std ?
is the important factor, i.e., var(Z-X) ?2
Given Z, X is distant from Z in the Z/-? range
with c conf
We often ignore the confidence c and use ? to
represent the
difficulty of value estimation.

10
Problem with Var/conf metric

No knowledge about the original data is
incorporated
Knowledge about the original data distribution
which will be discovered with distribution
reconstruction, in additive perturbation
can be known in prior in some applications
Other prior knowledge may introduce more types of
attacks
Privacy evaluation need to incorporate these
attacks

Mutual information based method
incorporating the original data distribution
Concept Uncertainty ? entropy
Difficulty of estimation the amount of privacy
Intuition knowing the perturbed data Z and the
noise Y distribution, how much uncertainty of X
is reduced.
Z,Y do not help in estimate X ? all uncertainty
of X is preserved privacy 1
Otherwise 0lt privacy lt1

Definition of mutual information
Entropy h(A) ? evaluate uncertainty of A
Not easy to estimate ? high entropy
Distributions with the same variance ? uniform
has the largest entropy
Conditional entropy h(AB)
If we know the random variable B, how much is the
uncertainty of A
If B is not independent of A, the uncertainty of
A can be reduced, (B helps explain A) i.e.,
h(AB) lth(A)
Mutual information I(AB) h(A)-h(AB)
Evaluate the information brought by B in
estimating A
Note I(AB) I(BA)

Inherent privacy of a random variable
Using uniform variable as the reference
2h(A)
Make the definition consistent with Rakeshs
approach
MI based privacy metric
P(AB) 1-2-I(AB), the lost privacy
I(AB) 0 ? B does not help estimate A
Privacy is fully preserved, the lost privacy
P(AB) 0
I(AB) gt0 ? 0ltP(AB)lt1
Calculation for additive perturbation
I(XZ) h(Z) h(ZX) h(Z) h(Y)

14
Distribution reconstruction

Problem Z XY
Know noise Ys distribution Fy
Know the perturbed values z1, z2,zn
Estimate the distribution Fx
Basic methods
Rakeshs method
EM esitmation

15
Rakeshs algorithm (paper 10)

Find distribution P(XXY)
three key points to understand it
Bayes rule
P(XXY) P(XYX) P(X)/P(XY)
Conditional prob
fxy(XYwXx) fy(w-x)
Prob at the point a uses the average of all
sample estimates

Using fx(a)?
16

The iterative algorithm

Stop criterion the difference between two
consecutive fx estimates is small
17
Make it more efficient

Bintize the range of x
Discretize the previous formula

x
m(x) mid-point of the bin that x is in Lt
length of interval t
18

Weakness of Rakeshs algorithm
No convergence proof
Dont know if the iteration gives the globally
optimal result

19
EM algorithm (paper 11)

Using discretized bins to approximate the
distribution
Maximum Likelihood Estimation (MLE) method
X1,x2,, xn are Independent and identically
distributed
Joint distribution
f(x1,x2,,xn?) f(x1?)f(x2?)f(xn?)
MLE principle
Find ? that maximizes f(x1,x2,,xn?)
Often maximize log f(x1,x2,,xn?) sum log
f(xi?)

Density (the height) of Bin i is notated as ?i
x
20

Basic idea of the EM alogrithm
Q(?,?) is the MLE function
? is the bin densities (?1, ?2, ?k), and ? is
the previous estimate of ?.
EM algorithm
Initial ? uniform distribution
In each iteration find the current ? that
maximize Q(?,?) based on previous estimate ?,
and z

zj upper(?i)ltY ltzj lower(?i)
21

EM algorithm has properties
Unique global optimal solution
? converges to the MLE solution

22
Evaluating loss of information

The information that additive perturbation wants
to preserve
Column distribution
First metric
Difference between the estimate and the original
distribution

23
Evaluating loss of information

Indirect metric
Modeling quality
The accuracy of classifier, if used for
classification modeling
Evaluation method
Accuracy of the classifier trained on the
original data
Accuracy of the classifier trained on the
reconstructed distribution

24
DM with Additive Perturbation

Example decision tree
A brief introduction to decision tree algorithm
There are many versions
One version working on continuous attributes

Split evaluation
gini(S) 1- sum(pj2)
Pj is the relativ frequency of class j in S
gini_split(S) n1/ngini(S1)n2/ngini(S2)
The smaller the better
Procedure
Get the distribution of each attribute
Scan through each bin in the attribute and
calculate the gini_split index ? problem how to
determine pj
Find the minimum one

An approximate method to determine pj
The original domain is partitioned to m bins
Reconstruction gives an distribution over the
bins ? n1, n2,nm
Sort the perturbed data by the target attribute
assign the records sequentially to the bins
according to the distribution
Look at the class labels associated with the
records
? Errors happen because we use perturbed values
to determine the bin identification of each
record

27
When to reconstruct distribution