Additive Data Perturbation: the Basic Problem and Techniques - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Additive Data Perturbation: the Basic Problem and Techniques

Description:

Problem with Var/conf metric. No knowledge about the original data is incorporated ... If we know the random variable B, how much is the uncertainty of A ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 30
Provided by: keke9
Category:

less

Transcript and Presenter's Notes

Title: Additive Data Perturbation: the Basic Problem and Techniques


1
Additive Data Perturbation the Basic Problem and
Techniques
2
Outline
  • Motivation
  • Definition
  • Privacy metrics
  • Distribution reconstruction methods
  • Privacy-preserving data mining with additive data
    perturbation
  • Summary
  • Note focus on the papers 10 and 11

3
Motivation
  • Web-based computing
  • Observations
  • Only a few sensitive attributes need protection
  • Allow individual user to perform protection with
    low cost
  • Some data mining algorithms work on distribution
    instead of individual records

4
  • Definition of dataset
  • Column by row table
  • Each row is a record, or a vector
  • Each column represents an attribute
  • We also call it multidimensional data

2 records in the 3-attribute dataset
A 3-dimensional record
5
Additive perturbation
  • Definition
  • Z XY
  • X is the original value, Y is random noise and Z
    is the perturbed value
  • Data Z and the parameters of Y are published
  • e.g., Y is Gaussian N(0,1)
  • History
  • Used in statistical databases to protect
    sensitive attributes (late 80s to 90s) check
    paper14
  • Benefit
  • Allow distribution reconstruction
  • Allow individual user to do perturbation
  • Publish the noise distribution

6
Applications in data mining
  • Distribution reconstruction algorithms
  • Rakeshs algorithm
  • Expectation-Maximization (EM) algorithm
  • Column-distribution based algorithms
  • Decision tree
  • Naïve Bayes classifier

7
Major issues
  • Privacy metrics
  • Distribution reconstruction algorithms
  • Metrics for loss of information
  • A tradeoff between loss of information and
    privacy

8
Privacy metrics for additive perturbation
  • Variance/confidence based definition
  • Mutual information based definition

9
Variance/confidence based definition
  • Method
  • Based on attackers view value estimation
  • Knowing perturbed data, and noise distribution
  • No other prior knowledge
  • Estimation method

Confidence interval the range having c prob
that the real
value is in
Perturbed value
  • Y zero mean, std ?
  • is the important factor, i.e., var(Z-X) ?2
  • Given Z, X is distant from Z in the Z/-? range
    with c conf
  • We often ignore the confidence c and use ? to
    represent the
  • difficulty of value estimation.

10
Problem with Var/conf metric
  • No knowledge about the original data is
    incorporated
  • Knowledge about the original data distribution
  • which will be discovered with distribution
    reconstruction, in additive perturbation
  • can be known in prior in some applications
  • Other prior knowledge may introduce more types of
    attacks
  • Privacy evaluation need to incorporate these
    attacks

11
  • Mutual information based method
  • incorporating the original data distribution
  • Concept Uncertainty ? entropy
  • Difficulty of estimation the amount of privacy
  • Intuition knowing the perturbed data Z and the
    noise Y distribution, how much uncertainty of X
    is reduced.
  • Z,Y do not help in estimate X ? all uncertainty
    of X is preserved privacy 1
  • Otherwise 0lt privacy lt1

12
  • Definition of mutual information
  • Entropy h(A) ? evaluate uncertainty of A
  • Not easy to estimate ? high entropy
  • Distributions with the same variance ? uniform
    has the largest entropy
  • Conditional entropy h(AB)
  • If we know the random variable B, how much is the
    uncertainty of A
  • If B is not independent of A, the uncertainty of
    A can be reduced, (B helps explain A) i.e.,
    h(AB) lth(A)
  • Mutual information I(AB) h(A)-h(AB)
  • Evaluate the information brought by B in
    estimating A
  • Note I(AB) I(BA)

13
  • Inherent privacy of a random variable
  • Using uniform variable as the reference
  • 2h(A)
  • Make the definition consistent with Rakeshs
    approach
  • MI based privacy metric
  • P(AB) 1-2-I(AB), the lost privacy
  • I(AB) 0 ? B does not help estimate A
  • Privacy is fully preserved, the lost privacy
    P(AB) 0
  • I(AB) gt0 ? 0ltP(AB)lt1
  • Calculation for additive perturbation
  • I(XZ) h(Z) h(ZX) h(Z) h(Y)

14
Distribution reconstruction
  • Problem Z XY
  • Know noise Ys distribution Fy
  • Know the perturbed values z1, z2,zn
  • Estimate the distribution Fx
  • Basic methods
  • Rakeshs method
  • EM esitmation

15
Rakeshs algorithm (paper 10)
  • Find distribution P(XXY)
  • three key points to understand it
  • Bayes rule
  • P(XXY) P(XYX) P(X)/P(XY)
  • Conditional prob
  • fxy(XYwXx) fy(w-x)
  • Prob at the point a uses the average of all
    sample estimates

Using fx(a)?
16
  • The iterative algorithm

Stop criterion the difference between two
consecutive fx estimates is small
17
Make it more efficient
  • Bintize the range of x
  • Discretize the previous formula

x
m(x) mid-point of the bin that x is in Lt
length of interval t
18
  • Weakness of Rakeshs algorithm
  • No convergence proof
  • Dont know if the iteration gives the globally
    optimal result

19
EM algorithm (paper 11)
  • Using discretized bins to approximate the
    distribution
  • Maximum Likelihood Estimation (MLE) method
  • X1,x2,, xn are Independent and identically
    distributed
  • Joint distribution
  • f(x1,x2,,xn?) f(x1?)f(x2?)f(xn?)
  • MLE principle
  • Find ? that maximizes f(x1,x2,,xn?)
  • Often maximize log f(x1,x2,,xn?) sum log
    f(xi?)

Density (the height) of Bin i is notated as ?i
x
20
  • Basic idea of the EM alogrithm
  • Q(?,?) is the MLE function
  • ? is the bin densities (?1, ?2, ?k), and ? is
    the previous estimate of ?.
  • EM algorithm
  • Initial ? uniform distribution
  • In each iteration find the current ? that
    maximize Q(?,?) based on previous estimate ?,
    and z

zj upper(?i)ltY ltzj lower(?i)
21
  • EM algorithm has properties
  • Unique global optimal solution
  • ? converges to the MLE solution

22
Evaluating loss of information
  • The information that additive perturbation wants
    to preserve
  • Column distribution
  • First metric
  • Difference between the estimate and the original
    distribution

23
Evaluating loss of information
  • Indirect metric
  • Modeling quality
  • The accuracy of classifier, if used for
    classification modeling
  • Evaluation method
  • Accuracy of the classifier trained on the
    original data
  • Accuracy of the classifier trained on the
    reconstructed distribution

24
DM with Additive Perturbation
  • Example decision tree
  • A brief introduction to decision tree algorithm
  • There are many versions
  • One version working on continuous attributes

25
  • Split evaluation
  • gini(S) 1- sum(pj2)
  • Pj is the relativ frequency of class j in S
  • gini_split(S) n1/ngini(S1)n2/ngini(S2)
  • The smaller the better
  • Procedure
  • Get the distribution of each attribute
  • Scan through each bin in the attribute and
    calculate the gini_split index ? problem how to
    determine pj
  • Find the minimum one

26
  • An approximate method to determine pj
  • The original domain is partitioned to m bins
  • Reconstruction gives an distribution over the
    bins ? n1, n2,nm
  • Sort the perturbed data by the target attribute
  • assign the records sequentially to the bins
    according to the distribution
  • Look at the class labels associated with the
    records
  • ? Errors happen because we use perturbed values
    to determine the bin identification of each
    record

27
When to reconstruct distribution
  • Global calculate once
  • By class calculate once per class
  • Local by class at each node
  • Empirical study shows
  • By class and Local are more effective

28
Problems with paper 10,11
  • Privacy evaluation
  • Didnt consider in-depth attacking methods
  • Data reconstruction methods
  • Loss of information
  • Negatively related to privacy
  • Not directly related to modeling
  • Accuracy of distribution reconstruction vs.
    accuracy of classifier ?

29
Summary
  • We discussed the basic methods with additive
    perturbation
  • Definition
  • Privacy metrics
  • Distribution reconstruction
  • The problem with privacy evaluation is not
    complete
  • Attacks
  • Covered by next class
Write a Comment
User Comments (0)
About PowerShow.com